1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.6 2007-02-21 13:38:22 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter Architecture</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
35 The &dom; filter architecture consists of four
36 different pipelines, each being a chain of arbitraily many sucessive
37 &xslt; transformations of the internal &dom; &xml;
38 representations of documents.
41 <figure id="record-model-domxml-architecture-fig">
42 <title>&dom; &xml; filter architecture</title>
45 <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
48 <imagedata fileref="domfilter.png" format="PNG"/>
51 <!-- Fall back if none of the images can be used -->
53 [Here there should be a diagram showing the &dom; &xml;
54 filter architecture, but is seems that your
55 tool chain has not been able to include the diagram in this
63 <table id="record-model-domxml-architecture-table" frame="top">
64 <title>&dom; &xml; filter pipelines overview</title>
70 <entry>Description</entry>
78 <entry><literal>input</literal></entry>
80 <entry>input parsing and initial
81 transformations to common &xml; format</entry>
82 <entry>Input raw &xml; record buffers, &xml; streams and
83 binary &marc; buffers</entry>
84 <entry>Common &xml; &dom;</entry>
87 <entry><literal>extract</literal></entry>
89 <entry>indexing term extraction
90 transformations</entry>
91 <entry>Common &xml; &dom;</entry>
92 <entry>Indexing &xml; &dom;</entry>
95 <entry><literal>store</literal></entry>
97 <entry> transformations before internal document
99 <entry>Common &xml; &dom;</entry>
100 <entry>Storage &xml; &dom;</entry>
103 <entry><literal>retrieve</literal></entry>
105 <entry>multiple document retrieve transformations from
106 storage to different output
107 formats are possible</entry>
108 <entry>Storage &xml; &dom;</entry>
109 <entry>Output &xml; syntax in requested formats</entry>
116 The &dom; &xml; filter pipelines use &xslt; (and if supported on
117 your platform, even &exslt;), it brings thus full &xpath;
118 support to the indexing, storage and display rules of not only
119 &xml; documents, but also binary &marc; records.
124 <section id="record-model-domxml-pipeline">
125 <title>&dom; &xml; filter pipeline configuration</title>
128 The experimental, loadable &dom; &xml;/&xslt; filter module
129 <literal>mod-dom.so</literal>
130 is invoked by the <filename>zebra.cfg</filename> configuration statement
132 recordtype.xml: dom.db/filter_dom_conf.xml
134 In this example the &dom; &xml; filter is configured to work
135 on all data files with suffix
136 <filename>*.xml</filename>, where the configuration file is found in the
137 path <filename>db/filter_dom_conf.xml</filename>.
140 <para>The &dom; &xslt; filter configuration file must be
141 valid &xml;. It might look like this:
144 <?xml version="1.0" encoding="UTF8"?>
145 <dom xmlns="http://indexdata.com/zebra-2.0">
147 <xmlreader level="1"/>
148 <!-- <marc inputcharset="marc-8"/> -->
151 <xslt stylesheet="common2index.xsl"/>
154 <xslt stylesheet="common2store.xsl"/>
157 <xslt stylesheet="store2dc.xsl"/>
159 <retrieve name="mods">
160 <xslt stylesheet="store2mods.xsl"/>
167 The root &xml; element <literal><dom></literal> and all other &dom;
168 &xml; filter elements are residing in the namespace
169 <literal>http://indexdata.com/zebra-2.0</literal>.
172 All pipeline definition elements - i.e. the
173 <literal><input></literal>,
174 <literal><extact></literal>,
175 <literal><store></literal>, and
176 <literal><retrieve></literal> elements - are optional.
177 Missing pipeline definitions are just interpreted
178 do-nothing identity pipelines.
181 All pipeine definition elements may contain zero or more
182 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
183 &xslt; transformation instructions, which are performed
184 sequentially from top to bottom.
185 The paths in the <literal>stylesheet</literal> attributes
186 are relative to zebras working directory, or absolute to the file
191 <section id="record-model-domxml-pipeline-input">
192 <title>Input pipeline</title>
194 The <literal><input></literal> pipeline definition element
195 may contain either one &xml; Reader definition
196 <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
197 an &xml; collection input stream into individual &xml; &dom;
198 documents at the prescribed element level,
201 <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
202 a conversion to &marcxml; format &dom; trees. The allowed values
203 of the <literal>inputcharset</literal> attribute depend on your
204 local <productname>iconv</productname> set-up.
207 Both input parsers deliver individual &dom; &xml; documents to the
208 following chain of zero or more
209 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
210 &xslt; transformations. At the end of this pipeline, the documents
211 are in the common format, used to feed both the
212 <literal><extact></literal> and
213 <literal><store></literal> pipelines.
217 <section id="record-model-domxml-pipeline-extract">
218 <title>Extract pipeline</title>
221 <section id="record-model-domxml-pipeline-store">
222 <title>Store pipeline</title>
225 <section id="record-model-domxml-pipeline-retrieve">
226 <title>Retrieve pipeline</title>
229 All named stylesheets defined inside
230 <literal>schema</literal> element tags
231 are for presentation after search, including
232 the indexing stylesheet (which is a great debugging help). The
233 names defined in the <literal>name</literal> attributes must be
234 unique, these are the literal <literal>schema</literal> or
235 <literal>element set</literal> names used in
236 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
237 <ulink url="&url.sru;">&sru;</ulink> and
238 &z3950; protocol queries.
243 <section id="record-model-domxml-internal">
244 <title>&dom; filter internal record representation</title>
245 <para>When indexing, an &xml; Reader is invoked to split the input
246 files into suitable record &xml; pieces. Each record piece is then
247 transformed to an &xml; &dom; structure, which is essentially the
248 record model. Only &xslt; transformations can be applied during
249 index, search and retrieval. Consequently, output formats are
250 restricted to whatever &xslt; can deliver from the record &xml;
251 structure, be it other &xml; formats, HTML, or plain text. In case
252 you have <literal>libxslt1</literal> running with E&xslt; support,
253 you can use this functionality inside the &dom;
254 filter configuration &xslt; stylesheets.
258 <section id="record-model-domxml-canonical">
259 <title>&dom; Canonical Indexing Format</title>
260 <para>The output of the indexing &xslt; stylesheets must contain
261 certain elements in the magic
262 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
263 namespace. The output of the &xslt; indexing transformation is then
264 parsed using &dom; methods, and the contained instructions are
265 performed on the <emphasis>magic elements and their
269 For example, the output of the command
271 xsltproc xsl/oai2index.xsl one-record.xml
273 might look like this:
275 <?xml version="1.0" encoding="UTF-8"?>
276 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
277 z:id="oai:JTRS:CP-3290---Volume-I"
280 <z:index name="oai_identifier" type="0">
281 oai:JTRS:CP-3290---Volume-I</z:index>
282 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
283 <z:index name="oai_setspec" type="0">jtrs</z:index>
284 <z:index name="dc_all" type="w">
285 <z:index name="dc_title" type="w">Proceedings of the 4th
286 International Conference and Exhibition:
287 World Congress on Superconductivity - Volume I</z:index>
288 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
289 Burnham, Editors</z:index>
294 <para>This means the following: From the original &xml; file
295 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
296 same form coming from a splitted input file), the indexing
297 stylesheet produces an indexing &xml; record, which is defined by
298 the <literal>record</literal> element in the magic namespace
299 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
300 &zebra; uses the content of
301 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
302 record ID, and - in case static ranking is set - the content of
303 <literal>z:rank="47896"</literal> as static rank. Following the
304 discussion in <xref linkend="administration-ranking"/>
305 we see that this records is internally ordered
306 lexicographically according to the value of the string
307 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
308 The type of action performed during indexing is defined by
309 <literal>z:type="update"></literal>, with recognized values
310 <literal>insert</literal>, <literal>update</literal>, and
311 <literal>delete</literal>.
313 <para>In this example, the following literal indexes are constructed:
322 where the indexing type is defined in the
323 <literal>type</literal> attribute
324 (any value from the standard configuration
325 file <filename>default.idx</filename> will do). Finally, any
326 <literal>text()</literal> node content recursively contained
327 inside the <literal>index</literal> will be filtered through the
328 appropriate charmap for character normalization, and will be
329 inserted in the index.
332 Specific to this example, we see that the single word
333 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
334 byte for byte without any form of character normalization,
335 inserted into the index named <literal>oai:identifier</literal>,
337 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
338 will be inserted using the <literal>w</literal> character
339 normalization defined in <filename>default.idx</filename> into
340 the index <literal>dc:creator</literal> (that is, after character
341 normalization the index will keep the inidividual words
342 <literal>kumar</literal>, <literal>krishen</literal>,
343 <literal>and</literal>, <literal>calvin</literal>,
344 <literal>burnham</literal>, and <literal>editors</literal>), and
345 finally both the texts
346 <literal>Proceedings of the 4th International Conference and Exhibition:
347 World Congress on Superconductivity - Volume I</literal>
349 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
350 will be inserted into the index <literal>dc:all</literal> using
351 the same character normalization map <literal>w</literal>.
354 Finally, this example configuration can be queried using &pqf;
355 queries, either transported by &z3950;, (here using a yaz-client)
358 Z> open localhost:9999
362 Z> f @attr 1=dc_creator Kumar
363 Z> scan @attr 1=dc_creator adam
365 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
366 Z> scan @attr 1=dc_title abc
370 extentions <literal>x-pquery</literal> and
371 <literal>x-pScanClause</literal> to
375 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
376 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
379 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
380 configuration, and <xref linkend="gfs-config"/> or the &yaz;
381 <ulink url="&url.yaz.cql;">&cql; section</ulink>
382 for the details or the &yaz; frontend server.
385 Notice that there are no <filename>*.abs</filename>,
386 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
387 filter configuration files involves in this process, and that the
388 literal index names are used during search and retrieval.
394 <section id="record-model-domxml-conf">
395 <title>&dom; Record Model Configuration</title>
398 <section id="record-model-domxml-index">
399 <title>&dom; Indexing Configuration</title>
401 As mentioned above, there can be only one indexing
402 stylesheet, and configuration of the indexing process is a synonym
403 of writing an &xslt; stylesheet which produces &xml; output containing the
404 magic elements discussed in
405 <xref linkend="record-model-domxml-internal"/>.
406 Obviously, there are million of different ways to accomplish this
407 task, and some comments and code snippets are in order to lead
408 our paduans on the right track to the good side of the force.
411 Stylesheets can be written in the <emphasis>pull</emphasis> or
412 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
413 means that the output &xml; structure is taken as starting point of
414 the internal structure of the &xslt; stylesheet, and portions of
415 the input &xml; are <emphasis>pulled</emphasis> out and inserted
416 into the right spots of the output &xml; structure. On the other
417 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
418 calling their template definitions, a process which is commanded
419 by the input &xml; structure, and avake to produce some output &xml;
420 whenever some special conditions in the input styelsheets are
421 met. The <emphasis>pull</emphasis> type is well-suited for input
422 &xml; with strong and well-defined structure and semantcs, like the
423 following &oai; indexing example, whereas the
424 <emphasis>push</emphasis> type might be the only possible way to
425 sort out deeply recursive input &xml; formats.
428 A <emphasis>pull</emphasis> stylesheet example used to index
429 &oai; harvested records could use some of the following template
433 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
434 xmlns:z="http://indexdata.dk/zebra/xslt/1"
435 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
436 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
437 xmlns:dc="http://purl.org/dc/elements/1.1/"
440 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
442 <!-- disable all default text node output -->
443 <xsl:template match="text()"/>
445 <!-- match on oai xml record root -->
446 <xsl:template match="/">
447 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
449 <!-- you might want to use z:rank="{some &xslt; function here}" -->
450 <xsl:apply-templates/>
454 <!-- &oai; indexing templates -->
455 <xsl:template match="oai:record/oai:header/oai:identifier">
456 <z:index name="oai_identifier" type="0">
457 <xsl:value-of select="."/>
463 <!-- DC specific indexing templates -->
464 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
465 <z:index name="dc_title" type="w">
466 <xsl:value-of select="."/>
478 that the names and types of the indexes can be defined in the
479 indexing &xslt; stylesheet <emphasis>dynamically according to
480 content in the original &xml; records</emphasis>, which has
481 opportunities for great power and wizardery as well as grande
485 The following excerpt of a <emphasis>push</emphasis> stylesheet
486 <emphasis>might</emphasis>
487 be a good idea according to your strict control of the &xml;
488 input format (due to rigerours checking against well-defined and
489 tight RelaxNG or &xml; Schema's, for example):
492 <xsl:template name="element-name-indexes">
493 <z:index name="{name()}" type="w">
494 <xsl:value-of select="'1'"/>
499 This template creates indexes which have the name of the working
500 node of any input &xml; file, and assigns a '1' to the index.
502 <literal>find @attr 1=xyz 1</literal>
503 finds all files which contain at least one
504 <literal>xyz</literal> &xml; element. In case you can not control
505 which element names the input files contain, you might ask for
506 disaster and bad karma using this technique.
509 One variation over the theme <emphasis>dynamically created
510 indexes</emphasis> will definitely be unwise:
513 <!-- match on oai xml record root -->
514 <xsl:template match="/">
515 <z:record z:type="update">
517 <!-- create dynamic index name from input content -->
518 <xsl:variable name="dynamic_content">
519 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
522 <!-- create zillions of indexes with unknown names -->
523 <z:index name="{$dynamic_content}" type="w">
524 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
531 Don't be tempted to cross
532 the line to the dark side of the force, paduan; this leads
533 to suffering and pain, and universal
534 disentigration of your project schedule.
538 <section id="record-model-domxml-elementset">
539 <title>&dom; Exchange Formats</title>
541 An exchange format can be anything which can be the outcome of an
542 &xslt; transformation, as far as the stylesheet is registered in
543 the main &dom; &xslt; filter configuration file, see
544 <xref linkend="record-model-domxml-filter"/>.
545 In principle anything that can be expressed in &xml;, HTML, and
546 TEXT can be the output of a <literal>schema</literal> or
547 <literal>element set</literal> directive during search, as long as
548 the information comes from the
549 <emphasis>original input record &xml; &dom; tree</emphasis>
550 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
553 In addition, internal administrative information from the &zebra;
554 indexer can be accessed during record retrieval. The following
555 example is a summary of the possibilities:
558 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
559 xmlns:z="http://indexdata.dk/zebra/xslt/1"
562 <!-- register internal zebra parameters -->
563 <xsl:param name="id" select="''"/>
564 <xsl:param name="filename" select="''"/>
565 <xsl:param name="score" select="''"/>
566 <xsl:param name="schema" select="''"/>
568 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
570 <!-- use then for display of internal information -->
571 <xsl:template match="/">
573 <id><xsl:value-of select="$id"/></id>
574 <filename><xsl:value-of select="$filename"/></filename>
575 <score><xsl:value-of select="$score"/></score>
576 <schema><xsl:value-of select="$schema"/></schema>
587 <section id="record-model-domxml-example">
588 <title>&dom; Filter &oai; Indexing Example</title>
590 The sourcecode tarball contains a working &dom; filter example in
591 the directory <filename>examples/dom-oai/</filename>, which
592 should get you started.
595 More example data can be harvested from any &oai; complient server,
596 see details at the &oai;
597 <ulink url="http://www.openarchives.org/">
598 http://www.openarchives.org/</ulink> web site, and the community
600 <ulink url="http://www.openarchives.org/community/index.html">
601 http://www.openarchives.org/community/index.html</ulink>.
604 <ulink url="http://www.oaforum.org/tutorial/">
605 http://www.oaforum.org/tutorial/</ulink>.
617 c) Main "dom" &xslt; filter config file:
618 cat db/filter_dom_conf.xml
620 <?xml version="1.0" encoding="UTF8"?>
622 <schema name="dom" stylesheet="db/dom2dom.xsl" />
623 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
624 stylesheet="db/dom2index.xsl" />
625 <schema name="dc" stylesheet="db/dom2dc.xsl" />
626 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
627 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
628 <schema name="help" stylesheet="db/dom2help.xsl" />
632 the paths are relative to the directory where zebra.init is placed
635 The split level decides where the SAX parser shall split the
636 collections of records into individual records, which then are
637 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
639 The indexing stylesheet is found by it's identifier.
641 All the other stylesheets are for presentation after search.
643 - in data/ a short sample of harvested carnivorous plants
644 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
646 - in root also one single data record - nice for testing the xslt
649 xsltproc db/dom2index.xsl carni*.xml
653 - in db/ a cql2pqf.txt yaz-client config file
654 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
656 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
658 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
659 as it constructs the new &xml; structure by pulling data out of the
660 respective elements/attributes of the old structure.
662 Notice the special zebra namespace, and the special elements in this
663 namespace which indicate to the zebra indexer what to do.
665 <z:record id="67ht7" rank="675" type="update">
666 indicates that a new record with given id and static rank has to be updated.
668 <z:index name="title" type="w">
669 encloses all the text/&xml; which shall be indexed in the index named
670 "title" and of index type "w" (see file default.idx in your zebra
682 <!-- Keep this comment at the end of the file
687 sgml-minimize-attributes:nil
688 sgml-always-quote-attributes:t
691 sgml-parent-document: "zebra.xml"
692 sgml-local-catalogs: nil
693 sgml-namecase-general:t