1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.7 2007-02-21 14:15:07 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter Architecture</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
35 The &dom; filter architecture consists of four
36 different pipelines, each being a chain of arbitraily many sucessive
37 &xslt; transformations of the internal &dom; &xml;
38 representations of documents.
41 <figure id="record-model-domxml-architecture-fig">
42 <title>&dom; &xml; filter architecture</title>
45 <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
48 <imagedata fileref="domfilter.png" format="PNG"/>
51 <!-- Fall back if none of the images can be used -->
53 [Here there should be a diagram showing the &dom; &xml;
54 filter architecture, but is seems that your
55 tool chain has not been able to include the diagram in this
63 <table id="record-model-domxml-architecture-table" frame="top">
64 <title>&dom; &xml; filter pipelines overview</title>
70 <entry>Description</entry>
78 <entry><literal>input</literal></entry>
80 <entry>input parsing and initial
81 transformations to common &xml; format</entry>
82 <entry>Input raw &xml; record buffers, &xml; streams and
83 binary &marc; buffers</entry>
84 <entry>Common &xml; &dom;</entry>
87 <entry><literal>extract</literal></entry>
89 <entry>indexing term extraction
90 transformations</entry>
91 <entry>Common &xml; &dom;</entry>
92 <entry>Indexing &xml; &dom;</entry>
95 <entry><literal>store</literal></entry>
97 <entry> transformations before internal document
99 <entry>Common &xml; &dom;</entry>
100 <entry>Storage &xml; &dom;</entry>
103 <entry><literal>retrieve</literal></entry>
105 <entry>multiple document retrieve transformations from
106 storage to different output
107 formats are possible</entry>
108 <entry>Storage &xml; &dom;</entry>
109 <entry>Output &xml; syntax in requested formats</entry>
116 The &dom; &xml; filter pipelines use &xslt; (and if supported on
117 your platform, even &exslt;), it brings thus full &xpath;
118 support to the indexing, storage and display rules of not only
119 &xml; documents, but also binary &marc; records.
124 <section id="record-model-domxml-pipeline">
125 <title>&dom; &xml; filter pipeline configuration</title>
128 The experimental, loadable &dom; &xml;/&xslt; filter module
129 <literal>mod-dom.so</literal>
130 is invoked by the <filename>zebra.cfg</filename> configuration statement
132 recordtype.xml: dom.db/filter_dom_conf.xml
134 In this example the &dom; &xml; filter is configured to work
135 on all data files with suffix
136 <filename>*.xml</filename>, where the configuration file is found in the
137 path <filename>db/filter_dom_conf.xml</filename>.
140 <para>The &dom; &xslt; filter configuration file must be
141 valid &xml;. It might look like this:
144 <?xml version="1.0" encoding="UTF8"?>
145 <dom xmlns="http://indexdata.com/zebra-2.0">
147 <xmlreader level="1"/>
148 <!-- <marc inputcharset="marc-8"/> -->
151 <xslt stylesheet="common2index.xsl"/>
154 <xslt stylesheet="common2store.xsl"/>
157 <xslt stylesheet="store2dc.xsl"/>
159 <retrieve name="mods">
160 <xslt stylesheet="store2mods.xsl"/>
167 The root &xml; element <literal><dom></literal> and all other &dom;
168 &xml; filter elements are residing in the namespace
169 <literal>http://indexdata.com/zebra-2.0</literal>.
172 All pipeline definition elements - i.e. the
173 <literal><input></literal>,
174 <literal><extact></literal>,
175 <literal><store></literal>, and
176 <literal><retrieve></literal> elements - are optional.
177 Missing pipeline definitions are just interpreted
178 do-nothing identity pipelines.
181 All pipeine definition elements may contain zero or more
182 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
183 &xslt; transformation instructions, which are performed
184 sequentially from top to bottom.
185 The paths in the <literal>stylesheet</literal> attributes
186 are relative to zebras working directory, or absolute to the file
191 <section id="record-model-domxml-pipeline-input">
192 <title>Input pipeline</title>
194 The <literal><input></literal> pipeline definition element
195 may contain either one &xml; Reader definition
196 <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
197 an &xml; collection input stream into individual &xml; &dom;
198 documents at the prescribed element level,
201 <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
202 a conversion to &marcxml; format &dom; trees. The allowed values
203 of the <literal>inputcharset</literal> attribute depend on your
204 local <productname>iconv</productname> set-up.
207 Both input parsers deliver individual &dom; &xml; documents to the
208 following chain of zero or more
209 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
210 &xslt; transformations. At the end of this pipeline, the documents
211 are in the common format, used to feed both the
212 <literal><extact></literal> and
213 <literal><store></literal> pipelines.
217 <section id="record-model-domxml-pipeline-extract">
218 <title>Extract pipeline</title>
220 The <literal><extact></literal> pipeline takes documents
221 from any common &dom; &xml; format to the &zebra; specific
222 indexing &dom; &xml; format.
223 It may consist of zero ore more
224 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
225 &xslt; transformations, and the outcome is handled to the
226 &zebra; core to drive the proces of building the inverted
228 <xref linkend="record-model-domxml-canonical-index"/> for
233 <section id="record-model-domxml-pipeline-store">
234 <title>Store pipeline</title>
235 The <literal><store></literal> pipeline takes documents
236 from any common &dom; &xml; format to the &zebra; specific
237 storage &dom; &xml; format.
238 It may consist of zero ore more
239 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
240 &xslt; transformations, and the outcome is handled to the
241 &zebra; core for deposition into the internal storage system.
244 <section id="record-model-domxml-pipeline-retrieve">
245 <title>Retrieve pipeline</title>
247 Finally, there may be one or more
248 <literal><retrieve></literal> pipeline definitions, each
249 of them again consisting of zero or more
250 <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
251 &xslt; transformations. These are used for document
252 presentation after search, and take the internal storage &dom;
253 &xml; to the requested output formats during record present
257 The possible multiple
258 <literal><retrieve></literal> pipeline definitions
259 are distinguished by their unique <literal>name</literal>
260 attributes, these are the literal <literal>schema</literal> or
261 <literal>element set</literal> names used in
262 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
263 <ulink url="&url.sru;">&sru;</ulink> and
264 &z3950; protocol queries.
269 <section id="record-model-domxml-canonical-index">
270 <title>Canonical Indexing Format</title>
271 <para>The output of the indexing &xslt; stylesheets must contain
272 certain elements in the magic
273 <literal>xmlns:z="http://indexdata.dk/zebra-2.0"</literal>
274 namespace. The output of the &xslt; indexing transformation is then
275 parsed using &dom; methods, and the contained instructions are
276 performed on the <emphasis>magic elements and their
280 For example, the output of the command
282 xsltproc xsl/oai2index.xsl one-record.xml
284 might look like this:
286 <?xml version="1.0" encoding="UTF-8"?>
287 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
288 z:id="oai:JTRS:CP-3290---Volume-I"
291 <z:index name="oai_identifier" type="0">
292 oai:JTRS:CP-3290---Volume-I</z:index>
293 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
294 <z:index name="oai_setspec" type="0">jtrs</z:index>
295 <z:index name="dc_all" type="w">
296 <z:index name="dc_title" type="w">Proceedings of the 4th
297 International Conference and Exhibition:
298 World Congress on Superconductivity - Volume I</z:index>
299 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
300 Burnham, Editors</z:index>
305 <para>This means the following: From the original &xml; file
306 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
307 same form coming from a splitted input file), the indexing
308 stylesheet produces an indexing &xml; record, which is defined by
309 the <literal>record</literal> element in the magic namespace
310 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
311 &zebra; uses the content of
312 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
313 record ID, and - in case static ranking is set - the content of
314 <literal>z:rank="47896"</literal> as static rank. Following the
315 discussion in <xref linkend="administration-ranking"/>
316 we see that this records is internally ordered
317 lexicographically according to the value of the string
318 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
319 The type of action performed during indexing is defined by
320 <literal>z:type="update"></literal>, with recognized values
321 <literal>insert</literal>, <literal>update</literal>, and
322 <literal>delete</literal>.
324 <para>In this example, the following literal indexes are constructed:
333 where the indexing type is defined in the
334 <literal>type</literal> attribute
335 (any value from the standard configuration
336 file <filename>default.idx</filename> will do). Finally, any
337 <literal>text()</literal> node content recursively contained
338 inside the <literal>index</literal> will be filtered through the
339 appropriate charmap for character normalization, and will be
340 inserted in the index.
343 Specific to this example, we see that the single word
344 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
345 byte for byte without any form of character normalization,
346 inserted into the index named <literal>oai:identifier</literal>,
348 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
349 will be inserted using the <literal>w</literal> character
350 normalization defined in <filename>default.idx</filename> into
351 the index <literal>dc:creator</literal> (that is, after character
352 normalization the index will keep the inidividual words
353 <literal>kumar</literal>, <literal>krishen</literal>,
354 <literal>and</literal>, <literal>calvin</literal>,
355 <literal>burnham</literal>, and <literal>editors</literal>), and
356 finally both the texts
357 <literal>Proceedings of the 4th International Conference and Exhibition:
358 World Congress on Superconductivity - Volume I</literal>
360 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
361 will be inserted into the index <literal>dc:all</literal> using
362 the same character normalization map <literal>w</literal>.
365 Finally, this example configuration can be queried using &pqf;
366 queries, either transported by &z3950;, (here using a yaz-client)
369 Z> open localhost:9999
373 Z> f @attr 1=dc_creator Kumar
374 Z> scan @attr 1=dc_creator adam
376 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
377 Z> scan @attr 1=dc_title abc
381 extentions <literal>x-pquery</literal> and
382 <literal>x-pScanClause</literal> to
386 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
387 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
390 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
391 configuration, and <xref linkend="gfs-config"/> or the &yaz;
392 <ulink url="&url.yaz.cql;">&cql; section</ulink>
393 for the details or the &yaz; frontend server.
396 Notice that there are no <filename>*.abs</filename>,
397 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
398 filter configuration files involves in this process, and that the
399 literal index names are used during search and retrieval.
405 <section id="record-model-domxml-conf">
406 <title>&dom; Record Model Configuration</title>
409 <section id="record-model-domxml-index">
410 <title>&dom; Indexing Configuration</title>
412 As mentioned above, there can be only one indexing
413 stylesheet, and configuration of the indexing process is a synonym
414 of writing an &xslt; stylesheet which produces &xml; output containing the
415 magic elements discussed in
416 <xref linkend="record-model-domxml-internal"/>.
417 Obviously, there are million of different ways to accomplish this
418 task, and some comments and code snippets are in order to lead
419 our paduans on the right track to the good side of the force.
422 Stylesheets can be written in the <emphasis>pull</emphasis> or
423 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
424 means that the output &xml; structure is taken as starting point of
425 the internal structure of the &xslt; stylesheet, and portions of
426 the input &xml; are <emphasis>pulled</emphasis> out and inserted
427 into the right spots of the output &xml; structure. On the other
428 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
429 calling their template definitions, a process which is commanded
430 by the input &xml; structure, and avake to produce some output &xml;
431 whenever some special conditions in the input styelsheets are
432 met. The <emphasis>pull</emphasis> type is well-suited for input
433 &xml; with strong and well-defined structure and semantcs, like the
434 following &oai; indexing example, whereas the
435 <emphasis>push</emphasis> type might be the only possible way to
436 sort out deeply recursive input &xml; formats.
439 A <emphasis>pull</emphasis> stylesheet example used to index
440 &oai; harvested records could use some of the following template
444 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
445 xmlns:z="http://indexdata.dk/zebra/xslt/1"
446 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
447 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
448 xmlns:dc="http://purl.org/dc/elements/1.1/"
451 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
453 <!-- disable all default text node output -->
454 <xsl:template match="text()"/>
456 <!-- match on oai xml record root -->
457 <xsl:template match="/">
458 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
460 <!-- you might want to use z:rank="{some &xslt; function here}" -->
461 <xsl:apply-templates/>
465 <!-- &oai; indexing templates -->
466 <xsl:template match="oai:record/oai:header/oai:identifier">
467 <z:index name="oai_identifier" type="0">
468 <xsl:value-of select="."/>
474 <!-- DC specific indexing templates -->
475 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
476 <z:index name="dc_title" type="w">
477 <xsl:value-of select="."/>
489 that the names and types of the indexes can be defined in the
490 indexing &xslt; stylesheet <emphasis>dynamically according to
491 content in the original &xml; records</emphasis>, which has
492 opportunities for great power and wizardery as well as grande
496 The following excerpt of a <emphasis>push</emphasis> stylesheet
497 <emphasis>might</emphasis>
498 be a good idea according to your strict control of the &xml;
499 input format (due to rigerours checking against well-defined and
500 tight RelaxNG or &xml; Schema's, for example):
503 <xsl:template name="element-name-indexes">
504 <z:index name="{name()}" type="w">
505 <xsl:value-of select="'1'"/>
510 This template creates indexes which have the name of the working
511 node of any input &xml; file, and assigns a '1' to the index.
513 <literal>find @attr 1=xyz 1</literal>
514 finds all files which contain at least one
515 <literal>xyz</literal> &xml; element. In case you can not control
516 which element names the input files contain, you might ask for
517 disaster and bad karma using this technique.
520 One variation over the theme <emphasis>dynamically created
521 indexes</emphasis> will definitely be unwise:
524 <!-- match on oai xml record root -->
525 <xsl:template match="/">
526 <z:record z:type="update">
528 <!-- create dynamic index name from input content -->
529 <xsl:variable name="dynamic_content">
530 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
533 <!-- create zillions of indexes with unknown names -->
534 <z:index name="{$dynamic_content}" type="w">
535 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
542 Don't be tempted to cross
543 the line to the dark side of the force, paduan; this leads
544 to suffering and pain, and universal
545 disentigration of your project schedule.
549 <section id="record-model-domxml-elementset">
550 <title>&dom; Exchange Formats</title>
552 An exchange format can be anything which can be the outcome of an
553 &xslt; transformation, as far as the stylesheet is registered in
554 the main &dom; &xslt; filter configuration file, see
555 <xref linkend="record-model-domxml-filter"/>.
556 In principle anything that can be expressed in &xml;, HTML, and
557 TEXT can be the output of a <literal>schema</literal> or
558 <literal>element set</literal> directive during search, as long as
559 the information comes from the
560 <emphasis>original input record &xml; &dom; tree</emphasis>
561 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
564 In addition, internal administrative information from the &zebra;
565 indexer can be accessed during record retrieval. The following
566 example is a summary of the possibilities:
569 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
570 xmlns:z="http://indexdata.dk/zebra/xslt/1"
573 <!-- register internal zebra parameters -->
574 <xsl:param name="id" select="''"/>
575 <xsl:param name="filename" select="''"/>
576 <xsl:param name="score" select="''"/>
577 <xsl:param name="schema" select="''"/>
579 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
581 <!-- use then for display of internal information -->
582 <xsl:template match="/">
584 <id><xsl:value-of select="$id"/></id>
585 <filename><xsl:value-of select="$filename"/></filename>
586 <score><xsl:value-of select="$score"/></score>
587 <schema><xsl:value-of select="$schema"/></schema>
598 <section id="record-model-domxml-example">
599 <title>&dom; Filter &oai; Indexing Example</title>
601 The sourcecode tarball contains a working &dom; filter example in
602 the directory <filename>examples/dom-oai/</filename>, which
603 should get you started.
606 More example data can be harvested from any &oai; complient server,
607 see details at the &oai;
608 <ulink url="http://www.openarchives.org/">
609 http://www.openarchives.org/</ulink> web site, and the community
611 <ulink url="http://www.openarchives.org/community/index.html">
612 http://www.openarchives.org/community/index.html</ulink>.
615 <ulink url="http://www.oaforum.org/tutorial/">
616 http://www.oaforum.org/tutorial/</ulink>.
628 c) Main "dom" &xslt; filter config file:
629 cat db/filter_dom_conf.xml
631 <?xml version="1.0" encoding="UTF8"?>
633 <schema name="dom" stylesheet="db/dom2dom.xsl" />
634 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
635 stylesheet="db/dom2index.xsl" />
636 <schema name="dc" stylesheet="db/dom2dc.xsl" />
637 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
638 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
639 <schema name="help" stylesheet="db/dom2help.xsl" />
643 the paths are relative to the directory where zebra.init is placed
646 The split level decides where the SAX parser shall split the
647 collections of records into individual records, which then are
648 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
650 The indexing stylesheet is found by it's identifier.
652 All the other stylesheets are for presentation after search.
654 - in data/ a short sample of harvested carnivorous plants
655 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
657 - in root also one single data record - nice for testing the xslt
660 xsltproc db/dom2index.xsl carni*.xml
664 - in db/ a cql2pqf.txt yaz-client config file
665 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
667 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
669 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
670 as it constructs the new &xml; structure by pulling data out of the
671 respective elements/attributes of the old structure.
673 Notice the special zebra namespace, and the special elements in this
674 namespace which indicate to the zebra indexer what to do.
676 <z:record id="67ht7" rank="675" type="update">
677 indicates that a new record with given id and static rank has to be updated.
679 <z:index name="title" type="w">
680 encloses all the text/&xml; which shall be indexed in the index named
681 "title" and of index type "w" (see file default.idx in your zebra
693 <!-- Keep this comment at the end of the file
698 sgml-minimize-attributes:nil
699 sgml-always-quote-attributes:t
702 sgml-parent-document: "zebra.xml"
703 sgml-local-catalogs: nil
704 sgml-namecase-general:t