1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.5 2007-02-21 12:29:52 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter Architecture</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
35 The &dom; filter architecture consists of four
36 different pipelines, each being a chain of arbitraily many sucessive
37 &xslt; transformations of the internal &dom; &xml;
38 representations of documents.
41 <figure id="record-model-domxml-architecture-fig">
42 <title>&dom; &xml; filter architecture</title>
45 <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
48 <imagedata fileref="domfilter.png" format="PNG"/>
51 <!-- Fall back if none of the images can be used -->
53 [Here there should be a diagram showing the &dom; &xml;
54 filter architecture, but is seems that your
55 tool chain has not been able to include the diagram in this
63 <table id="record-model-domxml-architecture-table" frame="top">
64 <title>&dom; &xml; filter pipelines overview</title>
70 <entry>Description</entry>
78 <entry><literal>input</literal></entry>
80 <entry>input parsing and initial
81 transformations to common &xml; format</entry>
82 <entry>raw &xml; record buffers, &xml; streams and
83 binary &marc; buffers</entry>
84 <entry>single &dom; &xml; documents suitable for indexing and
85 internal storage</entry>
88 <entry><literal>extract</literal></entry>
90 <entry>indexing term extraction
91 transformations</entry>
92 <entry>common single &dom; &xml; format</entry>
93 <entry>&zebra; internal indexing &dom; &xml; document</entry>
96 <entry><literal>store</literal></entry>
98 <entry> transformations before internal document
100 <entry>common single &dom; &xml; format</entry>
101 <entry>&zebra; internal storage &dom; &xml; document</entry>
104 <entry><literal>retrieve</literal></entry>
106 <entry>multiple document retrieve transformations from
107 storage to different output
108 formats are possible</entry>
109 <entry>&zebra; internal storage &dom; &xml; document</entry>
110 <entry>output &xml; syntax and requested format</entry>
117 The &dom; &xml; filter pipelines use &xslt; (and if supported on
118 your platform, even &exslt;), it brings thus full &xpath;
119 support to the indexing, storage and display rules of not only
120 &xml; documents, but also binary &marc; records.
125 <section id="record-model-domxml-pipeline">
126 <title>&dom; &xml; filter pipeline configuration</title>
129 The experimental, loadable &dom; &xml;/&xslt; filter module
130 <literal>mod-dom.so</literal>
131 is invoked by the <filename>zebra.cfg</filename> configuration statement
133 recordtype.xml: dom.db/filter_dom_conf.xml
135 In this example on all data files with suffix
136 <filename>*.xml</filename>, where the
137 &dom; &xslt; filter configuration file is found in the
138 path <filename>db/filter_dom_conf.xml</filename>.
141 <para>The &dom; &xslt; filter configuration file must be
142 valid &xml;. It might look like this:
145 <?xml version="1.0" encoding="UTF8"?>
146 <dom xmlns="http://indexdata.com/zebra-2.0">
148 <xmlreader level="1"/>
149 <!-- <marc inputcharset="marc-8"/> -->
152 <xslt stylesheet="common2index.xsl"/>
155 <xslt stylesheet="common2store.xsl"/>
158 <xslt stylesheet="store2dc.xsl"/>
160 <retrieve name="mods">
161 <xslt stylesheet="store2mods.xsl"/>
169 All named stylesheets defined inside
170 <literal>schema</literal> element tags
171 are for presentation after search, including
172 the indexing stylesheet (which is a great debugging help). The
173 names defined in the <literal>name</literal> attributes must be
174 unique, these are the literal <literal>schema</literal> or
175 <literal>element set</literal> names used in
176 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
177 <ulink url="&url.sru;">&sru;</ulink> and
178 &z3950; protocol queries.
179 The paths in the <literal>stylesheet</literal> attributes
180 are relative to zebras working directory, or absolute to file
184 The <literal><split level="2"/></literal> decides where the
185 &xml; Reader shall split the
186 collections of records into individual records, which then are
187 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
190 There must be exactly one indexing &xslt; stylesheet, which is
191 defined by the magic attribute
192 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
195 <section id="record-model-domxml-internal">
196 <title>&dom; filter internal record representation</title>
197 <para>When indexing, an &xml; Reader is invoked to split the input
198 files into suitable record &xml; pieces. Each record piece is then
199 transformed to an &xml; &dom; structure, which is essentially the
200 record model. Only &xslt; transformations can be applied during
201 index, search and retrieval. Consequently, output formats are
202 restricted to whatever &xslt; can deliver from the record &xml;
203 structure, be it other &xml; formats, HTML, or plain text. In case
204 you have <literal>libxslt1</literal> running with E&xslt; support,
205 you can use this functionality inside the &dom;
206 filter configuration &xslt; stylesheets.
210 <section id="record-model-domxml-canonical">
211 <title>&dom; Canonical Indexing Format</title>
212 <para>The output of the indexing &xslt; stylesheets must contain
213 certain elements in the magic
214 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
215 namespace. The output of the &xslt; indexing transformation is then
216 parsed using &dom; methods, and the contained instructions are
217 performed on the <emphasis>magic elements and their
221 For example, the output of the command
223 xsltproc xsl/oai2index.xsl one-record.xml
225 might look like this:
227 <?xml version="1.0" encoding="UTF-8"?>
228 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
229 z:id="oai:JTRS:CP-3290---Volume-I"
232 <z:index name="oai_identifier" type="0">
233 oai:JTRS:CP-3290---Volume-I</z:index>
234 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
235 <z:index name="oai_setspec" type="0">jtrs</z:index>
236 <z:index name="dc_all" type="w">
237 <z:index name="dc_title" type="w">Proceedings of the 4th
238 International Conference and Exhibition:
239 World Congress on Superconductivity - Volume I</z:index>
240 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
241 Burnham, Editors</z:index>
246 <para>This means the following: From the original &xml; file
247 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
248 same form coming from a splitted input file), the indexing
249 stylesheet produces an indexing &xml; record, which is defined by
250 the <literal>record</literal> element in the magic namespace
251 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
252 &zebra; uses the content of
253 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
254 record ID, and - in case static ranking is set - the content of
255 <literal>z:rank="47896"</literal> as static rank. Following the
256 discussion in <xref linkend="administration-ranking"/>
257 we see that this records is internally ordered
258 lexicographically according to the value of the string
259 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
260 The type of action performed during indexing is defined by
261 <literal>z:type="update"></literal>, with recognized values
262 <literal>insert</literal>, <literal>update</literal>, and
263 <literal>delete</literal>.
265 <para>In this example, the following literal indexes are constructed:
274 where the indexing type is defined in the
275 <literal>type</literal> attribute
276 (any value from the standard configuration
277 file <filename>default.idx</filename> will do). Finally, any
278 <literal>text()</literal> node content recursively contained
279 inside the <literal>index</literal> will be filtered through the
280 appropriate charmap for character normalization, and will be
281 inserted in the index.
284 Specific to this example, we see that the single word
285 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
286 byte for byte without any form of character normalization,
287 inserted into the index named <literal>oai:identifier</literal>,
289 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
290 will be inserted using the <literal>w</literal> character
291 normalization defined in <filename>default.idx</filename> into
292 the index <literal>dc:creator</literal> (that is, after character
293 normalization the index will keep the inidividual words
294 <literal>kumar</literal>, <literal>krishen</literal>,
295 <literal>and</literal>, <literal>calvin</literal>,
296 <literal>burnham</literal>, and <literal>editors</literal>), and
297 finally both the texts
298 <literal>Proceedings of the 4th International Conference and Exhibition:
299 World Congress on Superconductivity - Volume I</literal>
301 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
302 will be inserted into the index <literal>dc:all</literal> using
303 the same character normalization map <literal>w</literal>.
306 Finally, this example configuration can be queried using &pqf;
307 queries, either transported by &z3950;, (here using a yaz-client)
310 Z> open localhost:9999
314 Z> f @attr 1=dc_creator Kumar
315 Z> scan @attr 1=dc_creator adam
317 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
318 Z> scan @attr 1=dc_title abc
322 extentions <literal>x-pquery</literal> and
323 <literal>x-pScanClause</literal> to
327 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
328 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
331 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
332 configuration, and <xref linkend="gfs-config"/> or the &yaz;
333 <ulink url="&url.yaz.cql;">&cql; section</ulink>
334 for the details or the &yaz; frontend server.
337 Notice that there are no <filename>*.abs</filename>,
338 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
339 filter configuration files involves in this process, and that the
340 literal index names are used during search and retrieval.
346 <section id="record-model-domxml-conf">
347 <title>&dom; Record Model Configuration</title>
350 <section id="record-model-domxml-index">
351 <title>&dom; Indexing Configuration</title>
353 As mentioned above, there can be only one indexing
354 stylesheet, and configuration of the indexing process is a synonym
355 of writing an &xslt; stylesheet which produces &xml; output containing the
356 magic elements discussed in
357 <xref linkend="record-model-domxml-internal"/>.
358 Obviously, there are million of different ways to accomplish this
359 task, and some comments and code snippets are in order to lead
360 our paduans on the right track to the good side of the force.
363 Stylesheets can be written in the <emphasis>pull</emphasis> or
364 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
365 means that the output &xml; structure is taken as starting point of
366 the internal structure of the &xslt; stylesheet, and portions of
367 the input &xml; are <emphasis>pulled</emphasis> out and inserted
368 into the right spots of the output &xml; structure. On the other
369 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
370 calling their template definitions, a process which is commanded
371 by the input &xml; structure, and avake to produce some output &xml;
372 whenever some special conditions in the input styelsheets are
373 met. The <emphasis>pull</emphasis> type is well-suited for input
374 &xml; with strong and well-defined structure and semantcs, like the
375 following &oai; indexing example, whereas the
376 <emphasis>push</emphasis> type might be the only possible way to
377 sort out deeply recursive input &xml; formats.
380 A <emphasis>pull</emphasis> stylesheet example used to index
381 &oai; harvested records could use some of the following template
385 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
386 xmlns:z="http://indexdata.dk/zebra/xslt/1"
387 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
388 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
389 xmlns:dc="http://purl.org/dc/elements/1.1/"
392 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
394 <!-- disable all default text node output -->
395 <xsl:template match="text()"/>
397 <!-- match on oai xml record root -->
398 <xsl:template match="/">
399 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
401 <!-- you might want to use z:rank="{some &xslt; function here}" -->
402 <xsl:apply-templates/>
406 <!-- &oai; indexing templates -->
407 <xsl:template match="oai:record/oai:header/oai:identifier">
408 <z:index name="oai_identifier" type="0">
409 <xsl:value-of select="."/>
415 <!-- DC specific indexing templates -->
416 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
417 <z:index name="dc_title" type="w">
418 <xsl:value-of select="."/>
430 that the names and types of the indexes can be defined in the
431 indexing &xslt; stylesheet <emphasis>dynamically according to
432 content in the original &xml; records</emphasis>, which has
433 opportunities for great power and wizardery as well as grande
437 The following excerpt of a <emphasis>push</emphasis> stylesheet
438 <emphasis>might</emphasis>
439 be a good idea according to your strict control of the &xml;
440 input format (due to rigerours checking against well-defined and
441 tight RelaxNG or &xml; Schema's, for example):
444 <xsl:template name="element-name-indexes">
445 <z:index name="{name()}" type="w">
446 <xsl:value-of select="'1'"/>
451 This template creates indexes which have the name of the working
452 node of any input &xml; file, and assigns a '1' to the index.
454 <literal>find @attr 1=xyz 1</literal>
455 finds all files which contain at least one
456 <literal>xyz</literal> &xml; element. In case you can not control
457 which element names the input files contain, you might ask for
458 disaster and bad karma using this technique.
461 One variation over the theme <emphasis>dynamically created
462 indexes</emphasis> will definitely be unwise:
465 <!-- match on oai xml record root -->
466 <xsl:template match="/">
467 <z:record z:type="update">
469 <!-- create dynamic index name from input content -->
470 <xsl:variable name="dynamic_content">
471 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
474 <!-- create zillions of indexes with unknown names -->
475 <z:index name="{$dynamic_content}" type="w">
476 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
483 Don't be tempted to cross
484 the line to the dark side of the force, paduan; this leads
485 to suffering and pain, and universal
486 disentigration of your project schedule.
490 <section id="record-model-domxml-elementset">
491 <title>&dom; Exchange Formats</title>
493 An exchange format can be anything which can be the outcome of an
494 &xslt; transformation, as far as the stylesheet is registered in
495 the main &dom; &xslt; filter configuration file, see
496 <xref linkend="record-model-domxml-filter"/>.
497 In principle anything that can be expressed in &xml;, HTML, and
498 TEXT can be the output of a <literal>schema</literal> or
499 <literal>element set</literal> directive during search, as long as
500 the information comes from the
501 <emphasis>original input record &xml; &dom; tree</emphasis>
502 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
505 In addition, internal administrative information from the &zebra;
506 indexer can be accessed during record retrieval. The following
507 example is a summary of the possibilities:
510 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
511 xmlns:z="http://indexdata.dk/zebra/xslt/1"
514 <!-- register internal zebra parameters -->
515 <xsl:param name="id" select="''"/>
516 <xsl:param name="filename" select="''"/>
517 <xsl:param name="score" select="''"/>
518 <xsl:param name="schema" select="''"/>
520 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
522 <!-- use then for display of internal information -->
523 <xsl:template match="/">
525 <id><xsl:value-of select="$id"/></id>
526 <filename><xsl:value-of select="$filename"/></filename>
527 <score><xsl:value-of select="$score"/></score>
528 <schema><xsl:value-of select="$schema"/></schema>
539 <section id="record-model-domxml-example">
540 <title>&dom; Filter &oai; Indexing Example</title>
542 The sourcecode tarball contains a working &dom; filter example in
543 the directory <filename>examples/dom-oai/</filename>, which
544 should get you started.
547 More example data can be harvested from any &oai; complient server,
548 see details at the &oai;
549 <ulink url="http://www.openarchives.org/">
550 http://www.openarchives.org/</ulink> web site, and the community
552 <ulink url="http://www.openarchives.org/community/index.html">
553 http://www.openarchives.org/community/index.html</ulink>.
556 <ulink url="http://www.oaforum.org/tutorial/">
557 http://www.oaforum.org/tutorial/</ulink>.
569 c) Main "dom" &xslt; filter config file:
570 cat db/filter_dom_conf.xml
572 <?xml version="1.0" encoding="UTF8"?>
574 <schema name="dom" stylesheet="db/dom2dom.xsl" />
575 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
576 stylesheet="db/dom2index.xsl" />
577 <schema name="dc" stylesheet="db/dom2dc.xsl" />
578 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
579 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
580 <schema name="help" stylesheet="db/dom2help.xsl" />
584 the paths are relative to the directory where zebra.init is placed
587 The split level decides where the SAX parser shall split the
588 collections of records into individual records, which then are
589 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
591 The indexing stylesheet is found by it's identifier.
593 All the other stylesheets are for presentation after search.
595 - in data/ a short sample of harvested carnivorous plants
596 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
598 - in root also one single data record - nice for testing the xslt
601 xsltproc db/dom2index.xsl carni*.xml
605 - in db/ a cql2pqf.txt yaz-client config file
606 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
608 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
610 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
611 as it constructs the new &xml; structure by pulling data out of the
612 respective elements/attributes of the old structure.
614 Notice the special zebra namespace, and the special elements in this
615 namespace which indicate to the zebra indexer what to do.
617 <z:record id="67ht7" rank="675" type="update">
618 indicates that a new record with given id and static rank has to be updated.
620 <z:index name="title" type="w">
621 encloses all the text/&xml; which shall be indexed in the index named
622 "title" and of index type "w" (see file default.idx in your zebra
634 <!-- Keep this comment at the end of the file
639 sgml-minimize-attributes:nil
640 sgml-always-quote-attributes:t
643 sgml-parent-document: "zebra.xml"
644 sgml-local-catalogs: nil
645 sgml-namecase-general:t