1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.3 2007-02-20 14:57:00 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
36 <section id="record-model-domxml-architecture">
37 <title>&dom; &xml; filter architecture</title>
40 The internal &dom; &xml; representation can be fed into four
41 different pipelines, consisting of arbitraily many sucessive
42 &xslt; transformations.
45 <table id="record-model-domxml-architecture-table" frame="top">
46 <title>&dom; &xml; filter pipelines overview</title>
52 <entry>Description</entry>
60 <entry><literal>input</literal></entry>
62 <entry>input parsing and initial
63 transformations to common &xml; format</entry>
64 <entry>raw &xml; record buffers, &xml; streams and
65 binary &marc; buffers</entry>
66 <entry>single &dom; &xml; documents suitable for indexing and
67 internal storage</entry>
70 <entry><literal>extract</literal></entry>
72 <entry>indexing term extraction
73 transformations</entry>
74 <entry>common single &dom; &xml; format</entry>
75 <entry>&zebra; internal indexing &dom; &xml; document</entry>
78 <entry><literal>store</literal></entry>
80 <entry> transformations before internal document
82 <entry>common single &dom; &xml; format</entry>
83 <entry>&zebra; internal storage &dom; &xml; document</entry>
86 <entry><literal>retrieve</literal></entry>
88 <entry>document retrieve transformations from storage to output
89 syntax and format</entry>
90 <entry>&zebra; internal storage &dom; &xml; document</entry>
91 <entry>requested output syntax and format</entry>
98 The &dom; &xml; filter pipelines use &xslt; (and if supported on
99 your platform, even &exslt;), it brings thus full &xpath;
100 support to the indexing, storage and display rules of not only
101 &xml; documents, but also binary &marc; records.
106 <section id="record-model-domxml-pipeline">
107 <title>&dom; &xml; filter pipeline configuration</title>
110 The experimental, loadable &dom; &xml;/&xslt; filter module
111 <literal>mod-dom.so</literal>
112 is invoked by the <filename>zebra.cfg</filename> configuration statement
114 recordtype.xml: dom.db/filter_dom_conf.xml
116 In this example on all data files with suffix
117 <filename>*.xml</filename>, where the
118 &dom; &xslt; filter configuration file is found in the
119 path <filename>db/filter_dom_conf.xml</filename>.
122 <para>The &dom; &xslt; filter configuration file must be
123 valid &xml;. It might look like this:
126 <?xml version="1.0" encoding="UTF8"?>
127 <dom xmlns="http://indexdata.com/zebra-2.0">
129 <xmlreader level="1"/>
131 <extract name="index">
132 <xslt stylesheet="common2index.xsl"/>
135 <xslt stylesheet="common2store.xsl"/>
138 <xslt stylesheet="store2dc.xsl"/>
146 All named stylesheets defined inside
147 <literal>schema</literal> element tags
148 are for presentation after search, including
149 the indexing stylesheet (which is a great debugging help). The
150 names defined in the <literal>name</literal> attributes must be
151 unique, these are the literal <literal>schema</literal> or
152 <literal>element set</literal> names used in
153 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
154 <ulink url="&url.sru;">&sru;</ulink> and
155 &z3950; protocol queries.
156 The paths in the <literal>stylesheet</literal> attributes
157 are relative to zebras working directory, or absolute to file
161 The <literal><split level="2"/></literal> decides where the
162 &xml; Reader shall split the
163 collections of records into individual records, which then are
164 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
167 There must be exactly one indexing &xslt; stylesheet, which is
168 defined by the magic attribute
169 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
172 <section id="record-model-domxml-internal">
173 <title>&dom; filter internal record representation</title>
174 <para>When indexing, an &xml; Reader is invoked to split the input
175 files into suitable record &xml; pieces. Each record piece is then
176 transformed to an &xml; &dom; structure, which is essentially the
177 record model. Only &xslt; transformations can be applied during
178 index, search and retrieval. Consequently, output formats are
179 restricted to whatever &xslt; can deliver from the record &xml;
180 structure, be it other &xml; formats, HTML, or plain text. In case
181 you have <literal>libxslt1</literal> running with E&xslt; support,
182 you can use this functionality inside the &dom;
183 filter configuration &xslt; stylesheets.
187 <section id="record-model-domxml-canonical">
188 <title>&dom; Canonical Indexing Format</title>
189 <para>The output of the indexing &xslt; stylesheets must contain
190 certain elements in the magic
191 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
192 namespace. The output of the &xslt; indexing transformation is then
193 parsed using &dom; methods, and the contained instructions are
194 performed on the <emphasis>magic elements and their
198 For example, the output of the command
200 xsltproc xsl/oai2index.xsl one-record.xml
202 might look like this:
204 <?xml version="1.0" encoding="UTF-8"?>
205 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
206 z:id="oai:JTRS:CP-3290---Volume-I"
209 <z:index name="oai_identifier" type="0">
210 oai:JTRS:CP-3290---Volume-I</z:index>
211 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
212 <z:index name="oai_setspec" type="0">jtrs</z:index>
213 <z:index name="dc_all" type="w">
214 <z:index name="dc_title" type="w">Proceedings of the 4th
215 International Conference and Exhibition:
216 World Congress on Superconductivity - Volume I</z:index>
217 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
218 Burnham, Editors</z:index>
223 <para>This means the following: From the original &xml; file
224 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
225 same form coming from a splitted input file), the indexing
226 stylesheet produces an indexing &xml; record, which is defined by
227 the <literal>record</literal> element in the magic namespace
228 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
229 &zebra; uses the content of
230 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
231 record ID, and - in case static ranking is set - the content of
232 <literal>z:rank="47896"</literal> as static rank. Following the
233 discussion in <xref linkend="administration-ranking"/>
234 we see that this records is internally ordered
235 lexicographically according to the value of the string
236 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
237 The type of action performed during indexing is defined by
238 <literal>z:type="update"></literal>, with recognized values
239 <literal>insert</literal>, <literal>update</literal>, and
240 <literal>delete</literal>.
242 <para>In this example, the following literal indexes are constructed:
251 where the indexing type is defined in the
252 <literal>type</literal> attribute
253 (any value from the standard configuration
254 file <filename>default.idx</filename> will do). Finally, any
255 <literal>text()</literal> node content recursively contained
256 inside the <literal>index</literal> will be filtered through the
257 appropriate charmap for character normalization, and will be
258 inserted in the index.
261 Specific to this example, we see that the single word
262 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
263 byte for byte without any form of character normalization,
264 inserted into the index named <literal>oai:identifier</literal>,
266 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
267 will be inserted using the <literal>w</literal> character
268 normalization defined in <filename>default.idx</filename> into
269 the index <literal>dc:creator</literal> (that is, after character
270 normalization the index will keep the inidividual words
271 <literal>kumar</literal>, <literal>krishen</literal>,
272 <literal>and</literal>, <literal>calvin</literal>,
273 <literal>burnham</literal>, and <literal>editors</literal>), and
274 finally both the texts
275 <literal>Proceedings of the 4th International Conference and Exhibition:
276 World Congress on Superconductivity - Volume I</literal>
278 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
279 will be inserted into the index <literal>dc:all</literal> using
280 the same character normalization map <literal>w</literal>.
283 Finally, this example configuration can be queried using &pqf;
284 queries, either transported by &z3950;, (here using a yaz-client)
287 Z> open localhost:9999
291 Z> f @attr 1=dc_creator Kumar
292 Z> scan @attr 1=dc_creator adam
294 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
295 Z> scan @attr 1=dc_title abc
299 extentions <literal>x-pquery</literal> and
300 <literal>x-pScanClause</literal> to
304 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
305 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
308 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
309 configuration, and <xref linkend="gfs-config"/> or the &yaz;
310 <ulink url="&url.yaz.cql;">&cql; section</ulink>
311 for the details or the &yaz; frontend server.
314 Notice that there are no <filename>*.abs</filename>,
315 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
316 filter configuration files involves in this process, and that the
317 literal index names are used during search and retrieval.
323 <section id="record-model-domxml-conf">
324 <title>&dom; Record Model Configuration</title>
327 <section id="record-model-domxml-index">
328 <title>&dom; Indexing Configuration</title>
330 As mentioned above, there can be only one indexing
331 stylesheet, and configuration of the indexing process is a synonym
332 of writing an &xslt; stylesheet which produces &xml; output containing the
333 magic elements discussed in
334 <xref linkend="record-model-domxml-internal"/>.
335 Obviously, there are million of different ways to accomplish this
336 task, and some comments and code snippets are in order to lead
337 our paduans on the right track to the good side of the force.
340 Stylesheets can be written in the <emphasis>pull</emphasis> or
341 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
342 means that the output &xml; structure is taken as starting point of
343 the internal structure of the &xslt; stylesheet, and portions of
344 the input &xml; are <emphasis>pulled</emphasis> out and inserted
345 into the right spots of the output &xml; structure. On the other
346 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
347 calling their template definitions, a process which is commanded
348 by the input &xml; structure, and avake to produce some output &xml;
349 whenever some special conditions in the input styelsheets are
350 met. The <emphasis>pull</emphasis> type is well-suited for input
351 &xml; with strong and well-defined structure and semantcs, like the
352 following &oai; indexing example, whereas the
353 <emphasis>push</emphasis> type might be the only possible way to
354 sort out deeply recursive input &xml; formats.
357 A <emphasis>pull</emphasis> stylesheet example used to index
358 &oai; harvested records could use some of the following template
362 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
363 xmlns:z="http://indexdata.dk/zebra/xslt/1"
364 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
365 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
366 xmlns:dc="http://purl.org/dc/elements/1.1/"
369 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
371 <!-- disable all default text node output -->
372 <xsl:template match="text()"/>
374 <!-- match on oai xml record root -->
375 <xsl:template match="/">
376 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
378 <!-- you might want to use z:rank="{some &xslt; function here}" -->
379 <xsl:apply-templates/>
383 <!-- &oai; indexing templates -->
384 <xsl:template match="oai:record/oai:header/oai:identifier">
385 <z:index name="oai_identifier" type="0">
386 <xsl:value-of select="."/>
392 <!-- DC specific indexing templates -->
393 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
394 <z:index name="dc_title" type="w">
395 <xsl:value-of select="."/>
407 that the names and types of the indexes can be defined in the
408 indexing &xslt; stylesheet <emphasis>dynamically according to
409 content in the original &xml; records</emphasis>, which has
410 opportunities for great power and wizardery as well as grande
414 The following excerpt of a <emphasis>push</emphasis> stylesheet
415 <emphasis>might</emphasis>
416 be a good idea according to your strict control of the &xml;
417 input format (due to rigerours checking against well-defined and
418 tight RelaxNG or &xml; Schema's, for example):
421 <xsl:template name="element-name-indexes">
422 <z:index name="{name()}" type="w">
423 <xsl:value-of select="'1'"/>
428 This template creates indexes which have the name of the working
429 node of any input &xml; file, and assigns a '1' to the index.
431 <literal>find @attr 1=xyz 1</literal>
432 finds all files which contain at least one
433 <literal>xyz</literal> &xml; element. In case you can not control
434 which element names the input files contain, you might ask for
435 disaster and bad karma using this technique.
438 One variation over the theme <emphasis>dynamically created
439 indexes</emphasis> will definitely be unwise:
442 <!-- match on oai xml record root -->
443 <xsl:template match="/">
444 <z:record z:type="update">
446 <!-- create dynamic index name from input content -->
447 <xsl:variable name="dynamic_content">
448 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
451 <!-- create zillions of indexes with unknown names -->
452 <z:index name="{$dynamic_content}" type="w">
453 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
460 Don't be tempted to cross
461 the line to the dark side of the force, paduan; this leads
462 to suffering and pain, and universal
463 disentigration of your project schedule.
467 <section id="record-model-domxml-elementset">
468 <title>&dom; Exchange Formats</title>
470 An exchange format can be anything which can be the outcome of an
471 &xslt; transformation, as far as the stylesheet is registered in
472 the main &dom; &xslt; filter configuration file, see
473 <xref linkend="record-model-domxml-filter"/>.
474 In principle anything that can be expressed in &xml;, HTML, and
475 TEXT can be the output of a <literal>schema</literal> or
476 <literal>element set</literal> directive during search, as long as
477 the information comes from the
478 <emphasis>original input record &xml; &dom; tree</emphasis>
479 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
482 In addition, internal administrative information from the &zebra;
483 indexer can be accessed during record retrieval. The following
484 example is a summary of the possibilities:
487 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
488 xmlns:z="http://indexdata.dk/zebra/xslt/1"
491 <!-- register internal zebra parameters -->
492 <xsl:param name="id" select="''"/>
493 <xsl:param name="filename" select="''"/>
494 <xsl:param name="score" select="''"/>
495 <xsl:param name="schema" select="''"/>
497 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
499 <!-- use then for display of internal information -->
500 <xsl:template match="/">
502 <id><xsl:value-of select="$id"/></id>
503 <filename><xsl:value-of select="$filename"/></filename>
504 <score><xsl:value-of select="$score"/></score>
505 <schema><xsl:value-of select="$schema"/></schema>
516 <section id="record-model-domxml-example">
517 <title>&dom; Filter &oai; Indexing Example</title>
519 The sourcecode tarball contains a working &dom; filter example in
520 the directory <filename>examples/dom-oai/</filename>, which
521 should get you started.
524 More example data can be harvested from any &oai; complient server,
525 see details at the &oai;
526 <ulink url="http://www.openarchives.org/">
527 http://www.openarchives.org/</ulink> web site, and the community
529 <ulink url="http://www.openarchives.org/community/index.html">
530 http://www.openarchives.org/community/index.html</ulink>.
533 <ulink url="http://www.oaforum.org/tutorial/">
534 http://www.oaforum.org/tutorial/</ulink>.
546 c) Main "dom" &xslt; filter config file:
547 cat db/filter_dom_conf.xml
549 <?xml version="1.0" encoding="UTF8"?>
551 <schema name="dom" stylesheet="db/dom2dom.xsl" />
552 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
553 stylesheet="db/dom2index.xsl" />
554 <schema name="dc" stylesheet="db/dom2dc.xsl" />
555 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
556 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
557 <schema name="help" stylesheet="db/dom2help.xsl" />
561 the paths are relative to the directory where zebra.init is placed
564 The split level decides where the SAX parser shall split the
565 collections of records into individual records, which then are
566 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
568 The indexing stylesheet is found by it's identifier.
570 All the other stylesheets are for presentation after search.
572 - in data/ a short sample of harvested carnivorous plants
573 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
575 - in root also one single data record - nice for testing the xslt
578 xsltproc db/dom2index.xsl carni*.xml
582 - in db/ a cql2pqf.txt yaz-client config file
583 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
585 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
587 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
588 as it constructs the new &xml; structure by pulling data out of the
589 respective elements/attributes of the old structure.
591 Notice the special zebra namespace, and the special elements in this
592 namespace which indicate to the zebra indexer what to do.
594 <z:record id="67ht7" rank="675" type="update">
595 indicates that a new record with given id and static rank has to be updated.
597 <z:index name="title" type="w">
598 encloses all the text/&xml; which shall be indexed in the index named
599 "title" and of index type "w" (see file default.idx in your zebra
611 <!-- Keep this comment at the end of the file
616 sgml-minimize-attributes:nil
617 sgml-always-quote-attributes:t
620 sgml-parent-document: "zebra.xml"
621 sgml-local-catalogs: nil
622 sgml-namecase-general:t