1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.2 2007-02-20 14:53:25 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
36 <section id="record-model-domxml-architecture">
37 <title>&dom; &xml; filter architecture</title>
40 The internal &dom; &xml; representation can be fed into four
41 different pipelines, consisting of arbitraily many sucessive
42 &xslt; transformations.
45 <table id="record-model-domxml-architecture-table" frame="top">
46 <title>&dom; &xml; filter pipelines overview</title>
52 <entry>Description</entry>
60 <entry><literal>input</literal></entry>
62 <entry>input parsing and initial
63 transformations to common &xml; format</entry>
64 <entry>raw &xml; record buffers, &xml; streams and
65 binary &marc; buffers</entry>
66 <entry>single &dom; &xml; documents suitable for indexing and
67 internal storage</entry>
70 <entry><literal>extract</literal></entry>
72 <entry>indexing term extraction
73 transformations</entry>
74 <entry>common single &dom; &xml; format</entry>
75 <entry>&zebra; internal indexing &dom; &xml; document</entry>
78 <entry><literal>store</literal></entry>
80 <entry> transformations before internal document
82 <entry>common single &dom; &xml; format</entry>
83 <entry>&zebra; internal storage &dom; &xml; document</entry>
86 <entry><literal>retrieve</literal></entry>
88 <entry>document retrieve transformations from storage to output
89 syntax and format</entry>
90 <entry>&zebra; internal storage &dom; &xml; document</entry>
91 <entry>requested output syntax and format</entry>
98 The &dom; &xml; filter pipelines use &xslt; (and if supported on
99 your platform, even &exslt;), it brings thus full &xpath;
100 support to the indexing, storage and display rules of not only
101 &xml; documents, but also binary &marc; records.
106 <section id="record-model-domxml-pipeline">
107 <title>&dom; &xml; filter pipeline configuration</title>
110 The experimental, loadable &dom; &xml;/&xslt; filter module
111 <literal>mod-dom.so</literal>
112 is invoked by the <filename>zebra.cfg</filename> configuration statement
114 recordtype.xml: dom.db/filter_dom_conf.xml
116 In this example on all data files with suffix
117 <filename>*.xml</filename>, where the
118 &dom; &xslt; filter configuration file is found in the
119 path <filename>db/filter_dom_conf.xml</filename>.
122 <para>The &dom; &xslt; filter configuration file must be
123 valid &xml;. It might look like this:
128 <xmlreader level="1"/>
130 <extract name="index">
131 <xslt stylesheet="common2index.xsl"/>
134 <xslt stylesheet="common2store.xsl"/>
137 <xslt stylesheet="store2dc.xsl"/>
145 All named stylesheets defined inside
146 <literal>schema</literal> element tags
147 are for presentation after search, including
148 the indexing stylesheet (which is a great debugging help). The
149 names defined in the <literal>name</literal> attributes must be
150 unique, these are the literal <literal>schema</literal> or
151 <literal>element set</literal> names used in
152 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
153 <ulink url="&url.sru;">&sru;</ulink> and
154 &z3950; protocol queries.
155 The paths in the <literal>stylesheet</literal> attributes
156 are relative to zebras working directory, or absolute to file
160 The <literal><split level="2"/></literal> decides where the
161 &xml; Reader shall split the
162 collections of records into individual records, which then are
163 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
166 There must be exactly one indexing &xslt; stylesheet, which is
167 defined by the magic attribute
168 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
171 <section id="record-model-domxml-internal">
172 <title>&dom; filter internal record representation</title>
173 <para>When indexing, an &xml; Reader is invoked to split the input
174 files into suitable record &xml; pieces. Each record piece is then
175 transformed to an &xml; &dom; structure, which is essentially the
176 record model. Only &xslt; transformations can be applied during
177 index, search and retrieval. Consequently, output formats are
178 restricted to whatever &xslt; can deliver from the record &xml;
179 structure, be it other &xml; formats, HTML, or plain text. In case
180 you have <literal>libxslt1</literal> running with E&xslt; support,
181 you can use this functionality inside the &dom;
182 filter configuration &xslt; stylesheets.
186 <section id="record-model-domxml-canonical">
187 <title>&dom; Canonical Indexing Format</title>
188 <para>The output of the indexing &xslt; stylesheets must contain
189 certain elements in the magic
190 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
191 namespace. The output of the &xslt; indexing transformation is then
192 parsed using &dom; methods, and the contained instructions are
193 performed on the <emphasis>magic elements and their
197 For example, the output of the command
199 xsltproc xsl/oai2index.xsl one-record.xml
201 might look like this:
203 <?xml version="1.0" encoding="UTF-8"?>
204 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
205 z:id="oai:JTRS:CP-3290---Volume-I"
208 <z:index name="oai_identifier" type="0">
209 oai:JTRS:CP-3290---Volume-I</z:index>
210 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
211 <z:index name="oai_setspec" type="0">jtrs</z:index>
212 <z:index name="dc_all" type="w">
213 <z:index name="dc_title" type="w">Proceedings of the 4th
214 International Conference and Exhibition:
215 World Congress on Superconductivity - Volume I</z:index>
216 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
217 Burnham, Editors</z:index>
222 <para>This means the following: From the original &xml; file
223 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
224 same form coming from a splitted input file), the indexing
225 stylesheet produces an indexing &xml; record, which is defined by
226 the <literal>record</literal> element in the magic namespace
227 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
228 &zebra; uses the content of
229 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
230 record ID, and - in case static ranking is set - the content of
231 <literal>z:rank="47896"</literal> as static rank. Following the
232 discussion in <xref linkend="administration-ranking"/>
233 we see that this records is internally ordered
234 lexicographically according to the value of the string
235 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
236 The type of action performed during indexing is defined by
237 <literal>z:type="update"></literal>, with recognized values
238 <literal>insert</literal>, <literal>update</literal>, and
239 <literal>delete</literal>.
241 <para>In this example, the following literal indexes are constructed:
250 where the indexing type is defined in the
251 <literal>type</literal> attribute
252 (any value from the standard configuration
253 file <filename>default.idx</filename> will do). Finally, any
254 <literal>text()</literal> node content recursively contained
255 inside the <literal>index</literal> will be filtered through the
256 appropriate charmap for character normalization, and will be
257 inserted in the index.
260 Specific to this example, we see that the single word
261 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
262 byte for byte without any form of character normalization,
263 inserted into the index named <literal>oai:identifier</literal>,
265 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
266 will be inserted using the <literal>w</literal> character
267 normalization defined in <filename>default.idx</filename> into
268 the index <literal>dc:creator</literal> (that is, after character
269 normalization the index will keep the inidividual words
270 <literal>kumar</literal>, <literal>krishen</literal>,
271 <literal>and</literal>, <literal>calvin</literal>,
272 <literal>burnham</literal>, and <literal>editors</literal>), and
273 finally both the texts
274 <literal>Proceedings of the 4th International Conference and Exhibition:
275 World Congress on Superconductivity - Volume I</literal>
277 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
278 will be inserted into the index <literal>dc:all</literal> using
279 the same character normalization map <literal>w</literal>.
282 Finally, this example configuration can be queried using &pqf;
283 queries, either transported by &z3950;, (here using a yaz-client)
286 Z> open localhost:9999
290 Z> f @attr 1=dc_creator Kumar
291 Z> scan @attr 1=dc_creator adam
293 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
294 Z> scan @attr 1=dc_title abc
298 extentions <literal>x-pquery</literal> and
299 <literal>x-pScanClause</literal> to
303 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
304 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
307 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
308 configuration, and <xref linkend="gfs-config"/> or the &yaz;
309 <ulink url="&url.yaz.cql;">&cql; section</ulink>
310 for the details or the &yaz; frontend server.
313 Notice that there are no <filename>*.abs</filename>,
314 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
315 filter configuration files involves in this process, and that the
316 literal index names are used during search and retrieval.
322 <section id="record-model-domxml-conf">
323 <title>&dom; Record Model Configuration</title>
326 <section id="record-model-domxml-index">
327 <title>&dom; Indexing Configuration</title>
329 As mentioned above, there can be only one indexing
330 stylesheet, and configuration of the indexing process is a synonym
331 of writing an &xslt; stylesheet which produces &xml; output containing the
332 magic elements discussed in
333 <xref linkend="record-model-domxml-internal"/>.
334 Obviously, there are million of different ways to accomplish this
335 task, and some comments and code snippets are in order to lead
336 our paduans on the right track to the good side of the force.
339 Stylesheets can be written in the <emphasis>pull</emphasis> or
340 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
341 means that the output &xml; structure is taken as starting point of
342 the internal structure of the &xslt; stylesheet, and portions of
343 the input &xml; are <emphasis>pulled</emphasis> out and inserted
344 into the right spots of the output &xml; structure. On the other
345 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
346 calling their template definitions, a process which is commanded
347 by the input &xml; structure, and avake to produce some output &xml;
348 whenever some special conditions in the input styelsheets are
349 met. The <emphasis>pull</emphasis> type is well-suited for input
350 &xml; with strong and well-defined structure and semantcs, like the
351 following &oai; indexing example, whereas the
352 <emphasis>push</emphasis> type might be the only possible way to
353 sort out deeply recursive input &xml; formats.
356 A <emphasis>pull</emphasis> stylesheet example used to index
357 &oai; harvested records could use some of the following template
361 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
362 xmlns:z="http://indexdata.dk/zebra/xslt/1"
363 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
364 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
365 xmlns:dc="http://purl.org/dc/elements/1.1/"
368 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
370 <!-- disable all default text node output -->
371 <xsl:template match="text()"/>
373 <!-- match on oai xml record root -->
374 <xsl:template match="/">
375 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
377 <!-- you might want to use z:rank="{some &xslt; function here}" -->
378 <xsl:apply-templates/>
382 <!-- &oai; indexing templates -->
383 <xsl:template match="oai:record/oai:header/oai:identifier">
384 <z:index name="oai_identifier" type="0">
385 <xsl:value-of select="."/>
391 <!-- DC specific indexing templates -->
392 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
393 <z:index name="dc_title" type="w">
394 <xsl:value-of select="."/>
406 that the names and types of the indexes can be defined in the
407 indexing &xslt; stylesheet <emphasis>dynamically according to
408 content in the original &xml; records</emphasis>, which has
409 opportunities for great power and wizardery as well as grande
413 The following excerpt of a <emphasis>push</emphasis> stylesheet
414 <emphasis>might</emphasis>
415 be a good idea according to your strict control of the &xml;
416 input format (due to rigerours checking against well-defined and
417 tight RelaxNG or &xml; Schema's, for example):
420 <xsl:template name="element-name-indexes">
421 <z:index name="{name()}" type="w">
422 <xsl:value-of select="'1'"/>
427 This template creates indexes which have the name of the working
428 node of any input &xml; file, and assigns a '1' to the index.
430 <literal>find @attr 1=xyz 1</literal>
431 finds all files which contain at least one
432 <literal>xyz</literal> &xml; element. In case you can not control
433 which element names the input files contain, you might ask for
434 disaster and bad karma using this technique.
437 One variation over the theme <emphasis>dynamically created
438 indexes</emphasis> will definitely be unwise:
441 <!-- match on oai xml record root -->
442 <xsl:template match="/">
443 <z:record z:type="update">
445 <!-- create dynamic index name from input content -->
446 <xsl:variable name="dynamic_content">
447 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
450 <!-- create zillions of indexes with unknown names -->
451 <z:index name="{$dynamic_content}" type="w">
452 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
459 Don't be tempted to cross
460 the line to the dark side of the force, paduan; this leads
461 to suffering and pain, and universal
462 disentigration of your project schedule.
466 <section id="record-model-domxml-elementset">
467 <title>&dom; Exchange Formats</title>
469 An exchange format can be anything which can be the outcome of an
470 &xslt; transformation, as far as the stylesheet is registered in
471 the main &dom; &xslt; filter configuration file, see
472 <xref linkend="record-model-domxml-filter"/>.
473 In principle anything that can be expressed in &xml;, HTML, and
474 TEXT can be the output of a <literal>schema</literal> or
475 <literal>element set</literal> directive during search, as long as
476 the information comes from the
477 <emphasis>original input record &xml; &dom; tree</emphasis>
478 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
481 In addition, internal administrative information from the &zebra;
482 indexer can be accessed during record retrieval. The following
483 example is a summary of the possibilities:
486 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
487 xmlns:z="http://indexdata.dk/zebra/xslt/1"
490 <!-- register internal zebra parameters -->
491 <xsl:param name="id" select="''"/>
492 <xsl:param name="filename" select="''"/>
493 <xsl:param name="score" select="''"/>
494 <xsl:param name="schema" select="''"/>
496 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
498 <!-- use then for display of internal information -->
499 <xsl:template match="/">
501 <id><xsl:value-of select="$id"/></id>
502 <filename><xsl:value-of select="$filename"/></filename>
503 <score><xsl:value-of select="$score"/></score>
504 <schema><xsl:value-of select="$schema"/></schema>
515 <section id="record-model-domxml-example">
516 <title>&dom; Filter &oai; Indexing Example</title>
518 The sourcecode tarball contains a working &dom; filter example in
519 the directory <filename>examples/dom-oai/</filename>, which
520 should get you started.
523 More example data can be harvested from any &oai; complient server,
524 see details at the &oai;
525 <ulink url="http://www.openarchives.org/">
526 http://www.openarchives.org/</ulink> web site, and the community
528 <ulink url="http://www.openarchives.org/community/index.html">
529 http://www.openarchives.org/community/index.html</ulink>.
532 <ulink url="http://www.oaforum.org/tutorial/">
533 http://www.oaforum.org/tutorial/</ulink>.
545 c) Main "dom" &xslt; filter config file:
546 cat db/filter_dom_conf.xml
548 <?xml version="1.0" encoding="UTF8"?>
550 <schema name="dom" stylesheet="db/dom2dom.xsl" />
551 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
552 stylesheet="db/dom2index.xsl" />
553 <schema name="dc" stylesheet="db/dom2dc.xsl" />
554 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
555 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
556 <schema name="help" stylesheet="db/dom2help.xsl" />
560 the paths are relative to the directory where zebra.init is placed
563 The split level decides where the SAX parser shall split the
564 collections of records into individual records, which then are
565 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
567 The indexing stylesheet is found by it's identifier.
569 All the other stylesheets are for presentation after search.
571 - in data/ a short sample of harvested carnivorous plants
572 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
574 - in root also one single data record - nice for testing the xslt
577 xsltproc db/dom2index.xsl carni*.xml
581 - in db/ a cql2pqf.txt yaz-client config file
582 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
584 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
586 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
587 as it constructs the new &xml; structure by pulling data out of the
588 respective elements/attributes of the old structure.
590 Notice the special zebra namespace, and the special elements in this
591 namespace which indicate to the zebra indexer what to do.
593 <z:record id="67ht7" rank="675" type="update">
594 indicates that a new record with given id and static rank has to be updated.
596 <z:index name="title" type="w">
597 encloses all the text/&xml; which shall be indexed in the index named
598 "title" and of index type "w" (see file default.idx in your zebra
610 <!-- Keep this comment at the end of the file
615 sgml-minimize-attributes:nil
616 sgml-always-quote-attributes:t
619 sgml-parent-document: "zebra.xml"
620 sgml-local-catalogs: nil
621 sgml-namecase-general:t