1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.4 2007-02-20 15:02:18 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
36 <section id="record-model-domxml-architecture">
37 <title>&dom; &xml; filter architecture</title>
40 The internal &dom; &xml; representation can be fed into four
41 different pipelines, consisting of arbitraily many sucessive
42 &xslt; transformations.
45 <table id="record-model-domxml-architecture-table" frame="top">
46 <title>&dom; &xml; filter pipelines overview</title>
52 <entry>Description</entry>
60 <entry><literal>input</literal></entry>
62 <entry>input parsing and initial
63 transformations to common &xml; format</entry>
64 <entry>raw &xml; record buffers, &xml; streams and
65 binary &marc; buffers</entry>
66 <entry>single &dom; &xml; documents suitable for indexing and
67 internal storage</entry>
70 <entry><literal>extract</literal></entry>
72 <entry>indexing term extraction
73 transformations</entry>
74 <entry>common single &dom; &xml; format</entry>
75 <entry>&zebra; internal indexing &dom; &xml; document</entry>
78 <entry><literal>store</literal></entry>
80 <entry> transformations before internal document
82 <entry>common single &dom; &xml; format</entry>
83 <entry>&zebra; internal storage &dom; &xml; document</entry>
86 <entry><literal>retrieve</literal></entry>
88 <entry>multiple document retrieve transformations from
89 storage to different output
90 formats are possible</entry>
91 <entry>&zebra; internal storage &dom; &xml; document</entry>
92 <entry>output &xml; syntax and requested format</entry>
99 The &dom; &xml; filter pipelines use &xslt; (and if supported on
100 your platform, even &exslt;), it brings thus full &xpath;
101 support to the indexing, storage and display rules of not only
102 &xml; documents, but also binary &marc; records.
107 <section id="record-model-domxml-pipeline">
108 <title>&dom; &xml; filter pipeline configuration</title>
111 The experimental, loadable &dom; &xml;/&xslt; filter module
112 <literal>mod-dom.so</literal>
113 is invoked by the <filename>zebra.cfg</filename> configuration statement
115 recordtype.xml: dom.db/filter_dom_conf.xml
117 In this example on all data files with suffix
118 <filename>*.xml</filename>, where the
119 &dom; &xslt; filter configuration file is found in the
120 path <filename>db/filter_dom_conf.xml</filename>.
123 <para>The &dom; &xslt; filter configuration file must be
124 valid &xml;. It might look like this:
127 <?xml version="1.0" encoding="UTF8"?>
128 <dom xmlns="http://indexdata.com/zebra-2.0">
130 <xmlreader level="1"/>
131 <!-- <marc inputcharset="marc-8"/> -->
134 <xslt stylesheet="common2index.xsl"/>
137 <xslt stylesheet="common2store.xsl"/>
140 <xslt stylesheet="store2dc.xsl"/>
142 <retrieve name="mods">
143 <xslt stylesheet="store2mods.xsl"/>
151 All named stylesheets defined inside
152 <literal>schema</literal> element tags
153 are for presentation after search, including
154 the indexing stylesheet (which is a great debugging help). The
155 names defined in the <literal>name</literal> attributes must be
156 unique, these are the literal <literal>schema</literal> or
157 <literal>element set</literal> names used in
158 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
159 <ulink url="&url.sru;">&sru;</ulink> and
160 &z3950; protocol queries.
161 The paths in the <literal>stylesheet</literal> attributes
162 are relative to zebras working directory, or absolute to file
166 The <literal><split level="2"/></literal> decides where the
167 &xml; Reader shall split the
168 collections of records into individual records, which then are
169 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
172 There must be exactly one indexing &xslt; stylesheet, which is
173 defined by the magic attribute
174 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
177 <section id="record-model-domxml-internal">
178 <title>&dom; filter internal record representation</title>
179 <para>When indexing, an &xml; Reader is invoked to split the input
180 files into suitable record &xml; pieces. Each record piece is then
181 transformed to an &xml; &dom; structure, which is essentially the
182 record model. Only &xslt; transformations can be applied during
183 index, search and retrieval. Consequently, output formats are
184 restricted to whatever &xslt; can deliver from the record &xml;
185 structure, be it other &xml; formats, HTML, or plain text. In case
186 you have <literal>libxslt1</literal> running with E&xslt; support,
187 you can use this functionality inside the &dom;
188 filter configuration &xslt; stylesheets.
192 <section id="record-model-domxml-canonical">
193 <title>&dom; Canonical Indexing Format</title>
194 <para>The output of the indexing &xslt; stylesheets must contain
195 certain elements in the magic
196 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
197 namespace. The output of the &xslt; indexing transformation is then
198 parsed using &dom; methods, and the contained instructions are
199 performed on the <emphasis>magic elements and their
203 For example, the output of the command
205 xsltproc xsl/oai2index.xsl one-record.xml
207 might look like this:
209 <?xml version="1.0" encoding="UTF-8"?>
210 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
211 z:id="oai:JTRS:CP-3290---Volume-I"
214 <z:index name="oai_identifier" type="0">
215 oai:JTRS:CP-3290---Volume-I</z:index>
216 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
217 <z:index name="oai_setspec" type="0">jtrs</z:index>
218 <z:index name="dc_all" type="w">
219 <z:index name="dc_title" type="w">Proceedings of the 4th
220 International Conference and Exhibition:
221 World Congress on Superconductivity - Volume I</z:index>
222 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
223 Burnham, Editors</z:index>
228 <para>This means the following: From the original &xml; file
229 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
230 same form coming from a splitted input file), the indexing
231 stylesheet produces an indexing &xml; record, which is defined by
232 the <literal>record</literal> element in the magic namespace
233 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
234 &zebra; uses the content of
235 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
236 record ID, and - in case static ranking is set - the content of
237 <literal>z:rank="47896"</literal> as static rank. Following the
238 discussion in <xref linkend="administration-ranking"/>
239 we see that this records is internally ordered
240 lexicographically according to the value of the string
241 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
242 The type of action performed during indexing is defined by
243 <literal>z:type="update"></literal>, with recognized values
244 <literal>insert</literal>, <literal>update</literal>, and
245 <literal>delete</literal>.
247 <para>In this example, the following literal indexes are constructed:
256 where the indexing type is defined in the
257 <literal>type</literal> attribute
258 (any value from the standard configuration
259 file <filename>default.idx</filename> will do). Finally, any
260 <literal>text()</literal> node content recursively contained
261 inside the <literal>index</literal> will be filtered through the
262 appropriate charmap for character normalization, and will be
263 inserted in the index.
266 Specific to this example, we see that the single word
267 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
268 byte for byte without any form of character normalization,
269 inserted into the index named <literal>oai:identifier</literal>,
271 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
272 will be inserted using the <literal>w</literal> character
273 normalization defined in <filename>default.idx</filename> into
274 the index <literal>dc:creator</literal> (that is, after character
275 normalization the index will keep the inidividual words
276 <literal>kumar</literal>, <literal>krishen</literal>,
277 <literal>and</literal>, <literal>calvin</literal>,
278 <literal>burnham</literal>, and <literal>editors</literal>), and
279 finally both the texts
280 <literal>Proceedings of the 4th International Conference and Exhibition:
281 World Congress on Superconductivity - Volume I</literal>
283 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
284 will be inserted into the index <literal>dc:all</literal> using
285 the same character normalization map <literal>w</literal>.
288 Finally, this example configuration can be queried using &pqf;
289 queries, either transported by &z3950;, (here using a yaz-client)
292 Z> open localhost:9999
296 Z> f @attr 1=dc_creator Kumar
297 Z> scan @attr 1=dc_creator adam
299 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
300 Z> scan @attr 1=dc_title abc
304 extentions <literal>x-pquery</literal> and
305 <literal>x-pScanClause</literal> to
309 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
310 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
313 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
314 configuration, and <xref linkend="gfs-config"/> or the &yaz;
315 <ulink url="&url.yaz.cql;">&cql; section</ulink>
316 for the details or the &yaz; frontend server.
319 Notice that there are no <filename>*.abs</filename>,
320 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
321 filter configuration files involves in this process, and that the
322 literal index names are used during search and retrieval.
328 <section id="record-model-domxml-conf">
329 <title>&dom; Record Model Configuration</title>
332 <section id="record-model-domxml-index">
333 <title>&dom; Indexing Configuration</title>
335 As mentioned above, there can be only one indexing
336 stylesheet, and configuration of the indexing process is a synonym
337 of writing an &xslt; stylesheet which produces &xml; output containing the
338 magic elements discussed in
339 <xref linkend="record-model-domxml-internal"/>.
340 Obviously, there are million of different ways to accomplish this
341 task, and some comments and code snippets are in order to lead
342 our paduans on the right track to the good side of the force.
345 Stylesheets can be written in the <emphasis>pull</emphasis> or
346 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
347 means that the output &xml; structure is taken as starting point of
348 the internal structure of the &xslt; stylesheet, and portions of
349 the input &xml; are <emphasis>pulled</emphasis> out and inserted
350 into the right spots of the output &xml; structure. On the other
351 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
352 calling their template definitions, a process which is commanded
353 by the input &xml; structure, and avake to produce some output &xml;
354 whenever some special conditions in the input styelsheets are
355 met. The <emphasis>pull</emphasis> type is well-suited for input
356 &xml; with strong and well-defined structure and semantcs, like the
357 following &oai; indexing example, whereas the
358 <emphasis>push</emphasis> type might be the only possible way to
359 sort out deeply recursive input &xml; formats.
362 A <emphasis>pull</emphasis> stylesheet example used to index
363 &oai; harvested records could use some of the following template
367 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
368 xmlns:z="http://indexdata.dk/zebra/xslt/1"
369 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
370 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
371 xmlns:dc="http://purl.org/dc/elements/1.1/"
374 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
376 <!-- disable all default text node output -->
377 <xsl:template match="text()"/>
379 <!-- match on oai xml record root -->
380 <xsl:template match="/">
381 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
383 <!-- you might want to use z:rank="{some &xslt; function here}" -->
384 <xsl:apply-templates/>
388 <!-- &oai; indexing templates -->
389 <xsl:template match="oai:record/oai:header/oai:identifier">
390 <z:index name="oai_identifier" type="0">
391 <xsl:value-of select="."/>
397 <!-- DC specific indexing templates -->
398 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
399 <z:index name="dc_title" type="w">
400 <xsl:value-of select="."/>
412 that the names and types of the indexes can be defined in the
413 indexing &xslt; stylesheet <emphasis>dynamically according to
414 content in the original &xml; records</emphasis>, which has
415 opportunities for great power and wizardery as well as grande
419 The following excerpt of a <emphasis>push</emphasis> stylesheet
420 <emphasis>might</emphasis>
421 be a good idea according to your strict control of the &xml;
422 input format (due to rigerours checking against well-defined and
423 tight RelaxNG or &xml; Schema's, for example):
426 <xsl:template name="element-name-indexes">
427 <z:index name="{name()}" type="w">
428 <xsl:value-of select="'1'"/>
433 This template creates indexes which have the name of the working
434 node of any input &xml; file, and assigns a '1' to the index.
436 <literal>find @attr 1=xyz 1</literal>
437 finds all files which contain at least one
438 <literal>xyz</literal> &xml; element. In case you can not control
439 which element names the input files contain, you might ask for
440 disaster and bad karma using this technique.
443 One variation over the theme <emphasis>dynamically created
444 indexes</emphasis> will definitely be unwise:
447 <!-- match on oai xml record root -->
448 <xsl:template match="/">
449 <z:record z:type="update">
451 <!-- create dynamic index name from input content -->
452 <xsl:variable name="dynamic_content">
453 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
456 <!-- create zillions of indexes with unknown names -->
457 <z:index name="{$dynamic_content}" type="w">
458 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
465 Don't be tempted to cross
466 the line to the dark side of the force, paduan; this leads
467 to suffering and pain, and universal
468 disentigration of your project schedule.
472 <section id="record-model-domxml-elementset">
473 <title>&dom; Exchange Formats</title>
475 An exchange format can be anything which can be the outcome of an
476 &xslt; transformation, as far as the stylesheet is registered in
477 the main &dom; &xslt; filter configuration file, see
478 <xref linkend="record-model-domxml-filter"/>.
479 In principle anything that can be expressed in &xml;, HTML, and
480 TEXT can be the output of a <literal>schema</literal> or
481 <literal>element set</literal> directive during search, as long as
482 the information comes from the
483 <emphasis>original input record &xml; &dom; tree</emphasis>
484 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
487 In addition, internal administrative information from the &zebra;
488 indexer can be accessed during record retrieval. The following
489 example is a summary of the possibilities:
492 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
493 xmlns:z="http://indexdata.dk/zebra/xslt/1"
496 <!-- register internal zebra parameters -->
497 <xsl:param name="id" select="''"/>
498 <xsl:param name="filename" select="''"/>
499 <xsl:param name="score" select="''"/>
500 <xsl:param name="schema" select="''"/>
502 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
504 <!-- use then for display of internal information -->
505 <xsl:template match="/">
507 <id><xsl:value-of select="$id"/></id>
508 <filename><xsl:value-of select="$filename"/></filename>
509 <score><xsl:value-of select="$score"/></score>
510 <schema><xsl:value-of select="$schema"/></schema>
521 <section id="record-model-domxml-example">
522 <title>&dom; Filter &oai; Indexing Example</title>
524 The sourcecode tarball contains a working &dom; filter example in
525 the directory <filename>examples/dom-oai/</filename>, which
526 should get you started.
529 More example data can be harvested from any &oai; complient server,
530 see details at the &oai;
531 <ulink url="http://www.openarchives.org/">
532 http://www.openarchives.org/</ulink> web site, and the community
534 <ulink url="http://www.openarchives.org/community/index.html">
535 http://www.openarchives.org/community/index.html</ulink>.
538 <ulink url="http://www.oaforum.org/tutorial/">
539 http://www.oaforum.org/tutorial/</ulink>.
551 c) Main "dom" &xslt; filter config file:
552 cat db/filter_dom_conf.xml
554 <?xml version="1.0" encoding="UTF8"?>
556 <schema name="dom" stylesheet="db/dom2dom.xsl" />
557 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
558 stylesheet="db/dom2index.xsl" />
559 <schema name="dc" stylesheet="db/dom2dc.xsl" />
560 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
561 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
562 <schema name="help" stylesheet="db/dom2help.xsl" />
566 the paths are relative to the directory where zebra.init is placed
569 The split level decides where the SAX parser shall split the
570 collections of records into individual records, which then are
571 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
573 The indexing stylesheet is found by it's identifier.
575 All the other stylesheets are for presentation after search.
577 - in data/ a short sample of harvested carnivorous plants
578 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
580 - in root also one single data record - nice for testing the xslt
583 xsltproc db/dom2index.xsl carni*.xml
587 - in db/ a cql2pqf.txt yaz-client config file
588 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
590 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
592 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
593 as it constructs the new &xml; structure by pulling data out of the
594 respective elements/attributes of the old structure.
596 Notice the special zebra namespace, and the special elements in this
597 namespace which indicate to the zebra indexer what to do.
599 <z:record id="67ht7" rank="675" type="update">
600 indicates that a new record with given id and static rank has to be updated.
602 <z:index name="title" type="w">
603 encloses all the text/&xml; which shall be indexed in the index named
604 "title" and of index type "w" (see file default.idx in your zebra
616 <!-- Keep this comment at the end of the file
621 sgml-minimize-attributes:nil
622 sgml-always-quote-attributes:t
625 sgml-parent-document: "zebra.xml"
626 sgml-local-catalogs: nil
627 sgml-namecase-general:t