1 <chapter id="record-model-alvisxslt">
2 <!-- $Id: recordmodel-alvisxslt.xml,v 1.11 2006-11-13 14:53:40 marc Exp $ -->
3 <title>ALVIS XML Record Model and Filter Module</title>
7 The record model described in this chapter applies to the fundamental,
9 record type <literal>alvis</literal>, introduced in
10 <xref linkend="componentmodulesalvis"/>. The ALVIS XML record model
11 is experimental, and it's inner workings might change in future
12 releases of the Zebra Information Server.
15 <para> This filter has been developed under the
16 <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
17 the European Community under the "Information Society Technologies"
22 <section id="record-model-alvisxslt-filter">
23 <title>ALVIS Record Filter</title>
25 The experimental, loadable Alvis XML/XSLT filter module
26 <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
27 <literal>libidzebra1.4-mod-alvis</literal>.
28 It is invoked by the <filename>zebra.cfg</filename> configuration statement
30 recordtype.xml: alvis.db/filter_alvis_conf.xml
32 In this example on all data files with suffix
33 <filename>*.xml</filename>, where the
34 Alvis XSLT filter configuration file is found in the
35 path <filename>db/filter_alvis_conf.xml</filename>.
37 <para>The Alvis XSLT filter configuration file must be
38 valid XML. It might look like this (This example is
39 used for indexing and display of OAI harvested records):
41 <?xml version="1.0" encoding="UTF-8"?>
43 <schema name="identity" stylesheet="xsl/identity.xsl" />
44 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
45 stylesheet="xsl/oai2index.xsl" />
46 <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
47 <!-- use split level 2 when indexing whole OAI Record lists -->
48 <split level="2"/>
53 All named stylesheets defined inside
54 <literal>schema</literal> element tags
55 are for presentation after search, including
56 the indexing stylesheet (which is a great debugging help). The
57 names defined in the <literal>name</literal> attributes must be
58 unique, these are the literal <literal>schema</literal> or
59 <literal>element set</literal> names used in
60 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>,
61 <ulink url="&url.sru;">SRU</ulink> and
62 Z39.50 protocol queries.
63 The paths in the <literal>stylesheet</literal> attributes
64 are relative to zebras working directory, or absolute to file
68 The <literal><split level="2"/></literal> decides where the
69 XML Reader shall split the
70 collections of records into individual records, which then are
71 loaded into DOM, and have the indexing XSLT stylesheet applied.
74 There must be exactly one indexing XSLT stylesheet, which is
75 defined by the magic attribute
76 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
79 <section id="record-model-alvisxslt-internal">
80 <title>ALVIS Internal Record Representation</title>
81 <para>When indexing, an XML Reader is invoked to split the input
82 files into suitable record XML pieces. Each record piece is then
83 transformed to an XML DOM structure, which is essentially the
84 record model. Only XSLT transformations can be applied during
85 index, search and retrieval. Consequently, output formats are
86 restricted to whatever XSLT can deliver from the record XML
87 structure, be it other XML formats, HTML, or plain text. In case
88 you have <literal>libxslt1</literal> running with EXSLT support,
89 you can use this functionality inside the Alvis
90 filter configuration XSLT stylesheets.
94 <section id="record-model-alvisxslt-canonical">
95 <title>ALVIS Canonical Indexing Format</title>
96 <para>The output of the indexing XSLT stylesheets must contain
97 certain elements in the magic
98 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
99 namespace. The output of the XSLT indexing transformation is then
100 parsed using DOM methods, and the contained instructions are
101 performed on the <emphasis>magic elements and their
105 For example, the output of the command
107 xsltproc xsl/oai2index.xsl one-record.xml
109 might look like this:
111 <?xml version="1.0" encoding="UTF-8"?>
112 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
113 z:id="oai:JTRS:CP-3290---Volume-I"
116 <z:index name="oai_identifier" type="0">
117 oai:JTRS:CP-3290---Volume-I</z:index>
118 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
119 <z:index name="oai_setspec" type="0">jtrs</z:index>
120 <z:index name="dc_all" type="w">
121 <z:index name="dc_title" type="w">Proceedings of the 4th
122 International Conference and Exhibition:
123 World Congress on Superconductivity - Volume I</z:index>
124 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
125 Burnham, Editors</z:index>
130 <para>This means the following: From the original XML file
131 <literal>one-record.xml</literal> (or from the XML record DOM of the
132 same form coming from a splitted input file), the indexing
133 stylesheet produces an indexing XML record, which is defined by
134 the <literal>record</literal> element in the magic namespace
135 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
136 Zebra uses the content of
137 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
138 record ID, and - in case static ranking is set - the content of
139 <literal>z:rank="47896"</literal> as static rank. Following the
140 discussion in <xref linkend="administration-ranking"/>
141 we see that this records is internally ordered
142 lexicographically according to the value of the string
143 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
144 The type of action performed during indexing is defined by
145 <literal>z:type="update"></literal>, with recognized values
146 <literal>insert</literal>, <literal>update</literal>, and
147 <literal>delete</literal>.
149 <para>In this example, the following literal indexes are constructed:
158 where the indexing type is defined in the
159 <literal>type</literal> attribute
160 (any value from the standard configuration
161 file <filename>default.idx</filename> will do). Finally, any
162 <literal>text()</literal> node content recursively contained
163 inside the <literal>index</literal> will be filtered through the
164 appropriate charmap for character normalization, and will be
165 inserted in the index.
168 Specific to this example, we see that the single word
169 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
170 byte for byte without any form of character normalization,
171 inserted into the index named <literal>oai:identifier</literal>,
173 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
174 will be inserted using the <literal>w</literal> character
175 normalization defined in <filename>default.idx</filename> into
176 the index <literal>dc:creator</literal> (that is, after character
177 normalization the index will keep the inidividual words
178 <literal>kumar</literal>, <literal>krishen</literal>,
179 <literal>and</literal>, <literal>calvin</literal>,
180 <literal>burnham</literal>, and <literal>editors</literal>), and
181 finally both the texts
182 <literal>Proceedings of the 4th International Conference and Exhibition:
183 World Congress on Superconductivity - Volume I</literal>
185 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
186 will be inserted into the index <literal>dc:all</literal> using
187 the same character normalization map <literal>w</literal>.
190 Finally, this example configuration can be queried using PQF
191 queries, either transported by Z39.50, (here using a yaz-client)
194 Z> open localhost:9999
198 Z> f @attr 1=dc:creator Kumar
199 Z> scan @attr 1=dc:creator adam
201 Z> f @attr 1=dc:title @attr 4=2 "proceeding congress superconductivity"
202 Z> scan @attr 1=dc:title abc
206 extentions <literal>x-pquery</literal> and
207 <literal>x-pScanClause</literal> to
211 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc%3Acreator+%40attr+4%3D6+%22the
212 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc:date+@attr+4=2+a
215 See <xref linkend="zebrasrv-sru"/> for more information on SRU/SRW
216 configuration, and <xref linkend="gfs-config"/> or the YAZ
217 <ulink url="&url.yaz.cql;">CQL section</ulink>
218 for the details or the YAZ frontend server.
221 Notice that there are no <filename>*.abs</filename>,
222 <filename>*.est</filename>, <filename>*.map</filename>, or other GRS-1
223 filter configuration files involves in this process, and that the
224 literal index names are used during search and retrieval.
230 <section id="record-model-alvisxslt-conf">
231 <title>ALVIS Record Model Configuration</title>
234 <section id="record-model-alvisxslt-index">
235 <title>ALVIS Indexing Configuration</title>
237 As mentioned above, there can be only one indexing
238 stylesheet, and configuration of the indexing process is a synonym
239 of writing an XSLT stylesheet which produces XML output containing the
240 magic elements discussed in
241 <xref linkend="record-model-alvisxslt-internal"/>.
242 Obviously, there are million of different ways to accomplish this
243 task, and some comments and code snippets are in order to lead
244 our paduans on the right track to the good side of the force.
247 Stylesheets can be written in the <emphasis>pull</emphasis> or
248 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
249 means that the output XML structure is taken as starting point of
250 the internal structure of the XSLT stylesheet, and portions of
251 the input XML are <emphasis>pulled</emphasis> out and inserted
252 into the right spots of the output XML structure. On the other
253 side, <emphasis>push</emphasis> XSLT stylesheets are recursavly
254 calling their template definitions, a process which is commanded
255 by the input XML structure, and avake to produce some output XML
256 whenever some special conditions in the input styelsheets are
257 met. The <emphasis>pull</emphasis> type is well-suited for input
258 XML with strong and well-defined structure and semantcs, like the
259 following OAI indexing example, whereas the
260 <emphasis>push</emphasis> type might be the only possible way to
261 sort out deeply recursive input XML formats.
264 A <emphasis>pull</emphasis> stylesheet example used to index
265 OAI harvested records could use some of the following template
269 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
270 xmlns:z="http://indexdata.dk/zebra/xslt/1"
271 xmlns:oai="http://www.openarchives.org/OAI/2.0/"
272 xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
273 xmlns:dc="http://purl.org/dc/elements/1.1/"
276 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
278 <!-- disable all default text node output -->
279 <xsl:template match="text()"/>
281 <!-- match on oai xml record root -->
282 <xsl:template match="/">
283 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
285 <!-- you might want to use z:rank="{some XSLT function here}" -->
286 <xsl:apply-templates/>
290 <!-- OAI indexing templates -->
291 <xsl:template match="oai:record/oai:header/oai:identifier">
292 <z:index name="oai_identifier" type="0">
293 <xsl:value-of select="."/>
299 <!-- DC specific indexing templates -->
300 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
301 <z:index name="dc_title" type="w">
302 <xsl:value-of select="."/>
314 that the names and types of the indexes can be defined in the
315 indexing XSLT stylesheet <emphasis>dynamically according to
316 content in the original XML records</emphasis>, which has
317 opportunities for great power and wizardery as well as grande
321 The following excerpt of a <emphasis>push</emphasis> stylesheet
322 <emphasis>might</emphasis>
323 be a good idea according to your strict control of the XML
324 input format (due to rigerours checking against well-defined and
325 tight RelaxNG or XML Schema's, for example):
328 <xsl:template name="element-name-indexes">
329 <z:index name="{name()}" type="w">
330 <xsl:value-of select="'1'"/>
335 This template creates indexes which have the name of the working
336 node of any input XML file, and assigns a '1' to the index.
338 <literal>find @attr 1=xyz 1</literal>
339 finds all files which contain at least one
340 <literal>xyz</literal> XML element. In case you can not control
341 which element names the input files contain, you might ask for
342 disaster and bad karma using this technique.
345 One variation over the theme <emphasis>dynamically created
346 indexes</emphasis> will definitely be unwise:
349 <!-- match on oai xml record root -->
350 <xsl:template match="/">
351 <z:record z:type="update">
353 <!-- create dynamic index name from input content -->
354 <xsl:variable name="dynamic_content">
355 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
358 <!-- create zillions of indexes with unknown names -->
359 <z:index name="{$dynamic_content}" type="w">
360 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
367 Don't be tempted to cross
368 the line to the dark side of the force, paduan; this leads
369 to suffering and pain, and universal
370 disentigration of your project schedule.
374 <section id="record-model-alvisxslt-elementset">
375 <title>ALVIS Exchange Formats</title>
377 An exchange format can be anything which can be the outcome of an
378 XSLT transformation, as far as the stylesheet is registered in
379 the main Alvis XSLT filter configuration file, see
380 <xref linkend="record-model-alvisxslt-filter"/>.
381 In principle anything that can be expressed in XML, HTML, and
382 TEXT can be the output of a <literal>schema</literal> or
383 <literal>element set</literal> directive during search, as long as
384 the information comes from the
385 <emphasis>original input record XML DOM tree</emphasis>
386 (and not the transformed and <emphasis>indexed</emphasis> XML!!).
389 In addition, internal administrative information from the Zebra
390 indexer can be accessed during record retrieval. The following
391 example is a summary of the possibilities:
394 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
395 xmlns:z="http://indexdata.dk/zebra/xslt/1"
398 <!-- register internal zebra parameters -->
399 <xsl:param name="id" select="''"/>
400 <xsl:param name="filename" select="''"/>
401 <xsl:param name="score" select="''"/>
402 <xsl:param name="schema" select="''"/>
404 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
406 <!-- use then for display of internal information -->
407 <xsl:template match="/">
409 <id><xsl:value-of select="$id"/></id>
410 <filename><xsl:value-of select="$filename"/></filename>
411 <score><xsl:value-of select="$score"/></score>
412 <schema><xsl:value-of select="$schema"/></schema>
423 <section id="record-model-alvisxslt-example">
424 <title>ALVIS Filter OAI Indexing Example</title>
426 The sourcecode tarball contains a working Alvis filter example in
427 the directory <filename>examples/alvis-oai/</filename>, which
428 should get you started.
431 More example data can be harvested from any OAI complient server,
432 see details at the OAI
433 <ulink url="http://www.openarchives.org/">
434 http://www.openarchives.org/</ulink> web site, and the community
436 <ulink url="http://www.openarchives.org/community/index.html">
437 http://www.openarchives.org/community/index.html</ulink>.
440 <ulink url="http://www.oaforum.org/tutorial/">
441 http://www.oaforum.org/tutorial/</ulink>.
453 c) Main "alvis" XSLT filter config file:
454 cat db/filter_alvis_conf.xml
456 <?xml version="1.0" encoding="UTF8"?>
458 <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
459 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
460 stylesheet="db/alvis2index.xsl" />
461 <schema name="dc" stylesheet="db/alvis2dc.xsl" />
462 <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
463 <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
464 <schema name="help" stylesheet="db/alvis2help.xsl" />
468 the paths are relative to the directory where zebra.init is placed
471 The split level decides where the SAX parser shall split the
472 collections of records into individual records, which then are
473 loaded into DOM, and have the indexing XSLT stylesheet applied.
475 The indexing stylesheet is found by it's identifier.
477 All the other stylesheets are for presentation after search.
479 - in data/ a short sample of harvested carnivorous plants
480 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
482 - in root also one single data record - nice for testing the xslt
485 xsltproc db/alvis2index.xsl carni*.xml
489 - in db/ a cql2pqf.txt yaz-client config file
490 which is also used in the yaz-server <ulink url="&url.cql;">CQL</ulink>-to-PQF process
492 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
494 - in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
495 as it constructs the new XML structure by pulling data out of the
496 respective elements/attributes of the old structure.
498 Notice the special zebra namespace, and the special elements in this
499 namespace which indicate to the zebra indexer what to do.
501 <z:record id="67ht7" rank="675" type="update">
502 indicates that a new record with given id and static rank has to be updated.
504 <z:index name="title" type="w">
505 encloses all the text/XML which shall be indexed in the index named
506 "title" and of index type "w" (see file default.idx in your zebra
518 <!-- Keep this comment at the end of the file
523 sgml-minimize-attributes:nil
524 sgml-always-quote-attributes:t
527 sgml-parent-document: "zebra.xml"
528 sgml-local-catalogs: nil
529 sgml-namecase-general:t