1 <chapter id="record-model-alvisxslt">
2 <!-- $Id: recordmodel-alvisxslt.xml,v 1.19 2007-05-24 13:44:09 adam Exp $ -->
3 <title>ALVIS &acro.xml; Record Model and Filter Module</title>
7 The functionality of this record model has been improved and
8 replaced by the DOM &acro.xml; record model, see
9 <xref linkend="record-model-domxml"/>. The Alvis &acro.xml; record
10 model is considered obsolete, and will eventually be removed
11 from future releases of the &zebra; software.
16 The record model described in this chapter applies to the fundamental,
18 record type <literal>alvis</literal>, introduced in
19 <xref linkend="componentmodulesalvis"/>.
22 <para> This filter has been developed under the
23 <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
24 the European Community under the "Information Society Technologies"
29 <section id="record-model-alvisxslt-filter">
30 <title>ALVIS Record Filter</title>
32 The experimental, loadable Alvis &acro.xml;/&acro.xslt; filter module
33 <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
34 <literal>libidzebra1.4-mod-alvis</literal>.
35 It is invoked by the <filename>zebra.cfg</filename> configuration statement
37 recordtype.xml: alvis.db/filter_alvis_conf.xml
39 In this example on all data files with suffix
40 <filename>*.xml</filename>, where the
41 Alvis &acro.xslt; filter configuration file is found in the
42 path <filename>db/filter_alvis_conf.xml</filename>.
44 <para>The Alvis &acro.xslt; filter configuration file must be
45 valid &acro.xml;. It might look like this (This example is
46 used for indexing and display of &acro.oai; harvested records):
48 <?xml version="1.0" encoding="UTF-8"?>
50 <schema name="identity" stylesheet="xsl/identity.xsl" />
51 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
52 stylesheet="xsl/oai2index.xsl" />
53 <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
54 <!-- use split level 2 when indexing whole &acro.oai; Record lists -->
55 <split level="2"/>
60 All named stylesheets defined inside
61 <literal>schema</literal> element tags
62 are for presentation after search, including
63 the indexing stylesheet (which is a great debugging help). The
64 names defined in the <literal>name</literal> attributes must be
65 unique, these are the literal <literal>schema</literal> or
66 <literal>element set</literal> names used in
67 <ulink url="http://www.loc.gov/standards/sru/srw/">&acro.srw;</ulink>,
68 <ulink url="&url.sru;">&acro.sru;</ulink> and
69 &acro.z3950; protocol queries.
70 The paths in the <literal>stylesheet</literal> attributes
71 are relative to zebras working directory, or absolute to file
75 The <literal><split level="2"/></literal> decides where the
76 &acro.xml; Reader shall split the
77 collections of records into individual records, which then are
78 loaded into &acro.dom;, and have the indexing &acro.xslt; stylesheet applied.
81 There must be exactly one indexing &acro.xslt; stylesheet, which is
82 defined by the magic attribute
83 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
86 <section id="record-model-alvisxslt-internal">
87 <title>ALVIS Internal Record Representation</title>
88 <para>When indexing, an &acro.xml; Reader is invoked to split the input
89 files into suitable record &acro.xml; pieces. Each record piece is then
90 transformed to an &acro.xml; &acro.dom; structure, which is essentially the
91 record model. Only &acro.xslt; transformations can be applied during
92 index, search and retrieval. Consequently, output formats are
93 restricted to whatever &acro.xslt; can deliver from the record &acro.xml;
94 structure, be it other &acro.xml; formats, HTML, or plain text. In case
95 you have <literal>libxslt1</literal> running with E&acro.xslt; support,
96 you can use this functionality inside the Alvis
97 filter configuration &acro.xslt; stylesheets.
101 <section id="record-model-alvisxslt-canonical">
102 <title>ALVIS Canonical Indexing Format</title>
103 <para>The output of the indexing &acro.xslt; stylesheets must contain
104 certain elements in the magic
105 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
106 namespace. The output of the &acro.xslt; indexing transformation is then
107 parsed using &acro.dom; methods, and the contained instructions are
108 performed on the <emphasis>magic elements and their
112 For example, the output of the command
114 xsltproc xsl/oai2index.xsl one-record.xml
116 might look like this:
118 <?xml version="1.0" encoding="UTF-8"?>
119 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
120 z:id="oai:JTRS:CP-3290---Volume-I"
122 <z:index name="oai_identifier" type="0">
123 oai:JTRS:CP-3290---Volume-I</z:index>
124 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
125 <z:index name="oai_setspec" type="0">jtrs</z:index>
126 <z:index name="dc_all" type="w">
127 <z:index name="dc_title" type="w">Proceedings of the 4th
128 International Conference and Exhibition:
129 World Congress on Superconductivity - Volume I</z:index>
130 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
131 Burnham, Editors</z:index>
136 <para>This means the following: From the original &acro.xml; file
137 <literal>one-record.xml</literal> (or from the &acro.xml; record &acro.dom; of the
138 same form coming from a split input file), the indexing
139 stylesheet produces an indexing &acro.xml; record, which is defined by
140 the <literal>record</literal> element in the magic namespace
141 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
142 &zebra; uses the content of
143 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
144 record ID, and - in case static ranking is set - the content of
145 <literal>z:rank="47896"</literal> as static rank. Following the
146 discussion in <xref linkend="administration-ranking"/>
147 we see that this records is internally ordered
148 lexicographically according to the value of the string
149 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
150 <!-- The type of action performed during indexing is defined by
151 <literal>z:type="update"></literal>, with recognized values
152 <literal>insert</literal>, <literal>update</literal>, and
153 <literal>delete</literal>. -->
155 <para>In this example, the following literal indexes are constructed:
164 where the indexing type is defined in the
165 <literal>type</literal> attribute
166 (any value from the standard configuration
167 file <filename>default.idx</filename> will do). Finally, any
168 <literal>text()</literal> node content recursively contained
169 inside the <literal>index</literal> will be filtered through the
170 appropriate char map for character normalization, and will be
171 inserted in the index.
174 Specific to this example, we see that the single word
175 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
176 byte for byte without any form of character normalization,
177 inserted into the index named <literal>oai:identifier</literal>,
179 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
180 will be inserted using the <literal>w</literal> character
181 normalization defined in <filename>default.idx</filename> into
182 the index <literal>dc:creator</literal> (that is, after character
183 normalization the index will keep the individual words
184 <literal>kumar</literal>, <literal>krishen</literal>,
185 <literal>and</literal>, <literal>calvin</literal>,
186 <literal>burnham</literal>, and <literal>editors</literal>), and
187 finally both the texts
188 <literal>Proceedings of the 4th International Conference and Exhibition:
189 World Congress on Superconductivity - Volume I</literal>
191 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
192 will be inserted into the index <literal>dc:all</literal> using
193 the same character normalization map <literal>w</literal>.
196 Finally, this example configuration can be queried using &acro.pqf;
197 queries, either transported by &acro.z3950;, (here using a yaz-client)
200 Z> open localhost:9999
204 Z> f @attr 1=dc_creator Kumar
205 Z> scan @attr 1=dc_creator adam
207 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
208 Z> scan @attr 1=dc_title abc
212 extensions <literal>x-pquery</literal> and
213 <literal>x-pScanClause</literal> to
214 &acro.sru;, and &acro.srw;
217 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
218 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
221 See <xref linkend="zebrasrv-sru"/> for more information on &acro.sru;/&acro.srw;
222 configuration, and <xref linkend="gfs-config"/> or the &yaz;
223 <ulink url="&url.yaz.cql;">&acro.cql; section</ulink>
224 for the details or the &yaz; frontend server.
227 Notice that there are no <filename>*.abs</filename>,
228 <filename>*.est</filename>, <filename>*.map</filename>, or other &acro.grs1;
229 filter configuration files involves in this process, and that the
230 literal index names are used during search and retrieval.
236 <section id="record-model-alvisxslt-conf">
237 <title>ALVIS Record Model Configuration</title>
240 <section id="record-model-alvisxslt-index">
241 <title>ALVIS Indexing Configuration</title>
243 As mentioned above, there can be only one indexing
244 stylesheet, and configuration of the indexing process is a synonym
245 of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
246 magic elements discussed in
247 <xref linkend="record-model-alvisxslt-internal"/>.
248 Obviously, there are million of different ways to accomplish this
249 task, and some comments and code snippets are in order to lead
250 our Padawan's on the right track to the good side of the force.
253 Stylesheets can be written in the <emphasis>pull</emphasis> or
254 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
255 means that the output &acro.xml; structure is taken as starting point of
256 the internal structure of the &acro.xslt; stylesheet, and portions of
257 the input &acro.xml; are <emphasis>pulled</emphasis> out and inserted
258 into the right spots of the output &acro.xml; structure. On the other
259 side, <emphasis>push</emphasis> &acro.xslt; stylesheets are recursively
260 calling their template definitions, a process which is commanded
261 by the input &acro.xml; structure, and are triggered to produce some output &acro.xml;
262 whenever some special conditions in the input stylesheets are
263 met. The <emphasis>pull</emphasis> type is well-suited for input
264 &acro.xml; with strong and well-defined structure and semantics, like the
265 following &acro.oai; indexing example, whereas the
266 <emphasis>push</emphasis> type might be the only possible way to
267 sort out deeply recursive input &acro.xml; formats.
270 A <emphasis>pull</emphasis> stylesheet example used to index
271 &acro.oai; harvested records could use some of the following template
275 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
276 xmlns:z="http://indexdata.dk/zebra/xslt/1"
277 xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/"
278 xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/"
279 xmlns:dc="http://purl.org/dc/elements/1.1/"
282 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
284 <!-- disable all default text node output -->
285 <xsl:template match="text()"/>
287 <!-- match on oai xml record root -->
288 <xsl:template match="/">
289 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
290 <!-- you might want to use z:rank="{some &acro.xslt; function here}" -->
291 <xsl:apply-templates/>
295 <!-- &acro.oai; indexing templates -->
296 <xsl:template match="oai:record/oai:header/oai:identifier">
297 <z:index name="oai_identifier" type="0">
298 <xsl:value-of select="."/>
304 <!-- DC specific indexing templates -->
305 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
306 <z:index name="dc_title" type="w">
307 <xsl:value-of select="."/>
319 that the names and types of the indexes can be defined in the
320 indexing &acro.xslt; stylesheet <emphasis>dynamically according to
321 content in the original &acro.xml; records</emphasis>, which has
322 opportunities for great power and wizardry as well as grande
326 The following excerpt of a <emphasis>push</emphasis> stylesheet
327 <emphasis>might</emphasis>
328 be a good idea according to your strict control of the &acro.xml;
329 input format (due to rigorous checking against well-defined and
330 tight RelaxNG or &acro.xml; Schema's, for example):
333 <xsl:template name="element-name-indexes">
334 <z:index name="{name()}" type="w">
335 <xsl:value-of select="'1'"/>
340 This template creates indexes which have the name of the working
341 node of any input &acro.xml; file, and assigns a '1' to the index.
343 <literal>find @attr 1=xyz 1</literal>
344 finds all files which contain at least one
345 <literal>xyz</literal> &acro.xml; element. In case you can not control
346 which element names the input files contain, you might ask for
347 disaster and bad karma using this technique.
350 One variation over the theme <emphasis>dynamically created
351 indexes</emphasis> will definitely be unwise:
354 <!-- match on oai xml record root -->
355 <xsl:template match="/">
358 <!-- create dynamic index name from input content -->
359 <xsl:variable name="dynamic_content">
360 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
363 <!-- create zillions of indexes with unknown names -->
364 <z:index name="{$dynamic_content}" type="w">
365 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
372 Don't be tempted to cross
373 the line to the dark side of the force, Padawan; this leads
374 to suffering and pain, and universal
375 disintegration of your project schedule.
379 <section id="record-model-alvisxslt-elementset">
380 <title>ALVIS Exchange Formats</title>
382 An exchange format can be anything which can be the outcome of an
383 &acro.xslt; transformation, as far as the stylesheet is registered in
384 the main Alvis &acro.xslt; filter configuration file, see
385 <xref linkend="record-model-alvisxslt-filter"/>.
386 In principle anything that can be expressed in &acro.xml;, HTML, and
387 TEXT can be the output of a <literal>schema</literal> or
388 <literal>element set</literal> directive during search, as long as
389 the information comes from the
390 <emphasis>original input record &acro.xml; &acro.dom; tree</emphasis>
391 (and not the transformed and <emphasis>indexed</emphasis> &acro.xml;!!).
394 In addition, internal administrative information from the &zebra;
395 indexer can be accessed during record retrieval. The following
396 example is a summary of the possibilities:
399 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
400 xmlns:z="http://indexdata.dk/zebra/xslt/1"
403 <!-- register internal zebra parameters -->
404 <xsl:param name="id" select="''"/>
405 <xsl:param name="filename" select="''"/>
406 <xsl:param name="score" select="''"/>
407 <xsl:param name="schema" select="''"/>
409 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
411 <!-- use then for display of internal information -->
412 <xsl:template match="/">
414 <id><xsl:value-of select="$id"/></id>
415 <filename><xsl:value-of select="$filename"/></filename>
416 <score><xsl:value-of select="$score"/></score>
417 <schema><xsl:value-of select="$schema"/></schema>
428 <section id="record-model-alvisxslt-example">
429 <title>ALVIS Filter &acro.oai; Indexing Example</title>
431 The source code tarball contains a working Alvis filter example in
432 the directory <filename>examples/alvis-oai/</filename>, which
433 should get you started.
436 More example data can be harvested from any &acro.oai; compliant server,
437 see details at the &acro.oai;
438 <ulink url="http://www.openarchives.org/">
439 http://www.openarchives.org/</ulink> web site, and the community
441 <ulink url="http://www.openarchives.org/community/index.html">
442 http://www.openarchives.org/community/index.html</ulink>.
445 <ulink url="http://www.oaforum.org/tutorial/">
446 http://www.oaforum.org/tutorial/</ulink>.
457 <!-- Keep this comment at the end of the file
462 sgml-minimize-attributes:nil
463 sgml-always-quote-attributes:t
466 sgml-parent-document: "zebra.xml"
467 sgml-local-catalogs: nil
468 sgml-namecase-general:t