ALVIS &acro.xml; Record Model and Filter Module
The functionality of this record model has been improved and
replaced by the DOM &acro.xml; record model, see
. The Alvis &acro.xml; record
model is considered obsolete, and will eventually be removed
from future releases of the &zebra; software.
The record model described in this chapter applies to the fundamental,
structured &acro.xml;
record type alvis, introduced in
.
This filter has been developed under the
ALVIS project funded by
the European Community under the "Information Society Technologies"
Program (2002-2006).
ALVIS Record Filter
The experimental, loadable Alvis &acro.xml;/&acro.xslt; filter module
mod-alvis.so is packaged in the GNU/Debian package
libidzebra1.4-mod-alvis.
It is invoked by the zebra.cfg configuration statement
recordtype.xml: alvis.db/filter_alvis_conf.xml
In this example on all data files with suffix
*.xml, where the
Alvis &acro.xslt; filter configuration file is found in the
path db/filter_alvis_conf.xml.
The Alvis &acro.xslt; filter configuration file must be
valid &acro.xml;. It might look like this (This example is
used for indexing and display of &acro.oai; harvested records):
<?xml version="1.0" encoding="UTF-8"?>
<schemaInfo>
<schema name="identity" stylesheet="xsl/identity.xsl" />
<schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
stylesheet="xsl/oai2index.xsl" />
<schema name="dc" stylesheet="xsl/oai2dc.xsl" />
<!-- use split level 2 when indexing whole OAI Record lists -->
<split level="2"/>
</schemaInfo>
All named stylesheets defined inside
schema element tags
are for presentation after search, including
the indexing stylesheet (which is a great debugging help). The
names defined in the name attributes must be
unique, these are the literal schema or
element set names used in
&acro.srw;,
&acro.sru; and
&acro.z3950; protocol queries.
The paths in the stylesheet attributes
are relative to zebras working directory, or absolute to file
system root.
The <split level="2"/> decides where the
&acro.xml; Reader shall split the
collections of records into individual records, which then are
loaded into &acro.dom;, and have the indexing &acro.xslt; stylesheet applied.
There must be exactly one indexing &acro.xslt; stylesheet, which is
defined by the magic attribute
identifier="http://indexdata.dk/zebra/xslt/1".
ALVIS Internal Record Representation
When indexing, an &acro.xml; Reader is invoked to split the input
files into suitable record &acro.xml; pieces. Each record piece is then
transformed to an &acro.xml; &acro.dom; structure, which is essentially the
record model. Only &acro.xslt; transformations can be applied during
index, search and retrieval. Consequently, output formats are
restricted to whatever &acro.xslt; can deliver from the record &acro.xml;
structure, be it other &acro.xml; formats, HTML, or plain text. In case
you have libxslt1 running with E&acro.xslt; support,
you can use this functionality inside the Alvis
filter configuration &acro.xslt; stylesheets.
ALVIS Canonical Indexing Format
The output of the indexing &acro.xslt; stylesheets must contain
certain elements in the magic
xmlns:z="http://indexdata.dk/zebra/xslt/1"
namespace. The output of the &acro.xslt; indexing transformation is then
parsed using &acro.dom; methods, and the contained instructions are
performed on the magic elements and their
subtrees.
For example, the output of the command
xsltproc xsl/oai2index.xsl one-record.xml
might look like this:
<?xml version="1.0" encoding="UTF-8"?>
<z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
z:id="oai:JTRS:CP-3290---Volume-I"
z:rank="47896">
<z:index name="oai_identifier" type="0">
oai:JTRS:CP-3290---Volume-I</z:index>
<z:index name="oai_datestamp" type="0">2004-07-09</z:index>
<z:index name="oai_setspec" type="0">jtrs</z:index>
<z:index name="dc_all" type="w">
<z:index name="dc_title" type="w">Proceedings of the 4th
International Conference and Exhibition:
World Congress on Superconductivity - Volume I</z:index>
<z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
Burnham, Editors</z:index>
</z:index>
</z:record>
This means the following: From the original &acro.xml; file
one-record.xml (or from the &acro.xml; record &acro.dom; of the
same form coming from a split input file), the indexing
stylesheet produces an indexing &acro.xml; record, which is defined by
the record element in the magic namespace
xmlns:z="http://indexdata.dk/zebra/xslt/1".
&zebra; uses the content of
z:id="oai:JTRS:CP-3290---Volume-I" as internal
record ID, and - in case static ranking is set - the content of
z:rank="47896" as static rank. Following the
discussion in
we see that this records is internally ordered
lexicographically according to the value of the string
oai:JTRS:CP-3290---Volume-I47896.
In this example, the following literal indexes are constructed:
oai_identifier
oai_datestamp
oai_setspec
dc_all
dc_title
dc_creator
where the indexing type is defined in the
type attribute
(any value from the standard configuration
file default.idx will do). Finally, any
text() node content recursively contained
inside the index will be filtered through the
appropriate char map for character normalization, and will be
inserted in the index.
Specific to this example, we see that the single word
oai:JTRS:CP-3290---Volume-I will be literal,
byte for byte without any form of character normalization,
inserted into the index named oai:identifier,
the text
Kumar Krishen and *Calvin Burnham, Editors
will be inserted using the w character
normalization defined in default.idx into
the index dc:creator (that is, after character
normalization the index will keep the individual words
kumar, krishen,
and, calvin,
burnham, and editors), and
finally both the texts
Proceedings of the 4th International Conference and Exhibition:
World Congress on Superconductivity - Volume I
and
Kumar Krishen and *Calvin Burnham, Editors
will be inserted into the index dc:all using
the same character normalization map w.
Finally, this example configuration can be queried using &acro.pqf;
queries, either transported by &acro.z3950;, (here using a yaz-client)
open localhost:9999
Z> elem dc
Z> form xml
Z>
Z> f @attr 1=dc_creator Kumar
Z> scan @attr 1=dc_creator adam
Z>
Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
Z> scan @attr 1=dc_title abc
]]>
or the proprietary
extensions x-pquery and
x-pScanClause to
&acro.sru;, and &acro.srw;
See for more information on &acro.sru;/&acro.srw;
configuration, and or the &yaz;
&acro.cql; section
for the details or the &yaz; frontend server.
Notice that there are no *.abs,
*.est, *.map, or other &acro.grs1;
filter configuration files involves in this process, and that the
literal index names are used during search and retrieval.
ALVIS Record Model Configuration
ALVIS Indexing Configuration
As mentioned above, there can be only one indexing
stylesheet, and configuration of the indexing process is a synonym
of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
magic elements discussed in
.
Obviously, there are million of different ways to accomplish this
task, and some comments and code snippets are in order to lead
our Padawan's on the right track to the good side of the force.
Stylesheets can be written in the pull or
the push style: pull
means that the output &acro.xml; structure is taken as starting point of
the internal structure of the &acro.xslt; stylesheet, and portions of
the input &acro.xml; are pulled out and inserted
into the right spots of the output &acro.xml; structure. On the other
side, push &acro.xslt; stylesheets are recursively
calling their template definitions, a process which is commanded
by the input &acro.xml; structure, and are triggered to produce some output &acro.xml;
whenever some special conditions in the input stylesheets are
met. The pull type is well-suited for input
&acro.xml; with strong and well-defined structure and semantics, like the
following &acro.oai; indexing example, whereas the
push type might be the only possible way to
sort out deeply recursive input &acro.xml; formats.
A pull stylesheet example used to index
&acro.oai; harvested records could use some of the following template
definitions:
]]>
Notice also,
that the names and types of the indexes can be defined in the
indexing &acro.xslt; stylesheet dynamically according to
content in the original &acro.xml; records, which has
opportunities for great power and wizardry as well as grande
disaster.
The following excerpt of a push stylesheet
might
be a good idea according to your strict control of the &acro.xml;
input format (due to rigorous checking against well-defined and
tight RelaxNG or &acro.xml; Schema's, for example):
]]>
This template creates indexes which have the name of the working
node of any input &acro.xml; file, and assigns a '1' to the index.
The example query
find @attr 1=xyz 1
finds all files which contain at least one
xyz &acro.xml; element. In case you can not control
which element names the input files contain, you might ask for
disaster and bad karma using this technique.
One variation over the theme dynamically created
indexes will definitely be unwise:
]]>
Don't be tempted to cross
the line to the dark side of the force, Padawan; this leads
to suffering and pain, and universal
disintegration of your project schedule.
ALVIS Exchange Formats
An exchange format can be anything which can be the outcome of an
&acro.xslt; transformation, as far as the stylesheet is registered in
the main Alvis &acro.xslt; filter configuration file, see
.
In principle anything that can be expressed in &acro.xml;, HTML, and
TEXT can be the output of a schema or
element set directive during search, as long as
the information comes from the
original input record &acro.xml; &acro.dom; tree
(and not the transformed and indexed &acro.xml;!!).
In addition, internal administrative information from the &zebra;
indexer can be accessed during record retrieval. The following
example is a summary of the possibilities:
]]>
ALVIS Filter &acro.oai; Indexing Example
The source code tarball contains a working Alvis filter example in
the directory examples/alvis-oai/, which
should get you started.
More example data can be harvested from any &acro.oai; compliant server,
see details at the &acro.oai;
http://www.openarchives.org/ web site, and the community
links at
http://www.openarchives.org/community/index.html.
There is a tutorial
found at
http://www.oaforum.org/tutorial/.