&acro.dom; &acro.xml; Record Model and Filter Module
The record model described in this chapter applies to the fundamental,
structured &acro.xml;
record type &acro.dom;, introduced in
. The &acro.dom; &acro.xml; record model
is experimental, and its inner workings might change in future
releases of the &zebra; Information Server.
&acro.dom; Record Filter Architecture
The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
internal data model, and can therefore parse, index, and display
any &acro.xml; document type. It is well suited to work on
standardized &acro.xml;-based formats such as Dublin Core, MODS, METS,
MARCXML, OAI-PMH, RSS, and performs equally well on any other
non-standard &acro.xml; format.
A parser for binary &acro.marc; records based on the ISO2709 library
standard is provided, it transforms these to the internal
&acro.marcxml; &acro.dom; representation. Other binary document parsers
are planned to follow.
The &acro.dom; filter architecture consists of four
different pipelines, each being a chain of arbitrarily many successive
&acro.xslt; transformations of the internal &acro.dom; &acro.xml;
representations of documents.
&acro.dom; &acro.xml; filter pipelines overviewNameWhenDescriptionInputOutputinputfirstinput parsing and initial
transformations to common &acro.xml; formatInput raw &acro.xml; record buffers, &acro.xml; streams and
binary &acro.marc; buffersCommon &acro.xml; &acro.dom;extractsecondindexing term extraction
transformationsCommon &acro.xml; &acro.dom;Indexing &acro.xml; &acro.dom;storesecond transformations before internal document
storageCommon &acro.xml; &acro.dom;Storage &acro.xml; &acro.dom;retrievethirdmultiple document retrieve transformations from
storage to different output
formats are possibleStorage &acro.xml; &acro.dom;Output &acro.xml; syntax in requested formats
The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on
your platform, even &acro.exslt;), it brings thus full &acro.xpath;
support to the indexing, storage and display rules of not only
&acro.xml; documents, but also binary &acro.marc; records.
&acro.dom; &acro.xml; filter pipeline configuration
The experimental, loadable &acro.dom; &acro.xml;/&acro.xslt; filter module
mod-dom.so
is invoked by the zebra.cfg configuration statement
recordtype.xml: dom.db/filter_dom_conf.xml
In this example the &acro.dom; &acro.xml; filter is configured to work
on all data files with suffix
*.xml, where the configuration file is found in the
path db/filter_dom_conf.xml.
The &acro.dom; &acro.xslt; filter configuration file must be
valid &acro.xml;. It might look like this:
]]>
The root &acro.xml; element <dom> and all other &acro.dom;
&acro.xml; filter elements are residing in the namespace
xmlns="http://indexdata.com/zebra-2.0".
All pipeline definition elements - i.e. the
<input>,
<extract>,
<store>, and
<retrieve> elements - are optional.
Missing pipeline definitions are just interpreted
do-nothing identity pipelines.
All pipeline definition elements may contain zero or more
]]>
&acro.xslt; transformation instructions, which are performed
sequentially from top to bottom.
The paths in the stylesheet attributes
are relative to zebras working directory, or absolute to the file
system root.
Input pipeline
The <input> pipeline definition element
may contain either one &acro.xml; Reader definition
]]>, used to split
an &acro.xml; collection input stream into individual &acro.xml; &acro.dom;
documents at the prescribed element level,
or one &acro.marc; binary
parsing instruction
]]>, which defines
a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values
of the inputcharset attribute depend on your
local iconv set-up.
Both input parsers deliver individual &acro.dom; &acro.xml; documents to the
following chain of zero or more
]]>
&acro.xslt; transformations. At the end of this pipeline, the documents
are in the common format, used to feed both the
<extract> and
<store> pipelines.
Extract pipeline
The <extract> pipeline takes documents
from any common &acro.dom; &acro.xml; format to the &zebra; specific
indexing &acro.dom; &acro.xml; format.
It may consist of zero ore more
]]>
&acro.xslt; transformations, and the outcome is handled to the
&zebra; core to drive the process of building the inverted
indexes. See
for
details.
Store pipeline
The <store> pipeline takes documents
from any common &acro.dom; &acro.xml; format to the &zebra; specific
storage &acro.dom; &acro.xml; format.
It may consist of zero ore more
]]>
&acro.xslt; transformations, and the outcome is handled to the
&zebra; core for deposition into the internal storage system.
Retrieve pipeline
Finally, there may be one or more
<retrieve> pipeline definitions, each
of them again consisting of zero or more
]]>
&acro.xslt; transformations. These are used for document
presentation after search, and take the internal storage &acro.dom;
&acro.xml; to the requested output formats during record present
requests.
The possible multiple
<retrieve> pipeline definitions
are distinguished by their unique name
attributes, these are the literal schema or
element set names used in
&acro.srw;,
&acro.sru; and
&acro.z3950; protocol queries.
Canonical Indexing Format
&acro.dom; &acro.xml; indexing comes in two flavors: pure
processing-instruction governed plain &acro.xml; documents, and - very
similar to the Alvis filter indexing format - &acro.xml; documents
containing &acro.xml; <record> and
<index> instructions from the magic
namespace xmlns:z="http://indexdata.com/zebra-2.0".
Processing-instruction governed indexing formatThe output of the processing instruction driven
indexing &acro.xslt; stylesheets must contain
processing instructions named
zebra-2.0.
The output of the &acro.xslt; indexing transformation is then
parsed using &acro.dom; methods, and the contained instructions are
performed on the elements and their
subtrees directly following the processing instructions.
For example, the output of the command
xsltproc dom-index-pi.xsl marc-one.xml
might look like this:
11224466How to program a computer
]]>
Magic element governed indexing formatThe output of the indexing &acro.xslt; stylesheets must contain
certain elements in the magic
xmlns:z="http://indexdata.com/zebra-2.0"
namespace. The output of the &acro.xslt; indexing transformation is then
parsed using &acro.dom; methods, and the contained instructions are
performed on the magic elements and their
subtrees.
For example, the output of the command
xsltproc dom-index-element.xsl marc-one.xml
might look like this:
11224466
How to program a computer
]]>
Semantics of the indexing formats
Both indexing formats are defined with equal semantics and
behavior in mind:
&zebra; specific instructions are either
processing instructions named
zebra-2.0 or
elements contained in the namespace
xmlns:z="http://indexdata.com/zebra-2.0".
There must be exactly one record
instruction, which sets the scope for the following,
possibly nested index instructions.
The unique record instruction
may have additional attributes id,
rank and type.
Attribute id is the value of the opaque ID
and may be any string not containing the whitespace character
' '.
The rank attribute value must be a
non-negative integer. See
.
The type attribute specifies how the record
is to be treated. The following values may be given for
type:
insert
The record is inserted. If the record already exists, it is
skipped (i.e. not replaced).
replace
The record is replaced. If the record does not already exist,
it is skipped (i.e. not inserted).
delete
The record is deleted. If the record does not already exist,
a warning issued and rest of records are skipped in
from the input stream.
update
The record is inserted or replaced depending on whether the
record exists or not. This is the default behavior but may
be effectively changed by "outside" the scope of the DOM
filter by zebraidx commands or extended services updates.
adelete
The record is deleted. If the record does not already exist,
it is skipped (i.e. nothing is deleted).
Requires version 2.0.54 or later.
Note that the value of type is only used to
determine the action if and only if the Zebra indexer is running
in "update" mode (i.e zebraidx update) or if the specialUpdate
action of the
Extended
Service Update is used.
For this reason a specialUpdate may end up deleting records!
Multiple and possible nested index
instructions must contain at least one
indexname:indextype
pair, and may contain multiple such pairs separated by the
whitespace character ' '. In each index
pair, the name and the type of the index is separated by a
colon character ':'.
Any index name consisting of ASCII letters, and following the
standard &zebra; rules will do, see
.
Index types are restricted to the values defined in
the standard configuration
file default.idx, see
and
for details.
&acro.dom; input documents which are not resulting in both one
unique valid
record instruction and one or more valid
index instructions can not be searched and
found. Therefore,
invalid document processing is aborted, and any content of
the <extract> and
<store> pipelines is discarded.
A warning is issued in the logs.
The examples work as follows:
From the original &acro.xml; file
marc-one.xml (or from the &acro.xml; record &acro.dom; of the
same form coming from an <input>
pipeline),
the indexing
pipeline <extract>
produces an indexing &acro.xml; record, which is defined by
the record instruction
&zebra; uses the content of
z:id="11224466"
or
id=11224466
as internal
record ID, and - in case static ranking is set - the content of
rank=42
or
z:rank="42"
as static rank.
In these examples, the following literal indexes are constructed:
any:w
control:0
title:w
title:p
title:s
where the indexing type is defined after the
literal ':' character.
Any value from the standard configuration
file default.idx will do.
Finally, any
text() node content recursively contained
inside the <z:index> element, or any
element following a index processing instruction,
will be filtered through the
appropriate char map for character normalization, and will be
inserted in the named indexes.
Finally, this example configuration can be queried using &acro.pqf;
queries, either transported by &acro.z3950;, (here using a yaz-client)
open localhost:9999
Z> elem dc
Z> form xml
Z>
Z> find @attr 1=control @attr 4=3 11224466
Z> scan @attr 1=control @attr 4=3 ""
Z>
Z> find @attr 1=title program
Z> scan @attr 1=title ""
Z>
Z> find @attr 1=title @attr 4=2 "How to program a computer"
Z> scan @attr 1=title @attr 4=2 ""
]]>
or the proprietary
extensions x-pquery and
x-pScanClause to
&acro.sru;, and &acro.srw;
See for more information on &acro.sru;/&acro.srw;
configuration, and or the &yaz;
&acro.cql; section
for the details or the &yaz; frontend server.
Notice that there are no *.abs,
*.est, *.map, or other &acro.grs1;
filter configuration files involves in this process, and that the
literal index names are used during search and retrieval.
In case that we want to support the usual
bib-1 &acro.z3950; numeric access points, it is a
good idea to choose string index names defined in the default
configuration file tab/bib1.att, see
&acro.dom; Record Model Configuration&acro.dom; Indexing Configuration
As mentioned above, there can be only one indexing pipeline,
and configuration of the indexing process is a synonym
of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
magic processing instructions or elements discussed in
.
Obviously, there are million of different ways to accomplish this
task, and some comments and code snippets are in order to
enlighten the wary.
Stylesheets can be written in the pull or
the push style: pull
means that the output &acro.xml; structure is taken as starting point of
the internal structure of the &acro.xslt; stylesheet, and portions of
the input &acro.xml; are pulled out and inserted
into the right spots of the output &acro.xml; structure.
On the other
side, push &acro.xslt; stylesheets are recursively
calling their template definitions, a process which is commanded
by the input &acro.xml; structure, and is triggered to produce
some output &acro.xml;
whenever some special conditions in the input stylesheets are
met. The pull type is well-suited for input
&acro.xml; with strong and well-defined structure and semantics, like the
following &acro.oai; indexing example, whereas the
push type might be the only possible way to
sort out deeply recursive input &acro.xml; formats.
A pull stylesheet example used to index
&acro.oai; harvested records could use some of the following template
definitions:
]]>
&acro.dom; Indexing &acro.marcxml;
The &acro.dom; filter allows indexing of both binary &acro.marc; records
and &acro.marcxml; records, depending on its configuration.
A typical &acro.marcxml; record might look like this:
4200366nam 22001698a 4500 11224466 DLC 00000000000000.0 910710c19910701nju 00010 eng 11224466 DLCDLC123-xyzJack CollinsHow to program a computerPenguin8710p. cm.
]]>
It is easily possible to make string manipulation in the &acro.dom;
filter. For example, if you want to drop some leading articles
in the indexing of sort fields, you might want to pick out the
&acro.marcxml; indicator attributes to chop of leading substrings. If
the above &acro.xml; example would have an indicator
ind2="8" in the title field
245, i.e.
How to program a computer
]]>
one could write a template taking into account this information
to chop the first 8 characters from the
sorting index title:s like this:
0
]]>
The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be:
How to program a computer
program a computer
]]>
and the record would be sorted in the title index under 'P', not 'H'.
&acro.dom; Indexing Wizardry
The names and types of the indexes can be defined in the
indexing &acro.xslt; stylesheet dynamically according to
content in the original &acro.xml; records, which has
opportunities for great power and wizardry as well as grande
disaster.
The following excerpt of a push stylesheet
might
be a good idea according to your strict control of the &acro.xml;
input format (due to rigorous checking against well-defined and
tight RelaxNG or &acro.xml; Schema's, for example):
]]>
This template creates indexes which have the name of the working
node of any input &acro.xml; file, and assigns a '1' to the index.
The example query
find @attr 1=xyz 1
finds all files which contain at least one
xyz &acro.xml; element. In case you can not control
which element names the input files contain, you might ask for
disaster and bad karma using this technique.
One variation over the theme dynamically created
indexes will definitely be unwise:
]]>
Don't be tempted to play too smart tricks with the power of
&acro.xslt;, the above example will create zillions of
indexes with unpredictable names, resulting in severe &zebra;
index pollution..
Debuggig &acro.dom; Filter Configurations
It can be very hard to debug a &acro.dom; filter setup due to the many
successive &acro.marc; syntax translations, &acro.xml; stream splitting and
&acro.xslt; transformations involved. As an aid, you have always the
power of the -s command line switch to the
zebraidz indexing command at your hand:
zebraidx -s -c zebra.cfg update some_record_stream.xml
This command line simulates indexing and dumps a lot of debug
information in the logs, telling exactly which transformations
have been applied, how the documents look like after each
transformation, and which record ids and terms are send to the indexer.