&acro.dom; &acro.xml; Record Model and Filter Module The record model described in this chapter applies to the fundamental, structured &acro.xml; record type &acro.dom;, introduced in . The &acro.dom; &acro.xml; record model is experimental, and its inner workings might change in future releases of the &zebra; Information Server.
&acro.dom; Record Filter Architecture The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as internal data model, and can therefore parse, index, and display any &acro.xml; document type. It is well suited to work on standardized &acro.xml;-based formats such as Dublin Core, MODS, METS, MARCXML, OAI-PMH, RSS, and performs equally well on any other non-standard &acro.xml; format. A parser for binary &acro.marc; records based on the ISO2709 library standard is provided, it transforms these to the internal &acro.marcxml; &acro.dom; representation. Other binary document parsers are planned to follow. The &acro.dom; filter architecture consists of four different pipelines, each being a chain of arbitrarily many successive &acro.xslt; transformations of the internal &acro.dom; &acro.xml; representations of documents.
&acro.dom; &acro.xml; filter architecture [Here there should be a diagram showing the &acro.dom; &acro.xml; filter architecture, but is seems that your tool chain has not been able to include the diagram in this document.]
&acro.dom; &acro.xml; filter pipelines overview Name When Description Input Output input first input parsing and initial transformations to common &acro.xml; format Input raw &acro.xml; record buffers, &acro.xml; streams and binary &acro.marc; buffers Common &acro.xml; &acro.dom; extract second indexing term extraction transformations Common &acro.xml; &acro.dom; Indexing &acro.xml; &acro.dom; store second transformations before internal document storage Common &acro.xml; &acro.dom; Storage &acro.xml; &acro.dom; retrieve third multiple document retrieve transformations from storage to different output formats are possible Storage &acro.xml; &acro.dom; Output &acro.xml; syntax in requested formats
The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on your platform, even &acro.exslt;), it brings thus full &acro.xpath; support to the indexing, storage and display rules of not only &acro.xml; documents, but also binary &acro.marc; records.
&acro.dom; &acro.xml; filter pipeline configuration The experimental, loadable &acro.dom; &acro.xml;/&acro.xslt; filter module mod-dom.so is invoked by the zebra.cfg configuration statement recordtype.xml: dom.db/filter_dom_conf.xml In this example the &acro.dom; &acro.xml; filter is configured to work on all data files with suffix *.xml, where the configuration file is found in the path db/filter_dom_conf.xml. The &acro.dom; &acro.xslt; filter configuration file must be valid &acro.xml;. It might look like this: ]]> The root &acro.xml; element <dom> and all other &acro.dom; &acro.xml; filter elements are residing in the namespace xmlns="http://indexdata.com/zebra-2.0". All pipeline definition elements - i.e. the <input>, <extract>, <store>, and <retrieve> elements - are optional. Missing pipeline definitions are just interpreted do-nothing identity pipelines. All pipeline definition elements may contain zero or more ]]> &acro.xslt; transformation instructions, which are performed sequentially from top to bottom. The paths in the stylesheet attributes are relative to zebras working directory, or absolute to the file system root.
Input pipeline The <input> pipeline definition element may contain either one &acro.xml; Reader definition ]]>, used to split an &acro.xml; collection input stream into individual &acro.xml; &acro.dom; documents at the prescribed element level, or one &acro.marc; binary parsing instruction ]]>, which defines a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values of the inputcharset attribute depend on your local iconv set-up. Both input parsers deliver individual &acro.dom; &acro.xml; documents to the following chain of zero or more ]]> &acro.xslt; transformations. At the end of this pipeline, the documents are in the common format, used to feed both the <extract> and <store> pipelines.
Extract pipeline The <extract> pipeline takes documents from any common &acro.dom; &acro.xml; format to the &zebra; specific indexing &acro.dom; &acro.xml; format. It may consist of zero ore more ]]> &acro.xslt; transformations, and the outcome is handled to the &zebra; core to drive the process of building the inverted indexes. See for details.
Store pipeline The <store> pipeline takes documents from any common &acro.dom; &acro.xml; format to the &zebra; specific storage &acro.dom; &acro.xml; format. It may consist of zero ore more ]]> &acro.xslt; transformations, and the outcome is handled to the &zebra; core for deposition into the internal storage system.
Retrieve pipeline Finally, there may be one or more <retrieve> pipeline definitions, each of them again consisting of zero or more ]]> &acro.xslt; transformations. These are used for document presentation after search, and take the internal storage &acro.dom; &acro.xml; to the requested output formats during record present requests. The possible multiple <retrieve> pipeline definitions are distinguished by their unique name attributes, these are the literal schema or element set names used in &acro.srw;, &acro.sru; and &acro.z3950; protocol queries.
Canonical Indexing Format &acro.dom; &acro.xml; indexing comes in two flavors: pure processing-instruction governed plain &acro.xml; documents, and - very similar to the Alvis filter indexing format - &acro.xml; documents containing &acro.xml; <record> and <index> instructions from the magic namespace xmlns:z="http://indexdata.com/zebra-2.0".
Processing-instruction governed indexing format The output of the processing instruction driven indexing &acro.xslt; stylesheets must contain processing instructions named zebra-2.0. The output of the &acro.xslt; indexing transformation is then parsed using &acro.dom; methods, and the contained instructions are performed on the elements and their subtrees directly following the processing instructions. For example, the output of the command xsltproc dom-index-pi.xsl marc-one.xml might look like this: 11224466 How to program a computer ]]>
Magic element governed indexing format The output of the indexing &acro.xslt; stylesheets must contain certain elements in the magic xmlns:z="http://indexdata.com/zebra-2.0" namespace. The output of the &acro.xslt; indexing transformation is then parsed using &acro.dom; methods, and the contained instructions are performed on the magic elements and their subtrees. For example, the output of the command xsltproc dom-index-element.xsl marc-one.xml might look like this: 11224466 How to program a computer ]]>
Semantics of the indexing formats Both indexing formats are defined with equal semantics and behavior in mind: &zebra; specific instructions are either processing instructions named zebra-2.0 or elements contained in the namespace xmlns:z="http://indexdata.com/zebra-2.0". There must be exactly one record instruction, which sets the scope for the following, possibly nested index instructions. The unique record instruction may have additional attributes id, rank and type. Attribute id is the value of the opaque ID and may be any string not containing the whitespace character ' '. The rank attribute value must be a non-negative integer. See . The type attribute specifies how the record is to be treated. The following values may be given for type: insert The record is inserted. If the record already exists, it is skipped (i.e. not replaced). replace The record is replaced. If the record does not already exist, it is skipped (i.e. not inserted). delete The record is deleted. If the record does not already exist, a warning issued and rest of records are skipped in from the input stream. update The record is inserted or replaced depending on whether the record exists or not. This is the default behavior but may be effectively changed by "outside" the scope of the DOM filter by zebraidx commands or extended services updates. adelete The record is deleted. If the record does not already exist, it is skipped (i.e. nothing is deleted). Requires version 2.0.54 or later. Note that the value of type is only used to determine the action if and only if the Zebra indexer is running in "update" mode (i.e zebraidx update) or if the specialUpdate action of the Extended Service Update is used. For this reason a specialUpdate may end up deleting records! Multiple and possible nested index instructions must contain at least one indexname:indextype pair, and may contain multiple such pairs separated by the whitespace character ' '. In each index pair, the name and the type of the index is separated by a colon character ':'. Any index name consisting of ASCII letters, and following the standard &zebra; rules will do, see . Index types are restricted to the values defined in the standard configuration file default.idx, see and for details. &acro.dom; input documents which are not resulting in both one unique valid record instruction and one or more valid index instructions can not be searched and found. Therefore, invalid document processing is aborted, and any content of the <extract> and <store> pipelines is discarded. A warning is issued in the logs. The examples work as follows: From the original &acro.xml; file marc-one.xml (or from the &acro.xml; record &acro.dom; of the same form coming from an <input> pipeline), the indexing pipeline <extract> produces an indexing &acro.xml; record, which is defined by the record instruction &zebra; uses the content of z:id="11224466" or id=11224466 as internal record ID, and - in case static ranking is set - the content of rank=42 or z:rank="42" as static rank. In these examples, the following literal indexes are constructed: any:w control:0 title:w title:p title:s where the indexing type is defined after the literal ':' character. Any value from the standard configuration file default.idx will do. Finally, any text() node content recursively contained inside the <z:index> element, or any element following a index processing instruction, will be filtered through the appropriate char map for character normalization, and will be inserted in the named indexes. Finally, this example configuration can be queried using &acro.pqf; queries, either transported by &acro.z3950;, (here using a yaz-client) open localhost:9999 Z> elem dc Z> form xml Z> Z> find @attr 1=control @attr 4=3 11224466 Z> scan @attr 1=control @attr 4=3 "" Z> Z> find @attr 1=title program Z> scan @attr 1=title "" Z> Z> find @attr 1=title @attr 4=2 "How to program a computer" Z> scan @attr 1=title @attr 4=2 "" ]]> or the proprietary extensions x-pquery and x-pScanClause to &acro.sru;, and &acro.srw; See for more information on &acro.sru;/&acro.srw; configuration, and or the &yaz; &acro.cql; section for the details or the &yaz; frontend server. Notice that there are no *.abs, *.est, *.map, or other &acro.grs1; filter configuration files involves in this process, and that the literal index names are used during search and retrieval. In case that we want to support the usual bib-1 &acro.z3950; numeric access points, it is a good idea to choose string index names defined in the default configuration file tab/bib1.att, see
&acro.dom; Record Model Configuration
&acro.dom; Indexing Configuration As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the magic processing instructions or elements discussed in . Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to enlighten the wary. Stylesheets can be written in the pull or the push style: pull means that the output &acro.xml; structure is taken as starting point of the internal structure of the &acro.xslt; stylesheet, and portions of the input &acro.xml; are pulled out and inserted into the right spots of the output &acro.xml; structure. On the other side, push &acro.xslt; stylesheets are recursively calling their template definitions, a process which is commanded by the input &acro.xml; structure, and is triggered to produce some output &acro.xml; whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input &acro.xml; with strong and well-defined structure and semantics, like the following &acro.oai; indexing example, whereas the push type might be the only possible way to sort out deeply recursive input &acro.xml; formats. A pull stylesheet example used to index &acro.oai; harvested records could use some of the following template definitions: ]]>
&acro.dom; Indexing &acro.marcxml; The &acro.dom; filter allows indexing of both binary &acro.marc; records and &acro.marcxml; records, depending on its configuration. A typical &acro.marcxml; record might look like this: 42 00366nam 22001698a 4500 11224466 DLC 00000000000000.0 910710c19910701nju 00010 eng 11224466 DLC DLC 123-xyz Jack Collins How to program a computer Penguin 8710 p. cm. ]]> It is easily possible to make string manipulation in the &acro.dom; filter. For example, if you want to drop some leading articles in the indexing of sort fields, you might want to pick out the &acro.marcxml; indicator attributes to chop of leading substrings. If the above &acro.xml; example would have an indicator ind2="8" in the title field 245, i.e. How to program a computer ]]> one could write a template taking into account this information to chop the first 8 characters from the sorting index title:s like this: 0 ]]> The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be: How to program a computer program a computer ]]> and the record would be sorted in the title index under 'P', not 'H'.
&acro.dom; Indexing Wizardry The names and types of the indexes can be defined in the indexing &acro.xslt; stylesheet dynamically according to content in the original &acro.xml; records, which has opportunities for great power and wizardry as well as grande disaster. The following excerpt of a push stylesheet might be a good idea according to your strict control of the &acro.xml; input format (due to rigorous checking against well-defined and tight RelaxNG or &acro.xml; Schema's, for example): ]]> This template creates indexes which have the name of the working node of any input &acro.xml; file, and assigns a '1' to the index. The example query find @attr 1=xyz 1 finds all files which contain at least one xyz &acro.xml; element. In case you can not control which element names the input files contain, you might ask for disaster and bad karma using this technique. One variation over the theme dynamically created indexes will definitely be unwise: ]]> Don't be tempted to play too smart tricks with the power of &acro.xslt;, the above example will create zillions of indexes with unpredictable names, resulting in severe &zebra; index pollution..
Debuggig &acro.dom; Filter Configurations It can be very hard to debug a &acro.dom; filter setup due to the many successive &acro.marc; syntax translations, &acro.xml; stream splitting and &acro.xslt; transformations involved. As an aid, you have always the power of the -s command line switch to the zebraidz indexing command at your hand: zebraidx -s -c zebra.cfg update some_record_stream.xml This command line simulates indexing and dumps a lot of debug information in the logs, telling exactly which transformations have been applied, how the documents look like after each transformation, and which record ids and terms are send to the indexer.