X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Frecordmodel-domxml.xml;h=a9b85db7d726fb7eb9b50cc5679cf01f240fd8f2;hb=cf66499bac7c49c5bdd363a2c927295fa92f547a;hp=9c281873dd179efd5d26982c0a4e37df64ffe47f;hpb=c1152dc950bd0edb1e638f55f71f8f7c20c4f01a;p=idzebra-moved-to-github.git diff --git a/doc/recordmodel-domxml.xml b/doc/recordmodel-domxml.xml index 9c28187..a9b85db 100644 --- a/doc/recordmodel-domxml.xml +++ b/doc/recordmodel-domxml.xml @@ -1,11 +1,11 @@ - + &dom; &xml; Record Model and Filter Module The record model described in this chapter applies to the fundamental, structured &xml; - record type dom, introduced in + record type &dom;, introduced in . The &dom; &xml; record model is experimental, and it's inner workings might change in future releases of the &zebra; Information Server. @@ -19,7 +19,7 @@ The &dom; &xml; filter uses a standard &dom; &xml; structure as internal data model, and can therefore parse, index, and display - any &xml; document type. It is wellsuited to work on + any &xml; document type. It is well suited to work on standardized &xml;-based formats such as Dublin Core, MODS, METS, MARCXML, OAI-PMH, RSS, and performs equally well on any other non-standard &xml; format. @@ -33,7 +33,7 @@ The &dom; filter architecture consists of four - different pipelines, each being a chain of arbitraily many sucessive + different pipelines, each being a chain of arbitrarily many successive &xslt; transformations of the internal &dom; &xml; representations of documents. @@ -79,26 +79,25 @@ first input parsing and initial transformations to common &xml; format - raw &xml; record buffers, &xml; streams and + Input raw &xml; record buffers, &xml; streams and binary &marc; buffers - single &dom; &xml; documents suitable for indexing and - internal storage + Common &xml; &dom; extract second indexing term extraction transformations - common single &dom; &xml; format - &zebra; internal indexing &dom; &xml; document + Common &xml; &dom; + Indexing &xml; &dom; store second transformations before internal document storage - common single &dom; &xml; format - &zebra; internal storage &dom; &xml; document + Common &xml; &dom; + Storage &xml; &dom; retrieve @@ -106,8 +105,8 @@ multiple document retrieve transformations from storage to different output formats are possible - &zebra; internal storage &dom; &xml; document - output &xml; syntax and requested format + Storage &xml; &dom; + Output &xml; syntax in requested formats @@ -132,9 +131,9 @@ recordtype.xml: dom.db/filter_dom_conf.xml - In this example on all data files with suffix - *.xml, where the - &dom; &xslt; filter configuration file is found in the + In this example the &dom; &xml; filter is configured to work + on all data files with suffix + *.xml, where the configuration file is found in the path db/filter_dom_conf.xml. @@ -164,54 +163,160 @@ ]]> - - All named stylesheets defined inside - schema element tags - are for presentation after search, including - the indexing stylesheet (which is a great debugging help). The - names defined in the name attributes must be - unique, these are the literal schema or - element set names used in - &srw;, - &sru; and - &z3950; protocol queries. + The root &xml; element <dom> and all other &dom; + &xml; filter elements are residing in the namespace + xmlns="http://indexdata.dk/zebra-2.0". + + + All pipeline definition elements - i.e. the + <input>, + <extract>, + <store>, and + <retrieve> elements - are optional. + Missing pipeline definitions are just interpreted + do-nothing identity pipelines. + + + All pipeline definition elements may contain zero or more + ]]> + &xslt; transformation instructions, which are performed + sequentially from top to bottom. The paths in the stylesheet attributes - are relative to zebras working directory, or absolute to file + are relative to zebras working directory, or absolute to the file system root. + + +
+ Input pipeline - The <split level="2"/> decides where the - &xml; Reader shall split the - collections of records into individual records, which then are - loaded into &dom;, and have the indexing &xslt; stylesheet applied. + The <input> pipeline definition element + may contain either one &xml; Reader definition + ]]>, used to split + an &xml; collection input stream into individual &xml; &dom; + documents at the prescribed element level, + or one &marc; binary + parsing instruction + ]]>, which defines + a conversion to &marcxml; format &dom; trees. The allowed values + of the inputcharset attribute depend on your + local iconv set-up. - There must be exactly one indexing &xslt; stylesheet, which is - defined by the magic attribute - identifier="http://indexdata.dk/zebra/xslt/1". + Both input parsers deliver individual &dom; &xml; documents to the + following chain of zero or more + ]]> + &xslt; transformations. At the end of this pipeline, the documents + are in the common format, used to feed both the + <extract> and + <store> pipelines. +
+ +
+ Extract pipeline + + The <extract> pipeline takes documents + from any common &dom; &xml; format to the &zebra; specific + indexing &dom; &xml; format. + It may consist of zero ore more + ]]> + &xslt; transformations, and the outcome is handled to the + &zebra; core to drive the process of building the inverted + indexes. See + for + details. + +
-
- &dom; filter internal record representation - When indexing, an &xml; Reader is invoked to split the input - files into suitable record &xml; pieces. Each record piece is then - transformed to an &xml; &dom; structure, which is essentially the - record model. Only &xslt; transformations can be applied during - index, search and retrieval. Consequently, output formats are - restricted to whatever &xslt; can deliver from the record &xml; - structure, be it other &xml; formats, HTML, or plain text. In case - you have libxslt1 running with E&xslt; support, - you can use this functionality inside the &dom; - filter configuration &xslt; stylesheets. +
+ Store pipeline + The <store> pipeline takes documents + from any common &dom; &xml; format to the &zebra; specific + storage &dom; &xml; format. + It may consist of zero ore more + ]]> + &xslt; transformations, and the outcome is handled to the + &zebra; core for deposition into the internal storage system. +
+ +
+ Retrieve pipeline + + Finally, there may be one or more + <retrieve> pipeline definitions, each + of them again consisting of zero or more + ]]> + &xslt; transformations. These are used for document + presentation after search, and take the internal storage &dom; + &xml; to the requested output formats during record present + requests. + + The possible multiple + <retrieve> pipeline definitions + are distinguished by their unique name + attributes, these are the literal schema or + element set names used in + &srw;, + &sru; and + &z3950; protocol queries. +
-
- &dom; Canonical Indexing Format + +
+ Canonical Indexing Format + + + &dom; &xml; indexing comes in two flavors: pure + processing-instruction governed plain &xml; documents, and - very + similar to the Alvis filter indexing format - &xml; documents + containing &xml; <record> and + <index> instructions from the magic + namespace xmlns:z="http://indexdata.dk/zebra-2.0". + + +
+ Processing-instruction governed indexing format + + The output of the processing instruction driven + indexing &xslt; stylesheets must contain + processing instructions named + zebra-2.0. + The output of the &xslt; indexing transformation is then + parsed using &dom; methods, and the contained instructions are + performed on the elements and their + subtrees directly following the processing instructions. + + + For example, the output of the command + + xsltproc dom-index-pi.xsl marc-one.xml + + might look like this: + + + + + + 11224466 + + How to program a computer + + ]]> + + +
+ +
+ Magic element governed indexing format + The output of the indexing &xslt; stylesheets must contain certain elements in the magic - xmlns:z="http://indexdata.dk/zebra/xslt/1" + xmlns:z="http://indexdata.dk/zebra-2.0" namespace. The output of the &xslt; indexing transformation is then parsed using &dom; methods, and the contained instructions are performed on the magic elements and their @@ -219,88 +324,140 @@ For example, the output of the command - - xsltproc xsl/oai2index.xsl one-record.xml + + xsltproc dom-index-element.xsl marc-one.xml might look like this: - <?xml version="1.0" encoding="UTF-8"?> - <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" - z:id="oai:JTRS:CP-3290---Volume-I" - z:rank="47896" - z:type="update"> - <z:index name="oai_identifier" type="0"> - oai:JTRS:CP-3290---Volume-I</z:index> - <z:index name="oai_datestamp" type="0">2004-07-09</z:index> - <z:index name="oai_setspec" type="0">jtrs</z:index> - <z:index name="dc_all" type="w"> - <z:index name="dc_title" type="w">Proceedings of the 4th - International Conference and Exhibition: - World Congress on Superconductivity - Volume I</z:index> - <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin - Burnham, Editors</z:index> - </z:index> - </z:record> + + + 11224466 + + How to program a computer + + ]]> - This means the following: From the original &xml; file - one-record.xml (or from the &xml; record &dom; of the - same form coming from a splitted input file), the indexing - stylesheet produces an indexing &xml; record, which is defined by - the record element in the magic namespace - xmlns:z="http://indexdata.dk/zebra/xslt/1". +
+ + +
+ Semantics of the indexing formats + + + Both indexing formats are defined with equal semantics and + behavior in mind: + + + &zebra; specific instructions are either + processing instructions named + zebra-2.0 or + elements contained in the namespace + xmlns:z="http://indexdata.dk/zebra-2.0". + + + + There must be exactly one record + instruction, which sets the scope for the following, + possibly nested index instructions. + + + + The unique record instruction + may have additional attributes id and + rank, where the value of the opaque ID + may be any string not containing the whitespace character + ' ', and the rank value must be a + non-negative integer. See + + + + + Multiple and possible nested index + instructions must contain at least one + indexname:indextype + pair, and may contain multiple such pairs separated by the + whitespace character ' '. In each index + pair, the name and the type of the index is separated by a + colon character ':'. + + + + + Any index name consisting of ASCII letters, and following the + standard &zebra; rules will do, see + . + + + + + Index types are restricted to the values defined in + the standard configuration + file default.idx, see + and + for details. + + + + + &dom; input documents which are not resulting in both one + unique valid + record instruction and one or more valid + index instructions can not be searched and + found. Therefore, + invalid document processing is aborted, and any content of + the <extract> and + <store> pipelines is discarted. + A warning is issued in the logs. + + + + + + + The examples work as follows: + From the original &xml; file + marc-one.xml (or from the &xml; record &dom; of the + same form coming from an <input> + pipeline), + the indexing + pipeline <extract> + produces an indexing &xml; record, which is defined by + the record instruction &zebra; uses the content of - z:id="oai:JTRS:CP-3290---Volume-I" as internal + z:id="11224466" + or + id=11224466 + as internal record ID, and - in case static ranking is set - the content of - z:rank="47896" as static rank. Following the - discussion in - we see that this records is internally ordered - lexicographically according to the value of the string - oai:JTRS:CP-3290---Volume-I47896. - The type of action performed during indexing is defined by - z:type="update">, with recognized values - insert, update, and - delete. + rank=42 + or + z:rank="42" + as static rank. - In this example, the following literal indexes are constructed: + + + In these examples, the following literal indexes are constructed: - oai_identifier - oai_datestamp - oai_setspec - dc_all - dc_title - dc_creator + any:w + control:0 + title:w + title:p + title:s - where the indexing type is defined in the - type attribute - (any value from the standard configuration - file default.idx will do). Finally, any + where the indexing type is defined after the + literal ':' character. + Any value from the standard configuration + file default.idx will do. + Finally, any text() node content recursively contained - inside the index will be filtered through the - appropriate charmap for character normalization, and will be - inserted in the index. - - - Specific to this example, we see that the single word - oai:JTRS:CP-3290---Volume-I will be literal, - byte for byte without any form of character normalization, - inserted into the index named oai:identifier, - the text - Kumar Krishen and *Calvin Burnham, Editors - will be inserted using the w character - normalization defined in default.idx into - the index dc:creator (that is, after character - normalization the index will keep the inidividual words - kumar, krishen, - and, calvin, - burnham, and editors), and - finally both the texts - Proceedings of the 4th International Conference and Exhibition: - World Congress on Superconductivity - Volume I - and - Kumar Krishen and *Calvin Burnham, Editors - will be inserted into the index dc:all using - the same character normalization map w. + inside the <z:index> element, or any + element following a index processing instruction, + will be filtered through the + appropriate char map for character normalization, and will be + inserted in the named indexes. Finally, this example configuration can be queried using &pqf; @@ -311,21 +468,24 @@ Z> elem dc Z> form xml Z> - Z> f @attr 1=dc_creator Kumar - Z> scan @attr 1=dc_creator adam + Z> find @attr 1=control @attr 4=3 11224466 + Z> scan @attr 1=control @attr 4=3 "" + Z> + Z> find @attr 1=title program + Z> scan @attr 1=title "" Z> - Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity" - Z> scan @attr 1=dc_title abc + Z> find @attr 1=title @attr 4=2 "How to program a computer" + Z> scan @attr 1=title @attr 4=2 "" ]]> or the proprietary - extentions x-pquery and + extensions x-pquery and x-pScanClause to &sru;, and &srw; See for more information on &sru;/&srw; @@ -339,6 +499,16 @@ filter configuration files involves in this process, and that the literal index names are used during search and retrieval. + + In case that we want to support the usual + bib-1 &z3950; numeric access points, it is a + good idea to choose string index names defined in the default + configuration file tab/bib1.att, see + + + +
+
@@ -350,14 +520,14 @@
&dom; Indexing Configuration - As mentioned above, there can be only one indexing - stylesheet, and configuration of the indexing process is a synonym + As mentioned above, there can be only one indexing pipeline, + and configuration of the indexing process is a synonym of writing an &xslt; stylesheet which produces &xml; output containing the - magic elements discussed in - . + magic processing instructions or elements discussed in + . Obviously, there are million of different ways to accomplish this - task, and some comments and code snippets are in order to lead - our paduans on the right track to the good side of the force. + task, and some comments and code snippets are in order to + enlighten the wary. Stylesheets can be written in the pull or @@ -365,13 +535,15 @@ means that the output &xml; structure is taken as starting point of the internal structure of the &xslt; stylesheet, and portions of the input &xml; are pulled out and inserted - into the right spots of the output &xml; structure. On the other - side, push &xslt; stylesheets are recursavly + into the right spots of the output &xml; structure. + On the other + side, push &xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &xml; structure, and avake to produce some output &xml; - whenever some special conditions in the input styelsheets are + by the input &xml; structure, and is triggered to produce + some output &xml; + whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input - &xml; with strong and well-defined structure and semantcs, like the + &xml; with strong and well-defined structure and semantics, like the following &oai; indexing example, whereas the push type might be the only possible way to sort out deeply recursive input &xml; formats. @@ -383,29 +555,34 @@ + - + + + + - - + + + + - + - + @@ -414,7 +591,7 @@ - + @@ -430,19 +607,19 @@ that the names and types of the indexes can be defined in the indexing &xslt; stylesheet dynamically according to content in the original &xml; records, which has - opportunities for great power and wizardery as well as grande + opportunities for great power and wizardry as well as grande disaster. The following excerpt of a push stylesheet might be a good idea according to your strict control of the &xml; - input format (due to rigerours checking against well-defined and + input format (due to rigorous checking against well-defined and tight RelaxNG or &xml; Schema's, for example): - + @@ -464,7 +641,7 @@ - + @@ -472,7 +649,7 @@ - + @@ -480,13 +657,32 @@ ]]> - Don't be tempted to cross - the line to the dark side of the force, paduan; this leads - to suffering and pain, and universal - disentigration of your project schedule. + Don't be tempted to play too smart tricks with the power of + &xslt;, the above example will create zillions of + indexes with unpredictable names, resulting in severe &zebra; + index pollution..
+
+ Debuggig &dom; Filter Configurations + + It can be very hard to debug a &dom; filter setup due to the many + sucessive &marc; syntax translations, &xml; stream splitting and + &xslt; transformations involved. As an aid, you have always the + power of the -s command line switch to the + zebraidz indexing command at your hand: + + zebraidx -s -c zebra.cfg update some_record_stream.xml + + This command line simulates indexing and dumps a lot of debug + information in the logs, telling exactly which transformations + have been applied, how the documents look like after each + transformation, and which record ids and terms are send to the indexer. + +
+ + + @@ -519,7 +715,7 @@ - + @@ -535,16 +731,18 @@
+ --> + @@ -564,72 +763,6 @@
- - - -