X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel-domxml.xml;h=391b45338c15ce9a10e1e6e02b43b8f3e0526462;hp=bb5b300578cbfe00549ae91bfa5757ba2419d80b;hb=250de4ed23a44f5eb3552db317eef0d0fbe3265c;hpb=c99c50f588fb803362a47a933c988360ab1cd98c diff --git a/doc/recordmodel-domxml.xml b/doc/recordmodel-domxml.xml index bb5b300..391b453 100644 --- a/doc/recordmodel-domxml.xml +++ b/doc/recordmodel-domxml.xml @@ -1,45 +1,44 @@ - - &dom; &xml; Record Model and Filter Module - + &acro.dom; &acro.xml; Record Model and Filter Module + The record model described in this chapter applies to the fundamental, - structured &xml; - record type &dom;, introduced in - . The &dom; &xml; record model - is experimental, and it's inner workings might change in future + structured &acro.xml; + record type &acro.dom;, introduced in + . The &acro.dom; &acro.xml; record model + is experimental, and its inner workings might change in future releases of the &zebra; Information Server.
- &dom; Record Filter Architecture + &acro.dom; Record Filter Architecture - The &dom; &xml; filter uses a standard &dom; &xml; structure as + The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as internal data model, and can therefore parse, index, and display - any &xml; document type. It is well suited to work on - standardized &xml;-based formats such as Dublin Core, MODS, METS, + any &acro.xml; document type. It is well suited to work on + standardized &acro.xml;-based formats such as Dublin Core, MODS, METS, MARCXML, OAI-PMH, RSS, and performs equally well on any other - non-standard &xml; format. + non-standard &acro.xml; format. - A parser for binary &marc; records based on the ISO2709 library + A parser for binary &acro.marc; records based on the ISO2709 library standard is provided, it transforms these to the internal - &marcxml; &dom; representation. Other binary document parsers + &acro.marcxml; &acro.dom; representation. Other binary document parsers are planned to follow. - The &dom; filter architecture consists of four + The &acro.dom; filter architecture consists of four different pipelines, each being a chain of arbitrarily many successive - &xslt; transformations of the internal &dom; &xml; + &acro.xslt; transformations of the internal &acro.dom; &acro.xml; representations of documents.
- &dom; &xml; filter architecture + &acro.dom; &acro.xml; filter architecture @@ -50,7 +49,7 @@ - [Here there should be a diagram showing the &dom; &xml; + [Here there should be a diagram showing the &acro.dom; &acro.xml; filter architecture, but is seems that your tool chain has not been able to include the diagram in this document.] @@ -61,7 +60,7 @@ - &dom; &xml; filter pipelines overview + &acro.dom; &acro.xml; filter pipelines overview @@ -78,26 +77,26 @@ input first input parsing and initial - transformations to common &xml; format - Input raw &xml; record buffers, &xml; streams and - binary &marc; buffers - Common &xml; &dom; + transformations to common &acro.xml; format + Input raw &acro.xml; record buffers, &acro.xml; streams and + binary &acro.marc; buffers + Common &acro.xml; &acro.dom; extract second indexing term extraction transformations - Common &xml; &dom; - Indexing &xml; &dom; + Common &acro.xml; &acro.dom; + Indexing &acro.xml; &acro.dom; store second transformations before internal document storage - Common &xml; &dom; - Storage &xml; &dom; + Common &acro.xml; &acro.dom; + Storage &acro.xml; &acro.dom; retrieve @@ -105,40 +104,40 @@ multiple document retrieve transformations from storage to different output formats are possible - Storage &xml; &dom; - Output &xml; syntax in requested formats + Storage &acro.xml; &acro.dom; + Output &acro.xml; syntax in requested formats
- The &dom; &xml; filter pipelines use &xslt; (and if supported on - your platform, even &exslt;), it brings thus full &xpath; + The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on + your platform, even &acro.exslt;), it brings thus full &acro.xpath; support to the indexing, storage and display rules of not only - &xml; documents, but also binary &marc; records. + &acro.xml; documents, but also binary &acro.marc; records.
- &dom; &xml; filter pipeline configuration + &acro.dom; &acro.xml; filter pipeline configuration - The experimental, loadable &dom; &xml;/&xslt; filter module + The experimental, loadable &acro.dom; &acro.xml;/&acro.xslt; filter module mod-dom.so is invoked by the zebra.cfg configuration statement recordtype.xml: dom.db/filter_dom_conf.xml - In this example the &dom; &xml; filter is configured to work + In this example the &acro.dom; &acro.xml; filter is configured to work on all data files with suffix *.xml, where the configuration file is found in the path db/filter_dom_conf.xml. - The &dom; &xslt; filter configuration file must be - valid &xml;. It might look like this: + The &acro.dom; &acro.xslt; filter configuration file must be + valid &acro.xml;. It might look like this: @@ -147,7 +146,7 @@ - + @@ -164,9 +163,9 @@ - The root &xml; element <dom> and all other &dom; - &xml; filter elements are residing in the namespace - xmlns="http://indexdata.dk/zebra-2.0". + The root &acro.xml; element <dom> and all other &acro.dom; + &acro.xml; filter elements are residing in the namespace + xmlns="http://indexdata.com/zebra-2.0". All pipeline definition elements - i.e. the @@ -180,7 +179,7 @@ All pipeline definition elements may contain zero or more ]]> - &xslt; transformation instructions, which are performed + &acro.xslt; transformation instructions, which are performed sequentially from top to bottom. The paths in the stylesheet attributes are relative to zebras working directory, or absolute to the file @@ -192,22 +191,22 @@ Input pipeline The <input> pipeline definition element - may contain either one &xml; Reader definition + may contain either one &acro.xml; Reader definition ]]>, used to split - an &xml; collection input stream into individual &xml; &dom; + an &acro.xml; collection input stream into individual &acro.xml; &acro.dom; documents at the prescribed element level, - or one &marc; binary + or one &acro.marc; binary parsing instruction ]]>, which defines - a conversion to &marcxml; format &dom; trees. The allowed values + a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values of the inputcharset attribute depend on your local iconv set-up. - Both input parsers deliver individual &dom; &xml; documents to the + Both input parsers deliver individual &acro.dom; &acro.xml; documents to the following chain of zero or more ]]> - &xslt; transformations. At the end of this pipeline, the documents + &acro.xslt; transformations. At the end of this pipeline, the documents are in the common format, used to feed both the <extract> and <store> pipelines. @@ -218,11 +217,11 @@ Extract pipeline The <extract> pipeline takes documents - from any common &dom; &xml; format to the &zebra; specific - indexing &dom; &xml; format. + from any common &acro.dom; &acro.xml; format to the &zebra; specific + indexing &acro.dom; &acro.xml; format. It may consist of zero ore more ]]> - &xslt; transformations, and the outcome is handled to the + &acro.xslt; transformations, and the outcome is handled to the &zebra; core to drive the process of building the inverted indexes. See for @@ -233,11 +232,11 @@
Store pipeline The <store> pipeline takes documents - from any common &dom; &xml; format to the &zebra; specific - storage &dom; &xml; format. + from any common &acro.dom; &acro.xml; format to the &zebra; specific + storage &acro.dom; &acro.xml; format. It may consist of zero ore more ]]> - &xslt; transformations, and the outcome is handled to the + &acro.xslt; transformations, and the outcome is handled to the &zebra; core for deposition into the internal storage system.
@@ -248,9 +247,9 @@ <retrieve> pipeline definitions, each of them again consisting of zero or more ]]> - &xslt; transformations. These are used for document - presentation after search, and take the internal storage &dom; - &xml; to the requested output formats during record present + &acro.xslt; transformations. These are used for document + presentation after search, and take the internal storage &acro.dom; + &acro.xml; to the requested output formats during record present requests.
@@ -259,9 +258,9 @@ are distinguished by their unique name attributes, these are the literal schema or element set names used in - &srw;, - &sru; and - &z3950; protocol queries. + &acro.srw;, + &acro.sru; and + &acro.z3950; protocol queries.
@@ -270,23 +269,23 @@ Canonical Indexing Format - &dom; &xml; indexing comes in two flavors: pure - processing-instruction governed plain &xml; documents, and - very - similar to the Alvis filter indexing format - &xml; documents - containing &xml; <record> and + &acro.dom; &acro.xml; indexing comes in two flavors: pure + processing-instruction governed plain &acro.xml; documents, and - very + similar to the Alvis filter indexing format - &acro.xml; documents + containing &acro.xml; <record> and <index> instructions from the magic - namespace xmlns:z="http://indexdata.dk/zebra-2.0". + namespace xmlns:z="http://indexdata.com/zebra-2.0".
Processing-instruction governed indexing format The output of the processing instruction driven - indexing &xslt; stylesheets must contain + indexing &acro.xslt; stylesheets must contain processing instructions named zebra-2.0. - The output of the &xslt; indexing transformation is then - parsed using &dom; methods, and the contained instructions are + The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are performed on the elements and their subtrees directly following the processing instructions. @@ -314,11 +313,11 @@
Magic element governed indexing format - The output of the indexing &xslt; stylesheets must contain + The output of the indexing &acro.xslt; stylesheets must contain certain elements in the magic - xmlns:z="http://indexdata.dk/zebra-2.0" - namespace. The output of the &xslt; indexing transformation is then - parsed using &dom; methods, and the contained instructions are + xmlns:z="http://indexdata.com/zebra-2.0" + namespace. The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are performed on the magic elements and their subtrees. @@ -355,7 +354,7 @@ processing instructions named zebra-2.0 or elements contained in the namespace - xmlns:z="http://indexdata.dk/zebra-2.0". + xmlns:z="http://indexdata.com/zebra-2.0". @@ -365,13 +364,66 @@ - The unique record instruction - may have additional attributes id and - rank, where the value of the opaque ID - may be any string not containing the whitespace character - ' ', and the rank value must be a + + The unique record instruction + may have additional attributes id, + rank and type. + Attribute id is the value of the opaque ID + and may be any string not containing the whitespace character + ' '. + The rank attribute value must be a non-negative integer. See - + . + The type attribute specifies how the record + is to be treated. The following values may be given for + type: + + + insert + + + The record is inserted. If the record already exists, it is + skipped (i.e. not replaced). + + + + + replace + + + The record is replaced. If the record does not already exist, + it is skipped (i.e. not inserted). + + + + + delete + + + The record is deleted. If the record does not already exist, + it is skipped (i.e. nothing is deleted). + + + + + update + + + The record is inserted or replaced depending on whether the + record exists or not. This is the default behavior but may + be effectively changed by "outside" the scope of the DOM + filter by zebraidx commands or extended services updates. + + + + + Note that the value of type is only used to + determine the action if and only if the Zebra indexer is running + in "update" mode (i.e zebraidx update) or if the specialUpdate + action of the + Extended + Service Update is used. + For this reason a specialUpdate may end up deleting records! @@ -400,18 +452,30 @@ for details. + + + &acro.dom; input documents which are not resulting in both one + unique valid + record instruction and one or more valid + index instructions can not be searched and + found. Therefore, + invalid document processing is aborted, and any content of + the <extract> and + <store> pipelines is discarded. + A warning is issued in the logs. + + - The examples work as follows: - From the original &xml; file - marc-one.xml (or from the &xml; record &dom; of the + From the original &acro.xml; file + marc-one.xml (or from the &acro.xml; record &acro.dom; of the same form coming from an <input> pipeline), the indexing pipeline <extract> - produces an indexing &xml; record, which is defined by + produces an indexing &acro.xml; record, which is defined by the record instruction &zebra; uses the content of z:id="11224466" @@ -447,8 +511,8 @@ inserted in the named indexes. - Finally, this example configuration can be queried using &pqf; - queries, either transported by &z3950;, (here using a yaz-client) + Finally, this example configuration can be queried using &acro.pqf; + queries, either transported by &acro.z3950;, (here using a yaz-client) open localhost:9999 @@ -468,27 +532,27 @@ or the proprietary extensions x-pquery and x-pScanClause to - &sru;, and &srw; + &acro.sru;, and &acro.srw; - See for more information on &sru;/&srw; + See for more information on &acro.sru;/&acro.srw; configuration, and or the &yaz; - &cql; section + &acro.cql; section for the details or the &yaz; frontend server. Notice that there are no *.abs, - *.est, *.map, or other &grs1; + *.est, *.map, or other &acro.grs1; filter configuration files involves in this process, and that the literal index names are used during search and retrieval. In case that we want to support the usual - bib-1 &z3950; numeric access points, it is a + bib-1 &acro.z3950; numeric access points, it is a good idea to choose string index names defined in the default configuration file tab/bib1.att, see @@ -501,15 +565,15 @@
- &dom; Record Model Configuration + &acro.dom; Record Model Configuration
- &dom; Indexing Configuration + &acro.dom; Indexing Configuration As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym - of writing an &xslt; stylesheet which produces &xml; output containing the + of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the magic processing instructions or elements discussed in . Obviously, there are million of different ways to accomplish this @@ -519,32 +583,32 @@ Stylesheets can be written in the pull or the push style: pull - means that the output &xml; structure is taken as starting point of - the internal structure of the &xslt; stylesheet, and portions of - the input &xml; are pulled out and inserted - into the right spots of the output &xml; structure. + means that the output &acro.xml; structure is taken as starting point of + the internal structure of the &acro.xslt; stylesheet, and portions of + the input &acro.xml; are pulled out and inserted + into the right spots of the output &acro.xml; structure. On the other - side, push &xslt; stylesheets are recursively + side, push &acro.xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &xml; structure, and is triggered to produce - some output &xml; + by the input &acro.xml; structure, and is triggered to produce + some output &acro.xml; whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input - &xml; with strong and well-defined structure and semantics, like the - following &oai; indexing example, whereas the + &acro.xml; with strong and well-defined structure and semantics, like the + following &acro.oai; indexing example, whereas the push type might be the only possible way to - sort out deeply recursive input &xml; formats. + sort out deeply recursive input &acro.xml; formats. A pull stylesheet example used to index - &oai; harvested records could use some of the following template + &acro.oai; harvested records could use some of the following template definitions: @@ -569,7 +633,7 @@ - + @@ -589,20 +653,120 @@ ]]> +
+ + +
+ &acro.dom; Indexing &acro.marcxml; + + The &acro.dom; filter allows indexing of both binary &acro.marc; records + and &acro.marcxml; records, depending on its configuration. + A typical &acro.marcxml; record might look like this: + + + 42 + 00366nam 22001698a 4500 + 11224466 + DLC + 00000000000000.0 + 910710c19910701nju 00010 eng + + 11224466 + + + DLC + DLC + + + 123-xyz + + + Jack Collins + + + How to program a computer + + + Penguin + + + 8710 + + + p. cm. + + + ]]> + + + - Notice also, - that the names and types of the indexes can be defined in the - indexing &xslt; stylesheet dynamically according to - content in the original &xml; records, which has + It is easily possible to make string manipulation in the &acro.dom; + filter. For example, if you want to drop some leading articles + in the indexing of sort fields, you might want to pick out the + &acro.marcxml; indicator attributes to chop of leading substrings. If + the above &acro.xml; example would have an indicator + ind2="8" in the title field + 245, i.e. + + + How to program a computer + + ]]> + + one could write a template taking into account this information + to chop the first 8 characters from the + sorting index title:s like this: + + + + + 0 + + + + + + + + + + + + + + ]]> + + The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be: + + How to program a computer + program a computer + ]]> + + and the record would be sorted in the title index under 'P', not 'H'. + +
+ + +
+ &acro.dom; Indexing Wizardry + + The names and types of the indexes can be defined in the + indexing &acro.xslt; stylesheet dynamically according to + content in the original &acro.xml; records, which has opportunities for great power and wizardry as well as grande disaster. The following excerpt of a push stylesheet might - be a good idea according to your strict control of the &xml; + be a good idea according to your strict control of the &acro.xml; input format (due to rigorous checking against well-defined and - tight RelaxNG or &xml; Schema's, for example): + tight RelaxNG or &acro.xml; Schema's, for example): @@ -613,11 +777,11 @@ ]]> This template creates indexes which have the name of the working - node of any input &xml; file, and assigns a '1' to the index. + node of any input &acro.xml; file, and assigns a '1' to the index. The example query find @attr 1=xyz 1 finds all files which contain at least one - xyz &xml; element. In case you can not control + xyz &acro.xml; element. In case you can not control which element names the input files contain, you might ask for disaster and bad karma using this technique. @@ -645,25 +809,44 @@ ]]> Don't be tempted to play too smart tricks with the power of - &xslt;, the above example will create zillions of + &acro.xslt;, the above example will create zillions of indexes with unpredictable names, resulting in severe &zebra; index pollution..
+
+ Debuggig &acro.dom; Filter Configurations + + It can be very hard to debug a &acro.dom; filter setup due to the many + successive &acro.marc; syntax translations, &acro.xml; stream splitting and + &acro.xslt; transformations involved. As an aid, you have always the + power of the -s command line switch to the + zebraidz indexing command at your hand: + + zebraidx -s -c zebra.cfg update some_record_stream.xml + + This command line simulates indexing and dumps a lot of debug + information in the logs, telling exactly which transformations + have been applied, how the documents look like after each + transformation, and which record ids and terms are send to the indexer. + +
+ + + @@ -683,7 +866,7 @@ - + @@ -699,18 +882,19 @@
+ -->