X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel-domxml.xml;h=391b45338c15ce9a10e1e6e02b43b8f3e0526462;hp=a9b85db7d726fb7eb9b50cc5679cf01f240fd8f2;hb=250de4ed23a44f5eb3552db317eef0d0fbe3265c;hpb=8ade6bf0476b510499488f499156604172b8d1fc diff --git a/doc/recordmodel-domxml.xml b/doc/recordmodel-domxml.xml index a9b85db..391b453 100644 --- a/doc/recordmodel-domxml.xml +++ b/doc/recordmodel-domxml.xml @@ -1,45 +1,44 @@ - - &dom; &xml; Record Model and Filter Module - + &acro.dom; &acro.xml; Record Model and Filter Module + The record model described in this chapter applies to the fundamental, - structured &xml; - record type &dom;, introduced in - . The &dom; &xml; record model - is experimental, and it's inner workings might change in future + structured &acro.xml; + record type &acro.dom;, introduced in + . The &acro.dom; &acro.xml; record model + is experimental, and its inner workings might change in future releases of the &zebra; Information Server.
- &dom; Record Filter Architecture + &acro.dom; Record Filter Architecture - The &dom; &xml; filter uses a standard &dom; &xml; structure as + The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as internal data model, and can therefore parse, index, and display - any &xml; document type. It is well suited to work on - standardized &xml;-based formats such as Dublin Core, MODS, METS, + any &acro.xml; document type. It is well suited to work on + standardized &acro.xml;-based formats such as Dublin Core, MODS, METS, MARCXML, OAI-PMH, RSS, and performs equally well on any other - non-standard &xml; format. + non-standard &acro.xml; format. - A parser for binary &marc; records based on the ISO2709 library + A parser for binary &acro.marc; records based on the ISO2709 library standard is provided, it transforms these to the internal - &marcxml; &dom; representation. Other binary document parsers + &acro.marcxml; &acro.dom; representation. Other binary document parsers are planned to follow. - The &dom; filter architecture consists of four + The &acro.dom; filter architecture consists of four different pipelines, each being a chain of arbitrarily many successive - &xslt; transformations of the internal &dom; &xml; + &acro.xslt; transformations of the internal &acro.dom; &acro.xml; representations of documents.
- &dom; &xml; filter architecture + &acro.dom; &acro.xml; filter architecture @@ -50,7 +49,7 @@ - [Here there should be a diagram showing the &dom; &xml; + [Here there should be a diagram showing the &acro.dom; &acro.xml; filter architecture, but is seems that your tool chain has not been able to include the diagram in this document.] @@ -61,7 +60,7 @@ - &dom; &xml; filter pipelines overview + &acro.dom; &acro.xml; filter pipelines overview @@ -78,26 +77,26 @@ input first input parsing and initial - transformations to common &xml; format - Input raw &xml; record buffers, &xml; streams and - binary &marc; buffers - Common &xml; &dom; + transformations to common &acro.xml; format + Input raw &acro.xml; record buffers, &acro.xml; streams and + binary &acro.marc; buffers + Common &acro.xml; &acro.dom; extract second indexing term extraction transformations - Common &xml; &dom; - Indexing &xml; &dom; + Common &acro.xml; &acro.dom; + Indexing &acro.xml; &acro.dom; store second transformations before internal document storage - Common &xml; &dom; - Storage &xml; &dom; + Common &acro.xml; &acro.dom; + Storage &acro.xml; &acro.dom; retrieve @@ -105,40 +104,40 @@ multiple document retrieve transformations from storage to different output formats are possible - Storage &xml; &dom; - Output &xml; syntax in requested formats + Storage &acro.xml; &acro.dom; + Output &acro.xml; syntax in requested formats
- The &dom; &xml; filter pipelines use &xslt; (and if supported on - your platform, even &exslt;), it brings thus full &xpath; + The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on + your platform, even &acro.exslt;), it brings thus full &acro.xpath; support to the indexing, storage and display rules of not only - &xml; documents, but also binary &marc; records. + &acro.xml; documents, but also binary &acro.marc; records.
- &dom; &xml; filter pipeline configuration + &acro.dom; &acro.xml; filter pipeline configuration - The experimental, loadable &dom; &xml;/&xslt; filter module + The experimental, loadable &acro.dom; &acro.xml;/&acro.xslt; filter module mod-dom.so is invoked by the zebra.cfg configuration statement recordtype.xml: dom.db/filter_dom_conf.xml - In this example the &dom; &xml; filter is configured to work + In this example the &acro.dom; &acro.xml; filter is configured to work on all data files with suffix *.xml, where the configuration file is found in the path db/filter_dom_conf.xml. - The &dom; &xslt; filter configuration file must be - valid &xml;. It might look like this: + The &acro.dom; &acro.xslt; filter configuration file must be + valid &acro.xml;. It might look like this: @@ -147,7 +146,7 @@ - + @@ -164,9 +163,9 @@ - The root &xml; element <dom> and all other &dom; - &xml; filter elements are residing in the namespace - xmlns="http://indexdata.dk/zebra-2.0". + The root &acro.xml; element <dom> and all other &acro.dom; + &acro.xml; filter elements are residing in the namespace + xmlns="http://indexdata.com/zebra-2.0". All pipeline definition elements - i.e. the @@ -180,7 +179,7 @@ All pipeline definition elements may contain zero or more ]]> - &xslt; transformation instructions, which are performed + &acro.xslt; transformation instructions, which are performed sequentially from top to bottom. The paths in the stylesheet attributes are relative to zebras working directory, or absolute to the file @@ -192,22 +191,22 @@ Input pipeline The <input> pipeline definition element - may contain either one &xml; Reader definition + may contain either one &acro.xml; Reader definition ]]>, used to split - an &xml; collection input stream into individual &xml; &dom; + an &acro.xml; collection input stream into individual &acro.xml; &acro.dom; documents at the prescribed element level, - or one &marc; binary + or one &acro.marc; binary parsing instruction ]]>, which defines - a conversion to &marcxml; format &dom; trees. The allowed values + a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values of the inputcharset attribute depend on your local iconv set-up. - Both input parsers deliver individual &dom; &xml; documents to the + Both input parsers deliver individual &acro.dom; &acro.xml; documents to the following chain of zero or more ]]> - &xslt; transformations. At the end of this pipeline, the documents + &acro.xslt; transformations. At the end of this pipeline, the documents are in the common format, used to feed both the <extract> and <store> pipelines. @@ -218,11 +217,11 @@ Extract pipeline The <extract> pipeline takes documents - from any common &dom; &xml; format to the &zebra; specific - indexing &dom; &xml; format. + from any common &acro.dom; &acro.xml; format to the &zebra; specific + indexing &acro.dom; &acro.xml; format. It may consist of zero ore more ]]> - &xslt; transformations, and the outcome is handled to the + &acro.xslt; transformations, and the outcome is handled to the &zebra; core to drive the process of building the inverted indexes. See for @@ -233,11 +232,11 @@
Store pipeline The <store> pipeline takes documents - from any common &dom; &xml; format to the &zebra; specific - storage &dom; &xml; format. + from any common &acro.dom; &acro.xml; format to the &zebra; specific + storage &acro.dom; &acro.xml; format. It may consist of zero ore more ]]> - &xslt; transformations, and the outcome is handled to the + &acro.xslt; transformations, and the outcome is handled to the &zebra; core for deposition into the internal storage system.
@@ -248,9 +247,9 @@ <retrieve> pipeline definitions, each of them again consisting of zero or more ]]> - &xslt; transformations. These are used for document - presentation after search, and take the internal storage &dom; - &xml; to the requested output formats during record present + &acro.xslt; transformations. These are used for document + presentation after search, and take the internal storage &acro.dom; + &acro.xml; to the requested output formats during record present requests.
@@ -259,9 +258,9 @@ are distinguished by their unique name attributes, these are the literal schema or element set names used in - &srw;, - &sru; and - &z3950; protocol queries. + &acro.srw;, + &acro.sru; and + &acro.z3950; protocol queries.
@@ -270,23 +269,23 @@ Canonical Indexing Format - &dom; &xml; indexing comes in two flavors: pure - processing-instruction governed plain &xml; documents, and - very - similar to the Alvis filter indexing format - &xml; documents - containing &xml; <record> and + &acro.dom; &acro.xml; indexing comes in two flavors: pure + processing-instruction governed plain &acro.xml; documents, and - very + similar to the Alvis filter indexing format - &acro.xml; documents + containing &acro.xml; <record> and <index> instructions from the magic - namespace xmlns:z="http://indexdata.dk/zebra-2.0". + namespace xmlns:z="http://indexdata.com/zebra-2.0".
Processing-instruction governed indexing format The output of the processing instruction driven - indexing &xslt; stylesheets must contain + indexing &acro.xslt; stylesheets must contain processing instructions named zebra-2.0. - The output of the &xslt; indexing transformation is then - parsed using &dom; methods, and the contained instructions are + The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are performed on the elements and their subtrees directly following the processing instructions. @@ -314,11 +313,11 @@
Magic element governed indexing format - The output of the indexing &xslt; stylesheets must contain + The output of the indexing &acro.xslt; stylesheets must contain certain elements in the magic - xmlns:z="http://indexdata.dk/zebra-2.0" - namespace. The output of the &xslt; indexing transformation is then - parsed using &dom; methods, and the contained instructions are + xmlns:z="http://indexdata.com/zebra-2.0" + namespace. The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are performed on the magic elements and their subtrees. @@ -355,7 +354,7 @@ processing instructions named zebra-2.0 or elements contained in the namespace - xmlns:z="http://indexdata.dk/zebra-2.0". + xmlns:z="http://indexdata.com/zebra-2.0". @@ -365,13 +364,66 @@ - The unique record instruction - may have additional attributes id and - rank, where the value of the opaque ID - may be any string not containing the whitespace character - ' ', and the rank value must be a + + The unique record instruction + may have additional attributes id, + rank and type. + Attribute id is the value of the opaque ID + and may be any string not containing the whitespace character + ' '. + The rank attribute value must be a non-negative integer. See - + . + The type attribute specifies how the record + is to be treated. The following values may be given for + type: + + + insert + + + The record is inserted. If the record already exists, it is + skipped (i.e. not replaced). + + + + + replace + + + The record is replaced. If the record does not already exist, + it is skipped (i.e. not inserted). + + + + + delete + + + The record is deleted. If the record does not already exist, + it is skipped (i.e. nothing is deleted). + + + + + update + + + The record is inserted or replaced depending on whether the + record exists or not. This is the default behavior but may + be effectively changed by "outside" the scope of the DOM + filter by zebraidx commands or extended services updates. + + + + + Note that the value of type is only used to + determine the action if and only if the Zebra indexer is running + in "update" mode (i.e zebraidx update) or if the specialUpdate + action of the + Extended + Service Update is used. + For this reason a specialUpdate may end up deleting records! @@ -402,29 +454,28 @@ - &dom; input documents which are not resulting in both one + &acro.dom; input documents which are not resulting in both one unique valid record instruction and one or more valid index instructions can not be searched and found. Therefore, invalid document processing is aborted, and any content of the <extract> and - <store> pipelines is discarted. + <store> pipelines is discarded. A warning is issued in the logs. - The examples work as follows: - From the original &xml; file - marc-one.xml (or from the &xml; record &dom; of the + From the original &acro.xml; file + marc-one.xml (or from the &acro.xml; record &acro.dom; of the same form coming from an <input> pipeline), the indexing pipeline <extract> - produces an indexing &xml; record, which is defined by + produces an indexing &acro.xml; record, which is defined by the record instruction &zebra; uses the content of z:id="11224466" @@ -460,8 +511,8 @@ inserted in the named indexes. - Finally, this example configuration can be queried using &pqf; - queries, either transported by &z3950;, (here using a yaz-client) + Finally, this example configuration can be queried using &acro.pqf; + queries, either transported by &acro.z3950;, (here using a yaz-client) open localhost:9999 @@ -481,27 +532,27 @@ or the proprietary extensions x-pquery and x-pScanClause to - &sru;, and &srw; + &acro.sru;, and &acro.srw; - See for more information on &sru;/&srw; + See for more information on &acro.sru;/&acro.srw; configuration, and or the &yaz; - &cql; section + &acro.cql; section for the details or the &yaz; frontend server. Notice that there are no *.abs, - *.est, *.map, or other &grs1; + *.est, *.map, or other &acro.grs1; filter configuration files involves in this process, and that the literal index names are used during search and retrieval. In case that we want to support the usual - bib-1 &z3950; numeric access points, it is a + bib-1 &acro.z3950; numeric access points, it is a good idea to choose string index names defined in the default configuration file tab/bib1.att, see @@ -514,15 +565,15 @@
- &dom; Record Model Configuration + &acro.dom; Record Model Configuration
- &dom; Indexing Configuration + &acro.dom; Indexing Configuration As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym - of writing an &xslt; stylesheet which produces &xml; output containing the + of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the magic processing instructions or elements discussed in . Obviously, there are million of different ways to accomplish this @@ -532,32 +583,32 @@ Stylesheets can be written in the pull or the push style: pull - means that the output &xml; structure is taken as starting point of - the internal structure of the &xslt; stylesheet, and portions of - the input &xml; are pulled out and inserted - into the right spots of the output &xml; structure. + means that the output &acro.xml; structure is taken as starting point of + the internal structure of the &acro.xslt; stylesheet, and portions of + the input &acro.xml; are pulled out and inserted + into the right spots of the output &acro.xml; structure. On the other - side, push &xslt; stylesheets are recursively + side, push &acro.xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &xml; structure, and is triggered to produce - some output &xml; + by the input &acro.xml; structure, and is triggered to produce + some output &acro.xml; whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input - &xml; with strong and well-defined structure and semantics, like the - following &oai; indexing example, whereas the + &acro.xml; with strong and well-defined structure and semantics, like the + following &acro.oai; indexing example, whereas the push type might be the only possible way to - sort out deeply recursive input &xml; formats. + sort out deeply recursive input &acro.xml; formats. A pull stylesheet example used to index - &oai; harvested records could use some of the following template + &acro.oai; harvested records could use some of the following template definitions: @@ -582,7 +633,7 @@ - + @@ -602,20 +653,120 @@ ]]> +
+ + +
+ &acro.dom; Indexing &acro.marcxml; + + The &acro.dom; filter allows indexing of both binary &acro.marc; records + and &acro.marcxml; records, depending on its configuration. + A typical &acro.marcxml; record might look like this: + + + 42 + 00366nam 22001698a 4500 + 11224466 + DLC + 00000000000000.0 + 910710c19910701nju 00010 eng + + 11224466 + + + DLC + DLC + + + 123-xyz + + + Jack Collins + + + How to program a computer + + + Penguin + + + 8710 + + + p. cm. + + + ]]> + + + + + It is easily possible to make string manipulation in the &acro.dom; + filter. For example, if you want to drop some leading articles + in the indexing of sort fields, you might want to pick out the + &acro.marcxml; indicator attributes to chop of leading substrings. If + the above &acro.xml; example would have an indicator + ind2="8" in the title field + 245, i.e. + + + How to program a computer + + ]]> + + one could write a template taking into account this information + to chop the first 8 characters from the + sorting index title:s like this: + + + + + 0 + + + + + + + + + + + + + + ]]> + + The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be: + + How to program a computer + program a computer + ]]> + + and the record would be sorted in the title index under 'P', not 'H'. + +
+ + +
+ &acro.dom; Indexing Wizardry - Notice also, - that the names and types of the indexes can be defined in the - indexing &xslt; stylesheet dynamically according to - content in the original &xml; records, which has + The names and types of the indexes can be defined in the + indexing &acro.xslt; stylesheet dynamically according to + content in the original &acro.xml; records, which has opportunities for great power and wizardry as well as grande disaster. The following excerpt of a push stylesheet might - be a good idea according to your strict control of the &xml; + be a good idea according to your strict control of the &acro.xml; input format (due to rigorous checking against well-defined and - tight RelaxNG or &xml; Schema's, for example): + tight RelaxNG or &acro.xml; Schema's, for example): @@ -626,11 +777,11 @@ ]]> This template creates indexes which have the name of the working - node of any input &xml; file, and assigns a '1' to the index. + node of any input &acro.xml; file, and assigns a '1' to the index. The example query find @attr 1=xyz 1 finds all files which contain at least one - xyz &xml; element. In case you can not control + xyz &acro.xml; element. In case you can not control which element names the input files contain, you might ask for disaster and bad karma using this technique. @@ -658,18 +809,18 @@ ]]> Don't be tempted to play too smart tricks with the power of - &xslt;, the above example will create zillions of + &acro.xslt;, the above example will create zillions of indexes with unpredictable names, resulting in severe &zebra; index pollution..
- Debuggig &dom; Filter Configurations + Debuggig &acro.dom; Filter Configurations - It can be very hard to debug a &dom; filter setup due to the many - sucessive &marc; syntax translations, &xml; stream splitting and - &xslt; transformations involved. As an aid, you have always the + It can be very hard to debug a &acro.dom; filter setup due to the many + successive &acro.marc; syntax translations, &acro.xml; stream splitting and + &acro.xslt; transformations involved. As an aid, you have always the power of the -s command line switch to the zebraidz indexing command at your hand: @@ -684,18 +835,18 @@