From c99c50f588fb803362a47a933c988360ab1cd98c Mon Sep 17 00:00:00 2001 From: Marc Cromme Date: Thu, 22 Feb 2007 15:44:19 +0000 Subject: [PATCH] added more instructions to DOM filter docs, spell checked both DOM and Alvis filter docs --- doc/recordmodel-alvisxslt.xml | 110 +++------------ doc/recordmodel-domxml.xml | 310 +++++++++++++++++++---------------------- 2 files changed, 169 insertions(+), 251 deletions(-) diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml index 8eee9b9..f3b69db 100644 --- a/doc/recordmodel-alvisxslt.xml +++ b/doc/recordmodel-alvisxslt.xml @@ -1,14 +1,16 @@ - - + + ALVIS &xml; Record Model and Filter Module - + The functionality of this record model has been improved and - replaced by the DOM &xml; record model. See - . + replaced by the DOM &xml; record model, see + . The Alvis &xml; record + model is considered obsolete, and will eventually be removed + from future releases of the &zebra; software. - + The record model described in this chapter applies to the fundamental, @@ -134,7 +136,7 @@ This means the following: From the original &xml; file one-record.xml (or from the &xml; record &dom; of the - same form coming from a splitted input file), the indexing + same form coming from a split input file), the indexing stylesheet produces an indexing &xml; record, which is defined by the record element in the magic namespace xmlns:z="http://indexdata.dk/zebra/xslt/1". @@ -166,7 +168,7 @@ file default.idx will do). Finally, any text() node content recursively contained inside the index will be filtered through the - appropriate charmap for character normalization, and will be + appropriate char map for character normalization, and will be inserted in the index. @@ -179,7 +181,7 @@ will be inserted using the w character normalization defined in default.idx into the index dc:creator (that is, after character - normalization the index will keep the inidividual words + normalization the index will keep the individual words kumar, krishen, and, calvin, burnham, and editors), and @@ -208,7 +210,7 @@ ]]> or the proprietary - extentions x-pquery and + extensions x-pquery and x-pScanClause to &sru;, and &srw; @@ -246,7 +248,7 @@ . Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to lead - our paduans on the right track to the good side of the force. + our Padawan's on the right track to the good side of the force. Stylesheets can be written in the pull or @@ -255,12 +257,12 @@ the internal structure of the &xslt; stylesheet, and portions of the input &xml; are pulled out and inserted into the right spots of the output &xml; structure. On the other - side, push &xslt; stylesheets are recursavly + side, push &xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &xml; structure, and avake to produce some output &xml; - whenever some special conditions in the input styelsheets are + by the input &xml; structure, and are triggered to produce some output &xml; + whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input - &xml; with strong and well-defined structure and semantcs, like the + &xml; with strong and well-defined structure and semantics, like the following &oai; indexing example, whereas the push type might be the only possible way to sort out deeply recursive input &xml; formats. @@ -319,14 +321,14 @@ that the names and types of the indexes can be defined in the indexing &xslt; stylesheet dynamically according to content in the original &xml; records, which has - opportunities for great power and wizardery as well as grande + opportunities for great power and wizardry as well as grande disaster. The following excerpt of a push stylesheet might be a good idea according to your strict control of the &xml; - input format (due to rigerours checking against well-defined and + input format (due to rigorous checking against well-defined and tight RelaxNG or &xml; Schema's, for example): Don't be tempted to cross - the line to the dark side of the force, paduan; this leads + the line to the dark side of the force, Padawan; this leads to suffering and pain, and universal - disentigration of your project schedule. + disintegration of your project schedule. @@ -428,12 +430,12 @@
ALVIS Filter &oai; Indexing Example - The sourcecode tarball contains a working Alvis filter example in + The source code tarball contains a working Alvis filter example in the directory examples/alvis-oai/, which should get you started. - More example data can be harvested from any &oai; complient server, + More example data can be harvested from any &oai; compliant server, see details at the &oai; http://www.openarchives.org/ web site, and the community @@ -453,72 +455,6 @@ - - - - + &dom; &xml; Record Model and Filter Module The record model described in this chapter applies to the fundamental, structured &xml; - record type dom, introduced in + record type &dom;, introduced in . The &dom; &xml; record model is experimental, and it's inner workings might change in future releases of the &zebra; Information Server. @@ -19,7 +19,7 @@ The &dom; &xml; filter uses a standard &dom; &xml; structure as internal data model, and can therefore parse, index, and display - any &xml; document type. It is wellsuited to work on + any &xml; document type. It is well suited to work on standardized &xml;-based formats such as Dublin Core, MODS, METS, MARCXML, OAI-PMH, RSS, and performs equally well on any other non-standard &xml; format. @@ -33,7 +33,7 @@ The &dom; filter architecture consists of four - different pipelines, each being a chain of arbitraily many sucessive + different pipelines, each being a chain of arbitrarily many successive &xslt; transformations of the internal &dom; &xml; representations of documents. @@ -166,19 +166,19 @@ The root &xml; element <dom> and all other &dom; &xml; filter elements are residing in the namespace - http://indexdata.com/zebra-2.0. + xmlns="http://indexdata.dk/zebra-2.0". All pipeline definition elements - i.e. the <input>, - <extact>, + <extract>, <store>, and <retrieve> elements - are optional. Missing pipeline definitions are just interpreted do-nothing identity pipelines. - All pipeine definition elements may contain zero or more + All pipeline definition elements may contain zero or more ]]> &xslt; transformation instructions, which are performed sequentially from top to bottom. @@ -209,7 +209,7 @@ ]]> &xslt; transformations. At the end of this pipeline, the documents are in the common format, used to feed both the - <extact> and + <extract> and <store> pipelines.
@@ -217,13 +217,13 @@
Extract pipeline - The <extact> pipeline takes documents + The <extract> pipeline takes documents from any common &dom; &xml; format to the &zebra; specific indexing &dom; &xml; format. It may consist of zero ore more ]]> &xslt; transformations, and the outcome is handled to the - &zebra; core to drive the proces of building the inverted + &zebra; core to drive the process of building the inverted indexes. See for details. @@ -233,8 +233,8 @@
Store pipeline The <store> pipeline takes documents - from any common &dom; &xml; format to the &zebra; specific - storage &dom; &xml; format. + from any common &dom; &xml; format to the &zebra; specific + storage &dom; &xml; format. It may consist of zero ore more ]]> &xslt; transformations, and the outcome is handled to the @@ -275,7 +275,7 @@ similar to the Alvis filter indexing format - &xml; documents containing &xml; <record> and <index> instructions from the magic - namespace xmlns:z="http://indexdata.dk/zebra-2.0". + namespace xmlns:z="http://indexdata.dk/zebra-2.0".
@@ -301,9 +301,9 @@ - + 11224466 - + How to program a computer ]]> @@ -333,8 +333,8 @@ - 11224466 - + 11224466 + How to program a computer ]]> @@ -348,41 +348,94 @@ Both indexing formats are defined with equal semantics and - behaviour in mind. + behavior in mind: + + + &zebra; specific instructions are either + processing instructions named + zebra-2.0 or + elements contained in the namespace + xmlns:z="http://indexdata.dk/zebra-2.0". + + + + There must be exactly one record + instruction, which sets the scope for the following, + possibly nested index instructions. + + + + The unique record instruction + may have additional attributes id and + rank, where the value of the opaque ID + may be any string not containing the whitespace character + ' ', and the rank value must be a + non-negative integer. See + + + + + Multiple and possible nested index + instructions must contain at least one + indexname:indextype + pair, and may contain multiple such pairs separated by the + whitespace character ' '. In each index + pair, the name and the type of the index is separated by a + colon character ':'. + + + + + Any index name consisting of ASCII letters, and following the + standard &zebra; rules will do, see + . + + + + + Index types are restricted to the values defined in + the standard configuration + file default.idx, see + and + for details. + + + - This means the following: From the original &xml; file - one-record.xml (or from the &xml; record &dom; of the - same form coming from a splitted input file), the indexing - stylesheet produces an indexing &xml; record, which is defined by - the record element in the magic namespace - xmlns:z="http://indexdata.dk/zebra/xslt/1". + The examples work as follows: + From the original &xml; file + marc-one.xml (or from the &xml; record &dom; of the + same form coming from an <input> + pipeline), + the indexing + pipeline <extract> + produces an indexing &xml; record, which is defined by + the record instruction &zebra; uses the content of - z:id="oai:JTRS:CP-3290---Volume-I" as internal + z:id="11224466" + or + id=11224466 + as internal record ID, and - in case static ranking is set - the content of - z:rank="47896" as static rank. Following the - discussion in - we see that this records is internally ordered - lexicographically according to the value of the string - oai:JTRS:CP-3290---Volume-I47896. - The type of action performed during indexing is defined by - z:type="update">, with recognized values - insert, update, and - delete. + rank=42 + or + z:rank="42" + as static rank. In these examples, the following literal indexes are constructed: any:w - control:w + control:0 title:w title:p title:s where the indexing type is defined after the - literal ':' charaacter. + literal ':' character. Any value from the standard configuration file default.idx will do. Finally, any @@ -390,33 +443,9 @@ inside the <z:index> element, or any element following a index processing instruction, will be filtered through the - appropriate charmap for character normalization, and will be + appropriate char map for character normalization, and will be inserted in the named indexes. - - - - Specific to this example, we see that the single word - oai:JTRS:CP-3290---Volume-I will be literal, - byte for byte without any form of character normalization, - inserted into the index named oai:identifier, - the text - Kumar Krishen and *Calvin Burnham, Editors - will be inserted using the w character - normalization defined in default.idx into - the index dc:creator (that is, after character - normalization the index will keep the inidividual words - kumar, krishen, - and, calvin, - burnham, and editors), and - finally both the texts - Proceedings of the 4th International Conference and Exhibition: - World Congress on Superconductivity - Volume I - and - Kumar Krishen and *Calvin Burnham, Editors - will be inserted into the index dc:all using - the same character normalization map w. - Finally, this example configuration can be queried using &pqf; queries, either transported by &z3950;, (here using a yaz-client) @@ -426,21 +455,24 @@ Z> elem dc Z> form xml Z> - Z> f @attr 1=dc_creator Kumar - Z> scan @attr 1=dc_creator adam + Z> find @attr 1=control @attr 4=3 11224466 + Z> scan @attr 1=control @attr 4=3 "" Z> - Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity" - Z> scan @attr 1=dc_title abc + Z> find @attr 1=title program + Z> scan @attr 1=title "" + Z> + Z> find @attr 1=title @attr 4=2 "How to program a computer" + Z> scan @attr 1=title @attr 4=2 "" ]]> or the proprietary - extentions x-pquery and + extensions x-pquery and x-pScanClause to &sru;, and &srw; See for more information on &sru;/&srw; @@ -454,6 +486,13 @@ filter configuration files involves in this process, and that the literal index names are used during search and retrieval. + + In case that we want to support the usual + bib-1 &z3950; numeric access points, it is a + good idea to choose string index names defined in the default + configuration file tab/bib1.att, see + +
@@ -468,14 +507,14 @@
&dom; Indexing Configuration - As mentioned above, there can be only one indexing - stylesheet, and configuration of the indexing process is a synonym + As mentioned above, there can be only one indexing pipeline, + and configuration of the indexing process is a synonym of writing an &xslt; stylesheet which produces &xml; output containing the - magic elements discussed in - . + magic processing instructions or elements discussed in + . Obviously, there are million of different ways to accomplish this - task, and some comments and code snippets are in order to lead - our paduans on the right track to the good side of the force. + task, and some comments and code snippets are in order to + enlighten the wary. Stylesheets can be written in the pull or @@ -483,13 +522,15 @@ means that the output &xml; structure is taken as starting point of the internal structure of the &xslt; stylesheet, and portions of the input &xml; are pulled out and inserted - into the right spots of the output &xml; structure. On the other - side, push &xslt; stylesheets are recursavly + into the right spots of the output &xml; structure. + On the other + side, push &xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &xml; structure, and avake to produce some output &xml; - whenever some special conditions in the input styelsheets are + by the input &xml; structure, and is triggered to produce + some output &xml; + whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input - &xml; with strong and well-defined structure and semantcs, like the + &xml; with strong and well-defined structure and semantics, like the following &oai; indexing example, whereas the push type might be the only possible way to sort out deeply recursive input &xml; formats. @@ -501,29 +542,34 @@ + - + + + + - - + + + + - + - + @@ -532,7 +578,7 @@ - + @@ -548,19 +594,19 @@ that the names and types of the indexes can be defined in the indexing &xslt; stylesheet dynamically according to content in the original &xml; records, which has - opportunities for great power and wizardery as well as grande + opportunities for great power and wizardry as well as grande disaster. The following excerpt of a push stylesheet might be a good idea according to your strict control of the &xml; - input format (due to rigerours checking against well-defined and + input format (due to rigorous checking against well-defined and tight RelaxNG or &xml; Schema's, for example): - + @@ -582,7 +628,7 @@ - + @@ -590,7 +636,7 @@ - + @@ -598,10 +644,10 @@ ]]> - Don't be tempted to cross - the line to the dark side of the force, paduan; this leads - to suffering and pain, and universal - disentigration of your project schedule. + Don't be tempted to play too smart tricks with the power of + &xslt;, the above example will create zillions of + indexes with unpredictable names, resulting in severe &zebra; + index pollution..
@@ -654,15 +700,16 @@
+
@@ -682,72 +730,6 @@
- - - -