From b00064c85119fb3a6ca07f809f41d8f97f192165 Mon Sep 17 00:00:00 2001 From: Marc Cromme Date: Tue, 20 Feb 2007 14:28:31 +0000 Subject: [PATCH] added initial DOM XML filter documentation. Much is missing yet ... --- doc/Makefile.am | 3 +- doc/architecture.xml | 64 ++++- doc/entities.ent | 5 +- doc/recordmodel-alvisxslt.xml | 15 +- doc/recordmodel-domxml.xml | 621 +++++++++++++++++++++++++++++++++++++++++ doc/recordmodel-grs.xml | 10 +- doc/zebra.xml | 3 +- 7 files changed, 710 insertions(+), 11 deletions(-) create mode 100644 doc/recordmodel-domxml.xml diff --git a/doc/Makefile.am b/doc/Makefile.am index 5f6a183..001d575 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -1,4 +1,4 @@ -## $Id: Makefile.am,v 1.63 2007-01-15 20:04:34 adam Exp $ +## $Id: Makefile.am,v 1.64 2007-02-20 14:28:31 marc Exp $ docdir=$(datadir)/doc/$(PACKAGE)$(PACKAGE_SUFFIX) SUBDIRS = common @@ -17,6 +17,7 @@ XMLFILES = \ marc_indexing.xml \ querymodel.xml \ quickstart.xml \ + recordmodel-domxml.xml \ recordmodel-alvisxslt.xml \ recordmodel-grs.xml \ manref.xml \ diff --git a/doc/architecture.xml b/doc/architecture.xml index fd89051..cecd978 100644 --- a/doc/architecture.xml +++ b/doc/architecture.xml @@ -1,5 +1,5 @@ - + Overview of &zebra; Architecture
@@ -207,9 +207,64 @@ modules. +
+ &dom; &xml; Record Model and Filter Module + + The &dom; &xml; filter uses a standard &dom; &xml; structure as + internal data model, and can thus parse, index, and display + any &xml; document. + + + A parser for binary &marc; records based on the ISO2709 library + standard is provided, it transforms these to the internal + &marcxml; &dom; representation. + + + The internal &dom; &xml; representation can be fed into four + different pipelines, consisting of arbitraily many sucessive + &xslt; transformations; these are for + + input parsing and initial + transformations, + indexing term extraction + transformations + transformations before internal document + storage, and + retrieve transformations from storage to output + format + + + + The &dom; &xml; filter pipelines use &xslt; (and if supported on + your platform, even &exslt;), it brings thus full &xpath; + support to the indexing, storage and display rules of not only + &xml; documents, but also binary &marc; records. + + + Finally, the &dom; &xml; filter allows for static ranking at index + time, and to to sort hit lists according to predefined + static ranks. + + + Details on the experimental &dom; &xml; filter are found in + . + + + The Debian package libidzebra-2.0-mod-dom + contains the &dom; filter module. + +
ALVIS &xml; Record Model and Filter Module + + + The functionality of this record model has been improved and + replaced by the &dom; &xml; record model. See + . + + + The Alvis filter for &xml; files is an &xslt; based input filter. @@ -252,6 +307,13 @@
&grs1; Record Model and Filter Modules + + + The functionality of this record model has been improved and + replaced by the &dom; &xml; record model. See + . + + The &grs1; filter modules described in diff --git a/doc/entities.ent b/doc/entities.ent index 4c82c41..f2150fa 100644 --- a/doc/entities.ent +++ b/doc/entities.ent @@ -1,4 +1,4 @@ - + @@ -7,8 +7,9 @@ - + + diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml index 93ce649..8eee9b9 100644 --- a/doc/recordmodel-alvisxslt.xml +++ b/doc/recordmodel-alvisxslt.xml @@ -1,15 +1,20 @@ - + ALVIS &xml; Record Model and Filter Module - + + + + The functionality of this record model has been improved and + replaced by the DOM &xml; record model. See + . + + The record model described in this chapter applies to the fundamental, structured &xml; record type alvis, introduced in - . The ALVIS &xml; record model - is experimental, and it's inner workings might change in future - releases of the &zebra; Information Server. + . This filter has been developed under the diff --git a/doc/recordmodel-domxml.xml b/doc/recordmodel-domxml.xml new file mode 100644 index 0000000..201299a --- /dev/null +++ b/doc/recordmodel-domxml.xml @@ -0,0 +1,621 @@ + + + &dom; &xml; Record Model and Filter Module + + + The record model described in this chapter applies to the fundamental, + structured &xml; + record type dom, introduced in + . The &dom; &xml; record model + is experimental, and it's inner workings might change in future + releases of the &zebra; Information Server. + + + + +
+ &dom; Record Filter + + + The &dom; &xml; filter uses a standard &dom; &xml; structure as + internal data model, and can therefore parse, index, and display + any &xml; document type. It is wellsuited to work on + standardized &xml;-based formats such as Dublin Core, MODS, METS, + MARCXML, OAI-PMH, RSS, and performs equally well on any other + non-standard &xml; format. + + + A parser for binary &marc; records based on the ISO2709 library + standard is provided, it transforms these to the internal + &marcxml; &dom; representation. Other binary document parsers + are planned to follow. + +
+ + +
+ &dom; &xml; filter architecture + + + The internal &dom; &xml; representation can be fed into four + different pipelines, consisting of arbitraily many sucessive + &xslt; transformations. + + + + &dom; &xml; filter pipelines overview + + + + Name + When + Description + Input + Output + + + + + + input + first + input parsing and initial + transformations to common &xml; format + raw &xml; record buffers, &xml; streams and + binary &marc; buffers + single &dom; &xml; documents suitable for indexing and + internal storage + + + extract + second + indexing term extraction + transformations + common single &dom; &xml; format + &zebra; internal indexing &dom; &xml; document + + + store + second + transformations before internal document + storage + common single &dom; &xml; format + &zebra; internal storage &dom; &xml; document + + + retrieve + third + document retrieve transformations from storage to output + syntax and format + &zebra; internal storage &dom; &xml; document + requested output syntax and format + + + +
+ + + The &dom; &xml; filter pipelines use &xslt; (and if supported on + your platform, even &exslt;), it brings thus full &xpath; + support to the indexing, storage and display rules of not only + &xml; documents, but also binary &marc; records. + +
+ + +
+ &dom; &xml; filter pipeline configuration + + + The experimental, loadable &dom; &xml;/&xslt; filter module + mod-dom.so is packaged in the GNU/Debian package + libidzebra2.0-mod-dom. + It is invoked by the zebra.cfg configuration statement + + recordtype.xml: dom.db/filter_dom_conf.xml + + In this example on all data files with suffix + *.xml, where the + &dom; &xslt; filter configuration file is found in the + path db/filter_dom_conf.xml. + + + + + + + The &dom; &xslt; filter configuration file must be + valid &xml;. It might look like this (This example is + used for indexing and display of &oai; harvested records): + + <?xml version="1.0" encoding="UTF-8"?> + <schemaInfo> + <schema name="identity" stylesheet="xsl/identity.xsl" /> + <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1" + stylesheet="xsl/oai2index.xsl" /> + <schema name="dc" stylesheet="xsl/oai2dc.xsl" /> + <!-- use split level 2 when indexing whole &oai; Record lists --> + <split level="2"/> + </schemaInfo> + + + + All named stylesheets defined inside + schema element tags + are for presentation after search, including + the indexing stylesheet (which is a great debugging help). The + names defined in the name attributes must be + unique, these are the literal schema or + element set names used in + &srw;, + &sru; and + &z3950; protocol queries. + The paths in the stylesheet attributes + are relative to zebras working directory, or absolute to file + system root. + + + The <split level="2"/> decides where the + &xml; Reader shall split the + collections of records into individual records, which then are + loaded into &dom;, and have the indexing &xslt; stylesheet applied. + + + There must be exactly one indexing &xslt; stylesheet, which is + defined by the magic attribute + identifier="http://indexdata.dk/zebra/xslt/1". + + +
+ &dom; Internal Record Representation + When indexing, an &xml; Reader is invoked to split the input + files into suitable record &xml; pieces. Each record piece is then + transformed to an &xml; &dom; structure, which is essentially the + record model. Only &xslt; transformations can be applied during + index, search and retrieval. Consequently, output formats are + restricted to whatever &xslt; can deliver from the record &xml; + structure, be it other &xml; formats, HTML, or plain text. In case + you have libxslt1 running with E&xslt; support, + you can use this functionality inside the &dom; + filter configuration &xslt; stylesheets. + +
+ +
+ &dom; Canonical Indexing Format + The output of the indexing &xslt; stylesheets must contain + certain elements in the magic + xmlns:z="http://indexdata.dk/zebra/xslt/1" + namespace. The output of the &xslt; indexing transformation is then + parsed using &dom; methods, and the contained instructions are + performed on the magic elements and their + subtrees. + + + For example, the output of the command + + xsltproc xsl/oai2index.xsl one-record.xml + + might look like this: + + <?xml version="1.0" encoding="UTF-8"?> + <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" + z:id="oai:JTRS:CP-3290---Volume-I" + z:rank="47896" + z:type="update"> + <z:index name="oai_identifier" type="0"> + oai:JTRS:CP-3290---Volume-I</z:index> + <z:index name="oai_datestamp" type="0">2004-07-09</z:index> + <z:index name="oai_setspec" type="0">jtrs</z:index> + <z:index name="dc_all" type="w"> + <z:index name="dc_title" type="w">Proceedings of the 4th + International Conference and Exhibition: + World Congress on Superconductivity - Volume I</z:index> + <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin + Burnham, Editors</z:index> + </z:index> + </z:record> + + + This means the following: From the original &xml; file + one-record.xml (or from the &xml; record &dom; of the + same form coming from a splitted input file), the indexing + stylesheet produces an indexing &xml; record, which is defined by + the record element in the magic namespace + xmlns:z="http://indexdata.dk/zebra/xslt/1". + &zebra; uses the content of + z:id="oai:JTRS:CP-3290---Volume-I" as internal + record ID, and - in case static ranking is set - the content of + z:rank="47896" as static rank. Following the + discussion in + we see that this records is internally ordered + lexicographically according to the value of the string + oai:JTRS:CP-3290---Volume-I47896. + The type of action performed during indexing is defined by + z:type="update">, with recognized values + insert, update, and + delete. + + In this example, the following literal indexes are constructed: + + oai_identifier + oai_datestamp + oai_setspec + dc_all + dc_title + dc_creator + + where the indexing type is defined in the + type attribute + (any value from the standard configuration + file default.idx will do). Finally, any + text() node content recursively contained + inside the index will be filtered through the + appropriate charmap for character normalization, and will be + inserted in the index. + + + Specific to this example, we see that the single word + oai:JTRS:CP-3290---Volume-I will be literal, + byte for byte without any form of character normalization, + inserted into the index named oai:identifier, + the text + Kumar Krishen and *Calvin Burnham, Editors + will be inserted using the w character + normalization defined in default.idx into + the index dc:creator (that is, after character + normalization the index will keep the inidividual words + kumar, krishen, + and, calvin, + burnham, and editors), and + finally both the texts + Proceedings of the 4th International Conference and Exhibition: + World Congress on Superconductivity - Volume I + and + Kumar Krishen and *Calvin Burnham, Editors + will be inserted into the index dc:all using + the same character normalization map w. + + + Finally, this example configuration can be queried using &pqf; + queries, either transported by &z3950;, (here using a yaz-client) + + open localhost:9999 + Z> elem dc + Z> form xml + Z> + Z> f @attr 1=dc_creator Kumar + Z> scan @attr 1=dc_creator adam + Z> + Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity" + Z> scan @attr 1=dc_title abc + ]]> + + or the proprietary + extentions x-pquery and + x-pScanClause to + &sru;, and &srw; + + + + See for more information on &sru;/&srw; + configuration, and or the &yaz; + &cql; section + for the details or the &yaz; frontend server. + + + Notice that there are no *.abs, + *.est, *.map, or other &grs1; + filter configuration files involves in this process, and that the + literal index names are used during search and retrieval. + +
+
+ + +
+ &dom; Record Model Configuration + + +
+ &dom; Indexing Configuration + + As mentioned above, there can be only one indexing + stylesheet, and configuration of the indexing process is a synonym + of writing an &xslt; stylesheet which produces &xml; output containing the + magic elements discussed in + . + Obviously, there are million of different ways to accomplish this + task, and some comments and code snippets are in order to lead + our paduans on the right track to the good side of the force. + + + Stylesheets can be written in the pull or + the push style: pull + means that the output &xml; structure is taken as starting point of + the internal structure of the &xslt; stylesheet, and portions of + the input &xml; are pulled out and inserted + into the right spots of the output &xml; structure. On the other + side, push &xslt; stylesheets are recursavly + calling their template definitions, a process which is commanded + by the input &xml; structure, and avake to produce some output &xml; + whenever some special conditions in the input styelsheets are + met. The pull type is well-suited for input + &xml; with strong and well-defined structure and semantcs, like the + following &oai; indexing example, whereas the + push type might be the only possible way to + sort out deeply recursive input &xml; formats. + + + A pull stylesheet example used to index + &oai; harvested records could use some of the following template + definitions: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> + + + + Notice also, + that the names and types of the indexes can be defined in the + indexing &xslt; stylesheet dynamically according to + content in the original &xml; records, which has + opportunities for great power and wizardery as well as grande + disaster. + + + The following excerpt of a push stylesheet + might + be a good idea according to your strict control of the &xml; + input format (due to rigerours checking against well-defined and + tight RelaxNG or &xml; Schema's, for example): + + + + + + + ]]> + + This template creates indexes which have the name of the working + node of any input &xml; file, and assigns a '1' to the index. + The example query + find @attr 1=xyz 1 + finds all files which contain at least one + xyz &xml; element. In case you can not control + which element names the input files contain, you might ask for + disaster and bad karma using this technique. + + + One variation over the theme dynamically created + indexes will definitely be unwise: + + + + + + + + + + + + + + + + + + ]]> + + Don't be tempted to cross + the line to the dark side of the force, paduan; this leads + to suffering and pain, and universal + disentigration of your project schedule. + +
+ +
+ &dom; Exchange Formats + + An exchange format can be anything which can be the outcome of an + &xslt; transformation, as far as the stylesheet is registered in + the main &dom; &xslt; filter configuration file, see + . + In principle anything that can be expressed in &xml;, HTML, and + TEXT can be the output of a schema or + element set directive during search, as long as + the information comes from the + original input record &xml; &dom; tree + (and not the transformed and indexed &xml;!!). + + + In addition, internal administrative information from the &zebra; + indexer can be accessed during record retrieval. The following + example is a summary of the possibilities: + + + + + + + + + + + + + + + + + + + + + + + ]]> + + + +
+ +
+ &dom; Filter &oai; Indexing Example + + The sourcecode tarball contains a working &dom; filter example in + the directory examples/dom-oai/, which + should get you started. + + + More example data can be harvested from any &oai; complient server, + see details at the &oai; + + http://www.openarchives.org/ web site, and the community + links at + + http://www.openarchives.org/community/index.html. + There is a tutorial + found at + + http://www.oaforum.org/tutorial/. + +
+ +
+ + +
+ + + + + + + + diff --git a/doc/recordmodel-grs.xml b/doc/recordmodel-grs.xml index 848db70..7ba26d3 100644 --- a/doc/recordmodel-grs.xml +++ b/doc/recordmodel-grs.xml @@ -1,7 +1,15 @@ - + &grs1; Record Model and Filter Modules + + + The functionality of this record model has been improved and + replaced by the DOM &xml; record model. See + . + + + The record model described in this chapter applies to the fundamental, structured diff --git a/doc/zebra.xml b/doc/zebra.xml index 5110f33..540bbd4 100644 --- a/doc/zebra.xml +++ b/doc/zebra.xml @@ -11,7 +11,7 @@ ]> - + &zebra; - User's Guide and Reference @@ -62,6 +62,7 @@ &chap-architecture; &chap-querymodel; &chap-administration; + &chap-recordmodel-domxml; &chap-recordmodel-alvisxslt; &chap-recordmodel-grs; &chap-field-structure; -- 1.7.10.4