X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel-alvisxslt.xml;h=81d3dca04d23c2cadc4c074405076be171f0b1b3;hp=f3b69db806f2be09a86e0a6adda1f32b177f0738;hb=c3ff843e467932c6027a8b3b2ebda7b44612447e;hpb=c99c50f588fb803362a47a933c988360ab1cd98c diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml index f3b69db..81d3dca 100644 --- a/doc/recordmodel-alvisxslt.xml +++ b/doc/recordmodel-alvisxslt.xml @@ -1,171 +1,169 @@ - - - ALVIS &xml; Record Model and Filter Module - - - - The functionality of this record model has been improved and - replaced by the DOM &xml; record model, see - . The Alvis &xml; record - model is considered obsolete, and will eventually be removed - from future releases of the &zebra; software. - - + + ALVIS &acro.xml; Record Model and Filter Module + + + + The functionality of this record model has been improved and + replaced by the DOM &acro.xml; record model, see + . The Alvis &acro.xml; record + model is considered obsolete, and will eventually be removed + from future releases of the &zebra; software. + + The record model described in this chapter applies to the fundamental, - structured &xml; + structured &acro.xml; record type alvis, introduced in . - This filter has been developed under the + This filter has been developed under the ALVIS project funded by the European Community under the "Information Society Technologies" Program (2002-2006). - - + +
ALVIS Record Filter - The experimental, loadable Alvis &xml;/&xslt; filter module - mod-alvis.so is packaged in the GNU/Debian package + The experimental, loadable Alvis &acro.xml;/&acro.xslt; filter module + mod-alvis.so is packaged in the GNU/Debian package libidzebra1.4-mod-alvis. It is invoked by the zebra.cfg configuration statement recordtype.xml: alvis.db/filter_alvis_conf.xml - In this example on all data files with suffix + In this example on all data files with suffix *.xml, where the - Alvis &xslt; filter configuration file is found in the + Alvis &acro.xslt; filter configuration file is found in the path db/filter_alvis_conf.xml. - The Alvis &xslt; filter configuration file must be - valid &xml;. It might look like this (This example is - used for indexing and display of &oai; harvested records): + The Alvis &acro.xslt; filter configuration file must be + valid &acro.xml;. It might look like this (This example is + used for indexing and display of &acro.oai; harvested records): - <?xml version="1.0" encoding="UTF-8"?> - <schemaInfo> - <schema name="identity" stylesheet="xsl/identity.xsl" /> - <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1" - stylesheet="xsl/oai2index.xsl" /> - <schema name="dc" stylesheet="xsl/oai2dc.xsl" /> - <!-- use split level 2 when indexing whole &oai; Record lists --> - <split level="2"/> - </schemaInfo> - + <?xml version="1.0" encoding="UTF-8"?> + <schemaInfo> + <schema name="identity" stylesheet="xsl/identity.xsl" /> + <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1" + stylesheet="xsl/oai2index.xsl" /> + <schema name="dc" stylesheet="xsl/oai2dc.xsl" /> + <!-- use split level 2 when indexing whole OAI Record lists --> + <split level="2"/> + </schemaInfo> + All named stylesheets defined inside - schema element tags + schema element tags are for presentation after search, including the indexing stylesheet (which is a great debugging help). The names defined in the name attributes must be - unique, these are the literal schema or - element set names used in - &srw;, - &sru; and - &z3950; protocol queries. + unique, these are the literal schema or + element set names used in + &acro.srw;, + &acro.sru; and + &acro.z3950; protocol queries. The paths in the stylesheet attributes are relative to zebras working directory, or absolute to file system root. The <split level="2"/> decides where the - &xml; Reader shall split the + &acro.xml; Reader shall split the collections of records into individual records, which then are - loaded into &dom;, and have the indexing &xslt; stylesheet applied. + loaded into &acro.dom;, and have the indexing &acro.xslt; stylesheet applied. - There must be exactly one indexing &xslt; stylesheet, which is - defined by the magic attribute + There must be exactly one indexing &acro.xslt; stylesheet, which is + defined by the magic attribute identifier="http://indexdata.dk/zebra/xslt/1".
- ALVIS Internal Record Representation - When indexing, an &xml; Reader is invoked to split the input - files into suitable record &xml; pieces. Each record piece is then - transformed to an &xml; &dom; structure, which is essentially the - record model. Only &xslt; transformations can be applied during - index, search and retrieval. Consequently, output formats are - restricted to whatever &xslt; can deliver from the record &xml; - structure, be it other &xml; formats, HTML, or plain text. In case - you have libxslt1 running with E&xslt; support, - you can use this functionality inside the Alvis - filter configuration &xslt; stylesheets. + ALVIS Internal Record Representation + When indexing, an &acro.xml; Reader is invoked to split the input + files into suitable record &acro.xml; pieces. Each record piece is then + transformed to an &acro.xml; &acro.dom; structure, which is essentially the + record model. Only &acro.xslt; transformations can be applied during + index, search and retrieval. Consequently, output formats are + restricted to whatever &acro.xslt; can deliver from the record &acro.xml; + structure, be it other &acro.xml; formats, HTML, or plain text. In case + you have libxslt1 running with E&acro.xslt; support, + you can use this functionality inside the Alvis + filter configuration &acro.xslt; stylesheets.
- ALVIS Canonical Indexing Format - The output of the indexing &xslt; stylesheets must contain - certain elements in the magic + ALVIS Canonical Indexing Format + The output of the indexing &acro.xslt; stylesheets must contain + certain elements in the magic xmlns:z="http://indexdata.dk/zebra/xslt/1" - namespace. The output of the &xslt; indexing transformation is then - parsed using &dom; methods, and the contained instructions are - performed on the magic elements and their - subtrees. + namespace. The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are + performed on the magic elements and their + subtrees. - For example, the output of the command - + For example, the output of the command + xsltproc xsl/oai2index.xsl one-record.xml - + might look like this: <?xml version="1.0" encoding="UTF-8"?> - <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" - z:id="oai:JTRS:CP-3290---Volume-I" - z:rank="47896" - z:type="update"> - <z:index name="oai_identifier" type="0"> - oai:JTRS:CP-3290---Volume-I</z:index> - <z:index name="oai_datestamp" type="0">2004-07-09</z:index> - <z:index name="oai_setspec" type="0">jtrs</z:index> - <z:index name="dc_all" type="w"> - <z:index name="dc_title" type="w">Proceedings of the 4th - International Conference and Exhibition: - World Congress on Superconductivity - Volume I</z:index> - <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin - Burnham, Editors</z:index> - </z:index> - </z:record> + <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" + z:id="oai:JTRS:CP-3290---Volume-I" + z:rank="47896"> + <z:index name="oai_identifier" type="0"> + oai:JTRS:CP-3290---Volume-I</z:index> + <z:index name="oai_datestamp" type="0">2004-07-09</z:index> + <z:index name="oai_setspec" type="0">jtrs</z:index> + <z:index name="dc_all" type="w"> + <z:index name="dc_title" type="w">Proceedings of the 4th + International Conference and Exhibition: + World Congress on Superconductivity - Volume I</z:index> + <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin + Burnham, Editors</z:index> + </z:index> + </z:record> - This means the following: From the original &xml; file - one-record.xml (or from the &xml; record &dom; of the + This means the following: From the original &acro.xml; file + one-record.xml (or from the &acro.xml; record &acro.dom; of the same form coming from a split input file), the indexing - stylesheet produces an indexing &xml; record, which is defined by + stylesheet produces an indexing &acro.xml; record, which is defined by the record element in the magic namespace xmlns:z="http://indexdata.dk/zebra/xslt/1". - &zebra; uses the content of + &zebra; uses the content of z:id="oai:JTRS:CP-3290---Volume-I" as internal - record ID, and - in case static ranking is set - the content of + record ID, and - in case static ranking is set - the content of z:rank="47896" as static rank. Following the discussion in we see that this records is internally ordered lexicographically according to the value of the string oai:JTRS:CP-3290---Volume-I47896. - The type of action performed during indexing is defined by + In this example, the following literal indexes are constructed: - oai_identifier - oai_datestamp - oai_setspec - dc_all - dc_title - dc_creator + oai_identifier + oai_datestamp + oai_setspec + dc_all + dc_title + dc_creator - where the indexing type is defined in the - type attribute + where the indexing type is defined in the + type attribute (any value from the standard configuration - file default.idx will do). Finally, any + file default.idx will do). Finally, any text() node content recursively contained inside the index will be filtered through the appropriate char map for character normalization, and will be @@ -176,26 +174,26 @@ oai:JTRS:CP-3290---Volume-I will be literal, byte for byte without any form of character normalization, inserted into the index named oai:identifier, - the text + the text Kumar Krishen and *Calvin Burnham, Editors will be inserted using the w character normalization defined in default.idx into the index dc:creator (that is, after character - normalization the index will keep the individual words - kumar, krishen, + normalization the index will keep the individual words + kumar, krishen, and, calvin, burnham, and editors), and finally both the texts Proceedings of the 4th International Conference and Exhibition: - World Congress on Superconductivity - Volume I + World Congress on Superconductivity - Volume I and - Kumar Krishen and *Calvin Burnham, Editors + Kumar Krishen and *Calvin Burnham, Editors will be inserted into the index dc:all using - the same character normalization map w. + the same character normalization map w. - Finally, this example configuration can be queried using &pqf; - queries, either transported by &z3950;, (here using a yaz-client) + Finally, this example configuration can be queried using &acro.pqf; + queries, either transported by &acro.z3950;, (here using a yaz-client) open localhost:9999 @@ -212,21 +210,21 @@ or the proprietary extensions x-pquery and x-pScanClause to - &sru;, and &srw; + &acro.sru;, and &acro.srw; - See for more information on &sru;/&srw; + See for more information on &acro.sru;/&acro.srw; configuration, and or the &yaz; - &cql; section + &acro.cql; section for the details or the &yaz; frontend server. Notice that there are no *.abs, - *.est, *.map, or other &grs1; + *.est, *.map, or other &acro.grs1; filter configuration files involves in this process, and that the literal index names are used during search and retrieval. @@ -238,14 +236,14 @@ ALVIS Record Model Configuration -
- ALVIS Indexing Configuration +
+ ALVIS Indexing Configuration As mentioned above, there can be only one indexing stylesheet, and configuration of the indexing process is a synonym - of writing an &xslt; stylesheet which produces &xml; output containing the - magic elements discussed in - . + of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the + magic elements discussed in + . Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to lead our Padawan's on the right track to the good side of the force. @@ -253,122 +251,121 @@ Stylesheets can be written in the pull or the push style: pull - means that the output &xml; structure is taken as starting point of - the internal structure of the &xslt; stylesheet, and portions of - the input &xml; are pulled out and inserted - into the right spots of the output &xml; structure. On the other - side, push &xslt; stylesheets are recursively + means that the output &acro.xml; structure is taken as starting point of + the internal structure of the &acro.xslt; stylesheet, and portions of + the input &acro.xml; are pulled out and inserted + into the right spots of the output &acro.xml; structure. On the other + side, push &acro.xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &xml; structure, and are triggered to produce some output &xml; + by the input &acro.xml; structure, and are triggered to produce some output &acro.xml; whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input - &xml; with strong and well-defined structure and semantics, like the - following &oai; indexing example, whereas the + &acro.xml; with strong and well-defined structure and semantics, like the + following &acro.oai; indexing example, whereas the push type might be the only possible way to - sort out deeply recursive input &xml; formats. + sort out deeply recursive input &acro.xml; formats. - + A pull stylesheet example used to index - &oai; harvested records could use some of the following template + &acro.oai; harvested records could use some of the following template definitions: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + xmlns:z="http://indexdata.dk/zebra/xslt/1" + xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/" + xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/" + xmlns:dc="http://purl.org/dc/elements/1.1/" + version="1.0"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> Notice also, that the names and types of the indexes can be defined in the - indexing &xslt; stylesheet dynamically according to - content in the original &xml; records, which has + indexing &acro.xslt; stylesheet dynamically according to + content in the original &acro.xml; records, which has opportunities for great power and wizardry as well as grande - disaster. + disaster. The following excerpt of a push stylesheet - might - be a good idea according to your strict control of the &xml; + might + be a good idea according to your strict control of the &acro.xml; input format (due to rigorous checking against well-defined and - tight RelaxNG or &xml; Schema's, for example): + tight RelaxNG or &acro.xml; Schema's, for example): - - - - + + + + + ]]> - This template creates indexes which have the name of the working - node of any input &xml; file, and assigns a '1' to the index. - The example query - find @attr 1=xyz 1 + This template creates indexes which have the name of the working + node of any input &acro.xml; file, and assigns a '1' to the index. + The example query + find @attr 1=xyz 1 finds all files which contain at least one - xyz &xml; element. In case you can not control + xyz &acro.xml; element. In case you can not control which element names the input files contain, you might ask for disaster and bad karma using this technique. One variation over the theme dynamically created - indexes will definitely be unwise: + indexes will definitely be unwise: - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + ]]> Don't be tempted to cross @@ -376,70 +373,70 @@ to suffering and pain, and universal disintegration of your project schedule. -
+
-
- ALVIS Exchange Formats - +
+ ALVIS Exchange Formats + An exchange format can be anything which can be the outcome of an - &xslt; transformation, as far as the stylesheet is registered in - the main Alvis &xslt; filter configuration file, see + &acro.xslt; transformation, as far as the stylesheet is registered in + the main Alvis &acro.xslt; filter configuration file, see . - In principle anything that can be expressed in &xml;, HTML, and - TEXT can be the output of a schema or - element set directive during search, as long as - the information comes from the - original input record &xml; &dom; tree - (and not the transformed and indexed &xml;!!). + In principle anything that can be expressed in &acro.xml;, HTML, and + TEXT can be the output of a schema or + element set directive during search, as long as + the information comes from the + original input record &acro.xml; &acro.dom; tree + (and not the transformed and indexed &acro.xml;!!). In addition, internal administrative information from the &zebra; indexer can be accessed during record retrieval. The following example is a summary of the possibilities: - - - - - - - - - - - - - - - - - - - - - + xmlns:z="http://indexdata.dk/zebra/xslt/1" + version="1.0"> + + + + + + + + + + + + + + + + + + + + ]]> -
+
-
- ALVIS Filter &oai; Indexing Example - +
+ ALVIS Filter &acro.oai; Indexing Example + The source code tarball contains a working Alvis filter example in the directory examples/alvis-oai/, which - should get you started. + should get you started. - More example data can be harvested from any &oai; compliant server, - see details at the &oai; + More example data can be harvested from any &acro.oai; compliant server, + see details at the &acro.oai; http://www.openarchives.org/ web site, and the community - links at + links at http://www.openarchives.org/community/index.html. There is a tutorial @@ -451,7 +448,7 @@
- + @@ -465,7 +462,7 @@ sgml-always-quote-attributes:t sgml-indent-step:1 sgml-indent-data:t - sgml-parent-document: "zebra.xml" + sgml-parent-document: "idzebra.xml" sgml-local-catalogs: nil sgml-namecase-general:t End: