X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel-alvisxslt.xml;h=c633022cae14edd6c4b1ec81959c4c68d4403eaa;hp=a322f74aa1282fbd16f2fab172db85aa273d1a54;hb=99842ec71f065fd6886daa355923b01d9ce71d26;hpb=495a66ecd5fb966a8bd52f95dc25cde9d673e569 diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml index a322f74..c633022 100644 --- a/doc/recordmodel-alvisxslt.xml +++ b/doc/recordmodel-alvisxslt.xml @@ -1,41 +1,48 @@ - - - ALVIS XML Record Model and Filter Module - + + ALVIS &acro.xml; Record Model and Filter Module + + + + The functionality of this record model has been improved and + replaced by the DOM &acro.xml; record model, see + . The Alvis &acro.xml; record + model is considered obsolete, and will eventually be removed + from future releases of the &zebra; software. + + The record model described in this chapter applies to the fundamental, - structured XML + structured &acro.xml; record type alvis, introduced in - . The ALVIS XML record model - is experimental, and it's inner workings might change in future - releases of the Zebra Information Server. + . This filter has been developed under the ALVIS project funded by the European Community under the "Information Society Technologies" - Programme (2002-2006). + Program (2002-2006). - +
ALVIS Record Filter - The experimental, loadable Alvis XM/XSLT filter module + The experimental, loadable Alvis &acro.xml;/&acro.xslt; filter module mod-alvis.so is packaged in the GNU/Debian package libidzebra1.4-mod-alvis. - It is invoked by the zebra configuration statement + It is invoked by the zebra.cfg configuration statement recordtype.xml: alvis.db/filter_alvis_conf.xml - on all data files with suffix .xml, where the - alvis XSLT filter config file is found in the - path db/filter_alvis_conf.xml + In this example on all data files with suffix + *.xml, where the + Alvis &acro.xslt; filter configuration file is found in the + path db/filter_alvis_conf.xml. - The alvis XSLT filter config file must be - valid XML. It might look like this (used for indexing and display - of OAI harvested records): + The Alvis &acro.xslt; filter configuration file must be + valid &acro.xml;. It might look like this (This example is + used for indexing and display of &acro.oai; harvested records): <?xml version="1.0" encoding="UTF-8"?> <schemaInfo> @@ -56,47 +63,47 @@ names defined in the name attributes must be unique, these are the literal schema or element set names used in - SRW, - SRU and - Z39.50 protocol queries. - The pathes in the stylesheet attributes + &acro.srw;, + &acro.sru; and + &acro.z3950; protocol queries. + The paths in the stylesheet attributes are relative to zebras working directory, or absolute to file system root. The <split level="2"/> decides where the - XML Reader shall split the + &acro.xml; Reader shall split the collections of records into individual records, which then are - loaded into DOM, and have the indexing XSLT stylesheet applied. + loaded into &acro.dom;, and have the indexing &acro.xslt; stylesheet applied. - There must be exactly one indexing XSLT stylesheet, which is + There must be exactly one indexing &acro.xslt; stylesheet, which is defined by the magic attribute identifier="http://indexdata.dk/zebra/xslt/1". - +
ALVIS Internal Record Representation - When indexing, an XML Reader is invoked to split the input - files into suitable record XML pieces. Each record piece is then - transformed to an XML DOM structire, which is essentially the - record model. Only XSLT transfomations can be applied during + When indexing, an &acro.xml; Reader is invoked to split the input + files into suitable record &acro.xml; pieces. Each record piece is then + transformed to an &acro.xml; &acro.dom; structure, which is essentially the + record model. Only &acro.xslt; transformations can be applied during index, search and retrieval. Consequently, output formats are - restricted to whatever XSLT can deliver from the record XML - structure, be it other XML formats, HTML, or plain text. In case - you have libxslt1 running with EXSLT support, - you can use this functionality inside the alvis - filter configuraiton XSLT stylesheets. + restricted to whatever &acro.xslt; can deliver from the record &acro.xml; + structure, be it other &acro.xml; formats, HTML, or plain text. In case + you have libxslt1 running with E&acro.xslt; support, + you can use this functionality inside the Alvis + filter configuration &acro.xslt; stylesheets. - +
- +
ALVIS Canonical Indexing Format - The output of the indexing XSLT stylesheets must contain + The output of the indexing &acro.xslt; stylesheets must contain certain elements in the magic xmlns:z="http://indexdata.dk/zebra/xslt/1" - namespace. The output of the XSLT indexing transformation is then - parsed using DOM methods, and the contained instructions are + namespace. The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are performed on the magic elements and their subtrees. @@ -110,159 +117,339 @@ <?xml version="1.0" encoding="UTF-8"?> <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" z:id="oai:JTRS:CP-3290---Volume-I" - z:rank="47896" - z:type="update"> - <z:index name="oai:identifier" type="0"> + z:rank="47896"> + <z:index name="oai_identifier" type="0"> oai:JTRS:CP-3290---Volume-I</z:index> - <z:index name="oai:datestamp" type="0">2004-07-09</z:index> - <z:index name="oai:setspec" type="0">jtrs</z:index> - <z:index name="dc:all" type="w"> - <z:index name="dc:title" type="w">Proceedings of the 4th + <z:index name="oai_datestamp" type="0">2004-07-09</z:index> + <z:index name="oai_setspec" type="0">jtrs</z:index> + <z:index name="dc_all" type="w"> + <z:index name="dc_title" type="w">Proceedings of the 4th International Conference and Exhibition: World Congress on Superconductivity - Volume I</z:index> - <z:index name="dc:creator" type="w">Kumar Krishen and *Calvin + <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin Burnham, Editors</z:index> </z:index> </z:record> - This means the following: From the original XML file - one-record.xml (or from the XML record DOM of the - same form coming from a splitted input file), the indexing - stylesheet produces an indexing XML record, which is defined by + This means the following: From the original &acro.xml; file + one-record.xml (or from the &acro.xml; record &acro.dom; of the + same form coming from a split input file), the indexing + stylesheet produces an indexing &acro.xml; record, which is defined by the record element in the magic namespace xmlns:z="http://indexdata.dk/zebra/xslt/1". - Zebra uses the content of + &zebra; uses the content of z:id="oai:JTRS:CP-3290---Volume-I" as internal record ID, and - in case static ranking is set - the content of z:rank="47896" as static rank. Following the - discussion in XXX we see that this records is internally ordered + discussion in + we see that this records is internally ordered lexicographically according to the value of the string oai:JTRS:CP-3290---Volume-I47896. - The type of action performed during indexing is defined by + - Then the following literal indexes are constructed: + In this example, the following literal indexes are constructed: - oai:identifier - oai:datestamp - oai:setspec - dc:all - dc:title - dc:creator + oai_identifier + oai_datestamp + oai_setspec + dc_all + dc_title + dc_creator where the indexing type is defined in the - type attribute (any value from the standard config - filedefault.idx will do). Finally, any + type attribute + (any value from the standard configuration + file default.idx will do). Finally, any text() node content recursively contained inside the index will be filtered through the - appropriate charmap for character normalization, and will be + appropriate char map for character normalization, and will be inserted in the index. - Notice that there are no .abs, - .est, .map, or other GRS-1 - filter configuration files involves in this process. Notice also, - that the names and types of the indexes can be defined in the - indexing XSLT stylesheet dynamically according to - content in the original XML records, which has - oppertunities for great power and great disaster. + Specific to this example, we see that the single word + oai:JTRS:CP-3290---Volume-I will be literal, + byte for byte without any form of character normalization, + inserted into the index named oai:identifier, + the text + Kumar Krishen and *Calvin Burnham, Editors + will be inserted using the w character + normalization defined in default.idx into + the index dc:creator (that is, after character + normalization the index will keep the individual words + kumar, krishen, + and, calvin, + burnham, and editors), and + finally both the texts + Proceedings of the 4th International Conference and Exhibition: + World Congress on Superconductivity - Volume I + and + Kumar Krishen and *Calvin Burnham, Editors + will be inserted into the index dc:all using + the same character normalization map w. + + + Finally, this example configuration can be queried using &acro.pqf; + queries, either transported by &acro.z3950;, (here using a yaz-client) + + open localhost:9999 + Z> elem dc + Z> form xml + Z> + Z> f @attr 1=dc_creator Kumar + Z> scan @attr 1=dc_creator adam + Z> + Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity" + Z> scan @attr 1=dc_title abc + ]]> + + or the proprietary + extensions x-pquery and + x-pScanClause to + &acro.sru;, and &acro.srw; + + + + See for more information on &acro.sru;/&acro.srw; + configuration, and or the &yaz; + &acro.cql; section + for the details or the &yaz; frontend server. - - + + Notice that there are no *.abs, + *.est, *.map, or other &acro.grs1; + filter configuration files involves in this process, and that the + literal index names are used during search and retrieval. + +
+
- +
ALVIS Record Model Configuration - +
ALVIS Indexing Configuration - FIXME + + As mentioned above, there can be only one indexing + stylesheet, and configuration of the indexing process is a synonym + of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the + magic elements discussed in + . + Obviously, there are million of different ways to accomplish this + task, and some comments and code snippets are in order to lead + our Padawan's on the right track to the good side of the force. + + + Stylesheets can be written in the pull or + the push style: pull + means that the output &acro.xml; structure is taken as starting point of + the internal structure of the &acro.xslt; stylesheet, and portions of + the input &acro.xml; are pulled out and inserted + into the right spots of the output &acro.xml; structure. On the other + side, push &acro.xslt; stylesheets are recursively + calling their template definitions, a process which is commanded + by the input &acro.xml; structure, and are triggered to produce some output &acro.xml; + whenever some special conditions in the input stylesheets are + met. The pull type is well-suited for input + &acro.xml; with strong and well-defined structure and semantics, like the + following &acro.oai; indexing example, whereas the + push type might be the only possible way to + sort out deeply recursive input &acro.xml; formats. + + + A pull stylesheet example used to index + &acro.oai; harvested records could use some of the following template + definitions: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> + - FIXME + + Notice also, + that the names and types of the indexes can be defined in the + indexing &acro.xslt; stylesheet dynamically according to + content in the original &acro.xml; records, which has + opportunities for great power and wizardry as well as grande + disaster. - FIXME + + The following excerpt of a push stylesheet + might + be a good idea according to your strict control of the &acro.xml; + input format (due to rigorous checking against well-defined and + tight RelaxNG or &acro.xml; Schema's, for example): + + + + + + + ]]> + + This template creates indexes which have the name of the working + node of any input &acro.xml; file, and assigns a '1' to the index. + The example query + find @attr 1=xyz 1 + finds all files which contain at least one + xyz &acro.xml; element. In case you can not control + which element names the input files contain, you might ask for + disaster and bad karma using this technique. - + + One variation over the theme dynamically created + indexes will definitely be unwise: + + + + + + + + + + + + + + + + + + ]]> + + Don't be tempted to cross + the line to the dark side of the force, Padawan; this leads + to suffering and pain, and universal + disintegration of your project schedule. + +
- +
ALVIS Exchange Formats - FIXME - - - - - - - - - + + + + + + + + + + + + + + + + + + + ]]> + + - The indexing stylesheet is found by it's identifier. +
- All the other stylesheets are for presentation after search. +
+ ALVIS Filter &acro.oai; Indexing Example + + The source code tarball contains a working Alvis filter example in + the directory examples/alvis-oai/, which + should get you started. + + + More example data can be harvested from any &acro.oai; compliant server, + see details at the &acro.oai; + + http://www.openarchives.org/ web site, and the community + links at + + http://www.openarchives.org/community/index.html. + There is a tutorial + found at + + http://www.oaforum.org/tutorial/. + +
-- in data/ a short sample of harvested carnivorous plants - ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml +
-- in root also one single data record - nice for testing the xslt - stylesheets, - xsltproc db/alvis2index.xsl carni*.xml - - and so on. - -- in db/ a cql2pqf.txt yaz-client config file - which is also used in the yaz-server CQL-to-PQF process - - see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map - -- in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing, - as it constructs the new XML structure by pulling data out of the - respective elements/attributes of the old structure. - - Notice the special zebra namespace, and the special elements in this - namespace which indicate to the zebra indexer what to do. - - - indicates that a new record with given id and static rank has to be updated. - - - encloses all the text/XML which shall be indexed in the index named - "title" and of index type "w" (see file default.idx in your zebra - installation) - - - - - ---> - +