X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Frecordmodel-alvisxslt.xml;h=328bbce68b27c9a79765944d0663605c19a74fea;hb=5ca4e60e990af6ad6b62ebff855d7b642f37c3ec;hp=764190da266a33d214b29a7414a86ba896fd767a;hpb=14a2dbce03d7802ab5b1e57b09d915339bb5fc54;p=idzebra-moved-to-github.git
diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml
index 764190d..328bbce 100644
--- a/doc/recordmodel-alvisxslt.xml
+++ b/doc/recordmodel-alvisxslt.xml
@@ -1,55 +1,448 @@
-
- ALVIS XML Record Model and Filter Module
+
+ ALVIS &xml; Record Model and Filter Module
The record model described in this chapter applies to the fundamental,
- structured XML
+ structured &xml;
record type alvis, introduced in
- . The ALVIS XML record model
+ . The ALVIS &xml; record model
is experimental, and it's inner workings might change in future
- releases of the Zebra Information Server.
+ releases of the &zebra; Information Server.
-
-
+ This filter has been developed under the
+ ALVIS project funded by
+ the European Community under the "Information Society Technologies"
+ Program (2002-2006).
+
-
- ALLVIS Record Filter
+
+ ALVIS Record Filter
- The experimental, loadable Alvis XM/XSLT filter module
+ The experimental, loadable Alvis &xml;/XSLT filter module
mod-alvis.so is packaged in the GNU/Debian package
libidzebra1.4-mod-alvis.
+ It is invoked by the zebra.cfg configuration statement
+
+ recordtype.xml: alvis.db/filter_alvis_conf.xml
+
+ In this example on all data files with suffix
+ *.xml, where the
+ Alvis XSLT filter configuration file is found in the
+ path db/filter_alvis_conf.xml.
+
+ The Alvis XSLT filter configuration file must be
+ valid &xml;. It might look like this (This example is
+ used for indexing and display of OAI harvested records):
+
+ <?xml version="1.0" encoding="UTF-8"?>
+ <schemaInfo>
+ <schema name="identity" stylesheet="xsl/identity.xsl" />
+ <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+ stylesheet="xsl/oai2index.xsl" />
+ <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
+ <!-- use split level 2 when indexing whole OAI Record lists -->
+ <split level="2"/>
+ </schemaInfo>
+
+
+
+ All named stylesheets defined inside
+ schema element tags
+ are for presentation after search, including
+ the indexing stylesheet (which is a great debugging help). The
+ names defined in the name attributes must be
+ unique, these are the literal schema or
+ element set names used in
+ SRW,
+ SRU and
+ Z39.50 protocol queries.
+ The paths in the stylesheet attributes
+ are relative to zebras working directory, or absolute to file
+ system root.
+
+
+ The <split level="2"/> decides where the
+ &xml; Reader shall split the
+ collections of records into individual records, which then are
+ loaded into DOM, and have the indexing XSLT stylesheet applied.
+
+
+ There must be exactly one indexing XSLT stylesheet, which is
+ defined by the magic attribute
+ identifier="http://indexdata.dk/zebra/xslt/1".
-
- ALLVIS Internal Record Representation
- FIXME
-
-
-
- ALLVIS Canonical Format
- FIXME
-
-
-
-
-
-
-
- ALLVIS Record Model Configuration
- FIXME
-
-
-
-
+
+ ALVIS Internal Record Representation
+ When indexing, an &xml; Reader is invoked to split the input
+ files into suitable record &xml; pieces. Each record piece is then
+ transformed to an &xml; DOM structure, which is essentially the
+ record model. Only XSLT transformations can be applied during
+ index, search and retrieval. Consequently, output formats are
+ restricted to whatever XSLT can deliver from the record &xml;
+ structure, be it other &xml; formats, HTML, or plain text. In case
+ you have libxslt1 running with EXSLT support,
+ you can use this functionality inside the Alvis
+ filter configuration XSLT stylesheets.
+
+
+
+
+ ALVIS Canonical Indexing Format
+ The output of the indexing XSLT stylesheets must contain
+ certain elements in the magic
+ xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ namespace. The output of the XSLT indexing transformation is then
+ parsed using DOM methods, and the contained instructions are
+ performed on the magic elements and their
+ subtrees.
+
+
+ For example, the output of the command
+
+ xsltproc xsl/oai2index.xsl one-record.xml
+
+ might look like this:
+
+ <?xml version="1.0" encoding="UTF-8"?>
+ <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ z:id="oai:JTRS:CP-3290---Volume-I"
+ z:rank="47896"
+ z:type="update">
+ <z:index name="oai_identifier" type="0">
+ oai:JTRS:CP-3290---Volume-I</z:index>
+ <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
+ <z:index name="oai_setspec" type="0">jtrs</z:index>
+ <z:index name="dc_all" type="w">
+ <z:index name="dc_title" type="w">Proceedings of the 4th
+ International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I</z:index>
+ <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
+ Burnham, Editors</z:index>
+ </z:index>
+ </z:record>
+
+
+ This means the following: From the original &xml; file
+ one-record.xml (or from the &xml; record DOM of the
+ same form coming from a splitted input file), the indexing
+ stylesheet produces an indexing &xml; record, which is defined by
+ the record element in the magic namespace
+ xmlns:z="http://indexdata.dk/zebra/xslt/1".
+ &zebra; uses the content of
+ z:id="oai:JTRS:CP-3290---Volume-I" as internal
+ record ID, and - in case static ranking is set - the content of
+ z:rank="47896" as static rank. Following the
+ discussion in
+ we see that this records is internally ordered
+ lexicographically according to the value of the string
+ oai:JTRS:CP-3290---Volume-I47896.
+ The type of action performed during indexing is defined by
+ z:type="update">, with recognized values
+ insert, update, and
+ delete.
+
+ In this example, the following literal indexes are constructed:
+
+ oai_identifier
+ oai_datestamp
+ oai_setspec
+ dc_all
+ dc_title
+ dc_creator
+
+ where the indexing type is defined in the
+ type attribute
+ (any value from the standard configuration
+ file default.idx will do). Finally, any
+ text() node content recursively contained
+ inside the index will be filtered through the
+ appropriate charmap for character normalization, and will be
+ inserted in the index.
+
+
+ Specific to this example, we see that the single word
+ oai:JTRS:CP-3290---Volume-I will be literal,
+ byte for byte without any form of character normalization,
+ inserted into the index named oai:identifier,
+ the text
+ Kumar Krishen and *Calvin Burnham, Editors
+ will be inserted using the w character
+ normalization defined in default.idx into
+ the index dc:creator (that is, after character
+ normalization the index will keep the inidividual words
+ kumar, krishen,
+ and, calvin,
+ burnham, and editors), and
+ finally both the texts
+ Proceedings of the 4th International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I
+ and
+ Kumar Krishen and *Calvin Burnham, Editors
+ will be inserted into the index dc:all using
+ the same character normalization map w.
+
+
+ Finally, this example configuration can be queried using PQF
+ queries, either transported by Z39.50, (here using a yaz-client)
+
+ open localhost:9999
+ Z> elem dc
+ Z> form xml
+ Z>
+ Z> f @attr 1=dc_creator Kumar
+ Z> scan @attr 1=dc_creator adam
+ Z>
+ Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
+ Z> scan @attr 1=dc_title abc
+ ]]>
+
+ or the proprietary
+ extentions x-pquery and
+ x-pScanClause to
+ SRU, and SRW
+
+
+
+ See for more information on SRU/SRW
+ configuration, and or the YAZ
+ CQL section
+ for the details or the YAZ frontend server.
+
+
+ Notice that there are no *.abs,
+ *.est, *.map, or other GRS-1
+ filter configuration files involves in this process, and that the
+ literal index names are used during search and retrieval.
+
+
+
+
+
+
+ ALVIS Record Model Configuration
+
+
+
+ ALVIS Indexing Configuration
+
+ As mentioned above, there can be only one indexing
+ stylesheet, and configuration of the indexing process is a synonym
+ of writing an XSLT stylesheet which produces &xml; output containing the
+ magic elements discussed in
+ .
+ Obviously, there are million of different ways to accomplish this
+ task, and some comments and code snippets are in order to lead
+ our paduans on the right track to the good side of the force.
+
+
+ Stylesheets can be written in the pull or
+ the push style: pull
+ means that the output &xml; structure is taken as starting point of
+ the internal structure of the XSLT stylesheet, and portions of
+ the input &xml; are pulled out and inserted
+ into the right spots of the output &xml; structure. On the other
+ side, push XSLT stylesheets are recursavly
+ calling their template definitions, a process which is commanded
+ by the input &xml; structure, and avake to produce some output &xml;
+ whenever some special conditions in the input styelsheets are
+ met. The pull type is well-suited for input
+ &xml; with strong and well-defined structure and semantcs, like the
+ following OAI indexing example, whereas the
+ push type might be the only possible way to
+ sort out deeply recursive input &xml; formats.
+
+
+ A pull stylesheet example used to index
+ OAI harvested records could use some of the following template
+ definitions:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+ Notice also,
+ that the names and types of the indexes can be defined in the
+ indexing XSLT stylesheet dynamically according to
+ content in the original &xml; records, which has
+ opportunities for great power and wizardery as well as grande
+ disaster.
+
+
+ The following excerpt of a push stylesheet
+ might
+ be a good idea according to your strict control of the &xml;
+ input format (due to rigerours checking against well-defined and
+ tight RelaxNG or &xml; Schema's, for example):
+
+
+
+
+
+
+ ]]>
+
+ This template creates indexes which have the name of the working
+ node of any input &xml; file, and assigns a '1' to the index.
+ The example query
+ find @attr 1=xyz 1
+ finds all files which contain at least one
+ xyz &xml; element. In case you can not control
+ which element names the input files contain, you might ask for
+ disaster and bad karma using this technique.
+
+
+ One variation over the theme dynamically created
+ indexes will definitely be unwise:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+ Don't be tempted to cross
+ the line to the dark side of the force, paduan; this leads
+ to suffering and pain, and universal
+ disentigration of your project schedule.
+
+
+
+
ALVIS Exchange Formats
- FIXME
-
-
-
+
+ An exchange format can be anything which can be the outcome of an
+ XSLT transformation, as far as the stylesheet is registered in
+ the main Alvis XSLT filter configuration file, see
+ .
+ In principle anything that can be expressed in &xml;, HTML, and
+ TEXT can be the output of a schema or
+ element set directive during search, as long as
+ the information comes from the
+ original input record &xml; DOM tree
+ (and not the transformed and indexed &xml;!!).
+
+
+ In addition, internal administrative information from the &zebra;
+ indexer can be accessed during record retrieval. The following
+ example is a summary of the possibilities:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
+
+
+
+
+ ALVIS Filter OAI Indexing Example
+
+ The sourcecode tarball contains a working Alvis filter example in
+ the directory examples/alvis-oai/, which
+ should get you started.
+
+
+ More example data can be harvested from any OAI complient server,
+ see details at the OAI
+
+ http://www.openarchives.org/ web site, and the community
+ links at
+
+ http://www.openarchives.org/community/index.html.
+ There is a tutorial
+ found at
+
+ http://www.oaforum.org/tutorial/.
+
+
+
+
@@ -72,7 +465,7 @@ c) Main "alvis" XSLT filter config file:
- the pathes are relative to the directory where zebra.init is placed
+ the paths are relative to the directory where zebra.init is placed
and is started up.
The split level decides where the SAX parser shall split the
@@ -94,12 +487,12 @@ c) Main "alvis" XSLT filter config file:
and so on.
- in db/ a cql2pqf.txt yaz-client config file
- which is also used in the yaz-server CQL-to-PQF process
+ which is also used in the yaz-server CQL-to-PQF process
see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
- in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
- as it constructs the new XML structure by pulling data out of the
+ as it constructs the new &xml; structure by pulling data out of the
respective elements/attributes of the old structure.
Notice the special zebra namespace, and the special elements in this
@@ -109,7 +502,7 @@ c) Main "alvis" XSLT filter config file:
indicates that a new record with given id and static rank has to be updated.
- encloses all the text/XML which shall be indexed in the index named
+ encloses all the text/&xml; which shall be indexed in the index named
"title" and of index type "w" (see file default.idx in your zebra
installation)