X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Frecordmodel-alvisxslt.xml;h=81d3dca04d23c2cadc4c074405076be171f0b1b3;hb=cc72249ff74400f6106897b513c10b932a67feec;hp=a322f74aa1282fbd16f2fab172db85aa273d1a54;hpb=495a66ecd5fb966a8bd52f95dc25cde9d673e569;p=idzebra-moved-to-github.git
diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml
index a322f74..81d3dca 100644
--- a/doc/recordmodel-alvisxslt.xml
+++ b/doc/recordmodel-alvisxslt.xml
@@ -1,268 +1,455 @@
-
- ALVIS XML Record Model and Filter Module
-
+ ALVIS &acro.xml; Record Model and Filter Module
+
+
+
+ The functionality of this record model has been improved and
+ replaced by the DOM &acro.xml; record model, see
+ . The Alvis &acro.xml; record
+ model is considered obsolete, and will eventually be removed
+ from future releases of the &zebra; software.
+
+
The record model described in this chapter applies to the fundamental,
- structured XML
+ structured &acro.xml;
record type alvis, introduced in
- . The ALVIS XML record model
- is experimental, and it's inner workings might change in future
- releases of the Zebra Information Server.
+ .
- This filter has been developed under the
+ This filter has been developed under the
ALVIS project funded by
the European Community under the "Information Society Technologies"
- Programme (2002-2006).
+ Program (2002-2006).
-
-
-
+
+
+
ALVIS Record Filter
- The experimental, loadable Alvis XM/XSLT filter module
- mod-alvis.so is packaged in the GNU/Debian package
+ The experimental, loadable Alvis &acro.xml;/&acro.xslt; filter module
+ mod-alvis.so is packaged in the GNU/Debian package
libidzebra1.4-mod-alvis.
- It is invoked by the zebra configuration statement
+ It is invoked by the zebra.cfg configuration statement
recordtype.xml: alvis.db/filter_alvis_conf.xml
- on all data files with suffix .xml, where the
- alvis XSLT filter config file is found in the
- path db/filter_alvis_conf.xml
+ In this example on all data files with suffix
+ *.xml, where the
+ Alvis &acro.xslt; filter configuration file is found in the
+ path db/filter_alvis_conf.xml.
- The alvis XSLT filter config file must be
- valid XML. It might look like this (used for indexing and display
- of OAI harvested records):
+ The Alvis &acro.xslt; filter configuration file must be
+ valid &acro.xml;. It might look like this (This example is
+ used for indexing and display of &acro.oai; harvested records):
- <?xml version="1.0" encoding="UTF-8"?>
- <schemaInfo>
- <schema name="identity" stylesheet="xsl/identity.xsl" />
- <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
- stylesheet="xsl/oai2index.xsl" />
- <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
- <!-- use split level 2 when indexing whole OAI Record lists -->
- <split level="2"/>
- </schemaInfo>
-
+ <?xml version="1.0" encoding="UTF-8"?>
+ <schemaInfo>
+ <schema name="identity" stylesheet="xsl/identity.xsl" />
+ <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+ stylesheet="xsl/oai2index.xsl" />
+ <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
+ <!-- use split level 2 when indexing whole OAI Record lists -->
+ <split level="2"/>
+ </schemaInfo>
+
All named stylesheets defined inside
- schema element tags
+ schema element tags
are for presentation after search, including
the indexing stylesheet (which is a great debugging help). The
names defined in the name attributes must be
- unique, these are the literal schema or
- element set names used in
- SRW,
- SRU and
- Z39.50 protocol queries.
- The pathes in the stylesheet attributes
+ unique, these are the literal schema or
+ element set names used in
+ &acro.srw;,
+ &acro.sru; and
+ &acro.z3950; protocol queries.
+ The paths in the stylesheet attributes
are relative to zebras working directory, or absolute to file
system root.
The <split level="2"/> decides where the
- XML Reader shall split the
+ &acro.xml; Reader shall split the
collections of records into individual records, which then are
- loaded into DOM, and have the indexing XSLT stylesheet applied.
+ loaded into &acro.dom;, and have the indexing &acro.xslt; stylesheet applied.
- There must be exactly one indexing XSLT stylesheet, which is
- defined by the magic attribute
+ There must be exactly one indexing &acro.xslt; stylesheet, which is
+ defined by the magic attribute
identifier="http://indexdata.dk/zebra/xslt/1".
-
- ALVIS Internal Record Representation
- When indexing, an XML Reader is invoked to split the input
- files into suitable record XML pieces. Each record piece is then
- transformed to an XML DOM structire, which is essentially the
- record model. Only XSLT transfomations can be applied during
- index, search and retrieval. Consequently, output formats are
- restricted to whatever XSLT can deliver from the record XML
- structure, be it other XML formats, HTML, or plain text. In case
- you have libxslt1 running with EXSLT support,
- you can use this functionality inside the alvis
- filter configuraiton XSLT stylesheets.
+
+ ALVIS Internal Record Representation
+ When indexing, an &acro.xml; Reader is invoked to split the input
+ files into suitable record &acro.xml; pieces. Each record piece is then
+ transformed to an &acro.xml; &acro.dom; structure, which is essentially the
+ record model. Only &acro.xslt; transformations can be applied during
+ index, search and retrieval. Consequently, output formats are
+ restricted to whatever &acro.xslt; can deliver from the record &acro.xml;
+ structure, be it other &acro.xml; formats, HTML, or plain text. In case
+ you have libxslt1 running with E&acro.xslt; support,
+ you can use this functionality inside the Alvis
+ filter configuration &acro.xslt; stylesheets.
-
+
-
- ALVIS Canonical Indexing Format
- The output of the indexing XSLT stylesheets must contain
- certain elements in the magic
+
+ ALVIS Canonical Indexing Format
+ The output of the indexing &acro.xslt; stylesheets must contain
+ certain elements in the magic
xmlns:z="http://indexdata.dk/zebra/xslt/1"
- namespace. The output of the XSLT indexing transformation is then
- parsed using DOM methods, and the contained instructions are
- performed on the magic elements and their
- subtrees.
+ namespace. The output of the &acro.xslt; indexing transformation is then
+ parsed using &acro.dom; methods, and the contained instructions are
+ performed on the magic elements and their
+ subtrees.
- For example, the output of the command
-
+ For example, the output of the command
+
xsltproc xsl/oai2index.xsl one-record.xml
-
+
might look like this:
<?xml version="1.0" encoding="UTF-8"?>
- <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
- z:id="oai:JTRS:CP-3290---Volume-I"
- z:rank="47896"
- z:type="update">
- <z:index name="oai:identifier" type="0">
- oai:JTRS:CP-3290---Volume-I</z:index>
- <z:index name="oai:datestamp" type="0">2004-07-09</z:index>
- <z:index name="oai:setspec" type="0">jtrs</z:index>
- <z:index name="dc:all" type="w">
- <z:index name="dc:title" type="w">Proceedings of the 4th
- International Conference and Exhibition:
- World Congress on Superconductivity - Volume I</z:index>
- <z:index name="dc:creator" type="w">Kumar Krishen and *Calvin
- Burnham, Editors</z:index>
- </z:index>
- </z:record>
+ <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
+ z:id="oai:JTRS:CP-3290---Volume-I"
+ z:rank="47896">
+ <z:index name="oai_identifier" type="0">
+ oai:JTRS:CP-3290---Volume-I</z:index>
+ <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
+ <z:index name="oai_setspec" type="0">jtrs</z:index>
+ <z:index name="dc_all" type="w">
+ <z:index name="dc_title" type="w">Proceedings of the 4th
+ International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I</z:index>
+ <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
+ Burnham, Editors</z:index>
+ </z:index>
+ </z:record>
- This means the following: From the original XML file
- one-record.xml (or from the XML record DOM of the
- same form coming from a splitted input file), the indexing
- stylesheet produces an indexing XML record, which is defined by
+ This means the following: From the original &acro.xml; file
+ one-record.xml (or from the &acro.xml; record &acro.dom; of the
+ same form coming from a split input file), the indexing
+ stylesheet produces an indexing &acro.xml; record, which is defined by
the record element in the magic namespace
xmlns:z="http://indexdata.dk/zebra/xslt/1".
- Zebra uses the content of
+ &zebra; uses the content of
z:id="oai:JTRS:CP-3290---Volume-I" as internal
- record ID, and - in case static ranking is set - the content of
+ record ID, and - in case static ranking is set - the content of
z:rank="47896" as static rank. Following the
- discussion in XXX we see that this records is internally ordered
+ discussion in
+ we see that this records is internally ordered
lexicographically according to the value of the string
oai:JTRS:CP-3290---Volume-I47896.
- The type of action performed during indexing is defined by
+
- Then the following literal indexes are constructed:
+ In this example, the following literal indexes are constructed:
- oai:identifier
- oai:datestamp
- oai:setspec
- dc:all
- dc:title
- dc:creator
+ oai_identifier
+ oai_datestamp
+ oai_setspec
+ dc_all
+ dc_title
+ dc_creator
- where the indexing type is defined in the
- type attribute (any value from the standard config
- filedefault.idx will do). Finally, any
+ where the indexing type is defined in the
+ type attribute
+ (any value from the standard configuration
+ file default.idx will do). Finally, any
text() node content recursively contained
inside the index will be filtered through the
- appropriate charmap for character normalization, and will be
+ appropriate char map for character normalization, and will be
inserted in the index.
- Notice that there are no .abs,
- .est, .map, or other GRS-1
- filter configuration files involves in this process. Notice also,
- that the names and types of the indexes can be defined in the
- indexing XSLT stylesheet dynamically according to
- content in the original XML records, which has
- oppertunities for great power and great disaster.
+ Specific to this example, we see that the single word
+ oai:JTRS:CP-3290---Volume-I will be literal,
+ byte for byte without any form of character normalization,
+ inserted into the index named oai:identifier,
+ the text
+ Kumar Krishen and *Calvin Burnham, Editors
+ will be inserted using the w character
+ normalization defined in default.idx into
+ the index dc:creator (that is, after character
+ normalization the index will keep the individual words
+ kumar, krishen,
+ and, calvin,
+ burnham, and editors), and
+ finally both the texts
+ Proceedings of the 4th International Conference and Exhibition:
+ World Congress on Superconductivity - Volume I
+ and
+ Kumar Krishen and *Calvin Burnham, Editors
+ will be inserted into the index dc:all using
+ the same character normalization map w.
+
+
+ Finally, this example configuration can be queried using &acro.pqf;
+ queries, either transported by &acro.z3950;, (here using a yaz-client)
+
+ open localhost:9999
+ Z> elem dc
+ Z> form xml
+ Z>
+ Z> f @attr 1=dc_creator Kumar
+ Z> scan @attr 1=dc_creator adam
+ Z>
+ Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
+ Z> scan @attr 1=dc_title abc
+ ]]>
+
+ or the proprietary
+ extensions x-pquery and
+ x-pScanClause to
+ &acro.sru;, and &acro.srw;
+
+
+
+ See for more information on &acro.sru;/&acro.srw;
+ configuration, and or the &yaz;
+ &acro.cql; section
+ for the details or the &yaz; frontend server.
+
+
+ Notice that there are no *.abs,
+ *.est, *.map, or other &acro.grs1;
+ filter configuration files involves in this process, and that the
+ literal index names are used during search and retrieval.
-
-
+
+
-
+
ALVIS Record Model Configuration
-
- ALVIS Indexing Configuration
- FIXME
+
+ ALVIS Indexing Configuration
+
+ As mentioned above, there can be only one indexing
+ stylesheet, and configuration of the indexing process is a synonym
+ of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
+ magic elements discussed in
+ .
+ Obviously, there are million of different ways to accomplish this
+ task, and some comments and code snippets are in order to lead
+ our Padawan's on the right track to the good side of the force.
- FIXME
+
+ Stylesheets can be written in the pull or
+ the push style: pull
+ means that the output &acro.xml; structure is taken as starting point of
+ the internal structure of the &acro.xslt; stylesheet, and portions of
+ the input &acro.xml; are pulled out and inserted
+ into the right spots of the output &acro.xml; structure. On the other
+ side, push &acro.xslt; stylesheets are recursively
+ calling their template definitions, a process which is commanded
+ by the input &acro.xml; structure, and are triggered to produce some output &acro.xml;
+ whenever some special conditions in the input stylesheets are
+ met. The pull type is well-suited for input
+ &acro.xml; with strong and well-defined structure and semantics, like the
+ following &acro.oai; indexing example, whereas the
+ push type might be the only possible way to
+ sort out deeply recursive input &acro.xml; formats.
- FIXME
+
+ A pull stylesheet example used to index
+ &acro.oai; harvested records could use some of the following template
+ definitions:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
-
-
-
- ALVIS Exchange Formats
- FIXME
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+ Don't be tempted to cross
+ the line to the dark side of the force, Padawan; this leads
+ to suffering and pain, and universal
+ disintegration of your project schedule.
+
+
-
- indicates that a new record with given id and static rank has to be updated.
+
+ ALVIS Exchange Formats
+
+ An exchange format can be anything which can be the outcome of an
+ &acro.xslt; transformation, as far as the stylesheet is registered in
+ the main Alvis &acro.xslt; filter configuration file, see
+ .
+ In principle anything that can be expressed in &acro.xml;, HTML, and
+ TEXT can be the output of a schema or
+ element set directive during search, as long as
+ the information comes from the
+ original input record &acro.xml; &acro.dom; tree
+ (and not the transformed and indexed &acro.xml;!!).
+
+
+ In addition, internal administrative information from the &zebra;
+ indexer can be accessed during record retrieval. The following
+ example is a summary of the possibilities:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ]]>
+
+
-
- encloses all the text/XML which shall be indexed in the index named
- "title" and of index type "w" (see file default.idx in your zebra
- installation)
+
+
+ ALVIS Filter &acro.oai; Indexing Example
+
+ The source code tarball contains a working Alvis filter example in
+ the directory examples/alvis-oai/, which
+ should get you started.
+
+
+ More example data can be harvested from any &acro.oai; compliant server,
+ see details at the &acro.oai;
+
+ http://www.openarchives.org/ web site, and the community
+ links at
+
+ http://www.openarchives.org/community/index.html.
+ There is a tutorial
+ found at
+
+ http://www.oaforum.org/tutorial/.
+
+
-
+
-
--->
+
@@ -275,7 +462,7 @@ c) Main "alvis" XSLT filter config file:
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
sgml-local-catalogs: nil
sgml-namecase-general:t
End: