Added clean-local sections to fix make distcheck

[idzebra-moved-to-github.git] / doc / recordmodel-alvisxslt.xml
diff --git a/doc/recordmodel-alvisxslt.xml b/doc/recordmodel-alvisxslt.xml

index 764190d..48823e4 100644 (file)
--- a/doc/recordmodel-alvisxslt.xml
+++ b/doc/recordmodel-alvisxslt.xml
@@ -1,126 +1,458 @@
- <chapter id="record-model-alvisxslt">
-  <!-- $Id: recordmodel-alvisxslt.xml,v 1.1 2006-02-15 11:07:47 marc Exp $ -->
-  <title>ALVIS XML Record Model and Filter Module</title>
-  
+<chapter id="record-model-alvisxslt">
+  <!-- $Id: recordmodel-alvisxslt.xml,v 1.19 2007-05-24 13:44:09 adam Exp $ -->
+  <title>ALVIS &acro.xml; Record Model and Filter Module</title>
+
+     <warning>
+      <para>
+        The functionality of this record model has been improved and
+        replaced by the DOM &acro.xml; record model, see 
+        <xref linkend="record-model-domxml"/>.  The Alvis &acro.xml; record
+        model is considered obsolete, and will eventually be removed
+        from future releases of the &zebra; software. 
+      </para>
+     </warning>  
  
    <para>
     The record model described in this chapter applies to the fundamental,
  
    <para>
     The record model described in this chapter applies to the fundamental,
-   structured XML
+   structured &acro.xml;
     record type <literal>alvis</literal>, introduced in
     record type <literal>alvis</literal>, introduced in
-   <xref linkend="componentmodulesalvis"/>. The ALVIS XML record model
-   is experimental, and it's inner workings might change in future
-   releases of the Zebra Information Server.
+   <xref linkend="componentmodulesalvis"/>.
    </para>
  
    </para>
  
-  
-
+  <para> This filter has been developed under the 
+   <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
+   the European Community under the "Information Society Technologies"
+   Program (2002-2006).
+  </para>
    
    
    
    
-  <sect1 id="record-model-alvisxslt-filter">
-   <title>ALLVIS Record Filter</title>
+  <section id="record-model-alvisxslt-filter">
+   <title>ALVIS Record Filter</title>
     <para>
     <para>
-    The experimental, loadable  Alvis XM/XSLT filter module
+    The experimental, loadable  Alvis &acro.xml;/&acro.xslt; filter module
     <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
      <literal>libidzebra1.4-mod-alvis</literal>.
     <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
      <literal>libidzebra1.4-mod-alvis</literal>.
+    It is invoked by the <filename>zebra.cfg</filename> configuration statement
+    <screen>
+     recordtype.xml: alvis.db/filter_alvis_conf.xml
+    </screen>
+    In this example on all data files with suffix 
+    <filename>*.xml</filename>, where the
+    Alvis &acro.xslt; filter configuration file is found in the
+    path <filename>db/filter_alvis_conf.xml</filename>.
+   </para>
+   <para>The Alvis &acro.xslt; filter configuration file must be
+    valid &acro.xml;. It might look like this (This example is
+    used for indexing and display of &acro.oai; harvested records):
+    <screen>
+    &lt;?xml version="1.0" encoding="UTF-8"?&gt;
+      &lt;schemaInfo&gt;
+        &lt;schema name="identity" stylesheet="xsl/identity.xsl" /&gt;
+        &lt;schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
+            stylesheet="xsl/oai2index.xsl" /&gt;
+        &lt;schema name="dc" stylesheet="xsl/oai2dc.xsl" /&gt;
+        &lt;!-- use split level 2 when indexing whole &acro.oai; Record lists --&gt;
+        &lt;split level="2"/&gt;
+      &lt;/schemaInfo&gt;
+    </screen> 
+   </para>
+   <para>
+    All named stylesheets defined inside
+    <literal>schema</literal> element tags 
+    are for presentation after search, including
+    the indexing stylesheet (which is a great debugging help). The
+    names defined in the <literal>name</literal> attributes must be
+    unique, these are the literal <literal>schema</literal> or 
+    <literal>element set</literal> names used in 
+      <ulink url="http://www.loc.gov/standards/sru/srw/">&acro.srw;</ulink>,
+      <ulink url="&url.sru;">&acro.sru;</ulink> and
+    &acro.z3950; protocol queries.
+    The paths in the <literal>stylesheet</literal> attributes
+    are relative to zebras working directory, or absolute to file
+    system root.
+   </para>
+   <para>
+    The <literal>&lt;split level="2"/&gt;</literal> decides where the
+    &acro.xml; Reader shall split the
+    collections of records into individual records, which then are
+    loaded into &acro.dom;, and have the indexing &acro.xslt; stylesheet applied.
+   </para>
+   <para>
+    There must be exactly one indexing &acro.xslt; stylesheet, which is
+    defined by the magic attribute  
+    <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
     </para>
  
     </para>
  
-   <sect2 id="record-model-alvisxslt-internal">
-    <title>ALLVIS Internal Record Representation</title>   
-    <para>FIXME</para>
-   </sect2>
-
-   <sect2 id="record-model-alvisxslt-canonical">
-    <title>ALLVIS Canonical Format</title>   
-    <para>FIXME</para>
-   </sect2>
-
-
-  </sect1>
-
-
-  <sect1 id="record-model-alvisxslt-conf">
-   <title>ALLVIS Record Model Configuration</title>
-   <para>FIXME</para>
-
-
-
-  <sect2 id="record-model-alvisxslt-exchange">
+   <section id="record-model-alvisxslt-internal">
+    <title>ALVIS Internal Record Representation</title>   
+    <para>When indexing, an &acro.xml; Reader is invoked to split the input
+    files into suitable record &acro.xml; pieces. Each record piece is then
+    transformed to an &acro.xml; &acro.dom; structure, which is essentially the
+    record model. Only &acro.xslt; transformations can be applied during
+    index, search and retrieval. Consequently, output formats are
+    restricted to whatever &acro.xslt; can deliver from the record &acro.xml;
+    structure, be it other &acro.xml; formats, HTML, or plain text. In case
+    you have <literal>libxslt1</literal> running with E&acro.xslt; support,
+    you can use this functionality inside the Alvis
+    filter configuration &acro.xslt; stylesheets.
+    </para>
+   </section>
+
+   <section id="record-model-alvisxslt-canonical">
+    <title>ALVIS Canonical Indexing Format</title>   
+    <para>The output of the indexing &acro.xslt; stylesheets must contain
+    certain elements in the magic 
+     <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
+    namespace. The output of the &acro.xslt; indexing transformation is then
+    parsed using &acro.dom; methods, and the contained instructions are
+    performed on the <emphasis>magic elements and their
+    subtrees</emphasis>.
+    </para>
+    <para>
+    For example, the output of the command
+     <screen>  
+      xsltproc xsl/oai2index.xsl one-record.xml
+     </screen> 
+     might look like this:
+     <screen>
+      &lt;?xml version="1.0" encoding="UTF-8"?&gt;
+      &lt;z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" 
+           z:id="oai:JTRS:CP-3290---Volume-I" 
+           z:rank="47896"&gt;
+       &lt;z:index name="oai_identifier" type="0"&gt;
+                oai:JTRS:CP-3290---Volume-I&lt;/z:index&gt;
+       &lt;z:index name="oai_datestamp" type="0"&gt;2004-07-09&lt;/z:index&gt;
+       &lt;z:index name="oai_setspec" type="0"&gt;jtrs&lt;/z:index&gt;
+       &lt;z:index name="dc_all" type="w"&gt;
+          &lt;z:index name="dc_title" type="w"&gt;Proceedings of the 4th 
+                International Conference and Exhibition:
+                World Congress on Superconductivity - Volume I&lt;/z:index&gt;
+          &lt;z:index name="dc_creator" type="w"&gt;Kumar Krishen and *Calvin
+                Burnham, Editors&lt;/z:index&gt;
+       &lt;/z:index&gt;
+     &lt;/z:record&gt;
+     </screen>
+    </para>
+    <para>This means the following: From the original &acro.xml; file 
+     <literal>one-record.xml</literal> (or from the &acro.xml; record &acro.dom; of the
+     same form coming from a split input file), the indexing
+     stylesheet produces an indexing &acro.xml; record, which is defined by
+     the <literal>record</literal> element in the magic namespace
+     <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
+     &zebra; uses the content of 
+     <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
+     record ID, and - in case static ranking is set - the content of 
+     <literal>z:rank="47896"</literal> as static rank. Following the
+     discussion in <xref linkend="administration-ranking"/>
+     we see that this records is internally ordered
+     lexicographically according to the value of the string
+     <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
+     <!-- The type of action performed during indexing is defined by
+     <literal>z:type="update"&gt;</literal>, with recognized values
+     <literal>insert</literal>, <literal>update</literal>, and 
+     <literal>delete</literal>. --> 
+    </para>
+    <para>In this example, the following literal indexes are constructed:
+     <screen>
+       oai_identifier
+       oai_datestamp
+       oai_setspec
+       dc_all
+       dc_title
+       dc_creator
+     </screen>
+     where the indexing type is defined in the 
+     <literal>type</literal> attribute 
+     (any value from the standard configuration
+     file <filename>default.idx</filename> will do). Finally, any 
+     <literal>text()</literal> node content recursively contained
+     inside the <literal>index</literal> will be filtered through the
+     appropriate char map for character normalization, and will be
+     inserted in the index.
+    </para>
+    <para>
+     Specific to this example, we see that the single word
+     <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
+     byte for byte without any form of character normalization,
+     inserted into the index named <literal>oai:identifier</literal>,
+     the text 
+     <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
+     will be inserted using the <literal>w</literal> character
+     normalization defined in <filename>default.idx</filename> into
+     the index <literal>dc:creator</literal> (that is, after character
+     normalization the index will keep the individual words 
+     <literal>kumar</literal>, <literal>krishen</literal>, 
+     <literal>and</literal>, <literal>calvin</literal>,
+     <literal>burnham</literal>, and <literal>editors</literal>), and
+     finally both the texts
+     <literal>Proceedings of the 4th International Conference and Exhibition:
+      World Congress on Superconductivity - Volume I</literal> 
+     and
+     <literal>Kumar Krishen and *Calvin Burnham, Editors</literal> 
+     will be inserted into the index <literal>dc:all</literal> using
+     the same character normalization map <literal>w</literal>. 
+    </para>
+    <para>
+     Finally, this example configuration can be queried using &acro.pqf;
+     queries, either transported by &acro.z3950;, (here using a yaz-client) 
+     <screen>
+      <![CDATA[
+      Z> open localhost:9999
+      Z> elem dc
+      Z> form xml
+      Z>
+      Z> f @attr 1=dc_creator Kumar
+      Z> scan @attr 1=dc_creator adam
+      Z>
+      Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
+      Z> scan @attr 1=dc_title abc
+      ]]>
+     </screen>
+     or the proprietary
+     extensions <literal>x-pquery</literal> and
+     <literal>x-pScanClause</literal> to
+     &acro.sru;, and &acro.srw;
+     <screen>
+      <![CDATA[
+      http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
+      http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
+      ]]>
+     </screen>
+     See <xref linkend="zebrasrv-sru"/> for more information on &acro.sru;/&acro.srw;
+     configuration, and <xref linkend="gfs-config"/> or the &yaz;
+     <ulink url="&url.yaz.cql;">&acro.cql; section</ulink>
+     for the details or the &yaz; frontend server.
+    </para>
+    <para>
+     Notice that there are no <filename>*.abs</filename>,
+     <filename>*.est</filename>, <filename>*.map</filename>, or other &acro.grs1;
+     filter configuration files involves in this process, and that the
+     literal index names are used during search and retrieval.
+    </para>
+   </section>
+  </section>
+
+
+  <section id="record-model-alvisxslt-conf">
+   <title>ALVIS Record Model Configuration</title>
+
+
+  <section id="record-model-alvisxslt-index">
+   <title>ALVIS Indexing Configuration</title>
+    <para>
+     As mentioned above, there can be only one indexing
+     stylesheet, and configuration of the indexing process is a synonym
+     of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the
+     magic elements discussed in  
+     <xref linkend="record-model-alvisxslt-internal"/>. 
+     Obviously, there are million of different ways to accomplish this
+     task, and some comments and code snippets are in order to lead
+     our Padawan's on the right track to the  good side of the force.
+    </para>
+    <para>
+     Stylesheets can be written in the <emphasis>pull</emphasis> or
+     the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
+     means that the output &acro.xml; structure is taken as starting point of
+     the internal structure of the &acro.xslt; stylesheet, and portions of
+     the input &acro.xml; are <emphasis>pulled</emphasis> out and inserted
+     into the right spots of the output &acro.xml; structure. On the other
+     side, <emphasis>push</emphasis> &acro.xslt; stylesheets are recursively
+     calling their template definitions, a process which is commanded
+     by the input &acro.xml; structure, and are triggered to produce some output &acro.xml;
+     whenever some special conditions in the input stylesheets are
+     met. The <emphasis>pull</emphasis> type is well-suited for input
+     &acro.xml; with strong and well-defined structure and semantics, like the
+     following &acro.oai; indexing example, whereas the
+     <emphasis>push</emphasis> type might be the only possible way to
+     sort out deeply recursive input &acro.xml; formats.
+    </para>
+    <para> 
+     A <emphasis>pull</emphasis> stylesheet example used to index
+     &acro.oai; harvested records could use some of the following template
+     definitions:
+     <screen>
+      <![CDATA[
+      <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+       xmlns:z="http://indexdata.dk/zebra/xslt/1"
+       xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/" 
+       xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/" 
+       xmlns:dc="http://purl.org/dc/elements/1.1/"
+       version="1.0">
+
+       <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+        <!-- disable all default text node output -->
+        <xsl:template match="text()"/>
+
+         <!-- match on oai xml record root -->
+         <xsl:template match="/">    
+          <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}">
+           <!-- you might want to use z:rank="{some &acro.xslt; function here}" --> 
+           <xsl:apply-templates/>
+          </z:record>
+         </xsl:template>
+
+         <!-- &acro.oai; indexing templates -->
+         <xsl:template match="oai:record/oai:header/oai:identifier">
+          <z:index name="oai_identifier" type="0">
+           <xsl:value-of select="."/>
+          </z:index>    
+         </xsl:template>
+
+         <!-- etc, etc -->
+
+         <!-- DC specific indexing templates -->
+         <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
+          <z:index name="dc_title" type="w">
+           <xsl:value-of select="."/>
+          </z:index>
+         </xsl:template>
+
+         <!-- etc, etc -->
+ 
+      </xsl:stylesheet>
+      ]]>
+     </screen>
+    </para>
+    <para>
+     Notice also,
+     that the names and types of the indexes can be defined in the
+     indexing &acro.xslt; stylesheet <emphasis>dynamically according to
+     content in the original &acro.xml; records</emphasis>, which has
+     opportunities for great power and wizardry as well as grande
+     disaster.  
+    </para>
+    <para>
+     The following excerpt of a <emphasis>push</emphasis> stylesheet
+     <emphasis>might</emphasis> 
+     be a good idea according to your strict control of the &acro.xml;
+     input format (due to rigorous checking against well-defined and
+     tight RelaxNG or &acro.xml; Schema's, for example):
+     <screen>
+      <![CDATA[
+      <xsl:template name="element-name-indexes">     
+       <z:index name="{name()}" type="w">
+        <xsl:value-of select="'1'"/>
+       </z:index>
+      </xsl:template>
+      ]]>
+     </screen>
+     This template creates indexes which have the name of the working 
+     node of any input  &acro.xml; file, and assigns a '1' to the index.
+     The example query 
+     <literal>find @attr 1=xyz 1</literal> 
+     finds all files which contain at least one
+     <literal>xyz</literal> &acro.xml; element. In case you can not control
+     which element names the input files contain, you might ask for
+     disaster and bad karma using this technique.
+    </para>
+    <para>
+     One variation over the theme <emphasis>dynamically created
+     indexes</emphasis> will definitely be unwise:
+     <screen>
+      <![CDATA[  
+      <!-- match on oai xml record root -->
+      <xsl:template match="/">    
+       <z:record>
+      
+        <!-- create dynamic index name from input content --> 
+        <xsl:variable name="dynamic_content">
+         <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
+        </xsl:variable>
+        
+        <!-- create zillions of indexes with unknown names -->
+        <z:index name="{$dynamic_content}" type="w">
+         <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
+        </z:index>          
+       </z:record>
+       
+      </xsl:template>
+      ]]>
+     </screen>
+     Don't be tempted to cross
+     the line to the dark side of the force, Padawan; this leads
+     to suffering and pain, and universal
+     disintegration of your project schedule.
+    </para>
+  </section>
+
+  <section id="record-model-alvisxslt-elementset">
     <title>ALVIS Exchange Formats</title>
     <title>ALVIS Exchange Formats</title>
-   <para>FIXME</para>
-  </sect2>
-
-  </sect1>
+   <para>
+     An exchange format can be anything which can be the outcome of an
+     &acro.xslt; transformation, as far as the stylesheet is registered in
+     the main Alvis &acro.xslt; filter configuration file, see
+     <xref linkend="record-model-alvisxslt-filter"/>.
+     In principle anything that can be expressed in  &acro.xml;, HTML, and
+     TEXT can be the output of a <literal>schema</literal> or 
+    <literal>element set</literal> directive during search, as long as
+     the information comes from the 
+     <emphasis>original input record &acro.xml; &acro.dom; tree</emphasis>
+     (and not the transformed and <emphasis>indexed</emphasis> &acro.xml;!!).
+    </para>
+    <para>
+     In addition, internal administrative information from the &zebra;
+     indexer can be accessed during record retrieval. The following
+     example is a summary of the possibilities:
+     <screen>
+      <![CDATA[  
+      <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+       xmlns:z="http://indexdata.dk/zebra/xslt/1"
+       version="1.0">
+
+       <!-- register internal zebra parameters -->       
+       <xsl:param name="id" select="''"/>
+       <xsl:param name="filename" select="''"/>
+       <xsl:param name="score" select="''"/>
+       <xsl:param name="schema" select="''"/>
+           
+       <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
+
+       <!-- use then for display of internal information -->
+       <xsl:template match="/">
+         <z:zebra>
+           <id><xsl:value-of select="$id"/></id>
+           <filename><xsl:value-of select="$filename"/></filename>
+           <score><xsl:value-of select="$score"/></score>
+           <schema><xsl:value-of select="$schema"/></schema>
+         </z:zebra>
+       </xsl:template>
+
+      </xsl:stylesheet>
+      ]]>
+     </screen>
+    </para>
+
+  </section>
+
+  <section id="record-model-alvisxslt-example">
+   <title>ALVIS Filter &acro.oai; Indexing Example</title>
+   <para>
+     The source code tarball contains a working Alvis filter example in
+     the directory <filename>examples/alvis-oai/</filename>, which
+     should get you started.  
+    </para>
+    <para>
+     More example data can be harvested from any &acro.oai; compliant server,
+     see details at the  &acro.oai; 
+     <ulink url="http://www.openarchives.org/">
+      http://www.openarchives.org/</ulink> web site, and the community
+      links at 
+     <ulink url="http://www.openarchives.org/community/index.html">
+      http://www.openarchives.org/community/index.html</ulink>.
+     There is a  tutorial
+     found at
+     <ulink url="http://www.oaforum.org/tutorial/">
+      http://www.oaforum.org/tutorial/</ulink>.
+    </para>
+   </section>
+
+  </section>
  
    
   </chapter>
  
  
  
    
   </chapter>
  
  
-<!--
-
-c)  Main "alvis" XSLT filter config file:
-  cat db/filter_alvis_conf.xml 
-
-  <?xml version="1.0" encoding="UTF8"?>
-  <schemaInfo>
-    <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
-    <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
-            stylesheet="db/alvis2index.xsl" />
-    <schema name="dc" stylesheet="db/alvis2dc.xsl" />
-    <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
-    <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
-    <schema name="help" stylesheet="db/alvis2help.xsl" />
-    <split level="1"/>
-  </schemaInfo>
-
-  the pathes are relative to the directory where zebra.init is placed
-  and is started up.
-
-  The split level decides where the SAX parser shall split the
-  collections of records into individual records, which then are
-  loaded into DOM, and have the indexing XSLT stylesheet applied.
-
-  The indexing stylesheet is found by it's identifier.
-
-  All the other stylesheets are for presentation after search.
-
-- in data/ a short sample of harvested carnivorous plants
-  ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
-
-- in root also one single data record - nice for testing the xslt
-  stylesheets,
-  
-  xsltproc db/alvis2index.xsl carni*.xml
-
-  and so on.
-
-- in db/ a cql2pqf.txt yaz-client config file 
-  which is also used in the yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process
-
-   see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
-
-- in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
-  as it constructs the new XML structure by pulling data out of the
-  respective elements/attributes of the old structure.
-
-  Notice the special zebra namespace, and the special elements in this
-  namespace which indicate to the zebra indexer what to do.
-
-  <z:record id="67ht7" rank="675" type="update">
-  indicates that a new record with given id and static rank has to be updated. 
-
-  <z:index name="title" type="w">
-   encloses all the text/XML which shall be indexed in the index named
-   "title" and of index type "w" (see  file default.idx in your zebra
-   installation) 
-
-
-   </para>
-
-   <para>
--->
-
-
-
  
   <!-- Keep this comment at the end of the file
   Local variables:
  
   <!-- Keep this comment at the end of the file
   Local variables: