doc/recordmodel-alvisxslt.xml

   1  <chapter id="record-model-alvisxslt">
   2   <!-- $Id: recordmodel-alvisxslt.xml,v 1.7 2006-04-25 12:26:26 marc Exp $ -->
   3   <title>ALVIS XML Record Model and Filter Module</title>
   4
   5
   6   <para>
   7    The record model described in this chapter applies to the fundamental,
   8    structured XML
   9    record type <literal>alvis</literal>, introduced in
  10    <xref linkend="componentmodulesalvis"/>. The ALVIS XML record model
  11    is experimental, and it's inner workings might change in future
  12    releases of the Zebra Information Server.
  13   </para>
  14
  15   <para> This filter has been developed under the
  16    <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
  17    the European Community under the "Information Society Technologies"
  18    Program (2002-2006).
  19   </para>
  20
  21
  22   <sect1 id="record-model-alvisxslt-filter">
  23    <title>ALVIS Record Filter</title>
  24    <para>
  25     The experimental, loadable  Alvis XML/XSLT filter module
  26    <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
  27     <literal>libidzebra1.4-mod-alvis</literal>.
  28     It is invoked by the <filename>zebra.cfg</filename> configuration statement
  29     <screen>
  30      recordtype.xml: alvis.db/filter_alvis_conf.xml
  31     </screen>
  32     In this example on all data files with suffix
  33     <filename>*.xml</filename>, where the
  34     Alvis XSLT filter configuration file is found in the
  35     path <filename>db/filter_alvis_conf.xml</filename>.
  36    </para>
  37    <para>The Alvis XSLT filter configuration file must be
  38     valid XML. It might look like this (This example is
  39     used for indexing and display of OAI harvested records):
  40     <screen>
  41     &lt;?xml version="1.0" encoding="UTF-8"?&gt;
  42       &lt;schemaInfo&gt;
  43         &lt;schema name="identity" stylesheet="xsl/identity.xsl" /&gt;
  44         &lt;schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
  45             stylesheet="xsl/oai2index.xsl" /&gt;
  46         &lt;schema name="dc" stylesheet="xsl/oai2dc.xsl" /&gt;
  47         &lt;!-- use split level 2 when indexing whole OAI Record lists --&gt;
  48         &lt;split level="2"/&gt;
  49       &lt;/schemaInfo&gt;
  50     </screen>
  51    </para>
  52    <para>
  53     All named stylesheets defined inside
  54     <literal>schema</literal> element tags
  55     are for presentation after search, including
  56     the indexing stylesheet (which is a great debugging help). The
  57     names defined in the <literal>name</literal> attributes must be
  58     unique, these are the literal <literal>schema</literal> or
  59     <literal>element set</literal> names used in
  60       <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>,
  61       <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink> and
  62     Z39.50 protocol queries.
  63     The paths in the <literal>stylesheet</literal> attributes
  64     are relative to zebras working directory, or absolute to file
  65     system root.
  66    </para>
  67    <para>
  68     The <literal>&lt;split level="2"/&gt;</literal> decides where the
  69     XML Reader shall split the
  70     collections of records into individual records, which then are
  71     loaded into DOM, and have the indexing XSLT stylesheet applied.
  72    </para>
  73    <para>
  74     There must be exactly one indexing XSLT stylesheet, which is
  75     defined by the magic attribute
  76     <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
  77    </para>
  78
  79    <sect2 id="record-model-alvisxslt-internal">
  80     <title>ALVIS Internal Record Representation</title>
  81     <para>When indexing, an XML Reader is invoked to split the input
  82     files into suitable record XML pieces. Each record piece is then
  83     transformed to an XML DOM structure, which is essentially the
  84     record model. Only XSLT transformations can be applied during
  85     index, search and retrieval. Consequently, output formats are
  86     restricted to whatever XSLT can deliver from the record XML
  87     structure, be it other XML formats, HTML, or plain text. In case
  88     you have <literal>libxslt1</literal> running with EXSLT support,
  89     you can use this functionality inside the Alvis
  90     filter configuration XSLT stylesheets.
  91     </para>
  92    </sect2>
  93
  94    <sect2 id="record-model-alvisxslt-canonical">
  95     <title>ALVIS Canonical Indexing Format</title>
  96     <para>The output of the indexing XSLT stylesheets must contain
  97     certain elements in the magic
  98      <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
  99     namespace. The output of the XSLT indexing transformation is then
 100     parsed using DOM methods, and the contained instructions are
 101     performed on the <emphasis>magic elements and their
 102     subtrees</emphasis>.
 103     </para>
 104     <para>
 105     For example, the output of the command
 106      <screen>
 107       xsltproc xsl/oai2index.xsl one-record.xml
 108      </screen>
 109      might look like this:
 110      <screen>
 111       &lt;?xml version="1.0" encoding="UTF-8"?&gt;
 112       &lt;z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
 113            z:id="oai:JTRS:CP-3290---Volume-I"
 114            z:rank="47896"
 115            z:type="update"&gt;
 116        &lt;z:index name="oai:identifier" type="0"&gt;
 117                 oai:JTRS:CP-3290---Volume-I&lt;/z:index&gt;
 118        &lt;z:index name="oai:datestamp" type="0"&gt;2004-07-09&lt;/z:index&gt;
 119        &lt;z:index name="oai:setspec" type="0"&gt;jtrs&lt;/z:index&gt;
 120        &lt;z:index name="dc:all" type="w"&gt;
 121           &lt;z:index name="dc:title" type="w"&gt;Proceedings of the 4th
 122                 International Conference and Exhibition:
 123                 World Congress on Superconductivity - Volume I&lt;/z:index&gt;
 124           &lt;z:index name="dc:creator" type="w"&gt;Kumar Krishen and *Calvin
 125                 Burnham, Editors&lt;/z:index&gt;
 126        &lt;/z:index&gt;
 127      &lt;/z:record&gt;
 128      </screen>
 129     </para>
 130     <para>This means the following: From the original XML file
 131      <literal>one-record.xml</literal> (or from the XML record DOM of the
 132      same form coming from a splitted input file), the indexing
 133      stylesheet produces an indexing XML record, which is defined by
 134      the <literal>record</literal> element in the magic namespace
 135      <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
 136      Zebra uses the content of
 137      <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
 138      record ID, and - in case static ranking is set - the content of
 139      <literal>z:rank="47896"</literal> as static rank. Following the
 140      discussion in <xref linkend="administration-ranking"/>
 141      we see that this records is internally ordered
 142      lexicographically according to the value of the string
 143      <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
 144      The type of action performed during indexing is defined by
 145      <literal>z:type="update"&gt;</literal>, with recognized values
 146      <literal>insert</literal>, <literal>update</literal>, and
 147      <literal>delete</literal>.
 148     </para>
 149     <para>In this example, the following literal indexes are constructed:
 150      <screen>
 151        oai:identifier
 152        oai:datestamp
 153        oai:setspec
 154        dc:all
 155        dc:title
 156        dc:creator
 157      </screen>
 158      where the indexing type is defined in the
 159      <literal>type</literal> attribute
 160      (any value from the standard configuration
 161      file <filename>default.idx</filename> will do). Finally, any
 162      <literal>text()</literal> node content recursively contained
 163      inside the <literal>index</literal> will be filtered through the
 164      appropriate charmap for character normalization, and will be
 165      inserted in the index.
 166     </para>
 167     <para>
 168      Specific to this example, we see that the single word
 169      <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
 170      byte for byte without any form of character normalization,
 171      inserted into the index named <literal>oai:identifier</literal>,
 172      the text
 173      <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
 174      will be inserted using the <literal>w</literal> character
 175      normalization defined in <filename>default.idx</filename> into
 176      the index <literal>dc:creator</literal> (that is, after character
 177      normalization the index will keep the inidividual words
 178      <literal>kumar</literal>, <literal>krishen</literal>,
 179      <literal>and</literal>, <literal>calvin</literal>,
 180      <literal>burnham</literal>, and <literal>editors</literal>), and
 181      finally both the texts
 182      <literal>Proceedings of the 4th International Conference and Exhibition:
 183       World Congress on Superconductivity - Volume I</literal>
 184      and
 185      <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
 186      will be inserted into the index <literal>dc:all</literal> using
 187      the same character normalization map <literal>w</literal>.
 188     </para>
 189     <para>
 190      Finally, this example configuration can be queried using PQF
 191      queries, either transported by Z39.50, (here using a yaz-client)
 192      <screen>
 193       <![CDATA[
 194       Z> open localhost:9999
 195       Z> elem dc
 196       Z> form xml
 197       Z>
 198       Z> f @attr 1=dc:creator Kumar
 199       Z> scan @attr 1=dc:creator adam
 200       Z>
 201       Z> f @attr 1=dc:title @attr 4=2 "proceeding congress superconductivity"
 202       Z> scan @attr 1=dc:title abc
 203       ]]>
 204      </screen>
 205      or the proprietary
 206      extentions <literal>x-pquery</literal> and
 207      <literal>x-pScanClause</literal> to
 208      SRU, and SRW
 209      <screen>
 210       <![CDATA[
 211       http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc%3Acreator+%40attr+4%3D6+%22the
 212       http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc:date+@attr+4=2+a
 213       ]]>
 214      </screen>
 215      See <xref linkend="server-sru"/> for more information on SRU/SRW
 216      configuration, and <xref linkend="gfs-config"/> or
 217      <ulink url="http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql">
 218       the YAZ manual CQL section</ulink>
 219      for the details
 220      of the YAZ frontend server
 221      <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
 222      configuration.
 223     </para>
 224     <para>
 225      Notice that there are no <filename>*.abs</filename>,
 226      <filename>*.est</filename>, <filename>*.map</filename>, or other GRS-1
 227      filter configuration files involves in this process, and that the
 228      literal index names are used during search and retrieval.
 229     </para>
 230    </sect2>
 231   </sect1>
 232
 233
 234   <sect1 id="record-model-alvisxslt-conf">
 235    <title>ALVIS Record Model Configuration</title>
 236
 237
 238   <sect2 id="record-model-alvisxslt-index">
 239    <title>ALVIS Indexing Configuration</title>
 240     <para>
 241      As mentioned above, there can be only one indexing
 242      stylesheet, and configuration of the indexing process is a synonym
 243      of writing an XSLT stylesheet which produces XML output containing the
 244      magic elements discussed in
 245      <xref linkend="record-model-alvisxslt-internal"/>.
 246      Obviously, there are million of different ways to accomplish this
 247      task, and some comments and code snippets are in order to lead
 248      our paduans on the right track to the  good side of the force.
 249     </para>
 250     <para>
 251      Stylesheets can be written in the <emphasis>pull</emphasis> or
 252      the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
 253      means that the output XML structure is taken as starting point of
 254      the internal structure of the XSLT stylesheet, and portions of
 255      the input XML are <emphasis>pulled</emphasis> out and inserted
 256      into the right spots of the output XML structure. On the other
 257      side, <emphasis>push</emphasis> XSLT stylesheets are recursavly
 258      calling their template definitions, a process which is commanded
 259      by the input XML structure, and avake to produce some output XML
 260      whenever some special conditions in the input styelsheets are
 261      met. The <emphasis>pull</emphasis> type is well-suited for input
 262      XML with strong and well-defined structure and semantcs, like the
 263      following OAI indexing example, whereas the
 264      <emphasis>push</emphasis> type might be the only possible way to
 265      sort out deeply recursive input XML formats.
 266     </para>
 267     <para>
 268      A <emphasis>pull</emphasis> stylesheet example used to index
 269      OAI harvested records could use some of the following template
 270      definitions:
 271      <screen>
 272       <![CDATA[
 273       <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 274        xmlns:z="http://indexdata.dk/zebra/xslt/1"
 275        xmlns:oai="http://www.openarchives.org/OAI/2.0/"
 276        xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
 277        xmlns:dc="http://purl.org/dc/elements/1.1/"
 278        version="1.0">
 279
 280        <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
 281
 282         <!-- disable all default text node output -->
 283         <xsl:template match="text()"/>
 284
 285          <!-- match on oai xml record root -->
 286          <xsl:template match="/">
 287           <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
 288            z:type="update">
 289            <!-- you might want to use z:rank="{some XSLT function here}" -->
 290            <xsl:apply-templates/>
 291           </z:record>
 292          </xsl:template>
 293
 294          <!-- OAI indexing templates -->
 295          <xsl:template match="oai:record/oai:header/oai:identifier">
 296           <z:index name="oai:identifier" type="0">
 297            <xsl:value-of select="."/>
 298           </z:index>
 299          </xsl:template>
 300
 301          <!-- etc, etc -->
 302
 303          <!-- DC specific indexing templates -->
 304          <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
 305           <z:index name="dc:title" type="w">
 306            <xsl:value-of select="."/>
 307           </z:index>
 308          </xsl:template>
 309
 310          <!-- etc, etc -->
 311
 312       </xsl:stylesheet>
 313       ]]>
 314      </screen>
 315     </para>
 316     <para>
 317      Notice also,
 318      that the names and types of the indexes can be defined in the
 319      indexing XSLT stylesheet <emphasis>dynamically according to
 320      content in the original XML records</emphasis>, which has
 321      opportunities for great power and wizardery as well as grande
 322      disaster.
 323     </para>
 324     <para>
 325      The following excerpt of a <emphasis>push</emphasis> stylesheet
 326      <emphasis>might</emphasis>
 327      be a good idea according to your strict control of the XML
 328      input format (due to rigerours checking against well-defined and
 329      tight RelaxNG or XML Schema's, for example):
 330      <screen>
 331       <![CDATA[
 332       <xsl:template name="element-name-indexes">
 333        <z:index name="{name()}" type="w">
 334         <xsl:value-of select="'1'"/>
 335        </z:index>
 336       </xsl:template>
 337       ]]>
 338      </screen>
 339      This template creates indexes which have the name of the working
 340      node of any input  XML file, and assigns a '1' to the index.
 341      The example query
 342      <literal>find @attr 1=xyz 1</literal>
 343      finds all files which contain at least one
 344      <literal>xyz</literal> XML element. In case you can not control
 345      which element names the input files contain, you might ask for
 346      disaster and bad karma using this technique.
 347     </para>
 348     <para>
 349      One variation over the theme <emphasis>dynamically created
 350      indexes</emphasis> will definitely be unwise:
 351      <screen>
 352       <![CDATA[
 353       <!-- match on oai xml record root -->
 354       <xsl:template match="/">
 355        <z:record z:type="update">
 356
 357         <!-- create dynamic index name from input content -->
 358         <xsl:variable name="dynamic_content">
 359          <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
 360         </xsl:variable>
 361
 362         <!-- create zillions of indexes with unknown names -->
 363         <z:index name="{$dynamic_content}" type="w">
 364          <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
 365         </z:index>
 366        </z:record>
 367
 368       </xsl:template>
 369       ]]>
 370      </screen>
 371      Don't be tempted to cross
 372      the line to the dark side of the force, paduan; this leads
 373      to suffering and pain, and universal
 374      disentigration of your project schedule.
 375     </para>
 376   </sect2>
 377
 378   <sect2 id="record-model-alvisxslt-elementset">
 379    <title>ALVIS Exchange Formats</title>
 380    <para>
 381      An exchange format can be anything which can be the outcome of an
 382      XSLT transformation, as far as the stylesheet is registered in
 383      the main Alvis XSLT filter configuration file, see
 384      <xref linkend="record-model-alvisxslt-filter"/>.
 385      In principle anything that can be expressed in  XML, HTML, and
 386      TEXT can be the output of a <literal>schema</literal> or
 387     <literal>element set</literal> directive during search, as long as
 388      the information comes from the
 389      <emphasis>original input record XML DOM tree</emphasis>
 390      (and not the transformed and <emphasis>indexed</emphasis> XML!!).
 391     </para>
 392     <para>
 393      In addition, internal administrative information from the Zebra
 394      indexer can be accessed during record retrieval. The following
 395      example is a summary of the possibilities:
 396      <screen>
 397       <![CDATA[
 398       <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 399        xmlns:z="http://indexdata.dk/zebra/xslt/1"
 400        version="1.0">
 401
 402        <!-- register internal zebra parameters -->
 403        <xsl:param name="id" select="''"/>
 404        <xsl:param name="filename" select="''"/>
 405        <xsl:param name="score" select="''"/>
 406        <xsl:param name="schema" select="''"/>
 407
 408        <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
 409
 410        <!-- use then for display of internal information -->
 411        <xsl:template match="/">
 412          <z:zebra>
 413            <id><xsl:value-of select="$id"/></id>
 414            <filename><xsl:value-of select="$filename"/></filename>
 415            <score><xsl:value-of select="$score"/></score>
 416            <schema><xsl:value-of select="$schema"/></schema>
 417          </z:zebra>
 418        </xsl:template>
 419
 420       </xsl:stylesheet>
 421       ]]>
 422      </screen>
 423     </para>
 424
 425   </sect2>
 426
 427   <sect2 id="record-model-alvisxslt-example">
 428    <title>ALVIS Filter OAI Indexing Example</title>
 429    <para>
 430      The sourcecode tarball contains a working Alvis filter example in
 431      the directory <filename>examples/alvis-oai/</filename>, which
 432      should get you started.
 433     </para>
 434     <para>
 435      More example data can be harvested from any OAI complient server,
 436      see details at the  OAI
 437      <ulink url="http://www.openarchives.org/">
 438       http://www.openarchives.org/</ulink> web site, and the community
 439       links at
 440      <ulink url="http://www.openarchives.org/community/index.html">
 441       http://www.openarchives.org/community/index.html</ulink>.
 442      There is a  tutorial
 443      found at
 444      <ulink url="http://www.oaforum.org/tutorial/">
 445       http://www.oaforum.org/tutorial/</ulink>.
 446     </para>
 447    </sect2>
 448
 449   </sect1>
 450
 451
 452  </chapter>
 453
 454
 455 <!--
 456
 457 c)  Main "alvis" XSLT filter config file:
 458   cat db/filter_alvis_conf.xml
 459
 460   <?xml version="1.0" encoding="UTF8"?>
 461   <schemaInfo>
 462     <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
 463     <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
 464             stylesheet="db/alvis2index.xsl" />
 465     <schema name="dc" stylesheet="db/alvis2dc.xsl" />
 466     <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
 467     <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
 468     <schema name="help" stylesheet="db/alvis2help.xsl" />
 469     <split level="1"/>
 470   </schemaInfo>
 471
 472   the paths are relative to the directory where zebra.init is placed
 473   and is started up.
 474
 475   The split level decides where the SAX parser shall split the
 476   collections of records into individual records, which then are
 477   loaded into DOM, and have the indexing XSLT stylesheet applied.
 478
 479   The indexing stylesheet is found by it's identifier.
 480
 481   All the other stylesheets are for presentation after search.
 482
 483 - in data/ a short sample of harvested carnivorous plants
 484   ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
 485
 486 - in root also one single data record - nice for testing the xslt
 487   stylesheets,
 488
 489   xsltproc db/alvis2index.xsl carni*.xml
 490
 491   and so on.
 492
 493 - in db/ a cql2pqf.txt yaz-client config file
 494   which is also used in the yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process
 495
 496    see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
 497
 498 - in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
 499   as it constructs the new XML structure by pulling data out of the
 500   respective elements/attributes of the old structure.
 501
 502   Notice the special zebra namespace, and the special elements in this
 503   namespace which indicate to the zebra indexer what to do.
 504
 505   <z:record id="67ht7" rank="675" type="update">
 506   indicates that a new record with given id and static rank has to be updated.
 507
 508   <z:index name="title" type="w">
 509    encloses all the text/XML which shall be indexed in the index named
 510    "title" and of index type "w" (see  file default.idx in your zebra
 511    installation)
 512
 513
 514    </para>
 515
 516    <para>
 517 -->
 518
 519
 520
 521
 522  <!-- Keep this comment at the end of the file
 523  Local variables:
 524  mode: sgml
 525  sgml-omittag:t
 526  sgml-shorttag:t
 527  sgml-minimize-attributes:nil
 528  sgml-always-quote-attributes:t
 529  sgml-indent-step:1
 530  sgml-indent-data:t
 531  sgml-parent-document: "zebra.xml"
 532  sgml-local-catalogs: nil
 533  sgml-namecase-general:t
 534  End:
 535  -->