doc/recordmodel-domxml.xml

   1 <chapter id="record-model-domxml">
   2   <!-- $Id: recordmodel-domxml.xml,v 1.7 2007-02-21 14:15:07 marc Exp $ -->
   3   <title>&dom; &xml; Record Model and Filter Module</title>
   4
   5   <para>
   6    The record model described in this chapter applies to the fundamental,
   7    structured &xml;
   8    record type <literal>dom</literal>, introduced in
   9    <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
  10    is experimental, and it's inner workings might change in future
  11    releases of the &zebra; Information Server.
  12   </para>
  13
  14
  15
  16   <section id="record-model-domxml-filter">
  17    <title>&dom; Record Filter Architecture</title>
  18
  19      <para>
  20       The &dom; &xml; filter uses a standard &dom; &xml; structure as
  21       internal data model, and can therefore parse, index, and display
  22       any &xml; document type. It is wellsuited to work on
  23       standardized &xml;-based formats such as Dublin Core, MODS, METS,
  24       MARCXML, OAI-PMH, RSS, and performs equally  well on any other
  25       non-standard &xml; format.
  26     </para>
  27     <para>
  28       A parser for binary &marc; records based on the ISO2709 library
  29       standard is provided, it transforms these to the internal
  30       &marcxml; &dom; representation. Other binary document parsers
  31       are planned to follow.
  32     </para>
  33
  34     <para>
  35       The &dom; filter architecture consists of four
  36       different pipelines, each being a chain of arbitraily many sucessive
  37       &xslt; transformations of the internal &dom; &xml;
  38       representations of documents.
  39     </para>
  40
  41     <figure id="record-model-domxml-architecture-fig">
  42       <title>&dom; &xml; filter architecture</title>
  43       <mediaobject>
  44        <imageobject>
  45          <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
  46         </imageobject>
  47         <imageobject>
  48           <imagedata fileref="domfilter.png" format="PNG"/>
  49         </imageobject>
  50         <textobject>
  51         <!-- Fall back if none of the images can be used -->
  52         <phrase>
  53           [Here there should be a diagram showing the &dom; &xml;
  54            filter architecture, but is seems that your
  55            tool chain has not been able to include the diagram in this
  56            document.]
  57          </phrase>
  58         </textobject>
  59       </mediaobject>
  60      </figure>
  61
  62
  63     <table id="record-model-domxml-architecture-table" frame="top">
  64       <title>&dom; &xml; filter pipelines overview</title>
  65       <tgroup cols="5">
  66        <thead>
  67         <row>
  68          <entry>Name</entry>
  69          <entry>When</entry>
  70          <entry>Description</entry>
  71          <entry>Input</entry>
  72          <entry>Output</entry>
  73         </row>
  74        </thead>
  75
  76        <tbody>
  77         <row>
  78          <entry><literal>input</literal></entry>
  79          <entry>first</entry>
  80          <entry>input parsing and initial
  81           transformations to common &xml; format</entry>
  82          <entry>Input raw &xml; record buffers, &xml;  streams and
  83                 binary &marc; buffers</entry>
  84          <entry>Common &xml; &dom;</entry>
  85         </row>
  86         <row>
  87          <entry><literal>extract</literal></entry>
  88          <entry>second</entry>
  89          <entry>indexing term extraction
  90           transformations</entry>
  91          <entry>Common &xml; &dom;</entry>
  92          <entry>Indexing &xml; &dom;</entry>
  93         </row>
  94         <row>
  95          <entry><literal>store</literal></entry>
  96          <entry>second</entry>
  97          <entry> transformations before internal document
  98           storage</entry>
  99          <entry>Common &xml; &dom;</entry>
 100          <entry>Storage &xml; &dom;</entry>
 101         </row>
 102         <row>
 103          <entry><literal>retrieve</literal></entry>
 104          <entry>third</entry>
 105          <entry>multiple document retrieve transformations from
 106           storage to different output
 107           formats are possible</entry>
 108          <entry>Storage &xml; &dom;</entry>
 109          <entry>Output &xml; syntax in requested formats</entry>
 110         </row>
 111        </tbody>
 112       </tgroup>
 113      </table>
 114
 115     <para>
 116       The &dom; &xml; filter pipelines use &xslt; (and if  supported on
 117       your platform, even &exslt;), it brings thus full &xpath;
 118       support to the indexing, storage and display rules of not only
 119       &xml; documents, but also binary &marc; records.
 120     </para>
 121    </section>
 122
 123
 124    <section id="record-model-domxml-pipeline">
 125     <title>&dom; &xml; filter pipeline configuration</title>
 126
 127    <para>
 128     The experimental, loadable  &dom; &xml;/&xslt; filter module
 129    <literal>mod-dom.so</literal>
 130     is invoked by the <filename>zebra.cfg</filename> configuration statement
 131     <screen>
 132      recordtype.xml: dom.db/filter_dom_conf.xml
 133     </screen>
 134     In this example the &dom; &xml; filter is configured to work
 135     on all data files with suffix
 136     <filename>*.xml</filename>, where the configuration file is found in the
 137     path <filename>db/filter_dom_conf.xml</filename>.
 138    </para>
 139
 140    <para>The &dom; &xslt; filter configuration file must be
 141     valid &xml;. It might look like this:
 142     <screen>
 143     <![CDATA[
 144     <?xml version="1.0" encoding="UTF8"?>
 145     <dom xmlns="http://indexdata.com/zebra-2.0">
 146       <input>
 147         <xmlreader level="1"/>
 148         <!-- <marc inputcharset="marc-8"/> -->
 149       </input>
 150       <extrac>
 151          <xslt stylesheet="common2index.xsl"/>
 152       </extract>
 153       <store>
 154          <xslt stylesheet="common2store.xsl"/>
 155       </store>
 156       <retrieve name="dc">
 157         <xslt stylesheet="store2dc.xsl"/>
 158       </retrieve>
 159       <retrieve name="mods">
 160         <xslt stylesheet="store2mods.xsl"/>
 161       </retrieve>
 162     </dom>
 163     ]]>
 164     </screen>
 165    </para>
 166    <para>
 167      The root &xml; element <literal>&lt;dom&gt;</literal> and all other &dom;
 168      &xml; filter elements are residing in the namespace
 169      <literal>http://indexdata.com/zebra-2.0</literal>.
 170    </para>
 171    <para>
 172     All pipeline definition elements - i.e. the
 173      <literal>&lt;input&gt;</literal>,
 174      <literal>&lt;extact&gt;</literal>,
 175      <literal>&lt;store&gt;</literal>, and
 176      <literal>&lt;retrieve&gt;</literal> elements - are optional.
 177      Missing pipeline definitions are just interpreted
 178      do-nothing identity pipelines.
 179    </para>
 180    <para>
 181     All pipeine definition elements may contain zero or more
 182     <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
 183     &xslt; transformation instructions, which are performed
 184     sequentially from top to bottom.
 185     The paths in the <literal>stylesheet</literal> attributes
 186     are relative to zebras working directory, or absolute to the file
 187     system root.
 188    </para>
 189
 190
 191    <section id="record-model-domxml-pipeline-input">
 192     <title>Input pipeline</title>
 193    <para>
 194     The <literal>&lt;input&gt;</literal> pipeline definition element
 195     may contain either one &xml; Reader definition
 196     <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
 197     an &xml; collection input stream into individual &xml; &dom;
 198     documents at the prescribed element level,
 199     or one &marc; binary
 200     parsing instruction
 201     <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
 202     a conversion to &marcxml; format &dom; trees. The allowed values
 203     of the <literal>inputcharset</literal> attribute depend on your
 204     local <productname>iconv</productname> set-up.
 205    </para>
 206    <para>
 207     Both input parsers deliver individual &dom; &xml; documents to the
 208     following chain of zero or more
 209     <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
 210     &xslt; transformations. At the end of this pipeline, the documents
 211     are in the common format, used to feed both the
 212      <literal>&lt;extact&gt;</literal> and
 213      <literal>&lt;store&gt;</literal> pipelines.
 214    </para>
 215    </section>
 216
 217    <section id="record-model-domxml-pipeline-extract">
 218     <title>Extract pipeline</title>
 219      <para>
 220        The <literal>&lt;extact&gt;</literal> pipeline takes documents
 221        from any common &dom; &xml; format to the &zebra; specific
 222         indexing &dom; &xml; format.
 223        It may consist of zero ore more
 224        <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
 225        &xslt; transformations, and the outcome is handled to the
 226        &zebra; core to drive the proces of building the inverted
 227        indexes. See
 228        <xref linkend="record-model-domxml-canonical-index"/> for
 229        details.
 230      </para>
 231    </section>
 232
 233    <section id="record-model-domxml-pipeline-store">
 234     <title>Store pipeline</title>
 235        The <literal>&lt;store&gt;</literal> pipeline takes documents
 236        from any common &dom; &xml; format to the &zebra; specific
 237         storage &dom; &xml; format.
 238        It may consist of zero ore more
 239        <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
 240        &xslt; transformations, and the outcome is handled to the
 241        &zebra; core for deposition into the internal storage system.
 242     </section>
 243
 244    <section id="record-model-domxml-pipeline-retrieve">
 245     <title>Retrieve pipeline</title>
 246     <para>
 247       Finally, there may be one or more
 248       <literal>&lt;retrieve&gt;</literal> pipeline definitions, each
 249       of them again consisting of zero or more
 250       <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
 251        &xslt; transformations. These are used for document
 252       presentation after search, and take the internal storage &dom;
 253       &xml; to the requested output formats during record present
 254       requests.
 255     </para>
 256     <para>
 257      The  possible multiple
 258      <literal>&lt;retrieve&gt;</literal> pipeline definitions
 259      are distinguished by their unique <literal>name</literal>
 260      attributes, these are the literal <literal>schema</literal> or
 261      <literal>element set</literal> names used in
 262       <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
 263       <ulink url="&url.sru;">&sru;</ulink> and
 264       &z3950; protocol queries.
 265    </para>
 266    </section>
 267
 268
 269    <section id="record-model-domxml-canonical-index">
 270     <title>Canonical Indexing Format</title>
 271     <para>The output of the indexing &xslt; stylesheets must contain
 272     certain elements in the magic
 273      <literal>xmlns:z="http://indexdata.dk/zebra-2.0"</literal>
 274     namespace. The output of the &xslt; indexing transformation is then
 275     parsed using &dom; methods, and the contained instructions are
 276     performed on the <emphasis>magic elements and their
 277     subtrees</emphasis>.
 278     </para>
 279     <para>
 280     For example, the output of the command
 281      <screen>
 282       xsltproc xsl/oai2index.xsl one-record.xml
 283      </screen>
 284      might look like this:
 285      <screen>
 286       &lt;?xml version="1.0" encoding="UTF-8"?&gt;
 287       &lt;z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
 288            z:id="oai:JTRS:CP-3290---Volume-I"
 289            z:rank="47896"
 290            z:type="update"&gt;
 291        &lt;z:index name="oai_identifier" type="0"&gt;
 292                 oai:JTRS:CP-3290---Volume-I&lt;/z:index&gt;
 293        &lt;z:index name="oai_datestamp" type="0"&gt;2004-07-09&lt;/z:index&gt;
 294        &lt;z:index name="oai_setspec" type="0"&gt;jtrs&lt;/z:index&gt;
 295        &lt;z:index name="dc_all" type="w"&gt;
 296           &lt;z:index name="dc_title" type="w"&gt;Proceedings of the 4th
 297                 International Conference and Exhibition:
 298                 World Congress on Superconductivity - Volume I&lt;/z:index&gt;
 299           &lt;z:index name="dc_creator" type="w"&gt;Kumar Krishen and *Calvin
 300                 Burnham, Editors&lt;/z:index&gt;
 301        &lt;/z:index&gt;
 302      &lt;/z:record&gt;
 303      </screen>
 304     </para>
 305     <para>This means the following: From the original &xml; file
 306      <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
 307      same form coming from a splitted input file), the indexing
 308      stylesheet produces an indexing &xml; record, which is defined by
 309      the <literal>record</literal> element in the magic namespace
 310      <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
 311      &zebra; uses the content of
 312      <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
 313      record ID, and - in case static ranking is set - the content of
 314      <literal>z:rank="47896"</literal> as static rank. Following the
 315      discussion in <xref linkend="administration-ranking"/>
 316      we see that this records is internally ordered
 317      lexicographically according to the value of the string
 318      <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
 319      The type of action performed during indexing is defined by
 320      <literal>z:type="update"&gt;</literal>, with recognized values
 321      <literal>insert</literal>, <literal>update</literal>, and
 322      <literal>delete</literal>.
 323     </para>
 324     <para>In this example, the following literal indexes are constructed:
 325      <screen>
 326        oai_identifier
 327        oai_datestamp
 328        oai_setspec
 329        dc_all
 330        dc_title
 331        dc_creator
 332      </screen>
 333      where the indexing type is defined in the
 334      <literal>type</literal> attribute
 335      (any value from the standard configuration
 336      file <filename>default.idx</filename> will do). Finally, any
 337      <literal>text()</literal> node content recursively contained
 338      inside the <literal>index</literal> will be filtered through the
 339      appropriate charmap for character normalization, and will be
 340      inserted in the index.
 341     </para>
 342     <para>
 343      Specific to this example, we see that the single word
 344      <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
 345      byte for byte without any form of character normalization,
 346      inserted into the index named <literal>oai:identifier</literal>,
 347      the text
 348      <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
 349      will be inserted using the <literal>w</literal> character
 350      normalization defined in <filename>default.idx</filename> into
 351      the index <literal>dc:creator</literal> (that is, after character
 352      normalization the index will keep the inidividual words
 353      <literal>kumar</literal>, <literal>krishen</literal>,
 354      <literal>and</literal>, <literal>calvin</literal>,
 355      <literal>burnham</literal>, and <literal>editors</literal>), and
 356      finally both the texts
 357      <literal>Proceedings of the 4th International Conference and Exhibition:
 358       World Congress on Superconductivity - Volume I</literal>
 359      and
 360      <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
 361      will be inserted into the index <literal>dc:all</literal> using
 362      the same character normalization map <literal>w</literal>.
 363     </para>
 364     <para>
 365      Finally, this example configuration can be queried using &pqf;
 366      queries, either transported by &z3950;, (here using a yaz-client)
 367      <screen>
 368       <![CDATA[
 369       Z> open localhost:9999
 370       Z> elem dc
 371       Z> form xml
 372       Z>
 373       Z> f @attr 1=dc_creator Kumar
 374       Z> scan @attr 1=dc_creator adam
 375       Z>
 376       Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
 377       Z> scan @attr 1=dc_title abc
 378       ]]>
 379      </screen>
 380      or the proprietary
 381      extentions <literal>x-pquery</literal> and
 382      <literal>x-pScanClause</literal> to
 383      &sru;, and &srw;
 384      <screen>
 385       <![CDATA[
 386       http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
 387       http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
 388       ]]>
 389      </screen>
 390      See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
 391      configuration, and <xref linkend="gfs-config"/> or the &yaz;
 392      <ulink url="&url.yaz.cql;">&cql; section</ulink>
 393      for the details or the &yaz; frontend server.
 394     </para>
 395     <para>
 396      Notice that there are no <filename>*.abs</filename>,
 397      <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
 398      filter configuration files involves in this process, and that the
 399      literal index names are used during search and retrieval.
 400     </para>
 401    </section>
 402   </section>
 403
 404
 405   <section id="record-model-domxml-conf">
 406    <title>&dom; Record Model Configuration</title>
 407
 408
 409   <section id="record-model-domxml-index">
 410    <title>&dom; Indexing Configuration</title>
 411     <para>
 412      As mentioned above, there can be only one indexing
 413      stylesheet, and configuration of the indexing process is a synonym
 414      of writing an &xslt; stylesheet which produces &xml; output containing the
 415      magic elements discussed in
 416      <xref linkend="record-model-domxml-internal"/>.
 417      Obviously, there are million of different ways to accomplish this
 418      task, and some comments and code snippets are in order to lead
 419      our paduans on the right track to the  good side of the force.
 420     </para>
 421     <para>
 422      Stylesheets can be written in the <emphasis>pull</emphasis> or
 423      the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
 424      means that the output &xml; structure is taken as starting point of
 425      the internal structure of the &xslt; stylesheet, and portions of
 426      the input &xml; are <emphasis>pulled</emphasis> out and inserted
 427      into the right spots of the output &xml; structure. On the other
 428      side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
 429      calling their template definitions, a process which is commanded
 430      by the input &xml; structure, and avake to produce some output &xml;
 431      whenever some special conditions in the input styelsheets are
 432      met. The <emphasis>pull</emphasis> type is well-suited for input
 433      &xml; with strong and well-defined structure and semantcs, like the
 434      following &oai; indexing example, whereas the
 435      <emphasis>push</emphasis> type might be the only possible way to
 436      sort out deeply recursive input &xml; formats.
 437     </para>
 438     <para>
 439      A <emphasis>pull</emphasis> stylesheet example used to index
 440      &oai; harvested records could use some of the following template
 441      definitions:
 442      <screen>
 443       <![CDATA[
 444       <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 445        xmlns:z="http://indexdata.dk/zebra/xslt/1"
 446        xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
 447        xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
 448        xmlns:dc="http://purl.org/dc/elements/1.1/"
 449        version="1.0">
 450
 451        <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
 452
 453         <!-- disable all default text node output -->
 454         <xsl:template match="text()"/>
 455
 456          <!-- match on oai xml record root -->
 457          <xsl:template match="/">
 458           <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
 459            z:type="update">
 460            <!-- you might want to use z:rank="{some &xslt; function here}" -->
 461            <xsl:apply-templates/>
 462           </z:record>
 463          </xsl:template>
 464
 465          <!-- &oai; indexing templates -->
 466          <xsl:template match="oai:record/oai:header/oai:identifier">
 467           <z:index name="oai_identifier" type="0">
 468            <xsl:value-of select="."/>
 469           </z:index>
 470          </xsl:template>
 471
 472          <!-- etc, etc -->
 473
 474          <!-- DC specific indexing templates -->
 475          <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
 476           <z:index name="dc_title" type="w">
 477            <xsl:value-of select="."/>
 478           </z:index>
 479          </xsl:template>
 480
 481          <!-- etc, etc -->
 482
 483       </xsl:stylesheet>
 484       ]]>
 485      </screen>
 486     </para>
 487     <para>
 488      Notice also,
 489      that the names and types of the indexes can be defined in the
 490      indexing &xslt; stylesheet <emphasis>dynamically according to
 491      content in the original &xml; records</emphasis>, which has
 492      opportunities for great power and wizardery as well as grande
 493      disaster.
 494     </para>
 495     <para>
 496      The following excerpt of a <emphasis>push</emphasis> stylesheet
 497      <emphasis>might</emphasis>
 498      be a good idea according to your strict control of the &xml;
 499      input format (due to rigerours checking against well-defined and
 500      tight RelaxNG or &xml; Schema's, for example):
 501      <screen>
 502       <![CDATA[
 503       <xsl:template name="element-name-indexes">
 504        <z:index name="{name()}" type="w">
 505         <xsl:value-of select="'1'"/>
 506        </z:index>
 507       </xsl:template>
 508       ]]>
 509      </screen>
 510      This template creates indexes which have the name of the working
 511      node of any input  &xml; file, and assigns a '1' to the index.
 512      The example query
 513      <literal>find @attr 1=xyz 1</literal>
 514      finds all files which contain at least one
 515      <literal>xyz</literal> &xml; element. In case you can not control
 516      which element names the input files contain, you might ask for
 517      disaster and bad karma using this technique.
 518     </para>
 519     <para>
 520      One variation over the theme <emphasis>dynamically created
 521      indexes</emphasis> will definitely be unwise:
 522      <screen>
 523       <![CDATA[
 524       <!-- match on oai xml record root -->
 525       <xsl:template match="/">
 526        <z:record z:type="update">
 527
 528         <!-- create dynamic index name from input content -->
 529         <xsl:variable name="dynamic_content">
 530          <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
 531         </xsl:variable>
 532
 533         <!-- create zillions of indexes with unknown names -->
 534         <z:index name="{$dynamic_content}" type="w">
 535          <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
 536         </z:index>
 537        </z:record>
 538
 539       </xsl:template>
 540       ]]>
 541      </screen>
 542      Don't be tempted to cross
 543      the line to the dark side of the force, paduan; this leads
 544      to suffering and pain, and universal
 545      disentigration of your project schedule.
 546     </para>
 547   </section>
 548
 549   <section id="record-model-domxml-elementset">
 550    <title>&dom; Exchange Formats</title>
 551    <para>
 552      An exchange format can be anything which can be the outcome of an
 553      &xslt; transformation, as far as the stylesheet is registered in
 554      the main &dom; &xslt; filter configuration file, see
 555      <xref linkend="record-model-domxml-filter"/>.
 556      In principle anything that can be expressed in  &xml;, HTML, and
 557      TEXT can be the output of a <literal>schema</literal> or
 558     <literal>element set</literal> directive during search, as long as
 559      the information comes from the
 560      <emphasis>original input record &xml; &dom; tree</emphasis>
 561      (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
 562     </para>
 563     <para>
 564      In addition, internal administrative information from the &zebra;
 565      indexer can be accessed during record retrieval. The following
 566      example is a summary of the possibilities:
 567      <screen>
 568       <![CDATA[
 569       <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 570        xmlns:z="http://indexdata.dk/zebra/xslt/1"
 571        version="1.0">
 572
 573        <!-- register internal zebra parameters -->
 574        <xsl:param name="id" select="''"/>
 575        <xsl:param name="filename" select="''"/>
 576        <xsl:param name="score" select="''"/>
 577        <xsl:param name="schema" select="''"/>
 578
 579        <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
 580
 581        <!-- use then for display of internal information -->
 582        <xsl:template match="/">
 583          <z:zebra>
 584            <id><xsl:value-of select="$id"/></id>
 585            <filename><xsl:value-of select="$filename"/></filename>
 586            <score><xsl:value-of select="$score"/></score>
 587            <schema><xsl:value-of select="$schema"/></schema>
 588          </z:zebra>
 589        </xsl:template>
 590
 591       </xsl:stylesheet>
 592       ]]>
 593      </screen>
 594     </para>
 595
 596   </section>
 597
 598   <section id="record-model-domxml-example">
 599    <title>&dom; Filter &oai; Indexing Example</title>
 600    <para>
 601      The sourcecode tarball contains a working &dom; filter example in
 602      the directory <filename>examples/dom-oai/</filename>, which
 603      should get you started.
 604     </para>
 605     <para>
 606      More example data can be harvested from any &oai; complient server,
 607      see details at the  &oai;
 608      <ulink url="http://www.openarchives.org/">
 609       http://www.openarchives.org/</ulink> web site, and the community
 610       links at
 611      <ulink url="http://www.openarchives.org/community/index.html">
 612       http://www.openarchives.org/community/index.html</ulink>.
 613      There is a  tutorial
 614      found at
 615      <ulink url="http://www.oaforum.org/tutorial/">
 616       http://www.oaforum.org/tutorial/</ulink>.
 617     </para>
 618    </section>
 619
 620   </section>
 621
 622
 623  </chapter>
 624
 625
 626 <!--
 627
 628 c)  Main "dom" &xslt; filter config file:
 629   cat db/filter_dom_conf.xml
 630
 631   <?xml version="1.0" encoding="UTF8"?>
 632   <schemaInfo>
 633     <schema name="dom" stylesheet="db/dom2dom.xsl" />
 634     <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
 635             stylesheet="db/dom2index.xsl" />
 636     <schema name="dc" stylesheet="db/dom2dc.xsl" />
 637     <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
 638     <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
 639     <schema name="help" stylesheet="db/dom2help.xsl" />
 640     <split level="1"/>
 641   </schemaInfo>
 642
 643   the paths are relative to the directory where zebra.init is placed
 644   and is started up.
 645
 646   The split level decides where the SAX parser shall split the
 647   collections of records into individual records, which then are
 648   loaded into &dom;, and have the indexing &xslt; stylesheet applied.
 649
 650   The indexing stylesheet is found by it's identifier.
 651
 652   All the other stylesheets are for presentation after search.
 653
 654 - in data/ a short sample of harvested carnivorous plants
 655   ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
 656
 657 - in root also one single data record - nice for testing the xslt
 658   stylesheets,
 659
 660   xsltproc db/dom2index.xsl carni*.xml
 661
 662   and so on.
 663
 664 - in db/ a cql2pqf.txt yaz-client config file
 665   which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
 666
 667    see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
 668
 669 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
 670   as it constructs the new &xml; structure by pulling data out of the
 671   respective elements/attributes of the old structure.
 672
 673   Notice the special zebra namespace, and the special elements in this
 674   namespace which indicate to the zebra indexer what to do.
 675
 676   <z:record id="67ht7" rank="675" type="update">
 677   indicates that a new record with given id and static rank has to be updated.
 678
 679   <z:index name="title" type="w">
 680    encloses all the text/&xml; which shall be indexed in the index named
 681    "title" and of index type "w" (see  file default.idx in your zebra
 682    installation)
 683
 684
 685    </para>
 686
 687    <para>
 688 -->
 689
 690
 691
 692
 693  <!-- Keep this comment at the end of the file
 694  Local variables:
 695  mode: sgml
 696  sgml-omittag:t
 697  sgml-shorttag:t
 698  sgml-minimize-attributes:nil
 699  sgml-always-quote-attributes:t
 700  sgml-indent-step:1
 701  sgml-indent-data:t
 702  sgml-parent-document: "zebra.xml"
 703  sgml-local-catalogs: nil
 704  sgml-namecase-general:t
 705  End:
 706  -->