added missing domfilter.eps to make rules, such that it is included in the distributi...
[idzebra-moved-to-github.git] / doc / recordmodel-domxml.xml
index 8e9b969..2c89976 100644 (file)
@@ -1,5 +1,5 @@
 <chapter id="record-model-domxml">
-  <!-- $Id: recordmodel-domxml.xml,v 1.4 2007-02-20 15:02:18 marc Exp $ -->
+  <!-- $Id: recordmodel-domxml.xml,v 1.8 2007-02-21 15:03:30 marc Exp $ -->
   <title>&dom; &xml; Record Model and Filter Module</title>
 
   <para>
@@ -14,7 +14,7 @@
   
   
   <section id="record-model-domxml-filter">
-   <title>&dom; Record Filter</title>
+   <title>&dom; Record Filter Architecture</title>
 
      <para>
       The &dom; &xml; filter uses a standard &dom; &xml; structure as
       &marcxml; &dom; representation. Other binary document parsers
       are planned to follow.  
     </para>
-   </section>
-
-
-   <section id="record-model-domxml-architecture">
-    <title>&dom; &xml; filter architecture</title>   
 
     <para>
-      The internal &dom; &xml; representation can be fed into four
-      different pipelines, consisting of arbitraily many sucessive
-      &xslt; transformations.
+      The &dom; filter architecture consists of four
+      different pipelines, each being a chain of arbitraily many sucessive
+      &xslt; transformations of the internal &dom; &xml;
+      representations of documents.
     </para>
 
+    <figure id="record-model-domxml-architecture-fig">
+      <title>&dom; &xml; filter architecture</title>
+      <mediaobject>
+       <imageobject>
+         <imagedata fileref="domfilter.pdf" format="PDF" scale="50"/>
+        </imageobject>
+        <imageobject>
+          <imagedata fileref="domfilter.png" format="PNG"/>
+        </imageobject>
+        <textobject>
+        <!-- Fall back if none of the images can be used -->
+        <phrase>
+          [Here there should be a diagram showing the &dom; &xml;
+           filter architecture, but is seems that your
+           tool chain has not been able to include the diagram in this
+           document.]
+         </phrase>
+        </textobject>
+      </mediaobject>
+     </figure>
+
+
     <table id="record-model-domxml-architecture-table" frame="top">
       <title>&dom; &xml; filter pipelines overview</title>
       <tgroup cols="5">
          <entry>first</entry>
          <entry>input parsing and initial
           transformations to common &xml; format</entry>
-         <entry>raw &xml; record buffers, &xml;  streams and 
+         <entry>Input raw &xml; record buffers, &xml;  streams and 
                 binary &marc; buffers</entry>
-         <entry>single &dom; &xml; documents suitable for indexing and
-                internal storage</entry>
+         <entry>Common &xml; &dom;</entry>
         </row>
         <row>
          <entry><literal>extract</literal></entry>
          <entry>second</entry>
          <entry>indexing term extraction
           transformations</entry>
-         <entry>common single &dom; &xml; format</entry>
-         <entry>&zebra; internal indexing &dom; &xml; document</entry>
+         <entry>Common &xml; &dom;</entry>
+         <entry>Indexing &xml; &dom;</entry>
         </row>
         <row>
          <entry><literal>store</literal></entry>
          <entry>second</entry>
          <entry> transformations before internal document
           storage</entry>
-         <entry>common single &dom; &xml; format</entry>
-         <entry>&zebra; internal storage &dom; &xml; document</entry>
+         <entry>Common &xml; &dom;</entry>
+         <entry>Storage &xml; &dom;</entry>
         </row>
         <row>
          <entry><literal>retrieve</literal></entry>
          <entry>multiple document retrieve transformations from
           storage to different output
           formats are possible</entry>
-         <entry>&zebra; internal storage &dom; &xml; document</entry>
-         <entry>output &xml; syntax and requested format</entry>
+         <entry>Storage &xml; &dom;</entry>
+         <entry>Output &xml; syntax in requested formats</entry>
         </row>
        </tbody>
       </tgroup>
     <screen>
      recordtype.xml: dom.db/filter_dom_conf.xml
     </screen>
-    In this example on all data files with suffix 
-    <filename>*.xml</filename>, where the
-    &dom; &xslt; filter configuration file is found in the
+    In this example the &dom; &xml; filter is configured to work 
+    on all data files with suffix 
+    <filename>*.xml</filename>, where the configuration file is found in the
     path <filename>db/filter_dom_conf.xml</filename>.
    </para>
 
     ]]>
     </screen>
    </para>
-
    <para>
-    All named stylesheets defined inside
-    <literal>schema</literal> element tags 
-    are for presentation after search, including
-    the indexing stylesheet (which is a great debugging help). The
-    names defined in the <literal>name</literal> attributes must be
-    unique, these are the literal <literal>schema</literal> or 
-    <literal>element set</literal> names used in 
-      <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
-      <ulink url="&url.sru;">&sru;</ulink> and
-    &z3950; protocol queries.
+     The root &xml; element <literal>&lt;dom&gt;</literal> and all other &dom;
+     &xml; filter elements are residing in the namespace 
+     <literal>http://indexdata.com/zebra-2.0</literal>.
+   </para>
+   <para>
+    All pipeline definition elements - i.e. the
+     <literal>&lt;input&gt;</literal>,
+     <literal>&lt;extact&gt;</literal>,
+     <literal>&lt;store&gt;</literal>, and 
+     <literal>&lt;retrieve&gt;</literal> elements - are optional.
+     Missing pipeline definitions are just interpreted
+     do-nothing identity pipelines.
+   </para>
+   <para>
+    All pipeine definition elements may contain zero or more 
+    <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+    &xslt; transformation instructions, which are performed
+    sequentially from top to bottom.
     The paths in the <literal>stylesheet</literal> attributes
-    are relative to zebras working directory, or absolute to file
+    are relative to zebras working directory, or absolute to the file
     system root.
    </para>
+
+
+   <section id="record-model-domxml-pipeline-input">
+    <title>Input pipeline</title>   
    <para>
-    The <literal>&lt;split level="2"/&gt;</literal> decides where the
-    &xml; Reader shall split the
-    collections of records into individual records, which then are
-    loaded into &dom;, and have the indexing &xslt; stylesheet applied.
+    The <literal>&lt;input&gt;</literal> pipeline definition element
+    may contain either one &xml; Reader definition
+    <literal><![CDATA[<xmlreader level="1"/>]]></literal>, used to split
+    an &xml; collection input stream into individual &xml; &dom;
+    documents at the prescribed element level, 
+    or one &marc; binary
+    parsing instruction
+    <literal><![CDATA[<marc inputcharset="marc-8"/>]]></literal>, which defines
+    a conversion to &marcxml; format &dom; trees. The allowed values
+    of the <literal>inputcharset</literal> attribute depend on your
+    local <productname>iconv</productname> set-up.
    </para>
    <para>
-    There must be exactly one indexing &xslt; stylesheet, which is
-    defined by the magic attribute  
-    <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
+    Both input parsers deliver individual &dom; &xml; documents to the
+    following chain of zero or more  
+    <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+    &xslt; transformations. At the end of this pipeline, the documents
+    are in the common format, used to feed both the 
+     <literal>&lt;extact&gt;</literal> and 
+     <literal>&lt;store&gt;</literal> pipelines.
    </para>
+   </section>
+
+   <section id="record-model-domxml-pipeline-extract">
+    <title>Extract pipeline</title>   
+     <para>
+       The <literal>&lt;extact&gt;</literal> pipeline takes documents
+       from any common &dom; &xml; format to the &zebra; specific
+        indexing &dom; &xml; format.
+       It may consist of zero ore more 
+       <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+       &xslt; transformations, and the outcome is handled to the
+       &zebra; core to drive the proces of building the inverted
+       indexes. See
+       <xref linkend="record-model-domxml-canonical-index"/> for
+       details.
+     </para>
+   </section>
 
-   <section id="record-model-domxml-internal">
-    <title>&dom; filter internal record representation</title>   
-    <para>When indexing, an &xml; Reader is invoked to split the input
-    files into suitable record &xml; pieces. Each record piece is then
-    transformed to an &xml; &dom; structure, which is essentially the
-    record model. Only &xslt; transformations can be applied during
-    index, search and retrieval. Consequently, output formats are
-    restricted to whatever &xslt; can deliver from the record &xml;
-    structure, be it other &xml; formats, HTML, or plain text. In case
-    you have <literal>libxslt1</literal> running with E&xslt; support,
-    you can use this functionality inside the &dom;
-    filter configuration &xslt; stylesheets.
+   <section id="record-model-domxml-pipeline-store">
+    <title>Store pipeline</title>   
+       The <literal>&lt;store&gt;</literal> pipeline takes documents
+       from any common &dom; &xml; format to the &zebra; specific
+        storage &dom; &xml; format.
+       It may consist of zero ore more 
+       <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+       &xslt; transformations, and the outcome is handled to the
+       &zebra; core for deposition into the internal storage system.
+    </section>
+
+   <section id="record-model-domxml-pipeline-retrieve">
+    <title>Retrieve pipeline</title>   
+    <para>
+      Finally, there may be one or more 
+      <literal>&lt;retrieve&gt;</literal> pipeline definitions, each
+      of them again consisting of zero or more
+      <literal><![CDATA[<xslt stylesheet="path/file.xsl"/>]]></literal>
+       &xslt; transformations. These are used for document
+      presentation after search, and take the internal storage &dom;
+      &xml; to the requested output formats during record present
+      requests.  
     </para>
+    <para>
+     The  possible multiple 
+     <literal>&lt;retrieve&gt;</literal> pipeline definitions
+     are distinguished by their unique <literal>name</literal>
+     attributes, these are the literal <literal>schema</literal> or 
+     <literal>element set</literal> names used in 
+      <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
+      <ulink url="&url.sru;">&sru;</ulink> and
+      &z3950; protocol queries.
+   </para>
    </section>
 
-   <section id="record-model-domxml-canonical">
-    <title>&dom; Canonical Indexing Format</title>   
+
+   <section id="record-model-domxml-canonical-index">
+    <title>Canonical Indexing Format</title>
+
+    <para>
+     &dom; &xml; indexing comes in two flavors: pure
+     processing-instruction governed plain &xml; documents, and - very
+     similar to the Alvis filter indexing format - &xml; documents
+     containing &xml; <literal>&lt;record&gt;</literal> and
+     <literal>&lt;index&gt;</literal> instructions from the magic
+     namespace <literal>xmlns:z="http://indexdata.dk/zebra-2.0"</literal>. 
+    </para>
+
+   <section id="record-model-domxml-canonical-index-pi">
+    <title>Processing-instruction governed indexing format</title>
+      <para>The output of the processing instruction driven 
+      indexing &xslt; stylesheets must contain
+      processing instructions named 
+       <literal>zebra-2.0</literal>. 
+      The output of the &xslt; indexing transformation is then
+      parsed using &dom; methods, and the contained instructions are
+      performed on the <emphasis>elements and their
+      subtrees directly following the processing instructions</emphasis>.
+      </para>
+      <para>
+     For example, the output of the command
+     <screen>  
+       xsltproc dom-index-pi.xsl marc-one.xml
+     </screen> 
+     might look like this:
+     <screen>
+      <![CDATA[
+      <?xml version="1.0" encoding="UTF-8"?>
+      <?zebra-2.0 record id=11224466 rank=42?>
+      <record>
+        <?zebra-2.0 index control:w?>
+        <control>11224466</control>
+        <?zebra-2.0 index title:w title:p title:s any:w?>
+        <title>How to program a computer</title>
+      </record>
+      ]]>
+     </screen>
+    </para>
+   </section>
+
+   <section id="record-model-domxml-canonical-index-element">
+    <title>Magic element governed indexing format</title>
+   
     <para>The output of the indexing &xslt; stylesheets must contain
     certain elements in the magic 
-     <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
+     <literal>xmlns:z="http://indexdata.dk/zebra-2.0"</literal>
     namespace. The output of the &xslt; indexing transformation is then
     parsed using &dom; methods, and the contained instructions are
     performed on the <emphasis>magic elements and their
     </para>
     <para>
     For example, the output of the command
-     <screen>  
-      xsltproc xsl/oai2index.xsl one-record.xml
+     <screen>   
+      xsltproc dom-index-element.xsl marc-one.xml 
      </screen> 
      might look like this:
      <screen>
-      &lt;?xml version="1.0" encoding="UTF-8"?&gt;
-      &lt;z:record xmlns:z="http://indexdata.dk/zebra/xslt/1" 
-           z:id="oai:JTRS:CP-3290---Volume-I" 
-           z:rank="47896"
-           z:type="update"&gt;
-       &lt;z:index name="oai_identifier" type="0"&gt;
-                oai:JTRS:CP-3290---Volume-I&lt;/z:index&gt;
-       &lt;z:index name="oai_datestamp" type="0"&gt;2004-07-09&lt;/z:index&gt;
-       &lt;z:index name="oai_setspec" type="0"&gt;jtrs&lt;/z:index&gt;
-       &lt;z:index name="dc_all" type="w"&gt;
-          &lt;z:index name="dc_title" type="w"&gt;Proceedings of the 4th 
-                International Conference and Exhibition:
-                World Congress on Superconductivity - Volume I&lt;/z:index&gt;
-          &lt;z:index name="dc_creator" type="w"&gt;Kumar Krishen and *Calvin
-                Burnham, Editors&lt;/z:index&gt;
-       &lt;/z:index&gt;
-     &lt;/z:record&gt;
+      <![CDATA[
+      <?xml version="1.0" encoding="UTF-8"?>
+      <z:record xmlns:z="http://indexdata.com/zebra-2.0" 
+                z:id="11224466" z:rank="42">
+          <z:index name="control">11224466</z:index>
+          <z:index name="title:w title:p title:s any:w">
+                    How to program a computer</z:index>
+      </z:record>
+      ]]>
      </screen>
     </para>
+   </section>
+
+
+   <section id="record-model-domxml-canonical-index-semantics">
+    <title>Semantics of the indexing formats</title>
+
+    <para>
+     Both indexing formats are defined with equal semantics and
+     behaviour in mind. 
+    </para>
+
+    
     <para>This means the following: From the original &xml; file 
      <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
      same form coming from a splitted input file), the indexing
      <literal>insert</literal>, <literal>update</literal>, and 
      <literal>delete</literal>. 
     </para>
-    <para>In this example, the following literal indexes are constructed:
+    
+
+    <para>In these examples, the following literal indexes are constructed:
      <screen>
-       oai_identifier
-       oai_datestamp
-       oai_setspec
-       dc_all
-       dc_title
-       dc_creator
+       any:w
+       control:w
+       title:w
+       title:p
+       title:s
      </screen>
-     where the indexing type is defined in the 
-     <literal>type</literal> attribute 
-     (any value from the standard configuration
-     file <filename>default.idx</filename> will do). Finally, any 
+     where the indexing type is defined after the 
+     literal <literal>':'</literal> charaacter.  
+     Any value from the standard configuration
+     file <filename>default.idx</filename> will do.
+     Finally, any 
      <literal>text()</literal> node content recursively contained
-     inside the <literal>index</literal> will be filtered through the
+     inside the <literal>&lt;z:index&gt;</literal> element, or any
+     element following a <literal>index</literal> processing instruction,
+     will be filtered through the
      appropriate charmap for character normalization, and will be
-     inserted in the index.
+     inserted in the named indexes.
     </para>
+
+    
     <para>
      Specific to this example, we see that the single word
      <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
      filter configuration files involves in this process, and that the
      literal index names are used during search and retrieval.
     </para>
+    
+   </section>
+
    </section>
   </section>