Use entity idcommon rather than common
[idzebra-moved-to-github.git] / doc / recordmodel-domxml.xml
index bb5b300..bf31b74 100644 (file)
@@ -1,7 +1,7 @@
 <chapter id="record-model-domxml">
-  <!-- $Id: recordmodel-domxml.xml,v 1.9 2007-02-22 15:44:19 marc Exp $ -->
+  <!-- $Id: recordmodel-domxml.xml,v 1.13 2007-03-21 19:37:00 adam Exp $ -->
   <title>&dom; &xml; Record Model and Filter Module</title>
-
+  
   <para>
    The record model described in this chapter applies to the fundamental,
    structured &xml;
          </para>
        </listitem>
        <listitem>
-         <para>The unique <literal>record</literal> instruction
-           may have additional attributes <literal>id</literal> and
-            <literal>rank</literal>, where the value of the opaque ID
-            may be any string not containing the whitespace character 
-            <literal>' '</literal>, and the rank value must be a
+         <para>
+            The unique <literal>record</literal> instruction
+           may have additional attributes <literal>id</literal>,
+            <literal>rank</literal> and <literal>type</literal>.
+            Attribute <literal>id</literal> is the value of the opaque ID
+            and may be any string not containing the whitespace character 
+            <literal>' '</literal>.
+            The <literal>rank</literal> attribute value must be a
             non-negative integer. See 
-            <xref linkend="administration-ranking"/>
+            <xref linkend="administration-ranking"/> .
+            The <literal>type</literal> attribute specifies how the record
+            is to be treated. The following values may be given for 
+            <literal>type</literal>:
+            <variablelist>
+             <varlistentry>
+              <term><literal>insert</literal></term>
+              <listitem>
+               <para>
+                The record is inserted. If the record already exists, it is
+                skipped (i.e. not replaced).
+               </para>
+              </listitem>
+             </varlistentry>
+             <varlistentry>
+              <term><literal>replace</literal></term>
+              <listitem>
+               <para>
+                The record is replaced. If the record does not already exist,
+                it is skipped (i.e. not inserted).
+               </para>
+              </listitem>
+             </varlistentry>
+             <varlistentry>
+              <term><literal>delete</literal></term>
+              <listitem>
+               <para>
+                The record is deleted. If the record does not already exist,
+                it is skipped (i.e. nothing is deleted).
+               </para>
+              </listitem>
+             </varlistentry>
+             <varlistentry>
+              <term><literal>update</literal></term>
+              <listitem>
+               <para>
+                The record is inserted or replaced depending on whether the
+                record exists or not. This is the default behavior but may
+                be effectively changed by "outside" the scope of the DOM
+                filter by zebraidx commands or extended services updates.
+               </para>
+              </listitem>
+             </varlistentry>
+            </variablelist>
+          Note that the value of <literal>type</literal> is only used to
+          determine the action if and only if the Zebra indexer is running
+          in "update" mode (i.e zebraidx update) or if the specialUpdate
+          action of the
+          <link linkend="administration-extended-services-z3950">Extended
+          Service Update</link> is used.
+          For this reason a specialUpdate may end up deleting records!
          </para>
        </listitem>
        <listitem>
          <xref linkend="fields-and-charsets"/> for details.
          </para>
        </listitem>
+       <listitem>
+         <para> 
+         &dom; input documents which are not resulting in both one
+         unique valid 
+         <literal>record</literal> instruction and one or more valid 
+         <literal>index</literal> instructions can not be searched and
+         found. Therefore,
+         invalid document processing is aborted, and any content of
+         the <literal>&lt;extract&gt;</literal> and 
+         <literal>&lt;store&gt;</literal> pipelines is discarted.
+          A warning is issued in the logs. 
+         </para>
+       </listitem>
       </itemizedlist>
     </para>
-
     
     <para>The examples work as follows: 
      From the original &xml; file 
 
          <!-- OAI indexing templates -->
          <xsl:template match="oai:record/oai:header/oai:identifier">
-          <z:index name="oai_identifier;0">
+          <z:index name="oai_identifier:0">
            <xsl:value-of select="."/>
           </z:index>    
          </xsl:template>
       ]]>
      </screen>
     </para>
+  </section>
+
+
+  <section id="record-model-domxml-index-marc">
+   <title>&dom; Indexing &marcxml;</title>
+    <para>
+      The &dom; filter allows indexing of both binary &marc; records
+      and &marcxml; records, depending on it's configuration.
+      A typical &marcxml; record might look like this:
+      <screen>  
+      <![CDATA[
+      <record xmlns="http://www.loc.gov/MARC21/slim">
+       <rank>42</rank>
+       <leader>00366nam  22001698a 4500</leader>
+       <controlfield tag="001">   11224466   </controlfield>
+       <controlfield tag="003">DLC  </controlfield>
+       <controlfield tag="005">00000000000000.0  </controlfield>
+       <controlfield tag="008">910710c19910701nju           00010 eng    </controlfield>
+       <datafield tag="010" ind1=" " ind2=" ">
+         <subfield code="a">   11224466 </subfield>
+       </datafield>
+       <datafield tag="040" ind1=" " ind2=" ">
+         <subfield code="a">DLC</subfield>
+         <subfield code="c">DLC</subfield>
+       </datafield>
+       <datafield tag="050" ind1="0" ind2="0">
+         <subfield code="a">123-xyz</subfield>
+       </datafield>
+       <datafield tag="100" ind1="1" ind2="0">
+         <subfield code="a">Jack Collins</subfield>
+       </datafield>
+       <datafield tag="245" ind1="1" ind2="0">
+         <subfield code="a">How to program a computer</subfield>
+       </datafield>
+       <datafield tag="260" ind1="1" ind2=" ">
+         <subfield code="a">Penguin</subfield>
+       </datafield>
+       <datafield tag="263" ind1=" " ind2=" ">
+         <subfield code="a">8710</subfield>
+       </datafield>
+       <datafield tag="300" ind1=" " ind2=" ">
+         <subfield code="a">p. cm.</subfield>
+       </datafield>
+      </record>
+      ]]>
+      </screen>
+    </para>
+
     <para>
-     Notice also,
-     that the names and types of the indexes can be defined in the
+      It is easily possible to make string manipulation in the &dom;
+      filter. For example, if you want to drop some leading articles
+      in the indexing of sort fields, you might want to pick out the 
+      &marcxml; indicator attributes to chop of leading substrings. If
+      the above &xml; example would have an indicator
+      <literal>ind2="8"</literal> in the title field 
+      <literal>245</literal>, i.e.
+      <screen>  
+      <![CDATA[
+       <datafield tag="245" ind1="1" ind2="8">
+         <subfield code="a">How to program a computer</subfield>
+       </datafield>
+      ]]>
+      </screen>
+      one could write a template taking into account this information
+      to chop the first <literal>8</literal> characters from the
+      sorting index <literal>title:s</literal> like this:
+      <screen>  
+      <![CDATA[
+      <xsl:template match="m:datafield[@tag='245']">
+        <xsl:variable name="chop">
+          <xsl:choose>
+            <xsl:when test="not(number(@ind2))">0</xsl:when>
+            <xsl:otherwise><xsl:value-of select="number(@ind2)"/></xsl:otherwise>
+          </xsl:choose>
+        </xsl:variable>  
+
+        <z:index name="title:w title:p any:w">
+           <xsl:value-of select="m:subfield[@code='a']"/>
+        </z:index>
+
+        <z:index name="title:s">
+          <xsl:value-of select="substring(m:subfield[@code='a'], $chop)"/>
+        </z:index>
+
+      </xsl:template> 
+      ]]>
+      </screen>
+      The output of the above &marcxml; and &xslt; excerpt would then be:
+      <screen>  
+      <![CDATA[
+        <z:index name="title:w title:p any:w">How to program a computer</z:index>
+        <z:index name="title:s">program a computer</z:index>
+      ]]>
+      </screen>
+      and the record would be sorted in the title index under 'P', not 'H'.
+    </para>
+  </section>
+
+
+  <section id="record-model-domxml-index-wizzard">
+   <title>&dom; Indexing Wizardry</title>
+    <para>
+     The names and types of the indexes can be defined in the
      indexing &xslt; stylesheet <emphasis>dynamically according to
      content in the original &xml; records</emphasis>, which has
      opportunities for great power and wizardry as well as grande
     </para>
   </section>
 
+  <section id="record-model-domxml-debug">
+   <title>Debuggig &dom; Filter Configurations</title>
+   <para>
+    It can be very hard to debug a &dom; filter setup due to the many
+    sucessive &marc; syntax translations, &xml; stream splitting and 
+    &xslt; transformations involved. As an aid, you have always the
+    power of the <literal>-s</literal> command line switch to the 
+    <literal>zebraidz</literal> indexing command at your hand: 
+    <screen>
+     zebraidx -s -c zebra.cfg update some_record_stream.xml
+    </screen>
+    This command line simulates indexing and dumps a lot of debug
+    information in the logs, telling exactly which transformations
+    have been applied, how the documents look like after each
+    transformation, and which record ids and terms are send to the indexer.
+   </para>
+  </section>
+
+  <!--
   <section id="record-model-domxml-elementset">
    <title>&dom; Exchange Formats</title>
    <para>
        xmlns:z="http://indexdata.dk/zebra/xslt/1"
        version="1.0">
 
-       <!-- register internal zebra parameters -->       
+       <!- - register internal zebra parameters - ->       
        <xsl:param name="id" select="''"/>
        <xsl:param name="filename" select="''"/>
        <xsl:param name="score" select="''"/>
            
        <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
 
-       <!-- use then for display of internal information -->
+       <!- - use then for display of internal information - ->
        <xsl:template match="/">
          <z:zebra>
            <id><xsl:value-of select="$id"/></id>
     </para>
 
   </section>
+  -->
 
   <!--
   <section id="record-model-domxml-example">