X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=4079205a6308d9f2cfa401a5a1e2157819eb1f4b;hp=758542b3ed84da1979baa1dcd1e15f414324f872;hb=1b8e1d7dfece31918056f76819c18675ed6e781e;hpb=7b25277add2aae5caabee02213911aeeb65030c8

diff --git a/doc/field-structure.xml b/doc/field-structure.xml
index 758542b..4079205 100644
--- a/doc/field-structure.xml
+++ b/doc/field-structure.xml
@@ -1,11 +1,11 @@
  <chapter id="fields-and-charsets">
-  <!-- $Id: field-structure.xml,v 1.2 2006-09-05 12:01:31 adam Exp $ -->
+  <!-- $Id: field-structure.xml,v 1.12 2007-02-02 09:58:39 marc Exp $ -->
   <title>Field Structure and Character Sets
   </title>
   
   <para>
    In order to provide a flexible approach to national character set
-   handling, Zebra allows the administrator to configure the set up the
+   handling, &zebra; allows the administrator to configure the set up the
    system to handle any 8-bit character set &mdash; including sets that
    require multi-octet diacritics or other multi-octet characters. The
    definition of a character set includes a specification of the
@@ -76,26 +76,162 @@
 	search containing space characters as a word proximity search.
        </para>
       </listitem></varlistentry>
+
+     <varlistentry id="default.idx.firstinfield">
+      <term>firstinfield <replaceable>boolean</replaceable></term>
+      <listitem>
+       <para>
+	This directive enables or disables first-in-field indexing.
+	The value of the <replaceable>boolean</replaceable> should be 0
+	(disable) or 1. 
+       </para>
+      </listitem></varlistentry>
+
+     <varlistentry id="default.idx.alwaysmatches">
+      <term>alwaysmatches <replaceable>boolean</replaceable></term>
+      <listitem>
+       <para>
+	This directive enables or disables alwaysmatches indexing.
+	The value of the <replaceable>boolean</replaceable> should be 0
+	(disable) or 1. 
+       </para>
+      </listitem></varlistentry>
+
      <varlistentry>
       <term>charmap <replaceable>filename</replaceable></term>
       <listitem>
        <para>
 	This is the filename of the character
 	map to be used for this index for field type.
+        See <xref linkend="character-map-files"/> for details.
        </para>
       </listitem></varlistentry>
     </variablelist>
    </para>
+   <para>
+    Following are three excerpts of the standard
+    <filename>tab/default.idx</filename> configuration file. Notice
+    that the <literal>index</literal> and <literal>sort</literal>
+    are grouping directives, which bind all other following directives
+    to them:
+    <screen>
+     # Traditional word index
+     # Used if completenss is 'incomplete field' (@attr 6=1) and
+     # structure is word/phrase/word-list/free-form-text/document-text
+     index w
+     completeness 0
+     position 1
+     alwaysmatches 1
+     firstinfield 1
+     charmap string.chr
+
+     ...
+
+     # Null map index (no mapping at all)
+     # Used if structure=key (@attr 4=3)
+     index 0
+     completeness 0
+     position 1
+     charmap @
+
+     ...
+
+     # Sort register
+     sort s
+     completeness 1
+     charmap string.chr
+    </screen>
+   </para>
   </section>
 
   <section id="character-map-files">
    <title>The character map file format</title>
    <para>
-    The contents of the character map files are structured as follows:
+    The character map files are used to define the word tokenization
+    and character normalization performed before inserting text into
+    the inverse indexes. &zebra; ships with the predefined character map
+    files <filename>tab/*.chr</filename>. Users are allowed to add
+    and/or modify maps according to their needs.  
    </para>
 
+   <table id="character-map-table" frame="top">
+     <title>Character maps predefined in &zebra;</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>File name</entry>
+         <entry>Intended type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry><literal>numeric.chr</literal></entry>
+         <entry><literal>:n</literal></entry>
+         <entry>Numeric digit tokenization and normalization map. All
+         characters not in the set <literal>-{0-9}.,</literal> will be
+         suppressed. Note that floating point numbers are processed
+         fine, but scientific exponential numbers are trashed.</entry>
+        </row>
+        <row>
+         <entry><literal>scan.chr</literal></entry>
+         <entry><literal>:w or :p</literal></entry>
+         <entry>Word tokenization char map for Scandinavian
+         languages. This one resembles the generic word tokenization
+         character map <literal>tab/string.chr</literal>, the main
+         differences are sorting of the special characters 
+        <literal>Ã¼zÃ¦Ã¤Ã¸Ã¶Ã¥</literal> and equivalence maps according to
+         Scandinavian language rules.</entry>
+        </row>
+        <row>
+         <entry><literal>string.chr</literal></entry>
+         <entry><literal>:w or :p</literal></entry>
+         <entry>General word tokenization and normalization character
+         map, mostly useful for English texts. Use this to derive your
+         own language tokenization and normalization derivatives.</entry>
+        </row>
+        <row>
+         <entry><literal>urx.chr</literal></entry>
+         <entry><literal>:u</literal></entry>
+         <entry>URL parsing and tokenization character map.</entry>
+        </row>
+        <row>
+         <entry><literal>@</literal></entry>
+         <entry><literal>:0</literal></entry>
+         <entry>Do-nothing character map used for literal binary
+         indexing. There is no existing file associated to it, and
+         there is no normalization or tokenization performed at all.</entry>
+        </row>
+      </tbody>
+     </tgroup>
+   </table>
+
    <para>
+    The contents of the character map files are structured as follows:
     <variablelist>
+     <varlistentry>
+      <term>encoding <replaceable>encoding-name</replaceable></term>
+      <listitem>
+       <para>
+	This directive must be at the very beginning of the file, and it
+        specifies the character encoding used in the entire file. If
+        omitted, the encoding <literal>ISO-8859-1</literal> is assumed.
+       </para>
+       <para>
+        For example, one of the test files found at  
+          <literal>test/rusmarc/tab/string.chr</literal> contains the following
+        encoding directive:
+        <screen>
+         encoding koi8-r
+        </screen>
+          and the test file
+          <literal>test/charmap/string.utf8.chr</literal> is encoded
+          in UTF-8:
+        <screen>
+         encoding utf-8
+        </screen>
+       </para>
+      </listitem></varlistentry>
 
      <varlistentry>
       <term>lowercase <replaceable>value-set</replaceable></term>
@@ -149,16 +285,30 @@
 	</itemizedlist>
 
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        lowercase normalization and sorting order:
+        <screen>
+         lowercase {0-9}{a-y}Ã¼zÃ¦Ã¤Ã¸Ã¶Ã¥
+        </screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>uppercase <replaceable>value-set</replaceable></term>
       <listitem>
        <para>
 	This directive introduces the
-	upper-case equivalencis to the value set (if any). The number and
+	upper-case equivalences to the value set (if any). The number and
 	order of the entries in the list should be the same as in the
 	<literal>lowercase</literal> directive.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        uppercase equivalent:
+        <screen>
+         uppercase {0-9}{A-Y}ÃZÃÃÃÃÃ
+        </screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>space <replaceable>value-set</replaceable></term>
@@ -173,6 +323,13 @@
 	<literal>uppercase</literal> and <literal>lowercase</literal>
 	directives.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        space instruction:
+        <screen><![CDATA[
+         space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
+        ]]></screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>map <replaceable>value-set</replaceable>
@@ -183,7 +340,7 @@
 	members of the value-set on the left to the character on the
 	right. The character on the right must occur in the value
 	set (the <literal>lowercase</literal> directive) of the
-	character set, but it may be a paranthesis-enclosed
+	character set, but it may be a parenthesis-enclosed
 	multi-octet character. This directive may be used to map
 	diacritics to their base characters, or to map HTML-style
 	character-representations to their natural form, etc. The
@@ -192,6 +349,37 @@
 	transformations. See section <xref
 	 linkend="leading-articles"/>.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        map instructions among others, to make sure that HTML entity
+        encoded  Danish special characters are mapped to the
+        equivalent Latin-1 characters:
+        <screen><![CDATA[
+         map (&aelig;)      Ã¦
+         map (&oslash;)     Ã¸
+         map (&aring;)      Ã¥
+        ]]></screen>
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>equivalent <replaceable>value-set</replaceable></term>
+      <listitem>
+       <para>
+	This directive introduces equivalence classes of characters
+	and/or strings for sorting purposes only. It resembles the map
+	directive, but does not affect search and retrieval indexing,
+	but only sorting order under present requests. 
+       </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        equivalent sorting instructions, which can be uncommented:
+        <screen><![CDATA[
+         # equivalent Ã¦Ã¤(ae)
+         # equivalent Ã¸Ã¶(oe)
+         # equivalent Ã¥(aa)
+         # equivalent uÃ¼
+        ]]></screen>
+       </para>
       </listitem></varlistentry>
     </variablelist>
    </para>
@@ -201,7 +389,7 @@
    <para>
     In addition to specifying sort orders, space (blank) handling,
     and upper/lowercase folding, you can also use the character map
-    files to make Zebra ignore leading articles in sorting records,
+    files to make &zebra; ignore leading articles in sorting records,
     or when doing complete field searching.
    </para>
    <para>
@@ -240,6 +428,7 @@
     would both produce the same results.
    </para>
   </section>
+
  </chapter>
  <!-- Keep this comment at the end of the file
  Local variables: