X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=a19838e57e8e00271e3491be1e0a5f7fbc645a69;hp=6eda6a54d14a45a602519a1090e541e6053f8c0a;hb=HEAD;hpb=91500c092797da8e769f1e63ff0c6bd67aad0fc8

diff --git a/doc/field-structure.xml b/doc/field-structure.xml
index 6eda6a5..a19838e 100644
--- a/doc/field-structure.xml
+++ b/doc/field-structure.xml
@@ -1,11 +1,10 @@
  <chapter id="fields-and-charsets">
-  <!-- $Id: field-structure.xml,v 1.4 2006-11-13 14:53:40 marc Exp $ -->
   <title>Field Structure and Character Sets
   </title>
   
   <para>
    In order to provide a flexible approach to national character set
-   handling, Zebra allows the administrator to configure the set up the
+   handling, &zebra; allows the administrator to configure the set up the
    system to handle any 8-bit character set &mdash; including sets that
    require multi-octet diacritics or other multi-octet characters. The
    definition of a character set includes a specification of the
@@ -21,17 +20,33 @@
    special-purpose fields such as WWW-style linkages (URx).
   </para>
 
+  <para>
+   Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field
+   type is a single character, e.g. <literal>w</literal> (for word), and
+   <literal>p</literal> for phrase. Zebra 2.0.20 and later allow field types 
+   to be any string. This allows for greater flexibility - in particular
+   per-locale (language) fields can be defined.
+  </para>
+
+  <para>
+   Version 2.0.20 of Zebra can also be configured - per field - to use the
+   <ulink url="&url.icu;">ICU</ulink> library to perform tokenization and
+   normalization of strings. This is an alternative to the "charmap"
+   files which has been part of Zebra since its first release.
+  </para>
+
   <section id="default-idx-file">
    <title>The default.idx file</title>
    <para>
     The field types, and hence character sets, are associated with data
-    elements by the .abs files (see above).
-    The file <literal>default.idx</literal>
-    provides the association between field type codes (as used in the .abs
-    files) and the character map files (with the .chr suffix). The format
+    elements by the indexing rules (say <literal>title:w</literal>) in the
+    various filters. Fields are defined in a field definition file which,
+    by default, is called <filename>default.idx</filename>. 
+    This file provides the association between field type codes 
+    and the character map files (with the .chr suffix). The format
     of the .idx file is as follows
    </para>
-
+   
    <para>
     <variablelist>
 
@@ -69,9 +84,9 @@
 	(non-space characters) separated by single space characters
 	(normalized to " " on display). When completeness is
 	disabled, each word is indexed as a separate entry. Complete subfield
-	indexing is most useful for fields which are typically browsed (eg.
+	indexing is most useful for fields which are typically browsed (e.g.,
 	titles, authors, or subjects), or instances where a match on a
-	complete subfield is essential (eg. exact title searching). For fields
+	complete subfield is essential (e.g., exact title searching). For fields
 	where completeness is disabled, the search engine will interpret a
 	search containing space characters as a word proximity search.
        </para>
@@ -103,20 +118,151 @@
        <para>
 	This is the filename of the character
 	map to be used for this index for field type.
+        See <xref linkend="character-map-files"/> for details.
+       </para>
+      </listitem></varlistentry>
+     
+     <varlistentry>
+      <term>icuchain <replaceable>filename</replaceable></term>
+      <listitem>
+       <para>
+	Specifies the filename with ICU tokenization and
+	normalization rules. 
+	See <xref linkend="icuchain-files"/> for details.
+	Using icuchain for a field type is an alternative to
+	charmap. It does not make sense to define both
+	icuchain and charmap for the same field type.
        </para>
       </listitem></varlistentry>
     </variablelist>
    </para>
+   <example id="field-types">
+    <title>Field types</title>
+    <para>
+     Following are three excerpts of the standard
+     <filename>tab/default.idx</filename> configuration file. Notice
+     that the <literal>index</literal> and <literal>sort</literal>
+     are grouping directives, which bind all other following directives
+     to them:
+     <screen>
+     # Traditional word index
+     # Used if completeness is 'incomplete field' (@attr 6=1) and
+     # structure is word/phrase/word-list/free-form-text/document-text
+     index w
+     completeness 0
+     position 1
+     alwaysmatches 1
+     firstinfield 1
+     charmap string.chr
+
+     ...
+
+     # Null map index (no mapping at all)
+     # Used if structure=key (@attr 4=3)
+     index 0
+     completeness 0
+     position 1
+     charmap @
+
+     ...
+
+     # Sort register
+     sort s
+     completeness 1
+     charmap string.chr
+     </screen>
+    </para>
+   </example>
   </section>
 
   <section id="character-map-files">
-   <title>The character map file format</title>
+   <title>Charmap Files</title>
    <para>
-    The contents of the character map files are structured as follows:
+    The character map files are used to define the word tokenization
+    and character normalization performed before inserting text into
+    the inverse indexes. &zebra; ships with the predefined character map
+    files <filename>tab/*.chr</filename>. Users are allowed to add
+    and/or modify maps according to their needs.  
    </para>
 
+   <table id="character-map-table" frame="top">
+     <title>Character maps predefined in &zebra;</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>File name</entry>
+         <entry>Intended type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry><literal>numeric.chr</literal></entry>
+         <entry><literal>:n</literal></entry>
+         <entry>Numeric digit tokenization and normalization map. All
+         characters not in the set <literal>-{0-9}.,</literal> will be
+         suppressed. Note that floating point numbers are processed
+         fine, but scientific exponential numbers are trashed.</entry>
+        </row>
+        <row>
+         <entry><literal>scan.chr</literal></entry>
+         <entry><literal>:w or :p</literal></entry>
+         <entry>Word tokenization char map for Scandinavian
+         languages. This one resembles the generic word tokenization
+         character map <literal>tab/string.chr</literal>, the main
+         differences are sorting of the special characters 
+        <literal>Ã¼zÃ¦Ã¤Ã¸Ã¶Ã¥</literal> and equivalence maps according to
+         Scandinavian language rules.</entry>
+        </row>
+        <row>
+         <entry><literal>string.chr</literal></entry>
+         <entry><literal>:w or :p</literal></entry>
+         <entry>General word tokenization and normalization character
+         map, mostly useful for English texts. Use this to derive your
+         own language tokenization and normalization derivatives.</entry>
+        </row>
+        <row>
+         <entry><literal>urx.chr</literal></entry>
+         <entry><literal>:u</literal></entry>
+         <entry>URL parsing and tokenization character map.</entry>
+        </row>
+        <row>
+         <entry><literal>@</literal></entry>
+         <entry><literal>:0</literal></entry>
+         <entry>Do-nothing character map used for literal binary
+         indexing. There is no existing file associated to it, and
+         there is no normalization or tokenization performed at all.</entry>
+        </row>
+      </tbody>
+     </tgroup>
+   </table>
+
    <para>
+    The contents of the character map files are structured as follows:
     <variablelist>
+     <varlistentry>
+      <term>encoding <replaceable>encoding-name</replaceable></term>
+      <listitem>
+       <para>
+	This directive must be at the very beginning of the file, and it
+        specifies the character encoding used in the entire file. If
+        omitted, the encoding <literal>ISO-8859-1</literal> is assumed.
+       </para>
+       <para>
+        For example, one of the test files found at  
+          <literal>test/rusmarc/tab/string.chr</literal> contains the following
+        encoding directive:
+        <screen>
+         encoding koi8-r
+        </screen>
+          and the test file
+          <literal>test/charmap/string.utf8.chr</literal> is encoded
+          in UTF-8:
+        <screen>
+         encoding utf-8
+        </screen>
+       </para>
+      </listitem></varlistentry>
 
      <varlistentry>
       <term>lowercase <replaceable>value-set</replaceable></term>
@@ -149,7 +295,7 @@
 	  <para>
 	   Curly braces {} may be used to enclose ranges of single
 	   characters (possibly using the escape convention described in the
-	   preceding point), eg. {a-z} to introduce the
+	   preceding point), e.g., {a-z} to introduce the
 	   standard range of ASCII characters.
 	   Note that the interpretation of such a range depends on
 	   the concrete representation in your local, physical character set.
@@ -158,8 +304,8 @@
 
 	 <listitem>
 	  <para>
-	   paranthesises () may be used to enclose multi-byte characters -
-	   eg. diacritics or special national combinations (eg. Spanish
+	   parentheses () may be used to enclose multi-byte characters -
+	   e.g., diacritics or special national combinations (e.g., Spanish
 	   "ll"). When found in the input stream (or a search term),
 	   these characters are viewed and sorted as a single character, with a
 	   sorting value depending on the position of the group in the value
@@ -170,16 +316,30 @@
 	</itemizedlist>
 
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        lowercase normalization and sorting order:
+        <screen>
+         lowercase {0-9}{a-y}Ã¼zÃ¦Ã¤Ã¸Ã¶Ã¥
+        </screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>uppercase <replaceable>value-set</replaceable></term>
       <listitem>
        <para>
 	This directive introduces the
-	upper-case equivalencis to the value set (if any). The number and
+	upper-case equivalences to the value set (if any). The number and
 	order of the entries in the list should be the same as in the
 	<literal>lowercase</literal> directive.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        uppercase equivalent:
+        <screen>
+         uppercase {0-9}{A-Y}ÃZÃÃÃÃÃ
+        </screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>space <replaceable>value-set</replaceable></term>
@@ -194,6 +354,13 @@
 	<literal>uppercase</literal> and <literal>lowercase</literal>
 	directives.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        space instruction:
+        <screen><![CDATA[
+         space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
+        ]]></screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>map <replaceable>value-set</replaceable>
@@ -204,102 +371,159 @@
 	members of the value-set on the left to the character on the
 	right. The character on the right must occur in the value
 	set (the <literal>lowercase</literal> directive) of the
-	character set, but it may be a paranthesis-enclosed
+	character set, but it may be a parenthesis-enclosed
 	multi-octet character. This directive may be used to map
 	diacritics to their base characters, or to map HTML-style
 	character-representations to their natural form, etc. The
 	map directive can also be used to ignore leading articles in
 	searching and/or sorting, and to perform other special
-	transformations. See section <xref
-	 linkend="leading-articles"/>.
+	transformations.
+       </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        map instructions among others, to make sure that HTML entity
+        encoded  Danish special characters are mapped to the
+        equivalent Latin-1 characters:
+        <screen><![CDATA[
+         map (&aelig;)      Ã¦
+         map (&oslash;)     Ã¸
+         map (&aring;)      Ã¥
+        ]]></screen>
+	</para>
+       <para>
+	In addition to specifying sort orders, space (blank) handling,
+	and upper/lowercase folding, you can also use the character map
+	files to make &zebra; ignore leading articles in sorting records,
+	or when doing complete field searching.
+       </para>
+       <para>
+	This is done using the <literal>map</literal> directive in the
+	character map file. In a nutshell, what you do is map certain
+	sequences of characters, when they occur <emphasis> in the
+	 beginning of a field</emphasis>, to a space. Assuming that the
+	character "@" is defined as a space character in your file, you
+	can do:
+	<screen>
+	 map (^The\s) @
+	 map (^the\s) @
+	</screen>
+	The effect of these directives is to map either 'the' or 'The',
+	followed by a space character, to a space. The hat ^ character
+	denotes beginning-of-field only when complete-subfield indexing
+	or sort indexing is taking place; otherwise, it is treated just
+	as any other character.
+       </para>
+       <para>
+	Because the <literal>default.idx</literal> file can be used to
+	associate different character maps with different indexing types
+	-- and you can create additional indexing types, should the need
+	arise -- it is possible to specify that leading articles should
+	be ignored either in sorting, in complete-field searching, or
+	both.
+       </para>
+       <para>
+	If you ignore certain prefixes in sorting, then these will be
+	eliminated from the index, and sorting will take place as if
+	they weren't there. However, if you set the system up to ignore
+	certain prefixes in <emphasis>searching</emphasis>, then these
+	are deleted both from the indexes and from query terms, when the
+	client specifies complete-field searching. This has the effect
+	that a search for 'the science journal' and 'science journal'
+	would both produce the same results.
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>equivalent <replaceable>value-set</replaceable></term>
+      <listitem>
+       <para>
+	This directive introduces equivalence classes of strings for
+	searching purposes only. It's a one-to-many
+	conversion that takes place only during search before the map
+	directive kicks in.
+       </para>
+       <para>
+	 For example given:
+        <screen><![CDATA[
+         equivalent Ã¦Ã¤(ae)
+        ]]></screen>
+       </para>
+       <para>
+	 a search for the <literal>Ã¤sel</literal> will be be match any of
+	 <literal>Ã¦sel</literal>, <literal>Ã¤sel</literal> and
+	 <literal>aesel</literal>.
        </para>
       </listitem></varlistentry>
     </variablelist>
    </para>
   </section>
-  <section id="leading-articles">
-   <title>Ignoring leading articles</title>
-   <para>
-    In addition to specifying sort orders, space (blank) handling,
-    and upper/lowercase folding, you can also use the character map
-    files to make Zebra ignore leading articles in sorting records,
-    or when doing complete field searching.
-   </para>
+
+  <section id="icuchain-files">
+   <title>ICU Chain Files</title>
    <para>
-    This is done using the <literal>map</literal> directive in the
-    character map file. In a nutshell, what you do is map certain
-    sequences of characters, when they occur <emphasis> in the
-     beginning of a field</emphasis>, to a space. Assuming that the
-    character "@" is defined as a space character in your file, you
-    can do:
-    <screen>
-     map (^The\s) @
-     map (^the\s) @
-    </screen>
-    The effect of these directives is to map either 'the' or 'The',
-    followed by a space character, to a space. The hat ^ character
-    denotes beginning-of-field only when complete-subfield indexing
-    or sort indexing is taking place; otherwise, it is treated just
-    as any other character.
+    The <ulink url="&url.icu;">ICU</ulink> chain files defines a 
+    <emphasis>chain</emphasis> of rules
+    which specify the conversion process to be carried out for each
+    record string for indexing.
    </para>
    <para>
-    Because the <literal>default.idx</literal> file can be used to
-    associate different character maps with different indexing types
-    -- and you can create additional indexing types, should the need
-    arise -- it is possible to specify that leading articles should
-    be ignored either in sorting, in complete-field searching, or
-    both.
+    Both searching and sorting is based on the <emphasis>sort</emphasis>
+    normalization that ICU provides. This means that scan and sort will
+    return terms in the sort order given by ICU.
    </para>
    <para>
-    If you ignore certain prefixes in sorting, then these will be
-    eliminated from the index, and sorting will take place as if
-    they weren't there. However, if you set the system up to ignore
-    certain prefixes in <emphasis>searching</emphasis>, then these
-    are deleted both from the indexes and from query terms, when the
-    client specifies complete-field searching. This has the effect
-    that a search for 'the science journal' and 'science journal'
-    would both produce the same results.
+    Zebra is using YAZ' ICU wrapper. Refer to the 
+    <ulink url="&url.yaz.yaz-icu;">yaz-icu man page</ulink> for
+    documentation about the ICU chain rules.
    </para>
-  </section>
-  <section id="default-idx-debug">
-   <title>Field structure debugging using the special 
-          <literal>zebra::</literal> element set</title>
+   <tip>
+    <para>
+     Use the yaz-icu program to test your icuchain rules.
+    </para>
+   </tip>
+   <example id="indexing-greek-example"><title>Indexing Greek text</title>
+    <para>
+     Consider a system where all "regular" text is to be indexed
+     using as Greek (locale: EL).
+     We would have to change our index type file - to read
+     <screen>
+      # Index greek words
+      index w
+      completeness 0
+      position 1
+      alwaysmatches 1
+      firstinfield 1
+      icuahain greek.xml
+      ..
+     </screen>
+     The ICU chain file <filename>greek.xml</filename> could look
+     as follows:
+     <screen><![CDATA[
+      <icu_chain locale="el">
+      <transform rule="[:Control:] Any-Remove"/>
+      <tokenize rule="l"/>
+      <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
+      <display/>
+      <casemap rule="l"/>
+     </icu_chain>
+     ]]></screen>
+    </para>
+   </example>
    <para>
-    At some time, it is very hard to figure out what exactly has been
-    indexed how and in which indexes. Using the indexing stylesheet of
-    the Alvis filter, one can at least see which portion of the record
-    went into which index, but a similar aid does not exist for all
-    other indexing filters.  
-   </para>
-   <para>
-    Starting with <literal>Zebra</literal> version
-    <literal>2.0.4-2</literal> or newer, one has the possibility to
-    use the special
-    <literal>zebra::</literal> element set name, which is only defined for
-    the <literal>SUTRS</literal> and <literal>XML</literal> record
-    formats.
-    <screen>
-      Z> f @attr 1=dc_all minutter
-      Z> format sutrs
-      Z> elements zebra::
-      Z> s 1+1
-    </screen>
-    will display all indexed tokens from all indexed fields of the
-    first record, and it will display in <literal>SUTRS</literal>
-    record syntax, whereas 
-    <screen>
-      Z> f @attr 1=dc_all minutter
-      Z> format xml
-      Z> elements zebra::dc_publisher
-      Z> s 1+1
-      Z> elements zebra::dc_publisher:p
-      Z> s 1+1
-    </screen> 
-    displays in <literal>XML</literal> record syntax only the content
-      of the zebra string index <literal>dc_publisher</literal>, or
-      even only the type <literal>p</literal> phrase indexed part of it.
+    Zebra is shipped with a field types file <filename>icu.idx</filename>
+    which is an ICU chain version of <filename>default.idx</filename>.
    </para>
+
+   <example id="indexing-marcxml-example"><title>MARCXML indexing using ICU</title>
+    <para>
+     The directory <filename>examples/marcxml</filename> includes
+     a complete sample with MARCXML records that are DOM XML indexed 
+     using ICU chain rules. Study the
+     <filename>README</filename> in the <filename>marcxml</filename>
+     directory for details.
+    </para>
+   </example>
   </section>
+
  </chapter>
  <!-- Keep this comment at the end of the file
  Local variables:
@@ -310,7 +534,7 @@
  sgml-always-quote-attributes:t
  sgml-indent-step:1
  sgml-indent-data:t
- sgml-parent-document: "zebra.xml"
+ sgml-parent-document: "idzebra.xml"
  sgml-local-catalogs: nil
  sgml-namecase-general:t
  End: