X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=3a0a5f2535027830e1d86c71c7b7cfe00c834376;hb=a92270aafb3ba7b336bc2334ed7c44c631c1cb29;hp=6318dbe3f805bd686986fb077f2d3131edb5caf6;hpb=b9c1a6fcf5c4821d0190efdecbc14ea5d6c96aec;p=idzebra-moved-to-github.git

diff --git a/doc/field-structure.xml b/doc/field-structure.xml
index 6318dbe..3a0a5f2 100644
--- a/doc/field-structure.xml
+++ b/doc/field-structure.xml
@@ -1,5 +1,5 @@
  <chapter id="fields-and-charsets">
-  <!-- $Id: field-structure.xml,v 1.6 2006-11-23 09:03:50 marc Exp $ -->
+  <!-- $Id: field-structure.xml,v 1.8 2006-11-28 13:05:57 marc Exp $ -->
   <title>Field Structure and Character Sets
   </title>
   
@@ -103,6 +103,7 @@
        <para>
 	This is the filename of the character
 	map to be used for this index for field type.
+        See <xref linkend="character-map-files"/> for details.
        </para>
       </listitem></varlistentry>
     </variablelist>
@@ -112,10 +113,67 @@
   <section id="character-map-files">
    <title>The character map file format</title>
    <para>
-    The contents of the character map files are structured as follows:
+    The character map files are used to define the word tokenization
+    and character normalization performed before inserting text into
+    the inverse indexes. Zebra ships with the predefined character map
+    files <filename>tab/*.chr</filename>. Users are allowed to add
+    and/or modify maps according to their needs.  
    </para>
 
+   <table id="querymodel-attribute-sets-table" frame="top">
+     <title>Character maps predefined in Zebra</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>File name</entry>
+         <entry>Intended type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+       <tbody>
+        <row>
+         <entry><literal>numeric.chr</literal></entry>
+         <entry><literal>:n</literal></entry>
+         <entry>Numeric digit tokenization and normalization map. All
+         characters not in the set <literal>-{0-9}.,</literal> will be
+         suppressed. Note that floating point numbers are processed
+         fine, but scientific exponential numbers are trashed.</entry>
+        </row>
+        <row>
+         <entry><literal>scan.chr</literal></entry>
+         <entry><literal>:w or :p</literal></entry>
+         <entry>Word tokenization char map for Scandinavian
+         languages. This one resembles the generic word tokenization
+         character map <literal>tab/string.chr</literal>, the main
+         differences are sorting of the special characters 
+        <literal>Ã¼zÃ¦Ã¤Ã¸Ã¶Ã¥</literal> and equivalence maps according to
+         Scandinavian language rules.</entry>
+        </row>
+        <row>
+         <entry><literal>string.chr</literal></entry>
+         <entry><literal>:w or :p</literal></entry>
+         <entry>General word tokenization and normalization character
+         map, mostly useful for English texts. Use this to derive your
+         own language tokenization and normalization derivatives.</entry>
+        </row>
+        <row>
+         <entry><literal>urx.chr</literal></entry>
+         <entry><literal>:u</literal></entry>
+         <entry>URL parsing and tokenization character map.</entry>
+        </row>
+        <row>
+         <entry><literal>@</literal></entry>
+         <entry><literal>:0</literal></entry>
+         <entry>Do-nothing character map used for literal binary
+         indexing. There is no existing file associated to it, and
+         there is no normalization or tokenization performed at all.</entry>
+        </row>
+      </tbody>
+     </tgroup>
+   </table>
+
    <para>
+    The contents of the character map files are structured as follows:
     <variablelist>
 
      <varlistentry>
@@ -170,16 +228,30 @@
 	</itemizedlist>
 
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        lowercase normalization and sorting order:
+        <screen>
+         lowercase {0-9}{a-y}Ã¼zÃ¦Ã¤Ã¸Ã¶Ã¥
+        </screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>uppercase <replaceable>value-set</replaceable></term>
       <listitem>
        <para>
 	This directive introduces the
-	upper-case equivalencis to the value set (if any). The number and
+	upper-case equivalences to the value set (if any). The number and
 	order of the entries in the list should be the same as in the
 	<literal>lowercase</literal> directive.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        uppercase equivalent:
+        <screen>
+         uppercase {0-9}{A-Y}ÃZÃÃÃÃÃ
+        </screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>space <replaceable>value-set</replaceable></term>
@@ -194,6 +266,13 @@
 	<literal>uppercase</literal> and <literal>lowercase</literal>
 	directives.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        space instruction:
+        <screen><![CDATA[
+         space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
+        ]]></screen>
+       </para>
       </listitem></varlistentry>
      <varlistentry>
       <term>map <replaceable>value-set</replaceable>
@@ -204,7 +283,7 @@
 	members of the value-set on the left to the character on the
 	right. The character on the right must occur in the value
 	set (the <literal>lowercase</literal> directive) of the
-	character set, but it may be a paranthesis-enclosed
+	character set, but it may be a parenthesis-enclosed
 	multi-octet character. This directive may be used to map
 	diacritics to their base characters, or to map HTML-style
 	character-representations to their natural form, etc. The
@@ -213,6 +292,37 @@
 	transformations. See section <xref
 	 linkend="leading-articles"/>.
        </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        map instructions among others, to make sure that HTML entity
+        encoded  Danish special characters are mapped to the
+        equivalent Latin-1 characters:
+        <screen><![CDATA[
+         map (&aelig;)      Ã¦
+         map (&oslash;)     Ã¸
+         map (&aring;)      Ã¥
+        ]]></screen>
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>equivalent <replaceable>value-set</replaceable></term>
+      <listitem>
+       <para>
+	This directive introduces equivalence classes of characters
+	and/or strings for sorting purposes only. It resembles the map
+	directive, but does not affect search and retrieval indexing,
+	but only sorting order under present requests. 
+       </para>
+       <para>
+        For example, <literal>scan.chr</literal> contains the following
+        equivalent sorting instructions, which can be uncommented:
+        <screen><![CDATA[
+         # equivalent Ã¦Ã¤(ae)
+         # equivalent Ã¸Ã¶(oe)
+         # equivalent Ã¥(aa)
+         # equivalent uÃ¼
+        ]]></screen>
+       </para>
       </listitem></varlistentry>
     </variablelist>
    </para>
@@ -262,134 +372,6 @@
    </para>
   </section>
 
-  <section id="default-idx-zebra">
-   <title>Accessing Zebra internal record data using 
-    the <literal>zebra::</literal> element sets</title>
-   <para>
-    Starting with <literal>Zebra</literal> version
-    <literal>2.0.4-2</literal> or newer, one has the possibility to
-    use the special
-    <literal>zebra::data</literal>,
-    <literal>zebra::meta</literal> and 
-    <literal>zebra::index</literal> element set names.
-   </para>
-   <note>
-    <para>
-     Usage of the <literal>zebra::</literal> element sets accesses
-     record data directly from the internal storage, and will
-     therefore work exactly the same way, irrespectively of indexing
-     filter used. 
-    </para>
-    <para>
-     These element set names are optimized for retrieval speed, and
-     will perform better than using for example
-     <literal>alvis</literal> filter XSLT based extraction of small
-     parts of the records.  
-    </para>
-   </note>
-   <para>
-    For example, to  fetch the raw binary record data stored in the
-    zebra internal storage, or on the filesystem, the following
-    commands can be issued:
-    <screen>
-      Z> f @attr 1=title my
-      Z> format xml
-      Z> elements zebra::data
-      Z> s 1+1
-      Z> format sutrs
-      Z> s 1+1
-      Z> format usmarc
-      Z> s 1+1
-    </screen>
-    </para>
-   <note>
-    <para>
-     The special 
-     <literal>zebra::data</literal> element set name is 
-     defined for any record syntax, but will always fetch  
-     the raw record data in exactly the original form. No record syntax
-     specific transformations will be applied to the raw record data. 
-    </para>
-   </note>
-   <para>
-    Also, Zebra internal metadata about the record can be accessed: 
-    <screen>
-      Z> f @attr 1=title my
-      Z> format xml
-      Z> elements zebra::meta::sysno
-      Z> s 1+1
-    </screen> 
-    displays in <literal>XML</literal> record syntax only internal
-    record system number, whereas 
-    <screen>
-      Z> f @attr 1=title my
-      Z> format xml
-      Z> elements zebra::meta
-      Z> s 1+1
-    </screen> 
-    displays all available metadata on the record. These include sytem
-      number, database name,  indexed filename,  filter used for indexing,
-      score and static ranking information and finally bytesize of record.
-   </para>
-   <note>
-    <para>
-     The special 
-     <literal>zebra::meta</literal> element set names are only 
-     defined for
-     <literal>SUTRS</literal> and <literal>XML</literal> record
-     syntaxes. 
-    </para>
-   </note>
-   <para>
-    Sometimes, it is very hard to figure out what exactly has been
-    indexed how and in which indexes. Using the indexing stylesheet of
-    the Alvis filter, one can at least see which portion of the record
-    went into which index, but a similar aid does not exist for all
-    other indexing filters.  
-   </para>
-   <para>
-    The special
-    <literal>zebra::index</literal> element set names are provided to
-    access information on per record indexed fields. For example, the
-    queries 
-    <screen>
-      Z> f @attr 1=title my
-      Z> format sutrs
-      Z> elements zebra::index
-      Z> s 1+1
-    </screen>
-    will display all indexed tokens from all indexed fields of the
-    first record, and it will display in <literal>SUTRS</literal>
-    record syntax, whereas 
-    <screen>
-      Z> f @attr 1=title my
-      Z> format xml
-      Z> elements zebra::index::title
-      Z> s 1+1
-      Z> elements zebra::index::title:p
-      Z> s 1+1
-    </screen> 
-    displays in <literal>XML</literal> record syntax only the content
-      of the zebra string index <literal>title</literal>, or
-      even only the type <literal>p</literal> phrase indexed part of it.
-   </para>
-   <note>
-    <para>
-     The special <literal>zebra::index</literal> 
-     element set names are only 
-     defined for
-     <literal>SUTRS</literal> and <literal>XML</literal> record
-     syntaxes. 
-    </para>
-    <para> Trying to access numeric <literal>Bib-1</literal> use
-    attributes or trying to access non-existent zebra intern string
-    access points will result in a
-    <literal>
-     Diagnostic [25]: Specified element set name not valid for specified database
-    </literal>
-    </para>
-   </note>
-  </section>
  </chapter>
  <!-- Keep this comment at the end of the file
  Local variables: