X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;fp=doc%2Ffield-structure.xml;h=c354795e21e4afc1652f56a833860fcf2f04fd8b;hp=0000000000000000000000000000000000000000;hb=37dc985516f52f34fc8434cc8beb982bb0c8988f;hpb=819007639f67bdf6a147a8fc5e66c7fbad9ada6a

diff --git a/doc/field-structure.xml b/doc/field-structure.xml
new file mode 100644
index 0000000..c354795
--- /dev/null
+++ b/doc/field-structure.xml
@@ -0,0 +1,257 @@
+ <chapter id="fields-and-charsets">
+  <!-- $Id: field-structure.xml,v 1.1 2006-09-03 21:37:26 adam Exp $ -->
+  <title>Field Structure and Character Sets
+  </title>
+  
+  <para>
+   In order to provide a flexible approach to national character set
+   handling, Zebra allows the administrator to configure the set up the
+   system to handle any 8-bit character set &mdash; including sets that
+   require multi-octet diacritics or other multi-octet characters. The
+   definition of a character set includes a specification of the
+   permissible values, their sort order (this affects the display in the
+   SCAN function), and relationships between upper- and lowercase
+   characters. Finally, the definition includes the specification of
+   space characters for the set.
+  </para>
+
+  <para>
+   The operator can define different character sets for different fields,
+   typical examples being standard text fields, numerical fields, and
+   special-purpose fields such as WWW-style linkages (URx).
+  </para>
+
+  <section id="default-idx-file">
+   <title>The default.idx file</title>
+   <para>
+    The field types, and hence character sets, are associated with data
+    elements by the .abs files (see above).
+    The file <literal>default.idx</literal>
+    provides the association between field type codes (as used in the .abs
+    files) and the character map files (with the .chr suffix). The format
+    of the .idx file is as follows
+   </para>
+
+   <para>
+    <variablelist>
+
+     <varlistentry>
+      <term>index <emphasis>field type code</emphasis></term>
+      <listitem>
+       <para>
+	This directive introduces a new search index code.
+	The argument is a one-character code to be used in the
+	.abs files to select this particular index type. An index, roughly,
+	corresponds to a particular structure attribute during search. Refer
+	to <xref linkend="search"/>.
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>sort <emphasis>field code type</emphasis></term>
+      <listitem>
+       <para>
+	This directive introduces a 
+	sort index. The argument is a one-character code to be used in the
+	.abs fie to select this particular index type. The corresponding
+	use attribute must be used in the sort request to refer to this
+	particular sort index. The corresponding character map (see below)
+	is used in the sort process.
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>completeness <emphasis>boolean</emphasis></term>
+      <listitem>
+       <para>
+	This directive enables or disables complete field indexing.
+	The value of the <emphasis>boolean</emphasis> should be 0
+	(disable) or 1. If completeness is enabled, the index entry will
+	contain the complete contents of the field (up to a limit), with words
+	(non-space characters) separated by single space characters
+	(normalized to " " on display). When completeness is
+	disabled, each word is indexed as a separate entry. Complete subfield
+	indexing is most useful for fields which are typically browsed (eg.
+	titles, authors, or subjects), or instances where a match on a
+	complete subfield is essential (eg. exact title searching). For fields
+	where completeness is disabled, the search engine will interpret a
+	search containing space characters as a word proximity search.
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>charmap <emphasis>filename</emphasis></term>
+      <listitem>
+       <para>
+	This is the filename of the character
+	map to be used for this index for field type.
+       </para>
+      </listitem></varlistentry>
+    </variablelist>
+   </para>
+  </section>
+
+  <section id="character-map-files">
+   <title>The character map file format</title>
+   <para>
+    The contents of the character map files are structured as follows:
+   </para>
+
+   <para>
+    <variablelist>
+
+     <varlistentry>
+      <term>lowercase <emphasis>value-set</emphasis></term>
+      <listitem>
+       <para>
+	This directive introduces the basic value set of the field type.
+	The format is an ordered list (without spaces) of the
+	characters which may occur in "words" of the given type.
+	The order of the entries in the list determines the
+	sort order of the index. In addition to single characters, the
+	following combinations are legal:
+       </para>
+
+       <para>
+
+	<itemizedlist>
+	 <listitem>
+	  <para>
+	   Backslashes may be used to introduce three-digit octal, or
+	   two-digit hex representations of single characters
+	   (preceded by <literal>x</literal>).
+	   In addition, the combinations
+	   \\, \\r, \\n, \\t, \\s (space &mdash; remember that real
+	   space-characters may not occur in the value definition), and
+	   \\ are recognized, with their usual interpretation.
+	  </para>
+	 </listitem>
+
+	 <listitem>
+	  <para>
+	   Curly braces {} may be used to enclose ranges of single
+	   characters (possibly using the escape convention described in the
+	   preceding point), eg. {a-z} to introduce the
+	   standard range of ASCII characters.
+	   Note that the interpretation of such a range depends on
+	   the concrete representation in your local, physical character set.
+	  </para>
+	 </listitem>
+
+	 <listitem>
+	  <para>
+	   paranthesises () may be used to enclose multi-byte characters -
+	   eg. diacritics or special national combinations (eg. Spanish
+	   "ll"). When found in the input stream (or a search term),
+	   these characters are viewed and sorted as a single character, with a
+	   sorting value depending on the position of the group in the value
+	   statement.
+	  </para>
+	 </listitem>
+
+	</itemizedlist>
+
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>uppercase <emphasis>value-set</emphasis></term>
+      <listitem>
+       <para>
+	This directive introduces the
+	upper-case equivalencis to the value set (if any). The number and
+	order of the entries in the list should be the same as in the
+	<literal>lowercase</literal> directive.
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>space <emphasis>value-set</emphasis></term>
+      <listitem>
+       <para>
+	This directive introduces the character
+	which separate words in the input stream. Depending on the
+	completeness mode of the field in question, these characters either
+	terminate an index entry, or delimit individual "words" in
+	the input stream. The order of the elements is not significant &mdash;
+	otherwise the representation is the same as for the
+	<literal>uppercase</literal> and <literal>lowercase</literal>
+	directives.
+       </para>
+      </listitem></varlistentry>
+     <varlistentry>
+      <term>map <emphasis>value-set</emphasis>
+       <emphasis>target</emphasis></term>
+      <listitem>
+       <para>
+	This directive introduces a mapping between each of the
+	members of the value-set on the left to the character on the
+	right. The character on the right must occur in the value
+	set (the <literal>lowercase</literal> directive) of the
+	character set, but it may be a paranthesis-enclosed
+	multi-octet character. This directive may be used to map
+	diacritics to their base characters, or to map HTML-style
+	character-representations to their natural form, etc. The
+	map directive can also be used to ignore leading articles in
+	searching and/or sorting, and to perform other special
+	transformations. See section <xref
+	 linkend="leading-articles"/>.
+       </para>
+      </listitem></varlistentry>
+    </variablelist>
+   </para>
+  </section>
+  <section id="leading-articles">
+   <title>Ignoring leading articles</title>
+   <para>
+    In addition to specifying sort orders, space (blank) handling,
+    and upper/lowercase folding, you can also use the character map
+    files to make Zebra ignore leading articles in sorting records,
+    or when doing complete field searching.
+   </para>
+   <para>
+    This is done using the <literal>map</literal> directive in the
+    character map file. In a nutshell, what you do is map certain
+    sequences of characters, when they occur <emphasis> in the
+     beginning of a field</emphasis>, to a space. Assuming that the
+    character "@" is defined as a space character in your file, you
+    can do:
+    <screen>
+     map (^The\s) @
+     map (^the\s) @
+    </screen>
+    The effect of these directives is to map either 'the' or 'The',
+    followed by a space character, to a space. The hat ^ character
+    denotes beginning-of-field only when complete-subfield indexing
+    or sort indexing is taking place; otherwise, it is treated just
+    as any other character.
+   </para>
+   <para>
+    Because the <literal>default.idx</literal> file can be used to
+    associate different character maps with different indexing types
+    -- and you can create additional indexing types, should the need
+    arise -- it is possible to specify that leading articles should
+    be ignored either in sorting, in complete-field searching, or
+    both.
+   </para>
+   <para>
+    If you ignore certain prefixes in sorting, then these will be
+    eliminated from the index, and sorting will take place as if
+    they weren't there. However, if you set the system up to ignore
+    certain prefixes in <emphasis>searching</emphasis>, then these
+    are deleted both from the indexes and from query terms, when the
+    client specifies complete-field searching. This has the effect
+    that a search for 'the science journal' and 'science journal'
+    would both produce the same results.
+   </para>
+  </section>
+ </chapter>
+ <!-- Keep this comment at the end of the file
+ Local variables:
+ mode: sgml
+ sgml-omittag:t
+ sgml-shorttag:t
+ sgml-minimize-attributes:nil
+ sgml-always-quote-attributes:t
+ sgml-indent-step:1
+ sgml-indent-data:t
+ sgml-parent-document: "zebra.xml"
+ sgml-local-catalogs: nil
+ sgml-namecase-general:t
+ End:
+ -->