<chapter id="fields-and-charsets">
- <!-- $Id: field-structure.xml,v 1.1 2006-09-03 21:37:26 adam Exp $ -->
<title>Field Structure and Character Sets
</title>
<para>
In order to provide a flexible approach to national character set
- handling, Zebra allows the administrator to configure the set up the
+ handling, &zebra; allows the administrator to configure the set up the
system to handle any 8-bit character set — including sets that
require multi-octet diacritics or other multi-octet characters. The
definition of a character set includes a specification of the
special-purpose fields such as WWW-style linkages (URx).
</para>
+ <para>
+ Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field
+ type is a single character, e.g. <literal>w</literal> (for word), and
+ <literal>p</literal> for phrase. Zebra 2.0.20 and later allow field types
+ to be any string. This allows for greater flexibility - in particular
+ per-locale (language) fields can be defined.
+ </para>
+
+ <para>
+ Version 2.1 of Zebra can also be configured - per field - to use the
+ <ulink url="&url.icu;">ICU</ulink> library to perform tokenization and
+ normalization of strings. This is an alternative to the "charmap"
+ files which has been part of Zebra since its first release.
+ </para>
+
<section id="default-idx-file">
<title>The default.idx file</title>
<para>
The field types, and hence character sets, are associated with data
- elements by the .abs files (see above).
- The file <literal>default.idx</literal>
- provides the association between field type codes (as used in the .abs
- files) and the character map files (with the .chr suffix). The format
+ elements by the indexing rules (say <literal>title:w</literal>) in the
+ various filters. Fields are defined in a field definition file which,
+ by default, is called <filename>default.idx</filename>.
+ This file provides the association between field type codes
+ and the character map files (with the .chr suffix). The format
of the .idx file is as follows
</para>
-
+
<para>
<variablelist>
<varlistentry>
- <term>index <emphasis>field type code</emphasis></term>
+ <term>index <replaceable>field type code</replaceable></term>
<listitem>
<para>
This directive introduces a new search index code.
The argument is a one-character code to be used in the
.abs files to select this particular index type. An index, roughly,
corresponds to a particular structure attribute during search. Refer
- to <xref linkend="search"/>.
+ to <xref linkend="zebrasrv-search"/>.
</para>
</listitem></varlistentry>
<varlistentry>
- <term>sort <emphasis>field code type</emphasis></term>
+ <term>sort <replaceable>field code type</replaceable></term>
<listitem>
<para>
This directive introduces a
</para>
</listitem></varlistentry>
<varlistentry>
- <term>completeness <emphasis>boolean</emphasis></term>
+ <term>completeness <replaceable>boolean</replaceable></term>
<listitem>
<para>
This directive enables or disables complete field indexing.
- The value of the <emphasis>boolean</emphasis> should be 0
+ The value of the <replaceable>boolean</replaceable> should be 0
(disable) or 1. If completeness is enabled, the index entry will
contain the complete contents of the field (up to a limit), with words
(non-space characters) separated by single space characters
(normalized to " " on display). When completeness is
disabled, each word is indexed as a separate entry. Complete subfield
- indexing is most useful for fields which are typically browsed (eg.
+ indexing is most useful for fields which are typically browsed (e.g.,
titles, authors, or subjects), or instances where a match on a
- complete subfield is essential (eg. exact title searching). For fields
+ complete subfield is essential (e.g., exact title searching). For fields
where completeness is disabled, the search engine will interpret a
search containing space characters as a word proximity search.
</para>
</listitem></varlistentry>
+
+ <varlistentry id="default.idx.firstinfield">
+ <term>firstinfield <replaceable>boolean</replaceable></term>
+ <listitem>
+ <para>
+ This directive enables or disables first-in-field indexing.
+ The value of the <replaceable>boolean</replaceable> should be 0
+ (disable) or 1.
+ </para>
+ </listitem></varlistentry>
+
+ <varlistentry id="default.idx.alwaysmatches">
+ <term>alwaysmatches <replaceable>boolean</replaceable></term>
+ <listitem>
+ <para>
+ This directive enables or disables alwaysmatches indexing.
+ The value of the <replaceable>boolean</replaceable> should be 0
+ (disable) or 1.
+ </para>
+ </listitem></varlistentry>
+
<varlistentry>
- <term>charmap <emphasis>filename</emphasis></term>
+ <term>charmap <replaceable>filename</replaceable></term>
<listitem>
<para>
This is the filename of the character
map to be used for this index for field type.
+ See <xref linkend="character-map-files"/> for details.
+ </para>
+ </listitem></varlistentry>
+
+ <varlistentry>
+ <term>icuchain <replaceable>filename</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the filename with ICU tokenization and
+ normalization rules.
+ See <xref linkend="icuchain-files"/> for details.
+ Using icuchain for a field type is an alternative to
+ charmap. It does not make sense to define both
+ icuchain and charmap for the same field type.
</para>
</listitem></varlistentry>
</variablelist>
</para>
+ <example id="field-types">
+ <title>Field types</title>
+ <para>
+ Following are three excerpts of the standard
+ <filename>tab/default.idx</filename> configuration file. Notice
+ that the <literal>index</literal> and <literal>sort</literal>
+ are grouping directives, which bind all other following directives
+ to them:
+ <screen>
+ # Traditional word index
+ # Used if completeness is 'incomplete field' (@attr 6=1) and
+ # structure is word/phrase/word-list/free-form-text/document-text
+ index w
+ completeness 0
+ position 1
+ alwaysmatches 1
+ firstinfield 1
+ charmap string.chr
+
+ ...
+
+ # Null map index (no mapping at all)
+ # Used if structure=key (@attr 4=3)
+ index 0
+ completeness 0
+ position 1
+ charmap @
+
+ ...
+
+ # Sort register
+ sort s
+ completeness 1
+ charmap string.chr
+ </screen>
+ </para>
+ </example>
</section>
<section id="character-map-files">
- <title>The character map file format</title>
+ <title>Charmap Files</title>
<para>
- The contents of the character map files are structured as follows:
+ The character map files are used to define the word tokenization
+ and character normalization performed before inserting text into
+ the inverse indexes. &zebra; ships with the predefined character map
+ files <filename>tab/*.chr</filename>. Users are allowed to add
+ and/or modify maps according to their needs.
</para>
+ <table id="character-map-table" frame="top">
+ <title>Character maps predefined in &zebra;</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>File name</entry>
+ <entry>Intended type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>numeric.chr</literal></entry>
+ <entry><literal>:n</literal></entry>
+ <entry>Numeric digit tokenization and normalization map. All
+ characters not in the set <literal>-{0-9}.,</literal> will be
+ suppressed. Note that floating point numbers are processed
+ fine, but scientific exponential numbers are trashed.</entry>
+ </row>
+ <row>
+ <entry><literal>scan.chr</literal></entry>
+ <entry><literal>:w or :p</literal></entry>
+ <entry>Word tokenization char map for Scandinavian
+ languages. This one resembles the generic word tokenization
+ character map <literal>tab/string.chr</literal>, the main
+ differences are sorting of the special characters
+ <literal>üzæäøöå</literal> and equivalence maps according to
+ Scandinavian language rules.</entry>
+ </row>
+ <row>
+ <entry><literal>string.chr</literal></entry>
+ <entry><literal>:w or :p</literal></entry>
+ <entry>General word tokenization and normalization character
+ map, mostly useful for English texts. Use this to derive your
+ own language tokenization and normalization derivatives.</entry>
+ </row>
+ <row>
+ <entry><literal>urx.chr</literal></entry>
+ <entry><literal>:u</literal></entry>
+ <entry>URL parsing and tokenization character map.</entry>
+ </row>
+ <row>
+ <entry><literal>@</literal></entry>
+ <entry><literal>:0</literal></entry>
+ <entry>Do-nothing character map used for literal binary
+ indexing. There is no existing file associated to it, and
+ there is no normalization or tokenization performed at all.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
<para>
+ The contents of the character map files are structured as follows:
<variablelist>
+ <varlistentry>
+ <term>encoding <replaceable>encoding-name</replaceable></term>
+ <listitem>
+ <para>
+ This directive must be at the very beginning of the file, and it
+ specifies the character encoding used in the entire file. If
+ omitted, the encoding <literal>ISO-8859-1</literal> is assumed.
+ </para>
+ <para>
+ For example, one of the test files found at
+ <literal>test/rusmarc/tab/string.chr</literal> contains the following
+ encoding directive:
+ <screen>
+ encoding koi8-r
+ </screen>
+ and the test file
+ <literal>test/charmap/string.utf8.chr</literal> is encoded
+ in UTF-8:
+ <screen>
+ encoding utf-8
+ </screen>
+ </para>
+ </listitem></varlistentry>
<varlistentry>
- <term>lowercase <emphasis>value-set</emphasis></term>
+ <term>lowercase <replaceable>value-set</replaceable></term>
<listitem>
<para>
This directive introduces the basic value set of the field type.
<para>
Curly braces {} may be used to enclose ranges of single
characters (possibly using the escape convention described in the
- preceding point), eg. {a-z} to introduce the
+ preceding point), e.g., {a-z} to introduce the
standard range of ASCII characters.
Note that the interpretation of such a range depends on
the concrete representation in your local, physical character set.
<listitem>
<para>
- paranthesises () may be used to enclose multi-byte characters -
- eg. diacritics or special national combinations (eg. Spanish
+ parentheses () may be used to enclose multi-byte characters -
+ e.g., diacritics or special national combinations (e.g., Spanish
"ll"). When found in the input stream (or a search term),
these characters are viewed and sorted as a single character, with a
sorting value depending on the position of the group in the value
</itemizedlist>
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ lowercase normalization and sorting order:
+ <screen>
+ lowercase {0-9}{a-y}üzæäøöå
+ </screen>
+ </para>
</listitem></varlistentry>
<varlistentry>
- <term>uppercase <emphasis>value-set</emphasis></term>
+ <term>uppercase <replaceable>value-set</replaceable></term>
<listitem>
<para>
This directive introduces the
- upper-case equivalencis to the value set (if any). The number and
+ upper-case equivalences to the value set (if any). The number and
order of the entries in the list should be the same as in the
<literal>lowercase</literal> directive.
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ uppercase equivalent:
+ <screen>
+ uppercase {0-9}{A-Y}ÜZÆÄØÖÅ
+ </screen>
+ </para>
</listitem></varlistentry>
<varlistentry>
- <term>space <emphasis>value-set</emphasis></term>
+ <term>space <replaceable>value-set</replaceable></term>
<listitem>
<para>
This directive introduces the character
<literal>uppercase</literal> and <literal>lowercase</literal>
directives.
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ space instruction:
+ <screen><![CDATA[
+ space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
+ ]]></screen>
+ </para>
</listitem></varlistentry>
<varlistentry>
- <term>map <emphasis>value-set</emphasis>
- <emphasis>target</emphasis></term>
+ <term>map <replaceable>value-set</replaceable>
+ <replaceable>target</replaceable></term>
<listitem>
<para>
This directive introduces a mapping between each of the
members of the value-set on the left to the character on the
right. The character on the right must occur in the value
set (the <literal>lowercase</literal> directive) of the
- character set, but it may be a paranthesis-enclosed
+ character set, but it may be a parenthesis-enclosed
multi-octet character. This directive may be used to map
diacritics to their base characters, or to map HTML-style
character-representations to their natural form, etc. The
map directive can also be used to ignore leading articles in
searching and/or sorting, and to perform other special
- transformations. See section <xref
- linkend="leading-articles"/>.
+ transformations.
+ </para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ map instructions among others, to make sure that HTML entity
+ encoded Danish special characters are mapped to the
+ equivalent Latin-1 characters:
+ <screen><![CDATA[
+ map (æ) æ
+ map (ø) ø
+ map (å) å
+ ]]></screen>
+ </para>
+ <para>
+ In addition to specifying sort orders, space (blank) handling,
+ and upper/lowercase folding, you can also use the character map
+ files to make &zebra; ignore leading articles in sorting records,
+ or when doing complete field searching.
+ </para>
+ <para>
+ This is done using the <literal>map</literal> directive in the
+ character map file. In a nutshell, what you do is map certain
+ sequences of characters, when they occur <emphasis> in the
+ beginning of a field</emphasis>, to a space. Assuming that the
+ character "@" is defined as a space character in your file, you
+ can do:
+ <screen>
+ map (^The\s) @
+ map (^the\s) @
+ </screen>
+ The effect of these directives is to map either 'the' or 'The',
+ followed by a space character, to a space. The hat ^ character
+ denotes beginning-of-field only when complete-subfield indexing
+ or sort indexing is taking place; otherwise, it is treated just
+ as any other character.
+ </para>
+ <para>
+ Because the <literal>default.idx</literal> file can be used to
+ associate different character maps with different indexing types
+ -- and you can create additional indexing types, should the need
+ arise -- it is possible to specify that leading articles should
+ be ignored either in sorting, in complete-field searching, or
+ both.
+ </para>
+ <para>
+ If you ignore certain prefixes in sorting, then these will be
+ eliminated from the index, and sorting will take place as if
+ they weren't there. However, if you set the system up to ignore
+ certain prefixes in <emphasis>searching</emphasis>, then these
+ are deleted both from the indexes and from query terms, when the
+ client specifies complete-field searching. This has the effect
+ that a search for 'the science journal' and 'science journal'
+ would both produce the same results.
+ </para>
+ </listitem></varlistentry>
+ <varlistentry>
+ <term>equivalent <replaceable>value-set</replaceable></term>
+ <listitem>
+ <para>
+ This directive introduces equivalence classes of characters
+ and/or strings for sorting purposes only. It resembles the map
+ directive, but does not affect search and retrieval indexing,
+ but only sorting order under present requests.
+ </para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ equivalent sorting instructions, which can be uncommented:
+ <screen><![CDATA[
+ # equivalent æä(ae)
+ # equivalent øö(oe)
+ # equivalent å(aa)
+ # equivalent uü
+ ]]></screen>
</para>
</listitem></varlistentry>
</variablelist>
</para>
</section>
- <section id="leading-articles">
- <title>Ignoring leading articles</title>
+
+ <section id="icuchain-files">
+ <title>ICU Chain Files</title>
<para>
- In addition to specifying sort orders, space (blank) handling,
- and upper/lowercase folding, you can also use the character map
- files to make Zebra ignore leading articles in sorting records,
- or when doing complete field searching.
+ The <ulink url="&url.icu;">ICU</ulink> chain files defines a
+ <emphasis>chain</emphasis> of rules
+ which specify the conversion process to be carried out for each
+ record string for indexing.
</para>
<para>
- This is done using the <literal>map</literal> directive in the
- character map file. In a nutshell, what you do is map certain
- sequences of characters, when they occur <emphasis> in the
- beginning of a field</emphasis>, to a space. Assuming that the
- character "@" is defined as a space character in your file, you
- can do:
- <screen>
- map (^The\s) @
- map (^the\s) @
- </screen>
- The effect of these directives is to map either 'the' or 'The',
- followed by a space character, to a space. The hat ^ character
- denotes beginning-of-field only when complete-subfield indexing
- or sort indexing is taking place; otherwise, it is treated just
- as any other character.
+ Both searching and sorting is based on the <emphasis>sort</emphasis>
+ normalization that ICU provides. This means that scan and sort will
+ return terms in the sort order given by ICU.
</para>
<para>
- Because the <literal>default.idx</literal> file can be used to
- associate different character maps with different indexing types
- -- and you can create additional indexing types, should the need
- arise -- it is possible to specify that leading articles should
- be ignored either in sorting, in complete-field searching, or
- both.
+ Zebra is using YAZ' ICU wrapper. Refer to the
+ <ulink url="&url.yaz.yaz-icu;">yaz-icu man page</ulink> for
+ documentation about the ICU chain rules.
</para>
+ <tip>
+ <para>
+ Use the yaz-icu program to test your icuchain rules.
+ </para>
+ </tip>
+ <example id="indexing-greek-example"><title>Indexing Greek text</title>
+ <para>
+ Consider a system where all "regular" text is to be indexed
+ using as Greek (locale: EL).
+ We would have to change our index type file - to read
+ <screen>
+ # Index greek words
+ index w
+ completeness 0
+ position 1
+ alwaysmatches 1
+ firstinfield 1
+ icuahain greek.xml
+ ..
+ </screen>
+ The ICU chain file <filename>greek.xml</filename> could look
+ as follows:
+ <screen><![CDATA[
+ <icu_chain locale="el">
+ <transform rule="[:Control:] Any-Remove"/>
+ <tokenize rule="l"/>
+ <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
+ <display/>
+ <casemap rule="l"/>
+ </icu_chain>
+ ]]></screen>
+ </para>
+ </example>
<para>
- If you ignore certain prefixes in sorting, then these will be
- eliminated from the index, and sorting will take place as if
- they weren't there. However, if you set the system up to ignore
- certain prefixes in <emphasis>searching</emphasis>, then these
- are deleted both from the indexes and from query terms, when the
- client specifies complete-field searching. This has the effect
- that a search for 'the science journal' and 'science journal'
- would both produce the same results.
+ Zebra is shipped with a field types file <filename>icu.idx</filename>
+ which is an ICU chain version of <filename>default.idx</filename>.
</para>
+
+ <example id="indexing-marcxml-example"><title>MARCXML indexing using ICU</title>
+ <para>
+ The directory <filename>examples/marcxml</filename> includes
+ a complete sample with MARCXML records that are DOM XML indexed
+ using ICU chain rules. Study the
+ <filename>README</filename> in the <filename>marcxml</filename>
+ directory for details.
+ </para>
+ </example>
</section>
+
</chapter>
<!-- Keep this comment at the end of the file
Local variables: