<chapter id="fields-and-charsets">
- <!-- $Id: field-structure.xml,v 1.6 2006-11-23 09:03:50 marc Exp $ -->
+ <!-- $Id: field-structure.xml,v 1.12 2007-02-02 09:58:39 marc Exp $ -->
<title>Field Structure and Character Sets
</title>
<para>
In order to provide a flexible approach to national character set
- handling, Zebra allows the administrator to configure the set up the
+ handling, &zebra; allows the administrator to configure the set up the
system to handle any 8-bit character set — including sets that
require multi-octet diacritics or other multi-octet characters. The
definition of a character set includes a specification of the
<para>
This is the filename of the character
map to be used for this index for field type.
+ See <xref linkend="character-map-files"/> for details.
</para>
</listitem></varlistentry>
</variablelist>
</para>
+ <para>
+ Following are three excerpts of the standard
+ <filename>tab/default.idx</filename> configuration file. Notice
+ that the <literal>index</literal> and <literal>sort</literal>
+ are grouping directives, which bind all other following directives
+ to them:
+ <screen>
+ # Traditional word index
+ # Used if completenss is 'incomplete field' (@attr 6=1) and
+ # structure is word/phrase/word-list/free-form-text/document-text
+ index w
+ completeness 0
+ position 1
+ alwaysmatches 1
+ firstinfield 1
+ charmap string.chr
+
+ ...
+
+ # Null map index (no mapping at all)
+ # Used if structure=key (@attr 4=3)
+ index 0
+ completeness 0
+ position 1
+ charmap @
+
+ ...
+
+ # Sort register
+ sort s
+ completeness 1
+ charmap string.chr
+ </screen>
+ </para>
</section>
<section id="character-map-files">
<title>The character map file format</title>
<para>
- The contents of the character map files are structured as follows:
+ The character map files are used to define the word tokenization
+ and character normalization performed before inserting text into
+ the inverse indexes. &zebra; ships with the predefined character map
+ files <filename>tab/*.chr</filename>. Users are allowed to add
+ and/or modify maps according to their needs.
</para>
+ <table id="character-map-table" frame="top">
+ <title>Character maps predefined in &zebra;</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>File name</entry>
+ <entry>Intended type</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>numeric.chr</literal></entry>
+ <entry><literal>:n</literal></entry>
+ <entry>Numeric digit tokenization and normalization map. All
+ characters not in the set <literal>-{0-9}.,</literal> will be
+ suppressed. Note that floating point numbers are processed
+ fine, but scientific exponential numbers are trashed.</entry>
+ </row>
+ <row>
+ <entry><literal>scan.chr</literal></entry>
+ <entry><literal>:w or :p</literal></entry>
+ <entry>Word tokenization char map for Scandinavian
+ languages. This one resembles the generic word tokenization
+ character map <literal>tab/string.chr</literal>, the main
+ differences are sorting of the special characters
+ <literal>üzæäøöå</literal> and equivalence maps according to
+ Scandinavian language rules.</entry>
+ </row>
+ <row>
+ <entry><literal>string.chr</literal></entry>
+ <entry><literal>:w or :p</literal></entry>
+ <entry>General word tokenization and normalization character
+ map, mostly useful for English texts. Use this to derive your
+ own language tokenization and normalization derivatives.</entry>
+ </row>
+ <row>
+ <entry><literal>urx.chr</literal></entry>
+ <entry><literal>:u</literal></entry>
+ <entry>URL parsing and tokenization character map.</entry>
+ </row>
+ <row>
+ <entry><literal>@</literal></entry>
+ <entry><literal>:0</literal></entry>
+ <entry>Do-nothing character map used for literal binary
+ indexing. There is no existing file associated to it, and
+ there is no normalization or tokenization performed at all.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
<para>
+ The contents of the character map files are structured as follows:
<variablelist>
+ <varlistentry>
+ <term>encoding <replaceable>encoding-name</replaceable></term>
+ <listitem>
+ <para>
+ This directive must be at the very beginning of the file, and it
+ specifies the character encoding used in the entire file. If
+ omitted, the encoding <literal>ISO-8859-1</literal> is assumed.
+ </para>
+ <para>
+ For example, one of the test files found at
+ <literal>test/rusmarc/tab/string.chr</literal> contains the following
+ encoding directive:
+ <screen>
+ encoding koi8-r
+ </screen>
+ and the test file
+ <literal>test/charmap/string.utf8.chr</literal> is encoded
+ in UTF-8:
+ <screen>
+ encoding utf-8
+ </screen>
+ </para>
+ </listitem></varlistentry>
<varlistentry>
<term>lowercase <replaceable>value-set</replaceable></term>
</itemizedlist>
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ lowercase normalization and sorting order:
+ <screen>
+ lowercase {0-9}{a-y}üzæäøöå
+ </screen>
+ </para>
</listitem></varlistentry>
<varlistentry>
<term>uppercase <replaceable>value-set</replaceable></term>
<listitem>
<para>
This directive introduces the
- upper-case equivalencis to the value set (if any). The number and
+ upper-case equivalences to the value set (if any). The number and
order of the entries in the list should be the same as in the
<literal>lowercase</literal> directive.
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ uppercase equivalent:
+ <screen>
+ uppercase {0-9}{A-Y}ÜZÆÄØÖÅ
+ </screen>
+ </para>
</listitem></varlistentry>
<varlistentry>
<term>space <replaceable>value-set</replaceable></term>
<literal>uppercase</literal> and <literal>lowercase</literal>
directives.
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ space instruction:
+ <screen><![CDATA[
+ space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
+ ]]></screen>
+ </para>
</listitem></varlistentry>
<varlistentry>
<term>map <replaceable>value-set</replaceable>
members of the value-set on the left to the character on the
right. The character on the right must occur in the value
set (the <literal>lowercase</literal> directive) of the
- character set, but it may be a paranthesis-enclosed
+ character set, but it may be a parenthesis-enclosed
multi-octet character. This directive may be used to map
diacritics to their base characters, or to map HTML-style
character-representations to their natural form, etc. The
transformations. See section <xref
linkend="leading-articles"/>.
</para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ map instructions among others, to make sure that HTML entity
+ encoded Danish special characters are mapped to the
+ equivalent Latin-1 characters:
+ <screen><![CDATA[
+ map (æ) æ
+ map (ø) ø
+ map (å) å
+ ]]></screen>
+ </para>
+ </listitem></varlistentry>
+ <varlistentry>
+ <term>equivalent <replaceable>value-set</replaceable></term>
+ <listitem>
+ <para>
+ This directive introduces equivalence classes of characters
+ and/or strings for sorting purposes only. It resembles the map
+ directive, but does not affect search and retrieval indexing,
+ but only sorting order under present requests.
+ </para>
+ <para>
+ For example, <literal>scan.chr</literal> contains the following
+ equivalent sorting instructions, which can be uncommented:
+ <screen><![CDATA[
+ # equivalent æä(ae)
+ # equivalent øö(oe)
+ # equivalent å(aa)
+ # equivalent uü
+ ]]></screen>
+ </para>
</listitem></varlistentry>
</variablelist>
</para>
<para>
In addition to specifying sort orders, space (blank) handling,
and upper/lowercase folding, you can also use the character map
- files to make Zebra ignore leading articles in sorting records,
+ files to make &zebra; ignore leading articles in sorting records,
or when doing complete field searching.
</para>
<para>
</para>
</section>
- <section id="default-idx-zebra">
- <title>Accessing Zebra internal record data using
- the <literal>zebra::</literal> element sets</title>
- <para>
- Starting with <literal>Zebra</literal> version
- <literal>2.0.4-2</literal> or newer, one has the possibility to
- use the special
- <literal>zebra::data</literal>,
- <literal>zebra::meta</literal> and
- <literal>zebra::index</literal> element set names.
- </para>
- <note>
- <para>
- Usage of the <literal>zebra::</literal> element sets accesses
- record data directly from the internal storage, and will
- therefore work exactly the same way, irrespectively of indexing
- filter used.
- </para>
- <para>
- These element set names are optimized for retrieval speed, and
- will perform better than using for example
- <literal>alvis</literal> filter XSLT based extraction of small
- parts of the records.
- </para>
- </note>
- <para>
- For example, to fetch the raw binary record data stored in the
- zebra internal storage, or on the filesystem, the following
- commands can be issued:
- <screen>
- Z> f @attr 1=title my
- Z> format xml
- Z> elements zebra::data
- Z> s 1+1
- Z> format sutrs
- Z> s 1+1
- Z> format usmarc
- Z> s 1+1
- </screen>
- </para>
- <note>
- <para>
- The special
- <literal>zebra::data</literal> element set name is
- defined for any record syntax, but will always fetch
- the raw record data in exactly the original form. No record syntax
- specific transformations will be applied to the raw record data.
- </para>
- </note>
- <para>
- Also, Zebra internal metadata about the record can be accessed:
- <screen>
- Z> f @attr 1=title my
- Z> format xml
- Z> elements zebra::meta::sysno
- Z> s 1+1
- </screen>
- displays in <literal>XML</literal> record syntax only internal
- record system number, whereas
- <screen>
- Z> f @attr 1=title my
- Z> format xml
- Z> elements zebra::meta
- Z> s 1+1
- </screen>
- displays all available metadata on the record. These include sytem
- number, database name, indexed filename, filter used for indexing,
- score and static ranking information and finally bytesize of record.
- </para>
- <note>
- <para>
- The special
- <literal>zebra::meta</literal> element set names are only
- defined for
- <literal>SUTRS</literal> and <literal>XML</literal> record
- syntaxes.
- </para>
- </note>
- <para>
- Sometimes, it is very hard to figure out what exactly has been
- indexed how and in which indexes. Using the indexing stylesheet of
- the Alvis filter, one can at least see which portion of the record
- went into which index, but a similar aid does not exist for all
- other indexing filters.
- </para>
- <para>
- The special
- <literal>zebra::index</literal> element set names are provided to
- access information on per record indexed fields. For example, the
- queries
- <screen>
- Z> f @attr 1=title my
- Z> format sutrs
- Z> elements zebra::index
- Z> s 1+1
- </screen>
- will display all indexed tokens from all indexed fields of the
- first record, and it will display in <literal>SUTRS</literal>
- record syntax, whereas
- <screen>
- Z> f @attr 1=title my
- Z> format xml
- Z> elements zebra::index::title
- Z> s 1+1
- Z> elements zebra::index::title:p
- Z> s 1+1
- </screen>
- displays in <literal>XML</literal> record syntax only the content
- of the zebra string index <literal>title</literal>, or
- even only the type <literal>p</literal> phrase indexed part of it.
- </para>
- <note>
- <para>
- The special <literal>zebra::index</literal>
- element set names are only
- defined for
- <literal>SUTRS</literal> and <literal>XML</literal> record
- syntaxes.
- </para>
- <para> Trying to access numeric <literal>Bib-1</literal> use
- attributes or trying to access non-existent zebra intern string
- access points will result in a
- <literal>
- Diagnostic [25]: Specified element set name not valid for specified database
- </literal>
- </para>
- </note>
- </section>
</chapter>
<!-- Keep this comment at the end of the file
Local variables: