Field Structure and Character Sets
In order to provide a flexible approach to national character set
handling, &zebra; allows the administrator to configure the set up the
system to handle any 8-bit character set — including sets that
require multi-octet diacritics or other multi-octet characters. The
definition of a character set includes a specification of the
permissible values, their sort order (this affects the display in the
SCAN function), and relationships between upper- and lowercase
characters. Finally, the definition includes the specification of
space characters for the set.
The operator can define different character sets for different fields,
typical examples being standard text fields, numerical fields, and
special-purpose fields such as WWW-style linkages (URx).
Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field
type is a single character, e.g. w (for word), and
p for phrase. Zebra 2.0.20 and later allow field types
to be any string. This allows for greater flexibility - in particular
per-locale (language) fields can be defined.
Version 2.0.20 of Zebra can also be configured - per field - to use the
ICU library to perform tokenization and
normalization of strings. This is an alternative to the "charmap"
files which has been part of Zebra since its first release.
The default.idx file
The field types, and hence character sets, are associated with data
elements by the indexing rules (say title:w) in the
various filters. Fields are defined in a field definition file which,
by default, is called default.idx.
This file provides the association between field type codes
and the character map files (with the .chr suffix). The format
of the .idx file is as follows
index field type code
This directive introduces a new search index code.
The argument is a one-character code to be used in the
.abs files to select this particular index type. An index, roughly,
corresponds to a particular structure attribute during search. Refer
to .
sort field code type
This directive introduces a
sort index. The argument is a one-character code to be used in the
.abs fie to select this particular index type. The corresponding
use attribute must be used in the sort request to refer to this
particular sort index. The corresponding character map (see below)
is used in the sort process.
completeness boolean
This directive enables or disables complete field indexing.
The value of the boolean should be 0
(disable) or 1. If completeness is enabled, the index entry will
contain the complete contents of the field (up to a limit), with words
(non-space characters) separated by single space characters
(normalized to " " on display). When completeness is
disabled, each word is indexed as a separate entry. Complete subfield
indexing is most useful for fields which are typically browsed (e.g.,
titles, authors, or subjects), or instances where a match on a
complete subfield is essential (e.g., exact title searching). For fields
where completeness is disabled, the search engine will interpret a
search containing space characters as a word proximity search.
firstinfield boolean
This directive enables or disables first-in-field indexing.
The value of the boolean should be 0
(disable) or 1.
alwaysmatches boolean
This directive enables or disables alwaysmatches indexing.
The value of the boolean should be 0
(disable) or 1.
charmap filename
This is the filename of the character
map to be used for this index for field type.
See for details.
icuchain filename
Specifies the filename with ICU tokenization and
normalization rules.
See for details.
Using icuchain for a field type is an alternative to
charmap. It does not make sense to define both
icuchain and charmap for the same field type.
Field types
Following are three excerpts of the standard
tab/default.idx configuration file. Notice
that the index and sort
are grouping directives, which bind all other following directives
to them:
# Traditional word index
# Used if completeness is 'incomplete field' (@attr 6=1) and
# structure is word/phrase/word-list/free-form-text/document-text
index w
completeness 0
position 1
alwaysmatches 1
firstinfield 1
charmap string.chr
...
# Null map index (no mapping at all)
# Used if structure=key (@attr 4=3)
index 0
completeness 0
position 1
charmap @
...
# Sort register
sort s
completeness 1
charmap string.chr
Charmap Files
The character map files are used to define the word tokenization
and character normalization performed before inserting text into
the inverse indexes. &zebra; ships with the predefined character map
files tab/*.chr. Users are allowed to add
and/or modify maps according to their needs.
Character maps predefined in &zebra;
File name
Intended type
Description
numeric.chr
:n
Numeric digit tokenization and normalization map. All
characters not in the set -{0-9}., will be
suppressed. Note that floating point numbers are processed
fine, but scientific exponential numbers are trashed.
scan.chr
:w or :p
Word tokenization char map for Scandinavian
languages. This one resembles the generic word tokenization
character map tab/string.chr, the main
differences are sorting of the special characters
üzæäøöå and equivalence maps according to
Scandinavian language rules.
string.chr
:w or :p
General word tokenization and normalization character
map, mostly useful for English texts. Use this to derive your
own language tokenization and normalization derivatives.
urx.chr
:u
URL parsing and tokenization character map.
@
:0
Do-nothing character map used for literal binary
indexing. There is no existing file associated to it, and
there is no normalization or tokenization performed at all.
The contents of the character map files are structured as follows:
encoding encoding-name
This directive must be at the very beginning of the file, and it
specifies the character encoding used in the entire file. If
omitted, the encoding ISO-8859-1 is assumed.
For example, one of the test files found at
test/rusmarc/tab/string.chr contains the following
encoding directive:
encoding koi8-r
and the test file
test/charmap/string.utf8.chr is encoded
in UTF-8:
encoding utf-8
lowercase value-set
This directive introduces the basic value set of the field type.
The format is an ordered list (without spaces) of the
characters which may occur in "words" of the given type.
The order of the entries in the list determines the
sort order of the index. In addition to single characters, the
following combinations are legal:
Backslashes may be used to introduce three-digit octal, or
two-digit hex representations of single characters
(preceded by x).
In addition, the combinations
\\, \\r, \\n, \\t, \\s (space — remember that real
space-characters may not occur in the value definition), and
\\ are recognized, with their usual interpretation.
Curly braces {} may be used to enclose ranges of single
characters (possibly using the escape convention described in the
preceding point), e.g., {a-z} to introduce the
standard range of ASCII characters.
Note that the interpretation of such a range depends on
the concrete representation in your local, physical character set.
parentheses () may be used to enclose multi-byte characters -
e.g., diacritics or special national combinations (e.g., Spanish
"ll"). When found in the input stream (or a search term),
these characters are viewed and sorted as a single character, with a
sorting value depending on the position of the group in the value
statement.
For example, scan.chr contains the following
lowercase normalization and sorting order:
lowercase {0-9}{a-y}üzæäøöå
uppercase value-set
This directive introduces the
upper-case equivalences to the value set (if any). The number and
order of the entries in the list should be the same as in the
lowercase directive.
For example, scan.chr contains the following
uppercase equivalent:
uppercase {0-9}{A-Y}ÜZÆÄØÖÅ
space value-set
This directive introduces the character
which separate words in the input stream. Depending on the
completeness mode of the field in question, these characters either
terminate an index entry, or delimit individual "words" in
the input stream. The order of the elements is not significant —
otherwise the representation is the same as for the
uppercase and lowercase
directives.
For example, scan.chr contains the following
space instruction:
?@\[\\]^_`\{|}~
]]>
map value-set
target
This directive introduces a mapping between each of the
members of the value-set on the left to the character on the
right. The character on the right must occur in the value
set (the lowercase directive) of the
character set, but it may be a parenthesis-enclosed
multi-octet character. This directive may be used to map
diacritics to their base characters, or to map HTML-style
character-representations to their natural form, etc. The
map directive can also be used to ignore leading articles in
searching and/or sorting, and to perform other special
transformations.
For example, scan.chr contains the following
map instructions among others, to make sure that HTML entity
encoded Danish special characters are mapped to the
equivalent Latin-1 characters:
In addition to specifying sort orders, space (blank) handling,
and upper/lowercase folding, you can also use the character map
files to make &zebra; ignore leading articles in sorting records,
or when doing complete field searching.
This is done using the map directive in the
character map file. In a nutshell, what you do is map certain
sequences of characters, when they occur in the
beginning of a field, to a space. Assuming that the
character "@" is defined as a space character in your file, you
can do:
map (^The\s) @
map (^the\s) @
The effect of these directives is to map either 'the' or 'The',
followed by a space character, to a space. The hat ^ character
denotes beginning-of-field only when complete-subfield indexing
or sort indexing is taking place; otherwise, it is treated just
as any other character.
Because the default.idx file can be used to
associate different character maps with different indexing types
-- and you can create additional indexing types, should the need
arise -- it is possible to specify that leading articles should
be ignored either in sorting, in complete-field searching, or
both.
If you ignore certain prefixes in sorting, then these will be
eliminated from the index, and sorting will take place as if
they weren't there. However, if you set the system up to ignore
certain prefixes in searching, then these
are deleted both from the indexes and from query terms, when the
client specifies complete-field searching. This has the effect
that a search for 'the science journal' and 'science journal'
would both produce the same results.
equivalent value-set
This directive introduces equivalence classes of characters
and/or strings for sorting purposes only. It resembles the map
directive, but does not affect search and retrieval indexing,
but only sorting order under present requests.
For example, scan.chr contains the following
equivalent sorting instructions, which can be uncommented:
ICU Chain Files
The ICU chain files defines a
chain of rules
which specify the conversion process to be carried out for each
record string for indexing.
Both searching and sorting is based on the sort
normalization that ICU provides. This means that scan and sort will
return terms in the sort order given by ICU.
Zebra is using YAZ' ICU wrapper. Refer to the
yaz-icu man page for
documentation about the ICU chain rules.
Use the yaz-icu program to test your icuchain rules.
Indexing Greek text
Consider a system where all "regular" text is to be indexed
using as Greek (locale: EL).
We would have to change our index type file - to read
# Index greek words
index w
completeness 0
position 1
alwaysmatches 1
firstinfield 1
icuahain greek.xml
..
The ICU chain file greek.xml could look
as follows:
]]>
Zebra is shipped with a field types file icu.idx
which is an ICU chain version of default.idx.
MARCXML indexing using ICU
The directory examples/marcxml includes
a complete sample with MARCXML records that are DOM XML indexed
using ICU chain rules. Study the
README in the marcxml
directory for details.