Field Structure and Character Sets In order to provide a flexible approach to national character set handling, &zebra; allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the permissible values, their sort order (this affects the display in the SCAN function), and relationships between upper- and lowercase characters. Finally, the definition includes the specification of space characters for the set. The operator can define different character sets for different fields, typical examples being standard text fields, numerical fields, and special-purpose fields such as WWW-style linkages (URx). Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field type is a single character, e.g. w (for word), and p for phrase. Zebra 2.0.20 and later allow field types to be any string. This allows for greater flexibility - in particular per-locale (language) fields can be defined. Version 2.0.20 of Zebra can also be configured - per field - to use the ICU library to perform tokenization and normalization of strings. This is an alternative to the "charmap" files which has been part of Zebra since its first release.
The default.idx file The field types, and hence character sets, are associated with data elements by the indexing rules (say title:w) in the various filters. Fields are defined in a field definition file which, by default, is called default.idx. This file provides the association between field type codes and the character map files (with the .chr suffix). The format of the .idx file is as follows index field type code This directive introduces a new search index code. The argument is a one-character code to be used in the .abs files to select this particular index type. An index, roughly, corresponds to a particular structure attribute during search. Refer to . sort field code type This directive introduces a sort index. The argument is a one-character code to be used in the .abs fie to select this particular index type. The corresponding use attribute must be used in the sort request to refer to this particular sort index. The corresponding character map (see below) is used in the sort process. completeness boolean This directive enables or disables complete field indexing. The value of the boolean should be 0 (disable) or 1. If completeness is enabled, the index entry will contain the complete contents of the field (up to a limit), with words (non-space characters) separated by single space characters (normalized to " " on display). When completeness is disabled, each word is indexed as a separate entry. Complete subfield indexing is most useful for fields which are typically browsed (e.g., titles, authors, or subjects), or instances where a match on a complete subfield is essential (e.g., exact title searching). For fields where completeness is disabled, the search engine will interpret a search containing space characters as a word proximity search. firstinfield boolean This directive enables or disables first-in-field indexing. The value of the boolean should be 0 (disable) or 1. alwaysmatches boolean This directive enables or disables alwaysmatches indexing. The value of the boolean should be 0 (disable) or 1. charmap filename This is the filename of the character map to be used for this index for field type. See for details. icuchain filename Specifies the filename with ICU tokenization and normalization rules. See for details. Using icuchain for a field type is an alternative to charmap. It does not make sense to define both icuchain and charmap for the same field type. Field types Following are three excerpts of the standard tab/default.idx configuration file. Notice that the index and sort are grouping directives, which bind all other following directives to them: # Traditional word index # Used if completeness is 'incomplete field' (@attr 6=1) and # structure is word/phrase/word-list/free-form-text/document-text index w completeness 0 position 1 alwaysmatches 1 firstinfield 1 charmap string.chr ... # Null map index (no mapping at all) # Used if structure=key (@attr 4=3) index 0 completeness 0 position 1 charmap @ ... # Sort register sort s completeness 1 charmap string.chr
Charmap Files The character map files are used to define the word tokenization and character normalization performed before inserting text into the inverse indexes. &zebra; ships with the predefined character map files tab/*.chr. Users are allowed to add and/or modify maps according to their needs. Character maps predefined in &zebra; File name Intended type Description numeric.chr :n Numeric digit tokenization and normalization map. All characters not in the set -{0-9}., will be suppressed. Note that floating point numbers are processed fine, but scientific exponential numbers are trashed. scan.chr :w or :p Word tokenization char map for Scandinavian languages. This one resembles the generic word tokenization character map tab/string.chr, the main differences are sorting of the special characters üzæäøöå and equivalence maps according to Scandinavian language rules. string.chr :w or :p General word tokenization and normalization character map, mostly useful for English texts. Use this to derive your own language tokenization and normalization derivatives. urx.chr :u URL parsing and tokenization character map. @ :0 Do-nothing character map used for literal binary indexing. There is no existing file associated to it, and there is no normalization or tokenization performed at all.
The contents of the character map files are structured as follows: encoding encoding-name This directive must be at the very beginning of the file, and it specifies the character encoding used in the entire file. If omitted, the encoding ISO-8859-1 is assumed. For example, one of the test files found at test/rusmarc/tab/string.chr contains the following encoding directive: encoding koi8-r and the test file test/charmap/string.utf8.chr is encoded in UTF-8: encoding utf-8 lowercase value-set This directive introduces the basic value set of the field type. The format is an ordered list (without spaces) of the characters which may occur in "words" of the given type. The order of the entries in the list determines the sort order of the index. In addition to single characters, the following combinations are legal: Backslashes may be used to introduce three-digit octal, or two-digit hex representations of single characters (preceded by x). In addition, the combinations \\, \\r, \\n, \\t, \\s (space — remember that real space-characters may not occur in the value definition), and \\ are recognized, with their usual interpretation. Curly braces {} may be used to enclose ranges of single characters (possibly using the escape convention described in the preceding point), e.g., {a-z} to introduce the standard range of ASCII characters. Note that the interpretation of such a range depends on the concrete representation in your local, physical character set. parentheses () may be used to enclose multi-byte characters - e.g., diacritics or special national combinations (e.g., Spanish "ll"). When found in the input stream (or a search term), these characters are viewed and sorted as a single character, with a sorting value depending on the position of the group in the value statement. For example, scan.chr contains the following lowercase normalization and sorting order: lowercase {0-9}{a-y}üzæäøöå uppercase value-set This directive introduces the upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive. For example, scan.chr contains the following uppercase equivalent: uppercase {0-9}{A-Y}ÜZÆÄØÖÅ space value-set This directive introduces the character which separate words in the input stream. Depending on the completeness mode of the field in question, these characters either terminate an index entry, or delimit individual "words" in the input stream. The order of the elements is not significant — otherwise the representation is the same as for the uppercase and lowercase directives. For example, scan.chr contains the following space instruction: ?@\[\\]^_`\{|}~ ]]> map value-set target This directive introduces a mapping between each of the members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The map directive can also be used to ignore leading articles in searching and/or sorting, and to perform other special transformations. For example, scan.chr contains the following map instructions among others, to make sure that HTML entity encoded Danish special characters are mapped to the equivalent Latin-1 characters: In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, you can also use the character map files to make &zebra; ignore leading articles in sorting records, or when doing complete field searching. This is done using the map directive in the character map file. In a nutshell, what you do is map certain sequences of characters, when they occur in the beginning of a field, to a space. Assuming that the character "@" is defined as a space character in your file, you can do: map (^The\s) @ map (^the\s) @ The effect of these directives is to map either 'the' or 'The', followed by a space character, to a space. The hat ^ character denotes beginning-of-field only when complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just as any other character. Because the default.idx file can be used to associate different character maps with different indexing types -- and you can create additional indexing types, should the need arise -- it is possible to specify that leading articles should be ignored either in sorting, in complete-field searching, or both. If you ignore certain prefixes in sorting, then these will be eliminated from the index, and sorting will take place as if they weren't there. However, if you set the system up to ignore certain prefixes in searching, then these are deleted both from the indexes and from query terms, when the client specifies complete-field searching. This has the effect that a search for 'the science journal' and 'science journal' would both produce the same results. equivalent value-set This directive introduces equivalence classes of characters and/or strings for sorting purposes only. It resembles the map directive, but does not affect search and retrieval indexing, but only sorting order under present requests. For example, scan.chr contains the following equivalent sorting instructions, which can be uncommented:
ICU Chain Files The ICU chain files defines a chain of rules which specify the conversion process to be carried out for each record string for indexing. Both searching and sorting is based on the sort normalization that ICU provides. This means that scan and sort will return terms in the sort order given by ICU. Zebra is using YAZ' ICU wrapper. Refer to the yaz-icu man page for documentation about the ICU chain rules. Use the yaz-icu program to test your icuchain rules. Indexing Greek text Consider a system where all "regular" text is to be indexed using as Greek (locale: EL). We would have to change our index type file - to read # Index greek words index w completeness 0 position 1 alwaysmatches 1 firstinfield 1 icuahain greek.xml .. The ICU chain file greek.xml could look as follows: ]]> Zebra is shipped with a field types file icu.idx which is an ICU chain version of default.idx. MARCXML indexing using ICU The directory examples/marcxml includes a complete sample with MARCXML records that are DOM XML indexed using ICU chain rules. Study the README in the marcxml directory for details.