X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=4079205a6308d9f2cfa401a5a1e2157819eb1f4b;hp=758542b3ed84da1979baa1dcd1e15f414324f872;hb=1b8e1d7dfece31918056f76819c18675ed6e781e;hpb=7b25277add2aae5caabee02213911aeeb65030c8 diff --git a/doc/field-structure.xml b/doc/field-structure.xml index 758542b..4079205 100644 --- a/doc/field-structure.xml +++ b/doc/field-structure.xml @@ -1,11 +1,11 @@ - + Field Structure and Character Sets In order to provide a flexible approach to national character set - handling, Zebra allows the administrator to configure the set up the + handling, &zebra; allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the @@ -76,26 +76,162 @@ search containing space characters as a word proximity search. + + + firstinfield boolean + + + This directive enables or disables first-in-field indexing. + The value of the boolean should be 0 + (disable) or 1. + + + + + alwaysmatches boolean + + + This directive enables or disables alwaysmatches indexing. + The value of the boolean should be 0 + (disable) or 1. + + + charmap filename This is the filename of the character map to be used for this index for field type. + See for details. + + Following are three excerpts of the standard + tab/default.idx configuration file. Notice + that the index and sort + are grouping directives, which bind all other following directives + to them: + + # Traditional word index + # Used if completenss is 'incomplete field' (@attr 6=1) and + # structure is word/phrase/word-list/free-form-text/document-text + index w + completeness 0 + position 1 + alwaysmatches 1 + firstinfield 1 + charmap string.chr + + ... + + # Null map index (no mapping at all) + # Used if structure=key (@attr 4=3) + index 0 + completeness 0 + position 1 + charmap @ + + ... + + # Sort register + sort s + completeness 1 + charmap string.chr + +
The character map file format - The contents of the character map files are structured as follows: + The character map files are used to define the word tokenization + and character normalization performed before inserting text into + the inverse indexes. &zebra; ships with the predefined character map + files tab/*.chr. Users are allowed to add + and/or modify maps according to their needs. + + Character maps predefined in &zebra; + + + + File name + Intended type + Description + + + + + numeric.chr + :n + Numeric digit tokenization and normalization map. All + characters not in the set -{0-9}., will be + suppressed. Note that floating point numbers are processed + fine, but scientific exponential numbers are trashed. + + + scan.chr + :w or :p + Word tokenization char map for Scandinavian + languages. This one resembles the generic word tokenization + character map tab/string.chr, the main + differences are sorting of the special characters + üzæäøöå and equivalence maps according to + Scandinavian language rules. + + + string.chr + :w or :p + General word tokenization and normalization character + map, mostly useful for English texts. Use this to derive your + own language tokenization and normalization derivatives. + + + urx.chr + :u + URL parsing and tokenization character map. + + + @ + :0 + Do-nothing character map used for literal binary + indexing. There is no existing file associated to it, and + there is no normalization or tokenization performed at all. + + + +
+ + The contents of the character map files are structured as follows: + + encoding encoding-name + + + This directive must be at the very beginning of the file, and it + specifies the character encoding used in the entire file. If + omitted, the encoding ISO-8859-1 is assumed. + + + For example, one of the test files found at + test/rusmarc/tab/string.chr contains the following + encoding directive: + + encoding koi8-r + + and the test file + test/charmap/string.utf8.chr is encoded + in UTF-8: + + encoding utf-8 + + + lowercase value-set @@ -149,16 +285,30 @@ + + For example, scan.chr contains the following + lowercase normalization and sorting order: + + lowercase {0-9}{a-y}üzæäøöå + + uppercase value-set This directive introduces the - upper-case equivalencis to the value set (if any). The number and + upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive. + + For example, scan.chr contains the following + uppercase equivalent: + + uppercase {0-9}{A-Y}ÜZÆÄØÖÅ + + space value-set @@ -173,6 +323,13 @@ uppercase and lowercase directives. + + For example, scan.chr contains the following + space instruction: + ?@\[\\]^_`\{|}~ + ]]> + map value-set @@ -183,7 +340,7 @@ members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the - character set, but it may be a paranthesis-enclosed + character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The @@ -192,6 +349,37 @@ transformations. See section . + + For example, scan.chr contains the following + map instructions among others, to make sure that HTML entity + encoded Danish special characters are mapped to the + equivalent Latin-1 characters: + + + + + equivalent value-set + + + This directive introduces equivalence classes of characters + and/or strings for sorting purposes only. It resembles the map + directive, but does not affect search and retrieval indexing, + but only sorting order under present requests. + + + For example, scan.chr contains the following + equivalent sorting instructions, which can be uncommented: + + @@ -201,7 +389,7 @@ In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, you can also use the character map - files to make Zebra ignore leading articles in sorting records, + files to make &zebra; ignore leading articles in sorting records, or when doing complete field searching. @@ -240,6 +428,7 @@ would both produce the same results.
+