X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=a1de6dd2ef95db362b83fd705fb5a42ae99847d9;hb=2ce46f160259c9452405b68489c16654919cd16c;hp=bd46d2a5d71c222d4e0715148854099c1cb22044;hpb=e70a548d193a5187b8074c439f2d7fa687a8e8c4;p=idzebra-moved-to-github.git diff --git a/doc/field-structure.xml b/doc/field-structure.xml index bd46d2a..a1de6dd 100644 --- a/doc/field-structure.xml +++ b/doc/field-structure.xml @@ -1,5 +1,5 @@ - + Field Structure and Character Sets @@ -103,6 +103,7 @@ This is the filename of the character map to be used for this index for field type. + See for details. @@ -112,11 +113,91 @@
The character map file format - The contents of the character map files are structured as follows: + The character map files are used to define the word tokenization + and character normalization performed before inserting text into + the inverse indexes. Zebra ships with the predefined character map + files tab/*.chr. Users are allowed to add + and/or modify maps according to their needs. + + Character maps predefined in Zebra + + + + File name + Intended type + Description + + + + + numeric.chr + :n + Numeric digit tokenization and normalization map. All + characters not in the set -{0-9}., will be + suppressed. Note that floating point numbers are processed + fine, but scientific exponential numbers are trashed. + + + scan.chr + :w or :p + Word tokenization char map for Scandinavian + languages. This one resembles the generic word tokenization + character map tab/string.chr, the main + differences are sorting of the special characters + üzæäøöå and equivalence maps according to + Scandinavian language rules. + + + string.chr + :w or :p + General word tokenization and normalization character + map, mostly useful for English texts. Use this to derive your + own language tokenization and normalization derivatives. + + + urx.chr + :u + URL parsing and tokenization character map. + + + @ + :0 + Do-nothing character map used for literal binary + indexing. There is no existing file associated to it, and + there is no normalization or tokenization performed at all. + + + +
+ + The contents of the character map files are structured as follows: + + encoding encoding-name + + + This directive must be at the very beginning of the file, and it + specifies the character encoding used in the entire file. If + omitted, the encoding ISO-8859-1 is assumed. + + + For example, one of the test files found at + test/rusmarc/tab/string.chr contains the following + encoding directive: + + encoding koi8-r + + and the test file + test/charmap/string.utf8.chr is encoded + in UTF-8: + + encoding utf-8 + + + lowercase value-set @@ -170,16 +251,30 @@ + + For example, scan.chr contains the following + lowercase normalization and sorting order: + + lowercase {0-9}{a-y}üzæäøöå + + uppercase value-set This directive introduces the - upper-case equivalencis to the value set (if any). The number and + upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive. + + For example, scan.chr contains the following + uppercase equivalent: + + uppercase {0-9}{A-Y}ÜZÆÄØÖÅ + + space value-set @@ -194,6 +289,13 @@ uppercase and lowercase directives. + + For example, scan.chr contains the following + space instruction: + ?@\[\\]^_`\{|}~ + ]]> + map value-set @@ -204,7 +306,7 @@ members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the - character set, but it may be a paranthesis-enclosed + character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The @@ -213,6 +315,37 @@ transformations. See section . + + For example, scan.chr contains the following + map instructions among others, to make sure that HTML entity + encoded Danish special characters are mapped to the + equivalent Latin-1 characters: + + + + + equivalent value-set + + + This directive introduces equivalence classes of characters + and/or strings for sorting purposes only. It resembles the map + directive, but does not affect search and retrieval indexing, + but only sorting order under present requests. + + + For example, scan.chr contains the following + equivalent sorting instructions, which can be uncommented: + +