X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;fp=doc%2Ffield-structure.xml;h=3a0a5f2535027830e1d86c71c7b7cfe00c834376;hp=bd46d2a5d71c222d4e0715148854099c1cb22044;hb=a92270aafb3ba7b336bc2334ed7c44c631c1cb29;hpb=9757a9ac857180889850aec2756595c04501aeb7 diff --git a/doc/field-structure.xml b/doc/field-structure.xml index bd46d2a..3a0a5f2 100644 --- a/doc/field-structure.xml +++ b/doc/field-structure.xml @@ -1,5 +1,5 @@ - + Field Structure and Character Sets @@ -103,6 +103,7 @@ This is the filename of the character map to be used for this index for field type. + See for details. @@ -112,10 +113,67 @@
The character map file format - The contents of the character map files are structured as follows: + The character map files are used to define the word tokenization + and character normalization performed before inserting text into + the inverse indexes. Zebra ships with the predefined character map + files tab/*.chr. Users are allowed to add + and/or modify maps according to their needs. + + Character maps predefined in Zebra + + + + File name + Intended type + Description + + + + + numeric.chr + :n + Numeric digit tokenization and normalization map. All + characters not in the set -{0-9}., will be + suppressed. Note that floating point numbers are processed + fine, but scientific exponential numbers are trashed. + + + scan.chr + :w or :p + Word tokenization char map for Scandinavian + languages. This one resembles the generic word tokenization + character map tab/string.chr, the main + differences are sorting of the special characters + üzæäøöå and equivalence maps according to + Scandinavian language rules. + + + string.chr + :w or :p + General word tokenization and normalization character + map, mostly useful for English texts. Use this to derive your + own language tokenization and normalization derivatives. + + + urx.chr + :u + URL parsing and tokenization character map. + + + @ + :0 + Do-nothing character map used for literal binary + indexing. There is no existing file associated to it, and + there is no normalization or tokenization performed at all. + + + +
+ + The contents of the character map files are structured as follows: @@ -170,16 +228,30 @@ + + For example, scan.chr contains the following + lowercase normalization and sorting order: + + lowercase {0-9}{a-y}üzæäøöå + + uppercase value-set This directive introduces the - upper-case equivalencis to the value set (if any). The number and + upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive. + + For example, scan.chr contains the following + uppercase equivalent: + + uppercase {0-9}{A-Y}ÜZÆÄØÖÅ + + space value-set @@ -194,6 +266,13 @@ uppercase and lowercase directives. + + For example, scan.chr contains the following + space instruction: + ?@\[\\]^_`\{|}~ + ]]> + map value-set @@ -204,7 +283,7 @@ members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the - character set, but it may be a paranthesis-enclosed + character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The @@ -213,6 +292,37 @@ transformations. See section . + + For example, scan.chr contains the following + map instructions among others, to make sure that HTML entity + encoded Danish special characters are mapped to the + equivalent Latin-1 characters: + + + + + equivalent value-set + + + This directive introduces equivalence classes of characters + and/or strings for sorting purposes only. It resembles the map + directive, but does not affect search and retrieval indexing, + but only sorting order under present requests. + + + For example, scan.chr contains the following + equivalent sorting instructions, which can be uncommented: + +