X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=4079205a6308d9f2cfa401a5a1e2157819eb1f4b;hp=3a0a5f2535027830e1d86c71c7b7cfe00c834376;hb=1b8e1d7dfece31918056f76819c18675ed6e781e;hpb=a92270aafb3ba7b336bc2334ed7c44c631c1cb29 diff --git a/doc/field-structure.xml b/doc/field-structure.xml index 3a0a5f2..4079205 100644 --- a/doc/field-structure.xml +++ b/doc/field-structure.xml @@ -1,11 +1,11 @@ - + Field Structure and Character Sets In order to provide a flexible approach to national character set - handling, Zebra allows the administrator to configure the set up the + handling, &zebra; allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the @@ -108,6 +108,40 @@ + + Following are three excerpts of the standard + tab/default.idx configuration file. Notice + that the index and sort + are grouping directives, which bind all other following directives + to them: + + # Traditional word index + # Used if completenss is 'incomplete field' (@attr 6=1) and + # structure is word/phrase/word-list/free-form-text/document-text + index w + completeness 0 + position 1 + alwaysmatches 1 + firstinfield 1 + charmap string.chr + + ... + + # Null map index (no mapping at all) + # Used if structure=key (@attr 4=3) + index 0 + completeness 0 + position 1 + charmap @ + + ... + + # Sort register + sort s + completeness 1 + charmap string.chr + +
@@ -115,13 +149,13 @@ The character map files are used to define the word tokenization and character normalization performed before inserting text into - the inverse indexes. Zebra ships with the predefined character map + the inverse indexes. &zebra; ships with the predefined character map files tab/*.chr. Users are allowed to add and/or modify maps according to their needs. - - Character maps predefined in Zebra +
+ Character maps predefined in &zebra; @@ -175,6 +209,29 @@ The contents of the character map files are structured as follows: + + encoding encoding-name + + + This directive must be at the very beginning of the file, and it + specifies the character encoding used in the entire file. If + omitted, the encoding ISO-8859-1 is assumed. + + + For example, one of the test files found at + test/rusmarc/tab/string.chr contains the following + encoding directive: + + encoding koi8-r + + and the test file + test/charmap/string.utf8.chr is encoded + in UTF-8: + + encoding utf-8 + + + lowercase value-set @@ -332,7 +389,7 @@ In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, you can also use the character map - files to make Zebra ignore leading articles in sorting records, + files to make &zebra; ignore leading articles in sorting records, or when doing complete field searching.