X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;fp=doc%2Ffield-structure.xml;h=c354795e21e4afc1652f56a833860fcf2f04fd8b;hp=0000000000000000000000000000000000000000;hb=37dc985516f52f34fc8434cc8beb982bb0c8988f;hpb=819007639f67bdf6a147a8fc5e66c7fbad9ada6a diff --git a/doc/field-structure.xml b/doc/field-structure.xml new file mode 100644 index 0000000..c354795 --- /dev/null +++ b/doc/field-structure.xml @@ -0,0 +1,257 @@ + + + Field Structure and Character Sets + + + + In order to provide a flexible approach to national character set + handling, Zebra allows the administrator to configure the set up the + system to handle any 8-bit character set — including sets that + require multi-octet diacritics or other multi-octet characters. The + definition of a character set includes a specification of the + permissible values, their sort order (this affects the display in the + SCAN function), and relationships between upper- and lowercase + characters. Finally, the definition includes the specification of + space characters for the set. + + + + The operator can define different character sets for different fields, + typical examples being standard text fields, numerical fields, and + special-purpose fields such as WWW-style linkages (URx). + + +
+ The default.idx file + + The field types, and hence character sets, are associated with data + elements by the .abs files (see above). + The file default.idx + provides the association between field type codes (as used in the .abs + files) and the character map files (with the .chr suffix). The format + of the .idx file is as follows + + + + + + + index field type code + + + This directive introduces a new search index code. + The argument is a one-character code to be used in the + .abs files to select this particular index type. An index, roughly, + corresponds to a particular structure attribute during search. Refer + to . + + + + sort field code type + + + This directive introduces a + sort index. The argument is a one-character code to be used in the + .abs fie to select this particular index type. The corresponding + use attribute must be used in the sort request to refer to this + particular sort index. The corresponding character map (see below) + is used in the sort process. + + + + completeness boolean + + + This directive enables or disables complete field indexing. + The value of the boolean should be 0 + (disable) or 1. If completeness is enabled, the index entry will + contain the complete contents of the field (up to a limit), with words + (non-space characters) separated by single space characters + (normalized to " " on display). When completeness is + disabled, each word is indexed as a separate entry. Complete subfield + indexing is most useful for fields which are typically browsed (eg. + titles, authors, or subjects), or instances where a match on a + complete subfield is essential (eg. exact title searching). For fields + where completeness is disabled, the search engine will interpret a + search containing space characters as a word proximity search. + + + + charmap filename + + + This is the filename of the character + map to be used for this index for field type. + + + + +
+ +
+ The character map file format + + The contents of the character map files are structured as follows: + + + + + + + lowercase value-set + + + This directive introduces the basic value set of the field type. + The format is an ordered list (without spaces) of the + characters which may occur in "words" of the given type. + The order of the entries in the list determines the + sort order of the index. In addition to single characters, the + following combinations are legal: + + + + + + + + Backslashes may be used to introduce three-digit octal, or + two-digit hex representations of single characters + (preceded by x). + In addition, the combinations + \\, \\r, \\n, \\t, \\s (space — remember that real + space-characters may not occur in the value definition), and + \\ are recognized, with their usual interpretation. + + + + + + Curly braces {} may be used to enclose ranges of single + characters (possibly using the escape convention described in the + preceding point), eg. {a-z} to introduce the + standard range of ASCII characters. + Note that the interpretation of such a range depends on + the concrete representation in your local, physical character set. + + + + + + paranthesises () may be used to enclose multi-byte characters - + eg. diacritics or special national combinations (eg. Spanish + "ll"). When found in the input stream (or a search term), + these characters are viewed and sorted as a single character, with a + sorting value depending on the position of the group in the value + statement. + + + + + + + + + uppercase value-set + + + This directive introduces the + upper-case equivalencis to the value set (if any). The number and + order of the entries in the list should be the same as in the + lowercase directive. + + + + space value-set + + + This directive introduces the character + which separate words in the input stream. Depending on the + completeness mode of the field in question, these characters either + terminate an index entry, or delimit individual "words" in + the input stream. The order of the elements is not significant — + otherwise the representation is the same as for the + uppercase and lowercase + directives. + + + + map value-set + target + + + This directive introduces a mapping between each of the + members of the value-set on the left to the character on the + right. The character on the right must occur in the value + set (the lowercase directive) of the + character set, but it may be a paranthesis-enclosed + multi-octet character. This directive may be used to map + diacritics to their base characters, or to map HTML-style + character-representations to their natural form, etc. The + map directive can also be used to ignore leading articles in + searching and/or sorting, and to perform other special + transformations. See section . + + + + +
+
+ Ignoring leading articles + + In addition to specifying sort orders, space (blank) handling, + and upper/lowercase folding, you can also use the character map + files to make Zebra ignore leading articles in sorting records, + or when doing complete field searching. + + + This is done using the map directive in the + character map file. In a nutshell, what you do is map certain + sequences of characters, when they occur in the + beginning of a field, to a space. Assuming that the + character "@" is defined as a space character in your file, you + can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', + followed by a space character, to a space. The hat ^ character + denotes beginning-of-field only when complete-subfield indexing + or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to + associate different character maps with different indexing types + -- and you can create additional indexing types, should the need + arise -- it is possible to specify that leading articles should + be ignored either in sorting, in complete-field searching, or + both. + + + If you ignore certain prefixes in sorting, then these will be + eliminated from the index, and sorting will take place as if + they weren't there. However, if you set the system up to ignore + certain prefixes in searching, then these + are deleted both from the indexes and from query terms, when the + client specifies complete-field searching. This has the effect + that a search for 'the science journal' and 'science journal' + would both produce the same results. + +
+
+