X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel.xml;h=5846b4bce904a8290a81ac82c82ca3fe2922b405;hp=e052f12e5b3218284191075410fea08fcdd37506;hb=7c3a0352f0492609a3b6b26b63a72b0b2d207aab;hpb=710144c9d5a2577034d300ef6beb334d5145d657;ds=sidebyside diff --git a/doc/recordmodel.xml b/doc/recordmodel.xml index e052f12..5846b4b 100644 --- a/doc/recordmodel.xml +++ b/doc/recordmodel.xml @@ -1,5 +1,5 @@ - + The Record Model @@ -1786,174 +1786,216 @@ special-purpose fields such as WWW-style linkages (URx). - - The field types, and hence character sets, are associated with data - elements by the .abs files (see above). - The file default.idx - provides the association between field type codes (as used in the .abs - files) and the character map files (with the .chr suffix). The format - of the .idx file is as follows - - - - - - - index field type code - - - This directive introduces a new search index code. - The argument is a one-character code to be used in the - .abs files to select this particular index type. An index, roughly, - corresponds to a particular structure attribute during search. Refer - to . - - - - sort field code type - - - This directive introduces a - sort index. The argument is a one-character code to be used in the - .abs fie to select this particular index type. The corresponding - use attribute must be used in the sort request to refer to this - particular sort index. The corresponding character map (see below) - is used in the sort process. - - - - completeness boolean - - - This directive enables or disables complete field indexing. - The value of the boolean should be 0 - (disable) or 1. If completeness is enabled, the index entry will - contain the complete contents of the field (up to a limit), with words - (non-space characters) separated by single space characters - (normalized to " " on display). When completeness is - disabled, each word is indexed as a separate entry. Complete subfield - indexing is most useful for fields which are typically browsed (eg. - titles, authors, or subjects), or instances where a match on a - complete subfield is essential (eg. exact title searching). For fields - where completeness is disabled, the search engine will interpret a - search containing space characters as a word proximity search. - - - - charmap filename - - - This is the filename of the character - map to be used for this index for field type. - - - - - - - The contents of the character map files are structured as follows: - - - - - - - lowercase value-set - - - This directive introduces the basic value set of the field type. - The format is an ordered list (without spaces) of the - characters which may occur in "words" of the given type. - The order of the entries in the list determines the - sort order of the index. In addition to single characters, the - following combinations are legal: - - - - - - - - Backslashes may be used to introduce three-digit octal, or - two-digit hex representations of single characters - (preceded by x). - In addition, the combinations - \\, \\r, \\n, \\t, \\s (space — remember that real - space-characters may not occur in the value definition), and - \\ are recognized, with their usual interpretation. - - - - - - Curly braces {} may be used to enclose ranges of single - characters (possibly using the escape convention described in the - preceding point), eg. {a-z} to introduce the - standard range of ASCII characters. - Note that the interpretation of such a range depends on - the concrete representation in your local, physical character set. - - - - - - paranthesises () may be used to enclose multi-byte characters - - eg. diacritics or special national combinations (eg. Spanish - "ll"). When found in the input stream (or a search term), - these characters are viewed and sorted as a single character, with a - sorting value depending on the position of the group in the value - statement. - - + + The default.idx file + + The field types, and hence character sets, are associated with data + elements by the .abs files (see above). + The file default.idx + provides the association between field type codes (as used in the .abs + files) and the character map files (with the .chr suffix). The format + of the .idx file is as follows + - + + + + + index field type code + + + This directive introduces a new search index code. + The argument is a one-character code to be used in the + .abs files to select this particular index type. An index, roughly, + corresponds to a particular structure attribute during search. Refer + to . + + + + sort field code type + + + This directive introduces a + sort index. The argument is a one-character code to be used in the + .abs fie to select this particular index type. The corresponding + use attribute must be used in the sort request to refer to this + particular sort index. The corresponding character map (see below) + is used in the sort process. + + + + completeness boolean + + + This directive enables or disables complete field indexing. + The value of the boolean should be 0 + (disable) or 1. If completeness is enabled, the index entry will + contain the complete contents of the field (up to a limit), with words + (non-space characters) separated by single space characters + (normalized to " " on display). When completeness is + disabled, each word is indexed as a separate entry. Complete subfield + indexing is most useful for fields which are typically browsed (eg. + titles, authors, or subjects), or instances where a match on a + complete subfield is essential (eg. exact title searching). For fields + where completeness is disabled, the search engine will interpret a + search containing space characters as a word proximity search. + + + + charmap filename + + + This is the filename of the character + map to be used for this index for field type. + + + + + - - - - uppercase value-set - - - This directive introduces the - upper-case equivalencis to the value set (if any). The number and - order of the entries in the list should be the same as in the - lowercase directive. - - - - space value-set - - - This directive introduces the character - which separate words in the input stream. Depending on the - completeness mode of the field in question, these characters either - terminate an index entry, or delimit individual "words" in - the input stream. The order of the elements is not significant — - otherwise the representation is the same as for the - uppercase and lowercase - directives. - - - - map value-set - target - - - This directive introduces a - mapping between each of the members of the value-set on the left to - the character on the right. The character on the right must occur in - the value set (the lowercase directive) of - the character set, but - it may be a paranthesis-enclosed multi-octet character. This directive - may be used to map diacritics to their base characters, or to map - HTML-style character-representations to their natural form, etc. - - - - + + The character map file format + + The contents of the character map files are structured as follows: + + + + + + lowercase value-set + + + This directive introduces the basic value set of the field type. + The format is an ordered list (without spaces) of the + characters which may occur in "words" of the given type. + The order of the entries in the list determines the + sort order of the index. In addition to single characters, the + following combinations are legal: + + + + + + + + Backslashes may be used to introduce three-digit octal, or + two-digit hex representations of single characters + (preceded by x). + In addition, the combinations + \\, \\r, \\n, \\t, \\s (space — remember that real + space-characters may not occur in the value definition), and + \\ are recognized, with their usual interpretation. + + + + + + Curly braces {} may be used to enclose ranges of single + characters (possibly using the escape convention described in the + preceding point), eg. {a-z} to introduce the + standard range of ASCII characters. + Note that the interpretation of such a range depends on + the concrete representation in your local, physical character set. + + + + + + paranthesises () may be used to enclose multi-byte characters - + eg. diacritics or special national combinations (eg. Spanish + "ll"). When found in the input stream (or a search term), + these characters are viewed and sorted as a single character, with a + sorting value depending on the position of the group in the value + statement. + + + + + + + + + uppercase value-set + + + This directive introduces the + upper-case equivalencis to the value set (if any). The number and + order of the entries in the list should be the same as in the + lowercase directive. + + + + space value-set + + + This directive introduces the character + which separate words in the input stream. Depending on the + completeness mode of the field in question, these characters either + terminate an index entry, or delimit individual "words" in + the input stream. The order of the elements is not significant — + otherwise the representation is the same as for the + uppercase and lowercase + directives. + + + + map value-set + target + + + This directive introduces a + mapping between each of the members of the value-set on the left to + the character on the right. The character on the right must occur in + the value set (the lowercase directive) of + the character set, but + it may be a paranthesis-enclosed multi-octet character. This directive + may be used to map diacritics to their base characters, or to map + HTML-style character-representations to their natural form, etc. The map directive + can also be used to ignore leading articles in searching and/or sorting, and to perform + other special transformations. See section . + + + + + + + Ignoring leading articles + + In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, + you can also use the character map files to make Zebra ignore leading articles in sorting + records, or when doing complete field searching. + + + This is done using the map directive in the character map file. In a + nutshell, what you do is map certain sequences of characters, when they occur + in the beginning of a field, to a space. Assuming that the character "@" is + defined as a space character in your file, you can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', followed by a space + character, to a space. The hat ^ character denotes beginning-of-field only when + complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to associate different + character maps with different indexing types -- and you can create additional indexing + types, should the need arise -- it is possible to specify that leading articles should be + ignored either in sorting, in complete-field searching, or both. + + + If you ignore certain prefixes in sorting, then these will be eliminated from the index, + and sorting will take place as if they weren't there. However, if you set the system up + to ignore certain prefixes in searching, then these are deleted both + from the indexes and from query terms, when the client specifies complete-field + searching. This has the effect that a search for 'the science journal' and 'science + journal' would both produce the same results. + + -