X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel.xml;h=5846b4bce904a8290a81ac82c82ca3fe2922b405;hp=efd69da250ac5932a740e3f59e2759272422cf9c;hb=7c3a0352f0492609a3b6b26b63a72b0b2d207aab;hpb=f4bb896e485ca3ce8f3b14d5199f79ba90f6b2f0 diff --git a/doc/recordmodel.xml b/doc/recordmodel.xml index efd69da..5846b4b 100644 --- a/doc/recordmodel.xml +++ b/doc/recordmodel.xml @@ -1,5 +1,5 @@ - + The Record Model @@ -477,11 +477,12 @@ - begin type [parameter ... ] + begin type [parameter ... ] Begin a new - data element. The type is one of the following: + data element. The type is one of + the following: @@ -492,7 +493,7 @@ name of the schema that describes the structure of the record, eg. gils or wais (see below). The begin record call should precede - any other use of the begin statement. + any other use of the begin statement. @@ -512,7 +513,7 @@ Begin a new node in a variant tree. The parameters are - class type value. + class type value. @@ -521,7 +522,7 @@ - data + data parameter Create a data element. The concatenated arguments make @@ -530,28 +531,41 @@ the layout (whitespace) of the data should be retained for transmission. The option -element - tag wraps the data up in - the tag. + tag wraps the data up in + the tag. The use of the -element option is equivalent to - preceding the command with a begin - element command, and following - it with the end command. + preceding the command with a begin + element command, and following + it with the end command. - end [type] + end [type] Close a tagged element. If no parameter is given, the last element on the stack is terminated. The first parameter, if any, is a type name, similar - to the begin statement. - For the element type, a tag + to the begin statement. + For the element type, a tag name can be provided to terminate a specific tag. + + + unread no + + + Move the input pointer to the offset of first character that + match rule given by no. + The first rule from left-to-right is numbered zero, + the second rule is named 1 and so on. + + + + @@ -571,23 +585,21 @@ /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { - begin element bodyOfDisplay - begin variant body iana "text/plain" - data -text $1 - end record + begin element bodyOfDisplay + begin variant body iana "text/plain" + data -text $1 + end record } - If Zebra is compiled with support for Tcl (Tool Command Language) - enabled, the statements described above are supplemented with a complete + If Zebra is compiled with support for Tcl enabled, the statements + described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation - mechanisms for modifying the elements of a record. Tcl is a popular - scripting environment, with several tutorials available both online - and in hardcopy. + mechanisms for modifying the elements of a record. @@ -1183,23 +1195,23 @@ esetname G gils-g.est esetname F @ - elm (1,10) rank - - elm (1,12) url - - elm (1,14) localControlNumber Local-number - elm (1,16) dateOfLastModification Date/time-last-modified - elm (2,1) title w:!,p:! - elm (4,1) controlIdentifier Identifier-standard - elm (2,6) abstract Abstract - elm (4,51) purpose ! - elm (4,52) originator - - elm (4,53) accessConstraints ! - elm (4,54) useConstraints ! - elm (4,70) availability - - elm (4,70)/(4,90) distributor - - elm (4,70)/(4,90)/(2,7) distributorName ! - elm (4,70)/(4,90)/(2,10 distributorOrganization ! - elm (4,70)/(4,90)/(4,2) distributorStreetAddress ! - elm (4,70)/(4,90)/(4,3) distributorCity ! + elm (1,10) rank - + elm (1,12) url - + elm (1,14) localControlNumber Local-number + elm (1,16) dateOfLastModification Date/time-last-modified + elm (2,1) title w:!,p:! + elm (4,1) controlIdentifier Identifier-standard + elm (2,6) abstract Abstract + elm (4,51) purpose ! + elm (4,52) originator - + elm (4,53) accessConstraints ! + elm (4,54) useConstraints ! + elm (4,70) availability - + elm (4,70)/(4,90) distributor - + elm (4,70)/(4,90)/(2,7) distributorName ! + elm (4,70)/(4,90)/(2,10) distributorOrganization ! + elm (4,70)/(4,90)/(4,2) distributorStreetAddress ! + elm (4,70)/(4,90)/(4,3) distributorCity ! @@ -1774,174 +1786,216 @@ special-purpose fields such as WWW-style linkages (URx). - - The field types, and hence character sets, are associated with data - elements by the .abs files (see above). - The file default.idx - provides the association between field type codes (as used in the .abs - files) and the character map files (with the .chr suffix). The format - of the .idx file is as follows - - - - - - - index field type code - - - This directive introduces a new search index code. - The argument is a one-character code to be used in the - .abs files to select this particular index type. An index, roughly, - corresponds to a particular structure attribute during search. Refer - to . - - - - sort field code type - - - This directive introduces a - sort index. The argument is a one-character code to be used in the - .abs fie to select this particular index type. The corresponding - use attribute must be used in the sort request to refer to this - particular sort index. The corresponding character map (see below) - is used in the sort process. - - - - completeness boolean - - - This directive enables or disables complete field indexing. - The value of the boolean should be 0 - (disable) or 1. If completeness is enabled, the index entry will - contain the complete contents of the field (up to a limit), with words - (non-space characters) separated by single space characters - (normalized to " " on display). When completeness is - disabled, each word is indexed as a separate entry. Complete subfield - indexing is most useful for fields which are typically browsed (eg. - titles, authors, or subjects), or instances where a match on a - complete subfield is essential (eg. exact title searching). For fields - where completeness is disabled, the search engine will interpret a - search containing space characters as a word proximity search. - - - - charmap filename - - - This is the filename of the character - map to be used for this index for field type. - - - - - - - The contents of the character map files are structured as follows: - - - - - - - lowercase value-set - - - This directive introduces the basic value set of the field type. - The format is an ordered list (without spaces) of the - characters which may occur in "words" of the given type. - The order of the entries in the list determines the - sort order of the index. In addition to single characters, the - following combinations are legal: - - - - - - - - Backslashes may be used to introduce three-digit octal, or - two-digit hex representations of single characters - (preceded by x). - In addition, the combinations - \\, \\r, \\n, \\t, \\s (space — remember that real - space-characters may not occur in the value definition), and - \\ are recognized, with their usual interpretation. - - - - - - Curly braces {} may be used to enclose ranges of single - characters (possibly using the escape convention described in the - preceding point), eg. {a-z} to introduce the - standard range of ASCII characters. - Note that the interpretation of such a range depends on - the concrete representation in your local, physical character set. - - - - - - paranthesises () may be used to enclose multi-byte characters - - eg. diacritics or special national combinations (eg. Spanish - "ll"). When found in the input stream (or a search term), - these characters are viewed and sorted as a single character, with a - sorting value depending on the position of the group in the value - statement. - - + + The default.idx file + + The field types, and hence character sets, are associated with data + elements by the .abs files (see above). + The file default.idx + provides the association between field type codes (as used in the .abs + files) and the character map files (with the .chr suffix). The format + of the .idx file is as follows + - + + + + + index field type code + + + This directive introduces a new search index code. + The argument is a one-character code to be used in the + .abs files to select this particular index type. An index, roughly, + corresponds to a particular structure attribute during search. Refer + to . + + + + sort field code type + + + This directive introduces a + sort index. The argument is a one-character code to be used in the + .abs fie to select this particular index type. The corresponding + use attribute must be used in the sort request to refer to this + particular sort index. The corresponding character map (see below) + is used in the sort process. + + + + completeness boolean + + + This directive enables or disables complete field indexing. + The value of the boolean should be 0 + (disable) or 1. If completeness is enabled, the index entry will + contain the complete contents of the field (up to a limit), with words + (non-space characters) separated by single space characters + (normalized to " " on display). When completeness is + disabled, each word is indexed as a separate entry. Complete subfield + indexing is most useful for fields which are typically browsed (eg. + titles, authors, or subjects), or instances where a match on a + complete subfield is essential (eg. exact title searching). For fields + where completeness is disabled, the search engine will interpret a + search containing space characters as a word proximity search. + + + + charmap filename + + + This is the filename of the character + map to be used for this index for field type. + + + + + - - - - uppercase value-set - - - This directive introduces the - upper-case equivalencis to the value set (if any). The number and - order of the entries in the list should be the same as in the - lowercase directive. - - - - space value-set - - - This directive introduces the character - which separate words in the input stream. Depending on the - completeness mode of the field in question, these characters either - terminate an index entry, or delimit individual "words" in - the input stream. The order of the elements is not significant — - otherwise the representation is the same as for the - uppercase and lowercase - directives. - - - - map value-set - target - - - This directive introduces a - mapping between each of the members of the value-set on the left to - the character on the right. The character on the right must occur in - the value set (the lowercase directive) of - the character set, but - it may be a paranthesis-enclosed multi-octet character. This directive - may be used to map diacritics to their base characters, or to map - HTML-style character-representations to their natural form, etc. - - - - + + The character map file format + + The contents of the character map files are structured as follows: + + + + + + lowercase value-set + + + This directive introduces the basic value set of the field type. + The format is an ordered list (without spaces) of the + characters which may occur in "words" of the given type. + The order of the entries in the list determines the + sort order of the index. In addition to single characters, the + following combinations are legal: + + + + + + + + Backslashes may be used to introduce three-digit octal, or + two-digit hex representations of single characters + (preceded by x). + In addition, the combinations + \\, \\r, \\n, \\t, \\s (space — remember that real + space-characters may not occur in the value definition), and + \\ are recognized, with their usual interpretation. + + + + + + Curly braces {} may be used to enclose ranges of single + characters (possibly using the escape convention described in the + preceding point), eg. {a-z} to introduce the + standard range of ASCII characters. + Note that the interpretation of such a range depends on + the concrete representation in your local, physical character set. + + + + + + paranthesises () may be used to enclose multi-byte characters - + eg. diacritics or special national combinations (eg. Spanish + "ll"). When found in the input stream (or a search term), + these characters are viewed and sorted as a single character, with a + sorting value depending on the position of the group in the value + statement. + + + + + + + + + uppercase value-set + + + This directive introduces the + upper-case equivalencis to the value set (if any). The number and + order of the entries in the list should be the same as in the + lowercase directive. + + + + space value-set + + + This directive introduces the character + which separate words in the input stream. Depending on the + completeness mode of the field in question, these characters either + terminate an index entry, or delimit individual "words" in + the input stream. The order of the elements is not significant — + otherwise the representation is the same as for the + uppercase and lowercase + directives. + + + + map value-set + target + + + This directive introduces a + mapping between each of the members of the value-set on the left to + the character on the right. The character on the right must occur in + the value set (the lowercase directive) of + the character set, but + it may be a paranthesis-enclosed multi-octet character. This directive + may be used to map diacritics to their base characters, or to map + HTML-style character-representations to their natural form, etc. The map directive + can also be used to ignore leading articles in searching and/or sorting, and to perform + other special transformations. See section . + + + + + + + Ignoring leading articles + + In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, + you can also use the character map files to make Zebra ignore leading articles in sorting + records, or when doing complete field searching. + + + This is done using the map directive in the character map file. In a + nutshell, what you do is map certain sequences of characters, when they occur + in the beginning of a field, to a space. Assuming that the character "@" is + defined as a space character in your file, you can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', followed by a space + character, to a space. The hat ^ character denotes beginning-of-field only when + complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to associate different + character maps with different indexing types -- and you can create additional indexing + types, should the need arise -- it is possible to specify that leading articles should be + ignored either in sorting, in complete-field searching, or both. + + + If you ignore certain prefixes in sorting, then these will be eliminated from the index, + and sorting will take place as if they weren't there. However, if you set the system up + to ignore certain prefixes in searching, then these are deleted both + from the indexes and from query terms, when the client specifies complete-field + searching. This has the effect that a search for 'the science journal' and 'science + journal' would both produce the same results. + + -