X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel.xml;h=5846b4bce904a8290a81ac82c82ca3fe2922b405;hp=8ad29e8c01e62584efbabadceecbdfd62d19b06f;hb=7c3a0352f0492609a3b6b26b63a72b0b2d207aab;hpb=d2e692248eac6469ef7a3a3f8044010cb5cc1da7 diff --git a/doc/recordmodel.xml b/doc/recordmodel.xml index 8ad29e8..5846b4b 100644 --- a/doc/recordmodel.xml +++ b/doc/recordmodel.xml @@ -1,5 +1,5 @@ - + The Record Model @@ -33,7 +33,7 @@ - + When records are accessed by the system, they are represented in their local, or native format. This might be SGML or HTML files, @@ -477,13 +477,14 @@ - begin type [parameter ... ] + begin type [parameter ... ] Begin a new - data element. The type is one of the following: + data element. The type is one of + the following: - + record @@ -492,7 +493,7 @@ name of the schema that describes the structure of the record, eg. gils or wais (see below). The begin record call should precede - any other use of the begin statement. + any other use of the begin statement. @@ -512,7 +513,7 @@ Begin a new node in a variant tree. The parameters are - class type value. + class type value. @@ -521,7 +522,7 @@ - data + data parameter Create a data element. The concatenated arguments make @@ -530,28 +531,41 @@ the layout (whitespace) of the data should be retained for transmission. The option -element - tag wraps the data up in - the tag. + tag wraps the data up in + the tag. The use of the -element option is equivalent to - preceding the command with a begin - element command, and following - it with the end command. + preceding the command with a begin + element command, and following + it with the end command. - end [type] + end [type] Close a tagged element. If no parameter is given, the last element on the stack is terminated. The first parameter, if any, is a type name, similar - to the begin statement. - For the element type, a tag + to the begin statement. + For the element type, a tag name can be provided to terminate a specific tag. + + + unread no + + + Move the input pointer to the offset of first character that + match rule given by no. + The first rule from left-to-right is numbered zero, + the second rule is named 1 and so on. + + + + @@ -571,23 +585,21 @@ /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { - begin element bodyOfDisplay - begin variant body iana "text/plain" - data -text $1 - end record + begin element bodyOfDisplay + begin variant body iana "text/plain" + data -text $1 + end record } - If Zebra is compiled with support for Tcl (Tool Command Language) - enabled, the statements described above are supplemented with a complete + If Zebra is compiled with support for Tcl enabled, the statements + described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation - mechanisms for modifying the elements of a record. Tcl is a popular - scripting environment, with several tutorials available both online - and in hardcopy. + mechanisms for modifying the elements of a record. @@ -692,35 +704,35 @@ Which of the two elements are transmitted to the client by the server depends on the specifications provided by the client, if any. - + In practice, each variant node is associated with a triple of class, type, value, corresponding to the variant mechanism of Z39.50. - + - + Data Elements - + Data nodes have no children (they are always leaf nodes in the record tree). - + - + - + Configuring Your Data Model - + The following sections describe the configuration files that govern the internal management of data records. The system searches for the files @@ -770,7 +782,7 @@ known. - + The variant set which is used in the profile. This provides a @@ -884,12 +896,12 @@ The file may contain the following directives: - + - + - name symbolic-name + name symbolic-name (m) This provides a shorthand name or @@ -898,17 +910,17 @@ - reference OID-name + reference OID-name (m) The reference name of the OID for the profile. The reference names can be found in the util - module of YAZ. + module of YAZ. - attset filename + attset filename (m) The attribute set that is used for @@ -917,7 +929,7 @@ - tagset filename + tagset filename (o) The tag set (if any) that describe @@ -926,7 +938,7 @@ - varset filename + varset filename (o) The variant set used in the profile. @@ -934,25 +946,27 @@ - maptab filename + maptab filename (o,r) This points to a conversion table that might be used if the client asks for the record in a different schema from the native one. - + + - marc filename + marc filename (o) Points to a file containing parameters - for representing the record contents in the ISO2709 syntax. Read the - description of the MARC representation facility below. + for representing the record contents in the ISO2709 syntax. + Read the description of the MARC representation facility below. - + + - esetname name filename + esetname name filename (o,r) Associates the @@ -960,9 +974,10 @@ given in place of the filename, this corresponds to a null mapping for the given element set name. - + + - any tags + any tags (o) This directive specifies a list of attributes @@ -972,49 +987,74 @@ provides an efficient way of supporting free-text searching across all elements. However, it does increase the size of the index significantly. The attributes can be qualified with a structure, as in - the elm directive below. + the elm directive below. - + + - elm path name attributes + elm path name attributes (o,r) Adds an element to the abstract record syntax of the schema. - The path follows the + The path follows the syntax which is suggested by the Z39.50 document - that is, a sequence - of tags separated by slashes (/). Each tag is given as a + of tags separated by slashes (/). Each tag is given as a comma-separated pair of tag type and -value surrounded by parenthesis. - The name is the name of the element, and - the attributes + The name is the name of the element, and + the attributes specifies which attributes to use when indexing the element in a comma-separated list. A ! in place of the attribute name is equivalent to specifying an attribute name identical to the element name. A - in place of the attribute name specifies that no indexing is to take place for the given element. - The attributes can be qualified with field - types to specify which + The attributes can be qualified with field + types to specify which character set should govern the indexing procedure for that field. The same data element may be indexed into several different fields, using different character set definitions. See the . - The default field type is "w" for word. + The default field type is w for + word. - + + + + + xelm xpath attributes + + + Specifies indexing for record nodes given by + xpath. Unlike directive + elm, this directive allows you to index attribute + contents. The xpath uses + a syntax similar to XPath. The attributes + have same syntax and meaning as directive elm, except that operator + ! refers to the nodes selected by xpath. + + + + + - encoding encodingname + encoding encodingname This directive specifies character encoding for external records. For records such as XML that specifies encoding within the file via a header this directive is ignored. If neither this directive is given, nor an encoding is set - within external records, ISO-8859-1 encoding is assmed. + within external records, ISO-8859-1 encoding is assumed. - xpath enable/disable + xpath enable/disable If this directive is followed by enable, @@ -1024,6 +1064,103 @@ + + + + + + + systag + systemTag + actualTag + + + + Specifies what information, if any, Zebra should + automatically include in retrieval records for the + ``system fields'' that it supports. + systemTag may + be any of the following: + + + rank + + An integer indicating the relevance-ranking score + assigned to the record. + + + + sysno + + An automatically generated identifier for the record, + unique within this database. It is represented by the + <localControlNumber> element in + XML and the (1,14) tag in GRS-1. + + + + size + + The size, in bytes, of the retrieved record. + + + + + + The actualTag parameter may be + none to indicate that the named element + should be omitted from retrieval records. + + + @@ -1058,23 +1195,23 @@ esetname G gils-g.est esetname F @ - elm (1,10) rank - - elm (1,12) url - - elm (1,14) localControlNumber Local-number - elm (1,16) dateOfLastModification Date/time-last-modified - elm (2,1) title w:!,p:! - elm (4,1) controlIdentifier Identifier-standard - elm (2,6) abstract Abstract - elm (4,51) purpose ! - elm (4,52) originator - - elm (4,53) accessConstraints ! - elm (4,54) useConstraints ! - elm (4,70) availability - - elm (4,70)/(4,90) distributor - - elm (4,70)/(4,90)/(2,7) distributorName ! - elm (4,70)/(4,90)/(2,10 distributorOrganization ! - elm (4,70)/(4,90)/(4,2) distributorStreetAddress ! - elm (4,70)/(4,90)/(4,3) distributorCity ! + elm (1,10) rank - + elm (1,12) url - + elm (1,14) localControlNumber Local-number + elm (1,16) dateOfLastModification Date/time-last-modified + elm (2,1) title w:!,p:! + elm (4,1) controlIdentifier Identifier-standard + elm (2,6) abstract Abstract + elm (4,51) purpose ! + elm (4,52) originator - + elm (4,53) accessConstraints ! + elm (4,54) useConstraints ! + elm (4,70) availability - + elm (4,70)/(4,90) distributor - + elm (4,70)/(4,90)/(2,7) distributorName ! + elm (4,70)/(4,90)/(2,10) distributorOrganization ! + elm (4,70)/(4,90)/(4,2) distributorStreetAddress ! + elm (4,70)/(4,90)/(4,3) distributorCity ! @@ -1085,7 +1222,7 @@ The Attribute Set (.att) Files - This file type describes the Use elements of + This file type describes the Use elements of an attribute set. It contains the following directives. @@ -1093,7 +1230,7 @@ - name symbolic-name + name symbolic-name (m) This provides a shorthand name or @@ -1102,24 +1239,24 @@ - reference OID-name + reference OID-name (m) The reference name of the OID for the attribute set. - The reference names can be found in the util - module of YAZ. + The reference names can be found in the util + module of YAZ. - include filename + include filename (o,r) This directive is used to include another attribute set as a part of the current one. This is used when a new attribute set is defined as an extension to another set. For instance, many new attribute sets are defined as extensions - to the bib-1 set. + to the bib-1 set. This is an important feature of the retrieval system of Z39.50, as it ensures the highest possible level of interoperability, as those access points of your database which are @@ -1129,15 +1266,15 @@ att - att-value att-name [local-value] + att-value att-name [local-value] (o,r) This repeatable directive introduces a new attribute to the set. The attribute value is stored in the index (unless a - local-value is + local-value is given, in which case this is stored). The name is used to refer to the - attribute from the abstract syntax. + attribute from the abstract syntax. @@ -1649,174 +1786,216 @@ special-purpose fields such as WWW-style linkages (URx). - - The field types, and hence character sets, are associated with data - elements by the .abs files (see above). - The file default.idx - provides the association between field type codes (as used in the .abs - files) and the character map files (with the .chr suffix). The format - of the .idx file is as follows - - - - - - - index field type code - - - This directive introduces a new search index code. - The argument is a one-character code to be used in the - .abs files to select this particular index type. An index, roughly, - corresponds to a particular structure attribute during search. Refer - to . - - - - sort field code type - - - This directive introduces a - sort index. The argument is a one-character code to be used in the - .abs fie to select this particular index type. The corresponding - use attribute must be used in the sort request to refer to this - particular sort index. The corresponding character map (see below) - is used in the sort process. - - - - completeness boolean - - - This directive enables or disables complete field indexing. - The value of the boolean should be 0 - (disable) or 1. If completeness is enabled, the index entry will - contain the complete contents of the field (up to a limit), with words - (non-space characters) separated by single space characters - (normalized to " " on display). When completeness is - disabled, each word is indexed as a separate entry. Complete subfield - indexing is most useful for fields which are typically browsed (eg. - titles, authors, or subjects), or instances where a match on a - complete subfield is essential (eg. exact title searching). For fields - where completeness is disabled, the search engine will interpret a - search containing space characters as a word proximity search. - - - - charmap filename - - - This is the filename of the character - map to be used for this index for field type. - - - - - - - The contents of the character map files are structured as follows: - - - - - - - lowercase value-set - - - This directive introduces the basic value set of the field type. - The format is an ordered list (without spaces) of the - characters which may occur in "words" of the given type. - The order of the entries in the list determines the - sort order of the index. In addition to single characters, the - following combinations are legal: - - - - - - - - Backslashes may be used to introduce three-digit octal, or - two-digit hex representations of single characters - (preceded by x). - In addition, the combinations - \\, \\r, \\n, \\t, \\s (space — remember that real - space-characters may not occur in the value definition), and - \\ are recognized, with their usual interpretation. - - - - - - Curly braces {} may be used to enclose ranges of single - characters (possibly using the escape convention described in the - preceding point), eg. {a-z} to introduce the - standard range of ASCII characters. - Note that the interpretation of such a range depends on - the concrete representation in your local, physical character set. - - - - - - paranthesises () may be used to enclose multi-byte characters - - eg. diacritics or special national combinations (eg. Spanish - "ll"). When found in the input stream (or a search term), - these characters are viewed and sorted as a single character, with a - sorting value depending on the position of the group in the value - statement. - - + + The default.idx file + + The field types, and hence character sets, are associated with data + elements by the .abs files (see above). + The file default.idx + provides the association between field type codes (as used in the .abs + files) and the character map files (with the .chr suffix). The format + of the .idx file is as follows + - + + + + + index field type code + + + This directive introduces a new search index code. + The argument is a one-character code to be used in the + .abs files to select this particular index type. An index, roughly, + corresponds to a particular structure attribute during search. Refer + to . + + + + sort field code type + + + This directive introduces a + sort index. The argument is a one-character code to be used in the + .abs fie to select this particular index type. The corresponding + use attribute must be used in the sort request to refer to this + particular sort index. The corresponding character map (see below) + is used in the sort process. + + + + completeness boolean + + + This directive enables or disables complete field indexing. + The value of the boolean should be 0 + (disable) or 1. If completeness is enabled, the index entry will + contain the complete contents of the field (up to a limit), with words + (non-space characters) separated by single space characters + (normalized to " " on display). When completeness is + disabled, each word is indexed as a separate entry. Complete subfield + indexing is most useful for fields which are typically browsed (eg. + titles, authors, or subjects), or instances where a match on a + complete subfield is essential (eg. exact title searching). For fields + where completeness is disabled, the search engine will interpret a + search containing space characters as a word proximity search. + + + + charmap filename + + + This is the filename of the character + map to be used for this index for field type. + + + + + - - - - uppercase value-set - - - This directive introduces the - upper-case equivalencis to the value set (if any). The number and - order of the entries in the list should be the same as in the - lowercase directive. - - - - space value-set - - - This directive introduces the character - which separate words in the input stream. Depending on the - completeness mode of the field in question, these characters either - terminate an index entry, or delimit individual "words" in - the input stream. The order of the elements is not significant — - otherwise the representation is the same as for the - uppercase and lowercase - directives. - - - - map value-set - target - - - This directive introduces a - mapping between each of the members of the value-set on the left to - the character on the right. The character on the right must occur in - the value set (the lowercase directive) of - the character set, but - it may be a paranthesis-enclosed multi-octet character. This directive - may be used to map diacritics to their base characters, or to map - HTML-style character-representations to their natural form, etc. - - - - + + The character map file format + + The contents of the character map files are structured as follows: + + + + + + lowercase value-set + + + This directive introduces the basic value set of the field type. + The format is an ordered list (without spaces) of the + characters which may occur in "words" of the given type. + The order of the entries in the list determines the + sort order of the index. In addition to single characters, the + following combinations are legal: + + + + + + + + Backslashes may be used to introduce three-digit octal, or + two-digit hex representations of single characters + (preceded by x). + In addition, the combinations + \\, \\r, \\n, \\t, \\s (space — remember that real + space-characters may not occur in the value definition), and + \\ are recognized, with their usual interpretation. + + + + + + Curly braces {} may be used to enclose ranges of single + characters (possibly using the escape convention described in the + preceding point), eg. {a-z} to introduce the + standard range of ASCII characters. + Note that the interpretation of such a range depends on + the concrete representation in your local, physical character set. + + + + + + paranthesises () may be used to enclose multi-byte characters - + eg. diacritics or special national combinations (eg. Spanish + "ll"). When found in the input stream (or a search term), + these characters are viewed and sorted as a single character, with a + sorting value depending on the position of the group in the value + statement. + + + + + + + + + uppercase value-set + + + This directive introduces the + upper-case equivalencis to the value set (if any). The number and + order of the entries in the list should be the same as in the + lowercase directive. + + + + space value-set + + + This directive introduces the character + which separate words in the input stream. Depending on the + completeness mode of the field in question, these characters either + terminate an index entry, or delimit individual "words" in + the input stream. The order of the elements is not significant — + otherwise the representation is the same as for the + uppercase and lowercase + directives. + + + + map value-set + target + + + This directive introduces a + mapping between each of the members of the value-set on the left to + the character on the right. The character on the right must occur in + the value set (the lowercase directive) of + the character set, but + it may be a paranthesis-enclosed multi-octet character. This directive + may be used to map diacritics to their base characters, or to map + HTML-style character-representations to their natural form, etc. The map directive + can also be used to ignore leading articles in searching and/or sorting, and to perform + other special transformations. See section . + + + + + + + Ignoring leading articles + + In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, + you can also use the character map files to make Zebra ignore leading articles in sorting + records, or when doing complete field searching. + + + This is done using the map directive in the character map file. In a + nutshell, what you do is map certain sequences of characters, when they occur + in the beginning of a field, to a space. Assuming that the character "@" is + defined as a space character in your file, you can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', followed by a space + character, to a space. The hat ^ character denotes beginning-of-field only when + complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to associate different + character maps with different indexing types -- and you can create additional indexing + types, should the need arise -- it is possible to specify that leading articles should be + ignored either in sorting, in complete-field searching, or both. + + + If you ignore certain prefixes in sorting, then these will be eliminated from the index, + and sorting will take place as if they weren't there. However, if you set the system up + to ignore certain prefixes in searching, then these are deleted both + from the indexes and from query terms, when the client specifies complete-field + searching. This has the effect that a search for 'the science journal' and 'science + journal' would both produce the same results. + + -