X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Frecordmodel.xml;h=e30bf103cefe813645f38d204020f45497a9b452;hb=3d775a219c4cc3382851dd82174680724a2f3202;hp=b438f33a51c3e678833fce486a27a640ac8e288a;hpb=49f49aa27c8d63cea66dfb5a9e06e5735c835509;p=idzebra-moved-to-github.git diff --git a/doc/recordmodel.xml b/doc/recordmodel.xml index b438f33..e30bf10 100644 --- a/doc/recordmodel.xml +++ b/doc/recordmodel.xml @@ -1,105 +1,20 @@ - - The Record Model + + GRS Record Model and Filter Modules - - The Zebra system is designed to support a wide range of data management - applications. The system can be configured to handle virtually any - kind of structured data. Each record in the system is associated with - a record schema which lends context to the data - elements of the record. - Any number of record schemas can coexist in the system. - Although it may be wise to use only a single schema within - one database, the system poses no such restrictions. - The record model described in this chapter applies to the fundamental, structured record type grs, introduced in - . - + . - - Records pass through three different states during processing in the - system. - - - - - - - - - When records are accessed by the system, they are represented - in their local, or native format. This might be SGML or HTML files, - News or Mail archives, MARC records. If the system doesn't already - know how to read the type of data you need to store, you can set up an - input filter by preparing conversion rules based on regular - expressions and possibly augmented by a flexible scripting language - (Tcl). - The input filter produces as output an internal representation, - a tree structure. - - - - - - - When records are processed by the system, they are represented - in a tree-structure, constructed by tagged data elements hanging off a - root node. The tagged elements may contain data or yet more tagged - elements in a recursive structure. The system performs various - actions on this tree structure (indexing, element selection, schema - mapping, etc.), - - - - - - - Before transmitting records to the client, they are first - converted from the internal structure to a form suitable for exchange - over the network - according to the Z39.50 standard. - - - - - - - - - Local Representation - - - As mentioned earlier, Zebra places few restrictions on the type of - data that you can index and manage. Generally, whatever the form of - the data, it is parsed by an input filter specific to that format, and - turned into an internal structure that Zebra knows how to handle. This - process takes place whenever the record is accessed - for indexing and - retrieval. - - - - The RecordType parameter in the zebra.cfg file, or - the -t option to the indexer tells Zebra how to - process input records. - Two basic types of processing are available - raw text and structured - data. Raw text is just that, and it is selected by providing the - argument text to Zebra. Structured records are - all handled internally using the basic mechanisms described in the - subsequent sections. - Zebra can read structured records in many different formats. - How this is done is governed by additional parameters after the - "grs" keyword, separated by "." characters. - + + GRS Record Filters - Four basic subtypes to the grs type are + Many basic subtypes of the grs type are currently available: @@ -109,38 +24,62 @@ grs.sgml - This is the canonical input format — - described below. It is a simple SGML-like syntax. + This is the canonical input format + described . It is using + simple SGML-like syntax. + - grs.regx.filter + grs.marc - This enables a user-supplied input - filter. The mechanisms of these filters are described below. + This allows Zebra to read + records in the ISO2709 (MARC) encoding standard. + + + The loadable grs.marc filter module + is packaged in the GNU/Debian package + libidzebra1.4-mod-grs-marc + - grs.tcl.filter + grs.marcxml - Similar to grs.regx but using Tcl for rules. + This allows Zebra to read + records in the ISO2709??? (MARCXML) encoding standard. + + The loadable grs.marcxml filter module + is also contained in the GNU/Debian package + libidzebra1.4-mod-grs-marc + - grs.marc.abstract syntax + grs.danbib - This allows Zebra to read - records in the ISO2709 (MARC) encoding standard. In this case, the - last parameter abstract syntax names the - .abs file (see below) - which describes the specific MARC structure of the input record as - well as the indexing rules. + The grs.danbib filter parses DanBib + records, a danish MARC record variant called DANMARC. + DanBib is the Danish Union Catalogue hosted by the + Danish Bibliographic Centre (DBC). + + The loadable grs.danbib filter module + is packages in the GNU/Debian package + libidzebra1.4-mod-grs-danbib. @@ -148,18 +87,55 @@ grs.xml - This filter reads XML records. Only one record per file + This filter reads XML records and uses Expat to + parse them and convert them into IDZebra's internal + grs record model. + Only one record per file is supported. The filter is only available if Zebra/YAZ is compiled with EXPAT support. + + The loadable grs.xml filter module + is packagged in the GNU/Debian package + libidzebra1.4-mod-grs-xml + + + + + grs.regx + + + This enables a user-supplied Regular Expressions input + filter described in + . + + + The loadable grs.regx filter module + is packaged in the GNU/Debian package + libidzebra1.4-mod-grs-regx + + + + + grs.tcl + + + Similar to grs.regx but using Tcl for rules, described in + . + + + The loadable grs.tcl filter module + is also packaged in the GNU/Debian package + libidzebra1.4-mod-grs-regx + - - Canonical Input Format + + GRS Canonical Input Format Although input data can take any form, it is sometimes useful to @@ -239,7 +215,7 @@ makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record - (see ). + (see ). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory @@ -359,8 +335,8 @@ - - Input Filters + + GRS REGX And TCL Input Filters In order to handle general input formats, Zebra allows the @@ -477,11 +453,12 @@ - begin type [parameter ... ] + begin type [parameter ... ] Begin a new - data element. The type is one of the following: + data element. The type is one of + the following: @@ -492,7 +469,7 @@ name of the schema that describes the structure of the record, eg. gils or wais (see below). The begin record call should precede - any other use of the begin statement. + any other use of the begin statement. @@ -512,7 +489,7 @@ Begin a new node in a variant tree. The parameters are - class type value. + class type value. @@ -521,7 +498,7 @@ - data + data parameter Create a data element. The concatenated arguments make @@ -530,28 +507,41 @@ the layout (whitespace) of the data should be retained for transmission. The option -element - tag wraps the data up in - the tag. + tag wraps the data up in + the tag. The use of the -element option is equivalent to - preceding the command with a begin - element command, and following - it with the end command. + preceding the command with a begin + element command, and following + it with the end command. - end [type] + end [type] Close a tagged element. If no parameter is given, the last element on the stack is terminated. The first parameter, if any, is a type name, similar - to the begin statement. - For the element type, a tag + to the begin statement. + For the element type, a tag name can be provided to terminate a specific tag. + + + unread no + + + Move the input pointer to the offset of first character that + match rule given by no. + The first rule from left-to-right is numbered zero, + the second rule is named 1 and so on. + + + + @@ -571,31 +561,29 @@ /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { - begin element bodyOfDisplay - begin variant body iana "text/plain" - data -text $1 - end record + begin element bodyOfDisplay + begin variant body iana "text/plain" + data -text $1 + end record } - If Zebra is compiled with support for Tcl (Tool Command Language) - enabled, the statements described above are supplemented with a complete + If Zebra is compiled with support for Tcl enabled, the statements + described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation - mechanisms for modifying the elements of a record. Tcl is a popular - scripting environment, with several tutorials available both online - and in hardcopy. + mechanisms for modifying the elements of a record. - - Internal Representation + + GRS Internal Record Representation When records are manipulated by the system, they're represented in a @@ -718,12 +706,13 @@ - - Configuring Your Data Model + + GRS Record Model Configuration The following sections describe the configuration files that govern - the internal management of data records. The system searches for the files + the internal management of grs records. + The system searches for the files in the directories specified by the profilePath setting in the zebra.cfg file. @@ -1017,8 +1006,8 @@ elm, this directive allows you to index attribute contents. The xpath uses a syntax similar to XPath. The attributes - have same syntax and meaning as directive elm, except that ! - refers to the nodes selected by xpath. + have same syntax and meaning as directive elm, except that operator + ! refers to the nodes selected by xpath. + + + + + systag + systemTag + actualTag + + + + Specifies what information, if any, Zebra should + automatically include in retrieval records for the + ``system fields'' that it supports. + systemTag may + be any of the following: + + + rank + + An integer indicating the relevance-ranking score + assigned to the record. + + + + sysno + + An automatically generated identifier for the record, + unique within this database. It is represented by the + <localControlNumber> element in + XML and the (1,14) tag in GRS-1. + + + + size + + The size, in bytes, of the retrieved record. + + + + + + The actualTag parameter may be + none to indicate that the named element + should be omitted from retrieval records. + + + @@ -1086,23 +1187,23 @@ esetname G gils-g.est esetname F @ - elm (1,10) rank - - elm (1,12) url - - elm (1,14) localControlNumber Local-number - elm (1,16) dateOfLastModification Date/time-last-modified - elm (2,1) title w:!,p:! - elm (4,1) controlIdentifier Identifier-standard - elm (2,6) abstract Abstract - elm (4,51) purpose ! - elm (4,52) originator - - elm (4,53) accessConstraints ! - elm (4,54) useConstraints ! - elm (4,70) availability - - elm (4,70)/(4,90) distributor - - elm (4,70)/(4,90)/(2,7) distributorName ! - elm (4,70)/(4,90)/(2,10 distributorOrganization ! - elm (4,70)/(4,90)/(4,2) distributorStreetAddress ! - elm (4,70)/(4,90)/(4,3) distributorCity ! + elm (1,10) rank - + elm (1,12) url - + elm (1,14) localControlNumber Local-number + elm (1,16) dateOfLastModification Date/time-last-modified + elm (2,1) title w:!,p:! + elm (4,1) controlIdentifier Identifier-standard + elm (2,6) abstract Abstract + elm (4,51) purpose ! + elm (4,52) originator - + elm (4,53) accessConstraints ! + elm (4,54) useConstraints ! + elm (4,70) availability - + elm (4,70)/(4,90) distributor - + elm (4,70)/(4,90)/(2,7) distributorName ! + elm (4,70)/(4,90)/(2,10) distributorOrganization ! + elm (4,70)/(4,90)/(4,2) distributorStreetAddress ! + elm (4,70)/(4,90)/(4,3) distributorCity ! @@ -1677,181 +1778,233 @@ special-purpose fields such as WWW-style linkages (URx). - - The field types, and hence character sets, are associated with data - elements by the .abs files (see above). - The file default.idx - provides the association between field type codes (as used in the .abs - files) and the character map files (with the .chr suffix). The format - of the .idx file is as follows - - - - - - - index field type code - - - This directive introduces a new search index code. - The argument is a one-character code to be used in the - .abs files to select this particular index type. An index, roughly, - corresponds to a particular structure attribute during search. Refer - to . - - - - sort field code type - - - This directive introduces a - sort index. The argument is a one-character code to be used in the - .abs fie to select this particular index type. The corresponding - use attribute must be used in the sort request to refer to this - particular sort index. The corresponding character map (see below) - is used in the sort process. - - - - completeness boolean - - - This directive enables or disables complete field indexing. - The value of the boolean should be 0 - (disable) or 1. If completeness is enabled, the index entry will - contain the complete contents of the field (up to a limit), with words - (non-space characters) separated by single space characters - (normalized to " " on display). When completeness is - disabled, each word is indexed as a separate entry. Complete subfield - indexing is most useful for fields which are typically browsed (eg. - titles, authors, or subjects), or instances where a match on a - complete subfield is essential (eg. exact title searching). For fields - where completeness is disabled, the search engine will interpret a - search containing space characters as a word proximity search. - - - - charmap filename - - - This is the filename of the character - map to be used for this index for field type. - - - - - - - The contents of the character map files are structured as follows: - - - - - - - lowercase value-set - - - This directive introduces the basic value set of the field type. - The format is an ordered list (without spaces) of the - characters which may occur in "words" of the given type. - The order of the entries in the list determines the - sort order of the index. In addition to single characters, the - following combinations are legal: - - - - - - - - Backslashes may be used to introduce three-digit octal, or - two-digit hex representations of single characters - (preceded by x). - In addition, the combinations - \\, \\r, \\n, \\t, \\s (space — remember that real - space-characters may not occur in the value definition), and - \\ are recognized, with their usual interpretation. - - - - - - Curly braces {} may be used to enclose ranges of single - characters (possibly using the escape convention described in the - preceding point), eg. {a-z} to introduce the - standard range of ASCII characters. - Note that the interpretation of such a range depends on - the concrete representation in your local, physical character set. - - - - - - paranthesises () may be used to enclose multi-byte characters - - eg. diacritics or special national combinations (eg. Spanish - "ll"). When found in the input stream (or a search term), - these characters are viewed and sorted as a single character, with a - sorting value depending on the position of the group in the value - statement. - - + + The default.idx file + + The field types, and hence character sets, are associated with data + elements by the .abs files (see above). + The file default.idx + provides the association between field type codes (as used in the .abs + files) and the character map files (with the .chr suffix). The format + of the .idx file is as follows + - + + + + + index field type code + + + This directive introduces a new search index code. + The argument is a one-character code to be used in the + .abs files to select this particular index type. An index, roughly, + corresponds to a particular structure attribute during search. Refer + to . + + + + sort field code type + + + This directive introduces a + sort index. The argument is a one-character code to be used in the + .abs fie to select this particular index type. The corresponding + use attribute must be used in the sort request to refer to this + particular sort index. The corresponding character map (see below) + is used in the sort process. + + + + completeness boolean + + + This directive enables or disables complete field indexing. + The value of the boolean should be 0 + (disable) or 1. If completeness is enabled, the index entry will + contain the complete contents of the field (up to a limit), with words + (non-space characters) separated by single space characters + (normalized to " " on display). When completeness is + disabled, each word is indexed as a separate entry. Complete subfield + indexing is most useful for fields which are typically browsed (eg. + titles, authors, or subjects), or instances where a match on a + complete subfield is essential (eg. exact title searching). For fields + where completeness is disabled, the search engine will interpret a + search containing space characters as a word proximity search. + + + + charmap filename + + + This is the filename of the character + map to be used for this index for field type. + + + + + - - - - uppercase value-set - - - This directive introduces the - upper-case equivalencis to the value set (if any). The number and - order of the entries in the list should be the same as in the - lowercase directive. - - - - space value-set - - - This directive introduces the character - which separate words in the input stream. Depending on the - completeness mode of the field in question, these characters either - terminate an index entry, or delimit individual "words" in - the input stream. The order of the elements is not significant — - otherwise the representation is the same as for the - uppercase and lowercase - directives. - - - - map value-set - target - - - This directive introduces a - mapping between each of the members of the value-set on the left to - the character on the right. The character on the right must occur in - the value set (the lowercase directive) of - the character set, but - it may be a paranthesis-enclosed multi-octet character. This directive - may be used to map diacritics to their base characters, or to map - HTML-style character-representations to their natural form, etc. - - - - + + The character map file format + + The contents of the character map files are structured as follows: + + + + + + lowercase value-set + + + This directive introduces the basic value set of the field type. + The format is an ordered list (without spaces) of the + characters which may occur in "words" of the given type. + The order of the entries in the list determines the + sort order of the index. In addition to single characters, the + following combinations are legal: + + + + + + + + Backslashes may be used to introduce three-digit octal, or + two-digit hex representations of single characters + (preceded by x). + In addition, the combinations + \\, \\r, \\n, \\t, \\s (space — remember that real + space-characters may not occur in the value definition), and + \\ are recognized, with their usual interpretation. + + + + + + Curly braces {} may be used to enclose ranges of single + characters (possibly using the escape convention described in the + preceding point), eg. {a-z} to introduce the + standard range of ASCII characters. + Note that the interpretation of such a range depends on + the concrete representation in your local, physical character set. + + + + + + paranthesises () may be used to enclose multi-byte characters - + eg. diacritics or special national combinations (eg. Spanish + "ll"). When found in the input stream (or a search term), + these characters are viewed and sorted as a single character, with a + sorting value depending on the position of the group in the value + statement. + + + + + + + + + uppercase value-set + + + This directive introduces the + upper-case equivalencis to the value set (if any). The number and + order of the entries in the list should be the same as in the + lowercase directive. + + + + space value-set + + + This directive introduces the character + which separate words in the input stream. Depending on the + completeness mode of the field in question, these characters either + terminate an index entry, or delimit individual "words" in + the input stream. The order of the elements is not significant — + otherwise the representation is the same as for the + uppercase and lowercase + directives. + + + + map value-set + target + + + This directive introduces a mapping between each of the + members of the value-set on the left to the character on the + right. The character on the right must occur in the value + set (the lowercase directive) of the + character set, but it may be a paranthesis-enclosed + multi-octet character. This directive may be used to map + diacritics to their base characters, or to map HTML-style + character-representations to their natural form, etc. The + map directive can also be used to ignore leading articles in + searching and/or sorting, and to perform other special + transformations. See section . + + + + + + + Ignoring leading articles + + In addition to specifying sort orders, space (blank) handling, + and upper/lowercase folding, you can also use the character map + files to make Zebra ignore leading articles in sorting records, + or when doing complete field searching. + + + This is done using the map directive in the + character map file. In a nutshell, what you do is map certain + sequences of characters, when they occur in the + beginning of a field, to a space. Assuming that the + character "@" is defined as a space character in your file, you + can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', + followed by a space character, to a space. The hat ^ character + denotes beginning-of-field only when complete-subfield indexing + or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to + associate different character maps with different indexing types + -- and you can create additional indexing types, should the need + arise -- it is possible to specify that leading articles should + be ignored either in sorting, in complete-field searching, or + both. + + + If you ignore certain prefixes in sorting, then these will be + eliminated from the index, and sorting will take place as if + they weren't there. However, if you set the system up to ignore + certain prefixes in searching, then these + are deleted both from the indexes and from query terms, when the + client specifies complete-field searching. This has the effect + that a search for 'the science journal' and 'science journal' + would both produce the same results. + + - - - Exchange Formats + + GRS Exchange Formats - Converting records from the internal structure to en exchange format + Converting records from the internal structure to an exchange format is largely an automatic process. Currently, the following exchange formats are supported: @@ -1919,6 +2072,7 @@ + SOIF. Support for this syntax is experimental, and is currently @@ -1926,10 +2080,9 @@ abstract syntaxes can be mapped to the SOIF format, although nested elements are represented by concatenation of the tag names at each level. - - +