X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Frecordmodel-grs.xml;h=853410a0f203c748cb88382960d5dbd51fd7fee4;hp=848db70c79103831e452ce1ec662859b96a6733b;hb=dcda88860b03641b6900d43135ca769f005105e8;hpb=b19b79e382ef8196f1625763db1af3a82b1e0c81 diff --git a/doc/recordmodel-grs.xml b/doc/recordmodel-grs.xml index 848db70..853410a 100644 --- a/doc/recordmodel-grs.xml +++ b/doc/recordmodel-grs.xml @@ -1,6 +1,13 @@ - - &grs1; Record Model and Filter Modules + &acro.grs1; Record Model and Filter Modules + + + + The functionality of this record model has been improved and + replaced by the DOM &acro.xml; record model. See + . + + The record model described in this chapter applies to the fundamental, @@ -11,7 +18,7 @@
- &grs1; Record Filters + &acro.grs1; Record Filters Many basic subtypes of the grs type are currently available: @@ -25,7 +32,7 @@ This is the canonical input format described . It is using - simple &sgml;-like syntax. + simple &acro.sgml;-like syntax. @@ -34,22 +41,22 @@ This allows &zebra; to read - records in the ISO2709 (&marc;) encoding standard. + records in the ISO2709 (&acro.marc;) encoding standard. Last parameter type names the .abs file (see below) - which describes the specific &marc; structure of the input record as + which describes the specific &acro.marc; structure of the input record as well as the indexing rules. - The grs.marc uses an internal represtantion - which is not &xml; conformant. In particular &marc; tags are - presented as elements with the same name. And &xml; elements + The grs.marc uses an internal representation + which is not &acro.xml; conformant. In particular &acro.marc; tags are + presented as elements with the same name. And &acro.xml; elements may not start with digits. Therefore this filter is only - suitable for systems returning &grs1; and &marc; records. For &xml; + suitable for systems returning &acro.grs1; and &acro.marc; records. For &acro.xml; use grs.marcxml filter instead (see below). - The loadable grs.marc filter module - is packaged in the GNU/Debian package + The loadable grs.marc filter module + is packaged in the GNU/Debian package libidzebra2.0-mod-grs-marc @@ -61,14 +68,14 @@ This allows &zebra; to read ISO2709 encoded records. Last parameter type names the .abs file (see below) - which describes the specific &marc; structure of the input record as + which describes the specific &acro.marc; structure of the input record as well as the indexing rules. The internal representation for grs.marcxml - is the same as for &marcxml;. - It slightly more complicated to work with than - grs.marc but &xml; conformant. + is the same as for &acro.marcxml;. + It slightly more complicated to work with than + grs.marc but &acro.xml; conformant. The loadable grs.marcxml filter module @@ -81,20 +88,20 @@ grs.xml - This filter reads &xml; records and uses + This filter reads &acro.xml; records and uses Expat to - parse them and convert them into ID&zebra;'s internal + parse them and convert them into ID&zebra;'s internal grs record model. - Only one record per file is supported, due to the fact &xml; does + Only one record per file is supported, due to the fact &acro.xml; does not allow two documents to "follow" each other (there is no way to know when a document is finished). This filter is only available if &zebra; is compiled with EXPAT support. The loadable grs.xml filter module - is packagged in the GNU/Debian package + is packaged in the GNU/Debian package libidzebra2.0-mod-grs-xml - + @@ -115,7 +122,7 @@ grs.tcl.filter - Similar to grs.regx but using Tcl for rules, described in + Similar to grs.regx but using Tcl for rules, described in . @@ -130,14 +137,14 @@
- &grs1; Canonical Input Format + &acro.grs1; Canonical Input Format Although input data can take any form, it is sometimes useful to describe the record processing capabilities of the system in terms of a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In &zebra;, this - canonical format is an "&sgml;-like" syntax. + canonical format is an "&acro.sgml;-like" syntax. @@ -157,16 +164,16 @@ <Distributor> - <Name> USGS/WRD </Name> - <Organization> USGS/WRD </Organization> - <Street-Address> - U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW - </Street-Address> - <City> ALBUQUERQUE </City> - <State> NM </State> - <Zip-Code> 87102 </Zip-Code> - <Country> USA </Country> - <Telephone> (505) 766-5560 </Telephone> + <Name> USGS/WRD </Name> + <Organization> USGS/WRD </Organization> + <Street-Address> + U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW + </Street-Address> + <City> ALBUQUERQUE </City> + <State> NM </State> + <Zip-Code> 87102 </Zip-Code> + <Country> USA </Country> + <Telephone> (505) 766-5560 </Telephone> </Distributor> @@ -174,12 +181,12 @@ @@ -215,7 +222,7 @@ contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory elements - &zebra; does not validate the contents of a record against - the &z3950; profile, however - it merely attempts to match up elements + the &acro.z3950; profile, however - it merely attempts to match up elements of a local representation with the given schema): @@ -223,7 +230,7 @@ <gils> - <title>Zen and the Art of Motorcycle Maintenance</title> + <title>Zen and the Art of Motorcycle Maintenance</title> </gils> @@ -240,7 +247,7 @@ textual data elements which might appear in different languages, and images which may appear in different formats or layouts. The variant system in &zebra; is essentially a representation of - the variant mechanism of &z3950;-1995. + the variant mechanism of &acro.z3950;-1995. @@ -320,7 +327,7 @@ The title element above comes in two variants. Both have the IANA body type "text/plain", but one is in English, and the other in - Danish. The client, using the element selection mechanism of &z3950;, + Danish. The client, using the element selection mechanism of &acro.z3950;, can retrieve information about the available variant forms of data elements, or it can select specific variants based on the requirements of the end-user. @@ -331,7 +338,7 @@
- &grs1; REGX And TCL Input Filters + &acro.grs1; REGX And TCL Input Filters In order to handle general input formats, &zebra; allows the @@ -352,7 +359,7 @@ type regx, argument filter-filename). - + Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The @@ -360,7 +367,7 @@ and the actions normally contribute to the generation of an internal representation of the record. - + An expression can be either of the following: @@ -408,7 +415,7 @@ Matches regular expression pattern reg from the input record. The operators supported are the same - as for regular expression queries. Refer to + as for regular expression queries. Refer to . @@ -460,13 +467,13 @@ data element. The type is one of the following: - + record Begin a new record. The following parameter should be the - name of the schema that describes the structure of the record, eg. + name of the schema that describes the structure of the record, e.g., gils or wais (see below). The begin record call should precede any other use of the begin statement. @@ -561,10 +568,10 @@ /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { - begin element bodyOfDisplay - begin variant body iana "text/plain" - data -text $1 - end record + begin element bodyOfDisplay + begin variant body iana "text/plain" + data -text $1 + end record } @@ -583,7 +590,7 @@
- &grs1; Internal Record Representation + &acro.grs1; Internal Record Representation When records are manipulated by the system, they're represented in a @@ -597,9 +604,9 @@ - ROOT - TITLE "Zen and the Art of Motorcycle Maintenance" - AUTHOR "Robert Pirsig" + ROOT + TITLE "Zen and the Art of Motorcycle Maintenance" + AUTHOR "Robert Pirsig" @@ -612,11 +619,11 @@ - ROOT - TITLE "Zen and the Art of Motorcycle Maintenance" - AUTHOR - FIRST-NAME "Robert" - SURNAME "Pirsig" + ROOT + TITLE "Zen and the Art of Motorcycle Maintenance" + AUTHOR + FIRST-NAME "Robert" + SURNAME "Pirsig" @@ -680,38 +687,38 @@ Which of the two elements are transmitted to the client by the server depends on the specifications provided by the client, if any. - + In practice, each variant node is associated with a triple of class, - type, value, corresponding to the variant mechanism of &z3950;. + type, value, corresponding to the variant mechanism of &acro.z3950;. - +
- +
Data Elements - + Data nodes have no children (they are always leaf nodes in the record tree). - + - +
- +
- +
- &grs1; Record Model Configuration - + &acro.grs1; Record Model Configuration + The following sections describe the configuration files that govern - the internal management of grs records. + the internal management of grs records. The system searches for the files in the directories specified by the profilePath setting in the zebra.cfg file. @@ -728,7 +735,7 @@ @@ -737,7 +744,7 @@ - The object identifier of the &z3950; schema associated + The object identifier of the &acro.z3950; schema associated with the ARS, so that it can be referred to by the client. @@ -759,7 +766,7 @@ known. - + The variant set which is used in the profile. This provides a @@ -774,7 +781,7 @@ ask for a subset of the data elements contained in a record. Element set names, in the retrieval module, are mapped to element specifications, which contain information equivalent to the - Espec-1 syntax of &z3950;. + Espec-1 syntax of &acro.z3950;. @@ -788,15 +795,15 @@ Possibly, a set of rules describing the mapping of elements to a - &marc; representation. + &acro.marc; representation. - + A list of element descriptions (this is the actual ARS of the - schema, in &z3950; terms), which lists the ways in which the various + schema, in &acro.z3950; terms), which lists the ways in which the various tags can be used and organized hierarchically. @@ -822,7 +829,7 @@ The number of different file types may appear daunting at first, but - each type corresponds fairly clearly to a single aspect of the &z3950; + each type corresponds fairly clearly to a single aspect of the &acro.z3950; retrieval facilities. Further, the average database administrator, who is simply reusing an existing profile for which tables already exist, shouldn't have to worry too much about the contents of these tables. @@ -840,27 +847,27 @@ file. Some settings are optional (o), while others again are mandatory (m). - +
- +
The Abstract Syntax (.abs) Files - + - The name of this file type is slightly misleading in &z3950; terms, + The name of this file type is slightly misleading in &acro.z3950; terms, since, apart from the actual abstract syntax of the profile, it also includes most of the other definitions that go into a database profile. - + - When a record in the canonical, &sgml;-like format is read from a file + When a record in the canonical, &acro.sgml;-like format is read from a file or from the database, the first tag of the file should reference the profile that governs the layout of the record. If the first tag of the record is, say, <gils>, the system will look for the profile definition in the file gils.abs. Profile definitions are cached, so they only have to be read once - during the lifespan of the current process. + during the lifespan of the current process. @@ -869,14 +876,14 @@ introduces the profile, and should always be called first thing when introducing a new record. - + The file may contain the following directives: - + - + name symbolic-name @@ -938,7 +945,7 @@ (o) Points to a file containing parameters for representing the record contents in the ISO2709 syntax. - Read the description of the &marc; representation facility below. + Read the description of the &acro.marc; representation facility below. @@ -974,7 +981,7 @@ (o,r) Adds an element to the abstract record syntax of the schema. The path follows the - syntax which is suggested by the &z3950; document - that is, a sequence + syntax which is suggested by the &acro.z3950; document - that is, a sequence of tags separated by slashes (/). Each tag is given as a comma-separated pair of tag type and -value surrounded by parenthesis. The name is the name of the element, and @@ -996,7 +1003,7 @@ - + xelm xpath attributes @@ -1021,8 +1028,8 @@ melm field$subfield attributes - This directive is specifically for &marc;-formatted records, - ingested either in the form of &marcxml; documents, or in the + This directive is specifically for &acro.marc;-formatted records, + ingested either in the form of &acro.marcxml; documents, or in the ISO2709/Z39.2 format using the grs.marcxml input filter. You can specify indexing rules for any subfield, or you can leave off the $subfield part and specify default rules @@ -1038,11 +1045,11 @@ This directive specifies character encoding for external records. - For records such as &xml; that specifies encoding within the + For records such as &acro.xml; that specifies encoding within the file via a header this directive is ignored. If neither this directive is given, nor an encoding is set within external records, ISO-8859-1 encoding is assumed. - + @@ -1051,60 +1058,60 @@ If this directive is followed by enable, then extra indexing is performed to allow for XPath-like queries. - If this directive is not specified - equivalent to + If this directive is not specified - equivalent to disable - no extra XPath-indexing is performed. - @@ -1117,7 +1124,7 @@ Specifies what information, if any, &zebra; should - automatically include in retrieval records for the + automatically include in retrieval records for the ``system fields'' that it supports. systemTag may be any of the following: @@ -1125,24 +1132,24 @@ rank - An integer indicating the relevance-ranking score - assigned to the record. - + An integer indicating the relevance-ranking score + assigned to the record. + sysno - An automatically generated identifier for the record, - unique within this database. It is represented by the - <localControlNumber> element in - &xml; and the (1,14) tag in &grs1;. - + An automatically generated identifier for the record, + unique within this database. It is represented by the + <localControlNumber> element in + &acro.xml; and the (1,14) tag in &acro.grs1;. + size - The size, in bytes, of the retrieved record. - + The size, in bytes, of the retrieved record. + @@ -1155,7 +1162,7 @@ - + The mechanism for controlling indexing is not adequate for @@ -1163,7 +1170,7 @@ configuration table eventually. - + The following is an excerpt from the abstract syntax file for the GILS profile. @@ -1195,7 +1202,7 @@ elm (4,1) controlIdentifier Identifier-standard elm (2,6) abstract Abstract elm (4,51) purpose ! - elm (4,52) originator - + elm (4,52) originator - elm (4,53) accessConstraints ! elm (4,54) useConstraints ! elm (4,70) availability - @@ -1215,10 +1222,10 @@ This file type describes the Use elements of - an attribute set. - It contains the following directives. + an attribute set. + It contains the following directives. - + @@ -1250,7 +1257,7 @@ set. For instance, many new attribute sets are defined as extensions to the bib-1 set. This is an important feature of the retrieval - system of &z3950;, as it ensures the highest possible level of + system of &acro.z3950;, as it ensures the highest possible level of interoperability, as those access points of your database which are derived from the external set (say, bib-1) can be used even by clients who are unaware of the new set. @@ -1266,7 +1273,7 @@ attribute value is stored in the index (unless a local-value is given, in which case this is stored). The name is used to refer to the - attribute from the abstract syntax. + attribute from the abstract syntax. @@ -1302,7 +1309,7 @@ This file type defines the tagset of the profile, possibly by referencing other tag sets (most tag sets, for instance, will include - tagsetG and tagsetM from the &z3950; specification. The file may + tagsetG and tagsetM from the &acro.z3950; specification. The file may contain the following directives. @@ -1542,7 +1549,7 @@ The element set specification files describe a selection of a subset of the elements of a database record. The element selection mechanism is equivalent to the one supplied by the Espec-1 - syntax of the &z3950; specification. + syntax of the &acro.z3950; specification. In fact, the internal representation of an element set specification is identical to the Espec-1 structure, and we'll refer you to the description of that structure for most of @@ -1556,7 +1563,7 @@ otherwise is noted. - + The directives available in the element set file are as follows: @@ -1583,7 +1590,7 @@ provides a default variant request for use when the individual element requests (see below) do not contain a variant request. Variant requests consist of a blank-separated list of - variant components. A variant compont is a comma-separated, + variant components. A variant component is a comma-separated, parenthesized triple of variant class, type, and value (the two former values being represented as integers). The value can currently only be entered as a string (this will change to depend on the definition of @@ -1683,21 +1690,21 @@ a schema that differs from the native schema of the record. For instance, a client might only know how to process WAIS records, while the database record is represented in a more specific schema, such as - GILS. In this module, a mapping of data to one of the &marc; formats is + GILS. In this module, a mapping of data to one of the &acro.marc; formats is also thought of as a schema mapping (mapping the elements of the - record into fields consistent with the given &marc; specification, prior + record into fields consistent with the given &acro.marc; specification, prior to actually converting the data to the ISO2709). This use of the - object identifier for &usmarc; as a schema identifier represents an + object identifier for &acro.usmarc; as a schema identifier represents an overloading of the OID which might not be entirely proper. However, it represents the dual role of schema and record syntax which - is assumed by the &marc; family in &z3950;. + is assumed by the &acro.marc; family in &acro.z3950;. @@ -1740,7 +1747,7 @@
- The &marc; (ISO2709) Representation (.mar) Files + The &acro.marc; (ISO2709) Representation (.mar) Files This file provides rules for representing a record in the ISO2709 @@ -1749,16 +1756,16 @@
- &grs1; Exchange Formats + &acro.grs1; Exchange Formats Converting records from the internal structure to an exchange format @@ -1770,7 +1777,7 @@ - &grs1;. The internal representation is based on &grs1;/&xml;, so the + &acro.grs1;. The internal representation is based on &acro.grs1;/&acro.xml;, so the conversion here is straightforward. The system will create applied variant and supported variant lists as required, if a record contains variant information. @@ -1779,34 +1786,34 @@ - &xml;. The internal representation is based on &grs1;/&xml; so - the mapping is trivial. Note that &xml; schemas, preprocessing + &acro.xml;. The internal representation is based on &acro.grs1;/&acro.xml; so + the mapping is trivial. Note that &acro.xml; schemas, preprocessing instructions and comments are not part of the internal representation - and therefore will never be part of a generated &xml; record. + and therefore will never be part of a generated &acro.xml; record. Future versions of the &zebra; will support that. - &sutrs;. Again, the mapping is fairly straightforward. Indentation + &acro.sutrs;. Again, the mapping is fairly straightforward. Indentation is used to show the hierarchical structure of the record. All - "&grs1;" type records support both the &grs1; and &sutrs; + "&acro.grs1;" type records support both the &acro.grs1; and &acro.sutrs; representations. - + - ISO2709-based formats (&usmarc;, etc.). Only records with a + ISO2709-based formats (&acro.usmarc;, etc.). Only records with a two-level structure (corresponding to fields and subfields) can be directly mapped to ISO2709. For records with a different structuring - (eg., GILS), the representation in a structure like &usmarc; involves a + (e.g., GILS), the representation in a structure like &acro.usmarc; involves a schema-mapping (see ), to an - "implied" &usmarc; schema (implied, + "implied" &acro.usmarc; schema (implied, because there is no formal schema which specifies the use of the - &usmarc; fields outside of ISO2709). The resultant, two-level record is + &acro.usmarc; fields outside of ISO2709). The resultant, two-level record is then mapped directly from the internal representation to ISO2709. See the GILS schema definition files for a detailed example of this approach. @@ -1829,7 +1836,7 @@ - + SOIF. Support for this syntax is experimental, and is currently @@ -1839,186 +1846,186 @@ level. - +
- +
- Extended indexing of &marc; records - - Extended indexing of &marc; records will help you if you need index a + Extended indexing of &acro.marc; records + + Extended indexing of &acro.marc; records will help you if you need index a combination of subfields, or index only a part of the whole field, - or use during indexing process embedded fields of &marc; record. + or use during indexing process embedded fields of &acro.marc; record. - - Extended indexing of &marc; records additionally allows: + + Extended indexing of &acro.marc; records additionally allows: - + - to index data in LEADER of &marc; record + to index data in LEADER of &acro.marc; record - + to index data in control fields (with fixed length) - + to use during indexing the values of indicators - + - to index linked fields for UNI&marc; based formats + to index linked fields for UNI&acro.marc; based formats - + - + In compare with simple indexing process the extended indexing - may increase (about 2-3 times) the time of indexing process for &marc; + may increase (about 2-3 times) the time of indexing process for &acro.marc; records. - +
The index-formula - + At the beginning, we have to define the term - index-formula for &marc; records. This term helps - to understand the notation of extended indexing of &marc; records by &zebra;. + index-formula for &acro.marc; records. This term helps + to understand the notation of extended indexing of &acro.marc; records by &zebra;. Our definition is based on the document "The table - of conformity for &z3950; use attributes and R&usmarc; fields". - The document is available only in russian language. - + of conformity for &acro.z3950; use attributes and R&acro.usmarc; fields". + The document is available only in Russian language. + The index-formula is the combination of subfields presented in such way: - + 71-00$a, $g, $h ($c){.$b ($c)} , (1) - + - We know that &zebra; supports a &bib1; attribute - right truncation. - In this case, the index-formula (1) consists from + We know that &zebra; supports a &acro.bib1; attribute - right truncation. + In this case, the index-formula (1) consists from forms, defined in the same way as (1) - + 71-00$a, $g, $h 71-00$a, $g 71-00$a - + - The original &marc; record may be without some elements, which included in index-formula. + The original &acro.marc; record may be without some elements, which included in index-formula. - + This notation includes such operands as: - + # It means whitespace character. - + - The position may contain any value, defined by - &marc; format. + &acro.marc; format. For example, index-formula - + 70-#1$a, $g , (2) - - includes - + + includes + 700#1$a, $g 701#1$a, $g 702#1$a, $g - + - + {...} The repeatable elements are defined in figure-brackets {}. For example, index-formula - + 71-00$a, $g, $h ($c){.$b ($c)} , (3) - + includes - + 71-00$a, $g, $h ($c). $b ($c) 71-00$a, $g, $h ($c). $b ($c). $b ($c) 71-00$a, $g, $h ($c). $b ($c). $b ($c). $b ($c) - + - + - All another operands are the same as accepted in &marc; world. + All another operands are the same as accepted in &acro.marc; world.
- +
Notation of <emphasis>index-formula</emphasis> for &zebra; - - + + Extended indexing overloads path of elm definition in abstract syntax file of &zebra; (.abs file). It means that names beginning with "mc-" are interpreted by &zebra; as index-formula. The database index is created and - linked with access point (&bib1; use attribute) + linked with access point (&acro.bib1; use attribute) according to this formula. - + For example, index-formula - + 71-00$a, $g, $h ($c){.$b ($c)} , (4) - + in .abs file looks like: - + mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)} - - + + The notation of index-formula uses the operands: - + _ It means whitespace character. - + . The position may contain any value, defined by - &marc; format. For example, + &acro.marc; format. For example, index-formula - + 70-#1$a, $g , (5) - + matches mc-70._1_$a,_$g_ and includes - + 700_1_$a,_$g_ 701_1_$a,_$g_ @@ -2026,21 +2033,21 @@ - + {...} The repeatable elements are defined in figure-brackets {}. For example, index-formula - + 71#00$a, $g, $h ($c) {.$b ($c)} , (6) - - matches + + matches mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)} and includes - + 71.00_$a,_$g,_$h_(_$c_).$b_(_$c_) 71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_) @@ -2048,120 +2055,120 @@ - + <...> Embedded index-formula (for linked fields) is between <>. For example, index-formula - + 4--#-$170-#1$a, $g ($c) , (7) - + matches mc-4.._._$1<70._1_$a,_$g_(_$c_)>_ and includes - + 463_._$1<70._1_$a,_$g_(_$c_)>_ - + - + - All another operands are the same as accepted in &marc; world. + All another operands are the same as accepted in &acro.marc; world. - +
Examples - + - + - + indexing LEADER - + You need to use keyword "ldr" to index leader. For example, indexing data from 6th and 7th position of LEADER - + elm mc-ldr[6] Record-type ! elm mc-ldr[7] Bib-level ! - + - + - + indexing data from control fields - + indexing date (the time added to database) - + - elm mc-008[0-5] Date/time-added-to-db ! + elm mc-008[0-5] Date/time-added-to-db ! - - or for R&usmarc; (this data included in 100th field) - + + or for R&acro.usmarc; (this data included in 100th field) + elm mc-100___$a[0-7]_ Date/time-added-to-db ! - + - + - + using indicators while indexing - For R&usmarc; index-formula + For R&acro.usmarc; index-formula 70-#1$a, $g matches - + elm 70._1_$a,_$g_ Author !:w,!:p - - When &zebra; finds a field according to + + When &zebra; finds a field according to "70." pattern it checks the indicators. In this case the value of first indicator doesn't mater, but the value of - second one must be whitespace, in another case a field is not + second one must be whitespace, in another case a field is not indexed. - + - - indexing embedded (linked) fields for UNI&marc; based + + indexing embedded (linked) fields for UNI&acro.marc; based formats - - For R&usmarc; index-formula + + For R&acro.usmarc; index-formula 4--#-$170-#1$a, $g ($c) matches - + _ Author !:w,!:p ]]> - + Data are extracted from record if the field matches to "4.._." pattern and data in linked field match to embedded index-formula 70._1_$a,_$g_(_$c_). - + - + - - + +
- +