X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=a19838e57e8e00271e3491be1e0a5f7fbc645a69;hp=6eda6a54d14a45a602519a1090e541e6053f8c0a;hb=HEAD;hpb=91500c092797da8e769f1e63ff0c6bd67aad0fc8 diff --git a/doc/field-structure.xml b/doc/field-structure.xml index 6eda6a5..a19838e 100644 --- a/doc/field-structure.xml +++ b/doc/field-structure.xml @@ -1,11 +1,10 @@ - Field Structure and Character Sets In order to provide a flexible approach to national character set - handling, Zebra allows the administrator to configure the set up the + handling, &zebra; allows the administrator to configure the set up the system to handle any 8-bit character set — including sets that require multi-octet diacritics or other multi-octet characters. The definition of a character set includes a specification of the @@ -21,17 +20,33 @@ special-purpose fields such as WWW-style linkages (URx). + + Zebra 1.3 and Zebra versions 2.0.18 and earlier required that the field + type is a single character, e.g. w (for word), and + p for phrase. Zebra 2.0.20 and later allow field types + to be any string. This allows for greater flexibility - in particular + per-locale (language) fields can be defined. + + + + Version 2.0.20 of Zebra can also be configured - per field - to use the + ICU library to perform tokenization and + normalization of strings. This is an alternative to the "charmap" + files which has been part of Zebra since its first release. + +
The default.idx file The field types, and hence character sets, are associated with data - elements by the .abs files (see above). - The file default.idx - provides the association between field type codes (as used in the .abs - files) and the character map files (with the .chr suffix). The format + elements by the indexing rules (say title:w) in the + various filters. Fields are defined in a field definition file which, + by default, is called default.idx. + This file provides the association between field type codes + and the character map files (with the .chr suffix). The format of the .idx file is as follows - + @@ -69,9 +84,9 @@ (non-space characters) separated by single space characters (normalized to " " on display). When completeness is disabled, each word is indexed as a separate entry. Complete subfield - indexing is most useful for fields which are typically browsed (eg. + indexing is most useful for fields which are typically browsed (e.g., titles, authors, or subjects), or instances where a match on a - complete subfield is essential (eg. exact title searching). For fields + complete subfield is essential (e.g., exact title searching). For fields where completeness is disabled, the search engine will interpret a search containing space characters as a word proximity search. @@ -103,20 +118,151 @@ This is the filename of the character map to be used for this index for field type. + See for details. + + + + + icuchain filename + + + Specifies the filename with ICU tokenization and + normalization rules. + See for details. + Using icuchain for a field type is an alternative to + charmap. It does not make sense to define both + icuchain and charmap for the same field type. + + Field types + + Following are three excerpts of the standard + tab/default.idx configuration file. Notice + that the index and sort + are grouping directives, which bind all other following directives + to them: + + # Traditional word index + # Used if completeness is 'incomplete field' (@attr 6=1) and + # structure is word/phrase/word-list/free-form-text/document-text + index w + completeness 0 + position 1 + alwaysmatches 1 + firstinfield 1 + charmap string.chr + + ... + + # Null map index (no mapping at all) + # Used if structure=key (@attr 4=3) + index 0 + completeness 0 + position 1 + charmap @ + + ... + + # Sort register + sort s + completeness 1 + charmap string.chr + + +
- The character map file format + Charmap Files - The contents of the character map files are structured as follows: + The character map files are used to define the word tokenization + and character normalization performed before inserting text into + the inverse indexes. &zebra; ships with the predefined character map + files tab/*.chr. Users are allowed to add + and/or modify maps according to their needs. + + Character maps predefined in &zebra; + + + + File name + Intended type + Description + + + + + numeric.chr + :n + Numeric digit tokenization and normalization map. All + characters not in the set -{0-9}., will be + suppressed. Note that floating point numbers are processed + fine, but scientific exponential numbers are trashed. + + + scan.chr + :w or :p + Word tokenization char map for Scandinavian + languages. This one resembles the generic word tokenization + character map tab/string.chr, the main + differences are sorting of the special characters + üzæäøöå and equivalence maps according to + Scandinavian language rules. + + + string.chr + :w or :p + General word tokenization and normalization character + map, mostly useful for English texts. Use this to derive your + own language tokenization and normalization derivatives. + + + urx.chr + :u + URL parsing and tokenization character map. + + + @ + :0 + Do-nothing character map used for literal binary + indexing. There is no existing file associated to it, and + there is no normalization or tokenization performed at all. + + + +
+ + The contents of the character map files are structured as follows: + + encoding encoding-name + + + This directive must be at the very beginning of the file, and it + specifies the character encoding used in the entire file. If + omitted, the encoding ISO-8859-1 is assumed. + + + For example, one of the test files found at + test/rusmarc/tab/string.chr contains the following + encoding directive: + + encoding koi8-r + + and the test file + test/charmap/string.utf8.chr is encoded + in UTF-8: + + encoding utf-8 + + + lowercase value-set @@ -149,7 +295,7 @@ Curly braces {} may be used to enclose ranges of single characters (possibly using the escape convention described in the - preceding point), eg. {a-z} to introduce the + preceding point), e.g., {a-z} to introduce the standard range of ASCII characters. Note that the interpretation of such a range depends on the concrete representation in your local, physical character set. @@ -158,8 +304,8 @@ - paranthesises () may be used to enclose multi-byte characters - - eg. diacritics or special national combinations (eg. Spanish + parentheses () may be used to enclose multi-byte characters - + e.g., diacritics or special national combinations (e.g., Spanish "ll"). When found in the input stream (or a search term), these characters are viewed and sorted as a single character, with a sorting value depending on the position of the group in the value @@ -170,16 +316,30 @@ + + For example, scan.chr contains the following + lowercase normalization and sorting order: + + lowercase {0-9}{a-y}üzæäøöå + + uppercase value-set This directive introduces the - upper-case equivalencis to the value set (if any). The number and + upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive. + + For example, scan.chr contains the following + uppercase equivalent: + + uppercase {0-9}{A-Y}ÜZÆÄØÖÅ + + space value-set @@ -194,6 +354,13 @@ uppercase and lowercase directives. + + For example, scan.chr contains the following + space instruction: + ?@\[\\]^_`\{|}~ + ]]> + map value-set @@ -204,102 +371,159 @@ members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the - character set, but it may be a paranthesis-enclosed + character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The map directive can also be used to ignore leading articles in searching and/or sorting, and to perform other special - transformations. See section . + transformations. + + + For example, scan.chr contains the following + map instructions among others, to make sure that HTML entity + encoded Danish special characters are mapped to the + equivalent Latin-1 characters: + + + + In addition to specifying sort orders, space (blank) handling, + and upper/lowercase folding, you can also use the character map + files to make &zebra; ignore leading articles in sorting records, + or when doing complete field searching. + + + This is done using the map directive in the + character map file. In a nutshell, what you do is map certain + sequences of characters, when they occur in the + beginning of a field, to a space. Assuming that the + character "@" is defined as a space character in your file, you + can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', + followed by a space character, to a space. The hat ^ character + denotes beginning-of-field only when complete-subfield indexing + or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to + associate different character maps with different indexing types + -- and you can create additional indexing types, should the need + arise -- it is possible to specify that leading articles should + be ignored either in sorting, in complete-field searching, or + both. + + + If you ignore certain prefixes in sorting, then these will be + eliminated from the index, and sorting will take place as if + they weren't there. However, if you set the system up to ignore + certain prefixes in searching, then these + are deleted both from the indexes and from query terms, when the + client specifies complete-field searching. This has the effect + that a search for 'the science journal' and 'science journal' + would both produce the same results. + + + + equivalent value-set + + + This directive introduces equivalence classes of strings for + searching purposes only. It's a one-to-many + conversion that takes place only during search before the map + directive kicks in. + + + For example given: + + + + a search for the äsel will be be match any of + æsel, äsel and + aesel.
-
- Ignoring leading articles - - In addition to specifying sort orders, space (blank) handling, - and upper/lowercase folding, you can also use the character map - files to make Zebra ignore leading articles in sorting records, - or when doing complete field searching. - + +
+ ICU Chain Files - This is done using the map directive in the - character map file. In a nutshell, what you do is map certain - sequences of characters, when they occur in the - beginning of a field, to a space. Assuming that the - character "@" is defined as a space character in your file, you - can do: - - map (^The\s) @ - map (^the\s) @ - - The effect of these directives is to map either 'the' or 'The', - followed by a space character, to a space. The hat ^ character - denotes beginning-of-field only when complete-subfield indexing - or sort indexing is taking place; otherwise, it is treated just - as any other character. + The ICU chain files defines a + chain of rules + which specify the conversion process to be carried out for each + record string for indexing. - Because the default.idx file can be used to - associate different character maps with different indexing types - -- and you can create additional indexing types, should the need - arise -- it is possible to specify that leading articles should - be ignored either in sorting, in complete-field searching, or - both. + Both searching and sorting is based on the sort + normalization that ICU provides. This means that scan and sort will + return terms in the sort order given by ICU. - If you ignore certain prefixes in sorting, then these will be - eliminated from the index, and sorting will take place as if - they weren't there. However, if you set the system up to ignore - certain prefixes in searching, then these - are deleted both from the indexes and from query terms, when the - client specifies complete-field searching. This has the effect - that a search for 'the science journal' and 'science journal' - would both produce the same results. + Zebra is using YAZ' ICU wrapper. Refer to the + yaz-icu man page for + documentation about the ICU chain rules. -
-
- Field structure debugging using the special - <literal>zebra::</literal> element set + + + Use the yaz-icu program to test your icuchain rules. + + + Indexing Greek text + + Consider a system where all "regular" text is to be indexed + using as Greek (locale: EL). + We would have to change our index type file - to read + + # Index greek words + index w + completeness 0 + position 1 + alwaysmatches 1 + firstinfield 1 + icuahain greek.xml + .. + + The ICU chain file greek.xml could look + as follows: + + + + + + + + ]]> + + - At some time, it is very hard to figure out what exactly has been - indexed how and in which indexes. Using the indexing stylesheet of - the Alvis filter, one can at least see which portion of the record - went into which index, but a similar aid does not exist for all - other indexing filters. - - - Starting with Zebra version - 2.0.4-2 or newer, one has the possibility to - use the special - zebra:: element set name, which is only defined for - the SUTRS and XML record - formats. - - Z> f @attr 1=dc_all minutter - Z> format sutrs - Z> elements zebra:: - Z> s 1+1 - - will display all indexed tokens from all indexed fields of the - first record, and it will display in SUTRS - record syntax, whereas - - Z> f @attr 1=dc_all minutter - Z> format xml - Z> elements zebra::dc_publisher - Z> s 1+1 - Z> elements zebra::dc_publisher:p - Z> s 1+1 - - displays in XML record syntax only the content - of the zebra string index dc_publisher, or - even only the type p phrase indexed part of it. + Zebra is shipped with a field types file icu.idx + which is an ICU chain version of default.idx. + + MARCXML indexing using ICU + + The directory examples/marcxml includes + a complete sample with MARCXML records that are DOM XML indexed + using ICU chain rules. Study the + README in the marcxml + directory for details. + +
+