From bd964f3a7291ef3171b917348142472384b636cf Mon Sep 17 00:00:00 2001 From: Adam Dickmeiss Date: Wed, 19 Dec 2007 09:30:29 +0000 Subject: [PATCH] Added some material about ICU chains. --- doc/administration.xml | 15 +++- doc/field-structure.xml | 201 ++++++++++++++++++++++++++++++++++------------- 2 files changed, 162 insertions(+), 54 deletions(-) diff --git a/doc/administration.xml b/doc/administration.xml index cffae1e..e8e9840 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,5 +1,5 @@ - + Administrating &zebra; + Field Structure and Character Sets @@ -21,17 +21,33 @@ special-purpose fields such as WWW-style linkages (URx). + + Zebra 1.3 and Zebra 2.0 series require that the field type is + a single character, e.g. w (for word), and + p for phrase. Zebra 2.1 allows field types to + be any string. This allows for greater flexibility - in particular + per-locale (language) fields can be defined. + + + + Version 2.1 of Zebra can also be configured - per field - to use the + ICU library to perform tokenization and + normalization of strings. This is an alternative to the "charmap" + files which has been part of Zebra since its first release. + +
The default.idx file The field types, and hence character sets, are associated with data - elements by the .abs files (see above). - The file default.idx - provides the association between field type codes (as used in the .abs - files) and the character map files (with the .chr suffix). The format + elements by the indexing rules (say title:w) in the + various filters. Fields are defined in a field definition file which, + by default, is called default.idx. + This file provides the association between field type codes + and the character map files (with the .chr suffix). The format of the .idx file is as follows - + @@ -106,15 +122,30 @@ See for details. + + + icuchain filename + + + Specifies the filename with ICU tokenization and + normalization rules. + See for details. + Using icuchain for a field type is an alternative to + charmap. It does not make sense to define both + icuchain and charmap for the same field type. + + - - Following are three excerpts of the standard - tab/default.idx configuration file. Notice - that the index and sort - are grouping directives, which bind all other following directives - to them: - + + Field types + + Following are three excerpts of the standard + tab/default.idx configuration file. Notice + that the index and sort + are grouping directives, which bind all other following directives + to them: + # Traditional word index # Used if completenss is 'incomplete field' (@attr 6=1) and # structure is word/phrase/word-list/free-form-text/document-text @@ -140,12 +171,13 @@ sort s completeness 1 charmap string.chr - - + + +
- The character map file format + Charmap Files The character map files are used to define the word tokenization and character normalization performed before inserting text into @@ -346,8 +378,7 @@ character-representations to their natural form, etc. The map directive can also be used to ignore leading articles in searching and/or sorting, and to perform other special - transformations. See section . + transformations. For example, scan.chr contains the following @@ -359,6 +390,47 @@ map (ø) ø map (å) Ã¥ ]]> + + + In addition to specifying sort orders, space (blank) handling, + and upper/lowercase folding, you can also use the character map + files to make &zebra; ignore leading articles in sorting records, + or when doing complete field searching. + + + This is done using the map directive in the + character map file. In a nutshell, what you do is map certain + sequences of characters, when they occur in the + beginning of a field, to a space. Assuming that the + character "@" is defined as a space character in your file, you + can do: + + map (^The\s) @ + map (^the\s) @ + + The effect of these directives is to map either 'the' or 'The', + followed by a space character, to a space. The hat ^ character + denotes beginning-of-field only when complete-subfield indexing + or sort indexing is taking place; otherwise, it is treated just + as any other character. + + + Because the default.idx file can be used to + associate different character maps with different indexing types + -- and you can create additional indexing types, should the need + arise -- it is possible to specify that leading articles should + be ignored either in sorting, in complete-field searching, or + both. + + + If you ignore certain prefixes in sorting, then these will be + eliminated from the index, and sorting will take place as if + they weren't there. However, if you set the system up to ignore + certain prefixes in searching, then these + are deleted both from the indexes and from query terms, when the + client specifies complete-field searching. This has the effect + that a search for 'the science journal' and 'science journal' + would both produce the same results. @@ -384,49 +456,72 @@
-
- Ignoring leading articles + +
+ ICU Chain Files - In addition to specifying sort orders, space (blank) handling, - and upper/lowercase folding, you can also use the character map - files to make &zebra; ignore leading articles in sorting records, - or when doing complete field searching. + The ICU chain files defines a + chain of rules + which specify the conversion process to be carried out for each + record string for indexing. - This is done using the map directive in the - character map file. In a nutshell, what you do is map certain - sequences of characters, when they occur in the - beginning of a field, to a space. Assuming that the - character "@" is defined as a space character in your file, you - can do: - - map (^The\s) @ - map (^the\s) @ - - The effect of these directives is to map either 'the' or 'The', - followed by a space character, to a space. The hat ^ character - denotes beginning-of-field only when complete-subfield indexing - or sort indexing is taking place; otherwise, it is treated just - as any other character. + Both searching and sorting is based on the sort + normalization that ICU provides. This means that scan and sort will + return terms in the sort order given by ICU. - Because the default.idx file can be used to - associate different character maps with different indexing types - -- and you can create additional indexing types, should the need - arise -- it is possible to specify that leading articles should - be ignored either in sorting, in complete-field searching, or - both. + Zebra is using YAZ' ICU wrapper. Refer to the + yaz-icu man page for + documentation about the ICU chain rules. + + + Use the yaz-icu program to test your icuchain rules. + + + Indexing Greek text + + Consider a system where all "regular" text is to be indexed + using as Greek (locale: EL). + We would have to change our index type file - to read + + # Index greek words + index w + completeness 0 + position 1 + alwaysmatches 1 + firstinfield 1 + icuahain greek.xml + .. + + The ICU chain file greek.xml could look + as follows: + + + + + + + + ]]> + + - If you ignore certain prefixes in sorting, then these will be - eliminated from the index, and sorting will take place as if - they weren't there. However, if you set the system up to ignore - certain prefixes in searching, then these - are deleted both from the indexes and from query terms, when the - client specifies complete-field searching. This has the effect - that a search for 'the science journal' and 'science journal' - would both produce the same results. + Zebra is shipped with a field types file icu.idx + which is an ICU chain version of default.idx. + + MARCXML indexing using ICU + + The directory examples/marcxml includes + a complete sample with MARCXML recordst that are DOM XML indexed + using ICU chain rules. Study the + README in the marcxml + directory for details. + +
-- 1.7.10.4