X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Ffield-structure.xml;h=3a0a5f2535027830e1d86c71c7b7cfe00c834376;hb=a92270aafb3ba7b336bc2334ed7c44c631c1cb29;hp=6318dbe3f805bd686986fb077f2d3131edb5caf6;hpb=b9c1a6fcf5c4821d0190efdecbc14ea5d6c96aec;p=idzebra-moved-to-github.git diff --git a/doc/field-structure.xml b/doc/field-structure.xml index 6318dbe..3a0a5f2 100644 --- a/doc/field-structure.xml +++ b/doc/field-structure.xml @@ -1,5 +1,5 @@ - + Field Structure and Character Sets @@ -103,6 +103,7 @@ This is the filename of the character map to be used for this index for field type. + See for details. @@ -112,10 +113,67 @@
The character map file format - The contents of the character map files are structured as follows: + The character map files are used to define the word tokenization + and character normalization performed before inserting text into + the inverse indexes. Zebra ships with the predefined character map + files tab/*.chr. Users are allowed to add + and/or modify maps according to their needs. + + Character maps predefined in Zebra + + + + File name + Intended type + Description + + + + + numeric.chr + :n + Numeric digit tokenization and normalization map. All + characters not in the set -{0-9}., will be + suppressed. Note that floating point numbers are processed + fine, but scientific exponential numbers are trashed. + + + scan.chr + :w or :p + Word tokenization char map for Scandinavian + languages. This one resembles the generic word tokenization + character map tab/string.chr, the main + differences are sorting of the special characters + üzæäøöå and equivalence maps according to + Scandinavian language rules. + + + string.chr + :w or :p + General word tokenization and normalization character + map, mostly useful for English texts. Use this to derive your + own language tokenization and normalization derivatives. + + + urx.chr + :u + URL parsing and tokenization character map. + + + @ + :0 + Do-nothing character map used for literal binary + indexing. There is no existing file associated to it, and + there is no normalization or tokenization performed at all. + + + +
+ + The contents of the character map files are structured as follows: @@ -170,16 +228,30 @@ + + For example, scan.chr contains the following + lowercase normalization and sorting order: + + lowercase {0-9}{a-y}üzæäøöå + + uppercase value-set This directive introduces the - upper-case equivalencis to the value set (if any). The number and + upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive. + + For example, scan.chr contains the following + uppercase equivalent: + + uppercase {0-9}{A-Y}ÜZÆÄØÖÅ + + space value-set @@ -194,6 +266,13 @@ uppercase and lowercase directives. + + For example, scan.chr contains the following + space instruction: + ?@\[\\]^_`\{|}~ + ]]> + map value-set @@ -204,7 +283,7 @@ members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the - character set, but it may be a paranthesis-enclosed + character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The @@ -213,6 +292,37 @@ transformations. See section . + + For example, scan.chr contains the following + map instructions among others, to make sure that HTML entity + encoded Danish special characters are mapped to the + equivalent Latin-1 characters: + + + + + equivalent value-set + + + This directive introduces equivalence classes of characters + and/or strings for sorting purposes only. It resembles the map + directive, but does not affect search and retrieval indexing, + but only sorting order under present requests. + + + For example, scan.chr contains the following + equivalent sorting instructions, which can be uncommented: + + @@ -262,134 +372,6 @@
-
- Accessing Zebra internal record data using - the <literal>zebra::</literal> element sets - - Starting with Zebra version - 2.0.4-2 or newer, one has the possibility to - use the special - zebra::data, - zebra::meta and - zebra::index element set names. - - - - Usage of the zebra:: element sets accesses - record data directly from the internal storage, and will - therefore work exactly the same way, irrespectively of indexing - filter used. - - - These element set names are optimized for retrieval speed, and - will perform better than using for example - alvis filter XSLT based extraction of small - parts of the records. - - - - For example, to fetch the raw binary record data stored in the - zebra internal storage, or on the filesystem, the following - commands can be issued: - - Z> f @attr 1=title my - Z> format xml - Z> elements zebra::data - Z> s 1+1 - Z> format sutrs - Z> s 1+1 - Z> format usmarc - Z> s 1+1 - - - - - The special - zebra::data element set name is - defined for any record syntax, but will always fetch - the raw record data in exactly the original form. No record syntax - specific transformations will be applied to the raw record data. - - - - Also, Zebra internal metadata about the record can be accessed: - - Z> f @attr 1=title my - Z> format xml - Z> elements zebra::meta::sysno - Z> s 1+1 - - displays in XML record syntax only internal - record system number, whereas - - Z> f @attr 1=title my - Z> format xml - Z> elements zebra::meta - Z> s 1+1 - - displays all available metadata on the record. These include sytem - number, database name, indexed filename, filter used for indexing, - score and static ranking information and finally bytesize of record. - - - - The special - zebra::meta element set names are only - defined for - SUTRS and XML record - syntaxes. - - - - Sometimes, it is very hard to figure out what exactly has been - indexed how and in which indexes. Using the indexing stylesheet of - the Alvis filter, one can at least see which portion of the record - went into which index, but a similar aid does not exist for all - other indexing filters. - - - The special - zebra::index element set names are provided to - access information on per record indexed fields. For example, the - queries - - Z> f @attr 1=title my - Z> format sutrs - Z> elements zebra::index - Z> s 1+1 - - will display all indexed tokens from all indexed fields of the - first record, and it will display in SUTRS - record syntax, whereas - - Z> f @attr 1=title my - Z> format xml - Z> elements zebra::index::title - Z> s 1+1 - Z> elements zebra::index::title:p - Z> s 1+1 - - displays in XML record syntax only the content - of the zebra string index title, or - even only the type p phrase indexed part of it. - - - - The special zebra::index - element set names are only - defined for - SUTRS and XML record - syntaxes. - - Trying to access numeric Bib-1 use - attributes or trying to access non-existent zebra intern string - access points will result in a - - Diagnostic [25]: Specified element set name not valid for specified database - - - -