From b7b3b09b5bf04a832b9602d4717d7e1eb512079c Mon Sep 17 00:00:00 2001 From: Marc Cromme Date: Fri, 25 May 2007 12:30:27 +0000 Subject: [PATCH] added ICU urls and a section on ICU tokenization and normalization --- doc/book.xml | 60 +++++++++++++++++++++++++++--- doc/pazpar2_conf.xml | 100 ++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 147 insertions(+), 13 deletions(-) diff --git a/doc/book.xml b/doc/book.xml index 69fc0d9..95edb40 100644 --- a/doc/book.xml +++ b/doc/book.xml @@ -9,7 +9,7 @@ %common; ]> - + Pazpar2 - User's Guide and Reference @@ -19,6 +19,9 @@ AdamDickmeiss + + MarcCromme + &version; ©right-year; @@ -151,6 +154,24 @@ + International + Components for Unicode (ICU) + + + ICU provides Unicode support for non-english languages with + character sets outside the range of 7bit ASCII, like + Greek, Russian, German and Frensh. Pazpar2 uses the ICU + unicode character conversions, unicode normalization, case + folding and other fundamental operations needed in + tokenization, normalization and ranking of records. + + + Compiling, linking, and usage of the ICU libraries is optional, + but strongly recommended for usage in an international + environment. + + + @@ -196,6 +217,7 @@ apt-get install libyaz-dev + apt-get install libicu36-dev With these packages installed, the usual configure + make @@ -208,7 +230,8 @@ Using pazpar2 - This chapter provides a general introduction to the use and deployment of pazpar2. + This chapter provides a general introduction to the use and + deployment of pazpar2.
@@ -225,8 +248,8 @@ functionality, but it isn't a requirement -- you can choose to use pazpar2 entirely as a backend to your regular server-side scripting. When you do use pazpar2 in conjunction - with browser scripting (JavaScript/Ajax, Flash, applets, etc.), there are - special considerations. + with browser scripting (JavaScript/Ajax, Flash, applets, + etc.), there are special considerations. @@ -410,7 +433,8 @@ metasearching is really, really hard. If you want to build a project with pazpar2, and you need access to resources with non-standard interfaces, we can help. We run gateways to more than - 2,000 popular, commercial databases and other resources, making it simple + 2,000 popular, commercial databases and other resources, + making it simple to plug them directly into pazpar2. For a small annual fee per database, we can help you establish connections to your licensed resources. Meanwhile, you can help! If you build your own @@ -430,6 +454,32 @@ implement it.
+ +
+ Unicode Compliance + + Pazpar2 is unicode compliant and language and locale aware to + the exted the used backend Z39.50 targets are. Just a few bad + behaving targets can spoil the search experience considerably + if for example Greek, Russian or otherwise non 7-bit ASCII + search terms are entered. In these cases some targets return + records irrelevant to the query, and the result screens wil be + cluttered with noise. + + + While noise from misbehaving targets can not be removed, it can + be reduced using truely unicode based ranking. This is an + option which is available to the system administrator if ICU + support is compiled into Pazpar2, see + for details. + + + In addition, the ICU tokenization and normalization rules must + be defined in the master configuration file described in + . + +
+
diff --git a/doc/pazpar2_conf.xml b/doc/pazpar2_conf.xml index e2be4c0..6db0999 100644 --- a/doc/pazpar2_conf.xml +++ b/doc/pazpar2_conf.xml @@ -8,7 +8,7 @@ %common; ]> - + Pazpar2 @@ -116,6 +116,72 @@ + icu_chain + + + Definition of ICU tokenization and normalization rules + are used if ICU support is compiled in. The 'id' + attribute is currently not used, and the 'locale' + attribute must be set to one of the locale strings + defined in ICU. The child elements listed below can be + in any order, except the 'index' element which logically + belongs to the end of the list. The stated tokenization, + normalization and charmapping instructions are performed + in order from top to bottom. + + + casemap + + + The attribure 'rule' defines the direction of the + per-character casemapping, allowed values are "l" + (lower), "u" (upper), "t" (title). + + + + normalize + + + Normalization and transformation of tokens follows + the rules defined in the 'rule' attribute. For + possible values we refer to the extensive ICU + documentation found at the + ICU + transformation home page. Set filtering + principles are explained at the + ICU set and + filtering page. + + + + tokenize + + + Tokenization is the only rule in the ICU chain + which splits one token into multiple tokens. The + 'rule' attribute may have the following values: + "s" (sentence), "l" (line-break), "w" (word), and + "c" (character), the later probably not beeing + very useful in a runing pazpar2 installation. + + + + index + + + Finally the 'index' element instruction - without + any 'rule' attribute - is used to store the tokens + after chain processing in the relevance ranking + unit of Pazpar2. It will always be the last + instruction in the chain. + + + + + + + + service @@ -144,10 +210,13 @@ This is the name of the data element. It is matched - against the 'type' attribute of the 'metadata' element + against the 'type' attribute of the + 'metadata' element in the normalized record. A warning is produced if - metdata elements with an unknown name are found in the - normalized record. This name is also used to represent + metdata elements with an unknown name are + found in the + normalized record. This name is also used to + represent data elements in the records returned by the webservice API, and to name sort lists and browse facets. @@ -194,11 +263,13 @@ rank - Specifies that this element is to be used to help rank + Specifies that this element is to be used to + help rank records against the user's query (when ranking is requested). The value is an integer, used as a multiplier against the basic TF*IDF score. A value of - 1 is the base, higher values give additional weight to + 1 is the base, higher values give additional + weight to elements of this type. The default is '0', which excludes this element from the rank calculation. @@ -212,7 +283,8 @@ termlist, or browse facet. Values are tabulated from incoming records, and a highscore of values (with their associated frequency) is made available to the - client through the webservice API. The possible values + client through the webservice API. + The possible values are 'yes' and 'no' (default). @@ -258,6 +330,18 @@ + + + + @@ -473,7 +557,7 @@ - + -- 1.7.10.4