From 383965c54b667bba902d92d86b9eef4f3fa1810f Mon Sep 17 00:00:00 2001 From: Adam Dickmeiss Date: Fri, 5 Oct 2012 14:57:34 +0200 Subject: [PATCH] Separate chapter about ranking --- doc/book.xml | 72 +++++++++++++++++++++++++++++++++++++++++- doc/pazpar2_conf.xml | 85 +++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 148 insertions(+), 9 deletions(-) diff --git a/doc/book.xml b/doc/book.xml index 9f96d75..294031e 100644 --- a/doc/book.xml +++ b/doc/book.xml @@ -825,7 +825,77 @@ - +
+ Relevance ranking + + Pazpar2 uses a variant of the fterm frequency–inverse document frequency + (Tf-idf) ranking algorithm. + + + The Tf-part is straightforward to calculate and is based on the + documents that Pazpar2 fetches. The idf-part, however, is more tricky + since the corpus at hand is ONLY the relevant documents and not + irrelevant ones. Pazpar2 does not have the full corpus -- only the + documents that match a particular search. + + + Computatation of the Tf-part is based on the normalized documents. + The length, the position and terms are thus normalized at this point. + Also the computation if performed for each document received from the + target - before merging takes place. The result of a TF-compuation is + added to the TF-total of a cluster. Thus, if a document occurs twice, + then the TF-part is doubled. That, however, can be adjusted, because the + TF-part may be divided by the number of documents in a cluster. + + + The algorithm used by Pazpar2 has two phases. In phase one + Pazpar2 computes a tf-array .. This is being done as records are + fetched form the database. In this case, the rank weigth + w, the and rank tweaks lead, + follow and length. + + + 0) + w[i] += w[i] * follow / (1+log2(d) + // length: length of field (number of terms that is) + if (length strategy is "linear") + tf[i] += w[i] / length; + else if (length strategy is "log") + tf[i] += w[i] / log2(length); + else if (length strategy is "none") + tf[i] += w[i]; + ]]> + + In phase two, the idf-array is computed and the final score + is computed. This is done for each cluster as part of each show command. + The rank tweak cluster is in use here. + + 0) + idf[i] = log(1 + doctotal / dococcur[i]) + else + idf[i] = 0; + + relevance = 0; + for i = 1, .., N: (each term) + if (cluster is "yes") + tf[i] = tf[i] / cluster_size; + relevance += 100000 * tf[i] / idf[i]; + ]]> +
diff --git a/doc/pazpar2_conf.xml b/doc/pazpar2_conf.xml index d7de16f..cbad09d 100644 --- a/doc/pazpar2_conf.xml +++ b/doc/pazpar2_conf.xml @@ -268,7 +268,7 @@ M [F N] where M is an integer, used as a - multiplier against the basic TF*IDF score. A value of + weight against the basic TF*IDF score. A value of 1 is the base, higher values give additional weight to elements of this type. The default is '0', which excludes this element from the rank calculation. @@ -289,6 +289,8 @@ The per field rank was introduced in Pazpar2 1.6.15. Earlier releases only allowed a rank value M (simple integer). + See for more + about ranking. @@ -585,18 +587,85 @@ rank - Customizes the ranking (relevance) algorithm. - Attribute 'cluster' is a boolean - that controls whether Pazpar2 should boost ranking for merged - records. Is 'yes' by default. A value of 'no' will make - Pazpar2 average ranking of each record in a cluster. + Customizes the ranking (relevance) algorithm. Also known as + rank tweaks. The rank element + accepts the following attributes - all being optional: + + + cluster + + + Attribute 'cluster' is a boolean + that controls whether Pazpar2 should boost ranking for merged + records. Is 'yes' by default. A value of 'no' will make + Pazpar2 average ranking of each record in a cluster. + + + + + debug + + + Attribute 'debug' is a boolean + that controls whether Pazpar2 should include details + about ranking for each document in the show command's + response. Enable by using value "yes", disable by using + value "no" (default). + + + + + follow + + + Attribute 'follow' is a a floating point number greater than + or equal to 0. A positive number will boost weight for terms + that occur close to each other (proximity, distance). + A value of 1, will double the weight if two terms are in + proximity distance of 1 (next to each other). The default + value of 'follow' is 0 (order will not affect weight). + + + + + lead + + + Attribute 'lead' is a floating point number. + It controls if term weight should be reduced by position + from start in a metadata field. A positive value of 'lead' + will reduce weight as it apperas further away from the lead + of the field. Default value is 0 (no reduction of weight by + position). + + + + + length + + + Attribute 'length' determines how/if term weight should be + divided by lenght of metadata field. A value of "linear" + divide by length. A value of "log" will divide by log2(length). + A value of "none" will leave term weight as is (no division). + Default value is "linear". + + + + - This configuration was added in pazpar2 1.6.18. + Refer to to see how + these tweaks are used in computation of score. + + + Customization of ranking algorithm was introduced with + Pazpar2 1.6.18. The semantics of some of the fields changed + in versions up to 1.6.21. - + sort-default -- 1.7.10.4