X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fbook.xml;fp=doc%2Fbook.xml;h=294031ec2925e60c1b308264b69254e7ceb970e8;hb=383965c54b667bba902d92d86b9eef4f3fa1810f;hp=9f96d75a39d3f3fdad14e70eb20d3451c304428d;hpb=821fa07cd1abcc22928db9a9a80eecde9c0dd360;p=pazpar2-moved-to-github.git
diff --git a/doc/book.xml b/doc/book.xml
index 9f96d75..294031e 100644
--- a/doc/book.xml
+++ b/doc/book.xml
@@ -825,7 +825,77 @@
-
+
+ Relevance ranking
+
+ Pazpar2 uses a variant of the fterm frequencyâinverse document frequency
+ (Tf-idf) ranking algorithm.
+
+
+ The Tf-part is straightforward to calculate and is based on the
+ documents that Pazpar2 fetches. The idf-part, however, is more tricky
+ since the corpus at hand is ONLY the relevant documents and not
+ irrelevant ones. Pazpar2 does not have the full corpus -- only the
+ documents that match a particular search.
+
+
+ Computatation of the Tf-part is based on the normalized documents.
+ The length, the position and terms are thus normalized at this point.
+ Also the computation if performed for each document received from the
+ target - before merging takes place. The result of a TF-compuation is
+ added to the TF-total of a cluster. Thus, if a document occurs twice,
+ then the TF-part is doubled. That, however, can be adjusted, because the
+ TF-part may be divided by the number of documents in a cluster.
+
+
+ The algorithm used by Pazpar2 has two phases. In phase one
+ Pazpar2 computes a tf-array .. This is being done as records are
+ fetched form the database. In this case, the rank weigth
+ w, the and rank tweaks lead,
+ follow and length.
+
+
+ 0)
+ w[i] += w[i] * follow / (1+log2(d)
+ // length: length of field (number of terms that is)
+ if (length strategy is "linear")
+ tf[i] += w[i] / length;
+ else if (length strategy is "log")
+ tf[i] += w[i] / log2(length);
+ else if (length strategy is "none")
+ tf[i] += w[i];
+ ]]>
+
+ In phase two, the idf-array is computed and the final score
+ is computed. This is done for each cluster as part of each show command.
+ The rank tweak cluster is in use here.
+
+ 0)
+ idf[i] = log(1 + doctotal / dococcur[i])
+ else
+ idf[i] = 0;
+
+ relevance = 0;
+ for i = 1, .., N: (each term)
+ if (cluster is "yes")
+ tf[i] = tf[i] / cluster_size;
+ relevance += 100000 * tf[i] / idf[i];
+ ]]>
+