doc/zvrank.txt

   1 Zvrank: an experimental ranking algorithm. See doc/zvrank.txt and
   2 source in index/zvrank.c. Enable this by using rank: zvrank in zebra.cfg.
   3 Contributed by Johannes Leveling <Johannes.Leveling at
   4 fernuni-hagen.de>
   5
   6
   7 This is an experimental addon (or patch) for the Zebra server
   8 to faciliate ranked searches with the Vector Space Model.
   9 To install the addon, copy the file 'rank1.c' in the index
  10 directory of the Zebra distribution to 'rank1.c.orig' and
  11 replace 'rank1.c' with 'zrank1.c'. Then recompile Zebra.
  12 (make sure you link with the math library, so add the linker option
  13 '-lm'  if necessary).
  14
  15
  16 Introduction:
  17 =============
  18 In the vector space model, queries and documents are
  19 represented as vectors of term weights. Positive weights
  20 characterise terms assigned to a document while a zero weight
  21 is used for terms that do not occur in the document.
  22 The most popular method to obtain term weights is the tf-idf
  23 weighting schema, where the weight of a term is calculated as
  24 the product of a term factor (tf) and an inverse document factor
  25 (idf). Combining the tf and idf factors (usually the product)
  26 and normalizing them yields a tf-idf vector (or term weight vector)
  27 modelling a document (or query).
  28 The similarity score between a query and a document (or the
  29 similarity between two documents) depends on the term weight vectors and is
  30 then computed by a similarity function combining these two
  31 weight vectors (most often the cosine between the vectors is computed).
  32
  33 Weighting functions:
  34 ====================
  35 1) term frequency functions:
  36    locc denotes the in document frequency of a term (local frequency)
  37
  38    none ('n'):      tf = locc
  39    binary ('b'):    tf = 1
  40    max_norm ('m'):  tf = locc/max_locc
  41    aug_norm ('a'):  tf = K + (1 - K) * (locc/max_locc)  ; K = 0.5
  42    square ('s'):    tf = locc * locc
  43    log ('l'):       tf = log(locc) + 1
  44
  45 2) inverse document functions:
  46    gocc is the database frequency of a term (global frequncy)
  47    num_docs is the number of documents in the database
  48
  49    none ('n'):      idf = 1
  50    tfidf ('t'):     idf = log (num_docs / gocc)
  51    prob ('p'):      idf = log ((num_docs - gocc) / gocc)
  52    freq ('f'):      idf = 1 / n
  53    squared ('s'):   idf = log (num_docs / gocc) ^ 2
  54
  55 3) weight normalisation functions:
  56    wt = tf * idf
  57
  58    none ('n'):      nwt = wt
  59    sum ('s'):       nwt = wt / sum ( wt_i )
  60    cosine ('c'):    nwt = wt / sqrt ( sum ( wt_i ^2 ))
  61    fourth ('f'):    nwt = wt / sum ( wt_i ^ 4 )
  62    max ('m'):       nwt = wt / max ( wt_i )
  63
  64
  65 For example, the string 'atc' indicates that the
  66 augmented normalized term frequency ('a'), the
  67 usual tfidf weight ('t') and cosine normalisation ('c') is to be
  68 used for term weight caomputation.
  69
  70 *) similarity function:
  71    - cosine ('c')
  72    others are (TODO, not yet implemented):
  73    - pivoted unqiue normalisation ('p')
  74    - jaccard ('j')
  75    - dice ('d')
  76    - minkwoski ('m')
  77
  78 (Note: at the moment, there are 6 * 5 * 5 * 6 * 5 * 5 (* 1)
  79 ranking schemes selectable, but not all of them work).
  80
  81
  82 Example query (for yaz-client):
  83 ===============================
  84   find @attr 1=4 @attr 2=102 @attr 4=105 "where do i find literature on medical databases"
  85 Relation attribute 102 indicates that the results should be ranked by relevance
  86 and the query should be treated as free text (or wordlist).
  87
  88
  89
  90 Pitfalls and problems:
  91 ======================
  92 - The name of the ranking schema should be read from the Zebra
  93   configuration file.
  94 - Some weighting schemas require values to be calculated
  95   that are assigned constant values in this addon.
  96   For example, the db_f_max is the maximum frequency of a term
  97   in the database.
  98 - Ranking (or weight computation) is done online, e. g. immediately before
  99   records are retrieved. A faster way to get term weights would be to store
 100   them within inverted files. Adding this feature to Zebra would require major
 101   changes for the indexing process (as far as I can tell).
 102 - Stopwords are frequent words considered not useful for indexing
 103   documents (for example, "and", "the", "above" are common stopwords).
 104   Often these words do not carry semantic information. A stopword
 105   list might considerably reduce the size of the index and will probably
 106   enhance precision for natural language queries. In Zebra, stopword removal
 107   is possible with input filters.
 108 - Stemming is often used to map various morphologic forms of a
 109   concept to a single representation (for example, "countries" and
 110   "country" should both be stemmed to "country"). Using stemming for indexing
 111   is used to increase recall. In Zebra, stemming should be part of input
 112   filtering.
 113
 114
 115 Literature:
 116 ===========
 117 * Sebastian Hammer and Adam Dickmeiss and Heikki Levanto and Mike Taylor,
 118   Zebra - user's guide and reference,
 119   IndexData, 1995-2003.
 120
 121 * Gerard Salton and Chris Buckley,
 122   "Term Weighting Approaches in Automatic Text Retrieval",
 123   Dept. of Computer Science, Cornell University,
 124   TR 87-881, November 1987.
 125   Also appeared in:
 126   Information Processing and Management, vol. 24, no. 5, pp. 513--523, 1988.
 127
 128 * Justin Zobel and Alistair Moffat,
 129   Exploring the Similarity Space,
 130   SIGIR Forum 32(1), p. 18-34, 1998.
 131   http://citeseer.nj.nec.com/zobel98exploring.html
 132
 133 * SMART related documents:
 134   http://www.dcs.gla.ac.uk/ftp/pub/Bruin/HTML/SMART.HTM
 135
 136
 137
 138 Nomenclature / glossary:
 139 ========================
 140 - Database and collection are used as synonyms
 141 - A Document is the indexed part of a record that is referred to in a query
 142   (for a title search, the title entry)
 143
 144 * Ranking schema
 145   A ranking schema is identified by a seven character string
 146   (note: this may change in the future).
 147   The first three characters indicate the function to be used
 148   for the documents term frequency factor, the documents
 149   inverse document frequency factor and the function to combine
 150   these factors to assign a term weight.
 151   The middle character is the delimiter '-'.
 152   The last three characters indicate the functions for query term weighting.
 153   Note that different functions may be used for query and document vectors.
 154
 155   The default similarity function calculates the cosine
 156   between a document term vector and the query term vector.
 157
 158 * Term:
 159   an atomic concept used for indexing (a string),
 160   for example a word, a phrase or a number
 161 * Document:
 162   In Zebra, any part of a record that has index terms assigned to
 163   it. As data can (sometimes) be structured, document also refers to the
 164   smallest substructure with index terms (for example, a newspaper
 165   article may be structured into its title, abstract and its body of
 166   text components, which can be considered as different documents).
 167 * Term weighting function:
 168   the function used to combine and normalize tf and idf
 169 * Term frequency factor (tf) / Local weight:
 170   It indicates how important a term is to a particular document
 171   and depends on how many times a term occurs in a document.
 172 * Inverse document factor (idf) / Global weight:
 173   The global weight indicates how important a term is to the entire
 174   database based on the number of documents in which the term occurs.
 175   The inverse document frequency is based on the observation that a less
 176   frequently occurring term has better properties discriminating
 177   documents than a term that in more documents.
 178 * Normalisation:
 179   the normalisation function tries to compensate for ranking discrepancies
 180   (for example different document lengths).
 181 * Score:
 182   The score of a document indicates its similarity to the query
 183   (0 <= score <=1000)
 184 * Rank:
 185   The rank of a document is the position in the ranked result set.
 186   (first document: rank 1, etc.)
 187
 188 TODO:
 189 =====
 190 - replace 'fprintf' with 'yaz_log'
 191 - produce small test database and test cases
 192 - extend naming schema to eight chars (include similarity functions)
 193 - warn if schema is not fully available (e.g. if 'num_docs' or 'tf_max'
 194   are used)
 195 - optimize implementation (!)
 196 - Some structure elements are declared as integers ('int').
 197   For larger databases, they might have to be declared as 'unsigned long int'
 198 - 'd_f_max_str' and 'f_max_str' are not really needed (in DS, RS)
 199 - 'db_num_docs' is the number of documents in the database.
 200   (probably the number of sysnos)
 201   This is computed for the DBs Explain record as 'recordCount'
 202   and should be avaialable as reg-> ... -> recordCount
 203 - 'db_terms' is the number of distinct terms used in the database.
 204   (probably the number of keys)
 205 - maybe maximum frequencies can be computed on-the-fly for result sets
 206   (in contrast to the whole set of records)
 207