src/relevance-todo-heikki.txt

   1
   2 Relevancy stuff - status 20-Jan-2014 - How to get going again?
   3
   4 This summary is also in PAZ-917.
   5
   6 I have done some ranking-related stuff, and now it looks like we might end up
   7 not continuing with it. So I write this quick summary to state what I have done,
   8 and what I would do next, so we can pick the ball up again, if need be.
   9
  10 Added a new setting native_score which can be a field name for the score. If
  11 specified, this is the field that contains the ranking score from the back-end.
  12 These scores are normalized to a range that is close to 1.0 .. 0.0, minimizing
  13 the squared distance from 1/position curve.
  14
  15 This can also be a special value "position", which can be used when the target
  16 returns the records in relevancy order, but without a numeric value. This makes
  17 a guess based on 1/position. There is also another magic value "internal", which
  18 uses our TF/IDF ranking, but normalized the same way as before.
  19
  20 The normalizing works fine, as long as records have scores from the back end.
  21 For our own TF/IDF thing, things don't work so well yet, as it works on the
  22 cluster level, not on individual records. I haven't quite sorted out how to make
  23 the TF/IDF thing on a record level, probably need to duplicate the ranking code
  24 and keep score vectors per record as well as per cluster, so as to keep the
  25 current behavior as the default... There is a dirty hack to put the cluster
  26 score in the records too.
  27
  28 The record scores are supposed to be combined into cluster scores, so that
  29 clusters can be sorted. This is not yet done, but should not be much of work. At
  30 the moment each cluster gets one of the record scores directly. Once this is
  31 done, we can define new setting(s) to adjust the cluster scoring. First by
  32 selecting some algorithm (max, avg, sum, some form of decaying sum (largest
  33 score + half the second largest + quarter of the next largest, etc)), and then
  34 adjustments parameters to give some targets extra weight (at least when
  35 averaging), or extra boost (to indicate they tend to have better results).
  36
  37 Before starting to code anything much, we obviously need tests. There is a
  38 decent test framework, it should not be many days work to make a number of test
  39 cases for the native ranking first, then for the normalized TF/IDF (once we get
  40 that coded), and then for merging record scores into cluster scores.
  41
  42
  43 * * *
  44
  45
  46 How does relevancy ranking work in pz2
  47 Need to understand it before I can change it to work on individual records
  48
  49 Data structures
  50
  51 struct relevance {
  52     int *doc_frequency_vec;
  53     int *term_frequency_vec_tmp;
  54     int *term_pos;
  55     int vec_len;
  56     struct word_entry *entries;
  57     ...
  58     struct norm_client *norm;   // my list of (sub)records for normalizing, one list per client
  59 }
  60
  61 struct word_entry {
  62     const char *norm_str;
  63     const char *display_str;
  64     int termno;
  65     char *ccl_field;
  66     struct word_entry *next;
  67 }
  68
  69 // Find the norm_client entry for this client, or create one if not there
  70 struct norm_client *findnorm( struct relevance *rel, struct client* client)
  71
  72 // Add all records from a cluster into the list for that client, for normalizing later
  73 static void setup_norm_record( struct relevance *rel,  struct record_cluster *clust)
  74
  75 // find the word_entry that matches the norm_str
  76 // if found, sets up entries->ccl_field, and weight
  77 static struct word_entry *word_entry_match(struct relevance *r,
  78                                            const char *norm_str,
  79                                            const char *rank, int *weight)
  80
  81 // Put <match> tags around the words in the recors text
  82 // not called from inside relevance.c at all! Called from session.c:2051,
  83 // ingest_to_cluster(). Can probably be ignored for this summary.
  84 int relevance_snippet(struct relevance *r,
  85                       const char *words, const char *name,
  86                       WRBUF w_snippet)
  87
  88 // not called from inside relevance.c!
  89 // Seems to implement the decay and follow stuff, adjusting term weights within a field
  90 // Called from session.c:2286, ingest_to_cluster(), in if(rank), with a comment
  91 // ranking of _all_ fields enabled.
  92 void relevance_countwords(struct relevance *r, struct record_cluster *cluster,
  93                           const char *words, const char *rank,
  94                           const char *name)
  95
  96 // Recurses through a RPN query, pulls out the terms we want for ranking
  97 // Appends each word to relevance->entries with normalized string,
  98 // ccl_field, termno, and display_str.
  99 // Ok, here we decide which terms we are interested in!
 100 // called from relevance_create_ccl(), (and recursively from itself)
 101 static void pull_terms(struct relevance *res, struct ccl_rpn_node *n)
 102
 103 // Clears the relevance->doc_frequency_vec
 104 void relevance_clear(struct relevance *r)
 105
 106 // Sets up the relevance structure. Gets lots of controlling params
 107 // pulls terms, which gets the vec_len. then mallocs relevance->term_frequency_vec
 108 // term_frequency_vec_tmp, and term_pos. Calls relevance_clear to clear the doc_frequency_vec.
 109 struct relevance *relevance_create_ccl(pp2_charset_fact_t pft,
 110                                        struct ccl_rpn_node *query,
 111                                        int rank_cluster,
 112                                        double follow_factor, double lead_decay,
 113                                        int length_divide)
 114
 115 // kills the nmem, freeing all memory.
 116 void relevance_destroy(struct relevance **rp)
 117
 118 // Adds the values from src into the dst, for both term_frequency_vec and
 119 // term_frequency_vecf. Both src and dst are clusters.
 120 // Called from reclists.c:419 merge_cluster()
 121 void relevance_mergerec(struct relevance *r, struct record_cluster *dst,
 122                         const struct record_cluster *src)
 123
 124 // Adds a new cluster to the relevance stuff
 125 // mallocs rec->term_frequency_vec and _vecf for the cluster, and clears them to zeroes
 126 // Called from reclists.c: 458 new_cluster()
 127 void relevance_newrec(struct relevance *r, struct record_cluster *rec)
 128
 129 // increments relevance->doc_frequency_vec[i] for each i that has something in the
 130 // cluster->term_frequency_vec[i], i=1..vec_len, and increments doc_frequency_vec[0].
 131 // called from session.c:2330, ingest_to_cluster(), near the end
 132 void relevance_donerecord(struct relevance *r, struct record_cluster *cluster)
 133
 134 // Calculates a idfvec from relevance->doc_frequency_vec (basically 1/doc_frequency_vec,
 135 // times doc_frequency_vec[0].
 136 // Then loops through all clusters, and for each calculates score from each term
 137 // rec->term_frequency_vec[i] * idfvec[i]. Sums these as the cluster score.
 138 // If rank_cluster is set, divides the sum by the count, getting avg score.
 139 // Then calls normalize_scores.
 140 // Called from session.c:1319 show_range_start().
 141 void relevance_prepare_read(struct relevance *rel, struct reclist *reclist)
 142
 143
 144 TODO - Read through ingest_to_cluster, and summarize how the ranking actually
 145 works.  That's a long routine, 400 lines. Quick read didn't show all that much.
 146
 147 So, basically we have
 148   - relevance->entries
 149      - Set up in pull_terms, updated in word_entry_match
 150   - relevance->doc_frequency_vec
 151      - Set up with zeroes in relevance_create_ccl
 152      - Updated in relevance_donerecord, based on the cluster->term_frequency_vec
 153   - cluster->term_frequency_vec
 154      - Set up and zeroed in relevance_newrec
 155      - Updated in relevance_mergerec
 156
 157 * * *
 158
 159
 160