- <para>
- The default <literal>rank-1</literal> ranking module implements a
- TF/IDF (Term Frequecy over Inverse Document Frequency) like
- algorithm. In contrast to the usual defintion of TF/IDF
- algorithms, which only considers searching in one full-text
- index, this one works on multiple indexes at the same time.
- More precisely,
- &zebra; does boolean queries and searches in specific addressed
- indexes (there are inverted indexes pointing from terms in the
- dictionary to documents and term positions inside documents).
- It works like this:
- <variablelist>
- <varlistentry>
- <term>Query Components</term>
- <listitem>
- <para>
- First, the boolean query is dismantled into it's principal components,
- i.e. atomic queries where one term is looked up in one index.
- For example, the query
- <screen>
- @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
- </screen>
- is a boolean AND between the atomic parts
- <screen>
- @attr 2=102 @attr 1=1010 Utah
- </screen>
- and
- <screen>
- @attr 2=102 @attr 1=1018 Springer
- </screen>
- which gets processed each for itself.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Atomic hit lists</term>
- <listitem>
- <para>
- Second, for each atomic query, the hit list of documents is
- computed.
- </para>
- <para>
- In this example, two hit lists for each index
- <literal>@attr 1=1010</literal> and
- <literal>@attr 1=1018</literal> are computed.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Atomic scores</term>
- <listitem>
- <para>
- Third, each document in the hit list is assigned a score (_if_ ranking
- is enabled and requested in the query) using a TF/IDF scheme.
- </para>
- <para>
- In this example, both atomic parts of the query assign the magic
- <literal>@attr 2=102</literal> relevance attribute, and are
- to be used in the relevance ranking functions.
- </para>
- <para>
- It is possible to apply dynamic ranking on only parts of the
- PQF query:
- <screen>
- @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
- </screen>
- searches for all documents which have the term 'Utah' on the
- body of text, and which have the term 'Springer' in the publisher
- field, and sort them in the order of the relevance ranking made on
- the body-of-text index only.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Hit list merging</term>
- <listitem>
- <para>
- Fourth, the atomic hit lists are merged according to the boolean
- conditions to a final hit list of documents to be returned.
- </para>
- <para>
- This step is always performed, independently of the fact that
- dynamic ranking is enabled or not.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Document score computation</term>
- <listitem>
- <para>
- Fifth, the total score of a document is computed as a linear
- combination of the atomic scores of the atomic hit lists
- </para>
- <para>
- Ranking weights may be used to pass a value to a ranking
- algorithm, using the non-standard BIB-1 attribute type 9.
- This allows one branch of a query to use one value while
- another branch uses a different one. For example, we can search
- for <literal>utah</literal> in the
- <literal>@attr 1=4</literal> index with weight 30, as
- well as in the <literal>@attr 1=1010</literal> index with weight 20:
- <screen>
- @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
- </screen>
- </para>
- <para>
- The default weight is
- sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score
- is 1000 and the bottom score is 0, encoded in integers.
- </para>
- <warning>
- <para>
- The ranking-weight feature is experimental. It may change in future
- releases of zebra.
- </para>
- </warning>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Re-sorting of hit list</term>
- <listitem>
- <para>
- Finally, the final hit list is re-ordered according to scores.
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
-
-
-<!--
-Still need to describe the exact TF/IDF formula. Here's the info, need -->
-<!--to extract it in human readable form .. MC
-
-static int calc (void *set_handle, zint sysno, zint staticrank,
- int *stop_flag)
-{
- int i, lo, divisor, score = 0;
- struct rank_set_info *si = (struct rank_set_info *) set_handle;
-
- if (!si->no_rank_entries)
- return -1; /* ranking not enabled for any terms */
-
- for (i = 0; i < si->no_entries; i++)
- {
- yaz_log(log_level, "calc: i=%d rank_flag=%d lo=%d",
- i, si->entries[i].rank_flag, si->entries[i].local_occur);
- if (si->entries[i].rank_flag && (lo = si->entries[i].local_occur))
- score += (8+log2_int (lo)) * si->entries[i].global_inv *
- si->entries[i].rank_weight;
- }
- divisor = si->no_rank_entries * (8+log2_int (si->last_pos/si->no_entries));
- score = score / divisor;
- yaz_log(log_level, "calc sysno=" ZINT_FORMAT " score=%d", sysno, score);
- if (score > 1000)
- score = 1000;
- /* reset the counts for the next term */
- for (i = 0; i < si->no_entries; i++)
- si->entries[i].local_occur = 0;
- return score;
-}
-
-
-where lo = si->entries[i].local_occur is the local documents term-within-index frequency, si->entries[i].global_inv represents the IDF part (computed in static void *begin()), and
-si->entries[i].rank_weight is the weight assigner per index (default 34, or set in the @attr 9=xyz magic)
-
-Finally, the IDF part is computed as:
-
-static void *begin (struct zebra_register *reg,
- void *class_handle, RSET rset, NMEM nmem,
- TERMID *terms, int numterms)
-{
- struct rank_set_info *si =
- (struct rank_set_info *) nmem_malloc (nmem,sizeof(*si));
- int i;
-
- yaz_log(log_level, "rank-1 begin");
- si->no_entries = numterms;
- si->no_rank_entries = 0;
- si->nmem=nmem;
- si->entries = (struct rank_term_info *)
- nmem_malloc (si->nmem, sizeof(*si->entries)*numterms);
- for (i = 0; i < numterms; i++)
- {
- zint g = rset_count(terms[i]->rset);
- yaz_log(log_level, "i=%d flags=%s '%s'", i,
- terms[i]->flags, terms[i]->name );
- if (!strncmp (terms[i]->flags, "rank,", 5))
- {
- const char *cp = strstr(terms[i]->flags+4, ",w=");
- si->entries[i].rank_flag = 1;
- if (cp)
- si->entries[i].rank_weight = atoi (cp+3);
- else
- si->entries[i].rank_weight = 34; /* sqrroot of 1000 */
- yaz_log(log_level, " i=%d weight=%d g="ZINT_FORMAT, i,
- si->entries[i].rank_weight, g);
- (si->no_rank_entries)++;
- }
- else
- si->entries[i].rank_flag = 0;
- si->entries[i].local_occur = 0; /* FIXME */
- si->entries[i].global_occur = g;
- si->entries[i].global_inv = 32 - log2_int (g);
- yaz_log(log_level, " global_inv = %d g = " ZINT_FORMAT,
- (int) (32-log2_int (g)), g);
- si->entries[i].term = terms[i];
- si->entries[i].term_index=i;
- terms[i]->rankpriv = &(si->entries[i]);
- }
- return si;
-}
-
-
-where g = rset_count(terms[i]->rset) is the count of all documents in this specific index hit list, and the IDF part then is
-
- si->entries[i].global_inv = 32 - log2_int (g);
- -->
+ <para>
+ The &zebra; server supports <emphasis>updating</emphasis> of the index
+ structures. That is, you can add, modify, or remove records from
+ databases managed by &zebra; without rebuilding the entire index.
+ Since this process involves modifying structured files with various
+ references between blocks of data in the files, the update process
+ is inherently sensitive to system crashes, or to process interruptions:
+ Anything but a successfully completed update process will leave the
+ register files in an unknown state, and you will essentially have no
+ recourse but to re-index everything, or to restore the register files
+ from a backup medium.
+ Further, while the update process is active, users cannot be
+ allowed to access the system, as the contents of the register files
+ may change unpredictably.
+ </para>
+
+ <para>
+ You can solve these problems by enabling the shadow register system in
+ &zebra;.
+ During the updating procedure, <literal>zebraidx</literal> will temporarily
+ write changes to the involved files in a set of "shadow
+ files", without modifying the files that are accessed by the
+ active server processes. If the update procedure is interrupted by a
+ system crash or a signal, you simply repeat the procedure - the
+ register files have not been changed or damaged, and the partially
+ written shadow files are automatically deleted before the new updating
+ procedure commences.
+ </para>
+
+ <para>
+ At the end of the updating procedure (or in a separate operation, if
+ you so desire), the system enters a "commit mode". First,
+ any active server processes are forced to access those blocks that
+ have been changed from the shadow files rather than from the main
+ register files; the unmodified blocks are still accessed at their
+ normal location (the shadow files are not a complete copy of the
+ register files - they only contain those parts that have actually been
+ modified). If the commit process is interrupted at any point during the
+ commit process, the server processes will continue to access the
+ shadow files until you can repeat the commit procedure and complete
+ the writing of data to the main register files. You can perform
+ multiple update operations to the registers before you commit the
+ changes to the system files, or you can execute the commit operation
+ at the end of each update operation. When the commit phase has
+ completed successfully, any running server processes are instructed to
+ switch their operations to the new, operational register, and the
+ temporary shadow files are deleted.
+ </para>
+
+ </sect2>
+
+ <sect2 id="shadow-registers-how-to-use">
+ <title>How to Use Shadow Register Files</title>
+
+ <para>
+ The first step is to allocate space on your system for the shadow
+ files.
+ You do this by adding a <literal>shadow</literal> entry to the
+ <literal>zebra.cfg</literal> file.
+ The syntax of the <literal>shadow</literal> entry is exactly the
+ same as for the <literal>register</literal> entry
+ (see <xref linkend="register-location"/>).
+ The location of the shadow area should be
+ <emphasis>different</emphasis> from the location of the main register
+ area (if you have specified one - remember that if you provide no
+ <literal>register</literal> setting, the default register area is the
+ working directory of the server and indexing processes).
+ </para>
+
+ <para>
+ The following excerpt from a <literal>zebra.cfg</literal> file shows
+ one example of a setup that configures both the main register
+ location and the shadow file area.
+ Note that two directories or partitions have been set aside
+ for the shadow file area. You can specify any number of directories
+ for each of the file areas, but remember that there should be no
+ overlaps between the directories used for the main registers and the
+ shadow files, respectively.
+ </para>
+ <para>
+
+ <screen>
+ register: /d1:500M
+ shadow: /scratch1:100M /scratch2:200M
+ </screen>
+
+ </para>
+
+ <para>
+ When shadow files are enabled, an extra command is available at the
+ <literal>zebraidx</literal> command line.
+ In order to make changes to the system take effect for the
+ users, you'll have to submit a "commit" command after a
+ (sequence of) update operation(s).
+ </para>
+
+ <para>
+
+ <screen>
+ $ zebraidx update /d1/records
+ $ zebraidx commit
+ </screen>
+
+ </para>
+
+ <para>
+ Or you can execute multiple updates before committing the changes:
+ </para>
+
+ <para>
+
+ <screen>
+ $ zebraidx -g books update /d1/records /d2/more-records
+ $ zebraidx -g fun update /d3/fun-records
+ $ zebraidx commit
+ </screen>
+
+ </para>
+
+ <para>
+ If one of the update operations above had been interrupted, the commit
+ operation on the last line would fail: <literal>zebraidx</literal>
+ will not let you commit changes that would destroy the running register.
+ You'll have to rerun all of the update operations since your last
+ commit operation, before you can commit the new changes.
+ </para>
+
+ <para>
+ Similarly, if the commit operation fails, <literal>zebraidx</literal>
+ will not let you start a new update operation before you have
+ successfully repeated the commit operation.
+ The server processes will keep accessing the shadow files rather
+ than the (possibly damaged) blocks of the main register files
+ until the commit operation has successfully completed.
+ </para>
+
+ <para>
+ You should be aware that update operations may take slightly longer
+ when the shadow register system is enabled, since more file access
+ operations are involved. Further, while the disk space required for
+ the shadow register data is modest for a small update operation, you
+ may prefer to disable the system if you are adding a very large number
+ of records to an already very large database (we use the terms
+ <emphasis>large</emphasis> and <emphasis>modest</emphasis>
+ very loosely here, since every application will have a
+ different perception of size).
+ To update the system without the use of the the shadow files,
+ simply run <literal>zebraidx</literal> with the <literal>-n</literal>
+ option (note that you do not have to execute the
+ <emphasis>commit</emphasis> command of <literal>zebraidx</literal>
+ when you temporarily disable the use of the shadow registers in
+ this fashion.
+ Note also that, just as when the shadow registers are not enabled,
+ server processes will be barred from accessing the main register
+ while the update procedure takes place.
+ </para>
+
+ </sect2>
+
+ </sect1>
+
+
+ <sect1 id="administration-ranking">
+ <title>Relevance Ranking and Sorting of Result Sets</title>
+
+ <sect2 id="administration-overview">
+ <title>Overview</title>
+ <para>
+ The default ordering of a result set is left up to the server,
+ which inside &zebra; means sorting in ascending document ID order.
+ This is not always the order humans want to browse the sometimes
+ quite large hit sets. Ranking and sorting comes to the rescue.
+ </para>
+
+ <para>
+ In cases where a good presentation ordering can be computed at
+ indexing time, we can use a fixed <literal>static ranking</literal>
+ scheme, which is provided for the <literal>alvis</literal>
+ indexing filter. This defines a fixed ordering of hit lists,
+ independently of the query issued.
+ </para>
+
+ <para>
+ There are cases, however, where relevance of hit set documents is
+ highly dependent on the query processed.
+ Simply put, <literal>dynamic relevance ranking</literal>
+ sorts a set of retrieved records such that those most likely to be
+ relevant to your request are retrieved first.
+ Internally, &zebra; retrieves all documents that satisfy your
+ query, and re-orders the hit list to arrange them based on
+ a measurement of similarity between your query and the content of
+ each record.
+ </para>
+
+ <para>
+ Finally, there are situations where hit sets of documents should be
+ <literal>sorted</literal> during query time according to the
+ lexicographical ordering of certain sort indexes created at
+ indexing time.
+ </para>
+ </sect2>