doc/administration.xml

   1 <chapter id="administration">
   2  <title>Administrating &zebra;</title>
   3  <!-- ### It's a bit daft that this chapter (which describes half of
   4           the configuration-file formats) is separated from
   5           "recordmodel-grs.xml" (which describes the other half) by the
   6           instructions on running zebraidx and zebrasrv.  Some careful
   7           re-ordering is required here.
   8  -->
   9
  10  <para>
  11   Unlike many simpler retrieval systems, &zebra; supports safe, incremental
  12   updates to an existing index.
  13  </para>
  14
  15  <para>
  16   Normally, when &zebra; modifies the index it reads a number of records
  17   that you specify.
  18   Depending on your specifications and on the contents of each record
  19   one the following events take place for each record:
  20   <variablelist>
  21
  22    <varlistentry>
  23     <term>Insert</term>
  24     <listitem>
  25      <para>
  26       The record is indexed as if it never occurred before.
  27       Either the &zebra; system doesn't know how to identify the record or
  28       &zebra; can identify the record but didn't find it to be already indexed.
  29      </para>
  30     </listitem>
  31    </varlistentry>
  32    <varlistentry>
  33     <term>Modify</term>
  34     <listitem>
  35      <para>
  36       The record has already been indexed.
  37       In this case either the contents of the record or the location
  38       (file) of the record indicates that it has been indexed before.
  39      </para>
  40     </listitem>
  41    </varlistentry>
  42    <varlistentry>
  43     <term>Delete</term>
  44     <listitem>
  45      <para>
  46       The record is deleted from the index. As in the
  47       update-case it must be able to identify the record.
  48      </para>
  49     </listitem>
  50    </varlistentry>
  51   </variablelist>
  52  </para>
  53
  54  <para>
  55   Please note that in both the modify- and delete- case the &zebra;
  56   indexer must be able to generate a unique key that identifies the record
  57   in question (more on this below).
  58  </para>
  59
  60  <para>
  61   To administrate the &zebra; retrieval system, you run the
  62   <literal>zebraidx</literal> program.
  63   This program supports a number of options which are preceded by a dash,
  64   and a few commands (not preceded by dash).
  65 </para>
  66
  67  <para>
  68   Both the &zebra; administrative tool and the &acro.z3950; server share a
  69   set of index files and a global configuration file.
  70   The name of the configuration file defaults to
  71   <literal>zebra.cfg</literal>.
  72   The configuration file includes specifications on how to index
  73   various kinds of records and where the other configuration files
  74   are located. <literal>zebrasrv</literal> and <literal>zebraidx</literal>
  75   <emphasis>must</emphasis> be run in the directory where the
  76   configuration file lives unless you indicate the location of the
  77   configuration file by option <literal>-c</literal>.
  78  </para>
  79
  80  <sect1 id="record-types">
  81   <title>Record Types</title>
  82
  83   <para>
  84    Indexing is a per-record process, in which either insert/modify/delete
  85    will occur. Before a record is indexed search keys are extracted from
  86    whatever might be the layout the original record (sgml,html,text, etc..).
  87    The &zebra; system currently supports two fundamental types of records:
  88    structured and simple text.
  89    To specify a particular extraction process, use either the
  90    command line option <literal>-t</literal> or specify a
  91    <literal>recordType</literal> setting in the configuration file.
  92   </para>
  93
  94  </sect1>
  95
  96  <sect1 id="zebra-cfg">
  97   <title>The &zebra; Configuration File</title>
  98
  99   <para>
 100    The &zebra; configuration file, read by <literal>zebraidx</literal> and
 101    <literal>zebrasrv</literal> defaults to <literal>zebra.cfg</literal>
 102    unless specified by <literal>-c</literal> option.
 103   </para>
 104
 105   <para>
 106    You can edit the configuration file with a normal text editor.
 107    parameter names and values are separated by colons in the file. Lines
 108    starting with a hash sign (<literal>#</literal>) are
 109    treated as comments.
 110   </para>
 111
 112   <para>
 113    If you manage different sets of records that share common
 114    characteristics, you can organize the configuration settings for each
 115    type into "groups".
 116    When <literal>zebraidx</literal> is run and you wish to address a
 117    given group you specify the group name with the <literal>-g</literal>
 118    option.
 119    In this case settings that have the group name as their prefix
 120    will be used by <literal>zebraidx</literal>.
 121    If no <literal>-g</literal> option is specified, the settings
 122    without prefix are used.
 123   </para>
 124
 125   <para>
 126    In the configuration file, the group name is placed before the option
 127    name itself, separated by a dot (.). For instance, to set the record type
 128    for group <literal>public</literal> to <literal>grs.sgml</literal>
 129    (the &acro.sgml;-like format for structured records) you would write:
 130   </para>
 131
 132   <para>
 133    <screen>
 134     public.recordType: grs.sgml
 135    </screen>
 136   </para>
 137
 138   <para>
 139    To set the default value of the record type to <literal>text</literal>
 140    write:
 141   </para>
 142
 143   <para>
 144    <screen>
 145     recordType: text
 146    </screen>
 147   </para>
 148
 149   <para>
 150    The available configuration settings are summarized below. They will be
 151    explained further in the following sections.
 152   </para>
 153
 154   <!--
 155    FIXME - Didn't Adam make something to have multiple databases in multiple dirs...
 156   -->
 157
 158   <para>
 159    <variablelist>
 160
 161     <varlistentry>
 162      <term>
 163       <emphasis>group</emphasis>
 164       .recordType[<emphasis>.name</emphasis>]:
 165       <replaceable>type</replaceable>
 166      </term>
 167      <listitem>
 168       <para>
 169        Specifies how records with the file extension
 170        <emphasis>name</emphasis> should be handled by the indexer.
 171        This option may also be specified as a command line option
 172        (<literal>-t</literal>). Note that if you do not specify a
 173        <emphasis>name</emphasis>, the setting applies to all files.
 174        In general, the record type specifier consists of the elements (each
 175        element separated by dot), <emphasis>fundamental-type</emphasis>,
 176        <emphasis>file-read-type</emphasis> and arguments. Currently, two
 177        fundamental types exist, <literal>text</literal> and
 178        <literal>grs</literal>.
 179       </para>
 180      </listitem>
 181     </varlistentry>
 182     <varlistentry>
 183      <term><emphasis>group</emphasis>.recordId:
 184      <replaceable>record-id-spec</replaceable></term>
 185      <listitem>
 186       <para>
 187        Specifies how the records are to be identified when updated. See
 188        <xref linkend="locating-records"/>.
 189       </para>
 190      </listitem>
 191     </varlistentry>
 192     <varlistentry>
 193      <term><emphasis>group</emphasis>.database:
 194      <replaceable>database</replaceable></term>
 195      <listitem>
 196       <para>
 197        Specifies the &acro.z3950; database name.
 198        <!-- FIXME - now we can have multiple databases in one server. -H -->
 199       </para>
 200      </listitem>
 201     </varlistentry>
 202     <varlistentry>
 203      <term><emphasis>group</emphasis>.storeKeys:
 204      <replaceable>boolean</replaceable></term>
 205      <listitem>
 206       <para>
 207        Specifies whether key information should be saved for a given
 208        group of records. If you plan to update/delete this type of
 209        records later this should be specified as 1; otherwise it
 210        should be 0 (default), to save register space.
 211        <!-- ### this is the first mention of "register" -->
 212        See <xref linkend="file-ids"/>.
 213       </para>
 214      </listitem>
 215     </varlistentry>
 216     <varlistentry>
 217      <term><emphasis>group</emphasis>.storeData:
 218       <replaceable>boolean</replaceable></term>
 219      <listitem>
 220       <para>
 221        Specifies whether the records should be stored internally
 222        in the &zebra; system files.
 223        If you want to maintain the raw records yourself,
 224        this option should be false (0).
 225        If you want &zebra; to take care of the records for you, it
 226        should be true(1).
 227       </para>
 228      </listitem>
 229     </varlistentry>
 230     <varlistentry>
 231      <!-- ### probably a better place to define "register" -->
 232      <term>register: <replaceable>register-location</replaceable></term>
 233      <listitem>
 234       <para>
 235        Specifies the location of the various register files that &zebra; uses
 236        to represent your databases.
 237        See <xref linkend="register-location"/>.
 238       </para>
 239      </listitem>
 240     </varlistentry>
 241     <varlistentry>
 242      <term>shadow: <replaceable>register-location</replaceable></term>
 243      <listitem>
 244       <para>
 245        Enables the <emphasis>safe update</emphasis> facility of &zebra;, and
 246        tells the system where to place the required, temporary files.
 247        See <xref linkend="shadow-registers"/>.
 248       </para>
 249      </listitem>
 250     </varlistentry>
 251     <varlistentry>
 252      <term>lockDir: <replaceable>directory</replaceable></term>
 253      <listitem>
 254       <para>
 255        Directory in which various lock files are stored.
 256       </para>
 257      </listitem>
 258     </varlistentry>
 259     <varlistentry>
 260      <term>keyTmpDir: <replaceable>directory</replaceable></term>
 261      <listitem>
 262       <para>
 263        Directory in which temporary files used during zebraidx's update
 264        phase are stored.
 265       </para>
 266      </listitem>
 267     </varlistentry>
 268     <varlistentry>
 269      <term>setTmpDir: <replaceable>directory</replaceable></term>
 270      <listitem>
 271       <para>
 272        Specifies the directory that the server uses for temporary result sets.
 273        If not specified <literal>/tmp</literal> will be used.
 274       </para>
 275      </listitem>
 276     </varlistentry>
 277     <varlistentry>
 278      <term>profilePath: <replaceable>path</replaceable></term>
 279      <listitem>
 280       <para>
 281        Specifies a path of profile specification files.
 282        The path is composed of one or more directories separated by
 283        colon. Similar to <literal>PATH</literal> for UNIX systems.
 284       </para>
 285      </listitem>
 286     </varlistentry>
 287
 288      <varlistentry>
 289       <term>modulePath: <replaceable>path</replaceable></term>
 290       <listitem>
 291        <para>
 292         Specifies a path of record filter modules.
 293         The path is composed of one or more directories separated by
 294         colon. Similar to <literal>PATH</literal> for UNIX systems.
 295         The 'make install' procedure typically puts modules in
 296         <filename>/usr/local/lib/idzebra-2.0/modules</filename>.
 297        </para>
 298       </listitem>
 299      </varlistentry>
 300
 301      <varlistentry>
 302       <term>index: <replaceable>filename</replaceable></term>
 303       <listitem>
 304        <para>
 305         Defines the filename which holds fields structure
 306         definitions. If omitted, the file <filename>default.idx</filename>
 307         is read.
 308         Refer to <xref linkend="default-idx-file"/> for
 309         more information.
 310        </para>
 311       </listitem>
 312      </varlistentry>
 313
 314      <varlistentry>
 315       <term>staticrank: <replaceable>integer</replaceable></term>
 316       <listitem>
 317        <para>
 318         Enables whether static ranking is to be enabled (1) or
 319         disabled (0). If omitted, it is disabled - corresponding
 320         to a value of 0.
 321         Refer to <xref linkend="administration-ranking-static"/> .
 322        </para>
 323       </listitem>
 324      </varlistentry>
 325
 326
 327      <varlistentry>
 328       <term>estimatehits:: <replaceable>integer</replaceable></term>
 329       <listitem>
 330        <para>
 331         Controls whether &zebra; should calculate approximite hit counts and
 332         at which hit count it is to be enabled.
 333         A value of 0 disables approximiate hit counts.
 334         For a positive value approximaite hit count is enabled
 335         if it is known to be larger than <replaceable>integer</replaceable>.
 336        </para>
 337        <para>
 338         Approximate hit counts can also be triggered by a particular
 339         attribute in a query.
 340         Refer to <xref linkend="querymodel-zebra-global-attr-limit"/>.
 341        </para>
 342       </listitem>
 343      </varlistentry>
 344
 345     <varlistentry>
 346      <term>attset: <replaceable>filename</replaceable></term>
 347      <listitem>
 348       <para>
 349         Specifies the filename(s) of attribute set files for use in
 350         searching. In many configurations <filename>bib1.att</filename>
 351         is used, but that is not required. If Classic Explain
 352         attributes is to be used for searching,
 353         <filename>explain.att</filename> must be given.
 354         The path to att-files in general can be given using
 355         <literal>profilePath</literal> setting.
 356         See also <xref linkend="attset-files"/>.
 357       </para>
 358      </listitem>
 359     </varlistentry>
 360     <varlistentry>
 361      <term>memMax: <replaceable>size</replaceable></term>
 362      <listitem>
 363       <para>
 364        Specifies <replaceable>size</replaceable> of internal memory
 365        to use for the zebraidx program.
 366        The amount is given in megabytes - default is 4 (4 MB).
 367        The more memory, the faster large updates happen, up to about
 368        half the free memory available on the computer.
 369       </para>
 370      </listitem>
 371     </varlistentry>
 372     <varlistentry>
 373      <term>tempfiles: <replaceable>Yes/Auto/No</replaceable></term>
 374      <listitem>
 375       <para>
 376        Tells zebra if it should use temporary files when indexing. The
 377        default is Auto, in which case zebra uses temporary files only
 378        if it would need more that <replaceable>memMax</replaceable>
 379        megabytes of memory. This should be good for most uses.
 380       </para>
 381      </listitem>
 382     </varlistentry>
 383
 384     <varlistentry>
 385      <term>root: <replaceable>dir</replaceable></term>
 386      <listitem>
 387       <para>
 388        Specifies a directory base for &zebra;. All relative paths
 389        given (in profilePath, register, shadow) are based on this
 390        directory. This setting is useful if your &zebra; server
 391        is running in a different directory from where
 392        <literal>zebra.cfg</literal> is located.
 393       </para>
 394      </listitem>
 395     </varlistentry>
 396
 397     <varlistentry>
 398      <term>passwd: <replaceable>file</replaceable></term>
 399      <listitem>
 400       <para>
 401        Specifies a file with description of user accounts for &zebra;.
 402        The format is similar to that known to Apache's htpasswd files
 403        and UNIX' passwd files. Non-empty lines not beginning with
 404        # are considered account lines. There is one account per-line.
 405        A line consists of fields separate by a single colon character.
 406        First field is username, second is password.
 407       </para>
 408      </listitem>
 409     </varlistentry>
 410
 411     <varlistentry>
 412      <term>passwd.c: <replaceable>file</replaceable></term>
 413      <listitem>
 414       <para>
 415        Specifies a file with description of user accounts for &zebra;.
 416        File format is similar to that used by the passwd directive except
 417        that the password are encrypted. Use Apache's htpasswd or similar
 418        for maintenance.
 419       </para>
 420      </listitem>
 421     </varlistentry>
 422
 423     <varlistentry>
 424      <term>perm.<replaceable>user</replaceable>:
 425      <replaceable>permstring</replaceable></term>
 426      <listitem>
 427       <para>
 428        Specifies permissions (priviledge) for a user that are allowed
 429        to access &zebra; via the passwd system. There are two kinds
 430        of permissions currently: read (r) and write(w). By default
 431        users not listed in a permission directive are given the read
 432        privilege. To specify permissions for a user with no
 433        username, or &acro.z3950; anonymous style use
 434         <literal>anonymous</literal>. The permstring consists of
 435        a sequence of characters. Include character <literal>w</literal>
 436        for write/update access, <literal>r</literal> for read access and
 437        <literal>a</literal> to allow anonymous access through this account.
 438       </para>
 439      </listitem>
 440     </varlistentry>
 441
 442     <varlistentry>
 443       <term>dbaccess <replaceable>accessfile</replaceable></term>
 444       <listitem>
 445         <para>
 446           Names a file which lists database subscriptions for individual users.
 447           The access file should consists of lines of the form <literal>username:
 448           dbnames</literal>, where dbnames is a list of database names, seprated by
 449           '+'. No whitespace is allowed in the database list.
 450         </para>
 451       </listitem>
 452     </varlistentry>
 453
 454    </variablelist>
 455   </para>
 456
 457  </sect1>
 458
 459  <sect1 id="locating-records">
 460   <title>Locating Records</title>
 461
 462   <para>
 463    The default behavior of the &zebra; system is to reference the
 464    records from their original location, i.e. where they were found when you
 465    run <literal>zebraidx</literal>.
 466    That is, when a client wishes to retrieve a record
 467    following a search operation, the files are accessed from the place
 468    where you originally put them - if you remove the files (without
 469    running <literal>zebraidx</literal> again, the server will return
 470    diagnostic number 14 (``System error in presenting records'') to
 471    the client.
 472   </para>
 473
 474   <para>
 475    If your input files are not permanent - for example if you retrieve
 476    your records from an outside source, or if they were temporarily
 477    mounted on a CD-ROM drive,
 478    you may want &zebra; to make an internal copy of them. To do this,
 479    you specify 1 (true) in the <literal>storeData</literal> setting. When
 480    the &acro.z3950; server retrieves the records they will be read from the
 481    internal file structures of the system.
 482   </para>
 483
 484  </sect1>
 485
 486  <sect1 id="simple-indexing">
 487   <title>Indexing with no Record IDs (Simple Indexing)</title>
 488
 489   <para>
 490    If you have a set of records that are not expected to change over time
 491    you may can build your database without record IDs.
 492    This indexing method uses less space than the other methods and
 493    is simple to use.
 494   </para>
 495
 496   <para>
 497    To use this method, you simply omit the <literal>recordId</literal> entry
 498    for the group of files that you index. To add a set of records you use
 499    <literal>zebraidx</literal> with the <literal>update</literal> command. The
 500    <literal>update</literal> command will always add all of the records that it
 501    encounters to the index - whether they have already been indexed or
 502    not. If the set of indexed files change, you should delete all of the
 503    index files, and build a new index from scratch.
 504   </para>
 505
 506   <para>
 507    Consider a system in which you have a group of text files called
 508    <literal>simple</literal>.
 509    That group of records should belong to a &acro.z3950; database called
 510    <literal>textbase</literal>.
 511    The following <literal>zebra.cfg</literal> file will suffice:
 512   </para>
 513   <para>
 514
 515    <screen>
 516     profilePath: /usr/local/idzebra/tab
 517     attset: bib1.att
 518     simple.recordType: text
 519     simple.database: textbase
 520    </screen>
 521
 522   </para>
 523
 524   <para>
 525    Since the existing records in an index can not be addressed by their
 526    IDs, it is impossible to delete or modify records when using this method.
 527   </para>
 528
 529  </sect1>
 530
 531  <sect1 id="file-ids">
 532   <title>Indexing with File Record IDs</title>
 533
 534   <para>
 535    If you have a set of files that regularly change over time: Old files
 536    are deleted, new ones are added, or existing files are modified, you
 537    can benefit from using the <emphasis>file ID</emphasis>
 538    indexing methodology.
 539    Examples of this type of database might include an index of WWW
 540    resources, or a USENET news spool area.
 541    Briefly speaking, the file key methodology uses the directory paths
 542    of the individual records as a unique identifier for each record.
 543    To perform indexing of a directory with file keys, again, you specify
 544    the top-level directory after the <literal>update</literal> command.
 545    The command will recursively traverse the directories and compare
 546    each one with whatever have been indexed before in that same directory.
 547    If a file is new (not in the previous version of the directory) it
 548    is inserted into the registers; if a file was already indexed and
 549    it has been modified since the last update, the index is also
 550    modified; if a file has been removed since the last
 551    visit, it is deleted from the index.
 552   </para>
 553
 554   <para>
 555    The resulting system is easy to administrate. To delete a record you
 556    simply have to delete the corresponding file (say, with the
 557    <literal>rm</literal> command). And to add records you create new
 558    files (or directories with files). For your changes to take effect
 559    in the register you must run <literal>zebraidx update</literal> with
 560    the same directory root again. This mode of operation requires more
 561    disk space than simpler indexing methods, but it makes it easier for
 562    you to keep the index in sync with a frequently changing set of data.
 563    If you combine this system with the <emphasis>safe update</emphasis>
 564    facility (see below), you never have to take your server off-line for
 565    maintenance or register updating purposes.
 566   </para>
 567
 568   <para>
 569    To enable indexing with pathname IDs, you must specify
 570    <literal>file</literal> as the value of <literal>recordId</literal>
 571    in the configuration file. In addition, you should set
 572    <literal>storeKeys</literal> to <literal>1</literal>, since the &zebra;
 573    indexer must save additional information about the contents of each record
 574    in order to modify the indexes correctly at a later time.
 575   </para>
 576
 577    <!--
 578     FIXME - There must be a simpler way to do this with Adams string tags -H
 579      -->
 580
 581   <para>
 582    For example, to update records of group <literal>esdd</literal>
 583    located below
 584    <literal>/data1/records/</literal> you should type:
 585    <screen>
 586     $ zebraidx -g esdd update /data1/records
 587    </screen>
 588   </para>
 589
 590   <para>
 591    The corresponding configuration file includes:
 592    <screen>
 593     esdd.recordId: file
 594     esdd.recordType: grs.sgml
 595     esdd.storeKeys: 1
 596    </screen>
 597   </para>
 598
 599   <note>
 600    <para>You cannot start out with a group of records with simple
 601     indexing (no record IDs as in the previous section) and then later
 602     enable file record Ids. &zebra; must know from the first time that you
 603     index the group that
 604     the files should be indexed with file record IDs.
 605    </para>
 606    </note>
 607
 608   <para>
 609    You cannot explicitly delete records when using this method (using the
 610    <literal>delete</literal> command to <literal>zebraidx</literal>. Instead
 611    you have to delete the files from the file system (or move them to a
 612    different location)
 613    and then run <literal>zebraidx</literal> with the
 614    <literal>update</literal> command.
 615   </para>
 616   <!-- ### what happens if a file contains multiple records? -->
 617 </sect1>
 618
 619  <sect1 id="generic-ids">
 620   <title>Indexing with General Record IDs</title>
 621
 622   <para>
 623    When using this method you construct an (almost) arbitrary, internal
 624    record key based on the contents of the record itself and other system
 625    information. If you have a group of records that explicitly associates
 626    an ID with each record, this method is convenient. For example, the
 627    record format may contain a title or a ID-number - unique within the group.
 628    In either case you specify the &acro.z3950; attribute set and use-attribute
 629    location in which this information is stored, and the system looks at
 630    that field to determine the identity of the record.
 631   </para>
 632
 633   <para>
 634    As before, the record ID is defined by the <literal>recordId</literal>
 635    setting in the configuration file. The value of the record ID specification
 636    consists of one or more tokens separated by whitespace. The resulting
 637    ID is represented in the index by concatenating the tokens and
 638    separating them by ASCII value (1).
 639   </para>
 640
 641   <para>
 642    There are three kinds of tokens:
 643    <variablelist>
 644
 645     <varlistentry>
 646      <term>Internal record info</term>
 647      <listitem>
 648       <para>
 649        The token refers to a key that is
 650        extracted from the record. The syntax of this token is
 651        <literal>(</literal> <emphasis>set</emphasis> <literal>,</literal>
 652        <emphasis>use</emphasis> <literal>)</literal>,
 653        where <emphasis>set</emphasis> is the
 654        attribute set name <emphasis>use</emphasis> is the
 655        name or value of the attribute.
 656       </para>
 657      </listitem>
 658     </varlistentry>
 659     <varlistentry>
 660      <term>System variable</term>
 661      <listitem>
 662       <para>
 663        The system variables are preceded by
 664
 665        <screen>
 666         $
 667        </screen>
 668        and immediately followed by the system variable name, which
 669        may one of
 670        <variablelist>
 671
 672         <varlistentry>
 673          <term>group</term>
 674          <listitem>
 675           <para>
 676            Group name.
 677           </para>
 678          </listitem>
 679         </varlistentry>
 680         <varlistentry>
 681          <term>database</term>
 682          <listitem>
 683           <para>
 684            Current database specified.
 685           </para>
 686          </listitem>
 687         </varlistentry>
 688         <varlistentry>
 689          <term>type</term>
 690          <listitem>
 691           <para>
 692            Record type.
 693           </para>
 694          </listitem>
 695         </varlistentry>
 696        </variablelist>
 697       </para>
 698      </listitem>
 699     </varlistentry>
 700     <varlistentry>
 701      <term>Constant string</term>
 702      <listitem>
 703       <para>
 704        A string used as part of the ID &mdash; surrounded
 705        by single- or double quotes.
 706       </para>
 707      </listitem>
 708     </varlistentry>
 709    </variablelist>
 710   </para>
 711
 712   <para>
 713    For instance, the sample GILS records that come with the &zebra;
 714    distribution contain a unique ID in the data tagged Control-Identifier.
 715    The data is mapped to the &acro.bib1; use attribute Identifier-standard
 716    (code 1007). To use this field as a record id, specify
 717    <literal>(bib1,Identifier-standard)</literal> as the value of the
 718    <literal>recordId</literal> in the configuration file.
 719    If you have other record types that uses the same field for a
 720    different purpose, you might add the record type
 721    (or group or database name) to the record id of the gils
 722    records as well, to prevent matches with other types of records.
 723    In this case the recordId might be set like this:
 724
 725    <screen>
 726     gils.recordId: $type (bib1,Identifier-standard)
 727    </screen>
 728
 729   </para>
 730
 731   <para>
 732    (see <xref linkend="grs"/>
 733     for details of how the mapping between elements of your records and
 734     searchable attributes is established).
 735   </para>
 736
 737   <para>
 738    As for the file record ID case described in the previous section,
 739    updating your system is simply a matter of running
 740    <literal>zebraidx</literal>
 741    with the <literal>update</literal> command. However, the update with general
 742    keys is considerably slower than with file record IDs, since all files
 743    visited must be (re)read to discover their IDs.
 744   </para>
 745
 746   <para>
 747    As you might expect, when using the general record IDs
 748    method, you can only add or modify existing records with the
 749    <literal>update</literal> command.
 750    If you wish to delete records, you must use the,
 751    <literal>delete</literal> command, with a directory as a parameter.
 752    This will remove all records that match the files below that root
 753    directory.
 754   </para>
 755
 756  </sect1>
 757
 758  <sect1 id="register-location">
 759   <title>Register Location</title>
 760
 761   <para>
 762    Normally, the index files that form dictionaries, inverted
 763    files, record info, etc., are stored in the directory where you run
 764    <literal>zebraidx</literal>. If you wish to store these, possibly large,
 765    files somewhere else, you must add the <literal>register</literal>
 766    entry to the <literal>zebra.cfg</literal> file.
 767    Furthermore, the &zebra; system allows its file
 768    structures to span multiple file systems, which is useful for
 769    managing very large databases.
 770   </para>
 771
 772   <para>
 773    The value of the <literal>register</literal> setting is a sequence
 774    of tokens. Each token takes the form:
 775
 776    <screen>
 777     <emphasis>dir</emphasis><literal>:</literal><emphasis>size</emphasis>
 778    </screen>
 779
 780    The <emphasis>dir</emphasis> specifies a directory in which index files
 781    will be stored and the <emphasis>size</emphasis> specifies the maximum
 782    size of all files in that directory. The &zebra; indexer system fills
 783    each directory in the order specified and use the next specified
 784    directories as needed.
 785    The <emphasis>size</emphasis> is an integer followed by a qualifier
 786    code,
 787    <literal>b</literal> for bytes,
 788    <literal>k</literal> for kilobytes.
 789    <literal>M</literal> for megabytes,
 790    <literal>G</literal> for gigabytes.
 791    Specifying a negative value disables the checking (it still needs the unit,
 792    use <literal>-1b</literal>).
 793   </para>
 794
 795   <para>
 796    For instance, if you have allocated three disks for your register, and
 797    the first disk is mounted
 798    on <literal>/d1</literal> and has 2GB of free space, the
 799    second, mounted on <literal>/d2</literal> has 3.6 GB, and the third,
 800    on which you have more space than you bother to worry about, mounted on
 801    <literal>/d3</literal> you could put this entry in your configuration file:
 802
 803    <screen>
 804     register: /d1:2G /d2:3600M /d3:-1b
 805    </screen>
 806   </para>
 807
 808   <para>
 809    Note that &zebra; does not verify that the amount of space specified is
 810    actually available on the directory (file system) specified - it is
 811    your responsibility to ensure that enough space is available, and that
 812    other applications do not attempt to use the free space. In a large
 813    production system, it is recommended that you allocate one or more
 814    file system exclusively to the &zebra; register files.
 815   </para>
 816
 817  </sect1>
 818
 819  <sect1 id="shadow-registers">
 820   <title>Safe Updating - Using Shadow Registers</title>
 821
 822   <sect2 id="shadow-registers-description">
 823    <title>Description</title>
 824
 825    <para>
 826     The &zebra; server supports <emphasis>updating</emphasis> of the index
 827     structures. That is, you can add, modify, or remove records from
 828     databases managed by &zebra; without rebuilding the entire index.
 829     Since this process involves modifying structured files with various
 830     references between blocks of data in the files, the update process
 831     is inherently sensitive to system crashes, or to process interruptions:
 832     Anything but a successfully completed update process will leave the
 833     register files in an unknown state, and you will essentially have no
 834     recourse but to re-index everything, or to restore the register files
 835     from a backup medium.
 836     Further, while the update process is active, users cannot be
 837     allowed to access the system, as the contents of the register files
 838     may change unpredictably.
 839    </para>
 840
 841    <para>
 842     You can solve these problems by enabling the shadow register system in
 843     &zebra;.
 844     During the updating procedure, <literal>zebraidx</literal> will temporarily
 845     write changes to the involved files in a set of "shadow
 846     files", without modifying the files that are accessed by the
 847     active server processes. If the update procedure is interrupted by a
 848     system crash or a signal, you simply repeat the procedure - the
 849     register files have not been changed or damaged, and the partially
 850     written shadow files are automatically deleted before the new updating
 851     procedure commences.
 852    </para>
 853
 854    <para>
 855     At the end of the updating procedure (or in a separate operation, if
 856     you so desire), the system enters a "commit mode". First,
 857     any active server processes are forced to access those blocks that
 858     have been changed from the shadow files rather than from the main
 859     register files; the unmodified blocks are still accessed at their
 860     normal location (the shadow files are not a complete copy of the
 861     register files - they only contain those parts that have actually been
 862     modified). If the commit process is interrupted at any point during the
 863     commit process, the server processes will continue to access the
 864     shadow files until you can repeat the commit procedure and complete
 865     the writing of data to the main register files. You can perform
 866     multiple update operations to the registers before you commit the
 867     changes to the system files, or you can execute the commit operation
 868     at the end of each update operation. When the commit phase has
 869     completed successfully, any running server processes are instructed to
 870     switch their operations to the new, operational register, and the
 871     temporary shadow files are deleted.
 872    </para>
 873
 874   </sect2>
 875
 876   <sect2 id="shadow-registers-how-to-use">
 877    <title>How to Use Shadow Register Files</title>
 878
 879    <para>
 880     The first step is to allocate space on your system for the shadow
 881     files.
 882     You do this by adding a <literal>shadow</literal> entry to the
 883     <literal>zebra.cfg</literal> file.
 884     The syntax of the <literal>shadow</literal> entry is exactly the
 885     same as for the <literal>register</literal> entry
 886     (see <xref linkend="register-location"/>).
 887      The location of the shadow area should be
 888      <emphasis>different</emphasis> from the location of the main register
 889      area (if you have specified one - remember that if you provide no
 890      <literal>register</literal> setting, the default register area is the
 891      working directory of the server and indexing processes).
 892    </para>
 893
 894    <para>
 895     The following excerpt from a <literal>zebra.cfg</literal> file shows
 896     one example of a setup that configures both the main register
 897     location and the shadow file area.
 898     Note that two directories or partitions have been set aside
 899     for the shadow file area. You can specify any number of directories
 900     for each of the file areas, but remember that there should be no
 901     overlaps between the directories used for the main registers and the
 902     shadow files, respectively.
 903    </para>
 904    <para>
 905
 906     <screen>
 907      register: /d1:500M
 908      shadow: /scratch1:100M /scratch2:200M
 909     </screen>
 910
 911    </para>
 912
 913    <para>
 914     When shadow files are enabled, an extra command is available at the
 915     <literal>zebraidx</literal> command line.
 916     In order to make changes to the system take effect for the
 917     users, you'll have to submit a "commit" command after a
 918     (sequence of) update operation(s).
 919    </para>
 920
 921    <para>
 922
 923     <screen>
 924      $ zebraidx update /d1/records
 925      $ zebraidx commit
 926     </screen>
 927
 928    </para>
 929
 930    <para>
 931     Or you can execute multiple updates before committing the changes:
 932    </para>
 933
 934    <para>
 935
 936     <screen>
 937      $ zebraidx -g books update /d1/records  /d2/more-records
 938      $ zebraidx -g fun update /d3/fun-records
 939      $ zebraidx commit
 940     </screen>
 941
 942    </para>
 943
 944    <para>
 945     If one of the update operations above had been interrupted, the commit
 946     operation on the last line would fail: <literal>zebraidx</literal>
 947     will not let you commit changes that would destroy the running register.
 948     You'll have to rerun all of the update operations since your last
 949     commit operation, before you can commit the new changes.
 950    </para>
 951
 952    <para>
 953     Similarly, if the commit operation fails, <literal>zebraidx</literal>
 954     will not let you start a new update operation before you have
 955     successfully repeated the commit operation.
 956     The server processes will keep accessing the shadow files rather
 957     than the (possibly damaged) blocks of the main register files
 958     until the commit operation has successfully completed.
 959    </para>
 960
 961    <para>
 962     You should be aware that update operations may take slightly longer
 963     when the shadow register system is enabled, since more file access
 964     operations are involved. Further, while the disk space required for
 965     the shadow register data is modest for a small update operation, you
 966     may prefer to disable the system if you are adding a very large number
 967     of records to an already very large database (we use the terms
 968     <emphasis>large</emphasis> and <emphasis>modest</emphasis>
 969     very loosely here, since every application will have a
 970     different perception of size).
 971     To update the system without the use of the the shadow files,
 972     simply run <literal>zebraidx</literal> with the <literal>-n</literal>
 973     option (note that you do not have to execute the
 974     <emphasis>commit</emphasis> command of <literal>zebraidx</literal>
 975     when you temporarily disable the use of the shadow registers in
 976     this fashion.
 977     Note also that, just as when the shadow registers are not enabled,
 978     server processes will be barred from accessing the main register
 979     while the update procedure takes place.
 980    </para>
 981
 982   </sect2>
 983
 984  </sect1>
 985
 986
 987  <sect1 id="administration-ranking">
 988   <title>Relevance Ranking and Sorting of Result Sets</title>
 989
 990   <sect2 id="administration-overview">
 991    <title>Overview</title>
 992    <para>
 993     The default ordering of a result set is left up to the server,
 994     which inside &zebra; means sorting in ascending document ID order.
 995     This is not always the order humans want to browse the sometimes
 996     quite large hit sets. Ranking and sorting comes to the rescue.
 997    </para>
 998
 999    <para>
1000     In cases where a good presentation ordering can be computed at
1001     indexing time, we can use a fixed <literal>static ranking</literal>
1002     scheme, which is provided for the <literal>alvis</literal>
1003     indexing filter. This defines a fixed ordering of hit lists,
1004     independently of the query issued.
1005    </para>
1006
1007    <para>
1008     There are cases, however, where relevance of hit set documents is
1009     highly dependent on the query processed.
1010     Simply put, <literal>dynamic relevance ranking</literal>
1011     sorts a set of retrieved records such that those most likely to be
1012     relevant to your request are retrieved first.
1013     Internally, &zebra; retrieves all documents that satisfy your
1014     query, and re-orders the hit list to arrange them based on
1015     a measurement of similarity between your query and the content of
1016     each record.
1017    </para>
1018
1019    <para>
1020     Finally, there are situations where hit sets of documents should be
1021     <literal>sorted</literal> during query time according to the
1022     lexicographical ordering of certain sort indexes created at
1023     indexing time.
1024    </para>
1025   </sect2>
1026
1027
1028  <sect2 id="administration-ranking-static">
1029   <title>Static Ranking</title>
1030
1031    <para>
1032     &zebra; uses internally inverted indexes to look up term occurencies
1033     in documents. Multiple queries from different indexes can be
1034     combined by the binary boolean operations <literal>AND</literal>,
1035     <literal>OR</literal> and/or <literal>NOT</literal> (which
1036     is in fact a binary <literal>AND NOT</literal> operation).
1037     To ensure fast query execution
1038     speed, all indexes have to be sorted in the same order.
1039    </para>
1040    <para>
1041     The indexes are normally sorted according to document
1042     <literal>ID</literal> in
1043     ascending order, and any query which does not invoke a special
1044     re-ranking function will therefore retrieve the result set in
1045     document
1046     <literal>ID</literal>
1047     order.
1048    </para>
1049    <para>
1050     If one defines the
1051     <screen>
1052     staticrank: 1
1053     </screen>
1054     directive in the main core &zebra; configuration file, the internal document
1055     keys used for ordering are augmented by a preceding integer, which
1056     contains the static rank of a given document, and the index lists
1057     are ordered
1058     first by ascending static rank,
1059     then by ascending document <literal>ID</literal>.
1060     Zero
1061     is the ``best'' rank, as it occurs at the
1062     beginning of the list; higher numbers represent worse scores.
1063    </para>
1064    <para>
1065     The experimental <literal>alvis</literal> filter provides a
1066     directive to fetch static rank information out of the indexed &acro.xml;
1067     records, thus making <emphasis>all</emphasis> hit sets ordered
1068     after <emphasis>ascending</emphasis> static
1069     rank, and for those doc's which have the same static rank, ordered
1070     after <emphasis>ascending</emphasis> doc <literal>ID</literal>.
1071     See <xref linkend="record-model-alvisxslt"/> for the gory details.
1072    </para>
1073     </sect2>
1074
1075
1076  <sect2 id="administration-ranking-dynamic">
1077   <title>Dynamic Ranking</title>
1078    <para>
1079     In order to fiddle with the static rank order, it is necessary to
1080     invoke additional re-ranking/re-ordering using dynamic
1081     ranking or score functions. These functions return positive
1082     integer scores, where <emphasis>highest</emphasis> score is
1083     ``best'';
1084     hit sets are sorted according to <emphasis>descending</emphasis>
1085     scores (in contrary
1086     to the index lists which are sorted according to
1087     ascending rank number and document ID).
1088    </para>
1089    <para>
1090     Dynamic ranking is enabled by a directive like one of the
1091     following in the zebra configuration file (use only one of these a time!):
1092     <screen>
1093     rank: rank-1        # default TDF-IDF like
1094     rank: rank-static   # dummy do-nothing
1095     </screen>
1096    </para>
1097
1098    <para>
1099     Dynamic ranking is done at query time rather than
1100     indexing time (this is why we
1101     call it ``dynamic ranking'' in the first place ...)
1102     It is invoked by adding
1103     the &acro.bib1; relation attribute with
1104     value ``relevance'' to the &acro.pqf; query (that is,
1105     <literal>@attr&nbsp;2=102</literal>, see also
1106     <ulink url="&url.z39.50;bib1.html">
1107      The &acro.bib1; Attribute Set Semantics</ulink>, also in
1108       <ulink url="&url.z39.50.attset.bib1;">HTML</ulink>).
1109     To find all articles with the word <literal>Eoraptor</literal> in
1110     the title, and present them relevance ranked, issue the &acro.pqf; query:
1111     <screen>
1112      @attr 2=102 @attr 1=4 Eoraptor
1113     </screen>
1114    </para>
1115
1116     <sect3 id="administration-ranking-dynamic-rank1">
1117      <title>Dynamically ranking using &acro.pqf; queries with the 'rank-1'
1118       algorithm</title>
1119
1120    <para>
1121      The default <literal>rank-1</literal> ranking module implements a
1122      TF/IDF (Term Frequecy over Inverse Document Frequency) like
1123      algorithm. In contrast to the usual defintion of TF/IDF
1124      algorithms, which only considers searching in one full-text
1125      index, this one works on multiple indexes at the same time.
1126      More precisely,
1127      &zebra; does boolean queries and searches in specific addressed
1128      indexes (there are inverted indexes pointing from terms in the
1129      dictionary to documents and term positions inside documents).
1130      It works like this:
1131      <variablelist>
1132       <varlistentry>
1133        <term>Query Components</term>
1134        <listitem>
1135         <para>
1136          First, the boolean query is dismantled into its principal components,
1137          i.e. atomic queries where one term is looked up in one index.
1138          For example, the query
1139          <screen>
1140         @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
1141          </screen>
1142          is a boolean AND between the atomic parts
1143          <screen>
1144        @attr 2=102 @attr 1=1010 Utah
1145          </screen>
1146           and
1147          <screen>
1148        @attr 2=102 @attr 1=1018 Springer
1149          </screen>
1150          which gets processed each for itself.
1151         </para>
1152        </listitem>
1153       </varlistentry>
1154
1155       <varlistentry>
1156        <term>Atomic hit lists</term>
1157        <listitem>
1158         <para>
1159          Second, for each atomic query, the hit list of documents is
1160          computed.
1161         </para>
1162         <para>
1163          In this example, two hit lists for each index
1164          <literal>@attr 1=1010</literal>  and
1165          <literal>@attr 1=1018</literal> are computed.
1166         </para>
1167        </listitem>
1168       </varlistentry>
1169
1170       <varlistentry>
1171        <term>Atomic scores</term>
1172        <listitem>
1173         <para>
1174          Third, each document in the hit list is assigned a score (_if_ ranking
1175          is enabled and requested in the query)  using a TF/IDF scheme.
1176         </para>
1177         <para>
1178          In this example, both atomic parts of the query assign the magic
1179          <literal>@attr 2=102</literal> relevance attribute, and are
1180          to be used in the relevance ranking functions.
1181         </para>
1182         <para>
1183          It is possible to apply dynamic ranking on only parts of the
1184          &acro.pqf; query:
1185          <screen>
1186           @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
1187          </screen>
1188          searches for all documents which have the term 'Utah' on the
1189          body of text, and which have the term 'Springer' in the publisher
1190          field, and sort them in the order of the relevance ranking made on
1191          the body-of-text index only.
1192         </para>
1193        </listitem>
1194       </varlistentry>
1195
1196       <varlistentry>
1197        <term>Hit list merging</term>
1198        <listitem>
1199         <para>
1200          Fourth, the atomic hit lists are merged according to the boolean
1201          conditions to a final hit list of documents to be returned.
1202         </para>
1203         <para>
1204         This step is always performed, independently of the fact that
1205         dynamic ranking is enabled or not.
1206         </para>
1207        </listitem>
1208       </varlistentry>
1209
1210       <varlistentry>
1211        <term>Document score computation</term>
1212        <listitem>
1213         <para>
1214          Fifth, the total score of a document is computed as a linear
1215          combination of the atomic scores of the atomic hit lists
1216         </para>
1217         <para>
1218          Ranking weights may be used to pass a value to a ranking
1219          algorithm, using the non-standard &acro.bib1; attribute type 9.
1220          This allows one branch of a query to use one value while
1221          another branch uses a different one.  For example, we can search
1222          for <literal>utah</literal> in the
1223          <literal>@attr 1=4</literal> index with weight 30, as
1224          well as in the <literal>@attr 1=1010</literal> index with weight 20:
1225          <screen>
1226          @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
1227          </screen>
1228         </para>
1229         <para>
1230          The default weight is
1231          sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score
1232          is 1000 and the bottom score is 0, encoded in integers.
1233         </para>
1234         <warning>
1235          <para>
1236           The ranking-weight feature is experimental. It may change in future
1237           releases of zebra.
1238          </para>
1239         </warning>
1240        </listitem>
1241       </varlistentry>
1242
1243       <varlistentry>
1244        <term>Re-sorting of hit list</term>
1245        <listitem>
1246         <para>
1247          Finally, the final hit list is re-ordered according to scores.
1248         </para>
1249        </listitem>
1250       </varlistentry>
1251      </variablelist>
1252
1253
1254 <!--
1255 Still need to describe the exact TF/IDF formula. Here's the info, need -->
1256 <!--to extract it in human readable form .. MC
1257
1258 static int calc (void *set_handle, zint sysno, zint staticrank,
1259                  int *stop_flag)
1260 {
1261     int i, lo, divisor, score = 0;
1262     struct rank_set_info *si = (struct rank_set_info *) set_handle;
1263
1264     if (!si->no_rank_entries)
1265         return -1;   /* ranking not enabled for any terms */
1266
1267     for (i = 0; i < si->no_entries; i++)
1268     {
1269         yaz_log(log_level, "calc: i=%d rank_flag=%d lo=%d",
1270                 i, si->entries[i].rank_flag, si->entries[i].local_occur);
1271         if (si->entries[i].rank_flag && (lo = si->entries[i].local_occur))
1272             score += (8+log2_int (lo)) * si->entries[i].global_inv *
1273                 si->entries[i].rank_weight;
1274     }
1275     divisor = si->no_rank_entries * (8+log2_int (si->last_pos/si->no_entries));
1276     score = score / divisor;
1277     yaz_log(log_level, "calc sysno=" ZINT_FORMAT " score=%d", sysno, score);
1278     if (score > 1000)
1279         score = 1000;
1280     /* reset the counts for the next term */
1281     for (i = 0; i < si->no_entries; i++)
1282         si->entries[i].local_occur = 0;
1283     return score;
1284 }
1285
1286
1287 where lo = si->entries[i].local_occur is the local documents term-within-index frequency, si->entries[i].global_inv represents the IDF part (computed in static void *begin()), and
1288 si->entries[i].rank_weight is the weight assigner per index (default 34, or set in the @attr 9=xyz magic)
1289
1290 Finally, the IDF part is computed as:
1291
1292 static void *begin (struct zebra_register *reg,
1293                     void *class_handle, RSET rset, NMEM nmem,
1294                     TERMID *terms, int numterms)
1295 {
1296     struct rank_set_info *si =
1297         (struct rank_set_info *) nmem_malloc (nmem,sizeof(*si));
1298     int i;
1299
1300     yaz_log(log_level, "rank-1 begin");
1301     si->no_entries = numterms;
1302     si->no_rank_entries = 0;
1303     si->nmem=nmem;
1304     si->entries = (struct rank_term_info *)
1305         nmem_malloc (si->nmem, sizeof(*si->entries)*numterms);
1306     for (i = 0; i < numterms; i++)
1307     {
1308         zint g = rset_count(terms[i]->rset);
1309         yaz_log(log_level, "i=%d flags=%s '%s'", i,
1310                 terms[i]->flags, terms[i]->name );
1311         if  (!strncmp (terms[i]->flags, "rank,", 5))
1312         {
1313             const char *cp = strstr(terms[i]->flags+4, ",w=");
1314             si->entries[i].rank_flag = 1;
1315             if (cp)
1316                 si->entries[i].rank_weight = atoi (cp+3);
1317             else
1318               si->entries[i].rank_weight = 34; /* sqrroot of 1000 */
1319             yaz_log(log_level, " i=%d weight=%d g="ZINT_FORMAT, i,
1320                      si->entries[i].rank_weight, g);
1321             (si->no_rank_entries)++;
1322         }
1323         else
1324             si->entries[i].rank_flag = 0;
1325         si->entries[i].local_occur = 0;  /* FIXME */
1326         si->entries[i].global_occur = g;
1327         si->entries[i].global_inv = 32 - log2_int (g);
1328         yaz_log(log_level, " global_inv = %d g = " ZINT_FORMAT,
1329                 (int) (32-log2_int (g)), g);
1330         si->entries[i].term = terms[i];
1331         si->entries[i].term_index=i;
1332         terms[i]->rankpriv = &(si->entries[i]);
1333     }
1334     return si;
1335 }
1336
1337
1338 where g = rset_count(terms[i]->rset) is the count of all documents in this specific index hit list, and the IDF part then is
1339
1340  si->entries[i].global_inv = 32 - log2_int (g);
1341    -->
1342
1343    </para>
1344
1345
1346     <para>
1347     The <literal>rank-1</literal> algorithm
1348     does not use the static rank
1349     information in the list keys, and will produce the same ordering
1350     with or without static ranking enabled.
1351     </para>
1352
1353
1354     <!--
1355     <sect3 id="administration-ranking-dynamic-rank1">
1356      <title>Dynamically ranking &acro.pqf; queries with the 'rank-static'
1357       algorithm</title>
1358     <para>
1359     The dummy <literal>rank-static</literal> reranking/scoring
1360     function returns just
1361     <literal>score = max int - staticrank</literal>
1362     in order to preserve the static ordering of hit sets that would
1363     have been produced had it not been invoked.
1364     Obviously, to combine static and dynamic ranking usefully,
1365     it is necessary
1366     to make a new ranking
1367     function; this is left
1368     as an exercise for the reader.
1369    </para>
1370     </sect3>
1371     -->
1372
1373    <warning>
1374      <para>
1375       <literal>Dynamic ranking</literal> is not compatible
1376       with <literal>estimated hit sizes</literal>, as all documents in
1377       a hit set must be accessed to compute the correct placing in a
1378       ranking sorted list. Therefore the use attribute setting
1379       <literal>@attr&nbsp;2=102</literal> clashes with
1380       <literal>@attr&nbsp;9=integer</literal>.
1381      </para>
1382    </warning>
1383
1384    <!--
1385     we might want to add ranking like this:
1386     UNPUBLISHED:
1387     Simple BM25 Extension to Multiple Weighted Fields
1388     Stephen Robertson, Hugo Zaragoza and Michael Taylor
1389     Microsoft Research
1390     ser@microsoft.com
1391     hugoz@microsoft.com
1392     mitaylor2microsoft.com
1393    -->
1394
1395     </sect3>
1396
1397     <sect3 id="administration-ranking-dynamic-cql">
1398      <title>Dynamically ranking &acro.cql; queries</title>
1399      <para>
1400       Dynamic ranking can be enabled during sever side &acro.cql;
1401       query expansion by adding <literal>@attr&nbsp;2=102</literal>
1402       chunks to the &acro.cql; config file. For example
1403       <screen>
1404        relationModifier.relevant                = 2=102
1405       </screen>
1406       invokes dynamic ranking each time a &acro.cql; query of the form
1407       <screen>
1408        Z> querytype cql
1409        Z> f alvis.text =/relevant house
1410       </screen>
1411       is issued. Dynamic ranking can also be automatically used on
1412       specific &acro.cql; indexes by (for example) setting
1413       <screen>
1414        index.alvis.text                        = 1=text 2=102
1415       </screen>
1416       which then invokes dynamic ranking each time a &acro.cql; query of the form
1417       <screen>
1418        Z> querytype cql
1419        Z> f alvis.text = house
1420       </screen>
1421       is issued.
1422      </para>
1423
1424     </sect3>
1425
1426     </sect2>
1427
1428
1429  <sect2 id="administration-ranking-sorting">
1430   <title>Sorting</title>
1431    <para>
1432      &zebra; sorts efficiently using special sorting indexes
1433      (type=<literal>s</literal>; so each sortable index must be known
1434      at indexing time, specified in the configuration of record
1435      indexing.  For example, to enable sorting according to the &acro.bib1;
1436      <literal>Date/time-added-to-db</literal> field, one could add the line
1437      <screen>
1438         xelm /*/@created               Date/time-added-to-db:s
1439      </screen>
1440      to any <literal>.abs</literal> record-indexing configuration file.
1441      Similarly, one could add an indexing element of the form
1442      <screen><![CDATA[
1443       <z:index name="date-modified" type="s">
1444        <xsl:value-of select="some/xpath"/>
1445       </z:index>
1446       ]]></screen>
1447      to any <literal>alvis</literal>-filter indexing stylesheet.
1448      </para>
1449      <para>
1450       Indexing can be specified at searching time using a query term
1451       carrying the non-standard
1452       &acro.bib1; attribute-type <literal>7</literal>.  This removes the
1453       need to send a &acro.z3950; <literal>Sort Request</literal>
1454       separately, and can dramatically improve latency when the client
1455       and server are on separate networks.
1456       The sorting part of the query is separate from the rest of the
1457       query - the actual search specification - and must be combined
1458       with it using OR.
1459      </para>
1460      <para>
1461       A sorting subquery needs two attributes: an index (such as a
1462       &acro.bib1; type-1 attribute) specifying which index to sort on, and a
1463       type-7 attribute whose value is be <literal>1</literal> for
1464       ascending sorting, or <literal>2</literal> for descending.  The
1465       term associated with the sorting attribute is the priority of
1466       the sort key, where <literal>0</literal> specifies the primary
1467       sort key, <literal>1</literal> the secondary sort key, and so
1468       on.
1469      </para>
1470     <para>For example, a search for water, sort by title (ascending),
1471     is expressed by the &acro.pqf; query
1472      <screen>
1473      @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
1474      </screen>
1475       whereas a search for water, sort by title ascending,
1476      then date descending would be
1477      <screen>
1478      @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
1479      </screen>
1480     </para>
1481     <para>
1482      Notice the fundamental differences between <literal>dynamic
1483      ranking</literal> and <literal>sorting</literal>: there can be
1484      only one ranking function defined and configured; but multiple
1485      sorting indexes can be specified dynamically at search
1486      time. Ranking does not need to use specific indexes, so
1487      dynamic ranking can be enabled and disabled without
1488      re-indexing; whereas, sorting indexes need to be
1489      defined before indexing.
1490      </para>
1491
1492  </sect2>
1493
1494
1495  </sect1>
1496
1497  <sect1 id="administration-extended-services">
1498   <title>Extended Services: Remote Insert, Update and Delete</title>
1499
1500    <note>
1501     <para>
1502      Extended services are only supported when accessing the &zebra;
1503      server using the <ulink url="&url.z39.50;">&acro.z3950;</ulink>
1504      protocol. The <ulink url="&url.sru;">&acro.sru;</ulink> protocol does
1505      not support extended services.
1506     </para>
1507    </note>
1508
1509   <para>
1510     The extended services are not enabled by default in zebra - due to the
1511     fact that they modify the system. &zebra; can be configured
1512     to allow anybody to
1513     search, and to allow only updates for a particular admin user
1514     in the main zebra configuration file <filename>zebra.cfg</filename>.
1515     For user <literal>admin</literal>, you could use:
1516     <screen>
1517      perm.anonymous: r
1518      perm.admin: rw
1519      passwd: passwordfile
1520     </screen>
1521     And in the password file
1522     <filename>passwordfile</filename>, you have to specify users and
1523     encrypted passwords as colon separated strings.
1524     Use a tool like <filename>htpasswd</filename>
1525     to maintain the encrypted passwords.
1526     <screen>
1527      admin:secret
1528     </screen>
1529     It is essential to configure  &zebra; to store records internally,
1530     and to support
1531     modifications and deletion of records:
1532     <screen>
1533      storeData: 1
1534      storeKeys: 1
1535     </screen>
1536     The general record type should be set to any record filter which
1537     is able to parse &acro.xml; records, you may use any of the two
1538     declarations (but not both simultaneously!)
1539     <screen>
1540      recordType: dom.filter_dom_conf.xml
1541      # recordType: grs.xml
1542     </screen>
1543     Notice the difference to the specific instructions
1544     <screen>
1545      recordType.xml: dom.filter_dom_conf.xml
1546      # recordType.xml: grs.xml
1547     </screen>
1548     which only work when indexing XML files from the filesystem using
1549     the <literal>*.xml</literal> naming convention.
1550    </para>
1551    <para>
1552     To enable transaction safe shadow indexing,
1553     which is extra important for this kind of operation, set
1554     <screen>
1555      shadow: directoryname: size (e.g. 1000M)
1556     </screen>
1557      See <xref linkend="zebra-cfg"/> for additional information on
1558      these configuration options.
1559    </para>
1560    <note>
1561     <para>
1562      It is not possible to carry information about record types or
1563      similar to &zebra; when using extended services, due to
1564      limitations of the <ulink url="&url.z39.50;">&acro.z3950;</ulink>
1565      protocol. Therefore, indexing filters can not be chosen on a
1566      per-record basis. One and only one general &acro.xml; indexing filter
1567      must be defined.
1568      <!-- but because it is represented as an OID, we would need some
1569      form of proprietary mapping scheme between record type strings and
1570      OIDs. -->
1571      <!--
1572      However, as a minimum, it would be extremely useful to enable
1573      people to use &acro.marc21;, assuming grs.marcxml.marc21 as a record
1574      type.
1575      -->
1576     </para>
1577    </note>
1578
1579
1580    <sect2 id="administration-extended-services-z3950">
1581     <title>Extended services in the &acro.z3950; protocol</title>
1582
1583     <para>
1584      The <ulink url="&url.z39.50;">&acro.z3950;</ulink> standard allows
1585      servers to accept special binary <emphasis>extended services</emphasis>
1586      protocol packages, which may be used to insert, update and delete
1587      records into servers. These carry  control and update
1588      information to the servers, which are encoded in seven package fields:
1589     </para>
1590
1591     <table id="administration-extended-services-z3950-table" frame="top">
1592      <title>Extended services &acro.z3950; Package Fields</title>
1593       <tgroup cols="3">
1594        <thead>
1595        <row>
1596          <entry>Parameter</entry>
1597          <entry>Value</entry>
1598          <entry>Notes</entry>
1599         </row>
1600       </thead>
1601        <tbody>
1602         <row>
1603          <entry><literal>type</literal></entry>
1604          <entry><literal>'update'</literal></entry>
1605          <entry>Must be set to trigger extended services</entry>
1606         </row>
1607         <row>
1608          <entry><literal>action</literal></entry>
1609          <entry><literal>string</literal></entry>
1610         <entry>
1611          Extended service action type with
1612          one of four possible values: <literal>recordInsert</literal>,
1613          <literal>recordReplace</literal>,
1614          <literal>recordDelete</literal>,
1615          and <literal>specialUpdate</literal>
1616         </entry>
1617         </row>
1618         <row>
1619          <entry><literal>record</literal></entry>
1620          <entry><literal>&acro.xml; string</literal></entry>
1621          <entry>An &acro.xml; formatted string containing the record</entry>
1622         </row>
1623        <row>
1624         <entry><literal>syntax</literal></entry>
1625         <entry><literal>'xml'</literal></entry>
1626         <entry>XML/SUTRS/MARC. GRS-1 not supported.
1627          The default filter (record type) as given by recordType in
1628          zebra.cfg is used to parse the record.</entry>
1629        </row>
1630         <row>
1631          <entry><literal>recordIdOpaque</literal></entry>
1632          <entry><literal>string</literal></entry>
1633          <entry>
1634          Optional client-supplied, opaque record
1635          identifier used under insert operations.
1636         </entry>
1637         </row>
1638         <row>
1639          <entry><literal>recordIdNumber </literal></entry>
1640          <entry><literal>positive number</literal></entry>
1641          <entry>&zebra;'s internal system number,
1642          not allowed for  <literal>recordInsert</literal> or
1643          <literal>specialUpdate</literal> actions which result in fresh
1644          record inserts.
1645         </entry>
1646         </row>
1647         <row>
1648          <entry><literal>databaseName</literal></entry>
1649          <entry><literal>database identifier</literal></entry>
1650         <entry>
1651          The name of the database to which the extended services should be
1652          applied.
1653         </entry>
1654         </row>
1655       </tbody>
1656       </tgroup>
1657      </table>
1658
1659
1660    <para>
1661     The <literal>action</literal> parameter can be any of
1662     <literal>recordInsert</literal> (will fail if the record already exists),
1663     <literal>recordReplace</literal> (will fail if the record does not exist),
1664     <literal>recordDelete</literal> (will fail if the record does not
1665        exist), and
1666     <literal>specialUpdate</literal> (will insert or update the record
1667        as needed, record deletion is not possible).
1668    </para>
1669
1670     <para>
1671      During all actions, the
1672      usual rules for internal record ID generation apply, unless an
1673      optional <literal>recordIdNumber</literal> &zebra; internal ID or a
1674     <literal>recordIdOpaque</literal> string identifier is assigned.
1675      The default ID generation is
1676      configured using the <literal>recordId:</literal> from
1677      <filename>zebra.cfg</filename>.
1678      See <xref linkend="zebra-cfg"/>.
1679     </para>
1680
1681    <para>
1682     Setting of the <literal>recordIdNumber</literal> parameter,
1683     which must be an existing &zebra; internal system ID number, is not
1684     allowed during any  <literal>recordInsert</literal> or
1685      <literal>specialUpdate</literal> action resulting in fresh record
1686     inserts.
1687     </para>
1688
1689     <para>
1690      When retrieving existing
1691      records indexed with &acro.grs1; indexing filters, the &zebra; internal
1692      ID number is returned in the field
1693     <literal>/*/id:idzebra/localnumber</literal> in the namespace
1694     <literal>xmlns:id="http://www.indexdata.dk/zebra/"</literal>,
1695     where it can be picked up for later record updates or deletes.
1696     </para>
1697
1698     <para>
1699      A new element set for retrieval of internal record
1700      data has been added, which can be used to access minimal records
1701      containing only the <literal>recordIdNumber</literal> &zebra;
1702      internal ID, or the <literal>recordIdOpaque</literal> string
1703      identifier. This works for any indexing filter used.
1704      See <xref linkend="special-retrieval"/>.
1705     </para>
1706
1707    <para>
1708      The <literal>recordIdOpaque</literal> string parameter
1709      is an client-supplied, opaque record
1710      identifier, which may be  used under
1711      insert, update and delete operations. The
1712      client software is responsible for assigning these to
1713      records.      This identifier will
1714      replace zebra's own automagic identifier generation with a unique
1715      mapping from <literal>recordIdOpaque</literal> to the
1716      &zebra; internal <literal>recordIdNumber</literal>.
1717      <emphasis>The opaque <literal>recordIdOpaque</literal> string
1718      identifiers
1719       are not visible in retrieval records, nor are
1720       searchable, so the value of this parameter is
1721       questionable. It serves mostly as a convenient mapping from
1722       application domain string identifiers to &zebra; internal ID's.
1723      </emphasis>
1724     </para>
1725    </sect2>
1726
1727
1728  <sect2 id="administration-extended-services-yaz-client">
1729   <title>Extended services from yaz-client</title>
1730
1731    <para>
1732     We can now start a yaz-client admin session and create a database:
1733    <screen>
1734     <![CDATA[
1735      $ yaz-client localhost:9999 -u admin/secret
1736      Z> adm-create
1737      ]]>
1738    </screen>
1739     Now the <literal>Default</literal> database was created,
1740     we can insert an &acro.xml; file (esdd0006.grs
1741     from example/gils/records) and index it:
1742    <screen>
1743     <![CDATA[
1744      Z> update insert id1234 esdd0006.grs
1745      ]]>
1746    </screen>
1747     The 3rd parameter - <literal>id1234</literal> here -
1748       is the  <literal>recordIdOpaque</literal> package field.
1749    </para>
1750    <para>
1751     Actually, we should have a way to specify "no opaque record id" for
1752     yaz-client's update command.. We'll fix that.
1753    </para>
1754    <para>
1755     The newly inserted record can be searched as usual:
1756     <screen>
1757     <![CDATA[
1758      Z> f utah
1759      Sent searchRequest.
1760      Received SearchResponse.
1761      Search was a success.
1762      Number of hits: 1, setno 1
1763      SearchResult-1: term=utah cnt=1
1764      records returned: 0
1765      Elapsed: 0.014179
1766      ]]>
1767     </screen>
1768    </para>
1769    <para>
1770      Let's delete the beast, using the same
1771      <literal>recordIdOpaque</literal> string parameter:
1772     <screen>
1773     <![CDATA[
1774      Z> update delete id1234
1775      No last record (update ignored)
1776      Z> update delete 1 esdd0006.grs
1777      Got extended services response
1778      Status: done
1779      Elapsed: 0.072441
1780      Z> f utah
1781      Sent searchRequest.
1782      Received SearchResponse.
1783      Search was a success.
1784      Number of hits: 0, setno 2
1785      SearchResult-1: term=utah cnt=0
1786      records returned: 0
1787      Elapsed: 0.013610
1788      ]]>
1789      </screen>
1790     </para>
1791     <para>
1792     If shadow register is enabled in your
1793     <filename>zebra.cfg</filename>,
1794     you must run the adm-commit command
1795     <screen>
1796     <![CDATA[
1797      Z> adm-commit
1798      ]]>
1799     </screen>
1800      after each update session in order write your changes from the
1801      shadow to the life register space.
1802    </para>
1803  </sect2>
1804
1805
1806  <sect2 id="administration-extended-services-yaz-php">
1807   <title>Extended services from yaz-php</title>
1808
1809    <para>
1810     Extended services are also available from the &yaz; &acro.php; client layer. An
1811     example of an &yaz;-&acro.php; extended service transaction is given here:
1812     <screen>
1813     <![CDATA[
1814      $record = '<record><title>A fine specimen of a record</title></record>';
1815
1816      $options = array('action' => 'recordInsert',
1817                       'syntax' => 'xml',
1818                       'record' => $record,
1819                       'databaseName' => 'mydatabase'
1820                      );
1821
1822      yaz_es($yaz, 'update', $options);
1823      yaz_es($yaz, 'commit', array());
1824      yaz_wait();
1825
1826      if ($error = yaz_error($yaz))
1827        echo "$error";
1828      ]]>
1829     </screen>
1830     </para>
1831     </sect2>
1832
1833    <sect2 id="administration-extended-services-debugging">
1834     <title>Extended services debugging guide</title>
1835     <para>
1836      When debugging ES over PHP we recomment the following order of tests:
1837     </para>
1838
1839     <itemizedlist>
1840      <listitem>
1841       <para>
1842        Make sure you have a nice record on your filesystem, which you can
1843        index from the filesystem by use of the zebraidx command.
1844        Do it exactly as you planned, using one of the GRS-1 filters,
1845        or the DOMXML filter.
1846        When this works, proceed.
1847       </para>
1848      </listitem>
1849      <listitem>
1850       <para>
1851        Check that your server setup is OK before you even coded one single
1852        line PHP using ES.
1853        Take the same record form the file system, and send as ES via
1854        <literal>yaz-client</literal> like described in
1855        <xref linkend="administration-extended-services-yaz-client"/>,
1856        and
1857        remeber the <literal>-a</literal> option which tells you what
1858        goes over the wire! Notice also the section on permissions:
1859        try
1860        <screen>
1861         perm.anonymous: rw
1862        </screen>
1863        in <literal>zebra.cfg</literal> to make sure you do not run into
1864        permission  problems (but never expose such an unsecure setup on the
1865        internet!!!). Then, make sure to set the general
1866        <literal>recordType</literal> instruction, pointing correctly
1867        to the GRS-1 filters,
1868        or the DOMXML filters.
1869       </para>
1870      </listitem>
1871      <listitem>
1872       <para>
1873        If you insist on using the <literal>sysno</literal> in the
1874        <literal>recordIdNumber</literal> setting,
1875        please make sure you do only updates and deletes. Zebra's internal
1876        system number is not allowed for
1877        <literal>recordInsert</literal> or
1878        <literal>specialUpdate</literal> actions
1879        which result in fresh record inserts.
1880       </para>
1881      </listitem>
1882      <listitem>
1883       <para>
1884        If <literal>shadow register</literal> is enabled in your
1885        <literal>zebra.cfg</literal>, you must remember running the
1886        <screen>
1887         Z> adm-commit
1888        </screen>
1889        command as well.
1890       </para>
1891      </listitem>
1892      <listitem>
1893       <para>
1894        If this works, then proceed to do the same thing in your PHP script.
1895       </para>
1896      </listitem>
1897     </itemizedlist>
1898
1899
1900    </sect2>
1901
1902  </sect1>
1903
1904 </chapter>
1905
1906  <!-- Keep this comment at the end of the file
1907  Local variables:
1908  mode: sgml
1909  sgml-omittag:t
1910  sgml-shorttag:t
1911  sgml-minimize-attributes:nil
1912  sgml-always-quote-attributes:t
1913  sgml-indent-step:1
1914  sgml-indent-data:t
1915  sgml-parent-document: "zebra.xml"
1916  sgml-local-catalogs: nil
1917  sgml-namecase-general:t
1918  End:
1919  -->