doc/examples.xml

   1 <chapter id="examples">
   2  <!-- $Id: examples.xml,v 1.18 2002-12-01 23:26:26 mike Exp $ -->
   3  <title>Example Configurations</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <literal>zebraidx</literal> and <literal>zebrasrv</literal> are both
  10    driven by a master configuration file, which may refer to other
  11    subsidiary configuration files.  By default, they try to use
  12    <filename>zebra.cfg</filename> in the working directory as the
  13    master file; but this can be changed using the <literal>-c</literal>
  14    option to specify an alternative master configuration file.
  15   </para>
  16   <para>
  17    The master configuration file tells Zebra:
  18    <itemizedlist>
  19
  20     <listitem>
  21      <para>
  22       Where to find subsidiary configuration files, including both
  23       those that are named explicitly and a few ``magic'' files such
  24       as <literal>default.idx</literal>,
  25       which specifies the default indexing rules.
  26      </para>
  27     </listitem>
  28
  29     <listitem>
  30      <para>
  31       What record schemas to support.  (Subsidiary files specifiy how
  32       to index the contents of records in those schemas, and what
  33       format to use when presenting records in those schemas to client
  34       software.)
  35      </para>
  36     </listitem>
  37
  38     <listitem>
  39      <para>
  40       What attribute sets to recognise in searches.  (Subsidiary files
  41       specify how to interpret the attributes in terms
  42       of the indexes that are created on the records.)
  43      </para>
  44     </listitem>
  45
  46     <listitem>
  47      <para>
  48       Policy details such as what type of input format to expect when
  49       adding new records, what low-level indexing algorithm to use,
  50       how to identify potential duplicate records, etc.
  51      </para>
  52     </listitem>
  53
  54    </itemizedlist>
  55   </para>
  56   <para>
  57    Now let's see what goes in the <literal>zebra.cfg</literal> file
  58    for some example configurations.
  59   </para>
  60  </sect1>
  61
  62  <sect1 id="example1">
  63   <title>Example 1: XML Indexing And Searching</title>
  64
  65   <para>
  66    This example shows how Zebra can be used with absolutely minimal
  67    configuration to index a body of
  68    <ulink url="http://www.w3.org/XML/">XML</ulink>
  69    documents, and search them using
  70    <ulink url="http://www.w3.org/TR/xpath">XPath</ulink>
  71    expressions to specify access points.
  72   </para>
  73   <para>
  74    Go to the <literal>examples/zthes</literal> subdirectory
  75    of the distribution archive.
  76    There you will find a <literal>Makefile</literal> that will
  77    populate the <literal>records</literal> subdirectory with a file of
  78    <ulink url="http://zthes.z3950.org/">Zthes</ulink>
  79    records representing a taxonomic hierarchy of dinosaurs.  (The
  80    records are generated from the family tree in the file
  81    <literal>dino.tree</literal>.)
  82    Type <literal>make records/dino.xml</literal>
  83    to make the XML data file.
  84    (Or you could just type <literal>make</literal> to build the XML
  85    data file, create the database and populate it with the taxonomic
  86    records all in one shot - but then you wouldn't learn anything,
  87    would you?  :-)
  88   </para>
  89   <para>
  90    Now we need to create a Zebra database to hold and index the XML
  91    records.  We do this with the
  92    Zebra indexer, <literal>zebraidx</literal>, which is
  93    driven by the <literal>zebra.cfg</literal> configuration file.
  94    For our purposes, we don't need any
  95    special behaviour - we can use the defaults - so we can start with a
  96    minimal file that just tells <literal>zebraidx</literal> where to
  97    find the default indexing rules, and how to parse the records:
  98    <screen>
  99     profilePath: .:../../tab
 100     recordType: grs.sgml
 101    </screen>
 102   </para>
 103   <para>
 104    That's all you need for a minimal Zebra configuration.  Now you can
 105    roll the XML records into the database and build the indexes:
 106    <screen>
 107     zebraidx update records
 108    </screen>
 109   </para>
 110   <para>
 111    Now start the server.  Like the indexer, its behaviour is
 112    controlled by the
 113    <literal>zebra.cfg</literal> file; and like the indexer, it works
 114    just fine with this minimal configuration.
 115    <screen>
 116         zebrasrv
 117    </screen>
 118    By default, the server listens on IP port number 9999, although
 119    this can easily be changed - see
 120    <xref linkend="zebrasrv"/>.
 121   </para>
 122   <para>
 123    Now you can use the Z39.50 client program of your choice to execute
 124    XPath-based boolean queries and fetch the XML records that satisfy
 125    them:
 126    <screen>
 127     $ yaz-client @:9999
 128     Connecting...Ok.
 129     Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
 130     Number of hits: 1
 131     Z&gt; format xml
 132     Z&gt; show 1
 133     &lt;Zthes&gt;
 134      &lt;termId&gt;22&lt;/termId&gt;
 135      &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
 136      &lt;termType&gt;PT&lt;/termType&gt;
 137      &lt;termNote&gt;The tallest known dinosaur (18m)&lt;/termNote&gt;
 138      &lt;relation&gt;
 139       &lt;relationType&gt;BT&lt;/relationType&gt;
 140       &lt;termId&gt;21&lt;/termId&gt;
 141       &lt;termName&gt;Brachiosauridae&lt;/termName&gt;
 142       &lt;termType&gt;PT&lt;/termType&gt;
 143      &lt;/relation&gt;
 144
 145       &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
 146         &lt;size&gt;300&lt;/size&gt;
 147         &lt;localnumber&gt;23&lt;/localnumber&gt;
 148         &lt;filename&gt;records/dino.xml&lt;/filename&gt;
 149       &lt;/idzebra&gt;
 150     &lt;/Zthes&gt;
 151    </screen>
 152   </para>
 153   <para>
 154    Now wasn't that nice and easy?
 155   </para>
 156  </sect1>
 157
 158
 159  <sect1 id="example2">
 160   <title>Example 2: Supporting Interoperable Searches</title>
 161
 162   <para>
 163    The problem with the previous example is that you need to know the
 164    structure of the documents in order to find them.  For example,
 165    when we wanted to find the record for the taxon
 166    <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
 167    we had to formulate a complex XPath
 168    <literal>/Zthes/termName</literal>
 169    which embodies the knowledge that taxon names are specified in a
 170    <literal>&lt;termName&gt;</literal> element inside the top-level
 171    <literal>&lt;Zthes&gt;</literal> element.
 172   </para>
 173   <para>
 174    This is bad not just because it requires a lot of typing, but more
 175    significantly because it ties searching semantics to the physical
 176    structure of the searched records.  You can't use the same search
 177    specification to search two databases if their internal
 178    representations are different.  Consider an different taxonomy
 179    database in which the records have taxon names specified
 180    inside a <literal>&lt;name&gt;</literal> element nested within a
 181    <literal>&lt;identification&gt;</literal> element
 182    inside a top-level <literal>&lt;taxon&gt;</literal> element: then
 183    you'd need to search for them using
 184    <literal>1=/taxon/identification/name</literal>
 185   </para>
 186   <para>
 187    How, then, can we build broadcasting Information Retrieval
 188    applications that look for records in many different databases?
 189    The Z39.50 protocol offers a powerful and general solution to this:
 190    abstract ``access points''.  In the Z39.50 model, an access point
 191    is simply a point at which searches can be directed.  Nothing is
 192    said about implementation: in a given database, an access point
 193    might be implemented as an index, a path into physical records, an
 194    algorithm for interrogating relational tables or whatever works.
 195    The only important thing point is that the semantics of an access
 196    point are fixed and well defined.
 197   </para>
 198   <para>
 199    For convenience, access points are gathered into <firstterm>attribute
 200    sets</firstterm>.  For example, the BIB-1 attribute set is supposed to
 201    contain bibliographic access points such as author, title, subject
 202    and ISBN; the GEO attribute set contains access points pertaining
 203    to geospatial information (bounding coordinates, stratum, latitude
 204    resolution, etc.); the CIMI
 205    attribute set contains access points to do with museum collections
 206    (provenance, inscriptions, etc.)
 207   </para>
 208   <para>
 209    In practice, the BIB-1 attribute set has tended to be a dumping
 210    ground for all sorts of access points, so that, for example, it
 211    includes some geospatial access points as well as strictly
 212    bibliographic ones.  Nevertheless, this model
 213    allows a layer of abstraction over the physical representation of
 214    records in databases.
 215   </para>
 216   <para>
 217    In the BIB-1 attribute set, a taxon name is probably best
 218    interpreted as a title - that is, a phrase that identifies the item
 219    in question.  BIB-1 represents title searches by
 220    access point 4.  (See
 221    <ulink url="ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt"
 222         >The BIB-1 Attribute Set Semantics</ulink>)
 223    So we need to configure our dinosaur database so that searches for
 224    BIB-1 access point 4 look in the
 225    <literal>&lt;termName&gt;</literal> element,
 226    inside the top-level
 227    <literal>&lt;Zthes&gt;</literal> element.
 228   </para>
 229   <para>
 230    ### Here's where it all goes to pieces.  The current arrangement is
 231    very awkward (and somewhat embarrassing) to describe, and the new
 232    arrangement hasn't actually been implemented yet.
 233   </para>
 234   <para>
 235    This is a two-step process.  First, we need to tell Zebra that we
 236    want to support the BIB-1 attribute set.  Then we need to tell it
 237    which elements of its record pertain to access point 4.
 238   </para>
 239   <para>
 240    We need to create an <link linkend="abs-file">Abstract Syntax
 241    file</link> named after the document element of the records we're
 242    working with, plus a <literal>.abs</literal> suffix - in this case,
 243    <literal>Zthes.abs</literal> - as follows:
 244   </para>
 245   <itemizedlist>
 246    <listitem>
 247     <para>
 248
 249     </para>
 250    </listitem>
 251    <listitem>
 252     <para>
 253     </para>
 254    </listitem>
 255   </itemizedlist>
 256  </sect1>
 257 </chapter>
 258
 259
 260 <!--
 261         The simplest hello-world example could go like this:
 262
 263         Index the document
 264
 265         <book>
 266            <title>The art of motorcycle maintenance</title>
 267            <subject scheme="Dewey">zen</subject>
 268         </book>
 269
 270         And search it like
 271
 272         f @attr 1=/book/title motorcycle
 273
 274         f @attr 1=/book/subject[@scheme=Dewey] zen
 275
 276         If you suddenly decide you want broader interop, you can add
 277         an abs file (more or less like this):
 278
 279         attset bib1.att
 280         tagset tagsetg.tag
 281
 282         elm (2,1)       title   title
 283         elm (2,21)      subject  subject
 284 -->
 285
 286 <!--
 287 How to include images:
 288
 289         <mediaobject>
 290           <imageobject>
 291             <imagedata fileref="system.eps" format="eps">
 292           </imageobject>
 293           <imageobject>
 294             <imagedata fileref="system.gif" format="gif">
 295           </imageobject>
 296           <textobject>
 297             <phrase>The Multi-Lingual Search System Architecture</phrase>
 298           </textobject>
 299           <caption>
 300             <para>
 301               <emphasis role="strong">
 302                 The Multi-Lingual Search System Architecture.
 303               </emphasis>
 304               <para>
 305                 Network connections across local area networks are
 306                 represented by straight lines, and those over the
 307                 internet by jagged lines.
 308           </caption>
 309         </mediaobject>
 310
 311 Where the three <*object> thingies inside the top-level <mediaobject>
 312 are decreasingly preferred version to include depending on what the
 313 rendering engine can handle.  I generated the EPS version of the image
 314 by exporting a line-drawing done in TGIF, then converted that to the
 315 GIF using a shell-script called "epstogif" which used an appallingly
 316 baroque sequence of conversions, which I would prefer not to pollute
 317 the Zebra build environment with:
 318
 319         #!/bin/sh
 320
 321         # Yes, what follows is stupidly convoluted, but I can't find a
 322         # more straightforward path from the EPS generated by tgif's
 323         # "Print" command into a browser-friendly format.
 324
 325         file=`echo "$1" | sed 's/\.eps//'`
 326         ps2pdf "$1" "$file".pdf
 327         pdftopbm "$file".pdf "$file"
 328         pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
 329         rm -f "$file".pdf "$file"-000001.pbm
 330
 331 -->
 332
 333  <!-- Keep this comment at the end of the file
 334  Local variables:
 335  mode: sgml
 336  sgml-omittag:t
 337  sgml-shorttag:t
 338  sgml-minimize-attributes:nil
 339  sgml-always-quote-attributes:t
 340  sgml-indent-step:1
 341  sgml-indent-data:t
 342  sgml-parent-document: "zebra.xml"
 343  sgml-local-catalogs: nil
 344  sgml-namecase-general:t
 345  End:
 346  -->