doc/examples.xml

   1 <chapter id="examples">
   2  <!-- $Id: examples.xml,v 1.17 2002-11-08 17:00:57 mike Exp $ -->
   3  <title>Example Configurations</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <literal>zebraidx</literal> and <literal>zebrasrv</literal> are both
  10    driven by a master configuration file, which may refer to other
  11    subsidiary configuration files.  By default, they try to use
  12    <filename>zebra.cfg</filename> in the working directory as the
  13    master file; but this can be changed using the <literal>-c</literal>
  14    option to specify an alternative master configuration file.
  15   </para>
  16   <para>
  17    The master configuration file tells Zebra:
  18    <itemizedlist>
  19
  20     <listitem>
  21      <para>
  22       Where to find subsidiary configuration files, including
  23       <literal>default.idx</literal>
  24       which specifies the default indexing rules.
  25      </para>
  26     </listitem>
  27
  28     <listitem>
  29      <para>
  30       What attribute sets to recognise in searches.
  31      </para>
  32     </listitem>
  33
  34     <listitem>
  35      <para>
  36       Policy details such as what record type to expect, what
  37       low-level indexing algorithm to use, how to identify potential
  38       duplicate records, etc.
  39      </para>
  40     </listitem>
  41
  42    </itemizedlist>
  43   </para>
  44   <para>
  45    Now let's see what goes in the <literal>zebra.cfg</literal> file
  46    for some example configurations.
  47   </para>
  48  </sect1>
  49
  50  <sect1 id="example1">
  51   <title>Example 1: XML Indexing And Searching</title>
  52
  53   <para>
  54    This example shows how Zebra can be used with absolutely minimal
  55    configuration to index a body of
  56    <ulink url="http://www.w3.org/XML/">XML</ulink>
  57    documents, and search them using
  58    <ulink url="http://www.w3.org/TR/xpath">XPath</ulink>
  59    expressions to specify access points.
  60   </para>
  61   <para>
  62    Go to the <literal>examples/zthes</literal> subdirectory
  63    of the distribution archive.
  64    There you will find a <literal>Makefile</literal> that will
  65    populate the <literal>records</literal> subdirectory with a file of
  66    <ulink url="http://zthes.z3950.org/">Zthes</ulink>
  67    records representing a taxonomic hierarchy of dinosaurs.  (The
  68    records are generated from the family tree in the file
  69    <literal>dino.tree</literal>.)
  70    Type <literal>make records/dino.xml</literal>
  71    to make the XML data file.
  72   </para>
  73   <para>
  74    Now we need to create a Zebra database to hold and index the XML
  75    records.  We do this with the
  76    Zebra indexer, <literal>zebraidx</literal>, which is
  77    driven by the <literal>zebra.cfg</literal> configuration file.
  78    For our purposes, we don't need any
  79    special behaviour - we can use the defaults - so we start with a
  80    minimal file that just tells <literal>zebraidx</literal> where to
  81    find the default indexing rules, and how to parse the records:
  82    <screen>
  83     profilePath: .:../../tab
  84     recordType: grs.sgml
  85    </screen>
  86   </para>
  87   <para>
  88    That's all you need for a minimal Zebra configuration.  Now you can
  89    roll the XML records into the database and build the indexes:
  90    <screen>
  91     zebraidx update records
  92    </screen>
  93   </para>
  94   <para>
  95    Now start the server.  Like the indexer, its behaviour is
  96    controlled by the
  97    <literal>zebra.cfg</literal> file; and like the indexer, it works
  98    just fine with this minimal configuration.
  99    <screen>
 100         zebrasrv
 101    </screen>
 102    By default, the server listens on IP port number 9999, although
 103    this can easily be changed - see
 104    <xref linkend="zebrasrv"/>.
 105   </para>
 106   <para>
 107    Now you can use the Z39.50 client program of your choice to execute
 108    XPath-based boolean queries and fetch the XML records that satisfy
 109    them:
 110    <screen>
 111     $ yaz-client tcp:@:9999
 112     Connecting...Ok.
 113     Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
 114     Number of hits: 1
 115     Z&gt; format xml
 116     Z&gt; show 1
 117     &lt;Zthes&gt;
 118      &lt;termId&gt;22&lt;/termId&gt;
 119      &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
 120      &lt;termType&gt;PT&lt;/termType&gt;
 121      &lt;relation&gt;
 122       &lt;relationType&gt;BT&lt;/relationType&gt;
 123       &lt;termId&gt;21&lt;/termId&gt;
 124       &lt;termName&gt;Brachiosauridae&lt;/termName&gt;
 125       &lt;termType&gt;PT&lt;/termType&gt;
 126      &lt;/relation&gt;
 127
 128       &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
 129         &lt;size&gt;245&lt;/size&gt;
 130         &lt;localnumber&gt;23&lt;/localnumber&gt;
 131         &lt;filename&gt;records/dino.xml&lt;/filename&gt;
 132       &lt;/idzebra&gt;
 133     &lt;/Zthes&gt;
 134    </screen>
 135   </para>
 136   <para>
 137    Now wasn't that easy?
 138   </para>
 139  </sect1>
 140
 141
 142  <sect1 id="example2">
 143   <title>Example 2: Supporting Interoperable Searches</title>
 144
 145   <para>
 146    The problem with the previous example is that you need to know the
 147    structure of the documents in order to find them.  For example,
 148    when we wanted to find the record for the taxon
 149    <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
 150    we had to formulate a complex XPath
 151    <literal>/Zthes/termName</literal>
 152    which embodies the knowledge that taxon names are specified in a
 153    <literal>&lt;termName&gt;</literal> element inside the top-level
 154    <literal>&lt;Zthes&gt;</literal> element.
 155   </para>
 156   <para>
 157    This is bad not just because it requires a lot of typing, but more
 158    significantly because it ties searching semantics to the physical
 159    structure of the searched records.  You can't use the same search
 160    specification to search two databases if their internal
 161    representations are different.  Consider an alternative taxonomy
 162    database in which the records have taxon names specified
 163    inside a <literal>&lt;name&gt;</literal> element nested within a
 164    <literal>&lt;identification&gt;</literal> element
 165    inside a top-level <literal>&lt;taxon&gt;</literal> element: then
 166    you'd need to search for them using
 167    <literal>1=/taxon/identification/name</literal>
 168   </para>
 169   <para>
 170    How, then, can we build broadcasting Information Retrieval
 171    applications that look for records in many different databases?
 172    The Z39.50 protocol offers a powerful and general solution to this:
 173    abstract ``access points''.  In the Z39.50 model, an access point
 174    is simply a point at which searches can be directed.  Nothing is
 175    said about implementation: in a given database, an access point
 176    might be implemented as an index, a path into physical records, an
 177    algorithm for interrogating relational tables or whatever works.
 178    The key point is that the semantics of an access point are fixed
 179    and well defined.
 180   </para>
 181   <para>
 182    For convenience, access points are gathered into <firstterm>attribute
 183    sets</firstterm>.  For example, the BIB-1 attribute set is supposed to
 184    contain bibliographic access points such as author, title, subject
 185    and ISBN; the GEO attribute set contains access points pertaining
 186    to geospatial information (bounding coordinates, stratum, latitude
 187    resolution, etc.); the CIMI
 188    attribute set contains access points to do with museum collections
 189    (provenance, inscriptions, etc.)
 190   </para>
 191   <para>
 192    In practice, the BIB-1 attribute set has tended to be a dumping
 193    ground for all sorts of access points, so that, for example, it
 194    includes some geospatial access points as well as strictly
 195    bibliographic ones.  Nevertheless, the key point is that this model
 196    allows a layer of abstraction over the physical representation of
 197    records in databases.
 198   </para>
 199   <para>
 200    In the BIB-1 attribute set, a taxon name is probably best
 201    interpreted as a title - that is, a phrase that identifies the item
 202    in question.  BIB-1 represents title searches by
 203    access point 4.  (See
 204    <ulink url="ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt"
 205         >The BIB-1 Attribute Set Semantics</ulink>)
 206    So we need to configure our dinosaur database so that searches for
 207    BIB-1 access point 4 look in the
 208    <literal>&lt;termName&gt;</literal> element,
 209    inside the top-level
 210    <literal>&lt;Zthes&gt;</literal> element.
 211   </para>
 212   <para>
 213    This is a two-step process.  First, we need to tell Zebra that we
 214    want to support the BIB-1 attribute set.  Then we need to tell it
 215    which elements of its record pertain to access point 4.
 216   </para>
 217   <para>
 218    We need to create an <link linkend="abs-file">Abstract Syntax
 219    file</link> named after the document element of the records we're
 220    working with, plus a <literal>.abs</literal> suffix - in this case,
 221    <literal>Zthes.abs</literal> - as follows:
 222   </para>
 223   <itemizedlist>
 224    <listitem>
 225     <para>
 226
 227     </para>
 228    </listitem>
 229    <listitem>
 230     <para>
 231     </para>
 232    </listitem>
 233   </itemizedlist>
 234  </sect1>
 235 </chapter>
 236
 237
 238 <!--
 239         The simplest hello-world example could go like this:
 240
 241         Index the document
 242
 243         <book>
 244            <title>The art of motorcycle maintenance</title>
 245            <subject scheme="Dewey">zen</subject>
 246         </book>
 247
 248         And search it like
 249
 250         f @attr 1=/book/title motorcycle
 251
 252         f @attr 1=/book/subject[@scheme=Dewey] zen
 253
 254         If you suddenly decide you want broader interop, you can add
 255         an abs file (more or less like this):
 256
 257         attset bib1.att
 258         tagset tagsetg.tag
 259
 260         elm (2,1)       title   title
 261         elm (2,21)      subject  subject
 262 -->
 263
 264 <!--
 265 How to include images:
 266
 267         <mediaobject>
 268           <imageobject>
 269             <imagedata fileref="system.eps" format="eps">
 270           </imageobject>
 271           <imageobject>
 272             <imagedata fileref="system.gif" format="gif">
 273           </imageobject>
 274           <textobject>
 275             <phrase>The Multi-Lingual Search System Architecture</phrase>
 276           </textobject>
 277           <caption>
 278             <para>
 279               <emphasis role="strong">
 280                 The Multi-Lingual Search System Architecture.
 281               </emphasis>
 282               <para>
 283                 Network connections across local area networks are
 284                 represented by straight lines, and those over the
 285                 internet by jagged lines.
 286           </caption>
 287         </mediaobject>
 288
 289 Where the three <*object> thingies inside the top-level <mediaobject>
 290 are decreasingly preferred version to include depending on what the
 291 rendering engine can handle.  I generated the EPS version of the image
 292 by exporting a line-drawing done in TGIF, then converted that to the
 293 GIF using a shell-script called "epstogif" which used an appallingly
 294 baroque sequence of conversions, which I would prefer not to pollute
 295 the Zebra build environment with:
 296
 297         #!/bin/sh
 298
 299         # Yes, what follows is stupidly convoluted, but I can't find a
 300         # more straightforward path from the EPS generated by tgif's
 301         # "Print" command into a browser-friendly format.
 302
 303         file=`echo "$1" | sed 's/\.eps//'`
 304         ps2pdf "$1" "$file".pdf
 305         pdftopbm "$file".pdf "$file"
 306         pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
 307         rm -f "$file".pdf "$file"-000001.pbm
 308
 309 -->
 310
 311  <!-- Keep this comment at the end of the file
 312  Local variables:
 313  mode: sgml
 314  sgml-omittag:t
 315  sgml-shorttag:t
 316  sgml-minimize-attributes:nil
 317  sgml-always-quote-attributes:t
 318  sgml-indent-step:1
 319  sgml-indent-data:t
 320  sgml-parent-document: "zebra.xml"
 321  sgml-local-catalogs: nil
 322  sgml-namecase-general:t
 323  End:
 324  -->