doc/examples.xml

   1 <chapter id="examples">
   2  <!-- $Id: examples.xml,v 1.9 2002-10-16 20:33:31 mike Exp $ -->
   3  <title>Example Configurations</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <literal>zebraidx</literal> and <literal>zebrasrv</literal> are both
  10    driven by a master configuration file, which may refer to other
  11    subsidiary configuration files.  By default, they try to use
  12    <filename>zebra.cfg</filename> in the working directory as the
  13    master file; but this can be changed using the <literal>-t</literal>
  14    option to specify an alternative master configuration file.
  15   </para>
  16   <para>
  17    The master configuration file tells Zebra:
  18    <itemizedlist>
  19
  20     <listitem>
  21      <para>
  22       Where to find subsidiary configuration files, including
  23       <literal>default.idx</literal>
  24       which specifies the default indexing rules.
  25      </para>
  26     </listitem>
  27
  28     <listitem>
  29      <para>
  30       What attribute sets to recognise in searches.
  31      </para>
  32     </listitem>
  33
  34     <listitem>
  35      <para>
  36       Policy details such as what record type to expect, what
  37       low-level indexing algorithm to use, how to identify potential
  38       duplicate records, etc.
  39      </para>
  40     </listitem>
  41
  42    </itemizedlist>
  43   </para>
  44   <para>
  45    Now let's see what goes in the <literal>zebra.cfg</literal> file
  46    for some example configurations.
  47   </para>
  48  </sect1>
  49
  50  <sect1 id="example1">
  51   <title>Example 1: XML Indexing And Searching</title>
  52
  53   <para>
  54    This example shows how Zebra can be used with absolutely minimal
  55    configuration to index a body of
  56    <ulink url="http://www.w3.org/xml/###">XML</ulink>
  57    documents, and search them using
  58    <ulink url="http://www.w3.org/xpath/###">XPath</ulink>
  59    expressions to specify access points.
  60   </para>
  61   <para>
  62    Go to the <literal>examples/dinosauricon</literal> subdirectory
  63    of the distribution archive.
  64    There you will find a <literal>records</literal> subdirectory,
  65    which contains some raw XML data to be added to the database: in
  66    this case, as single file, <literal>genera.xml</literal>,
  67    which contain information about all the known dinosaur genera as of
  68    August 2002.
  69   </para>
  70   <para>
  71    Now we need to create the Zebra database, which we do with the
  72    Zebra indexer, <literal>zebraidx</literal>, which is
  73    driven by the <literal>zebra.cfg</literal> configuration file.
  74    For our purposes, we don't need any
  75    special behaviour - we can use the defaults - so we start with a
  76    minimal file that just tells <literal>zebraidx</literal> where to
  77    find the default indexing rules, and how to parse the records:
  78    <screen>
  79     profilePath: .:../../tab:../../../yaz/tab
  80     recordType: grs.sgml
  81    </screen>
  82   </para>
  83   <para>
  84    That's all you need for a minimal Zebra configuration.  Now you can
  85    roll the XML records into the database and build the indexes:
  86    <screen>
  87     zebraidx update records
  88    </screen>
  89   </para>
  90   <para>
  91    Now start the server.  Like the indexer, its behaviour is
  92    controlled by the
  93    <literal>zebra.cfg</literal> file; and like the indexer, it works
  94    just fine with this minimal configuration.
  95    <screen>
  96         zebrasrv
  97    </screen>
  98    By default, the server listens on IP port number 9999, although
  99    this can easily be changed - see
 100    <xref linkend="zebrasrv"/>.
 101   </para>
 102   <para>
 103    Now you can use the Z39.50 client program of your choice to execute
 104    XPath-based boolean queries and fetch the XML records that satisfy
 105    them:
 106    <screen>
 107     $ yaz-client tcp:@:9999
 108     Connecting...Ok.
 109     Z&gt; find @attr 1=/GENUS/SPECIES/AUTHOR/@name Wedel
 110     Number of hits: 1
 111     Z&gt; format xml
 112     Z&gt; show 1
 113     &lt;GENUS name="Sauroposeidon" type="with"&gt;
 114      &lt;MEANING&gt;lizard Poseidon &lt;LOW&gt;(Greek god of, among other things, earthquakes)&lt;/LOW&gt;&lt;/MEANING&gt;
 115      &lt;SPECIES name="proteles"&gt;
 116       &lt;AUTHOR type="vide" name="Franklin" year="2000"&gt;&lt;/AUTHOR&gt;
 117       &lt;AUTHOR name="Wedel, Cifelli, Sanders"&gt;&lt;/AUTHOR&gt;
 118      &lt;/SPECIES&gt;
 119      &lt;PLACE name="Oklahoma"&gt;&lt;/PLACE&gt;
 120      &lt;TIME value="Albian"&gt;&lt;/TIME&gt;
 121      &lt;LENGTH value="30" q="1"&gt;&lt;/LENGTH&gt;
 122      &lt;REMAINS content="rib, cervical vertebrae"&gt;&lt;/REMAINS&gt;
 123      &lt;ESSAY&gt;
 124       &lt;P&gt; This new &lt;NOMEN name="Brachiosaurus"&gt;&lt;/NOMEN&gt;-like &lt;LINK content="dinosaur"&gt;&lt;/LINK&gt;
 125       was perhaps the tallest. With its head raised, it stood 60 feet (nearly
 126       20 m) tall. &lt;/P&gt;
 127      &lt;/ESSAY&gt;
 128
 129       &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
 130         &lt;size&gt;593&lt;/size&gt;
 131         &lt;localnumber&gt;891&lt;/localnumber&gt;
 132         &lt;filename&gt;records/genera.xml&lt;/filename&gt;
 133       &lt;/idzebra&gt;
 134     &lt;/GENUS&gt;
 135    </screen>
 136   </para>
 137   <para>
 138    Now wasn't that easy?
 139   </para>
 140  </sect1>
 141
 142
 143  <sect1 id="example2">
 144   <title>Example 2: Supporting Interoperable Searches</title>
 145
 146   <para>
 147    The problem with the previous example is that you need to know the
 148    structure of the documents in order to find them.  For example,
 149    when we wanted to know the genera for which Matt Wedel is an
 150    author, we had to formulate a complex XPath
 151    <literal>1=/GENUS/SPECIES/AUTHOR/@name</literal>
 152    which embodies the knowledge that author names are specified in the
 153    <literal>name</literal> attribute of the
 154    <literal>&lt;AUTHOR&gt;</literal> element,
 155    which is inside the
 156    <literal>&lt;SPECIES&gt;</literal> element,
 157    which in turn is inside the top-level
 158    <literal>&lt;GENUS&gt;</literal> element.
 159   </para>
 160   <para>
 161    This is bad not just because it requires a lot of typing, but more
 162    significantly because it ties searching semantics to the physical
 163    structure of the searched records.  You can't use the same search
 164    specification to search two databases if their internal
 165    representations are different.  Consider an alternative dinosaur
 166    database in which the records have author names specified
 167    inside an <literal>&lt;authorName&gt;</literal> element directly
 168    inside a top-level <literal>&lt;taxon&gt;</literal> element: then
 169    you'd need to search for them using
 170    <literal>1=/taxon/authorName</literal>
 171   </para>
 172   <para>
 173    How, then, can we build broadcasting Information Retrieval
 174    applications that look for records in many different databases?
 175    The Z39.50 protocol offers a powerful and general solution to this:
 176    abstract ``access points''.  In the Z39.50 model, an access point
 177    is simply a point at which searches can be directed.  Nothing is
 178    said about implementation: in a given database, an access point
 179    might be implemented as an index, a path into physical records, an
 180    algorithm for interrogating relational tables or whatever works.
 181    The key point is that the semantics of an access point are fixed
 182    and well defined.
 183   </para>
 184   <para>
 185    For convenience, access points are gathered into <define>attribute
 186    sets</define>.  For example, the BIB-1 attribute set is supposed to
 187    contain bibliographic access points such as author, title, subject
 188    and ISBN; the GEO attribute set contains access points pertaining
 189    to geospatial information (bounding box, ###, etc.); the CIMI
 190    attribute set contains access points to do with museum collections
 191    (provenance, inscriptions, etc.)
 192   </para>
 193   <para>
 194    In practice, the BIB-1 attribute set has tended to be a dumping
 195    ground for all sorts of access points, so that, for example, it
 196    includes some geospatial access points as well as strictly
 197    bibliographic ones.  Nevertheless, the key point is that this model
 198    allows a layer of abstraction over the physical representation of
 199    records in databases.
 200   </para>
 201   <para>
 202    In the BIB-1 attribute set, an author search is represented by
 203    access point 1003.  (See
 204    <ulink url="###bib1-semantics"/>)
 205    So we need to configure our dinosaur database so that searches for
 206    BIB-1 access point 1003 look the
 207    <literal>name</literal> attribute of the
 208    <literal>&lt;AUTHOR&gt;</literal> element,
 209    inside the
 210    <literal>&lt;SPECIES&gt;</literal> element,
 211    inside the top-level
 212    <literal>&lt;GENUS&gt;</literal> element.
 213   </para>
 214   <para>
 215    This is a two-step process.  First, we need to tell Zebra that we
 216    want to support the BIB-1 attribute set.  Then we need to tell it
 217    which elements of its record pertain to access point 1003.
 218   </para>
 219  </sect1>
 220 </chapter>
 221
 222
 223 <!--
 224   <para>
 225    You may have noticed as <literal>zebraidx</literal> was building
 226    the database that it issued a warning, which we ignored at the
 227    time:
 228    <screen>
 229     $ zebraidx update records
 230     00:45:46-08/10: ../../index/zebraidx(5016) [warn] records/genera.xml:0 Couldn't open GENUS.abs [No such file or directory]
 231    </screen>
 232    FIXME ### This needs more text
 233   </para>
 234 -->
 235
 236 <!--
 237
 238    <listitem>
 239     <para>
 240      The master configuration file, <literal>zebra.cfg</literal>,
 241      which is as short and simple as it can be:
 242      <screen>
 243         # $Header: /home/cvsroot/idis/doc/examples.xml,v 1.9 2002-10-16 20:33:31 mike Exp $
 244         # Bare-bones master configuration file for Zebra
 245         profilePath: .:../../tab:../../../yaz/tab
 246      </screen>
 247      Apart from the comments, which are ignored, all this specifies is
 248      that the server should recognise the attribute set described in
 249      the file called
 250      <literal>bib1.att</literal>.
 251      ### What is an attribute set?
 252     </para>
 253    </listitem>
 254
 255    <listitem>
 256     <para>
 257      The BIB-1 attribute set configuration file,
 258      <literal>bib1.att</literal>, which is also as short as possible:
 259      <screen>
 260         # $Header: /home/cvsroot/idis/doc/examples.xml,v 1.9 2002-10-16 20:33:31 mike Exp $
 261         # Bare-bones BIB-1 attribute set file for Zebra
 262         reference Bib-1
 263      </screen>
 264      Apart from the comments, all this specifies is that reference of
 265      the attribute set described by this file is
 266      <literal>Bib-1</literal>, a name recognised by the system as
 267      referring to a well-known opaque identifier that is transmitted
 268      by clients as part of their searches.
 269      ### Yeuch!  Surely we can say that better!
 270     </para>
 271     <para>
 272      ### Can't we somehow say this trivial thing in the main
 273      configuration file?
 274     </para>
 275    </listitem>
 276 -->
 277
 278 <!--
 279         The simplest hello-world example could go like this:
 280
 281         Index the document
 282
 283         <book>
 284            <title>The art of motorcycle maintenance</title>
 285            <subject scheme="Dewey">zen</subject>
 286         </book>
 287
 288         And search it like
 289
 290         f @attr 1=/book/title motorcycle
 291
 292         f @attr 1=/book/subject[@scheme=Dewey] zen
 293
 294         If you suddenly decide you want broader interop, you can add
 295         an abs file (more or less like this):
 296
 297         attset bib1.att
 298         tagset tagsetg.tag
 299
 300         elm (2,1)       title   title
 301         elm (2,21)      subject  subject
 302 -->
 303
 304 <!--
 305 How to include images:
 306
 307         <mediaobject>
 308           <imageobject>
 309             <imagedata fileref="system.eps" format="eps">
 310           </imageobject>
 311           <imageobject>
 312             <imagedata fileref="system.gif" format="gif">
 313           </imageobject>
 314           <textobject>
 315             <phrase>The Multi-Lingual Search System Architecture</phrase>
 316           </textobject>
 317           <caption>
 318             <para>
 319               <emphasis role="strong">
 320                 The Multi-Lingual Search System Architecture.
 321               </emphasis>
 322               <para>
 323                 Network connections across local area networks are
 324                 represented by straight lines, and those over the
 325                 internet by jagged lines.
 326           </caption>
 327         </mediaobject>
 328
 329 Whene the three <*object> thingies inside the top-level <mediaobject>
 330 are decreasingly preferred version to include depending on what the
 331 rendering engine can handle.  I generated the EPS version of the image
 332 by exporting a line-drawing done in TGIF, then converted that to the
 333 GIF using a shell-script called "epstogif" which used an appallingly
 334 baroque sequence of conversions, which I would prefer not to pollute
 335 the Zebra build environment with:
 336
 337         #!/bin/sh
 338
 339         # Yes, what follows is stupidly convoluted, but I can't find a
 340         # more straightforward path from the EPS generated by tgif's
 341         # "Print" command into a browser-friendly format.
 342
 343         file=`echo "$1" | sed 's/\.eps//'`
 344         ps2pdf "$1" "$file".pdf
 345         pdftopbm "$file".pdf "$file"
 346         pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
 347         rm -f "$file".pdf "$file"-000001.pbm
 348
 349 -->
 350
 351  <!-- Keep this comment at the end of the file
 352  Local variables:
 353  mode: sgml
 354  sgml-omittag:t
 355  sgml-shorttag:t
 356  sgml-minimize-attributes:nil
 357  sgml-always-quote-attributes:t
 358  sgml-indent-step:1
 359  sgml-indent-data:t
 360  sgml-parent-document: "zebra.xml"
 361  sgml-local-catalogs: nil
 362  sgml-namecase-general:t
 363  End:
 364  -->