doc/examples.xml

   1  <chapter id="examples">
   2   <title>Example Configurations</title>
   3
   4   <sect1 id="examples-overview">
   5    <title>Overview</title>
   6
   7    <para>
   8     <command>zebraidx</command> and
   9     <command>zebrasrv</command> are both
  10     driven by a master configuration file, which may refer to other
  11     subsidiary configuration files.  By default, they try to use
  12     <filename>zebra.cfg</filename> in the working directory as the
  13     master file; but this can be changed using the <literal>-c</literal>
  14     option to specify an alternative master configuration file.
  15    </para>
  16    <para>
  17     The master configuration file tells &zebra;:
  18     <itemizedlist>
  19
  20      <listitem>
  21       <para>
  22        Where to find subsidiary configuration files, including both
  23        those that are named explicitly and a few ``magic'' files such
  24        as <literal>default.idx</literal>,
  25        which specifies the default indexing rules.
  26       </para>
  27      </listitem>
  28
  29      <listitem>
  30       <para>
  31        What record schemas to support.  (Subsidiary files specify how
  32        to index the contents of records in those schemas, and what
  33        format to use when presenting records in those schemas to client
  34        software.)
  35       </para>
  36      </listitem>
  37
  38      <listitem>
  39       <para>
  40        What attribute sets to recognise in searches.  (Subsidiary files
  41        specify how to interpret the attributes in terms
  42        of the indexes that are created on the records.)
  43       </para>
  44      </listitem>
  45
  46      <listitem>
  47       <para>
  48        Policy details such as what type of input format to expect when
  49        adding new records, what low-level indexing algorithm to use,
  50        how to identify potential duplicate records, etc.
  51       </para>
  52      </listitem>
  53
  54     </itemizedlist>
  55    </para>
  56    <para>
  57     Now let's see what goes in the <literal>zebra.cfg</literal> file
  58     for some example configurations.
  59    </para>
  60   </sect1>
  61
  62   <sect1 id="example1">
  63    <title>Example 1: &acro.xml; Indexing And Searching</title>
  64
  65    <para>
  66     This example shows how &zebra; can be used with absolutely minimal
  67     configuration to index a body of
  68     <ulink url="&url.xml;">&acro.xml;</ulink>
  69     documents, and search them using
  70     <ulink url="&url.xpath;">XPath</ulink>
  71     expressions to specify access points.
  72    </para>
  73    <para>
  74     Go to the <literal>examples/zthes</literal> subdirectory
  75     of the distribution archive.
  76     There you will find a <literal>Makefile</literal> that will
  77     populate the <literal>records</literal> subdirectory with a file of
  78     <ulink url="http://zthes.z3950.org/">Zthes</ulink>
  79     records representing a taxonomic hierarchy of dinosaurs.  (The
  80     records are generated from the family tree in the file
  81     <literal>dino.tree</literal>.)
  82     Type <literal>make records/dino.xml</literal>
  83     to make the &acro.xml; data file.
  84     (Or you could just type <literal>make dino</literal> to build the &acro.xml;
  85     data file, create the database and populate it with the taxonomic
  86     records all in one shot - but then you wouldn't learn anything,
  87     would you?  :-)
  88    </para>
  89    <para>
  90     Now we need to create a &zebra; database to hold and index the &acro.xml;
  91     records.  We do this with the
  92     &zebra; indexer, <command>zebraidx</command>, which is
  93     driven by the <literal>zebra.cfg</literal> configuration file.
  94     For our purposes, we don't need any
  95     special behaviour - we can use the defaults - so we can start with a
  96     minimal file that just tells <command>zebraidx</command> where to
  97     find the default indexing rules, and how to parse the records:
  98     <screen>
  99      profilePath: .:../../tab
 100      recordType: grs.sgml
 101     </screen>
 102    </para>
 103    <para>
 104     That's all you need for a minimal &zebra; configuration.  Now you can
 105     roll the &acro.xml; records into the database and build the indexes:
 106     <screen>
 107      zebraidx update records
 108     </screen>
 109    </para>
 110    <para>
 111     Now start the server.  Like the indexer, its behaviour is
 112     controlled by the
 113     <literal>zebra.cfg</literal> file; and like the indexer, it works
 114     just fine with this minimal configuration.
 115     <screen>
 116      zebrasrv
 117     </screen>
 118     By default, the server listens on IP port number 9999, although
 119     this can easily be changed - see
 120     <xref linkend="zebrasrv"/>.
 121    </para>
 122    <para>
 123     Now you can use the &acro.z3950; client program of your choice to execute
 124     XPath-based boolean queries and fetch the &acro.xml; records that satisfy
 125     them:
 126     <screen>
 127      $ yaz-client @:9999
 128      Connecting...Ok.
 129      Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
 130      Number of hits: 1
 131      Z&gt; format xml
 132      Z&gt; show 1
 133      &lt;Zthes&gt;
 134      &lt;termId&gt;22&lt;/termId&gt;
 135      &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
 136      &lt;termType&gt;PT&lt;/termType&gt;
 137      &lt;termNote&gt;The tallest known dinosaur (18m)&lt;/termNote&gt;
 138      &lt;relation&gt;
 139      &lt;relationType&gt;BT&lt;/relationType&gt;
 140      &lt;termId&gt;21&lt;/termId&gt;
 141      &lt;termName&gt;Brachiosauridae&lt;/termName&gt;
 142      &lt;termType&gt;PT&lt;/termType&gt;
 143      &lt;/relation&gt;
 144
 145      &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
 146      &lt;size&gt;300&lt;/size&gt;
 147      &lt;localnumber&gt;23&lt;/localnumber&gt;
 148      &lt;filename&gt;records/dino.xml&lt;/filename&gt;
 149      &lt;/idzebra&gt;
 150      &lt;/Zthes&gt;
 151     </screen>
 152    </para>
 153    <para>
 154     Now wasn't that nice and easy?
 155    </para>
 156   </sect1>
 157
 158
 159   <sect1 id="example2">
 160    <title>Example 2: Supporting Interoperable Searches</title>
 161
 162    <para>
 163     The problem with the previous example is that you need to know the
 164     structure of the documents in order to find them.  For example,
 165     when we wanted to find the record for the taxon
 166     <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
 167     we had to formulate a complex XPath
 168     <literal>/Zthes/termName</literal>
 169     which embodies the knowledge that taxon names are specified in a
 170     <literal>&lt;termName&gt;</literal> element inside the top-level
 171     <literal>&lt;Zthes&gt;</literal> element.
 172    </para>
 173    <para>
 174     This is bad not just because it requires a lot of typing, but more
 175     significantly because it ties searching semantics to the physical
 176     structure of the searched records.  You can't use the same search
 177     specification to search two databases if their internal
 178     representations are different.  Consider a different taxonomy
 179     database in which the records have taxon names specified
 180     inside a <literal>&lt;name&gt;</literal> element nested within a
 181     <literal>&lt;identification&gt;</literal> element
 182     inside a top-level <literal>&lt;taxon&gt;</literal> element: then
 183     you'd need to search for them using
 184     <literal>1=/taxon/identification/name</literal>
 185    </para>
 186    <para>
 187     How, then, can we build broadcasting Information Retrieval
 188     applications that look for records in many different databases?
 189     The &acro.z3950; protocol offers a powerful and general solution to this:
 190     abstract ``access points''.  In the &acro.z3950; model, an access point
 191     is simply a point at which searches can be directed.  Nothing is
 192     said about implementation: in a given database, an access point
 193     might be implemented as an index, a path into physical records, an
 194     algorithm for interrogating relational tables or whatever works.
 195     The only important thing is that the semantics of an access
 196     point is fixed and well defined.
 197    </para>
 198    <para>
 199     For convenience, access points are gathered into <firstterm>attribute
 200      sets</firstterm>.  For example, the &acro.bib1; attribute set is supposed to
 201     contain bibliographic access points such as author, title, subject
 202     and ISBN; the GEO attribute set contains access points pertaining
 203     to geospatial information (bounding coordinates, stratum, latitude
 204     resolution, etc.); the CIMI
 205     attribute set contains access points to do with museum collections
 206     (provenance, inscriptions, etc.)
 207    </para>
 208    <para>
 209     In practice, the &acro.bib1; attribute set has tended to be a dumping
 210     ground for all sorts of access points, so that, for example, it
 211     includes some geospatial access points as well as strictly
 212     bibliographic ones.  Nevertheless, this model
 213     allows a layer of abstraction over the physical representation of
 214     records in databases.
 215    </para>
 216    <para>
 217     In the &acro.bib1; attribute set, a taxon name is probably best
 218     interpreted as a title - that is, a phrase that identifies the item
 219     in question.  &acro.bib1; represents title searches by
 220     access point 4.  (See
 221     <ulink url="&url.z39.50.bib1.semantics;">The &acro.bib1; Attribute
 222      Set Semantics</ulink>)
 223     So we need to configure our dinosaur database so that searches for
 224     &acro.bib1; access point 4 look in the
 225     <literal>&lt;termName&gt;</literal> element,
 226     inside the top-level
 227     <literal>&lt;Zthes&gt;</literal> element.
 228    </para>
 229    <para>
 230     This is a two-step process.  First, we need to tell &zebra; that we
 231     want to support the &acro.bib1; attribute set.  Then we need to tell it
 232     which elements of its record pertain to access point 4.
 233    </para>
 234    <para>
 235     We need to create an <link linkend="abs-file">Abstract Syntax
 236      file</link> named after the document element of the records we're
 237     working with, plus a <literal>.abs</literal> suffix - in this case,
 238     <literal>Zthes.abs</literal> - as follows:
 239    </para>
 240    <programlistingco>
 241     <areaspec>
 242      <area id="attset.zthes" coords="2"/>
 243      <area id="attset.attset" coords="3"/>
 244      <area id="termId" coords="7"/>
 245      <area id="termName" coords="8"/>
 246     </areaspec>
 247    <programlisting>
 248     attset zthes.att
 249     attset bib1.att
 250     xpath enable
 251     systag sysno none
 252
 253     xelm /Zthes/termId              termId:w
 254     xelm /Zthes/termName            termName:w,title:w
 255     xelm /Zthes/termQualifier       termQualifier:w
 256     xelm /Zthes/termType            termType:w
 257     xelm /Zthes/termLanguage        termLanguage:w
 258     xelm /Zthes/termNote            termNote:w
 259     xelm /Zthes/termCreatedDate     termCreatedDate:w
 260     xelm /Zthes/termCreatedBy       termCreatedBy:w
 261     xelm /Zthes/termModifiedDate    termModifiedDate:w
 262     xelm /Zthes/termModifiedBy      termModifiedBy:w
 263    </programlisting>
 264    <calloutlist>
 265     <callout arearefs="attset.zthes">
 266      <para>
 267       Declare Thesaurus attribute set. See <filename>zthes.att</filename>.
 268      </para>
 269     </callout>
 270     <callout arearefs="attset.attset">
 271      <para>
 272       Declare &acro.bib1; attribute set. See <filename>bib1.att</filename> in
 273       &zebra;'s <filename>tab</filename> directory.
 274      </para>
 275     </callout>
 276     <callout arearefs="termId">
 277      <para>
 278       This xelm directive selects contents of nodes by XPath expression
 279       <literal>/Zthes/termId</literal>. The contents (CDATA) will be
 280       word searchable by Zthes attribute termId (value 1001).
 281      </para>
 282     </callout>
 283     <callout arearefs="termName">
 284      <para>
 285       Make <literal>termName</literal> word searchable by both
 286       Zthes attribute termName (1002) and &acro.bib1; attribute title (4).
 287      </para>
 288     </callout>
 289    </calloutlist>
 290   </programlistingco>
 291    <para>
 292     After re-indexing, we can search the database using &acro.bib1;
 293     attribute, title, as follows:
 294     <screen>
 295      Z> form xml
 296      Z> f @attr 1=4 Eoraptor
 297      Sent searchRequest.
 298      Received SearchResponse.
 299      Search was a success.
 300      Number of hits: 1, setno 1
 301      SearchResult-1: Eoraptor(1)
 302      records returned: 0
 303      Elapsed: 0.106896
 304      Z> s
 305      Sent presentRequest (1+1).
 306      Records: 1
 307      [Default]Record type: &acro.xml;
 308      &lt;Zthes&gt;
 309      &lt;termId&gt;2&lt;/termId&gt;
 310      &lt;termName&gt;Eoraptor&lt;/termName&gt;
 311      &lt;termType&gt;PT&lt;/termType&gt;
 312      &lt;termNote&gt;The most basal known dinosaur&lt;/termNote&gt;
 313      ...
 314     </screen>
 315    </para>
 316   </sect1>
 317  </chapter>
 318
 319
 320  <!--
 321  The simplest hello-world example could go like this:
 322
 323  Index the document
 324
 325  <book>
 326  <title>The art of motorcycle maintenance</title>
 327  <subject scheme="Dewey">zen</subject>
 328         </book>
 329
 330  And search it like
 331
 332  f @attr 1=/book/title motorcycle
 333
 334  f @attr 1=/book/subject[@scheme=Dewey] zen
 335
 336  If you suddenly decide you want broader interop, you can add
 337  an abs file (more or less like this):
 338
 339  attset bib1.att
 340  tagset tagsetg.tag
 341
 342  elm (2,1)       title   title
 343  elm (2,21)      subject  subject
 344  -->
 345
 346  <!--
 347  How to include images:
 348
 349  <mediaobject>
 350  <imageobject>
 351  <imagedata fileref="system.eps" format="eps">
 352           </imageobject>
 353  <imageobject>
 354  <imagedata fileref="system.gif" format="gif">
 355           </imageobject>
 356  <textobject>
 357  <phrase>The Multi-Lingual Search System Architecture</phrase>
 358           </textobject>
 359  <caption>
 360  <para>
 361  <emphasis role="strong">
 362  The Multi-Lingual Search System Architecture.
 363               </emphasis>
 364  <para>
 365  Network connections across local area networks are
 366  represented by straight lines, and those over the
 367  internet by jagged lines.
 368           </caption>
 369         </mediaobject>
 370
 371  Where the three <*object> thingies inside the top-level <mediaobject>
 372  are decreasingly preferred version to include depending on what the
 373  rendering engine can handle.  I generated the EPS version of the image
 374  by exporting a line-drawing done in TGIF, then converted that to the
 375  GIF using a shell-script called "epstogif" which used an appallingly
 376  baroque sequence of conversions, which I would prefer not to pollute
 377  the &zebra; build environment with:
 378
 379  #!/bin/sh
 380
 381  # Yes, what follows is stupidly convoluted, but I can't find a
 382  # more straightforward path from the EPS generated by tgif's
 383  # "Print" command into a browser-friendly format.
 384
 385  file=`echo "$1" | sed 's/\.eps//'`
 386  ps2pdf "$1" "$file".pdf
 387  pdftopbm "$file".pdf "$file"
 388  pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
 389  rm -f "$file".pdf "$file"-000001.pbm
 390
 391  -->
 392
 393  <!-- Keep this comment at the end of the file
 394  Local variables:
 395  mode: sgml
 396  sgml-omittag:t
 397  sgml-shorttag:t
 398  sgml-minimize-attributes:nil
 399  sgml-always-quote-attributes:t
 400  sgml-indent-step:1
 401  sgml-indent-data:t
 402  sgml-parent-document: "idzebra.xml"
 403  sgml-local-catalogs: nil
 404  sgml-namecase-general:t
 405  End:
 406  -->