doc/examples.xml

   1 <chapter id="examples">
   2  <!-- $Id: examples.xml,v 1.27 2007-05-24 13:44:09 adam Exp $ -->
   3  <title>Example Configurations</title>
   4
   5  <sect1 id="examples-overview">
   6   <title>Overview</title>
   7
   8   <para>
   9    <command>zebraidx</command> and
  10    <command>zebrasrv</command> are both
  11    driven by a master configuration file, which may refer to other
  12    subsidiary configuration files.  By default, they try to use
  13    <filename>zebra.cfg</filename> in the working directory as the
  14    master file; but this can be changed using the <literal>-c</literal>
  15    option to specify an alternative master configuration file.
  16   </para>
  17   <para>
  18    The master configuration file tells &zebra;:
  19    <itemizedlist>
  20
  21     <listitem>
  22      <para>
  23       Where to find subsidiary configuration files, including both
  24       those that are named explicitly and a few ``magic'' files such
  25       as <literal>default.idx</literal>,
  26       which specifies the default indexing rules.
  27      </para>
  28     </listitem>
  29
  30     <listitem>
  31      <para>
  32       What record schemas to support.  (Subsidiary files specifiy how
  33       to index the contents of records in those schemas, and what
  34       format to use when presenting records in those schemas to client
  35       software.)
  36      </para>
  37     </listitem>
  38
  39     <listitem>
  40      <para>
  41       What attribute sets to recognise in searches.  (Subsidiary files
  42       specify how to interpret the attributes in terms
  43       of the indexes that are created on the records.)
  44      </para>
  45     </listitem>
  46
  47     <listitem>
  48      <para>
  49       Policy details such as what type of input format to expect when
  50       adding new records, what low-level indexing algorithm to use,
  51       how to identify potential duplicate records, etc.
  52      </para>
  53     </listitem>
  54
  55    </itemizedlist>
  56   </para>
  57   <para>
  58    Now let's see what goes in the <literal>zebra.cfg</literal> file
  59    for some example configurations.
  60   </para>
  61  </sect1>
  62
  63  <sect1 id="example1">
  64   <title>Example 1: &acro.xml; Indexing And Searching</title>
  65
  66   <para>
  67    This example shows how &zebra; can be used with absolutely minimal
  68    configuration to index a body of
  69    <ulink url="&url.xml;">&acro.xml;</ulink>
  70    documents, and search them using
  71    <ulink url="&url.xpath;">XPath</ulink>
  72    expressions to specify access points.
  73   </para>
  74   <para>
  75    Go to the <literal>examples/zthes</literal> subdirectory
  76    of the distribution archive.
  77    There you will find a <literal>Makefile</literal> that will
  78    populate the <literal>records</literal> subdirectory with a file of
  79    <ulink url="http://zthes.z3950.org/">Zthes</ulink>
  80    records representing a taxonomic hierarchy of dinosaurs.  (The
  81    records are generated from the family tree in the file
  82    <literal>dino.tree</literal>.)
  83    Type <literal>make records/dino.xml</literal>
  84    to make the &acro.xml; data file.
  85    (Or you could just type <literal>make dino</literal> to build the &acro.xml;
  86    data file, create the database and populate it with the taxonomic
  87    records all in one shot - but then you wouldn't learn anything,
  88    would you?  :-)
  89   </para>
  90   <para>
  91    Now we need to create a &zebra; database to hold and index the &acro.xml;
  92    records.  We do this with the
  93    &zebra; indexer, <command>zebraidx</command>, which is
  94    driven by the <literal>zebra.cfg</literal> configuration file.
  95    For our purposes, we don't need any
  96    special behaviour - we can use the defaults - so we can start with a
  97    minimal file that just tells <command>zebraidx</command> where to
  98    find the default indexing rules, and how to parse the records:
  99    <screen>
 100     profilePath: .:../../tab
 101     recordType: grs.sgml
 102    </screen>
 103   </para>
 104   <para>
 105    That's all you need for a minimal &zebra; configuration.  Now you can
 106    roll the &acro.xml; records into the database and build the indexes:
 107    <screen>
 108     zebraidx update records
 109    </screen>
 110   </para>
 111   <para>
 112    Now start the server.  Like the indexer, its behaviour is
 113    controlled by the
 114    <literal>zebra.cfg</literal> file; and like the indexer, it works
 115    just fine with this minimal configuration.
 116    <screen>
 117         zebrasrv
 118    </screen>
 119    By default, the server listens on IP port number 9999, although
 120    this can easily be changed - see
 121    <xref linkend="zebrasrv"/>.
 122   </para>
 123   <para>
 124    Now you can use the &acro.z3950; client program of your choice to execute
 125    XPath-based boolean queries and fetch the &acro.xml; records that satisfy
 126    them:
 127    <screen>
 128     $ yaz-client @:9999
 129     Connecting...Ok.
 130     Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
 131     Number of hits: 1
 132     Z&gt; format xml
 133     Z&gt; show 1
 134     &lt;Zthes&gt;
 135      &lt;termId&gt;22&lt;/termId&gt;
 136      &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
 137      &lt;termType&gt;PT&lt;/termType&gt;
 138      &lt;termNote&gt;The tallest known dinosaur (18m)&lt;/termNote&gt;
 139      &lt;relation&gt;
 140       &lt;relationType&gt;BT&lt;/relationType&gt;
 141       &lt;termId&gt;21&lt;/termId&gt;
 142       &lt;termName&gt;Brachiosauridae&lt;/termName&gt;
 143       &lt;termType&gt;PT&lt;/termType&gt;
 144      &lt;/relation&gt;
 145
 146       &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
 147         &lt;size&gt;300&lt;/size&gt;
 148         &lt;localnumber&gt;23&lt;/localnumber&gt;
 149         &lt;filename&gt;records/dino.xml&lt;/filename&gt;
 150       &lt;/idzebra&gt;
 151     &lt;/Zthes&gt;
 152    </screen>
 153   </para>
 154   <para>
 155    Now wasn't that nice and easy?
 156   </para>
 157  </sect1>
 158
 159
 160  <sect1 id="example2">
 161   <title>Example 2: Supporting Interoperable Searches</title>
 162
 163   <para>
 164    The problem with the previous example is that you need to know the
 165    structure of the documents in order to find them.  For example,
 166    when we wanted to find the record for the taxon
 167    <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
 168    we had to formulate a complex XPath
 169    <literal>/Zthes/termName</literal>
 170    which embodies the knowledge that taxon names are specified in a
 171    <literal>&lt;termName&gt;</literal> element inside the top-level
 172    <literal>&lt;Zthes&gt;</literal> element.
 173   </para>
 174   <para>
 175    This is bad not just because it requires a lot of typing, but more
 176    significantly because it ties searching semantics to the physical
 177    structure of the searched records.  You can't use the same search
 178    specification to search two databases if their internal
 179    representations are different.  Consider a different taxonomy
 180    database in which the records have taxon names specified
 181    inside a <literal>&lt;name&gt;</literal> element nested within a
 182    <literal>&lt;identification&gt;</literal> element
 183    inside a top-level <literal>&lt;taxon&gt;</literal> element: then
 184    you'd need to search for them using
 185    <literal>1=/taxon/identification/name</literal>
 186   </para>
 187   <para>
 188    How, then, can we build broadcasting Information Retrieval
 189    applications that look for records in many different databases?
 190    The &acro.z3950; protocol offers a powerful and general solution to this:
 191    abstract ``access points''.  In the &acro.z3950; model, an access point
 192    is simply a point at which searches can be directed.  Nothing is
 193    said about implementation: in a given database, an access point
 194    might be implemented as an index, a path into physical records, an
 195    algorithm for interrogating relational tables or whatever works.
 196    The only important thing is that the semantics of an access
 197    point is fixed and well defined.
 198   </para>
 199   <para>
 200    For convenience, access points are gathered into <firstterm>attribute
 201    sets</firstterm>.  For example, the &acro.bib1; attribute set is supposed to
 202    contain bibliographic access points such as author, title, subject
 203    and ISBN; the GEO attribute set contains access points pertaining
 204    to geospatial information (bounding coordinates, stratum, latitude
 205    resolution, etc.); the CIMI
 206    attribute set contains access points to do with museum collections
 207    (provenance, inscriptions, etc.)
 208   </para>
 209   <para>
 210    In practice, the &acro.bib1; attribute set has tended to be a dumping
 211    ground for all sorts of access points, so that, for example, it
 212    includes some geospatial access points as well as strictly
 213    bibliographic ones.  Nevertheless, this model
 214    allows a layer of abstraction over the physical representation of
 215    records in databases.
 216   </para>
 217   <para>
 218    In the &acro.bib1; attribute set, a taxon name is probably best
 219    interpreted as a title - that is, a phrase that identifies the item
 220    in question.  &acro.bib1; represents title searches by
 221    access point 4.  (See
 222    <ulink url="&url.z39.50.bib1.semantics;">The &acro.bib1; Attribute
 223     Set Semantics</ulink>)
 224    So we need to configure our dinosaur database so that searches for
 225    &acro.bib1; access point 4 look in the
 226    <literal>&lt;termName&gt;</literal> element,
 227    inside the top-level
 228    <literal>&lt;Zthes&gt;</literal> element.
 229   </para>
 230   <para>
 231    This is a two-step process.  First, we need to tell &zebra; that we
 232    want to support the &acro.bib1; attribute set.  Then we need to tell it
 233    which elements of its record pertain to access point 4.
 234    </para>
 235    <para>
 236    We need to create an <link linkend="abs-file">Abstract Syntax
 237    file</link> named after the document element of the records we're
 238     working with, plus a <literal>.abs</literal> suffix - in this case,
 239     <literal>Zthes.abs</literal> - as follows:
 240    </para>
 241    <programlistingco>
 242     <areaspec>
 243      <area id="attset.zthes" coords="2"/>
 244      <area id="attset.attset" coords="3"/>
 245      <area id="termId" coords="7"/>
 246      <area id="termName" coords="8"/>
 247     </areaspec>
 248     <programlisting>
 249 attset zthes.att
 250 attset bib1.att
 251 xpath enable
 252 systag sysno none
 253
 254 xelm /Zthes/termId              termId:w
 255 xelm /Zthes/termName            termName:w,title:w
 256 xelm /Zthes/termQualifier       termQualifier:w
 257 xelm /Zthes/termType            termType:w
 258 xelm /Zthes/termLanguage        termLanguage:w
 259 xelm /Zthes/termNote            termNote:w
 260 xelm /Zthes/termCreatedDate     termCreatedDate:w
 261 xelm /Zthes/termCreatedBy       termCreatedBy:w
 262 xelm /Zthes/termModifiedDate    termModifiedDate:w
 263 xelm /Zthes/termModifiedBy      termModifiedBy:w
 264     </programlisting>
 265    <calloutlist>
 266     <callout arearefs="attset.zthes">
 267      <para>
 268       Declare Thesausus attribute set. See <filename>zthes.att</filename>.
 269      </para>
 270     </callout>
 271     <callout arearefs="attset.attset">
 272      <para>
 273       Declare &acro.bib1; attribute set. See <filename>bib1.att</filename> in
 274       &zebra;'s <filename>tab</filename> directory.
 275      </para>
 276     </callout>
 277     <callout arearefs="termId">
 278      <para>
 279       This xelm directive selects contents of nodes by XPath expression
 280       <literal>/Zthes/termId</literal>. The contents (CDATA) will be
 281       word searchable by Zthes attribute termId (value 1001).
 282      </para>
 283     </callout>
 284     <callout arearefs="termName">
 285      <para>
 286       Make <literal>termName</literal> word searchable by both
 287       Zthes attribute termName (1002) and &acro.bib1; atttribute title (4).
 288      </para>
 289     </callout>
 290    </calloutlist>
 291   </programlistingco>
 292    <para>
 293     After re-indexing, we can search the database using &acro.bib1;
 294     attribute, title, as follows:
 295     <screen>
 296 Z> form xml
 297 Z> f @attr 1=4 Eoraptor
 298 Sent searchRequest.
 299 Received SearchResponse.
 300 Search was a success.
 301 Number of hits: 1, setno 1
 302 SearchResult-1: Eoraptor(1)
 303 records returned: 0
 304 Elapsed: 0.106896
 305 Z> s
 306 Sent presentRequest (1+1).
 307 Records: 1
 308 [Default]Record type: &acro.xml;
 309 &lt;Zthes&gt;
 310  &lt;termId&gt;2&lt;/termId&gt;
 311  &lt;termName&gt;Eoraptor&lt;/termName&gt;
 312  &lt;termType&gt;PT&lt;/termType&gt;
 313  &lt;termNote&gt;The most basal known dinosaur&lt;/termNote&gt;
 314  ...
 315     </screen>
 316    </para>
 317  </sect1>
 318 </chapter>
 319
 320
 321 <!--
 322         The simplest hello-world example could go like this:
 323
 324         Index the document
 325
 326         <book>
 327            <title>The art of motorcycle maintenance</title>
 328            <subject scheme="Dewey">zen</subject>
 329         </book>
 330
 331         And search it like
 332
 333         f @attr 1=/book/title motorcycle
 334
 335         f @attr 1=/book/subject[@scheme=Dewey] zen
 336
 337         If you suddenly decide you want broader interop, you can add
 338         an abs file (more or less like this):
 339
 340         attset bib1.att
 341         tagset tagsetg.tag
 342
 343         elm (2,1)       title   title
 344         elm (2,21)      subject  subject
 345 -->
 346
 347 <!--
 348 How to include images:
 349
 350         <mediaobject>
 351           <imageobject>
 352             <imagedata fileref="system.eps" format="eps">
 353           </imageobject>
 354           <imageobject>
 355             <imagedata fileref="system.gif" format="gif">
 356           </imageobject>
 357           <textobject>
 358             <phrase>The Multi-Lingual Search System Architecture</phrase>
 359           </textobject>
 360           <caption>
 361             <para>
 362               <emphasis role="strong">
 363                 The Multi-Lingual Search System Architecture.
 364               </emphasis>
 365               <para>
 366                 Network connections across local area networks are
 367                 represented by straight lines, and those over the
 368                 internet by jagged lines.
 369           </caption>
 370         </mediaobject>
 371
 372 Where the three <*object> thingies inside the top-level <mediaobject>
 373 are decreasingly preferred version to include depending on what the
 374 rendering engine can handle.  I generated the EPS version of the image
 375 by exporting a line-drawing done in TGIF, then converted that to the
 376 GIF using a shell-script called "epstogif" which used an appallingly
 377 baroque sequence of conversions, which I would prefer not to pollute
 378 the &zebra; build environment with:
 379
 380         #!/bin/sh
 381
 382         # Yes, what follows is stupidly convoluted, but I can't find a
 383         # more straightforward path from the EPS generated by tgif's
 384         # "Print" command into a browser-friendly format.
 385
 386         file=`echo "$1" | sed 's/\.eps//'`
 387         ps2pdf "$1" "$file".pdf
 388         pdftopbm "$file".pdf "$file"
 389         pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
 390         rm -f "$file".pdf "$file"-000001.pbm
 391
 392 -->
 393
 394  <!-- Keep this comment at the end of the file
 395  Local variables:
 396  mode: sgml
 397  sgml-omittag:t
 398  sgml-shorttag:t
 399  sgml-minimize-attributes:nil
 400  sgml-always-quote-attributes:t
 401  sgml-indent-step:1
 402  sgml-indent-data:t
 403  sgml-parent-document: "zebra.xml"
 404  sgml-local-catalogs: nil
 405  sgml-namecase-general:t
 406  End:
 407  -->