doc/introduction.xml

   1 <chapter id="introduction">
   2  <!-- $Id: introduction.xml,v 1.33 2006-06-13 13:45:08 marc Exp $ -->
   3  <title>Introduction</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
  10    is a high-performance, general-purpose structured text
  11    indexing and retrieval engine. It reads records in a
  12    variety of input formats (eg. email, XML, MARC) and provides access
  13    to them through a powerful combination of boolean search
  14    expressions and relevance-ranked free-text queries.
  15   </para>
  16
  17   <para>
  18    Zebra supports large databases (tens of millions of records,
  19    tens of gigabytes of data). It allows safe, incremental
  20    database updates on live systems. Because Zebra supports
  21    the industry-standard information retrieval protocol, Z39.50,
  22    you can search Zebra databases using an enormous variety of
  23    programs and toolkits, both commercial and free, which understand
  24    this protocol.  Application libraries are available to allow
  25    bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
  26    Basic, Python, PHP and more - see
  27    <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
  28    for more information on some of these client toolkits.
  29   </para>
  30
  31   <para>
  32    This document is an introduction to the Zebra system. It explains
  33    how to compile the software, how to prepare your first database,
  34    and how to configure the server to give you the
  35    functionality that you need.
  36   </para>
  37  </sect1>
  38
  39  <sect1 id="features">
  40   <title>Features</title>
  41
  42   <para>
  43    This is an overview of some of Zebra's most important features:
  44   </para>
  45
  46   <para>
  47    <itemizedlist>
  48
  49     <listitem>
  50      <para>
  51       Very large databases: logical files can be
  52       automatically partitioned over multiple disks.
  53      </para>
  54     </listitem>
  55
  56     <listitem>
  57      <para>
  58       Arbitrarily complex records.  The internal data format
  59       is a structured format conceptually similar to XML or GRS-1,
  60       which allows lists, nested structured data elements and
  61       variant forms of data.
  62      </para>
  63     </listitem>
  64
  65     <listitem>
  66      <para>
  67       Robust updating - records can be added and deleted ``on the fly''
  68       without rebuilding the index from scratch.
  69       Records can be safely updated even while users are accessing
  70       the server.
  71       The update procedure is tolerant to crashes or hard interrupts
  72       during database updating - data can be reconstructed following
  73       a crash.
  74      </para>
  75     </listitem>
  76
  77     <listitem>
  78      <para>
  79       Configurable to understand many input formats.
  80       A system of input filters driven by
  81       regular expressions allows most ASCII-based
  82       data formats to be easily processed.
  83       SGML, XML, ISO2709 (MARC), and raw text are also
  84       supported.
  85      </para>
  86     </listitem>
  87
  88     <listitem>
  89      <para>
  90       Searching supports a powerful combination of boolean queries as
  91       well as relevance-ranking (free-text) queries.  Truncation,
  92       masking, full regular expression matching and "approximate
  93       matching" (eg. spelling mistakes) are all handled.
  94      </para>
  95     </listitem>
  96
  97     <listitem>
  98       <para>
  99         Index-only databases: data can be, and usually is, imported
 100         into Zebra's own storage, but Zebra can also refer to
 101         external files, building and maintaining indexes of "live"
 102         collections.
 103       </para>
 104     </listitem>
 105
 106     <listitem>
 107      <para>
 108       Zebra is written in portable C, so it runs on most Unix-like systems
 109       as well as Windows NT.  A binary distribution for Windows NT is
 110       available at
 111       <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
 112       and pre-built packages are available for
 113       <!--- some Linux
 114       distributions:
 115       Red Hat 7.x RPMs at
 116       <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
 117       and Debian packages at
 118       -->
 119       <literal>GNU/Debian Linux</literal> at
 120       <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>.
 121      </para>
 122     </listitem>
 123
 124    </itemizedlist>
 125
 126   </para>
 127
 128   <para>
 129    Z39.50 protocol support:
 130   </para>
 131
 132   <para>
 133    <itemizedlist>
 134     <listitem>
 135      <para>
 136       Protocol facilities: Init, Search, Present (retrieval),
 137       Segmentation (support for very large records), Delete, Scan
 138       (index browsing), Sort, Close and support for the ``update''
 139       Extended Service to add or replace an existing XML record.
 140         <!-- Adam says:
 141              * Supported
 142              You can insert/delete/replace an XML record given an
 143              "external" ID.  Actually this way of doing ES Update was
 144              meant for an OAI application that Ian Ibbotson had in
 145              mind to implement. The "update" command in YAZ client
 146              implements this on the client side. My plan is to make
 147              this available in ZOOM "extended" soon..
 148         -->
 149      </para>
 150     </listitem>
 151
 152     <listitem>
 153      <para>
 154       Piggy-backed presents are honored in the search request - that
 155       is, a subset of the found records can be returned directly with
 156       a search response, enabling search and retrieval to happen in a
 157       single round-trip.
 158      </para>
 159     </listitem>
 160
 161     <listitem>
 162      <para>
 163       Named result sets are supported.
 164      </para>
 165     </listitem>
 166
 167     <listitem>
 168      <para>
 169       Easily configured to support different application profiles, with
 170       tables for attribute sets, tag sets, and abstract syntaxes.
 171       Additional tables control facilities such as element mappings to
 172       different schema (eg., GILS-to-USMARC).
 173      </para>
 174     </listitem>
 175
 176     <listitem>
 177      <para>
 178       Complex composition specifications using Espec-1 (partial support).
 179       Element sets are defined using the Espec-1 capability,
 180       and are specified in configuration files as simple element
 181       requests (and, optionally, variant requests).
 182      </para>
 183     </listitem>
 184
 185     <listitem>
 186      <para>
 187       Multiple record syntaxes
 188       for data retrieval: GRS-1, SUTRS,
 189       XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
 190       and schemas on the fly.
 191      </para>
 192     </listitem>
 193
 194    </itemizedlist>
 195
 196   </para>
 197
 198  </sect1>
 199
 200   <sect1 id="apps">
 201   <title>Applications</title>
 202   <para>
 203    Zebra has been deployed in numerous applications, in both the
 204    academic and commercial worlds, in application domains as diverse
 205    as bibliographic catalogues, geospatial information, structured
 206    vocabulary browsing, government information locators, civic
 207    information systems, environmental observations, museum information
 208    and web indexes.
 209   </para>
 210   <para>
 211    Notable applications include the following:
 212   </para>
 213
 214   <sect2>
 215    <title>DADS - the DTV Article Database Service</title>
 216    <para>
 217     DADS is a huge database of more than ten million records, totalling
 218     over ten gigabytes of data.  The records are metadata about academic
 219     journal articles, primarily scientific; about 10% of these
 220     metadata records link to the full text of the articles they
 221     describe, a body of about a terabyte of information (although the
 222     full text is not indexed.)
 223    </para>
 224    <para>
 225     It allows students and researchers at DTU (Danmarks Tekniske
 226     Universitet, the Technical College of Denmark) to find and order
 227     articles from multiple databases in a single query.  The database
 228     contains literature on all engineering subjects.  It's available
 229     on-line through a web gateway, though currently only to registered
 230     users.
 231    </para>
 232    <para>
 233     More information can be found at
 234     <ulink url="http://www.dtv.dk/"/> and
 235     <ulink url="http://dads.dtv.dk"/>
 236    </para>
 237   </sect2>
 238
 239   <sect2>
 240    <title>Infonet Eprints</title>
 241    <para>
 242      The InfoNet Eprints service from the
 243      <ulink url="http://www.dtv.dk/">
 244       Technical Knowledge Center of Denmark</ulink>
 245      provides access to documents stored in
 246      eprint/preprint servers and institutional research archives around
 247      the world. The service is based on Open Archives Initiative metadata
 248      harvesting of selected scientific archives around the world. These
 249      open archives offer free and unrestricted access to their contents.
 250     </para>
 251    <para>
 252     Infonet Eprints currently holds 1.4 million records from 16 archives.
 253     The online search facility is found at
 254     <ulink url="http://preprints.cvt.dk"/>.
 255    </para>
 256   </sect2>
 257
 258   <sect2>
 259    <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
 260    <para>
 261     Fernuniversit&#x00E4;t Hagen in Germany have developed a natural
 262     language interface for access to library databases.
 263     <!-- <ulink
 264     url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
 265     In order to evaluate this interface for recall and precision, they
 266     chose Zebra as the basis for retrieval effectiveness.  The Zebra
 267     server contains a copy of the GIRT database, consisting of more
 268     than 76000 records in SGML format (bibliographic records from
 269     social science), which are mapped to MARC for presentation.
 270    </para>
 271    <para>
 272     (GIRT is the German Indexing and Retrieval Testdatabase.  It is a
 273     standard German-language test database for intelligent indexing
 274     and retrieval systems.  See
 275     <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
 276    </para>
 277    <para>
 278     Evaluation will take place as part of the TREC/CLEF campaign 2003
 279     <ulink url="http://clef.iei.pi.cnr.it"/>.
 280     <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
 281    </para>
 282    <para>
 283     For more information, contact Johannes Leveling
 284     <email>Johannes.Leveling@FernUni-Hagen.De</email>
 285    </para>
 286   </sect2>
 287
 288   <sect2>
 289    <title>ULS (Union List of Serials)</title>
 290    <para>
 291     The M25 Systems Team
 292     has created a union catalogue for the periodicals of the
 293     twenty-one constituent libraries of the University of London and
 294     the University of Westminster
 295     (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
 296     They have achieved this using an
 297     unusual architecture, which they describe as a
 298     ``non-distributed virtual union catalogue''.
 299    </para>
 300    <para>
 301     The member libraries send in data files representing their
 302     periodicals, including both brief bibliographic data and summary
 303     holdings.  Then 21 individual Z39.50 targets are created, each
 304     using Zebra, and all mounted on the single hardware server.
 305     The live service provides a web gateway allowing Z39.50 searching
 306     of all of the targets or a selection of them.  Zebra's small
 307     footprint allows a relatively modest system to comfortably host
 308     the 21 servers.
 309    </para>
 310    <para>
 311     More information can be found at
 312     <ulink url="http://www.m25lib.ac.uk/ULS/"/>
 313    </para>
 314   </sect2>
 315
 316   <sect2>
 317    <title>Various web indexes</title>
 318    <para>
 319     Zebra has been used by a variety of institutions to construct
 320     indexes of large web sites, typically in the region of tens of
 321     millions of pages.  In this role, it functions somewhat similarly
 322     to the engine of google or altavista, but for a selected intranet
 323     or a subset of the whole Web.
 324    </para>
 325    <para>
 326     For example, Liverpool University's web-search facility (see on
 327     the home page at
 328     <ulink url="http://www.liv.ac.uk/"/>
 329     and many sub-pages) works by relevance-searching a Zebra database
 330     which is populated by the Harvest-NG web-crawling software.
 331    </para>
 332    <para>
 333     For more information on Liverpool university's intranet search
 334     architecture, contact John Gilbertson
 335     <email>jgilbert@liverpool.ac.uk</email>
 336    </para>
 337    <para>
 338     Kang-Jin Lee
 339     <email>lee@arco.de</email>,
 340     has recently modified the Harvest web indexer to use Zebra as
 341     its native repository engine.  His comments on the switch over
 342     from the old engine are revealing:
 343     <blockquote>
 344      <para>
 345       The first results after some testing with Zebra are very
 346       promising.  The tests were done with around 220,000 SOIF files,
 347       which occupies 1.6GB of disk space.
 348      </para>
 349      <para>
 350       Building the index from scratch takes around one hour with Zebra
 351       where [old-engine] needs around five hours.  While [old-engine]
 352       blocks search requests when updating its index, Zebra can still
 353       answer search requests.
 354       [...]
 355       Zebra supports incremental indexing which will speed up indexing
 356       even further.
 357      </para>
 358      <para>
 359       While the search time of [old-engine] varies from some seconds
 360       to some minutes depending how expensive the query is, Zebra
 361       usually takes around one to three seconds, even for expensive
 362       queries.
 363       [...]
 364       Zebra can search more than 100 times faster than [old-engine]
 365       and can process multiple search requests simultaneously
 366      </para>
 367      <para>
 368       I am very happy to see such nice software available under GPL.
 369      </para>
 370     </blockquote>
 371    </para>
 372   </sect2>
 373  </sect1>
 374
 375
 376  <sect1 id="support">
 377   <title>Support</title>
 378   <para>
 379    You can get support for Zebra from at least three sources.
 380   </para>
 381   <para>
 382    First, there's the Zebra web site at
 383    <ulink url="http://indexdata.dk/zebra/"/>,
 384    which always has the most recent version available for download.
 385    If you have a problem with Zebra, the first thing to do is see
 386    whether it's fixed in the current release.
 387   </para>
 388   <para>
 389    Second, there's the Zebra mailing list.  Its home page at
 390    <ulink url="http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist"/>
 391    includes a complete archive of all messages that have ever been
 392    posted on the list.  The Zebra mailing list is used both for
 393    announcements from the authors (new
 394    releases, bug fixes, etc.) and general discussion.  You are welcome
 395    to seek support there.  Join by filling the form on the list home page.
 396   </para>
 397   <para>
 398    Third, it's possible to buy a commercial support contract, with
 399    well defined service levels and response times, from Index Data.
 400    See
 401    <ulink url="http://indexdata.dk/support/"/>
 402    for details.
 403   </para>
 404  </sect1>
 405
 406
 407  <sect1 id="future">
 408   <title>Future Directions</title>
 409
 410   <para>
 411    These are some of the plans that we have for the software in the near
 412    and far future, ordered approximately as we expect to work on them.
 413   </para>
 414
 415   <para>
 416    <itemizedlist>
 417
 418     <listitem>
 419      <para>
 420        Improved support for XML in search and retrieval. Eventually,
 421        the goal is for Zebra to pull double duty as a flexible
 422        information retrieval engine and high-performance XML
 423        repository.  The recent addition of XPath searching is one
 424        example of the kind of enhancement we're working on.
 425      </para>
 426      <para>
 427        There is also the experimental <literal>ALVIS XSLT</literal>
 428        XML input filter, which unleashes the full power of DOM based
 429        XSLT transformations during indexing and record retrieval. Work
 430        on this filter has been sponsored by the ALVIS EU project
 431        <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
 432        mature soon, as it is planned to be included in the version 1.4
 433        release of Zebra.
 434      </para>
 435     </listitem>
 436
 437     <listitem>
 438      <para>
 439        Access to the search engine through SOAP/RPC API to allow the
 440        construction of applications without requiring Z39.50 tools.
 441        <!--
 442       This will shortly be available by means of Index Data's
 443         <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>-to-Z39.50 gateway, currently in beta test.
 444        -->
 445        Experimental support of the
 446        Search/Retrieve Via URL ( <ulink url="&url.sru;">SRU</ulink>)
 447        <ulink url="&url.sru;"/>
 448        REST webservice, and the
 449         Search/Retrieve Web Service ( <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>)
 450        <ulink url="http://www.loc.gov/standards/sru/srw/"/>
 451        SOAP Web Service have recently been added to the YAZ/Zebra
 452        combo - including server side Common Query Language (<ulink url="&url.cql;">CQL</ulink>)
 453        <ulink url="&url.cql;"/> parsing
 454        and configuration. It remains to find a sponsor for further testing,
 455        documentation and packaging of this exiting component.
 456      </para>
 457     </listitem>
 458
 459     <listitem>
 460      <para>
 461        Finalisation and documentation of Zebra's C programming
 462        API, allowing updates, database management and other functions
 463        not readily expressed in Z39.50.  We will also consider
 464        exposing the API through SOAP.
 465      </para>
 466     </listitem>
 467
 468     <listitem>
 469      <para>
 470        Support for the use of Perl both for access to the Zebra API
 471        and for building extension ``plug-ins'' such as input filters.
 472        The code for this has been contributed to the source tree by
 473        Peter Popovics
 474        <email>pop@technomat.hu</email>,
 475        and is in the process of being integrated and tested.
 476      </para>
 477     </listitem>
 478
 479     <listitem>
 480      <para>
 481        Improved free-text searching. We're first and foremost octet jockeys and
 482        we're actively looking for organisations or people who'd like
 483        to contribute experience in relevance ranking and text
 484        searching.
 485      </para>
 486     </listitem>
 487
 488    </itemizedlist>
 489   </para>
 490
 491   <para>
 492    Programmers thrive on user feedback. If you are interested in a
 493    facility that you don't see mentioned here, or if there's something
 494    you think we could do better, please drop us a mail.  Better still,
 495    implement it and send us the patches.
 496   </para>
 497   <para>
 498    If you think it's all really neat, you're welcome to drop us a line
 499    saying that, too. You can email us on
 500    <email>info@indexdata.dk</email>
 501    or check the contact info at the end of this manual.
 502   </para>
 503
 504  </sect1>
 505 </chapter>
 506  <!-- Keep this comment at the end of the file
 507  Local variables:
 508  mode: sgml
 509  sgml-omittag:t
 510  sgml-shorttag:t
 511  sgml-minimize-attributes:nil
 512  sgml-always-quote-attributes:t
 513  sgml-indent-step:1
 514  sgml-indent-data:t
 515  sgml-parent-document: "zebra.xml"
 516  sgml-local-catalogs: nil
 517  sgml-namecase-general:t
 518  End:
 519  -->