doc/introduction.xml

   1 <chapter id="introduction">
   2  <!-- $Id: introduction.xml,v 1.19 2002-10-20 14:02:03 mike Exp $ -->
   3  <title>Introduction</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <ulink url="http://indexdata.dk/zebra/">
  10      Zebra</ulink>
  11    is a high-performance, general-purpose structured text
  12    indexing and retrieval engine. It reads structured records in a
  13    variety of input formats (eg. email, XML, MARC) and provides access
  14    to them through a powerful combination of boolean search
  15    expressions and relevance-ranked free-text queries.
  16   </para>
  17
  18   <para>
  19    Zebra supports large databases (tens of millions of records,
  20    tens of gigabytes of data). It allows safe, incremental
  21    database updates on live systems. Because Zebra supports
  22    the industry-standard information retrieval protocol, Z39.50,
  23    you can search Zebra databases using an enormous variety of
  24    programs and toolkits, both commercial and free, which understand
  25    this protocol.  Application libraries are available to allow
  26    bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
  27    Basic, Python, PHP and more - see
  28    <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
  29    for more information on some of these client toolkits.
  30   </para>
  31
  32   <para>
  33    This document is an introduction to the Zebra system. It explains
  34    how to compile the software, how to prepare your first database,
  35    and how to configure the server to give you the
  36    functionality that you need.
  37   </para>
  38  </sect1>
  39
  40  <sect1 id="features">
  41   <title>Features</title>
  42
  43   <para>
  44    This is an overview of some of Zebra's most important features:
  45   </para>
  46
  47   <para>
  48    <itemizedlist>
  49
  50     <listitem>
  51      <para>
  52       Very large databases: files for indexes, etc. can be
  53       automatically partitioned over multiple disks.
  54      </para>
  55     </listitem>
  56
  57     <listitem>
  58      <para>
  59       Arbitrarily complex records.  The internal data format
  60       is an structured format conceptually similar to XML or GRS-1,
  61       which allows lists, nested structured data elements and
  62       variant forms of data.
  63      </para>
  64     </listitem>
  65
  66     <listitem>
  67      <para>
  68       Robust updating - records can be added and deleted ``on the fly''
  69       without rebuilding the index from scratch.
  70       Records can be safely updated even while users are accessing
  71       the server.
  72       The update procedure is tolerant to crashes or hard interrupts
  73       during database updating - data can be reconstructed following
  74       a crash.
  75      </para>
  76     </listitem>
  77
  78     <listitem>
  79      <para>
  80       Configurable to understand many input formats.
  81       A system of input filters driven by
  82       regular expressions allows most ASCII-based
  83       data formats to be easily processed.
  84       SGML, XML, ISO2709 (MARC), and raw text are also
  85       supported.
  86      </para>
  87     </listitem>
  88
  89     <listitem>
  90      <para>
  91       Searching supports a powerful combination of boolean queries as
  92       well as relevance-ranking (free-text) queries.  Truncation,
  93       masking, full regular expression matching and "approximate
  94       matching" (eg. spelling mistakes) are all handled.
  95      </para>
  96     </listitem>
  97
  98     <listitem>
  99       <para>
 100         Index-only databases: data can be, and usually is, imported
 101         into Zebra's own storage, but Zebra can also refer to
 102         external files, building and maintaining indexes of "live"
 103         collections.
 104       </para>
 105     </listitem>
 106
 107     <listitem>
 108      <para>
 109       Zebra is written in portable C, so it runs on most Unix-like systems
 110       as well as Windows NT.  A binary distribution for Windows NT is
 111       available at
 112       <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
 113       and pre-built packages are available for some Linux
 114       distributions:
 115       Red Hat 7.x RPMs at
 116       <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
 117       and Debian packages at
 118       <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>
 119      </para>
 120     </listitem>
 121
 122    </itemizedlist>
 123
 124   </para>
 125
 126   <para>
 127    Z39.50 protocol support:
 128   </para>
 129
 130   <para>
 131    <itemizedlist>
 132     <listitem>
 133      <para>
 134       Protocol facilities: Init, Search, Present (retrieval),
 135       Segmentation (support for very large records), Delete, Scan
 136       (index browsing), Sort, Close and support for the ``update''
 137       Extended Service to add or replace an existing XML record.
 138         <!-- Adam says:
 139              * Supported
 140              You can insert/delete/replace an XML record given an
 141              "external" ID.  Actually this way of doing ES Update was
 142              meant for an OAI application that Ian Ibbotson had in
 143              mind to implement. The "update" command in YAZ client
 144              implements this on the client side. My plan is to make
 145              this available in ZOOM "extended" soon..
 146         -->
 147      </para>
 148     </listitem>
 149
 150     <listitem>
 151      <para>
 152       Piggy-backed presents are honored in the search request - that
 153       is, a subset of the found records can be returned directly with
 154       a search response, enabling search and retrieval to happen in a
 155       single round-trip.
 156      </para>
 157     </listitem>
 158
 159     <listitem>
 160      <para>
 161       Named result sets are supported.
 162      </para>
 163     </listitem>
 164
 165     <listitem>
 166      <para>
 167       Easily configured to support different application profiles, with
 168       tables for attribute sets, tag sets, and abstract syntaxes.
 169       Additional tables control facilities such as element mappings to
 170       different schema (eg., GILS-to-USMARC).
 171      </para>
 172     </listitem>
 173
 174     <listitem>
 175      <para>
 176       Complex composition specifications using Espec-1 (partial support).
 177       Element sets are defined using the Espec-1 capability,
 178       and are specified in configuration files as simple element
 179       requests (and, optionally, variant requests).
 180      </para>
 181     </listitem>
 182
 183     <listitem>
 184      <para>
 185       Multiple record syntaxes
 186       for data retrieval: GRS-1, SUTRS,
 187       XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
 188       and schemas on the fly.
 189      </para>
 190     </listitem>
 191
 192    </itemizedlist>
 193
 194   </para>
 195
 196  </sect1>
 197
 198   <sect1 id="apps">
 199   <title>Applications</title>
 200   <para>
 201    Zebra has been deployed in numerous applications, in both the
 202    academic and commercial worlds, in application domains as diverse
 203    as bibliographic catalogues, geospatial information, structured
 204    vocabulary browsing, government information locators, civic
 205    information systems, environmental observations, museum information
 206    and web indexes.
 207   </para>
 208   <para>
 209    Notable applications include the following:
 210   </para>
 211
 212   <sect2>
 213    <title>DADS - the DTV Article Database Service</title>
 214    <para>
 215     DADS is a huge database of more than ten million records, totalling
 216     over ten gigabytes of data.  The records are metadata about academic
 217     journal articles, primarily scientific; about 10% of these
 218     metadata records link to the full text of the articles they
 219     describe, a body of about a terabyte of information (although the
 220     full text is not indexed.)
 221    </para>
 222    <para>
 223     It allows students and researchers at DTU (Danmarks Tekniske
 224     Universitet, the Technical College of Denmark) to find and order
 225     articles from multiple databases in a single query.  The database
 226     contains literature on all engineering subjects.  It's available
 227     on-line through a web gateway, though currently only to registered
 228     users.
 229    </para>
 230    <para>
 231     More information can be found at
 232     <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
 233    </para>
 234   </sect2>
 235
 236   <sect2>
 237    <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
 238    <para>
 239     Fernuniversität Hagen in Germany have developed a natural
 240     language interface for access to library databases.
 241     <ulink url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/>
 242     In order to evaluate this interface for recall and precision, they
 243     chose Zebra as the basis for retrieval effectiveness.  The Zebra
 244     server contains a copy of the GIRT database, consisting of more
 245     than 76000 records in SGML format (bibliographic records from
 246     social science), which are mapped to MARC for presentation.
 247    </para>
 248    <para>
 249     (GIRT is the German Indexing and Retrieval Testdatabase.  It is a
 250     standard German-language test database for intelligent indexing
 251     and retrieval systems.  See
 252     <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
 253    </para>
 254    <para>
 255     Evaluation will take place as part of the TREC/CLEF campaign 2003
 256     <ulink url="http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/"/>
 257    </para>
 258    <para>
 259     For more information, contact Johannes Leveling
 260     <email>Johannes.Leveling@FernUni-Hagen.De</email>
 261    </para>
 262   </sect2>
 263
 264   <sect2>
 265    <title>ULS (Union List of Serials)</title>
 266    <para>
 267     The M25-Link systems team
 268     (<ulink url="http://www.m25lib.ac.uk/M25link/"/>)
 269     are involved in a project called ULS to provide a union catalogue
 270     for periodicals in 21 member libraries.  They do this with an
 271     unusual architecture which they call a
 272     ``non-distributed virtual union catalogue''.
 273    </para>
 274    <para>
 275     The member libraries send in data files representing their
 276     periodicals, including both brief bibliographic data and summary
 277     holdings.  Then 21 individual Z39.50 targets are created, each
 278     using Zebra, and all mounted on the single hardware server.
 279     The live service provides a web gateway allowing Z39.50 searching
 280     of all of the targets or a selection of them.  Zebra's small
 281     footprint allows a relatively modest system to comfortably host
 282     the 21 servers.
 283    </para>
 284    <para>
 285     More information can be found at
 286     <ulink url="http://www.m25lib.ac.uk/ULS/"/>
 287    </para>
 288   </sect2>
 289
 290   <sect2>
 291    <title>Various web indexes</title>
 292    <para>
 293     Zebra has been used by a variety of institutions to construct
 294     indexes of large web sites, typically in the region of tens of
 295     millions of pages.  In this role, it functions somewhat similarly
 296     to the engine of google or altavista, but for a selected intranet
 297     or a subset of the whole Web.
 298    </para>
 299    <para>
 300     For example, Liverpool University's web-search facility (see on
 301     the home page at
 302     <ulink url="http://www.liv.ac.uk/"/>
 303     and many sub-pages) works by relevance-searching a Zebra database
 304     which is populated by the Harvest-NG web-crawling software.
 305    </para>
 306    <para>
 307     For more information, contact John Gilbertson
 308     <email>jgilbert@liverpool.ac.uk</email>
 309    </para>
 310   </sect2>
 311  </sect1>
 312
 313
 314  <sect1 id="support">
 315   <title>Support</title>
 316   <para>
 317    You can get support for Zebra from at least three sources.
 318   </para>
 319   <para>
 320    First, there's the Zebra web site at
 321    <ulink url="http://indexdata.dk/zebra/"/>,
 322    which always has the most recent version available for download.
 323    If you have a problem with Zebra, the first thing to do is see
 324    whether it's fixed in the current release.
 325   </para>
 326   <para>
 327    Second, there's the Zebra mailing list.  Its home page at
 328    <ulink url="http://indexdata.dk/mailman/listinfo/zebralist"/>
 329    includes a complete archive of all messages that have ever been
 330    posted on the list.  The Zebra mailing list is used both for
 331    announcements from the authors (new
 332    releases, bug fixes, etc.) and general discussion.  You are welcome
 333    to seek support there.  Join by sending email to
 334    <email>zebra-request@indexdata.dk</email>. Put the word
 335    <literal>subscribe</literal> in the body of the message.
 336   </para>
 337   <para>
 338    Third, it's possible to buy a commercial support contract, with
 339    well defined service levels and response times, from Index Data.
 340    See
 341    <ulink url="http://indexdata.dk/support/?lang=en"/>
 342    <!-- ### compare this page with http://indexdata.dk/support2/ -->
 343    for details.
 344   </para>
 345  </sect1>
 346
 347
 348  <sect1 id="future">
 349   <title>Future Directions</title>
 350
 351   <para>
 352    These are some of the plans that we have for the software in the near
 353    and far future, ordered approximately as we expect to work on them.
 354   </para>
 355
 356   <para>
 357    <itemizedlist>
 358
 359     <listitem>
 360      <para>
 361        Improved support for XML in search and retrieval. Eventually,
 362        the goal is for Zebra to pull double duty as a flexible
 363        information retrieval engine and high-performance XML
 364        repository.
 365      </para>
 366      <para>
 367        ### Partially done.
 368      </para>
 369     </listitem>
 370
 371     <listitem>
 372      <para>
 373        Access to search engine through SOAP/RPC API to allow the
 374        construction of applications without requiring Z39.50 tools.
 375      </para>
 376      <para>
 377        ### Partially done, thanks to the new SRW/Z39.50 gateway.
 378      </para>
 379     </listitem>
 380
 381     <listitem>
 382      <para>
 383        Finalisation and documentation of Zebra's C programming
 384        API, allowing updates, database management and other functions
 385        not readily expressed in Z39.50.  We will also consider
 386        exposing the API through SOAP.
 387      </para>
 388     </listitem>
 389
 390     <listitem>
 391      <para>
 392        Improved free-text searching. We're first and foremost octet jockeys and
 393        we're actively looking for organisations or people who'd like
 394        to contribute experience in relevance ranking and text
 395        searching.
 396      </para>
 397     </listitem>
 398
 399    </itemizedlist>
 400   </para>
 401
 402   <para>
 403    Programmers thrive on user feedback. If you are interested in a
 404    facility that you don't see mentioned here, or if there's something
 405    you think we could do better, please drop us a mail.  Better still,
 406    implement it and send us the patches.
 407   </para>
 408   <para>
 409    If you think it's all really neat, you're welcome to drop us a line
 410    saying that, too. You can email us on
 411    <email>info@indexdata.dk</email>
 412    or check the contact info at the end of this manual.
 413   </para>
 414
 415  </sect1>
 416 </chapter>
 417  <!-- Keep this comment at the end of the file
 418  Local variables:
 419  mode: sgml
 420  sgml-omittag:t
 421  sgml-shorttag:t
 422  sgml-minimize-attributes:nil
 423  sgml-always-quote-attributes:t
 424  sgml-indent-step:1
 425  sgml-indent-data:t
 426  sgml-parent-document: "zebra.xml"
 427  sgml-local-catalogs: nil
 428  sgml-namecase-general:t
 429  End:
 430  -->