doc/introduction.xml

   1 <chapter id="introduction">
   2  <!-- $Id: introduction.xml,v 1.15 2002-10-10 14:27:18 heikki Exp $ -->
   3  <title>Introduction</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <ulink url="http://www.indexdata.dk/zebra/">
  10      Zebra</ulink>
  11    is a high-performance, general-purpose structured text
  12    indexing and retrieval engine. It reads structured records in a
  13    variety of input formats (eg. email, XML, MARC) and provides access
  14    to them through a powerful combination of boolean search
  15    expressions and relevance-ranked free-text queries.
  16   </para>
  17
  18   <para>
  19    Zebra supports large databases (tens of millions of records,
  20    tens of gigabytes of data). It allows safe, incremental
  21    database updates on live systems. Because Zebra supports
  22    the industry-standard information retrieval protocol, Z39.50,
  23    you can search Zebra databases using an enormous variety of
  24    programs and toolkits, both commercial and free, which understand
  25    this protocol.  Application libraries are available to allow
  26    bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
  27    Basic, Python, PHP and more - see
  28    <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
  29    for more information on some of these client toolkits.
  30   </para>
  31
  32   <para>
  33    This document is an introduction to the Zebra system. It explains
  34    how to compile the software, how to prepare your first database,
  35    and how to configure the server to give you the
  36    functionality that you need.
  37   </para>
  38  </sect1>
  39
  40  <sect1 id="features">
  41   <title>Features</title>
  42
  43   <para>
  44    This is an overview of some of Zebra's most important features:
  45   </para>
  46
  47   <para>
  48    <itemizedlist>
  49
  50     <listitem>
  51      <para>
  52       Very large databases: files for indexes, etc. can be
  53       automatically partitioned over multiple disks.
  54      </para>
  55     </listitem>
  56
  57     <listitem>
  58      <para>
  59       Arbitrarily complex records.  The internal data format
  60       is an structured format conceptually similar to XML or GRS-1,
  61       which allows lists, nested structured data elements and
  62       variant forms of data.
  63      </para>
  64     </listitem>
  65
  66     <listitem>
  67      <para>
  68       Robust updating - records can be added and deleted ``on the fly''
  69       without rebuilding the index from scratch.
  70       Records can be safely updated even while users are accessing
  71       the server.
  72       The update procedure is tolerant to crashes or hard interrupts
  73       during database updating - data can be reconstructed following
  74       a crash.
  75      </para>
  76     </listitem>
  77
  78     <listitem>
  79      <para>
  80       Configurable to understand many input formats.
  81       A system of input filters driven by
  82       regular expressions allows most ASCII-based
  83       data formats to be easily processed.
  84       SGML, XML, ISO2709 (MARC), and raw text are also
  85       supported.
  86      </para>
  87     </listitem>
  88
  89     <listitem>
  90      <para>
  91       Searching supports a powerful combination of boolean queries as
  92       well as relevance-ranking (free-text) queries.  Truncation,
  93       masking, full regular expression matching and "approximate
  94       matching" (eg. spelling mistakes) are all handled.
  95      </para>
  96     </listitem>
  97
  98     <listitem>
  99       <para>
 100         Index-only databases: data can be, and usually is, imported
 101         into Zebra's own storage, but Zebra can also refer to
 102         external files, building and maintaining indexes of "live"
 103         collections.
 104       </para>
 105     </listitem>
 106
 107     <listitem>
 108      <para>
 109       Zebra is written in portable C, so it runs on most Unix-like systems
 110       as well as Windows NT.  A binary distribution for Windows NT is
 111       available at
 112       <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/">
 113      </para>
 114     </listitem>
 115
 116    </itemizedlist>
 117
 118   </para>
 119
 120   <para>
 121    Z39.50 protocol support:
 122   </para>
 123
 124   <para>
 125    <itemizedlist>
 126     <listitem>
 127      <para>
 128       Protocol facilities: Init, Search, Present (retrieval),
 129       Segmentation (support for very large records), Delete, Scan
 130       (index browsing), Sort, Close and some Extended Services.
 131      </para>
 132     </listitem>
 133
 134     <listitem>
 135      <para>
 136       Piggy-backed presents are honored in the search request - that
 137       is, a subset of the found records can be returned directly with
 138       a search response, enabling search and retrieval to happen in a
 139       single round-trip.
 140      </para>
 141     </listitem>
 142
 143     <listitem>
 144      <para>
 145       Named result sets are supported.
 146      </para>
 147     </listitem>
 148
 149     <listitem>
 150      <para>
 151       Easily configured to support different application profiles, with
 152       tables for attribute sets, tag sets, and abstract syntaxes.
 153       Additional tables control facilities such as element mappings to
 154       different schema (eg., GILS-to-USMARC).
 155      </para>
 156     </listitem>
 157
 158     <listitem>
 159      <para>
 160       Complex composition specifications using Espec-1 (partial support).
 161       Element sets are defined using the Espec-1 capability,
 162       and are specified in configuration files as simple element
 163       requests (and, optionally, variant requests).
 164      </para>
 165     </listitem>
 166
 167     <listitem>
 168      <para>
 169       Multiple record syntaxes
 170       for data retrieval: GRS-1, SUTRS,
 171       XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
 172       and schemas on the fly.
 173      </para>
 174     </listitem>
 175
 176    </itemizedlist>
 177
 178   </para>
 179
 180  </sect1>
 181
 182   <sect1 id="apps">
 183   <title>Applications</title>
 184   <para>
 185    Zebra has been deployed in numerous applications, in both the
 186    academic and commercial worlds, in application domains as diverse
 187    as bibliographic catalogues, geospatial information, structured
 188    vocabulary browsing, government information locators, civic
 189    information systems, environmental observations, museum information
 190    and web indexes.
 191   </para>
 192   <para>
 193    Notable applications include the following:
 194   </para>
 195
 196   <sect2>
 197    <title>DADS - the DTV Article Database Service</title>
 198    <para>
 199     DADS is a huge database of more than ten million records, totalling
 200     over ten gigabytes of data.  The records are metadata about academic
 201     journal articles, primarily scientific; about 10% of these
 202     metadata records link to the full text of the articles they
 203     describe, a body of about a terabyte of information (although the
 204     full text is not indexed.)
 205    </para>
 206    <para>
 207     It allows students and researchers at DTU (Danmarks Tekniske
 208     Universitet, the Technical College of Denmark) to find and order
 209     articles from multiple databases in a single query.  The database
 210     contains literature on all engineering subjects.  It's available
 211     on-line through a web gateway, though currently only to registered
 212     users.
 213    </para>
 214    <para>
 215     More information can be found at
 216     <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
 217    </para>
 218   </sect2>
 219
 220   <sect2>
 221    <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
 222    <para>
 223     Fernuniversität Hagen in Germany have developed a natural
 224     language interface for access to library databases.
 225     <ulink url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/>
 226     In order to evaluate this interface for recall and precision, they
 227     chose Zebra as the basis for retrieval effectiveness.  The Zebra
 228     server contains a copy of the GIRT database, consisting of more
 229     than 76000 records in SGML format (bibliographic records from
 230     social science), which are mapped to MARC for presentation.
 231    </para>
 232    <para>
 233     (GIRT is the German Indexing and Retrieval Testdatabase.  It is a
 234     standard German-language test database for intelligent indexing
 235     and retrieval systems.  See
 236     <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
 237    </para>
 238    <para>
 239     Evaluation will take place as part of the TREC/CLEF campaign 2003
 240     <ulink url="http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/"/>
 241    </para>
 242    <para>
 243     For more information, contact Johannes Leveling
 244     <email>Johannes.Leveling@FernUni-Hagen.De</email>
 245    </para>
 246   </sect2>
 247
 248   <sect2>
 249    <title>ULS (Union List of Serials)</title>
 250    <para>
 251     The M25-Link systems team
 252     (<ulink url="http://www.m25lib.ac.uk/M25link/"/>)
 253     are involved in a project called ULS to provide a union catalogue
 254     for periodicals in 21 member libraries.  They do this with an
 255     unusual architecture which they call a
 256     ``non-distributed virtual union catalogue''.
 257    </para>
 258    <para>
 259     The member libraries send in data files representing their
 260     periodicals, including both brief bibliographic data and summary
 261     holdings.  Then 21 individual Z39.50 targets are created, each
 262     using Zebra, and all mounted on the single hardware server.
 263     The live service provides a web gateway allowing Z39.50 searching
 264     of all of the targets or a selection of them.  Zebra's small
 265     footprint allows a relatively modest system to comfortably host
 266     the 21 servers.
 267    </para>
 268    <para>
 269     More information can be found at
 270     <ulink url="http://www.m25lib.ac.uk/ULS/"/>
 271    </para>
 272   </sect2>
 273
 274   <sect2>
 275    <title>Various web indexes</title>
 276    <para>
 277     Zebra has been used by a variety of institutions to construct
 278     indexes of large web sites, typically in the region of tens of
 279     millions of pages.  In this role, it functions somewhat similarly
 280     to the engine of google or altavista, but for a selected intranet
 281     or a subset of the whole Web.
 282    </para>
 283    <para>
 284     For example, Liverpool University's web-search facility (see on
 285     the home page at
 286     <ulink url="http://www.liv.ac.uk/"/>
 287     and many sub-pages) works by relevance-searching a Zebra database
 288     which is populated by the Harvest-NG web-crawling software.
 289    </para>
 290    <para>
 291     For more information, contact John Gilbertson
 292     <email>jgilbert@liverpool.ac.uk</email>
 293    </para>
 294   </sect2>
 295  </sect1>
 296
 297
 298  <sect1 id="support">
 299   <title>Support</title>
 300   <para>
 301    You can get support for Zebra from at least three sources.
 302   </para>
 303   <para>
 304    First, there's the Zebra web site at
 305    <ulink url="http://www.indexdata.dk/zebra/"/>,
 306    which always has the most recent version available for download.
 307    If you have a problem with Zebra, the first thing to do is see
 308    whether it's fixed in the current release.
 309   </para>
 310   <para>
 311    Second, there's the Zebra mailing list.  Its home page at
 312    <ulink url="http://www.indexdata.dk/mailman/listinfo/zebralist"/>
 313    includes a complete archive of all messages that have ever been
 314    posted on the list.  The Zebra mailing list is used both for
 315    announcements from the authors (new
 316    releases, bug fixes, etc.) and general discussion.  You are welcome
 317    to seek support there.  Join by sending email to
 318    <email>zebra-request@indexdata.dk</email>. Put the word 'subscribe'
 319    in the body of the message.
 320    <!-- zebra-subscribe-###@mailman.indexdata.dk-->
 321   </para>
 322   <para>
 323    Third, it's possible to buy a commercial support contract, with
 324    well defined service levels and response times, from Index Data.
 325    See
 326    <ulink url="http://www.indexdata.dk/support/?lang=en"/>
 327    <!-- ulink url="http://www.indexdata.dk/support/###"/-->
 328    for details.
 329   </para>
 330  </sect1>
 331
 332
 333  <sect1 id="future">
 334   <title>Future Directions</title>
 335
 336   <para>
 337    These are some of the plans that we have for the software in the near
 338    and far future, ordered approximately as we expect to work on them.
 339   </para>
 340
 341   <para>
 342    <itemizedlist>
 343
 344     <listitem>
 345      <para>
 346        Improved support for XML in search and retrieval. Eventually,
 347        the goal is for Zebra to pull double duty as a flexible
 348        information retrieval engine and high-performance XML
 349        repository.
 350      </para>
 351      <para>
 352        ### Partially done.
 353      </para>
 354     </listitem>
 355
 356     <listitem>
 357      <para>
 358        Access to search engine through SOAP/RPC API to allow the
 359        construction of applications without requiring Z39.50 tools.
 360      </para>
 361      <para>
 362        ### Partially done, thanks to the new SRW/Z39.50 gateway.
 363      </para>
 364     </listitem>
 365
 366     <listitem>
 367      <para>
 368        Finalisation and documentation of Zebra's C programming
 369        API, allowing updates, database management and other functions
 370        not readily expressed in Z39.50.  We will also consider
 371        exposing the API through SOAP.
 372      </para>
 373     </listitem>
 374
 375     <listitem>
 376      <para>
 377        Improved free-text searching. We're first and foremost octet jockeys and
 378        we're actively looking for organisations or people who'd like
 379        to contribute experience in relevance ranking and text
 380        searching.
 381      </para>
 382     </listitem>
 383
 384    </itemizedlist>
 385   </para>
 386
 387   <para>
 388    Programmers thrive on user feedback. If you are interested in a
 389    facility that you don't see mentioned here, or if there's something
 390    you think we could do better, please drop us a mail.  Better still,
 391    implement it and send us the patches.
 392   </para>
 393   <para>
 394    If you think it's all really neat, you're welcome to drop us a line
 395    saying that, too. You can email us on
 396    <email>info@indexdata.dk</email>
 397    or check the contact info at the end of this manual.
 398   </para>
 399
 400  </sect1>
 401 </chapter>
 402  <!-- Keep this comment at the end of the file
 403  Local variables:
 404  mode: sgml
 405  sgml-omittag:t
 406  sgml-shorttag:t
 407  sgml-minimize-attributes:nil
 408  sgml-always-quote-attributes:t
 409  sgml-indent-step:1
 410  sgml-indent-data:t
 411  sgml-parent-document: "zebra.xml"
 412  sgml-local-catalogs: nil
 413  sgml-namecase-general:t
 414  End:
 415  -->