doc/introduction.xml

   1 <chapter id="introduction">
   2  <!-- $Id: introduction.xml,v 1.39 2006-09-03 21:37:26 adam Exp $ -->
   3  <title>Introduction</title>
   4
   5  <section id="overview">
   6   <title>Overview</title>
   7
   8   <para>
   9    <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
  10    is a high-performance, general-purpose structured text
  11    indexing and retrieval engine. It reads records in a
  12    variety of input formats (eg. email, XML, MARC) and provides access
  13    to them through a powerful combination of boolean search
  14    expressions and relevance-ranked free-text queries.
  15   </para>
  16
  17   <para>
  18    Zebra supports large databases (tens of millions of records,
  19    tens of gigabytes of data). It allows safe, incremental
  20    database updates on live systems. Because Zebra supports
  21    the industry-standard information retrieval protocol, Z39.50,
  22    you can search Zebra databases using an enormous variety of
  23    programs and toolkits, both commercial and free, which understand
  24    this protocol.  Application libraries are available to allow
  25    bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
  26    Basic, Python, PHP and more - see
  27    <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
  28    for more information on some of these client toolkits.
  29   </para>
  30
  31   <para>
  32    This document is an introduction to the Zebra system. It explains
  33    how to compile the software, how to prepare your first database,
  34    and how to configure the server to give you the
  35    functionality that you need.
  36   </para>
  37  </section>
  38
  39  <section id="features">
  40   <title>Features</title>
  41
  42   <para>
  43    This is an overview of some of Zebra's most important features:
  44   </para>
  45
  46   <para>
  47    <itemizedlist>
  48
  49     <listitem>
  50      <para>
  51       Very large databases: logical files can be
  52       automatically partitioned over multiple disks.
  53      </para>
  54     </listitem>
  55
  56     <listitem>
  57      <para>
  58       Arbitrarily complex records.  The internal data format
  59       is a structured format conceptually similar to XML or GRS-1,
  60       which allows lists, nested structured data elements and
  61       variant forms of data.
  62      </para>
  63     </listitem>
  64
  65     <listitem>
  66      <para>
  67       Robust updating - records can be added and deleted ``on the fly''
  68       without rebuilding the index from scratch.
  69       Records can be safely updated even while users are accessing
  70       the server.
  71       The update procedure is tolerant to crashes or hard interrupts
  72       during database updating - data can be reconstructed following
  73       a crash.
  74      </para>
  75     </listitem>
  76
  77     <listitem>
  78      <para>
  79       Configurable to understand many input formats.
  80       A system of input filters driven by
  81       regular expressions allows most ASCII-based
  82       data formats to be easily processed.
  83       SGML, XML, ISO2709 (MARC), and raw text are also
  84       supported.
  85      </para>
  86     </listitem>
  87
  88     <listitem>
  89      <para>
  90       Searching supports a powerful combination of boolean queries as
  91       well as relevance-ranking (free-text) queries.  Truncation,
  92       masking, full regular expression matching and "approximate
  93       matching" (eg. spelling mistakes) are all handled.
  94      </para>
  95     </listitem>
  96
  97     <listitem>
  98       <para>
  99         Index-only databases: data can be, and usually is, imported
 100         into Zebra's own storage, but Zebra can also refer to
 101         external files, building and maintaining indexes of "live"
 102         collections.
 103       </para>
 104     </listitem>
 105
 106     <listitem>
 107      <para>
 108       Zebra is written in portable C, so it runs on most Unix-like systems
 109       as well as Windows NT.  A binary distribution for Windows NT is
 110       available at
 111       <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
 112       and pre-built packages are available for
 113       <!--- some Linux
 114       distributions:
 115       Red Hat 7.x RPMs at
 116       <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
 117       and Debian packages at
 118       -->
 119       <literal>GNU/Debian Linux</literal> at
 120       <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>.
 121      </para>
 122     </listitem>
 123
 124    </itemizedlist>
 125
 126   </para>
 127
 128   <para>
 129      <ulink url="&url.z39.50;">Z39.50</ulink> protocol support:
 130   </para>
 131
 132   <para>
 133    <itemizedlist>
 134     <listitem>
 135      <para>
 136       Protocol facilities: Init, Search, Present (retrieval),
 137       Segmentation (support for very large records), Delete, Scan
 138       (index browsing), Sort, Close and support for the ``update''
 139       Extended Service to add or replace an existing XML record.
 140      </para>
 141     </listitem>
 142
 143     <listitem>
 144      <para>
 145       Piggy-backed presents are honored in the search request - that
 146       is, a subset of the found records can be returned directly with
 147       a search response, enabling search and retrieval to happen in a
 148       single round-trip.
 149      </para>
 150     </listitem>
 151
 152     <listitem>
 153      <para>
 154       Named result sets are supported.
 155      </para>
 156     </listitem>
 157
 158     <listitem>
 159      <para>
 160       Easily configured to support different application profiles, with
 161       tables for attribute sets, tag sets, and abstract syntaxes.
 162       Additional tables control facilities such as element mappings to
 163       different schema (eg., GILS-to-USMARC).
 164      </para>
 165     </listitem>
 166
 167     <listitem>
 168      <para>
 169       Complex composition specifications using Espec-1 (partial support).
 170       Element sets are defined using the Espec-1 capability,
 171       and are specified in configuration files as simple element
 172       requests (and, optionally, variant requests).
 173      </para>
 174     </listitem>
 175
 176     <listitem>
 177      <para>
 178       Multiple record syntaxes
 179       for data retrieval: GRS-1, SUTRS,
 180       XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
 181       and schemas on the fly.
 182      </para>
 183     </listitem>
 184
 185    </itemizedlist>
 186
 187   </para>
 188
 189
 190   <para>
 191     <ulink url="&url.sru;">SRU</ulink> Web Service support:
 192   </para>
 193   <para>
 194    <itemizedlist>
 195     <listitem>
 196      <para>
 197        The protocol operations <literal>explain</literal>,
 198        <literal>searchRetrieve</literal> and <literal>scan</literal>
 199        are supported.
 200      </para>
 201     </listitem>
 202     <listitem>
 203      <para>
 204        <ulink url="&url.cql;">CQL</ulink> to internal query model RPN
 205        conversion is supported.
 206      </para>
 207     </listitem>
 208     <listitem>
 209      <para>
 210        Multiple XML record formats
 211       for data retrieval are supported, modelled over the  GRS-1, SUTRS,
 212       MARC record formats. Records can be mapped between record
 213        schemas on the fly. Arbitrarily complex XSLT transformations
 214       can be applied during record retrieval if one uses the
 215        <literal>alvis</literal> filter module.
 216      </para>
 217     </listitem>
 218     <listitem>
 219      <para>
 220        Additional PQF query syntax for
 221        <literal>searchRetrieve</literal>
 222        and <literal>scan</literal> operations is supported.
 223      </para>
 224     </listitem>
 225
 226    </itemizedlist>
 227
 228   </para>
 229
 230
 231  </section>
 232
 233   <section id="introduction-apps">
 234   <title>References and Zebra based Applications</title>
 235   <para>
 236    Zebra has been deployed in numerous applications, in both the
 237    academic and commercial worlds, in application domains as diverse
 238    as bibliographic catalogues, geospatial information, structured
 239    vocabulary browsing, government information locators, civic
 240    information systems, environmental observations, museum information
 241    and web indexes.
 242   </para>
 243   <para>
 244    Notable applications include the following:
 245   </para>
 246
 247
 248   <section id="koha-ils">
 249    <title>Koha free open-source ILS</title>
 250    <para>
 251      <ulink url="http://www.koha.org/">Koha</ulink> is a full-featured
 252      open-source ILS, initially developed  in
 253      New Zealand by Katipo Communications Ltd, and first deployed in
 254      January of 2000 for Horowhenua Library Trust. It is currently
 255      maintained by a team of software providers and library technology
 256      staff from around the globe.
 257     </para>
 258     <para>
 259      <ulink url="http://liblime.com/">LibLime</ulink>,
 260      a company that is marketing and supporting Koha, adds in
 261      the new release of Koha 3.0 the Zebra
 262      database server to drive its bibliographic database.
 263     </para>
 264     <para>
 265      In early 2005, the Koha project development team began looking at
 266      ways to improve MARC support and overcome scalability limitations
 267      in the Koha 2.x series. After extensive evaluations of the best
 268      of the Open Source textual database engines - including MySQL
 269      full-text searching, PostgreSQL, Lucene and Plucene - the team
 270      selected Zebra.
 271     </para>
 272     <para>
 273      "Zebra completely eliminates scalability limitations, because it
 274      can support tens of millions of records." explained Joshua
 275      Ferraro, LibLime's Technology President and Koha's Project
 276      Release Manager. "Our performance tests showed search results in
 277      under a second for databases with over 5 million records on a
 278      modest i386 900Mhz test server."
 279     </para>
 280     <para>
 281      "Zebra also includes support for true boolean search expressions
 282      and relevance-ranked free-text queries, both of which the Koha
 283      2.x series lack. Zebra also supports incremental and safe
 284      database updates, which allow on-the-fly record
 285      management. Finally, since Zebra has at its heart the Z39.50
 286      protocol, it greatly improves Koha's support for that critical
 287      library standard."
 288     </para>
 289     <para>
 290      Although the bibliographic database will be moved to Zebra, Koha
 291      3.0 will continue to use a relational SQL-based database design
 292      for the 'factual' database. "Relational database managers have
 293      their strengths, in spite of their inability to handle large
 294      numbers of bibliographic records efficiently," summed up Ferraro,
 295      "We're taking the best from both worlds in our redesigned Koha
 296      3.0.
 297      </para>
 298      <para>
 299      See also LibLime's newsletter article
 300       <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
 301      Koha Earns its Stripes</ulink>.
 302      </para>
 303    </section>
 304
 305   <section id="emilda-ils">
 306    <title>Emilda open source ILS</title>
 307    <para>
 308      <ulink url="http://www.emilda.org/">Emilda</ulink>
 309      is a complete Integrated Library System, released under the
 310      GNU General Public License. It has a
 311      full featured Web-OPAC, allowing comprehensive system management
 312      from virtually any computer with an Internet connection, has
 313      template based layout allowing anyone to alter the visual
 314      appearance of Emilda, and is
 315      XML based language for fast and easy portability to virtually any
 316      language.
 317      Currently, Emilda is used at three schools in Espoo, Finland.
 318     </para>
 319     <para>
 320      As a surplus, 100% MARC compatibility has been achieved using the
 321     Zebra Server from Index Data as backend server.
 322     </para>
 323    </section>
 324
 325   <section id="reindex-ils">
 326    <title>ReIndex.Net web based ILS</title>
 327     <para>
 328      <ulink url="http://www.reindex.net/index.php?lang=en">Reindex.net</ulink>
 329      is a netbased library service offering all
 330      traditional functions on a very high level plus many new
 331      services. Reindex.net is a comprehensive and powerful WEB system
 332      based on standards such as XML and Z39.50.
 333      updates. Reindex supports MARC21, danMARC eller Dublin Core with
 334      UTF8-encoding.
 335     </para>
 336     <para>
 337      Reindex.net runs on GNU/Debian Linux with Zebra and Simpleserver
 338      from Index
 339      Data for bibliographic data. The relational database system
 340      Sybase 9 XML is used for
 341      administrative data.
 342      Internally MARCXML is used for bibliographical records. Update
 343      utilizes Z39.50 extended services.
 344     </para>
 345    </section>
 346
 347    <section id="dads-article-database">
 348     <title>DADS - the DTV Article Database
 349      Service</title>
 350     <para>
 351     DADS is a huge database of more than ten million records, totalling
 352     over ten gigabytes of data.  The records are metadata about academic
 353     journal articles, primarily scientific; about 10% of these
 354     metadata records link to the full text of the articles they
 355     describe, a body of about a terabyte of information (although the
 356     full text is not indexed.)
 357    </para>
 358    <para>
 359     It allows students and researchers at DTU (Danmarks Tekniske
 360     Universitet, the Technical College of Denmark) to find and order
 361     articles from multiple databases in a single query.  The database
 362     contains literature on all engineering subjects.  It's available
 363     on-line through a web gateway, though currently only to registered
 364     users.
 365    </para>
 366    <para>
 367     More information can be found at
 368     <ulink url="http://www.dtv.dk/"/> and
 369     <ulink url="http://dads.dtv.dk"/>
 370    </para>
 371   </section>
 372
 373   <section id="infonet-eprints">
 374    <title>Infonet Eprints</title>
 375    <para>
 376      The InfoNet Eprints service from the
 377      <ulink url="http://www.dtv.dk/">
 378       Technical Knowledge Center of Denmark</ulink>
 379      provides access to documents stored in
 380      eprint/preprint servers and institutional research archives around
 381      the world. The service is based on Open Archives Initiative metadata
 382      harvesting of selected scientific archives around the world. These
 383      open archives offer free and unrestricted access to their contents.
 384     </para>
 385    <para>
 386     Infonet Eprints currently holds 1.4 million records from 16 archives.
 387     The online search facility is found at
 388     <ulink url="http://preprints.cvt.dk"/>.
 389    </para>
 390   </section>
 391
 392   <section id="alvis-project">
 393    <title>Alvis</title>
 394    <para>
 395      The <ulink url="http://www.alvis.info/alvis/">Alvis</ulink> EU
 396      project run under the 6th Framework (IST-1-002068-STP)
 397      is building a semantic-based peer-to-peer search engine. A
 398      consortium of eleven partners from six different European
 399      Community countries plus Switzerland and China contribute
 400      with expertise in a broad range of specialties including network
 401      topologies, routing algorithms, linguistic analysis and
 402      bioinformatics.
 403     </para>
 404     <para>
 405      The Zebra information retrieval indexing machine is used inside
 406      the Alvis framework to
 407      manage huge collections of natural language processed and
 408      enhanced XML data, coming from a topic relevant web crawl.
 409      In this application, Zebra swallows and manages 37GB of XML data
 410      in about 4 hours, resulting in search times of fractions of
 411      seconds.
 412      </para>
 413    </section>
 414
 415
 416   <section id="uls">
 417    <title>ULS (Union List of Serials)</title>
 418    <para>
 419     The M25 Systems Team
 420     has created a union catalogue for the periodicals of the
 421     twenty-one constituent libraries of the University of London and
 422     the University of Westminster
 423     (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
 424     They have achieved this using an
 425     unusual architecture, which they describe as a
 426     ``non-distributed virtual union catalogue''.
 427    </para>
 428    <para>
 429     The member libraries send in data files representing their
 430     periodicals, including both brief bibliographic data and summary
 431     holdings.  Then 21 individual Z39.50 targets are created, each
 432     using Zebra, and all mounted on the single hardware server.
 433     The live service provides a web gateway allowing Z39.50 searching
 434     of all of the targets or a selection of them.  Zebra's small
 435     footprint allows a relatively modest system to comfortably host
 436     the 21 servers.
 437    </para>
 438    <para>
 439     More information can be found at
 440     <ulink url="http://www.m25lib.ac.uk/ULS/"/>
 441    </para>
 442   </section>
 443
 444   <section id="nli">
 445    <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
 446    <para>
 447     Fernuniversit&#x00E4;t Hagen in Germany have developed a natural
 448     language interface for access to library databases.
 449     <!-- <ulink
 450     url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
 451     In order to evaluate this interface for recall and precision, they
 452     chose Zebra as the basis for retrieval effectiveness.  The Zebra
 453     server contains a copy of the GIRT database, consisting of more
 454     than 76000 records in SGML format (bibliographic records from
 455     social science), which are mapped to MARC for presentation.
 456    </para>
 457    <para>
 458     (GIRT is the German Indexing and Retrieval Testdatabase.  It is a
 459     standard German-language test database for intelligent indexing
 460     and retrieval systems.  See
 461     <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
 462    </para>
 463    <para>
 464     Evaluation will take place as part of the TREC/CLEF campaign 2003
 465     <ulink url="http://clef.iei.pi.cnr.it"/>.
 466     <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
 467    </para>
 468    <para>
 469     For more information, contact Johannes Leveling
 470     <email>Johannes.Leveling@FernUni-Hagen.De</email>
 471    </para>
 472   </section>
 473
 474   <section id="various-web-indexes">
 475    <title>Various web indexes</title>
 476    <para>
 477     Zebra has been used by a variety of institutions to construct
 478     indexes of large web sites, typically in the region of tens of
 479     millions of pages.  In this role, it functions somewhat similarly
 480     to the engine of google or altavista, but for a selected intranet
 481     or a subset of the whole Web.
 482    </para>
 483    <para>
 484     For example, Liverpool University's web-search facility (see on
 485     the home page at
 486     <ulink url="http://www.liv.ac.uk/"/>
 487     and many sub-pages) works by relevance-searching a Zebra database
 488     which is populated by the Harvest-NG web-crawling software.
 489    </para>
 490    <para>
 491     For more information on Liverpool university's intranet search
 492     architecture, contact John Gilbertson
 493     <email>jgilbert@liverpool.ac.uk</email>
 494    </para>
 495    <para>
 496     Kang-Jin Lee
 497     has recently modified the Harvest web indexer to use Zebra as
 498     its native repository engine.  His comments on the switch over
 499     from the old engine are revealing:
 500     <blockquote>
 501      <para>
 502       The first results after some testing with Zebra are very
 503       promising.  The tests were done with around 220,000 SOIF files,
 504       which occupies 1.6GB of disk space.
 505      </para>
 506      <para>
 507       Building the index from scratch takes around one hour with Zebra
 508       where [old-engine] needs around five hours.  While [old-engine]
 509       blocks search requests when updating its index, Zebra can still
 510       answer search requests.
 511       [...]
 512       Zebra supports incremental indexing which will speed up indexing
 513       even further.
 514      </para>
 515      <para>
 516       While the search time of [old-engine] varies from some seconds
 517       to some minutes depending how expensive the query is, Zebra
 518       usually takes around one to three seconds, even for expensive
 519       queries.
 520       [...]
 521       Zebra can search more than 100 times faster than [old-engine]
 522       and can process multiple search requests simultaneously
 523      </para>
 524      <para>
 525       I am very happy to see such nice software available under GPL.
 526      </para>
 527     </blockquote>
 528    </para>
 529   </section>
 530  </section>
 531
 532
 533  <section id="introduction-support">
 534   <title>Support</title>
 535   <para>
 536    You can get support for Zebra from at least three sources.
 537   </para>
 538   <para>
 539    First, there's the Zebra web site at
 540    <ulink url="http://indexdata.dk/zebra/"/>,
 541    which always has the most recent version available for download.
 542    If you have a problem with Zebra, the first thing to do is see
 543    whether it's fixed in the current release.
 544   </para>
 545   <para>
 546    Second, there's the Zebra mailing list.  Its home page at
 547    <ulink url="http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist"/>
 548    includes a complete archive of all messages that have ever been
 549    posted on the list.  The Zebra mailing list is used both for
 550    announcements from the authors (new
 551    releases, bug fixes, etc.) and general discussion.  You are welcome
 552    to seek support there.  Join by filling the form on the list home page.
 553   </para>
 554   <para>
 555    Third, it's possible to buy a commercial support contract, with
 556    well defined service levels and response times, from Index Data.
 557    See
 558    <ulink url="http://indexdata.dk/support/"/>
 559    for details.
 560   </para>
 561  </section>
 562
 563
 564  <section id="future">
 565   <title>Future Directions</title>
 566
 567   <para>
 568    These are some of the plans that we have for the software in the near
 569    and far future, ordered approximately as we expect to work on them.
 570   </para>
 571
 572   <para>
 573    <itemizedlist>
 574
 575     <listitem>
 576      <para>
 577        Improved support for XML in search and retrieval. Eventually,
 578        the goal is for Zebra to pull double duty as a flexible
 579        information retrieval engine and high-performance XML
 580        repository.  The recent addition of XPath searching is one
 581        example of the kind of enhancement we're working on.
 582      </para>
 583      <para>
 584        There is also the experimental <literal>ALVIS XSLT</literal>
 585        XML input filter, which unleashes the full power of DOM based
 586        XSLT transformations during indexing and record retrieval. Work
 587        on this filter has been sponsored by the ALVIS EU project
 588        <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
 589        mature soon, as it is planned to be included in the version 1.4
 590        release of Zebra.
 591      </para>
 592     </listitem>
 593
 594     <listitem>
 595      <para>
 596        Access to the search engine through SOAP/RPC API to allow the
 597        construction of applications without requiring Z39.50 tools.
 598        <!--
 599       This will shortly be available by means of Index Data's
 600         <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>-to-Z39.50 gateway, currently in beta test.
 601        -->
 602        Experimental support of the
 603        Search/Retrieve Via URL ( <ulink url="&url.sru;">SRU</ulink>)
 604        <ulink url="&url.sru;"/>
 605        REST webservice, and the
 606         Search/Retrieve Web Service ( <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>)
 607        <ulink url="http://www.loc.gov/standards/sru/srw/"/>
 608        SOAP Web Service have recently been added to the YAZ/Zebra
 609        combo - including server side Common Query Language (<ulink url="&url.cql;">CQL</ulink>)
 610        <ulink url="&url.cql;"/> parsing
 611        and configuration. It remains to find a sponsor for further testing,
 612        documentation and packaging of this exiting component.
 613      </para>
 614     </listitem>
 615
 616     <listitem>
 617      <para>
 618        Finalisation and documentation of Zebra's C programming
 619        API, allowing updates, database management and other functions
 620        not readily expressed in Z39.50.  We will also consider
 621        exposing the API through SOAP.
 622      </para>
 623     </listitem>
 624
 625     <listitem>
 626      <para>
 627        Support for the use of Perl both for access to the Zebra API
 628        and for building extension ``plug-ins'' such as input filters.
 629        The code for this has been contributed to the source tree by
 630        Peter Popovics
 631        <email>pop@technomat.hu</email>,
 632        and is in the process of being integrated and tested.
 633      </para>
 634     </listitem>
 635
 636     <listitem>
 637      <para>
 638        Improved free-text searching. We're first and foremost octet jockeys and
 639        we're actively looking for organisations or people who'd like
 640        to contribute experience in relevance ranking and text
 641        searching.
 642      </para>
 643     </listitem>
 644
 645    </itemizedlist>
 646   </para>
 647
 648   <para>
 649    Programmers thrive on user feedback. If you are interested in a
 650    facility that you don't see mentioned here, or if there's something
 651    you think we could do better, please drop us a mail.  Better still,
 652    implement it and send us the patches.
 653   </para>
 654   <para>
 655    If you think it's all really neat, you're welcome to drop us a line
 656    saying that, too. You can email us on
 657    <email>info@indexdata.dk</email>
 658    or check the contact info at the end of this manual.
 659   </para>
 660
 661  </section>
 662 </chapter>
 663  <!-- Keep this comment at the end of the file
 664  Local variables:
 665  mode: sgml
 666  sgml-omittag:t
 667  sgml-shorttag:t
 668  sgml-minimize-attributes:nil
 669  sgml-always-quote-attributes:t
 670  sgml-indent-step:1
 671  sgml-indent-data:t
 672  sgml-parent-document: "zebra.xml"
 673  sgml-local-catalogs: nil
 674  sgml-namecase-general:t
 675  End:
 676  -->