<chapter id="introduction">
- <!-- $Id: introduction.xml,v 1.7 2002-08-05 08:27:05 quinn Exp $ -->
+ <!-- $Id: introduction.xml,v 1.23 2002-12-02 15:11:49 mike Exp $ -->
<title>Introduction</title>
<sect1>
<title>Overview</title>
<para>
- The
- <ulink url="http://www.indexdata.dk/zebra/">
- Zebra</ulink>
- server is a high-performance, general-purpose structured text
- indexing and retrieval engine. It reads structured records in a
- variety of input formats (eg. email, XML, MARC) and allows access
- to them through exact boolean search expressions and
- relevance-ranked free-text queries.
- </para>
-
- <para>
- Zebra supports large databases (more than ten gigabytes of data,
- tens of millions of records). It supports incremental, safe
- database updates on live systems. You can access data stored in
- Zebra using a variety of Index Data tools (eg. YAZ and PHP/YAZ) as
- well as commercial and freeware Z39.50 clients and toolkits.
- </para>
+ <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
+ is a high-performance, general-purpose structured text
+ indexing and retrieval engine. It reads records in a
+ variety of input formats (eg. email, XML, MARC) and provides access
+ to them through a powerful combination of boolean search
+ expressions and relevance-ranked free-text queries.
+ </para>
<para>
- This document is an introduction to the Zebra system. It will tell you
- how to compile the software, and how to prepare your first database.
- It also explains how the server can be configured to give you the
- functionality that you need.
+ Zebra supports large databases (tens of millions of records,
+ tens of gigabytes of data). It allows safe, incremental
+ database updates on live systems. Because Zebra supports
+ the industry-standard information retrieval protocol, Z39.50,
+ you can search Zebra databases using an enormous variety of
+ programs and toolkits, both commercial and free, which understand
+ this protocol. Application libraries are available to allow
+ bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
+ Basic, Python, PHP and more - see
+ <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
+ for more information on some of these client toolkits.
</para>
-
+
<para>
-
- If you find the software interesting, you should visit the
- <ulink url="http://www.indexdata.dk/zebra/">
- Zebra web site</ulink>, where you can join the
- <ulink url="http://www.indexdata.dk/mailman/listinfo/zebralist">
- mailing-list</ulink>
- by sending email to
+ This document is an introduction to the Zebra system. It explains
+ how to compile the software, how to prepare your first database,
+ and how to configure the server to give you the
+ functionality that you need.
</para>
-
</sect1>
<sect1 id="features">
<title>Features</title>
<para>
- This is an overview of some of the most important features of the
- system.
+ This is an overview of some of Zebra's most important features:
</para>
<para>
<listitem>
<para>
- Supports large databases - files for indices, etc. can be
+ Very large databases: logical files can be
automatically partitioned over multiple disks.
</para>
</listitem>
<listitem>
<para>
- Supports arbitrarily complex records - base input format is an
- SGML-like syntax which allows nested (structured) data elements, as
- well as variant forms of data.
+ Arbitrarily complex records. The internal data format
+ is a structured format conceptually similar to XML or GRS-1,
+ which allows lists, nested structured data elements and
+ variant forms of data.
</para>
</listitem>
<listitem>
<para>
- Robust updating - records can be added and deleted without
- rebuilding the index from scratch.
+ Robust updating - records can be added and deleted ``on the fly''
+ without rebuilding the index from scratch.
+ Records can be safely updated even while users are accessing
+ the server.
The update procedure is tolerant to crashes or hard interrupts
- during register updating - registers can be reconstructed following
+ during database updating - data can be reconstructed following
a crash.
- Registers can be safely updated even while users are accessing
- the server.
</para>
</listitem>
<listitem>
<para>
- Supports random storage formats. A system of input filters driven by
- regular expressions allows you to easily process most ASCII-based
- data formats. SGML, XML, ISO2709 (MARC), and raw text are also
+ Configurable to understand many input formats.
+ A system of input filters driven by
+ regular expressions allows most ASCII-based
+ data formats to be easily processed.
+ SGML, XML, ISO2709 (MARC), and raw text are also
supported.
</para>
</listitem>
<listitem>
<para>
- Supports boolean queries as well as relevance-ranking (free-text)
- searching. Right truncation and masking in terms are supported, as
- well as full regular expressions.
+ Searching supports a powerful combination of boolean queries as
+ well as relevance-ranking (free-text) queries. Truncation,
+ masking, full regular expression matching and "approximate
+ matching" (eg. spelling mistakes) are all handled.
</para>
</listitem>
<listitem>
<para>
- Can import the data into Zebras own storage, or just refer to
- external files (good for building indexes of "live"
- collections).
+ Index-only databases: data can be, and usually is, imported
+ into Zebra's own storage, but Zebra can also refer to
+ external files, building and maintaining indexes of "live"
+ collections.
</para>
</listitem>
<listitem>
<para>
- Supports multiple concrete syntaxes
- for record exchange (depending on the configuration): GRS-1, SUTRS,
- XML, ISO2709 (*MARC). Records can be mapped between record syntaxes
- and schema on the fly.
- </para>
- </listitem>
-
- <listitem>
- <para>
- Supports approximate matching in registers (ie. spelling mistakes,
- etc).
- </para>
- </listitem>
-
- <listitem>
- <para>
Zebra is written in portable C, so it runs on most Unix-like systems
- as well as Windows NT - a binary distribution for Windows NT is available.
+ as well as Windows NT. A binary distribution for Windows NT is
+ available at
+ <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
+ and pre-built packages are available for some Linux
+ distributions:
+ Red Hat 7.x RPMs at
+ <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
+ and Debian packages at
+ <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>
</para>
</listitem>
<itemizedlist>
<listitem>
<para>
- Protocol facilities: Init, Search, Retrieve, Delete, Browse and Sort.
+ Protocol facilities: Init, Search, Present (retrieval),
+ Segmentation (support for very large records), Delete, Scan
+ (index browsing), Sort, Close and support for the ``update''
+ Extended Service to add or replace an existing XML record.
+ <!-- Adam says:
+ * Supported
+ You can insert/delete/replace an XML record given an
+ "external" ID. Actually this way of doing ES Update was
+ meant for an OAI application that Ian Ibbotson had in
+ mind to implement. The "update" command in YAZ client
+ implements this on the client side. My plan is to make
+ this available in ZOOM "extended" soon..
+ -->
</para>
</listitem>
<listitem>
<para>
- Piggy-backed presents are honored in the search-request.
+ Piggy-backed presents are honored in the search request - that
+ is, a subset of the found records can be returned directly with
+ a search response, enabling search and retrieval to happen in a
+ single round-trip.
</para>
</listitem>
Named result sets are supported.
</para>
</listitem>
+
<listitem>
<para>
Easily configured to support different application profiles, with
<listitem>
<para>
- Complex composition specifications using Espec-1 are partially
- supported (simple element requests only).
+ Complex composition specifications using Espec-1 (partial support).
+ Element sets are defined using the Espec-1 capability,
+ and are specified in configuration files as simple element
+ requests (and, optionally, variant requests).
</para>
</listitem>
<listitem>
<para>
- Element Set Names are defined using the Espec-1 capability of the
- system, and are given in configuration files as simple element
- requests (and possibly variant requests).
+ Multiple record syntaxes
+ for data retrieval: GRS-1, SUTRS,
+ XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
+ and schemas on the fly.
</para>
</listitem>
</sect1>
+ <sect1 id="apps">
+ <title>Applications</title>
+ <para>
+ Zebra has been deployed in numerous applications, in both the
+ academic and commercial worlds, in application domains as diverse
+ as bibliographic catalogues, geospatial information, structured
+ vocabulary browsing, government information locators, civic
+ information systems, environmental observations, museum information
+ and web indexes.
+ </para>
+ <para>
+ Notable applications include the following:
+ </para>
+
+ <sect2>
+ <title>DADS - the DTV Article Database Service</title>
+ <para>
+ DADS is a huge database of more than ten million records, totalling
+ over ten gigabytes of data. The records are metadata about academic
+ journal articles, primarily scientific; about 10% of these
+ metadata records link to the full text of the articles they
+ describe, a body of about a terabyte of information (although the
+ full text is not indexed.)
+ </para>
+ <para>
+ It allows students and researchers at DTU (Danmarks Tekniske
+ Universitet, the Technical College of Denmark) to find and order
+ articles from multiple databases in a single query. The database
+ contains literature on all engineering subjects. It's available
+ on-line through a web gateway, though currently only to registered
+ users.
+ </para>
+ <para>
+ More information can be found at
+ <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
+ <para>
+ Fernuniversität Hagen in Germany have developed a natural
+ language interface for access to library databases.
+ <ulink url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/>
+ In order to evaluate this interface for recall and precision, they
+ chose Zebra as the basis for retrieval effectiveness. The Zebra
+ server contains a copy of the GIRT database, consisting of more
+ than 76000 records in SGML format (bibliographic records from
+ social science), which are mapped to MARC for presentation.
+ </para>
+ <para>
+ (GIRT is the German Indexing and Retrieval Testdatabase. It is a
+ standard German-language test database for intelligent indexing
+ and retrieval systems. See
+ <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
+ </para>
+ <para>
+ Evaluation will take place as part of the TREC/CLEF campaign 2003
+ <ulink url="http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/"/>
+ </para>
+ <para>
+ For more information, contact Johannes Leveling
+ <email>Johannes.Leveling@FernUni-Hagen.De</email>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>ULS (Union List of Serials)</title>
+ <para>
+ The M25-Link systems team
+ (<ulink url="http://www.m25lib.ac.uk/M25link/"/>)
+ are involved in a project called ULS to provide a union catalogue
+ for periodicals in 21 member libraries. They do this with an
+ unusual architecture which they call a
+ ``non-distributed virtual union catalogue''.
+ </para>
+ <para>
+ The member libraries send in data files representing their
+ periodicals, including both brief bibliographic data and summary
+ holdings. Then 21 individual Z39.50 targets are created, each
+ using Zebra, and all mounted on the single hardware server.
+ The live service provides a web gateway allowing Z39.50 searching
+ of all of the targets or a selection of them. Zebra's small
+ footprint allows a relatively modest system to comfortably host
+ the 21 servers.
+ </para>
+ <para>
+ More information can be found at
+ <ulink url="http://www.m25lib.ac.uk/ULS/"/>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Various web indexes</title>
+ <para>
+ Zebra has been used by a variety of institutions to construct
+ indexes of large web sites, typically in the region of tens of
+ millions of pages. In this role, it functions somewhat similarly
+ to the engine of google or altavista, but for a selected intranet
+ or a subset of the whole Web.
+ </para>
+ <para>
+ For example, Liverpool University's web-search facility (see on
+ the home page at
+ <ulink url="http://www.liv.ac.uk/"/>
+ and many sub-pages) works by relevance-searching a Zebra database
+ which is populated by the Harvest-NG web-crawling software.
+ </para>
+ <para>
+ For more information on Liverpool university's intranet search
+ architecture, contact John Gilbertson
+ <email>jgilbert@liverpool.ac.uk</email>
+ </para>
+ <para>
+ Kang-Jin Lee
+ <email>lee@arco.de</email>,
+ has recently modified the Harvest web indexer to use Zebra as
+ its native repository engine. His comments on the switch over
+ from the old engine are revealing:
+ <blockquote>
+ <para>
+ The first results after some testing with Zebra are very
+ promising. The tests were done with around 220,000 SOIF files,
+ which occupies 1.6GB of disk space.
+ </para>
+ <para>
+ Building the index from scratch takes around one hour with Zebra
+ where [old-engine] needs around five hours. While [old-engine]
+ blocks search requests when updating its index, Zebra can still
+ answer search requests.
+ [...]
+ Zebra supports incremental indexing which will speed up indexing
+ even further.
+ </para>
+ <para>
+ While the search time of [old-engine] varies from some seconds
+ to some minutes depending how expensive the query is, Zebra
+ usually takes around one to three seconds, even for expensive
+ queries.
+ [...]
+ Zebra can search more than 100 times faster than [old-engine]
+ and can process multiple search requests simultaneously
+ </para>
+ <para>
+ I am very happy to see such nice software available under GPL.
+ </para>
+ </blockquote>
+ </para>
+ </sect2>
+ </sect1>
+
+
+ <sect1 id="support">
+ <title>Support</title>
+ <para>
+ You can get support for Zebra from at least three sources.
+ </para>
+ <para>
+ First, there's the Zebra web site at
+ <ulink url="http://indexdata.dk/zebra/"/>,
+ which always has the most recent version available for download.
+ If you have a problem with Zebra, the first thing to do is see
+ whether it's fixed in the current release.
+ </para>
+ <para>
+ Second, there's the Zebra mailing list. Its home page at
+ <ulink url="http://indexdata.dk/mailman/listinfo/zebralist"/>
+ includes a complete archive of all messages that have ever been
+ posted on the list. The Zebra mailing list is used both for
+ announcements from the authors (new
+ releases, bug fixes, etc.) and general discussion. You are welcome
+ to seek support there. Join by sending email to
+ <email>zebra-request@indexdata.dk</email> with the word
+ <literal>subscribe</literal> in the body of the message.
+ </para>
+ <para>
+ Third, it's possible to buy a commercial support contract, with
+ well defined service levels and response times, from Index Data.
+ See
+ <ulink url="http://indexdata.dk/support2/"/>
+ for details.
+ </para>
+ </sect1>
+
+
<sect1 id="future">
- <title>Future Work</title>
+ <title>Future Directions</title>
<para>
These are some of the plans that we have for the software in the near
- and far future, approximately ordered after their relative importance.
+ and far future, ordered approximately as we expect to work on them.
</para>
<para>
Improved support for XML in search and retrieval. Eventually,
the goal is for Zebra to pull double duty as a flexible
information retrieval engine and high-performance XML
- repository.
+ repository. The recent addition of XPath searching is one
+ example of the kind of enhancement we're working on.
</para>
</listitem>
<listitem>
<para>
- Access to search engine through SOAP/RPC API to allow the
+ Access to the search engine through SOAP/RPC API to allow the
construction of applications without requiring Z39.50 tools.
+ This will shortly be available by means of Index Data's
+ SRW-to-Z39.50 gateway, currently in beta test.
</para>
</listitem>
<listitem>
<para>
- Finalisation, documentation of the Zebra API. Consider
- exposing the API through SOAP as well (allowing updates,
- database management).
+ Finalisation and documentation of Zebra's C programming
+ API, allowing updates, database management and other functions
+ not readily expressed in Z39.50. We will also consider
+ exposing the API through SOAP.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Support for the use of Perl both for access to the Zebra API
+ and for building extension ``plug-ins'' such as input filters.
+ The code for this has been contributed to the source tree by
+ Peter Popovics
+ <email>pop@indexdata.dk</email>,
+ and is in the process of being integrated and tested.
</para>
</listitem>
<para>
Programmers thrive on user feedback. If you are interested in a
facility that you don't see mentioned here, or if there's something
- you think we could do better, please drop us a mail.
+ you think we could do better, please drop us a mail. Better still,
+ implement it and send us the patches.
+ </para>
+ <para>
If you think it's all really neat, you're welcome to drop us a line
- saying that, too. You'll find contact info at the end of this file.
+ saying that, too. You can email us on
+ <email>info@indexdata.dk</email>
+ or check the contact info at the end of this manual.
</para>
</sect1>