<chapter id="introduction">
- <!-- $Id: introduction.xml,v 1.6 2002-08-02 19:26:55 adam Exp $ -->
+ <!-- $Id: introduction.xml,v 1.12 2002-08-30 01:17:10 mike Exp $ -->
<title>Introduction</title>
<sect1>
<title>Overview</title>
<para>
- The
<ulink url="http://www.indexdata.dk/zebra/">
Zebra</ulink>
- system is a fielded free-text indexing and retrieval engine with a
- Z39.50 front-end. You can use our various toolkits or any commercial
- or free-ware Z39.50 client to access data stored in Zebra.
+ is a high-performance, general-purpose structured text
+ indexing and retrieval engine. It reads structured records in a
+ variety of input formats (eg. email, XML, MARC) and provides access
+ to them through a powerful combination of boolean search
+ expressions and relevance-ranked free-text queries.
</para>
-
- <para>
- FIXME - not a "first step" but a part of a complete system! -H
- </para>
-
+
<para>
- The Zebra server is our first step towards the development of a fully
- configurable, open information system. Eventually, it will be paired
- off with a powerful Z39.50 client to support complex information
- management tasks within almost any application domain. We're making
- the server available now because it's no fun to be in the open
- information retrieval business all by yourself. We want to allow
- people with interesting data to make their things
- available in interesting ways, without having to start out
- by implementing yet another protocol stack from scratch.
+ Zebra supports large databases (tens of millions of records,
+ tens of gigabytes of data). It allows safe, incremental
+ database updates on live systems. Because Zebra supports
+ the industry-standard information retrieval protocol, Z39.50,
+ you can search Zebra databases using an enormous variety of
+ programs and toolkits, both commercial and free, which understand
+ this protocol. Application libraries are available to allow
+ bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
+ Basic, Python, PHP and more - see
+ <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
+ for more information on some of these client toolkits.
</para>
-
+
<para>
- This document is an introduction to the Zebra system. It will tell you
- how to compile the software, and how to prepare your first database.
- It also explains how the server can be configured to give you the
+ This document is an introduction to the Zebra system. It explains
+ how to compile the software, how to prepare your first database,
+ and how to configure the server to give you the
functionality that you need.
</para>
<para>
-
- If you find the software interesting, you should visit the
- <ulink url="http://www.indexdata.dk/zebra/">
- Zebra web site</ulink>, where you can join the
+ If you use Zebra, you should visit its
+ <ulink url="http://www.indexdata.dk/zebra/">web site</ulink>,
+ where you can join the
<ulink url="http://www.indexdata.dk/mailman/listinfo/zebralist">
mailing-list</ulink>
by sending email to
+ <email>### zebra-subscribe@mailman.indexdata.dk</email>
</para>
</sect1>
<title>Features</title>
<para>
- This is a list of some of the most important features of the
- system.
+ This is an overview of some of Zebra's most important features:
</para>
<para>
<listitem>
<para>
- Supports large databases - files for indices, etc. can be
+ Very large databases: files for indexes, etc. can be
automatically partitioned over multiple disks.
</para>
</listitem>
<listitem>
<para>
- Supports arbitrarily complex records - base input format is an
- SGML-like syntax which allows nested (structured) data elements, as
- well as variant forms of data.
+ Arbitrarily complex records. The internal data format
+ is an structured format conceptually similar to XML or GRS-1,
+ which allows nested structured data elements and
+ variant forms of data.
</para>
</listitem>
<listitem>
<para>
- Robust updating - records can be added and deleted without
- rebuilding the index from scratch.
+ Robust updating - records can be added and deleted ``on the fly''
+ without rebuilding the index from scratch.
+ Records can be safely updated even while users are accessing
+ the server.
The update procedure is tolerant to crashes or hard interrupts
- during register updating - registers can be reconstructed following
+ during database updating - data can be reconstructed following
a crash.
- Registers can be safely updated even while users are accessing
- the server.
</para>
</listitem>
<listitem>
<para>
- Supports random storage formats. A system of input filters driven by
+ Configurable to understand many input formats.
+ A system of input filters driven by
regular expressions allows you to easily process most ASCII-based
data formats. SGML, XML, ISO2709 (MARC), and raw text are also
supported.
<listitem>
<para>
- Supports boolean queries as well as relevance-ranking (free-text)
- searching. Right truncation and masking in terms are supported, as
- well as full regular expressions.
+ Searching supports a powerful combination of boolean queries as
+ well as relevance-ranking (free-text) queries. Truncation,
+ masking, full regular expression matching and "approximate
+ matching" (eg. spelling mistakes) are all supported.
</para>
</listitem>
<listitem>
<para>
- Can import the data into Zebras own storage, or just refer to
- external files (html pages).
+ Index-only databases: data can be, and usually is, imported
+ into Zebra's own storage, but Zebra can also refer to
+ external files, building and maintaining indexes of "live"
+ collections.
</para>
</listitem>
<listitem>
<para>
- Supports multiple concrete syntaxes
- for record exchange (depending on the configuration): GRS-1, SUTRS,
- XML, ISO2709 (*MARC). Records can be mapped between record syntaxes
- and schema on the fly.
- </para>
- </listitem>
-
- <listitem>
- <para>
- Supports approximate matching in registers (ie. spelling mistakes,
- etc).
- </para>
- </listitem>
-
- <listitem>
- <para>
Zebra is written in portable C, so it runs on most Unix-like systems
- as well as Windows NT - a binary distribution for Windows NT is available.
+ as well as Windows NT. A binary distribution for Windows NT is
+ available.
</para>
</listitem>
</para>
<para>
- Protocol support:
+ Z39.50 protocol support:
</para>
<para>
<itemizedlist>
<listitem>
<para>
- Protocol facilities: Init, Search, Retrieve, Delete, Browse and Sort.
+ Protocol facilities: Init, Search, Present (retrieval), Delete,
+ Scan (index browsing) and Sort.
</para>
</listitem>
Named result sets are supported.
</para>
</listitem>
+
<listitem>
<para>
Easily configured to support different application profiles, with
<listitem>
<para>
- Complex composition specifications using Espec-1 are partially
- supported (simple element requests only).
+ Complex composition specifications using Espec-1 (partial support).
+ Element sets are defined using the Espec-1 capability,
+ and are specified in configuration files as simple element
+ requests (and, optionally, variant requests).
</para>
</listitem>
<listitem>
<para>
- Element Set Names are defined using the Espec-1 capability of the
- system, and are given in configuration files as simple element
- requests (and possibly variant requests).
+ Multiple record syntaxes
+ for data retrieval: GRS-1, SUTRS,
+ XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
+ and schemas on the fly.
</para>
</listitem>
</sect1>
+ <sect1 id="apps">
+ <title>Applications</title>
+ <para>
+ Zebra has been deployed in numerous applications, in both the
+ academic and commercial worlds, in application domains as diverse
+ as bibliographic catalogues, geospatial information, structured
+ vocabulary browsing, government information locators, civic
+ information systems, environmental observations, museum information
+ and web indexes.
+ </para>
+ <para>
+ Notable applications include the following:
+ </para>
+
+ <sect2>
+ <title>DADS - the DTV Article Database Service</title>
+ <para>
+ DADS is a huge database of more than ten million records, totalling
+ over ten gigabytes of data. The records are metadata about academic
+ journal articles, primarily scientific; about 10% of these
+ metadata records link to the full text of the articles they
+ describe, a body of about a terabyte of information (although the
+ full text is not indexed.)
+ </para>
+ <para>
+ It allows students and researchers at DTU (Danmarks Tekniske
+ Universitet, the Technical College of Denmark) to find and order
+ articles from multiple databases in a single query. The database
+ contains literature on all engineering subjects. It's available
+ on-line through a web gateway, though currently only to registered
+ users.
+ </para>
+ <para>
+ More information can be found at
+ <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
+ </para>
+ </sect2>
+
+<!--
+Envelope-to: zebra@miketaylor.org.uk
+From: Johannes Leveling <Johannes.Leveling@FernUni-Hagen.de>
+Content-Type: text/plain; charset=iso-8859-1
+Date: Thu, 29 Aug 2002 19:19:55 +0200
+To: zebra@miketaylor.org.uk
+Subject: [Zebralist] Looking for Deployment Stories
+In-Reply-To: <200208281002.LAA16526@seatbooker.net>
+X-Virus-Scanned: by AMaViS perl-11
+X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id g7TLWR905724
+
+Mike Taylor writes:
+ > People,
+ >
+ > In collaboration with Sebastian, Adam and Heikki, I am reworking some
+ > parts of the Zebra documentation in preparation for the forthcoming
+ > release. One area I am keen to expand on is (briefly) describing
+ > interesting applications of Zebra. If you've deployed it in a way
+ > that you consider interesting, I'd love to hear from you, however
+ > briefly. Think of this as a chance to get some free publicity for
+ > your application in the Zebra documentation.
+ >
+ > Replies off-list to <zebra@miketaylor.org.uk>, please.
+ >
+ > _/|_ _______________________________________________________________
+ > /o ) \/ Mike Taylor <mike@miketaylor.org.uk> www.miketaylor.org.uk
+ > )_v__/\ There are some good things you can never have too much of.
+ >
+ >
+ > _______________________________________________
+ > Zebralist mailing list
+ > Zebralist@indexdata.dk
+ > http://www.indexdata.dk/mailman/listinfo/zebralist
+ >
+Intersting?
+We have developed a natural language interface (NLI-Z39.50) for access
+to library databases at the Fernuniversität Hagen, Germany
+(http://ki212.fernuni-hagen.de/nli/NLI.html).
+To prepare formal information retrieval evaluation,
+we chose the Zebra server as the basis for
+evaluating retrieval effectiveness (measuring recall
+and precision for the GIRT database). The Zebra database
+consists of more than 76000 records in SGML format (bibliographic
+records from social science), which are mapped to MARC for presentation.
+Evaluation will take place as part of the TREC/CLEF campaign 2003
+(see http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/).
+
+
+Johannes Leveling Praktische Informatik VII/KI
+ FernUniversität Hagen
+
+Email : Johannes.Leveling@FernUni-Hagen.De
+Tel. : +49 2331 987-4525
+
+-->
+
+ <sect2>
+ <title>Various web indexes</title>
+ <para>
+ Zebra has been used by a variety of institutions to construct
+ indexes of large web sites, typically in the region of tens of
+ millions of pages. In this role, it functions somewhat similarly
+ to the engine of google or altavista, but for a selected intranet
+ or subset of the whole Web.
+ </para>
+ <para>
+ ### examples, details and numbers, please!
+ </para>
+ </sect2>
+ </sect1>
+
<sect1 id="future">
- <title>Future Work</title>
+ <title>Future Directions</title>
<para>
These are some of the plans that we have for the software in the near
- and far future, approximately ordered after their relative importance.
- Items marked with an
- asterisk will be implemented before the
- last beta release.
- FIXME - What are the current plans?
+ and far future, ordered approximately as we expect to work on them.
</para>
<para>
<listitem>
<para>
- *Finalize the data element <emphasis>include</emphasis> facility
- to support multimedia data elements in records.
+ Improved support for XML in search and retrieval. Eventually,
+ the goal is for Zebra to pull double duty as a flexible
+ information retrieval engine and high-performance XML
+ repository.
</para>
</listitem>
<listitem>
<para>
- Add more sophisticated relevance ranking mechanisms.
- Add support for soundex and stemming.
- Add relevance <emphasis>feedback</emphasis> support.
+ Access to search engine through SOAP/RPC API to allow the
+ construction of applications without requiring Z39.50 tools.
</para>
</listitem>
<listitem>
<para>
- Complete EXPLAIN support.
+ Finalisation and documentation of Zebra's C programming
+ API, allowing updates, database management and other functions
+ not readily expressed in Z39.50. We will also consider
+ exposing the API through SOAP.
</para>
</listitem>
<listitem>
<para>
- Add support for very large records by implementing segmentation and/or
- variant pieces.
+ Improved free-text searching. We're first and foremost octet jockeys and
+ we're actively looking for organisations or people who'd like
+ to contribute experience in relevance ranking and text
+ searching.
</para>
</listitem>
- <listitem>
- <para>
- Support the Item Update extended service of the protocol.
- </para>
- </listitem>
-
- <listitem>
- <para>
- We want to add a management system that allows you to
- control your databases and configuration tables from a graphical
- interface.
- </para>
- </listitem>
</itemizedlist>
</para>
<para>
Programmers thrive on user feedback. If you are interested in a
facility that you don't see mentioned here, or if there's something
- you think we could do better, please drop us a mail.
+ you think we could do better, please drop us a mail. Better still,
+ implement it and send us the patches.
+ </para>
+ <para>
If you think it's all really neat, you're welcome to drop us a line
saying that, too. You'll find contact info at the end of this file.
</para>