1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.19 2002-10-20 14:02:03 mike Exp $ -->
3 <title>Introduction</title>
6 <title>Overview</title>
9 <ulink url="http://indexdata.dk/zebra/">
11 is a high-performance, general-purpose structured text
12 indexing and retrieval engine. It reads structured records in a
13 variety of input formats (eg. email, XML, MARC) and provides access
14 to them through a powerful combination of boolean search
15 expressions and relevance-ranked free-text queries.
19 Zebra supports large databases (tens of millions of records,
20 tens of gigabytes of data). It allows safe, incremental
21 database updates on live systems. Because Zebra supports
22 the industry-standard information retrieval protocol, Z39.50,
23 you can search Zebra databases using an enormous variety of
24 programs and toolkits, both commercial and free, which understand
25 this protocol. Application libraries are available to allow
26 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
27 Basic, Python, PHP and more - see
28 <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
29 for more information on some of these client toolkits.
33 This document is an introduction to the Zebra system. It explains
34 how to compile the software, how to prepare your first database,
35 and how to configure the server to give you the
36 functionality that you need.
41 <title>Features</title>
44 This is an overview of some of Zebra's most important features:
52 Very large databases: files for indexes, etc. can be
53 automatically partitioned over multiple disks.
59 Arbitrarily complex records. The internal data format
60 is an structured format conceptually similar to XML or GRS-1,
61 which allows lists, nested structured data elements and
62 variant forms of data.
68 Robust updating - records can be added and deleted ``on the fly''
69 without rebuilding the index from scratch.
70 Records can be safely updated even while users are accessing
72 The update procedure is tolerant to crashes or hard interrupts
73 during database updating - data can be reconstructed following
80 Configurable to understand many input formats.
81 A system of input filters driven by
82 regular expressions allows most ASCII-based
83 data formats to be easily processed.
84 SGML, XML, ISO2709 (MARC), and raw text are also
91 Searching supports a powerful combination of boolean queries as
92 well as relevance-ranking (free-text) queries. Truncation,
93 masking, full regular expression matching and "approximate
94 matching" (eg. spelling mistakes) are all handled.
100 Index-only databases: data can be, and usually is, imported
101 into Zebra's own storage, but Zebra can also refer to
102 external files, building and maintaining indexes of "live"
109 Zebra is written in portable C, so it runs on most Unix-like systems
110 as well as Windows NT. A binary distribution for Windows NT is
112 <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
113 and pre-built packages are available for some Linux
116 <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
117 and Debian packages at
118 <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>
127 Z39.50 protocol support:
134 Protocol facilities: Init, Search, Present (retrieval),
135 Segmentation (support for very large records), Delete, Scan
136 (index browsing), Sort, Close and support for the ``update''
137 Extended Service to add or replace an existing XML record.
140 You can insert/delete/replace an XML record given an
141 "external" ID. Actually this way of doing ES Update was
142 meant for an OAI application that Ian Ibbotson had in
143 mind to implement. The "update" command in YAZ client
144 implements this on the client side. My plan is to make
145 this available in ZOOM "extended" soon..
152 Piggy-backed presents are honored in the search request - that
153 is, a subset of the found records can be returned directly with
154 a search response, enabling search and retrieval to happen in a
161 Named result sets are supported.
167 Easily configured to support different application profiles, with
168 tables for attribute sets, tag sets, and abstract syntaxes.
169 Additional tables control facilities such as element mappings to
170 different schema (eg., GILS-to-USMARC).
176 Complex composition specifications using Espec-1 (partial support).
177 Element sets are defined using the Espec-1 capability,
178 and are specified in configuration files as simple element
179 requests (and, optionally, variant requests).
185 Multiple record syntaxes
186 for data retrieval: GRS-1, SUTRS,
187 XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
188 and schemas on the fly.
199 <title>Applications</title>
201 Zebra has been deployed in numerous applications, in both the
202 academic and commercial worlds, in application domains as diverse
203 as bibliographic catalogues, geospatial information, structured
204 vocabulary browsing, government information locators, civic
205 information systems, environmental observations, museum information
209 Notable applications include the following:
213 <title>DADS - the DTV Article Database Service</title>
215 DADS is a huge database of more than ten million records, totalling
216 over ten gigabytes of data. The records are metadata about academic
217 journal articles, primarily scientific; about 10% of these
218 metadata records link to the full text of the articles they
219 describe, a body of about a terabyte of information (although the
220 full text is not indexed.)
223 It allows students and researchers at DTU (Danmarks Tekniske
224 Universitet, the Technical College of Denmark) to find and order
225 articles from multiple databases in a single query. The database
226 contains literature on all engineering subjects. It's available
227 on-line through a web gateway, though currently only to registered
231 More information can be found at
232 <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
237 <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
239 Fernuniversität Hagen in Germany have developed a natural
240 language interface for access to library databases.
241 <ulink url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/>
242 In order to evaluate this interface for recall and precision, they
243 chose Zebra as the basis for retrieval effectiveness. The Zebra
244 server contains a copy of the GIRT database, consisting of more
245 than 76000 records in SGML format (bibliographic records from
246 social science), which are mapped to MARC for presentation.
249 (GIRT is the German Indexing and Retrieval Testdatabase. It is a
250 standard German-language test database for intelligent indexing
251 and retrieval systems. See
252 <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
255 Evaluation will take place as part of the TREC/CLEF campaign 2003
256 <ulink url="http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/"/>
259 For more information, contact Johannes Leveling
260 <email>Johannes.Leveling@FernUni-Hagen.De</email>
265 <title>ULS (Union List of Serials)</title>
267 The M25-Link systems team
268 (<ulink url="http://www.m25lib.ac.uk/M25link/"/>)
269 are involved in a project called ULS to provide a union catalogue
270 for periodicals in 21 member libraries. They do this with an
271 unusual architecture which they call a
272 ``non-distributed virtual union catalogue''.
275 The member libraries send in data files representing their
276 periodicals, including both brief bibliographic data and summary
277 holdings. Then 21 individual Z39.50 targets are created, each
278 using Zebra, and all mounted on the single hardware server.
279 The live service provides a web gateway allowing Z39.50 searching
280 of all of the targets or a selection of them. Zebra's small
281 footprint allows a relatively modest system to comfortably host
285 More information can be found at
286 <ulink url="http://www.m25lib.ac.uk/ULS/"/>
291 <title>Various web indexes</title>
293 Zebra has been used by a variety of institutions to construct
294 indexes of large web sites, typically in the region of tens of
295 millions of pages. In this role, it functions somewhat similarly
296 to the engine of google or altavista, but for a selected intranet
297 or a subset of the whole Web.
300 For example, Liverpool University's web-search facility (see on
302 <ulink url="http://www.liv.ac.uk/"/>
303 and many sub-pages) works by relevance-searching a Zebra database
304 which is populated by the Harvest-NG web-crawling software.
307 For more information, contact John Gilbertson
308 <email>jgilbert@liverpool.ac.uk</email>
315 <title>Support</title>
317 You can get support for Zebra from at least three sources.
320 First, there's the Zebra web site at
321 <ulink url="http://indexdata.dk/zebra/"/>,
322 which always has the most recent version available for download.
323 If you have a problem with Zebra, the first thing to do is see
324 whether it's fixed in the current release.
327 Second, there's the Zebra mailing list. Its home page at
328 <ulink url="http://indexdata.dk/mailman/listinfo/zebralist"/>
329 includes a complete archive of all messages that have ever been
330 posted on the list. The Zebra mailing list is used both for
331 announcements from the authors (new
332 releases, bug fixes, etc.) and general discussion. You are welcome
333 to seek support there. Join by sending email to
334 <email>zebra-request@indexdata.dk</email>. Put the word
335 <literal>subscribe</literal> in the body of the message.
338 Third, it's possible to buy a commercial support contract, with
339 well defined service levels and response times, from Index Data.
341 <ulink url="http://indexdata.dk/support/?lang=en"/>
342 <!-- ### compare this page with http://indexdata.dk/support2/ -->
349 <title>Future Directions</title>
352 These are some of the plans that we have for the software in the near
353 and far future, ordered approximately as we expect to work on them.
361 Improved support for XML in search and retrieval. Eventually,
362 the goal is for Zebra to pull double duty as a flexible
363 information retrieval engine and high-performance XML
373 Access to search engine through SOAP/RPC API to allow the
374 construction of applications without requiring Z39.50 tools.
377 ### Partially done, thanks to the new SRW/Z39.50 gateway.
383 Finalisation and documentation of Zebra's C programming
384 API, allowing updates, database management and other functions
385 not readily expressed in Z39.50. We will also consider
386 exposing the API through SOAP.
392 Improved free-text searching. We're first and foremost octet jockeys and
393 we're actively looking for organisations or people who'd like
394 to contribute experience in relevance ranking and text
403 Programmers thrive on user feedback. If you are interested in a
404 facility that you don't see mentioned here, or if there's something
405 you think we could do better, please drop us a mail. Better still,
406 implement it and send us the patches.
409 If you think it's all really neat, you're welcome to drop us a line
410 saying that, too. You can email us on
411 <email>info@indexdata.dk</email>
412 or check the contact info at the end of this manual.
417 <!-- Keep this comment at the end of the file
422 sgml-minimize-attributes:nil
423 sgml-always-quote-attributes:t
426 sgml-parent-document: "zebra.xml"
427 sgml-local-catalogs: nil
428 sgml-namecase-general:t