X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fintroduction.xml;h=29d759dc84706a10dfaa74e2ee2e8fbbb2e909ae;hb=1b368282fac5d740ffc22c049e2b4881ca8d3a1a;hp=03401df0eeea89962fa1ddb5757eb71e0d911527;hpb=bffe964768496135023ab242d6b468558fa1c2be;p=idzebra-moved-to-github.git diff --git a/doc/introduction.xml b/doc/introduction.xml index 03401df..29d759d 100644 --- a/doc/introduction.xml +++ b/doc/introduction.xml @@ -1,15 +1,14 @@ - + Introduction - +
Overview - - Zebra + Zebra is a high-performance, general-purpose structured text - indexing and retrieval engine. It reads structured records in a + indexing and retrieval engine. It reads records in a variety of input formats (eg. email, XML, MARC) and provides access to them through a powerful combination of boolean search expressions and relevance-ranked free-text queries. @@ -24,8 +23,8 @@ programs and toolkits, both commercial and free, which understand this protocol. Application libraries are available to allow bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual - Basic, Python, PHP and more - see - the ZOOM web site + Basic, Python, PHP and more - see the + ZOOM web site for more information on some of these client toolkits. @@ -35,157 +34,224 @@ and how to configure the server to give you the functionality that you need. - - - If you use Zebra, you should visit its - web site, - where you can join the - - mailing-list - by sending email to - ### zebra-subscribe@mailman.indexdata.dk - - - +
- - Features +
+ Zebra Features Overview - - This is an overview of some of Zebra's most important features: - - - - - - - - Very large databases: files for indexes, etc. can be - automatically partitioned over multiple disks. - - - - - - Arbitrarily complex records. The internal data format - is an structured format conceptually similar to XML or GRS-1, - which allows nested structured data elements and - variant forms of data. - - - - - Robust updating - records can be added and deleted ``on the fly'' + + Zebra Features Overview + + + + Feature + Availability + Notes + Reference + + + + + Boolean query language + CQL and RPN/PQF + The type-1 Reverse Polish Notation (RPN) + and it's textual representation Prefix Query Format (PQF) are + supported. The Common Query Language (CQL) can be configured as + a mapping from CQL to RPN/PQF + + + + + Operation types + Z39.50/SRU explain, search, and scan + + + + + Recursive boolean query tree + CQL and RPN/PQF + Both CQL and RPN/PQF allow atomic query parts (APT) to + be combined into complex boolean query trees + + + + Large databases + 64 file pointers assure that register files can extend + the 2 GB limit. Logical files can be + automatically partitioned over multiple disks, thus allowing for + large databases. + + + + + Complex semi-structured Documents + XML and GRS-1 Documents + Both XML and GRS-1 documents exhibit a DOM like internal + representation allowing for complex indexing and display rules + + + + Database updates + live, incremental updates + Robust updating - records can be added and deleted ``on the fly'' without rebuilding the index from scratch. Records can be safely updated even while users are accessing the server. The update procedure is tolerant to crashes or hard interrupts during database updating - data can be reconstructed following - a crash. - - - - - - Configurable to understand many input formats. - A system of input filters driven by - regular expressions allows you to easily process most ASCII-based - data formats. SGML, XML, ISO2709 (MARC), and raw text are also - supported. - - - - - - Searching supports a powerful combination of boolean queries as - well as relevance-ranking (free-text) queries. Truncation, - masking, full regular expression matching and "approximate - matching" (eg. spelling mistakes) are all supported. - - - - - - Index-only databases: data can be, and usually is, imported + a crash. + + + + Input document formats + XML, SGML, Text, ISO2709 (MARC) + + A system of input filters driven by + regular expressions allows most ASCII-based + data formats to be easily processed. + SGML, XML, ISO2709 (MARC), and raw text are also + supported. + + + + Relevance ranking + TF-IDF like + Relevance-ranking of free-text queries is supported + using a TF-IDF like algorithm. + + + + Document storage + Index-only, Key storage, Document storage + Data can be, and usually is, imported into Zebra's own storage, but Zebra can also refer to external files, building and maintaining indexes of "live" - collections. - - - - - - Zebra is written in portable C, so it runs on most Unix-like systems - as well as Windows NT. A binary distribution for Windows NT is - available. - - - - - - - - - Z39.50 protocol support: - - - - - - - Protocol facilities: Init, Search, Present (retrieval), Delete, - Scan (index browsing) and Sort. - - - - - - Piggy-backed presents are honored in the search-request. - - - - - - Named result sets are supported. - - - - - - Easily configured to support different application profiles, with - tables for attribute sets, tag sets, and abstract syntaxes. - Additional tables control facilities such as element mappings to - different schema (eg., GILS-to-USMARC). - - - - - - Complex composition specifications using Espec-1 (partial support). - Element sets are defined using the Espec-1 capability, - and are specified in configuration files as simple element - requests (and, optionally, variant requests). - - - - - - Multiple record syntaxes + collections. + + + + Regular expression matching + Regexp + Full regular expression matching and "approximate + matching" (eg. spelling mistake corrections) are handled. + + + + Search truncation + + + + + + Remote update + Z39.50 extended services + + + + + Supported Platforms + UNIX, Linux, Windows (NT/2000/2003/XP) + Zebra is written in portable C, so it runs on most + Unix-like systems as well as Windows (NT/2000/2003/XP). Binary + distributions are + available for GNU/Debian Linux and Windows + + + + Z39.50 + Z39.50 protocol support + Protocol facilities: Init, Search, Present (retrieval), + Segmentation (support for very large records), Delete, Scan + (index browsing), Sort, Close and support for the ``update'' + Extended Service to add or replace an existing XML + record. Piggy-backed presents are honored in the search + request. Named result sets are supported. + + + + Record Syntaxes + + Multiple record syntaxes for data retrieval: GRS-1, SUTRS, XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes - and schemas on the fly. - - + and schemas on the fly. + + + + Web Service support + SRU GET/POST/SOAP + The protocol operations explain, + searchRetrieve and scan + are supported. CQL to internal + query model RPN conversion is supported. Extended RPN queries + for search/retrieve and scan are supported. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ -
- -
- +
- - Applications +
+ References and Zebra based Applications Zebra has been deployed in numerous applications, in both the academic and commercial worlds, in application domains as diverse @@ -198,9 +264,110 @@ Notable applications include the following: - - DADS - the DTV Article Database Service + +
+ Koha free open-source ILS + + Koha is a full-featured + open-source ILS, initially developed in + New Zealand by Katipo Communications Ltd, and first deployed in + January of 2000 for Horowhenua Library Trust. It is currently + maintained by a team of software providers and library technology + staff from around the globe. + + + LibLime, + a company that is marketing and supporting Koha, adds in + the new release of Koha 3.0 the Zebra + database server to drive its bibliographic database. + + + In early 2005, the Koha project development team began looking at + ways to improve MARC support and overcome scalability limitations + in the Koha 2.x series. After extensive evaluations of the best + of the Open Source textual database engines - including MySQL + full-text searching, PostgreSQL, Lucene and Plucene - the team + selected Zebra. + + + "Zebra completely eliminates scalability limitations, because it + can support tens of millions of records." explained Joshua + Ferraro, LibLime's Technology President and Koha's Project + Release Manager. "Our performance tests showed search results in + under a second for databases with over 5 million records on a + modest i386 900Mhz test server." + + + "Zebra also includes support for true boolean search expressions + and relevance-ranked free-text queries, both of which the Koha + 2.x series lack. Zebra also supports incremental and safe + database updates, which allow on-the-fly record + management. Finally, since Zebra has at its heart the Z39.50 + protocol, it greatly improves Koha's support for that critical + library standard." + + + Although the bibliographic database will be moved to Zebra, Koha + 3.0 will continue to use a relational SQL-based database design + for the 'factual' database. "Relational database managers have + their strengths, in spite of their inability to handle large + numbers of bibliographic records efficiently," summed up Ferraro, + "We're taking the best from both worlds in our redesigned Koha + 3.0. + + + See also LibLime's newsletter article + + Koha Earns its Stripes. + +
+ +
+ Emilda open source ILS + Emilda + is a complete Integrated Library System, released under the + GNU General Public License. It has a + full featured Web-OPAC, allowing comprehensive system management + from virtually any computer with an Internet connection, has + template based layout allowing anyone to alter the visual + appearance of Emilda, and is + XML based language for fast and easy portability to virtually any + language. + Currently, Emilda is used at three schools in Espoo, Finland. + + + As a surplus, 100% MARC compatibility has been achieved using the + Zebra Server from Index Data as backend server. + +
+ +
+ ReIndex.Net web based ILS + + Reindex.net + is a netbased library service offering all + traditional functions on a very high level plus many new + services. Reindex.net is a comprehensive and powerful WEB system + based on standards such as XML and Z39.50. + updates. Reindex supports MARC21, danMARC eller Dublin Core with + UTF8-encoding. + + + Reindex.net runs on GNU/Debian Linux with Zebra and Simpleserver + from Index + Data for bibliographic data. The relational database system + Sybase 9 XML is used for + administrative data. + Internally MARCXML is used for bibliographical records. Update + utilizes Z39.50 extended services. + +
+ +
+ DADS - the DTV Article Database + Service + DADS is a huge database of more than ten million records, totalling over ten gigabytes of data. The records are metadata about academic journal articles, primarily scientific; about 10% of these @@ -218,82 +385,203 @@ More information can be found at - + and + - +
- +
+ NLI-Z39.50 - a Natural Language Interface for Libraries + + Fernuniversität Hagen in Germany have developed a natural + language interface for access to library databases. + + In order to evaluate this interface for recall and precision, they + chose Zebra as the basis for retrieval effectiveness. The Zebra + server contains a copy of the GIRT database, consisting of more + than 76000 records in SGML format (bibliographic records from + social science), which are mapped to MARC for presentation. + + + (GIRT is the German Indexing and Retrieval Testdatabase. It is a + standard German-language test database for intelligent indexing + and retrieval systems. See + ) + + + Evaluation will take place as part of the TREC/CLEF campaign 2003 + . + + + + For more information, contact Johannes Leveling + Johannes.Leveling@FernUni-Hagen.De + +
- +
Various web indexes Zebra has been used by a variety of institutions to construct indexes of large web sites, typically in the region of tens of millions of pages. In this role, it functions somewhat similarly to the engine of google or altavista, but for a selected intranet - or subset of the whole Web. + or a subset of the whole Web. + + + For example, Liverpool University's web-search facility (see on + the home page at + + and many sub-pages) works by relevance-searching a Zebra database + which is populated by the Harvest-NG web-crawling software. - ### examples, details and numbers, please! + For more information on Liverpool university's intranet search + architecture, contact John Gilbertson + jgilbert@liverpool.ac.uk - - + + Kang-Jin Lee + has recently modified the Harvest web indexer to use Zebra as + its native repository engine. His comments on the switch over + from the old engine are revealing: +
+ + The first results after some testing with Zebra are very + promising. The tests were done with around 220,000 SOIF files, + which occupies 1.6GB of disk space. + + + Building the index from scratch takes around one hour with Zebra + where [old-engine] needs around five hours. While [old-engine] + blocks search requests when updating its index, Zebra can still + answer search requests. + [...] + Zebra supports incremental indexing which will speed up indexing + even further. + + + While the search time of [old-engine] varies from some seconds + to some minutes depending how expensive the query is, Zebra + usually takes around one to three seconds, even for expensive + queries. + [...] + Zebra can search more than 100 times faster than [old-engine] + and can process multiple search requests simultaneously + + + I am very happy to see such nice software available under GPL. + +
+
+
+
+ + +
+ Support + + You can get support for Zebra from at least three sources. + + + First, there's the Zebra web site at + , + which always has the most recent version available for download. + If you have a problem with Zebra, the first thing to do is see + whether it's fixed in the current release. + + + Second, there's the Zebra mailing list. Its home page at + + includes a complete archive of all messages that have ever been + posted on the list. The Zebra mailing list is used both for + announcements from the authors (new + releases, bug fixes, etc.) and general discussion. You are welcome + to seek support there. Join by filling the form on the list home page. + + + Third, it's possible to buy a commercial support contract, with + well defined service levels and response times, from Index Data. + See + + for details. + +
- + +
Future Directions @@ -309,14 +597,17 @@ Tel. : +49 2331 987-4525 Improved support for XML in search and retrieval. Eventually, the goal is for Zebra to pull double duty as a flexible information retrieval engine and high-performance XML - repository. + repository. The recent addition of XPath searching is one + example of the kind of enhancement we're working on. - - - - Access to search engine through SOAP/RPC API to allow the - construction of applications without requiring Z39.50 tools. + There is also the experimental ALVIS XSLT + XML input filter, which unleashes the full power of DOM based + XSLT transformations during indexing and record retrieval. Work + on this filter has been sponsored by the ALVIS EU project + . We expect this filter to + mature soon, as it is planned to be included in the version 2.0 + release of Zebra. @@ -349,10 +640,12 @@ Tel. : +49 2331 987-4525 If you think it's all really neat, you're welcome to drop us a line - saying that, too. You'll find contact info at the end of this file. + saying that, too. You can email us on + info@indexdata.dk + or check the contact info at the end of this manual. - +