X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fintroduction.xml;h=83032319ae55e3168c9ae17c720876f6e3c27a9b;hb=d6c8724ed6eeb2625a6ee316f2130817a314fbe7;hp=b9a68d2e4c9001cd6799715fd9f7d42650d090b8;hpb=79e9818dfb6b9a0a04bdd6bc6467c8dae3b8f493;p=idzebra-moved-to-github.git diff --git a/doc/introduction.xml b/doc/introduction.xml index b9a68d2..8303231 100644 --- a/doc/introduction.xml +++ b/doc/introduction.xml @@ -1,5 +1,5 @@ - + Introduction @@ -52,8 +52,7 @@ Features - This is an overview of some of the most important features of the - system. + This is an overview of some of Zebra's most important features: @@ -61,34 +60,36 @@ - Supports large databases - files for indexes, etc. can be + Very large databases: files for indexes, etc. can be automatically partitioned over multiple disks. - Supports arbitrarily complex records - base input format is an - SGML-like syntax which allows nested (structured) data elements, as - well as variant forms of data. + Arbitrarily complex records. The internal data format + is an structured format conceptually similar to XML or GRS-1, + which allows nested structured data elements and + variant forms of data. - Robust updating - records can be added and deleted without - rebuilding the index from scratch. + Robust updating - records can be added and deleted ``on the fly'' + without rebuilding the index from scratch. + Records can be safely updated even while users are accessing + the server. The update procedure is tolerant to crashes or hard interrupts - during register updating - registers can be reconstructed following + during database updating - data can be reconstructed following a crash. - Registers can be safely updated even while users are accessing - the server. - Supports random storage formats. A system of input filters driven by + Configurable to understand many input formats. + A system of input filters driven by regular expressions allows you to easily process most ASCII-based data formats. SGML, XML, ISO2709 (MARC), and raw text are also supported. @@ -97,40 +98,27 @@ - Supports boolean queries as well as relevance-ranking (free-text) - searching. Right truncation and masking in terms are supported, as - well as full regular expressions. + Searching supports a powerful combination of boolean queries as + well as relevance-ranking (free-text) queries. Truncation, + masking, full regular expression matching and "approximate + matching" (eg. spelling mistakes) are all supported. - Can import the data into Zebras own storage, or just refer to - external files (good for building indexes of "live" - collections). + Index-only databases: data can be, and usually is, imported + into Zebra's own storage, but Zebra can also refer to + external files, building and maintaining indexes of "live" + collections. - Supports multiple concrete syntaxes - for record exchange (depending on the configuration): GRS-1, SUTRS, - XML, ISO2709 (*MARC). Records can be mapped between record syntaxes - and schema on the fly. - - - - - - Supports approximate matching in registers (ie. spelling mistakes, - etc). - - - - - Zebra is written in portable C, so it runs on most Unix-like systems - as well as Windows NT - a binary distribution for Windows NT is available. + as well as Windows NT. A binary distribution for Windows NT is + available. @@ -146,7 +134,8 @@ - Protocol facilities: Init, Search, Retrieve, Delete, Browse and Sort. + Protocol facilities: Init, Search, Present (retrieval), Delete, + Scan (index browsing) and Sort. @@ -161,6 +150,7 @@ Named result sets are supported. + Easily configured to support different application profiles, with @@ -172,16 +162,19 @@ - Complex composition specifications using Espec-1 are partially - supported (simple element requests only). + Complex composition specifications using Espec-1 (partial support). + Element sets are defined using the Espec-1 capability, + and are specified in configuration files as simple element + requests (and, optionally, variant requests). - Element Set Names are defined using the Espec-1 capability of the - system, and are given in configuration files as simple element - requests (and possibly variant requests). + Multiple record syntaxes + for data retrieval: GRS-1, SUTRS, + XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes + and schemas on the fly. @@ -191,12 +184,15 @@ - + Applications Zebra has been deployed in numerous applications, in both the academic and commercial worlds, in application domains as diverse - as bibliographic information, geospatial, ### (Help, guys!) + as bibliographic catalogues, geospatial information, structured + vocabulary browsing, government information locators, civic + information systems, environmental observations, museum information + and web indexes. Notable applications include the following: @@ -205,7 +201,7 @@ DADS - the DTV Article Database Service - DADS is a huge database of more than ten million records, totally + DADS is a huge database of more than ten million records, totalling over ten gigabytes of data. The records are metadata about academic journal articles, primarily scientific; about 10% of these metadata records link to the full text of the articles they @@ -213,12 +209,67 @@ full text is not indexed.) - It allows students and researchers at DTU (###) to find and order + It allows students and researchers at DTU (Danmarks Tekniske + Universitet, the Technical College of Denmark) to find and order articles from multiple databases in a single query. The database contains literature on all engineering subjects. It's available - on-line through a web gateway at - http://www.dtv.dk/search/index_e.htm - though currently only to registered users. + on-line through a web gateway, though currently only to registered + users. + + + More information can be found at + + + + + + NLI-Z39.50 - a Natural Language Interface for Libraries + + Fernuniversität Hagen in Germany have developed a natural + language interface for access to library databases. + + In order to evaluate this interface for recall and precision, they + chose Zebra as the basis for retrieval effectiveness. The Zebra + server contains a copy of the GIRT database, consisting of more + than 76000 records in SGML format (bibliographic records from + social science), which are mapped to MARC for presentation. + + + (GIRT is the German Indexing and Retrieval Testdatabase. It is a + standard German-language test database for intelligent indexing + and retrieval systems. See + + + + Evaluation will take place as part of the TREC/CLEF campaign 2003 + + + + For more information, contact Johannes Leveling + Johannes.Leveling@FernUni-Hagen.De + + + + + ULS (Union List of Serials) + + The London School of Economics (### I think) + are involved in a projects called ULS to provide a union catalogue + for periodicals in 21 member libraries. They do this with an + unusual architecture which they call a + ``non-distributed virtual union catalogue''. + + + The member libraries send in data files representing their + periodicals, including both brief bibliographic data and summary + holdings. Then 21 individual Z39.50 targets are created, each + using Zebra, and all mounted on the single hardware server. + The live service provides a web gateway allowing Z39.50 searching + of all 21 targets or a selection of them. + + + More information can be found at + @@ -232,17 +283,25 @@ or subset of the whole Web. - ### examples, details and numbers, please! + For example, Liverpool University's web-search facility (see on + the home page at + + and many sub-pages) works by relevance-searching a Zebra database + which is populated by the Harvest-NG web-crawling software. + + + For more information, contact John Gilbertson + jgilbert@liverpool.ac.uk - Future Work + Future Directions These are some of the plans that we have for the software in the near - and far future, approximately ordered after their relative importance. + and far future, ordered approximately as we expect to work on them. @@ -266,9 +325,10 @@ - Finalisation, documentation of the Zebra API. Consider - exposing the API through SOAP as well (allowing updates, - database management). + Finalisation and documentation of Zebra's C programming + API, allowing updates, database management and other functions + not readily expressed in Z39.50. We will also consider + exposing the API through SOAP. @@ -287,7 +347,10 @@ Programmers thrive on user feedback. If you are interested in a facility that you don't see mentioned here, or if there's something - you think we could do better, please drop us a mail. + you think we could do better, please drop us a mail. Better still, + implement it and send us the patches. + + If you think it's all really neat, you're welcome to drop us a line saying that, too. You'll find contact info at the end of this file.