X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Farchitecture.xml;h=b6fe7cf140f8f6bde45b9c40678336cbc4c76117;hb=7253cbefce93c35a083505e14d97b3ae24c0a66b;hp=1def20e55667f09607cba1183db604ebd752b898;hpb=2c0ee3249ef46031064a0e8e7d63bd400317f5e9;p=idzebra-moved-to-github.git diff --git a/doc/architecture.xml b/doc/architecture.xml index 1def20e..b6fe7cf 100644 --- a/doc/architecture.xml +++ b/doc/architecture.xml @@ -1,11 +1,10 @@ - + Overview of Zebra Architecture - - +
Local Representation - + As mentioned earlier, Zebra places few restrictions on the type of data that you can index and manage. Generally, whatever the form of @@ -30,9 +29,9 @@ "grs" keyword, separated by "." characters. --> - +
- +
Main Components The Zebra system is designed to support a wide range of data management @@ -46,68 +45,121 @@ The Zebra indexer and information retrieval server consists of the - following main applications: the zebraidx - indexing maintenance utility, and the zebrasrv - information query and retireval server. Both are using some of the + following main applications: the zebraidx + indexing maintenance utility, and the zebrasrv + information query and retrieval server. Both are using some of the same main components, which are presented here. - This virtual package installs all the necessary packages to start + The virtual Debian package idzebra-2.0 + installs all the necessary packages to start working with Zebra - including utility programs, development libraries, - documentation and modules. - idzebra1.4 + documentation and modules. - - Core Zebra Module Containing Common Functionality +
+ Core Zebra Libraries Containing Common Functionality - - loads external filter modules used for presenting - the recods in a search response. - - executes search requests in PQF/RPN, which are handed over from - the YAZ server frontend API - - calls resorting/reranking algorithms on the hit sets - - returns - possibly ranked - result sets, hit - numbers, and the like internal data to the YAZ server backend API. - + The core Zebra module is the meat of the zebraidx + indexing maintenance utility, and the zebrasrv + information query and retrieval server binaries. Shortly, the core + libraries are responsible for + + + Dynamic Loading + + of external filter modules, in case the application is + not compiled statically. These filter modules define indexing, + search and retrieval capabilities of the various input formats. + + + + + Index Maintenance + + Zebra maintains Term Dictionaries and ISAM index + entries in inverted index structures kept on disk. These are + optimized for fast inset, update and delete, as well as good + search performance. + + + + + Search Evaluation + + by execution of search requests expressed in PQF/RPN + data structures, which are handed over from + the YAZ server frontend API. Search evaluation includes + construction of hit lists according to boolean combinations + of simpler searches. Fast performance is achieved by careful + use of index structures, and by evaluation specific index hit + lists in correct order. + + + + + Ranking and Sorting + + + components call resorting/re-ranking algorithms on the hit + sets. These might also be pre-sorted not only using the + assigned document ID's, but also using assigned static rank + information. + + + + + Record Presentation + + returns - possibly ranked - result sets, hit + numbers, and the like internal data to the YAZ server backend API + for shipping to the client. Each individual filter module + implements it's own specific presentation formats. + + + + + - This package contains all run-time libraries for Zebra. - libidzebra1.4 - This package includes documentation for Zebra in PDF and HTML. - idzebra1.4-doc - This package includes common essential Zebra configuration files - idzebra1.4-common + The Debian package libidzebra-2.0 + contains all run-time libraries for Zebra, the + documentation in PDF and HTML is found in + idzebra-2.0-doc, and + idzebra-2.0-common + includes common essential Zebra configuration files. - +
- +
Zebra Indexer - the core Zebra indexer which - - loads external filter modules used for indexing data records of - different type. - - creates, updates and drops databases and indexes + The zebraidx + indexing maintenance utility + loads external filter modules used for indexing data records of + different type, and creates, updates and drops databases and + indexes according to the rules defined in the filter modules. - This package contains Zebra utilities such as the zebraidx indexer - utility and the zebrasrv server. - idzebra1.4-utils + The Debian package idzebra-2.0-utils contains + the zebraidx utility. - +
- +
Zebra Searcher/Retriever - the core Zebra searcher/retriever which + This is the executable which runs the Z39.50/SRU/SRW server and + glues together the core libraries and the filter modules to one + great Information Retrieval server application. - This package contains Zebra utilities such as the zebraidx indexer - utility and the zebrasrv server, and their associated man pages. - idzebra1.4-utils + The Debian package idzebra-2.0-utils contains + the zebrasrv utility. - +
- +
YAZ Server Frontend The YAZ server frontend is @@ -117,101 +169,155 @@ In addition to Z39.50 requests, the YAZ server frontend acts - as HTTP server, honouring - SRW SOAP requests, and SRU REST requests. Moreover, it can - translate inco ming CQL queries to PQF/RPN queries, if + as HTTP server, honoring + SRU SOAP + requests, and + SRU REST + requests. Moreover, it can + translate incoming + CQL + queries to + PQF + queries, if correctly configured. - YAZ is a toolkit that allows you to develop software using the - ANSI Z39.50/ISO23950 standard for information retrieval. - SRW/ SRU - libyazthread.so - libyaz.so - libyaz + YAZ + is an Open Source + toolkit that allows you to develop software using the + ANSI Z39.50/ISO23950 standard for information retrieval. + It is packaged in the Debian packages + yaz and libyaz. - +
- +
Record Models and Filter Modules - all filter modules which do indexing and record display filtering: -This virtual package contains all base IDZebra filter modules. EMPTY ??? - libidzebra1.4-modules + The hard work of knowing what to index, + how to do it, and which + part of the records to send in a search/retrieve response is + implemented in + various filter modules. It is their responsibility to define the + exact indexing and record display filtering rules. + + + The virtual Debian package + libidzebra-2.0-modules installs all base filter + modules. - + +
TEXT Record Model and Filter Module - Plain ASCII text filter - + Plain ASCII text filter. TODO: add information here. - +
- +
GRS Record Model and Filter Modules - - - - grs.danbib GRS filters of various kind (*.abs files) -IDZebra filter grs.danbib (DBC DanBib records) - This package includes grs.danbib filter which parses DanBib records. - DanBib is the Danish Union Catalogue hosted by DBC - (Danish Bibliographic Centre). - libidzebra1.4-mod-grs-danbib - - - - grs.marc - - grs.marcxml - This package includes the grs.marc and grs.marcxml filters that allows - IDZebra to read MARC records based on ISO2709. - - libidzebra1.4-mod-grs-marc - - - grs.regx - - grs.tcl GRS TCL scriptable filter - This package includes the grs.regx and grs.tcl filters. - libidzebra1.4-mod-grs-regx - - - - grs.sgml - libidzebra1.4-mod-grs-sgml not packaged yet ?? - - - grs.xml - This package includes the grs.xml filter which uses Expat to - parse records in XML and turn them into IDZebra's internal grs node. - libidzebra1.4-mod-grs-xml + The GRS filter modules described in + + are all based on the Z39.50 specifications, and it is absolutely + mandatory to have the reference pages on BIB-1 attribute sets on + you hand when configuring GRS filters. The GRS filters come in + different flavors, and a short introduction is needed here. + GRS filters of various kind have also been called ABS filters due + to the *.abs configuration file suffix. + + + The grs.marc and + grs.marcxml filters are suited to parse and + index binary and XML versions of traditional library MARC records + based on the ISO2709 standard. The Debian package for both + filters is + libidzebra-2.0-mod-grs-marc. + + + GRS TCL scriptable filters for extensive user configuration come + in two flavors: a regular expression filter + grs.regx using TCL regular expressions, and + a general scriptable TCL filter called + grs.tcl + are both included in the + libidzebra-2.0-mod-grs-regx Debian package. - + + A general purpose SGML filter is called + grs.sgml. This filter is not yet packaged, + but planned to be in the + libidzebra-2.0-mod-grs-sgml Debian package. + + + The Debian package + libidzebra-2.0-mod-grs-xml includes the + grs.xml filter which uses Expat to + parse records in XML and turn them into IDZebra's internal GRS node + trees. Have also a look at the Alvis XML/XSLT filter described in + the next session. + +
- +
ALVIS Record Model and Filter Module - - - alvis Experimental Alvis XSLT filter - mod-alvis.so - libidzebra1.4-mod-alvis + The Alvis filter for XML files is an XSLT based input + filter. + It indexes element and attribute content of any thinkable XML format + using full XPATH support, a feature which the standard Zebra + GRS SGML and XML filters lacked. The indexed documents are + parsed into a standard XML DOM tree, which restricts record size + according to availability of memory. + + + The Alvis filter + uses XSLT display stylesheets, which let + the Zebra DB administrator associate multiple, different views on + the same XML document type. These views are chosen on-the-fly in + search time. + + + In addition, the Alvis filter configuration is not bound to the + arcane BIB-1 Z39.50 library catalogue indexing traditions and + folklore, and is therefore easier to understand. + + + Finally, the Alvis filter allows for static ranking at index + time, and to to sort hit lists according to predefined + static ranks. This imposes no overhead at all, both + search and indexing perform still + O(1) irrespectively of document + collection size. This feature resembles Googles pre-ranking using + their Pagerank algorithm. + + + Details on the experimental Alvis XSLT filter are found in + . + + + The Debian package libidzebra-2.0-mod-alvis + contains the Alvis filter module. - +
- + + SAFARI filter module TODO: add information here. - +
+ --> -
+
-
+ - +
Indexing and Retrieval Workflow @@ -261,8 +367,160 @@ IDZebra filter grs.danbib (DBC DanBib records) - +
+
+ Retrieval of Zebra internal record data + + Starting with Zebra version 2.0.5 or newer, it is + possible to use a special element set which has the prefix + zebra::. + + + Using this element will, regardless of record type, return + Zebra's internal index structure/data for a record. + In particular, the regular record filters are not invoked when + these are in use. + This can in some cases make the retrival faster than regular + retrieval operations (for MARC, XML etc). + + + Special Retrieval Elements + + + + Element Set + Description + Syntax + + + + + zebra::meta::sysno + Get Zebra record system ID + XML and SUTRS + + + zebra::data + Get raw record + all + + + zebra::meta + Get Zebra record internal metadata + XML and SUTRS + + + zebra::index + Get all indexed keys for record + XML and SUTRS + + + + zebra::index::f + + + Get indexed keys for field f for record + + XML and SUTRS + + + + zebra::index::f:t + + + Get indexed keys for field f + and type t for record + + XML and SUTRS + + + +
+ + For example, to fetch the raw binary record data stored in the + zebra internal storage, or on the filesystem, the following + commands can be issued: + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::data + Z> s 1+1 + Z> format sutrs + Z> s 1+1 + Z> format usmarc + Z> s 1+1 + + + + The special + zebra::data element set name is + defined for any record syntax, but will always fetch + the raw record data in exactly the original form. No record syntax + specific transformations will be applied to the raw record data. + + + Also, Zebra internal metadata about the record can be accessed: + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::meta::sysno + Z> s 1+1 + + displays in XML record syntax only internal + record system number, whereas + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::meta + Z> s 1+1 + + displays all available metadata on the record. These include sytem + number, database name, indexed filename, filter used for indexing, + score and static ranking information and finally bytesize of record. + + + Sometimes, it is very hard to figure out what exactly has been + indexed how and in which indexes. Using the indexing stylesheet of + the Alvis filter, one can at least see which portion of the record + went into which index, but a similar aid does not exist for all + other indexing filters. + + + The special + zebra::index element set names are provided to + access information on per record indexed fields. For example, the + queries + + Z> f @attr 1=title my + Z> format sutrs + Z> elements zebra::index + Z> s 1+1 + + will display all indexed tokens from all indexed fields of the + first record, and it will display in SUTRS + record syntax, whereas + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::index::title + Z> s 1+1 + Z> elements zebra::index::title:p + Z> s 1+1 + + displays in XML record syntax only the content + of the zebra string index title, or + even only the type p phrase indexed part of it. + + + + Trying to access numeric Bib-1 use + attributes or trying to access non-existent zebra intern string + access points will result in a Diagnostic 25: Specified element set + 'name not valid for specified database. + + +