X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Farchitecture.xml;h=b6fe7cf140f8f6bde45b9c40678336cbc4c76117;hb=89d3a004b7c651fd5673abfc192e1472dc4d4197;hp=37afaee5bd62608bd0d8331a2fb23e3a5535787e;hpb=14a2dbce03d7802ab5b1e57b09d915339bb5fc54;p=idzebra-moved-to-github.git diff --git a/doc/architecture.xml b/doc/architecture.xml index 37afaee..b6fe7cf 100644 --- a/doc/architecture.xml +++ b/doc/architecture.xml @@ -1,11 +1,10 @@ - + Overview of Zebra Architecture - - +
Local Representation - + As mentioned earlier, Zebra places few restrictions on the type of data that you can index and manage. Generally, whatever the form of @@ -30,62 +29,9 @@ "grs" keyword, separated by "." characters. --> - - - - Indexing and Retrieval Workflow - - - Records pass through three different states during processing in the - system. - - - - - - - - - When records are accessed by the system, they are represented - in their local, or native format. This might be SGML or HTML files, - News or Mail archives, MARC records. If the system doesn't already - know how to read the type of data you need to store, you can set up an - input filter by preparing conversion rules based on regular - expressions and possibly augmented by a flexible scripting language - (Tcl). - The input filter produces as output an internal representation, - a tree structure. +
- - - - - - When records are processed by the system, they are represented - in a tree-structure, constructed by tagged data elements hanging off a - root node. The tagged elements may contain data or yet more tagged - elements in a recursive structure. The system performs various - actions on this tree structure (indexing, element selection, schema - mapping, etc.), - - - - - - - Before transmitting records to the client, they are first - converted from the internal structure to a form suitable for exchange - over the network - according to the Z39.50 standard. - - - - - - -
- - - +
Main Components The Zebra system is designed to support a wide range of data management @@ -99,68 +45,121 @@ The Zebra indexer and information retrieval server consists of the - following main applications: the zebraidx - indexing maintenance utility, and the zebrasrv - information query and retireval server. Both are using some of the + following main applications: the zebraidx + indexing maintenance utility, and the zebrasrv + information query and retrieval server. Both are using some of the same main components, which are presented here. - This virtual package installs all the necessary packages to start + The virtual Debian package idzebra-2.0 + installs all the necessary packages to start working with Zebra - including utility programs, development libraries, - documentation and modules. - idzebra1.4 + documentation and modules. - - Core Zebra Module Containing Common Functionality +
+ Core Zebra Libraries Containing Common Functionality - - loads external filter modules used for presenting - the recods in a search response. - - executes search requests in PQF/RPN, which are handed over from - the YAZ server frontend API - - calls resorting/reranking algorithms on the hit sets - - returns - possibly ranked - result sets, hit - numbers, and the like internal data to the YAZ server backend API. - + The core Zebra module is the meat of the zebraidx + indexing maintenance utility, and the zebrasrv + information query and retrieval server binaries. Shortly, the core + libraries are responsible for + + + Dynamic Loading + + of external filter modules, in case the application is + not compiled statically. These filter modules define indexing, + search and retrieval capabilities of the various input formats. + + + + + Index Maintenance + + Zebra maintains Term Dictionaries and ISAM index + entries in inverted index structures kept on disk. These are + optimized for fast inset, update and delete, as well as good + search performance. + + + + + Search Evaluation + + by execution of search requests expressed in PQF/RPN + data structures, which are handed over from + the YAZ server frontend API. Search evaluation includes + construction of hit lists according to boolean combinations + of simpler searches. Fast performance is achieved by careful + use of index structures, and by evaluation specific index hit + lists in correct order. + + + + + Ranking and Sorting + + + components call resorting/re-ranking algorithms on the hit + sets. These might also be pre-sorted not only using the + assigned document ID's, but also using assigned static rank + information. + + + + + Record Presentation + + returns - possibly ranked - result sets, hit + numbers, and the like internal data to the YAZ server backend API + for shipping to the client. Each individual filter module + implements it's own specific presentation formats. + + + + + - This package contains all run-time libraries for Zebra. - libidzebra1.4 - This package includes documentation for Zebra in PDF and HTML. - idzebra1.4-doc - This package includes common essential Zebra configuration files - idzebra1.4-common + The Debian package libidzebra-2.0 + contains all run-time libraries for Zebra, the + documentation in PDF and HTML is found in + idzebra-2.0-doc, and + idzebra-2.0-common + includes common essential Zebra configuration files. - +
- +
Zebra Indexer - the core Zebra indexer which - - loads external filter modules used for indexing data records of - different type. - - creates, updates and drops databases and indexes + The zebraidx + indexing maintenance utility + loads external filter modules used for indexing data records of + different type, and creates, updates and drops databases and + indexes according to the rules defined in the filter modules. - This package contains Zebra utilities such as the zebraidx indexer - utility and the zebrasrv server. - idzebra1.4-utils + The Debian package idzebra-2.0-utils contains + the zebraidx utility. - +
- +
Zebra Searcher/Retriever - the core Zebra searcher/retriever which + This is the executable which runs the Z39.50/SRU/SRW server and + glues together the core libraries and the filter modules to one + great Information Retrieval server application. - This package contains Zebra utilities such as the zebraidx indexer - utility and the zebrasrv server, and their associated man pages. - idzebra1.4-utils + The Debian package idzebra-2.0-utils contains + the zebrasrv utility. - +
- +
YAZ Server Frontend The YAZ server frontend is @@ -170,488 +169,358 @@ In addition to Z39.50 requests, the YAZ server frontend acts - as HTTP server, honouring - SRW SOAP requests, and SRU REST requests. Moreover, it can - translate inco ming CQL queries to PQF/RPN queries, if + as HTTP server, honoring + SRU SOAP + requests, and + SRU REST + requests. Moreover, it can + translate incoming + CQL + queries to + PQF + queries, if correctly configured. - YAZ is a toolkit that allows you to develop software using the - ANSI Z39.50/ISO23950 standard for information retrieval. - SRW/ SRU - libyazthread.so - libyaz.so - libyaz + YAZ + is an Open Source + toolkit that allows you to develop software using the + ANSI Z39.50/ISO23950 standard for information retrieval. + It is packaged in the Debian packages + yaz and libyaz. - +
- +
Record Models and Filter Modules - all filter modules which do indexing and record display filtering: -This virtual package contains all base IDZebra filter modules. EMPTY ??? - libidzebra1.4-modules + The hard work of knowing what to index, + how to do it, and which + part of the records to send in a search/retrieve response is + implemented in + various filter modules. It is their responsibility to define the + exact indexing and record display filtering rules. + + + The virtual Debian package + libidzebra-2.0-modules installs all base filter + modules. - + +
TEXT Record Model and Filter Module - Plain ASCII text filter - + Plain ASCII text filter. TODO: add information here. - +
- +
GRS Record Model and Filter Modules - - - - grs.danbib GRS filters of various kind (*.abs files) -IDZebra filter grs.danbib (DBC DanBib records) - This package includes grs.danbib filter which parses DanBib records. - DanBib is the Danish Union Catalogue hosted by DBC - (Danish Bibliographic Centre). - libidzebra1.4-mod-grs-danbib - - - - grs.marc - - grs.marcxml - This package includes the grs.marc and grs.marcxml filters that allows - IDZebra to read MARC records based on ISO2709. - - libidzebra1.4-mod-grs-marc - - - grs.regx - - grs.tcl GRS TCL scriptable filter - This package includes the grs.regx and grs.tcl filters. - libidzebra1.4-mod-grs-regx - - - - grs.sgml - libidzebra1.4-mod-grs-sgml not packaged yet ?? - - - grs.xml - This package includes the grs.xml filter which uses Expat to - parse records in XML and turn them into IDZebra's internal grs node. - libidzebra1.4-mod-grs-xml + The GRS filter modules described in + + are all based on the Z39.50 specifications, and it is absolutely + mandatory to have the reference pages on BIB-1 attribute sets on + you hand when configuring GRS filters. The GRS filters come in + different flavors, and a short introduction is needed here. + GRS filters of various kind have also been called ABS filters due + to the *.abs configuration file suffix. + + + The grs.marc and + grs.marcxml filters are suited to parse and + index binary and XML versions of traditional library MARC records + based on the ISO2709 standard. The Debian package for both + filters is + libidzebra-2.0-mod-grs-marc. + + + GRS TCL scriptable filters for extensive user configuration come + in two flavors: a regular expression filter + grs.regx using TCL regular expressions, and + a general scriptable TCL filter called + grs.tcl + are both included in the + libidzebra-2.0-mod-grs-regx Debian package. + + + A general purpose SGML filter is called + grs.sgml. This filter is not yet packaged, + but planned to be in the + libidzebra-2.0-mod-grs-sgml Debian package. - + + The Debian package + libidzebra-2.0-mod-grs-xml includes the + grs.xml filter which uses Expat to + parse records in XML and turn them into IDZebra's internal GRS node + trees. Have also a look at the Alvis XML/XSLT filter described in + the next session. + +
- +
ALVIS Record Model and Filter Module - - alvis Experimental Alvis XSLT filter - mod-alvis.so - libidzebra1.4-mod-alvis + The Alvis filter for XML files is an XSLT based input + filter. + It indexes element and attribute content of any thinkable XML format + using full XPATH support, a feature which the standard Zebra + GRS SGML and XML filters lacked. The indexed documents are + parsed into a standard XML DOM tree, which restricts record size + according to availability of memory. + + + The Alvis filter + uses XSLT display stylesheets, which let + the Zebra DB administrator associate multiple, different views on + the same XML document type. These views are chosen on-the-fly in + search time. - - - - SAFARI Record Model and Filter Module - - safari - + In addition, the Alvis filter configuration is not bound to the + arcane BIB-1 Z39.50 library catalogue indexing traditions and + folklore, and is therefore easier to understand. - - - - - - - - - - Server Side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> To PQF Conversion - - The cql2pqf.txt yaz-client config file, which is also used in the - yaz-server CQL-to-PQF process, is used to to drive - org.z3950.zing.cql.CQLNode's toPQF() back-end and the YAZ CQL-to-PQF - converter. This specifies the interpretation of various CQL - indexes, relations, etc. in terms of Type-1 query attributes. - - This configuration file generates queries using BIB-1 attributes. - See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html - for the Maintenance Agency's work-in-progress mapping of Dublin Core - indexes to Attribute Architecture (util, XD and BIB-2) - attributes. +
- a) CQL set prefixes are specified using the correct CQL/ SRW/U - prefixes for the required index sets, or user-invented prefixes for - special index sets. An index set in CQL is roughly speaking equivalent to a - namespace specifier in XML. +
- b) The default index set to be used if none explicitely mentioned - c) Index mapping definitions of the form +
+ Indexing and Retrieval Workflow - index.cql.all = 1=text + + Records pass through three different states during processing in the + system. + - which means that the index "all" from the set "cql" is mapped on the - bib-1 RPN query "@attr 1=text" (where "text" is some existing index - in zebra, see indexing stylesheet) + - d) Relation mapping from CQL relations to bib-1 RPN "@attr 2= " stuff + + + + + When records are accessed by the system, they are represented + in their local, or native format. This might be SGML or HTML files, + News or Mail archives, MARC records. If the system doesn't already + know how to read the type of data you need to store, you can set up an + input filter by preparing conversion rules based on regular + expressions and possibly augmented by a flexible scripting language + (Tcl). + The input filter produces as output an internal representation, + a tree structure. - e) Relation modifier mapping from CQL relations to bib-1 RPN "@attr - 2= " stuff + + + - f) Position attributes + + When records are processed by the system, they are represented + in a tree-structure, constructed by tagged data elements hanging off a + root node. The tagged elements may contain data or yet more tagged + elements in a recursive structure. The system performs various + actions on this tree structure (indexing, element selection, schema + mapping, etc.), - g) structure attributes + + + - h) truncation attributes + + Before transmitting records to the client, they are first + converted from the internal structure to a form suitable for exchange + over the network - according to the Z39.50 standard. + + - See - http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config - file details. + + +
+
+ Retrieval of Zebra internal record data + + Starting with Zebra version 2.0.5 or newer, it is + possible to use a special element set which has the prefix + zebra::. - - - - - Static and Dynamic Ranking - Zebra uses internally inverted indexes to look up term occurencies - in documents. Multiple queries from different indexes can be - combined by the binary boolean operations AND, OR and/or NOT (which - is in fact a binary AND NOT operation). To ensure fast query execution - speed, all indexes have to be sorted in the same order. - - The indexes are normally sorted according to document ID in - ascending order, and any query which does not invoke a special - re-ranking function will therefore retrieve the result set in document ID - order. - - If one defines the - - staticrank: 1 - - directive in the main core Zebra config file, the internal document - keys used for ordering are augmented by a preceeding integer, which - contains the static rank of a given document, and the index lists - are ordered - - first by ascending static rank - - then by ascending document ID. - - This implies that the default rank "0" is the best rank at the - beginning of the list, and "max int" is the worst static rank. - - The "alvis" and the experimental "xslt" filters are providing a - directive to fetch static rank information out of the indexed XML - records, thus making _all_ hit sets orderd after ascending static - rank, and for those doc's which have the same static rank, ordered - after ascending doc ID. - If one wants to do a little fiddeling with the static rank order, - one has to invoke additional re-ranking/re-ordering using dynamic - reranking or score functions. These functions return positive - interger scores, where _highest_ score is best, which means that the - hit sets will be sorted according to _decending_ scores (in contrary - to the index lists which are sorted according to _ascending_ rank - number and document ID) - - - Those are defined in the zebra C source files - - "rank-1" : zebra/index/rank1.c - default TF/IDF like zebra dynamic ranking - "rank-static" : zebra/index/rankstatic.c - do-nothing dummy static ranking (this is just to prove - that the static rank can be used in dynamic ranking functions) - "zvrank" : zebra/index/zvrank.c - many different dynamic TF/IDF ranking functions - - The are in the zebra config file enabled by a directive like: - - rank: rank-static - - Notice that the "rank-1" and "zvrank" do not use the static rank - information in the list keys, and will produce the same ordering - with our without static ranking enabled. - - The dummy "rank-static" reranking/scoring function returns just - score = max int - staticrank - in order to preserve the ordering of hit sets with and without it's - call. - - Obviously, one wants to make a new ranking function, which combines - static and dynamic ranking, which is left as an exercise for the - reader .. (Wray, this is your's ...) - - + Using this element will, regardless of record type, return + Zebra's internal index structure/data for a record. + In particular, the regular record filters are not invoked when + these are in use. + This can in some cases make the retrival faster than regular + retrieval operations (for MARC, XML etc). - - + + Special Retrieval Elements + + + + Element Set + Description + Syntax + + + + + zebra::meta::sysno + Get Zebra record system ID + XML and SUTRS + + + zebra::data + Get raw record + all + + + zebra::meta + Get Zebra record internal metadata + XML and SUTRS + + + zebra::index + Get all indexed keys for record + XML and SUTRS + + + + zebra::index::f + + + Get indexed keys for field f for record + + XML and SUTRS + + + + zebra::index::f:t + + + Get indexed keys for field f + and type t for record + + XML and SUTRS + + + +
- yazserver frontend config file - - db/yazserver.xml - - Setup of listening ports, and virtual zebra servers. - Note path to server-side CQL-to-PQF config file, and to - SRW explain config section. - - The path is relative to the directory where zebra.init is placed - and is started up. The other pathes are relative to , - which in this case is the same. - - see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl - + For example, to fetch the raw binary record data stored in the + zebra internal storage, or on the filesystem, the following + commands can be issued: + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::data + Z> s 1+1 + Z> format sutrs + Z> s 1+1 + Z> format usmarc + Z> s 1+1 + + + + The special + zebra::data element set name is + defined for any record syntax, but will always fetch + the raw record data in exactly the original form. No record syntax + specific transformations will be applied to the raw record data. - - Z39.50 searching: - - search like this (using client-side CQL-to-PQF conversion): - - yaz-client -q db/cql2pqf.txt localhost:9999 - > format xml - > querytype cql2rpn - > f text=(plant and soil) - > s 1 - > elements dc - > s 1 - > elements index - > s 1 - > elements alvis - > s 1 - > elements snippet - > s 1 - - - search like this (using server-side CQL-to-PQF conversion): - (the only difference is "querytype cql" instead of - "querytype cql2rpn" and the call without specifying a local - conversion file) - - yaz-client localhost:9999 - > format xml - > querytype cql - > f text=(plant and soil) - > s 1 - > elements dc - > s 1 - > elements index - > s 1 - > elements alvis - > s 1 - > elements snippet - > s 1 - - NEW: static relevance ranking - see examples in alvis2index.xsl - - > f text = /relevant (plant and soil) - > elem dc - > s 1 - - > f title = /relevant a - > elem dc - > s 1 - - - - SRW/U searching - Surf into http://localhost:9999 - - firefox http://localhost:9999 - - gives you an explain record. Unfortunately, the data found in the - CQL-to-PQF text file must be added by hand-craft into the explain - section of the yazserver.xml file. Too bad, but this is all extreme - new alpha stuff, and a lot of work has yet to be done .. - - Searching via SRU: surf into the URL (lines broken here - concat on - URL line) - - - see number of hits: - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=text=(plant%20and%20soil) - - - - fetch record 5-7 in DC format - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=text=(plant%20and%20soil) - &startRecord=5&maximumRecords=2&recordSchema=dc - - - - even search using PQF queries using the extended verb "x-pquery", - which is special to YAZ/Zebra - - http://localhost:9999/?version=1.1&operation=searchRetrieve - &x-pquery=@attr%201=text%20@and%20plant%20soil - - More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/ -278,280d299 - Search via SRW: - read the fine manual at - http://www.loc.gov/z3950/agency/zing/srw/ - - -and so on. The list of available indexes is found in db/cql2pqf.txt - - -7) How do you add to the index attributes of any other type than "w"? -I mean, in the context of making CQL queries. Let's say I want a date -attribute in there, so that one could do date > 20050101 in CQL. - -Currently for example 'date-modified' is of type 'w'. - -The 2-seconds-of-though solution: - - in alvis2index.sl: - - - - - -But here's the catch...doesn't the use of the 'd' type require -structure type 'date' (@attr 4=5) in PQF? But then...how does that -reflect in the CQL->RPN/PQF mapping - does it really work if I just -change the type of an element in alvis2index.sl? I would think not...? - - - - - Kimmo - - -Either do: - - f @attr 4=5 @attr 1=date-modified 20050713 - -or do - - -Either do: - - f @attr 4=5 @attr 1=date-modified 20050713 - -or do - -querytype cql - - f date-modified=20050713 - - f date-modified=20050713 - - Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att -r 4=1 @attr 2=3 @attr "1=date-modified" 20050713 - - - - f date-modified eq 20050713 - -Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5 - @attr 2=3 @attr "1=date-modified" 20050713 - - + Also, Zebra internal metadata about the record can be accessed: + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::meta::sysno + Z> s 1+1 + + displays in XML record syntax only internal + record system number, whereas + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::meta + Z> s 1+1 + + displays all available metadata on the record. These include sytem + number, database name, indexed filename, filter used for indexing, + score and static ranking information and finally bytesize of record. - -E) EXTENDED SERVICE LIFE UPDATES - -The extended services are not enabled by default in zebra - due to the -fact that they modify the system. - -In order to allow anybody to update, use -perm.anonymous: rw -in zebra.cfg. - -Or, even better, allow only updates for a particular admin user. For -user 'admin', you could use: -perm.admin: rw -passwd: passwordfile - -And in passwordfile, specify users and passwords .. -admin:secret - -We can now start a yaz-client admin session and create a database: - -$ yaz-client localhost:9999 -u admin/secret -Authentication set to Open (admin/secret) -Connecting...OK. -Sent initrequest. -Connection accepted by v3 target. -ID : 81 -Name : Zebra Information Server/GFS/YAZ -Version: Zebra 1.4.0/1.63/2.1.9 -Options: search present delSet triggerResourceCtrl scan sort -extendedServices namedResultSets -Elapsed: 0.007046 -Z> adm-create -Admin request -Got extended services response -Status: done -Elapsed: 0.045009 -: -Now Default was created.. We can now insert an XML file (esdd0006.grs -from example/gils/records) and index it: - -Z> update insert 1 esdd0006.grs -Got extended services response -Status: done -Elapsed: 0.438016 - -The 3rd parameter.. 1 here .. is the opaque record id from Ext update. -It a record ID that _we_ assign to the record in question. If we do not -assign one the usual rules for match apply (recordId: from zebra.cfg). - -Actually, we should have a way to specify "no opaque record id" for -yaz-client's update command.. We'll fix that. - -Elapsed: 0.438016 -Z> f utah -Sent searchRequest. -Received SearchResponse. -Search was a success. -Number of hits: 1, setno 1 -SearchResult-1: term=utah cnt=1 -records returned: 0 -Elapsed: 0.014179 - -Let's delete the beast: -Z> update delete 1 -No last record (update ignored) -Z> update delete 1 esdd0006.grs -Got extended services response -Status: done -Elapsed: 0.072441 -Z> f utah -Sent searchRequest. -Received SearchResponse. -Search was a success. -Number of hits: 0, setno 2 -SearchResult-1: term=utah cnt=0 -records returned: 0 -Elapsed: 0.013610 - -If shadow register is enabled you must run the adm-commit command in -order write your changes.. - + Sometimes, it is very hard to figure out what exactly has been + indexed how and in which indexes. Using the indexing stylesheet of + the Alvis filter, one can at least see which portion of the record + went into which index, but a similar aid does not exist for all + other indexing filters. - - - -
---> + + The special + zebra::index element set names are provided to + access information on per record indexed fields. For example, the + queries + + Z> f @attr 1=title my + Z> format sutrs + Z> elements zebra::index + Z> s 1+1 + + will display all indexed tokens from all indexed fields of the + first record, and it will display in SUTRS + record syntax, whereas + + Z> f @attr 1=title my + Z> format xml + Z> elements zebra::index::title + Z> s 1+1 + Z> elements zebra::index::title:p + Z> s 1+1 + + displays in XML record syntax only the content + of the zebra string index title, or + even only the type p phrase indexed part of it. + + + + Trying to access numeric Bib-1 use + attributes or trying to access non-existent zebra intern string + access points will result in a Diagnostic 25: Specified element set + 'name not valid for specified database. + + +