X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fadministration.xml;h=829ef7591505428863477952c443e459617edb76;hb=656a766f96dd92939c3604a7bf88f2355d040fc8;hp=be92e8e0893b8d105638a623d5194f0b45dca4dc;hpb=24cf42a15df56f9fe2436eedef816212b9d4fb17;p=idzebra-moved-to-github.git diff --git a/doc/administration.xml b/doc/administration.xml index be92e8e..829ef75 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,9 +1,9 @@ - + Administrating Zebra @@ -94,7 +94,7 @@ - + The Zebra Configuration File @@ -281,20 +281,67 @@ Specifies a path of profile specification files. The path is composed of one or more directories separated by - colon. Similar to PATH for UNIX systems. + colon. Similar to PATH for UNIX systems. + + + modulePath: path + + + Specifies a path of record filter modules. + The path is composed of one or more directories separated by + colon. Similar to PATH for UNIX systems. + The 'make install' procedure typically puts modules in + /usr/local/lib/idzebra-2.0/modules. + + + + + + staticrank: integer + + + Enables whether static ranking is to be enabled (1) or + disabled (0). If omitted, it is disabled - corresponding + to a value of 0. + Refer to . + + + + + + + estimatehits:: integer + + + Controls whether Zebra should calculate approximite hit counts and + at which hit count it is to be enabled. + A value of 0 disables approximiate hit counts. + For a positive value approximaite hit count is enabled + if it is known to be larger than integer. + + + Approximate hit counts can also be triggered by a particular + attribute in a query. + Refer to . + + + + attset: filename - Specifies the filename(s) of attribute set files for use in - searching. At least the Bib-1 set should be loaded - (bib1.att). - The profilePath setting is used to look for - the specified files. - See + Specifies the filename(s) of attribute set files for use in + searching. In many configurations bib1.att + is used, but that is not required. If Classic Explain + attributes is to be used for searching, + explain.att must be given. + The path to att-files in general can be given using + profilePath setting. + See also . @@ -305,6 +352,19 @@ Specifies size of internal memory to use for the zebraidx program. The amount is given in megabytes - default is 4 (4 MB). + The more memory, the faster large updates happen, up to about + half the free memory available on the computer. + + + + + tempfiles: Yes/Auto/No + + + Tells zebra if it should use temporary files when indexing. The + default is Auto, in which case zebra uses temporary files only + if it would need more that memMax + megabytes of memory. This should be good for most uses. @@ -323,23 +383,62 @@ - tagsysno: 0|1 + passwd: file - Species whether Zebra should include system-number data in XML - and GRS-1 records returned to clients, represented by the - <localControlNumber> element in XML - and the (1,14) tag in GRS-1. - The content of these elements is an internally-generated - integer uniquely identifying the record within its database. - It is included by default but may be turned off, with - tagsysno: 0 for databases in which a local - control number is explicitly specified in the input records - themselves. + Specifies a file with description of user accounts for Zebra. + The format is similar to that known to Apache's htpasswd files + and UNIX' passwd files. Non-empty lines not beginning with + # are considered account lines. There is one account per-line. + A line consists of fields separate by a single colon character. + First field is username, second is password. + + passwd.c: file + + + Specifies a file with description of user accounts for Zebra. + File format is similar to that used by the passwd directive except + that the password are encrypted. Use Apache's htpasswd or similar + for maintenance. + + + + + + perm.user: + permstring + + + Specifies permissions (priviledge) for a user that are allowed + to access Zebra via the passwd system. There are two kinds + of permissions currently: read (r) and write(w). By default + users not listed in a permission directive are given the read + privilege. To specify permissions for a user with no + username, or Z39.50 anonymous style use + anonymous. The permstring consists of + a sequence of characters. Include character w + for write/update access, r for read access and + a to allow anonymous access through this account. + + + + + + dbaccess accessfile + + + Names a file which lists database subscriptions for individual users. + The access file should consists of lines of the form username: + dbnames, where dbnames is a list of database names, seprated by + '+'. No whitespace is allowed in the database list. + + + + @@ -351,7 +450,7 @@ The default behavior of the Zebra system is to reference the records from their original location, i.e. where they were found when you - ran zebraidx. + run zebraidx. That is, when a client wishes to retrieve a record following a search operation, the files are accessed from the place where you originally put them - if you remove the files (without @@ -402,7 +501,7 @@ - profilePath: /usr/local/yaz + profilePath: /usr/local/idzebra/tab attset: bib1.att simple.recordType: text simple.database: textbase @@ -618,7 +717,7 @@ - (see + (see for details of how the mapping between elements of your records and searchable attributes is established). @@ -706,7 +805,7 @@ Safe Updating - Using Shadow Registers - + Description @@ -760,7 +859,7 @@ - + How to Use Shadow Register Files @@ -792,7 +891,6 @@ register: /d1:500M - shadow: /scratch1:100M /scratch2:200M @@ -870,8 +968,846 @@ + + + + Relevance Ranking and Sorting of Result Sets + + + Overview + + The default ordering of a result set is left up to the server, + which inside Zebra means sorting in ascending document ID order. + This is not always the order humans want to browse the sometimes + quite large hit sets. Ranking and sorting comes to the rescue. + + + + In cases where a good presentation ordering can be computed at + indexing time, we can use a fixed static ranking + scheme, which is provided for the alvis + indexing filter. This defines a fixed ordering of hit lists, + independently of the query issued. + + + + There are cases, however, where relevance of hit set documents is + highly dependent on the query processed. + Simply put, dynamic relevance ranking + sorts a set of retrieved records such that those most likely to be + relevant to your request are retrieved first. + Internally, Zebra retrieves all documents that satisfy your + query, and re-orders the hit list to arrange them based on + a measurement of similarity between your query and the content of + each record. + + + + Finally, there are situations where hit sets of documents should be + sorted during query time according to the + lexicographical ordering of certain sort indexes created at + indexing time. + + + + + + Static Ranking + + + Zebra uses internally inverted indexes to look up term occurencies + in documents. Multiple queries from different indexes can be + combined by the binary boolean operations AND, + OR and/or NOT (which + is in fact a binary AND NOT operation). + To ensure fast query execution + speed, all indexes have to be sorted in the same order. + + + The indexes are normally sorted according to document + ID in + ascending order, and any query which does not invoke a special + re-ranking function will therefore retrieve the result set in + document + ID + order. + + + If one defines the + + staticrank: 1 + + directive in the main core Zebra configuration file, the internal document + keys used for ordering are augmented by a preceding integer, which + contains the static rank of a given document, and the index lists + are ordered + first by ascending static rank, + then by ascending document ID. + Zero + is the ``best'' rank, as it occurs at the + beginning of the list; higher numbers represent worse scores. + + + The experimental alvis filter provides a + directive to fetch static rank information out of the indexed XML + records, thus making all hit sets ordered + after ascending static + rank, and for those doc's which have the same static rank, ordered + after ascending doc ID. + See for the gory details. + + + + + + Dynamic Ranking + + In order to fiddle with the static rank order, it is necessary to + invoke additional re-ranking/re-ordering using dynamic + ranking or score functions. These functions return positive + integer scores, where highest score is + ``best''; + hit sets are sorted according to descending + scores (in contrary + to the index lists which are sorted according to + ascending rank number and document ID). + + + Dynamic ranking is enabled by a directive like one of the + following in the zebra configuration file (use only one of these a time!): + + rank: rank-1 # default TDF-IDF like + rank: rank-static # dummy do-nothing + + + + Dynamic ranking is done at query time rather than + indexing time (this is why we + call it ``dynamic ranking'' in the first place ...) + It is invoked by adding + the Bib-1 relation attribute with + value ``relevance'' to the PQF query (that is, + @attr 2=102, see also + + The BIB-1 Attribute Set Semantics, also in + HTML). + To find all articles with the word Eoraptor in + the title, and present them relevance ranked, issue the PQF query: + + @attr 2=102 @attr 1=4 Eoraptor + + + + + Dynamically ranking using PQF queries with the 'rank-1' + algorithm + + + The default rank-1 ranking module implements a + TF/IDF (Term Frequecy over Inverse Document Frequency) like + algorithm. In contrast to the usual defintion of TF/IDF + algorithms, which only considers searching in one full-text + index, this one works on multiple indexes at the same time. + More precisely, + Zebra does boolean queries and searches in specific addressed + indexes (there are inverted indexes pointing from terms in the + dictionary to documents and term positions inside documents). + It works like this: + + + Query Components + + + First, the boolean query is dismantled into it's principal components, + i.e. atomic queries where one term is looked up in one index. + For example, the query + + @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer + + is a boolean AND between the atomic parts + + @attr 2=102 @attr 1=1010 Utah + + and + + @attr 2=102 @attr 1=1018 Springer + + which gets processed each for itself. + + + + + + Atomic hit lists + + + Second, for each atomic query, the hit list of documents is + computed. + + + In this example, two hit lists for each index + @attr 1=1010 and + @attr 1=1018 are computed. + + + + + + Atomic scores + + + Third, each document in the hit list is assigned a score (_if_ ranking + is enabled and requested in the query) using a TF/IDF scheme. + + + In this example, both atomic parts of the query assign the magic + @attr 2=102 relevance attribute, and are + to be used in the relevance ranking functions. + + + It is possible to apply dynamic ranking on only parts of the + PQF query: + + @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer + + searches for all documents which have the term 'Utah' on the + body of text, and which have the term 'Springer' in the publisher + field, and sort them in the order of the relevance ranking made on + the body-of-text index only. + + + + + + Hit list merging + + + Fourth, the atomic hit lists are merged according to the boolean + conditions to a final hit list of documents to be returned. + + + This step is always performed, independently of the fact that + dynamic ranking is enabled or not. + + + + + + Document score computation + + + Fifth, the total score of a document is computed as a linear + combination of the atomic scores of the atomic hit lists + + + Ranking weights may be used to pass a value to a ranking + algorithm, using the non-standard BIB-1 attribute type 9. + This allows one branch of a query to use one value while + another branch uses a different one. For example, we can search + for utah in the + @attr 1=4 index with weight 30, as + well as in the @attr 1=1010 index with weight 20: + + @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city + + + + The default weight is + sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score + is 1000 and the bottom score is 0, encoded in integers. + + + + The ranking-weight feature is experimental. It may change in future + releases of zebra. + + + + + + + Re-sorting of hit list + + + Finally, the final hit list is re-ordered according to scores. + + + + + + + + + + + + + + The rank-1 algorithm + does not use the static rank + information in the list keys, and will produce the same ordering + with or without static ranking enabled. + + + + + + + + Dynamic ranking is not compatible + with estimated hit sizes, as all documents in + a hit set must be accessed to compute the correct placing in a + ranking sorted list. Therefore the use attribute setting + @attr 2=102 clashes with + @attr 9=integer. + + + + + + + + + Dynamically ranking CQL queries + + Dynamic ranking can be enabled during sever side CQL + query expansion by adding @attr 2=102 + chunks to the CQL config file. For example + + relationModifier.relevant = 2=102 + + invokes dynamic ranking each time a CQL query of the form + + Z> querytype cql + Z> f alvis.text =/relevant house + + is issued. Dynamic ranking can also be automatically used on + specific CQL indexes by (for example) setting + + index.alvis.text = 1=text 2=102 + + which then invokes dynamic ranking each time a CQL query of the form + + Z> querytype cql + Z> f alvis.text = house + + is issued. + + + + + + + + + Sorting + + Zebra sorts efficiently using special sorting indexes + (type=s; so each sortable index must be known + at indexing time, specified in the configuration of record + indexing. For example, to enable sorting according to the BIB-1 + Date/time-added-to-db field, one could add the line + + xelm /*/@created Date/time-added-to-db:s + + to any .abs record-indexing configuration file. + Similarly, one could add an indexing element of the form + + + + ]]> + to any alvis-filter indexing stylesheet. + + + Indexing can be specified at searching time using a query term + carrying the non-standard + BIB-1 attribute-type 7. This removes the + need to send a Z39.50 Sort Request + separately, and can dramatically improve latency when the client + and server are on separate networks. + The sorting part of the query is separate from the rest of the + query - the actual search specification - and must be combined + with it using OR. + + + A sorting subquery needs two attributes: an index (such as a + BIB-1 type-1 attribute) specifying which index to sort on, and a + type-7 attribute whose value is be 1 for + ascending sorting, or 2 for descending. The + term associated with the sorting attribute is the priority of + the sort key, where 0 specifies the primary + sort key, 1 the secondary sort key, and so + on. + + For example, a search for water, sort by title (ascending), + is expressed by the PQF query + + @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 + + whereas a search for water, sort by title ascending, + then date descending would be + + @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 + + + + Notice the fundamental differences between dynamic + ranking and sorting: there can be + only one ranking function defined and configured; but multiple + sorting indexes can be specified dynamically at search + time. Ranking does not need to use specific indexes, so + dynamic ranking can be enabled and disabled without + re-indexing; whereas, sorting indexes need to be + defined before indexing. + + + + + + + + + Extended Services: Remote Insert, Update and Delete + + + + Extended services are only supported when accessing the Zebra + server using the Z39.50 + protocol. The SRU protocol does + not support extended services. + + + + + The extended services are not enabled by default in zebra - due to the + fact that they modify the system. Zebra can be configured + to allow anybody to + search, and to allow only updates for a particular admin user + in the main zebra configuration file zebra.cfg. + For user admin, you could use: + + perm.anonymous: r + perm.admin: rw + passwd: passwordfile + + And in the password file + passwordfile, you have to specify users and + encrypted passwords as colon separated strings. + Use a tool like htpasswd + to maintain the encrypted passwords. + + admin:secret + + It is essential to configure Zebra to store records internally, + and to support + modifications and deletion of records: + + storeData: 1 + storeKeys: 1 + + The general record type should be set to any record filter which + is able to parse XML records, you may use any of the two + declarations (but not both simultaneously!) + + recordType: grs.xml + # recordType: alvis.filter_alvis_config.xml + + To enable transaction safe shadow indexing, + which is extra important for this kind of operation, set + + shadow: directoryname: size (e.g. 1000M) + + See for additional information on + these configuration options. + + + + It is not possible to carry information about record types or + similar to Zebra when using extended services, due to + limitations of the Z39.50 + protocol. Therefore, indexing filters can not be chosen on a + per-record basis. One and only one general XML indexing filter + must be defined. + + + + + + + + Extended services in the Z39.50 protocol + + + The Z39.50 standard allows + servers to accept special binary extended services + protocol packages, which may be used to insert, update and delete + records into servers. These carry control and update + information to the servers, which are encoded in seven package fields: + + + + Extended services Z39.50 Package Fields + + + + Parameter + Value + Notes + + + + + type + 'update' + Must be set to trigger extended services + + + action + string + + Extended service action type with + one of four possible values: recordInsert, + recordReplace, + recordDelete, + and specialUpdate + + + + record + XML string + An XML formatted string containing the record + + + syntax + 'xml' + Only XML record syntax is supported + + + recordIdOpaque + string + + Optional client-supplied, opaque record + identifier used under insert operations. + + + + recordIdNumber + positive number + Zebra's internal system number, + not allowed for recordInsert or + specialUpdate actions which result in fresh + record inserts. + + + + databaseName + database identifier + + The name of the database to which the extended services should be + applied. + + + + +
+ + + + The action parameter can be any of + recordInsert (will fail if the record already exists), + recordReplace (will fail if the record does not exist), + recordDelete (will fail if the record does not + exist), and + specialUpdate (will insert or update the record + as needed, record deletion is not possible). + + + + During all actions, the + usual rules for internal record ID generation apply, unless an + optional recordIdNumber Zebra internal ID or a + recordIdOpaque string identifier is assigned. + The default ID generation is + configured using the recordId: from + zebra.cfg. + See . + + + + Setting of the recordIdNumber parameter, + which must be an existing Zebra internal system ID number, is not + allowed during any recordInsert or + specialUpdate action resulting in fresh record + inserts. + + + + When retrieving existing + records indexed with GRS indexing filters, the Zebra internal + ID number is returned in the field + /*/id:idzebra/localnumber in the namespace + xmlns:id="http://www.indexdata.dk/zebra/", + where it can be picked up for later record updates or deletes. + + + + A new element set for retrieval of internal record + data has been added, which can be used to access minimal records + containing only the recordIdNumber Zebra + internal ID, or the recordIdOpaque string + identifier. This works for any indexing filter used. + See . + + + + The recordIdOpaque string parameter + is an client-supplied, opaque record + identifier, which may be used under + insert, update and delete operations. The + client software is responsible for assigning these to + records. This identifier will + replace zebra's own automagic identifier generation with a unique + mapping from recordIdOpaque to the + Zebra internal recordIdNumber. + The opaque recordIdOpaque string + identifiers + are not visible in retrieval records, nor are + searchable, so the value of this parameter is + questionable. It serves mostly as a convenient mapping from + application domain string identifiers to Zebra internal ID's. + + +
+ + + + Extended services from yaz-client + + + We can now start a yaz-client admin session and create a database: + + adm-create + ]]> + + Now the Default database was created, + we can insert an XML file (esdd0006.grs + from example/gils/records) and index it: + + update insert id1234 esdd0006.grs + ]]> + + The 3rd parameter - id1234 here - + is the recordIdOpaque package field. + + + Actually, we should have a way to specify "no opaque record id" for + yaz-client's update command.. We'll fix that. + + + The newly inserted record can be searched as usual: + + f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 1, setno 1 + SearchResult-1: term=utah cnt=1 + records returned: 0 + Elapsed: 0.014179 + ]]> + + + + Let's delete the beast, using the same + recordIdOpaque string parameter: + + update delete id1234 + No last record (update ignored) + Z> update delete 1 esdd0006.grs + Got extended services response + Status: done + Elapsed: 0.072441 + Z> f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 0, setno 2 + SearchResult-1: term=utah cnt=0 + records returned: 0 + Elapsed: 0.013610 + ]]> + + + + If shadow register is enabled in your + zebra.cfg, + you must run the adm-commit command + + adm-commit + ]]> + + after each update session in order write your changes from the + shadow to the life register space. + + + + + + Extended services from yaz-php + + + Extended services are also available from the YAZ PHP client layer. An + example of an YAZ-PHP extended service transaction is given here: + + A fine specimen of a record'; + + $options = array('action' => 'recordInsert', + 'syntax' => 'xml', + 'record' => $record, + 'databaseName' => 'mydatabase' + ); + + yaz_es($yaz, 'update', $options); + yaz_es($yaz, 'commit', array()); + yaz_wait(); + + if ($error = yaz_error($yaz)) + echo "$error"; + ]]> + + + +
+
+