X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fadministration.xml;h=1dd6a228b9a8b8e31cb67492ba605b2259015e7e;hb=b6ff969813b5ac4c0a6b266979469b0cc24201fd;hp=be92e8e0893b8d105638a623d5194f0b45dca4dc;hpb=24cf42a15df56f9fe2436eedef816212b9d4fb17;p=idzebra-moved-to-github.git diff --git a/doc/administration.xml b/doc/administration.xml index be92e8e..1dd6a22 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,9 +1,9 @@ - + Administrating Zebra @@ -305,6 +305,19 @@ Specifies size of internal memory to use for the zebraidx program. The amount is given in megabytes - default is 4 (4 MB). + The more memory, the faster large updates happen, up to about + half the free memory available on the computer. + + + + + tempfiles: Yes/Auto/No + + + Tells zebra if it should use temporary files when indexing. The + default is Auto, in which case zebra uses temporary files only + if it would need more that memMax + megabytes of memory. This should be good for most uses. @@ -323,23 +336,61 @@ - tagsysno: 0|1 + passwd: file + + + Specifies a file with description of user accounts for Zebra. + The format is similar to that known to Apache's htpasswd files + and UNIX' passwd files. Non-empty lines not beginning with + # are considered account lines. There is one account per-line. + A line consists of fields separate by a single colon character. + First field is username, second is password. + + + + + + passwd.c: file + + + Specifies a file with description of user accounts for Zebra. + File format is similar to that used by the passwd directive except + that the password are encrypted. Use Apache's htpasswd or similar + for maintenanace. + + + + + + perm.user: + permstring - Species whether Zebra should include system-number data in XML - and GRS-1 records returned to clients, represented by the - <localControlNumber> element in XML - and the (1,14) tag in GRS-1. - The content of these elements is an internally-generated - integer uniquely identifying the record within its database. - It is included by default but may be turned off, with - tagsysno: 0 for databases in which a local - control number is explicitly specified in the input records - themselves. + Specifies permissions (priviledge) for a user that are allowed + to access Zebra via the passwd system. There are two kinds + of permissions currently: read (r) and write(w). By default + users not listed in a permission directive are given the read + priviledge. To specify permissions for a user with no + username, or Z39.50 anonymous style use + anonymous. The permstring consists of + a sequence of characters. Include character w + for write/update access, r for read access. + + dbaccess accessfile + + + Names a file which lists database subscriptions for individual users. + The access file should consists of lines of the form username: + dbnames, where dbnames is a list of database names, seprated by + '+'. No whitespace is allowed in the database list. + + + + @@ -402,7 +453,7 @@ - profilePath: /usr/local/yaz + profilePath: /usr/local/idzebra/tab attset: bib1.att simple.recordType: text simple.database: textbase @@ -618,7 +669,7 @@ - (see + (see for details of how the mapping between elements of your records and searchable attributes is established). @@ -792,7 +843,6 @@ register: /d1:500M - shadow: /scratch1:100M /scratch2:200M @@ -870,8 +920,526 @@ + + + + Relevance Ranking and Sorting of Result Sets + + + Overview + + The default ordering of a result set is left up to the server, + which inside Zebra means sorting in ascending document ID order. + This is not always the order humans want to browse the sometimes + quite large hit sets. Ranking and sorting comes to the rescue. + + + + In cases where a good presentation ordering can be computed at + indexing time, we can use a fixed static ranking + scheme, which is provided for the alvis + indexing filter. This defines a fixed ordering of hit lists, + independently of the query issued. + + + + There are cases, however, where relevance of hit set documents is + highly dependent on the query processed. + Simply put, dynamic relevance ranking + sorts a set of retrieved + records such + that those most likely to be relevant to your request are + retrieved first. + Internally, Zebra retrieves all documents that satisfy your + query, and re-orders the hit list to arrange them based on + a measurement of similarity between your query and the content of + each record. + + + + Finally, there are situations where hit sets of documents should be + sorted during query time according to the + lexicographical ordering of certain sort indexes created at + indexing time. + + + + + + Static Ranking + + + Zebra uses internally inverted indexes to look up term occurencies + in documents. Multiple queries from different indexes can be + combined by the binary boolean operations AND, + OR and/or NOT (which + is in fact a binary AND NOT operation). + To ensure fast query execution + speed, all indexes have to be sorted in the same order. + + + The indexes are normally sorted according to document + ID in + ascending order, and any query which does not invoke a special + re-ranking function will therefore retrieve the result set in + document + ID + order. + + + If one defines the + + staticrank: 1 + + directive in the main core Zebra config file, the internal document + keys used for ordering are augmented by a preceeding integer, which + contains the static rank of a given document, and the index lists + are ordered + first by ascending static rank, + then by ascending document ID. + Zero + is the ``best'' rank, as it occurs at the + beginning of the list; higher numbers represent worse scores. + + + The experimental alvis filter provides a + directive to fetch static rank information out of the indexed XML + records, thus making all hit sets orderd + after ascending static + rank, and for those doc's which have the same static rank, ordered + after ascending doc ID. + See for the gory details. + + + + + + Dynamic Ranking + + In order to fiddle with the static rank order, it is necessary to + invoke additional re-ranking/re-ordering using dynamic + ranking or score functions. These functions return positive + integer scores, where highest score is + ``best''; + hit sets are sorted according to + decending + scores (in contrary + to the index lists which are sorted according to + ascending rank number and document ID). + + + Dynamic ranking is enabled by a directive like one of the + following in the zebra config file (use only one of these a time!): + + rank: rank-1 # default TDF-IDF like + rank: rank-static # dummy do-nothing + rank: zvrank # configurable, experimental TDF-IDF like + + Notice that the rank-1 and + zvrank do not use the static rank + information in the list keys, and will produce the same ordering + with or without static ranking enabled. + + + The dummy rank-static reranking/scoring + function returns just + score = max int - staticrank + in order to preserve the static ordering of hit sets that would + have been produced had it not been invoked. + Obviously, to combine static and dynamic ranking usefully, + it is necessary + to make a new ranking + function; this is left + as an exercise for the reader. + + + + + Dynamic ranking is done at query time rather than + indexing time (this is why we + call it ``dynamic ranking'' in the first place ...) + It is invoked by adding + the Bib-1 relation attribute with + value ``relevance'' to the PQF query (that is, + @attr 2=102, see also + + The BIB-1 Attribute Set Semantics). + To find all articles with the word Eoraptor in + the title, and present them relevance ranked, issue the PQF query: + + @attr 2=102 @attr 1=4 Eoraptor + + + + + The default rank-1 ranking module implements a + TF-IDF (Term Frequecy over Inverse Document Frequency) like algorithm. + + + + + Notice that dynamic ranking is not compatible + with estimated hit sizes, as all documents in + a hit set must be acessed to compute the correct placing in a + ranking sorted list. Therefore the use attribute setting + @attr 2=102 clashes with + @attr 9=integer. + + + + + It is possible to apply dynamic ranking on parts of the PQF query + allone: + + Z> f @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer + + searches for all documents which have the term 'Utah' on the + body of text, and which have the term 'Springer' in the publisher + field, and sort them in the order of the relvance ranking made on + the body of text index only. + + + Rank weight is a way to pass a value to a ranking algorithm - so that + one APT has one value - while another as a different one. For + example, we can + search for 'utah' in use attribute set 'title' with weight 30, as + well as in use attribute set 'any' with weight 20. + + Z> f @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah + + + + + The rank weight feature is experimental. It may change in future + releases of zebra, and is not production mature. + + + + + Notice that dynamic ranking can be enabled in + sever side CQL query expansion by adding @attr + 2=102 to the CQL config file. For example + + relationModifier.relevant = 2=102 + + invokes dynamik ranking each time a CQL query of the form + + Z> querytype cql + Z> f alvis.text =/relevant house + + is issued. Dynamic ranking can be enabled on specific CQL indexes + by (for example) setting + + index.alvis.text = 1=text 2=102 + + which then invokes dynamik ranking each time a CQL query of the form + + Z> querytype cql + Z> f alvis.text = house + + is issued. + + + + + + + Sorting + + Sorting is enabled in the configuration of record indexing. For + example, to enable sorting according to the BIB-1 + Date/time-added-to-db field, one could add the line + + xelm /*/@created Date/time-added-to-db:s + + to any .abs record indexing config file, or + similarily, one could add an indexing element of the form + + + + ]]> + to any alvis indexing rule. + + + To trigger a sorting on a pre-defined sorting index of type + s, we can issue a sort with BIB-1 + embedded sort attribute set 7. + The embedded sort is a way to specify sort within a query - thus + removing the need to send a Z39.50 Sort + Request separately. + + + The value after attribute type 7 is + 1 (=ascending), or 2 + (=descending). + The attributes+term (APT) node is separate from the rest of the + PQF query, and must be @or'ed. + The term associated with this attribute is the sorting level, + where + 0 specifies the primary sort key, + 1 the secondary sort key, and so on. + + For example, a search for water, sort by title (ascending), + is expressed by the PQF query + + Z> f @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 + + whereas a search for water, sort by title ascending, + then date descending would be + + Z> f @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 + + + + Notice the fundamental differences between dynamic + ranking and sorting: there can only + be one ranking function defined and configured, but there can be + specified multiple sorting indexes dynamically at search + time. Ranking does not need to use specific indexes, which means, + dynamic ranking can be enabled and disabled without + re-indexing. On the other hand, sorting indexes need to be + defined before indexing. + + + + + + + + + Extended Services: Remote Insert, Update and Delete + + + The extended services are not enabled by default in zebra - due to the + fact that they modify the system. + In order to allow anybody to update, use + + perm.anonymous: rw + + in the main zebra configuration file zebra.cfg. + Or, even better, allow only updates for a particular admin user. For + user admin, you could use: + + perm.admin: rw + passwd: passwordfile + + And in passwordfile, specify users and + passwords as colon seperated strings: + + admin:secret + + + + We can now start a yaz-client admin session and create a database: + + adm-create + ]]> + + Now the Default database was created, + we can insert an XML file (esdd0006.grs + from example/gils/records) and index it: + + update insert 1 esdd0006.grs + ]]> + + The 3rd parameter - 1 here - + is the opaque record ID from Ext update. + It a record ID that we assign to the record + in question. If we do not + assign one, the usual rules for match apply (recordId: from zebra.cfg). + + + Actually, we should have a way to specify "no opaque record id" for + yaz-client's update command.. We'll fix that. + + + The newly inserted record can be searched as usual: + + f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 1, setno 1 + SearchResult-1: term=utah cnt=1 + records returned: 0 + Elapsed: 0.014179 + ]]> + + + + Let's delete the beast: + + update delete 1 + No last record (update ignored) + Z> update delete 1 esdd0006.grs + Got extended services response + Status: done + Elapsed: 0.072441 + Z> f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 0, setno 2 + SearchResult-1: term=utah cnt=0 + records returned: 0 + Elapsed: 0.013610 + ]]> + + + + If shadow register is enabled in your + zebra.cfg, + you must run the adm-commit command + + adm-commit + ]]> + + after each update session in order write your changes from the + shadow to the life register space. + + + Extended services are also available from the YAZ client layer. An + example of an YAZ-PHP extended service transaction is given here: + + A fine specimen of a record'; + + $options = array('action' => 'recordInsert', + 'syntax' => 'xml', + 'record' => $record, + 'databaseName' => 'mydatabase' + ); + + yaz_es($yaz, 'update', $options); + yaz_es($yaz, 'commit', array()); + yaz_wait(); + + if ($error = yaz_error($yaz)) + echo "$error"; + ]]> + + The action parameter can be any of + recordInsert (will fail if the record already exists), + recordReplace (will fail if the record does not exist), + recordDelete (will fail if the record does not + exist), and + specialUpdate (will insert or update the record + as needed). + + + If a record is inserted + using the action recordInsert + one can specify the optional + recordIdOpaque parameter, which is a + client-supplied, opaque record identifier. This identifier will + replace zebra's own automagic identifier generation. + + + When using the action recordReplace or + recordDelete, one must specify the additional + recordIdNumber parameter, which must be an + existing Zebra internal system ID number. When retrieving existing + records, the ID number is returned in the field + /*/id:idzebra/localnumber in the namespace + xmlns:id="http://www.indexdata.dk/zebra/", + where it can be picked up for later record updates or deletes. + + + + + + YAZ Frontend Virtual Hosts + + zebrasrv uses the YAZ server frontend and does + support multiple virtual servers behind multiple listening sockets. + + &zebrasrv-virtual; + + + Section "Virtual Hosts" in the YAZ manual. + http://www.indexdata.dk/yaz/doc/server.vhosts.tkl + + + + + + Server Side CQL to PQF Query Translation + + Using the + <cql2rpn>l2rpn.txt</cql2rpn> + YAZ Frontend Virtual + Hosts option, one can configure + the YAZ Frontend CQL-to-PQF + converter, specifying the interpretation of various + CQL + indexes, relations, etc. in terms of Type-1 query attributes. + + + + For example, using server-side CQL-to-PQF conversion, one might + query a zebra server like this: + + querytype cql + Z> find text=(plant and soil) + ]]> + + and - if properly configured - even static relevance ranking can + be performed using CQL query syntax: + + find text = /relevant (plant and soil) + ]]> + + + + + By the way, the same configuration can be used to + search using client-side CQL-to-PQF conversion: + (the only difference is querytype cql2rpn + instead of + querytype cql, and the call specifying a local + conversion file) + + querytype cql2rpn + Z> find text=(plant and soil) + ]]> + + + + + Exhaustive information can be found in the + Section "Specification of CQL to RPN mappings" in the YAZ manual. + + http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map, + and shall therefore not be repeated here. + + + + + +