X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fadministration.xml;h=7303d30a4dc637a73207c1de70a78440714f34b2;hb=558bf94a5f36eb89b0ca7ac4780b641da852c36b;hp=5ebfcd3ed4ac714ec8ef689d2be7fb00535b5dc4;hpb=79e9818dfb6b9a0a04bdd6bc6467c8dae3b8f493;p=idzebra-moved-to-github.git diff --git a/doc/administration.xml b/doc/administration.xml index 5ebfcd3..7303d30 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,7 +1,13 @@ - + Administrating Zebra - + + Unlike many simpler retrieval systems, Zebra supports safe, incremental updates to an existing index. @@ -100,7 +106,7 @@ You can edit the configuration file with a normal text editor. parameter names and values are separated by colons in the file. Lines - starting with a hash sign (#) are + starting with a hash sign (#) are treated as comments. @@ -146,9 +152,9 @@ explained further in the following sections. - + @@ -156,7 +162,7 @@ group - .recordType[.name]: + .recordType[.name]: type @@ -190,7 +196,7 @@ Specifies the Z39.50 database name. - FIXME - now we can have multiple databases in one server. -H + @@ -203,6 +209,7 @@ group of records. If you plan to update/delete this type of records later this should be specified as 1; otherwise it should be 0 (default), to save register space. + See . @@ -222,6 +229,7 @@ + register: register-location @@ -253,7 +261,7 @@ keyTmpDir: directory - Directory in which temporary files used during zebraidx' update + Directory in which temporary files used during zebraidx's update phase are stored. @@ -268,7 +276,7 @@ - profilePath: path + profilePath: path Specifies a path of profile specification files. @@ -297,6 +305,19 @@ Specifies size of internal memory to use for the zebraidx program. The amount is given in megabytes - default is 4 (4 MB). + The more memory, the faster large updates happen, up to about + half the free memory available on the computer. + + + + + tempfiles: Yes/Auto/No + + + Tells zebra if it should use temporary files when indexing. The + default is Auto, in which case zebra uses temporary files only + if it would need more that memMax + megabytes of memory. This should be good for most uses. @@ -307,13 +328,69 @@ Specifies a directory base for Zebra. All relative paths given (in profilePath, register, shadow) are based on this - directory. This setting is useful if if you Zebra server + directory. This setting is useful if your Zebra server is running in a different directory from where zebra.cfg is located. + + passwd: file + + + Specifies a file with description of user accounts for Zebra. + The format is similar to that known to Apache's htpasswd files + and UNIX' passwd files. Non-empty lines not beginning with + # are considered account lines. There is one account per-line. + A line consists of fields separate by a single colon character. + First field is username, second is password. + + + + + + passwd.c: file + + + Specifies a file with description of user accounts for Zebra. + File format is similar to that used by the passwd directive except + that the password are encrypted. Use Apache's htpasswd or similar + for maintenanace. + + + + + + perm.user: + permstring + + + Specifies permissions (priviledge) for a user that are allowed + to access Zebra via the passwd system. There are two kinds + of permissions currently: read (r) and write(w). By default + users not listed in a permission directive are given the read + priviledge. To specify permissions for a user with no + username, or Z39.50 anonymous style use + anonymous. The permstring consists of + a sequence of characters. Include character w + for write/update access, r for read access. + + + + + + dbaccess accessfile + + + Names a file which lists database subscriptions for individual users. + The access file should consists of lines of the form username: + dbnames, where dbnames is a list of database names, seprated by + '+'. No whitespace is allowed in the database list. + + + + @@ -325,12 +402,13 @@ The default behavior of the Zebra system is to reference the records from their original location, i.e. where they were found when you - ran zebraidx. + run zebraidx. That is, when a client wishes to retrieve a record following a search operation, the files are accessed from the place where you originally put them - if you remove the files (without - running zebraidx again, the client - will receive a diagnostic message. + running zebraidx again, the server will return + diagnostic number 14 (``System error in presenting records'') to + the client. @@ -375,7 +453,7 @@ - profilePath: /usr/local/yaz + profilePath: /usr/local/idzebra/tab attset: bib1.att simple.recordType: text simple.database: textbase @@ -436,9 +514,9 @@ in order to modify the indexes correctly at a later time. - - FIXME - There must be a simpler way to do this with Adams string tags -H - + For example, to update records of group esdd @@ -475,6 +553,7 @@ and then run zebraidx with the update command. + @@ -590,7 +669,7 @@ - (see + (see for details of how the mapping between elements of your records and searchable attributes is established). @@ -764,7 +843,6 @@ register: /d1:500M - shadow: /scratch1:100M /scratch2:200M @@ -776,14 +854,13 @@ In order to make changes to the system take effect for the users, you'll have to submit a "commit" command after a (sequence of) update operation(s). - You can ask the indexer to commit the changes immediately - after the update operation: - $ zebraidx update /d1/records update /d2/more-records commit + $ zebraidx update /d1/records + $ zebraidx commit @@ -795,7 +872,7 @@ - $ zebraidx -g books update /d1/records update /d2/more-records + $ zebraidx -g books update /d1/records /d2/more-records $ zebraidx -g fun update /d3/fun-records $ zebraidx commit @@ -843,8 +920,684 @@ + + + + Relevance Ranking and Sorting of Result Sets + + + Overview + + The default ordering of a result set is left up to the server, + which inside Zebra means sorting in ascending document ID order. + This is not always the order humans want to browse the sometimes + quite large hit sets. Ranking and sorting comes to the rescue. + + + + In cases where a good presentation ordering can be computed at + indexing time, we can use a fixed static ranking + scheme, which is provided for the alvis + indexing filter. This defines a fixed ordering of hit lists, + independently of the query issued. + + + + There are cases, however, where relevance of hit set documents is + highly dependent on the query processed. + Simply put, dynamic relevance ranking + sorts a set of retrieved records such that those most likely to be + relevant to your request are retrieved first. + Internally, Zebra retrieves all documents that satisfy your + query, and re-orders the hit list to arrange them based on + a measurement of similarity between your query and the content of + each record. + + + + Finally, there are situations where hit sets of documents should be + sorted during query time according to the + lexicographical ordering of certain sort indexes created at + indexing time. + + + + + + Static Ranking + + + Zebra uses internally inverted indexes to look up term occurencies + in documents. Multiple queries from different indexes can be + combined by the binary boolean operations AND, + OR and/or NOT (which + is in fact a binary AND NOT operation). + To ensure fast query execution + speed, all indexes have to be sorted in the same order. + + + The indexes are normally sorted according to document + ID in + ascending order, and any query which does not invoke a special + re-ranking function will therefore retrieve the result set in + document + ID + order. + + + If one defines the + + staticrank: 1 + + directive in the main core Zebra configuration file, the internal document + keys used for ordering are augmented by a preceding integer, which + contains the static rank of a given document, and the index lists + are ordered + first by ascending static rank, + then by ascending document ID. + Zero + is the ``best'' rank, as it occurs at the + beginning of the list; higher numbers represent worse scores. + + + The experimental alvis filter provides a + directive to fetch static rank information out of the indexed XML + records, thus making all hit sets ordered + after ascending static + rank, and for those doc's which have the same static rank, ordered + after ascending doc ID. + See for the gory details. + + + + + + Dynamic Ranking + + In order to fiddle with the static rank order, it is necessary to + invoke additional re-ranking/re-ordering using dynamic + ranking or score functions. These functions return positive + integer scores, where highest score is + ``best''; + hit sets are sorted according to descending + scores (in contrary + to the index lists which are sorted according to + ascending rank number and document ID). + + + Dynamic ranking is enabled by a directive like one of the + following in the zebra configuration file (use only one of these a time!): + + rank: rank-1 # default TDF-IDF like + rank: rank-static # dummy do-nothing + + + + + Dynamic ranking is done at query time rather than + indexing time (this is why we + call it ``dynamic ranking'' in the first place ...) + It is invoked by adding + the Bib-1 relation attribute with + value ``relevance'' to the PQF query (that is, + @attr 2=102, see also + + The BIB-1 Attribute Set Semantics, also in + HTML). + To find all articles with the word Eoraptor in + the title, and present them relevance ranked, issue the PQF query: + + @attr 2=102 @attr 1=4 Eoraptor + + + + + Dynamically ranking using PQF queries with the 'rank-1' + algorithm + + + The default rank-1 ranking module implements a + TF/IDF (Term Frequecy over Inverse Document Frequency) like + algorithm. In contrast to the usual defintion of TF/IDF + algorithms, which only considers searching in one full-text + index, this one works on multiple indexes at the same time. + More precisely, + Zebra does boolean queries and searches in specific addressed + indexes (there are inverted indexes pointing from terms in the + dictionary to documents and term positions inside documents). + It works like this: + + + Query Components + + + First, the boolean query is dismantled into it's principal components, + i.e. atomic queries where one term is looked up in one index. + For example, the query + + @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer + + is a boolean AND between the atomic parts + + @attr 2=102 @attr 1=1010 Utah + + and + + @attr 2=102 @attr 1=1018 Springer + + which gets processed each for itself. + + + + + + Atomic hit lists + + + Second, for each atomic query, the hit list of documents is + computed. + + + In this example, two hit lists for each index + @attr 1=1010 and + @attr 1=1018 are computed. + + + + + + Atomic scores + + + Third, each document in the hit list is assigned a score (_if_ ranking + is enabled and requested in the query) using a TF/IDF scheme. + + + In this example, both atomic parts of the query assign the magic + @attr 2=102 relevance attribute, and are + to be used in the relevance ranking functions. + + + It is possible to apply dynamic ranking on only parts of the + PQF query: + + @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer + + searches for all documents which have the term 'Utah' on the + body of text, and which have the term 'Springer' in the publisher + field, and sort them in the order of the relevance ranking made on + the body-of-text index only. + + + + + + Hit list merging + + + Fourth, the atomic hit lists are merged according to the boolean + conditions to a final hit list of documents to be returned. + + + This step is always performed, independently of the fact that + dynamic ranking is enabled or not. + + + + + + Document score computation + + + Fifth, the total score of a document is computed as a linear + combination of the atomic scores of the atomic hit lists + + + Ranking weights may be used to pass a value to a ranking + algorithm, using the non-standard BIB-1 attribute type 9. + This allows one branch of a query to use one value while + another branch uses a different one. For example, we can search + for utah in the + @attr 1=4 index with weight 30, as + well as in the @attr 1=1010 index with weight 20: + + @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city + + + + The default weight is + sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score + is 1000 and the bottom score is 0, encoded in integers. + + + + The ranking-weight feature is experimental. It may change in future + releases of zebra. + + + + + + + Re-sorting of hit list + + + Finally, the final hit list is re-ordered according to scores. + + + + + + + + + + + + + + The rank-1 algorithm + does not use the static rank + information in the list keys, and will produce the same ordering + with or without static ranking enabled. + + + + + + + + + + Dynamic ranking is not compatible + with estimated hit sizes, as all documents in + a hit set must be accessed to compute the correct placing in a + ranking sorted list. Therefore the use attribute setting + @attr 2=102 clashes with + @attr 9=integer. + + + + + + + + Dynamically ranking CQL queries + + Dynamic ranking can be enabled during sever side CQL + query expansion by adding @attr 2=102 + chunks to the CQL config file. For example + + relationModifier.relevant = 2=102 + + invokes dynamic ranking each time a CQL query of the form + + Z> querytype cql + Z> f alvis.text =/relevant house + + is issued. Dynamic ranking can also be automatically used on + specific CQL indexes by (for example) setting + + index.alvis.text = 1=text 2=102 + + which then invokes dynamic ranking each time a CQL query of the form + + Z> querytype cql + Z> f alvis.text = house + + is issued. + + + + + + + + + Sorting + + Zebra sorts efficiently using special sorting indexes + (type=s; so each sortable index must be known + at indexing time, specified in the configuration of record + indexing. For example, to enable sorting according to the BIB-1 + Date/time-added-to-db field, one could add the line + + xelm /*/@created Date/time-added-to-db:s + + to any .abs record-indexing configuration file. + Similarly, one could add an indexing element of the form + + + + ]]> + to any alvis-filter indexing stylesheet. + + + Indexing can be specified at searching time using a query term + carrying the non-standard + BIB-1 attribute-type 7. This removes the + need to send a Z39.50 Sort Request + separately, and can dramatically improve latency when the client + and server are on separate networks. + The sorting part of the query is separate from the rest of the + query - the actual search specification - and must be combined + with it using OR. + + + A sorting subquery needs two attributes: an index (such as a + BIB-1 type-1 attribute) specifying which index to sort on, and a + type-7 attribute whose value is be 1 for + ascending sorting, or 2 for descending. The + term associated with the sorting attribute is the priority of + the sort key, where 0 specifies the primary + sort key, 1 the secondary sort key, and so + on. + + For example, a search for water, sort by title (ascending), + is expressed by the PQF query + + @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 + + whereas a search for water, sort by title ascending, + then date descending would be + + @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 + + + + Notice the fundamental differences between dynamic + ranking and sorting: there can be + only one ranking function defined and configured; but multiple + sorting indexes can be specified dynamically at search + time. Ranking does not need to use specific indexes, so + dynamic ranking can be enabled and disabled without + re-indexing; whereas, sorting indexes need to be + defined before indexing. + + + + + + + + + Extended Services: Remote Insert, Update and Delete + + + The extended services are not enabled by default in zebra - due to the + fact that they modify the system. + In order to allow anybody to update, use + + perm.anonymous: rw + + in the main zebra configuration file zebra.cfg. + Or, even better, allow only updates for a particular admin user. For + user admin, you could use: + + perm.admin: rw + passwd: passwordfile + + And in passwordfile, specify users and + passwords as colon seperated strings: + + admin:secret + + + + We can now start a yaz-client admin session and create a database: + + adm-create + ]]> + + Now the Default database was created, + we can insert an XML file (esdd0006.grs + from example/gils/records) and index it: + + update insert 1 esdd0006.grs + ]]> + + The 3rd parameter - 1 here - + is the opaque record ID from Ext update. + It a record ID that we assign to the record + in question. If we do not + assign one, the usual rules for match apply (recordId: from zebra.cfg). + + + Actually, we should have a way to specify "no opaque record id" for + yaz-client's update command.. We'll fix that. + + + The newly inserted record can be searched as usual: + + f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 1, setno 1 + SearchResult-1: term=utah cnt=1 + records returned: 0 + Elapsed: 0.014179 + ]]> + + + + Let's delete the beast: + + update delete 1 + No last record (update ignored) + Z> update delete 1 esdd0006.grs + Got extended services response + Status: done + Elapsed: 0.072441 + Z> f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 0, setno 2 + SearchResult-1: term=utah cnt=0 + records returned: 0 + Elapsed: 0.013610 + ]]> + + + + If shadow register is enabled in your + zebra.cfg, + you must run the adm-commit command + + adm-commit + ]]> + + after each update session in order write your changes from the + shadow to the life register space. + + + Extended services are also available from the YAZ client layer. An + example of an YAZ-PHP extended service transaction is given here: + + A fine specimen of a record'; + + $options = array('action' => 'recordInsert', + 'syntax' => 'xml', + 'record' => $record, + 'databaseName' => 'mydatabase' + ); + + yaz_es($yaz, 'update', $options); + yaz_es($yaz, 'commit', array()); + yaz_wait(); + + if ($error = yaz_error($yaz)) + echo "$error"; + ]]> + + The action parameter can be any of + recordInsert (will fail if the record already exists), + recordReplace (will fail if the record does not exist), + recordDelete (will fail if the record does not + exist), and + specialUpdate (will insert or update the record + as needed). + + + If a record is inserted + using the action recordInsert + one can specify the optional + recordIdOpaque parameter, which is a + client-supplied, opaque record identifier. This identifier will + replace zebra's own automagic identifier generation. + + + When using the action recordReplace or + recordDelete, one must specify the additional + recordIdNumber parameter, which must be an + existing Zebra internal system ID number. When retrieving existing + records, the ID number is returned in the field + /*/id:idzebra/localnumber in the namespace + xmlns:id="http://www.indexdata.dk/zebra/", + where it can be picked up for later record updates or deletes. + + + + + + YAZ Frontend Virtual Hosts + + zebrasrv uses the YAZ server frontend and does + support multiple virtual servers behind multiple listening sockets. + + &zebrasrv-virtual; + + + Section "Virtual Hosts" in the YAZ manual. + http://www.indexdata.dk/yaz/doc/server.vhosts.tkl + + + + +