X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Fadministration.xml;h=b95db6619112ccf3c6a809d3822e27eec6e4b30b;hp=beba1663bf6b30735b2b021cecfb45a5ccace814;hb=HEAD;hpb=37dc985516f52f34fc8434cc8beb982bb0c8988f diff --git a/doc/administration.xml b/doc/administration.xml index beba166..b95db66 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,1452 +1,1475 @@ - - - Administrating Zebra - + + Administrating &zebra; + - - Unlike many simpler retrieval systems, Zebra supports safe, incremental - updates to an existing index. - - - - Normally, when Zebra modifies the index it reads a number of records - that you specify. - Depending on your specifications and on the contents of each record - one the following events take place for each record: - - - - Insert - - - The record is indexed as if it never occurred before. - Either the Zebra system doesn't know how to identify the record or - Zebra can identify the record but didn't find it to be already indexed. - - - - - Modify - - - The record has already been indexed. - In this case either the contents of the record or the location - (file) of the record indicates that it has been indexed before. - - - - - Delete - - - The record is deleted from the index. As in the - update-case it must be able to identify the record. - - - - - - - - Please note that in both the modify- and delete- case the Zebra - indexer must be able to generate a unique key that identifies the record - in question (more on this below). - - - - To administrate the Zebra retrieval system, you run the - zebraidx program. - This program supports a number of options which are preceded by a dash, - and a few commands (not preceded by dash). - - - - Both the Zebra administrative tool and the Z39.50 server share a - set of index files and a global configuration file. - The name of the configuration file defaults to - zebra.cfg. - The configuration file includes specifications on how to index - various kinds of records and where the other configuration files - are located. zebrasrv and zebraidx - must be run in the directory where the - configuration file lives unless you indicate the location of the - configuration file by option -c. - - - - Record Types - - - Indexing is a per-record process, in which either insert/modify/delete - will occur. Before a record is indexed search keys are extracted from - whatever might be the layout the original record (sgml,html,text, etc..). - The Zebra system currently supports two fundamental types of records: - structured and simple text. - To specify a particular extraction process, use either the - command line option -t or specify a - recordType setting in the configuration file. - - - - - - The Zebra Configuration File - - - The Zebra configuration file, read by zebraidx and - zebrasrv defaults to zebra.cfg - unless specified by -c option. - - - - You can edit the configuration file with a normal text editor. - parameter names and values are separated by colons in the file. Lines - starting with a hash sign (#) are - treated as comments. - - - - If you manage different sets of records that share common - characteristics, you can organize the configuration settings for each - type into "groups". - When zebraidx is run and you wish to address a - given group you specify the group name with the -g - option. - In this case settings that have the group name as their prefix - will be used by zebraidx. - If no -g option is specified, the settings - without prefix are used. - - - - In the configuration file, the group name is placed before the option - name itself, separated by a dot (.). For instance, to set the record type - for group public to grs.sgml - (the SGML-like format for structured records) you would write: - - - - - public.recordType: grs.sgml - - - - - To set the default value of the record type to text - write: - - - - - recordType: text - - - - The available configuration settings are summarized below. They will be - explained further in the following sections. + Unlike many simpler retrieval systems, &zebra; supports safe, incremental + updates to an existing index. - - - + + Normally, when &zebra; modifies the index it reads a number of records + that you specify. + Depending on your specifications and on the contents of each record + one the following events take place for each record: - - - - group - .recordType[.name]: - type - - - - Specifies how records with the file extension - name should be handled by the indexer. - This option may also be specified as a command line option - (-t). Note that if you do not specify a - name, the setting applies to all files. - In general, the record type specifier consists of the elements (each - element separated by dot), fundamental-type, - file-read-type and arguments. Currently, two - fundamental types exist, text and - grs. - - - - - group.recordId: - record-id-spec - - - Specifies how the records are to be identified when updated. See - . - - - - - group.database: - database - - - Specifies the Z39.50 database name. - - - - - - group.storeKeys: - boolean - - - Specifies whether key information should be saved for a given - group of records. If you plan to update/delete this type of - records later this should be specified as 1; otherwise it - should be 0 (default), to save register space. - - See . - - - - - group.storeData: - boolean - - - Specifies whether the records should be stored internally - in the Zebra system files. - If you want to maintain the raw records yourself, - this option should be false (0). - If you want Zebra to take care of the records for you, it - should be true(1). - - - - - - register: register-location - - - Specifies the location of the various register files that Zebra uses - to represent your databases. - See . - - - - - shadow: register-location - - - Enables the safe update facility of Zebra, and - tells the system where to place the required, temporary files. - See . - - - - - lockDir: directory - - - Directory in which various lock files are stored. - - - - - keyTmpDir: directory - - - Directory in which temporary files used during zebraidx's update - phase are stored. - - - - - setTmpDir: directory - - - Specifies the directory that the server uses for temporary result sets. - If not specified /tmp will be used. - - - - - profilePath: path - - - Specifies a path of profile specification files. - The path is composed of one or more directories separated by - colon. Similar to PATH for UNIX systems. - - - - - attset: filename - - - Specifies the filename(s) of attribute set files for use in - searching. At least the Bib-1 set should be loaded - (bib1.att). - The profilePath setting is used to look for - the specified files. - See - - - - - memMax: size - - - Specifies size of internal memory - to use for the zebraidx program. - The amount is given in megabytes - default is 4 (4 MB). - The more memory, the faster large updates happen, up to about - half the free memory available on the computer. - - - - - tempfiles: Yes/Auto/No - - - Tells zebra if it should use temporary files when indexing. The - default is Auto, in which case zebra uses temporary files only - if it would need more that memMax - megabytes of memory. This should be good for most uses. - - - - root: dir + Insert - Specifies a directory base for Zebra. All relative paths - given (in profilePath, register, shadow) are based on this - directory. This setting is useful if your Zebra server - is running in a different directory from where - zebra.cfg is located. + The record is indexed as if it never occurred before. + Either the &zebra; system doesn't know how to identify the record or + &zebra; can identify the record but didn't find it to be already indexed. - - passwd: file + Modify - Specifies a file with description of user accounts for Zebra. - The format is similar to that known to Apache's htpasswd files - and UNIX' passwd files. Non-empty lines not beginning with - # are considered account lines. There is one account per-line. - A line consists of fields separate by a single colon character. - First field is username, second is password. + The record has already been indexed. + In this case either the contents of the record or the location + (file) of the record indicates that it has been indexed before. - - passwd.c: file + Delete - Specifies a file with description of user accounts for Zebra. - File format is similar to that used by the passwd directive except - that the password are encrypted. Use Apache's htpasswd or similar - for maintenance. + The record is deleted from the index. As in the + update-case it must be able to identify the record. - - - perm.user: - permstring - - - Specifies permissions (priviledge) for a user that are allowed - to access Zebra via the passwd system. There are two kinds - of permissions currently: read (r) and write(w). By default - users not listed in a permission directive are given the read - privilege. To specify permissions for a user with no - username, or Z39.50 anonymous style use - anonymous. The permstring consists of - a sequence of characters. Include character w - for write/update access, r for read access. - - - - - - dbaccess accessfile - - - Names a file which lists database subscriptions for individual users. - The access file should consists of lines of the form username: - dbnames, where dbnames is a list of database names, seprated by - '+'. No whitespace is allowed in the database list. - - - - - - - - - Locating Records - - - The default behavior of the Zebra system is to reference the - records from their original location, i.e. where they were found when you - run zebraidx. - That is, when a client wishes to retrieve a record - following a search operation, the files are accessed from the place - where you originally put them - if you remove the files (without - running zebraidx again, the server will return - diagnostic number 14 (``System error in presenting records'') to - the client. - - - - If your input files are not permanent - for example if you retrieve - your records from an outside source, or if they were temporarily - mounted on a CD-ROM drive, - you may want Zebra to make an internal copy of them. To do this, - you specify 1 (true) in the storeData setting. When - the Z39.50 server retrieves the records they will be read from the - internal file structures of the system. - - - - - - Indexing with no Record IDs (Simple Indexing) - - - If you have a set of records that are not expected to change over time - you may can build your database without record IDs. - This indexing method uses less space than the other methods and - is simple to use. - - - - To use this method, you simply omit the recordId entry - for the group of files that you index. To add a set of records you use - zebraidx with the update command. The - update command will always add all of the records that it - encounters to the index - whether they have already been indexed or - not. If the set of indexed files change, you should delete all of the - index files, and build a new index from scratch. - - - - Consider a system in which you have a group of text files called - simple. - That group of records should belong to a Z39.50 database called - textbase. - The following zebra.cfg file will suffice: - - - - - profilePath: /usr/local/idzebra/tab - attset: bib1.att - simple.recordType: text - simple.database: textbase - - - - - Since the existing records in an index can not be addressed by their - IDs, it is impossible to delete or modify records when using this method. - - - - - - Indexing with File Record IDs - - - If you have a set of files that regularly change over time: Old files - are deleted, new ones are added, or existing files are modified, you - can benefit from using the file ID - indexing methodology. - Examples of this type of database might include an index of WWW - resources, or a USENET news spool area. - Briefly speaking, the file key methodology uses the directory paths - of the individual records as a unique identifier for each record. - To perform indexing of a directory with file keys, again, you specify - the top-level directory after the update command. - The command will recursively traverse the directories and compare - each one with whatever have been indexed before in that same directory. - If a file is new (not in the previous version of the directory) it - is inserted into the registers; if a file was already indexed and - it has been modified since the last update, the index is also - modified; if a file has been removed since the last - visit, it is deleted from the index. - - - - The resulting system is easy to administrate. To delete a record you - simply have to delete the corresponding file (say, with the - rm command). And to add records you create new - files (or directories with files). For your changes to take effect - in the register you must run zebraidx update with - the same directory root again. This mode of operation requires more - disk space than simpler indexing methods, but it makes it easier for - you to keep the index in sync with a frequently changing set of data. - If you combine this system with the safe update - facility (see below), you never have to take your server off-line for - maintenance or register updating purposes. - - - To enable indexing with pathname IDs, you must specify - file as the value of recordId - in the configuration file. In addition, you should set - storeKeys to 1, since the Zebra - indexer must save additional information about the contents of each record - in order to modify the indexes correctly at a later time. + Please note that in both the modify- and delete- case the &zebra; + indexer must be able to generate a unique key that identifies the record + in question (more on this below). - - - For example, to update records of group esdd - located below - /data1/records/ you should type: - - $ zebraidx -g esdd update /data1/records - + To administrate the &zebra; retrieval system, you run the + zebraidx program. + This program supports a number of options which are preceded by a dash, + and a few commands (not preceded by dash). - + - The corresponding configuration file includes: - - esdd.recordId: file - esdd.recordType: grs.sgml - esdd.storeKeys: 1 - + Both the &zebra; administrative tool and the &acro.z3950; server share a + set of index files and a global configuration file. + The name of the configuration file defaults to + zebra.cfg. + The configuration file includes specifications on how to index + various kinds of records and where the other configuration files + are located. zebrasrv and zebraidx + must be run in the directory where the + configuration file lives unless you indicate the location of the + configuration file by option -c. - - - You cannot start out with a group of records with simple - indexing (no record IDs as in the previous section) and then later - enable file record Ids. Zebra must know from the first time that you - index the group that - the files should be indexed with file record IDs. + + + Record Types + + + Indexing is a per-record process, in which either insert/modify/delete + will occur. Before a record is indexed search keys are extracted from + whatever might be the layout the original record (sgml,html,text, etc..). + The &zebra; system currently supports two fundamental types of records: + structured and simple text. + To specify a particular extraction process, use either the + command line option -t or specify a + recordType setting in the configuration file. - - - - You cannot explicitly delete records when using this method (using the - delete command to zebraidx. Instead - you have to delete the files from the file system (or move them to a - different location) - and then run zebraidx with the - update command. - - - - - - Indexing with General Record IDs - - - When using this method you construct an (almost) arbitrary, internal - record key based on the contents of the record itself and other system - information. If you have a group of records that explicitly associates - an ID with each record, this method is convenient. For example, the - record format may contain a title or a ID-number - unique within the group. - In either case you specify the Z39.50 attribute set and use-attribute - location in which this information is stored, and the system looks at - that field to determine the identity of the record. - - - - As before, the record ID is defined by the recordId - setting in the configuration file. The value of the record ID specification - consists of one or more tokens separated by whitespace. The resulting - ID is represented in the index by concatenating the tokens and - separating them by ASCII value (1). - - - - There are three kinds of tokens: - - - - Internal record info - - - The token refers to a key that is - extracted from the record. The syntax of this token is - ( set , - use ), - where set is the - attribute set name use is the - name or value of the attribute. - - - - - System variable - - - The system variables are preceded by - - - $ - - and immediately followed by the system variable name, which - may one of - - - - group - - - Group name. - - - - - database - - - Current database specified. - - - - - type - - - Record type. - - - - - - - - - Constant string - - - A string used as part of the ID — surrounded - by single- or double quotes. - - - - - - - - For instance, the sample GILS records that come with the Zebra - distribution contain a unique ID in the data tagged Control-Identifier. - The data is mapped to the Bib-1 use attribute Identifier-standard - (code 1007). To use this field as a record id, specify - (bib1,Identifier-standard) as the value of the - recordId in the configuration file. - If you have other record types that uses the same field for a - different purpose, you might add the record type - (or group or database name) to the record id of the gils - records as well, to prevent matches with other types of records. - In this case the recordId might be set like this: - - - gils.recordId: $type (bib1,Identifier-standard) - - - - - - (see - for details of how the mapping between elements of your records and - searchable attributes is established). - - - - As for the file record ID case described in the previous section, - updating your system is simply a matter of running - zebraidx - with the update command. However, the update with general - keys is considerably slower than with file record IDs, since all files - visited must be (re)read to discover their IDs. - - - - As you might expect, when using the general record IDs - method, you can only add or modify existing records with the - update command. - If you wish to delete records, you must use the, - delete command, with a directory as a parameter. - This will remove all records that match the files below that root - directory. - - - - - - Register Location - - - Normally, the index files that form dictionaries, inverted - files, record info, etc., are stored in the directory where you run - zebraidx. If you wish to store these, possibly large, - files somewhere else, you must add the register - entry to the zebra.cfg file. - Furthermore, the Zebra system allows its file - structures to span multiple file systems, which is useful for - managing very large databases. - - - - The value of the register setting is a sequence - of tokens. Each token takes the form: - - - dir:size. - - - The dir specifies a directory in which index files - will be stored and the size specifies the maximum - size of all files in that directory. The Zebra indexer system fills - each directory in the order specified and use the next specified - directories as needed. - The size is an integer followed by a qualifier - code, - b for bytes, - k for kilobytes. - M for megabytes, - G for gigabytes. - - - - For instance, if you have allocated two disks for your register, and - the first disk is mounted - on /d1 and has 2GB of free space and the - second, mounted on /d2 has 3.6 GB, you could - put this entry in your configuration file: - - - register: /d1:2G /d2:3600M - - - - - - Note that Zebra does not verify that the amount of space specified is - actually available on the directory (file system) specified - it is - your responsibility to ensure that enough space is available, and that - other applications do not attempt to use the free space. In a large - production system, it is recommended that you allocate one or more - file system exclusively to the Zebra register files. - - - - - - Safe Updating - Using Shadow Registers - - - Description - + + + + + The &zebra; Configuration File + + + The &zebra; configuration file, read by zebraidx and + zebrasrv defaults to zebra.cfg + unless specified by -c option. + + - The Zebra server supports updating of the index - structures. That is, you can add, modify, or remove records from - databases managed by Zebra without rebuilding the entire index. - Since this process involves modifying structured files with various - references between blocks of data in the files, the update process - is inherently sensitive to system crashes, or to process interruptions: - Anything but a successfully completed update process will leave the - register files in an unknown state, and you will essentially have no - recourse but to re-index everything, or to restore the register files - from a backup medium. - Further, while the update process is active, users cannot be - allowed to access the system, as the contents of the register files - may change unpredictably. + You can edit the configuration file with a normal text editor. + parameter names and values are separated by colons in the file. Lines + starting with a hash sign (#) are + treated as comments. - + - You can solve these problems by enabling the shadow register system in - Zebra. - During the updating procedure, zebraidx will temporarily - write changes to the involved files in a set of "shadow - files", without modifying the files that are accessed by the - active server processes. If the update procedure is interrupted by a - system crash or a signal, you simply repeat the procedure - the - register files have not been changed or damaged, and the partially - written shadow files are automatically deleted before the new updating - procedure commences. + If you manage different sets of records that share common + characteristics, you can organize the configuration settings for each + type into "groups". + When zebraidx is run and you wish to address a + given group you specify the group name with the -g + option. + In this case settings that have the group name as their prefix + will be used by zebraidx. + If no -g option is specified, the settings + without prefix are used. - + - At the end of the updating procedure (or in a separate operation, if - you so desire), the system enters a "commit mode". First, - any active server processes are forced to access those blocks that - have been changed from the shadow files rather than from the main - register files; the unmodified blocks are still accessed at their - normal location (the shadow files are not a complete copy of the - register files - they only contain those parts that have actually been - modified). If the commit process is interrupted at any point during the - commit process, the server processes will continue to access the - shadow files until you can repeat the commit procedure and complete - the writing of data to the main register files. You can perform - multiple update operations to the registers before you commit the - changes to the system files, or you can execute the commit operation - at the end of each update operation. When the commit phase has - completed successfully, any running server processes are instructed to - switch their operations to the new, operational register, and the - temporary shadow files are deleted. + In the configuration file, the group name is placed before the option + name itself, separated by a dot (.). For instance, to set the record type + for group public to grs.sgml + (the &acro.sgml;-like format for structured records) you would write: - - - - - How to Use Shadow Register Files - + - The first step is to allocate space on your system for the shadow - files. - You do this by adding a shadow entry to the - zebra.cfg file. - The syntax of the shadow entry is exactly the - same as for the register entry - (see ). - The location of the shadow area should be - different from the location of the main register - area (if you have specified one - remember that if you provide no - register setting, the default register area is the - working directory of the server and indexing processes). + + public.recordType: grs.sgml + - + - The following excerpt from a zebra.cfg file shows - one example of a setup that configures both the main register - location and the shadow file area. - Note that two directories or partitions have been set aside - for the shadow file area. You can specify any number of directories - for each of the file areas, but remember that there should be no - overlaps between the directories used for the main registers and the - shadow files, respectively. + To set the default value of the record type to text + write: + - - register: /d1:500M - shadow: /scratch1:100M /scratch2:200M + recordType: text - - + - When shadow files are enabled, an extra command is available at the - zebraidx command line. - In order to make changes to the system take effect for the - users, you'll have to submit a "commit" command after a - (sequence of) update operation(s). + The available configuration settings are summarized below. They will be + explained further in the following sections. - + + + - + + + + + group + .recordType[.name]: + type + + + + Specifies how records with the file extension + name should be handled by the indexer. + This option may also be specified as a command line option + (-t). Note that if you do not specify a + name, the setting applies to all files. + In general, the record type specifier consists of the elements (each + element separated by dot), fundamental-type, + file-read-type and arguments. Currently, two + fundamental types exist, text and + grs. + + + + + group.recordId: + record-id-spec + + + Specifies how the records are to be identified when updated. See + . + + + + + group.database: + database + + + Specifies the &acro.z3950; database name. + + + + + + group.storeKeys: + boolean + + + Specifies whether key information should be saved for a given + group of records. If you plan to update/delete this type of + records later this should be specified as 1; otherwise it + should be 0 (default), to save register space. + + See . + + + + + group.storeData: + boolean + + + Specifies whether the records should be stored internally + in the &zebra; system files. + If you want to maintain the raw records yourself, + this option should be false (0). + If you want &zebra; to take care of the records for you, it + should be true(1). + + + + + + register: register-location + + + Specifies the location of the various register files that &zebra; uses + to represent your databases. + See . + + + + + shadow: register-location + + + Enables the safe update facility of &zebra;, and + tells the system where to place the required, temporary files. + See . + + + + + lockDir: directory + + + Directory in which various lock files are stored. + + + + + keyTmpDir: directory + + + Directory in which temporary files used during zebraidx's update + phase are stored. + + + + + setTmpDir: directory + + + Specifies the directory that the server uses for temporary result sets. + If not specified /tmp will be used. + + + + + profilePath: path + + + Specifies a path of profile specification files. + The path is composed of one or more directories separated by + colon. Similar to PATH for UNIX systems. + + + + + + modulePath: path + + + Specifies a path of record filter modules. + The path is composed of one or more directories separated by + colon. Similar to PATH for UNIX systems. + The 'make install' procedure typically puts modules in + /usr/local/lib/idzebra-2.0/modules. + + + + + + index: filename + + + Defines the filename which holds fields structure + definitions. If omitted, the file default.idx + is read. + Refer to for + more information. + + + + + + sortmax: integer + + + Specifies the maximum number of records that will be sorted + in a result set. If the result set contains more than + integer records, records after the + limit will not be sorted. If omitted, the default value is + 1,000. + + + + + + staticrank: integer + + + Enables whether static ranking is to be enabled (1) or + disabled (0). If omitted, it is disabled - corresponding + to a value of 0. + Refer to . + + + + + + + estimatehits: integer + + + Controls whether &zebra; should calculate approximate hit counts and + at which hit count it is to be enabled. + A value of 0 disables approximate hit counts. + For a positive value approximate hit count is enabled + if it is known to be larger than integer. + + + Approximate hit counts can also be triggered by a particular + attribute in a query. + Refer to . + + + + + + attset: filename + + + Specifies the filename(s) of attribute set files for use in + searching. In many configurations bib1.att + is used, but that is not required. If Classic Explain + attributes is to be used for searching, + explain.att must be given. + The path to att-files in general can be given using + profilePath setting. + See also . + + + + + memMax: size + + + Specifies size of internal memory + to use for the zebraidx program. + The amount is given in megabytes - default is 4 (4 MB). + The more memory, the faster large updates happen, up to about + half the free memory available on the computer. + + + + + tempfiles: Yes/Auto/No + + + Tells zebra if it should use temporary files when indexing. The + default is Auto, in which case zebra uses temporary files only + if it would need more that memMax + megabytes of memory. This should be good for most uses. + + + + + + root: dir + + + Specifies a directory base for &zebra;. All relative paths + given (in profilePath, register, shadow) are based on this + directory. This setting is useful if your &zebra; server + is running in a different directory from where + zebra.cfg is located. + + + + + + passwd: file + + + Specifies a file with description of user accounts for &zebra;. + The format is similar to that known to Apache's htpasswd files + and UNIX' passwd files. Non-empty lines not beginning with + # are considered account lines. There is one account per-line. + A line consists of fields separate by a single colon character. + First field is username, second is password. + + + + + + passwd.c: file + + + Specifies a file with description of user accounts for &zebra;. + File format is similar to that used by the passwd directive except + that the password are encrypted. Use Apache's htpasswd or similar + for maintenance. + + + + + + perm.user: + permstring + + + Specifies permissions (privilege) for a user that are allowed + to access &zebra; via the passwd system. There are two kinds + of permissions currently: read (r) and write(w). By default + users not listed in a permission directive are given the read + privilege. To specify permissions for a user with no + username, or &acro.z3950; anonymous style use + anonymous. The permstring consists of + a sequence of characters. Include character w + for write/update access, r for read access and + a to allow anonymous access through this account. + + + + + + dbaccess: accessfile + + + Names a file which lists database subscriptions for individual users. + The access file should consists of lines of the form + username: dbnames, where dbnames is a list of + database names, separated by '+'. No whitespace is allowed in the + database list. + + + + + + encoding: charsetname + + + Tells &zebra; to interpret the terms in Z39.50 queries as + having been encoded using the specified character + encoding. The default is ISO-8859-1; one + useful alternative is UTF-8. + + + + + + storeKeys: value + + + Specifies whether &zebra; keeps a copy of indexed keys. + Use a value of 1 to enable; 0 to disable. If storeKeys setting is + omitted, it is enabled. Enabled storeKeys + are required for updating and deleting records. Disable only + storeKeys to save space and only plan to index data once. + + + + + + storeData: value + + + Specifies whether &zebra; keeps a copy of indexed records. + Use a value of 1 to enable; 0 to disable. If storeData setting is + omitted, it is enabled. A storeData setting of 0 (disabled) makes + Zebra fetch records from the original locaction in the file + system using filename, file offset and file length. For the + DOM and ALVIS filter, the storeData setting is ignored. + + + + + + + + + + + Locating Records + + + The default behavior of the &zebra; system is to reference the + records from their original location, i.e. where they were found when you + run zebraidx. + That is, when a client wishes to retrieve a record + following a search operation, the files are accessed from the place + where you originally put them - if you remove the files (without + running zebraidx again, the server will return + diagnostic number 14 (``System error in presenting records'') to + the client. + + + + If your input files are not permanent - for example if you retrieve + your records from an outside source, or if they were temporarily + mounted on a CD-ROM drive, + you may want &zebra; to make an internal copy of them. To do this, + you specify 1 (true) in the storeData setting. When + the &acro.z3950; server retrieves the records they will be read from the + internal file structures of the system. + + + + + + Indexing with no Record IDs (Simple Indexing) + + + If you have a set of records that are not expected to change over time + you may can build your database without record IDs. + This indexing method uses less space than the other methods and + is simple to use. + + + + To use this method, you simply omit the recordId entry + for the group of files that you index. To add a set of records you use + zebraidx with the update command. The + update command will always add all of the records that it + encounters to the index - whether they have already been indexed or + not. If the set of indexed files change, you should delete all of the + index files, and build a new index from scratch. + + + + Consider a system in which you have a group of text files called + simple. + That group of records should belong to a &acro.z3950; database called + textbase. + The following zebra.cfg file will suffice: + + + - $ zebraidx update /d1/records - $ zebraidx commit + profilePath: /usr/local/idzebra/tab + attset: bib1.att + simple.recordType: text + simple.database: textbase - + + + + + Since the existing records in an index can not be addressed by their + IDs, it is impossible to delete or modify records when using this method. + + + + + + Indexing with File Record IDs + + + If you have a set of files that regularly change over time: Old files + are deleted, new ones are added, or existing files are modified, you + can benefit from using the file ID + indexing methodology. + Examples of this type of database might include an index of WWW + resources, or a USENET news spool area. + Briefly speaking, the file key methodology uses the directory paths + of the individual records as a unique identifier for each record. + To perform indexing of a directory with file keys, again, you specify + the top-level directory after the update command. + The command will recursively traverse the directories and compare + each one with whatever have been indexed before in that same directory. + If a file is new (not in the previous version of the directory) it + is inserted into the registers; if a file was already indexed and + it has been modified since the last update, the index is also + modified; if a file has been removed since the last + visit, it is deleted from the index. - + - Or you can execute multiple updates before committing the changes: + The resulting system is easy to administrate. To delete a record you + simply have to delete the corresponding file (say, with the + rm command). And to add records you create new + files (or directories with files). For your changes to take effect + in the register you must run zebraidx update with + the same directory root again. This mode of operation requires more + disk space than simpler indexing methods, but it makes it easier for + you to keep the index in sync with a frequently changing set of data. + If you combine this system with the safe update + facility (see below), you never have to take your server off-line for + maintenance or register updating purposes. - + - + To enable indexing with pathname IDs, you must specify + file as the value of recordId + in the configuration file. In addition, you should set + storeKeys to 1, since the &zebra; + indexer must save additional information about the contents of each record + in order to modify the indexes correctly at a later time. + + + + + + For example, to update records of group esdd + located below + /data1/records/ you should type: - $ zebraidx -g books update /d1/records /d2/more-records - $ zebraidx -g fun update /d3/fun-records - $ zebraidx commit + $ zebraidx -g esdd update /data1/records - - + - If one of the update operations above had been interrupted, the commit - operation on the last line would fail: zebraidx - will not let you commit changes that would destroy the running register. - You'll have to rerun all of the update operations since your last - commit operation, before you can commit the new changes. + The corresponding configuration file includes: + + esdd.recordId: file + esdd.recordType: grs.sgml + esdd.storeKeys: 1 + - + + + You cannot start out with a group of records with simple + indexing (no record IDs as in the previous section) and then later + enable file record Ids. &zebra; must know from the first time that you + index the group that + the files should be indexed with file record IDs. + + + - Similarly, if the commit operation fails, zebraidx - will not let you start a new update operation before you have - successfully repeated the commit operation. - The server processes will keep accessing the shadow files rather - than the (possibly damaged) blocks of the main register files - until the commit operation has successfully completed. + You cannot explicitly delete records when using this method (using the + delete command to zebraidx. Instead + you have to delete the files from the file system (or move them to a + different location) + and then run zebraidx with the + update command. - + + + + + Indexing with General Record IDs + - You should be aware that update operations may take slightly longer - when the shadow register system is enabled, since more file access - operations are involved. Further, while the disk space required for - the shadow register data is modest for a small update operation, you - may prefer to disable the system if you are adding a very large number - of records to an already very large database (we use the terms - large and modest - very loosely here, since every application will have a - different perception of size). - To update the system without the use of the the shadow files, - simply run zebraidx with the -n - option (note that you do not have to execute the - commit command of zebraidx - when you temporarily disable the use of the shadow registers in - this fashion. - Note also that, just as when the shadow registers are not enabled, - server processes will be barred from accessing the main register - while the update procedure takes place. + When using this method you construct an (almost) arbitrary, internal + record key based on the contents of the record itself and other system + information. If you have a group of records that explicitly associates + an ID with each record, this method is convenient. For example, the + record format may contain a title or a ID-number - unique within the group. + In either case you specify the &acro.z3950; attribute set and use-attribute + location in which this information is stored, and the system looks at + that field to determine the identity of the record. - - - - + + As before, the record ID is defined by the recordId + setting in the configuration file. The value of the record ID specification + consists of one or more tokens separated by whitespace. The resulting + ID is represented in the index by concatenating the tokens and + separating them by ASCII value (1). + - - Relevance Ranking and Sorting of Result Sets + + There are three kinds of tokens: + + + + Internal record info + + + The token refers to a key that is + extracted from the record. The syntax of this token is + ( set , + use ), + where set is the + attribute set name use is the + name or value of the attribute. + + + + + System variable + + + The system variables are preceded by + + + $ + + and immediately followed by the system variable name, which + may one of + + + + group + + + Group name. + + + + + database + + + Current database specified. + + + + + type + + + Record type. + + + + + + + + + Constant string + + + A string used as part of the ID — surrounded + by single- or double quotes. + + + + + - - Overview - The default ordering of a result set is left up to the server, - which inside Zebra means sorting in ascending document ID order. - This is not always the order humans want to browse the sometimes - quite large hit sets. Ranking and sorting comes to the rescue. + For instance, the sample GILS records that come with the &zebra; + distribution contain a unique ID in the data tagged Control-Identifier. + The data is mapped to the &acro.bib1; use attribute Identifier-standard + (code 1007). To use this field as a record id, specify + (bib1,Identifier-standard) as the value of the + recordId in the configuration file. + If you have other record types that uses the same field for a + different purpose, you might add the record type + (or group or database name) to the record id of the gils + records as well, to prevent matches with other types of records. + In this case the recordId might be set like this: + + + gils.recordId: $type (bib1,Identifier-standard) + + - - In cases where a good presentation ordering can be computed at - indexing time, we can use a fixed static ranking - scheme, which is provided for the alvis - indexing filter. This defines a fixed ordering of hit lists, - independently of the query issued. + + (see + for details of how the mapping between elements of your records and + searchable attributes is established). - There are cases, however, where relevance of hit set documents is - highly dependent on the query processed. - Simply put, dynamic relevance ranking - sorts a set of retrieved records such that those most likely to be - relevant to your request are retrieved first. - Internally, Zebra retrieves all documents that satisfy your - query, and re-orders the hit list to arrange them based on - a measurement of similarity between your query and the content of - each record. + As for the file record ID case described in the previous section, + updating your system is simply a matter of running + zebraidx + with the update command. However, the update with general + keys is considerably slower than with file record IDs, since all files + visited must be (re)read to discover their IDs. - Finally, there are situations where hit sets of documents should be - sorted during query time according to the - lexicographical ordering of certain sort indexes created at - indexing time. + As you might expect, when using the general record IDs + method, you can only add or modify existing records with the + update command. + If you wish to delete records, you must use the, + delete command, with a directory as a parameter. + This will remove all records that match the files below that root + directory. - + + + + Register Location - - Static Ranking - - Zebra uses internally inverted indexes to look up term occurencies - in documents. Multiple queries from different indexes can be - combined by the binary boolean operations AND, - OR and/or NOT (which - is in fact a binary AND NOT operation). - To ensure fast query execution - speed, all indexes have to be sorted in the same order. + Normally, the index files that form dictionaries, inverted + files, record info, etc., are stored in the directory where you run + zebraidx. If you wish to store these, possibly large, + files somewhere else, you must add the register + entry to the zebra.cfg file. + Furthermore, the &zebra; system allows its file + structures to span multiple file systems, which is useful for + managing very large databases. + - The indexes are normally sorted according to document - ID in - ascending order, and any query which does not invoke a special - re-ranking function will therefore retrieve the result set in - document - ID - order. + The value of the register setting is a sequence + of tokens. Each token takes the form: + + dir:size + + The dir specifies a directory in which index files + will be stored and the size specifies the maximum + size of all files in that directory. The &zebra; indexer system fills + each directory in the order specified and use the next specified + directories as needed. + The size is an integer followed by a qualifier + code, + b for bytes, + k for kilobytes. + M for megabytes, + G for gigabytes. + Specifying a negative value disables the checking (it still needs the unit, + use -1b). + - If one defines the + For instance, if you have allocated three disks for your register, and + the first disk is mounted + on /d1 and has 2GB of free space, the + second, mounted on /d2 has 3.6 GB, and the third, + on which you have more space than you bother to worry about, mounted on + /d3 you could put this entry in your configuration file: + - staticrank: 1 - - directive in the main core Zebra configuration file, the internal document - keys used for ordering are augmented by a preceding integer, which - contains the static rank of a given document, and the index lists - are ordered - first by ascending static rank, - then by ascending document ID. - Zero - is the ``best'' rank, as it occurs at the - beginning of the list; higher numbers represent worse scores. + register: /d1:2G /d2:3600M /d3:-1b + + - The experimental alvis filter provides a - directive to fetch static rank information out of the indexed XML - records, thus making all hit sets ordered - after ascending static - rank, and for those doc's which have the same static rank, ordered - after ascending doc ID. - See for the gory details. + Note that &zebra; does not verify that the amount of space specified is + actually available on the directory (file system) specified - it is + your responsibility to ensure that enough space is available, and that + other applications do not attempt to use the free space. In a large + production system, it is recommended that you allocate one or more + file system exclusively to the &zebra; register files. - + - - Dynamic Ranking - - In order to fiddle with the static rank order, it is necessary to - invoke additional re-ranking/re-ordering using dynamic - ranking or score functions. These functions return positive - integer scores, where highest score is - ``best''; - hit sets are sorted according to descending - scores (in contrary - to the index lists which are sorted according to - ascending rank number and document ID). - - - Dynamic ranking is enabled by a directive like one of the - following in the zebra configuration file (use only one of these a time!): - - rank: rank-1 # default TDF-IDF like - rank: rank-static # dummy do-nothing - - - - - Dynamic ranking is done at query time rather than - indexing time (this is why we - call it ``dynamic ranking'' in the first place ...) - It is invoked by adding - the Bib-1 relation attribute with - value ``relevance'' to the PQF query (that is, - @attr 2=102, see also - - The BIB-1 Attribute Set Semantics, also in - HTML). - To find all articles with the word Eoraptor in - the title, and present them relevance ranked, issue the PQF query: - - @attr 2=102 @attr 1=4 Eoraptor - - + + Safe Updating - Using Shadow Registers - - Dynamically ranking using PQF queries with the 'rank-1' - algorithm + + Description - - The default rank-1 ranking module implements a - TF/IDF (Term Frequecy over Inverse Document Frequency) like - algorithm. In contrast to the usual defintion of TF/IDF - algorithms, which only considers searching in one full-text - index, this one works on multiple indexes at the same time. - More precisely, - Zebra does boolean queries and searches in specific addressed - indexes (there are inverted indexes pointing from terms in the - dictionary to documents and term positions inside documents). - It works like this: - - - Query Components - - - First, the boolean query is dismantled into it's principal components, - i.e. atomic queries where one term is looked up in one index. - For example, the query - - @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer - - is a boolean AND between the atomic parts - - @attr 2=102 @attr 1=1010 Utah - - and - - @attr 2=102 @attr 1=1018 Springer - - which gets processed each for itself. - - - - - - Atomic hit lists - - - Second, for each atomic query, the hit list of documents is - computed. - - - In this example, two hit lists for each index - @attr 1=1010 and - @attr 1=1018 are computed. - - - - - - Atomic scores - - - Third, each document in the hit list is assigned a score (_if_ ranking - is enabled and requested in the query) using a TF/IDF scheme. - - - In this example, both atomic parts of the query assign the magic - @attr 2=102 relevance attribute, and are - to be used in the relevance ranking functions. - - - It is possible to apply dynamic ranking on only parts of the - PQF query: - - @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer - - searches for all documents which have the term 'Utah' on the - body of text, and which have the term 'Springer' in the publisher - field, and sort them in the order of the relevance ranking made on - the body-of-text index only. - - - - - - Hit list merging - - - Fourth, the atomic hit lists are merged according to the boolean - conditions to a final hit list of documents to be returned. - - - This step is always performed, independently of the fact that - dynamic ranking is enabled or not. - - - - - - Document score computation - - - Fifth, the total score of a document is computed as a linear - combination of the atomic scores of the atomic hit lists - - - Ranking weights may be used to pass a value to a ranking - algorithm, using the non-standard BIB-1 attribute type 9. - This allows one branch of a query to use one value while - another branch uses a different one. For example, we can search - for utah in the - @attr 1=4 index with weight 30, as - well as in the @attr 1=1010 index with weight 20: - - @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city - - - - The default weight is - sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score - is 1000 and the bottom score is 0, encoded in integers. - - - - The ranking-weight feature is experimental. It may change in future - releases of zebra. - - - - - - - Re-sorting of hit list - - - Finally, the final hit list is re-ordered according to scores. - - - - - - - - + + The &zebra; server supports updating of the index + structures. That is, you can add, modify, or remove records from + databases managed by &zebra; without rebuilding the entire index. + Since this process involves modifying structured files with various + references between blocks of data in the files, the update process + is inherently sensitive to system crashes, or to process interruptions: + Anything but a successfully completed update process will leave the + register files in an unknown state, and you will essentially have no + recourse but to re-index everything, or to restore the register files + from a backup medium. + Further, while the update process is active, users cannot be + allowed to access the system, as the contents of the register files + may change unpredictably. + + + + You can solve these problems by enabling the shadow register system in + &zebra;. + During the updating procedure, zebraidx will temporarily + write changes to the involved files in a set of "shadow + files", without modifying the files that are accessed by the + active server processes. If the update procedure is interrupted by a + system crash or a signal, you simply repeat the procedure - the + register files have not been changed or damaged, and the partially + written shadow files are automatically deleted before the new updating + procedure commences. + + + + At the end of the updating procedure (or in a separate operation, if + you so desire), the system enters a "commit mode". First, + any active server processes are forced to access those blocks that + have been changed from the shadow files rather than from the main + register files; the unmodified blocks are still accessed at their + normal location (the shadow files are not a complete copy of the + register files - they only contain those parts that have actually been + modified). If the commit process is interrupted at any point during the + commit process, the server processes will continue to access the + shadow files until you can repeat the commit procedure and complete + the writing of data to the main register files. You can perform + multiple update operations to the registers before you commit the + changes to the system files, or you can execute the commit operation + at the end of each update operation. When the commit phase has + completed successfully, any running server processes are instructed to + switch their operations to the new, operational register, and the + temporary shadow files are deleted. + + + + + + How to Use Shadow Register Files + + + The first step is to allocate space on your system for the shadow + files. + You do this by adding a shadow entry to the + zebra.cfg file. + The syntax of the shadow entry is exactly the + same as for the register entry + (see ). + The location of the shadow area should be + different from the location of the main register + area (if you have specified one - remember that if you provide no + register setting, the default register area is the + working directory of the server and indexing processes). + + + + The following excerpt from a zebra.cfg file shows + one example of a setup that configures both the main register + location and the shadow file area. + Note that two directories or partitions have been set aside + for the shadow file area. You can specify any number of directories + for each of the file areas, but remember that there should be no + overlaps between the directories used for the main registers and the + shadow files, respectively. + + + + + register: /d1:500M + shadow: /scratch1:100M /scratch2:200M + + + + + + When shadow files are enabled, an extra command is available at the + zebraidx command line. + In order to make changes to the system take effect for the + users, you'll have to submit a "commit" command after a + (sequence of) update operation(s). + + + + + + $ zebraidx update /d1/records + $ zebraidx commit + + + + + + Or you can execute multiple updates before committing the changes: + + + + + + $ zebraidx -g books update /d1/records /d2/more-records + $ zebraidx -g fun update /d3/fun-records + $ zebraidx commit + + + + + + If one of the update operations above had been interrupted, the commit + operation on the last line would fail: zebraidx + will not let you commit changes that would destroy the running register. + You'll have to rerun all of the update operations since your last + commit operation, before you can commit the new changes. + + + + Similarly, if the commit operation fails, zebraidx + will not let you start a new update operation before you have + successfully repeated the commit operation. + The server processes will keep accessing the shadow files rather + than the (possibly damaged) blocks of the main register files + until the commit operation has successfully completed. + + + + You should be aware that update operations may take slightly longer + when the shadow register system is enabled, since more file access + operations are involved. Further, while the disk space required for + the shadow register data is modest for a small update operation, you + may prefer to disable the system if you are adding a very large number + of records to an already very large database (we use the terms + large and modest + very loosely here, since every application will have a + different perception of size). + To update the system without the use of the the shadow files, + simply run zebraidx with the -n + option (note that you do not have to execute the + commit command of zebraidx + when you temporarily disable the use of the shadow registers in + this fashion. + Note also that, just as when the shadow registers are not enabled, + server processes will be barred from accessing the main register + while the update procedure takes place. + + + + + + + + + Relevance Ranking and Sorting of Result Sets + + + Overview + + The default ordering of a result set is left up to the server, + which inside &zebra; means sorting in ascending document ID order. + This is not always the order humans want to browse the sometimes + quite large hit sets. Ranking and sorting comes to the rescue. + + + + In cases where a good presentation ordering can be computed at + indexing time, we can use a fixed static ranking + scheme, which is provided for the alvis + indexing filter. This defines a fixed ordering of hit lists, + independently of the query issued. + + + + There are cases, however, where relevance of hit set documents is + highly dependent on the query processed. + Simply put, dynamic relevance ranking + sorts a set of retrieved records such that those most likely to be + relevant to your request are retrieved first. + Internally, &zebra; retrieves all documents that satisfy your + query, and re-orders the hit list to arrange them based on + a measurement of similarity between your query and the content of + each record. + + + + Finally, there are situations where hit sets of documents should be + sorted during query time according to the + lexicographical ordering of certain sort indexes created at + indexing time. + + - + + Static Ranking - The rank-1 algorithm - does not use the static rank - information in the list keys, and will produce the same ordering - with or without static ranking enabled. + &zebra; uses internally inverted indexes to look up term frequencies + in documents. Multiple queries from different indexes can be + combined by the binary boolean operations AND, + OR and/or NOT (which + is in fact a binary AND NOT operation). + To ensure fast query execution + speed, all indexes have to be sorted in the same order. + + + The indexes are normally sorted according to document + ID in + ascending order, and any query which does not invoke a special + re-ranking function will therefore retrieve the result set in + document + ID + order. + + + If one defines the + + staticrank: 1 + + directive in the main core &zebra; configuration file, the internal document + keys used for ordering are augmented by a preceding integer, which + contains the static rank of a given document, and the index lists + are ordered + first by ascending static rank, + then by ascending document ID. + Zero + is the ``best'' rank, as it occurs at the + beginning of the list; higher numbers represent worse scores. + + + The experimental alvis filter provides a + directive to fetch static rank information out of the indexed &acro.xml; + records, thus making all hit sets ordered + after ascending static + rank, and for those doc's which have the same static rank, ordered + after ascending doc ID. + See for the gory details. + + + + + + Dynamic Ranking + + In order to fiddle with the static rank order, it is necessary to + invoke additional re-ranking/re-ordering using dynamic + ranking or score functions. These functions return positive + integer scores, where highest score is + ``best''; + hit sets are sorted according to descending + scores (in contrary + to the index lists which are sorted according to + ascending rank number and document ID). + + + Dynamic ranking is enabled by a directive like one of the + following in the zebra configuration file (use only one of these a time!): + + rank: rank-1 # default TDF-IDF like + rank: rank-static # dummy do-nothing + + + + + Dynamic ranking is done at query time rather than + indexing time (this is why we + call it ``dynamic ranking'' in the first place ...) + It is invoked by adding + the &acro.bib1; relation attribute with + value ``relevance'' to the &acro.pqf; query (that is, + @attr 2=102, see also + + The &acro.bib1; Attribute Set Semantics, also in + HTML). + To find all articles with the word Eoraptor in + the title, and present them relevance ranked, issue the &acro.pqf; query: + + @attr 2=102 @attr 1=4 Eoraptor + - - - - + - Dynamic ranking is not compatible - with estimated hit sizes, as all documents in - a hit set must be accessed to compute the correct placing in a - ranking sorted list. Therefore the use attribute setting - @attr 2=102 clashes with - @attr 9=integer. + The default rank-1 ranking module implements a + TF/IDF (Term Frequecy over Inverse Document Frequency) like + algorithm. In contrast to the usual definition of TF/IDF + algorithms, which only considers searching in one full-text + index, this one works on multiple indexes at the same time. + More precisely, + &zebra; does boolean queries and searches in specific addressed + indexes (there are inverted indexes pointing from terms in the + dictionary to documents and term positions inside documents). + It works like this: + + + Query Components + + + First, the boolean query is dismantled into its principal components, + i.e. atomic queries where one term is looked up in one index. + For example, the query + + @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer + + is a boolean AND between the atomic parts + + @attr 2=102 @attr 1=1010 Utah + + and + + @attr 2=102 @attr 1=1018 Springer + + which gets processed each for itself. + + + + + + Atomic hit lists + + + Second, for each atomic query, the hit list of documents is + computed. + + + In this example, two hit lists for each index + @attr 1=1010 and + @attr 1=1018 are computed. + + + + + + Atomic scores + + + Third, each document in the hit list is assigned a score (_if_ ranking + is enabled and requested in the query) using a TF/IDF scheme. + + + In this example, both atomic parts of the query assign the magic + @attr 2=102 relevance attribute, and are + to be used in the relevance ranking functions. + + + It is possible to apply dynamic ranking on only parts of the + &acro.pqf; query: + + @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer + + searches for all documents which have the term 'Utah' on the + body of text, and which have the term 'Springer' in the publisher + field, and sort them in the order of the relevance ranking made on + the body-of-text index only. + + + + + + Hit list merging + + + Fourth, the atomic hit lists are merged according to the boolean + conditions to a final hit list of documents to be returned. + + + This step is always performed, independently of the fact that + dynamic ranking is enabled or not. + + + + + + Document score computation + + + Fifth, the total score of a document is computed as a linear + combination of the atomic scores of the atomic hit lists + + + Ranking weights may be used to pass a value to a ranking + algorithm, using the non-standard &acro.bib1; attribute type 9. + This allows one branch of a query to use one value while + another branch uses a different one. For example, we can search + for utah in the + @attr 1=4 index with weight 30, as + well as in the @attr 1=1010 index with weight 20: + + @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city + + + + The default weight is + sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score + is 1000 and the bottom score is 0, encoded in integers. + + + + The ranking-weight feature is experimental. It may change in future + releases of zebra. + + + + + + + Re-sorting of hit list + + + Finally, the final hit list is re-ordered according to scores. + + + + + - - + + + The rank-1 algorithm + does not use the static rank + information in the list keys, and will produce the same ordering + with or without static ranking enabled. + + + + + + + + Dynamic ranking is not compatible + with estimated hit sizes, as all documents in + a hit set must be accessed to compute the correct placing in a + ranking sorted list. Therefore the use attribute setting + @attr 2=102 clashes with + @attr 9=integer. + + + + - Dynamically ranking CQL queries + Dynamically ranking &acro.cql; queries - Dynamic ranking can be enabled during sever side CQL + Dynamic ranking can be enabled during sever side &acro.cql; query expansion by adding @attr 2=102 - chunks to the CQL config file. For example + chunks to the &acro.cql; config file. For example relationModifier.relevant = 2=102 - invokes dynamic ranking each time a CQL query of the form + invokes dynamic ranking each time a &acro.cql; query of the form Z> querytype cql Z> f alvis.text =/relevant house is issued. Dynamic ranking can also be automatically used on - specific CQL indexes by (for example) setting + specific &acro.cql; indexes by (for example) setting index.alvis.text = 1=text 2=102 - which then invokes dynamic ranking each time a CQL query of the form + which then invokes dynamic ranking each time a &acro.cql; query of the form Z> querytype cql Z> f alvis.text = house is issued. - + - + - - Sorting - - Zebra sorts efficiently using special sorting indexes + + Sorting + + &zebra; sorts efficiently using special sorting indexes (type=s; so each sortable index must be known at indexing time, specified in the configuration of record - indexing. For example, to enable sorting according to the BIB-1 + indexing. For example, to enable sorting according to the &acro.bib1; Date/time-added-to-db field, one could add the line - xelm /*/@created Date/time-added-to-db:s + xelm /*/@created Date/time-added-to-db:s to any .abs record-indexing configuration file. Similarly, one could add an indexing element of the form - - - + + ]]> to any alvis-filter indexing stylesheet. - - - Indexing can be specified at searching time using a query term - carrying the non-standard - BIB-1 attribute-type 7. This removes the - need to send a Z39.50 Sort Request - separately, and can dramatically improve latency when the client - and server are on separate networks. - The sorting part of the query is separate from the rest of the - query - the actual search specification - and must be combined - with it using OR. - - - A sorting subquery needs two attributes: an index (such as a - BIB-1 type-1 attribute) specifying which index to sort on, and a - type-7 attribute whose value is be 1 for - ascending sorting, or 2 for descending. The - term associated with the sorting attribute is the priority of - the sort key, where 0 specifies the primary - sort key, 1 the secondary sort key, and so - on. - + + + Indexing can be specified at searching time using a query term + carrying the non-standard + &acro.bib1; attribute-type 7. This removes the + need to send a &acro.z3950; Sort Request + separately, and can dramatically improve latency when the client + and server are on separate networks. + The sorting part of the query is separate from the rest of the + query - the actual search specification - and must be combined + with it using OR. + + + A sorting subquery needs two attributes: an index (such as a + &acro.bib1; type-1 attribute) specifying which index to sort on, and a + type-7 attribute whose value is be 1 for + ascending sorting, or 2 for descending. The + term associated with the sorting attribute is the priority of + the sort key, where 0 specifies the primary + sort key, 1 the secondary sort key, and so + on. + For example, a search for water, sort by title (ascending), - is expressed by the PQF query + is expressed by the &acro.pqf; query - @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 + @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 - whereas a search for water, sort by title ascending, + whereas a search for water, sort by title ascending, then date descending would be - @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 + @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 Notice the fundamental differences between dynamic - ranking and sorting: there can be + ranking and sorting: there can be only one ranking function defined and configured; but multiple sorting indexes can be specified dynamically at search time. Ranking does not need to use specific indexes, so dynamic ranking can be enabled and disabled without re-indexing; whereas, sorting indexes need to be defined before indexing. - + + + - + - + + Extended Services: Remote Insert, Update and Delete - - Extended Services: Remote Insert, Update and Delete - - Extended services are only supported when accessing the Zebra - server using the Z39.50 - protocol. The SRU protocol does + Extended services are only supported when accessing the &zebra; + server using the &acro.z3950; + protocol. The &acro.sru; protocol does not support extended services. - - + + The extended services are not enabled by default in zebra - due to the - fact that they modify the system. Zebra can be configured + fact that they modify the system. &zebra; can be configured to allow anybody to search, and to allow only updates for a particular admin user in the main zebra configuration file zebra.cfg. @@ -1456,15 +1479,15 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci perm.admin: rw passwd: passwordfile - And in the password file + And in the password file passwordfile, you have to specify users and - encrypted passwords as colon separated strings. - Use a tool like htpasswd - to maintain the encrypted passwords. - + encrypted passwords as colon separated strings. + Use a tool like htpasswd + to maintain the encrypted passwords. + admin:secret - It is essential to configure Zebra to store records internally, + It is essential to configure &zebra; to store records internally, and to support modifications and deletion of records: @@ -1472,300 +1495,374 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci storeKeys: 1 The general record type should be set to any record filter which - is able to parse XML records, you may use any of the two + is able to parse &acro.xml; records, you may use any of the two declarations (but not both simultaneously!) - - recordType: grs.xml - # recordType: alvis.filter_alvis_config.xml + + recordType: dom.filter_dom_conf.xml + # recordType: grs.xml + + Notice the difference to the specific instructions + + recordType.xml: dom.filter_dom_conf.xml + # recordType.xml: grs.xml + which only work when indexing XML files from the filesystem using + the *.xml naming convention. + + To enable transaction safe shadow indexing, which is extra important for this kind of operation, set shadow: directoryname: size (e.g. 1000M) + See for additional information on + these configuration options. It is not possible to carry information about record types or - similar to Zebra when using extended services, due to - limitations of the Z39.50 + similar to &zebra; when using extended services, due to + limitations of the &acro.z3950; protocol. Therefore, indexing filters can not be chosen on a - per-record basis. One and only one general XML indexing filter - must be defined. + per-record basis. One and only one general &acro.xml; indexing filter + must be defined. - Extended services in the Z39.50 protocol + Extended services in the &acro.z3950; protocol - The Z39.50 standard allows + The &acro.z3950; standard allows servers to accept special binary extended services protocol packages, which may be used to insert, update and delete records into servers. These carry control and update - information to the servers, which are encoded in seven package fields: + information to the servers, which are encoded in seven package fields: - Extended services Z39.50 Package Fields - - + Extended services &acro.z3950; Package Fields + + - Parameter - Value - Notes - + Parameter + Value + Notes + - - - type - 'update' - Must be set to trigger extended services - - - action - string + + + type + 'update' + Must be set to trigger extended services + + + action + string - Extended service action type with + Extended service action type with one of four possible values: recordInsert, recordReplace, recordDelete, and specialUpdate - - - record - XML string - An XML formatted string containing the record - - - syntax - 'xml' - Only XML record syntax is supported - - - recordIdOpaque - string - - Optional client-supplied, opaque record + + + record + &acro.xml; string + An &acro.xml; formatted string containing the record + + + syntax + 'xml' + XML/SUTRS/MARC. GRS-1 not supported. + The default filter (record type) as given by recordType in + zebra.cfg is used to parse the record. + + + recordIdOpaque + string + + Optional client-supplied, opaque record identifier used under insert operations. - - - recordIdNumber - positive number - Zebra's internal system number, only for update - actions. + + + recordIdNumber + positive number + &zebra;'s internal system number, + not allowed for recordInsert or + specialUpdate actions which result in fresh + record inserts. - - - databaseName - database identifier + + + databaseName + database identifier - The name of the database to which the extended services should be + The name of the database to which the extended services should be applied. - + - -
+ + - - The action parameter can be any of - recordInsert (will fail if the record already exists), - recordReplace (will fail if the record does not exist), - recordDelete (will fail if the record does not - exist), and - specialUpdate (will insert or update the record - as needed). - + + The action parameter can be any of + recordInsert (will fail if the record already exists), + recordReplace (will fail if the record does not exist), + recordDelete (will fail if the record does not + exist), and + specialUpdate (will insert or update the record + as needed, record deletion is not possible). + - During a recordInsert action, the + During all actions, the usual rules for internal record ID generation apply, unless an - optional recordIdNumber Zebra internal ID or a - recordIdOpaque string identifier is assigned. + optional recordIdNumber &zebra; internal ID or a + recordIdOpaque string identifier is assigned. The default ID generation is configured using the recordId: from - zebra.cfg. + zebra.cfg. + See . - - The actions recordReplace or - recordDelete need specification of the additional - recordIdNumber parameter, which must be an - existing Zebra internal system ID number, or the optional - recordIdOpaque string parameter. + + Setting of the recordIdNumber parameter, + which must be an existing &zebra; internal system ID number, is not + allowed during any recordInsert or + specialUpdate action resulting in fresh record + inserts. When retrieving existing - records indexed with GRS indexing filters, the Zebra internal + records indexed with &acro.grs1; indexing filters, the &zebra; internal ID number is returned in the field - /*/id:idzebra/localnumber in the namespace - xmlns:id="http://www.indexdata.dk/zebra/", - where it can be picked up for later record updates or deletes. + /*/id:idzebra/localnumber in the namespace + xmlns:id="http://www.indexdata.dk/zebra/", + where it can be picked up for later record updates or deletes. + - Records indexed with the alvis filter - have similar means to discover the internal Zebra ID. + A new element set for retrieval of internal record + data has been added, which can be used to access minimal records + containing only the recordIdNumber &zebra; + internal ID, or the recordIdOpaque string + identifier. This works for any indexing filter used. + See . - - + + The recordIdOpaque string parameter is an client-supplied, opaque record - identifier, which may be used under + identifier, which may be used under insert, update and delete operations. The client software is responsible for assigning these to records. This identifier will replace zebra's own automagic identifier generation with a unique - mapping from recordIdOpaque to the - Zebra internal recordIdNumber. + mapping from recordIdOpaque to the + &zebra; internal recordIdNumber. The opaque recordIdOpaque string - identifiers + identifiers are not visible in retrieval records, nor are searchable, so the value of this parameter is questionable. It serves mostly as a convenient mapping from - application domain string identifiers to Zebra internal ID's. - + application domain string identifiers to &zebra; internal ID's. +
- - - Extended services from yaz-client - - We can now start a yaz-client admin session and create a database: - - adm-create - ]]> - - Now the Default database was created, - we can insert an XML file (esdd0006.grs - from example/gils/records) and index it: - - update insert id1234 esdd0006.grs - ]]> - - The 3rd parameter - id1234 here - - is the recordIdOpaque package field. - - - Actually, we should have a way to specify "no opaque record id" for - yaz-client's update command.. We'll fix that. - - - The newly inserted record can be searched as usual: - - f utah - Sent searchRequest. - Received SearchResponse. - Search was a success. - Number of hits: 1, setno 1 - SearchResult-1: term=utah cnt=1 - records returned: 0 - Elapsed: 0.014179 - ]]> - - - - Let's delete the beast, using the same + + Extended services from yaz-client + + + We can now start a yaz-client admin session and create a database: + + adm-create + ]]> + + Now the Default database was created, + we can insert an &acro.xml; file (esdd0006.grs + from example/gils/records) and index it: + + update insert id1234 esdd0006.grs + ]]> + + The 3rd parameter - id1234 here - + is the recordIdOpaque package field. + + + Actually, we should have a way to specify "no opaque record id" for + yaz-client's update command.. We'll fix that. + + + The newly inserted record can be searched as usual: + + f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 1, setno 1 + SearchResult-1: term=utah cnt=1 + records returned: 0 + Elapsed: 0.014179 + ]]> + + + + Let's delete the beast, using the same recordIdOpaque string parameter: - - update delete id1234 - No last record (update ignored) - Z> update delete 1 esdd0006.grs - Got extended services response - Status: done - Elapsed: 0.072441 - Z> f utah - Sent searchRequest. - Received SearchResponse. - Search was a success. - Number of hits: 0, setno 2 - SearchResult-1: term=utah cnt=0 - records returned: 0 - Elapsed: 0.013610 - ]]> + + update delete id1234 + No last record (update ignored) + Z> update delete 1 esdd0006.grs + Got extended services response + Status: done + Elapsed: 0.072441 + Z> f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 0, setno 2 + SearchResult-1: term=utah cnt=0 + records returned: 0 + Elapsed: 0.013610 + ]]> - If shadow register is enabled in your - zebra.cfg, - you must run the adm-commit command - - adm-commit - ]]> - + If shadow register is enabled in your + zebra.cfg, + you must run the adm-commit command + + adm-commit + ]]> + after each update session in order write your changes from the shadow to the life register space. - - + + - - - Extended services from yaz-php - - Extended services are also available from the YAZ PHP client layer. An - example of an YAZ-PHP extended service transaction is given here: - - A fine specimen of a record'; - - $options = array('action' => 'recordInsert', - 'syntax' => 'xml', - 'record' => $record, - 'databaseName' => 'mydatabase' - ); - - yaz_es($yaz, 'update', $options); - yaz_es($yaz, 'commit', array()); - yaz_wait(); - - if ($error = yaz_error($yaz)) - echo "$error"; - ]]> - - - -
+ + Extended services from yaz-php + + Extended services are also available from the &yaz; &acro.php; client layer. An + example of an &yaz;-&acro.php; extended service transaction is given here: + + A fine specimen of a record'; + + $options = array('action' => 'recordInsert', + 'syntax' => 'xml', + 'record' => $record, + 'databaseName' => 'mydatabase' + ); + + yaz_es($yaz, 'update', $options); + yaz_es($yaz, 'commit', array()); + yaz_wait(); + + if ($error = yaz_error($yaz)) + echo "$error"; + ]]> + + + - - YAZ Frontend Virtual Hosts + + Extended services debugging guide - zebrasrv uses the YAZ server frontend and does - support multiple virtual servers behind multiple listening sockets. + When debugging ES over PHP we recommend the following order of tests: - &zebrasrv-virtual; - - - Section "Virtual Hosts" in the YAZ manual. - http://www.indexdata.dk/yaz/doc/server.vhosts.tkl - - + + + + Make sure you have a nice record on your filesystem, which you can + index from the filesystem by use of the zebraidx command. + Do it exactly as you planned, using one of the GRS-1 filters, + or the DOMXML filter. + When this works, proceed. + + + + + Check that your server setup is OK before you even coded one single + line PHP using ES. + Take the same record form the file system, and send as ES via + yaz-client like described in + , + and + remember the -a option which tells you what + goes over the wire! Notice also the section on permissions: + try + + perm.anonymous: rw + + in zebra.cfg to make sure you do not run into + permission problems (but never expose such an insecure setup on the + internet!!!). Then, make sure to set the general + recordType instruction, pointing correctly + to the GRS-1 filters, + or the DOMXML filters. + + + + + If you insist on using the sysno in the + recordIdNumber setting, + please make sure you do only updates and deletes. Zebra's internal + system number is not allowed for + recordInsert or + specialUpdate actions + which result in fresh record inserts. + + + + + If shadow register is enabled in your + zebra.cfg, you must remember running the + + Z> adm-commit + + command as well. + + + + + If this works, then proceed to do the same thing in your PHP script. + + + + + + + +
- -
+