From 972bceaa6386f904bc3e4845f1c5598656c5c6f2 Mon Sep 17 00:00:00 2001 From: Adam Dickmeiss Date: Mon, 24 Jun 2013 10:37:29 +0200 Subject: [PATCH] Documentation re-indent, remove trailing whitespace --- doc/administration.xml | 2894 ++++++++++++++++++++--------------------- doc/architecture.xml | 146 +-- doc/examples.xml | 650 ++++----- doc/field-structure.xml | 2 +- doc/idzebra.xml | 8 +- doc/indexdata.xml | 12 +- doc/installation.xml | 122 +- doc/introduction.xml | 1549 +++++++++++----------- doc/license.xml | 22 +- doc/querymodel.xml | 1154 ++++++++-------- doc/quickstart.xml | 198 +-- doc/recordmodel-alvisxslt.xml | 392 +++--- doc/recordmodel-domxml.xml | 1384 ++++++++++---------- doc/recordmodel-grs.xml | 470 +++---- doc/tutorial.xml | 160 +-- doc/zebraidx.xml | 28 +- doc/zebrasrv.xml | 348 ++--- 17 files changed, 4712 insertions(+), 4827 deletions(-) diff --git a/doc/administration.xml b/doc/administration.xml index 762ba7d..b95db66 100644 --- a/doc/administration.xml +++ b/doc/administration.xml @@ -1,289 +1,289 @@ - - Administrating &zebra; - + + Administrating &zebra; + - - Unlike many simpler retrieval systems, &zebra; supports safe, incremental - updates to an existing index. - - - - Normally, when &zebra; modifies the index it reads a number of records - that you specify. - Depending on your specifications and on the contents of each record - one the following events take place for each record: - - - - Insert - - - The record is indexed as if it never occurred before. - Either the &zebra; system doesn't know how to identify the record or - &zebra; can identify the record but didn't find it to be already indexed. - - - - - Modify - - - The record has already been indexed. - In this case either the contents of the record or the location - (file) of the record indicates that it has been indexed before. - - - - - Delete - - - The record is deleted from the index. As in the - update-case it must be able to identify the record. - - - - - - - - Please note that in both the modify- and delete- case the &zebra; - indexer must be able to generate a unique key that identifies the record - in question (more on this below). - - - - To administrate the &zebra; retrieval system, you run the - zebraidx program. - This program supports a number of options which are preceded by a dash, - and a few commands (not preceded by dash). - - - - Both the &zebra; administrative tool and the &acro.z3950; server share a - set of index files and a global configuration file. - The name of the configuration file defaults to - zebra.cfg. - The configuration file includes specifications on how to index - various kinds of records and where the other configuration files - are located. zebrasrv and zebraidx - must be run in the directory where the - configuration file lives unless you indicate the location of the - configuration file by option -c. - - - - Record Types - - - Indexing is a per-record process, in which either insert/modify/delete - will occur. Before a record is indexed search keys are extracted from - whatever might be the layout the original record (sgml,html,text, etc..). - The &zebra; system currently supports two fundamental types of records: - structured and simple text. - To specify a particular extraction process, use either the - command line option -t or specify a - recordType setting in the configuration file. - - - - - - The &zebra; Configuration File - - - The &zebra; configuration file, read by zebraidx and - zebrasrv defaults to zebra.cfg - unless specified by -c option. - - - - You can edit the configuration file with a normal text editor. - parameter names and values are separated by colons in the file. Lines - starting with a hash sign (#) are - treated as comments. - - - If you manage different sets of records that share common - characteristics, you can organize the configuration settings for each - type into "groups". - When zebraidx is run and you wish to address a - given group you specify the group name with the -g - option. - In this case settings that have the group name as their prefix - will be used by zebraidx. - If no -g option is specified, the settings - without prefix are used. + Unlike many simpler retrieval systems, &zebra; supports safe, incremental + updates to an existing index. - - - In the configuration file, the group name is placed before the option - name itself, separated by a dot (.). For instance, to set the record type - for group public to grs.sgml - (the &acro.sgml;-like format for structured records) you would write: - - - - - public.recordType: grs.sgml - - - - - To set the default value of the record type to text - write: - - - - - recordType: text - - - - - The available configuration settings are summarized below. They will be - explained further in the following sections. - - - - + + Normally, when &zebra; modifies the index it reads a number of records + that you specify. + Depending on your specifications and on the contents of each record + one the following events take place for each record: - - - - group - .recordType[.name]: - type - - - - Specifies how records with the file extension - name should be handled by the indexer. - This option may also be specified as a command line option - (-t). Note that if you do not specify a - name, the setting applies to all files. - In general, the record type specifier consists of the elements (each - element separated by dot), fundamental-type, - file-read-type and arguments. Currently, two - fundamental types exist, text and - grs. - - - - - group.recordId: - record-id-spec - - - Specifies how the records are to be identified when updated. See - . - - - - - group.database: - database - - - Specifies the &acro.z3950; database name. - - - - - - group.storeKeys: - boolean - - - Specifies whether key information should be saved for a given - group of records. If you plan to update/delete this type of - records later this should be specified as 1; otherwise it - should be 0 (default), to save register space. - - See . - - - - - group.storeData: - boolean - - - Specifies whether the records should be stored internally - in the &zebra; system files. - If you want to maintain the raw records yourself, - this option should be false (0). - If you want &zebra; to take care of the records for you, it - should be true(1). - - - - - - register: register-location - - - Specifies the location of the various register files that &zebra; uses - to represent your databases. - See . - - - - - shadow: register-location - - - Enables the safe update facility of &zebra;, and - tells the system where to place the required, temporary files. - See . - - - - - lockDir: directory - - - Directory in which various lock files are stored. - - - + - keyTmpDir: directory + Insert - Directory in which temporary files used during zebraidx's update - phase are stored. + The record is indexed as if it never occurred before. + Either the &zebra; system doesn't know how to identify the record or + &zebra; can identify the record but didn't find it to be already indexed. - setTmpDir: directory + Modify - Specifies the directory that the server uses for temporary result sets. - If not specified /tmp will be used. + The record has already been indexed. + In this case either the contents of the record or the location + (file) of the record indicates that it has been indexed before. - profilePath: path + Delete - Specifies a path of profile specification files. - The path is composed of one or more directories separated by - colon. Similar to PATH for UNIX systems. + The record is deleted from the index. As in the + update-case it must be able to identify the record. + + + + + Please note that in both the modify- and delete- case the &zebra; + indexer must be able to generate a unique key that identifies the record + in question (more on this below). + + + + To administrate the &zebra; retrieval system, you run the + zebraidx program. + This program supports a number of options which are preceded by a dash, + and a few commands (not preceded by dash). + + + + Both the &zebra; administrative tool and the &acro.z3950; server share a + set of index files and a global configuration file. + The name of the configuration file defaults to + zebra.cfg. + The configuration file includes specifications on how to index + various kinds of records and where the other configuration files + are located. zebrasrv and zebraidx + must be run in the directory where the + configuration file lives unless you indicate the location of the + configuration file by option -c. + + + + Record Types + + + Indexing is a per-record process, in which either insert/modify/delete + will occur. Before a record is indexed search keys are extracted from + whatever might be the layout the original record (sgml,html,text, etc..). + The &zebra; system currently supports two fundamental types of records: + structured and simple text. + To specify a particular extraction process, use either the + command line option -t or specify a + recordType setting in the configuration file. + + + + + + The &zebra; Configuration File + + + The &zebra; configuration file, read by zebraidx and + zebrasrv defaults to zebra.cfg + unless specified by -c option. + + + + You can edit the configuration file with a normal text editor. + parameter names and values are separated by colons in the file. Lines + starting with a hash sign (#) are + treated as comments. + + + + If you manage different sets of records that share common + characteristics, you can organize the configuration settings for each + type into "groups". + When zebraidx is run and you wish to address a + given group you specify the group name with the -g + option. + In this case settings that have the group name as their prefix + will be used by zebraidx. + If no -g option is specified, the settings + without prefix are used. + + + + In the configuration file, the group name is placed before the option + name itself, separated by a dot (.). For instance, to set the record type + for group public to grs.sgml + (the &acro.sgml;-like format for structured records) you would write: + + + + + public.recordType: grs.sgml + + + + + To set the default value of the record type to text + write: + + + + + recordType: text + + + + + The available configuration settings are summarized below. They will be + explained further in the following sections. + + + + + + + + + + group + .recordType[.name]: + type + + + + Specifies how records with the file extension + name should be handled by the indexer. + This option may also be specified as a command line option + (-t). Note that if you do not specify a + name, the setting applies to all files. + In general, the record type specifier consists of the elements (each + element separated by dot), fundamental-type, + file-read-type and arguments. Currently, two + fundamental types exist, text and + grs. + + + + + group.recordId: + record-id-spec + + + Specifies how the records are to be identified when updated. See + . + + + + + group.database: + database + + + Specifies the &acro.z3950; database name. + + + + + + group.storeKeys: + boolean + + + Specifies whether key information should be saved for a given + group of records. If you plan to update/delete this type of + records later this should be specified as 1; otherwise it + should be 0 (default), to save register space. + + See . + + + + + group.storeData: + boolean + + + Specifies whether the records should be stored internally + in the &zebra; system files. + If you want to maintain the raw records yourself, + this option should be false (0). + If you want &zebra; to take care of the records for you, it + should be true(1). + + + + + + register: register-location + + + Specifies the location of the various register files that &zebra; uses + to represent your databases. + See . + + + + + shadow: register-location + + + Enables the safe update facility of &zebra;, and + tells the system where to place the required, temporary files. + See . + + + + + lockDir: directory + + + Directory in which various lock files are stored. + + + + + keyTmpDir: directory + + + Directory in which temporary files used during zebraidx's update + phase are stored. + + + + + setTmpDir: directory + + + Specifies the directory that the server uses for temporary result sets. + If not specified /tmp will be used. + + + + + profilePath: path + + + Specifies a path of profile specification files. + The path is composed of one or more directories separated by + colon. Similar to PATH for UNIX systems. + + + modulePath: path @@ -315,11 +315,11 @@ sortmax: integer - Specifies the maximum number of records that will be sorted - in a result set. If the result set contains more than - integer records, records after the - limit will not be sorted. If omitted, the default value is - 1,000. + Specifies the maximum number of records that will be sorted + in a result set. If the result set contains more than + integer records, records after the + limit will not be sorted. If omitted, the default value is + 1,000. @@ -355,1093 +355,1003 @@ - - attset: filename - - + + attset: filename + + Specifies the filename(s) of attribute set files for use in searching. In many configurations bib1.att is used, but that is not required. If Classic Explain attributes is to be used for searching, explain.att must be given. - The path to att-files in general can be given using + The path to att-files in general can be given using profilePath setting. See also . - - - - - memMax: size - - - Specifies size of internal memory - to use for the zebraidx program. - The amount is given in megabytes - default is 4 (4 MB). - The more memory, the faster large updates happen, up to about - half the free memory available on the computer. - - - - - tempfiles: Yes/Auto/No - - - Tells zebra if it should use temporary files when indexing. The - default is Auto, in which case zebra uses temporary files only - if it would need more that memMax - megabytes of memory. This should be good for most uses. - - - + + + + + memMax: size + + + Specifies size of internal memory + to use for the zebraidx program. + The amount is given in megabytes - default is 4 (4 MB). + The more memory, the faster large updates happen, up to about + half the free memory available on the computer. + + + + + tempfiles: Yes/Auto/No + + + Tells zebra if it should use temporary files when indexing. The + default is Auto, in which case zebra uses temporary files only + if it would need more that memMax + megabytes of memory. This should be good for most uses. + + + - - root: dir - - - Specifies a directory base for &zebra;. All relative paths - given (in profilePath, register, shadow) are based on this - directory. This setting is useful if your &zebra; server - is running in a different directory from where - zebra.cfg is located. - - - + + root: dir + + + Specifies a directory base for &zebra;. All relative paths + given (in profilePath, register, shadow) are based on this + directory. This setting is useful if your &zebra; server + is running in a different directory from where + zebra.cfg is located. + + + - - passwd: file - - - Specifies a file with description of user accounts for &zebra;. - The format is similar to that known to Apache's htpasswd files - and UNIX' passwd files. Non-empty lines not beginning with - # are considered account lines. There is one account per-line. - A line consists of fields separate by a single colon character. - First field is username, second is password. - - - + + passwd: file + + + Specifies a file with description of user accounts for &zebra;. + The format is similar to that known to Apache's htpasswd files + and UNIX' passwd files. Non-empty lines not beginning with + # are considered account lines. There is one account per-line. + A line consists of fields separate by a single colon character. + First field is username, second is password. + + + - - passwd.c: file - - - Specifies a file with description of user accounts for &zebra;. - File format is similar to that used by the passwd directive except - that the password are encrypted. Use Apache's htpasswd or similar - for maintenance. - - - + + passwd.c: file + + + Specifies a file with description of user accounts for &zebra;. + File format is similar to that used by the passwd directive except + that the password are encrypted. Use Apache's htpasswd or similar + for maintenance. + + + - - perm.user: - permstring - - - Specifies permissions (privilege) for a user that are allowed - to access &zebra; via the passwd system. There are two kinds - of permissions currently: read (r) and write(w). By default - users not listed in a permission directive are given the read - privilege. To specify permissions for a user with no - username, or &acro.z3950; anonymous style use + + perm.user: + permstring + + + Specifies permissions (privilege) for a user that are allowed + to access &zebra; via the passwd system. There are two kinds + of permissions currently: read (r) and write(w). By default + users not listed in a permission directive are given the read + privilege. To specify permissions for a user with no + username, or &acro.z3950; anonymous style use anonymous. The permstring consists of - a sequence of characters. Include character w - for write/update access, r for read access and - a to allow anonymous access through this account. - - - + a sequence of characters. Include character w + for write/update access, r for read access and + a to allow anonymous access through this account. + + + - + dbaccess: accessfile - - Names a file which lists database subscriptions for individual users. - The access file should consists of lines of the form - username: dbnames, where dbnames is a list of - database names, separated by '+'. No whitespace is allowed in the - database list. - + + Names a file which lists database subscriptions for individual users. + The access file should consists of lines of the form + username: dbnames, where dbnames is a list of + database names, separated by '+'. No whitespace is allowed in the + database list. + - + - + encoding: charsetname - - Tells &zebra; to interpret the terms in Z39.50 queries as - having been encoded using the specified character - encoding. The default is ISO-8859-1; one - useful alternative is UTF-8. - + + Tells &zebra; to interpret the terms in Z39.50 queries as + having been encoded using the specified character + encoding. The default is ISO-8859-1; one + useful alternative is UTF-8. + - + - + storeKeys: value - - Specifies whether &zebra; keeps a copy of indexed keys. - Use a value of 1 to enable; 0 to disable. If storeKeys setting is - omitted, it is enabled. Enabled storeKeys - are required for updating and deleting records. Disable only - storeKeys to save space and only plan to index data once. - + + Specifies whether &zebra; keeps a copy of indexed keys. + Use a value of 1 to enable; 0 to disable. If storeKeys setting is + omitted, it is enabled. Enabled storeKeys + are required for updating and deleting records. Disable only + storeKeys to save space and only plan to index data once. + - + - + storeData: value - - Specifies whether &zebra; keeps a copy of indexed records. - Use a value of 1 to enable; 0 to disable. If storeData setting is - omitted, it is enabled. A storeData setting of 0 (disabled) makes - Zebra fetch records from the original locaction in the file - system using filename, file offset and file length. For the - DOM and ALVIS filter, the storeData setting is ignored. - + + Specifies whether &zebra; keeps a copy of indexed records. + Use a value of 1 to enable; 0 to disable. If storeData setting is + omitted, it is enabled. A storeData setting of 0 (disabled) makes + Zebra fetch records from the original locaction in the file + system using filename, file offset and file length. For the + DOM and ALVIS filter, the storeData setting is ignored. + - + - - - - - - - Locating Records - - - The default behavior of the &zebra; system is to reference the - records from their original location, i.e. where they were found when you - run zebraidx. - That is, when a client wishes to retrieve a record - following a search operation, the files are accessed from the place - where you originally put them - if you remove the files (without - running zebraidx again, the server will return - diagnostic number 14 (``System error in presenting records'') to - the client. - - - - If your input files are not permanent - for example if you retrieve - your records from an outside source, or if they were temporarily - mounted on a CD-ROM drive, - you may want &zebra; to make an internal copy of them. To do this, - you specify 1 (true) in the storeData setting. When - the &acro.z3950; server retrieves the records they will be read from the - internal file structures of the system. - - - - - - Indexing with no Record IDs (Simple Indexing) - - - If you have a set of records that are not expected to change over time - you may can build your database without record IDs. - This indexing method uses less space than the other methods and - is simple to use. - - - - To use this method, you simply omit the recordId entry - for the group of files that you index. To add a set of records you use - zebraidx with the update command. The - update command will always add all of the records that it - encounters to the index - whether they have already been indexed or - not. If the set of indexed files change, you should delete all of the - index files, and build a new index from scratch. - - - - Consider a system in which you have a group of text files called - simple. - That group of records should belong to a &acro.z3950; database called - textbase. - The following zebra.cfg file will suffice: - - - - - profilePath: /usr/local/idzebra/tab - attset: bib1.att - simple.recordType: text - simple.database: textbase - + + - - - - Since the existing records in an index can not be addressed by their - IDs, it is impossible to delete or modify records when using this method. - - - - - - Indexing with File Record IDs - - - If you have a set of files that regularly change over time: Old files - are deleted, new ones are added, or existing files are modified, you - can benefit from using the file ID - indexing methodology. - Examples of this type of database might include an index of WWW - resources, or a USENET news spool area. - Briefly speaking, the file key methodology uses the directory paths - of the individual records as a unique identifier for each record. - To perform indexing of a directory with file keys, again, you specify - the top-level directory after the update command. - The command will recursively traverse the directories and compare - each one with whatever have been indexed before in that same directory. - If a file is new (not in the previous version of the directory) it - is inserted into the registers; if a file was already indexed and - it has been modified since the last update, the index is also - modified; if a file has been removed since the last - visit, it is deleted from the index. - - - - The resulting system is easy to administrate. To delete a record you - simply have to delete the corresponding file (say, with the - rm command). And to add records you create new - files (or directories with files). For your changes to take effect - in the register you must run zebraidx update with - the same directory root again. This mode of operation requires more - disk space than simpler indexing methods, but it makes it easier for - you to keep the index in sync with a frequently changing set of data. - If you combine this system with the safe update - facility (see below), you never have to take your server off-line for - maintenance or register updating purposes. - - - - To enable indexing with pathname IDs, you must specify - file as the value of recordId - in the configuration file. In addition, you should set - storeKeys to 1, since the &zebra; - indexer must save additional information about the contents of each record - in order to modify the indexes correctly at a later time. - - - + + + + Locating Records - - For example, to update records of group esdd - located below - /data1/records/ you should type: - - $ zebraidx -g esdd update /data1/records - - - - - The corresponding configuration file includes: - - esdd.recordId: file - esdd.recordType: grs.sgml - esdd.storeKeys: 1 - - - - - You cannot start out with a group of records with simple - indexing (no record IDs as in the previous section) and then later - enable file record Ids. &zebra; must know from the first time that you - index the group that - the files should be indexed with file record IDs. - - - - - You cannot explicitly delete records when using this method (using the - delete command to zebraidx. Instead - you have to delete the files from the file system (or move them to a - different location) - and then run zebraidx with the - update command. - - - - - - Indexing with General Record IDs - - - When using this method you construct an (almost) arbitrary, internal - record key based on the contents of the record itself and other system - information. If you have a group of records that explicitly associates - an ID with each record, this method is convenient. For example, the - record format may contain a title or a ID-number - unique within the group. - In either case you specify the &acro.z3950; attribute set and use-attribute - location in which this information is stored, and the system looks at - that field to determine the identity of the record. - - - - As before, the record ID is defined by the recordId - setting in the configuration file. The value of the record ID specification - consists of one or more tokens separated by whitespace. The resulting - ID is represented in the index by concatenating the tokens and - separating them by ASCII value (1). - - - - There are three kinds of tokens: - - - - Internal record info - - - The token refers to a key that is - extracted from the record. The syntax of this token is - ( set , - use ), - where set is the - attribute set name use is the - name or value of the attribute. - - - - - System variable - - - The system variables are preceded by - - - $ - - and immediately followed by the system variable name, which - may one of - - - - group - - - Group name. - - - - - database - - - Current database specified. - - - - - type - - - Record type. - - - - - - - - - Constant string - - - A string used as part of the ID — surrounded - by single- or double quotes. - - - - - - - - For instance, the sample GILS records that come with the &zebra; - distribution contain a unique ID in the data tagged Control-Identifier. - The data is mapped to the &acro.bib1; use attribute Identifier-standard - (code 1007). To use this field as a record id, specify - (bib1,Identifier-standard) as the value of the - recordId in the configuration file. - If you have other record types that uses the same field for a - different purpose, you might add the record type - (or group or database name) to the record id of the gils - records as well, to prevent matches with other types of records. - In this case the recordId might be set like this: - - - gils.recordId: $type (bib1,Identifier-standard) - - - - - - (see - for details of how the mapping between elements of your records and - searchable attributes is established). - - - - As for the file record ID case described in the previous section, - updating your system is simply a matter of running - zebraidx - with the update command. However, the update with general - keys is considerably slower than with file record IDs, since all files - visited must be (re)read to discover their IDs. - - - - As you might expect, when using the general record IDs - method, you can only add or modify existing records with the - update command. - If you wish to delete records, you must use the, - delete command, with a directory as a parameter. - This will remove all records that match the files below that root - directory. - - - - - - Register Location - - - Normally, the index files that form dictionaries, inverted - files, record info, etc., are stored in the directory where you run - zebraidx. If you wish to store these, possibly large, - files somewhere else, you must add the register - entry to the zebra.cfg file. - Furthermore, the &zebra; system allows its file - structures to span multiple file systems, which is useful for - managing very large databases. - - - - The value of the register setting is a sequence - of tokens. Each token takes the form: - - dir:size - - The dir specifies a directory in which index files - will be stored and the size specifies the maximum - size of all files in that directory. The &zebra; indexer system fills - each directory in the order specified and use the next specified - directories as needed. - The size is an integer followed by a qualifier - code, - b for bytes, - k for kilobytes. - M for megabytes, - G for gigabytes. - Specifying a negative value disables the checking (it still needs the unit, - use -1b). - - - - For instance, if you have allocated three disks for your register, and - the first disk is mounted - on /d1 and has 2GB of free space, the - second, mounted on /d2 has 3.6 GB, and the third, - on which you have more space than you bother to worry about, mounted on - /d3 you could put this entry in your configuration file: - - - register: /d1:2G /d2:3600M /d3:-1b - - - - - Note that &zebra; does not verify that the amount of space specified is - actually available on the directory (file system) specified - it is - your responsibility to ensure that enough space is available, and that - other applications do not attempt to use the free space. In a large - production system, it is recommended that you allocate one or more - file system exclusively to the &zebra; register files. - - - - - - Safe Updating - Using Shadow Registers - - - Description - - The &zebra; server supports updating of the index - structures. That is, you can add, modify, or remove records from - databases managed by &zebra; without rebuilding the entire index. - Since this process involves modifying structured files with various - references between blocks of data in the files, the update process - is inherently sensitive to system crashes, or to process interruptions: - Anything but a successfully completed update process will leave the - register files in an unknown state, and you will essentially have no - recourse but to re-index everything, or to restore the register files - from a backup medium. - Further, while the update process is active, users cannot be - allowed to access the system, as the contents of the register files - may change unpredictably. + The default behavior of the &zebra; system is to reference the + records from their original location, i.e. where they were found when you + run zebraidx. + That is, when a client wishes to retrieve a record + following a search operation, the files are accessed from the place + where you originally put them - if you remove the files (without + running zebraidx again, the server will return + diagnostic number 14 (``System error in presenting records'') to + the client. - + - You can solve these problems by enabling the shadow register system in - &zebra;. - During the updating procedure, zebraidx will temporarily - write changes to the involved files in a set of "shadow - files", without modifying the files that are accessed by the - active server processes. If the update procedure is interrupted by a - system crash or a signal, you simply repeat the procedure - the - register files have not been changed or damaged, and the partially - written shadow files are automatically deleted before the new updating - procedure commences. + If your input files are not permanent - for example if you retrieve + your records from an outside source, or if they were temporarily + mounted on a CD-ROM drive, + you may want &zebra; to make an internal copy of them. To do this, + you specify 1 (true) in the storeData setting. When + the &acro.z3950; server retrieves the records they will be read from the + internal file structures of the system. - + + + + + Indexing with no Record IDs (Simple Indexing) + - At the end of the updating procedure (or in a separate operation, if - you so desire), the system enters a "commit mode". First, - any active server processes are forced to access those blocks that - have been changed from the shadow files rather than from the main - register files; the unmodified blocks are still accessed at their - normal location (the shadow files are not a complete copy of the - register files - they only contain those parts that have actually been - modified). If the commit process is interrupted at any point during the - commit process, the server processes will continue to access the - shadow files until you can repeat the commit procedure and complete - the writing of data to the main register files. You can perform - multiple update operations to the registers before you commit the - changes to the system files, or you can execute the commit operation - at the end of each update operation. When the commit phase has - completed successfully, any running server processes are instructed to - switch their operations to the new, operational register, and the - temporary shadow files are deleted. + If you have a set of records that are not expected to change over time + you may can build your database without record IDs. + This indexing method uses less space than the other methods and + is simple to use. - - - - - How to Use Shadow Register Files - + - The first step is to allocate space on your system for the shadow - files. - You do this by adding a shadow entry to the - zebra.cfg file. - The syntax of the shadow entry is exactly the - same as for the register entry - (see ). - The location of the shadow area should be - different from the location of the main register - area (if you have specified one - remember that if you provide no - register setting, the default register area is the - working directory of the server and indexing processes). + To use this method, you simply omit the recordId entry + for the group of files that you index. To add a set of records you use + zebraidx with the update command. The + update command will always add all of the records that it + encounters to the index - whether they have already been indexed or + not. If the set of indexed files change, you should delete all of the + index files, and build a new index from scratch. - + - The following excerpt from a zebra.cfg file shows - one example of a setup that configures both the main register - location and the shadow file area. - Note that two directories or partitions have been set aside - for the shadow file area. You can specify any number of directories - for each of the file areas, but remember that there should be no - overlaps between the directories used for the main registers and the - shadow files, respectively. + Consider a system in which you have a group of text files called + simple. + That group of records should belong to a &acro.z3950; database called + textbase. + The following zebra.cfg file will suffice: - + - register: /d1:500M - shadow: /scratch1:100M /scratch2:200M + profilePath: /usr/local/idzebra/tab + attset: bib1.att + simple.recordType: text + simple.database: textbase - + - + - When shadow files are enabled, an extra command is available at the - zebraidx command line. - In order to make changes to the system take effect for the - users, you'll have to submit a "commit" command after a - (sequence of) update operation(s). + Since the existing records in an index can not be addressed by their + IDs, it is impossible to delete or modify records when using this method. - + + + + + Indexing with File Record IDs + - - - $ zebraidx update /d1/records - $ zebraidx commit - - + If you have a set of files that regularly change over time: Old files + are deleted, new ones are added, or existing files are modified, you + can benefit from using the file ID + indexing methodology. + Examples of this type of database might include an index of WWW + resources, or a USENET news spool area. + Briefly speaking, the file key methodology uses the directory paths + of the individual records as a unique identifier for each record. + To perform indexing of a directory with file keys, again, you specify + the top-level directory after the update command. + The command will recursively traverse the directories and compare + each one with whatever have been indexed before in that same directory. + If a file is new (not in the previous version of the directory) it + is inserted into the registers; if a file was already indexed and + it has been modified since the last update, the index is also + modified; if a file has been removed since the last + visit, it is deleted from the index. - + - Or you can execute multiple updates before committing the changes: + The resulting system is easy to administrate. To delete a record you + simply have to delete the corresponding file (say, with the + rm command). And to add records you create new + files (or directories with files). For your changes to take effect + in the register you must run zebraidx update with + the same directory root again. This mode of operation requires more + disk space than simpler indexing methods, but it makes it easier for + you to keep the index in sync with a frequently changing set of data. + If you combine this system with the safe update + facility (see below), you never have to take your server off-line for + maintenance or register updating purposes. - + - - - $ zebraidx -g books update /d1/records /d2/more-records - $ zebraidx -g fun update /d3/fun-records - $ zebraidx commit - - + To enable indexing with pathname IDs, you must specify + file as the value of recordId + in the configuration file. In addition, you should set + storeKeys to 1, since the &zebra; + indexer must save additional information about the contents of each record + in order to modify the indexes correctly at a later time. - + + + - If one of the update operations above had been interrupted, the commit - operation on the last line would fail: zebraidx - will not let you commit changes that would destroy the running register. - You'll have to rerun all of the update operations since your last - commit operation, before you can commit the new changes. + For example, to update records of group esdd + located below + /data1/records/ you should type: + + $ zebraidx -g esdd update /data1/records + - + - Similarly, if the commit operation fails, zebraidx - will not let you start a new update operation before you have - successfully repeated the commit operation. - The server processes will keep accessing the shadow files rather - than the (possibly damaged) blocks of the main register files - until the commit operation has successfully completed. + The corresponding configuration file includes: + + esdd.recordId: file + esdd.recordType: grs.sgml + esdd.storeKeys: 1 + - + + + You cannot start out with a group of records with simple + indexing (no record IDs as in the previous section) and then later + enable file record Ids. &zebra; must know from the first time that you + index the group that + the files should be indexed with file record IDs. + + + - You should be aware that update operations may take slightly longer - when the shadow register system is enabled, since more file access - operations are involved. Further, while the disk space required for - the shadow register data is modest for a small update operation, you - may prefer to disable the system if you are adding a very large number - of records to an already very large database (we use the terms - large and modest - very loosely here, since every application will have a - different perception of size). - To update the system without the use of the the shadow files, - simply run zebraidx with the -n - option (note that you do not have to execute the - commit command of zebraidx - when you temporarily disable the use of the shadow registers in - this fashion. - Note also that, just as when the shadow registers are not enabled, - server processes will be barred from accessing the main register - while the update procedure takes place. + You cannot explicitly delete records when using this method (using the + delete command to zebraidx. Instead + you have to delete the files from the file system (or move them to a + different location) + and then run zebraidx with the + update command. - - - - + + + + Indexing with General Record IDs - - Relevance Ranking and Sorting of Result Sets - - - Overview - The default ordering of a result set is left up to the server, - which inside &zebra; means sorting in ascending document ID order. - This is not always the order humans want to browse the sometimes - quite large hit sets. Ranking and sorting comes to the rescue. + When using this method you construct an (almost) arbitrary, internal + record key based on the contents of the record itself and other system + information. If you have a group of records that explicitly associates + an ID with each record, this method is convenient. For example, the + record format may contain a title or a ID-number - unique within the group. + In either case you specify the &acro.z3950; attribute set and use-attribute + location in which this information is stored, and the system looks at + that field to determine the identity of the record. - - In cases where a good presentation ordering can be computed at - indexing time, we can use a fixed static ranking - scheme, which is provided for the alvis - indexing filter. This defines a fixed ordering of hit lists, - independently of the query issued. + + As before, the record ID is defined by the recordId + setting in the configuration file. The value of the record ID specification + consists of one or more tokens separated by whitespace. The resulting + ID is represented in the index by concatenating the tokens and + separating them by ASCII value (1). - There are cases, however, where relevance of hit set documents is - highly dependent on the query processed. - Simply put, dynamic relevance ranking - sorts a set of retrieved records such that those most likely to be - relevant to your request are retrieved first. - Internally, &zebra; retrieves all documents that satisfy your - query, and re-orders the hit list to arrange them based on - a measurement of similarity between your query and the content of - each record. + There are three kinds of tokens: + + + + Internal record info + + + The token refers to a key that is + extracted from the record. The syntax of this token is + ( set , + use ), + where set is the + attribute set name use is the + name or value of the attribute. + + + + + System variable + + + The system variables are preceded by + + + $ + + and immediately followed by the system variable name, which + may one of + + + + group + + + Group name. + + + + + database + + + Current database specified. + + + + + type + + + Record type. + + + + + + + + + Constant string + + + A string used as part of the ID — surrounded + by single- or double quotes. + + + + - Finally, there are situations where hit sets of documents should be - sorted during query time according to the - lexicographical ordering of certain sort indexes created at - indexing time. - - + For instance, the sample GILS records that come with the &zebra; + distribution contain a unique ID in the data tagged Control-Identifier. + The data is mapped to the &acro.bib1; use attribute Identifier-standard + (code 1007). To use this field as a record id, specify + (bib1,Identifier-standard) as the value of the + recordId in the configuration file. + If you have other record types that uses the same field for a + different purpose, you might add the record type + (or group or database name) to the record id of the gils + records as well, to prevent matches with other types of records. + In this case the recordId might be set like this: + + gils.recordId: $type (bib1,Identifier-standard) + - - Static Ranking - - - &zebra; uses internally inverted indexes to look up term frequencies - in documents. Multiple queries from different indexes can be - combined by the binary boolean operations AND, - OR and/or NOT (which - is in fact a binary AND NOT operation). - To ensure fast query execution - speed, all indexes have to be sorted in the same order. + - The indexes are normally sorted according to document - ID in - ascending order, and any query which does not invoke a special - re-ranking function will therefore retrieve the result set in - document - ID - order. + (see + for details of how the mapping between elements of your records and + searchable attributes is established). + - If one defines the - - staticrank: 1 - - directive in the main core &zebra; configuration file, the internal document - keys used for ordering are augmented by a preceding integer, which - contains the static rank of a given document, and the index lists - are ordered - first by ascending static rank, - then by ascending document ID. - Zero - is the ``best'' rank, as it occurs at the - beginning of the list; higher numbers represent worse scores. + As for the file record ID case described in the previous section, + updating your system is simply a matter of running + zebraidx + with the update command. However, the update with general + keys is considerably slower than with file record IDs, since all files + visited must be (re)read to discover their IDs. + - The experimental alvis filter provides a - directive to fetch static rank information out of the indexed &acro.xml; - records, thus making all hit sets ordered - after ascending static - rank, and for those doc's which have the same static rank, ordered - after ascending doc ID. - See for the gory details. + As you might expect, when using the general record IDs + method, you can only add or modify existing records with the + update command. + If you wish to delete records, you must use the, + delete command, with a directory as a parameter. + This will remove all records that match the files below that root + directory. - + + + + Register Location - - Dynamic Ranking - In order to fiddle with the static rank order, it is necessary to - invoke additional re-ranking/re-ordering using dynamic - ranking or score functions. These functions return positive - integer scores, where highest score is - ``best''; - hit sets are sorted according to descending - scores (in contrary - to the index lists which are sorted according to - ascending rank number and document ID). + Normally, the index files that form dictionaries, inverted + files, record info, etc., are stored in the directory where you run + zebraidx. If you wish to store these, possibly large, + files somewhere else, you must add the register + entry to the zebra.cfg file. + Furthermore, the &zebra; system allows its file + structures to span multiple file systems, which is useful for + managing very large databases. + - Dynamic ranking is enabled by a directive like one of the - following in the zebra configuration file (use only one of these a time!): - - rank: rank-1 # default TDF-IDF like - rank: rank-static # dummy do-nothing - + The value of the register setting is a sequence + of tokens. Each token takes the form: + + dir:size + + The dir specifies a directory in which index files + will be stored and the size specifies the maximum + size of all files in that directory. The &zebra; indexer system fills + each directory in the order specified and use the next specified + directories as needed. + The size is an integer followed by a qualifier + code, + b for bytes, + k for kilobytes. + M for megabytes, + G for gigabytes. + Specifying a negative value disables the checking (it still needs the unit, + use -1b). - + - Dynamic ranking is done at query time rather than - indexing time (this is why we - call it ``dynamic ranking'' in the first place ...) - It is invoked by adding - the &acro.bib1; relation attribute with - value ``relevance'' to the &acro.pqf; query (that is, - @attr 2=102, see also - - The &acro.bib1; Attribute Set Semantics, also in - HTML). - To find all articles with the word Eoraptor in - the title, and present them relevance ranked, issue the &acro.pqf; query: + For instance, if you have allocated three disks for your register, and + the first disk is mounted + on /d1 and has 2GB of free space, the + second, mounted on /d2 has 3.6 GB, and the third, + on which you have more space than you bother to worry about, mounted on + /d3 you could put this entry in your configuration file: + - @attr 2=102 @attr 1=4 Eoraptor + register: /d1:2G /d2:3600M /d3:-1b - - Dynamically ranking using &acro.pqf; queries with the 'rank-1' - algorithm - - The default rank-1 ranking module implements a - TF/IDF (Term Frequecy over Inverse Document Frequency) like - algorithm. In contrast to the usual definition of TF/IDF - algorithms, which only considers searching in one full-text - index, this one works on multiple indexes at the same time. - More precisely, - &zebra; does boolean queries and searches in specific addressed - indexes (there are inverted indexes pointing from terms in the - dictionary to documents and term positions inside documents). - It works like this: - - - Query Components - - - First, the boolean query is dismantled into its principal components, - i.e. atomic queries where one term is looked up in one index. - For example, the query - - @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer - - is a boolean AND between the atomic parts - - @attr 2=102 @attr 1=1010 Utah - - and - - @attr 2=102 @attr 1=1018 Springer - - which gets processed each for itself. - - - - - - Atomic hit lists - - - Second, for each atomic query, the hit list of documents is - computed. - - - In this example, two hit lists for each index - @attr 1=1010 and - @attr 1=1018 are computed. - - - - - - Atomic scores - - - Third, each document in the hit list is assigned a score (_if_ ranking - is enabled and requested in the query) using a TF/IDF scheme. - - - In this example, both atomic parts of the query assign the magic - @attr 2=102 relevance attribute, and are - to be used in the relevance ranking functions. - - - It is possible to apply dynamic ranking on only parts of the - &acro.pqf; query: - - @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer - - searches for all documents which have the term 'Utah' on the - body of text, and which have the term 'Springer' in the publisher - field, and sort them in the order of the relevance ranking made on - the body-of-text index only. - - - - - - Hit list merging - - - Fourth, the atomic hit lists are merged according to the boolean - conditions to a final hit list of documents to be returned. - - - This step is always performed, independently of the fact that - dynamic ranking is enabled or not. - - - - - - Document score computation - - - Fifth, the total score of a document is computed as a linear - combination of the atomic scores of the atomic hit lists - - - Ranking weights may be used to pass a value to a ranking - algorithm, using the non-standard &acro.bib1; attribute type 9. - This allows one branch of a query to use one value while - another branch uses a different one. For example, we can search - for utah in the - @attr 1=4 index with weight 30, as - well as in the @attr 1=1010 index with weight 20: - - @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city - - - - The default weight is - sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score - is 1000 and the bottom score is 0, encoded in integers. - - - - The ranking-weight feature is experimental. It may change in future - releases of zebra. - - - - - - - Re-sorting of hit list - - - Finally, the final hit list is re-ordered according to scores. - - - - - - - - - + Note that &zebra; does not verify that the amount of space specified is + actually available on the directory (file system) specified - it is + your responsibility to ensure that enough space is available, and that + other applications do not attempt to use the free space. In a large + production system, it is recommended that you allocate one or more + file system exclusively to the &zebra; register files. + + + + Safe Updating - Using Shadow Registers + + + Description + + + The &zebra; server supports updating of the index + structures. That is, you can add, modify, or remove records from + databases managed by &zebra; without rebuilding the entire index. + Since this process involves modifying structured files with various + references between blocks of data in the files, the update process + is inherently sensitive to system crashes, or to process interruptions: + Anything but a successfully completed update process will leave the + register files in an unknown state, and you will essentially have no + recourse but to re-index everything, or to restore the register files + from a backup medium. + Further, while the update process is active, users cannot be + allowed to access the system, as the contents of the register files + may change unpredictably. + + + + You can solve these problems by enabling the shadow register system in + &zebra;. + During the updating procedure, zebraidx will temporarily + write changes to the involved files in a set of "shadow + files", without modifying the files that are accessed by the + active server processes. If the update procedure is interrupted by a + system crash or a signal, you simply repeat the procedure - the + register files have not been changed or damaged, and the partially + written shadow files are automatically deleted before the new updating + procedure commences. + + + + At the end of the updating procedure (or in a separate operation, if + you so desire), the system enters a "commit mode". First, + any active server processes are forced to access those blocks that + have been changed from the shadow files rather than from the main + register files; the unmodified blocks are still accessed at their + normal location (the shadow files are not a complete copy of the + register files - they only contain those parts that have actually been + modified). If the commit process is interrupted at any point during the + commit process, the server processes will continue to access the + shadow files until you can repeat the commit procedure and complete + the writing of data to the main register files. You can perform + multiple update operations to the registers before you commit the + changes to the system files, or you can execute the commit operation + at the end of each update operation. When the commit phase has + completed successfully, any running server processes are instructed to + switch their operations to the new, operational register, and the + temporary shadow files are deleted. + + + + + + How to Use Shadow Register Files + + + The first step is to allocate space on your system for the shadow + files. + You do this by adding a shadow entry to the + zebra.cfg file. + The syntax of the shadow entry is exactly the + same as for the register entry + (see ). + The location of the shadow area should be + different from the location of the main register + area (if you have specified one - remember that if you provide no + register setting, the default register area is the + working directory of the server and indexing processes). + + + + The following excerpt from a zebra.cfg file shows + one example of a setup that configures both the main register + location and the shadow file area. + Note that two directories or partitions have been set aside + for the shadow file area. You can specify any number of directories + for each of the file areas, but remember that there should be no + overlaps between the directories used for the main registers and the + shadow files, respectively. + + + + + register: /d1:500M + shadow: /scratch1:100M /scratch2:200M + + + + + + When shadow files are enabled, an extra command is available at the + zebraidx command line. + In order to make changes to the system take effect for the + users, you'll have to submit a "commit" command after a + (sequence of) update operation(s). + + + + + + $ zebraidx update /d1/records + $ zebraidx commit + + + + + + Or you can execute multiple updates before committing the changes: + + + + + + $ zebraidx -g books update /d1/records /d2/more-records + $ zebraidx -g fun update /d3/fun-records + $ zebraidx commit + + + + + + If one of the update operations above had been interrupted, the commit + operation on the last line would fail: zebraidx + will not let you commit changes that would destroy the running register. + You'll have to rerun all of the update operations since your last + commit operation, before you can commit the new changes. + + + + Similarly, if the commit operation fails, zebraidx + will not let you start a new update operation before you have + successfully repeated the commit operation. + The server processes will keep accessing the shadow files rather + than the (possibly damaged) blocks of the main register files + until the commit operation has successfully completed. + + + + You should be aware that update operations may take slightly longer + when the shadow register system is enabled, since more file access + operations are involved. Further, while the disk space required for + the shadow register data is modest for a small update operation, you + may prefer to disable the system if you are adding a very large number + of records to an already very large database (we use the terms + large and modest + very loosely here, since every application will have a + different perception of size). + To update the system without the use of the the shadow files, + simply run zebraidx with the -n + option (note that you do not have to execute the + commit command of zebraidx + when you temporarily disable the use of the shadow registers in + this fashion. + Note also that, just as when the shadow registers are not enabled, + server processes will be barred from accessing the main register + while the update procedure takes place. + + + + + + + + + Relevance Ranking and Sorting of Result Sets + + + Overview + + The default ordering of a result set is left up to the server, + which inside &zebra; means sorting in ascending document ID order. + This is not always the order humans want to browse the sometimes + quite large hit sets. Ranking and sorting comes to the rescue. + + + + In cases where a good presentation ordering can be computed at + indexing time, we can use a fixed static ranking + scheme, which is provided for the alvis + indexing filter. This defines a fixed ordering of hit lists, + independently of the query issued. + + + + There are cases, however, where relevance of hit set documents is + highly dependent on the query processed. + Simply put, dynamic relevance ranking + sorts a set of retrieved records such that those most likely to be + relevant to your request are retrieved first. + Internally, &zebra; retrieves all documents that satisfy your + query, and re-orders the hit list to arrange them based on + a measurement of similarity between your query and the content of + each record. + + + + Finally, there are situations where hit sets of documents should be + sorted during query time according to the + lexicographical ordering of certain sort indexes created at + indexing time. + + + + + + Static Ranking + + + &zebra; uses internally inverted indexes to look up term frequencies + in documents. Multiple queries from different indexes can be + combined by the binary boolean operations AND, + OR and/or NOT (which + is in fact a binary AND NOT operation). + To ensure fast query execution + speed, all indexes have to be sorted in the same order. + + + The indexes are normally sorted according to document + ID in + ascending order, and any query which does not invoke a special + re-ranking function will therefore retrieve the result set in + document + ID + order. + + + If one defines the + + staticrank: 1 + + directive in the main core &zebra; configuration file, the internal document + keys used for ordering are augmented by a preceding integer, which + contains the static rank of a given document, and the index lists + are ordered + first by ascending static rank, + then by ascending document ID. + Zero + is the ``best'' rank, as it occurs at the + beginning of the list; higher numbers represent worse scores. + + + The experimental alvis filter provides a + directive to fetch static rank information out of the indexed &acro.xml; + records, thus making all hit sets ordered + after ascending static + rank, and for those doc's which have the same static rank, ordered + after ascending doc ID. + See for the gory details. + + + + + + Dynamic Ranking + + In order to fiddle with the static rank order, it is necessary to + invoke additional re-ranking/re-ordering using dynamic + ranking or score functions. These functions return positive + integer scores, where highest score is + ``best''; + hit sets are sorted according to descending + scores (in contrary + to the index lists which are sorted according to + ascending rank number and document ID). + + + Dynamic ranking is enabled by a directive like one of the + following in the zebra configuration file (use only one of these a time!): + + rank: rank-1 # default TDF-IDF like + rank: rank-static # dummy do-nothing + + - The rank-1 algorithm - does not use the static rank - information in the list keys, and will produce the same ordering - with or without static ranking enabled. + Dynamic ranking is done at query time rather than + indexing time (this is why we + call it ``dynamic ranking'' in the first place ...) + It is invoked by adding + the &acro.bib1; relation attribute with + value ``relevance'' to the &acro.pqf; query (that is, + @attr 2=102, see also + + The &acro.bib1; Attribute Set Semantics, also in + HTML). + To find all articles with the word Eoraptor in + the title, and present them relevance ranked, issue the &acro.pqf; query: + + @attr 2=102 @attr 1=4 Eoraptor + - - - - + + + The default rank-1 ranking module implements a + TF/IDF (Term Frequecy over Inverse Document Frequency) like + algorithm. In contrast to the usual definition of TF/IDF + algorithms, which only considers searching in one full-text + index, this one works on multiple indexes at the same time. + More precisely, + &zebra; does boolean queries and searches in specific addressed + indexes (there are inverted indexes pointing from terms in the + dictionary to documents and term positions inside documents). + It works like this: + + + Query Components + + + First, the boolean query is dismantled into its principal components, + i.e. atomic queries where one term is looked up in one index. + For example, the query + + @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer + + is a boolean AND between the atomic parts + + @attr 2=102 @attr 1=1010 Utah + + and + + @attr 2=102 @attr 1=1018 Springer + + which gets processed each for itself. + + + + + + Atomic hit lists + + + Second, for each atomic query, the hit list of documents is + computed. + + + In this example, two hit lists for each index + @attr 1=1010 and + @attr 1=1018 are computed. + + + + + + Atomic scores + + + Third, each document in the hit list is assigned a score (_if_ ranking + is enabled and requested in the query) using a TF/IDF scheme. + + + In this example, both atomic parts of the query assign the magic + @attr 2=102 relevance attribute, and are + to be used in the relevance ranking functions. + + + It is possible to apply dynamic ranking on only parts of the + &acro.pqf; query: + + @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer + + searches for all documents which have the term 'Utah' on the + body of text, and which have the term 'Springer' in the publisher + field, and sort them in the order of the relevance ranking made on + the body-of-text index only. + + + + + + Hit list merging + + + Fourth, the atomic hit lists are merged according to the boolean + conditions to a final hit list of documents to be returned. + + + This step is always performed, independently of the fact that + dynamic ranking is enabled or not. + + + + + + Document score computation + + + Fifth, the total score of a document is computed as a linear + combination of the atomic scores of the atomic hit lists + + + Ranking weights may be used to pass a value to a ranking + algorithm, using the non-standard &acro.bib1; attribute type 9. + This allows one branch of a query to use one value while + another branch uses a different one. For example, we can search + for utah in the + @attr 1=4 index with weight 30, as + well as in the @attr 1=1010 index with weight 20: + + @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city + + + + The default weight is + sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score + is 1000 and the bottom score is 0, encoded in integers. + + + + The ranking-weight feature is experimental. It may change in future + releases of zebra. + + + + + + + Re-sorting of hit list + + + Finally, the final hit list is re-ordered according to scores. + + + + + + + + - Dynamic ranking is not compatible - with estimated hit sizes, as all documents in - a hit set must be accessed to compute the correct placing in a - ranking sorted list. Therefore the use attribute setting - @attr 2=102 clashes with - @attr 9=integer. + The rank-1 algorithm + does not use the static rank + information in the list keys, and will produce the same ordering + with or without static ranking enabled. - - + + + + + + Dynamic ranking is not compatible + with estimated hit sizes, as all documents in + a hit set must be accessed to compute the correct placing in a + ranking sorted list. Therefore the use attribute setting + @attr 2=102 clashes with + @attr 9=integer. + + + + @@ -1454,7 +1364,7 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci relationModifier.relevant = 2=102 - invokes dynamic ranking each time a &acro.cql; query of the form + invokes dynamic ranking each time a &acro.cql; query of the form Z> querytype cql Z> f alvis.text =/relevant house @@ -1464,90 +1374,90 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci index.alvis.text = 1=text 2=102 - which then invokes dynamic ranking each time a &acro.cql; query of the form + which then invokes dynamic ranking each time a &acro.cql; query of the form Z> querytype cql Z> f alvis.text = house is issued. - + - + - - Sorting - + + Sorting + &zebra; sorts efficiently using special sorting indexes (type=s; so each sortable index must be known at indexing time, specified in the configuration of record indexing. For example, to enable sorting according to the &acro.bib1; Date/time-added-to-db field, one could add the line - xelm /*/@created Date/time-added-to-db:s + xelm /*/@created Date/time-added-to-db:s to any .abs record-indexing configuration file. Similarly, one could add an indexing element of the form - - - + + ]]> to any alvis-filter indexing stylesheet. - - - Indexing can be specified at searching time using a query term - carrying the non-standard - &acro.bib1; attribute-type 7. This removes the - need to send a &acro.z3950; Sort Request - separately, and can dramatically improve latency when the client - and server are on separate networks. - The sorting part of the query is separate from the rest of the - query - the actual search specification - and must be combined - with it using OR. - - - A sorting subquery needs two attributes: an index (such as a - &acro.bib1; type-1 attribute) specifying which index to sort on, and a - type-7 attribute whose value is be 1 for - ascending sorting, or 2 for descending. The - term associated with the sorting attribute is the priority of - the sort key, where 0 specifies the primary - sort key, 1 the secondary sort key, and so - on. - + + + Indexing can be specified at searching time using a query term + carrying the non-standard + &acro.bib1; attribute-type 7. This removes the + need to send a &acro.z3950; Sort Request + separately, and can dramatically improve latency when the client + and server are on separate networks. + The sorting part of the query is separate from the rest of the + query - the actual search specification - and must be combined + with it using OR. + + + A sorting subquery needs two attributes: an index (such as a + &acro.bib1; type-1 attribute) specifying which index to sort on, and a + type-7 attribute whose value is be 1 for + ascending sorting, or 2 for descending. The + term associated with the sorting attribute is the priority of + the sort key, where 0 specifies the primary + sort key, 1 the secondary sort key, and so + on. + For example, a search for water, sort by title (ascending), - is expressed by the &acro.pqf; query + is expressed by the &acro.pqf; query - @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 + @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 - whereas a search for water, sort by title ascending, + whereas a search for water, sort by title ascending, then date descending would be - @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 + @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1 Notice the fundamental differences between dynamic - ranking and sorting: there can be + ranking and sorting: there can be only one ranking function defined and configured; but multiple sorting indexes can be specified dynamically at search time. Ranking does not need to use specific indexes, so dynamic ranking can be enabled and disabled without re-indexing; whereas, sorting indexes need to be defined before indexing. - + + + - + - + + Extended Services: Remote Insert, Update and Delete - - Extended Services: Remote Insert, Update and Delete - Extended services are only supported when accessing the &zebra; @@ -1556,8 +1466,8 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci not support extended services. - - + + The extended services are not enabled by default in zebra - due to the fact that they modify the system. &zebra; can be configured to allow anybody to @@ -1569,15 +1479,15 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci perm.admin: rw passwd: passwordfile - And in the password file + And in the password file passwordfile, you have to specify users and - encrypted passwords as colon separated strings. - Use a tool like htpasswd - to maintain the encrypted passwords. - + encrypted passwords as colon separated strings. + Use a tool like htpasswd + to maintain the encrypted passwords. + admin:secret - It is essential to configure &zebra; to store records internally, + It is essential to configure &zebra; to store records internally, and to support modifications and deletion of records: @@ -1587,15 +1497,15 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci The general record type should be set to any record filter which is able to parse &acro.xml; records, you may use any of the two declarations (but not both simultaneously!) - + recordType: dom.filter_dom_conf.xml # recordType: grs.xml Notice the difference to the specific instructions - + recordType.xml: dom.filter_dom_conf.xml # recordType.xml: grs.xml - + which only work when indexing XML files from the filesystem using the *.xml naming convention. @@ -1605,8 +1515,8 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci shadow: directoryname: size (e.g. 1000M) - See for additional information on - these configuration options. + See for additional information on + these configuration options. @@ -1615,14 +1525,14 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci limitations of the &acro.z3950; protocol. Therefore, indexing filters can not be chosen on a per-record basis. One and only one general &acro.xml; indexing filter - must be defined. + must be defined. @@ -1636,41 +1546,41 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci servers to accept special binary extended services protocol packages, which may be used to insert, update and delete records into servers. These carry control and update - information to the servers, which are encoded in seven package fields: + information to the servers, which are encoded in seven package fields: Extended services &acro.z3950; Package Fields - - + + - Parameter - Value - Notes - + Parameter + Value + Notes + - - - type - 'update' - Must be set to trigger extended services - - - action - string + + + type + 'update' + Must be set to trigger extended services + + + action + string - Extended service action type with + Extended service action type with one of four possible values: recordInsert, recordReplace, recordDelete, and specialUpdate - - - record - &acro.xml; string - An &acro.xml; formatted string containing the record - + + + record + &acro.xml; string + An &acro.xml; formatted string containing the record + syntax 'xml' @@ -1678,74 +1588,74 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci The default filter (record type) as given by recordType in zebra.cfg is used to parse the record. - - recordIdOpaque - string - + + recordIdOpaque + string + Optional client-supplied, opaque record identifier used under insert operations. - - - recordIdNumber - positive number - &zebra;'s internal system number, - not allowed for recordInsert or + + + recordIdNumber + positive number + &zebra;'s internal system number, + not allowed for recordInsert or specialUpdate actions which result in fresh record inserts. - - - databaseName - database identifier + + + databaseName + database identifier - The name of the database to which the extended services should be + The name of the database to which the extended services should be applied. - + - -
+ + - - The action parameter can be any of - recordInsert (will fail if the record already exists), - recordReplace (will fail if the record does not exist), - recordDelete (will fail if the record does not - exist), and - specialUpdate (will insert or update the record - as needed, record deletion is not possible). - + + The action parameter can be any of + recordInsert (will fail if the record already exists), + recordReplace (will fail if the record does not exist), + recordDelete (will fail if the record does not + exist), and + specialUpdate (will insert or update the record + as needed, record deletion is not possible). + During all actions, the usual rules for internal record ID generation apply, unless an optional recordIdNumber &zebra; internal ID or a - recordIdOpaque string identifier is assigned. + recordIdOpaque string identifier is assigned. The default ID generation is configured using the recordId: from - zebra.cfg. - See . + zebra.cfg. + See . - - Setting of the recordIdNumber parameter, - which must be an existing &zebra; internal system ID number, is not - allowed during any recordInsert or + + Setting of the recordIdNumber parameter, + which must be an existing &zebra; internal system ID number, is not + allowed during any recordInsert or specialUpdate action resulting in fresh record - inserts. + inserts. When retrieving existing - records indexed with &acro.grs1; indexing filters, the &zebra; internal + records indexed with &acro.grs1; indexing filters, the &zebra; internal ID number is returned in the field - /*/id:idzebra/localnumber in the namespace - xmlns:id="http://www.indexdata.dk/zebra/", - where it can be picked up for later record updates or deletes. + /*/id:idzebra/localnumber in the namespace + xmlns:id="http://www.indexdata.dk/zebra/", + where it can be picked up for later record updates or deletes. - + A new element set for retrieval of internal record data has been added, which can be used to access minimal records @@ -1755,131 +1665,131 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci See . - + The recordIdOpaque string parameter is an client-supplied, opaque record - identifier, which may be used under + identifier, which may be used under insert, update and delete operations. The client software is responsible for assigning these to records. This identifier will replace zebra's own automagic identifier generation with a unique - mapping from recordIdOpaque to the + mapping from recordIdOpaque to the &zebra; internal recordIdNumber. The opaque recordIdOpaque string - identifiers + identifiers are not visible in retrieval records, nor are searchable, so the value of this parameter is questionable. It serves mostly as a convenient mapping from application domain string identifiers to &zebra; internal ID's. - + - - - Extended services from yaz-client - - We can now start a yaz-client admin session and create a database: - - adm-create - ]]> - - Now the Default database was created, - we can insert an &acro.xml; file (esdd0006.grs - from example/gils/records) and index it: - - update insert id1234 esdd0006.grs - ]]> - - The 3rd parameter - id1234 here - - is the recordIdOpaque package field. - - - Actually, we should have a way to specify "no opaque record id" for - yaz-client's update command.. We'll fix that. - - - The newly inserted record can be searched as usual: - - f utah - Sent searchRequest. - Received SearchResponse. - Search was a success. - Number of hits: 1, setno 1 - SearchResult-1: term=utah cnt=1 - records returned: 0 - Elapsed: 0.014179 - ]]> - - - - Let's delete the beast, using the same + + Extended services from yaz-client + + + We can now start a yaz-client admin session and create a database: + + adm-create + ]]> + + Now the Default database was created, + we can insert an &acro.xml; file (esdd0006.grs + from example/gils/records) and index it: + + update insert id1234 esdd0006.grs + ]]> + + The 3rd parameter - id1234 here - + is the recordIdOpaque package field. + + + Actually, we should have a way to specify "no opaque record id" for + yaz-client's update command.. We'll fix that. + + + The newly inserted record can be searched as usual: + + f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 1, setno 1 + SearchResult-1: term=utah cnt=1 + records returned: 0 + Elapsed: 0.014179 + ]]> + + + + Let's delete the beast, using the same recordIdOpaque string parameter: - - update delete id1234 - No last record (update ignored) - Z> update delete 1 esdd0006.grs - Got extended services response - Status: done - Elapsed: 0.072441 - Z> f utah - Sent searchRequest. - Received SearchResponse. - Search was a success. - Number of hits: 0, setno 2 - SearchResult-1: term=utah cnt=0 - records returned: 0 - Elapsed: 0.013610 - ]]> + + update delete id1234 + No last record (update ignored) + Z> update delete 1 esdd0006.grs + Got extended services response + Status: done + Elapsed: 0.072441 + Z> f utah + Sent searchRequest. + Received SearchResponse. + Search was a success. + Number of hits: 0, setno 2 + SearchResult-1: term=utah cnt=0 + records returned: 0 + Elapsed: 0.013610 + ]]> - If shadow register is enabled in your - zebra.cfg, - you must run the adm-commit command - - adm-commit - ]]> - + If shadow register is enabled in your + zebra.cfg, + you must run the adm-commit command + + adm-commit + ]]> + after each update session in order write your changes from the shadow to the life register space. - - + + - - - Extended services from yaz-php - - Extended services are also available from the &yaz; &acro.php; client layer. An - example of an &yaz;-&acro.php; extended service transaction is given here: - - A fine specimen of a record'; - - $options = array('action' => 'recordInsert', - 'syntax' => 'xml', - 'record' => $record, - 'databaseName' => 'mydatabase' - ); - - yaz_es($yaz, 'update', $options); - yaz_es($yaz, 'commit', array()); - yaz_wait(); - - if ($error = yaz_error($yaz)) - echo "$error"; - ]]> - + + Extended services from yaz-php + + + Extended services are also available from the &yaz; &acro.php; client layer. An + example of an &yaz;-&acro.php; extended service transaction is given here: + + A fine specimen of a record'; + + $options = array('action' => 'recordInsert', + 'syntax' => 'xml', + 'record' => $record, + 'databaseName' => 'mydatabase' + ); + + yaz_es($yaz, 'update', $options); + yaz_es($yaz, 'commit', array()); + yaz_wait(); + + if ($error = yaz_error($yaz)) + echo "$error"; + ]]> + - + Extended services debugging guide @@ -1890,29 +1800,29 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci - Make sure you have a nice record on your filesystem, which you can + Make sure you have a nice record on your filesystem, which you can index from the filesystem by use of the zebraidx command. Do it exactly as you planned, using one of the GRS-1 filters, - or the DOMXML filter. + or the DOMXML filter. When this works, proceed. - Check that your server setup is OK before you even coded one single + Check that your server setup is OK before you even coded one single line PHP using ES. - Take the same record form the file system, and send as ES via + Take the same record form the file system, and send as ES via yaz-client like described in , and remember the -a option which tells you what goes over the wire! Notice also the section on permissions: - try + try perm.anonymous: rw - in zebra.cfg to make sure you do not run into - permission problems (but never expose such an insecure setup on the + in zebra.cfg to make sure you do not run into + permission problems (but never expose such an insecure setup on the internet!!!). Then, make sure to set the general recordType instruction, pointing correctly to the GRS-1 filters, @@ -1921,19 +1831,19 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci - If you insist on using the sysno in the - recordIdNumber setting, - please make sure you do only updates and deletes. Zebra's internal + If you insist on using the sysno in the + recordIdNumber setting, + please make sure you do only updates and deletes. Zebra's internal system number is not allowed for - recordInsert or - specialUpdate actions + recordInsert or + specialUpdate actions which result in fresh record inserts. - If shadow register is enabled in your - zebra.cfg, you must remember running the + If shadow register is enabled in your + zebra.cfg, you must remember running the Z> adm-commit @@ -1950,9 +1860,9 @@ where g = rset_count(terms[i]->rset) is the count of all documents in this speci -
+
-
+
- - + + + --> @@ -66,7 +66,7 @@ &chap-recordmodel-alvisxslt; &chap-recordmodel-grs; &chap-field-structure; - + Reference @@ -77,11 +77,11 @@ &manref; - + &app-license; &gpl2; &app-indexdata; - + +
+ Overview + + &zebra; is a free, fast, friendly information management system. It can + index records in &acro.xml;/&acro.sgml;, &acro.marc;, e-mail archives and many other + formats, and quickly find them using a combination of boolean + searching and relevance ranking. Search-and-retrieve applications can + be written using &acro.api;s in a wide variety of languages, communicating + with the &zebra; server using industry-standard information-retrieval + protocols or web services. + + + &zebra; is licensed Open Source, and can be + deployed by anyone for any purpose without license fees. The C source + code is open to anybody to read and change under the GPL license. + + + &zebra; is a networked component which acts as a + reliable &acro.z3950; server + for both record/document search, presentation, insert, update and + delete operations. In addition, it understands the &acro.sru; family of + webservices, which exist in &acro.rest; &acro.get;/&acro.post; and truly + &acro.soap; flavors. + + + &zebra; is available as MS Windows 2003 Server (32 bit) self-extracting + package as well as GNU/Debian Linux (32 bit and 64 bit) precompiled + packages. It has been deployed successfully on other Unix systems, + including Sun Sparc, HP Unix, and many variants of Linux and BSD + based systems. + + + http://www.indexdata.com/zebra/ + http://ftp.indexdata.dk/pub/zebra/win32/ + http://ftp.indexdata.dk/pub/zebra/debian/ + + + + &zebra; + is a high-performance, general-purpose structured text + indexing and retrieval engine. It reads records in a + variety of input formats (e.g. email, &acro.xml;, &acro.marc;) and provides access + to them through a powerful combination of boolean search + expressions and relevance-ranked free-text queries. + + + &zebra; supports large databases (tens of millions of records, + tens of gigabytes of data). It allows safe, incremental + database updates on live systems. Because &zebra; supports + the industry-standard information retrieval protocol, &acro.z3950;, + you can search &zebra; databases using an enormous variety of + programs and toolkits, both commercial and free, which understand + this protocol. Application libraries are available to allow + bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual + Basic, Python, &acro.php; and more - see the + &acro.zoom; web site + for more information on some of these client toolkits. + + + + This document is an introduction to the &zebra; system. It explains + how to compile the software, how to prepare your first database, + and how to configure the server to give you the + functionality that you need. + +
+ +
+ &zebra; Features Overview
&zebra; Document Model - - &zebra; document model - - - - - - - - Feature - Availability - Notes - Reference - - - - - Complex semi-structured Documents - &acro.xml; and &acro.grs1; Documents - Both &acro.xml; and &acro.grs1; documents exhibit a &acro.dom; like internal - representation allowing for complex indexing and display rules - and - - - - Input document formats - &acro.xml;, &acro.sgml;, Text, ISO2709 (&acro.marc;) - - A system of input filters driven by - regular expressions allows most ASCII-based - data formats to be easily processed. - &acro.sgml;, &acro.xml;, ISO2709 (&acro.marc;), and raw text are also - supported. - - - - Document storage - Index-only, Key storage, Document storage - Data can be, and usually is, imported - into &zebra;'s own storage, but &zebra; can also refer to - external files, building and maintaining indexes of "live" - collections. - - - - - -
+ + &zebra; document model + + + + + + + + Feature + Availability + Notes + Reference + + + + + Complex semi-structured Documents + &acro.xml; and &acro.grs1; Documents + Both &acro.xml; and &acro.grs1; documents exhibit a &acro.dom; like internal + representation allowing for complex indexing and display rules + and + + + + Input document formats + &acro.xml;, &acro.sgml;, Text, ISO2709 (&acro.marc;) + + A system of input filters driven by + regular expressions allows most ASCII-based + data formats to be easily processed. + &acro.sgml;, &acro.xml;, ISO2709 (&acro.marc;), and raw text are also + supported. + + + + Document storage + Index-only, Key storage, Document storage + Data can be, and usually is, imported + into &zebra;'s own storage, but &zebra; can also refer to + external files, building and maintaining indexes of "live" + collections. + + + + + +
&zebra; Index Scanning - - &zebra; index scanning - - - - - - - - Feature - Availability - Notes - Reference - - - - - Scan - term suggestions - Scan on a given named index returns all the - indexed terms in lexicographical order near the given start - term. This can be used to create drop-down menus and search - suggestions. - and - - - - - Facetted browsing +
+ &zebra; index scanning + + + + + + + + Feature + Availability + Notes + Reference + + + + + Scan + term suggestions + Scan on a given named index returns all the + indexed terms in lexicographical order near the given start + term. This can be used to create drop-down menus and search + suggestions. + and + + + + + Facetted browsing available Zebra 2.1 and allows retrieval of facets for a result set. - - - - - Drill-down or refine-search - partially - scanning in result sets can be used to implement - drill-down in search clients - - - - -
+ + + + + Drill-down or refine-search + partially + scanning in result sets can be used to implement + drill-down in search clients + + + + +
&zebra; Document Presentation - - &zebra; document presentation - - - - - - - - Feature - Availability - Notes - Reference - - - - - Hit count - yes - Search results include at any time the total hit count of a given - query, either exact computed, or approximative, in case that the - hit count exceeds a possible pre-defined hit set truncation - level. - - and - - - - - Paged result sets - yes - Paging of search requests and present/display request - can return any successive number of records from any start - position in the hit set, i.e. it is trivial to provide search - results in successive pages of any size. - - - - &acro.xml; document transformations - &acro.xslt; based - Record presentation can be performed in many - pre-defined &acro.xml; data - formats, where the original &acro.xml; records are on-the-fly transformed - through any preconfigured &acro.xslt; transformation. It is therefore - trivial to present records in short/full &acro.xml; views, transforming to - RSS, Dublin Core, or other &acro.xml; based data formats, or transform - records to XHTML snippets ready for inserting in XHTML pages. - - - - - Binary record transformations - &acro.marc;, &acro.usmarc;, &acro.marc21; and &acro.marcxml; - post-filter record transformations - - - - Record Syntaxes - - Multiple record syntaxes - for data retrieval: &acro.grs1;, &acro.sutrs;, - &acro.xml;, ISO2709 (&acro.marc;), etc. Records can be mapped between - record syntaxes and schemas on the fly. - - - - &zebra; internal metadata - yes - &zebra; internal document metadata can be fetched in - &acro.sutrs; and &acro.xml; record syntaxes. Those are useful in client - applications. - - - - &zebra; internal raw record data - yes - &zebra; internal raw, binary record data can be fetched in - &acro.sutrs; and &acro.xml; record syntaxes, leveraging %zebra; to a - binary storage system - - - - &zebra; internal record field data - yes - &zebra; internal record field data can be fetched in - &acro.sutrs; and &acro.xml; record syntaxes. This makes very fast minimal - record data displays possible. - - - - -
+ + &zebra; document presentation + + + + + + + + Feature + Availability + Notes + Reference + + + + + Hit count + yes + Search results include at any time the total hit count of a given + query, either exact computed, or approximative, in case that the + hit count exceeds a possible pre-defined hit set truncation + level. + + and + + + + + Paged result sets + yes + Paging of search requests and present/display request + can return any successive number of records from any start + position in the hit set, i.e. it is trivial to provide search + results in successive pages of any size. + + + + &acro.xml; document transformations + &acro.xslt; based + Record presentation can be performed in many + pre-defined &acro.xml; data + formats, where the original &acro.xml; records are on-the-fly transformed + through any preconfigured &acro.xslt; transformation. It is therefore + trivial to present records in short/full &acro.xml; views, transforming to + RSS, Dublin Core, or other &acro.xml; based data formats, or transform + records to XHTML snippets ready for inserting in XHTML pages. + + + + + Binary record transformations + &acro.marc;, &acro.usmarc;, &acro.marc21; and &acro.marcxml; + post-filter record transformations + + + + Record Syntaxes + + Multiple record syntaxes + for data retrieval: &acro.grs1;, &acro.sutrs;, + &acro.xml;, ISO2709 (&acro.marc;), etc. Records can be mapped between + record syntaxes and schemas on the fly. + + + + &zebra; internal metadata + yes + &zebra; internal document metadata can be fetched in + &acro.sutrs; and &acro.xml; record syntaxes. Those are useful in client + applications. + + + + &zebra; internal raw record data + yes + &zebra; internal raw, binary record data can be fetched in + &acro.sutrs; and &acro.xml; record syntaxes, leveraging %zebra; to a + binary storage system + + + + &zebra; internal record field data + yes + &zebra; internal record field data can be fetched in + &acro.sutrs; and &acro.xml; record syntaxes. This makes very fast minimal + record data displays possible. + + + + +
&zebra; Sorting and Ranking - - &zebra; sorting and ranking - - - - - - - - Feature - Availability - Notes - Reference - - - - - Sort - numeric, lexicographic - Sorting on the basis of alpha-numeric and numeric data - is supported. Alphanumeric sorts can be configured for - different data encodings and locales for European languages. - and - - - - Combined sorting - yes - Sorting on the basis of combined sorts ­ e.g. combinations of - ascending/descending sorts of lexicographical/numeric/date field data - is supported - - - - Relevance ranking - TF-IDF like - Relevance-ranking of free-text queries is supported - using a TF-IDF like algorithm. - - - - Static pre-ranking - yes - Enables pre-index time ranking of documents where hit - lists are ordered first by ascending static rank, then by - ascending document ID. - - - - -
+ + &zebra; sorting and ranking + + + + + + + + Feature + Availability + Notes + Reference + + + + + Sort + numeric, lexicographic + Sorting on the basis of alpha-numeric and numeric data + is supported. Alphanumeric sorts can be configured for + different data encodings and locales for European languages. + and + + + + Combined sorting + yes + Sorting on the basis of combined sorts ­ e.g. combinations of + ascending/descending sorts of lexicographical/numeric/date field data + is supported + + + + Relevance ranking + TF-IDF like + Relevance-ranking of free-text queries is supported + using a TF-IDF like algorithm. + + + + Static pre-ranking + yes + Enables pre-index time ranking of documents where hit + lists are ordered first by ascending static rank, then by + ascending document ID. + + + + +
@@ -449,264 +425,264 @@ &zebra; Live Updates - - &zebra; live updates - - - - - - - - Feature - Availability - Notes - Reference - - - - - Incremental and batch updates - - It is possible to schedule record inserts/updates/deletes in any - quantity, from single individual handled records to batch updates - in strikes of any size, as well as total re-indexing of all records - from file system. - - - - Remote updates - &acro.z3950; extended services - Updates can be performed from remote locations using the - &acro.z3950; extended services. Access to extended services can be - login-password protected. - and - - - - Live updates - transaction based - Data updates are transaction based and can be performed - on running &zebra; systems. Full searchability is preserved - during life data update due to use of shadow disk areas for - update operations. Multiple update transactions at the same - time are lined up, to be performed one after each other. Data - integrity is preserved. - - - - -
+ + &zebra; live updates + + + + + + + + Feature + Availability + Notes + Reference + + + + + Incremental and batch updates + + It is possible to schedule record inserts/updates/deletes in any + quantity, from single individual handled records to batch updates + in strikes of any size, as well as total re-indexing of all records + from file system. + + + + Remote updates + &acro.z3950; extended services + Updates can be performed from remote locations using the + &acro.z3950; extended services. Access to extended services can be + login-password protected. + and + + + + Live updates + transaction based + Data updates are transaction based and can be performed + on running &zebra; systems. Full searchability is preserved + during life data update due to use of shadow disk areas for + update operations. Multiple update transactions at the same + time are lined up, to be performed one after each other. Data + integrity is preserved. + + + + +
-
- &zebra; Networked Protocols - - - &zebra; networked protocols - - - - - - - - Feature - Availability - Notes - Reference - - - - - Fundamental operations - &acro.z3950;/&acro.sru; explain, - search, scan, and - update - - - - - &acro.z3950; protocol support - yes - Protocol facilities supported are: - init, search, - present (retrieval), - Segmentation (support for very large records), - delete, scan - (index browsing), sort, - close and support for the update - Extended Service to add or replace an existing &acro.xml; - record. Piggy-backed presents are honored in the search - request. Named result sets are supported. - - - - Web Service support - &acro.sru; - The protocol operations explain, - searchRetrieve and scan - are supported. &acro.cql; to internal - query model &acro.rpn; - conversion is supported. Extended RPN queries - for search/retrieve and scan are supported. - - - - -
+
+ &zebra; Networked Protocols + + + &zebra; networked protocols + + + + + + + + Feature + Availability + Notes + Reference + + + + + Fundamental operations + &acro.z3950;/&acro.sru; explain, + search, scan, and + update + + + + + &acro.z3950; protocol support + yes + Protocol facilities supported are: + init, search, + present (retrieval), + Segmentation (support for very large records), + delete, scan + (index browsing), sort, + close and support for the update + Extended Service to add or replace an existing &acro.xml; + record. Piggy-backed presents are honored in the search + request. Named result sets are supported. + + + + Web Service support + &acro.sru; + The protocol operations explain, + searchRetrieve and scan + are supported. &acro.cql; to internal + query model &acro.rpn; + conversion is supported. Extended RPN queries + for search/retrieve and scan are supported. + + + + +
&zebra; Data Size and Scalability - - &zebra; data size and scalability - - - - - - - - Feature - Availability - Notes - Reference - - - - - No of records - 40-60 million - - - - - Data size - 100 GB of record data - &zebra; based applications have successfully indexed up - to 100 GB of record data - - - - Scale out - multiple discs - - - - - Performance - O(n * log N) - &zebra; query speed and performance is affected roughly by - O(log N), - where N is the total database size, and by - O(n), where n is the - specific query hit set size. - - - - Average search times - - Even on very large size databases hit rates of 20 queries per - seconds with average query answering time of 1 second are possible, - provided that the boolean queries are constructed sufficiently - precise to result in hit sets of the order of 1000 to 5.000 - documents. - - - - Large databases - 64 bit file pointers - 64 file pointers assure that register files can extend - the 2 GB limit. Logical files can be - automatically partitioned over multiple disks, thus allowing for - large databases. - - - - -
+ + &zebra; data size and scalability + + + + + + + + Feature + Availability + Notes + Reference + + + + + No of records + 40-60 million + + + + + Data size + 100 GB of record data + &zebra; based applications have successfully indexed up + to 100 GB of record data + + + + Scale out + multiple discs + + + + + Performance + O(n * log N) + &zebra; query speed and performance is affected roughly by + O(log N), + where N is the total database size, and by + O(n), where n is the + specific query hit set size. + + + + Average search times + + Even on very large size databases hit rates of 20 queries per + seconds with average query answering time of 1 second are possible, + provided that the boolean queries are constructed sufficiently + precise to result in hit sets of the order of 1000 to 5.000 + documents. + + + + Large databases + 64 bit file pointers + 64 file pointers assure that register files can extend + the 2 GB limit. Logical files can be + automatically partitioned over multiple disks, thus allowing for + large databases. + + + + +
&zebra; Supported Platforms - - &zebra; supported platforms - - - - - - - - Feature - Availability - Notes - Reference - - - - - Linux - - GNU Linux (32 and 64bit), journaling Reiser or (better) - JFS file system - on disks. NFS file systems are not supported. - GNU/Debian Linux packages are available - - - - Unix - tar-ball - &zebra; is written in portable C, so it runs on most - Unix-like systems. - Usual tar-ball install possible on many major Unix systems - - - - Windows - NT/2000/2003/XP - &zebra; runs as well on Windows (NT/2000/2003/XP). - Windows installer packages available - - - - -
+ + &zebra; supported platforms + + + + + + + + Feature + Availability + Notes + Reference + + + + + Linux + + GNU Linux (32 and 64bit), journaling Reiser or (better) + JFS file system + on disks. NFS file systems are not supported. + GNU/Debian Linux packages are available + + + + Unix + tar-ball + &zebra; is written in portable C, so it runs on most + Unix-like systems. + Usual tar-ball install possible on many major Unix systems + + + + Windows + NT/2000/2003/XP + &zebra; runs as well on Windows (NT/2000/2003/XP). + Windows installer packages available + + + + +
- -
- + + +
- References and &zebra; based Applications - - &zebra; has been deployed in numerous applications, in both the - academic and commercial worlds, in application domains as diverse - as bibliographic catalogues, Geo-spatial information, structured - vocabulary browsing, government information locators, civic - information systems, environmental observations, museum information - and web indexes. - - - Notable applications include the following: - - - -
- Koha free open-source ILS + References and &zebra; based Applications + + &zebra; has been deployed in numerous applications, in both the + academic and commercial worlds, in application domains as diverse + as bibliographic catalogues, Geo-spatial information, structured + vocabulary browsing, government information locators, civic + information systems, environmental observations, museum information + and web indexes. + + Notable applications include the following: + + + +
+ Koha free open-source ILS + Koha is a full-featured - open-source ILS, initially developed in + open-source ILS, initially developed in New Zealand by Katipo Communications Ltd, and first deployed in January of 2000 for Horowhenua Library Trust. It is currently maintained by a team of software providers and library technology - staff from around the globe. + staff from around the globe. - LibLime, + LibLime, a company that is marketing and supporting Koha, adds in the new release of Koha 3.0 the &zebra; database server to drive its bibliographic database. @@ -717,7 +693,7 @@ in the Koha 2.x series. After extensive evaluations of the best of the Open Source textual database engines - including MySQL full-text searching, PostgreSQL, Lucene and Plucene - the team - selected &zebra;. + selected &zebra;. "&zebra; completely eliminates scalability limitations, because it @@ -725,7 +701,7 @@ Ferraro, LibLime's Technology President and Koha's Project Release Manager. "Our performance tests showed search results in under a second for databases with over 5 million records on a - modest i386 900Mhz test server." + modest i386 900Mhz test server." "&zebra; also includes support for true boolean search expressions @@ -734,37 +710,37 @@ database updates, which allow on-the-fly record management. Finally, since &zebra; has at its heart the &acro.z3950; protocol, it greatly improves Koha's support for that critical - library standard." + library standard." - + Although the bibliographic database will be moved to &zebra;, Koha 3.0 will continue to use a relational SQL-based database design for the 'factual' database. "Relational database managers have their strengths, in spite of their inability to handle large numbers of bibliographic records efficiently," summed up Ferraro, "We're taking the best from both worlds in our redesigned Koha - 3.0. - - + 3.0. + + See also LibLime's newsletter article - - Koha Earns its Stripes. - + + Koha Earns its Stripes. +
-
- Kete Open Source Digital Library and Archiving software - +
+ Kete Open Source Digital Library and Archiving software + Kete is a digital object - management repository, initially developed in + management repository, initially developed in New Zealand. Initial development has been a partnership between the Horowhenua Library Trust and Katipo Communications Ltd. funded as part of the Community Partnership Fund in 2006. Kete is purpose built software to enable communities to build their own digital - libraries, archives and repositories. + libraries, archives and repositories. It is based on Ruby-on-Rails and MySQL, and integrates the &zebra; server @@ -773,20 +749,20 @@ application. See how Kete manages - Zebra. - - + url="http://kete.net.nz/documentation/topics/show/139-managing-zebra">manages + Zebra. + + Why does Kete wants to use Zebra?? Speed, Scalability and easy - integration with Koha. Read their - detailed - reasoning here. + integration with Koha. Read their + detailed + reasoning here.
-
- ReIndex.Net web based ILS +
+ ReIndex.Net web based ILS Reindex.net is a netbased library service offering all @@ -794,16 +770,16 @@ services. Reindex.net is a comprehensive and powerful WEB system based on standards such as &acro.xml; and &acro.z3950;. updates. Reindex supports &acro.marc21;, dan&acro.marc; eller Dublin Core with - UTF8-encoding. + UTF8-encoding. Reindex.net runs on GNU/Debian Linux with &zebra; and Simpleserver - from Index + from Index Data for bibliographic data. The relational database system Sybase 9 &acro.xml; is used for - administrative data. + administrative data. Internally &acro.marcxml; is used for bibliographical records. Update - utilizes &acro.z3950; extended services. + utilizes &acro.z3950; extended services.
@@ -811,115 +787,114 @@ DADS - the DTV Article Database Service - DADS is a huge database of more than ten million records, totalling - over ten gigabytes of data. The records are metadata about academic - journal articles, primarily scientific; about 10% of these - metadata records link to the full text of the articles they - describe, a body of about a terabyte of information (although the - full text is not indexed.) - - - It allows students and researchers at DTU (Danmarks Tekniske - Universitet, the Technical College of Denmark) to find and order - articles from multiple databases in a single query. The database - contains literature on all engineering subjects. It's available - on-line through a web gateway, though currently only to registered - users. - - - More information can be found at - and - - -
+ DADS is a huge database of more than ten million records, totalling + over ten gigabytes of data. The records are metadata about academic + journal articles, primarily scientific; about 10% of these + metadata records link to the full text of the articles they + describe, a body of about a terabyte of information (although the + full text is not indexed.) +
+ + It allows students and researchers at DTU (Danmarks Tekniske + Universitet, the Technical College of Denmark) to find and order + articles from multiple databases in a single query. The database + contains literature on all engineering subjects. It's available + on-line through a web gateway, though currently only to registered + users. + + + More information can be found at + and + + +
-
- ULS (Union List of Serials) - - The M25 Systems Team - has created a union catalogue for the periodicals of the - twenty-one constituent libraries of the University of London and - the University of Westminster - (). - They have achieved this using an - unusual architecture, which they describe as a - ``non-distributed virtual union catalogue''. - - - The member libraries send in data files representing their - periodicals, including both brief bibliographic data and summary - holdings. Then 21 individual &acro.z3950; targets are created, each - using &zebra;, and all mounted on the single hardware server. - The live service provides a web gateway allowing &acro.z3950; searching - of all of the targets or a selection of them. &zebra;'s small - footprint allows a relatively modest system to comfortably host - the 21 servers. - - - More information can be found at - - -
+
+ ULS (Union List of Serials) + + The M25 Systems Team + has created a union catalogue for the periodicals of the + twenty-one constituent libraries of the University of London and + the University of Westminster + (). + They have achieved this using an + unusual architecture, which they describe as a + ``non-distributed virtual union catalogue''. + + + The member libraries send in data files representing their + periodicals, including both brief bibliographic data and summary + holdings. Then 21 individual &acro.z3950; targets are created, each + using &zebra;, and all mounted on the single hardware server. + The live service provides a web gateway allowing &acro.z3950; searching + of all of the targets or a selection of them. &zebra;'s small + footprint allows a relatively modest system to comfortably host + the 21 servers. + + + More information can be found at + + +
-
- Various web indexes - - &zebra; has been used by a variety of institutions to construct - indexes of large web sites, typically in the region of tens of - millions of pages. In this role, it functions somewhat similarly - to the engine of Google or AltaVista, but for a selected intranet - or a subset of the whole Web. - - - For example, Liverpool University's web-search facility (see on - the home page at - - and many sub-pages) works by relevance-searching a &zebra; database - which is populated by the Harvest-NG web-crawling software. - - - For more information on Liverpool university's intranet search - architecture, contact John Gilbertson - jgilbert@liverpool.ac.uk - - - Kang-Jin Lee - has recently modified the Harvest web indexer to use &zebra; as - its native repository engine. His comments on the switch over - from the old engine are revealing: -
- - The first results after some testing with &zebra; are very - promising. The tests were done with around 220,000 SOIF files, - which occupies 1.6GB of disk space. - - - Building the index from scratch takes around one hour with &zebra; - where [old-engine] needs around five hours. While [old-engine] - blocks search requests when updating its index, &zebra; can still - answer search requests. - [...] - &zebra; supports incremental indexing which will speed up indexing - even further. - - - While the search time of [old-engine] varies from some seconds - to some minutes depending how expensive the query is, &zebra; - usually takes around one to three seconds, even for expensive - queries. - [...] - &zebra; can search more than 100 times faster than [old-engine] - and can process multiple search requests simultaneously - - - I am very happy to see such nice software available under GPL. - -
-
+
+ Various web indexes + + &zebra; has been used by a variety of institutions to construct + indexes of large web sites, typically in the region of tens of + millions of pages. In this role, it functions somewhat similarly + to the engine of Google or AltaVista, but for a selected intranet + or a subset of the whole Web. + + + For example, Liverpool University's web-search facility (see on + the home page at + + and many sub-pages) works by relevance-searching a &zebra; database + which is populated by the Harvest-NG web-crawling software. + + + For more information on Liverpool university's intranet search + architecture, contact John Gilbertson + jgilbert@liverpool.ac.uk + + + Kang-Jin Lee + has recently modified the Harvest web indexer to use &zebra; as + its native repository engine. His comments on the switch over + from the old engine are revealing: +
+ + The first results after some testing with &zebra; are very + promising. The tests were done with around 220,000 SOIF files, + which occupies 1.6GB of disk space. + + + Building the index from scratch takes around one hour with &zebra; + where [old-engine] needs around five hours. While [old-engine] + blocks search requests when updating its index, &zebra; can still + answer search requests. + [...] + &zebra; supports incremental indexing which will speed up indexing + even further. + + + While the search time of [old-engine] varies from some seconds + to some minutes depending how expensive the query is, &zebra; + usually takes around one to three seconds, even for expensive + queries. + [...] + &zebra; can search more than 100 times faster than [old-engine] + and can process multiple search requests simultaneously + + + I am very happy to see such nice software available under GPL. + +
+
+
-
- - +
Support @@ -941,8 +916,8 @@ releases, bug fixes, etc.) and general discussion. You are welcome to seek support there. Join by filling the form on the list home page. -
- +
+ - @@ -1622,13 +1622,13 @@ &zebra; Extension Rank Weight Attribute (type 9) Rank weight is a way to pass a value to a ranking algorithm - so - that one &acro.apt; has one value - while another as a different one. + that one &acro.apt; has one value - while another as a different one. See also . For example, searching for utah in title with weight 30 as well - as any with weight 20: - + as any with weight 20: + Z> find @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah @@ -1637,23 +1637,23 @@
&zebra; Extension Term Reference Attribute (type 10) - &zebra; supports the searchResult-1 facility. + &zebra; supports the searchResult-1 facility. If the Term Reference Attribute (type 10) is given, that specifies a subqueryId value returned as part of the search result. It is a way for a client to name an &acro.apt; part of a - query. + query. Experimental. Do not use in production code. - + - +
- - - + + +
Local Approximative Limit Attribute (type 11) @@ -1676,7 +1676,7 @@ For example, we might be interested in exact hit count for a, but - for b we allow hit count estimates for 1000 and higher. + for b we allow hit count estimates for 1000 and higher. Z> find @and a @attr 11=1000 b @@ -1686,7 +1686,7 @@ The estimated hit count facility makes searches faster, as one only needs to process large hit lists partially. It is mostly used in huge databases, where you you want trade - exactness of hit counts against speed of execution. + exactness of hit counts against speed of execution. @@ -1694,7 +1694,7 @@ Do not use approximative hit count limits in conjunction with relevance ranking, as re-sorting of the result set only works when the entire result set has - been processed. + been processed.
@@ -1704,7 +1704,7 @@ By default &zebra; computes precise hit counts for a query as a whole. Setting attribute 12 makes it perform approximative - hit counts instead. It has the same semantics as + hit counts instead. It has the same semantics as estimatehits for the . @@ -1717,7 +1717,7 @@ Do not use approximative hit count limits in conjunction with relevance ranking, as re-sorting of the result set only works when the entire result set has - been processed. + been processed. @@ -1728,7 +1728,7 @@ &zebra; specific Scan Extensions to all Attribute Sets &zebra; extends the Bib1 attribute types, and these extensions are - recognized regardless of attribute + recognized regardless of attribute set used in a scan operation query. @@ -1757,44 +1757,44 @@ -
- + +
&zebra; Extension Result Set Narrow (type 8) If attribute Result Set Narrow (type 8) is given for scan, the value is the name of a - result set. Each hit count in scan is - @and'ed with the result set given. + result set. Each hit count in scan is + @and'ed with the result set given. - Consider for example + Consider for example the case of scanning all title fields around the scanterm mozart, then refining the scan by issuing a filtering query for amadeus to - restrict the scan to the result set of the query: + restrict the scan to the result set of the query: - Z> scan @attr 1=4 mozart - ... - * mozart (43) - mozartforskningen (1) - mozartiana (1) - mozarts (16) - ... - Z> f @attr 1=4 amadeus - ... - Number of hits: 15, setno 2 - ... - Z> scan @attr 1=4 @attr 8=2 mozart - ... - * mozart (14) - mozartforskningen (0) - mozartiana (0) - mozarts (1) - ... + Z> scan @attr 1=4 mozart + ... + * mozart (43) + mozartforskningen (1) + mozartiana (1) + mozarts (16) + ... + Z> f @attr 1=4 amadeus + ... + Number of hits: 15, setno 2 + ... + Z> scan @attr 1=4 @attr 8=2 mozart + ... + * mozart (14) + mozartforskningen (0) + mozartiana (0) + mozarts (1) + ... - + &zebra; 2.0.2 and later is able to skip 0 hit counts. This, however, is known not to scale if the number of terms to skip is high. @@ -1808,16 +1808,16 @@ The &zebra; Extension Approximative Limit (type 12) is a way to enable approximate hit counts for scan hit counts, in the same - way as for search hit counts. + way as for search hit counts.
- +
&zebra; special &acro.idxpath; Attribute Set for &acro.grs1; indexing - The attribute-set idxpath consists of a single - Use (type 1) attribute. All non-use attributes behave as normal. + The attribute-set idxpath consists of a single + Use (type 1) attribute. All non-use attributes behave as normal. This feature is enabled when defining the @@ -1836,10 +1836,10 @@
- &acro.idxpath; Use Attributes (type = 1) + &acro.idxpath; Use Attributes (type = 1) This attribute set allows one to search &acro.grs1; filter indexed - records by &acro.xpath; like structured index names. + records by &acro.xpath; like structured index names. @@ -1848,7 +1848,7 @@ index names, which might clash with your own index names. - + &zebra; specific &acro.idxpath; Use Attributes (type 1) @@ -1899,22 +1899,22 @@ See tab/idxpath.att for more information. - Search for all documents starting with root element + Search for all documents starting with root element /root (either using the numeric or the string use attributes): - Z> find @attrset idxpath @attr 1=1 @attr 4=3 root/ - Z> find @attr idxpath 1=1 @attr 4=3 root/ - Z> find @attr 1=_XPATH_BEGIN @attr 4=3 root/ + Z> find @attrset idxpath @attr 1=1 @attr 4=3 root/ + Z> find @attr idxpath 1=1 @attr 4=3 root/ + Z> find @attr 1=_XPATH_BEGIN @attr 4=3 root/ - Search for all documents where specific nested &acro.xpath; + Search for all documents where specific nested &acro.xpath; /c1/c2/../cn exists. Notice the very counter-intuitive reverse notation! - Z> find @attrset idxpath @attr 1=1 @attr 4=3 cn/cn-1/../c1/ - Z> find @attr 1=_XPATH_BEGIN @attr 4=3 cn/cn-1/../c1/ + Z> find @attrset idxpath @attr 1=1 @attr 4=3 cn/cn-1/../c1/ + Z> find @attr 1=_XPATH_BEGIN @attr 4=3 cn/cn-1/../c1/ @@ -1925,19 +1925,19 @@ - Search for CDATA string anothertext in any - attribute: - + Search for CDATA string anothertext in any + attribute: + Z> find @attrset idxpath @attr 1=1015 anothertext Z> find @attr 1=_XPATH_ATTR_CDATA anothertext - Search for all documents with have an &acro.xml; element node - including an &acro.xml; attribute named creator - - Z> find @attrset idxpath @attr 1=3 @attr 4=3 creator - Z> find @attr 1=_XPATH_ATTR_NAME @attr 4=3 creator + Search for all documents with have an &acro.xml; element node + including an &acro.xml; attribute named creator + + Z> find @attrset idxpath @attr 1=3 @attr 4=3 creator + Z> find @attr 1=_XPATH_ATTR_NAME @attr 4=3 creator @@ -1951,7 +1951,7 @@ Scanning is supported on all idxpath indexes, both specified as numeric use attributes, or as string - index names. + index names. Z> scan @attrset idxpath @attr 1=1016 text Z> scan @attr 1=_XPATH_ATTR_CDATA anothertext @@ -1964,7 +1964,7 @@
- Mapping from &acro.pqf; atomic &acro.apt; queries to &zebra; internal + <title>Mapping from &acro.pqf; atomic &acro.apt; queries to &zebra; internal register indexes The rules for &acro.pqf; &acro.apt; mapping are rather tricky to grasp in the @@ -1972,19 +1972,19 @@ internal register or string index to use, according to the use attribute or access point specified in the query. Thereafter we deal with the rules for determining the correct structure type of - the named register. + the named register. -
- Mapping of &acro.pqf; &acro.apt; access points - +
+ Mapping of &acro.pqf; &acro.apt; access points + &zebra; understands four fundamental different types of access - points, of which only the + points, of which only the numeric use attribute type access points are defined by the &acro.z3950; standard. All other access point types are &zebra; specific, and non-portable. - +
Access point name mapping @@ -1996,86 +1996,86 @@ GrammarNotes - - - - Use attribute - numeric - [1-9][1-9]* - directly mapped to string index name - - - String index name - string - [a-zA-Z](\-?[a-zA-Z0-9])* - normalized name is used as internal string index name - - - &zebra; internal index name - zebra - _[a-zA-Z](_?[a-zA-Z0-9])* - hardwired internal string index name - - - &acro.xpath; special index - XPath - /.* - special xpath search for &acro.grs1; indexed records - + + + + Use attribute + numeric + [1-9][1-9]* + directly mapped to string index name + + + String index name + string + [a-zA-Z](\-?[a-zA-Z0-9])* + normalized name is used as internal string index name + + + &zebra; internal index name + zebra + _[a-zA-Z](_?[a-zA-Z0-9])* + hardwired internal string index name + + + &acro.xpath; special index + XPath + /.* + special xpath search for &acro.grs1; indexed records +
- + - Attribute set names and + Attribute set names and string index names are normalizes according to the following rules: all single hyphens '-' are stripped, and all upper case letters are folded to lower case. - + - Numeric use attributes are mapped + Numeric use attributes are mapped to the &zebra; internal string index according to the attribute set definition in use. The default attribute set is &acro.bib1;, and may be omitted in the &acro.pqf; query. - + According to normalization and numeric use attribute mapping, it follows that the following &acro.pqf; queries are considered equivalent (assuming the default configuration has not been altered): - Z> find @attr 1=Body-of-text serenade - Z> find @attr 1=bodyoftext serenade - Z> find @attr 1=BodyOfText serenade - Z> find @attr 1=bO-d-Y-of-tE-x-t serenade - Z> find @attr 1=1010 serenade - Z> find @attrset bib1 @attr 1=1010 serenade - Z> find @attrset bib1 @attr 1=1010 serenade - Z> find @attrset Bib1 @attr 1=1010 serenade - Z> find @attrset b-I-b-1 @attr 1=1010 serenade - - + Z> find @attr 1=Body-of-text serenade + Z> find @attr 1=bodyoftext serenade + Z> find @attr 1=BodyOfText serenade + Z> find @attr 1=bO-d-Y-of-tE-x-t serenade + Z> find @attr 1=1010 serenade + Z> find @attrset bib1 @attr 1=1010 serenade + Z> find @attrset bib1 @attr 1=1010 serenade + Z> find @attrset Bib1 @attr 1=1010 serenade + Z> find @attrset b-I-b-1 @attr 1=1010 serenade + + - + The numerical - use attributes (type 1) + use attributes (type 1) are interpreted according to the attribute sets which have been loaded in the zebra.cfg file, and are matched against specific fields as specified in the .abs file which describes the profile of the records which have been loaded. - If no use attribute is provided, a default of + If no use attribute is provided, a default of &acro.bib1; Use Any (1016) is assumed. The predefined use attribute sets can be reconfigured by tweaking the configuration files - tab/*.att, and + tab/*.att, and new attribute sets can be defined by adding similar files in the - configuration path profilePath of the server. - + configuration path profilePath of the server. + String indexes can be accessed directly, @@ -2091,10 +2091,10 @@ &zebra; internal indexes can be accessed directly, according to the same rules as the user defined - string indexes. The only difference is that + string indexes. The only difference is that &zebra; internal index names are hardwired, all uppercase and - must start with the character '_'. + must start with the character '_'. @@ -2102,7 +2102,7 @@ available using the &acro.grs1; filter for indexing. These access point names must start with the character '/', they are not - normalized, but passed unaltered to the &zebra; internal + normalized, but passed unaltered to the &zebra; internal &acro.xpath; engine. See . @@ -2111,15 +2111,15 @@
-
- Mapping of &acro.pqf; &acro.apt; structure and completeness to + <section id="querymodel-pqf-apt-mapping-structuretype"> + <title>Mapping of &acro.pqf; &acro.apt; structure and completeness to register type - + Internally &zebra; has in its default configuration several - different types of registers or indexes, whose tokenization and - character normalization rules differ. This reflects the fact that + different types of registers or indexes, whose tokenization and + character normalization rules differ. This reflects the fact that searching fundamental different tokens like dates, numbers, - bitfields and string based text needs different rule sets. + bitfields and string based text needs different rule sets. @@ -2136,7 +2136,7 @@ - phrase (@attr 4=1), word (@attr 4=2), + phrase (@attr 4=1), word (@attr 4=2), word-list (@attr 4=6), free-form-text (@attr 4=105), or document-text (@attr 4=106) @@ -2146,7 +2146,7 @@ - phrase (@attr 4=1), word (@attr 4=2), + phrase (@attr 4=1), word (@attr 4=2), word-list (@attr 4=6), free-form-text (@attr 4=105), or document-text (@attr 4=106) @@ -2196,59 +2196,59 @@ overruled overruled special - Internal record ID register, used whenever + Internal record ID register, used whenever Relation Always Matches (@attr 2=103) is specified
- + - - - If a Structure attribute of - Phrase is used in conjunction with a - Completeness attribute of - Complete (Sub)field, the term is matched - against the contents of the phrase (long word) register, if one - exists for the given Use attribute. - A phrase register is created for those fields in the - &acro.grs1; *.abs file that contains a - p-specifier. + + + If a Structure attribute of + Phrase is used in conjunction with a + Completeness attribute of + Complete (Sub)field, the term is matched + against the contents of the phrase (long word) register, if one + exists for the given Use attribute. + A phrase register is created for those fields in the + &acro.grs1; *.abs file that contains a + p-specifier. - Z> scan @attr 1=Title @attr 4=1 @attr 6=3 beethoven + Z> scan @attr 1=Title @attr 4=1 @attr 6=3 beethoven ... bayreuther festspiele (1) * beethoven bibliography database (1) benny carter (1) ... - Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography" + Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography" ... Number of hits: 0, setno 5 ... - Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database" + Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database" ... Number of hits: 1, setno 6 - - + + - - If Structure=Phrase is - used in conjunction with Incomplete Field - the - default value for Completeness, the - search is directed against the normal word registers, but if the term - contains multiple words, the term will only match if all of the words - are found immediately adjacent, and in the given order. - The word search is performed on those fields that are indexed as - type w in the &acro.grs1; *.abs file. + + If Structure=Phrase is + used in conjunction with Incomplete Field - the + default value for Completeness, the + search is directed against the normal word registers, but if the term + contains multiple words, the term will only match if all of the words + are found immediately adjacent, and in the given order. + The word search is performed on those fields that are indexed as + type w in the &acro.grs1; *.abs file. - Z> scan @attr 1=Title @attr 4=1 @attr 6=1 beethoven + Z> scan @attr 1=Title @attr 4=1 @attr 6=1 beethoven ... - beefheart (1) + beefheart (1) * beethoven (18) - beethovens (7) + beethovens (7) ... - Z> find @attr 1=Title @attr 4=1 @attr 6=1 beethoven + Z> find @attr 1=Title @attr 4=1 @attr 6=1 beethoven ... Number of hits: 18, setno 1 ... @@ -2256,74 +2256,74 @@ ... Number of hits: 2, setno 2 ... - - + + - - If the Structure attribute is - Word List, - Free-form Text, or - Document Text, the term is treated as a - natural-language, relevance-ranked query. - This search type uses the word register, i.e. those fields - that are indexed as type w in the - &acro.grs1; *.abs file. - + + If the Structure attribute is + Word List, + Free-form Text, or + Document Text, the term is treated as a + natural-language, relevance-ranked query. + This search type uses the word register, i.e. those fields + that are indexed as type w in the + &acro.grs1; *.abs file. + - - If the Structure attribute is - Numeric String the term is treated as an integer. - The search is performed on those fields that are indexed - as type n in the &acro.grs1; + + If the Structure attribute is + Numeric String the term is treated as an integer. + The search is performed on those fields that are indexed + as type n in the &acro.grs1; *.abs file. - + - - If the Structure attribute is - URX the term is treated as a URX (URL) entity. - The search is performed on those fields that are indexed as type - u in the *.abs file. - + + If the Structure attribute is + URX the term is treated as a URX (URL) entity. + The search is performed on those fields that are indexed as type + u in the *.abs file. + - - If the Structure attribute is - Local Number the term is treated as - native &zebra; Record Identifier. - + + If the Structure attribute is + Local Number the term is treated as + native &zebra; Record Identifier. + - - If the Relation attribute is - Equals (default), the term is matched - in a normal fashion (modulo truncation and processing of - individual words, if required). - If Relation is Less Than, - Less Than or Equal, - Greater than, or Greater than or - Equal, the term is assumed to be numerical, and a - standard regular expression is constructed to match the given - expression. - If Relation is Relevance, - the standard natural-language query processor is invoked. - + + If the Relation attribute is + Equals (default), the term is matched + in a normal fashion (modulo truncation and processing of + individual words, if required). + If Relation is Less Than, + Less Than or Equal, + Greater than, or Greater than or + Equal, the term is assumed to be numerical, and a + standard regular expression is constructed to match the given + expression. + If Relation is Relevance, + the standard natural-language query processor is invoked. + - - For the Truncation attribute, - No Truncation is the default. - Left Truncation is not supported. - Process # in search term is supported, as is - Regxp-1. - Regxp-2 enables the fault-tolerant (fuzzy) - search. As a default, a single error (deletion, insertion, - replacement) is accepted when terms are matched against the register - contents. - + + For the Truncation attribute, + No Truncation is the default. + Left Truncation is not supported. + Process # in search term is supported, as is + Regxp-1. + Regxp-2 enables the fault-tolerant (fuzzy) + search. As a default, a single error (deletion, insertion, + replacement) is accepted when terms are matched against the register + contents. + -
+
&zebra; Regular Expressions in Truncation Attribute (type = 5) - + Each term in a query is interpreted as a regular expression if the truncation value is either Regxp-1 (@attr 5=102) @@ -2350,29 +2350,29 @@ - + The above operands can be combined with the following operators: - + Regular Expression Operators x* - Matches x zero or more times. + Matches x zero or more times. Priority: high. x+ - Matches x one or more times. + Matches x one or more times. Priority: high. x? - Matches x zero or once. + Matches x zero or once. Priority: high. @@ -2390,16 +2390,16 @@ The order of evaluation may be changed by using parentheses. - -
- + + + If the first character of the Regxp-2 query is a plus character (+) it marks the beginning of a section with non-standard specifiers. The next plus character marks the end of the section. Currently &zebra; only supports one specifier, the error tolerance, - which consists one digit. + which consists one digit. @@ -2427,19 +2427,19 @@
- + @@ -2449,66 +2449,66 @@ Using the <cql2rpn>l2rpn.txt</cql2rpn> - &yaz; Frontend Virtual + &yaz; Frontend Virtual Hosts option, one can configure the &yaz; Frontend &acro.cql;-to-&acro.pqf; - converter, specifying the interpretation of various + converter, specifying the interpretation of various &acro.cql; indexes, relations, etc. in terms of Type-1 query attributes. - + For example, using server-side &acro.cql;-to-&acro.pqf; conversion, one might query a zebra server like this: - querytype cql Z> find text=(plant and soil) ]]> - and - if properly configured - even static relevance ranking can - be performed using &acro.cql; query syntax: + and - if properly configured - even static relevance ranking can + be performed using &acro.cql; query syntax: - find text = /relevant (plant and soil) ]]> - + - By the way, the same configuration can be used to + By the way, the same configuration can be used to search using client-side &acro.cql;-to-&acro.pqf; conversion: - (the only difference is querytype cql2rpn - instead of + (the only difference is querytype cql2rpn + instead of querytype cql, and the call specifying a local conversion file) - querytype cql2rpn Z> find text=(plant and soil) ]]> - + Exhaustive information can be found in the Section &acro.cql; to &acro.rpn; conversion in the &yaz; manual. - - - + - + - In this section, we will test the system by indexing a small set of - sample GILS records that are included with the &zebra; distribution, - running a &zebra; server against the newly created database, and - searching the indexes with a client that connects to that server. - - - Go to the examples/gils subdirectory of the - distribution archive. The 48 test records are located in the sub - directory records. To index these, type: - - zebraidx update records - - - - - In this command, the word update is followed - by the name of a directory: zebraidx updates all - files in the hierarchy rooted at that directory. - - - - If your indexing command was successful, you are now ready to - fire up a server. To start a server on port 2100, type: - - - zebrasrv @:2100 - - - + + + In this section, we will test the system by indexing a small set of + sample GILS records that are included with the &zebra; distribution, + running a &zebra; server against the newly created database, and + searching the indexes with a client that connects to that server. + + + Go to the examples/gils subdirectory of the + distribution archive. The 48 test records are located in the sub + directory records. To index these, type: + + zebraidx update records + + + + + In this command, the word update is followed + by the name of a directory: zebraidx updates all + files in the hierarchy rooted at that directory. + + + + If your indexing command was successful, you are now ready to + fire up a server. To start a server on port 2100, type: + + + zebrasrv @:2100 + - - The &zebra; index that you have just created has a single database - named Default. - The database contains records structured according to - the GILS profile, and the server will - return records in &acro.usmarc;, &acro.grs1;, or &acro.sutrs; format depending - on what the client asks for. - - - - To test the server, you can use any &acro.z3950; client. - For instance, you can use the demo command-line client that comes - with &yaz;: - - - - yaz-client localhost:2100 - - - - - When the client has connected, you can type: - - - - - Z> find surficial - Z> show 1 - - - - - The default retrieval syntax for the client is &acro.usmarc;, and the - default element set is F (``full record''). To - try other formats and element sets for the same record, try: - - - - Z>format sutrs - Z>show 1 - Z>format grs-1 - Z>show 1 - Z>format xml - Z>show 1 - Z>elements B - Z>show 1 - - - - - You may notice that more fields are returned when your - client requests &acro.sutrs;, &acro.grs1; or &acro.xml; records. - This is normal - not all of the GILS data elements have mappings in - the &acro.usmarc; record format. - - - If you've made it this far, you know that your installation is - working, but there's a certain amount of voodoo going on - for - example, the mysterious incantations in the - zebra.cfg file. In order to help us understand - these fully, the next chapter will work through a series of - increasingly complex example configurations. - - - + + + The &zebra; index that you have just created has a single database + named Default. + The database contains records structured according to + the GILS profile, and the server will + return records in &acro.usmarc;, &acro.grs1;, or &acro.sutrs; format depending + on what the client asks for. + + + + To test the server, you can use any &acro.z3950; client. + For instance, you can use the demo command-line client that comes + with &yaz;: + + + + yaz-client localhost:2100 + + + + + When the client has connected, you can type: + + + + + Z> find surficial + Z> show 1 + + + + + The default retrieval syntax for the client is &acro.usmarc;, and the + default element set is F (``full record''). To + try other formats and element sets for the same record, try: + + + + Z>format sutrs + Z>show 1 + Z>format grs-1 + Z>show 1 + Z>format xml + Z>show 1 + Z>elements B + Z>show 1 + + + + + You may notice that more fields are returned when your + client requests &acro.sutrs;, &acro.grs1; or &acro.xml; records. + This is normal - not all of the GILS data elements have mappings in + the &acro.usmarc; record format. + + + + If you've made it this far, you know that your installation is + working, but there's a certain amount of voodoo going on - for + example, the mysterious incantations in the + zebra.cfg file. In order to help us understand + these fully, the next chapter will work through a series of + increasingly complex example configurations. + + + + insert, update, and + delete. --> In this example, the following literal indexes are constructed: - oai_identifier - oai_datestamp - oai_setspec - dc_all - dc_title - dc_creator + oai_identifier + oai_datestamp + oai_setspec + dc_all + dc_title + dc_creator - where the indexing type is defined in the - type attribute + where the indexing type is defined in the + type attribute (any value from the standard configuration - file default.idx will do). Finally, any + file default.idx will do). Finally, any text() node content recursively contained inside the index will be filtered through the appropriate char map for character normalization, and will be @@ -174,26 +174,26 @@ oai:JTRS:CP-3290---Volume-I will be literal, byte for byte without any form of character normalization, inserted into the index named oai:identifier, - the text + the text Kumar Krishen and *Calvin Burnham, Editors will be inserted using the w character normalization defined in default.idx into the index dc:creator (that is, after character - normalization the index will keep the individual words - kumar, krishen, + normalization the index will keep the individual words + kumar, krishen, and, calvin, burnham, and editors), and finally both the texts Proceedings of the 4th International Conference and Exhibition: - World Congress on Superconductivity - Volume I + World Congress on Superconductivity - Volume I and - Kumar Krishen and *Calvin Burnham, Editors + Kumar Krishen and *Calvin Burnham, Editors will be inserted into the index dc:all using - the same character normalization map w. + the same character normalization map w. Finally, this example configuration can be queried using &acro.pqf; - queries, either transported by &acro.z3950;, (here using a yaz-client) + queries, either transported by &acro.z3950;, (here using a yaz-client) open localhost:9999 @@ -236,14 +236,14 @@ ALVIS Record Model Configuration -
- ALVIS Indexing Configuration +
+ ALVIS Indexing Configuration As mentioned above, there can be only one indexing stylesheet, and configuration of the indexing process is a synonym of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the - magic elements discussed in - . + magic elements discussed in + . Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to lead our Padawan's on the right track to the good side of the force. @@ -265,51 +265,51 @@ push type might be the only possible way to sort out deeply recursive input &acro.xml; formats. - + A pull stylesheet example used to index &acro.oai; harvested records could use some of the following template definitions: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + xmlns:z="http://indexdata.dk/zebra/xslt/1" + xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/" + xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/" + xmlns:dc="http://purl.org/dc/elements/1.1/" + version="1.0"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> @@ -317,29 +317,29 @@ Notice also, that the names and types of the indexes can be defined in the indexing &acro.xslt; stylesheet dynamically according to - content in the original &acro.xml; records, which has + content in the original &acro.xml; records, which has opportunities for great power and wizardry as well as grande - disaster. + disaster. The following excerpt of a push stylesheet - might + might be a good idea according to your strict control of the &acro.xml; input format (due to rigorous checking against well-defined and tight RelaxNG or &acro.xml; Schema's, for example): - - - - + + + + + ]]> - This template creates indexes which have the name of the working + This template creates indexes which have the name of the working node of any input &acro.xml; file, and assigns a '1' to the index. - The example query - find @attr 1=xyz 1 + The example query + find @attr 1=xyz 1 finds all files which contain at least one xyz &acro.xml; element. In case you can not control which element names the input files contain, you might ask for @@ -347,25 +347,25 @@ One variation over the theme dynamically created - indexes will definitely be unwise: + indexes will definitely be unwise: - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + ]]> Don't be tempted to cross @@ -373,19 +373,19 @@ to suffering and pain, and universal disintegration of your project schedule. -
+
-
- ALVIS Exchange Formats - +
+ ALVIS Exchange Formats + An exchange format can be anything which can be the outcome of an &acro.xslt; transformation, as far as the stylesheet is registered in the main Alvis &acro.xslt; filter configuration file, see . In principle anything that can be expressed in &acro.xml;, HTML, and - TEXT can be the output of a schema or - element set directive during search, as long as - the information comes from the + TEXT can be the output of a schema or + element set directive during search, as long as + the information comes from the original input record &acro.xml; &acro.dom; tree (and not the transformed and indexed &acro.xml;!!). @@ -394,49 +394,49 @@ indexer can be accessed during record retrieval. The following example is a summary of the possibilities: - - - - - - - - - - - - - - - - - - - - - + xmlns:z="http://indexdata.dk/zebra/xslt/1" + version="1.0"> + + + + + + + + + + + + + + + + + + + + ]]> -
+
-
- ALVIS Filter &acro.oai; Indexing Example - +
+ ALVIS Filter &acro.oai; Indexing Example + The source code tarball contains a working Alvis filter example in the directory examples/alvis-oai/, which - should get you started. + should get you started. More example data can be harvested from any &acro.oai; compliant server, - see details at the &acro.oai; + see details at the &acro.oai; http://www.openarchives.org/ web site, and the community - links at + links at http://www.openarchives.org/community/index.html. There is a tutorial @@ -448,7 +448,7 @@
- + @@ -462,7 +462,7 @@ sgml-always-quote-attributes:t sgml-indent-step:1 sgml-indent-data:t - sgml-parent-document: "zebra.xml" + sgml-parent-document: "idzebra.xml" sgml-local-catalogs: nil sgml-namecase-general:t End: diff --git a/doc/recordmodel-domxml.xml b/doc/recordmodel-domxml.xml index 5aab33e..16d7001 100644 --- a/doc/recordmodel-domxml.xml +++ b/doc/recordmodel-domxml.xml @@ -1,6 +1,6 @@ - + &acro.dom; &acro.xml; Record Model and Filter Module - + The record model described in this chapter applies to the fundamental, structured &acro.xml; @@ -10,174 +10,174 @@ releases of the &zebra; Information Server. - - + +
&acro.dom; Record Filter Architecture - - The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as - internal data model, and can therefore parse, index, and display - any &acro.xml; document type. It is well suited to work on - standardized &acro.xml;-based formats such as Dublin Core, MODS, METS, - MARCXML, OAI-PMH, RSS, and performs equally well on any other - non-standard &acro.xml; format. - - - A parser for binary &acro.marc; records based on the ISO2709 library - standard is provided, it transforms these to the internal - &acro.marcxml; &acro.dom; representation. Other binary document parsers - are planned to follow. - + + The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as + internal data model, and can therefore parse, index, and display + any &acro.xml; document type. It is well suited to work on + standardized &acro.xml;-based formats such as Dublin Core, MODS, METS, + MARCXML, OAI-PMH, RSS, and performs equally well on any other + non-standard &acro.xml; format. + + + A parser for binary &acro.marc; records based on the ISO2709 library + standard is provided, it transforms these to the internal + &acro.marcxml; &acro.dom; representation. Other binary document parsers + are planned to follow. + - - The &acro.dom; filter architecture consists of four - different pipelines, each being a chain of arbitrarily many successive - &acro.xslt; transformations of the internal &acro.dom; &acro.xml; - representations of documents. - + + The &acro.dom; filter architecture consists of four + different pipelines, each being a chain of arbitrarily many successive + &acro.xslt; transformations of the internal &acro.dom; &acro.xml; + representations of documents. + -
- &acro.dom; &acro.xml; filter architecture - - - - - - - - - - - [Here there should be a diagram showing the &acro.dom; &acro.xml; - filter architecture, but is seems that your - tool chain has not been able to include the diagram in this - document.] - - - -
- - - - &acro.dom; &acro.xml; filter pipelines overview - - - - Name - When - Description - Input - Output - - - - - - input - first - input parsing and initial - transformations to common &acro.xml; format - Input raw &acro.xml; record buffers, &acro.xml; streams and - binary &acro.marc; buffers - Common &acro.xml; &acro.dom; - - - extract - second - indexing term extraction - transformations - Common &acro.xml; &acro.dom; - Indexing &acro.xml; &acro.dom; - - - store - second - transformations before internal document - storage - Common &acro.xml; &acro.dom; - Storage &acro.xml; &acro.dom; - - - retrieve - third - multiple document retrieve transformations from - storage to different output - formats are possible - Storage &acro.xml; &acro.dom; - Output &acro.xml; syntax in requested formats - - - -
+
+ &acro.dom; &acro.xml; filter architecture + + + + + + + + + + + [Here there should be a diagram showing the &acro.dom; &acro.xml; + filter architecture, but is seems that your + tool chain has not been able to include the diagram in this + document.] + + + +
+ + + + &acro.dom; &acro.xml; filter pipelines overview + + + + Name + When + Description + Input + Output + + + + + + input + first + input parsing and initial + transformations to common &acro.xml; format + Input raw &acro.xml; record buffers, &acro.xml; streams and + binary &acro.marc; buffers + Common &acro.xml; &acro.dom; + + + extract + second + indexing term extraction + transformations + Common &acro.xml; &acro.dom; + Indexing &acro.xml; &acro.dom; + + + store + second + transformations before internal document + storage + Common &acro.xml; &acro.dom; + Storage &acro.xml; &acro.dom; + + + retrieve + third + multiple document retrieve transformations from + storage to different output + formats are possible + Storage &acro.xml; &acro.dom; + Output &acro.xml; syntax in requested formats + + + +
- - The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on - your platform, even &acro.exslt;), it brings thus full &acro.xpath; - support to the indexing, storage and display rules of not only - &acro.xml; documents, but also binary &acro.marc; records. - -
+ + The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on + your platform, even &acro.exslt;), it brings thus full &acro.xpath; + support to the indexing, storage and display rules of not only + &acro.xml; documents, but also binary &acro.marc; records. + +
-
- &acro.dom; &acro.xml; filter pipeline configuration +
+ &acro.dom; &acro.xml; filter pipeline configuration The experimental, loadable &acro.dom; &acro.xml;/&acro.xslt; filter module - mod-dom.so + mod-dom.so is invoked by the zebra.cfg configuration statement recordtype.xml: dom.db/filter_dom_conf.xml - In this example the &acro.dom; &acro.xml; filter is configured to work - on all data files with suffix + In this example the &acro.dom; &acro.xml; filter is configured to work + on all data files with suffix *.xml, where the configuration file is found in the path db/filter_dom_conf.xml. The &acro.dom; &acro.xslt; filter configuration file must be valid &acro.xml;. It might look like this: - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + - ]]> + ]]> - The root &acro.xml; element <dom> and all other &acro.dom; - &acro.xml; filter elements are residing in the namespace - xmlns="http://indexdata.com/zebra-2.0". + The root &acro.xml; element <dom> and all other &acro.dom; + &acro.xml; filter elements are residing in the namespace + xmlns="http://indexdata.com/zebra-2.0". All pipeline definition elements - i.e. the - <input>, - <extract>, - <store>, and - <retrieve> elements - are optional. - Missing pipeline definitions are just interpreted - do-nothing identity pipelines. + <input>, + <extract>, + <store>, and + <retrieve> elements - are optional. + Missing pipeline definitions are just interpreted + do-nothing identity pipelines. - All pipeline definition elements may contain zero or more + All pipeline definition elements may contain zero or more ]]> &acro.xslt; transformation instructions, which are performed sequentially from top to bottom. @@ -188,80 +188,80 @@
- Input pipeline - - The <input> pipeline definition element - may contain either one &acro.xml; Reader definition - ]]>, used to split - an &acro.xml; collection input stream into individual &acro.xml; &acro.dom; - documents at the prescribed element level, - or one &acro.marc; binary - parsing instruction - ]]>, which defines - a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values - of the inputcharset attribute depend on your - local iconv set-up. - - - Both input parsers deliver individual &acro.dom; &acro.xml; documents to the - following chain of zero or more - ]]> - &acro.xslt; transformations. At the end of this pipeline, the documents - are in the common format, used to feed both the - <extract> and + Input pipeline + + The <input> pipeline definition element + may contain either one &acro.xml; Reader definition + ]]>, used to split + an &acro.xml; collection input stream into individual &acro.xml; &acro.dom; + documents at the prescribed element level, + or one &acro.marc; binary + parsing instruction + ]]>, which defines + a conversion to &acro.marcxml; format &acro.dom; trees. The allowed values + of the inputcharset attribute depend on your + local iconv set-up. + + + Both input parsers deliver individual &acro.dom; &acro.xml; documents to the + following chain of zero or more + ]]> + &acro.xslt; transformations. At the end of this pipeline, the documents + are in the common format, used to feed both the + <extract> and <store> pipelines. - +
- Extract pipeline - - The <extract> pipeline takes documents - from any common &acro.dom; &acro.xml; format to the &zebra; specific - indexing &acro.dom; &acro.xml; format. - It may consist of zero ore more - ]]> - &acro.xslt; transformations, and the outcome is handled to the - &zebra; core to drive the process of building the inverted - indexes. See - for - details. - + Extract pipeline + + The <extract> pipeline takes documents + from any common &acro.dom; &acro.xml; format to the &zebra; specific + indexing &acro.dom; &acro.xml; format. + It may consist of zero ore more + ]]> + &acro.xslt; transformations, and the outcome is handled to the + &zebra; core to drive the process of building the inverted + indexes. See + for + details. +
- Store pipeline - The <store> pipeline takes documents - from any common &acro.dom; &acro.xml; format to the &zebra; specific - storage &acro.dom; &acro.xml; format. - It may consist of zero ore more - ]]> - &acro.xslt; transformations, and the outcome is handled to the - &zebra; core for deposition into the internal storage system. -
+ Store pipeline + The <store> pipeline takes documents + from any common &acro.dom; &acro.xml; format to the &zebra; specific + storage &acro.dom; &acro.xml; format. + It may consist of zero ore more + ]]> + &acro.xslt; transformations, and the outcome is handled to the + &zebra; core for deposition into the internal storage system. +
- Retrieve pipeline + Retrieve pipeline - Finally, there may be one or more - <retrieve> pipeline definitions, each - of them again consisting of zero or more - ]]> - &acro.xslt; transformations. These are used for document - presentation after search, and take the internal storage &acro.dom; - &acro.xml; to the requested output formats during record present - requests. + Finally, there may be one or more + <retrieve> pipeline definitions, each + of them again consisting of zero or more + ]]> + &acro.xslt; transformations. These are used for document + presentation after search, and take the internal storage &acro.dom; + &acro.xml; to the requested output formats during record present + requests. - The possible multiple + The possible multiple <retrieve> pipeline definitions are distinguished by their unique name - attributes, these are the literal schema or - element set names used in - &acro.srw;, - &acro.sru; and - &acro.z3950; protocol queries. - + attributes, these are the literal schema or + element set names used in + &acro.srw;, + &acro.sru; and + &acro.z3950; protocol queries. +
@@ -277,303 +277,303 @@ namespace xmlns:z="http://indexdata.com/zebra-2.0". -
- Processing-instruction governed indexing format - - The output of the processing instruction driven +
+ Processing-instruction governed indexing format + + The output of the processing instruction driven indexing &acro.xslt; stylesheets must contain - processing instructions named - zebra-2.0. + processing instructions named + zebra-2.0. The output of the &acro.xslt; indexing transformation is then parsed using &acro.dom; methods, and the contained instructions are performed on the elements and their - subtrees directly following the processing instructions. - - - For example, the output of the command - + subtrees directly following the processing instructions. + + + For example, the output of the command + xsltproc dom-index-pi.xsl marc-one.xml - - might look like this: - - - - - - 11224466 - - How to program a computer + + might look like this: + + + + + + 11224466 + + How to program a computer - ]]> - - -
+ ]]> + +
+
-
- Magic element governed indexing format - - The output of the indexing &acro.xslt; stylesheets must contain - certain elements in the magic - xmlns:z="http://indexdata.com/zebra-2.0" - namespace. The output of the &acro.xslt; indexing transformation is then - parsed using &acro.dom; methods, and the contained instructions are - performed on the magic elements and their - subtrees. - - - For example, the output of the command - - xsltproc dom-index-element.xsl marc-one.xml - - might look like this: - - - - 11224466 - - How to program a computer +
+ Magic element governed indexing format + + The output of the indexing &acro.xslt; stylesheets must contain + certain elements in the magic + xmlns:z="http://indexdata.com/zebra-2.0" + namespace. The output of the &acro.xslt; indexing transformation is then + parsed using &acro.dom; methods, and the contained instructions are + performed on the magic elements and their + subtrees. + + + For example, the output of the command + + xsltproc dom-index-element.xsl marc-one.xml + + might look like this: + + + + 11224466 + + How to program a computer - ]]> - - -
+ ]]> +
+
+
-
- Semantics of the indexing formats +
+ Semantics of the indexing formats - - Both indexing formats are defined with equal semantics and - behavior in mind: - + + Both indexing formats are defined with equal semantics and + behavior in mind: + - &zebra; specific instructions are either + &zebra; specific instructions are either processing instructions named zebra-2.0 or elements contained in the namespace xmlns:z="http://indexdata.com/zebra-2.0". - + - There must be exactly one record - instruction, which sets the scope for the following, - possibly nested index instructions. - + There must be exactly one record + instruction, which sets the scope for the following, + possibly nested index instructions. + - - The unique record instruction - may have additional attributes id, - rank and type. - Attribute id is the value of the opaque ID - and may be any string not containing the whitespace character - ' '. - The rank attribute value must be a - non-negative integer. See - . - The type attribute specifies how the record - is to be treated. The following values may be given for - type: - - - insert - - - The record is inserted. If the record already exists, it is - skipped (i.e. not replaced). - - - - - replace - - - The record is replaced. If the record does not already exist, - it is skipped (i.e. not inserted). - - - - - delete - - - The record is deleted. If the record does not already exist, - a warning issued and rest of records are skipped in - from the input stream. - - - - - update - - - The record is inserted or replaced depending on whether the - record exists or not. This is the default behavior but may - be effectively changed by "outside" the scope of the DOM - filter by zebraidx commands or extended services updates. - - - - - adelete - - - The record is deleted. If the record does not already exist, - it is skipped (i.e. nothing is deleted). - - - - Requires version 2.0.54 or later. - - - - - - Note that the value of type is only used to - determine the action if and only if the Zebra indexer is running - in "update" mode (i.e zebraidx update) or if the specialUpdate - action of the - Extended + + The unique record instruction + may have additional attributes id, + rank and type. + Attribute id is the value of the opaque ID + and may be any string not containing the whitespace character + ' '. + The rank attribute value must be a + non-negative integer. See + . + The type attribute specifies how the record + is to be treated. The following values may be given for + type: + + + insert + + + The record is inserted. If the record already exists, it is + skipped (i.e. not replaced). + + + + + replace + + + The record is replaced. If the record does not already exist, + it is skipped (i.e. not inserted). + + + + + delete + + + The record is deleted. If the record does not already exist, + a warning issued and rest of records are skipped in + from the input stream. + + + + + update + + + The record is inserted or replaced depending on whether the + record exists or not. This is the default behavior but may + be effectively changed by "outside" the scope of the DOM + filter by zebraidx commands or extended services updates. + + + + + adelete + + + The record is deleted. If the record does not already exist, + it is skipped (i.e. nothing is deleted). + + + + Requires version 2.0.54 or later. + + + + + + Note that the value of type is only used to + determine the action if and only if the Zebra indexer is running + in "update" mode (i.e zebraidx update) or if the specialUpdate + action of the + Extended Service Update is used. - For this reason a specialUpdate may end up deleting records! - + For this reason a specialUpdate may end up deleting records! + - Multiple and possible nested index - instructions must contain at least one + Multiple and possible nested index + instructions must contain at least one indexname:indextype - pair, and may contain multiple such pairs separated by the + pair, and may contain multiple such pairs separated by the whitespace character ' '. In each index - pair, the name and the type of the index is separated by a + pair, the name and the type of the index is separated by a colon character ':'. - + - + Any index name consisting of ASCII letters, and following the - standard &zebra; rules will do, see + standard &zebra; rules will do, see . - + - + Index types are restricted to the values defined in the standard configuration file default.idx, see - and + and for details. - + - + &acro.dom; input documents which are not resulting in both one - unique valid - record instruction and one or more valid + unique valid + record instruction and one or more valid index instructions can not be searched and found. Therefore, invalid document processing is aborted, and any content of - the <extract> and + the <extract> and <store> pipelines is discarded. - A warning is issued in the logs. - + A warning is issued in the logs. + - - - The examples work as follows: - From the original &acro.xml; file - marc-one.xml (or from the &acro.xml; record &acro.dom; of the - same form coming from an <input> - pipeline), - the indexing - pipeline <extract> - produces an indexing &acro.xml; record, which is defined by - the record instruction - &zebra; uses the content of - z:id="11224466" - or - id=11224466 - as internal - record ID, and - in case static ranking is set - the content of - rank=42 - or - z:rank="42" - as static rank. - - + - In these examples, the following literal indexes are constructed: - + The examples work as follows: + From the original &acro.xml; file + marc-one.xml (or from the &acro.xml; record &acro.dom; of the + same form coming from an <input> + pipeline), + the indexing + pipeline <extract> + produces an indexing &acro.xml; record, which is defined by + the record instruction + &zebra; uses the content of + z:id="11224466" + or + id=11224466 + as internal + record ID, and - in case static ranking is set - the content of + rank=42 + or + z:rank="42" + as static rank. + + + + In these examples, the following literal indexes are constructed: + any:w control:0 title:w title:p title:s - - where the indexing type is defined after the - literal ':' character. - Any value from the standard configuration - file default.idx will do. - Finally, any - text() node content recursively contained - inside the <z:index> element, or any - element following a index processing instruction, - will be filtered through the - appropriate char map for character normalization, and will be - inserted in the named indexes. - - - Finally, this example configuration can be queried using &acro.pqf; - queries, either transported by &acro.z3950;, (here using a yaz-client) - - open localhost:9999 - Z> elem dc - Z> form xml - Z> - Z> find @attr 1=control @attr 4=3 11224466 - Z> scan @attr 1=control @attr 4=3 "" - Z> - Z> find @attr 1=title program - Z> scan @attr 1=title "" - Z> - Z> find @attr 1=title @attr 4=2 "How to program a computer" - Z> scan @attr 1=title @attr 4=2 "" - ]]> - - or the proprietary - extensions x-pquery and - x-pScanClause to - &acro.sru;, and &acro.srw; - - - - See for more information on &acro.sru;/&acro.srw; - configuration, and or the &yaz; - &acro.cql; section - for the details or the &yaz; frontend server. - - - Notice that there are no *.abs, - *.est, *.map, or other &acro.grs1; - filter configuration files involves in this process, and that the - literal index names are used during search and retrieval. - - - In case that we want to support the usual - bib-1 &acro.z3950; numeric access points, it is a - good idea to choose string index names defined in the default - configuration file tab/bib1.att, see - - - -
+ + where the indexing type is defined after the + literal ':' character. + Any value from the standard configuration + file default.idx will do. + Finally, any + text() node content recursively contained + inside the <z:index> element, or any + element following a index processing instruction, + will be filtered through the + appropriate char map for character normalization, and will be + inserted in the named indexes. + + + Finally, this example configuration can be queried using &acro.pqf; + queries, either transported by &acro.z3950;, (here using a yaz-client) + + open localhost:9999 + Z> elem dc + Z> form xml + Z> + Z> find @attr 1=control @attr 4=3 11224466 + Z> scan @attr 1=control @attr 4=3 "" + Z> + Z> find @attr 1=title program + Z> scan @attr 1=title "" + Z> + Z> find @attr 1=title @attr 4=2 "How to program a computer" + Z> scan @attr 1=title @attr 4=2 "" + ]]> + + or the proprietary + extensions x-pquery and + x-pScanClause to + &acro.sru;, and &acro.srw; + + + + See for more information on &acro.sru;/&acro.srw; + configuration, and or the &yaz; + &acro.cql; section + for the details or the &yaz; frontend server. + + + Notice that there are no *.abs, + *.est, *.map, or other &acro.grs1; + filter configuration files involves in this process, and that the + literal index names are used during search and retrieval. + + + In case that we want to support the usual + bib-1 &acro.z3950; numeric access points, it is a + good idea to choose string index names defined in the default + configuration file tab/bib1.att, see + + + +
@@ -583,14 +583,14 @@ &acro.dom; Record Model Configuration -
- &acro.dom; Indexing Configuration +
+ &acro.dom; Indexing Configuration As mentioned above, there can be only one indexing pipeline, and configuration of the indexing process is a synonym of writing an &acro.xslt; stylesheet which produces &acro.xml; output containing the - magic processing instructions or elements discussed in - . + magic processing instructions or elements discussed in + . Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to enlighten the wary. @@ -601,11 +601,11 @@ means that the output &acro.xml; structure is taken as starting point of the internal structure of the &acro.xslt; stylesheet, and portions of the input &acro.xml; are pulled out and inserted - into the right spots of the output &acro.xml; structure. + into the right spots of the output &acro.xml; structure. On the other side, push &acro.xslt; stylesheets are recursively calling their template definitions, a process which is commanded - by the input &acro.xml; structure, and is triggered to produce + by the input &acro.xml; structure, and is triggered to produce some output &acro.xml; whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input @@ -614,187 +614,187 @@ push type might be the only possible way to sort out deeply recursive input &acro.xml; formats. - + A pull stylesheet example used to index &acro.oai; harvested records could use some of the following template definitions: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + xmlns:z="http://indexdata.com/zebra-2.0" + xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/" + xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/" + xmlns:dc="http://purl.org/dc/elements/1.1/" + version="1.0"> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ]]> -
+
-
- &acro.dom; Indexing &acro.marcxml; +
+ &acro.dom; Indexing &acro.marcxml; - The &acro.dom; filter allows indexing of both binary &acro.marc; records - and &acro.marcxml; records, depending on its configuration. - A typical &acro.marcxml; record might look like this: - + The &acro.dom; filter allows indexing of both binary &acro.marc; records + and &acro.marcxml; records, depending on its configuration. + A typical &acro.marcxml; record might look like this: + - 42 - 00366nam 22001698a 4500 - 11224466 - DLC - 00000000000000.0 - 910710c19910701nju 00010 eng - - 11224466 - - - DLC - DLC - - - 123-xyz - - - Jack Collins - - - How to program a computer - - - Penguin - - - 8710 - - - p. cm. - - + 42 + 00366nam 22001698a 4500 + 11224466 + DLC + 00000000000000.0 + 910710c19910701nju 00010 eng + + 11224466 + + + DLC + DLC + + + 123-xyz + + + Jack Collins + + + How to program a computer + + + Penguin + + + 8710 + + + p. cm. + + ]]> - + - It is easily possible to make string manipulation in the &acro.dom; - filter. For example, if you want to drop some leading articles - in the indexing of sort fields, you might want to pick out the - &acro.marcxml; indicator attributes to chop of leading substrings. If - the above &acro.xml; example would have an indicator - ind2="8" in the title field - 245, i.e. - + It is easily possible to make string manipulation in the &acro.dom; + filter. For example, if you want to drop some leading articles + in the indexing of sort fields, you might want to pick out the + &acro.marcxml; indicator attributes to chop of leading substrings. If + the above &acro.xml; example would have an indicator + ind2="8" in the title field + 245, i.e. + - How to program a computer - + + How to program a computer + ]]> - - one could write a template taking into account this information - to chop the first 8 characters from the - sorting index title:s like this: - + + one could write a template taking into account this information + to chop the first 8 characters from the + sorting index title:s like this: + - - - 0 - - - - - - - - - - - - - + + + 0 + + + + + + + + + + + + + ]]> - - The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be: - + + The output of the above &acro.marcxml; and &acro.xslt; excerpt would then be: + How to program a computer - program a computer + How to program a computer + program a computer ]]> - - and the record would be sorted in the title index under 'P', not 'H'. + + and the record would be sorted in the title index under 'P', not 'H'. -
+
-
- &acro.dom; Indexing Wizardry +
+ &acro.dom; Indexing Wizardry The names and types of the indexes can be defined in the indexing &acro.xslt; stylesheet dynamically according to - content in the original &acro.xml; records, which has + content in the original &acro.xml; records, which has opportunities for great power and wizardry as well as grande - disaster. + disaster. The following excerpt of a push stylesheet - might + might be a good idea according to your strict control of the &acro.xml; input format (due to rigorous checking against well-defined and tight RelaxNG or &acro.xml; Schema's, for example): - - - - + + + + + ]]> - This template creates indexes which have the name of the working + This template creates indexes which have the name of the working node of any input &acro.xml; file, and assigns a '1' to the index. - The example query - find @attr 1=xyz 1 + The example query + find @attr 1=xyz 1 finds all files which contain at least one xyz &acro.xml; element. In case you can not control which element names the input files contain, you might ask for @@ -802,25 +802,25 @@ One variation over the theme dynamically created - indexes will definitely be unwise: + indexes will definitely be unwise: - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + ]]> Don't be tempted to play too smart tricks with the power of @@ -828,104 +828,104 @@ indexes with unpredictable names, resulting in severe &zebra; index pollution.. -
- -
- Debuggig &acro.dom; Filter Configurations - - It can be very hard to debug a &acro.dom; filter setup due to the many - successive &acro.marc; syntax translations, &acro.xml; stream splitting and - &acro.xslt; transformations involved. As an aid, you have always the - power of the -s command line switch to the - zebraidz indexing command at your hand: - - zebraidx -s -c zebra.cfg update some_record_stream.xml - - This command line simulates indexing and dumps a lot of debug - information in the logs, telling exactly which transformations - have been applied, how the documents look like after each - transformation, and which record ids and terms are send to the indexer. - -
+
- + --> - - + @@ -939,7 +939,7 @@ sgml-always-quote-attributes:t sgml-indent-step:1 sgml-indent-data:t - sgml-parent-document: "zebra.xml" + sgml-parent-document: "idzebra.xml" sgml-local-catalogs: nil sgml-namecase-general:t End: diff --git a/doc/recordmodel-grs.xml b/doc/recordmodel-grs.xml index c4ff6c7..853410a 100644 --- a/doc/recordmodel-grs.xml +++ b/doc/recordmodel-grs.xml @@ -1,13 +1,13 @@ &acro.grs1; Record Model and Filter Modules - - - The functionality of this record model has been improved and - replaced by the DOM &acro.xml; record model. See - . - - + + + The functionality of this record model has been improved and + replaced by the DOM &acro.xml; record model. See + . + + The record model described in this chapter applies to the fundamental, @@ -32,7 +32,7 @@ This is the canonical input format described . It is using - simple &acro.sgml;-like syntax. + simple &acro.sgml;-like syntax. @@ -41,7 +41,7 @@ This allows &zebra; to read - records in the ISO2709 (&acro.marc;) encoding standard. + records in the ISO2709 (&acro.marc;) encoding standard. Last parameter type names the .abs file (see below) which describes the specific &acro.marc; structure of the input record as @@ -55,8 +55,8 @@ use grs.marcxml filter instead (see below). - The loadable grs.marc filter module - is packaged in the GNU/Debian package + The loadable grs.marc filter module + is packaged in the GNU/Debian package libidzebra2.0-mod-grs-marc @@ -74,7 +74,7 @@ The internal representation for grs.marcxml is the same as for &acro.marcxml;. - It slightly more complicated to work with than + It slightly more complicated to work with than grs.marc but &acro.xml; conformant. @@ -90,7 +90,7 @@ This filter reads &acro.xml; records and uses Expat to - parse them and convert them into ID&zebra;'s internal + parse them and convert them into ID&zebra;'s internal grs record model. Only one record per file is supported, due to the fact &acro.xml; does not allow two documents to "follow" each other (there is no way @@ -101,7 +101,7 @@ The loadable grs.xml filter module is packaged in the GNU/Debian package libidzebra2.0-mod-grs-xml - + @@ -122,7 +122,7 @@ grs.tcl.filter - Similar to grs.regx but using Tcl for rules, described in + Similar to grs.regx but using Tcl for rules, described in . @@ -164,16 +164,16 @@ <Distributor> - <Name> USGS/WRD </Name> - <Organization> USGS/WRD </Organization> - <Street-Address> - U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW - </Street-Address> - <City> ALBUQUERQUE </City> - <State> NM </State> - <Zip-Code> 87102 </Zip-Code> - <Country> USA </Country> - <Telephone> (505) 766-5560 </Telephone> + <Name> USGS/WRD </Name> + <Organization> USGS/WRD </Organization> + <Street-Address> + U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW + </Street-Address> + <City> ALBUQUERQUE </City> + <State> NM </State> + <Zip-Code> 87102 </Zip-Code> + <Country> USA </Country> + <Telephone> (505) 766-5560 </Telephone> </Distributor> @@ -181,12 +181,12 @@ @@ -230,7 +230,7 @@ <gils> - <title>Zen and the Art of Motorcycle Maintenance</title> + <title>Zen and the Art of Motorcycle Maintenance</title> </gils> @@ -359,7 +359,7 @@ type regx, argument filter-filename). - + Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The @@ -367,7 +367,7 @@ and the actions normally contribute to the generation of an internal representation of the record. - + An expression can be either of the following: @@ -415,7 +415,7 @@ Matches regular expression pattern reg from the input record. The operators supported are the same - as for regular expression queries. Refer to + as for regular expression queries. Refer to . @@ -467,7 +467,7 @@ data element. The type is one of the following: - + record @@ -568,10 +568,10 @@ /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { - begin element bodyOfDisplay - begin variant body iana "text/plain" - data -text $1 - end record + begin element bodyOfDisplay + begin variant body iana "text/plain" + data -text $1 + end record }
@@ -604,9 +604,9 @@ - ROOT - TITLE "Zen and the Art of Motorcycle Maintenance" - AUTHOR "Robert Pirsig" + ROOT + TITLE "Zen and the Art of Motorcycle Maintenance" + AUTHOR "Robert Pirsig" @@ -619,11 +619,11 @@ - ROOT - TITLE "Zen and the Art of Motorcycle Maintenance" - AUTHOR - FIRST-NAME "Robert" - SURNAME "Pirsig" + ROOT + TITLE "Zen and the Art of Motorcycle Maintenance" + AUTHOR + FIRST-NAME "Robert" + SURNAME "Pirsig" @@ -687,38 +687,38 @@ Which of the two elements are transmitted to the client by the server depends on the specifications provided by the client, if any.
- + In practice, each variant node is associated with a triple of class, type, value, corresponding to the variant mechanism of &acro.z3950;. - + - +
Data Elements - + Data nodes have no children (they are always leaf nodes in the record tree). - + - +
- + - +
&acro.grs1; Record Model Configuration - + The following sections describe the configuration files that govern - the internal management of grs records. + the internal management of grs records. The system searches for the files in the directories specified by the profilePath setting in the zebra.cfg file. @@ -735,7 +735,7 @@ @@ -766,7 +766,7 @@ known. - + The variant set which is used in the profile. This provides a @@ -800,7 +800,7 @@ - + A list of element descriptions (this is the actual ARS of the schema, in &acro.z3950; terms), which lists the ways in which the various @@ -847,19 +847,19 @@ file. Some settings are optional (o), while others again are mandatory (m). - +
- +
The Abstract Syntax (.abs) Files - + The name of this file type is slightly misleading in &acro.z3950; terms, since, apart from the actual abstract syntax of the profile, it also includes most of the other definitions that go into a database profile. - + When a record in the canonical, &acro.sgml;-like format is read from a file or from the database, the first tag of the file should reference the @@ -867,7 +867,7 @@ record is, say, <gils>, the system will look for the profile definition in the file gils.abs. Profile definitions are cached, so they only have to be read once - during the lifespan of the current process. + during the lifespan of the current process. @@ -876,14 +876,14 @@ introduces the profile, and should always be called first thing when introducing a new record. - + The file may contain the following directives: - + - + name symbolic-name @@ -1003,7 +1003,7 @@ - + xelm xpath attributes @@ -1049,7 +1049,7 @@ file via a header this directive is ignored. If neither this directive is given, nor an encoding is set within external records, ISO-8859-1 encoding is assumed. - + @@ -1058,60 +1058,60 @@ If this directive is followed by enable, then extra indexing is performed to allow for XPath-like queries. - If this directive is not specified - equivalent to + If this directive is not specified - equivalent to disable - no extra XPath-indexing is performed. - @@ -1124,7 +1124,7 @@ Specifies what information, if any, &zebra; should - automatically include in retrieval records for the + automatically include in retrieval records for the ``system fields'' that it supports. systemTag may be any of the following: @@ -1132,24 +1132,24 @@ rank - An integer indicating the relevance-ranking score - assigned to the record. - + An integer indicating the relevance-ranking score + assigned to the record. + sysno - An automatically generated identifier for the record, - unique within this database. It is represented by the - <localControlNumber> element in - &acro.xml; and the (1,14) tag in &acro.grs1;. - + An automatically generated identifier for the record, + unique within this database. It is represented by the + <localControlNumber> element in + &acro.xml; and the (1,14) tag in &acro.grs1;. + size - The size, in bytes, of the retrieved record. - + The size, in bytes, of the retrieved record. + @@ -1162,7 +1162,7 @@ - + The mechanism for controlling indexing is not adequate for @@ -1170,7 +1170,7 @@ configuration table eventually. - + The following is an excerpt from the abstract syntax file for the GILS profile. @@ -1202,7 +1202,7 @@ elm (4,1) controlIdentifier Identifier-standard elm (2,6) abstract Abstract elm (4,51) purpose ! - elm (4,52) originator - + elm (4,52) originator - elm (4,53) accessConstraints ! elm (4,54) useConstraints ! elm (4,70) availability - @@ -1222,10 +1222,10 @@ This file type describes the Use elements of - an attribute set. - It contains the following directives. + an attribute set. + It contains the following directives. - + @@ -1273,7 +1273,7 @@ attribute value is stored in the index (unless a local-value is given, in which case this is stored). The name is used to refer to the - attribute from the abstract syntax. + attribute from the abstract syntax. @@ -1563,7 +1563,7 @@ otherwise is noted. - + The directives available in the element set file are as follows: @@ -1701,10 +1701,10 @@ @@ -1756,9 +1756,9 @@
@@ -1836,7 +1836,7 @@ - + SOIF. Support for this syntax is experimental, and is currently @@ -1846,48 +1846,48 @@ level. - + - +
Extended indexing of &acro.marc; records - + Extended indexing of &acro.marc; records will help you if you need index a combination of subfields, or index only a part of the whole field, or use during indexing process embedded fields of &acro.marc; record. - + Extended indexing of &acro.marc; records additionally allows: - + to index data in LEADER of &acro.marc; record - + to index data in control fields (with fixed length) - + to use during indexing the values of indicators - + to index linked fields for UNI&acro.marc; based formats - + - + In compare with simple indexing process the extended indexing may increase (about 2-3 times) the time of indexing process for &acro.marc; records. - +
The index-formula - + At the beginning, we have to define the term index-formula for &acro.marc; records. This term helps to understand the notation of extended indexing of &acro.marc; records by &zebra;. @@ -1895,84 +1895,84 @@ "The table of conformity for &acro.z3950; use attributes and R&acro.usmarc; fields". The document is available only in Russian language. - + The index-formula is the combination of subfields presented in such way: - + 71-00$a, $g, $h ($c){.$b ($c)} , (1) - + We know that &zebra; supports a &acro.bib1; attribute - right truncation. - In this case, the index-formula (1) consists from + In this case, the index-formula (1) consists from forms, defined in the same way as (1) - + 71-00$a, $g, $h 71-00$a, $g 71-00$a - + The original &acro.marc; record may be without some elements, which included in index-formula. - + This notation includes such operands as: - + # It means whitespace character. - + - The position may contain any value, defined by &acro.marc; format. For example, index-formula - + 70-#1$a, $g , (2) - - includes - + + includes + 700#1$a, $g 701#1$a, $g 702#1$a, $g - + - + {...} The repeatable elements are defined in figure-brackets {}. For example, index-formula - + 71-00$a, $g, $h ($c){.$b ($c)} , (3) - + includes - + 71-00$a, $g, $h ($c). $b ($c) 71-00$a, $g, $h ($c). $b ($c). $b ($c) 71-00$a, $g, $h ($c). $b ($c). $b ($c). $b ($c) - + - + All another operands are the same as accepted in &acro.marc; world. @@ -1980,11 +1980,11 @@
- +
Notation of <emphasis>index-formula</emphasis> for &zebra; - - + + Extended indexing overloads path of elm definition in abstract syntax file of &zebra; (.abs file). It means that names beginning with @@ -1992,40 +1992,40 @@ index-formula. The database index is created and linked with access point (&acro.bib1; use attribute) according to this formula. - + For example, index-formula - + 71-00$a, $g, $h ($c){.$b ($c)} , (4) - + in .abs file looks like: - + mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)} - - + + The notation of index-formula uses the operands: - + _ It means whitespace character. - + . The position may contain any value, defined by &acro.marc; format. For example, index-formula - + 70-#1$a, $g , (5) - + matches mc-70._1_$a,_$g_ and includes - + 700_1_$a,_$g_ 701_1_$a,_$g_ @@ -2033,21 +2033,21 @@ - + {...} The repeatable elements are defined in figure-brackets {}. For example, index-formula - + 71#00$a, $g, $h ($c) {.$b ($c)} , (6) - - matches + + matches mc-71.00_$a,_$g,_$h_(_$c_){.$b_(_$c_)} and includes - + 71.00_$a,_$g,_$h_(_$c_).$b_(_$c_) 71.00_$a,_$g,_$h_(_$c_).$b_(_$c_).$b_(_$c_) @@ -2055,120 +2055,120 @@ - + <...> Embedded index-formula (for linked fields) is between <>. For example, index-formula - + 4--#-$170-#1$a, $g ($c) , (7) - + matches mc-4.._._$1<70._1_$a,_$g_(_$c_)>_ and includes - + 463_._$1<70._1_$a,_$g_(_$c_)>_ - + - + All another operands are the same as accepted in &acro.marc; world. - +
Examples - + - + - + indexing LEADER - + You need to use keyword "ldr" to index leader. For example, indexing data from 6th and 7th position of LEADER - + elm mc-ldr[6] Record-type ! elm mc-ldr[7] Bib-level ! - + - + - + indexing data from control fields - + indexing date (the time added to database) - + - elm mc-008[0-5] Date/time-added-to-db ! + elm mc-008[0-5] Date/time-added-to-db ! - + or for R&acro.usmarc; (this data included in 100th field) - + elm mc-100___$a[0-7]_ Date/time-added-to-db ! - + - + - + using indicators while indexing For R&acro.usmarc; index-formula 70-#1$a, $g matches - + elm 70._1_$a,_$g_ Author !:w,!:p - - When &zebra; finds a field according to + + When &zebra; finds a field according to "70." pattern it checks the indicators. In this case the value of first indicator doesn't mater, but the value of - second one must be whitespace, in another case a field is not + second one must be whitespace, in another case a field is not indexed. - + - + indexing embedded (linked) fields for UNI&acro.marc; based formats - - For R&acro.usmarc; index-formula + + For R&acro.usmarc; index-formula 4--#-$170-#1$a, $g ($c) matches - + _ Author !:w,!:p ]]> - + Data are extracted from record if the field matches to "4.._." pattern and data in linked field match to embedded index-formula 70._1_$a,_$g_(_$c_). - + - + - - + +
- +