X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;ds=inline;f=doc%2Fadministration.xml;h=b95db6619112ccf3c6a809d3822e27eec6e4b30b;hb=HEAD;hp=5e3b545b96db5681dc7ccda11394fd541233336c;hpb=51fc70e752ec936e8815d639b34dd0fab17e0aab;p=idzebra-moved-to-github.git
diff --git a/doc/administration.xml b/doc/administration.xml
index 5e3b545..b95db66 100644
--- a/doc/administration.xml
+++ b/doc/administration.xml
@@ -1,849 +1,1869 @@
-
-
- Administrating Zebra
-
-
- Unlike many simpler retrieval systems, Zebra supports safe, incremental
- updates to an existing index.
-
-
-
- Normally, when Zebra modifies the index it reads a number of records
- that you specify.
- Depending on your specifications and on the contents of each record
- one the following events take place for each record:
-
-
-
- Insert
-
-
- The record is indexed as if it never occurred before.
- Either the Zebra system doesn't know how to identify the record or
- Zebra can identify the record but didn't find it to be already indexed.
-
-
-
-
- Modify
-
-
- The record has already been indexed.
- In this case either the contents of the record or the location
- (file) of the record indicates that it has been indexed before.
-
-
-
-
- Delete
-
-
- The record is deleted from the index. As in the
- update-case it must be able to identify the record.
-
-
-
-
-
-
-
- Please note that in both the modify- and delete- case the Zebra
- indexer must be able to generate a unique key that identifies the record
- in question (more on this below).
-
-
-
- To administrate the Zebra retrieval system, you run the
- zebraidx program.
- This program supports a number of options which are preceded by a dash,
- and a few commands (not preceded by dash).
-
-
-
- Both the Zebra administrative tool and the Z39.50 server share a
- set of index files and a global configuration file.
- The name of the configuration file defaults to
- zebra.cfg.
- The configuration file includes specifications on how to index
- various kinds of records and where the other configuration files
- are located. zebrasrv and zebraidx
- must be run in the directory where the
- configuration file lives unless you indicate the location of the
- configuration file by option -c.
-
-
-
- Record Types
-
-
- Indexing is a per-record process, in which either insert/modify/delete
- will occur. Before a record is indexed search keys are extracted from
- whatever might be the layout the original record (sgml,html,text, etc..).
- The Zebra system currently supports two fundamental types of records:
- structured and simple text.
- To specify a particular extraction process, use either the
- command line option -t or specify a
- recordType setting in the configuration file.
-
-
-
-
-
- The Zebra Configuration File
-
-
- The Zebra configuration file, read by zebraidx and
- zebrasrv defaults to zebra.cfg
- unless specified by -c option.
-
-
-
- You can edit the configuration file with a normal text editor.
- parameter names and values are separated by colons in the file. Lines
- starting with a hash sign (#) are
- treated as comments.
-
-
-
- If you manage different sets of records that share common
- characteristics, you can organize the configuration settings for each
- type into "groups".
- When zebraidx is run and you wish to address a
- given group you specify the group name with the -g
- option.
- In this case settings that have the group name as their prefix
- will be used by zebraidx.
- If no -g option is specified, the settings
- without prefix are used.
-
-
-
- In the configuration file, the group name is placed before the option
- name itself, separated by a dot (.). For instance, to set the record type
- for group public to grs.sgml
- (the SGML-like format for structured records) you would write:
-
-
-
-
- public.recordType: grs.sgml
-
-
-
-
- To set the default value of the record type to text
- write:
-
-
-
-
- recordType: text
-
-
-
+
+ Administrating &zebra;
+
+
- The available configuration settings are summarized below. They will be
- explained further in the following sections.
+ Unlike many simpler retrieval systems, &zebra; supports safe, incremental
+ updates to an existing index.
-
-
-
+
+ Normally, when &zebra; modifies the index it reads a number of records
+ that you specify.
+ Depending on your specifications and on the contents of each record
+ one the following events take place for each record:
-
-
-
- group
- .recordType[.name]:
- type
-
-
-
- Specifies how records with the file extension
- name should be handled by the indexer.
- This option may also be specified as a command line option
- (-t). Note that if you do not specify a
- name, the setting applies to all files.
- In general, the record type specifier consists of the elements (each
- element separated by dot), fundamental-type,
- file-read-type and arguments. Currently, two
- fundamental types exist, text and
- grs.
-
-
-
-
- group.recordId:
- record-id-spec
-
-
- Specifies how the records are to be identified when updated. See
- .
-
-
-
-
- group.database:
- database
-
-
- Specifies the Z39.50 database name.
-
-
-
-
-
- group.storeKeys:
- boolean
-
-
- Specifies whether key information should be saved for a given
- group of records. If you plan to update/delete this type of
- records later this should be specified as 1; otherwise it
- should be 0 (default), to save register space.
- See .
-
-
-
-
- group.storeData:
- boolean
-
-
- Specifies whether the records should be stored internally
- in the Zebra system files.
- If you want to maintain the raw records yourself,
- this option should be false (0).
- If you want Zebra to take care of the records for you, it
- should be true(1).
-
-
-
-
- register: register-location
-
-
- Specifies the location of the various register files that Zebra uses
- to represent your databases.
- See .
-
-
-
-
- shadow: register-location
-
-
- Enables the safe update facility of Zebra, and
- tells the system where to place the required, temporary files.
- See .
-
-
-
-
- lockDir: directory
-
-
- Directory in which various lock files are stored.
-
-
-
-
- keyTmpDir: directory
-
-
- Directory in which temporary files used during zebraidx' update
- phase are stored.
-
-
-
-
- setTmpDir: directory
-
-
- Specifies the directory that the server uses for temporary result sets.
- If not specified /tmp will be used.
-
-
-
-
- profilePath: path
-
-
- Specifies a path of profile specification files.
- The path is composed of one or more directories separated by
- colon. Similar to PATH for UNIX systems.
-
-
-
+
- attset: filename
+ Insert
- Specifies the filename(s) of attribute set files for use in
- searching. At least the Bib-1 set should be loaded
- (bib1.att).
- The profilePath setting is used to look for
- the specified files.
- See
+ The record is indexed as if it never occurred before.
+ Either the &zebra; system doesn't know how to identify the record or
+ &zebra; can identify the record but didn't find it to be already indexed.
- memMax: size
+ Modify
- Specifies size of internal memory
- to use for the zebraidx program.
- The amount is given in megabytes - default is 4 (4 MB).
+ The record has already been indexed.
+ In this case either the contents of the record or the location
+ (file) of the record indicates that it has been indexed before.
-
- root: dir
+ Delete
- Specifies a directory base for Zebra. All relative paths
- given (in profilePath, register, shadow) are based on this
- directory. This setting is useful if if you Zebra server
- is running in a different directory from where
- zebra.cfg is located.
+ The record is deleted from the index. As in the
+ update-case it must be able to identify the record.
-
-
-
-
-
- Locating Records
-
-
- The default behavior of the Zebra system is to reference the
- records from their original location, i.e. where they were found when you
- ran zebraidx.
- That is, when a client wishes to retrieve a record
- following a search operation, the files are accessed from the place
- where you originally put them - if you remove the files (without
- running zebraidx again, the client
- will receive a diagnostic message.
-
-
-
- If your input files are not permanent - for example if you retrieve
- your records from an outside source, or if they were temporarily
- mounted on a CD-ROM drive,
- you may want Zebra to make an internal copy of them. To do this,
- you specify 1 (true) in the storeData setting. When
- the Z39.50 server retrieves the records they will be read from the
- internal file structures of the system.
-
-
-
-
-
- Indexing with no Record IDs (Simple Indexing)
-
-
- If you have a set of records that are not expected to change over time
- you may can build your database without record IDs.
- This indexing method uses less space than the other methods and
- is simple to use.
-
-
-
- To use this method, you simply omit the recordId entry
- for the group of files that you index. To add a set of records you use
- zebraidx with the update command. The
- update command will always add all of the records that it
- encounters to the index - whether they have already been indexed or
- not. If the set of indexed files change, you should delete all of the
- index files, and build a new index from scratch.
-
-
-
- Consider a system in which you have a group of text files called
- simple.
- That group of records should belong to a Z39.50 database called
- textbase.
- The following zebra.cfg file will suffice:
-
-
-
-
- profilePath: /usr/local/yaz
- attset: bib1.att
- simple.recordType: text
- simple.database: textbase
-
-
-
- Since the existing records in an index can not be addressed by their
- IDs, it is impossible to delete or modify records when using this method.
+ Please note that in both the modify- and delete- case the &zebra;
+ indexer must be able to generate a unique key that identifies the record
+ in question (more on this below).
-
-
-
-
- Indexing with File Record IDs
-
-
- If you have a set of files that regularly change over time: Old files
- are deleted, new ones are added, or existing files are modified, you
- can benefit from using the file ID
- indexing methodology.
- Examples of this type of database might include an index of WWW
- resources, or a USENET news spool area.
- Briefly speaking, the file key methodology uses the directory paths
- of the individual records as a unique identifier for each record.
- To perform indexing of a directory with file keys, again, you specify
- the top-level directory after the update command.
- The command will recursively traverse the directories and compare
- each one with whatever have been indexed before in that same directory.
- If a file is new (not in the previous version of the directory) it
- is inserted into the registers; if a file was already indexed and
- it has been modified since the last update, the index is also
- modified; if a file has been removed since the last
- visit, it is deleted from the index.
-
-
+
- The resulting system is easy to administrate. To delete a record you
- simply have to delete the corresponding file (say, with the
- rm command). And to add records you create new
- files (or directories with files). For your changes to take effect
- in the register you must run zebraidx update with
- the same directory root again. This mode of operation requires more
- disk space than simpler indexing methods, but it makes it easier for
- you to keep the index in sync with a frequently changing set of data.
- If you combine this system with the safe update
- facility (see below), you never have to take your server off-line for
- maintenance or register updating purposes.
+ To administrate the &zebra; retrieval system, you run the
+ zebraidx program.
+ This program supports a number of options which are preceded by a dash,
+ and a few commands (not preceded by dash).
-
+
- To enable indexing with pathname IDs, you must specify
- file as the value of recordId
- in the configuration file. In addition, you should set
- storeKeys to 1, since the Zebra
- indexer must save additional information about the contents of each record
- in order to modify the indexes correctly at a later time.
+ Both the &zebra; administrative tool and the &acro.z3950; server share a
+ set of index files and a global configuration file.
+ The name of the configuration file defaults to
+ zebra.cfg.
+ The configuration file includes specifications on how to index
+ various kinds of records and where the other configuration files
+ are located. zebrasrv and zebraidx
+ must be run in the directory where the
+ configuration file lives unless you indicate the location of the
+ configuration file by option -c.
-
+
+
+ Record Types
+
+
+ Indexing is a per-record process, in which either insert/modify/delete
+ will occur. Before a record is indexed search keys are extracted from
+ whatever might be the layout the original record (sgml,html,text, etc..).
+ The &zebra; system currently supports two fundamental types of records:
+ structured and simple text.
+ To specify a particular extraction process, use either the
+ command line option -t or specify a
+ recordType setting in the configuration file.
+
+
+
+
+
+ The &zebra; Configuration File
+
+
+ The &zebra; configuration file, read by zebraidx and
+ zebrasrv defaults to zebra.cfg
+ unless specified by -c option.
+
+
+
+ You can edit the configuration file with a normal text editor.
+ parameter names and values are separated by colons in the file. Lines
+ starting with a hash sign (#) are
+ treated as comments.
+
+
+
+ If you manage different sets of records that share common
+ characteristics, you can organize the configuration settings for each
+ type into "groups".
+ When zebraidx is run and you wish to address a
+ given group you specify the group name with the -g
+ option.
+ In this case settings that have the group name as their prefix
+ will be used by zebraidx.
+ If no -g option is specified, the settings
+ without prefix are used.
+
+
+
+ In the configuration file, the group name is placed before the option
+ name itself, separated by a dot (.). For instance, to set the record type
+ for group public to grs.sgml
+ (the &acro.sgml;-like format for structured records) you would write:
+
+
+
+
+ public.recordType: grs.sgml
+
+
+
+
+ To set the default value of the record type to text
+ write:
+
+
+
+
+ recordType: text
+
+
+
+
+ The available configuration settings are summarized below. They will be
+ explained further in the following sections.
+
+
+ FIXME - Didn't Adam make something to have multiple databases in multiple dirs...
+ -->
-
- For example, to update records of group esdd
- located below
- /data1/records/ you should type:
-
- $ zebraidx -g esdd update /data1/records
-
-
-
-
- The corresponding configuration file includes:
-
- esdd.recordId: file
- esdd.recordType: grs.sgml
- esdd.storeKeys: 1
-
-
-
-
- You cannot start out with a group of records with simple
- indexing (no record IDs as in the previous section) and then later
- enable file record Ids. Zebra must know from the first time that you
- index the group that
- the files should be indexed with file record IDs.
+
+
+
+
+
+ group
+ .recordType[.name]:
+ type
+
+
+
+ Specifies how records with the file extension
+ name should be handled by the indexer.
+ This option may also be specified as a command line option
+ (-t). Note that if you do not specify a
+ name, the setting applies to all files.
+ In general, the record type specifier consists of the elements (each
+ element separated by dot), fundamental-type,
+ file-read-type and arguments. Currently, two
+ fundamental types exist, text and
+ grs.
+
+
+
+
+ group.recordId:
+ record-id-spec
+
+
+ Specifies how the records are to be identified when updated. See
+ .
+
+
+
+
+ group.database:
+ database
+
+
+ Specifies the &acro.z3950; database name.
+
+
+
+
+
+ group.storeKeys:
+ boolean
+
+
+ Specifies whether key information should be saved for a given
+ group of records. If you plan to update/delete this type of
+ records later this should be specified as 1; otherwise it
+ should be 0 (default), to save register space.
+
+ See .
+
+
+
+
+ group.storeData:
+ boolean
+
+
+ Specifies whether the records should be stored internally
+ in the &zebra; system files.
+ If you want to maintain the raw records yourself,
+ this option should be false (0).
+ If you want &zebra; to take care of the records for you, it
+ should be true(1).
+
+
+
+
+
+ register: register-location
+
+
+ Specifies the location of the various register files that &zebra; uses
+ to represent your databases.
+ See .
+
+
+
+
+ shadow: register-location
+
+
+ Enables the safe update facility of &zebra;, and
+ tells the system where to place the required, temporary files.
+ See .
+
+
+
+
+ lockDir: directory
+
+
+ Directory in which various lock files are stored.
+
+
+
+
+ keyTmpDir: directory
+
+
+ Directory in which temporary files used during zebraidx's update
+ phase are stored.
+
+
+
+
+ setTmpDir: directory
+
+
+ Specifies the directory that the server uses for temporary result sets.
+ If not specified /tmp will be used.
+
+
+
+
+ profilePath: path
+
+
+ Specifies a path of profile specification files.
+ The path is composed of one or more directories separated by
+ colon. Similar to PATH for UNIX systems.
+
+
+
+
+
+ modulePath: path
+
+
+ Specifies a path of record filter modules.
+ The path is composed of one or more directories separated by
+ colon. Similar to PATH for UNIX systems.
+ The 'make install' procedure typically puts modules in
+ /usr/local/lib/idzebra-2.0/modules.
+
+
+
+
+
+ index: filename
+
+
+ Defines the filename which holds fields structure
+ definitions. If omitted, the file default.idx
+ is read.
+ Refer to for
+ more information.
+
+
+
+
+
+ sortmax: integer
+
+
+ Specifies the maximum number of records that will be sorted
+ in a result set. If the result set contains more than
+ integer records, records after the
+ limit will not be sorted. If omitted, the default value is
+ 1,000.
+
+
+
+
+
+ staticrank: integer
+
+
+ Enables whether static ranking is to be enabled (1) or
+ disabled (0). If omitted, it is disabled - corresponding
+ to a value of 0.
+ Refer to .
+
+
+
+
+
+
+ estimatehits: integer
+
+
+ Controls whether &zebra; should calculate approximate hit counts and
+ at which hit count it is to be enabled.
+ A value of 0 disables approximate hit counts.
+ For a positive value approximate hit count is enabled
+ if it is known to be larger than integer.
+
+
+ Approximate hit counts can also be triggered by a particular
+ attribute in a query.
+ Refer to .
+
+
+
+
+
+ attset: filename
+
+
+ Specifies the filename(s) of attribute set files for use in
+ searching. In many configurations bib1.att
+ is used, but that is not required. If Classic Explain
+ attributes is to be used for searching,
+ explain.att must be given.
+ The path to att-files in general can be given using
+ profilePath setting.
+ See also .
+
+
+
+
+ memMax: size
+
+
+ Specifies size of internal memory
+ to use for the zebraidx program.
+ The amount is given in megabytes - default is 4 (4 MB).
+ The more memory, the faster large updates happen, up to about
+ half the free memory available on the computer.
+
+
+
+
+ tempfiles: Yes/Auto/No
+
+
+ Tells zebra if it should use temporary files when indexing. The
+ default is Auto, in which case zebra uses temporary files only
+ if it would need more that memMax
+ megabytes of memory. This should be good for most uses.
+
+
+
+
+
+ root: dir
+
+
+ Specifies a directory base for &zebra;. All relative paths
+ given (in profilePath, register, shadow) are based on this
+ directory. This setting is useful if your &zebra; server
+ is running in a different directory from where
+ zebra.cfg is located.
+
+
+
+
+
+ passwd: file
+
+
+ Specifies a file with description of user accounts for &zebra;.
+ The format is similar to that known to Apache's htpasswd files
+ and UNIX' passwd files. Non-empty lines not beginning with
+ # are considered account lines. There is one account per-line.
+ A line consists of fields separate by a single colon character.
+ First field is username, second is password.
+
+
+
+
+
+ passwd.c: file
+
+
+ Specifies a file with description of user accounts for &zebra;.
+ File format is similar to that used by the passwd directive except
+ that the password are encrypted. Use Apache's htpasswd or similar
+ for maintenance.
+
+
+
+
+
+ perm.user:
+ permstring
+
+
+ Specifies permissions (privilege) for a user that are allowed
+ to access &zebra; via the passwd system. There are two kinds
+ of permissions currently: read (r) and write(w). By default
+ users not listed in a permission directive are given the read
+ privilege. To specify permissions for a user with no
+ username, or &acro.z3950; anonymous style use
+ anonymous. The permstring consists of
+ a sequence of characters. Include character w
+ for write/update access, r for read access and
+ a to allow anonymous access through this account.
+
+
+
+
+
+ dbaccess: accessfile
+
+
+ Names a file which lists database subscriptions for individual users.
+ The access file should consists of lines of the form
+ username: dbnames, where dbnames is a list of
+ database names, separated by '+'. No whitespace is allowed in the
+ database list.
+
+
+
+
+
+ encoding: charsetname
+
+
+ Tells &zebra; to interpret the terms in Z39.50 queries as
+ having been encoded using the specified character
+ encoding. The default is ISO-8859-1; one
+ useful alternative is UTF-8.
+
+
+
+
+
+ storeKeys: value
+
+
+ Specifies whether &zebra; keeps a copy of indexed keys.
+ Use a value of 1 to enable; 0 to disable. If storeKeys setting is
+ omitted, it is enabled. Enabled storeKeys
+ are required for updating and deleting records. Disable only
+ storeKeys to save space and only plan to index data once.
+
+
+
+
+
+ storeData: value
+
+
+ Specifies whether &zebra; keeps a copy of indexed records.
+ Use a value of 1 to enable; 0 to disable. If storeData setting is
+ omitted, it is enabled. A storeData setting of 0 (disabled) makes
+ Zebra fetch records from the original locaction in the file
+ system using filename, file offset and file length. For the
+ DOM and ALVIS filter, the storeData setting is ignored.
+
+
+
+
+
+
+
+
+
+ Locating Records
+
+
+ The default behavior of the &zebra; system is to reference the
+ records from their original location, i.e. where they were found when you
+ run zebraidx.
+ That is, when a client wishes to retrieve a record
+ following a search operation, the files are accessed from the place
+ where you originally put them - if you remove the files (without
+ running zebraidx again, the server will return
+ diagnostic number 14 (``System error in presenting records'') to
+ the client.
+
+
+
+ If your input files are not permanent - for example if you retrieve
+ your records from an outside source, or if they were temporarily
+ mounted on a CD-ROM drive,
+ you may want &zebra; to make an internal copy of them. To do this,
+ you specify 1 (true) in the storeData setting. When
+ the &acro.z3950; server retrieves the records they will be read from the
+ internal file structures of the system.
+
+
+
+
+
+ Indexing with no Record IDs (Simple Indexing)
+
+
+ If you have a set of records that are not expected to change over time
+ you may can build your database without record IDs.
+ This indexing method uses less space than the other methods and
+ is simple to use.
+
+
+
+ To use this method, you simply omit the recordId entry
+ for the group of files that you index. To add a set of records you use
+ zebraidx with the update command. The
+ update command will always add all of the records that it
+ encounters to the index - whether they have already been indexed or
+ not. If the set of indexed files change, you should delete all of the
+ index files, and build a new index from scratch.
+
+
+
+ Consider a system in which you have a group of text files called
+ simple.
+ That group of records should belong to a &acro.z3950; database called
+ textbase.
+ The following zebra.cfg file will suffice:
+
+
+
+
+ profilePath: /usr/local/idzebra/tab
+ attset: bib1.att
+ simple.recordType: text
+ simple.database: textbase
+
+
+
+
+
+ Since the existing records in an index can not be addressed by their
+ IDs, it is impossible to delete or modify records when using this method.
+
+
+
+
+
+ Indexing with File Record IDs
+
+
+ If you have a set of files that regularly change over time: Old files
+ are deleted, new ones are added, or existing files are modified, you
+ can benefit from using the file ID
+ indexing methodology.
+ Examples of this type of database might include an index of WWW
+ resources, or a USENET news spool area.
+ Briefly speaking, the file key methodology uses the directory paths
+ of the individual records as a unique identifier for each record.
+ To perform indexing of a directory with file keys, again, you specify
+ the top-level directory after the update command.
+ The command will recursively traverse the directories and compare
+ each one with whatever have been indexed before in that same directory.
+ If a file is new (not in the previous version of the directory) it
+ is inserted into the registers; if a file was already indexed and
+ it has been modified since the last update, the index is also
+ modified; if a file has been removed since the last
+ visit, it is deleted from the index.
+
+
+
+ The resulting system is easy to administrate. To delete a record you
+ simply have to delete the corresponding file (say, with the
+ rm command). And to add records you create new
+ files (or directories with files). For your changes to take effect
+ in the register you must run zebraidx update with
+ the same directory root again. This mode of operation requires more
+ disk space than simpler indexing methods, but it makes it easier for
+ you to keep the index in sync with a frequently changing set of data.
+ If you combine this system with the safe update
+ facility (see below), you never have to take your server off-line for
+ maintenance or register updating purposes.
+
+
+
+ To enable indexing with pathname IDs, you must specify
+ file as the value of recordId
+ in the configuration file. In addition, you should set
+ storeKeys to 1, since the &zebra;
+ indexer must save additional information about the contents of each record
+ in order to modify the indexes correctly at a later time.
+
+
+
+
+
+ For example, to update records of group esdd
+ located below
+ /data1/records/ you should type:
+
+ $ zebraidx -g esdd update /data1/records
+
+
+
+
+ The corresponding configuration file includes:
+
+ esdd.recordId: file
+ esdd.recordType: grs.sgml
+ esdd.storeKeys: 1
+
+
+
+
+ You cannot start out with a group of records with simple
+ indexing (no record IDs as in the previous section) and then later
+ enable file record Ids. &zebra; must know from the first time that you
+ index the group that
+ the files should be indexed with file record IDs.
+
-
-
- You cannot explicitly delete records when using this method (using the
- delete command to zebraidx. Instead
- you have to delete the files from the file system (or move them to a
- different location)
- and then run zebraidx with the
- update command.
-
-
-
-
- Indexing with General Record IDs
-
-
- When using this method you construct an (almost) arbitrary, internal
- record key based on the contents of the record itself and other system
- information. If you have a group of records that explicitly associates
- an ID with each record, this method is convenient. For example, the
- record format may contain a title or a ID-number - unique within the group.
- In either case you specify the Z39.50 attribute set and use-attribute
- location in which this information is stored, and the system looks at
- that field to determine the identity of the record.
-
-
-
- As before, the record ID is defined by the recordId
- setting in the configuration file. The value of the record ID specification
- consists of one or more tokens separated by whitespace. The resulting
- ID is represented in the index by concatenating the tokens and
- separating them by ASCII value (1).
-
-
-
- There are three kinds of tokens:
-
-
-
- Internal record info
-
-
- The token refers to a key that is
- extracted from the record. The syntax of this token is
- ( set ,
- use ),
- where set is the
- attribute set name use is the
- name or value of the attribute.
-
-
-
-
- System variable
-
-
- The system variables are preceded by
-
-
- $
-
- and immediately followed by the system variable name, which
- may one of
-
-
-
- group
-
-
- Group name.
-
-
-
-
- database
-
-
- Current database specified.
-
-
-
-
- type
-
-
- Record type.
-
-
-
-
-
-
-
-
- Constant string
-
-
- A string used as part of the ID — surrounded
- by single- or double quotes.
-
-
-
-
-
-
-
- For instance, the sample GILS records that come with the Zebra
- distribution contain a unique ID in the data tagged Control-Identifier.
- The data is mapped to the Bib-1 use attribute Identifier-standard
- (code 1007). To use this field as a record id, specify
- (bib1,Identifier-standard) as the value of the
- recordId in the configuration file.
- If you have other record types that uses the same field for a
- different purpose, you might add the record type
- (or group or database name) to the record id of the gils
- records as well, to prevent matches with other types of records.
- In this case the recordId might be set like this:
-
-
- gils.recordId: $type (bib1,Identifier-standard)
-
-
-
-
-
- (see
- for details of how the mapping between elements of your records and
- searchable attributes is established).
-
-
-
- As for the file record ID case described in the previous section,
- updating your system is simply a matter of running
- zebraidx
- with the update command. However, the update with general
- keys is considerably slower than with file record IDs, since all files
- visited must be (re)read to discover their IDs.
-
-
-
- As you might expect, when using the general record IDs
- method, you can only add or modify existing records with the
- update command.
- If you wish to delete records, you must use the,
- delete command, with a directory as a parameter.
- This will remove all records that match the files below that root
- directory.
-
-
-
-
-
- Register Location
-
-
- Normally, the index files that form dictionaries, inverted
- files, record info, etc., are stored in the directory where you run
- zebraidx. If you wish to store these, possibly large,
- files somewhere else, you must add the register
- entry to the zebra.cfg file.
- Furthermore, the Zebra system allows its file
- structures to span multiple file systems, which is useful for
- managing very large databases.
-
-
-
- The value of the register setting is a sequence
- of tokens. Each token takes the form:
-
-
- dir:size.
-
-
- The dir specifies a directory in which index files
- will be stored and the size specifies the maximum
- size of all files in that directory. The Zebra indexer system fills
- each directory in the order specified and use the next specified
- directories as needed.
- The size is an integer followed by a qualifier
- code,
- b for bytes,
- k for kilobytes.
- M for megabytes,
- G for gigabytes.
-
-
-
- For instance, if you have allocated two disks for your register, and
- the first disk is mounted
- on /d1 and has 2GB of free space and the
- second, mounted on /d2 has 3.6 GB, you could
- put this entry in your configuration file:
-
-
- register: /d1:2G /d2:3600M
-
-
-
-
-
- Note that Zebra does not verify that the amount of space specified is
- actually available on the directory (file system) specified - it is
- your responsibility to ensure that enough space is available, and that
- other applications do not attempt to use the free space. In a large
- production system, it is recommended that you allocate one or more
- file system exclusively to the Zebra register files.
-
-
-
-
-
- Safe Updating - Using Shadow Registers
-
-
- Description
-
-
- The Zebra server supports updating of the index
- structures. That is, you can add, modify, or remove records from
- databases managed by Zebra without rebuilding the entire index.
- Since this process involves modifying structured files with various
- references between blocks of data in the files, the update process
- is inherently sensitive to system crashes, or to process interruptions:
- Anything but a successfully completed update process will leave the
- register files in an unknown state, and you will essentially have no
- recourse but to re-index everything, or to restore the register files
- from a backup medium.
- Further, while the update process is active, users cannot be
- allowed to access the system, as the contents of the register files
- may change unpredictably.
-
-
-
- You can solve these problems by enabling the shadow register system in
- Zebra.
- During the updating procedure, zebraidx will temporarily
- write changes to the involved files in a set of "shadow
- files", without modifying the files that are accessed by the
- active server processes. If the update procedure is interrupted by a
- system crash or a signal, you simply repeat the procedure - the
- register files have not been changed or damaged, and the partially
- written shadow files are automatically deleted before the new updating
- procedure commences.
-
-
-
- At the end of the updating procedure (or in a separate operation, if
- you so desire), the system enters a "commit mode". First,
- any active server processes are forced to access those blocks that
- have been changed from the shadow files rather than from the main
- register files; the unmodified blocks are still accessed at their
- normal location (the shadow files are not a complete copy of the
- register files - they only contain those parts that have actually been
- modified). If the commit process is interrupted at any point during the
- commit process, the server processes will continue to access the
- shadow files until you can repeat the commit procedure and complete
- the writing of data to the main register files. You can perform
- multiple update operations to the registers before you commit the
- changes to the system files, or you can execute the commit operation
- at the end of each update operation. When the commit phase has
- completed successfully, any running server processes are instructed to
- switch their operations to the new, operational register, and the
- temporary shadow files are deleted.
-
-
-
-
-
- How to Use Shadow Register Files
-
-
- The first step is to allocate space on your system for the shadow
- files.
- You do this by adding a shadow entry to the
- zebra.cfg file.
- The syntax of the shadow entry is exactly the
- same as for the register entry
- (see ).
- The location of the shadow area should be
- different from the location of the main register
- area (if you have specified one - remember that if you provide no
- register setting, the default register area is the
- working directory of the server and indexing processes).
+
+
+ You cannot explicitly delete records when using this method (using the
+ delete command to zebraidx. Instead
+ you have to delete the files from the file system (or move them to a
+ different location)
+ and then run zebraidx with the
+ update command.
+
+
+
+
+
+ Indexing with General Record IDs
+
+
+ When using this method you construct an (almost) arbitrary, internal
+ record key based on the contents of the record itself and other system
+ information. If you have a group of records that explicitly associates
+ an ID with each record, this method is convenient. For example, the
+ record format may contain a title or a ID-number - unique within the group.
+ In either case you specify the &acro.z3950; attribute set and use-attribute
+ location in which this information is stored, and the system looks at
+ that field to determine the identity of the record.
-
+
- The following excerpt from a zebra.cfg file shows
- one example of a setup that configures both the main register
- location and the shadow file area.
- Note that two directories or partitions have been set aside
- for the shadow file area. You can specify any number of directories
- for each of the file areas, but remember that there should be no
- overlaps between the directories used for the main registers and the
- shadow files, respectively.
+ As before, the record ID is defined by the recordId
+ setting in the configuration file. The value of the record ID specification
+ consists of one or more tokens separated by whitespace. The resulting
+ ID is represented in the index by concatenating the tokens and
+ separating them by ASCII value (1).
+
-
+ There are three kinds of tokens:
+
+
+
+ Internal record info
+
+
+ The token refers to a key that is
+ extracted from the record. The syntax of this token is
+ ( set ,
+ use ),
+ where set is the
+ attribute set name use is the
+ name or value of the attribute.
+
+
+
+
+ System variable
+
+
+ The system variables are preceded by
+
+
+ $
+
+ and immediately followed by the system variable name, which
+ may one of
+
+
+
+ group
+
+
+ Group name.
+
+
+
+
+ database
+
+
+ Current database specified.
+
+
+
+
+ type
+
+
+ Record type.
+
+
+
+
+
+
+
+
+ Constant string
+
+
+ A string used as part of the ID — surrounded
+ by single- or double quotes.
+
+
+
+
+
+
+
+ For instance, the sample GILS records that come with the &zebra;
+ distribution contain a unique ID in the data tagged Control-Identifier.
+ The data is mapped to the &acro.bib1; use attribute Identifier-standard
+ (code 1007). To use this field as a record id, specify
+ (bib1,Identifier-standard) as the value of the
+ recordId in the configuration file.
+ If you have other record types that uses the same field for a
+ different purpose, you might add the record type
+ (or group or database name) to the record id of the gils
+ records as well, to prevent matches with other types of records.
+ In this case the recordId might be set like this:
+
- register: /d1:500M
-
- shadow: /scratch1:100M /scratch2:200M
+ gils.recordId: $type (bib1,Identifier-standard)
-
+
-
+
- When shadow files are enabled, an extra command is available at the
- zebraidx command line.
- In order to make changes to the system take effect for the
- users, you'll have to submit a "commit" command after a
- (sequence of) update operation(s).
+ (see
+ for details of how the mapping between elements of your records and
+ searchable attributes is established).
-
+
-
+ As for the file record ID case described in the previous section,
+ updating your system is simply a matter of running
+ zebraidx
+ with the update command. However, the update with general
+ keys is considerably slower than with file record IDs, since all files
+ visited must be (re)read to discover their IDs.
+
+
+
+ As you might expect, when using the general record IDs
+ method, you can only add or modify existing records with the
+ update command.
+ If you wish to delete records, you must use the,
+ delete command, with a directory as a parameter.
+ This will remove all records that match the files below that root
+ directory.
+
+
+
+
+
+ Register Location
+
+
+ Normally, the index files that form dictionaries, inverted
+ files, record info, etc., are stored in the directory where you run
+ zebraidx. If you wish to store these, possibly large,
+ files somewhere else, you must add the register
+ entry to the zebra.cfg file.
+ Furthermore, the &zebra; system allows its file
+ structures to span multiple file systems, which is useful for
+ managing very large databases.
+
+
+
+ The value of the register setting is a sequence
+ of tokens. Each token takes the form:
+
+ dir:size
+
+ The dir specifies a directory in which index files
+ will be stored and the size specifies the maximum
+ size of all files in that directory. The &zebra; indexer system fills
+ each directory in the order specified and use the next specified
+ directories as needed.
+ The size is an integer followed by a qualifier
+ code,
+ b for bytes,
+ k for kilobytes.
+ M for megabytes,
+ G for gigabytes.
+ Specifying a negative value disables the checking (it still needs the unit,
+ use -1b).
+
+
+
+ For instance, if you have allocated three disks for your register, and
+ the first disk is mounted
+ on /d1 and has 2GB of free space, the
+ second, mounted on /d2 has 3.6 GB, and the third,
+ on which you have more space than you bother to worry about, mounted on
+ /d3 you could put this entry in your configuration file:
+
- $ zebraidx update /d1/records
- $ zebraidx commit
+ register: /d1:2G /d2:3600M /d3:-1b
-
-
+
+
+ Note that &zebra; does not verify that the amount of space specified is
+ actually available on the directory (file system) specified - it is
+ your responsibility to ensure that enough space is available, and that
+ other applications do not attempt to use the free space. In a large
+ production system, it is recommended that you allocate one or more
+ file system exclusively to the &zebra; register files.
+
+
+
+
+
+ Safe Updating - Using Shadow Registers
+
+
+ Description
+
+
+ The &zebra; server supports updating of the index
+ structures. That is, you can add, modify, or remove records from
+ databases managed by &zebra; without rebuilding the entire index.
+ Since this process involves modifying structured files with various
+ references between blocks of data in the files, the update process
+ is inherently sensitive to system crashes, or to process interruptions:
+ Anything but a successfully completed update process will leave the
+ register files in an unknown state, and you will essentially have no
+ recourse but to re-index everything, or to restore the register files
+ from a backup medium.
+ Further, while the update process is active, users cannot be
+ allowed to access the system, as the contents of the register files
+ may change unpredictably.
+
+
+
+ You can solve these problems by enabling the shadow register system in
+ &zebra;.
+ During the updating procedure, zebraidx will temporarily
+ write changes to the involved files in a set of "shadow
+ files", without modifying the files that are accessed by the
+ active server processes. If the update procedure is interrupted by a
+ system crash or a signal, you simply repeat the procedure - the
+ register files have not been changed or damaged, and the partially
+ written shadow files are automatically deleted before the new updating
+ procedure commences.
+
+
+
+ At the end of the updating procedure (or in a separate operation, if
+ you so desire), the system enters a "commit mode". First,
+ any active server processes are forced to access those blocks that
+ have been changed from the shadow files rather than from the main
+ register files; the unmodified blocks are still accessed at their
+ normal location (the shadow files are not a complete copy of the
+ register files - they only contain those parts that have actually been
+ modified). If the commit process is interrupted at any point during the
+ commit process, the server processes will continue to access the
+ shadow files until you can repeat the commit procedure and complete
+ the writing of data to the main register files. You can perform
+ multiple update operations to the registers before you commit the
+ changes to the system files, or you can execute the commit operation
+ at the end of each update operation. When the commit phase has
+ completed successfully, any running server processes are instructed to
+ switch their operations to the new, operational register, and the
+ temporary shadow files are deleted.
+
+
+
+
+
+ How to Use Shadow Register Files
+
+
+ The first step is to allocate space on your system for the shadow
+ files.
+ You do this by adding a shadow entry to the
+ zebra.cfg file.
+ The syntax of the shadow entry is exactly the
+ same as for the register entry
+ (see ).
+ The location of the shadow area should be
+ different from the location of the main register
+ area (if you have specified one - remember that if you provide no
+ register setting, the default register area is the
+ working directory of the server and indexing processes).
+
+
+
+ The following excerpt from a zebra.cfg file shows
+ one example of a setup that configures both the main register
+ location and the shadow file area.
+ Note that two directories or partitions have been set aside
+ for the shadow file area. You can specify any number of directories
+ for each of the file areas, but remember that there should be no
+ overlaps between the directories used for the main registers and the
+ shadow files, respectively.
+
+
+
+
+ register: /d1:500M
+ shadow: /scratch1:100M /scratch2:200M
+
+
+
+
+
+ When shadow files are enabled, an extra command is available at the
+ zebraidx command line.
+ In order to make changes to the system take effect for the
+ users, you'll have to submit a "commit" command after a
+ (sequence of) update operation(s).
+
+
+
+
+
+ $ zebraidx update /d1/records
+ $ zebraidx commit
+
+
+
+
+
+ Or you can execute multiple updates before committing the changes:
+
+
+
+
+
+ $ zebraidx -g books update /d1/records /d2/more-records
+ $ zebraidx -g fun update /d3/fun-records
+ $ zebraidx commit
+
+
+
+
+
+ If one of the update operations above had been interrupted, the commit
+ operation on the last line would fail: zebraidx
+ will not let you commit changes that would destroy the running register.
+ You'll have to rerun all of the update operations since your last
+ commit operation, before you can commit the new changes.
+
+
+
+ Similarly, if the commit operation fails, zebraidx
+ will not let you start a new update operation before you have
+ successfully repeated the commit operation.
+ The server processes will keep accessing the shadow files rather
+ than the (possibly damaged) blocks of the main register files
+ until the commit operation has successfully completed.
+
+
+
+ You should be aware that update operations may take slightly longer
+ when the shadow register system is enabled, since more file access
+ operations are involved. Further, while the disk space required for
+ the shadow register data is modest for a small update operation, you
+ may prefer to disable the system if you are adding a very large number
+ of records to an already very large database (we use the terms
+ large and modest
+ very loosely here, since every application will have a
+ different perception of size).
+ To update the system without the use of the the shadow files,
+ simply run zebraidx with the -n
+ option (note that you do not have to execute the
+ commit command of zebraidx
+ when you temporarily disable the use of the shadow registers in
+ this fashion.
+ Note also that, just as when the shadow registers are not enabled,
+ server processes will be barred from accessing the main register
+ while the update procedure takes place.
+
+
+
+
+
+
+
+
+ Relevance Ranking and Sorting of Result Sets
+
+
+ Overview
+
+ The default ordering of a result set is left up to the server,
+ which inside &zebra; means sorting in ascending document ID order.
+ This is not always the order humans want to browse the sometimes
+ quite large hit sets. Ranking and sorting comes to the rescue.
+
+
+
+ In cases where a good presentation ordering can be computed at
+ indexing time, we can use a fixed static ranking
+ scheme, which is provided for the alvis
+ indexing filter. This defines a fixed ordering of hit lists,
+ independently of the query issued.
+
+
+
+ There are cases, however, where relevance of hit set documents is
+ highly dependent on the query processed.
+ Simply put, dynamic relevance ranking
+ sorts a set of retrieved records such that those most likely to be
+ relevant to your request are retrieved first.
+ Internally, &zebra; retrieves all documents that satisfy your
+ query, and re-orders the hit list to arrange them based on
+ a measurement of similarity between your query and the content of
+ each record.
+
+
+
+ Finally, there are situations where hit sets of documents should be
+ sorted during query time according to the
+ lexicographical ordering of certain sort indexes created at
+ indexing time.
+
+
+
+
+
+ Static Ranking
+
+
+ &zebra; uses internally inverted indexes to look up term frequencies
+ in documents. Multiple queries from different indexes can be
+ combined by the binary boolean operations AND,
+ OR and/or NOT (which
+ is in fact a binary AND NOT operation).
+ To ensure fast query execution
+ speed, all indexes have to be sorted in the same order.
+
+
+ The indexes are normally sorted according to document
+ ID in
+ ascending order, and any query which does not invoke a special
+ re-ranking function will therefore retrieve the result set in
+ document
+ ID
+ order.
+
+
+ If one defines the
+
+ staticrank: 1
+
+ directive in the main core &zebra; configuration file, the internal document
+ keys used for ordering are augmented by a preceding integer, which
+ contains the static rank of a given document, and the index lists
+ are ordered
+ first by ascending static rank,
+ then by ascending document ID.
+ Zero
+ is the ``best'' rank, as it occurs at the
+ beginning of the list; higher numbers represent worse scores.
+
+
+ The experimental alvis filter provides a
+ directive to fetch static rank information out of the indexed &acro.xml;
+ records, thus making all hit sets ordered
+ after ascending static
+ rank, and for those doc's which have the same static rank, ordered
+ after ascending doc ID.
+ See for the gory details.
+
+
+
+
+
+ Dynamic Ranking
+
+ In order to fiddle with the static rank order, it is necessary to
+ invoke additional re-ranking/re-ordering using dynamic
+ ranking or score functions. These functions return positive
+ integer scores, where highest score is
+ ``best'';
+ hit sets are sorted according to descending
+ scores (in contrary
+ to the index lists which are sorted according to
+ ascending rank number and document ID).
+
+
+ Dynamic ranking is enabled by a directive like one of the
+ following in the zebra configuration file (use only one of these a time!):
+
+ rank: rank-1 # default TDF-IDF like
+ rank: rank-static # dummy do-nothing
+
+
+
+
+ Dynamic ranking is done at query time rather than
+ indexing time (this is why we
+ call it ``dynamic ranking'' in the first place ...)
+ It is invoked by adding
+ the &acro.bib1; relation attribute with
+ value ``relevance'' to the &acro.pqf; query (that is,
+ @attr 2=102, see also
+
+ The &acro.bib1; Attribute Set Semantics, also in
+ HTML).
+ To find all articles with the word Eoraptor in
+ the title, and present them relevance ranked, issue the &acro.pqf; query:
+
+ @attr 2=102 @attr 1=4 Eoraptor
+
+
+
+
+ Dynamically ranking using &acro.pqf; queries with the 'rank-1'
+ algorithm
+
+
+ The default rank-1 ranking module implements a
+ TF/IDF (Term Frequecy over Inverse Document Frequency) like
+ algorithm. In contrast to the usual definition of TF/IDF
+ algorithms, which only considers searching in one full-text
+ index, this one works on multiple indexes at the same time.
+ More precisely,
+ &zebra; does boolean queries and searches in specific addressed
+ indexes (there are inverted indexes pointing from terms in the
+ dictionary to documents and term positions inside documents).
+ It works like this:
+
+
+ Query Components
+
+
+ First, the boolean query is dismantled into its principal components,
+ i.e. atomic queries where one term is looked up in one index.
+ For example, the query
+
+ @attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
+
+ is a boolean AND between the atomic parts
+
+ @attr 2=102 @attr 1=1010 Utah
+
+ and
+
+ @attr 2=102 @attr 1=1018 Springer
+
+ which gets processed each for itself.
+
+
+
+
+
+ Atomic hit lists
+
+
+ Second, for each atomic query, the hit list of documents is
+ computed.
+
+
+ In this example, two hit lists for each index
+ @attr 1=1010 and
+ @attr 1=1018 are computed.
+
+
+
+
+
+ Atomic scores
+
+
+ Third, each document in the hit list is assigned a score (_if_ ranking
+ is enabled and requested in the query) using a TF/IDF scheme.
+
+
+ In this example, both atomic parts of the query assign the magic
+ @attr 2=102 relevance attribute, and are
+ to be used in the relevance ranking functions.
+
+
+ It is possible to apply dynamic ranking on only parts of the
+ &acro.pqf; query:
+
+ @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
+
+ searches for all documents which have the term 'Utah' on the
+ body of text, and which have the term 'Springer' in the publisher
+ field, and sort them in the order of the relevance ranking made on
+ the body-of-text index only.
+
+
+
+
+
+ Hit list merging
+
+
+ Fourth, the atomic hit lists are merged according to the boolean
+ conditions to a final hit list of documents to be returned.
+
+
+ This step is always performed, independently of the fact that
+ dynamic ranking is enabled or not.
+
+
+
+
+
+ Document score computation
+
+
+ Fifth, the total score of a document is computed as a linear
+ combination of the atomic scores of the atomic hit lists
+
+
+ Ranking weights may be used to pass a value to a ranking
+ algorithm, using the non-standard &acro.bib1; attribute type 9.
+ This allows one branch of a query to use one value while
+ another branch uses a different one. For example, we can search
+ for utah in the
+ @attr 1=4 index with weight 30, as
+ well as in the @attr 1=1010 index with weight 20:
+
+ @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
+
+
+
+ The default weight is
+ sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score
+ is 1000 and the bottom score is 0, encoded in integers.
+
+
+
+ The ranking-weight feature is experimental. It may change in future
+ releases of zebra.
+
+
+
+
+
+
+ Re-sorting of hit list
+
+
+ Finally, the final hit list is re-ordered according to scores.
+
+
+
+
+
+
+
+
+
+ The rank-1 algorithm
+ does not use the static rank
+ information in the list keys, and will produce the same ordering
+ with or without static ranking enabled.
+
+
+
+
+
+
+
+ Dynamic ranking is not compatible
+ with estimated hit sizes, as all documents in
+ a hit set must be accessed to compute the correct placing in a
+ ranking sorted list. Therefore the use attribute setting
+ @attr 2=102 clashes with
+ @attr 9=integer.
+
+
+
+
+
+
+
+
+ Dynamically ranking &acro.cql; queries
+
+ Dynamic ranking can be enabled during sever side &acro.cql;
+ query expansion by adding @attr 2=102
+ chunks to the &acro.cql; config file. For example
+
+ relationModifier.relevant = 2=102
+
+ invokes dynamic ranking each time a &acro.cql; query of the form
+
+ Z> querytype cql
+ Z> f alvis.text =/relevant house
+
+ is issued. Dynamic ranking can also be automatically used on
+ specific &acro.cql; indexes by (for example) setting
+
+ index.alvis.text = 1=text 2=102
+
+ which then invokes dynamic ranking each time a &acro.cql; query of the form
+
+ Z> querytype cql
+ Z> f alvis.text = house
+
+ is issued.
+
+
+
+
+
+
+
+
+ Sorting
+
+ &zebra; sorts efficiently using special sorting indexes
+ (type=s; so each sortable index must be known
+ at indexing time, specified in the configuration of record
+ indexing. For example, to enable sorting according to the &acro.bib1;
+ Date/time-added-to-db field, one could add the line
+
+ xelm /*/@created Date/time-added-to-db:s
+
+ to any .abs record-indexing configuration file.
+ Similarly, one could add an indexing element of the form
+
+
+
+ ]]>
+ to any alvis-filter indexing stylesheet.
+
+
+ Indexing can be specified at searching time using a query term
+ carrying the non-standard
+ &acro.bib1; attribute-type 7. This removes the
+ need to send a &acro.z3950; Sort Request
+ separately, and can dramatically improve latency when the client
+ and server are on separate networks.
+ The sorting part of the query is separate from the rest of the
+ query - the actual search specification - and must be combined
+ with it using OR.
+
+
+ A sorting subquery needs two attributes: an index (such as a
+ &acro.bib1; type-1 attribute) specifying which index to sort on, and a
+ type-7 attribute whose value is be 1 for
+ ascending sorting, or 2 for descending. The
+ term associated with the sorting attribute is the priority of
+ the sort key, where 0 specifies the primary
+ sort key, 1 the secondary sort key, and so
+ on.
+
+ For example, a search for water, sort by title (ascending),
+ is expressed by the &acro.pqf; query
+
+ @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
+
+ whereas a search for water, sort by title ascending,
+ then date descending would be
+
+ @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
+
+
+
+ Notice the fundamental differences between dynamic
+ ranking and sorting: there can be
+ only one ranking function defined and configured; but multiple
+ sorting indexes can be specified dynamically at search
+ time. Ranking does not need to use specific indexes, so
+ dynamic ranking can be enabled and disabled without
+ re-indexing; whereas, sorting indexes need to be
+ defined before indexing.
+
+
+
+
+
+
+
+
+ Extended Services: Remote Insert, Update and Delete
+
+
+
+ Extended services are only supported when accessing the &zebra;
+ server using the &acro.z3950;
+ protocol. The &acro.sru; protocol does
+ not support extended services.
+
+
+
- Or you can execute multiple updates before committing the changes:
+ The extended services are not enabled by default in zebra - due to the
+ fact that they modify the system. &zebra; can be configured
+ to allow anybody to
+ search, and to allow only updates for a particular admin user
+ in the main zebra configuration file zebra.cfg.
+ For user admin, you could use:
+
+ perm.anonymous: r
+ perm.admin: rw
+ passwd: passwordfile
+
+ And in the password file
+ passwordfile, you have to specify users and
+ encrypted passwords as colon separated strings.
+ Use a tool like htpasswd
+ to maintain the encrypted passwords.
+
+ admin:secret
+
+ It is essential to configure &zebra; to store records internally,
+ and to support
+ modifications and deletion of records:
+
+ storeData: 1
+ storeKeys: 1
+
+ The general record type should be set to any record filter which
+ is able to parse &acro.xml; records, you may use any of the two
+ declarations (but not both simultaneously!)
+
+ recordType: dom.filter_dom_conf.xml
+ # recordType: grs.xml
+
+ Notice the difference to the specific instructions
+
+ recordType.xml: dom.filter_dom_conf.xml
+ # recordType.xml: grs.xml
+
+ which only work when indexing XML files from the filesystem using
+ the *.xml naming convention.
-
-
+ To enable transaction safe shadow indexing,
+ which is extra important for this kind of operation, set
- $ zebraidx -g books update /d1/records /d2/more-records
- $ zebraidx -g fun update /d3/fun-records
- $ zebraidx commit
+ shadow: directoryname: size (e.g. 1000M)
-
-
-
-
- If one of the update operations above had been interrupted, the commit
- operation on the last line would fail: zebraidx
- will not let you commit changes that would destroy the running register.
- You'll have to rerun all of the update operations since your last
- commit operation, before you can commit the new changes.
-
-
-
- Similarly, if the commit operation fails, zebraidx
- will not let you start a new update operation before you have
- successfully repeated the commit operation.
- The server processes will keep accessing the shadow files rather
- than the (possibly damaged) blocks of the main register files
- until the commit operation has successfully completed.
-
-
-
- You should be aware that update operations may take slightly longer
- when the shadow register system is enabled, since more file access
- operations are involved. Further, while the disk space required for
- the shadow register data is modest for a small update operation, you
- may prefer to disable the system if you are adding a very large number
- of records to an already very large database (we use the terms
- large and modest
- very loosely here, since every application will have a
- different perception of size).
- To update the system without the use of the the shadow files,
- simply run zebraidx with the -n
- option (note that you do not have to execute the
- commit command of zebraidx
- when you temporarily disable the use of the shadow registers in
- this fashion.
- Note also that, just as when the shadow registers are not enabled,
- server processes will be barred from accessing the main register
- while the update procedure takes place.
-
-
-
-
-
-
-
+ See for additional information on
+ these configuration options.
+
+
+
+ It is not possible to carry information about record types or
+ similar to &zebra; when using extended services, due to
+ limitations of the &acro.z3950;
+ protocol. Therefore, indexing filters can not be chosen on a
+ per-record basis. One and only one general &acro.xml; indexing filter
+ must be defined.
+
+
+
+
+
+
+
+ Extended services in the &acro.z3950; protocol
+
+
+ The &acro.z3950; standard allows
+ servers to accept special binary extended services
+ protocol packages, which may be used to insert, update and delete
+ records into servers. These carry control and update
+ information to the servers, which are encoded in seven package fields:
+
+
+
+ Extended services &acro.z3950; Package Fields
+
+
+
+ Parameter
+ Value
+ Notes
+
+
+
+
+ type
+ 'update'
+ Must be set to trigger extended services
+
+
+ action
+ string
+
+ Extended service action type with
+ one of four possible values: recordInsert,
+ recordReplace,
+ recordDelete,
+ and specialUpdate
+
+
+
+ record
+ &acro.xml; string
+ An &acro.xml; formatted string containing the record
+
+
+ syntax
+ 'xml'
+ XML/SUTRS/MARC. GRS-1 not supported.
+ The default filter (record type) as given by recordType in
+ zebra.cfg is used to parse the record.
+
+
+ recordIdOpaque
+ string
+
+ Optional client-supplied, opaque record
+ identifier used under insert operations.
+
+
+
+ recordIdNumber
+ positive number
+ &zebra;'s internal system number,
+ not allowed for recordInsert or
+ specialUpdate actions which result in fresh
+ record inserts.
+
+
+
+ databaseName
+ database identifier
+
+ The name of the database to which the extended services should be
+ applied.
+
+
+
+
+
+
+
+
+ The action parameter can be any of
+ recordInsert (will fail if the record already exists),
+ recordReplace (will fail if the record does not exist),
+ recordDelete (will fail if the record does not
+ exist), and
+ specialUpdate (will insert or update the record
+ as needed, record deletion is not possible).
+
+
+
+ During all actions, the
+ usual rules for internal record ID generation apply, unless an
+ optional recordIdNumber &zebra; internal ID or a
+ recordIdOpaque string identifier is assigned.
+ The default ID generation is
+ configured using the recordId: from
+ zebra.cfg.
+ See .
+
+
+
+ Setting of the recordIdNumber parameter,
+ which must be an existing &zebra; internal system ID number, is not
+ allowed during any recordInsert or
+ specialUpdate action resulting in fresh record
+ inserts.
+
+
+
+ When retrieving existing
+ records indexed with &acro.grs1; indexing filters, the &zebra; internal
+ ID number is returned in the field
+ /*/id:idzebra/localnumber in the namespace
+ xmlns:id="http://www.indexdata.dk/zebra/",
+ where it can be picked up for later record updates or deletes.
+
+
+
+ A new element set for retrieval of internal record
+ data has been added, which can be used to access minimal records
+ containing only the recordIdNumber &zebra;
+ internal ID, or the recordIdOpaque string
+ identifier. This works for any indexing filter used.
+ See .
+
+
+
+ The recordIdOpaque string parameter
+ is an client-supplied, opaque record
+ identifier, which may be used under
+ insert, update and delete operations. The
+ client software is responsible for assigning these to
+ records. This identifier will
+ replace zebra's own automagic identifier generation with a unique
+ mapping from recordIdOpaque to the
+ &zebra; internal recordIdNumber.
+ The opaque recordIdOpaque string
+ identifiers
+ are not visible in retrieval records, nor are
+ searchable, so the value of this parameter is
+ questionable. It serves mostly as a convenient mapping from
+ application domain string identifiers to &zebra; internal ID's.
+
+
+
+
+
+
+ Extended services from yaz-client
+
+
+ We can now start a yaz-client admin session and create a database:
+
+ adm-create
+ ]]>
+
+ Now the Default database was created,
+ we can insert an &acro.xml; file (esdd0006.grs
+ from example/gils/records) and index it:
+
+ update insert id1234 esdd0006.grs
+ ]]>
+
+ The 3rd parameter - id1234 here -
+ is the recordIdOpaque package field.
+
+
+ Actually, we should have a way to specify "no opaque record id" for
+ yaz-client's update command.. We'll fix that.
+
+
+ The newly inserted record can be searched as usual:
+
+ f utah
+ Sent searchRequest.
+ Received SearchResponse.
+ Search was a success.
+ Number of hits: 1, setno 1
+ SearchResult-1: term=utah cnt=1
+ records returned: 0
+ Elapsed: 0.014179
+ ]]>
+
+
+
+ Let's delete the beast, using the same
+ recordIdOpaque string parameter:
+
+ update delete id1234
+ No last record (update ignored)
+ Z> update delete 1 esdd0006.grs
+ Got extended services response
+ Status: done
+ Elapsed: 0.072441
+ Z> f utah
+ Sent searchRequest.
+ Received SearchResponse.
+ Search was a success.
+ Number of hits: 0, setno 2
+ SearchResult-1: term=utah cnt=0
+ records returned: 0
+ Elapsed: 0.013610
+ ]]>
+
+
+
+ If shadow register is enabled in your
+ zebra.cfg,
+ you must run the adm-commit command
+
+ adm-commit
+ ]]>
+
+ after each update session in order write your changes from the
+ shadow to the life register space.
+
+
+
+
+
+ Extended services from yaz-php
+
+
+ Extended services are also available from the &yaz; &acro.php; client layer. An
+ example of an &yaz;-&acro.php; extended service transaction is given here:
+
+ A fine specimen of a record';
+
+ $options = array('action' => 'recordInsert',
+ 'syntax' => 'xml',
+ 'record' => $record,
+ 'databaseName' => 'mydatabase'
+ );
+
+ yaz_es($yaz, 'update', $options);
+ yaz_es($yaz, 'commit', array());
+ yaz_wait();
+
+ if ($error = yaz_error($yaz))
+ echo "$error";
+ ]]>
+
+
+
+
+
+ Extended services debugging guide
+
+ When debugging ES over PHP we recommend the following order of tests:
+
+
+
+
+
+ Make sure you have a nice record on your filesystem, which you can
+ index from the filesystem by use of the zebraidx command.
+ Do it exactly as you planned, using one of the GRS-1 filters,
+ or the DOMXML filter.
+ When this works, proceed.
+
+
+
+
+ Check that your server setup is OK before you even coded one single
+ line PHP using ES.
+ Take the same record form the file system, and send as ES via
+ yaz-client like described in
+ ,
+ and
+ remember the -a option which tells you what
+ goes over the wire! Notice also the section on permissions:
+ try
+
+ perm.anonymous: rw
+
+ in zebra.cfg to make sure you do not run into
+ permission problems (but never expose such an insecure setup on the
+ internet!!!). Then, make sure to set the general
+ recordType instruction, pointing correctly
+ to the GRS-1 filters,
+ or the DOMXML filters.
+
+
+
+
+ If you insist on using the sysno in the
+ recordIdNumber setting,
+ please make sure you do only updates and deletes. Zebra's internal
+ system number is not allowed for
+ recordInsert or
+ specialUpdate actions
+ which result in fresh record inserts.
+
+
+
+
+ If shadow register is enabled in your
+ zebra.cfg, you must remember running the
+
+ Z> adm-commit
+
+ command as well.
+
+
+
+
+ If this works, then proceed to do the same thing in your PHP script.
+
+
+
+
+
+
+
+
+
+
+