Administrating &zebra;
Unlike many simpler retrieval systems, &zebra; supports safe, incremental
updates to an existing index.
Normally, when &zebra; modifies the index it reads a number of records
that you specify.
Depending on your specifications and on the contents of each record
one the following events take place for each record:
Insert
The record is indexed as if it never occurred before.
Either the &zebra; system doesn't know how to identify the record or
&zebra; can identify the record but didn't find it to be already indexed.
Modify
The record has already been indexed.
In this case either the contents of the record or the location
(file) of the record indicates that it has been indexed before.
Delete
The record is deleted from the index. As in the
update-case it must be able to identify the record.
Please note that in both the modify- and delete- case the &zebra;
indexer must be able to generate a unique key that identifies the record
in question (more on this below).
To administrate the &zebra; retrieval system, you run the
zebraidx program.
This program supports a number of options which are preceded by a dash,
and a few commands (not preceded by dash).
Both the &zebra; administrative tool and the &acro.z3950; server share a
set of index files and a global configuration file.
The name of the configuration file defaults to
zebra.cfg.
The configuration file includes specifications on how to index
various kinds of records and where the other configuration files
are located. zebrasrv and zebraidx
must be run in the directory where the
configuration file lives unless you indicate the location of the
configuration file by option -c.
Record Types
Indexing is a per-record process, in which either insert/modify/delete
will occur. Before a record is indexed search keys are extracted from
whatever might be the layout the original record (sgml,html,text, etc..).
The &zebra; system currently supports two fundamental types of records:
structured and simple text.
To specify a particular extraction process, use either the
command line option -t or specify a
recordType setting in the configuration file.
The &zebra; Configuration File
The &zebra; configuration file, read by zebraidx and
zebrasrv defaults to zebra.cfg
unless specified by -c option.
You can edit the configuration file with a normal text editor.
parameter names and values are separated by colons in the file. Lines
starting with a hash sign (#) are
treated as comments.
If you manage different sets of records that share common
characteristics, you can organize the configuration settings for each
type into "groups".
When zebraidx is run and you wish to address a
given group you specify the group name with the -g
option.
In this case settings that have the group name as their prefix
will be used by zebraidx.
If no -g option is specified, the settings
without prefix are used.
In the configuration file, the group name is placed before the option
name itself, separated by a dot (.). For instance, to set the record type
for group public to grs.sgml
(the &acro.sgml;-like format for structured records) you would write:
public.recordType: grs.sgml
To set the default value of the record type to text
write:
recordType: text
The available configuration settings are summarized below. They will be
explained further in the following sections.
group
.recordType[.name]:
type
Specifies how records with the file extension
name should be handled by the indexer.
This option may also be specified as a command line option
(-t). Note that if you do not specify a
name, the setting applies to all files.
In general, the record type specifier consists of the elements (each
element separated by dot), fundamental-type,
file-read-type and arguments. Currently, two
fundamental types exist, text and
grs.
group.recordId:
record-id-spec
Specifies how the records are to be identified when updated. See
.
group.database:
database
Specifies the &acro.z3950; database name.
group.storeKeys:
boolean
Specifies whether key information should be saved for a given
group of records. If you plan to update/delete this type of
records later this should be specified as 1; otherwise it
should be 0 (default), to save register space.
See .
group.storeData:
boolean
Specifies whether the records should be stored internally
in the &zebra; system files.
If you want to maintain the raw records yourself,
this option should be false (0).
If you want &zebra; to take care of the records for you, it
should be true(1).
register: register-location
Specifies the location of the various register files that &zebra; uses
to represent your databases.
See .
shadow: register-location
Enables the safe update facility of &zebra;, and
tells the system where to place the required, temporary files.
See .
lockDir: directory
Directory in which various lock files are stored.
keyTmpDir: directory
Directory in which temporary files used during zebraidx's update
phase are stored.
setTmpDir: directory
Specifies the directory that the server uses for temporary result sets.
If not specified /tmp will be used.
profilePath: path
Specifies a path of profile specification files.
The path is composed of one or more directories separated by
colon. Similar to PATH for UNIX systems.
modulePath: path
Specifies a path of record filter modules.
The path is composed of one or more directories separated by
colon. Similar to PATH for UNIX systems.
The 'make install' procedure typically puts modules in
/usr/local/lib/idzebra-2.0/modules.
index: filename
Defines the filename which holds fields structure
definitions. If omitted, the file default.idx
is read.
Refer to for
more information.
sortmax: integer
Specifies the maximum number of records that will be sorted
in a result set. If the result set contains more than
integer records, records after the
limit will not be sorted. If omitted, the default value is
1,000.
staticrank: integer
Enables whether static ranking is to be enabled (1) or
disabled (0). If omitted, it is disabled - corresponding
to a value of 0.
Refer to .
estimatehits: integer
Controls whether &zebra; should calculate approximate hit counts and
at which hit count it is to be enabled.
A value of 0 disables approximate hit counts.
For a positive value approximate hit count is enabled
if it is known to be larger than integer.
Approximate hit counts can also be triggered by a particular
attribute in a query.
Refer to .
attset: filename
Specifies the filename(s) of attribute set files for use in
searching. In many configurations bib1.att
is used, but that is not required. If Classic Explain
attributes is to be used for searching,
explain.att must be given.
The path to att-files in general can be given using
profilePath setting.
See also .
memMax: size
Specifies size of internal memory
to use for the zebraidx program.
The amount is given in megabytes - default is 4 (4 MB).
The more memory, the faster large updates happen, up to about
half the free memory available on the computer.
tempfiles: Yes/Auto/No
Tells zebra if it should use temporary files when indexing. The
default is Auto, in which case zebra uses temporary files only
if it would need more that memMax
megabytes of memory. This should be good for most uses.
root: dir
Specifies a directory base for &zebra;. All relative paths
given (in profilePath, register, shadow) are based on this
directory. This setting is useful if your &zebra; server
is running in a different directory from where
zebra.cfg is located.
passwd: file
Specifies a file with description of user accounts for &zebra;.
The format is similar to that known to Apache's htpasswd files
and UNIX' passwd files. Non-empty lines not beginning with
# are considered account lines. There is one account per-line.
A line consists of fields separate by a single colon character.
First field is username, second is password.
passwd.c: file
Specifies a file with description of user accounts for &zebra;.
File format is similar to that used by the passwd directive except
that the password are encrypted. Use Apache's htpasswd or similar
for maintenance.
perm.user:
permstring
Specifies permissions (privilege) for a user that are allowed
to access &zebra; via the passwd system. There are two kinds
of permissions currently: read (r) and write(w). By default
users not listed in a permission directive are given the read
privilege. To specify permissions for a user with no
username, or &acro.z3950; anonymous style use
anonymous. The permstring consists of
a sequence of characters. Include character w
for write/update access, r for read access and
a to allow anonymous access through this account.
dbaccess: accessfile
Names a file which lists database subscriptions for individual users.
The access file should consists of lines of the form
username: dbnames, where dbnames is a list of
database names, separated by '+'. No whitespace is allowed in the
database list.
encoding: charsetname
Tells &zebra; to interpret the terms in Z39.50 queries as
having been encoded using the specified character
encoding. The default is ISO-8859-1; one
useful alternative is UTF-8.
storeKeys: value
Specifies whether &zebra; keeps a copy of indexed keys.
Use a value of 1 to enable; 0 to disable. If storeKeys setting is
omitted, it is enabled. Enabled storeKeys
are required for updating and deleting records. Disable only
storeKeys to save space and only plan to index data once.
storeData: value
Specifies whether &zebra; keeps a copy of indexed records.
Use a value of 1 to enable; 0 to disable. If storeData setting is
omitted, it is enabled. A storeData setting of 0 (disabled) makes
Zebra fetch records from the original locaction in the file
system using filename, file offset and file length. For the
DOM and ALVIS filter, the storeData setting is ignored.
Locating Records
The default behavior of the &zebra; system is to reference the
records from their original location, i.e. where they were found when you
run zebraidx.
That is, when a client wishes to retrieve a record
following a search operation, the files are accessed from the place
where you originally put them - if you remove the files (without
running zebraidx again, the server will return
diagnostic number 14 (``System error in presenting records'') to
the client.
If your input files are not permanent - for example if you retrieve
your records from an outside source, or if they were temporarily
mounted on a CD-ROM drive,
you may want &zebra; to make an internal copy of them. To do this,
you specify 1 (true) in the storeData setting. When
the &acro.z3950; server retrieves the records they will be read from the
internal file structures of the system.
Indexing with no Record IDs (Simple Indexing)
If you have a set of records that are not expected to change over time
you may can build your database without record IDs.
This indexing method uses less space than the other methods and
is simple to use.
To use this method, you simply omit the recordId entry
for the group of files that you index. To add a set of records you use
zebraidx with the update command. The
update command will always add all of the records that it
encounters to the index - whether they have already been indexed or
not. If the set of indexed files change, you should delete all of the
index files, and build a new index from scratch.
Consider a system in which you have a group of text files called
simple.
That group of records should belong to a &acro.z3950; database called
textbase.
The following zebra.cfg file will suffice:
profilePath: /usr/local/idzebra/tab
attset: bib1.att
simple.recordType: text
simple.database: textbase
Since the existing records in an index can not be addressed by their
IDs, it is impossible to delete or modify records when using this method.
Indexing with File Record IDs
If you have a set of files that regularly change over time: Old files
are deleted, new ones are added, or existing files are modified, you
can benefit from using the file ID
indexing methodology.
Examples of this type of database might include an index of WWW
resources, or a USENET news spool area.
Briefly speaking, the file key methodology uses the directory paths
of the individual records as a unique identifier for each record.
To perform indexing of a directory with file keys, again, you specify
the top-level directory after the update command.
The command will recursively traverse the directories and compare
each one with whatever have been indexed before in that same directory.
If a file is new (not in the previous version of the directory) it
is inserted into the registers; if a file was already indexed and
it has been modified since the last update, the index is also
modified; if a file has been removed since the last
visit, it is deleted from the index.
The resulting system is easy to administrate. To delete a record you
simply have to delete the corresponding file (say, with the
rm command). And to add records you create new
files (or directories with files). For your changes to take effect
in the register you must run zebraidx update with
the same directory root again. This mode of operation requires more
disk space than simpler indexing methods, but it makes it easier for
you to keep the index in sync with a frequently changing set of data.
If you combine this system with the safe update
facility (see below), you never have to take your server off-line for
maintenance or register updating purposes.
To enable indexing with pathname IDs, you must specify
file as the value of recordId
in the configuration file. In addition, you should set
storeKeys to 1, since the &zebra;
indexer must save additional information about the contents of each record
in order to modify the indexes correctly at a later time.
For example, to update records of group esdd
located below
/data1/records/ you should type:
$ zebraidx -g esdd update /data1/records
The corresponding configuration file includes:
esdd.recordId: file
esdd.recordType: grs.sgml
esdd.storeKeys: 1
You cannot start out with a group of records with simple
indexing (no record IDs as in the previous section) and then later
enable file record Ids. &zebra; must know from the first time that you
index the group that
the files should be indexed with file record IDs.
You cannot explicitly delete records when using this method (using the
delete command to zebraidx. Instead
you have to delete the files from the file system (or move them to a
different location)
and then run zebraidx with the
update command.
Indexing with General Record IDs
When using this method you construct an (almost) arbitrary, internal
record key based on the contents of the record itself and other system
information. If you have a group of records that explicitly associates
an ID with each record, this method is convenient. For example, the
record format may contain a title or a ID-number - unique within the group.
In either case you specify the &acro.z3950; attribute set and use-attribute
location in which this information is stored, and the system looks at
that field to determine the identity of the record.
As before, the record ID is defined by the recordId
setting in the configuration file. The value of the record ID specification
consists of one or more tokens separated by whitespace. The resulting
ID is represented in the index by concatenating the tokens and
separating them by ASCII value (1).
There are three kinds of tokens:
Internal record info
The token refers to a key that is
extracted from the record. The syntax of this token is
( set ,
use ),
where set is the
attribute set name use is the
name or value of the attribute.
System variable
The system variables are preceded by
$
and immediately followed by the system variable name, which
may one of
group
Group name.
database
Current database specified.
type
Record type.
Constant string
A string used as part of the ID — surrounded
by single- or double quotes.
For instance, the sample GILS records that come with the &zebra;
distribution contain a unique ID in the data tagged Control-Identifier.
The data is mapped to the &acro.bib1; use attribute Identifier-standard
(code 1007). To use this field as a record id, specify
(bib1,Identifier-standard) as the value of the
recordId in the configuration file.
If you have other record types that uses the same field for a
different purpose, you might add the record type
(or group or database name) to the record id of the gils
records as well, to prevent matches with other types of records.
In this case the recordId might be set like this:
gils.recordId: $type (bib1,Identifier-standard)
(see
for details of how the mapping between elements of your records and
searchable attributes is established).
As for the file record ID case described in the previous section,
updating your system is simply a matter of running
zebraidx
with the update command. However, the update with general
keys is considerably slower than with file record IDs, since all files
visited must be (re)read to discover their IDs.
As you might expect, when using the general record IDs
method, you can only add or modify existing records with the
update command.
If you wish to delete records, you must use the,
delete command, with a directory as a parameter.
This will remove all records that match the files below that root
directory.
Register Location
Normally, the index files that form dictionaries, inverted
files, record info, etc., are stored in the directory where you run
zebraidx. If you wish to store these, possibly large,
files somewhere else, you must add the register
entry to the zebra.cfg file.
Furthermore, the &zebra; system allows its file
structures to span multiple file systems, which is useful for
managing very large databases.
The value of the register setting is a sequence
of tokens. Each token takes the form:
dir:size
The dir specifies a directory in which index files
will be stored and the size specifies the maximum
size of all files in that directory. The &zebra; indexer system fills
each directory in the order specified and use the next specified
directories as needed.
The size is an integer followed by a qualifier
code,
b for bytes,
k for kilobytes.
M for megabytes,
G for gigabytes.
Specifying a negative value disables the checking (it still needs the unit,
use -1b).
For instance, if you have allocated three disks for your register, and
the first disk is mounted
on /d1 and has 2GB of free space, the
second, mounted on /d2 has 3.6 GB, and the third,
on which you have more space than you bother to worry about, mounted on
/d3 you could put this entry in your configuration file:
register: /d1:2G /d2:3600M /d3:-1b
Note that &zebra; does not verify that the amount of space specified is
actually available on the directory (file system) specified - it is
your responsibility to ensure that enough space is available, and that
other applications do not attempt to use the free space. In a large
production system, it is recommended that you allocate one or more
file system exclusively to the &zebra; register files.
Safe Updating - Using Shadow Registers
Description
The &zebra; server supports updating of the index
structures. That is, you can add, modify, or remove records from
databases managed by &zebra; without rebuilding the entire index.
Since this process involves modifying structured files with various
references between blocks of data in the files, the update process
is inherently sensitive to system crashes, or to process interruptions:
Anything but a successfully completed update process will leave the
register files in an unknown state, and you will essentially have no
recourse but to re-index everything, or to restore the register files
from a backup medium.
Further, while the update process is active, users cannot be
allowed to access the system, as the contents of the register files
may change unpredictably.
You can solve these problems by enabling the shadow register system in
&zebra;.
During the updating procedure, zebraidx will temporarily
write changes to the involved files in a set of "shadow
files", without modifying the files that are accessed by the
active server processes. If the update procedure is interrupted by a
system crash or a signal, you simply repeat the procedure - the
register files have not been changed or damaged, and the partially
written shadow files are automatically deleted before the new updating
procedure commences.
At the end of the updating procedure (or in a separate operation, if
you so desire), the system enters a "commit mode". First,
any active server processes are forced to access those blocks that
have been changed from the shadow files rather than from the main
register files; the unmodified blocks are still accessed at their
normal location (the shadow files are not a complete copy of the
register files - they only contain those parts that have actually been
modified). If the commit process is interrupted at any point during the
commit process, the server processes will continue to access the
shadow files until you can repeat the commit procedure and complete
the writing of data to the main register files. You can perform
multiple update operations to the registers before you commit the
changes to the system files, or you can execute the commit operation
at the end of each update operation. When the commit phase has
completed successfully, any running server processes are instructed to
switch their operations to the new, operational register, and the
temporary shadow files are deleted.
How to Use Shadow Register Files
The first step is to allocate space on your system for the shadow
files.
You do this by adding a shadow entry to the
zebra.cfg file.
The syntax of the shadow entry is exactly the
same as for the register entry
(see ).
The location of the shadow area should be
different from the location of the main register
area (if you have specified one - remember that if you provide no
register setting, the default register area is the
working directory of the server and indexing processes).
The following excerpt from a zebra.cfg file shows
one example of a setup that configures both the main register
location and the shadow file area.
Note that two directories or partitions have been set aside
for the shadow file area. You can specify any number of directories
for each of the file areas, but remember that there should be no
overlaps between the directories used for the main registers and the
shadow files, respectively.
register: /d1:500M
shadow: /scratch1:100M /scratch2:200M
When shadow files are enabled, an extra command is available at the
zebraidx command line.
In order to make changes to the system take effect for the
users, you'll have to submit a "commit" command after a
(sequence of) update operation(s).
$ zebraidx update /d1/records
$ zebraidx commit
Or you can execute multiple updates before committing the changes:
$ zebraidx -g books update /d1/records /d2/more-records
$ zebraidx -g fun update /d3/fun-records
$ zebraidx commit
If one of the update operations above had been interrupted, the commit
operation on the last line would fail: zebraidx
will not let you commit changes that would destroy the running register.
You'll have to rerun all of the update operations since your last
commit operation, before you can commit the new changes.
Similarly, if the commit operation fails, zebraidx
will not let you start a new update operation before you have
successfully repeated the commit operation.
The server processes will keep accessing the shadow files rather
than the (possibly damaged) blocks of the main register files
until the commit operation has successfully completed.
You should be aware that update operations may take slightly longer
when the shadow register system is enabled, since more file access
operations are involved. Further, while the disk space required for
the shadow register data is modest for a small update operation, you
may prefer to disable the system if you are adding a very large number
of records to an already very large database (we use the terms
large and modest
very loosely here, since every application will have a
different perception of size).
To update the system without the use of the the shadow files,
simply run zebraidx with the -n
option (note that you do not have to execute the
commit command of zebraidx
when you temporarily disable the use of the shadow registers in
this fashion.
Note also that, just as when the shadow registers are not enabled,
server processes will be barred from accessing the main register
while the update procedure takes place.
Relevance Ranking and Sorting of Result Sets
Overview
The default ordering of a result set is left up to the server,
which inside &zebra; means sorting in ascending document ID order.
This is not always the order humans want to browse the sometimes
quite large hit sets. Ranking and sorting comes to the rescue.
In cases where a good presentation ordering can be computed at
indexing time, we can use a fixed static ranking
scheme, which is provided for the alvis
indexing filter. This defines a fixed ordering of hit lists,
independently of the query issued.
There are cases, however, where relevance of hit set documents is
highly dependent on the query processed.
Simply put, dynamic relevance ranking
sorts a set of retrieved records such that those most likely to be
relevant to your request are retrieved first.
Internally, &zebra; retrieves all documents that satisfy your
query, and re-orders the hit list to arrange them based on
a measurement of similarity between your query and the content of
each record.
Finally, there are situations where hit sets of documents should be
sorted during query time according to the
lexicographical ordering of certain sort indexes created at
indexing time.
Static Ranking
&zebra; uses internally inverted indexes to look up term frequencies
in documents. Multiple queries from different indexes can be
combined by the binary boolean operations AND,
OR and/or NOT (which
is in fact a binary AND NOT operation).
To ensure fast query execution
speed, all indexes have to be sorted in the same order.
The indexes are normally sorted according to document
ID in
ascending order, and any query which does not invoke a special
re-ranking function will therefore retrieve the result set in
document
ID
order.
If one defines the
staticrank: 1
directive in the main core &zebra; configuration file, the internal document
keys used for ordering are augmented by a preceding integer, which
contains the static rank of a given document, and the index lists
are ordered
first by ascending static rank,
then by ascending document ID.
Zero
is the ``best'' rank, as it occurs at the
beginning of the list; higher numbers represent worse scores.
The experimental alvis filter provides a
directive to fetch static rank information out of the indexed &acro.xml;
records, thus making all hit sets ordered
after ascending static
rank, and for those doc's which have the same static rank, ordered
after ascending doc ID.
See for the gory details.
Dynamic Ranking
In order to fiddle with the static rank order, it is necessary to
invoke additional re-ranking/re-ordering using dynamic
ranking or score functions. These functions return positive
integer scores, where highest score is
``best'';
hit sets are sorted according to descending
scores (in contrary
to the index lists which are sorted according to
ascending rank number and document ID).
Dynamic ranking is enabled by a directive like one of the
following in the zebra configuration file (use only one of these a time!):
rank: rank-1 # default TDF-IDF like
rank: rank-static # dummy do-nothing
Dynamic ranking is done at query time rather than
indexing time (this is why we
call it ``dynamic ranking'' in the first place ...)
It is invoked by adding
the &acro.bib1; relation attribute with
value ``relevance'' to the &acro.pqf; query (that is,
@attr 2=102, see also
The &acro.bib1; Attribute Set Semantics, also in
HTML).
To find all articles with the word Eoraptor in
the title, and present them relevance ranked, issue the &acro.pqf; query:
@attr 2=102 @attr 1=4 Eoraptor
Dynamically ranking using &acro.pqf; queries with the 'rank-1'
algorithm
The default rank-1 ranking module implements a
TF/IDF (Term Frequecy over Inverse Document Frequency) like
algorithm. In contrast to the usual definition of TF/IDF
algorithms, which only considers searching in one full-text
index, this one works on multiple indexes at the same time.
More precisely,
&zebra; does boolean queries and searches in specific addressed
indexes (there are inverted indexes pointing from terms in the
dictionary to documents and term positions inside documents).
It works like this:
Query Components
First, the boolean query is dismantled into its principal components,
i.e. atomic queries where one term is looked up in one index.
For example, the query
@attr 2=102 @and @attr 1=1010 Utah @attr 1=1018 Springer
is a boolean AND between the atomic parts
@attr 2=102 @attr 1=1010 Utah
and
@attr 2=102 @attr 1=1018 Springer
which gets processed each for itself.
Atomic hit lists
Second, for each atomic query, the hit list of documents is
computed.
In this example, two hit lists for each index
@attr 1=1010 and
@attr 1=1018 are computed.
Atomic scores
Third, each document in the hit list is assigned a score (_if_ ranking
is enabled and requested in the query) using a TF/IDF scheme.
In this example, both atomic parts of the query assign the magic
@attr 2=102 relevance attribute, and are
to be used in the relevance ranking functions.
It is possible to apply dynamic ranking on only parts of the
&acro.pqf; query:
@and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
searches for all documents which have the term 'Utah' on the
body of text, and which have the term 'Springer' in the publisher
field, and sort them in the order of the relevance ranking made on
the body-of-text index only.
Hit list merging
Fourth, the atomic hit lists are merged according to the boolean
conditions to a final hit list of documents to be returned.
This step is always performed, independently of the fact that
dynamic ranking is enabled or not.
Document score computation
Fifth, the total score of a document is computed as a linear
combination of the atomic scores of the atomic hit lists
Ranking weights may be used to pass a value to a ranking
algorithm, using the non-standard &acro.bib1; attribute type 9.
This allows one branch of a query to use one value while
another branch uses a different one. For example, we can search
for utah in the
@attr 1=4 index with weight 30, as
well as in the @attr 1=1010 index with weight 20:
@attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
The default weight is
sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score
is 1000 and the bottom score is 0, encoded in integers.
The ranking-weight feature is experimental. It may change in future
releases of zebra.
Re-sorting of hit list
Finally, the final hit list is re-ordered according to scores.
The rank-1 algorithm
does not use the static rank
information in the list keys, and will produce the same ordering
with or without static ranking enabled.
Dynamic ranking is not compatible
with estimated hit sizes, as all documents in
a hit set must be accessed to compute the correct placing in a
ranking sorted list. Therefore the use attribute setting
@attr 2=102 clashes with
@attr 9=integer.
Dynamically ranking &acro.cql; queries
Dynamic ranking can be enabled during sever side &acro.cql;
query expansion by adding @attr 2=102
chunks to the &acro.cql; config file. For example
relationModifier.relevant = 2=102
invokes dynamic ranking each time a &acro.cql; query of the form
Z> querytype cql
Z> f alvis.text =/relevant house
is issued. Dynamic ranking can also be automatically used on
specific &acro.cql; indexes by (for example) setting
index.alvis.text = 1=text 2=102
which then invokes dynamic ranking each time a &acro.cql; query of the form
Z> querytype cql
Z> f alvis.text = house
is issued.
Sorting
&zebra; sorts efficiently using special sorting indexes
(type=s; so each sortable index must be known
at indexing time, specified in the configuration of record
indexing. For example, to enable sorting according to the &acro.bib1;
Date/time-added-to-db field, one could add the line
xelm /*/@created Date/time-added-to-db:s
to any .abs record-indexing configuration file.
Similarly, one could add an indexing element of the form
]]>
to any alvis-filter indexing stylesheet.
Indexing can be specified at searching time using a query term
carrying the non-standard
&acro.bib1; attribute-type 7. This removes the
need to send a &acro.z3950; Sort Request
separately, and can dramatically improve latency when the client
and server are on separate networks.
The sorting part of the query is separate from the rest of the
query - the actual search specification - and must be combined
with it using OR.
A sorting subquery needs two attributes: an index (such as a
&acro.bib1; type-1 attribute) specifying which index to sort on, and a
type-7 attribute whose value is be 1 for
ascending sorting, or 2 for descending. The
term associated with the sorting attribute is the priority of
the sort key, where 0 specifies the primary
sort key, 1 the secondary sort key, and so
on.
For example, a search for water, sort by title (ascending),
is expressed by the &acro.pqf; query
@or @attr 1=1016 water @attr 7=1 @attr 1=4 0
whereas a search for water, sort by title ascending,
then date descending would be
@or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
Notice the fundamental differences between dynamic
ranking and sorting: there can be
only one ranking function defined and configured; but multiple
sorting indexes can be specified dynamically at search
time. Ranking does not need to use specific indexes, so
dynamic ranking can be enabled and disabled without
re-indexing; whereas, sorting indexes need to be
defined before indexing.
Extended Services: Remote Insert, Update and Delete
Extended services are only supported when accessing the &zebra;
server using the &acro.z3950;
protocol. The &acro.sru; protocol does
not support extended services.
The extended services are not enabled by default in zebra - due to the
fact that they modify the system. &zebra; can be configured
to allow anybody to
search, and to allow only updates for a particular admin user
in the main zebra configuration file zebra.cfg.
For user admin, you could use:
perm.anonymous: r
perm.admin: rw
passwd: passwordfile
And in the password file
passwordfile, you have to specify users and
encrypted passwords as colon separated strings.
Use a tool like htpasswd
to maintain the encrypted passwords.
admin:secret
It is essential to configure &zebra; to store records internally,
and to support
modifications and deletion of records:
storeData: 1
storeKeys: 1
The general record type should be set to any record filter which
is able to parse &acro.xml; records, you may use any of the two
declarations (but not both simultaneously!)
recordType: dom.filter_dom_conf.xml
# recordType: grs.xml
Notice the difference to the specific instructions
recordType.xml: dom.filter_dom_conf.xml
# recordType.xml: grs.xml
which only work when indexing XML files from the filesystem using
the *.xml naming convention.
To enable transaction safe shadow indexing,
which is extra important for this kind of operation, set
shadow: directoryname: size (e.g. 1000M)
See for additional information on
these configuration options.
It is not possible to carry information about record types or
similar to &zebra; when using extended services, due to
limitations of the &acro.z3950;
protocol. Therefore, indexing filters can not be chosen on a
per-record basis. One and only one general &acro.xml; indexing filter
must be defined.
Extended services in the &acro.z3950; protocol
The &acro.z3950; standard allows
servers to accept special binary extended services
protocol packages, which may be used to insert, update and delete
records into servers. These carry control and update
information to the servers, which are encoded in seven package fields:
Extended services &acro.z3950; Package Fields
Parameter
Value
Notes
type
'update'
Must be set to trigger extended services
action
string
Extended service action type with
one of four possible values: recordInsert,
recordReplace,
recordDelete,
and specialUpdate
record
&acro.xml; string
An &acro.xml; formatted string containing the record
syntax
'xml'
XML/SUTRS/MARC. GRS-1 not supported.
The default filter (record type) as given by recordType in
zebra.cfg is used to parse the record.
recordIdOpaque
string
Optional client-supplied, opaque record
identifier used under insert operations.
recordIdNumber
positive number
&zebra;'s internal system number,
not allowed for recordInsert or
specialUpdate actions which result in fresh
record inserts.
databaseName
database identifier
The name of the database to which the extended services should be
applied.
The action parameter can be any of
recordInsert (will fail if the record already exists),
recordReplace (will fail if the record does not exist),
recordDelete (will fail if the record does not
exist), and
specialUpdate (will insert or update the record
as needed, record deletion is not possible).
During all actions, the
usual rules for internal record ID generation apply, unless an
optional recordIdNumber &zebra; internal ID or a
recordIdOpaque string identifier is assigned.
The default ID generation is
configured using the recordId: from
zebra.cfg.
See .
Setting of the recordIdNumber parameter,
which must be an existing &zebra; internal system ID number, is not
allowed during any recordInsert or
specialUpdate action resulting in fresh record
inserts.
When retrieving existing
records indexed with &acro.grs1; indexing filters, the &zebra; internal
ID number is returned in the field
/*/id:idzebra/localnumber in the namespace
xmlns:id="http://www.indexdata.dk/zebra/",
where it can be picked up for later record updates or deletes.
A new element set for retrieval of internal record
data has been added, which can be used to access minimal records
containing only the recordIdNumber &zebra;
internal ID, or the recordIdOpaque string
identifier. This works for any indexing filter used.
See .
The recordIdOpaque string parameter
is an client-supplied, opaque record
identifier, which may be used under
insert, update and delete operations. The
client software is responsible for assigning these to
records. This identifier will
replace zebra's own automagic identifier generation with a unique
mapping from recordIdOpaque to the
&zebra; internal recordIdNumber.
The opaque recordIdOpaque string
identifiers
are not visible in retrieval records, nor are
searchable, so the value of this parameter is
questionable. It serves mostly as a convenient mapping from
application domain string identifiers to &zebra; internal ID's.
Extended services from yaz-client
We can now start a yaz-client admin session and create a database:
adm-create
]]>
Now the Default database was created,
we can insert an &acro.xml; file (esdd0006.grs
from example/gils/records) and index it:
update insert id1234 esdd0006.grs
]]>
The 3rd parameter - id1234 here -
is the recordIdOpaque package field.
Actually, we should have a way to specify "no opaque record id" for
yaz-client's update command.. We'll fix that.
The newly inserted record can be searched as usual:
f utah
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 1, setno 1
SearchResult-1: term=utah cnt=1
records returned: 0
Elapsed: 0.014179
]]>
Let's delete the beast, using the same
recordIdOpaque string parameter:
update delete id1234
No last record (update ignored)
Z> update delete 1 esdd0006.grs
Got extended services response
Status: done
Elapsed: 0.072441
Z> f utah
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 0, setno 2
SearchResult-1: term=utah cnt=0
records returned: 0
Elapsed: 0.013610
]]>
If shadow register is enabled in your
zebra.cfg,
you must run the adm-commit command
adm-commit
]]>
after each update session in order write your changes from the
shadow to the life register space.
Extended services from yaz-php
Extended services are also available from the &yaz; &acro.php; client layer. An
example of an &yaz;-&acro.php; extended service transaction is given here:
A fine specimen of a record';
$options = array('action' => 'recordInsert',
'syntax' => 'xml',
'record' => $record,
'databaseName' => 'mydatabase'
);
yaz_es($yaz, 'update', $options);
yaz_es($yaz, 'commit', array());
yaz_wait();
if ($error = yaz_error($yaz))
echo "$error";
]]>
Extended services debugging guide
When debugging ES over PHP we recommend the following order of tests:
Make sure you have a nice record on your filesystem, which you can
index from the filesystem by use of the zebraidx command.
Do it exactly as you planned, using one of the GRS-1 filters,
or the DOMXML filter.
When this works, proceed.
Check that your server setup is OK before you even coded one single
line PHP using ES.
Take the same record form the file system, and send as ES via
yaz-client like described in
,
and
remember the -a option which tells you what
goes over the wire! Notice also the section on permissions:
try
perm.anonymous: rw
in zebra.cfg to make sure you do not run into
permission problems (but never expose such an insecure setup on the
internet!!!). Then, make sure to set the general
recordType instruction, pointing correctly
to the GRS-1 filters,
or the DOMXML filters.
If you insist on using the sysno in the
recordIdNumber setting,
please make sure you do only updates and deletes. Zebra's internal
system number is not allowed for
recordInsert or
specialUpdate actions
which result in fresh record inserts.
If shadow register is enabled in your
zebra.cfg, you must remember running the
Z> adm-commit
command as well.
If this works, then proceed to do the same thing in your PHP script.