<chapter id="administration">
- <!-- $Id: administration.xml,v 1.50 2007-02-02 11:10:08 marc Exp $ -->
<title>Administrating &zebra;</title>
<!-- ### It's a bit daft that this chapter (which describes half of
the configuration-file formats) is separated from
</para>
<para>
- Both the &zebra; administrative tool and the &z3950; server share a
+ Both the &zebra; administrative tool and the &acro.z3950; server share a
set of index files and a global configuration file.
The name of the configuration file defaults to
<literal>zebra.cfg</literal>.
In the configuration file, the group name is placed before the option
name itself, separated by a dot (.). For instance, to set the record type
for group <literal>public</literal> to <literal>grs.sgml</literal>
- (the &sgml;-like format for structured records) you would write:
+ (the &acro.sgml;-like format for structured records) you would write:
</para>
<para>
<replaceable>database</replaceable></term>
<listitem>
<para>
- Specifies the &z3950; database name.
+ Specifies the &acro.z3950; database name.
<!-- FIXME - now we can have multiple databases in one server. -H -->
</para>
</listitem>
</varlistentry>
<varlistentry>
+ <term>index: <replaceable>filename</replaceable></term>
+ <listitem>
+ <para>
+ Defines the filename which holds fields structure
+ definitions. If omitted, the file <filename>default.idx</filename>
+ is read.
+ Refer to <xref linkend="default-idx-file"/> for
+ more information.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>sortmax: <replaceable>integer</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the maximum number of records that will be sorted
+ in a result set. If the result set contains more than
+ <replaceable>integer</replaceable> records, records after the
+ limit will not be sorted. If omitted, the default value is
+ 1,000.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
<term>staticrank: <replaceable>integer</replaceable></term>
<listitem>
<para>
<varlistentry>
- <term>estimatehits:: <replaceable>integer</replaceable></term>
+ <term>estimatehits: <replaceable>integer</replaceable></term>
<listitem>
<para>
- Controls whether &zebra; should calculate approximite hit counts and
+ Controls whether &zebra; should calculate approximate hit counts and
at which hit count it is to be enabled.
- A value of 0 disables approximiate hit counts.
- For a positive value approximaite hit count is enabled
+ A value of 0 disables approximate hit counts.
+ For a positive value approximate hit count is enabled
if it is known to be larger than <replaceable>integer</replaceable>.
</para>
<para>
<replaceable>permstring</replaceable></term>
<listitem>
<para>
- Specifies permissions (priviledge) for a user that are allowed
+ Specifies permissions (privilege) for a user that are allowed
to access &zebra; via the passwd system. There are two kinds
of permissions currently: read (r) and write(w). By default
users not listed in a permission directive are given the read
privilege. To specify permissions for a user with no
- username, or &z3950; anonymous style use
+ username, or &acro.z3950; anonymous style use
<literal>anonymous</literal>. The permstring consists of
a sequence of characters. Include character <literal>w</literal>
for write/update access, <literal>r</literal> for read access and
</varlistentry>
<varlistentry>
- <term>dbaccess <replaceable>accessfile</replaceable></term>
+ <term>dbaccess: <replaceable>accessfile</replaceable></term>
<listitem>
<para>
Names a file which lists database subscriptions for individual users.
- The access file should consists of lines of the form <literal>username:
- dbnames</literal>, where dbnames is a list of database names, seprated by
- '+'. No whitespace is allowed in the database list.
+ The access file should consists of lines of the form
+ <literal>username: dbnames</literal>, where dbnames is a list of
+ database names, separated by '+'. No whitespace is allowed in the
+ database list.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>encoding: <replaceable>charsetname</replaceable></term>
+ <listitem>
+ <para>
+ Tells &zebra; to interpret the terms in Z39.50 queries as
+ having been encoded using the specified character
+ encoding. The default is <literal>ISO-8859-1</literal>; one
+ useful alternative is <literal>UTF-8</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>storeKeys: <replaceable>value</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether &zebra; keeps a copy of indexed keys.
+ Use a value of 1 to enable; 0 to disable. If storeKeys setting is
+ omitted, it is enabled. Enabled storeKeys
+ are required for updating and deleting records. Disable only
+ storeKeys to save space and only plan to index data once.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>storeData: <replaceable>value</replaceable></term>
+ <listitem>
+ <para>
+ Specifies whether &zebra; keeps a copy of indexed records.
+ Use a value of 1 to enable; 0 to disable. If storeData setting is
+ omitted, it is enabled. A storeData setting of 0 (disabled) makes
+ Zebra fetch records from the original locaction in the file
+ system using filename, file offset and file length. For the
+ DOM and ALVIS filter, the storeData setting is ignored.
</para>
</listitem>
</varlistentry>
mounted on a CD-ROM drive,
you may want &zebra; to make an internal copy of them. To do this,
you specify 1 (true) in the <literal>storeData</literal> setting. When
- the &z3950; server retrieves the records they will be read from the
+ the &acro.z3950; server retrieves the records they will be read from the
internal file structures of the system.
</para>
<para>
Consider a system in which you have a group of text files called
<literal>simple</literal>.
- That group of records should belong to a &z3950; database called
+ That group of records should belong to a &acro.z3950; database called
<literal>textbase</literal>.
The following <literal>zebra.cfg</literal> file will suffice:
</para>
information. If you have a group of records that explicitly associates
an ID with each record, this method is convenient. For example, the
record format may contain a title or a ID-number - unique within the group.
- In either case you specify the &z3950; attribute set and use-attribute
+ In either case you specify the &acro.z3950; attribute set and use-attribute
location in which this information is stored, and the system looks at
that field to determine the identity of the record.
</para>
<para>
For instance, the sample GILS records that come with the &zebra;
distribution contain a unique ID in the data tagged Control-Identifier.
- The data is mapped to the &bib1; use attribute Identifier-standard
+ The data is mapped to the &acro.bib1; use attribute Identifier-standard
(code 1007). To use this field as a record id, specify
<literal>(bib1,Identifier-standard)</literal> as the value of the
<literal>recordId</literal> in the configuration file.
The value of the <literal>register</literal> setting is a sequence
of tokens. Each token takes the form:
- <screen>
- <emphasis>dir</emphasis><literal>:</literal><emphasis>size</emphasis>.
- </screen>
+ <emphasis>dir</emphasis><literal>:</literal><emphasis>size</emphasis>
The <emphasis>dir</emphasis> specifies a directory in which index files
will be stored and the <emphasis>size</emphasis> specifies the maximum
<literal>k</literal> for kilobytes.
<literal>M</literal> for megabytes,
<literal>G</literal> for gigabytes.
+ Specifying a negative value disables the checking (it still needs the unit,
+ use <literal>-1b</literal>).
</para>
<para>
- For instance, if you have allocated two disks for your register, and
+ For instance, if you have allocated three disks for your register, and
the first disk is mounted
- on <literal>/d1</literal> and has 2GB of free space and the
- second, mounted on <literal>/d2</literal> has 3.6 GB, you could
- put this entry in your configuration file:
+ on <literal>/d1</literal> and has 2GB of free space, the
+ second, mounted on <literal>/d2</literal> has 3.6 GB, and the third,
+ on which you have more space than you bother to worry about, mounted on
+ <literal>/d3</literal> you could put this entry in your configuration file:
<screen>
- register: /d1:2G /d2:3600M
+ register: /d1:2G /d2:3600M /d3:-1b
</screen>
-
</para>
<para>
<title>Static Ranking</title>
<para>
- &zebra; uses internally inverted indexes to look up term occurencies
+ &zebra; uses internally inverted indexes to look up term frequencies
in documents. Multiple queries from different indexes can be
combined by the binary boolean operations <literal>AND</literal>,
<literal>OR</literal> and/or <literal>NOT</literal> (which
</para>
<para>
The experimental <literal>alvis</literal> filter provides a
- directive to fetch static rank information out of the indexed &xml;
+ directive to fetch static rank information out of the indexed &acro.xml;
records, thus making <emphasis>all</emphasis> hit sets ordered
after <emphasis>ascending</emphasis> static
rank, and for those doc's which have the same static rank, ordered
indexing time (this is why we
call it ``dynamic ranking'' in the first place ...)
It is invoked by adding
- the &bib1; relation attribute with
- value ``relevance'' to the &pqf; query (that is,
+ the &acro.bib1; relation attribute with
+ value ``relevance'' to the &acro.pqf; query (that is,
<literal>@attr 2=102</literal>, see also
<ulink url="&url.z39.50;bib1.html">
- The &bib1; Attribute Set Semantics</ulink>, also in
+ The &acro.bib1; Attribute Set Semantics</ulink>, also in
<ulink url="&url.z39.50.attset.bib1;">HTML</ulink>).
To find all articles with the word <literal>Eoraptor</literal> in
- the title, and present them relevance ranked, issue the &pqf; query:
+ the title, and present them relevance ranked, issue the &acro.pqf; query:
<screen>
@attr 2=102 @attr 1=4 Eoraptor
</screen>
</para>
<sect3 id="administration-ranking-dynamic-rank1">
- <title>Dynamically ranking using &pqf; queries with the 'rank-1'
+ <title>Dynamically ranking using &acro.pqf; queries with the 'rank-1'
algorithm</title>
<para>
The default <literal>rank-1</literal> ranking module implements a
TF/IDF (Term Frequecy over Inverse Document Frequency) like
- algorithm. In contrast to the usual defintion of TF/IDF
+ algorithm. In contrast to the usual definition of TF/IDF
algorithms, which only considers searching in one full-text
index, this one works on multiple indexes at the same time.
More precisely,
<term>Query Components</term>
<listitem>
<para>
- First, the boolean query is dismantled into it's principal components,
+ First, the boolean query is dismantled into its principal components,
i.e. atomic queries where one term is looked up in one index.
For example, the query
<screen>
</para>
<para>
It is possible to apply dynamic ranking on only parts of the
- &pqf; query:
+ &acro.pqf; query:
<screen>
@and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
</screen>
</para>
<para>
Ranking weights may be used to pass a value to a ranking
- algorithm, using the non-standard &bib1; attribute type 9.
+ algorithm, using the non-standard &acro.bib1; attribute type 9.
This allows one branch of a query to use one value while
another branch uses a different one. For example, we can search
for <literal>utah</literal> in the
</para>
<para>
The default weight is
- sqrt(1000) ~ 34 , as the &z3950; standard prescribes that the top score
+ sqrt(1000) ~ 34 , as the &acro.z3950; standard prescribes that the top score
is 1000 and the bottom score is 0, encoded in integers.
</para>
<warning>
<!--
<sect3 id="administration-ranking-dynamic-rank1">
- <title>Dynamically ranking &pqf; queries with the 'rank-static'
+ <title>Dynamically ranking &acro.pqf; queries with the 'rank-static'
algorithm</title>
<para>
The dummy <literal>rank-static</literal> reranking/scoring
</sect3>
<sect3 id="administration-ranking-dynamic-cql">
- <title>Dynamically ranking &cql; queries</title>
+ <title>Dynamically ranking &acro.cql; queries</title>
<para>
- Dynamic ranking can be enabled during sever side &cql;
+ Dynamic ranking can be enabled during sever side &acro.cql;
query expansion by adding <literal>@attr 2=102</literal>
- chunks to the &cql; config file. For example
+ chunks to the &acro.cql; config file. For example
<screen>
relationModifier.relevant = 2=102
</screen>
- invokes dynamic ranking each time a &cql; query of the form
+ invokes dynamic ranking each time a &acro.cql; query of the form
<screen>
Z> querytype cql
Z> f alvis.text =/relevant house
</screen>
is issued. Dynamic ranking can also be automatically used on
- specific &cql; indexes by (for example) setting
+ specific &acro.cql; indexes by (for example) setting
<screen>
index.alvis.text = 1=text 2=102
</screen>
- which then invokes dynamic ranking each time a &cql; query of the form
+ which then invokes dynamic ranking each time a &acro.cql; query of the form
<screen>
Z> querytype cql
Z> f alvis.text = house
&zebra; sorts efficiently using special sorting indexes
(type=<literal>s</literal>; so each sortable index must be known
at indexing time, specified in the configuration of record
- indexing. For example, to enable sorting according to the &bib1;
+ indexing. For example, to enable sorting according to the &acro.bib1;
<literal>Date/time-added-to-db</literal> field, one could add the line
<screen>
xelm /*/@created Date/time-added-to-db:s
<para>
Indexing can be specified at searching time using a query term
carrying the non-standard
- &bib1; attribute-type <literal>7</literal>. This removes the
- need to send a &z3950; <literal>Sort Request</literal>
+ &acro.bib1; attribute-type <literal>7</literal>. This removes the
+ need to send a &acro.z3950; <literal>Sort Request</literal>
separately, and can dramatically improve latency when the client
and server are on separate networks.
The sorting part of the query is separate from the rest of the
</para>
<para>
A sorting subquery needs two attributes: an index (such as a
- &bib1; type-1 attribute) specifying which index to sort on, and a
+ &acro.bib1; type-1 attribute) specifying which index to sort on, and a
type-7 attribute whose value is be <literal>1</literal> for
ascending sorting, or <literal>2</literal> for descending. The
term associated with the sorting attribute is the priority of
on.
</para>
<para>For example, a search for water, sort by title (ascending),
- is expressed by the &pqf; query
+ is expressed by the &acro.pqf; query
<screen>
@or @attr 1=1016 water @attr 7=1 @attr 1=4 0
</screen>
<note>
<para>
Extended services are only supported when accessing the &zebra;
- server using the <ulink url="&url.z39.50;">&z3950;</ulink>
- protocol. The <ulink url="&url.sru;">&sru;</ulink> protocol does
+ server using the <ulink url="&url.z39.50;">&acro.z3950;</ulink>
+ protocol. The <ulink url="&url.sru;">&acro.sru;</ulink> protocol does
not support extended services.
</para>
</note>
storeKeys: 1
</screen>
The general record type should be set to any record filter which
- is able to parse &xml; records, you may use any of the two
+ is able to parse &acro.xml; records, you may use any of the two
declarations (but not both simultaneously!)
<screen>
- recordType: grs.xml
- # recordType: alvis.filter_alvis_config.xml
+ recordType: dom.filter_dom_conf.xml
+ # recordType: grs.xml
</screen>
+ Notice the difference to the specific instructions
+ <screen>
+ recordType.xml: dom.filter_dom_conf.xml
+ # recordType.xml: grs.xml
+ </screen>
+ which only work when indexing XML files from the filesystem using
+ the <literal>*.xml</literal> naming convention.
+ </para>
+ <para>
To enable transaction safe shadow indexing,
which is extra important for this kind of operation, set
<screen>
<para>
It is not possible to carry information about record types or
similar to &zebra; when using extended services, due to
- limitations of the <ulink url="&url.z39.50;">&z3950;</ulink>
+ limitations of the <ulink url="&url.z39.50;">&acro.z3950;</ulink>
protocol. Therefore, indexing filters can not be chosen on a
- per-record basis. One and only one general &xml; indexing filter
+ per-record basis. One and only one general &acro.xml; indexing filter
must be defined.
<!-- but because it is represented as an OID, we would need some
form of proprietary mapping scheme between record type strings and
OIDs. -->
<!--
However, as a minimum, it would be extremely useful to enable
- people to use &marc21;, assuming grs.marcxml.marc21 as a record
+ people to use &acro.marc21;, assuming grs.marcxml.marc21 as a record
type.
-->
</para>
<sect2 id="administration-extended-services-z3950">
- <title>Extended services in the &z3950; protocol</title>
+ <title>Extended services in the &acro.z3950; protocol</title>
<para>
- The <ulink url="&url.z39.50;">&z3950;</ulink> standard allows
+ The <ulink url="&url.z39.50;">&acro.z3950;</ulink> standard allows
servers to accept special binary <emphasis>extended services</emphasis>
protocol packages, which may be used to insert, update and delete
records into servers. These carry control and update
</para>
<table id="administration-extended-services-z3950-table" frame="top">
- <title>Extended services &z3950; Package Fields</title>
+ <title>Extended services &acro.z3950; Package Fields</title>
<tgroup cols="3">
<thead>
<row>
</row>
<row>
<entry><literal>record</literal></entry>
- <entry><literal>&xml; string</literal></entry>
- <entry>An &xml; formatted string containing the record</entry>
- </row>
- <row>
- <entry><literal>syntax</literal></entry>
- <entry><literal>'xml'</literal></entry>
- <entry>Only &xml; record syntax is supported</entry>
+ <entry><literal>&acro.xml; string</literal></entry>
+ <entry>An &acro.xml; formatted string containing the record</entry>
</row>
+ <row>
+ <entry><literal>syntax</literal></entry>
+ <entry><literal>'xml'</literal></entry>
+ <entry>XML/SUTRS/MARC. GRS-1 not supported.
+ The default filter (record type) as given by recordType in
+ zebra.cfg is used to parse the record.</entry>
+ </row>
<row>
<entry><literal>recordIdOpaque</literal></entry>
<entry><literal>string</literal></entry>
<entry>
- Optional client-supplied, opaque record
+ Optional client-supplied, opaque record
identifier used under insert operations.
</entry>
</row>
<para>
When retrieving existing
- records indexed with &grs1; indexing filters, the &zebra; internal
+ records indexed with &acro.grs1; indexing filters, the &zebra; internal
ID number is returned in the field
<literal>/*/id:idzebra/localnumber</literal> in the namespace
<literal>xmlns:id="http://www.indexdata.dk/zebra/"</literal>,
]]>
</screen>
Now the <literal>Default</literal> database was created,
- we can insert an &xml; file (esdd0006.grs
+ we can insert an &acro.xml; file (esdd0006.grs
from example/gils/records) and index it:
<screen>
<![CDATA[
<title>Extended services from yaz-php</title>
<para>
- Extended services are also available from the &yaz; &php; client layer. An
- example of an &yaz;-&php; extended service transaction is given here:
+ Extended services are also available from the &yaz; &acro.php; client layer. An
+ example of an &yaz;-&acro.php; extended service transaction is given here:
<screen>
<![CDATA[
$record = '<record><title>A fine specimen of a record</title></record>';
</screen>
</para>
</sect2>
+
+ <sect2 id="administration-extended-services-debugging">
+ <title>Extended services debugging guide</title>
+ <para>
+ When debugging ES over PHP we recommend the following order of tests:
+ </para>
+
+ <itemizedlist>
+ <listitem>
+ <para>
+ Make sure you have a nice record on your filesystem, which you can
+ index from the filesystem by use of the zebraidx command.
+ Do it exactly as you planned, using one of the GRS-1 filters,
+ or the DOMXML filter.
+ When this works, proceed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Check that your server setup is OK before you even coded one single
+ line PHP using ES.
+ Take the same record form the file system, and send as ES via
+ <literal>yaz-client</literal> like described in
+ <xref linkend="administration-extended-services-yaz-client"/>,
+ and
+ remember the <literal>-a</literal> option which tells you what
+ goes over the wire! Notice also the section on permissions:
+ try
+ <screen>
+ perm.anonymous: rw
+ </screen>
+ in <literal>zebra.cfg</literal> to make sure you do not run into
+ permission problems (but never expose such an insecure setup on the
+ internet!!!). Then, make sure to set the general
+ <literal>recordType</literal> instruction, pointing correctly
+ to the GRS-1 filters,
+ or the DOMXML filters.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If you insist on using the <literal>sysno</literal> in the
+ <literal>recordIdNumber</literal> setting,
+ please make sure you do only updates and deletes. Zebra's internal
+ system number is not allowed for
+ <literal>recordInsert</literal> or
+ <literal>specialUpdate</literal> actions
+ which result in fresh record inserts.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If <literal>shadow register</literal> is enabled in your
+ <literal>zebra.cfg</literal>, you must remember running the
+ <screen>
+ Z> adm-commit
+ </screen>
+ command as well.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ If this works, then proceed to do the same thing in your PHP script.
+ </para>
+ </listitem>
+ </itemizedlist>
+
+
+ </sect2>
+
</sect1>
</chapter>