Charmap work

[idzebra-moved-to-github.git] / doc / zebra.sgml
diff --git a/doc/zebra.sgml b/doc/zebra.sgml

index f268655..b97e539 100644 (file)
--- a/doc/zebra.sgml
+++ b/doc/zebra.sgml
@@ -1,13 +1,13 @@
  <!doctype linuxdoc system>
  
  <!--
-  $Id: zebra.sgml,v 1.20 1996-03-18 10:48:13 quinn Exp $
+  $Id: zebra.sgml,v 1.27 1996-06-04 08:21:13 quinn Exp $
  -->
  
  <article>
  <title>Zebra Server - Administrators's Guide and Reference
  <author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></>
-<date>$Revision: 1.20 $
+<date>$Revision: 1.27 $
  <abstract>
  The Zebra information server combines a versatile fielded/free-text
  search engine with a Z39.50-1995 frontend to provide a powerful and flexible
@@ -49,7 +49,7 @@ mailing-list by sending Email to <tt/zebra-request@index.ping.dk/.
  <sect1>Features
  
  <p>
-This is a listof some of the most important features of the
+This is a list of some of the most important features of the
  system.
  
  <itemize>
@@ -71,6 +71,11 @@ SGML-like syntax which allows nested (structured) data elements, as
  well as variant forms of data.
  
  <item>
+Supports random storage formats. A system of input filters driven by
+regular expressions allows you to easily process most ASCII-based
+data formats.
+
+<item>
  Supports boolean queries as well as relevance-ranking (free-text)
  searching. Right truncation and masking in terms are supported, as
  well as full regular expressions.
@@ -82,6 +87,10 @@ ISO2709 (*MARC). Records can be mapped between record syntaxes and
  schema on the fly.
  
  <item>
+Supports approximate matching in registers (ie. spelling mistakes,
+etc).
+
+<item>
  Protocol support:
  
  <itemize>
@@ -139,11 +148,6 @@ last beta release.
  <itemize>
  
  <item>
-*Allow the system to handle other input formats. Specifically
-MARC records and general, structured ASCII records (such as mail/news
-files) parameterized by regular expressions.
-
-<item>
  *Complete the support for variants. Finalize support for the WAIS
  retrieval methodology.
  
@@ -159,7 +163,7 @@ Add index and data compression to save disk space.
  
  <item>
  Add more sophisticated relevance ranking mechanisms. Add support for soundex
-and stemming. Add relevance feedback support.
+and stemming. Add relevance <it/feedback/ support.
  
  <item>
  Add Explain support.
@@ -172,10 +176,6 @@ variant pieces.
  Support the Item Update extended service of the protocol.
  
  <item>
-The Zebra search engine supports approximate string matching in the
-index. We'd like to find a way to support and control this from RPN.
-
-<item>
  We want to add a management system that allows you to
  control your databases and configuration tables from a graphical
  interface. We'll probably use Tcl/Tk to stay platform-independent.
@@ -191,25 +191,13 @@ contact info at the end of this file.
  <sect>Compiling the software
  
  <p>
-Zebra uses the YAZ package to implement Z39.50, so you
-have to compile YAZ before going further. Specifically, Zebra uses
-the YAZ header files in <tt>yaz/include/..</tt> and its public library
-<tt>yaz/lib/libyaz.a</tt>.
-
-As with YAZ, an ANSI C compiler is required in order to compile the Zebra
+An ANSI C compiler is required to compile the Zebra
  server system &mdash; <tt/gcc/ works fine if your own system doesn't
  provide an adequate compiler.
  
-Unpack the Zebra software. You might put Zebra in the same directory level
-as YAZ, for example if YAZ is placed in ..<tt>/src/yaz-xxx</tt>, then
-Zebra is placed in ..<tt>/src/zebra-yyy</tt>.
-
-Edit the top-level <tt>Makefile</tt> in the Zebra directory in which
-you specify the location of YAZ by setting make variables.
-The <tt>OSILIB</tt> should be empty if YAZ wasn't compiled with
-MOSI support. Some systems, such as Solaris, have separate socket
-libraries and for those systems you need to specify the
-<tt>NETLIB</tt> variable.
+Unpack the distribution archive. In some cases, you may want to edit
+the top-level <tt/Makefile/, eg. to select a different C compiler, or
+to specify machine-specific libraries in the <bf/NETLIB/ variable.
  
  When you are done editing the <tt>Makefile</tt> type:
  <tscreen><verb>
@@ -223,16 +211,16 @@ If successful, two executables have been created in the sub-directory
  <tag><tt>zebraidx</tt></tag> The administrative tool for the search index.
  </descrip>
  
-<sect>Quick Start
-
+<sect>Quick Start 
  <p>
-This section will get you started quickly! We will try to index a few sample
-GILS records that are included with the Zebra distribution. Go to the
-<tt>test</tt> subdirectory. There you will find a configuration
+In this section, we will test the system by indexing a small set of sample
+GILS records that are included with the software distribution. Go to the
+<tt>test</tt> subdirectory of the distribution archive. There you will
+find a configuration
  file named <tt>zebra.cfg</tt> with the following contents:
  <tscreen><verb>
  # Where are the YAZ tables located.
-profilePath: /usr/local/yaz
+profilePath: ../../yaz/tab ../tab
  
  # Files that describe the attribute sets supported.
  attset: bib1.att
@@ -240,7 +228,8 @@ attset: gils.att
  </verb></tscreen>
  
  Now, edit the file and set <tt>profilePath</tt> to the path of the
-YAZ profile tables (sub directory <tt>tab</tt> of YAZ).
+YAZ profile tables (sub directory <tt>tab</tt> of the YAZ distribution
+archive).
  
  The 48 test records are located in the sub directory <tt>records</tt>.
  To index these, type:
@@ -258,8 +247,11 @@ fire up a server. To start a server on port 2100, type:
  $ ../index/zebrasrv tcp:@:2100
  </verb></tscreen>
  
-The Zebra index that you've just made has one database called Default. It will
-return either USMARC, GRS-1, or SUTRS depending on what your client asks
+The Zebra index that you have just created has a single database
+named <tt/Default/. The database contains records structured according to
+the GILS profile, and the server will
+return records in either either USMARC, GRS-1, or SUTRS depending
+on what your client asks
  for.
  
  To test the server, you can use any Z39.50 client (1992 or later). For
@@ -285,9 +277,16 @@ Z>format sutrs
  Z>show 1
  Z>format grs-1
  Z>show 1
+Z>elements B
+Z>show 1
  </verb></tscreen>
  
-If you've made it this far, there's a reasonably good chance that
+<it>NOTE: You may notice that more fields are returned when your
+client requests SUTRS or GRS-1 records. When retrieving GILS records,
+this is normal - not all of the GILS data elements have mappings in
+the USMARC record format.</it>
+
+If you've made it this far, there's a good chance that
  you've got through the compilation OK.
  
  <sect>Administrating Zebra<label id="administrating">
@@ -353,7 +352,7 @@ Parameter names and values are seperated by colons in the file. Lines
  starting with a hash sign (<tt/&num;/) are treated as comments.
  
  If you manage different sets of records that share common
-caracteristics, you can organize the configuration settings for each
+characteristics, you can organize the configuration settings for each
  type into &dquot;groups&dquot;.
  When <tt>zebraidx</tt> is run and you wish to address a given group
  you specify the group name with the <tt>-g</tt> option. In this case
@@ -381,49 +380,56 @@ The available configuration settings are summarized below. They will be
  explained further in the following sections.
  
  <descrip>
-<tag>&lsqb;<it>group</it>.&rsqb;recordType&lsqb;<it>.name</it>&rsqb;</tag>
+<tag><it>group</it>.recordType&lsqb;<it>.name</it>&rsqb;</tag>
   Specifies how records with the file extension <it>name</it> should
   be handled by the indexer. This option may also be specified
   as a command line option (<tt>-t</tt>). Note that if you do not
   specify a <it/name/, the setting applies to all files.
-<tag>&lsqb;<it>group</it>.&rsqb;recordId</tag>
- Specifies how the records are to be identified when updated.
-<tag>&lsqb;<it>group</it>.&rsqb;database</tag>
+<tag><it>group</it>.recordId</tag>
+ Specifies how the records are to be identified when updated. See
+section <ref id="locating-records" name="Locating Records">.
+<tag><it>group</it>.database</tag>
   Specifies the Z39.50 database name.
-<tag>&lsqb;<it>group</it>.&rsqb;storeKeys</tag>
+<tag><it>group</it>.storeKeys</tag>
   Specifies whether key information should be saved for a given
   group of records. If you plan to update/delete this type of
   records later this should be specified as 1; otherwise it
- should be 0 (default), to save register space.
-<tag>&lsqb;<it>group</it>.&rsqb;storeData</tag>
+ should be 0 (default), to save register space. See section
+<ref id="file-ids" name="Indexing With File Record IDs">.
+<tag><it>group</it>.storeData</tag>
   Specifies whether the records should be stored internally
   in the Zebra system files. If you want to maintain the raw records yourself,
   this option should be false (0). If you want Zebra to take care of the records
   for you, it should be true(1).
  <tag>register</tag> 
   Specifies the location of the various register files that Zebra uses
- to represent your databases.
+ to represent your databases. See section
+<ref id="register-location" name="Register Location">.
  <tag>shadow</tag>
   Enables the <it/safe update/ facility of Zebra, and tells the system
- where to place the required, temporary files.
+ where to place the required, temporary files. See section
+<ref id="shadow-registers" name="Safe Updating - Using Shadow Registers">.
+<tag>lockPath</tag>
+ Directory in which various lock files are stored.
  <tag>tempSetPath</tag>
   Specifies the directory that the server uses for temporary result sets.
   If not specified <tt>/tmp</tt> will be used.
  <tag>profilePath</tag>
- Specifies the location of profile specification paths.
+ Specifies the location of profile specification files.
  <tag>attset</tag> 
   Specifies the filename(s) of attribute set files for use in
   searching. At least the Bib-1 set should be loaded (<tt/bib1.att/).
   The <tt/profilePath/ setting is used to look for the specified files.
+ See section <ref id="attset-files" name="The Attribute Set Files">
  </descrip>
  
-<sect1>Locating Records
+<sect1>Locating Records<label id="locating-records">
  <p>
  The default behaviour of the Zebra system is to reference the
  records from their original location, i.e. where they were found when you
  ran <tt/zebraidx/. That is, when a client wishes to retrieve a record
  following a search operation, the files are accessed from the place
-where you originally put them - if you remove the files (whithout
+where you originally put them - if you remove the files (without
  running <tt/zebraidx/ again, the client will receive a diagnostic
  message.
  
@@ -465,7 +471,7 @@ simple.database: textbase
  Since the existing records in an index can not be addressed by their
  IDs, it is impossible to delete or modify records when using this method.
  
-<sect1>Indexing with File Record IDs
+<sect1>Indexing with File Record IDs<label id="file-ids">
  
  <p>
  If you have a set of files that regularly change over time: Old files
@@ -817,7 +823,9 @@ Registers">).
  
  </descrip>
  
-<sect>Running the Z39.50 Server (zebrasrv)
+<sect>The Z39.50 Server
+
+<sect1>Running the Z39.50 Server (zebrasrv)
  
  <p>
  <bf/Syntax/
@@ -857,7 +865,12 @@ privileged port.
  
  <tag>-w <it/working-directory/</tag>Change working directory.
  
-<tag/-i/Run under the Internet superserver, <tt/inetd/.
+<tag>-i <it/minutes/</tag>Run under the Internet superserver, <tt/inetd/.
+
+<tag>-t <it/timeout/</tag>Set the idle session timeout (default 60 minutes).
+
+<tag>-k <it/kilobytes/</tag>Set the (approximate) maximum size of
+present response messages. Default is 1024 Kb (1 Mb).
  </descrip>
  
  A <it/listener-address/ consists of a transport mode followed by a
@@ -908,6 +921,92 @@ a dedicated IR server account.
  The default behavior for <tt/zebrasrv/ is to establish a single TCP/IP
  listener, for the Z39.50 protocol, on port 9999.
  
+<sect1>Z39.50 Protocol Support and Behavior
+
+<sect2>Initialization
+
+<p>
+During initialization, the server will negotiate to version 3 of the
+Z39.50 protocol, and the option bits for Search, Present, Scan,
+NamedResultSets, and concurrentOperations will be set, if requested by
+the client. The maximum PDU size is negotiated down to a maximum of
+1Mb by default.
+
+<sect2>Search
+
+<p>
+The supported query type are 1 and 101 All operators except PROXIMITY
+are currently supported. Queries can be arbitrarily complex. Named
+result sets are supported, and result sets can be used as operands
+with no limitations. Searches may span multiple databases.
+
+The server has full support for piggy-backed present requests (see
+also the following section).
+
+<bf/Use/ attributes are interpreted according to the attribute sets which
+have been loaded in the <tt/zebra.cfg/ file, and are matched against
+specific fields as specified in the <tt/.abs/ file which describes the
+profile of the records which have been loaded. If no <bf/Use/
+attribute is provided, a default of <bf/Any/ is assumed.
+
+If a <bf/Structure/ attribute of <bf/Phrase/ is used in conjunction with a
+<bf/Completeness/ attribute of <bf/Complete (Sub)field/, the term is
+matched against the contents of a phrase (long word) register, if one
+exists for the given <bf/Use/ attribute. If <bf/Structure/=<bf/Phrase/
+is used in conjunction with <bf/Incomplete Field/ - the default value
+for <bf/Completeness/, the search is directed against the normal word
+registers, but if the term contains multiple words, the term will only
+match if all of the words are found immediately adjacent, and in the
+given order. If the <bf/Structure/ attribute is <bf/Word List/,
+<bf/Free-form Text/, or <bf/Document Text/, the term is treated as a
+natural-language, relevance-ranked query.
+
+If the <bf/Relation/ attribute is <bf/Equals/ (default), the term is
+matched in a normal fashion (modulo truncation and processing of
+individual words, if required). If <bf/Relation/ is <bf/Less Than/,
+<bf/Less Than or Equal/, <bf/Greater than/, or <bf/Greater than or
+Equal/, the term is assumed to be numerical, and a standard regular
+expression is constructed to match the given expression. If
+<bf/Relation/ is <bf/Relevance/, the standard natural-language query
+processor is invoked.
+
+For the <bf/Truncation/ attribute, <bf/No Truncation/ is the default.
+<bf/Left Truncation/ is not supported. <bf/Process &num;/ is supported, as
+is <bf/Regxp-1/. <bf/Regxp-2/ enables the fault-tolerant (fuzzy)
+search. As a default, a single error (deletion, insertion,
+replacement) is accepted when terms are matched against the register
+contents.
+
+<sect2>Present
+
+<p>
+The present facility is supported in a standard fashion. The requested
+record syntax is matched against the ones supported by the profile of
+each record retrieved. If no record syntax is given, SUTRS is the
+default. The requested element set name, again, is matched against any
+provided by the relevant record profiles.
+
+<sect2>Scan
+
+<p>
+The attribute combinations provided with the TermListAndStartPoint are
+processed in the same way as operands in a query (see above).
+Currently, only the term and the globalOccurrences are returned with
+the TermInfo structure.
+
+<sect2>Close
+
+<p>
+If a Close PDU is received, the server will respond with a Close PDU
+with reason=FINISHED, no matter which protocol version was negotiated
+during initialization. If the protocol version is 3 or more, the
+server will generate a Close PDU under certain circumstances,
+including a session timeout (60 minutes by default), and certain kinds of
+protocol errors. Once a Close PDU has been sent, the protocol
+association is considered broken, and the transport connection will be
+closed immediately upon receipt of further data, or following a short
+timeout.
+
  <sect>The Record Model
  
  <p>
@@ -1479,7 +1578,7 @@ elm (4,70)/(4,90)/(4,2) distributorStreetAddress    !
  elm (4,70)/(4,90)/(4,3) distributorCity             !
  </verb></tscreen>
  
-<sect2>The Attribute Set (.att) Files
+<sect2>The Attribute Set (.att) Files<label id="attset-files">
  
  <p>
  This file type describes the <bf/Use/ elements of an attribute set.
@@ -1806,7 +1905,7 @@ belonging to the Explain schema.
  <sect>License
  
  <p>
-Copyright &copy; 1995, Index Data.
+Copyright &copy; 1995,1996 Index Data.
  
  All rights reserved.