Bug fix: Leading and trailing white space weren't removed in scan tokens.

[idzebra-moved-to-github.git] / doc / zebra.sgml
diff --git a/doc/zebra.sgml b/doc/zebra.sgml

index 923d865..a50f22b 100644 (file)
--- a/doc/zebra.sgml
+++ b/doc/zebra.sgml
@@ -1,13 +1,13 @@
  <!doctype linuxdoc system>
  
  <!--
-  $Id: zebra.sgml,v 1.24 1996-04-17 09:21:29 quinn Exp $
+  $Id: zebra.sgml,v 1.34 1997-01-02 10:49:30 quinn Exp $
  -->
  
  <article>
  <title>Zebra Server - Administrators's Guide and Reference
  <author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></>
-<date>$Revision: 1.24 $
+<date>$Revision: 1.34 $
  <abstract>
  The Zebra information server combines a versatile fielded/free-text
  search engine with a Z39.50-1995 frontend to provide a powerful and flexible
@@ -49,7 +49,7 @@ mailing-list by sending Email to <tt/zebra-request@index.ping.dk/.
  <sect1>Features
  
  <p>
-This is a listof some of the most important features of the
+This is a list of some of the most important features of the
  system.
  
  <itemize>
@@ -71,6 +71,11 @@ SGML-like syntax which allows nested (structured) data elements, as
  well as variant forms of data.
  
  <item>
+Supports random storage formats. A system of input filters driven by
+regular expressions allows you to easily process most ASCII-based
+data formats.
+
+<item>
  Supports boolean queries as well as relevance-ranking (free-text)
  searching. Right truncation and masking in terms are supported, as
  well as full regular expressions.
@@ -82,6 +87,10 @@ ISO2709 (*MARC). Records can be mapped between record syntaxes and
  schema on the fly.
  
  <item>
+Supports approximate matching in registers (ie. spelling mistakes,
+etc).
+
+<item>
  Protocol support:
  
  <itemize>
@@ -139,11 +148,6 @@ last beta release.
  <itemize>
  
  <item>
-*Allow the system to handle other input formats. Specifically
-MARC records and general, structured ASCII records (such as mail/news
-files) parameterized by regular expressions.
-
-<item>
  *Complete the support for variants. Finalize support for the WAIS
  retrieval methodology.
  
@@ -155,11 +159,8 @@ data elements in records.
  *Port the system to Windows NT.
  
  <item>
-Add index and data compression to save disk space.
-
-<item>
  Add more sophisticated relevance ranking mechanisms. Add support for soundex
-and stemming. Add relevance feedback support.
+and stemming. Add relevance <it/feedback/ support.
  
  <item>
  Add Explain support.
@@ -172,10 +173,6 @@ variant pieces.
  Support the Item Update extended service of the protocol.
  
  <item>
-The Zebra search engine supports approximate string matching in the
-index. We'd like to find a way to support and control this from RPN.
-
-<item>
  We want to add a management system that allows you to
  control your databases and configuration tables from a graphical
  interface. We'll probably use Tcl/Tk to stay platform-independent.
@@ -225,6 +222,9 @@ profilePath: ../../yaz/tab ../tab
  # Files that describe the attribute sets supported.
  attset: bib1.att
  attset: gils.att
+
+# Name of character map file.
+charMap: scan.chr
  </verb></tscreen>
  
  Now, edit the file and set <tt>profilePath</tt> to the path of the
@@ -234,11 +234,11 @@ archive).
  The 48 test records are located in the sub directory <tt>records</tt>.
  To index these, type:
  <tscreen><verb>
-$ ../index/zebraidx -t grs update records
+$ ../index/zebraidx -t grs.sgml update records
  </verb></tscreen>
  
  In the command above the option <tt>-t</tt> specified the record
-type &mdash; in this case <tt>grs</tt>. The word <tt>update</tt> followed
+type &mdash; in this case <tt>grs.sgml</tt>. The word <tt>update</tt> followed
  by a directory root updates all files below that directory node.
  
  If your indexing command was successful, you are now ready to
@@ -361,13 +361,12 @@ by <tt>zebraidx</tt>. If no <tt/-g/ option is specified, the settings
  with no prefix are used.
  
  In the configuration file, the group name is placed before the option
-name
-itself, separated by a dot (.). For instance, to set the record type
-for group <tt/public/ to <tt/grs/ (the common format for structured
+name itself, separated by a dot (.). For instance, to set the record type
+for group <tt/public/ to <tt/grs.sgml/ (the SGML-like format for structured
  records) you would write:
  
  <tscreen><verb>
-public.recordType: grs
+public.recordType: grs.sgml
  </verb></tscreen>
  
  To set the default value of the record type to <tt/text/ write:
@@ -384,8 +383,12 @@ explained further in the following sections.
   Specifies how records with the file extension <it>name</it> should
   be handled by the indexer. This option may also be specified
   as a command line option (<tt>-t</tt>). Note that if you do not
- specify a <it/name/, the setting applies to all files.
-<tag><it>group</it>.recordId</tag>
+ specify a <it/name/, the setting applies to all files. In general,
+ the record type specifier consists of the elements (each
+ element separated by dot), <it>fundamental-type</it>,
+ <it>file-read-type</it> and arguments. Currently, two
+ fundamental types exist, <tt>text</tt> and <tt>grs</tt>.
+ <tag><it>group</it>.recordId</tag>
   Specifies how the records are to be identified when updated. See
  section <ref id="locating-records" name="Locating Records">.
  <tag><it>group</it>.database</tag>
@@ -409,9 +412,12 @@ section <ref id="locating-records" name="Locating Records">.
   Enables the <it/safe update/ facility of Zebra, and tells the system
   where to place the required, temporary files. See section
  <ref id="shadow-registers" name="Safe Updating - Using Shadow Registers">.
-<tag>lockPath</tag>
+<tag>lockDir</tag>
   Directory in which various lock files are stored.
-<tag>tempSetPath</tag>
+<tag>keyTmpDir</tag>
+ Directory in which temporary files used during zebraidx' update
+ phase are stored. 
+<tag>setTmpDir</tag>
   Specifies the directory that the server uses for temporary result sets.
   If not specified <tt>/tmp</tt> will be used.
  <tag>profilePath</tag>
@@ -421,8 +427,13 @@ section <ref id="locating-records" name="Locating Records">.
   searching. At least the Bib-1 set should be loaded (<tt/bib1.att/).
   The <tt/profilePath/ setting is used to look for the specified files.
   See section <ref id="attset-files" name="The Attribute Set Files">
+<tag>charMap</tag>
+ Specifies the filename of a character mapping. Zebra uses the path,
+ <tt>profilePath</tt>, to locate this file.
+<tag>memMax</tag>
+ Specifies size of internal memory to use for the zebraidx program. The
+ amount is given in megabytes - default is 4 (4 MB).
  </descrip>
-
  <sect1>Locating Records<label id="locating-records">
  <p>
  The default behaviour of the Zebra system is to reference the
@@ -823,7 +834,9 @@ Registers">).
  
  </descrip>
  
-<sect>Running the Z39.50 Server (zebrasrv)
+<sect>The Z39.50 Server
+
+<sect1>Running the Z39.50 Server (zebrasrv)
  
  <p>
  <bf/Syntax/
@@ -863,7 +876,12 @@ privileged port.
  
  <tag>-w <it/working-directory/</tag>Change working directory.
  
-<tag/-i/Run under the Internet superserver, <tt/inetd/.
+<tag>-i <it/minutes/</tag>Run under the Internet superserver, <tt/inetd/.
+
+<tag>-t <it/timeout/</tag>Set the idle session timeout (default 60 minutes).
+
+<tag>-k <it/kilobytes/</tag>Set the (approximate) maximum size of
+present response messages. Default is 1024 Kb (1 Mb).
  </descrip>
  
  A <it/listener-address/ consists of a transport mode followed by a
@@ -914,6 +932,212 @@ a dedicated IR server account.
  The default behavior for <tt/zebrasrv/ is to establish a single TCP/IP
  listener, for the Z39.50 protocol, on port 9999.
  
+<sect1>Z39.50 Protocol Support and Behavior
+
+<sect2>Initialization
+
+<p>
+During initialization, the server will negotiate to version 3 of the
+Z39.50 protocol, and the option bits for Search, Present, Scan,
+NamedResultSets, and concurrentOperations will be set, if requested by
+the client. The maximum PDU size is negotiated down to a maximum of
+1Mb by default.
+
+<sect2>Search
+
+<p>
+The supported query type are 1 and 101. All operators are currently
+supported except that only proximity units of type "word" are supported
+for the proximity operator. Queries can be arbitrarily complex. Named
+result sets are supported, and result sets can be used as operands with
+no limitations. Searches may span multiple databases.
+
+The server has full support for piggy-backed present requests (see
+also the following section).
+
+<bf/Use/ attributes are interpreted according to the attribute sets which
+have been loaded in the <tt/zebra.cfg/ file, and are matched against
+specific fields as specified in the <tt/.abs/ file which describes the
+profile of the records which have been loaded. If no <bf/Use/
+attribute is provided, a default of <bf/Any/ is assumed.
+
+If a <bf/Structure/ attribute of <bf/Phrase/ is used in conjunction with a
+<bf/Completeness/ attribute of <bf/Complete (Sub)field/, the term is
+matched against the contents of a phrase (long word) register, if one
+exists for the given <bf/Use/ attribute. If <bf/Structure/=<bf/Phrase/
+is used in conjunction with <bf/Incomplete Field/ - the default value
+for <bf/Completeness/, the search is directed against the normal word
+registers, but if the term contains multiple words, the term will only
+match if all of the words are found immediately adjacent, and in the
+given order. If the <bf/Structure/ attribute is <bf/Word List/,
+<bf/Free-form Text/, or <bf/Document Text/, the term is treated as a
+natural-language, relevance-ranked query.
+
+If the <bf/Relation/ attribute is <bf/Equals/ (default), the term is
+matched in a normal fashion (modulo truncation and processing of
+individual words, if required). If <bf/Relation/ is <bf/Less Than/,
+<bf/Less Than or Equal/, <bf/Greater than/, or <bf/Greater than or
+Equal/, the term is assumed to be numerical, and a standard regular
+expression is constructed to match the given expression. If
+<bf/Relation/ is <bf/Relevance/, the standard natural-language query
+processor is invoked.
+
+For the <bf/Truncation/ attribute, <bf/No Truncation/ is the default.
+<bf/Left Truncation/ is not supported. <bf/Process &num;/ is supported, as
+is <bf/Regxp-1/. <bf/Regxp-2/ enables the fault-tolerant (fuzzy)
+search. As a default, a single error (deletion, insertion,
+replacement) is accepted when terms are matched against the register
+contents.
+
+Zebra interprets queries in one the following ways:
+<descrip>
+<tag>1 Phrase search</tag>
+ Each token separated by white space is truncated according to the
+ value of truncation attribute. If the completeness attribute
+ is <bf/complete subfield/ the search is directed to the phrase
+ register. For other completeness attribute values the term is split
+ into tokens according to the white-space specification in the
+ character map. Only records in which each token exists in the order
+ specified are matched.
+<tag>2 Word search</tag>
+ The token is truncated according to the value of truncation attribute. 
+ The completeness attribute is ignored.
+<tag>3 Ranked search</tag>
+ Each token separated by white space is truncated according to the value
+ of truncation attribute. The completenss attribute is ignored.
+<tag>4 Numeric relation</tag>
+ The token should consist of decimal digits. The integer is matched
+ against integers in the register according to the relation attribute.
+ The truncation - and the completenss attribute is ignored.
+<tag>5 Document identifier</tag>
+ The token consists of exactly one document identifier. The 
+ truncation - and the completeness attribute is ignored.
+</descrip>
+
+For ranked searches the result sets are ranked and a score
+is associated with each record. All other result sets from the
+remaining four types are non-ranked.
+
+Combinations of the structure attribute and the relation attribute
+determine how the query is interpreted. The two following tables
+define how.
+
+<verb>
+                              Structure Attribute (4)
+                        none     phrase(1)  word(2)   word list(6)
+
+           none          1         1         2         3
+           =   (3)       1         1         2         3
+           <   (1)       4         4         4         4
+Relation   <=  (2)       4         4         4         4
+Attribute  >=  (4)       4         4         4         4
+ (2)       >   (5)       4         4         4         4
+           <>  (6)       -         -         -         -
+           rel (102)     3         3         3         3
+           other         1         1         2         3
+ 
+</verb>
+
+<verb>
+                              Structure Attribute (4)
+                       free-form- document- local-    string
+                       text       text      number 
+                       (105)      (106)     (107)     (108)
+           none          3         3         5         1
+           =   (3)       3         3         5         1
+           <   (1)       4         4         5         4
+ Relation  <=  (2)       4         4         5         4
+ Attribute >=  (4)       4         4         5         4
+ (2)       >   (5)       4         4         5         4
+           <>  (6)       -         -         5         -
+           rel (102)     3         3         5         3
+           other         3         3         5         1
+
+</verb>
+
+<sect3>Regular expressions
+<p>
+
+Each term in a query is interpreted as a regular expression if
+the truncation value is either <bf/Regxp-1/ (102) or <bf/Regxp-2/ (103).
+Both query types follow the same syntax with the operands:
+<descrip>
+<tag/x/ Matches the character <it/x/.
+<tag/./ Matches any character.
+<tag><tt/[/..<tt/]/</tag> Matches the set of characters specified;
+ such as <tt/[abc]/ or <tt/[a-c]/.
+</descrip>
+and the operators:
+<descrip>
+<tag/x*/ Matches <it/x/ zero or more times. Priority: high.
+<tag/x+/ Matches <it/x/ one or more times. Priority: high.
+<tag/x?/ Matches <it/x/ once or twice. Priority: high.
+<tag/xy/ Matches <it/x/, then <it/y/. Priority: medium.
+<tag/x|y/ Matches either <it/x/ or <it/y/. Priority: low.
+</descrip>
+The order of evaluation may be changed by using parentheses.
+
+If the first character of the <bf/Regxp-2/ query is a plus character
+(<tt/+/) it marks the beginning of a section with non-standard
+specifiers. The next plus character marks the end of the section.
+Currently Zebra only supports one specifier, the error tolerance,
+which consists one digit. 
+
+Since the plus operator is normally a suffix operator the addition to
+the query syntax doesn't violate the syntax for standard regular
+expressions.
+
+<sect3>Query examples
+<p>
+Phrase search for <bf/information retrieval/ in the title-register:
+<verb>
+ @attr 1=4 "information retrieval"
+</verb>
+
+Ranked search for the same thing:
+<verb>
+ @attr 1=4 @attr 2=102 "Information retrieval"
+</verb>
+
+Phrase search with a regular expression:
+<verb>
+ @attr 1=4 @attr 5=102 "informat.* retrieval"
+</verb>
+
+Ranked search with a regular expression:
+<verb>
+ @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"
+</verb>
+
+<sect2>Present
+<p>
+The present facility is supported in a standard fashion. The requested
+record syntax is matched against the ones supported by the profile of
+each record retrieved. If no record syntax is given, SUTRS is the
+default. The requested element set name, again, is matched against any
+provided by the relevant record profiles.
+
+<sect2>Scan
+
+<p>
+The attribute combinations provided with the TermListAndStartPoint are
+processed in the same way as operands in a query (see above).
+Currently, only the term and the globalOccurrences are returned with
+the TermInfo structure.
+
+<sect2>Close
+
+<p>
+If a Close PDU is received, the server will respond with a Close PDU
+with reason=FINISHED, no matter which protocol version was negotiated
+during initialization. If the protocol version is 3 or more, the
+server will generate a Close PDU under certain circumstances,
+including a session timeout (60 minutes by default), and certain kinds of
+protocol errors. Once a Close PDU has been sent, the protocol
+association is considered broken, and the transport connection will be
+closed immediately upon receipt of further data, or following a short
+timeout.
+
  <sect>The Record Model
  
  <p>
@@ -925,6 +1149,10 @@ record. Any number of record schema can coexist in the system.
  Although it may be wise to use only a single schema within
  one database, the system poses no such restrictions.
  
+The record model described in this chapter applies to the fundamental
+record type <tt>grs</tt> as introduced in
+section <ref id="record-types" name="Record Types">.
+
  Records pass through three different states during processing in the
  system.
  
@@ -968,6 +1196,9 @@ a single, canonical input format that gives access to the full
  spectrum of structure and flexibility in the system. In Zebra, this
  canonical format is an &dquot;SGML-like&dquot; syntax.
  
+To use the canonical format specify <tt>grs.sgml</tt> as the record
+type,
+
  Consider a record describing an information resource (such a record is
  sometimes known as a <it/locator record/). It might contain a field
  describing the distributor of the information resource, which might in
@@ -1102,7 +1333,10 @@ work with.
  
  Input filters are ASCII files, generally with the suffix <tt/.flt/.
  The system looks for the files in the directories given in the
-<bf/profilePath/ setting in the <tt/zebra.cfg/ file.
+<bf/profilePath/ setting in the <tt/zebra.cfg/ files. The record type
+for the filter is <tt>grs.regx.</tt><it>filter-filename</it>
+(fundamental type <tt>grs</tt>, file read type <tt>regx</tt>, argument
+<it>filter-filename</it>).
  
  Generally, an input filter consists of a sequence of rules, where each
  rule consists of a sequence of expressions, followed by an action. The
@@ -1429,16 +1663,28 @@ given element set name with an element selection file. If an (@) is
  given in place of the filename, this corresponds to a null mapping for
  the given element set name.
  
-<tag>elm <it/path name attribute/</tag> (o,r) Adds an element
+<tag>any <it/tags/</tag> (o) This directive specifies a list of
+attributes which should be appended to the attribute list given for each
+element. The effect is to make every single element in the abstract
+syntax searchable by way of the given attributes. This directive
+provides an efficient way of supporting free-text searching across all
+elements. However, it does increase the size of the index
+significantly. The attributes can be qualified with a structure, as in
+the <bf/elm/ directive below.
+
+<tag>elm <it/path name attributes/</tag> (o,r) Adds an element
  to the abstract record syntax of the schema. The <it/path/ follows the
  syntax which is suggested by the Z39.50 document - that is, a sequence
  of tags separated by slashes (/). Each tag is given as a
  comma-separated pair of tag type and -value surrounded by parenthesis.
-The <it/name/ is the name of the element, and the <it/attribute/
-specifies what attribute to use when indexing the element. A ! in
+The <it/name/ is the name of the element, and the <it/attributes/
+specifies which attributes to use when indexing the element in a
+comma-separated list. A ! in
  place of the attribute name is equivalent to specifying an attribute
  name identical to the element name. A - in place of the attribute name
-specifies that no indexing is to take place for the given element.
+specifies that no indexing is to take place for the given element. The
+attributes can be qualified with a &dquot;p&dquot; or &dquot;w&dquot;
+to specify either word or phrase (complete field) indexing.
  </descrip>
  
  <it>
@@ -1812,7 +2058,7 @@ belonging to the Explain schema.
  <sect>License
  
  <p>
-Copyright &copy; 1995, Index Data.
+Copyright &copy; 1995,1996 Index Data.
  
  All rights reserved.