X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fzebra.sgml;h=7b83012b5d3ec0793793bae434dd4fc24bd6935e;hb=c9ac021f1381269609f3654384698f398cf46b96;hp=322e8cbe46a0d36eea61acafa41141eb7c4118e3;hpb=7b559154955fd4ac7f18657344241ca83443b3a2;p=idzebra-moved-to-github.git diff --git a/doc/zebra.sgml b/doc/zebra.sgml index 322e8cb..7b83012 100644 --- a/doc/zebra.sgml +++ b/doc/zebra.sgml @@ -1,13 +1,13 @@
Zebra Server - Administrators's Guide and Reference <author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></> -<date>$Revision: 1.22 $ +<date>$Revision: 1.35 $ <abstract> The Zebra information server combines a versatile fielded/free-text search engine with a Z39.50-1995 frontend to provide a powerful and flexible @@ -49,7 +49,7 @@ mailing-list by sending Email to <tt/zebra-request@index.ping.dk/. <sect1>Features <p> -This is a listof some of the most important features of the +This is a list of some of the most important features of the system. <itemize> @@ -71,6 +71,11 @@ SGML-like syntax which allows nested (structured) data elements, as well as variant forms of data. <item> +Supports random storage formats. A system of input filters driven by +regular expressions allows you to easily process most ASCII-based +data formats. + +<item> Supports boolean queries as well as relevance-ranking (free-text) searching. Right truncation and masking in terms are supported, as well as full regular expressions. @@ -82,6 +87,10 @@ ISO2709 (*MARC). Records can be mapped between record syntaxes and schema on the fly. <item> +Supports approximate matching in registers (ie. spelling mistakes, +etc). + +<item> Protocol support: <itemize> @@ -139,11 +148,6 @@ last beta release. <itemize> <item> -*Allow the system to handle other input formats. Specifically -MARC records and general, structured ASCII records (such as mail/news -files) parameterized by regular expressions. - -<item> *Complete the support for variants. Finalize support for the WAIS retrieval methodology. @@ -155,11 +159,8 @@ data elements in records. *Port the system to Windows NT. <item> -Add index and data compression to save disk space. - -<item> Add more sophisticated relevance ranking mechanisms. Add support for soundex -and stemming. Add relevance feedback support. +and stemming. Add relevance <it/feedback/ support. <item> Add Explain support. @@ -172,10 +173,6 @@ variant pieces. Support the Item Update extended service of the protocol. <item> -The Zebra search engine supports approximate string matching in the -index. We'd like to find a way to support and control this from RPN. - -<item> We want to add a management system that allows you to control your databases and configuration tables from a graphical interface. We'll probably use Tcl/Tk to stay platform-independent. @@ -191,25 +188,13 @@ contact info at the end of this file. <sect>Compiling the software <p> -Zebra uses the YAZ package to implement Z39.50, so you -have to compile YAZ before going further. Specifically, Zebra uses -the YAZ header files in <tt>yaz/include/..</tt> and its public library -<tt>yaz/lib/libyaz.a</tt>. - -As with YAZ, an ANSI C compiler is required in order to compile the Zebra +An ANSI C compiler is required to compile the Zebra server system — <tt/gcc/ works fine if your own system doesn't provide an adequate compiler. -Unpack the Zebra software. You might put Zebra in the same directory level -as YAZ, for example if YAZ is placed in ..<tt>/src/yaz-xxx</tt>, then -Zebra is placed in ..<tt>/src/zebra-yyy</tt>. - -Edit the top-level <tt>Makefile</tt> in the Zebra directory in which -you specify the location of YAZ by setting make variables. -The <tt>OSILIB</tt> should be empty if YAZ wasn't compiled with -MOSI support. Some systems, such as Solaris, have separate socket -libraries and for those systems you need to specify the -<tt>NETLIB</tt> variable. +Unpack the distribution archive. In some cases, you may want to edit +the top-level <tt/Makefile/, eg. to select a different C compiler, or +to specify machine-specific libraries in the <bf/ELIBS/ variable. When you are done editing the <tt>Makefile</tt> type: <tscreen><verb> @@ -223,8 +208,7 @@ If successful, two executables have been created in the sub-directory <tag><tt>zebraidx</tt></tag> The administrative tool for the search index. </descrip> -<sect>Quick Start - +<sect>Quick Start <p> In this section, we will test the system by indexing a small set of sample GILS records that are included with the software distribution. Go to the @@ -238,6 +222,9 @@ profilePath: ../../yaz/tab ../tab # Files that describe the attribute sets supported. attset: bib1.att attset: gils.att + +# Name of character map file. +charMap: scan.chr </verb></tscreen> Now, edit the file and set <tt>profilePath</tt> to the path of the @@ -247,11 +234,11 @@ archive). The 48 test records are located in the sub directory <tt>records</tt>. To index these, type: <tscreen><verb> -$ ../index/zebraidx -t grs update records +$ ../index/zebraidx -t grs.sgml update records </verb></tscreen> In the command above the option <tt>-t</tt> specified the record -type — in this case <tt>grs</tt>. The word <tt>update</tt> followed +type — in this case <tt>grs.sgml</tt>. The word <tt>update</tt> followed by a directory root updates all files below that directory node. If your indexing command was successful, you are now ready to @@ -261,7 +248,7 @@ $ ../index/zebrasrv tcp:@:2100 </verb></tscreen> The Zebra index that you have just created has a single database -named <ztt/Default/. The database contains records structured according to +named <tt/Default/. The database contains records structured according to the GILS profile, and the server will return records in either either USMARC, GRS-1, or SUTRS depending on what your client asks @@ -374,13 +361,12 @@ by <tt>zebraidx</tt>. If no <tt/-g/ option is specified, the settings with no prefix are used. In the configuration file, the group name is placed before the option -name -itself, separated by a dot (.). For instance, to set the record type -for group <tt/public/ to <tt/grs/ (the common format for structured +name itself, separated by a dot (.). For instance, to set the record type +for group <tt/public/ to <tt/grs.sgml/ (the SGML-like format for structured records) you would write: <tscreen><verb> -public.recordType: grs +public.recordType: grs.sgml </verb></tscreen> To set the default value of the record type to <tt/text/ write: @@ -397,8 +383,12 @@ explained further in the following sections. Specifies how records with the file extension <it>name</it> should be handled by the indexer. This option may also be specified as a command line option (<tt>-t</tt>). Note that if you do not - specify a <it/name/, the setting applies to all files. -<tag><it>group</it>.recordId</tag> + specify a <it/name/, the setting applies to all files. In general, + the record type specifier consists of the elements (each + element separated by dot), <it>fundamental-type</it>, + <it>file-read-type</it> and arguments. Currently, two + fundamental types exist, <tt>text</tt> and <tt>grs</tt>. + <tag><it>group</it>.recordId</tag> Specifies how the records are to be identified when updated. See section <ref id="locating-records" name="Locating Records">. <tag><it>group</it>.database</tag> @@ -422,7 +412,12 @@ section <ref id="locating-records" name="Locating Records">. Enables the <it/safe update/ facility of Zebra, and tells the system where to place the required, temporary files. See section <ref id="shadow-registers" name="Safe Updating - Using Shadow Registers">. -<tag>tempSetPath</tag> +<tag>lockDir</tag> + Directory in which various lock files are stored. +<tag>keyTmpDir</tag> + Directory in which temporary files used during zebraidx' update + phase are stored. +<tag>setTmpDir</tag> Specifies the directory that the server uses for temporary result sets. If not specified <tt>/tmp</tt> will be used. <tag>profilePath</tag> @@ -432,9 +427,14 @@ section <ref id="locating-records" name="Locating Records">. searching. At least the Bib-1 set should be loaded (<tt/bib1.att/). The <tt/profilePath/ setting is used to look for the specified files. See section <ref id="attset-files" name="The Attribute Set Files"> +<tag>charMap</tag> + Specifies the filename of a character mapping. Zebra uses the path, + <tt>profilePath</tt>, to locate this file. +<tag>memMax</tag> + Specifies size of internal memory to use for the zebraidx program. The + amount is given in megabytes - default is 4 (4 MB). </descrip> - -<sect1>Locating Records<label="locating-records"> +<sect1>Locating Records<label id="locating-records"> <p> The default behaviour of the Zebra system is to reference the records from their original location, i.e. where they were found when you @@ -834,7 +834,9 @@ Registers">). </descrip> -<sect>Running the Z39.50 Server (zebrasrv) +<sect>The Z39.50 Server + +<sect1>Running the Z39.50 Server (zebrasrv) <p> <bf/Syntax/ @@ -874,7 +876,12 @@ privileged port. <tag>-w <it/working-directory/</tag>Change working directory. -<tag/-i/Run under the Internet superserver, <tt/inetd/. +<tag>-i <it/minutes/</tag>Run under the Internet superserver, <tt/inetd/. + +<tag>-t <it/timeout/</tag>Set the idle session timeout (default 60 minutes). + +<tag>-k <it/kilobytes/</tag>Set the (approximate) maximum size of +present response messages. Default is 1024 Kb (1 Mb). </descrip> A <it/listener-address/ consists of a transport mode followed by a @@ -925,6 +932,212 @@ a dedicated IR server account. The default behavior for <tt/zebrasrv/ is to establish a single TCP/IP listener, for the Z39.50 protocol, on port 9999. +<sect1>Z39.50 Protocol Support and Behavior + +<sect2>Initialization + +<p> +During initialization, the server will negotiate to version 3 of the +Z39.50 protocol, and the option bits for Search, Present, Scan, +NamedResultSets, and concurrentOperations will be set, if requested by +the client. The maximum PDU size is negotiated down to a maximum of +1Mb by default. + +<sect2>Search + +<p> +The supported query type are 1 and 101. All operators are currently +supported except that only proximity units of type "word" are supported +for the proximity operator. Queries can be arbitrarily complex. Named +result sets are supported, and result sets can be used as operands with +no limitations. Searches may span multiple databases. + +The server has full support for piggy-backed present requests (see +also the following section). + +<bf/Use/ attributes are interpreted according to the attribute sets which +have been loaded in the <tt/zebra.cfg/ file, and are matched against +specific fields as specified in the <tt/.abs/ file which describes the +profile of the records which have been loaded. If no <bf/Use/ +attribute is provided, a default of <bf/Any/ is assumed. + +If a <bf/Structure/ attribute of <bf/Phrase/ is used in conjunction with a +<bf/Completeness/ attribute of <bf/Complete (Sub)field/, the term is +matched against the contents of a phrase (long word) register, if one +exists for the given <bf/Use/ attribute. If <bf/Structure/=<bf/Phrase/ +is used in conjunction with <bf/Incomplete Field/ - the default value +for <bf/Completeness/, the search is directed against the normal word +registers, but if the term contains multiple words, the term will only +match if all of the words are found immediately adjacent, and in the +given order. If the <bf/Structure/ attribute is <bf/Word List/, +<bf/Free-form Text/, or <bf/Document Text/, the term is treated as a +natural-language, relevance-ranked query. + +If the <bf/Relation/ attribute is <bf/Equals/ (default), the term is +matched in a normal fashion (modulo truncation and processing of +individual words, if required). If <bf/Relation/ is <bf/Less Than/, +<bf/Less Than or Equal/, <bf/Greater than/, or <bf/Greater than or +Equal/, the term is assumed to be numerical, and a standard regular +expression is constructed to match the given expression. If +<bf/Relation/ is <bf/Relevance/, the standard natural-language query +processor is invoked. + +For the <bf/Truncation/ attribute, <bf/No Truncation/ is the default. +<bf/Left Truncation/ is not supported. <bf/Process #/ is supported, as +is <bf/Regxp-1/. <bf/Regxp-2/ enables the fault-tolerant (fuzzy) +search. As a default, a single error (deletion, insertion, +replacement) is accepted when terms are matched against the register +contents. + +Zebra interprets queries in one the following ways: +<descrip> +<tag>1 Phrase search</tag> + Each token separated by white space is truncated according to the + value of truncation attribute. If the completeness attribute + is <bf/complete subfield/ the search is directed to the phrase + register. For other completeness attribute values the term is split + into tokens according to the white-space specification in the + character map. Only records in which each token exists in the order + specified are matched. +<tag>2 Word search</tag> + The token is truncated according to the value of truncation attribute. + The completeness attribute is ignored. +<tag>3 Ranked search</tag> + Each token separated by white space is truncated according to the value + of truncation attribute. The completenss attribute is ignored. +<tag>4 Numeric relation</tag> + The token should consist of decimal digits. The integer is matched + against integers in the register according to the relation attribute. + The truncation - and the completenss attribute is ignored. +<tag>5 Document identifier</tag> + The token consists of exactly one document identifier. The + truncation - and the completeness attribute is ignored. +</descrip> + +For ranked searches the result sets are ranked and a score +is associated with each record. All other result sets from the +remaining four types are non-ranked. + +Combinations of the structure attribute and the relation attribute +determine how the query is interpreted. The two following tables +define how. + +<verb> + Structure Attribute (4) + none phrase(1) word(2) word list(6) + + none 1 1 2 3 + = (3) 1 1 2 3 + < (1) 4 4 4 4 +Relation <= (2) 4 4 4 4 +Attribute >= (4) 4 4 4 4 + (2) > (5) 4 4 4 4 + <> (6) - - - - + rel (102) 3 3 3 3 + other 1 1 2 3 + +</verb> + +<verb> + Structure Attribute (4) + free-form- document- local- string + text text number + (105) (106) (107) (108) + none 3 3 5 1 + = (3) 3 3 5 1 + < (1) 4 4 5 4 + Relation <= (2) 4 4 5 4 + Attribute >= (4) 4 4 5 4 + (2) > (5) 4 4 5 4 + <> (6) - - 5 - + rel (102) 3 3 5 3 + other 3 3 5 1 + +</verb> + +<sect3>Regular expressions +<p> + +Each term in a query is interpreted as a regular expression if +the truncation value is either <bf/Regxp-1/ (102) or <bf/Regxp-2/ (103). +Both query types follow the same syntax with the operands: +<descrip> +<tag/x/ Matches the character <it/x/. +<tag/./ Matches any character. +<tag><tt/[/..<tt/]/</tag> Matches the set of characters specified; + such as <tt/[abc]/ or <tt/[a-c]/. +</descrip> +and the operators: +<descrip> +<tag/x*/ Matches <it/x/ zero or more times. Priority: high. +<tag/x+/ Matches <it/x/ one or more times. Priority: high. +<tag/x?/ Matches <it/x/ once or twice. Priority: high. +<tag/xy/ Matches <it/x/, then <it/y/. Priority: medium. +<tag/x|y/ Matches either <it/x/ or <it/y/. Priority: low. +</descrip> +The order of evaluation may be changed by using parentheses. + +If the first character of the <bf/Regxp-2/ query is a plus character +(<tt/+/) it marks the beginning of a section with non-standard +specifiers. The next plus character marks the end of the section. +Currently Zebra only supports one specifier, the error tolerance, +which consists one digit. + +Since the plus operator is normally a suffix operator the addition to +the query syntax doesn't violate the syntax for standard regular +expressions. + +<sect3>Query examples +<p> +Phrase search for <bf/information retrieval/ in the title-register: +<verb> + @attr 1=4 "information retrieval" +</verb> + +Ranked search for the same thing: +<verb> + @attr 1=4 @attr 2=102 "Information retrieval" +</verb> + +Phrase search with a regular expression: +<verb> + @attr 1=4 @attr 5=102 "informat.* retrieval" +</verb> + +Ranked search with a regular expression: +<verb> + @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval" +</verb> + +<sect2>Present +<p> +The present facility is supported in a standard fashion. The requested +record syntax is matched against the ones supported by the profile of +each record retrieved. If no record syntax is given, SUTRS is the +default. The requested element set name, again, is matched against any +provided by the relevant record profiles. + +<sect2>Scan + +<p> +The attribute combinations provided with the TermListAndStartPoint are +processed in the same way as operands in a query (see above). +Currently, only the term and the globalOccurrences are returned with +the TermInfo structure. + +<sect2>Close + +<p> +If a Close PDU is received, the server will respond with a Close PDU +with reason=FINISHED, no matter which protocol version was negotiated +during initialization. If the protocol version is 3 or more, the +server will generate a Close PDU under certain circumstances, +including a session timeout (60 minutes by default), and certain kinds of +protocol errors. Once a Close PDU has been sent, the protocol +association is considered broken, and the transport connection will be +closed immediately upon receipt of further data, or following a short +timeout. + <sect>The Record Model <p> @@ -936,6 +1149,10 @@ record. Any number of record schema can coexist in the system. Although it may be wise to use only a single schema within one database, the system poses no such restrictions. +The record model described in this chapter applies to the fundamental +record type <tt>grs</tt> as introduced in +section <ref id="record-types" name="Record Types">. + Records pass through three different states during processing in the system. @@ -979,6 +1196,9 @@ a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In Zebra, this canonical format is an &dquot;SGML-like&dquot; syntax. +To use the canonical format specify <tt>grs.sgml</tt> as the record +type, + Consider a record describing an information resource (such a record is sometimes known as a <it/locator record/). It might contain a field describing the distributor of the information resource, which might in @@ -1113,7 +1333,10 @@ work with. Input filters are ASCII files, generally with the suffix <tt/.flt/. The system looks for the files in the directories given in the -<bf/profilePath/ setting in the <tt/zebra.cfg/ file. +<bf/profilePath/ setting in the <tt/zebra.cfg/ files. The record type +for the filter is <tt>grs.regx.</tt><it>filter-filename</it> +(fundamental type <tt>grs</tt>, file read type <tt>regx</tt>, argument +<it>filter-filename</it>). Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The @@ -1440,16 +1663,28 @@ given element set name with an element selection file. If an (@) is given in place of the filename, this corresponds to a null mapping for the given element set name. -<tag>elm <it/path name attribute/</tag> (o,r) Adds an element +<tag>any <it/tags/</tag> (o) This directive specifies a list of +attributes which should be appended to the attribute list given for each +element. The effect is to make every single element in the abstract +syntax searchable by way of the given attributes. This directive +provides an efficient way of supporting free-text searching across all +elements. However, it does increase the size of the index +significantly. The attributes can be qualified with a structure, as in +the <bf/elm/ directive below. + +<tag>elm <it/path name attributes/</tag> (o,r) Adds an element to the abstract record syntax of the schema. The <it/path/ follows the syntax which is suggested by the Z39.50 document - that is, a sequence of tags separated by slashes (/). Each tag is given as a comma-separated pair of tag type and -value surrounded by parenthesis. -The <it/name/ is the name of the element, and the <it/attribute/ -specifies what attribute to use when indexing the element. A ! in +The <it/name/ is the name of the element, and the <it/attributes/ +specifies which attributes to use when indexing the element in a +comma-separated list. A ! in place of the attribute name is equivalent to specifying an attribute name identical to the element name. A - in place of the attribute name -specifies that no indexing is to take place for the given element. +specifies that no indexing is to take place for the given element. The +attributes can be qualified with a &dquot;p&dquot; or &dquot;w&dquot; +to specify either word or phrase (complete field) indexing. </descrip> <it> @@ -1823,7 +2058,7 @@ belonging to the Explain schema. <sect>License <p> -Copyright © 1995, Index Data. +Copyright © 1995,1996 Index Data. All rights reserved.