doc/querymodel.xml

   1  <chapter id="querymodel">
   2   <!-- $Id: querymodel.xml,v 1.4 2006-06-14 13:44:15 adam Exp $ -->
   3   <title>Query Model</title>
   4
   5   <sect1 id="querymodel-overview">
   6    <title>Query Model Overview</title>
   7
   8    <para>
   9     Zebra is born as a networking Information Retrieval engine adhering
  10     to the international standards
  11     <ulink url="&url.z39.50;">Z39.50</ulink> and
  12     <ulink url="&url.sru;">SRU</ulink>,
  13     and implement the query model defined there.
  14     Unfortunately, the Z39.50 query model has only defined a binary
  15     encoded representation, which is used as transport packaging in
  16     the Z39.50 protocol layer. This representation is not human
  17     readable, nor defines any convenient way to specify queries.
  18    </para>
  19    <!-- tell about RPN - include link to YAZ
  20         url.yaz.pqf -->
  21    <para>
  22     Therefore, Index Data has defined a textual representation of the
  23     RPN query: <literal>Prefix Query Format</literal>, short
  24     <literal>PQF</literal>, which then has been adopted by other
  25     parties developing Z39.50 software. It is also often referred to as
  26     <literal>Prefix Query Notation</literal>, or in short
  27     <literal>PQN</literal>, and is thoroughly explained in
  28     <xref linkend="querymodel-pqf"/>.
  29    </para>
  30
  31    <!-- PQF/RPN is natively supported. CQL is NOT . So we need a map -->
  32    <para>
  33     In addition, Zebra can be configured to understand and map the
  34     <literal>Common Query Language</literal>
  35     (<ulink url="&url.cql;">CQL</ulink>)
  36     to PQF. See an introduction on the mapping to the internal query
  37     representation in
  38     <xref linkend="querymodel-cql-to-pqf"/>.
  39    </para>
  40   </sect1>
  41
  42   <sect1 id="querymodel-pqf">
  43    <title>Prefix Query Format structure and syntax</title>
  44    <para>
  45     The <ulink url="&url.yaz.pqf;">PQF grammer</ulink>
  46     is documented in the YAZ manual, and shall not be
  47     repeated here. This textual PQF representation
  48     is always during search mapped to the equivalent Zebra internal
  49     query parse tree.
  50    </para>
  51
  52    <sect2 id="querymodel-pqf-tree">
  53     <title>PQF tree structure</title>
  54     <para>
  55      The PQF parse tree - or the equivalent textual representation -
  56      may start with one specification of the
  57      <emphasis>attribute set</emphasis> used. Following is a query
  58      tree, which
  59      consists of <emphasis>atomic query parts</emphasis>, eventually
  60      paired by <emphasis>boolean binary operators</emphasis>, and
  61      finally  <emphasis>recursively combined </emphasis> into
  62      complex query trees.
  63     </para>
  64
  65     <sect3 id="querymodel-attribute-sets">
  66      <title>Attribute sets</title>
  67      <para>
  68       Attribute sets define the exact meaning and semantics of queries
  69       issued. Zebra comes with some predefined attribute set
  70       definitions, others can easily be defined and added to the
  71       configuration.
  72       <note>
  73        The Zebra internal query procesing is modeled after
  74        the <literal>Bib1</literal> attribute set, and the non-use
  75        attributes type 2-9 are hard-wired in. It is therefore essential
  76        to be familiar with <xref linkend="querymodel-bib1"/>.
  77       </note>
  78      </para>
  79
  80      <table id="querymodel-attribute-sets-table">
  81       <caption>Attribute sets predefined in Zebra</caption>
  82        <!--
  83        <thead>
  84        <tr><td>one</td><td>two</td></tr>
  85       </thead>
  86        -->
  87        <tbody>
  88         <tr>
  89          <td><emphasis>exp-1</emphasis></td>
  90          <td><literal>Explain</literal> attribute set</td>
  91          <td>Special attribute set used on the special automagic
  92           <literal>IR-Explain-1</literal> database to gain information on
  93           server capabilities, database names, and database
  94           and semantics.</td>
  95         </tr>
  96         <tr>
  97          <td><emphasis>bib-1</emphasis></td>
  98          <td><literal>Bib1</literal> attribute set</td>
  99          <td>Standard PQF query language attribute set which defines the
 100           semantics of Z39.50 searching. In addition, all of the
 101           non-use attributes (type 2-9) define the Zebra internal query
 102           processing</td>
 103         </tr>
 104         <tr>
 105          <td><emphasis>gils</emphasis></td>
 106          <td><literal>GILS</literal> attribute set</td>
 107          <td>Extention to the <literal>Bib1</literal> attribute set.</td>
 108         </tr>
 109        </tbody>
 110      </table>
 111     </sect3>
 112
 113     <sect3 id="querymodel-boolean-operators">
 114      <title>Boolean operators</title>
 115      <para>
 116       A pair of subquery trees, or of atomic queries, is combined
 117       using the standard boolean operators into new query trees.
 118      </para>
 119
 120      <table id="querymodel-boolean-operators-table">
 121       <caption>Boolean operators</caption>
 122        <!--
 123        <thead>
 124        <tr><td>one</td><td>two</td></tr>
 125       </thead>
 126        -->
 127        <tbody>
 128         <tr><td><emphasis>@and</emphasis></td>
 129          <td>binary <literal>AND</literal> operator</td>
 130          <td>Set intersection of two atomic queries hit sets</td>
 131         </tr>
 132         <tr><td><emphasis>@or</emphasis></td>
 133          <td>binary <literal>OR</literal> operator</td>
 134          <td>Set union of two atomic queries hit sets</td>
 135         </tr>
 136         <tr><td><emphasis>@not</emphasis></td>
 137          <td>binary <literal>AND NOT</literal> operator</td>
 138          <td>Set complement of two atomic queries hit sets</td>
 139         </tr>
 140         <tr><td><emphasis>@prox</emphasis></td>
 141          <td>binary <literal>PROXIMY</literal> operator</td>
 142          <td>Set intersection of two atomic queries hit sets. In
 143           addition, the intersection set is purged for all
 144           documents which do not satisfy the requested query
 145           term proximity. Usually a proper subset of the AND
 146           operation.</td>
 147         </tr>
 148        </tbody>
 149      </table>
 150
 151      <para>
 152       For example, we can combine the terms
 153       <emphasis>information</emphasis> and <emphasis>retrieval</emphasis>
 154       into different searches in the default index of the default
 155       attribute set as follows.
 156       Querying for the union of all documents containing the
 157       terms <emphasis>information</emphasis> OR
 158       <emphasis>retrieval</emphasis>:
 159       <screen>
 160        Z> find @or information retrieval
 161       </screen>
 162      </para>
 163      <para>
 164       Querying for the intersection of all documents containing the
 165       terms <emphasis>information</emphasis> AND
 166       <emphasis>retrieval</emphasis>:
 167       The hit set is a subset of the coresponding
 168       OR query.
 169       <screen>
 170        Z> find @and information retrieval
 171       </screen>
 172      </para>
 173      <para>
 174       Querying for the intersection of all documents containing the
 175       terms <emphasis>information</emphasis> AND
 176       <emphasis>retrieval</emphasis>, taking proximity into account:
 177       The hit set is a subset of the coresponding
 178       AND query.
 179       <screen>
 180        Z> find @prox information retrieval
 181       </screen>
 182      </para>
 183      <para>
 184       Querying for the intersection of all documents containing the
 185       terms <emphasis>information</emphasis> AND
 186       <emphasis>retrieval</emphasis>, in the same order and near each
 187       other as described in the term list
 188       The hit set is a subset of the coresponding
 189       PROXIMY query.
 190       <screen>
 191        Z> find "information retrieval"
 192       </screen>
 193      </para>
 194     </sect3>
 195
 196
 197     <sect3 id="querymodel-atomic-queries">
 198      <title>Atomic queries</title>
 199      <para>
 200       Atomic queries are the query parts which work on one acess point
 201       only. These consist of <literal>an attribute list</literal>
 202       followed by a <literal>single term</literal> or a
 203       <literal>quoted term list</literal>.
 204      </para>
 205      <para>
 206       Unsupplied non-use attributes type 2-9 are either inherited from
 207       higher nodes in the query tree, or are set to Zebra's default values.
 208       See <xref linkend="querymodel-bib1"/> for details.
 209      </para>
 210
 211      <table id="querymodel-atomic-queries-table">
 212       <caption>Atomic queries</caption>
 213        <!--
 214        <thead>
 215        <tr><td>one</td><td>two</td></tr>
 216       </thead>
 217        -->
 218        <tbody>
 219         <tr><td><emphasis>attribute list</emphasis></td>
 220          <td>List of <literal>orthogonal</literal> attributes</td>
 221          <td>Any of the orthogonal attribute types may be omitted,
 222           these are inherited from higher query tree nodes, or if not
 223           inherited, are set to the default Zebra configuration values.
 224          </td>
 225         </tr>
 226         <tr><td><emphasis>term</emphasis></td>
 227          <td>single <literal>term</literal>
 228           or <literal>quoted term list</literal>   </td>
 229          <td>Here the search terms or list of search terms is added
 230           to the query</td>
 231         </tr>
 232        </tbody>
 233      </table>
 234      <para>
 235       Querying for the term <emphasis>information</emphasis> in the
 236       default index using the default attribite set, the server choice
 237       of access point/index, and the default non-use attributes.
 238       <screen>
 239        Z> find "information"
 240       </screen>
 241      </para>
 242      <para>
 243       Equivalent query fully specified:
 244       <screen>
 245        Z> find @attrset bib-1 @attr 1=1017 @attr 2=3 @attr 3=3 @attr 4=1 @attr 5=100 @attr 6=1 "information"
 246       </screen>
 247      </para>
 248
 249      <para>
 250       Finding all documents which have empty titles. Notice that the
 251       empty term must be quoted, but is otherwise legal.
 252       <screen>
 253        Z> find @attr 1=4 ""
 254       </screen>
 255      </para>
 256
 257     </sect3>
 258
 259     <sect3 id="querymodel-use-string">
 260      <title>Zebra's special use attribute type 1 of form 'string'</title>
 261      <para>
 262       The numeric <literal>use (type 1)</literal> attribute is usually
 263       refered to from a given
 264       attribute set. In addition, Zebra let you use
 265       <emphasis>any internal index
 266        name defined in your configuration</emphasis>
 267       as use atribute value. This is a great feature for
 268       debugging, and when you do
 269       not need the complecity of defined use attribute values. It is
 270       the preferred way of accessing Zebra indexes directly.
 271      </para>
 272      <para>
 273       Finding all documents which have the term list "information
 274       retrieval" in an Zebra index, using it's internal full string name.
 275       <screen>
 276        Z> find @attr 1=sometext "information retrieval"
 277       </screen>
 278      </para>
 279      <para>
 280       Searching the bib-1 use attribute 54 using it's string name:
 281       <screen>
 282        Z> find @attr 1=Code-language eng
 283       </screen>
 284      </para>
 285      <para>
 286       Searching in any silly string index - if it's defined in your
 287       indexation rules and can be parsed by the PQF parser.
 288       This is definitely not the recommended use of
 289       this facility, as it might confuse your users with some very
 290       unexpected results.
 291       <screen>
 292        Z> find @attr 1=silly/xpath/alike[@index]/name "information retrieval"
 293       </screen>
 294      </para>
 295      <para>
 296       See <xref linkend="querymodel-bib1-mapping"/> for details, and
 297       <xref linkend="server-sru"/>
 298       for the SRU PQF query extention using string names as a fast
 299       debugging facility.
 300      </para>
 301     </sect3>
 302
 303     <sect3 id="querymodel-use-xpath">
 304      <title>Zebra's special use attribute type 1 of form 'XPath'
 305       for GRS filters</title>
 306      <para>
 307       As we have seen above, it is possible (albeit seldom a great
 308       idea) to emulate
 309       <ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink> based
 310       search by defining <literal>use (type 1)</literal>
 311       <emphasis>string</emphasis> attributes which in appearence
 312       <emphasis>resemble XPath queries</emphasis>. There are two
 313       problems with this approach: first, the XPath-look-alike has to
 314       be defined at indexation time, no new undefined
 315       XPath queries can entered at search time, and second, it might
 316       confuse users very much that an XPath-alike index name in fact
 317       gets populated from a possible entirely different XML element
 318       than it pretends to acess.
 319      </para>
 320      <para>
 321       When using the <literal>GRS Record Model</literal>
 322       (see  <xref linkend="record-model-grs"/>), we have the
 323       possibility to embed <emphasis>life</emphasis>
 324       XPath expressions
 325       in the PQF queries, which are here called
 326       <literal>use (type 1)</literal> <emphasis>xpath</emphasis>
 327       attributes. You must enable the
 328       <literal>xpath enable</literal> directive in your
 329       <literal>.abs</literal> config files.
 330      </para>
 331      <note>
 332       Only a <emphasis>very</emphasis> restricted subset of the
 333       <ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink>
 334       standard is supported as the GRS record model is simpler than
 335       a full XML DOM structure. See the following examples for
 336       possibilities.
 337      </note>
 338      <para>
 339       Finding all documents which have the term "content"
 340       inside a text node found in a specific XML DOM
 341       <emphasis>subtree</emphasis>, whose starting element is
 342       adressed by XPath.
 343       <screen>
 344        Z> find @attr 1=/root content
 345        Z> find @attr 1=/root/first content
 346       </screen>
 347       <emphasis>Notice that the
 348        XPath must be absolute, i.e., must start with '/', and that the
 349        XPath <literal>decendant-or-self</literal> axis followed by a
 350        text node selection <literal>text()</literal> is implicitly
 351        appended to the stated XPath.
 352       </emphasis>
 353       It follows that the above searches are interpreted as:
 354       <screen>
 355        Z> find @attr 1=/root//text() content
 356        Z> find @attr 1=/root/first//text() content
 357       </screen>
 358      </para>
 359
 360      <para>
 361       Filter the adressing XPath by a predicate working on exact
 362       string values in
 363       attributes (in the XML sense) can be done: return all those docs which
 364       have the term "english" contained in one of all text subnodes of
 365       the subtree defined by the XPath
 366       <literal>/record/title[@lang='en']</literal>
 367       <screen>
 368        Z> find @attr 1=/record/title[@lang='en'] english
 369       </screen>
 370      </para>
 371
 372      <para>
 373       Combining numeric indexes, boolean expressions,
 374       and xpath based searches is possible:
 375       <screen>
 376        Z> find @attr 1=/record/title @and foo bar
 377        Z> find @and @attr 1=/record/title foo @attr 1=4 bar
 378       </screen>
 379      </para>
 380      <para>
 381       Escaping PQF keywords and other non-parseable XPath constructs
 382       with <literal>'{ }'</literal> to prevent syntax errors:
 383       <screen>
 384        Z> find @attr {1=/root/first[@attr='danish']} content
 385        Z> find @attr {1=/root/second[@attr='danish lake']}
 386        Z> find @attr {1=/root/third[@attr='dansk s\xc3\xb8']}
 387       </screen>
 388      </para>
 389      <warning>
 390       It is worth mentioning that these dynamic performed XPath
 391       queries are a performance bottelneck, as no optimized
 392       specialized indexes can be used. Therefore, avoid the use of
 393       this facility when speed is essential, and the database content
 394       size is medium to large.
 395      </warning>
 396     </sect3>
 397
 398    </sect2>
 399
 400    <sect2 id="querymodel-exp1">
 401     <title>Explain Attribute Set</title>
 402     <para>
 403      The Z39.50 standard defines the
 404      <ulink url="&url.z39.50.explain;">Explain</ulink>attribute set
 405      <literal>exp-1</literal>, which is used to discover information
 406      about a server's search semantics and functional capabilities
 407      Zebra exposes a  "classic"
 408      Explain database by base name <literal>IR-Explain-1</literal>, which
 409      is populated with system internal information.
 410     </para>
 411    <para>
 412      The attribute-set <literal>exp-1</literal> consists of a single
 413      <literal>Use (type 1)</literal> attribute.
 414     </para>
 415     <para>
 416      In addition, the non-Use
 417      <literal>bib-1</literal> attributes, that is, the types
 418      <literal>Relation</literal>, <literal>Position</literal>,
 419      <literal>Structure</literal>, <literal>Truncation</literal>,
 420      and <literal>Completeness</literal> are imported from
 421      the <literal>bib-1</literal> attribute set, and may be used
 422      within any explain query.
 423     </para>
 424
 425     <sect3 id="querymodel-exp1-use">
 426     <title>Use Attributes (type = 1)</title>
 427      <para>
 428       The following Explain search atributes are supported:
 429       <literal>ExplainCategory</literal> (@attr 1=1),
 430       <literal>DatabaseName</literal> (@attr 1=3),
 431       <literal>DateAdded</literal> (@attr 1=9),
 432       <literal>DateChanged</literal>(@attr 1=10).
 433      </para>
 434      <para>
 435       A search in the use attribute  <literal>ExplainCategory</literal>
 436       supports only these predefined values:
 437       <literal>CategoryList</literal>, <literal>TargetInfo</literal>,
 438       <literal>DatabaseInfo</literal>, <literal>AttributeDetails</literal>.
 439      </para>
 440      <para>
 441       See <filename>tab/explain.att</filename> and the
 442       <ulink url="&url.z39.50;">Z39.50</ulink> standard
 443       for more information.
 444      </para>
 445     </sect3>
 446
 447     <sect3>
 448      <title>Explain searches with yaz-client</title>
 449      <para>
 450       Classic Explain only defines retrieval of Explain information
 451       via ASN.1. Pratically no Z39.50 clients supports this. Fortunately
 452       they don't have to - Zebra allows retrieval of this information
 453       in other formats:
 454       <literal>SUTRS</literal>, <literal>XML</literal>,
 455       <literal>GRS-1</literal> and  <literal>ASN.1</literal> Explain.
 456      </para>
 457
 458      <para>
 459       List supported categories to find out which explain commands are
 460       supported:
 461       <screen>
 462        Z> base IR-Explain-1
 463        Z> find @attr exp1 1=1 categorylist
 464        Z> form sutrs
 465        Z> show 1+2
 466       </screen>
 467      </para>
 468
 469      <para>
 470       Get target info, that is, investigate which databases exist at
 471       this server endpoint:
 472       <screen>
 473        Z> base IR-Explain-1
 474        Z> find @attr exp1 1=1 targetinfo
 475        Z> form xml
 476        Z> show 1+1
 477        Z> form grs-1
 478        Z> show 1+1
 479        Z> form sutrs
 480        Z> show 1+1
 481       </screen>
 482      </para>
 483
 484      <para>
 485       List all supported databases, the number of hits
 486       is the number of databases found, which most commonly are the
 487       following two:
 488       the <literal>Default</literal> and the
 489       <literal>IR-Explain-1</literal> databases.
 490       <screen>
 491        Z> base IR-Explain-1
 492        Z> find @attr exp1 1=1 databaseinfo
 493        Z> form sutrs
 494        Z> show 1+2
 495       </screen>
 496      </para>
 497
 498      <para>
 499       Get database info record for database <literal>Default</literal>.
 500       <screen>
 501        Z> base IR-Explain-1
 502        Z> find @and @attr exp1 1=1 databaseinfo @attr exp1 1=3 Default
 503       </screen>
 504       Identical query with explicitly specified attribute set:
 505       <screen>
 506        Z> base IR-Explain-1
 507        Z> find @attrset exp1 @and @attr 1=1 databaseinfo @attr 1=3 Default
 508       </screen>
 509      </para>
 510
 511      <para>
 512       Get attribute details record for database
 513       <literal>Default</literal>.
 514       This query is very useful to study the internal Zebra indexes.
 515       If records have been indexed using the <literal>alvis</literal>
 516       XSLT filter, the string representation names of the known indexes can be
 517       found.
 518       <screen>
 519        Z> base IR-Explain-1
 520        Z> find @and @attr exp1 1=1 attributedetails @attr exp1 1=3 Default
 521       </screen>
 522       Identical query with explicitly specified attribute set:
 523       <screen>
 524        Z> base IR-Explain-1
 525        Z> find @attrset exp1 @and @attr 1=1 attributedetails @attr 1=3 Default
 526       </screen>
 527      </para>
 528     </sect3>
 529
 530    </sect2>
 531
 532    <sect2 id="querymodel-bib1">
 533     <title>Bib1 Attribute Set</title>
 534     <para>
 535      Something about querying to be written ..
 536     </para>
 537     <para>
 538      Most of the information contained in this section is an excerpt of
 539      the <literal>ATTRIBUTE SET BIB-1 (Z39.50-1995)
 540       SEMANTICS</literal>,
 541      found at  <ulink url="&url.z39.50.attset.bib1.1995;">. The BIB-1
 542       Attribute Set Semantics</ulink> from 1995, also in an updated
 543      <ulink url="&url.z39.50.attset.bib1;">Bib-1
 544       Attribute Set</ulink>
 545      version from 2003. Index Data is not the copyright holder of this
 546      information.
 547     </para>
 548
 549
 550    <sect3 id="querymodel-bib1-use">
 551      <title>Use Attributes (type = 1)</title>
 552     </sect3>
 553
 554     <para>
 555      Phrase search for <emphasis>information retrieval</emphasis> in
 556      the title-register:
 557      <screen>
 558       Z> find @attr 1=4 "information retrieval"
 559      </screen>
 560     </para>
 561
 562
 563     <sect3 id="querymodel-bib1-relation">
 564      <title>Relation Attributes (type = 2)</title>
 565     </sect3>
 566     <para>
 567      Supported operations: = (default, of omitted), &lt; &gt; &lt;=, &gt;= .
 568      Unsupported: Not equal.
 569
 570      The following relation attributes are also supported: relevance (102).
 571      <!-- always-matches (103) not supported for all indexes -->
 572
 573      All operations are based on a lexicographical ordering,
 574      <emphasis>expect</emphasis> in the case for the
 575      following structure attributes: numeric(109).
 576
 577
 578     </para>
 579
 580     <para>
 581      Ranked search for <emphasis>information retrieval</emphasis> in
 582      the title-register
 583      (see <xref linkend="administration-ranking"/> for the glory details):
 584      <screen>
 585       Z> find @attr 1=4 @attr 2=102 "information retrieval"
 586      </screen>
 587     </para>
 588
 589     <sect3 id="querymodel-bib1-position">
 590      <title>Position Attributes (type = 3)</title>
 591      <para>
 592       Only value of (any position(3) is supported. first in field(1),
 593       and first in subfield(2) are unsupported but using them
 594       does not trigger an error.
 595       <!-- It should -->
 596     </sect3>
 597
 598     <sect3 id="querymodel-bib1-structure">
 599      <title>Structure Attributes (type = 4)</title>
 600      <!-- See tab/default.idx -->
 601     </sect3>
 602
 603     <para>
 604      For example, in
 605      the GILS schema (<literal>gils.abs</literal>), the
 606      west-bounding-coordinate is indexed as type <literal>n</literal>,
 607      and is therefore searched by specifying
 608      <emphasis>structure</emphasis>=<emphasis>Numeric String</emphasis>.
 609      To match all those records with west-bounding-coordinate greater
 610      than -114 we use the following query:
 611      <screen>
 612       Z> find @attr 4=109 @attr 2=5 @attr gils 1=2038 -114
 613      </screen>
 614     </para>
 615
 616     <sect3 id="querymodel-bib1-truncation">
 617      <title>Truncation Attributes (type = 5)</title>
 618      <para>
 619       Supported are: No truncation(100) which is the default,
 620       Right trunation(1), Left truncation(2),
 621       Left&amp;Right truncation(3),
 622       Process <literal>#</literal> in term(100) which maps
 623       each # to <literal>.*</literal>,
 624       Regexp-1(102) normal regular, Regexp-2(103) (regular with fuzzy),
 625       <!--
 626       Special 104, 105, 106 are deprecated and will be removed! -->
 627
 628     </sect3>
 629
 630     <sect3 id="querymodel-bib1-completeness">
 631     <title>Completeness Attributes (type = 6)</title>
 632      <para>
 633       This attribute is ONLY used if structure w, p is to be
 634       chosen. completeness is ignorned if not w, p is to be
 635       used..
 636       Incomplete field(1) is the default and makes Zebra use
 637       register type w.
 638       complete subfield(2) and complete field(3) both triggers
 639       search field type p.
 640     </sect3>
 641    </sect2>
 642
 643
 644    <sect2 id="querymodel-zebra-attr-search">
 645     <title>Zebra specific Search Extentions to all Attribute Sets</title>
 646     <para>
 647      Zebra extends the Bib1 attribute types, and these extentions are
 648      recognized regardless of attribute
 649      set used in a <literal>search</literal> operation query.
 650     </para>
 651
 652      <table id="querymodel-zebra-attr-search-table">
 653       <caption>Zebra Search Attribute Extentions</caption>
 654        <thead>
 655         <tr>
 656          <td><emphasis>Name and Type</emphasis></td>
 657          <td>Operation</td>
 658          <td>Zebra version</td>
 659         </tr>
 660       </thead>
 661        <tbody>
 662         <tr>
 663          <td><emphasis>Embedded Sort (type 7)</emphasis></td>
 664          <td>search</td>
 665          <td>1.1</td>
 666         </tr>
 667         <tr>
 668          <td><emphasis>Term Set (type 8)</emphasis></td>
 669          <td>search</td>
 670          <td>1.1</td>
 671         </tr>
 672         <tr>
 673          <td><emphasis>Rank weight  (type 9)</emphasis></td>
 674          <td>search</td>
 675          <td>1.1</td>
 676         </tr>
 677         <tr>
 678          <td><emphasis>Approx Limit (type 9)</emphasis></td>
 679          <td>search</td>
 680          <td>1.4</td>
 681         </tr>
 682         <tr>
 683          <td><emphasis>Term Reference (type 10)</emphasis></td>
 684          <td>search</td>
 685          <td>1.4</td>
 686         </tr>
 687        </tbody>
 688       </table>
 689
 690     <sect3 id="querymodel-zebra-attr-sorting">
 691      <title>Zebra Extention Embedded Sort Attribute (type 7)</title>
 692     </sect3>
 693     <para>
 694      The embedded sort is a way to specify sort within a query - thus
 695      removing the need to send a Sort Request separately. It is both
 696      faster and does not require clients to deal with the Sort
 697      Facility.
 698     </para>
 699     <para>
 700      The possible values after attribute <literal>type 7</literal> are
 701      <literal>1</literal> ascending and
 702      <literal>2</literal> descending.
 703      The attributes+term (APT) node is separate from the
 704      rest and must be <literal>@or</literal>'ed.
 705      The term associated with APT is the sorting level in integers,
 706      where <literal>0</literal> means primary sort,
 707      <literal>1</literal> means secondary sort, and so forth.
 708      See also <xref linkend="administration-ranking"/>.
 709     </para>
 710     <para>
 711      For example, searching for water, sort by title (ascending)
 712      <screen>
 713       Z> find @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
 714      </screen>
 715     </para>
 716     <para>
 717      Or, searching for water, sort by title ascending, then date descending
 718      <screen>
 719       Z> find @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
 720      </screen>
 721     </para>
 722
 723     <sect3 id="querymodel-zebra-attr-estimation">
 724      <title>Zebra Extention Term Set Attribute (type 8)</title>
 725     </sect3>
 726     <para>
 727      The Term Set feature is a facility that allows a search to store
 728      hitting terms in a "pseudo" resultset; thus a search (as usual) +
 729      a scan-like facility. Requires a client that can do named result
 730      sets since the search generates two result sets. The value for
 731      attribute 8 is the name of a result set (string). The terms in
 732      the named term set are returned as SUTRS records.
 733     </para>
 734     <para>
 735      For example, searching  for u in title, right truncated, and
 736      storing the result in term set named 'aset'
 737      <screen>
 738       Z> find @attr 5=1 @attr 1=4 @attr 8=aset u
 739      </screen>
 740     </para>
 741     <warning>
 742      The model has one serious flaw: we don't know the size of term
 743      set. Experimental. Do not use in production code.
 744     </warning>
 745
 746     <sect3 id="querymodel-zebra-attr-weight">
 747      <title>Zebra Extention Rank Weight Attribute (type 9)</title>
 748     </sect3>
 749     <para>
 750      Rank weight is a way to pass a value to a ranking algorithm - so
 751      that one APT has one value - while another as a different one.
 752      See also <xref linkend="administration-ranking"/>.
 753     </para>
 754     <para>
 755      For example, searching  for utah in title with weight 30 as well
 756      as any with weight 20:
 757      <screen>
 758       Z> find @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
 759      </screen>
 760     </para>
 761
 762     <sect3 id="querymodel-zebra-attr-limit">
 763      <title>Zebra Extention Approximative Limit Attribute (type 9)</title>
 764     </sect3>
 765     <para>
 766      Newer Zebra versions normally estemiates hit count for every APT
 767      (leaf) in the query tree. These hit counts are returned as part of
 768      the searchResult-1 facility in the binary encoded Z39.50 search
 769      response packages.
 770     </para>
 771     <para>
 772      By setting a limit for the APT we can make Zebra turn into
 773      approximate hit count when a certain hit count limit is
 774      reached. A value of zero means exact hit count.
 775     </para>
 776     <para>
 777      For example, we might be intersted in exact hit count for a, but
 778      for b we allow hit count estimates for 1000 and higher.
 779      <screen>
 780       Z> find @and a @attr 9=1000 b
 781      </screen>
 782     </para>
 783     <note>
 784      The estimated hit count fascility makes searches faster, as one
 785      only needs to process large hit lists partially.
 786     </note>
 787     <warning>
 788      This facility clashes with rank weight, because there all
 789      documents in the hit lists need to be examined for scoring and
 790      re-sorting.
 791      It is an experimental
 792      extention. Do not use in production code.
 793     </warning>
 794
 795     <sect3 id="querymodel-zebra-attr-termref">
 796      <title>Zebra Extention Term Reference Attribute (type 10)</title>
 797     </sect3>
 798     <para>
 799      Zebra supports the searchResult-1 facility. If attribute 10 is
 800      given, that specifies a subqueryId value returned as part of the
 801      search result. It is a way for a client to name an APT part of a
 802      query.
 803     </para>
 804     <!--
 805     <para>
 806      <screen>
 807      </screen>
 808     </para>
 809     -->
 810     <warning>
 811      Experimental. Do not use in production code.
 812     </warning>
 813
 814
 815    </sect2>
 816
 817
 818    <sect2 id="querymodel-zebra-attr-scan">
 819     <title>Zebra specific Scan Extentions to all Attribute Sets</title>
 820     <para>
 821      Zebra extends the Bib1 attribute types, and these extentions are
 822      recognized regardless of attribute
 823      set used in a <literal>scan</literal> operation query.
 824     </para>
 825      <table id="querymodel-zebra-attr-scan-table">
 826       <caption>Zebra Scan Attribute Extentions</caption>
 827        <thead>
 828         <tr>
 829          <td><emphasis>Name and Type</emphasis></td>
 830          <td>Operation</td>
 831          <td>Zebra version</td>
 832         </tr>
 833       </thead>
 834        <tbody>
 835         <tr>
 836          <td><emphasis>Result Set Narrow (type 8)</emphasis></td>
 837          <td>scan</td>
 838          <td>1.3</td>
 839         </tr>
 840         <tr>
 841          <td><emphasis>Approximative Limit (type 9)</emphasis></td>
 842          <td>scan</td>
 843          <td>1.4</td>
 844         </tr>
 845        </tbody>
 846       </table>
 847
 848     <sect3 id="querymodel-zebra-attr-xyz">
 849      <title>Zebra Extention Result Set Narrow (type 8)</title>
 850     </sect3>
 851     <para>
 852      If attribute 8 is given for scan, the value is the name of a
 853      result set. Each hit count in scan is @and'ed with the result set
 854      given.
 855     </para>
 856     <!--
 857     <para>
 858      <screen>
 859      </screen>
 860     </para>
 861     -->
 862     <warning>
 863      Experimental and buggy. Definitely not to be used in production code.
 864     </warning>
 865
 866     <sect3 id="querymodel-zebra-attr-xyz">
 867      <title>Zebra Extention Approximative Limit (type 9)</title>
 868     </sect3>
 869     <para>
 870      The approximative limit (as for search) is a way to enable approx
 871      hit counts for scan hit counts.
 872     </para>
 873     <!--
 874     <para>
 875      <screen>
 876      </screen>
 877     </para>
 878     -->
 879     <warning>
 880      Experimental. Do not use in production code.
 881     </warning>
 882
 883
 884    </sect2>
 885
 886
 887    <sect2 id="querymodel-bib1-mapping">
 888     <title>Mapping from Bib1 Attributes to Zebra internal
 889      register indexes</title>
 890     <para>
 891      TO-DO
 892      </para>
 893
 894
 895      <!-- see in util/zebramap.c
 896       int zebra_maps_attr
 897
 898   if (completeness_value == 2 || completeness_value == 3)
 899         *complete_flag = 1;
 900     else
 901         *complete_flag = 0;
 902     *reg_id = 0;
 903
 904     *sort_flag =(sort_relation_value > 0) ? 1 : 0;
 905     *search_type = "phrase";
 906     strcpy(rank_type, "void");
 907     if (relation_value == 102)
 908     {
 909         if (weight_value == -1)
 910             weight_value = 34;
 911         sprintf(rank_type, "rank,w=%d,u=%d", weight_value, use_value);
 912     }
 913     if (relation_value == 103)
 914     {
 915         *search_type = "always";
 916         *reg_id = 'w';
 917         return 0;
 918     }
 919     if (*complete_flag)
 920         *reg_id = 'p';
 921     else
 922         *reg_id = 'w';
 923     switch (structure_value)
 924     {
 925     case 6:   /* word list */
 926         *search_type = "and-list";
 927         break;
 928     case 105: /* free-form-text */
 929         *search_type = "or-list";
 930         break;
 931     case 106: /* document-text */
 932         *search_type = "or-list";
 933         break;
 934     case -1:
 935     case 1:   /* phrase */
 936     case 2:   /* word */
 937     case 108: /* string */
 938         *search_type = "phrase";
 939         break;
 940    case 107: /* local-number */
 941         *search_type = "local";
 942         *reg_id = 0;
 943         break;
 944     case 109: /* numeric string */
 945         *reg_id = 'n';
 946         *search_type = "numeric";
 947         break;
 948     case 104: /* urx */
 949         *reg_id = 'u';
 950         *search_type = "phrase";
 951         break;
 952     case 3:   /* key */
 953         *reg_id = '0';
 954         *search_type = "phrase";
 955         break;
 956     case 4:  /* year */
 957         *reg_id = 'y';
 958         *search_type = "phrase";
 959         break;
 960     case 5:  /* date */
 961         *reg_id = 'd';
 962         *search_type = "phrase";
 963         break;
 964     default:
 965         return -1;
 966     }
 967     return 0;
 968
 969      -->
 970
 971
 972     <para>
 973      <emphasis>Use</emphasis> attributes are interpreted according to the
 974      attribute sets which have been loaded in the
 975     <literal>zebra.cfg</literal> file, and are matched against specific
 976      fields as specified in the <literal>.abs</literal> file which
 977      describes the profile of the records which have been loaded.
 978      If no Use attribute is provided, a default of Bib-1 Any is assumed.
 979     </para>
 980
 981     <para>
 982      If a <emphasis>Structure</emphasis> attribute of
 983      <emphasis>Phrase</emphasis> is used in conjunction with a
 984      <emphasis>Completeness</emphasis> attribute of
 985      <emphasis>Complete (Sub)field</emphasis>, the term is matched
 986      against the contents of the phrase (long word) register, if one
 987      exists for the given <emphasis>Use</emphasis> attribute.
 988      A phrase register is created for those fields in the
 989      <literal>.abs</literal> file that contains a
 990      <literal>p</literal>-specifier.
 991      <!-- ### whatever the hell _that_ is -->
 992     </para>
 993
 994     <para>
 995      If <emphasis>Structure</emphasis>=<emphasis>Phrase</emphasis> is
 996      used in conjunction with <emphasis>Incomplete Field</emphasis> - the
 997      default value for <emphasis>Completeness</emphasis>, the
 998      search is directed against the normal word registers, but if the term
 999      contains multiple words, the term will only match if all of the words
1000      are found immediately adjacent, and in the given order.
1001      The word search is performed on those fields that are indexed as
1002      type <literal>w</literal> in the <literal>.abs</literal> file.
1003     </para>
1004
1005     <para>
1006      If the <emphasis>Structure</emphasis> attribute is
1007      <emphasis>Word List</emphasis>,
1008      <emphasis>Free-form Text</emphasis>, or
1009      <emphasis>Document Text</emphasis>, the term is treated as a
1010      natural-language, relevance-ranked query.
1011      This search type uses the word register, i.e. those fields
1012      that are indexed as type <literal>w</literal> in the
1013      <literal>.abs</literal> file.
1014     </para>
1015
1016     <para>
1017      If the <emphasis>Structure</emphasis> attribute is
1018      <emphasis>Numeric String</emphasis> the term is treated as an integer.
1019      The search is performed on those fields that are indexed
1020      as type <literal>n</literal> in the <literal>.abs</literal> file.
1021     </para>
1022
1023     <para>
1024      If the <emphasis>Structure</emphasis> attribute is
1025      <emphasis>URx</emphasis> the term is treated as a URX (URL) entity.
1026      The search is performed on those fields that are indexed as type
1027      <literal>u</literal> in the <literal>.abs</literal> file.
1028     </para>
1029
1030     <para>
1031      If the <emphasis>Structure</emphasis> attribute is
1032      <emphasis>Local Number</emphasis> the term is treated as
1033      native Zebra Record Identifier.
1034     </para>
1035
1036     <para>
1037      If the <emphasis>Relation</emphasis> attribute is
1038      <emphasis>Equals</emphasis> (default), the term is matched
1039      in a normal fashion (modulo truncation and processing of
1040      individual words, if required).
1041      If <emphasis>Relation</emphasis> is <emphasis>Less Than</emphasis>,
1042      <emphasis>Less Than or Equal</emphasis>,
1043      <emphasis>Greater than</emphasis>, or <emphasis>Greater than or
1044       Equal</emphasis>, the term is assumed to be numerical, and a
1045      standard regular expression is constructed to match the given
1046      expression.
1047      If <emphasis>Relation</emphasis> is <emphasis>Relevance</emphasis>,
1048      the standard natural-language query processor is invoked.
1049     </para>
1050
1051     <para>
1052      For the <emphasis>Truncation</emphasis> attribute,
1053      <emphasis>No Truncation</emphasis> is the default.
1054      <emphasis>Left Truncation</emphasis> is not supported.
1055      <emphasis>Process # in search term</emphasis> is supported, as is
1056      <emphasis>Regxp-1</emphasis>.
1057      <emphasis>Regxp-2</emphasis> enables the fault-tolerant (fuzzy)
1058      search. As a default, a single error (deletion, insertion,
1059      replacement) is accepted when terms are matched against the register
1060      contents.
1061     </para>
1062    </sect2>
1063
1064    <sect2  id="querymodel-regular">
1065     <title>Zebra Regular Expressions in Truncation Attribute (type = 5)</title>
1066
1067     <para>
1068      Each term in a query is interpreted as a regular expression if
1069      the truncation value is either <emphasis>Regxp-1 (@attr 5=102)</emphasis>
1070      or <emphasis>Regxp-2 (@attr 5=103)</emphasis>.
1071      Both query types follow the same syntax with the operands:
1072     </para>
1073
1074      <table id="querymodel-regular-operands-table">
1075       <caption>Regular Expression Operands</caption>
1076        <!--
1077        <thead>
1078        <tr><td>one</td><td>two</td></tr>
1079       </thead>
1080        -->
1081        <tbody>
1082         <tr>
1083          <td><emphasis>x</emphasis></td>
1084          <td>Matches the character <emphasis>x</emphasis>.</td>
1085         </tr>
1086         <tr>
1087          <td><emphasis>.</emphasis></td>
1088          <td>Matches any character.</td>
1089         </tr>
1090         <tr>
1091          <td><emphasis>[ .. ]</emphasis></td>
1092          <td>Matches the set of characters specified;
1093          such as <literal>[abc]</literal> or <literal>[a-c]</literal>.</td>
1094         </tr>
1095        </tbody>
1096       </table>
1097
1098     <para>
1099      The above operands can be combined with the following operators:
1100     </para>
1101
1102
1103      <table id="querymodel-regular-operators-table">
1104       <caption>Regular Expression Operators</caption>
1105        <!--
1106        <thead>
1107        <tr><td>one</td><td>two</td></tr>
1108       </thead>
1109        -->
1110        <tbody>
1111         <tr>
1112          <td><emphasis>x*</emphasis></td>
1113          <td>Matches <emphasis>x</emphasis> zero or more times.
1114           Priority: high.</td>
1115         </tr>
1116         <tr>
1117          <td><emphasis>x+</emphasis></td>
1118          <td>Matches <emphasis>x</emphasis> one or more times.
1119           Priority: high.</td>
1120         </tr>
1121         <tr>
1122          <td><emphasis>x?</emphasis></td>
1123          <td> Matches <emphasis>x</emphasis> zero or once.
1124           Priority: high.</td>
1125         </tr>
1126         <tr>
1127          <td><emphasis>xy</emphasis></td>
1128          <td> Matches <emphasis>x</emphasis>, then <emphasis>y</emphasis>.
1129          Priority: medium.</td>
1130         </tr>
1131         <tr>
1132          <td><emphasis>x|y</emphasis></td>
1133          <td> Matches either <emphasis>x</emphasis> or <emphasis>y</emphasis>.
1134          Priority: low.</td>
1135         </tr>
1136         <tr>
1137          <td><emphasis>( )</emphasis></td>
1138          <td>The order of evaluation may be changed by using parentheses.</td>
1139         </tr>
1140        </tbody>
1141       </table>
1142
1143     <para>
1144      If the first character of the <emphasis>Regxp-2</emphasis> query
1145      is a plus character (<literal>+</literal>) it marks the
1146      beginning of a section with non-standard specifiers.
1147      The next plus character marks the end of the section.
1148      Currently Zebra only supports one specifier, the error tolerance,
1149      which consists one digit.
1150     </para>
1151
1152     <para>
1153      Since the plus operator is normally a suffix operator the addition to
1154      the query syntax doesn't violate the syntax for standard regular
1155      expressions.
1156     </para>
1157
1158     <para>
1159      For example, a phrase search with regular expressions  in
1160      the title-register is performed like this:
1161      <screen>
1162       Z> find @attr 1=4 @attr 5=102 "informat.* retrieval"
1163      </screen>
1164     </para>
1165
1166     <para>
1167      Combinations with other attributes are possible. For example, a
1168      ranked search with a regular expression
1169      (see <xref linkend="administration-ranking"/> for the glory details):
1170      <screen>
1171       Z> find @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"
1172      </screen>
1173     </para>
1174    </sect2>
1175
1176
1177    <!--
1178    <para>
1179     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
1180     the <literal>-t</literal> option to the indexer tells Zebra how to
1181     process input records.
1182     Two basic types of processing are available - raw text and structured
1183     data. Raw text is just that, and it is selected by providing the
1184     argument <emphasis>text</emphasis> to Zebra. Structured records are
1185     all handled internally using the basic mechanisms described in the
1186     subsequent sections.
1187     Zebra can read structured records in many different formats.
1188    </para>
1189    -->
1190   </sect1>
1191
1192
1193   <sect1 id="querymodel-cql-to-pqf">
1194    <title>Server Side CQL to PQF Query Translation</title>
1195    <para>
1196     Using the
1197     <literal>&lt;cql2rpn&gt;l2rpn.txt&lt;/cql2rpn&gt;</literal>
1198       YAZ Frontend Virtual
1199     Hosts option, one can configure
1200     the YAZ Frontend CQL-to-PQF
1201     converter, specifying the interpretation of various
1202     <ulink url="&url.cql;">CQL</ulink>
1203     indexes, relations, etc. in terms of Type-1 query attributes.
1204     <!-- The  yaz-client config file -->
1205    </para>
1206    <para>
1207     For example, using server-side CQL-to-PQF conversion, one might
1208     query a zebra server like this:
1209     <screen>
1210     <![CDATA[
1211      yaz-client localhost:9999
1212      Z> querytype cql
1213      Z> find text=(plant and soil)
1214      ]]>
1215     </screen>
1216      and - if properly configured - even static relevance ranking can
1217      be performed using CQL query syntax:
1218     <screen>
1219     <![CDATA[
1220      Z> find text = /relevant (plant and soil)
1221      ]]>
1222      </screen>
1223    </para>
1224
1225    <para>
1226     By the way, the same configuration can be used to
1227     search using client-side CQL-to-PQF conversion:
1228     (the only difference is <literal>querytype cql2rpn</literal>
1229     instead of
1230     <literal>querytype cql</literal>, and the call specifying a local
1231     conversion file)
1232     <screen>
1233     <![CDATA[
1234      yaz-client -q local/cql2pqf.txt localhost:9999
1235      Z> querytype cql2rpn
1236      Z> find text=(plant and soil)
1237      ]]>
1238      </screen>
1239    </para>
1240
1241    <para>
1242     Exhaustive information can be found in the
1243     Section "Specification of CQL to RPN mappings" in the YAZ manual.
1244     <ulink url="http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map">
1245      http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map</ulink>,
1246    and shall therefore not be repeated here.
1247    </para>
1248   <!--
1249   <para>
1250     See
1251       <ulink url="http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html">
1252       http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html</ulink>
1253     for the Maintenance Agency's work-in-progress mapping of Dublin Core
1254     indexes to Attribute Architecture (util, XD and BIB-2)
1255     attributes.
1256    </para>
1257    -->
1258  </sect1>
1259
1260
1261
1262 </chapter>
1263
1264  <!-- Keep this comment at the end of the file
1265  Local variables:
1266  mode: sgml
1267  sgml-omittag:t
1268  sgml-shorttag:t
1269  sgml-minimize-attributes:nil
1270  sgml-always-quote-attributes:t
1271  sgml-indent-step:1
1272  sgml-indent-data:t
1273  sgml-parent-document: "zebra.xml"
1274  sgml-local-catalogs: nil
1275  sgml-namecase-general:t
1276  End:
1277  -->