doc/querymodel.xml

   1  <chapter id="querymodel">
   2   <!-- $Id: querymodel.xml,v 1.5 2006-06-14 13:57:45 marc Exp $ -->
   3   <title>Query Model</title>
   4
   5   <sect1 id="querymodel-overview">
   6    <title>Query Model Overview</title>
   7
   8
   9    <sect2 id="querymodel-query-languages">
  10     <title>Query Languages</title>
  11
  12     <para>
  13      Zebra is born as a networking Information Retrieval engine adhering
  14      to the international standards
  15      <ulink url="&url.z39.50;">Z39.50</ulink> and
  16      <ulink url="&url.sru;">SRU</ulink>,
  17      and implement the query model defined there.
  18      Unfortunately, the Z39.50 query model has only defined a binary
  19      encoded representation, which is used as transport packaging in
  20      the Z39.50 protocol layer. This representation is not human
  21      readable, nor defines any convenient way to specify queries.
  22     </para>
  23    <!-- tell about RPN - include link to YAZ
  24         url.yaz.pqf -->
  25
  26    <sect3 id="querymodel-query-languages-pqf">
  27     <title>Prefix Query Format (PQF)</title>
  28
  29    <para>
  30      Index Data has defined a textual representaion in the
  31      <literal>Prefix Query Format</literal>, short
  32      <literal>PQF</literal>, which then has been adopted by other
  33      parties developing Z39.50 software. It is also often referred to as
  34      <literal>Prefix Query Notation</literal>, or in short
  35      <literal>PQN</literal>, and is thoroughly explained in
  36      <xref linkend="querymodel-pqf"/>.
  37     </para>
  38    </sect3>
  39
  40
  41    <!-- PQF/RPN is natively supported. CQL is NOT . So we need a map -->
  42    <sect3 id="querymodel-query-languages-cql">
  43     <title>Common Query Language (CQL)</title>
  44    <para>
  45      In addition, Zebra can be configured to understand and map the
  46      <literal>Common Query Language</literal>
  47      (<ulink url="&url.cql;">CQL</ulink>)
  48      to PQF. See an introduction on the mapping to the internal query
  49      representation in
  50      <xref linkend="querymodel-cql-to-pqf"/>.
  51     </para>
  52    </sect3>
  53
  54    </sect2>
  55
  56    <sect2 id="querymodel-query-types">
  57     <title>Query types</title>
  58     <para>
  59     </para>
  60
  61     <sect3 id="querymodel-query-type-explain">
  62      <title>Explain Queries</title>
  63      <para>
  64      </para>
  65     </sect3>
  66
  67     <sect3 id="querymodel-query-type-search">
  68      <title>Search Queries</title>
  69      <para>
  70      </para>
  71     </sect3>
  72
  73     <sect3 id="querymodel-query-type-scan">
  74      <title>Scan Queries</title>
  75      <para>
  76      </para>
  77     </sect3>
  78
  79    </sect2>
  80
  81  </sect1>
  82
  83
  84   <sect1 id="querymodel-pqf">
  85    <title>Prefix Query Format structure and syntax</title>
  86    <para>
  87     The <ulink url="&url.yaz.pqf;">PQF grammer</ulink>
  88     is documented in the YAZ manual, and shall not be
  89     repeated here. This textual PQF representation
  90     is always during search mapped to the equivalent Zebra internal
  91     query parse tree.
  92    </para>
  93
  94    <sect2 id="querymodel-pqf-tree">
  95     <title>PQF tree structure</title>
  96     <para>
  97      The PQF parse tree - or the equivalent textual representation -
  98      may start with one specification of the
  99      <emphasis>attribute set</emphasis> used. Following is a query
 100      tree, which
 101      consists of <emphasis>atomic query parts</emphasis>, eventually
 102      paired by <emphasis>boolean binary operators</emphasis>, and
 103      finally  <emphasis>recursively combined </emphasis> into
 104      complex query trees.
 105     </para>
 106
 107     <sect3 id="querymodel-attribute-sets">
 108      <title>Attribute sets</title>
 109      <para>
 110       Attribute sets define the exact meaning and semantics of queries
 111       issued. Zebra comes with some predefined attribute set
 112       definitions, others can easily be defined and added to the
 113       configuration.
 114       <note>
 115        The Zebra internal query procesing is modeled after
 116        the <literal>Bib1</literal> attribute set, and the non-use
 117        attributes type 2-6 are hard-wired in. It is therefore essential
 118        to be familiar with <xref linkend="querymodel-bib1"/>.
 119       </note>
 120      </para>
 121
 122      <table id="querymodel-attribute-sets-table">
 123       <caption>Attribute sets predefined in Zebra</caption>
 124        <!--
 125        <thead>
 126        <tr><td>one</td><td>two</td></tr>
 127       </thead>
 128        -->
 129        <tbody>
 130         <tr>
 131          <td><emphasis>exp-1</emphasis></td>
 132          <td><literal>Explain</literal> attribute set</td>
 133          <td>Special attribute set used on the special automagic
 134           <literal>IR-Explain-1</literal> database to gain information on
 135           server capabilities, database names, and database
 136           and semantics.</td>
 137         </tr>
 138         <tr>
 139          <td><emphasis>bib-1</emphasis></td>
 140          <td><literal>Bib1</literal> attribute set</td>
 141          <td>Standard PQF query language attribute set which defines the
 142           semantics of Z39.50 searching. In addition, all of the
 143           non-use attributes (type 2-9) define the Zebra internal query
 144           processing</td>
 145         </tr>
 146         <tr>
 147          <td><emphasis>gils</emphasis></td>
 148          <td><literal>GILS</literal> attribute set</td>
 149          <td>Extention to the <literal>Bib1</literal> attribute set.</td>
 150         </tr>
 151        </tbody>
 152      </table>
 153     </sect3>
 154
 155     <sect3 id="querymodel-boolean-operators">
 156      <title>Boolean operators</title>
 157      <para>
 158       A pair of subquery trees, or of atomic queries, is combined
 159       using the standard boolean operators into new query trees.
 160      </para>
 161
 162      <table id="querymodel-boolean-operators-table">
 163       <caption>Boolean operators</caption>
 164        <!--
 165        <thead>
 166        <tr><td>one</td><td>two</td></tr>
 167       </thead>
 168        -->
 169        <tbody>
 170         <tr><td><emphasis>@and</emphasis></td>
 171          <td>binary <literal>AND</literal> operator</td>
 172          <td>Set intersection of two atomic queries hit sets</td>
 173         </tr>
 174         <tr><td><emphasis>@or</emphasis></td>
 175          <td>binary <literal>OR</literal> operator</td>
 176          <td>Set union of two atomic queries hit sets</td>
 177         </tr>
 178         <tr><td><emphasis>@not</emphasis></td>
 179          <td>binary <literal>AND NOT</literal> operator</td>
 180          <td>Set complement of two atomic queries hit sets</td>
 181         </tr>
 182         <tr><td><emphasis>@prox</emphasis></td>
 183          <td>binary <literal>PROXIMY</literal> operator</td>
 184          <td>Set intersection of two atomic queries hit sets. In
 185           addition, the intersection set is purged for all
 186           documents which do not satisfy the requested query
 187           term proximity. Usually a proper subset of the AND
 188           operation.</td>
 189         </tr>
 190        </tbody>
 191      </table>
 192
 193      <para>
 194       For example, we can combine the terms
 195       <emphasis>information</emphasis> and <emphasis>retrieval</emphasis>
 196       into different searches in the default index of the default
 197       attribute set as follows.
 198       Querying for the union of all documents containing the
 199       terms <emphasis>information</emphasis> OR
 200       <emphasis>retrieval</emphasis>:
 201       <screen>
 202        Z> find @or information retrieval
 203       </screen>
 204      </para>
 205      <para>
 206       Querying for the intersection of all documents containing the
 207       terms <emphasis>information</emphasis> AND
 208       <emphasis>retrieval</emphasis>:
 209       The hit set is a subset of the coresponding
 210       OR query.
 211       <screen>
 212        Z> find @and information retrieval
 213       </screen>
 214      </para>
 215      <para>
 216       Querying for the intersection of all documents containing the
 217       terms <emphasis>information</emphasis> AND
 218       <emphasis>retrieval</emphasis>, taking proximity into account:
 219       The hit set is a subset of the coresponding
 220       AND query.
 221       <screen>
 222        Z> find @prox information retrieval
 223       </screen>
 224      </para>
 225      <para>
 226       Querying for the intersection of all documents containing the
 227       terms <emphasis>information</emphasis> AND
 228       <emphasis>retrieval</emphasis>, in the same order and near each
 229       other as described in the term list
 230       The hit set is a subset of the coresponding
 231       PROXIMY query.
 232       <screen>
 233        Z> find "information retrieval"
 234       </screen>
 235      </para>
 236     </sect3>
 237
 238
 239     <sect3 id="querymodel-atomic-queries">
 240      <title>Atomic queries</title>
 241      <para>
 242       Atomic queries are the query parts which work on one acess point
 243       only. These consist of <literal>an attribute list</literal>
 244       followed by a <literal>single term</literal> or a
 245       <literal>quoted term list</literal>.
 246      </para>
 247      <para>
 248       Unsupplied non-use attributes type 2-9 are either inherited from
 249       higher nodes in the query tree, or are set to Zebra's default values.
 250       See <xref linkend="querymodel-bib1"/> for details.
 251      </para>
 252
 253      <table id="querymodel-atomic-queries-table">
 254       <caption>Atomic queries</caption>
 255        <!--
 256        <thead>
 257        <tr><td>one</td><td>two</td></tr>
 258       </thead>
 259        -->
 260        <tbody>
 261         <tr><td><emphasis>attribute list</emphasis></td>
 262          <td>List of <literal>orthogonal</literal> attributes</td>
 263          <td>Any of the orthogonal attribute types may be omitted,
 264           these are inherited from higher query tree nodes, or if not
 265           inherited, are set to the default Zebra configuration values.
 266          </td>
 267         </tr>
 268         <tr><td><emphasis>term</emphasis></td>
 269          <td>single <literal>term</literal>
 270           or <literal>quoted term list</literal>   </td>
 271          <td>Here the search terms or list of search terms is added
 272           to the query</td>
 273         </tr>
 274        </tbody>
 275      </table>
 276      <para>
 277       Querying for the term <emphasis>information</emphasis> in the
 278       default index using the default attribite set, the server choice
 279       of access point/index, and the default non-use attributes.
 280       <screen>
 281        Z> find "information"
 282       </screen>
 283      </para>
 284      <para>
 285       Equivalent query fully specified:
 286       <screen>
 287        Z> find @attrset bib-1 @attr 1=1017 @attr 2=3 @attr 3=3 @attr 4=1 @attr 5=100 @attr 6=1 "information"
 288       </screen>
 289      </para>
 290
 291      <para>
 292       Finding all documents which have empty titles. Notice that the
 293       empty term must be quoted, but is otherwise legal.
 294       <screen>
 295        Z> find @attr 1=4 ""
 296       </screen>
 297      </para>
 298
 299     </sect3>
 300
 301     <sect3 id="querymodel-use-string">
 302      <title>Zebra's special use attribute type 1 of form 'string'</title>
 303      <para>
 304       The numeric <literal>use (type 1)</literal> attribute is usually
 305       refered to from a given
 306       attribute set. In addition, Zebra let you use
 307       <emphasis>any internal index
 308        name defined in your configuration</emphasis>
 309       as use atribute value. This is a great feature for
 310       debugging, and when you do
 311       not need the complecity of defined use attribute values. It is
 312       the preferred way of accessing Zebra indexes directly.
 313      </para>
 314      <para>
 315       Finding all documents which have the term list "information
 316       retrieval" in an Zebra index, using it's internal full string name.
 317       <screen>
 318        Z> find @attr 1=sometext "information retrieval"
 319       </screen>
 320      </para>
 321      <para>
 322       Searching the bib-1 use attribute 54 using it's string name:
 323       <screen>
 324        Z> find @attr 1=Code-language eng
 325       </screen>
 326      </para>
 327      <para>
 328       Searching in any silly string index - if it's defined in your
 329       indexation rules and can be parsed by the PQF parser.
 330       This is definitely not the recommended use of
 331       this facility, as it might confuse your users with some very
 332       unexpected results.
 333       <screen>
 334        Z> find @attr 1=silly/xpath/alike[@index]/name "information retrieval"
 335       </screen>
 336      </para>
 337      <para>
 338       See <xref linkend="querymodel-bib1-mapping"/> for details, and
 339       <xref linkend="server-sru"/>
 340       for the SRU PQF query extention using string names as a fast
 341       debugging facility.
 342      </para>
 343     </sect3>
 344
 345     <sect3 id="querymodel-use-xpath">
 346      <title>Zebra's special use attribute type 1 of form 'XPath'
 347       for GRS filters</title>
 348      <para>
 349       As we have seen above, it is possible (albeit seldom a great
 350       idea) to emulate
 351       <ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink> based
 352       search by defining <literal>use (type 1)</literal>
 353       <emphasis>string</emphasis> attributes which in appearence
 354       <emphasis>resemble XPath queries</emphasis>. There are two
 355       problems with this approach: first, the XPath-look-alike has to
 356       be defined at indexation time, no new undefined
 357       XPath queries can entered at search time, and second, it might
 358       confuse users very much that an XPath-alike index name in fact
 359       gets populated from a possible entirely different XML element
 360       than it pretends to acess.
 361      </para>
 362      <para>
 363       When using the <literal>GRS Record Model</literal>
 364       (see  <xref linkend="record-model-grs"/>), we have the
 365       possibility to embed <emphasis>life</emphasis>
 366       XPath expressions
 367       in the PQF queries, which are here called
 368       <literal>use (type 1)</literal> <emphasis>xpath</emphasis>
 369       attributes. You must enable the
 370       <literal>xpath enable</literal> directive in your
 371       <literal>.abs</literal> config files.
 372      </para>
 373      <note>
 374       Only a <emphasis>very</emphasis> restricted subset of the
 375       <ulink url="http://www.w3.org/TR/xpath">XPath 1.0</ulink>
 376       standard is supported as the GRS record model is simpler than
 377       a full XML DOM structure. See the following examples for
 378       possibilities.
 379      </note>
 380      <para>
 381       Finding all documents which have the term "content"
 382       inside a text node found in a specific XML DOM
 383       <emphasis>subtree</emphasis>, whose starting element is
 384       adressed by XPath.
 385       <screen>
 386        Z> find @attr 1=/root content
 387        Z> find @attr 1=/root/first content
 388       </screen>
 389       <emphasis>Notice that the
 390        XPath must be absolute, i.e., must start with '/', and that the
 391        XPath <literal>decendant-or-self</literal> axis followed by a
 392        text node selection <literal>text()</literal> is implicitly
 393        appended to the stated XPath.
 394       </emphasis>
 395       It follows that the above searches are interpreted as:
 396       <screen>
 397        Z> find @attr 1=/root//text() content
 398        Z> find @attr 1=/root/first//text() content
 399       </screen>
 400      </para>
 401
 402      <para>
 403       Filter the adressing XPath by a predicate working on exact
 404       string values in
 405       attributes (in the XML sense) can be done: return all those docs which
 406       have the term "english" contained in one of all text subnodes of
 407       the subtree defined by the XPath
 408       <literal>/record/title[@lang='en']</literal>
 409       <screen>
 410        Z> find @attr 1=/record/title[@lang='en'] english
 411       </screen>
 412      </para>
 413
 414      <para>
 415       Combining numeric indexes, boolean expressions,
 416       and xpath based searches is possible:
 417       <screen>
 418        Z> find @attr 1=/record/title @and foo bar
 419        Z> find @and @attr 1=/record/title foo @attr 1=4 bar
 420       </screen>
 421      </para>
 422      <para>
 423       Escaping PQF keywords and other non-parseable XPath constructs
 424       with <literal>'{ }'</literal> to prevent syntax errors:
 425       <screen>
 426        Z> find @attr {1=/root/first[@attr='danish']} content
 427        Z> find @attr {1=/root/second[@attr='danish lake']}
 428        Z> find @attr {1=/root/third[@attr='dansk s\xc3\xb8']}
 429       </screen>
 430      </para>
 431      <warning>
 432       It is worth mentioning that these dynamic performed XPath
 433       queries are a performance bottelneck, as no optimized
 434       specialized indexes can be used. Therefore, avoid the use of
 435       this facility when speed is essential, and the database content
 436       size is medium to large.
 437      </warning>
 438     </sect3>
 439
 440    </sect2>
 441
 442    <sect2 id="querymodel-exp1">
 443     <title>Explain Attribute Set</title>
 444     <para>
 445      The Z39.50 standard defines the
 446      <ulink url="&url.z39.50.explain;">Explain</ulink>attribute set
 447      <literal>exp-1</literal>, which is used to discover information
 448      about a server's search semantics and functional capabilities
 449      Zebra exposes a  "classic"
 450      Explain database by base name <literal>IR-Explain-1</literal>, which
 451      is populated with system internal information.
 452     </para>
 453    <para>
 454      The attribute-set <literal>exp-1</literal> consists of a single
 455      <literal>Use (type 1)</literal> attribute.
 456     </para>
 457     <para>
 458      In addition, the non-Use
 459      <literal>bib-1</literal> attributes, that is, the types
 460      <literal>Relation</literal>, <literal>Position</literal>,
 461      <literal>Structure</literal>, <literal>Truncation</literal>,
 462      and <literal>Completeness</literal> are imported from
 463      the <literal>bib-1</literal> attribute set, and may be used
 464      within any explain query.
 465     </para>
 466
 467     <sect3 id="querymodel-exp1-use">
 468     <title>Use Attributes (type = 1)</title>
 469      <para>
 470       The following Explain search atributes are supported:
 471       <literal>ExplainCategory</literal> (@attr 1=1),
 472       <literal>DatabaseName</literal> (@attr 1=3),
 473       <literal>DateAdded</literal> (@attr 1=9),
 474       <literal>DateChanged</literal>(@attr 1=10).
 475      </para>
 476      <para>
 477       A search in the use attribute  <literal>ExplainCategory</literal>
 478       supports only these predefined values:
 479       <literal>CategoryList</literal>, <literal>TargetInfo</literal>,
 480       <literal>DatabaseInfo</literal>, <literal>AttributeDetails</literal>.
 481      </para>
 482      <para>
 483       See <filename>tab/explain.att</filename> and the
 484       <ulink url="&url.z39.50;">Z39.50</ulink> standard
 485       for more information.
 486      </para>
 487     </sect3>
 488
 489     <sect3>
 490      <title>Explain searches with yaz-client</title>
 491      <para>
 492       Classic Explain only defines retrieval of Explain information
 493       via ASN.1. Pratically no Z39.50 clients supports this. Fortunately
 494       they don't have to - Zebra allows retrieval of this information
 495       in other formats:
 496       <literal>SUTRS</literal>, <literal>XML</literal>,
 497       <literal>GRS-1</literal> and  <literal>ASN.1</literal> Explain.
 498      </para>
 499
 500      <para>
 501       List supported categories to find out which explain commands are
 502       supported:
 503       <screen>
 504        Z> base IR-Explain-1
 505        Z> find @attr exp1 1=1 categorylist
 506        Z> form sutrs
 507        Z> show 1+2
 508       </screen>
 509      </para>
 510
 511      <para>
 512       Get target info, that is, investigate which databases exist at
 513       this server endpoint:
 514       <screen>
 515        Z> base IR-Explain-1
 516        Z> find @attr exp1 1=1 targetinfo
 517        Z> form xml
 518        Z> show 1+1
 519        Z> form grs-1
 520        Z> show 1+1
 521        Z> form sutrs
 522        Z> show 1+1
 523       </screen>
 524      </para>
 525
 526      <para>
 527       List all supported databases, the number of hits
 528       is the number of databases found, which most commonly are the
 529       following two:
 530       the <literal>Default</literal> and the
 531       <literal>IR-Explain-1</literal> databases.
 532       <screen>
 533        Z> base IR-Explain-1
 534        Z> find @attr exp1 1=1 databaseinfo
 535        Z> form sutrs
 536        Z> show 1+2
 537       </screen>
 538      </para>
 539
 540      <para>
 541       Get database info record for database <literal>Default</literal>.
 542       <screen>
 543        Z> base IR-Explain-1
 544        Z> find @and @attr exp1 1=1 databaseinfo @attr exp1 1=3 Default
 545       </screen>
 546       Identical query with explicitly specified attribute set:
 547       <screen>
 548        Z> base IR-Explain-1
 549        Z> find @attrset exp1 @and @attr 1=1 databaseinfo @attr 1=3 Default
 550       </screen>
 551      </para>
 552
 553      <para>
 554       Get attribute details record for database
 555       <literal>Default</literal>.
 556       This query is very useful to study the internal Zebra indexes.
 557       If records have been indexed using the <literal>alvis</literal>
 558       XSLT filter, the string representation names of the known indexes can be
 559       found.
 560       <screen>
 561        Z> base IR-Explain-1
 562        Z> find @and @attr exp1 1=1 attributedetails @attr exp1 1=3 Default
 563       </screen>
 564       Identical query with explicitly specified attribute set:
 565       <screen>
 566        Z> base IR-Explain-1
 567        Z> find @attrset exp1 @and @attr 1=1 attributedetails @attr 1=3 Default
 568       </screen>
 569      </para>
 570     </sect3>
 571
 572    </sect2>
 573
 574    <sect2 id="querymodel-bib1">
 575     <title>Bib1 Attribute Set</title>
 576     <para>
 577      Something about querying to be written ..
 578     </para>
 579     <para>
 580      Most of the information contained in this section is an excerpt of
 581      the <literal>ATTRIBUTE SET BIB-1 (Z39.50-1995)
 582       SEMANTICS</literal>,
 583      found at  <ulink url="&url.z39.50.attset.bib1.1995;">. The BIB-1
 584       Attribute Set Semantics</ulink> from 1995, also in an updated
 585      <ulink url="&url.z39.50.attset.bib1;">Bib-1
 586       Attribute Set</ulink>
 587      version from 2003. Index Data is not the copyright holder of this
 588      information.
 589     </para>
 590
 591
 592    <sect3 id="querymodel-bib1-use">
 593      <title>Use Attributes (type = 1)</title>
 594     </sect3>
 595
 596     <para>
 597      Phrase search for <emphasis>information retrieval</emphasis> in
 598      the title-register:
 599      <screen>
 600       Z> find @attr 1=4 "information retrieval"
 601      </screen>
 602     </para>
 603
 604     <para>
 605      See also <xref linkend="querymodel-use-string and  "/>
 606      <xref linkend="querymodel-use-xpath"/> for
 607      alternative acess to the Zebra internal index names and XPath queries.
 608     </para>
 609
 610
 611     <sect3 id="querymodel-bib1-relation">
 612      <title>Relation Attributes (type = 2)</title>
 613        <para>
 614      Supported operations: = (default, of omitted), &lt; &gt; &lt;=, &gt;= .
 615      Unsupported: Not equal.
 616
 617      The following relation attributes are also supported: relevance (102).
 618      <!-- always-matches (103) not supported for all indexes -->
 619
 620      All operations are based on a lexicographical ordering,
 621      <emphasis>expect</emphasis> in the case for the
 622      following structure attributes: numeric(109).
 623     </para>
 624
 625     <para>
 626      Ranked search for <emphasis>information retrieval</emphasis> in
 627      the title-register
 628      (see <xref linkend="administration-ranking"/> for the glory details):
 629      <screen>
 630       Z> find @attr 1=4 @attr 2=102 "information retrieval"
 631      </screen>
 632     </para>
 633     </sect3>
 634
 635     <sect3 id="querymodel-bib1-position">
 636      <title>Position Attributes (type = 3)</title>
 637      <para>
 638       Only value of (any position(3) is supported. first in field(1),
 639       and first in subfield(2) are unsupported but using them
 640       does not trigger an error.
 641       <!-- It should -->
 642       </para>
 643     </sect3>
 644
 645     <sect3 id="querymodel-bib1-structure">
 646      <title>Structure Attributes (type = 4)</title>
 647      <!-- See tab/default.idx -->
 648     </sect3>
 649
 650     <para>
 651      For example, in
 652      the GILS schema (<literal>gils.abs</literal>), the
 653      west-bounding-coordinate is indexed as type <literal>n</literal>,
 654      and is therefore searched by specifying
 655      <emphasis>structure</emphasis>=<emphasis>Numeric String</emphasis>.
 656      To match all those records with west-bounding-coordinate greater
 657      than -114 we use the following query:
 658      <screen>
 659       Z> find @attr 4=109 @attr 2=5 @attr gils 1=2038 -114
 660      </screen>
 661     </para>
 662
 663     <sect3 id="querymodel-bib1-truncation">
 664      <title>Truncation Attributes (type = 5)</title>
 665      <para>
 666       Supported are: No truncation(100) which is the default,
 667       Right trunation(1), Left truncation(2),
 668       Left&amp;Right truncation(3),
 669       Process <literal>#</literal> in term(100) which maps
 670       each # to <literal>.*</literal>,
 671       Regexp-1(102) normal regular, Regexp-2(103) (regular with fuzzy),
 672       <!--
 673       Special 104, 105, 106 are deprecated and will be removed! -->
 674       </para>
 675     </sect3>
 676
 677     <sect3 id="querymodel-bib1-completeness">
 678     <title>Completeness Attributes (type = 6)</title>
 679      <para>
 680       This attribute is ONLY used if structure w, p is to be
 681       chosen. completeness is ignorned if not w, p is to be
 682       used..
 683       Incomplete field(1) is the default and makes Zebra use
 684       register type w.
 685       complete subfield(2) and complete field(3) both triggers
 686       search field type p.
 687      </para>
 688     </sect3>
 689    </sect2>
 690
 691
 692    <sect2 id="querymodel-zebra-attr-search">
 693     <title>Zebra specific Search Extentions to all Attribute Sets</title>
 694     <para>
 695      Zebra extends the Bib1 attribute types, and these extentions are
 696      recognized regardless of attribute
 697      set used in a <literal>search</literal> operation query.
 698     </para>
 699
 700      <table id="querymodel-zebra-attr-search-table">
 701       <caption>Zebra Search Attribute Extentions</caption>
 702        <thead>
 703         <tr>
 704          <td><emphasis>Name and Type</emphasis></td>
 705          <td>Operation</td>
 706          <td>Zebra version</td>
 707         </tr>
 708       </thead>
 709        <tbody>
 710         <tr>
 711          <td><emphasis>Embedded Sort (type 7)</emphasis></td>
 712          <td>search</td>
 713          <td>1.1</td>
 714         </tr>
 715         <tr>
 716          <td><emphasis>Term Set (type 8)</emphasis></td>
 717          <td>search</td>
 718          <td>1.1</td>
 719         </tr>
 720         <tr>
 721          <td><emphasis>Rank weight  (type 9)</emphasis></td>
 722          <td>search</td>
 723          <td>1.1</td>
 724         </tr>
 725         <tr>
 726          <td><emphasis>Approx Limit (type 9)</emphasis></td>
 727          <td>search</td>
 728          <td>1.4</td>
 729         </tr>
 730         <tr>
 731          <td><emphasis>Term Reference (type 10)</emphasis></td>
 732          <td>search</td>
 733          <td>1.4</td>
 734         </tr>
 735        </tbody>
 736       </table>
 737
 738     <sect3 id="querymodel-zebra-attr-sorting">
 739      <title>Zebra Extention Embedded Sort Attribute (type 7)</title>
 740     </sect3>
 741     <para>
 742      The embedded sort is a way to specify sort within a query - thus
 743      removing the need to send a Sort Request separately. It is both
 744      faster and does not require clients to deal with the Sort
 745      Facility.
 746     </para>
 747     <para>
 748      The possible values after attribute <literal>type 7</literal> are
 749      <literal>1</literal> ascending and
 750      <literal>2</literal> descending.
 751      The attributes+term (APT) node is separate from the
 752      rest and must be <literal>@or</literal>'ed.
 753      The term associated with APT is the sorting level in integers,
 754      where <literal>0</literal> means primary sort,
 755      <literal>1</literal> means secondary sort, and so forth.
 756      See also <xref linkend="administration-ranking"/>.
 757     </para>
 758     <para>
 759      For example, searching for water, sort by title (ascending)
 760      <screen>
 761       Z> find @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
 762      </screen>
 763     </para>
 764     <para>
 765      Or, searching for water, sort by title ascending, then date descending
 766      <screen>
 767       Z> find @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
 768      </screen>
 769     </para>
 770
 771     <sect3 id="querymodel-zebra-attr-estimation">
 772      <title>Zebra Extention Term Set Attribute (type 8)</title>
 773     </sect3>
 774     <para>
 775      The Term Set feature is a facility that allows a search to store
 776      hitting terms in a "pseudo" resultset; thus a search (as usual) +
 777      a scan-like facility. Requires a client that can do named result
 778      sets since the search generates two result sets. The value for
 779      attribute 8 is the name of a result set (string). The terms in
 780      the named term set are returned as SUTRS records.
 781     </para>
 782     <para>
 783      For example, searching  for u in title, right truncated, and
 784      storing the result in term set named 'aset'
 785      <screen>
 786       Z> find @attr 5=1 @attr 1=4 @attr 8=aset u
 787      </screen>
 788     </para>
 789     <warning>
 790      The model has one serious flaw: we don't know the size of term
 791      set. Experimental. Do not use in production code.
 792     </warning>
 793
 794     <sect3 id="querymodel-zebra-attr-weight">
 795      <title>Zebra Extention Rank Weight Attribute (type 9)</title>
 796     </sect3>
 797     <para>
 798      Rank weight is a way to pass a value to a ranking algorithm - so
 799      that one APT has one value - while another as a different one.
 800      See also <xref linkend="administration-ranking"/>.
 801     </para>
 802     <para>
 803      For example, searching  for utah in title with weight 30 as well
 804      as any with weight 20:
 805      <screen>
 806       Z> find @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
 807      </screen>
 808     </para>
 809
 810     <sect3 id="querymodel-zebra-attr-limit">
 811      <title>Zebra Extention Approximative Limit Attribute (type 9)</title>
 812     </sect3>
 813     <para>
 814      Newer Zebra versions normally estemiates hit count for every APT
 815      (leaf) in the query tree. These hit counts are returned as part of
 816      the searchResult-1 facility in the binary encoded Z39.50 search
 817      response packages.
 818     </para>
 819     <para>
 820      By setting a limit for the APT we can make Zebra turn into
 821      approximate hit count when a certain hit count limit is
 822      reached. A value of zero means exact hit count.
 823     </para>
 824     <para>
 825      For example, we might be intersted in exact hit count for a, but
 826      for b we allow hit count estimates for 1000 and higher.
 827      <screen>
 828       Z> find @and a @attr 9=1000 b
 829      </screen>
 830     </para>
 831     <note>
 832      The estimated hit count fascility makes searches faster, as one
 833      only needs to process large hit lists partially.
 834     </note>
 835     <warning>
 836      This facility clashes with rank weight, because there all
 837      documents in the hit lists need to be examined for scoring and
 838      re-sorting.
 839      It is an experimental
 840      extention. Do not use in production code.
 841     </warning>
 842
 843     <sect3 id="querymodel-zebra-attr-termref">
 844      <title>Zebra Extention Term Reference Attribute (type 10)</title>
 845     </sect3>
 846     <para>
 847      Zebra supports the searchResult-1 facility. If attribute 10 is
 848      given, that specifies a subqueryId value returned as part of the
 849      search result. It is a way for a client to name an APT part of a
 850      query.
 851     </para>
 852     <!--
 853     <para>
 854      <screen>
 855      </screen>
 856     </para>
 857     -->
 858     <warning>
 859      Experimental. Do not use in production code.
 860     </warning>
 861
 862
 863    </sect2>
 864
 865
 866    <sect2 id="querymodel-zebra-attr-scan">
 867     <title>Zebra specific Scan Extentions to all Attribute Sets</title>
 868     <para>
 869      Zebra extends the Bib1 attribute types, and these extentions are
 870      recognized regardless of attribute
 871      set used in a <literal>scan</literal> operation query.
 872     </para>
 873      <table id="querymodel-zebra-attr-scan-table">
 874       <caption>Zebra Scan Attribute Extentions</caption>
 875        <thead>
 876         <tr>
 877          <td><emphasis>Name and Type</emphasis></td>
 878          <td>Operation</td>
 879          <td>Zebra version</td>
 880         </tr>
 881       </thead>
 882        <tbody>
 883         <tr>
 884          <td><emphasis>Result Set Narrow (type 8)</emphasis></td>
 885          <td>scan</td>
 886          <td>1.3</td>
 887         </tr>
 888         <tr>
 889          <td><emphasis>Approximative Limit (type 9)</emphasis></td>
 890          <td>scan</td>
 891          <td>1.4</td>
 892         </tr>
 893        </tbody>
 894       </table>
 895
 896     <sect3 id="querymodel-zebra-attr-xyz">
 897      <title>Zebra Extention Result Set Narrow (type 8)</title>
 898     </sect3>
 899     <para>
 900      If attribute 8 is given for scan, the value is the name of a
 901      result set. Each hit count in scan is @and'ed with the result set
 902      given.
 903     </para>
 904     <!--
 905     <para>
 906      <screen>
 907      </screen>
 908     </para>
 909     -->
 910     <warning>
 911      Experimental and buggy. Definitely not to be used in production code.
 912     </warning>
 913
 914     <sect3 id="querymodel-zebra-attr-xyz">
 915      <title>Zebra Extention Approximative Limit (type 9)</title>
 916     </sect3>
 917     <para>
 918      The approximative limit (as for search) is a way to enable approx
 919      hit counts for scan hit counts.
 920     </para>
 921     <!--
 922     <para>
 923      <screen>
 924      </screen>
 925     </para>
 926     -->
 927     <warning>
 928      Experimental. Do not use in production code.
 929     </warning>
 930
 931
 932    </sect2>
 933
 934
 935    <sect2 id="querymodel-bib1-mapping">
 936     <title>Mapping from Bib1 Attributes to Zebra internal
 937      register indexes</title>
 938     <para>
 939      TO-DO
 940      </para>
 941
 942
 943      <!-- see in util/zebramap.c
 944       int zebra_maps_attr
 945
 946   if (completeness_value == 2 || completeness_value == 3)
 947         *complete_flag = 1;
 948     else
 949         *complete_flag = 0;
 950     *reg_id = 0;
 951
 952     *sort_flag =(sort_relation_value > 0) ? 1 : 0;
 953     *search_type = "phrase";
 954     strcpy(rank_type, "void");
 955     if (relation_value == 102)
 956     {
 957         if (weight_value == -1)
 958             weight_value = 34;
 959         sprintf(rank_type, "rank,w=%d,u=%d", weight_value, use_value);
 960     }
 961     if (relation_value == 103)
 962     {
 963         *search_type = "always";
 964         *reg_id = 'w';
 965         return 0;
 966     }
 967     if (*complete_flag)
 968         *reg_id = 'p';
 969     else
 970         *reg_id = 'w';
 971     switch (structure_value)
 972     {
 973     case 6:   /* word list */
 974         *search_type = "and-list";
 975         break;
 976     case 105: /* free-form-text */
 977         *search_type = "or-list";
 978         break;
 979     case 106: /* document-text */
 980         *search_type = "or-list";
 981         break;
 982     case -1:
 983     case 1:   /* phrase */
 984     case 2:   /* word */
 985     case 108: /* string */
 986         *search_type = "phrase";
 987         break;
 988    case 107: /* local-number */
 989         *search_type = "local";
 990         *reg_id = 0;
 991         break;
 992     case 109: /* numeric string */
 993         *reg_id = 'n';
 994         *search_type = "numeric";
 995         break;
 996     case 104: /* urx */
 997         *reg_id = 'u';
 998         *search_type = "phrase";
 999         break;
1000     case 3:   /* key */
1001         *reg_id = '0';
1002         *search_type = "phrase";
1003         break;
1004     case 4:  /* year */
1005         *reg_id = 'y';
1006         *search_type = "phrase";
1007         break;
1008     case 5:  /* date */
1009         *reg_id = 'd';
1010         *search_type = "phrase";
1011         break;
1012     default:
1013         return -1;
1014     }
1015     return 0;
1016
1017      -->
1018
1019
1020     <para>
1021      <emphasis>Use</emphasis> attributes are interpreted according to the
1022      attribute sets which have been loaded in the
1023     <literal>zebra.cfg</literal> file, and are matched against specific
1024      fields as specified in the <literal>.abs</literal> file which
1025      describes the profile of the records which have been loaded.
1026      If no Use attribute is provided, a default of Bib-1 Any is assumed.
1027     </para>
1028
1029     <para>
1030      If a <emphasis>Structure</emphasis> attribute of
1031      <emphasis>Phrase</emphasis> is used in conjunction with a
1032      <emphasis>Completeness</emphasis> attribute of
1033      <emphasis>Complete (Sub)field</emphasis>, the term is matched
1034      against the contents of the phrase (long word) register, if one
1035      exists for the given <emphasis>Use</emphasis> attribute.
1036      A phrase register is created for those fields in the
1037      <literal>.abs</literal> file that contains a
1038      <literal>p</literal>-specifier.
1039      <!-- ### whatever the hell _that_ is -->
1040     </para>
1041
1042     <para>
1043      If <emphasis>Structure</emphasis>=<emphasis>Phrase</emphasis> is
1044      used in conjunction with <emphasis>Incomplete Field</emphasis> - the
1045      default value for <emphasis>Completeness</emphasis>, the
1046      search is directed against the normal word registers, but if the term
1047      contains multiple words, the term will only match if all of the words
1048      are found immediately adjacent, and in the given order.
1049      The word search is performed on those fields that are indexed as
1050      type <literal>w</literal> in the <literal>.abs</literal> file.
1051     </para>
1052
1053     <para>
1054      If the <emphasis>Structure</emphasis> attribute is
1055      <emphasis>Word List</emphasis>,
1056      <emphasis>Free-form Text</emphasis>, or
1057      <emphasis>Document Text</emphasis>, the term is treated as a
1058      natural-language, relevance-ranked query.
1059      This search type uses the word register, i.e. those fields
1060      that are indexed as type <literal>w</literal> in the
1061      <literal>.abs</literal> file.
1062     </para>
1063
1064     <para>
1065      If the <emphasis>Structure</emphasis> attribute is
1066      <emphasis>Numeric String</emphasis> the term is treated as an integer.
1067      The search is performed on those fields that are indexed
1068      as type <literal>n</literal> in the <literal>.abs</literal> file.
1069     </para>
1070
1071     <para>
1072      If the <emphasis>Structure</emphasis> attribute is
1073      <emphasis>URx</emphasis> the term is treated as a URX (URL) entity.
1074      The search is performed on those fields that are indexed as type
1075      <literal>u</literal> in the <literal>.abs</literal> file.
1076     </para>
1077
1078     <para>
1079      If the <emphasis>Structure</emphasis> attribute is
1080      <emphasis>Local Number</emphasis> the term is treated as
1081      native Zebra Record Identifier.
1082     </para>
1083
1084     <para>
1085      If the <emphasis>Relation</emphasis> attribute is
1086      <emphasis>Equals</emphasis> (default), the term is matched
1087      in a normal fashion (modulo truncation and processing of
1088      individual words, if required).
1089      If <emphasis>Relation</emphasis> is <emphasis>Less Than</emphasis>,
1090      <emphasis>Less Than or Equal</emphasis>,
1091      <emphasis>Greater than</emphasis>, or <emphasis>Greater than or
1092       Equal</emphasis>, the term is assumed to be numerical, and a
1093      standard regular expression is constructed to match the given
1094      expression.
1095      If <emphasis>Relation</emphasis> is <emphasis>Relevance</emphasis>,
1096      the standard natural-language query processor is invoked.
1097     </para>
1098
1099     <para>
1100      For the <emphasis>Truncation</emphasis> attribute,
1101      <emphasis>No Truncation</emphasis> is the default.
1102      <emphasis>Left Truncation</emphasis> is not supported.
1103      <emphasis>Process # in search term</emphasis> is supported, as is
1104      <emphasis>Regxp-1</emphasis>.
1105      <emphasis>Regxp-2</emphasis> enables the fault-tolerant (fuzzy)
1106      search. As a default, a single error (deletion, insertion,
1107      replacement) is accepted when terms are matched against the register
1108      contents.
1109     </para>
1110    </sect2>
1111
1112    <sect2  id="querymodel-regular">
1113     <title>Zebra Regular Expressions in Truncation Attribute (type = 5)</title>
1114
1115     <para>
1116      Each term in a query is interpreted as a regular expression if
1117      the truncation value is either <emphasis>Regxp-1 (@attr 5=102)</emphasis>
1118      or <emphasis>Regxp-2 (@attr 5=103)</emphasis>.
1119      Both query types follow the same syntax with the operands:
1120     </para>
1121
1122      <table id="querymodel-regular-operands-table">
1123       <caption>Regular Expression Operands</caption>
1124        <!--
1125        <thead>
1126        <tr><td>one</td><td>two</td></tr>
1127       </thead>
1128        -->
1129        <tbody>
1130         <tr>
1131          <td><emphasis>x</emphasis></td>
1132          <td>Matches the character <emphasis>x</emphasis>.</td>
1133         </tr>
1134         <tr>
1135          <td><emphasis>.</emphasis></td>
1136          <td>Matches any character.</td>
1137         </tr>
1138         <tr>
1139          <td><emphasis>[ .. ]</emphasis></td>
1140          <td>Matches the set of characters specified;
1141          such as <literal>[abc]</literal> or <literal>[a-c]</literal>.</td>
1142         </tr>
1143        </tbody>
1144       </table>
1145
1146     <para>
1147      The above operands can be combined with the following operators:
1148     </para>
1149
1150
1151      <table id="querymodel-regular-operators-table">
1152       <caption>Regular Expression Operators</caption>
1153        <!--
1154        <thead>
1155        <tr><td>one</td><td>two</td></tr>
1156       </thead>
1157        -->
1158        <tbody>
1159         <tr>
1160          <td><emphasis>x*</emphasis></td>
1161          <td>Matches <emphasis>x</emphasis> zero or more times.
1162           Priority: high.</td>
1163         </tr>
1164         <tr>
1165          <td><emphasis>x+</emphasis></td>
1166          <td>Matches <emphasis>x</emphasis> one or more times.
1167           Priority: high.</td>
1168         </tr>
1169         <tr>
1170          <td><emphasis>x?</emphasis></td>
1171          <td> Matches <emphasis>x</emphasis> zero or once.
1172           Priority: high.</td>
1173         </tr>
1174         <tr>
1175          <td><emphasis>xy</emphasis></td>
1176          <td> Matches <emphasis>x</emphasis>, then <emphasis>y</emphasis>.
1177          Priority: medium.</td>
1178         </tr>
1179         <tr>
1180          <td><emphasis>x|y</emphasis></td>
1181          <td> Matches either <emphasis>x</emphasis> or <emphasis>y</emphasis>.
1182          Priority: low.</td>
1183         </tr>
1184         <tr>
1185          <td><emphasis>( )</emphasis></td>
1186          <td>The order of evaluation may be changed by using parentheses.</td>
1187         </tr>
1188        </tbody>
1189       </table>
1190
1191     <para>
1192      If the first character of the <emphasis>Regxp-2</emphasis> query
1193      is a plus character (<literal>+</literal>) it marks the
1194      beginning of a section with non-standard specifiers.
1195      The next plus character marks the end of the section.
1196      Currently Zebra only supports one specifier, the error tolerance,
1197      which consists one digit.
1198     </para>
1199
1200     <para>
1201      Since the plus operator is normally a suffix operator the addition to
1202      the query syntax doesn't violate the syntax for standard regular
1203      expressions.
1204     </para>
1205
1206     <para>
1207      For example, a phrase search with regular expressions  in
1208      the title-register is performed like this:
1209      <screen>
1210       Z> find @attr 1=4 @attr 5=102 "informat.* retrieval"
1211      </screen>
1212     </para>
1213
1214     <para>
1215      Combinations with other attributes are possible. For example, a
1216      ranked search with a regular expression
1217      (see <xref linkend="administration-ranking"/> for the glory details):
1218      <screen>
1219       Z> find @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"
1220      </screen>
1221     </para>
1222    </sect2>
1223
1224
1225    <!--
1226    <para>
1227     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
1228     the <literal>-t</literal> option to the indexer tells Zebra how to
1229     process input records.
1230     Two basic types of processing are available - raw text and structured
1231     data. Raw text is just that, and it is selected by providing the
1232     argument <emphasis>text</emphasis> to Zebra. Structured records are
1233     all handled internally using the basic mechanisms described in the
1234     subsequent sections.
1235     Zebra can read structured records in many different formats.
1236    </para>
1237    -->
1238   </sect1>
1239
1240
1241   <sect1 id="querymodel-cql-to-pqf">
1242    <title>Server Side CQL to PQF Query Translation</title>
1243    <para>
1244     Using the
1245     <literal>&lt;cql2rpn&gt;l2rpn.txt&lt;/cql2rpn&gt;</literal>
1246       YAZ Frontend Virtual
1247     Hosts option, one can configure
1248     the YAZ Frontend CQL-to-PQF
1249     converter, specifying the interpretation of various
1250     <ulink url="&url.cql;">CQL</ulink>
1251     indexes, relations, etc. in terms of Type-1 query attributes.
1252     <!-- The  yaz-client config file -->
1253    </para>
1254    <para>
1255     For example, using server-side CQL-to-PQF conversion, one might
1256     query a zebra server like this:
1257     <screen>
1258     <![CDATA[
1259      yaz-client localhost:9999
1260      Z> querytype cql
1261      Z> find text=(plant and soil)
1262      ]]>
1263     </screen>
1264      and - if properly configured - even static relevance ranking can
1265      be performed using CQL query syntax:
1266     <screen>
1267     <![CDATA[
1268      Z> find text = /relevant (plant and soil)
1269      ]]>
1270      </screen>
1271    </para>
1272
1273    <para>
1274     By the way, the same configuration can be used to
1275     search using client-side CQL-to-PQF conversion:
1276     (the only difference is <literal>querytype cql2rpn</literal>
1277     instead of
1278     <literal>querytype cql</literal>, and the call specifying a local
1279     conversion file)
1280     <screen>
1281     <![CDATA[
1282      yaz-client -q local/cql2pqf.txt localhost:9999
1283      Z> querytype cql2rpn
1284      Z> find text=(plant and soil)
1285      ]]>
1286      </screen>
1287    </para>
1288
1289    <para>
1290     Exhaustive information can be found in the
1291     Section "Specification of CQL to RPN mappings" in the YAZ manual.
1292     <ulink url="http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map">
1293      http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map</ulink>,
1294    and shall therefore not be repeated here.
1295    </para>
1296   <!--
1297   <para>
1298     See
1299       <ulink url="http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html">
1300       http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html</ulink>
1301     for the Maintenance Agency's work-in-progress mapping of Dublin Core
1302     indexes to Attribute Architecture (util, XD and BIB-2)
1303     attributes.
1304    </para>
1305    -->
1306  </sect1>
1307
1308
1309
1310 </chapter>
1311
1312  <!-- Keep this comment at the end of the file
1313  Local variables:
1314  mode: sgml
1315  sgml-omittag:t
1316  sgml-shorttag:t
1317  sgml-minimize-attributes:nil
1318  sgml-always-quote-attributes:t
1319  sgml-indent-step:1
1320  sgml-indent-data:t
1321  sgml-parent-document: "zebra.xml"
1322  sgml-local-catalogs: nil
1323  sgml-namecase-general:t
1324  End:
1325  -->