doc/recordmodel.xml

   1  <chapter id="record-model">
   2   <!-- $Id: recordmodel.xml,v 1.5 2002-08-28 09:39:45 mike Exp $ -->
   3   <title>The Record Model</title>
   4
   5   <para>
   6    The Zebra system is designed to support a wide range of data management
   7    applications. The system can be configured to handle virtually any
   8    kind of structured data. Each record in the system is associated with
   9    a <emphasis>record schema</emphasis> which lends context to the data
  10    elements of the record.
  11    Any number of record schema can coexist in the system.
  12    Although it may be wise to use only a single schema within
  13    one database, the system poses no such restrictions.
  14   </para>
  15
  16   <para>
  17    The record model described in this chapter applies to the fundamental,
  18    structured
  19    record type <literal>grs</literal> as introduced in
  20    <xref linkend="record-types"/>.
  21    FIXME - Need to describe the simple string-tag model, or at least
  22    refer to it here. -H
  23   </para>
  24
  25   <para>
  26    Records pass through three different states during processing in the
  27    system.
  28   </para>
  29
  30   <para>
  31
  32    <itemizedlist>
  33     <listitem>
  34
  35      <para>
  36       When records are accessed by the system, they are represented
  37       in their local, or native format. This might be SGML or HTML files,
  38       News or Mail archives, MARC records. If the system doesn't already
  39       know how to read the type of data you need to store, you can set up an
  40       input filter by preparing conversion rules based on regular
  41       expressions and possibly augmented by a flexible scripting language
  42       (Tcl).
  43       The input filter produces as output an internal representation:
  44
  45      </para>
  46     </listitem>
  47     <listitem>
  48
  49      <para>
  50       When records are processed by the system, they are represented
  51       in a tree-structure, constructed by tagged data elements hanging off a
  52       root node. The tagged elements may contain data or yet more tagged
  53       elements in a recursive structure. The system performs various
  54       actions on this tree structure (indexing, element selection, schema
  55       mapping, etc.),
  56
  57      </para>
  58     </listitem>
  59     <listitem>
  60
  61      <para>
  62       Before transmitting records to the client, they are first
  63       converted from the internal structure to a form suitable for exchange
  64       over the network - according to the Z39.50 standard.
  65      </para>
  66     </listitem>
  67
  68    </itemizedlist>
  69
  70   </para>
  71
  72   <sect1 id="local-representation">
  73    <title>Local Representation</title>
  74
  75    <para>
  76     As mentioned earlier, Zebra places few restrictions on the type of
  77     data that you can index and manage. Generally, whatever the form of
  78     the data, it is parsed by an input filter specific to that format, and
  79     turned into an internal structure that Zebra knows how to handle. This
  80     process takes place whenever the record is accessed - for indexing and
  81     retrieval.
  82    </para>
  83
  84    <para>
  85     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  86     the <literal>-t</literal> option to the indexer tells Zebra how to
  87     process input records.
  88     Two basic types of processing are available - raw text and structured
  89     data. Raw text is just that, and it is selected by providing the
  90     argument <emphasis>text</emphasis> to Zebra. Structured records are
  91     all handled internally using the basic mechanisms described in the
  92     subsequent sections.
  93     Zebra can read structured records in many different formats.
  94     How this is done is governed by additional parameters after the
  95     "grs" keyboard, separated by "." characters.
  96    </para>
  97
  98    <para>
  99     Four basic subtypes to the <emphasis>grs</emphasis> type are
 100     currently available:
 101    </para>
 102
 103    <para>
 104     <variablelist>
 105      <varlistentry>
 106       <term>grs.sgml</term>
 107       <listitem>
 108        <para>
 109         This is the canonical input format &mdash;
 110         described below. It is a simple SGML-like syntax.
 111        </para>
 112       </listitem>
 113      </varlistentry>
 114      <varlistentry>
 115       <term>grs.regx.<emphasis>filter</emphasis></term>
 116       <listitem>
 117        <para>
 118         This enables a user-supplied input
 119         filter. The mechanisms of these filters are described below.
 120        </para>
 121       </listitem>
 122      </varlistentry>
 123      <varlistentry>
 124       <term>grs.tcl.<emphasis>filter</emphasis></term>
 125       <listitem>
 126        <para>
 127         Similar to grs.regx but using Tcl for rules.
 128        </para>
 129       </listitem>
 130      </varlistentry>
 131      <varlistentry>
 132       <term>grs.marc.<emphasis>abstract syntax</emphasis></term>
 133       <listitem>
 134        <para>
 135         This allows Zebra to read
 136         records in the ISO2709 (MARC) encoding standard. In this case, the
 137         last parameter <emphasis>abstract syntax</emphasis> names the
 138         <literal>.abs</literal> file (see below)
 139         which describes the specific MARC structure of the input record as
 140         well as the indexing rules.
 141        </para>
 142       </listitem>
 143      </varlistentry>
 144      <varlistentry>
 145       <term>grs.xml</term>
 146       <listitem>
 147        <para>
 148         This filter reads XML records. Only one record per file
 149         is supported. The filter is only available if Zebra/YAZ
 150         is compiled with EXPAT support.
 151        </para>
 152       </listitem>
 153      </varlistentry>
 154
 155     </variablelist>
 156    </para>
 157
 158    <sect2>
 159     <title>Canonical Input Format</title>
 160
 161     <para>
 162      Although input data can take any form, it is sometimes useful to
 163      describe the record processing capabilities of the system in terms of
 164      a single, canonical input format that gives access to the full
 165      spectrum of structure and flexibility in the system. In Zebra, this
 166      canonical format is an "SGML-like" syntax.
 167     </para>
 168
 169     <para>
 170      To use the canonical format specify <literal>grs.sgml</literal> as
 171      the record type.
 172     </para>
 173
 174     <para>
 175      Consider a record describing an information resource (such a record is
 176      sometimes known as a <emphasis>locator record</emphasis>).
 177      It might contain a field describing the distributor of the
 178      information resource, which might in turn be partitioned into
 179      various fields providing details about the distributor, like this:
 180     </para>
 181
 182     <para>
 183
 184      <screen>
 185       &#60;Distributor&#62;
 186       &#60;Name&#62; USGS/WRD &#60;/Name&#62;
 187       &#60;Organization&#62; USGS/WRD &#60;/Organization&#62;
 188       &#60;Street-Address&#62;
 189       U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
 190       &#60;/Street-Address&#62;
 191       &#60;City&#62; ALBUQUERQUE &#60;/City&#62;
 192       &#60;State&#62; NM &#60;/State&#62;
 193       &#60;Zip-Code&#62; 87102 &#60;/Zip-Code&#62;
 194       &#60;Country&#62; USA &#60;/Country&#62;
 195       &#60;Telephone&#62; (505) 766-5560 &#60;/Telephone&#62;
 196       &#60;/Distributor&#62;
 197      </screen>
 198
 199     </para>
 200
 201     <note>
 202      <para>
 203       The indentation used above is used to illustrate how Zebra
 204       interprets the mark-up. The indentation, in itself, has no
 205       significance to the parser for the canonical input format, which
 206       discards superfluous whitespace.
 207      </para>
 208     </note>
 209     <para>
 210      The keywords surrounded by &lt;...&gt; are
 211      <emphasis>tags</emphasis>, while the sections of text
 212      in between are the <emphasis>data elements</emphasis>.
 213      A data element is characterized by its location in the tree
 214      that is made up by the nested elements.
 215      Each element is terminated by a closing tag - beginning
 216      with <literal>&#60;</literal>/, and containing the same symbolic
 217      tag-name as the corresponding opening tag.
 218      The general closing tag - <literal>&#60;</literal>&gt;/ -
 219      terminates the element started by the last opening tag. The
 220      structuring of elements is significant.
 221      The element <emphasis>Telephone</emphasis>,
 222      for instance, may be indexed and presented to the client differently,
 223      depending on whether it appears inside the
 224      <emphasis>Distributor</emphasis> element, or some other,
 225      structured data element such a <emphasis>Supplier</emphasis> element.
 226     </para>
 227
 228     <sect3>
 229      <title>Record Root</title>
 230
 231      <para>
 232       The first tag in a record describes the root node of the tree that
 233       makes up the total record. In the canonical input format, the root tag
 234       should contain the name of the schema that lends context to the
 235       elements of the record
 236       (see <xref linkend="internal-representation"/>).
 237       The following is a GILS record that
 238       contains only a single element (strictly speaking, that makes it an
 239       illegal GILS record, since the GILS profile includes several mandatory
 240       elements - Zebra does not validate the contents of a record against
 241       the Z39.50 profile, however - it merely attempts to match up elements
 242       of a local representation with the given schema):
 243      </para>
 244
 245      <para>
 246
 247       <screen>
 248        &#60;gils&#62;
 249        &#60;title&#62;Zen and the Art of Motorcycle Maintenance&#60;/title&#62;
 250        &#60;/gils&#62;
 251       </screen>
 252
 253      </para>
 254
 255     </sect3>
 256
 257     <sect3>
 258      <title>Variants</title>
 259
 260      <para>
 261       Zebra allows you to provide individual data elements in a number of
 262       <emphasis>variant forms</emphasis>. Examples of variant forms are
 263       textual data elements which might appear in different languages, and
 264       images which may appear in different formats or layouts.
 265       The variant system in Zebra is essentially a representation of
 266       the variant mechanism of Z39.50-1995.
 267      </para>
 268
 269      <para>
 270       The following is an example of a title element which occurs in two
 271       different languages.
 272      </para>
 273
 274      <para>
 275
 276       <screen>
 277        &#60;title&#62;
 278        &#60;var lang lang "eng"&#62;
 279        Zen and the Art of Motorcycle Maintenance&#60;/&#62;
 280        &#60;var lang lang "dan"&#62;
 281        Zen og Kunsten at Vedligeholde en Motorcykel&#60;/&#62;
 282        &#60;/title&#62;
 283       </screen>
 284
 285      </para>
 286
 287      <para>
 288       The syntax of the <emphasis>variant element</emphasis> is
 289       <literal>&lt;var class type value&gt;</literal>.
 290       The available values for the <emphasis>class</emphasis> and
 291       <emphasis>type</emphasis> fields are given by the variant set
 292       that is associated with the current schema
 293       (see <xref linkend="variant-set"/>).
 294      </para>
 295
 296      <para>
 297       Variant elements are terminated by the general end-tag &#60;/&#62;, by
 298       the variant end-tag &#60;/var&#62;, by the appearance of another variant
 299       tag with the same <emphasis>class</emphasis> and
 300       <emphasis>value</emphasis> settings, or by the
 301       appearance of another, normal tag. In other words, the end-tags for
 302       the variants used in the example above could have been saved.
 303      </para>
 304
 305      <para>
 306       Variant elements can be nested. The element
 307      </para>
 308
 309      <para>
 310
 311       <screen>
 312        &#60;title&#62;
 313        &#60;var lang lang "eng"&#62;&#60;var body iana "text/plain"&#62;
 314        Zen and the Art of Motorcycle Maintenance
 315        &#60;/title&#62;
 316       </screen>
 317
 318      </para>
 319
 320      <para>
 321       Associates two variant components to the variant list for the title
 322       element.
 323      </para>
 324
 325      <para>
 326       Given the nesting rules described above, we could write
 327      </para>
 328
 329      <para>
 330
 331       <screen>
 332        &#60;title&#62;
 333        &#60;var body iana "text/plain&#62;
 334        &#60;var lang lang "eng"&#62;
 335        Zen and the Art of Motorcycle Maintenance
 336        &#60;var lang lang "dan"&#62;
 337        Zen og Kunsten at Vedligeholde en Motorcykel
 338        &#60;/title&#62;
 339       </screen>
 340
 341      </para>
 342
 343      <para>
 344       The title element above comes in two variants. Both have the IANA body
 345       type "text/plain", but one is in English, and the other in
 346       Danish. The client, using the element selection mechanism of Z39.50,
 347       can retrieve information about the available variant forms of data
 348       elements, or it can select specific variants based on the requirements
 349       of the end-user.
 350      </para>
 351
 352     </sect3>
 353
 354    </sect2>
 355
 356    <sect2>
 357     <title>Input Filters</title>
 358
 359     <para>
 360      In order to handle general input formats, Zebra allows the
 361      operator to define filters which read individual records in their
 362      native format and produce an internal representation that the system
 363      can work with.
 364     </para>
 365
 366     <para>
 367      Input filters are ASCII files, generally with the suffix
 368      <literal>.flt</literal>.
 369      The system looks for the files in the directories given in the
 370      <emphasis>profilePath</emphasis> setting in the
 371      <literal>zebra.cfg</literal> files.
 372      The record type for the filter is
 373      <literal>grs.regx.</literal><emphasis>filter-filename</emphasis>
 374      (fundamental type <literal>grs</literal>, file read
 375      type <literal>regx</literal>, argument
 376      <emphasis>filter-filename</emphasis>).
 377     </para>
 378
 379     <para>
 380      Generally, an input filter consists of a sequence of rules, where each
 381      rule consists of a sequence of expressions, followed by an action. The
 382      expressions are evaluated against the contents of the input record,
 383      and the actions normally contribute to the generation of an internal
 384      representation of the record.
 385     </para>
 386
 387     <para>
 388      An expression can be either of the following:
 389     </para>
 390
 391     <para>
 392      <variablelist>
 393
 394       <varlistentry>
 395        <term>INIT</term>
 396        <listitem>
 397         <para>
 398          The action associated with this expression is evaluated
 399          exactly once in the lifetime of the application, before any records
 400          are read. It can be used in conjunction with an action that
 401          initializes tables or other resources that are used in the processing
 402          of input records.
 403         </para>
 404        </listitem>
 405       </varlistentry>
 406       <varlistentry>
 407        <term>BEGIN</term>
 408        <listitem>
 409         <para>
 410          Matches the beginning of the record. It can be used to
 411          initialize variables, etc. Typically, the
 412          <emphasis>BEGIN</emphasis> rule is also used
 413          to establish the root node of the record.
 414         </para>
 415        </listitem>
 416       </varlistentry>
 417       <varlistentry>
 418        <term>END</term>
 419        <listitem>
 420         <para>
 421          Matches the end of the record - when all of the contents
 422          of the record has been processed.
 423         </para>
 424        </listitem>
 425       </varlistentry>
 426       <varlistentry>
 427        <term>/pattern/</term>
 428        <listitem>
 429         <para>
 430          Matches a string of characters from the input record.
 431         </para>
 432        </listitem>
 433       </varlistentry>
 434       <varlistentry>
 435        <term>BODY</term>
 436        <listitem>
 437         <para>
 438          This keyword may only be used between two patterns.
 439          It matches everything between (not including) those patterns.
 440         </para>
 441        </listitem>
 442       </varlistentry>
 443       <varlistentry>
 444        <term>FINISH</term>
 445        <listitem>
 446         <para>
 447          The expression associated with this pattern is evaluated
 448          once, before the application terminates. It can be used to release
 449          system resources - typically ones allocated in the
 450          <emphasis>INIT</emphasis> step.
 451         </para>
 452        </listitem>
 453       </varlistentry>
 454      </variablelist>
 455     </para>
 456
 457     <para>
 458      An action is surrounded by curly braces (&lcub;...&rcub;), and
 459      consists of a sequence of statements. Statements may be separated
 460      by newlines or semicolons (;).
 461      Within actions, the strings that matched the expressions
 462      immediately preceding the action can be referred to as
 463      &dollar;0, &dollar;1, &dollar;2, etc.
 464     </para>
 465
 466     <para>
 467      The available statements are:
 468     </para>
 469
 470     <para>
 471      <variablelist>
 472
 473       <varlistentry>
 474        <term>begin <emphasis>type &lsqb;parameter ... &rsqb;</emphasis></term>
 475        <listitem>
 476         <para>
 477          Begin a new
 478          data element. The type is one of the following:
 479          <variablelist>
 480
 481           <varlistentry>
 482            <term>record</term>
 483            <listitem>
 484             <para>
 485              Begin a new record. The following parameter should be the
 486              name of the schema that describes the structure of the record, eg.
 487              <literal>gils</literal> or <literal>wais</literal> (see below).
 488              The <literal>begin record</literal> call should precede
 489              any other use of the <emphasis>begin</emphasis> statement.
 490             </para>
 491            </listitem>
 492           </varlistentry>
 493           <varlistentry>
 494            <term>element</term>
 495            <listitem>
 496             <para>
 497              Begin a new tagged element. The parameter is the
 498              name of the tag. If the tag is not matched anywhere in the tagsets
 499              referenced by the current schema, it is treated as a local string
 500              tag.
 501             </para>
 502            </listitem>
 503           </varlistentry>
 504           <varlistentry>
 505            <term>variant</term>
 506            <listitem>
 507             <para>
 508              Begin a new node in a variant tree. The parameters are
 509              <emphasis>class type value</emphasis>.
 510             </para>
 511            </listitem>
 512           </varlistentry>
 513          </variablelist>
 514         </para>
 515        </listitem>
 516       </varlistentry>
 517       <varlistentry>
 518        <term>data</term>
 519        <listitem>
 520         <para>
 521          Create a data element. The concatenated arguments make
 522          up the value of the data element.
 523          The option <literal>-text</literal> signals that
 524          the layout (whitespace) of the data should be retained for
 525          transmission.
 526          The option <literal>-element</literal>
 527          <emphasis>tag</emphasis> wraps the data up in
 528          the <emphasis>tag</emphasis>.
 529          The use of the <literal>-element</literal> option is equivalent to
 530          preceding the command with a <emphasis>begin
 531           element</emphasis> command, and following
 532          it with the <emphasis>end</emphasis> command.
 533         </para>
 534        </listitem>
 535       </varlistentry>
 536       <varlistentry>
 537        <term>end <emphasis>&lsqb;type&rsqb;</emphasis></term>
 538        <listitem>
 539         <para>
 540          Close a tagged element. If no parameter is given,
 541          the last element on the stack is terminated.
 542          The first parameter, if any, is a type name, similar
 543          to the <emphasis>begin</emphasis> statement.
 544          For the <emphasis>element</emphasis> type, a tag
 545          name can be provided to terminate a specific tag.
 546         </para>
 547        </listitem>
 548       </varlistentry>
 549      </variablelist>
 550     </para>
 551
 552     <para>
 553      The following input filter reads a Usenet news file, producing a
 554      record in the WAIS schema. Note that the body of a news posting is
 555      separated from the list of headers by a blank line (or rather a
 556      sequence of two newline characters.
 557     </para>
 558
 559     <para>
 560
 561      <screen>
 562       BEGIN                { begin record wais }
 563
 564       /^From:/ BODY /$/    { data -element name $1 }
 565       /^Subject:/ BODY /$/ { data -element title $1 }
 566       /^Date:/ BODY /$/    { data -element lastModified $1 }
 567       /\n\n/ BODY END      {
 568       begin element bodyOfDisplay
 569       begin variant body iana "text/plain"
 570       data -text $1
 571       end record
 572       }
 573      </screen>
 574
 575     </para>
 576
 577     <para>
 578      If Zebra is compiled with support for Tcl (Tool Command Language)
 579      enabled, the statements described above are supplemented with a complete
 580      scripting environment, including control structures (conditional
 581      expressions and loop constructs), and powerful string manipulation
 582      mechanisms for modifying the elements of a record. Tcl is a popular
 583      scripting environment, with several tutorials available both online
 584      and in hardcopy.
 585     </para>
 586
 587    </sect2>
 588
 589   </sect1>
 590
 591   <sect1 id="internal-representation">
 592    <title>Internal Representation</title>
 593
 594    <para>
 595     When records are manipulated by the system, they're represented in a
 596     tree-structure, with data elements at the leaf nodes, and tags or
 597     variant components at the non-leaf nodes. The root-node identifies the
 598     schema that lends context to the tagging and structuring of the
 599     record. Imagine a simple record, consisting of a 'title' element and
 600     an 'author' element:
 601    </para>
 602
 603    <para>
 604
 605     <screen>
 606      TITLE     "Zen and the Art of Motorcycle Maintenance"
 607      ROOT
 608      AUTHOR    "Robert Pirsig"
 609     </screen>
 610
 611    </para>
 612
 613    <para>
 614     A slightly more complex record would have the author element consist
 615     of two elements, a surname and a first name:
 616    </para>
 617
 618    <para>
 619
 620     <screen>
 621      TITLE     "Zen and the Art of Motorcycle Maintenance"
 622      ROOT
 623      FIRST-NAME "Robert"
 624      AUTHOR
 625      SURNAME    "Pirsig"
 626     </screen>
 627
 628    </para>
 629
 630    <para>
 631     The root of the record will refer to the record schema that describes
 632     the structuring of this particular record. The schema defines the
 633     element tags (TITLE, FIRST-NAME, etc.) that may occur in the record, as
 634     well as the structuring (SURNAME should appear below AUTHOR, etc.). In
 635     addition, the schema establishes element set names that are used by
 636     the client to request a subset of the elements of a given record. The
 637     schema may also establish rules for converting the record to a
 638     different schema, by stating, for each element, a mapping to a
 639     different tag path.
 640    </para>
 641
 642    <sect2>
 643     <title>Tagged Elements</title>
 644
 645     <para>
 646      A data element is characterized by its tag, and its position in the
 647      structure of the record. For instance, while the tag "telephone
 648      number" may be used different places in a record, we may need to
 649      distinguish between these occurrences, both for searching and
 650      presentation purposes. For instance, while the phone numbers for the
 651      "customer" and the "service provider" are both
 652      representatives for the same type of resource (a telephone number), it
 653      is essential that they be kept separate. The record schema provides
 654      the structure of the record, and names each data element (defined by
 655      the sequence of tags - the tag path - by which the element can be
 656      reached from the root of the record).
 657     </para>
 658
 659    </sect2>
 660
 661    <sect2>
 662     <title>Variants</title>
 663
 664     <para>
 665      The children of a tag node may be either more tag nodes, a data node
 666      (possibly accompanied by tag nodes),
 667      or a tree of variant nodes. The children of  variant nodes are either
 668      more variant nodes or a data node (possibly accompanied by more
 669      variant nodes). Each leaf node, which is normally a
 670      data node, corresponds to a <emphasis>variant form</emphasis> of the
 671      tagged element identified by the tag which parents the variant tree.
 672      The following title element occurs in two different languages:
 673     </para>
 674
 675     <para>
 676
 677      <screen>
 678       VARIANT LANG=ENG  "War and Peace"
 679       TITLE
 680       VARIANT LANG=DAN  "Krig og Fred"
 681      </screen>
 682
 683     </para>
 684
 685     <para>
 686      Which of the two elements are transmitted to the client by the server
 687      depends on the specifications provided by the client, if any.
 688     </para>
 689
 690     <para>
 691      In practice, each variant node is associated with a triple of class,
 692      type, value, corresponding to the variant mechanism of Z39.50.
 693     </para>
 694
 695    </sect2>
 696
 697    <sect2>
 698     <title>Data Elements</title>
 699
 700     <para>
 701      Data nodes have no children (they are always leaf nodes in the record
 702      tree).
 703     </para>
 704
 705     <note>
 706      <para>
 707       FIXME! Documentation needs extension here about types of nodes - numerical,
 708       textual, etc., plus the various types of inclusion notes.
 709      </para>
 710     </note>
 711
 712    </sect2>
 713
 714   </sect1>
 715
 716   <sect1 id="data-model">
 717    <title>Configuring Your Data Model</title>
 718
 719    <para>
 720     The following sections describe the configuration files that govern
 721     the internal management of data records. The system searches for the files
 722     in the directories specified by the <emphasis>profilePath</emphasis>
 723     setting in the <literal>zebra.cfg</literal> file.
 724    </para>
 725
 726    <sect2>
 727     <title>The Abstract Syntax</title>
 728
 729     <para>
 730      The abstract syntax definition (also known as an Abstract Record
 731      Structure, or ARS) is the focal point of the
 732      record schema description. For a given schema, the ABS file may state any
 733      or all of the following:
 734     </para>
 735
 736     <para>
 737      FIXME - Need a diagram here, or a simple explanation how it all hangs together -H
 738     </para>
 739
 740     <para>
 741
 742      <itemizedlist>
 743       <listitem>
 744
 745        <para>
 746         The object identifier of the Z39.50 schema associated
 747         with the ARS, so that it can be referred to by the client.
 748        </para>
 749       </listitem>
 750
 751       <listitem>
 752        <para>
 753         The attribute set (which can possibly be a compound of multiple
 754         sets) which applies in the profile. This is used when indexing and
 755         searching the records belonging to the given profile.
 756        </para>
 757       </listitem>
 758
 759       <listitem>
 760        <para>
 761         The Tag set (again, this can consist of several different sets).
 762         This is used when reading the records from a file, to recognize the
 763         different tags, and when transmitting the record to the client -
 764         mapping the tags to their numerical representation, if they are
 765         known.
 766        </para>
 767       </listitem>
 768
 769       <listitem>
 770        <para>
 771         The variant set which is used in the profile. This provides a
 772         vocabulary for specifying the <emphasis>forms</emphasis> of
 773         data that appear inside the records.
 774        </para>
 775       </listitem>
 776
 777       <listitem>
 778        <para>
 779         Element set names, which are a shorthand way for the client to
 780         ask for a subset of the data elements contained in a record. Element
 781         set names, in the retrieval module, are mapped to <emphasis>element
 782          specifications</emphasis>, which contain information equivalent to the
 783         <emphasis>Espec-1</emphasis> syntax of Z39.50.
 784        </para>
 785       </listitem>
 786
 787       <listitem>
 788        <para>
 789         Map tables, which may specify mappings to
 790         <emphasis>other</emphasis> database profiles, if desired.
 791        </para>
 792       </listitem>
 793
 794       <listitem>
 795        <para>
 796         Possibly, a set of rules describing the mapping of elements to a
 797         MARC representation.
 798
 799        </para>
 800       </listitem>
 801
 802       <listitem>
 803        <para>
 804         A list of element descriptions (this is the actual ARS of the
 805         schema, in Z39.50 terms), which lists the ways in which the various
 806         tags can be used and organized hierarchically.
 807        </para>
 808       </listitem>
 809
 810      </itemizedlist>
 811
 812     </para>
 813
 814     <para>
 815      Several of the entries above simply refer to other files, which
 816      describe the given objects.
 817     </para>
 818
 819    </sect2>
 820
 821    <sect2>
 822     <title>The Configuration Files</title>
 823
 824     <para>
 825      This section describes the syntax and use of the various tables which
 826      are used by the retrieval module.
 827     </para>
 828
 829     <para>
 830      The number of different file types may appear daunting at first, but
 831      each type corresponds fairly clearly to a single aspect of the Z39.50
 832      retrieval facilities. Further, the average database administrator,
 833      who is simply reusing an existing profile for which tables already
 834      exist, shouldn't have to worry too much about the contents of these tables.
 835     </para>
 836
 837     <para>
 838      Generally, the files are simple ASCII files, which can be maintained
 839      using any text editor. Blank lines, and lines beginning with a (&num;) are
 840      ignored. Any characters on a line followed by a (&num;) are also ignored.
 841      All other lines contain <emphasis>directives</emphasis>, which provide
 842      some setting or value to the system.
 843      Generally, settings are characterized by a single
 844      keyword, identifying the setting, followed by a number of parameters.
 845      Some settings are repeatable (r), while others may occur only once in a
 846      file. Some settings are optional (o), while others again are
 847      mandatory (m).
 848     </para>
 849
 850    </sect2>
 851
 852    <sect2>
 853     <title>The Abstract Syntax (.abs) Files</title>
 854
 855     <para>
 856      The name of this file type is slightly misleading in Z39.50 terms,
 857      since, apart from the actual abstract syntax of the profile, it also
 858      includes most of the other definitions that go into a database
 859      profile.
 860     </para>
 861
 862     <para>
 863      When a record in the canonical, SGML-like format is read from a file
 864      or from the database, the first tag of the file should reference the
 865      profile that governs the layout of the record. If the first tag of the
 866      record is, say, <literal>&lt;gils&gt;</literal>, the system will look
 867      for the profile definition in the file <literal>gils.abs</literal>.
 868      Profile definitions are cached, so they only have to be read once
 869      during the lifespan of the current process.
 870     </para>
 871
 872     <para>
 873      When writing your own input filters, the
 874      <emphasis>record-begin</emphasis> command
 875      introduces the profile, and should always be called first thing when
 876      introducing a new record.
 877     </para>
 878
 879     <para>
 880      The file may contain the following directives:
 881     </para>
 882
 883     <para>
 884      <variablelist>
 885
 886       <varlistentry>
 887        <term>name <emphasis>symbolic-name</emphasis></term>
 888        <listitem>
 889         <para>
 890          (m) This provides a shorthand name or
 891          description for the profile. Mostly useful for diagnostic purposes.
 892         </para>
 893        </listitem>
 894       </varlistentry>
 895       <varlistentry>
 896        <term>reference <emphasis>OID-name</emphasis></term>
 897        <listitem>
 898         <para>
 899          (m) The reference name of the OID for the profile.
 900          The reference names can be found in the <emphasis>util</emphasis>
 901          module of <emphasis>YAZ</emphasis>.
 902         </para>
 903        </listitem>
 904       </varlistentry>
 905       <varlistentry>
 906        <term>attset <emphasis>filename</emphasis></term>
 907        <listitem>
 908         <para>
 909          (m) The attribute set that is used for
 910          indexing and searching records belonging to this profile.
 911         </para>
 912        </listitem>
 913       </varlistentry>
 914       <varlistentry>
 915        <term>tagset <emphasis>filename</emphasis></term>
 916        <listitem>
 917         <para>
 918          (o) The tag set (if any) that describe
 919          that fields of the records.
 920         </para>
 921        </listitem>
 922       </varlistentry>
 923       <varlistentry>
 924        <term>varset <emphasis>filename</emphasis></term>
 925        <listitem>
 926         <para>
 927          (o) The variant set used in the profile.
 928         </para>
 929        </listitem>
 930       </varlistentry>
 931       <varlistentry>
 932        <term>maptab <emphasis>filename</emphasis></term>
 933        <listitem>
 934         <para>
 935          (o,r) This points to a
 936          conversion table that might be used if the client asks for the record
 937          in a different schema from the native one.
 938         </para>
 939        </listitem></varlistentry>
 940       <varlistentry>
 941        <term>marc <emphasis>filename</emphasis></term>
 942        <listitem>
 943         <para>
 944          (o) Points to a file containing parameters
 945          for representing the record contents in the ISO2709 syntax. Read the
 946          description of the MARC representation facility below.
 947         </para>
 948        </listitem></varlistentry>
 949       <varlistentry>
 950        <term>esetname <emphasis>name filename</emphasis></term>
 951        <listitem>
 952         <para>
 953          (o,r) Associates the
 954          given element set name with an element selection file. If an (@) is
 955          given in place of the filename, this corresponds to a null mapping for
 956          the given element set name.
 957         </para>
 958        </listitem></varlistentry>
 959       <varlistentry>
 960        <term>any <emphasis>tags</emphasis></term>
 961        <listitem>
 962         <para>
 963          (o) This directive specifies a list of attributes
 964          which should be appended to the attribute list given for each
 965          element. The effect is to make every single element in the abstract
 966          syntax searchable by way of the given attributes. This directive
 967          provides an efficient way of supporting free-text searching across all
 968          elements. However, it does increase the size of the index
 969          significantly. The attributes can be qualified with a structure, as in
 970          the <emphasis>elm</emphasis> directive below.
 971         </para>
 972        </listitem></varlistentry>
 973       <varlistentry>
 974        <term>elm <emphasis>path name attributes</emphasis></term>
 975        <listitem>
 976         <para>
 977          (o,r) Adds an element to the abstract record syntax of the schema.
 978          The <emphasis>path</emphasis> follows the
 979          syntax which is suggested by the Z39.50 document - that is, a sequence
 980          of tags separated by slashes (/). Each tag is given as a
 981          comma-separated pair of tag type and -value surrounded by parenthesis.
 982          The <emphasis>name</emphasis> is the name of the element, and
 983          the <emphasis>attributes</emphasis>
 984          specifies which attributes to use when indexing the element in a
 985          comma-separated list.
 986          A ! in place of the attribute name is equivalent to
 987          specifying an attribute name identical to the element name.
 988          A - in place of the attribute name
 989          specifies that no indexing is to take place for the given element.
 990          The attributes can be qualified with <emphasis>field
 991           types</emphasis> to specify which
 992          character set should govern the indexing procedure for that field.
 993          The same data element may be indexed into several different
 994          fields, using different character set definitions.
 995          See the <xref linkend="field-structure-and-character-sets"/>.
 996          The default field type is "w" for <emphasis>word</emphasis>.
 997         </para>
 998        </listitem></varlistentry>
 999       <varlistentry>
1000        <term>encoding <emphasis>encodingname</emphasis></term>
1001        <listitem>
1002         <para>
1003          This directive specifies character encoding for external records.
1004          For records such as XML that specifies encoding within the
1005          file via a header this directive is ignored.
1006          If neither this directive is given, nor an encoding is set
1007          within external records, ISO-8859-1 encoding is assmed.
1008          </para>
1009        </listitem>
1010       </varlistentry>
1011       <varlistentry>
1012        <term>xpath <emphasis>enable/disable</emphasis></term>
1013        <listitem>
1014         <para>
1015          If this directive is followed by <literal>enable</literal>,
1016          then extra indexing is performed to allow for XPath-like queries.
1017          If this directive is not specified - equivalent to
1018          <literal>disable</literal> - no extra XPath-indexing is performed.
1019         </para>
1020        </listitem>
1021       </varlistentry>
1022      </variablelist>
1023     </para>
1024
1025     <note>
1026      <para>
1027       The mechanism for controlling indexing is not adequate for
1028       complex databases, and will probably be moved into a separate
1029       configuration table eventually.
1030      </para>
1031     </note>
1032
1033     <para>
1034      The following is an excerpt from the abstract syntax file for the GILS
1035      profile.
1036     </para>
1037
1038     <para>
1039
1040      <screen>
1041       name gils
1042       reference GILS-schema
1043       attset gils.att
1044       tagset gils.tag
1045       varset var1.var
1046
1047       maptab gils-usmarc.map
1048
1049       # Element set names
1050
1051       esetname VARIANT gils-variant.est  # for WAIS-compliance
1052       esetname B gils-b.est
1053       esetname G gils-g.est
1054       esetname F @
1055
1056       elm (1,10)              rank                        -
1057       elm (1,12)              url                         -
1058       elm (1,14)              localControlNumber     Local-number
1059       elm (1,16)              dateOfLastModification Date/time-last-modified
1060       elm (2,1)               title                       w:!,p:!
1061       elm (4,1)               controlIdentifier      Identifier-standard
1062       elm (2,6)               abstract               Abstract
1063       elm (4,51)              purpose                     !
1064       elm (4,52)              originator                  -
1065       elm (4,53)              accessConstraints           !
1066       elm (4,54)              useConstraints              !
1067       elm (4,70)              availability                -
1068       elm (4,70)/(4,90)       distributor                 -
1069       elm (4,70)/(4,90)/(2,7) distributorName             !
1070       elm (4,70)/(4,90)/(2,10 distributorOrganization     !
1071       elm (4,70)/(4,90)/(4,2) distributorStreetAddress    !
1072       elm (4,70)/(4,90)/(4,3) distributorCity             !
1073      </screen>
1074
1075     </para>
1076
1077    </sect2>
1078
1079    <sect2 id="attset-files">
1080     <title>The Attribute Set (.att) Files</title>
1081
1082     <para>
1083      This file type describes the <emphasis>Use</emphasis> elements of
1084      an attribute set.
1085      It contains the following directives.
1086     </para>
1087
1088     <para>
1089      <variablelist>
1090       <varlistentry>
1091        <term>name <emphasis>symbolic-name</emphasis></term>
1092        <listitem>
1093         <para>
1094          (m) This provides a shorthand name or
1095          description for the attribute set.
1096          Mostly useful for diagnostic purposes.
1097         </para>
1098        </listitem></varlistentry>
1099       <varlistentry>
1100        <term>reference <emphasis>OID-name</emphasis></term>
1101        <listitem>
1102         <para>
1103          (m) The reference name of the OID for
1104          the attribute set.
1105          The reference names can be found in the <emphasis>util</emphasis>
1106          module of <emphasis>YAZ</emphasis>.
1107         </para>
1108        </listitem></varlistentry>
1109       <varlistentry>
1110        <term>include <emphasis>filename</emphasis></term>
1111        <listitem>
1112         <para>
1113          (o,r) This directive is used to
1114          include another attribute set as a part of the current one. This is
1115          used when a new attribute set is defined as an extension to another
1116          set. For instance, many new attribute sets are defined as extensions
1117          to the <emphasis>bib-1</emphasis> set.
1118          This is an important feature of the retrieval
1119          system of Z39.50, as it ensures the highest possible level of
1120          interoperability, as those access points of your database which are
1121          derived from the external set (say, bib-1) can be used even by clients
1122          who are unaware of the new set.
1123         </para>
1124        </listitem></varlistentry>
1125       <varlistentry>
1126        <term>att
1127         <emphasis>att-value att-name &lsqb;local-value&rsqb;</emphasis></term>
1128        <listitem>
1129         <para>
1130          (o,r) This
1131          repeatable directive introduces a new attribute to the set. The
1132          attribute value is stored in the index (unless a
1133          <emphasis>local-value</emphasis> is
1134          given, in which case this is stored). The name is used to refer to the
1135          attribute from the <emphasis>abstract syntax</emphasis>.
1136         </para>
1137        </listitem></varlistentry>
1138      </variablelist>
1139     </para>
1140
1141     <para>
1142      This is an excerpt from the GILS attribute set definition.
1143      Notice how the file describing the <emphasis>bib-1</emphasis>
1144      attribute set is referenced.
1145     </para>
1146
1147     <para>
1148
1149      <screen>
1150       name gils
1151       reference GILS-attset
1152       include bib1.att
1153
1154       att 2001          distributorName
1155       att 2002          indextermsControlled
1156       att 2003          purpose
1157       att 2004          accessConstraints
1158       att 2005          useConstraints
1159      </screen>
1160
1161     </para>
1162
1163    </sect2>
1164
1165    <sect2>
1166     <title>The Tag Set (.tag) Files</title>
1167
1168     <para>
1169      This file type defines the tagset of the profile, possibly by
1170      referencing other tag sets (most tag sets, for instance, will include
1171      tagsetG and tagsetM from the Z39.50 specification. The file may
1172      contain the following directives.
1173     </para>
1174
1175     <para>
1176      <variablelist>
1177
1178       <varlistentry>
1179        <term>name <emphasis>symbolic-name</emphasis></term>
1180        <listitem>
1181         <para>
1182          (m) This provides a shorthand name or
1183          description for the tag set. Mostly useful for diagnostic purposes.
1184         </para>
1185        </listitem></varlistentry>
1186       <varlistentry>
1187        <term>reference <emphasis>OID-name</emphasis></term>
1188        <listitem>
1189         <para>
1190          (o) The reference name of the OID for the tag set.
1191          The reference names can be found in the <emphasis>util</emphasis>
1192          module of <emphasis>YAZ</emphasis>.
1193          The directive is optional, since not all tag sets
1194          are registered outside of their schema.
1195         </para>
1196        </listitem></varlistentry>
1197       <varlistentry>
1198        <term>type <emphasis>integer</emphasis></term>
1199        <listitem>
1200         <para>
1201          (m) The type number of the tagset within the schema
1202          profile (note: this specification really should belong to the .abs
1203          file. This will be fixed in a future release).
1204         </para>
1205        </listitem></varlistentry>
1206       <varlistentry>
1207        <term>include <emphasis>filename</emphasis></term>
1208        <listitem>
1209         <para>
1210          (o,r) This directive is used
1211          to include the definitions of other tag sets into the current one.
1212         </para>
1213        </listitem></varlistentry>
1214       <varlistentry>
1215        <term>tag <emphasis>number names type</emphasis></term>
1216        <listitem>
1217         <para>
1218          (o,r) Introduces a new tag to the set.
1219          The <emphasis>number</emphasis> is the tag number as used
1220          in the protocol (there is currently no mechanism for
1221          specifying string tags at this point, but this would be quick
1222          work to add).
1223          The <emphasis>names</emphasis> parameter is a list of names
1224          by which the tag should be recognized in the input file format.
1225          The names should be separated by slashes (/).
1226          The <emphasis>type</emphasis> is the recommended data type of
1227          the tag.
1228          It should be one of the following:
1229
1230          <itemizedlist>
1231           <listitem>
1232            <para>
1233             structured
1234            </para>
1235           </listitem>
1236
1237           <listitem>
1238            <para>
1239             string
1240            </para>
1241           </listitem>
1242
1243           <listitem>
1244            <para>
1245             numeric
1246            </para>
1247           </listitem>
1248
1249           <listitem>
1250            <para>
1251             bool
1252            </para>
1253           </listitem>
1254
1255           <listitem>
1256            <para>
1257             oid
1258            </para>
1259           </listitem>
1260
1261           <listitem>
1262            <para>
1263             generalizedtime
1264            </para>
1265           </listitem>
1266
1267           <listitem>
1268            <para>
1269             intunit
1270            </para>
1271           </listitem>
1272
1273           <listitem>
1274            <para>
1275             int
1276            </para>
1277           </listitem>
1278
1279           <listitem>
1280            <para>
1281             octetstring
1282            </para>
1283           </listitem>
1284
1285           <listitem>
1286            <para>
1287             null
1288            </para>
1289           </listitem>
1290
1291          </itemizedlist>
1292
1293         </para>
1294        </listitem></varlistentry>
1295      </variablelist>
1296     </para>
1297
1298     <para>
1299      The following is an excerpt from the TagsetG definition file.
1300     </para>
1301
1302     <para>
1303      <screen>
1304       name tagsetg
1305       reference TagsetG
1306       type 2
1307
1308       tag       1       title           string
1309       tag       2       author          string
1310       tag       3       publicationPlace string
1311       tag       4       publicationDate string
1312       tag       5       documentId      string
1313       tag       6       abstract        string
1314       tag       7       name            string
1315       tag       8       date            generalizedtime
1316       tag       9       bodyOfDisplay   string
1317       tag       10      organization    string
1318      </screen>
1319     </para>
1320
1321    </sect2>
1322
1323    <sect2 id="variant-set">
1324     <title>The Variant Set (.var) Files</title>
1325
1326     <para>
1327      The variant set file is a straightforward representation of the
1328      variant set definitions associated with the protocol. At present, only
1329      the <emphasis>Variant-1</emphasis> set is known.
1330     </para>
1331
1332     <para>
1333      These are the directives allowed in the file.
1334     </para>
1335
1336     <para>
1337      <variablelist>
1338
1339       <varlistentry>
1340        <term>name <emphasis>symbolic-name</emphasis></term>
1341        <listitem>
1342         <para>
1343          (m) This provides a shorthand name or
1344          description for the variant set. Mostly useful for diagnostic purposes.
1345         </para>
1346        </listitem></varlistentry>
1347       <varlistentry>
1348        <term>reference <emphasis>OID-name</emphasis></term>
1349        <listitem>
1350         <para>
1351          (o) The reference name of the OID for
1352          the variant set, if one is required. The reference names can be found
1353          in the <emphasis>util</emphasis> module of <emphasis>YAZ</emphasis>.
1354         </para>
1355        </listitem></varlistentry>
1356       <varlistentry>
1357        <term>class <emphasis>integer class-name</emphasis></term>
1358        <listitem>
1359         <para>
1360          (m,r) Introduces a new
1361          class to the variant set.
1362         </para>
1363        </listitem></varlistentry>
1364       <varlistentry>
1365        <term>type <emphasis>integer type-name datatype</emphasis></term>
1366        <listitem>
1367         <para>
1368          (m,r) Addes a
1369          new type to the current class (the one introduced by the most recent
1370          <emphasis>class</emphasis> directive).
1371          The type names belong to the same name space as the one used
1372          in the tag set definition file.
1373         </para>
1374        </listitem></varlistentry>
1375      </variablelist>
1376     </para>
1377
1378     <para>
1379      The following is an excerpt from the file describing the variant set
1380      <emphasis>Variant-1</emphasis>.
1381     </para>
1382
1383     <para>
1384
1385      <screen>
1386       name variant-1
1387       reference Variant-1
1388
1389       class 1 variantId
1390
1391       type      1       variantId               octetstring
1392
1393       class 2 body
1394
1395       type      1       iana                    string
1396       type      2       z39.50                  string
1397       type      3       other                   string
1398      </screen>
1399
1400     </para>
1401
1402    </sect2>
1403
1404    <sect2>
1405     <title>The Element Set (.est) Files</title>
1406
1407     <para>
1408      The element set specification files describe a selection of a subset
1409      of the elements of a database record. The element selection mechanism
1410      is equivalent to the one supplied by the <emphasis>Espec-1</emphasis>
1411      syntax of the Z39.50 specification.
1412      In fact, the internal representation of an element set
1413      specification is identical to the <emphasis>Espec-1</emphasis> structure,
1414      and we'll refer you to the description of that structure for most of
1415      the detailed semantics of the directives below.
1416     </para>
1417
1418     <note>
1419      <para>
1420       Not all of the Espec-1 functionality has been implemented yet.
1421       The fields that are mentioned below all work as expected, unless
1422       otherwise is noted.
1423      </para>
1424     </note>
1425
1426     <para>
1427      The directives available in the element set file are as follows:
1428     </para>
1429
1430     <para>
1431      <variablelist>
1432       <varlistentry>
1433        <term>defaultVariantSetId <emphasis>OID-name</emphasis></term>
1434        <listitem>
1435         <para>
1436          (o) If variants are used in
1437          the following, this should provide the name of the variantset used
1438          (it's not currently possible to specify a different set in the
1439          individual variant request). In almost all cases (certainly all
1440          profiles known to us), the name
1441          <literal>Variant-1</literal> should be given here.
1442         </para>
1443        </listitem></varlistentry>
1444       <varlistentry>
1445        <term>defaultVariantRequest <emphasis>variant-request</emphasis></term>
1446        <listitem>
1447         <para>
1448          (o) This directive
1449          provides a default variant request for
1450          use when the individual element requests (see below) do not contain a
1451          variant request. Variant requests consist of a blank-separated list of
1452          variant components. A variant compont is a comma-separated,
1453          parenthesized triple of variant class, type, and value (the two former
1454          values being represented as integers). The value can currently only be
1455          entered as a string (this will change to depend on the definition of
1456          the variant in question). The special value (@) is interpreted as a
1457          null value, however.
1458         </para>
1459        </listitem></varlistentry>
1460       <varlistentry>
1461        <term>simpleElement
1462         <emphasis>path &lsqb;'variant' variant-request&rsqb;</emphasis></term>
1463        <listitem>
1464         <para>
1465          (o,r) This corresponds to a simple element request
1466          in <emphasis>Espec-1</emphasis>.
1467          The path consists of a sequence of tag-selectors, where each of
1468          these can consist of either:
1469         </para>
1470
1471         <para>
1472          <itemizedlist>
1473           <listitem>
1474            <para>
1475             A simple tag, consisting of a comma-separated type-value pair in
1476             parenthesis, possibly followed by a colon (:) followed by an
1477             occurrences-specification (see below). The tag-value can be a number
1478             or a string. If the first character is an apostrophe ('), this
1479             forces the value to be interpreted as a string, even if it
1480             appears to be numerical.
1481            </para>
1482           </listitem>
1483
1484           <listitem>
1485            <para>
1486             A WildThing, represented as a question mark (?), possibly
1487             followed by a colon (:) followed by an occurrences
1488             specification (see below).
1489            </para>
1490           </listitem>
1491
1492           <listitem>
1493            <para>
1494             A WildPath, represented as an asterisk (*). Note that the last
1495             element of the path should not be a wildPath (wildpaths don't
1496             work in this version).
1497            </para>
1498           </listitem>
1499
1500          </itemizedlist>
1501
1502         </para>
1503
1504         <para>
1505          The occurrences-specification can be either the string
1506          <literal>all</literal>, the string <literal>last</literal>, or
1507          an explicit value-range. The value-range is represented as
1508          an integer (the starting point), possibly followed by a
1509          plus (+) and a second integer (the number of elements, default
1510          being one).
1511         </para>
1512
1513         <para>
1514          The variant-request has the same syntax as the defaultVariantRequest
1515          above. Note that it may sometimes be useful to give an empty variant
1516          request, simply to disable the default for a specific set of fields
1517          (we aren't certain if this is proper <emphasis>Espec-1</emphasis>,
1518          but it works in this implementation).
1519         </para>
1520        </listitem></varlistentry>
1521      </variablelist>
1522     </para>
1523
1524     <para>
1525      The following is an example of an element specification belonging to
1526      the GILS profile.
1527     </para>
1528
1529     <para>
1530
1531      <screen>
1532       simpleelement (1,10)
1533       simpleelement (1,12)
1534       simpleelement (2,1)
1535       simpleelement (1,14)
1536       simpleelement (4,1)
1537       simpleelement (4,52)
1538      </screen>
1539
1540     </para>
1541
1542    </sect2>
1543
1544    <sect2 id="schema-mapping">
1545     <title>The Schema Mapping (.map) Files</title>
1546
1547     <para>
1548      Sometimes, the client might want to receive a database record in
1549      a schema that differs from the native schema of the record. For
1550      instance, a client might only know how to process WAIS records, while
1551      the database record is represented in a more specific schema, such as
1552      GILS. In this module, a mapping of data to one of the MARC formats is
1553      also thought of as a schema mapping (mapping the elements of the
1554      record into fields consistent with the given MARC specification, prior
1555      to actually converting the data to the ISO2709). This use of the
1556      object identifier for USMARC as a schema identifier represents an
1557      overloading of the OID which might not be entirely proper. However,
1558      it represents the dual role of schema and record syntax which
1559      is assumed by the MARC family in Z39.50.
1560     </para>
1561
1562     <para>
1563      <emphasis>NOTE: FIXME! The schema-mapping functions are so far limited to a
1564       straightforward mapping of elements. This should be extended with
1565       mechanisms for conversions of the element contents, and conditional
1566       mappings of elements based on the record contents.</emphasis>
1567     </para>
1568
1569     <para>
1570      These are the directives of the schema mapping file format:
1571     </para>
1572
1573     <para>
1574      <variablelist>
1575
1576       <varlistentry>
1577        <term>targetName <emphasis>name</emphasis></term>
1578        <listitem>
1579         <para>
1580          (m) A symbolic name for the target schema
1581          of the table. Useful mostly for diagnostic purposes.
1582         </para>
1583        </listitem></varlistentry>
1584       <varlistentry>
1585        <term>targetRef <emphasis>OID-name</emphasis></term>
1586        <listitem>
1587         <para>
1588          (m) An OID name for the target schema.
1589          This is used, for instance, by a server receiving a request to present
1590          a record in a different schema from the native one.
1591          The name, again, is found in the <emphasis>oid</emphasis>
1592          module of <emphasis>YAZ</emphasis>.
1593         </para>
1594        </listitem></varlistentry>
1595       <varlistentry>
1596        <term>map <emphasis>element-name target-path</emphasis></term>
1597        <listitem>
1598         <para>
1599          (o,r) Adds
1600          an element mapping rule to the table.
1601         </para>
1602        </listitem></varlistentry>
1603      </variablelist>
1604     </para>
1605
1606    </sect2>
1607
1608    <sect2>
1609     <title>The MARC (ISO2709) Representation (.mar) Files</title>
1610
1611     <para>
1612      This file provides rules for representing a record in the ISO2709
1613      format. The rules pertain mostly to the values of the constant-length
1614      header of the record.
1615     </para>
1616
1617     <para>
1618      <emphasis>NOTE: FIXME! This will be described better. We're in the process of
1619       re-evaluating and most likely changing the way that MARC records are
1620       handled by the system.</emphasis>
1621     </para>
1622
1623    </sect2>
1624
1625    <sect2 id="field-structure-and-character-sets">
1626     <title>Field Structure and Character Sets
1627     </title>
1628
1629     <para>
1630      In order to provide a flexible approach to national character set
1631      handling, Zebra allows the administrator to configure the set up the
1632      system to handle any 8-bit character set &mdash; including sets that
1633      require multi-octet diacritics or other multi-octet characters. The
1634      definition of a character set includes a specification of the
1635      permissible values, their sort order (this affects the display in the
1636      SCAN function), and relationships between upper- and lowercase
1637      characters. Finally, the definition includes the specification of
1638      space characters for the set.
1639     </para>
1640
1641     <para>
1642      The operator can define different character sets for different fields,
1643      typical examples being standard text fields, numerical fields, and
1644      special-purpose fields such as WWW-style linkages (URx).
1645     </para>
1646
1647     <para>
1648      The field types, and hence character sets, are associated with data
1649      elements by the .abs files (see above).
1650      The file <literal>default.idx</literal>
1651      provides the association between field type codes (as used in the .abs
1652      files) and the character map files (with the .chr suffix). The format
1653      of the .idx file is as follows
1654     </para>
1655
1656     <para>
1657      <variablelist>
1658
1659       <varlistentry>
1660        <term>index <emphasis>field type code</emphasis></term>
1661        <listitem>
1662         <para>
1663          This directive introduces a new search index code.
1664          The argument is a one-character code to be used in the
1665          .abs files to select this particular index type. An index, roughly,
1666          corresponds to a particular structure attribute during search. Refer
1667          to <xref linkend="search"/>.
1668         </para>
1669        </listitem></varlistentry>
1670       <varlistentry>
1671        <term>sort <emphasis>field code type</emphasis></term>
1672        <listitem>
1673         <para>
1674          This directive introduces a
1675          sort index. The argument is a one-character code to be used in the
1676          .abs fie to select this particular index type. The corresponding
1677          use attribute must be used in the sort request to refer to this
1678          particular sort index. The corresponding character map (see below)
1679          is used in the sort process.
1680         </para>
1681        </listitem></varlistentry>
1682       <varlistentry>
1683        <term>completeness <emphasis>boolean</emphasis></term>
1684        <listitem>
1685         <para>
1686          This directive enables or disables complete field indexing.
1687          The value of the <emphasis>boolean</emphasis> should be 0
1688          (disable) or 1. If completeness is enabled, the index entry will
1689          contain the complete contents of the field (up to a limit), with words
1690          (non-space characters) separated by single space characters
1691          (normalized to " " on display). When completeness is
1692          disabled, each word is indexed as a separate entry. Complete subfield
1693          indexing is most useful for fields which are typically browsed (eg.
1694          titles, authors, or subjects), or instances where a match on a
1695          complete subfield is essential (eg. exact title searching). For fields
1696          where completeness is disabled, the search engine will interpret a
1697          search containing space characters as a word proximity search.
1698         </para>
1699        </listitem></varlistentry>
1700       <varlistentry>
1701        <term>charmap <emphasis>filename</emphasis></term>
1702        <listitem>
1703         <para>
1704          This is the filename of the character
1705          map to be used for this index for field type.
1706         </para>
1707        </listitem></varlistentry>
1708      </variablelist>
1709     </para>
1710
1711     <para>
1712      The contents of the character map files are structured as follows:
1713     </para>
1714
1715     <para>
1716      <variablelist>
1717
1718       <varlistentry>
1719        <term>lowercase <emphasis>value-set</emphasis></term>
1720        <listitem>
1721         <para>
1722          This directive introduces the basic value set of the field type.
1723          The format is an ordered list (without spaces) of the
1724          characters which may occur in "words" of the given type.
1725          The order of the entries in the list determines the
1726          sort order of the index. In addition to single characters, the
1727          following combinations are legal:
1728         </para>
1729
1730         <para>
1731
1732          <itemizedlist>
1733           <listitem>
1734            <para>
1735             Backslashes may be used to introduce three-digit octal, or
1736             two-digit hex representations of single characters
1737             (preceded by <literal>x</literal>).
1738             In addition, the combinations
1739             \\, \\r, \\n, \\t, \\s (space &mdash; remember that real
1740             space-characters may not occur in the value definition), and
1741             \\ are recognized, with their usual interpretation.
1742            </para>
1743           </listitem>
1744
1745           <listitem>
1746            <para>
1747             Curly braces &lcub;&rcub; may be used to enclose ranges of single
1748             characters (possibly using the escape convention described in the
1749             preceding point), eg. &lcub;a-z&rcub; to introduce the
1750             standard range of ASCII characters.
1751             Note that the interpretation of such a range depends on
1752             the concrete representation in your local, physical character set.
1753            </para>
1754           </listitem>
1755
1756           <listitem>
1757            <para>
1758             paranthesises () may be used to enclose multi-byte characters -
1759             eg. diacritics or special national combinations (eg. Spanish
1760             "ll"). When found in the input stream (or a search term),
1761             these characters are viewed and sorted as a single character, with a
1762             sorting value depending on the position of the group in the value
1763             statement.
1764            </para>
1765           </listitem>
1766
1767          </itemizedlist>
1768
1769         </para>
1770        </listitem></varlistentry>
1771       <varlistentry>
1772        <term>uppercase <emphasis>value-set</emphasis></term>
1773        <listitem>
1774         <para>
1775          This directive introduces the
1776          upper-case equivalencis to the value set (if any). The number and
1777          order of the entries in the list should be the same as in the
1778          <literal>lowercase</literal> directive.
1779         </para>
1780        </listitem></varlistentry>
1781       <varlistentry>
1782        <term>space <emphasis>value-set</emphasis></term>
1783        <listitem>
1784         <para>
1785          This directive introduces the character
1786          which separate words in the input stream. Depending on the
1787          completeness mode of the field in question, these characters either
1788          terminate an index entry, or delimit individual "words" in
1789          the input stream. The order of the elements is not significant &mdash;
1790          otherwise the representation is the same as for the
1791          <literal>uppercase</literal> and <literal>lowercase</literal>
1792          directives.
1793         </para>
1794        </listitem></varlistentry>
1795       <varlistentry>
1796        <term>map <emphasis>value-set</emphasis>
1797         <emphasis>target</emphasis></term>
1798        <listitem>
1799         <para>
1800          This directive introduces a
1801          mapping between each of the members of the value-set on the left to
1802          the character on the right. The character on the right must occur in
1803          the value set (the <literal>lowercase</literal> directive) of
1804          the character set, but
1805          it may be a paranthesis-enclosed multi-octet character. This directive
1806          may be used to map diacritics to their base characters, or to map
1807          HTML-style character-representations to their natural form, etc.
1808         </para>
1809        </listitem></varlistentry>
1810      </variablelist>
1811     </para>
1812
1813    </sect2>
1814
1815   </sect1>
1816
1817   <sect1 id="formats">
1818    <title>Exchange Formats</title>
1819
1820    <para>
1821     Converting records from the internal structure to en exchange format
1822     is largely an automatic process. Currently, the following exchange
1823     formats are supported:
1824    </para>
1825
1826    <para>
1827     <itemizedlist>
1828      <listitem>
1829       <para>
1830        GRS-1. The internal representation is based on GRS-1/XML, so the
1831        conversion here is straightforward. The system will create
1832        applied variant and supported variant lists as required, if a record
1833        contains variant information.
1834       </para>
1835      </listitem>
1836
1837      <listitem>
1838       <para>
1839        XML. The internal representation is based on GRS-1/XML so
1840        the mapping is trivial. Note that XML schemas, preprocessing
1841        instructions and comments are not part of the internal representation
1842        and therefore will never be part of a generated XML record.
1843        Future versions of the Zebra will support that.
1844       </para>
1845      </listitem>
1846
1847      <listitem>
1848       <para>
1849        SUTRS. Again, the mapping is fairly straightforward. Indentation
1850        is used to show the hierarchical structure of the record. All
1851        "GRS" type records support both the GRS-1 and SUTRS
1852        representations.
1853        FIXME - What is SUTRS - should be expanded here
1854       </para>
1855      </listitem>
1856
1857      <listitem>
1858       <para>
1859        ISO2709-based formats (USMARC, etc.). Only records with a
1860        two-level structure (corresponding to fields and subfields) can be
1861        directly mapped to ISO2709. For records with a different structuring
1862        (eg., GILS), the representation in a structure like USMARC involves a
1863        schema-mapping (see <xref linkend="schema-mapping"/>), to an
1864        "implied" USMARC schema (implied,
1865        because there is no formal schema which specifies the use of the
1866        USMARC fields outside of ISO2709). The resultant, two-level record is
1867        then mapped directly from the internal representation to ISO2709. See
1868        the GILS schema definition files for a detailed example of this
1869        approach.
1870       </para>
1871      </listitem>
1872
1873      <listitem>
1874       <para>
1875        Explain. This representation is only available for records
1876        belonging to the Explain schema.
1877       </para>
1878      </listitem>
1879
1880      <listitem>
1881       <para>
1882        Summary. This ASN-1 based structure is only available for records
1883        belonging to the Summary schema - or schema which provide a mapping
1884        to this schema (see the description of the schema mapping facility
1885        above).
1886       </para>
1887      </listitem>
1888
1889      <listitem>
1890       <para>
1891        SOIF. Support for this syntax is experimental, and is currently
1892        keyed to a private Index Data OID (1.2.840.10003.5.1000.81.2). All
1893        abstract syntaxes can be mapped to the SOIF format, although nested
1894        elements are represented by concatenation of the tag names at each
1895        level.
1896        FIXME - Is this used anywhere ? -H
1897       </para>
1898      </listitem>
1899
1900     </itemizedlist>
1901    </para>
1902   </sect1>
1903
1904  </chapter>
1905  <!-- Keep this comment at the end of the file
1906  Local variables:
1907  mode: sgml
1908  sgml-omittag:t
1909  sgml-shorttag:t
1910  sgml-minimize-attributes:nil
1911  sgml-always-quote-attributes:t
1912  sgml-indent-step:1
1913  sgml-indent-data:t
1914  sgml-parent-document: "zebra.xml"
1915  sgml-local-catalogs: nil
1916  sgml-namecase-general:t
1917  End:
1918  -->