doc/architecture.xml

   1  <chapter id="architecture">
   2   <!-- $Id: architecture.xml,v 1.19 2007-02-02 09:58:39 marc Exp $ -->
   3   <title>Overview of &zebra; Architecture</title>
   4
   5   <section id="architecture-representation">
   6    <title>Local Representation</title>
   7
   8    <para>
   9     As mentioned earlier, &zebra; places few restrictions on the type of
  10     data that you can index and manage. Generally, whatever the form of
  11     the data, it is parsed by an input filter specific to that format, and
  12     turned into an internal structure that &zebra; knows how to handle. This
  13     process takes place whenever the record is accessed - for indexing and
  14     retrieval.
  15    </para>
  16
  17    <para>
  18     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  19     the <literal>-t</literal> option to the indexer tells &zebra; how to
  20     process input records.
  21     Two basic types of processing are available - raw text and structured
  22     data. Raw text is just that, and it is selected by providing the
  23     argument <emphasis>text</emphasis> to &zebra;. Structured records are
  24     all handled internally using the basic mechanisms described in the
  25     subsequent sections.
  26     &zebra; can read structured records in many different formats.
  27     <!--
  28     How this is done is governed by additional parameters after the
  29     "grs" keyword, separated by "." characters.
  30     -->
  31    </para>
  32   </section>
  33
  34   <section id="architecture-maincomponents">
  35    <title>Main Components</title>
  36    <para>
  37     The &zebra; system is designed to support a wide range of data management
  38     applications. The system can be configured to handle virtually any
  39     kind of structured data. Each record in the system is associated with
  40     a <emphasis>record schema</emphasis> which lends context to the data
  41     elements of the record.
  42     Any number of record schemas can coexist in the system.
  43     Although it may be wise to use only a single schema within
  44     one database, the system poses no such restrictions.
  45    </para>
  46    <para>
  47     The &zebra; indexer and information retrieval server consists of the
  48     following main applications: the <command>zebraidx</command>
  49     indexing maintenance utility, and the <command>zebrasrv</command>
  50     information query and retrieval server. Both are using some of the
  51     same main components, which are presented here.
  52    </para>
  53    <para>
  54     The virtual Debian package <literal>idzebra-2.0</literal>
  55     installs all the necessary packages to start
  56     working with &zebra; - including utility programs, development libraries,
  57     documentation and modules.
  58   </para>
  59
  60    <section id="componentcore">
  61     <title>Core &zebra; Libraries Containing Common Functionality</title>
  62     <para>
  63      The core &zebra; module is the meat of the <command>zebraidx</command>
  64     indexing maintenance utility, and the <command>zebrasrv</command>
  65     information query and retrieval server binaries. Shortly, the core
  66     libraries are responsible for
  67      <variablelist>
  68       <varlistentry>
  69        <term>Dynamic Loading</term>
  70        <listitem>
  71         <para>of external filter modules, in case the application is
  72         not compiled statically. These filter modules define indexing,
  73         search and retrieval capabilities of the various input formats.
  74         </para>
  75        </listitem>
  76       </varlistentry>
  77       <varlistentry>
  78        <term>Index Maintenance</term>
  79        <listitem>
  80         <para> &zebra; maintains Term Dictionaries and ISAM index
  81         entries in inverted index structures kept on disk. These are
  82         optimized for fast inset, update and delete, as well as good
  83         search performance.
  84         </para>
  85        </listitem>
  86       </varlistentry>
  87       <varlistentry>
  88        <term>Search Evaluation</term>
  89        <listitem>
  90         <para>by execution of search requests expressed in PQF/RPN
  91          data structures, which are handed over from
  92          the YAZ server frontend API. Search evaluation includes
  93          construction of hit lists according to boolean combinations
  94          of simpler searches. Fast performance is achieved by careful
  95          use of index structures, and by evaluation specific index hit
  96          lists in correct order.
  97         </para>
  98        </listitem>
  99       </varlistentry>
 100       <varlistentry>
 101        <term>Ranking and Sorting</term>
 102        <listitem>
 103         <para>
 104          components call resorting/re-ranking algorithms on the hit
 105          sets. These might also be pre-sorted not only using the
 106          assigned document ID's, but also using assigned static rank
 107          information.
 108         </para>
 109        </listitem>
 110       </varlistentry>
 111       <varlistentry>
 112        <term>Record Presentation</term>
 113        <listitem>
 114         <para>returns - possibly ranked - result sets, hit
 115          numbers, and the like internal data to the YAZ server backend API
 116          for shipping to the client. Each individual filter module
 117          implements it's own specific presentation formats.
 118         </para>
 119        </listitem>
 120       </varlistentry>
 121      </variablelist>
 122      </para>
 123     <para>
 124      The Debian package <literal>libidzebra-2.0</literal>
 125      contains all run-time libraries for &zebra;, the
 126      documentation in PDF and HTML is found in
 127      <literal>idzebra-2.0-doc</literal>, and
 128      <literal>idzebra-2.0-common</literal>
 129      includes common essential &zebra; configuration files.
 130     </para>
 131    </section>
 132
 133
 134    <section id="componentindexer">
 135     <title>&zebra; Indexer</title>
 136     <para>
 137      The  <command>zebraidx</command>
 138      indexing maintenance utility
 139      loads external filter modules used for indexing data records of
 140      different type, and creates, updates and drops databases and
 141      indexes according to the rules defined in the filter modules.
 142     </para>
 143     <para>
 144      The Debian  package <literal>idzebra-2.0-utils</literal> contains
 145      the  <command>zebraidx</command> utility.
 146     </para>
 147    </section>
 148
 149    <section id="componentsearcher">
 150     <title>&zebra; Searcher/Retriever</title>
 151     <para>
 152      This is the executable which runs the Z39.50/SRU/SRW server and
 153      glues together the core libraries and the filter modules to one
 154      great Information Retrieval server application.
 155     </para>
 156     <para>
 157      The Debian  package <literal>idzebra-2.0-utils</literal> contains
 158      the  <command>zebrasrv</command> utility.
 159     </para>
 160    </section>
 161
 162    <section id="componentyazserver">
 163     <title>YAZ Server Frontend</title>
 164     <para>
 165      The YAZ server frontend is
 166      a full fledged stateful Z39.50 server taking client
 167      connections, and forwarding search and scan requests to the
 168      &zebra; core indexer.
 169     </para>
 170     <para>
 171      In addition to Z39.50 requests, the YAZ server frontend acts
 172      as HTTP server, honoring
 173       <ulink url="&url.srw;">SRU SOAP</ulink>
 174      requests, and
 175      <ulink url="&url.sru;">SRU REST</ulink>
 176      requests. Moreover, it can
 177      translate incoming
 178      <ulink url="&url.cql;">CQL</ulink>
 179      queries to
 180      <ulink url="&url.yaz.pqf;">PQF</ulink>
 181       queries, if
 182      correctly configured.
 183     </para>
 184     <para>
 185      <ulink url="&url.yaz;">YAZ</ulink>
 186      is an Open Source
 187      toolkit that allows you to develop software using the
 188      ANSI Z39.50/ISO23950 standard for information retrieval.
 189      It is packaged in the Debian packages
 190      <literal>yaz</literal> and <literal>libyaz</literal>.
 191     </para>
 192    </section>
 193
 194    <section id="componentmodules">
 195     <title>Record Models and Filter Modules</title>
 196     <para>
 197      The hard work of knowing <emphasis>what</emphasis> to index,
 198      <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
 199      part of the records to send in a search/retrieve response is
 200      implemented in
 201      various filter modules. It is their responsibility to define the
 202      exact indexing and record display filtering rules.
 203      </para>
 204      <para>
 205      The virtual Debian package
 206      <literal>libidzebra-2.0-modules</literal> installs all base filter
 207      modules.
 208     </para>
 209
 210
 211    <section id="componentmodulesalvis">
 212     <title>ALVIS XML Record Model and Filter Module</title>
 213      <para>
 214       The Alvis filter for XML files is an XSLT based input
 215       filter.
 216       It indexes element and attribute content of any thinkable XML format
 217       using full XPATH support, a feature which the standard &zebra;
 218       GRS SGML and XML filters lacked. The indexed documents are
 219       parsed into a standard XML DOM tree, which restricts record size
 220       according to availability of memory.
 221     </para>
 222     <para>
 223       The Alvis filter
 224       uses XSLT display stylesheets, which let
 225       the &zebra; DB administrator associate multiple, different views on
 226       the same XML document type. These views are chosen on-the-fly in
 227       search time.
 228      </para>
 229     <para>
 230       In addition, the Alvis filter configuration is not bound to the
 231       arcane  BIB-1 Z39.50 library catalogue indexing traditions and
 232       folklore, and is therefore easier to understand.
 233     </para>
 234     <para>
 235       Finally, the Alvis  filter allows for static ranking at index
 236       time, and to to sort hit lists according to predefined
 237       static ranks. This imposes no overhead at all, both
 238       search and indexing perform still
 239       <emphasis>O(1)</emphasis> irrespectively of document
 240       collection size. This feature resembles Googles pre-ranking using
 241       their Pagerank algorithm.
 242     </para>
 243     <para>
 244       Details on the experimental Alvis XSLT filter are found in
 245       <xref linkend="record-model-alvisxslt"/>.
 246       </para>
 247      <para>
 248       The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
 249       contains the Alvis filter module.
 250      </para>
 251     </section>
 252
 253    <section id="componentmodulesgrs">
 254     <title>GRS Record Model and Filter Modules</title>
 255     <para>
 256     The GRS filter modules described in
 257     <xref linkend="grs"/>
 258     are all based on the Z39.50 specifications, and it is absolutely
 259     mandatory to have the reference pages on BIB-1 attribute sets on
 260     you hand when configuring GRS filters. The GRS filters come in
 261     different flavors, and a short introduction is needed here.
 262     GRS filters of various kind have also been called ABS filters due
 263     to the <filename>*.abs</filename> configuration file suffix.
 264     </para>
 265     <para>
 266       The <emphasis>grs.marc</emphasis> and
 267       <emphasis>grs.marcxml</emphasis> filters are suited to parse and
 268       index binary and XML versions of traditional library MARC records
 269       based on the ISO2709 standard. The Debian package for both
 270       filters is
 271      <literal>libidzebra-2.0-mod-grs-marc</literal>.
 272     </para>
 273     <para>
 274       GRS TCL scriptable filters for extensive user configuration come
 275      in two flavors: a regular expression filter
 276      <emphasis>grs.regx</emphasis> using TCL regular expressions, and
 277      a general scriptable TCL filter called
 278      <emphasis>grs.tcl</emphasis>
 279      are both included in the
 280      <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
 281     </para>
 282     <para>
 283       A general purpose SGML filter is called
 284      <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
 285      but planned to be in the
 286      <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
 287     </para>
 288     <para>
 289       The Debian  package
 290       <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
 291       <emphasis>grs.xml</emphasis> filter which uses <ulink
 292       url="&url.expat;">Expat</ulink> to
 293       parse records in XML and turn them into ID&zebra;'s internal GRS node
 294       trees. Have also a look at the Alvis XML/XSLT filter described in
 295       the next session.
 296     </para>
 297    </section>
 298
 299    <section id="componentmodulestext">
 300     <title>TEXT Record Model and Filter Module</title>
 301     <para>
 302       Plain ASCII text filter. TODO: add information here.
 303     </para>
 304    </section>
 305
 306     <!--
 307    <section id="componentmodulessafari">
 308     <title>SAFARI Record Model and Filter Module</title>
 309     <para>
 310      SAFARI filter module TODO: add information here.
 311     </para>
 312    </section>
 313     -->
 314
 315    </section>
 316
 317   </section>
 318
 319
 320   <section id="architecture-workflow">
 321    <title>Indexing and Retrieval Workflow</title>
 322
 323   <para>
 324    Records pass through three different states during processing in the
 325    system.
 326   </para>
 327
 328   <para>
 329
 330    <itemizedlist>
 331     <listitem>
 332
 333      <para>
 334       When records are accessed by the system, they are represented
 335       in their local, or native format. This might be SGML or HTML files,
 336       News or Mail archives, MARC records. If the system doesn't already
 337       know how to read the type of data you need to store, you can set up an
 338       input filter by preparing conversion rules based on regular
 339       expressions and possibly augmented by a flexible scripting language
 340       (Tcl).
 341       The input filter produces as output an internal representation,
 342       a tree structure.
 343
 344      </para>
 345     </listitem>
 346     <listitem>
 347
 348      <para>
 349       When records are processed by the system, they are represented
 350       in a tree-structure, constructed by tagged data elements hanging off a
 351       root node. The tagged elements may contain data or yet more tagged
 352       elements in a recursive structure. The system performs various
 353       actions on this tree structure (indexing, element selection, schema
 354       mapping, etc.),
 355
 356      </para>
 357     </listitem>
 358     <listitem>
 359
 360      <para>
 361       Before transmitting records to the client, they are first
 362       converted from the internal structure to a form suitable for exchange
 363       over the network - according to the Z39.50 standard.
 364      </para>
 365     </listitem>
 366
 367    </itemizedlist>
 368
 369   </para>
 370   </section>
 371
 372   <section id="special-retrieval">
 373    <title>Retrieval of &zebra; internal record data</title>
 374    <para>
 375     Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
 376     possible to use a special element set which has the prefix
 377     <literal>zebra::</literal>.
 378    </para>
 379    <para>
 380     Using this element will, regardless of record type, return
 381     &zebra;'s internal index structure/data for a record.
 382     In particular, the regular record filters are not invoked when
 383     these are in use.
 384     This can in some cases make the retrival faster than regular
 385     retrieval operations (for MARC, XML etc).
 386    </para>
 387    <table id="special-retrieval-types">
 388     <title>Special Retrieval Elements</title>
 389     <tgroup cols="2">
 390      <thead>
 391       <row>
 392        <entry>Element Set</entry>
 393        <entry>Description</entry>
 394        <entry>Syntax</entry>
 395       </row>
 396      </thead>
 397      <tbody>
 398       <row>
 399        <entry><literal>zebra::meta::sysno</literal></entry>
 400        <entry>Get &zebra; record system ID</entry>
 401        <entry>XML and SUTRS</entry>
 402       </row>
 403       <row>
 404        <entry><literal>zebra::data</literal></entry>
 405        <entry>Get raw record</entry>
 406        <entry>all</entry>
 407       </row>
 408       <row>
 409        <entry><literal>zebra::meta</literal></entry>
 410        <entry>Get &zebra; record internal metadata</entry>
 411        <entry>XML and SUTRS</entry>
 412       </row>
 413       <row>
 414        <entry><literal>zebra::index</literal></entry>
 415        <entry>Get all indexed keys for record</entry>
 416        <entry>XML and SUTRS</entry>
 417       </row>
 418       <row>
 419        <entry>
 420         <literal>zebra::index::</literal><replaceable>f</replaceable>
 421        </entry>
 422        <entry>
 423         Get indexed keys for field <replaceable>f</replaceable> for record
 424        </entry>
 425        <entry>XML and SUTRS</entry>
 426       </row>
 427       <row>
 428        <entry>
 429         <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
 430        </entry>
 431        <entry>
 432         Get indexed keys for field <replaceable>f</replaceable>
 433           and type <replaceable>t</replaceable> for record
 434        </entry>
 435        <entry>XML and SUTRS</entry>
 436       </row>
 437      </tbody>
 438     </tgroup>
 439    </table>
 440    <para>
 441     For example, to fetch the raw binary record data stored in the
 442     zebra internal storage, or on the filesystem, the following
 443     commands can be issued:
 444     <screen>
 445       Z> f @attr 1=title my
 446       Z> format xml
 447       Z> elements zebra::data
 448       Z> s 1+1
 449       Z> format sutrs
 450       Z> s 1+1
 451       Z> format usmarc
 452       Z> s 1+1
 453     </screen>
 454     </para>
 455    <para>
 456     The special
 457     <literal>zebra::data</literal> element set name is
 458     defined for any record syntax, but will always fetch
 459     the raw record data in exactly the original form. No record syntax
 460     specific transformations will be applied to the raw record data.
 461    </para>
 462    <para>
 463     Also, &zebra; internal metadata about the record can be accessed:
 464     <screen>
 465       Z> f @attr 1=title my
 466       Z> format xml
 467       Z> elements zebra::meta::sysno
 468       Z> s 1+1
 469     </screen>
 470     displays in <literal>XML</literal> record syntax only internal
 471     record system number, whereas
 472     <screen>
 473       Z> f @attr 1=title my
 474       Z> format xml
 475       Z> elements zebra::meta
 476       Z> s 1+1
 477     </screen>
 478     displays all available metadata on the record. These include sytem
 479     number, database name,  indexed filename,  filter used for indexing,
 480     score and static ranking information and finally bytesize of record.
 481    </para>
 482    <para>
 483     Sometimes, it is very hard to figure out what exactly has been
 484     indexed how and in which indexes. Using the indexing stylesheet of
 485     the Alvis filter, one can at least see which portion of the record
 486     went into which index, but a similar aid does not exist for all
 487     other indexing filters.
 488    </para>
 489    <para>
 490     The special
 491     <literal>zebra::index</literal> element set names are provided to
 492     access information on per record indexed fields. For example, the
 493     queries
 494     <screen>
 495       Z> f @attr 1=title my
 496       Z> format sutrs
 497       Z> elements zebra::index
 498       Z> s 1+1
 499     </screen>
 500     will display all indexed tokens from all indexed fields of the
 501     first record, and it will display in <literal>SUTRS</literal>
 502     record syntax, whereas
 503     <screen>
 504       Z> f @attr 1=title my
 505       Z> format xml
 506       Z> elements zebra::index::title
 507       Z> s 1+1
 508       Z> elements zebra::index::title:p
 509       Z> s 1+1
 510     </screen>
 511     displays in <literal>XML</literal> record syntax only the content
 512       of the zebra string index <literal>title</literal>, or
 513       even only the type <literal>p</literal> phrase indexed part of it.
 514    </para>
 515    <note>
 516     <para>
 517      Trying to access numeric <literal>Bib-1</literal> use
 518      attributes or trying to access non-existent zebra intern string
 519      access points will result in a Diagnostic 25: Specified element set
 520      'name not valid for specified database.
 521     </para>
 522    </note>
 523   </section>
 524
 525  </chapter>
 526
 527  <!-- Keep this comment at the end of the file
 528  Local variables:
 529  mode: sgml
 530  sgml-omittag:t
 531  sgml-shorttag:t
 532  sgml-minimize-attributes:nil
 533  sgml-always-quote-attributes:t
 534  sgml-indent-step:1
 535  sgml-indent-data:t
 536  sgml-parent-document: "zebra.xml"
 537  sgml-local-catalogs: nil
 538  sgml-namecase-general:t
 539  End:
 540  -->