doc/architecture.xml

   1  <chapter id="architecture">
   2   <!-- $Id: architecture.xml,v 1.26 2007-12-19 13:35:39 adam Exp $ -->
   3   <title>Overview of &zebra; Architecture</title>
   4
   5   <section id="architecture-representation">
   6    <title>Local Representation</title>
   7
   8    <para>
   9     As mentioned earlier, &zebra; places few restrictions on the type of
  10     data that you can index and manage. Generally, whatever the form of
  11     the data, it is parsed by an input filter specific to that format, and
  12     turned into an internal structure that &zebra; knows how to handle. This
  13     process takes place whenever the record is accessed - for indexing and
  14     retrieval.
  15    </para>
  16
  17    <para>
  18     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  19     the <literal>-t</literal> option to the indexer tells &zebra; how to
  20     process input records.
  21     Two basic types of processing are available - raw text and structured
  22     data. Raw text is just that, and it is selected by providing the
  23     argument <emphasis>text</emphasis> to &zebra;. Structured records are
  24     all handled internally using the basic mechanisms described in the
  25     subsequent sections.
  26     &zebra; can read structured records in many different formats.
  27     <!--
  28     How this is done is governed by additional parameters after the
  29     "grs" keyword, separated by "." characters.
  30     -->
  31    </para>
  32   </section>
  33
  34   <section id="architecture-maincomponents">
  35    <title>Main Components</title>
  36    <para>
  37     The &zebra; system is designed to support a wide range of data management
  38     applications. The system can be configured to handle virtually any
  39     kind of structured data. Each record in the system is associated with
  40     a <emphasis>record schema</emphasis> which lends context to the data
  41     elements of the record.
  42     Any number of record schemas can coexist in the system.
  43     Although it may be wise to use only a single schema within
  44     one database, the system poses no such restrictions.
  45    </para>
  46    <para>
  47     The &zebra; indexer and information retrieval server consists of the
  48     following main applications: the <command>zebraidx</command>
  49     indexing maintenance utility, and the <command>zebrasrv</command>
  50     information query and retrieval server. Both are using some of the
  51     same main components, which are presented here.
  52    </para>
  53    <para>
  54     The virtual Debian package <literal>idzebra-2.0</literal>
  55     installs all the necessary packages to start
  56     working with &zebra; - including utility programs, development libraries,
  57     documentation and modules.
  58   </para>
  59
  60    <section id="componentcore">
  61     <title>Core &zebra; Libraries Containing Common Functionality</title>
  62     <para>
  63      The core &zebra; module is the meat of the <command>zebraidx</command>
  64     indexing maintenance utility, and the <command>zebrasrv</command>
  65     information query and retrieval server binaries. Shortly, the core
  66     libraries are responsible for
  67      <variablelist>
  68       <varlistentry>
  69        <term>Dynamic Loading</term>
  70        <listitem>
  71         <para>of external filter modules, in case the application is
  72         not compiled statically. These filter modules define indexing,
  73         search and retrieval capabilities of the various input formats.
  74         </para>
  75        </listitem>
  76       </varlistentry>
  77       <varlistentry>
  78        <term>Index Maintenance</term>
  79        <listitem>
  80         <para> &zebra; maintains Term Dictionaries and ISAM index
  81         entries in inverted index structures kept on disk. These are
  82         optimized for fast inset, update and delete, as well as good
  83         search performance.
  84         </para>
  85        </listitem>
  86       </varlistentry>
  87       <varlistentry>
  88        <term>Search Evaluation</term>
  89        <listitem>
  90         <para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
  91          data structures, which are handed over from
  92          the &yaz; server frontend &acro.api;. Search evaluation includes
  93          construction of hit lists according to boolean combinations
  94          of simpler searches. Fast performance is achieved by careful
  95          use of index structures, and by evaluation specific index hit
  96          lists in correct order.
  97         </para>
  98        </listitem>
  99       </varlistentry>
 100       <varlistentry>
 101        <term>Ranking and Sorting</term>
 102        <listitem>
 103         <para>
 104          components call resorting/re-ranking algorithms on the hit
 105          sets. These might also be pre-sorted not only using the
 106          assigned document ID's, but also using assigned static rank
 107          information.
 108         </para>
 109        </listitem>
 110       </varlistentry>
 111       <varlistentry>
 112        <term>Record Presentation</term>
 113        <listitem>
 114         <para>returns - possibly ranked - result sets, hit
 115          numbers, and the like internal data to the &yaz; server backend &acro.api;
 116          for shipping to the client. Each individual filter module
 117          implements it's own specific presentation formats.
 118         </para>
 119        </listitem>
 120       </varlistentry>
 121      </variablelist>
 122      </para>
 123     <para>
 124      The Debian package <literal>libidzebra-2.0</literal>
 125      contains all run-time libraries for &zebra;, the
 126      documentation in PDF and HTML is found in
 127      <literal>idzebra-2.0-doc</literal>, and
 128      <literal>idzebra-2.0-common</literal>
 129      includes common essential &zebra; configuration files.
 130     </para>
 131    </section>
 132
 133
 134    <section id="componentindexer">
 135     <title>&zebra; Indexer</title>
 136     <para>
 137      The  <command>zebraidx</command>
 138      indexing maintenance utility
 139      loads external filter modules used for indexing data records of
 140      different type, and creates, updates and drops databases and
 141      indexes according to the rules defined in the filter modules.
 142     </para>
 143     <para>
 144      The Debian  package <literal>idzebra-2.0-utils</literal> contains
 145      the  <command>zebraidx</command> utility.
 146     </para>
 147    </section>
 148
 149    <section id="componentsearcher">
 150     <title>&zebra; Searcher/Retriever</title>
 151     <para>
 152      This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
 153      glues together the core libraries and the filter modules to one
 154      great Information Retrieval server application.
 155     </para>
 156     <para>
 157      The Debian  package <literal>idzebra-2.0-utils</literal> contains
 158      the  <command>zebrasrv</command> utility.
 159     </para>
 160    </section>
 161
 162    <section id="componentyazserver">
 163     <title>&yaz; Server Frontend</title>
 164     <para>
 165      The &yaz; server frontend is
 166      a full fledged stateful &acro.z3950; server taking client
 167      connections, and forwarding search and scan requests to the
 168      &zebra; core indexer.
 169     </para>
 170     <para>
 171      In addition to &acro.z3950; requests, the &yaz; server frontend acts
 172      as HTTP server, honoring
 173       <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
 174      requests, and
 175      &acro.sru; &acro.rest;
 176      requests. Moreover, it can
 177      translate incoming
 178      <ulink url="&url.cql;">&acro.cql;</ulink>
 179      queries to
 180      <ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
 181       queries, if
 182      correctly configured.
 183     </para>
 184     <para>
 185      <ulink url="&url.yaz;">&yaz;</ulink>
 186      is an Open Source
 187      toolkit that allows you to develop software using the
 188      &acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
 189      It is packaged in the Debian packages
 190      <literal>yaz</literal> and <literal>libyaz</literal>.
 191     </para>
 192    </section>
 193
 194    <section id="componentmodules">
 195     <title>Record Models and Filter Modules</title>
 196     <para>
 197      The hard work of knowing <emphasis>what</emphasis> to index,
 198      <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
 199      part of the records to send in a search/retrieve response is
 200      implemented in
 201      various filter modules. It is their responsibility to define the
 202      exact indexing and record display filtering rules.
 203      </para>
 204      <para>
 205      The virtual Debian package
 206      <literal>libidzebra-2.0-modules</literal> installs all base filter
 207      modules.
 208     </para>
 209
 210    <section id="componentmodulesdom">
 211     <title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
 212      <para>
 213       The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
 214       internal data model, and can thus parse, index, and display
 215       any &acro.xml; document.
 216     </para>
 217     <para>
 218       A parser for binary &acro.marc; records based on the ISO2709 library
 219       standard is provided, it transforms these to the internal
 220       &acro.marcxml; &acro.dom; representation.
 221     </para>
 222     <para>
 223       The internal &acro.dom; &acro.xml; representation can be fed into four
 224       different pipelines, consisting of arbitraily many sucessive
 225       &acro.xslt; transformations; these are for
 226      <itemizedlist>
 227        <listitem><para>input parsing and initial
 228           transformations,</para></listitem>
 229        <listitem><para>indexing term extraction
 230           transformations</para></listitem>
 231        <listitem><para>transformations before internal document
 232           storage, and </para></listitem>
 233        <listitem><para>retrieve transformations from storage to output
 234           format</para></listitem>
 235       </itemizedlist>
 236     </para>
 237     <para>
 238       The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if  supported on
 239       your platform, even &acro.exslt;), it brings thus full &acro.xpath;
 240       support to the indexing, storage and display rules of not only
 241       &acro.xml; documents, but also binary &acro.marc; records.
 242     </para>
 243     <para>
 244       Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
 245       time, and to to sort hit lists according to predefined
 246       static ranks.
 247     </para>
 248     <para>
 249       Details on the experimental &acro.dom; &acro.xml; filter are found in
 250       <xref linkend="record-model-domxml"/>.
 251       </para>
 252      <para>
 253       The Debian package <literal>libidzebra-2.0-mod-dom</literal>
 254       contains the &acro.dom; filter module.
 255      </para>
 256     </section>
 257
 258    <section id="componentmodulesalvis">
 259     <title>ALVIS &acro.xml; Record Model and Filter Module</title>
 260      <note>
 261       <para>
 262         The functionality of this record model has been improved and
 263         replaced by the &acro.dom; &acro.xml; record model. See
 264         <xref linkend="componentmodulesdom"/>.
 265       </para>
 266      </note>
 267
 268      <para>
 269       The Alvis filter for &acro.xml; files is an &acro.xslt; based input
 270       filter.
 271       It indexes element and attribute content of any thinkable &acro.xml; format
 272       using full &acro.xpath; support, a feature which the standard &zebra;
 273       &acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
 274       parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
 275       according to availability of memory.
 276     </para>
 277     <para>
 278       The Alvis filter
 279       uses &acro.xslt; display stylesheets, which let
 280       the &zebra; DB administrator associate multiple, different views on
 281       the same &acro.xml; document type. These views are chosen on-the-fly in
 282       search time.
 283      </para>
 284     <para>
 285       In addition, the Alvis filter configuration is not bound to the
 286       arcane  &acro.bib1; &acro.z3950; library catalogue indexing traditions and
 287       folklore, and is therefore easier to understand.
 288     </para>
 289     <para>
 290       Finally, the Alvis  filter allows for static ranking at index
 291       time, and to to sort hit lists according to predefined
 292       static ranks. This imposes no overhead at all, both
 293       search and indexing perform still
 294       <emphasis>O(1)</emphasis> irrespectively of document
 295       collection size. This feature resembles Googles pre-ranking using
 296       their Pagerank algorithm.
 297     </para>
 298     <para>
 299       Details on the experimental Alvis &acro.xslt; filter are found in
 300       <xref linkend="record-model-alvisxslt"/>.
 301       </para>
 302      <para>
 303       The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
 304       contains the Alvis filter module.
 305      </para>
 306     </section>
 307
 308    <section id="componentmodulesgrs">
 309     <title>&acro.grs1; Record Model and Filter Modules</title>
 310      <note>
 311       <para>
 312         The functionality of this record model has been improved and
 313         replaced by the &acro.dom; &acro.xml; record model. See
 314         <xref linkend="componentmodulesdom"/>.
 315       </para>
 316      </note>
 317     <para>
 318     The &acro.grs1; filter modules described in
 319     <xref linkend="grs"/>
 320     are all based on the &acro.z3950; specifications, and it is absolutely
 321     mandatory to have the reference pages on &acro.bib1; attribute sets on
 322     you hand when configuring &acro.grs1; filters. The GRS filters come in
 323     different flavors, and a short introduction is needed here.
 324     &acro.grs1; filters of various kind have also been called ABS filters due
 325     to the <filename>*.abs</filename> configuration file suffix.
 326     </para>
 327     <para>
 328       The <emphasis>grs.marc</emphasis> and
 329       <emphasis>grs.marcxml</emphasis> filters are suited to parse and
 330       index binary and &acro.xml; versions of traditional library &acro.marc; records
 331       based on the ISO2709 standard. The Debian package for both
 332       filters is
 333      <literal>libidzebra-2.0-mod-grs-marc</literal>.
 334     </para>
 335     <para>
 336       &acro.grs1; TCL scriptable filters for extensive user configuration come
 337      in two flavors: a regular expression filter
 338      <emphasis>grs.regx</emphasis> using TCL regular expressions, and
 339      a general scriptable TCL filter called
 340      <emphasis>grs.tcl</emphasis>
 341      are both included in the
 342      <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
 343     </para>
 344     <para>
 345       A general purpose &acro.sgml; filter is called
 346      <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
 347      but planned to be in the
 348      <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
 349     </para>
 350     <para>
 351       The Debian  package
 352       <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
 353       <emphasis>grs.xml</emphasis> filter which uses <ulink
 354       url="&url.expat;">Expat</ulink> to
 355       parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
 356       trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
 357       the next session.
 358     </para>
 359    </section>
 360
 361    <section id="componentmodulestext">
 362     <title>TEXT Record Model and Filter Module</title>
 363     <para>
 364       Plain ASCII text filter. TODO: add information here.
 365     </para>
 366    </section>
 367
 368     <!--
 369    <section id="componentmodulessafari">
 370     <title>SAFARI Record Model and Filter Module</title>
 371     <para>
 372      SAFARI filter module TODO: add information here.
 373     </para>
 374    </section>
 375     -->
 376
 377    </section>
 378
 379   </section>
 380
 381
 382   <section id="architecture-workflow">
 383    <title>Indexing and Retrieval Workflow</title>
 384
 385   <para>
 386    Records pass through three different states during processing in the
 387    system.
 388   </para>
 389
 390   <para>
 391
 392    <itemizedlist>
 393     <listitem>
 394
 395      <para>
 396       When records are accessed by the system, they are represented
 397       in their local, or native format. This might be &acro.sgml; or HTML files,
 398       News or Mail archives, &acro.marc; records. If the system doesn't already
 399       know how to read the type of data you need to store, you can set up an
 400       input filter by preparing conversion rules based on regular
 401       expressions and possibly augmented by a flexible scripting language
 402       (Tcl).
 403       The input filter produces as output an internal representation,
 404       a tree structure.
 405
 406      </para>
 407     </listitem>
 408     <listitem>
 409
 410      <para>
 411       When records are processed by the system, they are represented
 412       in a tree-structure, constructed by tagged data elements hanging off a
 413       root node. The tagged elements may contain data or yet more tagged
 414       elements in a recursive structure. The system performs various
 415       actions on this tree structure (indexing, element selection, schema
 416       mapping, etc.),
 417
 418      </para>
 419     </listitem>
 420     <listitem>
 421
 422      <para>
 423       Before transmitting records to the client, they are first
 424       converted from the internal structure to a form suitable for exchange
 425       over the network - according to the &acro.z3950; standard.
 426      </para>
 427     </listitem>
 428
 429    </itemizedlist>
 430
 431   </para>
 432   </section>
 433
 434   <section id="special-retrieval">
 435    <title>Retrieval of &zebra; internal record data</title>
 436    <para>
 437     Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
 438     possible to use a special element set which has the prefix
 439     <literal>zebra::</literal>.
 440    </para>
 441    <para>
 442     Using this element will, regardless of record type, return
 443     &zebra;'s internal index structure/data for a record.
 444     In particular, the regular record filters are not invoked when
 445     these are in use.
 446     This can in some cases make the retrival faster than regular
 447     retrieval operations (for &acro.marc;, &acro.xml; etc).
 448    </para>
 449    <table id="special-retrieval-types">
 450     <title>Special Retrieval Elements</title>
 451     <tgroup cols="2">
 452      <thead>
 453       <row>
 454        <entry>Element Set</entry>
 455        <entry>Description</entry>
 456        <entry>Syntax</entry>
 457       </row>
 458      </thead>
 459      <tbody>
 460       <row>
 461        <entry><literal>zebra::meta::sysno</literal></entry>
 462        <entry>Get &zebra; record system ID</entry>
 463        <entry>&acro.xml; and &acro.sutrs;</entry>
 464       </row>
 465       <row>
 466        <entry><literal>zebra::data</literal></entry>
 467        <entry>Get raw record</entry>
 468        <entry>all</entry>
 469       </row>
 470       <row>
 471        <entry><literal>zebra::meta</literal></entry>
 472        <entry>Get &zebra; record internal metadata</entry>
 473        <entry>&acro.xml; and &acro.sutrs;</entry>
 474       </row>
 475       <row>
 476        <entry><literal>zebra::index</literal></entry>
 477        <entry>Get all indexed keys for record</entry>
 478        <entry>&acro.xml; and &acro.sutrs;</entry>
 479       </row>
 480       <row>
 481        <entry>
 482         <literal>zebra::index::</literal><replaceable>f</replaceable>
 483        </entry>
 484        <entry>
 485         Get indexed keys for field <replaceable>f</replaceable> for record
 486        </entry>
 487        <entry>&acro.xml; and &acro.sutrs;</entry>
 488       </row>
 489       <row>
 490        <entry>
 491         <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
 492        </entry>
 493        <entry>
 494         Get indexed keys for field <replaceable>f</replaceable>
 495           and type <replaceable>t</replaceable> for record
 496        </entry>
 497        <entry>&acro.xml; and &acro.sutrs;</entry>
 498       </row>
 499       <row>
 500        <entry>
 501         <literal>zebra::snippet</literal>
 502        </entry>
 503        <entry>
 504         Get snippet for record for one or more indexes (f1,f2,..).
 505         This includes a phrase from the original
 506         record at the point where a match occurs (for a query). By default
 507         give terms before - and after are included in the snippet. The
 508         matching terms are enclosed within element
 509         <literal>&lt;s&gt;</literal>. The snippet facility requires
 510         Zebra 2.0.16 or later.
 511        </entry>
 512        <entry>&acro.xml; and &acro.sutrs;</entry>
 513       </row>
 514       <row>
 515        <entry>
 516         <literal>zebra::facet::</literal><replaceable>f1</replaceable>:<replaceable>t1</replaceable>,<replaceable>f2</replaceable>:<replaceable>t2</replaceable>,..
 517        </entry>
 518        <entry>
 519         Get facet of a result set. The facet result is returned
 520         as if it was a normal record, while in reality is a
 521         recap of most "important" terms in a result set for the fields
 522         given.
 523         The facet facility first appeared in Zebra 2.0.20.
 524        </entry>
 525        <entry>&acro.xml;</entry>
 526       </row>
 527      </tbody>
 528     </tgroup>
 529    </table>
 530    <para>
 531     For example, to fetch the raw binary record data stored in the
 532     zebra internal storage, or on the filesystem, the following
 533     commands can be issued:
 534     <screen>
 535       Z> f @attr 1=title my
 536       Z> format xml
 537       Z> elements zebra::data
 538       Z> s 1+1
 539       Z> format sutrs
 540       Z> s 1+1
 541       Z> format usmarc
 542       Z> s 1+1
 543     </screen>
 544     </para>
 545    <para>
 546     The special
 547     <literal>zebra::data</literal> element set name is
 548     defined for any record syntax, but will always fetch
 549     the raw record data in exactly the original form. No record syntax
 550     specific transformations will be applied to the raw record data.
 551    </para>
 552    <para>
 553     Also, &zebra; internal metadata about the record can be accessed:
 554     <screen>
 555       Z> f @attr 1=title my
 556       Z> format xml
 557       Z> elements zebra::meta::sysno
 558       Z> s 1+1
 559     </screen>
 560     displays in <literal>&acro.xml;</literal> record syntax only internal
 561     record system number, whereas
 562     <screen>
 563       Z> f @attr 1=title my
 564       Z> format xml
 565       Z> elements zebra::meta
 566       Z> s 1+1
 567     </screen>
 568     displays all available metadata on the record. These include sytem
 569     number, database name,  indexed filename,  filter used for indexing,
 570     score and static ranking information and finally bytesize of record.
 571    </para>
 572    <para>
 573     Sometimes, it is very hard to figure out what exactly has been
 574     indexed how and in which indexes. Using the indexing stylesheet of
 575     the Alvis filter, one can at least see which portion of the record
 576     went into which index, but a similar aid does not exist for all
 577     other indexing filters.
 578    </para>
 579    <para>
 580     The special
 581     <literal>zebra::index</literal> element set names are provided to
 582     access information on per record indexed fields. For example, the
 583     queries
 584     <screen>
 585       Z> f @attr 1=title my
 586       Z> format sutrs
 587       Z> elements zebra::index
 588       Z> s 1+1
 589     </screen>
 590     will display all indexed tokens from all indexed fields of the
 591     first record, and it will display in <literal>&acro.sutrs;</literal>
 592     record syntax, whereas
 593     <screen>
 594       Z> f @attr 1=title my
 595       Z> format xml
 596       Z> elements zebra::index::title
 597       Z> s 1+1
 598       Z> elements zebra::index::title:p
 599       Z> s 1+1
 600     </screen>
 601     displays in <literal>&acro.xml;</literal> record syntax only the content
 602       of the zebra string index <literal>title</literal>, or
 603       even only the type <literal>p</literal> phrase indexed part of it.
 604    </para>
 605    <note>
 606     <para>
 607      Trying to access numeric <literal>&acro.bib1;</literal> use
 608      attributes or trying to access non-existent zebra intern string
 609      access points will result in a Diagnostic 25: Specified element set
 610      'name not valid for specified database.
 611     </para>
 612    </note>
 613   </section>
 614
 615  </chapter>
 616
 617  <!-- Keep this comment at the end of the file
 618  Local variables:
 619  mode: sgml
 620  sgml-omittag:t
 621  sgml-shorttag:t
 622  sgml-minimize-attributes:nil
 623  sgml-always-quote-attributes:t
 624  sgml-indent-step:1
 625  sgml-indent-data:t
 626  sgml-parent-document: "zebra.xml"
 627  sgml-local-catalogs: nil
 628  sgml-namecase-general:t
 629  End:
 630  -->