doc/architecture.xml

   1  <chapter id="architecture">
   2   <!-- $Id: architecture.xml,v 1.4 2006-02-15 12:08:48 marc Exp $ -->
   3   <title>Overview of Zebra Architecture</title>
   4
   5
   6   <sect1 id="architecture-representation">
   7    <title>Local Representation</title>
   8
   9    <para>
  10     As mentioned earlier, Zebra places few restrictions on the type of
  11     data that you can index and manage. Generally, whatever the form of
  12     the data, it is parsed by an input filter specific to that format, and
  13     turned into an internal structure that Zebra knows how to handle. This
  14     process takes place whenever the record is accessed - for indexing and
  15     retrieval.
  16    </para>
  17
  18    <para>
  19     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  20     the <literal>-t</literal> option to the indexer tells Zebra how to
  21     process input records.
  22     Two basic types of processing are available - raw text and structured
  23     data. Raw text is just that, and it is selected by providing the
  24     argument <emphasis>text</emphasis> to Zebra. Structured records are
  25     all handled internally using the basic mechanisms described in the
  26     subsequent sections.
  27     Zebra can read structured records in many different formats.
  28     <!--
  29     How this is done is governed by additional parameters after the
  30     "grs" keyword, separated by "." characters.
  31     -->
  32    </para>
  33   </sect1>
  34
  35   <sect1 id="architecture-workflow">
  36    <title>Indexing and Retrieval Workflow</title>
  37
  38   <para>
  39    Records pass through three different states during processing in the
  40    system.
  41   </para>
  42
  43   <para>
  44
  45    <itemizedlist>
  46     <listitem>
  47
  48      <para>
  49       When records are accessed by the system, they are represented
  50       in their local, or native format. This might be SGML or HTML files,
  51       News or Mail archives, MARC records. If the system doesn't already
  52       know how to read the type of data you need to store, you can set up an
  53       input filter by preparing conversion rules based on regular
  54       expressions and possibly augmented by a flexible scripting language
  55       (Tcl).
  56       The input filter produces as output an internal representation,
  57       a tree structure.
  58
  59      </para>
  60     </listitem>
  61     <listitem>
  62
  63      <para>
  64       When records are processed by the system, they are represented
  65       in a tree-structure, constructed by tagged data elements hanging off a
  66       root node. The tagged elements may contain data or yet more tagged
  67       elements in a recursive structure. The system performs various
  68       actions on this tree structure (indexing, element selection, schema
  69       mapping, etc.),
  70
  71      </para>
  72     </listitem>
  73     <listitem>
  74
  75      <para>
  76       Before transmitting records to the client, they are first
  77       converted from the internal structure to a form suitable for exchange
  78       over the network - according to the Z39.50 standard.
  79      </para>
  80     </listitem>
  81
  82    </itemizedlist>
  83
  84   </para>
  85   </sect1>
  86
  87
  88   <sect1 id="architecture-maincomponents">
  89    <title>Main Components</title>
  90    <para>
  91     The Zebra system is designed to support a wide range of data management
  92     applications. The system can be configured to handle virtually any
  93     kind of structured data. Each record in the system is associated with
  94     a <emphasis>record schema</emphasis> which lends context to the data
  95     elements of the record.
  96     Any number of record schemas can coexist in the system.
  97     Although it may be wise to use only a single schema within
  98     one database, the system poses no such restrictions.
  99    </para>
 100    <para>
 101     The Zebra indexer and information retrieval server consists of the
 102     following main applications: the <literal>zebraidx</literal>
 103     indexing maintenance utility, and the <literal>zebrasrv</literal>
 104     information query and retireval server. Both are using some of the
 105     same main components, which are presented here.
 106    </para>
 107    <para>
 108     This virtual package installs all the necessary packages to start
 109     working with Zebra - including utility programs, development libraries,
 110     documentation and modules.
 111      <literal>idzebra1.4</literal>
 112   </para>
 113
 114    <sect2 id="componentcore">
 115     <title>Core Zebra Module Containing Common Functionality</title>
 116     <para>
 117      - loads external filter modules used for presenting
 118      the recods in a search response.
 119      - executes search requests in PQF/RPN, which are handed over from
 120      the YAZ server frontend API
 121      - calls resorting/reranking algorithms on the hit sets
 122      - returns - possibly ranked - result sets, hit
 123      numbers, and the like internal data to the YAZ server backend API.
 124     </para>
 125     <para>
 126      This package contains all run-time libraries for Zebra.
 127      <literal>libidzebra1.4</literal>
 128      This package includes documentation for Zebra in PDF and HTML.
 129      <literal>idzebra1.4-doc</literal>
 130      This package includes common essential Zebra configuration files
 131      <literal>idzebra1.4-common</literal>
 132     </para>
 133    </sect2>
 134
 135
 136    <sect2 id="componentindexer">
 137     <title>Zebra Indexer</title>
 138     <para>
 139      the core Zebra indexer which
 140      - loads external filter modules used for indexing data records of
 141      different type.
 142      - creates, updates and drops databases and indexes
 143     </para>
 144     <para>
 145      This package contains Zebra utilities such as the zebraidx indexer
 146      utility and the zebrasrv server.
 147      <literal>idzebra1.4-utils</literal>
 148     </para>
 149    </sect2>
 150
 151    <sect2 id="componentsearcher">
 152     <title>Zebra Searcher/Retriever</title>
 153     <para>
 154      the core Zebra searcher/retriever which
 155     </para>
 156     <para>
 157      This package contains Zebra utilities such as the zebraidx indexer
 158      utility and the zebrasrv server, and their associated man pages.
 159      <literal>idzebra1.4-utils</literal>
 160     </para>
 161    </sect2>
 162
 163    <sect2 id="componentyazserver">
 164     <title>YAZ Server Frontend</title>
 165     <para>
 166      The YAZ server frontend is
 167      a full fledged stateful Z39.50 server taking client
 168      connections, and forwarding search and scan requests to the
 169      Zebra core indexer.
 170     </para>
 171     <para>
 172      In addition to Z39.50 requests, the YAZ server frontend acts
 173      as HTTP server, honouring
 174       <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> SOAP requests, and  <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink> REST requests. Moreover, it can
 175      translate inco ming <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries to PQF/RPN queries, if
 176      correctly configured.
 177     </para>
 178     <para>
 179     YAZ is a toolkit that allows you to develop software using the
 180     ANSI Z39.50/ISO23950 standard for information retrieval.
 181      <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/ <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>
 182     <literal>libyazthread.so</literal>
 183     <literal>libyaz.so</literal>
 184     <literal>libyaz</literal>
 185     </para>
 186    </sect2>
 187
 188    <sect2 id="componentmodules">
 189     <title>Record Models and Filter Modules</title>
 190     <para>
 191       all filter modules which do indexing and record display filtering:
 192 This virtual package contains all base IDZebra filter modules. EMPTY ???
 193      <literal>libidzebra1.4-modules</literal>
 194     </para>
 195
 196    <sect3 id="componentmodulestext">
 197     <title>TEXT Record Model and Filter Module</title>
 198     <para>
 199       Plain ASCII text filter
 200      <!--
 201      <literal>text module missing as deb file<literal>
 202      -->
 203     </para>
 204    </sect3>
 205
 206    <sect3 id="componentmodulesgrs">
 207     <title>GRS Record Model and Filter Modules</title>
 208     <para>
 209     <xref linkend="record-model-grs"/>
 210
 211      - grs.danbib     GRS filters of various kind (*.abs files)
 212 IDZebra filter grs.danbib (DBC DanBib records)
 213   This package includes grs.danbib filter which parses DanBib records.
 214   DanBib is the Danish Union Catalogue hosted by DBC
 215   (Danish Bibliographic Centre).
 216      <literal>libidzebra1.4-mod-grs-danbib</literal>
 217
 218
 219      - grs.marc
 220      - grs.marcxml
 221   This package includes the grs.marc and grs.marcxml filters that allows
 222   IDZebra to read MARC records based on ISO2709.
 223
 224      <literal>libidzebra1.4-mod-grs-marc</literal>
 225
 226      - grs.regx
 227      - grs.tcl        GRS TCL scriptable filter
 228   This package includes the grs.regx and grs.tcl filters.
 229      <literal>libidzebra1.4-mod-grs-regx</literal>
 230
 231
 232      - grs.sgml
 233      <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
 234
 235      - grs.xml
 236   This package includes the grs.xml filter which uses <ulink url="http://expat.sourceforge.net/">Expat</ulink> to
 237   parse records in XML and turn them into IDZebra's internal grs node.
 238      <literal>libidzebra1.4-mod-grs-xml</literal>
 239     </para>
 240    </sect3>
 241
 242    <sect3 id="componentmodulesalvis">
 243     <title>ALVIS Record Model and Filter Module</title>
 244      <para>
 245       <xref linkend="record-model-alvisxslt"/>
 246       - alvis          Experimental Alvis XSLT filter
 247       <literal>mod-alvis.so</literal>
 248       <literal>libidzebra1.4-mod-alvis</literal>
 249      </para>
 250     </sect3>
 251
 252    <sect3 id="componentmodulessafari">
 253     <title>SAFARI Record Model and Filter Module</title>
 254     <para>
 255      - safari
 256      <!--
 257      <literal>safari module missing as deb file<literal>
 258      -->
 259     </para>
 260    </sect3>
 261
 262    </sect2>
 263
 264    <!--
 265    <sect2 id="componentconfig">
 266     <title>Configuration Files</title>
 267     <para>
 268      - yazserver XML based config file
 269      - core Zebra ascii based config files
 270      - filter module config files in many flavours
 271      - <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> to PQF ascii based config file
 272     </para>
 273    </sect2>
 274    -->
 275   </sect1>
 276
 277   <!--
 278
 279
 280   <sect1 id="cqltopqf">
 281    <title>Server Side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> To PQF Conversion</title>
 282    <para>
 283   The cql2pqf.txt yaz-client config file, which is also used in the
 284   yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process, is used to to drive
 285   org.z3950.zing.cql.<ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>Node's toPQF() back-end and the YAZ <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF
 286   converter.  This specifies the interpretation of various <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
 287   indexes, relations, etc. in terms of Type-1 query attributes.
 288
 289   This configuration file generates queries using BIB-1 attributes.
 290   See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
 291   for the Maintenance Agency's work-in-progress mapping of Dublin Core
 292   indexes to Attribute Architecture (util, XD and BIB-2)
 293   attributes.
 294
 295   a) <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> set prefixes  are specified using the correct <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>/ <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U
 296   prefixes for the required index sets, or user-invented prefixes for
 297   special index sets. An index set in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> is roughly speaking equivalent to a
 298   namespace specifier in XML.
 299
 300   b) The default index set to be used if none explicitely mentioned
 301
 302   c) Index mapping definitions of the form
 303
 304       index.cql.all  = 1=text
 305
 306   which means that the index "all" from the set "cql" is mapped on the
 307   bib-1 RPN query "@attr 1=text" (where "text" is some existing index
 308   in zebra, see indexing stylesheet)
 309
 310   d) Relation mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr 2= " stuff
 311
 312   e) Relation modifier mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr
 313   2= " stuff
 314
 315   f) Position attributes
 316
 317   g) structure attributes
 318
 319   h) truncation attributes
 320
 321   See
 322   http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
 323   file details.
 324
 325
 326    </para>
 327   </sect1>
 328
 329
 330   <sect1 id="ranking">
 331    <title>Static and Dynamic Ranking</title>
 332    <para>
 333       Zebra uses internally inverted indexes to look up term occurencies
 334   in documents. Multiple queries from different indexes can be
 335   combined by the binary boolean operations AND, OR and/or NOT (which
 336   is in fact a binary AND NOT operation). To ensure fast query execution
 337   speed, all indexes have to be sorted in the same order.
 338
 339   The indexes are normally sorted according to document ID in
 340   ascending order, and any query which does not invoke a special
 341   re-ranking function will therefore retrieve the result set in document ID
 342   order.
 343
 344   If one defines the
 345
 346     staticrank: 1
 347
 348   directive in the main core Zebra config file, the internal document
 349   keys used for ordering are augmented by a preceeding integer, which
 350   contains the static rank of a given document, and the index lists
 351   are ordered
 352     - first by ascending static rank
 353     - then by ascending document ID.
 354
 355   This implies that the default rank "0" is the best rank at the
 356   beginning of the list, and "max int" is the worst static rank.
 357
 358   The "alvis" and the experimental "xslt" filters are providing a
 359   directive to fetch static rank information out of the indexed XML
 360   records, thus making _all_ hit sets orderd after ascending static
 361   rank, and for those doc's which have the same static rank, ordered
 362   after ascending doc ID.
 363   If one wants to do a little fiddeling with the static rank order,
 364   one has to invoke additional re-ranking/re-ordering using dynamic
 365   reranking or score functions. These functions return positive
 366   interger scores, where _highest_ score is best, which means that the
 367   hit sets will be sorted according to _decending_ scores (in contrary
 368   to the index lists which are sorted according to _ascending_ rank
 369   number and document ID)
 370
 371
 372   Those are defined in the zebra C source files
 373
 374    "rank-1" : zebra/index/rank1.c
 375               default TF/IDF like zebra dynamic ranking
 376    "rank-static" : zebra/index/rankstatic.c
 377               do-nothing dummy static ranking (this is just to prove
 378               that the static rank can be used in dynamic ranking functions)
 379    "zvrank" : zebra/index/zvrank.c
 380               many different dynamic TF/IDF ranking functions
 381
 382    The are in the zebra config file enabled by a directive like:
 383
 384    rank: rank-static
 385
 386    Notice that the "rank-1" and "zvrank" do not use the static rank
 387    information in the list keys, and will produce the same ordering
 388    with our without static ranking enabled.
 389
 390    The dummy "rank-static" reranking/scoring function returns just
 391      score = max int - staticrank
 392    in order to preserve the ordering of hit sets with and without it's
 393    call.
 394
 395    Obviously, one wants to make a new ranking function, which combines
 396    static and dynamic ranking, which is left as an exercise for the
 397    reader .. (Wray, this is your's ...)
 398
 399
 400    </para>
 401
 402
 403    <para>
 404     yazserver frontend config file
 405
 406   db/yazserver.xml
 407
 408   Setup of listening ports, and virtual zebra servers.
 409   Note path to server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF config file, and to
 410    <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> explain config section.
 411
 412   The <directory> path is relative to the directory where zebra.init is placed
 413   and is started up. The other pathes are relative to <directory>,
 414   which in this case is the same.
 415
 416   see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
 417
 418    </para>
 419
 420    <para>
 421  Z39.50 searching:
 422
 423   search like this (using client-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
 424
 425   yaz-client -q db/cql2pqf.txt localhost:9999
 426   > format xml
 427   > querytype cql2rpn
 428   > f text=(plant and soil)
 429   > s 1
 430   > elements dc
 431   > s 1
 432   > elements index
 433   > s 1
 434   > elements alvis
 435   > s 1
 436   > elements snippet
 437   > s 1
 438
 439
 440   search like this (using server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
 441   (the only difference is "querytype cql" instead of
 442    "querytype cql2rpn" and the call without specifying a local
 443   conversion file)
 444
 445   yaz-client localhost:9999
 446  > format xml
 447   > querytype cql
 448   > f text=(plant and soil)
 449   > s 1
 450   > elements dc
 451   > s 1
 452   > elements index
 453   > s 1
 454   > elements alvis
 455   > s 1
 456   > elements snippet
 457   > s 1
 458
 459   NEW: static relevance ranking - see examples in alvis2index.xsl
 460
 461   > f text = /relevant (plant and soil)
 462   > elem dc
 463   > s 1
 464
 465   > f title = /relevant a
 466   > elem dc
 467   > s 1
 468
 469
 470
 471  <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U searching
 472  Surf into http://localhost:9999
 473
 474  firefox http://localhost:9999
 475
 476  gives you an explain record. Unfortunately, the data found in the
 477  <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF text file must be added by hand-craft into the explain
 478  section of the yazserver.xml file. Too bad, but this is all extreme
 479  new alpha stuff, and a lot of work has yet to be done ..
 480
 481  Searching via  <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>: surf into the URL (lines broken here - concat on
 482  URL line)
 483
 484  - see number of hits:
 485  http://localhost:9999/?version=1.1&operation=searchRetrieve
 486                        &query=text=(plant%20and%20soil)
 487
 488
 489  - fetch record 5-7 in DC format
 490  http://localhost:9999/?version=1.1&operation=searchRetrieve
 491                        &query=text=(plant%20and%20soil)
 492                        &startRecord=5&maximumRecords=2&recordSchema=dc
 493
 494
 495  - even search using PQF queries using the extended verb "x-pquery",
 496    which is special to YAZ/Zebra
 497
 498  http://localhost:9999/?version=1.1&operation=searchRetrieve
 499                        &x-pquery=@attr%201=text%20@and%20plant%20soil
 500
 501  More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
 502 278,280d299
 503  Search via  <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>:
 504  read the fine manual at
 505  http://www.loc.gov/z3950/agency/zing/srw/
 506
 507
 508 and so on. The list of available indexes is found in db/cql2pqf.txt
 509
 510
 511 7) How do you add to the index attributes of any other type than "w"?
 512 I mean, in the context of making <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries. Let's say I want a date
 513 attribute in there, so that one could do date > 20050101 in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>.
 514
 515 Currently for example 'date-modified' is of type 'w'.
 516
 517 The 2-seconds-of-though solution:
 518
 519      in alvis2index.sl:
 520
 521   <z:index name="date-modified" type="d">
 522       <xsl:value-of
 523            select="acquisition/acquisitionData/modifiedDate"/>
 524     </z:index>
 525
 526 But here's the catch...doesn't the use of the 'd' type require
 527 structure type 'date' (@attr 4=5) in PQF? But then...how does that
 528 reflect in the <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>->RPN/PQF mapping - does it really work if I just
 529 change the type of an element in alvis2index.sl? I would think not...?
 530
 531
 532
 533
 534               Kimmo
 535
 536
 537 Either do:
 538
 539    f @attr 4=5 @attr 1=date-modified 20050713
 540
 541 or do
 542
 543
 544 Either do:
 545
 546    f @attr 4=5 @attr 1=date-modified 20050713
 547
 548 or do
 549
 550 querytype cql
 551
 552  f date-modified=20050713
 553
 554  f date-modified=20050713
 555
 556  Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
 557 r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
 558
 559
 560
 561  f date-modified eq 20050713
 562
 563 Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
 564  @attr 2=3 @attr "1=date-modified" 20050713
 565
 566
 567    </para>
 568
 569    <para>
 570 E) EXTENDED SERVICE LIFE UPDATES
 571
 572 The extended services are not enabled by default in zebra - due to the
 573 fact that they modify the system.
 574
 575 In order to allow anybody to update, use
 576 perm.anonymous: rw
 577 in zebra.cfg.
 578
 579 Or, even better, allow only updates for a particular admin user. For
 580 user 'admin', you could use:
 581 perm.admin: rw
 582 passwd: passwordfile
 583
 584 And in passwordfile, specify users and passwords ..
 585 admin:secret
 586
 587 We can now start a yaz-client admin session and create a database:
 588
 589 $ yaz-client localhost:9999 -u admin/secret
 590 Authentication set to Open (admin/secret)
 591 Connecting...OK.
 592 Sent initrequest.
 593 Connection accepted by v3 target.
 594 ID     : 81
 595 Name   : Zebra Information Server/GFS/YAZ
 596 Version: Zebra 1.4.0/1.63/2.1.9
 597 Options: search present delSet triggerResourceCtrl scan sort
 598 extendedServices namedResultSets
 599 Elapsed: 0.007046
 600 Z> adm-create
 601 Admin request
 602 Got extended services response
 603 Status: done
 604 Elapsed: 0.045009
 605 :
 606 Now Default was created..  We can now insert an XML file (esdd0006.grs
 607 from example/gils/records) and index it:
 608
 609 Z> update insert 1 esdd0006.grs
 610 Got extended services response
 611 Status: done
 612 Elapsed: 0.438016
 613
 614 The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
 615 It a record ID that _we_ assign to the record in question. If we do not
 616 assign one the usual rules for match apply (recordId: from zebra.cfg).
 617
 618 Actually, we should have a way to specify "no opaque record id" for
 619 yaz-client's update command.. We'll fix that.
 620
 621 Elapsed: 0.438016
 622 Z> f utah
 623 Sent searchRequest.
 624 Received SearchResponse.
 625 Search was a success.
 626 Number of hits: 1, setno 1
 627 SearchResult-1: term=utah cnt=1
 628 records returned: 0
 629 Elapsed: 0.014179
 630
 631 Let's delete the beast:
 632 Z> update delete 1
 633 No last record (update ignored)
 634 Z> update delete 1 esdd0006.grs
 635 Got extended services response
 636 Status: done
 637 Elapsed: 0.072441
 638 Z> f utah
 639 Sent searchRequest.
 640 Received SearchResponse.
 641 Search was a success.
 642 Number of hits: 0, setno 2
 643 SearchResult-1: term=utah cnt=0
 644 records returned: 0
 645 Elapsed: 0.013610
 646
 647 If shadow register is enabled you must run the adm-commit command in
 648 order write your changes..
 649
 650    </para>
 651
 652
 653
 654   </sect1>
 655 -->
 656
 657  </chapter>
 658
 659  <!-- Keep this comment at the end of the file
 660  Local variables:
 661  mode: sgml
 662  sgml-omittag:t
 663  sgml-shorttag:t
 664  sgml-minimize-attributes:nil
 665  sgml-always-quote-attributes:t
 666  sgml-indent-step:1
 667  sgml-indent-data:t
 668  sgml-parent-document: "zebra.xml"
 669  sgml-local-catalogs: nil
 670  sgml-namecase-general:t
 671  End:
 672  -->