doc/tutorial.xml

   1 <chapter id="tutorial">
   2  <!-- $Id: tutorial.xml,v 1.4 2008-02-07 12:36:35 marc Exp $ -->
   3  <title>Tutorial</title>
   4
   5
   6  <sect1 id="tutorial-oai">
   7   <title>A first &acro.oai; indexing example</title>
   8
   9  <para>
  10   In this section, we will test the system by indexing a small set of
  11   sample &acro.oai; records that are included with the &zebra; distribution,
  12   running a &zebra; server against the newly created database, and
  13   searching the indexes with a client that connects to that server.
  14  </para>
  15  <para>
  16   Go to the <literal>examples/oai-pmh</literal> subdirectory of the
  17   distribution archive, or make a deep copy of the Debian installation
  18    directory
  19   <literal>/usr/share/idzebra-2.0.-examples/oai-pmh</literal>.
  20    An XML file containing multiple &acro.oai;
  21    records is located in the  sub
  22    directory <literal>examples/oai-pmh/data</literal>.
  23  </para>
  24  <para>
  25     Additional OAI test records can be downloaded by running a shell
  26     script (you may want to abort the script when you have waitet
  27     longer than your coffe brews ..).
  28   <screen>
  29      cd data
  30      ./fetch_OAI_data.sh
  31      cd ../
  32   </screen>
  33  </para>
  34  <para>
  35     To index these &acro.oai; records, type:
  36   <screen>
  37     zebraidx-2.0 -c conf/zebra.cfg init
  38     zebraidx-2.0 -c conf/zebra.cfg update data
  39     zebraidx-2.0 -c conf/zebra.cfg commit
  40   </screen>
  41    In case you have not installed zebra yet but have compiled the
  42     binaries from this tarball, use the following command form:
  43   <screen>
  44     ../../index/zebraidx -c conf/zebra.cfg this and that
  45   </screen>
  46    On some systems the &zebra; binaries are installed under the
  47    generic names, you need to use  the following command form:
  48   <screen>
  49     zebraidx -c conf/zebra.cfg this and that
  50   </screen>
  51  </para>
  52
  53  <para>
  54   In this command, the word <literal>update</literal> is followed
  55   by the name of a directory: <literal>zebraidx</literal> updates all
  56   files in the hierarchy rooted at <literal>data</literal>.
  57   The command option
  58   <literal>-c conf/zebra.cfg</literal> points to the proper
  59   configuration file.
  60  </para>
  61
  62  <para>
  63    You might ask yourself how &acro.xml; content is indexed using &acro.xslt;
  64    stylesheets: to satisfy your curiosity, you might want to run the
  65    indexing transformation on an example debugging &acro.oai; record.
  66    <screen>
  67     xsltproc conf/oai2index.xsl data/debug-record.xml
  68    </screen>
  69     Here you see the &acro.oai; record transformed into the indexing
  70     &acro.xml; format. &zebra; is creating several inverted indexes,
  71     and their name and type are clearly visible in the indexing
  72     &acro.xml; format.
  73  </para>
  74
  75  <para>
  76   If your indexing command was successful, you are now ready to
  77   fire up a server. To start a server on port 9999, type:
  78   <screen>
  79    zebrasrv-2.0 -c conf/zebra.cfg  @:9999
  80   </screen>
  81  </para>
  82
  83  <para>
  84   The &zebra; index that you have just created has a single database
  85   named <literal>Default</literal>.
  86   The database contains  several &acro.oai; records, and the server will
  87   return records in the &acro.xml; format only. The indexing machine
  88   did the splitting into individual records just behind the scenes.
  89  </para>
  90
  91
  92  </sect1>
  93
  94  <sect1 id="tutorial-oai-sru-pqf">
  95   <title>Searching the &acro.oai; database by web service</title>
  96
  97   <para>
  98     &zebra; has a build-in web service, which is close to the
  99     &acro.sru; standard web service. We use it to access our new
 100     database using any   &acro.xml; enabled web browser.
 101     This service is using the  &acro.pqf; query language.
 102     In a later
 103     section we show how to run a fully compliant  &acro.sru; server,
 104     including support for the query language  &acro.cql;
 105    </para>
 106
 107    <para>
 108     Searching and retrieving &acro.xml; records is easy. For example,
 109     you can point your browser to one of the following url's to
 110     search for the term <literal>the</literal>. Just point your
 111     browser at this link:
 112     <ulink
 113     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the">
 114    http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the</ulink>
 115    </para>
 116
 117    <warning>
 118     <para>
 119      These URL's woun't work unless you have indexed the example data
 120      and started an &zebra; server as outlined in the previous section.
 121     </para>
 122    </warning>
 123
 124    <para>
 125     In case we actually want to retrieve one record, we need to alter
 126     our URl to the following
 127    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 128    http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 129    </ulink>
 130    </para>
 131
 132    <para>
 133     This way we can page through our result set in chunks of records,
 134     for example, we access the 6th to the 10th record using the URL
 135    <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc">
 136    http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc
 137    </ulink>
 138   </para>
 139
 140 <!--
 141    relation tests:
 142
 143     <ulink url="">
 144
 145    http://localhost:9999/?version=1.1&amp;operation=searchRetrieve
 146                       &amp;x-pquery=title%3Cthe
 147 -->
 148  </sect1>
 149
 150  <sect1 id="tutorial-oai-sru-present">
 151   <title>Presenting search results in different formats</title>
 152
 153    <para>
 154     &zebra; uses &acro.xslt; stylesheets for both &acro.xml;record
 155     indexing and
 156     display retrieval. In this example installation, they are two
 157     retrieval schema's defined in
 158     <literal>conf/dom-conf.xml</literal>:
 159     the <literal>dc</literal> schema implemented in
 160     <literal>conf/oai2dc.xsl</literal>, and
 161     the <literal>zebra</literal> schema implemented in
 162     <literal>conf/oai2zebra.xsl</literal>.
 163     The URL's for acessing both are the same, except for the different
 164     value of the <literal>recordSchema</literal> parameter:
 165     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 166      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 167     </ulink>
 168     and
 169     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra">
 170      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra
 171     </ulink>
 172     For the curious, one can see that the &acro.xslt; transformations
 173     really do the magic.
 174     <screen>
 175      xsltproc conf/oai2dc.xsl data/debug-record.xml
 176      xsltproc conf/oai2zebra.xsl data/debug-record.xml
 177      </screen>
 178     Notice also that the &zebra; specific parameters are injected by
 179     the engine when retrieving data, therefore some of the attributes
 180     in the <literal>zebra</literal> retrieval schema are not filled
 181     when running the transformation from the command line.
 182    </para>
 183
 184
 185    <para>
 186     In addition to the user defined retrieval schema's one can  always
 187     choose from many  build-in schema's. In case one is only
 188     interested in the &zebra; internal metadata about a certain
 189     record, one uses the <literal>zebra::meta</literal> schema.
 190     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta">
 191      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta
 192     </ulink>
 193    </para>
 194
 195    <para>
 196     The <literal>zebra::data</literal> schema is used to retrieve the
 197     original stored &acro.oai; &acro.xml; record.
 198     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data">
 199      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data
 200     </ulink>
 201    </para>
 202
 203  </sect1>
 204
 205  <sect1 id="tutorial-oai-sru-searches">
 206   <title>More interesting searches</title>
 207
 208    <para>
 209     The &acro.oai; indexing example defines many different index
 210     names, a study of the <literal>conf/oai2index.xsl</literal>
 211     stylesheet reveals the following word type indexes (i.e. those
 212     swith suffix <literal>:w</literal>):
 213     <screen>
 214      any:w
 215      dc_title:w
 216      dc_creator:w
 217      dc_subject:w
 218      dc_description:w
 219      dc_contributor:w
 220      dc_publisher:w
 221      dc_language:w
 222      dc_rights:w
 223     </screen>
 224     By default, searches do access the <literal>anr:w</literal> index,
 225     but we can direct searches to any access point by constructing the
 226     correct &acro.pqf; query. For example, to search in titles only,
 227     we use
 228     <ulink
 229     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr
 230     1=dc_title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 231      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr
 232     1=dc_title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 233     </ulink>
 234    </para>
 235
 236    <para>
 237     Similar we can direct searches to the other indexes defined. Or we
 238     can create boolean combinations of searches on different
 239     indexes. In this case we search for <literal>the</literal> in
 240     <literal>dc_title</literal> and for <literal>fish</literal> in
 241     <literal>dc_description</literal> using the query
 242     <literal>@and @attr 1=dc_title the @attr 1=dc_description fish</literal>.
 243     <ulink
 244     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and
 245     @attr 1=dc_title the
 246     @attr 1=dc_description
 247     fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 248      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and
 249      @attr 1=dc_title the
 250      @attr 1=dc_description fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 251     </ulink>
 252    </para>
 253
 254
 255  </sect1>
 256
 257  <sect1 id="tutorial-oai-sru-zebra-indexess">
 258   <title>Investigating the content of the indexes</title>
 259
 260    <para>
 261     How doess the magic work? What is inside the indexes? Why is a certain
 262     record foound by a search, and another not?. The answer is in the
 263     inverterd indexes. You can easily investigate them using the
 264     special &zebra; schema
 265     <literal>zebra::index::fieldname</literal>. In this example you
 266     can see that the <literal>dc_title</literal> index has both word
 267     (type <literal>:w</literal>) and phrase (type
 268     <literal>:p</literal>)
 269     indexed fields,
 270     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::dc_title">
 271      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::dc_title
 272     </ulink>
 273    </para>
 274
 275    <para>
 276     But where in the indexes did the term match for the query occur?
 277     Easily answered with the special  &zebra; schema
 278     <literal>zebra::snippet</literal>. The matching terma are
 279     encapsulated by <literal>&lt;s&gt;</literal> tags.
 280     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
 281      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
 282     </ulink>
 283    </para>
 284
 285    <para>
 286     How can I refine my search? Which interesting search terms are
 287     found inside my hit set? Try the special  &zebra; schema
 288     <literal>zebra::facet::fieldname:type</literal>. In this case, we
 289     investigate additional search terms for the
 290     <literal>dc_title:w</literal> index.
 291     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::dc_title:w">
 292      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::dc_title:w
 293     </ulink>
 294    </para>
 295
 296    <para>
 297     One can ask for multiple facets. Here, we want them from phrase
 298     indexes of type
 299     <literal>:p</literal>.
 300     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::dc_publisher:p,dc_title:p">
 301      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::dc_publisher:p,dc_title:p
 302     </ulink>
 303    </para>
 304
 305  </sect1>
 306
 307
 308  <sect1 id="tutorial-oai-sru-yazfrontend">
 309   <title>Setting up a correct &acro.sru; web service</title>
 310
 311    <para>
 312        The &acro.sru; specification mandates that the &acro.cql; query
 313        language is supported and properly configure. Also, the server
 314        needs to be able to emmit a proper  &acro.explain; &acro.xml;
 315        record, which is used to determine the capabilities of the
 316        specific server instance.
 317     </para>
 318
 319    <para>
 320     In this example configuration we expoit the similarities between
 321     the &acro.explain; record and the &acro.cql; query language
 322     configuration, we generate the later from the former using an
 323     &acro.xslt; transformation.
 324     <screen>
 325      xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt
 326     </screen>
 327    </para>
 328
 329    <para>
 330     The we are all set to start the &acro.sru;/acro.z3950; server including
 331     &acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend
 332     server configuration - just type
 333     <screen>
 334      zebrasrv -f conf/yazserver.xml
 335      </screen>
 336     </para>
 337
 338    <para>
 339     First, we'd like to be sure that we can see the  &acro.explain;
 340     &acro.xml; response correctly. You might use either of these equivalent
 341     requests:
 342     <ulink
 343      url="http://localhost:9999">http://localhost:9999
 344     </ulink>
 345     <ulink
 346      url="http://localhost:9999/?version=1.1&amp;operation=explain">
 347      http://localhost:9999/?version=1.1&amp;operation=explain
 348     </ulink>
 349
 350    </para>
 351
 352    <para>
 353     Now we can issue true &acro.sru; requests. For example,
 354     <literal>dc.title=the
 355     and dc.description=fish</literal> results in the following page
 356     <ulink
 357     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 358     and dc.description=fish
 359     &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 360      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 361      and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 362     </ulink>
 363    </para>
 364
 365    <para>
 366     Scan of indexes is a part of the  &acro.sru; server business. For example,
 367     scanning the <literal>dc.title</literal> index gives us an idea
 368     what search terms are found there
 369     <ulink
 370     url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish">
 371      http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish
 372     </ulink>,
 373     whereas
 374    <ulink
 375     url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish">
 376 http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish
 377    </ulink>
 378     accesses the indexed indentifiers.
 379    </para>
 380
 381    <para>
 382     In addition, all &zebra; internal special elemen sets or record
 383     schema's of the form
 384     <literal>zebra::</literal> just work right out of the box
 385     <ulink
 386     url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 387     and dc.description=fish
 388     &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
 389      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 390      and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
 391     </ulink>
 392    </para>
 393
 394
 395
 396  </sect1>
 397
 398
 399   <sect1 id="tutorial-oai-z3950">
 400    <title>Searching the &acro.oai; database by &acro.z3950; protocol</title>
 401
 402    <para>
 403     In this section we repeat the searches and presents we have done so
 404     far using the binary &acro.z3950; protocol, you can use any
 405     &acro.z3950; client.
 406     For instance, you can use the demo command-line client that comes
 407     with &yaz;.
 408    </para>
 409    <para>
 410     Connecting to the server is done by the command
 411   <screen>
 412      yaz-client localhost:9999
 413     </screen>
 414    </para>
 415
 416    <para>
 417     When the client has connected, you can type:
 418     <screen>
 419      Z> format xml
 420      Z> querytype prefix
 421      Z> elements oai
 422      Z> find the
 423      Z> show 1+1
 424     </screen>
 425    </para>
 426
 427    <para>
 428     &acro.z3950; presents using presentation stylesheets:
 429     <screen>
 430      Z> elements dc
 431      Z> show 2+1
 432
 433      Z> elements zebra
 434      Z> show 3+1
 435     </screen>
 436    </para>
 437
 438    <para>
 439     &acro.z3950; buildin Zebra presents (in this configuration only if
 440     started without yaz-frontendserver):
 441
 442     <screen>
 443      Z> elements zebra::meta
 444      Z> show 4+1
 445
 446      Z> elements zebra::meta::sysno
 447      Z> show 5+1
 448
 449      Z> format sutrs
 450      Z> show 5+1
 451      Z> format xml
 452
 453      Z> elements zebra::index
 454      Z> show 6+1
 455
 456      Z> elements zebra::snippet
 457      Z> show 7+1
 458
 459      Z> elements zebra::facet::any:w
 460      Z> show 1+1
 461
 462      Z> elements zebra::facet::dc_publisher:p,dc_title:p
 463      Z> show 1+1
 464    </screen>
 465    </para>
 466
 467    <para>
 468     &acro.z3950; searches targeted at specific indexes and boolean
 469     combinations of these can be issued as well.
 470
 471     <screen>
 472      Z> elements dc
 473      Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4
 474      Z> show 1+1
 475
 476      Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20
 477      Z> show 1+1
 478
 479      Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562
 480      Z> show 1+1
 481
 482      Z> find @attr 1=dc_title communication
 483      Z> show 1+1
 484
 485      Z> find @attr 1=dc_identifier @attr 4=3
 486      http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86
 487      Z> show 1+1
 488     </screen>
 489    etc, etc.
 490    </para>
 491
 492    <para>
 493     &acro.z3950; scan:
 494     <screen>
 495      yaz-client localhost:9999
 496      Z> format xml
 497      Z> querytype prefix
 498      Z> scan @attr 1=oai_identifier @attr 4=3 oai
 499      Z> scan @attr 1=oai_datestamp @attr 4=3 1
 500      Z> scan @attr 1=oai_setspec @attr 4=3 2000
 501      Z>
 502      Z> scan @attr 1=dc_title communication
 503      Z> scan @attr 1=dc_identifier @attr 4=3 a
 504    </screen>
 505    </para>
 506
 507    <para>
 508     &acro.z3950; search using server-side CQL conversion:
 509     <screen>
 510    Z> format xml
 511    Z> querytype cql
 512    Z> elements dc
 513    Z>
 514    Z> find harry
 515    Z>
 516    Z> find dc.creator = the
 517    Z> find dc.creator = the
 518    Z> find dc.title = the
 519    Z>
 520    Z> find dc.description &lt; the
 521    Z> find dc.title &gt; some
 522    Z>
 523    Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78"
 524    Z> find dc.relation = something
 525    </screen>
 526    </para>
 527
 528    <!--
 529    etc, etc. Notice that  all indexes defined by 'type="0"' in the
 530    indexing style  sheet must be searched using the 'eq'
 531    relation.
 532
 533    Z> find title <> and
 534
 535    fails as well.  ???
 536    -->
 537
 538    <tip>
 539    <para>
 540     &acro.z3950; scan using server side CQL conversion -
 541    unfortunately, this will _never_ work as it is not supported by the
 542    &acro.z3950; standard.
 543    If you want to use scan using server side CQL conversion, you need to
 544    make an SRW connection using  yaz-client, or a
 545    SRU connection using REST Web Services - any browser will do.
 546    </para>
 547    </tip>
 548
 549    <tip>
 550    <para>
 551    All indexes defined by 'type="0"' in the
 552    indexing style  sheet must be searched using the '@attr 4=3'
 553    structure attribute instruction.
 554    </para>
 555    </tip>
 556
 557    <para>
 558    Notice that searching and scan on indexes
 559    <literal>dc_contributor</literal>,  <literal>dc_language</literal>,
 560    <literal>dc_rights</literal>, and <literal>dc_source</literal>
 561    might fail, simply because none of the records in the small example set
 562    have these fields set, and consequently, these indexes might not
 563    been created.
 564    </para>
 565
 566  </sect1>
 567
 568
 569
 570
 571
 572
 573
 574 </chapter>
 575
 576  <!-- Keep this comment at the end of the file
 577  Local variables:
 578  mode: sgml
 579  sgml-omittag:t
 580  sgml-shorttag:t
 581  sgml-minimize-attributes:nil
 582  sgml-always-quote-attributes:t
 583  sgml-indent-step:1
 584  sgml-indent-data:t
 585  sgml-parent-document: "zebra.xml"
 586  sgml-local-catalogs: nil
 587  sgml-namecase-general:t
 588  End:
 589  -->