doc/tutorial.xml

   1  <chapter id="tutorial">
   2   <title>Tutorial</title>
   3
   4
   5   <sect1 id="tutorial-oai">
   6    <title>A first &acro.oai; indexing example</title>
   7
   8    <para>
   9     In this section, we will test the system by indexing a small set of
  10     sample &acro.oai; records that are included with the &zebra; distribution,
  11     running a &zebra; server against the newly created database, and
  12     searching the indexes with a client that connects to that server.
  13    </para>
  14    <para>
  15     Go to the <literal>examples/oai-pmh</literal> subdirectory of the
  16     distribution archive, or make a deep copy of the Debian installation
  17     directory
  18     <literal>/usr/share/idzebra-2.0-examples/oai-pmh</literal>.
  19     An XML file containing multiple &acro.oai;
  20     records is located in the  sub
  21     directory <literal>examples/oai-pmh/data</literal>.
  22    </para>
  23    <para>
  24     Additional OAI test records can be downloaded by running a shell
  25     script (you may want to abort the script when you have waited
  26     longer than your coffee brews ..).
  27     <screen>
  28      cd data
  29      ./fetch_OAI_data.sh
  30      cd ../
  31     </screen>
  32    </para>
  33    <para>
  34     To index these &acro.oai; records, type:
  35     <screen>
  36      zebraidx-2.0 -c conf/zebra.cfg init
  37      zebraidx-2.0 -c conf/zebra.cfg update data
  38      zebraidx-2.0 -c conf/zebra.cfg commit
  39     </screen>
  40     In case you have not installed zebra yet but have compiled the
  41     binaries from this tarball, use the following command form:
  42     <screen>
  43      ../../index/zebraidx -c conf/zebra.cfg this and that
  44     </screen>
  45     On some systems the &zebra; binaries are installed under the
  46     generic names, you need to use  the following command form:
  47     <screen>
  48      zebraidx -c conf/zebra.cfg this and that
  49     </screen>
  50    </para>
  51
  52    <para>
  53     In this command, the word <literal>update</literal> is followed
  54     by the name of a directory: <literal>zebraidx</literal> updates all
  55     files in the hierarchy rooted at <literal>data</literal>.
  56     The command option
  57     <literal>-c conf/zebra.cfg</literal> points to the proper
  58     configuration file.
  59    </para>
  60
  61    <para>
  62     You might ask yourself how &acro.xml; content is indexed using &acro.xslt;
  63     stylesheets: to satisfy your curiosity, you might want to run the
  64     indexing transformation on an example debugging &acro.oai; record.
  65     <screen>
  66      xsltproc conf/oai2index.xsl data/debug-record.xml
  67     </screen>
  68     Here you see the &acro.oai; record transformed into the indexing
  69     &acro.xml; format. &zebra; is creating several inverted indexes,
  70     and their name and type are clearly visible in the indexing
  71     &acro.xml; format.
  72    </para>
  73
  74    <para>
  75     If your indexing command was successful, you are now ready to
  76     fire up a server. To start a server on port 9999, type:
  77     <screen>
  78      zebrasrv-2.0 -c conf/zebra.cfg  @:9999
  79     </screen>
  80    </para>
  81
  82    <para>
  83     The &zebra; index that you have just created has a single database
  84     named <literal>Default</literal>.
  85     The database contains  several &acro.oai; records, and the server will
  86     return records in the &acro.xml; format only. The indexing machine
  87     did the splitting into individual records just behind the scenes.
  88    </para>
  89
  90
  91   </sect1>
  92
  93   <sect1 id="tutorial-oai-sru-pqf">
  94    <title>Searching the &acro.oai; database by web service</title>
  95
  96    <para>
  97     &zebra; has a build-in web service, which is close to the
  98     &acro.sru; standard web service. We use it to access our new
  99     database using any   &acro.xml; enabled web browser.
 100     This service is using the  &acro.pqf; query language.
 101     In a later
 102     section we show how to run a fully compliant  &acro.sru; server,
 103     including support for the query language  &acro.cql;
 104    </para>
 105
 106    <para>
 107     Searching and retrieving &acro.xml; records is easy. For example,
 108     you can point your browser to one of the following URLs to
 109     search for the term <literal>the</literal>. Just point your
 110     browser at this link:
 111     <ulink
 112      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the">
 113      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the</ulink>
 114    </para>
 115
 116    <warning>
 117     <para>
 118      These URLs won't work unless you have indexed the example data
 119      and started an &zebra; server as outlined in the previous section.
 120     </para>
 121    </warning>
 122
 123    <para>
 124     In case we actually want to retrieve one record, we need to alter
 125     our URL to the following
 126     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 127      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 128     </ulink>
 129    </para>
 130
 131    <para>
 132     This way we can page through our result set in chunks of records,
 133     for example, we access the 6th to the 10th record using the URL
 134     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc">
 135      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc
 136     </ulink>
 137    </para>
 138
 139    <!--
 140    relation tests:
 141
 142    <ulink url="">
 143
 144    http://localhost:9999/?version=1.1&amp;operation=searchRetrieve
 145    &amp;x-pquery=title%3Cthe
 146    -->
 147   </sect1>
 148
 149   <sect1 id="tutorial-oai-sru-present">
 150    <title>Presenting search results in different formats</title>
 151
 152    <para>
 153     &zebra; uses &acro.xslt; stylesheets for both &acro.xml;record
 154     indexing and
 155     display retrieval. In this example installation, they are two
 156     retrieval schema's defined in
 157     <literal>conf/dom-conf.xml</literal>:
 158     the <literal>dc</literal> schema implemented in
 159     <literal>conf/oai2dc.xsl</literal>, and
 160     the <literal>zebra</literal> schema implemented in
 161     <literal>conf/oai2zebra.xsl</literal>.
 162     The URLs for accessing both are the same, except for the different
 163     value of the <literal>recordSchema</literal> parameter:
 164     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 165      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 166     </ulink>
 167     and
 168     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra">
 169      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra
 170     </ulink>
 171     For the curious, one can see that the &acro.xslt; transformations
 172     really do the magic.
 173     <screen>
 174      xsltproc conf/oai2dc.xsl data/debug-record.xml
 175      xsltproc conf/oai2zebra.xsl data/debug-record.xml
 176     </screen>
 177     Notice also that the &zebra; specific parameters are injected by
 178     the engine when retrieving data, therefore some of the attributes
 179     in the <literal>zebra</literal> retrieval schema are not filled
 180     when running the transformation from the command line.
 181    </para>
 182
 183
 184    <para>
 185     In addition to the user defined retrieval schema's one can  always
 186     choose from many  build-in schema's. In case one is only
 187     interested in the &zebra; internal metadata about a certain
 188     record, one uses the <literal>zebra::meta</literal> schema.
 189     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta">
 190      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta
 191     </ulink>
 192    </para>
 193
 194    <para>
 195     The <literal>zebra::data</literal> schema is used to retrieve the
 196     original stored &acro.oai; &acro.xml; record.
 197     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data">
 198      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data
 199     </ulink>
 200    </para>
 201
 202   </sect1>
 203
 204   <sect1 id="tutorial-oai-sru-searches">
 205    <title>More interesting searches</title>
 206
 207    <para>
 208     The &acro.oai; indexing example defines many different index
 209     names, a study of the <literal>conf/oai2index.xsl</literal>
 210     stylesheet reveals the following word type indexes (i.e. those
 211     with suffix <literal>:w</literal>):
 212     <screen>
 213      any:w
 214      title:w
 215      author:w
 216      subject-heading:w
 217      description:w
 218      contributor:w
 219      publisher:w
 220      language:w
 221      rights:w
 222     </screen>
 223     By default, searches do access the <literal>any:w</literal> index,
 224     but we can direct searches to any access point by constructing the
 225     correct &acro.pqf; query. For example, to search in titles only,
 226     we use
 227     <ulink
 228      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr 1=title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 229      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr 1=title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 230     </ulink>
 231    </para>
 232
 233    <para>
 234     Similar we can direct searches to the other indexes defined. Or we
 235     can create boolean combinations of searches on different
 236     indexes. In this case we search for <literal>the</literal> in
 237     <literal>title</literal> and for <literal>fish</literal> in
 238     <literal>description</literal> using the query
 239     <literal>@and @attr 1=title the @attr 1=description fish</literal>.
 240     <ulink
 241      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and @attr 1=title the @attr 1=description fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 242      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and @attr 1=title the @attr 1=description fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 243     </ulink>
 244    </para>
 245
 246
 247   </sect1>
 248
 249   <sect1 id="tutorial-oai-sru-zebra-indexes">
 250    <title>Investigating the content of the indexes</title>
 251
 252    <para>
 253     How does the magic work? What is inside the indexes? Why is a certain
 254     record found by a search, and another not?. The answer is in the
 255     inverted indexes. You can easily investigate them using the
 256     special &zebra; schema
 257     <literal>zebra::index::fieldname</literal>. In this example you
 258     can see that the <literal>title</literal> index has both word
 259     (type <literal>:w</literal>) and phrase (type
 260     <literal>:p</literal>)
 261     indexed fields,
 262     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::title">
 263      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::title
 264     </ulink>
 265    </para>
 266
 267    <para>
 268     But where in the indexes did the term match for the query occur?
 269     Easily answered with the special  &zebra; schema
 270     <literal>zebra::snippet</literal>. The matching terms are
 271     encapsulated by <literal>&lt;s&gt;</literal> tags.
 272     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
 273      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
 274     </ulink>
 275    </para>
 276
 277    <para>
 278     How can I refine my search? Which interesting search terms are
 279     found inside my hit set? Try the special  &zebra; schema
 280     <literal>zebra::facet::fieldname:type</literal>. In this case, we
 281     investigate additional search terms for the
 282     <literal>title:w</literal> index.
 283     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::title:w">
 284      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::title:w
 285     </ulink>
 286    </para>
 287
 288    <para>
 289     One can ask for multiple facets. Here, we want them from phrase
 290     indexes of type
 291     <literal>:p</literal>.
 292     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::publisher:p,title:p">
 293      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::publisher:p,title:p
 294     </ulink>
 295    </para>
 296
 297   </sect1>
 298
 299
 300   <sect1 id="tutorial-oai-sru-yazfrontend">
 301    <title>Setting up a correct &acro.sru; web service</title>
 302
 303    <para>
 304     The &acro.sru; specification mandates that the &acro.cql; query
 305     language is supported and properly configured. Also, the server
 306     needs to be able to emit a proper  &acro.explain; &acro.xml;
 307     record, which is used to determine the capabilities of the
 308     specific server instance.
 309    </para>
 310
 311    <para>
 312     In this example configuration we exploit the similarities between
 313     the &acro.explain; record and the &acro.cql; query language
 314     configuration, we generate the later from the former using an
 315     &acro.xslt; transformation.
 316     <screen>
 317      xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt
 318     </screen>
 319    </para>
 320
 321    <para>
 322     We are all set to start the &acro.sru;/&acro.z3950; server including
 323     &acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend
 324     server configuration - just type
 325     <screen>
 326      zebrasrv -f conf/yazserver.xml
 327     </screen>
 328    </para>
 329
 330    <para>
 331     First, we'd like to be sure that we can see the  &acro.explain;
 332     &acro.xml; response correctly. You might use either of these equivalent
 333     requests:
 334     <ulink
 335      url="http://localhost:9999">http://localhost:9999
 336     </ulink>
 337     or
 338     <ulink
 339      url="http://localhost:9999/?version=1.1&amp;operation=explain">
 340      http://localhost:9999/?version=1.1&amp;operation=explain
 341     </ulink>
 342
 343    </para>
 344
 345    <para>
 346     Now we can issue true &acro.sru; requests. For example,
 347     <literal>dc.title=the
 348      and dc.description=fish</literal> results in the following page
 349     <ulink
 350      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 351      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 352     </ulink>
 353    </para>
 354
 355    <para>
 356     Scan of indexes is a part of the  &acro.sru; server business. For example,
 357     scanning the <literal>dc.title</literal> index gives us an idea
 358     what search terms are found there
 359     <ulink
 360      url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish">
 361      http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish
 362     </ulink>,
 363     whereas
 364     <ulink
 365      url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish">
 366      http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish
 367     </ulink>
 368     accesses the indexed identifiers.
 369    </para>
 370
 371    <para>
 372     In addition, all &zebra; internal special element sets or record
 373     schema's of the form
 374     <literal>zebra::</literal> just work right out of the box
 375     <ulink
 376      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
 377      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
 378     </ulink>
 379    </para>
 380
 381
 382
 383   </sect1>
 384
 385
 386   <sect1 id="tutorial-oai-z3950">
 387    <title>Searching the &acro.oai; database by &acro.z3950; protocol</title>
 388
 389    <para>
 390     In this section we repeat the searches and presents we have done so
 391     far using the binary &acro.z3950; protocol, you can use any
 392     &acro.z3950; client.
 393     For instance, you can use the demo command-line client that comes
 394     with &yaz;.
 395    </para>
 396    <para>
 397     Connecting to the server is done by the command
 398     <screen>
 399      yaz-client localhost:9999
 400     </screen>
 401    </para>
 402
 403    <para>
 404     When the client has connected, you can type:
 405     <screen>
 406      Z> format xml
 407      Z> querytype prefix
 408      Z> elements oai
 409      Z> find the
 410      Z> show 1+1
 411     </screen>
 412    </para>
 413
 414    <para>
 415     &acro.z3950; presents using presentation stylesheets:
 416     <screen>
 417      Z> elements dc
 418      Z> show 2+1
 419
 420      Z> elements zebra
 421      Z> show 3+1
 422     </screen>
 423    </para>
 424
 425    <para>
 426     &acro.z3950; buildin Zebra presents (in this configuration only if
 427     started without yaz-frontendserver):
 428
 429     <screen>
 430      Z> elements zebra::meta
 431      Z> show 4+1
 432
 433      Z> elements zebra::meta::sysno
 434      Z> show 5+1
 435
 436      Z> format sutrs
 437      Z> show 5+1
 438      Z> format xml
 439
 440      Z> elements zebra::index
 441      Z> show 6+1
 442
 443      Z> elements zebra::snippet
 444      Z> show 7+1
 445
 446      Z> elements zebra::facet::any:w
 447      Z> show 1+1
 448
 449      Z> elements zebra::facet::publisher:p,title:p
 450      Z> show 1+1
 451     </screen>
 452    </para>
 453
 454    <para>
 455     &acro.z3950; searches targeted at specific indexes and boolean
 456     combinations of these can be issued as well.
 457
 458     <screen>
 459      Z> elements dc
 460      Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4
 461      Z> show 1+1
 462
 463      Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20
 464      Z> show 1+1
 465
 466      Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562
 467      Z> show 1+1
 468
 469      Z> find @attr 1=title communication
 470      Z> show 1+1
 471
 472      Z> find @attr 1=identifier @attr 4=3
 473      http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86
 474      Z> show 1+1
 475     </screen>
 476     etc, etc.
 477    </para>
 478
 479    <para>
 480     &acro.z3950; scan:
 481     <screen>
 482      yaz-client localhost:9999
 483      Z> format xml
 484      Z> querytype prefix
 485      Z> scan @attr 1=oai_identifier @attr 4=3 oai
 486      Z> scan @attr 1=oai_datestamp @attr 4=3 1
 487      Z> scan @attr 1=oai_setspec @attr 4=3 2000
 488      Z>
 489      Z> scan @attr 1=title communication
 490      Z> scan @attr 1=identifier @attr 4=3 a
 491     </screen>
 492    </para>
 493
 494    <para>
 495     &acro.z3950; search using server-side CQL conversion:
 496     <screen>
 497      Z> format xml
 498      Z> querytype cql
 499      Z> elements dc
 500      Z>
 501      Z> find harry
 502      Z>
 503      Z> find dc.creator = the
 504      Z> find dc.creator = the
 505      Z> find dc.title = the
 506      Z>
 507      Z> find dc.description &lt; the
 508      Z> find dc.title &gt; some
 509      Z>
 510      Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78"
 511      Z> find dc.relation = something
 512     </screen>
 513    </para>
 514
 515    <!--
 516    etc, etc. Notice that  all indexes defined by 'type="0"' in the
 517    indexing style  sheet must be searched using the 'eq'
 518    relation.
 519
 520    Z> find title <> and
 521
 522    fails as well.  ???
 523    -->
 524
 525    <tip>
 526     <para>
 527      &acro.z3950; scan using server side CQL conversion -
 528      unfortunately, this will _never_ work as it is not supported by the
 529      &acro.z3950; standard.
 530      If you want to use scan using server side CQL conversion, you need to
 531      make an SRW connection using  yaz-client, or a
 532      SRU connection using REST Web Services - any browser will do.
 533     </para>
 534    </tip>
 535
 536    <tip>
 537     <para>
 538      All indexes defined by 'type="0"' in the
 539      indexing style  sheet must be searched using the '@attr 4=3'
 540      structure attribute instruction.
 541     </para>
 542    </tip>
 543
 544    <para>
 545     Notice that searching and scan on indexes
 546     <literal>contributor</literal>,  <literal>language</literal>,
 547     <literal>rights</literal>, and <literal>source</literal>
 548     might fail, simply because none of the records in the small example set
 549     have these fields set, and consequently, these indexes might not
 550     been created.
 551    </para>
 552
 553   </sect1>
 554
 555  </chapter>
 556
 557
 558  <!-- Keep this comment at the end of the file
 559  Local variables:
 560  mode: sgml
 561  sgml-omittag:t
 562  sgml-shorttag:t
 563  sgml-minimize-attributes:nil
 564  sgml-always-quote-attributes:t
 565  sgml-indent-step:1
 566  sgml-indent-data:t
 567  sgml-parent-document: "idzebra.xml"
 568  sgml-local-catalogs: nil
 569  sgml-namecase-general:t
 570  End:
 571  -->