doc/tutorial.xml

   1  <chapter id="tutorial">
   2   <title>Tutorial</title>
   3
   4
   5   <sect1 id="tutorial-oai">
   6    <title>A first &acro.oai; indexing example</title>
   7
   8    <para>
   9     In this section, we will test the system by indexing a small set of
  10     sample &acro.oai; records that are included with the &zebra; distribution,
  11     running a &zebra; server against the newly created database, and
  12     searching the indexes with a client that connects to that server.
  13    </para>
  14    <para>
  15     Go to the <literal>examples/oai-pmh</literal> subdirectory of the
  16     distribution archive, or make a deep copy of the Debian installation
  17     directory
  18     <literal>/usr/share/idzebra-2.0-examples/oai-pmh</literal>.
  19     An XML file containing multiple &acro.oai;
  20     records is located in the  sub
  21     directory <literal>examples/oai-pmh/data</literal>.
  22    </para>
  23    <para>
  24     Additional OAI test records can be downloaded by running a shell
  25     script (you may want to abort the script when you have waited
  26     longer than your coffee brews ..).
  27     <screen>
  28      cd data
  29      ./fetch_OAI_data.sh
  30      cd ../
  31     </screen>
  32    </para>
  33    <para>
  34     To index these &acro.oai; records, type:
  35     <screen>
  36      zebraidx-2.0 -c conf/zebra.cfg init
  37      zebraidx-2.0 -c conf/zebra.cfg update data
  38      zebraidx-2.0 -c conf/zebra.cfg commit
  39     </screen>
  40     In case you have not installed zebra yet but have compiled the
  41     binaries from this tarball, use the following command form:
  42     <screen>
  43      ../../index/zebraidx -c conf/zebra.cfg this and that
  44     </screen>
  45     On some systems the &zebra; binaries are installed under the
  46     generic names, you need to use  the following command form:
  47     <screen>
  48      zebraidx -c conf/zebra.cfg this and that
  49     </screen>
  50    </para>
  51
  52    <para>
  53     In this command, the word <literal>update</literal> is followed
  54     by the name of a directory: <literal>zebraidx</literal> updates all
  55     files in the hierarchy rooted at <literal>data</literal>.
  56     The command option
  57     <literal>-c conf/zebra.cfg</literal> points to the proper
  58     configuration file.
  59    </para>
  60
  61    <para>
  62     You might ask yourself how &acro.xml; content is indexed using &acro.xslt;
  63     stylesheets: to satisfy your curiosity, you might want to run the
  64     indexing transformation on an example debugging &acro.oai; record.
  65     <screen>
  66      xsltproc conf/oai2index.xsl data/debug-record.xml
  67     </screen>
  68     Here you see the &acro.oai; record transformed into the indexing
  69     &acro.xml; format. &zebra; is creating several inverted indexes,
  70     and their name and type are clearly visible in the indexing
  71     &acro.xml; format.
  72    </para>
  73
  74    <para>
  75     If your indexing command was successful, you are now ready to
  76     fire up a server. To start a server on port 9999, type:
  77     <screen>
  78      zebrasrv-2.0 -c conf/zebra.cfg  @:9999
  79     </screen>
  80    </para>
  81
  82    <para>
  83     The &zebra; index that you have just created has a single database
  84     named <literal>Default</literal>.
  85     The database contains  several &acro.oai; records, and the server will
  86     return records in the &acro.xml; format only. The indexing machine
  87     did the splitting into individual records just behind the scenes.
  88    </para>
  89
  90
  91   </sect1>
  92
  93   <sect1 id="tutorial-oai-sru-pqf">
  94    <title>Searching the &acro.oai; database by web service</title>
  95
  96    <para>
  97     &zebra; has a build-in web service, which is close to the
  98     &acro.sru; standard web service. We use it to access our new
  99     database using any   &acro.xml; enabled web browser.
 100     This service is using the  &acro.pqf; query language.
 101     In a later
 102     section we show how to run a fully compliant  &acro.sru; server,
 103     including support for the query language  &acro.cql;
 104    </para>
 105
 106    <para>
 107     Searching and retrieving &acro.xml; records is easy. For example,
 108     you can point your browser to one of the following URLs to
 109     search for the term <literal>the</literal>. Just point your
 110     browser at this link:
 111     <ulink
 112      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the">
 113      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the</ulink>
 114    </para>
 115
 116    <warning>
 117     <para>
 118      These URLs won't work unless you have indexed the example data
 119      and started an &zebra; server as outlined in the previous section.
 120     </para>
 121    </warning>
 122
 123    <para>
 124     In case we actually want to retrieve one record, we need to alter
 125     our URL to the following
 126     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 127      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 128     </ulink>
 129    </para>
 130
 131    <para>
 132     This way we can page through our result set in chunks of records,
 133     for example, we access the 6th to the 10th record using the URL
 134     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc">
 135      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=6&amp;maximumRecords=5&amp;recordSchema=dc
 136     </ulink>
 137    </para>
 138
 139    <!--
 140    relation tests:
 141
 142    <ulink url="">
 143
 144    http://localhost:9999/?version=1.1&amp;operation=searchRetrieve
 145    &amp;x-pquery=title%3Cthe
 146    -->
 147   </sect1>
 148
 149   <sect1 id="tutorial-oai-sru-present">
 150    <title>Presenting search results in different formats</title>
 151
 152    <para>
 153     &zebra; uses &acro.xslt; stylesheets for both &acro.xml;record
 154     indexing and
 155     display retrieval. In this example installation, they are two
 156     retrieval schema's defined in
 157     <literal>conf/dom-conf.xml</literal>:
 158     the <literal>dc</literal> schema implemented in
 159     <literal>conf/oai2dc.xsl</literal>, and
 160     the <literal>zebra</literal> schema implemented in
 161     <literal>conf/oai2zebra.xsl</literal>.
 162     The URLs for accessing both are the same, except for the different
 163     value of the <literal>recordSchema</literal> parameter:
 164     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 165      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 166     </ulink>
 167     and
 168     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra">
 169      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra
 170     </ulink>
 171     For the curious, one can see that the &acro.xslt; transformations
 172     really do the magic.
 173     <screen>
 174      xsltproc conf/oai2dc.xsl data/debug-record.xml
 175      xsltproc conf/oai2zebra.xsl data/debug-record.xml
 176     </screen>
 177     Notice also that the &zebra; specific parameters are injected by
 178     the engine when retrieving data, therefore some of the attributes
 179     in the <literal>zebra</literal> retrieval schema are not filled
 180     when running the transformation from the command line.
 181    </para>
 182
 183
 184    <para>
 185     In addition to the user defined retrieval schema's one can  always
 186     choose from many  build-in schema's. In case one is only
 187     interested in the &zebra; internal metadata about a certain
 188     record, one uses the <literal>zebra::meta</literal> schema.
 189     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta">
 190      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::meta
 191     </ulink>
 192    </para>
 193
 194    <para>
 195     The <literal>zebra::data</literal> schema is used to retrieve the
 196     original stored &acro.oai; &acro.xml; record.
 197     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data">
 198      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::data
 199     </ulink>
 200    </para>
 201
 202   </sect1>
 203
 204   <sect1 id="tutorial-oai-sru-searches">
 205    <title>More interesting searches</title>
 206
 207    <para>
 208     The &acro.oai; indexing example defines many different index
 209     names, a study of the <literal>conf/oai2index.xsl</literal>
 210     stylesheet reveals the following word type indexes (i.e. those
 211     with suffix <literal>:w</literal>):
 212     <screen>
 213      any:w
 214      title:w
 215      author:w
 216      subject:w
 217      description:w
 218      contributor:w
 219      publisher:w
 220      language:w
 221      rights:w
 222     </screen>
 223     By default, searches do access the <literal>any:w</literal> index,
 224     but we can direct searches to any access point by constructing the
 225     correct &acro.pqf; query. For example, to search in titles only,
 226     we use
 227     <ulink
 228      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr
 229      1=title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 230      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@attr
 231      1=title the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 232     </ulink>
 233    </para>
 234
 235    <para>
 236     Similar we can direct searches to the other indexes defined. Or we
 237     can create boolean combinations of searches on different
 238     indexes. In this case we search for <literal>the</literal> in
 239     <literal>title</literal> and for <literal>fish</literal> in
 240     <literal>description</literal> using the query
 241     <literal>@and @attr 1=title the @attr 1=description fish</literal>.
 242     <ulink
 243      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and
 244      @attr 1=title the
 245      @attr 1=description
 246      fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 247      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=@and
 248      @attr 1=title the
 249      @attr 1=description fish&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 250     </ulink>
 251    </para>
 252
 253
 254   </sect1>
 255
 256   <sect1 id="tutorial-oai-sru-zebra-indexes">
 257    <title>Investigating the content of the indexes</title>
 258
 259    <para>
 260     How does the magic work? What is inside the indexes? Why is a certain
 261     record found by a search, and another not?. The answer is in the
 262     inverted indexes. You can easily investigate them using the
 263     special &zebra; schema
 264     <literal>zebra::index::fieldname</literal>. In this example you
 265     can see that the <literal>title</literal> index has both word
 266     (type <literal>:w</literal>) and phrase (type
 267     <literal>:p</literal>)
 268     indexed fields,
 269     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::title">
 270      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::index::title
 271     </ulink>
 272    </para>
 273
 274    <para>
 275     But where in the indexes did the term match for the query occur?
 276     Easily answered with the special  &zebra; schema
 277     <literal>zebra::snippet</literal>. The matching terms are
 278     encapsulated by <literal>&lt;s&gt;</literal> tags.
 279     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
 280      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
 281     </ulink>
 282    </para>
 283
 284    <para>
 285     How can I refine my search? Which interesting search terms are
 286     found inside my hit set? Try the special  &zebra; schema
 287     <literal>zebra::facet::fieldname:type</literal>. In this case, we
 288     investigate additional search terms for the
 289     <literal>title:w</literal> index.
 290     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::title:w">
 291      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::title:w
 292     </ulink>
 293    </para>
 294
 295    <para>
 296     One can ask for multiple facets. Here, we want them from phrase
 297     indexes of type
 298     <literal>:p</literal>.
 299     <ulink url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::publisher:p,title:p">
 300      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;x-pquery=the&amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::facet::publisher:p,title:p
 301     </ulink>
 302    </para>
 303
 304   </sect1>
 305
 306
 307   <sect1 id="tutorial-oai-sru-yazfrontend">
 308    <title>Setting up a correct &acro.sru; web service</title>
 309
 310    <para>
 311     The &acro.sru; specification mandates that the &acro.cql; query
 312     language is supported and properly configure. Also, the server
 313     needs to be able to emit a proper  &acro.explain; &acro.xml;
 314     record, which is used to determine the capabilities of the
 315     specific server instance.
 316    </para>
 317
 318    <para>
 319     In this example configuration we exploit the similarities between
 320     the &acro.explain; record and the &acro.cql; query language
 321     configuration, we generate the later from the former using an
 322     &acro.xslt; transformation.
 323     <screen>
 324      xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt
 325     </screen>
 326    </para>
 327
 328    <para>
 329     We are all set to start the &acro.sru;/&acro.z3950; server including
 330     &acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend
 331     server configuration - just type
 332     <screen>
 333      zebrasrv -f conf/yazserver.xml
 334     </screen>
 335    </para>
 336
 337    <para>
 338     First, we'd like to be sure that we can see the  &acro.explain;
 339     &acro.xml; response correctly. You might use either of these equivalent
 340     requests:
 341     <ulink
 342      url="http://localhost:9999">http://localhost:9999
 343     </ulink>
 344     <ulink
 345      url="http://localhost:9999/?version=1.1&amp;operation=explain">
 346      http://localhost:9999/?version=1.1&amp;operation=explain
 347     </ulink>
 348
 349    </para>
 350
 351    <para>
 352     Now we can issue true &acro.sru; requests. For example,
 353     <literal>dc.title=the
 354      and dc.description=fish</literal> results in the following page
 355     <ulink
 356      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 357      and dc.description=fish
 358      &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc">
 359      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 360      and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=dc
 361     </ulink>
 362    </para>
 363
 364    <para>
 365     Scan of indexes is a part of the  &acro.sru; server business. For example,
 366     scanning the <literal>dc.title</literal> index gives us an idea
 367     what search terms are found there
 368     <ulink
 369      url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish">
 370      http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.title=fish
 371     </ulink>,
 372     whereas
 373     <ulink
 374      url="http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish">
 375      http://localhost:9999/?version=1.1&amp;operation=scan&amp;scanClause=dc.identifier=fish
 376     </ulink>
 377     accesses the indexed identifiers.
 378    </para>
 379
 380    <para>
 381     In addition, all &zebra; internal special element sets or record
 382     schema's of the form
 383     <literal>zebra::</literal> just work right out of the box
 384     <ulink
 385      url="http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 386      and dc.description=fish
 387      &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet">
 388      http://localhost:9999/?version=1.1&amp;operation=searchRetrieve&amp;query=dc.title=the
 389      and dc.description=fish &amp;startRecord=1&amp;maximumRecords=1&amp;recordSchema=zebra::snippet
 390     </ulink>
 391    </para>
 392
 393
 394
 395   </sect1>
 396
 397
 398   <sect1 id="tutorial-oai-z3950">
 399    <title>Searching the &acro.oai; database by &acro.z3950; protocol</title>
 400
 401    <para>
 402     In this section we repeat the searches and presents we have done so
 403     far using the binary &acro.z3950; protocol, you can use any
 404     &acro.z3950; client.
 405     For instance, you can use the demo command-line client that comes
 406     with &yaz;.
 407    </para>
 408    <para>
 409     Connecting to the server is done by the command
 410     <screen>
 411      yaz-client localhost:9999
 412     </screen>
 413    </para>
 414
 415    <para>
 416     When the client has connected, you can type:
 417     <screen>
 418      Z> format xml
 419      Z> querytype prefix
 420      Z> elements oai
 421      Z> find the
 422      Z> show 1+1
 423     </screen>
 424    </para>
 425
 426    <para>
 427     &acro.z3950; presents using presentation stylesheets:
 428     <screen>
 429      Z> elements dc
 430      Z> show 2+1
 431
 432      Z> elements zebra
 433      Z> show 3+1
 434     </screen>
 435    </para>
 436
 437    <para>
 438     &acro.z3950; buildin Zebra presents (in this configuration only if
 439     started without yaz-frontendserver):
 440
 441     <screen>
 442      Z> elements zebra::meta
 443      Z> show 4+1
 444
 445      Z> elements zebra::meta::sysno
 446      Z> show 5+1
 447
 448      Z> format sutrs
 449      Z> show 5+1
 450      Z> format xml
 451
 452      Z> elements zebra::index
 453      Z> show 6+1
 454
 455      Z> elements zebra::snippet
 456      Z> show 7+1
 457
 458      Z> elements zebra::facet::any:w
 459      Z> show 1+1
 460
 461      Z> elements zebra::facet::publisher:p,title:p
 462      Z> show 1+1
 463     </screen>
 464    </para>
 465
 466    <para>
 467     &acro.z3950; searches targeted at specific indexes and boolean
 468     combinations of these can be issued as well.
 469
 470     <screen>
 471      Z> elements dc
 472      Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4
 473      Z> show 1+1
 474
 475      Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20
 476      Z> show 1+1
 477
 478      Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562
 479      Z> show 1+1
 480
 481      Z> find @attr 1=title communication
 482      Z> show 1+1
 483
 484      Z> find @attr 1=identifier @attr 4=3
 485      http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86
 486      Z> show 1+1
 487     </screen>
 488     etc, etc.
 489    </para>
 490
 491    <para>
 492     &acro.z3950; scan:
 493     <screen>
 494      yaz-client localhost:9999
 495      Z> format xml
 496      Z> querytype prefix
 497      Z> scan @attr 1=oai_identifier @attr 4=3 oai
 498      Z> scan @attr 1=oai_datestamp @attr 4=3 1
 499      Z> scan @attr 1=oai_setspec @attr 4=3 2000
 500      Z>
 501      Z> scan @attr 1=title communication
 502      Z> scan @attr 1=identifier @attr 4=3 a
 503     </screen>
 504    </para>
 505
 506    <para>
 507     &acro.z3950; search using server-side CQL conversion:
 508     <screen>
 509      Z> format xml
 510      Z> querytype cql
 511      Z> elements dc
 512      Z>
 513      Z> find harry
 514      Z>
 515      Z> find dc.creator = the
 516      Z> find dc.creator = the
 517      Z> find dc.title = the
 518      Z>
 519      Z> find dc.description &lt; the
 520      Z> find dc.title &gt; some
 521      Z>
 522      Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78"
 523      Z> find dc.relation = something
 524     </screen>
 525    </para>
 526
 527    <!--
 528    etc, etc. Notice that  all indexes defined by 'type="0"' in the
 529    indexing style  sheet must be searched using the 'eq'
 530    relation.
 531
 532    Z> find title <> and
 533
 534    fails as well.  ???
 535    -->
 536
 537    <tip>
 538     <para>
 539      &acro.z3950; scan using server side CQL conversion -
 540      unfortunately, this will _never_ work as it is not supported by the
 541      &acro.z3950; standard.
 542      If you want to use scan using server side CQL conversion, you need to
 543      make an SRW connection using  yaz-client, or a
 544      SRU connection using REST Web Services - any browser will do.
 545     </para>
 546    </tip>
 547
 548    <tip>
 549     <para>
 550      All indexes defined by 'type="0"' in the
 551      indexing style  sheet must be searched using the '@attr 4=3'
 552      structure attribute instruction.
 553     </para>
 554    </tip>
 555
 556    <para>
 557     Notice that searching and scan on indexes
 558     <literal>contributor</literal>,  <literal>language</literal>,
 559     <literal>rights</literal>, and <literal>source</literal>
 560     might fail, simply because none of the records in the small example set
 561     have these fields set, and consequently, these indexes might not
 562     been created.
 563    </para>
 564
 565   </sect1>
 566
 567  </chapter>
 568
 569
 570  <!-- Keep this comment at the end of the file
 571  Local variables:
 572  mode: sgml
 573  sgml-omittag:t
 574  sgml-shorttag:t
 575  sgml-minimize-attributes:nil
 576  sgml-always-quote-attributes:t
 577  sgml-indent-step:1
 578  sgml-indent-data:t
 579  sgml-parent-document: "idzebra.xml"
 580  sgml-local-catalogs: nil
 581  sgml-namecase-general:t
 582  End:
 583  -->