Tutorial
A first &acro.oai; indexing example
In this section, we will test the system by indexing a small set of
sample &acro.oai; records that are included with the &zebra; distribution,
running a &zebra; server against the newly created database, and
searching the indexes with a client that connects to that server.
Go to the examples/oai-pmh subdirectory of the
distribution archive, or make a deep copy of the Debian installation
directory
/usr/share/idzebra-2.0.-examples/oai-pmh.
An XML file containing multiple &acro.oai;
records is located in the sub
directory examples/oai-pmh/data.
Additional OAI test records can be downloaded by running a shell
script (you may want to abort the script when you have waited
longer than your coffee brews ..).
cd data
./fetch_OAI_data.sh
cd ../
To index these &acro.oai; records, type:
zebraidx-2.0 -c conf/zebra.cfg init
zebraidx-2.0 -c conf/zebra.cfg update data
zebraidx-2.0 -c conf/zebra.cfg commit
In case you have not installed zebra yet but have compiled the
binaries from this tarball, use the following command form:
../../index/zebraidx -c conf/zebra.cfg this and that
On some systems the &zebra; binaries are installed under the
generic names, you need to use the following command form:
zebraidx -c conf/zebra.cfg this and that
In this command, the word update is followed
by the name of a directory: zebraidx updates all
files in the hierarchy rooted at data.
The command option
-c conf/zebra.cfg points to the proper
configuration file.
You might ask yourself how &acro.xml; content is indexed using &acro.xslt;
stylesheets: to satisfy your curiosity, you might want to run the
indexing transformation on an example debugging &acro.oai; record.
xsltproc conf/oai2index.xsl data/debug-record.xml
Here you see the &acro.oai; record transformed into the indexing
&acro.xml; format. &zebra; is creating several inverted indexes,
and their name and type are clearly visible in the indexing
&acro.xml; format.
If your indexing command was successful, you are now ready to
fire up a server. To start a server on port 9999, type:
zebrasrv-2.0 -c conf/zebra.cfg @:9999
The &zebra; index that you have just created has a single database
named Default.
The database contains several &acro.oai; records, and the server will
return records in the &acro.xml; format only. The indexing machine
did the splitting into individual records just behind the scenes.
Searching the &acro.oai; database by web service
&zebra; has a build-in web service, which is close to the
&acro.sru; standard web service. We use it to access our new
database using any &acro.xml; enabled web browser.
This service is using the &acro.pqf; query language.
In a later
section we show how to run a fully compliant &acro.sru; server,
including support for the query language &acro.cql;
Searching and retrieving &acro.xml; records is easy. For example,
you can point your browser to one of the following URLs to
search for the term the. Just point your
browser at this link:
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the
These URLs won't work unless you have indexed the example data
and started an &zebra; server as outlined in the previous section.
In case we actually want to retrieve one record, we need to alter
our URL to the following
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc
This way we can page through our result set in chunks of records,
for example, we access the 6th to the 10th record using the URL
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=6&maximumRecords=5&recordSchema=dc
Presenting search results in different formats
&zebra; uses &acro.xslt; stylesheets for both &acro.xml;record
indexing and
display retrieval. In this example installation, they are two
retrieval schema's defined in
conf/dom-conf.xml:
the dc schema implemented in
conf/oai2dc.xsl, and
the zebra schema implemented in
conf/oai2zebra.xsl.
The URLs for accessing both are the same, except for the different
value of the recordSchema parameter:
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc
and
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra
For the curious, one can see that the &acro.xslt; transformations
really do the magic.
xsltproc conf/oai2dc.xsl data/debug-record.xml
xsltproc conf/oai2zebra.xsl data/debug-record.xml
Notice also that the &zebra; specific parameters are injected by
the engine when retrieving data, therefore some of the attributes
in the zebra retrieval schema are not filled
when running the transformation from the command line.
In addition to the user defined retrieval schema's one can always
choose from many build-in schema's. In case one is only
interested in the &zebra; internal metadata about a certain
record, one uses the zebra::meta schema.
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::meta
The zebra::data schema is used to retrieve the
original stored &acro.oai; &acro.xml; record.
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::data
More interesting searches
The &acro.oai; indexing example defines many different index
names, a study of the conf/oai2index.xsl
stylesheet reveals the following word type indexes (i.e. those
with suffix :w):
any:w
title:w
author:w
subject:w
description:w
contributor:w
publisher:w
language:w
rights:w
By default, searches do access the any:w index,
but we can direct searches to any access point by constructing the
correct &acro.pqf; query. For example, to search in titles only,
we use
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr
1=title the&startRecord=1&maximumRecords=1&recordSchema=dc
Similar we can direct searches to the other indexes defined. Or we
can create boolean combinations of searches on different
indexes. In this case we search for the in
title and for fish in
description using the query
@and @attr 1=title the @attr 1=description fish.
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@and
@attr 1=title the
@attr 1=description fish&startRecord=1&maximumRecords=1&recordSchema=dc
Investigating the content of the indexes
How does the magic work? What is inside the indexes? Why is a certain
record found by a search, and another not?. The answer is in the
inverted indexes. You can easily investigate them using the
special &zebra; schema
zebra::index::fieldname. In this example you
can see that the title index has both word
(type :w) and phrase (type
:p)
indexed fields,
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::index::title
But where in the indexes did the term match for the query occur?
Easily answered with the special &zebra; schema
zebra::snippet. The matching terms are
encapsulated by <s> tags.
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::snippet
How can I refine my search? Which interesting search terms are
found inside my hit set? Try the special &zebra; schema
zebra::facet::fieldname:type. In this case, we
investigate additional search terms for the
title:w index.
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::title:w
One can ask for multiple facets. Here, we want them from phrase
indexes of type
:p.
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::publisher:p,title:p
Setting up a correct &acro.sru; web service
The &acro.sru; specification mandates that the &acro.cql; query
language is supported and properly configure. Also, the server
needs to be able to emit a proper &acro.explain; &acro.xml;
record, which is used to determine the capabilities of the
specific server instance.
In this example configuration we exploit the similarities between
the &acro.explain; record and the &acro.cql; query language
configuration, we generate the later from the former using an
&acro.xslt; transformation.
xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt
We are all set to start the &acro.sru;/acro.z3950; server including
&acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend
server configuration - just type
zebrasrv -f conf/yazserver.xml
First, we'd like to be sure that we can see the &acro.explain;
&acro.xml; response correctly. You might use either of these equivalent
requests:
http://localhost:9999
http://localhost:9999/?version=1.1&operation=explain
Now we can issue true &acro.sru; requests. For example,
dc.title=the
and dc.description=fish results in the following page
http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the
and dc.description=fish &startRecord=1&maximumRecords=1&recordSchema=dc
Scan of indexes is a part of the &acro.sru; server business. For example,
scanning the dc.title index gives us an idea
what search terms are found there
http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.title=fish
,
whereas
http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish
accesses the indexed identifiers.
In addition, all &zebra; internal special element sets or record
schema's of the form
zebra:: just work right out of the box
http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the
and dc.description=fish &startRecord=1&maximumRecords=1&recordSchema=zebra::snippet
Searching the &acro.oai; database by &acro.z3950; protocol
In this section we repeat the searches and presents we have done so
far using the binary &acro.z3950; protocol, you can use any
&acro.z3950; client.
For instance, you can use the demo command-line client that comes
with &yaz;.
Connecting to the server is done by the command
yaz-client localhost:9999
When the client has connected, you can type:
Z> format xml
Z> querytype prefix
Z> elements oai
Z> find the
Z> show 1+1
&acro.z3950; presents using presentation stylesheets:
Z> elements dc
Z> show 2+1
Z> elements zebra
Z> show 3+1
&acro.z3950; buildin Zebra presents (in this configuration only if
started without yaz-frontendserver):
Z> elements zebra::meta
Z> show 4+1
Z> elements zebra::meta::sysno
Z> show 5+1
Z> format sutrs
Z> show 5+1
Z> format xml
Z> elements zebra::index
Z> show 6+1
Z> elements zebra::snippet
Z> show 7+1
Z> elements zebra::facet::any:w
Z> show 1+1
Z> elements zebra::facet::publisher:p,title:p
Z> show 1+1
&acro.z3950; searches targeted at specific indexes and boolean
combinations of these can be issued as well.
Z> elements dc
Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4
Z> show 1+1
Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20
Z> show 1+1
Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562
Z> show 1+1
Z> find @attr 1=title communication
Z> show 1+1
Z> find @attr 1=identifier @attr 4=3
http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86
Z> show 1+1
etc, etc.
&acro.z3950; scan:
yaz-client localhost:9999
Z> format xml
Z> querytype prefix
Z> scan @attr 1=oai_identifier @attr 4=3 oai
Z> scan @attr 1=oai_datestamp @attr 4=3 1
Z> scan @attr 1=oai_setspec @attr 4=3 2000
Z>
Z> scan @attr 1=title communication
Z> scan @attr 1=identifier @attr 4=3 a
&acro.z3950; search using server-side CQL conversion:
Z> format xml
Z> querytype cql
Z> elements dc
Z>
Z> find harry
Z>
Z> find dc.creator = the
Z> find dc.creator = the
Z> find dc.title = the
Z>
Z> find dc.description < the
Z> find dc.title > some
Z>
Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78"
Z> find dc.relation = something
&acro.z3950; scan using server side CQL conversion -
unfortunately, this will _never_ work as it is not supported by the
&acro.z3950; standard.
If you want to use scan using server side CQL conversion, you need to
make an SRW connection using yaz-client, or a
SRU connection using REST Web Services - any browser will do.
All indexes defined by 'type="0"' in the
indexing style sheet must be searched using the '@attr 4=3'
structure attribute instruction.
Notice that searching and scan on indexes
contributor, language,
rights, and source
might fail, simply because none of the records in the small example set
have these fields set, and consequently, these indexes might not
been created.