Tutorial A first &acro.oai; indexing example In this section, we will test the system by indexing a small set of sample &acro.oai; records that are included with the &zebra; distribution, running a &zebra; server against the newly created database, and searching the indexes with a client that connects to that server. Go to the examples/oai-pmh subdirectory of the distribution archive, or make a deep copy of the Debian installation directory /usr/share/idzebra-2.0.-examples/oai-pmh. An XML file containing multiple &acro.oai; records is located in the sub directory examples/oai-pmh/data. Additional OAI test records can be downloaded by running a shell script (you may want to abort the script when you have waited longer than your coffee brews ..). cd data ./fetch_OAI_data.sh cd ../ To index these &acro.oai; records, type: zebraidx-2.0 -c conf/zebra.cfg init zebraidx-2.0 -c conf/zebra.cfg update data zebraidx-2.0 -c conf/zebra.cfg commit In case you have not installed zebra yet but have compiled the binaries from this tarball, use the following command form: ../../index/zebraidx -c conf/zebra.cfg this and that On some systems the &zebra; binaries are installed under the generic names, you need to use the following command form: zebraidx -c conf/zebra.cfg this and that In this command, the word update is followed by the name of a directory: zebraidx updates all files in the hierarchy rooted at data. The command option -c conf/zebra.cfg points to the proper configuration file. You might ask yourself how &acro.xml; content is indexed using &acro.xslt; stylesheets: to satisfy your curiosity, you might want to run the indexing transformation on an example debugging &acro.oai; record. xsltproc conf/oai2index.xsl data/debug-record.xml Here you see the &acro.oai; record transformed into the indexing &acro.xml; format. &zebra; is creating several inverted indexes, and their name and type are clearly visible in the indexing &acro.xml; format. If your indexing command was successful, you are now ready to fire up a server. To start a server on port 9999, type: zebrasrv-2.0 -c conf/zebra.cfg @:9999 The &zebra; index that you have just created has a single database named Default. The database contains several &acro.oai; records, and the server will return records in the &acro.xml; format only. The indexing machine did the splitting into individual records just behind the scenes. Searching the &acro.oai; database by web service &zebra; has a build-in web service, which is close to the &acro.sru; standard web service. We use it to access our new database using any &acro.xml; enabled web browser. This service is using the &acro.pqf; query language. In a later section we show how to run a fully compliant &acro.sru; server, including support for the query language &acro.cql; Searching and retrieving &acro.xml; records is easy. For example, you can point your browser to one of the following URLs to search for the term the. Just point your browser at this link: http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the These URLs won't work unless you have indexed the example data and started an &zebra; server as outlined in the previous section. In case we actually want to retrieve one record, we need to alter our URL to the following http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc This way we can page through our result set in chunks of records, for example, we access the 6th to the 10th record using the URL http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=6&maximumRecords=5&recordSchema=dc Presenting search results in different formats &zebra; uses &acro.xslt; stylesheets for both &acro.xml;record indexing and display retrieval. In this example installation, they are two retrieval schema's defined in conf/dom-conf.xml: the dc schema implemented in conf/oai2dc.xsl, and the zebra schema implemented in conf/oai2zebra.xsl. The URLs for accessing both are the same, except for the different value of the recordSchema parameter: http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc and http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra For the curious, one can see that the &acro.xslt; transformations really do the magic. xsltproc conf/oai2dc.xsl data/debug-record.xml xsltproc conf/oai2zebra.xsl data/debug-record.xml Notice also that the &zebra; specific parameters are injected by the engine when retrieving data, therefore some of the attributes in the zebra retrieval schema are not filled when running the transformation from the command line. In addition to the user defined retrieval schema's one can always choose from many build-in schema's. In case one is only interested in the &zebra; internal metadata about a certain record, one uses the zebra::meta schema. http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::meta The zebra::data schema is used to retrieve the original stored &acro.oai; &acro.xml; record. http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::data More interesting searches The &acro.oai; indexing example defines many different index names, a study of the conf/oai2index.xsl stylesheet reveals the following word type indexes (i.e. those with suffix :w): any:w title:w author:w subject:w description:w contributor:w publisher:w language:w rights:w By default, searches do access the any:w index, but we can direct searches to any access point by constructing the correct &acro.pqf; query. For example, to search in titles only, we use http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr 1=title the&startRecord=1&maximumRecords=1&recordSchema=dc Similar we can direct searches to the other indexes defined. Or we can create boolean combinations of searches on different indexes. In this case we search for the in title and for fish in description using the query @and @attr 1=title the @attr 1=description fish. http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@and @attr 1=title the @attr 1=description fish&startRecord=1&maximumRecords=1&recordSchema=dc Investigating the content of the indexes How does the magic work? What is inside the indexes? Why is a certain record found by a search, and another not?. The answer is in the inverted indexes. You can easily investigate them using the special &zebra; schema zebra::index::fieldname. In this example you can see that the title index has both word (type :w) and phrase (type :p) indexed fields, http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::index::title But where in the indexes did the term match for the query occur? Easily answered with the special &zebra; schema zebra::snippet. The matching terms are encapsulated by <s> tags. http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::snippet How can I refine my search? Which interesting search terms are found inside my hit set? Try the special &zebra; schema zebra::facet::fieldname:type. In this case, we investigate additional search terms for the title:w index. http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::title:w One can ask for multiple facets. Here, we want them from phrase indexes of type :p. http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::publisher:p,title:p Setting up a correct &acro.sru; web service The &acro.sru; specification mandates that the &acro.cql; query language is supported and properly configure. Also, the server needs to be able to emit a proper &acro.explain; &acro.xml; record, which is used to determine the capabilities of the specific server instance. In this example configuration we exploit the similarities between the &acro.explain; record and the &acro.cql; query language configuration, we generate the later from the former using an &acro.xslt; transformation. xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt We are all set to start the &acro.sru;/acro.z3950; server including &acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend server configuration - just type zebrasrv -f conf/yazserver.xml First, we'd like to be sure that we can see the &acro.explain; &acro.xml; response correctly. You might use either of these equivalent requests: http://localhost:9999 http://localhost:9999/?version=1.1&operation=explain Now we can issue true &acro.sru; requests. For example, dc.title=the and dc.description=fish results in the following page http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the and dc.description=fish &startRecord=1&maximumRecords=1&recordSchema=dc Scan of indexes is a part of the &acro.sru; server business. For example, scanning the dc.title index gives us an idea what search terms are found there http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.title=fish , whereas http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish accesses the indexed identifiers. In addition, all &zebra; internal special element sets or record schema's of the form zebra:: just work right out of the box http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the and dc.description=fish &startRecord=1&maximumRecords=1&recordSchema=zebra::snippet Searching the &acro.oai; database by &acro.z3950; protocol In this section we repeat the searches and presents we have done so far using the binary &acro.z3950; protocol, you can use any &acro.z3950; client. For instance, you can use the demo command-line client that comes with &yaz;. Connecting to the server is done by the command yaz-client localhost:9999 When the client has connected, you can type: Z> format xml Z> querytype prefix Z> elements oai Z> find the Z> show 1+1 &acro.z3950; presents using presentation stylesheets: Z> elements dc Z> show 2+1 Z> elements zebra Z> show 3+1 &acro.z3950; buildin Zebra presents (in this configuration only if started without yaz-frontendserver): Z> elements zebra::meta Z> show 4+1 Z> elements zebra::meta::sysno Z> show 5+1 Z> format sutrs Z> show 5+1 Z> format xml Z> elements zebra::index Z> show 6+1 Z> elements zebra::snippet Z> show 7+1 Z> elements zebra::facet::any:w Z> show 1+1 Z> elements zebra::facet::publisher:p,title:p Z> show 1+1 &acro.z3950; searches targeted at specific indexes and boolean combinations of these can be issued as well. Z> elements dc Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4 Z> show 1+1 Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20 Z> show 1+1 Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562 Z> show 1+1 Z> find @attr 1=title communication Z> show 1+1 Z> find @attr 1=identifier @attr 4=3 http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86 Z> show 1+1 etc, etc. &acro.z3950; scan: yaz-client localhost:9999 Z> format xml Z> querytype prefix Z> scan @attr 1=oai_identifier @attr 4=3 oai Z> scan @attr 1=oai_datestamp @attr 4=3 1 Z> scan @attr 1=oai_setspec @attr 4=3 2000 Z> Z> scan @attr 1=title communication Z> scan @attr 1=identifier @attr 4=3 a &acro.z3950; search using server-side CQL conversion: Z> format xml Z> querytype cql Z> elements dc Z> Z> find harry Z> Z> find dc.creator = the Z> find dc.creator = the Z> find dc.title = the Z> Z> find dc.description < the Z> find dc.title > some Z> Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78" Z> find dc.relation = something &acro.z3950; scan using server side CQL conversion - unfortunately, this will _never_ work as it is not supported by the &acro.z3950; standard. If you want to use scan using server side CQL conversion, you need to make an SRW connection using yaz-client, or a SRU connection using REST Web Services - any browser will do. All indexes defined by 'type="0"' in the indexing style sheet must be searched using the '@attr 4=3' structure attribute instruction. Notice that searching and scan on indexes contributor, language, rights, and source might fail, simply because none of the records in the small example set have these fields set, and consequently, these indexes might not been created.