X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Ftutorial.xml;h=8d79bd866b5deaf53142f9ce14f3f1354ccbca29;hp=b336ac188f719dea56cb057ab0736b250b635b63;hb=dcda88860b03641b6900d43135ca769f005105e8;hpb=fd448e894bd9a4b3caed66542c017706fee83712 diff --git a/doc/tutorial.xml b/doc/tutorial.xml index b336ac1..8d79bd8 100644 --- a/doc/tutorial.xml +++ b/doc/tutorial.xml @@ -1,352 +1,572 @@ - - - Tutorial - - - - A first &acro.oai; indexing example - - - In this section, we will test the system by indexing a small set of - sample &acro.oai; records that are included with the &zebra; distribution, - running a &zebra; server against the newly created database, and - searching the indexes with a client that connects to that server. - - - Go to the examples/oai-pmh subdirectory of the - distribution archive, or make a deep copy of the Debian installation - directory - /usr/share/idzebra-2.0.-examples/oai-pmh. - An XML file containing multiple &acro.oai; - records is located in the sub - directory examples/oai-pmh/data. To index these, type: - - zebraidx -c conf/zebra.cfg init - zebraidx -c conf/zebra.cfg update data/oai-caltech.xml - zebraidx -c conf/zebra.cfg commit - - In case you have not installed zebra yet but have compiled the + + Tutorial + + + + A first &acro.oai; indexing example + + + In this section, we will test the system by indexing a small set of + sample &acro.oai; records that are included with the &zebra; distribution, + running a &zebra; server against the newly created database, and + searching the indexes with a client that connects to that server. + + + Go to the examples/oai-pmh subdirectory of the + distribution archive, or make a deep copy of the Debian installation + directory + /usr/share/idzebra-2.0-examples/oai-pmh. + An XML file containing multiple &acro.oai; + records is located in the sub + directory examples/oai-pmh/data. + + + Additional OAI test records can be downloaded by running a shell + script (you may want to abort the script when you have waited + longer than your coffee brews ..). + + cd data + ./fetch_OAI_data.sh + cd ../ + + + + To index these &acro.oai; records, type: + + zebraidx-2.0 -c conf/zebra.cfg init + zebraidx-2.0 -c conf/zebra.cfg update data + zebraidx-2.0 -c conf/zebra.cfg commit + + In case you have not installed zebra yet but have compiled the binaries from this tarball, use the following command form: - - ../../index/zebraidx -c conf/zebra.cfg this and that - - - - - In this command, the word update is followed - by the name of a directory: zebraidx updates all - files in the hierarchy rooted at that directory. The command option - -c conf/zebra.cfg points to the proper - configuration file. - - - - You might ask yourself how &acro.xml; content is indexed using &acro.xslt; - stylesheets: to satisfy your curiosity, you might want to run the - indexing transformation on an example debugging &acro.oai; record. - - xsltproc conf/oai2index.xsl data/debug-record.xml - + + ../../index/zebraidx -c conf/zebra.cfg this and that + + On some systems the &zebra; binaries are installed under the + generic names, you need to use the following command form: + + zebraidx -c conf/zebra.cfg this and that + + + + + In this command, the word update is followed + by the name of a directory: zebraidx updates all + files in the hierarchy rooted at data. + The command option + -c conf/zebra.cfg points to the proper + configuration file. + + + + You might ask yourself how &acro.xml; content is indexed using &acro.xslt; + stylesheets: to satisfy your curiosity, you might want to run the + indexing transformation on an example debugging &acro.oai; record. + + xsltproc conf/oai2index.xsl data/debug-record.xml + Here you see the &acro.oai; record transformed into the indexing &acro.xml; format. &zebra; is creating several inverted indexes, and their name and type are clearly visible in the indexing &acro.xml; format. - - - - If your indexing command was successful, you are now ready to - fire up a server. To start a server on port 9999, type: - - zebrasrv -c conf/zebra.cfg @:9999 - - - - - The &zebra; index that you have just created has a single database - named Default. - The database contains several &acro.oai; records, and the server will - return records in the &acro.xml; format only. The indexing machine - di the splitting into individual records just behind the scenes. - - - - To test the server, you can use any &acro.z3950; client. - For instance, you can use the demo command-line client that comes - with &yaz;; we start the SRU/SRW/Z39.50 server in PQF mode only: - - - - yaz-client localhost:9999 - - - - - When the client has connected, you can type: - - - - - Z> format xml - Z> elements oai - Z> find the - Z> show 1+1 - - - - + + + + Presenting search results in different formats + + + &zebra; uses &acro.xslt; stylesheets for both &acro.xml;record + indexing and + display retrieval. In this example installation, they are two + retrieval schema's defined in + conf/dom-conf.xml: + the dc schema implemented in + conf/oai2dc.xsl, and + the zebra schema implemented in + conf/oai2zebra.xsl. + The URLs for accessing both are the same, except for the different + value of the recordSchema parameter: + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=dc + + and + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra + + For the curious, one can see that the &acro.xslt; transformations + really do the magic. + + xsltproc conf/oai2dc.xsl data/debug-record.xml + xsltproc conf/oai2zebra.xsl data/debug-record.xml + + Notice also that the &zebra; specific parameters are injected by + the engine when retrieving data, therefore some of the attributes + in the zebra retrieval schema are not filled + when running the transformation from the command line. + + + + + In addition to the user defined retrieval schema's one can always + choose from many build-in schema's. In case one is only + interested in the &zebra; internal metadata about a certain + record, one uses the zebra::meta schema. + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::meta + + + + + The zebra::data schema is used to retrieve the + original stored &acro.oai; &acro.xml; record. + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::data + + + + + + + More interesting searches + + + The &acro.oai; indexing example defines many different index + names, a study of the conf/oai2index.xsl + stylesheet reveals the following word type indexes (i.e. those + with suffix :w): + + any:w + title:w + author:w + subject:w + description:w + contributor:w + publisher:w + language:w + rights:w + + By default, searches do access the any:w index, + but we can direct searches to any access point by constructing the + correct &acro.pqf; query. For example, to search in titles only, + we use + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr + 1=title the&startRecord=1&maximumRecords=1&recordSchema=dc + + + + + Similar we can direct searches to the other indexes defined. Or we + can create boolean combinations of searches on different + indexes. In this case we search for the in + title and for fish in + description using the query + @and @attr 1=title the @attr 1=description fish. + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@and + @attr 1=title the + @attr 1=description fish&startRecord=1&maximumRecords=1&recordSchema=dc + + + + + + + + Investigating the content of the indexes + + + How does the magic work? What is inside the indexes? Why is a certain + record found by a search, and another not?. The answer is in the + inverted indexes. You can easily investigate them using the + special &zebra; schema + zebra::index::fieldname. In this example you + can see that the title index has both word + (type :w) and phrase (type + :p) + indexed fields, + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::index::title + + + + + But where in the indexes did the term match for the query occur? + Easily answered with the special &zebra; schema + zebra::snippet. The matching terms are + encapsulated by <s> tags. + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::snippet + + + + + How can I refine my search? Which interesting search terms are + found inside my hit set? Try the special &zebra; schema + zebra::facet::fieldname:type. In this case, we + investigate additional search terms for the + title:w index. + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::title:w + + + + + One can ask for multiple facets. Here, we want them from phrase + indexes of type + :p. + + http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=the&startRecord=1&maximumRecords=1&recordSchema=zebra::facet::publisher:p,title:p + + + + + + + + Setting up a correct &acro.sru; web service + + + The &acro.sru; specification mandates that the &acro.cql; query + language is supported and properly configure. Also, the server + needs to be able to emit a proper &acro.explain; &acro.xml; + record, which is used to determine the capabilities of the + specific server instance. + + + + In this example configuration we exploit the similarities between + the &acro.explain; record and the &acro.cql; query language + configuration, we generate the later from the former using an + &acro.xslt; transformation. + + xsltproc conf/explain2cqlpqftxt.xsl conf/explain.xml > conf/cql2pqf.txt + + + + + We are all set to start the &acro.sru;/&acro.z3950; server including + &acro.pqf; and &acro.cql; query configuration. It uses the &yaz; frontend + server configuration - just type + + zebrasrv -f conf/yazserver.xml + + + + + First, we'd like to be sure that we can see the &acro.explain; + &acro.xml; response correctly. You might use either of these equivalent + requests: + http://localhost:9999 + + + http://localhost:9999/?version=1.1&operation=explain + + + + + + Now we can issue true &acro.sru; requests. For example, + dc.title=the + and dc.description=fish results in the following page + + http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the + and dc.description=fish &startRecord=1&maximumRecords=1&recordSchema=dc + + + + + Scan of indexes is a part of the &acro.sru; server business. For example, + scanning the dc.title index gives us an idea + what search terms are found there + + http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.title=fish + , + whereas + + http://localhost:9999/?version=1.1&operation=scan&scanClause=dc.identifier=fish + + accesses the indexed identifiers. + + + + In addition, all &zebra; internal special element sets or record + schema's of the form + zebra:: just work right out of the box + + http://localhost:9999/?version=1.1&operation=searchRetrieve&query=dc.title=the + and dc.description=fish &startRecord=1&maximumRecords=1&recordSchema=zebra::snippet + + + + + + + + + + Searching the &acro.oai; database by &acro.z3950; protocol + + + In this section we repeat the searches and presents we have done so + far using the binary &acro.z3950; protocol, you can use any + &acro.z3950; client. + For instance, you can use the demo command-line client that comes + with &yaz;. + + + Connecting to the server is done by the command + + yaz-client localhost:9999 + + + + + When the client has connected, you can type: + + Z> format xml + Z> querytype prefix + Z> elements oai + Z> find the + Z> show 1+1 + + + + + &acro.z3950; presents using presentation stylesheets: + + Z> elements dc + Z> show 2+1 + + Z> elements zebra + Z> show 3+1 + + + + + &acro.z3950; buildin Zebra presents (in this configuration only if + started without yaz-frontendserver): + + + Z> elements zebra::meta + Z> show 4+1 + + Z> elements zebra::meta::sysno + Z> show 5+1 + + Z> format sutrs + Z> show 5+1 + Z> format xml + + Z> elements zebra::index + Z> show 6+1 + + Z> elements zebra::snippet + Z> show 7+1 + + Z> elements zebra::facet::any:w + Z> show 1+1 + + Z> elements zebra::facet::publisher:p,title:p + Z> show 1+1 + + + + + &acro.z3950; searches targeted at specific indexes and boolean + combinations of these can be issued as well. + + + Z> elements dc + Z> find @attr 1=oai_identifier @attr 4=3 oai:caltechcstr.library.caltech.edu:4 + Z> show 1+1 + + Z> find @attr 1=oai_datestamp @attr 4=3 2001-04-20 + Z> show 1+1 + + Z> find @attr 1=oai_setspec @attr 4=3 7374617475733D756E707562 + Z> show 1+1 + + Z> find @attr 1=title communication + Z> show 1+1 + + Z> find @attr 1=identifier @attr 4=3 + http://resolver.caltech.edu/CaltechCSTR:1986.5228-tr-86 + Z> show 1+1 + + etc, etc. + + + + &acro.z3950; scan: + + yaz-client localhost:9999 + Z> format xml + Z> querytype prefix + Z> scan @attr 1=oai_identifier @attr 4=3 oai + Z> scan @attr 1=oai_datestamp @attr 4=3 1 + Z> scan @attr 1=oai_setspec @attr 4=3 2000 + Z> + Z> scan @attr 1=title communication + Z> scan @attr 1=identifier @attr 4=3 a + + + + + &acro.z3950; search using server-side CQL conversion: + + Z> format xml + Z> querytype cql + Z> elements dc + Z> + Z> find harry + Z> + Z> find dc.creator = the + Z> find dc.creator = the + Z> find dc.title = the + Z> + Z> find dc.description < the + Z> find dc.title > some + Z> + Z> find dc.identifier="http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78" + Z> find dc.relation = something + + + + + + + + &acro.z3950; scan using server side CQL conversion - + unfortunately, this will _never_ work as it is not supported by the + &acro.z3950; standard. + If you want to use scan using server side CQL conversion, you need to + make an SRW connection using yaz-client, or a + SRU connection using REST Web Services - any browser will do. + + + + + + All indexes defined by 'type="0"' in the + indexing style sheet must be searched using the '@attr 4=3' + structure attribute instruction. + + + + + Notice that searching and scan on indexes + contributor, language, + rights, and source + might fail, simply because none of the records in the small example set + have these fields set, and consequently, these indexes might not + been created. + + + + + -Z39.50 scan using server side CQL conversion: - - Unfortunately, this will _never_ work as it is not supported by the - Z39.50 standard. - If you want to use scan using server side CQL conversion, you need to - make an SRW connection using yaz-client, or a - SRU connection using REST Web Services - any browser will do. - - -SRU Explain ZeeRex response: - - http://localhost:9999/ - http://localhost:9999/?version=1.1&operation=explain - - -SRU Search Retrieve records: - - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=creator=adam - - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=date=1978-01-01 - &startRecord=1&maximumRecords=1&recordSchema=dc - - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=dc.title=the - - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=description=the - - - relation tests: - - http://localhost:9999/?version=1.1&operation=searchRetrieve - &query=title%3Cthe - - -SRU scan: - - http://localhost:9999/?version=1.1&operation=scan&scanClause=title=a - http://localhost:9999/?version=1.1&operation=scan - &scanClause=identifier%20eq%20a - - Notice: you need to use the 'eq' relation for all @attr 4=3 indexes - - - -SRW explain with CQL index points: - - Z> open http://localhost:9999 - Z> explain - - Notice: when opening a connection using the 'http.//' prefix, yaz-client - uses SRW SOAP connections, and 'form xml' and 'querytype cql' are - implicitely set. - - -SRW search using implicit server side CQL: - - Z> open http://localhost:9999 - Z> find identifier eq - "http://resolver.caltech.edu/CaltechCSTR:1978.2276-tr-78" - Z> find description < the - - - In SRW connection mode, the follwing fails due to problem in yaz-client: - Z> elements dc - Z> s 1+1 - - -SRW scan using implicit server side CQL: - - yaz-client http://localhost:9999 - Z> scan title = communication - Z> scan identifier eq a - - Notice: you need to use the 'eq' relation for all @attr 4=3 indexes - - - - ---> - - - - - - - - Requesting &acro.oai; records in &zebra; specific formats - - - - - - -