X-Git-Url: http://git.indexdata.com/?p=idzebra-moved-to-github.git;a=blobdiff_plain;f=doc%2Fexamples.xml;h=ebbac178df224830168ebc8367c93f26a4136753;hp=153eaed73afd22adc7d3a3215094ca8c7a4782f2;hb=27bdd6aa26843aeac89f635ed495996088d8e8aa;hpb=e57e9b19f4257768441a8025d08ce0e3430c7d78 diff --git a/doc/examples.xml b/doc/examples.xml index 153eaed..ebbac17 100644 --- a/doc/examples.xml +++ b/doc/examples.xml @@ -1,200 +1,321 @@ - Example Configurations - + Overview - zebraidx and zebrasrv are both + zebraidx and + zebrasrv are both driven by a master configuration file, which may refer to other subsidiary configuration files. By default, they try to use zebra.cfg in the working directory as the - master file; but this can be changed using the -t + master file; but this can be changed using the -c option to specify an alternative master configuration file. - The master configuration file tells Zebra: + The master configuration file tells &zebra;: - Where to find the default indexing rules (### default.idx) + Where to find subsidiary configuration files, including both + those that are named explicitly and a few ``magic'' files such + as default.idx, + which specifies the default indexing rules. - ### Something to do with explain.abs?! + What record schemas to support. (Subsidiary files specify how + to index the contents of records in those schemas, and what + format to use when presenting records in those schemas to client + software.) - ### Where to find other configuration files, e.g. searches using - BIB-1 attributes require a bib1.att configuration file (even if - the access point is actually an XPath expression). These are - searched for in the working directory unless otherwise - specified. + What attribute sets to recognise in searches. (Subsidiary files + specify how to interpret the attributes in terms + of the indexes that are created on the records.) + + + + + + Policy details such as what type of input format to expect when + adding new records, what low-level indexing algorithm to use, + how to identify potential duplicate records, etc. + + Now let's see what goes in the zebra.cfg file + for some example configurations. + - Example 1: Minimal Configuration + Example 1: &acro.xml; Indexing And Searching - This example shows how Zebra can be used with absolutely minimal - configuration to index a body of XML documents, and search them - using XPath expressions to specify access points. - - - Go to the zebra/examples/dinosauricon directory. - There you will find a records subdirectory, - which contains some raw XML data to be added to the database: in - this case, two files, genera.xml and - taxa.xml, which contain information about all - the known dinosaur genera as of August 2002. + This example shows how &zebra; can be used with absolutely minimal + configuration to index a body of + &acro.xml; + documents, and search them using + XPath + expressions to specify access points. - Now we need to create the Zebra database, which we do with the - Zebra indexer, zebraidx. This program's - behaviour is driven by a configuration life, generally called - zebra.cfg, although this can be changed with the - -c option. For our purposes, we don't need any - special behaviour - we can use the defaults - so an empty - configuration will do just fine. We can either create an empty - zebra.cfg or specify the name of an existing - empty file using, for example, -c /dev/null. + Go to the examples/zthes subdirectory + of the distribution archive. + There you will find a Makefile that will + populate the records subdirectory with a file of + Zthes + records representing a taxonomic hierarchy of dinosaurs. (The + records are generated from the family tree in the file + dino.tree.) + Type make records/dino.xml + to make the &acro.xml; data file. + (Or you could just type make dino to build the &acro.xml; + data file, create the database and populate it with the taxonomic + records all in one shot - but then you wouldn't learn anything, + would you? :-) - In this case, we'll use an empty zebra.cfg so - we can add more configuration to it later. + Now we need to create a &zebra; database to hold and index the &acro.xml; + records. We do this with the + &zebra; indexer, zebraidx, which is + driven by the zebra.cfg configuration file. + For our purposes, we don't need any + special behaviour - we can use the defaults - so we can start with a + minimal file that just tells zebraidx where to + find the default indexing rules, and how to parse the records: + + profilePath: .:../../tab + recordType: grs.sgml + - That's all you need for a minimal Zebra configuration. Now you can - roll the XML records into the database and build the indexes: + That's all you need for a minimal &zebra; configuration. Now you can + roll the &acro.xml; records into the database and build the indexes: - zebraidx -t grs.sgml update records + zebraidx update records - (### What does "grs.sgml" actually mean?) Now start the server. Like the indexer, its behaviour is - controlled by a configuration file, generally - zebra.cfg; and like the indexer, it works just - fine with an empty configuration. + controlled by the + zebra.cfg file; and like the indexer, it works + just fine with this minimal configuration. zebrasrv By default, the server listens on IP port number 9999, although - this can easily be changed. + this can easily be changed - see + . - Now you can use the Z39.50 client program of your choice to execute - XPath-based boolean queries and fetch the XML records that satisfy + Now you can use the &acro.z3950; client program of your choice to execute + XPath-based boolean queries and fetch the &acro.xml; records that satisfy them: - Z> open tcp:@:9999 - Connecting...Ok. - Z> find @attr 1=/GENUS/MEANING @or vertebra jaw - Number of hits: 1 - Z> format xml - Z> show 1 - Z> show 1 - <GENUS name="Hudiesaurus" type="with" xmlns:idzebra="http://www.indexdata.dk/zebra/"> - <MEANING> - butterfly <LOW>vertebra</LOW> lizard - </MEANING> - <LENGTH value="30"></LENGTH> - <PLACE name="China"></PLACE> - <REMAINS content="4 teeth, forelimb, first dorsal vertebra"></REMAINS> - <SPECIES name="sinojapanorum" status="nudum"> - <AUTHOR name="Dong" year="1997"></AUTHOR> - <MEANING> - Chinese-Japanese - </MEANING> - </SPECIES> - <idzebra:size>359</idzebra:size><idzebra:localnumber>447</idzebra:localnumber><idzebra:filename>records/genera.xml</idzebra:filename></GENUS> + $ yaz-client @:9999 + Connecting...Ok. + Z> find @attr 1=/Zthes/termName Sauroposeidon + Number of hits: 1 + Z> format xml + Z> show 1 + <Zthes> + <termId>22</termId> + <termName>Sauroposeidon</termName> + <termType>PT</termType> + <termNote>The tallest known dinosaur (18m)</termNote> + <relation> + <relationType>BT</relationType> + <termId>21</termId> + <termName>Brachiosauridae</termName> + <termType>PT</termType> + </relation> + + <idzebra xmlns="http://www.indexdata.dk/zebra/"> + <size>300</size> + <localnumber>23</localnumber> + <filename>records/dino.xml</filename> + </idzebra> + </Zthes> - Now wasn't that easy? + Now wasn't that nice and easy? + - Example 2: Adding Some Configuration + Example 2: Supporting Interoperable Searches - You may have noticed as zebraidx was building - the database that it issued several warnings, which we ignored at - the time: - -zebraidx -t grs.sgml update records -02:12:32-30/08: zebraidx(18151) [warn] default.idx [No such file or directory] -02:12:32-30/08: zebraidx(18151) [warn] Couldn't open explain.abs [No such file or directory] -02:12:32-30/08: zebraidx(18151) [warn] records/genera.xml:0 Couldn't open GENUS.abs [No such file or directory] -02:12:32-30/08: zebraidx(18151) [warn] records/genera.xml:0 Unknown register type: 0 -02:12:32-30/08: zebraidx(18151) [warn] records/genera.xml:0 Unknown register type: w -02:12:35-30/08: zebraidx(18151) [warn] records/taxa.xml:0 Couldn't open TAXON.abs [No such file or directory] - - And the server issued several more as the client connected to it, - then searched for and retrieved a record: - -02:17:10-30/08: zebrasrv(18165) [warn] default.idx [No such file or directory] -02:17:10-30/08: zebrasrv(18165) [warn] Couldn't open explain.abs [No such file or directory] -02:17:57-30/08: zebrasrv(18165) [warn] Unknown register type: w -02:18:42-30/08: zebrasrv(18165) [warn] Couldn't open GENUS.abs [No such file or directory] - + The problem with the previous example is that you need to know the + structure of the documents in order to find them. For example, + when we wanted to find the record for the taxon + Sauroposeidon, + we had to formulate a complex XPath + /Zthes/termName + which embodies the knowledge that taxon names are specified in a + <termName> element inside the top-level + <Zthes> element. + + + This is bad not just because it requires a lot of typing, but more + significantly because it ties searching semantics to the physical + structure of the searched records. You can't use the same search + specification to search two databases if their internal + representations are different. Consider a different taxonomy + database in which the records have taxon names specified + inside a <name> element nested within a + <identification> element + inside a top-level <taxon> element: then + you'd need to search for them using + 1=/taxon/identification/name + + + How, then, can we build broadcasting Information Retrieval + applications that look for records in many different databases? + The &acro.z3950; protocol offers a powerful and general solution to this: + abstract ``access points''. In the &acro.z3950; model, an access point + is simply a point at which searches can be directed. Nothing is + said about implementation: in a given database, an access point + might be implemented as an index, a path into physical records, an + algorithm for interrogating relational tables or whatever works. + The only important thing is that the semantics of an access + point is fixed and well defined. + + For convenience, access points are gathered into attribute + sets. For example, the &acro.bib1; attribute set is supposed to + contain bibliographic access points such as author, title, subject + and ISBN; the GEO attribute set contains access points pertaining + to geospatial information (bounding coordinates, stratum, latitude + resolution, etc.); the CIMI + attribute set contains access points to do with museum collections + (provenance, inscriptions, etc.) + + + In practice, the &acro.bib1; attribute set has tended to be a dumping + ground for all sorts of access points, so that, for example, it + includes some geospatial access points as well as strictly + bibliographic ones. Nevertheless, this model + allows a layer of abstraction over the physical representation of + records in databases. + + + In the &acro.bib1; attribute set, a taxon name is probably best + interpreted as a title - that is, a phrase that identifies the item + in question. &acro.bib1; represents title searches by + access point 4. (See + The &acro.bib1; Attribute + Set Semantics) + So we need to configure our dinosaur database so that searches for + &acro.bib1; access point 4 look in the + <termName> element, + inside the top-level + <Zthes> element. + + + This is a two-step process. First, we need to tell &zebra; that we + want to support the &acro.bib1; attribute set. Then we need to tell it + which elements of its record pertain to access point 4. + + + We need to create an Abstract Syntax + file named after the document element of the records we're + working with, plus a .abs suffix - in this case, + Zthes.abs - as follows: + + + + + + + + + +attset zthes.att +attset bib1.att +xpath enable +systag sysno none + +xelm /Zthes/termId termId:w +xelm /Zthes/termName termName:w,title:w +xelm /Zthes/termQualifier termQualifier:w +xelm /Zthes/termType termType:w +xelm /Zthes/termLanguage termLanguage:w +xelm /Zthes/termNote termNote:w +xelm /Zthes/termCreatedDate termCreatedDate:w +xelm /Zthes/termCreatedBy termCreatedBy:w +xelm /Zthes/termModifiedDate termModifiedDate:w +xelm /Zthes/termModifiedBy termModifiedBy:w + + + + + Declare Thesaurus attribute set. See zthes.att. + + + + + Declare &acro.bib1; attribute set. See bib1.att in + &zebra;'s tab directory. + + + + + This xelm directive selects contents of nodes by XPath expression + /Zthes/termId. The contents (CDATA) will be + word searchable by Zthes attribute termId (value 1001). + + + + + Make termName word searchable by both + Zthes attribute termName (1002) and &acro.bib1; attribute title (4). + + + + + + After re-indexing, we can search the database using &acro.bib1; + attribute, title, as follows: + +Z> form xml +Z> f @attr 1=4 Eoraptor +Sent searchRequest. +Received SearchResponse. +Search was a success. +Number of hits: 1, setno 1 +SearchResult-1: Eoraptor(1) +records returned: 0 +Elapsed: 0.106896 +Z> s +Sent presentRequest (1+1). +Records: 1 +[Default]Record type: &acro.xml; +<Zthes> + <termId>2</termId> + <termName>Eoraptor</termName> + <termType>PT</termType> + <termNote>The most basal known dinosaur</termNote> + ... + + -