X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fexamples.xml;h=10867efb3d4740be36c35d425368e6266e18522f;hb=807d2c445492c36b95b7ef1b3202ccdc0e302aa5;hp=2735cb2412f1fc87a642f0fe40c78f4d24f43fa6;hpb=bffe964768496135023ab242d6b468558fa1c2be;p=idzebra-moved-to-github.git diff --git a/doc/examples.xml b/doc/examples.xml index 2735cb2..10867ef 100644 --- a/doc/examples.xml +++ b/doc/examples.xml @@ -1,5 +1,5 @@ - + Example Configurations @@ -19,106 +19,119 @@ - Where to find the default indexing rules (### default.idx) + Where to find subsidiary configuration files, including + default.idx + which specifies the default indexing rules. - ### Something to do with explain.abs?! + What attribute sets to recognise in searches. - ### Where to find other configuration files, e.g. searches using - BIB-1 attributes require a bib1.att configuration file (even if - the access point is actually an XPath expression). These are - searched for in the working directory unless otherwise - specified. + Policy details such as what record type to expect, what + low-level indexing algorithm to use, how to identify potential + duplicate records, etc. + + Now let's see what goes in the zebra.cfg file + for some example configurations. + - - Example 1: Minimal Configuration + + Example 1: XML Indexing And Searching This example shows how Zebra can be used with absolutely minimal - configuration to index a body of XML documents, and search them - using XPath expressions to specify access points. + configuration to index a body of + XML + documents, and search them using + XPath + expressions to specify access points. - Go to the zebra/examples/dinosauricon directory. + Go to the examples/dinosauricon subdirectory + of the distribution archive. There you will find a records subdirectory, which contains some raw XML data to be added to the database: in - this case, two files, genera.xml and - taxa.xml, which contain information about all - the known dinosaur genera as of August 2002. + this case, as single file, genera.xml, + which contain information about all the known dinosaur genera as of + August 2002. Now we need to create the Zebra database, which we do with the - Zebra indexer, zebraidx. This program's - behaviour is driven by a configuration life, generally called - zebra.cfg, although this can be changed with the - -c option. For our purposes, we don't need any - special behaviour - we can use the defaults - so an empty - configuration will do just fine. We can either create an empty - zebra.cfg or specify the name of an existing - empty file using, for example, -c /dev/null. - - - In this case, we'll use an empty zebra.cfg so - we can add more configuration to it later. + Zebra indexer, zebraidx, which is + driven by the zebra.cfg configuration file. + For our purposes, we don't need any + special behaviour - we can use the defaults - so we start with a + minimal file that just tells zebraidx where to + find the default indexing rules, and how to parse the records: + + profilePath: .:../../tab:../../../yaz/tab + recordType: grs.sgml + That's all you need for a minimal Zebra configuration. Now you can roll the XML records into the database and build the indexes: - zebraidx -t grs.sgml update records + zebraidx update records - (### What does "grs.sgml" actually mean?) Now start the server. Like the indexer, its behaviour is - controlled by a configuration file, generally - zebra.cfg; and like the indexer, it works just - fine with an empty configuration. + controlled by the + zebra.cfg file; and like the indexer, it works + just fine with this minimal configuration. zebrasrv By default, the server listens on IP port number 9999, although - this can easily be changed. + this can easily be changed - see + . Now you can use the Z39.50 client program of your choice to execute XPath-based boolean queries and fetch the XML records that satisfy them: - Z> open tcp:@:9999 - Connecting...Ok. - Z> find @attr 1=/GENUS/MEANING @or vertebra jaw - Number of hits: 1 - Z> format xml - Z> show 1 - Z> show 1 - <GENUS name="Hudiesaurus" type="with" xmlns:idzebra="http://www.indexdata.dk/zebra/"> - <MEANING> - butterfly <LOW>vertebra</LOW> lizard - </MEANING> - <LENGTH value="30"></LENGTH> - <PLACE name="China"></PLACE> - <REMAINS content="4 teeth, forelimb, first dorsal vertebra"></REMAINS> - <SPECIES name="sinojapanorum" status="nudum"> - <AUTHOR name="Dong" year="1997"></AUTHOR> - <MEANING> - Chinese-Japanese - </MEANING> - </SPECIES> - <idzebra:size>359</idzebra:size><idzebra:localnumber>447</idzebra:localnumber><idzebra:filename>records/genera.xml</idzebra:filename></GENUS> + $ yaz-client tcp:@:9999 + Connecting...Ok. + Z> find @attr 1=/GENUS/SPECIES/AUTHOR/@name Wedel + Number of hits: 1 + Z> format xml + Z> show 1 + <GENUS name="Sauroposeidon" type="with"> + <MEANING>lizard Poseidon <LOW>(Greek god of, among other things, earthquakes)</LOW></MEANING> + <SPECIES name="proteles"> + <AUTHOR type="vide" name="Franklin" year="2000"></AUTHOR> + <AUTHOR name="Wedel, Cifelli, Sanders"></AUTHOR> + </SPECIES> + <PLACE name="Oklahoma"></PLACE> + <TIME value="Albian"></TIME> + <LENGTH value="30" q="1"></LENGTH> + <REMAINS content="rib, cervical vertebrae"></REMAINS> + <ESSAY> + <P> This new <NOMEN name="Brachiosaurus"></NOMEN>-like <LINK content="dinosaur"></LINK> + was perhaps the tallest. With its head raised, it stood 60 feet (nearly + 20 m) tall. </P> + </ESSAY> + + <idzebra xmlns="http://www.indexdata.dk/zebra/"> + <size>593</size> + <localnumber>891</localnumber> + <filename>records/genera.xml</filename> + </idzebra> + </GENUS> @@ -126,33 +139,112 @@ - - Example 2: Adding Some Configuration + + Example 2: Supporting Interoperable Searches + + + The problem with the previous example is that you need to know the + structure of the documents in order to find them. For example, + when we wanted to know the genera for which Matt Wedel is an + author + (Sauroposeidon proteles), + we had to formulate a complex XPath + 1=/GENUS/SPECIES/AUTHOR/@name + which embodies the knowledge that author names are specified in the + name attribute of the + <AUTHOR> element, + which is inside the + <SPECIES> element, + which in turn is inside the top-level + <GENUS> element. + + + This is bad not just because it requires a lot of typing, but more + significantly because it ties searching semantics to the physical + structure of the searched records. You can't use the same search + specification to search two databases if their internal + representations are different. Consider an alternative dinosaur + database in which the records have author names specified + inside an <authorName> element directly + inside a top-level <taxon> element: then + you'd need to search for them using + 1=/taxon/authorName + + + How, then, can we build broadcasting Information Retrieval + applications that look for records in many different databases? + The Z39.50 protocol offers a powerful and general solution to this: + abstract ``access points''. In the Z39.50 model, an access point + is simply a point at which searches can be directed. Nothing is + said about implementation: in a given database, an access point + might be implemented as an index, a path into physical records, an + algorithm for interrogating relational tables or whatever works. + The key point is that the semantics of an access point are fixed + and well defined. + + + For convenience, access points are gathered into attribute + sets. For example, the BIB-1 attribute set is supposed to + contain bibliographic access points such as author, title, subject + and ISBN; the GEO attribute set contains access points pertaining + to geospatial information (bounding box, ###, etc.); the CIMI + attribute set contains access points to do with museum collections + (provenance, inscriptions, etc.) + + + In practice, the BIB-1 attribute set has tended to be a dumping + ground for all sorts of access points, so that, for example, it + includes some geospatial access points as well as strictly + bibliographic ones. Nevertheless, the key point is that this model + allows a layer of abstraction over the physical representation of + records in databases. + + + In the BIB-1 attribute set, an author search is represented by + access point 1003. (See + ) + So we need to configure our dinosaur database so that searches for + BIB-1 access point 1003 look the + name attribute of the + <AUTHOR> element, + inside the + <SPECIES> element, + inside the top-level + <GENUS> element. + + + This is a two-step process. First, we need to tell Zebra that we + want to support the BIB-1 attribute set. Then we need to tell it + which elements of its record pertain to access point 1003. + + + + + + + + + + + + + + + + + + + + +