X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fexamples.xml;h=7a5b015e1ccee9aec5ff78077aa6428e2b422831;hb=c5971ebf8a88865ed9a1f7c8cf9daa22544f07be;hp=3a49b6c0ffc5c0faaa248c681783b1b6896c713a;hpb=496125ec5b13de49f4c34a2eb548467cfba0159a;p=idzebra-moved-to-github.git
diff --git a/doc/examples.xml b/doc/examples.xml
index 3a49b6c..7a5b015 100644
--- a/doc/examples.xml
+++ b/doc/examples.xml
@@ -1,159 +1,394 @@
-
-
- Example Configurations
-
-
- Overview
-
-
- zebraidx and zebrasrv are both
- driven by a master configuration file, which may refer to other
- subsidiary configuration files. By default, they try to use
- zebra.cfg in the working directory as the
- master file; but this can be changed using the -t
- option to specify an alternative master configuration file.
-
-
- The master configuration file tells Zebra:
-
-
-
+
+ Example Configurations
+
+
+ Overview
+
+
+ zebraidx and
+ zebrasrv are both
+ driven by a master configuration file, which may refer to other
+ subsidiary configuration files. By default, they try to use
+ zebra.cfg in the working directory as the
+ master file; but this can be changed using the -c
+ option to specify an alternative master configuration file.
+
+
+ The master configuration file tells &zebra;:
+
+
+
+
+ Where to find subsidiary configuration files, including both
+ those that are named explicitly and a few ``magic'' files such
+ as default.idx,
+ which specifies the default indexing rules.
+
+
+
+
+
+ What record schemas to support. (Subsidiary files specify how
+ to index the contents of records in those schemas, and what
+ format to use when presenting records in those schemas to client
+ software.)
+
+
+
+
+
+ What attribute sets to recognise in searches. (Subsidiary files
+ specify how to interpret the attributes in terms
+ of the indexes that are created on the records.)
+
+
+
+
+
+ Policy details such as what type of input format to expect when
+ adding new records, what low-level indexing algorithm to use,
+ how to identify potential duplicate records, etc.
+
+
+
+
+
+
+ Now let's see what goes in the zebra.cfg file
+ for some example configurations.
+
+
+
+
+ Example 1: &acro.xml; Indexing And Searching
+
+
+ This example shows how &zebra; can be used with absolutely minimal
+ configuration to index a body of
+ &acro.xml;
+ documents, and search them using
+ XPath
+ expressions to specify access points.
+
+
+ Go to the examples/zthes subdirectory
+ of the distribution archive.
+ There you will find a Makefile that will
+ populate the records subdirectory with a file of
+ Zthes
+ records representing a taxonomic hierarchy of dinosaurs. (The
+ records are generated from the family tree in the file
+ dino.tree.)
+ Type make records/dino.xml
+ to make the &acro.xml; data file.
+ (Or you could just type make dino to build the &acro.xml;
+ data file, create the database and populate it with the taxonomic
+ records all in one shot - but then you wouldn't learn anything,
+ would you? :-)
+
+
+ Now we need to create a &zebra; database to hold and index the &acro.xml;
+ records. We do this with the
+ &zebra; indexer, zebraidx, which is
+ driven by the zebra.cfg configuration file.
+ For our purposes, we don't need any
+ special behaviour - we can use the defaults - so we can start with a
+ minimal file that just tells zebraidx where to
+ find the default indexing rules, and how to parse the records:
+
+ profilePath: .:../../tab
+ recordType: grs.sgml
+
+
+
+ That's all you need for a minimal &zebra; configuration. Now you can
+ roll the &acro.xml; records into the database and build the indexes:
+
+ zebraidx update records
+
+
+
+ Now start the server. Like the indexer, its behaviour is
+ controlled by the
+ zebra.cfg file; and like the indexer, it works
+ just fine with this minimal configuration.
+
+ zebrasrv
+
+ By default, the server listens on IP port number 9999, although
+ this can easily be changed - see
+ .
+
+
+ Now you can use the &acro.z3950; client program of your choice to execute
+ XPath-based boolean queries and fetch the &acro.xml; records that satisfy
+ them:
+
+ $ yaz-client @:9999
+ Connecting...Ok.
+ Z> find @attr 1=/Zthes/termName Sauroposeidon
+ Number of hits: 1
+ Z> format xml
+ Z> show 1
+ <Zthes>
+ <termId>22</termId>
+ <termName>Sauroposeidon</termName>
+ <termType>PT</termType>
+ <termNote>The tallest known dinosaur (18m)</termNote>
+ <relation>
+ <relationType>BT</relationType>
+ <termId>21</termId>
+ <termName>Brachiosauridae</termName>
+ <termType>PT</termType>
+ </relation>
+
+ <idzebra xmlns="http://www.indexdata.dk/zebra/">
+ <size>300</size>
+ <localnumber>23</localnumber>
+ <filename>records/dino.xml</filename>
+ </idzebra>
+ </Zthes>
+
+
+
+ Now wasn't that nice and easy?
+
+
+
+
+
+ Example 2: Supporting Interoperable Searches
+
+
+ The problem with the previous example is that you need to know the
+ structure of the documents in order to find them. For example,
+ when we wanted to find the record for the taxon
+ Sauroposeidon,
+ we had to formulate a complex XPath
+ /Zthes/termName
+ which embodies the knowledge that taxon names are specified in a
+ <termName> element inside the top-level
+ <Zthes> element.
+
+
+ This is bad not just because it requires a lot of typing, but more
+ significantly because it ties searching semantics to the physical
+ structure of the searched records. You can't use the same search
+ specification to search two databases if their internal
+ representations are different. Consider a different taxonomy
+ database in which the records have taxon names specified
+ inside a <name> element nested within a
+ <identification> element
+ inside a top-level <taxon> element: then
+ you'd need to search for them using
+ 1=/taxon/identification/name
+
+
+ How, then, can we build broadcasting Information Retrieval
+ applications that look for records in many different databases?
+ The &acro.z3950; protocol offers a powerful and general solution to this:
+ abstract ``access points''. In the &acro.z3950; model, an access point
+ is simply a point at which searches can be directed. Nothing is
+ said about implementation: in a given database, an access point
+ might be implemented as an index, a path into physical records, an
+ algorithm for interrogating relational tables or whatever works.
+ The only important thing is that the semantics of an access
+ point is fixed and well defined.
+
+
+ For convenience, access points are gathered into attribute
+ sets. For example, the &acro.bib1; attribute set is supposed to
+ contain bibliographic access points such as author, title, subject
+ and ISBN; the GEO attribute set contains access points pertaining
+ to geospatial information (bounding coordinates, stratum, latitude
+ resolution, etc.); the CIMI
+ attribute set contains access points to do with museum collections
+ (provenance, inscriptions, etc.)
+
+
+ In practice, the &acro.bib1; attribute set has tended to be a dumping
+ ground for all sorts of access points, so that, for example, it
+ includes some geospatial access points as well as strictly
+ bibliographic ones. Nevertheless, this model
+ allows a layer of abstraction over the physical representation of
+ records in databases.
+
+
+ In the &acro.bib1; attribute set, a taxon name is probably best
+ interpreted as a title - that is, a phrase that identifies the item
+ in question. &acro.bib1; represents title searches by
+ access point 4. (See
+ The &acro.bib1; Attribute
+ Set Semantics)
+ So we need to configure our dinosaur database so that searches for
+ &acro.bib1; access point 4 look in the
+ <termName> element,
+ inside the top-level
+ <Zthes> element.
+
+
+ This is a two-step process. First, we need to tell &zebra; that we
+ want to support the &acro.bib1; attribute set. Then we need to tell it
+ which elements of its record pertain to access point 4.
+
+
+ We need to create an Abstract Syntax
+ file named after the document element of the records we're
+ working with, plus a .abs suffix - in this case,
+ Zthes.abs - as follows:
+
+
+
+
+
+
+
+
+
+ attset zthes.att
+ attset bib1.att
+ xpath enable
+ systag sysno none
+
+ xelm /Zthes/termId termId:w
+ xelm /Zthes/termName termName:w,title:w
+ xelm /Zthes/termQualifier termQualifier:w
+ xelm /Zthes/termType termType:w
+ xelm /Zthes/termLanguage termLanguage:w
+ xelm /Zthes/termNote termNote:w
+ xelm /Zthes/termCreatedDate termCreatedDate:w
+ xelm /Zthes/termCreatedBy termCreatedBy:w
+ xelm /Zthes/termModifiedDate termModifiedDate:w
+ xelm /Zthes/termModifiedBy termModifiedBy:w
+
+
+
- Where to find the default indexing rules (### default.idx)
+ Declare Thesaurus attribute set. See zthes.att.
-
-
-
+
+
- ### Something to do with explain.abs?!
+ Declare &acro.bib1; attribute set. See bib1.att in
+ &zebra;'s tab directory.
-
-
-
+
+
+
+ This xelm directive selects contents of nodes by XPath expression
+ /Zthes/termId. The contents (CDATA) will be
+ word searchable by Zthes attribute termId (value 1001).
+
+
+
- ### Where to find other configuration files, e.g. searches using
- BIB-1 attributes require a bib1.att configuration file (even if
- the access point is actually an XPath expression). These are
- searched for in the working directory unless otherwise
- specified.
+ Make termName word searchable by both
+ Zthes attribute termName (1002) and &acro.bib1; attribute title (4).
-
-
-
-
-
-
-
- First Example: Minimal Configuration
-
-
- This example shows how Zebra can be used, with absolutely minimal
- configuration, to index a body of XML documents, and search them
- using XPath expressions to specify access points.
-
-
- Go to the
- zebra/examples/dinosauricon
- directory. There you will find two significant files:
-
-
-
-
-
- The records subdirectory, which contains the
- raw XML data to be added to the database: in this case, just one
- file, genera.xml, which contains information
- about all the known dinosaur genera as of October 2000.
-
-
-
-
-
-
- The master configuration file, zebra.cfg,
- which is as short and simple as it can be:
-
-
- # $Header: /home/cvsroot/idis/doc/examples.xml,v 1.2 2002-08-29 16:30:22 mike Exp $
- # Bare-bones master configuration file for Zebra
- profilePath: .:../../tab:../../../yaz/tab
-
- Apart from the comments, which are ignored, all this specifies is
- that the server should recognise the attribute set described in
- the file called
- bib1.att.
-
-
-
-
-
-
-
-
- That's all you need for a minimal Zebra configuration. Now you can
- roll the XML records into the database and build the indexes:
-
- zebraidx -t grs.sgml update records
-
-
- and start the server which, by default listens on port 9999:
-
- zebrasrv
-
-
-
- Now you can use the Z39.50 client program of your choice to execute
- XPath-based boolean queries and fetch the XML records that satisfy
- them:
-
- Z> open tcp:@:9999
- Connecting...Ok.
- Z> find @attr 1=/GENUS/MEANING @or vertebra jaw
- Number of hits: 1
- Z> format xml
- Z> show 1
- Z> show 1
- <GENUS name="Hudiesaurus" type="with" xmlns:idzebra="http://www.indexdata.dk/zebra/">
- <MEANING>
- butterfly <LOW>vertebra</LOW> lizard
- </MEANING>
- <LENGTH value="30"></LENGTH>
- <PLACE name="China"></PLACE>
- <REMAINS content="4 teeth, forelimb, first dorsal vertebra"></REMAINS>
- <SPECIES name="sinojapanorum" status="nudum">
- <AUTHOR name="Dong" year="1997"></AUTHOR>
- <MEANING>
- Chinese-Japanese
- </MEANING>
- </SPECIES>
- <idzebra:size>359</idzebra:size><idzebra:localnumber>447</idzebra:localnumber><idzebra:filename>records/genera.xml</idzebra:filename></GENUS>
-
-
-
-
-
+
+
+
+
+ After re-indexing, we can search the database using &acro.bib1;
+ attribute, title, as follows:
+
+ Z> form xml
+ Z> f @attr 1=4 Eoraptor
+ Sent searchRequest.
+ Received SearchResponse.
+ Search was a success.
+ Number of hits: 1, setno 1
+ SearchResult-1: Eoraptor(1)
+ records returned: 0
+ Elapsed: 0.106896
+ Z> s
+ Sent presentRequest (1+1).
+ Records: 1
+ [Default]Record type: &acro.xml;
+ <Zthes>
+ <termId>2</termId>
+ <termName>Eoraptor</termName>
+ <termType>PT</termType>
+ <termNote>The most basal known dinosaur</termNote>
+ ...
+
+
+
+
+
+
+
+
+