Example ConfigurationsOverviewzebraidx and
zebrasrv are both
driven by a master configuration file, which may refer to other
subsidiary configuration files. By default, they try to use
zebra.cfg in the working directory as the
master file; but this can be changed using the -c
option to specify an alternative master configuration file.
The master configuration file tells &zebra;:
Where to find subsidiary configuration files, including both
those that are named explicitly and a few ``magic'' files such
as default.idx,
which specifies the default indexing rules.
What record schemas to support. (Subsidiary files specify how
to index the contents of records in those schemas, and what
format to use when presenting records in those schemas to client
software.)
What attribute sets to recognise in searches. (Subsidiary files
specify how to interpret the attributes in terms
of the indexes that are created on the records.)
Policy details such as what type of input format to expect when
adding new records, what low-level indexing algorithm to use,
how to identify potential duplicate records, etc.
Now let's see what goes in the zebra.cfg file
for some example configurations.
Example 1: &acro.xml; Indexing And Searching
This example shows how &zebra; can be used with absolutely minimal
configuration to index a body of
&acro.xml;
documents, and search them using
XPath
expressions to specify access points.
Go to the examples/zthes subdirectory
of the distribution archive.
There you will find a Makefile that will
populate the records subdirectory with a file of
Zthes
records representing a taxonomic hierarchy of dinosaurs. (The
records are generated from the family tree in the file
dino.tree.)
Type make records/dino.xml
to make the &acro.xml; data file.
(Or you could just type make dino to build the &acro.xml;
data file, create the database and populate it with the taxonomic
records all in one shot - but then you wouldn't learn anything,
would you? :-)
Now we need to create a &zebra; database to hold and index the &acro.xml;
records. We do this with the
&zebra; indexer, zebraidx, which is
driven by the zebra.cfg configuration file.
For our purposes, we don't need any
special behaviour - we can use the defaults - so we can start with a
minimal file that just tells zebraidx where to
find the default indexing rules, and how to parse the records:
profilePath: .:../../tab
recordType: grs.sgml
That's all you need for a minimal &zebra; configuration. Now you can
roll the &acro.xml; records into the database and build the indexes:
zebraidx update records
Now start the server. Like the indexer, its behaviour is
controlled by the
zebra.cfg file; and like the indexer, it works
just fine with this minimal configuration.
zebrasrv
By default, the server listens on IP port number 9999, although
this can easily be changed - see
.
Now you can use the &acro.z3950; client program of your choice to execute
XPath-based boolean queries and fetch the &acro.xml; records that satisfy
them:
$ yaz-client @:9999
Connecting...Ok.
Z> find @attr 1=/Zthes/termName Sauroposeidon
Number of hits: 1
Z> format xml
Z> show 1
<Zthes>
<termId>22</termId>
<termName>Sauroposeidon</termName>
<termType>PT</termType>
<termNote>The tallest known dinosaur (18m)</termNote>
<relation>
<relationType>BT</relationType>
<termId>21</termId>
<termName>Brachiosauridae</termName>
<termType>PT</termType>
</relation>
<idzebra xmlns="http://www.indexdata.dk/zebra/">
<size>300</size>
<localnumber>23</localnumber>
<filename>records/dino.xml</filename>
</idzebra>
</Zthes>
Now wasn't that nice and easy?
Example 2: Supporting Interoperable Searches
The problem with the previous example is that you need to know the
structure of the documents in order to find them. For example,
when we wanted to find the record for the taxon
Sauroposeidon,
we had to formulate a complex XPath
/Zthes/termName
which embodies the knowledge that taxon names are specified in a
<termName> element inside the top-level
<Zthes> element.
This is bad not just because it requires a lot of typing, but more
significantly because it ties searching semantics to the physical
structure of the searched records. You can't use the same search
specification to search two databases if their internal
representations are different. Consider a different taxonomy
database in which the records have taxon names specified
inside a <name> element nested within a
<identification> element
inside a top-level <taxon> element: then
you'd need to search for them using
1=/taxon/identification/name
How, then, can we build broadcasting Information Retrieval
applications that look for records in many different databases?
The &acro.z3950; protocol offers a powerful and general solution to this:
abstract ``access points''. In the &acro.z3950; model, an access point
is simply a point at which searches can be directed. Nothing is
said about implementation: in a given database, an access point
might be implemented as an index, a path into physical records, an
algorithm for interrogating relational tables or whatever works.
The only important thing is that the semantics of an access
point is fixed and well defined.
For convenience, access points are gathered into attribute
sets. For example, the &acro.bib1; attribute set is supposed to
contain bibliographic access points such as author, title, subject
and ISBN; the GEO attribute set contains access points pertaining
to geospatial information (bounding coordinates, stratum, latitude
resolution, etc.); the CIMI
attribute set contains access points to do with museum collections
(provenance, inscriptions, etc.)
In practice, the &acro.bib1; attribute set has tended to be a dumping
ground for all sorts of access points, so that, for example, it
includes some geospatial access points as well as strictly
bibliographic ones. Nevertheless, this model
allows a layer of abstraction over the physical representation of
records in databases.
In the &acro.bib1; attribute set, a taxon name is probably best
interpreted as a title - that is, a phrase that identifies the item
in question. &acro.bib1; represents title searches by
access point 4. (See
The &acro.bib1; Attribute
Set Semantics)
So we need to configure our dinosaur database so that searches for
&acro.bib1; access point 4 look in the
<termName> element,
inside the top-level
<Zthes> element.
This is a two-step process. First, we need to tell &zebra; that we
want to support the &acro.bib1; attribute set. Then we need to tell it
which elements of its record pertain to access point 4.
We need to create an Abstract Syntax
file named after the document element of the records we're
working with, plus a .abs suffix - in this case,
Zthes.abs - as follows:
attset zthes.att
attset bib1.att
xpath enable
systag sysno none
xelm /Zthes/termId termId:w
xelm /Zthes/termName termName:w,title:w
xelm /Zthes/termQualifier termQualifier:w
xelm /Zthes/termType termType:w
xelm /Zthes/termLanguage termLanguage:w
xelm /Zthes/termNote termNote:w
xelm /Zthes/termCreatedDate termCreatedDate:w
xelm /Zthes/termCreatedBy termCreatedBy:w
xelm /Zthes/termModifiedDate termModifiedDate:w
xelm /Zthes/termModifiedBy termModifiedBy:w
Declare Thesaurus attribute set. See zthes.att.
Declare &acro.bib1; attribute set. See bib1.att in
&zebra;'s tab directory.
This xelm directive selects contents of nodes by XPath expression
/Zthes/termId. The contents (CDATA) will be
word searchable by Zthes attribute termId (value 1001).
Make termName word searchable by both
Zthes attribute termName (1002) and &acro.bib1; attribute title (4).
After re-indexing, we can search the database using &acro.bib1;
attribute, title, as follows:
Z> form xml
Z> f @attr 1=4 Eoraptor
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 1, setno 1
SearchResult-1: Eoraptor(1)
records returned: 0
Elapsed: 0.106896
Z> s
Sent presentRequest (1+1).
Records: 1
[Default]Record type: &acro.xml;
<Zthes>
<termId>2</termId>
<termName>Eoraptor</termName>
<termType>PT</termType>
<termNote>The most basal known dinosaur</termNote>
...