1 <chapter id="examples">
2 <!-- $Id: examples.xml,v 1.27 2007-05-24 13:44:09 adam Exp $ -->
3 <title>Example Configurations</title>
5 <sect1 id="examples-overview">
6 <title>Overview</title>
9 <command>zebraidx</command> and
10 <command>zebrasrv</command> are both
11 driven by a master configuration file, which may refer to other
12 subsidiary configuration files. By default, they try to use
13 <filename>zebra.cfg</filename> in the working directory as the
14 master file; but this can be changed using the <literal>-c</literal>
15 option to specify an alternative master configuration file.
18 The master configuration file tells &zebra;:
23 Where to find subsidiary configuration files, including both
24 those that are named explicitly and a few ``magic'' files such
25 as <literal>default.idx</literal>,
26 which specifies the default indexing rules.
32 What record schemas to support. (Subsidiary files specifiy how
33 to index the contents of records in those schemas, and what
34 format to use when presenting records in those schemas to client
41 What attribute sets to recognise in searches. (Subsidiary files
42 specify how to interpret the attributes in terms
43 of the indexes that are created on the records.)
49 Policy details such as what type of input format to expect when
50 adding new records, what low-level indexing algorithm to use,
51 how to identify potential duplicate records, etc.
58 Now let's see what goes in the <literal>zebra.cfg</literal> file
59 for some example configurations.
64 <title>Example 1: &acro.xml; Indexing And Searching</title>
67 This example shows how &zebra; can be used with absolutely minimal
68 configuration to index a body of
69 <ulink url="&url.xml;">&acro.xml;</ulink>
70 documents, and search them using
71 <ulink url="&url.xpath;">XPath</ulink>
72 expressions to specify access points.
75 Go to the <literal>examples/zthes</literal> subdirectory
76 of the distribution archive.
77 There you will find a <literal>Makefile</literal> that will
78 populate the <literal>records</literal> subdirectory with a file of
79 <ulink url="http://zthes.z3950.org/">Zthes</ulink>
80 records representing a taxonomic hierarchy of dinosaurs. (The
81 records are generated from the family tree in the file
82 <literal>dino.tree</literal>.)
83 Type <literal>make records/dino.xml</literal>
84 to make the &acro.xml; data file.
85 (Or you could just type <literal>make dino</literal> to build the &acro.xml;
86 data file, create the database and populate it with the taxonomic
87 records all in one shot - but then you wouldn't learn anything,
91 Now we need to create a &zebra; database to hold and index the &acro.xml;
92 records. We do this with the
93 &zebra; indexer, <command>zebraidx</command>, which is
94 driven by the <literal>zebra.cfg</literal> configuration file.
95 For our purposes, we don't need any
96 special behaviour - we can use the defaults - so we can start with a
97 minimal file that just tells <command>zebraidx</command> where to
98 find the default indexing rules, and how to parse the records:
100 profilePath: .:../../tab
105 That's all you need for a minimal &zebra; configuration. Now you can
106 roll the &acro.xml; records into the database and build the indexes:
108 zebraidx update records
112 Now start the server. Like the indexer, its behaviour is
114 <literal>zebra.cfg</literal> file; and like the indexer, it works
115 just fine with this minimal configuration.
119 By default, the server listens on IP port number 9999, although
120 this can easily be changed - see
121 <xref linkend="zebrasrv"/>.
124 Now you can use the &acro.z3950; client program of your choice to execute
125 XPath-based boolean queries and fetch the &acro.xml; records that satisfy
130 Z> find @attr 1=/Zthes/termName Sauroposeidon
135 <termId>22</termId>
136 <termName>Sauroposeidon</termName>
137 <termType>PT</termType>
138 <termNote>The tallest known dinosaur (18m)</termNote>
140 <relationType>BT</relationType>
141 <termId>21</termId>
142 <termName>Brachiosauridae</termName>
143 <termType>PT</termType>
146 <idzebra xmlns="http://www.indexdata.dk/zebra/">
147 <size>300</size>
148 <localnumber>23</localnumber>
149 <filename>records/dino.xml</filename>
155 Now wasn't that nice and easy?
160 <sect1 id="example2">
161 <title>Example 2: Supporting Interoperable Searches</title>
164 The problem with the previous example is that you need to know the
165 structure of the documents in order to find them. For example,
166 when we wanted to find the record for the taxon
167 <foreignphrase role="taxon">Sauroposeidon</foreignphrase>,
168 we had to formulate a complex XPath
169 <literal>/Zthes/termName</literal>
170 which embodies the knowledge that taxon names are specified in a
171 <literal><termName></literal> element inside the top-level
172 <literal><Zthes></literal> element.
175 This is bad not just because it requires a lot of typing, but more
176 significantly because it ties searching semantics to the physical
177 structure of the searched records. You can't use the same search
178 specification to search two databases if their internal
179 representations are different. Consider a different taxonomy
180 database in which the records have taxon names specified
181 inside a <literal><name></literal> element nested within a
182 <literal><identification></literal> element
183 inside a top-level <literal><taxon></literal> element: then
184 you'd need to search for them using
185 <literal>1=/taxon/identification/name</literal>
188 How, then, can we build broadcasting Information Retrieval
189 applications that look for records in many different databases?
190 The &acro.z3950; protocol offers a powerful and general solution to this:
191 abstract ``access points''. In the &acro.z3950; model, an access point
192 is simply a point at which searches can be directed. Nothing is
193 said about implementation: in a given database, an access point
194 might be implemented as an index, a path into physical records, an
195 algorithm for interrogating relational tables or whatever works.
196 The only important thing is that the semantics of an access
197 point is fixed and well defined.
200 For convenience, access points are gathered into <firstterm>attribute
201 sets</firstterm>. For example, the &acro.bib1; attribute set is supposed to
202 contain bibliographic access points such as author, title, subject
203 and ISBN; the GEO attribute set contains access points pertaining
204 to geospatial information (bounding coordinates, stratum, latitude
205 resolution, etc.); the CIMI
206 attribute set contains access points to do with museum collections
207 (provenance, inscriptions, etc.)
210 In practice, the &acro.bib1; attribute set has tended to be a dumping
211 ground for all sorts of access points, so that, for example, it
212 includes some geospatial access points as well as strictly
213 bibliographic ones. Nevertheless, this model
214 allows a layer of abstraction over the physical representation of
215 records in databases.
218 In the &acro.bib1; attribute set, a taxon name is probably best
219 interpreted as a title - that is, a phrase that identifies the item
220 in question. &acro.bib1; represents title searches by
222 <ulink url="&url.z39.50.bib1.semantics;">The &acro.bib1; Attribute
223 Set Semantics</ulink>)
224 So we need to configure our dinosaur database so that searches for
225 &acro.bib1; access point 4 look in the
226 <literal><termName></literal> element,
228 <literal><Zthes></literal> element.
231 This is a two-step process. First, we need to tell &zebra; that we
232 want to support the &acro.bib1; attribute set. Then we need to tell it
233 which elements of its record pertain to access point 4.
236 We need to create an <link linkend="abs-file">Abstract Syntax
237 file</link> named after the document element of the records we're
238 working with, plus a <literal>.abs</literal> suffix - in this case,
239 <literal>Zthes.abs</literal> - as follows:
243 <area id="attset.zthes" coords="2"/>
244 <area id="attset.attset" coords="3"/>
245 <area id="termId" coords="7"/>
246 <area id="termName" coords="8"/>
254 xelm /Zthes/termId termId:w
255 xelm /Zthes/termName termName:w,title:w
256 xelm /Zthes/termQualifier termQualifier:w
257 xelm /Zthes/termType termType:w
258 xelm /Zthes/termLanguage termLanguage:w
259 xelm /Zthes/termNote termNote:w
260 xelm /Zthes/termCreatedDate termCreatedDate:w
261 xelm /Zthes/termCreatedBy termCreatedBy:w
262 xelm /Zthes/termModifiedDate termModifiedDate:w
263 xelm /Zthes/termModifiedBy termModifiedBy:w
266 <callout arearefs="attset.zthes">
268 Declare Thesausus attribute set. See <filename>zthes.att</filename>.
271 <callout arearefs="attset.attset">
273 Declare &acro.bib1; attribute set. See <filename>bib1.att</filename> in
274 &zebra;'s <filename>tab</filename> directory.
277 <callout arearefs="termId">
279 This xelm directive selects contents of nodes by XPath expression
280 <literal>/Zthes/termId</literal>. The contents (CDATA) will be
281 word searchable by Zthes attribute termId (value 1001).
284 <callout arearefs="termName">
286 Make <literal>termName</literal> word searchable by both
287 Zthes attribute termName (1002) and &acro.bib1; atttribute title (4).
293 After re-indexing, we can search the database using &acro.bib1;
294 attribute, title, as follows:
297 Z> f @attr 1=4 Eoraptor
299 Received SearchResponse.
300 Search was a success.
301 Number of hits: 1, setno 1
302 SearchResult-1: Eoraptor(1)
306 Sent presentRequest (1+1).
308 [Default]Record type: &acro.xml;
310 <termId>2</termId>
311 <termName>Eoraptor</termName>
312 <termType>PT</termType>
313 <termNote>The most basal known dinosaur</termNote>
322 The simplest hello-world example could go like this:
327 <title>The art of motorcycle maintenance</title>
328 <subject scheme="Dewey">zen</subject>
333 f @attr 1=/book/title motorcycle
335 f @attr 1=/book/subject[@scheme=Dewey] zen
337 If you suddenly decide you want broader interop, you can add
338 an abs file (more or less like this):
343 elm (2,1) title title
344 elm (2,21) subject subject
348 How to include images:
352 <imagedata fileref="system.eps" format="eps">
355 <imagedata fileref="system.gif" format="gif">
358 <phrase>The Multi-Lingual Search System Architecture</phrase>
362 <emphasis role="strong">
363 The Multi-Lingual Search System Architecture.
366 Network connections across local area networks are
367 represented by straight lines, and those over the
368 internet by jagged lines.
372 Where the three <*object> thingies inside the top-level <mediaobject>
373 are decreasingly preferred version to include depending on what the
374 rendering engine can handle. I generated the EPS version of the image
375 by exporting a line-drawing done in TGIF, then converted that to the
376 GIF using a shell-script called "epstogif" which used an appallingly
377 baroque sequence of conversions, which I would prefer not to pollute
378 the &zebra; build environment with:
382 # Yes, what follows is stupidly convoluted, but I can't find a
383 # more straightforward path from the EPS generated by tgif's
384 # "Print" command into a browser-friendly format.
386 file=`echo "$1" | sed 's/\.eps//'`
387 ps2pdf "$1" "$file".pdf
388 pdftopbm "$file".pdf "$file"
389 pnmscale 0.50 < "$file"-000001.pbm | pnmcrop | ppmtogif
390 rm -f "$file".pdf "$file"-000001.pbm
394 <!-- Keep this comment at the end of the file
399 sgml-minimize-attributes:nil
400 sgml-always-quote-attributes:t
403 sgml-parent-document: "zebra.xml"
404 sgml-local-catalogs: nil
405 sgml-namecase-general:t