1 <!doctype linuxdoc system>
4 <title>Serving GILS Records with Zebra
5 <author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></>
8 This document explains how to set up a simple database of Government
9 Information Locator Records using the Zebra retrieval engine and
10 Z39.50 server (version 1.0a6 or later).
16 Zebra is a powerful and versatile information management system
17 which allows you to construct arbitrarily complex record structures
19 efficient, robust databases.
21 Since the internal data modeling tools of Zebra are based on the
22 Z39.50-1995 standard, the system is well-suited to support a complex
23 database profile such as the one specified by the GILS profile.
24 Because GILS is expected by many to play an important role in the
25 evolving, global information society, and because the GILS profile has
26 proven useful to a number of different applications outside of its
27 dedicated domain, the public distribution of Zebra includes a
28 configuration set-up which makes it a simple matter to establish a new
31 This document, which is a supplement to the general documentation set
32 for Zebra, explains in simple terms how you can easily set up your own
33 GILS-compliant database service.
35 <sect>Retrieving and Unpacking Zebra
38 The first step is to download the software. If you are using a WWW
39 browser, you can point it at the Zebra distribution archive at
40 <tt><<htmlurl url="http://www.indexdata.dk/zebra.html"
41 name="http://www.indexdata.dk/zebra.html">></tt>, and follow the link named
42 <it/Download the latest version of the software (xxx)/, where <it/xxx/
43 is the current version of Zebra.
45 If you use an FTP client, you can use normal, anonymous FTP. Connect
46 to the host <tt/ftp.indexdata.dk/, log in as <tt/ftp/, and give your
47 Email-address as the password. Then type <tt>cd index/yaz</tt>, and
48 use the <tt/dir/ command to locate the current version of Zebra. The
49 file will be named <tt/zebra-xxx/, where <tt/xxx/ is the current
50 version of the software. Remember to use the <tt/bin/ command before
51 using <tt/get/ to download the software.
53 Once the distribution archive has been dowloaded, it must be
54 decompressed. To do this, use the command <tt/gunzip/ command (if your
55 system doesn't have the <tt/gunzip/ program, you will need to acquire
56 this separately). Finally, use the command <tt>tar xvf
57 <file></tt> to unpack the archive.
59 If you downloaded the source version of the software (this is the only
60 option today, although we expect to release binary versions for Linux,
61 SunOS, and Digital Unix shortly), you will have to compile Zebra
62 before you can use it.
64 On many of the major version of the Unix operating system, compiling
65 Zebra is a simple matter of typing <tt/make/ in the top-level
66 distribution directory (this is the directory that was created when
67 you executed <tt/tar/). Normally, Zebra compiles cleanly at least on
68 Linux, Digital Unix (DEC OSF/1), and IBM AIX. On certain platforms
69 (such as SunOS), you will need to edit the top-level <tt/Makefile/ to
70 set the <tt/ELIBS/ variable to include the Berkeley Socket
71 Libraries. On other Unix platforms, you <it/may/ need to modify
72 Makefiles or header files, but in general, we have found Zebra to be
73 easily portable across modern Unix-versions. You do need an ANSI-C
74 compliant compiler (you'll see a long list of Syntax-errors during the
75 compile if your default compiler is not ANSI C), but again, this is
76 standard on most modern Unix-systems. If you don't have one, the
77 freely available GNU C compiler is available for many systems.
79 <sect>The First GILS Database
82 Having successfully acquired the software, it's time to try it out.
83 The directory <tt/test/ under the main distribution directory contains
84 a small sample database of GILS records.
86 <it>NOTE: The records included in the distribution are part of a
87 sample set provided by the US Geological Survey, as a service to GILS
88 implementors. They are included for testing and demonstrating the
89 software, and neither the USGS or Index Data nor anyone else should be
90 held responsible for their contents.</it>
92 If you <tt/cd/ to the <tt/test/ directory, the first thing to notice
93 is the file <tt/zebra.cfg/. There has to be a file like this present
94 whenever you run Zebra - it establishes various settings and defaults,
95 and we'll return to its contents below (a detailed
96 description is found in the general Zebra documentation file).
98 The subdirectory <tt/records/ contains the sample records. We'll get
101 The first order of business is to index the sample records, and create
102 the access files required by the Z39.50 server. To do this, position
103 yourself in the <tt/test/ directory, and type the command
106 $ ../index/zebraidx update records
109 The indexing program will respond with a stream of control
110 information, and when it completes, the database is ready. To start
111 the Z39.50 server, type the command <tt>../index/zebrasrv</tt>.
113 Assuming that nothing unfortunate happened, you are now running a
114 GILS-compliant Z39.50 server on the port 9999 on your local machine
115 (to learn how to run the server at a different port, and redirect the
116 diagnostic output to a file, consult the section on <it/Running
117 zebrasrv/ in the general documentation).
118 The database containing the sample records is named <tt/Default/.
120 To test the server, you can use any compatible Z39.50 client. You can
121 also use the simple demonstration client which is included with Zebra
122 itself. To do this, start a new session on your machine (or put the
123 server in the background). Change to the directory <tt>yaz/client</tt>
124 under the main Zebra distribution directory. Now execute the command
127 $ ./client tcp:localhost:9999
130 If all went well, the client will tell you that it has established an
131 association with your test server. To test it, try out these commands:
138 The default retrieval syntax for the client is USMARC. To try other
139 formats for the same record, try:
150 You can learn more about the sample client by reading the <tt/README/
151 file in the <tt/yaz/ directory.
156 The GILS profile is only concerned with the communication that takes
157 place between two compliant systems. It doesn't mandate how the client
158 application should behave, and it doesn't tell you how you should
159 maintain and process data at the server side. Specifically, while the
160 profile specifies a number of different exchange format for retrieval
163 For the purposes of this discussion, we will be using a simple,
164 SGML-like representation of the GILS record structure. There is
165 nothing magical or sacrosanct about this format, but it is easy to
166 read and write, and because of its semblance of SGML and HTML, it is
167 familiar to many people. If you would like to use a different, local
168 representation for your GILS records, you can read the general Zebra
169 documentation to learn how to establish a custom input filter for your
170 particular record format.
172 In the SGML-like syntax, each record should begin with the tag
173 <tt/<gils>/. This selects the GILS profile, and provides context
174 for the content tags which follow. Similarly, each record should
175 finish with the end-tag <tt/&etago;gils>/.
177 The body of the record is made up by a sequence of tagged elements,
178 reflecting the <it/abstract record syntax/ of the GILS profile. Some
179 of these elements contain simple data, or text, while others contain
180 more tagged elements - these are complex, or constructed, data
181 elements. The tag names generally correspond to the tag names provided
182 in the GILS profile. Capitalization is ignored in tag names, as are
183 dashes (-). Hence, <tt/local-subject-index/ is equivalent to
184 <tt/LocalSubjectIndex/ which is the same as <tt/LOCALSUBJECTINDEX/.
186 It is useful to look at the records in the <tt>test/records</tt> as
187 examples of how SGML-formatted GILS record can look. Note that
188 whitespace is generally ignored, so you can choose whatever layout of
189 your records that suits you best.
191 <sect>The Zebra Configuration File
194 As mentioned, the Zebra indexer and server always look for the file
195 <tt/zebra.cfg/ in their current working directory (unless they are
196 told to look for it elsewhere with the <tt/-c/ option). The example
197 file in the <tt/test/ directory represents all but the bare minimum
198 for such a file. We find the
199 following to be a powerful setup for a GILS-like database (everything
200 preceded by (#) is ignored by the software):
204 # Sample configuration file for GILS database
207 # Where are the configuration files located?
208 profilePath: /usr/local/lib/zebra
210 # Load attribute sets for searching
214 # Records are identified by their path in the file system
217 # Store information about records to allow deletion and updating
220 # Records are structured
223 # Where to store the indexes
224 register: /datadisk/index:500M
226 # Where to store temporary data while merging with register
227 shadow: /datadisk/shadow:500M
230 If you like, you can paste this file straight into a <tt/zebra.cfg/
231 file ready for your own use (with a bit of editing of the pathnames).
232 In the following, we'll explain the individual settings. For the full
233 story on the <tt/zebra.cfg/ file and the configuration options of
234 Zebra, you should read the general documentation.
238 <tag/profilePath/ This field tells Zebra where to look for the
239 configuration files. In the distribution, these files are located in
240 the <tt/tab/ directory, but you may wish to put them someplace else
241 for convenience. If necessary, you can provide multiple directory
242 paths, separated by (:).
244 <tag/attset/ This field tells the Zebra server which attribute sets it
245 should support for searching. You could get by with just loading the
246 GILS set, but if you load BIB-1 as well, Zebra will support both sets
247 for those GILS attributes that are inherited from BIB-1.
249 <tag/recordId/ The <tt/recordId: file/ setting tells Zebra that
250 individual records should be identified by the physical files in which
251 they are located. In this mode, your database will always (after an
252 update operation) reflect the contents of the directory (or
255 <tag/storeKeys/ This setting tells Zebra to store additional
256 information about each record, to facilitate updating. In combination
257 with the <tt/recordId: file/ setting, this is a very convenient
258 maintenance option. If you maintain your records as individual files
259 in a directory tree, you have only to run <tt/zebraidx/ with the
260 top-level directory as an argument. If new files are added, they are
261 entered into the database. If they are modified, the indexes are
262 changed accordingly, and if they are deleted from the filesystem (or
263 renamed), the indexes are also updated correctly, the next time you
266 <tag/recordType/ This setting selects the type of processing which is
267 to take place when a record is accessed by the indexer or the Z39.50
268 server. GRS stands for <it/Generic Record Syntax/, and signals that
269 the records are structured.
271 <tag/register/ In the first test above, you may have noticed that the
272 <tt/zebraidx/ created a number of files in the working directory. Some
273 of these files, which contain the indexing information for the
274 database, can grow quite large, and it is sometimes useful to place
275 them in a separate directory or file system. You should provide the
276 path of the directory followed by a colon (:), followed by the maximum
277 amounts of megabytes (M) or kilobytes (K) of disk space that Zebra is
278 allowed to use in the given directory. If you specify more than one
279 directory:size combination <it/on the same line/, Zebra will fill up
280 each directory from left to right. This feature is essential if your
281 database is so large that the registers cannot fit into a single
282 partition of your disk.
284 <tag/shadow/ The format of this setting is the same as for the one
285 above. If you provide one or more directory for the &dquot;shadow
286 system&dquot;, you enable the safe updating system of the Zebra
287 indexer. When changes to the records are merged into the register
288 files, the files are not changed immediately. Instead, the changes are
289 written into separate files, or &dquot;shadow files&dquot;. At the end
290 of the merging process, or in a separate operation, the changes are
291 &dquot;committed&dquot;, and written into the register files
292 themselves. This final step is carried out by the command <tt/zebraidx
293 commit/ - the <tt/commit/ directive can also be given on the same
294 command line as the <tt/update/ directive - at the end of the command
295 line. The shadow file system can consume a lot of disk space -
296 particularly in a large update operation which involves almost all of
297 the index, but the benefits are substantial. If the system crashes
298 during an update procedure, or the process is otherwise interrupted,
299 the registers are left in an unknown state, and are effectively
300 rendered useless - this can be unfortunate if the index is very large,
301 but the use of the shadow system greatly reduces the risk of an index
302 being damaged in this way. Further, when the shadow system is enabled,
303 your clients may access the Zebra server without interruption
304 throughout the update and commit procedures - Zebra will ensure that
305 the parts of the register accessed by the server are always
310 <sect>Creating Your Own Database
313 Whenever we create a new database with Zebra, we find it useful to
314 first set up a new, empty directory. This directory will contain the
315 configuration file, the lock files maintained by Zebra (unless you
316 specify a different location for these), and any logs of updates and
317 server runs that you may wish to keep around. The first thing to do is
318 set up the <tt/zebra.cfg/ file for your database. You can copy the one
319 from the <tt/test/ directory, or you can create a new one using the
320 example settings described in the previous section. Once you get your
321 server up and running, you may want to read the description of the
322 <tt/zebra.cfg/ file in the general documentation, to set up additional
323 defaults for database names, etc.
325 If you copy one of these files, you should be careful to update the
326 pathnames to reflect the setup of your own database. In particular, if
327 you want to specify one or more directories for the register files
328 and/or the shadow files, you should make sure that these directories
329 exist and are accessible to the user ID which will run the Zebra
332 You need to make sure that your GILS records are available, too. For
333 small to medium-sized (say, less than 100,000 records) databases, it
334 is sometimes preferable to maintain the records as individual files
335 somewhere in the file system. Zebra will, by default, access these
336 files directly whenever the user requests to see a specific record.
337 However, you can set up Zebra to maintain the database records in
338 other ways, too. Consult the general documentation for details.
340 Finally, you need to run <tt/zebraidx/ to create the index files, and
341 start up the server, <tt/zebrasrv/ (the server can be run from the
342 <tt/inetd/ if required), and you are in business.
344 To access the data, you can use a dedicated Z39.50 client, or you can
345 set up a WWW/Z39.50 gateway to allow common WWW browsers to search
348 package includes a good, free gateway that you can experiment with.