doc/gils.sgml

   1 <!doctype linuxdoc system>
   2
   3 <!--
   4   $Id: gils.sgml,v 1.1 1996-05-07 11:19:13 quinn Exp $
   5 -->
   6
   7 <article>
   8 <title>Serving GILS Records with Zebra
   9 <author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></>
  10 <date>$Revision: 1.1 $
  11 <abstract>
  12 This document explains how to set up a simple database of Government
  13 Information Locator Records using the Zebra retrieval engine and
  14 Z39.50 server (version 1.0b1 or later).
  15 </abstract>
  16
  17 <sect>Introduction
  18
  19 <p>
  20 Zebra is a powerful and versatile information management system,
  21 allowing you to construct arbitrarily record structures and managing
  22 efficient, robust databases.
  23
  24 Since the internal data modeling tools of Zebra are based on the
  25 Z39.50-1995 standard, the system is well-suited to support a complex
  26 database profile such as the one specified by the GILS profile.
  27 Because GILS is expected by many to play an important role in the
  28 evolving, global information society, and because the GILS profile has
  29 proven useful to a number of different applications outside of its
  30 dedicated domain, the public distribution of Zebra includes a
  31 configuration set-up which makes it a simple matter to establish a new
  32 GILS database.
  33
  34 This document, which is a supplement to the general documentation set
  35 for Zebra, explains in simple terms how you can easily set up your own
  36 GILS-compliant database service.
  37
  38 <sect>Retrieving and Unpacking Zebra
  39
  40 <p>
  41 The first step is to download the software. If you are using a WWW
  42 browser, you can point it at the Zebra distribution archive at
  43 <htmlurl url="http://www.indexdata.dk/zebra.html"
  44 name="http://www.indexdata.dk/zebra.html">, and follow the link named
  45 <it/Download the latest version of the software (xxx)/, where <it/xxx/
  46 is the current version of Zebra.
  47
  48 If you use an FTP client, you can use normal, anonymous FTP. Connect
  49 to the host <tt/ftp.indexdata.dk/, log in as <tt/ftp/, and give your
  50 Email-address as the password. Then type <tt>cd index/yaz</tt>, and
  51 use the <tt/dir/ command to locate the current version of Zebra. The
  52 file will be named <tt/zebra-xxx/, where <tt/xxx/ is the current
  53 version of the software. Remember to use the <it/bin/ command, before
  54 using <tt/get/ to download the software.
  55
  56 Once the distribution archive has been dowloaded, it must be
  57 decompressed. For this, use the command <tt/gunzip/ command (if your
  58 system doesn't have the <tt/gunzip/ program, you will need to acquire
  59 this separately). Finally, use the command <tt>tar xvf
  60 &gt;file&lt;</tt> to unpack the archive.
  61
  62 If you dosnloaded the source version of the software (this is the only
  63 option today, although we expect to release binary versions for Linux,
  64 SunOS, and Digital Unix shortly).
  65
  66 On many of the major version of the Unix operating system, compiling
  67 Zebra is a simple matter of typing <tt/make/ in the top-level
  68 distribution directory (this is the directory that was created when
  69 you executed <tt/tar/). Normally, Zebra compiles cleanly at least on
  70 Linux, Digital Unix (DEC OSF/1), and IBM AIX. On certain platforms
  71 (such as SunOS), you will need to edit the top-level <tt/Makefile/ to
  72 set the <tt/NETLIB/ variable to include the &dquot;Berkeley Socket
  73 Libraries&dquot;. On other Unix platforms, you <it/may/ need to modify
  74 Makefiles or header files, but in general, we have found Zebra to be
  75 easily portable across modern Unix-versions. You do need an ANSI-C
  76 compliant compiler (you'll see a long list of Syntax-errors during the
  77 compile if your default compiler is not ANSI C), but again, this is
  78 standard on most modern Unix-systems. If you don't have one, the
  79 freely available GNU C compiler is available for many systems.
  80
  81 <sect>The First GILS Database
  82
  83 <p>
  84 Having successfully acquired the software, it's time to try it out.
  85 The directory <tt/test/ under the main distribution directory contains
  86 a small sample database of GILS records.
  87
  88 <it>NOTE: The records included in the distribution are part of a
  89 sample set provided by the US Geological Survey, as a service to GILS
  90 implementors. They are included for testing and demonstrating the
  91 software, and neither the USGS or Index Data nor anyone else should be
  92 held responsible for their contents.</it>
  93
  94 If you <tt/cd/ to the <tt/test/ directory, the first thing to notice
  95 is the file <tt/zebra.cfg/. There has to be a file like this present
  96 whenever you use Zebra - it establishes various settings and default,
  97 and we'll return to its contents below (a detailed
  98 description is found in the general Zebra documentation file).
  99
 100 The subdirectory <tt/records/ contain the sample records. We'll get
 101 back to them, too.
 102
 103 The first order of business is to index the sample records, and create
 104 the access files required by the Z39.50 server. To do this, position
 105 yourself in the <tt/test/ directory, and type the command
 106 <tt>../index/zebraidx update records</tt>.
 107
 108 The indexing program will respond with a stream of control
 109 information, and when it completes, the database is ready. To start
 110 the Z39.50 server, type the command <tt>../index/zebrasrv</tt>.
 111
 112 Assuming that nothing unfortunate happened, you are now running a
 113 GILS-compliant Z39.50 server on the port 9999 on your local machine
 114 (to learn how to run the server at a different port, and redirect the
 115 diagnostic output to a file, consult the section on <it/Running
 116 zebrasrv/ in the general documentation).
 117 The database containing the sample records is named <tt/Default/.
 118
 119 To test the server, you can use any compatible Z39.50 client. You can
 120 also use the simple demonstration client which is included with Zebra
 121 itself. To do this, start a new session on your machine (or put the
 122 server in the background). Change to the directory <tt>yaz/client</tt>
 123 under the main Zebra distribution directory. Now execute the command
 124 <tt>./client tcp:localhost:9999</tt>.
 125
 126 If all went well, the client will tell you that it has established an
 127 association with your test server. To test it, try out these commands:
 128
 129 <tscreen><verb>
 130 Z> find surficial
 131 Z> show 1
 132 </verb></tscreen>
 133
 134 The default retrieval syntax for the client is USMARC. To try other
 135 formats for the same record, try:
 136
 137 <tscreen><verb>
 138 Z>format sutrs
 139 Z>show 1
 140 Z>format grs-1
 141 Z>show 1
 142 Z>elements B
 143 Z>show 1
 144 </verb></tscreen>
 145
 146 You can learn more about the sample client by reading the <tt/README/
 147 file in the <tt/yaz/ directory.
 148
 149 <sect>The Records
 150
 151 <p>
 152 The GILS profile is only concerned with the communication that takes
 153 place between two compliant systems. It doesn't mandate how the client
 154 application should behave, and it doesn't tell you how you should
 155 maintain and process data at the server side. Specifically, while the
 156 profile specifies a number of different exchange format for retrieval
 157 records.
 158
 159 For the purposes of this discussion, we will be using a simple,
 160 SGML-like representation of the GILS record structure. There is
 161 nothing magical or sacrosanct about this format, but it is easy to
 162 read and write, and because of its semblance of SGML and HTML, it is
 163 familiar to many people. If you would like to use a different, local
 164 representation for your GILS records, you can read the general Zebra
 165 documentation to learn how to establish a custom input filter for your
 166 particular record format.
 167
 168 In the SGML-like syntax, each record should begin with the tag
 169 <tt/&lt;gils&gt;/. This selects the GILS profile, and provides context
 170 for the content tags which follow. Similarly, each record should
 171 finish with the end-tag <tt/&etago;gils&gt;/.
 172
 173 The body of the record is made up by a sequence of tagged elements,
 174 reflecting the <it/abstract record syntax/ of the GILS profile. Some
 175 of these elements contain simple data, or text, while others contain
 176 more tagged elements - these are complex, or constructed, data
 177 elements. The tag names generally correspond to the tag names provided
 178 in the GILS profile. Capitalization is ignored in tag names, as are
 179 dashes (-). Hence, <tt/local-subject-index/ is equivalent to
 180 <tt/LocalSubjectIndex/ which is the same as <tt/LOCALSUBJECTINDEX/.
 181
 182 It is useful to look at the records in the <tt>test/records</tt> as
 183 examples of how SGML-formatted GILS record can look. Note that
 184 whitespace is generally ignored, so you can choose whatever layout of
 185 your records suits you best. Note also that in some cases, the records
 186 are generated automatically rather than typed in by a human.
 187
 188 <sect>The Zebra Configuration File
 189
 190 <p>
 191 As mentioned, the Zebra indexer and server always look for the file
 192 <tt/zebra.cfg/ in their current working directory (unless they are
 193 told to look for it elsewhere with the <tt/-c/ option). The example
 194 file in the <tt/test/ directory represents all but the bare minimum
 195 for such a file. While it may seem daunting at first, we find the
 196 following to be a powerful setup for a GILS-like database (everything
 197 preceded by (#) is ignored by the software):
 198
 199 <tscreen><verb>
 200 #
 201 # Sample configuration file for GILS database
 202 #
 203
 204 # Where are the configuration tables located?
 205 profilePath: /usr/local/lib/zebra
 206
 207 # Load attribute sets for searching
 208 attset bib1.att
 209 attset gils.att
 210
 211 # Records are identified by their path in the file system
 212 recordId: file
 213
 214 # Store information about records to allow deletion and updating
 215 storeKeys: 1
 216
 217 # Records are structured
 218 recordType: grs
 219
 220 # Where to store the indexes
 221 register: /datadisk/index:500M
 222
 223 # Where to store temporary data while merging with register
 224 shadow: /datadisk/shadow:500M
 225 </verb></tscreen>
 226
 227 If you like, you can paste this file straight into a <tt/zebra.cfg/
 228 file ready for your own use (with a bit of editing of the pathnames).
 229 In the following, we'll explain the individual settings. For the full
 230 story on the <tt/zebra.cfg/ file and the configuration options of
 231 Zebra, you should read the general documentation.
 232
 233 <descrip>
 234
 235 <tag/profilePath/ This field tells Zebra where to look for the
 236 configuration files. In the distribution, these files are located in
 237 the <tt/tab/ directory, but you may wish to put them someplace else
 238 for convenience. If necessary, you can provide multiple directory
 239 paths, separated by (:).
 240
 241 <tag/attset/ This field tells the Zebra server which attribute sets it
 242 should support for searching. You could get by with just loading the
 243 GILS set, but if you load BIB-1 as well, Zebra will support both sets
 244 for those GILS attributes that are inherited from BIB-1.
 245
 246 <tag/recordId/ The <tt/recordId: file/ setting tells Zebra that
 247 individual records should be identified by the physical files in which
 248 they are located. In this mode, your database will always (after an
 249 update operation) reflect the contents of the directory (or
 250 directories).
 251
 252 <tag/storeKeys/ This setting tells Zebra to store additional
 253 information about each record, to facilitate updating. In combination
 254 with the <tt/recordId: file/ setting, this is a very convenient
 255 maintenance option. If you maintain your records as individual files
 256 in a directory tree, you have only to run <tt/zebraidx/ with the
 257 top-level directory as an argument. If new files are added, they are
 258 entered into the database. If they are modified, the indexes are
 259 changed accordingly, and if they are deleted from the filesystem (or
 260 renamed), the indexes are also updated correctly, the next time you
 261 run <tt/zebraidx/.
 262
 263 <tag/recordType/ This setting selects the type of processing which is
 264 to take place when a record is accessed by the indexer or the Z39.50
 265 server. GRS stands for <it/Generic Record Syntax/, and signals that
 266 the records are structured.
 267
 268 <tag/register/ In the first test above, you may have noticed that the
 269 <tt/zebraidx/ created a number of files in the working directory. Some
 270 of these files, which contain the indexing information for the
 271 database, can grow quite large, and it is sometimes useful to place
 272 them in a separate directory or file system. You should provide the
 273 path of the directory followed by a colon (:), followed by the maximum
 274 amounts of megabytes (M) or kilobytes (K) of disk space that Zebra is
 275 allowed to use in the given directory. If you specify more than one
 276 directory:size combination <it/on the same line/, Zebra will fill up
 277 each directory, one at a time. This feature is essential if your
 278 database is so large that the registers cannot fit into a single
 279 partition of your disk.
 280
 281 <tag/shadow/ The format of this setting is the same as for the one
 282 above. If you provide one or more directory for the &dquot;shadow
 283 system&dquot;, you enable the safe updating system of the Zebra
 284 indexer. When changes to the records are merged into the register
 285 files, the files are not changed immediately. Instead, the changes are
 286 written into separate files, or &dquot;shadow files&dquot;. At the end
 287 of the merging process, or in a separate operation, the changes are
 288 &dquot;committed&dquot;, and written into the register files
 289 themselves. This final step is carried out by the command <tt/zebraidx
 290 commit/ - the <tt/commit/ directive can also be given on the same
 291 command line as the <tt/update/ directive - at the end of the command
 292 line. The shadow file system can consume a lot of disk space -
 293 particularly in a large update operation which involves almost all of
 294 the index, but the benefits are substantial. If the system crashes
 295 during an update procedure, or the process is otherwise interrupted,
 296 the registers are left in an unknown state, and are effectively
 297 rendered useless - this can be unfortunate if the index is very large,
 298 but the use of the shadow system greatly reduces the risk of an index
 299 being damaged in this way. Further, when the shadow system is enabled,
 300 your clients may access the Zebra server without interruption
 301 throughout the update and commit procedures - Zebra will ensure that
 302 the parts of the register accessed by the server are always
 303 consistent.
 304
 305 <sect>Creating Your Own Database
 306
 307 <p>
 308
 309 </article>