doc/gils.sgml

   1 <!doctype linuxdoc system>
   2
   3 <article>
   4 <title>Serving GILS Records with Zebra
   5 <author><htmlurl url="http://www.indexdata.dk/" name="Index Data">, <tt><htmlurl url="mailto:info@index.ping.dk" name="info@index.ping.dk"></>
   6 <date>$Revision: 1.5 $
   7 <abstract>
   8 This document explains how to set up a simple database of Government
   9 Information Locator Records using the Zebra retrieval engine and
  10 Z39.50 server (version 1.0a6 or later).
  11 </abstract>
  12
  13 <sect>Introduction
  14
  15 <p>
  16 Zebra is a powerful and versatile information management system
  17 which allows you to construct arbitrarily complex record structures
  18 and manage
  19 efficient, robust databases.
  20
  21 Since the internal data modeling tools of Zebra are based on the
  22 Z39.50-1995 standard, the system is well-suited to support a complex
  23 database profile such as the one specified by the GILS profile.
  24 Because GILS is expected by many to play an important role in the
  25 evolving, global information society, and because the GILS profile has
  26 proven useful to a number of different applications outside of its
  27 dedicated domain, the public distribution of Zebra includes a
  28 configuration set-up which makes it a simple matter to establish a new
  29 GILS database.
  30
  31 This document, which is a supplement to the general documentation set
  32 for Zebra, explains in simple terms how you can easily set up your own
  33 GILS-compliant database service.
  34
  35 <sect>Retrieving and Unpacking Zebra
  36
  37 <p>
  38 The first step is to download the software. If you are using a WWW
  39 browser, you can point it at the Zebra distribution archive at
  40 <tt>&lt;<htmlurl url="http://www.indexdata.dk/zebra.html"
  41 name="http://www.indexdata.dk/zebra.html">&gt;</tt>, and follow the link named
  42 <it/Download the latest version of the software (xxx)/, where <it/xxx/
  43 is the current version of Zebra.
  44
  45 If you use an FTP client, you can use normal, anonymous FTP. Connect
  46 to the host <tt/ftp.indexdata.dk/, log in as <tt/ftp/, and give your
  47 Email-address as the password. Then type <tt>cd index/yaz</tt>, and
  48 use the <tt/dir/ command to locate the current version of Zebra. The
  49 file will be named <tt/zebra-xxx/, where <tt/xxx/ is the current
  50 version of the software. Remember to use the <tt/bin/ command before
  51 using <tt/get/ to download the software.
  52
  53 Once the distribution archive has been dowloaded, it must be
  54 decompressed. To do this, use the command <tt/gunzip/ command (if your
  55 system doesn't have the <tt/gunzip/ program, you will need to acquire
  56 this separately). Finally, use the command <tt>tar xvf
  57 &lt;file&gt;</tt> to unpack the archive.
  58
  59 If you downloaded the source version of the software (this is the only
  60 option today, although we expect to release binary versions for Linux,
  61 SunOS, and Digital Unix shortly), you will have to compile Zebra
  62 before you can use it.
  63
  64 On many of the major version of the Unix operating system, compiling
  65 Zebra is a simple matter of typing <tt/make/ in the top-level
  66 distribution directory (this is the directory that was created when
  67 you executed <tt/tar/). Normally, Zebra compiles cleanly at least on
  68 Linux, Digital Unix (DEC OSF/1), and IBM AIX. On certain platforms
  69 (such as SunOS), you will need to edit the top-level <tt/Makefile/ to
  70 set the <tt/ELIBS/ variable to include the Berkeley Socket
  71 Libraries. On other Unix platforms, you <it/may/ need to modify
  72 Makefiles or header files, but in general, we have found Zebra to be
  73 easily portable across modern Unix-versions. You do need an ANSI-C
  74 compliant compiler (you'll see a long list of Syntax-errors during the
  75 compile if your default compiler is not ANSI C), but again, this is
  76 standard on most modern Unix-systems. If you don't have one, the
  77 freely available GNU C compiler is available for many systems.
  78
  79 <sect>The First GILS Database
  80
  81 <p>
  82 Having successfully acquired the software, it's time to try it out.
  83 The directory <tt/test/ under the main distribution directory contains
  84 a small sample database of GILS records.
  85
  86 <it>NOTE: The records included in the distribution are part of a
  87 sample set provided by the US Geological Survey, as a service to GILS
  88 implementors. They are included for testing and demonstrating the
  89 software, and neither the USGS or Index Data nor anyone else should be
  90 held responsible for their contents.</it>
  91
  92 If you <tt/cd/ to the <tt/test/ directory, the first thing to notice
  93 is the file <tt/zebra.cfg/. There has to be a file like this present
  94 whenever you run Zebra - it establishes various settings and defaults,
  95 and we'll return to its contents below (a detailed
  96 description is found in the general Zebra documentation file).
  97
  98 The subdirectory <tt/records/ contains the sample records. We'll get
  99 back to them, too.
 100
 101 The first order of business is to index the sample records, and create
 102 the access files required by the Z39.50 server. To do this, position
 103 yourself in the <tt/test/ directory, and type the command
 104
 105 <tscreen><verb>
 106 $ ../index/zebraidx update records
 107 </verb></tscreen>
 108
 109 The indexing program will respond with a stream of control
 110 information, and when it completes, the database is ready. To start
 111 the Z39.50 server, type the command <tt>../index/zebrasrv</tt>.
 112
 113 Assuming that nothing unfortunate happened, you are now running a
 114 GILS-compliant Z39.50 server on the port 9999 on your local machine
 115 (to learn how to run the server at a different port, and redirect the
 116 diagnostic output to a file, consult the section on <it/Running
 117 zebrasrv/ in the general documentation).
 118 The database containing the sample records is named <tt/Default/.
 119
 120 To test the server, you can use any compatible Z39.50 client. You can
 121 also use the simple demonstration client which is included with Zebra
 122 itself. To do this, start a new session on your machine (or put the
 123 server in the background). Change to the directory <tt>yaz/client</tt>
 124 under the main Zebra distribution directory. Now execute the command
 125
 126 <tscreen><verb>
 127 $ ./client tcp:localhost:9999
 128 </verb></tscreen>
 129
 130 If all went well, the client will tell you that it has established an
 131 association with your test server. To test it, try out these commands:
 132
 133 <tscreen><verb>
 134 Z> find surficial
 135 Z> show 1
 136 </verb></tscreen>
 137
 138 The default retrieval syntax for the client is USMARC. To try other
 139 formats for the same record, try:
 140
 141 <tscreen><verb>
 142 Z>format sutrs
 143 Z>show 1
 144 Z>format grs-1
 145 Z>show 1
 146 Z>elements B
 147 Z>show 1
 148 </verb></tscreen>
 149
 150 You can learn more about the sample client by reading the <tt/README/
 151 file in the <tt/yaz/ directory.
 152
 153 <sect>The Records
 154
 155 <p>
 156 The GILS profile is only concerned with the communication that takes
 157 place between two compliant systems. It doesn't mandate how the client
 158 application should behave, and it doesn't tell you how you should
 159 maintain and process data at the server side. Specifically, while the
 160 profile specifies a number of different exchange format for retrieval
 161 records.
 162
 163 For the purposes of this discussion, we will be using a simple,
 164 SGML-like representation of the GILS record structure. There is
 165 nothing magical or sacrosanct about this format, but it is easy to
 166 read and write, and because of its semblance of SGML and HTML, it is
 167 familiar to many people. If you would like to use a different, local
 168 representation for your GILS records, you can read the general Zebra
 169 documentation to learn how to establish a custom input filter for your
 170 particular record format.
 171
 172 In the SGML-like syntax, each record should begin with the tag
 173 <tt/&lt;gils&gt;/. This selects the GILS profile, and provides context
 174 for the content tags which follow. Similarly, each record should
 175 finish with the end-tag <tt/&etago;gils&gt;/.
 176
 177 The body of the record is made up by a sequence of tagged elements,
 178 reflecting the <it/abstract record syntax/ of the GILS profile. Some
 179 of these elements contain simple data, or text, while others contain
 180 more tagged elements - these are complex, or constructed, data
 181 elements. The tag names generally correspond to the tag names provided
 182 in the GILS profile. Capitalization is ignored in tag names, as are
 183 dashes (-). Hence, <tt/local-subject-index/ is equivalent to
 184 <tt/LocalSubjectIndex/ which is the same as <tt/LOCALSUBJECTINDEX/.
 185
 186 It is useful to look at the records in the <tt>test/records</tt> as
 187 examples of how SGML-formatted GILS record can look. Note that
 188 whitespace is generally ignored, so you can choose whatever layout of
 189 your records that suits you best.
 190
 191 <sect>The Zebra Configuration File
 192
 193 <p>
 194 As mentioned, the Zebra indexer and server always look for the file
 195 <tt/zebra.cfg/ in their current working directory (unless they are
 196 told to look for it elsewhere with the <tt/-c/ option). The example
 197 file in the <tt/test/ directory represents all but the bare minimum
 198 for such a file. We find the
 199 following to be a powerful setup for a GILS-like database (everything
 200 preceded by (&num;) is ignored by the software):
 201
 202 <tscreen><verb>
 203 #
 204 # Sample configuration file for GILS database
 205 #
 206
 207 # Where are the configuration files located?
 208 profilePath: /usr/local/lib/zebra
 209
 210 # Load attribute sets for searching
 211 attset bib1.att
 212 attset gils.att
 213
 214 # Records are identified by their path in the file system
 215 recordId: file
 216
 217 # Store information about records to allow deletion and updating
 218 storeKeys: 1
 219
 220 # Records are structured
 221 recordType: grs
 222
 223 # Where to store the indexes
 224 register: /datadisk/index:500M
 225
 226 # Where to store temporary data while merging with register
 227 shadow: /datadisk/shadow:500M
 228 </verb></tscreen>
 229
 230 If you like, you can paste this file straight into a <tt/zebra.cfg/
 231 file ready for your own use (with a bit of editing of the pathnames).
 232 In the following, we'll explain the individual settings. For the full
 233 story on the <tt/zebra.cfg/ file and the configuration options of
 234 Zebra, you should read the general documentation.
 235
 236 <descrip>
 237
 238 <tag/profilePath/ This field tells Zebra where to look for the
 239 configuration files. In the distribution, these files are located in
 240 the <tt/tab/ directory, but you may wish to put them someplace else
 241 for convenience. If necessary, you can provide multiple directory
 242 paths, separated by (:).
 243
 244 <tag/attset/ This field tells the Zebra server which attribute sets it
 245 should support for searching. You could get by with just loading the
 246 GILS set, but if you load BIB-1 as well, Zebra will support both sets
 247 for those GILS attributes that are inherited from BIB-1.
 248
 249 <tag/recordId/ The <tt/recordId: file/ setting tells Zebra that
 250 individual records should be identified by the physical files in which
 251 they are located. In this mode, your database will always (after an
 252 update operation) reflect the contents of the directory (or
 253 directories).
 254
 255 <tag/storeKeys/ This setting tells Zebra to store additional
 256 information about each record, to facilitate updating. In combination
 257 with the <tt/recordId: file/ setting, this is a very convenient
 258 maintenance option. If you maintain your records as individual files
 259 in a directory tree, you have only to run <tt/zebraidx/ with the
 260 top-level directory as an argument. If new files are added, they are
 261 entered into the database. If they are modified, the indexes are
 262 changed accordingly, and if they are deleted from the filesystem (or
 263 renamed), the indexes are also updated correctly, the next time you
 264 run <tt/zebraidx/.
 265
 266 <tag/recordType/ This setting selects the type of processing which is
 267 to take place when a record is accessed by the indexer or the Z39.50
 268 server. GRS stands for <it/Generic Record Syntax/, and signals that
 269 the records are structured.
 270
 271 <tag/register/ In the first test above, you may have noticed that the
 272 <tt/zebraidx/ created a number of files in the working directory. Some
 273 of these files, which contain the indexing information for the
 274 database, can grow quite large, and it is sometimes useful to place
 275 them in a separate directory or file system. You should provide the
 276 path of the directory followed by a colon (:), followed by the maximum
 277 amounts of megabytes (M) or kilobytes (K) of disk space that Zebra is
 278 allowed to use in the given directory. If you specify more than one
 279 directory:size combination <it/on the same line/, Zebra will fill up
 280 each directory from left to right. This feature is essential if your
 281 database is so large that the registers cannot fit into a single
 282 partition of your disk.
 283
 284 <tag/shadow/ The format of this setting is the same as for the one
 285 above. If you provide one or more directory for the &dquot;shadow
 286 system&dquot;, you enable the safe updating system of the Zebra
 287 indexer. When changes to the records are merged into the register
 288 files, the files are not changed immediately. Instead, the changes are
 289 written into separate files, or &dquot;shadow files&dquot;. At the end
 290 of the merging process, or in a separate operation, the changes are
 291 &dquot;committed&dquot;, and written into the register files
 292 themselves. This final step is carried out by the command <tt/zebraidx
 293 commit/ - the <tt/commit/ directive can also be given on the same
 294 command line as the <tt/update/ directive - at the end of the command
 295 line. The shadow file system can consume a lot of disk space -
 296 particularly in a large update operation which involves almost all of
 297 the index, but the benefits are substantial. If the system crashes
 298 during an update procedure, or the process is otherwise interrupted,
 299 the registers are left in an unknown state, and are effectively
 300 rendered useless - this can be unfortunate if the index is very large,
 301 but the use of the shadow system greatly reduces the risk of an index
 302 being damaged in this way. Further, when the shadow system is enabled,
 303 your clients may access the Zebra server without interruption
 304 throughout the update and commit procedures - Zebra will ensure that
 305 the parts of the register accessed by the server are always
 306 consistent.
 307
 308 </descrip>
 309
 310 <sect>Creating Your Own Database
 311
 312 <p>
 313 Whenever we create a new database with Zebra, we find it useful to
 314 first set up a new, empty directory. This directory will contain the
 315 configuration file, the lock files maintained by Zebra (unless you
 316 specify a different location for these), and any logs of updates and
 317 server runs that you may wish to keep around. The first thing to do is
 318 set up the <tt/zebra.cfg/ file for your database. You can copy the one
 319 from the <tt/test/ directory, or you can create a new one using the
 320 example settings described in the previous section. Once you get your
 321 server up and running, you may want to read the description of the
 322 <tt/zebra.cfg/ file in the general documentation, to set up additional
 323 defaults for database names, etc.
 324
 325 If you copy one of these files, you should be careful to update the
 326 pathnames to reflect the setup of your own database. In particular, if
 327 you want to specify one or more directories for the register files
 328 and/or the shadow files, you should make sure that these directories
 329 exist and are accessible to the user ID which will run the Zebra
 330 processes.
 331
 332 You need to make sure that your GILS records are available, too. For
 333 small to medium-sized (say, less than 100,000 records) databases, it
 334 is sometimes preferable to maintain the records as individual files
 335 somewhere in the file system. Zebra will, by default, access these
 336 files directly whenever the user requests to see a specific record.
 337 However, you can set up Zebra to maintain the database records in
 338 other ways, too. Consult the general documentation for details.
 339
 340 Finally, you need to run <tt/zebraidx/ to create the index files, and
 341 start up the server, <tt/zebrasrv/ (the server can be run from the
 342 <tt/inetd/ if required), and you are in business.
 343
 344 To access the data, you can use a dedicated Z39.50 client, or you can
 345 set up a WWW/Z39.50 gateway to allow common WWW browsers to search
 346 your data. CNIDR's
 347 Isite
 348 package includes a good, free gateway that you can experiment with.
 349
 350 </article>