X-Git-Url: http://git.indexdata.com/?a=blobdiff_plain;f=doc%2Fbook.xml;h=41dfd01d725fb493232d7daeed0fa49617a04014;hb=5b556e3b9d95a6e249ffd66d38da5c33f9b00d1d;hp=d6e6f8d355072044c5dcdbc43f20e069925f8f38;hpb=8fc15e69384e20bb9c305a84683c36311bdec9f3;p=metaproxy-moved-to-github.git diff --git a/doc/book.xml b/doc/book.xml index d6e6f8d..41dfd01 100644 --- a/doc/book.xml +++ b/doc/book.xml @@ -1,4 +1,24 @@ - + + + + + + %common; + + + + +]> + + Metaproxy - User's Guide and Reference @@ -9,16 +29,20 @@ 2006 - Index Data + Index Data ApS Metaproxy is a universal router, proxy and encapsulated metasearcher for information retrieval protocols. It accepts, processes, interprets and redirects requests from IR clients using - standard protocols such as ANSI/NISO Z39.50 (and in the future SRU - and SRW), as well as functioning as a limited - HTTP server. Metaproxy is configured by an XML file which + standard protocols such as + ANSI/NISO Z39.50 + (and in the future SRU + and SRW), as + well as functioning as a limited + HTTP server. + Metaproxy is configured by an XML file which specifies how the software should function in terms of routes that the request packets can take through the proxy, each step on a route being an instantiation of a filter. Filters come in many @@ -33,6 +57,16 @@ should not at this stage redistribute the code without explicit written permission from the copyright holders, Index Data ApS. + + + + + + + + + + @@ -40,39 +74,55 @@ Introduction - - Metaproxy - is a standalone program that acts as a universal router, proxy and - encapsulated metasearcher for information retrieval protocols such - as Z39.50, and in the future SRU and SRW. To clients, it acts as a - server of these - protocols: it can be searched, records can be retrieved from it, - etc. To servers, it acts as a client: it searches in them, - retrieves records from them, etc. it satisfies its clients' - requests by transforming them, multiplexing them, forwarding them - on to zero or more servers, merging the results, transforming - them, and delivering them back to the client. In addition, it - acts as a simple HTTP server; support for further protocols can be - added in a modular fashion, through the creation of new filters. - - - Anything goes in! - Anything goes out! - Cold bananas, fish, pyjamas, - Mutton, beef and trout! + + Metaproxy + is a standalone program that acts as a universal router, proxy and + encapsulated metasearcher for information retrieval protocols such + as Z39.50, and in the future + SRU and SRW. + To clients, it acts as a server of these protocols: it can be searched, + records can be retrieved from it, etc. + To servers, it acts as a client: it searches in them, + retrieves records from them, etc. it satisfies its clients' + requests by transforming them, multiplexing them, forwarding them + on to zero or more servers, merging the results, transforming + them, and delivering them back to the client. In addition, it + acts as a simple HTTP server; support + for further protocols can be added in a modular fashion, through the + creation of new filters. + + + Anything goes in! + Anything goes out! + Fish, bananas, cold pyjamas, + Mutton, beef and trout! - attributed to Cole Porter. - - - Metaproxy is a more capable alternative to - YAZ Proxy, - being more powerful, flexible, configurable and extensible. Among - its many advantages over the older, more pedestrian work are - support for multiplexing (encapsulated metasearching), routing by - database name, authentication and authorisation and serving local - files via HTTP. Equally significant, its modular architecture - facilitites the creation of pluggable modules implementing further - functionality. - + + + Metaproxy is a more capable alternative to + YAZ Proxy, + being more powerful, flexible, configurable and extensible. Among + its many advantages over the older, more pedestrian work are + support for multiplexing (encapsulated metasearching), routing by + database name, authentication and authorisation and serving local + files via HTTP. Equally significant, its modular architecture + facilitites the creation of pluggable modules implementing further + functionality. + + + This manual will briefly describe Metaproxy's licensing situation + before giving an overview of its architecture, then discussing the + key concept of a filter in some depth and giving an overview of + the various filter types, then discussing the configuration file + format. After this come several optional chapters which may be + freely skipped: a detailed discussion of virtual databases and + multi-database searching, some notes on writing extensions + (additional filter types) and a high-level description of the + source code. Finally comes the reference guide, which contains + instructions for invoking the metaproxy + program, and detailed information on each type of filter, + including examples. + @@ -81,8 +131,8 @@ The Metaproxy Licence - No decision has yet been made on the terms under which - Metaproxy will be distributed. + No decision has yet been made on the terms under which + Metaproxy will be distributed. It is possible that, unlike other Index Data products, metaproxy may not be released under a @@ -95,8 +145,316 @@ + + Installation + + Metaproxy depends on the following tools/libraries: + + YAZ++ + + + This is a C++ library based on YAZ. + + + + Libxslt + + This is an XSLT processor - based on + Libxml2. Both Libxml2 and + Libxslt must be installed with the development components + (header files, etc.) as well as the run-time libraries. + + + + Boost + + + The popular C++ library. Initial versions of Metaproxy + was built with 1.33.0. Version 1.33.1 works too. + + + + + + + In order to compile Metaproxy a modern C++ compiler is + required. Boost, in particular, requires the C++ compiler + to facilitate the newest features. Refer to Boost + Compiler Status + for more information. + + + We have succesfully built Metaproxy using the compilers + GCC version 4.0 and + Microsoft Visual Studio 2003/2005. + + +
+ Installation on Unix (from Source) + + Here is a quick step-by-step guide on how to compile all the + tools that Metaproxy uses. Only few systems have none of the required + tools binary packages. If, for example, Libxml2/libxslt are already + installed as development packages use those (and omit compilation). + + + + Libxml2/libxslt: + + + gunzip -c libxml2-version.tar.gz|tar xf - + cd libxml2-version + ./configure + make + su + make install + + + gunzip -c libxslt-version.tar.gz|tar xf - + cd libxslt-version + ./configure + make + su + make install + + + YAZ/YAZ++: + + + gunzip -c yaz-version.tar.gz|tar xf - + cd yaz-version + ./configure + make + su + make install + + + gunzip -c yazpp-version.tar.gz|tar xf - + cd yazpp-version + ./configure + make + su + make install + + + Boost: + + + gunzip -c boost-version.tar.gz|tar xf - + cd boost-version + ./configure + make + su + make install + + + Metaproxy: + + + gunzip -c metaproxy-version.tar.gz|tar xf - + cd metaproxy-version + ./configure + make + su + make install + +
+ +
+ Installation on Debian GNU/Linux + + All dependencies for Metaproxy are available as + Debian + packages for the sarge (stable in 2005) and etch (testing in 2005) + distributions. + + + The procedures for Debian based systems, such as + Ubuntu is probably similar + + + There is currently no official Debian package for YAZ++. + And the Debian package for YAZ is probably too old. + Update the /etc/apt/sources.list + to include the Index Data repository. + See YAZ' Download Debian + for more information. + + + apt-get install libxslt1-dev + apt-get install libyazpp-dev + apt-get install libboost-dev + apt-get install libboost-thread-dev + apt-get install libboost-date-time-dev + apt-get install libboost-program-options-dev + apt-get install libboost-test-dev + + + With these packages installed, the usual configure + make + procedure can be used for Metaproxy as outlined in + . + +
+ +
+ Installation on Windows + + Metaproxy can be compiled with Microsoft + Visual Studio. + Version 2003 (C 7.1) and 2005 (C 8.0) is known to work. + +
+ Boost + + Get Boost from its home page. + You also need Boost Jam (an alternative to make). + That's also available from the Boost home page. + The files to be downloaded are called something like: + boost_1_33-1.exe + and + boost-jam-3.1.12-1-ntx86.zip. + Unpack Boost Jam first. Put bjam.exe + in your system path. Make a command prompt and ensure + it can be found automatically. If not check the PATH. + The Boost .exe is a self-extracting exe with + complete source for Boost. Compile that source with + Boost Jam (An alternative to Make). + The compilation takes a while. + For Visual Studio 2003, use + + bjam "-sTOOLS=vc-7_1" + + Here vc-7_1 refers to a "Toolset" (compiler system). + For Visual Studio 2005, use + + bjam "-sTOOLS=vc-8_0" + + To install the libraries in a common place, use + + bjam "-sTOOLS=vc-7_1" install + + (or vc-8_0 for VS 2005). + + + By default, the Boost build process installs the resulting + libraries + header files in + \boost\lib, \boost\include. + + + For more informatation about installing Boost refer to the + getting started + pages. + +
+
+ Libxslt + + Libxslt can be downloaded + for Windows from + here. + + + Libxslt has other dependencies, but thes can all be downloaded + from the same site. Get the following: + iconv, zlib, libxml2, libxslt. + +
+
+ YAZ + + YAZ can be downloaded + for Windows from + here. + +
+ +
+ YAZ++ + + Get YAZ++ as well. + Version 1.0 or later is required. For now get it from + Index Data's + Snapshot area. + + + YAZ++ includes NMAKE makefiles, similar to those found in the + YAZ package. + +
+ +
+ Metaproxy + + Metaproxy is shipped with NMAKE makfiles as well - similar + to those found in the YAZ++/YAZ packages. Adjust this Makefile + to point to the proper locations of Boost, Libxslt, Libxml2, + zlib, iconv, yaz and yazpp. + + + + DEBUG + + If set to 1, the software is + compiled with debugging libraries (code generation is + multi-threaded debug DLL). + If set to 0, the software is compiled with release libraries + (code generation is multi-threaded DLL). + + + + + BOOST + + + Boost install location + + + + + + BOOST_VERSION + + + Boost version (replace . with _). + + + + + + BOOST_TOOLSET + + + Boost toolset. + + + + + + LIBXSLT_DIR, + LIBXML2_DIR .. + + + Specify the locations of Libxslt, libiconv, libxml2 and + libxslt. + + + + + + + + After succesful compilation you'll find + metaproxy.exe in the + bin directory. + +
+ + +
+
+ The Metaproxy Architecture @@ -337,7 +695,7 @@ <literal>multi</literal> (mp::filter::Multi) - Performs multicast searching. + Performs multi-database searching. See the extended discussion of virtual databases and multi-database searching below. @@ -584,12 +942,11 @@ file (included in the distribution as metaproxy/etc/config0.xml). This file defines a very simple configuration that simply proxies - to whatever backend server the client requests, but logs each + to whatever back-end server the client requests, but logs each request and response. This can be useful for debugging complex client-server dialogues. - + @@ -624,7 +981,7 @@ a log filter that emits a message for each request; they are then fed into a z3950_client filter, which forwards the requests to the client-specified - backend Z39.509 server. When the response arrives, it is handed + back-end Z39.509 server. When the response arrives, it is handed back to the log filter, which emits another message; and then to the front-end filter, which returns the response to the client. @@ -644,18 +1001,365 @@ Two of Metaproxy's filters are concerned with multiple-database operations. Of these, virt_db can work alone to control the routing of searches to one of a number of servers, - while multi can work with the output of - virt_db to perform multicast searching, merging - the results into a unified result-set. The interaction between - these two filters is necessarily complex, reflecting the real - complexity of multicast searching in a protocol such as Z39.50 - that separates initialisation from searching, with the database to - search known only during the latter operation. + while multi can work together with + virt_db to perform multi-database searching, merging + the results into a unified result-set - ``metasearch in a box''. + + + The interaction between + these two filters is necessarily complex: it reflects the real, + irreducible complexity of multi-database searching in a protocol such + as Z39.50 that separates initialisation from searching, and in + which the database to be searched is not known at initialisation + time. + + + It's possible to use these filters without understanding the + details of their functioning and the interaction between them; the + next two sections of this chapter are ``HOWTO'' guides for doing + just that. However, debugging complex configurations will require + a deeper understanding, which the last two sections of this + chapters attempt to provide. + + + + +
+ Virtual databases with the <literal>virt_db</literal> filter + + Working alone, the purpose of the + virt_db + filter is to route search requests to one of a selection of + back-end databases. In this way, a single Z39.50 endpoint + (running Metaproxy) can provide access to several different + underlying services, including those that would otherwise be + inaccessible due to firewalls. In many useful configurations, the + back-end databases are local to the Metaproxy installation, but + the software does not enforce this, and any valid Z39.50 servers + may be used as back-ends. + + + For example, a virt_db + filter could be set up so that searches in the virtual database + ``lc'' are forwarded to the Library of Congress bibliographic + catalogue server, and searches in the virtual database ``marc'' + are forwarded to the toy database of MARC records that Index Data + hosts for testing purposes. A virt_db + configuration to make this switch would look like this: + + + lc + z3950.loc.gov:7090/voyager + + + marc + indexdata.dk/marc + +]]> - ### Much, much more to say! + As well as being useful in it own right, this filter also provides + the foundation for multi-database searching.
+ + +
+ Multi-database search with the <literal>multi</literal> filter + + To arrange for Metaproxy to broadcast searches to multiple back-end + servers, the configuration needs to include two components: a + virt_db + filter that specifies multiple + <target> + elements, and a subsequent + multi + filter. Here, for example, is a complete configuration that + broadcasts searches to both the Library of Congress catalogue and + Index Data's tiny testing database of MARC records: + + + + + + + + 10 + @:9000 + + + + lc + z3950.loc.gov:7090/voyager + + + marc + indexdata.dk/marc + + + all + z3950.loc.gov:7090/voyager + indexdata.dk/marc + + + + + 30 + + + +]]> + + (Using a + virt_db + filter that specifies multiple + <target> + elements but without a subsequent + multi + filter yields surprising and undesirable results, as will be + described below. Don't do that.) + + + Metaproxy can be invoked with this configuration as follows: + + ../src/metaproxy --config config-simple-multi.xml + + And thereafter, Z39.50 clients can connect to the running server + (on port 9000, as specified in the configuration) and search in + any of the databases + lc (the Library of Congress catalogue), + marc (Index Data's test database of MARC records) + or + all (both of these). As an example, a session + using the YAZ command-line client yaz-client is + here included (edited for brevity and clarity): + + base lc +Z> find computer +Search was a success. +Number of hits: 10000, setno 1 +Elapsed: 5.521070 +Z> base marc +Z> find computer +Search was a success. +Number of hits: 10, setno 3 +Elapsed: 0.060187 +Z> base all +Z> find computer +Search was a success. +Number of hits: 10010, setno 4 +Elapsed: 2.237648 +Z> show 1 +[marc]Record type: USmarc +001 11224466 +003 DLC +005 00000000000000.0 +008 910710c19910701nju 00010 eng +010 $a 11224466 +040 $a DLC $c DLC +050 00 $a 123-xyz +100 10 $a Jack Collins +245 10 $a How to program a computer +260 1 $a Penguin +263 $a 8710 +300 $a p. cm. +Elapsed: 0.119612 +Z> show 2 +[VOYAGER]Record type: USmarc +001 13339105 +005 20041229102447.0 +008 030910s2004 caua 000 0 eng +035 $a (DLC) 2003112666 +906 $a 7 $b cbc $c orignew $d 4 $e epcn $f 20 $g y-gencatlg +925 0 $a acquire $b 1 shelf copy $x policy default +955 $a pc10 2003-09-10 $a pv12 2004-06-23 to SSCD; $h sj05 2004-11-30 $e sj05 2004-11-30 to Shelf. +010 $a 2003112666 +020 $a 0761542892 +040 $a DLC $c DLC $d DLC +050 00 $a MLCM 2004/03312 (G) +245 10 $a 007, everything or nothing : $b Prima's official strategy guide / $c created by Kaizen Media Group. +246 3 $a Double-O-seven, everything or nothing +246 30 $a Prima's official strategy guide +260 $a Roseville, CA : $b Prima Games, $c c2004. +300 $a 161 p. : $b col. ill. ; $c 28 cm. +500 $a "Platforms: Nintendo GameCube, Macintosh, PC, PlayStation 2 computer entertainment system, Xbox"--P. [4] of cover. +650 0 $a Video games. +710 2 $a Kaizen Media Group. +856 42 $3 Publisher description $u http://www.loc.gov/catdir/description/random052/2003112666.html +Elapsed: 0.150623 +Z> +]]> + + As can be seen, the first record in the result set is from the + Index Data test database, and the second from the Library of + Congress database. The result-set continues alternating records + round-robin style until the point where one of the databases' + records are exhausted. + + + This example uses only two back-end databases; more may be used. + There is no limitation imposed on the number of databases that may + be metasearched in this way: issues of resource usage and + administrative complexity dictate the practical limits. + + + What happens when one of the databases doesn't respond? By default, + the entire multi-database search fails, and the appropriate + diagnostic is returned to the client. This is usually appropriate + during development, when technicians need maximum information, but + can be inconvenient in deployment, when users typically don't want + to be bothered with problems of this kind and prefer just to get + the records from the databases that are available. To obtain this + latter behaviour add an empty + <hideunavailable> + element inside the + multi filter: + + + + ]]> + + Under this regime, an error is reported to the client only if + all the databases in a multi-database search + are unavailable. + +
+ + +
+ What's going on? + + Lark's vomit + + This section goes into a level of technical detail that is + probably not necessary in order to configure and use Metaproxy. + It is provided only for those who like to know how things work. + You should feel free to skip on to the next section if this one + doesn't seem like fun. + + + + Hold on tight - this may get a little hairy. + + + In the general course of things, a Z39.50 Init request may carry + with it an otherInfo packet of type VAL_PROXY, + whose value indicates the address of a Z39.50 server to which the + ultimate connection is to be made. (This otherInfo packet is + supported by YAZ-based Z39.50 clients and servers, but has not yet + been ratified by the Maintenance Agency and so is not widely used + in non-Index Data software. We're working on it.) + The VAL_PROXY packet functions + analogously to the absoluteURI-style Request-URI used with the GET + method when a web browser asks a proxy to forward its request: see + the + Request-URI + section of + the HTTP 1.1 specification. + + + Within Metaproxy, Search requests that are part of the same + session as an Init request that carries a + VAL_PROXY otherInfo are also annotated with the + same information. The role of the virt_db + filter is to rewrite this otherInfo packet dependent on the + virtual database that the client wants to search. + + + When Metaproxy receives a Z39.50 Init request from a client, it + doesn't immediately forward that request to the back-end server. + Why not? Because it doesn't know which + back-end server to forward it to until the client sends a Search + request that specifies the database that it wants to search in. + Instead, it just treasures the Init request up in its heart; and, + later, the first time the client does a search on one of the + specified virtual databases, a connection is forged to the + appropriate server and the Init request is forwarded to it. If, + later in the session, the same client searches in a different + virtual database, then a connection is forged to the server that + hosts it, and the same cached Init request is forwarded there, + too. + + + All of this clever Init-delaying is done by the + frontend_net filter. The + virt_db filter knows nothing about it; in + fact, because the Init request that is received from the client + doesn't get forwarded until a Search request is received, the + virt_db filter (and the + z3950_client filter behind it) doesn't even get + invoked at Init time. The only thing that a + virt_db filter ever does is rewrite the + VAL_PROXY otherInfo in the requests that pass + through it. + + + It is possible for a virt_db filter to contain + multiple + <target> + elements. What does this mean? Only that the filter will add + multiple VAL_PROXY otherInfo packets to the + Search requests that pass through it. That's because the virtual + DB filter is dumb, and does exactly what it's told - no more, no + less. + If a Search request with multiple VAL_PROXY + otherInfo packets reaches a z3950_client + filter, this is an error. That filter doesn't know how to deal + with multiple targets, so it will either just pick one and search + in it, or (better) fail with an error message. + + + The multi filter comes to the rescue! This is + the only filter that knows how to deal with multiple + VAL_PROXY otherInfo packets, and it does so by + making multiple copies of the entire Search request: one for each + VAL_PROXY. Each of these new copies is then + passed down through the remaining filters in the route. (The + copies are handled in parallel though the + spawning of new threads.) Since the copies each have only one + VAL_PROXY otherInfo, they can be handled by the + z3950_client filter, which happily deals with + each one individually. When the results of the individual + searches come back up to the multi filter, it + merges them into a single Search response, which is what + eventually makes it back to the client. + +
+ + +
+ A picture is worth a thousand words (but only five hundred on 64-bit architectures) + + + + + + + + + + + + [Here there should be a diagram showing the progress of + packages through the filters during a simple virtual-database + search and a multi-database search, but is seems that your + toolchain has not been able to include the diagram in this + document. This is because of LaTeX suckage. Time to move to + OpenOffice. Yes, really.] + + + + + +
@@ -954,8 +1658,7 @@ &manref; - - +