<?xml version="1.0" standalone="no"?>
-<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1//EN"
- "http://www.oasis-open.org/docbook/xml/4.1/docbookx.dtd"
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+ "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
[
<!ENTITY % local SYSTEM "local.ent">
%local;
<author>
<firstname>Jakub</firstname><surname>Skoczen</surname>
</author>
+ <author>
+ <firstname>Mike</firstname><surname>Taylor</surname>
+ </author>
+ <author>
+ <firstname>Dennis</firstname><surname>Schafroth</surname>
+ </author>
<releaseinfo>&version;</releaseinfo>
<copyright>
<year>©right-year;</year>
</copyright>
<abstract>
<simpara>
- Pazpar2 is a high-performance, user interface-independent, data
- model-independent metasearching
- middle-ware featuring merging, relevance ranking, record sorting,
+ Pazpar2 is a high-performance metasearch engine featuring
+ merging, relevance ranking, record sorting,
and faceted results.
+ It is middleware: it has no user interface of its own, but can be
+ configured and controlled by an XML-over-HTTP web-service to provide
+ metasearching functionality behind any user interface.
</simpara>
<simpara>
This document is a guide and reference to Pazpar2 version &version;.
<inlinemediaobject>
<imageobject>
<imagedata fileref="common/id.png" format="PNG"/>
- </imageobject>
- <imageobject>
- <imagedata fileref="common/id.eps" format="EPS"/>
- </imageobject>
- </inlinemediaobject>
+ </imageobject>
+ <imageobject>
+ <imagedata fileref="common/id.eps" format="EPS"/>
+ </imageobject>
+ </inlinemediaobject>
</simpara>
</abstract>
</bookinfo>
-
+
<chapter id="introduction">
<title>Introduction</title>
- <para>
- Pazpar2 is a stand-alone metasearch client with a web-service API, designed
- to be used either from a browser-based client (JavaScript, Flash, Java,
- etc.), from server-side code, or any combination of the two.
- Pazpar2 is a highly optimized client designed to
- search many resources in parallel. It implements record merging,
- relevance-ranking and sorting by arbitrary data content, and facet
- analysis for browsing purposes. It is designed to be data model
- independent, and is capable of working with MARC, DublinCore, or any
- other <ulink url="&url.xml;">XML</ulink>-structured response format
- -- <ulink url="&url.xslt;">XSLT</ulink> is used to normalize and extract
- data from retrieval records for display and analysis. It can be used
- against any server which supports the
- <ulink url="&url.z39.50;">Z39.50</ulink> and <ulink url="&url.sru;">SRU/SRW</ulink>
- protocol. Proprietary
- backend modules can be used to support a large number of other protocols
- (please contact Index Data for further information about this).
- </para>
- <para>
- Additional functionality such as
- user management, attractive displays are expected to be implemented by
- applications that use Pazpar2. Pazpar2 is user interface independent.
- Its functionality is exposed through a simple REST-style web-service API,
- designed to be simple to use from an Ajax-enabled browser, Flash
- animation, Java applet, etc., or from a higher-level server-side language
- like PHP or Java. Because session information can be shared between
- browser-based logic and your server-side scripting, there is tremendous
- flexibility in how you implement your business logic on top of Pazpar2.
- </para>
- <para>
- Once you launch a search in Pazpar2, the operation continues behind the
- scenes. Pazpar2 connects to servers, carries out searches, and
- retrieves, deduplicates, and stores results internally. Your application
- code may periodically inquire about the status of an ongoing operation,
- and ask to see records or other result set facets. Result becomes
- available immediately, and it is easy to build end-user interfaces which
- feel extremely responsive, even when searching more than 100 servers
- concurrently.
- </para>
- <para>
- Pazpar2 is designed to be highly configurable. Incoming records are
- normalized to XML/UTF-8, and then further normalized using XSLT to a
- simple internal representation that is suitable for analysis. By
- providing XSLT stylesheets for different kinds of result records, you
- can tune Pazpar2 to work against different kinds of information
- retrieval servers. Finally, metadata is extracted, in a configurable
- way, from this internal record, to support display, merging, ranking,
- result set facets, and sorting. Pazpar2 is not bound to a specific model
- of metadata, such as DublinCore or MARC -- by providing the right
- configuration, it can work with a number of different kinds of data in
- support of many different applications.
- </para>
- <para>
- Pazpar2 is designed to be efficient and scalable. You can set it up to
- search several hundred targets in parallel, or you can use it to support
- hundreds of concurrent users. It is implemented with the same attention
- to performance and economy that we use in our indexing engines, so that
- you can focus on building your application, without worrying about the
- details of metasearch logic. You can devote all of your attention to
- usability and let Pazpar2 do what it does best -- metasearch.
- </para>
- <para>
- If you wish to connect to commercial or other databases which do not
- support open standards, please contact Index Data. We have a licensing
- agreement with a third party vendor which will enable Pazpar2 to access
- thousands of online databases, in addition to the vast number of catalogs
- and online services that support the Z39.50/SRU/SRW protocols.
- </para>
- <para>
- Pazpar2 is our attempt to re-think the traditional paradigms for
- implementing and deploying metasearch logic, with an uncompromising
- approach to performance, and attempting to make maximum use of the
- capabilities of modern browsers. The demo user interface that
- accompanies the distribution is but one example. If you think of new
- ways of using Pazpar2, we hope you'll share them with us, and if we
- can provide assistance with regards to training, design, programming,
- integration with different backends, hosting, or support, please don't
- hesitate to contact us. If you'd like to see functionality in Pazpar2
- that is not there today, please don't hesitate to contact us. It may
- already be in our development pipeline, or there might be a
- possibility for you to help out by sponsoring development time or
- code. Either way, get in touch and we will give you straight answers.
- </para>
- <para>
- Enjoy!
- </para>
- <para>
- Pazpar2 is covered by the GNU license version 2.
- See <xref linkend="license"/> for further information.
- </para>
+
+ <section id="what.pazpar2.is">
+ <title>What Pazpar2 is</title>
+ <para>
+ Pazpar2 is a stand-alone metasearch engine with a web-service API, designed
+ to be used either from a browser-based client (JavaScript, Flash,
+ Java applet,
+ etc.), from server-side code, or any combination of the two.
+ Pazpar2 is a highly optimized client designed to
+ search many resources in parallel. It implements record merging,
+ relevance-ranking and sorting by arbitrary data content, and facet
+ analysis for browsing purposes. It is designed to be data-model
+ independent, and is capable of working with MARC, DublinCore, or any
+ other <ulink url="&url.xml;">XML</ulink>-structured response format
+ -- <ulink url="&url.xslt;">XSLT</ulink> is used to normalize and extract
+ data from retrieval records for display and analysis. It can be used
+ against any server which supports the
+ <ulink url="&url.z39.50;">Z39.50</ulink>, <ulink url="&url.sru;">SRU/SRW</ulink>
+ or <ulink url="&url.solr;">SOLR</ulink> protocol. Proprietary
+ backend modules can function as connectors between these standard
+ protocols and any non-standard API, including web-site scraping, to
+ support a large number of other protocols.
+ </para>
+ <para>
+ Additional functionality such as
+ user management and attractive displays are expected to be implemented by
+ applications that use Pazpar2. Pazpar2 itself is user-interface independent.
+ Its functionality is exposed through a simple XML-based web-service API,
+ designed to be easy to use from an Ajax-enabled browser, Flash
+ animation, Java applet, etc., or from a higher-level server-side language
+ like PHP, Perl or Java. Because session information can be shared between
+ browser-based logic and server-side scripting, there is tremendous
+ flexibility in how you implement application-specific logic on top
+ of Pazpar2.
+ </para>
+ <para>
+ Once you launch a search in Pazpar2, the operation continues behind the
+ scenes. Pazpar2 connects to servers, carries out searches, and
+ retrieves, deduplicates, and stores results internally. Your application
+ code may periodically inquire about the status of an ongoing operation,
+ and ask to see records or result set facets. Results become
+ available immediately, and it is easy to build end-user interfaces than
+ feel extremely responsive, even when searching more than 100 servers
+ concurrently.
+ </para>
+ <para>
+ Pazpar2 is designed to be highly configurable. Incoming records are
+ normalized to XML/UTF-8, and then further normalized using XSLT to a
+ simple internal representation that is suitable for analysis. By
+ providing XSLT stylesheets for different kinds of result records, you
+ can configure Pazpar2 to work against different kinds of information
+ retrieval servers. Finally, metadata is extracted in a configurable
+ way from this internal record, to support display, merging, ranking,
+ result set facets, and sorting. Pazpar2 is not bound to a specific model
+ of metadata, such as DublinCore or MARC: by providing the right
+ configuration, it can work with any combination of different kinds of data
+ in support of many different applications.
+ </para>
+ <para>
+ Pazpar2 is designed to be efficient and scalable. You can set it up to
+ search several hundred targets in parallel, or you can use it to support
+ hundreds of concurrent users. It is implemented with the same attention
+ to performance and economy that we use in our indexing engines, so that
+ you can focus on building your application without worrying about the
+ details of metasearch logic. You can devote all of your attention to
+ usability and let Pazpar2 do what it does best -- metasearch.
+ </para>
+ <para>
+ Pazpar2 is our attempt to re-think the traditional paradigms for
+ implementing and deploying metasearch logic, with an uncompromising
+ approach to performance, and attempting to make maximum use of the
+ capabilities of modern browsers. The demo user interface that
+ accompanies the distribution is but one example. If you think of new
+ ways of using Pazpar2, we hope you'll share them with us, and if we
+ can provide assistance with regards to training, design, programming,
+ integration with different backends, hosting, or support, please don't
+ hesitate to contact us. If you'd like to see functionality in Pazpar2
+ that is not there today, please don't hesitate to contact us. It may
+ already be in our development pipeline, or there might be a
+ possibility for you to help out by sponsoring development time or
+ code. Either way, get in touch and we will give you straight answers.
+ </para>
+ <para>
+ Enjoy!
+ </para>
+ <para>
+ Pazpar2 is covered by the GNU General Public License (GPL) version 2.
+ See <xref linkend="license"/> for further information.
+ </para>
+ </section>
+
+ <section id="connectors">
+ <title>Connectors to non-standard databases</title>
+ <para>
+ If you need to access commercial or open access resources that don't support
+ Z39.50 or SRU, one approach would be to use a tool like <ulink
+ url="&url.simpleserver;">SimpleServer</ulink> to build a
+ gateway. An easier option is to use Index Data's <ulink
+ url="&url.mkc;">MasterKey Connect</ulink>
+ service, which will expose virtually <emphasis>any</emphasis> resource
+ through Z39.50/SRU, dead easy to integrate with Pazpar2.
+ The service is hosted, so all you have to do is to let us
+ know which resources you are interested in, and we operate the gateways,
+ or Connectors for you for a low annual charge.
+ Types of resources supported include
+ commercial databases, free online resources, and even local resources;
+ almost anything that can be accessed through a web-facing user
+ interface can be accessed in this way.
+ Contact <email>info@indexdata.com</email> for more information.
+ See <xref linkend="masterkey_connect"/> for an example.
+ </para>
+ </section>
+
+ <section id="name">
+ <title>A note on the name Pazpar2</title>
+ <para>
+ The name Pazpar2 derives from three sources. One one hand, it is
+ Index Data's second major piece of software that does parallel
+ searching of Z39.50 targets. On the other, it is a near-homophone
+ of Passpartout, the ever-helpful servant in Jules Verne's novel
+ Around the World in Eighty Days (who helpfully uses the language
+ of his master). Finally, "passe par tout" means something like
+ "passes through anything" in French -- on other words, a universal
+ solution, or if you like a MasterKey.
+ </para>
+ </section>
</chapter>
<chapter id="installation">
<title>Installation</title>
<para>
- The Pazpar2 package is very small. It includes documentation as well
+ The Pazpar2 package includes documentation as well
as the Pazpar2 server. The package also includes a simple user
- interface test1 which consists of a single HTML page and a single
+ interface called "test1", which consists of a single HTML page and a single
JavaScript file to illustrate the use of Pazpar2.
</para>
<para>
Pazpar2 depends on the following tools/libraries:
<variablelist>
<varlistentry><term><ulink url="&url.yaz;">YAZ</ulink></term>
- <listitem>
- <para>
- The popular Z39.50 toolkit for the C language.
- YAZ <emphasis>must</emphasis> be compiled with Libxml2/Libxslt support.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ The popular Z39.50 toolkit for the C language.
+ YAZ <emphasis>must</emphasis> be compiled with Libxml2/Libxslt support.
+ </para>
+ </listitem>
</varlistentry>
<varlistentry><term><ulink url="&url.icu;">International
- Components for Unicode (ICU)</ulink></term>
- <listitem>
- <para>
- ICU provides Unicode support for non-English languages with
- character sets outside the range of 7bit ASCII, like
- Greek, Russian, German and French. Pazpar2 uses the ICU
- Unicode character conversions, Unicode normalization, case
- folding and other fundamental operations needed in
- tokenization, normalization and ranking of records.
- </para>
- <para>
- Compiling, linking, and usage of the ICU libraries is optional,
- but strongly recommended for usage in an international
- environment.
- </para>
- </listitem>
+ Components for Unicode (ICU)</ulink></term>
+ <listitem>
+ <para>
+ ICU provides Unicode support for non-English languages with
+ character sets outside the range of 7bit ASCII, like
+ Greek, Russian, German and French. Pazpar2 uses the ICU
+ Unicode character conversions, Unicode normalization, case
+ folding and other fundamental operations needed in
+ tokenization, normalization and ranking of records.
+ </para>
+ <para>
+ Compiling, linking, and usage of the ICU libraries is optional,
+ but strongly recommended for usage in an international
+ environment.
+ </para>
+ </listitem>
</varlistentry>
</variablelist>
</para>
</para>
<section id="installation.unix">
- <title>Installation on Unix (from Source)</title>
+ <title>Installation from source on Unix (including Linux, MacOS, etc.)</title>
<para>
The latest source code for Pazpar2 is available from
<ulink url="&url.pazpar2.download;"/>.
- Only few systems have none of the required
- tools binary packages.
- If, for example, Libxml2/libXSLT libraries
- are already installed as development packages use these.
+ Most Unix-based operating systems have the required
+ tools available as binary packages.
+ For example, if Libxml2/libXSLT libraries
+ are already installed as development packages, use these.
</para>
-
+
<para>
- Ensure that the development libraries + header files are
+ Ensure that the development libraries and header files are
available on your system before compiling Pazpar2. For installation
- of YAZ, refer to the YAZ installation chapter.
+ of YAZ, refer to the Installation chapter of the YAZ manual at
+ <ulink url="&url.yaz.install;"/>.
+ </para>
+ <para>
+ Once the dependencies are in place, Pazpar2 can be unpacked and
+ installed as follows:
</para>
<screen>
- gunzip -c pazpar2-version.tar.gz|tar xf -
- cd pazpar2-version
+ tar xzf pazpar2-VERSION.tar.gz
+ cd pazpar2-VERSION
./configure
make
- su
- make install
+ sudo make install
</screen>
<para>
The <literal>make install</literal> will install manpages as well as the
- Pazpar2 server, <literal>pazpar2</literal>,
+ Pazpar2 server, <literal>pazpar2</literal>,
in PREFIX<literal>/sbin</literal>.
By default, PREFIX is <literal>/usr/local/</literal> . This can be
changed with configure option <option>--prefix</option>.
</section>
<section id="installation.win32">
- <title>Installation on Windows (from Source)</title>
- <para>
- Pazpar2 can be built for Windows using
- <ulink url="&url.vstudio;">Microsoft Visual Studio</ulink>.
- The support files for building YAZ on Windows are located in the
- <filename>win</filename> directory. The compilation is performed
- using the <filename>win/makefile</filename> which is to be
- processed by the NMAKE utility part of Visual Studio.
- </para>
- <para>
- Ensure that the development libraries + header files are
- available on your system before compiling Pazpar2. For installation
- of YAZ, refer to the YAZ installation chapter.
- It is easiest if YAZ and Pazpar2 are unpacked in the same
- directory (side-by-side).
- </para>
- <para>
- The compilation is tuned by editing the makefile of Pazpar2.
- The process is similar to YAZ. Adjust the various directories
- <literal>YAZ_DIR</literal>, <literal>ZLIB_DIR</literal>, ..
- </para>
- <para>
- Compile Pazpar2 by invoking <application>nmake</application> in
- the <filename>win</filename> directory.
- The resulting binaries of the build process are located in the
- <filename>bin</filename> of the Pazpar2 source
- tree - including the <filename>pazpar2.exe</filename> and necessary DLLs.
- </para>
- <para>
- The Windows version of Pazpar2 is a console application. It may
- be installed as a Windows Service by adding option
- <literal>-install</literal> for the pazpar2 program. This will
- register Pazpar2 as a service and use the other options provided
- in the same invocation. For example:
- <screen>
- cd \MyPazpar2\etc
- ..\bin\pazpar2 -install -c pazpar2.cfg -l pazpar2.log
- </screen>
- The Pazpar2 service may now be controlled via the Service Control
- Panel. It may be unregistered by passing the <literal>-remove</literal>
- option. Example:
- <screen>
- cd \MyPazpar2\etc
- ..\bin\pazpar2 -remove
- </screen>
- </para>
+ <title>Installation from source on Windows</title>
+ <para>
+ Pazpar2 can be built for Windows using
+ <ulink url="&url.vstudio;">Microsoft Visual Studio</ulink>.
+ The support files for building YAZ on Windows are located in the
+ <filename>win</filename> directory. The compilation is performed
+ using the <filename>win/makefile</filename> which is to be
+ processed by the NMAKE utility part of Visual Studio.
+ </para>
+ <para>
+ Ensure that the development libraries and header files are
+ available on your system before compiling Pazpar2. For installation
+ of YAZ, refer to
+ the Installation chapter of the YAZ manual at
+ <ulink url="&url.yaz.install;"/>.
+ It is easiest if YAZ and Pazpar2 are unpacked in the same
+ directory (side-by-side).
+ </para>
+ <para>
+ The compilation is tuned by editing the makefile of Pazpar2.
+ The process is similar to YAZ. Adjust the various directories
+ <literal>YAZ_DIR</literal>, <literal>ZLIB_DIR</literal>, etc.,
+ as required.
+ </para>
+ <para>
+ Compile Pazpar2 by invoking <application>nmake</application> in
+ the <filename>win</filename> directory.
+ The resulting binaries of the build process are located in the
+ <filename>bin</filename> of the Pazpar2 source
+ tree - including the <filename>pazpar2.exe</filename> and necessary DLLs.
+ </para>
+ <para>
+ The Windows version of Pazpar2 is a console application. It may
+ be installed as a Windows Service by adding option
+ <literal>-install</literal> for the pazpar2 program. This will
+ register Pazpar2 as a service and use the other options provided
+ in the same invocation. For example:
+ <screen>
+ cd \MyPazpar2\etc
+ ..\bin\pazpar2 -install -f pazpar2.cfg -l pazpar2.log
+ </screen>
+ The Pazpar2 service may now be controlled via the Service Control
+ Panel. It may be unregistered by passing the <literal>-remove</literal>
+ option. Example:
+ <screen>
+ cd \MyPazpar2\etc
+ ..\bin\pazpar2 -remove
+ </screen>
+ </para>
</section>
<section id="installation.test1">
- <title>Installation of test1 interface</title>
+ <title>Installation of test interfaces</title>
<para>
- In this section we outline how to install a simple interface that
- is part of the Pazpar2 source package. Note that Debian users can
- save time by just installing package <literal>pazpar2-test1</literal>.
+ In this section we show how to make available the set of simple
+ interfaces that are part of the Pazpar2 source package, and which
+ demonstrate some ways to use Pazpar2. (Note that Debian users can
+ save time by just installing the package <literal>pazpar2-test1</literal>.)
</para>
<para>
- A web server must be installed and running on the system, such as Apache.
+ A web server, such as Apache, must be installed and running on the system.
</para>
<para>
<screen>
cd etc
cp pazpar2.cfg.dist pazpar2.cfg
- ../src/pazpar2 -f pazpar2.cfg -t edu.xml
+ ../src/pazpar2 -f pazpar2.cfg
</screen>
And on Windows:
<screen>
cd etc
copy pazpar2.cfg.dist pazpar2.cfg
- ..\bin\pazpar2 -f pazpar2.cfg -t edu.xml
+ ..\bin\pazpar2 -f pazpar2.cfg
</screen>
- This will start a Pazpar2 listener on port 8004. It will proxy
- HTTP requests to localhost - port 80, which we assume will be the regular
+ This will start a Pazpar2 listener on port 9004. It will proxy
+ HTTP requests to port 80 on localhost, which we assume will be the regular
HTTP server on the system. Inspect and modify pazpar2.cfg as needed
- if this is to be changed. The -t option specifies the list of targets
+ if this is to be changed. The pazpar2.cfg file includes settings from the
+ file <filename>settings/edu.xml</filename>
to use for searches.
</para>
+
<para>
- Make a new console and move to the other stuff.
- For more information about pazpar2 options refer to the manpage.
+ The test UIs are located in <literal>www</literal>. Ensure that this
+ directory is available to the web server by copying
+ <literal>www</literal> to the document root,
+ using Apache's <literal>Alias</literal> directive, or
+ creating a symbolic link: for example, on a Debian or Ubuntu
+ system with Apache2 installed from the standard package, you might
+ make the link as follows:
+ <screen>
+ cd .../pazpar2
+ sudo ln -s `pwd`/www /var/www/pazpar2-demo
+ </screen>
</para>
<para>
- The test1 UI is located in <literal>www/test1</literal>. Ensure this
- directory is available to the web server by either copying
- <literal>test1</literal> to the document root, create a symlink or
- use Apache's <literal>Alias</literal> directive.
+ This makes the test applications visible at
+ <ulink url="http://localhost/pazpar2-demo/"/>
+ but they can not be run successfully from that URL, as they submit
+ search requests back to the server form which they were served,
+ and Apache2 doesn't know how to handle them. Instead, the test
+ applications must be accessed from Pazpar2 itself, acting as a
+ proxy to Apache2, at the URL
+ <ulink url="http://localhost:9004/pazpar2-demo/"/>
</para>
<para>
- The interface test1 interface should now be available on port 8004.
+ From here, the demo applications can be
+ accessed: <literal>test1</literal>, <literal>test2</literal> and
+ <literal>jsdemo</literal>
+ are pure HTML+JavaScript setups, needing no server-side
+ intelligence;
+ <literal>demo</literal>
+ requires PHP on the server.
</para>
<para>
- If you don't see the test1 interface. See if test1 is really available
- on the same URL but on port 80. If it's not, the Apache configuration
- (or other) is not correct.
+ If you don't see the test interfaces, check whether they are available
+ on port 80 (i.e. directly from the Apache2 server). If not, the
+ Apache configuration is incorrect.
</para>
<para>
In order to use Apache as frontend for the interface on port 80
- for public access etc., refer to
+ for public access etc., refer to
<xref linkend="installation.apache2proxy"/>.
</para>
</section>
<section id="installation.debian">
- <title>Installation on Debian GNU/Linux</title>
+ <title>Installation on Debian GNU/Linux and Ubuntu</title>
<para>
- Index Data provides Debian packages for Pazpar2. These are prepared
- for Debian versions Etch and Lenny (as of 2007).
- These packages are available at
- <ulink url="&url.pazpar2.download.debian;"/>.
+ Index Data provides Debian and Ubuntu packages for Pazpar2.
+ As of February 2010, these
+ are prepared for Debian versions Etch, Lenny and Squeeze; and for
+ Ubuntu versions 8.04 (hardy), 8.10 (intrepid), 9.04 (jaunty) and
+ 9.10 (karmic). These packages are available at
+ <ulink url="&url.pazpar2.download.debian;"/> and
+ <ulink url="&url.pazpar2.download.ubuntu;"/>.
</para>
</section>
<section id="installation.apache2proxy">
<title>Apache 2 Proxy</title>
<para>
- Apache 2 has a
- <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html">
+ Apache 2 has a
+ <ulink
+ url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html">
proxy module
- </ulink> which allows Pazpar2 to become a backend to an Apache 2
+ </ulink>
+ which allows Pazpar2 to become a backend to an Apache 2
based web service. The Apache 2 proxy must operate in the
<emphasis>Reverse</emphasis> Proxy mode.
</para>
-
+
<para>
On a Debian based Apache 2 system, the relevant modules can
be enabled with:
<screen>
- sudo a2enmod proxy_http
+ sudo a2enmod proxy_http proxy_balancer
</screen>
</para>
<para>
- Traditionally Pazpar2 interprets URL paths with suffix
+ Traditionally Pazpar2 interprets URL paths with suffix
<literal>/search.pz2</literal>.
- The
- <ulink
- url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass"
- >ProxyPass</ulink> directive of Apache must be used to map a URL path
+ The
+ <ulink
+ url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass">
+ ProxyPass
+ </ulink>
+ directive of Apache must be used to map a URL path
the the Pazpar2 server (listening port).
</para>
<screen><![CDATA[
<IfModule mod_proxy.c>
ProxyRequests Off
-
+
<Proxy *>
AddDefaultCharset off
Order deny,allow
Allow from all
</Proxy>
-
+
ProxyPass /myportal/search.pz2 http://localhost:8004/search.pz2
ProxyVia Off
</IfModule>
<title>Using Pazpar2</title>
<para>
This chapter provides a general introduction to the use and
- deployment of Pazpar2.
+ deployment of Pazpar2.
</para>
<section id="architecture">
with the server from which the enclosing HTML page or object
originated, Pazpar2 is designed so that it can act as a transparent
proxy in front of an existing webserver (see <xref
- linkend="pazpar2_conf"/> for details).
+ linkend="pazpar2_conf"/> for details).
In this mode, all regular
HTTP requests are transparently passed through to your webserver,
while Pazpar2 only intercepts search-related webservice requests.
The intermediate, internal representation of the record looks like
this:
<screen><![CDATA[
- <record xmlns="http://www.indexdata.com/pazpar2/1.0"
- mergekey="title The Shining author King, Stephen">
-
- <metadata type="title">The Shining</metadata>
+ <record xmlns="http://www.indexdata.com/pazpar2/1.0"
+ mergekey="title The Shining author King, Stephen">
- <metadata type="author">King, Stephen</metadata>
+ <metadata type="title" rank="2">The Shining</metadata>
- <metadata type="kind">ebook</metadata>
+ <metadata type="author">King, Stephen</metadata>
- <!-- ... and so on -->
- </record>
- ]]></screen>
+ <metadata type="kind">ebook</metadata>
+ <!-- ... and so on -->
+ </record>
+]]></screen>
As you can see, there isn't much to it. There are really only a few
important elements to this file.
records are never merged. The 'metadata' elements provide the meat
of the elements -- the content. the 'type' attribute is used to
match each element against processing rules that determine what
- happens to the data element next.
+ happens to the data element next. The attribute, 'rank' specifies
+ specifies a multipler for ranking for this element.
</para>
<para>
in the retrieval record ultimately drives merging, sorting, ranking,
the extraction of browse facets, and display, all configurable.
</para>
+
+ <para>
+ Pazpar2 1.6.37 and later also allows already clustered records to
+ be ingested. Suppose a database already clusters for us and we would like
+ to keep that cluster for Pazpar2. In that case we can generate a
+ <literal>cluster</literal> wrapper element that holds individual
+ <literal>record</literal> elements.
+ </para>
+ <para>
+ Cluster record example:
+ <screen><![CDATA[
+ <cluster xmlns="http://www.indexdata.com/pazpar2/1.0">
+ <record>
+ <metadata type="title" rank="2">The Shining</metadata>
+ <metadata type="author">King, Stephen</metadata>
+ <metadata type="kind">ebook</metadata>
+ </record>
+ <record>
+ <metadata type="title" rank="2">The Shining</metadata>
+ <metadata type="author">King, Stephen</metadata>
+ <metadata type="kind">audio</metadata>
+ </record>
+ </cluster>
+ ]]></screen>
+ </para>
</section>
<section id="client">
search. You start a new search using the 'search' command. Once the
search has been started, you can follow its progress using the
'stat', 'bytarget', 'termlist', or 'show' commands. Detailed records
- can be fetched using the 'record' command.
+ can be fetched using the 'record' command.
</para>
</section>
§-ajaxdev;
- <section id="nonstandard">
- <title>Connecting to non-standard resources</title>
- <para>
- Pazpar2 uses Z39.50 as its switchboard language -- i.e. as far as it
- is concerned, all resources speak Z39.50, or its webservices derivatives,
- SRU/SRW. It is, however, equipped
- to handle a broad range of different server behavior, through
- configurable query mapping and record normalization. If you develop
- configuration, stylesheets, etc., for a new type of resources, we
- encourage you to share your work. But you can also use Pazpar2 to
- connect to hundreds of resources that do not support standard
- protocols.
- </para>
-
- <para>
- For a growing number of resources, Z39.50 is all you need. Over the
- last few years, a number of commercial, full-text resources have
- implemented Z39.50. These can be used through Pazpar2 with little or
- no effort. Resources that use non-standard record formats will
- require a bit of XSLT work, but that's all.
- </para>
-
- <para>
- But what about resources that don't support Z39.50 at all? Some resources might
- support OpenSearch, private, XML/HTTP-based protocols, or something
- else entirely. Some databases exist only as web user interfaces and
- will require screen-scraping. Still others exist only as static
- files, or perhaps as databases supporting the OAI-PMH protocol.
- There is hope! Read on.
- </para>
-
- <para>
- Index Data continues to advocate the support of open standards. We
- work with database vendors to support standards, so you don't have
- to worry about programming against non-standard services. We also
- provide tools (see <ulink
- url="http://www.indexdata.com/simpleserver">SimpleServer</ulink>)
- which make it comparatively easy to build gateways against servers
- with non-standard behavior. Again, we encourage you to share any
- work you do in this direction.
- </para>
-
- <para>
- But the bottom line is that working with non-standard resources in
- metasearching is really, really hard. If you want to build a
- project with Pazpar2, and you need access to resources with
- non-standard interfaces, we can help. We run gateways to more than
- 2,000 popular, commercial databases and other resources,
- making it simple
- to plug them directly into Pazpar2. For a small annual fee per
- database, we can help you establish connections to your licensed
- resources. Meanwhile, you can help! If you build your own
- standards-compliant gateways, host them for others, or share the
- code! And tell your vendors that they can save everybody money and
- increase the appeal of their resources by supporting standards.
- </para>
-
- <para>
- There are those who will ask us why we are using Z39.50 as our
- switchboard language rather than a different protocol. Basically,
- we believe that Z39.50 is presently the most widely implemented
- information retrieval protocol that has the level of functionality
- required to support a good metasearching experience (structured
- searching, structured, well-defined results). It is also compact and
- efficient, and there is a very broad range of tools available to
- implement it.
- </para>
- </section>
-
<section id="unicode">
<title>Unicode Compliance</title>
<para>
</para>
<para>
In addition, the ICU tokenization and normalization rules must
- be defined in the master configuration file described in
+ be defined in the master configuration file described in
<xref linkend="config-server"/>.
</para>
</section>
+ <section id="load_balancing">
+ <title>Load balancing</title>
+ <para>
+ Just like any web server, Pazpar2, can be load balanced by a standard
+ hardware or software load balancer as long as the session stickiness
+ is ensured. If you are already running the Apache2 web server in front
+ of Pazpar2 and use the apache mod_proxy module to 'relay' client
+ requests to Pazpar2, this set up can be easily extended to include
+ load balancing capabilites.
+ To do so you need to enable the
+ <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html">
+ mod_proxy_balance
+ </ulink>
+ module in your Apache2 installation.
+ </para>
+
+ <para>
+ On a Debian based Apache 2 system, the relevant modules can
+ be enabled with:
+ <screen>
+ sudo a2enmod proxy_http
+ </screen>
+ </para>
+
+ <para>
+ The mod_proxy_balancer can pass all 'sessionsticky' requests to the
+ same backend worker as long as the requests are marked with the
+ originating worker's ID (called 'route'). If the Pazpar2 serverID is
+ configured (by setting an 'id' attribute on the 'server' element in
+ the Pazpar2 configuration file) Pazpar2 will append it to the
+ 'session' element returned during the 'init' in a mod_proxy_balancer
+ compatible manner.
+ Since the 'session' is then re-sent by the client (for all pazpar2
+ request besides 'init'), the balancer can use the marker to pass
+ the request to the right route. To do so the balancer needs to be
+ configured to inspect the 'session' parameter.
+ </para>
+
+ <example id="load_balancing.example">
+ <title>Apache 2 load balancing configuration</title>
+ <para>
+ Having 4 Pazpar2 instances running on the same host, port range of
+ 8004-8007 and serverIDs of: pz1, pz2, pz3 and pz4 respectively we
+ could use the following Apache 2 configuration to expose a single
+ pazpar2 'endpoint' on a standard
+ (<filename>/pazpar2/search.pz2</filename>) location:
+
+ <screen><![CDATA[
+ <Proxy *>
+ AddDefaultCharset off
+ Order deny,allow
+ Allow from all
+ </Proxy>
+ ProxyVia Off
+
+ # 'route' has to match the configured pazpar2 server ID
+ <Proxy balancer://pz2cluster>
+ BalancerMember http://localhost:8004 route=pz1
+ BalancerMember http://localhost:8005 route=pz2
+ BalancerMember http://localhost:8006 route=pz3
+ BalancerMember http://localhost:8007 route=pz4
+ </Proxy>
+
+ # route is resent in the 'session' param which has the form:
+ # 'sessid.serverid', understandable by the mod_proxy_load_balancer
+ # this is not going to work if the client tampers with the 'session' param
+ ProxyPass /pazpar2/search.pz2 balancer://pz2cluster lbmethod=byrequests stickysession=session nofailover=On
+ ]]></screen>
+
+ The 'ProxyPass' line sets up a reverse proxy for request
+ ‘/pazpar2/search.pz2’ and delegates all requests to the load balancer
+ (virtual worker) with name ‘pz2cluster’.
+ Sticky sessions are enabled and implemented using the ‘session’ parameter.
+ The ‘Proxy’ section lists all the servers (real workers) which the
+ load balancer can use.
+ </para>
+
+ </example>
+
+ </section>
+
+ <section id="relevance_ranking">
+ <title>Relevance ranking</title>
+ <para>
+ Pazpar2 uses a variant of the fterm frequency–inverse document frequency
+ (Tf-idf) ranking algorithm.
+ </para>
+ <para>
+ The Tf-part is straightforward to calculate and is based on the
+ documents that Pazpar2 fetches. The idf-part, however, is more tricky
+ since the corpus at hand is ONLY the relevant documents and not
+ irrelevant ones. Pazpar2 does not have the full corpus -- only the
+ documents that match a particular search.
+ </para>
+ <para>
+ Computatation of the Tf-part is based on the normalized documents.
+ The length, the position and terms are thus normalized at this point.
+ Also the computation if performed for each document received from the
+ target - before merging takes place. The result of a TF-compuation is
+ added to the TF-total of a cluster. Thus, if a document occurs twice,
+ then the TF-part is doubled. That, however, can be adjusted, because the
+ TF-part may be divided by the number of documents in a cluster.
+ </para>
+ <para>
+ The algorithm used by Pazpar2 has two phases. In phase one
+ Pazpar2 computes a tf-array .. This is being done as records are
+ fetched form the database. In this case, the rank weigth
+ <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
+ <literal>follow</literal> and <literal>length</literal>.
+
+ </para>
+ <screen><![CDATA[
+ tf[1,2,..N] = 0;
+ foreach document in a cluster
+ foreach field
+ w[1,2,..N] = 0;
+ for i = 1, .. N: (each term)
+ foreach pos (where term i occurs in field)
+ // w is configured weight for field
+ // pos is position of term in field
+ w[i] += w / (1 + log2(1+lead*pos))
+ if (d > 0)
+ w[i] += w[i] * follow / (1+log2(d)
+ // length: length of field (number of terms that is)
+ if (length strategy is "linear")
+ tf[i] += w[i] / length;
+ else if (length strategy is "log")
+ tf[i] += w[i] / log2(length);
+ else if (length strategy is "none")
+ tf[i] += w[i];
+ ]]></screen>
+ <para>
+ In phase two, the idf-array is computed and the final score
+ is computed. This is done for each cluster as part of each show command.
+ The rank tweak <literal>cluster</literal> is in use here.
+ </para>
+ <screen><![CDATA[
+ // dococcur[i]: number of records where term occurs
+ // doctotal: number of records
+ for i = 1, .., N (each term)
+ if (dococcur[i] > 0)
+ idf[i] = log(1 + doctotal / dococcur[i])
+ else
+ idf[i] = 0;
+
+ relevance = 0;
+ for i = 1, .., N: (each term)
+ if (cluster is "yes")
+ tf[i] = tf[i] / cluster_size;
+ relevance += 100000 * tf[i] / idf[i];
+ ]]></screen>
+ <para>
+ For controlling the ranking parameters, refer to the
+ <link linkend="service-rank">rank</link> element of the
+ service definition.
+ Refer to the <link linkend="metadata-rank">rank</link> attribute
+ of the metadata element for how to control ranking for individual
+ metadata fields.
+ </para>
+ </section> <!-- relevance_ranking -->
+
+ <section id="masterkey_connect">
+ <title>Pazpar2 and MasterKey Connect</title>
+ <para>
+ MasterKey Connect is a hosted connector, or gateway, service that exposes
+ whatever searchable resources you need. Since the service exposes all
+ resources using Z39.50 (or SRU), it is easy to set up Pazpar2 to use the
+ service. In particular, since all connectors expose basically the same core
+ behavior, it is a good use of Pazpar2's mechanism for managing default
+ behaviors across similar databases.
+ </para>
+ <para>
+ After installation of Pazpar2, the directory
+ <filename>/etc/pazpar2/settings/mkc</filename> (location may
+ vary depending on installation preferences) contains an example setup that
+ searches two different resources through a MasterKey Connect demo account.
+ The file mkc.xml contains default parameters that will work for all
+ MasterKey Connect resources (if you decide to become a customer of the
+ service, you will substitute your own account credentials for
+ the guest/guest). The other files contain specific information about
+ a couple of demonstration resources.
+ </para>
+
+ <para>
+ To play with the demo, just create a symlink from
+ <filename>/etc/pazpar2/services-enabled/default.xml</filename>
+ to <filename>/etc/pazpar2/services-available/mkc.xml</filename>.
+ And restart Pazpar2. You should now be able to search the two demo
+ resources using JSDemo or any user interface of your choice.
+ If you are interested in learning more about MasterKey Connect, or to
+ try out the service for free against your favorite online resource, just
+ contact us at <email>info@indexdata.com</email>.
+ </para>
+ </section>
+
</chapter> <!-- Using Pazpar2 -->
<reference id="reference">
&manref;
</reference>
- <appendix id="license"><title>License</title>
-
- <para>
- Pazpar2,
- Copyright © ©right-year; Index Data.
- </para>
-
- <para>
- Pazpar2 is free software; you can redistribute it and/or modify it under
- the terms of the GNU General Public License as published by the Free
- Software Foundation; either version 2, or (at your option) any later
- version.
- </para>
-
- <para>
- Pazpar2 is distributed in the hope that it will be useful, but WITHOUT ANY
- WARRANTY; without even the implied warranty of MERCHANTABILITY or
- FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
- for more details.
- </para>
-
- <para>
- You should have received a copy of the GNU General Public License
- along with Pazpar2; see the file LICENSE. If not, write to the
- Free Software Foundation,
- 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- </para>
+ <appendix id="license">
+ <title>License</title>
+
+ <para>
+ Pazpar2,
+ Copyright © ©right-year; Index Data.
+ </para>
+
+ <para>
+ Pazpar2 is free software; you can redistribute it and/or modify it under
+ the terms of the GNU General Public License as published by the Free
+ Software Foundation; either version 2, or (at your option) any later
+ version.
+ </para>
+
+ <para>
+ Pazpar2 is distributed in the hope that it will be useful, but WITHOUT ANY
+ WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+ for more details.
+ </para>
+
+ <para>
+ You should have received a copy of the GNU General Public License
+ along with Pazpar2; see the file LICENSE. If not, write to the
+ Free Software Foundation,
+ 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ </para>
</appendix>
&gpl2;
-
+
</book>
- <!-- Keep this comment at the end of the file
- Local variables:
- mode: sgml
- sgml-omittag:t
- sgml-shorttag:t
- sgml-minimize-attributes:nil
- sgml-always-quote-attributes:t
- sgml-indent-step:1
- sgml-indent-data:t
- sgml-parent-document: nil
- sgml-local-catalogs: nil
- sgml-namecase-general:t
- End:
- -->
+<!-- Keep this comment at the end of the file
+Local variables:
+mode: nxml
+nxml-child-indent: 1
+End:
+-->