From c50b7223e10de52e713be64559129ea89e8ed601 Mon Sep 17 00:00:00 2001 From: Mike Taylor Date: Sun, 1 Dec 2002 23:26:26 +0000 Subject: [PATCH] All sorts of minor and semi-major improvements. Remove harvest.mbox -- its content is now incorporated. --- doc/examples.xml | 52 +++++--- doc/harvest.mbox | 360 -------------------------------------------------- doc/installation.xml | 26 ++-- doc/introduction.xml | 73 +++++++--- doc/quickstart.xml | 71 ++++------ 5 files changed, 130 insertions(+), 452 deletions(-) delete mode 100644 doc/harvest.mbox diff --git a/doc/examples.xml b/doc/examples.xml index 10cbeb5..dc95e12 100644 --- a/doc/examples.xml +++ b/doc/examples.xml @@ -1,5 +1,5 @@ - + Example Configurations @@ -19,23 +19,35 @@ - Where to find subsidiary configuration files, including - default.idx + Where to find subsidiary configuration files, including both + those that are named explicitly and a few ``magic'' files such + as default.idx, which specifies the default indexing rules. - What attribute sets to recognise in searches. + What record schemas to support. (Subsidiary files specifiy how + to index the contents of records in those schemas, and what + format to use when presenting records in those schemas to client + software.) - Policy details such as what record type to expect, what - low-level indexing algorithm to use, how to identify potential - duplicate records, etc. + What attribute sets to recognise in searches. (Subsidiary files + specify how to interpret the attributes in terms + of the indexes that are created on the records.) + + + + + + Policy details such as what type of input format to expect when + adding new records, what low-level indexing algorithm to use, + how to identify potential duplicate records, etc. @@ -69,6 +81,10 @@ dino.tree.) Type make records/dino.xml to make the XML data file. + (Or you could just type make to build the XML + data file, create the database and populate it with the taxonomic + records all in one shot - but then you wouldn't learn anything, + would you? :-) Now we need to create a Zebra database to hold and index the XML @@ -76,7 +92,7 @@ Zebra indexer, zebraidx, which is driven by the zebra.cfg configuration file. For our purposes, we don't need any - special behaviour - we can use the defaults - so we start with a + special behaviour - we can use the defaults - so we can start with a minimal file that just tells zebraidx where to find the default indexing rules, and how to parse the records: @@ -108,7 +124,7 @@ XPath-based boolean queries and fetch the XML records that satisfy them: - $ yaz-client tcp:@:9999 + $ yaz-client @:9999 Connecting...Ok. Z> find @attr 1=/Zthes/termName Sauroposeidon Number of hits: 1 @@ -118,6 +134,7 @@ <termId>22</termId> <termName>Sauroposeidon</termName> <termType>PT</termType> + <termNote>The tallest known dinosaur (18m)</termNote> <relation> <relationType>BT</relationType> <termId>21</termId> @@ -126,7 +143,7 @@ </relation> <idzebra xmlns="http://www.indexdata.dk/zebra/"> - <size>245</size> + <size>300</size> <localnumber>23</localnumber> <filename>records/dino.xml</filename> </idzebra> @@ -134,7 +151,7 @@ - Now wasn't that easy? + Now wasn't that nice and easy? @@ -158,7 +175,7 @@ significantly because it ties searching semantics to the physical structure of the searched records. You can't use the same search specification to search two databases if their internal - representations are different. Consider an alternative taxonomy + representations are different. Consider an different taxonomy database in which the records have taxon names specified inside a <name> element nested within a <identification> element @@ -175,8 +192,8 @@ said about implementation: in a given database, an access point might be implemented as an index, a path into physical records, an algorithm for interrogating relational tables or whatever works. - The key point is that the semantics of an access point are fixed - and well defined. + The only important thing point is that the semantics of an access + point are fixed and well defined. For convenience, access points are gathered into attribute @@ -192,7 +209,7 @@ In practice, the BIB-1 attribute set has tended to be a dumping ground for all sorts of access points, so that, for example, it includes some geospatial access points as well as strictly - bibliographic ones. Nevertheless, the key point is that this model + bibliographic ones. Nevertheless, this model allows a layer of abstraction over the physical representation of records in databases. @@ -210,6 +227,11 @@ <Zthes> element. + ### Here's where it all goes to pieces. The current arrangement is + very awkward (and somewhat embarrassing) to describe, and the new + arrangement hasn't actually been implemented yet. + + This is a two-step process. First, we need to tell Zebra that we want to support the BIB-1 attribute set. Then we need to tell it which elements of its record pertain to access point 4. diff --git a/doc/harvest.mbox b/doc/harvest.mbox deleted file mode 100644 index 0f38a3a..0000000 --- a/doc/harvest.mbox +++ /dev/null @@ -1,360 +0,0 @@ -From zebralist-admin@indexdata.dk Sun Nov 24 23:16:24 2002 -MIME-Version: 1.0 -Envelope-to: zebra@miketaylor.org.uk -Content-Type: text/plain; - charset="us-ascii" -From: Kang-Jin Lee -To: zebralist@indexdata.dk -User-Agent: KMail/1.4.3 -X-Spam-Level: -Subject: [Zebralist] Some progress on Harvest's move to Zebra -Sender: zebralist-admin@indexdata.dk -X-BeenThere: zebralist@indexdata.dk -X-Mailman-Version: 2.0.11 -Precedence: bulk -List-Help: -List-Post: -List-Subscribe: , - -List-Id: Zebra Information Server -List-Unsubscribe: , - -List-Archive: -Date: Sun, 24 Nov 2002 20:45:19 +0100 -X-Spam-Status: No, hits=-1.0 required=5.0 tests=AWL version=2.20 -X-Spam-Level: -X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id gAONGNK15639 - -Hi, - -I finished first steps to use Zebra as fulltext engine for Harvest -(http://harvest.sourceforge.net/). The performance boost after -some testing are quite impressive. - -Here is my article I wrote for the Harvest mailinglist. - -Many thanks for Zebra. - ------------------------------------------------------- -Hi, - -The first results after some testing with Zebra are very promising. - -The tests were done with around 220 000 SOIF files, which occupies -1.6GB of disk space. - -Building the index from scratch takes around one hour with Zebra where -Glimpse needs around five hours. - -While glimpse blocks search requests when updating its index, Zebra -can still answer search requests. - -While the search time of glimpse varies from some seconds to some -minutes depending how expensive the query is, Zebra usually takes -around one to three seconds, even for expensive queries. - -Glimpse' index occupies around 250MB of disk space, Zebra's index -takes around 570MB. - -Zebra supports incremental indexing which will speed up indexing even -further. - -There are still potential for faster searches when necessary, using -tweaks on apache. - -On the other hand, modeling data is not complete, yet. - -To sum it up: -- Zebra indexes data five times faster than Glimpse -- Zebra doesn't cause downtimes for indexupdate -- Zebra's search time doesn't jump from seconds to minutes for no - obvious reason, but stays constant within a range of one to three - seconds -- Zebra can search more than 100 times faster than Glimpse -- Zebra can process multiple search requests simultaneously -- Zebra can speed up indexing by using incremental indexing -- Glimpse's index size is only around half of the Zebra's index - -kj ------------------------------------------------------- - -_______________________________________________ -Zebralist mailing list -Zebralist@indexdata.dk -http://www.indexdata.dk/mailman/listinfo/zebralist - -From mike@miketaylor.org.uk Sun Nov 24 23:41:14 2002 -Date: Sun, 24 Nov 2002 23:41:13 GMT -From: Mike Taylor -X-Was-To: lee@arco.de -X-Was-CC: zebralist@indexdata.dk -Cc: mike@localhost.localdomain -In-reply-to: <200211242045.19196.lee@arco.de> (message from Kang-Jin Lee on - Sun, 24 Nov 2002 20:45:19 +0100) -Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra - -> Date: Sun, 24 Nov 2002 20:45:19 +0100 -> From: Kang-Jin Lee -> -> Here is my article I wrote for the Harvest mailinglist. - -Hi K-J, - -It's nice to read all this good stuff about Zebra! I'm currently -working on changes to the documentation for the next Zebra release, -and I'd love to include a lightly-edited version of your message in -the new document. (Basically, I'd obscure the name of your old -engine, so it's clear that we're trying to say good things about Zebra -rather than score points off a competitor.) Would it be OK for me to -quote you? If yes in principle, then I'll run the actual wording past -you before submitting it. - -Thanks, - - _/|_ _______________________________________________________________ -/o ) \/ Mike Taylor www.miketaylor.org.uk -)_v__/\ "You question the worthiness of my code? I should kill you - where you stand!" -- Klingon Programming Mantra - -From lee@arco.de Mon Nov 25 10:02:13 2002 -MIME-Version: 1.0 -Envelope-to: mike@miketaylor.org.uk -Content-Type: text/plain; - charset="iso-8859-15" -From: Kang-Jin Lee -To: Mike Taylor -Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra -Date: Mon, 25 Nov 2002 08:27:42 +0100 -User-Agent: KMail/1.4.3 -In-Reply-To: <200211242340.gAONefg15769@localhost.localdomain> -X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 -X-Spam-Level: -Content-Length: 836 -X-MIME-Autoconverted: from quoted-printable to 8bit by seatbooker.net id JAA28796 - -Hi, - -On Monday 25 November 2002 00:40, you wrote: -> > Date: Sun, 24 Nov 2002 20:45:19 +0100 -> > From: Kang-Jin Lee -> > -> > Here is my article I wrote for the Harvest mailinglist. -> -> Hi K-J, -> -> It's nice to read all this good stuff about Zebra! I'm currently -> working on changes to the documentation for the next Zebra release, -> and I'd love to include a lightly-edited version of your message in -> the new document. (Basically, I'd obscure the name of your old -> engine, so it's clear that we're trying to say good things about Zebra -> rather than score points off a competitor.) Would it be OK for me to -> quote you? If yes in principle, then I'll run the actual wording past -> you before submitting it. - -You are welcome to do this. - -I am very happy to see such a nice software available under GPL. - -Thanks. - -kj - -From zebralist-admin@indexdata.dk Mon Nov 25 11:13:10 2002 -MIME-Version: 1.0 -Envelope-to: zebra@miketaylor.org.uk -From: Pete -X-X-Sender: qq15@uxa.liv.ac.uk -To: Kang-Jin Lee -cc: zebralist@indexdata.dk -Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra -In-Reply-To: <200211242045.19196.lee@arco.de> -Content-Type: TEXT/PLAIN; charset=US-ASCII -X-Spam-Level: -Sender: zebralist-admin@indexdata.dk -X-BeenThere: zebralist@indexdata.dk -X-Mailman-Version: 2.0.11 -Precedence: bulk -List-Help: -List-Post: -List-Subscribe: , - -List-Id: Zebra Information Server -List-Unsubscribe: , - -List-Archive: -Date: Mon, 25 Nov 2002 10:19:37 +0000 (GMT) -X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 -X-Spam-Level: -Content-Length: 2853 - -On Sun, 24 Nov 2002, Kang-Jin Lee wrote: - ->Hi, -> ->I finished first steps to use Zebra as fulltext engine for Harvest ->(http://harvest.sourceforge.net/). The performance boost after ->some testing are quite impressive. - -Hi ... I'd almost forgotten that the Harvest project is still active. - -We had a heap of challenges with our Harvest setup and with the -time taken to index and search ... we switched to using -Harvest-NG as the "reaper/gatherer" and modified Zebra to -work with SOIF and our own ranking algorithm - it's been in -service for over 6 months now. - -We had challenges with both speed of gathering and with -speed of indexing and searching but most seem to be -"managable" now. - -We offered our modifications to Zebra to Indexdata who -offered to look at them since the latest release of Zebra -is sufficiently different at the code level to make it -non-trivial for us to apply our code modifications to -it. - - -Cheers - -Pete Mallinson - -> ->Here is my article I wrote for the Harvest mailinglist. -> ->Many thanks for Zebra. -> ->------------------------------------------------------ ->Hi, -> ->The first results after some testing with Zebra are very promising. -> ->The tests were done with around 220 000 SOIF files, which occupies ->1.6GB of disk space. -> ->Building the index from scratch takes around one hour with Zebra where ->Glimpse needs around five hours. -> ->While glimpse blocks search requests when updating its index, Zebra ->can still answer search requests. -> ->While the search time of glimpse varies from some seconds to some ->minutes depending how expensive the query is, Zebra usually takes ->around one to three seconds, even for expensive queries. -> ->Glimpse' index occupies around 250MB of disk space, Zebra's index ->takes around 570MB. -> ->Zebra supports incremental indexing which will speed up indexing even ->further. -> ->There are still potential for faster searches when necessary, using ->tweaks on apache. -> ->On the other hand, modeling data is not complete, yet. -> ->To sum it up: ->- Zebra indexes data five times faster than Glimpse ->- Zebra doesn't cause downtimes for indexupdate ->- Zebra's search time doesn't jump from seconds to minutes for no -> obvious reason, but stays constant within a range of one to three -> seconds ->- Zebra can search more than 100 times faster than Glimpse ->- Zebra can process multiple search requests simultaneously ->- Zebra can speed up indexing by using incremental indexing ->- Glimpse's index size is only around half of the Zebra's index -> ->kj ->------------------------------------------------------ -> ->_______________________________________________ ->Zebralist mailing list ->Zebralist@indexdata.dk ->http://www.indexdata.dk/mailman/listinfo/zebralist -> - - - -_______________________________________________ -Zebralist mailing list -Zebralist@indexdata.dk -http://www.indexdata.dk/mailman/listinfo/zebralist - -From zebralist-admin@indexdata.dk Mon Nov 25 21:39:59 2002 -MIME-Version: 1.0 -Envelope-to: zebra@miketaylor.org.uk -Content-Type: text/plain; - charset="iso-8859-1" -From: Kang-Jin Lee -To: Pete -Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra -User-Agent: KMail/1.4.3 -In-Reply-To: -Cc: zebralist@indexdata.dk -X-Spam-Level: -Sender: zebralist-admin@indexdata.dk -X-BeenThere: zebralist@indexdata.dk -X-Mailman-Version: 2.0.11 -Precedence: bulk -List-Help: -List-Post: -List-Subscribe: , - -List-Id: Zebra Information Server -List-Unsubscribe: , - -List-Archive: -Date: Mon, 25 Nov 2002 20:39:47 +0100 -X-Spam-Status: No, hits=-3.2 required=5.0 tests=IN_REP_TO,AWL version=2.20 -X-Spam-Level: -X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id gAPLdwK18535 - -Hi, - -On Monday 25 November 2002 11:19, Pete wrote: - -> On Sun, 24 Nov 2002, Kang-Jin Lee wrote: - -> >I finished first steps to use Zebra as fulltext engine for Harvest -> >(http://harvest.sourceforge.net/). The performance boost after -> >some testing are quite impressive. -> -> Hi ... I'd almost forgotten that the Harvest project is still active. - -It seems that everybody has forgotten Harvest. :-) - -> We had a heap of challenges with our Harvest setup and with the -> time taken to index and search ... we switched to using -> Harvest-NG as the "reaper/gatherer" and modified Zebra to -> work with SOIF and our own ranking algorithm - it's been in -> service for over 6 months now. - -I am very interested in your setup. Would it be possible to send -your configuration files and modifications to me? -I made some small modifications to soif.flt and am still wondering -which query I should use. It would be very nice if I don't have to -reinvent the wheel. - -> We had challenges with both speed of gathering and with -> speed of indexing and searching but most seem to be -> "managable" now. - -How big is your gatherer? - -> We offered our modifications to Zebra to Indexdata who -> offered to look at them since the latest release of Zebra -> is sufficiently different at the code level to make it -> non-trivial for us to apply our code modifications to -> it. - -I would like to take a look at the modifications, too. - -Thanks. - -kj - - -_______________________________________________ -Zebralist mailing list -Zebralist@indexdata.dk -http://www.indexdata.dk/mailman/listinfo/zebralist - diff --git a/doc/installation.xml b/doc/installation.xml index 05b9ab5..fd1e873 100644 --- a/doc/installation.xml +++ b/doc/installation.xml @@ -1,5 +1,5 @@ - + Installation An ANSI C compiler is required to compile the Zebra @@ -11,7 +11,8 @@ Unpack the distribution archive. The configure shell script attempts to guess correct values for various system-dependent variables used during compilation. - It uses those values to create a 'Makefile' in each directory of Zebra. + It uses those values to create a Makefile in each + directory of Zebra. @@ -26,7 +27,7 @@ The configure script attempts to use C compiler specified by the CC environment variable. - If not set, cc or GNU C will be used. + If this is not set, cc or GNU C will be used. The CFLAGS environment variable holds options to be passed to the C compiler. If you're using a Bourne-shell compatible shell you may pass something like this: @@ -34,27 +35,26 @@ CC=/opt/ccs/bin/cc CFLAGS=-O ./configure - - The configure script takes a number of arguments, you can see - them all with + + + The configure script support various options: you can see what they + are with ./configure --help - - When configured, build the software by typing: - + Once the build environment is configured, build the software by + typing: make - - If successful, two executables are created in the sub-directory - index. + If the build is successful, two executables are created in the + sub-directory index: @@ -85,7 +85,7 @@ By default this will install the Zebra executables in /usr/local/bin, and the standard configuration files in - /usr/local/share/zebra + /usr/local/share/idzebra You can override this with the --prefix option to configure. diff --git a/doc/introduction.xml b/doc/introduction.xml index ad1b558..475c3e5 100644 --- a/doc/introduction.xml +++ b/doc/introduction.xml @@ -1,15 +1,14 @@ - + Introduction Overview - - Zebra + Zebra is a high-performance, general-purpose structured text - indexing and retrieval engine. It reads structured records in a + indexing and retrieval engine. It reads records in a variety of input formats (eg. email, XML, MARC) and provides access to them through a powerful combination of boolean search expressions and relevance-ranked free-text queries. @@ -49,7 +48,7 @@ - Very large databases: files for indexes, etc. can be + Very large databases: logical files can be automatically partitioned over multiple disks. @@ -57,7 +56,7 @@ Arbitrarily complex records. The internal data format - is an structured format conceptually similar to XML or GRS-1, + is a structured format conceptually similar to XML or GRS-1, which allows lists, nested structured data elements and variant forms of data. @@ -304,9 +303,45 @@ which is populated by the Harvest-NG web-crawling software. - For more information, contact John Gilbertson + For more information on Liverpool university's intranet search + architecture, contact John Gilbertson jgilbert@liverpool.ac.uk + + Kang-Jin Lee + lee@arco.de, + has recently modified the Harvest-NG web crawler to use Zebra as + its native repository engine. His comments on the switch over + from the old engine are revealing: +
+ + The first results after some testing with Zebra are very + promising. The tests were done with around 220,000 SOIF files, + which occupies 1.6GB of disk space. + + + Building the index from scratch takes around one hour with Zebra + where [old-engine] needs around five hours. While [old-engine] + blocks search requests when updating its index, Zebra can still + answer search requests. + [...] + Zebra supports incremental indexing which will speed up indexing + even further. + + + While the search time of [old-engine] varies from some seconds + to some minutes depending how expensive the query is, Zebra + usually takes around one to three seconds, even for expensive + queries. + [...] + Zebra can search more than 100 times faster than [old-engine] + and can process multiple search requests simultaneously + + + I am very happy to see such nice software available under GPL. + +
+
@@ -331,7 +366,7 @@ announcements from the authors (new releases, bug fixes, etc.) and general discussion. You are welcome to seek support there. Join by sending email to - zebra-request@indexdata.dk. Put the word + zebra-request@indexdata.dk with the word subscribe in the body of the message.
@@ -360,20 +395,17 @@ Improved support for XML in search and retrieval. Eventually, the goal is for Zebra to pull double duty as a flexible information retrieval engine and high-performance XML - repository. - - - ### Partially done. + repository. The recent addition of XPath searching is one + example of the kind of enhancement we're working on. - Access to search engine through SOAP/RPC API to allow the + Access to the search engine through SOAP/RPC API to allow the construction of applications without requiring Z39.50 tools. - - - ### Partially done, thanks to the new SRW/Z39.50 gateway. + This will shortly be available by means of Index Data's + SRW-to-Z39.50 gateway, currently in beta test. @@ -388,6 +420,15 @@ + Support for the use of Perl both for access to the Zebra API + and for building extension ``plug-ins'' such as input filters. + The code for this has been contributed to the source tree, and + is in the process of being integrated and tested. + + + + + Improved free-text searching. We're first and foremost octet jockeys and we're actively looking for organisations or people who'd like to contribute experience in relevance ranking and text diff --git a/doc/quickstart.xml b/doc/quickstart.xml index d7f0d00..1aae924 100644 --- a/doc/quickstart.xml +++ b/doc/quickstart.xml @@ -1,54 +1,27 @@ - + Quick Start - - - In this section, we will test the system by indexing a small set of sample - GILS records that are included with the software distribution. Go to the - examples/gils subdirectory of the distribution archive. - There you will find a configuration - file named zebra.cfg with the following contents: - - - # Where the schema files, attribute files, etc are located. - profilePath: ../../tab - - # Files that describe the attribute sets supported. - attset: bib1.att - attset: gils.att - attset: explain.att - - recordtype: grs.sgml - isam: c - + + In this section, we will test the system by indexing a small set of + sample GILS records that are included with the Zebra distribution, + running Zebra a server against the newly created database, and + searching the indexes with a client that connects to that server. - - - - The 48 test records are located in the sub directory - records. To index these, type: - + Go to the examples/gils subdirectory of the + distribution archive. The 48 test records are located in the sub + directory records. To index these, type: zebraidx update records - In the command above, the word update followed - by a directory root updates all files below that directory node. + In this command, the word update is followed + by the name of a directory: zebraidx updates all + files in the hierarchy rooted at that directory. @@ -56,7 +29,7 @@ fire up a server. To start a server on port 2100, type: - zebrasrv tcp:@:2100 + zebrasrv @:2100 @@ -66,17 +39,18 @@ named Default. The database contains records structured according to the GILS profile, and the server will - return records in either either USMARC, GRS-1, or SUTRS depending - on what your client asks for. + return records in USMARC, GRS-1, or SUTRS format depending + on what the client asks for. To test the server, you can use any Z39.50 client. - For instance, you can use the demo client that comes with YAZ: + For instance, you can use the demo command-line client that comes + with YAZ: - yaz-client tcp:localhost:2100 + yaz-client localhost:2100 @@ -92,8 +66,9 @@
- The default retrieval syntax for the client is USMARC. To try other - formats for the same record, try: + The default retrieval syntax for the client is USMARC, and the + default element set is F (``full record''). To + try other formats and element sets for the same record, try: @@ -110,8 +85,8 @@ You may notice that more fields are returned when your - client requests SUTRS or GRS-1 records. When retrieving GILS records, - this is normal - not all of the GILS data elements have mappings in + client requests SUTRS, GRS-1 or XML records. + This is normal - not all of the GILS data elements have mappings in the USMARC record format. -- 1.7.10.4