All sorts of minor and semi-major improvements.

author Mike Taylor <mike@indexdata.com>

Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)

committer Mike Taylor <mike@indexdata.com>

Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
author Mike Taylor <mike@indexdata.com>
Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
committer Mike Taylor <mike@indexdata.com>
Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
diff --git a/doc/examples.xml b/doc/examples.xml

index 10cbeb5..dc95e12 100644 (file)
--- a/doc/examples.xml
+++ b/doc/examples.xml
@@ -1,5 +1,5 @@
  <chapter id="examples">
  <chapter id="examples">
- <!-- $Id: examples.xml,v 1.17 2002-11-08 17:00:57 mike Exp $ -->
+ <!-- $Id: examples.xml,v 1.18 2002-12-01 23:26:26 mike Exp $ -->
   <title>Example Configurations</title>
  
   <sect1>
   <title>Example Configurations</title>
  
   <sect1>
@@ -19,23 +19,35 @@
  
      <listitem>
       <para>
  
      <listitem>
       <para>
-      Where to find subsidiary configuration files, including
-      <literal>default.idx</literal>
+      Where to find subsidiary configuration files, including both
+      those that are named explicitly and a few ``magic'' files such
+      as <literal>default.idx</literal>,
        which specifies the default indexing rules.
       </para>
      </listitem>
  
      <listitem>
       <para>
        which specifies the default indexing rules.
       </para>
      </listitem>
  
      <listitem>
       <para>
-      What attribute sets to recognise in searches.
+      What record schemas to support.  (Subsidiary files specifiy how
+      to index the contents of records in those schemas, and what
+      format to use when presenting records in those schemas to client
+      software.)
       </para>
      </listitem>
  
      <listitem>
       <para>
       </para>
      </listitem>
  
      <listitem>
       <para>
-      Policy details such as what record type to expect, what
-      low-level indexing algorithm to use, how to identify potential
-      duplicate records, etc.
+      What attribute sets to recognise in searches.  (Subsidiary files
+      specify how to interpret the attributes in terms
+      of the indexes that are created on the records.)
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
+      Policy details such as what type of input format to expect when
+      adding new records, what low-level indexing algorithm to use,
+      how to identify potential duplicate records, etc.
       </para>
      </listitem>
  
       </para>
      </listitem>
  
@@ -69,6 +81,10 @@
     <literal>dino.tree</literal>.)
     Type <literal>make records/dino.xml</literal>
     to make the XML data file.
     <literal>dino.tree</literal>.)
     Type <literal>make records/dino.xml</literal>
     to make the XML data file.
+   (Or you could just type <literal>make</literal> to build the XML
+   data file, create the database and populate it with the taxonomic
+   records all in one shot - but then you wouldn't learn anything,
+   would you?  :-)
    </para>
    <para>
     Now we need to create a Zebra database to hold and index the XML
    </para>
    <para>
     Now we need to create a Zebra database to hold and index the XML
@@ -76,7 +92,7 @@
     Zebra indexer, <literal>zebraidx</literal>, which is
     driven by the <literal>zebra.cfg</literal> configuration file.
     For our purposes, we don't need any
     Zebra indexer, <literal>zebraidx</literal>, which is
     driven by the <literal>zebra.cfg</literal> configuration file.
     For our purposes, we don't need any
-   special behaviour - we can use the defaults - so we start with a
+   special behaviour - we can use the defaults - so we can start with a
     minimal file that just tells <literal>zebraidx</literal> where to
     find the default indexing rules, and how to parse the records:
     <screen>
     minimal file that just tells <literal>zebraidx</literal> where to
     find the default indexing rules, and how to parse the records:
     <screen>
@@ -108,7 +124,7 @@
     XPath-based boolean queries and fetch the XML records that satisfy
     them:
     <screen>
     XPath-based boolean queries and fetch the XML records that satisfy
     them:
     <screen>
-    $ yaz-client tcp:@:9999
+    $ yaz-client @:9999
      Connecting...Ok.
      Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
      Number of hits: 1
      Connecting...Ok.
      Z&gt; find @attr 1=/Zthes/termName Sauroposeidon
      Number of hits: 1
@@ -118,6 +134,7 @@
       &lt;termId&gt;22&lt;/termId&gt;
       &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
       &lt;termType&gt;PT&lt;/termType&gt;
       &lt;termId&gt;22&lt;/termId&gt;
       &lt;termName&gt;Sauroposeidon&lt;/termName&gt;
       &lt;termType&gt;PT&lt;/termType&gt;
+     &lt;termNote&gt;The tallest known dinosaur (18m)&lt;/termNote&gt;
       &lt;relation&gt;
        &lt;relationType&gt;BT&lt;/relationType&gt;
        &lt;termId&gt;21&lt;/termId&gt;
       &lt;relation&gt;
        &lt;relationType&gt;BT&lt;/relationType&gt;
        &lt;termId&gt;21&lt;/termId&gt;
@@ -126,7 +143,7 @@
       &lt;/relation&gt;
  
        &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
       &lt;/relation&gt;
  
        &lt;idzebra xmlns="http://www.indexdata.dk/zebra/"&gt;
-       &lt;size&gt;245&lt;/size&gt;
+       &lt;size&gt;300&lt;/size&gt;
         &lt;localnumber&gt;23&lt;/localnumber&gt;
         &lt;filename&gt;records/dino.xml&lt;/filename&gt;
        &lt;/idzebra&gt;
         &lt;localnumber&gt;23&lt;/localnumber&gt;
         &lt;filename&gt;records/dino.xml&lt;/filename&gt;
        &lt;/idzebra&gt;
@@ -134,7 +151,7 @@
     </screen>
    </para>
    <para>
     </screen>
    </para>
    <para>
-   Now wasn't that easy?
+   Now wasn't that nice and easy?
    </para>
   </sect1>
  
    </para>
   </sect1>
  
@@ -158,7 +175,7 @@
     significantly because it ties searching semantics to the physical
     structure of the searched records.  You can't use the same search
     specification to search two databases if their internal
     significantly because it ties searching semantics to the physical
     structure of the searched records.  You can't use the same search
     specification to search two databases if their internal
-   representations are different.  Consider an alternative taxonomy
+   representations are different.  Consider an different taxonomy
     database in which the records have taxon names specified
     inside a <literal>&lt;name&gt;</literal> element nested within a
     <literal>&lt;identification&gt;</literal> element
     database in which the records have taxon names specified
     inside a <literal>&lt;name&gt;</literal> element nested within a
     <literal>&lt;identification&gt;</literal> element
@@ -175,8 +192,8 @@
     said about implementation: in a given database, an access point
     might be implemented as an index, a path into physical records, an
     algorithm for interrogating relational tables or whatever works.
     said about implementation: in a given database, an access point
     might be implemented as an index, a path into physical records, an
     algorithm for interrogating relational tables or whatever works.
-   The key point is that the semantics of an access point are fixed
-   and well defined.
+   The only important thing point is that the semantics of an access
+   point are fixed and well defined.
    </para>
    <para>
     For convenience, access points are gathered into <firstterm>attribute
    </para>
    <para>
     For convenience, access points are gathered into <firstterm>attribute
@@ -192,7 +209,7 @@
     In practice, the BIB-1 attribute set has tended to be a dumping
     ground for all sorts of access points, so that, for example, it
     includes some geospatial access points as well as strictly
     In practice, the BIB-1 attribute set has tended to be a dumping
     ground for all sorts of access points, so that, for example, it
     includes some geospatial access points as well as strictly
-   bibliographic ones.  Nevertheless, the key point is that this model
+   bibliographic ones.  Nevertheless, this model
     allows a layer of abstraction over the physical representation of
     records in databases.
    </para>
     allows a layer of abstraction over the physical representation of
     records in databases.
    </para>
@@ -210,6 +227,11 @@
     <literal>&lt;Zthes&gt;</literal> element.
    </para>
    <para>
     <literal>&lt;Zthes&gt;</literal> element.
    </para>
    <para>
+   ### Here's where it all goes to pieces.  The current arrangement is
+   very awkward (and somewhat embarrassing) to describe, and the new
+   arrangement hasn't actually been implemented yet.
+  </para>
+  <para>
     This is a two-step process.  First, we need to tell Zebra that we
     want to support the BIB-1 attribute set.  Then we need to tell it
     which elements of its record pertain to access point 4.
     This is a two-step process.  First, we need to tell Zebra that we
     want to support the BIB-1 attribute set.  Then we need to tell it
     which elements of its record pertain to access point 4.
diff --git a/doc/harvest.mbox b/doc/harvest.mbox

deleted file mode 100644 (file)

index 0f38a3a..0000000
--- a/doc/harvest.mbox
+++ /dev/null
@@ -1,360 +0,0 @@
-From zebralist-admin@indexdata.dk  Sun Nov 24 23:16:24 2002
-MIME-Version: 1.0
-Envelope-to: zebra@miketaylor.org.uk
-Content-Type: text/plain;
-  charset="us-ascii"
-From: Kang-Jin Lee <lee@arco.de>
-To: zebralist@indexdata.dk
-User-Agent: KMail/1.4.3
-X-Spam-Level: 
-Subject: [Zebralist] Some progress on Harvest's move to Zebra
-Sender: zebralist-admin@indexdata.dk
-X-BeenThere: zebralist@indexdata.dk
-X-Mailman-Version: 2.0.11
-Precedence: bulk
-List-Help: <mailto:zebralist-request@indexdata.dk?subject=help>
-List-Post: <mailto:zebralist@indexdata.dk>
-List-Subscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=subscribe>
-List-Id: Zebra Information Server <zebralist.indexdata.dk>
-List-Unsubscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=unsubscribe>
-List-Archive: <http://www.indexdata.dk/pipermail/zebralist/>
-Date: Sun, 24 Nov 2002 20:45:19 +0100
-X-Spam-Status: No, hits=-1.0 required=5.0 tests=AWL version=2.20
-X-Spam-Level: 
-X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id gAONGNK15639
-
-Hi,
-
-I finished first steps to use Zebra as fulltext engine for Harvest
-(http://harvest.sourceforge.net/). The performance boost after
-some testing are quite impressive.
-
-Here is my article I wrote for the Harvest mailinglist.
-
-Many thanks for Zebra.
-
-------------------------------------------------------
-Hi,
-
-The first results after some testing with Zebra are very promising.
-
-The tests were done with around 220 000 SOIF files, which occupies
-1.6GB of disk space.
-
-Building the index from scratch takes around one hour with Zebra where
-Glimpse needs around five hours.
-
-While glimpse blocks search requests when updating its index, Zebra
-can still answer search requests.
-
-While the search time of glimpse varies from some seconds to some
-minutes depending how expensive the query is, Zebra usually takes
-around one to three seconds, even for expensive queries.
-
-Glimpse' index occupies around 250MB of disk space, Zebra's index
-takes around 570MB.
-
-Zebra supports incremental indexing which will speed up indexing even
-further.
-
-There are still potential for faster searches when necessary, using
-tweaks on apache.
-
-On the other hand, modeling data is not complete, yet.
-
-To sum it up:
-- Zebra indexes data five times faster than Glimpse
-- Zebra doesn't cause downtimes for indexupdate
-- Zebra's search time doesn't jump from seconds to minutes for no
-  obvious reason, but stays constant within a range of one to three
-  seconds
-- Zebra can search more than 100 times faster than Glimpse
-- Zebra can process multiple search requests simultaneously
-- Zebra can speed up indexing by using incremental indexing
-- Glimpse's index size is only around half of the Zebra's index
-
-kj
-------------------------------------------------------
-
-_______________________________________________
-Zebralist mailing list
-Zebralist@indexdata.dk
-http://www.indexdata.dk/mailman/listinfo/zebralist
-
-From mike@miketaylor.org.uk  Sun Nov 24 23:41:14 2002
-Date: Sun, 24 Nov 2002 23:41:13 GMT
-From: Mike Taylor <mike@miketaylor.org.uk>
-X-Was-To: lee@arco.de
-X-Was-CC: zebralist@indexdata.dk
-Cc: mike@localhost.localdomain
-In-reply-to: <200211242045.19196.lee@arco.de> (message from Kang-Jin Lee on
-       Sun, 24 Nov 2002 20:45:19 +0100)
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-
-> Date: Sun, 24 Nov 2002 20:45:19 +0100
-> From: Kang-Jin Lee <lee@arco.de>
-> 
-> Here is my article I wrote for the Harvest mailinglist.
-
-Hi K-J,
-
-It's nice to read all this good stuff about Zebra!  I'm currently
-working on changes to the documentation for the next Zebra release,
-and I'd love to include a lightly-edited version of your message in
-the new document.  (Basically, I'd obscure the name of your old
-engine, so it's clear that we're trying to say good things about Zebra
-rather than score points off a competitor.)  Would it be OK for me to
-quote you?  If yes in principle, then I'll run the actual wording past
-you before submitting it.
-
-Thanks,
-
- _/|_   _______________________________________________________________
-/o ) \/  Mike Taylor   <mike@miketaylor.org.uk>   www.miketaylor.org.uk
-)_v__/\  "You question the worthiness of my code?  I should kill you
-        where you stand!" -- Klingon Programming Mantra
-
-From lee@arco.de Mon Nov 25 10:02:13 2002
-MIME-Version: 1.0
-Envelope-to: mike@miketaylor.org.uk
-Content-Type: text/plain;
-  charset="iso-8859-15"
-From: Kang-Jin Lee <lee@arco.de>
-To: Mike Taylor <mike@miketaylor.org.uk>
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-Date: Mon, 25 Nov 2002 08:27:42 +0100
-User-Agent: KMail/1.4.3
-In-Reply-To: <200211242340.gAONefg15769@localhost.localdomain>
-X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
-X-Spam-Level: 
-Content-Length: 836
-X-MIME-Autoconverted: from quoted-printable to 8bit by seatbooker.net id JAA28796
-
-Hi,
-
-On Monday 25 November 2002 00:40, you wrote:
-> > Date: Sun, 24 Nov 2002 20:45:19 +0100
-> > From: Kang-Jin Lee <lee@arco.de>
-> >
-> > Here is my article I wrote for the Harvest mailinglist.
->
-> Hi K-J,
->
-> It's nice to read all this good stuff about Zebra!  I'm currently
-> working on changes to the documentation for the next Zebra release,
-> and I'd love to include a lightly-edited version of your message in
-> the new document.  (Basically, I'd obscure the name of your old
-> engine, so it's clear that we're trying to say good things about Zebra
-> rather than score points off a competitor.)  Would it be OK for me to
-> quote you?  If yes in principle, then I'll run the actual wording past
-> you before submitting it.
-
-You are welcome to do this.
-
-I am very happy to see such a nice software available under GPL.
-
-Thanks.
-
-kj
-
-From zebralist-admin@indexdata.dk  Mon Nov 25 11:13:10 2002
-MIME-Version: 1.0
-Envelope-to: zebra@miketaylor.org.uk
-From: Pete <P.D.Mallinson@liverpool.ac.uk>
-X-X-Sender: qq15@uxa.liv.ac.uk
-To: Kang-Jin Lee <lee@arco.de>
-cc: zebralist@indexdata.dk
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-In-Reply-To: <200211242045.19196.lee@arco.de>
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-X-Spam-Level: 
-Sender: zebralist-admin@indexdata.dk
-X-BeenThere: zebralist@indexdata.dk
-X-Mailman-Version: 2.0.11
-Precedence: bulk
-List-Help: <mailto:zebralist-request@indexdata.dk?subject=help>
-List-Post: <mailto:zebralist@indexdata.dk>
-List-Subscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=subscribe>
-List-Id: Zebra Information Server <zebralist.indexdata.dk>
-List-Unsubscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=unsubscribe>
-List-Archive: <http://www.indexdata.dk/pipermail/zebralist/>
-Date: Mon, 25 Nov 2002 10:19:37 +0000 (GMT)
-X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
-X-Spam-Level: 
-Content-Length: 2853
-
-On Sun, 24 Nov 2002, Kang-Jin Lee wrote:
-
->Hi,
->
->I finished first steps to use Zebra as fulltext engine for Harvest
->(http://harvest.sourceforge.net/). The performance boost after
->some testing are quite impressive.
-
-Hi ... I'd almost forgotten that the Harvest project is still active.
-
-We had a heap of challenges with our Harvest setup and with the
-time taken to index and search ... we switched to using
-Harvest-NG as the "reaper/gatherer" and modified Zebra to
-work with SOIF and our own ranking algorithm - it's been in
-service for over 6 months now.
-
-We had challenges with both speed of gathering and with
-speed of indexing and searching but most seem to be
-"managable" now.
-
-We offered our modifications to Zebra to Indexdata who
-offered to look at them since the latest release of Zebra
-is sufficiently different at the code level to make it
-non-trivial for us to apply our code modifications to
-it.
-
-
-Cheers
-
-Pete Mallinson
-
->
->Here is my article I wrote for the Harvest mailinglist.
->
->Many thanks for Zebra.
->
->------------------------------------------------------
->Hi,
->
->The first results after some testing with Zebra are very promising.
->
->The tests were done with around 220 000 SOIF files, which occupies
->1.6GB of disk space.
->
->Building the index from scratch takes around one hour with Zebra where
->Glimpse needs around five hours.
->
->While glimpse blocks search requests when updating its index, Zebra
->can still answer search requests.
->
->While the search time of glimpse varies from some seconds to some
->minutes depending how expensive the query is, Zebra usually takes
->around one to three seconds, even for expensive queries.
->
->Glimpse' index occupies around 250MB of disk space, Zebra's index
->takes around 570MB.
->
->Zebra supports incremental indexing which will speed up indexing even
->further.
->
->There are still potential for faster searches when necessary, using
->tweaks on apache.
->
->On the other hand, modeling data is not complete, yet.
->
->To sum it up:
->- Zebra indexes data five times faster than Glimpse
->- Zebra doesn't cause downtimes for indexupdate
->- Zebra's search time doesn't jump from seconds to minutes for no
->  obvious reason, but stays constant within a range of one to three
->  seconds
->- Zebra can search more than 100 times faster than Glimpse
->- Zebra can process multiple search requests simultaneously
->- Zebra can speed up indexing by using incremental indexing
->- Glimpse's index size is only around half of the Zebra's index
->
->kj
->------------------------------------------------------
->
->_______________________________________________
->Zebralist mailing list
->Zebralist@indexdata.dk
->http://www.indexdata.dk/mailman/listinfo/zebralist
->
-
-
-
-_______________________________________________
-Zebralist mailing list
-Zebralist@indexdata.dk
-http://www.indexdata.dk/mailman/listinfo/zebralist
-
-From zebralist-admin@indexdata.dk  Mon Nov 25 21:39:59 2002
-MIME-Version: 1.0
-Envelope-to: zebra@miketaylor.org.uk
-Content-Type: text/plain;
-  charset="iso-8859-1"
-From: Kang-Jin Lee <lee@arco.de>
-To: Pete <P.D.Mallinson@liverpool.ac.uk>
-Subject: Re: [Zebralist] Some progress on Harvest's move to Zebra
-User-Agent: KMail/1.4.3
-In-Reply-To: <Pine.GSO.4.44.0211251007060.15395-100000@uxa.liv.ac.uk>
-Cc: zebralist@indexdata.dk
-X-Spam-Level: 
-Sender: zebralist-admin@indexdata.dk
-X-BeenThere: zebralist@indexdata.dk
-X-Mailman-Version: 2.0.11
-Precedence: bulk
-List-Help: <mailto:zebralist-request@indexdata.dk?subject=help>
-List-Post: <mailto:zebralist@indexdata.dk>
-List-Subscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=subscribe>
-List-Id: Zebra Information Server <zebralist.indexdata.dk>
-List-Unsubscribe: <http://www.indexdata.dk/mailman/listinfo/zebralist>,
-       <mailto:zebralist-request@indexdata.dk?subject=unsubscribe>
-List-Archive: <http://www.indexdata.dk/pipermail/zebralist/>
-Date: Mon, 25 Nov 2002 20:39:47 +0100
-X-Spam-Status: No, hits=-3.2 required=5.0 tests=IN_REP_TO,AWL version=2.20
-X-Spam-Level: 
-X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id gAPLdwK18535
-
-Hi,
-
-On Monday 25 November 2002 11:19, Pete wrote:
-
-> On Sun, 24 Nov 2002, Kang-Jin Lee wrote:
-
-> >I finished first steps to use Zebra as fulltext engine for Harvest
-> >(http://harvest.sourceforge.net/). The performance boost after
-> >some testing are quite impressive.
->
-> Hi ... I'd almost forgotten that the Harvest project is still active.
-
-It seems that everybody has forgotten Harvest. :-)
-
-> We had a heap of challenges with our Harvest setup and with the
-> time taken to index and search ... we switched to using
-> Harvest-NG as the "reaper/gatherer" and modified Zebra to
-> work with SOIF and our own ranking algorithm - it's been in
-> service for over 6 months now.
-
-I am very interested in your setup. Would it be possible to send
-your configuration files and modifications to me?
-I made some small modifications to soif.flt and am still wondering
-which query I should use. It would be very nice if I don't have to
-reinvent the wheel.
-
-> We had challenges with both speed of gathering and with
-> speed of indexing and searching but most seem to be
-> "managable" now.
-
-How big is your gatherer?
-
-> We offered our modifications to Zebra to Indexdata who
-> offered to look at them since the latest release of Zebra
-> is sufficiently different at the code level to make it
-> non-trivial for us to apply our code modifications to
-> it.
-
-I would like to take a look at the modifications, too.
-
-Thanks.
-
-kj
-
-
-_______________________________________________
-Zebralist mailing list
-Zebralist@indexdata.dk
-http://www.indexdata.dk/mailman/listinfo/zebralist
-
diff --git a/doc/installation.xml b/doc/installation.xml

index 05b9ab5..fd1e873 100644 (file)
--- a/doc/installation.xml
+++ b/doc/installation.xml
@@ -1,5 +1,5 @@
  <chapter id="installation">
  <chapter id="installation">
- <!-- $Id: installation.xml,v 1.5 2002-10-08 08:09:43 mike Exp $ -->
+ <!-- $Id: installation.xml,v 1.6 2002-12-01 23:26:26 mike Exp $ -->
   <title>Installation</title>
   <para>
    An ANSI C compiler is required to compile the Zebra
   <title>Installation</title>
   <para>
    An ANSI C compiler is required to compile the Zebra
@@ -11,7 +11,8 @@
    Unpack the distribution archive. The <literal>configure</literal>
    shell script attempts to guess correct values for various
    system-dependent variables used during compilation.
    Unpack the distribution archive. The <literal>configure</literal>
    shell script attempts to guess correct values for various
    system-dependent variables used during compilation.
-  It uses those values to create a 'Makefile' in each directory of Zebra.
+  It uses those values to create a <literal>Makefile</literal> in each
+  directory of Zebra.
   </para>
   
   <para>
   </para>
   
   <para>
@@ -26,7 +27,7 @@
   <para>
    The configure script attempts to use C compiler specified by
    the <literal>CC</literal> environment variable.
   <para>
    The configure script attempts to use C compiler specified by
    the <literal>CC</literal> environment variable.
-  If not set, <literal>cc</literal> or GNU C will be used.
+  If this is not set, <literal>cc</literal> or GNU C will be used.
    The <literal>CFLAGS</literal> environment variable holds
    options to be passed to the C compiler. If you're using a
    Bourne-shell compatible shell you may pass something like this:
    The <literal>CFLAGS</literal> environment variable holds
    options to be passed to the C compiler. If you're using a
    Bourne-shell compatible shell you may pass something like this:
@@ -34,27 +35,26 @@
    <screen>
    CC=/opt/ccs/bin/cc CFLAGS=-O ./configure
    </screen>
    <screen>
    CC=/opt/ccs/bin/cc CFLAGS=-O ./configure
    </screen>
-  
-  The configure script takes a number of arguments, you can see
-  them all with
+ </para>
+ <para>
+  The configure script support various options: you can see what they
+  are with
    <screen>
    ./configure --help
    </screen>
    <screen>
    ./configure --help
    </screen>
-
   </para>
   
   <para>
   </para>
   
   <para>
-  When configured, build the software by typing:
-  
+  Once the build environment is configured, build the software by
+  typing:
    <screen>
    make
    </screen>
    <screen>
    make
    </screen>
- 
   </para>
   
   <para>
   </para>
   
   <para>
-  If successful, two executables are created in the sub-directory
-  <literal>index</literal>.
+  If the build is successful, two executables are created in the
+  sub-directory <literal>index</literal>:
    <variablelist>
     
     <varlistentry>
    <variablelist>
     
     <varlistentry>
@@ -85,7 +85,7 @@
    By default this will install the Zebra executables in 
    <filename>/usr/local/bin</filename>,
    and the standard configuration files in 
    By default this will install the Zebra executables in 
    <filename>/usr/local/bin</filename>,
    and the standard configuration files in 
-  <filename>/usr/local/share/zebra</filename>
+  <filename>/usr/local/share/idzebra</filename>
    You can override this with the <literal>--prefix</literal> option
    to configure.
   </para>
    You can override this with the <literal>--prefix</literal> option
    to configure.
   </para>
diff --git a/doc/introduction.xml b/doc/introduction.xml

index ad1b558..475c3e5 100644 (file)
--- a/doc/introduction.xml
+++ b/doc/introduction.xml
@@ -1,15 +1,14 @@
  <chapter id="introduction">
  <chapter id="introduction">
- <!-- $Id: introduction.xml,v 1.21 2002-11-08 17:00:57 mike Exp $ -->
+ <!-- $Id: introduction.xml,v 1.22 2002-12-01 23:26:26 mike Exp $ -->
   <title>Introduction</title>
   
   <sect1>
    <title>Overview</title>
    
    <para>
   <title>Introduction</title>
   
   <sect1>
    <title>Overview</title>
    
    <para>
-   <ulink url="http://indexdata.dk/zebra/">
-     Zebra</ulink>
+   <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
     is a high-performance, general-purpose structured text
     is a high-performance, general-purpose structured text
-   indexing and retrieval engine. It reads structured records in a
+   indexing and retrieval engine. It reads records in a
     variety of input formats (eg. email, XML, MARC) and provides access
     to them through a powerful combination of boolean search
     expressions and relevance-ranked free-text queries.
     variety of input formats (eg. email, XML, MARC) and provides access
     to them through a powerful combination of boolean search
     expressions and relevance-ranked free-text queries.
@@ -49,7 +48,7 @@
  
      <listitem>
       <para>
  
      <listitem>
       <para>
-      Very large databases: files for indexes, etc. can be
+      Very large databases: logical files can be
        automatically partitioned over multiple disks.
       </para>
      </listitem>
        automatically partitioned over multiple disks.
       </para>
      </listitem>
@@ -57,7 +56,7 @@
      <listitem>
       <para>
        Arbitrarily complex records.  The internal data format
      <listitem>
       <para>
        Arbitrarily complex records.  The internal data format
-      is an structured format conceptually similar to XML or GRS-1,
+      is a structured format conceptually similar to XML or GRS-1,
        which allows lists, nested structured data elements and
        variant forms of data.
       </para>
        which allows lists, nested structured data elements and
        variant forms of data.
       </para>
@@ -304,9 +303,45 @@
      which is populated by the Harvest-NG web-crawling software.
     </para>
     <para>
      which is populated by the Harvest-NG web-crawling software.
     </para>
     <para>
-    For more information, contact John Gilbertson
+    For more information on Liverpool university's intranet search
+    architecture, contact John Gilbertson
      <email>jgilbert@liverpool.ac.uk</email>
     </para>
      <email>jgilbert@liverpool.ac.uk</email>
     </para>
+   <para>
+    Kang-Jin Lee
+    <email>lee@arco.de</email>,
+    has recently modified the Harvest-NG web crawler to use Zebra as
+    its native repository engine.  His comments on the switch over
+    from the old engine are revealing:
+    <blockquote>
+     <para>
+      The first results after some testing with Zebra are very
+      promising.  The tests were done with around 220,000 SOIF files,
+      which occupies 1.6GB of disk space.
+     </para>
+     <para>
+      Building the index from scratch takes around one hour with Zebra
+      where [old-engine] needs around five hours.  While [old-engine]
+      blocks search requests when updating its index, Zebra can still
+      answer search requests.
+      [...]
+      Zebra supports incremental indexing which will speed up indexing
+      even further.
+     </para>
+     <para>
+      While the search time of [old-engine] varies from some seconds
+      to some minutes depending how expensive the query is, Zebra
+      usually takes around one to three seconds, even for expensive
+      queries.
+      [...]
+      Zebra can search more than 100 times faster than [old-engine]
+      and can process multiple search requests simultaneously
+     </para>
+     <para>
+      I am very happy to see such nice software available under GPL.
+     </para>
+    </blockquote>
+   </para>
    </sect2>
   </sect1>
  
    </sect2>
   </sect1>
  
@@ -331,7 +366,7 @@
     announcements from the authors (new
     releases, bug fixes, etc.) and general discussion.  You are welcome
     to seek support there.  Join by sending email to
     announcements from the authors (new
     releases, bug fixes, etc.) and general discussion.  You are welcome
     to seek support there.  Join by sending email to
-   <email>zebra-request@indexdata.dk</email>. Put the word
+   <email>zebra-request@indexdata.dk</email> with the word
     <literal>subscribe</literal> in the body of the message.
    </para>
    <para>
     <literal>subscribe</literal> in the body of the message.
    </para>
    <para>
@@ -360,20 +395,17 @@
         Improved support for XML in search and retrieval. Eventually,
         the goal is for Zebra to pull double duty as a flexible
         information retrieval engine and high-performance XML
         Improved support for XML in search and retrieval. Eventually,
         the goal is for Zebra to pull double duty as a flexible
         information retrieval engine and high-performance XML
-       repository.
-     </para>
-     <para>
-       ### Partially done.
+       repository.  The recent addition of XPath searching is one
+       example of the kind of enhancement we're working on.
       </para>
      </listitem>
  
      <listitem>
       <para>
       </para>
      </listitem>
  
      <listitem>
       <para>
-       Access to search engine through SOAP/RPC API to allow the
+       Access to the search engine through SOAP/RPC API to allow the
         construction of applications without requiring Z39.50 tools.
         construction of applications without requiring Z39.50 tools.
-     </para>
-     <para>
-       ### Partially done, thanks to the new SRW/Z39.50 gateway.
+       This will shortly be available by means of Index Data's
+       SRW-to-Z39.50 gateway, currently in beta test.
       </para>
      </listitem>
  
       </para>
      </listitem>
  
@@ -388,6 +420,15 @@
  
      <listitem>
       <para>
  
      <listitem>
       <para>
+       Support for the use of Perl both for access to the Zebra API
+       and for building extension ``plug-ins'' such as input filters.
+       The code for this has been contributed to the source tree, and
+       is in the process of being integrated and tested.
+     </para>
+    </listitem>
+
+    <listitem>
+     <para>
         Improved free-text searching. We're first and foremost octet jockeys and
         we're actively looking for organisations or people who'd like
         to contribute experience in relevance ranking and text
         Improved free-text searching. We're first and foremost octet jockeys and
         we're actively looking for organisations or people who'd like
         to contribute experience in relevance ranking and text
diff --git a/doc/quickstart.xml b/doc/quickstart.xml

index d7f0d00..1aae924 100644 (file)
--- a/doc/quickstart.xml
+++ b/doc/quickstart.xml
@@ -1,54 +1,27 @@
  <chapter id="quick-start">
  <chapter id="quick-start">
- <!-- $Id: quickstart.xml,v 1.7 2002-10-30 14:35:09 adam Exp $ -->
+ <!-- $Id: quickstart.xml,v 1.8 2002-12-01 23:26:26 mike Exp $ -->
   <title>Quick Start </title>
   <title>Quick Start </title>
- 
- <!--
-  FIXME - Start with the new improved example scripts that run 
-  without any configuration file changes!
-       ### do we want this now we have "examples.html"? - mike, 15/10/02
- -->
  
   <para>
  
   <para>
-  In this section, we will test the system by indexing a small set of sample
-  GILS records that are included with the software distribution. Go to the
-  <literal>examples/gils</literal> subdirectory of the distribution archive.
-  There you will find a configuration
-  file named <literal>zebra.cfg</literal> with the following contents:
-  
-  <screen>
-   # Where the schema files, attribute files, etc are located.
-   profilePath: ../../tab
-
-   # Files that describe the attribute sets supported.
-   attset: bib1.att
-   attset: gils.att
-   attset: explain.att
-
-   recordtype: grs.sgml
-   isam: c
-  </screen>
+  <!-- ### ulink to GILS profile: what's the URL? -->
+  In this section, we will test the system by indexing a small set of
+  sample GILS records that are included with the Zebra distribution,
+  running Zebra a server against the newly created database, and
+  searching the indexes with a client that connects to that server.
   </para>
   </para>
-
- <!--  No longer necessary
- <para>
-  If necessary, edit the file and set <literal>profilePath</literal> to the path of the
-  YAZ profile tables (sub directory <literal>tab</literal> of the YAZ
-  distribution archive).
- </para>
- -->
- 
   <para>
   <para>
-  The 48 test records are located in the sub directory
-  <literal>records</literal>. To index these, type:
-  
+  Go to the <literal>examples/gils</literal> subdirectory of the
+  distribution archive.  The 48 test records are located in the sub
+  directory <literal>records</literal>. To index these, type:
    <screen>
     zebraidx update records
    </screen>
   </para>
   
   <para>
    <screen>
     zebraidx update records
    </screen>
   </para>
   
   <para>
-  In the command above, the word <literal>update</literal> followed
-  by a directory root updates all files below that directory node.
+  In this command, the word <literal>update</literal> is followed
+  by the name of a directory: <literal>zebraidx</literal> updates all
+  files in the hierarchy rooted at that directory.
   </para>
   
   <para>
   </para>
   
   <para>
@@ -56,7 +29,7 @@
    fire up a server. To start a server on port 2100, type:
    
    <screen>
    fire up a server. To start a server on port 2100, type:
    
    <screen>
-   zebrasrv tcp:@:2100
+   zebrasrv @:2100
    </screen>
    
   </para>
    </screen>
    
   </para>
@@ -66,17 +39,18 @@
    named <literal>Default</literal>.
    The database contains records structured according to
    the GILS profile, and the server will
    named <literal>Default</literal>.
    The database contains records structured according to
    the GILS profile, and the server will
-  return records in either either USMARC, GRS-1, or SUTRS depending
-  on what your client asks for.
+  return records in USMARC, GRS-1, or SUTRS format depending
+  on what the client asks for.
   </para>
   
   <para>
    To test the server, you can use any Z39.50 client.
   </para>
   
   <para>
    To test the server, you can use any Z39.50 client.
-  For instance, you can use the demo client that comes with YAZ:
+  For instance, you can use the demo command-line client that comes
+  with YAZ:
   </para>
   <para>
    <screen>
   </para>
   <para>
    <screen>
-   yaz-client tcp:localhost:2100
+   yaz-client localhost:2100
    </screen>
   </para>
   
    </screen>
   </para>
   
@@ -92,8 +66,9 @@
   </para>
   
   <para>
   </para>
   
   <para>
-  The default retrieval syntax for the client is USMARC. To try other
-  formats for the same record, try:
+  The default retrieval syntax for the client is USMARC, and the
+  default element set is <literal>F</literal> (``full record''). To
+  try other formats and element sets for the same record, try:
   </para>
   <para>
    <screen>
   </para>
   <para>
    <screen>
@@ -110,8 +85,8 @@
   
   <note>
    <para>You may notice that more fields are returned when your
   
   <note>
    <para>You may notice that more fields are returned when your
-   client requests SUTRS or GRS-1 records. When retrieving GILS records,
-   this is normal - not all of the GILS data elements have mappings in
+   client requests SUTRS, GRS-1 or XML records.
+   This is normal - not all of the GILS data elements have mappings in
     the USMARC record format.
    </para>
   </note>
     the USMARC record format.
    </para>
   </note>
author	Mike Taylor <mike@indexdata.com>
	Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
committer	Mike Taylor <mike@indexdata.com>
	Sun, 1 Dec 2002 23:26:26 +0000 (23:26 +0000)
doc/examples.xml		patch \| blob \| history
doc/harvest.mbox	[deleted file]	patch \| blob \| history
doc/installation.xml		patch \| blob \| history
doc/introduction.xml		patch \| blob \| history
doc/quickstart.xml		patch \| blob \| history