README

   1 zmbot: a Simple Web harvesting robot for Zebra.
   2
   3 Introduction
   4
   5   zmbot is a simple web harvester written in Tcl. The following
   6   summaries the features:
   7
   8   o Simple administration. One script does the job and no external
   9     database is required to operate.
  10
  11   o Interruptible. Harvesting may safely be stopped/interrupted at any
  12     point.
  13
  14   o Gentle harvesting. By default a site is visited once per minute -
  15     robots.txt honored.
  16
  17   o Concurrent harvesting (jobs) in one process and one thread.
  18
  19   o Inspects content-type header to determine structure of page.
  20
  21   o Written in Tcl and is quite portable. (Some may not think this as being
  22     feature; Perl version is welcomed!).
  23
  24   o Creates simple XML output. One file per URL.
  25
  26   The robot is started from the command line and takes one or more URL's
  27   as parameter(s). Options, prefixed with minus, alter the behaviour of
  28   the harvesting. The following options are supported:
  29
  30    -j jobs    The maximum number of concurrent HTTP sessions; default 5 jobs.
  31
  32    -i idle    Idle time in microseconds between visits to the same site;
  33               default 60000 = 60 seconds.
  34
  35    -c count   Maximum distance from original URL as given from the command
  36               line; default 50.
  37
  38
  39    -d domain  Only sites matching domain are visited. The domain given is
  40               a Tcl glob expression (.e.g *.somwhere.com). Remember to
  41               quote the domain when given on the command line so that your
  42               shell doesn't expand this. This option may be repeated thus
  43               allowing you to specify many "allowed" domains.
  44
  45    -r rules   Specifies a file with rules. See the rules file for an
  46               example.
  47
  48   Example 1: Harvest three links away from www.somwhere.com world-wide:
  49    ./robot.tcl -c 3 http://www.somwhere.com/
  50
  51   Example 2: Harvest the site www.somwhere.com only:
  52    ./robot.tcl -d www.somewhere.com http://www.somewhere.com/
  53
  54   Example 3: Harvest up to two click from www.a.dk and www.b.dk in dk-domain:
  55    ./robot.tcl -d '*.dk' -c 2 http://www.a.dk/ http://www.b.dk/
  56
  57   The zmbot robot creates three directories, visited, unvisited, bad
  58   for visited pages, unvisited pages, and bad pages respectively. The
  59   visited area holds keywords and metadata for all successully retrieved
  60   pages. The unvisited area serves as a "todo" list of pages to be visited
  61   in the future. The bad area holds pages that for some reason cannot be
  62   retrieved: non-existant, permission denied, robots.txt disallow, etc.
  63
  64 Installation:
  65
  66   $  ./configure
  67   $ make
  68
  69   The configure script looks for the Tcl shell, tclsh, to determine the
  70   location of Tcl and its configuration file tclConfig.sh. To manually specify
  71   Tcl's location, add --with-tclconfig and specify the directory where
  72   tclConfig.sh is installed. For example:
  73     ./configure --with-tclconfig=/usr/local/lib
  74