The RootNode specification facility described in Section 4.2 provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits -- for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. Starting with Harvest Version 1.1, it is possible to specify these and other aspects of enumeration, using the following syntax (which is backwards-compatible with Harvest Version 1.0):
<RootNodes> URL EnumSpec URL EnumSpec ... </RootNodes>
where EnumSpec is on a single line (using ``\
'' to escape
linefeeds), with the following syntax:
URL=Number[,URL-Filter-filename] \ Host=Number[,Host-Filter-filename] \ Access=TypeList \ Delay=Number \ Depth=Number
The EnumSpec modifiers are all optional, and have the following meanings:
|
'' character
between type names to allow multiple access methods. For example,
``Access=HTTP|FTP|Gopher
'' will follow HTTP, FTP, and Gopher URLs while
enumerating an HTTP RootNode URL.
By default, URL-Max defaults to 250, URL-Filter defaults to no
limit, Host-Max defaults to 1, Host-Filter defaults to no limit,
Access defaults to HTTP only, Delay defaults to 1 second, and
Depth defaults to zero . There is no way to specify an unlimited value for
URL-Max or Host-Max.
A filter file has the following syntax:
Deny regex Allow regex
Note that regex uses the standard UNIX ``regex'' syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''.
As an example file, the following URL-Filter file would allow all URLs
except those containing the regular expression ``/gatherers/.*
'':
Deny /gatherers/.* Allow .*
The URL-Filter regular expressions are matched only on the name portion
of each URL. Host-Filter regular expressions are matched on the
``hostname:port'' portion of each URL. The order of the Allow and
Deny entries is important, since the filters are applied sequentially from
first to last. So, for example, if you list ``Allow .*
'' first no
subsequent Deny expressions will be used, since this Allow filter
will allow all entries.