The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually-generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Appendix B).
This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system. A demo of our LSM Gatherer and Broker is available.
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-2
% ./RunGatherer
To view the configuration file for this Gatherer, look at example-2.cf. Notice that the Gatherer has its own Lib-Directory (see Section 4.7.1 for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. lib/stoplist.cf defines the types that Essence should not index. This example uses an empty stoplist.cf file to direct Essence to index all files.
The Gatherer retrieves each of the LeafNode URLs, which are all Linux
Software Map files from the Linux FTP archive tsx-11.mit.edu. The
Gatherer recognizes that a ``.lsm'' file is LSM type because of the
naming heuristic present in lib/byname.cf. The LSM type is a
``nested'' type as specified in the Essence source code
. Exploder programs
(named TypeName.unnest) are run on nested types rather than the usual
summarizers. The LSM.unnest program is the standard exploder program
that takes an LSM file and generates one or more corresponding SOIF
objects. When the Gatherer finishes, it contains one or more
corresponding SOIF objects for the software described
within each LSM file.
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:
% gather localhost 9222 | more
Because tsx-11.mit.edu is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in log.errors and log.gatherer to try to determine the problem.
The following two SOIF objects were generated by this Gatherer. The first object is summarizes the LSM file itself, and the second object summarizes the software described in the LSM file.
@FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm
Time-to-Live{7}: 9676800
Last-Modification-Time{9}: 781931042
Refresh-Rate{7}: 2419200
Gatherer-Name{25}: Example Gatherer Number 2
Gatherer-Host{22}: powell.cs.colorado.edu
Gatherer-Version{3}: 0.4
Type{3}: LSM
Update-Time{9}: 781931042
File-Size{3}: 848
MD5{32}: 67377f3ea214ab680892c82906081caf
}
@FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz
Time-to-Live{7}: 9676800
Last-Modification-Time{9}: 781931042
Refresh-Rate{7}: 2419200
Gatherer-Name{25}: Example Gatherer Number 2
Gatherer-Host{22}: powell.cs.colorado.edu
Gatherer-Version{3}: 0.4
Update-Time{9}: 781931042
Type{16}: GNUCompressedTar
Title{48}: Section 2, 3, 4, 5, 7, and 9 man pages for Linux
Version{3}: 1.4
Description{124}: Man pages for Linux. Mostly section 2 is complete. Section
3 has over 200 man pages, but it still far from being finished.
Author{27}: Linux Documentation Project
AuthorEmail{11}: DOC channel
Maintainer{9}: Rik Faith
MaintEmail{16}: faith@cs.unc.edu
Site{45}: ftp.cs.unc.edu
sunsite.unc.edu
tsx-11.mit.edu
Path{94}: /pub/faith/linux
/pub/Linux/docs/linux-doc-project/man-pages
/pub/linux/docs/linux-doc-project
File{20}: man-pages-1.4.tar.gz
FileSize{4}: 170k
CopyPolicy{47}: Public Domain or otherwise freely distributable
Keywords{10}: man
pages
Entered{24}: Sun Sep 11 19:52:06 1994
EnteredBy{9}: Rik Faith
CheckedEmail{16}: faith@cs.unc.edu
}
We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP. We have a demo available via the Web.