org.supermind.crawl

Package org.supermind.crawl

Interface Summary
CrawlSeedSource	Source of URLs to seed a crawl.
FetchedURLs	Database of URLs that have already been fetched.
FetchList	A data structure for managing the list of URLs to fetch.
LastModifiedDB	Database of URLs' last modified times.
PostFetchProcessor	Processes a fetched web page.

Class Summary
CachingFetchedURLs	Caches URL checksums via a chained scatter table.
CrawlTool	Command-line tool to run a crawl.
DefaultFetchedURLs	Default implementation of `FetchedURLs`.
DefaultFetchList	Fetchlist that assigns higher priority to hosts with many pending pages.
Fetcher	Fetcher.
FetcherOutput	An entry in the fetcher's output.
FetcherThread	Thread that performs the actual fetching.
FileCrawlSeedSource	Seed a crawl from a file.
HostQueue	Queue of `ScheduledURL`s which belong to a particular host.
InMemoryFetchedURLs	Saves fetched URLs to a HashSet.
LastModifiedFetchedURLs	Records URLs and their last modified times.
LoggingPostFetchProcessor
NutchFetchListCrawlSeedSource	Uses a Nutch FetchList to seed a crawl.
NutchSegmentPostFetchProcessor	Persists a fetched page to a Nutch segment.
PostFetchProcessorChain	Chain of `PostFetchProcessor`s.
ScheduledURL	A URL to be fetched.
SeedURL
TestDefaultFetchList