Package org.supermind.crawl

Interface Summary
CrawlSeedSource Source of URLs to seed a crawl.
FetchedURLs Database of URLs that have already been fetched.
FetchList A data structure for managing the list of URLs to fetch.
LastModifiedDB Database of URLs' last modified times.
PostFetchProcessor Processes a fetched web page.
 

Class Summary
CachingFetchedURLs Caches URL checksums via a chained scatter table.
CrawlTool Command-line tool to run a crawl.
DefaultFetchedURLs Default implementation of FetchedURLs.
DefaultFetchList Fetchlist that assigns higher priority to hosts with many pending pages.
Fetcher Fetcher.
FetcherOutput An entry in the fetcher's output.
FetcherThread Thread that performs the actual fetching.
FileCrawlSeedSource Seed a crawl from a file.
HostQueue Queue of ScheduledURLs which belong to a particular host.
InMemoryFetchedURLs Saves fetched URLs to a HashSet.
LastModifiedFetchedURLs Records URLs and their last modified times.
LoggingPostFetchProcessor  
NutchFetchListCrawlSeedSource Uses a Nutch FetchList to seed a crawl.
NutchSegmentPostFetchProcessor Persists a fetched page to a Nutch segment.
PostFetchProcessorChain Chain of PostFetchProcessors.
ScheduledURL A URL to be fetched.
SeedURL  
TestDefaultFetchList