Interface Summary | |
---|---|
CrawlSeedSource | Source of URLs to seed a crawl. |
FetchedURLs | Database of URLs that have already been fetched. |
FetchList | A data structure for managing the list of URLs to fetch. |
LastModifiedDB | Database of URLs' last modified times. |
PostFetchProcessor | Processes a fetched web page. |
Class Summary | |
---|---|
CachingFetchedURLs | Caches URL checksums via a chained scatter table. |
CrawlTool | Command-line tool to run a crawl. |
DefaultFetchedURLs | Default implementation of FetchedURLs . |
DefaultFetchList | Fetchlist that assigns higher priority to hosts with many pending pages. |
Fetcher | Fetcher. |
FetcherOutput | An entry in the fetcher's output. |
FetcherThread | Thread that performs the actual fetching. |
FileCrawlSeedSource | Seed a crawl from a file. |
HostQueue | Queue of ScheduledURL s which belong to a particular host. |
InMemoryFetchedURLs | Saves fetched URLs to a HashSet. |
LastModifiedFetchedURLs | Records URLs and their last modified times. |
LoggingPostFetchProcessor | |
NutchFetchListCrawlSeedSource | Uses a Nutch FetchList to seed a crawl. |
NutchSegmentPostFetchProcessor | Persists a fetched page to a Nutch segment. |
PostFetchProcessorChain | Chain of PostFetchProcessor s. |
ScheduledURL | A URL to be fetched. |
SeedURL | |
TestDefaultFetchList |