Package org.supermind.crawl.scope

Interface Summary
ScopeFilter<T> Determines the scope of an operation.
 

Class Summary
AbstractScope<T> Limits the scope of an operation through ScopeFilters.
FetchListScope Scope to determine what URLs are added to a FetchList.
FetchListScope.Input  
MapFileContentSeenFilter Writes MD5s to a MapFile for easy comparison.
NutchUrlFLFilter Filters URLs using Nutch's URLFilters.
OneExternalLinkFLFilter Allows a URL if its parent has the same host as its seed.
ParentPrefixPathFLFilter Allows a URL if it has the same path or host as its parent (originating page), .
ParseScope Scope to determine what which fetched URLs are parsed.
PostFetchScope Scope to determine which fetched URLs are processed by PostFetchProcessors.
PostFetchScope.Input  
SameParentHostFLFilter Allows a URL if it has the same host as its parent (originating page).
SameParentPathFLFilter Allows a URL if it has the same path as its parent (originating page).
SameParentTLDFLFilter Allows a URL if it has the same TLD (top-level domain) as its parent (originating page).
SizeConstrainedFLFilter Limits a crawl to a fixed number of pages.
WebDBContentSeenFilter Uses Nutch's WebDB and a page's Md5 hash to determine if a page has been seen.