org.supermind.crawl.scope
Class WebDBContentSeenFilter

java.lang.Object
  extended by org.supermind.crawl.scope.WebDBContentSeenFilter
All Implemented Interfaces:
ScopeFilter<FetcherOutput>

public class WebDBContentSeenFilter
extends java.lang.Object
implements ScopeFilter<FetcherOutput>

Uses Nutch's WebDB and a page's Md5 hash to determine if a page has been seen.


Field Summary
 
Fields inherited from interface org.supermind.crawl.scope.ScopeFilter
ABSTAIN, ALLOW, REJECT
 
Constructor Summary
WebDBContentSeenFilter(org.apache.nutch.fs.NutchFileSystem nfs, java.lang.String directory)
           
 
Method Summary
 int filter(FetcherOutput fetcherOutput)
          Filter the input.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WebDBContentSeenFilter

public WebDBContentSeenFilter(org.apache.nutch.fs.NutchFileSystem nfs,
                              java.lang.String directory)
Method Detail

filter

public int filter(FetcherOutput fetcherOutput)
Description copied from interface: ScopeFilter
Filter the input. Possible return values are ScopeFilter.ALLOW, ScopeFilter.REJECT and ScopeFilter.ABSTAIN.

Specified by:
filter in interface ScopeFilter<FetcherOutput>
Returns: