org.supermind.crawl
Class CachingFetchedURLs

java.lang.Object
  extended by org.supermind.crawl.CachingFetchedURLs
All Implemented Interfaces:
FetchedURLs

public class CachingFetchedURLs
extends java.lang.Object
implements FetchedURLs

Caches URL checksums via a chained scatter table. When there are too many urls, evicted checksums are persisted.


Field Summary
 
Fields inherited from interface org.supermind.crawl.FetchedURLs
LOG
 
Constructor Summary
CachingFetchedURLs()
           
 
Method Summary
 void close()
           
 boolean contains(java.net.URL url)
          Has the URL already been fetched?
 ScheduledURL get(long id)
          Get a persisted URL.
protected  long getChecksum(java.net.URL url)
          Create a 64-bit checksum by merging a 32-bit host checksum with the url's 32-bit checksum.
 void init()
           
 void insert(ScheduledURL url, org.apache.nutch.protocol.ProtocolOutput output)
          Insert a fetched URL.
 void setChecksum(java.util.zip.Checksum checksum)
           
 void setPurger(ScatterPurger purger)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CachingFetchedURLs

public CachingFetchedURLs()
Method Detail

close

public void close()
           throws java.io.IOException
Specified by:
close in interface FetchedURLs
Throws:
java.io.IOException

contains

public boolean contains(java.net.URL url)
Description copied from interface: FetchedURLs
Has the URL already been fetched?

Specified by:
contains in interface FetchedURLs
Returns:

get

public ScheduledURL get(long id)
Description copied from interface: FetchedURLs
Get a persisted URL. (optional operation)

Specified by:
get in interface FetchedURLs
Parameters:
id - ScheduledURL's id
Returns:
ScheduledURL, or null if doesn't exist

getChecksum

protected long getChecksum(java.net.URL url)
Create a 64-bit checksum by merging a 32-bit host checksum with the url's 32-bit checksum. By using host checksum as sig. bits, urls can be easily sorted by host.

Parameters:
url -
Returns:

init

public void init()
          throws java.io.IOException
Specified by:
init in interface FetchedURLs
Throws:
java.io.IOException

insert

public void insert(ScheduledURL url,
                   org.apache.nutch.protocol.ProtocolOutput output)
Description copied from interface: FetchedURLs
Insert a fetched URL.

Specified by:
insert in interface FetchedURLs
Parameters:
url - url
output - protocol output

setChecksum

public void setChecksum(java.util.zip.Checksum checksum)

setPurger

public void setPurger(ScatterPurger purger)