org.supermind.crawl
Class DefaultFetchList

java.lang.Object
  extended by org.supermind.crawl.DefaultFetchList
All Implemented Interfaces:
FetchList

public class DefaultFetchList
extends java.lang.Object
implements FetchList

Fetchlist that assigns higher priority to hosts with many pending pages.


Field Summary
protected  LastModifiedDB lastModifiedDb
          LastModifiedDB.
protected static java.util.logging.Logger LOG
           
 
Constructor Summary
DefaultFetchList()
           
 
Method Summary
 void close()
          Release resources.
 boolean contains(java.net.URL url)
          Does the fetchlist contain this url?
 int getCurrentSize()
          Total number of URLs this fetchlist currently contains.
protected  long getNextAvailable(long timeTaken)
          When a HostQueue is next available.
 void init()
          Initialize resources.
 HostQueue next()
          Get next HostQueue.
 void queue(ScheduledURL parent, java.net.URL url)
          Add a ScheduledURL to the fetchlist.
 void release(HostQueue hostQueue, int popped, long timeTaken)
          Release HostQueue from use.
 void setLastModifiedDb(LastModifiedDB lastModifiedDb)
           
 void setWaitFactor(int waitFactor)
          The time taken to download a chunk of pages from a host is multiplied by this waitFactor to determine how soon a host can be accessed again.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lastModifiedDb

protected LastModifiedDB lastModifiedDb
LastModifiedDB.


LOG

protected static java.util.logging.Logger LOG
Constructor Detail

DefaultFetchList

public DefaultFetchList()
Method Detail

close

public void close()
Description copied from interface: FetchList
Release resources.

Specified by:
close in interface FetchList

contains

public boolean contains(java.net.URL url)
Description copied from interface: FetchList
Does the fetchlist contain this url?

Specified by:
contains in interface FetchList
Returns:

getCurrentSize

public int getCurrentSize()
Description copied from interface: FetchList
Total number of URLs this fetchlist currently contains. This variable is only updated when HostQueues have been FetchList.release(org.supermind.crawl.HostQueue, int, long)d.

Specified by:
getCurrentSize in interface FetchList
Returns:
number of URLs

getNextAvailable

protected long getNextAvailable(long timeTaken)
When a HostQueue is next available.

Parameters:
timeTaken - time taken to download a page
Returns:
date (in milliseconds)

init

public void init()
Description copied from interface: FetchList
Initialize resources.

Specified by:
init in interface FetchList

next

public HostQueue next()
Description copied from interface: FetchList
Get next HostQueue.

Specified by:
next in interface FetchList
Returns:
next HostQueue, or null if none of the HostQueues have any URLs

queue

public void queue(ScheduledURL parent,
                  java.net.URL url)
Description copied from interface: FetchList
Add a ScheduledURL to the fetchlist. Multiple threads can be calling this method, and implementing classes must synchronize access accordingly.

Specified by:
queue in interface FetchList
Parameters:
parent - originating url
url - url to queue

release

public void release(HostQueue hostQueue,
                    int popped,
                    long timeTaken)
Description copied from interface: FetchList
Release HostQueue from use. This method must be called when HostQueue#pop() completes.

Specified by:
release in interface FetchList
popped - number of URLs popped from the queue
timeTaken - total time taken to download the popped urls

setLastModifiedDb

public void setLastModifiedDb(LastModifiedDB lastModifiedDb)

setWaitFactor

public void setWaitFactor(int waitFactor)
The time taken to download a chunk of pages from a host is multiplied by this waitFactor to determine how soon a host can be accessed again.

Parameters:
waitFactor -