org.supermind.crawl
Class FetcherThread

java.lang.Object
  extended by java.lang.Thread
      extended by org.supermind.crawl.FetcherThread
All Implemented Interfaces:
java.lang.Runnable

public class FetcherThread
extends java.lang.Thread

Thread that performs the actual fetching.

Each FetcherThread has its own FetchList and FetchedURLs. URLs are assigned to FetcherThreads by host, and the same URL is guaranteed to be always assigned to the same host.

URLs queued for fetching are not added directly to the respective fetchlists, rather to a holding area, which is checked at regular intervals. This allows fetchlists and fetchedurls to operate in a single-threaded model.


Nested Class Summary
 
Nested classes/interfaces inherited from class java.lang.Thread
java.lang.Thread.State, java.lang.Thread.UncaughtExceptionHandler
 
Field Summary
protected  boolean continueWaiting
           
protected  FetchedURLs fetchedUrls
           
protected  Fetcher fetcher
           
protected  FetchList fetchList
           
protected  FetchListScope fetchListScope
           
protected  FetchListScope.Input flScopeIn
           
protected  java.util.LinkedHashMap<java.net.URL,ScheduledURL> linkQueue
           
protected  int linkQueueBatchSize
           
protected  java.util.logging.Logger LOG
           
protected  ParseScope parseScope
           
protected  PostFetchScope.Input postFetchInput
           
protected  PostFetchScope postFetchScope
           
protected  boolean waiting
           
 
Fields inherited from class java.lang.Thread
MAX_PRIORITY, MIN_PRIORITY, NORM_PRIORITY
 
Constructor Summary
FetcherThread(Fetcher fetcher)
           
 
Method Summary
protected  void addOutlinksToFetchlist(ScheduledURL parent, org.apache.nutch.parse.Parse parse)
          Add outlinks to fetchlist.
protected  org.apache.nutch.parse.ParseStatus handleFetch(ScheduledURL scheduledURL, org.apache.nutch.protocol.ProtocolOutput output)
           
 void run()
           
 void setFetchedURLs(FetchedURLs fetchedUrls)
           
 void setFetcher(Fetcher fetcher)
           
 void setFetchList(FetchList fetchList)
           
 void setFetchListScope(FetchListScope fetchListScope)
           
 void setParseScope(ParseScope parseScope)
           
 void setPostFetchScope(PostFetchScope postFetchScope)
           
 
Methods inherited from class java.lang.Thread
activeCount, checkAccess, countStackFrames, currentThread, destroy, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, stop, suspend, toString, yield
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

continueWaiting

protected boolean continueWaiting

fetchedUrls

protected FetchedURLs fetchedUrls

fetcher

protected Fetcher fetcher

fetchList

protected FetchList fetchList

fetchListScope

protected FetchListScope fetchListScope

flScopeIn

protected final FetchListScope.Input flScopeIn

linkQueue

protected java.util.LinkedHashMap<java.net.URL,ScheduledURL> linkQueue

linkQueueBatchSize

protected int linkQueueBatchSize

LOG

protected final java.util.logging.Logger LOG

parseScope

protected ParseScope parseScope

postFetchInput

protected final PostFetchScope.Input postFetchInput

postFetchScope

protected PostFetchScope postFetchScope

waiting

protected boolean waiting
Constructor Detail

FetcherThread

public FetcherThread(Fetcher fetcher)
Method Detail

addOutlinksToFetchlist

protected void addOutlinksToFetchlist(ScheduledURL parent,
                                      org.apache.nutch.parse.Parse parse)
Add outlinks to fetchlist.

Parameters:
parent -
parse -

handleFetch

protected org.apache.nutch.parse.ParseStatus handleFetch(ScheduledURL scheduledURL,
                                                         org.apache.nutch.protocol.ProtocolOutput output)
                                                  throws java.io.IOException
Throws:
java.io.IOException

run

public void run()
Specified by:
run in interface java.lang.Runnable
Overrides:
run in class java.lang.Thread

setFetchedURLs

public void setFetchedURLs(FetchedURLs fetchedUrls)

setFetcher

public void setFetcher(Fetcher fetcher)

setFetchList

public void setFetchList(FetchList fetchList)

setFetchListScope

public void setFetchListScope(FetchListScope fetchListScope)

setParseScope

public void setParseScope(ParseScope parseScope)

setPostFetchScope

public void setPostFetchScope(PostFetchScope postFetchScope)