org.supermind.crawl.http
Class RobotRulesParser

java.lang.Object
  extended by org.supermind.crawl.http.RobotRulesParser

public class RobotRulesParser
extends java.lang.Object

This class handles the parsing of robots.txt files. It emits RobotRules objects, which describe the download permissions as described in RobotRulesParser.


Nested Class Summary
static class RobotRulesParser.RobotRuleSet
          This class holds the rules which were parsed from a robots.txt file, and can test paths against those rules.
 
Field Summary
static java.util.logging.Logger LOG
           
 
Constructor Summary
RobotRulesParser()
           
RobotRulesParser(java.lang.String[] robotNames)
          Creates a new RobotRulesParser which will use the supplied robotNames when choosing which stanza to follow in robots.txt files.
 
Method Summary
(package private) static RobotRulesParser.RobotRuleSet getEmptyRules()
          Returns a RobotRuleSet object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
(package private) static RobotRulesParser.RobotRuleSet getForbidAllRules()
          Returns a RobotRuleSet object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
static boolean isAllowed(java.net.URL url)
           
static void main(java.lang.String[] argv)
          command-line main for testing
(package private)  RobotRulesParser.RobotRuleSet parseRules(byte[] robotContent)
          Returns a RobotRulesParser.RobotRuleSet object which encapsulates the rules parsed from the supplied robotContent.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final java.util.logging.Logger LOG
Constructor Detail

RobotRulesParser

public RobotRulesParser()

RobotRulesParser

public RobotRulesParser(java.lang.String[] robotNames)
Creates a new RobotRulesParser which will use the supplied robotNames when choosing which stanza to follow in robots.txt files. Any name in the array may be matched. The order of the robotNames determines the precedence- if many names are matched, only the rules associated with the robot name having the smallest index will be used.

Method Detail

getEmptyRules

static RobotRulesParser.RobotRuleSet getEmptyRules()
Returns a RobotRuleSet object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.


getForbidAllRules

static RobotRulesParser.RobotRuleSet getForbidAllRules()
Returns a RobotRuleSet object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.


isAllowed

public static boolean isAllowed(java.net.URL url)
                         throws org.apache.nutch.protocol.ProtocolException,
                                java.io.IOException
Throws:
org.apache.nutch.protocol.ProtocolException
java.io.IOException

main

public static void main(java.lang.String[] argv)
command-line main for testing


parseRules

RobotRulesParser.RobotRuleSet parseRules(byte[] robotContent)
Returns a RobotRulesParser.RobotRuleSet object which encapsulates the rules parsed from the supplied robotContent.