java.lang.Objectorg.supermind.crawl.http.RobotRulesParser
public class RobotRulesParser
This class handles the parsing of robots.txt
files.
It emits RobotRules objects, which describe the download permissions
as described in RobotRulesParser.
Nested Class Summary | |
---|---|
static class |
RobotRulesParser.RobotRuleSet
This class holds the rules which were parsed from a robots.txt file, and can test paths against those rules. |
Field Summary | |
---|---|
static java.util.logging.Logger |
LOG
|
Constructor Summary | |
---|---|
RobotRulesParser()
|
|
RobotRulesParser(java.lang.String[] robotNames)
Creates a new RobotRulesParser which will use the
supplied robotNames when choosing which stanza to
follow in robots.txt files. |
Method Summary | |
---|---|
(package private) static RobotRulesParser.RobotRuleSet |
getEmptyRules()
Returns a RobotRuleSet object appropriate for use
when the robots.txt file is empty or missing; all
requests are allowed. |
(package private) static RobotRulesParser.RobotRuleSet |
getForbidAllRules()
Returns a RobotRuleSet object appropriate for use
when the robots.txt file is not fetched due to a
403/Forbidden response; all requests are
disallowed. |
static boolean |
isAllowed(java.net.URL url)
|
static void |
main(java.lang.String[] argv)
command-line main for testing |
(package private) RobotRulesParser.RobotRuleSet |
parseRules(byte[] robotContent)
Returns a RobotRulesParser.RobotRuleSet object which encapsulates the
rules parsed from the supplied robotContent . |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.util.logging.Logger LOG
Constructor Detail |
---|
public RobotRulesParser()
public RobotRulesParser(java.lang.String[] robotNames)
RobotRulesParser
which will use the
supplied robotNames
when choosing which stanza to
follow in robots.txt
files. Any name in the array
may be matched. The order of the robotNames
determines the precedence- if many names are matched, only the
rules associated with the robot name having the smallest index
will be used.
Method Detail |
---|
static RobotRulesParser.RobotRuleSet getEmptyRules()
RobotRuleSet
object appropriate for use
when the robots.txt
file is empty or missing; all
requests are allowed.
static RobotRulesParser.RobotRuleSet getForbidAllRules()
RobotRuleSet
object appropriate for use
when the robots.txt
file is not fetched due to a
403/Forbidden
response; all requests are
disallowed.
public static boolean isAllowed(java.net.URL url) throws org.apache.nutch.protocol.ProtocolException, java.io.IOException
org.apache.nutch.protocol.ProtocolException
java.io.IOException
public static void main(java.lang.String[] argv)
RobotRulesParser.RobotRuleSet parseRules(byte[] robotContent)
RobotRulesParser.RobotRuleSet
object which encapsulates the
rules parsed from the supplied robotContent
.