diggin/diggin-robotrules

parser/handler for Robots Exclusion Protocol (robots.txt and more)

Installs: 148

Dependents: 0

Stars: 6

Watchers: 3

Forks: 2

Open Issues: 1

Language: PHP

v0.8.1 2014-06-21 16:40 UTC

README

PHP parser/handler for Robots Exclusion Protocol (robots.txt and more..)

Master: Build Status Coverage Status

Features

  • implements http://www.robotstxt.org/norobots-rfc.txt

    • [DONE] "3.2.2 The Allow and Disallow lines" - as test-case
    • [DONE] "4.Examples" as test-case
  • passing Nutch's test code ref

    • [DONE] @see tests/Diggin/RobotRules/Imported/NutchTest.php
  • parsing & handling html-meta

ToDos

USAGE

<?php
use Diggin\RobotRules\Accepter\TxtAccepter;
use Diggin\RobotRules\Parser\TxtStringParser;

$robotstxt = <<<'ROBOTS'
# sample robots.txt
User-agent: YourCrawlerName
Disallow:

User-agent: *
Disallow: /aaa/ #comment
ROBOTS;

$accepter = new TxtAccepter;
$accepter->setRules(TxtStringParser::parse($robotstxt));

$accepter->setUserAgent('foo');
var_dump($accepter->isAllow('/aaa/')); //false
var_dump($accepter->isAllow('/b.html')); //true

$accepter->setUserAgent('YourCrawlerName');
var_dump($accepter->isAllow('/aaa/')); // true

INSTALL

Diggin_RobotRules is following PSR-0, so to register namespace Diggin\RobotRules into your ClassLoader.

To install via composer

  • $php composer.phar require diggin/diggin-robotrules "dev-master"

License

Diggin_RobotRules is licensed under new-bsd.

Reference & alternatives in others language.