bopoda / robots-txt-parser
PHP Class for parsing robots.txt files according to Google, Yandex specifications.
Installs: 165 272
Dependents: 2
Suggesters: 0
Security: 0
Stars: 44
Watchers: 7
Forks: 17
Open Issues: 7
Language:DIGITAL Command Language
Requires
- php: >=5.4.0
- ext-mbstring: *
Requires (Dev)
- phpunit/phpunit: >=3.7
This package is auto-updated.
Last update: 2024-10-28 12:20:53 UTC
README
RobotsTxtParser — PHP class for parsing all the directives of the robots.txt files
RobotsTxtValidator — PHP class for check is url allow or disallow according to robots.txt rules.
Try demo of RobotsTxtParser on-line on live domains.
Parsing is carried out according to the rules in accordance with Google & Yandex specifications:
Last improvements:
- Pars the Clean-param directive according to the clean-param syntax.
- Deleting comments (everything following the '#' character, up to the first line break, is disregarded)
- The improvement of the Parse of Host — the intersection directive, should refer to the user-agent '*'; If there are multiple hosts, the search engines take the value of the first.
- From the class, unused methods are removed, refactoring done, the scope of properties of the class is corrected.
- Added more test cases, as well as test cases added to the whole new functionality.
- RobotsTxtValidator class added to check if url is allowed to parsing.
- With version 2.0, the speed of RobotsTxtParser was significantly improved.
Supported Directives:
- DIRECTIVE_ALLOW = 'allow';
- DIRECTIVE_DISALLOW = 'disallow';
- DIRECTIVE_HOST = 'host';
- DIRECTIVE_SITEMAP = 'sitemap';
- DIRECTIVE_USERAGENT = 'user-agent';
- DIRECTIVE_CRAWL_DELAY = 'crawl-delay';
- DIRECTIVE_CLEAN_PARAM = 'clean-param';
- DIRECTIVE_NOINDEX = 'noindex';
Installation
Install the latest version with
composer require bopoda/robots-txt-parser
Run tests
Run phpunit tests using command
php vendor/bin/phpunit
Usage example
You can start the parser by getting the content of a robots.txt file from a website:
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt')); var_dump($parser->getRules());
Or simply using the contents of the file as input (ie: when the content is already cached):
$parser = new RobotsTxtParser(" User-Agent: * Disallow: /ajax Disallow: /search Clean-param: param1 /path/file.php User-agent: Yahoo Disallow: / Host: example.com Host: example2.com "); var_dump($parser->getRules());
This will output:
array(2) { ["*"]=> array(3) { ["disallow"]=> array(2) { [0]=> string(5) "/ajax" [1]=> string(7) "/search" } ["clean-param"]=> array(1) { [0]=> string(21) "param1 /path/file.php" } ["host"]=> string(11) "example.com" } ["yahoo"]=> array(1) { ["disallow"]=> array(1) { [0]=> string(1) "/" } } }
In order to validate URL, use the RobotsTxtValidator class:
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt')); $validator = new RobotsTxtValidator($parser->getRules()); $url = '/'; $userAgent = 'MyAwesomeBot'; if ($validator->isUrlAllow($url, $userAgent)) { // Crawl the site URL and do nice stuff }
Contribution
Feel free to create PR in this repository. Please, follow PSR style.
See the list of contributors which participated in this project.
Final Notes:
Please use v2.0+ version which works by same rules but is more highly performance.