wizardcompass / robots-txt-parser
A PHP library to parse and analyze robots.txt files with validation and URL fetching capabilities
Requires
- php: >=8.1
- guzzlehttp/guzzle: ^7.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.87
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^10.0
README
A comprehensive PHP library for parsing and analyzing robots.txt files. This package provides functionality to fetch, parse, and validate robots.txt files with support for large files and streaming downloads.
Features
- 🚀 Fast parsing of robots.txt content with detailed statistics
- 🌐 Smart URL fetching with automatic robots.txt path resolution
- 🔄 HTTP streaming with Guzzle for reliable downloads and size limits
- ✅ Validation with detailed error and warning reporting
- 📊 Comprehensive analysis including directive counts by type and user agent
- 🛡️ Size protection with Google's 500KB limit enforcement
- 🔧 Timeout handling and redirect support
- 📈 Performance optimized for large files (tested up to 5MB+)
Installation
Install via Composer:
composer require wizardcompass/robots-txt-parser
Requirements
- PHP 8.1 or higher
- Guzzle HTTP 7.0+ (for URL fetching)
Quick Start
<?php use WizardCompass\RobotsTxtParser\RobotsTxtParser; $parser = new RobotsTxtParser(); // Parse from string $robotsTxt = "User-agent: *\nDisallow: /admin\nAllow: /public"; $result = $parser->parse($robotsTxt); // Parse from URL (automatically appends /robots.txt) $result = $parser->parseFromUrl('https://example.com'); // Validate syntax $validation = $parser->validate($robotsTxt);
Usage Examples
Basic Parsing
<?php use WizardCompass\RobotsTxtParser\RobotsTxtParser; $parser = new RobotsTxtParser(); $content = <<<'EOT' # Example robots.txt User-agent: * Disallow: /admin Disallow: /private Allow: /public Crawl-delay: 10 User-agent: Googlebot Allow: /admin/public Disallow: /admin/private Sitemap: https://example.com/sitemap.xml EOT; $result = $parser->parse($content); echo "File size: " . $result['size'] . " bytes\n"; echo "Comments: " . $result['comment_count'] . "\n"; echo "User agents: " . $result['record_counts']['by_type']['user_agent'] . "\n"; echo "Disallow rules: " . $result['record_counts']['by_type']['disallow'] . "\n"; // Access user-agent specific data foreach ($result['record_counts']['by_useragent'] as $userAgent => $counts) { echo "User-agent '{$userAgent}' has {$counts['disallow']} disallow rules\n"; } // Access sitemap information echo "Sitemaps found: " . count($result['sitemaps']) . "\n"; foreach ($result['sitemaps'] as $sitemap) { echo "Sitemap: {$sitemap}\n"; }
Fetching from URL
The parser automatically handles robots.txt URL resolution. Just provide any URL and it will automatically fetch the robots.txt file from the root domain.
<?php use WizardCompass\RobotsTxtParser\RobotsTxtParser; $parser = new RobotsTxtParser(); try { // All of these will fetch from https://example.com/robots.txt $result = $parser->parseFromUrl('https://example.com'); $result = $parser->parseFromUrl('https://example.com/'); $result = $parser->parseFromUrl('https://example.com/some/page'); $result = $parser->parseFromUrl('https://example.com/robots.txt'); // explicit // Custom timeout (60 seconds instead of default 30) $result = $parser->parseFromUrl('https://example.com', 60); echo "Status: " . $result['status'] . "\n"; echo "Redirected: " . ($result['redirected'] ? 'Yes' : 'No') . "\n"; echo "Size: " . number_format($result['size_kib'], 2) . " KB\n"; echo "User agents found: " . count($result['record_counts']['by_useragent']) . "\n"; if (isset($result['size_limit_exceeded'])) { echo "Warning: File exceeded 500KB limit and was truncated\n"; } } catch (InvalidArgumentException $e) { echo "Invalid URL: " . $e->getMessage() . "\n"; } catch (RuntimeException $e) { echo "Failed to fetch: " . $e->getMessage() . "\n"; }
Validation
<?php use WizardCompass\RobotsTxtParser\RobotsTxtParser; $parser = new RobotsTxtParser(); $content = <<<'EOT' User-agent: * Disallow /admin Crawl-delay: invalid Sitemap: not-a-url Unknown-directive: value EOT; $validation = $parser->validate($content); if (!$validation['is_valid']) { echo "Validation failed!\n\n"; foreach ($validation['errors'] as $error) { echo "❌ Error: $error\n"; } } if (!empty($validation['warnings'])) { echo "\nWarnings:\n"; foreach ($validation['warnings'] as $warning) { echo "⚠️ Warning: $warning\n"; } }
Advanced Options
<?php use WizardCompass\RobotsTxtParser\RobotsTxtParser; $parser = new RobotsTxtParser(); // Parse with additional metadata $result = $parser->parse($content, [ 'status' => 404, // HTTP status code 'redirected' => true // Whether the request was redirected ]); // Check if file exceeds Google's size limit if ($result['over_google_limit']) { echo "Warning: File exceeds Google's 500KB recommendation\n"; } // Analyze directive distribution $recordCounts = $result['record_counts']['by_type']; echo "Directive breakdown:\n"; echo "- User-agent: {$recordCounts['user_agent']}\n"; echo "- Disallow: {$recordCounts['disallow']}\n"; echo "- Allow: {$recordCounts['allow']}\n"; echo "- Sitemap: {$recordCounts['sitemap']}\n"; echo "- Crawl-delay: {$recordCounts['crawl_delay']}\n"; echo "- Other: {$recordCounts['other']}\n";
Response Format
Parse Results
[ 'comment_count' => 2, // Number of comment lines 'over_google_limit' => false, // Whether file exceeds 500KB 'record_counts' => [ 'by_type' => [ 'allow' => 1, 'crawl_delay' => 1, 'disallow' => 3, 'noindex' => 0, 'other' => 0, 'sitemap' => 1, 'user_agent' => 2 ], 'by_useragent' => [ '*' => [ 'allow' => 1, 'crawl_delay' => 1, 'disallow' => 2, 'noindex' => 0, 'other' => 0 ], 'Googlebot' => [ 'allow' => 0, 'crawl_delay' => 0, 'disallow' => 1, 'noindex' => 0, 'other' => 0 ] ] ], 'sitemaps' => [ // Array of sitemap URLs found 'https://example.com/sitemap.xml', 'https://example.com/sitemap-news.xml' ], 'redirected' => false, // Whether URL was redirected 'size' => 150, // File size in bytes 'size_kib' => 0.146484375, // File size in KiB 'status' => 200 // HTTP status code ]
Validation Results
[ 'is_valid' => false, // Overall validation status 'warnings' => [ // Non-critical issues 'Line 5: Unknown directive "unknown-directive"' ], 'errors' => [ // Critical syntax errors 'Line 2: Invalid syntax - missing colon: "Disallow /admin"', 'Line 3: Crawl-delay value must be a non-negative number' ] ]
Size Limits and Performance
This library implements Google's recommended 500KB size limit for robots.txt files:
- Parsing: Files of any size can be parsed, with a warning flag for files exceeding 500KB
- URL Fetching: Downloads are automatically terminated at 500KB using Guzzle streaming
- Performance: Optimized for large files with efficient string processing
- Memory: Uses HTTP streaming to minimize memory usage
- Reliability: Guzzle HTTP client provides robust error handling and redirect support
Supported Directives
User-agent
: Specifies which web crawler the rules apply toDisallow
: Specifies paths that should not be crawledAllow
: Explicitly allows crawling of specific pathsCrawl-delay
/Crawldelay
: Specifies delay between requestsSitemap
: Specifies the location of sitemap filesNoindex
: Prevents indexing of specific pathsRequest-rate
: Controls request rate (marked as "other")Visit-time
: Specifies preferred visit times (marked as "other")Host
: Specifies preferred host (marked as "other")
Error Handling
The library uses proper PHP exceptions:
InvalidArgumentException
: For invalid URLs or parametersRuntimeException
: For network errors or file access issues
Always wrap URL operations in try-catch blocks for robust error handling.
Testing
Run the test suite:
# Install dependencies composer install # Run tests composer test # Run tests with coverage composer test-coverage
The test suite includes:
- Unit tests for all parsing functionality
- Large file tests (up to 5MB)
- Performance benchmarks
- Edge case validation
- URL fetching simulation
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Inspiration
This project was inspired by HTTP Archive's robots.txt analysis capabilities and aims to provide the same level of detailed parsing and validation for PHP applications.
Changelog
v1.0.0
- Initial release
- Basic parsing functionality
- URL fetching with size limits
- Comprehensive validation
- Large file support