wizardcompass/robots-txt-parser

A PHP library to parse and analyze robots.txt files with validation and URL fetching capabilities

v1.0.2 2025-09-24 02:11 UTC

This package is auto-updated.

Last update: 2025-09-24 02:14:52 UTC


README

PHP Version Latest Version on Packagist License: MIT

A comprehensive PHP library for parsing and analyzing robots.txt files. This package provides functionality to fetch, parse, and validate robots.txt files with support for large files and streaming downloads.

Features

  • 🚀 Fast parsing of robots.txt content with detailed statistics
  • 🌐 Smart URL fetching with automatic robots.txt path resolution
  • 🔄 HTTP streaming with Guzzle for reliable downloads and size limits
  • Validation with detailed error and warning reporting
  • 📊 Comprehensive analysis including directive counts by type and user agent
  • 🛡️ Size protection with Google's 500KB limit enforcement
  • 🔧 Timeout handling and redirect support
  • 📈 Performance optimized for large files (tested up to 5MB+)

Installation

Install via Composer:

composer require wizardcompass/robots-txt-parser

Requirements

  • PHP 8.1 or higher
  • Guzzle HTTP 7.0+ (for URL fetching)

Quick Start

<?php

use WizardCompass\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();

// Parse from string
$robotsTxt = "User-agent: *\nDisallow: /admin\nAllow: /public";
$result = $parser->parse($robotsTxt);

// Parse from URL (automatically appends /robots.txt)
$result = $parser->parseFromUrl('https://example.com');

// Validate syntax
$validation = $parser->validate($robotsTxt);

Usage Examples

Basic Parsing

<?php

use WizardCompass\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();

$content = <<<'EOT'
# Example robots.txt
User-agent: *
Disallow: /admin
Disallow: /private
Allow: /public
Crawl-delay: 10

User-agent: Googlebot
Allow: /admin/public
Disallow: /admin/private

Sitemap: https://example.com/sitemap.xml
EOT;

$result = $parser->parse($content);

echo "File size: " . $result['size'] . " bytes\n";
echo "Comments: " . $result['comment_count'] . "\n";
echo "User agents: " . $result['record_counts']['by_type']['user_agent'] . "\n";
echo "Disallow rules: " . $result['record_counts']['by_type']['disallow'] . "\n";

// Access user-agent specific data
foreach ($result['record_counts']['by_useragent'] as $userAgent => $counts) {
    echo "User-agent '{$userAgent}' has {$counts['disallow']} disallow rules\n";
}

// Access sitemap information
echo "Sitemaps found: " . count($result['sitemaps']) . "\n";
foreach ($result['sitemaps'] as $sitemap) {
    echo "Sitemap: {$sitemap}\n";
}

Fetching from URL

The parser automatically handles robots.txt URL resolution. Just provide any URL and it will automatically fetch the robots.txt file from the root domain.

<?php

use WizardCompass\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();

try {
    // All of these will fetch from https://example.com/robots.txt
    $result = $parser->parseFromUrl('https://example.com');
    $result = $parser->parseFromUrl('https://example.com/');
    $result = $parser->parseFromUrl('https://example.com/some/page');
    $result = $parser->parseFromUrl('https://example.com/robots.txt'); // explicit
    
    // Custom timeout (60 seconds instead of default 30)
    $result = $parser->parseFromUrl('https://example.com', 60);
    
    echo "Status: " . $result['status'] . "\n";
    echo "Redirected: " . ($result['redirected'] ? 'Yes' : 'No') . "\n";
    echo "Size: " . number_format($result['size_kib'], 2) . " KB\n";
    echo "User agents found: " . count($result['record_counts']['by_useragent']) . "\n";
    
    if (isset($result['size_limit_exceeded'])) {
        echo "Warning: File exceeded 500KB limit and was truncated\n";
    }
    
} catch (InvalidArgumentException $e) {
    echo "Invalid URL: " . $e->getMessage() . "\n";
} catch (RuntimeException $e) {
    echo "Failed to fetch: " . $e->getMessage() . "\n";
}

Validation

<?php

use WizardCompass\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();

$content = <<<'EOT'
User-agent: *
Disallow /admin
Crawl-delay: invalid
Sitemap: not-a-url
Unknown-directive: value
EOT;

$validation = $parser->validate($content);

if (!$validation['is_valid']) {
    echo "Validation failed!\n\n";
    
    foreach ($validation['errors'] as $error) {
        echo "❌ Error: $error\n";
    }
}

if (!empty($validation['warnings'])) {
    echo "\nWarnings:\n";
    foreach ($validation['warnings'] as $warning) {
        echo "⚠️  Warning: $warning\n";
    }
}

Advanced Options

<?php

use WizardCompass\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();

// Parse with additional metadata
$result = $parser->parse($content, [
    'status' => 404,      // HTTP status code
    'redirected' => true  // Whether the request was redirected
]);

// Check if file exceeds Google's size limit
if ($result['over_google_limit']) {
    echo "Warning: File exceeds Google's 500KB recommendation\n";
}

// Analyze directive distribution
$recordCounts = $result['record_counts']['by_type'];
echo "Directive breakdown:\n";
echo "- User-agent: {$recordCounts['user_agent']}\n";
echo "- Disallow: {$recordCounts['disallow']}\n";
echo "- Allow: {$recordCounts['allow']}\n";
echo "- Sitemap: {$recordCounts['sitemap']}\n";
echo "- Crawl-delay: {$recordCounts['crawl_delay']}\n";
echo "- Other: {$recordCounts['other']}\n";

Response Format

Parse Results

[
    'comment_count' => 2,              // Number of comment lines
    'over_google_limit' => false,      // Whether file exceeds 500KB
    'record_counts' => [
        'by_type' => [
            'allow' => 1,
            'crawl_delay' => 1,
            'disallow' => 3,
            'noindex' => 0,
            'other' => 0,
            'sitemap' => 1,
            'user_agent' => 2
        ],
        'by_useragent' => [
            '*' => [
                'allow' => 1,
                'crawl_delay' => 1,
                'disallow' => 2,
                'noindex' => 0,
                'other' => 0
            ],
            'Googlebot' => [
                'allow' => 0,
                'crawl_delay' => 0,
                'disallow' => 1,
                'noindex' => 0,
                'other' => 0
            ]
        ]
    ],
    'sitemaps' => [                    // Array of sitemap URLs found
        'https://example.com/sitemap.xml',
        'https://example.com/sitemap-news.xml'
    ],
    'redirected' => false,             // Whether URL was redirected
    'size' => 150,                     // File size in bytes
    'size_kib' => 0.146484375,        // File size in KiB
    'status' => 200                    // HTTP status code
]

Validation Results

[
    'is_valid' => false,              // Overall validation status
    'warnings' => [                   // Non-critical issues
        'Line 5: Unknown directive "unknown-directive"'
    ],
    'errors' => [                     // Critical syntax errors
        'Line 2: Invalid syntax - missing colon: "Disallow /admin"',
        'Line 3: Crawl-delay value must be a non-negative number'
    ]
]

Size Limits and Performance

This library implements Google's recommended 500KB size limit for robots.txt files:

  • Parsing: Files of any size can be parsed, with a warning flag for files exceeding 500KB
  • URL Fetching: Downloads are automatically terminated at 500KB using Guzzle streaming
  • Performance: Optimized for large files with efficient string processing
  • Memory: Uses HTTP streaming to minimize memory usage
  • Reliability: Guzzle HTTP client provides robust error handling and redirect support

Supported Directives

  • User-agent: Specifies which web crawler the rules apply to
  • Disallow: Specifies paths that should not be crawled
  • Allow: Explicitly allows crawling of specific paths
  • Crawl-delay / Crawldelay: Specifies delay between requests
  • Sitemap: Specifies the location of sitemap files
  • Noindex: Prevents indexing of specific paths
  • Request-rate: Controls request rate (marked as "other")
  • Visit-time: Specifies preferred visit times (marked as "other")
  • Host: Specifies preferred host (marked as "other")

Error Handling

The library uses proper PHP exceptions:

  • InvalidArgumentException: For invalid URLs or parameters
  • RuntimeException: For network errors or file access issues

Always wrap URL operations in try-catch blocks for robust error handling.

Testing

Run the test suite:

# Install dependencies
composer install

# Run tests
composer test

# Run tests with coverage
composer test-coverage

The test suite includes:

  • Unit tests for all parsing functionality
  • Large file tests (up to 5MB)
  • Performance benchmarks
  • Edge case validation
  • URL fetching simulation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Inspiration

This project was inspired by HTTP Archive's robots.txt analysis capabilities and aims to provide the same level of detailed parsing and validation for PHP applications.

Changelog

v1.0.0

  • Initial release
  • Basic parsing functionality
  • URL fetching with size limits
  • Comprehensive validation
  • Large file support