leopoletto/robots-txt-parser

A comprehensive PHP package for parsing robots.txt files, including support for meta tags and X-Robots-Tag HTTP headers

Installs: 1

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/leopoletto/robots-txt-parser

1.0.2 2025-11-07 23:12 UTC

This package is auto-updated.

Last update: 2025-11-08 01:04:53 UTC


README

Latest Version on Packagist Tests Total Downloads

A comprehensive PHP package for parsing and analyzing robots.txt files. This library is designed to help you understand the structure and content of robots.txt files, including support for X-Robots-Tag HTTP headers and meta tags from HTML pages.

Note: This library is designed for parsing and analyzing robots.txt files to understand their structure. It does not validate whether a specific bot can crawl a specific URL.

Installation

Install via Composer:

composer require leopoletto/robots-txt-parser

Requirements

  • PHP 8.2 or higher

Dependencies

  • Guzzle HTTP Client
  • Illuminate Collections

Quick Start

use Leopoletto\RobotsTxtParser\RobotsTxtParser;

// Instantiate the parser
$parser = new RobotsTxtParser();

// Configure your bot's user agent (required for parseUrl)
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com/mybot');

// Parse from URL
$response = $parser->parseUrl('https://example.com');

Configuration

Before parsing from a URL, you must configure your bot's user agent. This is used when making HTTP requests.

Method 1: Using configureUserAgent()

$parser->configureUserAgent('BotName', '1.0', 'https://example.com/bot');
// Results in: Mozilla/5.0 (compatible; BotName/1.0; https://example.com/bot)

Method 2: Using setUserAgent()

$parser->setUserAgent('MyCustomUserAgent/1.0');

Parsing Methods

The library provides three methods for parsing robots.txt content:

1. Parse from URL (parseUrl)

Parses robots.txt from a URL and also extracts:

  • X-Robots-Tag HTTP headers from the robots.txt response
  • Meta tags (robots, googlebot, googlebot-news) from the HTML page if a non-robots.txt URL is provided
$parser = new RobotsTxtParser();
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com');

// Parse from any URL (will automatically fetch /robots.txt)
$response = $parser->parseUrl('https://example.com');
// or
$response = $parser->parseUrl('https://example.com/robots.txt');

$records = $response->records();

What parseUrl returns:

  • All robots.txt directives (User-agent, Allow, Disallow, Crawl-delay, Sitemap)
  • X-Robots-Tag headers from the robots.txt response
  • Meta tags from the HTML page (if parsing a non-robots.txt URL)
  • Comments and syntax errors

2. Parse from File (parseFile)

Parses a robots.txt file from the local filesystem.

$parser = new RobotsTxtParser();
$response = $parser->parseFile('/path/to/robots.txt');

$records = $response->records();

3. Parse from Text (parseText)

Parses robots.txt content directly from a string.

$parser = new RobotsTxtParser();
$content = "User-agent: *\nDisallow: /admin/";
$response = $parser->parseText($content);

$records = $response->records();

Accessing Parsed Data

All parsing methods return a Response object with the following methods:

Basic Information

$response = $parser->parseUrl('https://example.com');

// Get the size of the parsed content in bytes
$size = $response->size();

// Get all records as a collection
$records = $response->records();

// Get total number of records
$totalLines = $records->lines();

User Agents

Get all user agents and their directives:

// Get all user agents
$userAgents = $records->userAgents()->toArray();

// Get a specific user agent
$googlebot = $records->userAgents('Googlebot')->toArray();

Example output:

{
    "*": {
        "line": 19,
        "userAgent": "*",
        "allow": [
            {
                "line": 20,
                "directive": "allow",
                "path": "/researchtools/ose/$"
            }
        ],
        "disallow": [
            {
                "line": 32,
                "directive": "disallow",
                "path": "/admin/"
            }
        ],
        "crawlDelay": []
    },
    "GPTBot": {
        "line": 11,
        "userAgent": "GPTBot",
        "allow": [],
        "disallow": [
            {
                "line": 12,
                "directive": "disallow",
                "path": "/blog/"
            }
        ],
        "crawlDelay": []
    }
}

Directives

Get specific directive types:

// Get all disallowed paths
$disallowed = $records->disallowed()->toArray();

// Get disallowed paths for a specific user agent
$disallowed = $records->disallowed('Googlebot')->toArray();

// Get all allowed paths
$allowed = $records->allowed()->toArray();

// Get crawl delays
$crawlDelays = $records->crawlDelay()->toArray();

Example output:

[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/"
    },
    {
        "line": 33,
        "directive": "disallow",
        "path": "/private/"
    }
]

Display User Agent Information

When you want to see which user agents apply to each directive:

// Show user agents as an array for each directive
$disallowed = $records->displayUserAgent()->disallowed()->toArray();

Example output:

[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": ["*", "GPT-User"]
    }
]

When querying by a specific user agent with displayUserAgent(), directives are expanded:

// Expand directives for all user agents in the same group
$disallowed = $records->displayUserAgent()->disallowed('*')->toArray();

Example output:

[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": "*"
    },
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": "GPT-User"
    }
]

Sitemaps

$sitemaps = $records->sitemaps()->toArray();

Example output:

[
    {
        "line": 52,
        "url": "https://example.com/sitemap.xml",
        "valid": true
    }
]

Comments

$comments = $records->comments()->toArray();

Example output:

[
    {
        "line": 1,
        "comment": "File last updated May 5, 2025"
    }
]

X-Robots-Tag Headers (from parseUrl)

When parsing from a URL, you can access X-Robots-Tag HTTP headers:

$headers = $records->headersDirectives()->toArray();

Example output:

[
    {
        "X-Robots-Tag": ["all"]
    }
]

Meta Tags (from parseUrl)

When parsing from a URL (non-robots.txt), you can access robots meta tags from the HTML:

$metaTags = $records->metaTagsDirectives()->toArray();

Example output:

[
    [
        "index",
        "follow",
        "max-image-preview:large",
        "max-snippet:-1",
        "max-video-preview:-1"
    ]
]

Syntax Errors

Check for parsing errors:

$errors = $records->syntaxErrors()->toArray();

Example output:

[
    {
        "line": 5,
        "message": "Directive must follow a user agent"
    }
]

Complete Example

Here's a complete example showing all available data:

use Leopoletto\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com/mybot');

// Parse from URL
$response = $parser->parseUrl('https://example.com');
$records = $response->records();

// Build comprehensive response
$data = [
    'size' => $response->size(),
    'lines' => $records->lines(),
    'user_agents' => $records->userAgents()->toArray(),
    'disallowed' => $records->displayUserAgent()->disallowed()->toArray(),
    'allowed' => $records->allowed()->toArray(),
    'crawls_delay' => $records->crawlDelay()->toArray(),
    'sitemaps' => $records->sitemaps()->toArray(),
    'comments' => $records->comments()->toArray(),
    'html' => $records->metaTagsDirectives()->toArray(),      // From parseUrl only
    'headers' => $records->headersDirectives()->toArray(),    // From parseUrl only
    'errors' => $records->syntaxErrors()->toArray(),
];

return response()->json($data);

See public/example.json for a complete example of the output structure.

User Agent Groups

The library correctly handles consecutive User-agent declarations, which in robots.txt format means they share the same directives:

User-agent: *
User-agent: GPT-User
Disallow: /admin/

Both * and GPT-User will have the same directives. When you query by either user agent, you'll get the same results:

$disallowed1 = $records->disallowed('*')->toArray();
$disallowed2 = $records->disallowed('GPT-User')->toArray();
// Both return the same directives

Features

  • ✅ Parse robots.txt from URL, file, or text
  • ✅ Extract X-Robots-Tag HTTP headers
  • ✅ Extract robots meta tags from HTML pages
  • ✅ Handle consecutive User-agent declarations (groups)
  • ✅ Efficient storage (no duplicate directives)
  • ✅ Support for all standard directives (Allow, Disallow, Crawl-delay, Sitemap)
  • ✅ Comments and syntax error detection
  • ✅ Memory-efficient streaming for large files
  • ✅ Comprehensive test coverage

Credits

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

The MIT License (MIT). Please see License File for more information.