README

A comprehensive PHP package for parsing and analyzing robots.txt files. This library is designed to help you understand the structure and content of robots.txt files, including support for X-Robots-Tag HTTP headers and meta tags from HTML pages.

Note: This library is designed for parsing and analyzing robots.txt files to understand their structure. It does not validate whether a specific bot can crawl a specific URL.

Installation

Install via Composer:

composer require leopoletto/robots-txt-parser

Requirements

PHP 8.2 or higher

Dependencies

Guzzle HTTP Client
Illuminate Collections

Quick Start

use Leopoletto\RobotsTxtParser\RobotsTxtParser;

// Instantiate the parser
$parser = new RobotsTxtParser();

// Configure your bot's user agent (required for parseUrl)
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com/mybot');

// Parse from URL
$response = $parser->parseUrl('https://example.com');

Configuration

Before parsing from a URL, you must configure your bot's user agent. This is used when making HTTP requests.

Method 1: Using `configureUserAgent()`

$parser->configureUserAgent('BotName', '1.0', 'https://example.com/bot');
// Results in: Mozilla/5.0 (compatible; BotName/1.0; https://example.com/bot)

Method 2: Using `setUserAgent()`

$parser->setUserAgent('MyCustomUserAgent/1.0');

Parsing Methods

The library provides three methods for parsing robots.txt content:

1. Parse from URL (`parseUrl`)

Parses robots.txt from a URL and also extracts:

X-Robots-Tag HTTP headers from the robots.txt response
Meta tags (robots, googlebot, googlebot-news) from the HTML page if a non-robots.txt URL is provided

$parser = new RobotsTxtParser();
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com');

// Parse from any URL (will automatically fetch /robots.txt)
$response = $parser->parseUrl('https://example.com');
// or
$response = $parser->parseUrl('https://example.com/robots.txt');

$records = $response->records();

What parseUrl returns:

All robots.txt directives (User-agent, Allow, Disallow, Crawl-delay, Sitemap)
X-Robots-Tag headers from the robots.txt response
Meta tags from the HTML page (if parsing a non-robots.txt URL)
Comments and syntax errors

2. Parse from File (`parseFile`)

Parses a robots.txt file from the local filesystem.

$parser = new RobotsTxtParser();
$response = $parser->parseFile('/path/to/robots.txt');

$records = $response->records();

3. Parse from Text (`parseText`)

Parses robots.txt content directly from a string.

$parser = new RobotsTxtParser();
$content = "User-agent: *\nDisallow: /admin/";
$response = $parser->parseText($content);

$records = $response->records();

Accessing Parsed Data

All parsing methods return a Response object with the following methods:

Basic Information

$response = $parser->parseUrl('https://example.com');

// Get the size of the parsed content in bytes
$size = $response->size();

// Get all records as a collection
$records = $response->records();

// Get total number of records
$totalLines = $records->lines();

User Agents

Get all user agents and their directives:

// Get all user agents
$userAgents = $records->userAgents()->toArray();

// Get a specific user agent
$googlebot = $records->userAgents('Googlebot')->toArray();

Example output:

{
    "*": {
        "line": 19,
        "userAgent": "*",
        "allow": [
            {
                "line": 20,
                "directive": "allow",
                "path": "/researchtools/ose/$"
            }
        ],
        "disallow": [
            {
                "line": 32,
                "directive": "disallow",
                "path": "/admin/"
            }
        ],
        "crawlDelay": []
    },
    "GPTBot": {
        "line": 11,
        "userAgent": "GPTBot",
        "allow": [],
        "disallow": [
            {
                "line": 12,
                "directive": "disallow",
                "path": "/blog/"
            }
        ],
        "crawlDelay": []
    }
}

Directives

Get specific directive types:

// Get all disallowed paths
$disallowed = $records->disallowed()->toArray();

// Get disallowed paths for a specific user agent
$disallowed = $records->disallowed('Googlebot')->toArray();

// Get all allowed paths
$allowed = $records->allowed()->toArray();

// Get crawl delays
$crawlDelays = $records->crawlDelay()->toArray();

Example output:

[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/"
    },
    {
        "line": 33,
        "directive": "disallow",
        "path": "/private/"
    }
]

Display User Agent Information

When you want to see which user agents apply to each directive:

// Show user agents as an array for each directive
$disallowed = $records->displayUserAgent()->disallowed()->toArray();

Example output:

[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": ["*", "GPT-User"]
    }
]

When querying by a specific user agent with displayUserAgent(), directives are expanded:

// Expand directives for all user agents in the same group
$disallowed = $records->displayUserAgent()->disallowed('*')->toArray();

Example output:

[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": "*"
    },
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": "GPT-User"
    }
]

Sitemaps

$sitemaps = $records->sitemaps()->toArray();

Example output:

[
    {
        "line": 52,
        "url": "https://example.com/sitemap.xml",
        "valid": true
    }
]

Comments

$comments = $records->comments()->toArray();

Example output:

[
    {
        "line": 1,
        "comment": "File last updated May 5, 2025"
    }
]

X-Robots-Tag Headers (from `parseUrl`)

When parsing from a URL, you can access X-Robots-Tag HTTP headers:

$headers = $records->headersDirectives()->toArray();

Example output:

[
    {
        "X-Robots-Tag": ["all"]
    }
]

Meta Tags (from `parseUrl`)

When parsing from a URL (non-robots.txt), you can access robots meta tags from the HTML:

$metaTags = $records->metaTagsDirectives()->toArray();

Example output:

[
    [
        "index",
        "follow",
        "max-image-preview:large",
        "max-snippet:-1",
        "max-video-preview:-1"
    ]
]

Syntax Errors

Check for parsing errors:

$errors = $records->syntaxErrors()->toArray();

Example output:

[
    {
        "line": 5,
        "message": "Directive must follow a user agent"
    }
]

Complete Example

Here's a complete example showing all available data:

use Leopoletto\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com/mybot');

// Parse from URL
$response = $parser->parseUrl('https://example.com');
$records = $response->records();

// Build comprehensive response
$data = [
    'size' => $response->size(),
    'lines' => $records->lines(),
    'user_agents' => $records->userAgents()->toArray(),
    'disallowed' => $records->displayUserAgent()->disallowed()->toArray(),
    'allowed' => $records->allowed()->toArray(),
    'crawls_delay' => $records->crawlDelay()->toArray(),
    'sitemaps' => $records->sitemaps()->toArray(),
    'comments' => $records->comments()->toArray(),
    'html' => $records->metaTagsDirectives()->toArray(),      // From parseUrl only
    'headers' => $records->headersDirectives()->toArray(),    // From parseUrl only
    'errors' => $records->syntaxErrors()->toArray(),
];

return response()->json($data);

See public/example.json for a complete example of the output structure.

User Agent Groups

The library correctly handles consecutive User-agent declarations, which in robots.txt format means they share the same directives:

User-agent: *
User-agent: GPT-User
Disallow: /admin/

Both * and GPT-User will have the same directives. When you query by either user agent, you'll get the same results:

$disallowed1 = $records->disallowed('*')->toArray();
$disallowed2 = $records->disallowed('GPT-User')->toArray();
// Both return the same directives

Features

✅ Parse robots.txt from URL, file, or text
✅ Extract X-Robots-Tag HTTP headers
✅ Extract robots meta tags from HTML pages
✅ Handle consecutive User-agent declarations (groups)
✅ Efficient storage (no duplicate directives)
✅ Support for all standard directives (Allow, Disallow, Crawl-delay, Sitemap)
✅ Comments and syntax error detection
✅ Memory-efficient streaming for large files
✅ Comprehensive test coverage

leopoletto / robots-txt-parser

Maintainers

Details

README

Installation

Requirements

Dependencies

Quick Start

Configuration

Method 1: Using `configureUserAgent()`

Method 2: Using `setUserAgent()`

Parsing Methods

1. Parse from URL (`parseUrl`)

2. Parse from File (`parseFile`)

3. Parse from Text (`parseText`)

Accessing Parsed Data

Basic Information

User Agents

Directives

Display User Agent Information

Sitemaps

Comments

X-Robots-Tag Headers (from `parseUrl`)

Meta Tags (from `parseUrl`)

Syntax Errors

Complete Example

User Agent Groups

Features

Credits

Contributing

License

leopoletto / robots-txt-parser

Maintainers

Details

README

Installation

Requirements

Dependencies

Quick Start

Configuration

Method 1: Using configureUserAgent()

Method 2: Using setUserAgent()

Parsing Methods

1. Parse from URL (parseUrl)

2. Parse from File (parseFile)

3. Parse from Text (parseText)

Accessing Parsed Data

Basic Information

User Agents

Directives

Display User Agent Information

Sitemaps

Comments

X-Robots-Tag Headers (from parseUrl)

Meta Tags (from parseUrl)

Syntax Errors

Complete Example

User Agent Groups

Features

Credits

Contributing

License

Method 1: Using `configureUserAgent()`

Method 2: Using `setUserAgent()`

1. Parse from URL (`parseUrl`)

2. Parse from File (`parseFile`)

3. Parse from Text (`parseText`)

X-Robots-Tag Headers (from `parseUrl`)

Meta Tags (from `parseUrl`)