README

A small, dependency-light PHP crawler that walks a site and generates XML sitemaps. It follows links, respects meta robots directives, and ships with a sitemap extension that can write a single sitemap or rotate into multiple files with an index.

Requirements

PHP 8.4+
Extensions: ext-dom, ext-curl, ext-xmlwriter

Installation

composer require tonsoo/php-crawler

Quick Start

<?php

use Tonsoo\PhpCrawler\Extensions\SitemapExtension;
use Tonsoo\PhpCrawler\Sitemap\SitemapGenerator;
use Tonsoo\PhpCrawler\Sitemap\Writers\RotatingSitemapWriter;

require __DIR__ . '/vendor/autoload.php';

crawler()
    ->preserveHost()
    ->respectCanonical(false)
    ->maxPages(1000)
    ->extension(
        new SitemapExtension(
            generator: new SitemapGenerator(
                writer: new RotatingSitemapWriter(
                    directory: __DIR__ . '/sitemap'
                )
            )
        )
    )
    ->start('https://example.com');

This will crawl https://example.com, write sitemap.xml (or sitemap-2.xml, sitemap-3.xml, etc.), and produce a sitemap-index.xml once multiple sitemap files are created.

Crawler Configuration

The crawler is configured via a fluent API on Crawler:

crawler()
    ->displayCrawls(true)
    ->displayMemoryInfo(true)
    ->respectNoIndex(true)
    ->respectNoFollow(true)
    ->respectCanonical(true)
    ->preserveScheme(true)
    ->preserveHost(true)
    ->maxPages(5000)
    ->start('https://example.com');

What these options do

displayCrawls(true): toggles crawl logging (currently not used by the built-in logger).
displayMemoryInfo(true): toggles memory logging (currently not used by the built-in logger).
respectNoIndex(true): honors <meta name="robots" content="noindex"> (default: true).
respectNoFollow(true): honors <meta name="robots" content="nofollow"> (default: true).
respectCanonical(true): uses the canonical URL for link resolution (default: true).
preserveScheme(true): stays on the same scheme (http vs https) (default: true).
preserveHost(true): stays on the same host (default: true).
maxPages(5000): stops after a page limit (default: null = unlimited).

Sitemap Generation

Single sitemap

use Tonsoo\PhpCrawler\Sitemap\SitemapGenerator;
use Tonsoo\PhpCrawler\Sitemap\Writers\XmlSitemapWriter;
use Tonsoo\PhpCrawler\Extensions\SitemapExtension;

crawler()
    ->extension(
        new SitemapExtension(
            generator: new SitemapGenerator(
                writer: new XmlSitemapWriter(
                    path: __DIR__ . '/sitemap/sitemap.xml'
                )
            )
        )
    )
    ->start('https://example.com');

Rotating sitemap + index

use Tonsoo\PhpCrawler\Sitemap\SitemapGenerator;
use Tonsoo\PhpCrawler\Sitemap\Writers\RotatingSitemapWriter;
use Tonsoo\PhpCrawler\Extensions\SitemapExtension;

crawler()
    ->extension(
        new SitemapExtension(
            generator: new SitemapGenerator(
                writer: new RotatingSitemapWriter(
                    directory: __DIR__ . '/sitemap',
                    baseName: 'sitemap',
                    extension: 'xml',
                    maxUrls: 50000
                )
            )
        )
    )
    ->start('https://example.com');

Notes:

RotatingSitemapWriter requires the directory to already exist.
The index file is written only when more than one sitemap file is created.
The index stores the sitemap filenames (relative paths), not absolute URLs.

Events

You can subscribe to crawler events to observe or extend behavior:

use Tonsoo\PhpCrawler\Events\OnCrawled;
use Tonsoo\PhpCrawler\Events\OnFinish;
use Tonsoo\PhpCrawler\Events\OnLinkFound;
use Tonsoo\PhpCrawler\Events\OnMismatchContent;
use Tonsoo\PhpCrawler\Events\OnMissingHtmlBody;
use Tonsoo\PhpCrawler\Events\OnStart;

crawler()
    ->onStart(fn (OnStart $event) => print("Starting\n"))
    ->onLinkFound(fn (OnLinkFound $event) => print("{$event->url} -> {$event->link}\n"))
    ->onCrawled(fn (OnCrawled $event) => print("Crawled {$event->page->uri}\n"))
    ->onMissingHtmlBody(fn (OnMissingHtmlBody $event) => print("No HTML: {$event->url}\n"))
    ->onMismatchContent(fn (OnMismatchContent $event) => print("Wrong content type: {$event->url}\n"))
    ->onFinish(fn (OnFinish $event) => print("Done: {$event->totalPages} pages\n"))
    ->start('https://example.com');

Custom HTTP Client, Logger, and Analyzer

You can plug in your own implementations:

use Tonsoo\PhpCrawler\Http\HttpClientInterface;
use Tonsoo\PhpCrawler\Logger\LoggerInterface;
use Tonsoo\PhpCrawler\Analysis\PageAnalyzerInterface;

crawler()
    ->httpClient(new YourHttpClient())
    ->logger(new YourLogger())
    ->pageAnalyzer(new YourAnalyzer())
    ->start('https://example.com');

Defaults:

HTTP client: CurlHttpClient (follows redirects, 4s connect/total timeout, custom UA string).
Logger: ConsoleLogger (timestamps to stdout).
Analyzer: DomDocumentPageAnalyzer (DOM + XPath).

Interfaces to implement:

HttpClientInterface::fetch(string $url): Result
LoggerInterface::log(string $message): void
PageAnalyzerInterface::analyze(Result $result, bool $respectNoIndex, bool $respectNoFollow): PageAnalysis

Error Handling

If maxPages is set and the crawler reaches the limit, it throws LimitExceededException after finishing the crawl loop:

use Tonsoo\PhpCrawler\Crawler\Exception\LimitExceededException;

try {
    crawler()->maxPages(100)->start('https://example.com');
} catch (LimitExceededException $e) {
    // handle limit reached
}

Crawling Behavior

The crawler only processes pages that return an HTML body with a text/html content type. If a page has no HTML body or a non-HTML content type, it is skipped and the corresponding event is emitted.

The crawler collects links from <a href="..."> elements and normalizes them. It will:

Resolve relative URLs against the current page
Drop fragments (the #... part)
Ignore non-HTTP(S) schemes
Optionally restrict links by host and scheme
Optionally respect noindex / nofollow meta tags (from <meta name="robots">)
Use canonical URLs when enabled

This crawler does not parse robots.txt.

Example Script

See examples/crawler.php for a full working example.

tonsoo / php-crawler

Maintainers

Package info

Statistics

Security