tonsoo / php-crawler
Installs: 9
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
pkg:composer/tonsoo/php-crawler
Requires
- php: ^8.4
- ext-curl: *
- ext-dom: *
- ext-xmlwriter: *
- league/uri: ^7.8
- nesbot/carbon: ^3.11
Suggests
- guzzlehttp/psr7: Optional: standards-based URI parsing/resolution
- symfony/dom-crawler: Optional: richer DOM traversal APIs
README
A small, dependency-light PHP crawler that walks a site and generates XML sitemaps. It follows links, respects meta robots directives, and ships with a sitemap extension that can write a single sitemap or rotate into multiple files with an index.
Requirements
- PHP 8.4+
- Extensions:
ext-dom,ext-curl,ext-xmlwriter
Installation
composer require tonsoo/php-crawler
Quick Start
<?php use Tonsoo\PhpCrawler\Extensions\SitemapExtension; use Tonsoo\PhpCrawler\Sitemap\SitemapGenerator; use Tonsoo\PhpCrawler\Sitemap\Writers\RotatingSitemapWriter; require __DIR__ . '/vendor/autoload.php'; crawler() ->preserveHost() ->respectCanonical(false) ->maxPages(1000) ->extension( new SitemapExtension( generator: new SitemapGenerator( writer: new RotatingSitemapWriter( directory: __DIR__ . '/sitemap' ) ) ) ) ->start('https://example.com');
This will crawl https://example.com, write sitemap.xml (or sitemap-2.xml, sitemap-3.xml, etc.), and produce a sitemap-index.xml once multiple sitemap files are created.
Crawler Configuration
The crawler is configured via a fluent API on Crawler:
crawler() ->displayCrawls(true) ->displayMemoryInfo(true) ->respectNoIndex(true) ->respectNoFollow(true) ->respectCanonical(true) ->preserveScheme(true) ->preserveHost(true) ->maxPages(5000) ->start('https://example.com');
What these options do
displayCrawls(true): toggles crawl logging (currently not used by the built-in logger).displayMemoryInfo(true): toggles memory logging (currently not used by the built-in logger).respectNoIndex(true): honors<meta name="robots" content="noindex">(default:true).respectNoFollow(true): honors<meta name="robots" content="nofollow">(default:true).respectCanonical(true): uses the canonical URL for link resolution (default:true).preserveScheme(true): stays on the same scheme (httpvshttps) (default:true).preserveHost(true): stays on the same host (default:true).maxPages(5000): stops after a page limit (default:null= unlimited).
Sitemap Generation
Single sitemap
use Tonsoo\PhpCrawler\Sitemap\SitemapGenerator; use Tonsoo\PhpCrawler\Sitemap\Writers\XmlSitemapWriter; use Tonsoo\PhpCrawler\Extensions\SitemapExtension; crawler() ->extension( new SitemapExtension( generator: new SitemapGenerator( writer: new XmlSitemapWriter( path: __DIR__ . '/sitemap/sitemap.xml' ) ) ) ) ->start('https://example.com');
Rotating sitemap + index
use Tonsoo\PhpCrawler\Sitemap\SitemapGenerator; use Tonsoo\PhpCrawler\Sitemap\Writers\RotatingSitemapWriter; use Tonsoo\PhpCrawler\Extensions\SitemapExtension; crawler() ->extension( new SitemapExtension( generator: new SitemapGenerator( writer: new RotatingSitemapWriter( directory: __DIR__ . '/sitemap', baseName: 'sitemap', extension: 'xml', maxUrls: 50000 ) ) ) ) ->start('https://example.com');
Notes:
RotatingSitemapWriterrequires the directory to already exist.- The index file is written only when more than one sitemap file is created.
- The index stores the sitemap filenames (relative paths), not absolute URLs.
Events
You can subscribe to crawler events to observe or extend behavior:
use Tonsoo\PhpCrawler\Events\OnCrawled; use Tonsoo\PhpCrawler\Events\OnFinish; use Tonsoo\PhpCrawler\Events\OnLinkFound; use Tonsoo\PhpCrawler\Events\OnMismatchContent; use Tonsoo\PhpCrawler\Events\OnMissingHtmlBody; use Tonsoo\PhpCrawler\Events\OnStart; crawler() ->onStart(fn (OnStart $event) => print("Starting\n")) ->onLinkFound(fn (OnLinkFound $event) => print("{$event->url} -> {$event->link}\n")) ->onCrawled(fn (OnCrawled $event) => print("Crawled {$event->page->uri}\n")) ->onMissingHtmlBody(fn (OnMissingHtmlBody $event) => print("No HTML: {$event->url}\n")) ->onMismatchContent(fn (OnMismatchContent $event) => print("Wrong content type: {$event->url}\n")) ->onFinish(fn (OnFinish $event) => print("Done: {$event->totalPages} pages\n")) ->start('https://example.com');
Custom HTTP Client, Logger, and Analyzer
You can plug in your own implementations:
use Tonsoo\PhpCrawler\Http\HttpClientInterface; use Tonsoo\PhpCrawler\Logger\LoggerInterface; use Tonsoo\PhpCrawler\Analysis\PageAnalyzerInterface; crawler() ->httpClient(new YourHttpClient()) ->logger(new YourLogger()) ->pageAnalyzer(new YourAnalyzer()) ->start('https://example.com');
Defaults:
- HTTP client:
CurlHttpClient(follows redirects, 4s connect/total timeout, custom UA string). - Logger:
ConsoleLogger(timestamps to stdout). - Analyzer:
DomDocumentPageAnalyzer(DOM + XPath).
Interfaces to implement:
HttpClientInterface::fetch(string $url): ResultLoggerInterface::log(string $message): voidPageAnalyzerInterface::analyze(Result $result, bool $respectNoIndex, bool $respectNoFollow): PageAnalysis
Error Handling
If maxPages is set and the crawler reaches the limit, it throws LimitExceededException after finishing the crawl loop:
use Tonsoo\PhpCrawler\Crawler\Exception\LimitExceededException; try { crawler()->maxPages(100)->start('https://example.com'); } catch (LimitExceededException $e) { // handle limit reached }
Crawling Behavior
The crawler only processes pages that return an HTML body with a text/html content type. If a page has no HTML body or a non-HTML content type, it is skipped and the corresponding event is emitted.
The crawler collects links from <a href="..."> elements and normalizes them. It will:
- Resolve relative URLs against the current page
- Drop fragments (the
#...part) - Ignore non-HTTP(S) schemes
- Optionally restrict links by host and scheme
- Optionally respect
noindex/nofollowmeta tags (from<meta name="robots">) - Use canonical URLs when enabled
This crawler does not parse robots.txt.
Example Script
See examples/crawler.php for a full working example.