vdb / php-spider
A configurable and extensible PHP web spider
Installs: 178 316
Dependents: 7
Suggesters: 0
Security: 0
Stars: 1 341
Watchers: 82
Forks: 232
Type:application
pkg:composer/vdb/php-spider
Requires
- php: >=8.0
- ext-dom: *
- ext-pcntl: *
- guzzlehttp/guzzle: ^6.0.0||^7.0.0
- spatie/robots-txt: ^2.0
- symfony/css-selector: ^3.0.0||^4.0.0||^5.0.0||^6.0||^7.0||^8.0
- symfony/dom-crawler: ^3.0.0||^4.0.0||^5.0.0||^6.0||^7.0||^8.0
- symfony/event-dispatcher: ^4.0.0||^5.0.0||^6.0||^7.0||^8.0
- symfony/finder: ^3.0.0||^4.0.0||^5.0.0||^6.0||^7.0||^8.0
- vdb/uri: ^0.3.2
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.69.0
- pdepend/pdepend: ^2.16.1
- phan/phan: ^4.0||^5.0||^6.0
- phpmd/phpmd: ^2.0.0
- phpunit/phpunit: ^9.0.0
- squizlabs/php_codesniffer: ^4.0.0
- dev-master
- v0.7.6
- v0.7.5
- v0.7.4
- v0.7.3
- v0.7.2
- v0.7.1
- v0.7.0
- v0.6.3
- v0.6.2
- v0.6.1
- v0.6.0
- v0.5.2
- v0.5.1
- v0.5.0
- v0.4.4
- v0.4.3
- v0.4.2
- v0.4.1
- v0.4
- v0.3
- v0.2
- v0.1
- dev-feature/cp-setup-versions
- dev-copilot/add-examples-documentation
- dev-copilot/improve-convenience-docs
- dev-feature/optimize-copilot-setup
- dev-copilot/refactor-discovererset-set-usage
- dev-fix/check-script-workflow-filter
- dev-copilot/add-property-deprecation-notices
- dev-copilot/sub-pr-144
- dev-feature/fix-act
- dev-copilot/enhance-readme-fluent-configuration
- dev-copilot/emit-deprecation-warning-in-set
- dev-copilot/remove-unused-sphinx-docs
- dev-copilot/add-runtime-deprecation-warning
- dev-copilot/start-phase-3
- dev-copilot/continue-phase-2-work
- dev-copilot/fix-missing-semicolon
- dev-copilot/review-code-architecture-test-scripts
- dev-copilot/check-valid-links-only
- dev-copilot/follow-internal-redirects
- dev-copilot/update-pr-submission-requirements
- dev-copilot/add-prefetch-filter-cache
- dev-copilot/allow-square-bracket-notation
- dev-feature/rename-act-check
- dev-feature/copilot-config
- dev-copilot/sub-pr-123
- dev-feature/green-metrics
- dev-SpeksForks-master
- dev-feature/extract-discovereduris
This package is auto-updated.
Last update: 2026-01-18 19:53:58 UTC
README
PHP-Spider Features
- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as robots.txt and Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports caching downloaded resources with configurable max age (see example and documentation)
- supports custom request handling logic
- supports Basic, Digest and NTLM HTTP authentication. See example.
- comes with a useful set of persistence handlers (memory, file)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy
This Spider does not support Javascript.
Installation
The easiest way to install PHP-Spider is with composer. Find it on Packagist.
$ composer require vdb/php-spider
Usage
This is a very simple example. This code can be found in example/example_simple.php. For a more complete example with some logging, caching and filters, see example/example_complex.php. That file contains a more real-world example.
Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see the link checker example. It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.
First create the spider
$spider = new Spider('http://www.dmoz.org');
Add a URI discoverer. Without it, the spider does nothing. In this case, we want all <a> nodes from a certain <div>
$spider->addDiscoverer(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));
Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->setMaxDepth(1); $spider->setMaxQueueSize(10);
Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.
$statsHandler = new StatsHandler(); $spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler); $spider->getDispatcher()->addSubscriber($statsHandler);
Execute the crawl
$spider->crawl();
When crawling is done, we could get some info about the crawl
echo "\n ENQUEUED: " . count($statsHandler->getQueued()); echo "\n SKIPPED: " . count($statsHandler->getFiltered()); echo "\n FAILED: " . count($statsHandler->getFailed()); echo "\n PERSISTED: " . count($statsHandler->getPersisted());
Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources
echo "\n\nDOWNLOADED RESOURCES: "; foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) { echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text(); }
Fluent Configuration
For most common settings, you can configure the spider fluently via convenience methods on Spider and keep related configuration in one place.
use VDB\Spider\Spider; use VDB\Spider\Discoverer\XPathExpressionDiscoverer; use VDB\Spider\Filter\Prefetch\AllowedHostsFilter; use VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler; use VDB\Spider\QueueManager\QueueManagerInterface; $spider = new Spider('https://example.com'); // Configure limits and traversal in one place $spider ->setDownloadLimit(50) // Max resources to download ->setTraversalAlgorithm(QueueManagerInterface::ALGORITHM_BREADTH_FIRST) ->setMaxDepth(2) // Max discovery depth ->setMaxQueueSize(500) // Max URIs in queue ->setPersistenceHandler(new FileSerializedResourcePersistenceHandler(__DIR__.'/results')) ->addDiscoverer(new XPathExpressionDiscoverer('//a')) // Add discoverers ->addFilter(new AllowedHostsFilter(['example.com'])); // Add prefetch filters // Optional: enable politeness policy (delay between requests to same domain) $spider->enablePolitenessPolicy(100); $spider->crawl();
Using Cache to Skip Already Downloaded Resources
To avoid re-downloading resources that are already cached (useful for incremental crawls):
use VDB\Spider\Filter\Prefetch\CachedResourceFilter; use VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler; // Use a fixed spider ID to share cache across runs $spiderId = 'my-spider-cache'; $spider = new Spider('http://example.com', null, null, null, $spiderId); // Set up file persistence $resultsPath = __DIR__ . '/cache'; $spider->getDownloader()->setPersistenceHandler( new FileSerializedResourcePersistenceHandler($resultsPath) ); // Add cache filter - skip resources downloaded within the last hour $maxAgeSeconds = 3600; // 1 hour (set to 0 to always use cache) $cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds); $spider->getDiscovererSet()->addFilter($cacheFilter); $spider->crawl();
For more details, see the CachedResourceFilter documentation and example.
Contributing
Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: Submitting a Patch.
There a few requirements for a Pull Request to be accepted:
- Follow the coding standards: PHP-Spider follows the coding standards defined in the PSR-0, PSR-1 and PSR-2 Coding Style Guides;
- Prove that the code works with unit tests and that coverage remains 100%;
Note: An easy way to check if your code conforms to PHP-Spider is by running the script
bin/static-analysis, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.
Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run
bin/coverage-enforce.
Local Testing with GitHub Actions
You can run the full CI pipeline locally using nektos/act:
# Fast path: run the full workflow with PHP 8.0 (recommended)
./bin/check
Or use the underlying act wrapper directly:
# Run all tests locally ./bin/act # Run specific PHP version locally ./bin/act --matrix php-versions:8.0 # Run specific job or view available workflows ./bin/act -l
For more details, see .github/LOCAL_TESTING.md.
Support
For things like reporting bugs and requesting features it is best to create an issue here on GitHub. It is even better to accompany it with a Pull Request. ;-)
License
PHP-Spider is licensed under the MIT license.