webimage / spider
Crawl website
0.0.6
2024-08-30 11:23 UTC
Requires
- guzzlehttp/guzzle: ^7.9.2
- monolog/monolog: 1.24.0
- symfony/browser-kit: ^5.4
- symfony/dom-crawler: ^4.4
- symfony/http-client: ^5.4
- webimage/core: ^1.4
README
A wrapper for Symfony/Browser-Kit that allows a URL to be downloaded, cached, and crawled.
Usage
use WebImage\Spider\UrlFetcher; use Symfony\Component\HttpClient\HttpClient; $logger = new \Monolog\Logger('spider'); $fetcher = new UrlFetcher('/path/to/cache', $logger, HttpClient::create()); $result = $fetcher->fetch(new Url('https://www.domain.com'));
It's a good idea to create an HttpClient with a user agent, for example
use Symfony\Component\HttpClient\HttpClient; HttpClient::create([ 'headers' => [ 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' ] ]);
A crawler can be setup to recursively crawl URLs by setting up onFetch(FetchHandlerInterface) or onFetchCallback listeners.
use WebImage\Spider\FetchResponseEvent; /** @var \WebImage\Spider\UrlFetcher $fetcher */ $fetcher->onFetchCallback(function(FetchResponseEvent $ev) { // Perform some logic here, then $ev->getTarget()->fetch(new Url('https://www.another.com/path')); });
Add onFetch(...) and onFetchCallback(...) pushes the URL onto a stack that is recursively processed in the order that URLs are added.