webimage/spider

Crawl website

0.0.6 2024-08-30 11:23 UTC

This package is auto-updated.

Last update: 2024-10-30 11:40:14 UTC


README

A wrapper for Symfony/Browser-Kit that allows a URL to be downloaded, cached, and crawled.

Usage

use WebImage\Spider\UrlFetcher;
use Symfony\Component\HttpClient\HttpClient;
$logger = new \Monolog\Logger('spider');
$fetcher = new UrlFetcher('/path/to/cache', $logger, HttpClient::create());
$result = $fetcher->fetch(new Url('https://www.domain.com'));

It's a good idea to create an HttpClient with a user agent, for example

use Symfony\Component\HttpClient\HttpClient;
HttpClient::create([
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    ]
]);

A crawler can be setup to recursively crawl URLs by setting up onFetch(FetchHandlerInterface) or onFetchCallback listeners.

use WebImage\Spider\FetchResponseEvent;
/** @var \WebImage\Spider\UrlFetcher $fetcher */
$fetcher->onFetchCallback(function(FetchResponseEvent $ev) {
    // Perform some logic here, then
    $ev->getTarget()->fetch(new Url('https://www.another.com/path'));
});

Add onFetch(...) and onFetchCallback(...) pushes the URL onto a stack that is recursively processed in the order that URLs are added.