radowoj / crawla
Simple web crawler based on Symfony components and Guzzle
v0.4.0
2022-07-03 09:20 UTC
Requires
- php: ^8.1
- guzzlehttp/guzzle: ^7.4
- symfony/css-selector: ^6.1
- symfony/dom-crawler: ^6.1
Requires (Dev)
- phpunit/phpunit: ^9.5
This package is auto-updated.
Last update: 2025-04-29 00:44:04 UTC
README
Installation
Via composer
$ composer require radowoj/crawla
Example 1 - get titles, counts of commits and readmes from pages linked from an entry point
<?php use Symfony\Component\DomCrawler\Crawler as DomCrawler; require_once('../vendor/autoload.php'); $crawler = new \Radowoj\Crawla\Crawler( 'https://github.com/radowoj' ); $dataGathered = []; //configure our crawler //first - set CSS selector for links that should be visited $crawler->setLinkSelector('span.pinned-repo-item-content span.d-block a.text-bold') //second - customize guzzle client used for requests ->setClient(new GuzzleHttp\Client([ GuzzleHttp\RequestOptions::DELAY => 100 ])) //third - define what should be done, when a page was visited? ->setPageVisitedCallback(function(DomCrawler $domCrawler) use(&$dataGathered) { //callback will be called for every visited page, including the base url, so let's ensure that //repo data will be gathered only on repo pages if (!preg_match('/radowoj\/\w+/', $domCrawler->getUri())) { return; } $readme = $domCrawler->filter('#readme'); $dataGathered[] = [ 'title' => trim($domCrawler->filter('span[itemprop="about"]')->text()), 'commits' => trim($domCrawler->filter('li.commits span.num')->text()), 'readme' => $readme->count() ? trim($readme->text()) : '', ]; }); //now crawl, following up to 1 links deep from the entry point $crawler->crawl(1); var_dump($dataGathered); var_dump($crawler->getVisited()->all());
Example 2 - simple site map
<?php require_once('../vendor/autoload.php'); $crawler = new \Radowoj\Crawla\Crawler( 'https://developer.github.com/' ); $dataGathered = []; //configure our crawler $crawler->setClient(new GuzzleHttp\Client([ GuzzleHttp\RequestOptions::DELAY => 100 ])) //set link selector (all links - this is the default value) ->setLinkSelector('a'); //check up to 1 levels deep $crawler->crawl(1); //get links of all visited pages var_dump($crawler->getVisited()->all()); //get links that were too deep to visit var_dump($crawler->getTooDeep()->all());