README

PCrawl

PCrawl is a PHP library for crawling and scraping web pages.
It supports multiple clients: curl, guzzle. Options to debug, modify and parse responses.

Features

Rapidly create custom clients. Fluently change clients and client options like user-agent, with method chaining.
Responses can be modified using reusable callback functions.
Debug Responses using different criterias - httpcode, regex etc.
Parse responses using querypath library. Several convenience functions are provided.
Fluent API. Different debuggers, clients and response mod objects can be be changed on the fly !

Full Example

We'll try to fetch a bad page, then detect using a debugger and finally change client options to fetch the page correctly.

Setup up some clients

// simple clients.
$gu = new GuzzleClient();

// Custom Client, that does not allow redirects.
$uptightNoRedirectClient = new CurlClient();
$uptightNoRedirectClient->setRedirects(0); // disable redirects

// Custom client - thin wrapper around curl
class ConvertToHttpsClient extends CurlClient
{
    public function get(string $url, array $options = []): PResponse
    {
        $url = str_replace('http://', 'https://', $url);
        return parent::get($url, $options);
    }
}

Lets make some debugger objects

$redirectDetector = new ResponseDebug();
$redirectDetector->setMustNotExistHttpCodes([301, 302, 303, 307, 308]);
$fullPageDetector = new ResponseDebug();
$fullPageDetector->setMustExistRegex(['#</html>#']);

Start fetching!

For testing, we will fetch page with a client that does not support redirects, then use the redirectDetector to detect 301. If so we change client option to support redirects and fetch again.

$req = new Request();
$url = "http://www.whatsmyua.info";
$req->setClient($uptightNoRedirectClient);
$count = 0;
do {
    $res = $req->get($url);
    $redirectDetector->setResponse($res);
    if ($redirectDetector->isFail()) {
        var_dump($redirectDetector->getFailDetail());
        $uptightNoRedirectClient->setRedirects(1);
        $res = $req->get($url);
    }
} while ($redirectDetector->isFail() && $count++ < 1);

Use the fullPageDetector to detect if the page is proper.
Then parse the response body using Parser

if ($fullPageDetector->setResponse($res)->isFail()) {
    var_dump($redirectDetector->getFailDetail());
} else {
    $parser = new ParserCommon($res->getBody()); 
    $h1 = $parser->find('h1')->text();
    $htmlClass = $parser->find('html')->attr('class');
}

Note: the debuggers, clients, parsers can be reused.

Detailed Usage

Usage of functions can be divided into parts:

Installation

Composer:

composer init   # for new projects. 
composer config minimum-stability dev # Will be removed once stable.
composer require gyaaniguy/pcrawl
composer update
include __DIR__ . '/vendor/autoload.php'; #in PHP

github:

git clone git@github.com:gyaaniguy/PCrawl.git # clone repo 
cd PCrawl 
composer update # update composer 
mv ../PCrawl /desired/location # Move dir to desired location.
require __DIR__ . '../PCrawl/vendor/autoload.php'; #in PHP

TODO list

Leverage guzzlehttp asynchronous support

Standards

PSR-12
PHPUnit tests

gyaaniguy / pcrawl

Maintainers

Details