johnroyer / crawler-php
crawler implement in PHP
0.3.6
2024-02-08 05:56 UTC
Requires
- php: ^8.1|^8.2
- ext-intl: *
- ext-mbstring: *
- guzzlehttp/guzzle: ^7.5
- johnroyer/url-normalizer: ^2.1.0
- symfony/css-selector: ^6.2
- symfony/dom-crawler: ^6.2
Requires (Dev)
- phpunit/phpunit: ^9.0
- squizlabs/php_codesniffer: ^3.6
README
Web crawler in simple.
Note: this is a site project. Do NOT use in production.
Usage
Create handler from AbstractHandler
, and set domain which handler should handles:
class MyHandler extends \Zeroplex\Crawler\Handler\AbstractHandler { public function getDomain(): string { return 'test.com'; } public function shouldFetch(\Psr\Http\Message\RequestInterface $request): bool { if (1 === preg_match('/(css|js|jpg|png|gif)$/', $request->getUri())) { // ignore css, js and common images return false; } return true; } public function handle(\Psr\Http\Message\ResponseInterface $response): void { // get content using $response->getBody()->getContents() } }
Then setup crawler and run:
$crawler = new \Zeroplex\Crawler\Crawler(); $crawler->setDelay(0) ->setTimeout(3) ->setFollowRedirect(true) ->setUserAgent('Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/100.1'); $crawler->addHandler(new BlogHandler()); // URL to start $crawler->run('https://test.com');
Extends
For example, implement URL queue by Predis.
composer install:
composer require predis/predis
Implement UrlQueueInterface
:
class RedisQueue implements Zeroplex\Crawler\UrlQueue\UrlQueueInterface { private $redis; public function __construct(string $host, int $port) { } public function push(string $url): void { $this->redis->lpush($url); } public function pop(): string { return $this->redis->lpop(); } // and so on }