thingston / crawler
Web crawler based on PHP Guzzle HTTP Client with concurrency support for faster operation.
Requires
- php: ^7.1
- doctrine/dbal: ^2.8
- guzzlehttp/guzzle: ^6.3
- jakubkulhan/chrome-devtools-protocol: ^1.0
- jwage/purl: ^0.0.10
- league/flysystem: ^1.0
- monolog/monolog: ^1.23
- psr/http-message: ^1.0
- psr/log: ^1.0
- symfony/css-selector: ^4.1
- symfony/dom-crawler: ^4.1
- t1gor/robots-txt-parser: ^0.2.4
- zendframework/zend-feed: ^2.10
Requires (Dev)
- phpunit/phpunit: ^7.4
- squizlabs/php_codesniffer: ^3.3
- symfony/var-dumper: ^4.1
README
Web crawler based on PHP Guzzle HTTP Client with concurrency support for faster operation. Includes support for any content-type download, link profiler and response observers.
Requirements
Thingston Crawler requires:
- PHP 7.1 or above.
Instalation
Add Thingston Crawler to any PHP project using Composer:
composer require thingston/crawler
Getting Started
Simply create a new Crawler
instance and invoke start
method with any public URI:
use Thingston\Crawler; $crawler = new Crawler(); $crawler->start('https://www.wikipedia.org/');
In order to process results from the crawling process you may add as many many Observers.
An Observer is a concrete class implement Thingston/Crawler/Observer/ObserverInterface
.
Reporting Issues
In case you find issues with this code please open a ticket in Github Issues at https://github.com/thingston/crawler/issues.
Contributors
Open Source is made of contribuition. If you want to contribute to Thingston please follow these steps:
- Fork latest version into your own repository.
- Write your changes or additions and commit them.
- Follow PSR-2 coding style standard.
- Make sure you have unit tests with full coverage to your changes.
- Go to Github Pull Requests at https://github.com/thingston/crawler/pulls and create a new request.
Thank you!
Changes and Versioning
All relevant changes on this code are logged in a separated log file.
Version numbers follow recommendations from Semantic Versioning.
License
Thingston code is maintained under The MIT License.