ddliu / spider
Light weight spider for the web.
Installs: 36
Dependents: 0
Suggesters: 0
Security: 0
Stars: 19
Watchers: 4
Forks: 4
Open Issues: 0
pkg:composer/ddliu/spider
Requires
- ddliu/filecache: ~0.1
- ddliu/normurl: ~0.1.1
- ddliu/requery: ~0.1
- ddliu/wildcards: 0.1.*
- monolog/monolog: ~1.11
- symfony/css-selector: ~2.5
- symfony/dom-crawler: ~2.5
README
A flexible spider in PHP.
Concepts
A spider contains many processors called pipes, you can pass as many tasks as you like to the spider, each task go through these pipes and get processed.
Installation
composer require ddliu/spider
Requirements
- PHP5.3+
- curl(RequestPipe)
Dependencies
See composer.json.
Usage
use ddliu\spider\Spider; use ddliu\spider\Pipe\NormalizeUrlPipe; use ddliu\spider\Pipe\RequestPipe; use ddliu\spider\Pipe\DomCrawlerPipe; (new Spider()) ->pipe(new NormalizeUrlPipe()) ->pipe(new RequestPipe()) ->pipe(new DomCrawlerPipe()) ->pipe(function($spider, $task) { $task['$dom']->filter('a')->each(function($a) use ($task) { $href = $a->attr('href'); $task->fork($href); }) }) // the entry task ->addTask('http://example.com') ->run() ->report();
Find more examples in examples folder.
Spider
The Spider class.
Options
- limit: maxmum tasks to run
Methods
pipe($pipe): add a pipeaddTask($task): add a taskrun(): run the spiderreport(): write report to log
Task
A task contains the data array and some helper functions.
The Task class implements ArrayAccess interface, so you can access data like array.
Methods
fork($task): add a sub task to the spiderignore(): ignore the task
Pipes
Pipes define how each task being processed.
A pipe can be a function:
function($spider, $task) {}
Or extends the BasePipe:
use ddliu\spider\Pipe\BasePipe; class MyPipe extends BasePipe { public function run($spider, $task) { // process the task... } }
Useful Pipes
NormalizeUrlPipe
Normalize $task['url'].
new NormalizeUrlPipe()
RequestPipe
Start an HTTP request with $task['url'] and save the result in $task['content'].
new RequestPipe(array( 'useragent' => 'myspider', 'timeout' => 10 ));
FileCachePipe
Cache a pipe (e.g. RequestPipe).
$requestPipe = new RequestPipe(); $cacheForReqPipe = new FileCachePipe($requestPipe, [ 'input' => 'url', 'output' => 'content', 'root' => '/path/to/cache/root', ]);
RetryPipe
Retry on failure.
$requestPipe = new RequestPipe(); $retryForReqPipe = new RetryPipe($requestPipe, [ 'count' => 10, ]);
DomCrawlerPipe
Create a DomCrawler from $task['content']. Access it with $task['$dom'] in following pipes.
ReportPipe
Report every 10 minutes.
new ReportPipe(array( 'seconds' => 600 ))
Logging
$spider->logger is an instance of Monolog\Logger. You can add logging handlers to it before start:
use Monolog\Handler\StreamHandler;
$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));
TODO/Ideas
- Real world examples.
- Running tasks concurrently.(With pthread?)
Alternate
Use golang version for better performance!