ddliu / spider
Light weight spider for the web.
Requires
- ddliu/filecache: ~0.1
- ddliu/normurl: ~0.1.1
- ddliu/requery: ~0.1
- ddliu/wildcards: 0.1.*
- monolog/monolog: ~1.11
- symfony/css-selector: ~2.5
- symfony/dom-crawler: ~2.5
README
A flexible spider in PHP.
Concepts
A spider contains many processors called pipes
, you can pass as many tasks as you like to the spider, each task go through these pipes
and get processed.
Installation
composer require ddliu/spider
Requirements
- PHP5.3+
- curl(RequestPipe)
Dependencies
See composer.json
.
Usage
use ddliu\spider\Spider; use ddliu\spider\Pipe\NormalizeUrlPipe; use ddliu\spider\Pipe\RequestPipe; use ddliu\spider\Pipe\DomCrawlerPipe; (new Spider()) ->pipe(new NormalizeUrlPipe()) ->pipe(new RequestPipe()) ->pipe(new DomCrawlerPipe()) ->pipe(function($spider, $task) { $task['$dom']->filter('a')->each(function($a) use ($task) { $href = $a->attr('href'); $task->fork($href); }) }) // the entry task ->addTask('http://example.com') ->run() ->report();
Find more examples in examples
folder.
Spider
The Spider
class.
Options
- limit: maxmum tasks to run
Methods
pipe($pipe)
: add a pipeaddTask($task)
: add a taskrun()
: run the spiderreport()
: write report to log
Task
A task contains the data array and some helper functions.
The Task
class implements ArrayAccess
interface, so you can access data like array.
Methods
fork($task)
: add a sub task to the spiderignore()
: ignore the task
Pipes
Pipes define how each task being processed.
A pipe can be a function:
function($spider, $task) {}
Or extends the BasePipe:
use ddliu\spider\Pipe\BasePipe; class MyPipe extends BasePipe { public function run($spider, $task) { // process the task... } }
Useful Pipes
NormalizeUrlPipe
Normalize $task['url']
.
new NormalizeUrlPipe()
RequestPipe
Start an HTTP request with $task['url']
and save the result in $task['content']
.
new RequestPipe(array( 'useragent' => 'myspider', 'timeout' => 10 ));
FileCachePipe
Cache a pipe (e.g. RequestPipe
).
$requestPipe = new RequestPipe(); $cacheForReqPipe = new FileCachePipe($requestPipe, [ 'input' => 'url', 'output' => 'content', 'root' => '/path/to/cache/root', ]);
RetryPipe
Retry on failure.
$requestPipe = new RequestPipe(); $retryForReqPipe = new RetryPipe($requestPipe, [ 'count' => 10, ]);
DomCrawlerPipe
Create a DomCrawler from $task['content']
. Access it with $task['$dom']
in following pipes.
ReportPipe
Report every 10 minutes.
new ReportPipe(array( 'seconds' => 600 ))
Logging
$spider->logger
is an instance of Monolog\Logger
. You can add logging handlers to it before start:
use Monolog\Handler\StreamHandler;
$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));
TODO/Ideas
- Real world examples.
- Running tasks concurrently.(With pthread?)
Alternate
Use golang version for better performance!