sleimanx2 / grawler
A guided html crawler with media meta extraction
Requires
- php: >=5.5
- fabpot/goutte: 3.1.*
- google/apiclient: 1.*
- hassankhan/config: 0.8.*
- njasm/soundcloud: 2.2.*
- vimeo/vimeo-api: 1.2.*
- vlucas/phpdotenv: 2.2.*
Requires (Dev)
- mockery/mockery: ^0.9.4
- phpunit/php-code-coverage: ^2.1
- phpunit/phpunit: ~4.0
README
Install
Via Composer
$ composer require sleimanx2/grawler
Basic Usage
getting the page dom
require_once('vendor/autoload.php'); $client = new Bowtie\Grawler\Client(); $grawler = $client->download('http://example.com');
finding basic attributes
$grawler->title();
// provide a css path to find the attribute $grawler->body($path = '.main-content');
// extracts meta keywords (array) $grawler->keywords();
// extracts meta description $grawler->description();
finding media
$grawler->images('.content img');
$grawler->videos('iframe');
$grawler->audio('.audio iframe');
Resolving media attributes
In order resolve media attributes you need to load providers's configuration
videos
Current video resolvers (youtube , vimeo)
// resolve all videos at once $videos = $grawler->videos('iframe')->resolve();
then you can access videos attributes as follow
foreach($videos as $video) { $video->id; // the video provider id $video->title; $video->description; $video->url; $video->embedUrl; $video->images; // Collection of Image instances $video->author; $video->authorId; $video->duration; $video->provider; //video source }
you can also resolve videos individually as follow
$videos = $grawler->videos('iframe') foreach($videos as $video) { $video->resolve(); $video->title; //... }
audio
Current video resolvers (soundcloud)
// resolve all audio at once $audio = $grawler->audio('.audio iframe')->resolve();
then you can access videos attributes as follow
foreach($audio as $track) { $track->id; // the video provider id $track->title; $track->description; $track->url; $track->embedUrl; $track->images; // Collection of cover photo instances $track->author; $track->authorId; $track->duration; $track->provider; //video source }
you can also resolve audio individually as follow
$track = $grawler->track('.audio iframe') foreach($audio as $track) { $track->resolve(); $track->title; //... }
Resolving page urls
$links = $grawler->links('.main thumb a') foreach($links as $link) { print $link //or print $link->uri //or print $link->getUri() }
Configuration
Client Config
Set user agent
$client->agent('Googlebot/2.1')->download('http://example.com');
Recomended : http://webmasters.stackexchange.com/questions/6205/what-user-agent-should-i-set
Set request auth
$client->auth('me', '**')
you can change the auth type as follow
$client->auth('me', '**', $type = 'basic');
Set request method
$client->method('post');
Grawler config
By default the grawler tries to access those environment variables
GRAWLER_YOUTUBE_KEY
GRAWLER_VIMEO_KEY
GRAWLER_VIMEO_SECRET
GRAWLER_SOUNDCLOUD_KEY
GRAWLER_SOUNDCLOUD_SECRET
if you don't use env vars you can load configuration as follow.
$config = [ 'youtubeKey' =>'', 'soundcloudKey'=>'' 'vimeoKey' => '', 'vimeoSecret' => '', 'soundcloudKey' => '', 'soundcloudSecret' => '', ]; $grawler->loadConfig($config);
Testing
$ phpunit --testsuite unit
$ phpunit --testsuite integration
NB: you should set your ptoviders key (youtube,vimeo,soundcloud...) to run integration tests
Contributing
Please see CONTRIBUTING
Security
If you discover any security related issues, please email sleiman@bowtie.land instead of using the issue tracker.
License
The MIT License (MIT). Please see License File for more information.