sleimanx2/grawler

A guided html crawler with media meta extraction

0.2.4 2019-09-29 11:29 UTC

This package is auto-updated.

Last update: 2024-04-14 06:42:19 UTC


README

Software License Build Status

Install

Via Composer

$ composer require sleimanx2/grawler

Basic Usage

getting the page dom
require_once('vendor/autoload.php');

$client = new Bowtie\Grawler\Client();

$grawler = $client->download('http://example.com');
finding basic attributes
$grawler->title();
// provide a css path to find the attribute
$grawler->body($path = '.main-content');
// extracts meta keywords (array)
$grawler->keywords();
// extracts meta description 
$grawler->description();
finding media
$grawler->images('.content img');
$grawler->videos('iframe');
$grawler->audio('.audio iframe');

Resolving media attributes

In order resolve media attributes you need to load providers's configuration

videos

Current video resolvers (youtube , vimeo)

// resolve all videos at once 
$videos = $grawler->videos('iframe')->resolve();

then you can access videos attributes as follow

foreach($videos as $video)
{
  $video->id; // the video provider id
  $video->title;
  $video->description;
  $video->url;
  $video->embedUrl;
  $video->images; // Collection of Image instances
  $video->author;
  $video->authorId;
  $video->duration;
  $video->provider; //video source
}

you can also resolve videos individually as follow

$videos = $grawler->videos('iframe')

foreach($videos as $video)
{
  $video->resolve();
  $video->title;
  //...
}

audio

Current video resolvers (soundcloud)

// resolve all audio at once 
$audio = $grawler->audio('.audio iframe')->resolve();

then you can access videos attributes as follow

foreach($audio as $track)
{
  $track->id; // the video provider id
  $track->title;
  $track->description;
  $track->url;
  $track->embedUrl;
  $track->images; // Collection of cover photo instances
  $track->author;
  $track->authorId;
  $track->duration;
  $track->provider; //video source
}

you can also resolve audio individually as follow

$track = $grawler->track('.audio iframe')

foreach($audio as $track)
{
  $track->resolve();
  $track->title;
  //...
}

Resolving page urls

$links = $grawler->links('.main thumb a')

foreach($links as $link)
{
  print $link
  //or
  print $link->uri
  //or
  print $link->getUri()
}

Configuration

Client Config

Set user agent
$client->agent('Googlebot/2.1')->download('http://example.com');

Recomended : http://webmasters.stackexchange.com/questions/6205/what-user-agent-should-i-set

Set request auth
$client->auth('me', '**')

you can change the auth type as follow

$client->auth('me', '**', $type = 'basic');
Set request method
$client->method('post');

Grawler config

By default the grawler tries to access those environment variables

GRAWLER_YOUTUBE_KEY

GRAWLER_VIMEO_KEY
GRAWLER_VIMEO_SECRET

GRAWLER_SOUNDCLOUD_KEY
GRAWLER_SOUNDCLOUD_SECRET

if you don't use env vars you can load configuration as follow.

$config = [
  'youtubeKey'   =>'',
  'soundcloudKey'=>''

  'vimeoKey'    => '',
  'vimeoSecret' => '',

  'soundcloudKey'    => '',
  'soundcloudSecret' => '',
];

$grawler->loadConfig($config);

Testing

$ phpunit --testsuite unit
$ phpunit --testsuite integration

NB: you should set your ptoviders key (youtube,vimeo,soundcloud...) to run integration tests

Contributing

Please see CONTRIBUTING

Security

If you discover any security related issues, please email sleiman@bowtie.land instead of using the issue tracker.

License

The MIT License (MIT). Please see License File for more information.