malahierba-lab/web-harvester

Laravel HTTP Client with Javascript capabilites

1.2.2 2016-08-27 16:56 UTC

README

A tool for get information from external websites. Powered by PhantomJS and malahierba.cl dev team

Installation

Add in your composer.json:

{
    "require": {
        "malahierba-lab/web-harvester": "1.*"
    }
}

Then you need run the composer update command.

After install you must configure Service Provider. Simply add the service provider in the config/app.php providers section:

Malahierba\WebHarvester\WebHarvesterServiceProvider::class

Now you need publish the config file. Simply execute php artisan vendor:publish

Configuration

Laravel Web Harvester run using PhantomJS headless Webkit browser. This tool is included as binary, so before you can use this package you need to specify your OS. This can be done in config file config\webharvester.php.

You need set option environment with once of the options supported:

  • linux-i686-32
  • linux-i686-64
  • macosx
  • windows

example: 'environment' => 'macosx'

Use

Important: For documentation purposes, in the examples below, always we assume than you import the library into your namespace using use Malahierba\WebHarvester;

Get WebPage Components

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {

    //Page Title
    $title                   = $webharvester->getTitle();

    //Page Description
    $description             = $webharvester->getDescription();

    //Get Status Code (If the url redirect to another webpage, then return the status code for the final webpage)
    $status_code             = $webharvester->getStatusCode();

    //Page Featured Image as URL
    $featured_image_url      = $webharvester->getFeaturedImage();

    //Page Featured Image as Base64
    $featured_image_base_64  = $webharvester->getFeaturedImage('base64');

    //Page real URL (if the $url redirect to another, return the final)
    $real_url                = $webharvester->getRealURL();

    //Site Name
    $sitename                = $webharvester->getSiteName();
}

Get expected behavior of the Robot (based on meta name="robots")

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {

    //check for index
    if ($webharvester->isIndexable()) {

        //...some code

    }

    //check for follow
    if ($webharvester->isFollowable()) {

        //...some code
        
    }

}

Get found links in WebPage (useful for web crawlers, web spiders, etc.)

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {

    //all full links as array

    $links = $webharvester->getLinks();  //retrieve an array with found links

    //all links as array, but query component removed (from the character "?" onwards)

    $links = $webharvester->getLinks([
        'remove' => ['query']
    ]);

    //retrieve links as array of objects (properties: url, follow)
    //if follow is false indicate than that links is marked to no follow (rel='nofollow') by the source website

    $links = $webharvester->getLinks(['only_urls' => false]); //default true

}

Important: For security reasons all links with embeded javascript are not included in output array

Get WebPage Raw Content

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->load($url)) {
    $raw = $webharvester->content();
}

Take ScreenShoot of a WebPage

$url = 'http://someurl';
$webharvester = new WebHarvester;

//Check if we can process the URL and Load it
if ($webharvester->takeScreenshot($url)) {
    $image_base_64 = $webharvester->content();  //return a base64 string
}

Setup Options

You can customize the webharvester with some functions:

$webharvester = new WebHarvester;

//Custom User Agent
$webharvester->setUserAgent('your user agent');

//Ignore SSL Errors
$webharvester->setIgnoreSSLErrors(true);

//Resource Timeout (in milliseconds)
$webharvester->setResourceTimeout(3000);

//Wait after load (in milliseconds)
$webharvester->setWaitAfterLoad(3000);  // <- useful for get async content

Licence

This project has MIT licence. For more information please read LICENCE file.