w3zone/crawler

1.0.1 2017-02-03 19:58 UTC

This package is not auto-updated.

Last update: 2024-04-08 12:43:57 UTC


README

Write Less, Do More.

Installation

composer require w3zone/Crawler

Requirements

  • node.js > 4.x
  • libcurl
  • php-curl
  • node.js request module
npm install request

Usage

require_once 'vendor/autoload.php';

use w3zone\Crawler\{Crawler, Services\phpCurl};

$crawler = new Crawler(new phpCurl);

$link = 'http://www.example.com';

// return an array [statusCode, body, headers, cookies]
// get method may contain link string or an array [url, query string]
$homePage = $crawler->get($link)->dumpHeaders()->run();

$response = $crawler->get($link)->dumpHeaders()->cookies($homePage['cookies'], 'r+w')->run();

Available Services

  • phpCurl
    use w3zone\Crawler\Services\phpCurl;
  • nodejsRequest
    use w3zone\Crawler\Services\nodejsRequest;
  • cliCurl
    use w3zone\Crawler\Services\cliCurl;

Available Methods

  • Get
    Crawler::get(mixed $arguments);
    set the request to GET method,
    accepts parameter holding the requested URL.

  • Post
    Crawler::post(mixed $arguments);
    set the request to POST method,
    accepts an array of options

$arguments = [
    'url' => 'www.example.com/login',
    'data' => [
        'username' => '',
        'password' => ''
    ]
];
  • Json
    Crawler::json(void)
    an easy way to create a json request.

  • XML
    Crawler::xml(void)
    an easy way to create a xml request.

  • Referer
    Crawler::referer(string $referer)
    set the current request referer.

  • Headers
    Crawler::headers(array $headers)
    set the request additional headers,
    note that this function will overwrite json && xml functions.

  • DumpHeaders
    Crawler::dumpHeaders(void)
    include the response headers in the object response.

  • Proxy Crawler::proxy(mixed $proxy)
    set the request proxy IP and proxy type,
    note proxy method accepts an array of proxy IP and proxy Type or an IP string

$proxy = [
    'ip' => 'xx.xx.xx.xx:xx',
    'type' => 'socks5'
];

if you've passed an IP as a string the default type will be HTTP.

  • Cookies
    Crawler::cookies(string $file, string $mode)
    set your proxy type, the first argument is a cookie string,
    the seccond argument is the cookie mode ,
    available modes :
    -- w : write only mode -- r : read only mode -- w+r : read and write

  • Initialize
    Crawler::initialize(array $arguments)
    initialize or re-initialize your request
    note that , this method will overwrite the other options

  • Run Crawler::run(void)
    fire the request.

Examples:-

Quick example to login into Github :-

require_once 'vendor/autoload.php';

use w3zone\Crawler\{Crawler, Services\phpCurl};

$crawler = new Crawler(new phpCurl);

$url = 'https://github.com/login';
$response = $crawler->get($url)->dumpHeaders()->run();

preg_match('#<input name="authenticity_token".*?value="(.*?)"#', $response['body'], $authenticity_token);

$url = 'https://github.com/session';
$post['commit'] = 'Sign in';
$post['utf8'] = '✓';
$post['authenticity_token'] = $authenticity_token[1];
$post['login'] = 'valid email';
$post['password'] = '';

$response = $crawler
    ->post(['url' => $url, 'data' => $post])
    ->cookies($response['cookies'], 'w+r')
    ->initialize([
        CURLOPT_FOLLOWLOCATION => true
    ])
    ->dumpHeaders()
->run();