danielbehrendt/web-scraper

This package can be used to scrape elements from websites.

v1.1.0 2020-04-25 06:05 UTC

This package is auto-updated.

Last update: 2024-04-15 18:06:59 UTC


README

Packagist Version PHP from Packagist License

This package can be used to scrape elements from a website. It wraps the fantastic Spatie Crawler (as I didn't want to reinvent the wheel, in terms of building just another crawler).

The crawler is preconfigured to crawl only internal URLs by a given starting point.

Installation

This package can be installed with Composer:

composer require danielbehrendt/web-scraper

Client options

Client options for the crawler can be set by passing them to the constructor of the page checker:

$webScraper = new WebScraper([
    'allow_redirects' => true
]);

As Spatie Crawler uses Guzzle under the hood all request options can be passed.

Crawler options

The following options (of the Spatie Crawler) can be set via options argument:

$results = $webScraper->getResults(
    $url,
    [
        'userAgent' => 'my-agent',
        'concurrency' => 1,
        'maximumCrawlCount' => 5,
        'maximumDepth' => 5,
        'maximumResponseSize' => 1024 * 1024 * 3,
        'delayBetweenRequests' => 150,
        'parseableMimeTypes' => [
            'text/html', 'text/plain',
        ],
    ]
);

Usage

The web scraper can be instantiated like this:

use DanielBehrendt\WebScraper\WebScraper;

$webScraper = new WebScraper();

Getting results

$url must be absolute, starting with a scheme.

$results = $webScraper->getResults($url);

The getResults method return a Laravel Collection so all available methods are supported.

Scrapers

A scraper instance can be set via setScraper method.

$webScraper->setScraper(new MarkupScraper());

This package comes with some Scrapers:

  • HeaderScraper: returns the headers and status code of each crawled page (default scraper)
  • MarkupScraper: scrapes some markup of each crawled page (headers and status code will not be returned)
  • UnencryptedEmailScraper scrapes unencrypted emails in the markup of each crawled page (headers and status code will not be returned)

Full example:

<?php

use DanielBehrendt\WebScraper\WebScraper;
use DanielBehrendt\WebScraper\Scrapers\MarkupScraper;

$webScraper = new WebScraper();

$results = $webScraper
    ->setScraper(new MarkupScraper())
    ->getResults('https://httpbin.org/html');

Add your own scraper

You can define your scraper and set it via setScraper method. The passed object must extend the abstract \DanielBehrendt\WebScraper\Scraper\BaseScraper class and must have a getElementSelectors method.

Example:

<?php

use DanielBehrendt\WebScraper\Scrapers\BaseScraper;

class CustomScraper extends BaseScraper
{
    /**
     * return array
     */
    public function getElementSelectors(): array
    {
        return [
            'h1' => [
                'filter' => '//h1/text()',
            ],
            'h2' => [
                'filter' => '//h2/text()',
            ],
            'h3' => [
                'filter' => '//h3/text()',
            ],
        ];
    }
}

CLI

This package also comes with some CLI commands:

  • ./console web-scraper:header
  • ./console web-scraper:markup
  • ./console web-scraper:unencrypted-email

Each command corresponds to one of the above mentioned Scrapers.

Changelog

Please see CHANGELOG for more information what has changed recently.

License

The MIT License (MIT). Please see License File for more information.