danielbehrendt / web-scraper
This package can be used to scrape elements from websites.
Requires
- php: ^7.4
- spatie/crawler: ^4.7
- symfony/console: ^5.0
- tightenco/collect: ^7.6
Requires (Dev)
- phpunit/phpunit: ^9
- symfony/process: ^5.0
- symfony/var-dumper: ^5.0
This package is auto-updated.
Last update: 2025-03-15 20:06:59 UTC
README
This package can be used to scrape elements from a website. It wraps the fantastic Spatie Crawler (as I didn't want to reinvent the wheel, in terms of building just another crawler).
The crawler is preconfigured to crawl only internal URLs by a given starting point.
Installation
This package can be installed with Composer:
composer require danielbehrendt/web-scraper
Client options
Client options for the crawler can be set by passing them to the constructor of the page checker:
$webScraper = new WebScraper([
'allow_redirects' => true
]);
As Spatie Crawler uses Guzzle under the hood all request options can be passed.
Crawler options
The following options (of the Spatie Crawler) can be set via options argument:
$results = $webScraper->getResults(
$url,
[
'userAgent' => 'my-agent',
'concurrency' => 1,
'maximumCrawlCount' => 5,
'maximumDepth' => 5,
'maximumResponseSize' => 1024 * 1024 * 3,
'delayBetweenRequests' => 150,
'parseableMimeTypes' => [
'text/html', 'text/plain',
],
]
);
Usage
The web scraper can be instantiated like this:
use DanielBehrendt\WebScraper\WebScraper;
$webScraper = new WebScraper();
Getting results
$url
must be absolute, starting with a scheme.
$results = $webScraper->getResults($url);
The getResults
method return a Laravel Collection so all
available methods are supported.
Scrapers
A scraper instance can be set via setScraper
method.
$webScraper->setScraper(new MarkupScraper());
This package comes with some Scrapers
:
HeaderScraper
: returns the headers and status code of each crawled page (default scraper)MarkupScraper
: scrapes some markup of each crawled page (headers and status code will not be returned)UnencryptedEmailScraper
scrapes unencrypted emails in the markup of each crawled page (headers and status code will not be returned)
Full example:
<?php
use DanielBehrendt\WebScraper\WebScraper;
use DanielBehrendt\WebScraper\Scrapers\MarkupScraper;
$webScraper = new WebScraper();
$results = $webScraper
->setScraper(new MarkupScraper())
->getResults('https://httpbin.org/html');
Add your own scraper
You can define your scraper and set it via setScraper
method. The passed object must extend the abstract
\DanielBehrendt\WebScraper\Scraper\BaseScraper
class and must have a getElementSelectors
method.
Example:
<?php
use DanielBehrendt\WebScraper\Scrapers\BaseScraper;
class CustomScraper extends BaseScraper
{
/**
* return array
*/
public function getElementSelectors(): array
{
return [
'h1' => [
'filter' => '//h1/text()',
],
'h2' => [
'filter' => '//h2/text()',
],
'h3' => [
'filter' => '//h3/text()',
],
];
}
}
CLI
This package also comes with some CLI commands:
./console web-scraper:header
./console web-scraper:markup
./console web-scraper:unencrypted-email
Each command corresponds to one of the above mentioned Scrapers
.
Changelog
Please see CHANGELOG for more information what has changed recently.
License
The MIT License (MIT). Please see License File for more information.