mjorgens / web-crawler
A PHP web crawler library
V1.0.3
2021-02-15 17:22 UTC
Requires
- php: ^7.2
- guzzlehttp/guzzle: ^6.0 || ^7.0
- guzzlehttp/psr7: ^1.0
- illuminate/database: ^6.20.15 || ^7.30.4 || ^8.25.0
- symfony/dom-crawler: ^4.0 || ^5.0
Requires (Dev)
- phpunit/phpunit: ^8.0 || ^9.0
- squizlabs/php_codesniffer: ^3.5
This package is auto-updated.
Last update: 2024-04-11 01:16:20 UTC
README
This is a PHP library that takes a starting URL and then parses the page Html and extracts the URLs. It then follows the URL and parses those pages until the max number of URLs is reached.
Requirements
Installation
The recommended way to install this library is through Composer.
composer require mjorgens/web-crawler
Usage
$repository = new \Mjorgens\Crawler\CrawledRepository\CrawledMemoryRepository(); // The collection of pages $url = new Uri('https://example.com'); // Starting url $maxUrls = 5; // Max number of urls to crawl Crawler::create() ->setRepository($repository) ->setMaxCrawl($maxUrls) ->startCrawling($url); // Start the crawler foreach ($repository as $page){ echo $page->url; echo $page->html; }