mjorgens / web-crawler

A PHP web crawler library

Maintainers

Details

github.com/mjorgens/web-crawler

Source

Issues

Installs: 12

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

V1.0.3 2021-02-15 17:22 UTC

Requires

php: ^7.2
guzzlehttp/guzzle: ^6.0 || ^7.0
guzzlehttp/psr7: ^1.0
illuminate/database: ^6.20.15 || ^7.30.4 || ^8.25.0
symfony/dom-crawler: ^4.0 || ^5.0

Requires (Dev)

phpunit/phpunit: ^8.0 || ^9.0
squizlabs/php_codesniffer: ^3.5

Suggests

None

Provides

None

Conflicts

None

Replaces

None

MIT cfcc25d0b657d8d74bc7f9d82f9b4f8af88c2a86

Marc Jorgensen <marcjorgensen.woop@gmail.com>

dev-master
V1.0.3
v1.0.2
v1.0.1
v1.0
dev-version-bump

This package is auto-updated.

Last update: 2025-06-11 03:48:54 UTC

README

This is a PHP library that takes a starting URL and then parses the page Html and extracts the URLs. It then follows the URL and parses those pages until the max number of URLs is reached.

Requirements

Installation

The recommended way to install this library is through Composer.

composer require mjorgens/web-crawler

Usage

$repository = new \Mjorgens\Crawler\CrawledRepository\CrawledMemoryRepository(); // The collection of pages
$url = new Uri('https://example.com'); // Starting url
$maxUrls = 5; // Max number of urls to crawl

Crawler::create()
            ->setRepository($repository)
            ->setMaxCrawl($maxUrls)
            ->startCrawling($url); // Start the crawler

foreach ($repository as $page){
    echo $page->url;
    echo $page->html;
}