Crawl all internal links found on a website
- dev-master / 0.0.x-dev
This package is auto-updated.
Last update: 2023-01-07 17:17:25 UTC
This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.
Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.
This package can be installed via Composer:
composer require spatie/crawler
The crawler can be instantiated like this
Crawler::create() ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>) ->startCrawling($url);
The argument passed to
setCrawlObserver must be an object that implements the
/** * Called when the crawler will crawl the given url. * * @param \Spatie\Crawler\Url $url */ public function willCrawl(Url $url); /** * Called when the crawler has crawled the given url. * * @param \Spatie\Crawler\Url $url * @param \Psr\Http\Message\ResponseInterface $response * @param \Spatie\Crawler\Url $foundOn */ public function hasBeenCrawled(Url $url, $response, Url $foundOn = null); /** * Called when the crawl has ended. */ public function finishedCrawling();
The package will make an educated guess as to where Chrome is installed on your system. You can also manually pass the location of the Chrome binary to
You can tell the crawler not to visit certain urls by passing using the
setCrawlProfile-function. That function expects
an objects that implements the
/* * Determine if the given url should be crawled. */ public function shouldCrawl(Url $url): bool;
This package comes with three
CrawlProfiles out of the box:
CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.
CrawlSubdomainUrls: this profile will only crawl the internal urls and its subdomains on the pages of a host.
To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the
Crawler::create() ->setConcurrency(1) //now all urls will be crawled one by one
By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the
// stop crawling after 5 urls Crawler::create() ->setMaximumCrawlCount(5)
By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the
When crawling a site the crawler will put urls to be crawled in a queue. By default this queue is stored in memory using the built in
When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases you can write your own crawl queue.
A valid crawl queue is any class that implements the
Spatie\Crawler\CrawlQueue\CrawlQueue-interface. You can pass your custom crawl queue via the
setCrawlQueue method on the crawler.
Crawler::create() ->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueue\CrawlQueue>)
Please see CHANGELOG for more information what has changed recently.
Please see CONTRIBUTING for details.
To run the tests you'll have to start the included node based server first in a separate terminal window.
cd tests/server npm install ./start_server.sh
With the server running, you can start testing.
If you discover any security related issues, please email email@example.com instead of using the issue tracker.
You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.
Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.
We publish all received postcards on our company website.
Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.
Does your business depend on our contributions? Reach out and support us on Patreon. All pledges will be dedicated to allocating workforce on maintenance and new awesome stuff.
The MIT License (MIT). Please see License File for more information.