ronappleton / webcrawler

Webcrawler for crawling indexed sites i.e. options and indexes

Maintainers

Details

github.com/ronappleton/webcrawler

pkg:composer/ronappleton/webcrawler

dev-master 2018-10-06 11:31 UTC

Requires

guzzlehttp/guzzle: ^6.3
nesbot/carbon: ^1.32

Requires (Dev)

None

Suggests

None

Provides

None

Conflicts

None

Replaces

None

MIT 48b1c2a45b6c57b894e3557537c7f01055ac2df1

Ron Appleton <ronald.appleton.woop@gmail.com>

dev-master

This package is auto-updated.

Last update: 2025-11-10 10:56:29 UTC

README

Simple web crawler for retrieving site links

This web crawler package is a simple package, designed for taking websites and extracting the files it can find from the html that the site provides.

It is restricted to the source domain by default, can be altered using the restrict_domain option of the crawl method.

It was built for handling known self linking sites, although I will add controls to prevent external crawling when required.

It is simple to use, and solves some of the issues other people have had trying to build simple crawlers.

Supported

Scanning and retrieving web page.
Reading and pulling out all links in web page.
Deducing if link is to another directory or to a file.
Storing file and directory location (web location)
Handles relative and non relative urls
Times crawls
Provides minimal count statistic
Exports data collected as array
Exports data collected as Json

Warning

Use this at your own risk, please don't crawl sites of people that are not expecting it, the risk is all yours

Simple Test Script

A simple script for testing is included.