ronappleton/webcrawler

Webcrawler for crawling indexed sites i.e. options and indexes

dev-master 2018-10-06 11:31 UTC

This package is auto-updated.

Last update: 2024-11-10 08:53:13 UTC


README

Simple web crawler for retrieving site links

This web crawler package is a simple package, designed for taking websites and extracting the files it can find from the html that the site provides.

It is restricted to the source domain by default, can be altered using the restrict_domain option of the crawl method.

It was built for handling known self linking sites, although I will add controls to prevent external crawling when required.

It is simple to use, and solves some of the issues other people have had trying to build simple crawlers.

Supported

  • Scanning and retrieving web page.
  • Reading and pulling out all links in web page.
  • Deducing if link is to another directory or to a file.
  • Storing file and directory location (web location)
  • Handles relative and non relative urls
  • Times crawls
  • Provides minimal count statistic
  • Exports data collected as array
  • Exports data collected as Json

Warning

Use this at your own risk, please don't crawl sites of people that are not expecting it, the risk is all yours

Simple Test Script

A simple script for testing is included.