piedweb/crawler

Web Crawler to check few SEO basics.


README

Open Source Package

CLI Seo Pocket Crawler

Latest Version Software License GitHub Tests Action Status Quality Score Code Coverage Type Coverage Total Downloads

Web Crawler to check few SEO basics.

Use the collected data in your favorite spreadsheet software or retrieve them via your favorite language.

French documentation available : https://piedweb.com/seo/crawler

Install

Via Packagist

$ composer create-project piedweb/crawler

Usage

Crawler CLI

$ bin/console crawler:go $start

Arguments:

  start                            Define where the crawl start. Eg: https://piedweb.com
                                   You can specify an id from a previous crawl. Other options will not be listen.
                                   You can use `last` to continue the last crawl (just stopped)

Options:

  -l, --limit=LIMIT                Define where a depth limit [default: 5]
  -i, --ignore=IGNORE              Virtual Robots.txt to respect (could be a string or an URL).
  -u, --user-agent=USER-AGENT      Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
  -w, --wait=WAIT                  In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
  -c, --cache-method=CACHE-METHOD  In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
  -r, --restart=RESTART            Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
  -h, --help                       Display this help message
  -q, --quiet                      Do not output any message
  -V, --version                    Display this application version
      --ansi                       Force ANSI output
      --no-ansi                    Disable ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug



Extract All External Links in 1s from a previous crawl

$ bin/console crawler:external $id [--host]
    --id
        id from a previous crawl
        You can use  `last` too show external links from the last crawl.

    --host -ho
        flag permitting to get only host

Calcul Page Rank

Will update the previous data.csv generated. Then you can explore your website with the PoC pagerank.html (in a server npx http-server -c-1 --port 3000).

$ bin/console crawler:pagerank $id
    --id
        id from a previous crawl
        You can use `last` too calcul page rank from the last crawl.

Testing

$ composer test

Todo

  • Better Links Harvesting and Recording (record context (list, nav, sentence...))
  • Transform the PoC (Page Rank Visualizer)
  • Complex Page Rank Calculator (with 301, canonical, nofollow, etc.)

Contributing

Please see contributing

Credits

License

The MIT License (MIT). Please see License File for more information.

Latest Version Software License Build Status Quality Score Code Coverage Total Downloads