README

CLI Seo Pocket Crawler

Web Crawler to check few SEO basics.

Use the collected data in your favorite spreadsheet software or retrieve them via your favorite language.

French documentation available : https://piedweb.com/seo/crawler

Install

$ composer create-project piedweb/crawler

Usage

Crawler CLI

$ bin/console crawler:go $start

Arguments:

  start                            Define where the crawl start. Eg: https://piedweb.com
                                   You can specify an id from a previous crawl. Other options will not be listen.
                                   You can use `last` to continue the last crawl (just stopped)

Options:

  -l, --limit=LIMIT                Define where a depth limit [default: 5]
  -i, --ignore=IGNORE              Virtual Robots.txt to respect (could be a string or an URL).
  -u, --user-agent=USER-AGENT      Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
  -w, --wait=WAIT                  In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
  -c, --cache-method=CACHE-METHOD  In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
  -r, --restart=RESTART            Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
  -h, --help                       Display this help message
  -q, --quiet                      Do not output any message
  -V, --version                    Display this application version
      --ansi                       Force ANSI output
      --no-ansi                    Disable ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Extract All External Links in 1s from a previous crawl

$ bin/console crawler:external $id [--host]

    --id
        id from a previous crawl
        You can use  `last` too show external links from the last crawl.

    --host -ho
        flag permitting to get only host

Calcul Page Rank

Will update the previous data.csv generated. Then you can explore your website with the PoC pagerank.html (in a server npx http-server -c-1 --port 3000).

$ bin/console crawler:pagerank $id

    --id
        id from a previous crawl
        You can use `last` too calcul page rank from the last crawl.

Testing

$ composer test

Todo

Better Links Harvesting and Recording (record context (list, nav, sentence...))
Transform the PoC (Page Rank Visualizer)
Complex Page Rank Calculator (with 301, canonical, nofollow, etc.)

Contributing

Please see contributing

Credits

PiedWeb ak Robind4
All Contributors

License

The MIT License (MIT). Please see License File for more information.

piedweb / crawler

Maintainers

Details