Web Crawler to check few SEO basics.

CLI Seo Pocket Crawler

Use the collected data in your favorite spreadsheet software or retrieve them via your favorite language.

French documentation available : https://piedweb.com/seo/crawler


Via Packagist

$ composer create-project piedweb/crawler


Crawler CLI

$ bin/console crawler:go $start


  start                            Define where the crawl start. Eg: https://piedweb.com
                                   You can specify an id from a previous crawl. Other options will not be listen.
                                   You can use `last` to continue the last crawl (just stopped)


  -l, --limit=LIMIT                Define where a depth limit [default: 5]
  -i, --ignore=IGNORE              Virtual Robots.txt to respect (could be a string or an URL).
  -u, --user-agent=USER-AGENT      Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
  -w, --wait=WAIT                  In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
  -c, --cache-method=CACHE-METHOD  In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
  -r, --restart=RESTART            Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
  -h, --help                       Display this help message
  -q, --quiet                      Do not output any message
  -V, --version                    Display this application version
      --ansi                       Force ANSI output
      --no-ansi                    Disable ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Extract All External Links in 1s from a previous crawl

$ bin/console crawler:external $id [--host]
        id from a previous crawl
        You can use  `last` too show external links from the last crawl.

    --host -ho
        flag permitting to get only host

Calcul Page Rank

Will update the previous data.csv generated. Then you can explore your website with the PoC pagerank.html (in a server npx http-server -c-1 --port 3000).

$ bin/console crawler:pagerank $id
        id from a previous crawl
        You can use `last` too calcul page rank from the last crawl.


$ composer test


  • Better Links Harvesting and Recording (record context (list, nav, sentence...))
  • Transform the PoC (Page Rank Visualizer)
  • Complex Page Rank Calculator (with 301, canonical, nofollow, etc.)


Please see contributing



The MIT License (MIT). Please see License File for more information.

