piedweb / crawler
Web Crawler to check few SEO basics.
0.1.804
2024-10-20 15:43 UTC
Requires
- php: >=8.3
- league/csv: ^9.8
- piedweb/curl: *
- piedweb/extractor: *
- piedweb/text-analyzer: *
- symfony/console: ^6.4|^7
- voku/stringy: ^6.5
- dev-main
- 0.1.804
- 0.1.803
- 0.1.802
- 0.1.801
- 0.1.800
- 0.1.799
- 0.1.798
- 0.1.797
- 0.1.796
- 0.1.795
- 0.1.794
- 0.1.793
- 0.1.792
- 0.1.791
- 0.1.790
- 0.1.789
- 0.1.788
- 0.1.787
- 0.1.786
- 0.1.785
- 0.1.784
- 0.1.783
- 0.1.782
- 0.1.781
- 0.1.78
- 0.1.77
- 0.1.76
- 0.1.75
- 0.1.74
- 0.1.73
- 0.1.72
- 0.1.71
- 0.1.70
- 0.1.69
- 0.1.68
- 0.1.67
- 0.1.66
- 0.1.65
- 0.1.64
- 0.1.63
- 0.1.62
- 0.1.61
- 0.1.60
- 0.1.59
- 0.1.58
- 0.1.57
- 0.1.56
- 0.1.55
- 0.1.54
- 0.1.53
- 0.1.52
- 0.1.51
- 0.1.50
- 0.1.49
- 0.1.48
- 0.1.47
- 0.1.46
- 0.1.45
- 0.1.44
- 0.1.43
- 0.1.42
- 0.1.41
- 0.1.40
- 0.1.35
- 0.1.34
- 0.1.33
- 0.1.32
- 0.1.30
- 0.1.24
- 0.1.23
- 0.1.22
- 0.1.21
- 0.1.20
- 0.0.13
This package is auto-updated.
Last update: 2024-11-20 21:58:27 UTC
README
CLI Seo Pocket Crawler
Web Crawler to check few SEO basics.
Use the collected data in your favorite spreadsheet software or retrieve them via your favorite language.
French documentation available : https://piedweb.com/seo/crawler
Install
Via Packagist
$ composer create-project piedweb/crawler
Usage
Crawler CLI
$ bin/console crawler:go $start
Arguments:
start Define where the crawl start. Eg: https://piedweb.com
You can specify an id from a previous crawl. Other options will not be listen.
You can use `last` to continue the last crawl (just stopped)
Options:
-l, --limit=LIMIT Define where a depth limit [default: 5]
-i, --ignore=IGNORE Virtual Robots.txt to respect (could be a string or an URL).
-u, --user-agent=USER-AGENT Define the user-agent used during the crawl. [default: "SEO Pocket Crawler - PiedWeb.com/seo/crawler"]
-w, --wait=WAIT In Microseconds, the time to wait between 2 requests. Default 0,1s. [default: 100000]
-c, --cache-method=CACHE-METHOD In Microseconds, the time to wait between two request. Default : 100000 (0,1s). [default: 2]
-r, --restart=RESTART Permit to restart a previous crawl. Values 1 = fresh restart, 2 = restart from cache
-h, --help Display this help message
-q, --quiet Do not output any message
-V, --version Display this application version
--ansi Force ANSI output
--no-ansi Disable ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
Extract All External Links in 1s from a previous crawl
$ bin/console crawler:external $id [--host]
--id
id from a previous crawl
You can use `last` too show external links from the last crawl.
--host -ho
flag permitting to get only host
Calcul Page Rank
Will update the previous data.csv
generated. Then you can explore your website with the PoC pagerank.html
(in a server npx http-server -c-1 --port 3000
).
$ bin/console crawler:pagerank $id
--id
id from a previous crawl
You can use `last` too calcul page rank from the last crawl.
Testing
$ composer test
Todo
- Better Links Harvesting and Recording (record context (list, nav, sentence...))
- Transform the PoC (Page Rank Visualizer)
- Complex Page Rank Calculator (with 301, canonical, nofollow, etc.)
Contributing
Please see contributing
Credits
License
The MIT License (MIT). Please see License File for more information.