languagewire / html-dumper
A library which downloads pages as static HTML files and assets and dumps them on disk
Installs: 3 868
Dependents: 0
Suggesters: 0
Security: 0
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 1
Requires
- php: >=7.2
- ext-dom: *
- guzzlehttp/guzzle: ^6.5 || ^7.5
Requires (Dev)
- monolog/monolog: ^2.8
- phpspec/prophecy-phpunit: ^2.0
- phpstan/phpstan: ^1.8
- phpunit/phpcov: ^8.2
- phpunit/phpunit: ^9.5
- squizlabs/php_codesniffer: ^4.0
README
HtmlDumper is a PHP library which downloads a copy of an HTML page and its assets into a target directory.
- Downloads HTML source code and transforms all URIs into relative paths, creating an updated
index.html
file. - Parses HTML and fetches relevant resources
- Stylesheets, scripts, images, videos
- Also works with assets located within CSS files.
- Removes anchor links to external pages.
- Does not crawl pages beyond the initial URL.
$url = "https://example.com"; $targetDirectory = "/tmp/htmldump"; $downloader = new \LanguageWire\HtmlDumper\Service\PageDownloader(); if ($downloader->download($url, $targetDirectory)) { echo "Sucessfully downloaded $url in $targetDirectory"; }
Requirements
- PHP 7.2+
- PHP DOM Extension
- Composer
Installation
The recommended way to install HtmlDumper is through Composer.
composer require languagewire/html-dumper
Development
In the build/
folder there is a Dockerfile
file which sets up all dependencies needed for local development, runs unit tests and other linters.
Customize build/.env
like this:
cd build
cp .env.template .env
nano .env
And then run ./build.sh
within the build/
folder:
cd build
./build.sh
License
HtmlDumper is made available under the MIT License (MIT). Please see the LICENSE file for more information.