languagewire/html-dumper

A library which downloads pages as static HTML files and assets and dumps them on disk

1.0.1 2022-11-18 15:10 UTC

This package is auto-updated.

Last update: 2024-04-21 11:09:27 UTC


README

Packagist Build Coverage Status license

HtmlDumper is a PHP library which downloads a copy of an HTML page and its assets into a target directory.

  • Downloads HTML source code and transforms all URIs into relative paths, creating an updated index.html file.
  • Parses HTML and fetches relevant resources
    • Stylesheets, scripts, images, videos
    • Also works with assets located within CSS files.
  • Removes anchor links to external pages.
  • Does not crawl pages beyond the initial URL.
$url = "https://example.com";
$targetDirectory = "/tmp/htmldump";

$downloader = new \LanguageWire\HtmlDumper\Service\PageDownloader();
if ($downloader->download($url, $targetDirectory)) {
    echo "Sucessfully downloaded $url in $targetDirectory";
}

Requirements

Installation

The recommended way to install HtmlDumper is through Composer.

composer require languagewire/html-dumper

Development

In the build/ folder there is a Dockerfile file which sets up all dependencies needed for local development, runs unit tests and other linters.

Customize build/.env like this:

cd build
cp .env.template .env
nano .env

And then run ./build.sh within the build/ folder:

cd build
./build.sh

License

HtmlDumper is made available under the MIT License (MIT). Please see the LICENSE file for more information.