futureplc / html-dom-document
A drop-in replacement for DOMDocument that handles HTML5 documents.
Fund package maintenance!
futureplc
Requires
- php: ^8.3
- ext-dom: *
- symfony/css-selector: ^7.0
Requires (Dev)
- ext-libxml: *
- friendsofphp/php-cs-fixer: ^3.21.1
- phpunit/phpunit: ^11.3.1
- spatie/ray: ^1.28
This package is auto-updated.
Last update: 2024-11-02 06:46:53 UTC
README
The HTMLDocument package has one primary purpose: to act as a stand-in replacement for the core DOMDocument
and related DOM classes that come with PHP.
⚠️ If you just need to crawl the DOM and not manipulate it in-place, consider using a package like the Symfony DOM Crawler component.
While the builtin DOM-related classes with PHP are a great way to parse XML, they quickly fall apart when trying to parse modern HTML5 markup. This package makes it more intuitive to work with, and handles some of the quirks behind-the-scenes.
This package provides a series of classes to replace the DOM ones in a backward-compatible fashion but with a tighter interface and additional utilities bundled in to make working with HTML a breeze. These classes will return instances of the equivalent HTML*
class instead of the DOM*
one:
DOMDocument
->HTMLDocument
DOMElement
->HTMLElement
DOMNode
->HTMLElement
DOMText
->HTMLText
DOMNodeList
->HTMLNodeList
DOMXPath
->HTMLXPath
Installation
You can install the package via Composer:
composer require futureplc/html-dom-document
Features
Sensible return values
There's nothing more annoying than having to check union types on every operation because of PHP's legacy of using falsey return types. We've sorted this by making sure there are sensible defaults:
- If a return value expects
DOMNodeList
orfalse
, we'll return an emptyDOMNodeList
if there are no values to return - If a return value could be a
string
orfalse
, we'll either throw an exception on failure or return an empty string - No more differentiating between
DOMNode
andDOMElement
; we have a singleHTMLElement
class that handles all scenarios of the two combined
You'll notice this philosophy throughout the interface - if there's a sensible type to return, we'll ensure you get that instead of dealing with unions.
Easily create HTML documents and elements
DOMDocument
typically has a terse, antiquated interface that requires a lot of setup and repetition to do even basic and commonly needed tasks like creating a DOMElement
class from a plain HTML string.
All the old DOMDocument
style methods still work, so you can drop this package in as a replacement for existing DOMDocument
implementations. However, we have added new ways to create HTML documents and elements without the verbosity usually required for some operations.
$dom = new HTMLDocument(); $dom->loadHTML($html); $dom = HTMLDocument::fromHTML($html); $dom = HTMLDocument::loadFromFile($filePath); $element = HTMLElement::fromNode($domNode); $element = HTMLElement::fromHTML($html); $element = $dom->createElement('p', 'This is a paragraph.'); $element = $dom->createElementFromNode($domNode); $element = $dom->createElementFromHTML('<p>This is a paragraph.</p>');
Additional behaviour to support HTML5
The majority of the custom behaviour to allow DOMDocument to parse any HTML string comes from a series of "middleware" classes that manipulate the HTML before it's loaded and before it's emitted as a plain HTML string again.
These middleware do various things, such as:
- Assuming HTML5 behaviour if no
<!doctype>
is present, by adding one - Ignoring LibXML errors (as LibXML complains about certain HTML5 tags even though it can parse them properly)
- Treating
<template>
and<script>
tags as verbatim so their contents aren't changed by the rest of the document
These will be enabled by default if you use the HTMLDocument
class, but you can disable them as needed.
- Calling
->withoutMiddleware()
without any arguments before loading the HTML will result in no middleware applying, essentially resulting in just the additional utility methods with none of the extra HTML5 support - Calling
->withoutMiddleware(MiddlewareName::class)
, using the class name of a middleware, will disable that specific one
Getting a plain HTML string back out of DOMDocument
can be a bit tricky if you need something specific like a specific element, so we have added some options to make it easier.
$html = (string) $dom; // Cast the HTMLDocument to a string $html = $dom->saveHTML(); $html = (string) $element; // Cast the HTMLElement to a string $html = $element->saveHTML(); $html = $element->getInnerHTML(); // Gets the HTML of the element without the wrapping node $html = $element->getOuterHTML(); // Gets the HTML of the element with the wrapping node
Check if HTML5
If you need to know whether you're working with an HTML5 document or not, the isHTML5()
method will tell you.
$dom->isHtml5(); // true
Void elements
If working with HTML5, you may want to know if a given node is a "void element", meaning it needs no closing tag. This can be checked with the isVoidElement()
method.
$element->isVoidElement(); // true
Normally when saving the HTML, DOMDocument
would output void elements as <example></example>
, but this package will output them as <example>
, even for custom elements, maintaining how they were input originally.
Working with attributes
The HTMLElement
class has a series of methods to help you work with attributes on elements.
$element->getAttributes(); // Returns an array of all attributes $element->getAttribute('class'); // Returns the value of the class attribute $element->setAttribute('class', 'foo'); // Sets the class attribute to "foo" $element->addAttribute('class', 'foo'); // Adds the "foo" value as a space-separated value to the class attribute, appending it if the attribute already exists $element->removeAttribute('ref'); // Removes the ref attribute entirely $element->removeAttribute('ref', 'noreferrer'); // Removes the "noreferrer" value from the ref attribute if it exists - if the attribute is now empty, it will be removed entirely $element->toggleAttribute('checked'); // Toggles the "checked" attribute
As we often work with CSS classes in HTML, there are also some methods to help with this.
$element->getClassList(); // Returns an array of CSS classes $element->setClassList(['foo', 'bar']); // Sets the CSS classes $element->hasClass('foo'); // Returns true if the element has the class "foo" $element->addClass('baz'); // Adds the class "baz" $element->removeClass('bar'); // Removes the class "bar"
Removing parts of a document
There are some helpful utilities for quickly removing parts of a document as required.
$element->wihoutSelector('p'); // Removes all child `<p>` element $element->withoutComments(); // Removes all HTML comments
Utility methods
There are a couple of additional utility methods to help build attribute strings from PHP arrays.
Utility::attribute()
will take a single key/value pair and turn it into an HTML attribute, regardless of whether the value is a string, array, or boolean. A boolean value can be used to conditionally add attributes.
Utility::attribute('class', ['foo', 'bar']); // class="foo bar" Utility::attribute('id', 'baz'); // id="baz" Utility::attribute('required', true); // disabled
Utility::attributes()
will take this further by doing the same with an array of key/value pairs, turning them into an HTML attribute string altogether.
Utility::attributes([ 'class' => ['foo', 'bar'], 'id' => 'baz', 'required' => true, 'checked' => false, ]); // class="foo bar" id="baz" required
Utility::nodeMapRecursive()
gives the ability to run a callback on every node in a document, including all child nodes. You can use this callback to inspect the nodes, modify them, replace one node with another entirely, or remove them from the document.
This is also available on HTMLElement
and HTMLDocument
objects through the mapRecursive
method.
$dom = HTMLDocument::fromHTML('<p><span>foo</span></p>'); // Make sure every element has a class of "bar" $dom->mapRecursive(function ($node) { if ($node instanceof HTMLElement) { $node->setAttribute('class', 'bar'); } }); // <p class="bar"><span class="bar">foo</span></p>
Utility::countRootNodes()
will tell you how many root nodes are in a document.
Utility::countRootNodes('<p>foo</p>'); // 1 Utility::countRootNodes('<p>foo</p><p>bar</p>'); // 2
If working with source HTML that contains multiple root nodes, you can use the Utility::wrap($html)
and Utility::unwrap($html)
methods to ensure a single root node or remove the root node, respectively.
Working with CSS classes
The HTMLElement
class has several methods to help you work with CSS classes.
$element->setClassList(['foo', 'bar']); $element->getClassList(); // ['foo', 'bar'] $element->hasClass('foo'); // true $element->addClass('foo'); // ['foo', 'bar', 'baz'] $element->removeClass('baz'); // ['foo', 'bar']
Toggling boolean attributes
In the case where you need to toggle some boolean attributes on or off, the toggleAttribute()
method is available.
$element = HTMLElement::fromString('<input type="checkbox">'); $element->toggleAttribute('checked'); // <input type="checkbox" checked> $element->toggleAttribute('checked'); // <input type="checkbox">
Querying on CSS selectors and XPath
Most people working with HTML know how to use most CSS selectors, but many have never touched XPath. We've added handy querySelector()
and querySelectorAll()
methods to the HTMLDocument
and HTMLElement
classes, allowing you to use CSS selectors directly to get the needed elements, courtesy of the Symfony CSS Selector package.
$dom->querySelector('head > title'); // Returns the first `<title>` element $dom->querySelectorAll('.foo'); // Returns all elements with the class `foo`
If you still need to work with XPath, there is a convenient xpath()
method on both HTMLDocument
and HTMLElement
classes.
$dom->xpath('//a'); // Returns all `<a>` elements
Working with text nodes
Working with text nodes can be tricky if you ever want to change something in the text to another node entirely. The replaceTextWithNode()
method on HTMLText
lets you do just that.
This is particularly useful if you use the Utility::nodeMapRecursive()
function, which will traverse through text nodes.
$textNode->replaceTextWithNode('example', HTMLElement::fromHTML('<strong>example</strong>'));
Other Notes
HTMLDocument
also has some other benefits over DOMDocument
:
- Tags with an XML-style namespace get maintained, whereas
DOMDocument
would typically only keep the last part of the tag name. This is useful when working with standards such as edge-side-includes and have markup such as<esi:include src="..." />
- Attributes starting with
@
get maintained, whereasDOMDocument
would typically remove them. This is useful when working with HTML that has Alpine.js or Vue.js markup such as<button @click="doSomething">
- Any void tags on the input HTML will also be output as void tags
Drawbacks
Because of all the extra checks and type conversions, this package is a bit slower than the native DOMDocument
classes. However, the difference is negligible in most cases, and the benefits of the additional features and ease of use far outweigh the performance hit unless you are processing millions of large HTML documents at once.
Testing
composer test
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Please see CONTRIBUTING for details.
Security Vulnerabilities
Please review our security policy on how to report security vulnerabilities.
Credits
License
The MIT License (MIT). Please see License File for more information.