daniesy/dominator

A simple HTML parser for PHP.

v0.0.19 2025-05-21 11:16 UTC

This package is auto-updated.

Last update: 2025-05-21 11:17:31 UTC


README

A robust, fast, and fully-featured HTML5 parser and query engine for PHP. Parse, traverse, manipulate, and query HTML documents with ease, supporting all modern HTML5 features, error recovery, namespaces, and more.

Features

  • Full HTML5 parsing: Handles all standard HTML5 elements, void/self-closing tags, comments, CDATA, script/style content, and multiple doctype variants.
  • XML declaration support: Preserves and exports XML declarations (e.g., <?xml version="1.0" encoding="utf-8"?>) at the start of the document.
  • Error recovery: Gracefully parses malformed or broken HTML, just like browsers do.
  • Entity decoding: Decodes HTML entities in text and attributes.
  • Whitespace normalization: Optionally normalizes whitespace in text nodes.
  • Namespaces: Supports XML/HTML namespaces (e.g., svg:rect).
  • Attribute handling: Handles quoted/unquoted and boolean attributes.
  • Query engine: Powerful CSS-like selectors for finding nodes (call querySelectorAll, querySelector, or getElementsByTagName directly on any Node).
  • Node manipulation: Add, remove, set attributes, change text, and more (Node).
  • Performance: Optimized for large and deeply nested documents.
  • Comprehensive tests: Includes extensive tests for all features and edge cases.
  • Robust CSS parser: Parses CSS rules and at-rules (including nested and simple at-rules like @media and @font-face).
  • CSS inlining: Optionally inlines CSS styles as style attributes when exporting HTML (Node::toInlinedHtml).
  • Pretty-print and minify HTML: Export HTML as minified (default) or pretty-printed (indented, human-readable) with Node::toHtml(false).

Installation

Install via Composer:

composer require daniesy/dominator

Or include the src/ files directly in your project.

Usage

Basic Parsing

use Daniesy\DOMinator\DOMinator;

$html = '<?xml version="1.0" encoding="utf-8"?>\n<div id="main"><p>Hello <b>World</b></p></div>';
$root = DOMinator::read($html);

Traversing the DOM

foreach ($root->children as $child) {
    echo $child->tag; // e.g., 'div'
}

Querying with CSS Selectors (DOM-like)

// Find all elements with class="foo"
$nodes = $root->querySelectorAll('.foo');
// Attribute selectors:
// Exact match
$adminNodes = $root->querySelectorAll('[data-role="admin"]');
// Space-separated word match
$adminWordNodes = $root->querySelectorAll('[data-role~="admin"]');
// Substring match
$adminSubstringNodes = $root->querySelectorAll('[data-role*="admin"]');
// Attribute presence
$withPlaceholder = $root->querySelectorAll('[placeholder]');
// Comma-separated (OR) selectors
$iconLinks = $root->querySelectorAll('link[rel="shortcut icon"], link[rel="icon"]');

// Access by index using item() method
echo $nodes->item(0)->innerText;
// Get the number of nodes
echo $nodes->length;
// Iterate over nodes
foreach ($nodes as $node) {
    echo $node->innerText;
}

// Find the first <span> element
$span = $root->querySelector('span');
if ($span) {
    echo $span->innerText;
}

// Find all <div> elements (case-insensitive)
$divs = $root->getElementsByTagName('div');
// Access by index
echo $divs->item(0)->innerText;
// Iterate over nodes
foreach ($divs as $div) {
    echo $div->innerText;
}

Manipulating Nodes

$node = $nodes->item(0);
$node->setAttribute('id', 'new-id');
$node->innerText = 'Updated text';
$node->remove(); // Remove from parent

CSS Parsing and Selector Matching

use Daniesy\DOMinator\CssParser;

$css = '@media (max-width:600px) { body { background: #fff; } }\n@font-face { font-family: test; src: url(test.woff); }\ndiv.foo#bar { color: green; }';
$rules = CssParser::parse($css);
// $rules[0]['type'] === 'at' for @media, $rules[1]['type'] === 'at' for @font-face, $rules[2]['type'] === 'rule' for div.foo#bar

// Selector matching:
use Daniesy\DOMinator\Nodes\Node;
$node = new Node('div', ['class' => 'foo bar', 'id' => 'bar']);
CssParser::matches('div.foo#bar', $node); // true
CssParser::matches('.baz', $node); // false
  • CssParser::parse($css) parses a CSS string into an array of rules and at-rules (including nested and simple at-rules).
  • CssParser::matches($selector, $node) checks if a node matches a CSS selector (supports tag, class, id, compound, and descendant selectors).

Exporting Back to HTML

// Minified (default)
$html = $root->toHtml();
// Pretty-printed (indented, human-readable)
$prettyHtml = $root->toHtml(false);
// Inline CSS styles (simple selectors only)
$inlinedHtml = $root->toInlinedHtml();
$prettyInlinedHtml = $root->toInlinedHtml(false);

Handling Namespaces

$html = '<svg:rect width="100" height="100"/>';
$root = DOMinator::read($html);
$svg = $root->children->item(0); // Direct access to children array is still available
echo $svg->namespace; // 'svg'
echo $svg->tag;       // 'rect'

// Alternatively, you can use querySelector
$svg = $root->querySelector('svg\\:rect');
echo $svg->namespace; // 'svg'
echo $svg->tag;       // 'rect'

Parsing Options

  • DOMinator::read($html, $normalizeWhitespace = false, $preprocess = null)
    • Parses the given HTML string into a DOM tree.
    • Supports input with an XML declaration (e.g., <?xml ...?>).
    • New: Accepts an optional $preprocess callback. If provided, this function will be called with the HTML string before parsing. You can use this to preprocess, sanitize, or transform the HTML as needed.
    • Example:
      $root = DOMinator::read($html, false, function($input) {
          // Preprocess or sanitize $input
          return str_replace('foo', 'bar', $input);
      });
  • Node::toHtml($minify = true)
    • If $minify is false, outputs pretty-printed HTML with indentation and newlines.
    • If $minify is true (default), outputs minified HTML.
  • Node::toInlinedHtml($minify = true)
    • Inlines simple CSS rules from <style> tags as inline style attributes.
    • Supports only tag, class, and id selectors (no combinators or advanced selectors).
    • Removes <style> tags from the output.
    • Use $minify = false for pretty-printed output.

API Reference

DOMinator

  • DOMinator::read(string $html, bool $normalizeWhitespace = false): Node
    • Parses HTML and returns the root node.

Node

  • Properties:
    • tag: Tag name (e.g., 'div')
    • namespace: Namespace prefix (e.g., 'svg')
    • attributes: Associative array of attributes
    • children: Array of child nodes
    • innerText: Text content (for text, comment, or CDATA nodes)
    • isComment, isCdata: Node type flags
    • parent: Parent node
    • xmlDeclaration: XML declaration string if present (e.g., <?xml version="1.0"?>)
  • Methods:
    • setAttribute($name, $value): Set or update attribute
    • removeAttribute($name): Remove attribute
    • remove(): Remove node from parent
    • toHtml(): Export node and children as HTML
    • toInlinedHtml($minify = true): Export HTML with inlined CSS styles
    • querySelectorAll($selector): Returns a NodeList of matching nodes (CSS selector)
    • querySelector($selector): Returns the first matching node or null
    • getElementsByTagName($tag): Returns a NodeList of nodes with the given tag name (case-insensitive)

NodeList

  • Properties:
    • length: The number of nodes in the list
  • Methods:
    • item($index): Returns the node at the specified index, or null if the index is out of range
    • count(): Returns the number of nodes in the list (implements Countable)
    • getIterator(): Returns an iterator for the nodes in the list (implements IteratorAggregate)

Testing

Run all tests with PHPUnit:

vendor/bin/phpunit tests

Examples

See tests/DOMinatorTest.php, tests/NodeTest.php, and tests/QueryTest.php for comprehensive usage and edge cases.

License

MIT License. See LICENSE file.

Author: Daniesy

Contributions and issues welcome!