html_inspector/html_inspector

Fast HTML parser and resolver for Internationalized Resource Identifiers (IRI)

Maintainers

Package info

codeberg.org/Jumping-Beaver/HTML_Inspector_for_PHP

Homepage

Type:php-ext

Ext name:ext-html_inspector

pkg:composer/html_inspector/html_inspector

Statistics

Installs: 4

Dependents: 0

Suggesters: 0

1 2025-08-01 10:08 UTC

This package is not auto-updated.

Last update: 2026-03-27 16:32:05 UTC


README

These are PHP bindings for HTML Inspector.

Example

<?php

function extract_anchors(string $html_utf8, string $document_uri)
{
    $doc = new HtmlInspector\HtmlDocument($html_utf8);
    $base_node = $doc->select(0)->child()->name('html')->child()->name('head')->child()
        ->name('base')->iterate();
    $base = HtmlInspector\resolve_iri($doc->get_attribute($base_node, 'href'), $document_uri);
    $base ??= $document_uri;
    $selector = $doc->select(0)->descendant()->name('a')->attribute_starts_with('href', '#')->not();
    while (($node_a = $selector->iterate()) !== -1) {
        $href = $doc->get_attribute($node_a, 'href');
        $uri = HtmlInspector\resolve_iri($href, $base);
        print("$uri\n");
    }
}

Design decisions

PHP iterators are currently not implemented

I have thought back and forth whether to implement PHP iterators to loop through nodes. How PHP implements iterators is awkward. Firstly, two redundant implementations are needed to support looping with foreach and to implement the Iterator interface. Moreover, it needs the two methods next (with no return value) and current instead of just one, we have to implement a caching of both the current value and of the validity state of the iterator, and in current we conditionally have to make one implicit iteration. Python is an example where iteration is implemented more elegantly using a single __next__ method that both iterates and then returns the current value. Another complication is how to encode the non-existence of a node. With PHP iterators, we need to use the value false and implement union type hints and a respective check for the get_* methods to enable a concise syntax. Without iterators, we can use the value -1 and pass it to the C functions without further checks.