sinnbeck/html-ast

Create an AST from a html string

dev-main 2025-04-14 11:14 UTC

This package is auto-updated.

Last update: 2025-04-14 11:14:39 UTC


README

An HTML AST (Abstract Syntax Tree) parser written in PHP.
Inspired by the AST parser in TempestPHP (by Brett Roose), this library provides a built-in lexer to tokenize HTML strings, an AST parser to convert tokens into a tree structure, and a printer to output well-formatted (indented) HTML.

Note: This package requires PHP 8.2 or higher.

Table of Contents

Features

  • Built-in Lexer: Tokenizes raw HTML input.
  • AST Parser: Converts tokenized HTML into an Abstract Syntax Tree for easier analysis and manipulation.
  • HTML Printer: Renders the AST back into properly indented HTML code.

Requirements

  • PHP version 8.2 or later.
  • Composer (for installation via Packagist).

Installation

You can install html-ast via Composer. From your project root, run:

composer require sinnbeck/html-ast

Alternatively, if you prefer to clone the repository directly:

git clone https://github.com/sinnbeck/html-ast.git
cd html-ast
composer install

Usage

The package is organized into three main components: the Lexer, the AST Parser, and the Printer. Below are basic examples of how to use each.

Lexing

The lexer tokenizes an HTML string. Tokens represent the smallest meaningful elements of the HTML (such as tags, attributes, and text).

use Sinnbeck\HtmlAst\Lexer\Lexer;

// Provide your HTML string
$html = '<div class="container"><p>Hello, world!</p></div>';

// Create a Lexer instance from the string
$lexer = Lexer::fromString($html);

// Lex the HTML string into tokens
$tokens = $lexer->lex();

// Optionally, inspect the tokens:
print_r($tokens);

Parsing

The AST parser converts the token list into a tree structure, where each node represents an HTML element, text node, or comment.

use Sinnbeck\HtmlAst\Ast\Parser;

// Create an AST parser instance with the tokens from the lexer
$ast = Parser::make($tokens);

// Parse tokens into an AST (node tree)
$nodes = $ast->parse();

// Optionally, inspect the node tree:
print_r($nodes);

Printing

The printer takes an HTML input or the resulting AST and renders it as neatly formatted HTML. This is useful for ensuring consistent formatting after transformations.

use Sinnbeck\HtmlAst\Printer;

// Create a Printer instance and render the HTML string
echo Printer::make($nodes)->render();

If you need to indent all lines by a certain level, you can easily do so.

use Sinnbeck\HtmlAst\Printer;

// Indents everything by 1 extra indentation level
echo Printer::make($nodes)->render(1);

By default, the output is indented with 4 spaces. This can be easily changed by calling ->withIndent()

use Sinnbeck\HtmlAst\Printer;

// Indents with tab instead of 4 spaces
echo Printer::make($nodes)->indentWith("\t")->render();

Testing

The repository includes tests under the tests directory, using Pest PHP as the testing framework and Symfony's VarDumper for debugging. To run tests, execute:

composer test

This command runs all tests to ensure the lexing, parsing, and printing functionalities work as expected.

Todo

  • Add line numbers to tokens (Lexer)
  • Introduce an HTML validator to ensure that the HTML structure conforms to expected standards
  • Implement a node visitor pattern to allow modification or transformation of the AST

Contributing

Contributions to html-ast are welcome. If you would like to contribute, please follow these steps:

  1. Fork the repository.
  2. Create a feature branch:
    git checkout -b feature/your-feature-name
  3. Make your changes and add tests.
  4. Format all files:
    ./vendor/bin/pint`
  5. Commit your changes:
    git commit -am 'Add new feature'
  6. Push the branch:
    git push origin feature/your-feature-name
  7. Open a pull request explaining your changes.

Please adhere to the coding standards and test all changes before submitting a pull request.

License

This project is licensed under the MIT License