acseo/domscribe

A PHP library for converting HTML to semantic Markdown, preserving structure and meaning

Maintainers

Package info

github.com/acseo/domscribe

pkg:composer/acseo/domscribe

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

dev-main 2026-03-09 10:55 UTC

This package is auto-updated.

Last update: 2026-03-09 10:56:51 UTC


README

License: MIT PHP Version

A powerful PHP library for converting HTML to semantic Markdown, preserving the structure and meaning of the original content.

This library is a PHP port of domscribe-python, which itself is based on dom-to-semantic-markdown.

๐Ÿš€ Features

  • Semantic preservation: Maintains the semantic structure of HTML during conversion
  • Complex structure handling: Handles nested lists, tables, and other complex HTML structures
  • Highly customizable: Extensive options to tailor the conversion process
  • Main content extraction: Automatically identifies and extracts the main content from web pages
  • LLM-friendly output: Optimized for Language Model processing with special annotations
  • Well-tested: Comprehensive test suite with PHPUnit
  • Modern PHP: Uses PHP 8.0+ features with strict typing

๐Ÿ“ฆ Installation

Install via Composer:

composer require acseo/domscribe

๐Ÿ“– Basic Usage

<?php

use Domscribe\Converter;

// Simple conversion
$html = "<h1>Hello, World!</h1><p>This is a <strong>test</strong>.</p>";
$markdown = Converter::htmlToMarkdown($html);
echo $markdown;
// Output:
// # Hello, World!
//
// This is a **test**.

๐ŸŽฏ Advanced Usage

Using Conversion Options

use Domscribe\Converter;
use Domscribe\ConversionOptions;

$html = '<html><body><main><h1>Main Content</h1><p>Some text</p></main></body></html>';

// Using an array
$options = [
    'extract_main_content' => true,
    'refify_urls' => true,
    'keep_html' => ['div', 'span'],
    'debug' => false,
];

$markdown = Converter::htmlToMarkdown($html, $options);

// Or using ConversionOptions object
$options = new ConversionOptions();
$options->extractMainContent = true;
$options->refifyUrls = true;
$options->keepHtml = ['div', 'span'];

$markdown = Converter::htmlToMarkdown($html, $options);

Available Options

Option Type Default Description
websiteDomain ?string null Website domain to strip from URLs
extractMainContent bool false Automatically extract main content
refifyUrls bool false Convert to reference-style links
urlMap array [] Map of URLs to replace
debug bool false Enable debug logging
enableTableColumnTracking bool true Add colId comments to table cells
keepHtml array [] HTML tags to preserve
includeMetaData string|bool|null null Include metadata from HTML head
overrideElementProcessing callable|null null Custom element processing callback
processUnhandledElement callable|null null Custom unhandled element callback
overrideNodeRenderer callable|null null Custom node renderer callback
renderCustomNode callable|null null Custom node renderer callback

๐ŸŽจ Examples

Convert Complex HTML

use Domscribe\Converter;

$html = <<<HTML
<div>
    <h1>My Blog Post</h1>
    <p>Here's a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2
            <ol>
                <li>Subitem 2.1</li>
                <li>Subitem 2.2</li>
            </ol>
        </li>
        <li>Item 3</li>
    </ul>
    <blockquote>
        <p>This is a quote.</p>
    </blockquote>
</div>
HTML;

$markdown = Converter::htmlToMarkdown($html);
echo $markdown;

Output:

# My Blog Post

Here's a paragraph with **bold** and *italic* text.

- Item 1
- Item 2
  1. Subitem 2.1
  2. Subitem 2.2
- Item 3

> This is a quote.

Extract Main Content

use Domscribe\Converter;

$html = <<<HTML
<html>
    <body>
        <header>Header content</header>
        <nav>Navigation</nav>
        <main>
            <h1>Main Article</h1>
            <p>This is the main content.</p>
        </main>
        <footer>Footer content</footer>
    </body>
</html>
HTML;

$options = ['extract_main_content' => true];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;

Output:

# Main Article

This is the main content.

Convert URLs to Reference Style

use Domscribe\Converter;

$html = <<<HTML
<p>
    Check out <a href="https://example.com">this site</a> and
    <a href="https://example.org">another site</a>.
    Here's <a href="https://example.com">the first site</a> again.
</p>
HTML;

$options = ['refify_urls' => true];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;

Output:

Check out [this site][1] and [another site][2].
Here's [the first site][1] again.

[1]: https://example.com
[2]: https://example.org

Preserve Specific HTML Tags

use Domscribe\Converter;

$html = '<p>This is <span class="highlight">highlighted</span> text.</p>';
$options = ['keep_html' => ['span']];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;

Output:

This is <span class="highlight">highlighted</span> text.

Tables with Column Identifiers

use Domscribe\Converter;

$html = <<<HTML
<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Alice</td>
            <td>30</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>25</td>
        </tr>
    </tbody>
</table>
HTML;

$markdown = Converter::htmlToMarkdown($html);
echo $markdown;

Output:

| Name <!-- colId: 1 --> | Age <!-- colId: 2 --> |
| --- | --- |
| Alice <!-- colId: 1 --> | 30 <!-- colId: 2 --> |
| Bob <!-- colId: 1 --> | 25 <!-- colId: 2 --> |

๐Ÿ”ง Working with the AST

Domscribe provides access to the Abstract Syntax Tree (AST) for advanced use cases:

use Domscribe\Converter;

$html = '<h1>Title</h1><p>Text with <a href="https://example.com">link</a></p>';

// Convert HTML to AST
$ast = Converter::htmlToMarkdownAst($html);

// Find specific nodes in the AST
$link = Converter::findInMarkdownAst($ast, function ($node) {
    return isset($node['type']) && $node['type'] === 'link';
});

// Find all nodes of a certain type
$allLinks = Converter::findAllInMarkdownAst($ast, function ($node) {
    return isset($node['type']) && $node['type'] === 'link';
});

// Convert AST back to Markdown string
$markdown = Converter::markdownAstToString($ast);

๐Ÿงช Running Tests

# Install dependencies
composer install

# Run tests
composer test

# Run with coverage
./vendor/bin/phpunit --coverage-html coverage

# Run static analysis
composer phpstan

# Check code style
composer cs-check

# Fix code style
composer cs-fix

๐Ÿ—๏ธ Architecture

The library is organized into several key components:

  • Converter: Main entry point and orchestrator
  • HtmlToMarkdownAst: Converts HTML DOM to Markdown AST
  • MarkdownAstToString: Converts AST to Markdown string
  • DomUtils: DOM manipulation and content extraction utilities
  • UrlUtils: URL processing and reference-style conversion
  • AstUtils: AST traversal and manipulation utilities
  • ConversionOptions: Configuration object for customization

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Credits

๐Ÿ”— Related Projects

๐Ÿ“ž Support

For issues, questions, or contributions, please use the GitHub issue tracker.