README

A powerful PHP library for converting HTML to semantic Markdown, preserving the structure and meaning of the original content.

This library is a PHP port of domscribe-python, which itself is based on dom-to-semantic-markdown.

🚀 Features

Semantic preservation: Maintains the semantic structure of HTML during conversion
Complex structure handling: Handles nested lists, tables, and other complex HTML structures
Highly customizable: Extensive options to tailor the conversion process
Main content extraction: Automatically identifies and extracts the main content from web pages
LLM-friendly output: Optimized for Language Model processing with special annotations
Well-tested: Comprehensive test suite with PHPUnit
Modern PHP: Uses PHP 8.0+ features with strict typing

📦 Installation

Install via Composer:

composer require acseo/domscribe

📖 Basic Usage

<?php

use Domscribe\Converter;

// Simple conversion
$html = "<h1>Hello, World!</h1><p>This is a <strong>test</strong>.</p>";
$markdown = Converter::htmlToMarkdown($html);
echo $markdown;
// Output:
// # Hello, World!
//
// This is a **test**.

🎯 Advanced Usage

Using Conversion Options

use Domscribe\Converter;
use Domscribe\ConversionOptions;

$html = '<html><body><main><h1>Main Content</h1><p>Some text</p></main></body></html>';

// Using an array
$options = [
    'extract_main_content' => true,
    'refify_urls' => true,
    'keep_html' => ['div', 'span'],
    'debug' => false,
];

$markdown = Converter::htmlToMarkdown($html, $options);

// Or using ConversionOptions object
$options = new ConversionOptions();
$options->extractMainContent = true;
$options->refifyUrls = true;
$options->keepHtml = ['div', 'span'];

$markdown = Converter::htmlToMarkdown($html, $options);

Available Options

Option	Type	Default	Description
`websiteDomain`	`?string`	`null`	Website domain to strip from URLs
`extractMainContent`	`bool`	`false`	Automatically extract main content
`refifyUrls`	`bool`	`false`	Convert to reference-style links
`urlMap`	`array`	`[]`	Map of URLs to replace
`debug`	`bool`	`false`	Enable debug logging
`enableTableColumnTracking`	`bool`	`true`	Add colId comments to table cells
`keepHtml`	`array`	`[]`	HTML tags to preserve
`includeMetaData`	`string\|bool\|null`	`null`	Include metadata from HTML head
`overrideElementProcessing`	`callable\|null`	`null`	Custom element processing callback
`processUnhandledElement`	`callable\|null`	`null`	Custom unhandled element callback
`overrideNodeRenderer`	`callable\|null`	`null`	Custom node renderer callback
`renderCustomNode`	`callable\|null`	`null`	Custom node renderer callback

🎨 Examples

Convert Complex HTML

use Domscribe\Converter;

$html = <<<HTML
<div>
    <h1>My Blog Post</h1>
    <p>Here's a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2
            <ol>
                <li>Subitem 2.1</li>
                <li>Subitem 2.2</li>
            </ol>
        </li>
        <li>Item 3</li>
    </ul>
    <blockquote>
        <p>This is a quote.</p>
    </blockquote>
</div>
HTML;

$markdown = Converter::htmlToMarkdown($html);
echo $markdown;

Output:

# My Blog Post

Here's a paragraph with **bold** and *italic* text.

- Item 1
- Item 2
  1. Subitem 2.1
  2. Subitem 2.2
- Item 3

> This is a quote.

Extract Main Content

use Domscribe\Converter;

$html = <<<HTML
<html>
    <body>
        <header>Header content</header>
        <nav>Navigation</nav>
        <main>
            <h1>Main Article</h1>
            <p>This is the main content.</p>
        </main>
        <footer>Footer content</footer>
    </body>
</html>
HTML;

$options = ['extract_main_content' => true];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;

Output:

# Main Article

This is the main content.

Convert URLs to Reference Style

use Domscribe\Converter;

$html = <<<HTML
<p>
    Check out <a href="https://example.com">this site</a> and
    <a href="https://example.org">another site</a>.
    Here's <a href="https://example.com">the first site</a> again.
</p>
HTML;

$options = ['refify_urls' => true];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;

Output:

Check out [this site][1] and [another site][2].
Here's [the first site][1] again.

[1]: https://example.com
[2]: https://example.org

Preserve Specific HTML Tags

use Domscribe\Converter;

$html = '<p>This is <span class="highlight">highlighted</span> text.</p>';
$options = ['keep_html' => ['span']];
$markdown = Converter::htmlToMarkdown($html, $options);
echo $markdown;

Output:

This is <span class="highlight">highlighted</span> text.

Tables with Column Identifiers

use Domscribe\Converter;

$html = <<<HTML
<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Alice</td>
            <td>30</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>25</td>
        </tr>
    </tbody>
</table>
HTML;

$markdown = Converter::htmlToMarkdown($html);
echo $markdown;

Output:

| Name <!-- colId: 1 --> | Age <!-- colId: 2 --> |
| --- | --- |
| Alice <!-- colId: 1 --> | 30 <!-- colId: 2 --> |
| Bob <!-- colId: 1 --> | 25 <!-- colId: 2 --> |

🔧 Working with the AST

Domscribe provides access to the Abstract Syntax Tree (AST) for advanced use cases:

use Domscribe\Converter;

$html = '<h1>Title</h1><p>Text with <a href="https://example.com">link</a></p>';

// Convert HTML to AST
$ast = Converter::htmlToMarkdownAst($html);

// Find specific nodes in the AST
$link = Converter::findInMarkdownAst($ast, function ($node) {
    return isset($node['type']) && $node['type'] === 'link';
});

// Find all nodes of a certain type
$allLinks = Converter::findAllInMarkdownAst($ast, function ($node) {
    return isset($node['type']) && $node['type'] === 'link';
});

// Convert AST back to Markdown string
$markdown = Converter::markdownAstToString($ast);

🧪 Running Tests

# Install dependencies
composer install

# Run tests
composer test

# Run with coverage
./vendor/bin/phpunit --coverage-html coverage

# Run static analysis
composer phpstan

# Check code style
composer cs-check

# Fix code style
composer cs-fix

🏗️ Architecture

The library is organized into several key components:

Converter: Main entry point and orchestrator
HtmlToMarkdownAst: Converts HTML DOM to Markdown AST
MarkdownAstToString: Converts AST to Markdown string
DomUtils: DOM manipulation and content extraction utilities
UrlUtils: URL processing and reference-style conversion
AstUtils: AST traversal and manipulation utilities
ConversionOptions: Configuration object for customization

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Credits

Original TypeScript library: dom-to-semantic-markdown
Python port: domscribe-python
PHP port by ACSEO

🔗 Related Projects

domscribe-python - Python version
dom-to-semantic-markdown - Original TypeScript version

📞 Support

For issues, questions, or contributions, please use the GitHub issue tracker.

acseo / domscribe

Maintainers

Package info

Statistics

Security

README

🚀 Features

📦 Installation

📖 Basic Usage

🎯 Advanced Usage

Using Conversion Options

Available Options

🎨 Examples

Convert Complex HTML

Extract Main Content

Convert URLs to Reference Style

Preserve Specific HTML Tags

Tables with Column Identifiers

🔧 Working with the AST

🧪 Running Tests

🏗️ Architecture

🤝 Contributing

📄 License

🙏 Credits

🔗 Related Projects

📞 Support