acseo / domscribe
A PHP library for converting HTML to semantic Markdown, preserving structure and meaning
Requires
- php: >=8.0
- ext-dom: *
- ext-libxml: *
- masterminds/html5: ^2.7
Requires (Dev)
- phpstan/phpstan: ^1.10
- phpunit/phpunit: ^9.5
- squizlabs/php_codesniffer: ^3.7
This package is auto-updated.
Last update: 2026-03-09 10:56:51 UTC
README
A powerful PHP library for converting HTML to semantic Markdown, preserving the structure and meaning of the original content.
This library is a PHP port of domscribe-python, which itself is based on dom-to-semantic-markdown.
๐ Features
- Semantic preservation: Maintains the semantic structure of HTML during conversion
- Complex structure handling: Handles nested lists, tables, and other complex HTML structures
- Highly customizable: Extensive options to tailor the conversion process
- Main content extraction: Automatically identifies and extracts the main content from web pages
- LLM-friendly output: Optimized for Language Model processing with special annotations
- Well-tested: Comprehensive test suite with PHPUnit
- Modern PHP: Uses PHP 8.0+ features with strict typing
๐ฆ Installation
Install via Composer:
composer require acseo/domscribe
๐ Basic Usage
<?php use Domscribe\Converter; // Simple conversion $html = "<h1>Hello, World!</h1><p>This is a <strong>test</strong>.</p>"; $markdown = Converter::htmlToMarkdown($html); echo $markdown; // Output: // # Hello, World! // // This is a **test**.
๐ฏ Advanced Usage
Using Conversion Options
use Domscribe\Converter; use Domscribe\ConversionOptions; $html = '<html><body><main><h1>Main Content</h1><p>Some text</p></main></body></html>'; // Using an array $options = [ 'extract_main_content' => true, 'refify_urls' => true, 'keep_html' => ['div', 'span'], 'debug' => false, ]; $markdown = Converter::htmlToMarkdown($html, $options); // Or using ConversionOptions object $options = new ConversionOptions(); $options->extractMainContent = true; $options->refifyUrls = true; $options->keepHtml = ['div', 'span']; $markdown = Converter::htmlToMarkdown($html, $options);
Available Options
| Option | Type | Default | Description |
|---|---|---|---|
websiteDomain |
?string |
null |
Website domain to strip from URLs |
extractMainContent |
bool |
false |
Automatically extract main content |
refifyUrls |
bool |
false |
Convert to reference-style links |
urlMap |
array |
[] |
Map of URLs to replace |
debug |
bool |
false |
Enable debug logging |
enableTableColumnTracking |
bool |
true |
Add colId comments to table cells |
keepHtml |
array |
[] |
HTML tags to preserve |
includeMetaData |
string|bool|null |
null |
Include metadata from HTML head |
overrideElementProcessing |
callable|null |
null |
Custom element processing callback |
processUnhandledElement |
callable|null |
null |
Custom unhandled element callback |
overrideNodeRenderer |
callable|null |
null |
Custom node renderer callback |
renderCustomNode |
callable|null |
null |
Custom node renderer callback |
๐จ Examples
Convert Complex HTML
use Domscribe\Converter; $html = <<<HTML <div> <h1>My Blog Post</h1> <p>Here's a paragraph with <strong>bold</strong> and <em>italic</em> text.</p> <ul> <li>Item 1</li> <li>Item 2 <ol> <li>Subitem 2.1</li> <li>Subitem 2.2</li> </ol> </li> <li>Item 3</li> </ul> <blockquote> <p>This is a quote.</p> </blockquote> </div> HTML; $markdown = Converter::htmlToMarkdown($html); echo $markdown;
Output:
# My Blog Post Here's a paragraph with **bold** and *italic* text. - Item 1 - Item 2 1. Subitem 2.1 2. Subitem 2.2 - Item 3 > This is a quote.
Extract Main Content
use Domscribe\Converter; $html = <<<HTML <html> <body> <header>Header content</header> <nav>Navigation</nav> <main> <h1>Main Article</h1> <p>This is the main content.</p> </main> <footer>Footer content</footer> </body> </html> HTML; $options = ['extract_main_content' => true]; $markdown = Converter::htmlToMarkdown($html, $options); echo $markdown;
Output:
# Main Article
This is the main content.
Convert URLs to Reference Style
use Domscribe\Converter; $html = <<<HTML <p> Check out <a href="https://example.com">this site</a> and <a href="https://example.org">another site</a>. Here's <a href="https://example.com">the first site</a> again. </p> HTML; $options = ['refify_urls' => true]; $markdown = Converter::htmlToMarkdown($html, $options); echo $markdown;
Output:
Check out [this site][1] and [another site][2]. Here's [the first site][1] again. [1]: https://example.com [2]: https://example.org
Preserve Specific HTML Tags
use Domscribe\Converter; $html = '<p>This is <span class="highlight">highlighted</span> text.</p>'; $options = ['keep_html' => ['span']]; $markdown = Converter::htmlToMarkdown($html, $options); echo $markdown;
Output:
This is <span class="highlight">highlighted</span> text.
Tables with Column Identifiers
use Domscribe\Converter; $html = <<<HTML <table> <thead> <tr> <th>Name</th> <th>Age</th> </tr> </thead> <tbody> <tr> <td>Alice</td> <td>30</td> </tr> <tr> <td>Bob</td> <td>25</td> </tr> </tbody> </table> HTML; $markdown = Converter::htmlToMarkdown($html); echo $markdown;
Output:
| Name <!-- colId: 1 --> | Age <!-- colId: 2 --> | | --- | --- | | Alice <!-- colId: 1 --> | 30 <!-- colId: 2 --> | | Bob <!-- colId: 1 --> | 25 <!-- colId: 2 --> |
๐ง Working with the AST
Domscribe provides access to the Abstract Syntax Tree (AST) for advanced use cases:
use Domscribe\Converter; $html = '<h1>Title</h1><p>Text with <a href="https://example.com">link</a></p>'; // Convert HTML to AST $ast = Converter::htmlToMarkdownAst($html); // Find specific nodes in the AST $link = Converter::findInMarkdownAst($ast, function ($node) { return isset($node['type']) && $node['type'] === 'link'; }); // Find all nodes of a certain type $allLinks = Converter::findAllInMarkdownAst($ast, function ($node) { return isset($node['type']) && $node['type'] === 'link'; }); // Convert AST back to Markdown string $markdown = Converter::markdownAstToString($ast);
๐งช Running Tests
# Install dependencies composer install # Run tests composer test # Run with coverage ./vendor/bin/phpunit --coverage-html coverage # Run static analysis composer phpstan # Check code style composer cs-check # Fix code style composer cs-fix
๐๏ธ Architecture
The library is organized into several key components:
- Converter: Main entry point and orchestrator
- HtmlToMarkdownAst: Converts HTML DOM to Markdown AST
- MarkdownAstToString: Converts AST to Markdown string
- DomUtils: DOM manipulation and content extraction utilities
- UrlUtils: URL processing and reference-style conversion
- AstUtils: AST traversal and manipulation utilities
- ConversionOptions: Configuration object for customization
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Credits
- Original TypeScript library: dom-to-semantic-markdown
- Python port: domscribe-python
- PHP port by ACSEO
๐ Related Projects
- domscribe-python - Python version
- dom-to-semantic-markdown - Original TypeScript version
๐ Support
For issues, questions, or contributions, please use the GitHub issue tracker.