dealnews / datocms-html-to-structured-text
Convert HTML to DatoCMS Structured Text (DAST) format
Installs: 0
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/dealnews/datocms-html-to-structured-text
Requires
- php: ^8.2
- ext-dom: *
- ext-libxml: *
Requires (Dev)
- phpunit/phpunit: ^11.5
README
Convert HTML to DatoCMS Structured Text (DAST format). PHP port of the official JavaScript library.
Requirements
- PHP 8.2+
- DOM extension
- libxml extension
- Composer
Installation
composer require dealnews/datocms-html-to-structured-text
Basic Usage
<?php require_once 'vendor/autoload.php'; use DealNews\HtmlToStructuredText\Converter; // Create converter instance $converter = new Converter(); // Simple HTML $html = '<h1>DatoCMS</h1><p>The best <strong>headless CMS</strong>.</p>'; $dast = $converter->convert($html); // Returns: // [ // 'schema' => 'dast', // 'document' => [ // 'type' => 'root', // 'children' => [...] // ] // ]
Features
- ✅ Converts HTML to valid DAST documents
- ✅ Supports all standard HTML elements
- ✅ Custom handlers for specialized conversions
- ✅ DOM preprocessing hooks
- ✅ Configurable allowed blocks, marks, and heading levels
- ✅ Mark extraction from inline CSS styles
- ✅ URL resolution with
<base>tag support - ✅ Type-safe with comprehensive PHPDoc
Supported Elements
Block Elements
| HTML | DAST Node | Notes |
|---|---|---|
<h1> - <h6> |
heading |
Level extracted from tag |
<p> |
paragraph |
|
<ul>, <ol> |
list |
Style: bulleted/numbered |
<li> |
listItem |
|
<blockquote> |
blockquote |
|
<pre>, <code> |
code |
Language from class attribute |
<hr> |
thematicBreak |
Inline Elements
| HTML | Mark | Notes |
|---|---|---|
<strong>, <b> |
strong |
|
<em>, <i> |
emphasis |
|
<u> |
underline |
|
<s>, <strike> |
strikethrough |
|
<mark> |
highlight |
|
<code> (inline) |
code |
In paragraph context |
<a> |
link |
With URL and optional meta |
<br> |
span with \n |
Ignored Elements
Scripts, styles, and media elements are ignored: <script>, <style>, <video>, <audio>, <iframe>, <embed>
Advanced Usage
Custom Handlers
Override default conversion for specific elements:
use DealNews\HtmlToStructuredText\Converter; use DealNews\HtmlToStructuredText\Options; use DealNews\HtmlToStructuredText\Handlers; $converter = new Converter(); $options = new Options(); // Custom h1 handler - adds prefix to all h1 headings $options->handlers['h1'] = function ( callable $create_node, \DOMNode $node, $context ) { // Use default handler $result = Handlers::heading($create_node, $node, $context); // Modify result if (isset($result['children'][0]['value'])) { $result['children'][0]['value'] = '★ ' . $result['children'][0]['value']; } return $result; }; $html = '<h1>Important</h1>'; $dast = $converter->convert($html, $options); // H1 will have "★ Important" as text
Preprocessing
Modify the DOM before conversion:
$options = new Options(); // Convert all <div> tags to <p> tags $options->preprocess = function (\DOMDocument $doc): void { $divs = []; foreach ($doc->getElementsByTagName('div') as $div) { $divs[] = $div; } foreach ($divs as $div) { $p = $doc->createElement('p'); while ($div->firstChild) { $p->appendChild($div->firstChild); } $div->parentNode->replaceChild($p, $div); } }; $html = '<div>Content</div>'; $dast = $converter->convert($html, $options); // Div becomes paragraph in DAST
Configuring Allowed Blocks
Control which block types are allowed:
$options = new Options(); $options->allowed_blocks = ['paragraph', 'list']; // Only paragraphs and lists $html = '<h1>Title</h1><p>Text</p>'; $dast = $converter->convert($html, $options); // H1 will be converted to paragraph
Configuring Allowed Marks
Control which text marks are allowed:
$options = new Options(); $options->allowed_marks = ['strong']; // Only bold $html = '<p><strong>Bold</strong> and <em>italic</em></p>'; $dast = $converter->convert($html, $options); // Only strong mark will be applied, emphasis ignored
Configuring Heading Levels
Control which heading levels are preserved:
$options = new Options(); $options->allowed_heading_levels = [1, 2]; // Only H1 and H2 $html = '<h1>H1</h1><h3>H3</h3>'; $dast = $converter->convert($html, $options); // H3 will be converted to paragraph
Options Reference
Options Class
class Options { // Whether to preserve newlines in text public bool $newlines = false; // Custom handler overrides public array $handlers = []; // Preprocessing function public $preprocess = null; // Allowed block types public array $allowed_blocks = [ 'blockquote', 'code', 'heading', 'link', 'list' ]; // Allowed mark types public array $allowed_marks = [ 'strong', 'code', 'emphasis', 'underline', 'strikethrough', 'highlight' ]; // Allowed heading levels (1-6) public array $allowed_heading_levels = [1, 2, 3, 4, 5, 6]; }
API Reference
Converter::convert(string $html, ?Options $options = null): ?array
Converts HTML string to DAST document.
Parameters:
$html- HTML string to convert$options- Optional conversion options
Returns: DAST document array or null if empty
Throws: ConversionError if conversion fails
Converter::convertDocument(\DOMDocument $doc, ?Options $options = null): ?array
Converts a DOMDocument to DAST (for pre-parsed HTML).
Parameters:
$doc- DOMDocument to convert$options- Optional conversion options
Returns: DAST document array or null if empty
Throws: ConversionError if conversion fails
Special Features
Code Block Language Detection
The library extracts programming language from code block class names:
$html = '<pre><code class="language-javascript">const x = 1;</code></pre>'; $dast = $converter->convert($html); // Result will have: ['type' => 'code', 'language' => 'javascript', 'code' => 'const x = 1;']
Default prefix is language- but can be customized in context.
Link Meta Extraction
Link meta attributes (target, rel, title) are extracted:
$html = '<a href="https://example.com" target="_blank" rel="noopener">Link</a>'; $dast = $converter->convert($html); // Result will have meta array: [['id' => 'target', 'value' => '_blank'], ...]
Inline Style Mark Extraction
The library can extract marks from inline CSS styles:
$html = '<span style="font-weight: bold">Bold via style</span>'; $dast = $converter->convert($html); // Creates span with strong mark
Supported style properties:
font-weight: boldorfont-weight > 400→strongfont-style: italic→emphasistext-decoration: underline→underline
URL Resolution with Base Tag
The <base> tag is respected for relative URL resolution:
$html = '<base href="https://example.com/"><a href="/page">Link</a>'; $dast = $converter->convert($html); // Link URL will be resolved to: https://example.com/page
Error Handling
The library throws ConversionError exceptions when conversion fails:
use DealNews\HtmlToStructuredText\ConversionError; try { $dast = $converter->convert($html); } catch (ConversionError $e) { echo "Conversion failed: " . $e->getMessage(); $node = $e->getNode(); // Get problematic DOM node if available }
Edge Cases
Whitespace Handling
- Single whitespace-only spans are removed when wrapped
- Newlines in text are preserved if
$options->newlines = true - In headings, newlines are converted to spaces (headings can't have line breaks)
Nested Lists
Nested lists are fully supported:
$html = '<ul><li>Item<ul><li>Nested</li></ul></li></ul>'; // Converts correctly to nested list structure
Mixed Inline/Block Content
Links and other hybrid elements are handled correctly:
$html = '<a href="#"><span>Inline</span><p>Block</p></a>'; // Properly splits into separate nodes
Differences from JavaScript Version
- No Promises: PHP handlers return directly (synchronous)
- No Hast: Works directly with PHP DOMDocument instead of intermediate tree
- Array Structure: DAST nodes are arrays (not objects)
- Error Handling: Uses exceptions instead of rejection
Development
Running Tests
composer install ./vendor/bin/phpunit
Current test coverage: 86%+
Running Examples
php examples/basic.php php examples/custom_handlers.php php examples/preprocessing.php
License
BSD 3-Clause License - see LICENSE file for details
Credits
This is a PHP port of the official DatoCMS HTML to Structured Text JavaScript library.
Ported and maintained by DealNews.
Related Projects
- datocms-structured-text-to-html-string - Convert DAST to HTML (the inverse operation)