goldziher / html-to-markdown
Native html_to_markdown PHP extension built on the Rust html-to-markdown engine.
Installs: 8
Dependents: 0
Suggesters: 0
Security: 0
Stars: 422
Watchers: 5
Forks: 40
Open Issues: 0
Language:HTML
Type:php-ext
Ext name:ext-html_to_markdown
pkg:composer/goldziher/html-to-markdown
Requires
- php: ^8.2
Requires (Dev)
- php/pie: ^1.0
- dev-main
- v2.15.0
- v2.14.11
- v2.14.10
- v2.14.9
- v2.14.8
- v2.14.7
- v2.14.6
- v2.14.5
- v2.14.4
- v2.14.3
- v2.14.2
- v2.14.1
- v2.14.0
- v2.13.0
- v2.12.1
- v2.12.0
- v2.11.4
- v2.11.3
- v2.11.2
- v2.11.1
- v2.11.0
- v2.10.1
- v2.10.0
- v2.9.3
- v2.9.2
- v2.9.1
- v2.9.0
- v2.8.3
- v2.8.2
- v2.8.1
- v2.8.0
- v2.7.2
- v2.7.1
- v2.7.0
- v2.6.6
- v2.6.5
- v2.6.4
- v2.6.3
- v2.6.2
- v2.6.1
- v2.6.0
- v2.5.7
- dev-feat/java-panama-ffi
This package is auto-updated.
Last update: 2025-12-19 08:51:58 UTC
README
High-performance HTML → Markdown conversion powered by Rust. Shipping as a Rust crate, Python package, PHP extension, Ruby gem, Elixir Rustler NIF, Node.js bindings, WebAssembly, and standalone CLI with identical rendering behaviour.
🎮 Try the Live Demo →
Experience WebAssembly-powered HTML to Markdown conversion instantly in your browser. No installation needed!
Why html-to-markdown?
- Blazing Fast: Rust-powered core delivers 10-80× faster conversion than pure Python alternatives
- Universal: Works everywhere - Node.js, Bun, Deno, browsers, Python, Rust, and standalone CLI
- Smart Conversion: Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
- Metadata Extraction: Extract document metadata (title, description, headers, links, images) alongside conversion
- Highly Configurable: Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
- Tag Preservation: Keep specific HTML tags unconverted when markdown isn't expressive enough
- Secure by Default: Built-in HTML sanitization prevents malicious content
- Consistent Output: Identical markdown rendering across all language bindings
Documentation
Language Guides & API References:
- Python – README with metadata extraction, inline images, hOCR workflows
- JavaScript/TypeScript – Node.js | TypeScript | WASM
- Ruby – README with RBS types, Steep type checking
- PHP – Package | Extension (PIE)
- Go – README with FFI bindings
- Java – README with Panama FFI, Maven/Gradle setup
- C#/.NET – README with NuGet distribution
- Elixir – README with Rustler NIF bindings
- Rust – README with core API, error handling, advanced features
Project Resources:
- Contributing – CONTRIBUTING.md ⭐ Start here for development
- Changelog – CHANGELOG.md – Version history and breaking changes
Installation
| Target | Command(s) |
|---|---|
| Node.js/Bun (native) | npm install html-to-markdown-node |
| WebAssembly (universal) | npm install html-to-markdown-wasm |
| Deno | import { convert } from "npm:html-to-markdown-wasm" |
| Python (bindings + CLI) | pip install html-to-markdown |
| PHP (extension + helpers) | PHP_EXTENSION_DIR=$(php-config --extension-dir) pie install goldziher/html-to-markdowncomposer require goldziher/html-to-markdown |
| Ruby gem | bundle add html-to-markdown or gem install html-to-markdown |
| Elixir (Rustler NIF) | {:html_to_markdown, "~> 2.8"} |
| Rust crate | cargo add html-to-markdown-rs |
| Rust CLI (crates.io) | cargo install html-to-markdown-cli |
| Homebrew CLI | brew install html-to-markdown (core) |
| Releases | GitHub Releases |
Quick Start
JavaScript/TypeScript
Node.js / Bun (Native - Fastest):
import { convert } from 'html-to-markdown-node'; const html = '<h1>Hello</h1><p>Rust ❤️ Markdown</p>'; const markdown = convert(html, { headingStyle: 'Atx', codeBlockStyle: 'Backticks', wrap: true, preserveTags: ['table'], // NEW in v2.5: Keep complex HTML as-is });
Deno / Browsers / Edge (Universal):
import { convert } from "npm:html-to-markdown-wasm"; // Deno // or: import { convert } from 'html-to-markdown-wasm'; // Bundlers const markdown = convert(html, { headingStyle: 'atx', listIndentWidth: 2, });
Performance: The shared fixture harness (task bench:bindings) now clocks C# at ~1.4k ops/sec (≈171 MB/s), Go at ~1.3k ops/sec (≈165 MB/s), Node, Python, and the Rust CLI at ~1.3–1.4k ops/sec (≈150 MB/s) on the 129 KB Wikipedia "Lists" page thanks to the new Buffer/Uint8Array fast paths and release-mode harness. Ruby stays close at ~1.2k ops/sec (≈150 MB/s), Java lands at ~1.0k ops/sec (≈126 MB/s), WASM hits ~0.85k ops/sec (≈108 MB/s), and PHP achieves ~0.3k ops/sec (≈35 MB/s)—all providing excellent throughput for production workloads.
See the JavaScript guides for full API documentation:
Metadata extraction (all languages)
import { convertWithMetadata } from 'html-to-markdown-node'; const html = ` <html> <head> <title>Example</title> <meta name="description" content="Demo page"> <link rel="canonical" href="https://example.com/page"> </head> <body> <h1 id="welcome">Welcome</h1> <a href="https://example.com" rel="nofollow external">Example link</a> <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480"> </body> </html> `; const { markdown, metadata } = await convertWithMetadata( html, { headingStyle: 'Atx' }, { extract_links: true, extract_images: true, extract_headers: true }, ); console.log(markdown); // metadata.document.title === 'Example' // metadata.links[0].rel === ['nofollow', 'external'] // metadata.images[0].dimensions === [640, 480]
Equivalent APIs are available in every binding:
- Python:
convert_with_metadata(html, options=None, metadata_config=None) - Ruby:
HtmlToMarkdown.convert_with_metadata(html, options = nil, metadata_config = nil) - PHP:
convert_with_metadata(string $html, ?array $options = null, ?array $metadataConfig = null)
CLI
# Convert a file html-to-markdown input.html > output.md # Stream from stdin curl https://example.com | html-to-markdown > output.md # Apply options html-to-markdown --heading-style atx --list-indent-width 2 input.html # Fetch a remote page (HTTP) with optional custom User-Agent html-to-markdown --url https://example.com > output.md html-to-markdown --url https://example.com --user-agent "Mozilla/5.0" > output.md
Metadata Extraction
Extract document metadata alongside HTML-to-Markdown conversion. All bindings support identical APIs:
CLI Examples
# Basic metadata extraction with conversion html-to-markdown input.html --with-metadata -o output.json # Extract document metadata (title, description, language, etc.) html-to-markdown input.html --with-metadata --extract-document # Extract headers and links html-to-markdown input.html --with-metadata --extract-headers --extract-links # Extract all metadata types with conversion html-to-markdown input.html --with-metadata \ --extract-document \ --extract-headers \ --extract-links \ --extract-images \ --extract-structured-data \ -o metadata.json # Fetch and extract from remote URL html-to-markdown --url https://example.com --with-metadata -o output.json # Web scraping with preprocessing and metadata html-to-markdown page.html --preprocess --preset aggressive \ --with-metadata --extract-links --extract-images
Output format (JSON):
{
"markdown": "# Title\n\nContent here...",
"metadata": {
"document": {
"title": "Page Title",
"description": "Meta description",
"charset": "utf-8",
"language": "en"
},
"headers": [
{ "level": 1, "text": "Title", "id": "title" }
],
"links": [
{
"text": "Example",
"href": "https://example.com",
"title": null,
"rel": ["external"]
}
],
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Hero image",
"title": null,
"dimensions": [640, 480]
}
]
}
}
Python Example
from html_to_markdown import convert_with_metadata html = ''' <html> <head> <title>Product Guide</title> <meta name="description" content="Complete product documentation"> </head> <body> <h1>Getting Started</h1> <p>Visit our <a href="https://example.com">website</a> for more.</p> <img src="https://example.com/guide.jpg" alt="Setup diagram" width="800" height="600"> </body> </html> ''' markdown, metadata = convert_with_metadata( html, options={'heading_style': 'Atx'}, metadata_config={ 'extract_document': True, 'extract_headers': True, 'extract_links': True, 'extract_images': True, } ) print(markdown) print(f"Title: {metadata['document']['title']}") print(f"Links found: {len(metadata['links'])}")
TypeScript/Node.js Example
import { convertWithMetadata } from 'html-to-markdown-node'; const html = ` <html> <head> <title>Article</title> <meta name="description" content="Tech article"> </head> <body> <h1>Web Performance</h1> <p>Read our <a href="/blog">blog</a> for tips.</p> <img src="/perf.png" alt="Chart" width="1200" height="630"> </body> </html> `; const { markdown, metadata } = await convertWithMetadata(html, { headingStyle: 'Atx', }, { extract_document: true, extract_headers: true, extract_links: true, extract_images: true, }); console.log(markdown); console.log(`Found ${metadata.headers.length} headers`); console.log(`Found ${metadata.links.length} links`);
Ruby Example
require 'html_to_markdown' html = <<~HTML <html> <head> <title>Documentation</title> <meta name="description" content="API Reference"> </head> <body> <h2>Installation</h2> <p>See our <a href="https://github.com">GitHub</a>.</p> <img src="https://example.com/diagram.svg" alt="Architecture" width="960" height="540"> </body> </html> HTML markdown, metadata = HtmlToMarkdown.convert_with_metadata( html, options: { heading_style: :atx }, metadata_config: { extract_document: true, extract_headers: true, extract_links: true, extract_images: true, } ) puts markdown puts "Title: #{metadata[:document][:title]}" puts "Images: #{metadata[:images].length}"
PHP Example
<?php use HtmlToMarkdown\HtmlToMarkdown; $html = <<<HTML <html> <head> <title>Tutorial</title> <meta name="description" content="Step-by-step guide"> </head> <body> <h1>Getting Started</h1> <p>Check our <a href="https://example.com/guide">guide</a>.</p> <img src="https://example.com/steps.png" alt="Steps" width="1024" height="768"> </body> </html> HTML; [$markdown, $metadata] = convert_with_metadata( $html, options: ['heading_style' => 'Atx'], metadataConfig: [ 'extract_document' => true, 'extract_headers' => true, 'extract_links' => true, 'extract_images' => true, ] ); echo "Title: " . $metadata['document']['title'] . "\n"; echo "Found " . count($metadata['links']) . " links\n";
Go Example
package main import ( "encoding/json" "fmt" "log" "github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown" ) func main() { html := ` <html> <head> <title>Developer Guide</title> <meta name="description" content="Complete API reference"> </head> <body> <h1>API Overview</h1> <p>Learn more at our <a href="https://api.example.com/docs">API docs</a>.</p> <img src="https://example.com/api-flow.png" alt="API Flow" width="1280" height="720"> </body> </html> ` markdown, metadata, err := htmltomarkdown.ConvertWithMetadata(html, &htmltomarkdown.MetadataConfig{ ExtractDocument: true, ExtractHeaders: true, ExtractLinks: true, ExtractImages: true, ExtractStructuredData: false, }) if err != nil { log.Fatal(err) } fmt.Println("Markdown:", markdown) fmt.Printf("Title: %s\n", metadata.Document.Title) fmt.Printf("Found %d links\n", len(metadata.Links)) // Marshal to JSON if needed jsonBytes, _ := json.MarshalIndent(metadata, "", " ") fmt.Println(string(jsonBytes)) }
Java Example
import io.github.goldziher.htmltomarkdown.HtmlToMarkdown; import io.github.goldziher.htmltomarkdown.ConversionResult; import com.google.gson.Gson; import com.google.gson.GsonBuilder; public class MetadataExample { public static void main(String[] args) { String html = """ <html> <head> <title>Java Guide</title> <meta name="description" content="Complete Java bindings documentation"> </head> <body> <h1>Quick Start</h1> <p>Visit our <a href="https://github.com/Goldziher/html-to-markdown">GitHub</a>.</p> <img src="https://example.com/java-flow.png" alt="Flow diagram" width="1024" height="576"> </body> </html> """; try { ConversionResult result = HtmlToMarkdown.convertWithMetadata( html, new HtmlToMarkdown.MetadataOptions() .extractDocument(true) .extractHeaders(true) .extractLinks(true) .extractImages(true) ); System.out.println("Markdown:\n" + result.getMarkdown()); System.out.println("Title: " + result.getMetadata().getDocument().getTitle()); System.out.println("Links found: " + result.getMetadata().getLinks().size()); // Pretty-print metadata as JSON Gson gson = new GsonBuilder().setPrettyPrinting().create(); System.out.println(gson.toJson(result.getMetadata())); } catch (HtmlToMarkdown.ConversionException e) { System.err.println("Conversion failed: " + e.getMessage()); } } }
C# Example
using HtmlToMarkdown; using System.Text.Json; var html = @" <html> <head> <title>C# Guide</title> <meta name=""description"" content=""Official C# bindings documentation""> </head> <body> <h1>Introduction</h1> <p>See our <a href=""https://github.com/Goldziher/html-to-markdown"">repository</a>.</p> <img src=""https://example.com/csharp-arch.png"" alt=""Architecture"" width=""1200"" height=""675""> </body> </html> "; try { var result = HtmlToMarkdownConverter.ConvertWithMetadata( html, new MetadataConfig { ExtractDocument = true, ExtractHeaders = true, ExtractLinks = true, ExtractImages = true, } ); Console.WriteLine("Markdown:"); Console.WriteLine(result.Markdown); Console.WriteLine($"Title: {result.Metadata.Document.Title}"); Console.WriteLine($"Links found: {result.Metadata.Links.Count}"); // Serialize metadata to JSON var options = new JsonSerializerOptions { WriteIndented = true }; var json = JsonSerializer.Serialize(result.Metadata, options); Console.WriteLine(json); } catch (HtmlToMarkdownException ex) { Console.Error.WriteLine($"Conversion failed: {ex.Message}"); }
See the individual binding READMEs for detailed metadata extraction options:
- Python – Python README
- TypeScript/Node.js – Node.js README | TypeScript README
- Ruby – Ruby README
- PHP – PHP README
- Go – Go README
- Java – Java README
- C#/.NET – C# README
- WebAssembly – WASM README
- Rust – Rust README
Python (v2 API)
from html_to_markdown import convert, convert_with_inline_images, InlineImageConfig html = "<h1>Hello</h1><p>Rust ❤️ Markdown</p>" markdown = convert(html) markdown, inline_images, warnings = convert_with_inline_images( '<img src="data:image/png;base64,...==" alt="Pixel">', image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True), )
Elixir
{:ok, markdown} = HtmlToMarkdown.convert("<h1>Hello</h1>") # Keyword options are supported (internally mapped to the Rust ConversionOptions struct) HtmlToMarkdown.convert!("<p>Wrap me</p>", wrap: true, wrap_width: 32, preprocessing: %{enabled: true})
Rust
use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle}; let html = "<h1>Welcome</h1><p>Fast conversion</p>"; let markdown = convert(html, None)?; let options = ConversionOptions { heading_style: HeadingStyle::Atx, ..Default::default() }; let markdown = convert(html, Some(options))?;
See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.
Performance
Benchmarked on Apple M4 with complex real-world documents (Wikipedia articles, tables, lists):
Operations per Second (higher is better)
Derived directly from tools/runtime-bench/results/latest.json (Apple M4, shared fixtures):
| Fixture | Node.js (NAPI) | WASM | Python (PyO3) | Speedup (Node vs Python) |
|---|---|---|---|---|
| Lists (Timeline) | 1,308 | 882 | 1,405 | 0.9× |
| Tables (Countries) | 331 | 242 | 352 | 0.9× |
| Medium (Python) | 150 | 121 | 158 | 1.0× |
| Large (Rust) | 163 | 124 | 183 | 0.9× |
| Small (Intro) | 208 | 163 | 223 | 0.9× |
| HOCR German PDF | 2,944 | 1,637 | 2,991 | 1.0× |
| HOCR Invoice | 27,326 | 7,775 | 23,500 | 1.2× |
| HOCR Tables | 3,475 | 1,667 | 3,464 | 1.0× |
Average Performance Summary
| Implementation | Avg ops/sec (fixtures) | vs Python | Notes |
|---|---|---|---|
| Rust CLI/Binary | 4,996 | 1.2× faster | Preprocessing now stays in one pass + reuses parse_owned, so the CLI leads every fixture |
| Node.js (NAPI-RS) | 4,488 | 1.0× | Buffer/handle combo keeps Node within ~10 % of the Rust core while serving JS runtimes |
| Ruby (magnus) | 4,278 | 0.9× | Still extremely fast; ~25 k ops/sec on HOCR invoices without extra work |
| Python (PyO3) | 4,034 | baseline | Release-mode harness plus handle reuse keep it competitive, but it now trails Node/Rust |
| WebAssembly | 1,576 | 0.4× | Portable option for Deno/browsers/edge using the new byte APIs |
| PHP (ext) | 1,480 | 0.4× | Composer extension holds steady at 35–70 MB/s once the PIE build is installed |
Key Insights
- Rust now leads throughput: the fused preprocessing +
parse_ownedpathway pushes the CLI to ~1.7 k ops/sec on the 129 KB lists page and ~31 k ops/sec on the HOCR invoice fixture. - Node.js trails by only a few percent after the buffer/handle work—~1.3 k ops/sec on the lists fixture and 27 k ops/sec on HOCR invoices without any UTF-16 copies.
- Python remains competitive but now sits below Node/Rust (~4.0 k average ops/sec); stick to the v2 API to avoid the deprecated compatibility shim.
- Elixir matches the Rust core because the Rustler NIF executes the same
ConversionOptionspipeline—benchmarks land between 170–1,460 ops/sec on the Wikipedia fixtures and >20 k ops/sec on micro HOCR payloads. - PHP and WASM stay in the 35–70 MB/s band, which is plenty for Composer queues or edge runtimes as long as the extension/module is built ahead of time.
- Rust CLI results now mirror the bindings, since
task bench:bindingsruns the harness withcargo run --releaseby default—profile there, then push optimizations down into each FFI layer.
Runtime Benchmarks (PHP / Ruby / Python / Node / WASM)
Measured on Apple M4 using the fixture-driven runtime harness in tools/runtime-bench (task bench:bindings). Every binding consumes the exact same HTML fixtures and hOCR samples from test_documents/:
| Document | Size | Ruby ops/sec | PHP ops/sec | Python ops/sec | Node ops/sec | WASM ops/sec | Elixir ops/sec | Rust ops/sec |
|---|---|---|---|---|---|---|---|---|
| Lists (Timeline) | 129 KB | 1,349 | 533 | 1,405 | 1,308 | 882 | 1,463 | 1,700 |
| Tables (Countries) | 360 KB | 326 | 118 | 352 | 331 | 242 | 357 | 416 |
| Medium (Python) | 657 KB | 157 | 59 | 158 | 150 | 121 | 171 | 190 |
| Large (Rust) | 567 KB | 174 | 65 | 183 | 163 | 124 | 174 | 220 |
| Small (Intro) | 463 KB | 214 | 83 | 223 | 208 | 163 | 247 | 258 |
| HOCR German PDF | 44 KB | 2,936 | 1,007 | 2,991 | 2,944 | 1,637 | 3,113 | 2,760 |
| HOCR Invoice | 4 KB | 25,740 | 8,781 | 23,500 | 27,326 | 7,775 | 20,424 | 31,345 |
| HOCR Embedded Tables | 37 KB | 3,328 | 1,194 | 3,464 | 3,475 | 1,667 | 3,366 | 3,080 |
The harness shells out to each runtime’s lightweight benchmark driver (packages/*/bin/benchmark.*, crates/*/bin/benchmark.ts), feeds fixtures defined in tools/runtime-bench/fixtures/*.toml, and writes machine-readable JSON reports (tools/runtime-bench/results/latest.json) for regression tracking. Add new languages or scenarios by extending those fixture files and drivers.
Use task bench:bindings to regenerate throughput numbers across all bindings or task bench:bindings:profile to capture CPU/memory samples while the benchmarks run. To focus on specific languages or fixtures (for example, task bench:bindings -- --language elixir), pass --language / --fixture directly to cargo run --manifest-path tools/runtime-bench/Cargo.toml -- ….
Need a call-stack view of the Rust core? Run task flamegraph:rust (or call the harness with --language rust --flamegraph path.svg) to profile a fixture and dump a ready-to-inspect flamegraph in tools/runtime-bench/results/.
Note on Python performance: The current Python bindings have optimization opportunities. The v2 API with direct convert() calls performs best; avoid the v1 compatibility layer for performance-critical applications.
Compatibility (v1 → v2)
Testing
Use the task runner to execute the entire matrix locally:
# All core test suites (Rust, Python, Ruby, Node, PHP, Go, C#, Elixir, Java) task test # Run the Wasmtime-backed WASM integration tests task wasm:test:wasmtime
The Wasmtime suite builds the html-to-markdown-wasm artifact with the same flags used in CI and drives it through Wasmtime to ensure the non-JS runtime behaves exactly like the browser/Deno builds.
- V2’s Rust core sustains 150–210 MB/s throughput; V1 averaged ≈ 2.5 MB/s in its Python/BeautifulSoup implementation (60–80× faster).
- The Python package offers a compatibility shim in
html_to_markdown.v1_compat(convert_to_markdown,convert_to_markdown_stream,markdownify). The shim is deprecated, emitsDeprecationWarningon every call, and will be removed in v3.0—plan migrations now. Details and keyword mappings live in Python README. - CLI flag changes, option renames, and other breaking updates are summarised in CHANGELOG.
Community
- Chat with us on Discord
- Explore the broader Kreuzberg document-processing ecosystem
- Sponsor development via GitHub Sponsors
Ruby
require 'html_to_markdown' html = '<h1>Hello</h1><p>Rust ❤️ Markdown</p>' markdown = HtmlToMarkdown.convert(html, heading_style: :atx, wrap: true) puts markdown # # Hello # # Rust ❤️ Markdown
See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.