helgesverre / markdown
A dirty-tricks Markdown parser (PHP FFI -> md4c).
Requires
- php: ^8.5
- ext-ffi: *
Requires (Dev)
- carthage-software/mago: ^1.30
- league/commonmark: ^2.7
- phpbench/phpbench: ^1.6
- phpunit/phpunit: ^13.2
- symfony/yaml: ^8.0
- tempest/markdown: ^1.0
README
A fast PHP Markdown parser backed by md4c through PHP FFI.
It renders GitHub-flavored Markdown, supports front matter and heading TOCs, and ships prebuilt native libraries so normal installs do not need a C compiler.
Install
composer require helgesverre/markdown
Requirements:
- PHP 8.5+
ext-ffiffi.enable=1for web/FPM use, or an opcache preload setup
Bundled native artifacts are selected at runtime:
| Platform | Artifact |
|---|---|
| macOS Apple Silicon + Intel | lib/darwin/libmd4cshim.dylib |
| Linux x86-64 | lib/linux-x86_64/libmd4cshim.so |
| Linux aarch64 | lib/linux-aarch64/libmd4cshim.so |
| Windows x64 | lib/windows-x86_64/md4cshim.dll |
HelgeSverre\Markdown\Ffi\Library::path() resolves libraries in this order:
$MARKDOWN_FFI_LIB- the bundled
lib/<platform>/binary - a local
native/build
Usage
Render HTML
use HelgeSverre\Markdown\Markdown; $html = Markdown::toHtml("# Hello\n\n- a\n- b\n"); $htmls = Markdown::toHtmlBatch([ "# One\n", "# Two\n", ]);
toHtml() is the fast path: Markdown in, HTML out. toHtmlBatch() packs many documents into one native call and renders them across a C thread pool where pthreads are available.
For explicit lifecycle and options, construct the parser directly:
use HelgeSverre\Markdown\Data\Dialect; use HelgeSverre\Markdown\Parser; $parser = new Parser( dialect: Dialect::GitHub, safe: false, xhtml: false, ); $html = $parser->toHtml("# Hello\n");
Parse Documents
parse() strips YAML front matter, renders the body, injects GitHub-style heading ids, and returns a ParsedMarkdown value with HTML, front matter, and TOC data.
use HelgeSverre\Markdown\Markdown; $doc = <<<MD --- title: Hello World tags: [php, markdown] --- # Introduction ## Getting started MD; $result = Markdown::parse($doc); $result->html; $result->frontmatter; // ['title' => 'Hello World', 'tags' => ['php', 'markdown']] $result->toc; // [['level' => 1, 'text' => 'Introduction', 'slug' => 'introduction'], ...] (string) $result; // same as $result->html
Malformed front matter degrades to an empty array. Heading ids are lower-cased, ASCII-folded, and de-duplicated with suffixes like intro-1.
Front matter is decoded by a vendored libyaml FFI path (parsed to JSON in C, then json_decoded) — no pure-PHP YAML parser is involved. Inputs libyaml's walker does not support — anchors/aliases and << merge keys — degrade to an empty array, the same as malformed YAML.
Date scalars are strings. A bare
date: 2026-06-05in front matter is returned as the string"2026-06-05"(matching PECLyaml, spyc, and dallgoot). This differs fromsymfony/yaml's default, which resolves it to an integer Unix timestamp. Quote or post-process if you need a different type.
Options
use HelgeSverre\Markdown\Data\Dialect; use HelgeSverre\Markdown\Parser; new Parser( dialect: Dialect::GitHub, // or Dialect::CommonMark safe: true, // strip raw HTML xhtml: true, // emit <br /> / <hr /> );
BatchParser accepts the same options. The Markdown facade uses the defaults.
Benchmarks
Run the full suite with:
composer bench
Fresh run from this checkout: PHP 8.5.5, Darwin arm64, PHPBench, opcache + tracing JIT + FFI preload. Full generated tables live in results/RESULTS.md, with machine-readable rows in results/results.json. The default corpus caps at ~256 KB (realistic document sizes plus two real-world corpora); the 1 MB and 8 MB scaling tiers are opt-in via composer bench:stress (run composer corpus first to generate them).
HTML Throughput Snapshot
toHtml() (render only) against the default corpus:
| Corpus | helgesverre/markdown | league/commonmark GFM | tempest/markdown |
|---|---|---|---|
doc-128kb.md (135 KB) |
0.71 ms / 196 MB/s | 42.14 ms / 3.3 MB/s | 10.94 ms / 12.6 MB/s |
commonmark-spec.md (165 KB) |
0.86 ms / 196 MB/s | 28.78 ms / 5.9 MB/s | — (threw) |
tempest-docs.md (252 KB) |
0.84 ms / 308 MB/s | 26.24 ms / 9.8 MB/s | 42.25 ms / 6.1 MB/s |
On the 252 KB Tempest docs corpus, the render fast path measured about 31x faster than league/commonmark GFM and about 50x faster than tempest/markdown. The full parse() pipeline (front matter + render + heading anchors + TOC) is benchmarked too — on that corpus it runs in ~1.12 ms (231 MB/s), still ~24x faster than league/commonmark GFM.
Front Matter
extract() pulls the YAML front matter without rendering the body (vendored libyaml in C → JSON → json_decode):
| Approach | Mean | Renders body? |
|---|---|---|
helgesverre/markdown extract only |
8.84 us | no |
helgesverre/markdown full parse |
31.86 us | yes |
symfony/yaml floor |
307.81 us | no |
league/commonmark front matter only |
344.33 us | no |
tempest/markdown lex (no render) |
402.79 us | no |
tempest/markdown full parse |
939.14 us | yes |
Front-matter extraction measured about 35x faster than the symfony/yaml floor and about 39x faster than league/commonmark's dedicated front-matter parser. (tempest/markdown has no dedicated front-matter API — lex() is its cheapest path, full parse() its idiomatic one.)
Memory numbers in the benchmark output need context: this parser renders into a short-lived C heap buffer before copying HTML back into PHP, so PHP's memory metrics undercount part of its transient native allocation. Pure-PHP parsers keep their work on the Zend heap.
How It Works
The hot path is one FFI call into a small C shim around md4c:
char* md2html(const char* input, size_t input_len, size_t* out_len, unsigned int parser_flags, unsigned int renderer_flags); void md2html_free(char* p);
md4c renders through callbacks internally, but those callbacks stay in C. PHP passes a byte string in, receives one allocated HTML buffer back, copies it with FFI::string(), and frees it.
Front matter uses the same one-call shape: yaml2json() walks libyaml's event stream into a single JSON string in C, which PHP json_decodes — no per-node FFI crossings. libyaml is vendored and statically linked into the shim, so the shipped binaries carry no external runtime dependency.
For production, bench/preload.php can warm an FFI::load() scope through opcache preload. Without preload, the library falls back to FFI::cdef() automatically.
The shim also includes a small correctness pass for md4c's permissive autolinks: explicit links whose text is itself an autolinkable URL can otherwise become invalid nested anchors. The pass collapses that generated shape while preserving user-supplied raw nested anchors.
Build From Source
Most users do not need this. Build from source when hacking on the C shim or targeting an unshipped platform.
composer build # current platform -> native/ composer build:all # all shipped platforms -> lib/
composer build needs a local C compiler. composer build:all uses clang for the macOS universal binary and zig cc for Linux and Windows cross-builds.
Scripts
| Command | What it does |
|---|---|
composer test |
Run PHPUnit |
composer check |
Run the CI correctness smoke gate |
composer bench |
Run PHPBench and regenerate results/ |
composer bench:stress |
Run the throughput bench against the 1 MB / 8 MB tiers |
composer examples |
Run every example script |
composer build |
Build the native shim for this platform |
composer build:all |
Cross-build shipped libraries |
composer format:check |
Check formatting with Mago |
composer lint |
Run Mago lint |
Tests
composer test
The suite covers GFM rendering, dialect/safe/XHTML options, generated anchor collapse without raw HTML rewrites, document parsing, front matter, heading slugs and TOCs, structural parity against league/commonmark, batch-vs-sequential output, shipped-library binding, hostile inputs, embedded NUL bytes, and leak checks.
CI runs the shipped binaries on Linux and macOS, keeps an experimental Windows shipped-binary job, and also builds the Linux shim from source.
Alternatives
league/commonmarkis the mature pure-PHP default. If you want extensibility and no native artifact, start there.tempest/markdownis a good fit inside the Tempest ecosystem, especially if you want its bundled syntax highlighting and heading behavior.
License
MIT. Bundled under their own MIT licenses: md4c (Martin Mitáš) for Markdown parsing and libyaml (Kirill Simonov et al.) for front-matter YAML — see THIRD_PARTY.md.