taufik-nurrohman / markdown
Obviously, a Markdown parser.
Fund package maintenance!
Other
Requires
- php: >=7.1
This package is auto-updated.
Last update: 2024-11-18 02:07:38 UTC
README
With 90% compliance to CommonMark 0.31.2 specifications.
Motivation
I appreciate the Parsedown project for its simplicity and speed. It uses only a single class file to convert Markdown syntax to HTML. However, given the decrease in Parsedown project activity over time, I assume that it is now in the state of “feature complete”. It still has some bugs to fix, and with the recent release of PHP version 8.1, some of the PHP syntax there has become obsolete.
There is actually a draft for Parsedown version 2.0, but it is no longer made as a single class file. It’s broken down into components. The goal, I think, is to make it easy to add functionality without breaking what’s already in the core. For others, it may be of great use, but I see it as a form of similarity to the features provided by CommonMark. Because of that, if I want to upgrade, it might be more optimal to just switch to CommonMark.
I’m not into things like that. As someone who needs a function to convert Markdown syntax to HTML, that kind of flexibility is completely unnecessary to me. I just want to convert Markdown syntax to HTML for once and then move on. It was fulfilled by Parsedown version 1.8, but it seems that it is no longer being actively maintained.
The goal of this project is to use it in my Markdown extension for Mecha in the future. Previously, I wanted to develop this converter directly into the extension, but my friend advised me to create this project separately as it might have potential to be used by other developers beyond the Mecha CMS developers.
Usage
This converter can be installed using Composer, but it
doesn’t need any other dependencies and just uses Composer’s ability to automatically include files. Those of you who
don’t use Composer should be able to include the from.php
and to.php
files directly into your application without
any problems.
Using Composer
From the command line interface, navigate to your project folder then run this command:
composer require taufik-nurrohman/markdown
Require the generated auto-loader file in your application:
<?php use function x\markdown\from as from_markdown; use function x\markdown\to as to_markdown; require 'vendor/autoload.php'; echo from_markdown('# asdf {#asdf}'); // Returns `'<h1 id="asdf">asdf</h1>'`
Using File
Require the from.php
and to.php
files in your application:
<?php use function x\markdown\from as from_markdown; use function x\markdown\to as to_markdown; require 'from.php'; require 'to.php'; echo from_markdown('# asdf {#asdf}'); // Returns `'<h1 id="asdf">asdf</h1>'`
The to.php
file is optional and is used to convert HTML to Markdown. If you just want to convert Markdown to HTML, you
don’t need to include this file. This feature is experimental and is provided as a complementary feature, as there is
function json_encode()
besides function json_decode()
. The Markdown result may not satisfy everyone, but it can be
discussed further.
Options
/** * Convert Markdown string to HTML string. * * @param null|string $value Your Markdown string. * @param bool $block If this option is set to `false`, Markdown block syntax will be ignored. * @return null|string */ from(?string $value, bool $block = true): ?string;
/** * Convert HTML string to Markdown string. * * @param null|string $value Your HTML string. * @param bool $block If this option is set to `false`, HTML block syntax will be stripped out. * @return null|string */ to(?string $value, bool $block = true): ?string;
Dialect
From time to time, the history of Mecha slowly forms my Markdown writing style. The Markdown extension used by Mecha was first built with Michel Fortin’s Markdown converter (which I believe is the very first port of a PHP-based Markdown converter originally written in Perl by John Gruber). Until the release of Mecha version 1.2.3, I decided to switch to Parsedown because it was quite popular at the time. It can also do the conversion process much faster. Emanuil Rusev’s way of detecting the block type by reading the first character is, in my opinion, very clever and efficient.
Attributes
My Markdown converter supports a more extensive attribute syntax, including a mix of .class
and #id
attribute
syntax, and a mix of key=value
attribute syntax:
Inline attributes always win over native syntax attributes and pre-defined attributes:
Emphasis
CommonMark’s emphasis (and strong emphasis) specifications almost drove me crazy! 🤯
Implementing that level of strictness would slow the project down even more towards a stable release. I actually understand the parsing strategy very well, but turning it into a minimal PHP code just feels so hard for me. In order to speed up the completion of the project, I decided to reduce the strictness of the emphasis (and strong emphasis) specifications.
They will not completely follow the CommonMark’s emphasis (and strong emphasis) specifications, but I promise that the HTML results will still make sense, especially for those who have never read the specifications.
Rule 1: The same type of emphasis can be nested only if one or both sides of the child emphasis begin and/or end with white-space or punctuation.
This will create nested emphasis:
This will not:
Rule 2: For conditions where the emphasis types are different, Rule 1 does not apply.
Rule 3: For conditions where the emphasis markers are different, Rule 1 does not apply.
Rule 4: The opening delimiter must not be followed by a white-space and the closing delimiter must not be preceded by a white-space in order for it to be a valid emphasis token.
Rule 5: The emphasis token cannot be empty.
Links
Relative links and absolute links with the server’s host name will be treated as internal links, otherwise they will be
treated as external links and will automatically get rel="nofollow"
and target="_blank"
attributes.
Notes
Notes follow the Markdown Extra’s notes syntax but with slightly different HTML output to match Mecha’s common naming style. Multi-line notes don’t have to be indented by four spaces as required by Markdown Extra. A space or tab is enough to continue the note.
Soft Break
Soft breaks are collapsed to spaces in non-critical parts such as in paragraphs and list items:
Code Block
I try to avoid conflict between different Markdown dialects and try to support whatever dialect you are using. For
example, since I originally used Markdown Extra, I am used to adding info string with a dot prefix to the fenced code
block syntax. This is not supported by Parsedown (or rather, Parsedown doesn’t care about the pattern of the given info
string and simply appends language-
prefix to it, since CommonMark also doesn’t give implementors special rules for
processing info string in fenced code block syntax).
Here’s how the code block results compare across each Markdown converter:
Markdown Extra
Parsedown Extra
Mine
HTML Block
CommonMark doesn’t care about the DOM and therefore also doesn’t care if a HTML element is perfectly balanced or not. Unlike the original Markdown syntax specification which doesn’t allow you to convert Markdown syntax inside a HTML block, the CommonMark specification doesn’t limit such a case. It cares about blank lines around the lines that look like a HTML block tag, as specified in Section 4.6, type 6.
Any text that comes after the opening and/or closing of a HTML block is treated as raw text and is not processed as Markdown syntax. A blank line is required to end the raw HTML block state:
Exception for types 1, 2, 3, 4, and 5. A line break is enough to end the raw HTML block state:
The examples below will generate a predictable HTML code, but not because this converter cares about the existing HTML tag balance:
You will understand why when you add a number of blank lines at any point in the HTML block:
Markdown Extra features the markdown
attribute on HTML to allow you to convert Markdown syntax to HTML in a HTML
block. In this converter, the feature will not work. For now, I have no plans to add such feature to avoid DOM parsing
tasks as much as possible. This also ensured me to avoid on using PHP dom
.
However, if you add a blank line, it’s as if the feature works (although the markdown
attribute is still there, it
doesn’t affect the HTML when rendered in the browser window). If you’re used to adding a blank line after the opening
HTML block tag and before the closing HTML block tag, you should be okay.
Opening an inline HTML element will not trigger the raw HTML block state unless the opening and closing tags stand alone on a single line. This is explained in Section 4.6, type 7:
Since CommonMark doesn’t care about HTML structure, the examples below will also conform to the specification, even if they result in broken HTML. However, these are very rarely intentionally written by hand, so such cases are very unlikely to occur:
Image Block
Markdown was initiated before the HTML5 era. When the <figure>
element was introduced, people started using it as a
feature to display an image with a caption. Most Markdown converters will convert image syntax that stands alone on a
single line as an image element wrapped in a paragraph element in the output. My converter would instead wrap it in a
figure element. Because for now, it seems like a figure element would be more desirable in this situation.
Paragraphs that appear below it will be taken as the image caption if you prepend a number of spaces less than 4.
FYI, this pattern should also be valid for average Markdown files. And so it will be gracefully degraded when parsed by other Markdown converters.
List Block
List blocks follow the CommonMark specifications with one exception: if the next ordered list item uses a number that is less than the number of the previous ordered list item, a new list block will be created. This is different from the original specification, which does not care about the literal value of the number.
Table Block
Table blocks follow the Markdown Extra’s table block syntax. However, there are a few additional features and rules:
- The actual number of columns follows the number of columns in the table header separator. If you have columns in table header and/or table data with a number that exceeds the actual number of columns, the excess columns will be discarded. If you have columns in table header and/or table data with a number that is less than the actual number of columns, several empty columns will be added automatically to the right side.
- Literal pipe characters in table columns must be escaped. Exceptions are those that appear in code span and attribute values of raw HTML tags.
- Header-less table is supported, but may not be compatible with other Markdown converters. Consider using this feature as rarely as possible, unless you have no plans to switch to other Markdown converters in the future.
- Table caption is supported and can be created using the same syntax as the image block’s caption syntax.
XSS
This converter is intended only to convert Markdown syntax to HTML based on the CommonMark specification. It doesn’t care about your user input. I have no intention of adding any special security features in the future, sorry. The attribute syntax feature may be a security risk for you if you want to use this converter on your comment entries, for example:
There should be many specialized PHP applications already that have specific tasks to deal with XSS, so consider post-processing the generated HTML markup before putting it out to the web:
Tests
Clone this repository into the root of your web server that supports PHP and then you can open the test/from.php
and
test/to.php
file with your browser to see the result and the performance of this converter in various cases.
Tweaks
Not all Markdown dialects are supported for various reasons. Some of the modification methods below can be implemented to add features that you might find in other Markdown converters.
Your Markdown content is represented as variable $value
. If you modify the content before the function
from_markdown()
is called, it means that you modify the Markdown content before it is converted. If you modify the
content after the function from_markdown()
is called, it means that you modify the results of the Markdown conversion.
Globally Reusable Functions
To make from_markdown()
and to_markdown()
functions reusable globally, use this method:
<?php require 'from.php'; require 'to.php'; // Or, if you are using Composer… // require 'vendor/autoload.php'; function from_markdown(...$v) { return x\markdown\from(...$v); } function to_markdown(...$v) { return x\markdown\to(...$v); }
XHTML to HTML5
This converter escapes invalid HTML elements and takes care of HTML special characters that you put in the Markdown
attribute syntax, so it is safe to replace ' />'
with '>'
directly from the results of the Markdown conversion:
$value = from_markdown($value); $value = strtr($value, [' />' => '>']); echo $value;
Strike
This method allows you to add strike-through syntax, as you may have already noticed in the GFM specification:
$value = from_markdown($value); $value = preg_replace('/((?<![~])[~]{1,2}(?![~]))([^~]+)\1/', '<del>$2</del>', $value); echo $value;
Task List
I am against the task list feature because it promotes bad practices to abuse the form input element. Although from the presentation side it displays a check box interface correctly, I still believe that input elements should ideally be used inside a form element. There are several Unicode symbols that are more suitable and easier to read from the Markdown source like ☐ and ☒, which means that this feature can actually be made using the existing list feature:
- ☒ asdf - ☐ asdf - ☐ asdf
In case you need it, or don’t want to update your existing task list syntax in your Markdown files, here’s the hack:
$value = from_markdown($value); $value = strtr($value, [ '<li><p>[ ] ' => '<li><p>☐ ', '<li><p>[x] ' => '<li><p>☒ ', '<li>[ ] ' => '<li>☐ ', '<li>[x] ' => '<li>☒ ' ]); echo $value;
Pre-Defined Abbreviations, Notes, and References
By inserting abbreviations, notes, and references at the end of the Markdown content, it will be as if you had pre-defined abbreviations, notes, and references feature. This should be placed at the end of the Markdown content, because according to the link reference definitions specification, the first declared reference always takes precedence:
$abbreviations = [ 'CSS' => 'Cascading Style Sheet', 'HTML' => 'Hyper Text Markup Language', 'JS' => 'JavaScript' ]; $references = [ 'mecha-cms' => ['https://github.com/mecha-cms', 'Mecha CMS', []], 'taufik-nurrohman' => ['https://github.com/taufik-nurrohman', 'Taufik Nurrohman', []], ]; $suffix = ""; if (!empty($abbreviations)) { foreach ($abbreviations as $k => $v) { $k = strtr($k, [ '[' => '\[', ']' => '\]' ]); $v = trim(preg_replace('/\s+/', ' ', $v)); $suffix .= "\n*[" . $k . ']: ' . $v; } } if (!empty($references)) { foreach ($references as $k => $v) { [$link, $title, $attributes] = $v; $k = strtr($k, [ '[' => '\[', ']' => '\]' ]); if ("" === $link || false !== strpos($link, ' ')) { $link = '<' . $link . '>'; } $reference = '[' . $k . ']: ' . $link; if (!empty($title)) { $reference .= " '" . strtr($title, ["'" => "\\'"]) . "'"; } if (!empty($attributes)) { foreach ($attributes as $kk => &$vv) { // `{.asdf}` if ('class' === $kk) { $vv = '.' . trim(preg_replace('/\s+/', '.', $vv)); continue; } // `{#asdf}` if ('id' === $kk) { $vv = '#' . $vv; continue; } // `{asdf}` if (true === $vv) { $vv = $kk; continue; } // `{asdf=""}` if ("" === $vv) { $vv = $kk . '=""'; continue; } // `{asdf='asdf'}` $vv = $kk . "='" . strtr($vv, ["'" => "\\'"]) . "'"; } unset($vv); sort($attributes); $attributes = trim(strtr(implode(' ', $attributes), [ ' #' => '#', ' .' => '.' ])); $reference .= ' {' . $attributes . '}'; } $suffix .= "\n" . $reference; } } $value = from_markdown($value . "\n" . $suffix); echo $value;
Pre-Defined Header’s ID
Add an automatic id
attribute to headers level 2 through 6 if it’s not set, and then prepend an anchor element that
points to it:
$value = from_markdown($value); if ($value && false !== strpos($value, '</h')) { $value = preg_replace_callback('/<(h[2-6])(\s(?>"[^"]*"|\'[^\']*\'|[^>])*)?>([\s\S]+?)<\/\1>/', static function ($m) { if (!empty($m[2]) && false !== strpos($m[2], 'id=') && preg_match('/\bid=("[^"]+"|\'[^\']+\'|[^\/>\s]+)/', $m[2], $n)) { if ('"' === $n[1][0] && '"' === substr($n[1], -1)) { $id = substr($n[1], 1, -1); } else if ("'" === $n[1][0] && "'" === substr($n[1], -1)) { $id = substr($n[1], 1, -1); } else { $id = $n[1]; } $m[3] = '<a href="#' . htmlspecialchars($id) . '" style="text-decoration: none;">⚓</a> ' . $m[3]; return '<' . $m[1] . $m[2] . '>' . $m[3] . '</' . $m[1] . '>'; } $id = trim(preg_replace('/[^a-z\x{4e00}-\x{9fa5}\d]+/u', '-', strtolower($m[3])), '-'); $m[3] = '<a href="#' . htmlspecialchars($id) . '" style="text-decoration: none;">⚓</a> ' . $m[3]; return '<' . $m[1] . ($m[2] ?? "") . ' id="' . htmlspecialchars($id) . '">' . $m[3] . '</' . $m[1] . '>'; }, $value); } echo $value;
Idea: Embed Syntax
The CommonMark specification for automatic links doesn’t limit specific types of URL protocols. It just specifies the pattern so we can take advantage of the automatic link syntax to render it as a kind of “embed” syntax, which you can then turn it into a chunk of HTML elements.
I’m sure this idea has never been done before and that’s why I want to be the first to mention it. But I’m not going to integrate this feature directly into my converter to keep it slim. I just want to give you a couple of ideas.
Be aware that these tweaks are very naive, as they will directly convert the “embed” syntax without taking the block type into account. You may need to use this filter to replace the “embed” syntax only in certain block types, e.g. to ignore the “embed” syntax inside a fenced code block syntax.
YouTube Video Embed
An embed syntax to display a YouTube video by video ID.
<youtube:dQw4w9WgXcQ>
$value = preg_replace('/^[ ]{0,3}<youtube:([^>]+)>\s*$/m', '<iframe src="https://www.youtube.com/embed/$1"></iframe>', $value); $value = from_markdown($value); echo $value;
GitHub Gist Embed
An embed syntax to display a GitHub gist by gist ID.
<gist:9c96049ca6c66e30e50793f5aef4818b>
$value = preg_replace('/^[ ]{0,3}<gist:([^>]+)>\s*$/m', '<script src="https://gist.github.com/taufik-nurrohman/$1.js"></script>', $value); $value = from_markdown($value); echo $value;
Form Embed
An embed syntax to display a HTML form that was generated from the server side with a reference ID of 18a4596d42c
and
a title
parameter to customize the HTML form title.
<form:18a4596d42c?title=Form+Title>
$value = preg_replace_callback('/^[ ]{0,3}<form:([^#>?]+)([?][^#>]*)?([#][^>]*)?>\s*$/m', static function ($m) { $path = $m[1]; $value = ""; parse_str(substr($m[2] ?? "", 1), $state); $value .= '<form action="/form/' . $path . '" method="post">'; if (!empty($state['title'])) { $value .= '<h1>' . $state['title'] . '</h1>'; } // … etc. // Be careful not to include blank line(s), or the raw HTML block state will end before the HTML form is complete! $value .= '</form>'; return $value; }, $value); $value = from_markdown($value); echo $value;
Idea: Note Block
Several people have discussed this feature, and I think I like this answer the most. The syntax is compatible with native Markdown syntax, which is nice to look at directly through the Markdown source, even when it gets rendered to HTML:
------------------------------ **NOTE:** asdf asdf asdf ------------------------------
------------------------------ **NOTE:** asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf asdf ------------------------------
Most Markdown converters will render the syntax above to this HTML, which is still acceptable to be treated as a note block from its presentation, despite its broken semantic:
<hr /><p><strong>NOTE:</strong> asdf asdf asdf</p><hr />
<hr /><p><strong>NOTE:</strong></p><p>asdf asdf asdf asdf asdf asdf asdf asdf</p><p>asdf asdf asdf asdf</p><hr />
With regular expressions, you can improve its semantic:
$value = from_markdown($value); $value = preg_replace_callback('/<hr\s*\/?>(<p><strong>NOTE:<\/strong>[\s\S]*?<\/p>)<hr\s*\/?>/', static function ($m) { return '<div role="note">' . $m[1] . '</div>'; }, $value); echo $value;
License
This library is licensed under the MIT License. Please consider donating 💰 if you benefit financially from this library.
Links
- Autumn image sample by @blmiers2
- Emoticon image sample by @emoticons4u (web archive)