Obviously, a Markdown parser.

dev-main 2024-11-18 02:07 UTC

This package is auto-updated.

Last update: 2024-11-18 02:07:38 UTC


README

from.php to.php

With 90% compliance to CommonMark 0.31.2 specifications.

Motivation

I appreciate the Parsedown project for its simplicity and speed. It uses only a single class file to convert Markdown syntax to HTML. However, given the decrease in Parsedown project activity over time, I assume that it is now in the state of “feature complete”. It still has some bugs to fix, and with the recent release of PHP version 8.1, some of the PHP syntax there has become obsolete.

There is actually a draft for Parsedown version 2.0, but it is no longer made as a single class file. It’s broken down into components. The goal, I think, is to make it easy to add functionality without breaking what’s already in the core. For others, it may be of great use, but I see it as a form of similarity to the features provided by CommonMark. Because of that, if I want to upgrade, it might be more optimal to just switch to CommonMark.

I’m not into things like that. As someone who needs a function to convert Markdown syntax to HTML, that kind of flexibility is completely unnecessary to me. I just want to convert Markdown syntax to HTML for once and then move on. It was fulfilled by Parsedown version 1.8, but it seems that it is no longer being actively maintained.

The goal of this project is to use it in my Markdown extension for Mecha in the future. Previously, I wanted to develop this converter directly into the extension, but my friend advised me to create this project separately as it might have potential to be used by other developers beyond the Mecha CMS developers.

Usage

This converter can be installed using Composer, but it doesn’t need any other dependencies and just uses Composer’s ability to automatically include files. Those of you who don’t use Composer should be able to include the from.php and to.php files directly into your application without any problems.

Using Composer

From the command line interface, navigate to your project folder then run this command:

composer require taufik-nurrohman/markdown

Require the generated auto-loader file in your application:

<?php

use function x\markdown\from as from_markdown;
use function x\markdown\to as to_markdown;

require 'vendor/autoload.php';

echo from_markdown('# asdf {#asdf}'); // Returns `'<h1 id="asdf">asdf</h1>'`

Using File

Require the from.php and to.php files in your application:

<?php

use function x\markdown\from as from_markdown;
use function x\markdown\to as to_markdown;

require 'from.php';
require 'to.php';

echo from_markdown('# asdf {#asdf}'); // Returns `'<h1 id="asdf">asdf</h1>'`

The to.php file is optional and is used to convert HTML to Markdown. If you just want to convert Markdown to HTML, you don’t need to include this file. This feature is experimental and is provided as a complementary feature, as there is function json_encode() besides function json_decode(). The Markdown result may not satisfy everyone, but it can be discussed further.

Options

/**
 * Convert Markdown string to HTML string.
 *
 * @param null|string $value Your Markdown string.
 * @param bool $block If this option is set to `false`, Markdown block syntax will be ignored.
 * @return null|string
 */
from(?string $value, bool $block = true): ?string;
/**
 * Convert HTML string to Markdown string.
 *
 * @param null|string $value Your HTML string.
 * @param bool $block If this option is set to `false`, HTML block syntax will be stripped out.
 * @return null|string
 */
to(?string $value, bool $block = true): ?string;

Dialect

From time to time, the history of Mecha slowly forms my Markdown writing style. The Markdown extension used by Mecha was first built with Michel Fortin’s Markdown converter (which I believe is the very first port of a PHP-based Markdown converter originally written in Perl by John Gruber). Until the release of Mecha version 1.2.3, I decided to switch to Parsedown because it was quite popular at the time. It can also do the conversion process much faster. Emanuil Rusev’s way of detecting the block type by reading the first character is, in my opinion, very clever and efficient.

Attributes

My Markdown converter supports a more extensive attribute syntax, including a mix of .class and #id attribute syntax, and a mix of key=value attribute syntax:

Inline attributes always win over native syntax attributes and pre-defined attributes:

Emphasis

CommonMark’s emphasis (and strong emphasis) specifications almost drove me crazy! 🤯

Implementing that level of strictness would slow the project down even more towards a stable release. I actually understand the parsing strategy very well, but turning it into a minimal PHP code just feels so hard for me. In order to speed up the completion of the project, I decided to reduce the strictness of the emphasis (and strong emphasis) specifications.

They will not completely follow the CommonMark’s emphasis (and strong emphasis) specifications, but I promise that the HTML results will still make sense, especially for those who have never read the specifications.

Rule 1: The same type of emphasis can be nested only if one or both sides of the child emphasis begin and/or end with white-space or punctuation.

This will create nested emphasis:

This will not:

Rule 2: For conditions where the emphasis types are different, Rule 1 does not apply.

Rule 3: For conditions where the emphasis markers are different, Rule 1 does not apply.

Rule 4: The opening delimiter must not be followed by a white-space and the closing delimiter must not be preceded by a white-space in order for it to be a valid emphasis token.

Rule 5: The emphasis token cannot be empty.

Links

Relative links and absolute links with the server’s host name will be treated as internal links, otherwise they will be treated as external links and will automatically get rel="nofollow" and target="_blank" attributes.

Notes

Notes follow the Markdown Extra’s notes syntax but with slightly different HTML output to match Mecha’s common naming style. Multi-line notes don’t have to be indented by four spaces as required by Markdown Extra. A space or tab is enough to continue the note.

Soft Break

Soft breaks are collapsed to spaces in non-critical parts such as in paragraphs and list items:

Code Block

I try to avoid conflict between different Markdown dialects and try to support whatever dialect you are using. For example, since I originally used Markdown Extra, I am used to adding info string with a dot prefix to the fenced code block syntax. This is not supported by Parsedown (or rather, Parsedown doesn’t care about the pattern of the given info string and simply appends language- prefix to it, since CommonMark also doesn’t give implementors special rules for processing info string in fenced code block syntax).

Here’s how the code block results compare across each Markdown converter:

Markdown Extra

Parsedown Extra

Mine

HTML Block

CommonMark doesn’t care about the DOM and therefore also doesn’t care if a HTML element is perfectly balanced or not. Unlike the original Markdown syntax specification which doesn’t allow you to convert Markdown syntax inside a HTML block, the CommonMark specification doesn’t limit such a case. It cares about blank lines around the lines that look like a HTML block tag, as specified in Section 4.6, type 6.

Any text that comes after the opening and/or closing of a HTML block is treated as raw text and is not processed as Markdown syntax. A blank line is required to end the raw HTML block state:

Exception for types 1, 2, 3, 4, and 5. A line break is enough to end the raw HTML block state:

The examples below will generate a predictable HTML code, but not because this converter cares about the existing HTML tag balance:

You will understand why when you add a number of blank lines at any point in the HTML block:

Markdown Extra features the markdown attribute on HTML to allow you to convert Markdown syntax to HTML in a HTML block. In this converter, the feature will not work. For now, I have no plans to add such feature to avoid DOM parsing tasks as much as possible. This also ensured me to avoid on using PHP dom.

However, if you add a blank line, it’s as if the feature works (although the markdown attribute is still there, it doesn’t affect the HTML when rendered in the browser window). If you’re used to adding a blank line after the opening HTML block tag and before the closing HTML block tag, you should be okay.

Opening an inline HTML element will not trigger the raw HTML block state unless the opening and closing tags stand alone on a single line. This is explained in Section 4.6, type 7:

Since CommonMark doesn’t care about HTML structure, the examples below will also conform to the specification, even if they result in broken HTML. However, these are very rarely intentionally written by hand, so such cases are very unlikely to occur:

Image Block

Markdown was initiated before the HTML5 era. When the <figure> element was introduced, people started using it as a feature to display an image with a caption. Most Markdown converters will convert image syntax that stands alone on a single line as an image element wrapped in a paragraph element in the output. My converter would instead wrap it in a figure element. Because for now, it seems like a figure element would be more desirable in this situation.

Paragraphs that appear below it will be taken as the image caption if you prepend a number of spaces less than 4.

FYI, this pattern should also be valid for average Markdown files. And so it will be gracefully degraded when parsed by other Markdown converters.

List Block

List blocks follow the CommonMark specifications with one exception: if the next ordered list item uses a number that is less than the number of the previous ordered list item, a new list block will be created. This is different from the original specification, which does not care about the literal value of the number.

Table Block

Table blocks follow the Markdown Extra’s table block syntax. However, there are a few additional features and rules:

  • The actual number of columns follows the number of columns in the table header separator. If you have columns in table header and/or table data with a number that exceeds the actual number of columns, the excess columns will be discarded. If you have columns in table header and/or table data with a number that is less than the actual number of columns, several empty columns will be added automatically to the right side.
  • Literal pipe characters in table columns must be escaped. Exceptions are those that appear in code span and attribute values of raw HTML tags.
  • Header-less table is supported, but may not be compatible with other Markdown converters. Consider using this feature as rarely as possible, unless you have no plans to switch to other Markdown converters in the future.
  • Table caption is supported and can be created using the same syntax as the image block’s caption syntax.

XSS

This converter is intended only to convert Markdown syntax to HTML based on the CommonMark specification. It doesn’t care about your user input. I have no intention of adding any special security features in the future, sorry. The attribute syntax feature may be a security risk for you if you want to use this converter on your comment entries, for example:

There should be many specialized PHP applications already that have specific tasks to deal with XSS, so consider post-processing the generated HTML markup before putting it out to the web:

Tests

Clone this repository into the root of your web server that supports PHP and then you can open the test/from.php and test/to.php file with your browser to see the result and the performance of this converter in various cases.

Tweaks

Not all Markdown dialects are supported for various reasons. Some of the modification methods below can be implemented to add features that you might find in other Markdown converters.

Your Markdown content is represented as variable $value. If you modify the content before the function from_markdown() is called, it means that you modify the Markdown content before it is converted. If you modify the content after the function from_markdown() is called, it means that you modify the results of the Markdown conversion.

Globally Reusable Functions

To make from_markdown() and to_markdown() functions reusable globally, use this method:

<?php

require 'from.php';
require 'to.php';

// Or, if you are using Composer…
// require 'vendor/autoload.php';

function from_markdown(...$v) {
    return x\markdown\from(...$v);
}

function to_markdown(...$v) {
    return x\markdown\to(...$v);
}

XHTML to HTML5

This converter escapes invalid HTML elements and takes care of HTML special characters that you put in the Markdown attribute syntax, so it is safe to replace ' />' with '>' directly from the results of the Markdown conversion:

$value = from_markdown($value);

$value = strtr($value, [' />' => '>']);

echo $value;

Strike

This method allows you to add strike-through syntax, as you may have already noticed in the GFM specification:

$value = from_markdown($value);

$value = preg_replace('/((?<![~])[~]{1,2}(?![~]))([^~]+)\1/', '<del>$2</del>', $value);

echo $value;

Task List

I am against the task list feature because it promotes bad practices to abuse the form input element. Although from the presentation side it displays a check box interface correctly, I still believe that input elements should ideally be used inside a form element. There are several Unicode symbols that are more suitable and easier to read from the Markdown source like ☐ and ☒, which means that this feature can actually be made using the existing list feature:

- ☒ asdf
- ☐ asdf
- ☐ asdf

In case you need it, or don’t want to update your existing task list syntax in your Markdown files, here’s the hack:

$value = from_markdown($value);

$value = strtr($value, [
    '<li><p>[ ] ' => '<li><p>&#x2610; ',
    '<li><p>[x] ' => '<li><p>&#x2612; ',
    '<li>[ ] ' => '<li>&#x2610; ',
    '<li>[x] ' => '<li>&#x2612; '
]);

echo $value;

Pre-Defined Abbreviations, Notes, and References

By inserting abbreviations, notes, and references at the end of the Markdown content, it will be as if you had pre-defined abbreviations, notes, and references feature. This should be placed at the end of the Markdown content, because according to the link reference definitions specification, the first declared reference always takes precedence:

$abbreviations = [
    'CSS' => 'Cascading Style Sheet',
    'HTML' => 'Hyper Text Markup Language',
    'JS' => 'JavaScript'
];

$references = [
    'mecha-cms' => ['https://github.com/mecha-cms', 'Mecha CMS', []],
    'taufik-nurrohman' => ['https://github.com/taufik-nurrohman', 'Taufik Nurrohman', []],
];

$suffix = "";

if (!empty($abbreviations)) {
    foreach ($abbreviations as $k => $v) {
        $k = strtr($k, [
            '[' => '\[',
            ']' => '\]'
        ]);
        $v = trim(preg_replace('/\s+/', ' ', $v));
        $suffix .= "\n*[" . $k . ']: ' . $v;
    }
}

if (!empty($references)) {
    foreach ($references as $k => $v) {
        [$link, $title, $attributes] = $v;
        $k = strtr($k, [
            '[' => '\[',
            ']' => '\]'
        ]);
        if ("" === $link || false !== strpos($link, ' ')) {
            $link = '<' . $link . '>';
        }
        $reference = '[' . $k . ']: ' . $link;
        if (!empty($title)) {
            $reference .= " '" . strtr($title, ["'" => "\\'"]) . "'";
        }
        if (!empty($attributes)) {
            foreach ($attributes as $kk => &$vv) {
                // `{.asdf}`
                if ('class' === $kk) {
                    $vv = '.' . trim(preg_replace('/\s+/', '.', $vv));
                    continue;
                }
                // `{#asdf}`
                if ('id' === $kk) {
                    $vv = '#' . $vv;
                    continue;
                }
                // `{asdf}`
                if (true === $vv) {
                    $vv = $kk;
                    continue;
                }
                // `{asdf=""}`
                if ("" === $vv) {
                    $vv = $kk . '=""';
                    continue;
                }
                // `{asdf='asdf'}`
                $vv = $kk . "='" . strtr($vv, ["'" => "\\'"]) . "'";
            }
            unset($vv);
            sort($attributes);
            $attributes = trim(strtr(implode(' ', $attributes), [
                ' #' => '#',
                ' .' => '.'
            ]));
            $reference .= ' {' . $attributes . '}';
        }
        $suffix .= "\n" . $reference;
    }
}

$value = from_markdown($value . "\n" . $suffix);

echo $value;

Pre-Defined Header’s ID

Add an automatic id attribute to headers level 2 through 6 if it’s not set, and then prepend an anchor element that points to it:

$value = from_markdown($value);

if ($value && false !== strpos($value, '</h')) {
    $value = preg_replace_callback('/<(h[2-6])(\s(?>"[^"]*"|\'[^\']*\'|[^>])*)?>([\s\S]+?)<\/\1>/', static function ($m) {
        if (!empty($m[2]) && false !== strpos($m[2], 'id=') && preg_match('/\bid=("[^"]+"|\'[^\']+\'|[^\/>\s]+)/', $m[2], $n)) {
            if ('"' === $n[1][0] && '"' === substr($n[1], -1)) {
                $id = substr($n[1], 1, -1);
            } else if ("'" === $n[1][0] && "'" === substr($n[1], -1)) {
                $id = substr($n[1], 1, -1);
            } else {
                $id = $n[1];
            }
            $m[3] = '<a href="#' . htmlspecialchars($id) . '" style="text-decoration: none;">⚓</a> ' . $m[3];
            return '<' . $m[1] . $m[2] . '>' . $m[3] . '</' . $m[1] . '>';
        }
        $id = trim(preg_replace('/[^a-z\x{4e00}-\x{9fa5}\d]+/u', '-', strtolower($m[3])), '-');
        $m[3] = '<a href="#' . htmlspecialchars($id) . '" style="text-decoration: none;">⚓</a> ' . $m[3];
        return '<' . $m[1] . ($m[2] ?? "") . ' id="' . htmlspecialchars($id) . '">' . $m[3] . '</' . $m[1] . '>';
    }, $value);
}

echo $value;

Idea: Embed Syntax

The CommonMark specification for automatic links doesn’t limit specific types of URL protocols. It just specifies the pattern so we can take advantage of the automatic link syntax to render it as a kind of “embed” syntax, which you can then turn it into a chunk of HTML elements.

I’m sure this idea has never been done before and that’s why I want to be the first to mention it. But I’m not going to integrate this feature directly into my converter to keep it slim. I just want to give you a couple of ideas.

Be aware that these tweaks are very naive, as they will directly convert the “embed” syntax without taking the block type into account. You may need to use this filter to replace the “embed” syntax only in certain block types, e.g. to ignore the “embed” syntax inside a fenced code block syntax.

YouTube Video Embed

An embed syntax to display a YouTube video by video ID.

<youtube:dQw4w9WgXcQ>
$value = preg_replace('/^[ ]{0,3}<youtube:([^>]+)>\s*$/m', '<iframe src="https://www.youtube.com/embed/$1"></iframe>', $value);

$value = from_markdown($value);

echo $value;

GitHub Gist Embed

An embed syntax to display a GitHub gist by gist ID.

<gist:9c96049ca6c66e30e50793f5aef4818b>
$value = preg_replace('/^[ ]{0,3}<gist:([^>]+)>\s*$/m', '<script src="https://gist.github.com/taufik-nurrohman/$1.js"></script>', $value);

$value = from_markdown($value);

echo $value;

Form Embed

An embed syntax to display a HTML form that was generated from the server side with a reference ID of 18a4596d42c and a title parameter to customize the HTML form title.

<form:18a4596d42c?title=Form+Title>
$value = preg_replace_callback('/^[ ]{0,3}<form:([^#>?]+)([?][^#>]*)?([#][^>]*)?>\s*$/m', static function ($m) {
    $path = $m[1];
    $value = "";
    parse_str(substr($m[2] ?? "", 1), $state);
    $value .= '<form action="/form/' . $path . '" method="post">';
    if (!empty($state['title'])) {
        $value .= '<h1>' . $state['title'] . '</h1>';
    }
    // … etc.
    // Be careful not to include blank line(s), or the raw HTML block state will end before the HTML form is complete!
    $value .= '</form>';
    return $value;
}, $value);

$value = from_markdown($value);

echo $value;

Idea: Note Block

Several people have discussed this feature, and I think I like this answer the most. The syntax is compatible with native Markdown syntax, which is nice to look at directly through the Markdown source, even when it gets rendered to HTML:

------------------------------

  **NOTE:** asdf asdf asdf

------------------------------
------------------------------

  **NOTE:**

  asdf asdf asdf asdf
  asdf asdf asdf asdf

  asdf asdf asdf asdf

------------------------------

Most Markdown converters will render the syntax above to this HTML, which is still acceptable to be treated as a note block from its presentation, despite its broken semantic:

<hr /><p><strong>NOTE:</strong> asdf asdf asdf</p><hr />
<hr /><p><strong>NOTE:</strong></p><p>asdf asdf asdf asdf asdf asdf asdf asdf</p><p>asdf asdf asdf asdf</p><hr />

With regular expressions, you can improve its semantic:

$value = from_markdown($value);

$value = preg_replace_callback('/<hr\s*\/?>(<p><strong>NOTE:<\/strong>[\s\S]*?<\/p>)<hr\s*\/?>/', static function ($m) {
    return '<div role="note">' . $m[1] . '</div>';
}, $value);

echo $value;

License

This library is licensed under the MIT License. Please consider donating 💰 if you benefit financially from this library.

Links