lastdragon-ru/diy-parser

Yet another package to simplify writing fast and memory-effective parsers that can parse infinite strings.

dev-main 2025-08-21 10:54 UTC

This package is auto-updated.

Last update: 2025-08-21 10:54:56 UTC


README

There are several tools to generate full-featured parsers even for PHP1. They are overkill when you just need to parse something simple. In such cases, you might decide to create your own parser. There are a lot of articles/examples on the web, and actually it is not too difficult as you may think. This is yet another package to simplify writing fast and memory-effective parsers that can parse infinite strings.

Requirements

Requirement Constraint Supported by
PHP ^8.4 HEAD , 9.2.0
^8.3 HEAD , 9.2.0

Installation

composer require lastdragon-ru/diy-parser

Introduction

As an example, I will show how to create a parser for mathematical expressions like 2 - (1 + 2) / 3 and how to calculate them as a bonus.

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Docs\Calculator\Calculator;
use LastDragon_ru\LaraASP\Dev\App\Example;

Example::dump(
    (new Calculator())->calculate('2 - (1 + 2) / 3'),
);

The (new Calculator())->calculate('2 - (1 + 2) / 3') is:

1

Our mathematical expression will be a string, subject to the following rules:

  • numbers (positive integers only)
  • operators (addition +, subtraction -, multiplication *, and division *)
  • sub-expressions inside ( and )
  • space(s) that can be used as delimiters to increase readability
  • Numbers and sub-expressions should be separated by an operator

We will start from the end, from the complete implementation of the parser, and then I will describe what is going inside it. So

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Calculator;

use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNodeChild;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNodeFactory;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorMultiplicationNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode;
use LastDragon_ru\DiyParser\Iterables\TransactionalIterable;
use LastDragon_ru\DiyParser\Tokenizer\Token;
use LastDragon_ru\DiyParser\Tokenizer\Tokenizer;
use LogicException;

use function filter_var;
use function preg_match;

use const FILTER_NULL_ON_FAILURE;
use const FILTER_VALIDATE_INT;

class Parser {
    public function __construct() {
        // empty
    }

    public function parse(string $pattern): ?ExpressionNode {
        $node = null;

        try {
            $iterable = (new Tokenizer(Name::class))->tokenize([$pattern]);
            $node     = $this->parseExpression($iterable);
        } catch (LogicException) {
            // The `$pattern` is not a valid expression
        }

        return $node;
    }

    /**
     * @param iterable<mixed, Token<Name>> $iterable
     */
    protected function parseExpression(iterable $iterable): ?ExpressionNode {
        $iterable = new TransactionalIterable($iterable, 64, 1);
        $factory  = new ExpressionNodeFactory();

        while ($iterable->valid()) {
            $factory->push($this->parseExpressionChild($iterable));
        }

        return $factory->create();
    }

    /**
     * @param TransactionalIterable<Token<Name>> $iterable
     */
    protected function parseExpressionChild(TransactionalIterable $iterable): ?ExpressionNodeChild {
        return $this->parseSubExpression($iterable)
            ?? $this->parseOperator($iterable)
            ?? $this->parseNumber($iterable)
            ?? $this->parseSpace($iterable);
    }

    /**
     * @param TransactionalIterable<Token<Name>> $iterable
     */
    protected function parseSubExpression(TransactionalIterable $iterable): ?ExpressionNode {
        // Is `(`?
        if ($iterable[0]?->is(Name::LeftParenthesis) !== true) {
            return null;
        }

        // Begin
        $iterable->begin();
        $iterable->next();

        // Parse
        $node    = null;
        $factory = new ExpressionNodeFactory();

        while ($iterable->valid()) {
            // Is `)`?
            if ($iterable[0]?->is(Name::RightParenthesis) === true) {
                $node = $factory->create();

                $iterable->next();

                break;
            }

            // Child
            $factory->push(
                $this->parseExpressionChild($iterable),
            );
        }

        // Commit
        $iterable->end($node);

        // Return
        return $node;
    }

    /**
     * @param TransactionalIterable<Token<Name>> $iterable
     */
    protected function parseOperator(TransactionalIterable $iterable): ?OperatorNode {
        $node = match ($iterable[0]->name ?? null) {
            Name::Plus     => new OperatorAdditionNode(),
            Name::Minus    => new OperatorSubtractionNode(),
            Name::Asterisk => new OperatorMultiplicationNode(),
            Name::Slash    => new OperatorDivisionNode(),
            default        => null,
        };

        if ($node !== null) {
            $iterable->next();
        }

        return $node;
    }

    /**
     * @param TransactionalIterable<Token<Name>> $iterable
     */
    protected function parseNumber(TransactionalIterable $iterable): ?NumberNode {
        // String?
        if ($iterable[0]?->is(null) !== true) {
            return null;
        }

        // Number?
        $number = $iterable[0]->value;
        $number = preg_match('/^[0-9]+$/u', $number) === 1
            ? filter_var($number, FILTER_VALIDATE_INT, FILTER_NULL_ON_FAILURE)
            : null;
        $node   = $number !== null ? new NumberNode($number) : null;

        if ($node !== null) {
            $iterable->next();
        }

        // Return
        return $node;
    }

    /**
     * @param TransactionalIterable<Token<Name>> $iterable
     */
    protected function parseSpace(TransactionalIterable $iterable): null {
        // Only spaces allowed here
        if ($iterable[0]?->is(Name::Space) !== true) {
            throw new LogicException('The string is not a mathematical expression.');
        } else {
            $iterable->next();
        }

        return null;
    }
}

And the resulting AST:

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Docs\Calculator\Parser;
use LastDragon_ru\LaraASP\Dev\App\Example;

Example::dump(
    (new Parser())->parse('2 - (1 + 2) / 3'),
);

The (new Parser())->parse('2 - (1 + 2) / 3') is:

LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
  +children: [
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
      +value: 2
    },
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode {},
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
      +children: [
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
          +value: 1
        },
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode {},
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
          +value: 2
        },
      ]
    },
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode {},
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
      +value: 3
    },
  ]
}

Tokenizer

First, we should split the input string into tokens that is happening inside the Parser::parse(). The Tokenizer accept a backed enum(s) and uses their values as delimiters (value can be a single character or a string). According to our rules, we need the following tokens:

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Calculator;

enum Name: string {
    case Plus             = '+';
    case Minus            = '-';
    case Asterisk         = '*';
    case Slash            = '/';
    case Space            = ' ';
    case LeftParenthesis  = '(';
    case RightParenthesis = ')';
}

After tokenization, we will have the list of tokens for further processing. Note that Token::$name contains the enum case or null for strings.

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Docs\Calculator\Name;
use LastDragon_ru\DiyParser\Tokenizer\Tokenizer;
use LastDragon_ru\LaraASP\Dev\App\Example;

use function iterator_to_array;

$input     = '2 - (1 + 2) / 3';
$tokenizer = new Tokenizer(Name::class);
$tokens    = $tokenizer->tokenize([$input]);

Example::dump(iterator_to_array($tokens));
Example output

The iterator_to_array($tokens) is:

[
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "2"
    +offset: 0
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1
      +name: "Space"
      +value: " "
    }
    +value: " "
    +offset: 1
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
      +name: "Minus"
      +value: "-"
    }
    +value: "-"
    +offset: 2
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
    +value: " "
    +offset: 3
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
      +name: "LeftParenthesis"
      +value: "("
    }
    +value: "("
    +offset: 4
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "1"
    +offset: 5
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
    +value: " "
    +offset: 6
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
      +name: "Plus"
      +value: "+"
    }
    +value: "+"
    +offset: 7
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
    +value: " "
    +offset: 8
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "2"
    +offset: 9
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
      +name: "RightParenthesis"
      +value: ")"
    }
    +value: ")"
    +offset: 10
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
    +value: " "
    +offset: 11
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
      +name: "Slash"
      +value: "/"
    }
    +value: "/"
    +offset: 12
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
    +value: " "
    +offset: 13
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "3"
    +offset: 14
  },
]

Parser

Next, we pass tokens into Parser::parseExpression() where we create the instance of TransactionalIterable and start converting tokens into AST. The TransactionalIterable is very important - it's not only return current/next/previous token and to navigate back and forward, but support transactions to process nested structures (e.g. sub-expressions) and rollback if parsing failed (eg if ) is missed).

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Docs\Calculator\Name;
use LastDragon_ru\DiyParser\Iterables\TransactionalIterable;
use LastDragon_ru\DiyParser\Tokenizer\Tokenizer;
use LastDragon_ru\LaraASP\Dev\App\Example;

$input    = '1 + 2';
$tokens   = (new Tokenizer(Name::class))->tokenize([$input]);
$iterable = new TransactionalIterable($tokens, 5, 5);

Example::dump($iterable[0]);
Example::dump($iterable[4]);

$iterable->next(2);
$iterable->begin();     // start nested

Example::dump($iterable[-2]);
Example::dump($iterable[0]);
Example::dump($iterable[2]);

$iterable->rollback();  // oops

Example::dump($iterable[0]);
Example output

The $iterable[0] is:

LastDragon_ru\DiyParser\Tokenizer\Token {
  +name: null
  +value: "1"
  +offset: 0
}

The $iterable[4] is:

LastDragon_ru\DiyParser\Tokenizer\Token {
  +name: null
  +value: "2"
  +offset: 4
}

The $iterable[-2] is:

LastDragon_ru\DiyParser\Tokenizer\Token {
  +name: null
  +value: "1"
  +offset: 0
}

The $iterable[0] is:

LastDragon_ru\DiyParser\Tokenizer\Token {
  +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
    +name: "Plus"
    +value: "+"
  }
  +value: "+"
  +offset: 2
}

The $iterable[2] is:

LastDragon_ru\DiyParser\Tokenizer\Token {
  +name: null
  +value: "2"
  +offset: 4
}

The $iterable[0] is:

LastDragon_ru\DiyParser\Tokenizer\Token {
  +name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
    +name: "Plus"
    +value: "+"
  }
  +value: "+"
  +offset: 2
}

All other methods just create AST nodes and check that the expression is correct. You may notice the ExpressionNodeFactory class, which is a subclass of NodeParentFactory. In our case, the ExpressionNodeFactory helps to simplify the code, and it also checks that an operator between numbers/expressions (one of our requirements).

In the general case, the main reason of NodeParentFactory is merging sequence of child nodes (of the same class) into one node. The Tokenizer may generate a lot of "string" tokens (especially if escaping is supported), but we usually want only one (abc) node in AST instead of multiple (a, b, and c) nodes. If you use static analysis tools like PHPStan, the class will guaranties the type safety too.

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Ast\NodeChild;
use LastDragon_ru\DiyParser\Ast\NodeParentFactory;
use LastDragon_ru\DiyParser\Ast\NodeParentImpl;
use LastDragon_ru\DiyParser\Ast\NodeString;
use LastDragon_ru\LaraASP\Dev\App\Example;
use Override;

// phpcs:disable PSR1.Files.SideEffects
// phpcs:disable PSR1.Classes.ClassDeclaration.MultipleClasses

/**
 * @implements NodeChild<ParentNode>
 */
class ChildNode extends NodeString implements NodeChild {
    // empty
}

/**
 * @extends NodeParentImpl<ChildNode>
 */
class ParentNode extends NodeParentImpl {
    // empty
}

/**
 * @extends NodeParentFactory<ParentNode, ChildNode>
 */
class ParentNodeFactory extends NodeParentFactory {
    /**
     * @inheritDoc
     */
    #[Override]
    protected function onCreate(array $children): ?object {
        return $children !== [] ? new ParentNode($children) : null;
    }

    /**
     * @inheritDoc
     */
    #[Override]
    protected function onPush(array $children, ?object $node): bool {
        return true;
    }
}

$factory = new ParentNodeFactory();

$factory->push(new ChildNode('a'));
$factory->push(new ChildNode('b'));
$factory->push(new ChildNode('c'));

Example::dump($factory->create()); // create and reset
Example::dump($factory->create()); // `null`

The $factory->create() is:

LastDragon_ru\DiyParser\Docs\Examples\ParentNode {
  +children: [
    LastDragon_ru\DiyParser\Docs\Examples\ChildNode {
      +string: "abc"
    },
  ]
}

The $factory->create() is:

null

AST

As you can see, our AST for mathematical expressions doesn't have references to parent/next/previous nodes. It makes the parser simple, and reduce memory usage. The Cursor class can be used to navigate the AST in all directions. Please note that your nodes must implement NodeParent and NodeChild if you want to use the cursor.

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Ast\Cursor;
use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode;
use LastDragon_ru\DiyParser\Docs\Calculator\Parser;
use LastDragon_ru\LaraASP\Dev\App\Example;

use function assert;

// Parse
$ast = (new Parser())->parse('2 - (1 + 2) / 3');

assert($ast instanceof ExpressionNode);

// Create the cursor
$cursor = new Cursor($ast);

// Children can be iterated directly
foreach ($cursor as $child) {
    if ($child->node instanceof ExpressionNode) {
        Example::dump($child->node);
        break;
    }
}

// Also possible to get n-th child
Example::dump($cursor[2]);

// And next/previous
Example::dump($cursor[2]->next->node ?? null);
Example::dump($cursor[2]->previous->node ?? null);
Example output
LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
  +children: [
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
      +value: 1
    },
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode {},
    LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
      +value: 2
    },
  ]
}

The $cursor[2] is:

LastDragon_ru\DiyParser\Ast\Cursor {
  +node: LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {#1
    +children: [
      LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
        +value: 1
      },
      LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode {},
      LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
        +value: 2
      },
    ]
  }
  +parent: LastDragon_ru\DiyParser\Ast\Cursor {
    +node: LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
      +children: [
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
          +value: 2
        },
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode {},
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {#1},
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode {},
        LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
          +value: 3
        },
      ]
    }
    +parent: null
    +index: null
  }
  +index: 2
}

The $cursor[2]->next->node ?? null is:

LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode {}

The $cursor[2]->previous->node ?? null is:

LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode {}

Escape

Of course, it is supported. You just need to specify the token to use as escaping string:

<?php declare(strict_types = 1);

namespace LastDragon_ru\DiyParser\Docs\Examples;

use LastDragon_ru\DiyParser\Tokenizer\Tokenizer;
use LastDragon_ru\LaraASP\Dev\App\Example;

use function iterator_to_array;

// phpcs:disable PSR1.Files.SideEffects
// phpcs:disable PSR1.Classes.ClassDeclaration.MultipleClasses

enum Name: string {
    case Slash     = '/';
    case Backslash = '\\';
}

$input     = 'a/b\\/\\c';
$tokenizer = new Tokenizer(Name::class, Name::Backslash);
$tokens    = $tokenizer->tokenize([$input]);

Example::dump(iterator_to_array($tokens));

The iterator_to_array($tokens) is:

[
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "a"
    +offset: 0
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Examples\Name {
      +name: "Slash"
      +value: "/"
    }
    +value: "/"
    +offset: 1
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "b"
    +offset: 2
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "/"
    +offset: 4
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: LastDragon_ru\DiyParser\Docs\Examples\Name {
      +name: "Backslash"
      +value: "\"
    }
    +value: "\"
    +offset: 5
  },
  LastDragon_ru\DiyParser\Tokenizer\Token {
    +name: null
    +value: "c"
    +offset: 6
  },
]

Limitations

Internally, the package uses preg_match() with u modifier to split string(s) into tokens because this is probably the faster way in PHP (if you have a better way, please create issue/pr/etc). The con is internal encoding is always UTF-8. Thus, if you need to parse strings in other encoding, you must convert them into UTF-8 before parsing.

Multiple tokens that are the same string/character are not supported yet. Moreover, if conflicted, no error will be reported. The priority is undefined.

The input string potentially may have infinite length. But, Token::$offset has int type so its max value is PHP_INT_MAX. Also, it may be limited by the size of the TransactionalIterable buffers (actual for nested nodes).

There are no line numbers, only Token::$offset from the beginning of the string. If you need them, you need to add token(s) for EOL and process/calculate within the parser.

There is also no special code to check nested/closed parentheses or throw an error if string cannot be parsed. It also should be done within parser implementation.

Bonus

For calculate mathematical expressions, we are using shunting yard algorithm to convert our AST into Reverse Polish Notation. Implementation is ever simpler than usual, because (...) represented as sub-nodes and no need any special actions, please see the ExpressionNode class.

Upgrading

Please follow Upgrade Guide.

Contributing

This package is the part of Awesome Set of Packages for Laravel. Please use the main repository to report issues, send pull requests, or ask questions.

Footnotes

  1. https://en.wikipedia.org/wiki/Comparison_of_parser_generators