README

Use regular expressions to split a given string into tokens.

Requirements

PHP >= 8.3

Installation

The best way to install interitty/tokenizer is using Composer:

composer require interitty/tokenizer

Tokenizer usage

The tokenization process needs the definition of a map (from token regexes to token classes) and string to be tokenized. A simple tokenizer that separates strings into numbers, whitespaces, and letters can look like the following code.

$tokenizer = new Tokenizer('say 123');
$tokenizer->map = [
    'number' => '~^\d+~',
    'whitespace' => '~^\s+~',
    'string' => '~^\w+~'
];

Processing the tokens

Tokens can be accessed by iterating thru the next and current methods until the TOKEN_END appears.

$tokens = [];
do {
    $token = $tokenizer->next();
    $tokens[] = $token;

    assert($token === $tokenizer->current());
} while ($token->getType() !== Token::TOKEN_END);

The resulting array of $tokens would look like the following.

[
    new Token('string', 'say', 1, 1),
    new Token('whitespace', ' ', 1, 4),
    new Token('number', '123', 1, 5),
]

Skipping unnecessary tokens

In some cases, it may be useful to automatically skip some tokens and move on to others. Because of that, there are addSkippedTokenType and setSkippedTokenTypes methods. The TOKEN_END token can't be skipped.

$tokenizer->addSkippedTokenType('whitespace');

$string = '';
do {
    $token = $tokenizer->next();
    $string .= $token->getValue();
} while ($token->getType() !== Token::TOKEN_END);
assert('say123' === $string);

Expecting tokens

The tokenizer includes a helper to expect the correct token type and value. This can simplify and unify the checking process.

$tokenizer = new Tokenizer('{some coed}');
$tokenizer->map = [
    'brackets' => '~^[{}]~',
    'code' => '~^[^{}]+~',
];

$tokenizer->expect($tokenizer->next(), 'brackets', '{');
$tokenizer->expect($tokenizer->next(), 'code');

$code = $tokenizer->current()->getValue();

$tokenizer->expect($tokenizer->next(), 'brackets', '}');

BaseTokenizerParser usage

The possible way for using a Tokenizer is in the BaseTokenizerParser which provides the functionality of parsing the given string into a stream of tokens. It can be useful for validating that a given string is compatible with the expected grammar and for parsing him into a structured array.

This functionality is used in the interitty/pacc.

BaseParser usage

In the case where it can be needed to work with own implementation of Tokenizer, there is a BaseParser abstract class that allows implementing own logic of work with current and next Token and own mechanism of work with the tokenType and tokenLexeme.

interitty / tokenizer

Maintainers

Details