interitty / tokenizer
Use regular expressions to split a given string into tokens.
Requires
- php: ~8.3
- dg/composer-cleaner: ~2.2
Requires (Dev)
- interitty/code-checker: ~1.0
- interitty/phpunit: ~1.0
README
Use regular expressions to split a given string into tokens.
Requirements
- PHP >= 8.3
Installation
The best way to install interitty/tokenizer is using Composer:
composer require interitty/tokenizer
Tokenizer usage
The tokenization process needs the definition of a map
(from token regexes to token classes) and string
to be tokenized.
A simple tokenizer that separates strings into numbers, whitespaces, and letters can look like the following code.
$tokenizer = new Tokenizer('say 123');
$tokenizer->map = [
'number' => '~^\d+~',
'whitespace' => '~^\s+~',
'string' => '~^\w+~'
];
Processing the tokens
Tokens can be accessed by iterating thru the next
and current
methods until the TOKEN_END
appears.
$tokens = [];
do {
$token = $tokenizer->next();
$tokens[] = $token;
assert($token === $tokenizer->current());
} while ($token->getType() !== Token::TOKEN_END);
The resulting array of $tokens
would look like the following.
[
new Token('string', 'say', 1, 1),
new Token('whitespace', ' ', 1, 4),
new Token('number', '123', 1, 5),
]
Skipping unnecessary tokens
In some cases, it may be useful to automatically skip some tokens and move on to others.
Because of that, there are addSkippedTokenType
and setSkippedTokenTypes
methods.
The TOKEN_END
token can't be skipped.
$tokenizer->addSkippedTokenType('whitespace');
$string = '';
do {
$token = $tokenizer->next();
$string .= $token->getValue();
} while ($token->getType() !== Token::TOKEN_END);
assert('say123' === $string);
Expecting tokens
The tokenizer includes a helper to expect the correct token type and value. This can simplify and unify the checking process.
$tokenizer = new Tokenizer('{some coed}');
$tokenizer->map = [
'brackets' => '~^[{}]~',
'code' => '~^[^{}]+~',
];
$tokenizer->expect($tokenizer->next(), 'brackets', '{');
$tokenizer->expect($tokenizer->next(), 'code');
$code = $tokenizer->current()->getValue();
$tokenizer->expect($tokenizer->next(), 'brackets', '}');
BaseTokenizerParser usage
The possible way for using a Tokenizer
is in the BaseTokenizerParser
which provides the functionality of parsing
the given string into a stream of tokens. It can be useful for validating that a given string is compatible with the
expected grammar and for parsing him into a structured array.
This functionality is used in the interitty/pacc.
BaseParser usage
In the case where it can be needed to work with own implementation of Tokenizer
, there is a BaseParser
abstract
class that allows implementing own logic of work with current and next Token
and own mechanism of work with the
tokenType
and tokenLexeme
.