remorhaz / php-unilex
Unilex: lexical analyzer generator with Unicode support written in PHP
Installs: 113 811
Dependents: 3
Suggesters: 0
Security: 0
Stars: 4
Watchers: 3
Forks: 1
Open Issues: 0
Requires
- php: ~8.1.0 || ~8.2.0 || ~8.3.0
- nikic/php-parser: ^4.12 || ^5
- phpdocumentor/reflection-docblock: ^4.3 || ^5
- remorhaz/int-rangesets: ^0.3
- remorhaz/ucd: ^0.3
- symfony/console: ^6.1 || ^7
- thecodingmachine/safe: ^1.3.1 || ^2
Requires (Dev)
- bamarni/composer-bin-plugin: ^1.8
- phpunit/phpunit: ^10.1 || ^11
README
UniLex is lexical analyzer generator (similar to lex
and flex
) with Unicode support.
It's written in PHP and generates code in PHP.
[WIP] Work in progress
Requirements
- PHP 8
License
UniLex library is licensed under MIT license.
Installation
Installation is as simple as any other composer library's one:
composer require remorhaz/php-unilex
Usage
Quick start in example
Let's imagine we want to write a simple calculator and we need a lexer (lexical analyzer) that provides a stream of IDs, numbers and operators. Create a new Composer project and execute following command from project directory:
composer require --dev remorhaz/php-unilex
Next step is creating a lexer specification in LexerSpec.php
file. We use @lexToken
tag in comments to specify regular expression for a token:
<?php /** * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context * @lexTargetClass TokenMatcher * @lexHeader */ const TOKEN_ID = 1; const TOKEN_OPERATOR = 2; const TOKEN_NUMBER = 3; /** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */ $context->setNewToken(TOKEN_ID); /** @lexToken /[+\-*\/]/ */ $context->setNewToken(TOKEN_OPERATOR); /** @lexToken /[0-9]+/ */ $context->setNewToken(TOKEN_NUMBER);
Next step is building a token matcher from specification:
vendor/bin/unilex LexerSpec.php > TokenMatcher.php
Now we have a compiled token matcher in TokenMatcher.php
file. Let's use it and read all tokens from the buffer:
<?php use Remorhaz\UniLex\Lexer\TokenFactory; use Remorhaz\UniLex\Lexer\TokenReader; use Remorhaz\UniLex\Unicode\CharBufferFactory; require_once "vendor/autoload.php"; require_once "TokenMatcher.php"; $buffer = CharBufferFactory::createFromString("x+2*3"); $tokenReader = new TokenReader($buffer, new TokenMatcher, new TokenFactory(0xFF)); do { $token = $tokenReader->read(); echo "Token ID: {$token->getType()}\n"; } while (!$token->isEoi());
On execution this script outputs:
Token ID: 1
Token ID: 2
Token ID: 3
Token ID: 2
Token ID: 3
Token ID: 255
Let's go a bit further and make it possible to retrieve text presentation of every token from input buffer. We need to modify a lexer specification to attach the result to each non-EOI token as an attribute:
<?php /** * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context * @lexTargetClass TokenMatcher * @lexHeader */ const TOKEN_ID = 1; const TOKEN_OPERATOR = 2; const TOKEN_NUMBER = 3; /** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */ $context ->setNewToken(TOKEN_ID) ->setTokenAttribute('text', $context->getSymbolString()); /** @lexToken /[+\-*\/]/ */ $context ->setNewToken(TOKEN_OPERATOR) ->setTokenAttribute('text', $context->getSymbolString()); /** @lexToken /[0-9]+/ */ $context ->setNewToken(TOKEN_NUMBER) ->setTokenAttribute('text', $context->getSymbolString());
After rebuilding token matcher with CLI utility we need to modify output cycle of our example program to make it print text with token IDs:
do { $token = $tokenReader->read(); echo "Token ID: {$token->getType()}", $token->isEoi() ? "\n" : " / '{$token->getAttribute('text')}'\n"; } while (!$token->isEoi());
And now program prints:
Token ID: 1 / 'x'
Token ID: 2 / '+'
Token ID: 3 / '2'
Token ID: 2 / '*'
Token ID: 3 / '3'
Token ID: 255
CLI
You can use command-line utility to build token matcher from specification:
vendor/bin/unilex path/to/spec/LexerSpec.php path/to/target/TokenMatcher.php --desc="My example matcher."
Specification
Specification is a PHP file that is split in parts by DocBlock comments with special tags. There is a special variable $context
that contains context object with \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface
interface. Current implementation also uses int
variable $char
that contains current symbol (TODO: should be moved into context object).
@lexHeader
This block can contain namespace
and use
statements that will be used during matcher generation.
@lexBeforeMatch
This block is executed before the beginning of matching procedure and can be used to initialize some additional variables.
@lexOnTransition
This block is executed on each symbol matched by token's regular expression.
@lexToken /regexp/
This block is executed on matching given regular expression from the input buffer. Most commonly it just setups new token in context object.
@lexMode 'mode_name'
This tag tells parser that matching @lexToken
expression matches only if current lexical mode is mode_name
. Lexical mode can be switched with $context->setMode('mode_name')
method. Using lexical modes allows to have several "sub-grammars" in one specification (i. e. some tokens can be recognized only in comments or strings).
@lexOnError
This block is executed if matcher fails to match any of token's regular expressions. By default it just returns false
.