dbeurive / lexer
This project implements a simple lexer.
Requires (Dev)
- phpdocumentor/phpdocumentor: 2.*
- phpunit/phpunit: 5.5.*
This package is not auto-updated.
Last update: 2025-01-13 13:33:53 UTC
README
This repository contains the implementation of a basic lexer.
A lexer explodes a given string into a list of tokens.
Installation
From the command line:
composer require dbeurive\lexer
If you want to include this package to your project, then edit your file composer.json
and add the following entry:
"require": {
"dbeurive/lexer": "*"
}
Synopsis
$varProcessor = function(array $inMatches) { $name = strtolower($inMatches[2]); switch (strtoupper($inMatches[1])) { case 'L': return 'LOCAL_' . $name; case 'G': return 'GLOBAL_' . $name; } throw new \Exception("Impossible error!"); }; $tokens = array( array('/[0-9]+/', 'numeric'), array('/\\$([lg])([a-z0-9]+)/i', 'variable', $varProcessor), array('/[a-z]{2,}/i', 'function'), array('/(\\+|\\-|\\*|\\/)/', 'operator'), array('/\\(/', 'open_bracket'), array('/\\)/', 'close_bracket'), array('/(\\s+|\\r?\\n)/', 'blank', function(array $m) { return null; }) ); try { $lexer = new Lexer($tokens); $text = '$gConstant1 + sin($lCoef1) / cos($lcoef2) * $gTemp - tan(21)'; $tokens = $lexer->lex($_text); } catch (\Exception $e) { print "ERROR: " . $e->getMessage() . "\n"; exit(1); } /** @var Token $_token */ foreach ($tokens as $_token) { printf("%s %s\n", $_token->type, $_token->value); }
Specifications
Description
The lexer is configured by a list of tokens specifications:
array(
<token specification>,
<token specification>,
...
)
Each token specification is an array that contains 2 or 3 elements.
<token specification> = array(<regexp>, <type>, [<transformer callback>])
- The first element is a regular expression that describes the token.
- The second element is a name that identifies the type of the token.
- The optional third element is a function that is applied to the token's value before it is returned.
WARNING
Make sure to double all characters "
\
" within the regular expressions that define the tokens. That is:'/\s/'
becomes'/\\s/'.
The signature of the optional third element (<transformer callback>
) must be:
mixed|null function(array $inMatches)
The array ($inMatches
) passed to the function comes from the processing of the regular expression that describes the token.
- The first element of the array (
$inMatches[0]
) contains the text that matches the full pattern. - The second element of the array (
$inMatches[1]
) contains the text that matched the first captured parenthesized subpattern. - The third element of the array (
$inMatches[2]
) contains the text that matched the second captured parenthesized subpattern. - ... and so on.
See the description for the PHP function
preg_match()
.
- If the function returns the value
null
, then the detected token is "ignored". That is: it will not be inserted into the list of extracted tokens. - If the function returns a non-null value, then the token is inserted in the list of detected tokens.
The value of the inserted token will be the value returned by the function (
<transformer callback>
).
Very important note
Be aware that the order of declarations of the tokens is important.
The example 2 illustrates this point.
use dbeurive\Lexer\Lexer; use dbeurive\Lexer\Token; $text = 'AAAA AA'; // --------------------------------------------------------- // TEST 1 // --------------------------------------------------------- $specifications = array( array('/AA/', 'type A2'), array('/A/', 'type A1'), array('/(\\s+|\\r?\\n)/', 'blank', function(array $m) { return null; }) ); try { $lexer = new Lexer($specifications); $tokens = $lexer->lex($text); } catch (\Exception $e) { print "ERROR: " . $e->getMessage() . "\n"; exit(1); } print "Test1: $text\n\n"; dumpToken($tokens); print "\n"; // --------------------------------------------------------- // TEST 2 // --------------------------------------------------------- $specifications = array( array('/A/', 'type A1'), array('/AA/', 'type A2'), array('/(\\s+|\\r?\\n)/', 'blank', function(array $m) { return null; }) ); try { $lexer = new Lexer($specifications); $tokens = $lexer->lex($text); } catch (\Exception $e) { print "ERROR: " . $e->getMessage() . "\n"; exit(1); } print "Test2: $text\n\n"; dumpToken($tokens); exit(0); function dumpToken(array $inTokens) { $max = 0; /** @var Token $_token */ foreach ($inTokens as $_token) { $max = strlen($_token->type) > $max ? strlen($_token->type) : $max; } /** @var Token $_token */ foreach ($inTokens as $_token) { printf("%${max}s %s\n", $_token->type, $_token->value); } }
The result is:
Test1: AAAA AA
type A2 AA
type A2 AA
type A2 AA
Test2: AAAA AA
type A1 A
type A1 A
type A1 A
type A1 A
type A1 A
type A1 A
API
Constructor
/** * Lexer constructor. * @param array $inSpecifications This array represents the tokens specifications. * Each element of this array is an array that specifies a token. * It contains 2 or 3 elements. * - First element: a regular expression that describes the token. * - Second element: the name of the token. * - Third element: an optional callback function. * The signature of this function must be: * null|string function(array $inMatches) * @throws \Exception */ public function __construct(array $inSpecifications)
Please see the section "specifications" for a detailed description of the parameter.
lex()
/** * Explode a given string into a list of tokens. * @param string $inString The string to explode into tokens. * @return array The method returns a list of tokens. * Each element of the returned list is an instance of the class Token. * @throws \Exception * @see Token */ public function lex($inString)
This method "parses" a given text and returns a list of detected tokens.
The returned array contains the list of detected tokens.
Each element of the returned array is an instance of the class \dbeurive\Lexer\Token
.
/** * Class Token * * This class implements a token. * * @package dbeurive\Lexer */ class Token { /** @var null|mixed Token's value. */ public $value = null; /** @var null|string Token's type. */ public $type = null; /** * Token constructor. * @param string $inOptValue The token's value. * @param string $inOptType The token's type. */ public function __construct($inOptValue=null, $inOptType=null) { $this->value = $inOptValue; $this->type = $inOptType; } }