nabu-3 / lexer
nabu-3 Lexer library to generate and analize lexical expressions
Requires
- php: >=7.2
- ext-mbstring: >=7.2
- nabu-3/minimal-class: dev-master
Requires (Dev)
- phpunit/phpunit: ^8.1
This package is auto-updated.
Last update: 2025-01-11 15:27:33 UTC
README
This is a Lexer library written in PHP to analyze lexical expressions and obtain a tokenized representation and a data structure as a descriptor of interpreted content.
The Lexer supports Unicode strings and Regular Expressions.
Installation
Lexer library requires PHP 7.2 or higher and mb_string native module.
The library is deployed as part of composer and Packagist standard PHP packages distribution. To use this library you need only to require it via composer:
composer require nabu-3/lexer
Basic usage
To start using this library you need to include the standard autoload.php file that is maintained by composer:
require_once 'vendor/autoload.php';
To start using this library, you can create a CNabuCustomLexer object and provide a Lexer Data storage as is:
use nabu\lexer\CNabuCustomLexer; use nabu\lexer\data\CNabuLexerData; $lexer = CNabuCustomLexer::getLexer(); $lexer->setData(new CNabuLexerData());
This action provides a custom lexer that you can customize to add rules and to perform analysis over your sample strings.
The Keyword Rule
The most basic rule, is the Keyword Rule. With it, you can parse a keyword and obtain the tokenized result.
Below, a basic sample using the Keyword Rule:
require_once 'vendor/autoload.php'; use nabu\lexer\CNabuCustomLexer; use nabu\lexer\data\CNabuLexerData; use nabu\lexer\rules\CNabuLexerRuleKeyword; $lexer = CNabuCustomLexer::getLexer(); $lexer->setData(new CNabuLexerData()); $keyword_rule = CNabuLexerRuleKeyword::createFromDescriptor( $lexer, array( 'keyword' => 'RULE', 'method' => 'ignore case' ) ); $lexer->registerRule('keyword_rule', $keyword_rule); $keyword_rule->applyRuleToContent('RULE is the basics'); var_export($keyword_rule->getTokens()); echo "\n";
Allowed methods are 'ignore case' and 'literal'. Then:
- 'ignore case' allows to match the keyword ignoring case letters. Internally, both strings (sample and keyword) are converted to lowercase and compare it. If both matches then interprets that the rule is covered and returns true.
- 'literal' forces that all characters matches exactly as expected by the keyword, and rule only is covered if all characters matches literally.
You can run this sample from the terminal typing:
php samples/basic_sample_01.php
After execute this sample, you can see in your terminal the list of parsed tokens:
array ( 0 => 'Rule', )
Note that the list contains only an item because the Keyword Rule affects only to one occurrence of keyword. As the rule method is defined as 'ignore case', the token included matches with the sample source string and not like the keyword attribute.
The Regular Expression Rule
This rule offers a wide application for polymorphic strings or dynamic structures that requires a use of a regular expression to interpret his content. Like the Keyword Rule, you can apply the match as 'literal' or 'ignore case', and, with ignore case, the '/i' modifier is applied when parse regular expressions using preg_match.
Below, a basic example using the Regular Expression Rule:
require_once 'vendor/autoload.php'; use nabu\lexer\CNabuCustomLexer; use nabu\lexer\data\CNabuLexerData; use nabu\lexer\rules\CNabuLexerRuleRegEx; $lexer = CNabuCustomLexer::getLexer(); $lexer->setData(new CNabuLexerData()); $regex_rule = CNabuLexerRuleRegEx::createFromDescriptor( $lexer, array( 'match' => '\\w+', 'method' => 'ignore case' ) ); $lexer->registerRule('regex_rule', $regex_rule); $regex_rule->applyRuleToContent('RUle is the basics'); var_export($regex_rule->getTokens()); echo "\n";
Allowed methods are the same than Keyword Rules attribute.
You can run this sample from the terminal typing:
php samples/basic_sample_02.php
After execute this sample, you can see in your terminal the list of parsed tokens:
array ( 0 => 'RUle', )
Note that the list contains only an item because the Regular Expression Rule affects only to one occurrence of the expression. As the rule method is defined as 'ignore case', the token included matches with the sample source string and not like the keyword attribute.
Block rules
Block rules have the capability of group any kind or rule to apply a case, sequence or repetition of a list of rules.
The Case Rule
This rule allows to treat a list of rules as a switch/case sentence. Then, you can define this list and apply the rule. If the sample string matches, at least one of the listed rules, the first matched is applied and the evaluation of the rule stops here.
Below, a basic example using the Case Rule:
require_once 'vendor/autoload.php'; use nabu\lexer\CNabuCustomLexer; use nabu\lexer\data\CNabuLexerData; use nabu\lexer\rules\CNabuLexerRuleGroup; $lexer = CNabuCustomLexer::getLexer(); $lexer->setData(new CNabuLexerData()); $case_rule = CNabuLexerRuleGroup::createFromDescriptor( $lexer, array( 'method' => 'case', 'group' => array( array( 'keyword' => 'Rule', 'method' => 'ignore case' ), array( 'keyword' => 'are', 'method' => 'ignore case' ), array( 'keyword' => 'the', 'method' => 'ignore case' ), array( 'keyword' => 'basics', 'method' => 'literal' ) ) ) ); $lexer->registerRule('case_rule', $case_rule); $case_rule->applyRuleToContent('The basics are Rules?'); var_export($case_rule->getTokens()); echo "\n";
You can run this sample from the terminal typing:
php samples/block_sample_01.php
After execute this sample, you can see in your terminal the list of parsed tokens:
array ( 0 => 'The', )
Note that the list contains only an item because the Case Rule affects only to the first occurrence in the list or rules.
The Sequence Rule
Sequence rules are similar to Case Rules, but it's necessary to look at the method, that it will be 'sequence', and also, that you can define a tokenizer expression to allow a separator between rules involved in the sequence.
Below, a basic example using the Sequence Rule:
require_once 'vendor/autoload.php'; use nabu\lexer\CNabuCustomLexer; use nabu\lexer\data\CNabuLexerData; use nabu\lexer\rules\CNabuLexerRuleGroup; $lexer = CNabuCustomLexer::getLexer(); $lexer->setData(new CNabuLexerData()); $sequence_rule = CNabuLexerRuleGroup::createFromDescriptor( $lexer, array( 'method' => 'sequence', 'tokenizer' => array( 'method' => 'literal', 'match' => '\s+', ), 'group' => array( array( 'keyword' => 'the', 'method' => 'ignore case' ), array( 'keyword' => 'basics', 'method' => 'literal' ), array( 'keyword' => 'are', 'method' => 'ignore case' ), array( 'keyword' => 'Rules', 'method' => 'ignore case' ) ) ) ); $lexer->registerRule('sequence_rule', $sequence_rule); $sequence_rule->applyRuleToContent("The basics are\tRules?"); var_export($sequence_rule->getTokens()); echo "\n";
Note that the variation respecting to Case Rule are two factors:
- The method is 'sequence'.
- We add a tokenizer attribute that contains an explicit rule declaration (in this case a Regular Expression Rule). This rule is applied before each iteration in the list of rules.
You can run this sample from the terminal typing:
php samples/block_sample_02.php
After execute this sample, you can see in your terminal the list of parsed tokens:
array ( 0 => 'The', 1 => ' ', 2 => 'basics', 3 => ' ', 4 => 'are', 5 => ' ', 6 => 'Rules', )
Note that the list contains all words in the sample string because the Sequence Rule try to match the full list in the order the it is declared. If one rule fails, then the sequence stops and rewinds the list to NULL to ensure that no tokens are parsed.
The Repeat Rule
Repeat rules have the capability of define a cardinality for a rule. This cardinality can be defined as a minimum value and a maximum value or as a fixed value. Allowed formats can be:
- Fixed cardinality: any natural number starting at 0. This will be applied as 'repeat exactly n times', where n is the selected number.
- Range: a range it's a tuple of values in the form 'm..n', where m and n are a natural number starting at 0 for m and at m for n. This means 'repeat between m and n times'. If the repeat number is less than m then the rule evaluation fails. If the repeat evaluation rule fails between m and n iterations, the evaluation rule success. If the repeat iteration reach n the evaluation stops and finish successful.
- Infinite: in this case, you choose 'n' as value. Internally, this is translated as 1..n and applies Range cardinality as explained above, and then, will be applied as 'at least one time, but until infinite times or rule fails'. Like Sequence Rules, this kind of rules supports the use of a tokenizer acting as a separator between each iteration of the rule.
Below, a basic example using the Repeat Rule:
require_once 'vendor/autoload.php'; use nabu\lexer\CNabuCustomLexer; use nabu\lexer\data\CNabuLexerData; use nabu\lexer\rules\CNabuLexerRuleRepeat; $lexer = CNabuCustomLexer::getLexer(); $lexer->setData(new CNabuLexerData()); $repeat_rule = CNabuLexerRuleRepeat::createFromDescriptor( $lexer, array( 'repeat' => '1..4', 'tokenizer' => array( 'method' => 'literal', 'match' => '\s+' ), 'rule' => array( 'method' => 'ignore case', 'match' => '[a-zA-Z]+' ) ) ); $lexer->registerRule('repeat_rule', $repeat_rule); $repeat_rule->applyRuleToContent("The basics are\tRules?"); var_export($repeat_rule->getTokens()); echo "\n";
You can run this sample from the terminal typing:
php samples/block_sample_03.php
This sample have a similar result than the above of Sequence Rule, but in this case, implied rules are less restrictives as the rule matches with any kind of repetition between 1 and 4 times, matching a sequence of letters in lowercase or uppercase. As is, another phrase containing at least one word will match this rule until a limit of four words.
array ( 0 => 'The', 1 => ' ', 2 => 'basics', 3 => ' ', 4 => 'are', 5 => ' ', 6 => 'Rules', )