ascetik/regex

Regex handling

v0.2.0 2024-01-28 14:20 UTC

This package is auto-updated.

Last update: 2024-01-28 14:21:39 UTC


README

use Ascetik\Regex\Core\Regex;

Regex

An Object Oriented way to handle Regular Expressions

Release notes

v0.2.0 : match with offset capture

The main improvement is the ability to use PREG_OFFSET_CAPTURE flag on a match. The usage of this flag with Regex instance gives a way to use the position of an occurence in a matching string sequence.

Some changes are internal optimizations.

Options and Partial/Global modes are handled like in the previous version.

Little breaking change : MatchOperator::apply() method is replaced by capture() method.

This package only provides preg_match* functionnalities for now. Replacements are still not avaliable.

Basic usage

First example : check if a string matches a regex pattern

$valid = Regex::from('/([a-b]+)/')->match('test')->capture()->isValid();

var_dump($valid): // true.

An invalid match would return false.

The matching parts of tested subject are available :

$occurences = Regex::from('/([a-b]+)/')->match('test')->capture()->occurences();

var_dump($occurences->content()): // StringMatchSet<StringOccurence['test']>
var_dump($occurences->flatContent()): // [ 0 = 'test']

An invalid match would return empty arrays in both cases

With php preg_match() and preg_match\all(), the first element of '&$matches' contains the text matching the full regex pattern. This is called a "report" in this package, available this way :

$report = Regex::from('/([a-b]+)/')->match('test')->capture()->report();

echo $report->content(): // 'test'

An invalid match would return a message indicating the reason of the failure : 'unmatch' for an unmatching string, preg_last_error_msg() if an error occured.

Here is the standard decomposition of all this process :

$regex = Regex::from('/([a-b]+)/'); // This factory method parses the input string to get delimiter, main pattern and options

$operator = $regex->match('test'): // this is a MatchOperator, providing new options to configure the match-check result format.

$capture = $operator->capture(); // this is a BasicCapture, from Capture interface family

$valid = $capture->isValid(); // simple boolean

$report = $capture->report(); // this is a StringReport, from MatchReport family
echo $capture->content(); // simple string from this kind of Capture instance

$occurences = $capture->occurences(); // StringOccurenceSet instance, from OccurencesContainer interface family
$container = $occurences->content(); // StringOcurrence[]
$flatContainer = $occurences->flatContent(); // string[]

// $occurences instance would be empty on invalid result.

$occurence = $container[0]; // StringOccurence instance, none if invalid or no matches

Those examples use a default 'null' preg_match flag. However, some of those outputs may differ a little, depending on the flag in use. All Capture implementations have a basic behavior. Some of them may have specific ones.

Pattern parsing

Some controls are made on Regex instance initialisation. Both options and delimiters are parsed from given pattern to convert them to instances.

Some restrictions are applied in a quiet way, ignoring wrong values or replacing them.

Using options

The inserted pattern is parsed to get the delimiter, the main pattern and options you may had. All options reported in php.net documentation are available.

The available options are described in an Enumeration. Any unknown option, excepted 'g' for global, will be ignored.

The 'g' option is handled separately and set Regex instance as "global".

// to build a global multiline case-insensitive regex :
$optionRegex = Regex::from('/([a-b]+)\-?/gmi');
$optionRegex->isGlobal(); // true
echo $optionRegex->pattern()->expression(); // '/([a-b]+)\-?/gim'

// to turn a global Regex to a partial one
$partialRegex = $globalRegex->partial();
$partialRegex->isGlobal(); // false
echo $optionRegex->pattern()->expression(); // '/([a-b]+)\-?/im'

// and inverse
$globalRegex = $regex->global();
echo $optionRegex->pattern()->expression(); // '/([a-b]+)\-?/gim'

Delimiters

Delimiters are described in an enumeration. When the pattern is parsed, delimiters have to match and be listed as a Delimiter enum case.

Some checks are made :

  • If any delimiter is either not listed or missing, a forward-slash is used as default delimiter.
  • If delimiters at start and end are not he same, they are both replaced by default delimiter as possible.

However, any user of this package is supposed to use a pattern which would work with php basic regex functions.

Advanced matching features

The previous examples worked with default 'null-like' flag. This version offers a new feature to use a first flag to use on a match test.

Match Scope

A match test may concern either the first matching element or all matching elements of a string. For a Regex instance, default scope is partial.

For global match test :

$globalRegex = $regex->global();

This feature will change in next release, breaking some parts using this method. For now, it stays like it is until i implement all basic functionnalities. (De toutes façons on s'en fout, je suis le seul à vouloir utiliser ce machin...)

Match with offset

A MatchOperator instance provides a way to use the '$offset' preg_match* parameter :

$operator = Regex::from('/([a-b]+)/')
   ->match('test');
echo $operator->atIndex(2)->capture()->content(); // prints 'st'

The method atIndex() returns a new instance of MatchOperator.

The MatchOperator::toIndex() method works with any kind of Capture.

Using PREG_OFFSET_CAPTURE flag

The 'flag' is the optionnal 4th parameter of preg_match()/preg_match_all() functions.

This version provides the ability to use PREG_OFFSET_CAPTURE flag in order to retrieve matching chunks and their position in the subject source string.

The usage of PREG_OFFSET_CAPTURE flag returns a result with huge differences. The first differance is that the "report" may contain multiple matches. The second diffference is that matches are not just strings but an array with a string and an integer.

In order to adapt the present regex system to this kind of output, this package provides what is needed to satisfy those requirements.

Here is an example :

$occurences = Regex::from('/([a-b]+)/')
   ->match('test')
   ->offsetCapture() // now we call offsetCapture() instead of capture()
   ->occurences();

var_dump($occurences->content()): // IndexedMatchSet<IndexedOccurence['test', 0]>
var_dump($occurences->flatContent()): // [ 0 = 'test']

Here is the decomposition :

$regex = Regex::from('/([a-b]+)/'); // This factory method parses the input string to get delimiter, main pattern and options

$operator = $regex->match('test'): // this is a MatchOperator, providing new options to configure the match-check result format.

$capture = $operator->offSetapture(); // this is an IndexedCapture, from Capture interface family

$valid = $capture->isValid(); // still a simple boolean

$report = $capture->report(); // this is either an IndexedReport for a succesful check on a StringReport in case of failure.
echo $capture->content(); // IndexedOccurenceSet if valid, string otherwise

$occurences = $capture->occurences(); // IndexedOccurenceSet instance, from OccurencesContainer interface family
$container = $occurences->content(); // IndexedOcurrence[]
$flatContainer = $occurences->flatContent(); // string[]

$occurenceExample = $container[0]; // IndexedOccurence instance, none if invalid
echo $occurenceExample->content(); // 'test', for our example
echo $occurenceExample->index(); // 0 in our example, matching string position in the input subject.

As you can see, there are some slight differences. The main mechanism is always the same, using different strategies related to the match mode and the flag in use.

Next features

  1. More flags !! next coming : UnmatchedAsNull.

  2. See "issues" below.

  3. Regex replacement : There are 3 ways to process replacements with regular expressions, sometimes using different types of parameters. I will just have to adjust the existing implementation to make it work.

  4. Maybe a step-by-step regex pattern builder...

  5. And some other things i forgot, obviously !

Issues

This version is only a draft. Development is made step by step, keeping minor release versions under 1.0.

The actual implementation of MatchOperator makes no difference between preg_match and preg_match_all results. According to the PHP documentation, the second function allows more flags. It would be a better idea to use a specific MatchOperator implementation in this case, with methods adapted to each available flag.

A next release will come after PREG_UNMATCHED_AS_NULL flag implementation, carrying some breaking changes concerning partial/global modes to provide different operator implementations.