lucleroy/php-regex

PHP Regular Expressions Builder

0.5.1 2022-12-27 14:14 UTC

This package is auto-updated.

Last update: 2024-12-27 18:28:40 UTC


README

PHP library with fluent interface to build regular expressions.

Table of contents

Introduction

Here is a simple example that creates a regular expression to recognize a PHP hexadecimal number (example: 0x1ff).

$regex = Regex::create()
    ->literal('0')->chars('xX')->digit(16)->atLeastOne()
    ->getRegex();

This code is equivalent to:

$regex = '/0[xX][0-9a-fA-F]+/m';

Requirements

PHP 5.5 or more.

Installation (with Composer)

Add the following to the require section of your composer.json file

"lucleroy/php-regex": "*"

and run composer update.

Usage

Workflow

Create a Regex object with Regex::create:

use LucLeroy\Regex;

require 'vendor/autoload.php';

$regex = Regex::create();

Build the regular expression:

$regex->literal('0')->chars('xX')->digit(16)->atLeastOne();

Retrieve the PHP Regular Expression string:

echo $regex->getRegex();              // /0[xX][0-9a-fA-F]+/m
echo $regex->getUtf8Regex();          // /0[xX][0-9a-fA-F]+/mu
echo $regex->getOptimizedRegex();     // /0[xX][0-9a-fA-F]+/mS
echo $regex->getUtf8OptimizedRegex(); // /0[xX][0-9a-fA-F]+/muS

By default, the resulting string is surrounded with '/'. You can change this character:

echo $regex->getRegex('%');              // %0[xX][0-9a-fA-F]+%m
echo $regex->getUtf8Regex('%');          // %0[xX][0-9a-fA-F]+%mu
echo $regex->getOptimizedRegex('%');     // %0[xX][0-9a-fA-F]+%mS
echo $regex->getUtf8OptimizedRegex('%'); // %0[xX][0-9a-fA-F]+%muS

The choosen character is automatically escaped:

$regex = Regex::create()
    ->digit()->atLeastOne()->literal('%/')->digit()->atLeastOne()->literal('%');

echo $regex->getRegex();    // /\d+%\/\d+%/m
echo $regex->getRegex('%'); // %\d+\%/\d+\%%m

Note: when you convert a Regex instance to a string, you get the raw regular expression string. With the preceding example :

echo "$regex"; // \d+%/\d+%

Literal Characters

Use Regex::literal to match literal characters. Special characters are automatically escaped:

echo Regex::create()
    ->literal('1+1=2'); // 1\+1\=2

The expression created by Regex::literal is indivisible: when you put a quantifier next to it, it applies to the whole expression and not only to the last character:

echo Regex::create()
    ->literal('ab')->anyTimes(); // (?:ab)*

echo Regex::create()
    ->literal('a')->literal('b')->anyTimes(); // ab*

Character Sets

Use Regex::chars to match chars in a character set. Use two dots to specify a range of characters.

echo Regex::create()
    ->chars('0..9-A..Z'); // [0-9\-A-Z]

If you want to match characters that are not in a specified set, use Regex::notChars:

echo Regex::create()
    ->notChars('0..9'); // [^0-9]

If you need to add special characters to a character set, you can provide an instance of Charset to the methods Regex::chars and Regex::notChars. For example, the following code matches letters and tabulations:

echo Regex::create()
    ->chars(Charset::create()->chars('a..zA..Z')->tab()); // [a-zA-Z\t]

You can use the following methods to match non-printable characters:

You can use shorthands for common character classes:

In addition, you can pass a base (from 2 to 26) to Charset::digit and Charset::notDigit:

echo Regex::create()
    ->chars(Charset::create()->digit()); // [\d]

echo Regex::create()
    ->chars(Charset::create()->digit(2)); // [01]

echo Regex::create()
    ->chars(Charset::create()->digit(16)); // [0-9a-fA-F]

You can match control characters (ASCII codes from 1 to 26) with Charset::control:

echo Regex::create()
    ->chars(Charset::create()->control('C')); // [\cC]

You can match an ANSI character with Charset::ansi:

echo Regex::create()
    ->chars(Charset::create()->ansi(0x7f)); // [\x7F]

You can match a range of ANSI characters with Charset::ansiRange:

echo Regex::create()
    ->chars(Charset::create()->ansiRange(0x20, 0x7f)); // [\x20-\x7F]

Finally, Charset provides some methods to work with Unicode characters.

Use Charset::extendedUnicode to match a Unicode grapheme:

echo Regex::create()
    ->chars(Charset::create()->extendedUnicode()); // [\X]

Use Charset::unicodeChar to match a specific unicode point:

echo Regex::create()
    ->chars(Charset::create()->unicodeChar(0x2122)); // [\x{2122}]

Use Charset::unicodeCharRange to match a range of unicode points:

echo Regex::create()
    ->chars(Charset::create()->unicodeCharRange(0x80, 0xff)); // [\x{80}-\x{FF}]

Use Charset::unicode to match a a Unicode class or category. For your convenience, a Unicode class with Unicode properties is provided:

echo Regex::create()
    ->chars(Charset::create()->unicode(Unicode::Letter)); // [\pL]

Note : all the methods of Charset are available in Regex:

echo Regex::create()
    ->digit();        // \d

echo Regex::create()
    ->digit(8);       // [0-7]

Match any character

If you want to match any character, use Regex::anyChar:

echo Regex::create()
    ->anyChar(); // (?s:.)

Note that the regular expression generated by the previous method matches also newlines. If you don't want to match newlines, use the method Regex::notNewline:

echo Regex::create()
    ->notNewline();   // .

Anchors

To match at the start of the string or at the end of the string, use Regex:startOfString and Regex::endOfString.

echo Regex::create()
    ->startOfString()->literal('123')->endOfString(); // \A123\z

The preceding method matches only at the string ends. If you want to match at the start of a line or at the end of a line, use Regex:startOfLine and Regex::endOfLine.

echo Regex::create()
    ->startOfLine()->literal('123')->endOfLine(); // ^123$

You can match at a word boundary with Regex::wordLimit. To match a position which is not a word boundary, use Regex::notWordLimit:

echo Regex::create()
    ->wordLimit();    // \b

echo Regex::create()
    ->notWordLimit(); // \B

Alternation

Use Regex::alt to create an alternation. There are several ways to provide each choice.

Firstly, you can pass choices as arguments:

$choices = [
    Regex::create()->literal('b'),
    Regex::create()->literal('c')
];

echo Regex::create()
    ->literal('a')
    ->alt($choices);  // a(?:b|c)

Secondly, you can give to the method the number of choices, which are taken from the previous expressions:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->literal('c')
    ->alt(2);       // a(?:b|c)

Finally, you can mark the position of the first choice with Regex::start and give no argument to the Regex::alt method:

echo Regex::create()
    ->literal('a')
    ->start()
    ->literal('b')
    ->literal('c')
    ->alt();       // a(?:b|c)

If you want to create an alternation with literals only, you can use Regex::literalAlt:

echo Regex::create()
    ->literalAlt(['one', 'two', 'three']);  // one|two|three

Quantifiers

Use Regex::optional to match an optional expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->optional();     // ab?

Use Regex::anyTimes to match any number of consecutive occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->anyTimes();     // ab*

Use Regex::atLeastOne to match at least one occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->atLeastOne();   // ab+

Use Regex::atLeast to match a minimum number of occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->atLeast(2);     // ab{2,}

Use Regex::between to match a number of occurences of the previous expression between two numbers:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->between(2,5);   // ab{2,5}

Use Regex::times to match a precise number of occurences of the previous expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->times(2);   // ab{2}

Note: instead of add the quantifier to the previous expression, you can provide a Regex instance as last argument of each of these methods.

Greedy, Lazy, Possessive Quantifiers

In the previous examples, the quantifiers are greedy. This is the default behavior. More precisely, a quantifier can have 4 modes: GREEDY, LAZY, POSSESSIVE, and UNDEFINED. When the regular expression string is generated, a quantifier with the UNDEFINED mode is considered as GREEDY. UNDEFINED is the default mode but you can use Regex::greedy, Regex::lazy and Regex::possessive on an empty Regex (just after the creation) to modify the default behavior:

echo Regex::create()
    ->lazy()
    ->literal('a')
    ->anyTimes()
    ->literal('b')
    ->anyTimes();     // a*?b*?

The same methods can be used after a quantifier to change its behavior:

echo Regex::create()
    ->lazy()
    ->literal('a')
    ->anyTimes()
    ->greedy()
    ->literal('b')
    ->anyTimes();    // a*b*?

You can also change the behavior of all quantifiers of a group:

echo Regex::create()
    ->literal('a')->literal('b')->optional()->group(2)->anyTimes()
    ->literal('c')->anyTimes()
    ->alt(2)
    ->lazy();  // (?:ab?)*?|c*?

In the previous example, you can notice that the behavior does not apply to the optional quantifier. You can use Regex::greedyRecursive, Regex::lazyRecursive and Regex::possessiveRecursive to apply the behavior recursively:

echo Regex::create()
    ->literal('a')->literal('b')->optional()->group(2)->anyTimes()
    ->literal('c')->anyTimes()
    ->alt(2)
    ->lazyRecursive();  // (?:ab??)*?|c*?

When applied to a group, all these methods modify the behavior of a quantifier only if it has the UNDEFINED mode. In the example, if the optional quantifier is set to GREEDY, it retains its behavior:

echo Regex::create()
    ->literal('a')->literal('b')->optional()->greedy()->group(2)->anyTimes()
    ->literal('c')->anyTimes()
    ->alt(2)
    ->lazyRecursive();  // (?:ab?)*?|c*?

Grouping and Capturing

By default, when the library needs to create a group, it is not captured. To capture an expression, you must use Regex::capture:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->literal('c')
    ->alt(2)->capture();  // a(b|c)

To create a named group, give an argument to Regex::capture:

echo Regex::create()
    ->literal('a')->capture('myname'); // (?P<myname>a)

You can group several expressions with Regex::group. As with Regex::alt, you can specify the expressions to group by using the Regex::start method or by giving the number of expressions to group or by giving directly the expression (a Regex instance):

echo Regex::create()
    ->literal('a')
    ->start()
    ->literal('b')
    ->literal('c')
    ->group()->capture();        // a(bc)

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->literal('c')
    ->group(2)->capture();       // a(bc)

$group = Regex::create()->literal('b')->literal('c');
echo Regex::create()
    ->literal('a')
    ->group($group)->capture();  // a(bc)

Backreferences

Use Regex::ref to make a backreference:

echo Regex::create()
    ->literal('a')->anyTimes()->capture()
    ->literal('-')
    ->ref(1);  // (a*)\-\g{1}

echo Regex::create()
    ->literal('a')->anyTimes()->capture('myname')
    ->literal('-')
    ->ref('myname');  // (?P<myname>a*)\-(?P=myname)

Atomic grouping

Use Regex::atomic to make an atomic group:

echo Regex::create()
    ->literal('a')->anyTimes()
    ->atomic();  // (?>a*)

Lookahead, Lookbehind

Use Regex::after, Regex::notAfter, Regex::before, Regex::notBefore:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->after();        // a(?=b)

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->notAfter();     // a(?!b)

echo Regex::create()
    ->literal('a')
    ->before()
    ->literal('b');   // (?<=a)b

echo Regex::create()
    ->literal('a')
    ->notBefore()
    ->literal('b');   // (?<!a)b

Conditionals

Create a conditional with Regex::cond. This method must be preceded by a condition, an expression to match when the condition is true, and an optional expression to match when the condition is false.

Use Regex::match to check if a captured group matches:

echo Regex::create()
    ->literal('a')->capture()->optional()
    ->match(1)
    ->literal('b')
    ->literal('c')
    ->cond();       // (a)?(?(1)b|c)

echo Regex::create()
    ->literal('a')->capture('myname')->optional()
    ->match('myname')
    ->literal('b')
    ->literal('c')
    ->cond();       // (?P<myname>a)?(?(myname)b|c)

Regex::match can also be used outside of a conditional. In this case, the regular expression fails if captured group does not match:

echo Regex::create()
    ->literal('a')->capture()->optional()
    ->match(1);     // (a)?(?(1)|(?!))

The others allowed conditions are Regex::after, Regex::notAfter, Regex::before, Regex::notBefore:

echo Regex::create()
    ->literal('a')->before()
    ->literal('b')
    ->literal('c')
    ->cond();      // (?(?<=a)b|c)

If you want the 'else' expression to match nothing, you can remove the 'else' expression:

echo Regex::create()
    ->literal('a')->before()
    ->literal('b')
    ->cond();      // (?(?<=a)b|)

If you want the 'then' expression to match nothing, you can use Regex::notCond to inverse the condition:

echo Regex::create()
    ->literal('a')->before()
    ->literal('c')
    ->notCond();   // (?(?<=a)|c)

You can also use Regex::nothing:

echo Regex::create()
    ->literal('a')->before()
    ->nothing()
    ->literal('c')
    ->cond();      // (?(?<=a)|c)

Case sensitivity

By default, the regular expression is case sensitive. Use Regex::caseSensitive or Regex::caseInsensitive to change this behavior. Each of these methods accepts an optional boolean argument. If this argument is false, the behavior is inverted: $regex->caseSensitive(false) is equivalent to $regex->caseInsensitive().

These methods change the behavior of the last expression:

echo Regex::create()
    ->literal('a')
    ->literal('b')
    ->caseInsensitive()
    ->literal('c');   // a(?i)b(?-i)c

When used at the beginning of the Regex, the whole expression is affected:

echo Regex::create()
    ->caseInsensitive()
    ->literal('a')
    ->literal('b')
    ->literal('c');   // (?i)abc(?-i)

Recursion

Use Regex::matchRecursive to match recursively the whole pattern. This example matches balanced parentheses:

echo Regex::create()
    ->literal('(')
    ->start()
        ->notChars('()')->atLeastOne()->atomic()
        ->matchRecursive()->anyTimes()
    ->alt()
    ->literal(')');   // \((?:(?>[^\(\)]+)|(?:(?R))*)\)

Special Expressions

Regex::crlf matches a Carriage Return followed by a Line Feed (Windows line breaks):

echo Regex::create()
    ->crlf();   // \r\n

Regex:unsignedIntRange matches a nonnegative integer in a given range. The third parameters specify how leading zeros are handled:

echo Regex::create()
    ->unsignedIntRange(1, 12);       // 1[0-2]|0?[1-9]   leadings zeros are optional

echo Regex::create()
    ->unsignedIntRange(1, 12, true); // 1[0-2]|0[1-9]    leadings zeros are required
    
echo Regex::create()
    ->unsignedIntRange(1, 12, false); // 1[0-2]|[1-9]    leadings zeros are not accepted

Note that in any case, the number of digits cannot exceed the number of digits of the maximum value.