s9e / regexp-builder
Single-purpose library that generates regular expressions that match a list of strings.
Installs: 454 499
Dependents: 4
Suggesters: 0
Security: 0
Stars: 29
Watchers: 2
Forks: 3
Open Issues: 0
Requires
- php: >=8.1
- lib-pcre: >=7.2
Requires (Dev)
- phpunit/phpunit: >=9.1
README
s9e\RegexpBuilder is a single-purpose library that generates a regular expression that matches a given list of strings. It is best suited for efficiently finding a list of literals inside of a text.
Simply put, given ['foo', 'bar', 'baz']
as input, the library will generate ba[rz]|foo
, a regular expression that can match any of the strings foo
, bar
, or baz
.
Usage
$builder = new s9e\RegexpBuilder\Builder; echo '/', $builder->build(['foo', 'bar', 'baz']), '/';
/ba[rz]|foo/
Examples
UTF-8 input with UTF-8 output
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Utf8', 'output' => 'Utf8' ]); echo '/', $builder->build(['☺', '☹']), '/u';
/[☹☺]/u
Raw input with raw output
Note that the output is shown here MIME-encoded as it is not possible to display raw bytes in UTF-8. Raw output is most suitable when the result is saved in binary form, e.g. in a data cache.
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Bytes', 'output' => 'Bytes' ]); echo '/', quoted_printable_encode($builder->build(['☺', '☹'])), '/';
/=E2=98[=B9=BA]/
Raw input with PHP output
For PHP regular expressions that do not use the u
flag. PHP output is most suitable for regexps that are used into PHP sources, in conjunction with var_export()
. The output itself is ASCII, with non-ASCII and non-printable characters escaped.
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Bytes', 'output' => 'PHP' ]); echo '/', $builder->build(['☺', '☹']), '/';
/\xE2\x98[\xB9\xBA]/
UTF-8 input with PHP output
For PHP regular expressions that use the u
flag.
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Utf8', 'output' => 'PHP' ]); echo '/', $builder->build(['☺', '☹']), '/u';
/[\x{2639}\x{263A}]/u
UTF-8 input with JavaScript output
For JavaScript regular expressions that do not use the u
flag and need the higher codepoints to be split into surrogates. The regexp itself uses only ASCII characters, with non-ASCII and non-printable characters escaped.
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Utf8', 'inputOptions' => ['useSurrogates' => true], 'output' => 'JavaScript' ]); echo '/', $builder->build(['☺', '☹']), "/\n"; echo '/', $builder->build(['😁', '😂']), '/';
/[\u2639\u263A]/
/\uD83D[\uDE01\uDE02]/
UTF-8 input with Unicode-aware JavaScript output
For JavaScript regular expressions that use the u
flag introduced in ECMAScript 6. In that case, you can simply forgo using surrogates.
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Utf8', 'output' => 'JavaScript' ]); echo '/', $builder->build(['☺', '☹']), "/u\n"; echo '/', $builder->build(['😁', '😂']), '/u';
/[\u2639\u263A]/u
/[\u{1F601}\u{1F602}]/u
Custom delimiters
$strings = ['/', '(', ')', '#']; $builder = new s9e\RegexpBuilder\Builder; echo '/', $builder->build($strings), "/\n"; $builder = new s9e\RegexpBuilder\Builder(['delimiter' => '#']); echo '#', $builder->build($strings), "#\n"; $builder = new s9e\RegexpBuilder\Builder(['delimiter' => '()']); echo '(', $builder->build($strings), ')';
/[#()\/]/
#[\#()/]#
([#\(\)/])
Lowercase hexadecimal representation
By default, the PHP
and JavaScript
output uses uppercase hexadecimal symbols, e.g. \xAB
. This can be changed to lowercase using the outputOptions
setting.
$builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Bytes', 'output' => 'PHP', 'outputOptions' => ['case' => 'lower'] ]); echo '/', $builder->build(['☺', '☹']), "/\n"; $builder = new s9e\RegexpBuilder\Builder([ 'input' => 'Utf8', 'output' => 'JavaScript', 'outputOptions' => ['case' => 'lower'] ]); echo '/', $builder->build(['☺', '☹']), '/';
/\xe2\x98[\xb9\xba]/
/[\u2639\u263a]/
Using meta-characters
Some individual characters can be used to represent arbitrary expressions in the input strings. The requirements are that:
- Only single characters (as per the input encoding) can be used. For example,
?
is allowed but not??
. - The regular expression must be valid on its own. For example,
.*
is valid but not+
.
In the following example, we emulate Bash-style jokers by mapping ?
to .
and *
to .*
.
$builder = new s9e\RegexpBuilder\Builder([ 'meta' => ['?' => '.', '*' => '.*'] ]); echo '/', $builder->build(['foo?', 'bar*']), '/';
/bar.*|foo./
In the following example, we map X
to \d
. Note that sequences produced by meta-characters may appear in character classes if the result is valid.
$builder = new s9e\RegexpBuilder\Builder([ 'meta' => ['X' => '\\d'] ]); echo '/', $builder->build(['a', 'b', 'X']), '/';
/[\dab]/