nimayneb/yawl

YAWL - Yet another wildcard library - Match pattern for strings with asterisk and query token

2.2.0 2020-05-13 16:29 UTC

This package is auto-updated.

Last update: 2024-12-25 21:20:29 UTC


README

Build Status

This is a library with classes for any wildcard implementation that finds pattern with * (asterisk) and ? (query) token.

Problem

Without regular expression extension (named ext-pcre - before PHP 5.3) you have no wildcard support within PHP.

See https://www.php.net/manual/en/pcre.installation.php.

The known wildcard behavior (see Wildcard character (@Wikipedia)) is limited for a good implementation.

A small remedy?

We need a kind of "compiled" and "cached" pattern. The implementation of regular expression like preg_match has a huge performance. We lost these advantage as of PHP 7.0, because of just-in-time compiled pcre pattern!

Table of content

  1. Benchmark
  2. Wildcard variants
  3. Possible valid pattern
  4. Invalid pattern
  5. Escaping
  6. Repeating phrases
  7. Caching
  8. Wish list
  9. Appendix

API:

  1. Matcher (for use of single calls)
  2. Performer (for use of multiple calls)
  3. Converter (for use of regular expression)

Benchmark

When we benchmark several methods to match a fitting phrase (1000 random strings), we have a good results without "regular expressions":

Single Byte:

Multi Byte (like Unicode):

Internal PHP functions comparison

Wildcard variants

Possible valid pattern

    ??      (2 characters)
    ???*  (2-3 characters)

Invalid pattern

    ***
    ?**
    ?*?
    *?

Escaping

    \\
    \?
    \*

Please note of this escaping scenarios:

Further explanation:

Repeating phrases

The asterisk (*) has a problem finding the right position. If several identical phrases can be found in a string, the search must take place in all positions to find the pattern.

 Search: *is?ue (where is "*is")
Subject: this is an asterisk issue
 Founds:   ^  ^          ^   ^

The simplest solution is to break down the pattern recursively, but it should be not recursively.

Caching

The regular expression extension has a caching and compiling strategy to improve performance. The second time the same pattern is called, a tremendous increase in performance is achieved.

To help us to improve also performance, we use a simple key-value caching.

    protected array $cachedResults = [];

    public function match(string $subject): bool
    {
        return $this->cachedResults[$subject] ?? $this->cachedResults[$subject] = $this->computePhrases($subject, 0);
    }

This is not a preferred solution.

Wish list

  • Remove recursive call of WildcardPerformer::computePhrases (see Wikipedia - Matching wildcards)
  • Remove of StringFunctionMapper (too slow)
  • Combine Matcher and Performer (two advantages in one)
  • Caching interface
  • New pattern ??** (0 or 2 characters) or ?????** (0 or 5 characters)
  • Glob support
    • Example: [abcdef0123456789], [0-9a-f], [!a-z]
    • Use in YAWL:
      • [0-9a-f]? (one times)
      • [0-9a-f]?* (zero or one times)
      • [0-9a-f]* (zero or N times)
      • [0-9a-f]** (one or N times)
      • [0-9a-f]x (one times then follows x)

Appendix

Common Algorithms:

Non-recursive Algorithms:

Recursive Algorithms:

PDepend

Build Status

  • NOP - Number Of Packages
  • NOC - Number Of Classes
  • NOM - Number Of Methods
  • LOC – Lines of Code
  • CYCLO - Cyclomatic complexity
  • CALLS - number of distinct function- and method-calls
  • ANDC - Average Number of Derived Classes
  • AHH - Average Hierarchy Height