nikic/phlexy

Lexing experiments in PHP

v0.1 2013-03-13 16:20 UTC

README

This project is a followup to my post on fast lexing in PHP. It contains a few lexer implementations (both stateless and stateful) and related performance tests.

Usage

Lexers are created from a lexer definition using a factory class.

For example, if you want to create a preg_replace based stateless CSV lexer, you can use the following code:

<?php
require 'path/to/lib/Phlexy/bootstrap.php';

$factory = new Phlexy\LexerFactory\Stateless\UsingPregReplace(
    new Phlexy\LexerDataGenerator
);

$lexer = $factory->createLexer(array(
    '[^",\r\n]+'                     => 0, // 0, 1, 2, 3 are the tokens
    '"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"' => 1, // they should really be constants
    ','                              => 2,
    '\r?\n'                          => 3,
));

$tokens = $lexer->lex("hallo world,foo bar,more foo,more bar,\"rare , escape\",some more,stuff\n...");

Similarly a stateful lexer:

<?php
require 'path/to/lib/Phlexy/bootstrap.php';

$factory = new Phlexy\LexerFactory\Stateful\UsingCompiledRegex(
    new Phlexy\LexerDataGenerator
);

// The "i" is an additional modifier (all createLexer methods accept it)
$lexer = $factory->createLexer($lexerDefinition, 'i');

For an example of a stateful lexer definition, you can look the definition for lexing PHP source code.

Performance

A performance comparison for the different lexer implementations can be done using the performance testing script:

$ /c/php-5.4.1/php examples/performanceTests.php

Timing lexing of CVS data:
Took 0.33259892463684 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.28691792488098 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.26784682273865 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.22256088256836 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)

Timing alphabet lexing of all "a":
Took 0.30809283256531 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.40949702262878 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.38628792762756 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.31351900100708 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)

Timing alphabet lexing of all "z":
Took 0.62087893486023 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.23668503761292 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.22538208961487 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.18682312965393 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)

Timing alphabet lexing of random string:
Took 0.94398212432861 seconds (Phlexy\Lexer\Stateless\Simple)
Took 0.42041087150574 seconds (Phlexy\Lexer\Stateless\WithCapturingGroups)
Took 0.40309715270996 seconds (Phlexy\Lexer\Stateless\WithoutCapturingGroups)
Took 0.37058591842651 seconds (Phlexy\Lexer\Stateless\UsingPregReplace)

Timing PHP lexing of this file:
Took 0.098251104354858 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.020735025405884 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)

Timing PHP lexing of larger TestAbstract file:
Took 0.268701076507570 seconds (Phlexy\Lexer\Stateful\Simple)
Took 0.065788984298706 seconds (Phlexy\Lexer\Stateful\UsingCompiledRegex)

Stateless\Simple and Stateful\Simple are trivial lexer implementations (which loop through the regular expressions).

Stateless\WithoutCapturingGroups, Stateless\WithCapturingGroups and Stateful\UsingCompiledRegex use the compiled regex approach described in the blog post mentioned above.

Stateless\UsingPregReplace is an extension of the compiled regex approach, where the looping through the regular expression is done by (mis)using preg_replace_callback.

As the above performance measurments show, the Simple approach is a good bit slower than using compiled regexes. For the CVS data it's only 1.17 times faster, but the difference significantly increases the more regular expressions there are. E.g. lexing of the alphabet on a random string is more than twice as fast. For lexing PHP the compiled approach is five times as fast.

The preg_replace trick makes the whole thing another bit faster. Sadly preg_replace can't be used for stateful lexers, at least I couldn't figure out a fast way to do the state transitions.