pcrov/unicode

Miscellaneous Unicode utility functions

0.1.1 2020-10-26 13:33 UTC

This package is auto-updated.

Last update: 2024-12-08 17:30:23 UTC


README

CI Status License Latest Stable Version

Miscellaneous Unicode utility functions.

Functions

Namespace pcrov\Unicode.

surrogate_pair_to_code_point(int $high, int $low): int

Translates a UTF-16 surrogate pair into a single code point. Wikipedia's UTF-16 article explains what this is fairly well.

utf8_find_invalid_byte_sequence(string $string): ?int

Returns the position of the first invalid byte sequence or null if the input is valid.

utf8_get_invalid_byte_sequence(string $string): ?string

Returns the first invalid byte sequence or null if the input is valid.

utf8_get_state_machine(): array

Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.

It is in the form of [byte => [valid next byte => ...,], ...]

Example use:

function utf8_generate_all_code_points(): string
{
    $generator = function (array $machine, string $buffer = "") use (&$generator) {
        // Completed a UTF-8 encoded code point.
        if ($buffer !== "" && isset($machine["\x0"])) {
            return $buffer;
        }

        $out = "";
        foreach ($machine as $byte => $next) {
            $out .= $generator($next, $buffer . $byte);
        }

        return $out;
    };

    return $generator(utf8_get_state_machine());
}

utf8_validate(string $string): bool

Does what it says on the box.

Data

The test/data directory holds two files containing all possible UTF-8 encoded characters. All 1,112,064 of them. One as plain text, the other as json. These are not included in packaged stable releases but can be generated with the example utf8_generate_all_code_points() function above (returns the plain text string.)

Excerpts from the Unicode 10.0.0 standard:

Recreated here for ease of reference. Nobody likes PDFs.

Table 3-6. UTF-8 Bit Distribution

Table 3-7. Well-Formed UTF-8 Byte Sequences