sjorek / unicode-normalization
An enhanced facade to existing unicode-normalization implementations.
v0.3.0
2018-03-25 16:50 UTC
Requires
- php: ^7.0
- symfony/filesystem: ^3.4 || ^4.0
Requires (Dev)
- ext-iconv: *
- ext-intl: *
- ext-mbstring: *
- ext-zlib: *
- friendsofphp/php-cs-fixer: ^2.7
- mikey179/vfsstream: ^1.6
- phpunit/phpunit: ^6.5
- sensiolabs/security-checker: ^4.1
Suggests
- ext-iconv: Eventually enables a special unicode-normalization mode for HFS+ filesystems (NFD_MAC), if the 'iconv' extension supports the 'utf-8-mac' charset.
- ext-intl: For best performance, but please consider 'sjorek/unicode-normalization-native-implementation' package instead.
- ext-mbstring: For best performance, but please consider 'sjorek/unicode-normalization-native-implementation' package instead.
- patchwork/utf8: For compatibility, if one of the above php extensions 'mbstring' or 'intl' is not available and the 'symfony/polyfill-*' packages are not suiteable.
- symfony/polyfill-intl-normalizer: For compatibility, if the php extension 'intl' is not available and the 'patchwork/utf8' package is not suiteable.
- symfony/polyfill-mbstring: For compatibility, if the php extension 'intl' is not available and the 'patchwork/utf8' package is not suiteable.
Replaces
README
A composer-package providing an enhanced facade to existing unicode-normalization implementations.
Installation
php composer.phar require sjorek/unicode-normalization
Usage
Unicode Normalization
<?php /** * Class for normalizing unicode. * * “Normalization: A process of removing alternate representations of equivalent * sequences from textual data, to convert the data into a form that can be * binary-compared for equivalence. In the Unicode Standard, normalization refers * specifically to processing to ensure that canonical-equivalent (and/or * compatibility-equivalent) strings have unique representations.” * * -- quoted from unicode glossary linked below * * @see http://www.unicode.org/glossary/#normalization * @see http://www.php.net/manual/en/class.normalizer.php * @see http://www.w3.org/wiki/I18N/CanonicalNormalization * @see http://www.w3.org/TR/charmod-norm/ * @see http://blog.whatwg.org/tag/unicode * @see http://en.wikipedia.org/wiki/Unicode_equivalence * @see http://stackoverflow.com/questions/7931204/what-is-normalized-utf-8-all-about * @see http://php.net/manual/en/class.normalizer.php */ class Sjorek\UnicodeNormalization\Normalizer implements Sjorek\UnicodeNormalization\Implementation\NormalizerInterface { /** * Constructor. * * @param null|bool|int|string $form (optional) Set normalization form, default: NFC * * Besides the normalization form class constants defined below, * the following case-insensitive aliases are supported: * <pre> * - Disable unicode-normalization : 0, false, null, empty * - Ignore/skip unicode-normalization : 1, NONE, true, binary, default, validate * - Normalization form D : 2, NFD, FORM_D, D, form-d, decompose, collation * - Normalization form D (mac) : 18, NFD_MAC, FORM_D_MAC, D_MAC, form-d-mac, d-mac, mac * - Normalization form KD : 3, NFKD, FORM_KD, KD, form-kd * - Normalization form C : 4, NFC, FORM_C, C, form-c, compose, recompose, legacy, html5 * - Normalization form KC : 5, NFKC, FORM_KC, KC, form-kc, matching * </pre> * * Hints: * <pre> * - The W3C recommends NFC for HTML5 Output. * - Mac OS X's HFS+ filesystem uses a NFD variant to store paths. We provide one implementation for this * special variant, but plain NFD works in most cases too. Even if you use something else than NFD or its * variant HFS+ will always use decomposed NFD path-strings if needed. * </pre> */ public function __construct($form = null); /** * Ignore any decomposition/composition. * * Ignoring Implementation decomposition/composition, means nothing is automatically normalized. * Many Linux- and BSD-filesystems do not normalize paths and filenames, but treat them as binary data. * Apple™'s APFS filesystem treats paths and filenames as binary data. * * @var int */ const NONE = 1; /** * Canonical decomposition. * * “A normalization form that erases any canonical differences, and produces a * decomposed result. For example, ä is converted to a + umlaut in this form. * This form is most often used in internal processing, such as in collation.” * * -- quoted from unicode glossary linked below * * @var int * * @see http://www.unicode.org/glossary/#normalization_form_d * @see https://developer.apple.com/library/content/qa/qa1173/_index.html * @see https://developer.apple.com/library/content/qa/qa1235/_index.html */ const NFD = 2; /** * Compatibility decomposition. * * “A normalization form that erases both canonical and compatibility differences, * and produces a decomposed result: for example, the single dž character is * converted to d + z + caron in this form.” * * -- quoted from unicode glossary linked below * * @var int * * @see http://www.unicode.org/glossary/#normalization_form_kd */ const NFKD = 3; /** * Canonical decomposition followed by canonical composition. * * “A normalization form that erases any canonical differences, and generally produces * a composed result. For example, a + umlaut is converted to ä in this form. This form * most closely matches legacy usage.” * * -- quoted from unicode glossary linked below * * W3C recommends NFC for HTML5 output and requires NFC for HTML5-compliant parser implementations. * * @var int * @var int $FORM_C * * @see http://www.unicode.org/glossary/#normalization_form_c */ const NFC = 4; /** * Compatibility Decomposition followed by Canonical Composition. * * “A normalization form that erases both canonical and compatibility differences, * and generally produces a composed result: for example, the single dž character * is converted to d + ž in this form. This form is commonly used in matching.” * * -- quoted from unicode glossary linked below * * @var int * @var int $FORM_KC * * @see http://www.unicode.org/glossary/#normalization_form_kc */ const NFKC = 5; /** * Apple™ Canonical decomposition for HFS Plus filesystems. * * “For example, HFS Plus (OS X Extended) uses a variant of Normal Form D in * which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF * are not decomposed …” * * -- quoted from Apple™'s Technical Q&A 1173 linked below * * “The characters with codes in the range u+2000 through u+2FFF are punctuation, * symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has * single characters for things like u+249c "⒜". The characters in this range are * not fully decomposed; they are left unchanged in HFS Plus strings. This allows * strings in Mac OS encodings to be converted to Implementation and back without loss of * information. This is not unnatural since a user would not necessarily expect a * dingbat "⒜" to be equivalent to the three character sequence "(a)" in a file name. * * The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs, * and are not decomposed in HFS Plus strings. * * So, for the example given earlier, u+00E9 ("é") must be stored as the two Implementation * characters u+0065 and u+0301 (in that order). The Implementation character u+00E9 ("é") * may not appear in a Implementation string used as part of an HFS Plus B-tree key.” * * -- quoted from Apple™'s Technical Q&A 1150 linked below * * @var int * * @see NormalizerInterface::NFD * @see https://developer.apple.com/library/content/qa/qa1173/_index.html * @see https://developer.apple.com/library/content/qa/qa1235/_index.html * @see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.html#CanonicalDecomposition * @see https://opensource.apple.com/source/libiconv/libiconv-50/libiconv/lib/utf8mac.h.auto.html */ const NFD_MAC = 18; // 0x02 (NFD) | 0x10 = 0x12 (18) /** * Set the default normalization form to the given value. * * @param int|string $form * * @see \Sjorek\UnicodeNormalization\NormalizationUtility::parseForm() * * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm */ public function setForm($form); /** * Retrieve the current normalization-form constant. * * @return int */ public function getForm(); /** * Normalizes the input provided and returns the normalized string. * * @param string $input the input string to normalize * @param int $form (optional) One of the normalization forms * * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm * * @return string the normalized string or FALSE if an error occurred * * @see http://php.net/manual/en/normalizer.normalize.php */ public function normalize($input, $form = null); /** * Checks if the provided string is already in the specified normalization form. * * @param string $input The input string to normalize * @param int $form (optional) One of the normalization forms * * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm * * @return bool TRUE if normalized, FALSE otherwise or if an error occurred * * @see http://php.net/manual/en/normalizer.isnormalized.php */ public function isNormalized($input, $form = null); /** * Normalizes the $string provided to the given or default $form and returns the normalized string. * * Calls underlying implementation even if given $form is NONE, but finally it normalizes only if needed. * * @param string $input the string to normalize * @param int $form (optional) normalization form to use, overriding the default * * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm * * @return null|string Normalized string or null if an error occurred */ public function normalizeTo($input, $form = null); /** * Normalizes the $string provided to the given or default $form and returns the normalized string. * * Does not call underlying implementation if given normalization is NONE and normalizes only if needed. * * @param string $input the string to normalize * @param int $form (optional) normalization form to use, overriding the default * * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm * * @return null|string Normalized string or null if an error occurred */ public function normalizeStringTo($input, $form = null); /** * Get the supported unicode version level as version triple ("X.Y.Z"). * * @return string */ public static function getUnicodeVersion(); /** * Get the supported unicode normalization forms as array. * * @return int[] */ public static function getNormalizationForms(); }
Stream filtering
<?php /** * @var $stream resource The stream to filter. * @var $form string The form to normalize unicode to. * @var $read_write int (optional) STREAM_FILTER_* constant to override the filter injection point * @var $params string|int (optional) A normalization-form alias or value * * @link http://php.net/manual/en/function.stream-filter-append.php * @link http://php.net/manual/en/function.stream-filter-prepend.php */ stream_filter_append($stream, "convert.unicode-normalization.$form"[, $read_write[, $params]]);
Note: Be careful when using on streams in r+
or w+
(or similar) modes; by default PHP will assign the
filter to both the reading and writing chain. This means it will attempt to convert the data twice - first when
reading from the stream, and once again when writing to it.
Examples
Unicode Normalization
<?php use Sjorek\UnicodeNormalization\Normalizer; $string = 'äöü'; $normalizer = new Normalizer(Normalizer::NONE); $nfc = new Normalizer(); $nfd = new Normalizer(Normalizer::NFD); $nfkc = new Normalizer('matching'); var_dump( // yields false as form NONE is never normalized $normalizer->isNormalized($string), // yields true, as NFC is the default for utf8 in the web. $nfc->isNormalized($string), // yields false $nfd->isNormalized($string), // yields false $nfkc->isNormalized($string), // yields false $normalizer->isNormalized($string, Normalizer::NFKD), // yields true $normalizer->normalize($string) === $string, // yields true $nfc->normalize($string) === $string, // yields false $nfd->normalize($string) === $string, // yields true, as only combined characters (means two or more letters in one // character, like the single dž character) are decomposed (for faster matching). $nfkc->normalize($string) === $string, Normalizer::getUnicodeVersion(), Normalizer::getNormalizationForms() );
Stream filtering
<?php $in_file = fopen('utf8-file.txt', 'r'); $out_file = fopen('utf8-normalized-to-nfc-file.txt', 'w'); // It works as a read filter: stream_filter_append($in_file, 'convert.unicode-normalization.NFC'); // Normalization form may be given as fourth parameter: // stream_filter_append($in_file, 'convert.unicode-normalization', null, 'NFC'); // And it also works as a write filter: // stream_filter_append($out_file, 'convert.unicode-normalization.NFC'); stream_copy_to_stream($in_file, $out_file);
Contributing
Look at the contribution guidelines