rowbot / idna
An implementation of UTS#46 Unicode IDNA Compatibility Processing.
Installs: 180 209
Dependents: 1
Suggesters: 0
Security: 0
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Requires
- php: >=7.1
- rowbot/punycode: ^1.0
- symfony/polyfill-intl-normalizer: ^1.18
Requires (Dev)
- guzzlehttp/guzzle: ^6.5 || ^7.0
- phpstan/phpstan: ^1.2
- phpstan/phpstan-deprecation-rules: ^1.0
- phpstan/phpstan-strict-rules: ^1.0
- phpunit/phpunit: ^7.0 || ^8.0 || ^9.0
- squizlabs/php_codesniffer: ^3.5.1
- symfony/cache: ^4.3 || ^5.0
README
A fully compliant implementation of UTS#46, otherwise known as Unicode IDNA Compatibility Processing. You can read more about the differences between IDNA2003, IDNA2008, and UTS#46 in Section 7. IDNA Comparison of the specification.
This library currently ships with Unicode 15.0.0 support and implements Version 15.0.0, Revision 29 of IDNA Compatibility Processing. It has the ability to use Unicode 11.0.0 to Unicode 15.0.0. While this library likely supports versions of Unicode less than 11.0.0, the format of the Unicode test files were changed beginning in 11.0.0 and as a result, versions of Unicode less than 11.0.0 have not been tested.
Requirements
- PHP 7.1+
rowbot/punycode
symfony/polyfill-intl-normalizer
Installation
composer require rowbot/idna
API
Idna::UNICODE_VERSION
The Unicode version of the data files used, as a string.
Idna::toAscii(string $domain, array $options = []): IdnaResult
Converts a domain name to its ASCII form. Anytime an error is recorded while doing an ASCII transformation, the transformation is considered to have failed and whatever domain name string is returned is considered "garbage". What you do with that result is entirely up to you.
toAscii Parameters
-
$domain - A domain name to convert to ASCII.
-
$options - An array of options for customizing the behavior of the transformation. Possible options include:
"CheckBidi"
- Checks the domain name string for errors with bi-directional characters. Defaults to true."CheckHyphens"
- Checks the domain name string for the positioning of hypens. Defaults to true."CheckJoiners"
- Checks the domain name string for errors with joining code points. Defaults to true."UseSTD3ASCIIRules"
- Disallows the use of ASCII characters other than[a-zA-Z0-9-]
. Defaults to true."Transitional_Processing"
- Whether transitional or non-transitional processing is used. When enabled, processing behaves more like IDNA2003 and when disabled behaves like IDNA2008. Defaults to false, which means that non-transitional processing is used by default."VerifyDnsLength"
- Validates the length of the domain name string and it's individual labels. Defaults to true.
Note: All options are case-sensitive.
use Rowbot\Idna\Idna; $result = Idna::toAscii('x-.xn--nxa'); // You must not use an ASCII domain that has errors. if ($result->hasErrors()) { throw new \Exception(); } echo $result->getDomain(); // x-.xn--nxa
Idna::toUnicode(string $domain, array $options = []): IdnaResult
Converts the domain name to its Unicode form. Unlike the toAscii transformation, toUnicode does not have a failure concept. This means that you can always use the returned string. However, deciding what to do with the returned domain name string when an error is recorded is entirely up to you.
-
$domain - A domain name to convert to UTF-8.
-
$options - An array of options for customizing the behavior of the transformation. Possible options include:
"CheckBidi"
- Checks the domain name string for errors with bi-directional characters. Defaults to true."CheckHyphens"
- Checks the domain name string for the positioning of hypens. Defaults to true."CheckJoiners"
- Checks the domain name string for errors with joining code points. Defaults to true."UseSTD3ASCIIRules"
- Disallows the use of ASCII characters other than[a-zA-Z0-9-]
. Defaults to true."Transitional_Processing"
- Whether transitional or non-transitional processing is used. When enabled, processing behaves more like IDNA2003 and when disabled behaves like IDNA2008. Defaults to false, which means that non-transitional processing is used by default.
Note: All options are case-sensitive.
Note:
"VerifyDnsLength"
is not a valid option here.use Rowbot\Idna\Idna; $result = Idna::toUnicode('xn---with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n.com'); echo $result->getDomain(); // 安室奈美恵-with-super-monkeys.com
IdnaResult object
Members
IdnaResult::getDomain(): string
Returns the transformed domain name string.
IdnaResult::getErrors(): int
Returns a bitmask representing all errors that were recorded while processing the input domain name string.
IdnaResult::hasError(int $error): bool
Returns whether or not a specific error was recorded.
IdnaResult::hasErrors(): bool
Returns whether or not an error was recorded while processing the input domain name string.
IdnaResult::isTransitionalDifferent(): bool
Returns true
if the input domain name contains a code point that has a status of "deviation"
.
This status indicates that the code points are handled differently in IDNA2003 than they are in
IDNA2008. At the time of writing, there are only 4 code points that have this status. They are
U+00DF, U+03C2, U+200C, and U+200D.
Error Codes
-
Idna::ERROR_EMPTY_LABEL
The domain name or one of it's labels are an empty string.
-
Idna::ERROR_LABEL_TOO_LONG
One of the domain's labels exceeds 63 bytes.
-
Idna::ERROR_DOMAIN_NAME_TOO_LONG
The length of the domain name exceeds 253 bytes.
-
Idna::ERROR_LEADING_HYPHEN
One of the domain name's labels starts with a hyphen-minus character (-).
-
Idna::ERROR_TRAILING_HYPHEN
One of the domain name's labels ends with a hyphen-minus character (-).
-
Idna::ERROR_HYPHEN_3_4
One of the domain name's labels contains a hyphen-minus character in the 3rd and 4th position.
-
Idna::ERROR_LEADING_COMBINING_MARK
One of the domain name's labels starts with a combining mark.
-
Idna::ERROR_DISALLOWED
The domain name contains characters that are disallowed.
-
Idna::ERROR_PUNYCODE
One of the domain name's labels starts with "xn--", but is not valid punycode.
-
Idna::ERROR_LABEL_HAS_DOT
One of the domain name's labels contains a full stop character (.).
-
Idna::ERROR_INVALID_ACE_LABEL
One of the domain name's labels is an invalid ACE label.
-
Idna::ERROR_BIDI
The domain name does not meet the BiDi requirements for IDNA.
-
Idna::ERROR_CONTEXTJ
One of the domain name's labels does not meet the CONTEXTJ requirements for IDNA.
The WTFs of Unicode Support in PHP
In any given version of PHP, there can be a multitude of different versions of Unicode in use. So... WTF?
-
What does this mean?
This means that if I ask the same question, each of the extensions listed below can give me a different answer. This is compounded by the fact that the versions of Unicode used in the below extensions can also be different given the same version of PHP. For example, the
intl
extension being used by my installation of PHP 7.2 could be using Unicode version 11, but theintl
extension in your web hosts installation of PHP 7.2 could be using Unicode version 6. -
How does this happen?
- The
mbstring
extension uses its own version of Unicode. - The
Onigurama
library, which is behindmbstring
's regular expression functions, uses its own version of Unicode. - The
PCRE
extension, which is the primary extension for working with regular extensions in PHP, uses its own version of Unicode. - The
intl
extension uses its own version of Unicode. - Any other extensions that add their own versions of Unicode.
- Userland libraries use their own version of Unicode (including this library).
- The
-
This library
Being able to use
mbstring
orintl
extensions would be helpful, but we cannot depend on them being installed or them having a consistent version of Unicode when they are installed. Additionally, extensions likePCRE
could be compiled without Unicode support entirely, though we do rely onPCRE
'su
modifier. For this reason we have to include our own Unicode data.
FAQs
-
I'm confused! Is this IDNA2003 or IDNA2008?
The answer to this is somewhat convoluted. TL;DR; It is neither.
Here is what the spec says:
To satisfy user expectations for mapping, and provide maximal compatibility with IDNA2003, this document specifies a mapping for use with IDNA2008. In addition, to transition more smoothly to IDNA2008, this document provides a Unicode algorithm for a standardized processing that allows conformant implementations to minimize the security and interoperability problems caused by the differences between IDNA2003 and IDNA2008. This Unicode IDNA Compatibility Processing is structured according to IDNA2003 principles, but extends those principles to Unicode 5.2 and later. It also incorporates the repertoire extensions provided by IDNA2008.
More information can be found in Section 2. Unicode IDNA Compatibility Processing and Section 7. IDNA Comparison.
-
What are the recommended options?
The default options are the recommended options, which are also the strictest.
// Default options. [ 'CheckHyphens' => true, 'CheckBidi' => true, 'CheckJoiners' => true, 'UseSTD3ASCIIRules' => true, 'Transitional_Processing' => false, 'VerifyDnsLength' => true, // Only for Idna::toAscii() ];
-
Do I have to provide all the options?
No. You only need to specifiy the options that you wish to change. Any option you specify will overwrite the default options.
use Rowbot\Idna\Idna; $result = Idna::toAscii('x-.xn--nxa', ['CheckHyphens' => true]); $result->hasErrors(); // true $result->hasError(Idna::ERROR_TRAILING_HYPHEN); // true $result = Idna::toAscii('x-.xn--nxa', ['CheckHyphens' => false]); $result->hasErrors(); // false $result->hasError(Idna::ERROR_TRAILING_HYPHEN); // false
-
What is the difference between
Transitional
andNon-transitional
processing?Transitional
processing is designed to mimic IDNA2003. It is highly recommended to useNon-transitional
processing, which tries to mimic IDNA2008. You can always check if a domain name would be different between the two processing modes by checkingIdnaResult::isTransitionalDifferent()
. -
Wouldn't it be neat if you also tested against the
idn_to_ascii()
andidn_to_utf8()
functions from theintl
extension?Yes. Yes, it would be neat if we could do an additional check for parity with the ICU implementation, however, for the reasons outlined above in The WTFs of Unicode Support in PHP, testing against these functions would be unreliable at best.
-
Why does the
intl
extension show weird characters that look like diamonds with question marks inside in invalid domains, but your implementation doesn't?$input = '憡?Ⴔ.XN--1UG73GL146A'; idn_to_utf8($input, 0, IDNA_INTL_VARIANT_UTS46, $info); echo $info['result']; // 憡��.xn--1ug73gl146a� echo ($info['errors'] & IDNA_ERROR_DISALLOWED) !== 0; // true $result = \Rowbot\Idna\Idna::toUnicode($input); echo $result->getDomain(); // 憡?Ⴔ.xn--1ug73gl146a echo $result->hasError(\Rowbot\Idna\Idna::ERROR_DISALLOWED); // true
From Section 4. Processing:
Implementations may make further modifications to the resulting Unicode string when showing it to the user. For example, it is recommended that disallowed characters be replaced by a U+FFFD to make them visible to the user. Similarly, labels that fail processing during steps 4 or 5 may be marked by the insertion of a U+FFFD or other visual device.
This implementation currently does not make these recommended modifications.
Internals
Building
Unicode data files are fetched from https://www.unicode.org/Public. Currently, Unicode version
11.0.0-15.0.0 are supported. To change the version of Unicode that the library is built with, you
must first change the value of the \Rowbot\Idna::UNICODE_VERSION
constant, like so:
class Idna { - public const UNICODE_VERSION = '13.0.0'; + public const UNICODE_VERSION = '14.0.0';
Then to generate the necessary data files, you execute the following command:
php bin/generateDataFiles.php
If no assertions or exceptions have occured, then you have successfully changed the Unicode version. You should now execute the tests to make sure everything is good to go. The tests will automatically fetch the version appropriate tests as the test files are not generated by the above command.
vendor/bin/phpunit