lhcze/bcp47-tag

BCP47Tag parser and validator

2.0.0 2025-07-13 22:15 UTC

This package is auto-updated.

Last update: 2025-07-13 22:24:02 UTC


README

πŸͺ Don’t panic. Your tag is valid.

Validate, Normalize & Canonicalize BCP 47 Language Tags (en, en-US, zh-Hant-CN, etc.)

License PHP GitHub Actions Workflow Status Packagist Downloads IANA Registry

BCP47Tag is a robust PHP library for working with BCP 47 language tags:

  • βœ”οΈ Validates against the real IANA Language Subtag Registry
  • βœ”οΈ ABNF-compliant (RFCβ€―5646)
  • βœ”οΈ Supports language, script, region, variant, grandfathered tags
  • βœ”οΈ Auto-normalizes casing & separators (en_us β†’ en-US)
  • βœ”οΈ Automatically expands collapsed ranges from the registry
  • βœ”οΈ Resolves partial language tags (e.g., en β†’ en-US) using custom canonical matching, with scoring
  • βœ”οΈ Error handling via clear exception types
  • βœ”οΈ Lightweight LanguageTag VO for validated tags
  • βœ”οΈ Works perfectly with ext-intlβ€”no surprises upon feeding ICU
  • βœ”οΈ Easy fallback mechanism
  • ️🫧 Supports grandfathered tags so old, they still remember when Unicode 2.0 was hot
  • πŸ–– Accepts i-klingon and i-enochian for your occult projects
  • πŸ€“ ABNF so clean, linguists shed a single tear

❓ Why not just use ext-intl?

Good question β€” and the answer is: you should keep using it! ext-intl (ICU) is brilliant at formatting if your tag is clean.

However, it does not:

  • βœ… Validate that your tag fully follows the BCP 47 ABNF rules.
  • βœ… Reject or warn about grandfathered or deprecated subtags.
  • βœ… Match your tags against the authoritative IANA Language Subtag Registry.
  • βœ… Resolve partial input (en β†’ en-US) to a known canonical list.
  • βœ… Enforce known tags only with knownTags + requireCanonical.

If you’re in Symfony, you might also use #[Assert\Locale] for basic input validation.
And that’s fine for checking user input β€” but it stops at structure. It won’t canonicalize, resolve, or check IANA.

πŸ‘‰ So the best practice:

  • βœ… Use BCP47Tag to validate & normalize.
  • βœ… Hand the cleaned tag to ext-intl or whatever else you have for formatting & display.
  • βœ… Trust you’ll never feed ICU any garbage.
  • βœ… Carry around immutable LanguageTag value object across your code base instead of string

BCP47Tag: RFC 5646 + IANA + real normalization + fallback + resolution.
No hustle with regex, str_replace() or guesswork.

⚑️ Installation

composer require lhcze/bcp47-tag

πŸš€ Basic Usage

use LHcze\BCP47\BCP47Tag;

// Just normalize & validate
$tag = new BCP47Tag('en_us');
echo $tag->getNormalized();    // "en-US"
echo $tag->getICUformat();   // "en_US"

// With canonical matching
$tag = new BCP47Tag('en', useCanonicalMatchTags: ['de-DE', 'en-US']);
echo $tag->getNormalized();    // "en-US"

// Use fallback if invalid
$tag = new BCP47Tag('notreal', 'fr-FR');
echo $tag->getNormalized(); // fr-FR

// Invalid input β†’ exception
try {
    new BCP47Tag('invalid!!');
} catch (BCP47InvalidLocaleException $e) {
    echo $e->getMessage();
}

// Feed to ext-intl
$icu = $tag->getICULocale(); // en_US
echo Locale::getDisplayLanguage($icu); // English

// LanguageTag VO
$langTag = $tag->getLanguageTag();
echo $langTag->getLanguage();  // "en"
echo $langTag->getRegion();    // "US"
echo (string) $langTag;        // "en-US"

πŸ” Features & Flow

  1. Normalize + parse
    Clean casing/formatting and parse into components.

  2. Validate against IANA
    Broken input or fallback triggers explicit exceptions:

    • BCP47InvalidLocaleException
    • BCP47InvalidFallbackLocaleException
  3. Canonical matching (optional)

    • Pass an array of useCanonicalMatchTags
    • Each is matched and scored:
      +100 language match, +10 region, +1 script
    • Highest score wins.
    • Same score makes the first one to have it to make a home run
  4. LanguageTag VO
    Immutable, validated, Stringable & JsonSerializable.

πŸ“œ Supported Tags

BCP47Tag uses a precompiled static PHP snapshot of the latest IANA Language Subtag Registry to validate languages, scripts, regions, variants, and grandfathered tags. The registry is loaded once per process, kept hot in OPcache for maximum speed.

  • βœ… ISO language, script, region, variants
  • βœ… Grandfathered/deprecated tags (e.g., i-klingon)
  • βœ… Collapsed registry ranges are auto-expanded
  • ⚠️ Extensions & private-use subtags (future)

🧩 Key API

Method Description
__construct(string $input, ?string $fallback, ?array $useCanonicalMatchTags) Main entry
getInputLocale() Original input string
getNormalized() RFC‑5646 formatted tag
getICUformat() Underscore variant (xx_XX)
getLanguageTag() Returns LanguageTag VO
__toString() / jsonSerialize() Returns normalized string

πŸ“œ The Official BCP 47 ABNF

The syntax tags must follow is defined by RFC 5646 in ABNF:

langtag = language
   ["-" script]
   ["-" region]
   *("-" variant)
   *("-" extension)
   ["-" privateuse]

Examples:

  • βœ… en β†’ valid
  • βœ… en-US β†’ valid
  • βœ… zh-Hant-CN β†’ valid
  • βœ… i-klingon β†’ valid (grandfathered)
  • βœ… en-US-x-private β†’ valid (extension/private use)
  • ❌ en-US--US β†’ invalid

BCP47Tag respects this ABNF, so your tags match the real spec β€” no hidden assumptions.

❓ Why is this useful?

Use cases include:

  • Validating API Accept-Language headers
  • Multi-regional CMS deployments
  • Internationalization pipelines
  • Locale-dependent services where mis-typed tags lead to silent failures

βš™οΈ Requirements

  • PHP 8.3+
  • ext-intl

πŸ§ͺ Tests

composer qa

πŸ“Œ Roadmap

  • βœ… IANA Language Subtag Registry integration
  • βœ… Language, script, region, variant validation
  • βœ… Lazy singleton registry loader
  • βœ… Static PHP snapshot of the IANA registry for ultra-fast lookups
  • βœ… Canonical matching with scoring
  • βœ… Typed exceptions for flow control
  • βš™οΈ Extension/subtag support (planned)
  • βš™οΈ Additional data use from IANA registry (suppress-script subtag, preferred, prefix)
  • βš™οΈ Auto-registry refresh script

πŸ“– License

MIT

πŸ”— References

🧬 Now go and boldly canonicalize strange new tags the BCP 47 way! 🌍✨