iliaal / phonetic
Native phonetic matching for PHP: Double Metaphone, Beider-Morse Phonetic Matching, Daitch-Mokotoff Soundex, NYSIIS, and Match Rating Approach.
Package info
Language:C
Type:php-ext
Ext name:ext-phonetic
pkg:composer/iliaal/phonetic
Requires
- php: >=8.1
This package is auto-updated.
Last update: 2026-07-01 16:09:39 UTC
README
Native phonetic matching for PHP: Double Metaphone, Beider-Morse Phonetic Matching (BMPM), Daitch-Mokotoff Soundex, NYSIIS, and Match Rating Approach, the phonetic name-matching encoders that PHP core does not ship. It also ships comparison helpers that answer "do these two names sound alike?" directly.
PHP core has soundex() and metaphone(), but not these, which are the standard tools for fuzzy name matching, record linkage, and genealogy search across spelling and transliteration variants.
Quick Start
Install via PIE (requires PHP 8.1 or later):
pie install iliaal/phonetic
Then ask whether two names sound alike, no userland matching logic required:
double_metaphone_match("Catherine", "Kathryn"); // 2 (strong match) dm_soundex_match("Moskowitz", "Moskovitz"); // true bmpm_match("Peterson", "Petersen"); // true
Choosing an algorithm
| Double Metaphone | BMPM | Daitch-Mokotoff Soundex | NYSIIS | Match Rating | |
|---|---|---|---|---|---|
| Output | primary + alternate key | language-aware token set | distinct 6-digit codes | single key | compact codex |
| Two names match when | keys are equal | token sets intersect | code sets intersect | keys are equal | clear the MRA similarity threshold |
| Strongest for | English and general Latin-script names | cross-language and transliteration variants (Slavic, Germanic, Hebrew, Romance) | Eastern-European and Ashkenazi surnames, genealogy | American/English surnames | English names; ships its own similarity test |
| Spelling-variant recall | good | highest | high, within its language model | good | good |
| Ambiguity handling | up to 2 keys | many tokens | multiple codes | single key | single codex |
| Relative speed | fast (1.0x) | slowest (~60x) | middle (~2.3x) | fast (0.42x) | fastest (0.24x) |
| Data source | clean-room published algorithm | Apache Commons Codec rule data | Apache Commons Codec rule data | clean-room published algorithm | clean-room published algorithm |
Rule of thumb: reach for Double Metaphone as a fast general-purpose default, BMPM when names cross languages or scripts, and Daitch-Mokotoff for Eastern-European and Jewish genealogy where it is the field standard. NYSIIS and Match Rating Approach are lighter, single-key English/American encoders, useful as alternate index keys or a second opinion alongside Double Metaphone.
API
Double Metaphone
Primary + alternate phonetic keys (Lawrence Philips). Clean-room implementation.
double_metaphone(string $string, int $max_length = 4): array double_metaphone("Schwarzenegger"); // ['primary' => 'XRSN', 'alternate' => 'XFRT'] double_metaphone("Smith"); // ['primary' => 'SM0', 'alternate' => 'XMT'] double_metaphone("Catherine", 3); // ['primary' => 'K0R', 'alternate' => 'KTR']
alternate equals primary when the algorithm produced no alternate branch. max_length caps each key (default 4; 0 or negative = unlimited).
Beider-Morse Phonetic Matching
Language-aware token set, joined by | (alternatives) and - (words). Matches Apache Commons Codec's default BeiderMorseEncoder.
bmpm(string $string, int $name_type = BMPM_GENERIC, int $accuracy = BMPM_APPROX, string $language = ""): string bmpm("Jackson"); // "iakson|iaksun|...|zokson" bmpm("Garcia", BMPM_SEPHARDIC, BMPM_EXACT);// "garsia|gartSa"
Empty $language auto-detects; pass a language name (e.g. "russian") to force it. Constants: BMPM_GENERIC, BMPM_ASHKENAZI, BMPM_SEPHARDIC, BMPM_APPROX, BMPM_EXACT.
Daitch-Mokotoff Soundex
List of distinct 6-digit codes (the algorithm branches on ambiguous letters). Matches Apache Commons Codec's DaitchMokotoffSoundex in branching mode.
dm_soundex(string $string): array dm_soundex("Auerbach"); // ['097400', '097500'] dm_soundex("Peters"); // ['734000', '739400']
NYSIIS
Single phonetic key (New York State Identification and Intelligence System), tuned for American/English surnames. Reimplementation of the published algorithm; matches Apache Commons Codec's Nysiis.
nysiis(string $string, int $max_length = 6): string nysiis("Larson"); // "LARSAN" nysiis("Larsen"); // "LARSAN" (same key) nysiis("Macdonald", 0); // "MCDANALD" (full, untruncated)
The classic algorithm truncates to 6 characters; max_length = 0 (or negative) returns the full key.
Match Rating Approach
Compact codex (Western Airlines, 1977). Pair it with its own similarity test instead of comparing codexes for equality.
match_rating(string $string): string match_rating("Smith"); // "SMTH" match_rating("Catherine"); // "CTHRN"
Use match_rating_compare() (below) for the actual homophone decision. It applies the algorithm's length-and-rating rules that plain codex equality skips.
Comparison helpers
Each encoder produces a different output shape, so "do these sound alike?" needs the right comparison per algorithm. These helpers encapsulate that, so you don't reimplement the set-intersection or match-strength logic in userland.
// Double Metaphone: 2 = primary keys agree, 1 = an alternate crosses, 0 = no match double_metaphone_match(string $a, string $b, int $max_length = 4): int double_metaphone_match("Catherine", "Kathryn"); // 2 double_metaphone_match("Vagner", "Wagner"); // 1 // BMPM: true when the phoneme token sets intersect (same args as bmpm()) bmpm_match(string $a, string $b, int $name_type = BMPM_GENERIC, int $accuracy = BMPM_APPROX, string $language = ""): bool bmpm_match("Moskowitz", "Moskovitz"); // true // Daitch-Mokotoff: true when the code sets intersect dm_soundex_match(string $a, string $b): bool dm_soundex_match("Moskowitz", "Moskovitz"); // true // NYSIIS: true when the single keys are equal nysiis_match(string $a, string $b, int $max_length = 6): bool nysiis_match("Smith", "Schmit"); // true (both SNAT) // Match Rating Approach: true when the two names clear the MRA similarity threshold match_rating_compare(string $a, string $b): bool match_rating_compare("Catherine", "Kathryn"); // true
Usage
For a one-off "do these sound alike?" check, use the comparison helpers directly. Each applies the correct per-algorithm logic:
double_metaphone_match("Catherine", "Kathryn"); // 2 (strong) dm_soundex_match("Moskowitz", "Moskovitz"); // true bmpm_match("Peterson", "Petersen"); // true match_rating_compare("Catherine", "Kathryn"); // true
For indexed lookup, encode once and store the key(s) with each record, then query by encoded value instead of re-encoding at search time. Double Metaphone gives one or two keys per name; Daitch-Mokotoff and BMPM give a set, so index every code. BMPM's token string separates alternatives with | and words with -:
// Build a phonetic index, then look up by shared code $index = []; foreach ($records as $id => $name) { foreach (dm_soundex($name) as $code) { // index every code in the set $index[$code][] = $id; } } $hits = $index[dm_soundex("Moskovitz")[0]] ?? []; // Splitting a BMPM token string into its individual codes $codes = preg_split('/[|-]/', bmpm("Peterson"));
Performance
Single-name encode, warm, -O2 non-ASan PHP 8.4 on one core, over a representative mix of 18 names (best of 5 trials). Absolute time scales with input length; the relative ordering is the stable part.
| encoder | per call | throughput | relative |
|---|---|---|---|
match_rating() |
~0.043 µs | ~23M/sec | 0.24x |
nysiis() |
~0.074 µs | ~13M/sec | 0.42x |
double_metaphone() |
~0.18 µs | ~5.5M/sec | 1.0x |
dm_soundex() |
~0.41 µs | ~2.4M/sec | ~2.3x slower |
bmpm() |
~11 µs | ~91k/sec | ~60x slower |
Match Rating and NYSIIS are short single-key passes, so they're the cheapest. Double Metaphone is a single linear pass with a primary/alternate split. Daitch-Mokotoff branches on ambiguous letters and dedups the resulting codes; a first-byte rule index keeps it fast. BMPM is the heaviest: language detection, a main transliteration pass, and two final rule passes, expanding a Cartesian product of phoneme alternatives capped at 20 per word. When you know the language, passing an explicit $language skips auto-detection and can cut bmpm time several-fold, though the gain depends on the chosen language's ruleset. Choose BMPM for recall, not throughput.
The comparison helpers cost roughly two encodes plus a cheap compare:
| helper | per call | throughput |
|---|---|---|
match_rating_compare() |
~0.11 µs | ~9M/sec |
nysiis_match() |
~0.14 µs | ~7M/sec |
double_metaphone_match() |
~0.26 µs | ~3.8M/sec |
dm_soundex_match() |
~0.80 µs | ~1.3M/sec |
bmpm_match() |
~22 µs | ~45k/sec |
For repeated lookups against a fixed corpus, encode once and index the keys (see Usage) rather than calling a helper per candidate pair.
Notes & limitations
- Input is UTF-8.
bmpm()anddm_soundex()fold accented Latin and lowercase both Latin and Cyrillic script before rule matching, so rawИвановencodes correctly. - Greek-script input is a known limitation: Greek capitals are not lowercased (the algorithm's context-sensitive final-sigma cannot be expressed by a point-wise case map), so pass Greek names already lowercased or romanized.
double_metaphone()targets ASCII/Latin; non-letter bytes are skipped, matching Apache Commons Codec.nysiis()andmatch_rating()operate on ASCII letters;match_rating()also folds the Latin-1/Latin-Extended accent set the reference handles.bmpm()cost grows faster than linearly with input length (roughly n^1.45 in practice, because it joins every word and runs three rule passes over the result). A single name is short, but a multi-kilobyte string can take seconds, so cap untrusted input length before you encode it.dm_soundex()anddm_soundex_match()reject input longer than 4096 bytes with aValueError. Real names are far shorter; the cap bounds the per-character branch work so an untrusted multi-megabyte string can't turn the encoder into a CPU sink.
🔗 Native PHP extensions
Companion native PHP extensions:
- php_excel: native Excel I/O via LibXL. 7-10× faster than PhpSpreadsheet, full XLS/XLSX with formulas, formatting, and styling.
- mdparser: native CommonMark + GFM markdown parser via md4c. 15-30× faster than pure-PHP libraries.
- php_clickhouse: native ClickHouse client speaking the wire protocol directly. Picks up where SeasClick left off.
- pdo_duckdb: PDO driver for DuckDB, analytical SQL in your PHP stack.
- fastjson: drop-in faster
ext/json, backed by yyjson. 6× encode, 2.7× decode, 5× validate. - phpser: decoder-optimized binary serializer for cache workloads. Faster than igbinary on packed numerics and DTO batches.
- fast_uuid: high-throughput UUID generation (v1/v4/v7), batched CSPRNG and SIMD hex formatter, ramsey-compatible API.
- fastchart: native chart-rendering extension. 38 chart types behind one fluent OO API, SVG-canonical with PNG/JPG/WebP and optional PDF output.
- statgrab: system statistics (CPU, memory, disk, network) via libstatgrab, no parsing /proc by hand.
License
BSD 3-Clause (see LICENSE).
The Beider-Morse and Daitch-Mokotoff rule data is vendored from Apache Commons Codec under the Apache License 2.0; its notice is included in Section 2 of the LICENSE file. Double Metaphone, NYSIIS, and Match Rating Approach are clean-room implementations of their published algorithms, with Commons Codec used only as the parity-test oracle (no third-party data).
Follow @iliaa on X • Blog • If this matched the names exact comparison missed, ⭐ star it!
