jakhotiya / symspell-php
Spelling correction & fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm
Installs: 42
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
pkg:composer/jakhotiya/symspell-php
Requires
- php: ^8.0
- ext-mbstring: *
Requires (Dev)
- pestphp/pest: ^3.8
- phpstan/phpstan: ^2.1
README
Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm
A complete PHP port of the SymSpell library - the world's fastest spelling correction & fuzzy search library.
Features
✅ Ultra-Fast Spelling Correction - 1 million times faster than traditional algorithms
✅ Word Segmentation - Split concatenated words ("thequickbrownfox" → "the quick brown fox")
✅ Compound Correction - Multi-word spelling correction with context awareness
✅ Multi-Language Support - Includes dictionaries for 8+ languages
✅ CLI Interface - Command-line tool with pipes and redirects support
✅ Complete API - All original SymSpell functionality ported to PHP
Quick Start
Installation
composer require jakhotiya/symspell-php
Basic Usage
<?php require_once 'vendor/autoload.php'; use Jakhotiya\SymspellPhp\SymSpell; use Jakhotiya\SymspellPhp\Enums\Verbosity; // Initialize SymSpell $symSpell = new SymSpell(); // Load dictionary $symSpell->loadDictionary('path/to/frequency_dictionary_en_82_765.txt', 0, 1); // Single word correction $suggestions = $symSpell->lookup('helo', Verbosity::Closest, 2); foreach ($suggestions as $suggestion) { echo "{$suggestion->term} (distance: {$suggestion->distance}, frequency: {$suggestion->count})\n"; } // Output: hello (distance: 1, frequency: 32960381) // Word segmentation $result = $symSpell->wordSegmentation('thequickbrownfox'); echo $result->correctedString; // "the quick brown fox" // Multi-word correction $suggestions = $symSpell->lookupCompound('hello wrold'); echo $suggestions[0]->term; // "hello world"
Core Algorithms
1. Single Word Correction
Fast spelling correction for individual words using the Symmetric Delete algorithm:
$symSpell = new SymSpell(); $symSpell->loadDictionary('dictionary.txt', 0, 1); // Get single best suggestion $suggestions = $symSpell->lookup('speling', Verbosity::Top, 2); echo $suggestions[0]->term; // "spelling" // Get all suggestions within edit distance $suggestions = $symSpell->lookup('speling', Verbosity::All, 2); foreach ($suggestions as $suggestion) { printf("%s (distance: %d, frequency: %s)\n", $suggestion->term, $suggestion->distance, number_format($suggestion->count) ); }
2. Word Segmentation
Triangular Matrix Algorithm - O(n) runtime complexity for splitting concatenated words:
// Split concatenated words with missing spaces $result = $symSpell->wordSegmentation('unitedkingdom'); echo $result->segmentedString; // "united kingdom" echo $result->correctedString; // "united kingdom" (with spelling correction) echo $result->distanceSum; // 1 (number of spaces inserted) echo $result->probabilityLogSum; // -7.63 (log probability score) // Works with typos too $result = $symSpell->wordSegmentation('thequickbrownfxojumps'); echo $result->correctedString; // "the quick brown fox jumps"
3. Compound Correction
Multi-word spelling correction with compound splitting/merging:
// Load bigram dictionary for better context $symSpell->loadBigramDictionary('frequency_bigramdictionary_en_243_342.txt', 0, 2); // Multi-word correction $suggestions = $symSpell->lookupCompound('whereis th elove hehad dated forImuch'); echo $suggestions[0]->term; // Output: "where is the love he had dated for much"
Demo Applications
The package includes four demo applications showcasing different features:
1. Basic Demo (Single Word Correction)
php demos/basic_demo.php
Interactive spell checker - type words and get suggestions.
2. Word Segmentation Demo
php demos/segmentation_demo.php
Split concatenated words:
- Input:
thequickbrownfoxjumps - Output:
the quick brown fox jumps
3. Compound Correction Demo
php demos/compound_demo.php
Multi-word spelling correction with context awareness.
4. Command Line Interface
# Basic usage echo "hello wrold" | php demos/cli_demo.php load frequency_dictionary_en_82_765.txt lookup # Word segmentation echo "thequickbrownfox" | php demos/cli_demo.php load frequency_dictionary_en_82_765.txt wordsegment # With full options echo "speling" | php demos/cli_demo.php load frequency_dictionary_en_82_765.txt 7 lookup 2 true Closest
CLI Parameters:
DictionaryType:load(load from file) orcreate(from corpus)DictionaryPath: Path to dictionary filePrefixLength: 5-7 (memory/speed trade-off)LookupType:lookup|lookupcompound|wordsegmentMaxEditDistance: Maximum edit distance (default: 2)OutputStats:true/false- show distance and frequencyVerbosity:Top|Closest|All
Dictionaries
📚 Dictionary Customization Guide - Learn how to add words, create custom dictionaries, and build domain-specific vocabularies.
The package includes comprehensive dictionaries:
English Dictionaries (Included)
frequency_dictionary_en_82_765.txt- 82,765 English words with frequenciesfrequency_bigramdictionary_en_243_342.txt- 243,342 English bigrams
Multi-Language Dictionaries (Included)
- 🇺🇸 English (en-80k.txt) - 80,000 words
- 🇩🇪 German (de-100k.txt) - 100,000 words
- 🇫🇷 French (fr-100k.txt) - 100,000 words
- 🇪🇸 Spanish (es-100l.txt) - 100,000 words
- 🇮🇹 Italian (it-100k.txt) - 100,000 words
- 🇷🇺 Russian (ru-100k.txt) - 100,000 words
- 🇮🇱 Hebrew (he-100k.txt) - 100,000 words
- 🇨🇳 Chinese (zh-50k.txt) - 50,000 words
Dictionary Format
Plain UTF-8 text files with format: word frequency
the 23135851162
of 13151942776
and 12997637966
to 12136980858
Performance
Speed Benchmarks
- Single word lookup: ~0.3ms per word
- Word segmentation: ~0.2ms for typical inputs
- Dictionary loading: ~50ms for 82K words
Memory Usage
- Dictionary: ~7MB for 82K English words
- Runtime: Minimal additional memory overhead
- Optimization: Use
prefixLength=5for lower memory usage
API Reference
Core Classes
SymSpell
Main spell correction class.
Constructor:
public function __construct( int $initialCapacity = 82765, int $maxDictionaryEditDistance = 2, int $prefixLength = 7, int $countThreshold = 1 )
Methods:
// Dictionary management public function loadDictionary(string $corpus, int $termIndex = 0, int $countIndex = 1): bool public function loadBigramDictionary(string $corpus, int $termIndex = 0, int $countIndex = 2): bool public function createDictionaryEntry(string $word, int $count): bool // Spell correction public function lookup(string $input, Verbosity $verbosity = Verbosity::Top, ?int $maxEditDistance = null): array public function lookupCompound(string $input, ?int $maxEditDistance = null): array public function wordSegmentation(string $input): SegmentationItem // Properties public function getWordCount(): int public function getEntryCount(): int public function getMaxDictionaryEditDistance(): int
SuggestItem
Represents a spelling suggestion.
class SuggestItem { public string $term; // Suggested word public int $distance; // Edit distance from input public int $count; // Frequency in dictionary }
SegmentationItem
Represents word segmentation result.
class SegmentationItem { public string $segmentedString; // Original with spaces inserted public string $correctedString; // Segmented + spelling corrected public int $distanceSum; // Total edit distance public float $probabilityLogSum; // Log probability score }
Verbosity Enum
Controls number of suggestions returned.
enum Verbosity: int { case Top = 0; // Single best suggestion case Closest = 1; // All suggestions with minimum edit distance case All = 2; // All suggestions within maxEditDistance }
Algorithm Details
Symmetric Delete Algorithm
SymSpell uses a revolutionary approach:
- Traditional: Generate all possible edits for input word (millions of variations)
- SymSpell: Pre-generate only deletions for dictionary words (25 deletions vs 3 million edits)
Result: 1,000,000x speed improvement over traditional methods.
Triangular Matrix Word Segmentation
- Runtime: O(n) linear complexity
- Method: Dynamic programming without recursion
- Optimization: Circular buffer for memory efficiency
- Scoring: Naive Bayes probability using real word frequencies
Edit Distance
Supports multiple algorithms:
- Levenshtein: Insertions, deletions, substitutions
- Damerau-OSA: Includes transpositions
- Optimized: Early termination for performance
Testing
Run the test suite:
./vendor/bin/pest
Test Coverage:
- ✅ 10/11 core algorithm tests passing
- ✅ Word frequency management
- ✅ Edit distance calculations
- ✅ Verbosity controls
- ✅ Count thresholds
- ✅ Overflow protection
- 🔄 Performance test (4,955 expected results)
Requirements
- PHP: 8.0+ (for enums and strict typing)
- Extensions:
mbstring(for UTF-8 support) - Memory: ~50MB for full English dictionary
- Disk: ~175MB for all included dictionaries
License
MIT License - see LICENSE file.
Credits
- Original SymSpell: Wolf Garbe
- PHP Port: Jakhotiya
- Algorithm: Symmetric Delete spelling correction
Applications
Perfect for:
- 🔍 Search engines - Query correction and fuzzy matching
- 📝 Text editors - Real-time spell checking
- 🤖 Chatbots - Understanding misspelled user input
- 📊 OCR systems - Post-processing scanned text
- 🌐 Web forms - User input validation and suggestion
- 🧬 Bioinformatics - DNA sequence analysis
- 🈳 CJK text processing - Chinese/Japanese/Korean segmentation
⚡ Experience the world's fastest spelling correction in PHP! ⚡