jakhotiya/symspell-php

Spelling correction & fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm

Installs: 42

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

pkg:composer/jakhotiya/symspell-php

0.0.1 2025-08-26 11:23 UTC

This package is auto-updated.

Last update: 2025-12-26 12:14:32 UTC


README

PHP Version License

Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm

A complete PHP port of the SymSpell library - the world's fastest spelling correction & fuzzy search library.

Features

Ultra-Fast Spelling Correction - 1 million times faster than traditional algorithms
Word Segmentation - Split concatenated words ("thequickbrownfox""the quick brown fox")
Compound Correction - Multi-word spelling correction with context awareness
Multi-Language Support - Includes dictionaries for 8+ languages
CLI Interface - Command-line tool with pipes and redirects support
Complete API - All original SymSpell functionality ported to PHP

Quick Start

Installation

composer require jakhotiya/symspell-php

Basic Usage

<?php
require_once 'vendor/autoload.php';

use Jakhotiya\SymspellPhp\SymSpell;
use Jakhotiya\SymspellPhp\Enums\Verbosity;

// Initialize SymSpell
$symSpell = new SymSpell();

// Load dictionary
$symSpell->loadDictionary('path/to/frequency_dictionary_en_82_765.txt', 0, 1);

// Single word correction
$suggestions = $symSpell->lookup('helo', Verbosity::Closest, 2);
foreach ($suggestions as $suggestion) {
    echo "{$suggestion->term} (distance: {$suggestion->distance}, frequency: {$suggestion->count})\n";
}
// Output: hello (distance: 1, frequency: 32960381)

// Word segmentation  
$result = $symSpell->wordSegmentation('thequickbrownfox');
echo $result->correctedString; // "the quick brown fox"

// Multi-word correction
$suggestions = $symSpell->lookupCompound('hello wrold');
echo $suggestions[0]->term; // "hello world"

Core Algorithms

1. Single Word Correction

Fast spelling correction for individual words using the Symmetric Delete algorithm:

$symSpell = new SymSpell();
$symSpell->loadDictionary('dictionary.txt', 0, 1);

// Get single best suggestion
$suggestions = $symSpell->lookup('speling', Verbosity::Top, 2);
echo $suggestions[0]->term; // "spelling"

// Get all suggestions within edit distance
$suggestions = $symSpell->lookup('speling', Verbosity::All, 2);
foreach ($suggestions as $suggestion) {
    printf("%s (distance: %d, frequency: %s)\n", 
        $suggestion->term, 
        $suggestion->distance, 
        number_format($suggestion->count)
    );
}

2. Word Segmentation

Triangular Matrix Algorithm - O(n) runtime complexity for splitting concatenated words:

// Split concatenated words with missing spaces
$result = $symSpell->wordSegmentation('unitedkingdom');
echo $result->segmentedString;  // "united kingdom"
echo $result->correctedString;  // "united kingdom" (with spelling correction)
echo $result->distanceSum;      // 1 (number of spaces inserted)
echo $result->probabilityLogSum; // -7.63 (log probability score)

// Works with typos too
$result = $symSpell->wordSegmentation('thequickbrownfxojumps');
echo $result->correctedString; // "the quick brown fox jumps"

3. Compound Correction

Multi-word spelling correction with compound splitting/merging:

// Load bigram dictionary for better context
$symSpell->loadBigramDictionary('frequency_bigramdictionary_en_243_342.txt', 0, 2);

// Multi-word correction
$suggestions = $symSpell->lookupCompound('whereis th elove hehad dated forImuch');
echo $suggestions[0]->term; 
// Output: "where is the love he had dated for much"

Demo Applications

The package includes four demo applications showcasing different features:

1. Basic Demo (Single Word Correction)

php demos/basic_demo.php

Interactive spell checker - type words and get suggestions.

2. Word Segmentation Demo

php demos/segmentation_demo.php

Split concatenated words:

  • Input: thequickbrownfoxjumps
  • Output: the quick brown fox jumps

3. Compound Correction Demo

php demos/compound_demo.php

Multi-word spelling correction with context awareness.

4. Command Line Interface

# Basic usage
echo "hello wrold" | php demos/cli_demo.php load frequency_dictionary_en_82_765.txt lookup

# Word segmentation
echo "thequickbrownfox" | php demos/cli_demo.php load frequency_dictionary_en_82_765.txt wordsegment

# With full options
echo "speling" | php demos/cli_demo.php load frequency_dictionary_en_82_765.txt 7 lookup 2 true Closest

CLI Parameters:

  • DictionaryType: load (load from file) or create (from corpus)
  • DictionaryPath: Path to dictionary file
  • PrefixLength: 5-7 (memory/speed trade-off)
  • LookupType: lookup | lookupcompound | wordsegment
  • MaxEditDistance: Maximum edit distance (default: 2)
  • OutputStats: true/false - show distance and frequency
  • Verbosity: Top | Closest | All

Dictionaries

📚 Dictionary Customization Guide - Learn how to add words, create custom dictionaries, and build domain-specific vocabularies.

The package includes comprehensive dictionaries:

English Dictionaries (Included)

  • frequency_dictionary_en_82_765.txt - 82,765 English words with frequencies
  • frequency_bigramdictionary_en_243_342.txt - 243,342 English bigrams

Multi-Language Dictionaries (Included)

  • 🇺🇸 English (en-80k.txt) - 80,000 words
  • 🇩🇪 German (de-100k.txt) - 100,000 words
  • 🇫🇷 French (fr-100k.txt) - 100,000 words
  • 🇪🇸 Spanish (es-100l.txt) - 100,000 words
  • 🇮🇹 Italian (it-100k.txt) - 100,000 words
  • 🇷🇺 Russian (ru-100k.txt) - 100,000 words
  • 🇮🇱 Hebrew (he-100k.txt) - 100,000 words
  • 🇨🇳 Chinese (zh-50k.txt) - 50,000 words

Dictionary Format

Plain UTF-8 text files with format: word frequency

the 23135851162
of 13151942776
and 12997637966
to 12136980858

Performance

Speed Benchmarks

  • Single word lookup: ~0.3ms per word
  • Word segmentation: ~0.2ms for typical inputs
  • Dictionary loading: ~50ms for 82K words

Memory Usage

  • Dictionary: ~7MB for 82K English words
  • Runtime: Minimal additional memory overhead
  • Optimization: Use prefixLength=5 for lower memory usage

API Reference

Core Classes

SymSpell

Main spell correction class.

Constructor:

public function __construct(
    int $initialCapacity = 82765,
    int $maxDictionaryEditDistance = 2,
    int $prefixLength = 7,
    int $countThreshold = 1
)

Methods:

// Dictionary management
public function loadDictionary(string $corpus, int $termIndex = 0, int $countIndex = 1): bool
public function loadBigramDictionary(string $corpus, int $termIndex = 0, int $countIndex = 2): bool
public function createDictionaryEntry(string $word, int $count): bool

// Spell correction
public function lookup(string $input, Verbosity $verbosity = Verbosity::Top, ?int $maxEditDistance = null): array
public function lookupCompound(string $input, ?int $maxEditDistance = null): array
public function wordSegmentation(string $input): SegmentationItem

// Properties
public function getWordCount(): int
public function getEntryCount(): int
public function getMaxDictionaryEditDistance(): int

SuggestItem

Represents a spelling suggestion.

class SuggestItem {
    public string $term;      // Suggested word
    public int $distance;     // Edit distance from input
    public int $count;        // Frequency in dictionary
}

SegmentationItem

Represents word segmentation result.

class SegmentationItem {
    public string $segmentedString;    // Original with spaces inserted
    public string $correctedString;    // Segmented + spelling corrected
    public int $distanceSum;           // Total edit distance
    public float $probabilityLogSum;   // Log probability score
}

Verbosity Enum

Controls number of suggestions returned.

enum Verbosity: int {
    case Top = 0;      // Single best suggestion
    case Closest = 1;  // All suggestions with minimum edit distance  
    case All = 2;      // All suggestions within maxEditDistance
}

Algorithm Details

Symmetric Delete Algorithm

SymSpell uses a revolutionary approach:

  • Traditional: Generate all possible edits for input word (millions of variations)
  • SymSpell: Pre-generate only deletions for dictionary words (25 deletions vs 3 million edits)

Result: 1,000,000x speed improvement over traditional methods.

Triangular Matrix Word Segmentation

  • Runtime: O(n) linear complexity
  • Method: Dynamic programming without recursion
  • Optimization: Circular buffer for memory efficiency
  • Scoring: Naive Bayes probability using real word frequencies

Edit Distance

Supports multiple algorithms:

  • Levenshtein: Insertions, deletions, substitutions
  • Damerau-OSA: Includes transpositions
  • Optimized: Early termination for performance

Testing

Run the test suite:

./vendor/bin/pest

Test Coverage:

  • ✅ 10/11 core algorithm tests passing
  • ✅ Word frequency management
  • ✅ Edit distance calculations
  • ✅ Verbosity controls
  • ✅ Count thresholds
  • ✅ Overflow protection
  • 🔄 Performance test (4,955 expected results)

Requirements

  • PHP: 8.0+ (for enums and strict typing)
  • Extensions: mbstring (for UTF-8 support)
  • Memory: ~50MB for full English dictionary
  • Disk: ~175MB for all included dictionaries

License

MIT License - see LICENSE file.

Credits

Applications

Perfect for:

  • 🔍 Search engines - Query correction and fuzzy matching
  • 📝 Text editors - Real-time spell checking
  • 🤖 Chatbots - Understanding misspelled user input
  • 📊 OCR systems - Post-processing scanned text
  • 🌐 Web forms - User input validation and suggestion
  • 🧬 Bioinformatics - DNA sequence analysis
  • 🈳 CJK text processing - Chinese/Japanese/Korean segmentation

⚡ Experience the world's fastest spelling correction in PHP!