edgaras / strsim
Collection of string similarity and distance algorithms in PHP including Levenshtein, Damerau-Levenshtein, Jaro-Winkler, and more
Installs: 2 063
Dependents: 0
Suggesters: 0
Security: 0
Stars: 11
Watchers: 1
Forks: 1
Open Issues: 1
pkg:composer/edgaras/strsim
Requires
- php: >=8.3.0
Requires (Dev)
- phpunit/phpunit: ^11.5
This package is auto-updated.
Last update: 2025-09-25 12:13:11 UTC
README
A collection of string similarity and distance algorithms implemented in PHP with full Unicode and multibyte character support. This library provides standalone static methods for computing various similarity metrics, useful in natural language processing, fuzzy matching, spell checking, and bioinformatics.
Requirements
- PHP 8.3+
- Composer
Installation
- Use the library via Composer:
composer require edgaras/strsim
- Include the Composer autoloader:
require __DIR__ . '/vendor/autoload.php';
Features
- Full Unicode Support: All algorithms handle multibyte characters, emoji, combining marks, and complex grapheme clusters
- UTF-8 Validation: Automatic validation of input strings with clear error messages
- Error Handling: Proper exception types with descriptive messages
- Code-Point Based: Consistent behavior across all Unicode normalization forms
- Optimized Tokenization: Smart whitespace handling for text-based algorithms
Supported Algorithms
Class | Method | Description |
---|---|---|
Levenshtein |
distance() |
Measures the number of insertions, deletions, or substitutions. |
DamerauLevenshtein |
distance() |
Levenshtein with transpositions included. |
Hamming |
distance() |
Counts differing positions (requires equal-length strings). |
Jaro |
distance() |
Measures similarity based on character matches and transpositions. |
JaroWinkler |
distance() |
Jaro with a prefix match boost for similar string starts. |
LCS |
length() |
Returns the length of the longest common subsequence. |
SmithWaterman |
score() |
Local alignment scoring for best-matching subsequences. |
NeedlemanWunsch |
score() |
Global alignment scoring for entire string similarity. |
Cosine |
similarity() |
Measures similarity via character frequency vectors. |
Cosine |
similarityFromVectors() |
Computes cosine similarity for numeric vector inputs. |
Jaccard |
index() |
Ratio of shared to total unique characters. |
MongeElkan |
similarity() |
Average best-word similarity using Jaro-Winkler internally. |
Usage
Basic Usage
use Edgaras\StrSim\Levenshtein; use Edgaras\StrSim\DamerauLevenshtein; use Edgaras\StrSim\Hamming; use Edgaras\StrSim\Jaro; use Edgaras\StrSim\JaroWinkler; use Edgaras\StrSim\LCS; use Edgaras\StrSim\SmithWaterman; use Edgaras\StrSim\NeedlemanWunsch; use Edgaras\StrSim\Cosine; use Edgaras\StrSim\Jaccard; use Edgaras\StrSim\MongeElkan; // Detecting spelling error distance in user input Levenshtein::distance("kitten", "sitting"); // Returns: 3 // Detecting typo distance with transposition correction DamerauLevenshtein::distance("abcd", "acbd"); // Returns: 1 // Bit-level error detection (equal-length only) Hamming::distance("1011101", "1001001"); // Returns: 2 // Comparing short strings with transposition support Jaro::distance("dixon", "dicksonx"); // Returns: 0.767 // Matching names with common prefixes JaroWinkler::distance("martha", "marhta"); // Returns: 0.961 // Finding common subsequence in DNA fragments LCS::length("ACCGGTCGAGTGCGCGGAAGCCGGCCGAA", "GTCGTTCGGAATGCCGTTGCTCTGTAAA"); // Returns: 13 // Local alignment score for substring match SmithWaterman::score("ACACACTA", "AGCACACA"); // Returns: 11 // Global alignment score for complete sequence match NeedlemanWunsch::score("GATTACA", "GCATGCU"); // Returns: 0 // Comparing word frequency in short texts Cosine::similarity("night", "nacht"); // Returns: 0.6 // Comparing embedding vectors from NLP model Cosine::similarityFromVectors([0.1, 0.2, 0.3], [0.1, 0.3, 0.4]); // Returns: 0.925 // Comparing token overlap in short strings Jaccard::index("abc", "bcd"); // Returns: 0.5 // Fuzzy match between two multi-word names MongeElkan::similarity("john smith", "jon smythe"); // Returns: 0.822
Unicode and Multibyte Examples
// All algorithms support Unicode characters Levenshtein::distance("café", "caffe"); // Returns: 2 Levenshtein::distance("こんにちは", "こんにちわ"); // Returns: 1 // Emoji and complex characters Levenshtein::distance("🚀🌟", "🚀⭐"); // Returns: 1 Hamming::distance("👍🏽", "👍🏾"); // Returns: 1 // Different scripts and languages Jaro::distance("привет", "привет"); // Returns: 1.0 JaroWinkler::distance("عربي", "عربى"); // Returns: 0.9 // ZWJ sequences and combining marks Levenshtein::distance("👨👩👧👦", "👨👩👧👦"); // Returns: 3 Levenshtein::distance("é", "e\u{0301}"); // Returns: 2
Custom Scoring
// Smith-Waterman with custom scoring SmithWaterman::score("ACGT", "ACGT", match: 5, mismatch: -2, gap: -1); // Returns: 20 // Needleman-Wunsch with custom parameters NeedlemanWunsch::score("ACGT", "ACGT", match: 3, mismatch: -1, gap: -2); // Returns: 12 // Jaro-Winkler with custom prefix scaling JaroWinkler::distance("prefix_test", "prefix_demo", 0.2); // Custom scale factor
Error Handling
try { // This will throw InvalidArgumentException for unequal lengths Hamming::distance("abc", "abcd"); } catch (InvalidArgumentException $e) { echo $e->getMessage(); // "Strings must be of equal length." } try { // This will throw InvalidArgumentException for invalid UTF-8 Levenshtein::distance("valid", "\xFF\xFF"); } catch (InvalidArgumentException $e) { echo $e->getMessage(); // "Input strings must be valid UTF-8." } try { // This will throw InvalidArgumentException for mismatched vector lengths Cosine::similarityFromVectors([1, 2], [1, 2, 3]); } catch (InvalidArgumentException $e) { echo $e->getMessage(); // "Vectors must be the same length." }