edgaras / strsim
Collection of string similarity and distance algorithms in PHP including Levenshtein, Damerau-Levenshtein, Jaro-Winkler, and more
v1.0.0
2025-05-25 13:30 UTC
Requires
- php: >=8.3.0
Requires (Dev)
- phpunit/phpunit: ^11.5
This package is auto-updated.
Last update: 2025-05-25 13:36:57 UTC
README
A collection of string similarity and distance algorithms implemented in PHP. This library provides standalone static methods for computing various similarity metrics, useful in natural language processing, fuzzy matching, spell checking, and bioinformatics.
Requirements
- PHP 8.3+
- Composer
Installation
- Use the library via Composer:
composer require edgaras/strsim
- Include the Composer autoloader:
require __DIR__ . '/vendor/autoload.php';
Supported Algorithms
Class | Method | Description |
---|---|---|
Levenshtein |
distance() |
Measures the number of insertions, deletions, or substitutions. |
DamerauLevenshtein |
distance() |
Levenshtein with transpositions included. |
Hamming |
distance() |
Counts differing positions (requires equal-length strings). |
Jaro |
distance() |
Measures similarity based on character matches and transpositions. |
JaroWinkler |
distance() |
Jaro with a prefix match boost for similar string starts. |
LCS |
length() |
Returns the length of the longest common subsequence. |
SmithWaterman |
score() |
Local alignment scoring for best-matching subsequences. |
NeedlemanWunsch |
score() |
Global alignment scoring for entire string similarity. |
Cosine |
similarity() |
Measures similarity via character frequency vectors. |
Cosine |
similarityFromVectors() |
Computes cosine similarity for numeric vector inputs. |
Jaccard |
index() |
Ratio of shared to total unique characters. |
MongeElkan |
similarity() |
Average best-word similarity using Jaro-Winkler internally. |
Usage
use Edgaras\StrSim\Levenshtein;
use Edgaras\StrSim\DamerauLevenshtein;
use Edgaras\StrSim\Hamming;
use Edgaras\StrSim\Jaro;
use Edgaras\StrSim\JaroWinkler;
use Edgaras\StrSim\LCS;
use Edgaras\StrSim\SmithWaterman;
use Edgaras\StrSim\NeedlemanWunsch;
use Edgaras\StrSim\Cosine;
use Edgaras\StrSim\Jaccard;
use Edgaras\StrSim\MongeElkan;
// Detecting spelling error distance in user input
Levenshtein::distance("kitten", "sitting");
// Detecting typo distance with transposition correction
DamerauLevenshtein::distance("abcd", "acbd");
// Bit-level error detection (equal-length only)
Hamming::distance("1011101", "1001001");
// Comparing short strings with transposition support
Jaro::distance("dixon", "dicksonx");
// Matching names with common prefixes
JaroWinkler::distance("martha", "marhta");
// Finding common subsequence in DNA fragments
LCS::length("ACCGGTCGAGTGCGCGGAAGCCGGCCGAA", "GTCGTTCGGAATGCCGTTGCTCTGTAAA");
// Local alignment score for substring match
SmithWaterman::score("ACACACTA", "AGCACACA");
// Global alignment score for complete sequence match
NeedlemanWunsch::score("GATTACA", "GCATGCU");
// Comparing word frequency in short texts
Cosine::similarity("night", "nacht");
// Comparing embedding vectors from NLP model
Cosine::similarityFromVectors([0.1, 0.2, 0.3], [0.1, 0.3, 0.4]);
// Comparing token overlap in short strings
Jaccard::index("abc", "bcd");
// Fuzzy match between two multi-word names
MongeElkan::similarity("john smith", "jon smythe");