adrianoferreira / document-distance
A simple library for calculating the distance between two documents through the cosine similarity algorithm.
Requires (Dev)
- mikey179/vfsstream: ^1.6
- phpunit/phpunit: ^8
This package is auto-updated.
Last update: 2024-11-25 03:25:34 UTC
README
Document Distance / Similarity is measured based on the content overlap between documents.
One of the most common algorithms to solve this particular problem is the cosine similarity - a vector based similarity measure. That's what this library is about.
The cosine distance of two documents is defined by the angle between their feature vectors which are, in our case, word frequency vectors. The word frequency distribution of a document is a mapping from words to their frequency count.
Installation
It's recommended that you use Composer to install this library.
$ composer require adrianoferreira/document-distance:dev-master
Usage
Calculating similarity percentage between two remote files:
echo ( new \AdrianoFerreira\DD\File( 'http://test.com/test.txt', 'http://test.com/test2.txt' ) )->getPercent();
Calculating arc size between two local files:
echo ( new \AdrianoFerreira\DD\File( __DIR__ . 'test.txt', __DIR__ . 'test2.txt' ) )->getArcSize();
Calculating similarity percentage between two arbitrary strings:
echo ( new \AdrianoFerreira\DD\Text( 'test 123 456', 'test 678 000' ) )->getPercent();
Calculating arc size between arbitrary strings:
echo ( new \AdrianoFerreira\DD\Text( 'test 123 456', 'test 678 000' ) )->getArcSize();
References
This implementation is based in a MIT document: https://courses.csail.mit.edu/6.006/fall11/rec/rec02.pdf