adrianoferreira/document-distance

A simple library for calculating the distance between two documents through the cosine similarity algorithm.

dev-master 2020-01-24 15:47 UTC

This package is auto-updated.

Last update: 2024-04-25 02:23:30 UTC


README

Build Status Build Status Total Downloads License

Document Distance / Similarity is measured based on the content overlap between documents.

One of the most common algorithms to solve this particular problem is the cosine similarity - a vector based similarity measure. That's what this library is about.

The cosine distance of two documents is defined by the angle between their feature vectors which are, in our case, word frequency vectors. The word frequency distribution of a document is a mapping from words to their frequency count.

Cosine Similarity

Installation

It's recommended that you use Composer to install this library.

$ composer require adrianoferreira/document-distance:dev-master

Usage

Calculating similarity percentage between two remote files:

echo ( new \AdrianoFerreira\DD\File( 'http://test.com/test.txt', 'http://test.com/test2.txt' ) )->getPercent();

Calculating arc size between two local files:

echo ( new \AdrianoFerreira\DD\File( __DIR__ . 'test.txt', __DIR__ . 'test2.txt' ) )->getArcSize();

Calculating similarity percentage between two arbitrary strings:

echo ( new \AdrianoFerreira\DD\Text( 'test 123 456', 'test 678 000' ) )->getPercent();

Calculating arc size between arbitrary strings:

echo ( new \AdrianoFerreira\DD\Text( 'test 123 456', 'test 678 000' ) )->getArcSize();

References

This implementation is based in a MIT document: https://courses.csail.mit.edu/6.006/fall11/rec/rec02.pdf