tabuna/similar

Unlock the power of effortless grouping by identifying similar strings based on shared topics within a set of sentences.

2.2.0 2022-02-23 22:18 UTC

This package is auto-updated.

Last update: 2024-10-12 23:50:59 UTC


README

Unit tests

This is an elementary library for working on identifying similar strings in PHP without using machine learning. It allows you to get groups of one topic from the transferred set of sentences. For example, combine news headlines from different publications, as Google News does.

Installation

Run this at the command line:

$ composer require esplora/similar

Usage

We need to create an object by passing a closure function as an argument, which checks if two strings are similar:

use Esplora\Similar\Similar;

$similar = new Similar(function (string $a, string $b) {
    similar_text($a, $b, $copy);

    return 51 < $copy;
});

Note that you don't need to use similar_text. You can use other implementations like soundex or something else.

Then we have to call the findOut method passing it a one-dimensional array with strings:

$similar->findOut([
    'Elon Musk gets mixed COVID-19 test results as SpaceX launches astronauts to the ISS',
    'Elon Musk may have Covid-19, should quarantine during SpaceX astronaut launch Sunday',

    // Superfluous word
    'Can Trump win with ‘fantasy’ electors bid? State GOP says no',
]);

As a result, there will be only one group containing headers:

'Elon Musk gets mixed COVID-19 test results as SpaceX launches astronauts to the ISS',
'Elon Musk may have Covid-19, should quarantine during SpaceX astronaut launch Sunday',

Keys

The input array stores its keys so that you can do additional processing:

$similar->findOut([
  'kos' => "Trump acknowledges Biden's win in latest tweet",
  'foo' => 'Elon Musk gets mixed COVID-19 test results as SpaceX launches astronauts to the ISS',
  'baz' => 'Trump says Biden won but again refuses to concede',
  'bar' => 'Elon Musk may have Covid-19, should quarantine during SpaceX astronaut launch Sunday',
]);

The result will be two groups:

[
  'foo' => 'Elon Musk gets mixed COVID-19 test results as SpaceX launches astronauts to the ISS',
  'bar' => 'Elon Musk may have Covid-19, should quarantine during SpaceX astronaut launch Sunday',
],
[
  'baz' => 'Trump says Biden won but again refuses to concede',
  'kos' => "Trump acknowledges Biden's win in latest tweet",
],

Objects

It is also possible to pass objects to evaluate more complex conditions. Each passed object must be able to cast to a string via the __toString() method.

$similar->findOut([
    new FixtureStingObject('Lorem ipsum dolor sit amet, consectetur adipiscing elit.'),
]);

License

The MIT License (MIT). Please see License File for more information.