emilklindt/laravel-marker-clusterer

Display a large number of markers with server-side clustering in Laravel

v0.2.0 2021-08-31 16:45 UTC

This package is auto-updated.

Last update: 2024-04-07 23:05:44 UTC


README

Unit Test status Packagist Version (including pre-releases) Packagist PHP Version Support MIT License

The emilklindt/laravel-marker-clusterer package allows you to cluster markers in Laravel, before sending them to the client-side.

When serving large datasets it may not be feasible or practical to send all datapoints to the client side. This may be due to the large payload necessary to transfer all datapoints individually, or to avoid undue processing on the client. This project aims to facilitate these cases, with server-side marker clustering for Laravel.


Table of Contents
  1. Installation
  2. Usage
  3. Clusterers
  4. Testing
  5. Credits
  6. License

Installation

You can install the package via composer:

composer require emilklindt/laravel-marker-clusterer

The package will automatically register itself.

Publish config file

You can optionally publish the config file with:

php artisan vendor:publish --provider="EmilKlindt\MarkerClusterer\MarkerClustererServiceProvider" --tag="config"

This is the contents of the file that will be published at config/marker-clusterer.php:

Contents of the configuration file
return [

    /*
    |--------------------------------------------------------------------------
    | Default Clusterer
    |--------------------------------------------------------------------------
    |
    | The default clustering method used when using the DefaultClusterer class
    | included in this project. This allows for easily swapping of the clusterer
    | used throughout a project, through only editing the config file.
    |
    */

    'default_clusterer' => 'density-based-spatial-clusterer',

    /*
    |--------------------------------------------------------------------------
    | Default Distance Formula
    |--------------------------------------------------------------------------
    |
    | The default formula for calculating distance from one coordinate to
    | another. Possible values are constants of the DistanceFormula enum.
    |
    */

    'default_distance_formula' => \EmilKlindt\MarkerClusterer\Enums\DistanceFormula::MANHATTAN,

    /*
    |--------------------------------------------------------------------------
    | K-means Clustering
    |--------------------------------------------------------------------------
    |
    | K-means algorithm identifies k number of centroids, and then allocates
    | every data point to the nearest cluster.
    |
    */

    'k_means' => [

        /*
        |--------------------------------------------------------------------------
        | Default Maximum Iterations
        |--------------------------------------------------------------------------
        |
        | The default number of maximum iterations of clustering, for example used
        | in K-means clustering, where clustering is repeated untill either reaching
        | convergence (no further changes) or the maximum number of iterations.
        |
        */

        'default_maximum_iterations' => 10,

        /*
        |--------------------------------------------------------------------------
        | Default Maximum Convergence Distance
        |--------------------------------------------------------------------------
        |
        | The maximum distance between iterations to count a cluster as converged,
        | meaning that no further iteration is necessary. A higher value can provide
        | better performance, due to the need of doing less iterations. A lower value
        | will ensure that a cluster has actually converged.
        |
        */

        'default_convergence_maximum' => 100.0,

        /*
        |--------------------------------------------------------------------------
        | Default Maximum Samples
        |--------------------------------------------------------------------------
        |
        | The default number of maximum samples of clustering, for example used
        | in K-means clustering, where the specified number of samples are run
        | to achieve the lowest variance between the centroids.
        |
        | This differs from maximum iterations in that, iterations are executed
        | on the same set of initially random centroids. Each sample instantiates
        | a new set of centroids to iteratively optimize, untill maximum number
        | of iterations or convergence is reached.
        |
        */

        'default_maximum_samples' => 10,
    ],

    /*
    |--------------------------------------------------------------------------
    | Density Based Spatial Clusterer (DBSCAN)
    |--------------------------------------------------------------------------
    |
    | Finds core samples of high density and expands clusters from them.
    |
    */

    'dbscan' => [

        /*
        |--------------------------------------------------------------------------
        | Default Include Noise
        |--------------------------------------------------------------------------
        |
        | Whether to include markers not meeting the threshold of minSamples.
        | If true, markers not within epsilon distance of at least minSamples,
        | will be included anyways, in a solo cluster for that given point.
        */

        'default_include_noise' => true,

        /*
        |--------------------------------------------------------------------------
        | Default Use Geohash Neighboring
        |--------------------------------------------------------------------------
        |
        | When response time is critical and precision is not, it may sometimes
        | be beneficial to use geohashing for neighbor searching only. A geohash
        | is calculated for every marker when added to the clusterer. This is
        | used to limit the scope of distance calculations to only points that
        | fall within neighboring geohashes.
        |
        | Enabling this setting will remove the last step, which is calculating
        | exact distance to each marker in the neighboring geohashes, and then
        | comparing it against the epsilon value.
        |
        | The geohash precision is based on the epsilon value, so by specifycing
        | a larger epsilon value, more markers will be considered neighbors etc.
        */

        'default_use_geohash_neighboring' => false,
    ]
];

Usage

In order to manage which clustering algorithm is used throughout your applicaiton, you are encouraged to use the DefaultClusterer class. This uses the default_clusterer value in the marker-clusterer config, which can be easily changed by publishing the configuration.

$clusters = DefaultClusterer::cluster($markers, $config);

Adding your markers

The clusterer takes a collection of markers. However, these markers have to implement the Clusterable interface, which has a single method for retrieving the marker coordinate.

An example implementation in your Eloquent model may look like this:

use League\Geotools\Coordinate\Coordinate;
use EmilKlindt\MarkerClusterer\Interfaces\Clusterable;

class Car extends Model implements Clusterable
{
    public function getClusterableCoordinate(): Coordinate
    {
        return new Coordinate([
            $this->lat,
            $this->lng
        ]);
    }
}

Clustering your markers

With the interface implemented, a collection of cars may be clustered like so:

$cars = Car::get();

$config = new Config([
    'epsilon' => 10.5,
    'minSamples' => 2
]);

$clusters = DefaultClusterer::cluster($cars, $config);

See the clusters section below for the different clustering methods and their parameters.

Clusterers

Contributions or feature requests are welcomed. Feel free to create a pull request or post your feature request in the ideas discussion topic.

DBSCAN vs. K-means Credit: NSHipster/DBSCAN repository

K-means clustering

Perhaps the most well known algorithm in clustering, k-means clustering will partition n elements into k clusters.

As shown illustratively above, K-means clusters markers into the cluster with the nearest mean, meaning the least distance from marker to the center of the cluster. K-means requires a k value, which is the number of clusters desired. There are many different ways of choosing this value, depending on your dataset, however this is not yet adressed in this repository.

$config = new Config([
    'k' => 3
]);

$clusters = KMeansClusterer::cluster($cars, $config);

Configuration parameters applicable to the KMeansClusterer:

  • k, integer (required)
    The desired number of clusters, as well as the initial number of centroids (cluster points) to initialize randomly. If k is larger than the number of markers, all markers will be clustered individually.

  • iterations, integer (optional)
    The maximum number of iterations to recalculate the centroid (mean position of cluster points), and assign markers to the nearest cluster. After clusters have been randomly initialized, markers are assigned nearest cluster, after which the cluster centroid is recalculated, and the process is repeated.

  • convergenceMaximum, float (optional)
    The maximum distance between a cluster's centroids in two consecutive iterations, in order to declare convergence, which along with iterations is a stopping criterion for generating a sample of clusters. This value may need tuning when changing the distanceFormula, as each distance formula has varied levels of accuracy.

  • samples, int (optional)
    Number of times to initialize random centroids for the clusters, and perform iterations number of optimization iterations. After samples number of clustered samples has been found, the cluster sample with the lowest variance is choosen (illustration, by Arif R, Medium). A higher value is more likely to yield an optimal result, at the cost of performance.

  • distanceFormula, string (optional)
    The formula used for calculating distance between points. Distances are used to determine nearest cluster to marker. Valid values are constants of the DistanceFormula enum. Haversine distance is preferable, if precision is important to the clustering task.

All properties marked optional has default values, specified in the configuration. See the publishing config file section above to manipulate these.

Density-Based Spatial clustering

Algorithm used is Density-Based Spatial Clustering of Applications with Noise (DBSCAN).

This clustering method overcomes a common issue with K-means clustering, namely the need to specify the number of clusters. This parameter may be dependent on the data set – more specifically, the density of the dataset. The DBSCAN algorithm takes an epsilon (maximum distance between two points to be grouped together) and minSamples (minimum amount of points to be grouped, to form a cluster).

The "Application with Noise" portion means, that the clustering method is able to handle noise. Meaning, points not immediately related to a cluster, may be either discarded or returned as single points, depending on the includeNoise configuration.

$config = new Config([
    'epsilon' => 10.5,
    'minSamples' => 5
]);

$clusters = DensityBasedSpatialClusterer::cluster($cars, $config);

Configuration parameters applicable to the DensityBasedSpatialClusterer:

  • epsilon, float (required)
    The maximum distance between two markers for one to be considered as in the neighborhood of the other. This is not to be considered as the maximum cluster size, as by doing multiple steps the cluster may become much larger than epsilon.

  • minSamples, integer (required)
    The number of markers in a neighborhood for a point to be considered as a core point to a new cluster. If a point p has over minSamples neighbors within epsilon distance, it is eligible to be core point for a new cluster. The algorithm creates a new cluster from this point, and adds all points within epsilon distance repeatedly, untill no more points are within reach.

  • includeNoise, boolean (optional)
    Whether to include markers not meeting the threshold of minSamples. If true, all markers not within epsilon distance of a cluster, will be included as individual clusters.

  • distanceFormula, string (optional)
    The formula used for calculating distance between points. Distances are used to determine nearest cluster to marker. Valid values are constants of the DistanceFormula enum. Haversine distance is preferable, if precision is important to the clustering task.

All properties marked optional has default values, specified in the configuration. See the publishing config file section above to manipulate these.

Testing

composer test

Credits

Thanks to Spatie for providing inspiration and useful resources for package development.

License

The MIT License (MIT). Please see License File for more information.