cylab / php-spark
A wrapper around arrays that mimics the MapReduce methods of Apache Spark
Requires (Dev)
- phpunit/phpunit: ^8.3
- squizlabs/php_codesniffer: ^3.4
This package is auto-updated.
Last update: 2024-12-20 22:52:55 UTC
README
php-spark is a wrapper around arrays that mimics the MapReduce API of Apache Spark.
$data = new Dataset([1, 2, 3, 4]);
$result = $data
->map(function ($v) {
return 2 * $v;
})
->reduce(function ($v, $agg) {
return $agg + $v;
});
$result == 20;
php-spark is NOT a PHP driver for Apache Spark (and I wish this would exist).
Dataset
A dataset is an immutable array of data. It is the equivalent of Spark RDD (Resilient Distributed Dataset).
use Cylab\Spark\Dataset;
$d = new Dataset([1, 2, 3, 4]);
var_dump($d->collect());
Transformations
Transformations return another dataset.
Method | Description |
---|---|
map(func) | Return a new distributed dataset formed by passing each element of the source through a function func. |
distinct() | Return a new dataset that contains the distinct elements of the source dataset. |
reduceByKey(func) | When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. |
groupByKey() | When called on a dataset of (K, V) pairs, returns a dataset of (K, V[]) pairs. |
Actions
Actions return other types of result.
Method | Description |
---|---|
reduce(func) | Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). |
collect() | Return all the elements of the dataset as an array. |
count() | Return the number of elements in the dataset. |
first() | Return the first element of the dataset (similar to take(1)). |
take(n) | Return an array with the first n elements of the dataset. |
Map
Map applies the provided function to all elements in the dataset and returns a new dataset containing the result of the map operation.
$d2 = $d->map(function ($v) { return 2 * $v; });
Reduce
The reduce function you provide must take two parameters: the current value and the aggregated value.
use Cylab\Spark\Dataset;
$d = new Dataset([1, 2, 3, 4]);
$result = $d->reduce(function ($v, $agg) {
return $agg + $v;
});
$result == 10;
Tuple
Some methods expect the dataset to contain a list of <key, value> tuples.
use Cylab\Spark\Dataset;
use Cylab\Spark\Tuple;
$strings = ["foe", "bar", "foe"];
$d = new Dataset ($strings);
$d2 = $d->map(function($s) { return new Tuple($s, 1); });
ReduceByKey
For this method to work, the input dataset must be a list of <key, value> tuples. The reduce function is then applied to all elements with the same key.
$counts = $d2->reduceByKey(function ($count, $sum) {
return $sum + $count;
});
First
Get the first element of a dataset.
// Tuple<"foe", 2>
var_dump($counts->first());
Future work
- flatMap
- sample
- union
- intersection
- aggregateByKey
- sortByKey
- join
- cartesian
- takeSample
- takeOrdered
- countByKey
- saveAsObjectFile
- saveAsJsonFile