cylab/php-spark

A wrapper around arrays that mimics the MapReduce methods of Apache Spark

0.0.3 2019-08-20 06:12 UTC

This package is auto-updated.

Last update: 2024-04-20 21:16:07 UTC


README

pipeline status coverage report

php-spark is a wrapper around arrays that mimics the MapReduce API of Apache Spark.

$data = new Dataset([1, 2, 3, 4]);
$result = $data
	->map(function ($v) {
		return 2 * $v;
	})
	->reduce(function ($v, $agg) {
		return $agg + $v;
	});

$result == 20;

php-spark is NOT a PHP driver for Apache Spark (and I wish this would exist).

Dataset

A dataset is an immutable array of data. It is the equivalent of Spark RDD (Resilient Distributed Dataset).

use Cylab\Spark\Dataset;

$d = new Dataset([1, 2, 3, 4]);
var_dump($d->collect());

Transformations

Transformations return another dataset.

MethodDescription
map(func)Return a new distributed dataset formed by passing each element of the source through a function func.
distinct()Return a new dataset that contains the distinct elements of the source dataset.
reduceByKey(func)When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.
groupByKey()When called on a dataset of (K, V) pairs, returns a dataset of (K, V[]) pairs.

Actions

Actions return other types of result.

MethodDescription
reduce(func)Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
collect()Return all the elements of the dataset as an array.
count()Return the number of elements in the dataset.
first()Return the first element of the dataset (similar to take(1)).
take(n)Return an array with the first n elements of the dataset.

Map

Map applies the provided function to all elements in the dataset and returns a new dataset containing the result of the map operation.

$d2 = $d->map(function ($v) { return 2 * $v; });

Reduce

The reduce function you provide must take two parameters: the current value and the aggregated value.

use Cylab\Spark\Dataset;

$d = new Dataset([1, 2, 3, 4]);
$result = $d->reduce(function ($v, $agg) {
	return $agg + $v;
});

$result == 10;

Tuple

Some methods expect the dataset to contain a list of <key, value> tuples.

use Cylab\Spark\Dataset;
use Cylab\Spark\Tuple;

$strings = ["foe", "bar", "foe"];
$d = new Dataset ($strings);
$d2 = $d->map(function($s) { return new Tuple($s, 1); });

ReduceByKey

For this method to work, the input dataset must be a list of <key, value> tuples. The reduce function is then applied to all elements with the same key.

$counts = $d2->reduceByKey(function ($count, $sum) {
	return $sum + $count;
});

First

Get the first element of a dataset.

// Tuple<"foe", 2>
var_dump($counts->first());

Future work

  • flatMap
  • sample
  • union
  • intersection
  • aggregateByKey
  • sortByKey
  • join
  • cartesian
  • takeSample
  • takeOrdered
  • countByKey
  • saveAsObjectFile
  • saveAsJsonFile