ganglio / pds
Probabilistic Data Structures to efficiently analyze and mine big datasets
Requires
- php: >=5.4.0
Requires (Dev)
- phpunit/phpunit: 4.8.*
README
Probabilistic Data Structures to efficiently analyze and mine big datasets
This package contains a collection of data structures and tools to analyze big amounts of data in a memory efficient way.
Table Of Content
Installation
Install via Composer (make sure you have composer in your path or in your project).
Put the following in your package.json:
{ "require": { "ganglio/PDS": "*" } }
and then run composer install
or just run
composer require ganglio/PDS
Namespaces
A number of namespaces are defined in the library.
- \ganglio\PDS\Bloom
- \ganglio\PDS\Estimators
- \ganglio\PDS\Hash
- \ganglio\PDS\Storage
Interfaces
Estimator
This interface is the basis for cardinality estimators. It defines two methods:
add($key)
- adds a key to the estimatorcount()
- returns the number of keys added to the estimator
Depending on the implementation the actual class might return an exact estimation, like the Exact
class, or an approximation like the HyperLogLog
class.
Hash
This interface is the basis for the various hashing classes offered by the package. It defines one method and a constant:
hash($str)
- performs the actual hashing of the string providedUPPERBOUND
- a 32-bit mask to be used by the hashing functions0xffffffff
Storage
This interface is the basis for the storage classes. It defines four methods:
set($key, $value)
- sets a value to the key in the storage systemget($key)
- gets the value stored to the keyflush()
- flushes the storage systemsize()
- returns the number of keys stored in the storage system
Classes
BitArray (implements Storage
)
Implements a single bit array. It's used to implement the Bloob Filter
where the set
method only accepts Bool
as $value
.
HyperLogLog (implements Estimator
)
Implements the HyperLogLog cardinality estimator algorithm. The actual implementation uses HyperLogLog for big cardinalities and LinearCounting for small ones as it gives a better approximation.
Exact (implements Estimator
)
Implement an exact counter. It's primarily a toy class to show how to use the Estimator
interface.
Trivial (implements Hash
)
Implements a trivial hashing algorithm. Basically adds the ASCII code shifted right by the character position for each characted of the input string and then takes the lower 32 bits. It's a toy class to show how to use the Hash
interface.
Pearson (implements Hash
)
Implements the Pearson non-cryptographic hashing function.
FVNHash (implements Hash
)
Implements the Fowler-Noll-Vo non-cryptographic hashing function. The actual algorithm is the FNV-1 hash.
Generic (implements Hash
)
This class is basically a wrapper around the standard PHP hash function. The constructor accepts the algorithm name to use as from the PHP hash_algos() function. If an unknown algorithm is specified it raises an exception, if none is specified MD5 is selected as default.
MultiHash (implements Hash
)
This class calculates multiple hashes using different algorithms specified ad arguments of the constructor. It's primarily used in conjunction with the BitArray
class to implement the Bloom Filter
.
Bloom
This class implements a Bloom Filter, a probabilistic data structure that allows to test if an element is a member of a set with a very small memory footprint.
Examples
TODO