README

Statistics PHP package

A PHP package for descriptive statistics, normal distribution, outlier detection, and streaming analytics on numeric data.

This package provides a comprehensive set of statistical functions for PHP: descriptive statistics (mean, median, mode, standard deviation, variance, quantiles), robust measures (trimmed mean, weighted median, median absolute deviation), distribution modelling (normal distribution with PDF, CDF, and inverse CDF), outlier detection (z-score and IQR-based), z-scores, percentiles, coefficient of variation, frequency tables, correlation, regression (linear, logarithmic, power, and exponential), kernel density estimation, and O(1) memory streaming statistics.

It works with any numeric dataset — from sports telemetry and sensor data to race results, survey responses, and financial time series.

Articles and resources:

Exploring Olympic Downhill Results with PHP Statistics — a step-by-step analysis of 2026 Olympic downhill race data
Statistics with PHP — introduction to the package and its core functions
PHP Statistics on Laravel News

This package is inspired by the Python statistics module

Installation

You can install the package via composer:

composer require hi-folks/statistics

Usage

Stat class

Stat class has methods to calculate an average or typical value from a population or sample. This class provides methods for calculating mathematical statistics of numeric data. The various mathematical statistics are listed below:

Mathematical Statistic	Description
`mean()`	arithmetic mean or "average" of data
`fmean()`	floating-point arithmetic mean, with optional weighting and precision
`trimmedMean()`	trimmed (truncated) mean — mean after removing outliers from each side
`median()`	median or "middle value" of data
`weightedMedian()`	weighted median — median with weights, where each value has a different importance
`medianLow()`	low median of data
`medianHigh()`	high median of data
`medianGrouped()`	median of grouped data, using interpolation
`mode()`	single mode (most common value) of discrete or nominal data
`multimode()`	list of modes (most common values) of discrete or nominal data
`quantiles()`	cut points dividing the range of a probability distribution into continuous intervals with equal probabilities (supports `exclusive` and `inclusive` methods)
`thirdQuartile()`	3rd quartile, is the value at which 75 percent of the data is below it
`firstQuartile()`	first quartile, is the value at which 25 percent of the data is below it
`percentile()`	value at any percentile (0–100) with linear interpolation
`rank()`	rank each data point, with configurable tie handling
`percentileRank()`	percentile position of a value within a dataset
`pstdev()`	Population standard deviation
`stdev()`	Sample standard deviation
`sem()`	Standard error of the mean (SEM) — measures precision of the sample mean
`meanAbsoluteDeviation()`	mean absolute deviation (MAD) — average distance from the mean
`medianAbsoluteDeviation()`	median absolute deviation — median distance from the median, robust to outliers
`pvariance()`	variance for a population (supports pre-computed mean via `mu`)
`variance()`	variance for a sample (supports pre-computed mean via `xbar`)
`skewness()`	adjusted Fisher-Pearson sample skewness
`pskewness()`	population (biased) skewness
`kurtosis()`	excess kurtosis (sample formula, 0 for normal distribution)
`coefficientOfVariation()`	coefficient of variation (CV%), relative dispersion as percentage
`zscores()`	z-scores for each value — how many standard deviations from the mean
`outliers()`	outlier detection based on z-score threshold
`iqrOutliers()`	outlier detection based on IQR method (box plot whiskers), robust for skewed data
`geometricMean()`	geometric mean
`harmonicMean()`	harmonic mean
`correlation()`	Pearson’s, Spearman’s rank, or Kendall tau correlation coefficient for two inputs
`kendallTau()`	Kendall tau-b rank correlation coefficient for ordinal association with tie handling
`covariance()`	the sample covariance of two inputs
`linearRegression()`	return the slope and intercept of simple linear regression parameters estimated using ordinary least squares (supports `proportional: true` for regression through the origin)
`logarithmicRegression()`	logarithmic regression — fits `y = a × ln(x) + b`, ideal for diminishing returns patterns (e.g., athletic improvement, learning curves)
`powerRegression()`	power regression — fits `y = a × x^b`, useful for power law relationships
`exponentialRegression()`	exponential regression — fits `y = a × e^(b×x)`, useful for exponential growth or decay
`rSquared()`	coefficient of determination (R²) — proportion of variance explained by linear regression
`confidenceInterval()`	confidence interval for the mean using the normal (z) distribution
`zTest()`	one-sample Z-test — tests whether the sample mean differs significantly from a hypothesized population mean
`tTest()`	one-sample t-test — like z-test but appropriate for small samples where the population standard deviation is unknown
`tTestTwoSample()`	two-sample independent t-test (Welch's) — compares the means of two independent groups without assuming equal variances
`tTestPaired()`	paired t-test — tests whether the mean difference between paired observations is significantly different from zero
`chiSquaredTest()`	chi-squared goodness-of-fit test for observed vs expected category counts
`chiSquaredIndependence()`	chi-squared test of independence for contingency tables
`kde()`	kernel density estimation — returns a closure that estimates the probability density (or CDF) at any point
`kdeRandom()`	random sampling from a kernel density estimate — returns a closure that generates random floats from the KDE distribution

Stat::mean( array $data )

Return the sample arithmetic mean of the array $data. The arithmetic mean is the sum of the data divided by the number of data points. It is commonly called “the average”, although it is only one of many mathematical averages. It is a measure of the central location of the data.

use HiFolks\Statistics\Stat;
$mean = Stat::mean([1, 2, 3, 4, 4]);
// 2.8
$mean = Stat::mean([-1.0, 2.5, 3.25, 5.75]);
// 2.625

Stat::fmean( array $data, array|null $weights = null, int|null $precision = null )

Return the arithmetic mean of the array $data, as a float, with optional weights and precision control. This function behaves like mean() but ensures a floating-point result and supports weighted datasets. If $weights is provided, it computes the weighted average. The result is rounded to a given decimal $precision. The result is rounded to $precision decimal places. If $precision is null, no rounding is applied — this may lead to results with long or unexpected decimal expansions due to the nature of floating-point arithmetic in PHP. Using rounding helps ensure cleaner, more predictable output.

use HiFolks\Statistics\Stat;

// Unweighted mean (same as mean but always float)
$fmean = Stat::fmean([3.5, 4.0, 5.25]);
// 4.25

// Weighted mean
$fmean = Stat::fmean([3.5, 4.0, 5.25], [1, 2, 1]);
// 4.1875

// Custom precision
$fmean = Stat::fmean([3.5, 4.0, 5.25], null, 2);
// 4.25

$fmean = Stat::fmean([3.5, 4.0, 5.25], [1, 2, 1], 3);
// 4.188

If the input is empty, or weights are invalid (e.g., length mismatch or sum is zero), an exception is thrown. Use this function when you need floating-point accuracy or to apply custom weighting and rounding to your average.

Stat::trimmedMean( array $data, float $proportionToCut = 0.1, ?int $round = null )

Return the trimmed (truncated) mean of the data. Computes the mean after removing the lowest and highest fraction of values. This is a robust measure of central tendency, less sensitive to outliers than the regular mean.

The $proportionToCut parameter specifies the fraction to trim from each side (must be in the range [0, 0.5)). For example, 0.1 removes the bottom 10% and top 10%.

use HiFolks\Statistics\Stat;
$mean = Stat::trimmedMean([1, 2, 3, 4, 5, 6, 7, 8, 9, 100], 0.1);
// 5.5 (outlier 100 and lowest value 1 removed)

$mean = Stat::trimmedMean([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.2);
// 5.5 (removes 2 values from each side)

$mean = Stat::trimmedMean([1, 2, 3, 4, 5], 0.0);
// 3.0 (no trimming, same as regular mean)

Stat::geometricMean( array $data )

The geometric mean indicates the central tendency or typical value of the data using the product of the values (as opposed to the arithmetic mean which uses their sum).

use HiFolks\Statistics\Stat;
$mean = Stat::geometricMean([54, 24, 36], 1);
// 36.0

Stat::harmonicMean( array $data )

The harmonic mean is the reciprocal of the arithmetic mean() of the reciprocals of the data. For example, the harmonic mean of three values a, b, and c will be equivalent to 3/(1/a + 1/b + 1/c). If one of the values is zero, the result will be zero.

use HiFolks\Statistics\Stat;
$mean = Stat::harmonicMean([40, 60], null, 1);
// 48.0

You can also calculate the harmonic weighted mean. Suppose a car travels 40 km/hr for 5 km, and when traffic clears, speeds up to 60 km/hr for the remaining 30 km of the journey. What is the average speed?

use HiFolks\Statistics\Stat;
Stat::harmonicMean([40, 60], [5, 30], 1);
// 56.0

where:

40, 60: are the elements
5, 30: are the weights for each element (the first weight is the weight of the first element, the second one is the weight of the second element)
1: is the decimal numbers you want to round

Stat::median( array $data )

Return the median (middle value) of numeric data, using the common “mean of middle two” method.

use HiFolks\Statistics\Stat;
$median = Stat::median([1, 3, 5]);
// 3
$median = Stat::median([1, 3, 5, 7]);
// 4

Stat::weightedMedian( array $data, array $weights, ?int $round = null )

Return the weighted median of the data. The weighted median is the value where the cumulative weight reaches 50% of the total weight. This is useful for survey data, financial analysis, or any dataset where observations have different importance.

All weights must be positive numbers and the weights array must have the same length as the data array.

use HiFolks\Statistics\Stat;
$median = Stat::weightedMedian([1, 2, 3], [1, 1, 1]);
// 2.0 (equal weights, same as regular median)

$median = Stat::weightedMedian([1, 2, 3], [1, 1, 10]);
// 3.0 (heavy weight on 3 pulls the median)

$median = Stat::weightedMedian([1, 2, 3, 4], [1, 1, 1, 1]);
// 2.5 (equal weights, even count — averages the two middle values)

Stat::medianLow( array $data )

Return the low median of numeric data. The low median is always a member of the data set. When the number of data points is odd, the middle value is returned. When it is even, the smaller of the two middle values is returned.

use HiFolks\Statistics\Stat;
$median = Stat::medianLow([1, 3, 5]);
// 3
$median = Stat::medianLow([1, 3, 5, 7]);
// 3

Stat::medianHigh( array $data )

Return the high median of data. The high median is always a member of the data set. When the number of data points is odd, the middle value is returned. When it is even, the larger of the two middle values is returned.

use HiFolks\Statistics\Stat;
$median = Stat::medianHigh([1, 3, 5]);
// 3
$median = Stat::medianHigh([1, 3, 5, 7]);
// 5

Stat::medianGrouped( array $data, float $interval = 1.0 )

Estimate the median for numeric data that has been grouped or binned around the midpoints of consecutive, fixed-width intervals. The $interval parameter specifies the width of each bin (default 1.0). This function uses interpolation within the median interval, assuming values are evenly distributed across each bin.

use HiFolks\Statistics\Stat;
$median = Stat::medianGrouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5]);
// 3.7
$median = Stat::medianGrouped([1, 3, 3, 5, 7]);
// 3.25
$median = Stat::medianGrouped([1, 3, 3, 5, 7], 2);
// 3.5

For example, demographic data summarized into ten-year age groups:

use HiFolks\Statistics\Stat;
// 172 people aged 20-30, 484 aged 30-40, 387 aged 40-50, etc.
$data = array_merge(
    array_fill(0, 172, 25),
    array_fill(0, 484, 35),
    array_fill(0, 387, 45),
    array_fill(0, 22, 55),
    array_fill(0, 6, 65),
);
round(Stat::medianGrouped($data, 10), 1);
// 37.5

Stat::quantiles( array $data, $n=4, $round=null, $method='exclusive' )

Divide data into n continuous intervals with equal probability. Returns a list of n - 1 cut points separating the intervals. Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set n to 100 for percentiles which gives the 99 cut points that separate data into 100 equal-sized groups.

The $method parameter controls the interpolation method:

'exclusive' (default): uses m = count + 1. Suitable for sampled data that may have more extreme values beyond the sample.
'inclusive': uses m = count - 1. Suitable for population data or samples known to include the most extreme values. The minimum value is treated as the 0th percentile and the maximum as the 100th percentile.

use HiFolks\Statistics\Stat;
$quantiles = Stat::quantiles([98, 90, 70,18,92,92,55,83,45,95,88]);
// [ 55.0, 88.0, 92.0 ]
$quantiles = Stat::quantiles([105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,103, 107, 101, 81, 109, 104], 10);
// [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

// Inclusive method
$quantiles = Stat::quantiles([1, 2, 3, 4, 5], method: 'inclusive');
// [2.0, 3.0, 4.0]

Stat::firstQuartile( array $data, $round=null )

The lower quartile, or first quartile (Q1), is the value under which 25% of data points are found when they are arranged in increasing order.

use HiFolks\Statistics\Stat;
$percentile = Stat::firstQuartile([98, 90, 70,18,92,92,55,83,45,95,88]);
// 55.0

Stat::thirdQuartile( array $data, $round=null )

The upper quartile, or third quartile (Q3), is the value under which 75% of data points are found when arranged in increasing order.

use HiFolks\Statistics\Stat;
$percentile = Stat::thirdQuartile([98, 90, 70,18,92,92,55,83,45,95,88]);
// 92.0

Stat::percentile( array $data, float $p, ?int $round = null )

Return the value at the given percentile of the data, using linear interpolation between adjacent data points (exclusive method, consistent with quantiles()).

The percentile $p must be between 0 and 100. Requires at least 2 data points.

use HiFolks\Statistics\Stat;
$value = Stat::percentile([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], 50);
// 55.0 (median)

$value = Stat::percentile([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], 90);
// 91.0

Stat::rank( array $data, string $method = Stat::RANK_AVERAGE )

Return 1-based ranks for each data point, preserving the original array keys.

The $method parameter controls how tied values are ranked:

Stat::RANK_AVERAGE (default): tied values receive the average rank.
Stat::RANK_MIN: tied values receive the lowest rank in the tied group.
Stat::RANK_MAX: tied values receive the highest rank in the tied group.
Stat::RANK_DENSE: tied values receive the same rank and ranks do not skip numbers.
Stat::RANK_ORDINAL: tied values are ranked by their sorted order, preserving input order inside ties.

use HiFolks\Statistics\Stat;

$ranks = Stat::rank([10, 20, 20, 30]);
// [1, 2.5, 2.5, 4]

$ranks = Stat::rank([10, 20, 20, 30], Stat::RANK_DENSE);
// [1, 2, 2, 3]

Stat::percentileRank( array $data, int|float $value, string $kind = Stat::PERCENTILE_RANK_WEAK, ?int $round = null )

Return the percentile position of a value within a dataset.

The $kind parameter controls the calculation:

Stat::PERCENTILE_RANK_WEAK (default): percentage of values less than or equal to the value.
Stat::PERCENTILE_RANK_STRICT: percentage of values strictly less than the value.
Stat::PERCENTILE_RANK_MEAN: average of weak and strict percentile ranks.
Stat::PERCENTILE_RANK_RANK: average percentage rank for exact matches, falling back to mean when the value is absent.

use HiFolks\Statistics\Stat;

$rank = Stat::percentileRank([10, 20, 20, 30, 40], 20);
// 60.0

$rank = Stat::percentileRank([10, 20, 20, 30, 40], 20, Stat::PERCENTILE_RANK_STRICT);
// 20.0

Stat::pstdev( array $data )

Return the Population Standard Deviation, a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

use HiFolks\Statistics\Stat;
$stdev = Stat::pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]);
// 0.986893273527251
$stdev = Stat::pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75], 4);
// 0.9869

Stat::stdev( array $data )

Return the Sample Standard Deviation, a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

use HiFolks\Statistics\Stat;
$stdev = Stat::stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]);
// 1.0810874155219827
$stdev = Stat::stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75], 4);
// 1.0811

Stat::sem( array $data, ?int $round = null )

Return the standard error of the mean (SEM). SEM measures how precisely the sample mean estimates the population mean. It decreases as the sample size grows.

Formula: stdev / sqrt(n)

Requires at least 2 data points.

use HiFolks\Statistics\Stat;
$sem = Stat::sem([2, 4, 4, 4, 5, 5, 7, 9]);
// 0.7559...

$sem = Stat::sem([2, 4, 4, 4, 5, 5, 7, 9], 4);
// 0.7559

Stat::meanAbsoluteDeviation( array $data, ?int $round = null )

Return the mean absolute deviation (MAD) — the average of the absolute deviations from the mean.

MAD is a simple, intuitive measure of dispersion: it tells you "on average, how far values are from the mean". Unlike standard deviation, it does not square the differences, making it easier to interpret and somewhat less sensitive to outliers.

Use MAD when you want a straightforward, interpretable measure of spread, especially for reporting to non-technical audiences.

use HiFolks\Statistics\Stat;
$mad = Stat::meanAbsoluteDeviation([1, 2, 3, 4, 5]);
// 1.2

$mad = Stat::meanAbsoluteDeviation([1, 2, 3, 4, 5], 1);
// 1.2

Stat::medianAbsoluteDeviation( array $data, ?int $round = null )

Return the median absolute deviation — the median of the absolute deviations from the median.

This is one of the most robust measures of dispersion available. Because it uses the median (not the mean) as the center and takes the median (not the mean) of deviations, it is highly resistant to outliers. Even if up to half the data points are extreme, the median absolute deviation remains stable.

Use it when your data may contain outliers, when you need a robust alternative to standard deviation, or for outlier detection (values far from the median in units of MAD are likely outliers).

use HiFolks\Statistics\Stat;
$mad = Stat::medianAbsoluteDeviation([1, 2, 3, 4, 5]);
// 1.0

// Robust to outliers — the outlier 1000 does not affect the result:
$mad = Stat::medianAbsoluteDeviation([1, 2, 3, 4, 1000]);
// 1.0

Stat::variance ( array $data, ?int $round = null, int|float|null $xbar = null)

Variance is a measure of dispersion of data points from the mean. Low variance indicates that data points are generally similar and do not vary widely from the mean. High variance indicates that data values have greater variability and are more widely dispersed from the mean.

To calculate the variance from a sample:

use HiFolks\Statistics\Stat;
$variance = Stat::variance([2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]);
// 1.3720238095238095

If you have already computed the mean, you can pass it via xbar to avoid recalculation:

$data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5];
$mean = Stat::mean($data);
$variance = Stat::variance($data, xbar: $mean);

If you need to calculate the variance on the whole population and not just on a sample you need to use pvariance method. You can optionally pass the population mean via mu:

use HiFolks\Statistics\Stat;
$variance = Stat::pvariance([0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]);
// 1.25

// With pre-computed mean:
$data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25];
$mu = Stat::mean($data);
$variance = Stat::pvariance($data, mu: $mu);

Stat::skewness ( array $data, ?int $round = null )

Skewness is a measure of the asymmetry of a distribution. The adjusted Fisher-Pearson formula is used, which is the same as Excel's SKEW() and Python's scipy.stats.skew(bias=False).

A positive skewness indicates a right-skewed distribution (tail extends to the right), while a negative skewness indicates a left-skewed distribution. A symmetric distribution has a skewness of 0.

Requires at least 3 data points.

use HiFolks\Statistics\Stat;
$skewness = Stat::skewness([1, 2, 3, 4, 5]);
// 0.0 (symmetric)

$skewness = Stat::skewness([1, 1, 1, 1, 1, 10]);
// positive (right-skewed)

If you need the population (biased) skewness instead of the sample skewness, use pskewness(). This is equivalent to scipy.stats.skew(bias=True):

use HiFolks\Statistics\Stat;
$pskewness = Stat::pskewness([1, 1, 1, 1, 1, 10]);

Stat::kurtosis ( array $data, ?int $round = null )

Kurtosis measures the "tailedness" of a distribution — how much data lives in the extreme tails compared to a normal distribution. This method returns the excess kurtosis using the sample formula, which is the same as Excel's KURT() and Python's scipy.stats.kurtosis(bias=False).

A normal distribution has excess kurtosis of 0. Positive values (leptokurtic) indicate heavier tails and more outliers. Negative values (platykurtic) indicate lighter tails and fewer outliers.

Requires at least 4 data points.

use HiFolks\Statistics\Stat;
$kurtosis = Stat::kurtosis([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);
// negative (platykurtic, lighter tails than normal)

$kurtosis = Stat::kurtosis([1, 2, 2, 2, 2, 2, 2, 2, 2, 50]);
// positive (leptokurtic, heavier tails due to outlier)

Stat::coefficientOfVariation( array $data, ?int $round = null, bool $population = false )

The coefficient of variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage. It measures relative variability and is useful for comparing dispersion across datasets with different units or scales.

By default it uses the sample standard deviation. Pass population: true to use the population standard deviation instead.

Requires at least 2 data points (sample) or 1 (population). Throws if the mean is zero.

use HiFolks\Statistics\Stat;
$cv = Stat::coefficientOfVariation([10, 20, 30, 40, 50]);
// ~52.70 (sample)

$cv = Stat::coefficientOfVariation([10, 20, 30, 40, 50], round: 2);
// 52.7

$cv = Stat::coefficientOfVariation([10, 20, 30, 40, 50], population: true);
// ~47.14 (population)

Stat::zscores( array $data, ?int $round = null )

Return the z-score for each value in the dataset. A z-score indicates how many standard deviations a value is from the mean. Z-scores are useful for standardizing data, comparing values from different distributions, and identifying outliers.

The z-scores of any dataset always sum to zero, and values beyond ±2 or ±3 are typically considered unusual or outliers.

Requires at least 2 data points and non-zero standard deviation.

use HiFolks\Statistics\Stat;
$zscores = Stat::zscores([2, 4, 4, 4, 5, 5, 7, 9]);
// array of z-scores, one per value

$zscores = Stat::zscores([2, 4, 4, 4, 5, 5, 7, 9], 2);
// z-scores rounded to 2 decimal places

Stat::outliers( array $data, float $threshold = 3.0 )

Return values from the dataset that are outliers based on z-score threshold. A value is considered an outlier if its absolute z-score exceeds the threshold.

The default threshold of 3.0 is a widely used convention — in a normal distribution, about 99.7% of values fall within 3 standard deviations of the mean, so values beyond that are rare. Use a lower threshold (e.g. 2.0) for stricter detection, or a higher one for more lenient filtering.

use HiFolks\Statistics\Stat;
$outliers = Stat::outliers([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]);
// [100]

$outliers = Stat::outliers([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1.0);
// values more than 1 stdev from the mean

Stat::iqrOutliers( array $data, float $factor = 1.5 )

Return values that are outliers based on the Interquartile Range (IQR) method. A value is an outlier if it falls below Q1 - factor * IQR or above Q3 + factor * IQR. This is the same method used for box plot whiskers.

Unlike z-score based detection, the IQR method is robust — it does not assume a normal distribution and is not influenced by extreme values themselves. This makes it the preferred choice for skewed data or when the dataset may already contain outliers that would distort the mean and standard deviation.

Use factor: 1.5 (default) for mild outliers, or factor: 3.0 for extreme outliers only.

Example: Ski downhill race times

In a ski downhill race, most athletes finish between 108–116 seconds. A time of 200s (e.g. a crash/DNF) or 50s (e.g. a timing error) would be flagged as outliers:

use HiFolks\Statistics\Stat;
$times = [110.2, 112.5, 108.9, 115.3, 111.7, 114.0, 109.8, 113.6, 200.0, 50.0];
$outliers = Stat::iqrOutliers($times);
// [200.0, 50.0] — the crash and the timing error are detected

$extremeOnly = Stat::iqrOutliers($times, 3.0);
// only the most extreme values

Stat::covariance ( array $x , array $y )

Covariance, static method, returns the sample covariance of two inputs $x and $y. Covariance is a measure of the joint variability of two inputs.

$covariance = Stat::covariance(
    [1, 2, 3, 4, 5, 6, 7, 8, 9],
    [1, 2, 3, 1, 2, 3, 1, 2, 3]
);
// 0.75

$covariance = Stat::covariance(
    [1, 2, 3, 4, 5, 6, 7, 8, 9],
    [9, 8, 7, 6, 5, 4, 3, 2, 1]
);
// -7.5

Stat::correlation ( array $x , array $y, string $method = 'linear' )

Return the Pearson’s correlation coefficient for two inputs. Pearson’s correlation coefficient r takes values between -1 and +1. It measures the strength and direction of the linear relationship, where +1 means very strong, positive linear relationship, -1 very strong, negative linear relationship, and 0 no linear relationship.

Use $method = 'ranked' for Spearman’s rank correlation, which measures monotonic relationships (not just linear). Spearman’s correlation is computed by applying Pearson’s formula to the ranks of the data.

Use $method = 'kendall' for Kendall tau-b correlation, which measures ordinal association by comparing concordant and discordant pairs. Kendall tau-b is often useful for ordinal data, small samples, and datasets with ties.

$correlation = Stat::correlation(
    [1, 2, 3, 4, 5, 6, 7, 8, 9],
    [1, 2, 3, 4, 5, 6, 7, 8, 9]
);
// 1.0

$correlation = Stat::correlation(
    [1, 2, 3, 4, 5, 6, 7, 8, 9],
    [9, 8, 7, 6, 5, 4, 3, 2, 1]
);
// -1.0

Spearman’s rank correlation (non-linear but monotonic relationship):

$correlation = Stat::correlation(
    [1, 2, 3, 4, 5],
    [1, 4, 9, 16, 25],
    'ranked'
);
// 1.0

Kendall tau-b rank correlation with ties:

$correlation = Stat::correlation(
    [12, 2, 1, 12, 2],
    [1, 4, 7, 1, 0],
    'kendall'
);
// -0.47140452079103173

Stat::kendallTau ( array $x , array $y, ?int $round = null )

Return Kendall's tau-b rank correlation coefficient for two inputs.

$correlation = Stat::kendallTau(
    [12, 2, 1, 12, 2],
    [1, 4, 7, 1, 0],
    4
);
// -0.4714

Stat::linearRegression ( array $x , array $y , bool $proportional = false )

Return the slope and intercept of simple linear regression parameters estimated using ordinary least squares. Simple linear regression describes the relationship between an independent variable $x and a dependent variable $y in terms of a linear function.

$years = [1971, 1975, 1979, 1982, 1983];
$films_total = [1, 2, 3, 4, 5]
list($slope, $intercept) = Stat::linearRegression(
    $years,
    $films_total
);
// 0.31
// -610.18

What happens in 2022, according to the samples above?

round($slope * 2022 + $intercept);
// 17.0

When proportional is true, the regression line is forced through the origin (intercept = 0). This is useful when the relationship between $x and $y is known to be proportional:

list($slope, $intercept) = Stat::linearRegression(
    [1, 2, 3, 4, 5],
    [2, 4, 6, 8, 10],
    proportional: true,
);
// $slope = 2.0
// $intercept = 0.0

Stat::logarithmicRegression( array $x, array $y )

Fit a logarithmic model y = a × ln(x) + b. Returns [a, b].

This model naturally captures diminishing returns — fast initial change that gradually flattens. It is useful for data where early gains are large but improvement slows over time, such as athletic performance trends, learning curves, or market saturation.

All x values must be positive (you cannot take the logarithm of zero or negative numbers).

Internally, this transforms x to ln(x) and applies linear regression, so it leverages the same robust ordinary least squares implementation.

use HiFolks\Statistics\Stat;

// Simulated weekly running paces (seconds/km) — diminishing improvement
$weeks = [1, 2, 3, 4, 5, 6, 7, 8];
$paces = [350, 342, 337, 333, 330, 328, 326, 325];

[$a, $b] = Stat::logarithmicRegression($weeks, $paces);
// $a = -12.33 (pace drops by 12.33 sec per unit of ln(week))
// $b = 350.2

// Predict pace at week 12:
$predicted = $a * log(12) + $b;
// ~320 seconds = 5:20/km

Compare with linear regression to see which fits better:

// R² for logarithmic model (transform x first)
$logWeeks = array_map(fn($v) => log($v), $weeks);
$r2Log = Stat::rSquared($logWeeks, $paces);
// 0.9987

// R² for linear model
$r2Linear = Stat::rSquared($weeks, $paces);
// 0.9176

// Logarithmic wins — the data has diminishing returns

Stat::powerRegression( array $x, array $y )

Fit a power model y = a × x^b. Returns [a, b].

Power regression is useful for data following power law relationships (e.g., scaling laws, allometric relationships). Both x and y values must be positive.

Internally, this linearizes as ln(y) = ln(a) + b × ln(x) and applies linear regression.

use HiFolks\Statistics\Stat;

// Data following y = 3 * x^2
$x = [1, 2, 3, 4, 5];
$y = [3, 12, 27, 48, 75];

[$a, $b] = Stat::powerRegression($x, $y);
// $a = 3.0
// $b = 2.0 (the exponent)

Stat::exponentialRegression( array $x, array $y )

Fit an exponential model y = a × e^(b×x). Returns [a, b].

Exponential regression is useful for data with exponential growth (positive b) or decay (negative b), such as population growth, compound interest, or radioactive decay. All y values must be positive.

Internally, this linearizes as ln(y) = ln(a) + b × x and applies linear regression.

use HiFolks\Statistics\Stat;

// Data following y = 2 * e^(0.5*x)
$x = [1, 2, 3, 4, 5];
$y = [3.30, 5.44, 8.96, 14.78, 24.36];

[$a, $b] = Stat::exponentialRegression($x, $y);
// $a ≈ 2.0
// $b ≈ 0.5

Stat::rSquared( array $x, array $y, bool $proportional = false, ?int $round = null )

Return the coefficient of determination (R²) — the proportion of variance in the dependent variable explained by the linear regression model. Values range from 0 (no explanatory power) to 1 (perfect fit).

Requires at least 2 data points and arrays of the same length.

use HiFolks\Statistics\Stat;
$r2 = Stat::rSquared([1, 2, 3, 4, 5], [2, 4, 6, 8, 10]);
// 1.0 (perfect linear relationship)

$r2 = Stat::rSquared(
    [1971, 1975, 1979, 1982, 1983],
    [1, 2, 3, 4, 5],
    round: 2,
);
// 0.96

With proportional regression (through the origin):

$r2 = Stat::rSquared(
    [1, 2, 3, 4, 5],
    [2, 4, 6, 8, 10],
    proportional: true,
);
// 1.0

To compute R² for non-linear models, transform the data the same way the regression method does:

// R² for logarithmic regression
$logX = array_map(fn($v) => log($v), $x);
$r2 = Stat::rSquared($logX, $y);

// R² for power regression
$logX = array_map(fn($v) => log($v), $x);
$logY = array_map(fn($v) => log($v), $y);
$r2 = Stat::rSquared($logX, $logY);

// R² for exponential regression
$logY = array_map(fn($v) => log($v), $y);
$r2 = Stat::rSquared($x, $logY);

Stat::confidenceInterval( array $data, float $confidenceLevel = 0.95, ?int $round = null )

Return the confidence interval for the mean using the normal (z) distribution.

Computes: mean ± z * (stdev / √n), where the z-critical value is derived from the inverse normal CDF.

Requires at least 2 data points. The confidence level must be between 0 and 1 exclusive.

use HiFolks\Statistics\Stat;
[$lower, $upper] = Stat::confidenceInterval([2, 4, 4, 4, 5, 5, 7, 9]);
// 95% CI: [3.52, 6.48] (approximately)

[$lower, $upper] = Stat::confidenceInterval([2, 4, 4, 4, 5, 5, 7, 9], confidenceLevel: 0.99);
// 99% CI: wider interval

[$lower, $upper] = Stat::confidenceInterval([2, 4, 4, 4, 5, 5, 7, 9], round: 2);
// [3.52, 6.48]

Stat::zTest( array $data, float $populationMean, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Perform a one-sample Z-test for the mean. Tests whether the sample mean differs significantly from a hypothesized population mean using the normal distribution.

Returns an associative array with zScore and pValue. The alternative hypothesis can be TwoSided (default), Greater, or Less.

Requires at least 2 data points.

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Enums\Alternative;

$result = Stat::zTest([2, 4, 4, 4, 5, 5, 7, 9], populationMean: 3.0);
// ['zScore' => 2.6457..., 'pValue' => 0.0081...]

$result = Stat::zTest([2, 4, 4, 4, 5, 5, 7, 9], populationMean: 3.0, alternative: Alternative::Greater);
// one-tailed test: is the sample mean greater than 3?

$result = Stat::zTest([2, 4, 4, 4, 5, 5, 7, 9], populationMean: 3.0, round: 4);
// ['zScore' => 2.6458, 'pValue' => 0.0081]

Stat::tTest( array $data, float $populationMean, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Perform a one-sample t-test for the mean. Tests whether the sample mean differs significantly from a hypothesized population mean using the Student's t-distribution. Unlike the z-test, the t-test is appropriate for small samples where the population standard deviation is unknown.

Returns an associative array with tStatistic, pValue, and degreesOfFreedom. The alternative hypothesis can be TwoSided (default), Greater, or Less.

Requires at least 2 data points.

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Enums\Alternative;

$result = Stat::tTest([2, 4, 4, 4, 5, 5, 7, 9], populationMean: 3.0);
// ['tStatistic' => 2.6457..., 'pValue' => 0.0331..., 'degreesOfFreedom' => 7]

$result = Stat::tTest([2, 4, 4, 4, 5, 5, 7, 9], populationMean: 3.0, alternative: Alternative::Greater);
// one-tailed test: is the sample mean greater than 3?

$result = Stat::tTest([2, 4, 4, 4, 5, 5, 7, 9], populationMean: 3.0, round: 4);
// ['tStatistic' => 2.6458, 'pValue' => 0.0331, 'degreesOfFreedom' => 7]

Stat::tTestTwoSample( array $data1, array $data2, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Perform a two-sample independent t-test (Welch's t-test). Compares the means of two independent groups without assuming equal variances. Uses the Welch–Satterthwaite approximation for degrees of freedom.

Returns an associative array with tStatistic, pValue, and degreesOfFreedom. The alternative hypothesis can be TwoSided (default), Greater, or Less.

Requires at least 2 data points in each sample.

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Enums\Alternative;

// Compare two groups
$group1 = [30.02, 29.99, 30.11, 29.97, 30.01, 29.99];
$group2 = [29.89, 29.93, 29.72, 29.98, 30.02, 29.98];
$result = Stat::tTestTwoSample($group1, $group2);
// ['tStatistic' => 1.6245..., 'pValue' => 0.1444..., 'degreesOfFreedom' => 6.84...]

// One-tailed test: is group1 mean greater than group2 mean?
$result = Stat::tTestTwoSample($group1, $group2, alternative: Alternative::Greater);

// Groups can have different sizes
$result = Stat::tTestTwoSample([1, 2, 3, 4, 5, 6, 7, 8], [3, 4, 5], round: 4);

Stat::tTestPaired( array $data1, array $data2, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Perform a paired t-test. Tests whether the mean difference between paired observations (e.g. before/after measurements on the same subjects) is significantly different from zero.

Returns an associative array with tStatistic, pValue, and degreesOfFreedom. Both arrays must have the same length.

Requires at least 2 paired observations.

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Enums\Alternative;

// Before and after treatment measurements
$before = [200, 190, 210, 220, 215, 205, 195, 225];
$after  = [192, 186, 198, 212, 208, 198, 188, 215];
$result = Stat::tTestPaired($before, $after);
// ['tStatistic' => 5.715..., 'pValue' => 0.0007..., 'degreesOfFreedom' => 7]

// One-tailed: did the treatment decrease the values?
$result = Stat::tTestPaired($before, $after, alternative: Alternative::Greater);

$result = Stat::tTestPaired($before, $after, round: 4);

Stat::chiSquaredTest( array $observed, ?array $expected = null, ?int $round = null )

Perform a chi-squared goodness-of-fit test. Use it when you want to check whether observed category counts differ from expected counts.

If $expected is omitted, a uniform distribution is assumed. Returns an associative array with chiSquared, pValue, and degreesOfFreedom.

use HiFolks\Statistics\Stat;

// Are these category counts different from a uniform distribution?
$result = Stat::chiSquaredTest([8, 12, 10], round: 4);
// ['chiSquared' => 0.8, 'pValue' => 0.6703, 'degreesOfFreedom' => 2]

// Compare observed counts to expected counts
$result = Stat::chiSquaredTest([18, 32, 50], [20, 30, 50], round: 4);
// ['chiSquared' => 0.3333, 'pValue' => 0.8465, 'degreesOfFreedom' => 2]

Stat::chiSquaredIndependence( array $table, ?int $round = null )

Perform a chi-squared test of independence on a contingency table. Use it when you want to check whether two categorical variables are associated.

Returns an associative array with chiSquared, pValue, degreesOfFreedom, and the expected-count table.

use HiFolks\Statistics\Stat;

$table = [
    [10, 20, 30],
    [6, 9, 17],
];

$result = Stat::chiSquaredIndependence($table, round: 4);
// [
//     'chiSquared' => 0.2716,
//     'pValue' => 0.873,
//     'degreesOfFreedom' => 2,
//     'expected' => [[10.4348, 18.913, 30.6522], [5.5652, 10.087, 16.3478]],
// ]

Stat::kde ( array $data , float $h , KdeKernel $kernel = KdeKernel::Normal , bool $cumulative = false )

Create a continuous probability density function (or cumulative distribution function) from discrete sample data using Kernel Density Estimation. Returns a Closure that can be called with any point to estimate the density (or CDF value).

Supported kernels: KdeKernel::Normal (alias KdeKernel::Gauss), KdeKernel::Logistic, KdeKernel::Sigmoid, KdeKernel::Rectangular (alias KdeKernel::Uniform), KdeKernel::Triangular, KdeKernel::Parabolic (alias KdeKernel::Epanechnikov), KdeKernel::Quartic (alias KdeKernel::Biweight), KdeKernel::Triweight, KdeKernel::Cosine.

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Enums\KdeKernel;

$data = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2];
$f = Stat::kde($data, h: 1.5);
$f(2.5);
// estimated density at x = 2.5

Using a different kernel:

$f = Stat::kde($data, h: 1.5, kernel: KdeKernel::Triangular);
$f(2.5);

Cumulative distribution function:

$F = Stat::kde($data, h: 1.5, cumulative: true);
$F(2.5);
// estimated CDF at x = 2.5 (probability that a value is <= 2.5)

Stat::kdeRandom ( array $data , float $h , KdeKernel $kernel = KdeKernel::Normal , ?int $seed = null )

Generate random samples from a Kernel Density Estimate. Returns a Closure that, when called, produces a random float drawn from the KDE distribution defined by the data and bandwidth.

use HiFolks\Statistics\Stat;
use HiFolks\Statistics\Enums\KdeKernel;

$data = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2];
$rand = Stat::kdeRandom($data, h: 1.5, seed: 8675309);
$samples = [];
for ($i = 0; $i < 10; $i++) {
    $samples[] = round($rand(), 1);
}
// [2.5, 3.3, -1.8, 7.3, -2.1, 4.6, 4.4, 5.9, -3.2, -1.6]

Using a different kernel:

$rand = Stat::kdeRandom($data, h: 1.5, kernel: KdeKernel::Triangular, seed: 42);
$rand();

Freq class

With Statistics package you can calculate frequency table. A frequency table lists the frequency of various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval.

Freq::frequencies( array $data )

use HiFolks\Statistics\Freq;

$fruits = ['🍈', '🍈', '🍈', '🍉','🍉','🍉','🍉','🍉','🍌'];
$freqTable = Freq::frequencies($fruits);
print_r($freqTable);

You can see the frequency table as an array:

Array
(
    [🍈] => 3
    [🍉] => 5
    [🍌] => 1
)

Freq::relativeFrequencies( array $data )

You can retrieve the frequency table in relative format (percentage):

$freqTable = Freq::relativeFrequencies($fruits, 2);
print_r($freqTable);

You can see the frequency table as an array with percentage of the occurrences:

Array
(
    [🍈] => 33.33
    [🍉] => 55.56
    [🍌] => 11.11
)

Freq::frequencyTableBySize( array $data , $size)

If you want to create a frequency table based on class (ranges of values) you can use frequencyTableBySize. The first parameter is the array, and the second one is the size of classes.

Calculate the frequency table with classes. Each group size is 4

$data = [1,1,1,4,4,5,5,5,6,7,8,8,8,9,9,9,9,9,9,10,10,11,12,12,
    13,14,14,15,15,16,16,16,16,17,17,17,18,18, ];
$result = \HiFolks\Statistics\Freq::frequencyTableBySize($data, 4);
print_r($result);
/*
Array
(
    [1] => 5
    [5] => 8
    [9] => 11
    [13] => 9
    [17] => 5
)
 */

Freq::frequencyTable()

If you want to create a frequency table based on class (ranges of values) you can use frequencyTable. The first parameter is the array, and the second one is the number of classes.

Calculate the frequency table with 5 classes.

$data = [1,1,1,4,4,5,5,5,6,7,8,8,8,9,9,9,9,9,9,10,10,11,12,12,
    13,14,14,15,15,16,16,16,16,17,17,17,18,18, ];
$result = \HiFolks\Statistics\Freq::frequencyTable($data, 5);
print_r($result);
/*
Array
(
    [1] => 5
    [5] => 8
    [9] => 11
    [13] => 9
    [17] => 5
)
 */

Statistics class

The methods provided by the Freq and the Stat classes are mainly static methods. If you prefer to use an object instance for calculating statistics you can choose to use an instance of the Statistics class. So for calling the statistics methods, you can use your object instance of the Statistics class.

For example for calculating the mean, you can obtain the Statistics object via the make() static method, and then use the new object $stat like in the following example:

$stat = HiFolks\Statistics\Statistics::make(
    [3,5,4,7,5,2]
);
echo $stat->valuesToString(5) . PHP_EOL;
// 2,3,4,5,5
echo "Mean              : " . $stat->mean() . PHP_EOL;
// Mean              : 4.3333333333333
echo "Count             : " . $stat->count() . PHP_EOL;
// Count             : 6
echo "Median            : " . $stat->median() . PHP_EOL;
// Median            : 4.5
echo "First Quartile  : " . $stat->firstQuartile() . PHP_EOL;
// First Quartile  : 2.5
echo "Third Quartile : " . $stat->thirdQuartile() . PHP_EOL;
// Third Quartile : 5
echo "Mode              : " . $stat->mode() . PHP_EOL;
// Mode              : 5

Calculate Frequency Table

The Statistics packages have some methods for generating Frequency Table:

frequencies(): a frequency is the number of times a value of the data occurs;
relativeFrequencies(): a relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes;
cumulativeFrequencies(): is the accumulation of the previous relative frequencies;
cumulativeRelativeFrequencies(): is the accumulation of the previous relative ratio.

use HiFolks\Statistics\Statistics;

$s = Statistics::make(
    [98, 90, 70,18,92,92,55,83,45,95,88,76]
);
$a = $s->frequencies();
print_r($a);
/*
Array
(
    [18] => 1
    [45] => 1
    [55] => 1
    [70] => 1
    [76] => 1
    [83] => 1
    [88] => 1
    [90] => 1
    [92] => 2
    [95] => 1
    [98] => 1
)
 */

$a = $s->relativeFrequencies();
print_r($a);
/*
Array
(
    [18] => 8.3333333333333
    [45] => 8.3333333333333
    [55] => 8.3333333333333
    [70] => 8.3333333333333
    [76] => 8.3333333333333
    [83] => 8.3333333333333
    [88] => 8.3333333333333
    [90] => 8.3333333333333
    [92] => 16.666666666667
    [95] => 8.3333333333333
    [98] => 8.3333333333333
)
 */

`NormalDist` class

The NormalDist class provides an easy way to work with normal distributions in PHP. It allows you to calculate probabilities and densities for a given mean (μ\muμ) and standard deviation (σ\sigmaσ).

Key features

Define a normal distribution with mean (μ\muμ) and standard deviation (σ\sigmaσ).
Calculate the Probability Density Function (PDF) to evaluate the relative likelihood of a value.
Calculate the Cumulative Distribution Function (CDF) to determine the probability of a value or lower.
Calculate the Inverse Cumulative Distribution Function (inv_cdf) to find the value for a given probability.

Class constructor

$normalDist = new NormalDist(float $mu = 0.0, float $sigma = 1.0);

$mu: The mean (default = 0.0).
$sigma: The standard deviation (default = 1.0).
Throws an exception if $sigma is non-positive.

Methods

Properties: mean, sigma, and variance

You can access the distribution parameters via getter methods:

$normalDist = new NormalDist(100, 15);
$normalDist->getMean();             // 100.0
$normalDist->getSigma();            // 15.0
$normalDist->getMedian();           // 100.0 (equals mean for normal dist)
$normalDist->getMode();             // 100.0 (equals mean for normal dist)
$normalDist->getVariance();         // 225.0 (sigma squared)
$normalDist->getVarianceRounded(2); // 225.0

From samples:

$normalDist = NormalDist::fromSamples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]);
$normalDist->getVarianceRounded(5); // 0.25767

Creating a normal distribution instance from sample data

The fromSamples() static method creates a normal distribution instance with mu and sigma parameters estimated from the sample data.

Example:

$samples = [2.5, 3.1, 2.1, 2.4, 2.7, 3.5];
$normalDist = NormalDist::fromSamples($samples);
$normalDist->getMeanRounded(5); // 2.71667
$normalDist->getSigmaRounded(5); // 0.50761

Generate random samples `samples($n, $seed)`

Generates $n random samples from the normal distribution using the Box-Muller transform. An optional $seed parameter allows reproducible results.

$normalDist = new NormalDist(100, 15);

// Generate 5 random samples
$samples = $normalDist->samples(5);
// e.g. [98.3, 112.7, 89.1, 105.4, 101.2]

// Reproducible results with a seed
$samples = $normalDist->samples(1000, seed: 42);

Z-score `zscore($x)`

Computes the standard score describing $x in terms of the number of standard deviations above or below the mean: (x - mu) / sigma.

$normalDist = new NormalDist(100, 15);
echo $normalDist->zscore(130);          // 2.0 (two std devs above mean)
echo $normalDist->zscore(85);           // -1.0 (one std dev below mean)
echo $normalDist->zscoreRounded(114, 3); // 0.933

Probability Density Function `pdf($x)`

Calculates the Probability Density Function at a given value xxx:

$normalDist->pdf(float $x): float

Input: the value $x at which to evaluate the PDF.
Output: the relative likelihood of $x in the distribution.

Example:

$normalDist = new NormalDist(10.0, 2.0);
echo $normalDist->pdf(12.0); // Output: 0.12098536225957168

Cumulative Distribution Function `cdf($x)`

Calculates the Cumulative Distribution Function at a given value $x:

$normalDist->cdf(float $x): float

Input: the value $x at which to evaluate the CDF.
Output: the probability that a random variable $x is less than or equal to $x.

Example:

$normalDist = new NormalDist(10.0, 2.0);
echo $normalDist->cdf(12.0); // Output: 0.8413447460685429

Calculating both, CDF and PDF:

$normalDist = new NormalDist(10.0, 2.0);

// Calculate PDF at x = 12
$pdf = $normalDist->pdf(12.0);
echo "PDF at x = 12: $pdf\n"; // Output: 0.12098536225957168

// Calculate CDF at x = 12
$cdf = $normalDist->cdf(12.0);
echo "CDF at x = 12: $cdf\n"; // Output: 0.8413447460685429

Inverse Cumulative Distribution Function `invCdf($p)`

Computes the Inverse Cumulative Distribution Function (also known as the quantile function or percent-point function). Given a probability $p, it finds the value $x such that cdf($x) = $p.

$normalDist->invCdf(float $p): float

Input: a probability $p in the range (0, 1) exclusive.
Output: the value $x where cdf($x) = $p.
Throws an exception if $p is not in (0, 1).

Example:

$normalDist = new NormalDist(0.0, 1.0);

// Find the value at the 95th percentile of a standard normal distribution
echo $normalDist->invCdfRounded(0.95, 5); // Output: 1.64485

// The median of a standard normal distribution
echo $normalDist->invCdf(0.5); // Output: 0.0

The invCdf() method is useful for:

Confidence intervals: find critical values for a given confidence level.
Hypothesis testing: determine thresholds for statistical significance.
Percentile calculations: find the value corresponding to a specific percentile.

Round-trip example with cdf():

$normalDist = new NormalDist(100, 15);

// inv_cdf(0.5) equals the mean
echo $normalDist->invCdf(0.5); // Output: 100.0

// Round-trip: cdf(invCdf(p)) ≈ p
echo $normalDist->cdfRounded($normalDist->invCdf(0.25), 2); // Output: 0.25

Quantiles `quantiles($n)`

Divides the normal distribution into $n continuous intervals with equal probability. Returns a list of $n - 1 cut points separating the intervals. Set $n to 4 for quartiles (the default), $n to 10 for deciles, or $n to 100 for percentiles.

$normalDist = new NormalDist(0.0, 1.0);

// Quartiles (default)
$normalDist->quantiles();    // [-0.6745, 0.0, 0.6745]

// Deciles
$normalDist->quantiles(10);  // 9 cut points

// Percentiles
$normalDist->quantiles(100); // 99 cut points

Overlapping coefficient `overlap($other)`

Computes the overlapping coefficient (OVL) between two normal distributions. Measures the agreement between two normal probability distributions. Returns a value between 0.0 and 1.0 giving the overlapping area in the two underlying probability density functions.

$n1 = new NormalDist(2.4, 1.6);
$n2 = new NormalDist(3.2, 2.0);
echo $n1->overlapRounded($n2, 4); // 0.8035

// Identical distributions overlap completely
$n3 = new NormalDist(0, 1);
echo $n3->overlap($n3); // 1.0

Combining a normal distribution via `add()` method

The add() method allows you to combine a NormalDist instance with either a constant or another NormalDist object. This operation supports mathematical transformations and the combination of distributions.

The use cases are:

Shifting a distribution: add a constant to shift the mean, useful in translating data.
Combining distributions: combine independent or jointly normally distributed variables, commonly used in statistics and probability.

$birth_weights = NormalDist::fromSamples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5]);
$drug_effects = new NormalDist(0.4, 0.15);
$combined = $birth_weights->add($drug_effects);

$combined->getMeanRounded(1); // 3.1
$combined->getSigmaRounded(1); // 0.5

$birth_weights->getMeanRounded(5); // 2.71667
$birth_weights->getSigmaRounded(5); // 0.50761

Scaling a normal distribution by a costant via `multiply()` method

The multiply() method for NormalDist multiplies both the mean (mu) and standard deviation (sigma) by a constant. This method is useful for rescaling distributions, such as when changing measurement units. The standard deviation is scaled by the absolute value of the constant to ensure it remains non-negative.

The method does not modify the existing object but instead returns a new NormalDist instance with the updated values.

Use Cases:

Rescaling distributions: useful when changing units (e.g., from meters to kilometers, or Celsius to Farenhait).
Transforming data: apply proportional scaling to statistical data.

$tempFebruaryCelsius = new NormalDist(5, 2.5); # Celsius
$tempFebFahrenheit = $tempFebruaryCelsius->multiply(9 / 5)->add(32); # Fahrenheit
$tempFebFahrenheit->getMeanRounded(1); // 41.0
$tempFebFahrenheit->getSigmaRounded(1); // 4.5

Subtracting from a normal distribution via `subtract()` method

The subtract() method is the counterpart to add(). It subtracts a constant or another NormalDist instance from this distribution.

A constant (float): shifts the mean down, leaving sigma unchanged.
A NormalDist instance: subtracts the means and combines the variances.

$nd = new NormalDist(100, 15);
$shifted = $nd->subtract(32);
$shifted->getMean();  // 68.0
$shifted->getSigma(); // 15.0 (unchanged)

Dividing a normal distribution by a constant via `divide()` method

The divide() method is the counterpart to multiply(). It divides both the mean (mu) and standard deviation (sigma) by a constant.

// Convert Fahrenheit back to Celsius: (F - 32) / (9/5)
$tempFahrenheit = new NormalDist(41, 4.5);
$tempCelsius = $tempFahrenheit->subtract(32)->divide(9 / 5);
$tempCelsius->getMeanRounded(1);  // 5.0
$tempCelsius->getSigmaRounded(1); // 2.5

References for NormalDist

This class is inspired by Python’s statistics.NormalDist and aims to provide similar functionality for PHP users. (Work in Progress)

`StudentT` class

The StudentT class represents the Student’s t-distribution, which is used for hypothesis testing and confidence intervals when the population standard deviation is unknown, especially with small sample sizes. As the degrees of freedom increase, the t-distribution approaches the standard normal distribution.

Creating a StudentT instance

use HiFolks\Statistics\StudentT;

$t = new StudentT(df: 10); // 10 degrees of freedom

Probability Density Function (PDF)

$t = new StudentT(5);
$t->pdf(0);        // ≈ 0.37961 (peak of the distribution)
$t->pdf(2.0);      // density at t=2
$t->pdfRounded(0); // 0.38

Cumulative Distribution Function (CDF)

$t = new StudentT(5);
$t->cdf(0);    // 0.5 (symmetric around zero)
$t->cdf(2.0);  // ≈ 0.94874
$t->cdfRounded(2.0); // 0.949

Inverse CDF (Quantile Function)

$t = new StudentT(10);
$t->invCdf(0.975);  // ≈ 2.228 (critical value for 95% two-sided test)
$t->invCdf(0.5);    // 0.0 (median)
$t->invCdfRounded(0.975, 3); // 2.228

StreamingStat (Experimental)

Note: StreamingStat is experimental in version 1.x. It will be released as stable in version 2. If you want to provide feedback, we are happy to hear from you — please open an issue at https://github.com/Hi-Folks/statistics/issues.

StreamingStat computes descriptive statistics in a single pass with O(1) memory, ideal for large datasets or generator-based streams.

use HiFolks\Statistics\StreamingStat;

$s = new StreamingStat();
$s->add(1)->add(2)->add(3)->add(4)->add(5);

$s->count();     // 5
$s->sum();       // 15.0
$s->min();       // 1.0
$s->max();       // 5.0
$s->mean();      // 3.0
$s->variance();  // 2.5
$s->stdev();     // 1.5811...
$s->skewness();  // 0.0
$s->kurtosis();  // -1.2

Method	Description	Min n
`count()`	Number of values added	0
`sum()`	Sum of all values	1
`min()`	Minimum value	1
`max()`	Maximum value	1
`mean(?int $round = null)`	Arithmetic mean	1
`variance(?int $round = null)`	Sample variance	2
`pvariance(?int $round = null)`	Population variance	1
`stdev(?int $round = null)`	Sample standard deviation	2
`pstdev(?int $round = null)`	Population standard deviation	1
`skewness(?int $round = null)`	Sample skewness (adjusted Fisher-Pearson)	3
`pskewness(?int $round = null)`	Population skewness	3
`kurtosis(?int $round = null)`	Excess kurtosis (sample)	4

All methods throw InvalidDataInputException when insufficient data is available.

Utility classes

The package includes utility classes under HiFolks\Statistics\Utils for common array and formatting operations.

`Arr` — array helpers

use HiFolks\Statistics\Utils\Arr;

Arr::extract( array $data, array $columns )

Extract one or more columns from an array of associative arrays. Returns one array per requested column.

$runners = [
    ['name' => 'Alice', 'age' => 30, 'score' => 95],
    ['name' => 'Bob',   'age' => 25, 'score' => 87],
];

[$ages, $scores] = Arr::extract($runners, ['age', 'score']);
// $ages = [30, 25], $scores = [95, 87]

Arr::partition( array $data, string $field, string $operator, mixed $value )

Split an array of associative arrays into [$matching, $nonMatching] groups based on a condition. Supported operators: ==, !=, >, <, >=, <=.

[$men, $women] = Arr::partition($runners, 'gender', '==', 'M');
[$seniors, $others] = Arr::partition($runners, 'age', '>=', 40);

Arr::toString( array $data, bool|int $sample = false )

Join array values into a comma-separated string. Pass an integer to limit to the first N values.

Arr::stripZeroes( array $data )

Remove zero values from the array.

`Format` — time formatting

use HiFolks\Statistics\Utils\Format;

Format::secondsToTime( int|float $seconds )

Convert seconds to a human-readable time string.

Format::secondsToTime(4845);  // "1:20:45"

Format::timeToSeconds( string $time )

Parse a time string back to total seconds.

Format::timeToSeconds('1:20:45');  // 4845

Format::secondsToHms( int|float $seconds )

Convert seconds to an associative array with hours, minutes, seconds keys.

Format::secondsToHms(4845);  // ['hours' => 1, 'minutes' => 20, 'seconds' => 45]

Format::hmsToSeconds( int $hours, int $minutes, int $seconds )

Convert hours, minutes, and seconds to total seconds.

Format::hmsToSeconds(1, 20, 45);  // 4845

Testing

composer run test           Runs the test script
composer run test-coverage  Runs the test-coverage script
composer run format         Runs the format script
composer run static-code    Runs the static-code script
composer run all-check      Runs the all-check script

Changelog

Please see CHANGELOG for more information on what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security Vulnerabilities

Please review our security policy on how to report security vulnerabilities.

Credits

License

The MIT License (MIT). Please see License File for more information.

hi-folks / statistics

Maintainers

Package info

Statistics

Security

README

Statistics PHP package

Installation

Usage

Stat class

Stat::mean( array $data )

Stat::fmean( array $data, array|null $weights = null, int|null $precision = null )

Stat::trimmedMean( array $data, float $proportionToCut = 0.1, ?int $round = null )

Stat::geometricMean( array $data )

Stat::harmonicMean( array $data )

Stat::median( array $data )

Stat::weightedMedian( array $data, array $weights, ?int $round = null )

Stat::medianLow( array $data )

Stat::medianHigh( array $data )

Stat::medianGrouped( array $data, float $interval = 1.0 )

Stat::quantiles( array $data, $n=4, $round=null, $method='exclusive' )

Stat::firstQuartile( array $data, $round=null )

Stat::thirdQuartile( array $data, $round=null )

Stat::percentile( array $data, float $p, ?int $round = null )

Stat::rank( array $data, string $method = Stat::RANK_AVERAGE )

Stat::percentileRank( array $data, int|float $value, string $kind = Stat::PERCENTILE_RANK_WEAK, ?int $round = null )

Stat::pstdev( array $data )

Stat::stdev( array $data )

Stat::sem( array $data, ?int $round = null )

Stat::meanAbsoluteDeviation( array $data, ?int $round = null )

Stat::medianAbsoluteDeviation( array $data, ?int $round = null )

Stat::variance ( array $data, ?int $round = null, int|float|null $xbar = null)

Stat::skewness ( array $data, ?int $round = null )

Stat::kurtosis ( array $data, ?int $round = null )

Stat::coefficientOfVariation( array $data, ?int $round = null, bool $population = false )

Stat::zscores( array $data, ?int $round = null )

Stat::outliers( array $data, float $threshold = 3.0 )

Stat::iqrOutliers( array $data, float $factor = 1.5 )

Stat::covariance ( array $x , array $y )

Stat::correlation ( array $x , array $y, string $method = 'linear' )

Stat::kendallTau ( array $x , array $y, ?int $round = null )

Stat::linearRegression ( array $x , array $y , bool $proportional = false )

Stat::logarithmicRegression( array $x, array $y )

Stat::powerRegression( array $x, array $y )

Stat::exponentialRegression( array $x, array $y )

Stat::rSquared( array $x, array $y, bool $proportional = false, ?int $round = null )

Stat::confidenceInterval( array $data, float $confidenceLevel = 0.95, ?int $round = null )

Stat::zTest( array $data, float $populationMean, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Stat::tTest( array $data, float $populationMean, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Stat::tTestTwoSample( array $data1, array $data2, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Stat::tTestPaired( array $data1, array $data2, Alternative $alternative = Alternative::TwoSided, ?int $round = null )

Stat::chiSquaredTest( array $observed, ?array $expected = null, ?int $round = null )

Stat::chiSquaredIndependence( array $table, ?int $round = null )

Stat::kde ( array $data , float $h , KdeKernel $kernel = KdeKernel::Normal , bool $cumulative = false )

Stat::kdeRandom ( array $data , float $h , KdeKernel $kernel = KdeKernel::Normal , ?int $seed = null )

Freq class

Freq::frequencies( array $data )

Freq::relativeFrequencies( array $data )

Freq::frequencyTableBySize( array $data , $size)

Freq::frequencyTable()

Statistics class

Calculate Frequency Table

NormalDist class

Key features

Class constructor

Methods

Properties: mean, sigma, and variance

Creating a normal distribution instance from sample data

Generate random samples samples($n, $seed)

Z-score zscore($x)

Probability Density Function pdf($x)

Cumulative Distribution Function cdf($x)

Inverse Cumulative Distribution Function invCdf($p)

Quantiles quantiles($n)

Overlapping coefficient overlap($other)

Combining a normal distribution via add() method

Scaling a normal distribution by a costant via multiply() method

Subtracting from a normal distribution via subtract() method

Dividing a normal distribution by a constant via divide() method

References for NormalDist

`NormalDist` class

Generate random samples `samples($n, $seed)`

Z-score `zscore($x)`

Probability Density Function `pdf($x)`

Cumulative Distribution Function `cdf($x)`

Inverse Cumulative Distribution Function `invCdf($p)`

Quantiles `quantiles($n)`

Overlapping coefficient `overlap($other)`

Combining a normal distribution via `add()` method

Scaling a normal distribution by a costant via `multiply()` method

Subtracting from a normal distribution via `subtract()` method

Dividing a normal distribution by a constant via `divide()` method

`StudentT` class

`Arr` — array helpers

`Format` — time formatting