cjuol / statguard
Suite de estadística avanzada para PHP: cálculos robustos (IQR, S*, MAD) vs clásicos, detección de sesgos y exportación CSV/JSON.
Requires
- php: >=8.1
Requires (Dev)
- markrogoyski/math-php: ^2.0
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^10.5 || ^11.0 || ^12.0
README
[English] | Español
StatGuard is a robust statistical analysis suite for PHP focused on scientific precision and data integrity. It compares classic statistics against robust statistics to detect bias, noise, and measurement anomalies in a fully automated way.
Why StatGuard
Outliers are inevitable in telemetry, finance, sports tracking, and lab measurements. A single extreme value can pull the arithmetic mean far from the central mass, which biases decisions that depend on it. StatGuard provides robust estimators (median, MAD, trimmed and winsorized means, Huber M-estimator) that stay stable under contamination so you can trust summaries even when the data is messy.
Highlights
- ClassicStats: Full classic descriptive statistics implementation.
- StatsComparator: The analysis core that evaluates data fidelity and issues a verdict.
- ExportableTrait: First-class CSV and JSON exports for every stats class.
- Traits + Interfaces: Built-in data validation and extensible architecture.
- Independent engines:
QuantileEngineandCentralTendencyEnginekeep core math isolated and reusable. - R parity: Quantiles and robust means are validated against R outputs.
Features
- 9 R-compatible quantile types (Hyndman & Fan 1-9).
- Robust means: Huber, winsorized, and trimmed.
Installation
Install via Composer:
composer require cjuol/statguard
Interactive Demo
An Outlier Playground is shipped in web/public/. It renders a dataset, injects synthetic outliers on demand, and shows classic vs. robust estimators (mean, median, Huber, trimmed, winsorized) side by side with a histogram overlay and the StatsComparator verdict.
# Local dev (native PHP, falls back to docker run if PHP not installed): ./scripts/serve-demo.sh # http://127.0.0.1:8080 # Or with docker compose (VPS / long-running): docker compose up -d # binds 127.0.0.1:8080 by default
docker compose up -d brings up only the demo service (PHP built-in server on web/public/). Override the bind with STATGUARD_DEMO_BIND=0.0.0.0 STATGUARD_DEMO_PORT=8080 docker compose up -d if you are not fronting it with a reverse proxy. The legacy Apache service is kept under the apache profile (docker compose --profile apache up web).
The UI calls POST /api.php with {"data": [...], "huberK": 1.345, "trimPercent": 0.1} and returns the full summary as JSON, making it a usable backend endpoint on its own.
Usage
Robust Estimators (Quick Start)
use Cjuol\StatGuard\RobustStats; $stats = new RobustStats(); $data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1000]; $huber = $stats->getHuberMean($data); $winsorized = $stats->getWinsorizedMean($data, 0.1); $iqr = $stats->getIqr($data, RobustStats::TYPE_R_DEFAULT);
Robust estimators stay stable even with extreme outliers:
| Metric | Result | Comment |
|---|---|---|
| Arithmetic Mean | 95.9091 | Pulled up by the outlier |
| Huber Mean | 6.0982 | Stays close to the central mass |
Example: Huber Mean
use Cjuol\StatGuard\RobustStats; $robust = new RobustStats(); $data = [10, 12, 11, 15, 10, 1000]; $huber = $robust->getHuberMean($data, 1.345, 50, 0.001);
Example: Winsorized Mean (R-Compatible Quantile Type)
use Cjuol\StatGuard\RobustStats; $robust = new RobustStats(); $data = [10, 12, 11, 15, 10, 1000]; // Type 7 matches R's default quantile() behavior. $winsorized = $robust->getWinsorizedMean($data, 0.1, 7);
Comparator (Bias Detection)
use Cjuol\StatGuard\StatsComparator; $comparator = new StatsComparator(); $data = [10, 12, 11, 15, 10, 1000]; $analysis = $comparator->analyze($data); echo $analysis['verdict']; // ALERT: Data is highly influenced by outliers. Use robust metrics.
Instant Export
use Cjuol\StatGuard\RobustStats; $robust = new RobustStats(); file_put_contents('report.csv', $robust->toCsv($data)); echo $robust->toJson($data);
Summary Keys (Classic vs Robust)
Classic summary keys:
[ 'mean', 'median', 'stdDev', 'sampleVariance', 'cv', 'outliersZScore', 'count' ]
Robust summary keys:
[ 'mean', 'median', 'robustDeviation', 'robustVariance', 'robustCv', 'iqr', 'mad', 'outliers', 'confidenceIntervals', 'count' ]
Metrics Comparison
| Metric | ClassicStats | RobustStats | Outlier Impact |
|---|---|---|---|
| Center | Mean | Median | High in classic |
| Dispersion | Standard Deviation | MAD (Scaled) | Extreme in classic |
| Variability | CV% | Robust CV% | Very high in classic |
| Exportable | ✅ Yes | ✅ Yes | - |
R Quantile Types (1-9)
StatGuard matches R v4.x quantile definitions. The table below summarizes the nine Hyndman & Fan (1996) types supported by quantile().
| Type | $p_k$ | $a$ | $b$ | Notes |
|---|---|---|---|---|
| 1 | $k / n$ | 0 | 0 | Inverse of empirical CDF (discontinuous). |
| 2 | $k / n$ | 0 | 0 | Averaged at discontinuities. |
| 3 | $(k - 0.5) / n$ | -0.5 | 0 | Nearest order statistic. |
| 4 | $k / n$ | 0 | 1 | Linear interpolation of CDF. |
| 5 | $(k - 0.5) / n$ | 0.5 | 0.5 | Hazen (1914). |
| 6 | $k / (n + 1)$ | 0 | 1 | Weibull (1939). |
| 7 | $(k - 1) / (n - 1)$ | 1 | 1 | R default, mode of $F(x)$. |
| 8 | $(k - 1/3) / (n + 1/3)$ | 1/3 | 1/3 | Median-unbiased. |
| 9 | $(k - 3/8) / (n + 1/4)$ | 3/8 | 3/8 | Normal-unbiased. |
Implemented Methods
ClassicStats
getMean(array $data): floatgetMedian(array $data): floatgetDeviation(array $data): floatgetStandardDeviation(array $data): floatgetCoefficientOfVariation(array $data): floatgetSampleVariance(array $data): floatgetPopulationVariance(array $data): floatgetOutliers(array $data): arraygetSummary(array $data, bool $sort = true, int $decimals = 2): arraytoJson(array $data, int $options = JSON_PRETTY_PRINT): stringtoCsv(array $data, string $delimiter = ","): string
RobustStats
getMean(array $data): floatgetMedian(array $data): floatgetDeviation(array $data): floatgetCoefficientOfVariation(array $data): floatgetRobustDeviation(array $data): floatgetRobustCv(array $data): floatgetRobustVariance(array $data): floatgetIqr(array $data): floatgetMad(array $data): floatgetOutliers(array $data): arraygetConfidenceIntervals(array $data): arraygetTrimmedMean(array $data, float $trimPercentage = 0.1): floatgetWinsorizedMean(array $data, float $trimPercentage = 0.1, int $type = 7): floatgetHuberMean(array $data, float $k = 1.345, int $maxIterations = 50, float $tolerance = 0.001): floatgetSummary(array $data, bool $sort = true, int $decimals = 2): arraytoJson(array $data, int $options = JSON_PRETTY_PRINT): stringtoCsv(array $data, string $delimiter = ","): string
StatsComparator
__construct(?RobustStats $robust = null, ?ClassicStats $classic = null)analyze(array $data, int $decimals = 2): array
Mathematical Basis
Scaled Robust Deviation
To keep comparisons fair, MAD is scaled to be comparable to standard deviation under normal distributions:
$$\sigma_{robust} = MAD \times 1.4826$$
Robust Coefficient of Variation ($CV_r$)
Calculated over the median to avoid a single extreme value inflating volatility:
$$CV_r = \left( \frac{\sigma_{robust}}{|\tilde{x}|} \right) \times 100$$
R Compatibility & Accuracy
Every public statistic is tested against R v4.x outputs to ensure scientific accuracy. Quantile calculations use Type 7 by default (the same default as quantile() in R), and robust central tendency methods (trimmed mean, winsorized mean, Huber M-estimator) are verified via R comparison scripts in the repository.
Docker Profiles (Optional R Validation)
StatGuard does not require R for normal usage. The default container is lightweight and focused on PHP development. For scientific auditing, you can enable the r-validation profile to run the R comparison script.
# Default dev container (no R runtime) docker compose up -d # Run tests in the default container composer run test # Run R validation in the heavy profile composer run validate-r
Performance Benchmarks (StatGuard vs MathPHP vs R)
Up to 5x faster than MathPHP in median calculations.
20x faster than MathPHP in robust mean estimation.
Dataset: 100,000 random floats. Benchmarks executed in the Docker performance profile using docker compose --profile performance run --rm benchmark report. R timings use system.time() and only measure computation (file load excluded).
Use json only when you need the shield data output (it does not update the markdown tables).
Scientific Parity (vs R)
Status shows ✅ when the absolute difference between StatGuard and R is below 0.0001.
Generate or refresh the table with php tests/BenchmarkStatGuard.php report.
| Method | StatGuard ms | StatGuard value | MathPHP ms | MathPHP value | R ms | R value | Status |
|---|---|---|---|---|---|---|---|
| Median | 15.23 | 499.249 | 71.69 | 499.249 | 1.00 | 499.249 | ✅ |
| Quantile Type 1 (p=0.75) | 14.79 | 747.736 | 14.69 | 747.7385 | 1.00 | 747.736 | ✅ |
| Quantile Type 2 (p=0.75) | 14.36 | 747.741 | 15.37 | 747.7385 | 1.00 | 747.741 | ✅ |
| Quantile Type 3 (p=0.75) | 14.81 | 747.736 | 15.99 | 747.7385 | 2.00 | 747.736 | ✅ |
| Quantile Type 4 (p=0.75) | 14.75 | 747.736 | 15.02 | 747.7385 | 1.00 | 747.736 | ✅ |
| Quantile Type 5 (p=0.75) | 13.99 | 747.741 | 14.72 | 747.7385 | 1.00 | 747.741 | ✅ |
| Quantile Type 6 (p=0.75) | 13.67 | 747.7435 | 14.42 | 747.7385 | 1.00 | 747.7435 | ✅ |
| Quantile Type 7 (p=0.75) | 14.03 | 747.7385 | 15.12 | 747.7385 | 1.00 | 747.7385 | ✅ |
| Quantile Type 8 (p=0.75) | 13.75 | 747.741833 | 15.03 | 747.7385 | 2.00 | 747.7418 | ✅ |
| Quantile Type 9 (p=0.75) | 14.10 | 747.741625 | 15.15 | 747.7385 | 2.00 | 747.7416 | ✅ |
| Huber mean | 33.00 | 499.174389 | 37.83 | 499.243589 | 8.00 | 499.18 | ❌ |
| Metric (100k) | StatGuard ms | MathPHP ms | R ms | Ratio (PHP/R) |
|---|---|---|---|---|
| Median | 15.8 | 76.5 | 2.00 | 7.92 |
| Quantile Type 7 (p=0.75) | 16.2 | 16.0 | 2.00 | 8.09 |
| Huber mean | 34.8 | 788.7 | 10.00 | 3.48 |
Precision check (Huber): $\Delta = 0.0056111266$ for $n = 100000$ (warning threshold $10^{-10}$). Smaller datasets showed higher deltas, which are reported by the benchmark warnings.
Consistent results with R core within 0.01% tolerance on the benchmark scale (0-1000).
Tests and Quality
Validated with PHPUnit for full coverage of calculations and data validation.
./vendor/bin/phpunit tests
License
This project is licensed under the MIT License. See LICENSE for details.
Built with ❤️ by cjuol.