brucetruth / php-idea
A production-ready machine learning library for PHP
Installs: 4
Dependents: 0
Suggesters: 0
Security: 0
Stars: 13
Watchers: 2
Forks: 5
Open Issues: 0
pkg:composer/brucetruth/php-idea
Requires
- php: ^8.2
- ext-json: *
Requires (Dev)
- phpstan/phpstan: ^1.12
- phpunit/phpunit: ^11.5
This package is auto-updated.
Last update: 2026-02-16 15:02:28 UTC
README
ml-idea is a modern, production-oriented machine learning library for PHP focused on clean APIs,
strict typing, and practical classification workflows.
Others always look down on PHP & have proclaimed its end since 2000, well, the elephant keeps moving.
Features
- PHP 8.2+ with strict types
- Consistent classifier contract (
train,predict,predictBatch) - Production-ready baseline classifiers:
KNearestNeighborsLogisticRegression(binary classification)GaussianNaiveBayes
- Model persistence (
ModelSerializer) - Data splitting utility (
TrainTestSplit) - Evaluation metrics (
accuracy,precision,recall,f1Score) - Advanced evaluation metrics:
rocAuc,prAuc,logLoss,brierScore,matthewsCorrcoef,meanAbsolutePercentageError - Preprocessing transformers (
StandardScaler,MinMaxScaler) - Workflow tools (
PipelineClassifier,KFoldcross-validation splits) - Extra splitters:
StratifiedKFold,TimeSeriesSplit - Cross-validation helpers:
CrossValidation::crossValScore*,CrossValidation::crossValPredict* - Probability calibration + threshold optimization:
CalibratedClassifierCV,ThresholdTuner - Regression support (
LinearRegression,RegressionMetrics) - Advanced modules:
PCA,MiniBatchKMeans,TfidfVectorizer - Vision module foundations: generic image feature extraction + color palette analysis + skin-tone risk heuristics
- Vision authenticity heuristic: AI-generation risk scoring from metadata and statistical image signals
- NLP foundation (Phase 1): fluent Text API, unicode tokenization with offsets, PII redaction, rule-based POS tagging
- NLP Phase 2: language detection, keyword extraction (RAKE), BM25 retrieval, hashing vectorizer, similarity utilities, and NLP RAG helpers
- NLP advanced tagging: multilingual rule-based POS, extensible language profiles, and rule-based NER
- GEO service + ML-GEO helpers: country/state/city lookup, nearest-place search, and geo feature building
- Managed dataset assets: registry, integrity checks, licenses metadata, and compiled indexes (trie/automaton/kd-tree)
- RAG foundations: embedders (
OpenAI,AzureOpenAI,Ollama), splitters, retriever, and multiple vector stores (in-memory, JSON, SQLite) - RAG LLM clients for QA generation:
Echo,OpenAI,Azure OpenAI, andOllama(direct orLlmClientFactory::fromEnv()) - Advanced RAG workflow: document loaders, hybrid retrieval, rerankers, citations/diagnostics, vector-index persistence, tool-calling + streaming hooks
- AI agents + tool routing:
ToolCallingAgent,ToolRoutingAgent, deterministic/local routing, and provider-backed routing (OpenAI/Azure/Ollama/custom) - Unified core contracts (v1.4):
fit/predict, probabilistic, online-learning, serializable model interfaces - Hyperparameter lifecycle helpers:
getParams,setParams,cloneWithParams, random-state aware models - PHPUnit test suite + CI workflow
- Static analysis support with PHPStan
Installation
composer require brucetruth/ml-idea
Quick Start
<?php declare(strict_types=1); require_once 'vendor/autoload.php'; use ML\IDEA\Classifiers\KNearestNeighbors; use ML\IDEA\Data\TrainTestSplit; use ML\IDEA\Metrics\ClassificationMetrics; use ML\IDEA\Preprocessing\StandardScaler; $samples = [[1, 1], [1, 2], [2, 1], [4, 4], [5, 5], [4, 5]]; $labels = ['A', 'A', 'A', 'B', 'B', 'B']; $split = TrainTestSplit::split($samples, $labels, testSize: 0.33, seed: 42); $scaler = new StandardScaler(); $xTrain = $scaler->fitTransform($split['xTrain']); $xTest = $scaler->transform($split['xTest']); $model = new KNearestNeighbors(k: 3, weighted: true); $model->train($xTrain, $split['yTrain']); $predictions = $model->predictBatch($xTest); $accuracy = ClassificationMetrics::accuracy($split['yTest'], $predictions); echo "Accuracy: " . round($accuracy * 100, 2) . "%\n";
Model Persistence
use ML\IDEA\Model\ModelSerializer; ModelSerializer::save($model, __DIR__ . '/knn.model.json'); $loadedModel = ModelSerializer::load(__DIR__ . '/knn.model.json');
Advanced v1.2 Examples
1) Pipeline + KFold
use ML\IDEA\Classifiers\KNearestNeighbors; use ML\IDEA\Data\KFold; use ML\IDEA\Pipeline\PipelineClassifier; use ML\IDEA\Preprocessing\StandardScaler; $samples = [[1,1],[1,2],[2,1],[4,4],[5,5],[4,5]]; $labels = ['A','A','A','B','B','B']; $folds = KFold::split(count($samples), nSplits: 3, shuffle: true, seed: 42); foreach ($folds as $fold) { $xTrain = $yTrain = $xTest = $yTest = []; foreach ($fold['train'] as $i) { $xTrain[] = $samples[$i]; $yTrain[] = $labels[$i]; } foreach ($fold['test'] as $i) { $xTest[] = $samples[$i]; $yTest[] = $labels[$i]; } $model = new PipelineClassifier([new StandardScaler()], new KNearestNeighbors(3, true)); $model->train($xTrain, $yTrain); $pred = $model->predictBatch($xTest); }
2) Linear Regression
use ML\IDEA\Regression\LinearRegression; use ML\IDEA\Metrics\RegressionMetrics; $x = [[1.0], [2.0], [3.0], [4.0]]; $y = [2.0, 4.0, 6.0, 8.0]; $reg = new LinearRegression(learningRate: 0.05, iterations: 5000); $reg->train($x, $y); $pred = $reg->predictBatch($x); echo RegressionMetrics::rootMeanSquaredError($y, $pred);
3) Text Embedding (TF-IDF)
use ML\IDEA\NLP\TfidfVectorizer; $docs = ['machine learning in php', 'php library for intelligence']; $vectorizer = new TfidfVectorizer(); $matrix = $vectorizer->fitTransform($docs);
Development
composer install
composer test
composer analyse
Examples
See runnable use-case scripts in examples/:
- basic classification flow
- CV + advanced metrics
- probability calibration + threshold tuning
- regression pipelines
- text features + clustering
- hyperparameter search
- RAG local chain + vector-store examples
- RAG DB loader example (SQLite/PDO)
- Agent toolbox example (
examples/agents) with local KB + weather + free API tools - Vision examples (palette extraction and content-risk heuristic demo)
- Vision authenticity-risk example (AI-generated likelihood heuristic)
- NLP Text API + POS example (
examples/16_nlp_text_api_and_pos.php) - NLP BM25 + similarity example (
examples/17_nlp_bm25_and_similarity.php) - NLP multilingual POS + NER example (
examples/18_nlp_multilingual_ner.php) - NLP extensibility example (
examples/19_nlp_extensibility_custom_profiles.php) - NLP trainable POS/NER pipeline example (
examples/20_nlp_trainable_pos_ner.php)
Roadmap
- More algorithms (tree-based models, multiclass linear models)
- Feature preprocessing (normalization, encoding)
- Cross-validation utilities
- Dataset loaders and richer benchmarking tools
- Context and chat history handling for the Tool Routing Agent
- Tool reliability layer for agents (timeouts, retries, fallbacks, structured errors)
- Policy and safety guardrails (tool allow/deny rules, injection checks, PII-safe logs)
- Improved routing quality (confidence scoring, clarification turn, top-k tool candidates)
- Observability + evaluation harness for routing/tool accuracy regressions
- Memory strategy beyond raw history (summaries, pruning, retrieval-based recall)
- Cost/latency controls (model tiering, caching, token budgets)
- Human-in-the-loop controls for risky actions and execution approvals
- Output quality controls (schema validation, grounding/citation checks, consistency pass)