ykachala / evals
Prompt and LLM-output regression testing for PHP. Define golden datasets, assert on model output (schema, semantic similarity, LLM-as-judge), and gate CI on quality scores.
Requires
- php: >=8.3
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- phpstan/phpstan: ^2.0
- phpunit/phpunit: ^11.0
Suggests
- ykachala/meter: Track the token cost of an eval run
- ykachala/semantic-cache: Cache judge/embedding calls so eval runs in CI stay cheap
This package is auto-updated.
Last update: 2026-06-02 09:07:45 UTC
README
Regression testing for prompts and LLM output, built for PHP CI. Define golden datasets, score model responses with deterministic and LLM-as-judge assertions, and fail the build when quality drops below a threshold.
Why this exists (the 2026 gap)
A prompt change is a code change with no test suite. Swap a model, tweak a system prompt, or upgrade your RAG retriever, and you have no idea whether you just made answers better or quietly broke 20% of them. Teams find out in production.
Python has a rich eval stack — promptfoo, DeepEval, Ragas, Langfuse evals. PHP, even with the 2026 surge in Prism / Neuron / Laravel AI SDK adoption, has no native eval harness. You can call a model from PHP, but you can't assert it's still good in CI. Ykachala Evals is that missing test layer — it feels like PHPUnit/Pest, not a separate Python toolchain.
What it does
- Datasets — golden cases (
input, optionalexpected, metadata) from PHP arrays, JSON/CSV, or a generator. - Assertions / scorers, mixed freely per case:
- Deterministic:
equals,contains,regex,jsonSchema,validJson,latencyUnder,costUnder - Semantic:
similarTo(embedding cosine ≥ threshold) - LLM-as-judge:
judge('Is the answer faithful to the context? Score 0–1.')
- Deterministic:
- Runners — drive it from Pest/PHPUnit or the
vendor/bin/evalsCLI. - CI gates — set pass thresholds (e.g. "≥ 0.9 of cases score ≥ 0.8"); non-zero exit on failure.
- Reports — JSON, JUnit XML (for CI annotations), and an HTML diff vs. the last snapshot.
Install
composer require --dev ykachala/evals
Quick start
use Ykachala\Evals\Suite; use Ykachala\Evals\Dataset; use function Ykachala\Evals\{contains, jsonSchema, judge, similarTo}; $suite = Suite::make('support-bot') ->using(fn (string $input) => $myAgent->reply($input)) // the system under test ->dataset(Dataset::fromJson(__DIR__.'/golden/support.json')) ->assert( contains('refund', caseInsensitive: true), similarTo('We will process your refund within 5 days', threshold: 0.82), judge('Is the reply polite and does it resolve the request?', pass: 0.7), ) ->gate(passRate: 0.9, minScore: 0.8); $report = $suite->run(); echo $report->summary(); // 47/50 passed · mean 0.91 · 2 regressions $report->writeJunit('build/evals.xml'); exit($report->passed() ? 0 : 1);
In Pest
it('keeps the support bot faithful', function () { $report = Suite::make('support-bot')->/* ... */->run(); expect($report)->toPassGate(); });
From CI
vendor/bin/evals run evals/ --gate=0.9 --junit=build/evals.xml
LLM-as-judge, made affordable
Judge and similarTo assertions cost tokens. Ykachala Evals integrates with
ykachala/semantic-cache to cache judge verdicts and embeddings, so re-running the suite
on an unchanged dataset is nearly free — and with ykachala/meter you see the exact token
cost of each eval run.
Architecture
src/
├── Suite.php # fluent builder: using() / dataset() / assert() / gate() / run()
├── Dataset.php # case loading (array, JSON, CSV, generator)
├── EvalCase.php # one eval case: input, expected, metadata (`Case` is reserved)
├── Assertion/ # deterministic + semantic + judge assertions
├── Judge/ # LLM-as-judge interface + adapters
├── Report.php # scores, regressions, JSON/JUnit/HTML output
└── Gate.php # pass thresholds + CI exit semantics
bin/evals # CLI runner
Roadmap
- EvalCase + Dataset loaders (array/JSON/CSV/generator)
- Deterministic assertions (equals/contains/regex/jsonSchema/validJson/latency/cost)
- Suite builder + Report (scores, regressions vs. snapshot)
- Semantic
similarTo+ LLM-as-judge assertions - Gate + CLI runner + JUnit/HTML reports
- semantic-cache + meter integration
See CLAUDE.md for the full phase plan and conventions.
License
MIT