ykachala/evals

Prompt and LLM-output regression testing for PHP. Define golden datasets, assert on model output (schema, semantic similarity, LLM-as-judge), and gate CI on quality scores.

Maintainers

Package info

github.com/ykachala/evals

pkg:composer/ykachala/evals

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

dev-main 2026-06-02 08:53 UTC

This package is auto-updated.

Last update: 2026-06-02 09:07:45 UTC


README

Regression testing for prompts and LLM output, built for PHP CI. Define golden datasets, score model responses with deterministic and LLM-as-judge assertions, and fail the build when quality drops below a threshold.

PHP Version License

Why this exists (the 2026 gap)

A prompt change is a code change with no test suite. Swap a model, tweak a system prompt, or upgrade your RAG retriever, and you have no idea whether you just made answers better or quietly broke 20% of them. Teams find out in production.

Python has a rich eval stack — promptfoo, DeepEval, Ragas, Langfuse evals. PHP, even with the 2026 surge in Prism / Neuron / Laravel AI SDK adoption, has no native eval harness. You can call a model from PHP, but you can't assert it's still good in CI. Ykachala Evals is that missing test layer — it feels like PHPUnit/Pest, not a separate Python toolchain.

What it does

  • Datasets — golden cases (input, optional expected, metadata) from PHP arrays, JSON/CSV, or a generator.
  • Assertions / scorers, mixed freely per case:
    • Deterministic: equals, contains, regex, jsonSchema, validJson, latencyUnder, costUnder
    • Semantic: similarTo (embedding cosine ≥ threshold)
    • LLM-as-judge: judge('Is the answer faithful to the context? Score 0–1.')
  • Runners — drive it from Pest/PHPUnit or the vendor/bin/evals CLI.
  • CI gates — set pass thresholds (e.g. "≥ 0.9 of cases score ≥ 0.8"); non-zero exit on failure.
  • Reports — JSON, JUnit XML (for CI annotations), and an HTML diff vs. the last snapshot.

Install

composer require --dev ykachala/evals

Quick start

use Ykachala\Evals\Suite;
use Ykachala\Evals\Dataset;
use function Ykachala\Evals\{contains, jsonSchema, judge, similarTo};

$suite = Suite::make('support-bot')
    ->using(fn (string $input) => $myAgent->reply($input))   // the system under test
    ->dataset(Dataset::fromJson(__DIR__.'/golden/support.json'))
    ->assert(
        contains('refund', caseInsensitive: true),
        similarTo('We will process your refund within 5 days', threshold: 0.82),
        judge('Is the reply polite and does it resolve the request?', pass: 0.7),
    )
    ->gate(passRate: 0.9, minScore: 0.8);

$report = $suite->run();

echo $report->summary();      // 47/50 passed · mean 0.91 · 2 regressions
$report->writeJunit('build/evals.xml');
exit($report->passed() ? 0 : 1);

In Pest

it('keeps the support bot faithful', function () {
    $report = Suite::make('support-bot')->/* ... */->run();
    expect($report)->toPassGate();
});

From CI

vendor/bin/evals run evals/ --gate=0.9 --junit=build/evals.xml

LLM-as-judge, made affordable

Judge and similarTo assertions cost tokens. Ykachala Evals integrates with ykachala/semantic-cache to cache judge verdicts and embeddings, so re-running the suite on an unchanged dataset is nearly free — and with ykachala/meter you see the exact token cost of each eval run.

Architecture

src/
├── Suite.php          # fluent builder: using() / dataset() / assert() / gate() / run()
├── Dataset.php        # case loading (array, JSON, CSV, generator)
├── EvalCase.php       # one eval case: input, expected, metadata (`Case` is reserved)
├── Assertion/         # deterministic + semantic + judge assertions
├── Judge/             # LLM-as-judge interface + adapters
├── Report.php         # scores, regressions, JSON/JUnit/HTML output
└── Gate.php           # pass thresholds + CI exit semantics
bin/evals              # CLI runner

Roadmap

  • EvalCase + Dataset loaders (array/JSON/CSV/generator)
  • Deterministic assertions (equals/contains/regex/jsonSchema/validJson/latency/cost)
  • Suite builder + Report (scores, regressions vs. snapshot)
  • Semantic similarTo + LLM-as-judge assertions
  • Gate + CLI runner + JUnit/HTML reports
  • semantic-cache + meter integration

See CLAUDE.md for the full phase plan and conventions.

License

MIT