padosoft/eval-harness

Laravel evaluation framework for RAG / LLM applications: golden datasets, exact-match + cosine-embedding + LLM-as-judge metrics, JSON + Markdown reports, Artisan-driven CI gate.

Maintainers

Package info

github.com/padosoft/eval-harness

pkg:composer/padosoft/eval-harness

Statistics

Installs: 5 222

Dependents: 0

Suggesters: 1

Stars: 0

Open Issues: 1

v1.3.0 2026-06-16 20:13 UTC

README

Laravel-native evaluation framework for RAG / LLM applications. Golden datasets in YAML, fifteen built-in metrics (including retrieval-ranking and ordinal scoring), judge calibration against human labels, production online monitoring with drift alerts, standalone output assertions, Markdown + JSON reports, and an Artisan CI gate. Stop shipping silent regressions in your AI pipeline.

Latest Version on Packagist Tests Total Downloads License PHP Version Require

eval-harness report banner

Official Documentation

πŸ“š Full documentation is available at doc.eval-harness.padosoft.com.

The documentation site covers everything in depth: a five-minute quickstart, the fifteen built-in metrics with their underlying theory and formulas, guides for CI gating, judge calibration, online monitoring and adversarial testing, the batch/Horizon operations model, the architecture and decision records, and the full CLI, configuration, and report-API reference.

Table of Contents

  1. Official Documentation
  2. Why eval-harness?
  3. Design rationale
  4. Features
  5. Comparison with alternatives
  6. Installation
  7. Quick start
  8. Usage examples
  9. Web admin panel UI
  10. Contract stability and migration
  11. Configuration
  12. Architecture
  13. AI vibe-coding pack included
  14. Testing
  15. Roadmap
  16. Contributing
  17. Security
  18. License

Why eval-harness?

Imagine deploying a RAG-powered chatbot to production. Quality is great on launch day. Three months later, somebody:

  • bumps the embedding model from text-embedding-3-small to text-embedding-3-large,
  • swaps the chat model from gpt-4o to gpt-4o-mini for cost,
  • tweaks the prompt template,
  • updates laravel/ai from ^0.5 to ^0.6,
  • changes the chunker from 800-token sliding window to 1200-token semantic.

Every one of those changes is a quality regression risk, and you have no programmatic signal they shipped intact. Your test suite green-lights the deployment because PHPUnit doesn't know what a "correct answer" looks like.

padosoft/eval-harness fixes that loop:

  1. You curate a small golden dataset β€” a YAML file with 30-200 (question, expected answer) pairs that represent the queries you actually care about.
  2. You declare metrics β€” exact-match for deterministic outputs, cosine-embedding for paraphrase tolerance, LLM-as-judge for subjective grading.
  3. You wire up a callable that drives your real pipeline against the dataset.
  4. CI runs php artisan eval-harness:run rag.factuality on every PR and gates the merge on the macro-F1 score.

Now your AI pipeline has the same regression protection your business logic has had for the last fifteen years.

Design rationale

The package is opinionated. Three decisions matter most:

1. No SDK lock-in. Every external call goes through Laravel's Http:: facade β€” never an OpenAI / Anthropic / Vertex SDK. Tests substitute via Http::fake() for deterministic offline runs, and swapping providers is a config-file change, not a refactor.

2. The dataset is YAML, not a Laravel model. YAML is reviewable in pull requests, diffable across releases, and survives database wipes. The package never stores datasets in your DB β€” they live in eval/golden/*.yml next to your code.

3. Failures are captured by default. A timeout on sample 47 should not mask the macro-F1 score across 200 valid samples. Every metric exception is recorded against (sample, metric) and surfaced in the final report so the operator can investigate, not re-run the whole 30-minute suite. Strict CI lanes can opt into EVAL_HARNESS_RAISE_EXCEPTIONS=true to abort on the first MetricException provider/metric contract error.

These decisions cost some flexibility (you can't dispatch metrics across multiple processes yet β€” see Roadmap) but they keep the public surface small and the offline path fast.

Features

  • Fifteen metrics out of the box β€” exact-match, contains, regex, rouge-l, citation-groundedness, cosine-embedding, bertscore-like, llm-as-judge, refusal-quality, ordinal-distance, and the retrieval-ranking family (retrieval-hit-at-k, retrieval-recall-at-k, retrieval-mrr, retrieval-ndcg-at-k, answer-containment-at-k) β€” and a clean Metric interface for adding more.
  • Retrieval-ranking metrics β€” domain-agnostic hit@k, recall@k, MRR, nDCG@k (binary or graded gains), and top-k answer containment over a ranked id/text list your retriever emits. A single tested source of truth for RAG ranking math (metrics.retrieval.default_k config + per-sample metadata.k override).
  • Ordinal / distance scoring β€” ordinal-distance gives partial credit for ordered labels (e.g. low < medium < high < urgent): exact = 1.0, off-by-one = 0.5, further = 0.0.
  • Judge calibration β€” php artisan eval-harness:calibrate-judge validates the LLM judge against human-labelled cases, reporting verdict agreement rate, a confusion matrix, a length-bias signal, and a self-preference guard (fails when the judge model equals the model under test). Gate CI on judge trustworthiness, not vibes.
  • Online / production monitoring β€” OnlineMonitor::capture() samples a configurable fraction of live AI traffic, judges it on a queue, stores scores historically, charts pass-rate-over-time, and fires an OnlinePassRateDropped event when recent quality dips below threshold. Off by default; Horizon-ready; exposed read-only at GET /{prefix}/online/{dataset}/trend for a companion dashboard.
  • Strict-schema YAML loader β€” versioned dataset contracts and actionable validation errors for malformed samples.
  • Deterministic LLM-as-judge β€” temperature 0, seed 42, response_format=json_object. Strict-JSON parser rejects malformed responses instead of silently scoring 0.
  • Stable JSON report shape β€” every payload carries explicit schema_version and dataset_schema_version fields. Wire into your CI dashboard once, then evolve additively.
  • Cohort-ready report data β€” JSON and Markdown reports aggregate scores by metadata.tags, expose an explicit untagged bucket, and include per-metric score histograms for dashboards.
  • Citation evidence checks β€” citation-groundedness can score simple citation markers or stricter metadata.citation_evidence spans that require both citation markers and quote text.
  • Opt-in adversarial lane β€” AdversarialDatasetFactory and php artisan eval-harness:adversarial build/run safety regression seeds for prompt injection, jailbreaks, data leaks, SSRF, tool abuse, and similar red-team categories. JSON/Markdown reports add category and compliance-framework summaries, and optional manifests retain adversarial run summaries while preserving latest failure-free baselines per compatible report schema, dataset, metric names, and adversarial category/sample-count slice under tight retention; --regression-gate fails CI when macro-F1 or configured metric aggregates drop, and --promote-failures writes failed samples back to a reloadable YAML dataset seed. Scheduler/CI guidance shows how to run the lane continuously without bundling a daemon in this package.
  • Standalone output assertions β€” score saved JSON/YAML outputs with the same metrics and report contract, without invoking your agent in CI.
  • Usage summaries β€” JSON and Markdown reports aggregate structured usage details for provider token counts, cost USD, and latency.
  • Runtime guardrails β€” provider timeouts are normalized, optional retries cover Laravel HTTP connection failures plus HTTP 429/5xx, and strict mode can rethrow MetricException failures instead of capturing them.
  • Batch execution modes β€” SUT runs flow through deterministic SerialBatch by default, or queue-backed LazyParallelBatch via --batch=lazy-parallel for Laravel queue/Horizon workers.
  • Operational batch profiles β€” --batch-profile=ci|smoke|nightly applies named presets of batch defaults (concurrency, queue, timeouts, chunk size, rate limit, checkpoint cadence, result TTL). Explicit CLI flags always win; host apps can override or add profiles under eval-harness.batches.profiles.* in config/eval-harness.php.
  • Producer-side backpressure β€” --chunk-size=N, --rate-limit=N --rate-window-seconds=W, and --result-ttl-seconds bound dispatch in flight, throttle to N samples per W-second sliding window (monotonic-clock math, amortized O(1)), and size result metadata TTL for delayed collection. Pass none on any nullable numeric flag to clear an inherited profile value for a one-off run.
  • Progress checkpoints and terminal status β€” --checkpoint-every=N emits structured progress events through an optional BatchProgressReporter container binding (default NullBatchProgressReporter). Dashboards that need to distinguish a finished failed batch from a stalled one can implement the optional BatchTerminalProgressReporter sub-contract for explicit success / failure / empty terminal status with partial-wins tolerance on the failure path.
  • Live batch registry endpoints β€” GET /<configured-prefix>/batches/live and GET /<configured-prefix>/batches/{id}/progress expose active lazy-parallel batch ids and compact progress counters through cache-backed read-only API contracts. The live registry is enabled by default and can be disabled with eval-harness.batches.live_registry.enabled.
  • Adversarial manifest discovery endpoints β€” GET /<configured-prefix>/adversarial/manifests and GET /<configured-prefix>/adversarial/manifests/{name} enumerate adversarial run manifests written to a configured disk so a companion UI can browse compliance history without scraping the filesystem. Opt-in via eval-harness.adversarial.manifests.{disk,path_prefix}; the CLI --adversarial-manifest=<arbitrary-path> flag is preserved for existing operators.
  • Report diff endpoint β€” GET /<configured-prefix>/reports/{id}/diff/{otherId} computes signed deltas (macro_f1, per-metric mean/pass_rate, per-cohort status with added / removed / regressed / improved / stable, total_samples / total_failures, adversarial categories when present) so a companion UI can show regression diffs side-by-side without fetching full reports. The prefix defaults to eval-harness/api and is configurable through eval-harness.api.prefix.
  • Dataset trend endpoint β€” GET /<configured-prefix>/datasets/{name}/trend?limit=N scans stored JSON reports for one dataset, skips malformed artifacts, caps limit at 100, caps scanned JSON files through eval-harness.api.trend.max_files_scanned, and returns chronological points with metrics, cohorts, usage, and the eval-harness.report-api.v1.trend schema discriminator.
  • Provider-agnostic β€” works with OpenAI, OpenRouter, Regolo, Mistral, any OpenAI-compatible chat-completions endpoint.
  • No DB migrations required β€” datasets are YAML, results are JSON. The package adds zero rows to your schema.
  • Artisan-driven CI gate β€” php artisan eval-harness:run exits non-zero on any captured failure. Wire it into the same workflow your unit tests run in.
  • Architecture tests included β€” every release proves it doesn't leak symbols from sibling packages, ever.

Comparison with alternatives

Status legend: βœ… YES means first-class support, ⚠️ partial means supported with limits or outside the Laravel-native path, and ❌ NO means not a primary fit.

Concern eval-harness OpenAI Evals LangSmith Ragas Promptfoo DeepEval
Laravel-native package βœ… YES - PHP/Laravel package ❌ NO - Python CLI/library ❌ NO - hosted Python/TS workflow ❌ NO - Python library ❌ NO - Node/YAML CLI ❌ NO - Python library
Runs inside your app container βœ… YES - resolves Laravel services directly ⚠️ partial - custom completion functions ⚠️ partial - SDK/API integration ⚠️ partial - integrate from Python ⚠️ partial - external CLI/provider call ⚠️ partial - local Python runner
Local-first storage βœ… YES - YAML datasets + JSON/Markdown reports ⚠️ partial - local logs or Snowflake ❌ NO - LangSmith cloud workspace βœ… YES - local datasets/results βœ… YES - local YAML/results ⚠️ partial - local evals, optional Confident AI cloud
Read-only report API βœ… YES - opt-in Laravel routes for report listing/show, cohorts, histograms, artifact download, and row CSV export ⚠️ partial - custom artifact/API layer βœ… YES - hosted experiment API ⚠️ partial - custom app/API layer ⚠️ partial - local result files/viewer workflows ⚠️ partial - local results or hosted platform API
Built-in metrics βœ… YES - offline exact/contains/regex/ROUGE-L/citation plus fakeable cosine/BERTScore-like/judge/refusal ⚠️ partial - custom eval code βœ… YES - evaluators in platform/SDK βœ… YES - RAG-focused metrics βœ… YES - assertions and graders βœ… YES - built-in metrics
Embedding semantic overlap βœ… YES - cosine-embedding + bertscore-like via fakeable EmbeddingClient ⚠️ partial - custom embedding eval code ⚠️ partial - SDK evaluator path βœ… YES - RAG embedding metrics ⚠️ partial - provider-backed similarity assertions βœ… YES - semantic metrics
Deterministic no-network tests βœ… YES - Http::fake, fake LLM/embedding clients ⚠️ partial - depends on eval ⚠️ partial - cloud/API path common ⚠️ partial - many metrics need LLMs ⚠️ partial - assertions can be local, red team needs models ⚠️ partial - metric dependent
LLM-as-judge βœ… YES - schema-checked, fakeable judge client βœ… YES - model-graded evals βœ… YES - evaluators βœ… YES - LLM metrics βœ… YES - rubric/grader assertions βœ… YES - LLM metrics
Refusal quality / safety judge βœ… YES - refusal-quality with required metadata + strict JSON schema ⚠️ partial - custom model-graded eval ⚠️ partial - custom evaluator workflow ⚠️ partial - custom LLM metric βœ… YES - safety/red-team assertions βœ… YES - safety metrics
Adversarial red-team seeds βœ… YES - opt-in Laravel seed factory for 10 categories ⚠️ partial - custom eval registry ⚠️ partial - custom datasets/evaluators ⚠️ partial - RAG-focused tests βœ… YES - red-team plugins βœ… YES - safety test cases
Adversarial CLI lane βœ… YES - eval-harness:adversarial with eval:adversarial alias, saved outputs, and batch options ⚠️ partial - custom eval runner scripts ⚠️ partial - custom evaluator automation ⚠️ partial - Python code orchestration βœ… YES - red-team CLI workflow βœ… YES - safety test runner
Adversarial compliance mapping βœ… YES - JSON/Markdown category + OWASP/NIST/EU AI Act summaries ⚠️ partial - custom eval metadata ⚠️ partial - custom evaluator metadata ⚠️ partial - custom report code βœ… YES - red-team category reporting ⚠️ partial - safety metadata/reporting
Adversarial run history manifests βœ… YES - local JSON manifest retains adversarial summaries and clean baselines ⚠️ partial - custom eval logs βœ… YES - hosted experiment history ⚠️ partial - custom persistence βœ… YES - monitoring/history workflows ⚠️ partial - platform/history workflow
Adversarial regression gate βœ… YES - --regression-gate fails on macro-F1 or metric drops from local manifests ⚠️ partial - custom eval thresholds βœ… YES - hosted experiment comparisons ⚠️ partial - custom CI checks βœ… YES - threshold/regression workflows βœ… YES - test assertions/regression workflows
Scheduled/continuous monitoring βœ… YES - Laravel Scheduler/CI cron guidance with manifests, Horizon queues, gates, and failure promotion ⚠️ partial - custom scheduler around eval runs βœ… YES - hosted monitoring workflows ⚠️ partial - custom scheduler around Python metrics βœ… YES - CLI/CI monitoring workflows ⚠️ partial - local runner or hosted platform workflow
Failure promotion to datasets βœ… YES - --promote-failures exports failed adversarial samples to YAML seeds ⚠️ partial - custom eval scripts βœ… YES - trace-to-dataset workflows ⚠️ partial - custom dataset curation βœ… YES - failure-driven test cases βœ… YES - failed test cases can become datasets
Citation evidence spans βœ… YES - citation_evidence requires marker + quote match ⚠️ partial - custom eval code ⚠️ partial - custom evaluator workflow βœ… YES - RAG faithfulness/context metrics ⚠️ partial - custom assertions βœ… YES - RAG faithfulness metrics
Cost/token/latency summaries βœ… YES - built-in provider usage + JSON/Markdown summaries ⚠️ partial - custom logging βœ… YES - experiment usage analytics βœ… YES - usage/cost hooks ⚠️ partial - provider output dependent ⚠️ partial - metric/provider dependent
Runtime retry / strict exception controls βœ… YES - normalized timeouts, connection/429/5xx retries, optional raise_exceptions ⚠️ partial - custom eval code ⚠️ partial - SDK/platform behavior βœ… YES - runtime metric settings ⚠️ partial - provider/config dependent ⚠️ partial - custom evaluator handling
Provider choice βœ… YES - any OpenAI-compatible endpoint via Laravel HTTP ⚠️ partial - OpenAI API defaults, custom completion functions possible βœ… YES - multi-provider ecosystem βœ… YES - via integrations βœ… YES - multi-provider βœ… YES - multi-provider
CI gate βœ… YES - Artisan command with non-zero failure exit ⚠️ partial - script around CLI/API ⚠️ partial - API/automation hook ⚠️ partial - custom script βœ… YES - CLI gate βœ… YES - test runner/CI flow
Queue/Horizon batch execution βœ… YES - SerialBatch + LazyParallelBatch for Laravel queues/Horizon ❌ NO - not Laravel queues ❌ NO - hosted tracing/evals ❌ NO - not Laravel queues ❌ NO - external CLI concurrency ❌ NO - not Laravel queues
Named operational profiles βœ… YES - --batch-profile=ci|smoke|nightly with host-app overrides under eval-harness.batches.profiles.* ⚠️ partial - custom run scripts ⚠️ partial - hosted experiment configs ❌ NO - custom Python wrappers ⚠️ partial - YAML config presets ❌ NO - custom Python wrappers
Producer-side backpressure βœ… YES - --chunk-size, --rate-limit, --rate-window-seconds with monotonic sliding-window math ❌ NO - per-eval concurrency only ⚠️ partial - hosted rate controls ❌ NO - in-process only ⚠️ partial - external concurrency flag ❌ NO - in-process only
Progress checkpoints + terminal status βœ… YES - --checkpoint-every plus optional BatchProgressReporter / BatchTerminalProgressReporter bindings ❌ NO - log-tail only βœ… YES - hosted run progress ⚠️ partial - custom callbacks ⚠️ partial - CLI progress lines ⚠️ partial - custom hooks
Eval sets / multi-dataset runs βœ… YES - EvalSetDefinition + resumable manifests βœ… YES - oaievalset βœ… YES - dataset experiments ⚠️ partial - run multiple datasets in code βœ… YES - suites/configs βœ… YES - metric collections/test suites
Resume interrupted multi-dataset progress βœ… YES - explicit per-dataset resume manifest ❌ NO - no mid-eval resume ⚠️ partial - platform run history ⚠️ partial - custom code ⚠️ partial - rerun/filter workflows ⚠️ partial - platform/regression workflows
Cohorts / tags / facets βœ… YES - tag cohorts in JSON/Markdown ⚠️ partial - custom eval/reporting βœ… YES - dataset filtering/metadata ⚠️ partial - custom analysis βœ… YES - metadata/config-driven views ⚠️ partial - test metadata
Saved-output assertions βœ… YES - --outputs and Eval::scoreOutputs() ⚠️ partial - custom eval code ⚠️ partial - compare uploaded runs ⚠️ partial - build dataset/results manually βœ… YES - assertion-first workflow βœ… YES - test-case assertions
Auditable in PR diff βœ… YES - YAML datasets + stable JSON/Markdown artifacts ⚠️ partial - local YAML/code possible ❌ NO - cloud-first βœ… YES - code/data files βœ… YES - YAML config βœ… YES - Python test files
Vendor lock-in βœ… YES - headless, local-first, provider-agnostic ⚠️ partial - OpenAI-oriented defaults ❌ NO - LangSmith workspace βœ… YES - OSS library βœ… YES - OSS CLI ⚠️ partial - OSS plus Confident AI option
Cost to evaluate 200 offline samples βœ… YES - free for offline metrics and faked providers ⚠️ partial - depends on model calls ❌ NO - cloud/API usage ⚠️ partial - free only for non-LLM metrics ⚠️ partial - free only for local assertions ⚠️ partial - free only for local/non-LLM metrics

The Python-stack tools are excellent if your stack is Python. If your RAG pipeline lives in a Laravel monolith, eval-harness is the shortest path from "we have AI in prod" to "we have a regression test for our AI in prod".

Installation

composer require padosoft/eval-harness

The package is auto-discovered. No config/app.php edits required.

Optional config publishing:

php artisan vendor:publish --tag=eval-harness-config

This drops config/eval-harness.php into your app where you can override the embeddings + judge endpoints / models / API keys.

Compatibility matrix

eval-harness PHP Laravel laravel/ai SDK symfony/yaml
0.x (current) 8.3 / 8.4 / 8.5 12.x / 13.x ^0.6 ^7 / ^8

Quick start

1. Curate a golden dataset

eval/golden/factuality.yml:

schema_version: eval-harness.dataset.v1
name: rag.factuality.fy2026
samples:
  - id: capital-france
    input:
      question: "What is the capital of France?"
    expected_output: "Paris"
    metadata:
      tags: [geography, easy]

  - id: refund-policy
    input:
      question: "How many days do I have to return an order?"
    expected_output: "30 days from delivery."
    metadata:
      tags: [policy, support]

schema_version is optional for existing datasets. If omitted, the loader defaults to eval-harness.dataset.v1.

2. Wire up a registrar in your app

app/Console/EvalRegistrar.php:

<?php

namespace App\Console;

use Illuminate\Contracts\Container\Container;
use Padosoft\EvalHarness\EvalEngine;

class EvalRegistrar
{
    public function __construct(private readonly Container $container) {}

    public function __invoke(EvalEngine $engine): void
    {
        $engine->dataset('rag.factuality.fy2026')
            ->loadFromYaml(base_path('eval/golden/factuality.yml'))
            ->withMetrics(['exact-match', 'cosine-embedding'])
            ->register();

        $this->container->bind('eval-harness.sut', fn () =>
            fn (array $input): string => app(\App\Rag\KnowledgeAgent::class)
                ->answer($input['question']),
        );
    }
}

3. Run the eval

php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\Console\EvalRegistrar" \
  --json --out=factuality.json

Exit code is 0 if every metric scored cleanly, non-zero otherwise. Wire that into the same tests.yml workflow that runs your PHPUnit suite and you've got a regression gate.

4. Score saved outputs

When another job already generated model responses, keep the same dataset and score those outputs directly:

{
  "outputs": {
    "capital-france": "Paris",
    "refund-policy": "30 days from delivery."
  }
}
php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\Console\EvalRegistrar" \
  --outputs=eval/outputs/factuality.json \
  --json --out=factuality.json

--outputs accepts JSON or YAML, map form (outputs.sample_id) or list form (outputs[].id + outputs[].actual_output). Relative --out paths use the configured reports disk and path prefix (eval-harness/reports by default). Add --raw-path only when you want a literal filesystem path and its parent directory already exists. The registrar still registers the dataset; no eval-harness.sut binding is required for this mode.

5. Read the report

php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\Console\EvalRegistrar"
# Eval report β€” rag.factuality.fy2026

_Run completed in 2.41s over 30 samples (0 failures captured)._

## Summary

| total samples | total failures | duration seconds |
| --- | --- | --- |
| 30 | 0 | 2.41 |

## Per-metric aggregates

| metric | mean | p50 | p95 | pass-rate (>= 0.5) |
| --- | --- | --- | --- | --- |
| exact-match | 0.7333 | 1.0000 | 1.0000 | 0.7333 |
| cosine-embedding | 0.9012 | 0.9421 | 0.9893 | 0.9667 |

## Macro-F1 (avg pass-rate across all metrics): 0.8500

## Cohorts by metadata.tags

| cohort | samples | metric | mean | p50 | p95 | pass-rate (>= 0.5) |
| --- | --- | --- | --- | --- | --- | --- |
| geography | 12 | exact-match | 0.9500 | 1.0000 | 1.0000 | 0.9500 |
| refund-policy | 8 | exact-match | 0.6000 | 0.5000 | 1.0000 | 0.6000 |

## Score histograms

### exact-match

| score range | count |
| --- | --- |
| 0.0-0.1 | 8 |
| 0.9-1.0 inclusive | 22 |

6. Optional read-only report API

The package can register opt-in, read-only routes for a separate Laravel admin/UI package to consume stored report artifacts. Routes are disabled by default because this package does not bundle authentication; enable them only behind your host app's existing admin middleware.

// config/eval-harness.php
'api' => [
    'enabled' => true,
    'prefix' => 'admin/eval-harness/api',
    'middleware' => ['web', 'auth'],
    'trend' => [
        'max_files_scanned' => 5000,
    ],
],

With the API enabled, GET /admin/eval-harness/api/reports lists JSON and Markdown artifacts from the configured reports disk/prefix, and GET /admin/eval-harness/api/reports/{id} shows one artifact by URL-safe id. Additional read-only contracts are now available for UI consumers:

  • GET /admin/eval-harness/api/reports/{id}/cohorts for cohort summaries.
  • GET /admin/eval-harness/api/reports/{id}/histograms for score distribution buckets.
  • GET /admin/eval-harness/api/reports/{id}/rows.csv for CSV sample rows.
  • GET /admin/eval-harness/api/reports/{id}/download for direct artifact download.
  • GET /admin/eval-harness/api/reports/{id}/diff/{otherId} for signed report deltas.
  • GET /admin/eval-harness/api/adversarial/manifests and GET /admin/eval-harness/api/adversarial/manifests/{name} for adversarial run manifest discovery.
  • GET /admin/eval-harness/api/batches/live and GET /admin/eval-harness/api/batches/{id}/progress for live lazy-parallel batch monitoring.
  • GET /admin/eval-harness/api/datasets/{name}/trend?limit=N for chronological dataset trend points.
  • GET /admin/eval-harness/api/online/{dataset}/trend?limit=N for production online-monitoring pass-rate-over-time points plus the configured alert threshold (drives the dashboard alert band).
  • JSON examples and contract notes are documented in docs/REPORT_API_CONTRACT.md.

Usage examples

Programmatic dataset (no YAML)

use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Facades\EvalFacade as Eval;

Eval::dataset('rag.smoke')
    ->withSamples([
        new DatasetSample(id: 's1', input: ['q' => 'hi'], expectedOutput: 'hello'),
        new DatasetSample(id: 's2', input: ['q' => 'bye'], expectedOutput: 'goodbye'),
    ])
    ->withMetrics(['exact-match'])
    ->register();

$report = Eval::scoreOutputs('rag.smoke', [
    's1' => 'hello',
    's2' => 'wrong answer',
]);

Custom metric

use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Metrics\Metric;
use Padosoft\EvalHarness\Metrics\MetricScore;

class JaccardWordOverlapMetric implements Metric
{
    public function name(): string
    {
        return 'jaccard-words';
    }

    public function score(DatasetSample $sample, string $actualOutput): MetricScore
    {
        $expected = array_unique(preg_split('/\\s+/', strtolower((string) $sample->expectedOutput)));
        $actual   = array_unique(preg_split('/\\s+/', strtolower($actualOutput)));
        $union = array_unique(array_merge($expected, $actual));
        if ($union === []) {
            return new MetricScore(0.0);
        }
        $intersection = array_intersect($expected, $actual);

        return new MetricScore(count($intersection) / count($union));
    }
}

// Wire it into a dataset:
Eval::dataset('rag.recall')
    ->loadFromYaml(base_path('eval/golden/recall.yml'))
    ->withMetrics([new JaccardWordOverlapMetric(), 'exact-match'])
    ->register();

CI workflow snippet

# .github/workflows/eval-gate.yml
- name: Run RAG regression gate
  env:
    EVAL_HARNESS_JUDGE_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: |
    php artisan eval-harness:run rag.factuality.fy2026 \
      --registrar="App\Console\EvalRegistrar" \
      --json --out=eval-report.json
- uses: actions/upload-artifact@v4
  if: always()
  with:
    name: eval-report
    path: eval-report.json

Queue-backed batch execution

--batch=lazy-parallel dispatches one queue job per sample and then assembles outputs in dataset order through the shared batch result store. It requires the SUT to be a container-resolvable concrete SampleRunner class that queue workers can resolve through the Laravel container. Constructor-injected object dependencies are supported when the worker container can resolve an equivalent fresh runner. Arbitrary callables, closures, anonymous runners, optional/defaulted constructor state, scalar/array/null runner properties, and caller-specific object configuration remain serial-only because queued jobs carry only the runner class name.

use App\Eval\MyRagRunner;

$this->app->bind('eval-harness.sut', MyRagRunner::class);
php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\\Console\\EvalQueueRegistrar" \
  --batch=lazy-parallel \
  --concurrency=4 \
  --queue=evals \
  --timeout=60 \
  --batch-timeout=300

Use Laravel's sync queue driver for unit tests. In production, run Horizon workers on the chosen queue and set EVAL_HARNESS_BATCH_CACHE_STORE to a cache backend shared by the command process and workers so queued sample outputs can be collected for report assembly. --concurrency is the lazy-parallel producer fan-out cap and is also the default dispatch window size; pass --chunk-size=N (must be <= --concurrency) for tighter backpressure. The producer waits after each chunk completes before dispatching the next, so when --chunk-size < --concurrency, chunk-size becomes the effective in-flight limit per producer process, not concurrency. Total worker pool demand scales with concurrent producers: if K eval commands run at the same time, peak in-flight demand is roughly chunk-size Γ— K. Size Horizon worker pool capacity for the actual peak (chunk-size Γ— concurrent producers) β€” not just one producer's chunk-size β€” or the queue will build a backlog. Worker counts themselves are configured in Horizon. --timeout is the per-sample job timeout; --batch-timeout caps the producer's wait on each dispatch window. It bounds BOTH the dispatch phase (including any producer-side --rate-limit pauses) AND the result-collection phase: when the timeout fires before all queued outputs land the command reports the missing samples; when dispatch itself consumes the budget (for example because a low rate limit throttles the producer) the command fails with an explicit "chunk dispatch consumed the full ... wait timeout" diagnostic that reports how many samples were still undispatched. Lower --chunk-size, relax --rate-limit, or raise --batch-timeout to fix it. --batch-timeout is a hard wall-clock cap only on real queue drivers (Redis, database, beanstalk β€” the documented Horizon path) where dispatch() returns immediately. On the sync queue driver dispatch() executes the job inline, so an individual slow sample can run arbitrarily longer than the chunk deadline before control returns. The package only sets the queue job's $timeout property β€” Laravel's queue workers honour it, but the sync driver does NOT enforce it because there is no worker process. On sync, neither --batch-timeout nor --timeout bounds per-sample runtime; slow runners can take arbitrarily long. Use a real queue driver in production for any wall-clock guarantee. Programmatic external dispatch() / collectOutputs() flows can set BatchOptions::lazyParallel(resultTtlSeconds: ...) to keep result metadata and sample outputs alive long enough for delayed collection. See docs/HORIZON_BATCH_QUEUES.md for Horizon supervisor, cache-store, and timeout sizing guidance.

Operational profiles and backpressure

--batch-profile=ci|smoke|nightly applies a named operational preset of batch defaults; explicit options always win, so profiles never lock operators in. CI lanes get sane lazy-parallel defaults, smoke checks stay serial, and nightly runs get throttled dispatch with checkpoints.

# CI gate: lazy-parallel, 4 concurrent samples, 30s job timeout, checkpoints every 25 samples.
php artisan eval-harness:run rag.factuality.fy2026 \
  --batch-profile=ci \
  --queue=evals \
  --json --out=evals/ci-rag.json

# Nightly long run: 16 concurrent samples, throttled at 60 dispatches/60s.
php artisan eval-harness:run rag.factuality.fy2026 \
  --batch-profile=nightly \
  --queue=evals-nightly \
  --json --out=evals/nightly-rag.json

Backpressure flags work with any lazy-parallel profile or with --batch=lazy-parallel:

  • --chunk-size=N narrows the producer dispatch window for tighter backpressure (defaults to --concurrency; must be <= --concurrency, since --concurrency is the fan-out cap).
  • --rate-limit=N --rate-window-seconds=W throttles producer dispatch to N samples per W-second rolling window.
  • --checkpoint-every=N emits structured progress events every N completed samples; bind a custom BatchProgressReporter to forward them to logs or dashboards. Dashboards that need to distinguish a finished failed batch from a stalled one should implement the optional BatchTerminalProgressReporter sub-contract instead β€” it adds a reportTerminal(...) callback with explicit success / failure / empty status. See docs/HORIZON_BATCH_QUEUES.md for an example binding.

To clear an inherited numeric profile value for a one-off run without redefining the profile, pass none (or null) on the corresponding flag. For example, with --batch-profile=nightly (which sets rate_limit=60, rate_window_seconds=60, checkpoint_every=100), --rate-limit=none --checkpoint-every=none disables both for that single invocation while keeping every other profile field. The same sentinel works for --timeout, --batch-timeout, --chunk-size, --rate-limit, --rate-window-seconds, --result-ttl-seconds, and --checkpoint-every. --queue is excluded β€” queue names are arbitrary strings, so override the profile in eval-harness.batches.profiles.* config when an inherited queue must be cleared.

Host apps can override or register additional profiles under eval-harness.batches.profiles in config/eval-harness.php. See docs/HORIZON_BATCH_QUEUES.md for the full profile reference and Horizon tuning recipes.

Eval sets and resume manifests

Group registered datasets into an eval set when one CI or release gate needs to run several datasets in order. The returned manifest is stable JSON and can be stored by the host app between attempts; completed datasets are skipped when the manifest is passed back in.

use Eval;
use Padosoft\EvalHarness\Batches\BatchOptions;
use Padosoft\EvalHarness\EvalSets\EvalSetManifest;

$manifestPath = storage_path('eval/release.rag.manifest.json');
$previousManifest = null;
if (is_file($manifestPath)) {
    $manifestPayload = json_decode((string) file_get_contents($manifestPath), true, flags: JSON_THROW_ON_ERROR);
    $previousManifest = is_array($manifestPayload)
        ? EvalSetManifest::fromJson($manifestPayload)
        : null;
}

$evalSet = Eval::evalSet('release.rag', [
    'rag.factuality.fy2026',
    'rag.refusals.fy2026',
]);

$result = Eval::runEvalSet(
    $evalSet,
    app(App\Eval\MyRagRunner::class),
    BatchOptions::serial(),
    $previousManifest ?? null,
);

file_put_contents(
    $manifestPath,
    json_encode($result->manifest->toJson(), JSON_PRETTY_PRINT | JSON_THROW_ON_ERROR),
);

Real-world usage recipes

These are complete snippets you can paste into a Laravel app or CI job.

Recipe A β€” PR-safe RAG regression gate

Use one dataset, one artifact path, and a deterministic failure signal in CI:

cat > .github/workflows/eval-gate.yml <<'EOF'
name: AI Regression Gate

on:
  pull_request:
    paths:
      - 'app/**'
      - 'config/**'
      - 'eval/**'
      - 'resources/**'

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup PHP
        uses: shivammathur/setup-php@v2
        with:
          php-version: '8.3'
          tools: composer:v2
      - uses: shivammathur/setup-node@v3
        with:
          node-version: '20'
      - name: Install dependencies
        run: composer install --no-interaction --prefer-dist --no-progress
      - name: Run eval gate
        env:
          EVAL_HARNESS_JUDGE_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          php artisan eval-harness:run rag.factuality.fy2026 \
            --registrar="App\\Console\\EvalRegistrar" \
            --batch=lazy-parallel \
            --concurrency=4 \
            --queue=default \
            --timeout=60 \
            --batch-timeout=180 \
            --json \
            --out=eval-report.json
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-report
          path: eval-report.json
EOF

Recipe B β€” nightly safety baseline refresh with manifest regression gate

Run adversarial tests on a schedule, keep a bounded local manifest, and fail when behavior regresses:

php artisan eval-harness:adversarial \
  --registrar="App\\Console\\EvalRegistrar" \
  --category=prompt-injection \
  --category=pii-leak \
  --metric=refusal-quality \
  --batch=lazy-parallel \
  --concurrency=2 \
  --queue=adversarial \
  --manifest=storage/app/adversarial-runs.json \
  --manifest-retain=10 \
  --regression-gate \
  --regression-max-drop=5 \
  --regression-metric=refusal-quality:pass_rate \
  --json \
  --out=adversarial-nightly.json

In Laravel Scheduler, dispatch that command via the schedule:run pipeline and rotate artifacts with normal deployment hygiene.

Recipe C β€” consume report API artifacts in your admin surface

Use the opt-in Laravel report API as a source of truth for charts and CSV export:

curl -sS \
  -H "Accept: application/json" \
  "$APP_URL/admin/eval-harness/api/reports" \
  | jq '.artifacts[] | select(.id|test("factuality")) | {id, type, size}'
curl -sS \
  "$APP_URL/admin/eval-harness/api/reports/<url-safe-id>/rows.csv" \
  -H "Accept: text/csv" \
  > eval-rows.csv

The examples above use stable API schema identifiers documented in docs/REPORT_API_CONTRACT.md.

Retrieval-ranking metrics (RAG hit@k / recall@k / MRR / nDCG@k)

The retrieval-ranking family scores how well your retriever ranks the right documents. The math is domain-agnostic: it consumes a ranked list of opaque ids (and, for containment, texts) plus the relevant ground truth, and returns a number in [0, 1]. Running the retriever stays in your app; the package owns the (easy-to-get-wrong) ranking math.

Your system-under-test emits actualOutput as JSON β€” an ordered list, rank 1 first:

{"retrieved": [{"id": "doc-7", "text": "..."}, {"id": "doc-3", "text": "..."}]}

A bare array (["doc-7", "doc-3"]) is also accepted. The sample's expected_output is the relevant ground truth:

samples:
  - id: q-1
    input: { question: "What is our refund window?" }
    expected_output: ["doc-3", "doc-9"]        # relevant ids (binary relevance)
    metadata: { k: 5 }                          # optional per-sample cutoff
  - id: q-2
    input: { question: "..." }
    expected_output: { "doc-3": 3, "doc-9": 1 } # graded gains (nDCG uses them)
alias what it scores
retrieval-hit-at-k 1.0 if any relevant id is in the top-k, else 0.0
retrieval-recall-at-k fraction of relevant ids found in the top-k
retrieval-mrr reciprocal rank of the first relevant id (cutoff-independent)
retrieval-ndcg-at-k nDCG@k with binary or graded gains
answer-containment-at-k 1.0 if the expected answer span appears in any top-k retrieved text

k resolves as per-sample metadata.k β†’ constructor k β†’ config eval-harness.metrics.retrieval.default_k (default 5). Aliases resolve through the container with zero extra binding:

$engine->dataset('rag.retrieval')->withMetrics([
    'retrieval-recall-at-k',
    'retrieval-ndcg-at-k',
    'answer-containment-at-k',
]);

Calibrating the judge against human labels

Before you trust an LLM judge to gate CI, prove it agrees with humans. Provide a YAML file of human-labelled cases:

schema_version: eval-harness.calibration.v1
name: judge.calibration.fy2026
cases:
  - id: c1
    input: { question: "What is the capital of France?" }
    expected: "Paris"
    actual: "The capital of France is Paris."
    human_verdict: pass
  - id: c2
    input: { question: "What is the capital of France?" }
    expected: "Paris"
    actual: "It is Berlin."
    human_verdict: fail
php artisan eval-harness:calibrate-judge eval/calibration/judge-cases.v1.yaml \
  --min-agreement=0.85 \
  --model-under-test=my-rag-model \
  --json --out=judge-calibration.json

The command judges each case, converts the raw judge score into a pass/fail verdict (verdict_pass_threshold), and reports the verdict agreement rate against the human ground truth (not raw-score correlation), a confusion matrix, a length-bias signal (Spearman rank correlation between answer length and judge score), and a self-preference guard. It exits non-zero when agreement falls below the floor or the judge model equals --model-under-test, and warns when length bias is suspected β€” so an untrustworthy judge fails the build instead of silently rubber-stamping regressions.

Online / production monitoring

Sample live AI traffic, judge it on a queue, and chart pass-rate drift. Everything is off by default. Enable it and call OnlineMonitor::capture() from your production AI path:

use Padosoft\EvalHarness\Online\OnlineMonitor;

// In your production controller / service, after generating an answer:
app(OnlineMonitor::class)->capture(
    dataset: 'rag.faq',
    sampleId: (string) $request->id,
    input: ['question' => $question],
    expected: $goldenAnswerOrReference,
    actual: $modelAnswer,
);

On a sampling hit the monitor dispatches a queued JudgeLiveSampleJob that judges the interaction with the configured metric and persists an OnlineScore row (the package ships its first migration for this). An OnlineDriftAlert re-checks the recent window and fires an OnlinePassRateDropped event when the pass rate dips below threshold β€” register a listener to route it to Slack / PagerDuty:

use Illuminate\Support\Facades\Event;
use Padosoft\EvalHarness\Online\Events\OnlinePassRateDropped;

Event::listen(function (OnlinePassRateDropped $e): void {
    // notify your on-call channel: $e->dataset, $e->passRate, $e->threshold
});
php artisan vendor:publish --tag=eval-harness-migrations
php artisan migrate

The read-only GET /{prefix}/online/{dataset}/trend endpoint feeds the pass-rate chart in the companion admin panel.

Web admin panel UI

This package stays headless, but it already has a polished companion web admin panel at padosoft/eval-harness-admin. Use that repository when you want a ready-made dashboard for browsing stored reports, comparing regressions, inspecting adversarial manifests, following live batch progress, and watching production online-monitoring pass-rate trends through this package's read-only API contracts.

eval-harness admin dashboard

The UI package spec is documented in docs/UI_PACKAGE_SPEC.md. It covers the intended read-only admin screens: Dashboard, Reports list, Report detail, Compare, Trend, Adversarial manifests, and Live batches.

Contract stability and migration

v1.0 marks the package contract baseline and introduces explicit guidance on semantic compatibility:

  • Versioned, additive report and dataset schemas (eval-harness.report.v1, eval-harness.dataset.v1).
  • API payload versioning (eval-harness.api.v1).
  • Queue manifest compatibility by report schema / dataset / metric set signatures.
  • Stable command contracts for eval-harness:run and eval-harness:adversarial with additive extension.

Read full details before a major upgrade:

For end users upgrading from pre-1.0, also check the changelog and keep composer pins strict to avoid surprise majors.

Configuration

config/eval-harness.php (after vendor:publish):

use Padosoft\EvalHarness\Support\RuntimeOptions;
use Padosoft\EvalHarness\Support\TimeoutNormalizer;

return [
    'metrics' => [

        'cosine_embedding' => [
            'endpoint' => env('EVAL_HARNESS_EMBEDDINGS_ENDPOINT', 'https://api.openai.com/v1/embeddings'),
            'api_key'  => env('EVAL_HARNESS_EMBEDDINGS_API_KEY', env('OPENAI_API_KEY', '')),
            'model'    => env('EVAL_HARNESS_EMBEDDINGS_MODEL', 'text-embedding-3-small'),
            'timeout_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_EMBEDDINGS_TIMEOUT'), 30),
        ],

        'llm_as_judge' => [
            'endpoint' => env('EVAL_HARNESS_JUDGE_ENDPOINT', 'https://api.openai.com/v1/chat/completions'),
            'api_key'  => env('EVAL_HARNESS_JUDGE_API_KEY', env('OPENAI_API_KEY', '')),
            'model'    => env('EVAL_HARNESS_JUDGE_MODEL', 'gpt-4o-mini'),
            'timeout_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_JUDGE_TIMEOUT'), 60),
            'prompt_template' => env('EVAL_HARNESS_JUDGE_PROMPT_TEMPLATE'),
        ],

        // Retrieval-ranking metrics (hit@k / recall@k / MRR / nDCG@k /
        // answer-containment@k). default_k is the cutoff; per-sample
        // metadata.k overrides it.
        'retrieval' => [
            'default_k' => RuntimeOptions::normalizePositiveInt(env('EVAL_HARNESS_RETRIEVAL_DEFAULT_K'), 5),
        ],

    ],

    // Judge calibration (eval-harness:calibrate-judge). Agreement is on
    // verdicts, not raw scores; require_distinct_models is the
    // self-preference guard.
    'calibration' => [
        'verdict_pass_threshold'  => RuntimeOptions::normalizeUnitInterval(env('EVAL_HARNESS_CALIBRATION_PASS_THRESHOLD'), 0.5),
        'min_agreement'           => RuntimeOptions::normalizeUnitInterval(env('EVAL_HARNESS_CALIBRATION_MIN_AGREEMENT'), 0.8),
        'length_bias_warn'        => RuntimeOptions::normalizeUnitInterval(env('EVAL_HARNESS_CALIBRATION_LENGTH_BIAS_WARN'), 0.4),
        'require_distinct_models' => RuntimeOptions::normalizeBoolean(env('EVAL_HARNESS_CALIBRATION_REQUIRE_DISTINCT_MODELS'), true),
        'model_under_test'        => env('EVAL_HARNESS_CALIBRATION_MODEL_UNDER_TEST'),
    ],

    // Online / production monitoring. Off by default; the host app calls
    // OnlineMonitor::capture() and a sampled fraction is judged on a queue.
    'online' => [
        'enabled'        => RuntimeOptions::normalizeBoolean(env('EVAL_HARNESS_ONLINE_ENABLED'), false),
        'sampling_rate'  => RuntimeOptions::normalizeUnitInterval(env('EVAL_HARNESS_ONLINE_SAMPLING_RATE'), 0.0),
        'metric'         => env('EVAL_HARNESS_ONLINE_METRIC', 'llm-as-judge'),
        'pass_threshold' => RuntimeOptions::normalizeUnitInterval(env('EVAL_HARNESS_ONLINE_PASS_THRESHOLD'), 0.7),
        'queue'          => env('EVAL_HARNESS_ONLINE_QUEUE'),
        'connection'     => env('EVAL_HARNESS_ONLINE_CONNECTION'),
        'alert' => [
            'threshold'   => RuntimeOptions::normalizeUnitInterval(env('EVAL_HARNESS_ONLINE_ALERT_THRESHOLD'), 0.8),
            'window'      => RuntimeOptions::normalizePositiveInt(env('EVAL_HARNESS_ONLINE_ALERT_WINDOW'), 50),
            'min_samples' => RuntimeOptions::normalizePositiveInt(env('EVAL_HARNESS_ONLINE_ALERT_MIN_SAMPLES'), 20),
        ],
    ],

    'runtime' => [
        'raise_exceptions' => RuntimeOptions::normalizeBoolean(env('EVAL_HARNESS_RAISE_EXCEPTIONS'), false),
        'provider_retry_attempts' => RuntimeOptions::normalizeNonNegativeInt(env('EVAL_HARNESS_PROVIDER_RETRY_ATTEMPTS'), 0),
        'provider_retry_sleep_milliseconds' => RuntimeOptions::normalizeNonNegativeInt(env('EVAL_HARNESS_PROVIDER_RETRY_SLEEP_MS'), 100),
    ],

    'reports' => [
        'disk' => env('EVAL_HARNESS_REPORTS_DISK', 'local'),
        'path_prefix' => env('EVAL_HARNESS_REPORTS_PATH', 'eval-harness/reports'),
    ],

    'batches' => [
        'lazy_parallel' => [
            'cache_store' => env('EVAL_HARNESS_BATCH_CACHE_STORE'),
            'result_ttl_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_BATCH_RESULT_TTL'), 3600),
            'wait_timeout_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_BATCH_WAIT_TIMEOUT'), 60),
        ],
    ],
];

Pointing at a non-OpenAI provider

# OpenRouter
EVAL_HARNESS_JUDGE_ENDPOINT=https://openrouter.ai/api/v1/chat/completions
EVAL_HARNESS_JUDGE_API_KEY=or-your-key
EVAL_HARNESS_JUDGE_MODEL=anthropic/claude-3.5-sonnet

# Regolo (Italian sovereign infra)
EVAL_HARNESS_JUDGE_ENDPOINT=https://api.regolo.ai/v1/chat/completions
EVAL_HARNESS_JUDGE_API_KEY=rgl-your-key
EVAL_HARNESS_JUDGE_MODEL=mistral-large

The embedding-backed metrics (cosine-embedding, bertscore-like) use the same OpenAI-compatible embeddings endpoint (data[].embedding). Most providers already implement that contract. Host apps can also bind Padosoft\EvalHarness\Contracts\EmbeddingClient to route embeddings through Laravel AI or deterministic fakes.

The judge-backed metrics (llm-as-judge, refusal-quality) share the same chat-completions settings. refusal-quality requires each sample to declare metadata.refusal_expected: true|false so safety/refusal behavior is explicit in the dataset contract.

Adversarial safety/regression seeds are opt-in. Build and register them programmatically when a host app wants a red-team lane:

use Padosoft\EvalHarness\Adversarial\AdversarialCategory;
use Padosoft\EvalHarness\Adversarial\AdversarialDatasetFactory;

$factory = app(AdversarialDatasetFactory::class);

$dataset = $factory->build(categories: [
    AdversarialCategory::PromptInjection,
    'pii-leak',
    'ssrf',
]);

app(\Padosoft\EvalHarness\EvalEngine::class)->registerDataset($dataset);

Or run the built-in red-team lane directly from Artisan:

php artisan eval-harness:adversarial \
  --registrar="App\Console\EvalRegistrar" \
  --category=prompt-injection \
  --category=pii-leak \
  --metric=refusal-quality \
  --manifest=storage/eval/adversarial-runs.json \
  --manifest-retain=10 \
  --regression-gate \
  --regression-max-drop=5 \
  --regression-metric=refusal-quality:mean \
  --json --out=adversarial.json

eval:adversarial is available as a short alias. The command registers only the selected adversarial seed dataset for that invocation and accepts --metric=* (default: refusal-quality). Two scoring modes are supported: pass --outputs=<path> to score precomputed responses (this path bypasses the batch contract entirely and goes straight to scoreOutputs(), so --batch, --batch-profile, --rate-limit, and the other dispatch flags are ignored when --outputs is set); otherwise the SUT is invoked through the same batch contract as eval-harness:run β€” --batch, --batch-profile, --concurrency, --queue, --timeout, --batch-timeout, --chunk-size, --rate-limit, --rate-window-seconds, --result-ttl-seconds, and --checkpoint-every. Add --manifest=<path> to update a local JSON run-history manifest and --manifest-retain=N to keep a bounded set of adversarial summaries: the newest N summaries plus any additional failure-free baselines needed for compatible report schema, dataset, metric names, and adversarial category/sample-count slices. Size --manifest-retain for the number of distinct report schema, dataset, metric, category, and sample-count slices you run, because each slice may need its own clean baseline. Add --regression-gate to compare the current run with the latest compatible failure-free existing manifest entry (same report schema, dataset, metric names, and adversarial category/sample-count slice) before the current run is recorded. --regression-max-drop=5 means five normalized percentage points. Repeat --regression-metric=metric or --regression-metric=metric:mean|p50|p95|pass_rate for additional metric aggregate checks. Configured regression metrics must exist in the current run; missing current aggregates fail closed even when no baseline exists yet. If no compatible failure-free baseline exists after that validation, the command emits an explicit missing-baseline status. Runs can advance the next compatible baseline whenever they are written to the manifest and failure-free, including plain --manifest writes without --regression-gate. Gated runs are recorded only when they are failure-free and do not fail configured gate checks; metric failures and gate failures are left out so they cannot seed broken baselines.

Add --promote-failures=eval/adversarial-failures.yml to export low-scoring samples and samples with metric exceptions into a reloadable dataset YAML seed. Use --promoted-dataset=adversarial.security.failures to control the dataset name inside that YAML; otherwise it defaults to <dataset>.failures. Promotion preserves the original sample input, expected output, and metadata, adds metadata.eval_harness.promoted_failure with the source dataset and failed metric names, and intentionally omits actual model output and raw provider error messages from the seed. If no samples fail, an existing promotion file at that path is removed so fixed CI artifact paths do not keep stale failure seeds.

The default factory covers 10 categories: prompt injection, jailbreak, tool abuse, PII leak, SSRF, SQL/shell injection, ASCII smuggling, competitor endorsement, excessive agency, and hallucination overreliance. Samples include metadata.tags, metadata.adversarial, metadata.refusal_expected, and metadata.refusal_policy so they can be scored with refusal-quality and grouped in JSON/Markdown reports. Reports expose a safe normalized adversarial subset only: category, label, severity, and compliance frameworks. Raw prompts, refusal policy text, and arbitrary sample metadata stay out of JSON sample rows. The top-level JSON adversarial block aggregates category metrics and framework counts for OWASP LLM, NIST AI RMF, and EU AI Act style security reporting; Markdown reports render the same data under Adversarial coverage. Manifest files use eval-harness.adversarial-runs.v1, store stable metric aggregates plus the safe adversarial summary, serialize command updates with a lock file, and write through a temporary file before replacing the target path.

For recurring safety checks, see docs/ADVERSARIAL_CONTINUOUS_MONITORING.md. It shows how to run the adversarial lane from Laravel Scheduler or CI cron with persistent manifests, Horizon-backed queues, regression gates, and failure promotion while keeping this package daemon-free.

Provider retries are opt-in. EVAL_HARNESS_PROVIDER_RETRY_ATTEMPTS=2 means two extra attempts after the initial request, with EVAL_HARNESS_PROVIDER_RETRY_SLEEP_MS between attempts. Retries apply only to Laravel HTTP connection failures, HTTP 429, and 5xx responses. Malformed successful responses still fail closed. By default, metric failures are captured in the report; set EVAL_HARNESS_RAISE_EXCEPTIONS=true when a strict CI lane should abort on the first MetricException provider/metric contract error.

For stricter RAG groundedness, citation-groundedness accepts metadata.citation_evidence:

metadata:
  citation_evidence:
    - citation: "[policy:refunds]"
      quote: "Refunds are available within 30 days."

Each evidence span scores only when the actual output contains both the citation marker and the quoted evidence text. Report details expose counts only, not raw citation or quote strings.

Metrics that expose provider usage can add a structured usage detail:

new MetricScore(1.0, [
    'usage' => [
        'prompt_tokens' => 120,
        'completion_tokens' => 40,
        'total_tokens' => 160,
        'cost_usd' => 0.0024,
        'latency_ms' => 850,
    ],
]);

The renderers aggregate those values into top-level JSON and Markdown usage summaries while leaving raw prompts/provider payloads out of the report contract. JSON summaries include per-field reported counts so consumers can distinguish "not reported" from a reported zero; Markdown renders unreported usage totals as n/a. Provider usage attached to a captured metric failure is still included in the aggregate summary so malformed judge/embedding responses do not hide token or latency spend.

The built-in OpenAI-compatible embedding and judge clients automatically attach safe usage details to their metric scores: token/cost fields from the provider's usage object when present, including provider-reported latency_ms when the backend returns it. They do not synthesize local wall-clock latency, keeping reports diff-friendly across repeated runs.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EvalCommand / AdversarialCommand                                β”‚
β”‚  └─► php artisan eval-harness:run / eval-harness:adversarial      β”‚
β”‚      └─► resolve registrar, dataset, callable/SampleRunner SUT    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EvalEngine                                                      β”‚
β”‚  - dataset registry (in-memory, single source of truth)          β”‚
β”‚  - run(dataset, sut) / runBatch(dataset, sut, BatchOptions)      β”‚
β”‚      β”œβ”€β–Ί dispatch samples through SerialBatch or LazyParallelBatchβ”‚
β”‚      β”œβ”€β–Ί invoke input callable or SampleInvocation callable/runnerβ”‚
β”‚      β”œβ”€β–Ί lazy-parallel jobs write outputs to BatchResultStore     β”‚
β”‚      β”œβ”€β–Ί for each metric: score(sample, actual)                  β”‚
β”‚      β”‚   - exception β†’ SampleFailure                             β”‚
β”‚      β”‚   - clean β†’ MetricScore                                   β”‚
β”‚      └─► assemble EvalReport                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚                                 β”‚
                  β–Ό                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Metrics                   β”‚       β”‚  Reports                   β”‚
β”‚  - ExactMatchMetric        β”‚       β”‚  - EvalReport              β”‚
β”‚  - CosineEmbeddingMetric   β”‚       β”‚  - MarkdownReportRenderer  β”‚
β”‚  - BertScoreLikeMetric     β”‚       β”‚  - JsonReportRenderer      β”‚
β”‚  - LlmAsJudgeMetric        β”‚       β”‚  - cohorts + histograms    β”‚
β”‚  - RefusalQualityMetric    β”‚       β”‚  - macroF1, p50, p95, mean β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pluggable metric registry (R23)

MetricResolver accepts:

  1. A Metric instance (full control).

  2. An FQCN string (resolved through the container).

  3. A built-in alias: exact-match, contains, regex, rouge-l, citation-groundedness, cosine-embedding, bertscore-like, llm-as-judge, refusal-quality, ordinal-distance, retrieval-hit-at-k, retrieval-recall-at-k, retrieval-mrr, retrieval-ndcg-at-k, answer-containment-at-k.

    The retrieval aliases and answer-containment-at-k auto-wire from the container with no extra binding. ordinal-distance needs an ordered scale, so pass an instance (new OrdinalDistanceMetric([...])) or bind one under the alias.

Every resolved class is asserted to implement Metric so a typo'd FQCN fails with a clear error instead of producing a runtime "method does not exist".

Standalone-agnostic guarantee

tests/Architecture/StandaloneAgnosticTest.php walks src/ and fails the build if it finds a substring referring to:

  • AskMyDocs internal symbols (KnowledgeDocument, kb_nodes, etc.).
  • Sibling Padosoft packages (padosoft/laravel-flow, padosoft/laravel-patent-box-tracker, etc.).

The package is consumed by AskMyDocs / patent-box-tracker / others but never depends on them. Your composer require works exactly the same whether you use AskMyDocs or not.

πŸš€ AI vibe-coding pack included

Every Padosoft package ships with a .claude/ pack that primes Claude Code (or any compatible AI coding agent) on the conventions used across the Padosoft repos:

  • Skills β€” auto-loaded when context matches (e.g. opening a Pull Request triggers copilot-pr-review-loop, editing tests triggers test-actually-tests-what-it-claims).
  • Rules β€” domain-specific guardrails (Laravel naming, query optimisation, exception handling, frontend testability).
  • Agents β€” specialised sub-agents for reviewing PRs, investigating CI failures, anticipating Copilot review findings.
  • Pattern adoption β€” the COMPANY-PACK + COMPANY-INVENTORY + PATTERN-ADOPTION docs explain how the conventions evolved across PRs so an AI agent can match the prevailing style on day one.

Install Claude Code, open the repo, and your agent immediately knows the house style. No prompt engineering required.

Testing

The package ships three PHPUnit testsuites:

# Default β€” offline, fast (1-2s)
vendor/bin/phpunit --testsuite Unit

# File-system invariants (standalone-agnostic)
vendor/bin/phpunit --testsuite Architecture

# Opt-in live test against a real LLM provider
EVAL_HARNESS_LIVE_API_KEY=sk-... vendor/bin/phpunit --testsuite Live

The Unit and Architecture suites are run by CI on every PR across the PHP 8.3 / 8.4 / 8.5 x Laravel 12 / 13 matrix. The Live suite is never run by CI β€” invoke it explicitly when you want to verify wire compatibility with a new provider or model.

Live suite contract

Every test in tests/Live/ calls markTestSkipped(...) from setUp() when EVAL_HARNESS_LIVE_API_KEY is empty. A contributor running the default phpunit invocation never trips the live path accidentally and never burns API credits.

Roadmap

v1.3.0 β€” Retrieval metrics, judge calibration, online monitoring (current)

Additive, backward-compatible feature set. All v1 contracts are preserved.

  • Retrieval-ranking metric family β€” retrieval-hit-at-k, retrieval-recall-at-k, retrieval-mrr, retrieval-ndcg-at-k, and answer-containment-at-k on a shared RankedRetrieval parser, with metrics.retrieval.default_k config and per-sample metadata.k override. A single tested source of truth for RAG ranking math.
  • Ordinal / distance scoring β€” ordinal-distance gives partial credit for ordered labels (exact 1.0 / off-by-one 0.5 / further 0.0).
  • Judge calibration β€” eval-harness:calibrate-judge validates the judge against human labels: verdict agreement rate, confusion matrix, length-bias signal, and a self-preference guard, with CI gating.
  • Online / production monitoring β€” OnlineMonitor::capture(), config-driven sampling, a queued JudgeLiveSampleJob, the first package migration (eval_harness_online_scores), an OnlinePassRateDropped drift event, and a read-only online/{dataset}/trend API endpoint feeding the companion admin panel's pass-rate chart.

v1.1.0 β€” Enterprise operations and scalability add-on

Released 2026-05-05 (release notes). Builds on v1.0 with the v1.x enterprise operations and scalability add-on. All v1 contracts are preserved.

  • Operational batch profiles β€” --batch-profile=ci|smoke|nightly with explicit-CLI-wins precedence and host-app overrides under eval-harness.batches.profiles.*.
  • Producer-side backpressure β€” --chunk-size, --rate-limit, --rate-window-seconds, and --result-ttl-seconds flags backed by RateLimitWindow (monotonic clock, amortized O(1) sliding-window math) and the producer-side throttle in LazyParallelBatch. Pass none on any nullable numeric flag to clear an inherited profile value for a one-off run.
  • Progress checkpoints β€” --checkpoint-every plus an optional BatchProgressReporter container binding (default NullBatchProgressReporter).
  • Optional terminal status reporting β€” BatchTerminalProgressReporter sub-contract with explicit success / failure / empty status and partial-wins tolerance on the failure path, so dashboards can distinguish a finished failed batch from a stalled one.
  • Operator-facing Horizon supervisor sizing guidance in docs/HORIZON_BATCH_QUEUES.md (multi-producer caveat, chunk-size vs concurrency, rate-limit producer-side behavior).

No hard Horizon dependency added; the package still uses sync queues and queue fakes in tests.

v1.0 (roadmap complete)

  • Core feature set complete β€” cohort metrics, report histograms, batch execution (SerialBatch, LazyParallelBatch, --batch=serial, --batch=lazy-parallel), eval sets with resumable progress, standalone output assertions, additional built-in metrics, usage summaries, and runtime guardrails are fully implemented.
  • Adversarial lane complete β€” opt-in adversarial datasets (including multi-input samples), eval-harness:adversarial, compliance summaries, manifest retention, --regression-gate, and --promote-failures are implemented and tested.
  • API contract complete β€” read-only report API routes/resources for listing/shows, cohort and histogram views, row CSV export, and report artifact download are implemented and documented.
  • Stability and release complete β€” contract stability documents, migration path from pre-1.0, and release guardrails (Metric, EvalReport, JSON report shape, queue jobs, commands, and API resources) are now in place.
  • Roadmap status: all planned Macro Task items (0 through 8) have been completed; no roadmap placeholders remain.

Contributing

See CONTRIBUTING.md. PRs follow the Padosoft 9-step canonical flow (R36): branch + local green + open PR with Copilot reviewer + wait for CI + wait for review + fix + repeat until zero must-fix comments + green CI + merge.

Security

If you discover a security vulnerability, see SECURITY.md β€” please do not open a public issue.

License

Apache-2.0. See LICENSE.

Copyright Β© 2026 Padosoft β€” Lorenzo Padovani.