carmelosantana / coqui-harbor-external
Harbor benchmarking toolkit for Coqui — task management, eval execution, and result analysis via the Harbor CLI
Package info
github.com/carmelosantana/coqui-harbor-external
pkg:composer/carmelosantana/coqui-harbor-external
Requires
- php: ^8.4
Requires (Dev)
- carmelosantana/php-agents: ^0.7
- pestphp/pest: ^3.0
- phpstan/phpstan: ^2.0
README
Harbor benchmarking toolkit for Coqui. Manage tasks, run evaluations, and analyze benchmark results via the Harbor CLI.
Requirements
- PHP 8.4+
- Harbor CLI (
uv tool install harbor) - Docker (for local evaluations)
- Coqui
Installation
composer require carmelosantana/coqui-harbor-external
The toolkit is auto-discovered by Coqui — no code changes needed.
Tools Provided
Discovery & Validation
| Tool | Description |
|---|---|
harbor_check |
Verify Harbor CLI, Python, Docker, and uv are installed |
harbor_task_validate |
Validate a task directory has the required structure |
harbor_dataset_list |
List registered datasets from the Harbor registry |
Task Authoring
| Tool | Description |
|---|---|
harbor_task_init |
Scaffold a new task directory (instruction.md, task.toml, environment/, tests/) |
harbor_task_list |
List all tasks in a local dataset directory |
harbor_task_delete |
Delete a task directory (gated — requires confirmation) |
Execution
| Tool | Description |
|---|---|
harbor_run |
Run a Harbor evaluation against a dataset or task path (gated) |
harbor_run_status |
Check job progress (trial completion, overall status) |
harbor_view |
Launch Harbor's web-based results viewer |
Analysis
| Tool | Description |
|---|---|
harbor_results |
Parse job results: pass/fail, reward distribution, durations |
harbor_trial_inspect |
Inspect a trial's trajectory, verifier logs, and reward |
harbor_compare |
Compare two or more jobs for regression detection |
harbor_failures |
Extract failed trials with root cause details |
harbor_cleanup |
Delete old job directories (gated) |
Python Agent Wrapper
The package includes a Python external agent that bridges Harbor's evaluation framework with Coqui's CLI. This allows Harbor to drive Coqui as the agent under test.
Setup
cd agent uv pip install -e .
Usage
harbor run \ -p ./my-tasks \ --agent-import-path coqui_harbor_agent.agent:CoquiExternalAgent \ -m anthropic/claude-sonnet-4-20250514
Configuration
| Environment Variable | Default | Description |
|---|---|---|
COQUI_BIN |
coqui |
Path to the Coqui binary |
COQUI_TIMEOUT |
600 |
Max seconds per task |
COQUI_MAX_ITERATIONS |
100 |
Agent iteration limit |
COQUI_MODEL |
(from Harbor -m) | Model override |
COQUI_ROLE |
coder |
Agent role |
COQUI_AUTO_APPROVE |
true |
Auto-approve tool calls |
COQUI_EXTRA_ARGS |
Additional CLI arguments |
Bundled Skill
The harbor-benchmarking skill provides an operational SOP for running benchmark campaigns — including task creation, evaluation execution, failure triage, regression detection, and reporting. It is auto-discovered when the package is installed.
Bundled Loop
The benchmark loop definition automates a full benchmark cycle:
- Plan — validate tasks, define success criteria, create plan artifact
- Coder — execute benchmark runs, analyze results, create report artifact
- Reviewer — verify completeness, check for regressions, approve or request changes
Terminates when the reviewer responds with APPROVED.
Development
composer install composer test # Run Pest tests composer analyse # Run PHPStan (level 8)
License
MIT