kevinpijning / pest-plugin-prompt
Pest plugin to evaluate prompts
Installs: 181
Dependents: 0
Suggesters: 0
Security: 0
Stars: 6
Watchers: 1
Forks: 2
Open Issues: 5
pkg:composer/kevinpijning/pest-plugin-prompt
Requires
- php: ^8.3
- pestphp/pest: ^4.0.0
- pestphp/pest-plugin: ^4.0.0
- symfony/yaml: ^7.3
Requires (Dev)
- pestphp/pest-dev-tools: ^4.0.0
This package is auto-updated.
Last update: 2025-12-31 13:46:45 UTC
README
Test your AI prompts with confidence using Pest's elegant syntax.
This plugin brings LLM prompt testing to your Pest test suite, powered by promptfoo under the hood. Write fluent, expressive tests for evaluating AI model prompts using the familiar Pest API you already love.
Table of Contents
- Why Use This Plugin?
- Prerequisites
- Installation
- Quick Start
- Documentation
- Core Functions
- Evaluation Methods
- Assertion Methods
toContain()toContainAll()toContainAny()toContainJson()toContainHtml()toContainSql()toContainXml()toEqual()toBe()toBeJudged()startsWith()toMatchRegex()toBeJson()toEqualJson()toMatchJsonStructure()toHaveJsonFragment()toHaveJsonFragments()toHaveJsonPath()toHaveJsonPaths()toHaveJsonType()toBeHtml()toBeSql()toBeXml()toBeSimilar()toHaveLevenshtein()toHaveRougeN()toHaveFScore()toHavePerplexity()toHavePerplexityScore()toHaveCost()toHaveLatency()toHaveValidFunctionCall()toHaveValidOpenaiFunctionCall()toHaveValidOpenaiToolsCall()toHaveToolCallF1()toHaveFinishReason()toBeClassified()toBeScoredByPi()toBeRefused()toPassJavascript()toPassPython()toPassWebhook()toHaveTraceSpanCount()toHaveTraceSpanDuration()toHaveTraceErrorSpans()notModifier
- Provider Configuration
- Usage Examples
- CLI Options
- Credits & License
Why Use This Plugin?
- Test prompts against multiple LLM providers - Compare OpenAI, Anthropic, and more in a single test
- Validate responses with content assertions - Check for specific text, JSON validity, HTML structure, and more
- Use LLM-based evaluation - Judge responses with natural language rubrics using AI itself
- Familiar Pest-style fluent API - Feels natural if you're already using Pest
- Automatic cleanup - Temporary files are managed for you
- Battle-tested - Built on promptfoo's proven evaluation framework
Prerequisites
Before you begin, make sure you have:
- PHP 8.3 or higher
- Pest 4.0 or higher
- Node.js and npm - Required for promptfoo execution via
npx - API keys for LLM providers - You'll need keys for the providers you want to test
Setting up API Keys
Set environment variables for the providers you'll use:
export OPENAI_API_KEY="your-openai-key-here" export ANTHROPIC_API_KEY="your-anthropic-key-here"
If you're using Laravel or a similar framework with .env file support, you can add them there instead.
For more provider options and configuration, check out promptfoo's provider documentation.
Installation
Install the plugin via Composer:
composer require kevinpijning/pest-plugin-prompt --dev
The plugin automatically registers with Pest via package discovery - no additional configuration needed!
Quick Start
Here's the simplest possible example to get you started:
test('greeting prompt works correctly', function () { prompt('You are a helpful assistant. Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice']) ->toContain('Alice'); });
What's happening here?
- We create a prompt with variable interpolation using
{{name}} - We specify OpenAI's GPT-4o-mini as our LLM provider
- We test with the variable
nameset to "Alice" - We assert that the response contains "Alice"
When you run this test, the plugin will:
- Send the prompt to OpenAI with "Alice" substituted for
{{name}} - Receive the response
- Verify that "Alice" appears in the response
- Pass or fail the test accordingly
Documentation
Core Functions
prompt()
Create a new evaluation with one or more prompts. Use {{variable}} syntax for variable interpolation.
// Single prompt prompt('You are a helpful assistant.'); // Multiple prompts (tested against each other) prompt( 'You are a helpful assistant.', 'You are a professional assistant.' ); // With variables prompt('Greet {{name}} warmly.');
provider()
Register a global provider like Pest datasets that can be reused across multiple tests. Providers registered with this function can be referenced by name in usingProvider().
// Register a simple provider provider('openai-gpt4')->id('openai:gpt-4'); // Register with full configuration provider('custom-openai') ->id('openai:gpt-4') ->label('Custom OpenAI') ->temperature(0.7) ->maxTokens(2000); // Use in tests prompt('Hello') ->usingProvider('custom-openai') ->expect() ->toContain('Hi');
Evaluation Methods
describe()
Add a description to your evaluation for better test output and debugging.
prompt('You are a helpful assistant.') ->describe('Tests basic assistant greeting') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('Hello');
usingProvider()
Specify which LLM provider(s) to use for evaluation. You can pass provider IDs, Provider instances, callables, or registered provider names.
// Single provider by ID prompt('Hello') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('Hi'); // Multiple providers (compares responses) prompt('What is 2+2?') ->usingProvider('openai:gpt-4o-mini', 'anthropic:claude-3') ->expect() ->toContain('4'); // Provider instance $provider = Provider::create('openai:gpt-4') ->temperature(0.7); prompt('Hello') ->usingProvider($provider) ->expect() ->toContain('Hi'); // Use default provider (openai:gpt-4o-mini) prompt('Hello') ->expect() ->toContain('Hi');
alwaysExpect()
Set default assertions and variables that apply to all test cases in the evaluation. This is useful when you want to ensure certain conditions are met for every test case without repeating the assertions.
prompt('Translate {{message}} to {{language}}.') ->usingProvider('openai:gpt-4o-mini') ->alwaysExpect(['message' => 'Hello World!']) ->toBeJudged('the language is always a friendly variant') ->toBeJudged('the source and output language are always mentioned in the response') ->expect(['language' => 'es']) ->toContain('hola') ->toBeJudged('Contains the translation of Hello world! in spanish');
With callback:
You can pass an optional callback function to configure the default test case:
prompt('Translate {{message}} to {{language}}.') ->usingProvider('openai:gpt-4o-mini') ->alwaysExpect( ['message' => 'Hello World!'], function (TestCase $testCase) { $testCase ->toBeJudged('the language is always a friendly variant') ->toBeJudged('the source and output language are always mentioned in the response'); } ) ->expect(['language' => 'es']) ->toContain('hola');
Key points:
alwaysExpect()returns aTestCaseinstance that supports all assertion methods- Assertions added via
alwaysExpect()apply to every test case in the evaluation - Default variables can be set and will be merged with test case variables
- You can chain multiple assertions after
alwaysExpect()or use a callback - The default test case is separate from regular test cases and won't appear in the
testCases()array - If
alwaysExpect()is called multiple times, subsequent calls will execute the callback on the existing default test case
Use cases:
- Ensure all responses meet quality standards (e.g., "always be professional")
- Set common variables that apply to all tests
- Enforce safety checks across all test cases
- Apply format requirements universally (e.g., "always contain JSON")
expect()
Create a test case with variables that will be substituted into your prompt template.
prompt('Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice']) ->toContain('Alice'); // Multiple variables prompt('{{greeting}}, {{name}}!') ->usingProvider('openai:gpt-4o-mini') ->expect(['greeting' => 'Hello', 'name' => 'Bob']) ->toContain('Hello') ->toContain('Bob'); // Empty variables (no substitution) prompt('You are a helpful assistant.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('assistant');
With callback:
You can pass an optional callback function that receives the created TestCase instance. This is useful for grouping multiple assertions or applying conditional logic.
prompt('Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice'], function (TestCase $testCase) { $testCase ->toContain('Alice') ->toContain('Hello') ->toBeJudged('response is friendly and welcoming'); }); // Using arrow function prompt('Translate {{text}} to {{language}}.') ->usingProvider('openai:gpt-4o-mini') ->expect( ['text' => 'Hello', 'language' => 'Spanish'], fn (TestCase $tc) => $tc ->toContain('Hola') ->toBeJudged('translation is accurate') );
and()
Chain multiple test cases for the same evaluation. Each call to and() creates a new test case with different variables.
prompt('Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice']) ->toContain('Alice') ->and(['name' => 'Bob']) ->toContain('Bob') ->and(['name' => 'Charlie']) ->toContain('Charlie');
With callback:
You can pass an optional callback function that receives the newly created TestCase:
prompt('Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice']) ->toContain('Alice') ->and(['name' => 'Bob'], function (TestCase $testCase) { $testCase ->toContain('Bob') ->toBeJudged('response is warm and friendly'); }) ->and(['name' => 'Charlie'], fn (TestCase $tc) => $tc->toContain('Charlie'));
to() and group()
Group multiple assertions together using a callback. Both to() and group() are aliases that execute a callback with the current test case, allowing you to organize assertions logically.
prompt('Explain {{topic}} in detail.') ->usingProvider('openai:gpt-4o-mini') ->expect(['topic' => 'quantum computing']) ->to(function (TestCase $testCase) { $testCase ->toContain('quantum') ->toContain('computing') ->toBeJudged('explanation is clear and accurate') ->toHaveLatency(2000); }); // Using group() (same as to()) prompt('Analyze {{data}}.') ->usingProvider('openai:gpt-4o-mini') ->expect(['data' => 'sales figures']) ->group(function (TestCase $testCase) { $testCase ->toContain('analysis') ->toBeJudged('analysis is thorough'); }); // Chaining multiple groups prompt('Review {{document}}.') ->usingProvider('openai:gpt-4o-mini') ->expect(['document' => 'contract']) ->to(fn (TestCase $tc) => $tc->toContain('terms')) ->group(fn (TestCase $tc) => $tc->toBeJudged('review is comprehensive')) ->to(fn (TestCase $tc) => $tc->toHaveLatency(1500));
Key points:
to()andgroup()are functionally identical - use whichever reads better in your context- The callback receives the current
TestCaseinstance - Useful for organizing related assertions together
- Can be chained multiple times
- Works with all assertion methods
Use cases:
- Group related assertions for better code organization
- Apply conditional logic based on test case variables
- Reuse assertion patterns across multiple test cases
Assertion Methods
toContain()
Assert that the response contains specific text. Case-insensitive by default.
prompt('What is the capital of France?') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('Paris'); // Case-sensitive matching prompt('What is the capital of France?') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('Paris', strict: true); // With threshold (similarity score, 0.0 to 1.0) prompt('Explain quantum computing.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('quantum', threshold: 0.8); // With custom options prompt('What is AI?') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('artificial intelligence', options: ['normalize': true]);
toContainAll()
Assert that the response contains all of the specified strings.
prompt('Describe a healthy meal.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainAll(['protein', 'vegetables', 'grains']); // Case-sensitive prompt('Describe a healthy meal.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainAll(['Protein', 'Vegetables'], strict: true); // With threshold prompt('Describe a healthy meal.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainAll(['protein', 'vegetables'], threshold: 0.9);
toContainAny()
Assert that the response contains at least one of the specified strings.
prompt('What is the weather like?') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainAny(['sunny', 'rainy', 'cloudy']); // Case-sensitive prompt('What is the weather like?') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainAny(['Sunny', 'Rainy'], strict: true);
toContainJson()
Assert that the response contains valid JSON.
prompt('Return user data as JSON: name, age, email') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainJson();
toContainHtml()
Assert that the response contains valid HTML.
prompt('Generate an HTML list of fruits') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainHtml();
toContainSql()
Assert that the response contains valid SQL.
prompt('Write a SQL query to select all users') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainSql();
toContainXml()
Assert that the response contains valid XML.
prompt('Generate XML for a product catalog') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainXml();
toEqual()
Assert that the response exactly equals the expected value. This is useful for deterministic outputs where you expect an exact match. You can also check whether it matches the expected JSON format.
prompt('Calculate 335 + 85. Return only the number.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toEqual(420);
toBe()
This is a convenience alias of toEqual().
prompt('Calculate 335 + 85. Return only the number.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBe(420);
toBeJudged()
Use an LLM to evaluate the response against a natural language rubric. This is useful for subjective quality checks.
prompt('Explain quantum computing to a beginner.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeJudged('The explanation should be clear, accurate, and use simple language.'); // With threshold (minimum score 0.0 to 1.0) prompt('Write a product description.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeJudged('The description should be persuasive and highlight key features.', threshold: 0.8); // With custom options prompt('Write a product description.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeJudged('Should be professional and engaging.', options: ['provider': 'openai:gpt-4']);
startsWith()
Assert that the response starts with a specific prefix.
prompt('Generate a greeting.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->startsWith('Hello'); // Case-sensitive prompt('Generate a greeting.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->startsWith('Hello', strict: true); // With threshold prompt('Generate a greeting.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->startsWith('Hello', threshold: 0.9);
toMatchRegex()
Assert that the response matches a regular expression pattern.
prompt('Generate a phone number.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toMatchRegex('/\d{3}-\d{3}-\d{4}/'); // With threshold prompt('Generate a phone number.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toMatchRegex('/\d{3}-\d{3}-\d{4}/', threshold: 0.9);
toBeJson()
Assert that the response is valid JSON (not just contains JSON).
prompt('Return user data as JSON: name, age, email') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeJson(); // With JSON schema validation prompt('Return user data as JSON.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeJson([ 'type' => 'object', 'properties' => [ 'name' => ['type' => 'string'], 'age' => ['type' => 'number'], ], 'required' => ['name', 'age'], ]);
toEqualJson()
Assert that the JSON output exactly equals the expected value. Object key order is ignored, but array order is preserved. This is similar to Laravel's assertExactJson().
prompt('Extract the person info from: {{text}}') ->usingProvider('openai:gpt-4o-mini') ->expect(['text' => 'John is 30 years old']) ->toEqualJson([ 'name' => 'John', 'age' => 30, ]); // Works with nested structures prompt('Extract address info.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toEqualJson([ 'user' => [ 'name' => 'John', 'address' => [ 'city' => 'Amsterdam', ], ], ]);
toMatchJsonStructure()
Assert that the JSON output contains all expected keys. This validates structure without checking values, similar to Laravel's assertJsonStructure().
// Simple key validation prompt('Return user data as JSON.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toMatchJsonStructure(['name', 'age', 'email']); // Nested structure validation prompt('Return user with address.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toMatchJsonStructure([ 'name', 'address' => ['street', 'city', 'country'], ]); // Array items with wildcard (*) prompt('Return a list of users.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toMatchJsonStructure([ 'users' => [ '*' => ['id', 'name', 'email'], ], ]);
toHaveJsonFragment()
Assert that the JSON output contains specific key-value pairs. Similar to Laravel's assertJsonFragment().
prompt('Extract person info from: {{text}}') ->usingProvider('openai:gpt-4o-mini') ->expect(['text' => 'John Doe is 30 years old']) ->toHaveJsonFragment(['name' => 'John Doe']) ->toHaveJsonFragment(['age' => 30]); // Works with nested values prompt('Extract user with address.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonFragment([ 'address' => ['city' => 'Amsterdam'], ]);
toHaveJsonFragments()
Assert that the JSON output contains all specified fragments.
prompt('Extract person info from: {{text}}') ->usingProvider('openai:gpt-4o-mini') ->expect(['text' => 'Jane Smith is 25 years old and lives in Berlin']) ->toHaveJsonFragments([ ['name' => 'Jane Smith'], ['age' => 25], ['city' => 'Berlin'], ]);
toHaveJsonPath()
Assert that a value exists at a specific JSON path. Supports dot notation, numeric array indices, and wildcards.
// Check path exists prompt('Return user with address.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonPath('name') ->toHaveJsonPath('address.city'); // Check path has specific value prompt('Extract person info.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonPath('name', 'John Doe') ->toHaveJsonPath('address.city', 'Amsterdam'); // Array index access prompt('Return list of users.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonPath('users.0.name') ->toHaveJsonPath('users.1.name', 'Jane'); // Wildcard for all array items prompt('Return list of users.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonPath('users.*.name') ->toHaveJsonPath('users.*.status', 'active');
toHaveJsonPaths()
Assert that multiple JSON paths exist, optionally with expected values.
// Check paths exist (array of strings) prompt('Return user data.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonPaths(['name', 'email', 'address.city']); // Check paths with values (associative array) prompt('Extract person info from: {{text}}') ->usingProvider('openai:gpt-4o-mini') ->expect(['text' => 'Grace Lee is 28 years old and lives in Seoul']) ->toHaveJsonPaths([ 'name' => 'Grace Lee', 'age' => 28, 'city' => 'Seoul', ]); // Mix of existence and value checks with wildcards prompt('Return users list.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonPaths([ 'users.*.name', 'users.*.type' => 'customer', ]);
toHaveJsonType()
Assert that the value at a JSON path has the expected type. Supports: string, number, boolean, array, object, null.
// Basic type validation prompt('Return user data.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonType('name', 'string') ->toHaveJsonType('age', 'number') ->toHaveJsonType('active', 'boolean'); // Nested path type validation prompt('Return user with address.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonType('address', 'object') ->toHaveJsonType('address.city', 'string'); // Array and wildcard type validation prompt('Return list of users.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveJsonType('users', 'array') ->toHaveJsonType('users.*.name', 'string') ->toHaveJsonType('users.*.age', 'number');
toBeHtml()
Assert that the response is valid HTML.
prompt('Generate an HTML list of fruits') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeHtml();
toBeSql()
Assert that the response is valid SQL (not just contains SQL).
prompt('Write a SQL query to select all users') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeSql(); // With authority list (allowed SQL operations) prompt('Write a SQL query.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeSql(['SELECT', 'INSERT']);
toBeXml()
Assert that the response is valid XML.
prompt('Generate XML for a product catalog') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeXml();
toBeSimilar()
Assert that the response is semantically similar to the expected value using embedding similarity.
prompt('Explain artificial intelligence.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeSimilar('AI is the simulation of human intelligence by machines'); // With threshold (default is 0.75) prompt('Explain artificial intelligence.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeSimilar('AI explanation', threshold: 0.8); // With custom embedding provider prompt('Explain artificial intelligence.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeSimilar('AI explanation', provider: 'huggingface:sentence-similarity:model'); // Multiple expected values prompt('Explain artificial intelligence.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeSimilar(['AI explanation', 'Machine intelligence', 'Artificial intelligence definition']);
toHaveLevenshtein()
Assert that the Levenshtein (edit) distance between the response and expected value is below a threshold.
prompt('Spell the word "hello".') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveLevenshtein('hello', threshold: 2.0);
toHaveRougeN()
Assert that the ROUGE-N score is above a threshold.
prompt('Summarize this article.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveRougeN(1, 'Expected summary', threshold: 0.7); // ROUGE-2 prompt('Summarize this article.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveRougeN(2, 'Expected summary', threshold: 0.6);
toHaveFScore()
Assert that the F-score is above a threshold.
prompt('Extract entities from the text.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFScore('Expected entities', threshold: 0.8);
toHavePerplexity()
Assert that the perplexity is below a threshold.
prompt('Generate coherent text.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHavePerplexity(threshold: 10.0);
toHavePerplexityScore()
Assert that the normalized perplexity score is below a threshold.
prompt('Generate coherent text.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHavePerplexityScore(threshold: 0.5);
toHaveCost()
Assert that the inference cost is below a maximum threshold.
prompt('Generate a short response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveCost(0.01);
toHaveLatency()
Assert that the response latency is below a maximum threshold (in milliseconds).
prompt('Generate a quick response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveLatency(1000);
toHaveValidFunctionCall()
Assert that the response contains a valid function call matching the provided schema.
prompt('Call the weather function.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveValidFunctionCall([ 'type' => 'object', 'properties' => [ 'name' => ['type' => 'string'], 'arguments' => ['type' => 'object'], ], ]);
toHaveValidOpenaiFunctionCall()
Assert that the response contains a valid OpenAI function call.
prompt('Call the weather function.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveValidOpenaiFunctionCall();
toHaveValidOpenaiToolsCall()
Assert that the response contains valid OpenAI tool calls.
prompt('Use the available tools.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveValidOpenaiToolsCall();
toHaveToolCallF1()
Assert that the F1 score comparing actual vs expected tool calls is above a threshold.
prompt('Call the weather and time functions.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveToolCallF1(['weather', 'time'], threshold: 0.8);
toHaveFinishReason()
Assert that the model stopped for the expected reason. You can use either a string or the FinishReason enum.
Standard Finish Reasons:
stop: Natural completion (reached end of response, stop sequence matched)length: Token limit reached (max_tokens exceeded, context length reached)content_filter: Content filtering triggered due to safety policiestool_calls: Model made function/tool calls
use KevinPijning\Prompt\Enums\FinishReason; // Using string prompt('Generate a response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReason('stop'); // Using enum prompt('Generate a response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReason(FinishReason::Stop); // Check for tool calls prompt('Use available tools.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReason(FinishReason::ToolCalls);
Convenience Methods:
For each finish reason, there's a dedicated convenience method:
// Natural completion prompt('Generate a response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReasonStop(); // Token limit reached prompt('Generate a very long response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReasonLength(); // Content filter triggered prompt('Generate harmful content.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReasonContentFilter(); // Tool calls made prompt('Use available tools.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveFinishReasonToolCalls();
toBeClassified()
Assert that a HuggingFace classifier returns the expected class above a threshold.
// Sentiment analysis prompt('Write a positive review.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeClassified( 'huggingface:text-classification:distilbert-base-uncased-finetuned-sst-2-english', 'POSITIVE', threshold: 0.8 ); // Hate speech detection prompt('Write a friendly message.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeClassified( 'huggingface:text-classification:facebook/roberta-hate-speech-dynabench-r4-target', 'nothate', threshold: 0.9 );
toBeScoredByPi()
Use Pi Labs' preference scoring model as an alternative to LLM-as-a-judge.
prompt('Write a helpful response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeScoredByPi('Is the response not apologetic and provides a clear, concise answer?', threshold: 0.8);
toBeRefused()
Assert that the LLM output indicates the model refused to perform the requested task.
prompt('Write harmful content.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeRefused(); // Ensure model does NOT refuse safe requests prompt('What is 2+2?') ->usingProvider('openai:gpt-4o-mini') ->expect() ->not->toBeRefused();
toPassJavascript()
Assert that a custom JavaScript function validates the output.
prompt('Generate a response longer than 10 characters.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toPassJavascript('return output.length > 10;');
toPassPython()
Assert that a custom Python function validates the output.
prompt('Generate a response longer than 10 characters.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toPassPython('return len(output) > 10');
toPassWebhook()
Assert that a webhook returns {pass: true}.
prompt('Generate a response.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toPassWebhook('https://example.com/validate');
toHaveTraceSpanCount()
Assert that trace spans matching patterns meet min/max thresholds.
prompt('Process the request.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveTraceSpanCount(['pattern1', 'pattern2'], min: 1, max: 5);
toHaveTraceSpanDuration()
Assert that trace span durations meet percentile and max duration thresholds.
prompt('Process the request.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toHaveTraceSpanDuration(['pattern1'], percentile: 0.95, maxDuration: 1000.0);
toHaveTraceErrorSpans()
Detect errors in traces by status codes, attributes, and messages.
prompt('Process the request.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->not->toHaveTraceErrorSpans();
not Modifier
Negate any assertion by using the not modifier.
prompt('Write a happy birthday message.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->not->toContain('condolences');
Provider Configuration
When creating or configuring providers, you can use these methods:
id()
Set the provider identifier (e.g., 'openai:gpt-4', 'anthropic:claude-3').
Provider::create('openai:gpt-4') ->id('openai:gpt-4o-mini');
label()
Set a custom label for the provider (useful in test output).
Provider::create('openai:gpt-4') ->label('OpenAI GPT-4 Production');
temperature()
Control randomness in responses (0.0 to 1.0). Lower values make responses more deterministic.
Provider::create('openai:gpt-4') ->temperature(0.7);
maxTokens()
Set the maximum number of tokens to generate.
Provider::create('openai:gpt-4') ->maxTokens(2000);
topP()
Set nucleus sampling parameter (0.0 to 1.0).
Provider::create('openai:gpt-4') ->topP(0.9);
frequencyPenalty()
Penalize frequent tokens (-2.0 to 2.0).
Provider::create('openai:gpt-4') ->frequencyPenalty(0.5);
presencePenalty()
Penalize new tokens based on presence in text (-2.0 to 2.0).
Provider::create('openai:gpt-4') ->presencePenalty(0.3);
stop()
Set stop sequences where generation should stop.
Provider::create('openai:gpt-4') ->stop(['\n', 'Human:', 'AI:']);
config()
Set custom configuration options for the provider.
Provider::create('openai:gpt-4') ->config([ 'apiKey' => 'custom-key', 'baseURL' => 'https://api.example.com', ]);
Usage Examples
Basic Example
test('assistant greets user correctly', function () { prompt('You are a helpful assistant. Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice']) ->toContain('Alice'); });
Multiple Prompts
Test multiple prompt variations against the same test cases.
test('prompt variations work', function () { prompt( 'You are a helpful assistant.', 'You are a professional assistant.', 'You are a friendly assistant.' ) ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContain('assistant'); });
Multiple Providers
Compare responses across different LLM providers.
test('providers give consistent answers', function () { prompt('What is 2+2?') ->usingProvider('openai:gpt-4o-mini', 'anthropic:claude-3') ->expect() ->toContain('4'); });
Multiple Test Cases
Test the same prompt with different variable values.
test('greeting works for different names', function () { prompt('Greet {{name}} warmly.') ->usingProvider('openai:gpt-4o-mini') ->expect(['name' => 'Alice']) ->toContain('Alice') ->and(['name' => 'Bob']) ->toContain('Bob') ->and(['name' => 'Charlie']) ->toContain('Charlie'); });
Default Test Cases
Use alwaysExpect() to set assertions that apply to all test cases.
test('all translations meet quality standards', function () { prompt('Translate {{message}} to {{language}} in the style {{style}}.') ->usingProvider('openai:gpt-4o-mini') ->alwaysExpect(['style' => 'friendly']) ->toBeJudged('the translation is always accurate and natural') ->toBeJudged('the response is always in a friendly tone') ->expect(['message' => 'Hello', 'language' => 'es']) ->toContain('hola') ->expect(['message' => 'Goodbye', 'language' => 'fr']) ->toContain('au revoir'); });
Provider Configuration
Configure providers with specific parameters.
test('creative writing with high temperature', function () { $creativeProvider = Provider::create('openai:gpt-4') ->temperature(0.9) ->maxTokens(500); prompt('Write a creative story about {{topic}}.') ->usingProvider($creativeProvider) ->expect(['topic' => 'space exploration']) ->toContain('space'); });
Global Provider Registration
Register providers once and reuse them across tests.
provider('openai-gpt4') ->id('openai:gpt-4') ->temperature(0.7) ->maxTokens(2000); test('uses registered provider', function () { prompt('Hello') ->usingProvider('openai-gpt4') ->expect() ->toContain('Hi'); });
Advanced Assertions
Combine multiple assertion types.
test('response meets multiple criteria', function () { prompt('Generate a user profile as JSON with name, email, and age.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toContainJson() ->toContainAll(['name', 'email', 'age']) ->toBeJudged('The JSON should be well-structured and include all required fields.'); });
LLM-Based Evaluation
Use AI to evaluate response quality.
test('response quality meets standards', function () { prompt('Explain machine learning to a beginner.') ->usingProvider('openai:gpt-4o-mini') ->expect() ->toBeJudged('The explanation should be clear, accurate, use simple language, and include examples.', threshold: 0.85); });
Structured JSON Output Testing
Test structured JSON outputs from LLMs, particularly useful with OpenAI's Responses API and structured output features.
// Register a provider with structured output schema provider('person-extractor', static fn (Provider $provider): Provider => $provider ->id('openai:responses:gpt-4o-mini') ->config([ 'response_format' => [ 'name' => 'person_info', 'type' => 'json_schema', 'strict' => true, 'schema' => [ 'type' => 'object', 'properties' => [ 'name' => ['type' => 'string'], 'age' => ['type' => 'number'], 'city' => ['type' => 'string'], ], 'required' => ['name', 'age', 'city'], 'additionalProperties' => false, ], ], ])); test('extracts person info with full validation', function () { prompt('Extract the person info from this text: {{text}}') ->describe('Testing structured JSON output') ->usingProvider('person-extractor') ->expect(['text' => 'John Doe is 30 years old and lives in Amsterdam.']) // Validate structure ->toMatchJsonStructure(['name', 'age', 'city']) // Validate specific values ->toHaveJsonFragment(['name' => 'John Doe', 'city' => 'Amsterdam']) // Validate types ->toHaveJsonType('name', 'string') ->toHaveJsonType('age', 'number') // Validate exact match ->toEqualJson([ 'name' => 'John Doe', 'age' => 30, 'city' => 'Amsterdam', ]); }); // Testing array outputs with nested structures provider('people-extractor', static fn (Provider $provider): Provider => $provider ->id('openai:responses:gpt-4o-mini') ->config([ 'response_format' => [ 'name' => 'people_list', 'type' => 'json_schema', 'strict' => true, 'schema' => [ 'type' => 'object', 'properties' => [ 'people' => [ 'type' => 'array', 'items' => [ 'type' => 'object', 'properties' => [ 'name' => ['type' => 'string'], 'role' => ['type' => 'string'], ], 'required' => ['name', 'role'], ], ], ], 'required' => ['people'], ], ], ])); test('extracts multiple people with array validation', function () { prompt('Extract all people from: {{text}}') ->usingProvider('people-extractor') ->expect(['text' => 'The team has Mike (developer) and Sarah (designer).']) // Validate array structure with wildcard ->toMatchJsonStructure([ 'people' => [ '*' => ['name', 'role'], ], ]) // Validate array item access ->toHaveJsonPath('people.0.name') ->toHaveJsonPath('people.1.name') // Validate all items have specific type ->toHaveJsonType('people', 'array') ->toHaveJsonType('people.*.name', 'string') ->toHaveJsonType('people.*.role', 'string'); });
Complex Example
A comprehensive example showing multiple features together.
// Register global providers provider('support-gpt4') ->id('openai:gpt-4') ->temperature(0.3); provider('support-claude') ->id('anthropic:claude-3') ->temperature(0.3); test('customer service prompt evaluation', function () { // Test multiple prompts across multiple providers prompt( 'You are a customer support agent. Help the customer with: {{issue}}', 'As a support agent, assist with: {{issue}}' ) ->describe('Customer service prompt evaluation') ->usingProvider('support-gpt4', 'support-claude') ->expect(['issue' => 'refund request']) ->toContainAll(['refund', 'help'], strict: false) ->toBeJudged('Response should be professional, empathetic, and helpful.', threshold: 0.8) ->and(['issue' => 'product question']) ->toContainAny(['product', 'feature', 'specification']) ->toBeJudged('Response should accurately answer the product question.'); });
CLI Options
--output
Save promptfoo evaluation results to a directory. Useful for debugging and analysis.
# Use default output directory (prompt-tests-output/) vendor/bin/pest --output # Specify custom output directory vendor/bin/pest --output=my-results/ # Alternative syntax vendor/bin/pest --output my-results/
The output directory will contain HTML reports and JSON data from promptfoo evaluations.
Credits & License
Created by: Kevin Pijning
Built on the shoulders of giants:
- Pest - The elegant PHP testing framework
- promptfoo - LLM evaluation framework
- Symfony Components - Process and YAML handling
License: MIT License
See the LICENSE file for full details.
Ready to start testing your prompts? Install the plugin and write your first test in under a minute. Happy testing!