@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
254 lines (197 loc) • 9.26 kB
Markdown
# runEvals
The `runEvals` function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.
## Usage example
```typescript
import { runEvals } from '@mastra/core/evals'
import { myAgent } from './agents/my-agent'
import { myScorer1, myScorer2 } from './scorers'
const result = await runEvals({
target: myAgent,
data: [
{ input: 'What is machine learning?' },
{ input: 'Explain neural networks' },
{ input: 'How does AI work?' },
],
scorers: [myScorer1, myScorer2],
targetOptions: { maxSteps: 5 },
concurrency: 2,
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Completed: ${item.input}`)
console.log(`Scores:`, scorerResults)
},
})
console.log(`Average scores:`, result.scores)
console.log(`Processed ${result.summary.totalItems} items`)
```
## Parameters
**target** (`Agent | Workflow`): The agent or workflow to evaluate.
**data** (`RunEvalsDataItem[]`): Array of test cases with input data and optional ground truth.
**scorers** (`MastraScorer[] | AgentScorerConfig | WorkflowScorerConfig`): Scorers to use. A flat array applies all scorers to the raw output. For agents, an \`AgentScorerConfig\` object separates agent-level and trajectory scorers. For workflows, a \`WorkflowScorerConfig\` object specifies scorers for the workflow, individual steps, and trajectory.
**targetOptions** (`AgentExecutionOptions | WorkflowRunOptions`): Options forwarded to the target during execution. For agents: options passed to agent.generate() (e.g. maxSteps, modelSettings, instructions). For workflows: options passed to run.start() (e.g. perStep, outputOptions, initialState).
**concurrency** (`number`): Number of test cases to run concurrently. (Default: `1`)
**onItemComplete** (`function`): Callback function called after each test case completes. Receives item, target result, and scorer results.
## Data item structure
**input** (`string | string[] | CoreMessage[] | any`): Input data for the target. For agents: messages or strings. For workflows: workflow input data.
**groundTruth** (`any`): Expected or reference output for comparison during scoring.
**expectedTrajectory** (`TrajectoryExpectation`): Expected trajectory configuration for trajectory scoring. Includes expected steps, ordering, efficiency budgets, blacklists, and tool failure tolerance. Passed to trajectory scorers as \`run.expectedTrajectory\`. Overrides the static defaults in scorer constructors.
**requestContext** (`RequestContext`): Request Context to pass to the target during execution.
**tracingContext** (`TracingContext`): Tracing context for observability and debugging.
**startOptions** (`WorkflowRunOptions`): Per-item workflow run options (e.g. initialState, perStep, outputOptions). Merged on top of targetOptions, so per-item values take precedence. Only applicable when the target is a workflow.
## Agent scorer configuration
For agents, use `AgentScorerConfig` to separate agent-level scorers from trajectory scorers:
**agent** (`MastraScorer[]`): Scorers that receive the raw agent output (MastraDBMessage\[]). Use for evaluating response quality, content, etc.
**trajectory** (`MastraScorer[]`): Scorers that receive a pre-extracted Trajectory object. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested tool calls and model generations). Otherwise, it falls back to extracting tool calls from agent messages.
## Workflow scorer configuration
For workflows, use `WorkflowScorerConfig` to specify scorers at different levels:
**workflow** (`MastraScorer[]`): Scorers to evaluate the entire workflow output.
**steps** (`Record<string, MastraScorer[]>`): Object mapping step IDs to arrays of scorers for evaluating individual step outputs.
**trajectory** (`MastraScorer[]`): Scorers that receive a pre-extracted Trajectory from the workflow execution. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested agent runs and tool calls within workflow steps). Otherwise, it falls back to extracting step results from the workflow output.
## Returns
**scores** (`Record<string, any>`): Average scores across all test cases, organized by scorer name.
**summary** (`object`): Summary information about the experiment execution.
**summary.totalItems** (`number`): Total number of test cases processed.
## Examples
### Agent Evaluation
```typescript
import { createScorer, runEvals } from '@mastra/core/evals'
const myScorer = createScorer({
id: 'my-scorer',
description: "Check if Agent's response contains ground truth",
type: 'agent',
}).generateScore(({ run }) => {
const response = run.output[0]?.content || ''
const expectedResponse = run.groundTruth
return response.includes(expectedResponse) ? 1 : 0
})
const result = await runEvals({
target: chatAgent,
data: [
{
input: 'What is AI?',
groundTruth: 'AI is a field of computer science that creates intelligent machines.',
},
{
input: 'How does machine learning work?',
groundTruth: 'Machine learning uses algorithms to learn patterns from data.',
},
],
scorers: [relevancyScorer],
concurrency: 3,
})
```
### Agent trajectory evaluation
Use `AgentScorerConfig` to evaluate both the agent response and its tool-calling trajectory:
```typescript
import { runEvals } from '@mastra/core/evals'
import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/code/trajectory'
const trajectoryScorer = createTrajectoryAccuracyScorerCode()
const result = await runEvals({
target: chatAgent,
data: [
{
input: 'What is the weather in London?',
expectedTrajectory: {
steps: [{ stepType: 'tool_call', name: 'weatherTool' }],
},
},
],
scorers: {
// agent: [responseQualityScorer], // Optional: add agent-level scorers
trajectory: [trajectoryScorer],
},
})
// result.scores.agent — average agent-level scores
// result.scores.trajectory — average trajectory scores
```
### Agent with `targetOptions`
Pass execution options like `maxSteps` or `modelSettings` to customize agent behavior during evaluation:
```typescript
const result = await runEvals({
target: chatAgent,
data: [{ input: 'Summarize this article' }, { input: 'Translate to French' }],
scorers: [relevancyScorer],
targetOptions: {
maxSteps: 5,
modelSettings: { temperature: 0 },
},
})
```
### Workflow Evaluation
```typescript
const workflowResult = await runEvals({
target: myWorkflow,
data: [
{ input: { query: 'Process this data', priority: 'high' } },
{ input: { query: 'Another task', priority: 'low' } },
],
scorers: {
workflow: [outputQualityScorer],
steps: {
'validation-step': [validationScorer],
'processing-step': [processingScorer],
},
},
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Workflow completed for: ${item.inputData.query}`)
if (scorerResults.workflow) {
console.log('Workflow scores:', scorerResults.workflow)
}
if (scorerResults.steps) {
console.log('Step scores:', scorerResults.steps)
}
},
})
```
### Workflow trajectory evaluation
Add trajectory scoring to workflow evaluations to validate step execution order:
```typescript
const workflowResult = await runEvals({
target: myWorkflow,
data: [
{
input: { query: 'Process this data' },
expectedTrajectory: {
steps: [
{ stepType: 'workflow_step', name: 'validate' },
{ stepType: 'workflow_step', name: 'process' },
{ stepType: 'workflow_step', name: 'output' },
],
},
},
],
scorers: {
workflow: [outputQualityScorer],
steps: {
validate: [validationScorer],
},
trajectory: [trajectoryScorer],
},
})
// result.scores.trajectory — workflow trajectory scores
```
### Workflow with per-item `startOptions`
Use `startOptions` on individual data items to customize each workflow run. Per-item values take precedence over `targetOptions`:
```typescript
const result = await runEvals({
target: myWorkflow,
data: [
{
input: { query: 'hello' },
startOptions: { initialState: { counter: 1 } },
},
{
input: { query: 'world' },
startOptions: { initialState: { counter: 2 } },
},
],
scorers: [outputQualityScorer],
targetOptions: { perStep: true },
})
```
## Related
- [createScorer()](https://mastra.ai/reference/evals/create-scorer): Create custom scorers for experiments
- [MastraScorer](https://mastra.ai/reference/evals/mastra-scorer): Learn about scorer structure and methods
- [Trajectory Accuracy](https://mastra.ai/reference/evals/trajectory-accuracy): Built-in trajectory evaluation scorers
- [Scorer Utilities](https://mastra.ai/reference/evals/scorer-utils): Helper functions for extracting trajectory data
- [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers): Guide to building evaluation logic
- [Scorers Overview](https://mastra.ai/docs/evals/overview): Understanding scorer concepts