UNPKG

@mastra/core

Version:

Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

254 lines (197 loc) 9.26 kB
# runEvals The `runEvals` function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems. ## Usage example ```typescript import { runEvals } from '@mastra/core/evals' import { myAgent } from './agents/my-agent' import { myScorer1, myScorer2 } from './scorers' const result = await runEvals({ target: myAgent, data: [ { input: 'What is machine learning?' }, { input: 'Explain neural networks' }, { input: 'How does AI work?' }, ], scorers: [myScorer1, myScorer2], targetOptions: { maxSteps: 5 }, concurrency: 2, onItemComplete: ({ item, targetResult, scorerResults }) => { console.log(`Completed: ${item.input}`) console.log(`Scores:`, scorerResults) }, }) console.log(`Average scores:`, result.scores) console.log(`Processed ${result.summary.totalItems} items`) ``` ## Parameters **target** (`Agent | Workflow`): The agent or workflow to evaluate. **data** (`RunEvalsDataItem[]`): Array of test cases with input data and optional ground truth. **scorers** (`MastraScorer[] | AgentScorerConfig | WorkflowScorerConfig`): Scorers to use. A flat array applies all scorers to the raw output. For agents, an \`AgentScorerConfig\` object separates agent-level and trajectory scorers. For workflows, a \`WorkflowScorerConfig\` object specifies scorers for the workflow, individual steps, and trajectory. **targetOptions** (`AgentExecutionOptions | WorkflowRunOptions`): Options forwarded to the target during execution. For agents: options passed to agent.generate() (e.g. maxSteps, modelSettings, instructions). For workflows: options passed to run.start() (e.g. perStep, outputOptions, initialState). **concurrency** (`number`): Number of test cases to run concurrently. (Default: `1`) **onItemComplete** (`function`): Callback function called after each test case completes. Receives item, target result, and scorer results. ## Data item structure **input** (`string | string[] | CoreMessage[] | any`): Input data for the target. For agents: messages or strings. For workflows: workflow input data. **groundTruth** (`any`): Expected or reference output for comparison during scoring. **expectedTrajectory** (`TrajectoryExpectation`): Expected trajectory configuration for trajectory scoring. Includes expected steps, ordering, efficiency budgets, blacklists, and tool failure tolerance. Passed to trajectory scorers as \`run.expectedTrajectory\`. Overrides the static defaults in scorer constructors. **requestContext** (`RequestContext`): Request Context to pass to the target during execution. **tracingContext** (`TracingContext`): Tracing context for observability and debugging. **startOptions** (`WorkflowRunOptions`): Per-item workflow run options (e.g. initialState, perStep, outputOptions). Merged on top of targetOptions, so per-item values take precedence. Only applicable when the target is a workflow. ## Agent scorer configuration For agents, use `AgentScorerConfig` to separate agent-level scorers from trajectory scorers: **agent** (`MastraScorer[]`): Scorers that receive the raw agent output (MastraDBMessage\[]). Use for evaluating response quality, content, etc. **trajectory** (`MastraScorer[]`): Scorers that receive a pre-extracted Trajectory object. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested tool calls and model generations). Otherwise, it falls back to extracting tool calls from agent messages. ## Workflow scorer configuration For workflows, use `WorkflowScorerConfig` to specify scorers at different levels: **workflow** (`MastraScorer[]`): Scorers to evaluate the entire workflow output. **steps** (`Record<string, MastraScorer[]>`): Object mapping step IDs to arrays of scorers for evaluating individual step outputs. **trajectory** (`MastraScorer[]`): Scorers that receive a pre-extracted Trajectory from the workflow execution. When storage is configured, the pipeline extracts a hierarchical trajectory from observability traces (including nested agent runs and tool calls within workflow steps). Otherwise, it falls back to extracting step results from the workflow output. ## Returns **scores** (`Record<string, any>`): Average scores across all test cases, organized by scorer name. **summary** (`object`): Summary information about the experiment execution. **summary.totalItems** (`number`): Total number of test cases processed. ## Examples ### Agent Evaluation ```typescript import { createScorer, runEvals } from '@mastra/core/evals' const myScorer = createScorer({ id: 'my-scorer', description: "Check if Agent's response contains ground truth", type: 'agent', }).generateScore(({ run }) => { const response = run.output[0]?.content || '' const expectedResponse = run.groundTruth return response.includes(expectedResponse) ? 1 : 0 }) const result = await runEvals({ target: chatAgent, data: [ { input: 'What is AI?', groundTruth: 'AI is a field of computer science that creates intelligent machines.', }, { input: 'How does machine learning work?', groundTruth: 'Machine learning uses algorithms to learn patterns from data.', }, ], scorers: [relevancyScorer], concurrency: 3, }) ``` ### Agent trajectory evaluation Use `AgentScorerConfig` to evaluate both the agent response and its tool-calling trajectory: ```typescript import { runEvals } from '@mastra/core/evals' import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/code/trajectory' const trajectoryScorer = createTrajectoryAccuracyScorerCode() const result = await runEvals({ target: chatAgent, data: [ { input: 'What is the weather in London?', expectedTrajectory: { steps: [{ stepType: 'tool_call', name: 'weatherTool' }], }, }, ], scorers: { // agent: [responseQualityScorer], // Optional: add agent-level scorers trajectory: [trajectoryScorer], }, }) // result.scores.agent — average agent-level scores // result.scores.trajectory — average trajectory scores ``` ### Agent with `targetOptions` Pass execution options like `maxSteps` or `modelSettings` to customize agent behavior during evaluation: ```typescript const result = await runEvals({ target: chatAgent, data: [{ input: 'Summarize this article' }, { input: 'Translate to French' }], scorers: [relevancyScorer], targetOptions: { maxSteps: 5, modelSettings: { temperature: 0 }, }, }) ``` ### Workflow Evaluation ```typescript const workflowResult = await runEvals({ target: myWorkflow, data: [ { input: { query: 'Process this data', priority: 'high' } }, { input: { query: 'Another task', priority: 'low' } }, ], scorers: { workflow: [outputQualityScorer], steps: { 'validation-step': [validationScorer], 'processing-step': [processingScorer], }, }, onItemComplete: ({ item, targetResult, scorerResults }) => { console.log(`Workflow completed for: ${item.inputData.query}`) if (scorerResults.workflow) { console.log('Workflow scores:', scorerResults.workflow) } if (scorerResults.steps) { console.log('Step scores:', scorerResults.steps) } }, }) ``` ### Workflow trajectory evaluation Add trajectory scoring to workflow evaluations to validate step execution order: ```typescript const workflowResult = await runEvals({ target: myWorkflow, data: [ { input: { query: 'Process this data' }, expectedTrajectory: { steps: [ { stepType: 'workflow_step', name: 'validate' }, { stepType: 'workflow_step', name: 'process' }, { stepType: 'workflow_step', name: 'output' }, ], }, }, ], scorers: { workflow: [outputQualityScorer], steps: { validate: [validationScorer], }, trajectory: [trajectoryScorer], }, }) // result.scores.trajectory — workflow trajectory scores ``` ### Workflow with per-item `startOptions` Use `startOptions` on individual data items to customize each workflow run. Per-item values take precedence over `targetOptions`: ```typescript const result = await runEvals({ target: myWorkflow, data: [ { input: { query: 'hello' }, startOptions: { initialState: { counter: 1 } }, }, { input: { query: 'world' }, startOptions: { initialState: { counter: 2 } }, }, ], scorers: [outputQualityScorer], targetOptions: { perStep: true }, }) ``` ## Related - [createScorer()](https://mastra.ai/reference/evals/create-scorer): Create custom scorers for experiments - [MastraScorer](https://mastra.ai/reference/evals/mastra-scorer): Learn about scorer structure and methods - [Trajectory Accuracy](https://mastra.ai/reference/evals/trajectory-accuracy): Built-in trajectory evaluation scorers - [Scorer Utilities](https://mastra.ai/reference/evals/scorer-utils): Helper functions for extracting trajectory data - [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers): Guide to building evaluation logic - [Scorers Overview](https://mastra.ai/docs/evals/overview): Understanding scorer concepts