@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
274 lines (195 loc) • 7.89 kB
Markdown
# Running experiments
**Added in:** `@mastra/core@1.4.0`
An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes.
## Basic experiment
Call [`startExperiment()`](https://mastra.ai/reference/datasets/startExperiment) with a target and scorers:
```typescript
import { mastra } from '../index'
const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' })
const summary = await dataset.startExperiment({
name: 'gpt-5.1-baseline',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
})
console.log(summary.status) // 'completed' | 'failed'
console.log(summary.succeededCount) // number of items that ran successfully
console.log(summary.failedCount) // number of items that failed
```
`startExperiment()` blocks until all items finish. For fire-and-forget execution, see [async experiments](#async-experiments).
## Studio
You can also run experiments in [Studio](https://mastra.ai/docs/studio/overview). After you've added a dataset item, open it and select **Run Experiment** and configure the target, scorers, and options.
After running an experiment, the **Experiments** tab shows all runs for that dataset (with status, counts, and timestamps). Select an experiment to see per-item results, scores, and execution traces.
In the **Experiments** tab, select **Compare** and choose two or more experiments to compare their scores and results side by side.
## Experiment targets
You can point an experiment at a registered agent, workflow, or scorer.
### Registered agent
Point to an agent registered on your Mastra instance:
```typescript
const summary = await dataset.startExperiment({
name: 'agent-v2-eval',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy'],
})
```
Each item's `input` is passed directly to `agent.generate()`, so it must be a `string`, `string[]`, or `CoreMessage[]`.
### Registered workflow
Point to a workflow registered on your Mastra instance:
```typescript
const summary = await dataset.startExperiment({
name: 'workflow-eval',
targetType: 'workflow',
targetId: 'translation-workflow',
scorers: ['accuracy'],
})
```
The workflow receives each item's `input` as its trigger data.
### Registered scorer
Point to a scorer to evaluate an LLM judge against ground truth:
```typescript
const summary = await dataset.startExperiment({
name: 'judge-accuracy-eval',
targetType: 'scorer',
targetId: 'accuracy',
})
```
The scorer receives each item's `input` and `groundTruth`. LLM-based judges can drift over time as underlying models change, so it's important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift.
## Scoring results
Scorers automatically run after each item's target execution. Pass scorer instances or registered scorer IDs:
**Scorer IDs**:
```typescript
// Reference scorers registered on the Mastra instance
const summary = await dataset.startExperiment({
name: 'with-registered-scorers',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy', 'fluency'],
})
```
**Scorer instances**:
```typescript
import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt'
const relevancy = createAnswerRelevancyScorer({ model: 'openai/gpt-5-mini' })
const summary = await dataset.startExperiment({
name: 'with-scorer-instances',
targetType: 'agent',
targetId: 'translation-agent',
scorers: [relevancy],
})
```
Each item's results include per-scorer scores:
```typescript
for (const item of summary.results) {
console.log(item.itemId, item.output)
for (const score of item.scores) {
console.log(` ${score.scorerName}: ${score.score} — ${score.reason}`)
}
}
```
> **Info:** Visit the [Scorers overview](https://mastra.ai/docs/evals/overview) for details on available and custom scorers.
## Async experiments
`startExperiment()` blocks until every item completes. For long-running datasets, use [`startExperimentAsync()`](https://mastra.ai/reference/datasets/startExperimentAsync) to start the experiment in the background:
```typescript
const { experimentId, status } = await dataset.startExperimentAsync({
name: 'large-dataset-run',
targetType: 'agent',
targetId: 'translation-agent',
scorers: ['accuracy'],
})
console.log(experimentId) // UUID
console.log(status) // 'pending'
```
Poll for completion using [`getExperiment()`](https://mastra.ai/reference/datasets/getExperiment):
```typescript
let experiment = await dataset.getExperiment({ experimentId })
while (experiment.status === 'pending' || experiment.status === 'running') {
await new Promise(resolve => setTimeout(resolve, 5000))
experiment = await dataset.getExperiment({ experimentId })
}
console.log(experiment.status) // 'completed' | 'failed'
```
## Configuration options
### Concurrency
Control how many items run in parallel (default: 5):
```typescript
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
maxConcurrency: 10,
})
```
### Timeouts and retries
Set a per-item timeout (in milliseconds) and retry count:
```typescript
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
itemTimeout: 30_000, // 30 seconds per item
maxRetries: 2, // retry failed items up to 2 times
})
```
Retries use exponential backoff. Abort errors are never retried.
### Aborting an experiment
Pass an `AbortSignal` to cancel a running experiment:
```typescript
const controller = new AbortController()
// Cancel after 60 seconds
setTimeout(() => controller.abort(), 60_000)
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
signal: controller.signal,
})
```
Remaining items are marked as skipped in the summary.
### Pinning a dataset version
Run against a specific snapshot of the dataset:
```typescript
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'translation-agent',
version: 3, // use items from dataset version 3
})
```
## Viewing results
### Listing experiments
```typescript
const { experiments, pagination } = await dataset.listExperiments({
page: 0,
perPage: 10,
})
for (const exp of experiments) {
console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`)
}
```
### Experiment details
```typescript
const experiment = await dataset.getExperiment({
experimentId: 'exp-abc-123',
})
console.log(experiment.status)
console.log(experiment.startedAt)
console.log(experiment.completedAt)
```
### Item-level results
```typescript
const { results, pagination } = await dataset.listExperimentResults({
experimentId: 'exp-abc-123',
page: 0,
perPage: 50,
})
for (const result of results) {
console.log(result.itemId, result.output, result.error)
}
```
## Understanding the summary
`startExperiment()` returns an `ExperimentSummary` with counts and per-item results:
- `completedWithErrors` is `true` when the experiment finished but some items failed.
- Items cancelled via `signal` appear in `skippedCount`.
> **Info:** Visit the [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment) for the full parameter and return type documentation.
## Related
- [Datasets overview](https://mastra.ai/docs/evals/datasets/overview)
- [Scorers overview](https://mastra.ai/docs/evals/overview)
- [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment)
- [`listExperimentResults` reference](https://mastra.ai/reference/datasets/listExperimentResults)