@mastra/core

# Running experiments **Added in:** `@mastra/core@1.4.0` An experiment runs every item in a dataset through a target (an agent, a workflow, or a scorer) and then optionally scores the outputs. Use a scorer as the target when you want to evaluate an LLM judge itself. Results are persisted to storage so you can compare runs across different prompts, models, or code changes. ## Basic experiment Call [`startExperiment()`](https://mastra.ai/reference/datasets/startExperiment) with a target and scorers: ```typescript import { mastra } from '../index' const dataset = await mastra.datasets.get({ id: 'translation-dataset-id' }) const summary = await dataset.startExperiment({ name: 'gpt-5.1-baseline', targetType: 'agent', targetId: 'translation-agent', scorers: ['accuracy', 'fluency'], }) console.log(summary.status) // 'completed' | 'failed' console.log(summary.succeededCount) // number of items that ran successfully console.log(summary.failedCount) // number of items that failed ``` `startExperiment()` blocks until all items finish. For fire-and-forget execution, see [async experiments](#async-experiments). ## Studio You can also run experiments in [Studio](https://mastra.ai/docs/studio/overview). After you've added a dataset item, open it and select **Run Experiment** and configure the target, scorers, and options. After running an experiment, the **Experiments** tab shows all runs for that dataset (with status, counts, and timestamps). Select an experiment to see per-item results, scores, and execution traces. In the **Experiments** tab, select **Compare** and choose two or more experiments to compare their scores and results side by side. ## Experiment targets You can point an experiment at a registered agent, workflow, or scorer. ### Registered agent Point to an agent registered on your Mastra instance: ```typescript const summary = await dataset.startExperiment({ name: 'agent-v2-eval', targetType: 'agent', targetId: 'translation-agent', scorers: ['accuracy'], }) ``` Each item's `input` is passed directly to `agent.generate()`, so it must be a `string`, `string[]`, or `CoreMessage[]`. ### Registered workflow Point to a workflow registered on your Mastra instance: ```typescript const summary = await dataset.startExperiment({ name: 'workflow-eval', targetType: 'workflow', targetId: 'translation-workflow', scorers: ['accuracy'], }) ``` The workflow receives each item's `input` as its trigger data. ### Registered scorer Point to a scorer to evaluate an LLM judge against ground truth: ```typescript const summary = await dataset.startExperiment({ name: 'judge-accuracy-eval', targetType: 'scorer', targetId: 'accuracy', }) ``` The scorer receives each item's `input` and `groundTruth`. LLM-based judges can drift over time as underlying models change, so it's important to periodically realign them against known-good labels. A dataset gives you a stable benchmark to detect that drift. ## Scoring results Scorers automatically run after each item's target execution. Pass scorer instances or registered scorer IDs: **Scorer IDs**: ```typescript // Reference scorers registered on the Mastra instance const summary = await dataset.startExperiment({ name: 'with-registered-scorers', targetType: 'agent', targetId: 'translation-agent', scorers: ['accuracy', 'fluency'], }) ``` **Scorer instances**: ```typescript import { createAnswerRelevancyScorer } from '@mastra/evals/scorers/prebuilt' const relevancy = createAnswerRelevancyScorer({ model: 'openai/gpt-5-mini' }) const summary = await dataset.startExperiment({ name: 'with-scorer-instances', targetType: 'agent', targetId: 'translation-agent', scorers: [relevancy], }) ``` Each item's results include per-scorer scores: ```typescript for (const item of summary.results) { console.log(item.itemId, item.output) for (const score of item.scores) { console.log(` ${score.scorerName}: ${score.score} — ${score.reason}`) } } ``` > **Info:** Visit the [Scorers overview](https://mastra.ai/docs/evals/overview) for details on available and custom scorers. ## Async experiments `startExperiment()` blocks until every item completes. For long-running datasets, use [`startExperimentAsync()`](https://mastra.ai/reference/datasets/startExperimentAsync) to start the experiment in the background: ```typescript const { experimentId, status } = await dataset.startExperimentAsync({ name: 'large-dataset-run', targetType: 'agent', targetId: 'translation-agent', scorers: ['accuracy'], }) console.log(experimentId) // UUID console.log(status) // 'pending' ``` Poll for completion using [`getExperiment()`](https://mastra.ai/reference/datasets/getExperiment): ```typescript let experiment = await dataset.getExperiment({ experimentId }) while (experiment.status === 'pending' || experiment.status === 'running') { await new Promise(resolve => setTimeout(resolve, 5000)) experiment = await dataset.getExperiment({ experimentId }) } console.log(experiment.status) // 'completed' | 'failed' ``` ## Configuration options ### Concurrency Control how many items run in parallel (default: 5): ```typescript const summary = await dataset.startExperiment({ targetType: 'agent', targetId: 'translation-agent', maxConcurrency: 10, }) ``` ### Timeouts and retries Set a per-item timeout (in milliseconds) and retry count: ```typescript const summary = await dataset.startExperiment({ targetType: 'agent', targetId: 'translation-agent', itemTimeout: 30_000, // 30 seconds per item maxRetries: 2, // retry failed items up to 2 times }) ``` Retries use exponential backoff. Abort errors are never retried. ### Aborting an experiment Pass an `AbortSignal` to cancel a running experiment: ```typescript const controller = new AbortController() // Cancel after 60 seconds setTimeout(() => controller.abort(), 60_000) const summary = await dataset.startExperiment({ targetType: 'agent', targetId: 'translation-agent', signal: controller.signal, }) ``` Remaining items are marked as skipped in the summary. ### Pinning a dataset version Run against a specific snapshot of the dataset: ```typescript const summary = await dataset.startExperiment({ targetType: 'agent', targetId: 'translation-agent', version: 3, // use items from dataset version 3 }) ``` ## Viewing results ### Listing experiments ```typescript const { experiments, pagination } = await dataset.listExperiments({ page: 0, perPage: 10, }) for (const exp of experiments) { console.log(`${exp.name} — ${exp.status} (${exp.succeededCount}/${exp.totalItems})`) } ``` ### Experiment details ```typescript const experiment = await dataset.getExperiment({ experimentId: 'exp-abc-123', }) console.log(experiment.status) console.log(experiment.startedAt) console.log(experiment.completedAt) ``` ### Item-level results ```typescript const { results, pagination } = await dataset.listExperimentResults({ experimentId: 'exp-abc-123', page: 0, perPage: 50, }) for (const result of results) { console.log(result.itemId, result.output, result.error) } ``` ## Understanding the summary `startExperiment()` returns an `ExperimentSummary` with counts and per-item results: - `completedWithErrors` is `true` when the experiment finished but some items failed. - Items cancelled via `signal` appear in `skippedCount`. > **Info:** Visit the [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment) for the full parameter and return type documentation. ## Related - [Datasets overview](https://mastra.ai/docs/evals/datasets/overview) - [Scorers overview](https://mastra.ai/docs/evals/overview) - [`startExperiment` reference](https://mastra.ai/reference/datasets/startExperiment) - [`listExperimentResults` reference](https://mastra.ai/reference/datasets/listExperimentResults)