@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
138 lines (87 loc) • 5.42 kB
Markdown
# dataset.startExperiment()
**Added in:** `@mastra/core@1.4.0`
Runs an experiment on the dataset and waits for completion. Executes all items against a target (agent, workflow, or scorer) with optional scoring.
## Usage example
```typescript
import { Mastra } from '@mastra/core'
const mastra = new Mastra({
/* storage config */
})
const dataset = await mastra.datasets.get({ id: 'dataset-id' })
// Run against a registered agent with a flat scorer list
const summary = await dataset.startExperiment({
targetType: 'agent',
targetId: 'my-agent',
scorers: ['accuracy', 'relevancy'],
maxConcurrency: 10,
})
// Or pass the same categorised shape accepted by runEvals
const summary2 = await dataset.startExperiment({
targetType: 'agent',
targetId: 'my-agent',
scorers: {
agent: [accuracyScorer],
trajectory: [toolOrderScorer],
},
})
// For workflow targets, score individual steps with their own scorers
const summary3 = await dataset.startExperiment({
targetType: 'workflow',
targetId: 'my-workflow',
scorers: {
workflow: [overallScorer],
steps: {
'fetch-data': [fetchScorer],
transform: [transformScorer],
},
trajectory: [executionPathScorer],
},
})
console.log(`${summary.succeededCount}/${summary.totalItems} succeeded`)
console.log(`Status: ${summary.status}`)
console.log(`${summary2.succeededCount}/${summary2.totalItems} succeeded`)
console.log(`Status: ${summary2.status}`)
```
## Parameters
**targetType** (`'agent' | 'workflow' | 'scorer'`): Type of registered target to run items against. Use with \`targetId\`.
**targetId** (`string`): ID of the registered target. Use with \`targetType\`.
**scorers** (`(MastraScorer | string)[] | AgentScorerConfig | WorkflowScorerConfig`): Scorers to evaluate each result. Accepts a flat array of \`MastraScorer\` instances or registered scorer IDs, or the same categorised config shape used by \`runEvals\` (\`AgentScorerConfig\` / \`WorkflowScorerConfig\`). Trajectory scorers (\`type: "trajectory"\`) automatically receive a pre-extracted \`Trajectory\` as their output regardless of which form is used. For workflow targets, per-step scorers can be passed via \`scorers: { steps: { stepId: \[...] } }\` and run against each step's output; their results carry the originating \`stepId\` and keep \`targetScope: "span"\` (matching \`runEvals\`).
**name** (`string`): Display name for the experiment.
**description** (`string`): Description of the experiment.
**metadata** (`Record<string, unknown>`): Arbitrary metadata for the experiment.
**version** (`number`): Pin to a specific dataset version. Defaults to the latest version.
**maxConcurrency** (`number`): Maximum concurrent item executions. Defaults to \`5\`.
**signal** (`AbortSignal`): AbortSignal for cancelling the experiment.
**itemTimeout** (`number`): Per-item execution timeout in milliseconds.
**maxRetries** (`number`): Maximum retries per item on failure. Defaults to \`0\` (no retries). Abort errors are never retried.
## Returns
**result** (`Promise<ExperimentSummary>`): Summary of the completed experiment.
**result.experimentId** (`string`): Unique ID of the experiment.
**result.status** (`'pending' | 'running' | 'completed' | 'failed'`): Final status of the experiment.
**result.totalItems** (`number`): Total number of items in the dataset.
**result.succeededCount** (`number`): Number of items that succeeded.
**result.failedCount** (`number`): Number of items that failed.
**result.skippedCount** (`number`): Number of items skipped (e.g., due to abort).
**result.completedWithErrors** (`boolean`): \`true\` if the run completed but some items failed.
**result.startedAt** (`Date`): When the experiment started.
**result.completedAt** (`Date`): When the experiment completed.
**result.results** (`ItemWithScores[]`): All item results with their scores.
**result.results.itemId** (`string`): ID of the dataset item.
**result.results.itemVersion** (`number`): Dataset version of the item when executed.
**result.results.input** (`unknown`): Input data passed to the target.
**result.results.output** (`unknown | null`): Output from the target, or \`null\` if failed.
**result.results.groundTruth** (`unknown | null`): Expected output from the dataset item.
**result.results.error** (`{ message: string; stack?: string; code?: string } | null`): Structured error if execution failed.
**result.results.startedAt** (`Date`): When item execution started.
**result.results.completedAt** (`Date`): When item execution completed.
**result.results.retryCount** (`number`): Number of retry attempts.
**result.results.scores** (`ScorerResult[]`): Results from all scorers for this item.
**result.results.scores.scorerId** (`string`): ID of the scorer.
**result.results.scores.scorerName** (`string`): Display name of the scorer.
**result.results.scores.score** (`number | null`): Computed score, or \`null\` if the scorer failed.
**result.results.scores.reason** (`string | null`): Reason/explanation for the score.
**result.results.scores.error** (`string | null`): Error message if the scorer failed.
## Related
- [dataset.startExperimentAsync()](https://mastra.ai/reference/datasets/startExperimentAsync)
- [dataset.listExperiments()](https://mastra.ai/reference/datasets/listExperiments)
- [DatasetsManager.compareExperiments()](https://mastra.ai/reference/datasets/compareExperiments)