UNPKG

@mastra/core

Version:

Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

138 lines (87 loc) 5.42 kB
# dataset.startExperiment() **Added in:** `@mastra/core@1.4.0` Runs an experiment on the dataset and waits for completion. Executes all items against a target (agent, workflow, or scorer) with optional scoring. ## Usage example ```typescript import { Mastra } from '@mastra/core' const mastra = new Mastra({ /* storage config */ }) const dataset = await mastra.datasets.get({ id: 'dataset-id' }) // Run against a registered agent with a flat scorer list const summary = await dataset.startExperiment({ targetType: 'agent', targetId: 'my-agent', scorers: ['accuracy', 'relevancy'], maxConcurrency: 10, }) // Or pass the same categorised shape accepted by runEvals const summary2 = await dataset.startExperiment({ targetType: 'agent', targetId: 'my-agent', scorers: { agent: [accuracyScorer], trajectory: [toolOrderScorer], }, }) // For workflow targets, score individual steps with their own scorers const summary3 = await dataset.startExperiment({ targetType: 'workflow', targetId: 'my-workflow', scorers: { workflow: [overallScorer], steps: { 'fetch-data': [fetchScorer], transform: [transformScorer], }, trajectory: [executionPathScorer], }, }) console.log(`${summary.succeededCount}/${summary.totalItems} succeeded`) console.log(`Status: ${summary.status}`) console.log(`${summary2.succeededCount}/${summary2.totalItems} succeeded`) console.log(`Status: ${summary2.status}`) ``` ## Parameters **targetType** (`'agent' | 'workflow' | 'scorer'`): Type of registered target to run items against. Use with \`targetId\`. **targetId** (`string`): ID of the registered target. Use with \`targetType\`. **scorers** (`(MastraScorer | string)[] | AgentScorerConfig | WorkflowScorerConfig`): Scorers to evaluate each result. Accepts a flat array of \`MastraScorer\` instances or registered scorer IDs, or the same categorised config shape used by \`runEvals\` (\`AgentScorerConfig\` / \`WorkflowScorerConfig\`). Trajectory scorers (\`type: "trajectory"\`) automatically receive a pre-extracted \`Trajectory\` as their output regardless of which form is used. For workflow targets, per-step scorers can be passed via \`scorers: { steps: { stepId: \[...] } }\` and run against each step's output; their results carry the originating \`stepId\` and keep \`targetScope: "span"\` (matching \`runEvals\`). **name** (`string`): Display name for the experiment. **description** (`string`): Description of the experiment. **metadata** (`Record<string, unknown>`): Arbitrary metadata for the experiment. **version** (`number`): Pin to a specific dataset version. Defaults to the latest version. **maxConcurrency** (`number`): Maximum concurrent item executions. Defaults to \`5\`. **signal** (`AbortSignal`): AbortSignal for cancelling the experiment. **itemTimeout** (`number`): Per-item execution timeout in milliseconds. **maxRetries** (`number`): Maximum retries per item on failure. Defaults to \`0\` (no retries). Abort errors are never retried. ## Returns **result** (`Promise<ExperimentSummary>`): Summary of the completed experiment. **result.experimentId** (`string`): Unique ID of the experiment. **result.status** (`'pending' | 'running' | 'completed' | 'failed'`): Final status of the experiment. **result.totalItems** (`number`): Total number of items in the dataset. **result.succeededCount** (`number`): Number of items that succeeded. **result.failedCount** (`number`): Number of items that failed. **result.skippedCount** (`number`): Number of items skipped (e.g., due to abort). **result.completedWithErrors** (`boolean`): \`true\` if the run completed but some items failed. **result.startedAt** (`Date`): When the experiment started. **result.completedAt** (`Date`): When the experiment completed. **result.results** (`ItemWithScores[]`): All item results with their scores. **result.results.itemId** (`string`): ID of the dataset item. **result.results.itemVersion** (`number`): Dataset version of the item when executed. **result.results.input** (`unknown`): Input data passed to the target. **result.results.output** (`unknown | null`): Output from the target, or \`null\` if failed. **result.results.groundTruth** (`unknown | null`): Expected output from the dataset item. **result.results.error** (`{ message: string; stack?: string; code?: string } | null`): Structured error if execution failed. **result.results.startedAt** (`Date`): When item execution started. **result.results.completedAt** (`Date`): When item execution completed. **result.results.retryCount** (`number`): Number of retry attempts. **result.results.scores** (`ScorerResult[]`): Results from all scorers for this item. **result.results.scores.scorerId** (`string`): ID of the scorer. **result.results.scores.scorerName** (`string`): Display name of the scorer. **result.results.scores.score** (`number | null`): Computed score, or \`null\` if the scorer failed. **result.results.scores.reason** (`string | null`): Reason/explanation for the score. **result.results.scores.error** (`string | null`): Error message if the scorer failed. ## Related - [dataset.startExperimentAsync()](https://mastra.ai/reference/datasets/startExperimentAsync) - [dataset.listExperiments()](https://mastra.ai/reference/datasets/listExperiments) - [DatasetsManager.compareExperiments()](https://mastra.ai/reference/datasets/compareExperiments)