@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
514 lines (392 loc) • 17.2 kB
Markdown
# Scorer utils
Mastra provides utility functions to help extract and process data from scorer run inputs and outputs. These utilities are particularly useful in the `preprocess` step of custom scorers.
## Import
```typescript
import {
getAssistantMessageFromRunOutput,
getReasoningFromRunOutput,
getUserMessageFromRunInput,
getSystemMessagesFromRunInput,
getCombinedSystemPrompt,
extractToolCalls,
extractInputMessages,
extractAgentResponseMessages,
compareTrajectories,
createTrajectoryTestRun,
} from '@mastra/evals/scorers/utils'
```
Trajectory extraction functions are available from `@mastra/core/evals`:
```typescript
import {
extractTrajectory,
extractWorkflowTrajectory,
extractTrajectoryFromTrace,
} from '@mastra/core/evals'
```
## Message extraction
### `getAssistantMessageFromRunOutput`
Extracts the text content from the first assistant message in the run output.
```typescript
const scorer = createScorer({
id: 'my-scorer',
description: 'My scorer',
type: 'agent',
})
.preprocess(({ run }) => {
const response = getAssistantMessageFromRunOutput(run.output)
return { response }
})
.generateScore(({ results }) => {
return results.preprocessStepResult?.response ? 1 : 0
})
```
**output** (`ScorerRunOutputForAgent`): The scorer run output (array of MastraDBMessage)
**Returns:** `string | undefined` - The assistant message text, or undefined if no assistant message is found.
### `getUserMessageFromRunInput`
Extracts the text content from the first user message in the run input.
```typescript
.preprocess(({ run }) => {
const userMessage = getUserMessageFromRunInput(run.input);
return { userMessage };
})
```
**input** (`ScorerRunInputForAgent`): The scorer run input containing input messages
**Returns:** `string | undefined` - The user message text, or undefined if no user message is found.
### `extractInputMessages`
Extracts text content from all input messages as an array.
```typescript
.preprocess(({ run }) => {
const allUserMessages = extractInputMessages(run.input);
return { conversationHistory: allUserMessages.join("\n") };
})
```
**Returns:** `string[]` - Array of text strings from each input message.
### `extractAgentResponseMessages`
Extracts text content from all assistant response messages as an array.
```typescript
.preprocess(({ run }) => {
const allResponses = extractAgentResponseMessages(run.output);
return { allResponses };
})
```
**Returns:** `string[]` - Array of text strings from each assistant message.
## Reasoning extraction
### `getReasoningFromRunOutput`
Extracts reasoning text from the run output. This is particularly useful when evaluating responses from reasoning models like `deepseek-reasoner` that produce chain-of-thought reasoning.
Reasoning can be stored in two places:
1. `content.reasoning` - a string field on the message content
2. `content.parts` - as parts with `type: 'reasoning'` containing `details`
```typescript
import {
getReasoningFromRunOutput,
getAssistantMessageFromRunOutput,
} from '@mastra/evals/scorers/utils'
const reasoningQualityScorer = createScorer({
id: 'reasoning-quality',
name: 'Reasoning Quality',
description: 'Evaluates the quality of model reasoning',
type: 'agent',
})
.preprocess(({ run }) => {
const reasoning = getReasoningFromRunOutput(run.output)
const response = getAssistantMessageFromRunOutput(run.output)
return { reasoning, response }
})
.analyze(({ results }) => {
const { reasoning } = results.preprocessStepResult || {}
return {
hasReasoning: !!reasoning,
reasoningLength: reasoning?.length || 0,
hasStepByStep: reasoning?.includes('step') || false,
}
})
.generateScore(({ results }) => {
const { hasReasoning, reasoningLength } = results.analyzeStepResult || {}
if (!hasReasoning) return 0
// Score based on reasoning length (normalized to 0-1)
return Math.min(reasoningLength / 500, 1)
})
.generateReason(({ results, score }) => {
const { hasReasoning, reasoningLength } = results.analyzeStepResult || {}
if (!hasReasoning) {
return 'No reasoning was provided by the model.'
}
return `Model provided ${reasoningLength} characters of reasoning. Score: ${score}`
})
```
**output** (`ScorerRunOutputForAgent`): The scorer run output (array of MastraDBMessage)
**Returns:** `string | undefined` - The reasoning text, or undefined if no reasoning is present.
## System message extraction
### `getSystemMessagesFromRunInput`
Extracts all system messages from the run input, including both standard system messages and tagged system messages (specialized prompts like memory instructions).
```typescript
.preprocess(({ run }) => {
const systemMessages = getSystemMessagesFromRunInput(run.input);
return {
systemPromptCount: systemMessages.length,
systemPrompts: systemMessages
};
})
```
**Returns:** `string[]` - Array of system message strings.
### `getCombinedSystemPrompt`
Combines all system messages into a single prompt string, joined with double newlines.
```typescript
.preprocess(({ run }) => {
const fullSystemPrompt = getCombinedSystemPrompt(run.input);
return { fullSystemPrompt };
})
```
**Returns:** `string` - Combined system prompt string.
## Tool call extraction
### `extractToolCalls`
Extracts information about all tool calls from the run output, including tool names, call IDs, and their positions in the message array.
```typescript
const toolUsageScorer = createScorer({
id: 'tool-usage',
description: 'Evaluates tool usage patterns',
type: 'agent',
})
.preprocess(({ run }) => {
const { tools, toolCallInfos } = extractToolCalls(run.output)
return {
toolsUsed: tools,
toolCount: tools.length,
toolDetails: toolCallInfos,
}
})
.generateScore(({ results }) => {
const { toolCount } = results.preprocessStepResult || {}
// Score based on appropriate tool usage
return toolCount > 0 ? 1 : 0
})
```
**Returns:**
```typescript
{
tools: string[]; // Array of tool names
toolCallInfos: ToolCallInfo[]; // Detailed tool call information
}
```
Where `ToolCallInfo` is:
```typescript
type ToolCallInfo = {
toolName: string // Name of the tool
toolCallId: string // Unique call identifier
messageIndex: number // Index in the output array
invocationIndex: number // Index within message's tool invocations
}
```
## Test utilities
These utilities help create test data for scorer development.
### `createTestMessage`
Creates a `MastraDBMessage` object for testing purposes.
```typescript
import { createTestMessage } from '@mastra/evals/scorers/utils'
const userMessage = createTestMessage({
content: 'What is the weather?',
role: 'user',
})
const assistantMessage = createTestMessage({
content: 'The weather is sunny.',
role: 'assistant',
toolInvocations: [
{
toolCallId: 'call-1',
toolName: 'weatherTool',
args: { location: 'London' },
result: { temp: 20 },
state: 'result',
},
],
})
```
### `createAgentTestRun`
Creates a complete test run object for testing scorers.
```typescript
import { createAgentTestRun, createTestMessage } from '@mastra/evals/scorers/utils'
const testRun = createAgentTestRun({
inputMessages: [createTestMessage({ content: 'Hello', role: 'user' })],
output: [createTestMessage({ content: 'Hi there!', role: 'assistant' })],
})
// Run your scorer with the test data
const result = await myScorer.run({
input: testRun.input,
output: testRun.output,
})
```
## Trajectory utilities
### `extractTrajectory`
Extracts a `Trajectory` from agent output messages (`MastraDBMessage[]`). Converts tool invocations into `ToolCallStep` objects. The `runEvals` pipeline calls this automatically for trajectory scorers — you only need it for direct testing.
Available from `@mastra/core/evals`.
```typescript
import { extractTrajectory } from '@mastra/core/evals'
const trajectory = extractTrajectory(agentOutputMessages)
// trajectory.steps — ToolCallStep[] extracted from toolInvocations
// trajectory.rawOutput — the original MastraDBMessage[] array
```
**Returns:** `Trajectory` — Contains `steps: TrajectoryStep[]`, `totalDurationMs`, and `rawOutput`.
### `extractWorkflowTrajectory`
Extracts a `Trajectory` from workflow step results. Converts `StepResult` records into `WorkflowStepStep` objects, respecting the execution path ordering.
Available from `@mastra/core/evals`.
```typescript
import { extractWorkflowTrajectory } from '@mastra/core/evals'
const trajectory = extractWorkflowTrajectory(
workflowResult.steps, // Record<string, StepResult>
workflowResult.stepExecutionPath, // string[] (optional)
)
// trajectory.steps — WorkflowStepStep[] in execution order
```
**Returns:** `Trajectory` — Contains `steps: TrajectoryStep[]`, `totalDurationMs`, and `rawWorkflowResult`.
### `extractTrajectoryFromTrace`
Builds a hierarchical `Trajectory` from observability trace spans (`SpanRecord[]`). Reconstructs the parent-child span tree and maps each span to the appropriate `TrajectoryStep` discriminated union type with nested `children`.
This is the preferred extraction method when storage is available. The `runEvals` pipeline calls this automatically when the target's `Mastra` instance has a configured storage backend. It produces richer trajectories than `extractTrajectory` or `extractWorkflowTrajectory` because it captures the full execution tree, including nested agent runs, tool calls, and model generations.
Available from `@mastra/core/evals`.
```typescript
import { extractTrajectoryFromTrace } from '@mastra/core/evals'
// After fetching a trace from the observability store
const traceData = await observabilityStore.getTrace({ traceId })
const trajectory = extractTrajectoryFromTrace(traceData.spans, rootSpanId)
// trajectory.steps — hierarchical TrajectoryStep[] with children
```
**Parameters:**
- `spans` (`SpanRecord[]`): Array of span records from a trace query.
- `rootSpanId` (`string`, optional): Span ID to use as the starting point. When omitted, uses spans with no parent.
**Returns:** `Trajectory`: Contains `steps: TrajectoryStep[]` with recursive `children` and `totalDurationMs`.
#### Span type mapping
| Span type | Trajectory step type | Key fields extracted |
| ---------------------- | ---------------------- | ------------------------------------------------------------- |
| `TOOL_CALL` | `tool_call` | `toolArgs`, `toolResult`, `success` |
| `MCP_TOOL_CALL` | `mcp_tool_call` | `toolArgs`, `toolResult`, `mcpServer`, `success` |
| `MODEL_GENERATION` | `model_generation` | `modelId`, `promptTokens`, `completionTokens`, `finishReason` |
| `AGENT_RUN` | `agent_run` | `agentId` (from entity ID) |
| `WORKFLOW_RUN` | `workflow_run` | `workflowId` (from entity ID) |
| `WORKFLOW_STEP` | `workflow_step` | `output` |
| `WORKFLOW_CONDITIONAL` | `workflow_conditional` | `conditionCount`, `selectedSteps` |
| `WORKFLOW_PARALLEL` | `workflow_parallel` | `branchCount`, `parallelSteps` |
| `WORKFLOW_LOOP` | `workflow_loop` | `loopType`, `totalIterations` |
| `WORKFLOW_SLEEP` | `workflow_sleep` | `sleepDurationMs`, `sleepType` |
| `WORKFLOW_WAIT_EVENT` | `workflow_wait_event` | `eventName`, `eventReceived` |
| `PROCESSOR_RUN` | `processor_run` | `processorId` |
Spans with types `GENERIC`, `MODEL_STEP`, `MODEL_CHUNK`, and `WORKFLOW_CONDITIONAL_EVAL` are skipped as noise.
### `compareTrajectories`
Compares an actual trajectory against an expected trajectory and returns a detailed comparison result. Used internally by `createTrajectoryAccuracyScorerCode`.
The `expected` parameter accepts either a `Trajectory` (actual trajectory) or `{ steps: ExpectedStep[] }`. When using `ExpectedStep[]`, you can match by name only, name + stepType, or include data for comparison. See [Expected steps](https://mastra.ai/reference/evals/trajectory-accuracy) for details.
```typescript
import { compareTrajectories } from '@mastra/evals/scorers/utils'
// Using ExpectedStep[] (recommended for expectations)
// Data fields (e.g. toolArgs) are auto-compared when present on expected steps
const result = compareTrajectories(
actualTrajectory,
{ steps: [{ name: 'search' }, { name: 'summarize', stepType: 'tool_call' }] },
{ allowRepeatedSteps: true },
)
// result.score — 0.0 to 1.0
// result.missingSteps — step names not found
// result.extraSteps — unexpected step names
// result.outOfOrderSteps — steps found but in wrong order
```
**Returns:** `TrajectoryComparisonResult`
### `createTrajectoryTestRun`
Creates a test run object for trajectory scorers. Wraps a `Trajectory` into the expected `ScorerRun` format.
```typescript
import { createTrajectoryTestRun } from '@mastra/evals/scorers/utils'
const run = createTrajectoryTestRun({
steps: [
{ stepType: 'tool_call', name: 'search', toolArgs: { q: 'test' } },
{ stepType: 'tool_call', name: 'summarize' },
],
})
const result = await trajectoryScorer.run(run)
```
### `checkTrajectoryEfficiency`
Evaluates trajectory efficiency against step, token, and duration budgets. Also detects redundant calls (same tool with same arguments).
```typescript
import { checkTrajectoryEfficiency } from '@mastra/evals/scorers/utils'
const result = checkTrajectoryEfficiency(trajectory, {
maxSteps: 5,
maxTotalTokens: 2000,
maxTotalDurationMs: 5000,
noRedundantCalls: true,
})
// result.score — 1.0 if within all budgets, lower with penalties
// result.redundantCalls — duplicate tool+args combos
// result.overStepBudget — true if maxSteps exceeded
// result.overTokenBudget — true if maxTotalTokens exceeded
// result.overDurationBudget — true if maxTotalDurationMs exceeded
```
**Returns:** `TrajectoryEfficiencyResult`
### `checkTrajectoryBlacklist`
Checks whether a trajectory contains forbidden tools or tool call sequences.
```typescript
import { checkTrajectoryBlacklist } from '@mastra/evals/scorers/utils'
const result = checkTrajectoryBlacklist(trajectory, {
blacklistedTools: ['deleteAll', 'admin-override'],
blacklistedSequences: [['escalate', 'admin-override']],
})
// result.score — 1.0 if no violations, 0.0 if any found
// result.violatedTools — blacklisted tools that were called
// result.violatedSequences — blacklisted sequences that were detected
```
**Returns:** `TrajectoryBlacklistResult`
### `analyzeToolFailures`
Detects tool failure patterns including retries, fallbacks, and argument corrections.
```typescript
import { analyzeToolFailures } from '@mastra/evals/scorers/utils'
const result = analyzeToolFailures(trajectory, {
maxRetriesPerTool: 2,
})
// result.score — 1.0 if no failure patterns, lower if patterns detected
// result.patterns — detected patterns (retry, fallback, arg_correction)
```
**Returns:** `ToolFailureAnalysisResult`
## Complete example
Here's a complete example showing how to use multiple utilities together:
```typescript
import { createScorer } from '@mastra/core/evals'
import {
getAssistantMessageFromRunOutput,
getReasoningFromRunOutput,
getUserMessageFromRunInput,
getCombinedSystemPrompt,
extractToolCalls,
} from '@mastra/evals/scorers/utils'
const comprehensiveScorer = createScorer({
id: 'comprehensive-analysis',
name: 'Comprehensive Analysis',
description: 'Analyzes all aspects of an agent response',
type: 'agent',
})
.preprocess(({ run }) => {
// Extract all relevant data
const userMessage = getUserMessageFromRunInput(run.input)
const response = getAssistantMessageFromRunOutput(run.output)
const reasoning = getReasoningFromRunOutput(run.output)
const systemPrompt = getCombinedSystemPrompt(run.input)
const { tools, toolCallInfos } = extractToolCalls(run.output)
return {
userMessage,
response,
reasoning,
systemPrompt,
toolsUsed: tools,
toolCount: tools.length,
}
})
.generateScore(({ results }) => {
const { response, reasoning, toolCount } = results.preprocessStepResult || {}
let score = 0
if (response && response.length > 0) score += 0.4
if (reasoning) score += 0.3
if (toolCount > 0) score += 0.3
return score
})
.generateReason(({ results, score }) => {
const { response, reasoning, toolCount } = results.preprocessStepResult || {}
const parts = []
if (response) parts.push('provided a response')
if (reasoning) parts.push('included reasoning')
if (toolCount > 0) parts.push(`used ${toolCount} tool(s)`)
return `Score: ${score}. The agent ${parts.join(', ')}.`
})
```