UNPKG

@mastra/core

Version:

Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

514 lines (392 loc) 17.2 kB
# Scorer utils Mastra provides utility functions to help extract and process data from scorer run inputs and outputs. These utilities are particularly useful in the `preprocess` step of custom scorers. ## Import ```typescript import { getAssistantMessageFromRunOutput, getReasoningFromRunOutput, getUserMessageFromRunInput, getSystemMessagesFromRunInput, getCombinedSystemPrompt, extractToolCalls, extractInputMessages, extractAgentResponseMessages, compareTrajectories, createTrajectoryTestRun, } from '@mastra/evals/scorers/utils' ``` Trajectory extraction functions are available from `@mastra/core/evals`: ```typescript import { extractTrajectory, extractWorkflowTrajectory, extractTrajectoryFromTrace, } from '@mastra/core/evals' ``` ## Message extraction ### `getAssistantMessageFromRunOutput` Extracts the text content from the first assistant message in the run output. ```typescript const scorer = createScorer({ id: 'my-scorer', description: 'My scorer', type: 'agent', }) .preprocess(({ run }) => { const response = getAssistantMessageFromRunOutput(run.output) return { response } }) .generateScore(({ results }) => { return results.preprocessStepResult?.response ? 1 : 0 }) ``` **output** (`ScorerRunOutputForAgent`): The scorer run output (array of MastraDBMessage) **Returns:** `string | undefined` - The assistant message text, or undefined if no assistant message is found. ### `getUserMessageFromRunInput` Extracts the text content from the first user message in the run input. ```typescript .preprocess(({ run }) => { const userMessage = getUserMessageFromRunInput(run.input); return { userMessage }; }) ``` **input** (`ScorerRunInputForAgent`): The scorer run input containing input messages **Returns:** `string | undefined` - The user message text, or undefined if no user message is found. ### `extractInputMessages` Extracts text content from all input messages as an array. ```typescript .preprocess(({ run }) => { const allUserMessages = extractInputMessages(run.input); return { conversationHistory: allUserMessages.join("\n") }; }) ``` **Returns:** `string[]` - Array of text strings from each input message. ### `extractAgentResponseMessages` Extracts text content from all assistant response messages as an array. ```typescript .preprocess(({ run }) => { const allResponses = extractAgentResponseMessages(run.output); return { allResponses }; }) ``` **Returns:** `string[]` - Array of text strings from each assistant message. ## Reasoning extraction ### `getReasoningFromRunOutput` Extracts reasoning text from the run output. This is particularly useful when evaluating responses from reasoning models like `deepseek-reasoner` that produce chain-of-thought reasoning. Reasoning can be stored in two places: 1. `content.reasoning` - a string field on the message content 2. `content.parts` - as parts with `type: 'reasoning'` containing `details` ```typescript import { getReasoningFromRunOutput, getAssistantMessageFromRunOutput, } from '@mastra/evals/scorers/utils' const reasoningQualityScorer = createScorer({ id: 'reasoning-quality', name: 'Reasoning Quality', description: 'Evaluates the quality of model reasoning', type: 'agent', }) .preprocess(({ run }) => { const reasoning = getReasoningFromRunOutput(run.output) const response = getAssistantMessageFromRunOutput(run.output) return { reasoning, response } }) .analyze(({ results }) => { const { reasoning } = results.preprocessStepResult || {} return { hasReasoning: !!reasoning, reasoningLength: reasoning?.length || 0, hasStepByStep: reasoning?.includes('step') || false, } }) .generateScore(({ results }) => { const { hasReasoning, reasoningLength } = results.analyzeStepResult || {} if (!hasReasoning) return 0 // Score based on reasoning length (normalized to 0-1) return Math.min(reasoningLength / 500, 1) }) .generateReason(({ results, score }) => { const { hasReasoning, reasoningLength } = results.analyzeStepResult || {} if (!hasReasoning) { return 'No reasoning was provided by the model.' } return `Model provided ${reasoningLength} characters of reasoning. Score: ${score}` }) ``` **output** (`ScorerRunOutputForAgent`): The scorer run output (array of MastraDBMessage) **Returns:** `string | undefined` - The reasoning text, or undefined if no reasoning is present. ## System message extraction ### `getSystemMessagesFromRunInput` Extracts all system messages from the run input, including both standard system messages and tagged system messages (specialized prompts like memory instructions). ```typescript .preprocess(({ run }) => { const systemMessages = getSystemMessagesFromRunInput(run.input); return { systemPromptCount: systemMessages.length, systemPrompts: systemMessages }; }) ``` **Returns:** `string[]` - Array of system message strings. ### `getCombinedSystemPrompt` Combines all system messages into a single prompt string, joined with double newlines. ```typescript .preprocess(({ run }) => { const fullSystemPrompt = getCombinedSystemPrompt(run.input); return { fullSystemPrompt }; }) ``` **Returns:** `string` - Combined system prompt string. ## Tool call extraction ### `extractToolCalls` Extracts information about all tool calls from the run output, including tool names, call IDs, and their positions in the message array. ```typescript const toolUsageScorer = createScorer({ id: 'tool-usage', description: 'Evaluates tool usage patterns', type: 'agent', }) .preprocess(({ run }) => { const { tools, toolCallInfos } = extractToolCalls(run.output) return { toolsUsed: tools, toolCount: tools.length, toolDetails: toolCallInfos, } }) .generateScore(({ results }) => { const { toolCount } = results.preprocessStepResult || {} // Score based on appropriate tool usage return toolCount > 0 ? 1 : 0 }) ``` **Returns:** ```typescript { tools: string[]; // Array of tool names toolCallInfos: ToolCallInfo[]; // Detailed tool call information } ``` Where `ToolCallInfo` is: ```typescript type ToolCallInfo = { toolName: string // Name of the tool toolCallId: string // Unique call identifier messageIndex: number // Index in the output array invocationIndex: number // Index within message's tool invocations } ``` ## Test utilities These utilities help create test data for scorer development. ### `createTestMessage` Creates a `MastraDBMessage` object for testing purposes. ```typescript import { createTestMessage } from '@mastra/evals/scorers/utils' const userMessage = createTestMessage({ content: 'What is the weather?', role: 'user', }) const assistantMessage = createTestMessage({ content: 'The weather is sunny.', role: 'assistant', toolInvocations: [ { toolCallId: 'call-1', toolName: 'weatherTool', args: { location: 'London' }, result: { temp: 20 }, state: 'result', }, ], }) ``` ### `createAgentTestRun` Creates a complete test run object for testing scorers. ```typescript import { createAgentTestRun, createTestMessage } from '@mastra/evals/scorers/utils' const testRun = createAgentTestRun({ inputMessages: [createTestMessage({ content: 'Hello', role: 'user' })], output: [createTestMessage({ content: 'Hi there!', role: 'assistant' })], }) // Run your scorer with the test data const result = await myScorer.run({ input: testRun.input, output: testRun.output, }) ``` ## Trajectory utilities ### `extractTrajectory` Extracts a `Trajectory` from agent output messages (`MastraDBMessage[]`). Converts tool invocations into `ToolCallStep` objects. The `runEvals` pipeline calls this automatically for trajectory scorers — you only need it for direct testing. Available from `@mastra/core/evals`. ```typescript import { extractTrajectory } from '@mastra/core/evals' const trajectory = extractTrajectory(agentOutputMessages) // trajectory.steps — ToolCallStep[] extracted from toolInvocations // trajectory.rawOutput — the original MastraDBMessage[] array ``` **Returns:** `Trajectory` — Contains `steps: TrajectoryStep[]`, `totalDurationMs`, and `rawOutput`. ### `extractWorkflowTrajectory` Extracts a `Trajectory` from workflow step results. Converts `StepResult` records into `WorkflowStepStep` objects, respecting the execution path ordering. Available from `@mastra/core/evals`. ```typescript import { extractWorkflowTrajectory } from '@mastra/core/evals' const trajectory = extractWorkflowTrajectory( workflowResult.steps, // Record<string, StepResult> workflowResult.stepExecutionPath, // string[] (optional) ) // trajectory.steps — WorkflowStepStep[] in execution order ``` **Returns:** `Trajectory` — Contains `steps: TrajectoryStep[]`, `totalDurationMs`, and `rawWorkflowResult`. ### `extractTrajectoryFromTrace` Builds a hierarchical `Trajectory` from observability trace spans (`SpanRecord[]`). Reconstructs the parent-child span tree and maps each span to the appropriate `TrajectoryStep` discriminated union type with nested `children`. This is the preferred extraction method when storage is available. The `runEvals` pipeline calls this automatically when the target's `Mastra` instance has a configured storage backend. It produces richer trajectories than `extractTrajectory` or `extractWorkflowTrajectory` because it captures the full execution tree, including nested agent runs, tool calls, and model generations. Available from `@mastra/core/evals`. ```typescript import { extractTrajectoryFromTrace } from '@mastra/core/evals' // After fetching a trace from the observability store const traceData = await observabilityStore.getTrace({ traceId }) const trajectory = extractTrajectoryFromTrace(traceData.spans, rootSpanId) // trajectory.steps — hierarchical TrajectoryStep[] with children ``` **Parameters:** - `spans` (`SpanRecord[]`): Array of span records from a trace query. - `rootSpanId` (`string`, optional): Span ID to use as the starting point. When omitted, uses spans with no parent. **Returns:** `Trajectory`: Contains `steps: TrajectoryStep[]` with recursive `children` and `totalDurationMs`. #### Span type mapping | Span type | Trajectory step type | Key fields extracted | | ---------------------- | ---------------------- | ------------------------------------------------------------- | | `TOOL_CALL` | `tool_call` | `toolArgs`, `toolResult`, `success` | | `MCP_TOOL_CALL` | `mcp_tool_call` | `toolArgs`, `toolResult`, `mcpServer`, `success` | | `MODEL_GENERATION` | `model_generation` | `modelId`, `promptTokens`, `completionTokens`, `finishReason` | | `AGENT_RUN` | `agent_run` | `agentId` (from entity ID) | | `WORKFLOW_RUN` | `workflow_run` | `workflowId` (from entity ID) | | `WORKFLOW_STEP` | `workflow_step` | `output` | | `WORKFLOW_CONDITIONAL` | `workflow_conditional` | `conditionCount`, `selectedSteps` | | `WORKFLOW_PARALLEL` | `workflow_parallel` | `branchCount`, `parallelSteps` | | `WORKFLOW_LOOP` | `workflow_loop` | `loopType`, `totalIterations` | | `WORKFLOW_SLEEP` | `workflow_sleep` | `sleepDurationMs`, `sleepType` | | `WORKFLOW_WAIT_EVENT` | `workflow_wait_event` | `eventName`, `eventReceived` | | `PROCESSOR_RUN` | `processor_run` | `processorId` | Spans with types `GENERIC`, `MODEL_STEP`, `MODEL_CHUNK`, and `WORKFLOW_CONDITIONAL_EVAL` are skipped as noise. ### `compareTrajectories` Compares an actual trajectory against an expected trajectory and returns a detailed comparison result. Used internally by `createTrajectoryAccuracyScorerCode`. The `expected` parameter accepts either a `Trajectory` (actual trajectory) or `{ steps: ExpectedStep[] }`. When using `ExpectedStep[]`, you can match by name only, name + stepType, or include data for comparison. See [Expected steps](https://mastra.ai/reference/evals/trajectory-accuracy) for details. ```typescript import { compareTrajectories } from '@mastra/evals/scorers/utils' // Using ExpectedStep[] (recommended for expectations) // Data fields (e.g. toolArgs) are auto-compared when present on expected steps const result = compareTrajectories( actualTrajectory, { steps: [{ name: 'search' }, { name: 'summarize', stepType: 'tool_call' }] }, { allowRepeatedSteps: true }, ) // result.score — 0.0 to 1.0 // result.missingSteps — step names not found // result.extraSteps — unexpected step names // result.outOfOrderSteps — steps found but in wrong order ``` **Returns:** `TrajectoryComparisonResult` ### `createTrajectoryTestRun` Creates a test run object for trajectory scorers. Wraps a `Trajectory` into the expected `ScorerRun` format. ```typescript import { createTrajectoryTestRun } from '@mastra/evals/scorers/utils' const run = createTrajectoryTestRun({ steps: [ { stepType: 'tool_call', name: 'search', toolArgs: { q: 'test' } }, { stepType: 'tool_call', name: 'summarize' }, ], }) const result = await trajectoryScorer.run(run) ``` ### `checkTrajectoryEfficiency` Evaluates trajectory efficiency against step, token, and duration budgets. Also detects redundant calls (same tool with same arguments). ```typescript import { checkTrajectoryEfficiency } from '@mastra/evals/scorers/utils' const result = checkTrajectoryEfficiency(trajectory, { maxSteps: 5, maxTotalTokens: 2000, maxTotalDurationMs: 5000, noRedundantCalls: true, }) // result.score — 1.0 if within all budgets, lower with penalties // result.redundantCalls — duplicate tool+args combos // result.overStepBudget — true if maxSteps exceeded // result.overTokenBudget — true if maxTotalTokens exceeded // result.overDurationBudget — true if maxTotalDurationMs exceeded ``` **Returns:** `TrajectoryEfficiencyResult` ### `checkTrajectoryBlacklist` Checks whether a trajectory contains forbidden tools or tool call sequences. ```typescript import { checkTrajectoryBlacklist } from '@mastra/evals/scorers/utils' const result = checkTrajectoryBlacklist(trajectory, { blacklistedTools: ['deleteAll', 'admin-override'], blacklistedSequences: [['escalate', 'admin-override']], }) // result.score — 1.0 if no violations, 0.0 if any found // result.violatedTools — blacklisted tools that were called // result.violatedSequences — blacklisted sequences that were detected ``` **Returns:** `TrajectoryBlacklistResult` ### `analyzeToolFailures` Detects tool failure patterns including retries, fallbacks, and argument corrections. ```typescript import { analyzeToolFailures } from '@mastra/evals/scorers/utils' const result = analyzeToolFailures(trajectory, { maxRetriesPerTool: 2, }) // result.score — 1.0 if no failure patterns, lower if patterns detected // result.patterns — detected patterns (retry, fallback, arg_correction) ``` **Returns:** `ToolFailureAnalysisResult` ## Complete example Here's a complete example showing how to use multiple utilities together: ```typescript import { createScorer } from '@mastra/core/evals' import { getAssistantMessageFromRunOutput, getReasoningFromRunOutput, getUserMessageFromRunInput, getCombinedSystemPrompt, extractToolCalls, } from '@mastra/evals/scorers/utils' const comprehensiveScorer = createScorer({ id: 'comprehensive-analysis', name: 'Comprehensive Analysis', description: 'Analyzes all aspects of an agent response', type: 'agent', }) .preprocess(({ run }) => { // Extract all relevant data const userMessage = getUserMessageFromRunInput(run.input) const response = getAssistantMessageFromRunOutput(run.output) const reasoning = getReasoningFromRunOutput(run.output) const systemPrompt = getCombinedSystemPrompt(run.input) const { tools, toolCallInfos } = extractToolCalls(run.output) return { userMessage, response, reasoning, systemPrompt, toolsUsed: tools, toolCount: tools.length, } }) .generateScore(({ results }) => { const { response, reasoning, toolCount } = results.preprocessStepResult || {} let score = 0 if (response && response.length > 0) score += 0.4 if (reasoning) score += 0.3 if (toolCount > 0) score += 0.3 return score }) .generateReason(({ results, score }) => { const { response, reasoning, toolCount } = results.preprocessStepResult || {} const parts = [] if (response) parts.push('provided a response') if (reasoning) parts.push('included reasoning') if (toolCount > 0) parts.push(`used ${toolCount} tool(s)`) return `Score: ${score}. The agent ${parts.join(', ')}.` }) ```