@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
270 lines (171 loc) • 11.8 kB
Markdown
# createScorer
Mastra provides a unified `createScorer` factory that allows you to define custom scorers for evaluating input/output pairs. You can use either native JavaScript functions or LLM-based prompt objects for each evaluation step. Custom scorers can be added to Agents and Workflow steps.
## How to create a custom scorer
Use the `createScorer` factory to define your scorer with a name, description, and optional judge configuration. Then chain step methods to build your evaluation pipeline. You must provide at least a `generateScore` step.
**Prompt object steps** are step configurations expressed as objects with `description` + `createPrompt` (and `outputSchema` for `preprocess`/`analyze`). These steps invoke the judge LLM. **Function steps** are plain functions and never call the judge.
```typescript
import { createScorer } from '@mastra/core/evals'
const scorer = createScorer({
id: 'my-custom-scorer',
name: 'My Custom Scorer', // Optional, defaults to id
description: 'Evaluates responses based on custom criteria',
type: 'agent', // Optional: for agent evaluation with automatic typing
judge: {
model: myModel,
instructions: 'You are an expert evaluator...',
},
})
.preprocess({
/* step config */
})
.analyze({
/* step config */
})
.generateScore(({ run, results }) => {
// Return a number
})
.generateReason({
/* step config */
})
```
## `createScorer` options
**id** (`string`): Unique identifier for the scorer. Used as the name if \`name\` is not provided.
**name** (`string`): Name of the scorer. Defaults to \`id\` if not provided.
**description** (`string`): Description of what the scorer does.
**judge** (`object`): Optional judge configuration for LLM-based steps.
**judge.model** (`LanguageModel`): The LLM model instance to use for evaluation.
**judge.instructions** (`string`): System prompt/instructions for the LLM.
**type** (`string`): Type specification for input/output. Use 'agent' for automatic agent types. For custom types, use the generic approach instead.
**prepareRun** (`(run: ScorerRun) => ScorerRun | Promise<ScorerRun>`): Transform the scorer run data before the pipeline executes. Use this to filter messages, limit context size, or drop fields the scorer doesn't need. The \[\`filterRun()\`]\(/reference/evals/filter-run) utility creates this function from declarative options. Can be async.
This function returns a scorer builder that you can chain step methods onto. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
The judge only runs for steps defined as **prompt objects** (`preprocess`, `analyze`, `generateScore`, `generateReason` in prompt mode). If you use function steps only, the judge is never called and there is no LLM output to inspect. In that case, any score/reason must be produced by your functions.
When a prompt-object step runs, its structured LLM output is stored in the corresponding result field (`preprocessStepResult`, `analyzeStepResult`, or the value consumed by `calculateScore` in `generateScore`).
## Type safety
You can specify input/output types when creating scorers for better type inference and IntelliSense support:
### Agent Type Shortcut
For evaluating agents, use `type: 'agent'` to automatically get the correct types for agent input/output:
```typescript
import { createScorer } from '@mastra/core/evals'
// Agent scorer with automatic typing
const agentScorer = createScorer({
id: 'agent-response-quality',
description: 'Evaluates agent responses',
type: 'agent', // Automatically provides ScorerRunInputForAgent/ScorerRunOutputForAgent
})
.preprocess(({ run }) => {
// run.input is automatically typed as ScorerRunInputForAgent
const userMessage = run.inputData.inputMessages[0]?.content
return { userMessage }
})
.generateScore(({ run, results }) => {
// run.output is automatically typed as ScorerRunOutputForAgent
const response = run.output[0]?.content
return response.length > 10 ? 1.0 : 0.5
})
```
### Custom Types with Generics
For custom input/output types, use the generic approach:
```typescript
import { createScorer } from '@mastra/core/evals'
type CustomInput = { query: string; context: string[] }
type CustomOutput = { answer: string; confidence: number }
const customScorer = createScorer<CustomInput, CustomOutput>({
id: 'custom-scorer',
description: 'Evaluates custom data',
}).generateScore(({ run }) => {
// run.input is typed as CustomInput
// run.output is typed as CustomOutput
return run.output.confidence
})
```
### Built-in Agent Types
- **`ScorerRunInputForAgent`** - Contains `inputMessages`, `rememberedMessages`, `systemMessages`, and `taggedSystemMessages` for agent evaluation
- **`ScorerRunOutputForAgent`** - Array of agent response messages
Using these types provides autocomplete, compile-time validation, and better documentation for your scoring logic.
## Trace scoring with agent types
When you use `type: 'agent'`, your scorer is compatible for both adding directly to agents and scoring traces from agent interactions. The scorer automatically transforms trace data into the proper agent input/output format:
```typescript
const agentTraceScorer = createScorer({
id: 'agent-trace-length',
description: 'Evaluates agent response length',
type: 'agent',
}).generateScore(({ run }) => {
// Trace data is automatically transformed to agent format
const userMessages = run.inputData.inputMessages
const agentResponse = run.output[0]?.content
// Score based on response length
return agentResponse?.length > 50 ? 0 : 1
})
// Register with Mastra for trace scoring
const mastra = new Mastra({
scorers: {
agentTraceScorer,
},
})
```
## Step method signatures
### preprocess
Optional preprocessing step that can extract or transform data before analysis.
**Function Mode:** Function: `({ run, results }) => any`
**run.input** (`any`): Input records provided to the scorer. If the scorer is added to an agent, this will be an array of user messages, e.g. \`\[{ role: 'user', content: 'hello world' }]\`. If the scorer is used in a workflow, this will be the input of the workflow.
**run.output** (`any`): Output record provided to the scorer. For agents, this is usually the agent's response. For workflows, this is the workflow's output.
**run.runId** (`string`): Unique identifier for this scoring run.
**run.requestContext** (`object`): Request Context from the agent or workflow step being evaluated (optional).
**results** (`object`): Empty object (no previous steps).
Returns: `any`\
The method can return any value. The returned value will be available to subsequent steps as `preprocessStepResult`.
**Prompt Object Mode:**
**description** (`string`): Description of what this preprocessing step does.
**outputSchema** (`StandardJSONSchemaV1`): Standard JSON Schema for the expected output of the preprocess step.
**createPrompt** (`function`): Function: ({ run, results }) => string. Returns the prompt for the LLM.
**judge** (`object`): (Optional) LLM judge for this step (can override main judge). See Judge Object section.
### analyze
Optional analysis step that processes the input/output and any preprocessed data.
**Function Mode:** Function: `({ run, results }) => any`
**run.input** (`any`): Input records provided to the scorer. If the scorer is added to an agent, this will be an array of user messages, e.g. \`\[{ role: 'user', content: 'hello world' }]\`. If the scorer is used in a workflow, this will be the input of the workflow.
**run.output** (`any`): Output record provided to the scorer. For agents, this is usually the agent's response. For workflows, this is the workflow's output.
**run.runId** (`string`): Unique identifier for this scoring run.
**run.requestContext** (`object`): Request Context from the agent or workflow step being evaluated (optional).
**results.preprocessStepResult** (`any`): Result from preprocess step, if defined (optional).
Returns: `any`\
The method can return any value. The returned value will be available to subsequent steps as `analyzeStepResult`.
**Prompt Object Mode:**
**description** (`string`): Description of what this analysis step does.
**outputSchema** (`StandardJSONSchemaV1`): Standard JSON Schema for the expected output of the analyze step.
**createPrompt** (`function`): Function: ({ run, results }) => string. Returns the prompt for the LLM.
**judge** (`object`): (Optional) LLM judge for this step (can override main judge). See Judge Object section.
### `generateScore`
**Required** step that computes the final numerical score.
**Function Mode:** Function: `({ run, results }) => number`
**run.input** (`any`): Input records provided to the scorer. If the scorer is added to an agent, this will be an array of user messages, e.g. \`\[{ role: 'user', content: 'hello world' }]\`. If the scorer is used in a workflow, this will be the input of the workflow.
**run.output** (`any`): Output record provided to the scorer. For agents, this is usually the agent's response. For workflows, this is the workflow's output.
**run.runId** (`string`): Unique identifier for this scoring run.
**run.requestContext** (`object`): Request Context from the agent or workflow step being evaluated (optional).
**results.preprocessStepResult** (`any`): Result from preprocess step, if defined (optional).
**results.analyzeStepResult** (`any`): Result from analyze step, if defined (optional).
Returns: `number`\
The method must return a numerical score.
**Prompt Object Mode:**
**description** (`string`): Description of what this scoring step does.
**outputSchema** (`StandardJSONSchemaV1`): Standard JSON Schema for the expected output of the generateScore step.
**createPrompt** (`function`): Function: ({ run, results }) => string. Returns the prompt for the LLM.
**judge** (`object`): (Optional) LLM judge for this step (can override main judge). See Judge Object section.
When using prompt object mode, you must also provide a `calculateScore` function to convert the LLM output to a numerical score:
**calculateScore** (`function`): Function: ({ run, results, analyzeStepResult }) => number. Converts the LLM's structured output into a numerical score.
### `generateReason`
Optional step that provides an explanation for the score.
**Function Mode:** Function: `({ run, results, score }) => string`
**run.input** (`any`): Input records provided to the scorer. If the scorer is added to an agent, this will be an array of user messages, e.g. \`\[{ role: 'user', content: 'hello world' }]\`. If the scorer is used in a workflow, this will be the input of the workflow.
**run.output** (`any`): Output record provided to the scorer. For agents, this is usually the agent's response. For workflows, this is the workflow's output.
**run.runId** (`string`): Unique identifier for this scoring run.
**run.requestContext** (`object`): Request Context from the agent or workflow step being evaluated (optional).
**results.preprocessStepResult** (`any`): Result from preprocess step, if defined (optional).
**results.analyzeStepResult** (`any`): Result from analyze step, if defined (optional).
**score** (`number`): Score computed by the generateScore step.
Returns: `string`\
The method must return a string explaining the score.
**Prompt Object Mode:**
**description** (`string`): Description of what this reasoning step does.
**createPrompt** (`function`): Function: ({ run, results, score }) => string. Returns the prompt for the LLM.
**judge** (`object`): (Optional) LLM judge for this step (can override main judge). See Judge Object section.
All step functions can be async.