@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
146 lines (111 loc) • 6.21 kB
Markdown
# Evals with memory
Agents that use memory in `thread` scope — including observational memory — require a thread ID at run time. When an eval invokes the agent without one, you'll see:
```text
ObservationalMemory (scope: 'thread') requires a threadId, but none was found in RequestContext or MessageList.
```
This page covers the three working patterns for running Mastra evals against memory-enabled agents, what each path supports, and which one to pick. A complete runnable repro for all three approaches lives in [`examples/evals-with-memory`](https://github.com/mastra-ai/mastra/tree/main/examples/evals-with-memory).
## When to use which approach
| Goal | Approach |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
| One shared conversation across every item | [`runEvals` with global `targetOptions.memory`](#shared-thread-with-runevals) |
| One independent thread per item, simple CI loop | [`runEvals` per item](#per-item-threads-with-runevals) |
| Per-item threads driven by a stored `Dataset` | [`dataset.startExperiment` with an inline task](#dataset-experiments-with-an-inline-task) |
Pre-seeding `RequestContext` with `MastraMemory` is **not** a supported way to drive memory into an agent. Thread resolution reads `args.memory.thread` — `RequestContext.MastraMemory` is populated by `prepare-memory-step` after the agent has already resolved its thread.
## Shared thread with `runEvals`
`runEvals` accepts `targetOptions`, which is forwarded to `agent.generate()`. Passing `memory: { thread, resource }` runs every data item against the same thread — useful for testing recall across a multi-turn conversation.
```typescript
import { runEvals } from '@mastra/core/evals'
import { supportAgent } from './support-agent'
import { recallScorer } from '../scorers/recall-scorer'
const memory = await supportAgent.getMemory()
await memory!.createThread({ threadId: 'eval-thread', resourceId: 'ci-user' })
const result = await runEvals({
target: supportAgent,
scorers: [recallScorer],
targetOptions: {
memory: { thread: 'eval-thread', resource: 'ci-user' },
},
data: [
{ input: 'My order number is 12345' },
{ input: 'What is my order number?', groundTruth: '12345' },
],
})
```
`targetOptions` is **global per call**. There is no per-item override on `RunEvalsDataItem` today.
## Per-item threads with `runEvals`
When each data item needs its own thread (the common CI shape), call `runEvals` once per item with a unique `targetOptions.memory` and aggregate the scores yourself.
```typescript
import { randomUUID } from 'node:crypto'
import { runEvals } from '@mastra/core/evals'
import { supportAgent } from './support-agent'
import { recallScorer } from '../scorers/recall-scorer'
const memory = await supportAgent.getMemory()
const resourceId = 'ci-user'
const items = [
{ input: 'Cats are mammals', groundTruth: 'mammals' },
{ input: 'Dogs are mammals too', groundTruth: 'mammals' },
]
// `runEvals` returns `{ scores: Record<string, number>; summary: { totalItems } }`.
const scores: number[] = []
for (const item of items) {
const threadId = `eval-${randomUUID()}`
await memory!.createThread({ threadId, resourceId, title: item.input })
const result = await runEvals({
target: supportAgent,
scorers: [recallScorer],
targetOptions: { memory: { thread: threadId, resource: resourceId } },
data: [item],
})
scores.push(result.scores[recallScorer.id])
}
const average = scores.reduce((a, b) => a + b, 0) / scores.length
```
> **Note:** Create the thread before running the eval. Observational memory in `thread` scope reads from a record that must already exist.
## Dataset experiments with an inline task
`dataset.startExperiment({ target: agent })` does **not** forward a `memory` option to the agent — only `requestContext`. To run a stored dataset against a memory-enabled agent, use an inline `task` function and stash `{ threadId, resourceId }` in each item's `metadata`. The scorer pipeline still runs as normal.
```typescript
import { randomUUID } from 'node:crypto'
import { mastra } from '../index'
import { supportAgent } from '../agents/support-agent'
import { recallScorer } from '../scorers/recall-scorer'
const memory = await supportAgent.getMemory()
const resourceId = 'ci-user'
const items = [
{ input: 'Cats are mammals', groundTruth: 'mammals', thread: `ds-${randomUUID()}` },
{ input: 'Dogs are mammals too', groundTruth: 'mammals', thread: `ds-${randomUUID()}` },
]
for (const it of items) {
await memory!.createThread({ threadId: it.thread, resourceId, title: it.input })
}
const dataset = await mastra.datasets.create({
name: 'support-recall',
description: 'Per-item memory via inline task + item metadata',
})
await dataset.addItems({
items: items.map(it => ({
input: it.input,
groundTruth: it.groundTruth,
metadata: { threadId: it.thread, resourceId },
})),
})
const summary = await dataset.startExperiment({
scorers: [recallScorer],
task: async ({ input, metadata }) => {
const { threadId, resourceId: rid } = (metadata ?? {}) as {
threadId: string
resourceId: string
}
const result = await supportAgent.generate(input as string, {
memory: { thread: threadId, resource: rid },
})
return result.text
},
})
```
The inline `task` receives the item's `metadata`, so each row can drive its own thread without changing the agent or any scorer.
> **Note:** Visit [runEvals reference](https://mastra.ai/reference/evals/run-evals) and [Dataset reference](https://mastra.ai/reference/datasets/dataset) for full configuration.
## Related
- [Running scorers in CI](https://mastra.ai/docs/evals/running-in-ci)
- [Running experiments](https://mastra.ai/docs/evals/datasets/running-experiments)
- [Observational memory](https://mastra.ai/docs/memory/observational-memory)
- [runEvals API reference](https://mastra.ai/reference/evals/run-evals)