@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
150 lines (104 loc) • 7.28 kB
Markdown
# Response Caching
> **Experimental:** This feature is in alpha. Breaking changes may occur without a major version bump until the API is stable.
Response caching skips the LLM call and replays a previously cached response when an agent receives an identical request. Use it to reduce latency and avoid paying for repeated calls.
Caching is implemented as the [`ResponseCache`](https://mastra.ai/reference/processors/response-cache) input processor. Mastra doesn't provide an agent-level option. To enable caching, register the processor explicitly. This keeps the API surface small while Mastra collects feedback; per-call overrides flow through `RequestContext`.
## When to use response caching
Reach for it when the same request shape repeats across users or sessions, for example prompt templates, suggested-prompt buttons, agentic search re-asks, or guardrail LLMs that classify the same input over and over. Skip it when calls trigger external side effects through tools, since cache hits replay tool calls without re-executing them.
## Quickstart
Add a `ResponseCache` to the agent's `inputProcessors` and pass any `MastraServerCache` as the backend. For development, `InMemoryServerCache` works out of the box:
```typescript
import { Agent } from '/core/agent'
import { InMemoryServerCache } from '/core/cache'
import { ResponseCache } from '/core/processors'
const cache = new InMemoryServerCache()
export const searchAgent = new Agent({
name: 'Search Agent',
instructions: 'You answer questions concisely.',
model: 'openai/gpt-5',
inputProcessors: [new ResponseCache({ cache, ttl: 600 })], // 10 minutes
})
```
The first call runs the LLM normally and writes the response to the cache. Subsequent calls with an identical resolved prompt return the cached response without invoking the LLM.
## Per-call overrides via RequestContext
Per-call config flows through `RequestContext`. Use `ResponseCache.context()` to build a fresh context, or `ResponseCache.applyContext()` to merge into one you already have:
```typescript
import { ResponseCache } from '/core/processors'
import { RequestContext } from '/core/request-context'
// Fresh context with the override
await agent.stream('hello', {
requestContext: ResponseCache.context({ key: 'custom-key', bust: true }),
})
// Or merge into an existing context
const ctx = new RequestContext()
ctx.set('caller-meta', { userId: 'u-123' })
ResponseCache.applyContext(ctx, { bust: true })
await agent.stream('hello', { requestContext: ctx })
```
Three fields are overridable per call:
- `key`: String or function. Overrides the auto-derived cache key for this request only.
- `scope`: String or `null`. Overrides the tenant/user scope for this request only. `null` opts out of scoping.
- `bust`: Boolean. Skips the cache read but still writes on completion (useful for "force refresh" buttons).
`cache`, `ttl`, and `agentId` stay on the constructor. They're instance-level concerns and not safe to vary per call.
## Tenant scoping
By default, `ResponseCache` looks up `MASTRA_RESOURCE_ID_KEY` on the request context and uses it as the cache scope. This means an agent that already populates the resource id (e.g. via memory) gets per-user isolation automatically. Two users never see each other's cached responses.
Override explicitly when you need a different scope:
```typescript
new Agent({
// ...
inputProcessors: [
new ResponseCache({
cache,
scope: 'org-123', // explicit tenant scope
}),
],
})
```
Pass `scope: null` to deliberately share entries across all callers. Only use this for known-public, non-personalized content.
## Custom cache backend
`ResponseCache` accepts any `MastraServerCache`. For production, use `RedisCache` from `/redis`:
```typescript
import { Agent } from '/core/agent'
import { ResponseCache } from '/core/processors'
import { RedisCache } from '/redis'
const cache = new RedisCache({ url: process.env.REDIS_URL })
export const agent = new Agent({
name: 'Cached Agent',
instructions: '...',
model: 'openai/gpt-5',
inputProcessors: [new ResponseCache({ cache })],
})
```
For a custom backend, extend `MastraServerCache` and implement its abstract methods (the processor only calls `get` and `set`).
## How caching is implemented
`ResponseCache` hooks into `processLLMRequest` (cache lookup, short-circuits on hit) and `processLLMResponse` (cache write on completion). Both run inside the agentic loop _after_ memory has loaded and earlier input processors have transformed the prompt.
This means the cache key is derived from the resolved `LanguageModelV2Prompt` Mastra is about to send to the model. The key is created _after_ memory has loaded and earlier input processors have run, and each step in an agentic tool loop is independently cached.
## What's in the cache key
When you don't supply `key`, the processor derives one deterministically from the inputs that change the LLM's response at this step: `agentId`, `stepNumber` (so each step in a tool loop has its own cache entry), `scope`, model identity (`provider`, `modelId`, spec version), and the resolved `prompt` (post-memory + post-processors). Any change to these inputs automatically invalidates the cache.
### Customize the cache key
Pass `key` as a function on the constructor or per-call to derive your own cache key from any subset of those inputs. The function receives the same inputs the deterministic hash would have consumed and returns a string (or a `Promise<string>`):
```typescript
import { ResponseCache, buildResponseCacheKey } from '/core/processors'
await agent.stream(input, {
requestContext: ResponseCache.context({
// Cache only on the model id and the resolved prompt tail — ignore
// step number, scope, etc.
key: ({ model, prompt }) => `qa:${model.modelId}:${JSON.stringify(prompt).slice(-200)}`,
}),
})
// Or reuse the deterministic helper while overriding individual fields:
await agent.stream(input, {
requestContext: ResponseCache.context({
key: inputs => buildResponseCacheKey({ ...inputs, scope: 'global' }),
}),
})
```
If the function throws, the processor falls back to the default key derivation so the call still benefits from caching.
## How cache hits work
When the processor finds a cache hit, it short-circuits the LLM call by returning the cached chunks from `processLLMRequest`. The agentic loop synthesizes a stream from those chunks instead of calling the model. `agent.generate()` collects them into a `FullOutput`; `agent.stream()` returns a `MastraModelOutput` whose chunks come from the cached buffer, so consumers iterating `fullStream` or awaiting `text`, `usage`, and `finishReason` see the cached values.
Cache writes happen after the response completes. Failed runs (errors, tripwire activations) aren't cached, so the next call retries cleanly.
## Related
- [`ResponseCache` reference](https://mastra.ai/reference/processors/response-cache)
- [Processors](https://mastra.ai/docs/agents/processors)
- [Guardrails](https://mastra.ai/docs/agents/guardrails)
- [Agent.stream()](https://mastra.ai/reference/streaming/agents/stream)
- [Agent.generate()](https://mastra.ai/reference/agents/generate)