@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
125 lines (104 loc) • 3.64 kB
Markdown
Running scorers in your CI pipeline provides quantifiable metrics for measuring agent quality over time. The `runEvals` function processes multiple test cases through your agent or workflow and returns aggregate scores.
You can use any testing framework that supports ESM modules, such as [Vitest](https://vitest.dev/), [Jest](https://jestjs.io/), or [Mocha](https://mochajs.org/).
Use `runEvals` to evaluate your agent against multiple test cases. The function accepts an array of data items, each containing an `input` and optional `groundTruth` for scorer validation.
```typescript
import { describe, it, expect } from 'vitest'
import { createScorer, runEvals } from '@mastra/core/evals'
import { weatherAgent } from './weather-agent'
import { locationScorer } from '../scorers/location-scorer'
describe('Weather Agent Tests', () => {
it('should correctly extract locations from queries', async () => {
const result = await runEvals({
data: [
{
input: 'weather in Berlin',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' },
},
{
input: 'weather in Berlin, Maryland',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'US' },
},
{
input: 'weather in Berlin, Russia',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'RU' },
},
],
target: weatherAgent,
scorers: [locationScorer],
})
// Assert aggregate score meets threshold
expect(result.scores['location-accuracy']).toBe(1)
expect(result.summary.totalItems).toBe(3)
})
})
```
The `runEvals` function returns an object with:
- `scores`: Average scores for each scorer across all test cases
- `summary.totalItems`: Total number of test cases processed
```typescript
{
scores: {
'location-accuracy': 1.0, // Average score across all items
'another-scorer': 0.85
},
summary: {
totalItems: 3
}
}
```
Create separate test cases for different evaluation scenarios:
```typescript
describe('Weather Agent Tests', () => {
const locationScorer = createScorer({
/* ... */
})
it('should handle location disambiguation', async () => {
const result = await runEvals({
data: [
{
input: 'weather in Berlin',
groundTruth: {
/* ... */
},
},
{
input: 'weather in Berlin, Maryland',
groundTruth: {
/* ... */
},
},
],
target: weatherAgent,
scorers: [locationScorer],
})
expect(result.scores['location-accuracy']).toBe(1)
})
it('should handle typos and misspellings', async () => {
const result = await runEvals({
data: [
{
input: 'weather in Berln',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' },
},
{
input: 'weather in Parris',
groundTruth: { expectedLocation: 'Paris', expectedCountry: 'FR' },
},
],
target: weatherAgent,
scorers: [locationScorer],
})
expect(result.scores['location-accuracy']).toBe(1)
})
})
```
- Learn about [creating custom scorers](https://mastra.ai/docs/evals/custom-scorers)
- Explore [built-in scorers](https://mastra.ai/docs/evals/built-in-scorers)
- Run scorers against [memory-enabled agents](https://mastra.ai/docs/evals/evals-with-memory)
- Read the [runEvals API reference](https://mastra.ai/reference/evals/run-evals)