UNPKG

@mastra/core

Version:

Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

125 lines (104 loc) 3.64 kB
# Running scorers in CI Running scorers in your CI pipeline provides quantifiable metrics for measuring agent quality over time. The `runEvals` function processes multiple test cases through your agent or workflow and returns aggregate scores. ## Basic setup You can use any testing framework that supports ESM modules, such as [Vitest](https://vitest.dev/), [Jest](https://jestjs.io/), or [Mocha](https://mochajs.org/). ## Creating test cases Use `runEvals` to evaluate your agent against multiple test cases. The function accepts an array of data items, each containing an `input` and optional `groundTruth` for scorer validation. ```typescript import { describe, it, expect } from 'vitest' import { createScorer, runEvals } from '@mastra/core/evals' import { weatherAgent } from './weather-agent' import { locationScorer } from '../scorers/location-scorer' describe('Weather Agent Tests', () => { it('should correctly extract locations from queries', async () => { const result = await runEvals({ data: [ { input: 'weather in Berlin', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' }, }, { input: 'weather in Berlin, Maryland', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'US' }, }, { input: 'weather in Berlin, Russia', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'RU' }, }, ], target: weatherAgent, scorers: [locationScorer], }) // Assert aggregate score meets threshold expect(result.scores['location-accuracy']).toBe(1) expect(result.summary.totalItems).toBe(3) }) }) ``` ## Understanding results The `runEvals` function returns an object with: - `scores`: Average scores for each scorer across all test cases - `summary.totalItems`: Total number of test cases processed ```typescript { scores: { 'location-accuracy': 1.0, // Average score across all items 'another-scorer': 0.85 }, summary: { totalItems: 3 } } ``` ## Multiple test scenarios Create separate test cases for different evaluation scenarios: ```typescript describe('Weather Agent Tests', () => { const locationScorer = createScorer({ /* ... */ }) it('should handle location disambiguation', async () => { const result = await runEvals({ data: [ { input: 'weather in Berlin', groundTruth: { /* ... */ }, }, { input: 'weather in Berlin, Maryland', groundTruth: { /* ... */ }, }, ], target: weatherAgent, scorers: [locationScorer], }) expect(result.scores['location-accuracy']).toBe(1) }) it('should handle typos and misspellings', async () => { const result = await runEvals({ data: [ { input: 'weather in Berln', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' }, }, { input: 'weather in Parris', groundTruth: { expectedLocation: 'Paris', expectedCountry: 'FR' }, }, ], target: weatherAgent, scorers: [locationScorer], }) expect(result.scores['location-accuracy']).toBe(1) }) }) ``` ## Next steps - Learn about [creating custom scorers](https://mastra.ai/docs/evals/custom-scorers) - Explore [built-in scorers](https://mastra.ai/docs/evals/built-in-scorers) - Run scorers against [memory-enabled agents](https://mastra.ai/docs/evals/evals-with-memory) - Read the [runEvals API reference](https://mastra.ai/reference/evals/run-evals)