@mastra/core

# Context precision scorer The `createContextPrecisionScorer()` function creates a scorer that evaluates how relevant and well-positioned retrieved context pieces are for generating expected outputs. It uses **Mean Average Precision (MAP)** to reward systems that place relevant context earlier in the sequence. It's especially useful for these use cases: ## RAG system evaluation Ideal for evaluating retrieved context in RAG pipelines where: - Context ordering matters for model performance - You need to measure retrieval quality beyond basic relevance - Early relevant context is more valuable than later relevant context ## Context window optimization Use when optimizing context selection for: - Limited context windows - Token budget constraints - Multi-step reasoning tasks ## Parameters **model** (`MastraModelConfig`): The language model to use for evaluating context relevance **options** (`ContextPrecisionMetricOptions`): Configuration options for the scorer **Note**: Either `context` or `contextExtractor` must be provided. If both are provided, `contextExtractor` takes precedence. ## `.run()` returns **score** (`number`): Mean Average Precision score between 0 and scale (default 0-1) **reason** (`string`): Human-readable explanation of the context precision evaluation ## Scoring details ### Mean Average Precision (MAP) Context Precision uses **Mean Average Precision** to evaluate both relevance and positioning: 1. **Context Evaluation**: Each context piece is classified as relevant or irrelevant for generating the expected output 2. **Precision Calculation**: For each relevant context at position `i`, precision = `relevant_items_so_far / (i + 1)` 3. **Average Precision**: Sum all precision values and divide by total relevant items 4. **Final Score**: Multiply by scale factor and round to 2 decimals ### Scoring Formula ```text MAP = (Σ Precision@k) / R Where: - Precision@k = (relevant items in positions 1...k) / k - R = total number of relevant items - Only calculated at positions where relevant items appear ``` ### Score Interpretation - **0.9-1.0**: Excellent precision - all relevant context early in sequence - **0.7-0.8**: Good precision - most relevant context well-positioned - **0.4-0.6**: Moderate precision - relevant context mixed with irrelevant - **0.1-0.3**: Poor precision - little relevant context or poorly positioned - **0.0**: No relevant context found ### Reason analysis The reason field explains: - Which context pieces were deemed relevant/irrelevant - How positioning affected the MAP calculation - Specific relevance criteria used in evaluation ### Optimization insights Use results to: - **Improve retrieval**: Filter out irrelevant context before ranking - **Optimize ranking**: Ensure relevant context surfaces early - **Tune chunk size**: Balance context detail vs. relevance precision - **Evaluate embeddings**: Test different embedding models for better retrieval ### Example Calculation Given context: `[relevant, irrelevant, relevant, irrelevant]` - Position 0: Relevant → Precision = 1/1 = 1.0 - Position 1: Skip (irrelevant) - Position 2: Relevant → Precision = 2/3 = 0.67 - Position 3: Skip (irrelevant) MAP = (1.0 + 0.67) / 2 = 0.835 ≈ **0.83** ## Scorer configuration ### Dynamic context extraction ```typescript const scorer = createContextPrecisionScorer({ model: 'openai/gpt-5.4', options: { contextExtractor: (input, output) => { // Extract context dynamically based on the query const query = input?.inputMessages?.[0]?.content || '' // Example: Retrieve from a vector database const searchResults = vectorDB.search(query, { limit: 10 }) return searchResults.map(result => result.content) }, scale: 1, }, }) ``` ### Large context evaluation ```typescript const scorer = createContextPrecisionScorer({ model: 'openai/gpt-5.4', options: { context: [ // Simulate retrieved documents from vector database 'Document 1: Highly relevant content...', 'Document 2: Somewhat related content...', 'Document 3: Tangentially related...', 'Document 4: Not relevant...', 'Document 5: Highly relevant content...', // ... up to dozens of context pieces ], }, }) ``` ## Example Evaluate RAG system context retrieval precision for different queries: ```typescript import { runEvals } from '@mastra/core/evals' import { createContextPrecisionScorer } from '@mastra/evals/scorers/prebuilt' import { myAgent } from './agent' const scorer = createContextPrecisionScorer({ model: 'openai/gpt-5.4', options: { contextExtractor: (input, output) => { // Extract context from agent's retrieved documents return output.metadata?.retrievedContext || [] }, }, }) const result = await runEvals({ data: [ { input: 'How does photosynthesis work in plants?', }, { input: 'What are the mental and physical benefits of exercise?', }, ], scorers: [scorer], target: myAgent, onItemComplete: ({ scorerResults }) => { console.log({ score: scorerResults[scorer.id].score, reason: scorerResults[scorer.id].reason, }) }, }) console.log(result.scores) ``` For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals). To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide. ## Comparison with context relevance Choose the right scorer for your needs: | Use Case | Context Relevance | Context Precision | | ------------------------ | -------------------- | ------------------------- | | **RAG evaluation** | When usage matters | When ranking matters | | **Context quality** | Nuanced levels | Binary relevance | | **Missing detection** | ✓ Identifies gaps | ✗ Not evaluated | | **Usage tracking** | ✓ Tracks utilization | ✗ Not considered | | **Position sensitivity** | ✗ Position agnostic | ✓ Rewards early placement | ## Related - [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy): Evaluates if answers address the question - [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness): Measures answer groundedness in context - [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers): Creating your own evaluation metrics