UNPKG

evalz

Version:

Model graded evals with typescript

853 lines (687 loc) 27.1 kB
<div align="center"> <h1>evalz</h1> </div> <br /> <p align="center"><i>> Structured evaluation toolkit for LLM outputs</i></p> <br /> <div align="center"> <a aria-label="NPM version" href="https://www.npmjs.com/package/evalz"> <img alt="evalz" src="https://img.shields.io/npm/v/evalz.svg?style=flat-square&logo=npm&labelColor=000000&label=evalz"> </a> <a aria-label="Island AI" href="https://github.com/hack-dance/island-ai"> <img alt="Island AI" src="https://img.shields.io/badge/Part of Island AI-000000.svg?style=flat-square&labelColor=000000&logo="> </a> <a aria-label="Made by hack.dance" href="https://hack.dance"> <img alt="docs" src="https://img.shields.io/badge/MADE%20BY%20HACK.DANCE-000000.svg?style=flat-square&labelColor=000000"> </a> <a aria-label="Twitter" href="https://twitter.com/dimitrikennedy"> <img alt="follow" src="https://img.shields.io/twitter/follow/dimitrikennedy?style=social&labelColor=000000"> </a> </div> ## Overview `evalz` provides structured evaluation tools for assessing LLM outputs across multiple dimensions. Built with TypeScript and integrated with OpenAI and Instructor, it enables both automated evaluation and human-in-the-loop assessment workflows. ### Key Capabilities - 🎯 **Model-Graded Evaluation**: Leverage LLMs to assess response quality - 📊 **Accuracy Measurement**: Compare outputs using semantic and lexical similarity - 🔍 **Context Validation**: Evaluate responses against source materials - ⚖️ **Composite Assessment**: Combine multiple evaluation types with custom weights ## Installation Install `evalz` using your preferred package manager: ```bash npm install evalz openai zod @instructor-ai/instructor bun add evalz openai zod @instructor-ai/instructor pnpm add evalz openai zod @instructor-ai/instructor ``` ## When to Use evalz ### Model-Graded Evaluation Provides human-like judgment for subjective criteria that can't be measured through pure text comparison Use when you need qualitative assessment of responses: - Evaluating RAG system output quality - Assessing chatbot response appropriateness - Validating content generation - Measuring response coherence and fluency ```typescript const relevanceEval = createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate relevance and quality from 0-1" }); ``` ### Accuracy Evaluation Gives objective measurements for cases where exact or semantic matching is important Use for comparing outputs against known correct answers: - Question-answering system validation - Translation accuracy measurement - Fact-checking systems - Test case validation ```typescript const accuracyEval = createAccuracyEvaluator({ weights: { factual: 0.6, // Levenshtein distance weight semantic: 0.4 // Embedding similarity weight } }); ``` ### Context Evaluation Measures how well outputs utilize and stay faithful to provided context Use for assessing responses against source materials: - RAG system faithfulness - Document summarization accuracy - Knowledge extraction validation - Information retrieval quality ```typescript const contextEval = createContextEvaluator({ type: "precision" // or "recall", "relevance", "entities-recall" }); ``` ### Composite Evaluation Provides balanced assessment across multiple dimensions of quality Use for comprehensive system assessment: - Production LLM monitoring - A/B testing prompts and models - Quality assurance pipelines - Multi-factor response validation ```typescript const compositeEval = createWeightedEvaluator({ evaluators: { relevance: relevanceEval(), accuracy: accuracyEval(), context: contextEval() }, weights: { relevance: 0.4, accuracy: 0.4, context: 0.2 } }); ``` ## Evaluator Types and Data Requirements ### Context Evaluator Types ```typescript type ContextEvaluatorType = "entities-recall" | "precision" | "recall" | "relevance"; ``` - **entities-recall**: Measures how well the completion captures named entities from the context - **precision**: Evaluates how accurate the completion is compared to the context - **recall**: Measures how much relevant information from the context is included - **relevance**: Assesses how well the completion relates to the context ### Data Requirements by Evaluator Type #### Model-Graded Evaluator ```typescript type ModelGradedData = { prompt: string; completion: string; expectedCompletion?: string; // Ignored for this evaluator type } const modelEval = createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate the response" }); await modelEval({ data: [{ prompt: "What is TypeScript?", completion: "TypeScript is a typed superset of JavaScript" }] }); ``` #### Accuracy Evaluator ```typescript type AccuracyData = { completion: string; expectedCompletion: string; // Required for accuracy comparison } const accuracyEval = createAccuracyEvaluator({ weights: { factual: 0.5, semantic: 0.5 } }); await accuracyEval({ data: [{ completion: "TypeScript adds types to JavaScript", expectedCompletion: "TypeScript is JavaScript with type support" }] }); ``` #### Context Evaluator ```typescript type ContextData = { prompt: string; completion: string; groundTruth: string; // Required for context evaluation contexts: string[]; // Required for context evaluation } // Entities Recall - Checks named entities const entitiesEval = createContextEvaluator({ type: "entities-recall" }); // Precision - Checks accuracy against context const precisionEval = createContextEvaluator({ type: "precision" }); // Recall - Checks information coverage const recallEval = createContextEvaluator({ type: "recall" }); // Relevance - Checks contextual relevance const relevanceEval = createContextEvaluator({ type: "relevance" }); // Example usage const data = { prompt: "What did the CEO say about Q3?", completion: "CEO Jane Smith reported 15% growth in Q3 2023", groundTruth: "The CEO announced strong Q3 performance", contexts: [ "CEO Jane Smith presented Q3 results", "Company saw 15% revenue growth in Q3 2023" ] }; await entitiesEval({ data: [data] }); // Focuses on "Jane Smith", "Q3", "2023" await precisionEval({ data: [data] }); // Checks factual accuracy await recallEval({ data: [data] }); // Checks information completeness await relevanceEval({ data: [data] }); // Checks contextual relevance ``` #### Composite Evaluation ```typescript // Can combine different evaluator types const compositeEval = createWeightedEvaluator({ evaluators: { entities: createContextEvaluator({ type: "entities-recall" }), accuracy: createAccuracyEvaluator({ weights: { factual: 0.9, // High weight on exact matches semantic: 0.1 // Low weight on similar terms } }), quality: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate quality" }) }, weights: { entities: 0.3, accuracy: 0.4, quality: 0.3 } }); // Must provide all required fields for each evaluator type await compositeEval({ data: [{ prompt: "Summarize the earnings call", completion: "CEO Jane Smith announced 15% growth", expectedCompletion: "The CEO reported strong growth", groundTruth: "CEO discussed Q3 performance", contexts: [ "CEO Jane Smith presented Q3 results", "Company saw 15% growth in Q3 2023" ] }] }); ``` ## Cookbook ### RAG System Evaluation Evaluate RAG responses for relevance to source documents and factual accuracy. ```typescript const ragEvaluator = createWeightedEvaluator({ evaluators: { // Check if named entities (people, places, dates) are preserved entities: createContextEvaluator({ type: "entities-recall" }), // Verify factual correctness using embedding similarity precision: createContextEvaluator({ type: "precision" }), // Check if all relevant information is included recall: createContextEvaluator({ type: "recall" }), // Assess overall contextual relevance relevance: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate how well the response uses the context" }) }, weights: { entities: 0.2, // Lower weight as it's more supplementary precision: 0.3, // Higher weight for factual correctness recall: 0.3, // Higher weight for information coverage relevance: 0.2 // Balance of overall relevance } }); const result = await ragEvaluator({ data: [{ prompt: "What are the key financial metrics?", completion: "Revenue grew 25% to $10M in Q3 2023", groundTruth: "Q3 2023 saw 25% revenue growth to $10M", contexts: [ "In Q3 2023, company revenue increased 25% to $10M", "Operating margins improved to 15%" ] }] }); /* Example output: { results: [{ score: 0.85, scores: [ { score: 1.0, evaluator: "entities" }, // Perfect entity preservation { score: 0.92, evaluator: "precision" }, // High factual accuracy { score: 0.75, evaluator: "recall" }, // Missing margin information { score: 0.78, evaluator: "relevance" } // Good contextual relevance ], item: { prompt: "What were the key financial metrics?", completion: "Revenue grew 25% to $10M in Q3 2023", groundTruth: "Q3 2023 saw 25% revenue growth to $10M", contexts: [...] } }], scoreResults: { value: 0.85, individual: { entities: 1.0, precision: 0.92, recall: 0.75, relevance: 0.78 } } } */ ``` ### Content Moderation Evaluation Binary evaluation for content policy compliance, useful for automated content filtering. ```typescript const moderationEvaluator = createEvaluator({ client: oai, model: "gpt-4-turbo", resultsType: "binary", // Changes output to true/false counts evaluationDescription: "Score 1 if content follows all policies (safe, respectful, appropriate), 0 if any violation exists" }); const moderationResult = await moderationEvaluator({ data: [ { prompt: "Describe our product benefits", completion: "Our product helps improve productivity", expectedCompletion: "Professional product description" }, { prompt: "Respond to negative review", completion: "Your complaint is totally wrong...", expectedCompletion: "Professional response to feedback" } ] }); /* Example output: { results: [ { score: 1, item: { ... } }, // Meets content guidelines { score: 0, item: { ... } } // Violates professional tone policy ], binaryResults: { trueCount: 1, falseCount: 1 } } */ ``` ### Student Answer Evaluation Demonstrates weighted evaluation combining exact matching, semantic understanding, and qualitative assessment. ```typescript const gradingEvaluator = createWeightedEvaluator({ evaluators: { // Check for presence of required terminology keyTerms: createAccuracyEvaluator({ weights: { factual: 0.9, // High weight on exact matches semantic: 0.1 // Low weight on similar terms } }), // Assess conceptual understanding understanding: createAccuracyEvaluator({ weights: { factual: 0.2, // Low weight on exact matches semantic: 0.8 // High weight on meaning similarity } }), // Evaluate answer quality like a human grader quality: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate answer completeness and clarity 0-1" }) }, weights: { keyTerms: 0.3, // Balance terminology requirements understanding: 0.4, // Emphasize conceptual grasp quality: 0.3 // Consider overall presentation } }); const gradingResult = await gradingEvaluator({ data: [{ prompt: "Explain how photosynthesis works", completion: "Plants convert sunlight into chemical energy through chlorophyll", expectedCompletion: "Photosynthesis is the process where plants use chlorophyll to convert sunlight, water, and CO2 into glucose and oxygen" }] }); /* Example output: { results: [{ score: 0.78, // Overall grade (78%) scores: [ { score: 0.65, // Missing key terms (water, CO2, glucose) evaluator: "keyTerms", evaluatorType: "accuracy" }, { score: 0.90, // Shows good conceptual understanding evaluator: "understanding", evaluatorType: "accuracy" }, { score: 0.75, // Clear but not comprehensive evaluator: "quality", evaluatorType: "model-graded" } ], item: { ... } }], scoreResults: { value: 0.78, individual: { keyTerms: 0.65, understanding: 0.90, quality: 0.75 } } } */ ``` ### Chatbot Quality Assessment Monitor chatbot response quality across multiple dimensions. ```typescript const chatbotEvaluator = createWeightedEvaluator({ evaluators: { // Evaluate response appropriateness relevance: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate how well the response addresses the user's query" }), // Check response tone tone: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate the professionalism and friendliness of the response" }), // Verify against known good responses accuracy: createAccuracyEvaluator({ weights: { semantic: 0.8, factual: 0.2 } }) }, weights: { relevance: 0.4, tone: 0.3, accuracy: 0.3 } }); const result = await chatbotEvaluator({ data: [{ prompt: "How do I reset my password?", completion: "You can reset your password by clicking the 'Forgot Password' link on the login page.", expectedCompletion: "To reset your password, use the 'Forgot Password' option at login.", contexts: ["Previous support interactions"] }] }); ``` ### Content Generation Pipeline Evaluate generated content for quality and accuracy. ```typescript const contentEvaluator = createWeightedEvaluator({ evaluators: { // Check writing quality quality: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate clarity, structure, and engagement" }), // Verify factual accuracy factCheck: createAccuracyEvaluator({ weights: { factual: 1.0 } }), // Assess source usage citations: createContextEvaluator({ type: "entities-recall" }) }, weights: { quality: 0.4, factCheck: 0.4, citations: 0.2 } }); const result = await contentEvaluator({ data: [{ prompt: "Write an article about renewable energy trends", completion: "Solar and wind power installations increased by 30% in 2023...", contexts: [ "Global renewable energy deployment grew by 30% year-over-year", "Solar and wind remained the fastest-growing sectors" ], groundTruth: "Renewable energy saw significant growth, led by solar and wind" }] }); ``` ### Document Processing System Evaluate document extraction and summarization quality. ```typescript const documentEvaluator = createWeightedEvaluator({ evaluators: { // Verify key information extraction extraction: createContextEvaluator({ type: "recall" }), // Check summary quality summary: createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate conciseness and completeness" }), // Validate against reference summary accuracy: createAccuracyEvaluator({ weights: { semantic: 0.6, factual: 0.4 } }) }, weights: { extraction: 0.4, summary: 0.3, accuracy: 0.3 } }); const result = await documentEvaluator({ data: [{ prompt: "Summarize the quarterly report", completion: "Q3 revenue grew 25% YoY, driven by new product launches...", contexts: [ "Revenue increased 25% compared to Q3 2022", "Growth primarily attributed to successful product launches" ], groundTruth: "Q3 saw 25% YoY revenue growth due to new products" }] }); ``` ## API Reference ### createEvaluator Creates a basic evaluator for assessing AI-generated content based on custom criteria. **Parameters** • client: OpenAI instance. • model: OpenAI model to use (e.g., "gpt-4o"). • evaluationDescription: Description guiding the evaluation criteria. • `resultsType`: Type of results to return ("score" or "binary"). • `messages`: Additional messages to include in the OpenAI API call. **Example** ```typescript import { createEvaluator } from "evalz"; import OpenAI from "openai"; const oai = new OpenAI({ apiKey: process.env["OPENAI_API_KEY"], organization: process.env["OPENAI_ORG_ID"] }); const evaluator = createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Rate the relevance from 0 to 1." }); const result = await evaluator({ data: [{ prompt: "Discuss the importance of AI.", completion: "AI is important for future technology.", expectedCompletion: "AI is important for future technology." }] }); console.log(result.scoreResults); ``` ### createAccuracyEvaluator Creates an evaluator that assesses string similarity using a hybrid approach of Levenshtein distance (factual similarity) and semantic embeddings (semantic similarity), with customizable weights. **Parameters** • `model` (optional): OpenAI.Embeddings.EmbeddingCreateParams["model"] - The OpenAI embedding model to use defaults to `"text-embedding-3-small"`. • `weights` (optional): An object specifying the weights for factual and semantic similarities. Defaults to `{` factual: 0.5, semantic: 0.5 }. **Example** ```typescript import { createAccuracyEvaluator } from "evalz"; const evaluator = createAccuracyEvaluator({ model: "text-embedding-3-small", weights: { factual: 0.4, semantic: 0.6 } }); const data = [ { completion: "Einstein was born in Germany in 1879.", expectedCompletion: "Einstein was born in 1879 in Germany." } ]; const result = await evaluator({ data }); console.log(result.scoreResults); ``` ### createWeightedEvaluator Combines multiple evaluators with specified weights for a comprehensive assessment. **Parameters** • `evaluators`: An object mapping evaluator names to evaluator functions. • `weights`: An object mapping evaluator names to their corresponding weights. **Example** ```typescript import { createWeightedEvaluator } from "evalz"; const weightedEvaluator = createWeightedEvaluator({ evaluators: { relevance: relevanceEval(), fluency: fluencyEval(), completeness: completenessEval() }, weights: { relevance: 0.25, fluency: 0.25, completeness: 0.5 } }); const result = await weightedEvaluator({ data: yourResponseData }); console.log(result.scoreResults); ``` ### Create Composite Weighted Evaluation A weighted evaluator that incorporates various evaluation types: **Example** ```typescript import { createEvaluator, createAccuracyEvaluator, createContextEvaluator, createWeightedEvaluator } from "evalz" const oai = new OpenAI({ apiKey: process.env["OPENAI_API_KEY"], organization: process.env["OPENAI_ORG_ID"] }); const relevanceEval = () => createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Please rate the relevance of the response from 0 (not at all relevant) to 1 (highly relevant), considering whether the AI stayed on topic and provided a reasonable answer." }); const distanceEval = () => createAccuracyEvaluator({ weights: { factual: 0.5, semantic: 0.5 } }); const semanticEval = () => createAccuracyEvaluator({ weights: { factual: 0.0, semantic: 1.0 } }); const fluencyEval = () => createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Please rate the completeness of the response from 0 (not at all complete) to 1 (completely answered), considering whether the AI addressed all parts of the prompt." }); const completenessEval = () => createEvaluator({ client: oai, model: "gpt-4-turbo", evaluationDescription: "Please rate the completeness of the response from 0 (not at all complete) to 1 (completely answered), considering whether the AI addressed all parts of the prompt." }); const contextEntitiesRecallEval = () => createContextEvaluator({ type: "entities-recall" }); const contextPrecisionEval = () => createContextEvaluator({ type: "precision" }); const contextRecallEval = () => createContextEvaluator({ type: "recall" }); const contextRelevanceEval = () => createContextEvaluator({ type: "relevance" }); const compositeWeightedEvaluator = createWeightedEvaluator({ evaluators: { relevance: relevanceEval(), fluency: fluencyEval(), completeness: completenessEval(), accuracy: createAccuracyEvaluator({ weights: { factual: 0.6, semantic: 0.4 } }), contextPrecision: contextPrecisionEval() }, weights: { relevance: 0.2, fluency: 0.2, completeness: 0.2, accuracy: 0.2, contextPrecision: 0.2 } }); const data = [ { prompt: "When was the first super bowl?", completion: "The first super bowl was held on January 15, 1967.", expectedCompletion: "The first superbowl was held on January 15, 1967.", contexts: ["The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."], groundTruth: "The first superbowl was held on January 15, 1967." } ]; const result = await compositeWeightedEvaluator({ data }); console.log(result.scoreResults); ``` ### createContextEvaluator Creates an evaluator that assesses context-based criteria such as relevance, precision, recall, and entities recall. **Parameters** • `type`: "entities-recall" | "precision" | "recall" | "relevance" - The type of context evaluation to perform. • `model` (optional): OpenAI.Embeddings.EmbeddingCreateParams["model"] - The OpenAI embedding model to use. Defaults to `"text-embedding-3-small"`. **Example** ```typescript import { createContextEvaluator } from "evalz"; const entitiesRecallEvaluator = createContextEvaluator({ type: "entities-recall" }); const precisionEvaluator = createContextEvaluator({ type: "precision" }); const recallEvaluator = createContextEvaluator({ type: "recall" }); const relevanceEvaluator = createContextEvaluator({ type: "relevance" }); const data = [ { prompt: "When was the first super bowl?", completion: "The first superbowl was held on January 15, 1967.", groundTruth: "The first superbowl was held on January 15, 1967.", contexts: [ "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967 at the Los Angeles Memorial Coliseum in Los Angeles.", "This first championship game is retroactively referred to as Super Bowl I." ] } ]; const result1 = await entitiesRecallEvaluator({ data }); console.log(result1.scoreResults); const result2 = await precisionEvaluator({ data }); console.log(result2.scoreResults); const result3 = await recallEvaluator({ data }); console.log(result3.scoreResults); const result4 = await relevanceEvaluator({ data }); console.log(result4.scoreResults); ``` ## Integration with Island AI Part of the Island AI toolkit: - [`schema-stream`](https://www.npmjs.com/package/schema-stream): Streaming JSON parser - [`zod-stream`](https://www.npmjs.com/package/zod-stream): Structured streaming - [`stream-hooks`](https://www.npmjs.com/package/stream-hooks): React streaming hooks - [`llm-polyglot`](https://www.npmjs.com/package/llm-polyglot): Universal LLM client - [`instructor`](https://www.npmjs.com/package/@instructor-ai/instructor): High-level extraction ## Contributing We welcome contributions! Check out: - [Island AI Documentation](https://island.hack.dance) - [GitHub Issues](https://github.com/hack-dance/island-ai/issues) - [Twitter](https://twitter.com/dimitrikennedy) ## License MIT © [hack.dance](https://hack.dance)