agentrails

# LLM Provider Support AgentRails supports multiple LLM providers for evaluating your agent's responses. You can use any provider regardless of which LLM your agent uses! ## Supported Providers ### OpenAI **Models:** GPT-4 Turbo, GPT-4, GPT-3.5 Turbo ```javascript module.exports = { llm: { provider: "openai", apiKey: process.env.OPENAI_API_KEY, model: "gpt-4-turbo-preview", // optional, default temperature: 0.3, // optional, default }, agent: async (input) => { /* your agent */ }, tests: [ /* your tests */ ], }; ``` **Setup:** ```bash npm install openai export OPENAI_API_KEY="sk-..." ``` ### Anthropic Claude **Models:** Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Sonnet ```javascript module.exports = { llm: { provider: "anthropic", apiKey: process.env.ANTHROPIC_API_KEY, model: "claude-3-5-sonnet-20241022", // optional, default temperature: 0.3, // optional }, agent: async (input) => { /* your agent */ }, tests: [ /* your tests */ ], }; ``` **Setup:** ```bash npm install @anthropic-ai/sdk export ANTHROPIC_API_KEY="sk-ant-..." ``` ### Google Gemini **Models:** Gemini Pro, Gemini Pro Vision ```javascript module.exports = { llm: { provider: "google", apiKey: process.env.GOOGLE_API_KEY, model: "gemini-pro", // optional, default temperature: 0.3, // optional }, agent: async (input) => { /* your agent */ }, tests: [ /* your tests */ ], }; ``` **Setup:** ```bash npm install @google/generative-ai export GOOGLE_API_KEY="..." ``` ### Grok (xAI) **Models:** Grok Beta ```javascript module.exports = { llm: { provider: "grok", apiKey: process.env.XAI_API_KEY, model: "grok-beta", // optional, default temperature: 0.3, // optional baseURL: "https://api.x.ai/v1", // optional, default }, agent: async (input) => { /* your agent */ }, tests: [ /* your tests */ ], }; ``` **Setup:** ```bash # Grok uses OpenAI SDK npm install openai export XAI_API_KEY="..." ``` ## Choosing a Provider ### When to use OpenAI - **Best for:** General purpose, well-documented, stable API - **Pros:** Excellent at following JSON format, reliable, fast - **Cons:** More expensive than alternatives ### When to use Anthropic - **Best for:** Long context evaluations, detailed reasoning - **Pros:** Excellent reasoning, large context window (200k tokens), good at nuanced evaluation - **Cons:** Slightly slower, requires separate SDK ### When to use Google Gemini - **Best for:** Cost-effective evaluation, multimodal inputs - **Pros:** Free tier available, fast, good for image inputs - **Cons:** Newer, less consistent JSON parsing ### When to use Grok - **Best for:** Latest news/current events evaluation - **Pros:** Access to real-time information, X integration - **Cons:** Beta stage, limited availability ## Cost Comparison Approximate costs per 1M tokens (input/output): | Provider | Model | Input | Output | | --------- | ----------------- | --------------------- | ------ | | OpenAI | GPT-4 Turbo | $10 | $30 | | OpenAI | GPT-3.5 Turbo | $0.50 | $1.50 | | Anthropic | Claude 3.5 Sonnet | $3 | $15 | | Google | Gemini Pro | Free tier, then $0.50 | $1.50 | | Grok | Grok Beta | TBD | TBD | ## Best Practices 1. **Match model to task complexity:** - Simple pass/fail: GPT-3.5 Turbo or Gemini Pro - Nuanced evaluation: GPT-4 Turbo or Claude 3.5 Sonnet 2. **Use different providers for redundancy:** ```javascript // Run same tests with multiple evaluators const providers = ["openai", "anthropic", "google"]; ``` 3. **Set temperature low (0.1-0.3):** - Low temperature = more consistent evaluation - High temperature = more creative but less reliable 4. **Your agent can use a different LLM:** ```javascript // Agent uses Claude, Evaluator uses GPT-4 module.exports = { llm: { provider: "openai", apiKey: process.env.OPENAI_API_KEY }, agent: async (input) => { // Your agent calls Claude internally return await yourClaudeAgent.chat(input); }, }; ``` ## Adding a New Provider To add a new provider: 1. Implement the `LLMEvaluator` interface in `src/evaluator.ts` 2. Add the provider to the `LLMProvider` type in `src/types.ts` 3. Update the `createEvaluator` factory function 4. Add tests in `tests/evaluator.test.ts` Example: ```typescript export class CustomEvaluator implements LLMEvaluator { async evaluate( input: string | Record<string, any>, actualResponse: string | Record<string, any>, expectedBehavior?: string, exampleResponses?: string[] ): Promise<{ passed: boolean; reasoning: string }> { // Your implementation } } ``` Pull requests welcome!