UNPKG

@flatfile/improv

Version:

A powerful TypeScript library for building AI agents with multi-threaded conversations, tool execution, and event handling capabilities

419 lines (339 loc) โ€ข 11.9 kB
# Reasoning Models 2025 Update - Complete Implementation ## Overview This document details the comprehensive implementation of reasoning model support in Improv, with deep focus on **Claude 4 Sonnet** and **Qwen 3** models as requested. All drivers now support advanced reasoning control parameters. --- ## ๐ŸŽฏ Claude 4 Sonnet - Deep Dive ### Latest Features (May 2025) Claude 4 Sonnet represents a significant advancement in AI reasoning with hybrid modes and extended thinking capabilities. #### Model Availability ```typescript // New Claude 4 models added to AnthropicModel type | "claude-4-sonnet-20250514" // Claude 4 Sonnet (May 2025) | "claude-opus-4-20250514" // Claude Opus 4 (May 2025) | "claude-3-7-sonnet-20250219" // Claude 3.7 Sonnet (Feb 2025) ``` #### Extended Thinking Configuration ```typescript const driver = new AnthropicThreadDriver({ model: "claude-4-sonnet-20250514", extendedThinking: { budgetTokens: 16000, // Up to 16k tokens for reasoning includeSummary: true, // Get summarized thinking process }, interleavedThinking: true, // Think between tool calls }); ``` #### Key Capabilities 1. **Hybrid Response Modes** - Near-instant responses for simple queries - Extended thinking for complex reasoning tasks 2. **Interleaved Thinking with Tool Use** - Model can think between tool calls - Sophisticated reasoning after receiving tool results - Requires beta header: `interleaved-thinking-2025-05-14` 3. **Summarized Thinking** - Full intelligence benefits of extended thinking - Prevents misuse by providing summaries instead of raw thinking - Up to 64K output tokens for rich code generation 4. **Reduced Shortcut Behavior** - 65% less likely to use shortcuts or loopholes - More reliable for agentic tasks #### Usage Examples ```typescript // Basic Extended Thinking const claude4Basic = new Solo({ driver: new AnthropicThreadDriver({ model: "claude-4-sonnet-20250514", extendedThinking: { budgetTokens: 8000, includeSummary: true, } }), systemPrompt: "You are a coding expert. Think deeply about problems." }); // Advanced with Interleaved Thinking + Tools const claude4Advanced = new Agent({ driver: new AnthropicThreadDriver({ model: "claude-4-sonnet-20250514", extendedThinking: { budgetTokens: 16000, includeSummary: true, }, interleavedThinking: true, }), tools: [codeAnalysisTool, searchTool], systemPrompt: "Think between tool calls to make better decisions." }); ``` #### Pricing & Performance - **Pricing**: $3/$15 per million tokens (input/output) - **Context**: Full context window support - **Output**: Up to 64K tokens - **Performance**: Leading SWE-bench (72.5%) and Terminal-bench (43.2%) --- ## ๐Ÿง  Qwen 3 Models - Deep Dive ### Latest Models (2025) Qwen 3 introduces hybrid reasoning modes with fine-grained control over thinking behavior. #### Model Availability ```typescript // New Qwen 3 models added to CerebrasModel type | "qwen-3-32b" // Qwen 3 32B (April 2025) | "qwen-3-235b-a22b-thinking-2507" // Qwen 3 235B Thinking (July 2025) | "qwen-3-30b-a3b-thinking-2507" // Qwen 3 30B Thinking (July 2025) | "qwq-32b-preview" // QwQ 32B Preview (Nov 2024) ``` #### Reasoning Configuration ```typescript const driver = new CerebrasThreadDriver({ model: "qwen-3-32b", reasoning: { enableThinking: true, // Enable <think> tags maxCompletionTokens: 64000, // High limit for thinking softControl: true, // Support /think and /no_think } }); ``` #### Key Features 1. **Hybrid Reasoning Modes** - **Thinking Mode**: Step-by-step reasoning with `<think>...</think>` tags - **Non-Thinking Mode**: Quick responses without visible reasoning 2. **Granular Control** - `enableThinking: false` - Aligns with Qwen2.5-Instruct behavior - `enableThinking: true` - Enables full reasoning mode - Smart defaults: Enabled for Qwen 3, disabled for pure QwQ models 3. **Soft Control Instructions** - `/think` - Enable thinking for specific request - `/no_think` - Disable thinking for specific request 4. **Model Variants** - **Standard Qwen 3**: Controllable thinking - **Thinking Models**: Always-on reasoning (235B, 30B thinking variants) - **QwQ Models**: Pure reasoning models (cannot disable) #### Usage Examples ```typescript // Controllable Thinking const qwen3Flexible = new Solo({ driver: new CerebrasThreadDriver({ model: "qwen-3-32b", reasoning: { enableThinking: true, maxCompletionTokens: 32000, softControl: true, } }) }); // Use soft control in requests await qwen3Flexible.ask("/think Analyze this complex algorithm step by step..."); await qwen3Flexible.ask("/no_think What's 2+2?"); // Always-On Reasoning Model const qwen3Thinking = new Agent({ driver: new CerebrasThreadDriver({ model: "qwen-3-235b-a22b-thinking-2507", reasoning: { maxCompletionTokens: 64000, // High limit for verbose reasoning } }), systemPrompt: "You are an expert reasoning AI." }); // Pure Reasoning Model (QwQ) const qwqReasoning = new Solo({ driver: new CerebrasThreadDriver({ model: "qwq-32b-preview", reasoning: { maxCompletionTokens: 64000, // Cannot disable reasoning } }) }); ``` #### Performance & Specifications - **Context**: 128K tokens (except smaller models) - **Training**: 36 trillion tokens in 119 languages - **License**: Apache 2.0 - **Sampling**: temperature=0.6, top_p=0.95 - **Benchmarks**: Outperforms o1 on several reasoning benchmarks --- ## ๐Ÿ”ง Complete API Parameters Reference ### OpenAI Driver ```typescript interface OpenAIConfig { model?: OpenAIModel; // Now includes o1-preview, o1-mini, o1 reasoning?: { reasoningEffort?: "low" | "medium" | "high"; // o1 full model only maxCompletionTokens?: number; // For o1 models trackReasoningTokens?: boolean; // Track reasoning cost }; } // Usage const openai = new OpenAIThreadDriver({ model: "o1-preview", reasoning: { maxCompletionTokens: 8000, trackReasoningTokens: true, } }); ``` ### Anthropic Driver ```typescript interface AnthropicConfig { model?: AnthropicModel; // Now includes claude-4-sonnet-20250514 extendedThinking?: { budgetTokens?: number; // Max reasoning tokens (up to 16k) includeSummary?: boolean; // Get thinking summary }; interleavedThinking?: boolean; // Think between tool calls } // Usage const anthropic = new AnthropicThreadDriver({ model: "claude-4-sonnet-20250514", extendedThinking: { budgetTokens: 16000, includeSummary: true, }, interleavedThinking: true, }); ``` ### Gemini Driver ```typescript interface GeminiConfig { thinking?: { thinkingBudget?: number; // -1=dynamic, 0=off, >0=specific includeThoughts?: boolean; // Get thought summaries }; } // Usage const gemini = new GeminiThreadDriver({ model: "gemini-2.5-flash", thinking: { thinkingBudget: 2048, // Specific token budget includeThoughts: true, } }); ``` ### Cerebras Driver ```typescript interface CerebrasConfig { model?: CerebrasModel; // Now includes all Qwen 3 variants reasoning?: { enableThinking?: boolean; // Enable/disable thinking maxCompletionTokens?: number; // High limits for thinking models softControl?: boolean; // Support /think /no_think }; } // Usage const cerebras = new CerebrasThreadDriver({ model: "qwen-3-32b", reasoning: { enableThinking: true, maxCompletionTokens: 32000, softControl: true, } }); ``` --- ## ๐Ÿš€ Advanced Usage Patterns ### 1. Hybrid Reasoning Agent ```typescript const hybridAgent = new Agent({ driver: new AnthropicThreadDriver({ model: "claude-4-sonnet-20250514", extendedThinking: { budgetTokens: 12000, includeSummary: true, }, interleavedThinking: true, }), tools: [searchTool, calculatorTool], systemPrompt: `You can switch between fast and deep thinking modes: - For simple queries, respond quickly - For complex problems, use extended thinking - Think between tool calls to make better decisions` }); ``` ### 2. Adaptive Reasoning with Qwen ```typescript const adaptiveQwen = new Solo({ driver: new CerebrasThreadDriver({ model: "qwen-3-32b", reasoning: { enableThinking: true, softControl: true, maxCompletionTokens: 32000, } }) }); // Adaptive usage await adaptiveQwen.ask("/think Solve this complex math problem..."); // Deep thinking await adaptiveQwen.ask("/no_think What's the weather like?"); // Quick response ``` ### 3. Multi-Model Reasoning Pipeline ```typescript const reasoningPipeline = { quickCheck: new Solo({ driver: new OpenAIThreadDriver({ model: "gpt-4o-mini" }) }), deepThinking: new Solo({ driver: new AnthropicThreadDriver({ model: "claude-4-sonnet-20250514", extendedThinking: { budgetTokens: 16000 } }) }), mathematicalReasoning: new Solo({ driver: new CerebrasThreadDriver({ model: "qwen-3-235b-a22b-thinking-2507", reasoning: { maxCompletionTokens: 64000 } }) }) }; // Route requests based on complexity async function routeReasoning(query: string, complexity: 'simple' | 'complex' | 'mathematical') { switch(complexity) { case 'simple': return reasoningPipeline.quickCheck.ask(query); case 'complex': return reasoningPipeline.deepThinking.ask(query); case 'mathematical': return reasoningPipeline.mathematicalReasoning.ask(query); } } ``` --- ## ๐Ÿงช Testing & Validation The comprehensive test suite validates all reasoning implementations: ```bash # Run the complete reasoning model test bun example/test-reasoning-models-comprehensive.ts # Test specific models ANTHROPIC_API_KEY=your_key bun example/test-reasoning-models-comprehensive.ts CEREBRAS_API_KEY=your_key bun example/test-reasoning-models-comprehensive.ts ``` ### Test Coverage - โœ… All reasoning extraction formats - โœ… Claude 4 Sonnet extended thinking - โœ… Qwen 3 hybrid reasoning modes - โœ… OpenAI o1 reasoning tokens - โœ… Gemini thinking budget control - โœ… Tool usage with reasoning - โœ… Streaming with reasoning extraction --- ## ๐Ÿ“Š Reasoning Model Comparison | Model | Reasoning Control | Best For | Cost | Speed | |-------|------------------|----------|------|-------| | **Claude 4 Sonnet** | Extended thinking budget | Complex reasoning, coding | $3/$15 | Hybrid | | **Qwen 3-32B** | Enable/disable thinking | Flexible use cases | Low | Fast | | **Qwen 3-235B Thinking** | Always-on reasoning | Deep analysis | Medium | Slow | | **o1-preview** | Built-in reasoning | Scientific problems | $15/$60 | Slow | | **o1-mini** | Built-in reasoning | Math, coding | $3/$12 | Medium | | **Gemini 2.5 Flash** | Thinking budget | Cost-effective reasoning | Low | Fast | --- ## ๐ŸŽฏ Recommendations ### For Your Primary Use Cases 1. **Claude 4 Sonnet**: Best for complex coding and reasoning tasks - Extended thinking with tool use - High-quality reasoning with summarization - Excellent for agentic workflows 2. **Qwen 3 Models**: Most flexible reasoning control - Hybrid modes for different complexity levels - Cost-effective for varied workloads - Great soft control with `/think` commands ### Migration Guide 1. **Update model types** - All new models are now available 2. **Add reasoning configs** - Configure thinking behavior per use case 3. **Test reasoning extraction** - Verify ReasoningExtractor works correctly 4. **Update system prompts** - Optimize for new reasoning capabilities This implementation provides state-of-the-art reasoning model support with fine-grained control over thinking behavior across all major providers. ๐Ÿš€