@flatfile/improv
Version:
A powerful TypeScript library for building AI agents with multi-threaded conversations, tool execution, and event handling capabilities
419 lines (339 loc) โข 11.9 kB
Markdown
# Reasoning Models 2025 Update - Complete Implementation
## Overview
This document details the comprehensive implementation of reasoning model support in Improv, with deep focus on **Claude 4 Sonnet** and **Qwen 3** models as requested. All drivers now support advanced reasoning control parameters.
## ๐ฏ Claude 4 Sonnet - Deep Dive
### Latest Features (May 2025)
Claude 4 Sonnet represents a significant advancement in AI reasoning with hybrid modes and extended thinking capabilities.
#### Model Availability
```typescript
// New Claude 4 models added to AnthropicModel type
| "claude-4-sonnet-20250514" // Claude 4 Sonnet (May 2025)
| "claude-opus-4-20250514" // Claude Opus 4 (May 2025)
| "claude-3-7-sonnet-20250219" // Claude 3.7 Sonnet (Feb 2025)
```
#### Extended Thinking Configuration
```typescript
const driver = new AnthropicThreadDriver({
model: "claude-4-sonnet-20250514",
extendedThinking: {
budgetTokens: 16000, // Up to 16k tokens for reasoning
includeSummary: true, // Get summarized thinking process
},
interleavedThinking: true, // Think between tool calls
});
```
#### Key Capabilities
1. **Hybrid Response Modes**
- Near-instant responses for simple queries
- Extended thinking for complex reasoning tasks
2. **Interleaved Thinking with Tool Use**
- Model can think between tool calls
- Sophisticated reasoning after receiving tool results
- Requires beta header: `interleaved-thinking-2025-05-14`
3. **Summarized Thinking**
- Full intelligence benefits of extended thinking
- Prevents misuse by providing summaries instead of raw thinking
- Up to 64K output tokens for rich code generation
4. **Reduced Shortcut Behavior**
- 65% less likely to use shortcuts or loopholes
- More reliable for agentic tasks
#### Usage Examples
```typescript
// Basic Extended Thinking
const claude4Basic = new Solo({
driver: new AnthropicThreadDriver({
model: "claude-4-sonnet-20250514",
extendedThinking: {
budgetTokens: 8000,
includeSummary: true,
}
}),
systemPrompt: "You are a coding expert. Think deeply about problems."
});
// Advanced with Interleaved Thinking + Tools
const claude4Advanced = new Agent({
driver: new AnthropicThreadDriver({
model: "claude-4-sonnet-20250514",
extendedThinking: {
budgetTokens: 16000,
includeSummary: true,
},
interleavedThinking: true,
}),
tools: [codeAnalysisTool, searchTool],
systemPrompt: "Think between tool calls to make better decisions."
});
```
#### Pricing & Performance
- **Pricing**: $3/$15 per million tokens (input/output)
- **Context**: Full context window support
- **Output**: Up to 64K tokens
- **Performance**: Leading SWE-bench (72.5%) and Terminal-bench (43.2%)
## ๐ง Qwen 3 Models - Deep Dive
### Latest Models (2025)
Qwen 3 introduces hybrid reasoning modes with fine-grained control over thinking behavior.
#### Model Availability
```typescript
// New Qwen 3 models added to CerebrasModel type
| "qwen-3-32b" // Qwen 3 32B (April 2025)
| "qwen-3-235b-a22b-thinking-2507" // Qwen 3 235B Thinking (July 2025)
| "qwen-3-30b-a3b-thinking-2507" // Qwen 3 30B Thinking (July 2025)
| "qwq-32b-preview" // QwQ 32B Preview (Nov 2024)
```
#### Reasoning Configuration
```typescript
const driver = new CerebrasThreadDriver({
model: "qwen-3-32b",
reasoning: {
enableThinking: true, // Enable <think> tags
maxCompletionTokens: 64000, // High limit for thinking
softControl: true, // Support /think and /no_think
}
});
```
#### Key Features
1. **Hybrid Reasoning Modes**
- **Thinking Mode**: Step-by-step reasoning with `<think>...</think>` tags
- **Non-Thinking Mode**: Quick responses without visible reasoning
2. **Granular Control**
- `enableThinking: false` - Aligns with Qwen2.5-Instruct behavior
- `enableThinking: true` - Enables full reasoning mode
- Smart defaults: Enabled for Qwen 3, disabled for pure QwQ models
3. **Soft Control Instructions**
- `/think` - Enable thinking for specific request
- `/no_think` - Disable thinking for specific request
4. **Model Variants**
- **Standard Qwen 3**: Controllable thinking
- **Thinking Models**: Always-on reasoning (235B, 30B thinking variants)
- **QwQ Models**: Pure reasoning models (cannot disable)
#### Usage Examples
```typescript
// Controllable Thinking
const qwen3Flexible = new Solo({
driver: new CerebrasThreadDriver({
model: "qwen-3-32b",
reasoning: {
enableThinking: true,
maxCompletionTokens: 32000,
softControl: true,
}
})
});
// Use soft control in requests
await qwen3Flexible.ask("/think Analyze this complex algorithm step by step...");
await qwen3Flexible.ask("/no_think What's 2+2?");
// Always-On Reasoning Model
const qwen3Thinking = new Agent({
driver: new CerebrasThreadDriver({
model: "qwen-3-235b-a22b-thinking-2507",
reasoning: {
maxCompletionTokens: 64000, // High limit for verbose reasoning
}
}),
systemPrompt: "You are an expert reasoning AI."
});
// Pure Reasoning Model (QwQ)
const qwqReasoning = new Solo({
driver: new CerebrasThreadDriver({
model: "qwq-32b-preview",
reasoning: {
maxCompletionTokens: 64000, // Cannot disable reasoning
}
})
});
```
#### Performance & Specifications
- **Context**: 128K tokens (except smaller models)
- **Training**: 36 trillion tokens in 119 languages
- **License**: Apache 2.0
- **Sampling**: temperature=0.6, top_p=0.95
- **Benchmarks**: Outperforms o1 on several reasoning benchmarks
## ๐ง Complete API Parameters Reference
### OpenAI Driver
```typescript
interface OpenAIConfig {
model?: OpenAIModel; // Now includes o1-preview, o1-mini, o1
reasoning?: {
reasoningEffort?: "low" | "medium" | "high"; // o1 full model only
maxCompletionTokens?: number; // For o1 models
trackReasoningTokens?: boolean; // Track reasoning cost
};
}
// Usage
const openai = new OpenAIThreadDriver({
model: "o1-preview",
reasoning: {
maxCompletionTokens: 8000,
trackReasoningTokens: true,
}
});
```
### Anthropic Driver
```typescript
interface AnthropicConfig {
model?: AnthropicModel; // Now includes claude-4-sonnet-20250514
extendedThinking?: {
budgetTokens?: number; // Max reasoning tokens (up to 16k)
includeSummary?: boolean; // Get thinking summary
};
interleavedThinking?: boolean; // Think between tool calls
}
// Usage
const anthropic = new AnthropicThreadDriver({
model: "claude-4-sonnet-20250514",
extendedThinking: {
budgetTokens: 16000,
includeSummary: true,
},
interleavedThinking: true,
});
```
### Gemini Driver
```typescript
interface GeminiConfig {
thinking?: {
thinkingBudget?: number; // -1=dynamic, 0=off, >0=specific
includeThoughts?: boolean; // Get thought summaries
};
}
// Usage
const gemini = new GeminiThreadDriver({
model: "gemini-2.5-flash",
thinking: {
thinkingBudget: 2048, // Specific token budget
includeThoughts: true,
}
});
```
### Cerebras Driver
```typescript
interface CerebrasConfig {
model?: CerebrasModel; // Now includes all Qwen 3 variants
reasoning?: {
enableThinking?: boolean; // Enable/disable thinking
maxCompletionTokens?: number; // High limits for thinking models
softControl?: boolean; // Support /think /no_think
};
}
// Usage
const cerebras = new CerebrasThreadDriver({
model: "qwen-3-32b",
reasoning: {
enableThinking: true,
maxCompletionTokens: 32000,
softControl: true,
}
});
```
## ๐ Advanced Usage Patterns
### 1. Hybrid Reasoning Agent
```typescript
const hybridAgent = new Agent({
driver: new AnthropicThreadDriver({
model: "claude-4-sonnet-20250514",
extendedThinking: {
budgetTokens: 12000,
includeSummary: true,
},
interleavedThinking: true,
}),
tools: [searchTool, calculatorTool],
systemPrompt: `You can switch between fast and deep thinking modes:
- For simple queries, respond quickly
- For complex problems, use extended thinking
- Think between tool calls to make better decisions`
});
```
### 2. Adaptive Reasoning with Qwen
```typescript
const adaptiveQwen = new Solo({
driver: new CerebrasThreadDriver({
model: "qwen-3-32b",
reasoning: {
enableThinking: true,
softControl: true,
maxCompletionTokens: 32000,
}
})
});
// Adaptive usage
await adaptiveQwen.ask("/think Solve this complex math problem..."); // Deep thinking
await adaptiveQwen.ask("/no_think What's the weather like?"); // Quick response
```
### 3. Multi-Model Reasoning Pipeline
```typescript
const reasoningPipeline = {
quickCheck: new Solo({
driver: new OpenAIThreadDriver({ model: "gpt-4o-mini" })
}),
deepThinking: new Solo({
driver: new AnthropicThreadDriver({
model: "claude-4-sonnet-20250514",
extendedThinking: { budgetTokens: 16000 }
})
}),
mathematicalReasoning: new Solo({
driver: new CerebrasThreadDriver({
model: "qwen-3-235b-a22b-thinking-2507",
reasoning: { maxCompletionTokens: 64000 }
})
})
};
// Route requests based on complexity
async function routeReasoning(query: string, complexity: 'simple' | 'complex' | 'mathematical') {
switch(complexity) {
case 'simple': return reasoningPipeline.quickCheck.ask(query);
case 'complex': return reasoningPipeline.deepThinking.ask(query);
case 'mathematical': return reasoningPipeline.mathematicalReasoning.ask(query);
}
}
```
## ๐งช Testing & Validation
The comprehensive test suite validates all reasoning implementations:
```bash
# Run the complete reasoning model test
bun example/test-reasoning-models-comprehensive.ts
# Test specific models
ANTHROPIC_API_KEY=your_key bun example/test-reasoning-models-comprehensive.ts
CEREBRAS_API_KEY=your_key bun example/test-reasoning-models-comprehensive.ts
```
### Test Coverage
- โ
All reasoning extraction formats
- โ
Claude 4 Sonnet extended thinking
- โ
Qwen 3 hybrid reasoning modes
- โ
OpenAI o1 reasoning tokens
- โ
Gemini thinking budget control
- โ
Tool usage with reasoning
- โ
Streaming with reasoning extraction
## ๐ Reasoning Model Comparison
| Model | Reasoning Control | Best For | Cost | Speed |
|-------|------------------|----------|------|-------|
| **Claude 4 Sonnet** | Extended thinking budget | Complex reasoning, coding | $3/$15 | Hybrid |
| **Qwen 3-32B** | Enable/disable thinking | Flexible use cases | Low | Fast |
| **Qwen 3-235B Thinking** | Always-on reasoning | Deep analysis | Medium | Slow |
| **o1-preview** | Built-in reasoning | Scientific problems | $15/$60 | Slow |
| **o1-mini** | Built-in reasoning | Math, coding | $3/$12 | Medium |
| **Gemini 2.5 Flash** | Thinking budget | Cost-effective reasoning | Low | Fast |
## ๐ฏ Recommendations
### For Your Primary Use Cases
1. **Claude 4 Sonnet**: Best for complex coding and reasoning tasks
- Extended thinking with tool use
- High-quality reasoning with summarization
- Excellent for agentic workflows
2. **Qwen 3 Models**: Most flexible reasoning control
- Hybrid modes for different complexity levels
- Cost-effective for varied workloads
- Great soft control with `/think` commands
### Migration Guide
1. **Update model types** - All new models are now available
2. **Add reasoning configs** - Configure thinking behavior per use case
3. **Test reasoning extraction** - Verify ReasoningExtractor works correctly
4. **Update system prompts** - Optimize for new reasoning capabilities
This implementation provides state-of-the-art reasoning model support with fine-grained control over thinking behavior across all major providers. ๐