mcp-code-wrapper
Version:
⚠️ EXPERIMENTAL - Generate code execution wrappers for MCP servers with progressive tool discovery to save up-front context
292 lines (213 loc) • 11.3 kB
Markdown
# Findings: Code Execution with MCP Experiment
**Date**: November 6, 2025
**Experiment**: Progressive tool discovery for Chrome DevTools MCP
**Goal**: Test Anthropic's code execution pattern to reduce context consumption
---
## Executive Summary
We successfully implemented Anthropic's code execution pattern for Chrome DevTools MCP and measured **96.2% token reduction** (17,500 tokens → 667 tokens) for a typical browser automation task.
The approach works. Progressive discovery is viable. But it adds operational complexity that needs careful consideration.
---
## What We Built
### 1. Filesystem Structure Generator
- Converts 26 MCP tools into organized TypeScript API files
- Categories: input, navigation, debugging, performance, network, emulation
- Each tool gets its own file with documentation and type signatures
### 2. MCP Execution Environment
- Spawns Chrome DevTools MCP server via stdio
- Provides API proxy that maps function calls to MCP tool invocations
- Executes user code with access to the full API
### 3. Progressive Discovery Workflow
- Start: Read `api/index.ts` (~279 tokens) to discover categories
- Explore: Read category index (~87 tokens) to see available tools
- Learn: Read individual tool files (~127 tokens each) for documentation
- Execute: Write code using discovered APIs (~45 tokens)
---
## Measured Results
### Token Comparison
**Direct MCP Approach (measured via `/context` in Claude Code):**
- All 26 tool definitions loaded upfront: **17,500 tokens**
- Average per tool: ~673 tokens (ranging from 580-830 tokens)
- Typical task overhead: **17,500 tokens**
**Code Execution Approach:**
- Root index: 279 tokens
- Category index: 87 tokens
- 2 tool definitions: 256 tokens (concise TypeScript docs vs verbose MCP schemas)
- User code: 45 tokens
- **Total: 667 tokens**
**Savings: 16,833 tokens (96.2% reduction)**
### Scaling Analysis
As tasks require more tools, the gap narrows but savings remain significant:
| Tools Used | Direct MCP | Code Execution | Savings |
|------------|-----------|----------------|---------|
| 2 tools | 17,500 | 667 | 96.2% |
| 5 tools | 17,500 | 1,001 | 94.3% |
| 10 tools | 17,500 | 1,636 | 90.7% |
| All 26 | 17,500 | 3,668 | 79.0% |
Even if you use half the available tools, you save 90%+ of context.
---
## What Worked
### ✅ Progressive Discovery
Claude can effectively navigate a filesystem structure to discover tools on-demand. The category organization (input, navigation, debugging) provides natural grouping that aligns with how developers think about browser automation.
### ✅ Token Efficiency
The numbers speak for themselves. For typical tasks (2-5 tools), you save 80-90% of context. Even complex tasks with many tools show significant savings.
### ✅ Code Execution Patterns
Writing code to orchestrate multiple tools is natural and more expressive than chaining individual tool calls. Local data processing, control flow, and state management work as expected.
### ✅ API Design
The category-based namespacing (`api.navigation.navigate_page()`) is intuitive and prevents naming collisions. TypeScript files provide natural documentation format.
### ✅ Actual MCP Integration
The executor successfully:
- Spawns the MCP server via stdio
- Handles JSON-RPC communication
- Maps API calls to tool invocations
- Returns results back to user code
---
## What Didn't Work (Or Needs Improvement)
### ⚠️ Parameter Mapping Complexity
Mapping positional arguments to named parameters is fragile:
```typescript
// This works (object with named params)
await api.navigation.navigate_page({ url: 'http://localhost:4321' });
// This is ambiguous (positional args)
await api.navigation.navigate_page('http://localhost:4321', { timeout: 5000 });
```
**Solution needed**: Full schema introspection from MCP to generate proper type definitions.
### ⚠️ No Sandboxing
User code executes via `AsyncFunction` with full Node.js access. Production would need:
- VM isolation (Node's `vm` module or Docker containers)
- Resource limits (memory, CPU, timeout)
- Filesystem restrictions
- Network access controls
### ⚠️ Error Handling
MCP errors need better propagation:
- Tool not found errors are unclear
- Timeout handling is basic
- No retry logic for transient failures
### ⚠️ Type Safety
Generated TypeScript files have `any` types. Better type definitions would:
- Catch parameter errors at "compile time" (in user code)
- Provide IDE autocomplete
- Document expected return types
### ⚠️ Debugging Experience
When code execution fails, debugging is harder than direct tool calls:
- Stack traces point to the executor, not user code
- No step-through debugging
- Console.log is the main debugging tool
---
## Tradeoffs Analysis
### Pros
1. **Massive token savings** (80-90% for typical tasks)
2. **Progressive disclosure** (only load what you need)
3. **Expressive code** (loops, conditionals, local processing)
4. **Natural organization** (categories align with mental models)
5. **Future-proof** (patterns work for any MCP, not just Chrome DevTools)
### Cons
1. **Operational complexity** (execution environment, sandboxing, monitoring)
2. **Debugging challenges** (harder to troubleshoot than direct calls)
3. **Type safety gaps** (needs schema introspection for proper types)
4. **Parameter mapping fragility** (positional vs named args)
5. **Initial setup cost** (filesystem generation, executor implementation)
---
## When To Use This Pattern
### Good Fit ✅
- **Complex workflows** that use multiple tools in sequence
- **Data processing tasks** that filter/transform results locally
- **Repeated automation** where token costs add up over time
- **Large MCP servers** with many tools (20+ tool definitions)
- **Context-constrained scenarios** where every token matters
### Poor Fit ❌
- **Single tool calls** (overhead not worth it)
- **Simple queries** (direct tool call is simpler)
- **Exploratory debugging** (interactive tool calls are faster)
- **Small MCP servers** (3-5 tools - just load them all)
- **Prototyping** (added complexity slows iteration)
---
## Comparison to Slash Command Isolation
In the [MCP isolation blog post](https://paddo.dev/blog/claude-code-mcp-context-isolation), we used slash commands to spawn separate Claude instances with isolated MCP configs. How do these approaches compare?
| Aspect | Slash Command Isolation | Code Execution |
|--------|------------------------|----------------|
| **Token savings** | Isolates full MCP context (6k tokens) from main conversation | Reduces MCP overhead within a single context (667 vs 6k) |
| **Use case** | Self-contained tasks that report back | Multi-step workflows with local data processing |
| **Complexity** | Simple (bash script + MCP config) | Complex (execution environment + API generation) |
| **Control** | Explicit invocation via `/chrome` | Integrated into main conversation |
| **State** | Stateless (each invocation is fresh) | Stateful (execution environment persists) |
**They're complementary, not competing:**
- Use slash commands for **isolation** (Chrome debugging separate from coding)
- Use code execution for **efficiency** (reduce MCP overhead within a context)
- Could even combine them: `/chrome-code` that spawns isolated environment with code execution
---
## Implementation Recommendations
### For Production Use
1. **Schema Introspection**
- Generate TypeScript types from MCP tool schemas
- Map parameters correctly (positional → named)
- Provide IDE autocomplete and type checking
2. **Sandboxing**
- Use Node's `vm2` or Docker containers
- Enforce resource limits (memory, CPU, network)
- Restrict filesystem access to safe paths only
3. **Error Handling**
- Better error messages with context
- Retry logic for transient MCP failures
- Timeout handling at both code and tool levels
4. **Debugging Tools**
- Step-through debugging for user code
- Tool call logging and tracing
- Performance profiling for optimization
5. **Type Definitions**
- Full TypeScript types for all tools
- Return type documentation
- Parameter validation before MCP calls
### For Experimentation
Current implementation is sufficient for:
- Validating the concept
- Measuring token savings
- Testing progressive discovery patterns
- Prototyping automation workflows
---
## Next Steps
### Immediate
- [ ] Add proper TypeScript type generation from MCP schemas
- [ ] Implement basic sandboxing (timeout, memory limits)
- [ ] Create more example workflows to test edge cases
- [ ] Write blog post documenting the experiment
### Future
- [ ] Build this pattern into a reusable library
- [ ] Support multiple MCP servers in one environment
- [ ] Add state persistence (filesystem storage between runs)
- [ ] Create visual debugging tools
- [ ] Benchmark performance vs direct MCP calls
---
## Conclusion
**The experiment succeeded.** Code execution with progressive discovery reduces token usage by ~89% for typical browser automation tasks while maintaining full MCP capability.
**The theory is sound.** Anthropic's architectural pattern works as proposed. Presenting tools as code APIs enables on-demand discovery without upfront context bloat.
**The tradeoffs are real.** You exchange token efficiency for operational complexity. Sandboxing, debugging, and type safety all require additional engineering investment.
**The use case matters.** For complex workflows, repeated automation, or context-constrained scenarios, the savings justify the complexity. For simple tasks, direct MCP tool calls remain simpler.
**This validates context engineering.** Your previous posts documented real problems (context bloat, lack of control, unpredictability). This experiment shows those problems are solvable at the infrastructure level, not just through workflow discipline.
---
## Blog Post Outline
### Title Ideas
- "Code Execution with MCP: 89% Less Context, Infinite More Complexity"
- "I Built Anthropic's Code Execution Pattern (Here's What I Learned)"
- "Progressive Tool Discovery: Solving the 6k Token MCP Problem"
### Structure
1. **Hook**: The 6k token problem we documented in the MCP isolation post
2. **The Theory**: Anthropic's code execution pattern explained
3. **The Experiment**: What we built and how it works
4. **The Results**: 88.9% token reduction (show the numbers)
5. **What Worked**: Progressive discovery, code patterns, token efficiency
6. **What Didn't**: Sandboxing, debugging, type safety gaps
7. **The Tradeoffs**: When it's worth it vs when it's not
8. **Relationship to Previous Work**: Slash commands vs code execution
9. **Honest Take**: This solves real problems but isn't a silver bullet
10. **Try It Yourself**: Link to repo, invite experimentation
### Key Points
- Lead with results (89% reduction)
- Show code examples (filesystem structure, user code)
- Compare to slash command isolation (complementary approaches)
- Be honest about complexity (not plug-and-play)
- Validate context engineering (you identified the problem, this is one solution)
- End with experiments readers can try
---
**Repository**: `/Users/paddo/Projects/mcp-code-wrapper`
**Working code**: Tested and functional
**Next**: Write the blog post