mcp-code-wrapper

# Findings: Code Execution with MCP Experiment **Date**: November 6, 2025 **Experiment**: Progressive tool discovery for Chrome DevTools MCP **Goal**: Test Anthropic's code execution pattern to reduce context consumption --- ## Executive Summary We successfully implemented Anthropic's code execution pattern for Chrome DevTools MCP and measured **96.2% token reduction** (17,500 tokens → 667 tokens) for a typical browser automation task. The approach works. Progressive discovery is viable. But it adds operational complexity that needs careful consideration. --- ## What We Built ### 1. Filesystem Structure Generator - Converts 26 MCP tools into organized TypeScript API files - Categories: input, navigation, debugging, performance, network, emulation - Each tool gets its own file with documentation and type signatures ### 2. MCP Execution Environment - Spawns Chrome DevTools MCP server via stdio - Provides API proxy that maps function calls to MCP tool invocations - Executes user code with access to the full API ### 3. Progressive Discovery Workflow - Start: Read `api/index.ts` (~279 tokens) to discover categories - Explore: Read category index (~87 tokens) to see available tools - Learn: Read individual tool files (~127 tokens each) for documentation - Execute: Write code using discovered APIs (~45 tokens) --- ## Measured Results ### Token Comparison **Direct MCP Approach (measured via `/context` in Claude Code):** - All 26 tool definitions loaded upfront: **17,500 tokens** - Average per tool: ~673 tokens (ranging from 580-830 tokens) - Typical task overhead: **17,500 tokens** **Code Execution Approach:** - Root index: 279 tokens - Category index: 87 tokens - 2 tool definitions: 256 tokens (concise TypeScript docs vs verbose MCP schemas) - User code: 45 tokens - **Total: 667 tokens** **Savings: 16,833 tokens (96.2% reduction)** ### Scaling Analysis As tasks require more tools, the gap narrows but savings remain significant: | Tools Used | Direct MCP | Code Execution | Savings | |------------|-----------|----------------|---------| | 2 tools | 17,500 | 667 | 96.2% | | 5 tools | 17,500 | 1,001 | 94.3% | | 10 tools | 17,500 | 1,636 | 90.7% | | All 26 | 17,500 | 3,668 | 79.0% | Even if you use half the available tools, you save 90%+ of context. --- ## What Worked ### ✅ Progressive Discovery Claude can effectively navigate a filesystem structure to discover tools on-demand. The category organization (input, navigation, debugging) provides natural grouping that aligns with how developers think about browser automation. ### ✅ Token Efficiency The numbers speak for themselves. For typical tasks (2-5 tools), you save 80-90% of context. Even complex tasks with many tools show significant savings. ### ✅ Code Execution Patterns Writing code to orchestrate multiple tools is natural and more expressive than chaining individual tool calls. Local data processing, control flow, and state management work as expected. ### ✅ API Design The category-based namespacing (`api.navigation.navigate_page()`) is intuitive and prevents naming collisions. TypeScript files provide natural documentation format. ### ✅ Actual MCP Integration The executor successfully: - Spawns the MCP server via stdio - Handles JSON-RPC communication - Maps API calls to tool invocations - Returns results back to user code --- ## What Didn't Work (Or Needs Improvement) ### ⚠️ Parameter Mapping Complexity Mapping positional arguments to named parameters is fragile: ```typescript // This works (object with named params) await api.navigation.navigate_page({ url: 'http://localhost:4321' }); // This is ambiguous (positional args) await api.navigation.navigate_page('http://localhost:4321', { timeout: 5000 }); ``` **Solution needed**: Full schema introspection from MCP to generate proper type definitions. ### ⚠️ No Sandboxing User code executes via `AsyncFunction` with full Node.js access. Production would need: - VM isolation (Node's `vm` module or Docker containers) - Resource limits (memory, CPU, timeout) - Filesystem restrictions - Network access controls ### ⚠️ Error Handling MCP errors need better propagation: - Tool not found errors are unclear - Timeout handling is basic - No retry logic for transient failures ### ⚠️ Type Safety Generated TypeScript files have `any` types. Better type definitions would: - Catch parameter errors at "compile time" (in user code) - Provide IDE autocomplete - Document expected return types ### ⚠️ Debugging Experience When code execution fails, debugging is harder than direct tool calls: - Stack traces point to the executor, not user code - No step-through debugging - Console.log is the main debugging tool --- ## Tradeoffs Analysis ### Pros 1. **Massive token savings** (80-90% for typical tasks) 2. **Progressive disclosure** (only load what you need) 3. **Expressive code** (loops, conditionals, local processing) 4. **Natural organization** (categories align with mental models) 5. **Future-proof** (patterns work for any MCP, not just Chrome DevTools) ### Cons 1. **Operational complexity** (execution environment, sandboxing, monitoring) 2. **Debugging challenges** (harder to troubleshoot than direct calls) 3. **Type safety gaps** (needs schema introspection for proper types) 4. **Parameter mapping fragility** (positional vs named args) 5. **Initial setup cost** (filesystem generation, executor implementation) --- ## When To Use This Pattern ### Good Fit ✅ - **Complex workflows** that use multiple tools in sequence - **Data processing tasks** that filter/transform results locally - **Repeated automation** where token costs add up over time - **Large MCP servers** with many tools (20+ tool definitions) - **Context-constrained scenarios** where every token matters ### Poor Fit ❌ - **Single tool calls** (overhead not worth it) - **Simple queries** (direct tool call is simpler) - **Exploratory debugging** (interactive tool calls are faster) - **Small MCP servers** (3-5 tools - just load them all) - **Prototyping** (added complexity slows iteration) --- ## Comparison to Slash Command Isolation In the [MCP isolation blog post](https://paddo.dev/blog/claude-code-mcp-context-isolation), we used slash commands to spawn separate Claude instances with isolated MCP configs. How do these approaches compare? | Aspect | Slash Command Isolation | Code Execution | |--------|------------------------|----------------| | **Token savings** | Isolates full MCP context (6k tokens) from main conversation | Reduces MCP overhead within a single context (667 vs 6k) | | **Use case** | Self-contained tasks that report back | Multi-step workflows with local data processing | | **Complexity** | Simple (bash script + MCP config) | Complex (execution environment + API generation) | | **Control** | Explicit invocation via `/chrome` | Integrated into main conversation | | **State** | Stateless (each invocation is fresh) | Stateful (execution environment persists) | **They're complementary, not competing:** - Use slash commands for **isolation** (Chrome debugging separate from coding) - Use code execution for **efficiency** (reduce MCP overhead within a context) - Could even combine them: `/chrome-code` that spawns isolated environment with code execution --- ## Implementation Recommendations ### For Production Use 1. **Schema Introspection** - Generate TypeScript types from MCP tool schemas - Map parameters correctly (positional → named) - Provide IDE autocomplete and type checking 2. **Sandboxing** - Use Node's `vm2` or Docker containers - Enforce resource limits (memory, CPU, network) - Restrict filesystem access to safe paths only 3. **Error Handling** - Better error messages with context - Retry logic for transient MCP failures - Timeout handling at both code and tool levels 4. **Debugging Tools** - Step-through debugging for user code - Tool call logging and tracing - Performance profiling for optimization 5. **Type Definitions** - Full TypeScript types for all tools - Return type documentation - Parameter validation before MCP calls ### For Experimentation Current implementation is sufficient for: - Validating the concept - Measuring token savings - Testing progressive discovery patterns - Prototyping automation workflows --- ## Next Steps ### Immediate - [ ] Add proper TypeScript type generation from MCP schemas - [ ] Implement basic sandboxing (timeout, memory limits) - [ ] Create more example workflows to test edge cases - [ ] Write blog post documenting the experiment ### Future - [ ] Build this pattern into a reusable library - [ ] Support multiple MCP servers in one environment - [ ] Add state persistence (filesystem storage between runs) - [ ] Create visual debugging tools - [ ] Benchmark performance vs direct MCP calls --- ## Conclusion **The experiment succeeded.** Code execution with progressive discovery reduces token usage by ~89% for typical browser automation tasks while maintaining full MCP capability. **The theory is sound.** Anthropic's architectural pattern works as proposed. Presenting tools as code APIs enables on-demand discovery without upfront context bloat. **The tradeoffs are real.** You exchange token efficiency for operational complexity. Sandboxing, debugging, and type safety all require additional engineering investment. **The use case matters.** For complex workflows, repeated automation, or context-constrained scenarios, the savings justify the complexity. For simple tasks, direct MCP tool calls remain simpler. **This validates context engineering.** Your previous posts documented real problems (context bloat, lack of control, unpredictability). This experiment shows those problems are solvable at the infrastructure level, not just through workflow discipline. --- ## Blog Post Outline ### Title Ideas - "Code Execution with MCP: 89% Less Context, Infinite More Complexity" - "I Built Anthropic's Code Execution Pattern (Here's What I Learned)" - "Progressive Tool Discovery: Solving the 6k Token MCP Problem" ### Structure 1. **Hook**: The 6k token problem we documented in the MCP isolation post 2. **The Theory**: Anthropic's code execution pattern explained 3. **The Experiment**: What we built and how it works 4. **The Results**: 88.9% token reduction (show the numbers) 5. **What Worked**: Progressive discovery, code patterns, token efficiency 6. **What Didn't**: Sandboxing, debugging, type safety gaps 7. **The Tradeoffs**: When it's worth it vs when it's not 8. **Relationship to Previous Work**: Slash commands vs code execution 9. **Honest Take**: This solves real problems but isn't a silver bullet 10. **Try It Yourself**: Link to repo, invite experimentation ### Key Points - Lead with results (89% reduction) - Show code examples (filesystem structure, user code) - Compare to slash command isolation (complementary approaches) - Be honest about complexity (not plug-and-play) - Validate context engineering (you identified the problem, this is one solution) - End with experiments readers can try --- **Repository**: `/Users/paddo/Projects/mcp-code-wrapper` **Working code**: Tested and functional **Next**: Write the blog post