@stackmemoryai/stackmemory

Version:

Project-scoped memory for AI coding tools. Durable context across sessions with MCP integration, frames, smart retrieval, Claude Code skills, and automatic hooks.

github.com/stackmemoryai/stackmemory

stackmemoryai/stackmemory

276 lines (220 loc) • 8.76 kB

Markdown

# Prompt Forge (GEPA) **Genetic Eval-driven Prompt Algorithm** Automatically evolve and optimize your `CLAUDE.md` system prompts using AI-powered evolutionary algorithms. ## Auto-Optimize on Save ```bash # Start watcher - auto-optimizes when CLAUDE.md changes node scripts/gepa/hooks/auto-optimize.js watch ./CLAUDE.md ``` Output shows before/after comparison: ``` ╔════════════════════════════════════════════════════════════╗ ║ BEFORE / AFTER COMPARISON ║ ╠════════════════════════════════════════════════════════════╣ ║ Metric Before After Change ║ ╠════════════════════════════════════════════════════════════╣ ║ Lines 125 142 +17 (+14%) ║ ║ Est. Tokens 873 920 +47 (+5%) ║ ║ MUST rules 1 3 +2 (+200%) ║ ║ NEVER rules 3 5 +2 (+67%) ║ ╚════════════════════════════════════════════════════════════╝ Section Changes: Added: + Error Handling + Performance Guidelines Summary: Token budget: +47 tokens Rule density: +4 explicit rules ``` ## How It Works ``` ┌─────────────────────────────────────────────────────────────────┐ │ GEPA Loop │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Seed │───►│ Mutate │───►│ Eval │───►│ Select │ │ │ │ Prompt │ │ (AI gen) │ │ (Claude) │ │ (best) │ │ │ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │ │ ▲ │ │ │ │ ┌──────────┐ │ │ │ └──────────────┤ Reflect │◄────────────────────┘ │ │ │(insights)│ │ │ └──────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Quick Start ```bash # 1. Initialize with your current CLAUDE.md node scripts/gepa/optimize.js init ./CLAUDE.md # 2. Run full optimization (10 generations) node scripts/gepa/optimize.js run # 3. Apply the best result cp scripts/gepa/generations/current ./CLAUDE.md ``` ## Commands | Command | Description | |---------|-------------| | `init [path]` | Initialize with a CLAUDE.md file | | `mutate` | Generate new prompt variants | | `eval [variant]` | Run evals on a specific variant | | `score` | Score all variants and select best | | `run [N]` | Full optimization loop for N generations | | `status` | Show current optimization status | | `diff [a] [b]` | Compare two variants | ## Mutation Strategies GEPA uses 6 mutation strategies, cycling through them: 1. **rephrase** - Reword for clarity without changing meaning 2. **add_examples** - Add concrete examples where abstract 3. **remove_redundancy** - DRY up repetitive instructions 4. **restructure** - Reorganize for better flow 5. **add_constraints** - Add guardrails for failure modes 6. **simplify** - Break down complex rules ## Reflection (Key Innovation) Unlike random mutations, GEPA analyzes **why** prompts fail: ```bash # Analyze session patterns node scripts/gepa/hooks/reflect.js analyze # Generate targeted improvements node scripts/gepa/hooks/reflect.js reflect ``` The reflection engine examines: - Common error patterns - Tool call success rates - User feedback (thumbs up/down) - Performance by variant Then generates **targeted** mutation suggestions. ## Hook Integration To track real usage for evals, add to `~/.claude/settings.json`: ```json { "hooks": { "postToolCall": [ { "command": "node ~/.claude/gepa/hooks/eval-tracker.js track-tool" } ], "postSession": [ { "command": "node ~/.claude/gepa/hooks/eval-tracker.js save" } ] } } ``` Or use the StackMemory daemon integration: ```bash # Add to your .env GEPA_ENABLED=true GEPA_DIR=~/.claude/gepa ``` ## Configuration Edit `config.json`: ```json { "evolution": { "populationSize": 4, // Variants per generation "generations": 10, // Max generations "selectionRate": 0.5 // Top 50% survive }, "evals": { "minSamplesPerVariant": 5, // Evals per variant "timeout": 120000 // 2 min per eval }, "scoring": { "threshold": 0.8 // Stop when 80% success } } ``` ## Writing Good Evals Evals live in `evals/*.jsonl`: ```json { "id": "eval-001", "name": "simple_function", "prompt": "Write a function that checks if a string is a palindrome", "expected": { "has_function": true, "handles_edge_cases": true }, "weight": 1.0 } ``` ### Expected Checks | Check | What It Looks For | |-------|-------------------| | `has_function` | Function definition in output | | `handles_edge_cases` | Null/empty/edge case handling | | `uses_async` | async/await usage | | `bug_fixed` | Fix-related language | | `explains_fix` | Explanation of changes | | Custom key | Looks for key as substring | ## Directory Structure ``` scripts/gepa/ ├── config.json # Settings ├── state.json # Current state ├── optimize.js # Main optimizer ├── hooks/ │ ├── eval-tracker.js # Session tracking hook │ └── reflect.js # Reflection engine ├── evals/ │ └── coding-tasks.jsonl ├── generations/ │ ├── gen-000/ │ │ └── baseline.md │ ├── gen-001/ │ │ ├── variant-a.md │ │ ├── variant-b.md │ │ └── baseline.md │ └── current -> gen-001/variant-a.md └── results/ ├── scores.jsonl └── sessions/ ``` ## Best Practices 1. **Start with good evals** - Garbage in, garbage out 2. **Run multiple generations** - Improvements compound 3. **Review diffs** - Understand what changed 4. **Keep baseline** - Always compare against original 5. **Monitor for drift** - Watch for unintended changes ## Example Output ``` $ node optimize.js run 3 ============================================================ GENERATION 1/3 ============================================================ Generating 4 variants for generation 1... Creating variant-a using strategy: rephrase Creating variant-b using strategy: add_examples Creating variant-c using strategy: remove_redundancy Creating variant-d using strategy: restructure Scoring 5 variants in generation 1... Running evals on baseline... Score: 65.0% Running evals on variant-a... Score: 72.0% Running evals on variant-b... Score: 78.0% Running evals on variant-c... Score: 70.0% Running evals on variant-d... Score: 68.0% Results: 1. variant-b: 78.0% <-- BEST 2. variant-a: 72.0% 3. variant-c: 70.0% 4. variant-d: 68.0% 5. baseline: 65.0% New best: variant-b (78.0%) ============================================================ OPTIMIZATION COMPLETE ============================================================ Best variant: variant-b Best score: 85.2% Generations: 3 To apply: cp generations/current /path/to/your/CLAUDE.md ``` ## Troubleshooting **"claude CLI not found"** Set `ANTHROPIC_API_KEY` for API fallback. **Slow evals** Reduce `minSamplesPerVariant` in config. **Poor results** Add more diverse evals covering failure modes. **Drift from original intent** Add evals that test for desired behaviors explicitly.