@sebastienrousseau/dotfiles

Version:

The Trusted Shell Platform — Universal dotfiles managed by Chezmoi. Features Bash & Zsh for macOS, Linux & WSL. Rust modern tooling & enterprise-grade security.

dotfiles.io

sebastienrousseau/dotfiles

144 lines (115 loc) • 6.34 kB

Markdown

--- title: "AI Cost Optimization" date: 2026-05-24 --- # AI Cost Optimization This dotfiles framework ships an opinionated AI-cost layer designed to keep token spend predictable and low while still using the smartest model for orchestration. Two ideas drive it: 1. **Delegate the grunt work.** Have the expensive smart model orchestrate, and a cheap fast model do the file reads, edits, and verification. The smart model sees one tool call and the final `git diff` instead of every intermediate read. 2. **Account every call.** Every invocation through `dot ai <provider>` appends a JSONL entry so you can see spend across all providers in one report, not just per-tool dashboards. ## Quick reference ```sh dot ai delegate "rename every UserService method that starts with get to fetch" dot ai cost # all-time overview dot ai cost --since 7 # last 7 days dot ai cost --fails # only failures + breakdown by failure type ``` Inside Claude Code: `/vibe`, `/vibe-report`, `/vibeon`, `/vibeoff`, `/vibestatus`, `/vibe-model-pick`, `/vibe-model-clear`. ## The delegator pattern The pattern comes from [pcx-wave/vibe-skill][vibe-skill]. Mistral Vibe is the default delegate but the same wrapper accepts any model Vibe knows about (DeepSeek V4 Flash, Gemini Flash, etc.). [vibe-skill]: https://github.com/pcx-wave/vibe-skill Cost comparison (May 2026 list prices, blended 85 in / 15 out typical of coding tasks): | Task | Claude Sonnet 4.6 ($3 / $15) | Mistral Medium 3.5 ($1.50 / $7.50) | DeepSeek V4 Flash ($0.14 / $0.28) | |---|---|---|---| | 1-file tweak (800 tok) | ~$0.004 | ~$0.002 | ~$0.0001 | | 6-read task (4,800 tok) | ~$0.023 | ~$0.012 | ~$0.0008 | | Multi-file refactor (12,000 tok) | ~$0.058 | ~$0.029 | ~$0.002 | Real-world stats from 254 vibe-skill runs over 10 days (May 2026): | | Amount | |---|---| | Actually paid (Mistral Pro prorated + DeepSeek pay-as-you-go) | **$10.35** | | Same workload pay-as-you-go via Mistral API | $46.61 | | Same workload on Claude Sonnet 4.6 | $179.91 | | Saved vs Claude | **$169.56 (17.4× cheaper)** | Claude itself contributes ~500-1500 tokens per delegation as orchestration overhead. Even with that overhead the savings dominate for anything beyond a one-line edit. ## Pieces deployed by this repo | Component | Source path | Deployed to | Role | |---|---|---|---| | `vibe` skill | `defaults/dot_claude/skills/vibe/` | `~/.claude/skills/vibe/` | Claude Code slash commands (`/vibe`, `/vibe-report`, etc.) | | Delegator binary | `defaults/dot_claude/skills/vibe/tools/executable_vibe-delegate` | `~/.claude/skills/vibe/tools/vibe-delegate` | Runs the cheap-model task in a pseudo-TTY, parses streaming JSON, syntax-checks changes, logs the run | | Reporter | `defaults/dot_claude/skills/vibe/tools/executable_delegate-report` | `~/.claude/skills/vibe/tools/delegate-report` | Reads the JSONL log, prints overview / by-model / by-project / failure tables | | CLI shim | `scripts/dot/commands/ai.sh` | `bin/dot ai delegate` / `bin/dot ai cost` | Same delegator + reporter, callable from the terminal without Claude Code | | Unified log hook | `_ai_log_run` in `ai.sh` | runs inside `run_ai_with_context` | Appends one JSONL line per `dot ai <provider>` invocation | | Log file | runtime-managed | `~/.local/share/delegate-runs.jsonl` | One line per run; `dot ai cost` reads it | ## State files | File | Owner | Purpose | |---|---|---| | `~/.local/share/delegate-runs.jsonl` | runtime | One JSONL entry per AI invocation (vibe + every other provider) | | `~/.local/share/vibe-auto.flag` | `/vibeon` / `/vibeoff` | When present, Claude auto-delegates coding tasks to Vibe | | `~/.local/share/vibe-model.flag` | `/vibe-model-pick` | Override the Vibe model for the next runs; cleared by `/vibe-model-clear` | | `~/.vibe/config.toml` | user | Vibe's own provider / model configuration | ## Reading the report ``` DELEGATE REPORT 2026-05-17 → 2026-05-24 Runs : 27 (ok: 25, failed: 2, timeout: 0) Success rate : 92% Avg duration : 18.4s Tokens total : 4,231,082 Delegate cost : $1.4711 Claude equiv : $14.8294 Saved : $13.3583 (90% cheaper than Claude) ``` `Claude equiv` is what the same workload would have cost on Claude Sonnet 4.6 ($3 / $15 per M tokens, blended at the same in/out ratio). The savings line is the difference. Failure types are broken out per model so you can see which delegate is most reliable for your repo. ## Provider coverage Every provider exposed via `dot ai <provider>` is logged best-effort. For providers that don't surface token counts in their CLI output, the report still tracks: timestamp, project, exit code, duration, prompt word count. Token / cost fields stay zero for those providers — the report tolerates the gap and aggregates by `model` regardless. Providers tracked today: | Provider | Binary | Logged | Tokens surfaced? | |---|---|---|---| | Claude Code | `claude` | yes | no (CLI doesn't expose) | | Codex | `codex` | yes | no | | Copilot CLI | `copilot` | yes | no | | Gemini CLI | `gemini` | yes | no | | Goose | `goose` | yes | no | | Aider | `aider` | yes | no | | OpenCode | `opencode` | yes | no | | Autohand | `autohand` | yes | no | | Mistral Vibe | `vibe` | yes | **yes** (via delegator) | | Qwen | `qwen` | yes | no | | ZAI | `zai` | yes | no | | Shell-GPT | `sgpt` | yes | no | | Ollama (local) | `ollama` | yes | n/a (no cost) | | Kiro CLI | `kiro-cli` | yes | no | ## Future work Not implemented yet, ordered roughly by likely impact: 1. **Provider-level budget guard.** `dot ai budget --set 50/month` would warn at 80% and refuse new requests at 100% (overridable). Needs per-provider cost estimation hooks beyond what each CLI surfaces today. 2. **Prompt response cache.** Many coding-helper queries are deterministic ("syntax for X in Y"). A local cache keyed on prompt hash + provider could short-circuit repeat queries. 3. **Per-task-class model routing.** `dot ai delegate --class refactor` would pick the cheapest model that meets the quality bar for the task class. Today the user picks the model. 4. **Rate-limit awareness.** Track API rate limits from response headers, queue requests, surface a `dot ai cost --limits` view that shows time-to-reset for every provider with an active limit.