@sebastienrousseau/dotfiles
Version:
The Trusted Shell Platform — Universal dotfiles managed by Chezmoi. Features Bash & Zsh for macOS, Linux & WSL. Rust modern tooling & enterprise-grade security.
144 lines (115 loc) • 6.34 kB
Markdown
---
title: "AI Cost Optimization"
date: 2026-05-24
---
# AI Cost Optimization
This dotfiles framework ships an opinionated AI-cost layer designed to
keep token spend predictable and low while still using the smartest
model for orchestration. Two ideas drive it:
1. **Delegate the grunt work.** Have the expensive smart model
orchestrate, and a cheap fast model do the file reads, edits, and
verification. The smart model sees one tool call and the final
`git diff` instead of every intermediate read.
2. **Account every call.** Every invocation through `dot ai <provider>`
appends a JSONL entry so you can see spend across all providers in
one report, not just per-tool dashboards.
## Quick reference
```sh
dot ai delegate "rename every UserService method that starts with get to fetch"
dot ai cost # all-time overview
dot ai cost --since 7 # last 7 days
dot ai cost --fails # only failures + breakdown by failure type
```
Inside Claude Code: `/vibe`, `/vibe-report`, `/vibeon`, `/vibeoff`,
`/vibestatus`, `/vibe-model-pick`, `/vibe-model-clear`.
## The delegator pattern
The pattern comes from [pcx-wave/vibe-skill][vibe-skill]. Mistral Vibe
is the default delegate but the same wrapper accepts any model Vibe
knows about (DeepSeek V4 Flash, Gemini Flash, etc.).
[vibe-skill]: https://github.com/pcx-wave/vibe-skill
Cost comparison (May 2026 list prices, blended 85 in / 15 out typical
of coding tasks):
| Task | Claude Sonnet 4.6 ($3 / $15) | Mistral Medium 3.5 ($1.50 / $7.50) | DeepSeek V4 Flash ($0.14 / $0.28) |
|---|---|---|---|
| 1-file tweak (800 tok) | ~$0.004 | ~$0.002 | ~$0.0001 |
| 6-read task (4,800 tok) | ~$0.023 | ~$0.012 | ~$0.0008 |
| Multi-file refactor (12,000 tok) | ~$0.058 | ~$0.029 | ~$0.002 |
Real-world stats from 254 vibe-skill runs over 10 days (May 2026):
| | Amount |
|---|---|
| Actually paid (Mistral Pro prorated + DeepSeek pay-as-you-go) | **$10.35** |
| Same workload pay-as-you-go via Mistral API | $46.61 |
| Same workload on Claude Sonnet 4.6 | $179.91 |
| Saved vs Claude | **$169.56 (17.4× cheaper)** |
Claude itself contributes ~500-1500 tokens per delegation as
orchestration overhead. Even with that overhead the savings dominate
for anything beyond a one-line edit.
## Pieces deployed by this repo
| Component | Source path | Deployed to | Role |
|---|---|---|---|
| `vibe` skill | `defaults/dot_claude/skills/vibe/` | `~/.claude/skills/vibe/` | Claude Code slash commands (`/vibe`, `/vibe-report`, etc.) |
| Delegator binary | `defaults/dot_claude/skills/vibe/tools/executable_vibe-delegate` | `~/.claude/skills/vibe/tools/vibe-delegate` | Runs the cheap-model task in a pseudo-TTY, parses streaming JSON, syntax-checks changes, logs the run |
| Reporter | `defaults/dot_claude/skills/vibe/tools/executable_delegate-report` | `~/.claude/skills/vibe/tools/delegate-report` | Reads the JSONL log, prints overview / by-model / by-project / failure tables |
| CLI shim | `scripts/dot/commands/ai.sh` | `bin/dot ai delegate` / `bin/dot ai cost` | Same delegator + reporter, callable from the terminal without Claude Code |
| Unified log hook | `_ai_log_run` in `ai.sh` | runs inside `run_ai_with_context` | Appends one JSONL line per `dot ai <provider>` invocation |
| Log file | runtime-managed | `~/.local/share/delegate-runs.jsonl` | One line per run; `dot ai cost` reads it |
## State files
| File | Owner | Purpose |
|---|---|---|
| `~/.local/share/delegate-runs.jsonl` | runtime | One JSONL entry per AI invocation (vibe + every other provider) |
| `~/.local/share/vibe-auto.flag` | `/vibeon` / `/vibeoff` | When present, Claude auto-delegates coding tasks to Vibe |
| `~/.local/share/vibe-model.flag` | `/vibe-model-pick` | Override the Vibe model for the next runs; cleared by `/vibe-model-clear` |
| `~/.vibe/config.toml` | user | Vibe's own provider / model configuration |
## Reading the report
```
DELEGATE REPORT 2026-05-17 → 2026-05-24
Runs : 27 (ok: 25, failed: 2, timeout: 0)
Success rate : 92%
Avg duration : 18.4s
Tokens total : 4,231,082
Delegate cost : $1.4711
Claude equiv : $14.8294
Saved : $13.3583 (90% cheaper than Claude)
```
`Claude equiv` is what the same workload would have cost on Claude
Sonnet 4.6 ($3 / $15 per M tokens, blended at the same in/out ratio).
The savings line is the difference. Failure types are broken out per
model so you can see which delegate is most reliable for your repo.
## Provider coverage
Every provider exposed via `dot ai <provider>` is logged best-effort.
For providers that don't surface token counts in their CLI output, the
report still tracks: timestamp, project, exit code, duration, prompt
word count. Token / cost fields stay zero for those providers — the
report tolerates the gap and aggregates by `model` regardless.
Providers tracked today:
| Provider | Binary | Logged | Tokens surfaced? |
|---|---|---|---|
| Claude Code | `claude` | yes | no (CLI doesn't expose) |
| Codex | `codex` | yes | no |
| Copilot CLI | `copilot` | yes | no |
| Gemini CLI | `gemini` | yes | no |
| Goose | `goose` | yes | no |
| Aider | `aider` | yes | no |
| OpenCode | `opencode` | yes | no |
| Autohand | `autohand` | yes | no |
| Mistral Vibe | `vibe` | yes | **yes** (via delegator) |
| Qwen | `qwen` | yes | no |
| ZAI | `zai` | yes | no |
| Shell-GPT | `sgpt` | yes | no |
| Ollama (local) | `ollama` | yes | n/a (no cost) |
| Kiro CLI | `kiro-cli` | yes | no |
## Future work
Not implemented yet, ordered roughly by likely impact:
1. **Provider-level budget guard.** `dot ai budget --set 50/month`
would warn at 80% and refuse new requests at 100% (overridable).
Needs per-provider cost estimation hooks beyond what each CLI
surfaces today.
2. **Prompt response cache.** Many coding-helper queries are
deterministic ("syntax for X in Y"). A local cache keyed on prompt
hash + provider could short-circuit repeat queries.
3. **Per-task-class model routing.** `dot ai delegate --class refactor`
would pick the cheapest model that meets the quality bar for the
task class. Today the user picks the model.
4. **Rate-limit awareness.** Track API rate limits from response
headers, queue requests, surface a `dot ai cost --limits` view that
shows time-to-reset for every provider with an active limit.