@sebastienrousseau/dotfiles
Version:
The Trusted Shell Platform — Universal dotfiles managed by Chezmoi. Features Bash & Zsh for macOS, Linux & WSL. Rust modern tooling & enterprise-grade security.
205 lines (147 loc) • 7.43 kB
Markdown
render_with_liquid: false
# Self-Healing
`.dotfiles` detects, diagnoses, and repairs configuration drift, missing tools, broken symlinks, and environmental damage — often without user intervention.
## The Self-Healing Loop
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ dot doctor │ ──► │ detect │ ──► │ report │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ dot heal │ ──► │ repair │ ──► │ verify │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼ (on failure)
┌─────────────┐ ┌─────────────┐
│dot rollback │ ──► │ revert │
└─────────────┘ └─────────────┘
```
## `dot doctor` — Detection
`dot doctor` runs ~40 health checks grouped into categories:
| Category | Checks |
|:---|:---|
| **Paths** | `~/.local/bin`, `~/.cargo/bin`, Mise shim order, Homebrew prefix |
| **Tools** | Required: git, chezmoi. Optional: mise, nix, age, sops, pandoc |
| **Chezmoi** | Source dir exists, data file valid, no uncommitted drift |
| **Shell** | Default shell matches profile, startup time <500ms |
| **Security** | SSH keys present, Age key present, gitleaks baseline |
| **Portability** | Git user.email set, LC_ALL sane, TERM recognized |
Exit codes:
| Code | Meaning |
|:---|:---|
| 0 | All checks passed |
| 1 | Warnings (non-blocking) |
| 2 | Critical failures (tool missing, config broken) |
Flags:
- `--score` / `-s` — numeric health score (0-100)
- `--heal` / `-H` — auto-fix detected issues (equivalent to `dot doctor && dot heal`)
- `--json` / `-j` — machine-readable output for CI/monitoring
- `--verbose` / `-v` — show every check (default shows only failures)
## `dot heal` — Repair
`dot heal` addresses three common failure modes:
### 1. Missing Tools
Reinstalls tools listed in `.chezmoidata.toml` that aren't on PATH:
```sh
dot heal
# [heal] jq not found, installing via mise
# [heal] age not found, installing via homebrew
# [heal] ✓ 2 tools installed
```
Priority order: mise → homebrew (macOS) → apt/dnf (Linux) → nix → manual install script.
### 2. Chezmoi Drift
Runs `chezmoi apply --force` to reconcile local files back to the source state. Skipped files (excluded via `.chezmoiignore`) are left alone.
### 3. Broken Symlinks & Missing Files
Detects dangling symlinks (target doesn't exist) and re-applies chezmoi to recreate them. Missing critical files (e.g. `~/.zshrc`) trigger a targeted re-render.
### Flags
- `--dry-run` / `-n` — show what would be fixed, don't change anything
- `--force` / `-f` — skip confirmation prompts
- `--tool <name>` — heal only a specific tool
### Exit Codes
- 0 — nothing to heal or all fixes succeeded
- 1 — some repairs failed, manual intervention required
## `dot chaos` — Self-Test
`dot chaos` intentionally corrupts the local installation to verify `dot heal` can recover it. This is **destructive** — run only in ephemeral environments (containers, VMs, fresh installs).
Corruption scenarios:
| Scenario | What it breaks |
|:---|:---|
| `symlink` | Delete 3 random managed symlinks |
| `config` | Rewrite `~/.gitconfig` with garbage |
| `tool` | `mv` a critical binary out of PATH |
| `permission` | `chmod 000` on a dotfile |
| `all` | Run all scenarios sequentially |
Typical workflow:
```sh
docker run --rm -it ubuntu bash
# inside container:
bash -c "$(curl -fsSL https://.../install.sh)"
dot doctor # baseline
dot chaos all # break things
dot heal # fix
dot doctor # verify
```
This is part of CI: every PR runs `dot chaos` in a Docker container and validates `dot heal` restores a healthy state.
## `dot rollback` — Revert
Before every `dot apply`, chezmoi writes a snapshot to `~/.local/state/dotfiles/snapshots/YYYY-MM-DD-HHMMSS/`. `dot rollback` restores the most recent snapshot:
```sh
dot rollback # revert to the most recent snapshot
dot rollback status # list available snapshots
dot rollback restore 3 # restore snapshot #3
dot rollback clean # delete snapshots older than 30 days
```
Snapshots include:
- Every file chezmoi would have overwritten
- The previous `.chezmoidata.toml` and `chezmoi.toml`
- A pointer to the Git SHA at apply time
They do **not** include:
- Generated caches (`~/.cache/`)
- Tool binaries (Mise-managed)
- External state (databases, remote repos)
## `dot bundle` — Offline Recovery
`dot bundle` creates a self-contained archive that can restore the workstation without network access:
```
dotfiles-bundle-v0.2.501-20260416.tar.zst
├── source/ # Full git clone at current HEAD
├── tools/ # Pre-built chezmoi binary + mise
├── secrets/ # Age-encrypted snapshot of ~/.config/age/
├── manual/ # Offline manual (HTML + PDF)
├── attestation.json # Signed state
├── install-offline.sh # Bootstrap script
└── SHA256SUMS # Integrity checksums
```
Usage:
```sh
dot bundle # create bundle in ~/Downloads/
dot bundle --to /path/to/usb.img # write to external storage
dot bundle restore bundle.tar.zst # restore from bundle
```
Recovery use case: you're stranded on a new machine with no internet. Copy the bundle over (USB, phone tether, Bluetooth). Run `bash install-offline.sh`. You have a working dotfiles environment in <60 seconds, no network required.
## Observability
### Health Score
`dot score` returns a 0-100 score combining:
| Dimension | Weight |
|:---|---:|
| Tool availability | 30 |
| Chezmoi drift | 20 |
| Security gates (sigs, secrets, Age key) | 20 |
| Performance (shell startup, cache health) | 15 |
| Compliance (policy hash match) | 15 |
Scores:
- 90-100 — healthy
- 70-89 — minor issues
- 50-69 — needs attention
- <50 — run `dot heal` immediately
### Metrics
`dot metrics` shows recent observations: shell startup time, last-apply duration, heal events, chaos self-tests, CVE counts from SBOM scan. Metrics are stored locally in `~/.local/state/dotfiles/metrics.jsonl` (append-only, one line per event).
## Design Principles
1. **Idempotent** — running `dot apply` or `dot heal` twice has the same effect as once
2. **Reversible** — every mutation creates a rollback point
3. **Observable** — failures produce actionable error messages with exit codes
4. **Offline-capable** — core flows (detect, heal, rollback) work without network
5. **Minimally invasive** — repairs are scoped to the smallest unit that fixes the problem
## See Also
- [Reliability Operations](../../operations/RELIABILITY.md)
- [Tutorial: First Install](../02-tutorials/01-first-install.md)
- [Cookbook: Troubleshooting](../04-cookbook/02-troubleshooting.md)