aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
129 lines (94 loc) • 4.16 kB
Markdown
# Pipeline Incident Report: {title}
## Incident Summary
| Field | Value |
|-------|-------|
| Incident ID | {INC-XXXX} |
| Pipeline | {pipeline-name} |
| Stage | {failing-stage} |
| Failure Type | {failure-type} (build / test / scan / deploy / infra) |
| Severity | {P1/P2/P3/P4} |
| Status | {Investigating / Identified / Monitoring / Resolved} |
| Detected | {timestamp} |
| Resolved | {timestamp} |
| Duration | {duration} |
| Affected Projects | {project-list} |
| Blocked Deployments | {blocked-deployment-list} |
## Impact Assessment
| Dimension | Detail |
|-----------|--------|
| Deployments blocked | {count} releases waiting on this pipeline |
| Teams affected | {team-list} |
| SLA/SLO impact | {yes/no — describe if yes} |
| Production affected | {yes/no — describe if yes} |
| Workaround available | {yes/no — describe if yes} |
## Root Cause Analysis
### What Failed
{One paragraph describing the failure: which step in which pipeline, what error was emitted, and what automated mitigation (if any) fired.}
### 5 Whys
| Why | Answer |
|-----|--------|
| Why did the pipeline fail? | {immediate failure — e.g., image scan returned CRITICAL CVE} |
| Why was that the result? | {proximate cause — e.g., base image updated without pinning} |
| Why did that happen? | {contributing cause — e.g., no alert on upstream base image change} |
| Why wasn't it caught earlier? | {systemic cause — e.g., no nightly base-layer audit in place} |
| Why doesn't the process prevent this? | {root systemic gap — e.g., dev-idempotent-builds rule not enforced in review} |
### Contributing Factors
- {factor 1 — e.g., no image digest pinning in Dockerfile}
- {factor 2 — e.g., retry limit set to 0, causing immediate failure with no recovery}
- {factor 3 — e.g., on-call rotation not notified until 45 minutes after detection}
## Timeline
| Time | Event |
|------|-------|
| {HH:MM UTC} | First failure alert received (pipeline status: failed) |
| {HH:MM UTC} | On-call engineer engaged |
| {HH:MM UTC} | Failure isolated to {stage} stage |
| {HH:MM UTC} | Root cause identified: {root-cause-summary} |
| {HH:MM UTC} | Mitigation applied: {mitigation-summary} |
| {HH:MM UTC} | Affected pipeline unblocked |
| {HH:MM UTC} | All blocked deployments cleared |
| {HH:MM UTC} | Incident resolved and monitoring confirmed stable |
## Resolution Steps
### Immediate Mitigation
```bash
# Steps taken to unblock the pipeline
{mitigation-command-1}
{mitigation-command-2}
```
### Verification After Mitigation
```bash
# Confirm pipeline passing
{pipeline-status-command}
# Confirm no remaining blocked deployments
{deployment-check-command}
```
**Expected output:**
```
{expected-output}
```
## Prevention
What changes prevent this class of failure from recurring:
| Change | Type | Owner | Target Date |
|--------|------|-------|-------------|
| {change-1 — e.g., pin all base images by digest in Dockerfiles} | Process | {owner} | {date} |
| {change-2 — e.g., add nightly builder audit to dev-builder-audit skill} | Automation | {owner} | {date} |
| {change-3 — e.g., alert on pipeline failure within 5 minutes} | Monitoring | {owner} | {date} |
| {change-4 — e.g., document secret hygiene requirements for build args} | Policy | {owner} | {date} |
## Postmortem Action Items
- [ ] {action-1}: Update Dockerfiles to use digest-pinned base images — @{owner} — {due-date}
- [ ] {action-2}: Add pipeline failure alerting with <5 min SLA — @{owner} — {due-date}
- [ ] {action-3}: Run `dev-builder-audit` skill weekly and file issues on outdated layers — @{owner} — {due-date}
- [ ] {action-4}: Add this failure pattern to `pipeline.md` troubleshooting table — @{owner} — {due-date}
- [ ] {action-5}: Review `dev-idempotent-builds` rule compliance across all affected projects — @{owner} — {due-date}
## References
- Failing pipeline run: {pipeline-run-url}
- Related issues: {issue-links}
- Related runbooks: {runbook-links}
- Monitoring dashboard: {dashboard-url}
- Build playbook: `templates/build-playbook.yaml`
- Deployment guide: `templates/deployment-guide.md`