@namastexlabs/speak
Version:
Open source voice dictation for everyone
155 lines (123 loc) • 7.44 kB
Markdown
# Risk Audit Workflow
**Last Updated:** !`date -u +"%Y-%m-%d %H:%M:%S UTC"`
**Extends universal audit framework with general risk assessment patterns.**
@.genie/code/agents/audit.md
---
## Risk Audit Mode
### When to Use
Use this workflow to enumerate top risks for an initiative, assess impact and likelihood with evidence, and propose concrete mitigations.
### Operating Framework
```
<task_breakdown>
1. [Discovery] Map initiative scope, constraints, dependencies, failure modes
2. [Implementation] Enumerate risks, assess impact × likelihood, design mitigations with ownership
3. [Verification] Rank risks by severity, document residual risk, deliver action plan + confidence verdict
</task_breakdown>
```
### Auto-Context Loading with @ Pattern
Use @ symbols to automatically load initiative context before risk analysis:
```
Scope: Production migration to Kubernetes
@docs/architecture/deployment-strategy.md
@infrastructure/terraform/prod-config.tf
@docs/team-runbook.md
@incidents/postmortems/2024-Q1.md
```
Benefits:
- Agents automatically read context before risk enumeration
- No need for "first review architecture, then assess risks"
- Ensures evidence-based risk analysis from the start
### Risk Assessment Framework
#### Risk Categories:
1. **Technical Risks** - Architecture, performance, scalability, data integrity
2. **Operational Risks** - Monitoring gaps, runbook incompleteness, on-call readiness
3. **Security Risks** - Authentication, authorization, data exposure, compliance
4. **People Risks** - Skill gaps, bus factor, team availability during migration
5. **External Risks** - Third-party dependencies, vendor SLAs, regulatory changes
6. **Timeline Risks** - Optimistic estimates, blockers, coordination overhead
---
## Concrete Example
**Scope:**
"Migrate production workloads from EC2 to Kubernetes. Current state: 50 microservices on EC2 Auto Scaling Groups, 99.9% uptime SLA, 20K RPS peak. Target state: EKS cluster with Istio service mesh. Timeline: 8 weeks."
**Risk Analysis:**
#### R1: Service Mesh Misconfiguration → Traffic Blackhole (Impact: CRITICAL, Likelihood: 50%)
- **Evidence:** Istio's complexity documented in 3 production incidents at Lyft (source: Envoy blog)
- **Failure Mode:** Incorrect VirtualService routing rules send 100% traffic to /dev/null
- **Mitigation:**
- Week 1-2: Shadow traffic to Istio canary (0% production), validate routing parity
- Week 3: Blue-green deployment with instant DNS rollback capability
- Owner: SRE team lead
- Timeline: 2 weeks before production traffic
- **Residual Risk:** 10% likelihood - DNS propagation delay (5-10 min) during rollback
#### R2: StatefulSet Data Loss During Node Drain (Impact: CRITICAL, Likelihood: 30%)
- **Evidence:** Kubernetes drains nodes during upgrades; PVC detachment can cause corruption (GitHub issue #89465)
- **Failure Mode:** Database pod evicted mid-transaction → data corruption
- **Mitigation:**
- Implement PodDisruptionBudgets with minAvailable=1 for all StatefulSets
- Add preStop hook with 30s graceful shutdown for database writes
- Test node drain scenarios in staging with chaos engineering (Gremlin)
- Owner: Platform team
- Timeline: Week 2-3
- **Residual Risk:** 5% likelihood - Cluster upgrade during high-traffic window (mitigate: maintenance window scheduling)
#### R3: Monitoring Blindspot During Migration (Impact: HIGH, Likelihood: 75%)
- **Evidence:** Current EC2 metrics (CloudWatch) incompatible with Kubernetes metrics (Prometheus)
- **Failure Mode:** 2-week gap where production issues undetected → delayed incident response
- **Mitigation:**
- Week 1: Deploy Prometheus + Grafana in parallel with CloudWatch
- Week 2: Replicate top 20 CloudWatch alarms in Prometheus AlertManager
- Week 3-4: Dual-monitor both systems before cutover
- Owner: Observability team
- Timeline: 4 weeks (frontload before migration)
- **Residual Risk:** 40% likelihood - Alert fatigue from dual systems causing missed signals (mitigate: weekly alert review)
#### R4: Team Kubernetes Skill Gap (Impact: HIGH, Likelihood: 60%)
- **Evidence:** Team survey: 40% have 0 Kubernetes experience, 30% basic only
- **Failure Mode:** Slow incident response, incorrect troubleshooting, extended MTTR
- **Mitigation:**
- Week 1-2: Mandatory Kubernetes bootcamp (2 days) for all engineers
- Week 3-6: Pair on-call shifts (experienced + learning engineer)
- External: Hire Kubernetes consultant for 8-week engagement + runbook creation
- Owner: Engineering manager
- Timeline: 6 weeks (start immediately)
- **Residual Risk:** 30% likelihood - Consultant availability delay (mitigate: contract signed Week 1)
#### R5: Third-Party Dependency on EC2 Metadata Service (Impact: MEDIUM, Likelihood: 40%)
- **Evidence:** 8 microservices use EC2 instance metadata for service discovery
- **Failure Mode:** Hard-coded metadata API calls fail in Kubernetes → startup crashes
- **Mitigation:**
- Week 1: Audit all microservices for EC2 metadata usage (grep for `169.254.169.254`)
- Week 2: Refactor to environment variables injected via ConfigMaps
- Week 3-4: Test in staging with no EC2 metadata server
- Owner: Application team
- Timeline: 4 weeks
- **Residual Risk:** 10% likelihood - Undiscovered transitive dependency in vendor libraries
#### Risk Prioritization Matrix:
| Rank | Risk | Impact | Likelihood | Severity Score | Mitigation Start |
|------|------|--------|------------|----------------|------------------|
| 1 | R1: Service Mesh Blackhole | Critical | 50% | 10 (Critical × High) | Week 1 |
| 2 | R2: StatefulSet Data Loss | Critical | 30% | 9 (Critical × Medium) | Week 2 |
| 3 | R3: Monitoring Blindspot | High | 75% | 8 (High × Very High) | Week 1 (parallel) |
| 4 | R4: Skill Gap | High | 60% | 7 (High × High) | Week 1 (immediate) |
| 5 | R5: EC2 Metadata Dependency | Medium | 40% | 5 (Medium × Medium) | Week 1 |
**Severity Score:** Impact (Critical=3, High=2, Medium=1) × Likelihood (VeryHigh=3, High=2, Medium=1)
**Next Actions (Prioritized):**
1. **Week 1:** Start Kubernetes bootcamp + monitoring parallel deployment + EC2 metadata audit
2. **Week 1-2:** Istio shadow traffic testing (blocks production cutover)
3. **Week 2-3:** StatefulSet PodDisruptionBudget implementation + chaos testing
4. **Week 3:** Contract Kubernetes consultant (if not done in Week 1)
5. **Week 4:** Full staging dry-run with all mitigations active → go/no-go decision
**Genie Verdict:** Migration is HIGH RISK but manageable with frontloaded mitigations. Service mesh and monitoring gaps are critical path blockers; recommend 2-week delay if Istio shadow testing reveals routing issues. Skill gap mitigation requires immediate bootcamp + consultant engagement. Residual risk acceptable if all mitigations complete by Week 4 (confidence: high - based on postmortem precedent and team readiness assessment)
---
## Prompt Template (Risk Audit Mode)
```
Scope: <initiative with timeline and constraints>
Context: <current state, target state, dependencies>
@relevant-files
Risk Analysis:
R1: <risk> (Impact: <level>, Likelihood: <%)
- Evidence: <source>
- Failure Mode: <what breaks>
- Mitigation: <action + owner + timeline>
- Residual Risk: <% after mitigation>
Risk Prioritization Matrix: [table]
Next Actions: [prioritized list with timeline]
Genie Verdict: <go/no-go/conditional> (confidence: <low|med|high> - reasoning)
```