aiwg
Version:
Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.
953 lines (632 loc) • 23.1 kB
Markdown
# Operational Metrics Catalog
## Purpose
Define metrics for tracking production reliability, infrastructure health, incident management, and cost efficiency.
**Scope**: SLO/SLI metrics, infrastructure, incidents, costs
**Target Audience**: Reliability Engineers, DevOps Engineers, Infrastructure Engineers, Operations Managers
**Integration**: Reference this catalog when defining SLOs, incident response, and capacity planning
## Overview
Operational metrics answer: **Is it running well?**
**Categories**:
1. **SLO/SLI Metrics** - Service level objectives and indicators (5 metrics)
2. **Infrastructure Metrics** - Resource utilization and health (4 metrics)
3. **Incident Metrics** - Response and recovery effectiveness (4 metrics)
4. **Cost Metrics** - Economic efficiency (3 metrics)
**Philosophy**: Reliability is a feature. Measure, monitor, improve.
**Critical Balance**: Reliability vs velocity. 100% uptime costs infinite resources.
## SLO/SLI Metrics
### Background
**SLI (Service Level Indicator)**: Quantifiable metric measuring service behavior
**SLO (Service Level Objective)**: Target value or range for SLI
**SLA (Service Level Agreement)**: Business contract with consequences if SLO missed
**Error Budget**: Allowed downtime before SLO breached (1 - SLO)
**Example**: SLO = 99.9% uptime → Error budget = 0.1% = 43 minutes/month
### Metric 1: Availability (Uptime)
**Definition**: Percentage of time service is operational
**Why It Matters**: Downtime = lost revenue, user frustration
**Data Source**: Monitoring system (Pingdom, Datadog, Prometheus)
**Collection Method**:
**Prometheus Query**:
```promql
# Availability over last 30 days
(1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) /
sum(rate(http_requests_total[30d])))) * 100
```
**SQL (from monitoring database)**:
```sql
SELECT
COUNT(*) FILTER (WHERE status = 'up') AS up_checks,
COUNT(*) AS total_checks,
ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'up') / COUNT(*), 3) AS availability
FROM health_checks
WHERE timestamp >= NOW() - INTERVAL '30 days'
```
**Formula**: (Uptime / Total time) × 100
**Calculation Methods**:
1. **Request-based**: (Successful requests / Total requests) × 100
2. **Time-based**: (Minutes up / Total minutes) × 100
**Common SLO Targets**:
| SLO Level | Uptime % | Downtime per Month | Downtime per Year |
|-----------|----------|-------------------|------------------|
| 90% | 90.000% | 3 days | 36.5 days |
| 99% | 99.000% | 7.2 hours | 3.65 days |
| 99.9% ("three nines") | 99.900% | 43 minutes | 8.76 hours |
| 99.95% | 99.950% | 22 minutes | 4.38 hours |
| 99.99% ("four nines") | 99.990% | 4.3 minutes | 52.6 minutes |
| 99.999% ("five nines") | 99.999% | 26 seconds | 5.26 minutes |
**Recommended Targets by Service Type**:
- Internal tools: 99% (7 hours/month)
- SaaS products: 99.9% (43 minutes/month)
- Mission-critical: 99.95-99.99% (4-22 minutes/month)
- Payment systems: 99.99%+ (< 5 minutes/month)
**Thresholds**:
- Warning: Approaching error budget (90% consumed)
- Alert: SLO breached
- Critical: Multiple SLOs breached simultaneously
**Recommended Review Cadence**:
- Monitor: Continuously (real-time dashboard)
- Review: Weekly (error budget consumption)
- Report: Monthly (SLO compliance report)
**Related Metrics**:
- Error budget remaining
- Time since last incident
- Mean Time Between Failures (MTBF)
### Metric 2: Latency (Response Time)
**Definition**: Time to process request (p50, p95, p99)
**Why It Matters**: Slow responses = poor user experience, lost conversions
**Data Source**: Application logs, APM tools (Datadog, New Relic), load balancers
**Collection Method**:
**Prometheus Query**:
```promql
# p95 latency over last 5 minutes
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
**SQL (from application logs)**:
```sql
SELECT
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_time_ms) AS p50_latency,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95_latency,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) AS p99_latency
FROM request_logs
WHERE timestamp >= NOW() - INTERVAL '1 hour'
```
**Formula**: Measure request start to response complete
**Why Percentiles, Not Average**:
- Average hides outliers (one slow request skews average)
- p50 (median): Half of users see this or better
- p95: 95% of users see this or better
- p99: 99% of users see this or better (catches tail latency)
**Common SLO Targets**:
| Service Type | p50 | p95 | p99 |
|-------------|-----|-----|-----|
| API endpoints | < 100ms | < 300ms | < 1s |
| Web pages | < 500ms | < 2s | < 5s |
| Database queries | < 10ms | < 50ms | < 200ms |
| Background jobs | < 1s | < 10s | < 60s |
**Thresholds**:
- Warning: p95 > target for 5 minutes
- Alert: p99 > 2× target
- Critical: p50 > target (widespread issue)
**Recommended Review Cadence**:
- Monitor: Continuously (real-time)
- Alert: When threshold exceeded
- Review: Weekly (latency trends)
**Optimization Levers**:
- Caching (Redis, CDN)
- Database indexing
- Query optimization
- Horizontal scaling
- Code profiling (identify hotspots)
### Metric 3: Error Rate
**Definition**: Percentage of requests resulting in errors
**Why It Matters**: Errors indicate bugs, infrastructure issues, or capacity problems
**Data Source**: Application logs, APM tools
**Collection Method**:
**Prometheus Query**:
```promql
# Error rate (5xx responses) over last 5 minutes
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
```
**SQL (from logs)**:
```sql
SELECT
COUNT(*) FILTER (WHERE status_code >= 500) AS errors,
COUNT(*) AS total_requests,
ROUND(100.0 * COUNT(*) FILTER (WHERE status_code >= 500) / COUNT(*), 3) AS error_rate
FROM request_logs
WHERE timestamp >= NOW() - INTERVAL '1 hour'
```
**Formula**: (Error requests / Total requests) × 100
**Error Categories**:
- 4xx errors: Client errors (bad requests, auth failures)
- 5xx errors: Server errors (bugs, crashes, timeouts)
**Common SLO Targets**:
| Service Type | Error Rate SLO |
|-------------|---------------|
| Public APIs | < 0.1% (99.9% success) |
| Internal services | < 0.5% |
| Background jobs | < 1% |
**Thresholds**:
- Warning: Error rate > 0.5% for 5 minutes
- Alert: Error rate > 1%
- Critical: Error rate > 5% (widespread failure)
**Recommended Review Cadence**:
- Monitor: Continuously
- Alert: When threshold exceeded
- Review: Daily (error trends and types)
**Related Metrics**:
- Error count by type (timeouts, crashes, validation)
- Error rate by endpoint
- Error rate by user cohort
### Metric 4: Saturation (Resource Utilization)
**Definition**: Percentage of resource capacity consumed
**Why It Matters**: High saturation → performance degradation, outages
**Data Source**: Infrastructure monitoring (Prometheus, CloudWatch, Datadog)
**Collection Method**:
**Prometheus Queries**:
```promql
# CPU saturation
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) * 100
# Memory saturation
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk saturation
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
```
**Formula**: (Resource used / Resource capacity) × 100
**Targets**:
| Resource | Warning | Critical |
|----------|---------|----------|
| CPU | 70% | 85% |
| Memory | 80% | 90% |
| Disk | 75% | 85% |
| Network bandwidth | 70% | 85% |
| Database connections | 75% | 90% |
**Thresholds**:
- Warning: Sustained saturation > 70%
- Alert: Saturation > 85%
- Critical: Saturation > 90% or growing rapidly
**Recommended Review Cadence**:
- Monitor: Continuously
- Alert: When threshold exceeded
- Capacity Planning: Weekly (project future needs)
**Capacity Planning**:
- Track saturation trends (forecast exhaustion date)
- Scale before hitting 80% (buffer for traffic spikes)
- Plan capacity increases 3-6 months ahead
### Metric 5: Error Budget
**Definition**: Amount of allowed downtime before SLO breached
**Why It Matters**: Quantifies trade-off between reliability and velocity
**Data Source**: Calculated from SLO and actual performance
**Calculation**:
```python
def calculate_error_budget(slo_target, actual_sli, time_period_days):
"""
slo_target: e.g., 99.9 (for 99.9% uptime)
actual_sli: e.g., 99.95 (actual uptime)
time_period_days: e.g., 30
"""
error_budget_pct = 100 - slo_target
actual_error_pct = 100 - actual_sli
error_budget_remaining_pct = error_budget_pct - actual_error_pct
error_budget_consumed_pct = (actual_error_pct / error_budget_pct) * 100
minutes_in_period = time_period_days * 24 * 60
error_budget_minutes = (error_budget_pct / 100) * minutes_in_period
consumed_minutes = (actual_error_pct / 100) * minutes_in_period
remaining_minutes = error_budget_minutes - consumed_minutes
return {
'error_budget_pct': error_budget_pct,
'consumed_pct': error_budget_consumed_pct,
'remaining_minutes': remaining_minutes,
'total_budget_minutes': error_budget_minutes
}
# Example:
# SLO = 99.9%, Actual = 99.95%, Period = 30 days
# Error budget = 0.1% = 43 minutes
# Actual errors = 0.05% = 22 minutes
# Budget consumed = 50%, remaining = 21 minutes
```
**Formula**:
```
Error Budget = (1 - SLO) × Time Period
Budget Consumed = (1 - Actual SLI) × Time Period
Budget Remaining = Error Budget - Budget Consumed
```
**Targets**:
| Budget Status | Action |
|--------------|--------|
| > 50% remaining | Prioritize features (move fast) |
| 25-50% remaining | Balance features and reliability |
| < 25% remaining | Slow down, focus on reliability |
| 0% remaining (exhausted) | Freeze features, fix reliability issues |
**Thresholds**:
- Warning: 75% budget consumed
- Alert: 90% budget consumed
- Critical: Budget exhausted (SLO breached)
**Recommended Review Cadence**:
- Monitor: Daily
- Review: Weekly (error budget burn rate)
- Report: Monthly (SLO report)
**Error Budget Policy** (define in advance):
- Budget healthy: Ship features, take calculated risks
- Budget low: Increase testing, slow releases
- Budget exhausted: Feature freeze, focus on reliability
## Infrastructure Metrics
### Metric 6: CPU Utilization
**Definition**: Percentage of CPU capacity used
**Why It Matters**: High CPU → slow responses, capacity constraints
**Data Source**: Infrastructure monitoring (Prometheus, CloudWatch)
**Collection Method**:
**Prometheus**:
```promql
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) * 100
```
**CloudWatch (AWS)**:
```bash
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2025-10-15T00:00:00Z \
--end-time 2025-10-15T23:59:59Z \
--period 300 \
--statistics Average
```
**Formula**: (CPU time used / CPU time available) × 100
**Targets**:
- Normal operation: 40-60%
- Peak traffic: 70-80%
- Warning threshold: 85%
- Critical threshold: 95%
**Thresholds**:
- Warning: Avg CPU > 70% for 10 minutes
- Alert: Avg CPU > 85% for 5 minutes
- Critical: CPU > 95% or sustained > 85%
**Recommended Review Cadence**:
- Monitor: Continuously
- Alert: When threshold exceeded
- Review: Weekly (capacity planning)
**Remediation**:
- Scale horizontally (add instances)
- Optimize code (profiling)
- Offload to caching layer
### Metric 7: Memory Utilization
**Definition**: Percentage of RAM consumed
**Why It Matters**: High memory → swapping, OOM kills, crashes
**Data Source**: Infrastructure monitoring
**Collection Method**:
**Prometheus**:
```promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
**Formula**: (Memory used / Total memory) × 100
**Targets**:
- Normal operation: 50-70%
- Warning threshold: 80%
- Critical threshold: 90%
**Thresholds**:
- Warning: Memory > 80% for 10 minutes
- Alert: Memory > 90%
- Critical: Memory > 95% or OOM events
**Recommended Review Cadence**:
- Monitor: Continuously
- Alert: When threshold exceeded
- Review: Weekly
**Common Issues**:
- Memory leaks (usage growing over time)
- Inefficient caching
- Large object allocation
- Database connection pooling
### Metric 8: Disk Usage
**Definition**: Percentage of disk space consumed
**Why It Matters**: Full disk → application crashes, data loss
**Data Source**: Infrastructure monitoring
**Collection Method**:
**Prometheus**:
```promql
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
```
**Bash**:
```bash
df -h / | awk 'NR==2 {print $5}' | sed 's/%//'
```
**Formula**: (Disk used / Disk capacity) × 100
**Targets**:
- Normal operation: < 70%
- Warning threshold: 80%
- Critical threshold: 90%
**Thresholds**:
- Warning: Disk > 75%
- Alert: Disk > 85%
- Critical: Disk > 90% or projected full within 7 days
**Recommended Review Cadence**:
- Monitor: Continuously
- Alert: When threshold exceeded
- Cleanup: Weekly (log rotation, temp files)
**Remediation**:
- Log rotation (delete old logs)
- Archive old data
- Increase disk size
- Cleanup temp files
### Metric 9: Network Throughput
**Definition**: Data transferred per time period (Mbps, Gbps)
**Why It Matters**: High throughput → bandwidth saturation, slow responses
**Data Source**: Infrastructure monitoring
**Collection Method**:
**Prometheus**:
```promql
# Incoming traffic (Mbps)
rate(node_network_receive_bytes_total[5m]) * 8 / 1000000
# Outgoing traffic (Mbps)
rate(node_network_transmit_bytes_total[5m]) * 8 / 1000000
```
**Formula**: Bytes transferred / Time period (convert to bits per second)
**Targets**:
- Normal operation: < 60% of capacity
- Warning threshold: 70%
- Critical threshold: 85%
**Thresholds**:
- Warning: Throughput > 70% of capacity
- Alert: Throughput > 85%
- Critical: Packet loss or sustained > 90%
**Recommended Review Cadence**:
- Monitor: Continuously
- Review: Weekly (capacity planning)
## Incident Metrics
### Metric 10: MTTD (Mean Time to Detect)
**Definition**: Average time from incident start to detection
**Why It Matters**: Fast detection minimizes impact
**Data Source**: Monitoring alerts + incident logs
**Collection Method**:
```sql
SELECT
AVG(detected_at - occurred_at) AS avg_mttd
FROM incidents
WHERE occurred_at >= NOW() - INTERVAL '90 days'
```
**Formula**: Detection time - Incident start time
**Targets**:
- Critical incidents: < 5 minutes
- High priority: < 15 minutes
- Medium priority: < 1 hour
**Thresholds**:
- Warning: MTTD > 15 minutes
- Investigation: MTTD increasing trend
**Recommended Review Cadence**:
- Track: Per incident
- Review: Monthly
- Improve: Quarterly (add monitoring)
**Improvement Strategies**:
- Add health checks
- Increase monitoring coverage
- Tune alert thresholds (reduce false negatives)
- Synthetic monitoring (proactive checks)
### Metric 11: MTTA (Mean Time to Acknowledge)
**Definition**: Average time from alert to responder engaged
**Why It Matters**: Fast response reduces incident duration
**Data Source**: Incident management system (PagerDuty, Opsgenie)
**Collection Method**:
```sql
SELECT
AVG(acknowledged_at - detected_at) AS avg_mtta
FROM incidents
WHERE detected_at >= NOW() - INTERVAL '90 days'
```
**Formula**: Acknowledgment time - Detection time
**Targets**:
- Critical incidents: < 5 minutes
- High priority: < 15 minutes
- Medium priority: < 1 hour
**Thresholds**:
- Warning: MTTA > 15 minutes
- Investigation: MTTA > 30 minutes
**Recommended Review Cadence**:
- Track: Per incident
- Review: Monthly
**Improvement Strategies**:
- Clear on-call rotation
- Escalation policies
- Improved alert routing
- Mobile alerting
### Metric 12: MTBF (Mean Time Between Failures)
**Definition**: Average time between incidents
**Why It Matters**: Indicates system stability
**Data Source**: Incident logs
**Collection Method**:
```sql
WITH incident_intervals AS (
SELECT
occurred_at,
LAG(occurred_at) OVER (ORDER BY occurred_at) AS previous_incident
FROM incidents
WHERE severity IN ('critical', 'high')
)
SELECT
AVG(occurred_at - previous_incident) AS avg_mtbf
FROM incident_intervals
WHERE previous_incident IS NOT NULL
```
**Formula**: Total uptime / Number of incidents
**Targets**:
- Mission-critical: > 30 days
- Production systems: > 14 days
- Non-critical: > 7 days
**Thresholds**:
- Warning: MTBF < 7 days
- Investigation: MTBF declining trend
**Recommended Review Cadence**:
- Calculate: Monthly
- Review: Quarterly
### Metric 13: Incident Count by Severity
**Definition**: Number of incidents per severity level
**Why It Matters**: Tracks incident trends, identifies problem areas
**Data Source**: Incident management system
**Collection Method**:
```sql
SELECT
severity,
COUNT(*) AS incident_count
FROM incidents
WHERE occurred_at >= NOW() - INTERVAL '30 days'
GROUP BY severity
ORDER BY
CASE severity
WHEN 'critical' THEN 1
WHEN 'high' THEN 2
WHEN 'medium' THEN 3
WHEN 'low' THEN 4
END
```
**Targets**:
- Critical: 0 per month
- High: < 2 per month
- Medium: < 10 per month
- Low: Acceptable (monitor trends)
**Thresholds**:
- Warning: > 1 critical per month
- Alert: > 3 high per month
- Investigation: Increasing trend
**Recommended Review Cadence**:
- Track: Weekly
- Review: Monthly (trend analysis)
## Cost Metrics
### Metric 14: Infrastructure Cost per User
**Definition**: Monthly infrastructure spend divided by active users
**Why It Matters**: Measures economic efficiency, unit economics
**Data Source**: Cloud billing (AWS Cost Explorer, GCP Billing) + user analytics
**Collection Method**:
**AWS CLI**:
```bash
aws ce get-cost-and-usage \
--time-period Start=2025-10-01,End=2025-10-31 \
--granularity MONTHLY \
--metrics BlendedCost
```
**Calculation**:
```python
monthly_infrastructure_cost = 10000 # From cloud billing
monthly_active_users = 5000 # From analytics
cost_per_user = monthly_infrastructure_cost / monthly_active_users
# Result: $2 per user per month
```
**Formula**: Total infrastructure cost / Monthly Active Users
**Targets**:
- SaaS products: < 20% of ARPU (average revenue per user)
- Cost-sensitive: < 10% of ARPU
- High-margin: < 5% of ARPU
**Thresholds**:
- Warning: Cost per user > 20% of ARPU
- Alert: Cost per user increasing > 20% month-over-month
- Investigation: Cost growing faster than users
**Recommended Review Cadence**:
- Calculate: Monthly
- Review: Monthly (cost optimization)
- Plan: Quarterly (reserved instances, cost optimization)
**Optimization Strategies**:
- Right-size instances (avoid over-provisioning)
- Use spot instances for non-critical workloads
- Reserved instances for predictable workloads
- Auto-scaling (scale down during low traffic)
- Optimize storage (lifecycle policies, compression)
### Metric 15: Cloud Spend by Service
**Definition**: Cost breakdown by cloud service (compute, storage, database, etc.)
**Why It Matters**: Identifies cost drivers, optimization targets
**Data Source**: Cloud billing reports
**Collection Method**:
**AWS Cost Explorer API**:
```bash
aws ce get-cost-and-usage \
--time-period Start=2025-10-01,End=2025-10-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
```
**Targets**:
- No universal targets (depends on architecture)
- Track month-over-month changes
**Thresholds**:
- Warning: Any service cost increases > 30% MoM
- Investigation: Total cost increases > 20% MoM
**Recommended Review Cadence**:
- Review: Monthly
- Deep Dive: Quarterly
**Common Cost Drivers**:
- EC2/Compute: Right-sizing, reserved instances
- RDS/Database: Instance optimization, read replicas
- S3/Storage: Lifecycle policies, compression
- Data Transfer: CDN usage, region optimization
- Load Balancers: Consolidation, traffic optimization
### Metric 16: Cost Efficiency (Cost per Request)
**Definition**: Infrastructure cost divided by request volume
**Why It Matters**: Normalizes cost by usage, tracks scaling efficiency
**Data Source**: Cloud billing + request logs
**Calculation**:
```python
monthly_cost = 10000 # Total infrastructure cost
monthly_requests = 50000000 # Total requests handled
cost_per_million_requests = (monthly_cost / monthly_requests) * 1000000
# Result: $0.20 per million requests
```
**Formula**: Total cost / Total requests (normalize per million)
**Targets**:
- Depends on business model
- Track trend (should decrease as scale increases)
**Thresholds**:
- Warning: Cost per request increasing
- Investigation: Cost per request not decreasing with scale
**Recommended Review Cadence**:
- Calculate: Monthly
- Review: Quarterly (economies of scale)
## Summary Table
| Metric | Category | Data Source | Frequency | Target | Critical Threshold |
|--------|----------|-------------|-----------|--------|--------------------|
| Availability | SLO/SLI | Monitoring | Continuous | ≥ 99.9% | < SLO target |
| Latency (p95) | SLO/SLI | APM | Continuous | < 300ms | > 1s |
| Error Rate | SLO/SLI | Logs | Continuous | < 0.1% | > 1% |
| Saturation (CPU) | SLO/SLI | Monitoring | Continuous | < 70% | > 85% |
| Error Budget | SLO/SLI | Calculated | Daily | > 25% | 0% (exhausted) |
| CPU Utilization | Infrastructure | Monitoring | Continuous | 40-60% | > 85% |
| Memory Utilization | Infrastructure | Monitoring | Continuous | 50-70% | > 90% |
| Disk Usage | Infrastructure | Monitoring | Continuous | < 70% | > 90% |
| Network Throughput | Infrastructure | Monitoring | Continuous | < 60% capacity | > 85% capacity |
| MTTD | Incidents | Monitoring + Incidents | Per incident | < 5 min | > 15 min |
| MTTA | Incidents | Incident system | Per incident | < 5 min | > 15 min |
| MTBF | Incidents | Incident logs | Monthly | > 30 days | < 7 days |
| Incident Count | Incidents | Incident system | Weekly | 0 critical/month | > 1 critical/month |
| Cost per User | Cost | Billing + Analytics | Monthly | < 20% ARPU | > 30% ARPU |
| Cloud Spend | Cost | Billing | Monthly | Stable | > 30% MoM increase |
| Cost per Request | Cost | Billing + Logs | Monthly | Decreasing | Increasing |
## Conclusion
Operational metrics enable proactive reliability management and cost optimization.
**Key Takeaways**:
1. SLOs quantify reliability targets (not everything needs five nines)
2. Error budgets balance reliability and velocity
3. Infrastructure metrics predict capacity constraints
4. Incident metrics drive process improvements
5. Cost metrics ensure economic sustainability
**Next Steps**:
1. Define SLOs for critical user journeys (3-5 SLOs)
2. Implement monitoring and alerting
3. Establish error budget policies
4. Track incident metrics, run postmortems
5. Monitor costs monthly, optimize quarterly
**Critical Success Factor**: Reliability is a feature. Budget for it, measure it, improve it.