aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

953 lines (632 loc) • 23.1 kB

Markdown

# Operational Metrics Catalog ## Purpose Define metrics for tracking production reliability, infrastructure health, incident management, and cost efficiency. **Scope**: SLO/SLI metrics, infrastructure, incidents, costs **Target Audience**: Reliability Engineers, DevOps Engineers, Infrastructure Engineers, Operations Managers **Integration**: Reference this catalog when defining SLOs, incident response, and capacity planning --- ## Overview Operational metrics answer: **Is it running well?** **Categories**: 1. **SLO/SLI Metrics** - Service level objectives and indicators (5 metrics) 2. **Infrastructure Metrics** - Resource utilization and health (4 metrics) 3. **Incident Metrics** - Response and recovery effectiveness (4 metrics) 4. **Cost Metrics** - Economic efficiency (3 metrics) **Philosophy**: Reliability is a feature. Measure, monitor, improve. **Critical Balance**: Reliability vs velocity. 100% uptime costs infinite resources. --- ## SLO/SLI Metrics ### Background **SLI (Service Level Indicator)**: Quantifiable metric measuring service behavior **SLO (Service Level Objective)**: Target value or range for SLI **SLA (Service Level Agreement)**: Business contract with consequences if SLO missed **Error Budget**: Allowed downtime before SLO breached (1 - SLO) **Example**: SLO = 99.9% uptime → Error budget = 0.1% = 43 minutes/month --- ### Metric 1: Availability (Uptime) **Definition**: Percentage of time service is operational **Why It Matters**: Downtime = lost revenue, user frustration **Data Source**: Monitoring system (Pingdom, Datadog, Prometheus) **Collection Method**: **Prometheus Query**: ```promql # Availability over last 30 days (1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) * 100 ``` **SQL (from monitoring database)**: ```sql SELECT COUNT(*) FILTER (WHERE status = 'up') AS up_checks, COUNT(*) AS total_checks, ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'up') / COUNT(*), 3) AS availability FROM health_checks WHERE timestamp >= NOW() - INTERVAL '30 days' ``` **Formula**: (Uptime / Total time) × 100 **Calculation Methods**: 1. **Request-based**: (Successful requests / Total requests) × 100 2. **Time-based**: (Minutes up / Total minutes) × 100 **Common SLO Targets**: | SLO Level | Uptime % | Downtime per Month | Downtime per Year | |-----------|----------|-------------------|------------------| | 90% | 90.000% | 3 days | 36.5 days | | 99% | 99.000% | 7.2 hours | 3.65 days | | 99.9% ("three nines") | 99.900% | 43 minutes | 8.76 hours | | 99.95% | 99.950% | 22 minutes | 4.38 hours | | 99.99% ("four nines") | 99.990% | 4.3 minutes | 52.6 minutes | | 99.999% ("five nines") | 99.999% | 26 seconds | 5.26 minutes | **Recommended Targets by Service Type**: - Internal tools: 99% (7 hours/month) - SaaS products: 99.9% (43 minutes/month) - Mission-critical: 99.95-99.99% (4-22 minutes/month) - Payment systems: 99.99%+ (< 5 minutes/month) **Thresholds**: - Warning: Approaching error budget (90% consumed) - Alert: SLO breached - Critical: Multiple SLOs breached simultaneously **Recommended Review Cadence**: - Monitor: Continuously (real-time dashboard) - Review: Weekly (error budget consumption) - Report: Monthly (SLO compliance report) **Related Metrics**: - Error budget remaining - Time since last incident - Mean Time Between Failures (MTBF) --- ### Metric 2: Latency (Response Time) **Definition**: Time to process request (p50, p95, p99) **Why It Matters**: Slow responses = poor user experience, lost conversions **Data Source**: Application logs, APM tools (Datadog, New Relic), load balancers **Collection Method**: **Prometheus Query**: ```promql # p95 latency over last 5 minutes histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) ``` **SQL (from application logs)**: ```sql SELECT PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_time_ms) AS p50_latency, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95_latency, PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) AS p99_latency FROM request_logs WHERE timestamp >= NOW() - INTERVAL '1 hour' ``` **Formula**: Measure request start to response complete **Why Percentiles, Not Average**: - Average hides outliers (one slow request skews average) - p50 (median): Half of users see this or better - p95: 95% of users see this or better - p99: 99% of users see this or better (catches tail latency) **Common SLO Targets**: | Service Type | p50 | p95 | p99 | |-------------|-----|-----|-----| | API endpoints | < 100ms | < 300ms | < 1s | | Web pages | < 500ms | < 2s | < 5s | | Database queries | < 10ms | < 50ms | < 200ms | | Background jobs | < 1s | < 10s | < 60s | **Thresholds**: - Warning: p95 > target for 5 minutes - Alert: p99 > 2× target - Critical: p50 > target (widespread issue) **Recommended Review Cadence**: - Monitor: Continuously (real-time) - Alert: When threshold exceeded - Review: Weekly (latency trends) **Optimization Levers**: - Caching (Redis, CDN) - Database indexing - Query optimization - Horizontal scaling - Code profiling (identify hotspots) --- ### Metric 3: Error Rate **Definition**: Percentage of requests resulting in errors **Why It Matters**: Errors indicate bugs, infrastructure issues, or capacity problems **Data Source**: Application logs, APM tools **Collection Method**: **Prometheus Query**: ```promql # Error rate (5xx responses) over last 5 minutes sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 ``` **SQL (from logs)**: ```sql SELECT COUNT(*) FILTER (WHERE status_code >= 500) AS errors, COUNT(*) AS total_requests, ROUND(100.0 * COUNT(*) FILTER (WHERE status_code >= 500) / COUNT(*), 3) AS error_rate FROM request_logs WHERE timestamp >= NOW() - INTERVAL '1 hour' ``` **Formula**: (Error requests / Total requests) × 100 **Error Categories**: - 4xx errors: Client errors (bad requests, auth failures) - 5xx errors: Server errors (bugs, crashes, timeouts) **Common SLO Targets**: | Service Type | Error Rate SLO | |-------------|---------------| | Public APIs | < 0.1% (99.9% success) | | Internal services | < 0.5% | | Background jobs | < 1% | **Thresholds**: - Warning: Error rate > 0.5% for 5 minutes - Alert: Error rate > 1% - Critical: Error rate > 5% (widespread failure) **Recommended Review Cadence**: - Monitor: Continuously - Alert: When threshold exceeded - Review: Daily (error trends and types) **Related Metrics**: - Error count by type (timeouts, crashes, validation) - Error rate by endpoint - Error rate by user cohort --- ### Metric 4: Saturation (Resource Utilization) **Definition**: Percentage of resource capacity consumed **Why It Matters**: High saturation → performance degradation, outages **Data Source**: Infrastructure monitoring (Prometheus, CloudWatch, Datadog) **Collection Method**: **Prometheus Queries**: ```promql # CPU saturation avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) * 100 # Memory saturation (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Disk saturation (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 ``` **Formula**: (Resource used / Resource capacity) × 100 **Targets**: | Resource | Warning | Critical | |----------|---------|----------| | CPU | 70% | 85% | | Memory | 80% | 90% | | Disk | 75% | 85% | | Network bandwidth | 70% | 85% | | Database connections | 75% | 90% | **Thresholds**: - Warning: Sustained saturation > 70% - Alert: Saturation > 85% - Critical: Saturation > 90% or growing rapidly **Recommended Review Cadence**: - Monitor: Continuously - Alert: When threshold exceeded - Capacity Planning: Weekly (project future needs) **Capacity Planning**: - Track saturation trends (forecast exhaustion date) - Scale before hitting 80% (buffer for traffic spikes) - Plan capacity increases 3-6 months ahead --- ### Metric 5: Error Budget **Definition**: Amount of allowed downtime before SLO breached **Why It Matters**: Quantifies trade-off between reliability and velocity **Data Source**: Calculated from SLO and actual performance **Calculation**: ```python def calculate_error_budget(slo_target, actual_sli, time_period_days): """ slo_target: e.g., 99.9 (for 99.9% uptime) actual_sli: e.g., 99.95 (actual uptime) time_period_days: e.g., 30 """ error_budget_pct = 100 - slo_target actual_error_pct = 100 - actual_sli error_budget_remaining_pct = error_budget_pct - actual_error_pct error_budget_consumed_pct = (actual_error_pct / error_budget_pct) * 100 minutes_in_period = time_period_days * 24 * 60 error_budget_minutes = (error_budget_pct / 100) * minutes_in_period consumed_minutes = (actual_error_pct / 100) * minutes_in_period remaining_minutes = error_budget_minutes - consumed_minutes return { 'error_budget_pct': error_budget_pct, 'consumed_pct': error_budget_consumed_pct, 'remaining_minutes': remaining_minutes, 'total_budget_minutes': error_budget_minutes } # Example: # SLO = 99.9%, Actual = 99.95%, Period = 30 days # Error budget = 0.1% = 43 minutes # Actual errors = 0.05% = 22 minutes # Budget consumed = 50%, remaining = 21 minutes ``` **Formula**: ``` Error Budget = (1 - SLO) × Time Period Budget Consumed = (1 - Actual SLI) × Time Period Budget Remaining = Error Budget - Budget Consumed ``` **Targets**: | Budget Status | Action | |--------------|--------| | > 50% remaining | Prioritize features (move fast) | | 25-50% remaining | Balance features and reliability | | < 25% remaining | Slow down, focus on reliability | | 0% remaining (exhausted) | Freeze features, fix reliability issues | **Thresholds**: - Warning: 75% budget consumed - Alert: 90% budget consumed - Critical: Budget exhausted (SLO breached) **Recommended Review Cadence**: - Monitor: Daily - Review: Weekly (error budget burn rate) - Report: Monthly (SLO report) **Error Budget Policy** (define in advance): - Budget healthy: Ship features, take calculated risks - Budget low: Increase testing, slow releases - Budget exhausted: Feature freeze, focus on reliability --- ## Infrastructure Metrics ### Metric 6: CPU Utilization **Definition**: Percentage of CPU capacity used **Why It Matters**: High CPU → slow responses, capacity constraints **Data Source**: Infrastructure monitoring (Prometheus, CloudWatch) **Collection Method**: **Prometheus**: ```promql avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) * 100 ``` **CloudWatch (AWS)**: ```bash aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --start-time 2025-10-15T00:00:00Z \ --end-time 2025-10-15T23:59:59Z \ --period 300 \ --statistics Average ``` **Formula**: (CPU time used / CPU time available) × 100 **Targets**: - Normal operation: 40-60% - Peak traffic: 70-80% - Warning threshold: 85% - Critical threshold: 95% **Thresholds**: - Warning: Avg CPU > 70% for 10 minutes - Alert: Avg CPU > 85% for 5 minutes - Critical: CPU > 95% or sustained > 85% **Recommended Review Cadence**: - Monitor: Continuously - Alert: When threshold exceeded - Review: Weekly (capacity planning) **Remediation**: - Scale horizontally (add instances) - Optimize code (profiling) - Offload to caching layer --- ### Metric 7: Memory Utilization **Definition**: Percentage of RAM consumed **Why It Matters**: High memory → swapping, OOM kills, crashes **Data Source**: Infrastructure monitoring **Collection Method**: **Prometheus**: ```promql (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 ``` **Formula**: (Memory used / Total memory) × 100 **Targets**: - Normal operation: 50-70% - Warning threshold: 80% - Critical threshold: 90% **Thresholds**: - Warning: Memory > 80% for 10 minutes - Alert: Memory > 90% - Critical: Memory > 95% or OOM events **Recommended Review Cadence**: - Monitor: Continuously - Alert: When threshold exceeded - Review: Weekly **Common Issues**: - Memory leaks (usage growing over time) - Inefficient caching - Large object allocation - Database connection pooling --- ### Metric 8: Disk Usage **Definition**: Percentage of disk space consumed **Why It Matters**: Full disk → application crashes, data loss **Data Source**: Infrastructure monitoring **Collection Method**: **Prometheus**: ```promql (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 ``` **Bash**: ```bash df -h / | awk 'NR==2 {print $5}' | sed 's/%//' ``` **Formula**: (Disk used / Disk capacity) × 100 **Targets**: - Normal operation: < 70% - Warning threshold: 80% - Critical threshold: 90% **Thresholds**: - Warning: Disk > 75% - Alert: Disk > 85% - Critical: Disk > 90% or projected full within 7 days **Recommended Review Cadence**: - Monitor: Continuously - Alert: When threshold exceeded - Cleanup: Weekly (log rotation, temp files) **Remediation**: - Log rotation (delete old logs) - Archive old data - Increase disk size - Cleanup temp files --- ### Metric 9: Network Throughput **Definition**: Data transferred per time period (Mbps, Gbps) **Why It Matters**: High throughput → bandwidth saturation, slow responses **Data Source**: Infrastructure monitoring **Collection Method**: **Prometheus**: ```promql # Incoming traffic (Mbps) rate(node_network_receive_bytes_total[5m]) * 8 / 1000000 # Outgoing traffic (Mbps) rate(node_network_transmit_bytes_total[5m]) * 8 / 1000000 ``` **Formula**: Bytes transferred / Time period (convert to bits per second) **Targets**: - Normal operation: < 60% of capacity - Warning threshold: 70% - Critical threshold: 85% **Thresholds**: - Warning: Throughput > 70% of capacity - Alert: Throughput > 85% - Critical: Packet loss or sustained > 90% **Recommended Review Cadence**: - Monitor: Continuously - Review: Weekly (capacity planning) --- ## Incident Metrics ### Metric 10: MTTD (Mean Time to Detect) **Definition**: Average time from incident start to detection **Why It Matters**: Fast detection minimizes impact **Data Source**: Monitoring alerts + incident logs **Collection Method**: ```sql SELECT AVG(detected_at - occurred_at) AS avg_mttd FROM incidents WHERE occurred_at >= NOW() - INTERVAL '90 days' ``` **Formula**: Detection time - Incident start time **Targets**: - Critical incidents: < 5 minutes - High priority: < 15 minutes - Medium priority: < 1 hour **Thresholds**: - Warning: MTTD > 15 minutes - Investigation: MTTD increasing trend **Recommended Review Cadence**: - Track: Per incident - Review: Monthly - Improve: Quarterly (add monitoring) **Improvement Strategies**: - Add health checks - Increase monitoring coverage - Tune alert thresholds (reduce false negatives) - Synthetic monitoring (proactive checks) --- ### Metric 11: MTTA (Mean Time to Acknowledge) **Definition**: Average time from alert to responder engaged **Why It Matters**: Fast response reduces incident duration **Data Source**: Incident management system (PagerDuty, Opsgenie) **Collection Method**: ```sql SELECT AVG(acknowledged_at - detected_at) AS avg_mtta FROM incidents WHERE detected_at >= NOW() - INTERVAL '90 days' ``` **Formula**: Acknowledgment time - Detection time **Targets**: - Critical incidents: < 5 minutes - High priority: < 15 minutes - Medium priority: < 1 hour **Thresholds**: - Warning: MTTA > 15 minutes - Investigation: MTTA > 30 minutes **Recommended Review Cadence**: - Track: Per incident - Review: Monthly **Improvement Strategies**: - Clear on-call rotation - Escalation policies - Improved alert routing - Mobile alerting --- ### Metric 12: MTBF (Mean Time Between Failures) **Definition**: Average time between incidents **Why It Matters**: Indicates system stability **Data Source**: Incident logs **Collection Method**: ```sql WITH incident_intervals AS ( SELECT occurred_at, LAG(occurred_at) OVER (ORDER BY occurred_at) AS previous_incident FROM incidents WHERE severity IN ('critical', 'high') ) SELECT AVG(occurred_at - previous_incident) AS avg_mtbf FROM incident_intervals WHERE previous_incident IS NOT NULL ``` **Formula**: Total uptime / Number of incidents **Targets**: - Mission-critical: > 30 days - Production systems: > 14 days - Non-critical: > 7 days **Thresholds**: - Warning: MTBF < 7 days - Investigation: MTBF declining trend **Recommended Review Cadence**: - Calculate: Monthly - Review: Quarterly --- ### Metric 13: Incident Count by Severity **Definition**: Number of incidents per severity level **Why It Matters**: Tracks incident trends, identifies problem areas **Data Source**: Incident management system **Collection Method**: ```sql SELECT severity, COUNT(*) AS incident_count FROM incidents WHERE occurred_at >= NOW() - INTERVAL '30 days' GROUP BY severity ORDER BY CASE severity WHEN 'critical' THEN 1 WHEN 'high' THEN 2 WHEN 'medium' THEN 3 WHEN 'low' THEN 4 END ``` **Targets**: - Critical: 0 per month - High: < 2 per month - Medium: < 10 per month - Low: Acceptable (monitor trends) **Thresholds**: - Warning: > 1 critical per month - Alert: > 3 high per month - Investigation: Increasing trend **Recommended Review Cadence**: - Track: Weekly - Review: Monthly (trend analysis) --- ## Cost Metrics ### Metric 14: Infrastructure Cost per User **Definition**: Monthly infrastructure spend divided by active users **Why It Matters**: Measures economic efficiency, unit economics **Data Source**: Cloud billing (AWS Cost Explorer, GCP Billing) + user analytics **Collection Method**: **AWS CLI**: ```bash aws ce get-cost-and-usage \ --time-period Start=2025-10-01,End=2025-10-31 \ --granularity MONTHLY \ --metrics BlendedCost ``` **Calculation**: ```python monthly_infrastructure_cost = 10000 # From cloud billing monthly_active_users = 5000 # From analytics cost_per_user = monthly_infrastructure_cost / monthly_active_users # Result: $2 per user per month ``` **Formula**: Total infrastructure cost / Monthly Active Users **Targets**: - SaaS products: < 20% of ARPU (average revenue per user) - Cost-sensitive: < 10% of ARPU - High-margin: < 5% of ARPU **Thresholds**: - Warning: Cost per user > 20% of ARPU - Alert: Cost per user increasing > 20% month-over-month - Investigation: Cost growing faster than users **Recommended Review Cadence**: - Calculate: Monthly - Review: Monthly (cost optimization) - Plan: Quarterly (reserved instances, cost optimization) **Optimization Strategies**: - Right-size instances (avoid over-provisioning) - Use spot instances for non-critical workloads - Reserved instances for predictable workloads - Auto-scaling (scale down during low traffic) - Optimize storage (lifecycle policies, compression) --- ### Metric 15: Cloud Spend by Service **Definition**: Cost breakdown by cloud service (compute, storage, database, etc.) **Why It Matters**: Identifies cost drivers, optimization targets **Data Source**: Cloud billing reports **Collection Method**: **AWS Cost Explorer API**: ```bash aws ce get-cost-and-usage \ --time-period Start=2025-10-01,End=2025-10-31 \ --granularity MONTHLY \ --metrics BlendedCost \ --group-by Type=DIMENSION,Key=SERVICE ``` **Targets**: - No universal targets (depends on architecture) - Track month-over-month changes **Thresholds**: - Warning: Any service cost increases > 30% MoM - Investigation: Total cost increases > 20% MoM **Recommended Review Cadence**: - Review: Monthly - Deep Dive: Quarterly **Common Cost Drivers**: - EC2/Compute: Right-sizing, reserved instances - RDS/Database: Instance optimization, read replicas - S3/Storage: Lifecycle policies, compression - Data Transfer: CDN usage, region optimization - Load Balancers: Consolidation, traffic optimization --- ### Metric 16: Cost Efficiency (Cost per Request) **Definition**: Infrastructure cost divided by request volume **Why It Matters**: Normalizes cost by usage, tracks scaling efficiency **Data Source**: Cloud billing + request logs **Calculation**: ```python monthly_cost = 10000 # Total infrastructure cost monthly_requests = 50000000 # Total requests handled cost_per_million_requests = (monthly_cost / monthly_requests) * 1000000 # Result: $0.20 per million requests ``` **Formula**: Total cost / Total requests (normalize per million) **Targets**: - Depends on business model - Track trend (should decrease as scale increases) **Thresholds**: - Warning: Cost per request increasing - Investigation: Cost per request not decreasing with scale **Recommended Review Cadence**: - Calculate: Monthly - Review: Quarterly (economies of scale) --- ## Summary Table | Metric | Category | Data Source | Frequency | Target | Critical Threshold | |--------|----------|-------------|-----------|--------|--------------------| | Availability | SLO/SLI | Monitoring | Continuous | ≥ 99.9% | < SLO target | | Latency (p95) | SLO/SLI | APM | Continuous | < 300ms | > 1s | | Error Rate | SLO/SLI | Logs | Continuous | < 0.1% | > 1% | | Saturation (CPU) | SLO/SLI | Monitoring | Continuous | < 70% | > 85% | | Error Budget | SLO/SLI | Calculated | Daily | > 25% | 0% (exhausted) | | CPU Utilization | Infrastructure | Monitoring | Continuous | 40-60% | > 85% | | Memory Utilization | Infrastructure | Monitoring | Continuous | 50-70% | > 90% | | Disk Usage | Infrastructure | Monitoring | Continuous | < 70% | > 90% | | Network Throughput | Infrastructure | Monitoring | Continuous | < 60% capacity | > 85% capacity | | MTTD | Incidents | Monitoring + Incidents | Per incident | < 5 min | > 15 min | | MTTA | Incidents | Incident system | Per incident | < 5 min | > 15 min | | MTBF | Incidents | Incident logs | Monthly | > 30 days | < 7 days | | Incident Count | Incidents | Incident system | Weekly | 0 critical/month | > 1 critical/month | | Cost per User | Cost | Billing + Analytics | Monthly | < 20% ARPU | > 30% ARPU | | Cloud Spend | Cost | Billing | Monthly | Stable | > 30% MoM increase | | Cost per Request | Cost | Billing + Logs | Monthly | Decreasing | Increasing | --- ## Conclusion Operational metrics enable proactive reliability management and cost optimization. **Key Takeaways**: 1. SLOs quantify reliability targets (not everything needs five nines) 2. Error budgets balance reliability and velocity 3. Infrastructure metrics predict capacity constraints 4. Incident metrics drive process improvements 5. Cost metrics ensure economic sustainability **Next Steps**: 1. Define SLOs for critical user journeys (3-5 SLOs) 2. Implement monitoring and alerting 3. Establish error budget policies 4. Track incident metrics, run postmortems 5. Monitor costs monthly, optimize quarterly **Critical Success Factor**: Reliability is a feature. Budget for it, measure it, improve it.