UNPKG

yoda-mcp

Version:

Intelligent Planning MCP with Optional Dependencies and Graceful Fallbacks - wise planning through the Force of lean excellence

791 lines (657 loc) 21 kB
# Planner MCP Monitoring Runbook ## Overview This runbook provides comprehensive monitoring procedures for the Planner MCP system, including metrics collection, alerting, dashboard management, and troubleshooting guides. ## Monitoring Architecture ### Components - **Prometheus**: Metrics collection and storage - **Grafana**: Visualization and dashboards - **AlertManager**: Alert routing and notification - **ELK Stack**: Log aggregation and analysis - **PagerDuty**: Incident management and escalation - **Custom Dashboards**: Business metrics and system health ### Data Flow ``` Application → Prometheus → Grafana → Alerts → PagerDuty ↓ Logs → ELK Stack → Kibana → Log Alerts ``` ## Key Metrics ### System Health Metrics #### Application Metrics - **Response Time**: HTTP request latency (P50, P95, P99) - **Throughput**: Requests per second - **Error Rate**: Percentage of failed requests - **Availability**: Uptime percentage - **Concurrent Users**: Active user sessions #### Infrastructure Metrics - **CPU Utilization**: Per pod and cluster-wide - **Memory Usage**: RAM consumption and limits - **Disk I/O**: Read/write operations and latency - **Network Traffic**: Ingress/egress bandwidth - **Pod Health**: Running, pending, failed pods #### MCP Server Metrics - **Connection Status**: Connectivity to each MCP server - **Response Time**: MCP server response latencies - **Capability Tier**: Current operational capability level - **Fallback Events**: Frequency of capability degradation - **Recovery Time**: Time to restore full capabilities ### Business Metrics #### Planning Performance - **Planning Success Rate**: Percentage of successful planning requests - **Quality Score**: Average plan quality certification level - **User Satisfaction**: Feedback scores and ratings - **Time to Completion**: Average planning cycle duration - **Feature Usage**: Adoption of different planning features #### Security Metrics - **Authentication Events**: Login attempts and failures - **Authorization Violations**: Access control failures - **Security Scan Results**: Vulnerability counts by severity - **Audit Log Volume**: Security event frequency - **Threat Detection**: Security anomalies identified ## Monitoring Dashboards ### Executive Dashboard **URL**: `https://grafana.planner-mcp.com/d/executive` #### Key Widgets - **System Availability**: 99.9% SLA compliance - **Business Metrics**: Planning success rate, user growth - **Financial Impact**: Revenue impact of outages - **Customer Satisfaction**: NPS scores and feedback trends ```javascript // Executive Dashboard Configuration { "dashboard": { "title": "Executive Dashboard", "panels": [ { "title": "System Availability", "type": "stat", "targets": [ { "expr": "100 * (1 - rate(http_requests_total{status=~'5..'}[7d]) / rate(http_requests_total[7d]))", "legendFormat": "Availability %" } ] }, { "title": "Planning Success Rate", "type": "gauge", "targets": [ { "expr": "rate(planning_requests_successful_total[24h]) / rate(planning_requests_total[24h]) * 100", "legendFormat": "Success Rate %" } ] } ] } } ``` ### Operations Dashboard **URL**: `https://grafana.planner-mcp.com/d/operations` #### System Health Section ```javascript { "panels": [ { "title": "Response Time Distribution", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "P50" }, { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "P95" }, { "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "P99" } ] }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "rate(http_requests_total{status=~'4..'}[5m])", "legendFormat": "4xx Errors" }, { "expr": "rate(http_requests_total{status=~'5..'}[5m])", "legendFormat": "5xx Errors" } ] } ] } ``` #### Infrastructure Section ```javascript { "panels": [ { "title": "CPU Usage", "type": "graph", "targets": [ { "expr": "avg(rate(container_cpu_usage_seconds_total[5m])) by (pod)", "legendFormat": "{{ pod }}" } ] }, { "title": "Memory Usage", "type": "graph", "targets": [ { "expr": "avg(container_memory_usage_bytes) by (pod) / 1024 / 1024", "legendFormat": "{{ pod }} MB" } ] } ] } ``` ### Engineering Dashboard **URL**: `https://grafana.planner-mcp.com/d/engineering` #### Performance Metrics ```javascript { "panels": [ { "title": "MCP Server Health", "type": "table", "targets": [ { "expr": "mcp_server_up", "format": "table", "legendFormat": "{{ server }}" } ] }, { "title": "Database Performance", "type": "graph", "targets": [ { "expr": "rate(database_query_duration_seconds_sum[5m]) / rate(database_query_duration_seconds_count[5m])", "legendFormat": "Avg Query Time" } ] } ] } ``` ## Alerting Rules ### Critical Alerts (P0) #### System Down ```yaml - alert: SystemDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "System is down" description: "System has been down for more than 1 minute" ``` #### High Error Rate ```yaml - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" ``` #### Database Connection Loss ```yaml - alert: DatabaseConnectionLoss expr: database_connections_active == 0 for: 30s labels: severity: critical annotations: summary: "Database connection lost" description: "No active database connections" ``` ### Warning Alerts (P1) #### High Response Time ```yaml - alert: HighResponseTime expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2 for: 5m labels: severity: warning annotations: summary: "High response time" description: "95th percentile response time is {{ $value }}s" ``` #### Memory Usage High ```yaml - alert: MemoryUsageHigh expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Memory usage is {{ $value | humanizePercentage }}" ``` #### MCP Server Degraded ```yaml - alert: MCPServerDegraded expr: mcp_server_response_time > 5 for: 3m labels: severity: warning annotations: summary: "MCP server degraded performance" description: "{{ $labels.server }} response time is {{ $value }}s" ``` ### Business Alerts #### Planning Success Rate Drop ```yaml - alert: PlanningSuccessRateDrop expr: rate(planning_requests_successful_total[10m]) / rate(planning_requests_total[10m]) < 0.9 for: 5m labels: severity: warning annotations: summary: "Planning success rate dropped" description: "Success rate is {{ $value | humanizePercentage }}" ``` #### Quality Score Degradation ```yaml - alert: QualityScoreDegradation expr: avg(quality_certification_score[15m]) < 80 for: 10m labels: severity: info annotations: summary: "Quality score degraded" description: "Average quality score is {{ $value }}" ``` ## Alert Management ### Alert Routing Configuration ```yaml # alertmanager.yml global: smtp_smarthost: 'localhost:587' smtp_from: 'alerts@planner-mcp.com' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' routes: - match: severity: critical receiver: pagerduty - match: severity: warning receiver: slack - match: severity: info receiver: email receivers: - name: 'web.hook' webhook_configs: - url: 'http://webhook.example.com/' - name: 'pagerduty' pagerduty_configs: - service_key: 'your-service-key' description: 'Critical alert from Planner MCP' - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/...' channel: '#alerts' title: 'Planner MCP Alert' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'email' email_configs: - to: 'engineering@company.com' subject: 'Planner MCP Alert' body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' ``` ### Alert Escalation #### Escalation Matrix | Severity | Initial | 15 min | 30 min | 1 hour | |----------|---------|--------|--------|--------| | Critical | On-call | Manager | Director | VP | | Warning | Slack | Email | Manager| - | | Info | Email | - | - | - | #### PagerDuty Integration ```bash # Test PagerDuty integration curl -X POST https://events.pagerduty.com/v2/enqueue \ -H 'Content-Type: application/json' \ -d '{ "routing_key": "your-routing-key", "event_action": "trigger", "payload": { "summary": "Test alert from Planner MCP", "severity": "critical", "source": "monitoring-system" } }' ``` ## Log Monitoring ### Log Sources #### Application Logs - **Location**: `/var/log/planner-mcp/app.log` - **Format**: JSON structured logs - **Retention**: 30 days - **Shipping**: Fluent Bit → Elasticsearch #### System Logs - **Location**: `/var/log/system/` - **Format**: Syslog format - **Retention**: 90 days - **Shipping**: Filebeat → Elasticsearch #### Audit Logs - **Location**: `/var/log/planner-mcp/audit.log` - **Format**: JSON with security context - **Retention**: 7 years - **Shipping**: Secure log pipeline → Archive ### Log Analysis Queries #### Error Analysis ```lucene # High-frequency errors level:ERROR AND @timestamp:[now-1h TO now] | stats count by error.message | sort count desc ``` #### Security Events ```lucene # Failed authentication attempts event.type:authentication AND event.outcome:failure | stats count by source.ip, user.name | where count > 10 ``` #### Performance Analysis ```lucene # Slow requests response.time:>2000 AND @timestamp:[now-15m TO now] | stats avg(response.time) by url.path | sort response.time desc ``` ### Log Alerts #### Error Rate Spike ```yaml # Watcher configuration for Elasticsearch { "trigger": { "schedule": { "interval": "1m" } }, "input": { "search": { "request": { "body": { "query": { "bool": { "filter": [ { "term": { "level": "ERROR" } }, { "range": { "@timestamp": { "gte": "now-5m" } } } ] } } } } } }, "condition": { "compare": { "ctx.payload.hits.total": { "gt": 10 } } }, "actions": { "send_slack": { "slack": { "message": { "to": "#alerts", "text": "High error rate detected: {{ctx.payload.hits.total}} errors in last 5 minutes" } } } } } ``` ## Monitoring Procedures ### Daily Health Check ```bash #!/bin/bash # daily-health-check.sh echo "=== Daily Health Check Report ===" > /tmp/health-report.txt echo "Date: $(date)" >> /tmp/health-report.txt echo "" >> /tmp/health-report.txt # System availability AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query?query=up" | jq -r '.data.result[0].value[1]') echo "System Availability: ${AVAILABILITY}" >> /tmp/health-report.txt # Response time RESPONSE_TIME=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[24h]))" | jq -r '.data.result[0].value[1]') echo "95th Percentile Response Time: ${RESPONSE_TIME}s" >> /tmp/health-report.txt # Error rate ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[24h]) / rate(http_requests_total[24h])" | jq -r '.data.result[0].value[1]') echo "Error Rate: ${ERROR_RATE}" >> /tmp/health-report.txt # MCP server health echo "" >> /tmp/health-report.txt echo "MCP Server Status:" >> /tmp/health-report.txt for server in serena sequential context7 todolist; do STATUS=$(curl -s "http://prometheus:9090/api/v1/query?query=mcp_server_up{server=\"$server\"}" | jq -r '.data.result[0].value[1]') echo " $server: $STATUS" >> /tmp/health-report.txt done # Send report mail -s "Daily Health Check Report" engineering@company.com < /tmp/health-report.txt ``` ### Weekly Performance Review ```bash #!/bin/bash # weekly-performance-review.sh echo "=== Weekly Performance Review ===" > /tmp/perf-report.txt echo "Week ending: $(date)" >> /tmp/perf-report.txt echo "" >> /tmp/perf-report.txt # SLA compliance echo "SLA Compliance (99.9% target):" >> /tmp/perf-report.txt UPTIME=$(curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(up[7d])" | jq -r '.data.result[0].value[1]') echo " Uptime: $(echo "$UPTIME * 100" | bc)%" >> /tmp/perf-report.txt # Performance trends echo "" >> /tmp/perf-report.txt echo "Performance Trends:" >> /tmp/perf-report.txt P95_CURRENT=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[7d]))" | jq -r '.data.result[0].value[1]') P95_PREVIOUS=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[14d])) offset 7d" | jq -r '.data.result[0].value[1]') echo " P95 Response Time: ${P95_CURRENT}s (prev: ${P95_PREVIOUS}s)" >> /tmp/perf-report.txt # Business metrics echo "" >> /tmp/perf-report.txt echo "Business Metrics:" >> /tmp/perf-report.txt SUCCESS_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(planning_requests_successful_total[7d]) / rate(planning_requests_total[7d])" | jq -r '.data.result[0].value[1]') echo " Planning Success Rate: $(echo "$SUCCESS_RATE * 100" | bc)%" >> /tmp/perf-report.txt # Send report mail -s "Weekly Performance Review" management@company.com < /tmp/perf-report.txt ``` ### Monthly Capacity Planning ```bash #!/bin/bash # monthly-capacity-planning.sh echo "=== Monthly Capacity Planning Report ===" > /tmp/capacity-report.txt echo "Month: $(date +'%B %Y')" >> /tmp/capacity-report.txt echo "" >> /tmp/capacity-report.txt # Resource utilization trends echo "Resource Utilization Trends:" >> /tmp/capacity-report.txt CPU_USAGE=$(curl -s "http://prometheus:9090/api/v1/query?query=avg(rate(container_cpu_usage_seconds_total[30d]))" | jq -r '.data.result[0].value[1]') MEMORY_USAGE=$(curl -s "http://prometheus:9090/api/v1/query?query=avg(container_memory_usage_bytes[30d]) / 1024 / 1024 / 1024" | jq -r '.data.result[0].value[1]') echo " Average CPU Usage: ${CPU_USAGE}" >> /tmp/capacity-report.txt echo " Average Memory Usage: ${MEMORY_USAGE}GB" >> /tmp/capacity-report.txt # Growth projections echo "" >> /tmp/capacity-report.txt echo "Growth Projections:" >> /tmp/capacity-report.txt REQUEST_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total[30d])" | jq -r '.data.result[0].value[1]') echo " Current Request Rate: ${REQUEST_RATE} req/s" >> /tmp/capacity-report.txt # Recommendations echo "" >> /tmp/capacity-report.txt echo "Recommendations:" >> /tmp/capacity-report.txt if (( $(echo "$CPU_USAGE > 0.7" | bc -l) )); then echo " - Consider scaling CPU resources" >> /tmp/capacity-report.txt fi if (( $(echo "$MEMORY_USAGE > 8" | bc -l) )); then echo " - Consider scaling memory resources" >> /tmp/capacity-report.txt fi # Send report mail -s "Monthly Capacity Planning Report" capacity-planning@company.com < /tmp/capacity-report.txt ``` ## Troubleshooting Guide ### High Response Time #### Symptoms - P95 response time > 2 seconds - Users complaining about slow performance - Timeout errors increasing #### Investigation Steps 1. **Check System Load** ```bash # Check current CPU usage kubectl top pods -n planner-mcp # Check memory usage kubectl describe nodes | grep -A 5 "Allocated resources" ``` 2. **Database Performance** ```bash # Check slow queries kubectl exec -n database postgres-0 -- psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;" # Check database connections kubectl exec -n database postgres-0 -- psql -c "SELECT count(*) FROM pg_stat_activity;" ``` 3. **MCP Server Performance** ```bash # Check MCP server response times curl -s http://prometheus:9090/api/v1/query?query=mcp_server_response_time # Test individual MCP servers ./scripts/test-mcp-performance.sh ``` #### Resolution Steps 1. **Immediate Actions** ```bash # Scale up replicas kubectl scale deployment/planner-mcp --replicas=5 -n planner-mcp # Clear application cache kubectl exec -n planner-mcp planner-mcp-xxx -- redis-cli FLUSHALL ``` 2. **Database Optimization** ```bash # Restart database connections kubectl exec -n database postgres-0 -- psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction';" # Update database statistics kubectl exec -n database postgres-0 -- psql -c "ANALYZE;" ``` ### High Error Rate #### Symptoms - 5xx error rate > 1% - Failed requests increasing - User-facing errors #### Investigation Steps 1. **Check Error Logs** ```bash # Recent error logs kubectl logs -n planner-mcp -l app=planner-mcp --tail=100 | grep ERROR # Error frequency analysis kubectl logs -n planner-mcp -l app=planner-mcp --since=1h | grep ERROR | awk '{print $3}' | sort | uniq -c ``` 2. **Health Check Status** ```bash # Check health endpoints curl -f https://planner-mcp.example.com/health # Check individual service health ./scripts/health-check-all.sh ``` #### Resolution Steps 1. **Service Recovery** ```bash # Restart failing pods kubectl delete pod -n planner-mcp -l app=planner-mcp,status!=Running # Rolling restart if needed kubectl rollout restart deployment/planner-mcp -n planner-mcp ``` ### MCP Server Unavailable #### Symptoms - MCP server connectivity alerts - Fallback mode activated - Reduced system capabilities #### Investigation Steps 1. **Connectivity Test** ```bash # Test MCP server connectivity ./scripts/test-mcp-connectivity.sh serena # Check DNS resolution nslookup serena-mcp.internal ``` 2. **Network Analysis** ```bash # Check network policies kubectl get networkpolicies -n planner-mcp # Trace network path kubectl exec -n planner-mcp planner-mcp-xxx -- traceroute serena-mcp.internal ``` #### Resolution Steps 1. **MCP Server Recovery** ```bash # Restart MCP server connection ./scripts/restart-mcp-connection.sh serena # Force capability tier recalculation ./scripts/recalculate-capabilities.sh ``` ## Performance Optimization ### Caching Strategy #### Cache Hit Rate Monitoring ```bash # Redis cache hit rate kubectl exec -n planner-mcp redis-0 -- redis-cli info stats | grep keyspace ``` #### Cache Optimization ```bash # Clear expired cache entries kubectl exec -n planner-mcp redis-0 -- redis-cli EVAL "return redis.call('del', unpack(redis.call('keys', ARGV[1])))" 0 "*expired*" # Optimize cache configuration kubectl patch configmap planner-config -n planner-mcp --patch '{"data":{"cache_ttl":"3600","cache_size":"1GB"}}' ``` ### Database Optimization #### Index Analysis ```bash # Check unused indexes kubectl exec -n database postgres-0 -- psql -c "SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname = 'public' ORDER BY n_distinct DESC;" ``` #### Query Optimization ```bash # Identify slow queries kubectl exec -n database postgres-0 -- psql -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;" ``` ## Contact Information ### Monitoring Team - **Primary**: @monitoring-team - **Secondary**: @devops-team - **Escalation**: @engineering-manager ### Tool Access - **Grafana**: https://grafana.planner-mcp.com - **Prometheus**: https://prometheus.planner-mcp.com - **Kibana**: https://kibana.planner-mcp.com - **PagerDuty**: https://company.pagerduty.com --- **Last Updated**: {current_date} **Version**: 1.0 **Owner**: Monitoring Team