yoda-mcp
Version:
Intelligent Planning MCP with Optional Dependencies and Graceful Fallbacks - wise planning through the Force of lean excellence
791 lines (657 loc) • 21 kB
Markdown
# Planner MCP Monitoring Runbook
## Overview
This runbook provides comprehensive monitoring procedures for the Planner MCP system, including metrics collection, alerting, dashboard management, and troubleshooting guides.
## Monitoring Architecture
### Components
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization and dashboards
- **AlertManager**: Alert routing and notification
- **ELK Stack**: Log aggregation and analysis
- **PagerDuty**: Incident management and escalation
- **Custom Dashboards**: Business metrics and system health
### Data Flow
```
Application → Prometheus → Grafana → Alerts → PagerDuty
↓
Logs → ELK Stack → Kibana → Log Alerts
```
## Key Metrics
### System Health Metrics
#### Application Metrics
- **Response Time**: HTTP request latency (P50, P95, P99)
- **Throughput**: Requests per second
- **Error Rate**: Percentage of failed requests
- **Availability**: Uptime percentage
- **Concurrent Users**: Active user sessions
#### Infrastructure Metrics
- **CPU Utilization**: Per pod and cluster-wide
- **Memory Usage**: RAM consumption and limits
- **Disk I/O**: Read/write operations and latency
- **Network Traffic**: Ingress/egress bandwidth
- **Pod Health**: Running, pending, failed pods
#### MCP Server Metrics
- **Connection Status**: Connectivity to each MCP server
- **Response Time**: MCP server response latencies
- **Capability Tier**: Current operational capability level
- **Fallback Events**: Frequency of capability degradation
- **Recovery Time**: Time to restore full capabilities
### Business Metrics
#### Planning Performance
- **Planning Success Rate**: Percentage of successful planning requests
- **Quality Score**: Average plan quality certification level
- **User Satisfaction**: Feedback scores and ratings
- **Time to Completion**: Average planning cycle duration
- **Feature Usage**: Adoption of different planning features
#### Security Metrics
- **Authentication Events**: Login attempts and failures
- **Authorization Violations**: Access control failures
- **Security Scan Results**: Vulnerability counts by severity
- **Audit Log Volume**: Security event frequency
- **Threat Detection**: Security anomalies identified
## Monitoring Dashboards
### Executive Dashboard
**URL**: `https://grafana.planner-mcp.com/d/executive`
#### Key Widgets
- **System Availability**: 99.9% SLA compliance
- **Business Metrics**: Planning success rate, user growth
- **Financial Impact**: Revenue impact of outages
- **Customer Satisfaction**: NPS scores and feedback trends
```javascript
// Executive Dashboard Configuration
{
"dashboard": {
"title": "Executive Dashboard",
"panels": [
{
"title": "System Availability",
"type": "stat",
"targets": [
{
"expr": "100 * (1 - rate(http_requests_total{status=~'5..'}[7d]) / rate(http_requests_total[7d]))",
"legendFormat": "Availability %"
}
]
},
{
"title": "Planning Success Rate",
"type": "gauge",
"targets": [
{
"expr": "rate(planning_requests_successful_total[24h]) / rate(planning_requests_total[24h]) * 100",
"legendFormat": "Success Rate %"
}
]
}
]
}
}
```
### Operations Dashboard
**URL**: `https://grafana.planner-mcp.com/d/operations`
#### System Health Section
```javascript
{
"panels": [
{
"title": "Response Time Distribution",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~'4..'}[5m])",
"legendFormat": "4xx Errors"
},
{
"expr": "rate(http_requests_total{status=~'5..'}[5m])",
"legendFormat": "5xx Errors"
}
]
}
]
}
```
#### Infrastructure Section
```javascript
{
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "avg(rate(container_cpu_usage_seconds_total[5m])) by (pod)",
"legendFormat": "{{ pod }}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "avg(container_memory_usage_bytes) by (pod) / 1024 / 1024",
"legendFormat": "{{ pod }} MB"
}
]
}
]
}
```
### Engineering Dashboard
**URL**: `https://grafana.planner-mcp.com/d/engineering`
#### Performance Metrics
```javascript
{
"panels": [
{
"title": "MCP Server Health",
"type": "table",
"targets": [
{
"expr": "mcp_server_up",
"format": "table",
"legendFormat": "{{ server }}"
}
]
},
{
"title": "Database Performance",
"type": "graph",
"targets": [
{
"expr": "rate(database_query_duration_seconds_sum[5m]) / rate(database_query_duration_seconds_count[5m])",
"legendFormat": "Avg Query Time"
}
]
}
]
}
```
## Alerting Rules
### Critical Alerts (P0)
#### System Down
```yaml
- alert: SystemDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "System is down"
description: "System has been down for more than 1 minute"
```
#### High Error Rate
```yaml
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
```
#### Database Connection Loss
```yaml
- alert: DatabaseConnectionLoss
expr: database_connections_active == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database connection lost"
description: "No active database connections"
```
### Warning Alerts (P1)
#### High Response Time
```yaml
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High response time"
description: "95th percentile response time is {{ $value }}s"
```
#### Memory Usage High
```yaml
- alert: MemoryUsageHigh
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
```
#### MCP Server Degraded
```yaml
- alert: MCPServerDegraded
expr: mcp_server_response_time > 5
for: 3m
labels:
severity: warning
annotations:
summary: "MCP server degraded performance"
description: "{{ $labels.server }} response time is {{ $value }}s"
```
### Business Alerts
#### Planning Success Rate Drop
```yaml
- alert: PlanningSuccessRateDrop
expr: rate(planning_requests_successful_total[10m]) / rate(planning_requests_total[10m]) < 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Planning success rate dropped"
description: "Success rate is {{ $value | humanizePercentage }}"
```
#### Quality Score Degradation
```yaml
- alert: QualityScoreDegradation
expr: avg(quality_certification_score[15m]) < 80
for: 10m
labels:
severity: info
annotations:
summary: "Quality score degraded"
description: "Average quality score is {{ $value }}"
```
## Alert Management
### Alert Routing Configuration
```yaml
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@planner-mcp.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack
- match:
severity: info
receiver: email
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook.example.com/'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-service-key'
description: 'Critical alert from Planner MCP'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: 'Planner MCP Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'email'
email_configs:
- to: 'engineering@company.com'
subject: 'Planner MCP Alert'
body: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
```
### Alert Escalation
#### Escalation Matrix
| Severity | Initial | 15 min | 30 min | 1 hour |
|----------|---------|--------|--------|--------|
| Critical | On-call | Manager | Director | VP |
| Warning | Slack | Email | Manager| - |
| Info | Email | - | - | - |
#### PagerDuty Integration
```bash
# Test PagerDuty integration
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "your-routing-key",
"event_action": "trigger",
"payload": {
"summary": "Test alert from Planner MCP",
"severity": "critical",
"source": "monitoring-system"
}
}'
```
## Log Monitoring
### Log Sources
#### Application Logs
- **Location**: `/var/log/planner-mcp/app.log`
- **Format**: JSON structured logs
- **Retention**: 30 days
- **Shipping**: Fluent Bit → Elasticsearch
#### System Logs
- **Location**: `/var/log/system/`
- **Format**: Syslog format
- **Retention**: 90 days
- **Shipping**: Filebeat → Elasticsearch
#### Audit Logs
- **Location**: `/var/log/planner-mcp/audit.log`
- **Format**: JSON with security context
- **Retention**: 7 years
- **Shipping**: Secure log pipeline → Archive
### Log Analysis Queries
#### Error Analysis
```lucene
# High-frequency errors
level:ERROR AND @timestamp:[now-1h TO now]
| stats count by error.message
| sort count desc
```
#### Security Events
```lucene
# Failed authentication attempts
event.type:authentication AND event.outcome:failure
| stats count by source.ip, user.name
| where count > 10
```
#### Performance Analysis
```lucene
# Slow requests
response.time:>2000 AND @timestamp:[now-15m TO now]
| stats avg(response.time) by url.path
| sort response.time desc
```
### Log Alerts
#### Error Rate Spike
```yaml
# Watcher configuration for Elasticsearch
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"body": {
"query": {
"bool": {
"filter": [
{
"term": {
"level": "ERROR"
}
},
{
"range": {
"@timestamp": {
"gte": "now-5m"
}
}
}
]
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 10
}
}
},
"actions": {
"send_slack": {
"slack": {
"message": {
"to": "#alerts",
"text": "High error rate detected: {{ctx.payload.hits.total}} errors in last 5 minutes"
}
}
}
}
}
```
## Monitoring Procedures
### Daily Health Check
```bash
#!/bin/bash
# daily-health-check.sh
echo "=== Daily Health Check Report ===" > /tmp/health-report.txt
echo "Date: $(date)" >> /tmp/health-report.txt
echo "" >> /tmp/health-report.txt
# System availability
AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query?query=up" | jq -r '.data.result[0].value[1]')
echo "System Availability: ${AVAILABILITY}" >> /tmp/health-report.txt
# Response time
RESPONSE_TIME=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[24h]))" | jq -r '.data.result[0].value[1]')
echo "95th Percentile Response Time: ${RESPONSE_TIME}s" >> /tmp/health-report.txt
# Error rate
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[24h]) / rate(http_requests_total[24h])" | jq -r '.data.result[0].value[1]')
echo "Error Rate: ${ERROR_RATE}" >> /tmp/health-report.txt
# MCP server health
echo "" >> /tmp/health-report.txt
echo "MCP Server Status:" >> /tmp/health-report.txt
for server in serena sequential context7 todolist; do
STATUS=$(curl -s "http://prometheus:9090/api/v1/query?query=mcp_server_up{server=\"$server\"}" | jq -r '.data.result[0].value[1]')
echo " $server: $STATUS" >> /tmp/health-report.txt
done
# Send report
mail -s "Daily Health Check Report" engineering@company.com < /tmp/health-report.txt
```
### Weekly Performance Review
```bash
#!/bin/bash
# weekly-performance-review.sh
echo "=== Weekly Performance Review ===" > /tmp/perf-report.txt
echo "Week ending: $(date)" >> /tmp/perf-report.txt
echo "" >> /tmp/perf-report.txt
# SLA compliance
echo "SLA Compliance (99.9% target):" >> /tmp/perf-report.txt
UPTIME=$(curl -s "http://prometheus:9090/api/v1/query?query=avg_over_time(up[7d])" | jq -r '.data.result[0].value[1]')
echo " Uptime: $(echo "$UPTIME * 100" | bc)%" >> /tmp/perf-report.txt
# Performance trends
echo "" >> /tmp/perf-report.txt
echo "Performance Trends:" >> /tmp/perf-report.txt
P95_CURRENT=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[7d]))" | jq -r '.data.result[0].value[1]')
P95_PREVIOUS=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[14d])) offset 7d" | jq -r '.data.result[0].value[1]')
echo " P95 Response Time: ${P95_CURRENT}s (prev: ${P95_PREVIOUS}s)" >> /tmp/perf-report.txt
# Business metrics
echo "" >> /tmp/perf-report.txt
echo "Business Metrics:" >> /tmp/perf-report.txt
SUCCESS_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(planning_requests_successful_total[7d]) / rate(planning_requests_total[7d])" | jq -r '.data.result[0].value[1]')
echo " Planning Success Rate: $(echo "$SUCCESS_RATE * 100" | bc)%" >> /tmp/perf-report.txt
# Send report
mail -s "Weekly Performance Review" management@company.com < /tmp/perf-report.txt
```
### Monthly Capacity Planning
```bash
#!/bin/bash
# monthly-capacity-planning.sh
echo "=== Monthly Capacity Planning Report ===" > /tmp/capacity-report.txt
echo "Month: $(date +'%B %Y')" >> /tmp/capacity-report.txt
echo "" >> /tmp/capacity-report.txt
# Resource utilization trends
echo "Resource Utilization Trends:" >> /tmp/capacity-report.txt
CPU_USAGE=$(curl -s "http://prometheus:9090/api/v1/query?query=avg(rate(container_cpu_usage_seconds_total[30d]))" | jq -r '.data.result[0].value[1]')
MEMORY_USAGE=$(curl -s "http://prometheus:9090/api/v1/query?query=avg(container_memory_usage_bytes[30d]) / 1024 / 1024 / 1024" | jq -r '.data.result[0].value[1]')
echo " Average CPU Usage: ${CPU_USAGE}" >> /tmp/capacity-report.txt
echo " Average Memory Usage: ${MEMORY_USAGE}GB" >> /tmp/capacity-report.txt
# Growth projections
echo "" >> /tmp/capacity-report.txt
echo "Growth Projections:" >> /tmp/capacity-report.txt
REQUEST_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total[30d])" | jq -r '.data.result[0].value[1]')
echo " Current Request Rate: ${REQUEST_RATE} req/s" >> /tmp/capacity-report.txt
# Recommendations
echo "" >> /tmp/capacity-report.txt
echo "Recommendations:" >> /tmp/capacity-report.txt
if (( $(echo "$CPU_USAGE > 0.7" | bc -l) )); then
echo " - Consider scaling CPU resources" >> /tmp/capacity-report.txt
fi
if (( $(echo "$MEMORY_USAGE > 8" | bc -l) )); then
echo " - Consider scaling memory resources" >> /tmp/capacity-report.txt
fi
# Send report
mail -s "Monthly Capacity Planning Report" capacity-planning@company.com < /tmp/capacity-report.txt
```
## Troubleshooting Guide
### High Response Time
#### Symptoms
- P95 response time > 2 seconds
- Users complaining about slow performance
- Timeout errors increasing
#### Investigation Steps
1. **Check System Load**
```bash
# Check current CPU usage
kubectl top pods -n planner-mcp
# Check memory usage
kubectl describe nodes | grep -A 5 "Allocated resources"
```
2. **Database Performance**
```bash
# Check slow queries
kubectl exec -n database postgres-0 -- psql -c "SELECT query, mean_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10;"
# Check database connections
kubectl exec -n database postgres-0 -- psql -c "SELECT count(*) FROM pg_stat_activity;"
```
3. **MCP Server Performance**
```bash
# Check MCP server response times
curl -s http://prometheus:9090/api/v1/query?query=mcp_server_response_time
# Test individual MCP servers
./scripts/test-mcp-performance.sh
```
#### Resolution Steps
1. **Immediate Actions**
```bash
# Scale up replicas
kubectl scale deployment/planner-mcp --replicas=5 -n planner-mcp
# Clear application cache
kubectl exec -n planner-mcp planner-mcp-xxx -- redis-cli FLUSHALL
```
2. **Database Optimization**
```bash
# Restart database connections
kubectl exec -n database postgres-0 -- psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction';"
# Update database statistics
kubectl exec -n database postgres-0 -- psql -c "ANALYZE;"
```
### High Error Rate
#### Symptoms
- 5xx error rate > 1%
- Failed requests increasing
- User-facing errors
#### Investigation Steps
1. **Check Error Logs**
```bash
# Recent error logs
kubectl logs -n planner-mcp -l app=planner-mcp --tail=100 | grep ERROR
# Error frequency analysis
kubectl logs -n planner-mcp -l app=planner-mcp --since=1h | grep ERROR | awk '{print $3}' | sort | uniq -c
```
2. **Health Check Status**
```bash
# Check health endpoints
curl -f https://planner-mcp.example.com/health
# Check individual service health
./scripts/health-check-all.sh
```
#### Resolution Steps
1. **Service Recovery**
```bash
# Restart failing pods
kubectl delete pod -n planner-mcp -l app=planner-mcp,status!=Running
# Rolling restart if needed
kubectl rollout restart deployment/planner-mcp -n planner-mcp
```
### MCP Server Unavailable
#### Symptoms
- MCP server connectivity alerts
- Fallback mode activated
- Reduced system capabilities
#### Investigation Steps
1. **Connectivity Test**
```bash
# Test MCP server connectivity
./scripts/test-mcp-connectivity.sh serena
# Check DNS resolution
nslookup serena-mcp.internal
```
2. **Network Analysis**
```bash
# Check network policies
kubectl get networkpolicies -n planner-mcp
# Trace network path
kubectl exec -n planner-mcp planner-mcp-xxx -- traceroute serena-mcp.internal
```
#### Resolution Steps
1. **MCP Server Recovery**
```bash
# Restart MCP server connection
./scripts/restart-mcp-connection.sh serena
# Force capability tier recalculation
./scripts/recalculate-capabilities.sh
```
## Performance Optimization
### Caching Strategy
#### Cache Hit Rate Monitoring
```bash
# Redis cache hit rate
kubectl exec -n planner-mcp redis-0 -- redis-cli info stats | grep keyspace
```
#### Cache Optimization
```bash
# Clear expired cache entries
kubectl exec -n planner-mcp redis-0 -- redis-cli EVAL "return redis.call('del', unpack(redis.call('keys', ARGV[1])))" 0 "*expired*"
# Optimize cache configuration
kubectl patch configmap planner-config -n planner-mcp --patch '{"data":{"cache_ttl":"3600","cache_size":"1GB"}}'
```
### Database Optimization
#### Index Analysis
```bash
# Check unused indexes
kubectl exec -n database postgres-0 -- psql -c "SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname = 'public' ORDER BY n_distinct DESC;"
```
#### Query Optimization
```bash
# Identify slow queries
kubectl exec -n database postgres-0 -- psql -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
```
## Contact Information
### Monitoring Team
- **Primary**: @monitoring-team
- **Secondary**: @devops-team
- **Escalation**: @engineering-manager
### Tool Access
- **Grafana**: https://grafana.planner-mcp.com
- **Prometheus**: https://prometheus.planner-mcp.com
- **Kibana**: https://kibana.planner-mcp.com
- **PagerDuty**: https://company.pagerduty.com
---
**Last Updated**: {current_date}
**Version**: 1.0
**Owner**: Monitoring Team