UNPKG

aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

361 lines (285 loc) 9.08 kB
--- description: Generate troubleshooting documentation category: documentation-changelogs argument-hint: 1. **System Overview and Architecture** [--interactive] [--guidance "text"] --- # Troubleshooting Guide Generator Command Generate troubleshooting documentation ## Instructions Follow this systematic approach to create troubleshooting guides: **$ARGUMENTS** 1. **System Overview and Architecture** - Document the system architecture and components - Map out dependencies and integrations - Identify critical paths and failure points - Create system topology diagrams - Document data flow and communication patterns 2. **Common Issues Identification** - Collect historical support tickets and issues - Interview team members about frequent problems - Analyze error logs and monitoring data - Review user feedback and complaints - Identify patterns in system failures 3. **Troubleshooting Framework** - Establish systematic diagnostic procedures - Create problem isolation methodologies - Document escalation paths and procedures - Set up logging and monitoring checkpoints - Define severity levels and response times 4. **Diagnostic Tools and Commands** ```markdown ## Essential Diagnostic Commands ### System Health ```bash # Check system resources top # CPU and memory usage df -h # Disk space free -m # Memory usage netstat -tuln # Network connections # Application logs tail -f /var/log/app.log journalctl -u service-name -f # Database connectivity mysql -u user -p -e "SELECT 1" psql -h host -U user -d db -c "SELECT 1" ``` ``` 5. **Issue Categories and Solutions** **Performance Issues:** ```markdown ### Slow Response Times **Symptoms:** - API responses > 5 seconds - User interface freezing - Database timeouts **Diagnostic Steps:** 1. Check system resources (CPU, memory, disk) 2. Review application logs for errors 3. Analyze database query performance 4. Check network connectivity and latency **Common Causes:** - Database connection pool exhaustion - Inefficient database queries - Memory leaks in application - Network bandwidth limitations **Solutions:** - Restart application services - Optimize database queries - Increase connection pool size - Scale infrastructure resources ``` 6. **Error Code Documentation** ```markdown ## Error Code Reference ### HTTP Status Codes - **500 Internal Server Error** - Check application logs for stack traces - Verify database connectivity - Check environment variables - **404 Not Found** - Verify URL routing configuration - Check if resources exist - Review API endpoint documentation - **503 Service Unavailable** - Check service health status - Verify load balancer configuration - Check for maintenance mode ``` 7. **Environment-Specific Issues** - Document development environment problems - Address staging/testing environment issues - Cover production-specific troubleshooting - Include local development setup problems 8. **Database Troubleshooting** ```markdown ### Database Connection Issues **Symptoms:** - "Connection refused" errors - "Too many connections" errors - Slow query performance **Diagnostic Commands:** ```sql -- Check active connections SHOW PROCESSLIST; -- Check database size SELECT table_schema, ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) AS 'DB Size in MB' FROM information_schema.tables GROUP BY table_schema; -- Check slow queries SHOW VARIABLES LIKE 'slow_query_log'; ``` ``` 9. **Network and Connectivity Issues** ```markdown ### Network Troubleshooting **Basic Connectivity:** ```bash # Test basic connectivity ping example.com telnet host port curl -v https://api.example.com/health # DNS resolution nslookup example.com dig example.com # Network routing traceroute example.com ``` **SSL/TLS Issues:** ```bash # Check SSL certificate openssl s_client -connect example.com:443 curl -vI https://example.com ``` ``` 10. **Application-Specific Troubleshooting** **Memory Issues:** ```markdown ### Out of Memory Errors **Java Applications:** ```bash # Check heap usage jstat -gc [PID] jmap -dump:format=b,file=heapdump.hprof [PID] # Analyze heap dump jhat heapdump.hprof ``` **Node.js Applications:** ```bash # Monitor memory usage node --inspect app.js # Use Chrome DevTools for memory profiling ``` ``` 11. **Security and Authentication Issues** ```markdown ### Authentication Failures **Symptoms:** - 401 Unauthorized responses - Token validation errors - Session timeout issues **Diagnostic Steps:** 1. Verify credentials and tokens 2. Check token expiration 3. Validate authentication service 4. Review CORS configuration **Common Solutions:** - Refresh authentication tokens - Clear browser cookies/cache - Verify CORS headers - Check API key permissions ``` 12. **Deployment and Configuration Issues** ```markdown ### Deployment Failures **Container Issues:** ```bash # Check container status docker ps -a docker logs container-name # Check resource limits docker stats # Debug container docker exec -it container-name /bin/bash ``` **Kubernetes Issues:** ```bash # Check pod status kubectl get pods kubectl describe pod pod-name kubectl logs pod-name # Check service connectivity kubectl get svc kubectl port-forward pod-name 8080:8080 ``` ``` 13. **Monitoring and Alerting Setup** - Configure health checks and monitoring - Set up log aggregation and analysis - Implement alerting for critical issues - Create dashboards for system metrics - Document monitoring thresholds 14. **Escalation Procedures** ```markdown ## Escalation Matrix ### Severity Levels **Critical (P1):** System down, data loss - Immediate response required - Escalate to on-call engineer - Notify management within 30 minutes **High (P2):** Major functionality impaired - Response within 2 hours - Escalate to senior engineer - Provide hourly updates **Medium (P3):** Minor functionality issues - Response within 8 hours - Assign to appropriate team member - Provide daily updates ``` 15. **Recovery Procedures** - Document system recovery steps - Create data backup and restore procedures - Establish rollback procedures for deployments - Document disaster recovery processes - Test recovery procedures regularly 16. **Preventive Measures** - Implement monitoring and alerting - Set up automated health checks - Create deployment validation procedures - Establish code review processes - Document maintenance procedures 17. **Knowledge Base Integration** - Link to relevant documentation - Reference API documentation - Include links to monitoring dashboards - Connect to team communication channels - Integrate with ticketing systems 18. **Team Communication** ```markdown ## Communication Channels ### Immediate Response - Slack: #incidents channel - Phone: On-call rotation - Email: alerts@company.com ### Status Updates - Status page: status.company.com - Twitter: @company_status - Internal wiki: troubleshooting section ``` 19. **Documentation Maintenance** - Regular review and updates - Version control for troubleshooting guides - Feedback collection from users - Integration with incident post-mortems - Continuous improvement processes 20. **Self-Service Tools** - Create diagnostic scripts and tools - Build automated recovery procedures - Implement self-healing systems - Provide user-friendly diagnostic interfaces - Create chatbot integration for common issues **Advanced Troubleshooting Techniques:** **Log Analysis:** ```bash # Search for specific errors grep -i "error" /var/log/app.log | tail -50 # Analyze log patterns awk '{print $1}' access.log | sort | uniq -c | sort -nr # Monitor logs in real-time tail -f /var/log/app.log | grep -i "exception" ``` **Performance Profiling:** ```bash # System performance iostat -x 1 sar -u 1 10 vmstat 1 10 # Application profiling strace -p [PID] perf record -p [PID] ``` Remember to: - Keep troubleshooting guides up-to-date - Test all documented procedures regularly - Collect feedback from users and improve guides - Include screenshots and visual aids where helpful - Make guides searchable and well-organized