UNPKG

@timemacro/service-guardian

Version:

Enterprise Linux service monitor with auto-restart, crash recovery, OOM detection, email alerts, health checks. Alternative to PM2, Supervisor, Monit for systemd services. Monitor MySQL, Nginx, Apache, PostgreSQL, Redis. Zero-downtime production monitorin

622 lines (478 loc) β€’ 18.2 kB
# Service Guardian **Enterprise-grade automatic service monitoring, recovery, and alerting system for Linux servers** [![npm version](https://img.shields.io/npm/v/@timemacro/service-guardian.svg)](https://www.npmjs.com/package/@timemacro/service-guardian) [![License: Proprietary](https://img.shields.io/badge/License-Proprietary-red.svg)](mailto:derrick@derricksiawor.com) [![Node.js Version](https://img.shields.io/node/v/@timemacro/service-guardian.svg)](https://nodejs.org) ## πŸ“¦ Installation **⚠️ IMPORTANT: This is a global CLI tool. Always install with the `-g` flag:** ```bash npm install -g @timemacro/service-guardian ``` Or with sudo if needed: ```bash sudo npm install -g @timemacro/service-guardian ``` Service Guardian is a production-ready Node.js daemon that monitors your Linux services, automatically recovers from failures, and sends intelligent alerts. Built for system administrators and DevOps teams who need reliable service uptime without manual intervention. ## The Problem It Solves Ever had MySQL crash at 3 AM due to an OOM killer? Or Apache go down during peak traffic? Service Guardian ensures your critical services stay running by: - **Detecting failures instantly** - Not just checking if process exists, but verifying services actually work - **Smart auto-recovery** - Distinguishes between crashes and manual stops, only restarts genuine failures - **Intelligent alerting** - Batched, actionable alerts with system context, not spam ## Key Features ### πŸ›‘οΈ Core Monitoring - **Systemd Integration** - Deep integration with systemd for accurate service state detection - **Intelligent Failure Analysis** - Differentiates between: - OOM (Out of Memory) kills - Service crashes - Manual stops (won't restart these) - Dependency failures - **Parallel Monitoring** - Efficiently monitors multiple services simultaneously - **Resource-Aware** - Monitors CPU, memory, disk usage before taking actions ### πŸ”„ Advanced Recovery - **Smart Auto-Restart** - With exponential backoff to prevent restart loops - **Dependency Management** - Handles service dependencies and circular dependencies - **Recovery Actions** - Beyond just restart: - Clear system cache - Kill memory-intensive processes - Reload configurations - Clean zombie processes - Repair databases - **Maintenance Windows** - Pause monitoring during planned maintenance ### πŸ₯ Health Checks - **Beyond Process Monitoring** - Tests if services actually work: - TCP port checks (is MySQL accepting connections?) - HTTP endpoint checks (is API returning 200?) - Custom script checks (complex business logic) - Command checks (simple shell commands) - **Failure Thresholds** - Only alerts after X consecutive failures (no false alarms) - **User-Friendly Messages** - Clear explanations of what's wrong and how to fix it ### πŸ“§ Intelligent Alerting - **Beautiful HTML Emails** - Professional, readable alert emails with system context - **Alert Aggregation** - Batches multiple alerts to reduce email spam - **Rate Limiting** - Prevents alert storms during major incidents - **Cooldown Periods** - Won't repeatedly alert for the same issue - **Contextual Information** - Includes failure analysis, resource usage, recent logs ### πŸ“Š Metrics & Reporting - **Service Metrics** - Track uptime, restart counts, failure patterns - **Resource Metrics** - Monitor CPU, memory, disk usage over time - **Daily Aggregation** - Historical data for trend analysis - **Health Reports** - Summary of all monitored services ### πŸ”’ Security - **Command Injection Protection** - All inputs sanitized and validated - **Whitelisted Commands** - Only approved system commands can be executed - **Path Traversal Prevention** - Secure file operations - **No Hardcoded Credentials** - Everything configurable via environment variables ## Installation ### Prerequisites - Node.js >= 16.0.0 - Linux with systemd (Debian, Ubuntu, RHEL, etc.) - Root or sudo access (for systemctl commands) ### Install via npm ```bash # Install globally npm install -g @timemacro/service-guardian # Or with sudo if needed sudo npm install -g @timemacro/service-guardian ``` ### Install from source ```bash # Clone from your private repository # Contact derrick@derricksiawor.com for access cd service-guardian npm install npm link ``` ## Quick Start ### 1. Install and Check Version ```bash # Install globally npm install -g @timemacro/service-guardian # Verify installation sg --version sg --help # See all available commands ``` ### 2. Configure Email Alerts (Optional but Recommended) ```bash sg config email # Interactive email setup ``` You'll be prompted for SMTP settings: - SMTP Host (e.g., smtp.gmail.com) - SMTP Port (e.g., 587) - Username - Password - From address - To address ### 3. Add Services to Monitor ```bash # Add a service (auto-restart and alerts are enabled by default) sg add mysql # Add multiple services sg add nginx sg add postgresql sg add redis # Add with custom settings sg add apache2 --max-restarts 10 # List all monitored services sg list ``` ### 4. Monitor Your Services ```bash # The daemon auto-starts when you add services sg status # Check daemon and all services status # View logs sg logs # Recent logs sg logs --follow # Live logs (like tail -f) sg logs --tail 100 # Last 100 lines # Manual operations sg check mysql # Check specific service sg restart # Restart the daemon sg test # Test all services ``` ## Usage ### Command Reference Service Guardian can be invoked using either `service-guardian` or `sg` (shorthand). We recommend using `sg` for convenience. #### Quick Information Commands ```bash # Get started quickly sg # Show help and available commands sg --help # Show detailed help sg --version # Show version # View current state sg status # Show daemon status and all monitored services sg list # List all monitored services sg info # Show system information and configuration ``` #### Core Commands ```bash # Daemon Control (auto-starts if not running) sg start # Start monitoring daemon (auto-starts on first command) sg stop # Stop monitoring daemon sg restart # Restart daemon sg status # Show daemon and services status # Service Management sg add <service> [options] # Add service to monitoring sg remove <service> # Remove service from monitoring sg list # List all monitored services sg enable <service> # Enable monitoring for service sg disable <service> # Disable monitoring for service # Monitoring & Logs sg logs # View recent daemon logs sg logs --follow # View logs in real-time (like tail -f) sg logs --tail 50 # View last 50 log lines sg check <service> # Manually check service status sg test # Test monitoring all services ``` #### Advanced Features ```bash # Health Checks sg health add <service> [options] # Add health check sg health list # List all health checks sg health remove <service> # Remove health check sg health test <service> # Test health check # Dependencies sg deps add <service> <deps...> # Add service dependencies sg deps remove <service> <deps...> # Remove dependencies sg deps list [service] # List dependencies sg deps check # Check for circular dependencies # Maintenance Windows sg maintenance add [options] # Schedule maintenance sg maintenance list # List maintenance windows sg maintenance remove <name> # Remove maintenance window # Groups & Tags sg group create <name> # Create service group sg group add <group> <services...> # Add services to group sg group list # List all groups sg tag add <service> <tags...> # Add tags to service sg tag list [service] # List tags # Metrics & Reports sg metrics [service] [options] # View service metrics sg report [options] # Generate health report # Configuration sg config email # Configure email settings sg config show # Show configuration sg config set <key> <value> # Set config value sg export [file] # Export configuration sg import <file> # Import configuration ``` ### Configuration Options Configuration is stored in `/etc/service-guardian/config.json` (or `~/.service-guardian/config.json` for non-root users). ```javascript { // Monitoring "CHECK_INTERVAL": 30, // Seconds between checks "HEALTH_CHECK_INTERVAL": 60, // Seconds between health checks // Restart Settings "MAX_RESTARTS": 5, // Max restart attempts "RESTART_DELAY": 10, // Initial delay (seconds) "RESTART_BACKOFF_MULTIPLIER": 2, // Exponential backoff "MAX_RESTART_DELAY": 300, // Max delay (seconds) // Alerts "ALERT_COOLDOWN": 600, // Seconds between alerts "ALERT_BATCH_INTERVAL": 60, // Batch window (seconds) "MAX_ALERTS_PER_HOUR": 10, // Rate limiting // Email Settings (set via sg config email) "SMTP_HOST": "smtp.gmail.com", "SMTP_PORT": 587, "SMTP_USER": "your-email@gmail.com", "SMTP_PASS": "your-app-password", "EMAIL_FROM": "alerts@yourserver.com", "EMAIL_TO": "admin@yourcompany.com" } ``` ## How It Works ### 1. Service Monitoring Flow ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Cron Scheduler β”‚ Every 30 seconds β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Check Services β”‚ Parallel checks β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Analyze Status β”‚ Is service healthy? β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”‚ Healthy β”‚ Not Healthy β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Failure Analysisβ”‚ Why did it fail? β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Recovery Actionsβ”‚ Try to fix β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Auto-Restart? β”‚ If enabled β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Send Alert? β”‚ If enabled & not in cooldown β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### 2. Failure Detection Service Guardian performs intelligent failure analysis: ```javascript // Not just "is process running?" if (!service.isActive) { // Analyze WHY it's not running const analysis = await analyzeFailure(service); if (analysis.type === 'MANUAL_STOP') { // User stopped it, don't restart return; } if (analysis.type === 'OOM_KILL') { // Killed by OOM, check memory before restart if (memory.usage > 90%) { // Clean up memory first await clearSystemCache(); } } // Smart restart with backoff await attemptRestart(service); } ``` ### 3. Health Checks Beyond process monitoring, health checks verify services actually work: ```javascript // TCP Health Check Example const mysql_health = { type: 'tcp', host: 'localhost', port: 3306, timeout: 10, interval: 60 }; // Results in user-friendly messages: // βœ… "mysql is responding on localhost:3306" // ❌ "mysql is not accepting connections on localhost:3306. // The service may be down or not listening on this port. // Suggestion: Verify mysql is running with: systemctl status mysql" ``` ### 4. Alert Aggregation Intelligent batching reduces email spam: ```javascript // Instead of 10 emails in 1 minute: // "nginx failed" // "mysql failed" // "redis failed" // ... // You get 1 comprehensive email: // "3 services need attention: // - nginx: Connection refused on port 80 // - mysql: OOM killed (memory: 95%) // - redis: Dependency postgres is down" ``` ## Real-World Examples ### Example 1: MySQL OOM Protection ```bash # Add MySQL with OOM recovery (auto-restart and alerts enabled by default) sg add mysql --max-restarts 5 # Add health check to verify it's accepting connections sg health add mysql --type tcp --port 3306 # Add recovery action to clear cache when memory is high sg recovery add mysql --type clear-cache --threshold 90 ``` When MySQL gets OOM-killed: 1. Service Guardian detects the OOM kill (not just "service down") 2. Checks system memory usage 3. If memory > 90%, clears system cache first 4. Restarts MySQL with exponential backoff 5. Verifies it's accepting connections 6. Sends detailed alert with memory stats and suggestions ### Example 2: Dependent Services ```bash # Setup WordPress stack with dependencies sg add nginx sg add php-fpm sg add mysql # Define dependencies sg deps add nginx php-fpm sg deps add php-fpm mysql # If MySQL fails, Service Guardian will: # 1. Restart MySQL first # 2. Then restart php-fpm (depends on MySQL) # 3. Then restart nginx (depends on php-fpm) ``` ### Example 3: Maintenance Windows ```bash # Schedule maintenance window for updates sg maintenance add "Weekly Updates" \ --days sunday \ --start 02:00 \ --duration 2 \ --services nginx,mysql,redis # During maintenance: # - No auto-restarts # - No alerts # - Services can be safely updated ``` ### Example 4: Custom Health Checks ```bash # Create custom health check script cat > /etc/service-guardian/health-checks/api-check.sh << 'EOF' #!/bin/bash RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/api/health) if [ "$RESPONSE" = "200" ]; then echo "API is healthy" exit 0 else echo "API returned status code: $RESPONSE" exit 1 fi EOF chmod +x /etc/service-guardian/health-checks/api-check.sh # Add the health check sg health add api --type script --script api-check.sh ``` ## Architecture ### Security Features 1. **Input Validation** - All inputs validated with JSON schemas 2. **Command Whitelisting** - Only approved system commands 3. **Shell Escape** - Prevents command injection 4. **Path Validation** - Prevents directory traversal 5. **Secure Execution** - Isolated command execution ### Performance - **Parallel Monitoring** - Check multiple services simultaneously - **Efficient Resource Usage** - Minimal CPU and memory footprint - **Optimized Queries** - Batch operations where possible - **Caching** - Reduces repeated system calls ### Reliability - **Crash Recovery** - Daemon automatically recovers from crashes - **Data Persistence** - Configuration and metrics survive restarts - **Atomic Operations** - Prevents partial updates - **Graceful Shutdown** - Cleanly stops all operations ## Troubleshooting ### Service Guardian won't start ```bash # Check if already running sg status # Check logs for errors sg logs --tail 50 # Verify Node.js version node --version # Should be >= 16.0.0 # Check permissions ls -la /etc/service-guardian/ ``` ### Services not being monitored ```bash # Verify service is added sg list # Check if service exists systemctl status <service-name> # Test monitoring manually sg check <service-name> # Check dependencies sg deps check ``` ### Not receiving alerts ```bash # Test email configuration sg config email --test # Check alert settings sg config show | grep ALERT # View recent alerts sg logs | grep "Alert sent" # Check cooldown status sg status --verbose ``` ### High memory usage ```bash # Check metrics history sg metrics --days 7 # Clear old metrics sg metrics --cleanup # Reduce check frequency sg config set CHECK_INTERVAL 60 ``` ## Development ### Running Tests ```bash npm test # Run all tests npm run test:watch # Watch mode npm run test:coverage # Coverage report ``` ### Contributing For contributions, please contact derrick@derricksiawor.com ## License **PROPRIETARY SOFTWARE - Copyright (c) 2025 Derrick S. K. Siawor. All Rights Reserved.** This software is proprietary and confidential. ### Permitted Use: - βœ… Personal use for monitoring your own services - βœ… Internal business use within your organization - βœ… Evaluation and testing purposes ### Restrictions: - ❌ No copying, modifying, or creating derivative works - ❌ No distribution, selling, or sublicensing to third parties - ❌ No reverse engineering or decompiling - ❌ No commercial use without written permission - ❌ No public sharing or repository hosting - ❌ No use for creating competing products ### Commercial Licensing: For commercial licenses, enterprise support, or custom implementations, contact: **derrick@derricksiawor.com** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND. See LICENSE file for full terms. ## Author **Derrick S. K. Siawor** Website: [https://derricksiawor.com](https://derricksiawor.com) ## Support - **Email Support**: derrick@derricksiawor.com (for issues, feature requests, and general support) - **npm Package**: [npmjs.com/package/@timemacro/service-guardian](https://www.npmjs.com/package/@timemacro/service-guardian) ## Acknowledgments Built with enterprise-grade libraries: - Commander.js - CLI interface - Nodemailer - Email alerts - node-cron - Scheduling - Winston - Logging - Chalk - Terminal styling --- **Stop losing sleep over crashed services. Let Service Guardian keep watch.**