@timemacro/service-guardian
Version:
Enterprise Linux service monitor with auto-restart, crash recovery, OOM detection, email alerts, health checks. Alternative to PM2, Supervisor, Monit for systemd services. Monitor MySQL, Nginx, Apache, PostgreSQL, Redis. Zero-downtime production monitorin
622 lines (478 loc) β’ 18.2 kB
Markdown
# Service Guardian
**Enterprise-grade automatic service monitoring, recovery, and alerting system for Linux servers**
[](https://www.npmjs.com/package/@timemacro/service-guardian)
[](mailto:derrick@derricksiawor.com)
[](https://nodejs.org)
## π¦ Installation
**β οΈ IMPORTANT: This is a global CLI tool. Always install with the `-g` flag:**
```bash
npm install -g @timemacro/service-guardian
```
Or with sudo if needed:
```bash
sudo npm install -g @timemacro/service-guardian
```
Service Guardian is a production-ready Node.js daemon that monitors your Linux services, automatically recovers from failures, and sends intelligent alerts. Built for system administrators and DevOps teams who need reliable service uptime without manual intervention.
## The Problem It Solves
Ever had MySQL crash at 3 AM due to an OOM killer? Or Apache go down during peak traffic? Service Guardian ensures your critical services stay running by:
- **Detecting failures instantly** - Not just checking if process exists, but verifying services actually work
- **Smart auto-recovery** - Distinguishes between crashes and manual stops, only restarts genuine failures
- **Intelligent alerting** - Batched, actionable alerts with system context, not spam
## Key Features
### π‘οΈ Core Monitoring
- **Systemd Integration** - Deep integration with systemd for accurate service state detection
- **Intelligent Failure Analysis** - Differentiates between:
- OOM (Out of Memory) kills
- Service crashes
- Manual stops (won't restart these)
- Dependency failures
- **Parallel Monitoring** - Efficiently monitors multiple services simultaneously
- **Resource-Aware** - Monitors CPU, memory, disk usage before taking actions
### π Advanced Recovery
- **Smart Auto-Restart** - With exponential backoff to prevent restart loops
- **Dependency Management** - Handles service dependencies and circular dependencies
- **Recovery Actions** - Beyond just restart:
- Clear system cache
- Kill memory-intensive processes
- Reload configurations
- Clean zombie processes
- Repair databases
- **Maintenance Windows** - Pause monitoring during planned maintenance
### π₯ Health Checks
- **Beyond Process Monitoring** - Tests if services actually work:
- TCP port checks (is MySQL accepting connections?)
- HTTP endpoint checks (is API returning 200?)
- Custom script checks (complex business logic)
- Command checks (simple shell commands)
- **Failure Thresholds** - Only alerts after X consecutive failures (no false alarms)
- **User-Friendly Messages** - Clear explanations of what's wrong and how to fix it
### π§ Intelligent Alerting
- **Beautiful HTML Emails** - Professional, readable alert emails with system context
- **Alert Aggregation** - Batches multiple alerts to reduce email spam
- **Rate Limiting** - Prevents alert storms during major incidents
- **Cooldown Periods** - Won't repeatedly alert for the same issue
- **Contextual Information** - Includes failure analysis, resource usage, recent logs
### π Metrics & Reporting
- **Service Metrics** - Track uptime, restart counts, failure patterns
- **Resource Metrics** - Monitor CPU, memory, disk usage over time
- **Daily Aggregation** - Historical data for trend analysis
- **Health Reports** - Summary of all monitored services
### π Security
- **Command Injection Protection** - All inputs sanitized and validated
- **Whitelisted Commands** - Only approved system commands can be executed
- **Path Traversal Prevention** - Secure file operations
- **No Hardcoded Credentials** - Everything configurable via environment variables
## Installation
### Prerequisites
- Node.js >= 16.0.0
- Linux with systemd (Debian, Ubuntu, RHEL, etc.)
- Root or sudo access (for systemctl commands)
### Install via npm
```bash
# Install globally
npm install -g @timemacro/service-guardian
# Or with sudo if needed
sudo npm install -g @timemacro/service-guardian
```
### Install from source
```bash
# Clone from your private repository
# Contact derrick@derricksiawor.com for access
cd service-guardian
npm install
npm link
```
## Quick Start
### 1. Install and Check Version
```bash
# Install globally
npm install -g @timemacro/service-guardian
# Verify installation
sg --version
sg --help # See all available commands
```
### 2. Configure Email Alerts (Optional but Recommended)
```bash
sg config email # Interactive email setup
```
You'll be prompted for SMTP settings:
- SMTP Host (e.g., smtp.gmail.com)
- SMTP Port (e.g., 587)
- Username
- Password
- From address
- To address
### 3. Add Services to Monitor
```bash
# Add a service (auto-restart and alerts are enabled by default)
sg add mysql
# Add multiple services
sg add nginx
sg add postgresql
sg add redis
# Add with custom settings
sg add apache2 --max-restarts 10
# List all monitored services
sg list
```
### 4. Monitor Your Services
```bash
# The daemon auto-starts when you add services
sg status # Check daemon and all services status
# View logs
sg logs # Recent logs
sg logs --follow # Live logs (like tail -f)
sg logs --tail 100 # Last 100 lines
# Manual operations
sg check mysql # Check specific service
sg restart # Restart the daemon
sg test # Test all services
```
## Usage
### Command Reference
Service Guardian can be invoked using either `service-guardian` or `sg` (shorthand). We recommend using `sg` for convenience.
#### Quick Information Commands
```bash
# Get started quickly
sg # Show help and available commands
sg --help # Show detailed help
sg --version # Show version
# View current state
sg status # Show daemon status and all monitored services
sg list # List all monitored services
sg info # Show system information and configuration
```
#### Core Commands
```bash
# Daemon Control (auto-starts if not running)
sg start # Start monitoring daemon (auto-starts on first command)
sg stop # Stop monitoring daemon
sg restart # Restart daemon
sg status # Show daemon and services status
# Service Management
sg add <service> [options] # Add service to monitoring
sg remove <service> # Remove service from monitoring
sg list # List all monitored services
sg enable <service> # Enable monitoring for service
sg disable <service> # Disable monitoring for service
# Monitoring & Logs
sg logs # View recent daemon logs
sg logs --follow # View logs in real-time (like tail -f)
sg logs --tail 50 # View last 50 log lines
sg check <service> # Manually check service status
sg test # Test monitoring all services
```
#### Advanced Features
```bash
# Health Checks
sg health add <service> [options] # Add health check
sg health list # List all health checks
sg health remove <service> # Remove health check
sg health test <service> # Test health check
# Dependencies
sg deps add <service> <deps...> # Add service dependencies
sg deps remove <service> <deps...> # Remove dependencies
sg deps list [service] # List dependencies
sg deps check # Check for circular dependencies
# Maintenance Windows
sg maintenance add [options] # Schedule maintenance
sg maintenance list # List maintenance windows
sg maintenance remove <name> # Remove maintenance window
# Groups & Tags
sg group create <name> # Create service group
sg group add <group> <services...> # Add services to group
sg group list # List all groups
sg tag add <service> <tags...> # Add tags to service
sg tag list [service] # List tags
# Metrics & Reports
sg metrics [service] [options] # View service metrics
sg report [options] # Generate health report
# Configuration
sg config email # Configure email settings
sg config show # Show configuration
sg config set <key> <value> # Set config value
sg export [file] # Export configuration
sg import <file> # Import configuration
```
### Configuration Options
Configuration is stored in `/etc/service-guardian/config.json` (or `~/.service-guardian/config.json` for non-root users).
```javascript
{
// Monitoring
"CHECK_INTERVAL": 30, // Seconds between checks
"HEALTH_CHECK_INTERVAL": 60, // Seconds between health checks
// Restart Settings
"MAX_RESTARTS": 5, // Max restart attempts
"RESTART_DELAY": 10, // Initial delay (seconds)
"RESTART_BACKOFF_MULTIPLIER": 2, // Exponential backoff
"MAX_RESTART_DELAY": 300, // Max delay (seconds)
// Alerts
"ALERT_COOLDOWN": 600, // Seconds between alerts
"ALERT_BATCH_INTERVAL": 60, // Batch window (seconds)
"MAX_ALERTS_PER_HOUR": 10, // Rate limiting
// Email Settings (set via sg config email)
"SMTP_HOST": "smtp.gmail.com",
"SMTP_PORT": 587,
"SMTP_USER": "your-email@gmail.com",
"SMTP_PASS": "your-app-password",
"EMAIL_FROM": "alerts@yourserver.com",
"EMAIL_TO": "admin@yourcompany.com"
}
```
## How It Works
### 1. Service Monitoring Flow
```
βββββββββββββββββββ
β Cron Scheduler β Every 30 seconds
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Check Services β Parallel checks
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Analyze Status β Is service healthy?
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ
β Healthy β Not Healthy
ββββββ¬βββββ
β
βΌ
βββββββββββββββββββ
β Failure Analysisβ Why did it fail?
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Recovery Actionsβ Try to fix
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Auto-Restart? β If enabled
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Send Alert? β If enabled & not in cooldown
βββββββββββββββββββ
```
### 2. Failure Detection
Service Guardian performs intelligent failure analysis:
```javascript
// Not just "is process running?"
if (!service.isActive) {
// Analyze WHY it's not running
const analysis = await analyzeFailure(service);
if (analysis.type === 'MANUAL_STOP') {
// User stopped it, don't restart
return;
}
if (analysis.type === 'OOM_KILL') {
// Killed by OOM, check memory before restart
if (memory.usage > 90%) {
// Clean up memory first
await clearSystemCache();
}
}
// Smart restart with backoff
await attemptRestart(service);
}
```
### 3. Health Checks
Beyond process monitoring, health checks verify services actually work:
```javascript
// TCP Health Check Example
const mysql_health = {
type: 'tcp',
host: 'localhost',
port: 3306,
timeout: 10,
interval: 60
};
// Results in user-friendly messages:
// β
"mysql is responding on localhost:3306"
// β "mysql is not accepting connections on localhost:3306.
// The service may be down or not listening on this port.
// Suggestion: Verify mysql is running with: systemctl status mysql"
```
### 4. Alert Aggregation
Intelligent batching reduces email spam:
```javascript
// Instead of 10 emails in 1 minute:
// "nginx failed"
// "mysql failed"
// "redis failed"
// ...
// You get 1 comprehensive email:
// "3 services need attention:
// - nginx: Connection refused on port 80
// - mysql: OOM killed (memory: 95%)
// - redis: Dependency postgres is down"
```
## Real-World Examples
### Example 1: MySQL OOM Protection
```bash
# Add MySQL with OOM recovery (auto-restart and alerts enabled by default)
sg add mysql --max-restarts 5
# Add health check to verify it's accepting connections
sg health add mysql --type tcp --port 3306
# Add recovery action to clear cache when memory is high
sg recovery add mysql --type clear-cache --threshold 90
```
When MySQL gets OOM-killed:
1. Service Guardian detects the OOM kill (not just "service down")
2. Checks system memory usage
3. If memory > 90%, clears system cache first
4. Restarts MySQL with exponential backoff
5. Verifies it's accepting connections
6. Sends detailed alert with memory stats and suggestions
### Example 2: Dependent Services
```bash
# Setup WordPress stack with dependencies
sg add nginx
sg add php-fpm
sg add mysql
# Define dependencies
sg deps add nginx php-fpm
sg deps add php-fpm mysql
# If MySQL fails, Service Guardian will:
# 1. Restart MySQL first
# 2. Then restart php-fpm (depends on MySQL)
# 3. Then restart nginx (depends on php-fpm)
```
### Example 3: Maintenance Windows
```bash
# Schedule maintenance window for updates
sg maintenance add "Weekly Updates" \
--days sunday \
--start 02:00 \
--duration 2 \
--services nginx,mysql,redis
# During maintenance:
# - No auto-restarts
# - No alerts
# - Services can be safely updated
```
### Example 4: Custom Health Checks
```bash
# Create custom health check script
cat > /etc/service-guardian/health-checks/api-check.sh << 'EOF'
#!/bin/bash
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/api/health)
if [ "$RESPONSE" = "200" ]; then
echo "API is healthy"
exit 0
else
echo "API returned status code: $RESPONSE"
exit 1
fi
EOF
chmod +x /etc/service-guardian/health-checks/api-check.sh
# Add the health check
sg health add api --type script --script api-check.sh
```
## Architecture
### Security Features
1. **Input Validation** - All inputs validated with JSON schemas
2. **Command Whitelisting** - Only approved system commands
3. **Shell Escape** - Prevents command injection
4. **Path Validation** - Prevents directory traversal
5. **Secure Execution** - Isolated command execution
### Performance
- **Parallel Monitoring** - Check multiple services simultaneously
- **Efficient Resource Usage** - Minimal CPU and memory footprint
- **Optimized Queries** - Batch operations where possible
- **Caching** - Reduces repeated system calls
### Reliability
- **Crash Recovery** - Daemon automatically recovers from crashes
- **Data Persistence** - Configuration and metrics survive restarts
- **Atomic Operations** - Prevents partial updates
- **Graceful Shutdown** - Cleanly stops all operations
## Troubleshooting
### Service Guardian won't start
```bash
# Check if already running
sg status
# Check logs for errors
sg logs --tail 50
# Verify Node.js version
node --version # Should be >= 16.0.0
# Check permissions
ls -la /etc/service-guardian/
```
### Services not being monitored
```bash
# Verify service is added
sg list
# Check if service exists
systemctl status <service-name>
# Test monitoring manually
sg check <service-name>
# Check dependencies
sg deps check
```
### Not receiving alerts
```bash
# Test email configuration
sg config email --test
# Check alert settings
sg config show | grep ALERT
# View recent alerts
sg logs | grep "Alert sent"
# Check cooldown status
sg status --verbose
```
### High memory usage
```bash
# Check metrics history
sg metrics --days 7
# Clear old metrics
sg metrics --cleanup
# Reduce check frequency
sg config set CHECK_INTERVAL 60
```
## Development
### Running Tests
```bash
npm test # Run all tests
npm run test:watch # Watch mode
npm run test:coverage # Coverage report
```
### Contributing
For contributions, please contact derrick@derricksiawor.com
## License
**PROPRIETARY SOFTWARE - Copyright (c) 2025 Derrick S. K. Siawor. All Rights Reserved.**
This software is proprietary and confidential.
### Permitted Use:
- β
Personal use for monitoring your own services
- β
Internal business use within your organization
- β
Evaluation and testing purposes
### Restrictions:
- β No copying, modifying, or creating derivative works
- β No distribution, selling, or sublicensing to third parties
- β No reverse engineering or decompiling
- β No commercial use without written permission
- β No public sharing or repository hosting
- β No use for creating competing products
### Commercial Licensing:
For commercial licenses, enterprise support, or custom implementations, contact: **derrick@derricksiawor.com**
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND. See LICENSE file for full terms.
## Author
**Derrick S. K. Siawor**
Website: [https://derricksiawor.com](https://derricksiawor.com)
## Support
- **Email Support**: derrick@derricksiawor.com (for issues, feature requests, and general support)
- **npm Package**: [npmjs.com/package/@timemacro/service-guardian](https://www.npmjs.com/package/@timemacro/service-guardian)
## Acknowledgments
Built with enterprise-grade libraries:
- Commander.js - CLI interface
- Nodemailer - Email alerts
- node-cron - Scheduling
- Winston - Logging
- Chalk - Terminal styling
**Stop losing sleep over crashed services. Let Service Guardian keep watch.**