claude-flow-novice
Version:
Claude Flow Novice - Advanced orchestration platform for multi-agent AI workflows with CFN Loop architecture Includes Local RuVector Accelerator and all CFN skills for complete functionality.
667 lines (511 loc) • 18.6 kB
Markdown
# Workspace Management Guide
**Part of Task P2-1.3: Supervised Workspace Cleanup (Phase 2)**
Comprehensive guide to the supervised workspace management system with automatic cleanup, crash recovery, and TTL-based retention.
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Workspace Lifecycle](#workspace-lifecycle)
3. [Cleanup Policies](#cleanup-policies)
4. [Orphan Detection and Recovery](#orphan-detection-and-recovery)
5. [Size Limit Enforcement](#size-limit-enforcement)
6. [Audit Trail and Monitoring](#audit-trail-and-monitoring)
7. [Manual Cleanup Procedures](#manual-cleanup-procedures)
8. [Troubleshooting](#troubleshooting)
9. [Performance Characteristics](#performance-characteristics)
## Architecture Overview
The workspace management system provides:
- **Isolated Workspaces**: Each agent task gets its own isolated directory
- **Automatic Cleanup**: Workspaces are automatically cleaned up on agent completion or crash
- **TTL-Based Retention**: Workspaces automatically expire after configurable TTL (default: 24h)
- **Orphan Detection**: Crashed/killed agents trigger orphan detection and cleanup
- **Size Limits**: Per-workspace size limits (configurable, default: 1GB) prevent disk exhaustion
- **Audit Trail**: Complete history of all cleanup operations with metadata
- **Manual Cleanup**: Operator utilities for manual workspace management
### Component Structure
```
WorkspaceSupervisor (src/services/workspace-supervisor.ts)
├── Create/cleanup workspaces
├── Track lifecycle and metadata
├── Enforce TTL policies
├── Monitor size limits
└── Record audit trail
OrphanDetector (src/lib/orphan-detector.ts)
├── Detect orphaned workspaces (no active process)
├── Grace period management (10 minutes)
├── Background scanning
└── Automatic cleanup
Database Schema (src/db/migrations/007-workspace-tracking-schema.sql)
├── Workspaces table (metadata)
├── Cleanup history (audit trail)
├── Workspace metrics (size tracking)
└── Orphan tracking (crash detection)
Cleanup Utilities (scripts/cleanup-workspaces.sh)
├── List workspaces
├── Manual cleanup
├── Force cleanup (skip TTL)
└── Reports and statistics
```
## Workspace Lifecycle
### 1. Creation (Agent Starts)
When an agent task begins, a supervised workspace is created:
```typescript
const supervisor = new WorkspaceSupervisor({
workspaceRoot: '/tmp/cfn-workspaces',
maxWorkspaceSizeBytes: 1024 * 1024 * 1024, // 1GB
defaultTtlHours: 24
});
await supervisor.initialize();
// Create isolated workspace
const workspace = await supervisor.createWorkspace({
agentId: 'backend-dev-001',
taskId: 'task-123',
maxSizeBytes: 1024 * 1024 * 1024,
ttlHours: 24
});
// Returns:
// {
// id: 'uuid...',
// agentId: 'backend-dev-001',
// taskId: 'task-123',
// path: '/tmp/cfn-workspaces/backend-dev-001-task-123-uuid/',
// createdAt: Date,
// ttlHours: 24,
// ...
// }
```
**What Happens:**
- Isolated directory created: `/tmp/cfn-workspaces/{agentId}-{taskId}-{uuid}/`
- Workspace metadata registered in SQLite database
- TTL timer initialized (24h default)
- Size monitoring activated
### 2. Monitoring (During Execution)
During agent execution:
```typescript
// Agent writes output files
await fs.writeFile(path.join(workspace.path, 'output.txt'), 'result');
// Supervisor tracks:
// - File count and total size
// - Size limit violations
// - Last access time (for orphan detection)
```
**Tracked Metrics:**
- Current size in bytes
- File count
- Last accessed timestamp
- Process ID (for crash detection)
- Size limit exceeded flag
### 3. Cleanup (Completion or Crash)
On completion, cleanup is triggered:
```typescript
// Normal completion
await supervisor.cleanupWorkspace(workspace.id, {
reason: 'agent_completed',
preserveArtifacts: ['report.md', 'output.json'],
artifactDestination: '/path/to/artifacts'
});
// On crash (orphan detection)
const orphanDetector = new OrphanDetector({
workspaceRoot: '/tmp/cfn-workspaces',
gracePeriodMinutes: 10
});
const cleanupStats = await orphanDetector.cleanupOrphans();
```
**Cleanup Options:**
- `reason`: Why cleanup occurred (agent_completed, agent_crashed, ttl_expired, manual)
- `preserveArtifacts`: Glob patterns for files to preserve
- `artifactDestination`: Where to move preserved artifacts
- `metadata`: Additional context (exit code, duration, etc.)
**What Happens:**
1. Files matching preserve patterns are moved to artifact destination
2. Remaining workspace directory is removed
3. Cleanup recorded in audit trail
4. Workspace unregistered from database
### 4. TTL Cleanup (Background Task)
Background scheduler runs hourly:
```
Every 60 minutes (configurable):
1. Find workspaces older than TTL
2. For each stale workspace:
- Preserve artifacts if configured
- Remove workspace directory
- Record cleanup in audit trail
3. Log statistics (workspaces cleaned, space freed)
```
## Cleanup Policies
### Normal Completion Cleanup
Triggered when agent completes successfully or fails:
```typescript
const stats = await supervisor.cleanupWorkspace(workspace.id, {
reason: 'agent_completed',
preserveArtifacts: ['*.md', 'artifacts/**'],
artifactDestination: '/artifacts/task-123/'
});
// Returns:
// {
// cleanedCount: 1,
// totalSizeFreed: 52428800, // 50MB
// filesRemoved: 127
// }
```
**Preserved Artifacts Example:**
```
Workspace:
output.json → /artifacts/task-123/output.json ✓
report.md → /artifacts/task-123/report.md ✓
temp_build/ → (removed) ✗
cache/ → (removed) ✗
```
### TTL-Based Cleanup
Automatic cleanup for old workspaces:
```typescript
// Runs every 60 minutes (configurable)
const stats = await supervisor.enforceRetentionPolicy({
preservePatterns: ['*.md', 'output.*']
});
// Returns statistics of cleaned workspaces
```
**TTL Policy:**
- Default: 24 hours
- Configurable per workspace
- Preserved artifacts moved before cleanup
- Audit trail recorded
### Crash Recovery Cleanup
Orphaned workspace detection and cleanup:
```typescript
const orphans = await orphanDetector.detectOrphans();
// Returns:
// [
// {
// id: 'workspace-uuid',
// agentId: 'agent-001',
// path: '/tmp/cfn-workspaces/...',
// processId: 12345,
// lastAccessedAt: Date,
// sizeBytes: 52428800,
// fileCount: 127
// }
// ]
// Cleanup orphans after grace period
const stats = await orphanDetector.cleanupOrphans();
```
**Orphan Detection Logic:**
1. Scan workspace directories
2. Check if associated process is still active (using process ID)
3. Measure time since last access
4. If process dead AND past grace period → mark as orphan
5. Cleanup orphan workspace
**Grace Period:** 10 minutes (configurable)
- Prevents premature cleanup during agent restarts
- After 10 minutes, orphaned workspace is safe to delete
## Orphan Detection and Recovery
### How Orphan Detection Works
1. **Process Monitoring**: Each workspace tracks the process ID (PID) of the agent
2. **Activity Tracking**: Last access time recorded for each workspace
3. **Grace Period**: 10-minute grace period after process death
4. **Automatic Cleanup**: After grace period, workspace is cleaned up
### Detecting Orphans
```bash
# Manual orphan detection
node -e "
const { OrphanDetector } = require('./src/lib/orphan-detector');
const detector = new OrphanDetector({
workspaceRoot: '/tmp/cfn-workspaces',
gracePeriodMinutes: 10
});
detector.detectOrphans().then(orphans => {
console.log('Orphaned workspaces:', orphans);
});
"
```
### Grace Period Protection
**Scenario: Agent Restart During Grace Period**
```
Time: 0:00 Agent crashes, workspace marked orphaned
Time: 0:05 Process restarts, updates last_accessed_at
Grace period timer resets
Time: 0:15 Grace period expires, but process is active
→ Workspace NOT cleaned up
```
**Scenario: Agent Crash, No Restart**
```
Time: 0:00 Agent crashes, workspace marked orphaned
Time: 0:10 Grace period expires, process still dead
→ Workspace cleaned up automatically
```
### Background Scanning
Orphan detector runs background scans (default: every 30 minutes):
```typescript
const detector = new OrphanDetector({
workspaceRoot: '/tmp/cfn-workspaces',
gracePeriodMinutes: 10,
scanIntervalMinutes: 30 // Background scan interval
});
detector.start(); // Start background scanning
// ... later ...
detector.stop(); // Stop background scanning
```
## Size Limit Enforcement
### Per-Workspace Size Limits
Each workspace has a configurable size limit (default: 1GB):
```typescript
const workspace = await supervisor.createWorkspace({
agentId: 'agent-001',
taskId: 'task-001',
maxSizeBytes: 1024 * 1024 * 1024, // 1GB limit
ttlHours: 24
});
```
### Monitoring Size Usage
```typescript
// Get workspace info with size
const info = await supervisor.getWorkspaceInfo(workspace.id);
console.log({
sizeBytes: 524288000, // 500MB
fileCount: 1240,
maxSizeBytes: 1073741824, // 1GB
exceedsLimit: false
});
```
### What Happens When Limit Exceeded
1. **Violation Flagged**: `exceedsLimit` flag set to true
2. **Metric Recorded**: Size violation recorded in workspace_metrics
3. **Alert Generated**: Log warning with workspace details
4. **Manual Cleanup**: Operator notified via monitoring
5. **Emergency Cleanup**: Can be manually triggered
### Disk Usage Monitoring
Get overall workspace statistics:
```typescript
const stats = await supervisor.getStatistics();
console.log({
totalWorkspaces: 45,
activeWorkspaces: 42,
totalDiskUsage: 536870912000, // 500GB
staleWorkspaces: 3
});
```
## Audit Trail and Monitoring
### Cleanup History
Complete audit trail of all cleanup operations:
```typescript
const history = await supervisor.getCleanupHistory(workspace.id);
// Returns:
// [
// {
// cleanedAt: Date,
// reason: 'agent_completed',
// sizeFreed: 52428800,
// filesRemoved: 127,
// metadata: { exitCode: 0, duration: 5000 }
// },
// {
// cleanedAt: Date,
// reason: 'ttl_expired',
// sizeFreed: 0, // Artifacts preserved
// filesRemoved: 0,
// metadata: { preserved_artifacts: 3 }
// }
// ]
```
### Metrics Tracking
Workspace size metrics tracked over time:
```sql
-- Query workspace metrics
SELECT
recorded_at,
size_bytes,
file_count,
exceeds_limit
FROM workspace_metrics
WHERE workspace_id = 'workspace-id'
ORDER BY recorded_at DESC
LIMIT 100;
```
### Cleanup Reporting
Generate cleanup reports:
```bash
# List all cleanups (last 10)
sqlite3 ./cfn-workspaces/metadata.db \
"SELECT workspace_id, reason, size_freed, cleaned_at
FROM cleanup_history
ORDER BY cleaned_at DESC
LIMIT 10;"
# Cleanup by reason
sqlite3 ./cfn-workspaces/metadata.db \
"SELECT reason, COUNT(*) as count, SUM(size_freed) as total_freed
FROM cleanup_history
GROUP BY reason;"
# Disk freed over time
sqlite3 ./cfn-workspaces/metadata.db \
"SELECT DATE(cleaned_at) as date, COUNT(*) as cleanups, SUM(size_freed) as freed
FROM cleanup_history
GROUP BY DATE(cleaned_at)
ORDER BY date DESC;"
```
## Manual Cleanup Procedures
### List All Workspaces
```bash
./scripts/cleanup-workspaces.sh --list
```
Output:
```
=== Workspace Listing ===
Active Workspaces:
ID Agent Task Size Created
─────────────────────────────────────────────────────────────
abc123 backend-dev-001 task-123 50MB 2025-11-15T10:00:00Z
def456 frontend-dev-002 task-124 100MB 2025-11-15T11:30:00Z
ghi789 tester-003 task-125 25MB 2025-11-15T14:00:00Z
```
### Show Orphaned Workspaces
```bash
./scripts/cleanup-workspaces.sh --orphans
```
Shows workspaces in grace period (orphan detected, waiting for grace period expiry).
### Manual Cleanup
```bash
./scripts/cleanup-workspaces.sh --cleanup abc123def456
```
Interactive cleanup (requires confirmation):
```
About to delete workspace:
ID: abc123def456
Path: /tmp/cfn-workspaces/backend-dev-001-task-123-abc123/
Size: 50MB
Continue? (yes/no): yes
✓ Workspace cleaned up: abc123def456 (50MB freed)
```
### Force Cleanup (Skip TTL)
```bash
./scripts/cleanup-workspaces.sh --force-cleanup abc123def456
```
Skips grace period and immediately cleans up workspace.
### Cleanup All Orphans
```bash
./scripts/cleanup-workspaces.sh --cleanup-orphans
```
Scans for orphaned workspaces past grace period and cleans them up.
### Generate Report
```bash
./scripts/cleanup-workspaces.sh --report
```
Comprehensive workspace report:
```
=== Workspace Report ===
Overall Statistics:
Total Workspaces: 42
Total Size (bytes): 536870912000
Total Files: 12450
Most Recent: 2025-11-16T10:00:00Z
Recent Cleanups:
ID (short) Reason Size (bytes) When
─────────────────────────────────────────────────
abc123 agent_completed 52428800 2025-11-16T09:30:00Z
def456 ttl_expired 26214400 2025-11-16T08:00:00Z
ghi789 manual 10485760 2025-11-15T20:00:00Z
Largest Workspaces:
ID (short) Agent Task Size (bytes) Files
──────────────────────────────────────────────────────────────
xyz999 heavy-job-001 task-500 268435456 2840
```
## Troubleshooting
### Orphaned Files Not Being Cleaned
**Symptoms:** Workspace directories still exist after 24h
**Diagnosis:**
```bash
# Check if process is still active
ps aux | grep <process-id>
# Check workspace metadata
cat /tmp/cfn-workspaces/agent-task-uuid/.metadata.json
# Check orphan tracking table
sqlite3 ./cfn-workspaces/metadata.db \
"SELECT * FROM orphan_tracking WHERE cleaned_at IS NULL;"
```
**Solutions:**
1. **Check process**: If process still running, that's correct (workspace not orphaned yet)
2. **Check grace period**: Wait 10+ minutes after process death
3. **Manual cleanup**: Use `--force-cleanup` to skip grace period
4. **Check permissions**: Verify write permissions to workspace root
### Workspace Size Limit Exceeded
**Symptoms:** `exceedsLimit: true` in workspace info
**Diagnosis:**
```bash
# Check workspace size
du -sh /tmp/cfn-workspaces/agent-task-uuid/
# List largest files
du -sh /tmp/cfn-workspaces/agent-task-uuid/* | sort -h | tail -10
```
**Solutions:**
1. **Clean temporary files**: Remove cache, temp directories
2. **Move artifacts**: Preserve important files and cleanup
3. **Increase limit**: Adjust `maxSizeBytes` in workspace config
4. **Manual cleanup**: Force cleanup and restart task
### Database Corruption
**Symptoms:** Database errors in logs, cleanup operations fail
**Recovery:**
```bash
# Backup corrupted database
cp ./cfn-workspaces/metadata.db ./cfn-workspaces/metadata.db.backup
# Verify database integrity
sqlite3 ./cfn-workspaces/metadata.db "PRAGMA integrity_check;"
# Rebuild if corrupted
sqlite3 ./cfn-workspaces/metadata.db < ./src/db/migrations/007-workspace-tracking-schema.sql
```
### High Disk Usage
**Diagnosis:**
```bash
# Report by reason for cleanup failures
sqlite3 ./cfn-workspaces/metadata.db \
"SELECT reason, COUNT(*) FROM cleanup_history GROUP BY reason;"
# Find workspaces not being cleaned
sqlite3 ./cfn-workspaces/metadata.db \
"SELECT id, agent_id, task_id, current_size_bytes
FROM workspaces
WHERE id NOT IN (SELECT DISTINCT workspace_id FROM cleanup_history)
ORDER BY current_size_bytes DESC;"
```
**Solutions:**
1. Run TTL cleanup: `supervisor.enforceRetentionPolicy()`
2. Clean orphans: `orphanDetector.cleanupOrphans()`
3. Manual cleanup for large workspaces
4. Review TTL policy (may be too long)
## Performance Characteristics
### Workspace Creation
- **Time**: < 100ms
- **Operations**: Directory creation + database insert
- **Scalability**: 1000+ concurrent workspaces
### Cleanup Operations
- **< 100MB**: < 1 second
- **< 1GB**: < 5 seconds
- **> 1GB**: ~5ms per MB
- **Database updates**: < 100ms
### Orphan Detection
- **Scan time**: ~30 seconds for 1000 workspaces
- **Background interval**: 30 minutes (configurable)
- **Grace period check**: O(n) where n = workspaces in grace period
### TTL Cleanup
- **100 workspaces**: < 5 minutes
- **1000 workspaces**: < 50 minutes
- **Background interval**: 60 minutes (configurable)
### Database
- **Schema**: ~7 tables with indexes
- **Cleanup history retention**: Depends on workspaces × average cleanups (1-2 per workspace)
- **Estimated size**: ~1MB per 1000 workspaces × 10 cleanup events
### Disk Usage Limits
- **Maximum workspaces**: Unlimited (filesystem dependent)
- **Recommended monitoring threshold**: 80% of available disk
- **Emergency cleanup trigger**: 90% disk usage
## Best Practices
1. **Set Appropriate TTL**: Default 24h is good for most tasks
2. **Preserve Artifacts**: Always preserve important output files
3. **Monitor Orphans**: Check orphan detector logs regularly
4. **Review Size Limits**: Adjust per-workspace limits based on task requirements
5. **Regular Audits**: Use `--report` to monitor cleanup effectiveness
6. **Backup Metadata**: Periodically backup database for disaster recovery
## Related Documentation
- **Task P2-1.3**: Supervised Workspace Cleanup
- **Database Service**: `docs/DATABASE_SERVICE_GUIDE.md`
- **Backup Manager**: `docs/BACKUP_MANAGER_GUIDE.md`
- **File Operations**: `src/lib/file-operations.ts`
**Last Updated**: 2025-11-16
**Version**: 1.0.0
**Status**: Production Ready