UNPKG

claude-flow-novice

Version:

Claude Flow Novice - Advanced orchestration platform for multi-agent AI workflows with CFN Loop architecture Includes Local RuVector Accelerator and all CFN skills for complete functionality.

667 lines (511 loc) 18.6 kB
# Workspace Management Guide **Part of Task P2-1.3: Supervised Workspace Cleanup (Phase 2)** Comprehensive guide to the supervised workspace management system with automatic cleanup, crash recovery, and TTL-based retention. ## Table of Contents 1. [Architecture Overview](#architecture-overview) 2. [Workspace Lifecycle](#workspace-lifecycle) 3. [Cleanup Policies](#cleanup-policies) 4. [Orphan Detection and Recovery](#orphan-detection-and-recovery) 5. [Size Limit Enforcement](#size-limit-enforcement) 6. [Audit Trail and Monitoring](#audit-trail-and-monitoring) 7. [Manual Cleanup Procedures](#manual-cleanup-procedures) 8. [Troubleshooting](#troubleshooting) 9. [Performance Characteristics](#performance-characteristics) ## Architecture Overview The workspace management system provides: - **Isolated Workspaces**: Each agent task gets its own isolated directory - **Automatic Cleanup**: Workspaces are automatically cleaned up on agent completion or crash - **TTL-Based Retention**: Workspaces automatically expire after configurable TTL (default: 24h) - **Orphan Detection**: Crashed/killed agents trigger orphan detection and cleanup - **Size Limits**: Per-workspace size limits (configurable, default: 1GB) prevent disk exhaustion - **Audit Trail**: Complete history of all cleanup operations with metadata - **Manual Cleanup**: Operator utilities for manual workspace management ### Component Structure ``` WorkspaceSupervisor (src/services/workspace-supervisor.ts) ├── Create/cleanup workspaces ├── Track lifecycle and metadata ├── Enforce TTL policies ├── Monitor size limits └── Record audit trail OrphanDetector (src/lib/orphan-detector.ts) ├── Detect orphaned workspaces (no active process) ├── Grace period management (10 minutes) ├── Background scanning └── Automatic cleanup Database Schema (src/db/migrations/007-workspace-tracking-schema.sql) ├── Workspaces table (metadata) ├── Cleanup history (audit trail) ├── Workspace metrics (size tracking) └── Orphan tracking (crash detection) Cleanup Utilities (scripts/cleanup-workspaces.sh) ├── List workspaces ├── Manual cleanup ├── Force cleanup (skip TTL) └── Reports and statistics ``` ## Workspace Lifecycle ### 1. Creation (Agent Starts) When an agent task begins, a supervised workspace is created: ```typescript const supervisor = new WorkspaceSupervisor({ workspaceRoot: '/tmp/cfn-workspaces', maxWorkspaceSizeBytes: 1024 * 1024 * 1024, // 1GB defaultTtlHours: 24 }); await supervisor.initialize(); // Create isolated workspace const workspace = await supervisor.createWorkspace({ agentId: 'backend-dev-001', taskId: 'task-123', maxSizeBytes: 1024 * 1024 * 1024, ttlHours: 24 }); // Returns: // { // id: 'uuid...', // agentId: 'backend-dev-001', // taskId: 'task-123', // path: '/tmp/cfn-workspaces/backend-dev-001-task-123-uuid/', // createdAt: Date, // ttlHours: 24, // ... // } ``` **What Happens:** - Isolated directory created: `/tmp/cfn-workspaces/{agentId}-{taskId}-{uuid}/` - Workspace metadata registered in SQLite database - TTL timer initialized (24h default) - Size monitoring activated ### 2. Monitoring (During Execution) During agent execution: ```typescript // Agent writes output files await fs.writeFile(path.join(workspace.path, 'output.txt'), 'result'); // Supervisor tracks: // - File count and total size // - Size limit violations // - Last access time (for orphan detection) ``` **Tracked Metrics:** - Current size in bytes - File count - Last accessed timestamp - Process ID (for crash detection) - Size limit exceeded flag ### 3. Cleanup (Completion or Crash) On completion, cleanup is triggered: ```typescript // Normal completion await supervisor.cleanupWorkspace(workspace.id, { reason: 'agent_completed', preserveArtifacts: ['report.md', 'output.json'], artifactDestination: '/path/to/artifacts' }); // On crash (orphan detection) const orphanDetector = new OrphanDetector({ workspaceRoot: '/tmp/cfn-workspaces', gracePeriodMinutes: 10 }); const cleanupStats = await orphanDetector.cleanupOrphans(); ``` **Cleanup Options:** - `reason`: Why cleanup occurred (agent_completed, agent_crashed, ttl_expired, manual) - `preserveArtifacts`: Glob patterns for files to preserve - `artifactDestination`: Where to move preserved artifacts - `metadata`: Additional context (exit code, duration, etc.) **What Happens:** 1. Files matching preserve patterns are moved to artifact destination 2. Remaining workspace directory is removed 3. Cleanup recorded in audit trail 4. Workspace unregistered from database ### 4. TTL Cleanup (Background Task) Background scheduler runs hourly: ``` Every 60 minutes (configurable): 1. Find workspaces older than TTL 2. For each stale workspace: - Preserve artifacts if configured - Remove workspace directory - Record cleanup in audit trail 3. Log statistics (workspaces cleaned, space freed) ``` ## Cleanup Policies ### Normal Completion Cleanup Triggered when agent completes successfully or fails: ```typescript const stats = await supervisor.cleanupWorkspace(workspace.id, { reason: 'agent_completed', preserveArtifacts: ['*.md', 'artifacts/**'], artifactDestination: '/artifacts/task-123/' }); // Returns: // { // cleanedCount: 1, // totalSizeFreed: 52428800, // 50MB // filesRemoved: 127 // } ``` **Preserved Artifacts Example:** ``` Workspace: output.json /artifacts/task-123/output.json report.md /artifacts/task-123/report.md temp_build/ (removed) cache/ (removed) ``` ### TTL-Based Cleanup Automatic cleanup for old workspaces: ```typescript // Runs every 60 minutes (configurable) const stats = await supervisor.enforceRetentionPolicy({ preservePatterns: ['*.md', 'output.*'] }); // Returns statistics of cleaned workspaces ``` **TTL Policy:** - Default: 24 hours - Configurable per workspace - Preserved artifacts moved before cleanup - Audit trail recorded ### Crash Recovery Cleanup Orphaned workspace detection and cleanup: ```typescript const orphans = await orphanDetector.detectOrphans(); // Returns: // [ // { // id: 'workspace-uuid', // agentId: 'agent-001', // path: '/tmp/cfn-workspaces/...', // processId: 12345, // lastAccessedAt: Date, // sizeBytes: 52428800, // fileCount: 127 // } // ] // Cleanup orphans after grace period const stats = await orphanDetector.cleanupOrphans(); ``` **Orphan Detection Logic:** 1. Scan workspace directories 2. Check if associated process is still active (using process ID) 3. Measure time since last access 4. If process dead AND past grace period mark as orphan 5. Cleanup orphan workspace **Grace Period:** 10 minutes (configurable) - Prevents premature cleanup during agent restarts - After 10 minutes, orphaned workspace is safe to delete ## Orphan Detection and Recovery ### How Orphan Detection Works 1. **Process Monitoring**: Each workspace tracks the process ID (PID) of the agent 2. **Activity Tracking**: Last access time recorded for each workspace 3. **Grace Period**: 10-minute grace period after process death 4. **Automatic Cleanup**: After grace period, workspace is cleaned up ### Detecting Orphans ```bash # Manual orphan detection node -e " const { OrphanDetector } = require('./src/lib/orphan-detector'); const detector = new OrphanDetector({ workspaceRoot: '/tmp/cfn-workspaces', gracePeriodMinutes: 10 }); detector.detectOrphans().then(orphans => { console.log('Orphaned workspaces:', orphans); }); " ``` ### Grace Period Protection **Scenario: Agent Restart During Grace Period** ``` Time: 0:00 Agent crashes, workspace marked orphaned Time: 0:05 Process restarts, updates last_accessed_at Grace period timer resets Time: 0:15 Grace period expires, but process is active Workspace NOT cleaned up ``` **Scenario: Agent Crash, No Restart** ``` Time: 0:00 Agent crashes, workspace marked orphaned Time: 0:10 Grace period expires, process still dead Workspace cleaned up automatically ``` ### Background Scanning Orphan detector runs background scans (default: every 30 minutes): ```typescript const detector = new OrphanDetector({ workspaceRoot: '/tmp/cfn-workspaces', gracePeriodMinutes: 10, scanIntervalMinutes: 30 // Background scan interval }); detector.start(); // Start background scanning // ... later ... detector.stop(); // Stop background scanning ``` ## Size Limit Enforcement ### Per-Workspace Size Limits Each workspace has a configurable size limit (default: 1GB): ```typescript const workspace = await supervisor.createWorkspace({ agentId: 'agent-001', taskId: 'task-001', maxSizeBytes: 1024 * 1024 * 1024, // 1GB limit ttlHours: 24 }); ``` ### Monitoring Size Usage ```typescript // Get workspace info with size const info = await supervisor.getWorkspaceInfo(workspace.id); console.log({ sizeBytes: 524288000, // 500MB fileCount: 1240, maxSizeBytes: 1073741824, // 1GB exceedsLimit: false }); ``` ### What Happens When Limit Exceeded 1. **Violation Flagged**: `exceedsLimit` flag set to true 2. **Metric Recorded**: Size violation recorded in workspace_metrics 3. **Alert Generated**: Log warning with workspace details 4. **Manual Cleanup**: Operator notified via monitoring 5. **Emergency Cleanup**: Can be manually triggered ### Disk Usage Monitoring Get overall workspace statistics: ```typescript const stats = await supervisor.getStatistics(); console.log({ totalWorkspaces: 45, activeWorkspaces: 42, totalDiskUsage: 536870912000, // 500GB staleWorkspaces: 3 }); ``` ## Audit Trail and Monitoring ### Cleanup History Complete audit trail of all cleanup operations: ```typescript const history = await supervisor.getCleanupHistory(workspace.id); // Returns: // [ // { // cleanedAt: Date, // reason: 'agent_completed', // sizeFreed: 52428800, // filesRemoved: 127, // metadata: { exitCode: 0, duration: 5000 } // }, // { // cleanedAt: Date, // reason: 'ttl_expired', // sizeFreed: 0, // Artifacts preserved // filesRemoved: 0, // metadata: { preserved_artifacts: 3 } // } // ] ``` ### Metrics Tracking Workspace size metrics tracked over time: ```sql -- Query workspace metrics SELECT recorded_at, size_bytes, file_count, exceeds_limit FROM workspace_metrics WHERE workspace_id = 'workspace-id' ORDER BY recorded_at DESC LIMIT 100; ``` ### Cleanup Reporting Generate cleanup reports: ```bash # List all cleanups (last 10) sqlite3 ./cfn-workspaces/metadata.db \ "SELECT workspace_id, reason, size_freed, cleaned_at FROM cleanup_history ORDER BY cleaned_at DESC LIMIT 10;" # Cleanup by reason sqlite3 ./cfn-workspaces/metadata.db \ "SELECT reason, COUNT(*) as count, SUM(size_freed) as total_freed FROM cleanup_history GROUP BY reason;" # Disk freed over time sqlite3 ./cfn-workspaces/metadata.db \ "SELECT DATE(cleaned_at) as date, COUNT(*) as cleanups, SUM(size_freed) as freed FROM cleanup_history GROUP BY DATE(cleaned_at) ORDER BY date DESC;" ``` ## Manual Cleanup Procedures ### List All Workspaces ```bash ./scripts/cleanup-workspaces.sh --list ``` Output: ``` === Workspace Listing === Active Workspaces: ID Agent Task Size Created ───────────────────────────────────────────────────────────── abc123 backend-dev-001 task-123 50MB 2025-11-15T10:00:00Z def456 frontend-dev-002 task-124 100MB 2025-11-15T11:30:00Z ghi789 tester-003 task-125 25MB 2025-11-15T14:00:00Z ``` ### Show Orphaned Workspaces ```bash ./scripts/cleanup-workspaces.sh --orphans ``` Shows workspaces in grace period (orphan detected, waiting for grace period expiry). ### Manual Cleanup ```bash ./scripts/cleanup-workspaces.sh --cleanup abc123def456 ``` Interactive cleanup (requires confirmation): ``` About to delete workspace: ID: abc123def456 Path: /tmp/cfn-workspaces/backend-dev-001-task-123-abc123/ Size: 50MB Continue? (yes/no): yes Workspace cleaned up: abc123def456 (50MB freed) ``` ### Force Cleanup (Skip TTL) ```bash ./scripts/cleanup-workspaces.sh --force-cleanup abc123def456 ``` Skips grace period and immediately cleans up workspace. ### Cleanup All Orphans ```bash ./scripts/cleanup-workspaces.sh --cleanup-orphans ``` Scans for orphaned workspaces past grace period and cleans them up. ### Generate Report ```bash ./scripts/cleanup-workspaces.sh --report ``` Comprehensive workspace report: ``` === Workspace Report === Overall Statistics: Total Workspaces: 42 Total Size (bytes): 536870912000 Total Files: 12450 Most Recent: 2025-11-16T10:00:00Z Recent Cleanups: ID (short) Reason Size (bytes) When ───────────────────────────────────────────────── abc123 agent_completed 52428800 2025-11-16T09:30:00Z def456 ttl_expired 26214400 2025-11-16T08:00:00Z ghi789 manual 10485760 2025-11-15T20:00:00Z Largest Workspaces: ID (short) Agent Task Size (bytes) Files ────────────────────────────────────────────────────────────── xyz999 heavy-job-001 task-500 268435456 2840 ``` ## Troubleshooting ### Orphaned Files Not Being Cleaned **Symptoms:** Workspace directories still exist after 24h **Diagnosis:** ```bash # Check if process is still active ps aux | grep <process-id> # Check workspace metadata cat /tmp/cfn-workspaces/agent-task-uuid/.metadata.json # Check orphan tracking table sqlite3 ./cfn-workspaces/metadata.db \ "SELECT * FROM orphan_tracking WHERE cleaned_at IS NULL;" ``` **Solutions:** 1. **Check process**: If process still running, that's correct (workspace not orphaned yet) 2. **Check grace period**: Wait 10+ minutes after process death 3. **Manual cleanup**: Use `--force-cleanup` to skip grace period 4. **Check permissions**: Verify write permissions to workspace root ### Workspace Size Limit Exceeded **Symptoms:** `exceedsLimit: true` in workspace info **Diagnosis:** ```bash # Check workspace size du -sh /tmp/cfn-workspaces/agent-task-uuid/ # List largest files du -sh /tmp/cfn-workspaces/agent-task-uuid/* | sort -h | tail -10 ``` **Solutions:** 1. **Clean temporary files**: Remove cache, temp directories 2. **Move artifacts**: Preserve important files and cleanup 3. **Increase limit**: Adjust `maxSizeBytes` in workspace config 4. **Manual cleanup**: Force cleanup and restart task ### Database Corruption **Symptoms:** Database errors in logs, cleanup operations fail **Recovery:** ```bash # Backup corrupted database cp ./cfn-workspaces/metadata.db ./cfn-workspaces/metadata.db.backup # Verify database integrity sqlite3 ./cfn-workspaces/metadata.db "PRAGMA integrity_check;" # Rebuild if corrupted sqlite3 ./cfn-workspaces/metadata.db < ./src/db/migrations/007-workspace-tracking-schema.sql ``` ### High Disk Usage **Diagnosis:** ```bash # Report by reason for cleanup failures sqlite3 ./cfn-workspaces/metadata.db \ "SELECT reason, COUNT(*) FROM cleanup_history GROUP BY reason;" # Find workspaces not being cleaned sqlite3 ./cfn-workspaces/metadata.db \ "SELECT id, agent_id, task_id, current_size_bytes FROM workspaces WHERE id NOT IN (SELECT DISTINCT workspace_id FROM cleanup_history) ORDER BY current_size_bytes DESC;" ``` **Solutions:** 1. Run TTL cleanup: `supervisor.enforceRetentionPolicy()` 2. Clean orphans: `orphanDetector.cleanupOrphans()` 3. Manual cleanup for large workspaces 4. Review TTL policy (may be too long) ## Performance Characteristics ### Workspace Creation - **Time**: < 100ms - **Operations**: Directory creation + database insert - **Scalability**: 1000+ concurrent workspaces ### Cleanup Operations - **< 100MB**: < 1 second - **< 1GB**: < 5 seconds - **> 1GB**: ~5ms per MB - **Database updates**: < 100ms ### Orphan Detection - **Scan time**: ~30 seconds for 1000 workspaces - **Background interval**: 30 minutes (configurable) - **Grace period check**: O(n) where n = workspaces in grace period ### TTL Cleanup - **100 workspaces**: < 5 minutes - **1000 workspaces**: < 50 minutes - **Background interval**: 60 minutes (configurable) ### Database - **Schema**: ~7 tables with indexes - **Cleanup history retention**: Depends on workspaces × average cleanups (1-2 per workspace) - **Estimated size**: ~1MB per 1000 workspaces × 10 cleanup events ### Disk Usage Limits - **Maximum workspaces**: Unlimited (filesystem dependent) - **Recommended monitoring threshold**: 80% of available disk - **Emergency cleanup trigger**: 90% disk usage ## Best Practices 1. **Set Appropriate TTL**: Default 24h is good for most tasks 2. **Preserve Artifacts**: Always preserve important output files 3. **Monitor Orphans**: Check orphan detector logs regularly 4. **Review Size Limits**: Adjust per-workspace limits based on task requirements 5. **Regular Audits**: Use `--report` to monitor cleanup effectiveness 6. **Backup Metadata**: Periodically backup database for disaster recovery ## Related Documentation - **Task P2-1.3**: Supervised Workspace Cleanup - **Database Service**: `docs/DATABASE_SERVICE_GUIDE.md` - **Backup Manager**: `docs/BACKUP_MANAGER_GUIDE.md` - **File Operations**: `src/lib/file-operations.ts` --- **Last Updated**: 2025-11-16 **Version**: 1.0.0 **Status**: Production Ready