UNPKG

claude-flow-tbowman01

Version:

Enterprise-grade AI agent orchestration with ruv-swarm integration (Alpha Release)

424 lines (313 loc) 12.2 kB
# Claude-Flow Coordination System A comprehensive, scalable, and fault-tolerant coordination system for multi-agent task execution and resource management. ## Overview The coordination system provides: - **Task Scheduling**: Intelligent agent selection and priority handling - **Resource Management**: Distributed locking with deadlock detection - **Message Passing**: Inter-agent communication with reliability - **Work Stealing**: Dynamic load balancing between agents - **Circuit Breakers**: Fault tolerance and cascade failure prevention - **Conflict Resolution**: Automated conflict detection and resolution - **Dependency Management**: Task dependency tracking and execution ordering - **Metrics & Monitoring**: Comprehensive performance tracking ## Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Task Scheduler │ │Resource Manager │ │Message Router │ │ │ │ │ │ │ │ • Agent Selection│ │ • Lock Manager │ │ • Queue Manager │ │ • Priority Queue│ │ • Deadlock Det. │ │ • Routing Logic │ │ • Dependencies │ │ • Timeouts │ │ • Reliability │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ └───────────────────────┼───────────────────────┘ │ ┌─────────────────┐ │Coordination Mgr │ │ │ │ • Event Handling│ │ • Health Monitor│ │ • Maintenance │ └─────────────────┘ ``` ## Core Components ### CoordinationManager The main coordination orchestrator that manages all subsystems: ```typescript import { CoordinationManager } from './coordination/index.ts'; const manager = new CoordinationManager(config, eventBus, logger); await manager.initialize(); // Assign task to agent await manager.assignTask(task, agentId); // Acquire resource await manager.acquireResource('file-lock', agentId); // Send message await manager.sendMessage('agent1', 'agent2', { type: 'status' }); ``` ### Advanced Task Scheduler Intelligent agent selection with multiple strategies: ```typescript import { AdvancedTaskScheduler, CapabilitySchedulingStrategy } from './coordination/index.ts'; const scheduler = new AdvancedTaskScheduler(config, eventBus, logger); // Register custom strategy scheduler.registerStrategy(new CapabilitySchedulingStrategy()); scheduler.setDefaultStrategy('capability'); // Register agents scheduler.registerAgent(agentProfile); // Tasks are automatically assigned to best agents await scheduler.assignTask(task); ``` ### Resource Manager Distributed locking with deadlock detection: ```typescript import { ResourceManager } from './coordination/index.ts'; const resourceManager = new ResourceManager(config, eventBus, logger); try { // Acquire with timeout and priority await resourceManager.acquire('database-lock', agentId, priority); // Critical section await performDatabaseOperation(); } finally { await resourceManager.release('database-lock', agentId); } ``` ### Work Stealing Coordinator Dynamic load balancing: ```typescript import { WorkStealingCoordinator } from './coordination/index.ts'; const workStealing = new WorkStealingCoordinator( { enabled: true, stealThreshold: 3, // Trigger when difference > 3 tasks maxStealBatch: 2, // Steal up to 2 tasks at once stealInterval: 5000, // Check every 5 seconds }, eventBus, logger, ); // Update agent workload workStealing.updateAgentWorkload(agentId, { taskCount: 5, avgTaskDuration: 2000, cpuUsage: 70, memoryUsage: 80, }); // Find best agent for task const bestAgent = workStealing.findBestAgent(task, availableAgents); ``` ### Dependency Graph Task dependency management: ```typescript import { DependencyGraph } from './coordination/index.ts'; const graph = new DependencyGraph(logger); // Add tasks with dependencies graph.addTask(task1); // No dependencies graph.addTask(task2); // Depends on task1 graph.addTask(task3); // Depends on task2 // Get ready tasks const readyTasks = graph.getReadyTasks(); // [task1] // Mark completion and get newly ready tasks const newlyReady = graph.markCompleted('task1'); // [task2] // Check for cycles const cycles = graph.detectCycles(); // Get topological order const order = graph.topologicalSort(); ``` ### Circuit Breaker Fault tolerance and cascade failure prevention: ```typescript import { CircuitBreaker, CircuitState } from './coordination/index.ts'; const breaker = new CircuitBreaker( 'external-api', { failureThreshold: 5, // Open after 5 failures successThreshold: 3, // Close after 3 successes in half-open timeout: 60000, // Try half-open after 60s halfOpenLimit: 2, // Max 2 requests in half-open }, logger, eventBus, ); // Execute with protection try { const result = await breaker.execute(async () => { return await callExternalAPI(); }); } catch (error) { if (breaker.getState() === CircuitState.OPEN) { // Circuit is open, use fallback return fallbackResponse(); } throw error; } ``` ### Conflict Resolution Automated conflict detection and resolution: ```typescript import { ConflictResolver, PriorityResolutionStrategy } from './coordination/index.ts'; const resolver = new ConflictResolver(logger, eventBus); // Register custom strategy resolver.registerStrategy(new PriorityResolutionStrategy()); // Report conflict const conflict = await resolver.reportResourceConflict('shared-file', [ 'agent1', 'agent2', 'agent3', ]); // Resolve using priority strategy const resolution = await resolver.resolveConflict(conflict.id, 'priority', { agentPriorities: new Map([ ['agent1', 10], ['agent2', 5], ]), }); console.log(`Winner: ${resolution.winner}`); // agent1 (higher priority) ``` ### Metrics Collection Comprehensive performance monitoring: ```typescript import { CoordinationMetricsCollector } from './coordination/index.ts'; const metrics = new CoordinationMetricsCollector(logger, eventBus); metrics.start(); // Get current metrics const current = metrics.getCurrentMetrics(); console.log({ activeTasks: current.taskMetrics.activeTasks, taskThroughput: current.taskMetrics.taskThroughput, agentUtilization: current.agentMetrics.agentUtilization, resourceUtilization: current.resourceMetrics.resourceUtilization, }); // Get metric history const history = metrics.getMetricHistory( 'task.completed', new Date(Date.now() - 3600000), // Last hour ); ``` ## Configuration ```typescript interface CoordinationConfig { maxRetries: number; // Task retry attempts retryDelay: number; // Base retry delay (ms) deadlockDetection: boolean; // Enable deadlock detection resourceTimeout: number; // Resource acquisition timeout (ms) messageTimeout: number; // Message delivery timeout (ms) } const config: CoordinationConfig = { maxRetries: 3, retryDelay: 1000, deadlockDetection: true, resourceTimeout: 30000, messageTimeout: 10000, }; ``` ## Event System The coordination system emits various events for monitoring and integration: ```typescript // Task events eventBus.on(SystemEvents.TASK_ASSIGNED, ({ taskId, agentId }) => { console.log(`Task ${taskId} assigned to ${agentId}`); }); // Resource events eventBus.on(SystemEvents.RESOURCE_ACQUIRED, ({ resourceId, agentId }) => { console.log(`Resource ${resourceId} locked by ${agentId}`); }); // Deadlock events eventBus.on(SystemEvents.DEADLOCK_DETECTED, ({ agents, resources }) => { console.log(`Deadlock detected: agents=${agents}, resources=${resources}`); }); // Work stealing events eventBus.on('workstealing:request', ({ sourceAgent, targetAgent, taskCount }) => { console.log(`Work stealing: ${taskCount} tasks from ${sourceAgent} to ${targetAgent}`); }); // Conflict events eventBus.on('conflict:resolved', ({ conflict, resolution }) => { console.log(`Conflict resolved: ${resolution.winner} won using ${resolution.type}`); }); // Circuit breaker events eventBus.on('circuitbreaker:state-change', ({ name, from, to }) => { console.log(`Circuit breaker ${name}: ${from} -> ${to}`); }); ``` ## Best Practices ### Task Design - Keep tasks small and focused - Minimize dependencies between tasks - Use appropriate priority levels - Include timeout information ### Resource Management - Always use try/finally for resource cleanup - Set appropriate timeouts - Use meaningful resource IDs - Avoid holding multiple resources simultaneously when possible ### Agent Registration - Register agents with accurate capability information - Update workload metrics regularly - Handle agent failures gracefully - Implement proper cleanup on termination ### Error Handling - Use circuit breakers for external dependencies - Implement proper retry logic - Log errors with sufficient context - Gracefully degrade functionality when possible ### Monitoring - Monitor key metrics regularly - Set up alerting for deadlocks and conflicts - Track agent utilization and task throughput - Review conflict resolution patterns ## Testing The coordination system includes comprehensive unit tests: ```bash # Run coordination tests deno test tests/unit/coordination/ # Run specific test file deno test tests/unit/coordination/coordination.test.ts # Run with coverage deno test --coverage=coverage tests/unit/coordination/ ``` ## Performance Characteristics ### Scalability - **Agents**: Supports 100+ concurrent agents - **Tasks**: Handles 1000+ tasks in queue - **Resources**: Manages 500+ shared resources - **Messages**: Processes 10,000+ messages/minute ### Latency - **Task Assignment**: < 10ms (99th percentile) - **Resource Acquisition**: < 50ms (99th percentile) - **Message Delivery**: < 5ms (99th percentile) - **Conflict Resolution**: < 100ms (99th percentile) ### Reliability - **Deadlock Detection**: Sub-second detection - **Circuit Breaker**: Configurable failure thresholds - **Retry Logic**: Exponential backoff with jitter - **Health Monitoring**: Continuous component health checks ## Troubleshooting ### Common Issues 1. **Deadlocks** - Enable deadlock detection - Reduce resource holding time - Use consistent resource ordering 2. **Performance Issues** - Enable work stealing - Monitor agent utilization - Optimize task granularity 3. **Resource Contention** - Increase resource timeout - Implement priority queuing - Use optimistic locking where possible 4. **Message Delays** - Check network connectivity - Monitor message queue sizes - Adjust timeout settings ### Debug Mode Enable debug logging for detailed coordination information: ```typescript const logger = new Logger({ level: 'debug' }); const eventBus = new EventBus(true); // Enable debug mode // This will log all coordination events and state changes ``` ## Future Enhancements - **Distributed Coordination**: Support for multi-node coordination - **Persistent State**: Task and resource state persistence - **Advanced Scheduling**: ML-based agent selection - **Custom Protocols**: Pluggable coordination protocols - **Visual Monitoring**: Web-based coordination dashboard