claude-flow-tbowman01
Version:
Enterprise-grade AI agent orchestration with ruv-swarm integration (Alpha Release)
424 lines (313 loc) • 12.2 kB
Markdown
A comprehensive, scalable, and fault-tolerant coordination system for multi-agent task execution and resource management.
The coordination system provides:
- **Task Scheduling**: Intelligent agent selection and priority handling
- **Resource Management**: Distributed locking with deadlock detection
- **Message Passing**: Inter-agent communication with reliability
- **Work Stealing**: Dynamic load balancing between agents
- **Circuit Breakers**: Fault tolerance and cascade failure prevention
- **Conflict Resolution**: Automated conflict detection and resolution
- **Dependency Management**: Task dependency tracking and execution ordering
- **Metrics & Monitoring**: Comprehensive performance tracking
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Task Scheduler │ │Resource Manager │ │Message Router │
│ │ │ │ │ │
│ • Agent Selection│ │ • Lock Manager │ │ • Queue Manager │
│ • Priority Queue│ │ • Deadlock Det. │ │ • Routing Logic │
│ • Dependencies │ │ • Timeouts │ │ • Reliability │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────┐
│Coordination Mgr │
│ │
│ • Event Handling│
│ • Health Monitor│
│ • Maintenance │
└─────────────────┘
```
The main coordination orchestrator that manages all subsystems:
```typescript
import { CoordinationManager } from './coordination/index.ts';
const manager = new CoordinationManager(config, eventBus, logger);
await manager.initialize();
// Assign task to agent
await manager.assignTask(task, agentId);
// Acquire resource
await manager.acquireResource('file-lock', agentId);
// Send message
await manager.sendMessage('agent1', 'agent2', { type: 'status' });
```
Intelligent agent selection with multiple strategies:
```typescript
import { AdvancedTaskScheduler, CapabilitySchedulingStrategy } from './coordination/index.ts';
const scheduler = new AdvancedTaskScheduler(config, eventBus, logger);
// Register custom strategy
scheduler.registerStrategy(new CapabilitySchedulingStrategy());
scheduler.setDefaultStrategy('capability');
// Register agents
scheduler.registerAgent(agentProfile);
// Tasks are automatically assigned to best agents
await scheduler.assignTask(task);
```
Distributed locking with deadlock detection:
```typescript
import { ResourceManager } from './coordination/index.ts';
const resourceManager = new ResourceManager(config, eventBus, logger);
try {
// Acquire with timeout and priority
await resourceManager.acquire('database-lock', agentId, priority);
// Critical section
await performDatabaseOperation();
} finally {
await resourceManager.release('database-lock', agentId);
}
```
Dynamic load balancing:
```typescript
import { WorkStealingCoordinator } from './coordination/index.ts';
const workStealing = new WorkStealingCoordinator(
{
enabled: true,
stealThreshold: 3, // Trigger when difference > 3 tasks
maxStealBatch: 2, // Steal up to 2 tasks at once
stealInterval: 5000, // Check every 5 seconds
},
eventBus,
logger,
);
// Update agent workload
workStealing.updateAgentWorkload(agentId, {
taskCount: 5,
avgTaskDuration: 2000,
cpuUsage: 70,
memoryUsage: 80,
});
// Find best agent for task
const bestAgent = workStealing.findBestAgent(task, availableAgents);
```
Task dependency management:
```typescript
import { DependencyGraph } from './coordination/index.ts';
const graph = new DependencyGraph(logger);
// Add tasks with dependencies
graph.addTask(task1); // No dependencies
graph.addTask(task2); // Depends on task1
graph.addTask(task3); // Depends on task2
// Get ready tasks
const readyTasks = graph.getReadyTasks(); // [task1]
// Mark completion and get newly ready tasks
const newlyReady = graph.markCompleted('task1'); // [task2]
// Check for cycles
const cycles = graph.detectCycles();
// Get topological order
const order = graph.topologicalSort();
```
Fault tolerance and cascade failure prevention:
```typescript
import { CircuitBreaker, CircuitState } from './coordination/index.ts';
const breaker = new CircuitBreaker(
'external-api',
{
failureThreshold: 5, // Open after 5 failures
successThreshold: 3, // Close after 3 successes in half-open
timeout: 60000, // Try half-open after 60s
halfOpenLimit: 2, // Max 2 requests in half-open
},
logger,
eventBus,
);
// Execute with protection
try {
const result = await breaker.execute(async () => {
return await callExternalAPI();
});
} catch (error) {
if (breaker.getState() === CircuitState.OPEN) {
// Circuit is open, use fallback
return fallbackResponse();
}
throw error;
}
```
Automated conflict detection and resolution:
```typescript
import { ConflictResolver, PriorityResolutionStrategy } from './coordination/index.ts';
const resolver = new ConflictResolver(logger, eventBus);
// Register custom strategy
resolver.registerStrategy(new PriorityResolutionStrategy());
// Report conflict
const conflict = await resolver.reportResourceConflict('shared-file', [
'agent1',
'agent2',
'agent3',
]);
// Resolve using priority strategy
const resolution = await resolver.resolveConflict(conflict.id, 'priority', {
agentPriorities: new Map([
['agent1', 10],
['agent2', 5],
]),
});
console.log(`Winner: ${resolution.winner}`); // agent1 (higher priority)
```
Comprehensive performance monitoring:
```typescript
import { CoordinationMetricsCollector } from './coordination/index.ts';
const metrics = new CoordinationMetricsCollector(logger, eventBus);
metrics.start();
// Get current metrics
const current = metrics.getCurrentMetrics();
console.log({
activeTasks: current.taskMetrics.activeTasks,
taskThroughput: current.taskMetrics.taskThroughput,
agentUtilization: current.agentMetrics.agentUtilization,
resourceUtilization: current.resourceMetrics.resourceUtilization,
});
// Get metric history
const history = metrics.getMetricHistory(
'task.completed',
new Date(Date.now() - 3600000), // Last hour
);
```
```typescript
interface CoordinationConfig {
maxRetries: number; // Task retry attempts
retryDelay: number; // Base retry delay (ms)
deadlockDetection: boolean; // Enable deadlock detection
resourceTimeout: number; // Resource acquisition timeout (ms)
messageTimeout: number; // Message delivery timeout (ms)
}
const config: CoordinationConfig = {
maxRetries: 3,
retryDelay: 1000,
deadlockDetection: true,
resourceTimeout: 30000,
messageTimeout: 10000,
};
```
The coordination system emits various events for monitoring and integration:
```typescript
// Task events
eventBus.on(SystemEvents.TASK_ASSIGNED, ({ taskId, agentId }) => {
console.log(`Task ${taskId} assigned to ${agentId}`);
});
// Resource events
eventBus.on(SystemEvents.RESOURCE_ACQUIRED, ({ resourceId, agentId }) => {
console.log(`Resource ${resourceId} locked by ${agentId}`);
});
// Deadlock events
eventBus.on(SystemEvents.DEADLOCK_DETECTED, ({ agents, resources }) => {
console.log(`Deadlock detected: agents=${agents}, resources=${resources}`);
});
// Work stealing events
eventBus.on('workstealing:request', ({ sourceAgent, targetAgent, taskCount }) => {
console.log(`Work stealing: ${taskCount} tasks from ${sourceAgent} to ${targetAgent}`);
});
// Conflict events
eventBus.on('conflict:resolved', ({ conflict, resolution }) => {
console.log(`Conflict resolved: ${resolution.winner} won using ${resolution.type}`);
});
// Circuit breaker events
eventBus.on('circuitbreaker:state-change', ({ name, from, to }) => {
console.log(`Circuit breaker ${name}: ${from} -> ${to}`);
});
```
- Keep tasks small and focused
- Minimize dependencies between tasks
- Use appropriate priority levels
- Include timeout information
- Always use try/finally for resource cleanup
- Set appropriate timeouts
- Use meaningful resource IDs
- Avoid holding multiple resources simultaneously when possible
- Register agents with accurate capability information
- Update workload metrics regularly
- Handle agent failures gracefully
- Implement proper cleanup on termination
- Use circuit breakers for external dependencies
- Implement proper retry logic
- Log errors with sufficient context
- Gracefully degrade functionality when possible
- Monitor key metrics regularly
- Set up alerting for deadlocks and conflicts
- Track agent utilization and task throughput
- Review conflict resolution patterns
The coordination system includes comprehensive unit tests:
```bash
deno test tests/unit/coordination/
deno test tests/unit/coordination/coordination.test.ts
deno test --coverage=coverage tests/unit/coordination/
```
- **Agents**: Supports 100+ concurrent agents
- **Tasks**: Handles 1000+ tasks in queue
- **Resources**: Manages 500+ shared resources
- **Messages**: Processes 10,000+ messages/minute
- **Task Assignment**: < 10ms (99th percentile)
- **Resource Acquisition**: < 50ms (99th percentile)
- **Message Delivery**: < 5ms (99th percentile)
- **Conflict Resolution**: < 100ms (99th percentile)
- **Deadlock Detection**: Sub-second detection
- **Circuit Breaker**: Configurable failure thresholds
- **Retry Logic**: Exponential backoff with jitter
- **Health Monitoring**: Continuous component health checks
1. **Deadlocks**
- Enable deadlock detection
- Reduce resource holding time
- Use consistent resource ordering
2. **Performance Issues**
- Enable work stealing
- Monitor agent utilization
- Optimize task granularity
3. **Resource Contention**
- Increase resource timeout
- Implement priority queuing
- Use optimistic locking where possible
4. **Message Delays**
- Check network connectivity
- Monitor message queue sizes
- Adjust timeout settings
Enable debug logging for detailed coordination information:
```typescript
const logger = new Logger({ level: 'debug' });
const eventBus = new EventBus(true); // Enable debug mode
// This will log all coordination events and state changes
```
- **Distributed Coordination**: Support for multi-node coordination
- **Persistent State**: Task and resource state persistence
- **Advanced Scheduling**: ML-based agent selection
- **Custom Protocols**: Pluggable coordination protocols
- **Visual Monitoring**: Web-based coordination dashboard