aiwg

Version:

Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo

aiwg.io

jmagly/aiwg

804 lines (639 loc) • 23 kB

Markdown

# RLM Agent Supervisor Integration ## Overview The RLM (Recursive Language Models) addon integrates with AIWG's Agent Supervisor to enable recursive sub-agent decomposition for processing arbitrarily large codebases. Instead of spawning raw tmux sessions, RLM sub-agents are managed through the supervisor's task queue, enabling consistent lifecycle management, concurrency control, and progress tracking. **Core principle**: Every RLM sub-call is a supervisor task, not a raw subprocess. This provides unified monitoring, cost tracking, and error recovery across the entire recursion tree. ## Architecture Overview ### RLM → Supervisor Mapping | RLM Pattern | Supervisor Implementation | |-------------|---------------------------| | `rlm-query` single file | `supervisor.submit(prompt, {agent: 'rlm-agent', metadata: {depth, context_file}})` | | `rlm-batch` parallel fan-out | Multiple `supervisor.submit()` calls with shared `batch_id` metadata | | Recursive sub-calls | `supervisor.submit()` with incremented `depth` in metadata | | Depth tracking | `metadata.depth` field, max 3 by default (from manifest) | | Result aggregation | Poll completed tasks by `batch_id`, aggregate outputs | ### Supervisor Role The Agent Supervisor (`tools/daemon/agent-supervisor.mjs`) provides: 1. **Task Queue Management**: Queue RLM sub-calls, process up to `maxConcurrency` in parallel 2. **Lifecycle Tracking**: Track states (queued → running → completed/failed) 3. **Output Streaming**: Real-time stdout/stderr from sub-agents via events 4. **Timeout Enforcement**: Kill sub-agents that exceed `taskTimeout` (default 2 hours) 5. **Graceful Shutdown**: Cancel queued tasks, wait for running tasks or force-kill on timeout 6. **Event Integration**: Emit `task:queued`, `task:started`, `task:completed`, `task:failed` events for hub integration ### Integration Points ``` RLM Agent ↓ ↓ submit() with depth metadata ↓ Agent Supervisor ↓ ├─→ Task Queue (priority-sorted) ├─→ Running Pool (≤ maxConcurrency) ├─→ Event Emitter (task lifecycle events) └─→ Task Store (persistent state) ↓ ↓ events ↓ Messaging Hub (Telegram/Discord/REPL) ``` ## Lifecycle Management ### Task Submission (rlm-query) When `rlm-query <file> <prompt>` is invoked: ```javascript // Internal RLM agent logic (conceptual) const task = supervisor.submit( // Prompt includes context file and sub-prompt `Context: ${readFileSync(contextFile)}\n\nTask: ${subPrompt}`, { agent: 'rlm-agent', // Target agent priority: 5 - depth, // Deeper calls = lower priority metadata: { type: 'rlm-query', depth: currentDepth + 1, context_file: contextFile, parent_task_id: currentTaskId, max_depth: 3 // From manifest default } } ); // Wait for completion await waitForTaskCompletion(task.id); const result = taskStore.getTask(task.id).result; ``` **Key behaviors**: - **Not raw tmux**: Uses supervisor's managed `spawn()` with proper process tracking - **Priority inversion**: Deeper recursion gets lower priority to prevent queue starvation - **Depth enforcement**: If `depth >= max_depth`, reject submission immediately ### Task Submission (rlm-batch) When `rlm-batch <pattern> <prompt>` is invoked: ```javascript const files = glob(pattern); const batchId = `batch-${Date.now()}`; const tasks = []; // Submit all tasks in parallel (up to maxConcurrency) for (const file of files) { const task = supervisor.submit( `Context: ${readFileSync(file)}\n\nTask: ${subPrompt}`, { agent: 'rlm-agent', priority: 5 - depth, // Same depth-based priority metadata: { type: 'rlm-batch', batch_id: batchId, depth: currentDepth + 1, context_file: file, total_in_batch: files.length } } ); tasks.push(task); } // Wait for all to complete await Promise.all(tasks.map(t => waitForTaskCompletion(t.id))); // Aggregate results const results = tasks.map(t => taskStore.getTask(t.id).result); const aggregated = aggregateByStrategy(results, aggregateStrategy); ``` **Key behaviors**: - **Parallel spawning**: All tasks submitted immediately, supervisor controls concurrency - **Batch tracking**: Shared `batch_id` in metadata for group operations - **Aggregate after completion**: Poll completed tasks, merge results according to strategy ### Sub-Agent Spawning (Recursive Calls) When a sub-agent at depth N spawns its own sub-agent (depth N+1): ```javascript // Inside depth-N sub-agent if (needsRecursion && currentDepth < maxDepth) { const childTask = supervisor.submit( // Child prompt childPrompt, { agent: 'rlm-agent', priority: 5 - (currentDepth + 1), // Lower priority metadata: { type: 'rlm-recursive', depth: currentDepth + 1, parent_task_id: myTaskId, max_depth: maxDepth } } ); // Depth N sub-agent waits for depth N+1 to complete await waitForTaskCompletion(childTask.id); const childResult = taskStore.getTask(childTask.id).result; } else if (currentDepth >= maxDepth) { throw new Error(`Max recursion depth ${maxDepth} exceeded`); } ``` **Depth limit enforcement**: ```javascript // In supervisor.submit() or pre-submit validation if (options.metadata?.depth >= MAX_DEPTH) { throw new Error( `Recursion depth limit exceeded: ${options.metadata.depth} >= ${MAX_DEPTH}` ); } ``` ### Graceful Shutdown When daemon shuts down with active RLM tasks: ```javascript // supervisor.shutdown() behavior await supervisor.shutdown(timeoutMs = 30000); // 1. Reject all queued tasks (including RLM sub-calls) while (queue.length > 0) { const task = queue.shift(); taskStore.cancelTask(task.id); emit('task:cancelled', {taskId: task.id, reason: 'shutdown'}); } // 2. Wait for running tasks (including depth-N sub-agents) to complete // 3. If timeout exceeded, SIGKILL all running processes // 4. All RLM recursion trees are terminated consistently ``` **Impact on RLM**: - Queued sub-calls are cancelled (partial tree completion) - Running sub-calls are given 30s to complete gracefully - If timeout, entire tree is killed (no orphan processes) - Task store preserves partial state for later inspection ## Concurrency Control ### maxConcurrency Limits Parallel Sub-Agents ```javascript // Supervisor config const supervisor = new AgentSupervisor({ maxConcurrency: 10, // Max 10 sub-agents running simultaneously taskTimeout: 120 * 60 * 1000 // 2 hours per sub-agent }); ``` **Behavior**: - If 10 RLM sub-agents are running, 11th waits in queue - Queue is priority-sorted (deeper calls = lower priority) - As sub-agents complete, queue drains automatically - Prevents system overload from deep/wide recursion trees ### Queue Overflow Handling When `rlm-batch` spawns 100 tasks but `maxConcurrency: 10`: ``` Iteration 0: Submit all 100 tasks → queue has 100 items Iteration 1: Spawn 10 (up to maxConcurrency) → queue: 90, running: 10 Iteration 2: As tasks complete, spawn more → queue: 85, running: 10 ... Iteration 10: All 100 processed → queue: 0, running: 0 ``` **No manual batching needed**: Supervisor handles queue automatically. ### Recommended Concurrency by Task Type | Task Type | Recommended maxConcurrency | Rationale | |-----------|----------------------------|-----------| | **Single-file rlm-query** | 3-5 | Low parallelism, sequential by nature | | **rlm-batch (10-50 files)** | 10-20 | Balance throughput and memory | | **rlm-batch (100+ files)** | 20-50 | High throughput, watch memory (each ~100MB) | | **Deep recursion (depth 3)** | 5-10 | Conservative to avoid cascade failures | | **Mixed workload** | 10 | Default balance for typical daemon use | **Memory considerations**: - Each sub-agent: ~100MB RAM - 50 parallel sub-agents: ~5GB RAM - Adjust `maxConcurrency` based on available system memory - Monitor with `supervisor.getStatus()` for runningCount ## Event Integration ### Task Lifecycle Events All RLM sub-calls emit standard supervisor events: | Event | When | Payload | |-------|------|---------| | `task:queued` | Task added to queue | `{taskId, prompt, queueSize}` | | `task:started` | Task begins execution | `{taskId, pid}` | | `task:output` | Sub-agent produces output | `{taskId, chunk, stream}` | | `task:completed` | Task succeeds | `{taskId, result, duration}` | | `task:failed` | Task fails | `{taskId, error, exitCode}` | | `task:timeout` | Task exceeds taskTimeout | `{taskId}` | | `task:cancelled` | Task cancelled (shutdown or manual) | `{taskId, signal}` | ### RLM-Specific Event Metadata Enhance events with RLM context: ```javascript // On task:started for RLM sub-agent supervisor.on('task:started', ({taskId, pid}) => { const task = taskStore.getTask(taskId); if (task.metadata?.type?.startsWith('rlm-')) { console.log(`RLM sub-agent started: depth ${task.metadata.depth}, file ${task.metadata.context_file}`); } }); // On task:completed, check if part of batch supervisor.on('task:completed', ({taskId, result}) => { const task = taskStore.getTask(taskId); if (task.metadata?.batch_id) { // Check if all batch tasks complete const batchTasks = taskStore.getTasksByBatchId(task.metadata.batch_id); const allComplete = batchTasks.every(t => t.state === 'completed'); if (allComplete) { emit('rlm:batch:completed', {batchId: task.metadata.batch_id}); } } }); ``` ### Progress Reporting Chain ``` Sub-agent (depth 2) ↓ stdout ↓ Supervisor ↓ task:output event ↓ Messaging Hub ↓ Telegram/Discord/REPL (User sees: "RLM sub-agent [depth 2/3] processing src/auth/login.ts...") ``` **Implementation**: ```javascript // In messaging hub supervisor.on('task:output', ({taskId, chunk, stream}) => { const task = taskStore.getTask(taskId); if (task.metadata?.type?.startsWith('rlm-')) { // Format for user const depthLabel = `[depth ${task.metadata.depth}/${task.metadata.max_depth}]`; messagingHub.send(`RLM ${depthLabel}: ${chunk}`); } }); ``` ## Depth Tracking ### Metadata Propagation Depth is tracked in task metadata, incremented on each spawn: ```javascript // Root RLM call (depth 0) const rootTask = supervisor.submit(prompt, { metadata: {depth: 0, max_depth: 3} }); // Depth 1 sub-call (spawned by root) const childTask = supervisor.submit(childPrompt, { metadata: {depth: 1, max_depth: 3, parent_task_id: rootTask.id} }); // Depth 2 sub-call (spawned by depth 1) const grandchildTask = supervisor.submit(grandchildPrompt, { metadata: {depth: 2, max_depth: 3, parent_task_id: childTask.id} }); // Depth 3 is MAX, cannot spawn further // Attempting depth 4 throws error ``` ### Enforcement of maxDepth **Default max depth**: 3 (from `agentic/code/addons/rlm/manifest.json`) ```json { "config": { "max_depth": 3, "max_sub_calls": 20 } } ``` **Enforcement points**: 1. **Pre-submit validation** (before `supervisor.submit()`): ```javascript if (metadata.depth >= maxDepth) { throw new Error(`Max depth ${maxDepth} exceeded`); } ``` 2. **Task creation** (in supervisor): ```javascript if (options.metadata?.depth >= MAX_DEPTH) { emit('task:failed', {taskId, error: 'Max depth exceeded'}); return null; } ``` 3. **Agent runtime** (inside RLM agent): ```javascript if (this.currentDepth >= this.maxDepth) { return 'MAX_DEPTH_REACHED'; // Don't spawn more sub-calls } ``` ### Recursion Tree Tracking Task metadata forms a tree structure: ```javascript // Task tree example { "task-root-001": { depth: 0, parent_task_id: null, children: ["task-child-001", "task-child-002"] }, "task-child-001": { depth: 1, parent_task_id: "task-root-001", children: ["task-grandchild-001"] }, "task-grandchild-001": { depth: 2, parent_task_id: "task-child-001", children: [] // Depth 3 cannot spawn children } } ``` **Visualization** (via task store query): ``` Root (depth 0): "Analyze codebase security" ├── Child 1 (depth 1): "Check src/auth/ for vulnerabilities" │ └── Grandchild 1 (depth 2): "Analyze src/auth/login.ts" ├── Child 2 (depth 1): "Check src/api/ for vulnerabilities" └── Grandchild 2 (depth 2): "Analyze src/api/users.ts" ``` ## Error Recovery ### Sub-Agent Crash Handling When a sub-agent crashes unexpectedly: ```javascript // Supervisor detects process exit with non-zero code proc.on('exit', (code, signal) => { if (code !== 0 && signal !== 'SIGTERM') { // Sub-agent crashed taskStore.failTask(task.id, `Process exited with code ${code}`); emit('task:failed', {taskId: task.id, exitCode: code}); // If part of batch, mark batch as partial failure if (task.metadata?.batch_id) { batchStore.recordFailure(task.metadata.batch_id, task.id); } } }); ``` **Impact on RLM**: - Failed sub-call does not crash parent - Parent receives `null` or error result - Partial tree completion (some branches succeed, some fail) - Final report documents which branches failed ### Partial Tree Completion When some sub-calls succeed and some fail: ```javascript // rlm-batch aggregation with partial failures const tasks = await Promise.allSettled( taskIds.map(id => waitForTaskCompletion(id)) ); const successful = tasks.filter(t => t.status === 'fulfilled'); const failed = tasks.filter(t => t.status === 'rejected'); if (failed.length > 0) { console.warn(`Batch partially failed: ${failed.length}/${tasks.length} tasks failed`); } // Aggregate only successful results const results = successful.map(t => taskStore.getTask(t.value).result); return aggregateByStrategy(results, strategy); ``` **User notification**: ``` RLM Batch: PARTIAL SUCCESS Pattern: src/**/*.ts Processed: 87/100 files Failed: 13 files (see report for details) Aggregated results based on 87 successful analyses. Failed files: - src/auth/legacy.ts (timeout) - src/utils/deprecated.ts (parse error) ... ``` ### Retry Strategy for Failed Sub-Calls **Option 1: Automatic retry** (at supervisor level): ```javascript const supervisor = new AgentSupervisor({ retryPolicy: { maxRetries: 1, // Retry failed tasks once retryDelay: 5000, // Wait 5s before retry retryableErrors: [ // Only retry specific errors 'ETIMEDOUT', 'ECONNRESET' ] } }); ``` **Option 2: Manual retry** (at RLM level): ```javascript // Inside RLM agent for (let attempt = 0; attempt < 3; attempt++) { try { const task = supervisor.submit(prompt, options); const result = await waitForTaskCompletion(task.id); return result; // Success } catch (error) { if (attempt === 2) throw error; // Max retries exceeded await sleep(1000 * (attempt + 1)); // Exponential backoff } } ``` **Recommended**: Option 2 (RLM-level retry) for more control over recursive retry logic. ## Configuration ### Supervisor Options Relevant to RLM ```javascript const supervisor = new AgentSupervisor({ // Core settings maxConcurrency: 10, // Max parallel sub-agents (default: 3) taskTimeout: 120 * 60 * 1000, // 2 hours per sub-agent (default: 2 hours) agentCommand: 'claude', // Command to spawn agents agentArgs: [], // Additional args for all agents // Task store for persistent state taskStore: new TaskStore({ path: '.aiwg/daemon/tasks.db' }), // Optional: Retry policy retryPolicy: { maxRetries: 1, retryDelay: 5000 } }); ``` ### Configuration for Different Workload Profiles #### Profile 1: Single-file RLM queries (low concurrency) ```javascript { maxConcurrency: 3, taskTimeout: 300000, // 5 minutes description: "Conservative for sequential deep recursion" } ``` #### Profile 2: Batch processing (high throughput) ```javascript { maxConcurrency: 20, taskTimeout: 600000, // 10 minutes description: "High throughput for parallel file processing" } ``` #### Profile 3: Deep recursion trees (balanced) ```javascript { maxConcurrency: 10, taskTimeout: 1200000, // 20 minutes description: "Balanced for depth-3 recursion with moderate parallelism" } ``` #### Profile 4: Mixed workload (default) ```javascript { maxConcurrency: 10, taskTimeout: 7200000, // 2 hours description: "Default balanced profile" } ``` ### RLM Addon Configuration From `agentic/code/addons/rlm/manifest.json`: ```json { "config": { "max_depth": 3, "max_sub_calls": 20, "sub_model": "sonnet", "parallel_sub_calls": true, "timeout_per_subcall": 300, "supervisor": { "default_max_concurrency": 10, "default_task_timeout": 7200000 } } } ``` ## Integration Examples ### Example 1: Simple rlm-query ```javascript // User invokes: /rlm-query src/auth/login.ts "extract function names" // RLM agent internally: const contextFile = 'src/auth/login.ts'; const subPrompt = 'extract function names'; const context = readFileSync(contextFile); const task = supervisor.submit( `Context:\n${context}\n\nTask: ${subPrompt}`, { agent: 'rlm-agent', priority: 5, // Depth 0 (high priority) metadata: { type: 'rlm-query', depth: 0, context_file: contextFile, max_depth: 3 } } ); // Wait for completion await waitForTaskCompletion(task.id); const result = taskStore.getTask(task.id).result; // Return to user console.log(`Extracted functions: ${result.output}`); ``` **Supervisor behavior**: - Task queued immediately - Spawned when `runningCount < maxConcurrency` - Output streamed via `task:output` events - Completed task marked in task store ### Example 2: rlm-batch with partial failures ```javascript // User invokes: /rlm-batch "src/**/*.ts" "count functions" const files = glob('src/**/*.ts'); // 100 files const batchId = `batch-${Date.now()}`; const tasks = []; // Submit all 100 tasks for (const file of files) { const task = supervisor.submit( `Context:\n${readFileSync(file)}\n\nTask: count functions`, { agent: 'rlm-agent', priority: 5, metadata: { type: 'rlm-batch', batch_id: batchId, depth: 0, context_file: file, total_in_batch: 100 } } ); tasks.push(task); } // Wait for all (up to maxConcurrency run in parallel) const results = await Promise.allSettled( tasks.map(t => waitForTaskCompletion(t.id)) ); // Handle partial failures const successful = results.filter(r => r.status === 'fulfilled'); const failed = results.filter(r => r.status === 'rejected'); if (failed.length > 0) { console.warn(`${failed.length} files failed`); } // Aggregate successful results const counts = successful.map(r => taskStore.getTask(r.value).result.output ); const totalFunctions = counts.reduce((sum, c) => sum + parseInt(c), 0); console.log(`Total functions: ${totalFunctions} (from ${successful.length}/${files.length} files)`); ``` **Supervisor behavior**: - All 100 tasks queued immediately - 10 run in parallel (if `maxConcurrency: 10`) - As each completes, next is spawned - Failed tasks don't block others - Aggregation happens after all complete/fail ### Example 3: Recursive sub-call (depth 2) ```javascript // Root task (depth 0): "Analyze security" const rootTask = supervisor.submit( "Analyze security across entire codebase", {metadata: {depth: 0, max_depth: 3}} ); // Root agent decides to delegate to module-level analysis (depth 1) const authModuleTask = supervisor.submit( "Analyze security in src/auth/ module", {metadata: {depth: 1, max_depth: 3, parent_task_id: rootTask.id}} ); // Depth-1 agent decides to analyze specific file (depth 2) const loginFileTask = supervisor.submit( "Analyze security in src/auth/login.ts", {metadata: {depth: 2, max_depth: 3, parent_task_id: authModuleTask.id}} ); // Depth-2 agent completes (cannot spawn depth 3 sub-calls) await waitForTaskCompletion(loginFileTask.id); // Depth-1 agent collects depth-2 results and completes await waitForTaskCompletion(authModuleTask.id); // Root agent collects all results and synthesizes await waitForTaskCompletion(rootTask.id); ``` **Supervisor behavior**: - Priority inversions: depth 0 > depth 1 > depth 2 - If maxConcurrency exceeded, deeper tasks queue - Each depth waits for children before completing - Depth 2 cannot spawn depth 3 (max_depth enforcement) ## Success Criteria RLM-Supervisor integration is successful when: - [ ] All RLM sub-calls managed via supervisor (no raw tmux) - [ ] Depth tracking accurate across entire recursion tree - [ ] maxDepth enforced (no depth 4+ sub-calls) - [ ] maxConcurrency respected (never exceeds limit) - [ ] Partial tree completion handled (failures don't crash parents) - [ ] Progress events reach messaging hub - [ ] Graceful shutdown terminates all sub-agents - [ ] Cost tracking accurate (sum all sub-call costs) - [ ] Task store preserves full recursion tree for inspection ## Troubleshooting ### Issue: Queue overflow (100+ queued tasks) **Symptom**: `supervisor.getStatus()` shows large queuedCount, slow throughput **Diagnosis**: `maxConcurrency` too low for workload **Solution**: Increase `maxConcurrency` (e.g., 10 → 20) ### Issue: Sub-agent timeouts **Symptom**: Many `task:timeout` events, sub-agents killed **Diagnosis**: `taskTimeout` too short for complex tasks **Solution**: Increase `taskTimeout` (e.g., 2 hours → 4 hours) ### Issue: Excessive memory usage **Symptom**: System memory approaching limit **Diagnosis**: Too many parallel sub-agents (each ~100MB) **Solution**: Decrease `maxConcurrency` (e.g., 50 → 20) ### Issue: Depth limit not enforced **Symptom**: Depth 4+ sub-calls observed **Diagnosis**: Missing depth validation in submit path **Solution**: Add pre-submit check: ```javascript if (metadata.depth >= maxDepth) { throw new Error(`Max depth ${maxDepth} exceeded`); } ``` ### Issue: Partial batch never completes **Symptom**: Some batch tasks stuck in "running" state forever **Diagnosis**: Sub-agent crashed without emitting exit event **Solution**: Add timeout handling per task, force-kill on timeout ## References - @$AIWG_ROOT/tools/daemon/agent-supervisor.mjs - Agent Supervisor implementation - @$AIWG_ROOT/agentic/code/addons/rlm/agents/rlm-agent.md - RLM agent definition - @$AIWG_ROOT/agentic/code/addons/rlm/commands/rlm-query.md - Single sub-call command - @$AIWG_ROOT/agentic/code/addons/rlm/commands/rlm-batch.md - Batch parallel command - @$AIWG_ROOT/agentic/code/addons/rlm/schemas/rlm-task-tree.yaml - Task tree schema - @$AIWG_ROOT/tools/daemon/task-store.mjs - Persistent task state - @.aiwg/research/findings/REF-089-recursive-language-models.md - RLM research - @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md - Subagent context minimization - Issue #323 - Supervisor integration implementation --- **Document Status**: COMPLETE **Last Updated**: 2026-02-09 **Related Epic**: Issue #321 (AIWG RLM Addon)