aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
804 lines (639 loc) • 23 kB
Markdown
# RLM Agent Supervisor Integration
## Overview
The RLM (Recursive Language Models) addon integrates with AIWG's Agent Supervisor to enable recursive sub-agent decomposition for processing arbitrarily large codebases. Instead of spawning raw tmux sessions, RLM sub-agents are managed through the supervisor's task queue, enabling consistent lifecycle management, concurrency control, and progress tracking.
**Core principle**: Every RLM sub-call is a supervisor task, not a raw subprocess. This provides unified monitoring, cost tracking, and error recovery across the entire recursion tree.
## Architecture Overview
### RLM → Supervisor Mapping
| RLM Pattern | Supervisor Implementation |
|-------------|---------------------------|
| `rlm-query` single file | `supervisor.submit(prompt, {agent: 'rlm-agent', metadata: {depth, context_file}})` |
| `rlm-batch` parallel fan-out | Multiple `supervisor.submit()` calls with shared `batch_id` metadata |
| Recursive sub-calls | `supervisor.submit()` with incremented `depth` in metadata |
| Depth tracking | `metadata.depth` field, max 3 by default (from manifest) |
| Result aggregation | Poll completed tasks by `batch_id`, aggregate outputs |
### Supervisor Role
The Agent Supervisor (`tools/daemon/agent-supervisor.mjs`) provides:
1. **Task Queue Management**: Queue RLM sub-calls, process up to `maxConcurrency` in parallel
2. **Lifecycle Tracking**: Track states (queued → running → completed/failed)
3. **Output Streaming**: Real-time stdout/stderr from sub-agents via events
4. **Timeout Enforcement**: Kill sub-agents that exceed `taskTimeout` (default 2 hours)
5. **Graceful Shutdown**: Cancel queued tasks, wait for running tasks or force-kill on timeout
6. **Event Integration**: Emit `task:queued`, `task:started`, `task:completed`, `task:failed` events for hub integration
### Integration Points
```
RLM Agent
↓
↓ submit() with depth metadata
↓
Agent Supervisor
↓
├─→ Task Queue (priority-sorted)
├─→ Running Pool (≤ maxConcurrency)
├─→ Event Emitter (task lifecycle events)
└─→ Task Store (persistent state)
↓
↓ events
↓
Messaging Hub (Telegram/Discord/REPL)
```
## Lifecycle Management
### Task Submission (rlm-query)
When `rlm-query <file> <prompt>` is invoked:
```javascript
// Internal RLM agent logic (conceptual)
const task = supervisor.submit(
// Prompt includes context file and sub-prompt
`Context: ${readFileSync(contextFile)}\n\nTask: ${subPrompt}`,
{
agent: 'rlm-agent', // Target agent
priority: 5 - depth, // Deeper calls = lower priority
metadata: {
type: 'rlm-query',
depth: currentDepth + 1,
context_file: contextFile,
parent_task_id: currentTaskId,
max_depth: 3 // From manifest default
}
}
);
// Wait for completion
await waitForTaskCompletion(task.id);
const result = taskStore.getTask(task.id).result;
```
**Key behaviors**:
- **Not raw tmux**: Uses supervisor's managed `spawn()` with proper process tracking
- **Priority inversion**: Deeper recursion gets lower priority to prevent queue starvation
- **Depth enforcement**: If `depth >= max_depth`, reject submission immediately
### Task Submission (rlm-batch)
When `rlm-batch <pattern> <prompt>` is invoked:
```javascript
const files = glob(pattern);
const batchId = `batch-${Date.now()}`;
const tasks = [];
// Submit all tasks in parallel (up to maxConcurrency)
for (const file of files) {
const task = supervisor.submit(
`Context: ${readFileSync(file)}\n\nTask: ${subPrompt}`,
{
agent: 'rlm-agent',
priority: 5 - depth, // Same depth-based priority
metadata: {
type: 'rlm-batch',
batch_id: batchId,
depth: currentDepth + 1,
context_file: file,
total_in_batch: files.length
}
}
);
tasks.push(task);
}
// Wait for all to complete
await Promise.all(tasks.map(t => waitForTaskCompletion(t.id)));
// Aggregate results
const results = tasks.map(t => taskStore.getTask(t.id).result);
const aggregated = aggregateByStrategy(results, aggregateStrategy);
```
**Key behaviors**:
- **Parallel spawning**: All tasks submitted immediately, supervisor controls concurrency
- **Batch tracking**: Shared `batch_id` in metadata for group operations
- **Aggregate after completion**: Poll completed tasks, merge results according to strategy
### Sub-Agent Spawning (Recursive Calls)
When a sub-agent at depth N spawns its own sub-agent (depth N+1):
```javascript
// Inside depth-N sub-agent
if (needsRecursion && currentDepth < maxDepth) {
const childTask = supervisor.submit(
// Child prompt
childPrompt,
{
agent: 'rlm-agent',
priority: 5 - (currentDepth + 1), // Lower priority
metadata: {
type: 'rlm-recursive',
depth: currentDepth + 1,
parent_task_id: myTaskId,
max_depth: maxDepth
}
}
);
// Depth N sub-agent waits for depth N+1 to complete
await waitForTaskCompletion(childTask.id);
const childResult = taskStore.getTask(childTask.id).result;
} else if (currentDepth >= maxDepth) {
throw new Error(`Max recursion depth ${maxDepth} exceeded`);
}
```
**Depth limit enforcement**:
```javascript
// In supervisor.submit() or pre-submit validation
if (options.metadata?.depth >= MAX_DEPTH) {
throw new Error(
`Recursion depth limit exceeded: ${options.metadata.depth} >= ${MAX_DEPTH}`
);
}
```
### Graceful Shutdown
When daemon shuts down with active RLM tasks:
```javascript
// supervisor.shutdown() behavior
await supervisor.shutdown(timeoutMs = 30000);
// 1. Reject all queued tasks (including RLM sub-calls)
while (queue.length > 0) {
const task = queue.shift();
taskStore.cancelTask(task.id);
emit('task:cancelled', {taskId: task.id, reason: 'shutdown'});
}
// 2. Wait for running tasks (including depth-N sub-agents) to complete
// 3. If timeout exceeded, SIGKILL all running processes
// 4. All RLM recursion trees are terminated consistently
```
**Impact on RLM**:
- Queued sub-calls are cancelled (partial tree completion)
- Running sub-calls are given 30s to complete gracefully
- If timeout, entire tree is killed (no orphan processes)
- Task store preserves partial state for later inspection
## Concurrency Control
### maxConcurrency Limits Parallel Sub-Agents
```javascript
// Supervisor config
const supervisor = new AgentSupervisor({
maxConcurrency: 10, // Max 10 sub-agents running simultaneously
taskTimeout: 120 * 60 * 1000 // 2 hours per sub-agent
});
```
**Behavior**:
- If 10 RLM sub-agents are running, 11th waits in queue
- Queue is priority-sorted (deeper calls = lower priority)
- As sub-agents complete, queue drains automatically
- Prevents system overload from deep/wide recursion trees
### Queue Overflow Handling
When `rlm-batch` spawns 100 tasks but `maxConcurrency: 10`:
```
Iteration 0: Submit all 100 tasks → queue has 100 items
Iteration 1: Spawn 10 (up to maxConcurrency) → queue: 90, running: 10
Iteration 2: As tasks complete, spawn more → queue: 85, running: 10
...
Iteration 10: All 100 processed → queue: 0, running: 0
```
**No manual batching needed**: Supervisor handles queue automatically.
### Recommended Concurrency by Task Type
| Task Type | Recommended maxConcurrency | Rationale |
|-----------|----------------------------|-----------|
| **Single-file rlm-query** | 3-5 | Low parallelism, sequential by nature |
| **rlm-batch (10-50 files)** | 10-20 | Balance throughput and memory |
| **rlm-batch (100+ files)** | 20-50 | High throughput, watch memory (each ~100MB) |
| **Deep recursion (depth 3)** | 5-10 | Conservative to avoid cascade failures |
| **Mixed workload** | 10 | Default balance for typical daemon use |
**Memory considerations**:
- Each sub-agent: ~100MB RAM
- 50 parallel sub-agents: ~5GB RAM
- Adjust `maxConcurrency` based on available system memory
- Monitor with `supervisor.getStatus()` for runningCount
## Event Integration
### Task Lifecycle Events
All RLM sub-calls emit standard supervisor events:
| Event | When | Payload |
|-------|------|---------|
| `task:queued` | Task added to queue | `{taskId, prompt, queueSize}` |
| `task:started` | Task begins execution | `{taskId, pid}` |
| `task:output` | Sub-agent produces output | `{taskId, chunk, stream}` |
| `task:completed` | Task succeeds | `{taskId, result, duration}` |
| `task:failed` | Task fails | `{taskId, error, exitCode}` |
| `task:timeout` | Task exceeds taskTimeout | `{taskId}` |
| `task:cancelled` | Task cancelled (shutdown or manual) | `{taskId, signal}` |
### RLM-Specific Event Metadata
Enhance events with RLM context:
```javascript
// On task:started for RLM sub-agent
supervisor.on('task:started', ({taskId, pid}) => {
const task = taskStore.getTask(taskId);
if (task.metadata?.type?.startsWith('rlm-')) {
console.log(`RLM sub-agent started: depth ${task.metadata.depth}, file ${task.metadata.context_file}`);
}
});
// On task:completed, check if part of batch
supervisor.on('task:completed', ({taskId, result}) => {
const task = taskStore.getTask(taskId);
if (task.metadata?.batch_id) {
// Check if all batch tasks complete
const batchTasks = taskStore.getTasksByBatchId(task.metadata.batch_id);
const allComplete = batchTasks.every(t => t.state === 'completed');
if (allComplete) {
emit('rlm:batch:completed', {batchId: task.metadata.batch_id});
}
}
});
```
### Progress Reporting Chain
```
Sub-agent (depth 2)
↓ stdout
↓
Supervisor
↓ task:output event
↓
Messaging Hub
↓
Telegram/Discord/REPL
(User sees: "RLM sub-agent [depth 2/3] processing src/auth/login.ts...")
```
**Implementation**:
```javascript
// In messaging hub
supervisor.on('task:output', ({taskId, chunk, stream}) => {
const task = taskStore.getTask(taskId);
if (task.metadata?.type?.startsWith('rlm-')) {
// Format for user
const depthLabel = `[depth ${task.metadata.depth}/${task.metadata.max_depth}]`;
messagingHub.send(`RLM ${depthLabel}: ${chunk}`);
}
});
```
## Depth Tracking
### Metadata Propagation
Depth is tracked in task metadata, incremented on each spawn:
```javascript
// Root RLM call (depth 0)
const rootTask = supervisor.submit(prompt, {
metadata: {depth: 0, max_depth: 3}
});
// Depth 1 sub-call (spawned by root)
const childTask = supervisor.submit(childPrompt, {
metadata: {depth: 1, max_depth: 3, parent_task_id: rootTask.id}
});
// Depth 2 sub-call (spawned by depth 1)
const grandchildTask = supervisor.submit(grandchildPrompt, {
metadata: {depth: 2, max_depth: 3, parent_task_id: childTask.id}
});
// Depth 3 is MAX, cannot spawn further
// Attempting depth 4 throws error
```
### Enforcement of maxDepth
**Default max depth**: 3 (from `agentic/code/addons/rlm/manifest.json`)
```json
{
"config": {
"max_depth": 3,
"max_sub_calls": 20
}
}
```
**Enforcement points**:
1. **Pre-submit validation** (before `supervisor.submit()`):
```javascript
if (metadata.depth >= maxDepth) {
throw new Error(`Max depth ${maxDepth} exceeded`);
}
```
2. **Task creation** (in supervisor):
```javascript
if (options.metadata?.depth >= MAX_DEPTH) {
emit('task:failed', {taskId, error: 'Max depth exceeded'});
return null;
}
```
3. **Agent runtime** (inside RLM agent):
```javascript
if (this.currentDepth >= this.maxDepth) {
return 'MAX_DEPTH_REACHED'; // Don't spawn more sub-calls
}
```
### Recursion Tree Tracking
Task metadata forms a tree structure:
```javascript
// Task tree example
{
"task-root-001": {
depth: 0,
parent_task_id: null,
children: ["task-child-001", "task-child-002"]
},
"task-child-001": {
depth: 1,
parent_task_id: "task-root-001",
children: ["task-grandchild-001"]
},
"task-grandchild-001": {
depth: 2,
parent_task_id: "task-child-001",
children: [] // Depth 3 cannot spawn children
}
}
```
**Visualization** (via task store query):
```
Root (depth 0): "Analyze codebase security"
├── Child 1 (depth 1): "Check src/auth/ for vulnerabilities"
│ └── Grandchild 1 (depth 2): "Analyze src/auth/login.ts"
├── Child 2 (depth 1): "Check src/api/ for vulnerabilities"
└── Grandchild 2 (depth 2): "Analyze src/api/users.ts"
```
## Error Recovery
### Sub-Agent Crash Handling
When a sub-agent crashes unexpectedly:
```javascript
// Supervisor detects process exit with non-zero code
proc.on('exit', (code, signal) => {
if (code !== 0 && signal !== 'SIGTERM') {
// Sub-agent crashed
taskStore.failTask(task.id, `Process exited with code ${code}`);
emit('task:failed', {taskId: task.id, exitCode: code});
// If part of batch, mark batch as partial failure
if (task.metadata?.batch_id) {
batchStore.recordFailure(task.metadata.batch_id, task.id);
}
}
});
```
**Impact on RLM**:
- Failed sub-call does not crash parent
- Parent receives `null` or error result
- Partial tree completion (some branches succeed, some fail)
- Final report documents which branches failed
### Partial Tree Completion
When some sub-calls succeed and some fail:
```javascript
// rlm-batch aggregation with partial failures
const tasks = await Promise.allSettled(
taskIds.map(id => waitForTaskCompletion(id))
);
const successful = tasks.filter(t => t.status === 'fulfilled');
const failed = tasks.filter(t => t.status === 'rejected');
if (failed.length > 0) {
console.warn(`Batch partially failed: ${failed.length}/${tasks.length} tasks failed`);
}
// Aggregate only successful results
const results = successful.map(t => taskStore.getTask(t.value).result);
return aggregateByStrategy(results, strategy);
```
**User notification**:
```
RLM Batch: PARTIAL SUCCESS
Pattern: src/**/*.ts
Processed: 87/100 files
Failed: 13 files (see report for details)
Aggregated results based on 87 successful analyses.
Failed files:
- src/auth/legacy.ts (timeout)
- src/utils/deprecated.ts (parse error)
...
```
### Retry Strategy for Failed Sub-Calls
**Option 1: Automatic retry** (at supervisor level):
```javascript
const supervisor = new AgentSupervisor({
retryPolicy: {
maxRetries: 1, // Retry failed tasks once
retryDelay: 5000, // Wait 5s before retry
retryableErrors: [ // Only retry specific errors
'ETIMEDOUT',
'ECONNRESET'
]
}
});
```
**Option 2: Manual retry** (at RLM level):
```javascript
// Inside RLM agent
for (let attempt = 0; attempt < 3; attempt++) {
try {
const task = supervisor.submit(prompt, options);
const result = await waitForTaskCompletion(task.id);
return result; // Success
} catch (error) {
if (attempt === 2) throw error; // Max retries exceeded
await sleep(1000 * (attempt + 1)); // Exponential backoff
}
}
```
**Recommended**: Option 2 (RLM-level retry) for more control over recursive retry logic.
## Configuration
### Supervisor Options Relevant to RLM
```javascript
const supervisor = new AgentSupervisor({
// Core settings
maxConcurrency: 10, // Max parallel sub-agents (default: 3)
taskTimeout: 120 * 60 * 1000, // 2 hours per sub-agent (default: 2 hours)
agentCommand: 'claude', // Command to spawn agents
agentArgs: [], // Additional args for all agents
// Task store for persistent state
taskStore: new TaskStore({
path: '.aiwg/daemon/tasks.db'
}),
// Optional: Retry policy
retryPolicy: {
maxRetries: 1,
retryDelay: 5000
}
});
```
### Configuration for Different Workload Profiles
#### Profile 1: Single-file RLM queries (low concurrency)
```javascript
{
maxConcurrency: 3,
taskTimeout: 300000, // 5 minutes
description: "Conservative for sequential deep recursion"
}
```
#### Profile 2: Batch processing (high throughput)
```javascript
{
maxConcurrency: 20,
taskTimeout: 600000, // 10 minutes
description: "High throughput for parallel file processing"
}
```
#### Profile 3: Deep recursion trees (balanced)
```javascript
{
maxConcurrency: 10,
taskTimeout: 1200000, // 20 minutes
description: "Balanced for depth-3 recursion with moderate parallelism"
}
```
#### Profile 4: Mixed workload (default)
```javascript
{
maxConcurrency: 10,
taskTimeout: 7200000, // 2 hours
description: "Default balanced profile"
}
```
### RLM Addon Configuration
From `agentic/code/addons/rlm/manifest.json`:
```json
{
"config": {
"max_depth": 3,
"max_sub_calls": 20,
"sub_model": "sonnet",
"parallel_sub_calls": true,
"timeout_per_subcall": 300,
"supervisor": {
"default_max_concurrency": 10,
"default_task_timeout": 7200000
}
}
}
```
## Integration Examples
### Example 1: Simple rlm-query
```javascript
// User invokes: /rlm-query src/auth/login.ts "extract function names"
// RLM agent internally:
const contextFile = 'src/auth/login.ts';
const subPrompt = 'extract function names';
const context = readFileSync(contextFile);
const task = supervisor.submit(
`Context:\n${context}\n\nTask: ${subPrompt}`,
{
agent: 'rlm-agent',
priority: 5, // Depth 0 (high priority)
metadata: {
type: 'rlm-query',
depth: 0,
context_file: contextFile,
max_depth: 3
}
}
);
// Wait for completion
await waitForTaskCompletion(task.id);
const result = taskStore.getTask(task.id).result;
// Return to user
console.log(`Extracted functions: ${result.output}`);
```
**Supervisor behavior**:
- Task queued immediately
- Spawned when `runningCount < maxConcurrency`
- Output streamed via `task:output` events
- Completed task marked in task store
### Example 2: rlm-batch with partial failures
```javascript
// User invokes: /rlm-batch "src/**/*.ts" "count functions"
const files = glob('src/**/*.ts'); // 100 files
const batchId = `batch-${Date.now()}`;
const tasks = [];
// Submit all 100 tasks
for (const file of files) {
const task = supervisor.submit(
`Context:\n${readFileSync(file)}\n\nTask: count functions`,
{
agent: 'rlm-agent',
priority: 5,
metadata: {
type: 'rlm-batch',
batch_id: batchId,
depth: 0,
context_file: file,
total_in_batch: 100
}
}
);
tasks.push(task);
}
// Wait for all (up to maxConcurrency run in parallel)
const results = await Promise.allSettled(
tasks.map(t => waitForTaskCompletion(t.id))
);
// Handle partial failures
const successful = results.filter(r => r.status === 'fulfilled');
const failed = results.filter(r => r.status === 'rejected');
if (failed.length > 0) {
console.warn(`${failed.length} files failed`);
}
// Aggregate successful results
const counts = successful.map(r =>
taskStore.getTask(r.value).result.output
);
const totalFunctions = counts.reduce((sum, c) => sum + parseInt(c), 0);
console.log(`Total functions: ${totalFunctions} (from ${successful.length}/${files.length} files)`);
```
**Supervisor behavior**:
- All 100 tasks queued immediately
- 10 run in parallel (if `maxConcurrency: 10`)
- As each completes, next is spawned
- Failed tasks don't block others
- Aggregation happens after all complete/fail
### Example 3: Recursive sub-call (depth 2)
```javascript
// Root task (depth 0): "Analyze security"
const rootTask = supervisor.submit(
"Analyze security across entire codebase",
{metadata: {depth: 0, max_depth: 3}}
);
// Root agent decides to delegate to module-level analysis (depth 1)
const authModuleTask = supervisor.submit(
"Analyze security in src/auth/ module",
{metadata: {depth: 1, max_depth: 3, parent_task_id: rootTask.id}}
);
// Depth-1 agent decides to analyze specific file (depth 2)
const loginFileTask = supervisor.submit(
"Analyze security in src/auth/login.ts",
{metadata: {depth: 2, max_depth: 3, parent_task_id: authModuleTask.id}}
);
// Depth-2 agent completes (cannot spawn depth 3 sub-calls)
await waitForTaskCompletion(loginFileTask.id);
// Depth-1 agent collects depth-2 results and completes
await waitForTaskCompletion(authModuleTask.id);
// Root agent collects all results and synthesizes
await waitForTaskCompletion(rootTask.id);
```
**Supervisor behavior**:
- Priority inversions: depth 0 > depth 1 > depth 2
- If maxConcurrency exceeded, deeper tasks queue
- Each depth waits for children before completing
- Depth 2 cannot spawn depth 3 (max_depth enforcement)
## Success Criteria
RLM-Supervisor integration is successful when:
- [ ] All RLM sub-calls managed via supervisor (no raw tmux)
- [ ] Depth tracking accurate across entire recursion tree
- [ ] maxDepth enforced (no depth 4+ sub-calls)
- [ ] maxConcurrency respected (never exceeds limit)
- [ ] Partial tree completion handled (failures don't crash parents)
- [ ] Progress events reach messaging hub
- [ ] Graceful shutdown terminates all sub-agents
- [ ] Cost tracking accurate (sum all sub-call costs)
- [ ] Task store preserves full recursion tree for inspection
## Troubleshooting
### Issue: Queue overflow (100+ queued tasks)
**Symptom**: `supervisor.getStatus()` shows large queuedCount, slow throughput
**Diagnosis**: `maxConcurrency` too low for workload
**Solution**: Increase `maxConcurrency` (e.g., 10 → 20)
### Issue: Sub-agent timeouts
**Symptom**: Many `task:timeout` events, sub-agents killed
**Diagnosis**: `taskTimeout` too short for complex tasks
**Solution**: Increase `taskTimeout` (e.g., 2 hours → 4 hours)
### Issue: Excessive memory usage
**Symptom**: System memory approaching limit
**Diagnosis**: Too many parallel sub-agents (each ~100MB)
**Solution**: Decrease `maxConcurrency` (e.g., 50 → 20)
### Issue: Depth limit not enforced
**Symptom**: Depth 4+ sub-calls observed
**Diagnosis**: Missing depth validation in submit path
**Solution**: Add pre-submit check:
```javascript
if (metadata.depth >= maxDepth) {
throw new Error(`Max depth ${maxDepth} exceeded`);
}
```
### Issue: Partial batch never completes
**Symptom**: Some batch tasks stuck in "running" state forever
**Diagnosis**: Sub-agent crashed without emitting exit event
**Solution**: Add timeout handling per task, force-kill on timeout
## References
- @$AIWG_ROOT/tools/daemon/agent-supervisor.mjs - Agent Supervisor implementation
- @$AIWG_ROOT/agentic/code/addons/rlm/agents/rlm-agent.md - RLM agent definition
- @$AIWG_ROOT/agentic/code/addons/rlm/commands/rlm-query.md - Single sub-call command
- @$AIWG_ROOT/agentic/code/addons/rlm/commands/rlm-batch.md - Batch parallel command
- @$AIWG_ROOT/agentic/code/addons/rlm/schemas/rlm-task-tree.yaml - Task tree schema
- @$AIWG_ROOT/tools/daemon/task-store.mjs - Persistent task state
- @.aiwg/research/findings/REF-089-recursive-language-models.md - RLM research
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md - Subagent context minimization
- Issue #323 - Supervisor integration implementation
**Document Status**: COMPLETE
**Last Updated**: 2026-02-09
**Related Epic**: Issue #321 (AIWG RLM Addon)