aiwg
Version:
Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.
1,292 lines (946 loc) • 33 kB
Markdown
# Epic #26 Troubleshooting Guide
**Version**: 1.0.0
**Last Updated**: 2026-02-03
**Applies To**: Ralph External Orchestrator with Epic #26 integration
## Overview
This guide covers troubleshooting for all four layers of Epic #26:
- **PID Control Layer** (#23) - Dynamic parameter adjustment
- **Claude Intelligence Layer** (#22) - Strategic prompt generation
- **Memory Layer** (#24) - Cross-loop knowledge persistence
- **Overseer Layer** (#25) - Autonomous health monitoring
## General Troubleshooting Approach
### Step 1: Identify the Layer
Epic #26 issues typically manifest in one of four layers. Determine which layer is involved:
| Symptom | Likely Layer |
|---------|--------------|
| Loop oscillates or converges slowly | PID Control |
| Poor prompts or irrelevant validation | Claude Intelligence |
| Not learning from past loops | Memory Layer |
| Pathological behaviors not caught | Overseer |
| Loop runs but features disabled | Configuration |
### Step 2: Check Logs
**Primary Locations:**
```bash
# General loop state
.aiwg/ralph-external/{loop-id}/state.json
# Iteration history with Epic #26 data
.aiwg/ralph-external/{loop-id}/history/iteration-*.json
# PID metrics log
.aiwg/ralph-external/{loop-id}/pid-metrics.json
# Overseer health checks
.aiwg/ralph-external/overseer/health-{loop-id}.json
# Semantic memory store
.aiwg/knowledge/ralph-learnings.json
# Escalation logs
.aiwg/ralph-external/overseer/escalations/
```
**What to Look For:**
- **PID metrics**: `proportional`, `integral`, `derivative`, `controlSignal`, `gains`
- **Overseer checks**: `detections`, `interventions`, `status`, `escalations`
- **Intelligence**: `stateAssessment`, `strategyPlan`, `preValidation`, `postValidation`
- **Memory**: `learningsRetrieved`, `learningsPromoted`
### Step 3: Verify Configuration
Epic #26 features can be toggled via config:
```javascript
// In orchestrator options
config: {
enablePIDControl: true, // PID control system
enableOverseer: true, // Overseer monitoring
enableSemanticMemory: true, // Semantic memory
}
```
If disabled, features won't activate. Check state file:
```bash
jq '.config' .aiwg/ralph-external/{loop-id}/state.json
```
### Step 4: Check for Common First Issues
| Issue | Check | Fix |
|-------|-------|-----|
| Features not active | `config.*` in state file | Enable in orchestrator options |
| PID metrics missing | Iteration has `pidMetrics` field? | Verify MetricsCollector initialization |
| No overseer checks | `healthReport` in iteration? | Verify Overseer initialization |
| Memory not retrieved | `learningsRetrieved` count? | Check semantic memory file exists |
| Claude not invoked | `claudePromptGenerator` errors? | Check `claude` CLI in PATH |
---
## 1. PID Control Issues
The PID Control Layer adjusts loop parameters dynamically based on progress metrics.
### Problem: Loop Converges Too Slowly
**Symptoms:**
- High iteration count for simple tasks
- Progress increases by <5% per iteration
- `controlSignal` consistently low (< 0.3)
**Diagnosis:**
```bash
# Check PID metrics
jq '.iterations[] | {iter: .number, progress: .analysis.completionPercentage, signal: .controlSignals.controlSignal}' \
.aiwg/ralph-external/{loop-id}/state.json
```
Look for:
- Low `controlSignal` values despite low progress
- `gains.kp` (proportional gain) < 0.5
**Likely Causes:**
- Gains too conservative for task complexity
- GainScheduler using `cautious` profile
- Task complexity underestimated
**Solutions:**
1. **Manual gain adjustment** (temporary):
```javascript
// In orchestrator initialization
pidController.gainScheduler.setGains({
kp: 0.8, // Increase from default 0.6
ki: 0.3, // Increase from default 0.2
kd: 0.4, // Increase from default 0.3
});
```
2. **Switch gain profile** (better):
```javascript
// Set to aggressive profile for fast convergence
pidController.gainScheduler.selectProfile('aggressive');
// Or standard (balanced)
pidController.gainScheduler.selectProfile('standard');
```
3. **Adjust task complexity estimate**:
```javascript
// Higher complexity = more aggressive gains
taskConfig.estimatedComplexity = 8; // 1-10 scale
```
4. **Check alarm thresholds**:
```bash
# View current thresholds
jq '.alarmSummary.thresholds' .aiwg/ralph-external/{loop-id}/state.json
```
If `slowConvergence` alarm is firing but not severe, increase threshold:
```javascript
alarms.updateThresholds({
slowConvergence: { progressPerIteration: 0.03 }, // Was 0.05
});
```
### Problem: Loop Oscillates (Progress Bouncing)
**Symptoms:**
- Progress goes: 40% → 60% → 45% → 70% → 55%
- `derivative` metric swings wildly (-0.3 → +0.4 → -0.2)
- Repeated WARN interventions for "oscillation detected"
**Diagnosis:**
```bash
# Check derivative trend
jq '.iterations[] | {iter: .number, progress: .analysis.completionPercentage, derivative: .pidMetrics.derivative}' \
.aiwg/ralph-external/{loop-id}/state.json
```
Look for:
- `derivative` frequently changing sign
- Large `controlSignal` swings (e.g., -0.8 to +0.6)
**Likely Causes:**
- Proportional gain (`kp`) too high → overreacts to progress changes
- Derivative gain (`kd`) too low → can't dampen oscillations
- High noise in progress measurements
**Solutions:**
1. **Reduce proportional gain**:
```javascript
pidController.gainScheduler.adjustGains({
kp: 0.4, // Reduce from 0.6 to prevent overreaction
});
```
2. **Increase derivative gain** (damping):
```javascript
pidController.gainScheduler.adjustGains({
kd: 0.5, // Increase from 0.3 for more damping
});
```
3. **Switch to `cautious` profile**:
```javascript
pidController.gainScheduler.selectProfile('cautious');
// Lower gains reduce oscillation risk
```
4. **Increase noise threshold**:
```javascript
// In MetricsCollector options
metricsCollector.setNoiseThreshold(0.1); // Was 0.05
// Filters out small progress fluctuations
```
5. **Check for measurement noise**:
If `analysis.completionPercentage` is unreliable (fluctuates without actual progress changes), the issue may be upstream in output analysis.
### Problem: Control Diverges (Gets Worse)
**Symptoms:**
- Progress decreases over iterations
- `integral` accumulates negative values (< -1.0)
- Overseer detects "quality regression"
**Diagnosis:**
```bash
# Check for negative integral accumulation
jq '.iterations[] | {iter: .number, progress: .analysis.completionPercentage, integral: .pidMetrics.integral}' \
.aiwg/ralph-external/{loop-id}/state.json
```
**Likely Causes:**
- Integral windup (accumulated error too large)
- Task is actually getting harder (scope creep)
- Bad feedback loop (iteration is making things worse)
**Solutions:**
1. **Reset integral** (anti-windup):
```javascript
pidController.metricsCollector.resetIntegral();
// Clears accumulated error
```
2. **Reduce integral gain**:
```javascript
pidController.gainScheduler.adjustGains({
ki: 0.1, // Reduce from 0.2
});
```
3. **Enable integral decay**:
```javascript
// In MetricsCollector options
metricsCollector.setIntegralDecay(0.85); // Was 0.9
// Prevents old errors from dominating
```
4. **Check for scope creep**:
If progress decreases because new work is being added, re-scope the objective or increase `maxIterations`.
5. **Overseer intervention**:
If divergence continues for 3+ iterations, Overseer should trigger PAUSE. If not:
```javascript
overseer.detector.updateThresholds({
qualityRegression: { consecutive: 2 }, // Was 3
});
```
### Problem: Alarms Firing Incorrectly
**Symptoms:**
- False alarms for "excessive oscillation" when progress is steady
- Missing alarms for actual stuck loops
- Email/issue escalations spam
**Diagnosis:**
```bash
# View alarm history
jq '.alarmSummary.activeAlarms' .aiwg/ralph-external/{loop-id}/state.json
# Check thresholds
jq '.alarmSummary.thresholds' .aiwg/ralph-external/{loop-id}/state.json
```
**Solutions:**
1. **Adjust specific alarm thresholds**:
```javascript
// Too sensitive
alarms.updateThresholds({
excessiveOscillation: { derivativeSwing: 0.5 }, // Was 0.3
slowConvergence: { progressPerIteration: 0.03 }, // Was 0.05
stuckLoop: { iterationsWithoutProgress: 4 }, // Was 3
});
```
2. **Disable specific alarms** (not recommended):
```javascript
alarms.disableAlarm('excessiveOscillation');
// Only if genuinely not needed for this task
```
3. **Reduce escalation frequency**:
```javascript
// In EscalationHandler options
escalationHandler.setThrottling({
minInterval: 600000, // 10 minutes between escalations (was 300000)
});
```
4. **Change alarm action levels**:
```javascript
// Make less severe
alarms.setActionLevel('excessiveOscillation', 'WARN'); // Was 'REDIRECT'
```
---
## 2. Claude Intelligence Issues
The Claude Intelligence Layer generates prompts, validates iterations, and plans strategies.
### Problem: Generated Prompts Unhelpful
**Symptoms:**
- Claude sessions don't make progress
- Prompts are too vague or off-topic
- Sessions repeatedly ask for clarification
**Diagnosis:**
```bash
# View generated prompts
jq '.iterations[] | {iter: .number, prompt: .claudePrompt | split("\n")[0:5]}' \
.aiwg/ralph-external/{loop-id}/state.json
```
Check:
- Is prompt specific to current blockers?
- Does it include relevant learnings?
- Is focus area correct?
**Likely Causes:**
- StateAssessor providing poor situation assessment
- Insufficient context (missing history/learnings)
- Claude model timeout during generation
**Solutions:**
1. **Check StateAssessor output**:
```bash
jq '.iterations[].assessment' .aiwg/ralph-external/{loop-id}/state.json
```
If `blockers` or `accomplishments` are empty/generic:
```javascript
// Ensure StateAssessor has enough history
stateAssessor.setWindowSize(5); // Was 3
```
2. **Verify memory retrieval**:
```bash
jq '.iterations[].learningsRetrieved' .aiwg/ralph-external/{loop-id}/state.json
```
If `[]` (empty), memory isn't being retrieved. Check:
```javascript
// Ensure MemoryRetrieval is initialized
memoryRetrieval.setRelevanceThreshold(0.5); // Was 0.7 (too strict)
```
3. **Increase Claude timeout**:
```javascript
// In ClaudePromptGenerator options
claudePromptGenerator.setTimeout(120000); // 2 minutes (was 90s)
```
4. **Fall back to template prompts**:
If Claude invocation fails consistently:
```javascript
// Check Claude CLI availability
spawnSync('which', ['claude']).status === 0
```
If not available, orchestrator falls back to standard PromptGenerator automatically.
5. **Check StrategyPlanner output**:
```bash
jq '.iterations[].strategy' .aiwg/ralph-external/{loop-id}/state.json
```
If strategy is generic ("continue working"), StrategyPlanner needs better PID signals:
```javascript
// Ensure PID metrics are being passed
strategyPlanner.plan(stateAssessment, pidMetrics);
```
### Problem: Validation Too Strict/Loose
**Symptoms:**
- Pre-validation blocks iterations that should proceed
- Post-validation passes iterations with obvious issues
- False positives for "schema validation failed"
**Diagnosis:**
```bash
# Check validation results
jq '.iterations[] | {iter: .number, preValid: .preValidation.valid, postValid: .postValidation.valid, issues: .postValidation.issues}' \
.aiwg/ralph-external/{loop-id}/state.json
```
**Solutions:**
1. **Adjust validation thresholds**:
```javascript
// In ValidationAgent options
validationAgent.setThresholds({
minProgress: 0.05, // Allow iterations with 5%+ progress (was 0.1)
maxRegressions: 3, // Allow up to 3 minor regressions (was 1)
requiredArtifacts: [], // Don't require specific artifacts
});
```
2. **Disable specific checks** (per-check tuning):
```javascript
// Too strict for this project
validationAgent.disableCheck('schemaValidation');
validationAgent.disableCheck('testCoverageRequirement');
```
3. **Enable check-by-check configuration**:
```javascript
validationAgent.configureCheck('qualityGate', {
minQualityScore: 0.6, // Was 0.8
failOnLowQuality: false, // Warn instead
});
```
4. **Review false positive patterns**:
```bash
# Common false positives
jq '.iterations[] | select(.postValidation.valid == false) | .postValidation.issues[]' \
.aiwg/ralph-external/{loop-id}/state.json
```
If a specific issue type appears repeatedly without actual problems, disable that check.
### Problem: Strategy Planner Keeps Escalating
**Symptoms:**
- StrategyPlanner recommends "escalate_to_human" repeatedly
- Low confidence scores (< 0.5) for all iterations
- Loop doesn't make progress due to excessive escalation
**Diagnosis:**
```bash
# Check strategy confidence
jq '.iterations[] | {iter: .number, strategy: .strategy.strategy, confidence: .strategy.confidence}' \
.aiwg/ralph-external/{loop-id}/state.json
```
**Likely Causes:**
- Confidence threshold too high
- PID signals indicating instability when task is actually progressing
- Task genuinely too complex for autonomous handling
**Solutions:**
1. **Lower confidence threshold**:
```javascript
// In StrategyPlanner options
strategyPlanner.setConfidenceThreshold(0.4); // Was 0.6
// Allow lower-confidence strategies
```
2. **Check PID stability**:
```bash
jq '.iterations[] | {iter: .number, derivative: .pidMetrics.derivative}' \
.aiwg/ralph-external/{loop-id}/state.json
```
If `derivative` is stable (not oscillating), but StrategyPlanner still escalates, it's being overcautious:
```javascript
// Adjust stability criteria
strategyPlanner.setStabilityCriteria({
maxDerivativeSwing: 0.4, // Was 0.2 (too strict)
});
```
3. **When escalation is correct**:
If task genuinely requires human input (e.g., design decision, ambiguous requirements), escalation is appropriate. In this case:
- Provide input via HITL gate
- Resume loop after clarification
- Memory will learn from the resolution
4. **Disable auto-escalation** (not recommended for production):
```javascript
// Only for testing
strategyPlanner.setAutoEscalate(false);
```
---
## 3. Memory Layer Issues
The Memory Layer persists cross-loop knowledge and retrieves relevant learnings.
### Problem: Memory Not Being Used
**Symptoms:**
- `learningsRetrieved: []` in all iterations
- Loop doesn't benefit from past experience
- Same mistakes repeated across loops
**Diagnosis:**
```bash
# Check if memory file exists
ls -lh .aiwg/knowledge/ralph-learnings.json
# Check learning count
jq '.stats.totalLearnings' .aiwg/knowledge/ralph-learnings.json
# Check retrieval attempts
jq '.iterations[] | {iter: .number, retrieved: .learningsRetrieved | length}' \
.aiwg/ralph-external/{loop-id}/state.json
```
**Likely Causes:**
1. **Memory store empty** (no prior loops completed successfully)
2. **Retrieval relevance threshold too strict**
3. **Task type mismatch** (stored learnings for "test-fix", current task is "feature")
**Solutions:**
1. **Verify memory is populated**:
```bash
jq '.learnings | length' .aiwg/knowledge/ralph-learnings.json
```
If `0`, no learnings have been promoted yet. Complete at least one successful loop to populate memory.
2. **Lower relevance threshold**:
```javascript
// In MemoryRetrieval options
memoryRetrieval.setRelevanceThreshold(0.5); // Was 0.7
// Retrieves more learnings (may include less relevant ones)
```
3. **Check task type matching**:
```bash
# View available task types in memory
jq '.learnings[] | .taskType' .aiwg/knowledge/ralph-learnings.json | sort | uniq
```
If current task type isn't represented:
```javascript
// Use more generic task type
taskConfig.taskType = 'general'; // Instead of 'specific-feature-name'
```
4. **Manually verify retrieval**:
```javascript
// Test retrieval directly
const learnings = await memoryRetrieval.retrieve({
objective: 'your objective',
taskType: 'test-fix',
});
console.log('Retrieved:', learnings.length);
```
5. **Check for initialization**:
```javascript
// Ensure MemoryRetrieval is created and passed to orchestrator
const memoryRetrieval = new MemoryRetrieval({
knowledgeDir: '.aiwg/knowledge',
relevanceThreshold: 0.6,
});
```
### Problem: Bad Learnings Promoted
**Symptoms:**
- Retrieved learnings recommend anti-patterns
- Confidence scores low for stored learnings
- Success rate < 50% for specific learnings
**Diagnosis:**
```bash
# Find low-confidence learnings
jq '.learnings[] | select(.confidence < 0.5) | {id, type, confidence, successRate}' \
.aiwg/knowledge/ralph-learnings.json
# Find learnings with low success rate
jq '.learnings[] | select(.successRate < 0.5) | {id, type, successRate}' \
.aiwg/knowledge/ralph-learnings.json
```
**Likely Causes:**
- Loop completed but quality was poor
- Learning extracted from failed iteration
- Promotion criteria too lenient
**Solutions:**
1. **Manual cleanup**:
```bash
# Identify bad learning ID
jq '.learnings[] | select(.id == "learn-abc123")' .aiwg/knowledge/ralph-learnings.json
# Remove it (requires manual edit or script)
```
Edit `.aiwg/knowledge/ralph-learnings.json` and remove the entry.
2. **Stricter promotion criteria**:
```javascript
// In MemoryPromotion options
memoryPromotion.setPromotionCriteria({
minConfidence: 0.7, // Was 0.5
minSuccessRate: 0.8, // Was 0.6 (after multiple uses)
});
```
3. **Automatic deprecation**:
```javascript
// Deprecate learnings with low success after N uses
memoryPromotion.enableAutoDeprecation({
minUses: 3,
deprecateBelow: 0.4, // Success rate
});
```
4. **Prevention strategies**:
- Only promote from iterations with high post-validation scores
- Filter out learnings from loops that required human intervention
- Review learnings before promotion (HITL gate)
```javascript
// In MemoryPromotion
memoryPromotion.setRequireValidation(true);
// Triggers HITL gate before promoting to semantic memory
```
### Problem: Memory Corruption
**Symptoms:**
- Error: "Checksum mismatch - data may be corrupted"
- SemanticMemory fails to load
- Backup recovery attempted
**Diagnosis:**
```bash
# Check error logs
grep "Checksum mismatch" .aiwg/ralph-external/*/logs/*.log
# Verify backup exists
ls -lh .aiwg/knowledge/ralph-learnings.json.bak
```
**Likely Causes:**
- File corruption (disk I/O error, interrupted write)
- Concurrent writes from multiple loops
- Manual editing introduced JSON syntax error
**Solutions:**
1. **Recovery from backup**:
If backup is valid (checksum matches):
```bash
# Automatic recovery happens on next load
# Or manually restore:
cp .aiwg/knowledge/ralph-learnings.json.bak .aiwg/knowledge/ralph-learnings.json
```
2. **If backup also corrupted**:
```bash
# Rebuild from loop-specific learnings
cat .aiwg/ralph-external/*/promoted-learnings.json | jq -s 'add' > rebuilt.json
# Manually create new store
echo '{
"version": "1.0.0",
"learnings": [],
"stats": {"totalLearnings": 0, "byType": {}, "byTaskType": {}}
}' > .aiwg/knowledge/ralph-learnings.json
```
3. **Prevention**:
- Enable file locking for concurrent access
- Use atomic writes (write to temp file, then rename)
- Regular backups (automated)
```javascript
// In SemanticMemory options
semanticMemory.setBackupInterval(10); // Backup every 10 saves
semanticMemory.enableFileLocking(true); // Prevent concurrent writes
```
4. **Verify JSON syntax**:
```bash
jq empty .aiwg/knowledge/ralph-learnings.json
# If valid: no output
# If invalid: error message with line number
```
---
## 4. Oversight Issues
The Overseer Layer monitors loop health and triggers interventions.
### Problem: False Positive Interventions
**Symptoms:**
- Overseer triggers PAUSE for healthy loops
- "Stuck loop" detected when progress is steady
- Excessive escalations to human
**Diagnosis:**
```bash
# Check detection patterns
jq '.healthCheckLog[] | select(.interventions | length > 0) | {iter: .iterationNumber, detections, interventions}' \
.aiwg/ralph-external/overseer/health-{loop-id}.json
```
**Solutions:**
1. **Threshold tuning**:
```javascript
// In BehaviorDetector options
behaviorDetector.updateThresholds({
stuckLoop: {
iterationsWithoutProgress: 4, // Was 3
minProgressDelta: 0.03, // Was 0.05
},
oscillation: {
derivativeSwing: 0.4, // Was 0.3
consecutiveSwings: 3, // Was 2
},
qualityRegression: {
consecutive: 3, // Was 2
minDelta: 0.1, // Was 0.05
},
});
```
2. **Behavior detector sensitivity**:
```javascript
// Make less sensitive overall
behaviorDetector.setSensitivity('low'); // Was 'normal'
// Options: 'low', 'normal', 'high'
```
3. **Disable specific detectors** (not recommended):
```javascript
behaviorDetector.disableDetector('oscillation');
// Only if genuinely problematic for this task type
```
4. **Adjust intervention levels**:
```javascript
// Make stuck loop less severe
interventionSystem.setInterventionLevel('stuckLoop', 'WARN'); // Was 'REDIRECT'
```
### Problem: Missed Problems (False Negatives)
**Symptoms:**
- Loop stuck for 5+ iterations without intervention
- Quality regresses but no detection
- Infinite loops not caught
**Diagnosis:**
```bash
# Check if detections are firing
jq '.healthCheckLog[] | {iter: .iterationNumber, detections: .detections | length}' \
.aiwg/ralph-external/overseer/health-{loop-id}.json
```
If `detections: 0` for all iterations, detectors aren't firing.
**Solutions:**
1. **Stricter thresholds**:
```javascript
behaviorDetector.updateThresholds({
stuckLoop: {
iterationsWithoutProgress: 2, // Was 3
},
qualityRegression: {
consecutive: 1, // Was 2 (detect immediately)
},
});
```
2. **Increase sensitivity**:
```javascript
behaviorDetector.setSensitivity('high'); // Was 'normal'
```
3. **Enable additional checks**:
```javascript
// Add resource burn detection
behaviorDetector.enableDetector('resourceBurn', {
maxIterations: 10,
warnThreshold: 7,
});
```
4. **Verify overseer initialization**:
```javascript
// Ensure Overseer is created and check() is called after each iteration
const overseer = new Overseer(loopId, objective, { autoEscalate: true });
// After each iteration:
const healthCheck = await overseer.check(iterationResult);
```
5. **Check for exceptions**:
```bash
# Look for errors in overseer logs
grep -i "error" .aiwg/ralph-external/overseer/*.log
```
### Problem: Escalation Not Working
**Symptoms:**
- ABORT intervention triggered but no human notification
- Escalations logged but no Gitea issue created
- Email notifications not sent
**Diagnosis:**
```bash
# Check escalation attempts
jq '.escalations' .aiwg/ralph-external/overseer/escalations/{loop-id}.json
# Check for errors
grep "escalation" .aiwg/ralph-external/*/logs/*.log
```
**Solutions:**
1. **Channel configuration**:
```javascript
// In EscalationHandler options
escalationHandler.setChannels({
gitea: true, // Create Gitea issue
email: false, // Send email (requires config)
cli: true, // CLI notification
});
```
2. **Gitea configuration**:
```bash
# Verify Gitea token exists
ls -lh ~/.config/gitea/token
# Test Gitea API
bash <<'EOF'
TOKEN=$(cat ~/.config/gitea/token)
curl -s -H "Authorization: token $TOKEN" \
"https://git.integrolabs.net/api/v1/user" | jq .
EOF
```
If token invalid or missing, create at: https://git.integrolabs.net/user/settings/applications
3. **Notification permissions**:
```javascript
// Ensure escalation handler has repo permissions
escalationHandler.setRepo('roctinam/ai-writing-guide');
```
4. **Fallback behavior**:
```javascript
// If Gitea fails, fallback to CLI notification
escalationHandler.setFallback('cli');
```
5. **Test escalation manually**:
```javascript
const escalation = await escalationHandler.escalate({
level: 'WARN',
reason: 'Test escalation',
loopId: 'test-001',
context: { test: true },
});
console.log('Escalation result:', escalation);
```
---
## 5. Integration Issues
Multiple layers interacting can cause conflicts or performance problems.
### Problem: Layers Conflicting
**Symptoms:**
- PID recommends "continue" but Strategy Planner recommends "escalate"
- Validation passes but Overseer triggers PAUSE
- Redundant checks slow down iterations
**Diagnosis:**
```bash
# Compare recommendations
jq '.iterations[] | {iter: .number, pidAction: .controlSignals.action, strategyPlan: .strategy.strategy, overseerStatus: .healthReport.status}' \
.aiwg/ralph-external/{loop-id}/state.json
```
**Solutions:**
1. **PID vs Strategy Planner disagreement**:
This is expected behavior. Resolution:
- **PID signals**: Quantitative (metrics-based)
- **Strategy Planner**: Qualitative (context-based)
When they disagree:
- If PID says "continue" but Strategy says "escalate" → Escalate (qualitative override)
- If PID says "abort" but Strategy says "continue" → Abort (safety override)
2. **Validation vs Overseer redundancy**:
Overseer can detect behaviors Validation can't (stuck loops, oscillation). If both firing for same issue:
```javascript
// Disable redundant validation checks
validationAgent.disableCheck('progressCheck'); // Overseer handles this
```
3. **Resolution strategies**:
```javascript
// Priority order (highest to lowest):
// 1. Overseer ABORT/PAUSE
// 2. Strategy Planner escalate
// 3. Validation fail
// 4. PID control signals
// Implement conflict resolution
function resolveActions(pidAction, strategyPlan, overseerStatus, validationResult) {
if (overseerStatus === 'aborted' || overseerStatus === 'paused') {
return overseerStatus;
}
if (strategyPlan === 'escalate_to_human') {
return 'escalate';
}
if (!validationResult.valid && validationResult.severity === 'critical') {
return 'pause';
}
return pidAction; // Default to PID recommendation
}
```
### Problem: Performance Degradation
**Symptoms:**
- Iterations take >2 minutes longer than before Epic #26
- High CPU usage
- Long pauses between iterations
**Diagnosis:**
```bash
# Check iteration timing
jq '.iterations[] | {iter: .number, duration: .duration, pidTime: .pidMetrics.computeTime, overseerTime: .healthReport.computeTime}' \
.aiwg/ralph-external/{loop-id}/state.json
```
**Solutions:**
1. **Identify slow layers**:
- **PID metrics**: Should be < 10ms
- **Overseer check**: Should be < 50ms
- **Claude prompt generation**: Can be 1-2 minutes (worth it)
- **Memory retrieval**: Should be < 100ms
2. **Disable unnecessary checks**:
```javascript
// If not using PID control for this task
config.enablePIDControl = false;
// If overseer not critical for this task
config.enableOverseer = false;
```
3. **Optimize memory retrieval**:
```javascript
// Limit retrieval count
memoryRetrieval.setMaxResults(5); // Was 10
// Cache recent retrievals
memoryRetrieval.enableCaching(true);
```
4. **Reduce Claude prompt generation frequency**:
```javascript
// Use fallback prompts every other iteration
claudePromptGenerator.setFrequency(0.5); // 50% of iterations
```
5. **Caching strategies**:
```javascript
// Cache StateAssessor output for similar states
stateAssessor.enableCaching({
maxAge: 300000, // 5 minutes
similarityThreshold: 0.9,
});
```
---
## 6. Recovery Procedures
### How to Reset Epic #26 State
**When to reset:**
- PID control is completely diverged
- Memory contains too many bad learnings
- Overseer state is corrupted
**Steps:**
1. **Clear PID state**:
```bash
# Remove PID metrics file
rm .aiwg/ralph-external/{loop-id}/pid-metrics.json
# Reset in-memory state
pidController.reset();
```
2. **Clear Overseer state**:
```bash
# Remove health check log
rm .aiwg/ralph-external/overseer/health-{loop-id}.json
# Reset overseer
overseer = new Overseer(loopId, objective);
```
3. **Clear semantic memory** (CAUTION: loses all learnings):
```bash
# Backup first
cp .aiwg/knowledge/ralph-learnings.json .aiwg/knowledge/ralph-learnings.json.backup-$(date +%Y%m%d)
# Reset
rm .aiwg/knowledge/ralph-learnings.json
# Will auto-initialize on next use
```
4. **Partial reset** (keep learnings, reset metrics):
```bash
# Keep semantic memory, reset loop-specific state
rm .aiwg/ralph-external/{loop-id}/pid-metrics.json
rm .aiwg/ralph-external/overseer/health-{loop-id}.json
```
### How to Disable Epic #26 Temporarily
**Emergency bypass** (if Epic #26 causing loop to fail):
```javascript
// In orchestrator initialization
const config = {
enablePIDControl: false,
enableOverseer: false,
enableSemanticMemory: false,
};
```
This reverts to pre-Epic #26 behavior (basic Ralph External loop).
**Gradual re-enablement**:
```javascript
// Enable one layer at a time
// Iteration 1: Enable PID only
config.enablePIDControl = true;
// Iteration 2: Add Overseer
config.enableOverseer = true;
// Iteration 3: Add Memory
config.enableSemanticMemory = true;
```
### Starting Fresh After Corruption
If all state is corrupted and backups failed:
```bash
# 1. Backup corrupted state
mkdir .aiwg/ralph-external/corrupted-$(date +%Y%m%d)
mv .aiwg/ralph-external/{loop-id}/* .aiwg/ralph-external/corrupted-$(date +%Y%m%d)/
# 2. Remove corrupted memory
mv .aiwg/knowledge/ralph-learnings.json .aiwg/knowledge/corrupted-$(date +%Y%m%d).json
# 3. Initialize fresh
# Orchestrator will auto-initialize all state on next run
# 4. Re-run loop from beginning
aiwg ralph "Your objective" --completion "Your completion criteria"
```
---
## Common Error Messages
| Error Message | Meaning | Solution |
|---------------|---------|----------|
| `PIDController not initialized. Call initialize() first.` | PID controller used before initialization | Call `pidController.initialize(taskConfig)` |
| `Checksum mismatch - data may be corrupted` | Semantic memory file corrupted | Restore from `.bak` backup or rebuild |
| `Claude invocation failed: ENOENT` | `claude` CLI not in PATH | Install Claude Code or use fallback prompts |
| `No learnings retrieved for task type: X` | Memory has no learnings for this task | Complete more loops or lower relevance threshold |
| `Overseer detected ABORT-level behavior` | Critical loop pathology | Check health report, fix root cause, resume |
| `GainScheduler: task complexity unknown` | Task config missing complexity estimate | Set `taskConfig.estimatedComplexity` (1-10) |
| `InterventionSystem: no handler for action X` | Intervention type not configured | Add handler or disable that intervention type |
| `EscalationHandler: Gitea API failed (401)` | Gitea token invalid/expired | Regenerate token at https://git.integrolabs.net |
| `MemoryPromotion: confidence below threshold` | Learning not confident enough to promote | Lower `minConfidence` or improve learning quality |
| `ValidationAgent: schema validation failed` | Artifact doesn't match expected schema | Disable schema check or fix artifact format |
---
## Diagnostic Commands
```bash
# Quick health check
jq '{
config: .config,
iteration: .currentIteration,
status: .status,
pidEnabled: .config.enablePIDControl,
overseerEnabled: .config.enableOverseer,
memoryEnabled: .config.enableSemanticMemory
}' .aiwg/ralph-external/{loop-id}/state.json
# PID diagnostics
jq '.iterations[-5:] | .[] | {
iter: .number,
progress: .analysis.completionPercentage,
p: .pidMetrics.proportional,
i: .pidMetrics.integral,
d: .pidMetrics.derivative,
signal: .controlSignals.controlSignal,
action: .controlSignals.action
}' .aiwg/ralph-external/{loop-id}/state.json
# Overseer diagnostics
jq '.healthCheckLog[-5:] | .[] | {
iter: .iterationNumber,
status: .status,
detections: .detections | length,
interventions: .interventions | map(.action)
}' .aiwg/ralph-external/overseer/health-{loop-id}.json
# Memory diagnostics
jq '{
total: .stats.totalLearnings,
byType: .stats.byType,
avgConfidence: (.learnings | map(.confidence) | add / length),
avgSuccessRate: (.learnings | map(.successRate) | add / length)
}' .aiwg/knowledge/ralph-learnings.json
# Claude Intelligence diagnostics
jq '.iterations[-5:] | .[] | {
iter: .number,
strategyPlan: .strategy.strategy,
strategyConfidence: .strategy.confidence,
preValid: .preValidation.valid,
postValid: .postValidation.valid,
promptLength: .claudePrompt | length
}' .aiwg/ralph-external/{loop-id}/state.json
```
---
## References
- @tools/ralph-external/README.md - Architecture overview
- @.aiwg/working/epic-26-integration-complete.md - Integration summary
- @.claude/rules/anti-laziness.md - Oversight behavioral patterns
- @.aiwg/research/findings/REF-015-self-refine.md - PID control research
- @.aiwg/research/findings/REF-021-reflexion.md - Memory layer research
## Support
For issues not covered in this guide:
1. Check `.aiwg/ralph-external/{loop-id}/logs/` for detailed logs
2. Search issues at: https://github.com/jmagly/aiwg/issues
3. Create new issue with: loop ID, error messages, diagnostic output
**Diagnostic Package:**
```bash
# Generate comprehensive diagnostic package
tar czf ralph-diagnostics-$(date +%Y%m%d).tar.gz \
.aiwg/ralph-external/{loop-id}/ \
.aiwg/knowledge/ralph-learnings.json \
.aiwg/ralph-external/overseer/
```
Attach to issue report for faster troubleshooting.