aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
674 lines (529 loc) • 21.7 kB
Markdown
# Reflexion Episodic Memory Guide
Comprehensive guide to Al's episodic memory system based on the Reflexion framework.
## Overview
Al implements **Reflexion's three-model architecture** for verbal reinforcement learning:
1. **Actor (Ma)** - Executes actions and generates code changes
2. **Evaluator (Me)** - Verifies results using external tools (npm test, tsc, eslint)
3. **Self-Reflection (Msr)** - Analyzes failures and generates actionable insights
After each failed iteration, Al generates a **structured reflection** stored in episodic memory. These reflections are injected into retry attempts, enabling learning without model retraining.
## Theoretical Foundation
**Research Basis**: REF-021 Reflexion - Language Agents with Verbal Reinforcement Learning (NeurIPS 2023)
**Key Results**:
- 91% pass@1 on HumanEval (surpasses GPT-4's 80% baseline)
- +24% task success through verbal reinforcement
- Learning occurs at inference time through context injection
**Core Insight**: Natural language reflections provide more actionable guidance than scalar rewards, enabling rapid learning through episodic memory.
See `@$AIWG_ROOT/docs/references/REF-021-reflexion-verbal-reinforcement.md` for complete research documentation.
## Three-Model Architecture
### 1. Actor Model (Ma)
**Role**: Generates text and actions based on current state and episodic memory.
**Policy**: `πθ(at|st)` where `θ = {Ma, mem}`
- `at` = action or text generation at time t
- `st` = current state (task + trajectory history)
- `mem` = episodic memory buffer (sliding window of reflections)
**Implementation in Al**:
```typescript
interface ActorOutput {
actions: Action[]; // Sequence of actions taken
rationale: string; // Reasoning for approach
strategy: string; // High-level strategy
files_modified: string[]; // Files touched
total_changes: ChangeStats; // Aggregate statistics
}
```
**Actor Variants**:
- **Chain-of-Thought (CoT)**: Step-by-step reasoning for single-generation tasks
- **ReAct**: Interleaved reasoning and acting for multi-step tasks (Al's default)
### 2. Evaluator Model (Me)
**Role**: Scores generated outputs to produce reward signal.
**Evaluation Strategies by Task Type**:
| Task Type | Evaluation Method | Al Implementation |
|-----------|------------------|---------------------|
| Programming | Unit tests + execution | `npm test`, `npm run test:coverage` |
| Type Safety | Compilation checks | `tsc --noEmit` |
| Code Quality | Linting | `eslint`, `markdownlint` |
| Integration | External API calls | Gitea API responses |
| Combined | Multiple tools | All of the above |
**Implementation in Al**:
```typescript
interface EvaluatorOutput {
passed: boolean; // Overall pass/fail
verification_type: VerificationType; // Type of verification
results: VerificationResult[]; // Individual tool results
errors: StructuredError[]; // Parsed error information
reward_signal: number; // Scalar reward [0.0, 1.0]
metrics: VerificationMetrics; // Quantitative metrics
}
```
**Reward Signal Calculation**:
```typescript
// Example: Combined verification
reward = (
(tests_passed / tests_total) * 0.5 +
(type_errors === 0 ? 0.3 : 0.0) +
(lint_errors === 0 ? 0.2 : 0.0)
);
```
### 3. Self-Reflection Model (Msr)
**Role**: Converts sparse rewards into detailed verbal feedback.
**Input**: `{trajectory τt, reward rt, episodic memory mem}`
**Output**: Natural language reflection containing:
1. **Credit assignment** - Identification of specific failing actions
2. **Causal reasoning** - Explanation of why actions led to failure
3. **Actionable insights** - Concrete suggestions for improvement
**Implementation in Al**:
```typescript
interface SelfReflection {
reflection_text: string; // First-person narrative reflection
credit_assignment: {
failing_action_indices: number[]; // Which actions failed
root_cause: string; // Identified root cause
failure_category: FailureCategory; // Error classification
};
causal_reasoning: string; // Why failure occurred
actionable_insights: string[]; // What to do next
lessons_learned: string[]; // General lessons
confidence: number; // Self-assessed confidence [0.0, 1.0]
related_reflections: number[]; // Previous relevant reflections
}
```
**Example Reflection** (from schema):
> "In my previous attempt, I added tests for the login function but didn't account for edge cases where the API response might be empty or undefined. The error 'Cannot read property map of undefined' occurred because I tried to call .map() on userData without first checking if it exists. In my next attempt, I will add null checks before accessing userData properties."
## Memory Architecture
### Short-Term Memory
**Current trajectory history**: `τt = [a0, o0, ..., ai, oi]`
- Represents immediate context and recent decisions
- Stored in current iteration state
### Long-Term Memory (Episodic Buffer)
**Reflection storage**: `mem = [sr0, sr1, ..., srt]`
- Maximum capacity **Ω** (omega) respects context limits
- Most recent experiences inform future decisions
- Provides "lessons learned" across trials
**Storage Location**: `.aiwg/ralph/reflections/<loop-id>/`
**Sliding Window Behavior**:
```
Ω=3 example (keep last 3 reflections):
Iteration 0 fails → reflection 001.json (in context)
Iteration 1 fails → reflection 002.json (in context)
Iteration 2 fails → reflection 003.json (in context)
Iteration 3 fails → reflection 004.json (in context)
001.json excluded from context
Iteration 4 fails → reflection 005.json (in context)
002.json excluded from context
```
**Memory Operations**:
```typescript
// Initialize
let mem: Reflection[] = [];
// After each failed trial
async function afterTrial(iteration: number, trajectory: Trajectory, reward: number) {
// 1. Generate reflection
const reflection = await generateReflection(trajectory, reward, mem);
// 2. Append to memory
mem.push(reflection);
// 3. Truncate to Ω capacity
if (mem.length > OMEGA_CAPACITY) {
mem = mem.slice(-OMEGA_CAPACITY);
}
// 4. Persist to disk
await saveReflection(reflection);
// 5. Update metadata
await updateMemoryMetadata({
omega_capacity: OMEGA_CAPACITY,
current_memory_size: mem.length,
reflections_in_context: mem.map(r => r.iteration),
total_reflections_generated: iteration + 1
});
}
```
## When Reflections Are Generated
Reflections are created **after failed verification** only:
| Outcome | Generate Reflection? | Next Action |
|---------|---------------------|-------------|
| All verifications pass | NO | Complete loop successfully |
| Some verifications fail | YES | Generate reflection → retry |
| Max iterations reached | YES | Generate final reflection → abort |
| Critical error | YES | Generate error reflection → abort |
**Verification Flow**:
```
Attempt → Execute actions → Verify results
↓
All passed?
/ \
YES NO
↓ ↓
Success Generate reflection
↓
Add to memory
↓
Inject into next attempt
↓
Retry
```
## Reflection Prompt Template
**System Prompt for Self-Reflection Model**:
```markdown
You are the Self-Reflection component in a Reflexion-based learning system.
You will be given:
1. Your previous implementation attempt (actions taken)
2. Verification results from external tools (tests, linters, type checker)
3. Your past reflections from similar failures (if any)
Your task is to analyze what went wrong and provide actionable guidance for the next attempt.
## Reflection Structure
Write a first-person narrative reflection that includes:
1. **Credit Assignment**: Which specific actions or code changes caused the failure?
2. **Causal Reasoning**: Why did these actions lead to failure? What was the underlying issue?
3. **Actionable Insights**: What concrete steps should be taken in the next attempt?
Be specific. Avoid generic advice like "be more careful" - instead, identify exact code patterns, missing checks, or logic errors.
## Example Reflection
"In my previous attempt, I tried to map over userData without checking if it exists. The error occurred because the API response was empty in the test case. I should add a null check before the map operation. In the next attempt, I will verify userData exists and return an empty array if it doesn't."
## Previous Reflections
{{#each previous_reflections}}
### Iteration {{this.iteration}}
{{this.reflection_text}}
Lessons learned:
{{#each this.lessons_learned}}
- {{this}}
{{/each}}
{{/each}}
## Current Failure
**Task**: {{task_description}}
**Actions Taken**:
{{#each actions}}
{{@index}}. {{this.description}}
File: {{this.file_path}}
Changes: +{{this.changes.additions}} -{{this.changes.deletions}}
{{/each}}
**Verification Results**:
{{#each verification_results}}
- {{this.tool}}: {{this.status}}
{{#if this.stderr}}
Error: {{this.stderr}}
{{/if}}
{{/each}}
**Errors**:
{{#each errors}}
- {{this.type}}: {{this.message}}
Location: {{this.file}}:{{this.line}}
{{/each}}
Now write your reflection following the structure above.
```
## How to Query Past Reflections
### Loading Reflections for Current Task
```typescript
import { loadReflections } from '@/ralph/memory';
// Load all reflections for a loop
const reflections = await loadReflections('ralph-task-123');
// Get reflections in current window (respects Ω)
const activeReflections = reflections.filter(r =>
r.memory_metadata.reflections_in_context.includes(r.iteration)
);
// Inject into retry prompt
const context = buildRetryContext({
task: currentTask,
previousReflections: activeReflections.map(r => r.self_reflection.reflection_text),
failedActions: currentFailure.actor_output.actions,
errors: currentFailure.evaluator_output.errors
});
```
### Cross-Task Learning Patterns
**Find similar failures across loops**:
```typescript
import { searchReflections } from '@/ralph/memory';
// Query by failure category
const similarFailures = await searchReflections({
failure_category: 'edge_case_miss',
min_confidence: 0.8,
limit: 5
});
// Extract lessons learned
const lessons = similarFailures.flatMap(r =>
r.self_reflection.lessons_learned
);
// Inject as general knowledge
const enhancedContext = {
...baseContext,
prior_knowledge: lessons
};
```
**Analyze improvement patterns**:
```typescript
import { analyzePerformance } from '@/ralph/analytics';
// Track reward progression
const loopHistory = await loadReflections('ralph-task-123');
const rewards = loopHistory.map(r => r.evaluator_output.reward_signal);
// Calculate learning rate
const learningRate = (rewards[rewards.length - 1] - rewards[0]) / rewards.length;
// Identify breakthrough moments
const improvements = loopHistory.filter(r =>
r.performance_delta?.is_improvement === true
);
```
### Example: Learning from Past API Integration Failures
```typescript
// Scenario: New API integration task
const task = "Integrate Gitea webhook API";
// Step 1: Find past API-related reflections
const apiReflections = await searchReflections({
task_keywords: ['API', 'integration', 'webhook'],
failure_category: ['edge_case_miss', 'integration_error'],
min_confidence: 0.7
});
// Step 2: Extract common lessons
const commonPatterns = extractPatterns(apiReflections, {
min_frequency: 2, // Lesson appears in ≥2 reflections
categories: ['actionable_insights', 'lessons_learned']
});
// Step 3: Inject as prior knowledge
const taskContext = {
task_description: task,
prior_api_lessons: commonPatterns,
similar_successes: apiReflections.filter(r =>
r.evaluator_output.passed === true
)
};
// Step 4: Execute with enhanced context
const result = await executeWithContext(taskContext);
```
## Memory Capacity Tuning (Ω Parameter)
**Choosing Ω based on task complexity**:
| Task Type | Recommended Ω | Rationale |
|-----------|--------------|-----------|
| Simple programming (single function) | 1 | Clear failure modes, quick fixes |
| Complex programming (multi-file) | 3 | Multiple error types, iterative refinement |
| Decision-making (multi-step) | 3 | Long trajectories, credit assignment needed |
| Reasoning (multi-hop) | 3 | Complex causal chains |
| Research/exploration | 5+ | Experimental, may exceed context limits |
**Empirical Evidence from Reflexion Paper**:
- **HumanEval (programming)**: Ω=1 optimal
- **AlfWorld (decision-making)**: Ω=3 optimal
- **HotPotQA (reasoning)**: Ω=3 optimal
**AIWG Defaults**:
```typescript
const OMEGA_DEFAULTS = {
'unit_tests': 1, // Simple test failures
'integration_tests': 3, // Complex integration issues
'type_check': 1, // Type errors are usually clear
'lint': 1, // Lint errors are specific
'combined': 3, // Multiple verification types
'manual_review': 5 // Subjective feedback needs history
};
```
**Dynamic Tuning**:
```typescript
// Adjust Ω based on failure diversity
function calculateOptimalOmega(reflections: Reflection[]): number {
const uniqueCategories = new Set(
reflections.map(r => r.self_reflection.credit_assignment.failure_category)
);
// More diverse failures → larger window
if (uniqueCategories.size >= 5) return 5;
if (uniqueCategories.size >= 3) return 3;
return 1;
}
```
## Performance Analysis
### Metrics to Track
**Individual Reflection Quality**:
- `self_reflection.confidence` - Self-assessed accuracy
- `performance_delta.is_improvement` - Did next iteration improve?
- Correlation between confidence and actual improvement
**Loop Performance**:
- Reward progression: `[r0, r1, r2, ..., rn]`
- Error count reduction over iterations
- Time to success (iterations needed)
- Failure category distribution
**Cross-Loop Learning**:
- Reuse rate of lessons learned
- Success rate on tasks similar to past failures
- Time to success improvement on repeated task types
### Example Analysis Script
```typescript
import { loadReflections, analyzeLoopPerformance } from '@/ralph/analytics';
async function analyzeRalphLearning() {
// Load all completed loops
const loops = await loadAllLoops();
// Analyze each loop
const analyses = await Promise.all(
loops.map(async loop => {
const reflections = await loadReflections(loop.id);
return {
loop_id: loop.id,
total_iterations: reflections.length,
final_success: loop.status === 'completed',
learning_curve: reflections.map(r => r.evaluator_output.reward_signal),
failure_categories: reflections.map(r =>
r.self_reflection.credit_assignment.failure_category
),
lessons_count: reflections.reduce((sum, r) =>
sum + r.self_reflection.lessons_learned.length, 0
),
avg_confidence: reflections.reduce((sum, r) =>
sum + (r.self_reflection.confidence || 0), 0
) / reflections.length
};
})
);
// Aggregate insights
const totalLearningRate = analyses.reduce((sum, a) => {
const curve = a.learning_curve;
const rate = curve.length > 1
? (curve[curve.length - 1] - curve[0]) / curve.length
: 0;
return sum + rate;
}, 0) / analyses.length;
console.log('Al Learning Analysis:');
console.log(`- Total loops: ${analyses.length}`);
console.log(`- Success rate: ${analyses.filter(a => a.final_success).length / analyses.length}`);
console.log(`- Avg learning rate: ${totalLearningRate.toFixed(3)}`);
console.log(`- Avg iterations: ${analyses.reduce((s, a) => s + a.total_iterations, 0) / analyses.length}`);
}
```
## Integration with Al External
Al's external loop implementation (`tools/ralph-external/`) uses episodic memory for recovery:
**Integration Points**:
1. **Initialization** - Load past reflections if resuming
2. **Iteration Start** - Inject active reflections into context
3. **Verification Failure** - Generate and store reflection
4. **Retry Attempt** - Include reflection in next iteration's prompt
5. **Completion** - Analyze reflection quality and learning curve
**File Mapping**:
```
tools/ralph-external/
├── core/
│ ├── memory.ts # Reflection loading/saving
│ ├── reflection.ts # Reflection generation
│ └── evaluator.ts # External verification
├── prompts/
│ ├── actor.hbs # Includes {{previous_reflections}}
│ └── reflection.hbs # Self-reflection template
└── state/
└── <loop-id>/
└── reflections/ # Symlink to .aiwg/ralph/reflections/<loop-id>/
```
**Reflection Injection Example**:
```typescript
// tools/ralph-external/core/actor.ts
import { loadActiveReflections } from './memory';
async function executeIteration(loopId: string, iteration: number) {
// Load reflections in window
const reflections = await loadActiveReflections(loopId);
// Build context with reflections
const context = {
task: currentTask,
iteration,
previous_reflections: reflections.map(r => ({
iteration: r.iteration,
reflection_text: r.self_reflection.reflection_text,
lessons_learned: r.self_reflection.lessons_learned,
actionable_insights: r.self_reflection.actionable_insights
})),
previous_errors: reflections.flatMap(r =>
r.evaluator_output.errors
)
};
// Execute with enhanced context
const result = await actor.execute(context);
return result;
}
```
## Best Practices
### Writing Quality Reflections
**DO**:
- ✅ Use first person ("I tried...", "In my next attempt...")
- ✅ Be specific about failing actions (cite line numbers, function names)
- ✅ Explain causal chain (X led to Y because Z)
- ✅ Provide concrete next steps ("Add null check at line 23")
- ✅ Reference previous reflections when applicable
- ✅ Assess confidence honestly
**DON'T**:
- ❌ Write generic advice ("Be more careful", "Test thoroughly")
- ❌ Blame external factors without analysis
- ❌ Repeat previous reflections verbatim
- ❌ Ignore verification errors in output
- ❌ Claim high confidence without evidence
### Optimizing Memory Usage
**Context Length Management**:
```typescript
// Estimate reflection size
function estimateTokens(reflection: Reflection): number {
const text = reflection.self_reflection.reflection_text;
const insights = reflection.self_reflection.actionable_insights.join(' ');
return Math.ceil((text + insights).length / 4); // Rough estimate
}
// Ensure reflections fit in context window
function pruneReflectionsToFit(
reflections: Reflection[],
maxTokens: number
): Reflection[] {
const sorted = reflections.sort((a, b) => b.iteration - a.iteration);
let totalTokens = 0;
const result = [];
for (const r of sorted) {
const tokens = estimateTokens(r);
if (totalTokens + tokens > maxTokens) break;
result.push(r);
totalTokens += tokens;
}
return result.reverse(); // Maintain chronological order
}
```
### Debugging Reflection Quality
**Low-Quality Reflection Indicators**:
- Confidence < 0.5 but is_improvement = false
- No actionable insights provided
- Reflection text < 100 characters
- No credit assignment identified
**Improvement Strategies**:
1. Enhance reflection prompt with more context
2. Include specific error details in prompt
3. Show examples of high-quality reflections
4. Require minimum reflection length
5. Validate reflection structure before saving
## Validation
**Schema Validation**:
```bash
# Validate reflection against schema
npx ajv validate \
-s agentic/code/addons/ralph/schemas/reflection-memory.json \
-d .aiwg/ralph/reflections/ralph-task-123/001.json
```
**Runtime Validation**:
```typescript
import Ajv from 'ajv';
import schema from '@/agentic/code/addons/ralph/schemas/reflection-memory.json';
const ajv = new Ajv();
const validate = ajv.compile(schema);
function validateReflection(reflection: unknown): Reflection {
if (!validate(reflection)) {
throw new Error(`Invalid reflection: ${ajv.errorsText(validate.errors)}`);
}
return reflection as Reflection;
}
```
## References
### AIWG Documentation
- **@$AIWG_ROOT/agentic/code/addons/ralph/schemas/reflection-memory.json** - JSON Schema definition
- **@.aiwg/ralph/reflections/.gitkeep** - Directory structure documentation
- **@.aiwg/ralph/reflections/example/001.json** - Example reflection
- **@$AIWG_ROOT/docs/references/REF-021-reflexion-verbal-reinforcement.md** - Research foundation
- **@$AIWG_ROOT/tools/ralph-external/README.md** - External loop implementation
### Research References
- **Reflexion (NeurIPS 2023)** - Shinn et al.
- 91% HumanEval pass@1 (surpasses GPT-4 baseline)
- Episodic memory with sliding window (Ω parameter)
- Three-model architecture: Actor, Evaluator, Self-Reflection
- arXiv: https://arxiv.org/abs/2303.11366
### Related AIWG Issues
- **#94** - Parent epic: Reflexion Integration
- **#102** - This implementation
- **#103** - Self-reflection prompt optimization
- **#104** - Cross-task learning analytics
---
**Document Version**: 1.0.0
**Last Updated**: 2026-01-25
**Status**: IMPLEMENTED
## Changelog
| Date | Change |
|------|--------|
| 2026-01-25 | Initial implementation following REF-021 specification |