aiwg
Version:
Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.
258 lines (183 loc) • 5.77 kB
Markdown
# Resilience & Recovery Patterns
Structured recovery protocols for handling failures in agentic workflows.
## Research Foundation
**REF-002**: Roig (2025) Key Finding
> "Recovery capability—not initial correctness—is the dominant predictor of agentic task success."
DeepSeek V3.1 achieves 92% success via post-training for verification/recovery, not model scale.
## The PAUSE-DIAGNOSE-ADAPT-RETRY-ESCALATE Protocol
### PAUSE
On any error or unexpected result:
```
ERROR DETECTED: [error message or unexpected behavior]
STATE PRESERVED: [what state was captured]
EXECUTION HALTED: Awaiting diagnosis
```
**Do NOT**:
- Immediately retry the same operation
- Continue with partial results
- Ignore the error and proceed
### DIAGNOSE
Analyze the error systematically:
```
DIAGNOSIS:
1. Error Type: [syntax|schema|logic|loop|resource|permission]
2. Root Cause: [specific cause identified]
3. Context: [what was being attempted]
4. History: [has this error occurred before in this session?]
```
**Error Type Decision Tree**:
```
Error
├─ Malformed output? → SYNTAX ERROR
├─ Wrong data structure? → SCHEMA MISMATCH
├─ Incorrect reasoning? → LOGIC ERROR
├─ Repeated same action? → LOOP DETECTED
├─ Resource unavailable? → RESOURCE ERROR
└─ Access denied? → PERMISSION ERROR
```
### ADAPT
Choose recovery strategy based on diagnosis:
| Error Type | Recovery Strategy |
|------------|-------------------|
| **SYNTAX** | Fix formatting, validate before retry |
| **SCHEMA** | Re-inspect target, update assumptions |
| **LOGIC** | Decompose into smaller steps |
| **LOOP** | Change approach entirely |
| **RESOURCE** | Wait and retry, or use alternative |
| **PERMISSION** | Escalate to user |
**Adaptation Template**:
```
ADAPTATION:
- Original approach: [what was tried]
- Failure mode: [how it failed]
- New approach: [what will be different]
- Validation: [how success will be verified]
```
### RETRY
Execute adapted approach with verification:
```
RETRY ATTEMPT [N of 3]:
- Adapted approach: [description]
- Pre-validation: [what was checked before retry]
- Execution: [attempting...]
- Post-validation: [did it succeed?]
```
**Retry Rules**:
- Maximum 3 adapted retry attempts
- Each retry MUST use a different approach
- Same approach twice = loop, change strategy
### ESCALATE
If 3 adapted retries fail:
```
ESCALATION REQUIRED
## Recovery Summary
- Original Task: [what was requested]
- Attempts Made:
1. [approach 1] → [failure reason]
2. [approach 2] → [failure reason]
3. [approach 3] → [failure reason]
## Blocking Issue
[Specific problem preventing progress]
## Human Action Needed
[What the user should do]
- Option A: [action and expected outcome]
- Option B: [action and expected outcome]
## Partial Progress (if any)
[What was accomplished before failure]
[Location of any saved state]
```
## Loop Detection
Detect and break out of loops:
### Detection Criteria
A loop is detected when:
- Same tool called 3+ times with similar arguments
- Same error returned 2+ times consecutively
- Output is identical to previous iteration
- No progress made in 3+ iterations
### Loop Breaking
```
LOOP DETECTED:
- Pattern: [what's repeating]
- Iterations: [count]
- Last 3 actions: [summary]
BREAKING LOOP:
- Abandoning current approach
- Analyzing why loop occurred
- Selecting fundamentally different strategy
```
## Recovery Checkpoints
Save state at key points for recovery:
```
CHECKPOINT: [name]
- Task progress: [what's been completed]
- Current state: [relevant state variables]
- Next step: [what would come next]
- Recovery path: [how to resume from here]
```
**When to checkpoint**:
- After successful phase completion
- Before risky operations
- After gathering critical data
- Before external API calls
## Error-Specific Recovery Patterns
### Pattern 1: File Not Found
```
ERROR: File not found: [path]
DIAGNOSIS: Resource error - expected file missing
ADAPTATION:
1. List directory to find actual files
2. Search for similar filenames
3. Ask user if path is correct
RECOVERY:
- Found similar: [actual file] - proceeding with this
- OR: Escalating to user for correct path
```
### Pattern 2: Schema Mismatch
```
ERROR: Expected [schema A], got [schema B]
DIAGNOSIS: Schema mismatch - assumptions incorrect
ADAPTATION:
1. Inspect actual schema/structure
2. Update processing logic
3. Re-attempt with correct schema
RECOVERY:
- Actual structure: [description]
- Adjusted approach: [how processing changed]
```
### Pattern 3: Rate Limit / Timeout
```
ERROR: Rate limit exceeded / Timeout
DIAGNOSIS: Resource error - external constraint
ADAPTATION:
1. Wait with exponential backoff
2. Reduce request scope
3. Use cached data if available
RECOVERY:
- Waited [duration], retrying
- OR: Using cached version from [timestamp]
- OR: Escalating - requires fresh data
```
### Pattern 4: Logic Loop
```
ERROR: Loop detected - same action repeated
DIAGNOSIS: Logic error - approach not progressing
ADAPTATION:
1. Step back and re-analyze problem
2. Try completely different strategy
3. Decompose into smaller steps
RECOVERY:
- Previous approach: [what looped]
- New approach: [fundamentally different strategy]
```
## Success Metrics
From Unified Production Plan:
| Metric | Target |
|--------|--------|
| Recovery success rate | ≥80% after initial failure |
| Auto-retry coverage | ≥2 adapted retries before escalating |
| Loop detection | Detect within 3 iterations |
| Escalation quality | Actionable human guidance |
## Integration with Agent Design Bible
This implements **Rule 7: Recovery-First Design**:
> "Build agents that can diagnose and recover from failures."
Every agent should include error handling that follows this protocol.