aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

525 lines (371 loc) • 15.5 kB

Markdown

# Agent Design Bible The definitive guide for designing reliable, production-grade AI agents in the AIWG framework. ## Research Foundation This guide synthesizes empirical findings from: - **[REF-001](#research-production-workflows)**: Bandara et al. (2024) "Production-Grade Agentic AI Workflows" - 9 best practices for production reliability - **[REF-002](#research-failure-modes)**: Roig (2025) "How Do LLMs Fail In Agentic Scenarios?" - 4 failure archetypes from 900 execution traces **Key Empirical Finding**: Recovery capability—not model scale or initial correctness—is the dominant predictor of agentic task success. DeepSeek V3.1 achieves 92% success via post-training for verification/recovery, not architectural changes. ## The 10 Golden Rules ### Rule 1: Single Responsibility Each agent does ONE thing well. **Rationale**: REF-001 BP-4 establishes that agents with focused responsibilities produce more predictable outputs. Roig (2025) shows multi-purpose agents exhibit higher rates of Archetype 4 failures (coherence loss under load). **Checklist**: - [ ] Agent purpose describable in one sentence - [ ] No "and" in the agent's core function - [ ] Clear input/output contract - [ ] Obvious when to use (and when NOT to use) this agent **Anti-pattern**: ```markdown # BAD: Multi-purpose agent name: Code Helper description: Reviews code, writes tests, fixes bugs, and documents functions ``` **Pattern**: ```markdown # GOOD: Focused agent name: Code Reviewer description: Performs comprehensive code reviews focusing on quality, security, and maintainability ``` ### Rule 2: Minimal Tools Assign 0-3 tools per agent. Prefer fewer. **Rationale**: REF-001 BP-3 warns against tool sprawl. Each additional tool increases the agent's decision space exponentially. Roig (2025) Archetype 4 shows tool-heavy agents suffer more fragile execution. **Tool Assignment Guide**: | Agent Type | Recommended Tools | Rationale | |------------|-------------------|-----------| | Research/Analysis | Read, Grep, Glob | Read-only exploration | | Content Creation | Read, Write | Focused output | | Code Modification | Read, Edit, Bash | Surgical changes | | Orchestration | Task | Delegation only | | Validation | Read, Grep | Verification only | **Anti-pattern**: ```markdown tools: Bash, Glob, Grep, Read, Write, Edit, MultiEdit, WebFetch, WebSearch, Task, NotebookEdit ``` **Pattern**: ```markdown tools: Read, Grep, Write ``` ### Rule 3: Explicit Inputs/Outputs Define exactly what the agent receives and produces. **Rationale**: Ambiguous contracts cause Roig Archetype 2 failures (over-helpfulness). When agents don't know what they're supposed to produce, they substitute plausible alternatives. **Contract Template**: ```markdown ## Inputs - **Required**: [What MUST be provided] - **Optional**: [What MAY be provided] - **Context**: [What ambient information is available] ## Outputs - **Primary**: [The main deliverable] - **Secondary**: [Supporting artifacts] - **Format**: [Exact structure/schema] ``` **Example**: ```markdown ## Inputs - **Required**: File path(s) to review - **Optional**: Focus areas (security, performance, style) - **Context**: Project coding standards from CLAUDE.md ## Outputs - **Primary**: Prioritized list of issues with file:line references - **Secondary**: Positive observations and overall assessment - **Format**: Markdown with Critical/High/Medium/Low sections ``` ### Rule 4: Grounding Before Action ALWAYS verify assumptions before modifying external state. **Rationale**: Roig (2025) Archetype 1 (Premature Action Without Grounding) is a leading cause of cascading failures. Agents that guess schemas instead of inspecting them produce incorrect outputs that compound downstream. **Grounding Checkpoint**: Before ANY operation touching external state (files, APIs, databases): 1. **List** available inspection tools (ls, head, schema, describe) 2. **Execute** minimum inspection to confirm assumptions 3. **Document** confirmed state in reasoning 4. **Only then** proceed with modification **Example**: ```markdown ## Process (Code Reviewer) 1. **Scan**: Read all specified files using Read/Grep/Glob tools - VERIFY files exist before analyzing - CONFIRM file types match expectations 2. **Analyze**: Evaluate against criteria 3. **Report**: Provide findings with exact file:line references ``` **Anti-pattern**: ```markdown # BAD: Assumes structure "The config file has a 'database' section with 'host' and 'port' fields..." ``` **Pattern**: ```markdown # GOOD: Verifies first "Let me read the config file to understand its structure..." [Reads file] "The config has sections: database, cache, logging. The database section contains..." ``` ### Rule 5: Escalate Uncertainty NEVER silently substitute missing or ambiguous data. **Rationale**: Roig (2025) Archetype 2 (Over-Helpfulness Under Uncertainty) shows models substitute plausible alternatives when data is missing, producing confidently wrong outputs. **Uncertainty Protocol**: When encountering entity mismatches or ambiguous references: 1. **STOP** - Do not proceed with assumptions 2. **LIST** - Show all potential matches with confidence indicators 3. **REPORT** - "Entity 'X' not found. Similar candidates: [list]" 4. **WAIT** - Request clarification before proceeding 5. **DOCUMENT** - Log any assumptions in trace output **Example Escalation**: ```markdown ## Uncertainty Detected Task requested: "Update the User service configuration" Found multiple matches: - `src/services/UserService.ts` (85% confidence - naming match) - `src/services/AuthService.ts` (40% confidence - contains user logic) - `config/services/user.yaml` (60% confidence - configuration file) **Action Required**: Please specify which file(s) to modify, or confirm the primary match. ``` **Anti-pattern**: ```markdown # BAD: Silent substitution Task: "Find revenue for Acme Corp" [CSV contains "Acme Corporation" and "Acme Inc"] "The revenue for Acme Corp is $1.2M" (silently used "Acme Corporation") ``` ### Rule 6: Scoped Context Only process information relevant to the current task. **Rationale**: Roig (2025) Archetype 3 (Distractor-Induced Context Pollution) shows that irrelevant but superficially similar information derails reasoning. The "Chekhov's gun" effect—if data is present, models assume it must be relevant. **Context Scoping Protocol**: 1. **Identify** explicit task scope (time ranges, entity filters, operation type) 2. **Classify** context sections: - **RELEVANT**: Directly supports task - **PERIPHERAL**: May be useful for edge cases - **DISTRACTOR**: Similar but out of scope 3. **Process** RELEVANT first, PERIPHERAL only if needed 4. **Ignore** DISTRACTOR content entirely **Example**: ```markdown ## Task Scope Analysis Task: "Calculate Q4 revenue for Product A" Context Classification: - RELEVANT: Q4 data rows, Product A entries - PERIPHERAL: Q4 data for Products B, C (same time period) - DISTRACTOR: Q1-Q3 data for Product A (wrong time period) Processing: Focus on rows where quarter='Q4' AND product='A' ``` ### Rule 7: Recovery-First Design Build agents that can diagnose and recover from failures. **Rationale**: REF-002's key finding: recovery capability is THE dominant predictor of success. DeepSeek V3.1's 92% success rate comes from post-training for verification/recovery behaviors. **Recovery Protocol**: ``` 1. PAUSE - Stop execution, preserve state 2. DIAGNOSE - Analyze error message and execution trace - Syntax error? → Fix formatting - Schema mismatch? → Re-inspect target - Logic error? → Decompose into smaller steps - Loop detected? → Change approach entirely 3. ADAPT - Choose recovery strategy based on diagnosis 4. RETRY - With adapted approach (max 3 attempts) 5. ESCALATE - If 3 adapted retries fail, request human intervention ``` **Agent Template Addition**: ```markdown ## Error Handling When encountering errors: 1. Capture the full error message and context 2. Analyze root cause before retrying 3. Adapt approach if same error occurs twice 4. Report blocking issues with: - What was attempted - What failed - What was tried to recover - What human input is needed ``` ### Rule 8: Appropriate Model Tier Match model capability to task complexity. **Rationale**: REF-001 BP-6 and REF-002 both show that model scale alone doesn't predict reliability. Use the right tier for the task—don't waste capacity on simple operations. **Model Selection Guide**: | Tier | Model | Use For | Avoid For | |------|-------|---------|-----------| | **Efficiency** | haiku | Validation, formatting, simple transforms | Complex reasoning, architecture | | **Balanced** | sonnet | Most development tasks, code review | Novel architecture, critical decisions | | **Reasoning** | opus | Architecture, security analysis, complex trade-offs | Routine operations, high-volume tasks | **Task-to-Tier Mapping**: ```markdown # HAIKU (efficiency) - Linting and formatting - Simple file operations - Template population - Status checks # SONNET (balanced) - Code review - Test generation - Documentation - Bug investigation # OPUS (reasoning) - Architecture design - Security threat modeling - Complex refactoring - Critical decision making ``` ### Rule 9: Parallel-Ready Design agents to run concurrently when tasks are independent. **Rationale**: REF-001 BP-9 (KISS) emphasizes simple, composable agents. Independent agents can run in parallel, dramatically improving throughput. **Parallel Design Checklist**: - [ ] Agent has no dependencies on other agents' outputs (or dependencies are explicit) - [ ] Agent doesn't modify shared state without coordination - [ ] Agent can be launched via Task tool alongside others - [ ] Agent's output is self-contained and mergeable **Orchestration Pattern**: ```markdown ## Parallel Review Pattern For comprehensive document review, launch simultaneously: - Security Architect → Security validation - Test Architect → Testability review - Technical Writer → Clarity review - Requirements Analyst → Traceability check All reviewers read the same input, produce independent feedback. Synthesizer agent merges feedback afterward. ``` ### Rule 10: Observable Execution Produce traceable outputs for debugging and improvement. **Rationale**: REF-001 emphasizes observability throughout. Without traces, failures can't be diagnosed or prevented. **Observability Requirements**: ```markdown ## Trace Output Every agent should log: 1. **Start**: Task received, inputs summary 2. **Plan**: Intended approach 3. **Steps**: Each significant action taken 4. **Decisions**: Why alternatives were rejected 5. **Result**: Final output summary 6. **Metrics**: Duration, tokens used, tools invoked ``` **Example Trace**: ``` [2025-12-10T10:30:00Z] CODE-REVIEWER started Input: src/api/*.ts (12 files) Focus: security, performance [2025-12-10T10:30:01Z] PLAN: Scan → Analyze → Prioritize → Report [2025-12-10T10:30:02Z] STEP: Reading src/api/auth.ts (342 lines) [2025-12-10T10:30:05Z] FINDING: SQL injection at auth.ts:87 [2025-12-10T10:30:15Z] COMPLETE Duration: 15s Findings: 3 critical, 5 high, 12 medium Files reviewed: 12/12 ``` ## When NOT to Use an Agent **REF-001 BP-2** explicitly identifies when to bypass agents for direct function calls. ### Use Direct Functions For | Operation | Why Not Agent | |-----------|---------------| | File I/O (read/write) | Deterministic, no reasoning needed | | String formatting | Pure transformation | | Data validation (schema) | Rule-based, predictable | | HTTP requests | API call, not decision | | Math calculations | Deterministic computation | ### Use Agents For | Operation | Why Agent | |-----------|-----------| | Code review | Requires judgment | | Architecture decisions | Trade-off analysis | | Content generation | Creative reasoning | | Error diagnosis | Root cause analysis | | Multi-step workflows | Coordination needed | **Decision Rule**: If the operation is deterministic and requires no judgment, use a direct function. If it requires reasoning, judgment, or creativity, use an agent. ## Agent Definition Template ```markdown --- name: [Agent Name] description: [One sentence describing single responsibility] model: [haiku|sonnet|opus] tools: [Minimal tool list, 0-3 preferred] --- # [Agent Name] You are a [role] specializing in [specific focus]. ## Inputs - **Required**: [What must be provided] - **Optional**: [What may be provided] - **Context**: [Ambient information available] ## Outputs - **Primary**: [Main deliverable] - **Format**: [Structure/schema] ## Process 1. **Ground**: [Verification step before action] 2. **Execute**: [Core task steps] 3. **Validate**: [Output verification] ## Uncertainty Handling When encountering ambiguity: 1. Stop and document the uncertainty 2. List potential interpretations 3. Request clarification before proceeding ## Error Recovery When encountering errors: 1. Capture full error context 2. Diagnose root cause 3. Adapt approach (don't retry blindly) 4. Escalate if 3 adapted attempts fail ## Example Usage [Concrete example of input → output] ``` ## Validation Checklist Before deploying any agent, verify: ### Structure - [ ] Single responsibility (Rule 1) - [ ] 0-3 tools assigned (Rule 2) - [ ] Explicit inputs/outputs (Rule 3) - [ ] Appropriate model tier (Rule 8) ### Behavior - [ ] Grounding step included (Rule 4) - [ ] Uncertainty escalation defined (Rule 5) - [ ] Context scoping guidance (Rule 6) - [ ] Recovery protocol specified (Rule 7) ### Operations - [ ] Parallel-ready design (Rule 9) - [ ] Observable execution (Rule 10) ### Meta - [ ] Clear when NOT to use this agent - [ ] Example usage provided - [ ] Error scenarios documented ## Failure Archetype Prevention Quick reference for avoiding the four empirically-identified failure modes: | Archetype | Prevention | Rule | |-----------|------------|------| | Premature Action | Grounding checkpoint | Rule 4 | | Over-Helpfulness | Uncertainty escalation | Rule 5 | | Distractor Pollution | Context scoping | Rule 6 | | Fragile Execution | Recovery-first design | Rule 7 | ## Multi-Agent Patterns ### Primary → Reviewers → Synthesizer Standard pattern for artifact generation: ``` Primary Author (opus) → Creates draft ↓ Parallel Reviewers (sonnet) → Independent review - Security review - Technical review - Standards review ↓ Synthesizer (sonnet) → Merges feedback into final ``` ### Decompose → Execute → Validate Pattern for complex tasks: ``` Decomposer (opus) → Breaks into ≤7 subtasks ↓ Executors (haiku/sonnet) → Complete subtasks in parallel ↓ Validator (sonnet) → Verifies completeness and consistency ``` ### Scout → Decide → Act Pattern for uncertain operations: ``` Scout (haiku) → Gathers information, identifies options ↓ Decider (opus) → Evaluates options, chooses approach ↓ Actor (sonnet) → Executes chosen approach ``` ## References - [REF-001: Production-Grade Agentic AI Workflows](#research-production-workflows) - [REF-002: How Do LLMs Fail In Agentic Scenarios?](#research-failure-modes) - [Production-Grade Guide](#ref-production-grade) ## Revision History | Date | Author | Changes | |------|--------|---------| | 2025-12-10 | AIWG | Initial version synthesizing REF-001 and REF-002 findings |