aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

1,085 lines (855 loc) • 47.1 kB

Markdown

# REF-013: MetaGPT - Meta Programming for Multi-Agent Collaborative Framework ## Citation Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., Wang, J., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., Ran, C., Xiao, L., Wu, C., & Schmidhuber, J. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. *The Twelfth International Conference on Learning Representations (ICLR 2024)*. **arXiv**: [https://arxiv.org/abs/2308.00352v7](https://arxiv.org/abs/2308.00352v7) **GitHub**: [https://github.com/geekan/MetaGPT](https://github.com/geekan/MetaGPT) ## Executive Summary MetaGPT introduces an innovative **meta-programming framework** that incorporates human **Standardized Operating Procedures (SOPs)** into LLM-based multi-agent collaborations. Unlike previous chat-based systems that suffer from cascading hallucinations, MetaGPT requires agents to generate **structured outputs** (requirements documents, design artifacts, interface specifications) rather than unstructured dialogue. This assembly-line paradigm assigns specialized roles (Product Manager, Architect, Project Manager, Engineer, QA Engineer) that work sequentially through a software development workflow, achieving: - **85.9% Pass@1 on HumanEval** (vs. GPT-4's 67.0%) - **87.7% Pass@1 on MBPP** - **100% task completion** on complex software projects - **2x token efficiency** compared to ChatDev (124.3 vs 248.9 tokens/line) **Key Innovation**: "Unlike other works, MetaGPT requires agents to generate structured outputs, such as high-quality requirements documents, design artifacts, flowcharts, and interface specifications. The use of intermediate structured outputs significantly increases the success rate of target code generation." (p. 1-2) ## Core Architecture ### 1. Role-Based Agent Specialization MetaGPT defines **five primary roles** mirroring real software development teams: | Role | Responsibilities | Key Outputs | Tools | |------|-----------------|-------------|-------| | **Product Manager** | Business analysis, competitive research, user needs | Product Requirements Document (PRD), User Stories, Competitive Quadrant Chart | Web search | | **Architect** | System design, technical specifications | System Design Document, Interface Definitions, Sequence Diagrams, File Structure | UML tools | | **Project Manager** | Task breakdown, workflow coordination | Task List, Logic Analysis, Dependency Graph, Shared Knowledge | None | | **Engineer** | Code implementation, debugging | Source Code, Unit Tests, Documentation | Code execution | | **QA Engineer** | Test generation, quality assurance | Test Cases, Code Review, Bug Reports | Testing frameworks | **Quote**: "In MetaGPT, we specify the agent's profile, which includes their name, profile, goal, and constraints for each role. We also initialize the specific context and skills for each role. For instance, a Product Manager can use web search tools, while an Engineer can execute code." (p. 4) **Agent Profile Structure**: - **Name**: Role identifier - **Profile**: Role description and expertise domain - **Goal**: Specific objectives for this role - **Constraints**: Operating boundaries and quality standards - **Skills/Tools**: Available capabilities (web search, code execution, etc.) - **Subscription**: Which message types trigger activation ### 2. Standard Operating Procedures (SOPs) MetaGPT encodes human software development SOPs into **prompt sequences** for streamlined workflows: ``` User Requirements ↓ ┌────────────────────────────────────────────┐ │ Product Manager │ │ SOP: Analyze competition + user needs │ │ Output: PRD (structured) │ │ - Product Goals │ │ - User Stories │ │ - Competitive Analysis │ │ - Requirement Pool │ │ - Competitive Quadrant Chart │ ├────────────────────────────────────────────┤ │ Architect │ │ SOP: Translate requirements → design │ │ Output: System Design (structured) │ │ - File List │ │ - Data Structures & Classes │ │ - Interface Definitions │ │ - Sequence Flow Diagrams │ ├────────────────────────────────────────────┤ │ Project Manager │ │ SOP: Decompose design → tasks │ │ Output: Task Distribution │ │ - Task List with Dependencies │ │ - Logic Analysis │ │ - Shared Knowledge Base │ ├────────────────────────────────────────────┤ │ Engineer │ │ SOP: Implement + debug with feedback │ │ Output: Code Files │ │ - Class Implementations │ │ - Function Definitions │ │ - Unit Tests │ │ Feedback Loop: Execute → Debug → Retry │ ├────────────────────────────────────────────┤ │ QA Engineer │ │ SOP: Generate tests + enforce quality │ │ Output: Test Suite │ │ - Test Cases │ │ - Code Review │ │ - Quality Report │ └────────────────────────────────────────────┘ ↓ Complete Software Artifact ``` **Sequential Workflow**: "In our work, we follow SOP in software development, which enables all agents to work in a sequential manner." (p. 4) **Contrast with Prior Work**: - **ChatDev**: Free-form dialogue between agents → information loss - **AutoGPT**: General problem-solving without systematic decomposition - **MetaGPT**: Structured handoffs with standardized intermediate outputs ### 3. Communication Protocol #### Problem: Information Distortion in Natural Language "However, despite the versatility of natural language, a question arises: does pure natural language communication suffice for solving complex tasks? For example, in the telephone game (or Chinese whispers), after several rounds of communication, the original information may be quite distorted." (p. 5-6) #### Solution: Structured Communication Interfaces **Establish schemas and formats for each role's outputs**: **Product Manager PRD Schema**: ```python { "original_requirements": str, "product_goals": List[str], # 3-5 goals "user_stories": List[str], # User-focused scenarios "competitive_analysis": List[str], # Competitor comparisons "requirement_analysis": str, "requirement_pool": List[Tuple[str, str]], # (requirement, priority) "ui_design_draft": str, "anything_unclear": str } ``` **Architect System Design Schema**: ```python { "implementation_approach": str, # Tech stack + rationale "python_package_name": str, "file_list": List[str], # All source files "data_structures_and_interfaces": str, # Class/API specs "program_call_flow": str, # Sequence diagram (Mermaid) "anything_unclear": str } ``` **Benefits of Structured Communication**: 1. **Prevents information loss**: All necessary details captured in schema 2. **Enables validation**: Can verify completeness programmatically 3. **Reduces ambiguity**: Fixed format eliminates interpretation variance 4. **Facilitates handoffs**: Downstream agents know exactly what to expect **Quote**: "Unlike ChatDev, agents in MetaGPT communicate through documents and diagrams (structured outputs) rather than dialogue. These documents contain all necessary information, preventing irrelevant or missing content." (p. 6) #### Publish-Subscribe Mechanism **Shared Message Pool**: All agents publish structured messages to a global pool and subscribe to relevant messages based on role profiles. ``` ┌─────────────────────────────────────────────┐ │ Shared Message Pool │ │ │ │ [PRD] ─────────┐ │ │ [SystemDesign]─┼───────┐ │ │ [TaskList] ────┼───────┼────┐ │ │ [Code] ────────┼───────┼────┼───┐ │ │ [Tests] ───────┘ │ │ │ │ └─────────────────────────┼────┼───┼───────────┘ │ │ │ ┌─────────────────┘ │ │ ↓ ↓ ↓ Architect Engineer QA subscribes: subscribes: - PRD - PRD - SystemDesign - TaskList ``` **Subscription Rules**: - **Architect** subscribes to: PRD only - **Project Manager** subscribes to: SystemDesign - **Engineer** subscribes to: SystemDesign, TaskList, PRD (for reference) - **QA Engineer** subscribes to: Code, SystemDesign, PRD **Activation Logic**: "In practical implementations, an agent activates its action only after receiving all its prerequisite dependencies." (p. 6) **Benefits**: 1. **Eliminates redundant communication**: Agents retrieve information directly from pool instead of requesting from other agents 2. **Prevents information overload**: Subscription filters irrelevant messages 3. **Maintains context**: All prior artifacts available for reference 4. **Enables parallelism**: Agents can work concurrently once prerequisites met **Quote**: "Sharing all information with every agent can lead to information overload. During task execution, an agent typically prefers to receive only task-related information and avoid distractions through irrelevant details." (p. 6) ### 4. Executable Feedback Mechanism **The Problem**: "Previous work introduced non-executable code review and self-reflection. However, they still face challenges in ensuring code executability and runtime correctness. Our first MetaGPT implementations overlooked certain errors during the review process, due to LLM hallucinations." (p. 6) **MetaGPT's Iterative Programming Workflow**: ``` ┌─────────────────────────────────────────┐ │ Engineer Agent │ │ │ │ 1. Read: PRD, SystemDesign, TaskList │ │ 2. Generate: Initial code implementation│ │ 3. Execute: Run code + unit tests │ │ 4. Check: Capture errors/test results │ │ │ │ │ ├─ Success? → Next task │ │ │ │ │ └─ Failure? │ │ ↓ │ │ 5. Debug: │ │ - Review historical execution log │ │ - Compare against PRD requirements │ │ - Analyze SystemDesign constraints │ │ - Identify root cause │ │ ↓ │ │ 6. Regenerate: Improved code │ │ ↓ │ │ 7. Iterate: Repeat 3-6 (max 3 times) │ └─────────────────────────────────────────┘ ``` **Memory Structure**: ```python { "execution_history": [ {"code_version": 1, "error": "NameError: 'foo' undefined", "traceback": "..."}, {"code_version": 2, "error": "TypeError: expected int, got str", "traceback": "..."} ], "debugging_context": { "prd_requirements": "...", "system_design": "...", "prior_attempts": ["version1.py", "version2.py"] } } ``` **Impact**: - **HumanEval**: +4.2% absolute improvement (81.7% → 85.9%) - **MBPP**: +5.4% absolute improvement (82.3% → 87.7%) - **Executability**: 3.67 → 3.75 (out of 4.0) - **Human revision cost**: -63% reduction (2.25 → 0.83 corrections) **Quote**: "This enables the Engineer to continuously improve code using its own historical execution and debugging memory. To obtain additional information, the Engineer writes and executes the corresponding unit test cases, and subsequently receives the test results. If satisfactory, additional development tasks are initiated. Otherwise the Engineer debugs the code before resuming programming." (p. 6) ## Benchmark Results ### Code Generation (Single Function Tasks) **HumanEval** (164 hand-written programming problems): | Model | Parameters | Pass@1 | Year | |-------|-----------|--------|------| | AlphaCode | 1.1B | 17.1% | 2022 | | Incoder | 6.7B | - | 2022 | | CodeGeeX | 13B | 15.2% | 2023 | | CodeGeeX-Mono | 16.1B | 32.9% | 2023 | | Codex (GPT-3.5) | 175B | 47.0% | 2021 | | PaLM Coder | 540B | 36.0% | 2022 | | Codex + CodeT | 175B | 65.8% | 2022 | | **GPT-4** | - | **67.0%** | 2023 | | **MetaGPT (w/o Feedback)** | GPT-4 | **81.7%** | 2024 | | **MetaGPT** | GPT-4 | **85.9%** | 2024 | **MBPP** (427 Python programming tasks): | Model | Pass@1 | |-------|--------| | CodeGeeX | 17.6% | | CodeGeeX-Mono | 38.6% | | PaLM Coder | 47.0% | | Codex | 58.1% | | Codex + CodeT | 67.7% | | **MetaGPT (w/o Feedback)** | **82.3%** | | **MetaGPT** | **87.7%** | **State-of-the-Art Achievement**: "Notably, in code generation benchmarks, MetaGPT achieves a new state-of-the-art (SoTA) with 85.9% and 87.7% in Pass@1." (p. 2) ### Software Development (Complex Multi-File Projects) **SoftwareDev Dataset**: 70 diverse tasks including mini-games (Flappy Bird, Snake, 2048), image processing, data visualization **Executability Comparison** (7 representative tasks): | Task | AutoGPT | LangChain | AgentVerse | ChatDev | MetaGPT | |------|---------|-----------|------------|---------|---------| | Flappy Bird | 1 | 1 | 1 | 2 | **3** | | Tank Battle | 1 | 1 | 1 | 2 | **4** | | 2048 Game | 1 | 1 | 1 | 1 | **4** | | Snake Game | 1 | 1 | 1 | 3 | **4** | | Brick Breaker | 1 | 1 | 1 | 1 | **4** | | Excel Processor | 1 | 1 | 1 | 4 | **4** | | CRUD Manager | 1 | 1 | 1 | 2 | **4** | | **Average** | **1.0** | **1.0** | **1.0** | **2.1** | **3.9** | **Executability Scale**: - **1** = Complete failure / non-functional - **2** = Runnable but imperfect - **3** = Nearly perfect / largely expected workflow - **4** = Flawless / perfect match to expectations **Detailed Metrics (SoftwareDev average across 7 tasks)**: | Metric | ChatDev | MetaGPT (w/o Feedback) | MetaGPT | |--------|---------|------------------------|---------| | **Executability** (1-4) | 2.25 | 3.67 | **3.75** | | **Running Time** (s) | 762 | 503 | **541** | | **Token Usage** | 19,292 | 24,613 | 31,255 | | **Code Files** | 1.9 | 4.6 | **5.1** | | **Lines/File** | 40.8 | 42.3 | **49.3** | | **Total Lines** | 77.5 | 194.6 | **251.4** | | **Productivity** (tokens/line) | 248.9 | 126.5 | **124.3** | | **Human Revisions** | 2.5 | 2.25 | **0.83** | **Key Insight**: MetaGPT generates **3.2x more code** (251.4 vs 77.5 lines) with **2x better token efficiency** (124.3 vs 248.9 tokens/line) compared to ChatDev. **Quote**: "Remarkably, in our experimental evaluations, MetaGPT achieves a 100% task completion rate, demonstrating the robustness and efficiency (time and token costs) of our design." (p. 2) ## Ablation Studies ### 1. Role Effectiveness Testing impact of adding specialized roles incrementally: | Configuration | # Agents | # Lines | Expense | Revisions | Executability | |---------------|----------|---------|---------|-----------|---------------| | Engineer only | 1 | 83.0 | $0.915 | **10** | 1.0 | | + Product Manager | 2 | 112.0 | $1.059 | 6.5 | 2.0 | | + Architect | 3 | 143.0 | $1.204 | 4.0 | 2.5 | | + Project Manager | 3 | 205.0 | $1.251 | 3.5 | 2.0 | | **Full Team** | **4** | **191.0** | **$1.385** | **2.5** | **4.0** | **Finding**: "The addition of roles different from just the Engineer consistently improves both revisions and executability. While more roles slightly increase the expenses, the overall performance improves noticeably, demonstrating the effectiveness of the various roles." (p. 9) **ROI Analysis**: - **Engineer only** → **Full team**: +51% cost ($0.915 → $1.385) - **Benefit**: 75% fewer revisions (10 → 2.5), 4x better executability (1.0 → 4.0) - **Net result**: Higher quality software with less human intervention ### 2. Executable Feedback Impact | Metric | w/o Feedback | with Feedback | Δ Absolute | Δ Relative | |--------|--------------|---------------|-----------|-----------| | HumanEval Pass@1 | 81.7% | 85.9% | **+4.2%** | +5.1% | | MBPP Pass@1 | 82.3% | 87.7% | **+5.4%** | +6.6% | | Executability (1-4) | 3.67 | 3.75 | +0.08 | +2.2% | | Human Revisions | 2.25 | 0.83 | **-1.42** | **-63%** | | Running Time (s) | 503 | 541 | +38 | +7.6% | **Quote**: "Adding executable feedback into MetaGPT leads to a significant improvement of 4.2% and 5.4% in Pass @1 on HumanEval and MBPP, respectively. Besides, the feedback mechanism improves feasibility (3.67 to 3.75) and reduces the cost of human revisions (2.25 to 0.83)." (p. 9) **Trade-off**: Slightly longer execution time (+38s, 7.6%) for significantly better quality and reduced manual effort. ### 3. Instruction Detail Sensitivity Testing impact of high-level vs. detailed user prompts (5 tasks): **High-level prompt**: "Create a brick breaker game." **Detailed prompt**: "Creating a brick breaker game. In a brick breaker game, the player typically controls a paddle at the bottom of the screen to bounce a ball towards a wall of bricks. The goal is to break all the bricks by hitting them with the ball." | Prompt Type | Avg Words | Time (s) | Tokens | # Lines | Executability | Productivity | Revisions | |-------------|-----------|----------|--------|---------|---------------|--------------|-----------| | **High-level** | 13.2 | 552.9 | 28,384 | 178.2 | **3.8** | 163.8 | **1.2** | | **Detailed** | 42.2 | 567.8 | 29,657 | 257.0 | **4.0** | **118.0** | 1.6 | **Findings**: - Detailed prompts → **+0.2 better executability** (3.8 → 4.0) - Detailed prompts → **44% more code** (178.2 → 257.0 lines) - Detailed prompts → **28% better productivity** (163.8 → 118.0 tokens/line) - **However**: High-level prompts still achieve 3.8/4.0 executability **Quote**: "Detailed prompts lead to better software projects with lower productivity ratios because of clearer requirements and functions, while simple inputs can still generate good enough software using MetaGPT with an executability rating of 3.8, which is comparable to the detailed prompt scenario." (p. 24) **Implication**: MetaGPT's Product Manager agent effectively expands brief inputs into detailed requirements. ## Capabilities Comparison Comparison with leading frameworks (from Table 2, p. 9): | Capability | AutoGPT | LangChain | AgentVerse | ChatDev | MetaGPT | **AIWG** | |------------|---------|-----------|------------|---------|---------|----------| | **PRD generation** | ✗ | ✗ | ✗ | ✗ | ✓ | **✓** | | **Technical design** | ✗ | ✗ | ✗ | ✗ | ✓ | **✓** | | **API interface generation** | ✗ | ✗ | ✗ | ✗ | ✓ | **✓** | | **Code generation** | ✓ | ✓ | ✓ | ✓ | ✓ | **✓** | | **Pre-compilation execution** | ✗ | ✗ | ✗ | ✗ | ✓ | **✓** | | **Role-based management** | ✗ | ✗ | ✗ | ✓ | ✓ | **✓** | | **Code review** | ✗ | ✗ | ✓ | ✓ | ✓ | **✓** | | **Deployment planning** | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** | | **Security review** | ✗ | ✗ | ✗ | ✗ | ✗ | **✓** | **Quote**: "Our framework encompasses a wide range of abilities to handle complex and specialized development tasks efficiently. Incorporating SOPs (e.g., role-play expertise, structured communication, streamlined workflow) can significantly improve code generation." (p. 8-9) **AIWG Extensions Beyond MetaGPT**: - **Deployment Planning**: Deployment Specialist agent with runbooks - **Security Review**: Security Architect agent with threat modeling - **Broader SDLC coverage**: Full inception → transition lifecycle ## Example Workflow: Drawing Application **User Input**: "write a python3 GUI app such that you can draw an image with it" ### Stage 1: Product Manager → PRD ```python ## Original Requirements "Write a python3 GUI app such that you can draw an image with it" ## Product Goals [ "Create a user-friendly GUI drawing application", "Ensure smooth drawing experience with various tools", "Provide color selection and file management" ] ## User Stories [ "As a user, I want to draw freehand on a canvas", "As a user, I want to select drawing tools (pencil, brush, eraser)", "As a user, I want to choose colors", "As a user, I want to save and open drawings" ] ## Competitive Analysis [ "Paint.NET: Feature-rich but complex for beginners", "Krita: Professional-grade, steep learning curve", "Our App: Simple, focused on core drawing" ] ## Requirement Pool [ ("GUI with drawing canvas", "P0"), ("Tool selection (pencil, brush, eraser)", "P0"), ("Color picker", "P0"), ("File save/open", "P0"), ("Smooth drawing performance", "P1") ] ``` ### Stage 2: Architect → System Design ```python ## Implementation Approach "Use Tkinter (standard Python GUI) for interface. Canvas widget handles drawing surface. Pillow library for image file I/O." ## File List [ "main.py", # Entry point, app initialization "canvas.py", # Drawing surface, mouse events "tools.py", # Tool selection interface "color_picker.py", # Color selection "file_manager.py" # Save/load functionality ] ## Data Structures and Interfaces """ class Canvas: +__init__(root, width, height) +on_mouse_down(event) +on_mouse_move(event) +clear() class ToolSelector: +__init__(root) +select_tool(tool_name: str) +get_current_tool() -> str class ColorPicker: +__init__(root) +select_color() +get_color() -> str class FileManager: +__init__(canvas) +save(filename: str) +open(filename: str) """ ## Program Call Flow (Sequence Diagram) """ User -> main: start() main -> Canvas: __init__() main -> ToolSelector: __init__() main -> ColorPicker: __init__() User -> Canvas: mouse_down() User -> Canvas: mouse_move() Canvas -> ToolSelector: get_current_tool() Canvas -> ColorPicker: get_color() Canvas: draw_line() User -> FileManager: save() """ ``` ### Stage 3: Project Manager → Task Distribution ```python ## Task List [ "main.py", # Priority 1: Framework setup "canvas.py", # Priority 2: Core drawing logic "tools.py", # Priority 3: Tool interface "color_picker.py", # Priority 3: Color interface "file_manager.py" # Priority 4: File I/O ] ## Logic Analysis [ ("main.py", "Initializes GUI, creates all components"), ("canvas.py", "Handles mouse events, delegates to tools"), ("tools.py", "Manages tool state, returns selected tool"), ("color_picker.py", "Provides color selection dialog"), ("file_manager.py", "Saves/loads canvas to/from files") ] ## Dependencies - canvas.py DEPENDS ON tools.py, color_picker.py - file_manager.py DEPENDS ON canvas.py - main.py DEPENDS ON ALL ``` ### Stage 4: Engineer → Code Implementation ```python # color_picker.py (generated code) import tkinter as tk from tkinter import colorchooser class ColorPicker: def __init__(self, root: tk.Tk, color: str = 'black'): self.frame = tk.Frame(root) self.color = color self.color_button = tk.Button( self.frame, text='Color', command=self.select_color ) self.color_button.pack(fill='x') def select_color(self): """Opens color chooser dialog""" color = colorchooser.askcolor()[1] if color is not None: self.color = color def pack(self, **kwargs): self.frame.pack(**kwargs) def get_color(self): return self.color ``` **Executable Feedback Loop**: 1. **First execution**: Import error (missing tkinter) 2. **Debug**: Review system design → tkinter should be standard lib 3. **Second execution**: Success, all tests pass 4. **Result**: Approved, proceed to next file ### Stage 5: QA Engineer → Test Generation ```python # test_color_picker.py (generated tests) import unittest import tkinter as tk class TestColorPicker(unittest.TestCase): def setUp(self): """Initialize test environment""" self.root = tk.Tk() self.cp = ColorPicker(self.root) def tearDown(self): """Clean up after test""" self.root.destroy() def test_initial_color(self): """Test default color is black""" self.assertEqual(self.cp.get_color(), 'black') def test_set_and_get_color(self): """Test setting and retrieving color""" new_color = '#ffffff' self.cp.color = new_color self.assertEqual(self.cp.get_color(), new_color) if __name__ == '__main__': unittest.main() ``` ### Final Output **Generated Software**: - **5 Python files** (main.py, canvas.py, tools.py, color_picker.py, file_manager.py) - **251 total lines of code** - **Executability**: 3.75/4.0 (nearly perfect) - **Human revisions needed**: 0.83 (< 1 correction) - **Cost**: $1.24 in API calls - **Time**: 541 seconds (~9 minutes) **Functional capabilities**: - ✓ Drawing canvas with mouse input - ✓ Tool selection (pencil, brush, eraser) - ✓ Color picker with full RGB selection - ✓ Save drawings to PNG files - ✓ Load previous drawings - ✓ Clear canvas functionality ## AIWG Implementation Mapping ### Direct Alignments | MetaGPT Component | AIWG Equivalent | Mapping Strength | |-------------------|-----------------|------------------| | **Product Manager → PRD** | Requirements Specialist → Use Cases | **EXACT** - Both perform business analysis and create structured requirements | | **Architect → System Design** | Technical Designer → SAD/ADRs | **EXACT** - Both create technical specifications and architecture | | **Project Manager → Task List** | SDLC Orchestrator → Phase Plans | **STRONG** - Both decompose work and manage dependencies | | **Engineer → Code** | Implementation Specialist → Source | **EXACT** - Both implement according to specifications | | **QA Engineer → Tests** | Test Engineer → Test Plans | **EXACT** - Both ensure quality through verification | | **Shared Message Pool** | `.aiwg/` artifact directory | **STRONG** - Both store intermediate outputs for reference | | **Publish-Subscribe** | Artifact traceability (@-mentions) | **MODERATE** - Both manage information flow between roles | | **Sequential Workflow** | Phase gates with handoffs | **STRONG** - Both enforce ordered progression | | **Executable Feedback** | Ralph iterative loops | **STRONG** - Both implement test-debug-retry cycles | ### Structural Differences **MetaGPT**: - **Strictly sequential workflow**: One phase completes before next begins - **One agent per role**: Single Product Manager, single Architect, etc. - **Code-generation focus**: Optimized specifically for software development - **Fully automated**: Runs without human intervention once started - **Single project scope**: Generates one software artifact per execution **AIWG**: - **Phase-gated workflow**: Can iterate within phases before gate approval - **Multiple agents per phase**: Can have specialized agents within same SDLC phase - **Broader SDLC scope**: Handles documentation, deployment, security beyond code - **Human-in-the-loop**: Designed for collaborative human-AI workflow - **Project lifecycle management**: Tracks entire project from inception through deployment ### Key Learnings for AIWG #### 1. Structured Outputs Are Critical **MetaGPT Finding**: "The use of intermediate structured outputs significantly increases the success rate of target code generation. Because it helps maintain consistency in communication, minimizing ambiguities and errors during collaboration." (p. 2) **AIWG Application**: - All agent outputs should follow predefined schemas - Templates should enforce structure, not just suggest it - Validation should check schema compliance before phase gate approval **Example**: ```yaml # Template: use-case.md (enforced structure) required_sections: - uc_id: "UC-XXX" - title: string - actors: list[string] - preconditions: list[string] - main_flow: list[step] - extensions: dict[condition, flow] - postconditions: list[string] validation: - uc_id must match pattern "UC-\d{3}" - main_flow must have at least 1 step - each step must reference an actor ``` #### 2. Reduce Hallucination Through Specialization **MetaGPT Finding**: "More graphically, in a company simulated by MetaGPT, all employees follow a strict and streamlined workflow, and all their handovers must comply with certain established standards. This reduces the risk of hallucinations caused by idle chatter between LLMs." (p. 2) **AIWG Application**: - Define narrow, focused responsibilities for each agent - Provide domain-specific expertise in agent prompts - Enforce structured handoffs between agents - Prevent free-form "discussion" between agents **Example Agent Definition**: ```markdown ## Requirements Specialist Agent ### Scope (NARROW) - Analyze user needs and business requirements ONLY - Generate use cases and user stories ONLY - DO NOT make technical design decisions - DO NOT write code or implementation plans ### Expertise - Business process modeling - User interview techniques - Requirement elicitation methods - Use case documentation standards ### Constraints - MUST output use cases following UC template - MUST validate against intake document - MUST NOT include implementation details - MUST hand off to Technical Designer for architecture ``` #### 3. Executable Feedback Loops Matter **MetaGPT Finding**: 5.4% absolute improvement on MBPP from adding runtime verification (82.3% → 87.7%) **AIWG Application**: - Implement test-debug-retry cycles for code generation - Execute validation checks before phase gate approval - Capture error messages and use for debugging context - Limit iterations (MetaGPT uses max 3) to prevent infinite loops **Ralph Loop Enhancement**: ```typescript // Enhanced Ralph with executable feedback interface RalphIteration { attempt: number; maxAttempts: 3; execute(): Result { const output = this.generateCode(); const testResults = this.runTests(output); if (testResults.passed) { return { status: 'SUCCESS', output }; } if (this.attempt >= this.maxAttempts) { return { status: 'ESCALATE', reason: 'Max retries exceeded', errors: testResults.failures }; } const debugContext = { requirements: this.loadArtifact('.aiwg/requirements/'), architecture: this.loadArtifact('.aiwg/architecture/'), priorAttempts: this.executionHistory, errors: testResults.failures }; this.attempt++; return this.retry(debugContext); } } ``` #### 4. Information Overload Is Real **MetaGPT Finding**: "Sharing all information with every agent can lead to information overload. During task execution, an agent typically prefers to receive only task-related information and avoid distractions through irrelevant details." (p. 6) **AIWG Application**: - Implement subscription filters for artifact access - Only provide context relevant to current task - Use @-mentions to explicitly reference needed artifacts - Prune context window to essential information only **Subscription Implementation**: ```typescript // Agent subscription configuration const technicalDesigner = { role: 'Technical Designer', subscribesTo: [ '.aiwg/requirements/use-cases/*.md', '.aiwg/requirements/nfr-modules/*.md', '.aiwg/intake/solution-profile.md' ], ignores: [ '.aiwg/implementation/**', '.aiwg/testing/**', '.aiwg/deployment/**', '.aiwg/working/**' // Always ignore temporary files ] }; function loadContextForAgent(agent: Agent): Context { const artifacts = glob(agent.subscribesTo) .filter(path => !matchesAny(path, agent.ignores)); return { role: agent.role, relevantArtifacts: artifacts, tokenCount: countTokens(artifacts) // Monitor context size }; } ``` #### 5. SOPs Provide Guardrails **MetaGPT Finding**: Human-validated workflows reduce search space for LLMs and provide guardrails against off-track generation. **AIWG Application**: - Encode SDLC phase procedures as explicit agent instructions - Define phase gate criteria as validation checkpoints - Create workflow templates for common scenarios - Document handoff protocols between phases **SOP Encoding Example**: ```markdown ## SOP: Inception to Elaboration Transition ### Prerequisites (Blocking Gate) - [ ] Intake form completed and validated - [ ] Vision document approved by stakeholders - [ ] Initial risk assessment completed - [ ] Feasibility confirmed ### Procedure 1. **Requirements Specialist** - Expand vision into detailed use cases - Document functional requirements - Identify NFR categories - Create traceability matrix 2. **Technical Designer** (receives use cases) - Draft Software Architecture Document (SAD) - Define system boundaries - Select technology stack - Document architecture decisions (ADRs) 3. **Security Architect** (receives SAD) - Perform threat modeling - Identify security requirements - Validate compliance needs 4. **Test Architect** (receives use cases + SAD) - Define test strategy - Identify testability requirements - Plan test environments 5. **Documentation Synthesizer** - Merge feedback from all specialists - Resolve conflicts and ambiguities - Baseline artifacts in `.aiwg/` ### Exit Criteria (Gate Approval) - [ ] SAD baselined with 3-5 ADRs approved - [ ] Master Test Plan drafted - [ ] Threat model completed - [ ] Traceability established (requirements → design) - [ ] Elaboration phase plan approved ### Handoff Format Publish to message pool: - .aiwg/architecture/software-architecture-doc.md (structured) - .aiwg/architecture/adrs/ADR-*.md (structured) - .aiwg/testing/master-test-plan.md (structured) - .aiwg/security/threat-model.md (structured) ``` ## Advanced Topics ### 1. Self-Improvement Mechanisms **Concept** (from Appendix A.1): "Through active teamwork, a software development team should learn from the experience gained by developing each project, thus becoming more compatible and successful over time." (p. 15) **MetaGPT Implementation**: 1. After each project, agents review feedback from prior work 2. Agents update their constraint prompts based on lessons learned 3. Summaries stored in long-term memory 4. Future projects inherit improved prompts **Current Limitation**: "These summary-based optimizations only modify constraints in the specialization of roles (Sec. 3.1) rather than structured communication interfaces in communication protocols (Sec. 3.2). Future advancements are yet to be explored." (p. 15) **AIWG Adaptation**: ```markdown ## Project Retrospective Agent ### Skills - analyze-project-artifacts: Review completed .aiwg/ directory - extract-lessons-learned: Identify patterns in successes/failures - update-agent-templates: Refine agent definitions based on lessons - maintain-knowledge-base: Store organizational learning ### Workflow 1. On project completion, analyze: - Which agents required most human corrections? - Which artifacts had quality issues? - Where did phase gates get blocked? - What handoffs caused delays? 2. Extract lessons: - "Requirements Specialist underspecified NFRs → Add NFR checklist" - "Technical Designer missed security concerns → Strengthen security review" 3. Update templates: - Add constraints to agent definitions - Enhance validation rules - Improve SOP procedures 4. Version control: - Track template evolution - A/B test improvements - Roll back if quality degrades ``` ### 2. Multi-Agent Economies **Concept** (from Appendix A.2): Integration with "Economy of Minds" (EOM) framework for dynamic role negotiation and credit assignment through market mechanisms. **AgentStore Platform**: "Each agent in AgentStore provides a list of services with corresponding costs. A convenient API is provided so that human users or agents in the platform can easily purchase services from other agents to accomplish their services." (p. 15-16) **AIWG Future Direction**: Could implement marketplace for specialized agents: - **Security Auditor**: $0.50 per threat model - **Performance Optimizer**: $0.75 per bottleneck analysis - **Documentation Writer**: $0.25 per API documentation page **Benefits**: - Dynamic role assignment based on task complexity - Quality incentives through reputation/pricing - Specialization emergence through market signals ### 3. Handling Deep-Seated Challenges **Use Context Efficiently** (from Appendix E.1, p. 26): **Challenge 1**: Unfolding short natural language descriptions to eliminate ambiguity - **MetaGPT Solution**: Product Manager agent expands brief inputs into detailed PRDs - **Evidence**: High-level prompts (13 words) still achieve 3.8/4.0 executability **Challenge 2**: Maintaining information validity in lengthy contexts - **MetaGPT Solution**: Publish-subscribe mechanism filters relevant information - **Evidence**: Architect only subscribes to PRD, ignoring later-stage artifacts **Reduce Hallucinations** (from Appendix E.1, p. 26): "LLMs often struggle with software generation due to vague task definitions. Focusing on granular tasks like requirement analysis and package selection offers guided thinking, which LLMs lack in broad task solving." **MetaGPT Strategies**: 1. **Granular task decomposition**: Product Manager → Architect → Engineer (narrow focus) 2. **Structured schemas**: Enforce specific output formats 3. **Intermediate verification**: Each role verifies prerequisites 4. **Executable feedback**: Runtime errors catch hallucinated code ## Comparison with Related Work ### ChatDev vs. MetaGPT | Aspect | ChatDev | MetaGPT | |--------|---------|---------| | **Communication** | Unstructured dialogue | Structured documents | | **Workflow** | Chat-based discussion | Assembly-line SOP | | **Intermediate Outputs** | Conversational | Formalized schemas | | **Executability** | 2.1 / 4.0 | **3.9 / 4.0** | | **Token Efficiency** | 248.9 tokens/line | **124.3 tokens/line** | | **Human Revisions** | 2.5 corrections | **0.83 corrections** | **Key Difference**: "Unlike ChatDev, agents in MetaGPT communicate through documents and diagrams (structured outputs) rather than dialogue. These documents contain all necessary information, preventing irrelevant or missing content." (p. 6) ### AutoGPT / LangChain / AgentVerse vs. MetaGPT **Why General Frameworks Struggle** (p. 23): "While models such as AutoGPT, Langchain, and AgentVerse display robust general problem-solving capabilities, they lack an essential element for developing complex systems: **systematically deconstructing requirements**." **MetaGPT's Advantage**: "Simplifies the process of transforming abstract requirements into detailed class and function designs through a specialized division of labor and SOPs workflow." ## Limitations and Future Work ### Current Limitations (from Appendix D.1, p. 25) **System Side**: - "At present, our system cannot fully cater to specific scenarios, such as UI and frontend, as we have yet to incorporate such agents and multimodal tools." - "Despite generating the most amount of code among comparable frameworks, it remains challenging to fulfill real-world applications' diverse and complex requirements." **User Side**: - "A key challenge for users is to interrupt the running process of each agent, or set the starting running point (checkpoint) for each agent." **AIWG Addresses These**: - ✓ Human-in-the-loop design allows interruption at any phase gate - ✓ Broader scope beyond code (deployment, security, documentation) - ✓ Checkpoint system through phase gates and artifact versioning ### Future Directions 1. **Multimodal Capabilities**: Add UI/UX design agents with visual outputs 2. **Dynamic Workflows**: Allow runtime SOP modification based on project needs 3. **Cross-Project Learning**: Aggregate lessons across multiple projects 4. **Human Collaboration**: Better integration of human feedback during execution 5. **Formal Verification**: Add proof-checking agents for critical systems ## Key Quotes **On SOPs**: > "MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors." (p. 1) **On Structured Communication**: > "The use of intermediate structured outputs significantly increases the success rate of target code generation. Because it helps maintain consistency in communication, minimizing ambiguities and errors during collaboration." (p. 2) **On Assembly Line Paradigm**: > "MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together." (p. 1) **On Reducing Hallucinations**: > "All employees follow a strict and streamlined workflow, and all their handovers must comply with certain established standards. This reduces the risk of hallucinations caused by idle chatter between LLMs, particularly in role-playing frameworks." (p. 2) **On Executable Feedback**: > "This enables the Engineer to continuously improve code using its own historical execution and debugging memory. To obtain additional information, the Engineer writes and executes the corresponding unit test cases, and subsequently receives the test results." (p. 6) **On Information Overload**: > "Sharing all information with every agent can lead to information overload. During task execution, an agent typically prefers to receive only task-related information and avoid distractions through irrelevant details." (p. 6) **On Meta-Programming**: > "MetaGPT stands out as a unique solution that allows for efficient meta-programming through a well-organized group of specialized agents. Each agent has a specific role and expertise, following some established standards." (p. 2) ## AIWG Implementation Checklist Based on MetaGPT validation, AIWG should prioritize: **Completed**: - [x] Define specialized roles with domain expertise - [x] Create structured document templates - [x] Implement artifact directory (`.aiwg/`) - [x] Build phase-gated workflow - [x] Document handoff protocols **High Priority Enhancements**: - [ ] **Enforce structured output validation** - Check schema compliance before gate approval - [ ] **Implement subscription mechanism** - Filter artifact access by agent role - [ ] **Add executable feedback loops** - Test-debug-retry for code generation - [ ] **Create SOP templates** - Encode procedures as agent instructions - [ ] **Build context pruning** - Limit information to role-relevant artifacts **Medium Priority**: - [ ] Develop self-improvement through retrospectives - [ ] Add competitive analysis to Requirements Specialist - [ ] Create sequence diagram generation for Technical Designer - [ ] Implement token efficiency metrics - [ ] Build quality dashboards (executability scoring) **Future Exploration**: - [ ] Multi-agent economy with dynamic role negotiation - [ ] Multimodal capabilities (UI/UX design) - [ ] Cross-project learning aggregation - [ ] Formal verification agents for critical systems ## Cross-References **Related AIWG References**: - @docs/references/REF-004-magis-multi-agent-software.md - Complementary multi-agent approach - @docs/references/REF-012-chatdev-multi-agent-software.md - Direct comparison baseline - @docs/references/REF-018-react-reasoning-acting.md - ReAct pattern used in MetaGPT - @docs/references/REF-021-reflexion-verbal-reinforcement.md - Self-reflection mechanisms - @docs/references/REF-010-stage-gate-systems.md - Phase gate methodology **AIWG Framework Components**: - @agentic/code/frameworks/sdlc-complete/README.md - SDLC orchestration - @agentic/code/frameworks/sdlc-complete/agents/manifest.json - Agent catalog - @agentic/code/frameworks/sdlc-complete/templates/ - Structured output templates - @docs/sdlc/workflows/ - Phase transition SOPs **Related Papers**: - Belbin, R.M. (2012). *Team Roles at Work*. - Role specialization theory - Agile Manifesto (2001). - Iterative development principles - DeMarco & Lister (2013). *Peopleware* - Human workflow patterns - Yao et al. (2022). ReAct - Reasoning and acting in LLMs - Shinn et al. (2023). Reflexion - Self-correction mechanisms ## Relevance to AIWG | Category | Relevance | Rationale | |----------|-----------|-----------| | **Workflow Orchestration** | **CRITICAL** | Validates SOP encoding and phase-gated workflow | | **Agent Design** | **CRITICAL** | Confirms role specialization and structured outputs | | **Artifact Management** | **HIGH** | Supports `.aiwg/` directory and publish-subscribe pattern | | **Quality Assurance** | **HIGH** | Demonstrates value of executable feedback loops | | **Communication Protocol** | **HIGH** | Validates structured messaging over free-form chat | | **Meta-Programming** | **MODERATE** | Provides theoretical foundation for AI-driven SDLC | **Overall Assessment**: MetaGPT provides **strong empirical validation** for AIWG's core architectural decisions, with state-of-the-art benchmark results demonstrating the effectiveness of SOP-driven multi-agent collaboration. ## Revision History | Date | Author | Changes | |------|--------|---------| | 2026-01-24 | Research Agent (#74) | Comprehensive documentation from full PDF analysis | --- **Document Status**: Complete **Last Updated**: 2026-01-24 **AIWG Relevance**: CRITICAL - Core validation of multi-agent SDLC approach **Implementation Priority**: HIGH - Direct applicability to framework design