aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

816 lines (569 loc) • 32.6 kB

Markdown

# REF-022: AutoGen - Enabling Next-Gen LLM Applications via Multi-Agent Conversation ## Citation Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., & Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. *arXiv preprint arXiv:2308.08155v2*. **arXiv**: [https://arxiv.org/abs/2308.08155](https://arxiv.org/abs/2308.08155) **GitHub**: [https://github.com/microsoft/autogen](https://github.com/microsoft/autogen) **Published**: August 15, 2023 (v1), October 3, 2023 (v2) **Institution**: Microsoft Research, Pennsylvania State University, University of Washington, Xidian University ## Executive Summary AutoGen is Microsoft Research's open-source framework for building LLM applications through multi-agent conversation. The framework introduces **conversable agents** that can combine LLMs, human inputs, and tools in flexible configurations, and **conversation programming** as a paradigm for orchestrating complex workflows. With 2.7M downloads and 37K GitHub stars (as of publication), AutoGen demonstrates production-ready multi-agent systems across mathematics, coding, question answering, decision-making, and entertainment domains. **Key Innovation**: Using natural language conversation as the primary coordination protocol between agents, each potentially backed by different capabilities (LLM, human, tool, or hybrid). **Production Impact**: Demonstrates 3-5x reduction in user interactions, 69.48% success rate on MATH benchmark (vs. 55.18% for vanilla GPT-4), and significant code reduction (430 lines to 100 lines in OptiGuide application). ## Core Concepts ### 1. Conversable Agent Abstraction **Definition**: "A conversable agent is an entity with a specific role that can pass messages to send and receive information to and from other conversable agents" (p. 3). **Key Properties**: - **Message passing**: Unified `send()`, `receive()`, `generate_reply()` interfaces - **Context maintenance**: Tracks conversation history and internal state - **Configurable capabilities**: Can leverage LLMs, humans, tools, or combinations - **Programmed behavior**: Follows patterns defined through reply functions **Agent Capabilities** (p. 3): | Capability | Description | Examples | |------------|-------------|----------| | **LLM-backed** | Role playing, reasoning, code generation, feedback adaptation | AssistantAgent with GPT-4 | | **Human-backed** | Solicits input at configurable intervals, allows skipping | UserProxyAgent in ALWAYS mode | | **Tool-backed** | Executes code, calls functions, interacts with external systems | Code executor, API clients | | **Hybrid** | Combines multiple capabilities in single agent | UserProxyAgent (human + tool) | **Built-in Agent Hierarchy** (p. 3): ``` ConversableAgent (base class) ├── AssistantAgent (LLM-backed, code generation) ├── UserProxyAgent (human + tool backed) └── GroupChatManager (dynamic speaker selection) ``` ### 2. Conversation Programming Paradigm **Definition**: "A programming paradigm centered around inter-agent conversations, involving two concepts: computation (actions agents take to compute responses) and control flow (sequence/conditions under which computations happen)" (p. 4). **Conversation-Centric Computation**: - Agents compute replies based on conversation context - Actions result in message passing for consequent conversations - Termination conditions can stop the flow **Conversation-Driven Control Flow**: - Agent decisions on which agents to message are functions of conversation history - Natural language can program control logic via prompts - Python code can specify termination conditions, human input modes, execution logic **Design Patterns** (pp. 4-5): #### Pattern 1: Auto-Reply Mechanism ``` Agent A receives message → generates_reply() → sends to Agent B Agent B auto-replies → generates_reply() → sends to Agent A ... continues until termination condition ``` **Benefits**: Decentralized, modular, no central orchestrator needed #### Pattern 2: Control Fusion (Natural + Programming Language) | Control Type | Mechanism | Example | |--------------|-----------|---------| | **Natural Language** | LLM-based prompts guide flow | "Reply 'TERMINATE' when done" | | **Programming Language** | Python code specifies logic | `max_auto_reply=5` | | **Hybrid Transition** | LLM function calls invoke code | GPT-4 function calling | **Example from Default System Message** (p. 18, Appendix C): ``` "If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try." ``` This single prompt programs error handling, retry logic, and adaptive planning. ### 3. Conversation Patterns **Static Patterns** (predefined topology): - **Two-Agent Chat**: Simple back-and-forth dialogue - **Sequential Pipeline**: Agent A → Agent B → Agent C - **Hierarchical**: Manager delegates to worker agents **Dynamic Patterns** (topology changes during execution): - **Group Chat**: Manager selects next speaker based on conversation context - **Function-Driven**: LLM decides which agents to invoke via function calls - **Nested Conversations**: Agent can hold current conversation while starting new ones **GroupChatManager Workflow** (p. 9): 1. **Select Speaker**: Uses role-play prompt to choose next agent 2. **Collect Response**: Asks selected speaker to respond 3. **Broadcast**: Shares response with all group members 4. **Repeat**: Continues until termination ### 4. Human-in-the-Loop Integration **Human Input Modes** (p. 3): | Mode | Description | Use Case | |------|-------------|----------| | `ALWAYS` | Requests input every turn | Development, debugging, critical decisions | | `NEVER` | Fully autonomous | Production automation | | `TERMINATE` | Input only at termination | Final approval/review | **Configurable Skip**: Humans can choose not to provide input, allowing automation to continue **Example**: Math problem solving switched from autonomous to human-in-loop by changing single parameter: `human_input_mode='ALWAYS'` (p. 6) ## Architecture and Implementation ### Enhanced LLM Inference Layer **Features** (p. 3): - Result caching (avoid redundant API calls) - Error handling and retry logic - Message templating - Multiple model backend support ### Reply Function System **Built-in Reply Functions**: - LLM-based inference - Code execution - Function calling - Human input solicitation **Custom Reply Functions** (p. 5): - Developers can `register_reply()` to define custom behaviors - Can chain conversations (agent holds current chat while consulting others) - Enables complex orchestration patterns ### Memory and Context Management - Each agent maintains isolated conversation history - Prevents "shortcuts and hallucinations" (p. 27) - Commander pattern: Central agent manages shared context across sub-agents ## Key Applications and Benchmark Results ### A1: Mathematics Problem Solving **Benchmark**: MATH dataset (Level-5 problems, most challenging) **Results** (p. 7, Figure 4a): | System | 120 Level-5 | Full Dataset | |--------|-------------|--------------| | **AutoGen** | **69.48%** | **69.48%** | | ChatGPT + Code Interpreter | 52.5% | N/A | | ChatGPT + Wolfram Plugin | 48.33% | N/A | | GPT-4 (vanilla) | 45.0% | 55.18% | | Multi-Agent Debate | 30.0% | N/A | | LangChain ReAct | 26.67% | N/A | **Key Quote** (p. 6): "AutoGen agents can be used out of the box to achieve the most competitive performance on math problem solving tasks." **Three Scenarios**: 1. **Autonomous**: AssistantAgent + UserProxyAgent solve problems automatically 2. **Human-in-Loop**: Human provides hints when agents stuck (100% success with hints) 3. **Multi-User**: Student and expert collaborate via separate agent pairs ### A2: Retrieval-Augmented Question Answering **Benchmark**: Natural Questions dataset (6,775 queries) **Novel Feature**: Interactive retrieval - when context insufficient, assistant replies "UPDATE CONTEXT" triggering automatic re-retrieval **Results** (p. 7, Figure 4b) vs. DPR baseline: | Metric | AutoGen | AutoGen W/O Interactive | DPR | |--------|---------|------------------------|-----| | F1 Score | 25.88% | 23.40% | 15.12% | | Recall | 66.65% | 62.60% | 58.56% | **Impact**: 19.4% of questions triggered "UPDATE CONTEXT", demonstrating value of conversational retrieval ### A3: Decision Making (ALFWorld) **Benchmark**: 134 household task environments (text-based) **Innovation**: Three-agent design adds grounding agent to supply commonsense knowledge **Results** (p. 8, Figure 4c): | System | Average | Best of 3 | |--------|---------|-----------| | **ALFChat (3 agents)** | **69%** | **77%** | | ALFChat (2 agents) | 54% | 63% | | ReAct | 54% | 66% | **Key Insight**: Grounding agent providing commonsense ("You must find and take the object before you can examine it") prevented error loops, adding +15% success rate ### A4: Multi-Agent Coding (OptiGuide) **Domain**: Supply chain optimization code generation **Architecture**: Commander → Writer ↔ Safeguard (adversarial checking) **Results** (p. 8, Figure 4d): | Metric | Multi-Agent GPT-4 | Single-Agent GPT-4 | Multi-Agent GPT-3.5 | Single-Agent GPT-3.5 | |--------|-------------------|-------------------|---------------------|---------------------| | F1 (Safety) | 96% | 88% | 83% | 48% | | Recall | 98% | 78% | 72% | 32% | **Code Reduction**: 430 lines → 100 lines (4.3x improvement) **User Experience**: 3x time savings, 3-5x fewer interactions vs. ChatGPT + Code Interpreter ### A5: Dynamic Group Chat **Evaluation**: 12 manually crafted complex tasks **Comparison**: 4-agent group chat vs. 2-agent system vs. task-based speaker selection **Results** (p. 28, Table 5): | System | GPT-3.5 Success | GPT-4 Success | Avg LLM Calls | Termination Failures | |--------|----------------|---------------|---------------|---------------------| | **Group Chat (role-play)** | **9/12** | **11/12** | **5.3** | **0** | | Two-Agent | 8/12 | 9/12 | 9.9 | 9 | | Task-based selection | 7/12 | 8/12 | 4.0 | 0 | **Key Finding**: Role-play prompting for speaker selection outperforms task-based prompting in both success rate and robustness ### A6: Conversational Chess **Features**: - Natural language moves with personality - AI-AI, AI-human, human-human modes - Board agent for grounding (validates legal moves) **Ablation Study**: Without board agent, illegitimate moves disrupted games. With board agent, games maintained integrity through error detection and correction cycles. ### A7: Browser Interactions (MiniWoB++) **Benchmark**: 49 web manipulation tasks **System**: MiniWobChat (AssistantAgent + Executor) **Results** (p. 32-33): - **Success Rate**: 52.8% (vs. 56.4% for RCI, state-of-the-art) - **Within 0.1 tolerance**: Both methods outperform on equal number of tasks **Advantage**: Modular design allows easy reuse of decision-making agent across different environments ## Critical Insights and Quotes ### On Multi-Agent Conversation Benefits > "Our insight is to use multi-agent conversations to achieve it. There are at least three reasons confirming its general feasibility and utility thanks to recent advances in LLMs: First, because chat-optimized LLMs (e.g., GPT-4) show the ability to incorporate feedback, LLM agents can cooperate through conversations with each other or human(s)... Second, because a single LLM can exhibit a broad range of capabilities (especially when configured with the correct prompt and inference settings), conversations between differently configured agents can help combine these broad LLM capabilities in a modular and complementary manner. Third, LLMs have demonstrated ability to solve complex tasks when the tasks are broken into simpler subtasks." (p. 2) ### On Conversable Agent Design > "AutoGen uses a generic design of agents that can leverage LLMs, human inputs, tools, or a combination of them. The result is that developers can easily and quickly create agents with different roles (e.g., agents to write code, execute code, wire in human feedback, validate outputs, etc.) by selecting and configuring a subset of built-in capabilities." (p. 2) ### On Conversation Programming > "A fundamental insight of AutoGen is to simplify and unify complex LLM application workflows as multi-agent conversations. So AutoGen adopts a programming paradigm centered around these inter-agent conversations." (p. 2) ### On Auto-Reply Mechanism > "AutoGen also introduces and by default adopts an agent auto-reply mechanism to realize conversation-driven control: Once an agent receives a message from another agent, it automatically invokes generate_reply and sends the reply back to the sender unless a termination condition is satisfied." (p. 5) ### On Natural Language Control > "In AutoGen, one can control the conversation flow by prompting the LLM-backed agents with natural language. For instance, the default system message of the built-in AssistantAgent in AutoGen uses natural language to instruct the agent to fix errors and generate code again if the previous result indicates there are errors." (p. 5) ### On Production Benefits > "The adoption of AutoGen has resulted in improved performance (over state-of-the-art approaches), reduced development code, and decreased manual burden for existing applications." (p. 9) ### On Modularity > "Dividing tasks among separate agents promotes modularity. Furthermore, since each agent can be developed, tested, and maintained separately, this approach simplifies overall development and code management." (p. 9) ### On Safety and Human Agency > "While the autonomous mode AutoGen supports could be desirable in many scenarios, a high level of autonomy can also pose potential risks, especially in high-risk applications... maintaining effective human oversight of applications built with AutoGen agents will become important." (p. 17) ## AIWG Agent Communication Mapping ### Direct Parallels | AutoGen Pattern | AIWG Equivalent | Notes | |----------------|-----------------|-------| | **Conversable Agents** | Specialized SDLC agents | Both use role-based agent design | | **AssistantAgent** | Primary Author agent | LLM-backed, generates content | | **UserProxyAgent** | Review/Approval agents | Can solicit human input, execute validations | | **GroupChatManager** | Orchestrator agent | Manages multi-agent workflows | | **Auto-reply mechanism** | Agent handoff protocol | Automatic progression through workflow | | **Conversation history** | Artifact lineage | Both track progression and context | ### Conversation Patterns in AIWG **Two-Agent Pattern** (REF-012 ChatDev parallel): ``` AIWG: Requirement Author ↔ Reviewer AutoGen: AssistantAgent ↔ UserProxyAgent ``` **Group Chat Pattern** (REF-013 MetaGPT parallel): ``` AIWG: Multi-agent review panel → Synthesizer → Final artifact AutoGen: GroupChatManager selects speakers → aggregates → produces result ``` **Hierarchical Pattern**: ``` AIWG: Project Manager → Phase-specific agents → Deliverables AutoGen: Commander → Writer + Safeguard → Validated output ``` ### Message Passing vs. Artifact Exchange **AutoGen**: Conversation-centric - Messages are primary coordination mechanism - Conversation history provides context - Output emerges from dialogue **AIWG**: Artifact-centric - Structured documents (UC-001.md, SAD.md) are primary coordination - `.aiwg/` directory provides persistence - Output is templated document **Hybrid Opportunity**: AIWG could adopt AutoGen's conversation patterns while maintaining artifact-based deliverables: ``` 1. Orchestrator initiates conversation with requirement message 2. Author agent generates initial draft (stored as artifact) 3. Reviewer agents converse about draft (comments in artifact) 4. Synthesizer agent aggregates conversation → final artifact ``` ### Human-in-Loop Integration **AutoGen Modes → AIWG Gates**: | AutoGen Mode | AIWG Gate Equivalent | Trigger | |--------------|---------------------|---------| | `ALWAYS` | Continuous review | Every artifact update | | `NEVER` | Autonomous generation | Full automation mode | | `TERMINATE` | Gate approval | Phase transitions | | Skip option | Optional feedback | Reviewer can defer to AI | **Implementation Pattern**: ```python # AIWG could adopt requirement_agent = ConversableAgent( name="RequirementAuthor", system_message="You write use cases following AIWG templates", human_input_mode="TERMINATE" # Human approves at phase gate ) ``` ### Tool Integration Parallels **AutoGen Tool Agents** → **AIWG Validation Agents**: | AutoGen Tool | AIWG Validator | Purpose | |--------------|---------------|---------| | Code executor | Metadata validator | Ensure correctness | | Board agent (chess) | Template validator | Enforce structure | | Grounding agent (ALFWorld) | Requirements tracer | Maintain consistency | | Safeguard (OptiGuide) | Security reviewer | Prevent unsafe outputs | ### Novel Patterns for AIWG **1. Interactive Retrieval for Requirements**: ``` User: "Add security requirements" Agent: Searches existing NFR modules Agent: "I found general security patterns. UPDATE CONTEXT for domain-specific?" User: "Yes, focus on API security" Agent: Re-retrieves with narrower context → generates UC-SEC-001 ``` **2. Multi-User Collaboration** (from A1 Scenario 3): ``` Developer ↔ Developer's Agent (writes code) ↓ Consults Expert via function call ↓ Expert ↔ Expert's Agent (reviews architecture) ↓ Returns guidance ↓ Developer's Agent incorporates feedback ``` **3. Grounding for Consistency**: ``` When agent generates UC-002: - Grounding agent checks against UC-001, NFRs, SAD - Identifies conflicts ("UC-002 requires async, but SAD specifies sync") - Returns to author agent for resolution ``` ## Key Differences from Prior Work ### vs. CAMEL (REF-012 ChatDev parallel) | Aspect | AutoGen | CAMEL | |--------|---------|-------| | Tool support | Yes (native) | No | | Human-in-loop | Flexible modes | No | | Conversation pattern | Static + Dynamic | Static only | | Production readiness | 2.7M downloads | Research demo | **AutoGen advantage**: Production-tested infrastructure vs. role-playing demonstration ### vs. MetaGPT (REF-013) | Aspect | AutoGen | MetaGPT | |--------|---------|---------| | Domain scope | Generic framework | Software development | | Agent customization | Composable capabilities | SOP-driven roles | | Application focus | Infrastructure | Specialized solution | | Extensibility | High (register custom agents) | Medium (within SOP framework) | **AutoGen advantage**: General-purpose infrastructure vs. domain-specific application ### vs. Multi-Agent Debate | Aspect | AutoGen | Debate | |--------|---------|--------| | Agent capabilities | LLM + tools + human | LLM only | | Conversation pattern | Customizable | Fixed debate structure | | Infrastructure | Full framework | Prompting technique | **AutoGen advantage**: Complete system vs. single pattern **Summary Table** (p. 15): | Framework | Infrastructure | Conversation Pattern | Execution-Capable | Human Involvement | |-----------|---------------|---------------------|-------------------|-------------------| | **AutoGen** | ✓ | Flexible | ✓ | Chat/skip | | Multi-Agent Debate | ✗ | Static | ✗ | ✗ | | CAMEL | ✓ | Static | ✗ | ✗ | | BabyAGI | ✗ | Static | ✗ | ✗ | | MetaGPT | ✗ | Static | ✓ | ✗ | ## Implementation Guidelines (Appendix B) ### Best Practices (pp. 16-17) **1. Start with Built-in Agents**: - AssistantAgent + UserProxyAgent solve many problems out-of-box - Customize via system messages before extending classes **2. Simple Topology First**: - Two-agent chat handles most scenarios - Group chat for complex multi-party coordination - Avoid premature complexity **3. Reuse Built-in Reply Methods**: - LLM-based, tool-based, human-based reply functions cover most needs - Custom reply functions only when necessary **4. Human-in-Loop Development**: - Start with `human_input_mode='ALWAYS'` during development - Evaluate assistant effectiveness with human feedback - Switch to `'NEVER'` for production automation **5. Iterative Refinement**: - Test with small examples - Tune prompts based on outputs - Discover corner cases before scaling ### When NOT to Use AutoGen (p. 17) **Unidirectional Pipelines**: If no back-and-forth needed, simpler frameworks (LangChain, LlamaIndex) may suffice **Existing Tools**: Can integrate LangChain tools as agent backends rather than replacing entire system **Specific Applications**: May want to wrap specialized agents from other frameworks as ConversableAgents **Optimization**: Use `flaml.tune` for hyperparameter tuning of AutoGen configurations ## Future Research Directions (pp. 17) ### 1. Optimal Multi-Agent Workflows **Open Questions**: - For what types of tasks are multi-agent workflows most useful? - How do multiple agents help in different applications? - What is the optimal (cost-effective) workflow for a given task? **Example**: When does three-agent design (Author + Reviewer + Synthesizer) outperform two-agent (Author + Reviewer)? ### 2. Highly Capable Agents **Challenge**: Creating agents that effectively leverage LLMs, tools, and humans **Evidence**: "CAMEL... cannot effectively solve problems in most cases primarily because it lacks the capability to execute tools or code" (p. 17) **Needs**: - Guidelines for application-specific agents - OSS knowledge base of proven agents - Agents that discover and upgrade their own skills ### 3. Scale, Safety, and Human Agency **Scaling Concerns**: - Can extremely complex multi-agent workflows solve extremely complex tasks? - How to log and debug intricate agent interactions? - Risk of "incomprehensible, unintelligible chatter among agents" (Lewis et al., 2017) **Safety Mechanisms**: - Fail-safes against cascading failures - Preventing reward hacking and undesired behaviors - Maintaining human oversight in high-risk applications **Key Quote**: "While AutoGen provides convenient and seamless involvement of humans through a user proxy agent, developers and stakeholders still need to understand and determine the appropriate level and pattern of human involvement to ensure the safe and ethical use of the technology." (p. 17) ## Production Evidence ### Adoption Metrics (as of publication) - **2.7M downloads**: Demonstrates real-world usage - **37K GitHub stars**: Strong community interest - **250+ contributors**: Diverse contributor base - **Monthly releases**: Active maintenance ### Real-World Applications **Mathematics**: Personalized AI tutoring, research assistance **Coding**: Supply chain optimization (OptiGuide), code generation with latest APIs **Question Answering**: Enterprise knowledge bases with retrieval augmentation **Decision Making**: Interactive environments, web automation **Entertainment**: Conversational chess, creative applications ### Developer Experience Improvements **Code Reduction Examples**: - OptiGuide: 430 lines → 100 lines (4.3x) - Chess application: Simplified through register_reply() pattern - ALFWorld: Modular grounding agent added without refactoring **User Interaction Reduction**: - 3-5x fewer interactions in OptiGuide vs. ChatGPT + Code Interpreter - 3x time savings in user study - Automatic retry/refinement reduces manual intervention ## Technical Implementation Details ### Default System Message (Appendix C, p. 18) **Prompting Techniques Color-Coded**: - **Role Play**: "You are a helpful AI assistant" - **Control Flow**: "If the result indicates there is an error, fix the error" - **Output Confine**: "Reply 'TERMINATE' in the end when everything is done" - **Facilitate Automation**: "The user cannot provide any other feedback" - **Grounding**: "Check the execution result returned by the user" **Full Message** (excerpt): ``` You are a helpful AI assistant. Solve tasks using your coding and language skills. In the following cases, suggest python code (in a python coding block) or shell script (in a sh coding block) for the user to execute... If you want the user to save the code in a file before executing it, put # filename: <filename> inside the code block as the first line... If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes... Reply "TERMINATE" in the end when everything is done. ``` This single prompt achieves sophisticated conversation control without explicit state machines. ### Enhanced LLM Inference Features **Caching**: Avoids redundant API calls for identical requests **Error Handling**: Automatic retry with exponential backoff **Message Templating**: Reusable prompt structures **Multi-Model Support**: Works with GPT-3.5, GPT-4, or other LLMs **Configuration Example**: ```python config_list = [ {"model": "gpt-4", "api_key": "..."}, {"model": "gpt-3.5-turbo", "api_key": "..."} ] llm_config = {"config_list": config_list, "cache_seed": 42} ``` ## Limitations and Considerations ### From Ethics Statement (p. 10) **Privacy**: User data in conversations must be protected **Bias**: LLMs exhibit training data biases that propagate through agents **Accountability**: Need to trace decision-making in multi-agent systems **Trust**: Clear communication about capabilities and limitations **Unintended Consequences**: Code execution and function calls carry risk ### Known Issues **Complexity Management** (p. 10): As workflows scale, tracking becomes difficult **Prompt Following**: GPT-4 follows instructions better than GPT-3.5-turbo (p. 18) **Cost**: Multiple LLM calls in multi-agent systems increase API costs **Ambiguity**: Some natural language instructions have ambiguous interpretations ## Related Work and References ### Frameworks Mentioned **Single-Agent**: - AutoGPT: Autonomous AI with tools, no multi-agent support - ChatGPT + Code Interpreter: Single agent with execution - LangChain Agents: Single-agent paradigm (ReAct) - Transformers Agent: Natural language API, single agent **Multi-Agent**: - BabyAGI: Task management system, static pattern - CAMEL: Role-playing framework, no tools - Multi-Agent Debate: Encouraging divergent thinking - MetaGPT: Software development, static SOP **Infrastructure**: - LangChain: General LLM application framework - LlamaIndex: RAG and document processing - Guidance: Constrained generation - Semantic Kernel: Microsoft's LLM orchestration ### Key Citations **Multi-Agent Theory**: - Liang et al. (2023): Multi-agent debate for divergent thinking - Du et al. (2023): Debate improves factuality and reasoning - Li et al. (2023b): CAMEL role-playing framework **LLM Capabilities**: - Yao et al. (2022): ReAct (reasoning + acting) - Xi et al. (2023): Survey of LLM-based agents - Wang et al. (2023b): Autonomous agents survey **Applications**: - Hendrycks et al. (2021): MATH dataset - Shridhar et al. (2021): ALFWorld benchmark - Kwiatkowski et al. (2019): Natural Questions **Safety and Ethics**: - Amodei et al. (2016): Concrete problems in AI safety - Navigli et al. (2023): Biases in large language models - Amershi et al. (2019): Guidelines for human-AI interaction ## Significance for AIWG ### Critical Implications **1. Validated Multi-Agent Communication Patterns** AutoGen provides production-tested evidence that multi-agent conversation works at scale. AIWG can adopt proven patterns: - **Group chat with dynamic speaker selection** for complex review processes - **Auto-reply mechanism** for automated workflow progression - **Human-in-loop modes** for gate approvals **2. Conversation as Coordination Protocol** AutoGen demonstrates that natural language can coordinate complex workflows. AIWG could adopt: - **Conversational requirement gathering**: Agents discuss and refine requirements - **Review conversations**: Multi-agent panels debate document quality - **Handoff protocols**: Agents negotiate responsibility transfers **3. Tool Integration Patterns** AutoGen's tool-backed agents provide a model for AIWG validators: - **Template validators** as tool agents - **Metadata checkers** as grounding agents - **Traceability analyzers** as safeguard agents **4. Production Readiness** With 2.7M downloads, AutoGen proves multi-agent systems can work in production. AIWG's multi-agent patterns are therefore not just theoretical but practical. ### Integration Opportunities **Direct Adoption**: - Use AutoGen's `ConversableAgent` as base class for AIWG agents - Adopt `GroupChatManager` for multi-agent review panels - Integrate `register_reply()` pattern for custom validators **Pattern Translation**: - AutoGen's "UPDATE CONTEXT" → AIWG's "REQUEST CLARIFICATION" - AutoGen's grounding agent → AIWG's consistency checker - AutoGen's safeguard agent → AIWG's security reviewer **Hybrid Architecture**: ``` AIWG Artifact System (persistence) ↓ AutoGen Conversation Layer (orchestration) ↓ Specialized SDLC Agents (generation) ``` ### Divergence Points **Output Format**: - AutoGen: Conversation result (unstructured) - AIWG: Templated documents (structured) **Persistence**: - AutoGen: Chat history (ephemeral) - AIWG: `.aiwg/` artifacts (permanent) **Validation**: - AutoGen: Conversational agreement - AIWG: Gate criteria checks **Solution**: AIWG could use AutoGen for orchestration while maintaining artifact-based deliverables: ``` 1. Agents converse (AutoGen) 2. Conversation result fills template (AIWG) 3. Template saved to .aiwg/ (AIWG) 4. Gate validates artifact (AIWG) ``` ## Cross-References ### Within AIWG Bibliography - **@docs/references/REF-012-chatdev-communicative-agents.md** - Role-based communication in software development - **@docs/references/REF-013-metagpt-meta-programming.md** - SOP-driven multi-agent systems - **@docs/references/REF-007-mixture-of-experts.md** - Ensemble foundation for multi-agent - **@docs/references/bibliography.md** - Full bibliography entry ### AIWG Implementation Files - **@agentic/code/frameworks/sdlc-complete/agents/** - Agent definitions - **@agentic/code/frameworks/sdlc-complete/docs/orchestrator-architecture.md** - Orchestration patterns - **@docs/multi-agent-documentation-pattern.md** - Multi-agent workflow template ### External Resources - **GitHub Repository**: [microsoft/autogen](https://github.com/microsoft/autogen) - **Documentation**: [AutoGen Docs](https://microsoft.github.io/autogen/) - **Paper**: [arXiv:2308.08155](https://arxiv.org/abs/2308.08155) - **Examples**: [AutoGen Examples Directory](https://github.com/microsoft/autogen/tree/main/notebook) ## Glossary of Terms | Term | Definition | Page | |------|------------|------| | **Conversable Agent** | Entity that can send/receive messages, maintain context, and perform role-specific actions | p. 3 | | **Conversation Programming** | Paradigm using conversations for computation (actions) and control flow (sequence) | p. 4 | | **Auto-Reply Mechanism** | Automatic invocation of generate_reply() upon receiving message | p. 5 | | **Human Input Mode** | Configuration for when/how agent solicits human input (ALWAYS/NEVER/TERMINATE) | p. 3 | | **Register Reply** | Method to add custom reply functions to agents | p. 5 | | **Group Chat** | Multi-agent pattern with dynamic speaker selection | p. 9 | | **Grounding Agent** | Agent that provides domain knowledge or constraints | p. 8 | | **Tool-Backed Agent** | Agent with capability to execute code or call functions | p. 3 | | **System Message** | Initial prompt defining agent role and behavior | p. 18 | | **Conversation History** | Accumulated messages providing context for agents | p. 3 | ## Revision History | Date | Author | Changes | |------|--------|---------| | 2026-01-24 | Research Specialist | Comprehensive documentation from PDF analysis | --- **Document Status**: Complete **Research Priority**: Critical for AIWG agent communication **Implementation Readiness**: Production-proven (2.7M downloads) **Next Actions**: 1. Evaluate AutoGen integration for AIWG orchestration 2. Map AIWG agent roles to AutoGen patterns 3. Prototype conversation-based requirement gathering 4. Design hybrid artifact + conversation architecture