aiwg
Version:
Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.
1,278 lines (904 loc) • 50.5 kB
Markdown
# REF-004: MAGIS - LLM-Based Multi-Agent Framework for GitHub Issue Resolution
## Citation
Tao, W., Zhou, Y., Wang, Y., Zhang, W., Zhang, H., & Cheng, Y. (2024). *MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue ReSolution*. arXiv:2403.17927v2 [cs.SE].
**URL**: https://arxiv.org/abs/2403.17927
**Category**: cs.SE (Software Engineering)
**Publication Date**: June 27, 2024 (v2)
**Affiliations**: Fudan University, University of Macau, Sun Yat-sen University, Chongqing University, The Chinese University of Hong Kong
## Abstract Summary
MAGIS addresses the complex challenge of resolving GitHub issues at the repository level - a task requiring both incorporation of new code and maintenance of existing functionality. Through empirical analysis of why LLMs fail at GitHub issue resolution, the authors propose a multi-agent framework with four specialized agents (Manager, Repository Custodian, Developer, QA Engineer) that collaborate through planning and coding phases.
**Core Challenge Addressed**: LLMs struggle with repository-level GitHub issue resolution, achieving less than 2% success rate when applied directly (GPT-4 on SWE-bench). The challenge encompasses locating files/lines to modify, managing complexity, and generating coherent code changes across entire repositories.
**Key Results**:
- **13.94% resolved ratio** on SWE-bench benchmark
- **8x improvement** over direct GPT-4 application (1.74% → 13.94%)
- **2x improvement** over previous SOTA (Claude-2 at 4.88%)
- **97.39% applied ratio** (code changes successfully git-apply)
**Key Contributions**:
1. Empirical analysis identifying three critical factors: file locating, line locating, code change complexity
2. Novel four-agent collaborative framework inspired by GitHub Flow
3. Memory mechanism for repository evolution (reduces LLM query costs)
4. Significant benchmark improvements demonstrating production viability
## Executive Summary
### The GitHub Issue Resolution Problem
GitHub issues represent real software evolution requirements - bug fixes, feature additions, performance enhancements. For popular repositories like Django (34K issues), resolving these programmatically could dramatically accelerate development. However, this is fundamentally different from function-level code generation:
**Repository-Level Challenges**:
- **Scale**: Entire codebase as context (exceeds LLM context limits)
- **Localization**: Finding which files and lines to modify
- **Complexity**: Multiple files, functions, hunks requiring coordinated changes
- **Maintenance**: Must preserve existing functionality while adding new capabilities
- **Testing**: Must pass both existing tests and new requirement tests
### The MAGIS Solution
MAGIS transforms the monolithic task into a **collaborative workflow** with specialized agents:
```
Human → GitHub Issue
↓
PLANNING PHASE:
├── Repository Custodian → Locate candidate files (BM25 + memory + LLM filtering)
├── Manager → Define file-level tasks + build team
└── Kick-off Meeting → Developers confirm plan, resolve dependencies
CODING PHASE:
├── Developer Agents → Locate lines + generate code (per task)
└── QA Engineer → Review + iterate (max iterations or approval)
↓
Merged Repository-Level Code Change
```
**Key Innovations**:
1. **Memory Mechanism**: Reuses file summaries to reduce redundant LLM queries
2. **Decomposition**: Issue → File-level tasks → Line-level edits
3. **Multi-step Coding**: Locate lines → Extract old code → Generate new code → Review
4. **Collaborative Planning**: Kick-off meetings ensure task coherence
5. **Continuous QA**: Each developer paired with dedicated QA engineer
## Empirical Study (Section 2)
The paper conducts rigorous analysis to answer: **Why does direct LLM application fail at GitHub issue resolution?**
### RQ1: Why is Performance Limited?
#### Factor 1: Locating Files to Modify
**Finding**: Higher recall improves results initially, but including too many files degrades performance.
- Claude-2: **29.58% recall → 1.96% resolved**, but **51.06% recall → 1.22% resolved**
- Cause: Including irrelevant files or exceeding LLM context capacity
**Implication**: Need **high recall with minimal files** - strategic balance, not just more files.
**Quote** (p.2-3): "optimizing the performance of LLMs can be better achieved by striving for higher recall scores with a minimized set of files"
#### Factor 2: Locating Lines to Modify
**Metric**: Coverage ratio = intersection of generated vs reference line ranges
**Formula** (Equation 1, p.3):
```
Coverage Ratio = Σ(intersection of modified lines) / Σ(total reference lines modified)
```
**Finding**: Strong positive correlation between line coverage and resolution success.
- Claude-2: **coefficient 0.5997, P < 0.05** (statistically significant)
- GPT-4/GPT-3.5: Limited data due to low success rates
**Distribution Analysis** (Figure 1, p.3):
- All three LLMs show **highest frequency at coverage ratio ≈ 0** (most attempts miss the target)
- Claude-2 > GPT-4 > GPT-3.5 at **coverage ratio ≈ 1** (perfect localization)
- This ranking matches their overall resolution success rates
**Quote** (p.3): "locating lines is a key factor for GitHub issue resolution"
#### Factor 3: Code Change Complexity
**Indices Measured**: # files, # functions, # hunks, # added LoC, # deleted LoC, # changed LoC
**Finding**: Significant negative correlation between complexity and success (Table 1, p.4).
| LLM | # Files | # Functions | # Hunks | # Added LoC | # Deleted LoC | # Changed LoC |
|-----|---------|-------------|---------|-------------|---------------|---------------|
| GPT-3.5 | −17.57* | −17.57* | −0.06* | −0.02 | −0.03 | −0.53* |
| GPT-4 | −25.15* | −25.15* | −0.06 | −0.10 | −0.04 | −0.21 |
| Claude-2 | −1.47* | −1.47* | −0.11* | −0.09* | −0.07* | −0.44* |
(* = P-value < 0.05, statistically significant)
**Interpretation**:
- **Number of files/functions**: Strong negative impact across all models
- **Claude-2**: Better handles complexity (lower negative coefficients)
- **More complex issues** (multi-file, multi-function) → **lower resolution rates**
**Quote** (p.3): "increased complexity, particularly in terms of the number of files and functions modified, may hinder the issue resolution"
### Empirical Study Summary
**Three Critical Success Factors**:
1. **File Locating**: Precision matters more than raw recall
2. **Line Locating**: Accurate line identification strongly predicts success
3. **Complexity Management**: Simpler changes (fewer files/functions) succeed more often
**AIWG Alignment**: These findings directly inform MAGIS design and validate AIWG's own decomposition strategy (issues → tasks → subtasks).
## Methodology (Section 3)
### Four Agent Roles
MAGIS implements four specialized agents inspired by GitHub Flow (human workflow paradigm):
#### 1. Manager Agent
**Responsibilities**:
- Decompose GitHub issue into file-level tasks
- Dynamically assemble developer team (one Developer per task)
- Organize kick-off meeting
- Generate executable work plan
**Innovation vs Human Workflow**: Humans form teams first, then assign tasks. MAGIS defines tasks first, then designs Developer agents to match - **greater flexibility**.
**Algorithm 2 (Team Building, p.6)**:
```
For each candidate file fi:
ti ← LLM(fi, issue description) # Define file-level task
ri ← LLM(ti, issue) # Design Developer role
Team ← Team ∪ {Developer with role ri}
```
**Quote** (p.4): "improves team flexibility and adaptability, enabling the formation of teams that can meet various issues efficiently"
#### 2. Repository Custodian Agent
**Responsibilities**:
- Locate candidate files relevant to GitHub issue
- Filter irrelevant files to minimize LLM context costs
- Maintain repository evolution memory (key innovation)
**Challenges Addressed**:
- **Computational cost**: Querying LLM for every file in large repos on every update
- **Performance degradation**: Long context inputs reduce LLM effectiveness (p.5 citations [31, 33, 68])
**Algorithm 1 (Locating with Memory, p.5)**:
```
1. BM25 ranking → Select top-k candidates
2. For each file fi:
a. Check memory M for previous summary sh
b. If file changed since version h:
- Compute diff: Δd = diff(fh, fi)
- If len(sh) < len(fi): reuse summary + LLM(Δd) for update
- Else: generate new summary
c. LLM determines relevance to issue → filter irrelevant files
```
**Memory Mechanism Benefits**:
- **Reuse**: Previous file summaries compressed by LLM
- **Incremental**: Only analyze diffs (git diff) for changed files
- **Cost reduction**: Avoid re-querying entire file contents
**Quote** (p.5): "Considering that applying the code change often modifies a specific part of the file rather than the entire file, we propose a memory mechanism to reuse the previously queried information"
#### 3. Developer Agent
**Responsibilities**:
- Execute assigned file-level task from Manager
- Locate specific line ranges to modify
- Generate new code to replace old code
- Iterate based on QA Engineer feedback
**Advantages Over Human Developers** (p.5):
- Work continuously without fatigue
- Parallel scheduling easier (no human constraints)
- Leverage automatic code generation strengths
**Innovation**: Decompose code modification into sub-operations (locate → extract → generate → replace) to maximize LLM's code generation strengths while mitigating change generation weaknesses.
#### 4. QA Engineer Agent
**Responsibilities**:
- Review each Developer's code change
- Provide task-specific, timely feedback
- Approve or request revisions (up to max iterations)
**Problem Addressed**: Code review delays in human workflows (up to 96 hours, citation [6]) and review neglect (citation [4]).
**Innovation**: **Each Developer paired with dedicated QA Engineer** - personalized, immediate feedback loop.
**Quote** (p.5): "To address this problem, our framework pairs each Developer agent with a QA Engineer agent, designed to offer task-specific, timely feedback"
### Collaborative Process
#### Planning Phase (Section 3.2.1)
**Three Stages**: Locate Code Files → Team Building → Kick-off Meeting
**Locating Code Files** (Algorithm 1):
```
Input: Repository Ri, GitHub issue qx
Output: Candidate files C^k, Repository memory M
1. BM25(Ri, qx) → Rank files by relevance
2. Select top-k files
3. For each file:
- Retrieve/generate summary (using memory M)
- LLM filter: relevant to issue? → Keep or discard
```
**Team Building** (Algorithm 2):
```
Input: Candidate files C^k, issue qx
Output: Tasks T^k, Developer role descriptions D^k, Work plan cmain
For each file fi in C^k:
ti ← LLM(fi, qx, prompt P4) # Define task
ri ← LLM(ti, qx, prompt P5) # Design Developer role
T^k ← T^k ∪ (fi, ti)
D^k ← D^k ∪ ri
recording ← kick_off_meeting(D^k) # Agents discuss
D^k ← refine_roles(D^k, recording, P6) # Adjust based on discussion
cmain ← LLM(recording, P7) # Generate executable plan
```
**Kick-off Meeting** (Figure 7, Appendix B, p.17):
Circular speech format:
1. **Manager opens** - states issue, assigned tasks, expected collaboration
2. **Developers speak in turn** - provide opinions, identify dependencies, suggest modifications
3. **Manager summarizes** - generates work plan as executable code
**Purpose** (p.6):
- Confirm tasks are reasonable and comprehensive
- Determine sequential dependencies vs parallel execution
- Avoid conflicts between developers
**Quote** (p.6): "The meeting makes collaboration among Developers more efficient and avoids potential conflicts"
#### Coding Phase (Section 3.2.2)
**Algorithm 3 (Coding Task Execution, p.6-7)**:
```
Input: File-task pairs T^k, max iterations nmax
Output: Repository-level code changes D
For each (fi, ti) in T^k:
ai ← LLM(fi, ti, P8) # Generate QA Engineer role
For j in [0, nmax):
If j > 0:
ti ← (ti, review_comment) # Append feedback
# Multi-step coding process:
{[s'i, e'i]} ← LLM(fi, ti, P9) # Locate line ranges
old_part ← extract(fi, {[s'i, e'i]}) # Extract existing code
new_part ← LLM(fi, ti, old_part, P10) # Generate replacement
f'i ← replace(fi, old_part, new_part) # Apply change
Δdi ← diff(fi, f'i) # Compute diff
# QA Review:
review_comment ← LLM(ti, Δdi, P11)
review_decision ← LLM(review_comment, P11)
If review_decision == approve:
break # Accept code change
D ← D ∪ Δdi # Merge into repository-level change
```
**Multi-Step Breakdown**:
1. **Locate**: Identify line ranges {[start, end]} requiring modification
2. **Extract**: Split file into old_part (to replace) and retained sections
3. **Generate**: LLM creates new_part to replace old_part
4. **Review**: QA Engineer evaluates, provides feedback or approval
5. **Iterate**: Continue until approval or max iterations reached
**Quote** (p.6): "we transform the code change generation into the multi-step coding process that is designed to leverage the strengths of LLMs in code generation while mitigating their weaknesses in code change generation"
### MAGIS Workflow Summary
```
GitHub Issue
↓
Repository Custodian: BM25 → Memory filter → LLM relevance check → Candidate files
↓
Manager: Define tasks → Design Developers → Kick-off meeting → Work plan
↓
Developer (per task):
Locate lines → Extract code → Generate new code → Submit for review
↓
QA Engineer: Review → Feedback/Approval
↓
[Iterate until approval or max attempts]
↓
Merge all code changes → New repository
```
## Experimental Results (Section 4)
### Setup
**Dataset**: SWE-bench - 2,294 real GitHub issues from 12 Python repositories
- **Test set**: 25% subset (574 instances) - same subset used for GPT-4 experiments [27]
- **Repositories**: Django, scikit-learn, matplotlib, pandas, sympy, etc.
**Base Model**: GPT-4 (for fairness with SWE-bench baselines)
**Metrics**:
- **Applied Ratio**: % of instances where code change can be `git apply`'d
- **Resolved Ratio**: % where code change passes all tests (old + new requirements)
**Setting**: Oracle file locating (correct files provided) - focuses evaluation on planning and coding phases.
### RQ2: Overall Effectiveness
**Table 2 (Main Results, p.7)**:
| Method | % Applied | % Resolved |
|--------|-----------|------------|
| GPT-3.5 | 11.67 | **0.84** |
| Claude-2 | 49.36 | **4.88** |
| GPT-4 | 13.24 | **1.74** |
| SWE-Llama 7b | 51.56 | **2.12** |
| SWE-Llama 13b | 49.13 | **4.36** |
| **MAGIS** | **97.39** | **13.94** |
| MAGIS (w/o QA) | 92.71 | 10.63 |
| MAGIS (w/o hints) | 94.25 | 10.28 |
| MAGIS (w/o hints, w/o QA) | 91.99 | 8.71 |
**Key Findings**:
1. **MAGIS achieves 13.94% resolved ratio** - best performance by significant margin
2. **8x improvement over GPT-4** (1.74% → 13.94%) using same base model
3. **2.86x improvement over Claude-2** (4.88% → 13.94%) - previous SOTA
4. **97.39% applied ratio** - nearly all code changes are syntactically valid
5. **Even without QA and hints** (8.71%), still **5x better than GPT-4**
**Quote** (p.7): "our framework's effectiveness is eight-fold that of the base LLM, GPT-4. This substantial increase underscores our framework's capability to harness the potential of LLMs more effectively"
**Ablation Analysis**:
- **w/o QA**: 10.63% (−3.31%) - QA Engineer contributes significantly
- **w/o hints**: 10.28% (−3.66%) - Human clarifications help but aren't required
- **w/o both**: 8.71% (−5.23%) - Core framework still provides 5x improvement
**Implication**: Multi-agent collaboration itself (Manager, Custodian, Developer) drives majority of gains.
### RQ3: Planning Effectiveness
#### Repository Custodian Performance
**Figure 3 (Recall vs File Number, p.8)**: MAGIS consistently outperforms BM25 baseline across all file counts.
- **Higher recall with fewer files** - validates memory mechanism effectiveness
- Strategic filtering reduces irrelevant files while maintaining coverage
#### Manager Performance
**Task Description Quality** (Figure 4, p.8):
GPT-4 evaluates correlation between Manager's generated task descriptions and reference code changes (1-5 scale, Table 6, p.21).
**Distribution**:
- Majority score ≥3 (correct direction)
- Higher scores (4-5) correlate with higher resolution probability
- More "Resolved" outcomes in high-correlation buckets
**Quote** (p.8): "when the generated task description closely aligns with the reference, there is a higher possibility of resolving the issue"
### RQ4: Coding Effectiveness
#### Line Locating Accuracy
**Figure 5 (Coverage Distribution, p.9)**: MAGIS shows **strong preference for coverage ratio ≈ 1** (perfect localization).
Compared to baselines:
- Higher frequency at ratio ≈ 1
- Lower frequency at ratio ≈ 0
- Multi-step process (Algorithm 3) improves line identification
**Figure 6 (Resolved Ratio by Coverage, p.9)**:
- **Cumulative frequency increases with coverage**
- Steeper slope in high-coverage region (0.6-1.0)
- Validates empirical finding: accurate line locating → higher success
**Quote** (p.9): "the Developer agent should prioritize improving its capability of locating code lines"
#### Complexity Correlation Reduction
**Table 3 (Complexity vs Resolution, p.9)**:
| Method | # Files | # Functions | # Hunks | # Added LoC | # Deleted LoC | # Changed LoC |
|--------|---------|-------------|---------|-------------|---------------|---------------|
| GPT-4 | −25.15* | −25.15* | −0.06 | −0.10 | −0.04 | −0.21 |
| **MAGIS** | **−1.55*** | **−1.55*** | −0.12* | −0.04* | −0.06* | −0.57* |
**Finding**: MAGIS dramatically reduces negative impact of file/function complexity.
- GPT-4: −25.15 correlation with # files/functions
- MAGIS: −1.55 correlation (94% reduction in negative impact)
**Implication**: Multi-agent decomposition successfully mitigates complexity barriers.
#### QA Engineer Contribution
**Ablation Result** (Table 2): QA Engineer adds +3.31% resolved ratio (10.63% → 13.94%)
**Case Study** (Figure 11 → Figure 10, Appendix I, p.20):
- Developer initially assigns wrong parameter (`random_state` instead of `seed`)
- QA Engineer identifies error: "doesn't seem entirely correct... could lead to worse results"
- Developer revises → Final code passes all tests
**Quote** (p.9): "This overall enhancement substantiates the QA Engineer's contribution to improving outcomes"
### Comparison with Contemporary Work
**SWE-bench Lite Results** (Table 4, Appendix D, p.18):
| Method | Resolved % |
|--------|-----------|
| AutoCodeRover | 16.11% (22.33% union) |
| SWE-Agent | 18.00% |
| **MAGIS Full** | **25.33%** |
| MAGIS w/o QA | 23.33% |
| MAGIS w/o hints | 16.67% |
| MAGIS w/o both | 16.00% |
**Finding**: MAGIS achieves highest resolved ratio on canonical SWE-bench lite subset.
**Devin Comparison** (Appendix E, p.18):
On 140 overlapping instances:
- MAGIS: 21 resolved (15%)
- Devin: 18 resolved (12.86%)
- **MAGIS faster**: ~3 min/issue vs Devin >10 min for 72% of instances
**Note**: Not entirely fair comparison - Devin has internet access, browser, unknown LLM.
## Case Studies
### Case 1: Django Issue #30664 (Figure 14, p.23)
**Issue**: SQLite3 migrations fail with quoted db_table
**MAGIS Resolution**:
1. **Repository Custodian**: Located 2 candidate files
2. **Manager**: Defined 2 tasks → Recruited Django Database Specialist, Alex Rossini
3. **Kick-off Meeting**: Determined execution sequence (Database Specialist first)
4. **Developer I**: Modified code, QA approved immediately
5. **Developer II**: Three attempts, QA feedback on first two, final version approved
6. **Result**: Both changes merged → All tests pass
**Comparison with Human Solution** (Figure 15 vs Figure 16):
- Human: Modified 4 hunks across 2 files
- MAGIS: Modified only 1 file (simpler solution)
- **Both pass all tests** - MAGIS found more elegant solution
### Case 2: scikit-learn Issue #9784 (Figures 11 → 10, p.20)
**Issue**: KMeans gives different results for n_jobs=1 vs n_jobs>1
**QA Engineer Value Demonstration**:
**First Attempt** (Figure 11):
```python
# Developer's initial code (Line 371)
random_state=random_state # WRONG - not using seeds array
```
**QA Engineer Feedback**:
> "This code change modifies the implementation of K-means algorithm and doesn't seem entirely correct. Running the algorithm just one time could lead to worse results, compared to running it multiple times (n_init times) and choosing the best result"
**Final Version** (Figure 10):
```python
# Developer's corrected code (Line 377)
random_state=seed # CORRECT - uses seed from iteration
```
**Result**: All tests pass after QA-guided revision
**Quote** (Case Study Section H, p.22): "With the help of the QA Engineer, the Developer further revise the code, and the final code change is shown in Fig. 10"
### Key Insights from Cases
1. **MAGIS can find simpler solutions** than human developers (Django case)
2. **QA Engineer prevents subtle bugs** (scikit-learn case)
3. **Kick-off meetings coordinate** multi-developer tasks effectively
4. **Memory mechanism scales** to large repositories
## Statistics on Generated Code Changes (Appendix F)
### Resolved Issues (Table 5, p.21)
**Complexity Comparison** (MAGIS vs Human Reference):
| Metric | MAGIS Avg | Gold Avg | Difference |
|--------|-----------|----------|------------|
| # Files | 1.02 | 1.04 | −0.02 |
| # Functions | 1.02 | 1.04 | −0.02 |
| # Hunks | 1.45 | 1.66 | −0.21 |
| # Added LoC | 9.75 | 4.34 | +5.41 |
| # Deleted LoC | 5.27 | 5.16 | +0.11 |
**Finding**: MAGIS generates **more comments** (explains higher added LoC).
**Figure 10 Example**: Lines 365, 368, 371, 374, 383 contain natural language descriptions of code changes.
**Quote** (p.19): "the generation results provided by our framework often contained more comment information... These natural language descriptions are valuable in actual software evolution [26, 35]"
**Implication**: MAGIS prioritizes **maintainability** through documentation.
### Maximum Capabilities
**Resolved Instances**:
- Max files modified: 2
- Max hunks: 4
- Max total changes: 1,655 lines
- Max single modification: 190 lines
**Applied but Unresolved**:
- Max files: 13
- Max hunks: 28
- Max modification location: Line 7,150
- Max single modification: 9,367 lines
**Implication**: Framework can handle complex, large-scale modifications.
### Distribution Analysis
**Figure 8 (Resolved Instances, p.19)**:
- MAGIS adds more lines than reference (higher median)
- MAGIS deletes similar amount (overlapping distribution)
- Difference primarily from added comments
**Figure 9 (Unresolved Instances, p.19)**:
- MAGIS deletes more, adds less (compared to reference)
- Suggests overly conservative strategy may contribute to test failures
**Quote** (p.19): "for unresolved instances, the framework tends to delete a larger number of lines while adding fewer lines, in contrast to the distribution of human-written changes"
### Repository Variation (Figure 13, p.21)
**Resolved Ratio by Repository**:
- Highest: ~40% (some repositories)
- Lowest: ~0% (others)
- **Large variance** suggests domain-specific challenges
**Implication**: Different code styles, architectures, and complexity affect success rates.
## AIWG Implementation Mapping
MAGIS validates and extends AIWG's multi-agent architecture. Here's how MAGIS concepts map to AIWG:
### Direct Alignments
| MAGIS Concept | AIWG Equivalent | Strength |
|---------------|-----------------|----------|
| **Manager Agent** | Project Manager agent + flow orchestration | **Strong** |
| **Repository Custodian** | Code Intelligence agent + context gathering | **Moderate** |
| **Developer Agents** | Code Writer, Test Engineer, etc. (53 agents) | **Strong** |
| **QA Engineer** | Code Reviewer agent + review flows | **Strong** |
| **Kick-off Meeting** | Agent collaboration in flows | **Moderate** |
| **Multi-step Coding** | Decomposed subtasks in SDLC phases | **Strong** |
| **File-level Tasks** | Use case → implementation mapping | **Strong** |
| **Memory Mechanism** | `.aiwg/` artifact persistence | **Partial** |
### MAGIS Innovations AIWG Can Adopt
#### 1. Memory Mechanism for Repository Evolution
**MAGIS Implementation** (Algorithm 1, p.5):
```
For each file in repository:
If previously analyzed:
summary_previous ← retrieve from memory
diff ← git diff previous current
If len(summary) < len(file):
summary_updated ← summary_previous + LLM(diff)
Else:
summary ← LLM(file)
memory.store(file, version, summary)
```
**AIWG Application**:
```markdown
# Proposed: .aiwg/knowledge/repository-memory.json
{
"src/auth/login.ts": {
"version": "a4f3b2c",
"summary": "Handles user authentication with JWT tokens...",
"last_analyzed": "2026-01-24T10:30:00Z"
},
"src/auth/session.ts": {
"version": "b2e1d9a",
"summary": "Manages user session lifecycle...",
"last_analyzed": "2026-01-24T10:32:00Z",
"diff_from_previous": "Added session timeout configuration"
}
}
```
**Benefits**:
- Reduce LLM queries for unchanged files
- Faster context loading for large repositories
- Incremental understanding as code evolves
**Implementation Location**: `agentic/code/addons/code-intelligence/memory-mechanism/`
#### 2. Line-Level Localization Before Code Generation
**MAGIS Multi-Step Process** (Algorithm 3, p.6):
```
1. Locate: {[start_line, end_line]} ← LLM(file, task, P9)
2. Extract: old_code ← file[start_line:end_line]
3. Generate: new_code ← LLM(file, task, old_code, P10)
4. Replace: file' ← replace(file, old_code, new_code)
5. Review: QA Engineer evaluates change
```
**AIWG Application**:
Current AIWG pattern (implicit):
```markdown
Developer agent receives task → generates full code change
```
**Proposed enhancement**:
```markdown
# In Code Writer agent definition:
## Modification Protocol
When modifying existing code:
1. **Locate**: Identify exact line ranges requiring change
- Use grep/glob to find relevant sections
- Output: "Lines X-Y in file.ts require modification"
2. **Extract**: Read current implementation
- Use Read tool with line numbers
- Understand existing logic and dependencies
3. **Generate**: Create replacement code
- Maintain existing style and patterns
- Add inline comments explaining changes
4. **Verify**: Self-check before submission
- Does change address the requirement?
- Are existing tests still valid?
```
**Benefits**:
- Leverages LLM strength in code generation
- Mitigates weakness in code modification
- Improves accuracy (validated by MAGIS Figure 6 correlation)
**Implementation**: Update agents in `agentic/code/frameworks/sdlc-complete/agents/code-writer.md`
#### 3. Formalized Kick-off Meetings
**MAGIS Pattern** (Section 3.2.1, p.6 + Figure 7, p.17):
```
Manager opens → States issue, tasks, expected collaboration
Developer 1 speaks → Identifies dependencies, suggests sequence
Developer 2 speaks → Confirms understanding, notes potential conflicts
Developer N speaks → ...
Manager summarizes → Generates executable work plan
```
**AIWG Application**:
Current: Flow commands coordinate agents sequentially
Proposed: Add explicit planning phase
```markdown
# New skill: .claude/skills/planning-meeting.md
# Planning Meeting Skill
## Purpose
Coordinate multiple agents before execution to identify dependencies,
resolve conflicts, and optimize execution order.
## Process
1. **Convene**: Gather all agents assigned to the workflow
2. **Present**: Manager agent describes overall goal and individual tasks
3. **Discuss**: Each agent identifies:
- Prerequisites for their task
- Outputs they produce for other agents
- Potential conflicts with other tasks
4. **Sequence**: Determine execution order (sequential vs parallel)
5. **Commit**: Generate executable plan with dependencies
## Outputs
- `.aiwg/working/planning-meeting-notes.md`
- `.aiwg/working/execution-plan.json`
## Example
```json
{
"workflow": "implement-auth-feature",
"agents": [
{
"name": "database-designer",
"task": "Design user schema",
"dependencies": [],
"outputs_for": ["api-designer", "test-engineer"]
},
{
"name": "api-designer",
"task": "Define authentication endpoints",
"dependencies": ["database-designer"],
"outputs_for": ["code-writer", "test-engineer"]
}
],
"execution_sequence": [
{"parallel": false, "agents": ["database-designer"]},
{"parallel": false, "agents": ["api-designer"]},
{"parallel": true, "agents": ["code-writer", "test-engineer"]}
]
}
```
```
**Benefits**:
- Reduces conflicts between parallel agents
- Optimizes execution order
- Documents decision-making process
**Implementation**: `agentic/code/addons/collaboration/planning-meetings/`
#### 4. Dedicated QA Engineer per Developer
**MAGIS Pattern** (Section 3.1 + Algorithm 3):
```
For each Developer agent:
qa_engineer ← LLM(developer_task, P8) # Generate specialized QA role
Loop:
code_change ← Developer.execute(task)
review ← qa_engineer.review(code_change, task)
If review.decision == "approve":
break
Else:
task ← task + review.feedback
Continue (max N iterations)
```
**AIWG Current Pattern**:
- Code Reviewer agent operates on completed work
- Review happens after implementation complete
**AIWG Enhancement**:
```markdown
# Proposed: Pair each agent with specialized reviewer
## In flow commands:
```yaml
agents:
- role: code-writer
task: "Implement authentication"
paired_reviewer:
role: security-focused-code-reviewer
context: "authentication implementation"
max_iterations: 3
- role: test-engineer
task: "Write integration tests"
paired_reviewer:
role: test-coverage-reviewer
context: "authentication tests"
max_iterations: 2
```
**Benefits**:
- Immediate, task-specific feedback
- Catches errors early (before merging)
- Reduces rework in later phases
**Implementation**: Extend flow command syntax, add iteration logic to orchestrator
### MAGIS Empirical Findings Applied to AIWG
#### Finding 1: File Locating Precision Matters
**MAGIS Evidence** (p.2-3): Claude-2 performance decreased from 1.96% → 1.22% as recall increased from 29.58% → 51.06%.
**AIWG Implication**: Code Intelligence agent should prioritize **relevant files** over **all files**.
**Current AIWG**: Uses grep/glob to find potentially relevant code
**Proposed Enhancement**:
```markdown
# In Code Intelligence agent
## File Relevance Scoring
When locating files for a task:
1. **Initial candidates**: Use grep/glob for broad search
2. **Summarize**: For each file, generate 2-3 sentence summary
3. **Score relevance**: Rate 1-5 how relevant to current task
4. **Filter**: Only include files with score ≥4
5. **Minimize**: If >5 files, prioritize highest scores
This prevents context overload while maintaining high recall.
```
#### Finding 2: Line Locating Strongly Predicts Success
**MAGIS Evidence** (Figure 6, p.9): Resolved ratio increases sharply with line coverage ratio, especially in 0.6-1.0 range.
**AIWG Implication**: Agents should **explicitly identify target lines** before generating code.
**Proposed Workflow**:
```markdown
# Code Writer agent modification protocol
## Step 1: Locate Target Lines
Use grep with context to identify modification points:
```bash
grep -n "function authenticate" src/auth.ts
# Output: Line 45: export function authenticate(credentials: Credentials)
```
## Step 2: Read Context
```bash
# Read lines 40-60 for context
```
## Step 3: State Intent
"I will modify lines 48-52 in src/auth.ts to add session timeout validation"
## Step 4: Generate Replacement
[Generate new code for lines 48-52]
## Step 5: Verify
Does the change address the requirement? Are line numbers correct?
```
#### Finding 3: Complexity Decomposition Reduces Negative Impact
**MAGIS Evidence** (Table 3, p.9): GPT-4 correlation with # files: −25.15; MAGIS: −1.55 (94% reduction).
**AIWG Implication**: Multi-file changes should be **decomposed into file-level tasks**, each handled by specialized agent.
**Current AIWG**: Single Code Writer may handle multi-file changes
**Proposed Enhancement**:
```markdown
# In Project Manager agent
## Multi-File Change Decomposition
When a requirement affects multiple files:
1. **Identify files**: List all files requiring modification
2. **Define tasks**: Create one file-level task per file
- Task 1: "Update user model in src/models/user.ts"
- Task 2: "Update auth service in src/services/auth.ts"
- Task 3: "Update API routes in src/routes/auth.ts"
3. **Assign specialists**: Create/assign agent per task
4. **Coordinate**: Use planning meeting to resolve dependencies
5. **Integrate**: Merge changes after individual completion
This mirrors MAGIS's Manager → multiple Developers pattern.
```
### Integration Opportunities
#### Short-Term (Immediate AIWG Enhancements)
1. **Add memory mechanism** to Code Intelligence agent
- Location: `agentic/code/addons/code-intelligence/`
- Implementation: JSON storage in `.aiwg/knowledge/repository-memory.json`
- Benefit: Faster context loading, reduced LLM queries
2. **Formalize multi-step modification protocol** in Code Writer agent
- Update: `agentic/code/frameworks/sdlc-complete/agents/code-writer.md`
- Add steps: Locate → Extract → Generate → Verify
- Benefit: Improved accuracy (validates MAGIS empirical findings)
3. **Enhance file locating precision** in Code Intelligence
- Add relevance scoring step
- Filter to top-N most relevant files
- Benefit: Avoid context overload (MAGIS Finding 1)
#### Medium-Term (Flow Command Extensions)
4. **Implement planning meetings** for multi-agent workflows
- New skill: `planning-meeting.md`
- Generates execution plan with dependencies
- Benefit: Optimize sequential vs parallel execution
5. **Add paired reviewer pattern** to flow commands
- Syntax: `paired_reviewer:` field in agent definitions
- Iteration logic with max attempts
- Benefit: Earlier error detection (MAGIS QA Engineer pattern)
6. **Decompose multi-file changes** in Project Manager logic
- Detect multi-file requirements
- Generate file-level subtasks
- Assign specialized agents per file
- Benefit: 94% reduction in complexity negative impact (MAGIS Table 3)
#### Long-Term (Framework Evolution)
7. **Incremental repository understanding**
- Persistent memory across sessions
- Git-based change tracking
- Diff-based summary updates
- Benefit: Scale to large, evolving codebases
8. **Dynamic agent generation**
- Manager creates specialized agents on-demand (MAGIS pattern)
- Currently: Fixed catalog of 53 agents
- Future: Generate bespoke agents per unique task
- Benefit: Greater flexibility for novel requirements
## Key Quotes
### On LLM Limitations at Repository Level
> "LLMs exhibit limitations in processing excessively long context inputs and are subject to constraints regarding their input context length. This limitation is particularly evident in repository-level coding tasks, such as solving GitHub issues, where the context comprises the entire repository" (p.2)
### On Locating Files Strategically
> "optimizing the performance of LLMs can be better achieved by striving for higher recall scores with a minimized set of files, thus suggesting a strategic balance between recall optimization and the number of chosen files" (p.3)
### On Line Locating as Key Factor
> "with a coefficient, 0.5997, on Claude-2, there is a substantial and positive relation between improvements in the coverage ratio and the probability of successfully resolving issues, which demonstrates that locating lines is a key factor for GitHub issue resolution" (p.3)
### On Manager Agent Flexibility
> "This setup improves team flexibility and adaptability, enabling the formation of teams that can meet various issues efficiently" (p.4)
### On Repository Custodian Memory Mechanism
> "Considering that applying the code change often modifies a specific part of the file rather than the entire file, we propose a memory mechanism to reuse the previously queried information" (p.5)
### On QA Engineer Necessity
> "To address this problem, our framework pairs each Developer agent with a QA Engineer agent, designed to offer task-specific, timely feedback. This personalized QA approach aims to boost the review process thereby better ensuring the software quality" (p.5)
### On Multi-Step Coding Process
> "we transform the code change generation into the multi-step coding process that is designed to leverage the strengths of LLMs in code generation while mitigating their weaknesses in code change generation" (p.6)
### On Kick-off Meeting Value
> "The meeting makes collaboration among Developers more efficient and avoids potential conflicts" (p.6)
### On Performance Gains
> "our framework's effectiveness is eight-fold that of the base LLM, GPT-4. This substantial increase underscores our framework's capability to harness the potential of LLMs more effectively" (p.7)
### On Task Description Quality
> "when the generated task description closely aligns with the reference, there is a higher possibility of resolving the issue" (p.8)
### On Line Locating Priority
> "the Developer agent should prioritize improving its capability of locating code lines" (p.9)
### On Generated Code Comments
> "the generation results provided by our framework often contained more comment information... These natural language descriptions are valuable in actual software evolution" (p.19)
## Related Work Context
### Multi-Agent Systems for Code Generation
**MetaGPT** (Hong et al., 2023): Simulates programming team SOPs, achieves leading scores on HumanEval/MBPP but focuses on **code repository establishment** (0 → complete), not evolution.
**ChatDev** (Qian et al., 2023): Virtual development company, decomposes requirements into atomic tasks. Completes small projects (<5 files average) in <7 minutes but doesn't address **software evolution**.
**MAGIS Distinction**: Focuses on **existing repository modification** - different challenge requiring file/line locating, complexity management, and existing code understanding.
### Automatic Program Repair (APR)
**Bug Localization**: DreamLoc (Qi et al., 2022) - deep relevance matching for bug locating
**Repair Methods**:
- VarFix (Wong et al., 2021) - retrieval-based
- ITER (Ye & Monperrus, 2024) - iterative neural repair
- RAP-GEN (Wang et al., 2023) - retrieval-augmented with CodeT5
**LLM-based APR**:
- Xia et al. (2023): Direct LLM application outperforms existing APR
- RepairAgent (Bouzenia et al., 2024): Autonomous LLM agent with dynamic tool interaction
**MAGIS Distinction**: Addresses **all GitHub issue types** (bugs, features, enhancements), not just bug fixing. Handles multi-file changes and complex requirements beyond single-bug repairs.
### Contemporary Work (Post-MAGIS)
**AutoCodeRover** (Zhang et al., 2024): 16.11% on SWE-bench lite (22.33% union over 3 runs)
**SWE-Agent** (Yang et al., 2024): 18.00% on SWE-bench lite
**Devin** (Cognition Labs, 2024): 12.86% on overlapping 140 instances, but has internet access + browser
**MAGIS Position**: Highest resolved ratio (25.33% on SWE-bench lite), fastest execution (~3 min/issue), open methodology.
## Limitations
### Acknowledged by Authors (Appendix K)
1. **Prompt Design Bias** (p.25)
- Prompt engineering affects LLM performance
- Template design follows guidelines but can't eliminate bias
- Dataset instance biases and API limitations compound issue
2. **Dataset Scope** (p.25)
- 12 Python repositories in SWE-bench
- May not generalize to specialized domains (microservices, functional programming)
- Code style and architecture variability not fully represented
**Quote** (p.25): "applying the findings of this paper to other code repositories may require further validation"
### Additional Considerations
3. **Language Specificity**: Only Python repositories tested
- JavaScript/TypeScript, Java, Go, Rust not validated
- Dynamic vs static typing may affect results
4. **Oracle File Locating**: Experiments assume correct files provided
- Real-world: File locating accuracy impacts overall performance
- Repository Custodian effectiveness critical but less validated
5. **Base Model Dependency**: Results tied to GPT-4 capabilities
- Future models may change relative performance
- Framework architecture should transfer, but absolute numbers may shift
6. **Context Length**: Still bounded by LLM context limits
- Memory mechanism helps but doesn't eliminate constraint
- Very large files (>10K lines) may challenge approach
## Benchmark Details
### SWE-bench Overview
**Source**: Jimenez et al. (2024) - "SWE-bench: Can language models resolve real-world GitHub issues?"
**Composition**:
- 2,294 GitHub issues from 12 Python repositories
- Real software evolution requirements (not synthetic)
- Each instance includes:
- Issue description
- Repository state at issue time
- Reference code change (human solution)
- Test suite (existing + new tests for requirement)
**Repositories** (example):
- django/django (web framework)
- scikit-learn/scikit-learn (machine learning)
- matplotlib/matplotlib (visualization)
- pandas-dev/pandas (data analysis)
- sympy/sympy (symbolic mathematics)
**Challenge Types**:
- Bug fixes (~60%)
- Feature additions (~25%)
- Performance enhancements (~10%)
- Refactoring (~5%)
**Evaluation**:
1. **Applied**: Can code change be `git apply`'d without conflicts?
2. **Resolved**: Does applied change pass all tests (Told ∩ Tnew)?
### SWE-bench Lite
**Purpose**: Canonical 300-instance subset for faster evaluation (recommended by authors)
**Selection Criteria**:
- Representative difficulty distribution
- Balanced across repositories
- Validated to correlate with full dataset results
**MAGIS Results**:
- 25.33% resolved on lite (vs 13.94% on 25% subset)
- Higher performance on curated subset expected
## Technical Implementation Details
### Prompts and Configuration
Paper mentions 11 distinct prompts (P1-P11) but doesn't publish full text:
| Prompt | Purpose | Algorithm Location |
|--------|---------|-------------------|
| P1 | Summarize code diff as commit message | Algorithm 1, line 13 |
| P2 | Compress file into summary | Algorithm 1, line 17 |
| P3 | Determine file relevance to issue | Algorithm 1, line 20 |
| P4 | Define file-level task | Algorithm 2, line 5 |
| P5 | Design Developer role | Algorithm 2, line 7 |
| P6 | Refine roles after meeting | Algorithm 2, line 11 |
| P7 | Generate executable work plan | Algorithm 2, line 12 |
| P8 | Design QA Engineer role | Algorithm 3, line 5 |
| P9 | Locate line ranges | Algorithm 3, line 10 |
| P10 | Generate replacement code | Algorithm 3, line 12 |
| P11 | Review code change | Algorithm 3, line 15-16 |
**Configuration** (not detailed in paper):
- Max iterations (nmax): Likely 3-5 based on case studies
- BM25 top-k: Not specified (likely 5-10 based on Figure 3)
- Context limits: Managed via memory mechanism
### Execution Environment
**Not specified in paper**:
- Docker/Kubernetes deployment?
- Parallel vs sequential agent execution?
- State management for long-running workflows?
- Error recovery strategies?
**Implied from case studies**:
- Sequential Developer execution (based on kick-off meeting output)
- Iterative QA review loop (max N attempts)
- Git-based repository management
## Future Research Directions
### Identified by Authors
1. **Cross-Language Generalization**: Validate on JavaScript, Java, Go, Rust repositories
2. **Specialized Domain Support**: Microservices, functional programming paradigms
3. **Larger Context Handling**: Improvements as LLM context windows expand
4. **Autonomous File Locating**: Remove oracle assumption, improve Repository Custodian
### Implied by Results
5. **Unresolved Instance Analysis**: Why do 86% still fail? Common failure patterns?
6. **Repository-Specific Adaptation**: Address 0-40% variance across repositories (Figure 13)
7. **Complex Change Strategies**: Improve handling of 10+ file, 20+ hunk modifications
8. **Comment Generation Policy**: Balance between documentation and implementation
### AIWG Research Opportunities
9. **Memory Mechanism Generalization**: Apply to non-code artifacts (docs, configs, tests)
10. **Planning Meeting Optimization**: When is kick-off valuable vs overhead?
11. **QA Engineer Specialization**: Task-specific vs general reviewers - performance tradeoff?
12. **Multi-Model Consensus**: Does heterogeneous LLM ensemble improve results (BP-6 from REF-001)?
## Comparative Framework Analysis
### MAGIS vs ChatDev vs MetaGPT
| Aspect | MAGIS | ChatDev | MetaGPT |
|--------|-------|---------|---------|
| **Primary Task** | Repository evolution | Project establishment | Project establishment |
| **Input** | GitHub issue + existing repo | Requirements | Requirements |
| **Output** | Code change (patch) | Complete codebase | Complete codebase |
| **Agent Roles** | 4 types (Manager, Custodian, Developer, QA) | 7 types (CEO, CTO, Programmer, etc.) | 5 types (PM, Architect, Engineer, etc.) |
| **Team Formation** | Dynamic (per issue) | Fixed team | Fixed team |
| **Key Innovation** | Memory mechanism, line-level locating | Self-reflection, mutual communication | SOPs, structured outputs |
| **Benchmark** | SWE-bench (13.94%) | Small projects (<5 files, <7 min) | HumanEval (leading scores) |
| **Limitation** | 86% still unresolved | Doesn't handle evolution | Doesn't handle evolution |
**Complementarity**: MAGIS extends ChatDev/MetaGPT from establishment → maintenance.
### MAGIS vs Traditional APR
| Aspect | MAGIS | Traditional APR |
|--------|-------|-----------------|
| **Scope** | All issue types | Bug fixing only |
| **Approach** | Multi-agent collaboration | Fault localization + repair |
| **Context** | Repository-level | File/function-level |
| **Human Input** | Issue description | Bug report |
| **Validation** | Test suite (old + new) | Test suite (old only) |
| **Performance** | 13.94% on SWE-bench | <10% on typical benchmarks |
**Advantage**: MAGIS handles feature additions, enhancements, refactoring - not just bugs.
## References
### Primary Source
- Tao, W., Zhou, Y., Wang, Y., Zhang, W., Zhang, H., & Cheng, Y. (2024). [MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue ReSolution](https://arxiv.org/abs/2403.17927). arXiv:2403.17927v2 [cs.SE]
### Cited Benchmarks
- **SWE-bench**: Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). [SWE-bench: Can language models resolve real-world GitHub issues?](https://openreview.net/forum?id=VTF8yNQM66). ICLR 2024.
- **SWE-bench Lite**: [Canonical 300-instance subset](https://www.swebench.com/lite.html)
- **HumanEval**: Chen, M. et al. (2021). [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374). arXiv:2107.03374
- **MBPP**: Austin, J. et al. (2021). [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732). arXiv:2108.07732
### Related Multi-Agent Systems
- **MetaGPT**: Hong, S. et al. (2023). [MetaGPT: Meta Programming for Multi-Agent Collaborative Framework](https://arxiv.org/abs/2308.00352). arXiv:2308.00352
- **ChatDev**: Qian, C. et al. (2023). Communicative Agents for Software Development. arXiv preprint
- **AutoGen**: Wu, Q. et al. (2023). [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https://arxiv.org/abs/2308.08155). arXiv:2308.08155
### Contemporary Work
- **AutoCodeRover**: Zhang, Y. et al. (2024). [AutoCodeRover: Autonomous Program Improvement](https://arxiv.org/abs/2404.05427). arXiv:2404.05427
- **SWE-Agent**: Yang, J. et al. (2024). SWE-Agent: Agent Computer Interfaces Enable Software Engineering Language Models
- **Devin**: Cognition Labs (2024). [SWE-bench Technical Report](https://www.cognition-labs.com/post/swe-bench-technical-report)
- **RepairAgent**: Bouzenia, I., Devanbu, P.T., & Pradel, M. (2024). [RepairAgent: An Autonomous, LLM-Based Agent for Program Repair](https://arxiv.org/abs/2403.17134). arXiv:2403.17134
### APR Background
- **DreamLoc**: Qi, B. et al. (2022). [DreamLoc: A Deep Relevance Matching-Based Framework for Bug Localization](https://doi.org/10.1109/TR.2021.3104728). IEEE Trans. Reliab., 71(1):235-249
- **ITER**: Ye, H. & Monperrus, M. (2024). [ITER: Iterative Neural Repair for Multi-Location Patches](https://doi.org/10.1145/3597503.3623337). ICSE 2024
### AIWG Documentation
- **SDLC Framework**: `agentic/code/frameworks/sdlc-complete/README.md`
- **Multi-Agent Pattern**: `docs/multi-agent-documentation-pattern.md`
- **Agent Catalog**: `agentic/code/frameworks/sdlc-complete/agents/`
## Appendices Summary
**Appendix A (p.16)**: Coverage ratio formula details, observation explanations
**Appendix B (p.17)**: Full kick-off meeting transcript (Figure 7) - Django issue #30664
**Appendix C (p.16)**: Applied and resolved ratio metric definitions
**Appendix D (p.18)**: SWE-bench lite comparison with AutoCodeRover, SWE-Agent
**Appendix E (p.18)**: Devin comparison - 140 overlapping instances, speed analysis
**Appendix F (p.19-21)**: Statistics on generated code changes, distribution analysis
**Appendix G (p.21)**: Task description evaluation criteria (GPT-4 scoring rubric)
**Appendix H (p.22)**: Django case study - detailed workflow walkthrough
**Appendix I (p.22)**: QA Engineer effectiveness - scikit-learn case study
**Appendix J (p.22-25)**: Extended related work - LLMs, multi-agent systems, APR
**Appendix K (p.25)**: Limitations - prompt bias, dataset scope
## Revision History
| Date | Author