github-pr-automation
Version:
MCP server and CLI for automated GitHub PR management, review resolution, and workflow optimization
452 lines (348 loc) • 11.6 kB
Markdown
# Design Decisions
## Overview
This document captures key design decisions made during the development of the GitHub PR Automation MCP server, along with the rationale behind each choice.
## 1. PR Identifier Format: Single String vs. Separate Fields
**Decision**: Use single string format `"owner/repo#123"` as the primary interface
**Alternatives Considered**:
- Separate fields: `{owner: string, repo: string, number: number}`
- URL only: `"https://github.com/owner/repo/pull/123"`
**Rationale**:
1. **AI Efficiency**: Single field reduces cognitive load for AI agents
2. **Human-Friendly**: Matches natural GitHub format (copy/paste from URLs)
3. **Flexibility**: Can support multiple formats with a parser:
- `owner/repo#123` (concise)
- `owner/repo/pulls/123` (explicit)
- Full GitHub URLs
4. **Implementation**: Simple regex parsing handles all cases
**Implementation Note**:
```typescript
// Parser supports all formats:
parsePRIdentifier("owner/repo#123")
parsePRIdentifier("owner/repo/pulls/123")
parsePRIdentifier("https://github.com/owner/repo/pull/123")
// All return: {owner: "owner", repo: "repo", number: 123}
```
**Trade-offs**:
- ✅ More concise API
- ✅ Easier for humans and AI
- ❌ Requires parsing (minimal cost)
- ❌ Slight potential for edge cases (mitigated by validation)
## 2. Tool Responsibility: Data Provider vs. Analyzer
**Decision**: Tools provide **raw, structured data**; LLMs perform **analysis and categorization**
**Alternatives Considered**:
- Tools do full analysis (categorization, severity, response generation)
- Hybrid approach (tools suggest, LLM decides)
**Rationale**:
### Why Raw Data Approach?
1. **Separation of Concerns**
- Tool: Fetch and structure data efficiently
- LLM: Understand context and semantics
2. **LLM Strengths**
- Better at understanding nuance ("This must be fixed" vs. "This could be better")
- Can consider PR context (e.g., "breaking change" means different things in different contexts)
- Adapts to project-specific terminology
3. **Tool Simplicity**
- No need to maintain complex categorization rules
- Easier to test (deterministic data fetching)
- Less prone to false positives/negatives
4. **Flexibility**
- Different LLMs might categorize differently
- Users can customize LLM prompts without changing tools
- Works across different project cultures
### Example: `find_unresolved_comments`
**Tool Provides**:
```typescript
{
body: "This could lead to SQL injection",
author: "senior-dev",
author_association: "MEMBER",
reactions: {"+1": 2, "eyes": 1},
file_path: "src/db/users.ts"
}
```
**LLM Determines**:
- Category: "blocking" (security concern)
- Severity: "high" (SQL injection is critical)
- Priority: Address immediately
- Response: Generates appropriate fix
**If Tool Did This**:
```typescript
{
body: "This could lead to SQL injection",
category: "blocking", // ❌ Pattern matching might miss context
severity: "high", // ❌ Might not understand project risk tolerance
// What if comment says "This wouldn't lead to SQL injection because..."?
}
```
**Trade-offs**:
- ✅ More accurate categorization (LLM understands context)
- ✅ Simpler tool implementation
- ✅ Easier to maintain and test
- ✅ Flexible across projects
- ❌ LLM must do more work (but this is what LLMs are good at)
## 3. Bot Filtering: Default Include vs. Exclude
**Decision**: **Include bots by default** (`include_bots: true`)
**Alternatives Considered**:
- Exclude bots by default
- Prescriptive bot filtering (hardcoded bot list)
**Rationale**:
1. **Neutral Default**
- Don't make assumptions about what users want to see
- Bot comments (like CodeRabbit) can be valuable
2. **User Control**
- LLM can filter based on content, not just username
- `exclude_authors` parameter for explicit filtering
- LLM can categorize as "nit" and deprioritize
3. **Context-Dependent Value**
- In some contexts, bot nits are useful
- In others, they're noise
- LLM can decide based on prompt
**Example Usage**:
```typescript
// Default: include everything
find_unresolved_comments({pr: "owner/repo#123"})
// Explicit exclusion if desired
find_unresolved_comments({
pr: "owner/repo#123",
exclude_authors: ["coderabbitai", "github-actions"]
})
// Or let LLM filter:
"Show me unresolved comments, but ignore trivial style nits"
// LLM sees all comments, filters based on content
```
**Trade-offs**:
- ✅ More data for LLM to work with
- ✅ Flexible filtering
- ✅ No hidden assumptions
- ⚠️ More tokens consumed (but LLM can skim quickly)
## 4. Pagination: Mandatory vs. Optional
**Decision**: **Mandatory pagination** with sensible defaults
**Alternatives Considered**:
- Optional pagination (return all by default)
- No pagination (always return everything)
**Rationale**:
1. **Token Efficiency**
- Large PRs can have 100+ comments or 50+ test failures
- Returning everything wastes tokens and time
2. **Progressive Disclosure**
- Users typically want to see "first 20 comments" not "all 500"
- Can always request more pages
3. **Predictable Performance**
- Bounded response sizes
- Consistent latency
**Implementation**:
```typescript
// Defaults chosen based on typical usage
{
page: 1,
page_size: 10, // For test failures (few items, detailed)
page_size: 20, // For comments (more items, scannable)
page_size: 5, // For commands (step-by-step)
}
// Always include pagination metadata
{
pagination: {
page: 1,
total_pages: 5,
has_next: true,
has_previous: false
}
}
```
**Trade-offs**:
- ✅ Predictable token usage
- ✅ Fast responses
- ✅ User can control detail level
- ❌ Might need multiple calls for full picture
- ❌ Slightly more complex API
## 5. Wait vs. Immediate: Polling Strategy
**Decision**: Support both modes with **`wait=false` as default**
**Rationale**:
1. **Immediate Mode (default)**
- Fast response (< 2s)
- Non-blocking
- User can poll externally if needed
2. **Wait Mode (opt-in)**
- Convenience for fix loops
- `bail_on_first` for rapid feedback
- Timeout protection (30 min max)
**Use Cases**:
**Immediate Mode**:
```typescript
// Quick status check
const status = await get_failing_tests({
pr: "owner/repo#123",
wait: false // or omit (default)
});
if (status.status === "running") {
// Come back later
}
```
**Wait Mode with Bail**:
```typescript
// Fast feedback loop
const result = await get_failing_tests({
pr: "owner/repo#123",
wait: true,
bail_on_first: true // Return ASAP when first failure detected
});
// Fix first failure, push, repeat
```
**Wait Mode Complete**:
```typescript
// Get all failures at once
const result = await get_failing_tests({
pr: "owner/repo#123",
wait: true,
bail_on_first: false // Wait for all tests
});
// Batch fix all failures
```
**Trade-offs**:
- ✅ Flexible for different workflows
- ✅ Fast by default
- ✅ Convenience when needed
- ❌ More complex implementation (polling logic)
## 6. Error Handling: Fail Fast vs. Graceful Degradation
**Decision**: **Fail fast for user errors**, **graceful degradation for API issues**
**Categories**:
1. **User Errors** (fail fast):
- Invalid PR identifier → Clear error message with example
- Missing token → "Set GITHUB_TOKEN environment variable"
- Invalid parameters → Zod validation with specific field errors
2. **API Errors** (retry with backoff):
- Rate limiting → Wait and retry automatically
- Temporary unavailability → Exponential backoff
- Timeouts → Retry 3 times
3. **Logical Errors** (explain and suggest):
- PRs not stacked → "PR #124's base is 'main', not PR #123's head 'feature-x'"
- No CI configured → "No CI checks found. Add GitHub Actions workflow?"
- Empty results → "PR has no comments" (not an error)
**Example Error Response**:
```typescript
{
error: "PR not found: owner/repo#123",
category: "user",
suggestion: "Check that PR number is correct. Example: 'owner/repo#456'",
documentation_url: "https://github.com/..."
}
```
**Trade-offs**:
- ✅ Clear feedback for user mistakes
- ✅ Resilient to transient failures
- ✅ Actionable error messages
- ❌ More code complexity (error categorization)
## 7. Instruction Generation: Prescriptive vs. Descriptive
**Decision**: **Generate prescriptive, actionable instructions**
**Rationale**:
LLMs benefit from:
1. **Concrete next steps** rather than general observations
2. **Commands they can execute** rather than descriptions
3. **Prioritization** rather than flat lists
**Example: get_failing_tests**
**Instead of**:
```typescript
{
message: "Tests failed in authentication module"
}
```
**We provide**:
```typescript
{
instructions: {
summary: "3 tests failed in authentication module",
priority: [
{
test: "test_login",
reason: "Blocking 5 downstream tests",
suggested_fix: "Check JWT token validation - returns 500 instead of 401"
}
],
commands: [
"pytest tests/auth/test_login.py -v",
"pytest tests/auth/test_login.py::test_invalid_token -v"
]
}
}
```
**Trade-offs**:
- ✅ Immediately actionable
- ✅ Clear priorities
- ✅ Reproducible locally
- ⚠️ Requires understanding test frameworks (but we do this anyway for parsing)
## 8. State Management: Stateless vs. Stateful
**Decision**: **Fully stateless** server
**Rationale**:
1. **Simplicity**
- No session management
- No cleanup required
- Easy to reason about
2. **Reliability**
- Server can restart without losing state
- Each request is independent
3. **Scalability**
- Easy to run multiple instances
- No state synchronization
**Implications**:
- Each tool call is independent
- No memory of previous calls
- User/LLM maintains context
- Authentication via token each request
**Trade-offs**:
- ✅ Simple, reliable, scalable
- ✅ No stale state bugs
- ❌ Can't optimize across calls (but caching helps)
- ❌ No long-running operations tracking (use wait mode instead)
## 9. Response Format: Optimized for AI vs. Human
**Decision**: **Optimize for AI consumption** (but keep human-readable)
**Principles**:
1. **Structured over prose**
```typescript
// ✅ Good (structured)
{status: "failed", failure_count: 3}
// ❌ Bad (requires parsing)
{message: "Status: failed (3 failures)"}
```
2. **Explicit over implicit**
```typescript
// ✅ Good
{has_next: true, total_pages: 5}
// ❌ Bad (must calculate)
{page: 1, total_items: 87, page_size: 20}
```
3. **Actionable over descriptive**
```typescript
// ✅ Good
{action_required: true, commands: ["git rebase ..."]}
// ❌ Bad
{message: "You need to rebase"}
```
**Trade-offs**:
- ✅ LLM can parse easily
- ✅ Consistent structure
- ✅ Type-safe
- ⚠️ More verbose (but that's fine for JSON)
## Summary of Key Principles
1. **Tools provide data, LLMs provide intelligence**
2. **Default to inclusive, let LLM filter**
3. **Paginate always, with sensible defaults**
4. **Fast by default, wait when requested**
5. **Fail fast on user errors, retry on API errors**
6. **Generate actionable instructions**
7. **Stay stateless for simplicity**
8. **Optimize for AI, keep human-readable**
These decisions create a system that:
- Maximizes LLM strengths
- Minimizes token waste
- Provides clear, actionable outputs
- Remains simple and maintainable