aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

304 lines (237 loc) • 7 kB

Markdown

# Best Output Selection Rules **Enforcement Level**: MEDIUM **Scope**: Ralph loops and iterative refinement **Research Basis**: REF-015 Self-Refine **Issue**: #168 ## Overview These rules enforce non-monotonic output selection - tracking the highest quality output across all iterations rather than simply accepting the final iteration result. ## Research Foundation From REF-015 Self-Refine (Madaan et al., 2023): - Quality can fluctuate during iterative refinement - Final iteration is not always the best - Peak quality often occurs at iteration 2-3, may degrade later - Selecting from history improves overall output quality **Example Quality Trajectory**: ``` Iteration 1: 72% quality Iteration 2: 85% quality ← PEAK Iteration 3: 83% quality (degraded) Final output: 83% (suboptimal) Best selection: 85% (iteration 2) ``` ## Mandatory Rules ### Rule 1: Preserve All Iteration Outputs **REQUIRED**: Every iteration's output MUST be preserved until loop completion. ```yaml iteration_history: - iteration: 1 artifacts: - path: ".aiwg/working/iteration-1/output.md" content_hash: "abc123" quality_score: 0.72 timestamp: "2026-01-25T10:00:00Z" - iteration: 2 artifacts: - path: ".aiwg/working/iteration-2/output.md" content_hash: "def456" quality_score: 0.85 # Best so far timestamp: "2026-01-25T10:05:00Z" - iteration: 3 artifacts: - path: ".aiwg/working/iteration-3/output.md" content_hash: "ghi789" quality_score: 0.83 # Degraded timestamp: "2026-01-25T10:10:00Z" ``` ### Rule 2: Track Running Best **REQUIRED**: Maintain a reference to the best iteration throughout the loop. ```yaml best_tracker: current_best: iteration: 2 quality_score: 0.85 artifacts_path: ".aiwg/working/iteration-2/" update_rule: | IF new_iteration.quality_score > current_best.quality_score: current_best = new_iteration ``` ### Rule 3: Select Best, Not Final **REQUIRED**: On loop completion, select the highest quality output regardless of iteration number. ```yaml selection_algorithm: on_loop_completion: - compare: current_best vs final_iteration - select: higher_quality_score - log: selection_decision - apply: selected_artifacts selection_criteria: primary: quality_score tiebreaker: earlier_iteration # Prefer earlier if equal ``` **FORBIDDEN**: ```yaml # Do NOT simply use final iteration final_output: iterations[-1].artifacts # Wrong! ``` **REQUIRED**: ```yaml # Select best quality regardless of recency final_output: max(iterations, key=quality_score).artifacts ``` ### Rule 4: Log Selection Decisions **REQUIRED**: Document why a particular iteration was selected. ```markdown ## Output Selection Report **Loop ID**: ralph-001 **Total Iterations**: 3 **Selected Iteration**: 2 ### Quality Scores | Iteration | Quality | Status | |-----------|---------|--------| | 1 | 72% | | | 2 | 85% | ✓ SELECTED | | 3 | 83% | (final) | ### Selection Rationale Iteration 2 selected because: - Highest quality score (85% vs 83% final) - Quality degraded in iteration 3 - All validation checks passed ### Artifacts Applied - .aiwg/architecture/sad.md (from iteration 2) ``` ### Rule 5: Support Manual Override **REQUIRED**: Allow human override of automatic selection. ```yaml manual_override: enabled: true options: - use_best: "Select highest quality" - use_final: "Use final iteration" - use_specific: "Select iteration N" audit: log_override: true require_reason: true ``` ## Quality Scoring ### Scoring Dimensions Quality score MUST incorporate multiple dimensions: | Dimension | Weight | Description | |-----------|--------|-------------| | Validation | 0.30 | Passes all validation checks | | Completeness | 0.25 | All required sections present | | Correctness | 0.25 | Accurate information/behavior | | Readability | 0.10 | Clear, well-structured | | Efficiency | 0.10 | Appropriate length/complexity | ### Score Calculation ```yaml quality_score: formula: | weighted_sum( validation * 0.30, completeness * 0.25, correctness * 0.25, readability * 0.10, efficiency * 0.10 ) normalization: 0.0 to 1.0 threshold_for_acceptance: 0.70 ``` ## Integration with Ralph ### Iteration Snapshot After each iteration, Ralph MUST: 1. **Snapshot artifacts** ```bash cp -r .aiwg/working/current/* .aiwg/working/iteration-N/ ``` 2. **Calculate quality score** ```yaml quality_check: - run_validation - check_completeness - evaluate_correctness - calculate_weighted_score ``` 3. **Update best tracker** ```yaml if quality_score > best_tracker.quality_score: best_tracker.update(iteration_N) ``` ### Loop Completion On completion: 1. **Compare best vs final** ```yaml comparison: best_iteration: 2 (85%) final_iteration: 3 (83%) delta: -2% decision: use_best ``` 2. **Apply selected output** ```bash cp -r .aiwg/working/iteration-2/* .aiwg/output/ ``` 3. **Generate selection report** ```markdown # Output Selection Report ... ``` ## Degradation Patterns ### Common Causes | Pattern | Cause | Mitigation | |---------|-------|------------| | Over-refinement | Too many iterations | Early stopping | | Scope creep | Adding unnecessary features | Strict requirements | | Style drift | Changing approach mid-loop | Consistent prompts | | Information loss | Summarizing too aggressively | Preserve details | ### Detection ```yaml degradation_detection: triggers: - quality_delta < -0.05 # 5% drop - consecutive_decreases >= 2 - validation_failures_increased actions: - flag_degradation - consider_early_stopping - preserve_pre_degradation_best ``` ## Storage ``` .aiwg/ralph/{loop_id}/ ├── iterations/ │ ├── iteration-1/ │ │ ├── artifacts/ │ │ └── metrics.json │ ├── iteration-2/ │ │ ├── artifacts/ │ │ └── metrics.json │ └── iteration-3/ │ ├── artifacts/ │ └── metrics.json ├── best-tracker.json ├── selection-report.md └── final-output/ └── (selected artifacts) ``` ## Validation Checklist Before completing a Ralph loop: - [ ] All iteration outputs preserved - [ ] Quality score calculated for each iteration - [ ] Best tracker maintained throughout - [ ] Selection based on quality, not recency - [ ] Selection decision logged with rationale - [ ] Override option available if needed - [ ] Degradation patterns detected ## References - @agentic/code/addons/ralph/schemas/iteration-analytics.yaml - Iteration tracking - @agentic/code/frameworks/sdlc-complete/schemas/research/quality-dimensions.yaml - Quality scoring - @.aiwg/research/findings/REF-015-self-refine.md - Research foundation - #168 - Implementation issue --- **Rule Status**: ACTIVE **Last Updated**: 2026-01-25