aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

253 lines (199 loc) • 8.06 kB

Markdown

# BestOutputTracker Implementation Summary ## Overview Implemented BestOutputTracker for Ralph external loop per issue #168, conforming to schema at `agentic/code/addons/ralph/schemas/iteration-analytics.yaml`. ## Implementation Details ### Files Created 1. **best-output-tracker.mjs** (766 lines) - Core implementation with all required methods - Quality score calculation with weighted dimensions - Non-monotonic best output selection - Artifact snapshot management - Selection report generation 2. **best-output-tracker.test.mjs** (587 lines) - Comprehensive test suite with 18 tests - All tests passing (100% pass rate) - Covers all major functionality and edge cases 3. **README-best-output-tracker.md** (471 lines) - Complete documentation with examples - API reference - Integration patterns - Storage structure documentation 4. **examples/best-output-example.mjs** (155 lines) - Runnable example demonstrating the tracker - Simulates quality trajectory: 70% → 86% → 82% - Shows degradation detection and best output selection ## Key Features Implemented ### 1. Quality Tracking - **Multi-dimensional scoring**: validation, completeness, correctness, readability, efficiency - **Weighted calculation**: Configurable dimension weights (default: 30%, 25%, 25%, 10%, 10%) - **Delta tracking**: Calculates improvement/degradation from previous iteration - **Comprehensive metrics**: tokens, cost, execution time, verification status ### 2. Best Output Selection - **Running best**: Maintains reference to best iteration throughout loop - **Selection modes**: - `highest_quality`: Select highest score regardless of verification - `highest_quality_verified`: Select highest among verified (default) - `most_recent_above_threshold`: Select most recent exceeding threshold - **Graceful fallback**: Falls back to unverified if no verified iterations exist ### 3. Artifact Management - **Snapshot creation**: Copies artifacts to iteration-specific directories - **Preservation**: Keeps all iterations by default (configurable) - **Cleanup**: Optional removal of non-selected snapshots ### 4. Degradation Detection - **Automatic detection**: Identifies when quality declines after peak - **Clear reporting**: Reports selected vs final iteration differences - **Recommendations**: Suggests optimal iteration count ### 5. Diminishing Returns Detection - **Configurable thresholds**: Consecutive count and delta thresholds - **Early stopping signal**: Enables loop termination when further iteration provides minimal benefit - **Pattern recognition**: Detects low-delta consecutive iterations ### 6. Reporting and Analytics - **Selection reports**: Detailed markdown reports with quality trajectory - **Summary statistics**: Total iterations, average/best/worst quality, costs - **CSV export**: Machine-readable data export - **Quality trajectory**: Visual representation with ASCII charts ## Schema Compliance Conforms to `iteration-analytics.yaml`: - ✓ Quality dimensions tracked - ✓ Selection criteria configuration - ✓ Iteration record structure - ✓ Best output selection logic - ✓ Diminishing returns detection - ✓ Storage paths and formats ## Research Foundation Based on REF-015 Self-Refine (Madaan et al., 2023): **Key Insight**: Quality can fluctuate during iterative refinement. Peak quality often occurs at iteration 2-3 before degrading. **Example from research**: ``` Iteration 1: 72% quality Iteration 2: 85% quality ← PEAK Iteration 3: 83% quality (degraded) Final output: 83% (suboptimal) Best selection: 85% (iteration 2) ``` **Impact**: Selecting best (not final) prevents returning degraded output after over-refinement. ## Test Results ``` ✔ BestOutputTracker (176.459272ms) ✔ recordIteration (35.818158ms) ✔ getBest (10.277452ms) ✔ selectOutput (53.387037ms) ✔ generateSelectionReport (12.181177ms) ✔ detectDiminishingReturns (21.948105ms) ✔ persistence (7.277794ms) ✔ getSummary (11.277256ms) ✔ exportCSV (7.481461ms) ✔ quality score calculation (13.704001ms) ℹ tests 18 ℹ pass 18 ℹ fail 0 ``` ## Usage Example ```javascript import { BestOutputTracker } from './best-output-tracker.mjs'; const tracker = new BestOutputTracker('loop-001'); // Record iterations for (let i = 1; i <= 3; i++) { tracker.recordIteration({ iteration_number: i, dimensions: { validation: 0.8, completeness: 0.8, correctness: 0.85, readability: 0.75, efficiency: 0.8, }, artifacts: ['output.md'], verification_status: 'passed', }); } // Select best (not final) const selection = tracker.selectOutput(); console.log(`Selected iteration ${selection.selected_iteration}`); // Output: Selected iteration 2 (if that was peak quality) // Generate report const report = tracker.generateSelectionReport(selection); ``` ## Integration Points ### With Ralph Loop - Call `recordIteration()` after each external iteration - Use `detectDiminishingReturns()` for early stopping - Call `selectOutput()` on loop completion - Generate selection report for audit trail ### With Output Analyzer - Convert analysis results to quality dimensions - Map verification status from success/failure - Extract artifact paths from analysis ### With State Manager - Store tracking data in `.aiwg/ralph/{loop_id}/` - Persist across session restarts - Load tracking history on recovery ## Storage Structure ``` .aiwg/ralph/{loop_id}/ ├── iterations/ │ ├── iteration-001/ │ │ └── (snapshotted artifacts) │ ├── iteration-002/ │ │ └── (snapshotted artifacts) │ └── iteration-003/ │ └── (snapshotted artifacts) ├── best-output-tracking.json └── selection-report.md ``` ## API Surface ### Constructor - `new BestOutputTracker(loopId, config)` ### Core Methods - `recordIteration(params)` - Record iteration with quality metrics - `getBest()` - Get current best iteration - `selectOutput()` - Select best output based on criteria - `generateSelectionReport(selection)` - Generate markdown report ### Analytics - `detectDiminishingReturns(consecutiveThreshold, deltaThreshold)` - `getQualityTrajectory()` - `getSummary()` - `exportCSV()` ### Management - `cleanupSnapshots(selectedIteration)` - `save()` / `load()` - Persistence ## Configuration Options ```javascript { storage_path: string, // Base directory selection: { mode: string, // Selection mode threshold: number, // Minimum quality (0-100) require_verification: boolean, }, keep_all_iterations: boolean, // Preserve all snapshots quality_weights: { // Custom dimension weights validation: number, // 0-1 completeness: number, // 0-1 correctness: number, // 0-1 readability: number, // 0-1 efficiency: number, // 0-1 }, } ``` ## Benefits 1. **Quality Preservation**: Never returns degraded output after over-refinement 2. **Transparency**: Clear reporting on why specific iteration was selected 3. **Cost Awareness**: Tracks token usage and costs across iterations 4. **Early Stopping**: Diminishing returns detection prevents wasted iterations 5. **Audit Trail**: Complete history with snapshots for review 6. **Flexibility**: Multiple selection modes for different use cases ## Next Steps 1. **Integration**: Wire into orchestrator.mjs for automatic tracking 2. **Visualization**: Add quality trajectory charts to web UI 3. **Thresholds**: Tune default quality weights based on real usage 4. **Metrics**: Add to ralph-status command for monitoring 5. **Alerts**: Notify when degradation exceeds threshold ## References - **Schema**: `@agentic/code/addons/ralph/schemas/iteration-analytics.yaml` - **Research**: `@.aiwg/research/findings/REF-015-self-refine.md` - **Rules**: `@.claude/rules/best-output-selection.md` - **Issue**: #168 ## Author Implemented: 2026-01-28 Agent: Claude Sonnet 4.5