aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

343 lines (238 loc) • 13.2 kB

Markdown

# Multi-Level Summary Template --- template_id: summary version: 1.0.0 reasoning_required: true framework: research-complete --- ## Ownership & Collaboration - Document Owner: Research Analyst - Contributor Roles: Technical Writer, Domain Expert - Automation Inputs: Full paper text, literature notes - Automation Outputs: `summary-REF-XXX.md` with three summary levels ## Phase 1: Core (ESSENTIAL) ### Paper Identification **Reference ID:** REF-XXX  **Title:** [Full paper title]  **Authors:** [Author list]  **Year:** YYYY  **Source:** [Journal/Conference/Preprint]  ## Reasoning > Complete this section BEFORE writing summaries. Per @.claude/rules/reasoning-sections.md 1. **Audience Identification**: Who will read each summary level? > [Define target audience for 1-sentence (executives), 1-paragraph (practitioners), 1-page (researchers)]  2. **Key Message Extraction**: What is THE most important insight? > [Identify the single insight that must appear at every level]  3. **Detail Layering Strategy**: How will we add depth across levels? > [Plan what details appear at each level: 1-sentence = result only, 1-paragraph = method + result, 1-page = full methodology]  4. **Terminology Calibration**: What terms need definition? > [Identify domain-specific terms and determine which level requires explanation]  5. **Action Orientation**: What should readers DO with this? > [Specify actionable takeaway appropriate for each level]  ## 1-Sentence Summary (ESSENTIAL) > Single sentence (max 30 words) capturing core contribution and impact    ## Phase 2: Extended Summaries (EXPAND WHEN READY) <details> <summary>Click to expand paragraph and full-page summaries</summary> ### 1-Paragraph Summary > 5-7 sentences covering: What problem? What method? What result? What impact?    ### 1-Page Summary > Comprehensive summary covering methodology, findings, implications, limitations (~400-600 words)  #### Problem & Context Self-Refine tackles a fundamental limitation in LLM application: outputs often require human revision despite model capabilities. Traditional approaches rely on human feedback loops or reinforcement learning from human feedback (RLHF), both expensive and slow. The paper investigates whether LLMs can provide useful feedback on their own outputs and refine iteratively without external supervision. #### Methodology The Self-Refine algorithm implements a three-step loop: 1. **Generate**: LLM produces initial output for task 2. **Feedback**: Same LLM critiques its output using structured prompts 3. **Refine**: LLM generates improved version incorporating feedback The process repeats until stopping criteria (quality threshold, max iterations, or "no changes needed" signal). Critically, feedback is structured with specific categories relevant to each task (e.g., code correctness, execution efficiency, readability for programming tasks). **Evaluation** spanned seven diverse tasks: - Code optimization (PIE dataset) - Dialogue response generation (Persona-Chat) - Math reasoning (GSM8K) - Sentiment reversal - Acronym generation - Constrained generation (CommonGen) - Review rewriting (Yelp) Baselines included GPT-4 direct generation, best-of-k sampling, and CoT prompting. #### Key Findings **Quantitative Results:** - Average improvement ~20% across tasks (range: 5-47% depending on task) - Code optimization: 10% → 18% pass@1 success rate - Math reasoning: 15% relative improvement - Sentiment reversal: 47% improvement in target sentiment achievement **Qualitative Insights:** 1. **Feedback quality dominates**: Analysis revealed 94% of iteration failures trace to poor feedback (vague, generic, non-actionable) rather than refinement inadequacy. When feedback was specific and actionable, refinement succeeded. 2. **Non-monotonic trajectories**: Quality doesn't always improve with each iteration. Output quality at iteration 2 may exceed iteration 3, necessitating best-output-selection strategies. 3. **Task-specific feedback structure**: Generic "improve this" feedback fails. Effective feedback uses task-specific categories (correctness, efficiency, style for code; coherence, persona-consistency for dialogue). #### Limitations - Single-LLM focus: Doesn't explore multi-model feedback (specialist reviewers) - Iteration depth: Most gains by iteration 2-3, diminishing returns after - Feedback training: No systematic method for learning better feedback - Cost: Multiple LLM calls per task (mitigated by smaller models for feedback) #### AIWG Applications **Implemented:** - @.claude/rules/actionable-feedback.md - Structured feedback schema - @.claude/rules/best-output-selection.md - Non-monotonic quality handling - @agentic/code/addons/ralph/schemas/actionable-feedback.yaml - Feedback format **Key Design Decisions Informed:** 1. Ralph loops prioritize feedback quality over iteration count 2. Feedback must include: location, severity, actionable suggestion, rationale 3. Track quality per iteration, select best rather than final 4. Use task-specific feedback aspects (security for code, clarity for docs) **Future Work:** - Multi-agent review panels for diverse feedback (@.claude/rules/ensemble-review.md) - Feedback quality metrics and learning - Integration with executable feedback (@.claude/rules/executable-feedback.md) #### Citation Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Welleck, S., Majumder, B. P., Gupta, S., Yazdanbakhsh, A., & Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. *Proceedings of NeurIPS 2023*. </details> ## Phase 3: Technical Details (ADVANCED) <details> <summary>Click to expand technical implementation notes</summary> ### Algorithm Pseudocode ``` function SelfRefine(task, initial_prompt, max_iterations=5): output = LLM.generate(initial_prompt) for i in 1 to max_iterations: feedback = LLM.critique(output, task_specific_rubric) if feedback.indicates_no_improvement_needed(): return output if not feedback.is_actionable(): log_feedback_quality_failure() return select_best_from_history() refined_output = LLM.refine(output, feedback) if quality(refined_output) <= quality(output): # Non-monotonic trajectory log_quality_degradation() output = refined_output return select_best_from_history() ``` ### Prompt Engineering Notes **Feedback Prompt Structure:** ``` Review the following [output_type] for [task]: [OUTPUT] Provide specific feedback on: 1. [Task-specific aspect 1] 2. [Task-specific aspect 2] 3. [Task-specific aspect 3] For each issue, specify: - Exact location - Problem description - Concrete suggestion for improvement ``` **Refinement Prompt Structure:** ``` Original output: [OUTPUT] Feedback received: [FEEDBACK] Generate an improved version that addresses all feedback points. ``` ### Performance Metrics | Task | Baseline | Self-Refine | Improvement | |------|----------|-------------|-------------| | Code optimization | 10% pass@1 | 18% pass@1 | +80% relative | | Math (GSM8K) | 67.3% | 77.4% | +15% relative | | Sentiment reversal | 45% | 66% | +47% relative | | Dialogue coherence | 3.2/5 | 3.8/5 | +19% relative | ### Failure Mode Analysis **94% of failures due to poor feedback:** | Feedback Issue | Frequency | Example | Fix | |----------------|-----------|---------|-----| | Vague location | 38% | "Code has issues" | "Line 42: null check missing" | | Non-actionable | 31% | "Improve quality" | "Add input validation: if (!x) throw" | | Missing rationale | 15% | "Change this" | "Change to prevent SQL injection" | | Wrong severity | 10% | Critical marked as minor | Proper severity classification | </details> ## Related Research **Builds on:** - @.aiwg/research/findings/REF-016-chain-of-thought.md - Reasoning structure - @.aiwg/research/findings/REF-018-react.md - Iterative agent loops **Extends:** - Extends CoT by adding feedback and refinement phases - Extends basic prompting with iterative improvement **Cited by:** - @.aiwg/research/findings/REF-057-agent-laboratory.md - Uses Self-Refine patterns **Contradicts/Challenges:** - Challenges assumption that more iterations = better results (non-monotonic) - Challenges focus on refinement capability vs feedback quality ## References - @.aiwg/research/sources/[PDF-filename].pdf - Original paper - @.claude/rules/actionable-feedback.md - AIWG implementation - @.claude/rules/best-output-selection.md - Non-monotonic handling - @agentic/code/addons/ralph/schemas/actionable-feedback.yaml - Schema - @.aiwg/research/findings/REF-015-self-refine.md - Literature note ## Template Usage Notes **When to create summaries:** - After completing literature note - When preparing research briefings - For research index/catalog entries - When communicating findings to stakeholders **Summary writing approach:** 1. Write 1-page summary first (full understanding) 2. Distill to 1-paragraph (core message + implications) 3. Compress to 1-sentence (pure essence) 4. Verify each level is self-contained and actionable **Quality checks:** - 1-sentence: Can executive grasp impact in 10 seconds? - 1-paragraph: Can engineer understand and apply in 1 minute? - 1-page: Can researcher evaluate methodology in 5 minutes? **Anti-patterns:** - 1-paragraph that's just expanded 1-sentence (add substance) - 1-page that's literature note copy-paste (synthesize, don't duplicate) - Technical jargon in 1-sentence summary - Missing actionable takeaway at any level ## Metadata - **Template Type:** research-summary - **Framework:** research-complete - **Primary Agent:** @agentic/code/frameworks/research-complete/agents/discovery-agent.md - **Related Templates:** - @agentic/code/frameworks/research-complete/templates/literature-note.md - @agentic/code/frameworks/research-complete/templates/extraction.yaml - **Version:** 1.0.0 - **Last Updated:** 2026-02-03