UNPKG

aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

597 lines (442 loc) 19.9 kB
# GRADE Quality Assessment Template --- template_id: quality-assessment version: 1.0.0 reasoning_required: true framework: research-complete --- ## Ownership & Collaboration - Document Owner: Research Analyst - Contributor Roles: Domain Expert, Quality Auditor - Automation Inputs: Paper metadata, methodology sections, results tables - Automation Outputs: `quality-assessment-REF-XXX.md` with GRADE rating and justification ## Phase 1: Core (ESSENTIAL) ### Paper Identification **Reference ID:** REF-XXX <!-- EXAMPLE: REF-018 --> **Title:** [Full paper title] <!-- EXAMPLE: ReAct: Synergizing Reasoning and Acting in Language Models --> **Authors:** [Author list] **Year:** YYYY **Source:** [Journal/Conference/Preprint] <!-- EXAMPLE: **Authors:** Yao, S., Zhao, J., Yu, D., et al. **Year:** 2022 **Source:** ICLR 2023 (International Conference on Learning Representations) --> ### Quality Rating Summary **GRADE Level:** HIGH | MODERATE | LOW | VERY LOW <!-- EXAMPLE: HIGH --> **Baseline (Source Type):** [Initial quality based on publication venue] <!-- EXAMPLE: **Baseline (Source Type):** HIGH (Peer-reviewed top-tier conference) Per @.claude/rules/research-metadata.md, peer-reviewed conferences start at HIGH. --> **Adjustments:** [Factors that raised or lowered from baseline] <!-- EXAMPLE: **Adjustments:** +0 (No upward adjustments needed, strong methodology) -0 (No downward adjustments, no serious limitations) Final: HIGH (retained baseline) --> **One-Line Rationale:** [Brief justification] <!-- EXAMPLE: **One-Line Rationale:** Rigorous experimental design with multiple baselines, reproducible methodology, and consistent results across four diverse benchmarks. --> ## Reasoning > Complete this section BEFORE detailed assessment. Per @.claude/rules/reasoning-sections.md 1. **Baseline Determination**: What is the starting quality level? > [Assess publication venue quality per GRADE guidelines] <!-- EXAMPLE: Source: ICLR 2023 (International Conference on Learning Representations) Venue quality: Top-tier ML conference (acceptance rate ~28%, rigorous peer review) Baseline per @.claude/rules/research-metadata.md: HIGH Rationale: ICLR is top-3 ML conference alongside NeurIPS and ICML. Strong review process, high standards. --> 2. **Study Design Evaluation**: How rigorous is the methodology? > [Assess experimental design, controls, baselines, reproducibility] <!-- EXAMPLE: Design strengths: - Multiple baselines (Act-only, CoT, standard prompting) - Diverse tasks (4 benchmarks: HotpotQA, FEVER, ALFWorld, WebShop) - Clear evaluation metrics (success rate, accuracy, trajectory efficiency) - Reproducible (code and data publicly available) Design weaknesses: - Single model family tested (GPT-3/GPT-4, no LLaMA/Claude comparison) - Limited analysis of failure modes --> 3. **Evidence Strength Assessment**: How strong is the evidence for claims? > [Evaluate sample size, statistical significance, effect sizes] <!-- EXAMPLE: Evidence strengths: - Large sample sizes (500 questions for HotpotQA, FEVER) - Consistent improvements across tasks (not cherry-picked) - Effect sizes substantial (34% relative improvement, not marginal) Evidence limitations: - No statistical significance testing reported (p-values, confidence intervals) - Single-run results (no error bars or repeated trials) --> 4. **Generalizability Analysis**: How broadly do findings apply? > [Assess scope: single domain vs cross-domain, single model vs multiple] <!-- EXAMPLE: Generalizability strengths: - Four diverse task types (QA, fact verification, interactive, web) - Both reasoning-heavy and action-heavy tasks tested Generalizability limitations: - All tasks involve tool use; unclear if pattern helps non-tool tasks - Consumer-grade tasks; enterprise SDLC workflows not tested - Single model family (OpenAI GPT); transferability to other LLMs unclear --> 5. **Risk of Bias Assessment**: Are there methodological biases? > [Check for selection bias, reporting bias, funding conflicts] <!-- EXAMPLE: Bias risks: - LOW: Baselines fairly selected (standard methods in literature) - LOW: Tasks represent diverse challenge types - MODERATE: Authors affiliated with OpenAI, tested on GPT models (potential optimization bias) - LOW: Code and data public (transparency high) Overall bias risk: LOW to MODERATE --> ## Phase 2: Detailed GRADE Assessment (EXPAND WHEN READY) <details> <summary>Click to expand detailed GRADE criteria evaluation</summary> ### GRADE Framework GRADE (Grading of Recommendations Assessment, Development and Evaluation) assesses evidence quality across five domains: 1. **Study Design** (starting point) 2. **Risk of Bias** (may downgrade) 3. **Inconsistency** (may downgrade) 4. **Indirectness** (may downgrade) 5. **Imprecision** (may downgrade) 6. **Publication Bias** (may downgrade) 7. **Large Effect Size** (may upgrade) 8. **Dose-Response** (may upgrade) 9. **Confounders** (may upgrade) ### 1. Study Design (Baseline) | Design Type | Baseline GRADE | This Paper | |-------------|----------------|------------| | Systematic review | HIGH | ☐ | | RCT | HIGH | ☐ | | Cohort study | MODERATE | ☐ | | Case-control | MODERATE | ☐ | | Case series | LOW | ☐ | | Expert opinion | VERY LOW | ☐ | | **Experimental (ML)** | **HIGH** | **☑** | | Preprint/unreviewed | LOW | ☐ | **Assessment:** <!-- EXAMPLE: **Design:** Controlled experiment with multiple baselines **Venue:** Peer-reviewed top-tier conference (ICLR 2023) **Baseline:** HIGH (equivalent to RCT in clinical research) Justification: ML experimental papers with proper controls and peer review start at HIGH in GRADE framework. --> ### 2. Risk of Bias (Downgrade?) | Bias Type | Risk Level | Impact on Rating | |-----------|------------|------------------| | Selection bias | LOW / MODERATE / HIGH | 0 / -1 / -2 | | Performance bias | LOW / MODERATE / HIGH | 0 / -1 / -2 | | Detection bias | LOW / MODERATE / HIGH | 0 / -1 / -2 | | Attrition bias | LOW / MODERATE / HIGH | 0 / -1 / -2 | | Reporting bias | LOW / MODERATE / HIGH | 0 / -1 / -2 | **Assessment:** <!-- EXAMPLE: **Selection bias:** LOW - Benchmarks are standard in field (HotpotQA, FEVER publicly available) - Not cherry-picked; results reported for all attempted tasks **Performance bias:** MODERATE - Authors affiliated with OpenAI, tested on GPT models - Possible optimization toward GPT behavior - Mitigation: Would benefit from independent replication on other LLMs **Detection bias:** LOW - Evaluation metrics clearly defined and objective (success rate, accuracy) - Not subjective assessment **Attrition bias:** LOW - All test cases processed (no selective exclusions mentioned) **Reporting bias:** LOW - Failures discussed (e.g., limited gains on some tasks) - Code and data publicly available for verification **Overall Risk of Bias:** LOW to MODERATE **Downgrade:** -0 levels (minor concerns but no serious limitations) --> ### 3. Inconsistency (Downgrade?) | Consistency Check | Status | Impact | |-------------------|--------|--------| | Results consistent across tasks? | YES / NO / MIXED | 0 / -1 / -2 | | Results align with prior work? | YES / NO / MIXED | 0 / -1 | | Effect sizes consistent? | YES / NO / MIXED | 0 / -1 | **Assessment:** <!-- EXAMPLE: **Consistency across tasks:** HIGH - Improvements observed on all four benchmarks - Effect sizes range 5-34% but all positive **Consistency with prior work:** HIGH - Builds logically on CoT (cited work) - Extends rather than contradicts prior findings **Effect size consistency:** MODERATE - Varying magnitude across tasks (5% on ALFWorld, 34% on HotpotQA) - Explainable by task characteristics (reasoning vs action-heavy) **Overall Inconsistency:** LOW **Downgrade:** -0 levels (results are consistent) --> ### 4. Indirectness (Downgrade?) | Indirectness Check | Status | Impact | |--------------------|--------|--------| | Population matches our target? | YES / NO / PARTIAL | 0 / -1 / -2 | | Intervention matches our use? | YES / NO / PARTIAL | 0 / -1 / -2 | | Outcomes match our needs? | YES / NO / PARTIAL | 0 / -1 | **Assessment:** <!-- EXAMPLE: **Population (Tasks):** PARTIAL - Paper tests: QA, fact-checking, interactive games, web shopping - AIWG needs: SDLC workflows (requirements, architecture, testing) - Overlap: Tool use, reasoning, multi-step processes - Gap: No direct test of SDLC-specific tasks **Intervention (Method):** YES - Paper uses: Thought→Action→Observation loop - AIWG uses: TAO loop (same structure) - Match: High **Outcomes:** PARTIAL - Paper measures: Success rate, accuracy, efficiency - AIWG needs: Artifact quality, iteration count, HITL rate - Overlap: Task completion as success metric - Gap: AIWG-specific quality dimensions not tested **Overall Indirectness:** MODERATE **Downgrade:** -0 levels (indirect but relevant; extrapolation reasonable) Note: Would downgrade -1 if population gap was severe, but tool-use reasoning generalizes reasonably to SDLC. --> ### 5. Imprecision (Downgrade?) | Precision Check | Status | Impact | |-----------------|--------|--------| | Large sample size? | YES / NO | 0 / -1 | | Narrow confidence intervals? | YES / NO / UNKNOWN | 0 / -1 / -1 | | Statistical significance reported? | YES / NO | 0 / -1 | **Assessment:** <!-- EXAMPLE: **Sample size:** YES - 500 questions per benchmark (HotpotQA, FEVER) - 134 tasks (ALFWorld), 251 tasks (WebShop) - Adequate for reliable estimates **Confidence intervals:** UNKNOWN - No CIs reported (common in ML papers but suboptimal) - No error bars or repeated trials - Cannot assess precision statistically **Statistical significance:** NO - No p-values reported - Cannot determine if differences are statistically significant vs random **Overall Imprecision:** MODERATE **Downgrade:** -0 levels (effect sizes large enough to be meaningful despite lack of statistical tests; common practice in ML) Note: Would downgrade -1 if effect sizes were small AND no significance testing. Here effects are large (34% improvement). --> ### 6. Publication Bias (Downgrade?) | Bias Check | Status | Impact | |------------|--------|--------| | Preregistration? | YES / NO | N/A for ML | | Negative results published? | YES / NO / UNKNOWN | 0 / -1 | | File drawer problem likely? | YES / NO | 0 / -1 | **Assessment:** <!-- EXAMPLE: **Preregistration:** N/A - Not standard in ML research (more common in psychology, medicine) **Negative results:** PARTIAL - Authors report lower gains on some tasks (5% on ALFWorld vs 34% on HotpotQA) - Some failure cases discussed - No mention of abandoned experiments **File drawer:** UNLIKELY - Peer-reviewed venue with replication requirements - Code and data public (enables verification) **Overall Publication Bias:** LOW **Downgrade:** -0 levels (no strong evidence of selective reporting) --> ### 7. Large Effect Size (Upgrade?) | Effect Check | Status | Impact | |--------------|--------|--------| | Large effect (>2x baseline)? | YES / NO | +1 / 0 | | Very large effect (>5x baseline)? | YES / NO | +2 / 0 | **Assessment:** <!-- EXAMPLE: **Effect magnitude:** - HotpotQA: 49% → 66% (1.35x baseline, +34% relative) - FEVER: Not specified in review - Overall: Moderate to large effects, not >2x **Large effect:** NO (1.35x < 2x threshold) **Upgrade:** +0 levels (effects are meaningful but not exceptionally large) Note: Would upgrade +1 if effect was >2x baseline (e.g., 49% → 98%). --> ### 8. Dose-Response Gradient (Upgrade?) | Gradient Check | Status | Impact | |----------------|--------|--------| | Clear dose-response? | YES / NO / N/A | +1 / 0 | **Assessment:** <!-- EXAMPLE: **Dose-response applicability:** N/A - Not applicable to this study (not a dose-response design) - Would apply to studies testing varying "doses" of intervention (e.g., 1 vs 3 vs 5 iterations) **Upgrade:** +0 levels (criterion not applicable) --> ### 9. Confounders (Upgrade?) | Confounder Check | Status | Impact | |------------------|--------|--------| | Plausible confounders? | YES / NO | 0 / +1 | | Confounders would reduce effect? | YES / NO / N/A | 0 / +1 | **Assessment:** <!-- EXAMPLE: **Plausible confounders:** - Model size (GPT-3 vs GPT-4) could confound - Task difficulty varies - Prompt engineering quality could vary **Confounders accounted for:** PARTIAL - Same model used for baselines (controls for model size) - Multiple tasks tested (controls for task-specific effects) - Some confounding remains (e.g., prompt quality) **Would confounders reduce observed effect?** UNLIKELY - If anything, confounders might favor simpler baselines (easier to optimize) - ReAct showing gains despite potential confounds is stronger evidence **Upgrade:** +0 levels (no evidence confounders would strengthen effect) --> ### GRADE Calculation **Starting Point:** HIGH (peer-reviewed experimental study) **Downgrades:** - Risk of Bias: -0 (LOW to MODERATE, no serious issues) - Inconsistency: -0 (results consistent) - Indirectness: -0 (extrapolation reasonable) - Imprecision: -0 (large effects compensate for lack of CIs) - Publication Bias: -0 (no evidence of bias) **Upgrades:** - Large Effect: +0 (effects moderate, not >2x) - Dose-Response: +0 (N/A) - Confounders: +0 (none that would strengthen) **Final GRADE Level:** HIGH <!-- EXAMPLE: **Calculation:** HIGH (baseline) - 0 (downgrades) + 0 (upgrades) = HIGH **Interpretation:** We have HIGH confidence that the true effect is close to the estimate. Further research is unlikely to substantially change our confidence in this finding. --> </details> ## Phase 3: Applicability Analysis (ADVANCED) <details> <summary>Click to expand AIWG-specific applicability assessment</summary> ### Applicability to AIWG **Overall Applicability:** HIGH | MODERATE | LOW <!-- EXAMPLE: HIGH --> **Rationale:** <!-- EXAMPLE: HIGH applicability because: 1. Core pattern (TAO loop) directly maps to AIWG agent design 2. Tool use context aligns with AIWG tool-using agents (Read, Write, Bash, Grep) 3. Multi-step reasoning required in SDLC tasks (same as QA tasks) 4. Observation grounding reduces hallucinations (critical for AIWG citation policy) Caveats: - Not tested on SDLC-specific tasks (requirements, architecture, code review) - Longer iteration counts in AIWG (Ralph 10+ iterations vs paper 3-5) - Multi-agent coordination not addressed (AIWG orchestrator needs) --> ### Implementation Confidence | AIWG Component | Confidence | Rationale | |----------------|------------|-----------| | [Component 1] | HIGH / MODERATE / LOW | [Why this confidence level] | | [Component 2] | HIGH / MODERATE / LOW | [Why this confidence level] | <!-- EXAMPLE: | AIWG Component | Confidence | Rationale | | TAO loop structure | HIGH | Direct pattern match, well-validated across tasks | | Thought types | MODERATE | Extrapolation from paper's example traces | | Tool grounding | HIGH | Hallucination reduction demonstrated clearly | | Multi-iteration (Ralph) | MODERATE | Paper tests 3-5 iterations, Ralph runs 10+; scale effects unknown | | Multi-agent coordination | LOW | Not addressed in paper; separate research needed | --> ### Evidence Gaps for AIWG **Gap 1:** [Specific AIWG use case not covered by research] <!-- EXAMPLE: **Gap 1:** Long-running agent sessions (10+ iterations) Paper tests: 3-5 TAO iterations per task AIWG needs: Ralph loops run 10+ iterations on complex tasks Gap: Unknown if reasoning quality degrades, if observation grounding remains effective Action: Monitor Ralph loop quality metrics; conduct pilot study if degradation observed --> **Gap 2:** [Another AIWG-specific gap] <!-- EXAMPLE: **Gap 2:** SDLC-specific tasks (requirements, architecture) Paper tests: QA, fact-checking, interactive games AIWG needs: Requirements elicitation, architectural decision-making Gap: Unclear if TAO loop benefits generalize to less well-defined tasks Action: Implement with monitoring; collect quality data from AIWG dogfooding --> ### Recommendations **For implementation:** <!-- EXAMPLE: 1. IMPLEMENT: TAO loop structure in all tool-using agents (HIGH confidence) 2. IMPLEMENT: Tool observation grounding for all factual claims (HIGH confidence) 3. PILOT: Thought type taxonomy (MODERATE confidence; adapt based on AIWG needs) 4. MONITOR: Quality in 10+ iteration Ralph loops (MODERATE confidence; watch for degradation) 5. DEFER: Multi-agent TAO coordination (LOW confidence; needs separate research) --> **For monitoring:** <!-- EXAMPLE: Track these metrics to validate applicability: - Agent success rate on SDLC tasks (does TAO loop improve performance?) - Hallucination rate (does observation grounding reduce fabrications?) - Iteration efficiency (do longer Ralph loops maintain quality?) - Human escalation rate (are agents self-sufficient with TAO loop?) Revalidation trigger: If any metric degrades >20% from baseline, investigate and potentially revise approach. --> **For future research:** <!-- EXAMPLE: Research needs revealed by applicability analysis: 1. TAO loop effectiveness on SDLC tasks (pilot study) 2. Long-iteration agent sessions (10+ TAO iterations) 3. Multi-agent TAO coordination patterns 4. Thought type taxonomy for SDLC domains Priority: #1 and #2 (directly affect AIWG production use) --> </details> ## Quality Summary Table | GRADE Criterion | Assessment | Impact | Notes | |-----------------|------------|--------|-------| | **Baseline (Study Design)** | HIGH | Starting point | Peer-reviewed experimental | | Risk of Bias | LOW-MODERATE | -0 | Minor concerns (OpenAI affiliation) | | Inconsistency | LOW | -0 | Results consistent | | Indirectness | MODERATE | -0 | Extrapolation reasonable | | Imprecision | MODERATE | -0 | Large effects compensate | | Publication Bias | LOW | -0 | No evidence of selective reporting | | Large Effect | NO | +0 | 1.35x not >2x threshold | | Dose-Response | N/A | +0 | Not applicable | | Confounders | UNLIKELY | +0 | None that strengthen effect | | **Final GRADE** | **HIGH** | **HIGH** | **High confidence in findings** | ## References - @.aiwg/research/sources/[PDF-filename].pdf - Original paper - @.aiwg/research/findings/REF-XXX.md - Literature note - @.claude/rules/citation-policy.md - GRADE-based citation language - @.claude/rules/research-metadata.md - Baseline quality by source type - @agentic/code/frameworks/research-complete/schemas/grade-schema.yaml - GRADE assessment schema ## Template Usage Notes **When to perform GRADE assessment:** - When adding paper to corpus - Before citing paper in implementation decisions - When updating quality assessments (annually) - Before recommending paper as "definitive" on topic **Assessment approach:** 1. Determine baseline quality (publication venue) 2. Evaluate each GRADE criterion systematically 3. Apply downgrades/upgrades per framework 4. Calculate final GRADE level 5. Assess AIWG-specific applicability 6. Document confidence and gaps **Common pitfalls:** - Conflating quality with relevance (HIGH quality ≠ HIGH relevance) - Ignoring indirect evidence (extrapolation often reasonable) - Over-penalizing for missing statistical tests (common in ML) - Not documenting applicability gaps **GRADE levels in citation language:** - HIGH: "demonstrates", "shows", "establishes" - MODERATE: "suggests", "indicates", "supports" - LOW: "limited evidence", "preliminary findings" - VERY LOW: "anecdotal", "exploratory" Per @.claude/rules/citation-policy.md ## Metadata - **Template Type:** research-quality-assessment - **Framework:** research-complete - **Primary Agent:** @agentic/code/frameworks/research-complete/agents/quality-agent.md - **Related Templates:** - @agentic/code/frameworks/research-complete/templates/literature-note.md - @agentic/code/frameworks/research-complete/templates/extraction.yaml - **Version:** 1.0.0 - **Last Updated:** 2026-02-03