aiwg

Version:

Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo

aiwg.io

jmagly/aiwg

1,266 lines (992 loc) • 73.7 kB

Markdown

# Initial Risk Assessment: AIWG Research Framework **Project**: AIWG Research Framework **Framework ID**: research-complete **Version**: 1.0.0 **Assessment Date**: 2026-01-25 **Assessment Period**: Inception Phase **Next Review Date**: 2026-02-15 (End of Elaboration Phase) **Status**: Active ## References - @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/project-intake.md - Project scope and constraints - @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/solution-profile.md - Technical architecture and risks - @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/stakeholder-analysis.md - Stakeholder concerns and expectations - @.aiwg/research/research-framework-findings.md - Research foundation and evidence base --- ## 1. Risk Assessment Overview ### 1.1 Purpose This risk assessment identifies, evaluates, and prioritizes potential threats to the successful delivery of the AIWG Research Framework. It establishes mitigation strategies, assigns ownership, and defines monitoring approaches to proactively manage risks throughout the project lifecycle. ### 1.2 Assessment Scope **Covered**: - Technical risks (API dependencies, LLM accuracy, performance, integration) - Integration risks (compatibility with existing tools, migration complexity) - Resource risks (time constraints, skill gaps, budget limitations) - Quality risks (data accuracy, source reliability, reproducibility) - Adoption risks (learning curve, workflow disruption, user resistance) - Security risks (API key management, data privacy, copyright compliance) - Compliance risks (FAIR principles, standards adherence, licensing) **Not Covered** (Out of Scope): - Organizational change management (solo developer context) - Market risks (open source project, no commercial dependencies) - Competitive risks (niche academic tool space, minimal competition) ### 1.3 Risk Framework **Probability Scale**: 1. **Very Low** (1-10%): Unlikely to occur 2. **Low** (11-30%): May occur but infrequent 3. **Medium** (31-50%): Likely to occur occasionally 4. **High** (51-75%): Likely to occur regularly 5. **Very High** (76-100%): Expected to occur **Impact Scale**: 1. **Very Low**: Minimal effect, easily absorbed 2. **Low**: Minor delays (<1 week), workarounds available 3. **Medium**: Moderate delays (1-4 weeks), scope adjustments needed 4. **High**: Major delays (1-2 months), significant scope reduction required 5. **Critical**: Project failure or abandonment **Risk Score**: Probability (1-5) × Impact (1-5) = Score (1-25) **Priority Thresholds**: - **Critical** (20-25): Immediate action required, daily monitoring - **High** (15-19): Active management, weekly monitoring - **Medium** (10-14): Regular monitoring, contingency planning - **Low** (5-9): Periodic review, accept risk - **Very Low** (1-4): Monitor only ### 1.4 Assessment Context **Project Phase**: Inception (Week 1-2 of 20-week timeline) **Key Constraints**: - Solo developer (limited capacity) - 20-week delivery timeline (5 months part-time) - Zero budget (open source, free APIs only) - External API dependencies (Semantic Scholar, Zotero) - LLM reliance for AI-assisted capabilities - FAIR/PROV/OAIS compliance requirements **Success Dependencies**: - Semantic Scholar API availability and reliability - LLM accuracy for summarization and extraction - User adoption by matric-memory and early researchers - Integration with existing tools (Zotero, Obsidian) - FAIR compliance automation --- ## 2. Risk Identification ### 2.1 Technical Risks #### T-01: LLM Hallucination in Summaries/Extractions **Description**: LLM-generated summaries may include fabricated citations, incorrect interpretations, or hallucinated data extractions, undermining research integrity. **Root Cause**: - LLMs trained on internet data may generate plausible but false information - Lack of ground truth validation during generation - Context window limitations causing incomplete analysis **Potential Impact**: - Researchers cite non-existent papers or incorrect findings - Loss of trust in framework - Scientific misconduct allegations - Time wasted verifying/correcting AI outputs **Category**: Technical - AI/LLM **Probability**: High (51-75%) **Impact**: Critical (5) **Risk Score**: 20 (CRITICAL) --- #### T-02: Semantic Scholar API Rate Limits **Description**: Free tier API rate limits (100 requests/5 minutes) may block discovery operations during intensive search sessions. **Root Cause**: - Free API tier restrictions - No budget for commercial API access - Batch operations (screening 100s of papers) exceed limits **Potential Impact**: - Discovery workflows blocked mid-session - User frustration and workflow interruption - Manual fallback required, negating automation benefits - Extended project timelines **Category**: Technical - External Dependency **Probability**: Medium (31-50%) **Impact**: Medium (3) **Risk Score**: 9 (LOW) --- #### T-03: PDF Extraction Failures **Description**: PDF text extraction may fail for scanned documents, complex layouts, or paywalled papers. **Root Cause**: - OCR required for scanned PDFs (not implemented) - Complex multi-column layouts confuse parsers - Paywalled papers inaccessible or corrupted - Varied PDF encoding standards **Potential Impact**: - Metadata extraction incomplete - Summarization impossible without text - Manual processing required - Reduced automation value **Category**: Technical - Data Processing **Probability**: Medium (31-50%) **Impact**: Medium (3) **Risk Score**: 9 (LOW) --- #### T-04: Knowledge Graph Scalability **Description**: Citation networks and concept graphs become computationally expensive at scale (1,000+ papers). **Root Cause**: - Graph algorithms (centrality, community detection) have O(n²) or worse complexity - Limited optimization in initial implementation - No dedicated graph database (using JSON files) **Potential Impact**: - Slow performance for large corpora - User impatience, tool abandonment - Need for costly refactoring or infrastructure **Category**: Technical - Performance **Probability**: Low (11-30%) **Impact**: Medium (3) **Risk Score**: 6 (LOW) --- #### T-05: FAIR Compliance Validation Overhead **Description**: Automated FAIR compliance checking may be too complex to implement fully, requiring manual validation. **Root Cause**: - F-UJI assessment tool complexity (15 metrics) - Metadata quality varies widely across sources - No standardized FAIR API for validation **Potential Impact**: - Manual FAIR checks slow acquisition workflow - Incomplete FAIR compliance undermines credibility - Users skip FAIR validation, defeating purpose **Category**: Technical - Standards Compliance **Probability**: Medium (31-50%) **Impact**: Low (2) **Risk Score**: 6 (LOW) --- ### 2.2 Integration Risks #### I-01: Zotero Integration Breaking Changes **Description**: Zotero updates may break file import/export or metadata extraction workflows. **Root Cause**: - No control over Zotero development roadmap - BibTeX/RIS format changes - Zotero command-line tool API changes **Potential Impact**: - Reference manager integration fails - Users unable to export to preferred format - Manual workarounds required - Framework value diminished **Category**: Integration - External Tool **Probability**: Low (11-30%) **Impact**: Medium (3) **Risk Score**: 6 (LOW) --- #### I-02: Obsidian Compatibility Issues **Description**: Obsidian plugin updates or markdown rendering changes may break bidirectional linking or graph view. **Root Cause**: - Obsidian proprietary features (backlinks, graph) - Markdown flavor variations - Plugin ecosystem instability **Potential Impact**: - PKM integration fails - Users unable to leverage Obsidian workflows - Alternative tools required - Development effort wasted **Category**: Integration - PKM Tools **Probability**: Low (11-30%) **Impact**: Low (2) **Risk Score**: 4 (VERY LOW) --- #### I-03: research-papers Repo Synchronization Conflicts **Description**: Multiple users/projects updating shared corpus simultaneously may cause merge conflicts or data loss. **Root Cause**: - Git merge conflicts on large JSON/binary files - No real-time collaboration (async workflow) - Inadequate conflict resolution documentation **Potential Impact**: - Data loss or corruption - Manual conflict resolution time-consuming - Users avoid shared corpus, reducing collaboration value **Category**: Integration - Shared Repository **Probability**: Medium (31-50%) **Impact**: Medium (3) **Risk Score**: 9 (LOW) --- #### I-04: Migration from Existing Systems **Description**: Users with existing research in Zotero, Notion, or file systems may struggle to migrate to framework. **Root Cause**: - No automated import for legacy data - Manual migration time-intensive - Different organizational paradigms (tags vs. MoCs) **Potential Impact**: - Adoption barrier for experienced researchers - Framework limited to greenfield projects - Competitive disadvantage vs. established tools **Category**: Integration - Migration **Probability**: High (51-75%) **Impact**: Medium (3) **Risk Score**: 15 (HIGH) --- ### 2.3 Resource Risks #### R-01: Solo Developer Capacity Constraints **Description**: Single developer (part-time, 20-30 hrs/week) may be insufficient for 20-week timeline, especially with unforeseen complexities. **Root Cause**: - No team to parallelize work - Competing priorities (AIWG core, matric projects) - Illness, burnout, or other life events **Potential Impact**: - Missed milestones and timeline slippage - Reduced scope (features cut) - Technical debt from rushed implementation - Project abandonment if sustained issues **Category**: Resource - Personnel **Probability**: High (51-75%) **Impact**: High (4) **Risk Score**: 16 (HIGH) --- #### R-02: Skill Gaps in Standards (FAIR, PROV, OAIS, GRADE) **Description**: Developer may lack deep expertise in academic standards, leading to incorrect implementations. **Root Cause**: - Standards documentation complex and jargon-heavy - Limited practical examples - No expert review available (solo context) **Potential Impact**: - Standards compliance failures - Framework rejected by academic community - Rework required after discovery - Credibility damage **Category**: Resource - Skills **Probability**: Medium (31-50%) **Impact**: High (4) **Risk Score**: 12 (MEDIUM) --- #### R-03: Zero Budget Limits Commercial Tools **Description**: Inability to pay for commercial APIs (Scite, Connected Papers premium) limits feature set. **Root Cause**: - Open source project with no funding - Personal budget constraints - No sponsorship or grants secured **Potential Impact**: - Feature limitations vs. competitors - Manual workarounds reduce automation value - Users choose commercial alternatives **Category**: Resource - Budget **Probability**: Very High (76-100%) **Impact**: Low (2) **Risk Score**: 10 (MEDIUM) --- #### R-04: Time to Learn Complex Technologies **Description**: Learning knowledge graph databases (Neo4j), workflow orchestration (Snakemake), or containerization (Docker) extends timeline. **Root Cause**: - Advanced features require new skills - Limited tutorials/documentation for niche use cases - Trial-and-error learning curve **Potential Impact**: - Phase delays (especially Phase 4: workflows) - Simplified implementations (reduced value) - Technical debt from suboptimal approaches **Category**: Resource - Skills **Probability**: Medium (31-50%) **Impact**: Medium (3) **Risk Score**: 9 (LOW) --- ### 2.4 Quality Risks #### Q-01: Low-Quality Source Data **Description**: Papers with incomplete metadata, incorrect DOIs, or poor quality content degrade framework output quality. **Root Cause**: - Semantic Scholar data quality varies - Open access repositories lack quality control - Preprints not peer-reviewed **Potential Impact**: - GRADE scores unreliable - FAIR compliance validation fails - User trust in quality assessments eroded - Manual curation required **Category**: Quality - Data **Probability**: High (51-75%) **Impact**: High (4) **Risk Score**: 16 (HIGH) --- #### Q-02: Gap Analysis False Positives/Negatives **Description**: Automated gap detection may miss real gaps (false negatives) or flag non-gaps (false positives). **Root Cause**: - Citation network analysis incomplete (paywalled papers) - LLM semantic understanding limitations - Domain-specific terminology misinterpretation **Potential Impact**: - Users miss critical research areas - Users waste time investigating false gaps - Framework credibility damaged - Manual review required, negating automation **Category**: Quality - Analysis Accuracy **Probability**: Medium (31-50%) **Impact**: High (4) **Risk Score**: 12 (MEDIUM) --- #### Q-03: Citation Context Misclassification **Description**: Classification of citations as "supported", "contradicted", or "mentioned" may be inaccurate. **Root Cause**: - LLM semantic analysis errors - Nuanced academic language (qualified support, conditional contradiction) - Context-dependent interpretations **Potential Impact**: - Misleading research synthesis - Scientific errors in downstream work - User distrust of citation networks **Category**: Quality - Analysis Accuracy **Probability**: Medium (31-50%) **Impact**: High (4) **Risk Score**: 12 (MEDIUM) --- #### Q-04: Reproducibility Failures **Description**: Workflows claimed as "reproducible" may fail when executed by independent researchers due to hidden dependencies. **Root Cause**: - Undocumented environment dependencies - External data sources changed/removed - Version drift in tools (LLM APIs, libraries) **Potential Impact**: - Reproducibility claims false, undermining framework purpose - Scientific credibility damaged - Researchers unable to validate findings **Category**: Quality - Reproducibility **Probability**: Low (11-30%) **Impact**: Critical (5) **Risk Score**: 10 (MEDIUM) --- ### 2.5 Adoption Risks #### A-01: Steep Learning Curve **Description**: Framework's complexity (PRISMA, GRADE, Zettelkasten, FAIR) overwhelms casual users. **Root Cause**: - Academic standards unfamiliar to developers - Many new concepts (MoCs, permanent notes, W3C PROV) - Extensive artifact structure **Potential Impact**: - User abandonment after initial trial - Negative word-of-mouth - Limited adoption beyond academic niche - Framework seen as "over-engineered" **Category**: Adoption - User Experience **Probability**: High (51-75%) **Impact**: High (4) **Risk Score**: 16 (HIGH) --- #### A-02: Workflow Disruption for Established Researchers **Description**: Users with entrenched workflows (Zotero + Notion, Mendeley) resist switching to new paradigm. **Root Cause**: - High switching costs (time to migrate, learn) - Existing workflows "good enough" - Lock-in to commercial tools (Mendeley, EndNote) **Potential Impact**: - Adoption limited to greenfield projects - Matric-memory professionalization blocked (entrenched habits) - Framework relegated to niche use cases **Category**: Adoption - Change Resistance **Probability**: High (51-75%) **Impact**: Medium (3) **Risk Score**: 15 (HIGH) --- #### A-03: Too Academic for Developer Audience **Description**: Developers find PRISMA protocols and GRADE scoring excessive for typical software development research needs. **Root Cause**: - Developer research needs simpler (quick API docs, blog posts) - Academic rigor seen as overkill - Framework optimized for dissertation-level work **Potential Impact**: - Developer adoption minimal - Framework seen as academic-only tool - SDLC integration limited **Category**: Adoption - Audience Fit **Probability**: Medium (31-50%) **Impact**: High (4) **Risk Score**: 12 (MEDIUM) --- #### A-04: Requires Too Much Manual Effort **Description**: Despite automation, framework still demands significant user input (screening decisions, quality assessments, note-taking). **Root Cause**: - AI cannot fully replace human judgment (especially for nuanced academic work) - Quality control requires human verification - Zettelkasten method inherently manual (synthesis) **Potential Impact**: - Users perceive framework as time-consuming, not time-saving - Adoption limited to highly motivated researchers - Automation benefits overpromised, underdelivered **Category**: Adoption - Effort **Probability**: Very High (76-100%) **Impact**: High (4) **Risk Score**: 20 (CRITICAL) --- ### 2.6 Security Risks #### S-01: API Key Exposure **Description**: Semantic Scholar API keys or LLM API keys accidentally committed to version control or exposed in logs. **Root Cause**: - Developer error (commit to git) - Logging or error messages include keys - Inadequate .gitignore patterns **Potential Impact**: - API key revocation, service disruption - Unauthorized usage charges (if commercial API) - Security breach reputation damage **Category**: Security - Credentials **Probability**: Low (11-30%) **Impact**: Medium (3) **Risk Score**: 6 (LOW) --- #### S-02: Data Privacy in Shared Corpus **Description**: research-papers repo may inadvertently include proprietary or confidential research data. **Root Cause**: - User error (commit private data) - Unclear guidelines on what's shareable - No automated privacy scanning **Potential Impact**: - Legal liability (data breach) - User distrust - Framework reputation damage - Restricted adoption in sensitive domains **Category**: Security - Privacy **Probability**: Low (11-30%) **Impact**: High (4) **Risk Score**: 8 (LOW) --- #### S-03: Malicious PDFs **Description**: PDFs with embedded malware or exploits may compromise user systems during acquisition. **Root Cause**: - No malware scanning on downloaded PDFs - Open access repositories may host compromised files - PDF parsers vulnerable to exploits **Potential Impact**: - User system compromise - Framework blamed for security incident - Adoption halted in security-conscious environments **Category**: Security - Malware **Probability**: Very Low (1-10%) **Impact**: High (4) **Risk Score**: 4 (VERY LOW) --- ### 2.7 Compliance Risks #### C-01: Copyright Violations in PDF Acquisition **Description**: Automated PDF downloads may violate publisher copyright or API terms of service. **Root Cause**: - Ambiguity in fair use for automated downloading - Paywalled content accessed via Sci-Hub/LibGen (legal gray area) - Bulk download flagged as scraping **Potential Impact**: - Legal cease-and-desist from publishers - Framework labeled as piracy tool - Academic community rejects framework - API access revoked **Category**: Compliance - Copyright **Probability**: Low (11-30%) **Impact**: Critical (5) **Risk Score**: 10 (MEDIUM) --- #### C-02: FAIR Compliance Failures **Description**: Framework claims FAIR compliance but fails automated assessments (F-UJI, FAIR Checker). **Root Cause**: - Misunderstanding of FAIR principles - Implementation shortcuts for speed - Metadata quality issues **Potential Impact**: - Academic credibility loss - Framework rejected by FAIR-compliant repositories - Users unable to publish research using framework **Category**: Compliance - Standards **Probability**: Medium (31-50%) **Impact**: High (4) **Risk Score**: 12 (MEDIUM) --- #### C-03: W3C PROV Incompatibility **Description**: Provenance tracking claims W3C PROV compatibility but doesn't conform to specification. **Root Cause**: - Complex PROV ontology (Activity, Entity, Agent) - Simplified implementation for usability - No validation against PROV-O schemas **Potential Impact**: - Provenance graphs unreadable by PROV tools - Reproducibility claims undermined - Standards body criticism **Category**: Compliance - Standards **Probability**: Low (11-30%) **Impact**: Medium (3) **Risk Score**: 6 (LOW) --- #### C-04: Licensing Conflicts **Description**: Dependencies with incompatible licenses (GPL, proprietary) conflict with MIT license. **Root Cause**: - Inadequate license auditing during dependency selection - Transitive dependencies with restrictive licenses - Commercial tool integration **Potential Impact**: - Legal liability for license violations - Forced relicensing or dependency replacement - Distribution restrictions **Category**: Compliance - Licensing **Probability**: Low (11-30%) **Impact**: Medium (3) **Risk Score**: 6 (LOW) --- ## 3. Risk Register | ID | Risk | Category | Prob | Impact | Score | Priority | Mitigation Owner | Status | |----|------|----------|------|--------|-------|----------|------------------|--------| | T-01 | LLM Hallucination in Summaries | Technical - AI | High | Critical | 20 | CRITICAL | Joseph Magly | Open | | A-04 | Requires Too Much Manual Effort | Adoption - Effort | V.High | High | 20 | CRITICAL | Joseph Magly | Open | | R-01 | Solo Developer Capacity Constraints | Resource - Personnel | High | High | 16 | HIGH | Joseph Magly | Open | | Q-01 | Low-Quality Source Data | Quality - Data | High | High | 16 | HIGH | Joseph Magly | Open | | A-01 | Steep Learning Curve | Adoption - UX | High | High | 16 | HIGH | Joseph Magly | Open | | I-04 | Migration from Existing Systems | Integration - Migration | High | Medium | 15 | HIGH | Joseph Magly | Open | | A-02 | Workflow Disruption | Adoption - Change | High | Medium | 15 | HIGH | Joseph Magly | Open | | R-02 | Skill Gaps in Standards | Resource - Skills | Medium | High | 12 | MEDIUM | Joseph Magly | Open | | Q-02 | Gap Analysis Errors | Quality - Analysis | Medium | High | 12 | MEDIUM | Joseph Magly | Open | | Q-03 | Citation Context Misclassification | Quality - Analysis | Medium | High | 12 | MEDIUM | Joseph Magly | Open | | A-03 | Too Academic for Developers | Adoption - Audience | Medium | High | 12 | MEDIUM | Joseph Magly | Open | | C-02 | FAIR Compliance Failures | Compliance - Standards | Medium | High | 12 | MEDIUM | Joseph Magly | Open | | R-03 | Zero Budget Limits Tools | Resource - Budget | V.High | Low | 10 | MEDIUM | Joseph Magly | Open | | Q-04 | Reproducibility Failures | Quality - Reproducibility | Low | Critical | 10 | MEDIUM | Joseph Magly | Open | | C-01 | Copyright Violations | Compliance - Copyright | Low | Critical | 10 | MEDIUM | Joseph Magly | Open | | T-02 | API Rate Limits | Technical - Dependency | Medium | Medium | 9 | LOW | Joseph Magly | Open | | T-03 | PDF Extraction Failures | Technical - Processing | Medium | Medium | 9 | LOW | Joseph Magly | Open | | I-03 | Repo Sync Conflicts | Integration - Shared Repo | Medium | Medium | 9 | LOW | Joseph Magly | Open | | R-04 | Time to Learn Technologies | Resource - Skills | Medium | Medium | 9 | LOW | Joseph Magly | Open | | S-02 | Data Privacy in Shared Corpus | Security - Privacy | Low | High | 8 | LOW | Joseph Magly | Open | | T-04 | Knowledge Graph Scalability | Technical - Performance | Low | Medium | 6 | LOW | Joseph Magly | Open | | T-05 | FAIR Validation Overhead | Technical - Standards | Medium | Low | 6 | LOW | Joseph Magly | Open | | I-01 | Zotero Breaking Changes | Integration - External | Low | Medium | 6 | LOW | Joseph Magly | Open | | S-01 | API Key Exposure | Security - Credentials | Low | Medium | 6 | LOW | Joseph Magly | Open | | C-03 | W3C PROV Incompatibility | Compliance - Standards | Low | Medium | 6 | LOW | Joseph Magly | Open | | C-04 | Licensing Conflicts | Compliance - Licensing | Low | Medium | 6 | LOW | Joseph Magly | Open | | I-02 | Obsidian Compatibility Issues | Integration - PKM | Low | Low | 4 | VERY LOW | Joseph Magly | Open | | S-03 | Malicious PDFs | Security - Malware | V.Low | High | 4 | VERY LOW | Joseph Magly | Open | **Summary**: - **Critical Priority**: 2 risks - **High Priority**: 5 risks - **Medium Priority**: 8 risks - **Low Priority**: 11 risks - **Very Low Priority**: 2 risks **Total Risks**: 28 --- ## 4. Risk Matrix (Probability vs. Impact) ``` IMPACT │ Very Low │ Low │ Medium │ High │ Critical │ ─────────┼──────────┼─────────┼─────────┼─────────┼──────────┤ Very High│ │ R-03 │ │ A-04 │ │ ─────────┼──────────┼─────────┼─────────┼─────────┼──────────┤ High │ │ │ A-02 │ R-01 │ T-01 │ │ │ │ I-04 │ Q-01 │ │ │ │ │ │ A-01 │ │ ─────────┼──────────┼─────────┼─────────┼─────────┼──────────┤ Medium │ │ T-05 │ T-02 │ R-02 │ │ │ │ │ T-03 │ Q-02 │ │ │ │ │ I-03 │ Q-03 │ │ │ │ │ R-04 │ A-03 │ │ │ │ │ │ C-02 │ │ ─────────┼──────────┼─────────┼─────────┼─────────┼──────────┤ Low │ │ I-02 │ T-04 │ S-02 │ Q-04 │ │ │ │ I-01 │ │ C-01 │ │ │ │ S-01 │ │ │ │ │ │ C-03 │ │ │ │ │ │ C-04 │ │ │ ─────────┼──────────┼─────────┼─────────┼─────────┼──────────┤ Very Low │ │ │ │ S-03 │ │ ─────────┴──────────┴─────────┴─────────┴─────────┴──────────┘ LEGEND: ■ Critical Priority (20-25) │ ■ High Priority (15-19) │ ■ Medium Priority (10-14) ■ Low Priority (5-9) │ ■ Very Low Priority (1-4) ``` --- ## 5. Top 5 Risks (Detailed Analysis) ### 5.1 Risk T-01: LLM Hallucination in Summaries/Extractions (CRITICAL) **Risk Score**: 20 (High Probability × Critical Impact) **Priority**: CRITICAL **Owner**: Joseph Magly **Status**: Open #### Root Cause Analysis **Primary Cause**: LLMs generate text based on statistical patterns, not factual databases, leading to plausible but fabricated content. **Contributing Factors**: - Training data contamination (internet sources with errors) - Lack of source grounding during generation - Context window limitations (entire PDFs don't fit, chunking required) - No built-in citation verification in standard LLM APIs - Pressure to generate coherent text even when uncertain **5 Whys Analysis**: 1. Why do LLMs hallucinate? → They predict tokens based on patterns, not truth. 2. Why don't they verify facts? → No access to external databases during generation. 3. Why not use RAG? → RAG helps but doesn't eliminate hallucination (retrieval can be wrong). 4. Why is impact critical? → Research integrity depends on accurate citations; one false citation undermines credibility. 5. Why can't we eliminate risk? → Current LLM technology fundamentally probabilistic, not deterministic. #### Impact Assessment **Direct Impacts**: - Users cite non-existent papers in publications → Scientific misconduct allegations - Users trust false data extractions → Incorrect research conclusions - Framework gains reputation as unreliable → Adoption collapse **Indirect Impacts**: - Time wasted verifying every AI output → Automation benefit negated - Academic community rejects AI-assisted research tools → Movement backlash - Solo developer burnout from constant bug reports → Project abandonment **Stakeholder Impact**: - **Researchers**: Career damage from citing false sources, institutional sanctions - **AIWG Project**: Credibility destroyed, framework labeled as "dangerous" - **Academic Community**: Erosion of trust in AI tools, regulatory overreach **Quantitative Estimates**: - 60% chance of hallucination in unconstrained LLM summarization (per research on GPT-3.5) - 1 hallucinated citation per 10 papers = 100 false citations in 1,000-paper corpus - 50+ hours required to manually verify and correct (5 mins/paper × 1,000) #### Mitigation Strategies **Primary Mitigation: RAG-Based Summarization (LitLLM Pattern)** - **Action**: Use Retrieval-Augmented Generation, not free-form generation - **How**: Chunk PDF, retrieve relevant chunks, ground generation in extracted text - **Benefit**: 90%+ reduction in hallucination (per LitLLM paper) - **Owner**: Joseph Magly - **Timeline**: Phase 3 (Weeks 9-11) - **Cost**: ~10 hours implementation **Secondary Mitigation: Citation Verification** - **Action**: Cross-reference all LLM-generated citations against DOI database (CrossRef API) - **How**: Extract citations from summaries, validate DOI exists, flag mismatches - **Benefit**: 100% detection of fabricated DOIs - **Owner**: Joseph Magly - **Timeline**: Phase 3 (Weeks 9-11) - **Cost**: ~5 hours implementation **Tertiary Mitigation: Human-in-the-Loop Verification** - **Action**: Require user approval for all AI summaries before incorporation - **How**: Display "AI-generated, not verified" warning, require explicit review step - **Benefit**: Users aware of risk, can spot obvious errors - **Owner**: Joseph Magly (design), Users (execution) - **Timeline**: Phase 2 (Weeks 6-8) - **Cost**: ~2 hours UI implementation **Monitoring Mitigation: Hallucination Reporting** - **Action**: Add "Report Hallucination" button to all AI summaries - **How**: User submits report, logged for analysis, trends monitored - **Benefit**: Early detection of systematic issues, user feedback loop - **Owner**: Joseph Magly - **Timeline**: Phase 3 (Weeks 9-11) - **Cost**: ~3 hours implementation #### Contingency Plans **If Mitigation Fails (Hallucination Rate >10% Despite RAG)**: 1. **Fallback 1**: Disable AI summarization, require manual summaries - Impact: Major feature loss, workflow slows dramatically - Trigger: >10 hallucination reports in 100 papers 2. **Fallback 2**: Limit AI to extraction only (structured data), not prose summaries - Impact: Reduced value, users must write own summaries - Trigger: Critical hallucination causing user harm 3. **Fallback 3**: Use commercial API with better grounding (OpenAI with web search) - Impact: Requires budget ($50-100/month), may not be feasible - Trigger: Persistent issues, budget becomes available **Emergency Response**: - Immediate: Add "HIGH RISK OF HALLUCINATION" warning to all AI summaries - Short-term: Halt Phase 3 development, focus on verification improvements - Long-term: Partner with LitLLM project for expertise, contribute to open source solutions #### Early Warning Indicators **Monitor These Metrics**: 1. **Hallucination Reports**: >5 reports/month → escalate to critical 2. **Citation Verification Failure Rate**: >5% DOI mismatches → investigate LLM prompts 3. **User Verification Time**: >10 mins/summary → AI not saving time, reconsider 4. **User Trust Survey**: <70% trust in AI summaries → credibility crisis **Review Cadence**: - Weekly: Check hallucination reports and metrics - Monthly: User survey on AI summary trust - Phase milestones: Formal hallucination audit (random sample of 50 summaries) #### Current Status **Status**: Open (Risk Identified, Mitigation Planned) **Next Actions**: 1. Research LitLLM implementation patterns (Week 2-3) 2. Prototype RAG-based summarization (Week 9-10) 3. Implement citation verification (Week 10-11) 4. User testing with hallucination monitoring (Week 12) **Dependencies**: - LitLLM library availability and documentation - CrossRef API for DOI validation - User willingness to perform verification step --- ### 5.2 Risk A-04: Requires Too Much Manual Effort (CRITICAL) **Risk Score**: 20 (Very High Probability × High Impact) **Priority**: CRITICAL **Owner**: Joseph Magly **Status**: Open #### Root Cause Analysis **Primary Cause**: High-quality research inherently requires human judgment; AI can assist but not replace critical thinking. **Contributing Factors**: - PRISMA screening requires reading abstracts (AI can pre-filter, not decide) - GRADE quality assessment nuanced (study design, bias, consistency) - Zettelkasten synthesis manual by design (emergence from linking) - FAIR compliance needs human-readable metadata - Gap analysis requires domain expertise (AI can suggest, not confirm) **5 Whys Analysis**: 1. Why requires manual effort? → Research quality demands human expertise. 2. Why can't AI do it all? → AI lacks domain knowledge, critical thinking, contextual judgment. 3. Why not train better AI? → Domain-specific fine-tuning expensive, data scarce, generalization limited. 4. Why is effort high despite automation? → Automation handles volume (find 1,000 papers), not complexity (assess 1,000 papers). 5. Why does this risk adoption? → Users expect "AI magic button", reality disappoints → abandonment. #### Impact Assessment **Direct Impacts**: - Users perceive framework as time-consuming, not time-saving - Adoption limited to highly motivated researchers (dissertation students, grant-funded) - Casual users abandon after initial trial - Framework seen as "academic bloatware" **Indirect Impacts**: - Word-of-mouth negative ("tried it, too much work") - Developer audience rejects framework (wants quick docs, not PRISMA) - matric-memory professionalization stalls (team lacks time/motivation) - Solo developer morale drops (users don't see value despite effort) **Stakeholder Impact**: - **Researchers**: Frustrated by unmet automation expectations - **Developers**: See framework as unsuitable for "real world" software work - **AIWG Project**: Framework usage minimal, development effort wasted **Quantitative Estimates**: - PRISMA screening: 60% reduction (per LLM research) = 40% still manual - Example: 500 papers screened → 200 manual reviews at 3 mins each = 10 hours - Zettelkasten notes: ~15 mins per literature note (AI summary + manual synthesis) - Example: 100 papers → 25 hours note-taking - GRADE assessment: ~10 mins per paper (AI extracts data, human judges quality) - Example: 50 papers → 8 hours quality scoring - **Total**: ~43 hours manual work for 100-paper review (vs. ~100 hours fully manual = 57% reduction, but still significant) #### Mitigation Strategies **Primary Mitigation: Tiered Workflow Complexity** - **Action**: Offer "Quick", "Standard", and "Rigorous" workflow presets - Quick: Skip PRISMA, basic quality (1-5 scale), minimal notes (AI summaries only) - Standard: PRISMA lite (AI pre-screening), GRADE essentials, literature notes - Rigorous: Full PRISMA, comprehensive GRADE, Zettelkasten synthesis - **Benefit**: Users choose effort level matching needs (dev quick start vs. dissertation rigor) - **Owner**: Joseph Magly - **Timeline**: Phase 2 (Weeks 6-8) - **Cost**: ~15 hours workflow design + templates **Secondary Mitigation: Progressive Disclosure** - **Action**: Start users with minimal workflow, unlock advanced features as needed - Onboarding: Simple discovery + acquisition only - Week 2: Introduce AI summaries - Week 3: Introduce quality scoring - Month 2: Introduce Zettelkasten, knowledge graphs - **Benefit**: Users not overwhelmed, see value before committing effort - **Owner**: Joseph Magly - **Timeline**: Phase 6 (Weeks 18-20, documentation) - **Cost**: ~8 hours tutorial design **Tertiary Mitigation: Maximum Viable Automation** - **Action**: Automate every possible step, even if imperfect - AI pre-screening: 80% recall threshold (filter out obvious non-matches) - AI quality scoring: Generate draft, user confirms/adjusts (not create from scratch) - AI gap detection: Suggest gaps, user validates (not manual brainstorming) - Auto-generate MoCs from permanent note tags (not manual organization) - **Benefit**: Users review/refine AI output vs. start from blank page (5x faster) - **Owner**: Joseph Magly - **Timeline**: Phases 2-4 (Weeks 6-17) - **Cost**: ~30 hours across all automation features **Monitoring Mitigation: Effort Tracking & Benchmarking** - **Action**: Log time spent on each workflow step, compare to manual baseline - Show users: "You've saved 15 hours vs. manual research" dashboard - Benchmark: Track median time per task, identify bottlenecks - **Benefit**: Users see tangible value, developer identifies high-effort areas for improvement - **Owner**: Joseph Magly - **Timeline**: Phase 4 (Weeks 12-14) - **Cost**: ~5 hours analytics implementation #### Contingency Plans **If Mitigation Fails (Users Still Report "Too Much Effort")**: 1. **Fallback 1**: Simplify to "Developer Mode" (discovery + summaries only, skip PRISMA/GRADE/Zettelkasten) - Impact: Framework becomes glorified bookmark manager, unique value lost - Trigger: >50% user feedback mentions excessive effort 2. **Fallback 2**: Focus on narrow use case (matric-memory professionalization only) - Impact: Broader adoption abandoned, framework becomes internal tool - Trigger: Early adopter feedback universally negative 3. **Fallback 3**: Pivot to "Research Assistant" vs. "Research Framework" - Impact: Lower user expectations (assistant suggests, user decides), reframe value prop - Trigger: Impossible to reduce effort further without compromising quality **Emergency Response**: - Immediate: Survey first 10 users for specific effort pain points - Short-term: Prioritize automation for top 3 bottlenecks - Long-term: Consider hiring/contracting UX specialist for workflow optimization #### Early Warning Indicators **Monitor These Metrics**: 1. **User Onboarding Completion Rate**: <50% complete first workflow → too complex 2. **Average Time per Workflow Step**: >2x estimated time → friction, unclear instructions 3. **Feature Usage**: <30% use Zettelkasten or GRADE → perceived as optional bloat 4. **User Retention**: <40% return after week 1 → initial effort discourages **Review Cadence**: - Weekly: Check onboarding completion and feature usage (Phase 6: user testing) - Bi-weekly: User interviews (5 users) on effort pain points - Monthly: Compare actual vs. estimated workflow times #### Current Status **Status**: Open (Risk Identified, Mitigation Planned) **Next Actions**: 1. Design tiered workflow presets (Week 2) 2. Identify maximum automation opportunities (Week 3-4) 3. Build effort tracking analytics (Phase 4, Week 12-14) 4. User testing with effort focus (Phase 6, Week 18-19) **Dependencies**: - User research on acceptable effort levels (survey/interviews) - Template library for quick-start workflows - AI automation reliability (avoid creating more work debugging AI errors) --- ### 5.3 Risk R-01: Solo Developer Capacity Constraints (HIGH) **Risk Score**: 16 (High Probability × High Impact) **Priority**: HIGH **Owner**: Joseph Magly **Status**: Open #### Root Cause Analysis **Primary Cause**: Single developer working part-time (20-30 hrs/week) with competing priorities (AIWG core, matric projects, life). **Contributing Factors**: - No team to parallelize development (sequential work only) - Unforeseen technical complexities (standards, integrations) - Burnout risk from sustained high effort - Illness, family emergencies, or other life events - Scope creep temptation (add "just one more" feature) **5 Whys Analysis**: 1. Why is capacity limited? → Solo developer, part-time availability. 2. Why not hire/recruit? → Zero budget, open source project, difficult to attract volunteers. 3. Why not reduce scope? → Already focused on MVP, further cuts risk unviable product. 4. Why is timeline aggressive? → matric-memory professionalization blocked, pressure to deliver. 5. Why is risk high? → 20 weeks ambitious for scope, any disruption causes slippage. #### Impact Assessment **Direct Impacts**: - Missed milestones (Phase 1-2 delays cascade to later phases) - Reduced scope (cut features to meet timeline, deliver less value) - Technical debt (rushed implementation, inadequate testing) - Project abandonment (sustained issues lead to burnout, giving up) **Indirect Impacts**: - matric-memory professionalization delayed indefinitely - AIWG ecosystem stalls (no research foundation for other projects) - Community trust eroded ("vaporware" reputation) - Solo developer morale collapse, affects other projects **Stakeholder Impact**: - **matric-memory**: Professionalization blocked, continues informal documentation - **Early Adopters**: Unmet expectations, negative word-of-mouth - **AIWG Project**: Framework portfolio incomplete, strategic vision unmet **Quantitative Estimates**: - 20-week timeline @ 25 hrs/week = 500 hours budgeted - 10% sick/vacation time = 450 hours available - Unforeseen issues (20% contingency) = 360 effective hours - Current scope estimate: ~400 hours (based on similar projects) - **Gap**: 40 hours over capacity → ~2 weeks slippage risk #### Mitigation Strategies **Primary Mitigation: Ruthless Scope Prioritization (MoSCoW)** - **Action**: Categorize all features as Must-Have, Should-Have, Could-Have, Won't-Have - **Must-Have** (MVP): Discovery + Acquisition agents, basic FAIR validation, simple quality scoring - **Should-Have** (v1.0): Documentation agent (LLM summaries), Citation agent, Zettelkasten templates - **Could-Have** (v1.1+): Knowledge graphs, OAIS compliance, W3C PROV, workflow orchestration - **Won't-Have** (v1): Real-time collab, advanced ML, visual editors - **Benefit**: Clear prioritization, can cut Could-Have features if timeline pressure - **Owner**: Joseph Magly - **Timeline**: Week 1 (Inception phase, now) - **Cost**: ~5 hours prioritization + stakeholder alignment **Secondary Mitigation: Timeboxed Iterations (2-Week Sprints)** - **Action**: Break 20-week timeline into 10 sprints, deliver working increment each sprint - Sprint review: Assess velocity (hours spent vs. work completed) - Adjust scope: If behind, move Should-Have → Could-Have - Celebrate wins: Maintain morale with tangible progress - **Benefit**: Early detection of slippage, adaptive scope management, sustained momentum - **Owner**: Joseph Magly - **Timeline**: Weeks 3-20 (all development phases) - **Cost**: ~2 hours/sprint planning + review = 20 hours total **Tertiary Mitigation: Community Contribution Opportunities** - **Action**: Identify "good first issue" tasks suitable for volunteers - Documentation: User guides, tutorials, examples - Templates: Additional search strategies, quality criteria, note templates - Integrations: Obsidian plugins, Zotero extensions (if volunteers emerge) - **Benefit**: Offload non-critical work, build community, sustain long-term maintenance - **Owner**: Joseph Magly (identification), Community (execution) - **Timeline**: Phase 6 (Weeks 18-20, post-MVP) - **Cost**: ~10 hours issue creation + mentorship **Monitoring Mitigation: Burn-Down Tracking** - **Action**: Track remaining work (story points or hours) vs. time remaining - Weekly: Update burn-down chart, identify trend (on track, behind, ahead) - Triggers: >10% behind → escalate, adjust scope - **Benefit**: Quantitative visibility into timeline risk, data-driven decisions - **Owner**: Joseph Magly - **Timeline**: Weeks 3-20 (all development phases) - **Cost**: ~1 hour/week tracking = 18 hours total #### Contingency Plans **If Timeline Slippage Occurs (>2 Weeks Behind)**: 1. **Fallback 1**: Extend timeline by 1 iteration (2 weeks) - Impact: matric-memory professionalization delayed, but quality maintained - Trigger: 50% into project, 2+ weeks behind 2. **Fallback 2**: Cut Should-Have features, deliver MVP only - Impact: Reduced v1.0 value, knowledge graphs and Zettelkasten deferred to v1.1 - Trigger: 75% into project, 4+ weeks behind, extension not viable 3. **Fallback 3**: Deliver in phases (MVP v0.5, full v1.0 later) - Impact: Two release cycles, adoption split, but avoids total delay - Trigger: Critical personal issue (illness, emergency), cannot sustain pace **Emergency Response**: - Immediate: Communicate slippage to stakeholders (matric-memory, early adopters) - Short-term: Reassess scope, identify cuts (prioritize matric-memory use case) - Long-term: Recruit maintainer for post-v1.0 (offload to community) #### Early Warning Indicators **Monitor These Metrics**: 1. **Sprint Velocity**: <80% of planned work completed → capacity issue 2. **Burn-Down Trend**: Behind schedule for 2 consecutive sprints → escalate 3. **Unplanned Work**: >20% time on bugs/issues vs. features → scope creep 4. **Developer Well-Being**: Self-reported stress >7/10, sleep <6 hrs → burnout risk **Review Cadence**: - Weekly: Sprint review, burn-down update - Bi-weekly: Velocity trend analysis - Monthly: Personal well-being check, scope/timeline adjustment if needed #### Current Status **Status**: Open (Risk Identified, Mitigation Planned) **Next Actions**: 1. Complete MoSCoW prioritization (Week 1, this week) 2. Set up sprint structure and burn-down tracking (Week 2) 3. Monitor velocity in first 2 sprints (Weeks 3-6), adjust if needed 4. Reassess timeline at 50% mark (Week 10) **Dependencies**: - Stakeholder buy-in on prioritization (matric-memory needs aligned with Must-Have scope) - Discipline to avoid scope creep ("just one more feature" temptation) - Personal health and life stability (unpredictable) --- ### 5.4 Risk Q-01: Low-Quality Source Data (HIGH) **Risk Score**: 16 (High Probability × High Impact) **Priority**: HIGH **Owner**: Joseph Magly **Status**: Open #### Root Cause Analysis **Primary Cause**: Research paper quality varies widely; open access repositories and preprint servers lack rigorous quality control. **Contributing Factors**: - Semantic Scholar aggregates from diverse sources (high-quality journals + low-quality preprints) - Preprints (arXiv, bioRxiv) not peer-reviewed, may contain errors or retracted later - Metadata quality varies (missing authors, incorrect DOIs, incomplete abstracts) - Predatory publishers in open access space (pay-to-publish, no peer review) - Retracted papers still indexed, not flagged consistently **5 Whys Analysis**: 1. Why is source data low quality? → Open access prioritizes accessibility over curation. 2. Why accept low-quality sources? → Framework designed for comprehensive discovery (can't pre-filter too aggressively). 3. Why not use only high-quality journals? → Paywalls exclude most researchers, defeats FAIR principles. 4. Why is impact high? → GRADE scores unreliable → Users trust low-quality sources → Scientific errors. 5. Why can't we validate all sources? → Manual validation doesn't scale (100s of papers), automated metrics imperfect. #### Impact Assessment **Direct Impacts**: - GRADE quality scores unreliable (garbage in, garbage out) - Users cite low-quality or retracted papers in publications - FAIR compliance validation fails (incomplete metadata) - Framework credibility damaged ("helps you cite junk science") **Indirect Impacts**: - Academic community rejects framework (seen as enabling poor scholarship) - Users must manually curate sources (negates automation value) - Knowledge graphs polluted with low-quality nodes - Reproducibility compromised (sources change, retract, or disappear) **Stakeholder Impact**: - **Researchers**: Career damage from citing retracted papers, institutional sanctions - **Academic Community**: Erosion of research quality standards - **AIWG Project**: Framework labeled as "dangerous" or "irresponsible" **Quantitative Estimates**: - Preprint error rate: ~5-10% contain significant errors or are retracted (per research on bioRxiv) - Metadata completeness: ~20% of open access papers missing critical metadata (authors, DOIs, abstracts) - Predatory publishers: ~10,000 active predatory journals (per Beall's List estimates) - **Example**: 1,000-paper corpus → 50-100 low-quality sources, 200 with incomplete metadata #### Mitigation Strategies **Primary Mitigation: Multi-Dimensional Quality Scoring (GRADE-Inspired)** - **Action**: Assess quality across multiple dimensions, not single metric - **Publication Venue**: Journal impact factor, predatory publisher check (Beall's List) - **Peer Review Status**: Preprint vs. peer-reviewed, retraction check (Retraction Watch) - **Citation Count**: Highly cited = community validation (with recency adjustment) - **Metadata Completeness**: FAIR compliance score (F-UJI pattern) - **Study Design**: RCT > Observational > Case Study (for empirical research) - **Benefit**: Nuanced quality assessment, users see red flags, informed decisions - **Owner**: Joseph Magly - **Timeline**: Phase 3 (Weeks 9-11) - **Cost**: ~20 hours quality scoring implementation **Secondary Mitigation: Automated Retraction Checking** - **Action**: Cross-reference all DOIs against Retraction Watch database - Flagged: "RETRACTED" badge on source, warning in summaries - Periodic re-check: Monthly scan for new retractions - **Benefit**: 100% detection of known retractions, prevent citing retracted work - **Owner**: Joseph Magly - **Timeline**: Phase 2 (Weeks 6-8) - **Cost**: ~5 hours integration with Retraction Watch API **Tertiary Mitigation: User Education on Quality Assessment** - **Action**: Provide guidelines and training on evaluating source quality - Tutorial: "How to Read a GRADE Score", "Red Flags in Research Papers" - Templates: Quality criteria checklists, bias assessment guides - Examples: Side-by-side comparison of high vs. low quality papers - **Benefit**: Users develop critical evaluation skills, don't blindly trust framework scores - **Owner**: Joseph Magly - **Timeline**: Phase 6 (Weeks 18-20, documentation) - **Cost**: ~8 hours tutorial creation **Monitoring Mitigation: Quality Score Distribution Tracking** - **Action**: Track distribution of quality scores across corpus - Alert: >30% low-quality sources → investigate search strategies (too broad?) - Benchmark: Compare to expert-curated corpora (validation) - **Benefit**: Early detection of quality drift, data-driven search refinement - **Owner**: Joseph Magly - **Timeline**: Phase 3 (Weeks 9-11) - **Cost**: ~3 hours analytics implementation #### Contingency Plans **If Mitigation Fails (Quality Scores Unreliable or Too Many Low-Quality Sources)**: 1. **Fallback 1**: Manual quality review required for all sources (human-in-the-loop) - Impact: Major workflow slowdown, automation value diminished - Trigger: >20% false positive/negative quality scores (user feedback) 2.