aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
1,231 lines (915 loc) • 76.9 kB
Markdown
# Test Strategy: AIWG Research Framework
**Project**: AIWG Research Framework
**Framework ID**: research-complete
**Version**: 1.0.0
**Document Date**: 2026-01-25
**Status**: Draft
**Owner**: Test Architect
**Contributors**: Test Engineer, Quality Assurance Specialist, Security Auditor
## References
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/vision-document.md - Vision and success metrics
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/inception/initial-risk-assessment.md - Risk profile and mitigation priorities
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-001-discover-research-papers.md - Use case #1 with acceptance criteria
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-002-acquire-research-source.md - Use case #2
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-003-document-research-paper.md - Use case #3
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-004-integrate-citations.md - Use case #4
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-005-track-provenance.md - Use case #5
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-006-assess-source-quality.md - Use case #6
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-007-archive-research-artifacts.md - Use case #7
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-008-execute-research-workflow.md - Use case #8
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-009-perform-gap-analysis.md - Use case #9
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/use-cases/UC-RF-010-export-research-artifacts.md - Use case #10
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/agents/discovery-agent-spec.md - Discovery Agent specification
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/agents/acquisition-agent-spec.md - Acquisition Agent specification
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/agents/documentation-agent-spec.md - Documentation Agent specification
- @$AIWG_ROOT/agentic/code/frameworks/research-complete/elaboration/agents/citation-agent-spec.md - Citation Agent specification
---
## Executive Summary
This Test Strategy defines the quality assurance approach for the AIWG Research Framework, a comprehensive system enabling research-backed software development through automated discovery, acquisition, documentation, and quality assessment of academic literature. The strategy addresses high-risk areas including LLM hallucination (T-01), data quality issues (Q-01), and API dependencies (T-02), while ensuring 100% coverage of all use cases and compliance with FAIR, PROV, and OAIS standards.
**Key Commitments**:
- **Minimum Coverage**: 90% code coverage (80% line, 75% branch)
- **Use Case Coverage**: 100% (all 10 use cases validated)
- **NFR Coverage**: 100% (all 45 NFRs measurable and tested)
- **Blocking Quality Gates**: Tests MUST pass before PR merge and release
- **Risk-Based Testing**: Priority testing for Critical (T-01, A-04) and High (R-01, Q-01, A-01) risks
**Quality Philosophy**: Testing is a blocking gate, not an afterthought. Coverage targets are minimum thresholds, not aspirational goals.
---
## 1. Test Strategy Overview
### 1.1 Objectives
**Primary Objective**: Ensure AIWG Research Framework delivers reliable, accurate, and reproducible research workflows meeting academic quality standards (PRISMA, GRADE, FAIR, PROV, OAIS).
**Specific Objectives**:
1. **Functional Correctness**
- Validate all 10 use cases with acceptance criteria
- Verify agent capabilities (8 agents: Discovery, Acquisition, Documentation, Citation, Quality, Provenance, Archive, Workflow)
- Test API integrations (Semantic Scholar, Zotero, CrossRef, Retraction Watch)
2. **Risk Mitigation**
- **Critical Risks**: LLM hallucination <5% (T-01), manual effort acceptable (A-04)
- **High Risks**: Quality scoring >90% accuracy (Q-01), onboarding <5 hours (A-01)
- **API Reliability**: Rate limit compliance 100% (T-02), graceful failure handling
3. **Performance & Scalability**
- Search completion <10 seconds (NFR-RF-D-01)
- Gap analysis <30 seconds for 100 papers (NFR-RF-D-02)
- Document generation <60 seconds per paper (NFR-RF-Doc-01)
- Support 1,000+ paper corpora without degradation
4. **Compliance Validation**
- FAIR compliance 100% (automated F-UJI validation)
- PRISMA protocol completeness 100%
- W3C PROV compatibility (automated schema validation)
- OAIS archival package conformance
5. **Security & Privacy**
- API key protection (no hardcoding, env vars only)
- Input sanitization (prevent injection attacks)
- Copyright compliance (respect publisher terms)
- Data privacy in shared corpora
### 1.2 Scope
**In Scope**:
| Component | Coverage | Testing Focus |
|-----------|----------|---------------|
| **Discovery Agent** | 100% use cases | Semantic search, gap analysis, citation chaining |
| **Acquisition Agent** | 100% use cases | PDF download, metadata extraction, FAIR validation |
| **Documentation Agent** | 100% use cases | LLM summarization, hallucination detection, citation verification |
| **Citation Agent** | 100% use cases | BibTeX generation, in-text citation, context classification |
| **Quality Agent** | 100% use cases | GRADE scoring, retraction checking, quality metrics |
| **Provenance Agent** | 100% use cases | W3C PROV graph generation, lineage tracking |
| **Archive Agent** | 100% use cases | OAIS SIP/AIP/DIP packaging, long-term preservation |
| **Workflow Agent** | 100% use cases | PRISMA protocol execution, reproducibility |
| **API Integrations** | All endpoints | Semantic Scholar, Zotero, CrossRef, Retraction Watch |
| **CLI Commands** | All commands | `aiwg research search`, `select`, `acquire`, `document`, etc. |
| **NFRs** | 45 NFRs | Performance, security, usability, compliance |
| **Data Formats** | All schemas | JSON, BibTeX, RIS, PROV-O, OAIS METS |
**Out of Scope** (Deferred to Post-v1.0):
- Multi-user collaboration (single-user MVP)
- Real-time synchronization (async workflow)
- Advanced ML models (knowledge graph embeddings, semantic clustering)
- Non-English language support (English-only v1.0)
- Visual editors (CLI-only v1.0)
### 1.3 Approach
**Testing Philosophy**: Shift-Left Quality Assurance
1. **Test-Driven Development (TDD)**: Write tests before implementation
2. **Continuous Testing**: Tests run on every commit (CI/CD integration)
3. **Risk-Based Prioritization**: Critical risks tested first and most thoroughly
4. **Automation First**: 95%+ test automation target (minimize manual testing)
5. **Progressive Assurance**: Test at unit → integration → system → acceptance levels
**Test Pyramid Strategy**:
```
/\
/ \ E2E Tests (10%)
/ \ - 10 use case scenarios
/------\ - Critical workflows
/ \
/ Integration \ (30%)
/ Tests \
/------------------\ - Agent interactions
/ \ - API integrations
/ Unit Tests (60%) \ - Component logic
/------------------------\ - Utility functions
```
**Coverage Targets by Level**:
| Test Level | Volume | Coverage Target | Automation |
|------------|--------|-----------------|------------|
| Unit Tests | 60% of test effort | 90% code coverage (80% line, 75% branch) | 100% automated |
| Integration Tests | 30% of test effort | 100% API endpoints, 100% agent interactions | 100% automated |
| System Tests (E2E) | 10% of test effort | 100% use cases, 100% critical paths | 100% automated |
| Acceptance Tests | Manual validation | 100% NFRs, user satisfaction | 50% automated |
**Testing Cadence**:
- **Pre-commit**: Unit tests (fast suite <30s), linting, type checking
- **PR Merge**: Full test suite (unit + integration + E2E), coverage check
- **Nightly**: Extended tests (performance, security scans, mutation testing)
- **Release**: Acceptance tests, manual exploratory testing, UAT sign-off
---
## 2. Test Levels
### 2.1 Unit Testing
**Objective**: Validate individual components (agents, services, utilities) in isolation.
**Scope**:
| Component | Unit Test Focus | Test Count (Estimate) |
|-----------|----------------|----------------------|
| Discovery Agent | Query construction, result ranking, gap detection algorithm | 25 tests |
| Acquisition Agent | PDF extraction, metadata parsing, FAIR validation | 20 tests |
| Documentation Agent | RAG-based summarization, citation verification, hallucination detection | 30 tests (critical risk T-01) |
| Citation Agent | BibTeX generation, in-text citation formatting, context classification | 15 tests |
| Quality Agent | GRADE scoring logic, retraction checking, quality aggregation | 20 tests (critical risk Q-01) |
| Provenance Agent | W3C PROV graph construction, lineage tracking, schema validation | 15 tests |
| Archive Agent | OAIS package creation, METS XML generation, checksum validation | 15 tests |
| Workflow Agent | PRISMA protocol execution, screening workflow, reproducibility | 20 tests |
| Utility Functions | File I/O, JSON parsing, date formatting, string sanitization | 30 tests |
| **Total** | | **190 unit tests** |
**Coverage Requirements**:
- **Minimum**: 80% line coverage, 75% branch coverage (enforced by CI)
- **Target**: 90% line coverage, 85% branch coverage
- **Critical Components**: 100% coverage for hallucination detection, citation verification, quality scoring
**Tools**:
- **Test Framework**: Jest (Node.js/TypeScript) or Vitest (faster alternative)
- **Mocking**: Jest mocks for API calls, file system operations
- **Coverage**: nyc (Istanbul) or Vitest coverage
- **Assertions**: expect() API, custom matchers for FAIR/PROV validation
**Example: Documentation Agent Hallucination Detection**
```typescript
describe('DocumentationAgent - Hallucination Detection', () => {
it('should detect fabricated citations in LLM summary', async () => {
const mockSummary = 'Smith et al. (2025) found that...' // 2025 paper doesn't exist
const paperMetadata = { year: 2023, authors: ['Doe, J.'], doi: '10.1234/real' }
const agent = new DocumentationAgent()
const result = await agent.detectHallucination(mockSummary, paperMetadata)
expect(result.hallucinated).toBe(true)
expect(result.issues).toContain('Citation year mismatch: 2025 vs 2023')
expect(result.issues).toContain('Author mismatch: Smith not in paper')
})
it('should verify DOI existence via CrossRef API', async () => {
const fakeDOI = '10.9999/fake'
const agent = new DocumentationAgent()
const result = await agent.verifyDOI(fakeDOI)
expect(result.valid).toBe(false)
expect(result.source).toBe('CrossRef API')
})
it('should pass clean summary without hallucinations', async () => {
const mockSummary = 'Doe et al. (2023) demonstrated...'
const paperMetadata = { year: 2023, authors: ['Doe, J.'], doi: '10.1234/real' }
const agent = new DocumentationAgent()
const result = await agent.detectHallucination(mockSummary, paperMetadata)
expect(result.hallucinated).toBe(false)
expect(result.issues).toHaveLength(0)
})
})
```
**Blocking Conditions**:
- **PR Merge**: Unit tests MUST pass, coverage CANNOT decrease below 80% line
- **Release**: Unit test suite MUST pass 100%, no regressions
### 2.2 Integration Testing
**Objective**: Validate interactions between agents, API integrations, and data flows.
**Scope**:
| Integration | Test Focus | Test Count (Estimate) |
|-------------|-----------|----------------------|
| Discovery → Acquisition | Search results → acquisition queue → PDF download | 5 tests |
| Acquisition → Documentation | PDF extraction → LLM summarization pipeline | 5 tests |
| Documentation → Citation | Summary generation → BibTeX creation workflow | 5 tests |
| Quality → Provenance | GRADE scoring → W3C PROV tracking | 5 tests |
| Workflow → All Agents | PRISMA protocol execution end-to-end | 5 tests |
| Semantic Scholar API | Search, pagination, rate limiting, error handling | 10 tests |
| Zotero Integration | Export to BibTeX, RIS, import from Zotero library | 5 tests |
| CrossRef API | DOI validation, metadata enrichment | 5 tests |
| Retraction Watch API | Retraction checking, flagging retracted papers | 5 tests |
| File System | `.aiwg/research/` structure, artifact persistence | 5 tests |
| **Total** | | **55 integration tests** |
**Coverage Requirements**:
- **API Endpoints**: 100% coverage (all endpoints tested)
- **Agent Handoffs**: 100% coverage (all agent-to-agent transitions validated)
- **Data Persistence**: 100% coverage (all JSON/BibTeX/RIS file formats validated)
**Tools**:
- **API Mocking**: nock (HTTP mocking), MSW (Mock Service Worker)
- **Contract Testing**: Pact (verify API contracts don't break)
- **Test Containers**: Docker containers for Zotero, Neo4j (if needed)
- **Fixtures**: Sample PDFs, JSON responses, BibTeX files
**Example: Semantic Scholar API Rate Limit Handling**
```typescript
describe('Semantic Scholar API - Rate Limit', () => {
it('should retry after 60s when rate limited', async () => {
const scope = nock('https://api.semanticscholar.org')
.get('/graph/v1/paper/search')
.query({ query: 'machine learning', limit: 100 })
.reply(429, { error: 'Rate limit exceeded' }) // First attempt: rate limited
.get('/graph/v1/paper/search')
.query({ query: 'machine learning', limit: 100 })
.delay(60000) // Simulate 60s wait
.reply(200, { data: [/* mock papers */] }) // Second attempt: success
const agent = new DiscoveryAgent()
const result = await agent.search('machine learning')
expect(result.papers).toHaveLength(100)
expect(scope.isDone()).toBe(true) // Both requests made
})
it('should fail gracefully after 3 retries', async () => {
const scope = nock('https://api.semanticscholar.org')
.get('/graph/v1/paper/search')
.times(3) // 3 retry attempts
.reply(429, { error: 'Rate limit exceeded' })
const agent = new DiscoveryAgent()
await expect(agent.search('machine learning')).rejects.toThrow('Rate limit exceeded after 3 retries')
})
})
```
**Blocking Conditions**:
- **PR Merge**: Integration tests MUST pass, no API contract regressions
- **Release**: 100% API endpoint coverage, all agent handoffs validated
### 2.3 System Testing (End-to-End)
**Objective**: Validate complete workflows from user command to final artifact generation.
**Scope**:
| Use Case | E2E Test Scenario | Critical Path |
|----------|------------------|---------------|
| UC-RF-001 | Discovery: Search → Gap Analysis → Acquisition Queue | Yes (60s workflow) |
| UC-RF-002 | Acquisition: Queue → PDF Download → FAIR Validation | Yes (90s workflow) |
| UC-RF-003 | Documentation: PDF → LLM Summary → Hallucination Check | Yes (Critical Risk T-01) |
| UC-RF-004 | Citation: Summary → BibTeX → In-Text Citation | Yes (30s workflow) |
| UC-RF-005 | Provenance: Actions → W3C PROV Graph → Validation | Yes (PROV compliance) |
| UC-RF-006 | Quality: Paper → GRADE Score → Retraction Check | Yes (Critical Risk Q-01) |
| UC-RF-007 | Archive: Artifacts → OAIS AIP → Checksum Validation | No (OAIS compliance) |
| UC-RF-008 | Workflow: PRISMA Protocol → Execution → Report | Yes (Reproducibility) |
| UC-RF-009 | Gap Analysis: Papers → Clustering → Gap Report | Yes (Automation value) |
| UC-RF-010 | Export: Artifacts → BibTeX/RIS/Zotero → Validation | No (Integration) |
**Test Count**: 10 E2E tests (1 per use case) + 5 critical path variations = **15 E2E tests**
**Coverage Requirements**:
- **Use Cases**: 100% coverage (all 10 use cases)
- **Critical Paths**: 100% coverage (6 critical workflows)
- **NFR Validation**: Embedded in E2E tests (performance, usability)
**Tools**:
- **E2E Framework**: Playwright (browser-based) or custom CLI test harness
- **Fixtures**: Real PDFs (open access), sample corpora, reference datasets
- **Environment**: Dockerized test environment with API mocks
- **Reporting**: Allure or similar for visual test reports
**Example: UC-RF-001 E2E Test**
```typescript
describe('UC-RF-001: Discover Research Papers E2E', () => {
it('should complete discovery workflow in <2 minutes', async () => {
const startTime = Date.now()
// Step 1: Run discovery command
const result = await cli.run(['research', 'search', 'reinforcement learning policy gradients'])
// Validate search completed in <10s (NFR-RF-D-01)
expect(Date.now() - startTime).toBeLessThan(10000)
// Validate 100 results returned
const searchResults = await fs.readJSON('.aiwg/research/discovery/search-results-*.json')
expect(searchResults.papers).toHaveLength(100)
// Validate gap analysis completed in <30s (NFR-RF-D-02)
const gapReport = await fs.readFile('.aiwg/research/analysis/gap-report-*.md', 'utf8')
expect(gapReport).toContain('## Under-Researched Topics')
expect(Date.now() - startTime).toBeLessThan(40000) // 10s search + 30s gap analysis
// Step 2: Select papers for acquisition
await cli.run(['research', 'select', '--top', '10'])
// Validate acquisition queue created
const queue = await fs.readJSON('.aiwg/research/discovery/acquisition-queue.json')
expect(queue.papers).toHaveLength(10)
// Validate total workflow <2 minutes (vision goal: 60%+ time savings)
expect(Date.now() - startTime).toBeLessThan(120000)
})
})
```
**Blocking Conditions**:
- **Release**: All 10 use case E2E tests MUST pass
- **Critical Paths**: All 6 critical workflows MUST complete within performance targets
- **No Regressions**: E2E tests from previous releases MUST continue passing
### 2.4 Acceptance Testing
**Objective**: Validate framework meets user needs and business objectives.
**Scope**:
| Acceptance Criteria | Test Method | Owner |
|---------------------|-------------|-------|
| **Use Case Acceptance** | Validate all acceptance criteria (AC-001 to AC-010 per use case) | Test Engineer |
| **NFR Acceptance** | Measure all 45 NFRs (performance, security, usability, compliance) | QA Specialist |
| **User Satisfaction** | Survey early adopters (matric-memory team, 5 external researchers) | Product Owner |
| **Reproducibility** | External researcher replicates PRISMA workflow | Academic Researcher |
| **FAIR Compliance** | F-UJI automated assessment (score >80%) | Compliance Specialist |
| **PROV Compliance** | W3C PROV-O schema validation (100% valid) | Standards Specialist |
| **OAIS Compliance** | OAIS SIP/AIP/DIP validation (METS conformance) | Archive Specialist |
**Test Count**: 10 use cases × 10 AC = 100 acceptance tests + 45 NFR tests = **145 acceptance tests**
**Coverage Requirements**:
- **Use Case AC**: 100% acceptance criteria validated
- **NFR Targets**: 100% NFRs measured and validated
- **User Satisfaction**: >80% users rate framework 4/5 or higher
**Tools**:
- **Manual Testing**: Checklist-based validation for use case AC
- **Automated NFR Testing**: Custom scripts for performance, security, compliance metrics
- **User Testing**: UserTesting.com or in-person sessions with matric-memory team
- **Compliance Tools**: F-UJI (FAIR), PROV-O validator, OAIS METS validator
**Example: NFR-RF-D-01 Acceptance Test**
```typescript
describe('NFR-RF-D-01: Search Completion Time <10s', () => {
it('should complete search in <10s for 95th percentile', async () => {
const durations: number[] = []
// Run search 100 times to measure 95th percentile
for (let i = 0; i < 100; i++) {
const startTime = Date.now()
await cli.run(['research', 'search', 'machine learning'])
durations.push(Date.now() - startTime)
}
durations.sort((a, b) => a - b)
const p95 = durations[Math.floor(durations.length * 0.95)]
expect(p95).toBeLessThan(10000) // 95th percentile <10s
})
})
```
**Blocking Conditions**:
- **Release**: All 145 acceptance tests MUST pass
- **NFR Targets**: All 45 NFRs MUST meet or exceed targets
- **User Satisfaction**: >80% user approval rating required for v1.0 release
---
## 3. Test Types
### 3.1 Functional Testing
**Objective**: Verify all features work as specified in use cases.
**Coverage**:
- **Use Case Scenarios**: All 10 use cases with main success scenario + alternate flows
- **Agent Capabilities**: All 8 agents with specified capabilities (e.g., Discovery Agent: semantic search, gap analysis, citation chaining)
- **CLI Commands**: All 12 CLI commands (e.g., `search`, `select`, `acquire`, `document`, `cite`, `grade`, `archive`, `workflow`, `export`)
- **Data Formats**: All output formats validated (JSON, BibTeX, RIS, PROV-O, METS)
**Test Techniques**:
- **Equivalence Partitioning**: Valid/invalid inputs (e.g., search query 3-200 chars, <3 chars invalid)
- **Boundary Value Analysis**: Edge cases (e.g., 1 result, 500 results, 0 results)
- **Decision Tables**: Complex logic (e.g., GRADE scoring with 5 criteria)
- **State Transition**: Workflow states (e.g., search → select → acquire → document)
**Priority**: P0 (Critical) - MUST PASS for release
### 3.2 Performance Testing
**Objective**: Validate framework meets performance NFRs.
**NFRs Validated**:
| NFR ID | Requirement | Target | Test Method |
|--------|-------------|--------|-------------|
| NFR-RF-D-01 | Search completion time | <10s (95th %ile) | Load test: 100 searches, measure latency |
| NFR-RF-D-02 | Gap analysis generation | <30s for 100 papers | Benchmark: 100-paper corpus, measure time |
| NFR-RF-Doc-01 | Document generation | <60s per paper | Stress test: 10 concurrent summaries |
| NFR-RF-Cite-01 | BibTeX generation | <5s for 100 papers | Batch test: 100 papers → BibTeX |
| NFR-RF-Q-01 | GRADE scoring | <15s per paper | Performance test: 50 papers, measure avg |
| NFR-RF-Prov-01 | PROV graph generation | <20s for 100 actions | Graph generation benchmark |
| NFR-RF-Arch-01 | OAIS package creation | <120s for 100 papers | Archive 100-paper corpus, measure time |
| NFR-RF-WF-01 | PRISMA workflow execution | <5 min for 500 papers | End-to-end workflow benchmark |
**Performance Baselines**:
- **Establish Baseline**: Week 11 (end of Documentation phase)
- **Regression Detection**: Any >10% performance degradation triggers investigation
- **Scalability Testing**: Test with 100, 500, 1,000 paper corpora (verify no degradation)
**Tools**:
- **Load Testing**: k6 (CLI load testing), Apache Bench
- **Profiling**: Node.js profiler, Chrome DevTools
- **Monitoring**: Grafana + Prometheus (optional, post-v1.0)
**Blocking Conditions**:
- **Release**: All 8 performance NFRs MUST meet targets
- **No Regressions**: Performance CANNOT degrade >10% from baseline
### 3.3 Security Testing
**Objective**: Validate security NFRs and mitigate security risks.
**NFRs Validated**:
| NFR ID | Requirement | Target | Test Method |
|--------|-------------|--------|-------------|
| NFR-RF-Sec-01 | API key protection | 100% (no hardcoding) | Static analysis: grep for API keys in code |
| NFR-RF-Sec-02 | Input sanitization | Prevent injection | Fuzz testing: malicious inputs |
| NFR-RF-Sec-03 | Copyright compliance | Respect publisher TOS | Manual audit: acquisition logic review |
| NFR-RF-Sec-04 | Data privacy | No PII in shared corpus | Privacy scan: detect sensitive data |
| NFR-RF-Sec-05 | PDF malware scanning | Optional virus scan | VirusTotal API integration (optional) |
**Security Risks Addressed**:
| Risk ID | Risk | Test Coverage |
|---------|------|---------------|
| S-01 | API key exposure | Static analysis, secret scanning (GitHub Actions) |
| S-02 | Data privacy in shared corpus | Privacy scanning, user guidelines |
| S-03 | Malicious PDFs | Optional VirusTotal integration, user warnings |
**Security Testing Techniques**:
1. **Static Analysis**: ESLint security rules, npm audit, Snyk vulnerability scanning
2. **Input Fuzzing**: Generate malicious inputs (SQL injection, XSS, command injection attempts)
3. **Secret Scanning**: GitHub secret scanning, truffleHog for leaked credentials
4. **Dependency Scanning**: Dependabot, npm audit for vulnerable dependencies
5. **Manual Review**: Code review for security best practices
**Tools**:
- **SAST**: ESLint plugin-security, SonarQube
- **Dependency Scanning**: npm audit, Snyk, Dependabot
- **Secret Scanning**: truffleHog, GitHub secret scanning
- **Fuzzing**: afl-fuzz, custom input generators
**Blocking Conditions**:
- **PR Merge**: No high/critical vulnerabilities (npm audit)
- **Release**: All 5 security NFRs validated, no secrets in code
### 3.4 Compliance Testing
**Objective**: Validate adherence to academic standards (FAIR, PROV, OAIS, PRISMA, GRADE).
**Standards Validated**:
| Standard | Compliance Requirement | Test Method |
|----------|------------------------|-------------|
| **FAIR Principles** | Findable, Accessible, Interoperable, Reusable | F-UJI automated assessment (score >80%) |
| **W3C PROV** | Provenance graphs conform to PROV-O ontology | PROV-O schema validator (100% valid) |
| **OAIS** | Archival packages conform to OAIS reference model | METS XML validation, BagIt compliance |
| **PRISMA** | Systematic review protocols complete | PRISMA checklist 100% |
| **GRADE** | Quality assessment structured per GRADE guidelines | GRADE criteria coverage 100% |
**NFRs Validated**:
| NFR ID | Requirement | Target | Test Method |
|--------|-------------|--------|-------------|
| NFR-RF-FAIR-01 | FAIR compliance rate | 100% | F-UJI assessment on sample corpus |
| NFR-RF-PROV-01 | PROV graph validity | 100% | PROV-O validator |
| NFR-RF-OAIS-01 | OAIS package conformance | 100% | METS validator, BagIt validation |
| NFR-RF-Comp-01 | PRISMA checklist completion | 100% | Manual checklist review |
| NFR-RF-Comp-02 | GRADE criteria coverage | 100% | Automated criteria check |
**Compliance Testing Process**:
1. **FAIR Compliance**:
- Run F-UJI assessment on sample 10-paper corpus
- Validate metadata completeness (DOI, authors, year, abstract)
- Check persistent identifiers (DOI, ORCID)
- Verify machine-readable formats (JSON-LD, PROV-O RDF)
2. **PROV Compliance**:
- Generate W3C PROV graph for sample workflow
- Validate against PROV-O ontology schema
- Check Activity, Entity, Agent triples
- Verify wasGeneratedBy, wasAttributedTo, used relationships
3. **OAIS Compliance**:
- Create archival package (SIP → AIP → DIP)
- Validate METS XML structure
- Check BagIt manifest (checksums valid)
- Verify preservation metadata
4. **PRISMA Compliance**:
- Generate systematic review protocol
- Validate PRISMA checklist (27 items)
- Check search strategy documentation
- Verify screening workflow completeness
5. **GRADE Compliance**:
- Generate GRADE quality assessment
- Validate all 5 criteria (risk of bias, inconsistency, indirectness, imprecision, publication bias)
- Check quality of evidence rating (high, moderate, low, very low)
**Tools**:
- **F-UJI**: https://www.f-uji.net/ (FAIR assessment tool)
- **PROV-O Validator**: https://www.w3.org/TR/prov-o/ (RDF validation)
- **METS Validator**: Library of Congress METS schema validator
- **BagIt**: Python bagit library for archival package validation
- **PRISMA Checklist**: http://www.prisma-statement.org/PRISMAStatement/Checklist.aspx
**Blocking Conditions**:
- **Release**: All 5 compliance NFRs MUST meet 100% targets
- **F-UJI Score**: >80% on automated FAIR assessment
- **PROV Validation**: 100% valid PROV-O graphs
---
## 4. Test Coverage Requirements
### 4.1 Code Coverage Targets
**Mandatory Thresholds** (CI-Enforced):
| Metric | Minimum (Blocking) | Target (Goal) | Enforcement |
|--------|-------------------|---------------|-------------|
| **Line Coverage** | 80% | 90% | PR merge blocked if <80% |
| **Branch Coverage** | 75% | 85% | PR merge blocked if <75% |
| **Function Coverage** | 85% | 95% | Warning if <85% |
| **Statement Coverage** | 80% | 90% | PR merge blocked if <80% |
**Critical Component Coverage** (100% Required):
| Component | Justification | Enforcement |
|-----------|--------------|-------------|
| Hallucination Detection | Critical Risk T-01 (LLM hallucination) | 100% line + branch |
| Citation Verification | Prevent citing fabricated papers | 100% line + branch |
| Quality Scoring | Critical Risk Q-01 (low-quality sources) | 100% line + branch |
| FAIR Validation | Compliance requirement | 100% line + branch |
| PROV Graph Generation | Standards compliance | 100% line + branch |
| API Rate Limiting | Critical Risk T-02 (API dependency) | 100% line + branch |
| Input Sanitization | Security Risk S-01 | 100% line + branch |
**Coverage Progression**:
| Phase | Line Coverage Target | Branch Coverage Target |
|-------|---------------------|------------------------|
| Construction (Week 9) | 60% | 55% |
| Documentation (Week 11) | 75% | 70% |
| Integration (Week 14) | 85% | 80% |
| Release (Week 20) | 90% | 85% |
**Coverage Ratcheting**:
- **Rule**: Coverage CANNOT decrease from previous PR
- **Enforcement**: CI fails if coverage drops >1%
- **Exception**: Technical debt approved by Test Architect (rare)
### 4.2 Use Case Coverage
**Coverage Matrix**: 100% Use Case Validation
| Use Case | Acceptance Criteria | Test Cases | Automation | Status |
|----------|-------------------|------------|------------|--------|
| UC-RF-001: Discover Research Papers | 10 AC | 15 tests | 100% | Planned |
| UC-RF-002: Acquire Research Source | 8 AC | 12 tests | 100% | Planned |
| UC-RF-003: Document Research Paper | 12 AC | 18 tests | 100% | Planned |
| UC-RF-004: Integrate Citations | 6 AC | 10 tests | 100% | Planned |
| UC-RF-005: Track Provenance | 5 AC | 8 tests | 100% | Planned |
| UC-RF-006: Assess Source Quality | 10 AC | 15 tests | 100% | Planned |
| UC-RF-007: Archive Research Artifacts | 7 AC | 10 tests | 100% | Planned |
| UC-RF-008: Execute Research Workflow | 8 AC | 12 tests | 100% | Planned |
| UC-RF-009: Perform Gap Analysis | 6 AC | 10 tests | 100% | Planned |
| UC-RF-010: Export Research Artifacts | 5 AC | 8 tests | 100% | Planned |
| **Total** | **77 AC** | **118 tests** | **100%** | |
**Traceability**: Every acceptance criteria MUST map to at least 1 test case.
### 4.3 NFR Coverage
**NFR Coverage Matrix**: 100% NFR Validation
| NFR Category | NFR Count | Measurable Targets | Test Automation | Status |
|--------------|-----------|-------------------|-----------------|--------|
| **Performance** | 10 NFRs | <10s search, <30s gap analysis, <60s doc generation, etc. | 100% | Planned |
| **Security** | 8 NFRs | API key protection, input sanitization, privacy, etc. | 90% | Planned |
| **Usability** | 7 NFRs | Onboarding <5hrs, query suggestions >80%, etc. | 50% (manual surveys) | Planned |
| **Compliance** | 10 NFRs | FAIR 100%, PROV 100%, OAIS 100%, PRISMA 100%, etc. | 100% | Planned |
| **Reliability** | 5 NFRs | API retry logic, rate limit compliance, error handling | 100% | Planned |
| **Scalability** | 5 NFRs | Support 1,000+ papers, no degradation at scale | 100% | Planned |
| **Total** | **45 NFRs** | **100% measurable** | **95% automation** | |
**NFR Testing Schedule**:
| Phase | NFRs Validated | Test Type |
|-------|----------------|-----------|
| Construction (Week 9-11) | Performance baselines, security static analysis | Automated |
| Integration (Week 12-14) | Compliance (FAIR, PROV), reliability (API retry) | Automated |
| Workflows (Week 15-17) | Scalability (1,000-paper corpus), PRISMA compliance | Automated |
| Validation (Week 18-20) | Usability (user surveys), full NFR regression | Manual + Automated |
---
## 5. Test Data Strategy
### 5.1 Test Data Requirements
**Data Categories**:
| Category | Purpose | Volume | Source |
|----------|---------|--------|--------|
| **Open Access Papers** | Real-world validation | 100 papers | Semantic Scholar, arXiv, PubMed Central |
| **Mock API Responses** | Offline testing, CI/CD | 500+ fixtures | Recorded Semantic Scholar responses |
| **Sample PDFs** | PDF extraction, hallucination detection | 50 PDFs | Open access repositories (CC BY license) |
| **BibTeX/RIS Fixtures** | Citation format validation | 100 entries | Zotero library, CrossRef API |
| **FAIR Metadata** | FAIR compliance testing | 20 datasets | Zenodo, Figshare (FAIR-compliant sources) |
| **PROV Graphs** | W3C PROV validation | 10 graphs | W3C PROV examples, custom workflows |
| **OAIS Packages** | Archival package testing | 5 packages | Library of Congress OAIS examples |
| **GRADE Assessments** | Quality scoring validation | 30 papers | Cochrane reviews, expert-validated |
### 5.2 Test Data Sources
**Primary Sources**:
1. **Semantic Scholar API**:
- Query: "machine learning", "reinforcement learning", "natural language processing"
- Filter: Open access, year >2020, citations >10
- Record responses for offline testing
2. **arXiv**:
- Categories: cs.AI, cs.LG, cs.CL
- Download sample PDFs (open access, no copyright issues)
3. **PubMed Central**:
- Medical/healthcare research papers
- Open access subset (CC BY license)
4. **Expert-Curated Datasets**:
- Cochrane reviews (GRADE quality assessments)
- Zenodo/Figshare (FAIR-compliant datasets)
**Data Licensing**:
- **Requirement**: All test data MUST be open access or CC BY licensed
- **Prohibition**: No paywalled content, respect publisher terms
- **Attribution**: Credit original sources in test documentation
### 5.3 Test Data Generation
**Automated Generation**:
1. **Mock API Responses**:
- Record real Semantic Scholar API responses
- Anonymize if needed (replace author names, paper IDs)
- Store in `test/fixtures/api-responses/`
2. **Synthetic Data**:
- Generate BibTeX entries with faker.js
- Create PROV graphs programmatically
- Build OAIS packages from templates
3. **Edge Cases**:
- Empty results (0 papers)
- Large results (500 papers)
- Malformed data (invalid DOI, missing metadata)
- Retracted papers (flagged in Retraction Watch)
**Data Versioning**:
- **Storage**: `test/fixtures/` directory
- **Version Control**: Git LFS for large PDFs (>1MB)
- **Updates**: Refresh fixtures quarterly to match API changes
### 5.4 Test Data Management
**Data Isolation**:
- **Principle**: Tests MUST NOT depend on external APIs (except explicit integration tests)
- **Approach**: Mock API responses for unit/E2E tests, real APIs for integration tests only
**Data Cleanup**:
- **Temporary Files**: Delete `.aiwg/research/` artifacts after each test
- **Test Isolation**: Each test suite runs in isolated directory (`test-run-{uuid}/`)
- **CI Cleanup**: Wipe test artifacts after CI run
**Data Privacy**:
- **No PII**: Test data MUST NOT contain personally identifiable information
- **No Secrets**: API keys, tokens in environment variables only (never in fixtures)
- **Public Datasets**: All test data shareable (no proprietary or confidential data)
**Fixtures Organization**:
```
test/fixtures/
├── api-responses/
│ ├── semantic-scholar/
│ │ ├── search-machine-learning.json
│ │ ├── search-empty-results.json
│ │ └── rate-limit-error.json
│ ├── crossref/
│ │ ├── doi-valid.json
│ │ └── doi-invalid.json
│ └── retraction-watch/
│ └── retracted-paper.json
├── pdfs/
│ ├── sample-paper-01.pdf
│ ├── sample-paper-02.pdf
│ └── sample-scanned.pdf (OCR test case)
├── bibtex/
│ ├── valid-100-entries.bib
│ └── malformed-entries.bib
├── fair-metadata/
│ ├── zenodo-dataset.json
│ └── figshare-dataset.json
├── prov-graphs/
│ ├── simple-workflow.ttl (Turtle format)
│ └── complex-lineage.jsonld
└── oais-packages/
├── sample-sip/ (Submission Information Package)
├── sample-aip/ (Archival Information Package)
└── sample-dip/ (Dissemination Information Package)
```
---
## 6. Test Automation
### 6.1 Automation Strategy
**Automation Targets**:
| Test Level | Automation Target | Current | Rationale |
|------------|------------------|---------|-----------|
| Unit Tests | 100% | 0% (Phase 3+) | Fast feedback, high ROI |
| Integration Tests | 100% | 0% (Phase 3+) | API contracts, agent handoffs |
| E2E Tests | 100% | 0% (Phase 4+) | Critical workflows, regression prevention |
| Acceptance Tests | 50% | 0% (Phase 6) | Some manual (user surveys, expert validation) |
| **Overall** | **95%** | **0%** (pre-Construction) | Minimize manual testing burden |
**Automation Phases**:
| Phase | Automation Focus | Deliverable |
|-------|-----------------|-------------|
| Construction (Week 9-11) | Unit tests, basic integration tests | 60% automation |
| Integration (Week 12-14) | Full integration tests, E2E scaffolding | 80% automation |
| Workflows (Week 15-17) | E2E critical paths, performance tests | 90% automation |
| Validation (Week 18-20) | Acceptance tests, compliance tests | 95% automation |
### 6.2 Automation Tools
**Test Frameworks**:
| Tool | Purpose | Justification |
|------|---------|--------------|
| **Jest** or **Vitest** | Unit + Integration testing | Node.js/TypeScript standard, fast, mocking built-in |
| **Playwright** | E2E testing (if UI added) | Browser automation, cross-platform |
| **Supertest** | API testing | Express.js integration, HTTP assertions |
| **k6** | Load testing | CLI-based, scriptable, cloud integration |
**CI/CD Integration**:
| Tool | Purpose | Configuration |
|------|---------|--------------|
| **GitHub Actions** | CI/CD pipeline | `.github/workflows/test.yml` |
| **Pre-commit Hooks** | Local fast feedback | Husky + lint-staged |
| **Codecov** | Coverage reporting | Integrated with GitHub Actions |
| **Dependabot** | Dependency updates | Automated PRs for npm packages |
**Mocking & Stubbing**:
| Tool | Purpose | Use Case |
|------|---------|----------|
| **nock** | HTTP mocking | Mock Semantic Scholar API responses |
| **sinon** | Function stubs/spies | Stub file system operations, time functions |
| **MSW** | Service worker mocking | Mock browser-based API calls (if needed) |
**Test Data Tools**:
| Tool | Purpose | Use Case |
|------|---------|----------|
| **faker.js** | Synthetic data generation | Generate BibTeX entries, author names |
| **Git LFS** | Large file versioning | Store sample PDFs in version control |
| **Docker** | Test environment isolation | Run Zotero, Neo4j in containers |
### 6.3 CI/CD Pipeline
**GitHub Actions Workflow** (`.github/workflows/test.yml`):
```yaml
name: Test Suite
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Lint
run: npm run lint
- name: Type check
run: npm run typecheck
- name: Unit tests
run: npm run test:unit -- --coverage
- name: Integration tests
run: npm run test:integration
env:
SEMANTIC_SCHOLAR_API_KEY: ${{ secrets.SEMANTIC_SCHOLAR_API_KEY }}
- name: E2E tests
run: npm run test:e2e
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage/lcov.info
- name: Check coverage thresholds
run: npm run coverage:check
# Fails if coverage <80% line, <75% branch
security:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Run npm audit
run: npm audit --audit-level=moderate
- name: Secret scanning
uses: trufflesecurity/trufflehog@main
with:
path: ./
base: ${{ github.event.repository.default_branch }}
head: HEAD
performance:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Performance tests
run: npm run test:performance
- name: Check performance regression
run: npm run performance:compare
# Fails if >10% degradation from baseline
```
**Pre-commit Hooks** (`.husky/pre-commit`):
```bash
#!/bin/sh
. "$(dirname "$0")/_/husky.sh"
# Run fast checks only (unit tests, linting)
npm run lint
npm run typecheck
npm run test:unit -- --bail --findRelatedTests
```
**Branch Protection Rules**:
- **Require**: All CI checks pass (test, security, coverage)
- **Require**: >1 approval for PRs to main
- **Prevent**: Force push to main
- **Enforce**: Coverage CANNOT decrease
### 6.4 Test Execution Schedule
**Pre-Commit** (Local, <30s):
- Linting (ESLint)
- Type checking (TypeScript)
- Unit tests (related files only)
**PR Merge** (CI, <5 min):
- Full lint + type check
- All unit tests (190 tests)
- All integration tests (55 tests)
- Coverage check (80% line, 75% branch)
- Security scan (npm audit, secret scan)
**Nightly** (CI, <30 min):
- Full test suite (unit + integration + E2E)
- Performance tests (baseline comparison)
- Mutation testing (Stryker)
- Dependency updates (Dependabot PRs)
**Release** (Manual + CI, <2 hours):
- Full regression suite
- All 145 acceptance tests
- Compliance validation (FAIR, PROV, OAIS)
- User acceptance testing (matric-memory team)
- Performance baseline update
---
## 7. Quality Gates
### 7.1 Pre-Commit Quality Gate
**Enforcement**: Local developer machine (Husky pre-commit hooks)
**Criteria**:
| Check | Threshold | Blocking | Rationale |
|-------|-----------|----------|-----------|
| Linting | 0 errors | Yes | Code style consistency |
| Type Errors | 0 errors | Yes | Type safety (TypeScript) |
| Unit Tests (Related) | 100% pass | Yes | Fast feedback on changes |
| Formatting | Prettier compliant | Yes | Consistent formatting |
**Bypass**: NOT allowed (except emergency hotfixes with Test Architect approval)
### 7.2 PR Merge Quality Gate
**Enforcement**: GitHub branch protection + CI (GitHub Actions)
**Criteria**:
| Check | Threshold | Blocking | Rationale |
|-------|-----------|----------|-----------|
| **All Tests Pass** | 100% | Yes | No regressions |
| **Code Coverage** | ≥80% line, ≥75% branch | Yes | Minimum coverage threshold |
| **Coverage Delta** | No decrease >1% | Yes | Prevent coverage erosion |
| **Security Vulnerabilities** | 0 high/critical | Yes | No known exploits |
| **Secret Scanning** | 0 secrets detected | Yes | No leaked credentials |
| **Peer Review** | ≥1 approval | Yes | Code quality assurance |
| **Linting** | 0 errors | Yes | Style compliance |
| **Type Errors** | 0 errors | Yes | Type safety |
| **Build Success** | Successful build | Yes | No compilation errors |
**Bypass**: NOT allowed (no exceptions)
**Escalation**: If gate blocks valid PR, Test Architect reviews and approves override (rare, documented)
### 7.3 Release Quality Gate
**Enforcement**: Manual checklist + automated validation (Release Manager + CI)
**Criteria**:
| Check | Threshold | Blocking | Rationale |
|-------|-----------|----------|-----------|
| **All Tests Pass** | 100% (unit + integration + E2E) | Yes | No known defects |
| **Code Coverage** | ≥90% line, ≥85% branch | Yes | Target coverage for release |
| **Use Case Validation** | 100% (all 10 use cases) | Yes | Complete functionality |
| **NFR Validation** | 100% (all 45 NFRs met) | Yes | Quality attributes |
| **Acceptance Tests** | 100% pass | Yes | User requirements met |
| **Performance** | All targets met, no regressions | Yes | User experience |
| **Security Scan** | 0 high/critical vulnerabilities | Yes | Production-ready |
| **Compliance** | FAIR >80%, PROV 100%, OAIS 100% | Yes | Standards conformance |
| **User Acceptance** | >80% satisfaction (4/5 rating) | Yes | User approval |
| **Documentation** | Complete (CLI ref, tutorials, API docs) | Yes | Usability |
| **Migration Guide** | Published (if breaking changes) | Yes | User support |
| **Changelog** | Complete with highlights | Yes | Transparency |
| **Known Issues** | Documented, no critical unresolved | Yes | Risk awareness |
**Release Checklist**:
- [ ] All automated tests pass (CI green)
- [ ] Code coverage ≥90% line
- [ ] 10 use cases validated (manual or automated)
- [ ] 45 NFRs validated (automated + manual surveys)
- [ ] Performance baselines met (no >10% regression)
- [ ] Security scan clean (npm audit, Snyk)
- [ ] FAIR compliance >80% (F-UJI assessment)
- [ ] PROV graphs 100% valid (PROV-O validator)
- [ ] OAIS packages conformant (METS validator)
- [ ] User acceptance testing complete (matric-memory sign-off)
- [ ] Documentation complete (CLI ref, tutorials)
- [ ] Changelog updated
- [ ] Release notes drafted
- [ ] Known issues documented
- [ ] Backup/rollback plan ready
**Approvals Required**:
- Test Architect (quality assurance)
- Product Owner (user acceptance)
- Release Manager (process compliance)
### 7.4 Phase Transition Quality Gates
**Inception → Elaboration**:
- [ ] Test strategy approved
- [ ] Coverage targets defined
- [ ] Automation feasibility assessed
- [ ] Test data sources identified
**Elaboration → Construction**:
- [ ] Master test plan approved
- [ ] Test environments provisioned (Docker, CI/CD)
- [ ] CI/CD pipeline includes test execution
- [ ] Baseline coverage established (0% for greenfield, acceptable)
- [ ] Test data fixtures created (API mocks, sample PDFs)
**Construction → Transition**:
- [ ] All coverage targets met (90% line, 85% branch)
- [ ] No critical/high defects open
- [ ] Performance baseline validated (all 8 NFRs)
- [ ] Security scan passed (no high/critical vulnerabilities)
- [ ] Regression suite passing (all unit + integration + E2E)
**Transition → Production**:
- [ ] UAT complete and signed off (matric-memory team)
- [ ] All test levels passing (unit, integration, E2E, acceptance)
- [ ] No regressions from baseline
- [ ] Operational runbook tested (deployment, rollback)
- [ ] Monitoring and alerting configured (optional for v1.0)
---
## 8. Risk-Based Testing
### 8.1 Critical Priority Risks (Score 20-25)
**T-01: LLM Hallucination in Summaries/Extractions** (Score: 20)
**Testing Strategy**:
| Test Type | Coverage | Test Count | Automation |
|-----------|----------|------------|------------|
| Unit Tests | Hallucination detection algorithm | 10 tests | 100% |
| Integration Tests | RAG-based summarization pipeline | 5 tests | 100% |
| E2E Tests | UC-RF-003 (Document Research Paper) | 3 tests | 100% |
| Acceptance Tests | Hallucination rate <5% | 1 validation | Manual (expert review) |
**Test Scenarios**:
1. **Fabricated Citations**: LLM generates non-existent paper citation
- **Test**: Detect author/year/DOI mismatch with source paper
- **Expected**: Hallucination flagged, warning displayed
2. **Citation Verification**: Cross-reference all citations with CrossRef DOI database
- **Test**: Verify DOI existence via API
- **Expected**: 100% DOI validation, fabricated DOIs rejected
3. **Human-in-the-Loop**: Require user approval for AI summaries
- **Test**: Display "AI-generated, not verified" warning
- **Expected**: User explicitly confirms before summary persisted
4. **Hallucination Reporting**: User reports suspected hallucination
- **Test**: User clicks "Report Hallucination" button
- **Expected**: Report logged, trends monitored
**Success Criteria**:
- Hallucination rate <5% in user testing (expert validation of 50 summaries)
- 100% DOI verification (no fabricated DOIs pass validation)
- User trust survey: >80% trust in AI summaries
**Monitoring**:
- Weekly hallucination report count (>5 reports/month → escalate)
- Monthly hallucination audit (random sample of 50 summaries)
---
**A-04: Requires Too Much Manual Effort** (Score: 20)
**Testing Strategy**:
| Test Type | Coverage | Test Count | Automation |
|-----------|----------|------------|------------|
| Performance Tests | Workflow completion time | 8 tests | 100% |
| Usability Tests | User onboarding, effort tracking | 5 tests | 50% (manual surveys) |
| E2E Tests | All 10 use cases (time measurement) | 10 tests | 100% |
| Acceptance Tests | User satisfaction, time savings | 2 validations | Manual (surveys) |
**Test Scenarios**:
1. **Tiered Workflow Complexity**: Quick vs. Standard vs. Rigorous workflows
- **Test**: Measure time for each workflow (Quick <30 min, Standard <2 hrs, Rigorous <8 hrs)
- **Expected**: >60% users choose Quick/Standard (not overwhelmed by rigor)
2. **Effort Tracking**: Log time spent on each workflow step
- **Test**: Track user time per task (search, screen, document, grade)
- **Expected**: Framework shows "You've saved 15 hours vs. manual research"
3. **Onboarding Completion**: Users complete onboarding tutorial
- **Test**: Track completion rate (Stage 1, Stage 2, ..., Stage 5)
- **Expected**: >70% complete Stage 2 (basic usage), >50% complete Stage 4 (advanced)
4. **User Satisfaction**: Survey users on effort perception
- **Test**: Ask "Was framework time-saving or time-consuming?" (1-5 scale)
- **Expected**: >60% users rate 4/5 (time-saving)
**Success Criteria**:
- Total workflow time <2 hours for 100-paper review (vs. 60 hours manual = 97% reduction)
- User satisfaction: >60% report time savings
- Onboarding completion: >70% complete basic workflow
**Monitoring**:
- Weekly onboarding completion rate (Phase 6: user testing)
- Bi-weekly user interviews on e