UNPKG

aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

244 lines (190 loc) 6.22 kB
# Doc-Intelligence Addon Evaluation Plan ## Overview This document defines the evaluation criteria, test scenarios, and quality gates for the doc-intelligence addon skills. ## Research Compliance Validation Each skill must demonstrate compliance with: - **REF-001**: Production-Grade Agentic Workflows - **REF-002**: LLM Failure Modes in Agentic Scenarios ### Archetype Mitigation Checklist | Skill | Archetype 1 | Archetype 2 | Archetype 3 | Archetype 4 | |-------|-------------|-------------|-------------|-------------| | doc-scraper | | | | | | pdf-extractor | | | | | | llms-txt-support | | | | | | source-unifier | | | | | | doc-splitter | | | | | ## Evaluation Scenarios ### 1. doc-scraper Evaluation **Test Case DS-001: Basic Documentation Scraping** ``` Input: https://docs.example.com/ Expected: Structured JSON output with pages, summary Grounding: Verify robots.txt checked Recovery: Handle 429 rate limit gracefully ``` **Test Case DS-002: JavaScript-Heavy Site** ``` Input: SPA documentation site Expected: Playwright fallback successful Grounding: Browser option selected correctly Recovery: Fallback from httpx to Playwright ``` **Test Case DS-003: Rate Limit Handling** ``` Input: Fast scraping attempt Expected: Automatic backoff applied Recovery: Exponential backoff, max 3 retries ``` ### 2. pdf-extractor Evaluation **Test Case PE-001: Standard PDF Extraction** ``` Input: Text-based PDF document Expected: Markdown output with structure preserved Grounding: PDF file existence verified Recovery: Handle corrupted PDF gracefully ``` **Test Case PE-002: Image-Heavy PDF** ``` Input: PDF with diagrams and screenshots Expected: OCR applied, images extracted Grounding: OCR availability checked Recovery: Fallback to text-only if OCR fails ``` **Test Case PE-003: Large PDF Processing** ``` Input: 500+ page PDF Expected: Chunked processing with checkpoints Grounding: Memory limits respected Recovery: Resume from checkpoint on failure ``` ### 3. llms-txt-support Evaluation **Test Case LT-001: llms.txt Detection** ``` Input: Site URL with /llms.txt Expected: Fast-path extraction used Grounding: Check existence before scraping Recovery: Fallback to doc-scraper if not found ``` **Test Case LT-002: llms-full.txt Handling** ``` Input: Site with full version Expected: Comprehensive content extracted Grounding: Prefer full version if available Recovery: Use standard version as fallback ``` ### 4. source-unifier Evaluation **Test Case SU-001: Multi-Source Merge** ``` Input: docs/ + GitHub README + PDF Expected: Unified output with deduplication Grounding: All sources validated before merge Recovery: Continue with partial sources on failure ``` **Test Case SU-002: Conflict Detection** ``` Input: Sources with contradictory information Expected: Conflicts flagged for user review Escalation: User decision required Recovery: Mark conflicts, don't auto-resolve ``` ### 5. doc-splitter Evaluation **Test Case SP-001: Large Documentation Split** ``` Input: 15,000 page documentation set Expected: Sub-skills created with router Grounding: Size analysis before splitting Recovery: Preserve partial splits on failure ``` **Test Case SP-002: Semantic Boundary Respect** ``` Input: Documentation with logical sections Expected: Splits at semantic boundaries Grounding: Section analysis performed Recovery: Conservative splits if analysis fails ``` ## Quality Gates ### Gate 1: Structure Validation - [ ] SKILL.md follows template - [ ] Required sections present (Purpose, Grounding, Escalation, Context, Recovery) - [ ] Checkpoint support documented - [ ] Workflow steps defined ### Gate 2: Research Compliance - [ ] BP-4 Single Responsibility demonstrated - [ ] BP-9 KISS principle applied - [ ] All 4 archetypes addressed - [ ] Uncertainty escalation clear ### Gate 3: Functional Testing - [ ] Happy path works - [ ] Error handling tested - [ ] Recovery protocol verified - [ ] Checkpoint creation confirmed ### Gate 4: Integration Testing - [ ] Works with doc-analyst orchestrator - [ ] Checkpoint handoff successful - [ ] Cross-skill data flow validated - [ ] Rollback capability confirmed ## Metrics ### Quality Score Calculation ``` Structure (25 points) - SKILL.md present: 5 - Required sections: 10 - Workflow steps: 5 - Configuration options: 5 Content (35 points) - Grounding checkpoint: 10 - Uncertainty escalation: 10 - Context scope table: 5 - Recovery protocol: 10 Examples (20 points) - Bash examples: 10 - Configuration examples: 5 - Output examples: 5 Documentation (20 points) - References: 5 - Troubleshooting: 5 - Checkpoint structure: 5 - Integration points: 5 Total: 100 points PASS: ≥80 | WARN: 60-79 | FAIL: <60 ``` ## Test Execution ### Manual Testing Checklist ```bash # 1. Environment Setup skill-seekers version # Verify tool available # 2. Basic Functionality skill-seekers scrape https://test-docs.example.com/ --output test_output/ # 3. PDF Extraction skill-seekers extract test.pdf --output test_output/ # 4. Source Unification skill-seekers unify test_output/*_data/ --output unified_output/ # 5. Large Doc Splitting skill-seekers split large_docs/ --output split_output/ ``` ### Automated Testing (CI/CD) ```yaml # .github/workflows/skill-evaluation.yml name: Skill Evaluation on: [push, pull_request] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Structure Validation run: | for skill in doc-scraper pdf-extractor llms-txt-support source-unifier doc-splitter; do test -f "agentic/code/addons/doc-intelligence/skills/$skill/SKILL.md" done - name: Content Validation run: | for skill in doc-scraper pdf-extractor llms-txt-support source-unifier doc-splitter; do grep -q "Grounding Checkpoint" "agentic/code/addons/doc-intelligence/skills/$skill/SKILL.md" grep -q "Recovery Protocol" "agentic/code/addons/doc-intelligence/skills/$skill/SKILL.md" done ``` ## Revision History | Version | Date | Changes | |---------|------|---------| | 1.0.0 | 2025-01-15 | Initial evaluation plan |