aiwg

Version:

Cognitive architecture for AI-augmented software development with structured memory, ensemble validation, and closed-loop correction. FAIR-aligned artifacts, 84% cost reduction via human-in-the-loop, standards adopted by 100+ organizations.

aiwg.io

jmagly/aiwg

285 lines (220 loc) • 6.22 kB

Markdown

# AIWG Evaluations Automated evaluation framework for agent quality assessment. ## Research Foundation - **REF-001**: BP-9 - Continuous evaluation of agent performance - **REF-002**: KAMI benchmark methodology for real agentic task evaluation ## Overview AIWG Evals provides: 1. **Agent Tests**: Validate individual agent behavior 2. **Workflow Scenarios**: Test multi-agent orchestration 3. **Archetype Detection**: Identify failure pattern susceptibility 4. **Quality Reports**: Track agent health over time ## Installation ```bash aiwg install aiwg-evals ``` ## Quick Start ```bash # Test a single agent aiwg eval-agent security-architect # Test a workflow aiwg eval-workflow flow-security-review-cycle # Generate quality report aiwg eval-report --output .aiwg/reports/eval-report.md ``` ## Evaluation Categories ### Archetype Tests (from REF-002) | Test | Archetype | What It Validates | |------|-----------|-------------------| | `grounding-test` | #1 Premature Action | Agent inspects before modifying | | `substitution-test` | #2 Over-Helpfulness | Agent escalates ambiguity | | `distractor-test` | #3 Context Pollution | Agent ignores irrelevant data | | `recovery-test` | #4 Fragile Execution | Agent recovers from failures | ### Performance Tests | Test | What It Measures | |------|------------------| | `parallel-test` | Correct parallel execution | | `latency-test` | Response time benchmarks | | `token-test` | Token efficiency | ### Quality Tests | Test | What It Validates | |------|-------------------| | `output-format` | Correct output structure | | `tool-usage` | Appropriate tool selection | | `scope-adherence` | Stays within defined scope | ## Test Scenarios ### grounding-test Tests Archetype 1: Premature Action Without Grounding ```yaml scenario: grounding-test setup: - Create file with specific content - Provide partial/misleading description task: "Modify the config file to enable feature X" expected: - Agent MUST read file before modification - Agent MUST NOT assume file structure success_criteria: - Read tool called before Edit tool - File content verified before changes ``` ### distractor-test Tests Archetype 3: Distractor-Induced Context Pollution ```yaml scenario: distractor-test setup: - Create target file with task data - Create distractor files with similar but irrelevant data task: "Extract the API endpoint from the config" expected: - Agent uses only relevant file - Agent ignores distractor data success_criteria: - Correct value extracted - Distractor data not in output ``` ### recovery-test Tests Archetype 4: Fragile Execution Under Load ```yaml scenario: recovery-test setup: - Configure operation to fail first attempt - Provide recovery path task: "Complete the data migration" expected: - Agent detects failure - Agent attempts recovery - Agent succeeds on retry OR escalates success_criteria: - Error detected (not ignored) - Recovery attempted - Final state correct OR escalation issued ``` ## Running Evaluations ### Single Agent ```bash # Run all tests for an agent aiwg eval-agent architecture-designer # Run specific test category aiwg eval-agent architecture-designer --category archetype # Verbose output aiwg eval-agent architecture-designer --verbose ``` ### Workflow ```bash # Run workflow scenario aiwg eval-workflow flow-inception-to-elaboration # With specific scenario aiwg eval-workflow flow-security-review-cycle --scenario distractor-test ``` ### Batch Evaluation ```bash # Evaluate all SDLC agents aiwg eval-agent --all --mode sdlc # Generate comparison report aiwg eval-report --compare previous-report.json ``` ## Output Format ### Test Results ```json { "agent": "security-architect", "timestamp": "2025-01-15T10:30:00Z", "tests": { "grounding-test": { "passed": true, "details": "Read tool called before Edit" }, "distractor-test": { "passed": false, "details": "Used distractor data in output", "evidence": "Output contained 'distractor-api.example.com'" } }, "summary": { "passed": 3, "failed": 1, "score": 0.75 } } ``` ### Quality Report ```markdown # Agent Quality Report ## Summary - **Agents Tested**: 53 - **Overall Score**: 87% - **Regression**: None ## By Archetype | Archetype | Pass Rate | Trend | |-----------|-----------|-------| | #1 Grounding | 92% | ↑ | | #2 Substitution | 88% | → | | #3 Distractor | 78% | ↓ | | #4 Recovery | 90% | ↑ | ## Agents Needing Attention - `data-analyst`: Failed distractor-test (3 consecutive) - `api-designer`: Latency regression (+40%) ## Recommendations 1. Review data-analyst context filtering 2. Investigate api-designer tool selection ``` ## Custom Scenarios Create custom test scenarios: ```yaml # scenarios/custom/my-scenario.yaml name: my-custom-scenario description: Test specific business logic category: custom setup: files: - path: test-data/input.json content: | {"key": "value"} task: | Process the input file and generate output. validation: - type: file_exists path: test-data/output.json - type: json_contains path: test-data/output.json key: "processed" value: true cleanup: - test-data/ ``` ## CI Integration ```yaml # .github/workflows/agent-quality.yml name: Agent Quality on: pull_request: paths: - '.claude/agents/**' jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Agent Evals run: | aiwg eval-agent --all --mode sdlc --output eval-results.json - name: Check Quality Gate run: | SCORE=$(jq '.summary.score' eval-results.json) if (( $(echo "$SCORE < 0.80" | bc -l) )); then echo "Quality score $SCORE below threshold 0.80" exit 1 fi ``` ## Success Metrics | Metric | Target | |--------|--------| | Grounding compliance | >90% | | Distractor resistance | >80% | | Recovery success | ≥80% | | Overall quality | ≥85% | ## Related - `docs/AGENT-DESIGN.md` - Agent Design Bible - `tools/linters/agent-linter.mjs` - Static agent validation - `prompts/reliability/` - Reliability patterns