UNPKG

quality-check-mcp-server

Version:

MCP server for quality validation, scoring, and rerun strategy determination in GAFF

462 lines (375 loc) 10.7 kB
# quality-check MCP Server **Purpose:** Quality validation, scoring, and automatic rerun strategy determination for GAFF workflows ## Overview The `quality-check` MCP server validates execution results from the router, calculates quality scores, determines if results meet acceptance criteria, and decides whether reruns are needed. This is the **quality assurance layer** that ensures GAFF workflows produce acceptable outputs. ## Features **Result Validation:** Comprehensive validation against quality criteria **Quality Scoring:** 0-1 scale scoring with configurable thresholds **Completeness Checks:** Verify all required outputs are present **Accuracy Verification:** Validate correctness of results **Rerun Strategy:** Intelligent decisions on partial/full/adaptive reruns **Failure Analysis:** Identify patterns and root causes --- ## Tools ### 1. `validate_execution_result` **Purpose:** Validate the final result of an intent graph execution. **Input:** ```typescript { execution_result: object, // Result from router.execute_graph quality_criteria: { completeness_required: boolean, accuracy_threshold: number, // 0-1 scale required_fields: string[], custom_validators?: object[] }, intent_graph: object, // Original intent graph original_request: object // Original user request } ``` **Output:** ```typescript { is_valid: boolean, quality_score: number, // 0-1 scale is_acceptable: boolean, // Based on threshold (default 0.85) issues: Array<{ type: "missing_field" | "accuracy" | "format" | "custom", field: string, message: string, severity: "error" | "warning" }>, completeness_score: number, accuracy_score: number, rerun_required: boolean, rerun_nodes: string[], recommendations: string[] } ``` --- ### 2. `score_quality` **Purpose:** Calculate quality score for execution results. **Input:** ```typescript { execution_result: object, scoring_criteria: { completeness_weight: number, // Default 0.4 accuracy_weight: number, // Default 0.4 performance_weight: number, // Default 0.2 custom_metrics?: object[] } } ``` **Output:** ```typescript { overall_score: number, // 0-1 scale component_scores: { completeness: number, accuracy: number, performance: number, custom: number[] }, grade: "excellent" | "good" | "acceptable" | "poor" | "failed", passing: boolean } ``` --- ### 3. `check_completeness` **Purpose:** Verify all required outputs are present and properly formatted. **Input:** ```typescript { execution_result: object, required_outputs: { required_fields: string[], required_types: object, // field -> expected type required_formats: object // field -> format pattern } } ``` **Output:** ```typescript { is_complete: boolean, completeness_score: number, missing_fields: string[], type_mismatches: Array<{ field: string, expected: string, actual: string }>, format_violations: Array<{ field: string, expected_format: string, actual_value: string }> } ``` --- ### 4. `check_accuracy` **Purpose:** Validate accuracy and correctness of results. **Input:** ```typescript { execution_result: object, accuracy_criteria: { validation_rules: object[], business_rules: object[], expected_ranges: object, cross_field_validations: object[] }, reference_data?: object // Optional reference for comparison } ``` **Output:** ```typescript { is_accurate: boolean, accuracy_score: number, rule_violations: Array<{ rule: string, field: string, message: string, severity: "error" | "warning" }>, confidence: number } ``` --- ### 5. `determine_rerun_strategy` **Purpose:** Intelligently decide the best rerun strategy based on failure analysis. **Input:** ```typescript { execution_result: object, validation_result: object, // From validate_execution_result intent_graph: object, failure_history?: object[] // Previous failures in this execution } ``` **Output:** ```typescript { rerun_required: boolean, strategy: "none" | "partial" | "full" | "adaptive", rerun_nodes: string[], // Nodes to re-execute estimated_success_probability: number, reasoning: string, max_attempts_recommendation: number, alternative_approaches: string[] } ``` **Strategies:** - **none:** Result is acceptable, no rerun needed - **partial:** Rerun only failed nodes and dependencies - **full:** Rerun entire workflow (major issues detected) - **adaptive:** Intelligently decide based on failure type and history --- ### 6. `analyze_failure_patterns` **Purpose:** Identify patterns in failures to help improve workflows. **Input:** ```typescript { execution_history: object[], // History of executions intent_graph: object, time_range?: { start: string, end: string } } ``` **Output:** ```typescript { patterns: Array<{ pattern_type: "node_failure" | "timeout" | "quality_issue" | "dependency", frequency: number, affected_nodes: string[], root_cause_hypothesis: string, recommendation: string }>, most_common_failures: object[], success_rate: number, average_quality_score: number, improvement_suggestions: string[] } ``` --- ## Installation ```bash cd gaff/mcp/quality-check npm install npm run build ``` ## Usage ### Standalone ```bash npm start ``` ### In GAFF The router calls quality-check after executing an intent graph: ```typescript // Router execution flow const executionResult = await router.execute_graph(intentGraph); // Quality check const qualityResult = await qualityCheck.validate_execution_result({ execution_result: executionResult, quality_criteria: gaffConfig.quality_assurance, intent_graph: intentGraph, original_request: userRequest }); // Rerun if needed if (qualityResult.rerun_required) { const rerunStrategy = await qualityCheck.determine_rerun_strategy({ execution_result: executionResult, validation_result: qualityResult, intent_graph: intentGraph }); if (rerunStrategy.strategy === 'partial') { await router.rerun_nodes({ graph: intentGraph, nodes_to_rerun: rerunStrategy.rerun_nodes }); } else if (rerunStrategy.strategy === 'full') { await router.execute_graph(intentGraph); } } ``` --- ## Configuration Set in `gaff.json`: ```json { "quality_assurance": { "enabled": true, "auto_rerun_on_failure": true, "quality_threshold": 0.85, "max_rerun_attempts": 2, "validation_rules": [ "completeness_check", "accuracy_verification", "consistency_check", "format_validation" ], "scoring_weights": { "completeness": 0.4, "accuracy": 0.4, "performance": 0.2 }, "rerun_strategies": { "partial": "Rerun only failed nodes", "full": "Rerun entire workflow", "adaptive": "Intelligently decide based on failure type" }, "default_strategy": "adaptive" } } ``` --- ## Quality Scoring Algorithm ### Overall Quality Score Calculation ``` overall_score = (completeness_score * 0.4) + (accuracy_score * 0.4) + (performance_score * 0.2) ``` ### Completeness Score - All required fields present: 1.0 - Missing fields: -0.2 per missing field - Type mismatches: -0.1 per mismatch ### Accuracy Score - All validation rules pass: 1.0 - Rule violations: -0.15 per error, -0.05 per warning - Business rule violations: -0.25 per violation ### Performance Score - Execution time within budget: 1.0 - Over budget: score = budget / actual_time - Token usage efficiency factor ### Grading Scale - **0.95-1.0:** Excellent ⭐⭐⭐⭐⭐ - **0.85-0.94:** Good ⭐⭐⭐⭐ - **0.75-0.84:** Acceptable ⭐⭐⭐ - **0.60-0.74:** Poor ⭐⭐ - **<0.60:** Failed --- ## Rerun Strategy Decision Tree ``` Is quality_score >= threshold (0.85)? ├─ YES No rerun needed └─ NO Analyze failures ├─ Single node failed, others OK? └─ YES Strategy: PARTIAL (rerun failed node + dependencies) ├─ Multiple independent nodes failed? └─ YES Strategy: PARTIAL (rerun all failed nodes) ├─ Systemic issue (e.g., data source unavailable)? └─ YES Strategy: FULL (rerun entire workflow) └─ Complex failure pattern? └─ YES Strategy: ADAPTIVE (analyze and decide) ``` --- ## Development ### Build ```bash npm run build ``` ### Watch mode ```bash npm run watch ``` ### Test ```bash npm test ``` --- ## Integration with Router The router must: 1. **Execute the intent graph** 2. **Call quality-check** with results 3. **Check if rerun is needed** 4. **Execute rerun strategy** if quality is below threshold 5. **Repeat until:** - Quality threshold is met, OR - Max rerun attempts reached **Pseudo-code:** ```typescript let attempt = 0; let result = null; let qualityResult = null; while (attempt < maxAttempts) { // Execute result = await router.execute_graph(graph); // Validate quality qualityResult = await qualityCheck.validate_execution_result({ execution_result: result, quality_criteria: config.quality_assurance, intent_graph: graph }); // Check if acceptable if (qualityResult.is_acceptable) { break; // Success! } // Determine rerun strategy const strategy = await qualityCheck.determine_rerun_strategy({ execution_result: result, validation_result: qualityResult, intent_graph: graph }); // Execute rerun if (strategy.strategy === 'partial') { await router.rerun_nodes(graph, strategy.rerun_nodes); } else if (strategy.strategy === 'full') { // Full rerun on next iteration } else { break; // No viable rerun strategy } attempt++; } return { result, qualityResult, attempts: attempt }; ``` --- ## License MIT License - Copyright 2025 Sean Poyner --- **Part of the [GAFF Framework](https://github.com/seanpoyner/gaff)**