aiwg

Version:

Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo

aiwg.io

jmagly/aiwg

164 lines (118 loc) • 5.35 kB

Markdown

# Eval Loop: Isolation Protocol The eval loop is `nlp-prod`'s production quality gate. The most important property is **strict isolation between the generator and the evaluator**. --- ## Why Isolation Matters If the evaluator has access to the generator's reasoning, system prompt, or chain-of-thought, it is not an independent evaluator — it is a rubber stamp. Contaminated evaluation produces inflated scores, misses real failures, and gives false confidence. The eval loop is modelled on the principle that a good reviewer is one who knows nothing about how the answer was produced — only whether the answer is correct. --- ## Isolation Invariants These invariants are enforced and checked before every eval run: | Invariant | Enforcement | |-----------|------------| | Evaluator prompt is a separate file | `prompts/evaluator.prompt.md` never merged with generator | | Evaluator receives only `{{input}}` and `{{output}}` | No `{{steps}}`, `{{chain_of_thought}}`, `{{context}}` | | Evaluator does not know generator model | Model selection is outside evaluator scope | | Evaluator uses a different (cheaper) model | `eval_model: haiku` separate from generator model | | Evaluator has no Write/Bash tools | `eval-reviewer` agent: Read-only tools only | --- ## Isolation Violations (Anti-Patterns) ### Violation 1: Merged files ```markdown # WRONG — generator.prompt.md ## System You are an extractor... ## Evaluator Score the above output... ``` The evaluator section must be its own file. No exceptions. ### Violation 2: Generator context in evaluator template ```markdown # WRONG — evaluator.prompt.md Generator's system prompt: {{system_prompt}} Generator used steps: {{intermediate_steps}} Score the output: {{output}} ``` Only `{{input}}` and `{{output}}`. The evaluator scores what the user would see. ### Violation 3: Chain-of-thought leakage ```markdown # WRONG — passing full generator response to evaluator output = generator_response.content[0].text # includes COT evaluate(input, output) # evaluator sees COT ``` If the generator produces chain-of-thought before the final answer, extract ONLY the final answer before passing to evaluator. --- ## Correct Eval Loop Implementation ```python # Correct isolation def generate(client, input_text): system, user = load_prompt("generator.prompt.md", {"input": input_text}) raw = client.messages.create(model=GENERATOR_MODEL, ...) output = extract_final_answer(raw.content[0].text) # strip COT if present return output def evaluate(client, input_text, output): # ONLY input and output — no generator context system, user = load_prompt("evaluator.prompt.md", { "input": input_text, "output": output, # final answer only }) raw = client.messages.create(model=EVALUATOR_MODEL, ...) # separate model return json.loads(raw.content[0].text) ``` --- ## Refinement Loop When the evaluator returns `pass: false`, the feedback is fed back to the generator for refinement: ``` Attempt 1: generate(input) → output → eval → fail (score 0.4) Attempt 2: generate(input + "Previous feedback: " + feedback) → output → eval → fail (score 0.7) Attempt 3: generate(input + "Previous feedback: " + feedback) → output → eval → pass (score 0.91) ``` Key: the feedback sent to the generator is the evaluator's `feedback` string — not a copy of the evaluator prompt. The generator does not learn the evaluator's rubric. --- ## Scoring Schema Every evaluation returns: ```json { "score": 0.0, "pass": false, "feedback": "specific, actionable description", "rubric_scores": { "criterion": 0.0 }, "failure_category": "format|content|hallucination|missing_field|other", "suggested_fix": "one-sentence prompt revision recommendation" } ``` - `score` is the weighted average of rubric scores - `pass` is `score >= pass_threshold` - `feedback` is addressed to the generator (actionable for refinement) - `suggested_fix` is addressed to the prompt engineer (for prompt revision) --- ## Eval vs Ralph Loop | Dimension | Ralph Loop | nlp-prod Eval Loop | |-----------|-----------|-------------------| | Purpose | Iterative development | Production quality gate | | Isolation | Shared session | Strict — evaluator has no generator context | | Output | Working implementation | Pass/fail + structured feedback | | Persistence | `.aiwg/ralph/` | `eval/results.jsonl` (append-only) | | Cost tracking | Session tokens | Per-call cost in results schema | | Termination | Completion criteria | pass_threshold OR max_attempts | | Human loop | Issue thread comments | Optional via `--interactive` | Use Ralph for development iteration. Use the eval loop as a production quality gate. --- ## Contamination Detection The `eval-reviewer` agent and eval loop runner both check for contamination before scoring: ```python CONTAMINATION_SIGNALS = [ "{{steps}}", "{{chain_of_thought}}", "{{intermediate}}", "generator_context", "system_prompt", ] def check_contamination(evaluator_prompt: str) -> bool: return any(signal in evaluator_prompt for signal in CONTAMINATION_SIGNALS) ``` If contamination is detected: - Run stops - Error message explains the violation - `contamination_warning: true` set in results if eval ran before detection