aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
164 lines (118 loc) • 5.35 kB
Markdown
# Eval Loop: Isolation Protocol
The eval loop is `nlp-prod`'s production quality gate. The most important property is **strict isolation between the generator and the evaluator**.
## Why Isolation Matters
If the evaluator has access to the generator's reasoning, system prompt, or chain-of-thought, it is not an independent evaluator — it is a rubber stamp. Contaminated evaluation produces inflated scores, misses real failures, and gives false confidence.
The eval loop is modelled on the principle that a good reviewer is one who knows nothing about how the answer was produced — only whether the answer is correct.
## Isolation Invariants
These invariants are enforced and checked before every eval run:
| Invariant | Enforcement |
|-----------|------------|
| Evaluator prompt is a separate file | `prompts/evaluator.prompt.md` never merged with generator |
| Evaluator receives only `{{input}}` and `{{output}}` | No `{{steps}}`, `{{chain_of_thought}}`, `{{context}}` |
| Evaluator does not know generator model | Model selection is outside evaluator scope |
| Evaluator uses a different (cheaper) model | `eval_model: haiku` separate from generator model |
| Evaluator has no Write/Bash tools | `eval-reviewer` agent: Read-only tools only |
## Isolation Violations (Anti-Patterns)
### Violation 1: Merged files
```markdown
# WRONG — generator.prompt.md
## System
You are an extractor...
## Evaluator
Score the above output...
```
The evaluator section must be its own file. No exceptions.
### Violation 2: Generator context in evaluator template
```markdown
# WRONG — evaluator.prompt.md
Generator's system prompt: {{system_prompt}}
Generator used steps: {{intermediate_steps}}
Score the output: {{output}}
```
Only `{{input}}` and `{{output}}`. The evaluator scores what the user would see.
### Violation 3: Chain-of-thought leakage
```markdown
# WRONG — passing full generator response to evaluator
output = generator_response.content[0].text # includes COT
evaluate(input, output) # evaluator sees COT
```
If the generator produces chain-of-thought before the final answer, extract ONLY the final answer before passing to evaluator.
## Correct Eval Loop Implementation
```python
# Correct isolation
def generate(client, input_text):
system, user = load_prompt("generator.prompt.md", {"input": input_text})
raw = client.messages.create(model=GENERATOR_MODEL, ...)
output = extract_final_answer(raw.content[0].text) # strip COT if present
return output
def evaluate(client, input_text, output):
# ONLY input and output — no generator context
system, user = load_prompt("evaluator.prompt.md", {
"input": input_text,
"output": output, # final answer only
})
raw = client.messages.create(model=EVALUATOR_MODEL, ...) # separate model
return json.loads(raw.content[0].text)
```
## Refinement Loop
When the evaluator returns `pass: false`, the feedback is fed back to the generator for refinement:
```
Attempt 1: generate(input) → output → eval → fail (score 0.4)
Attempt 2: generate(input + "Previous feedback: " + feedback) → output → eval → fail (score 0.7)
Attempt 3: generate(input + "Previous feedback: " + feedback) → output → eval → pass (score 0.91)
```
Key: the feedback sent to the generator is the evaluator's `feedback` string — not a copy of the evaluator prompt. The generator does not learn the evaluator's rubric.
## Scoring Schema
Every evaluation returns:
```json
{
"score": 0.0,
"pass": false,
"feedback": "specific, actionable description",
"rubric_scores": {
"criterion": 0.0
},
"failure_category": "format|content|hallucination|missing_field|other",
"suggested_fix": "one-sentence prompt revision recommendation"
}
```
- `score` is the weighted average of rubric scores
- `pass` is `score >= pass_threshold`
- `feedback` is addressed to the generator (actionable for refinement)
- `suggested_fix` is addressed to the prompt engineer (for prompt revision)
## Eval vs Ralph Loop
| Dimension | Ralph Loop | nlp-prod Eval Loop |
|-----------|-----------|-------------------|
| Purpose | Iterative development | Production quality gate |
| Isolation | Shared session | Strict — evaluator has no generator context |
| Output | Working implementation | Pass/fail + structured feedback |
| Persistence | `.aiwg/ralph/` | `eval/results.jsonl` (append-only) |
| Cost tracking | Session tokens | Per-call cost in results schema |
| Termination | Completion criteria | pass_threshold OR max_attempts |
| Human loop | Issue thread comments | Optional via `--interactive` |
Use Ralph for development iteration. Use the eval loop as a production quality gate.
## Contamination Detection
The `eval-reviewer` agent and eval loop runner both check for contamination before scoring:
```python
CONTAMINATION_SIGNALS = [
"{{steps}}",
"{{chain_of_thought}}",
"{{intermediate}}",
"generator_context",
"system_prompt",
]
def check_contamination(evaluator_prompt: str) -> bool:
return any(signal in evaluator_prompt for signal in CONTAMINATION_SIGNALS)
```
If contamination is detected:
- Run stops
- Error message explains the violation
- `contamination_warning: true` set in results if eval ran before detection