@blundergoat/goat-flow
Version:
AI coding agent harness and local dashboard for Claude Code, OpenAI Codex, Google Antigravity, and GitHub Copilot - setup audits, guardrails, structured skills, deny hooks, and persistent learning loops.
117 lines (84 loc) • 8.39 kB
Markdown
goat-flow-reference-version: "1.12.1"
# goat-review Reference Examples
Extended Excuse/Reality table, finding-format examples, and constraint rationale for `/goat-review`.
## Worked Example - Refuted Template Suspicion
Use this shape when Pass 1 raises a plausible template or output-format suspicion and Pass 2 disproves it. This example uses real `goat-review` files and anchors.
**Review surface:** `SKILL.md`, `references/automated-review.md`, `references/refuter-spec.md`
**Pass 1 suspicion (diff-only):**
- `SKILL.md` (search: `Review Integrity`) may omit the automated-review and refuter integrity lines even though the references require them.
**Pass 2 actions:**
1. Open `SKILL.md` and re-read `Review Integrity`.
2. Search for `Automated-reviewer overlap`.
3. Search for `Refuter pass`.
4. Open `references/automated-review.md` (search: `Automated-reviewer overlap`) and `references/refuter-spec.md` (search: `Review Integrity Extension`) to compare the reference contract with the main output template.
**Expected outcome:**
- Mark the suspicion `REFUTED` when `SKILL.md` contains both output-template lines.
- Do not surface a final finding.
- Write a refutation ledger entry:
- Original suspicion: `SKILL.md` may omit automated-review and refuter integrity lines.
- Refuting evidence: `SKILL.md` (search: `Automated-reviewer overlap`); `SKILL.md` (search: `Refuter pass`).
- Rationale: the main template now exposes both conditional integrity extensions, so the references are reachable during normal review output.
**Zero-finding final note:** "Checked Review Integrity against both optional references; no issue surfaced because the output template includes the required conditional lines."
## Worked Example - Full Output Block (end-to-end)
The refuted-suspicion example above walks one Pass 1 -> Pass 2 transition. This example shows the **entire Output Format filled in**, so a cold reader sees how every section is populated with real values rather than the empty template in `SKILL.md` (search: `## Output Format`). Following the convention in `## Finding Format Examples` below, it uses real `goat-review` anchors: treat it as the block to emit when a reviewed diff *weakens* the cited rules. The illustrative review surface is a small PR that (a) replaces the Blast Radius Rule's external call-site grep with a `// TODO` comment and (b) deletes the refuter authentication pre-flight. Proof class is `STATIC` for both findings because the review surface is documentation: the reviewer verifies by re-reading the live file, not by executing it.
**Review surface:** PR mode (`gh pr view 412`); base `origin/main`, head `a1b2c3d`; 2 files, 86 changed lines; under the 20-file / 3000-line chunking threshold, so no chunking. Copilot reviewed the PR. Pass 3 auto-triggered by the `[MUST:needs-decision]` finding (trigger 3); the refuter received the full findings list and upheld both.
```markdown
## TL;DR
Reviewed PR #412 (2 files, 86 lines): the diff removes the contract-change call-site search and the refuter auth gate. One MUST and one SHOULD surfaced; both confirmed in Pass 2 and upheld cross-model. Do not merge as-is.
## Review Integrity
- Scope snapshot: source=PR, base=origin/main, head=a1b2c3d, uncommitted=no, chunking=no
- Files opened in Pass 2: 2/2 (diff-only: none)
- Evidence: 2 OBSERVED / 0 INFERRED
- Refutations logged: 1
- Size: 2 files, 86 lines (chunked: no)
- Automated-reviewer overlap: 1 overlap with copilot-pull-request-reviewer, 1 net-new
- Refuter pass: yes; confirmed=2, refuted=0, unresolved=0, leads-verified=0, model=codex
- Degradation flags: spec-drift-skipped
- Conclusion: confident
## Findings
### MUST
- [MUST:needs-decision] [CONFIRMED-CROSS-MODEL] [overlap:copilot-pull-request-reviewer] **Blast Radius Rule no longer forces a call-site search on contract changes** `SKILL.md` (search: `Blast Radius Rule`) - the diff replaces the `rg`/`grep` consumer search with a `// TODO`, so a signature, return-type, or event-shape change can now ship without a single consumer verified; downstream callers break at runtime with no review signal. Needs-decision because relaxing the rule is a policy call the author may have intended. | Footgun: none | Evidence: OBSERVED | Proof: STATIC
### SHOULD
- [SHOULD:patch] [CONFIRMED-CROSS-MODEL] [new] **Refuter pre-flight auth check deleted** `references/refuter-spec.md` (search: `Pre-flight Check`) - removing the `codex login status` / `claude auth status` gate lets Pass 3 spawn an unauthenticated refuter that fails silently and is recorded as a clean `confirmed=0` instead of `cross-model-refuter-failed`; reviews then read as cross-verified when they were not. | Footgun: none | Evidence: OBSERVED | Proof: STATIC
## Systemic Patterns
<!-- 2 findings with distinct root causes and distinct fixes - no systemic parent emitted -->
## Pre-existing Nearby
- None.
## Pre-existing Issues
- None in scope for this diff.
## Breaking Changes
- None to the emitted skill contract; both findings remove safeguards rather than change a public interface.
## Top 5 Risks (cross-tier)
1. [MUST:needs-decision] **Contract changes ship without call-site verification** `SKILL.md` (search: `Blast Radius Rule`) - highest-harm regression: silent runtime breakage in unverified consumers.
2. [SHOULD:patch] **Unauthenticated refuter passes as success** `references/refuter-spec.md` (search: `Pre-flight Check`) - false cross-model confidence on every Pass 3.
## Ship Verdict
Decision: **NO**
Reasoning: The MUST finding (Blast Radius Rule removal) remains unaddressed by the diff, which forces NO per the Ship Verdict rule (unresolved MUST -> NO); the cross-model refuter upheld both findings. The SHOULD compounds the risk by masking refuter failures. Review Integrity is `confident`, so the verdict is not downgraded further.
Confidence: HIGH
## What's Good
- The diff keeps the two-pass discipline and the Refutation Ledger path intact.
## What I Didn't Examine
- The CI workflow that invokes the refuter (out of diff scope); flagged for the author.
```
## Finding Format Examples
Use concrete harm and proof class. These examples use real anchors from this skill surface; apply them when a reviewed diff removes, bypasses, or contradicts the cited rule.
**Systemic pattern:**
```markdown
## Systemic Patterns
- [SHOULD:patch] **Group repeated output-contract drift under one parent** - affected anchors: `SKILL.md` (search: `MUST group 3+ related findings as systemic patterns`), `SKILL.md` (search: `## Systemic Patterns`); repeated failure: three related findings share one output-contract root cause; harm: reviewers scatter one root cause across separate bullets, making the required fix easy to under-scope. | Evidence: OBSERVED | Proof: STATIC
```
**PR automated-review overlap:**
```markdown
- [SHOULD:patch] [overlap:copilot-pull-request-reviewer] **Report PR metadata ingestion failure explicitly** `references/automated-review.md` (search: `automated-review-uningested`) - If `gh pr view` returns `reviews,comments` but parsing fails, the review must degrade explicitly instead of reporting no bot findings; otherwise duplicated findings look net-new. | Footgun: none | Evidence: OBSERVED | Proof: STATIC
```
## Excuse/Reality Table (Full)
| Excuse | Reality |
|--------|---------|
| "Trusted author wrote it, Pass 2 will just refute everything - skip it" | In-group trust has historically produced the worst misses in auth/signing/rate-limit code. Open the files. |
| "CI is green, so boundary and signing edges are already covered" | CI tests what was thought of. Review looks for what wasn't. Green CI raises, not answers, the Pass-2 question. |
| "Tight window + demo tomorrow - MAY-only cosmetic pass is proportionate" | An incomplete review merged into a demo window is worse than a `coverage-degraded` conclusion returned on time. |
| "Findings would be zero anyway, so Review Integrity is paperwork" | Review Integrity IS the zero-findings signal. `files-not-opened` tells the reader you stopped early. |
| "The symbol is unique enough that grep is overkill" | Unique symbols still need external verification because the bug is in the consumer, not the emitter. |
| "Refuted suspicions are noise - logging them wastes tokens" | The ledger is the integrity surface. Without it, REFUTED is indistinguishable from "didn't bother to check." |