donobu
Version:
Create browser automations with an LLM agent and replay them as Playwright scripts.
436 lines • 19.6 kB
TypeScript
import type { TestInfo } from '@playwright/test';
import { z } from 'zod/v4';
import type { GptClient } from '../../../clients/GptClient';
import type { FlowMetadata } from '../../../models/FlowMetadata';
import type { DonobuExtendedPage } from '../../page/DonobuExtendedPage';
/**
* # Test Failure Triage System
*
* Transforms Playwright test failures in Donobu-powered test suites into structured,
* actionable **treatment plans**. A treatment plan tells both humans and automation *why*
* the test failed, *how confident* the system is in that diagnosis, and *what to do next*
* — whether that is retrying the automation, deleting a stale page.ai cache, filing a
* product bug, or updating selectors in the test code.
*
* ---
*
* ## Architecture Overview
*
* The system operates in two phases that run in sequence:
*
* ### Phase 1 — Evidence Collection (`gatherTestFailureEvidence`)
*
* Called automatically by the Donobu test extension (`testExtension.ts`) in the
* Playwright `afterEach` hook whenever a test fails. This phase:
*
* 1. Extracts error messages, stack traces, and assertion details from `TestInfo`.
* 2. Loads the Donobu flow metadata (objective, run mode, state) and recent tool call
* history from the persistence layer.
* 3. Fetches **historical runs** of the same flow (by name) from the flows manager to
* detect flakiness, regression patterns, and prior self-heal success.
* 4. Captures the **failure screenshot** (last tool call screenshot from the current
* run) and the **baseline screenshot** (last tool call screenshot from the most
* recent successful historical run) for visual comparison.
* 5. Reads the source of the failing test case for contextual grounding.
* 6. Runs the **heuristic classifier** (`deriveHeuristicAssessment`) which uses
* rule-based pattern matching over errors, tool calls, stale-cache indicators,
* and historical signals to produce a preliminary diagnosis — including a failure
* reason, confidence score, and retry recommendation.
* 7. Persists the complete evidence bundle (JSON + screenshots) to disk as a
* `FailureEvidenceRecord`.
*
* ### Phase 2 — Treatment Plan Generation (`generateTreatmentPlanFromEvidence`)
*
* Called by the Donobu CLI (`donobu-cli.ts`) after evidence files are collected. This
* phase reads the persisted evidence and **requires a GPT client** — there is no
* heuristic-only fallback path. It:
*
* 1. Sends the full evidence bundle — including screenshots as vision input — to a
* GPT model with a detailed system prompt, requesting a structured `TreatmentPlan`
* response.
* 2. **Reconciles** the GPT plan with heuristic signals (`reconcileTreatmentPlan`) to
* enforce invariants the LLM might miss (e.g., forcing `shouldRetryAutomation` when
* historical data shows prior self-heal success, or overriding retry-step priority
* for stale-cache scenarios).
* 3. Returns the final `TreatmentPlan` for the CLI to act on — potentially triggering
* automatic retries, cache deletion, or surfacing remediation steps to the engineer.
*
* ---
*
* ## Data Signals
*
* The triage system draws from several complementary data sources, each targeting
* different failure modes:
*
* | Signal | Source | What it reveals |
* |-------------------------|-------------------------------|----------------------------------------------------|
* | Error messages & stacks | `TestInfo.errors` | Direct cause (assertion, timeout, selector) |
* | Tool call history | `FlowsPersistence` | What actions the AI took and their outcomes |
* | Tool call parameters | `ToolCall.parameters` | Exact selectors, URLs, and inputs attempted |
* | Flow metadata | `DonobuExtendedPage._dnb` | Run mode, objective, allowed tools, timing |
* | Stale cache indicators | Derived from above | Whether page.ai cache staleness is the root cause |
* | Historical flow runs | `DonobuFlowsManager.getFlows` | Flakiness, regression patterns, prior self-heal |
* | Failure screenshot | Last tool call screenshot | Visual state of the page when the failure occurred |
* | Baseline screenshot | Last successful run's screenshot | Visual reference for what the page *should* look like |
* | Test source snippet | TypeScript AST parsing | The test's expectations and structure |
*
* ---
*
* ## Failure Classification
*
* Every treatment plan assigns one of the following failure reasons:
*
* - `SELECTOR_REGRESSION` — UI locators have gone stale.
* - `STALE_CACHE_OR_INSTRUCTIONS` — The page.ai deterministic cache is outdated.
* - `TIMING_OR_SYNCHRONISATION` — Race conditions, slow loads, or flaky waits.
* - `NETWORK_OR_DEPENDENCY` — External service failures or connectivity issues.
* - `APPLICATION_DEFECT` — A real bug in the product under test.
* - `ASSERTION_DRIFT` — Test expectations no longer match valid application behavior.
* - `AUTOMATION_SCRIPT_ISSUE` — The test script itself is incorrect.
* - `AUTHENTICATION_FAILURE` — Session/auth problems prevented the test from running.
* - `ENVIRONMENT_CONFIGURATION` — Infrastructure or environment misconfiguration.
* - `TEST_DATA_UNAVAILABLE` — Required test data is missing or invalid.
* - `UNKNOWN` — Insufficient signal to determine the cause.
*
* ---
*
* ## Getting the Most Out of This System
*
* ### 1. Name your flows consistently
*
* Historical analysis works by matching flows by name. If every test uses a unique,
* stable flow name, the system can compare the current failure against all prior runs
* of the same flow and detect flakiness, regressions, and self-heal patterns:
*
* ```ts
* test('checkout flow adds item and completes purchase', async ({ page }) => {
* const ai = await page.ai('Checkout — add item and purchase', { ... });
* // ...
* });
* ```
*
* ### 2. Let evidence persist to disk
*
* The default behavior writes evidence JSON and screenshots to the run directory. This
* enables the CLI's Phase 2 to enrich the diagnosis with GPT and visual comparison.
* Ensure `DONOBU_TRIAGE_DISABLED` is not set, and that `runDirectory` is writable:
*
* ```ts
* // Evidence is gathered automatically on failure — no extra code needed.
* // To customize the output directory:
* await gatherTestFailureEvidence(testInfo, page, {
* runDirectory: '/path/to/custom/output',
* });
* ```
*
* ### 3. Ensure a GPT client is available
*
* A GPT client is **required** for treatment plan generation. The LLM performs semantic
* reasoning: it reads the test source, interprets tool call parameters, compares
* screenshots visually, and produces human-readable remediation steps. The CLI
* instantiates a GPT client automatically from configured credentials.
*
* ### 4. Use deterministic (cached) mode for stable flows
*
* When flows run in `DETERMINISTIC` mode with a page.ai cache, the triage system
* activates its stale-cache detection pipeline — a composite scoring system that
* weighs whether the cached instructions have gone stale versus whether the failure
* is a legitimate test issue. This is the system's strongest diagnostic capability.
*
* ### 5. Inspect the evidence files
*
* Each failure produces a `failure-evidence-<id>.json` file (plus optional PNG
* screenshots) in the run directory. These files are self-contained and can be
* re-processed, shared for debugging, or fed back into `generateTreatmentPlanFromEvidence`
* independently.
*
* ---
*
* ## Key Exports
*
* - `gatherTestFailureEvidence` — Phase 1 entry point. Call from a Playwright afterEach hook.
* - `generateTreatmentPlanFromEvidence` — Phase 2 entry point. Requires a `GptClient` and a
* `FailureEvidenceRecord`.
* - `TreatmentPlan` — The Zod schema defining the treatment plan structure.
* - `FailureReasonSchema` — The Zod enum of all possible failure classifications.
*/
declare const FailureReasonSchema: z.ZodEnum<{
UNKNOWN: "UNKNOWN";
AUTOMATION_SCRIPT_ISSUE: "AUTOMATION_SCRIPT_ISSUE";
SELECTOR_REGRESSION: "SELECTOR_REGRESSION";
TIMING_OR_SYNCHRONISATION: "TIMING_OR_SYNCHRONISATION";
ASSERTION_DRIFT: "ASSERTION_DRIFT";
APPLICATION_DEFECT: "APPLICATION_DEFECT";
AUTHENTICATION_FAILURE: "AUTHENTICATION_FAILURE";
ENVIRONMENT_CONFIGURATION: "ENVIRONMENT_CONFIGURATION";
TEST_DATA_UNAVAILABLE: "TEST_DATA_UNAVAILABLE";
NETWORK_OR_DEPENDENCY: "NETWORK_OR_DEPENDENCY";
}>;
type FailureReason = z.infer<typeof FailureReasonSchema>;
declare const RemediationCategorySchema: z.ZodEnum<{
UNKNOWN: "UNKNOWN";
RETRY_AUTOMATION: "RETRY_AUTOMATION";
UPDATE_TEST_LOGIC: "UPDATE_TEST_LOGIC";
UPDATE_SELECTORS: "UPDATE_SELECTORS";
ADJUST_TIMING: "ADJUST_TIMING";
REFINE_ASSERTIONS: "REFINE_ASSERTIONS";
FIX_APPLICATION: "FIX_APPLICATION";
VALIDATE_AUTHENTICATION: "VALIDATE_AUTHENTICATION";
CHECK_ENVIRONMENT: "CHECK_ENVIRONMENT";
REFRESH_TEST_DATA: "REFRESH_TEST_DATA";
STABILIZE_DEPENDENCIES: "STABILIZE_DEPENDENCIES";
ESCALATE_MANUAL_REVIEW: "ESCALATE_MANUAL_REVIEW";
}>;
type RemediationCategory = z.infer<typeof RemediationCategorySchema>;
declare const RemediationStepSchema: z.ZodObject<{
category: z.ZodEnum<{
UNKNOWN: "UNKNOWN";
RETRY_AUTOMATION: "RETRY_AUTOMATION";
UPDATE_TEST_LOGIC: "UPDATE_TEST_LOGIC";
UPDATE_SELECTORS: "UPDATE_SELECTORS";
ADJUST_TIMING: "ADJUST_TIMING";
REFINE_ASSERTIONS: "REFINE_ASSERTIONS";
FIX_APPLICATION: "FIX_APPLICATION";
VALIDATE_AUTHENTICATION: "VALIDATE_AUTHENTICATION";
CHECK_ENVIRONMENT: "CHECK_ENVIRONMENT";
REFRESH_TEST_DATA: "REFRESH_TEST_DATA";
STABILIZE_DEPENDENCIES: "STABILIZE_DEPENDENCIES";
ESCALATE_MANUAL_REVIEW: "ESCALATE_MANUAL_REVIEW";
}>;
summary: z.ZodString;
details: z.ZodString;
}, z.core.$strip>;
type RemediationStep = z.infer<typeof RemediationStepSchema>;
declare const AdditionalDataRequestSchema: z.ZodObject<{
description: z.ZodString;
suggestedSources: z.ZodDefault<z.ZodArray<z.ZodString>>;
}, z.core.$strip>;
type AdditionalDataRequest = z.infer<typeof AdditionalDataRequestSchema>;
declare const AutomationDirectivesSchema: z.ZodObject<{
clearPageAiCache: z.ZodOptional<z.ZodOptional<z.ZodBoolean>>;
targetTestFile: z.ZodOptional<z.ZodOptional<z.ZodString>>;
targetProject: z.ZodOptional<z.ZodOptional<z.ZodString>>;
additionalPlaywrightArgs: z.ZodOptional<z.ZodOptional<z.ZodArray<z.ZodString>>>;
}, z.core.$strip>;
type AutomationDirectives = z.infer<typeof AutomationDirectivesSchema>;
declare const TreatmentPlan: z.ZodObject<{
failureSummary: z.ZodString;
failureReason: z.ZodEnum<{
UNKNOWN: "UNKNOWN";
AUTOMATION_SCRIPT_ISSUE: "AUTOMATION_SCRIPT_ISSUE";
SELECTOR_REGRESSION: "SELECTOR_REGRESSION";
TIMING_OR_SYNCHRONISATION: "TIMING_OR_SYNCHRONISATION";
ASSERTION_DRIFT: "ASSERTION_DRIFT";
APPLICATION_DEFECT: "APPLICATION_DEFECT";
AUTHENTICATION_FAILURE: "AUTHENTICATION_FAILURE";
ENVIRONMENT_CONFIGURATION: "ENVIRONMENT_CONFIGURATION";
TEST_DATA_UNAVAILABLE: "TEST_DATA_UNAVAILABLE";
NETWORK_OR_DEPENDENCY: "NETWORK_OR_DEPENDENCY";
}>;
confidence: z.ZodNumber;
observedIndicators: z.ZodDefault<z.ZodArray<z.ZodString>>;
remediationSteps: z.ZodDefault<z.ZodArray<z.ZodObject<{
category: z.ZodEnum<{
UNKNOWN: "UNKNOWN";
RETRY_AUTOMATION: "RETRY_AUTOMATION";
UPDATE_TEST_LOGIC: "UPDATE_TEST_LOGIC";
UPDATE_SELECTORS: "UPDATE_SELECTORS";
ADJUST_TIMING: "ADJUST_TIMING";
REFINE_ASSERTIONS: "REFINE_ASSERTIONS";
FIX_APPLICATION: "FIX_APPLICATION";
VALIDATE_AUTHENTICATION: "VALIDATE_AUTHENTICATION";
CHECK_ENVIRONMENT: "CHECK_ENVIRONMENT";
REFRESH_TEST_DATA: "REFRESH_TEST_DATA";
STABILIZE_DEPENDENCIES: "STABILIZE_DEPENDENCIES";
ESCALATE_MANUAL_REVIEW: "ESCALATE_MANUAL_REVIEW";
}>;
summary: z.ZodString;
details: z.ZodString;
}, z.core.$strip>>>;
additionalDataRequests: z.ZodDefault<z.ZodArray<z.ZodObject<{
description: z.ZodString;
suggestedSources: z.ZodDefault<z.ZodArray<z.ZodString>>;
}, z.core.$strip>>>;
shouldRetryAutomation: z.ZodBoolean;
requiresCodeChange: z.ZodBoolean;
requiresProductFix: z.ZodBoolean;
notes: z.ZodOptional<z.ZodString>;
automationDirectives: z.ZodOptional<z.ZodObject<{
clearPageAiCache: z.ZodOptional<z.ZodOptional<z.ZodBoolean>>;
targetTestFile: z.ZodOptional<z.ZodOptional<z.ZodString>>;
targetProject: z.ZodOptional<z.ZodOptional<z.ZodString>>;
additionalPlaywrightArgs: z.ZodOptional<z.ZodOptional<z.ZodArray<z.ZodString>>>;
}, z.core.$strip>>;
}, z.core.$strip>;
type SanitizedFlowMetadata = {
id: string;
name: string | null;
runMode: FlowMetadata['runMode'];
state: FlowMetadata['state'];
targetWebsite: string;
overallObjective: string | null;
allowedTools: string[];
envVars: string[] | null;
startedAt: number | null;
completedAt: number | null;
maxToolCalls: number | null;
gptConfigName: string | null;
defaultMessageDuration: number | null;
resultSummary: string | null;
};
type SummarizedToolCall = {
id: string;
toolName: string;
success: boolean;
outcomeSummary: string;
durationMs: number;
page: string;
startedAtIso: string;
completedAtIso: string;
parameters: string | null;
};
type ErrorSummary = {
message?: string;
stack?: string;
value?: string;
actual?: string;
expected?: string;
name?: string;
location?: string;
snippet?: string;
};
type FailureContext = {
testCase: {
title: string;
file?: string;
projectName: string;
status: TestInfo['status'];
expectedStatus: TestInfo['expectedStatus'];
retry: number;
repeatEachIndex: number;
workerIndex: number;
timeout: number;
duration: number;
annotations: TestInfo['annotations'];
autoHealEnabled?: boolean;
};
failure: {
errors: ErrorSummary[];
attachments: {
name: string;
contentType: string;
path?: string | null;
}[];
};
donobuFlow: {
metadata: SanitizedFlowMetadata | null;
recentToolCalls: SummarizedToolCall[];
};
testSnippet: string | null;
heuristics: HeuristicAssessment;
flowHistory: FlowHistorySummary | null;
};
type FailureEvidenceRecord = {
schemaVersion: 1 | 2;
evidenceId: string;
runId: string | null;
runDirectory: string;
collectedAtIso: string;
failureContext: FailureContext;
failureScreenshotPath: string | null;
baselineScreenshotPath: string | null;
};
type GatherTestFailureEvidenceOptions = {
runDirectory?: string;
persistToDisk?: boolean;
force?: boolean;
};
type GatherTestFailureEvidenceResult = {
evidence: FailureEvidenceRecord;
filePath: string | null;
};
type HeuristicAssessment = {
failureReason: FailureReason;
evidence: string[];
confidence: number;
failureSummary: string;
shouldRetryAutomation: boolean;
requiresCodeChange: boolean;
requiresProductFix: boolean;
remediationSteps: RemediationStep[];
additionalDataRequests: AdditionalDataRequest[];
notes?: string;
occurredDuringPageAi: boolean;
staleCacheIndicators: StaleCacheIndicators;
historicalSignals: HistoricalSignals | null;
};
type StaleCacheIndicators = {
usedDeterministicMode: boolean;
selectorFailedDuringPageAi: boolean;
failedAfterPageAiCompleted: boolean;
isRetryAttempt: boolean;
quickFailurePattern: boolean;
toolCallsShowSelectorIssues: boolean;
assertionsFailedAfterSuccessfulPageAi: boolean;
};
type HistoricalFlowRun = {
id: string;
state: string;
runMode: FlowMetadata['runMode'];
startedAt: number | null;
completedAt: number | null;
durationMs: number | null;
};
type FlowHistorySummary = {
flowName: string;
totalRuns: number;
successCount: number;
failureCount: number;
otherCount: number;
passRate: number;
recentRuns: HistoricalFlowRun[];
currentStreak: {
state: 'SUCCESS' | 'FAILED' | 'MIXED';
length: number;
};
lastSuccessfulRunId: string | null;
queryWindowDays: number;
queriedAt: string;
};
type HistoricalSignals = {
flakinessScore: number;
regressionLikelihood: number;
recentPassRate: number;
priorSelfHealSuccess: boolean;
cacheWasRecentlyValid: boolean;
};
declare const TRIAGE_PERSISTENCE_FILE_IDS: {
readonly evidence: "triage-evidence.json";
readonly failureScreenshot: "triage-failure-screenshot.png";
readonly baselineScreenshot: "triage-baseline-screenshot.png";
};
/**
* Compresses a set of historical flow runs into an aggregate summary compact
* enough for both heuristic reasoning and inclusion in GPT prompts.
*/
declare function summarizeFlowHistory(flowName: string, flows: FlowMetadata[]): FlowHistorySummary;
/**
* Derives actionable signals from historical flow run data to feed into the
* heuristic classifier: flakiness, regression likelihood, prior self-heal
* success, and whether the page.ai cache was recently validated.
*/
declare function deriveHistoricalSignals(history: FlowHistorySummary): HistoricalSignals;
/**
* Builds the heuristic triage assessment by combining rule-based inference,
* contextual flags, and derived remediation guidance ahead of GPT enrichment.
*/
declare function deriveHeuristicAssessment(testInfo: TestInfo, errorSummaries: ErrorSummary[], toolCalls: SummarizedToolCall[], flowMetadata: SanitizedFlowMetadata | null, flowHistory?: FlowHistorySummary | null): HeuristicAssessment;
/**
* Aligns the GPT-authored treatment plan with heuristic safeguards, especially
* for page.ai regressions where we prefer automated retries over manual toil.
*/
declare function reconcileTreatmentPlan(plan: z.infer<typeof TreatmentPlan>, heuristics: HeuristicAssessment): z.infer<typeof TreatmentPlan>;
declare function gatherTestFailureEvidence(testInfo: TestInfo, page: DonobuExtendedPage, options?: GatherTestFailureEvidenceOptions): Promise<GatherTestFailureEvidenceResult | null>;
declare function generateTreatmentPlanFromEvidence(gptClient: GptClient, evidence: FailureEvidenceRecord): Promise<z.infer<typeof TreatmentPlan>>;
export { type AdditionalDataRequest, AdditionalDataRequestSchema, type AutomationDirectives, deriveHeuristicAssessment, deriveHistoricalSignals, type ErrorSummary, type FailureEvidenceRecord, type FailureReason, FailureReasonSchema, type FlowHistorySummary, gatherTestFailureEvidence, type GatherTestFailureEvidenceOptions, type GatherTestFailureEvidenceResult, generateTreatmentPlanFromEvidence, type HeuristicAssessment, type HistoricalFlowRun, type HistoricalSignals, reconcileTreatmentPlan, type RemediationCategory, type RemediationStep, RemediationStepSchema, type SanitizedFlowMetadata, type SummarizedToolCall, summarizeFlowHistory, TreatmentPlan, TRIAGE_PERSISTENCE_FILE_IDS, };
//# sourceMappingURL=triageTestFailure.d.ts.map