UNPKG

autotel

Version:
181 lines (140 loc) 8.78 kB
--- name: analyze-traces description: > Analyze OpenTelemetry traces and structured logs from a running autotel service to debug errors, investigate latency, follow requests across services, and surface cardinality / attribute hygiene problems. Works with traces from any OTLP backend (Honeycomb, Grafana Tempo, Datadog, Jaeger, Sentry, Axiom, HyperDX, …) plus the local `.autotel/spans/` dump and `InMemorySpanExporter` in tests. license: MIT --- # Analyze traces This skill teaches an AI assistant how to read and reason about OpenTelemetry traces produced by autotel — whether they're sitting in a backend, exported to a local JSON dump, or captured in a test. ## When to use - Debugging a failing endpoint after deploy - Investigating latency regressions (p50 / p95 / p99 spike) - Following a single request across browser → server → queue → worker - Auditing attribute hygiene (cardinality, PII leak risk, noise) - Spot-checking that a new instrumentation actually produces the spans you expected ## Input formats | Source | How to access | | ------------------------------ | ------------------------------------------------------------------ | | Local debug dump | `.autotel/spans/*.ndjson` — one span per line, OTLP JSON shape | | `InMemorySpanExporter` (tests) | `exporter.getFinishedSpans()` | | Backend (interactive) | Jaeger / Tempo / Honeycomb UI; Datadog Trace Search; etc. | | Backend (programmatic) | Honeycomb Query API, Tempo `/api/search`, Datadog Logs / Trace API | ## The shape of an autotel span ```json { "name": "POST /api/checkout", "context": { "traceId": "…", "spanId": "…" }, "parentSpanId": "…", "kind": "SERVER", "startTimeUnixNano": "…", "endTimeUnixNano": "…", "status": { "code": "OK" }, "attributes": { "service.name": "checkout", "http.request.method": "POST", "url.full": "https://api.example.com/api/checkout", "http.response.status_code": 200, "user.id": "usr_123", "user.plan": "enterprise", "cart.items": 3, "cart.total": 14999, "_correlationId": "01J…" }, "events": [ { "name": "log.emit.manual", "attributes": { "level": "info", "stage": "validated" } } ], "links": [], "resource": { "service.name": "checkout", "deploy.id": "v2025.05.04-1" } } ``` Key conventions to recognise: - `service.name` distinguishes services in a multi-service trace. - `_correlationId` (autotel-specific) is stable within a logical unit of work even across forked child spans (`_parentCorrelationId` ties them). - `gen_ai.*` attributes follow OpenTelemetry gen-ai semantic conventions for LLM calls. - `exception.*` attributes (auto-set by `createStructuredError`) carry `type`, `message`, `stacktrace`. ## Common investigations ### "Why is endpoint X failing?" 1. Find the trace: filter `service.name = "<svc>" AND http.route = "<route>" AND status = error` for the last hour. 2. Open the slowest / latest matching trace. 3. Inspect the root span's `exception.message` and `exception.stacktrace`. 4. Walk down the child spans — the deepest span with `status.code = ERROR` is usually the culprit. 5. If using `createStructuredError`, look for `code`, `why`, `internal.*` attributes. They usually answer the "why" without you reading code. ### "Why is endpoint X slow?" 1. Find a slow trace: `service.name = "<svc>" AND http.route = "<route>" AND duration > p99(duration)`. 2. View the waterfall — pinpoint the longest child span by self-time (not wall time). 3. Common offenders: - **Sequential awaits that should be parallel** — sibling spans run end-to-end instead of overlapping. - **N+1 queries** — many short same-named spans (`SELECT * FROM …`) under one parent. - **Cold starts**`faas.coldstart=true` in Workers or Lambda. - **Tool retries** — gen-ai spans with `gen_ai.response.finish_reasons` containing `error` followed by another call. ### "Follow this user across services" Use `_correlationId` (or `user.id` if you have it): ``` service.name in (web, api, worker) AND _correlationId = "01J…" ORDER BY startTime ``` Each service contributes spans with the same `traceId` (W3C trace context propagation handles this automatically with autotel's global fetch instrumentation). ### "Did the new instrumentation actually fire?" In a test, dump the in-memory exporter: ```typescript import { InMemorySpanExporter } from 'autotel/exporters'; const exporter = new InMemorySpanExporter(); // … run the code under test const spans = exporter.getFinishedSpans(); console.log(spans.map((s) => ({ name: s.name, attrs: s.attributes }))); ``` Or live, point the SDK at a local file dump: ```typescript init({ service: 'my-app', debug: 'pretty', spanDumpPath: '.autotel/spans' }); ``` …then `tail -f .autotel/spans/*.ndjson | jq` while exercising the feature. ## Cardinality / hygiene audits | Check | Query / heuristic | | ----------------------------- | --------------------------------------------------------------------------------------------------------------- | | **Span name cardinality** | Top-K distinct `name` per service. Anything > a few hundred is a red flag — likely an unnormalised URL. | | **Per-attribute cardinality** | `unique(attribute_value)` per `attribute_key`. UUIDs / emails / `Date.now()` ids in attributes blow up storage. | | **Missing `service.name`** | Spans where the resource attribute is empty or `"app"` — fix at the SDK init. | | **PII smell** | Look for raw `@`, leading digit-runs of length 16, or `eyJ` prefixes — your redactor is off. | | **Health-check noise** | Spans with `http.route in (/healthz, /ready)`. Drop with `FilteringSpanProcessor`. | ## Reading gen-ai traces LLM calls produce a parent span (kind `CLIENT`) with children for each tool call: | Attribute | Meaning | | ---------------------------------------------------------------------- | ---------------------------------------- | | `gen_ai.provider.name` (legacy: `gen_ai.system`) | Provider (`openai`, `anthropic`, …) | | `gen_ai.request.model` | Model id | | `gen_ai.usage.input_tokens` / `output_tokens` | Token count | | `gen_ai.usage.cache_read.input_tokens` / `cache_creation.input_tokens` | Cache hits | | `gen_ai.response.finish_reasons` | `stop`, `tool_calls`, `length`, `error` | | `gen_ai.tool.name` | Tool invoked (on tool-call child spans) | | `gen_ai.usage.cost.usd` | Estimated cost (if pricing map provided) | These canonical `gen_ai.*` attributes are emitted by the `autotel-genai` package (third-party instrumentations may still emit the deprecated `gen_ai.system`). Common findings: - High `gen_ai.usage.input_tokens` with low `gen_ai.usage.cache_read.input_tokens` → enable prompt caching. - Many sequential tool-call spans → consider parallel tool calls if the model supports it. - `gen_ai.response.finish_reasons` contains `length` → bump `max_tokens`. ## When the trace is missing If you expected a span and there isn't one: 1. **Sampling.** Did head sampling drop it? Check `sampling.rates` and `recordedSpans` in any subscriber. 2. **Workers without `waitUntil`.** Did the request return before the exporter flushed? Move to `defineWorkerFetch` / `wrapModule`. 3. **`instrumentation.disabled = true`** — check env-conditional config. 4. **Exporter rejected.** Check service logs for `OTLP exporter` 4xx / 5xx — bad token, wrong dataset. ## Output format When summarising an investigation, lead with the **decision-changing fact**, then the supporting evidence: ``` Failure cause: payment.declined (Stripe code: insufficient_funds) - Trace: 9d3a…b21 - 38 / 412 checkout requests in the last hour failed with status=402. - All in eu-west-1, all on plan=free. - exception.cause.stripeChargeId starts with ch_3M… - Suggest: surface the structured `fix` field to the client; current 402 body returns generic message. ``` Keep raw span dumps out of the summary; link to the trace ID instead.