2026-05-23 — engineering / rag / llm / incident-management / postgres

When to skip the LLM: reusing root cause from similar resolved incidents

RAG over your own resolved incidents to ground (and sometimes entirely replace) the LLM root-cause call, the strict gate that decides when reuse is safe, and the shadow-validation that keeps it honest.

By Culprit · 13 min read

The prior post in this series ended at the correlation layer: pgvector clusters a thousand events per minute into ten incidents per minute. That is the easy part. The expensive part is what comes next — calling a language model to explain what caused each incident.

The obvious approach is to call the model every time. New incident, fresh analysis, structured output, done. For a service that opens one novel incident per day, this is fine. For a service that has been emitting the same ConnectionPoolExhausted pattern every Monday morning for six months, you have paid for the same answer fifty times over. The model does not know you already know the answer. It reasons from scratch, returns the same root cause, and the bill accrues.

The bet this post is about: ground each root-cause analysis in your own resolved incident history, and when the match is close enough and well-enough validated, skip the model entirely and reuse the prior answer verbatim. Fresh analysis is always available in one click. But paying for the same reasoning again and again is waste you can eliminate without touching correctness.

The data the model sees

Before the RCA layer, every event payload has already been tokenized at the edge. Field values that look like PII — hostnames, email addresses, API key prefixes, IP octets — are replaced with stable, deterministic tokens: <TOKEN_a1b2c3d4e5f6> instead of db-prod-west-2 (the placeholder is <TOKEN_ + 12 hex chars of a keyed hash). The tokenization is described in the PII post in this series.

The RCA pipeline only ever sees tokenized text. The prompt injected into the model, the embedding index over prior incidents, the stored root causes — all of it is <TOKEN_…> placeholders. This matters for the RAG layer specifically: tokenization is deterministic — the same value always maps to the same placeholder — so the co-occurrence structure that makes vector similarity meaningful is preserved. <TOKEN_a1b2c3d4e5f6> and <TOKEN_0f9e8d7c6b5a> appear together in the same sentence in the prior incident and in the new one; the embedding captures that co-occurrence without seeing what either token resolves to.

The RCA call itself

The model is Claude Haiku 4.5 by default, called via Anthropic's Batch API with prompt caching on the system prompt and output schema. The Batch API gives a substantial per-token discount and is appropriate here because RCA is not latency-critical: a two-to-five minute delay between incident opening and explanation visible is acceptable. The real-time API is reserved for one case — an explicit user "Deep Analysis" click, where immediacy is the point.

A larger model (Claude Sonnet 4.6) fires only on CRITICAL-severity incidents or on that explicit deep-analysis click. Everything else goes to Haiku. The gate is intentional: keeping the default model the cheaper one is part of what lets flat per-service pricing stay flat as usage grows.

The result of an RCA call is a structured object: a root-cause hypothesis, a confidence score between 0 and 1, a list of recommended next steps, and a flag indicating the call was "fresh" (LLM-generated, not a reuse). That last field matters for the chain policy we will get to shortly.

RAG over priors: grounding the analysis

For each new incident, before the model call, run a vector similarity search over the resolved incidents for the same tenant:

SELECT
  id,
  title,
  rca_summary,
  -- confidence lives inside the rca_summary JSONB, not a dedicated column
  (rca_summary->>'confidence')::float8 AS prior_confidence,
  last_seen,
  1 - (embedding <=> $1) AS cosine_similarity
FROM incidents
WHERE tenant_id   = $2
  AND status      = 'resolved'
  -- skip any prior the tenant marked thumb='down'. NOT EXISTS is NULL-safe:
  -- priors with no vote (and thumbs-up priors) are kept; only thumbs-down drops.
  AND NOT EXISTS (
    SELECT 1 FROM incident_feedback f
    WHERE f.incident_id = incidents.id AND f.thumb = 'down'
  )
  AND 1 - (embedding <=> $1) >= 0.70
ORDER BY embedding <=> $1
LIMIT 3;

The three retrieved incidents (K=3) go into the RCA prompt as a "priors" block — a structured list of prior root causes for similar events that the model can reason against. The model is not forced to agree with the priors; it is given context it otherwise lacks. Without this, the model's prior is whatever was in its training data. With it, the model's prior is your own operational history.

Two details in that query that are not obvious:

The thumbs-down filter (NOT EXISTS (… thumb = 'down')). Every incident in the RCA UI has a thumbs-up/thumbs-down control. A thumbs-down means "this root cause analysis was wrong or unhelpful." Using NOT EXISTS rather than an equality test on the column is deliberate: it is NULL-safe, so priors with no vote yet are kept (only an explicit thumbs-down drops a prior). A thumb-downed prior is excluded from retrieval entirely — it will never appear in a priors block, and it will never be a shortcut candidate. One bad analysis does not compound into a stream of bad analyses for similar incidents. The thumb-up is not required for RAG inclusion; neutral (no vote) is acceptable for the priors block. The shortcut gate is stricter, as we will see.

The 0.70 threshold. This is the floor for "useful context." A prior at 0.70 similarity shares meaningful structural overlap with the new incident but is not a strong enough match to trust as a reuse. It informs; it does not decide.

The shortcut: skipping the model entirely

The RAG priors block is context. The shortcut is a decision: if the single most-similar resolved prior clears a strict gate, do not call the model at all. Return the prior's root cause verbatim under a banner that names the source, and offer a one-click escape to run a fresh analysis.

Two of the rules are enforced before the gate, in the SQL that fetches candidates: the find_shortcut_candidate RPC only returns priors whose own RCA is "fresh" (LLM-generated, never another shortcut — so reuse never chains) and less than 30 days old. The gate is then a pure function over the candidates it gets back; every remaining check must pass, or we fall back to a fresh LLM call.

const SIMILARITY = 0.95;  // far stricter than the 0.70 priors-block floor
const CONFIDENCE = 0.90;

// `candidates` come from the find_shortcut_candidate RPC — already filtered to
// fresh priors inside the 30-day window and ordered by similarity descending.
function decideShortcut(candidates: ShortcutCandidate[]): ShortcutDecision {
  if (candidates.length === 0) return { shortcut: false, reason: 'no_candidates' };
  const top1 = candidates[0];

  // 1. Strong similarity.
  if (top1.similarity < SIMILARITY) {
    return { shortcut: false, reason: 'similarity_below_threshold' };
  }

  // 2. Human- or model-approved.
  const humanApproved = top1.has_thumbs_up;
  const modelApproved =
    top1.prior_confidence != null &&
    top1.prior_confidence >= CONFIDENCE &&
    !top1.has_thumbs_down;
  if (!humanApproved && !modelApproved) {
    return { shortcut: false, reason: 'no_human_or_model_approval' };
  }

  // 3. Runner-up must not disagree: if the 2nd-closest prior also clears the
  //    bar AND is a *different* incident, two strong priors conflict — bail.
  const top2 = candidates[1];
  if (top2 && top2.similarity >= SIMILARITY && top2.id !== top1.id) {
    return { shortcut: false, reason: 'top2_conflict' };
  }

  return { shortcut: true, target: top1 };
}

The logic behind each check in the gate function:

Similarity — 0.95, not 0.70. The priors block uses 0.70 because "useful context" is a low bar. The shortcut uses 0.95 because "same enough to skip the model" is a high bar. The difference is the difference between "probably related" and "almost certainly the same thing." At 0.95 cosine similarity in 1536-dimensional space, you are looking at events that share not just the same general domain but the same specific structural fingerprint.

Approval. A thumbs-up from a human is the strongest signal. A model confidence score of ≥ 0.90 with no thumbs-down is a weaker but acceptable signal for low-traffic services where users do not consistently vote. A prior with a thumbs-down never reaches the gate at all — the candidate query already excludes it — so the gate's job is the positive side: the reused answer must be one the system has some positive evidence for, not just the closest.

Runner-up disagreement. The most easily overlooked check, and the most important at the edges. If the top prior clears 0.95 AND the second-most-similar prior also clears 0.95 but is a different incident, two strong signals are in conflict. Reusing either is a gamble; the right move is to run a fresh analysis and let the model arbitrate. It fires rarely — two distinct priors both clearing 0.95 against the same new incident is unusual — but when it does, the fallback matters.

Two more rules are enforced earlier, in the candidate query (find_shortcut_candidate) rather than in the gate function:

No shortcut chains. Every shortcut must point directly at a fresh (LLM-generated) root cause. A shortcut pointing at another shortcut would create a telephone-game chain — prior A → shortcut B → shortcut C — and a wrong link would compound silently. The query only returns priors with rca_source = 'fresh', which collapses the graph flat: every reuse is one hop from real evidence.

30-day recency. A root cause accurate in January may not be accurate in April. The cause of a connection-pool exhaustion six weeks ago might have been a misconfigured pool size; today it might be a query regression. The 30-day window is a judgment call, not derived from first principles, and it is a config value rather than a hardcoded literal.

The banner: naming the source

When an incident's RCA came from the shortcut, the detail page shows a banner across the top of the RCA card:

PROBABLY THE SAME ROOT CAUSE AS INCIDENT #1234 — resolved 2026-04-15 [Run fresh analysis →]

The prior incident number is a link to that incident's detail page so you can compare the two directly. "Run fresh analysis" triggers a real-time (non-batched) model call and replaces the banner with the fresh result.

The similarity score is not shown in the UI. It is an internal calibration metric, not a user-facing confidence signal. "92% similar" is a number users would reasonably interpret as "92% confident the root cause is correct," which is a different claim. Showing the source incident and offering an escape hatch is more honest: here is where this answer came from, here is how to override it.

The framing is the inverse of the failure mode common in LLM-augmented tools — a confident, authoritative answer with no provenance and no escape. This banner has provenance (named source incident) and an escape (one click). The shortcut only fires when the gate passes; the gate is documented; the override is immediate.

Shadow validation: trusting, but checking

The shortcut runs in a validation phase first. During validation, every time the shortcut fires, the system also runs the real LLM analysis in the background.

The user is served the instant shortcut result. The shadow result is stored, not shown. A background job compares the two: does the shadow LLM agree with the reused root cause? If it disagrees — different root cause category, substantially lower confidence, contradictory recommended steps — the divergence is logged for review.

This deliberately gives up the cost savings during validation. The shortcut fires, which saves one model call from the user's perspective; the shadow call happens anyway. The realized token cost during validation is the same as if the shortcut did not exist. What the validation phase buys is calibration data: does the gate, as written, correctly identify the cases where reuse is safe?

After validation, the shadow sample drops to 5% — one in twenty shortcut fires triggers a shadow call rather than every one. That is when the cost reduction is actually realized. For services with high incident recurrence (the same failure mode resolving and recurring), the mechanism is designed to eliminate roughly 30–40% of model calls.

The shadow infrastructure also provides a check against model drift. If the model is updated and its outputs for a given pattern shift, the 5% sample will surface the divergence before it becomes a silent correctness regression.

The cost ceiling

Three RCA calls per incident, ever. The shortcut (if it fires) or the fresh batch call (if the gate does not pass) is call one. A user-clicked "Run fresh analysis" is call two. A rare later re-analysis — for example, if significantly different events attach hours after initial resolution — is call three. After three calls, new events attach to the incident but do not re-trigger analysis.

The intention is that a significantly different event pattern should open a new incident rather than re-analyze an old one. The correlation layer (the pgvector post) makes this automatic: if new events are dissimilar enough to the existing incident's representative embedding, they open a new incident and get their own fresh RCA.

The ceiling is not an arbitrary cap. Flat per-service pricing means RCA calls per incident are deliberately bounded. Three is enough to cover the initial analysis, one correction cycle, and one late-breaking re-analysis. The 5-minute per-call cooldown is the short-term guard; the three-call ceiling is the long-term backstop.

What this looks like across the pipeline

The full sequence for a new incident, after the correlation layer hands off:

Retrieve top-3 similar resolved priors (cosine ≥ 0.70, not thumb-downed) from incidents via pgvector.
Run the shortcut gate against top-1 and runner-up. If the gate passes, serve the prior's root cause under the banner, skip to step 5.
If the gate fails, build the RCA prompt with the priors block injected, enqueue a Batch API call.
On batch result receipt (typically 2–5 minutes), write the structured RCA to the incident row, update confidence score.
If in Phase B, enqueue a shadow Batch API call regardless of whether the shortcut fired.
Write shadow result for comparison; flag divergences.

The priors retrieval (step 1) and the gate check (step 2) are cheap: one indexed vector query per incident. The expensive step — the model call — only runs when the gate says it should.

The pipeline runs inside a Cloudflare Queue consumer, not on the ingest request path. The ingest handler validates the HMAC signature, writes the encrypted payload to the vault, and enqueues. All the analysis work happens asynchronously. The user sees events appear in the incident timeline immediately; the RCA card appears when the analysis completes, with no polling required (Supabase Realtime pushes the update).

What this does not do

The shortcut reuses a prior root cause verbatim. It does not interpolate between multiple priors, synthesize a new hypothesis, or learn incrementally. It picks one prior answer and reuses it or it does not reuse anything and calls the model. The simplicity is intentional: the RAG priors block (step 1 above) is where multiple priors inform a fresh analysis; the shortcut is where a single prior with a very high bar replaces the fresh analysis. Keeping the two mechanisms separate makes the gate auditable and the behavior predictable.

The thumbs feedback loop is also limited in current form. A thumbs-down removes a prior from retrieval. A thumbs-up strengthens its shortcut eligibility. But the feedback does not yet flow back into re-ranking the K=3 priors block in a way that weights approved priors more heavily than neutral ones. That is a candidate for the next iteration once Phase C data gives us a clearer picture of where retrieval quality matters most.

The RCA layer described here builds on the alert correlation pipeline covered in the pgvector post — both layers are live inside Culprit. The full security and privacy architecture (tokenization, RLS, audit trail) is on /security. A working demo of the end-to-end pipeline — ingest, sanitization, clustering, and root-cause analysis — is on /demo.