2026-05-08 — on-call / alert-fatigue / observability / correlation

From 1,000 alerts to 10 incidents

Turning a thousand noisy webhooks into ten real incidents, without throwing away the signal that lives in the noise. Alert correlation, the four hard parts.

By Culprit · 18 min read

You got paged 47 times this week. Eight of them were the same thing.

You know that because you went back through the pager log on Sunday afternoon and counted. Three pages were the connection-pool exhaustion that your last deploy half-fixed. Two were the same flapping health check on the EU node that nobody has time to investigate. One was the genuinely-new bug that the other 46 buried. The remaining 41 were variations of the same five problems, restated by five different monitoring sources, each insisting on the dignity of its own pager event.

The ratio you're looking at — pages-to-distinct-problems — is somewhere around 10 to 1, give or take. Across the industry it's usually worse. The on-call engineer's mental model treats them as 10 problems, not 47, because that's how many things actually need acting on. The pager treats them as 47 events because that's how many HTTP POSTs hit it. The gap between those two views is the gap between "manageable on-call" and "the thing destroying your team's retention."

This piece is about closing that gap.

01 — Why the obvious fixes don't survive contact with reality

There are three obvious fixes. Every team tries them in this order, and every team comes out the other side with the same problem in a slightly different shape.

Fix one: tune the thresholds. The pitch is clean — half your noise is alerts firing too eagerly. Walk the runbook, raise the warning thresholds, set tighter time windows, mark the flappy ones as "info-level." Two months later you have either (a) the same volume of alerts, because your config slowly relaxed back as new services were added, or (b) a different and more dangerous problem: the real outage that fired one alert at 70% above your new threshold and never made it to a page. Tuning thresholds is a maintenance burden that produces blind spots in exchange for short-term relief, and the blind spots are exactly the kind your post-mortem will be written about.

Fix two: disable the noisiest ones. "Just turn off the connection-pool alert; it's always firing and it's never an emergency." This works for a week. Then the connection pool actually exhausts during a customer migration, and the bug that would have shown up in the alert stream as "this is now happening 200 times an hour instead of 20" is invisible because the alert is muted. Noisy alerts are noisy because they fire in a regime where the underlying system is flaky but functional. The signal you want — "the regime just changed" — lives in the same channel as the noise.

Fix three: have a human triage every alert before it pages. A "noc role" or a triage rotation that gates pages. This works structurally but it doesn't work socially: the person doing the triage is, almost by construction, also on the on-call rotation. You've moved the noise from "wakes the on-call up" to "ruins the triager's afternoon." The triager burns out, leaves, and the role gets dropped. Six months later the noise is back, plus you have one fewer engineer.

The pattern in all three is the same: each fix tries to reduce the volume of alerts in the channel, when the actual problem is that the channel doesn't have any structure on top of the events. The alert pipeline assumes one event = one thing-to-do. That assumption is wrong every time the underlying systems are big enough to have failure modes that span multiple sources, multiple time windows, and multiple identities.

What you want isn't fewer alerts. It's a layer between the alerts and the pager that says "these 47 events describe ten distinct problems; here are the ten."

02 — What "correlation" actually means

The word "correlation" in observability gets used loosely. Three things people mean when they say it:

Deduplication. "Same alert from same source within N minutes" — collapse to one. This is table stakes; most pager-of-record tools do it natively. It cuts maybe 30% of the volume.
Topological grouping. "These three alerts are all for services downstream of the database that just went down" — cluster them as one incident under a root-cause hint. Requires a service map you have to maintain.
Semantic clustering. "These alerts have textually similar messages, originate near each other in time, and reference overlapping resources — they're probably the same problem." Requires no service map; works on the alert payloads themselves.

This piece is about the third. The first two are useful but bounded — you can do them with a few SQL rules. Semantic clustering is the one that scales past the limits of human-curated config, because it works on whatever the alert says, not on whatever you remembered to type into a YAML file last quarter.

The shape of the system that does this:

An incident is a persistent object that absorbs related events over its lifetime. New events join it; existing events stay in it; the incident has a state machine (open → investigating → resolved → closed) that's independent of whether more events keep arriving.
An event is the atomic unit you receive from monitors, error trackers, log aggregators, etc. Each event either joins an existing open incident (because it's "related" to events already in it) or starts a new incident.
The paging policy lives at the incident level, not the event level. Incident gets created → maybe page. Incident gets a 47th event → don't page again. Incident escalates in severity → re-page.

The 10x reduction in pages doesn't come from dropping events. It comes from giving events a place to land that isn't your phone.

03 — The four hard parts

The architecture above is easy to describe and tedious to build. Each part has a "how do we actually" attached to it that is most of the engineering work.

03.1 — Defining "related" without throwing away signal

The defining-related layer is what makes the system either work or quietly hide your real outages.

The naive shape: "two events are related if their service field matches and they arrived within 5 minutes." This dedupes the obvious double-alerts and falls apart everywhere else. It misses the cross-service cluster (DB outage paging three downstream services). It over-clusters the unrelated (two different bugs in the same service collapse into one incident, and the second one never gets seen).

The slightly-less-naive shape: vector-embed the alert message text and cosine-compare against recently-seen events on the same service. If the closest match exceeds a threshold and is already part of an open incident, attach there; otherwise create a new incident. The math at its simplest:

-- Returns the incident_id of the most-similar recent event in this service,
-- if and only if its similarity exceeds the threshold.
select incident_id
from sanitized_events
where service_id = $1
  and incident_id is not null
  and embedding is not null
  and created_at > now() - interval '60 minutes'
  and 1 - (embedding <=> $2) > 0.85  -- pgvector cosine similarity
order by embedding <=> $2
limit 1;

The <=> operator is pgvector's cosine distance — 0 means identical, 2 means opposite. The shape above is the simpler event-to-event variant; what Culprit ships in production is the per-incident variant from §03.2 — same SQL skeleton, but matching against incidents.representative_embedding rather than sanitized_events.embedding, at threshold 0.85 on a 60-minute window.

The threshold is the entire system. Set it too low (0.5) and unrelated alerts pile into the same incident — you'll lose distinct outages inside one super-incident and your team will trust the system less than the original noise. Set it too high (0.95) and almost nothing clusters — you've shipped a vector index that does the same job as service+time matching but with extra latency. There is no universally correct value. Pick one, instrument the false-merge and false-split rates against your traffic, iterate.

Embeddings need normalization first. Raw alert text has timestamps, numeric IDs, and stack-trace line numbers that move event-to-event. Embed those verbatim and your "same alert" pair gets cosine 0.6 because half the vector is encoding "different timestamp." Strip timestamps, replace ID-shaped tokens with <ID>, truncate stack traces to the top three frames before embedding. The embedding then captures the semantic shape of the alert, not the noise around it. (This normalization step sits in front of the embedder; it's mechanical, but the difference in cluster quality is large.)

03.2 — The cluster-anchor problem

Once you move from event-to-event matching to maintaining a single representative embedding per incident, you have to decide what that representative embedding actually is. If event A creates an incident with embedding E_A, and event B (similar to A) joins it with embedding E_B, what's the incident's embedding for the next comparison? Three options:

Anchor to the first event. Simple, but the incident drifts away from the first event's text as the situation evolves. A connection-pool incident that started with "warning: 80% utilization" no longer matches the later "critical: pool exhausted, 504 spike" event because the text changed too much.
Mean of all event embeddings. Adapts as the incident evolves, but a single garbage event drags the centroid toward irrelevant ground; subsequent legitimate events stop matching.
Anchor + decay. The incident's representative embedding is the first event's embedding, but it's recomputed (as a weighted mean) every N events or every M minutes, with the more recent events weighted higher. This is the shape Culprit ships today: weighted mean of the last 20 events with linear time decay (newest weighted highest), recompute fires every 10 attaches OR every 60 minutes, whichever first.

The decay interval is the secondary tuning knob. Recompute too often → behavior is unstable as new events arrive. Recompute too rarely → the cluster anchor goes stale during long-running incidents and stops matching new related events. A reasonable default: recompute every 10 events or every hour, whichever comes first. Same as the threshold: pick a number, watch the false-merge / false-split rates, iterate.

(Implementation note: Culprit's pipeline normalizes event text before embedding — strips ISO-8601 timestamps, unix epochs, UUIDs, hex digests, IPv4 octets, and stack-trace lines past the top three frames. Two semantically-identical events that differ only in noise produce semantically-identical vectors. The embedding step is the easy part; the normalization that comes before it is what the cluster quality actually rides on.)

03.3 — The notification budget

Correlation reduces page volume only if you wire the paging policy to the incident, not the event. The naive integration ("page on incident creation") gets you most of the way there. The careful integration earns the last 30%.

The careful version has three rules:

Quiet for 2 minutes before fanning out. When an incident is created (or significantly grows), wait 2 minutes before firing the downstream notification. If 12 more events arrive in that window, they all attach to the incident; the page that fires at the end describes "12 events from 4 services indicating an X" rather than "1 event indicating a Y" with 11 follow-up events arriving in a stream over the next minute. The on-call gets called once with full context, not 13 times. The same quiet-window logic also gates LLM-driven RCA — running the analysis once on the settled cluster, not 13 times on a still-evolving stream — and that's where the cost discipline of the LLM call lives.
Re-page only on severity escalation. Once an incident has fanned out, additional events join silently unless the severity increases (warning → critical) or the blast radius increases (single-host → multi-host). The pager isn't a tail of the incident; it's a notification of state change.
Auto-resolve quiet incidents. When an incident has been quiet for some interval (30 minutes is a reasonable starting point) and no human has acknowledged it, mark it auto-resolved with a note. The on-call shouldn't have to come back the next morning and close out a long tail of self-healed warnings.

Rules 1 and 2 are where the 10:1 → 100:1 reduction lives. Rule 3 is where you get back the cleanup time the system would otherwise be costing the on-call.

(Status: all three rules ship today. The 2-minute quiet window gates the customer-facing fan-out for non-CRITICAL severities; CRITICAL alerts bypass the wait so a real outage doesn't sit for two minutes before the page fires (the LLM RCA still runs in background to enrich the incident the on-call is already paged on). Severity-rank escalation (warning → critical) re-fires dispatch with a 5-minute cooldown; blast-radius escalation is on the roadmap. Auto-resolve runs every 5 minutes via a Cloudflare Cron Trigger worker, marks open incidents quiet for >30 min as auto-resolved, with two carve-outs — CRITICAL severity is never auto-resolved, and high-confidence RCA without a human ack is left for the engineer to see.)

03.4 — The escape valve

Correlation will be wrong sometimes. The team has to be able to override it.

Two operations are load-bearing:

Split. "These two events were merged into one incident, but they're actually different problems." UI-side: select an event in the incident, click "split into new incident," it gets a fresh incident object and the old one stays with the remaining events. Database-side: re-anchor the original incident's representative embedding from its remaining events.
Merge. "These two incidents are the same problem; the system didn't catch it." Pick incident A, target incident B, all of A's events move to B, A is closed with a merged_into reference, B's representative embedding is recomputed.

Without split/merge, engineers stop trusting the clustering after the first wrong call. With it, they get to fix the clustering's mistakes in seconds, which is what they need to develop the trust that the rest of the system depends on. Build these in week one of the rollout, not week six.

One subtle point: split and merge should re-write the audit trail forward-only. Don't delete or rewrite the original incident records — append a new incident with a derived_from pointer. This is what lets your post-mortem reconstruct what the system did think at the time, separately from what it should have thought in retrospect.

(Status: split / merge ship today. Every incident detail page has a "Split…" button (modal with per-event checkboxes; submit creates a new incident with a derived_from pointer back to the parent) and a "Merge into…" button (modal lists same-service open incidents; submit absorbs the source's events into the target and marks the source merged with a merged_into pointer). Both operations recompute the affected incidents' representative embeddings on the same transaction and write incident_split / incident_merged audit-event rows. The role gate is owner+admin in strict mode, member in flat mode — same posture as service-management.)

04 — What you give up

This is honest-tradeoffs time. Correlation is not free.

Latency. Pre-correlation, an alert reaches the pager in milliseconds. Post-correlation, you've added the embedding step (~50-200ms depending on model + cache), the vector search (~10-30ms with a warm pgvector index), and the 2-minute quiet window before paging on a new incident. The 2-minute window is the visible cost — if you have an alert pipeline whose users expect "page within 30 seconds of the spike," you have to either skip the quiet window (and accept some over-paging) or change their expectation. The change is usually worth making.

Cost. Vector embeddings cost something — at OpenAI's text-embedding-3-small rates, you're looking at ~$0.02 per million events. At a typical SaaS scale (hundreds of thousands of events per month per tenant) the embedding cost rounds to dollars per month per tenant, dominated by the storage of the vectors themselves rather than the API call. The expensive thing in this architecture is the LLM-driven root-cause analysis on top of the clustered incidents, not the clustering itself.

Operational complexity. You now have a vector index to maintain (pgvector or a managed equivalent), a similarity threshold and decay interval to tune, a notion of "incident lifecycle" with state transitions to monitor, and a split/merge UI that has to actually work. None of these are individually crushing; collectively they're a meaningful chunk of engineering work. Plan for two weeks of focused build for the first version and another month of tuning.

A different class of bug. The new failure mode is "two unrelated outages got merged into one incident, and the second one was discovered three hours later when someone manually looked at the on-call queue." This is a real failure, and it can be more harmful than the original noise problem because it hides the second incident's start time. Mitigations: alert on incidents that absorb events from N+ distinct services within M minutes (likely a false merge); page a second time on incidents whose event rate sustains above some threshold (something is actively wrong, even if we already paged once). The clustering reduces noise; it doesn't remove the need for sanity checks on the clustering.

05 — What you get back

The thing you get back isn't "ten alerts instead of a thousand." That framing under-sells it. The thing you get back is your team's relationship with their pager.

The on-call rotation stops being the thing that everyone dreads and starts being the thing that everyone takes seriously, because every page now corresponds to a thing that's actually worth waking up for. The 41 noise alerts from the opening paragraph stop happening — they aggregate into incidents and get worked during the next business day if at all. The six real distinct problems get individual incidents, individual pages, individual triage. The on-call engineer goes from "I got 47 pages this week" to "I got 8 pages this week, and I remember each one."

The downstream effects are large and not fully reversible:

Retention improves. On-call burnout is a top-three reason for SRE / DevOps engineers leaving early-stage companies. A team whose on-call rotation is bearable is a team that grows instead of churning.
Incident response speeds up. When the on-call has fewer pages to triage, each one gets more attention. The mean time to first response drops because the engineer isn't context-switching between simultaneous false alarms.
Post-mortem culture improves. With 8 weekly incidents instead of 47, post-mortems become possible to actually run. The team starts learning from outages instead of just surviving them.
You can hire engineers who would have refused the on-call rotation otherwise. Senior SREs increasingly screen for "what does on-call look like here" in interviews. A 10:1 alert-to-incident ratio is the kind of answer that closes candidates.

None of this is hypothetical. It's what you get when the layer between the alert pipeline and the pager actually does its job.

06 — Where to start, if you want to start

The order matters. Don't start with the vector index; start with the measurement.

Instrument the ratio. Before any change, count: pages last week / distinct problems your team would consider "really worth waking up for." If you can't get that number quickly, you're flying blind. Most teams discover the ratio is somewhere between 5:1 and 50:1; that establishes both the baseline and the size of the prize.
Add a 5-minute dedup window keyed on (service, error_message_hash). Cheap; collapses the obvious double-alerts. You'll see 20-40% page reduction in the first week with no machine learning involved. This is also the existence proof that the team trusts to justify the larger build.
Add per-tenant vector clustering with the anchor + decay rule from §03.2. This is the piece that takes a week of engineering and a month of tuning. Start with the threshold liberal (0.6-0.7, more aggressive merging), measure the false-merge rate, ratchet up. For reference: we landed at 0.85 after months of tuning, but the right threshold is service-shape-dependent and you'll need to find your own; treat the literal number in the closing paragraph as a destination, not a starting point.
Build the split / merge UI before you ship the clustering to production. Engineers' trust in the system is gated entirely on whether they can correct its mistakes in real time. Without the override path, the system will be ignored within two weeks; with it, the small wrong-merge incidents become a learning signal instead of a credibility hit.
Move the paging policy to the incident level. Notification budget rules from §03.3: 2-minute quiet window before first page, no re-page unless severity escalates, auto-resolve quiet incidents. This is what gets you the 5x → 10x reduction over what dedup alone delivers.

Six to eight weeks of focused work gets you to the end of step 5. The thing that gates the timeline is not the engineering — it's the calibration loop. You're going to ship at threshold 0.6 and discover that two unrelated outages keep getting merged. You're going to bump to 0.75 and discover that the same database-saturation incident is now firing as three separate incidents because the alert text varies more than you expected. You'll iterate. Plan for it.

If you'd rather not build and tune the pipeline yourself, Culprit ships the architecture above today: representative-incident clustering at 0.85 cosine threshold on a 60-minute window with the anchor + decay recompute rule from §03.2; embedding normalization in front of the embedder; the 2-minute quiet window for non-CRITICAL severities + immediate fan-out on CRITICAL; severity-escalation re-paging with a 5-minute cooldown; auto-resolve sweep every 5 minutes; split + merge primitives on every incident page with full audit lineage. The piece is here because the architecture above should be the default for anyone building in this space, not a thing each team has to discover by getting bitten.