2026-05-13 — engineering / pgvector / postgres / alert-correlation

pgvector for alert correlation: how a 1536-dim embedding ate our incident clustering

How to ship vector-based alert correlation in a few hundred lines of SQL plus one embedding API call per event — the query, the tuning knobs, the failure mode, and the line where pgvector stops being enough.

By Culprit · 12 min read

The default way to correlate alerts is to write rules. "If two events arrive on the same service within 60 seconds and one of them mentions connection, attach them to the same incident." It works on the alerts you wrote the rule for, and not on the next variant of the same root cause that arrives with a slightly different message.

There is a simpler approach. Embed each alert's text into a dense vector, store the vector on the event row, and when a new event comes in, find the recent open incidents whose representative vector is close enough and attach. Postgres has pgvector. The whole thing fits in a couple hundred lines of SQL and one embedding API call per event.

This post is the architecture, the SQL that does the work, the four knobs you have to tune, and the failure mode that will eventually bite you. It also says where the vector layer stops, because pgvector gets you correlation, not root cause.

The shape of "alert as a vector"

An alert payload that arrives over a webhook is a JSON blob with maybe a dozen fields. The interesting ones for correlation are the ones a human would read aloud to describe the alert: a title, a message, a service name, maybe a check name. The rest is metadata that helps you filter but does not help you cluster.

The pipeline is four steps:

Pull the textual fields out of the payload in a deterministic order — we pick title, summary, message, and check_name if present, joined with a fixed separator. Whatever you pick, do it the same way every time. The embedding model will hash any change in the text into a totally different vector.
Normalize. Lowercase, collapse whitespace, strip the per-event noise (incident IDs, timestamps, run IDs). After tokenization, the <TOKEN_…> placeholders are already in the text, so the embedding sees <TOKEN_a1b2c3> instead of db-prod-west-2. That matters less than you think — the model reads "host placeholder followed by connection refused" as the same shape regardless of which host.
Embed. One API call per event. We use text-embedding-3-small from OpenAI: 1536 dimensions, $0.02 per million tokens, fast enough to sit in the hot path of a queue consumer.
Write the vector to the event row. A vector(1536) column on sanitized_events, indexed for cosine distance.

The schema is the boring part:

CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE sanitized_events
  ADD COLUMN embedding vector(1536);

CREATE INDEX sanitized_events_embedding_hnsw
  ON sanitized_events
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

vector(1536) matches the embedding model output. If you change models, you change the column type, which means a migration; do not hardcode the dimension anywhere outside one config constant. hnsw is one of two index types pgvector ships. We will come back to which to use.

The query that does the work

Correlation is one query. Take the new event's vector, look at the recent open incidents on the same service, find the closest representative-vector match. If it is close enough, attach. If not, open a new incident.

WITH candidate AS (
  SELECT
    i.id              AS incident_id,
    i.embedding       AS rep_embedding,
    i.embedding <=> $1 AS distance
  FROM incidents i
  WHERE i.tenant_id  = $2
    AND i.service_id = $3
    AND i.status     = 'open'
    AND i.last_seen_at > now() - interval '15 minutes'
  ORDER BY i.embedding <=> $1
  LIMIT 50
)
SELECT incident_id, distance, 1 - distance AS cosine_similarity
FROM candidate
WHERE distance < 0.15      -- 1 - 0.85 = 0.15 -> cosine threshold 0.85
ORDER BY distance
LIMIT 1;

The pieces:

<=> is pgvector's cosine-distance operator. Distance is 1 - cosine_similarity, so a smaller number is a closer match.
The WHERE clause is the filter set: same tenant, same service, status open, recent. Do not skip the time-window filter. Without it, pgvector returns the closest match across all open incidents your tenant has ever opened on this service, and a brand-new alert can attach to a six-month-old incident nobody has touched.
LIMIT 50 is the candidate-set size. HNSW returns approximate nearest neighbors; the closer a match needs to be, the larger your candidate set has to be to find it.
The threshold check is in the outer query, not in the WHERE of the CTE. The index lookup cannot be predicate-pushed against an inequality on the distance — it returns the candidate set first, then you filter. The CTE makes the EXPLAIN plan readable.

If the query returns a row, attach the new event and update the incident's representative embedding. If it returns no row, open a new incident with the event's vector as the initial representative.

The four knobs

The pieces above are the architecture. The work is in the tuning. You have four knobs and they interact.

Knob 1: the similarity threshold

0.85 is the most common starting point in the literature. Higher means stricter — fewer false-positive merges, more false-negative splits where the same root cause produces two incidents because the wording diverged. Lower is the inverse.

Concrete shape, rounded to make the tradeoff readable:

At 0.92 you almost never wrongly cluster two unrelated alerts. You also miss real correlations whenever a service emits the same root cause twice with even modest wording changes — a runbook update, a slightly different timeout value, a new field in the payload.
At 0.85 you catch most real correlations. You also occasionally cluster a connection-pool storm with an unrelated TLS-handshake failure that happens to share enough surface tokens (region, hostname placeholder, the word timeout).
At 0.80 you cluster aggressively and you will get incidents that should have been two.

We run 0.85 as the default and let on-call engineers correct mistakes through split/merge primitives. The threshold is not the only defense against mis-clustering; the override is. A vendor whose only answer to "what if it gets it wrong" is "tune the threshold" is selling you the wrong shape of system.

The threshold should be a config value, not a hardcoded literal in the SQL. You will change it more than you expect.

Knob 2: the time window

The last_seen_at > now() - interval '15 minutes' filter is the single biggest difference between "correlation that scales" and "correlation that gradually attaches everything to the oldest open incident."

15 minutes is what we run:

Most correlated events for a single root cause arrive within a few minutes of each other.
15 minutes is enough headroom that a slow-burn incident — a memory leak emitting alerts every two or three minutes — still attaches new events to the right incident.
60 minutes starts to attach unrelated alerts, because services recur on the same shape of problem within an hour and you do not want a midday connection-pool blip cluster-attached to the morning's resolved-but-still-in-window incident.

If your traffic shape is different — your alerts are slower or burstier — pick a different number. The number is not load-bearing; the existence of some time window is.

Knob 3: candidate-set size

LIMIT 50 is the candidate set the HNSW index returns. The larger this number, the more likely the index returns the actual nearest neighbor at the cost of more work per query.

HNSW's recall (how often it returns the true nearest neighbor and not just a close one) is governed by the ef_search parameter, set per-query via SET LOCAL hnsw.ef_search = N. Default is 40. Push it to 100 for correlation queries:

SET LOCAL hnsw.ef_search = 100;
WITH candidate AS ( ... )
SELECT ...;

LIMIT 50 and ef_search = 100 are independent: ef_search is the size of the search frontier inside the index, LIMIT is how many frontier results bubble up. Larger ef_search is more expensive but more likely to surface the actual nearest neighbor.

For a service with five open incidents, both numbers are irrelevant. For a service with five hundred open incidents during a noisy outage, the difference between ef_search = 40 and ef_search = 100 is the difference between "occasionally misses the right cluster" and "consistently finds it."

Knob 4: HNSW vs IVFFlat

pgvector ships two index types. They have different tradeoffs.

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. High insert cost, low query cost, excellent recall, tunable per-query (hnsw.ef_search) without rebuilding. The default for new pgvector deployments.

IVFFlat (Inverted File with Flat) clusters vectors into lists partitions and searches the nearest probes partitions. Low insert cost, query cost depends on probes, recall depends on partition quality, has to be retrained periodically as the data distribution shifts. Useful when you have hundreds of millions of rows and HNSW build cost is prohibitive.

For alert correlation, HNSW is the simpler default if you are on pgvector 0.5.0+. If you are on an older version (Supabase shipped 0.4 for a long time), IVFFlat with lists = 100 is a reasonable starting point at the tens-of-thousands-of-rows scale and rebuilds with REINDEX if your data distribution shifts. The active set queried by the correlator (open incidents in a 15-minute window) is small and bounded either way. Pick the one your runtime supports without contortions and move on; this knob is rarely the bottleneck.

The representative-embedding update rule

When a new event attaches to an incident, the incident's representative embedding has to update or the next event will not match against the most recent shape of the cluster. The naive approach — store every event's vector and average them all on every query — works for small incidents and dies on large ones.

The correct approach is a single representative vector on the incident row, updated incrementally. A rule that works well is an exponentially weighted moving average:

UPDATE incidents
SET embedding = (embedding * 0.9) + (new_event_embedding * 0.1),
    last_seen_at = now(),
    event_count = event_count + 1
WHERE id = $1;

The 0.9/0.1 weighting biases toward the historical shape (so a single weird event does not pull the representative off) while letting the cluster drift gradually. Anything more aggressive than 0.7/0.3 drifts fast enough that a 15-minute incident no longer matches its own original event.

One subtle correctness issue: cosine distance is scale-invariant on unit vectors, but a weighted sum of two unit vectors is not itself a unit vector. After enough updates the representative drifts off the unit sphere and the cosine threshold goes subtly wrong. Wrap the update in normalize_l2(...) (a pgvector built-in) and never think about it again.

The failure mode: drift over time

A service that has been emitting the same shape of ConnectionPoolExhausted alert for six months suddenly starts emitting it with a slightly different message — a runbook author added a "Suggested fix:" trailer, the upstream library changed its log format. The new alerts embed differently. The old representative vectors no longer match. The correlator starts opening new incidents for what is semantically the same root cause.

The first symptom is "incident count for this service ticked up and I cannot tell why." The second is on-call engineers reporting that they keep merging incidents that should never have been split.

The detection is straightforward. For each open incident, compare the representative vector to a recent rolling-window average of the events actually attached to it. If the cosine similarity drops below some threshold (0.92 is a reasonable starting point), the cluster has drifted:

WITH recent_events AS (
  SELECT incident_id, AVG(embedding) AS recent_avg
  FROM sanitized_events
  WHERE incident_id = $1
    AND created_at > now() - interval '24 hours'
  GROUP BY incident_id
)
SELECT
  i.id,
  1 - (i.embedding <=> r.recent_avg) AS rep_vs_recent_similarity
FROM incidents i
JOIN recent_events r ON r.incident_id = i.id
WHERE 1 - (i.embedding <=> r.recent_avg) < 0.92;

The recovery is a representative refresh. Recompute the representative as the L2-normalized average of the recent events:

UPDATE incidents
SET embedding = normalize_l2((
  SELECT AVG(embedding)
  FROM sanitized_events
  WHERE incident_id = incidents.id
    AND created_at > now() - interval '24 hours'
))
WHERE id = $1;

Run it as a cron job. We run ours every few hours; the cadence matches "how fast can a service's alert vocabulary change without us noticing." The job is cheap — one query per active incident — and it bounds drift without throwing out the correlation history.

The thing not to do is recompute the representative on every event attach by averaging the full event set. That is O(N) per attach and unmanageable at scale. The exponentially weighted update is O(1) per attach; the periodic full refresh is O(N) but amortized across hours.

Where pgvector stops being enough

The vector layer tells you "these seventeen events probably belong to the same incident." It does not tell you what caused the incident. That needs a different tool.

The handoff is clean. After clustering, an incident has a known set of events. Hand the events plus context (service name, time window, severity, prior similar incidents) to an LLM with a structured RCA prompt. The LLM produces a hypothesis, a confidence score, and a list of next steps. The vector layer cannot do that — it only knows distance in embedding space. The LLM cannot do the clustering at scale — you cannot afford to call a model on every event, and the model has no way to know which events belong together until you have already clustered them.

The two layers compose. pgvector gets you from a thousand events per minute to ten incidents per minute. The LLM gets you from ten incidents per minute to ten explained incidents per minute. Embeddings are good at similarity; LLMs are good at explanation; pick the right tool for each.

The architecture above is what runs underneath Culprit's incident clustering — pgvector at the correlation layer, Anthropic at the explanation layer, with split/merge primitives so on-call engineers can correct the system when it gets a cluster wrong. The full security architecture is on /security; a working demo of the pipeline (sanitization, clustering, RCA) is on /demo.