2026-05-13 — security / engineering / cloudflare-workers / regex

How we built a no-ReDoS customer regex tokenizer in Cloudflare Workers

A pure-JS RE2 port, a 60s per-isolate cache with stampede guard, a Web Worker test gate, and the three production bugs we caught in the first 24 hours.

By Culprit · 21 min read

The problem is small to state and not small to solve. Customers want to extend our PII tokenizer with their own regex — internal account IDs, custom record formats, vendor-specific trace identifiers, anything our 25-category default detector won't recognise by name. The obvious implementation is a tenant_pii_patterns table, a RegExp per row, and a loop. The non-obvious problem is that stock JavaScript RegExp permits catastrophic backtracking — a single accidentally-pathological pattern can pin a CPU for seconds, and on Cloudflare Workers, where isolates are shared and CPU limits are hard, "seconds" means "your worker is killed and every other tenant on that isolate notices." This is the post about how we got from "obvious implementation that opens the production pipeline to ReDoS" to "linear-time regex engine running in three places that all agree about what a valid pattern is."

The constraint that kills the obvious solution

ReDoS — regular-expression denial of service — is a class of bug where a regex with nested quantifiers degenerates into exponential time on adversarial input. The textbook example is (a+)+$ against the string aaaaaaaaaaaaaaaaaaaaX. Stock JavaScript RegExp is backtracking-based, and that pattern walks every partition of the a run looking for a way to satisfy the trailing $. There isn't one, and walking takes time exponential in the run length.

You don't need a malicious customer for this to bite. A well-meaning person trying to write a pattern for "double-quoted JSON string values" can produce "(.+)+" and not realise it's an attack on themselves. The first time it ships against an input that doesn't match, their pattern locks up the engine.

Cloudflare Workers is the worst place for this to happen. Isolates are shared across requests; CPU is metered with a hard ceiling (50 ms on the free plan, 30 s on Workers Unbound); and there is no native re2 binding to fall back on. A single bad pattern in our pipeline would stall every event from every customer routed to that isolate until the runtime killed the request, at which point we lose the in-flight event and the customer gets a retry storm.

Here is the smallest possible reproduction of the problem in a stock-RegExp implementation:

// DO NOT SHIP THIS.
function runCustomerPatterns(input: string, patterns: { regex: string }[]) {
  const matches: string[] = [];
  for (const p of patterns) {
    const re = new RegExp(p.regex, 'g');  // backtracking engine
    for (const m of input.matchAll(re)) {
      matches.push(m[0]);
    }
  }
  return matches;
}

// One row in tenant_pii_patterns, harmless-looking:
//   { regex: '(a+)+$' }
// One real-world input it never matches:
//   'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
// Result: 30+ seconds of CPU on a 31-character string.

The standard mitigations don't apply. You can't put a per-pattern timeout around matchAll — the JavaScript event loop won't preempt a regex that's still executing, so by the time your timer fires, the worker has already burned through its CPU budget. You can't sandbox the regex in a separate process — Workers doesn't have processes. You can't statically reject "dangerous-looking" patterns — the academic literature on safe-regex detection is full of false negatives, and any whitelist narrow enough to be safe is also too narrow to be useful.

The only real solution is to use a regex engine that doesn't backtrack.

Why re2-wasm didn't work in Workers

Google's RE2 is the canonical answer. It's an NFA-based engine with a provable linear-time-in-input × pattern-length guarantee. There is no input you can construct, against any pattern RE2 accepts, that takes more than O(n × m) time. RE2 explicitly rejects features that would break the guarantee — backreferences, lookbehind — which is a feature, not a bug, when your goal is "no ReDoS, ever."

RE2 is C++. The straightforward way to use it from JavaScript is re2-wasm, which wraps the C++ source in a WebAssembly module and exposes a JS API surface. We tried it. It does not work in Cloudflare Workers, for reasons that took half a day to track down and are worth documenting because someone else is going to make the same attempt.

re2-wasm is built with Emscripten. The Emscripten loader has two paths for locating the .wasm blob it needs to instantiate — Node mode and browser mode — and Workers fits neither. In Node mode, the loader executes:

// Excerpted from the Emscripten-generated re2.js loader
if (ENVIRONMENT_IS_NODE) {
  var fs = require('fs');
  var nodePath = require('path');
  // ... resolves re2.wasm relative to __dirname via fs.readFileSync
}

There is no require in workerd, and there is no fs. The bundle build succeeds because esbuild can statically resolve the import; the runtime fails on the first regex compile because require('fs') throws.

In browser mode, the same loader does:

if (ENVIRONMENT_IS_WEB) {
  // Resolves re2.wasm via fetch() relative to import.meta.url
  var wasmBinaryFile = new URL('re2.wasm', import.meta.url).toString();
  // ... awaits fetch(wasmBinaryFile)
}

Workers has fetch, but the URL it produces resolves against the request's host, not the bundled asset. There's no equivalent of a webpack asset/resource pipeline in the Workers build that would put the .wasm blob at a fetchable URL. The bundle builds; the first regex compile makes a fetch request to a path on the customer's domain that returns a 404.

Neither failure mode is fixable by environment shimming. You can't polyfill require('fs') for a binary blob the loader expects to read off disk, and you can't make import.meta.url resolve to a Workers asset that the runtime doesn't expose to user code. The package is also effectively abandoned — the last release was two years ago and the open issues sit unanswered. Patching it ourselves was an option, but at that point we were several days into a problem that was supposed to be a library swap.

Re2js: same algorithm, native in Workers

The fix is re2js, a pure-JavaScript port of Google RE2. Same NFA construction, same linear-time guarantee, same intentionally restricted feature set (no backreferences, no lookbehind), no native dependency. Runs natively in Workers, in a browser Web Worker, and in Node, with one import line and zero platform-specific glue.

The microbenchmark cost relative to compiled C++ RE2 is real. Pure-JS NFA simulation is 5-10× slower than the WASM equivalent on long inputs. At our scale this does not matter. Our event payloads are typically under 8 KiB, and a tenant rarely has more than a dozen custom patterns. Twelve patterns × 8 KiB at re2js's measured throughput is well under one millisecond per event, dominated in the pipeline by HMAC tokenization, encryption, and the embedding API call.

The bundle cost matters more. re2js adds about 40 KiB gzipped to the pipeline Worker bundle (measured at deploy time: 250 KiB total, up from ~210 KiB pre-feature). The Workers free and paid plans cap compressed bundle size at 1 MiB, so we have headroom but not infinite headroom. If we ever ship a feature that needs another 750 KiB of dependencies, we'll need to split the customer-pattern runner into a separate Worker invoked via Service Bindings. That is a future problem with a known solution; today it's a single-Worker bundle.

The API surface looks like this:

import { RE2JS } from 're2js';

// Compile once at cache load time:
const compiled = RE2JS.compile('ACCT-\\d{8}', RE2JS.CASE_INSENSITIVE);

// Match many times against many inputs:
const matcher = compiled.matcher('order ACCT-12345678 received');
while (matcher.find()) {
  console.log(matcher.group(), matcher.start());
}
// → 'ACCT-12345678' 6

The compile → matcher → find shape is a deliberate Java-style API, inherited from the Google RE2 reference implementation. It maps cleanly to a "compile when the cache loads, matcher per input" usage pattern, which is how we use it everywhere — the pipeline runner, the browser test gate, and the server-side compile-on-write validator all share this exact idiom. Same engine, same flag bitmask, same guarantee in all three places.

The cache: 60s TTL + stampede guard

re2js compile is cheap but not free. Our pipeline Worker may handle hundreds of events per second from a single tenant, and each event needs the tenant's full pattern list. Recompiling every pattern every event would be visible in the request budget. We cache compiled patterns per tenant in an in-isolate Map.

// workers/pipeline/src/pii-pattern-cache.ts
interface CacheEntry {
  compiled: CompiledPattern[];
  version: number;
  fetchedAt: number;
  refreshing?: Promise<CacheEntry>;
}

const TTL_MS = 60_000;
const cache = new Map<string, CacheEntry>();

export async function loadCustomerPatterns(
  tenantId: string,
  env: Env
): Promise<CompiledPattern[]> {
  const now = Date.now();
  const entry = cache.get(tenantId);

  if (entry && now - entry.fetchedAt < TTL_MS) {
    return entry.compiled;
  }

  if (entry) {
    if (!entry.refreshing) {
      entry.refreshing = refreshEntry(tenantId, env)
        .then((fresh) => { cache.set(tenantId, fresh); return fresh; })
        .catch((err) => {
          console.error('pii-pattern-cache refresh failed', { tenantId, err });
          if (cache.get(tenantId) === entry) entry.refreshing = undefined;
          return entry;
        });
    }
    return entry.compiled;
  }

  const fresh = await refreshEntry(tenantId, env);
  cache.set(tenantId, fresh);
  return fresh.compiled;
}

Three things in this code are load-bearing.

Per-tenant keying. A slow refresh for tenant A doesn't block any event from tenant B. The cache is a Map, not a single-entry slot, and the lookup is keyed on tenantId. There is no shared lock anywhere in this hot path.

60-second TTL. Long enough that a busy tenant amortises the compile cost across thousands of events; short enough that a customer who saves a new pattern in the dashboard sees it apply within a minute. The cache invalidation key is paired with a tenants.pii_patterns_version integer that mutation RPCs increment, so a refresh after an update is guaranteed to read the new version, not a stale replica row.

Stampede guard. The non-obvious branch is the middle one. When an entry is past its TTL but a previous version exists, we fire off a refresh in the background and return the stale entry to the caller. If a hundred concurrent events arrive during that refresh window, they all see the in-flight refreshing promise and reuse it instead of each starting their own fetch. The stale entry isn't perfectly fresh, but it's well-formed and matches what the engine accepted last time. A 60-second cache that thunders the database every TTL boundary is worse than a cache that serves a slightly-stale entry for an extra few hundred milliseconds.

The refresh itself is two parallel HTTP calls — one for the pattern list, one for the version counter — followed by a serial compile loop:

async function refreshEntry(tenantId: string, env: Env): Promise<CacheEntry> {
  const [patternsRes, versionRes] = await Promise.all([
    fetch(`${env.SUPABASE_URL}/rest/v1/rpc/list_pii_patterns`, {
      method: 'POST',
      headers: { /* service-role key */ },
      // p_tenant_id is required for service-role callers — see "bug 2" below
      body: JSON.stringify({ p_tenant_id: tenantId }),
    }),
    fetch(`${env.SUPABASE_URL}/rest/v1/tenants?id=eq.${tenantId}&select=pii_patterns_version`, {
      headers: { /* service-role key */ },
    }),
  ]);
  // ...
  const compiled: CompiledPattern[] = [];
  for (const p of patterns.filter((row) => row.is_active)) {
    try {
      compiled.push({
        id: p.id, name: p.name, flags: p.flags,
        pattern: RE2JS.compile(p.pattern, flagsStringToBitmask(p.flags)),
      });
    } catch (err) {
      console.error('pii-pattern-cache compile failed', {
        tenantId, patternId: p.id, err: (err as Error).message,
      });
    }
  }
  return { compiled, version, fetchedAt: Date.now() };
}

The try/catch around RE2JS.compile is defensive; the database has its own server-side compile-on-write check that should make compile failures impossible at this point. We do it anyway, because "should be impossible" and "is impossible" are not the same thing, and a compile failure here that aborts the whole loop would silently disable every other pattern for the tenant.

The runner: per-pattern try/catch + ctx.waitUntil

Compilation is half the safety story. The other half is execution. Even with a well-formed pattern that compiled successfully, you can imagine pathological inputs that surface obscure runner bugs. We isolate per-pattern execution with a try/catch and record any failure out-of-band so a circuit breaker can trip on repeat offenders.

// workers/pipeline/src/pii-pattern-runner.ts
export function runCustomerPatterns(
  input: string,
  patterns: CompiledPattern[],
  ctx: ExecutionContext,
  env: Env,
  tenantId: string
): CustomerPatternMatch[] {
  const matches: CustomerPatternMatch[] = [];
  for (const p of patterns) {
    try {
      const matcher = p.pattern.matcher(input);
      while (matcher.find()) {
        const value = matcher.group();
        if (value === null) continue;
        matches.push({
          type: 'custom',
          patternId: p.id,
          value,
          index: matcher.start(),
        });
      }
    } catch (err) {
      ctx.waitUntil(recordPatternFailure(p.id, env, err as Error, tenantId));
    }
  }
  return matches;
}

The use of ctx.waitUntil here is non-negotiable, and we learned this the hard way on a different feature. In a Cloudflare Worker route handler, a fire-and-forget void fetch(...).catch(...) is cancelled the instant the response is returned to the client. The runtime has no obligation to keep an in-flight request alive past the request that triggered it. ctx.waitUntil is the documented way to say "this Promise is part of the request even though I'm not awaiting it." Without it, the failure recording HTTP call gets dropped about 95% of the time on busy isolates, and the circuit breaker never trips because it never sees the failures.

The circuit breaker itself lives in the database. The record_pii_pattern_failure RPC implements a sliding-window counter:

-- supabase/migrations/0044_pii_patterns.sql §12 (excerpted)
CREATE OR REPLACE FUNCTION public.record_pii_pattern_failure(p_pattern_id uuid)
RETURNS void
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = public, extensions, pg_temp
AS $$
DECLARE
  v_window_started timestamptz;
  v_count integer;
  v_new_count integer;
BEGIN
  -- ... service-role gate ...
  SELECT failure_window_started_at, failure_count
    INTO v_window_started, v_count
    FROM public.tenant_pii_patterns
   WHERE id = p_pattern_id AND is_active = true
   FOR UPDATE;

  IF v_window_started IS NULL OR (now() - v_window_started) > interval '5 minutes' THEN
    v_new_count := 1;
    UPDATE public.tenant_pii_patterns
       SET failure_count = 1, failure_window_started_at = now()
     WHERE id = p_pattern_id;
  ELSE
    v_new_count := v_count + 1;
    UPDATE public.tenant_pii_patterns
       SET failure_count = v_new_count
     WHERE id = p_pattern_id;
  END IF;

  IF v_new_count >= 3 THEN
    UPDATE public.tenant_pii_patterns
       SET is_active = false, disabled_reason = 'circuit_breaker'
     WHERE id = p_pattern_id;
    -- ... bump tenants.pii_patterns_version and write audit row ...
  END IF;
END;
$$;

Three failures in a five-minute window auto-disables the pattern. The pattern row stays in the table with disabled_reason = 'circuit_breaker' and is_active = false, which means the next cache refresh will silently drop it — no more events get matched against it, no more failures get recorded, the rest of the tenant's patterns continue to run. The customer can re-enable it from the dashboard, which resets the failure counter and starts a fresh five-minute window. This avoids the failure mode where one bad pattern keeps tripping the breaker repeatedly during edits.

Test before save: a Web Worker as the validation surface

The customer-facing surface is a modal in /settings/pii-patterns. It enforces a strict state machine: you can't save a pattern unless you have proven that it (a) compiles in re2js and (b) matches at least one occurrence in a sample event payload you paste yourself. The pristine state shows "Run a test before saving — Save unlocks once the pattern matches at least one occurrence in the sample." The save button stays disabled until the test produces a matched result.

The test runs in a browser Web Worker. There are two reasons for this, and only one of them is the reason you'd guess.

The reason you'd guess: the same engine has to run on both sides. If the modal accepts a pattern but the pipeline rejects it, we've wasted the customer's time and broken their trust. By using re2js in the browser exactly as we use it in the Worker, the modal's "compiles" verdict and the pipeline's "compiles" verdict are byte-identical:

// apps/web/lib/pii-patterns/regex-worker.ts
self.onmessage = (e: MessageEvent<TestRequest>) => {
  const { token, pattern, flags, sample } = e.data;
  let compiled: ReturnType<typeof RE2JS.compile>;
  try {
    compiled = RE2JS.compile(pattern, flagsToBitmask(flags));
  } catch (err) {
    self.postMessage({
      token,
      payload: { kind: 'compile_error', message: (err as Error).message },
    });
    return;
  }
  try {
    const matches: Array<{ value: string; index: number }> = [];
    const matcher = compiled.matcher(sample);
    while (matcher.find()) {
      const value = matcher.group();
      if (value === null) continue;
      matches.push({ value, index: matcher.start() });
      if (matches.length >= 100) break;  // bound the response payload
    }
    self.postMessage({ token, payload: { kind: 'matches', matches } });
  } catch (err) {
    self.postMessage({
      token,
      payload: { kind: 'run_error', message: (err as Error).message },
    });
  }
};

The reason you wouldn't guess: even though re2js can't actually hang, we want the modal to be able to time out. The Worker boundary lets us put a 2000 ms watchdog around any single test, and if the watchdog fires we terminate the Worker and respawn the next click. re2js is provably linear-time so this should never trip on a real pattern, but "should never" is not the same as "cannot," and a corrupted bundle or a missing module file would otherwise leave the modal indefinitely in a testing state with no escape.

The watchdog itself is fiddly to get right. The first version of the modal terminated and respawned the Worker on every Test click, which spent the re2js bundle's cold-start cost over and over. The smoke test caught it: the second click of any session false-tripped a 100 ms watchdog because the freshly-respawned Worker module was still parsing. We persisted the Worker across clicks, used a monotonic request token to disambiguate stale responses from rapid double-clicks, and raised the watchdog to 2 seconds — enough headroom for the cold-start of a freshly-respawned module under prod opennextjs, while still catching a genuinely wedged Worker.

// apps/web/components/settings/PiiPatternModal.tsx (excerpted)
const myToken = ++requestTokenRef.current;
setTestState({ kind: 'testing' });

if (workerRef.current === null) {
  workerRef.current = new Worker(
    new URL('../../lib/pii-patterns/regex-worker.ts', import.meta.url),
    { type: 'module' },
  );
  workerRef.current.onmessage = (e: MessageEvent) => {
    const envelope = e.data;
    if (!envelope || envelope.token !== requestTokenRef.current) return;  // stale
    // ... handle payload ...
  };
}

const timeout = setTimeout(() => {
  if (myToken !== requestTokenRef.current) return;
  workerRef.current?.terminate();
  workerRef.current = null;
  setTestState({
    kind: 'compile_error',
    message: `Pattern timed out (>${WORKER_TIMEOUT_MS}ms) — re2 should not hit this; investigate`,
  });
}, WORKER_TIMEOUT_MS);

workerRef.current.postMessage({ token: myToken, pattern: regexSrc, flags, sample });

Edit mode adds one more invariant: any mutation of the pattern or flags resets the test state to pristine. Without it, a customer could pass the test, change the regex, and then save behind the gate. The reset is centralised in a useEffect on (regex, flags) so no future code path can mutate the inputs without also wiping the test.

What we caught in production (and why being public about it matters)

Three bugs are worth talking about. One was caught before deploy and two were caught in the first hours after deploy by smoke tests we ran ourselves. None reached a customer. The reason to write them down is that "we shipped, smoked, found bugs, fixed them in hours" is more credible than "we shipped a perfect system." Engineers respect detail about how the work actually went.

Bug 1, pre-deploy: the IPv6-compressed regex matched bare colons. This was in our built-in detector, not the customer-pattern runner, but it would have been catastrophic and it's worth describing because it shows up the testing gap that made it possible. The original ipv6_compressed regex used {0,7} quantifiers on both sides of the literal ::, which meant a single bare : between non-hex characters would satisfy the pattern. The reproduction was four characters long: detectPii('{"alert":"DB failure"}') produced two ipv6_compressed matches, at indices 8 and 28 (": between alert and the value, and again at the end). Every JSON-shaped alert payload — which is most of them — would have been mangled. The fix was to require at least one hex group on each side of the :: via alternation, plus negative lookbehind/lookahead anchors that refuse to match unless surrounded by non-hex non-colon characters. We added five regression cases (bare colon between letters, bare colon between numbers, bare colon at start, bare colon at end, valid :: still matches). The lesson: a detector regex test suite must include adversarial cases where the input looks like the target class but isn't. Our original test suite had only positive cases, which is how a regex this broken passed CI.

Bug 2, post-deploy, caught by smoke: list_pii_patterns had no service-role path. The first version of the RPC granted EXECUTE only to the authenticated role, and the function body always derived the tenant from auth.uid() → tenant_members. The pipeline Worker calls this RPC with the service-role key — service-role JWT has no sub claim, so auth.uid() is NULL, and the function raised unauthenticated (errcode 42501) on every call. Net effect: the first customer to save a pattern would have lost every event afterward, because the pipeline could not retrieve the patterns to apply but also could not skip the customer-pattern pass without an empty list to iterate over. We caught this by manually saving a test pattern and watching the Worker logs for the next ingest event. The fix was a separate migration adding p_tenant_id uuid DEFAULT NULL to the RPC signature and branching on the JWT role: service-role callers must pass p_tenant_id explicitly (defends against accidentally cross-tenant reads), authenticated users ignore the parameter and derive from membership.

Bug 3, post-deploy, caught by integration tests we wrote because of bug 2: the capability helper read auth.uid() directly. The create/update/delete RPCs all support a p_actor_user_id DEFAULT NULL parameter that lets a service-role caller attribute audit rows to a specific human (used for internal jobs, nothing today but documented as supported). The RPC bodies anti-spoof this with COALESCE(auth.uid(), p_actor_user_id) after gating on the JWT role. But the capability helper, called inside each RPC body to check whether the resolved user could manage patterns, read auth.uid() directly instead of taking the resolved user as a parameter. So a service-role caller passing p_actor_user_id would correctly resolve v_user_id for the audit row, but the capability check then ran with auth.uid() = NULL, returned false, and raised forbidden: manage_pii_patterns capability required. The documented affordance was unreachable. We caught it because, after bug 2, we wrote integration tests against a real Supabase instance to exercise the service-role path end-to-end. The test that should have passed (a service-role caller with valid p_actor_user_id creating a pattern) failed instead. The fix mirrored a precedent we'd already set elsewhere in the codebase: the helper takes an explicit p_user_id argument, the anti-spoof responsibility lives in one place per RPC body, and the helper itself becomes dumb and composable.

The pattern across all three is the same. Each bug was caught because we treated "deployed" as the start of verification, not the end. The smoke tests for bugs 2 and 3 were five-minute manual exercises; the integration tests that found bug 3 were written in the hour after we fixed bug 2. Total elapsed time from first bug discovery to all three fixed and re-deployed: under four hours.

What we'd do differently next time

The honest critique of our process is that the integration tests for the service-role path of the database RPCs should have existed before the deploy, not after. The unit tests on the SQL functions were comprehensive — we tested the capability helper, the validation guards, the audit-row writes, the circuit-breaker math. What we didn't test was the cross-cutting concern: "does the pipeline Worker, calling with the service-role JWT, actually receive a non-empty pattern list when patterns exist." That's an integration test in the strict sense — multiple components, real auth context, real Postgres — and it's the test that would have caught bug 2 before the customer-facing deploy.

The reason it didn't exist is the usual reason: integration tests are slow, they require a real Supabase instance, and the project has comparable RPC patterns that worked fine without them. The wrong lesson is "always write integration tests for everything." The right lesson is narrower. Any RPC that needs a service-role path and an authenticated-user path — i.e., that is going to be called from the pipeline Worker as well as from a route handler — must have an integration test for both call shapes before the migration ships. The cost of writing the test is small relative to the cost of a missed customer event, and the payload of the test is generic enough that it can be reused across every future RPC that fits the shape.

We're adding that as a guardrail. Not as a process gate ("you must write the test"), but as a typed test helper that makes writing the test cheaper than not writing it. The future failure mode this protects against — a half-shipped RPC that works for one role and silently 401s for the other — is the same shape as bug 2, and it will happen again on a different RPC if we don't lower the friction of the prevention.

Where this lives in production

If this kind of detail is interesting and you're tired of writing your own PII redaction layer for every observability tool you touch, the engine described here is the customer-pattern surface in Culprit. The 25-category built-in detector and the customer pattern runner both ship in the same pipeline; you can see the detector in action on /try (the same shared module, running in your browser on text you paste) and read the security architecture summary on /security. The full customer-pattern UI is in /settings/pii-patterns once you have an account.

If you'd rather skip straight to the proof, /benchmarks/redos runs the same RegExp vs re2js comparison live in your browser against ten documented catastrophic-backtrack patterns — same Worker harness, same 5 s timeout, real measurements from your hardware.