2026-05-08 — llm / anthropic / prompt-caching / cost-engineering

Anthropic prompt caching cut our RCA cost by 90%

What actually goes in the cached segment, the two-segment trick that lets per-tenant context cache too, and what caching changes on Haiku 4.5.

By Culprit · 9 min read

LLM costs in production scale faster than the post-mortem of the demo bill suggests they will.

The shape of the problem: you ship a feature that calls Claude on every meaningful event. The first month the bill is rounding error and nobody looks at it. The second month a customer's traffic ramps and the line item is suddenly the thing finance asks about. The third month someone sends a polite message about whether this is "a real cost trend or a one-time spike" — and the architecture decision you made eight weeks ago, when the bill was rounding error, is the one now under review.

You can reduce this. Not by being clever about how you call the model — by being clever about what's constant across your calls. Anthropic's prompt caching, in our case, takes the per-RCA input cost from full-rate to one-tenth of full-rate on a 90%+ cache-hit rate. That's not a hypothetical; it's what we measure in production, and the math is simple enough to walk through here so you can run the numbers on your own pipeline.

The pricing structure

Anthropic publishes four price points per model. For Claude Haiku 4.5, the model we run as the default for incident root-cause analysis, those points are (verified from the Anthropic API docs):

| Token category | Haiku 4.5 | |---|---| | Base input | $1.00 per million tokens | | Cache write (5-minute TTL) | $1.25 per million tokens | | Cache read | $0.10 per million tokens | | Output | $5.00 per million tokens |

Two things to read from that table:

Cache read is 10x cheaper than base input. Same tokens in the request body, ten percent of the cost — if you can get them into the cache.
Cache write is 25% more expensive than base input. First time you send a cached segment, you're paying a small premium so the next request can pay the discount. The math only pays off if you call the model with the same cached segment more than ~1.25 times on average within the 5-minute TTL window.

That second point is the one most teams miss. If your call pattern is "one-shot, cold cache every time," prompt caching makes you slightly worse off. The win comes from repeatable structure across calls.

What's actually cacheable in an RCA call

A typical RCA call has five sources of tokens:

System prompt. Defines the role ("you are an SRE analyzing an incident"), the JSON schema for the response, and any guardrails. Identical across every call across every tenant. Maybe 800-1500 tokens depending on how rigorous your schema is.
Retrieval context ("here are 3 prior incidents from this same service that resolved similarly"). Static for a few minutes within a Batch run on one tenant + service. Maybe 400-800 tokens depending on how aggressive the retrieval is.
Per-incident events ("event 1 at 14:32:01: ConnectionPoolExhausted...; event 2 at 14:32:04: ..."). Unique to the incident under analysis. Cannot be cached across incidents. Typically 1500-3000 tokens.
Per-incident metadata (incident ID, service ID, severity). Tiny but unique.
Output tokens. The model's response. Cost is fixed at the output rate; caching doesn't apply.

Sources 1 and 2 are cacheable. Sources 3 and 4 are not. Source 5 is irrelevant.

In our distribution, sources 1 + 2 are roughly 70-80% of the input tokens for a typical RCA call. Cache them at 0.10 per million; pay full rate on the remaining 20-30%; total input cost drops by about 60-70% from the naive baseline. The "90%" headline number rounds up because we measure cache hits, not total cost, and within the cached portion the savings really are 90%.

The two-segment trick

Anthropic's API takes a cache_control marker per segment in your system array. Each marker is an independent breakpoint — the cache stores tokens up to the marker. If you have two segments, the API caches each one separately:

// Conceptual shape — see rca-prompt.ts for the exact code we run.
const system = [
  {
    type: 'text',
    text: SYSTEM_PROMPT,                    // ~1200 tokens, identical everywhere
    cache_control: { type: 'ephemeral' },
  },
  {
    type: 'text',
    text: priorIncidentsContext,            // ~600 tokens, per-tenant per-service
    cache_control: { type: 'ephemeral' },
  },
];

Why two segments instead of one? Because the cache lifetime for those two pieces is different.

The system prompt almost never changes — every RCA call across every tenant hits the cache. Cache read essentially every time after the first call.

The retrieval context (prior similar incidents for this service) changes whenever a new incident on that service resolves and shifts the top-K. Within a single Batch run on one tenant + service, repeats hit the cache. Across tenants, never.

If you stuff both into a single segment, the moment the retrieval context for tenant A changes, tenant B's hit rate drops too — because the one combined segment hashes differently. Two segments → independent cache lifetimes → tenant A's churn doesn't punish tenant B.

The order matters. Anthropic caches up to each marker, so the more-static segment must come first. If you put per-tenant retrieval first and the static system prompt second, the static prompt's cache key now includes the per-tenant content above it; you've just made the most cacheable segment uncacheable across tenants.

What kills the cache

In rough order of frequency:

The 5-minute ephemeral TTL. A cached segment expires 5 minutes after its last write. If your call pattern is bursty (RCA calls cluster around incidents, then quiet for an hour), a long quiet period will let every cached segment expire and you'll pay cache write (slightly above base rate) on the next batch. Spread your calls if you can; if you can't, accept that the first few calls after a quiet period pay full freight.

Whitespace drift. If you concatenate the system prompt with \n\n in one place and \n in another, you have two distinct cache keys. The cache hashes the literal token sequence, not the semantic meaning. Pick one separator and lint for it.

Trailing dynamic content. A common bug: someone adds a timestamp to the "system prompt" — Today's date is 2026-05-08T14:32:01Z — for "context". The timestamp changes every call. Now nothing cached after the timestamp survives. Keep dynamic content out of cached segments entirely; pass it as a user-message turn instead.

Schema version churn. If you're iterating on your JSON output schema (a normal early-product activity), every schema edit invalidates every cached system prompt. The cost of "tuning the schema" is partly paid in cache misses. Plan for one or two big schema-stabilization sweeps rather than continuous tweaks.

What caching changes

Take a representative RCA call: ~4000 input tokens, ~500 output tokens, with about 75% of the input stable enough to cache. The cost splits into three parts, and caching only touches one of them:

Cached input bills at the cache-read rate — a tenth of the base input rate. With most of the input cached, most of the input cost simply disappears.
Uncached input — the incident-specific remainder — bills at the base input rate.
Output is unaffected by caching, and at these ratios it dominates the per-call cost.

The cache write is amortized across every read in its lifetime, so once a segment is being reused its per-call contribution rounds to nothing.

Net effect: on a call of this shape, caching removes most of the input cost, and reserving the work for the Batch API (where the 2–5 minute latency is fine) takes another 50% off both input and output. The two compound — together they bring the per-call cost down by close to an order of magnitude versus the same call run uncached on the real-time API.

This is the discipline that lets our pricing stay flat and predictable as a customer's volume grows. We cover the model in our pricing overview.

Where this generalizes

If you're calling Claude on a per-event or per-incident schedule, the structure above applies to whatever shape your calls take. The questions to answer:

What in your prompt is identical across every call? That's segment 1. If the answer is "nothing," your prompt isn't designed for caching yet — find the constants. There almost always are some.
What is per-tenant or per-context but reused within a short window? That's segment 2. Common cases: retrieval context, customer-specific style guidelines, account metadata.
What's truly per-call? Goes in the user message turn, never in the cached system block.
Is your call rate above the break-even threshold? If you call the same cached prompt fewer than ~1.25 times per 5-minute window, you'll lose money on caching. For a noisy production system this is rarely the bottleneck, but for a low-volume tool it can be.

The pattern doesn't apply only to Claude. OpenAI's prompt caching follows similar economics with different numbers; Gemini's context caching has a different TTL but the same "what's static, what's dynamic" decomposition. The work of setting up your prompts so the static parts cluster at the front pays off across every model that supports caching, which is increasingly all of them.

A single test

If you're considering whether prompt caching applies to your pipeline, the cheapest first measurement is also the most informative one: count how many tokens of your typical request are byte-for-byte identical to the previous request. Not "semantically the same" — literally identical. If the answer is more than 50%, you're leaving money on the table; ship cache_control on the static prefix and watch the input-cost line item drop on the next billing day.

If the answer is less than 20%, your prompts are designed for context, not for repetition, and caching probably won't help much without a structural rewrite. Either way, knowing the number is a one-hour exercise that beats arguing about whether caching is worth the complexity.

The architecture above is what makes Culprit's flat-rate pricing economically defensible — RCA calls cluster around incidents, the system prompt and retrieval context dominate the input tokens, and the cache hit rate sits comfortably above 90%. Same primitives, different vertical: if you're shipping LLM features into production at any scale where the bill is starting to matter, this is the lowest-effort high-yield refactor you have available.