§ Guides
By AI Blog Editor
Apr 20, 2026 · 1 min read
Prompt caching: the cheap win most people skip
The largest single cost lever the Claude API gives you, and it's still the one most teams haven't turned on. How the cache works, how to structure for it, and what it actually saves.
Prompt caching is the largest single cost lever the Claude API gives you, and it's still the one most teams haven't turned on. The rule of thumb — "cache reads cost roughly a tenth of base input" — is doing a lot of heavy lifting in production economics, and it's earned in a single hour of wiring.
The idea is simple. Your prompts have a stable prefix (system message, tool definitions, reference docs) and an unstable tail (the user's turn). If you mark the prefix as cacheable, Claude stores it for five minutes. During that window, subsequent turns that reuse the prefix pay the cache-read rate instead of the full input rate.
Structure your prompt stable-first
Caching matches on byte-for-byte prefix. That means the order of your message matters: stable content must come before dynamic content, consistently. If you swap your system prompt and your tool definitions between turns, nothing matches, and you pay cache writes forever.
In practice: system prompt, then tool definitions, then long-lived reference content, then recent conversation, then the current user turn. Put a cache_control breakpoint at the end of the stable block.
Mind the five-minute TTL
The cache has a TTL. Miss the window and the next call is a cache-write (roughly 1.25× base input) rather than a cache-read. That matters a lot for cadence: if your workload fires every couple of minutes you'll keep the cache warm; if it fires every seven minutes you'll pay a cache-miss every time.
Two defenses. One: for cron-like workloads, either bunch calls close together or accept the cache miss; don't flip between "a thousand calls in five minutes" and "one every six minutes" without noticing. Two: for latency-critical paths, send a warm-up request before the real one if the cache might be cold.
What to cache, what not to
Cache: system prompt, tool schemas, long reference documents, policy text, few-shot examples that don't change per-user. Anything stable that's over ~1,000 tokens benefits.
Don't cache: the user's current message, anything that changes per-request, tiny prefixes where the overhead isn't worth it. Caching a 200-token system prompt is fine in theory and not worth the book-keeping in practice.
What it actually costs
Drag the hit-rate slider and fiddle with the numbers below. For a typical support-bot workload — 8k static tokens, 400 user tokens, 600 output tokens, a couple thousand calls a month — 80% hit takes the bill from the base rate down into the nice-to-have territory. At a million calls a month, 80% hit vs. no cache is a four-digit monthly swing.
◆ Workload
Monthly spend, no caching
$68
With cache at 80% hit
$36
You keep
$32 · 47% off
illustrative @ $3/M in, $15/M out · cache-write 1.25× · cache-read 0.1×
Three failure modes to watch
Accidental prefix drift. A new reviewer adds a timestamp to the system prompt; the cache never hits again. Guard your stable prefix like it's a hot path, because it is.
Breaking cache on tool addition. Adding a new tool at the top of the tool list shifts every subsequent byte. Append new tools to the end; only break cache on purpose.
Cross-tenant bleed in billing dashboards. When you report cost per tenant, remember that the same cached prefix might be serving many tenants. Allocate the cache-write cost fairly and the cache-read cost per tenant; don't charge tenant A for a cache warmed by tenant B.
The one-line heuristic
Stable content goes in front, dynamic content goes at the back, and a cache breakpoint goes between them. Every other sentence in this article is a consequence.
* * *
Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.
Elsewhere in this issue
3 moreLetters
Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.