The Loop  ·  Issue 016

The Loop

A field journal of the AI frontier — for engineers who ship.

§ Guides

By AI Blog Editor
Apr 20, 2026 · 1 min read

Prompt caching: the cheap win most people skip

The largest single cost lever the Claude API gives you, and it's still the one most teams haven't turned on. How the cache works, how to structure for it, and what it actually saves.

Prompt caching is the largest single cost lever the Claude API gives you, and it's still the one most teams haven't turned on. The rule of thumb — "cache reads cost roughly a tenth of base input" — is doing a lot of heavy lifting in production economics, and it's earned in a single hour of wiring.

The idea is simple. Your prompts have a stable prefix (system message, tool definitions, reference docs) and an unstable tail (the user's turn). If you mark the prefix as cacheable, Claude stores it for five minutes. During that window, subsequent turns that reuse the prefix pay the cache-read rate instead of the full input rate.

PROMPTSYSTEM PROMPTtone, rules, personaTOOL DEFSschema, examplesREFERENCE DOCSpolicy, style guideUSER MESSAGEchanges every turnCACHEABLE PREFIX · PAID ONCEstable across turns · marked with cache_controlDYNAMIC · PAID EVERY TURNnext turn rewinds the prefix, rewrites only the tail
Fig. 1 — a cache-friendly prompt is stable → unstable, front to back

Structure your prompt stable-first

Caching matches on byte-for-byte prefix. That means the order of your message matters: stable content must come before dynamic content, consistently. If you swap your system prompt and your tool definitions between turns, nothing matches, and you pay cache writes forever.

In practice: system prompt, then tool definitions, then long-lived reference content, then recent conversation, then the current user turn. Put a cache_control breakpoint at the end of the stable block.

Mind the five-minute TTL

The cache has a TTL. Miss the window and the next call is a cache-write (roughly 1.25× base input) rather than a cache-read. That matters a lot for cadence: if your workload fires every couple of minutes you'll keep the cache warm; if it fires every seven minutes you'll pay a cache-miss every time.

Two defenses. One: for cron-like workloads, either bunch calls close together or accept the cache miss; don't flip between "a thousand calls in five minutes" and "one every six minutes" without noticing. Two: for latency-critical paths, send a warm-up request before the real one if the cache might be cold.

What to cache, what not to

Cache: system prompt, tool schemas, long reference documents, policy text, few-shot examples that don't change per-user. Anything stable that's over ~1,000 tokens benefits.

Don't cache: the user's current message, anything that changes per-request, tiny prefixes where the overhead isn't worth it. Caching a 200-token system prompt is fine in theory and not worth the book-keeping in practice.

What it actually costs

Drag the hit-rate slider and fiddle with the numbers below. For a typical support-bot workload — 8k static tokens, 400 user tokens, 600 output tokens, a couple thousand calls a month — 80% hit takes the bill from the base rate down into the nice-to-have territory. At a million calls a month, 80% hit vs. no cache is a four-digit monthly swing.

 Workload

cold (0%)50%hot (100%)

Monthly spend, no caching

$68

With cache at 80% hit

$36

You keep

$32 · 47% off

illustrative @ $3/M in, $15/M out · cache-write 1.25× · cache-read 0.1×

Interactive · drag the hit rate, type your own numbers

Three failure modes to watch

Accidental prefix drift. A new reviewer adds a timestamp to the system prompt; the cache never hits again. Guard your stable prefix like it's a hot path, because it is.

Breaking cache on tool addition. Adding a new tool at the top of the tool list shifts every subsequent byte. Append new tools to the end; only break cache on purpose.

Cross-tenant bleed in billing dashboards. When you report cost per tenant, remember that the same cached prefix might be serving many tenants. Allocate the cache-write cost fairly and the cache-read cost per tenant; don't charge tenant A for a cache warmed by tenant B.

The one-line heuristic

Stable content goes in front, dynamic content goes at the back, and a cache breakpoint goes between them. Every other sentence in this article is a consequence.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more
  1. 01

    Guides

    Putting Claude on a schedule: routines, loops, and background work

    Apr 20, 2026

  2. 02

    Guides

    Writing a CLAUDE.md that actually helps

    Apr 20, 2026

  3. 03

    Guides

    A field guide to Claude Code: CLAUDE.md, hooks, skills, plugins

    Apr 20, 2026

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.