The Loop  ·  Issue 025

The Loop

A field journal of the AI frontier — for engineers who ship.

§ Guides

By AI Blog Editor
Apr 20, 2026 · 1 min read

Prompt caching: the cheap win most people skip

The largest single cost lever the Claude API gives you, and it's still the one most teams haven't turned on. How the cache works, how to structure for it, and what it actually saves.

Prompt caching is the largest single cost lever the Claude API gives you, and it's still the one most teams haven't turned on. The rule of thumb — "cache reads cost roughly a tenth of base input" — is doing a lot of heavy lifting in production economics, and it's earned in a single hour of wiring.

The idea is simple. Your prompts have a stable prefix (system message, tool definitions, reference docs) and an unstable tail (the user's turn). If you mark the prefix as cacheable, Claude stores it for five minutes. During that window, subsequent turns that reuse the prefix pay the cache-read rate instead of the full input rate.

PROMPTSYSTEM PROMPTtone, rules, personaTOOL DEFSschema, examplesREFERENCE DOCSpolicy, style guideUSER MESSAGEchanges every turnCACHEABLE PREFIX · PAID ONCEstable across turns · marked with cache_controlDYNAMIC · PAID EVERY TURNnext turn rewinds the prefix, rewrites only the tail
Fig. 1 — a cache-friendly prompt is stable → unstable, front to back

Structure your prompt stable-first

Caching matches on byte-for-byte prefix. That means the order of your message matters: stable content must come before dynamic content, consistently. If you swap your system prompt and your tool definitions between turns, nothing matches, and you pay cache writes forever.

In practice: system prompt, then tool definitions, then long-lived reference content, then recent conversation, then the current user turn. Put a cache_control breakpoint at the end of the stable block.

Mind the five-minute TTL

The cache has a TTL. Miss the window and the next call is a cache-write (roughly 1.25× base input) rather than a cache-read. That matters a lot for cadence: if your workload fires every couple of minutes you'll keep the cache warm; if it fires every seven minutes you'll pay a cache-miss every time.

Two defenses. One: for cron-like workloads, either bunch calls close together or accept the cache miss; don't flip between "a thousand calls in five minutes" and "one every six minutes" without noticing. Two: for latency-critical paths, send a warm-up request before the real one if the cache might be cold.

What to cache, what not to

Cache: system prompt, tool schemas, long reference documents, policy text, few-shot examples that don't change per-user. Anything stable that's over ~1,000 tokens benefits.

Don't cache: the user's current message, anything that changes per-request, tiny prefixes where the overhead isn't worth it. Caching a 200-token system prompt is fine in theory and not worth the book-keeping in practice.

What it actually costs

Drag the hit-rate slider and fiddle with the numbers below. For a typical support-bot workload — 8k static tokens, 400 user tokens, 600 output tokens, a couple thousand calls a month — 80% hit takes the bill from the base rate down into the nice-to-have territory. At a million calls a month, 80% hit vs. no cache is a four-digit monthly swing.

 Workload

cold (0%)50%hot (100%)

Monthly spend, no caching

$68

With cache at 80% hit

$36

You keep

$32 · 47% off

illustrative @ $3/M in, $15/M out · cache-write 1.25× · cache-read 0.1×

Interactive · drag the hit rate, type your own numbers

Three failure modes to watch

Accidental prefix drift. A new reviewer adds a timestamp to the system prompt; the cache never hits again. Guard your stable prefix like it's a hot path, because it is.

Breaking cache on tool addition. Adding a new tool at the top of the tool list shifts every subsequent byte. Append new tools to the end; only break cache on purpose.

Cross-tenant bleed in billing dashboards. When you report cost per tenant, remember that the same cached prefix might be serving many tenants. Allocate the cache-write cost fairly and the cache-read cost per tenant; don't charge tenant A for a cache warmed by tenant B.

The one-line heuristic

Stable content goes in front, dynamic content goes at the back, and a cache breakpoint goes between them. Every other sentence in this article is a consequence.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more
  1. 01

    News

    The first partner cut — days before Amazon's researchers flagged a Fable 5 vulnerability, the White House had already told Anthropic to revoke access for SK Telecom, its earliest Korean shareholder and a Project Glasswing partner, over concerns about the company's alleged ties to China. Five days later, Anthropic opened a Seoul office and signed every major Korean conglomerate that isn't SK.

    Jun 19, 2026

  2. 02

    The Patch

    The Patch — June 19, 2026

    Jun 19, 2026

  3. 03

    News

    The kill switch did the diplomacy — five days after Washington took Anthropic Fable 5 and Mythos 5 offline, Dario Amodei and Demis Hassabis sat down at the G7 in Évian-les-Bains and asked the allies to sign up for an explicitly US-led AI coalition. Canada said yes; France brought a list.

    Jun 18, 2026

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.