By AI Blog Editor
Apr 24, 2026 · 11 min read

Three bugs in the harness — Anthropic's Claude Code postmortem, and the system prompt that cost 3%

For two months, Claude Code users complained it had gotten dumber. Anthropic's April 23 postmortem blames three separate harness bugs — including a thinking-cache clear that fired every turn, and a one-line verbosity instruction that dropped eval scores three points.

Anthropic's diagram illustrating the Claude Code harness architecture. — From Anthropic's April 23 Claude Code postmortem — the harness, not the model, was where the bugs lived.

On April 23, 2026, Anthropic published a postmortem admitting that for roughly two months, the complaints on Reddit and the forums were right: Claude Code had gotten worse, and the reason was three distinct bugs in the harness wrapped around the model, not the model itself.

The writeup lands the day after OpenAI shipped GPT-5.5 and two days after DeepSeek V4 — a week when the entire conversation in AI was about new capability. Anthropic chose it to publish the other kind of engineering blog post. The one that starts by conceding the users were not imagining things.

The three bugs, in the order they shipped

March 4, 2026 — reasoning effort quietly dropped. The default reasoning effort for Sonnet 4.6 and Opus 4.6 was moved from high to medium to address UI freezing and long latencies when the model burned budget thinking. It stayed at medium until April 7, when feedback forced a revert. Opus 4.7 now defaults to xhigh; everything else is back to high. A month of users quietly got a cheaper model without being told.

March 26, 2026 — the thinking-cache bug. This is the one worth reading twice. A change intended to clear Claude's older thinking history once at the start of a long-idle session instead ran on every turn for the rest of the session. The implementation of the clear_thinking_20251015 API header with keep:1 was, per Anthropic's own writeup, the root cause. The symptoms were exactly what users reported: forgetfulness, repetition, odd tool choices, and usage limits draining faster than they should. Fixed April 10 in v2.1.101. It took Anthropic over a week to diagnose because orthogonal changes masked the bug in their own CLI testing.

Simon Willison, who runs roughly eleven long-running Claude Code sessions at any given time, put it plainly: "I think I probably spend more time in 'stale' sessions than in recently-started ones." The bug was targeted precisely at the workflow his readers — and mine — actually use.

April 16, 2026 — the one-line system prompt change. Anthropic added a single instruction to the Claude Code system prompt: "Length limits: keep text between tool calls to ≤25 words." On its face, a reasonable nudge — stop narrating, just work. The measured effect: a 3% drop in evaluation scores across both Opus 4.6 and Opus 4.7. Reverted on April 20 in v2.1.116. Usage limits were reset for all subscribers on April 23 as an apology.

Three percent is the margin between one model tier and the next. It is the gap an entire frontier lab's research org spends a quarter trying to close. Anthropic closed it in the wrong direction by asking the model to be quieter.

The irony nobody at Anthropic can write down

The verbosity limit that cost three eval points is still present, in spirit, in the instructions governing a great many agents built on top of the Claude API — including, one hears, some of Anthropic's own internal tooling. "Don't narrate, just call tools" is good advice for humans reviewing transcripts. It is apparently bad advice for a model whose chain-of-thought-adjacent reasoning leaks into the prose it writes between tool calls.

It is also one of the rare bugs that you can fix with a Git revert and an academic paper. The question — does forcing brevity on a reasoning model measurably degrade it, and by how much on which tasks — is the kind of thing that should end up in somebody's NeurIPS submission. Three percent is a result.

Why harness bugs are the hardest kind

Willison's framing is correct and worth generalising: "Bugs in system harnesses can be deeply intricate to identify, especially when the system incorporates non-deterministic components like LLMs themselves." The harness is the code that wraps the model — system prompt, tool definitions, context management, caching, retry logic. It is the seam between the deterministic software world and the probabilistic one, and the place where every piece of evidence a debugger would normally rely on — repro steps, stack traces, stable error messages — becomes soft.

The thinking-cache bug is the textbook example. If you resume a session and the model acts confused, you cannot tell by looking at it whether the model is having a bad day, whether the prompt is too long, whether the context has been silently truncated, or whether the cache-clearing header fired when it shouldn't have. All four hypotheses produce the same symptom: Claude is being a bit dim today.

Anthropic's own test harness cleared the bug because orthogonal changes — bundled into the same release — compensated for it. The CLI looked fine in their own testing and broken in the field, which is the failure mode every platform team eventually hits and the reason staging environments never, ever catch the bugs that matter.

Anthropic's timeline of when each bug was introduced, detected, and fixed.

The part where Opus 4.7 finds the bug Opus 4.6 couldn't

Tucked into the postmortem is a detail that reads like a product roadmap item: during a code review of the thinking-cache change, Opus 4.7 correctly flagged the bug. Opus 4.6 — the model actually running in the harness at the time — did not.

Read that sentence charitably and it is an argument for using a newer model as a code reviewer on the serving infrastructure of older ones. Read it less charitably and it is Anthropic conceding that the bug which degraded Opus 4.6 in the field was within Opus 4.7's capability to catch at PR time — if anyone had asked it.

Either way, the lesson generalises. If your agent's harness is complex enough that your own engineers cannot debug it from logs, a second model reviewing harness diffs is not a gimmick. It is a process control.

What this means

Harness quality is model quality, from the user's seat. For two months, the number on the marketing page was Opus 4.6. The thing running in the CLI was Opus 4.6 plus three silent regressions. Users noticed. Benchmarks did not, because benchmarks don't run in the harness.
"Just make it quieter" is a research question, not a config tweak. The 25-word limit is the most quietly expensive one-liner in the postmortem. Any team shipping instructions to a reasoning model that constrain its output between tool calls should be eval'ing that change before and after. If Anthropic can drop 3% on its own models, your RAG prompt can too.
Reset the usage counter when you ship the apology. Anthropic reset usage limits for all subscribers on April 23. That is the right call and it is also a reminder that when a bug wastes your users' rate limit, refunding the limit is part of the fix. The companies that don't will learn this the expensive way.

What to watch over the next month: whether Anthropic ships the promised better regression test suite for the harness itself, and whether any of the GPT-5.5, Gemini 3.x, or DeepSeek V4 coding agents publish comparable postmortems when — not if — the same class of bug hits them. Harness bugs are not an Anthropic problem. They are an agent-product problem. Anthropic just happens to have written the postmortem first.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.

The Loop