By AI Blog Editor
Apr 26, 2026 · 15 min read

Decoupled DiLoCo — Google teaches frontier training to survive a bad fibre and a dead chip

Google DeepMind trained a 12-billion-parameter model across four U.S. regions over a 2–5 Gbps link, more than twenty times faster than conventional sync. The "we need one 5-gigawatt campus" story now looks less like physics and more like habit.

Server racks at the NOIRLab headquarters server room, Tucson, Arizona. — NOIRLab HQ server racks. Photo — NOIRLab/AURA/NSF, CC BY 4.0.

On April 22, 2026, Google DeepMind quietly published a paper that argues the entire premise of the current AI-datacenter race is optional. The headline result: a 12-billion-parameter language model trained across four separate U.S. regions over a wide-area network of just 2–5 Gbps, at quality on par with a single-cluster baseline, and more than twenty times faster than the obvious "just sync over the WAN" approach.

The system is called Decoupled DiLoCo, and the boring engineering bits matter more than the marketing.

The problem it actually solves

Frontier pre-training in 2026 looks like this: tens of thousands of accelerators in one campus, lashed together with custom networking, all marching in lockstep. Every step of training waits for every chip to finish before any chip moves on. Add a flaky optical transceiver and the whole run stalls. Spread the chips across two cities and the cross-site bandwidth — what the DeepMind paper measures at 198 Gbps for vanilla data-parallel training across eight datacenters — exceeds what anyone realistically wires between buildings.

This is why the industry is in a single-campus arms race. Anthropic's recently announced 5-gigawatt expansion with Amazon leans on the assumption that you need contiguous capacity. xAI's Memphis build, the Stargate filings, every "100,000 GPUs in one room" press release — all priced against the same constraint.

Decoupled DiLoCo's claim is that the constraint is an artifact of how we choose to synchronise, not of the maths. The paper, authored by Arthur Douillard, Keith Rush, Jeff Dean and fifteen co-authors, partitions training across "independent learners that execute local inner optimization steps" before reporting back to a central synchroniser only occasionally. The synchroniser uses what the abstract calls "a minimum quorum, an adaptive grace window, and dynamic token-weighted merging" — a polite way of saying it stops waiting for stragglers and weights the survivors.

That is a sentence that, if it holds up at scale, kills a lot of expensive concrete.

The numbers, dual-sourced

Both the DeepMind blog post and MarkTechPost's independent write-up agree on the headline figures, and SDxCentral's coverage corroborates the WAN test.

Bandwidth: 0.84 Gbps inter-datacenter requirement across eight sites, versus 198 Gbps for standard data-parallel training. That is not an incremental improvement; that is an order-of-magnitude reframing of what counts as enough wire.
Resilience: in a chaos-engineering simulation across 1.2 million chips with high failure rates injected, Decoupled DiLoCo held 88% goodput — the fraction of compute time spent on useful training — versus 27% for vanilla data-parallel.
Quality: on Gemma 4 evaluations, the distributed run hit 64.1% average ML benchmark accuracy versus 64.4% for the single-cluster baseline. A 0.3-point gap is the kind of difference that disappears if you change the random seed.
Real WAN test: the 12-billion-parameter model trained across four U.S. regions on a 2–5 Gbps link finished, in DeepMind's framing, "more than 20 times faster" than conventional synchronous training over the same network.
Hardware mixing: the same run combined TPU v5p and v6e generations without measurable degradation.

The 12-billion-parameter test is the one to pay attention to. Twelve billion is not Gemini-flagship territory, but it is past the size where the result is a toy. And 2–5 Gbps is the kind of capacity any large enterprise already buys from a carrier, not custom dark fibre.

Why this is different from the previous DiLoCo

DiLoCo (the original, from late 2023) showed you could train across loosely connected clusters if you were willing to trade some efficiency for vastly less communication. It was a research curiosity that the open-source crowd promptly cloned — Prime Intellect's OpenDiLoCo used it to coordinate training across volunteers — and most frontier labs filed it under "interesting, not for us."

The new paper changes the disposition by attacking the two reasons production teams politely declined the original:

Failures stalled everything. A single dead worker stalled the synchronous round and you lost a whole epoch. Decoupled DiLoCo treats the synchroniser as a quorum problem with a grace window — the round closes when enough learners report, not when all of them do — and reintegrates stragglers when they reappear. The 88-vs-27 goodput number is what that change looks like in a chart.
Quality regressed. Earlier asynchronous schemes left a percentage point on the table. The token-weighted merging strategy in the new paper closes that gap to within noise on a Gemma-4-class run.

Neither change is conceptually exotic. Both are the kind of engineering that takes a year of iteration and a Pathways-class runtime to land in production. Which is the point: this is not a research toy. DeepMind is publishing it because it has been deployed.

The strategic read

If the technique generalises to 100B+ parameter runs — which the paper does not yet claim — three things shift.

First, the campus-size narrative weakens. The argument for spending $100 billion on a single 5-gigawatt site is that frontier training cannot be split. If it can, you can stand up a national-grid-tolerant collection of two-gigawatt sites where the power is, rather than one mega-campus where the power barely is. The economics of permitting, water, and transmission look very different in the second world.

Second, "stranded compute" becomes usable. The same architecture that survives a chaos-engineering failure rate is, in the limit, the one that lets you co-train across an under-utilised TPU pod in Iowa and an idle one in Oklahoma. Hyperscalers have a lot of those. So, increasingly, do nation-state buyers.

Third, the moat thesis around bespoke networking gets thinner. Nvidia's NVLink and the optical-switch fabric that AWS is building into Trainium-3 sites both bet that frontier training requires hero-tier interconnect. If you can train Gemma-4-class models on what amounts to a corporate ISP link, the case for spending an extra-billion-dollars per cluster on ultra-low-latency fabric becomes a case you have to argue rather than assume.

Note all three of those are conditional on the technique scaling. DeepMind has shown 12B works. It has not shown that a 1-trillion-parameter Gemini run is going to be sliced across four regions next quarter. The paper is careful about that distinction; the press coverage is less so.

What's missing from the announcement

The blog post is unusually quiet on three things, and the silences are themselves informative.

No flagship deployment claim. The post does not say "we trained Gemini 3.x on this." The closest it comes is the 12B test. Either the technique is being staged into production carefully, or it does not yet hold at flagship scale. Pick the interpretation you find more flattering.
No comparison to OpenDiLoCo. The open-source community has been refining the original DiLoCo idea for two years. DeepMind's new paper does not engage with that work directly, which is the kind of omission you notice if you have been reading the open-distributed-training literature.
No hardware-cost number. "We saved bandwidth" is impressive; "we saved $X per training run" is the version that ends up in board decks. The paper, perhaps wisely, does not try to put a dollar figure on it.

The cynical reading is that DeepMind is publishing the architecture without publishing the parts of it that actually matter at trillion-parameter scale, because those parts are competitive. The charitable reading is that this is genuinely the state of the technique today and the trillion-parameter version will land in a follow-up paper. Both can be true.

What to watch

Whether the next Gemini model card mentions multi-region training. If Gemini 3.2 or 4.0 is trained across regions, this paper was the announcement. If it isn't, the technique is real but not yet flagship-grade.
Whether AWS, Microsoft, or Meta publishes a counterpart. Anthropic has its own quiet bet on cross-region resilience inside the Trainium roadmap, and Microsoft has been hinting at "geo-distributed pre-training" in its Azure AI infrastructure talks. The first competitor to publish numbers in the same units as DeepMind's wins this round.
Whether anyone in the open-source crowd reproduces the 88% goodput claim. Prime Intellect, EleutherAI, or a Hugging Face collaboration could run the chaos-engineering test on a smaller scale within months. If the result holds, the technique is general. If it doesn't, DeepMind has some internal infrastructure that nobody else can replicate, and the 88% becomes a Google-specific number.

The headline takeaway is more uncomfortable than DeepMind's blog post lets on: a meaningful chunk of the AI-infrastructure capex announced in 2026 is priced against an assumption — that frontier training requires a single contiguous campus — that DeepMind has just published a paper attacking. Whether the attack lands at trillion-parameter scale will determine whether the next round of hundred-billion-dollar datacenter commitments age well, or look in retrospect like buying a mainframe in 1987.

* * *

Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.

Elsewhere in this issue

3 more

Letters

Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.

The Loop