Context Tetris

Messages drop every few seconds into a fixed 1000-token context window. Keep what matters, evict the noise.

❂ Primer

Skip if you already know the theory; the interactive is right below.

Real production prompts have a fixed-size context window and a stream of incoming content — system instructions, RAG chunks, user turns, tool call results, pleasantries. Not all of it fits. Deciding what stays is a constant, boring, important choice.

This game models that choice. 1000 tokens of context, messages arrive every couple of seconds, you pick keep or drop. Once the window is full, you can evict older messages to make room. Score = total importance of what's still in the window at round's end.

▶ Try it

Context usage

0 / 1000 tok

Round progress

0 / 25

Best score

optimal ≈ 74

Messages arrive every 1.8s. For each one, choose whether to keep it in your context window or drop it. Context is capped at 1000 tokens — you'll need to evict old messages to make room for important new ones. 25messages per round; score = total importance of what remains.

Context window — click any to evict

Empty. Kept messages will stack here.

saved locally · sign in to sync

▤ Leaderboard · top 25

Loading…

⁂ Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

What to notice

The pool is biased: 55% of arrivals are low-importance noise (pleasantries, emojis, redundant rephrasings), 30% are medium, and only 15% are the high-importance messages that actually move the score. A naive "keep everything" strategy fills the window with pleasantries and leaves no room for the system prompt when it finally arrives. A naive "drop everything" strategy keeps the window empty.

The optimal play is more like a senior engineer triaging a Slack channel: most things are fine to miss, a few things matter enormously, and you have to eject old low-value content proactively before the new high-value content arrives.

Message kinds

system — format rules, persona, tool-use instructions. Big, important, front-of-window.
user — the actual question, clarifications. Small but the single most important category.
fact — RAG chunks, policy excerpts, API schemas. Mixed sizes and mixed relevance.
noise — pleasantries, emoji reactions, repeated phrasings. Cheap to drop, bloats the context if kept.

This is RAG with a stopwatch

Every production LLM pipeline does a version of this, implicitly. A conversation-history truncator picks which turns to keep. A RAG retriever ranks chunks by relevance. A prompt compressor summarizes earlier turns to free space. The details vary — sliding windows, token-budget caps, semantic reranking, summarization cascades — but the shape is the same: a fixed budget, an endless stream, and a choice function you hope is smarter than "keep the most recent".

The interesting failure mode in this game is also the interesting failure mode in real life: noise accumulates into the context and crowds out the things that actually matter. "My model is ignoring my instructions" is often just "my instructions are no longer in the window".

In a line

Real-time triage game. Biased message pool (mostly noise), hard context cap, eviction-by-click. Score = importance-weighted retention. Models the problem every production RAG pipeline solves.

The Loop

Context Tetris

What to notice

Message kinds

This is RAG with a stopwatch

Other experiments

How a sentence becomes tokens

Temperature and top-p, visibly

What does this prompt actually cost?

Tokens per second

How far should the model think?

Neural language vs a Markov chain

What each token looks at

Words in space

The injection arena

AI or human?

Magnet flip