Context Tetris
Messages drop every few seconds into a fixed 1000-token context window. Keep what matters, evict the noise.
❂ Primer
Skip if you already know the theory; the interactive is right below.
Real production prompts have a fixed-size context window and a stream of incoming content — system instructions, RAG chunks, user turns, tool call results, pleasantries. Not all of it fits. Deciding what stays is a constant, boring, important choice.
This game models that choice. 1000 tokens of context, messages arrive every couple of seconds, you pick keep or drop. Once the window is full, you can evict older messages to make room. Score = total importance of what's still in the window at round's end.
▶ Try it
Context usage
0 / 1000 tok
Round progress
0 / 25
Best score
0
optimal ≈ 74
Messages arrive every 1.8s. For each one, choose whether to keep it in your context window or drop it. Context is capped at 1000 tokens — you'll need to evict old messages to make room for important new ones. 25messages per round; score = total importance of what remains.
Context window — click any to evict
Empty. Kept messages will stack here.
saved locally · sign in to sync
▤ Leaderboard · top 25
Loading…
⁂ Notes from the bench
What to watch for, why it matters, and the one thing that usually surprises people.
What to notice
The pool is biased: 55% of arrivals are low-importance noise (pleasantries, emojis, redundant rephrasings), 30% are medium, and only 15% are the high-importance messages that actually move the score. A naive "keep everything" strategy fills the window with pleasantries and leaves no room for the system prompt when it finally arrives. A naive "drop everything" strategy keeps the window empty.
The optimal play is more like a senior engineer triaging a Slack channel: most things are fine to miss, a few things matter enormously, and you have to eject old low-value content proactively before the new high-value content arrives.
Message kinds
- system — format rules, persona, tool-use instructions. Big, important, front-of-window.
- user — the actual question, clarifications. Small but the single most important category.
- fact — RAG chunks, policy excerpts, API schemas. Mixed sizes and mixed relevance.
- noise — pleasantries, emoji reactions, repeated phrasings. Cheap to drop, bloats the context if kept.
This is RAG with a stopwatch
Every production LLM pipeline does a version of this, implicitly. A conversation-history truncator picks which turns to keep. A RAG retriever ranks chunks by relevance. A prompt compressor summarizes earlier turns to free space. The details vary — sliding windows, token-budget caps, semantic reranking, summarization cascades — but the shape is the same: a fixed budget, an endless stream, and a choice function you hope is smarter than "keep the most recent".
The interesting failure mode in this game is also the interesting failure mode in real life: noise accumulates into the context and crowds out the things that actually matter. "My model is ignoring my instructions" is often just "my instructions are no longer in the window".
In a line
Real-time triage game. Biased message pool (mostly noise), hard context cap, eviction-by-click. Score = importance-weighted retention. Models the problem every production RAG pipeline solves.
Other experiments
11- Exp 001
How a sentence becomes tokens
- Exp 002
Temperature and top-p, visibly
- Exp 003
What does this prompt actually cost?
- Exp 004
Tokens per second
- Exp 005
How far should the model think?
- Exp 006
Neural language vs a Markov chain
- Exp 007
What each token looks at
- Exp 008
Words in space
- Exp 009
The injection arena
- Exp 010
AI or human?
- Exp 012
Magnet flip