The Loop  ·  Issue 017

The Loop

A field journal of the AI frontier — for engineers who ship.

  Lab bench

Experiment №011
filed Apr 21, 2026

game

Filed under

  • #context-window
  • #rag
  • #game

Context Tetris

Messages drop every few seconds into a fixed 1000-token context window. Keep what matters, evict the noise.

  Primer

Skip if you already know the theory; the interactive is right below.

Real production prompts have a fixed-size context window and a stream of incoming content — system instructions, RAG chunks, user turns, tool call results, pleasantries. Not all of it fits. Deciding what stays is a constant, boring, important choice.

This game models that choice. 1000 tokens of context, messages arrive every couple of seconds, you pick keep or drop. Once the window is full, you can evict older messages to make room. Score = total importance of what's still in the window at round's end.

▶  Try it

Context usage

0 / 1000 tok

Round progress

0 / 25

Best score

0

optimal ≈ 74

Messages arrive every 1.8s. For each one, choose whether to keep it in your context window or drop it. Context is capped at 1000 tokens — you'll need to evict old messages to make room for important new ones. 25messages per round; score = total importance of what remains.

Context window — click any to evict

Empty. Kept messages will stack here.

saved locally · sign in to sync

▤  Leaderboard · top 25

Loading…

  Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

What to notice

The pool is biased: 55% of arrivals are low-importance noise (pleasantries, emojis, redundant rephrasings), 30% are medium, and only 15% are the high-importance messages that actually move the score. A naive "keep everything" strategy fills the window with pleasantries and leaves no room for the system prompt when it finally arrives. A naive "drop everything" strategy keeps the window empty.

The optimal play is more like a senior engineer triaging a Slack channel: most things are fine to miss, a few things matter enormously, and you have to eject old low-value content proactively before the new high-value content arrives.

Message kinds

  • system — format rules, persona, tool-use instructions. Big, important, front-of-window.
  • user — the actual question, clarifications. Small but the single most important category.
  • fact — RAG chunks, policy excerpts, API schemas. Mixed sizes and mixed relevance.
  • noise — pleasantries, emoji reactions, repeated phrasings. Cheap to drop, bloats the context if kept.

This is RAG with a stopwatch

Every production LLM pipeline does a version of this, implicitly. A conversation-history truncator picks which turns to keep. A RAG retriever ranks chunks by relevance. A prompt compressor summarizes earlier turns to free space. The details vary — sliding windows, token-budget caps, semantic reranking, summarization cascades — but the shape is the same: a fixed budget, an endless stream, and a choice function you hope is smarter than "keep the most recent".

The interesting failure mode in this game is also the interesting failure mode in real life: noise accumulates into the context and crowds out the things that actually matter. "My model is ignoring my instructions" is often just "my instructions are no longer in the window".

In a line

Real-time triage game. Biased message pool (mostly noise), hard context cap, eviction-by-click. Score = importance-weighted retention. Models the problem every production RAG pipeline solves.

Other experiments

11
  1. Exp 001

    How a sentence becomes tokens

  2. Exp 002

    Temperature and top-p, visibly

  3. Exp 003

    What does this prompt actually cost?

  4. Exp 004

    Tokens per second

  5. Exp 005

    How far should the model think?

  6. Exp 006

    Neural language vs a Markov chain

  7. Exp 007

    What each token looks at

  8. Exp 008

    Words in space

  9. Exp 009

    The injection arena

  10. Exp 010

    AI or human?

  11. Exp 012

    Magnet flip