Neural language vs a Markov chain

Train a word-level n-gram in your browser on the fly. See how badly it loses to a real model on the same prompt.

❂ Primer

Skip if you already know the theory; the interactive is right below.

Before neural language models, we had statistical ones. An n-gram Markov chain counts how often each (n-1)-word sequence is followed by which next word, then samples from that distribution to generate new text. No training, no gradients, no GPU — just a big lookup table built from a single pass over a corpus.

The left panel below is a bona fide Markov chain trained in your browser, live, on the selected corpus. The right panel shows what a frontier model produces on the same prompt. Same starting words, wildly different outputs.

▶ Try it

Loading corpora…

⁂ Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

What to watch

The n-gram slider

At n=2 (bigrams), the chain's output is fluent-sounding but semantically random — local coherence, no plot. Crank it to n=5 and the chain starts reproducing verbatim chunks of the corpus, because every 4-word prefix has only one successor in a short text. This is the fundamental n-gram tradeoff: longer memory means less creativity, approaching pure plagiarism in the limit.

The real model

Notice the right-panel output paraphrases the corpus's ideas without copying its sentences. It understands what the text was about and restates it in its own words. The Markov chain never does that, no matter how much text you feed it. Meaning isn't in the n-gram statistics.

The difference is meaning

For decades, "language model" meant a probability distribution over next words. Technically, neural language models are still that — the softmax at the output layer is the same math. The difference is what's inside. A transformer has hundreds of billions of parameters encoding representations of meaning, syntax, and world knowledge that get mixed together at every layer. The Markov model has one parameter per observed transition and nothing else.

This is also why prompt engineering works at all. You're nudging a rich internal representation toward a region of its output space. You can't do that with a Markov chain — there's no representation to nudge.

In a line

Client-side Markov chain trained live on a short corpus (Dante, Sherlock Holmes, RFC 2616), side-by-side with a precomputed frontier-LLM completion. n-gram slider controls memory depth.

The Loop

Neural language vs a Markov chain

What to watch

The n-gram slider

The real model

The difference is meaning

Other experiments

How a sentence becomes tokens

Temperature and top-p, visibly

What does this prompt actually cost?

Tokens per second

How far should the model think?

What each token looks at

Words in space

The injection arena

AI or human?

Context Tetris

Magnet flip