Experiment №006
filed Apr 21, 2026
explainer
Filed under
- #language-models
- #markov
- #n-gram
- #history
Neural language vs a Markov chain
Train a word-level n-gram in your browser on the fly. See how badly it loses to a real model on the same prompt.
❂ Primer
Skip if you already know the theory; the interactive is right below.
Before neural language models, we had statistical ones. An n-gram Markov chain counts how often each (n-1)-word sequence is followed by which next word, then samples from that distribution to generate new text. No training, no gradients, no GPU — just a big lookup table built from a single pass over a corpus.
The left panel below is a bona fide Markov chain trained in your browser, live, on the selected corpus. The right panel shows what a frontier model produces on the same prompt. Same starting words, wildly different outputs.
▶ Try it
Loading corpora…
⁂ Notes from the bench
What to watch for, why it matters, and the one thing that usually surprises people.
What to watch
The n-gram slider
At n=2 (bigrams), the chain's output is fluent-sounding but semantically random — local coherence, no plot. Crank it to n=5 and the chain starts reproducing verbatim chunks of the corpus, because every 4-word prefix has only one successor in a short text. This is the fundamental n-gram tradeoff: longer memory means less creativity, approaching pure plagiarism in the limit.
The real model
Notice the right-panel output paraphrases the corpus's ideas without copying its sentences. It understands what the text was about and restates it in its own words. The Markov chain never does that, no matter how much text you feed it. Meaning isn't in the n-gram statistics.
The difference is meaning
For decades, "language model" meant a probability distribution over next words. Technically, neural language models are still that — the softmax at the output layer is the same math. The difference is what's inside. A transformer has hundreds of billions of parameters encoding representations of meaning, syntax, and world knowledge that get mixed together at every layer. The Markov model has one parameter per observed transition and nothing else.
This is also why prompt engineering works at all. You're nudging a rich internal representation toward a region of its output space. You can't do that with a Markov chain — there's no representation to nudge.
In a line
Client-side Markov chain trained live on a short corpus (Dante, Sherlock Holmes, RFC 2616), side-by-side with a precomputed frontier-LLM completion. n-gram slider controls memory depth.
Other experiments
11- Exp 001
How a sentence becomes tokens
- Exp 002
Temperature and top-p, visibly
- Exp 003
What does this prompt actually cost?
- Exp 004
Tokens per second
- Exp 005
How far should the model think?
- Exp 007
What each token looks at
- Exp 008
Words in space
- Exp 009
The injection arena
- Exp 010
AI or human?
- Exp 011
Context Tetris
- Exp 012
Magnet flip