Experiment №001
filed Apr 21, 2026

explainer

Filed under

#tokenization
#bpe
#gpt-4
#gpt-4o

How a sentence becomes tokens

Type something. Watch a real GPT tokenizer chop it into the units the model actually sees.

❂ Primer

Skip if you already know the theory; the interactive is right below.

Language models don't read characters and they don't read words. They read tokens — chunks of one-to-several characters, pulled from a fixed vocabulary of roughly 100,000 to 200,000 entries. Every prompt you send and every reply you get back is, under the hood, a list of integer IDs that index into that vocabulary.

The split isn't arbitrary. It comes from an algorithm called Byte-Pair Encoding (BPE), which greedily merges the most common character pairs in the training corpus into new tokens. Common English words like the and model (with the leading space) end up as a single token each. Rarer words get chopped into pieces. Emoji and non-Latin scripts often take several tokens per visible character.

The widget below runs the real BPE. No API call, no approximation — the same tables OpenAI ships with GPT-4 and GPT-4o, executed in your browser on every keystroke. What you see is exactly what those models would see.

▶ Try it

Encoder

(GPT-3.5 / GPT-4)

tokens 10chars 44chars/tok 4.40

Try this →

Output · one pill per token

·= space ⏎= newline → = tab

The791·quick4062·brown14198·fox39935·jumps35308·over927·the279·lazy16053·dog5679.13

Show raw token IDs

[791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679, 13]

Field note

Type something above to see a note here.

Method

Tokens are computed in your browser using the gpt-tokenizer implementation of OpenAI's cl100k_base BPE. Nothing is sent anywhere — open the network tab and confirm.

⁂ Notes from the bench

What to watch for, why it matters, and the one thing that usually surprises people.

Things worth poking at

A few experiments to run on the widget above. The point of each one is to make the tokenizer do something surprising enough that you remember it next time you're writing a prompt.

1. The leading space matters

Type model and note the token ID. Then type model with a leading space. You'll get a different token. BPE treats "start of a word" as part of the word itself — this is why prompt templates that concatenate strings without thinking about whitespace can quietly double the token count of your variable names.

2. Language matters a lot

Paste the English preset, note the chars-per-token ratio (usually 4–5). Now paste the Italian preset — it drops, but not catastrophically. Try Japanese or Arabic text if you have any handy: the ratio can collapse to around 1.5. Non-English speakers literally pay more per sentence.

3. cl100k vs o200k

Switch encoders with the toggle. cl100k_base is the GPT-3.5 / GPT-4 tokenizer from 2022. o200k_base is what ships with GPT-4o and the reasoning models — twice the vocabulary, and trained on a more multilingual corpus. The same prompt will usually produce fewer tokens under o200k, especially for code and non-English text. That's the silent reason newer models feel cheaper even when the per-token price hasn't moved.

4. Repetition doesn't compress

Try the "Repetition" preset. You might expect the three copies of the same long word to share tokens cleverly — they don't. BPE is a stateless vocabulary lookup; there's no run-length encoding, no deduplication, nothing. Every occurrence pays in full.

Three places this bites you

In descending order of how often you'll actually notice:

Pricing. You're billed per token, not per character. The chars/token ratio is the multiplier between what you typed and what you paid. Code and non-English text cost more per visible character than English prose.
Context windows. A model advertised as "200K context" is 200K tokens, which is maybe 150K words of English prose, maybe 60K words of Japanese, maybe 800 lines of minified JavaScript. Plan against tokens, not characters.
Prompt design. The model sees tokens — which is why JSON with weird whitespace, stringly-typed enums with unusual casing, and API names cobbled together from rare suffixes can all subtly confuse it. If a token boundary falls in a weird place, attention has to work harder to bridge it.

None of this is a reason to tokenizer-hack every prompt. It's just useful to have a mental image of what the model is actually reading, so that when something strange happens, "the tokenizer chopped it weirdly" is one of the hypotheses you can check — in about ten seconds, on this page.

In a line

A live cl100k_base / o200k_base tokenizer. Pulls the same BPE tables OpenAI uses, runs entirely in your browser, and shows the exact token IDs your prompt would cost you.

The Loop

How a sentence becomes tokens

Things worth poking at

1. The leading space matters

2. Language matters a lot

3. cl100k vs o200k

4. Repetition doesn't compress

Three places this bites you

Other experiments

Temperature and top-p, visibly

What does this prompt actually cost?

Tokens per second

How far should the model think?

Neural language vs a Markov chain

What each token looks at

Words in space

The injection arena

AI or human?

Context Tetris

Magnet flip