How a sentence becomes tokens
Type something. Watch a real GPT tokenizer chop it into the units the model actually sees.
❂ Primer
Skip if you already know the theory; the interactive is right below.
Language models don't read characters and they don't read words. They read tokens — chunks of one-to-several characters, pulled from a fixed vocabulary of roughly 100,000 to 200,000 entries. Every prompt you send and every reply you get back is, under the hood, a list of integer IDs that index into that vocabulary.
The split isn't arbitrary. It comes from an algorithm called Byte-Pair Encoding (BPE), which greedily merges the most common character pairs in the training corpus into new tokens. Common English words like the and model (with the leading space) end up as a single token each. Rarer words get chopped into pieces. Emoji and non-Latin scripts often take several tokens per visible character.
The widget below runs the real BPE. No API call, no approximation — the same tables OpenAI ships with GPT-4 and GPT-4o, executed in your browser on every keystroke. What you see is exactly what those models would see.
▶ Try it
Output · one pill per token
·= space ⏎= newline → = tab
Show raw token IDs
[791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679, 13]
Field note
Type something above to see a note here.
Method
Tokens are computed in your browser using the gpt-tokenizer implementation of OpenAI's cl100k_base BPE. Nothing is sent anywhere — open the network tab and confirm.
⁂ Notes from the bench
What to watch for, why it matters, and the one thing that usually surprises people.
Things worth poking at
A few experiments to run on the widget above. The point of each one is to make the tokenizer do something surprising enough that you remember it next time you're writing a prompt.
1. The leading space matters
Type model and note the token ID. Then type model with a leading space. You'll get a different token. BPE treats "start of a word" as part of the word itself — this is why prompt templates that concatenate strings without thinking about whitespace can quietly double the token count of your variable names.
2. Language matters a lot
Paste the English preset, note the chars-per-token ratio (usually 4–5). Now paste the Italian preset — it drops, but not catastrophically. Try Japanese or Arabic text if you have any handy: the ratio can collapse to around 1.5. Non-English speakers literally pay more per sentence.
3. cl100k vs o200k
Switch encoders with the toggle. cl100k_base is the GPT-3.5 / GPT-4 tokenizer from 2022. o200k_base is what ships with GPT-4o and the reasoning models — twice the vocabulary, and trained on a more multilingual corpus. The same prompt will usually produce fewer tokens under o200k, especially for code and non-English text. That's the silent reason newer models feel cheaper even when the per-token price hasn't moved.
4. Repetition doesn't compress
Try the "Repetition" preset. You might expect the three copies of the same long word to share tokens cleverly — they don't. BPE is a stateless vocabulary lookup; there's no run-length encoding, no deduplication, nothing. Every occurrence pays in full.
Three places this bites you
In descending order of how often you'll actually notice:
- Pricing. You're billed per token, not per character. The chars/token ratio is the multiplier between what you typed and what you paid. Code and non-English text cost more per visible character than English prose.
- Context windows. A model advertised as "200K context" is 200K tokens, which is maybe 150K words of English prose, maybe 60K words of Japanese, maybe 800 lines of minified JavaScript. Plan against tokens, not characters.
- Prompt design. The model sees tokens — which is why JSON with weird whitespace, stringly-typed enums with unusual casing, and API names cobbled together from rare suffixes can all subtly confuse it. If a token boundary falls in a weird place, attention has to work harder to bridge it.
None of this is a reason to tokenizer-hack every prompt. It's just useful to have a mental image of what the model is actually reading, so that when something strange happens, "the tokenizer chopped it weirdly" is one of the hypotheses you can check — in about ten seconds, on this page.
In a line
A live cl100k_base / o200k_base tokenizer. Pulls the same BPE tables OpenAI uses, runs entirely in your browser, and shows the exact token IDs your prompt would cost you.
Other experiments
11- Exp 002
Temperature and top-p, visibly
- Exp 003
What does this prompt actually cost?
- Exp 004
Tokens per second
- Exp 005
How far should the model think?
- Exp 006
Neural language vs a Markov chain
- Exp 007
What each token looks at
- Exp 008
Words in space
- Exp 009
The injection arena
- Exp 010
AI or human?
- Exp 011
Context Tetris
- Exp 012
Magnet flip