Tokenisation

Tokenisation is the process of converting raw text into a sequence of integer IDs (tokens) that a neural network can process. It is a prerequisite step before both training and inference.

Why tokenisation is necessary

Neural networks require a finite vocabulary of discrete symbols. Raw text is a sequence of Unicode characters encoded as UTF-8 bytes (just zeros and ones). Two symbols (0 and 1) with an extremely long sequence is impractical — sequence length is a scarce resource.

The design problem: trade off vocabulary size against sequence length. Larger vocabulary → shorter sequences; smaller vocabulary → longer sequences.

Byte-pair encoding (BPE)

State-of-the-art models use byte-pair encoding (BPE):

Start with 256 symbols (one per possible byte value).
Find the most frequent consecutive pair of symbols in the corpus.
Mint a new symbol with the next available ID and replace every occurrence of that pair.
Repeat until vocabulary reaches the desired size.

GPT-4 uses 100,277 tokens. Each iteration of BPE reduces sequence length and grows the vocabulary by one.

Consequences for model behaviour

Tokenisation creates a gap between how humans perceive text and how the model perceives it:

“Ubiquitous” is one word to a human; 3 tokens to the model.
Case matters: hello world and Hello World tokenise differently.
Spaces matter: hello world (one space) vs hello world (two spaces) produce different token sequences.
Spelling tasks are hard. The model never sees individual letters — only tokens. “How many R’s in strawberry?” was famously wrong in early models because the model sees str, awberry (not individual characters) and cannot count without additional reasoning tools.
Arithmetic is hard. Numbers tokenise inconsistently (e.g., 9.11 and 9.9 may be single tokens). The model can’t reliably compare them by reasoning on tokens; a Python interpreter is more reliable.

Practical implication

For tasks that require character-level reasoning (spelling, counting letters, formatting strings character-by-character), tell the model to use code rather than reason directly. The Python interpreter operates on actual characters, not tokens.

Tokenisation

Why tokenisation is necessary

Byte-pair encoding (BPE)

Consequences for model behaviour

Practical implication

See also