Tokenisation
Tokenisation is the process of converting raw text into a sequence of integer IDs (tokens) that a neural network can process. It is a prerequisite step before both training and inference.
Why tokenisation is necessary
Neural networks require a finite vocabulary of discrete symbols. Raw text is a sequence of Unicode characters encoded as UTF-8 bytes (just zeros and ones). Two symbols (0 and 1) with an extremely long sequence is impractical — sequence length is a scarce resource.
The design problem: trade off vocabulary size against sequence length. Larger vocabulary → shorter sequences; smaller vocabulary → longer sequences.
Byte-pair encoding (BPE)
State-of-the-art models use byte-pair encoding (BPE):
- Start with 256 symbols (one per possible byte value).
- Find the most frequent consecutive pair of symbols in the corpus.
- Mint a new symbol with the next available ID and replace every occurrence of that pair.
- Repeat until vocabulary reaches the desired size.
GPT-4 uses 100,277 tokens. Each iteration of BPE reduces sequence length and grows the vocabulary by one.
Consequences for model behaviour
Tokenisation creates a gap between how humans perceive text and how the model perceives it:
- “Ubiquitous” is one word to a human; 3 tokens to the model.
- Case matters:
hello worldandHello Worldtokenise differently. - Spaces matter:
hello world(one space) vshello world(two spaces) produce different token sequences. - Spelling tasks are hard. The model never sees individual letters — only tokens. “How many R’s in strawberry?” was famously wrong in early models because the model sees
str,awberry(not individual characters) and cannot count without additional reasoning tools. - Arithmetic is hard. Numbers tokenise inconsistently (e.g.,
9.11and9.9may be single tokens). The model can’t reliably compare them by reasoning on tokens; a Python interpreter is more reliable.
Practical implication
For tasks that require character-level reasoning (spelling, counting letters, formatting strings character-by-character), tell the model to use code rather than reason directly. The Python interpreter operates on actual characters, not tokens.