Concept

Tokenisation

concept tokenisation bpe multi-source

Tokenisation

Tokenisation is the process of converting raw text into a sequence of integer IDs (tokens) that a neural network can process. It is a prerequisite step before both training and inference.


Why tokenisation is necessary

Neural networks require a finite vocabulary of discrete symbols. Raw text is a sequence of Unicode characters encoded as UTF-8 bytes (just zeros and ones). Two symbols (0 and 1) with an extremely long sequence is impractical — sequence length is a scarce resource.

The design problem: trade off vocabulary size against sequence length. Larger vocabulary → shorter sequences; smaller vocabulary → longer sequences.


Byte-pair encoding (BPE)

State-of-the-art models use byte-pair encoding (BPE):

  1. Start with 256 symbols (one per possible byte value).
  2. Find the most frequent consecutive pair of symbols in the corpus.
  3. Mint a new symbol with the next available ID and replace every occurrence of that pair.
  4. Repeat until vocabulary reaches the desired size.

GPT-4 uses 100,277 tokens. Each iteration of BPE reduces sequence length and grows the vocabulary by one.


Consequences for model behaviour

Tokenisation creates a gap between how humans perceive text and how the model perceives it:

  • “Ubiquitous” is one word to a human; 3 tokens to the model.
  • Case matters: hello world and Hello World tokenise differently.
  • Spaces matter: hello world (one space) vs hello world (two spaces) produce different token sequences.
  • Spelling tasks are hard. The model never sees individual letters — only tokens. “How many R’s in strawberry?” was famously wrong in early models because the model sees str, awberry (not individual characters) and cannot count without additional reasoning tools.
  • Arithmetic is hard. Numbers tokenise inconsistently (e.g., 9.11 and 9.9 may be single tokens). The model can’t reliably compare them by reasoning on tokens; a Python interpreter is more reliable.

Practical implication

For tasks that require character-level reasoning (spelling, counting letters, formatting strings character-by-character), tell the model to use code rather than reason directly. The Python interpreter operates on actual characters, not tokens.


See also