Transformers

The Transformer is the neural network architecture underlying all modern large language models. Introduced in “Attention Is All You Need” (Vaswani et al., 2017).

What it does

A Transformer is a fixed function from input token sequences to a probability distribution over the next token. It applies a sequence of mathematical operations — matrix multiplications, softmax functions, layer normalisations, attention mechanisms — to transform the input token embeddings into output probabilities.

There is no memory, no state. Each call is stateless: given the same token sequence, the output distribution is the same (before sampling). The context window is the only mechanism through which prior conversation is visible to the model.

Internal structure

Tokens are first converted to vector representations (embeddings), then passed through alternating layers:

Attention blocks — allow tokens to attend to (look up and aggregate) information from other positions in the sequence.
MLP blocks — apply learned transformations to each position independently.
Layer normalisations — stabilise training.

The output of the final layer is mapped to a probability distribution over the vocabulary (softmax).

A typical production example might have ~85,000 parameters in a small demonstration; production models have billions to trillions.

What we understand and don’t

We understand the Transformer completely at the level of mathematical operations. We do not understand what the billions of parameters are doing when they collaborate to produce a response.

The reversal curse illustrates the opacity: GPT-4 correctly answers “Who is Tom Cruise’s mother?” (Mary Lee Pfeiffer) but fails “Who is Mary Lee Pfeiffer’s son?” Knowledge appears to be one-directional — stored from a particular angle.

The field of mechanistic interpretability attempts to reverse-engineer what individual circuits within the network compute. It can do this to some degree; a complete picture is far off.

Design principles

Transformers were designed with three properties in mind:

Expressiveness — can represent a wide class of functions over sequences.
Optimisability — gradients flow cleanly during training.
Parallelisability — attention and MLP operations can be computed simultaneously across the sequence, exploiting GPU hardware.

The parallelism over positions is what makes training on GPUs efficient. Each token in a batch contributes a training signal simultaneously.

Transformers

What it does

Internal structure

What we understand and don’t

Design principles

See also