Large Language Models

A large language model (LLM) is a neural network trained to predict the next token in a sequence, trained on vast quantities of text. At inference time it generates text by sampling one token at a time.

What an LLM actually is

At the most literal level, an LLM is two files: a parameters file (the trained weights of the neural network) and a run file (the code to execute those weights). Llama 2 70B, for instance, is a 140 GB parameters file and roughly 500 lines of C. It runs offline on a laptop. The complexity lies in obtaining the parameters, not in running them.

The parameters encode a lossy compression of the training corpus — roughly 100× compression. They store statistical patterns, not discrete facts. Recall is probabilistic and vague, like remembering something you read months ago rather than looking it up.

How LLMs are built

Three sequential stages:

1. Pretraining

See Pretraining. A large internet corpus (tens of terabytes) is tokenised and used to train the neural network to predict the next token. The output is a base model — an internet document simulator. Training costs millions of dollars and takes months. The base model has no assistant behaviour; it continues whatever text is in front of it.

2. Supervised fine-tuning (SFT)

See Fine-Tuning. The dataset switches from internet text to human-labelled conversations — hundreds of thousands of Q&A examples written by contractors following detailed labelling instructions. The same training algorithm runs for a short, cheap continuation. The result: an assistant model that follows the Q&A format and reflects the persona specified in the instructions.

3. Reinforcement learning

See Reinforcement Learning from Human Feedback. In verifiable domains (maths, code), the model generates many candidate solutions, scores them against correct answers, and trains on the successful ones. In unverifiable domains, human preferences substitute as a reward signal. RL discovers reasoning strategies the model’s own computation budget makes efficient — including chain-of-thought patterns that emerge without being programmed.

Cognitive profile

LLMs are not databases. Key properties:

Probabilistic recall. Things mentioned often in training are recalled more reliably; rare things vaguely or not at all.
No persistent memory. Each inference call is stateless; the Context Window is the only working memory.
Jagged Intelligence. Brilliant across most domains; arbitrary gaps where capability has not been developed or RL environments have not been built.
Hallucination. Models confabulate confidently when asked about things outside their reliable knowledge. Mitigation: refusal training; Tool Use.
Tokens to think. Each token receives a fixed compute budget. Complex reasoning must be distributed across many tokens — hence chain-of-thought prompting.

The LLM ecosystem (as of 2025)

Proprietary models (API-only): GPT series (OpenAI), Claude series (Anthropic), Gemini series (Google), Grok (xAI). Open-weights models: Llama series (Meta), DeepSeek (DeepSeek AI), Mistral (Mistral AI). Leaderboards: Chatbot Arena (lmarena.ai) for blind human comparisons; SEAL leaderboard (Scale AI) for benchmark evaluations.