Pretraining

Pretraining is the first and most expensive stage of building a large language model. A large internet corpus is tokenised and used to train a neural network to predict the next token in each sequence. The output is a base model.

The data pipeline

Starting point: Common Crawl, an organisation that has crawled the public internet since 2007 (2.7 billion web pages as of 2024). Raw crawl data goes through:

URL filtering — block lists for spam, malware, adult content.
Text extraction — stripping HTML, CSS, and navigation.
Language filtering — e.g., FineWeb keeps pages > 65% English. This is a design choice: filter a language out and the model will be weak in it.
Deduplication.
PII removal — addresses, social security numbers, etc.

Result: FineWeb (Hugging Face’s public example) is ~44 TB — small enough for a single hard drive. Every major lab has an internal equivalent.

Training objective

The model learns to predict the next token. Feed a window of tokens as context; the network outputs a probability distribution over all ~100,000 possible next tokens. Compare to the actual next token; adjust parameters to increase the probability of the correct token. Repeat across billions of windows.

The key metric is loss — lower is better. Loss decreases steadily over training; inference quality improves correspondingly.

Cost and scale

Llama 2 70B (Meta, 2023): ~10 TB text, ~6,000 GPUs, ~12 days, ~$2M. State-of-the-art models (GPT-4, Claude 3) run 10× or more of these numbers. Training a frontier model is now a $50–100M+ investment.

GPT-2 (OpenAI, 2019): ~$40,000 in 2019. Reproducible today for ~$600.

Output: the base model

The base model is an internet document simulator. Given a prompt, it continues the text in whatever style is statistically plausible — academic papers, code listings, Amazon product reviews. It hallucinates freely. It has no assistant behaviour.

Key properties:

Knowledge is stored probabilistically, not as discrete facts.
Things mentioned often in training are recalled more reliably.
The model has no knowledge of events after its training cutoff.
High-quality sources (Wikipedia) are oversampled and can be recited nearly verbatim; rare sources are recalled vaguely.

Relationship to post-training