Pretraining
Pretraining is the first and most expensive stage of building a large language model. A large internet corpus is tokenised and used to train a neural network to predict the next token in each sequence. The output is a base model.
The data pipeline
Starting point: Common Crawl, an organisation that has crawled the public internet since 2007 (2.7 billion web pages as of 2024). Raw crawl data goes through:
- URL filtering — block lists for spam, malware, adult content.
- Text extraction — stripping HTML, CSS, and navigation.
- Language filtering — e.g., FineWeb keeps pages > 65% English. This is a design choice: filter a language out and the model will be weak in it.
- Deduplication.
- PII removal — addresses, social security numbers, etc.
Result: FineWeb (Hugging Face’s public example) is ~44 TB — small enough for a single hard drive. Every major lab has an internal equivalent.
Training objective
The model learns to predict the next token. Feed a window of tokens as context; the network outputs a probability distribution over all ~100,000 possible next tokens. Compare to the actual next token; adjust parameters to increase the probability of the correct token. Repeat across billions of windows.
The key metric is loss — lower is better. Loss decreases steadily over training; inference quality improves correspondingly.
Cost and scale
Llama 2 70B (Meta, 2023): ~10 TB text, ~6,000 GPUs, ~12 days, ~$2M. State-of-the-art models (GPT-4, Claude 3) run 10× or more of these numbers. Training a frontier model is now a $50–100M+ investment.
GPT-2 (OpenAI, 2019): ~$40,000 in 2019. Reproducible today for ~$600.
Output: the base model
The base model is an internet document simulator. Given a prompt, it continues the text in whatever style is statistically plausible — academic papers, code listings, Amazon product reviews. It hallucinates freely. It has no assistant behaviour.
Key properties:
- Knowledge is stored probabilistically, not as discrete facts.
- Things mentioned often in training are recalled more reliably.
- The model has no knowledge of events after its training cutoff.
- High-quality sources (Wikipedia) are oversampled and can be recited nearly verbatim; rare sources are recalled vaguely.
Relationship to post-training
Pretraining is knowledge acquisition. All subsequent stages — Fine-Tuning, Reinforcement Learning from Human Feedback — operate on top of the pretrained base. Post-training is comparatively cheap (hours vs months) but reshapes the model’s output format and alignment without fundamentally changing its knowledge.