Concept

Pretraining

concept pretraining training multi-source

Pretraining

Pretraining is the first and most expensive stage of building a large language model. A large internet corpus is tokenised and used to train a neural network to predict the next token in each sequence. The output is a base model.


The data pipeline

Starting point: Common Crawl, an organisation that has crawled the public internet since 2007 (2.7 billion web pages as of 2024). Raw crawl data goes through:

  1. URL filtering — block lists for spam, malware, adult content.
  2. Text extraction — stripping HTML, CSS, and navigation.
  3. Language filtering — e.g., FineWeb keeps pages > 65% English. This is a design choice: filter a language out and the model will be weak in it.
  4. Deduplication.
  5. PII removal — addresses, social security numbers, etc.

Result: FineWeb (Hugging Face’s public example) is ~44 TB — small enough for a single hard drive. Every major lab has an internal equivalent.


Training objective

The model learns to predict the next token. Feed a window of tokens as context; the network outputs a probability distribution over all ~100,000 possible next tokens. Compare to the actual next token; adjust parameters to increase the probability of the correct token. Repeat across billions of windows.

The key metric is loss — lower is better. Loss decreases steadily over training; inference quality improves correspondingly.


Cost and scale

Llama 2 70B (Meta, 2023): ~10 TB text, ~6,000 GPUs, ~12 days, ~$2M. State-of-the-art models (GPT-4, Claude 3) run 10× or more of these numbers. Training a frontier model is now a $50–100M+ investment.

GPT-2 (OpenAI, 2019): ~$40,000 in 2019. Reproducible today for ~$600.


Output: the base model

The base model is an internet document simulator. Given a prompt, it continues the text in whatever style is statistically plausible — academic papers, code listings, Amazon product reviews. It hallucinates freely. It has no assistant behaviour.

Key properties:

  • Knowledge is stored probabilistically, not as discrete facts.
  • Things mentioned often in training are recalled more reliably.
  • The model has no knowledge of events after its training cutoff.
  • High-quality sources (Wikipedia) are oversampled and can be recited nearly verbatim; rare sources are recalled vaguely.

Relationship to post-training

Pretraining is knowledge acquisition. All subsequent stages — Fine-Tuning, Reinforcement Learning from Human Feedback — operate on top of the pretrained base. Post-training is comparatively cheap (hours vs months) but reshapes the model’s output format and alignment without fundamentally changing its knowledge.


See also