Scaling Laws
Scaling laws describe how the performance of a large language model on the next-token prediction task changes as a function of two variables: N (number of parameters) and D (amount of training data). The relationship is smooth, well-behaved, and predictable — and shows no sign of saturation at scales reached to date.
The core finding
Given only N and D, model loss (lower = better prediction accuracy) can be predicted with high confidence. Train a bigger model on more data and performance will reliably improve. Algorithmic improvements are a welcome bonus but not required for progress — scaling alone guarantees improvement.
Why this matters
Scaling laws convert AI progress from unpredictable research into a reliable engineering project: spend more on compute, get a better model. This predictability drove the GPU gold rush: NVIDIA’s dominance, billion-dollar data-centre buildouts, the race for more parameters and more data.
Practically: going from GPT-3.5 to GPT-4, nearly all benchmarks improved — not because of a single algorithmic breakthrough, but because of scale.
Limits of the scaling law insight
- Scaling laws describe improvement on the next-token prediction loss. This correlates with performance on real tasks, but the relationship is not perfect for all tasks.
- They say nothing about what tasks the model will be good or bad at — that is a question of data composition, not scale alone.
- They do not predict capability emergence for specific abilities (e.g., multi-step reasoning) which can appear abruptly at certain scales.
Relationship to the GPU economy
“Elon Musk getting 100,000 GPUs in a single data centre? They’re all doing this — predicting the next token, faster.” — Karpathy
H100 GPUs are optimised for the matrix multiplications that dominate Transformer training. Stacking them in clusters enables the parallel computation over billions of training windows that scaling laws promise will produce better models.