Concept

Scaling Laws

concept scaling training compute multi-source

Scaling Laws

Scaling laws describe how the performance of a large language model on the next-token prediction task changes as a function of two variables: N (number of parameters) and D (amount of training data). The relationship is smooth, well-behaved, and predictable — and shows no sign of saturation at scales reached to date.


The core finding

Given only N and D, model loss (lower = better prediction accuracy) can be predicted with high confidence. Train a bigger model on more data and performance will reliably improve. Algorithmic improvements are a welcome bonus but not required for progress — scaling alone guarantees improvement.


Why this matters

Scaling laws convert AI progress from unpredictable research into a reliable engineering project: spend more on compute, get a better model. This predictability drove the GPU gold rush: NVIDIA’s dominance, billion-dollar data-centre buildouts, the race for more parameters and more data.

Practically: going from GPT-3.5 to GPT-4, nearly all benchmarks improved — not because of a single algorithmic breakthrough, but because of scale.


Limits of the scaling law insight

  • Scaling laws describe improvement on the next-token prediction loss. This correlates with performance on real tasks, but the relationship is not perfect for all tasks.
  • They say nothing about what tasks the model will be good or bad at — that is a question of data composition, not scale alone.
  • They do not predict capability emergence for specific abilities (e.g., multi-step reasoning) which can appear abruptly at certain scales.

Relationship to the GPU economy

“Elon Musk getting 100,000 GPUs in a single data centre? They’re all doing this — predicting the next token, faster.” — Karpathy

H100 GPUs are optimised for the matrix multiplications that dominate Transformer training. Stacking them in clusters enables the parallel computation over billions of training windows that scaling laws promise will produce better models.


See also