Bitter Lesson
The bitter lesson (Rich Sutton, ~2019) is the empirical observation that more general methods — those that leverage computation and data at scale — consistently and eventually outperform methods that encode human domain knowledge into a system. The “bitterness” is that the community keeps learning this lesson anew in each AI subfield.
In this wiki, the concept is primarily used in Boris Cherny‘s product-oriented application: always bet on the most general model rather than constraining or specialising it.
Sutton’s original claim
Sutton observed, across decades of AI research (chess, Go, speech recognition, image recognition), that whenever researchers encoded domain knowledge into systems (handcrafted features, expert heuristics, specialised architectures), general-purpose methods driven by scale eventually surpassed them — often decisively. The lesson was “bitter” because the specialised approaches often worked better in the short term, leading researchers to invest in them, only to be overtaken.
Boris Cherny’s application to AI product building
Boris Cherny on Claude Code generalises this to AI product development:
Corollary 1: Don’t fine-tune when you can use the frontier model. Fine-tuning creates a specialised model. The next frontier model will typically make those gains obsolete.
Corollary 2: Don’t over-scaffold. Layering strict step-by-step workflows around a model — “do step 1, then step 2, then step 3” — is a form of specialisation. It may improve benchmark performance by 10–20%, but the next model release erases those gains because the general model learns to handle the cases the scaffolding was papering over.
Corollary 3: Give the model tools, not instructions. Instead of giving the model context upfront, give it a tool to retrieve context when it needs it. The model decides when and how to use it.
Corollary 4: Build for the general model 6 months from now. Accept bad product-market fit in the short term while waiting for the model’s general capability to arrive. Claude Code’s product design was built for a model that didn’t yet exist; the design paid off when Opus 4 arrived.
Relationship to latent demand
Latent Demand form 2 (model latent demand) is the product expression of the bitter lesson: build in a way that is “on distribution” — that lets the model do what it naturally wants to do, rather than constraining it. Both arguments converge on: expose general capability, minimise constraint.
Fei-Fei Li: bitter lesson is harder in robotics
Dr. Fei Fei Li on World Models provides an important qualification from robotics research. The bitter lesson requires a training-data / objective alignment that language models have but robots lack:
- LLMs: train on tokens → output tokens. Perfect alignment. Scale the data, scale the performance.
- Robots: must output actions in 3D physical environments. Web video data contains no 3D-grounded actions. Teleoperation and synthetic data partially bridge the gap.
Fei-Fei’s conclusion: the bitter lesson is directionally right for robotics but cannot simply be applied by scaling web data. New data strategies (world model-generated synthetic environments, teleoperation at scale) are needed before the lesson can fully apply.
Additionally, robots are physical hardware systems. Even if the “brain” problem is solved by scale, hardware maturation follows separate timelines. Self-driving cars (2D surfaces, no manipulation) required 20 years from research prototype to commercial deployment. Robots are harder.
See World Models for how world models may resolve the data alignment problem by generating diverse 3D action training environments.
Where mainstream views differ
The bitter lesson is generally accepted as an empirical pattern but its limits are debated:
- Short-term costs are real. The general model is often worse than a specialised one in the near term. Builders accepting bad PMF for 6 months on the promise of future model improvement are making a bet, not following a law.
- Not all tasks benefit equally. For highly constrained, well-specified tasks (e.g., a narrow classification pipeline), scaffolding may remain efficient even after model improvements.
- At inference-cost-sensitive scale, using a large general model where a fine-tuned small model would suffice may be economically irrational even if technically equivalent.