Jagged Intelligence
Jagged intelligence (also described as the “Swiss-cheese” model of LLM capability) is Andrej Karpathy‘s term for the uneven capability profile of large language models: brilliant across most domains, with arbitrary, often surprising gaps.
Examples
- 9.11 vs 9.9. Early models confidently said 9.11 > 9.9. Analysis: internal activations associated with Bible verse numbering (where 9.11 does come after 9.9) were apparently co-activated, leading to the wrong comparison.
- “How many R’s in strawberry?” Models insisted on 2 for a long time. Root cause: Tokenisation — the model sees
str,awberry, not individual letters, and cannot count character-by-character without additional reasoning. - Car-wash example (Karpathy, 2025): “I want to go to a car wash 50 metres away. Should I drive or walk?” State-of-the-art models say to walk — correct on the surface, but missing that you need a clean car at the other end. A model that can refactor 100,000-line codebases gets tripped up by this.
Why jaggedness exists
Two structural causes:
1. Verifiability
The RL training loop requires a reward signal, which requires a verifier. Labs built RL environments around maths and code because those domains are economically valuable and technically clean. Domains without easy verification — aesthetic judgment, common-sense reasoning about goals — received less RL investment and show more gaps.
2. Data and attention
Even within verifiable domains, capability reflects what the labs put in the training mix. Chess improved dramatically between GPT-3.5 and GPT-4 because someone at OpenAI added a large chess corpus to pretraining. The decision to focus creates a capability spike. Absence of focus leaves a gap.
Practical consequences
- Do not treat LLMs as reliable on any task without testing.
- Find where the capability holes are for your specific application — there is no manual; you have to explore empirically.
- If your application lives outside the circuits that were activated by RL training, you will need fine-tuning or a domain-specific setup; the base model out-of-the-box won’t get you there.
- Always verify outputs — especially for arithmetic, character-level operations, and multi-step reasoning.