Person

Chris Olah

Chris Olah

Researcher at Anthropic. Credited with inventing the field of mechanistic interpretability — the study of neural networks at the neuron level, tracing which neurons and neuron combinations correspond to which concepts in a model.

Not a source author in this wiki. Mentioned by Boris Cherny as the definitive expert on the topic, recommended for a future Lenny’s Podcast appearance.


Contributions

  • Founded mechanistic interpretability as a systematic discipline.
  • Established that model neurons behave analogously to biological neurons in important respects.
  • Developed the concept of superposition: in large models, single neurons can encode multiple concepts simultaneously, with meaning resolved by co-activation patterns.
  • Work informs Anthropic’s ability to monitor alignment at training time — e.g., detecting deception-related neuron activations.

See also