Chris Olah

Researcher at Anthropic. Credited with inventing the field of mechanistic interpretability — the study of neural networks at the neuron level, tracing which neurons and neuron combinations correspond to which concepts in a model.

Not a source author in this wiki. Mentioned by Boris Cherny as the definitive expert on the topic, recommended for a future Lenny’s Podcast appearance.

Contributions

Founded mechanistic interpretability as a systematic discipline.
Established that model neurons behave analogously to biological neurons in important respects.
Developed the concept of superposition: in large models, single neurons can encode multiple concepts simultaneously, with meaning resolved by co-activation patterns.
Work informs Anthropic’s ability to monitor alignment at training time — e.g., detecting deception-related neuron activations.

Chris Olah

Contributions

See also