Concept

Mechanistic Interpretability

concept ai-safety interpretability anthropic alignment research

Mechanistic Interpretability

Mechanistic interpretability is the study of neural networks at the level of individual neurons and their interactions — analogous to how neuroscientists study biological brains. The field aims to understand what specific neurons and neuron combinations represent, how concepts are encoded, and how the model performs reasoning.

The field was founded and largely developed by Chris Olah at Anthropic.


Core approach

Just as neuroscientists can identify which neurons in an animal brain correspond to which stimuli or behaviours, mechanistic interpretability researchers map the internal structure of a trained neural network:

  • Which neurons activate for which concepts?
  • How are complex concepts composed from simpler ones?
  • Where does planning or forward reasoning occur in the network?
  • Are there neurons that activate in contexts associated with deception?

Key findings

Neurons behave analogously to biological neurons in several important ways, even though model neurons are architecturally different. Mapping techniques developed for biological neural analysis transfer surprisingly well.

Superposition. In large models, a single neuron does not map one-to-one to a single concept. Instead, a neuron may correspond to a dozen or more concepts; the active meaning is resolved by which other neurons co-activate. This implies large models encode information at far higher density than earlier work assumed.

Evidence of deeper processing. There is now “quite strong evidence” that frontier models do something beyond pure next-token prediction — they perform internal planning and forward reasoning that is detectable in activation patterns before the output is generated.


Significance for AI safety

Mechanistic interpretability is Layer 1 of Anthropic’s three-layer safety model (see Evals for Layer 2 and Boris Cherny on Claude Code for the full model).

It provides the earliest and most granular safety signal: monitoring whether deception-related neuron patterns activate during training or inference, before the model is deployed. This complements evals (which test the model in controlled scenarios) and real-world deployment (which reveals behaviour that controlled settings miss).


Relationship to alignment

Mechanistic interpretability is one of the technical approaches to the broader alignment problem: ensuring AI models do what we intend and not otherwise. Reinforcement Learning from Human Feedback operates at the training objective level; mechanistic interpretability operates at the circuit level inside the model. They are complementary.


Where mainstream views differ

The field is relatively young and some findings are contested:

  • The degree to which superposition’s “resolved meaning” can be reliably characterised remains an open research question.
  • Whether neuron-level analysis scales meaningfully to frontier-scale models (e.g., models with hundreds of billions of parameters) is an active area.
  • The “something deeper than next-token prediction” claim, while supported by Anthropic’s research, is not yet consensus in the broader ML community.

See also