AI Safety Levels
AI Safety Levels (ASL) are Anthropic’s framework for categorising the risk posed by model capabilities, embedded in their responsible scaling policy. The framework defines what deployment and safety commitments are required at each level.
The levels
| Level | Capability threshold | Risk characterisation |
|---|---|---|
| ASL-1 | Below GPT-2 | Negligible risk |
| ASL-2 | Basic capability; early GPT era | Limited risk; standard safeguards sufficient |
| ASL-3 | Meaningful uplift for creating weapons of mass destruction | Moderate risk from misuse; current Claude models (2025) |
| ASL-4 | Potential for significant loss of human life from misuse | High risk; requires substantially stronger safety measures |
| ASL-5 | Potential extinction-level outcomes if misaligned or misused | Catastrophic risk; may require halting development until safety proven |
ASL-3 — current status
As of 2025, Anthropic classifies Claude models at ASL-3. The concrete evidence: in controlled expert evaluations, Claude provides measurable uplift to a bad actor seeking to create a bioweapon — above the baseline set by Google Search. Anthropic testified to Congress about this capability.
The ASL-3 designation triggers specific safety commitments: enhanced red-teaming, additional deployment restrictions, and published policy.
Purpose of the framework
- Operationalises safety claims. Instead of “we care about safety,” ASL provides concrete commitments tied to measurable capability thresholds.
- Enables responsible scaling. Anthropic can continue developing models while publishing the conditions under which they would need to pause.
- Builds trust with policymakers. Publishing both the risks and the commitments gives legislators a concrete basis for evaluation.
The “God in a box” problem
Ben frames ASL-4 and ASL-5 in terms of historical AI safety theory: early concern was about keeping a superintelligent system contained and aligned. The irony of language models: people are actively pulling the “God out of the box” (giving models full internet access, credentials, broad agency). The ASL framework is Anthropic’s attempt to gate capability deployment against safety maturity.
Where mainstream views differ
Some researchers argue capability-based risk levels are too crude — a model could theoretically be very capable but well-aligned, or less capable but poorly aligned. Anthropic’s view: capability is a necessary (if not sufficient) indicator of risk; a less capable model that’s badly aligned is less dangerous than a highly capable one in the same situation.
See also
- Constitutional AI
- Mechanistic Interpretability
- Evals
- Benjamin Mann on Anthropic and AGI
- Boris Cherny on Claude Code — Boris’s three-layer safety model (alignment → evals → in-the-wild)