Concept

AI Safety Levels

concept ai-safety anthropic agi alignment

AI Safety Levels

AI Safety Levels (ASL) are Anthropic’s framework for categorising the risk posed by model capabilities, embedded in their responsible scaling policy. The framework defines what deployment and safety commitments are required at each level.


The levels

LevelCapability thresholdRisk characterisation
ASL-1Below GPT-2Negligible risk
ASL-2Basic capability; early GPT eraLimited risk; standard safeguards sufficient
ASL-3Meaningful uplift for creating weapons of mass destructionModerate risk from misuse; current Claude models (2025)
ASL-4Potential for significant loss of human life from misuseHigh risk; requires substantially stronger safety measures
ASL-5Potential extinction-level outcomes if misaligned or misusedCatastrophic risk; may require halting development until safety proven

ASL-3 — current status

As of 2025, Anthropic classifies Claude models at ASL-3. The concrete evidence: in controlled expert evaluations, Claude provides measurable uplift to a bad actor seeking to create a bioweapon — above the baseline set by Google Search. Anthropic testified to Congress about this capability.

The ASL-3 designation triggers specific safety commitments: enhanced red-teaming, additional deployment restrictions, and published policy.


Purpose of the framework

  1. Operationalises safety claims. Instead of “we care about safety,” ASL provides concrete commitments tied to measurable capability thresholds.
  2. Enables responsible scaling. Anthropic can continue developing models while publishing the conditions under which they would need to pause.
  3. Builds trust with policymakers. Publishing both the risks and the commitments gives legislators a concrete basis for evaluation.

The “God in a box” problem

Ben frames ASL-4 and ASL-5 in terms of historical AI safety theory: early concern was about keeping a superintelligent system contained and aligned. The irony of language models: people are actively pulling the “God out of the box” (giving models full internet access, credentials, broad agency). The ASL framework is Anthropic’s attempt to gate capability deployment against safety maturity.


Where mainstream views differ

Some researchers argue capability-based risk levels are too crude — a model could theoretically be very capable but well-aligned, or less capable but poorly aligned. Anthropic’s view: capability is a necessary (if not sufficient) indicator of risk; a less capable model that’s badly aligned is less dangerous than a highly capable one in the same situation.


See also