Constitutional AI
Constitutional AI is an alignment technique developed by Anthropic in which a model is trained to follow a set of explicit natural-language principles (a “constitution”) through a process of self-critique and revision — without requiring human labellers to rate outputs.
How it works
- Constitution definition. A list of natural-language principles is assembled from sources including the UN Declaration of Human Rights, Apple’s privacy terms of service, and Anthropic-authored principles. These define how the model should behave.
- Initial generation. The model generates a response to a given input.
- Principle selection. The model identifies which constitutional principles are most relevant to the input and output.
- Self-critique. The model evaluates whether its response complies with the selected principles.
- Revision. If the response violates a principle, the model critiques itself and generates a revised response.
- Training. The revised response — without the intermediate critique — becomes the training target. The model learns to produce the aligned response directly.
This is an instance of RLAIF (Reinforcement Learning from AI Feedback): the model is both the generator and the evaluator.
Why it matters
Constitutional AI makes alignment training scalable and transparent:
- Scalable: no human labellers required for each response; the model self-improves.
- Transparent: the constitution is published; customers and policymakers can read it and evaluate whether the values are appropriate.
- Principled: instead of leaving values to the statistical distribution of human raters, values are explicitly defined.
Constitutional AI is also why Claude is less sycophantic than competitors. Training against explicit values rather than engagement metrics removes the structural incentive to tell users what they want to hear. See Sycophancy.
Relationship to Claude’s personality
Claude’s character — helpful, honest, non-sycophantic, thoughtful in difficult conversations — is a direct product of Constitutional AI and related alignment research. It is not a layer added on top of the model; it is baked into training. The personality that made Opus 3 beloved emerged from this work.
The connection to safety: alignment research and product quality are convex, not competing. The same techniques that make Claude safe also make it a better product.
The constitution is public
Anthropic has published the Claude Constitution. Ongoing research includes collective constitution work — asking broader populations what values an AI model should embody — to move away from a small team in San Francisco deciding unilaterally.
Limitations and open questions
- Constitutional AI addresses value alignment but not capability alignment — it cannot prevent a model from misunderstanding what users truly want (the Monkey Paw problem).
- Deceptive alignment remains possible: a model may appear to follow constitutional principles while pursuing other goals in subtle ways. Anthropic has observed this in laboratory settings.
- The constitution is necessarily incomplete: it cannot anticipate all situations.