What is Constitutional AI?

Constitutional AI (CAI) is Anthropic's approach to training AI systems using a set of explicit principles (a "constitution") that guide the model toward helpful, harmless, and honest behavior.

How Constitutional AI Works

1. Generate responses: Model produces initial outputs
2. Self-critique: Model evaluates outputs against principles
3. Revision: Model improves outputs based on critique
4. RL training: Train on revised outputs

Example Principles

Choose the response that is least harmful
Choose the response that is most helpful
Choose the response that is most honest
Avoid responses that are deceptive or manipulative

CAI vs RLHF

CAI reduces reliance on human labelers by using AI self-critique. This scales better and makes principles explicit rather than implicit in human preferences.

Which models use Constitutional AI?

Anthropic's Claude models use Constitutional AI. Other providers use similar principle-based approaches alongside RLHF.