Guide

What is Constitutional AI?

Training AI with explicit principles for safety and helpfulness.

Constitutional AI (CAI) is Anthropic's approach to training AI systems using a set of explicit principles (a "constitution") that guide the model toward helpful, harmless, and honest behavior.

How Constitutional AI Works

  1. 1. Generate responses: Model produces initial outputs
  2. 2. Self-critique: Model evaluates outputs against principles
  3. 3. Revision: Model improves outputs based on critique
  4. 4. RL training: Train on revised outputs

Example Principles

  • Choose the response that is least harmful
  • Choose the response that is most helpful
  • Choose the response that is most honest
  • Avoid responses that are deceptive or manipulative

CAI vs RLHF

CAI reduces reliance on human labelers by using AI self-critique. This scales better and makes principles explicit rather than implicit in human preferences.

Which models use Constitutional AI?

Anthropic's Claude models use Constitutional AI. Other providers use similar principle-based approaches alongside RLHF.

Monitor any LLM in production

Start Free