Guide
What is RLHF?
Reinforcement Learning from Human Feedback explained.
RLHF is a training technique that uses human preferences to fine-tune AI models. It's the key innovation that made ChatGPT helpful, harmless, and honest compared to base language models.
How RLHF Works
- 1. Supervised fine-tuning: Train on human-written examples
- 2. Reward model training: Humans rank outputs, model learns preferences
- 3. RL optimization: Model optimizes for reward model scores
RLHF Limitations
- Reward hacking—gaming the reward model
- Human labeler biases transfer to model
- Expensive and time-consuming
- Doesn't prevent all harmful outputs
Beyond RLHF
Production safety requires additional layers: output monitoring, guardrails, and continuous evaluation. RLHF is training-time safety; observability is runtime safety.
Do all LLMs use RLHF?
Most commercial LLMs use RLHF or similar techniques (DPO, Constitutional AI). Base models without RLHF are less safe and helpful.
Monitor RLHF'd models in production
Start Free