What is RLHF? Reinforcement Learning from Human Feedback

RLHF is a training technique that uses human preferences to fine-tune AI models. It's the key innovation that made ChatGPT helpful, harmless, and honest compared to base language models.

How RLHF Works

1. Supervised fine-tuning: Train on human-written examples
2. Reward model training: Humans rank outputs, model learns preferences
3. RL optimization: Model optimizes for reward model scores

RLHF Limitations

Reward hacking—gaming the reward model
Human labeler biases transfer to model
Expensive and time-consuming
Doesn't prevent all harmful outputs

Beyond RLHF

Production safety requires additional layers: output monitoring, guardrails, and continuous evaluation. RLHF is training-time safety; observability is runtime safety.

Do all LLMs use RLHF?

Most commercial LLMs use RLHF or similar techniques (DPO, Constitutional AI). Base models without RLHF are less safe and helpful.