Reinforcement Learning from Human Feedback (RLHF)
Training a language model to prefer outputs humans rate higher.
RLHF is the training step that turns a raw, helpful-but-rude pretrained model into a polite assistant. Humans rank pairs of model outputs; a reward model learns to predict their preferences; reinforcement learning then shapes the base model to maximize that reward.
RLHF is the reason ChatGPT was a breakthrough product even though the underlying GPT-3.5 model existed long before. Constitutional AI (used by Anthropic) is a related technique that uses written principles instead of (or alongside) human raters.
Direct Preference Optimization (DPO) and other methods have simplified the original RLHF pipeline, but the core idea align outputs with human preference signals is the foundation of every modern frontier chat model.