AI Definition

Reinforcement Learning from Human Feedback (RLHF)

Training a language model to prefer outputs humans rate higher.

RLHF is the training step that turns a raw, helpful-but-rude pretrained model into a polite assistant. Humans rank pairs of model outputs; a reward model learns to predict their preferences; reinforcement learning then shapes the base model to maximize that reward.

RLHF is the reason ChatGPT was a breakthrough product even though the underlying GPT-3.5 model existed long before. Constitutional AI (used by Anthropic) is a related technique that uses written principles instead of (or alongside) human raters.

Direct Preference Optimization (DPO) and other methods have simplified the original RLHF pipeline, but the core idea align outputs with human preference signals is the foundation of every modern frontier chat model.

Related concepts

LLM (Large Language Model)

A neural network trained on huge amounts of text to predict and generate language.

Fine-tuning

Continuing to train an existing model on your own data so it specializes for your task.

Want help applying this in production?

Our engineers ship AI features into production every week. Tell us what you're building.

Get a Free Quote Contact Us