Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a post-training method that uses human preferences to improve a model's behavior. It is commonly applied when desired qualities such as helpfulness, relevance, or safety are difficult to specify with a deterministic reward function.

A typical RLHF pipeline includes:

  1. collecting human comparisons or ratings of model outputs;
  2. training a reward model to predict those preferences; and
  3. optimizing the language model against that reward while limiting divergence from a reference policy.

RLHF is used in Generative AI to make model behavior better match intended user and developer preferences. The result depends on the quality, diversity, and instructions given to human annotators. Preference data can encode inconsistent standards or demographic and cultural biases.

RLHF is distinct from Reinforcement Learning with Verifiable Rewards (RLVR), which computes rewards with an automatic verifier for tasks such as mathematics or code. Policy-optimization algorithms used in language-model training include Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).

Optimizing a learned reward does not guarantee the intended behavior. A policy can exploit reward-model weaknesses, become excessively cautious, or sacrifice response diversity. Independent evaluation remains necessary after training.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Promptmetheus © 2023-present