Reinforcement Learning with Verifiable Rewards (RLVR)

Reinforcement Learning with Verifiable Rewards (RLVR) is a post-training method that improves a language model using rewards computed by an automatic verifier. The verifier checks whether a generated answer or action satisfies an objective criterion, allowing training to proceed without a human preference label for every sample.

RLVR is well suited to domains where outcomes can be checked programmatically, including:

mathematics with known or symbolically equivalent answers;
code evaluated by tests or execution;
formal proofs checked by a proof assistant;
structured outputs validated against schemas and business rules; and
tasks with measurable simulator or game outcomes.

The training loop samples one or more responses from the current policy, assigns rewards with the verifier, and updates the model to increase the probability of higher-reward behavior. Algorithms such as Group Relative Policy Optimization (GRPO) can perform these updates without a separate learned value model.

RLVR differs from Reinforcement Learning from Human Feedback (RLHF). RLHF commonly relies on human preferences or a reward model trained from them. RLVR uses an externally checkable outcome, which can be cheaper, less subjective, and harder to miscalibrate.

The verifier defines what the model is rewarded for and is therefore part of the specification. Weak verifiers can create reward hacking: a response satisfies the checker while violating the intended task. Passing a unit test, for example, does not prove maintainability, security, or broader correctness.

RLVR has become important in training reasoning models, where verifiable final answers can reinforce useful strategies such as checking work and changing approach. The DeepSeek-R1 paper reports the emergence of reasoning behaviors through large-scale reinforcement learning on verifiable tasks.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Reinforcement Learning from Human Feedback (RLHF)

Reasoning Model

Quantization

Proximal Policy Optimization (PPO)

Proprietary Model

Reranking

Retrieval-Augmented Generation (RAG)

Role prompting

Small Language Model (SLM)

Structured Output