Group Relative Policy Optimization (GRPO) is a reinforcement-learning algorithm for training language models from rewards. For each prompt, it samples a group of candidate responses and computes the relative advantage of each response from rewards within that group.
In simplified form, GRPO:
- generates several outputs from the current policy for the same prompt;
- scores each output with one or more reward functions;
- normalizes rewards relative to the group's mean and variation; and
- updates the policy to increase the probability of above-average outputs while limiting divergence from a reference policy.
GRPO is related to Proximal Policy Optimization (PPO) but removes the need to train a separate critic or value model. This can reduce memory and computational requirements during large-model post-training.
The quality and diversity of each response group matter. If all samples are nearly identical, relative rewards provide little learning signal. If rewards are sparse, many groups may contain no successful response. Sampling temperature, group size, task difficulty, and curriculum design therefore influence training efficiency.
GRPO is frequently used with Reinforcement Learning with Verifiable Rewards (RLVR). Mathematical answers, code tests, formatting checks, and other automatic verifiers provide rewards, while GRPO converts differences within a sampled group into policy updates.
As with any reward optimization method, GRPO can amplify flaws in the reward function. Multiple reward components may also conflict, and group normalization can make training behavior dependent on the sampled alternatives rather than an absolute quality scale.
GRPO was introduced in the DeepSeekMath paper and later used at larger scale for reasoning-model training.
The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.
It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.