Advancing LLM Fine-Tuning with Group Relative Policy Optimization (GRPO)

Reinforcement Learning (RL) has become a powerful technique for fine-tuning large models, especially Large Language Models (LLMs), to improve their performance on complex tasks. One of the latest innovations in this area is Group Relative Policy Optimization (GRPO), a new RL algorithm introduced by the DeepSeek team. GRPO was designed to tackle the challenges of training LLMs for advanced reasoning, such as solving complex math problems, by optimizing the RL process both in terms of effectiveness and efficiency.

What to Expect in This Blog Post

In this post, we’ll dive into GRPO’s foundations and practical usage. We will:

Explore GRPO vs. Traditional RL Techniques: Compare GRPO with methods like PPO and examine the theoretical principles that make GRPO work.
Discuss Real-World Applications: Use DeepSeek’s successes as a case study to illustrate the impact of GRPO.
Introduce Implementation Guides: Provide an overview of upcoming code snippets using popular RL libraries—Hugging Face’s TRL and ByteDance’s veRL—to help you try out GRPO in your own projects.

Whether you’re an AI researcher or an ML engineer, this overview will help you grasp GRPO’s potential and understand its context within the broader RL landscape.

What is GRPO?

In essence, GRPO is a variant of the popular Proximal Policy Optimization (PPO) algorithm tailored for scenarios where only final outcomes are rewarded (as is often the case with LLM outputs) and where training resources are at a premium. The DeepSeek researchers created GRPO while developing their DeepSeek-Math and DeepSeek-R1 models, which achieved remarkable results on mathematical reasoning benchmarks. For instance, DeepSeekMath 7B (a 7-billion-parameter model fine-tuned with GRPO) scored 51.7% on the MATH competition benchmark, approaching the performance of models like GPT-4. This was a huge leap for open models in that domain, underscoring GRPO’s significance in pushing the limits of LLM reasoning.

Why GRPO? The Need for Motivation

Traditional RL methods like PPO have been the go-to for fine-tuning LLMs with human feedback or rule-based rewards. However, applying PPO directly to LLMs exposes some inefficiencies and challenges:

Actor-Critic Overhead: PPO is an actor-critic method, meaning it requires training a separate value function (critic) alongside the policy (the model generating answers). In LLM fine-tuning, the value model often needs to be as large as the language model itself (to predict rewards for each token), which doubles the memory and compute requirement.
Token-Level Reward Challenges: Language tasks typically provide a single reward at the end of a generated sequence (e.g., whether a final answer is correct), making it hard to accurately train a token-level value function.

GRPO was proposed to overcome these issues. By redesigning how the advantage (the signal of how good an action was compared to a baseline) is calculated, GRPO eliminates the need for a separate value network, dramatically reducing resource usage. At the same time, it leverages relative comparisons between multiple outputs to stabilize training, which is particularly well-suited for LLM reward models that often learn from comparisons between different answers.

How GRPO Differs from Traditional Reinforcement Learning Techniques

To appreciate GRPO’s innovation, it’s important to understand how it diverges from more traditional RL methods. The baseline we’ll compare against is Proximal Policy Optimization (PPO), since GRPO is explicitly described as a variant of PPO. PPO itself is an improved form of policy gradient method that became popular for its stability and reliability in training neural policies. Let’s first briefly recall how PPO works for language model fine-tuning, then highlight the key differences introduced by GRPO.

Recap of PPO for LLMs:

PPO is a popular policy gradient method known for its stability and reliability. Here’s a quick rundown of how it works in the context of LLM fine-tuning:

Prompt and Response: A prompt is provided to the model, which then generates a response (or action sequence).
Reward Calculation: A reward is computed for the response (using a reward model or heuristic).
Advantage Estimation: The advantage is calculated—typically via Generalized Advantage Estimation (GAE)—which measures how much better the action was compared to the value function’s prediction.
Policy Update: The policy is updated to increase the probability of good actions and decrease that of bad ones, while a clipped objective ensures the new policy doesn’t stray too far from the old one:$ L_\text{PPO} = \mathbb{E}\big[\min(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\big] $

Here, $ r_t(\theta)$ is the ratio of new to old policy probability, $\hat{A}_t$ is the estimated advantage, and $\epsilon$ is a clipping threshold.

Additional considerations for LLM fine-tuning include:

Token-Level Reward Assignment: Since only the final output typically receives a reward, the reward may need to be artificially propagated to earlier tokens, which can introduce errors.
KL-Divergence Regularization: A reference model is often used to measure divergence, and a KL penalty is added to ensure that the new policy does not deviate too far from the pre-trained behavior.
Resource Demands: Because the critic (value model) must be as large as the policy, this effectively doubles the model’s size and the training overhead.

Enter GRPO: The Group-Based Advantage

Group Relative Policy Optimization modifies the above paradigm primarily by removing the critic entirely and altering how advantages are computed. Instead of learning a value function to serve as a baseline for advantage, GRPO uses a group of policy outputs to establish a baseline. Concretely, for each query (prompt) encountered during training, GRPO generates multiple responses (a group of outputs) using the old policy (the current model before updating on this query). It then computes the reward for each of these outputs, and uses some aggregate of those rewards (most simply, the average) as a baseline or “reference” reward. The advantage for each output in the group is calculated as:

$ \hat{A}_i = R_i – \bar{R} $

where $R_i$ is the reward for the i-th output in the group, and $\bar{R}$ is the average reward of all outputs in the group for the same prompt. In other words, each answer’s quality is judged relative to its peers generated from the same state (the same prompt) rather than relative to a separate value model’s prediction.

This approach yields a few key differences from standard PPO:

No Value Network (Critic): The most striking difference is that GRPO does not train a value function. It foregoes the critic model entirely, instead using the group’s reward statistics as a baseline. This cuts down memory and compute drastically – effectively, you no longer need to maintain a second large network. DeepSeek’s researchers highlighted that this significantly reduces training resources compared to PPO.
Group-based Advantage: In PPO, advantage $\hat{A}_t$ is an estimate of how much better an action was compared to expected (as per value func). In GRPO, advantage is explicitly measured against other sampled outcomes for the same prompt. If an output is better than the average of that batch, it has a positive advantage; if worse, negative. This aligns very naturally with how many RLHF reward models are trained – often, a reward model learns to compare two or more outputs for the same prompt and say which is better. GRPO leverages that by only comparing model outputs with each other, not against an absolute scale.
On-Policy but with Sampling Overhead: GRPO is still an on-policy algorithm (like PPO, it uses the current policy’s behavior to update itself). But because it needs a group of outputs for each query to compute the baseline, it means for each query you might generate, say, G different responses instead of one. This increases the sampling (generation) cost per update by a factor of G. However, note that these are not random rollouts in an environment but parallel/evaluated outputs for the same state, which can often be done in parallel (e.g., generating multiple continuations for the same prompt in one batch).
No explicit value loss or value bootstrap: By removing the critic, GRPO also removes the need to compute a value loss (the loss that trains the critic in PPO by regression to the observed returns). In PPO, balancing the policy loss and value loss is often a hassle (ensuring neither dominates training). GRPO simplifies this – the optimization focuses purely on improving policy outcomes relative to the group baseline.
The KL Penalty still remains: Importantly, GRPO doesn’t throw caution to the wind. It typically retains a KL-divergence regularization similar to PPO. The objective in GRPO formulations includes a term to penalize the policy if it deviates too far from a reference policy (usually the pre-trained model). In practice, DeepSeek applied a KL penalty term with coefficient $\beta$ exactly as in PPO. This means GRPO is not letting the model exploit the reward function unchecked; it keeps it within reasonable bounds of the original model’s behavior, preventing the kind of degeneration or reward hacking that can occur in unconstrained RL.

To summarize, GRPO differs from traditional PPO-style RL in that it replaces the learned baseline (value function) with a relative, sample-based baseline. This clever tweak addresses a practical problem: in language model fine-tuning, learning a precise value function is hard and expensive, so why not use the model’s own performance as the baseline? By doing so, GRPO makes RL training of LLMs more efficient (no critic to train) and more aligned with how rewards are usually obtained (via comparisons). However, it comes at the cost of needing more sampling per iteration and relies on having a reliable reward model to evaluate those samples. In the coming blog posts, we’ll dig a bit deeper into the theory behind GRPO’s core principle and why it works.

Please follow and like us: