I Ran 4 RL Algorithms on 4 Environments. Here's What Actually Happened.

Everyone says PPO is the gold standard of reinforcement learning. REINFORCE is "too simple." GRPO is the hot new thing from DeepSeek. I wanted to see if any of that holds up when you actually run them.

So I built all four from scratch, trained them on progressively harder environments, and let the results speak.

The Algorithms (Quick Version)

REINFORCE is the original policy gradient from 1992. Play a full episode, see how it went, nudge the policy in the right direction. Simple, but noisy.

PPO adds two things: a critic network that learns to predict how good each state is, and a clipping mechanism that stops the policy from changing too much in one update. This is what trained ChatGPT's RLHF stage.

GRPO skips the critic entirely. Instead of one action at a time, sample a group of actions for the same input, normalize their rewards within the group, and use that as your signal. DeepSeek used this to train their reasoning models.

CPGD is the simplest fix to REINFORCE's biggest problem. Instead of clipping the probability ratio (PPO) or the advantage (GRPO), it clips the gradient norm directly before each weight update.

The Environments (From Easy to Brutal)

CartPole: Balancing a Pole on a Cart

You push a cart left or right to keep a pole balanced upright. 4-dimensional state, 2 actions, dense reward. Every timestep the pole stays up you get +1. This is the "hello world" of RL.

|  ← pole
   _____|_____
   |  cart   |
───────────────

Solved threshold: 195 average reward.

Acrobot: The Robot Gymnast

A two-link robotic arm hangs down. You control torque on the middle joint only. Goal: swing the tip of the lower link above a height threshold. You cannot control the top joint directly, so you have to pump energy into the system by timing your torques, like a gymnast building swing momentum.

Solved threshold: average -100 (fewer steps to goal is better).

LunarLander: Land the Spacecraft

A lander falls toward the moon surface. You control three thrusters to land between two flags without crashing. Firing the engine costs reward, crashing costs a lot, landing cleanly earns a bonus.

🚀
     /  \
────🚩────🚩────

The tricky part: the reward is multi-objective. Hover too long and you burn fuel. Move too fast and you crash. You have to find the efficient middle path across 8 state dimensions. Solved threshold: 200.

MountainCar: The Trap

A car sits at the bottom of a valley. Reach the flag on the right hill. The engine is too weak to drive straight up.

🚩
       /
      /
  🚗
      \______/

The car must first drive left to build momentum, then swing right and use that momentum to crest the hill. Reward: -1 every single step, nothing else. The agent has to accidentally reach the top before it learns anything useful. Solved threshold: -110.

This one is specifically designed to break naive policy gradient methods.

Round 1: CartPole (Simple)

Results: REINFORCE 439 | PPO 228 | CPGD 449

All three algorithms solved it. CPGD won. PPO came in last.

The PPO result looks wrong at first. PPO is the most sophisticated algorithm here, so why did it underperform?

Training budget. With a rollout buffer of 512 steps and CartPole episodes averaging around 200 steps each, PPO only gets a gradient update every few episodes. REINFORCE and CPGD update after every single episode. Over 400 episodes, PPO made roughly 3x fewer weight updates.

CartPole is short-horizon and has dense reward, so frequent simple updates beat infrequent sophisticated ones. PPO is overkill here and doesn't have enough time to warm up its critic.

CPGD's gradient clipping gives just enough stability to beat REINFORCE without the overhead of maintaining a value network.

# CPGD core update: clip the gradient, not the ratio
loss = -(log_probs * advantages).mean()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), grad_clip=1.0)

optimizer.step()

Round 2: Acrobot (Medium)

Results: REINFORCE -88 | PPO -86 | CPGD -230

REINFORCE and PPO tied. CPGD collapsed completely.

The CPGD collapse is visible in the learning curves: it starts learning around episode 100, reaches -150 by episode 200, then crashes to -500 around episode 350 before partially recovering.

What happened? As CPGD's policy improved, episode lengths changed dramatically. Early on, Acrobot episodes are short (the arm fails to swing up quickly). As the policy improves, episodes get shorter in a different way (it reaches the goal faster). The exponential moving baseline gets confused by these shifting episode lengths, miscalculates the advantage, and a gradient update fires in the completely wrong direction. No critic to correct it.

PPO and REINFORCE don't have this problem because their baselines are either more stable (PPO's critic) or absent entirely (REINFORCE just uses raw returns).

The variance plot says it all: REINFORCE flatlines near zero after episode 100 (consistent, learned a stable policy). PPO has moderate variance throughout. CPGD has massive spikes right at the collapse point.

Training Variance at Episode 350:
  REINFORCE: ~5    (learned and stable)
  PPO:       ~40   (still actively improving)
  CPGD:      ~120  (catastrophic collapse)

Round 3: LunarLander (Hard)

Results: REINFORCE -57 | PPO -74 | CPGD -2

CPGD won again. PPO was last again. None of them solved it (threshold: 200).

At this point the pattern is hard to deny: within a 600-episode budget, PPO consistently underperforms the simpler algorithms.

The reason is always the same. PPO's critic needs time to learn accurate state value estimates. Until the critic is calibrated, the advantage estimates are worse than just using raw returns. With limited episodes, you're essentially paying the cost of running a critic without getting its benefit yet.

CPGD's advantage here: it updates every single episode. 600 gradient updates vs. PPO's ~120. Frequency of learning beats quality of learning at this budget.

The learning curves show REINFORCE and CPGD both trending upward steadily from episode 200 onward. PPO's curve is noisier and lower throughout.

No one is near the 200 threshold though. LunarLander at 600 episodes is genuinely unfinished business. A properly trained PPO at 2000+ episodes would likely dominate here.

Round 4: MountainCar (The Wall)

Results: REINFORCE -200 | PPO -200 | CPGD -200

Flat line. Every episode. All 800 of them. For all three algorithms.

The learning curves are perfectly horizontal. Variance is zero. Nobody ever reached the top once.

This is not a bug. This is the exploration problem in its purest form.

What the agent saw every episode:
  Episode 1:   -200
  Episode 2:   -200
  ...
  Episode 800: -200

Gradient signal = difference between outcomes
               = -200 minus -200
               = 0

Nothing to learn from.

MountainCar requires a fundamentally different approach. Curiosity-driven exploration, hindsight experience replay, or curriculum learning. Policy gradient optimization is the wrong tool entirely, regardless of how sophisticated the variant is.

The entropy bonus (0.05, much higher than other environments) was supposed to force random exploration. Even with nearly random behavior, the specific sequence of left-right oscillations needed to build sufficient momentum essentially never occurs within 200 steps by chance.

This is an important distinction: MountainCar is not harder because the optimal policy is more complex. It's harder because the reward structure makes the learning problem itself qualitatively different.

GRPO on Multi-Armed Bandit

GRPO doesn't fit neatly into CartPole or LunarLander comparisons because it's designed for a different kind of problem: evaluating multiple responses to the same prompt. The natural fit is a bandit problem.

10 slot machines, each with a hidden mean reward. GRPO samples a group of 8 arms simultaneously, normalizes their rewards within the group, and uses those normalized advantages to update the policy.

rewards = [bandit.pull(arm) for arm in sampled_arms]  # [r1, r2, ..., r8]
mean = np.mean(rewards)
std  = np.std(rewards)
advantages = (rewards - mean) / (std + 1e-8)

Result: GRPO identified arm 3 (mean 3.05) as best. True best was arm 6 (mean 3.16). Gap of 0.11.

With 8 arms sampled from 10, arms 3 and 6 appear in the same group about 62% of updates. But single-pull noise (standard deviation ~1.0) makes a 0.11 difference statistically invisible without thousands of direct comparisons. GRPO essentially tied on this one.

Increase group size to 10 (all arms every update) and it finds the true best arm reliably. The group size is a hyperparameter that matters enormously.

This is exactly why GRPO works well for LLMs: when you generate 8 different completions to the same prompt and score them all, the best completion is usually clearly better than the worst by a large margin, making the group normalization clean and informative.

The Complexity Heatmap

After running everything, normalizing scores to [0, 1] where 1 = solved, the picture looks like this:

REINFORCE    PPO    CPGD
CartPole          1.00      1.00   1.00   ← all solved
Acrobot           1.00      1.00   0.67   ← CPGD falls behind
LunarLander       0.63      0.61   0.71   ← all struggling, none solved
MountainCar       0.00      0.00   0.00   ← universal failure

The three lines on the performance vs. complexity plot are nearly parallel. They all degrade at roughly the same rate. There's no clear crossover where PPO suddenly pulls ahead.

This challenges the hypothesis. The original prediction was: PPO's advantage grows with complexity. What actually happened: within a fixed training budget of 600 episodes, algorithm architecture matters less than number of gradient updates.

What This Actually Means

For simple environments with dense reward and short episodes: CPGD or even REINFORCE are fine. PPO is overkill and needs more budget to warm up its critic.

For medium complexity environments: Algorithm choice starts to matter, but the differences are smaller than expected at 600 episodes.

For hard environments with sparse reward: None of these algorithms solve the problem without fundamentally better exploration. The bottleneck is not PPO vs. REINFORCE, it's whether the agent ever sees a reward signal to learn from.

For LLM fine-tuning specifically: GRPO makes sense not because it's better at optimization but because it eliminates the need to train a separate value model, which is expensive and unstable at LLM scale. The "group" structure maps naturally to "generate multiple completions, score them all."

The real lesson is that the training budget creates a regime boundary. Below ~1000 episodes, simpler algorithms with more frequent updates often beat complex algorithms that need time to amortize their overhead. Above that boundary, PPO's stability and variance reduction compound into a genuine advantage. Most benchmark papers report results in the "above the boundary" regime. Most real applications start in the "below the boundary" regime.

Code Snapshot: All Four Algorithms Side by Side

loss = -(log_probs * discounted_returns).mean()

ratio     = torch.exp(new_log_probs - old_log_probs)
advantage = returns - critic(state)
loss      = -torch.min(ratio * advantage,
                       torch.clamp(ratio, 0.8, 1.2) * advantage).mean()

group_rewards  = [env.step(a) for a in sampled_actions]
advantage      = (group_rewards - mean(group_rewards)) / std(group_rewards)
loss           = -torch.min(ratio * advantage,
                            torch.clamp(ratio, 0.8, 1.2) * advantage).mean()

loss = -(log_probs * (returns - baseline)).mean()
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)
optimizer.step()

Four algorithms. One core idea: take actions that led to good outcomes more often. Four different answers to: what counts as "good" and how do you update stably?

Full implementation with all visualizations on GitHub. Ran on Google Colab T4 GPU, training time roughly 15 minutes total across all environments.Full implementation with all visualizations on GitHub: RL Algorithm Comparison GitHub repo