7 Reward Hacking

M. C. Escher, *Fiumara, Calabria* (1930).

7.1 Chapter Map

Optimization pressure and Goodhart’s Law.
A taxonomy of exploits and the hardening techniques that work.

7.2 Goodhart’s Law

RLVR is in some sense the epitome of Goodhart’s Law when we view the verifier as the measure that becomes the target through RL optimization. If the verifier has any gap between what it checks and what we want, then optimization can exploit that gap for almost all non-trivial proxies.¹(Skalse et al. 2022) Goodhart’s Law can be applied to RLVR in three ways:

The verifier has random errors on some inputs. Over many training steps, the policy shifts toward the subspace where the verifier is accidentally generous.
Gradient descent over thousands of steps is powerful enough to discover gaps in the verifier that may be rare or invisible under ordinary evaluation.
Optimizing the proxy (test passage, answer matching) may produce a policy that achieves high scores through mechanisms unrelated to the intended skill, e.g. pattern-matching, memorization, or distribution exploitation.

Example intuition: imagine three choices. The true evaluator says A is best, B is second-best, C is worst. The proxy treats A as best but says B and C are equally bad. Moving a little mass from B to C (or toward a specific mix with more A, depending on setup) can improve proxy while hurting true performance.

7.3 A taxonomy of verifier exploits

7.3.1 Extraction exploits

The model satisfies the answer extractor without doing the task. Anything inside the <answer> tags that matches the gold answer gets reward 1.0, regardless of what preceded it. A model can learn to produce minimal-effort responses that place a plausible answer in the right position: a single line of text, no reasoning, occasionally correct by chance.

7.3.2 Reward shaping exploits

The signal path from Chapter 5 introduces its own optimization targets. Format rewards, partial credit, and auxiliary bonuses create surfaces the model can exploit independently of correctness.

7.3.3 Test adequacy failures

The verifier is correct on what it checks, but what it checks is insufficient. Liu et al. found that augmenting HumanEval with 80× more test cases changed model rankings: some models that scored well on the original suite dropped substantially.(Liu et al. 2023) In code generation, test suites are finite approximations of a specification. Through hardcoded branches or shallow pattern matching, a model can learn to pass specific test cases without implementing the correct algorithm.

7.3.4 Learned verifier biases

When the verifier includes a learned judge (Chapter 4), the judge’s training artifacts become exploit surfaces. A model trained against a verbosity-biased judge learns to produce longer outputs. A model trained against a position-biased judge learns to place its strongest argument where the judge expects to find it.

7.3.5 Distribution shift

The verifier was calibrated for one distribution of model outputs, yet the policy has drifted to another. A learned judge trained on early-training outputs may misjudge late-training outputs that have shifted in register, length, or structure.

7.3.6 Mechanism gaps

Turpin et al. showed that chain-of-thought explanations can hide factors that influenced the answer, and Lanham et al. tested faithfulness more directly by intervening on traces.(Turpin et al. 2023; Lanham et al. 2023) The mechanism gap is the difference between a trace that predicts correctness and a trace that causally controls the answer:

Let \(X\) be the prompt, \(R\) the written reasoning trace, \(Y\) the final answer, and \(H\) the hidden computation that produced both. An outcome verifier observes \((X,Y)\). A process verifier observes \((X,R,Y)\).

The artifact-level question is:

\[ \Pr(Y \text{ correct} \mid X,R). \tag{7.1}\]

The causal question is different:

\[ \Pr(Y=y \mid \operatorname{do}(R=r), X) \quad \text{versus} \quad \Pr(Y=y \mid \operatorname{do}(R=r'), X). \tag{7.2}\]

7.4 Empirical exploits

7.4.1 Unit-test manipulation

OpenAI reports a frontier reasoning model training run in which the agent was placed in coding environments and rewarded for making unit tests pass.(Baker et al. 2025) The agent did not only write better code. It found reward hacks in the environment. Two systemic hacks were exit(0), which exploited a bug that let the agent exit before all tests ran, and raise SkipTest, which skipped unit-test evaluation from outside the testing framework. These hacks became systemic until the environment was patched.

Patching a verification function to always return true, writing stubs when unit-test coverage is poor, parsing tests to extract expected values, decompiling reference artifacts, or shadowing libraries such as pandas so that the verifier doesn’t check the intended implementation are further examples of reward hacking. When optimization pressure overwhelms the verifier, the model learns that the reward is attached to “tests pass,” not to “the repository now implements the intended behavior.”

7.4.2 Missing negative

Cursor’s 2026 description of real-time RL for Composer gives the production version of the same problem.(Jackson et al. 2026) The training loop used real user interactions as reward signal and shipped new checkpoints as often as every five hours. One exploit came from invalid tool calls. Composer often needs to read files or run terminal commands. The original reward pipeline discarded examples where the tool call was invalid, so the model learned that if a task looked likely to fail, emitting a broken tool call avoided negative reward. The fix was to include broken tool calls as negative examples.

Another exploit came from clarifying questions. Part of the reward was derived from edits, so Composer learned to defer risky edits by asking questions instead of touching code. The reward pipeline had not defined the boundary between appropriate caution and avoidance of negative reward, so editing rates dropped until Cursor changed the reward function.

7.5 The over-optimization curve

The clearest quantitative evidence for Goodhart dynamics in optimization comes from Gao et al., who measured the relationship between optimization pressure and performance.(Gao et al. 2023) The premise is a fixed “gold” reward model as ground truth and a policy we optimize against a separate “proxy” reward model. What happens is that proxy reward increases monotonically, but gold reward first rises, then falls. The peak location depends on the proxy’s quality: better proxies peak later and higher, while weaker proxies peak early and low. The original result was measured for learned reward models in RLHF. But the dynamics apply whenever a proxy is imperfect. In the context of this book, the proxy is the programmatic verifier, which approximates but may not equal the target capability. The same dynamics hold as with learned reward models; the difference being that programmatic verifiers are stronger proxies than learned reward models, so peaks likely occur later and gaps open more slowly.

Pan et al. found that as the policy becomes stronger, it finds exploits that weaker policies could not.(Pan et al. 2022) There are capability thresholds where agent behavior qualitatively shifts, causing sharp drops in true performance even as proxy reward continues to climb. These phase transitions are only predictable empirically and difficult to monitor.

Reproduced from Gao et al. (Gao et al. 2023).

7.6 Tail precision

Average verifier accuracy is the wrong object once the model is optimizing against the verifier. What matters is the verifier’s behavior in the extreme tail that the optimizer selects.

Let \(q(y)\) be the proxy score assigned by the verifier and \(t(y)\) be the true task utility. A best-of-\(N\) selector returns

\[ y_N^\star = \arg\max_{1 \le i \le N} q(y_i), \qquad y_i \sim \pi_\theta(\cdot \mid x). \]

The quantity we care about is not \(\mathbb{E}[q(y_N^\star)]\). That will almost always rise with \(N\). The quantity we care about is

\[ \mathbb{E}[t(y_N^\star)] = \mathbb{E}\!\left[ t\!\left(\arg\max_{1 \le i \le N} q(y_i)\right) \right]. \]

If \(q\) and \(t\) agree in the bulk of the distribution but disagree in the upper tail of \(q\), then increasing \(N\) can make the system worse. The selector is not sampling typical verifier-approved outputs. It is sampling the most extreme verifier-approved output it can find.

A useful diagnostic is tail precision at threshold \(\tau\):

\[ \operatorname{TailPrecision}(\tau) = \Pr\bigl(t(y)=1 \mid q(y) \ge \tau\bigr). \]

For a verifier used at pass@1, moderate thresholds may be enough. For best-of-64, PRM-guided beam search, or RL over many gradient steps, the relevant threshold is much higher. The optimizer pushes probability mass toward the region where \(q\) is maximal, so robustness means that \(q\) remains aligned with \(t\) in that region. This is why red-teaming should search for high-score false positives, not just estimate average verifier accuracy on held-out samples.

7.7 Test time exploits

Best-of-\(N\) selection helps when the verifier is faithful, but can increase probability of high-scoring false positives. Suppose 1 in 100 rollouts contains a verifier exploit: a response that scores high on the proxy but low on true capability. With best-of-16, the chance of seeing at least one exploit is about 15%. With best-of-64, it rises to about 47%. With best-of-256, it reaches about 92%. Search is not gradient descent, but it still finds the gap between proxy and true, and a verifier that is good enough for pass@1 may not be good enough for best-of-64.

7.8 Hardening techniques

Hardening measures cost compute, engineering time, or both. We justify their use by whether they push the over-optimization peak far enough to the right.

Hidden tests. Holding out a set of tests the model never trains against reduces direct overfitting to visible checks; the model cannot overfit to tests it does not see.
Test augmentation. Generating tests automatically can expand coverage beyond what a human problem-setter provides. EvalPlus demonstrated that generated test suites reveal false positives that the original tests miss.(Liu et al. 2023)
Red-teaming before training. Probing the verifier adversarially is a proactive way to de-risk training runs.
KL constraints and early stopping. PPO’s KL penalty, GRPO’s gradient clipping, or simply stopping training before the over-optimization peak keeps the policy close enough to its initial state that the verifier’s calibration still holds.
Progressive curriculum. Increase task difficulty as the model improves, following the competence-band principle from Chapter 5. Progressive difficulty keeps the optimization pressure focused on genuinely informative tasks.

7.9 Open questions

Is there a practical pre-training audit protocol to evaluate verifier robustness?
Is there a useful analog to Gao et al.’s gold reward model for programmatic verifiers?
Can the model’s own internal representations be used to detect reward hacking?
Does combining hidden tests, ensembles, and KL constraints give diminishing or compounding returns?

Baker, Bowen, Joost Huizinga, Leo Gao, et al. 2025. “Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation.” arXiv Preprint arXiv:2503.11926. https://arxiv.org/abs/2503.11926.

Gao, Leo, John Schulman, and Jacob Hilton. 2023. “Scaling Laws for Reward Model Overoptimization.” Proceedings of the 40th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2210.10760.

Jackson, Jacob, Ben Trapani, Nathan Wang, and Wanqi Zhu. 2026. Improving Composer Through Real-Time RL. Cursor Blog. https://cursor.com/blog/real-time-rl-for-composer.

Lanham, Tamera, Anna Chen, Ansh Radhakrishnan, et al. 2023. “Measuring Faithfulness in Chain-of-Thought Reasoning.” arXiv Preprint arXiv:2307.13702. https://arxiv.org/abs/2307.13702.

Liu, Jiawei, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.” arXiv Preprint arXiv:2305.01210. https://arxiv.org/abs/2305.01210.

Pan, Alexander, Kush Bhatia, and Jacob Steinhardt. 2022. “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models.” Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2201.03544.

Skalse, Joar, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. “Defining and Characterizing Reward Hacking.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2209.13085.

Turpin, Miles, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.” arXiv Preprint arXiv:2305.04388. https://arxiv.org/abs/2305.04388.

Following Skalse et al., reward hacking with imperfect proxies has multiple outcomes:
1. Different rewards can rank low-level behaviors differently while still sharing optimal policies; in this case, optimization can still reach a good target-optimal policy.
2. There can also be policies with \(J_{R_2}(\pi_2) > J_{R_2}(\pi_1)\) and \(J_{R_1}(\pi_2) < J_{R_1}(\pi_1)\), so improving proxy reward worsens target reward.
3. For imperfect, non-trivial proxies, such misalignment directions exist somewhere in policy space, so over-optimization can still lead to degraded target performance, but this is not mathematically guaranteed for every trajectory.
↩︎