6 Search and Test Time Verification

M. C. Escher, *Calanques de Piana* (1928).

6.1 Chapter Map

A verifier can improve a system without parameter updates.
Disambiguating test vs train time compute.

6.2 Test Time

Chapters 2 through 5 treated the verifier as a source of training signal. The model generates rollouts, the verifier scores them, and the optimizer updates parameters. Test time verifier use is split up into the two groups, we’ll start with selection:

6.3 Selection

Technique	Decision rule	Verifier use	Best when	Main limitation
Best-of-\(N\) with an ORM (Cobbe et al. 2021)	Score each completed candidate and return the top one	Post-hoc scoring over full outputs	Cheap parallel reranking is enough	Can only filter samples the policy already produced
Best-of-\(N\) with a PRM (Lightman et al. 2023)	Rank candidates by process quality rather than just final outcome	Step-level or intermediate scoring folded into a final rank	Harder problems where reasoning quality matters	Higher scoring cost per candidate
Self-consistency (Wang et al. 2022)	Sample multiple paths and vote or cluster by agreement	No external verifier; agreement acts as the signal	Answers can be canonicalized and consensus is informative	Correlated errors can still dominate

When a PRM is used to rank complete solutions, the stepwise outputs must be reduced to one solution-level score:

\[ \text{Score}(x, y) = \operatorname{Agg}\bigl(\text{PRM}(s_1 \mid x), \ldots, \text{PRM}(s_K \mid x, s_{<K})\bigr). \]

Lightman et al. showed why this reduction matters at test time: on the MATH benchmark, PRM-based reranking in a best-of-\(N\) setting outperformed ORM-based reranking, with the gap widening as the number of candidates increased.(Lightman et al. 2023)

The remaining design choice is how to collapse step scores into a trajectory score. Math-Shepherd uses the minimum step score when reranking full solutions, reflecting the intuition that one invalid step can sink an otherwise plausible derivation.(Wang et al. 2024)

The basic arithmetic behind best-of-\(N\) is powerfully simple. Let \(p\) be the probability that a single sample is correct; therefore, a single sample is wrong with probability \(1 - p\). If we assume sample independence, the probability that all \(N\) samples are wrong is the product of those failure probabilities: \((1 - p)^N\). We can write the complement, which is at least one correct sample, as: \[ 1 - (1 - p)^N. \tag{6.1}\] pass@\(N\) sampling helps because repeated draws compound failure probability downward. Even a weak policy with \(p = 0.1\) has \[ 1 - 0.9^{10} \approx 0.65 \] chance of producing at least one correct solution among \(N = 10\) samples, and \[ 1 - 0.9^{20} \approx 0.88 \] among \(N = 20\). Real model samples are not truly independent, so the formula is an idealization rather than an exact law, but it captures the core reason best-of-\(N\) can buy large gains from modest per-sample competence.

6.3.1 pass@\(k\)

Chen et al. defined pass@\(k\): the probability that at least one of \(k\) samples passes all tests.(Chen et al. 2021) This metric quantifies how much the reported result depends on the evaluation protocol rather than the model. For example, the original Codex paper reported 28.8% pass@1 on HumanEval but 70.2% pass@100 from sampling alone.(Chen et al. 2021)

Figure 6.1: Exact AIME24 pass@k values for DeepScaleR-1.5B-Preview before and after micro-budget RLVR.(Khan et al. 2026)

6.3.2 Selection under verifier noise

Equation 6.1 assumes that the checker is the target property; nevertheless, that may not always be the case, even in RLVR, e.g. a code patch that passes unit tests but silently removes input validation on an endpoint the tests never hit. We will model this discrepancy in this section. This is a binary-verifier simplification; the more general setting is score-based selection, where the main failure mode is misranking over verifier scores rather than acceptance precision alone.

Let \(C \in \{0,1\}\) denote true correctness and \(V \in \{0,1\}\) denote whether the verifier accepts a sample. If a single rollout has true success probability \(p = \Pr(C=1)\), then the following are true:

Verifier true-positive rate \(\beta = \Pr(V=1 \mid C=1)\);
Verifier false-positive rate \(\alpha = \Pr(V=1 \mid C=0)\);
Then the probability of at least one accepted candidate among \(N\) samples is:

Equation 6.2 gives the chance that the verifier accepts at least one candidate after sampling \(N\) times.

\[ \Pr(\exists i : V_i = 1) = 1 - \bigl(1 - \beta p - \alpha(1-p)\bigr)^N. \tag{6.2}\]

The accepted pool, that is, the verifier tail, contains both true positives and false positives. Equation 6.3 gives the precision of that tail, i.e. among accepted samples, what fraction are genuinely correct?

\[ \Pr(C=1 \mid V=1) = \frac{\beta p}{\beta p + \alpha(1-p)}. \tag{6.3}\]

The numerator is “correct and accepted.” The denominator is “accepted for any reason,” including mistakes that slipped through.

If the unconditional probability, \(p\), that a sampled rollout is actually correct before any verifier check is small, even a low false-positive rate can dominate the accepted set because most samples are incorrect. Therefore, a small leak in the checker can still pollute the accepted tail. For a hard problem with \(p=0.05\) {5% base success}, \(\beta=0.9\) {90% true-positive rate}, and \(\alpha=0.01\) {1% false-positive rate}, the verifier-accepted tail is only

\[ \frac{0.9 \cdot 0.05}{0.9 \cdot 0.05 + 0.01 \cdot 0.95} \approx 0.83 \]

correct. If \(\alpha\) rises to \(0.05\), tail precision drops to

\[ \frac{0.9 \cdot 0.05}{0.9 \cdot 0.05 + 0.05 \cdot 0.95} \approx 0.49. \]

Best-of-\(N\) therefore depends on the verifier’s precision in the selected tail, not merely on its average accuracy. This bridges us to Chapter 7, where we discuss how more search increases both the chance of finding a correct sample and the surface area for finding a false positive that the verifier cannot reject.

6.3.3 Compute-optimal selection

One question which naturally arises from verification is the exploration/exploitation argument, with exploration corresponding to more generations and exploitation corresponding to more time spent on verification. Snell et al. asked: given a fixed compute budget, how should you split it between generating more candidates and spending more on verification?(Snell et al. 2024) Their conclusion is that the optimal allocation depends on problem difficulty. For hard problems where per-sample success is rare, PRM-guided selection can be 4x more efficient than naive best-of-\(N\), and a smaller model with more search can match or exceed the performance of a 14x larger model at matched compute.

6.4 Search: verifier as controller

Technique	Control loop	Verifier use	Best when	Main limitation
PRM-guided beam search	Expand or prune partial branches online	Score intermediate states during generation	The verifier is fast enough to sit on the inner loop	Latency dominates if scoring is expensive
Draft-and-check loops	Generate, test, backtrack, retry	Gate progress with tests, compilation, or checkpoints	Code and agentic tasks allow cheap external checks	Retries can be slow and brittle
Tool-gated continuation	Call a tool, inspect the result, revise the next step	Treat tool outputs as live verification signals	Tool use grounds the next action	Behavior depends on tool quality and availability
MCTS with exact verification (Trinh et al. 2025)	Search a tree of actions under verifier feedback	Check each step exactly with a formal system	A deployable step-level verifier exists	Mostly limited to formal domains

The difference here from selection is that search changes the output distribution, while selection only filters. A model that uses a verifier to prune branches, backtrack, and redirect can explore parts of the solution space that no single forward pass would reach. Search is more powerful, but also more expensive and more sensitive to verifier latency and accuracy.

For this chapter, only deployable test time verification counts. Test suites, proof kernels, live environments, and some learned judges can actually be run by the system at serving time.(Chen et al. 2021; Liu et al. 2023; Xin, Guo, et al. 2024; Xin, Ren, et al. 2024; Trinh et al. 2025) Benchmark-only answer-key grading in math is useful for measuring proposal quality, but it is not a deployable verifier and should not be confused with real test time capability.(Kydlicek 2025; Shao et al. 2024; DeepSeek-AI et al. 2025)

6.4.1 Search as controlled verification

Selection can be written as

\[ y^\star = \arg\max_{y_i \sim \pi_\theta(\cdot \mid x)} v(x,y_i), \qquad i = 1,\ldots,N, \]

where the verifier only acts after the model has produced complete candidates. Nothing in this formulation changes the path mid-stream; it only ranks finished products. Verifier outputs in search alter the future trajectory. A simple abstraction is a history-dependent controller:

\[ h_t = (x, a_1, o_1, \ldots, a_{t-1}, o_{t-1}), \qquad a_t \sim \pi_{\mathrm{search}}(\cdot \mid h_t), \]

Here \(h_t\) is the running record of what the system has tried and what the environment or verifier has said back so far. Observations \(o_t\) can include compiler errors, unit-test failures, proof-state feedback, retrieved documents, or learned verifier scores.

Once the verifier is embedded in the loop, the objective is no longer “sample \(N\) and choose the best.” It is closer to “steer a sequence of actions toward high final utility while paying separately for generation and checking”:

\[ \max_{\pi_{\mathrm{search}}} \mathbb{E}\!\left[ U(s_T) - \lambda_g \sum_{t=1}^{T} c_{\mathrm{gen}}(a_t) - \lambda_v \sum_{t=1}^{T} c_{\mathrm{verify}}(o_t) \right], \]

where \(U(s_T)\) is final utility, and the two costs represent generation and verification. This is why verifier latency matters. A verifier that is excellent but slow can be a good post-hoc ranker and a bad inner-loop controller, while a cheap noisy verifier can be useful for pruning but dangerous due to error compounding across steps.

6.5 Amortization

If best-of-\(N\) plus a test suite is so powerful, why bother with RL at all?

Latency. Search requires generating and scoring multiple candidates. An RL-trained model that has internalized the verified patterns produces a similar output in a single forward pass.

Cost. At deployment, compute is money. A model that has amortized search gains into its weights serves cheaper per query than one that requires \(N\) candidates and \(N\) verifier calls.

Amortized transfer. Search helps only when the verifier is available. A code model that learned robust patterns from RL on test-suite-verified tasks will generalize, at least partially, to coding tasks where no test suite exists; pure search cannot do this.

Exploration. Search over the current policy’s sampling distribution can only find solutions the policy can already almost produce. RL reinforces strategies that lead to verified success and suppresses strategies that do not, shifting the model’s probability mass toward better solutions.

6.6 Reporting results

A fair model report makes the policy, the verifier, and the test time compute budget legible as separate sources of performance:

Matched test time compute. The fairest comparison is the RL-trained model at pass@1 against the base model with search at the same total FLOP budget, amortizing the RL model’s training cost over all queries it will serve.

Explicit verifier access. State whether the reported result uses a verifier the deployed system could actually run at test time.

Separation of gains. Report pass@1 (no search), pass@\(N\) (selection), and search-guided results separately. This lets the reader see how much improvement comes from the policy, how much from selection, and how much from active search. A model with high pass@1 and modest pass@\(N\) has internalized most of the capability. A model with low pass@1 and high pass@\(N\) is leaning on search.

6.7 Open questions

How should test time compute budgets scale with problem difficulty, model size, and verifier cost?
Can learned verifiers be made fast enough for online search?
How should the field standardize reporting to separate training gains from search gains?
When does test time search amplify reward hacking rather than competence?

6.8 What comes next

Chapter 7 asks what happens when the verifier becomes the attack surface and optimization finds ways to satisfy the checker without solving the task.

Chen, Mark, Jerry Tworek, Heewoo Jun, et al. 2021. “Evaluating Large Language Models Trained on Code.” arXiv Preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374.

Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, et al. 2021. “Training Verifiers to Solve Math Word Problems.” arXiv Preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168.

DeepSeek-AI, Daya Guo, Dejian Yang, et al. 2025. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv Preprint arXiv:2501.12948. https://arxiv.org/abs/2501.12948.

Khan, Zohaib, Omer Tafveez, and Zoha Hayat Bhatti. 2026. “Plasticity Vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget.” arXiv Preprint arXiv:2601.06677. https://arxiv.org/abs/2601.06677.

Kydlicek, Hynek. 2025. Math-Verify: Math Verification Library. V. 0.6.1. Released. https://github.com/huggingface/Math-Verify.

Lightman, Hunter, Vineet Kosaraju, Yura Burda, et al. 2023. “Let’s Verify Step by Step.” arXiv Preprint arXiv:2305.20050. https://arxiv.org/abs/2305.20050.

Liu, Jiawei, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation.” arXiv Preprint arXiv:2305.01210. https://arxiv.org/abs/2305.01210.

Shao, Zhihong, Peiyi Wang, Qihao Zhu, et al. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv Preprint arXiv:2402.03300. https://arxiv.org/abs/2402.03300.

Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. “Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.” arXiv Preprint arXiv:2408.03314. https://arxiv.org/abs/2408.03314.

Trinh, Trieu H., Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. 2025. “Solving Olympiad Geometry Without Human Demonstrations.” arXiv Preprint arXiv:2510.01346. https://arxiv.org/abs/2510.01346.

Wang, Peiyi, Lei Li, Zhihong Shao, et al. 2024. “Math-Shepherd: Verify and Reinforce LLMs Step-by-Step Without Human Annotations.” arXiv Preprint arXiv:2312.08935. https://arxiv.org/abs/2312.08935.

Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al. 2022. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” arXiv Preprint arXiv:2203.11171. https://arxiv.org/abs/2203.11171.

Xin, Huajian, Daya Guo, Zhihong Shao, et al. 2024. “DeepSeek-Prover: Advancing Theorem Proving in LLMs Through Large-Scale Synthetic Data.” arXiv Preprint arXiv:2405.14333. https://arxiv.org/abs/2405.14333.

Xin, Huajian, Z. Z. Ren, Junxiao Song, et al. 2024. “DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search.” arXiv Preprint arXiv:2408.08152. https://arxiv.org/abs/2408.08152.