6 Search and Test Time Verification

6.1 Chapter Map
- A verifier can improve a system without parameter updates.
- Disambiguating test vs train time compute.
6.2 Test Time
Chapters 2 through 5 treated the verifier as a source of training signal. The model generates rollouts, the verifier scores them, and the optimizer updates parameters. Test time verifier use is split up into the two groups, we’ll start with selection:
6.3 Selection
| Technique | Decision rule | Verifier use | Best when | Main limitation |
|---|---|---|---|---|
| Best-of-\(N\) with an ORM (Cobbe et al. 2021) | Score each completed candidate and return the top one | Post-hoc scoring over full outputs | Cheap parallel reranking is enough | Can only filter samples the policy already produced |
| Best-of-\(N\) with a PRM (Lightman et al. 2023) | Rank candidates by process quality rather than just final outcome | Step-level or intermediate scoring folded into a final rank | Harder problems where reasoning quality matters | Higher scoring cost per candidate |
| Self-consistency (Wang et al. 2022) | Sample multiple paths and vote or cluster by agreement | No external verifier; agreement acts as the signal | Answers can be canonicalized and consensus is informative | Correlated errors can still dominate |
When a PRM is used to rank complete solutions, the stepwise outputs must be reduced to one solution-level score:
\[ \text{Score}(x, y) = \operatorname{Agg}\bigl(\text{PRM}(s_1 \mid x), \ldots, \text{PRM}(s_K \mid x, s_{<K})\bigr). \]
Lightman et al. showed why this reduction matters at test time: on the MATH benchmark, PRM-based reranking in a best-of-\(N\) setting outperformed ORM-based reranking, with the gap widening as the number of candidates increased.(Lightman et al. 2023)
The remaining design choice is how to collapse step scores into a trajectory score. Math-Shepherd uses the minimum step score when reranking full solutions, reflecting the intuition that one invalid step can sink an otherwise plausible derivation.(Wang et al. 2024)
The basic arithmetic behind best-of-\(N\) is powerfully simple. Let \(p\) be the probability that a single sample is correct; therefore, a single sample is wrong with probability \(1 - p\). If we assume sample independence, the probability that all \(N\) samples are wrong is the product of those failure probabilities: \((1 - p)^N\). We can write the complement, which is at least one correct sample, as: \[ 1 - (1 - p)^N. \tag{6.1}\] pass@\(N\) sampling helps because repeated draws compound failure probability downward. Even a weak policy with \(p = 0.1\) has \[ 1 - 0.9^{10} \approx 0.65 \] chance of producing at least one correct solution among \(N = 10\) samples, and \[ 1 - 0.9^{20} \approx 0.88 \] among \(N = 20\). Real model samples are not truly independent, so the formula is an idealization rather than an exact law, but it captures the core reason best-of-\(N\) can buy large gains from modest per-sample competence.
6.3.1 pass@\(k\)
Chen et al. defined pass@\(k\): the probability that at least one of \(k\) samples passes all tests.(Chen et al. 2021) This metric quantifies how much the reported result depends on the evaluation protocol rather than the model. For example, the original Codex paper reported 28.8% pass@1 on HumanEval but 70.2% pass@100 from sampling alone.(Chen et al. 2021)
6.3.2 Selection under verifier noise
Equation 6.1 assumes that the checker is the target property; nevertheless, that may not always be the case, even in RLVR, e.g. a code patch that passes unit tests but silently removes input validation on an endpoint the tests never hit. We will model this discrepancy in this section. This is a binary-verifier simplification; the more general setting is score-based selection, where the main failure mode is misranking over verifier scores rather than acceptance precision alone.
Let \(C \in \{0,1\}\) denote true correctness and \(V \in \{0,1\}\) denote whether the verifier accepts a sample. If a single rollout has true success probability \(p = \Pr(C=1)\), then the following are true:
- Verifier true-positive rate \(\beta = \Pr(V=1 \mid C=1)\);
- Verifier false-positive rate \(\alpha = \Pr(V=1 \mid C=0)\);
- Then the probability of at least one accepted candidate among \(N\) samples is:
Equation 6.2 gives the chance that the verifier accepts at least one candidate after sampling \(N\) times.
\[ \Pr(\exists i : V_i = 1) = 1 - \bigl(1 - \beta p - \alpha(1-p)\bigr)^N. \tag{6.2}\]
The accepted pool, that is, the verifier tail, contains both true positives and false positives. Equation 6.3 gives the precision of that tail, i.e. among accepted samples, what fraction are genuinely correct?
\[ \Pr(C=1 \mid V=1) = \frac{\beta p}{\beta p + \alpha(1-p)}. \tag{6.3}\]
The numerator is “correct and accepted.” The denominator is “accepted for any reason,” including mistakes that slipped through.
If the unconditional probability, \(p\), that a sampled rollout is actually correct before any verifier check is small, even a low false-positive rate can dominate the accepted set because most samples are incorrect. Therefore, a small leak in the checker can still pollute the accepted tail. For a hard problem with \(p=0.05\) {5% base success}, \(\beta=0.9\) {90% true-positive rate}, and \(\alpha=0.01\) {1% false-positive rate}, the verifier-accepted tail is only
\[ \frac{0.9 \cdot 0.05}{0.9 \cdot 0.05 + 0.01 \cdot 0.95} \approx 0.83 \]
correct. If \(\alpha\) rises to \(0.05\), tail precision drops to
\[ \frac{0.9 \cdot 0.05}{0.9 \cdot 0.05 + 0.05 \cdot 0.95} \approx 0.49. \]
Best-of-\(N\) therefore depends on the verifier’s precision in the selected tail, not merely on its average accuracy. This bridges us to Chapter 7, where we discuss how more search increases both the chance of finding a correct sample and the surface area for finding a false positive that the verifier cannot reject.
6.3.3 Compute-optimal selection
One question which naturally arises from verification is the exploration/exploitation argument, with exploration corresponding to more generations and exploitation corresponding to more time spent on verification. Snell et al. asked: given a fixed compute budget, how should you split it between generating more candidates and spending more on verification?(Snell et al. 2024) Their conclusion is that the optimal allocation depends on problem difficulty. For hard problems where per-sample success is rare, PRM-guided selection can be 4x more efficient than naive best-of-\(N\), and a smaller model with more search can match or exceed the performance of a 14x larger model at matched compute.
6.4 Search: verifier as controller
| Technique | Control loop | Verifier use | Best when | Main limitation |
|---|---|---|---|---|
| PRM-guided beam search | Expand or prune partial branches online | Score intermediate states during generation | The verifier is fast enough to sit on the inner loop | Latency dominates if scoring is expensive |
| Draft-and-check loops | Generate, test, backtrack, retry | Gate progress with tests, compilation, or checkpoints | Code and agentic tasks allow cheap external checks | Retries can be slow and brittle |
| Tool-gated continuation | Call a tool, inspect the result, revise the next step | Treat tool outputs as live verification signals | Tool use grounds the next action | Behavior depends on tool quality and availability |
| MCTS with exact verification (Trinh et al. 2025) | Search a tree of actions under verifier feedback | Check each step exactly with a formal system | A deployable step-level verifier exists | Mostly limited to formal domains |
The difference here from selection is that search changes the output distribution, while selection only filters. A model that uses a verifier to prune branches, backtrack, and redirect can explore parts of the solution space that no single forward pass would reach. Search is more powerful, but also more expensive and more sensitive to verifier latency and accuracy.
For this chapter, only deployable test time verification counts. Test suites, proof kernels, live environments, and some learned judges can actually be run by the system at serving time.(Chen et al. 2021; Liu et al. 2023; Xin, Guo, et al. 2024; Xin, Ren, et al. 2024; Trinh et al. 2025) Benchmark-only answer-key grading in math is useful for measuring proposal quality, but it is not a deployable verifier and should not be confused with real test time capability.(Kydlicek 2025; Shao et al. 2024; DeepSeek-AI et al. 2025)
6.4.1 Search as controlled verification
Selection can be written as
\[ y^\star = \arg\max_{y_i \sim \pi_\theta(\cdot \mid x)} v(x,y_i), \qquad i = 1,\ldots,N, \]
where the verifier only acts after the model has produced complete candidates. Nothing in this formulation changes the path mid-stream; it only ranks finished products. Verifier outputs in search alter the future trajectory. A simple abstraction is a history-dependent controller:
\[ h_t = (x, a_1, o_1, \ldots, a_{t-1}, o_{t-1}), \qquad a_t \sim \pi_{\mathrm{search}}(\cdot \mid h_t), \]
Here \(h_t\) is the running record of what the system has tried and what the environment or verifier has said back so far. Observations \(o_t\) can include compiler errors, unit-test failures, proof-state feedback, retrieved documents, or learned verifier scores.
Once the verifier is embedded in the loop, the objective is no longer “sample \(N\) and choose the best.” It is closer to “steer a sequence of actions toward high final utility while paying separately for generation and checking”:
\[ \max_{\pi_{\mathrm{search}}} \mathbb{E}\!\left[ U(s_T) - \lambda_g \sum_{t=1}^{T} c_{\mathrm{gen}}(a_t) - \lambda_v \sum_{t=1}^{T} c_{\mathrm{verify}}(o_t) \right], \]
where \(U(s_T)\) is final utility, and the two costs represent generation and verification. This is why verifier latency matters. A verifier that is excellent but slow can be a good post-hoc ranker and a bad inner-loop controller, while a cheap noisy verifier can be useful for pruning but dangerous due to error compounding across steps.
6.5 Amortization
If best-of-\(N\) plus a test suite is so powerful, why bother with RL at all?
Latency. Search requires generating and scoring multiple candidates. An RL-trained model that has internalized the verified patterns produces a similar output in a single forward pass.
Cost. At deployment, compute is money. A model that has amortized search gains into its weights serves cheaper per query than one that requires \(N\) candidates and \(N\) verifier calls.
Amortized transfer. Search helps only when the verifier is available. A code model that learned robust patterns from RL on test-suite-verified tasks will generalize, at least partially, to coding tasks where no test suite exists; pure search cannot do this.
Exploration. Search over the current policy’s sampling distribution can only find solutions the policy can already almost produce. RL reinforces strategies that lead to verified success and suppresses strategies that do not, shifting the model’s probability mass toward better solutions.
6.6 Reporting results
A fair model report makes the policy, the verifier, and the test time compute budget legible as separate sources of performance:
Matched test time compute. The fairest comparison is the RL-trained model at pass@1 against the base model with search at the same total FLOP budget, amortizing the RL model’s training cost over all queries it will serve.
Explicit verifier access. State whether the reported result uses a verifier the deployed system could actually run at test time.
Separation of gains. Report pass@1 (no search), pass@\(N\) (selection), and search-guided results separately. This lets the reader see how much improvement comes from the policy, how much from selection, and how much from active search. A model with high pass@1 and modest pass@\(N\) has internalized most of the capability. A model with low pass@1 and high pass@\(N\) is leaning on search.
6.7 Open questions
- How should test time compute budgets scale with problem difficulty, model size, and verifier cost?
- Can learned verifiers be made fast enough for online search?
- How should the field standardize reporting to separate training gains from search gains?
- When does test time search amplify reward hacking rather than competence?
6.8 What comes next
Chapter 7 asks what happens when the verifier becomes the attack surface and optimization finds ways to satisfy the checker without solving the task.