7 Search and Test-Time Verification
7.1 Chapter Map
- Explain what verifiers enable at inference time, not just during training.
- Distinguish immediate gains from search and reranking from gains that are actually amortized into the policy.
7.2 Main Argument
Modern RLVR is inseparable from inference-time compute. Self-consistency, reranking, draft-and-check loops, tool-augmented checking, and structured search all change what the verifier makes possible before the model has fully internalized the behavior.
This chapter should make the training-versus-search boundary explicit. Readers should leave able to separate gains from better policies, better search, and better verifiers.
7.3 Canonical Examples
- Best-of-N math solution selection with an exact answer checker.
- Multi-candidate code generation with unit-test reranking.
- Tool-using agents that query external systems and verify intermediate tool outputs before continuing.
7.4 Failure Modes
- Reporting test-time-search gains as if they were pure policy improvements.
- Ignoring latency and cost when verifier-heavy search is deployed.
- Building search around a weak verifier and amplifying the wrong behavior.
7.5 What the Verifier Sees
The verifier sees multiple candidate outputs, revisions, search nodes, or tool traces rather than a single final sample.
7.6 What the Verifier Misses
It still misses counterfactual search paths not explored and any capability not expressed through the chosen candidate set.
7.7 Research Notes
- Which verifier regimes benefit most from search before training?
- How should evaluations separate amortized capability from search-assisted capability?
- When does search amplify reward hacking rather than competence?