10 Open Problems

10.1 Chapter Map
- Discuss open problems in RLVR.
10.2 Verifier fidelity beyond math and code
In long-context QA, answer-evidence checks can miss unsupported synthesis. In multimodal search, final-answer and tool-format rewards can miss visual grounding. In agentic software tasks, tests can miss maintainability, security, minimality, and user intent. In instruction following, many constraints require semantic judgment rather than exact checking.(Peng et al. 2025; Brown 2025; Tan et al. 2025) RLVR has even been used in medicine, where a verifier may check a final label, citation, or structured field while missing whether the model used the right evidence, respected uncertainty, or made a decision a clinician would trust.(Zhang et al. 2025)
10.3 Adaptive RLVR
We can think of adaptive RLVR in the sense of prompt re-weighting, just to say verifying the difficulty of problems before using them in training in order to maintain a specific competence band over the problems the model tackles.Furthermore, the verifier itself may be updated throughout training to prevent the policy from finding gaps in optimization. A completely adaptive RLVR loop can be written as:
\[ (\pi_t, V_t, \mathcal D_t, \mathcal H_t) \longrightarrow (\pi_{t+1}, V_{t+1}, \mathcal D_{t+1}, \mathcal H_{t+1}), \tag{10.1}\]
where \(\pi_t\) is the policy, \(V_t\) the verifier stack, \(\mathcal D_t\) the task distribution, and \(\mathcal H_t\) the harness.