10  Open Problems

M. C. Escher, Alfedena Abruzzi (1929).

10.1 Chapter Map

  • Discuss open problems in RLVR.

10.2 Verifier fidelity beyond math and code

In long-context QA, answer-evidence checks can miss unsupported synthesis. In multimodal search, final-answer and tool-format rewards can miss visual grounding. In agentic software tasks, tests can miss maintainability, security, minimality, and user intent. In instruction following, many constraints require semantic judgment rather than exact checking.(Peng et al. 2025; Brown 2025; Tan et al. 2025) RLVR has even been used in medicine, where a verifier may check a final label, citation, or structured field while missing whether the model used the right evidence, respected uncertainty, or made a decision a clinician would trust.(Zhang et al. 2025)

10.3 Adaptive RLVR

We can think of adaptive RLVR in the sense of prompt re-weighting, just to say verifying the difficulty of problems before using them in training in order to maintain a specific competence band over the problems the model tackles.Furthermore, the verifier itself may be updated throughout training to prevent the policy from finding gaps in optimization. A completely adaptive RLVR loop can be written as:

\[ (\pi_t, V_t, \mathcal D_t, \mathcal H_t) \longrightarrow (\pi_{t+1}, V_{t+1}, \mathcal D_{t+1}, \mathcal H_{t+1}), \tag{10.1}\]

where \(\pi_t\) is the policy, \(V_t\) the verifier stack, \(\mathcal D_t\) the task distribution, and \(\mathcal H_t\) the harness.

Brown, William. 2025. Verifiers: Environments for LLM Reinforcement Learning. Software. https://github.com/PrimeIntellect-ai/verifiers.
Peng, Hao, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li. 2025. “VerIF: Verification Engineering for Reinforcement Learning in Instruction Following.” arXiv Preprint arXiv:2506.09942. https://arxiv.org/abs/2506.09942.
Tan, Sijun, Michael Luo, Colin Cai, et al. 2025. rLLM: A Framework for Post-Training Language Agents. Notion Blog. https://github.com/rllm-org/rllm.
Zhang, Sheng, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. 2025. “Med-RLVR: Emerging Medical Reasoning from a 3B Base Model via Reinforcement Learning.” arXiv Preprint arXiv:2502.19655. https://arxiv.org/abs/2502.19655.