11  Long-Context, Multimodal, and Agentic RLVR

11.1 Chapter Map

  • Examine how RLVR stretches into long-context, multimodal, and agentic settings.
  • Show why frontier tasks remain partially verifiable but become noisier, broader, and easier to misread.

11.2 Domain Overview

This chapter should show how RLVR stretches beyond the cleanest domains without pretending the frontier is neat. Once evidence selection, perception, and environment interaction matter, the verifier often becomes a stack of partial checks rather than a single decisive rule.

The point is to show where verifier-first thinking still works and where it starts to fray.

11.3 Verifier Regime

  • Long-context verification: Checking grounded use of evidence over extended context windows.
  • Multimodal verification: Checking outputs that depend on visual, auditory, or other non-textual evidence.
  • Agentic verification: Checking behavior that unfolds through tool use, environment interaction, and temporal traces.

These settings move the field away from single clean checkers and toward partial, layered, and instrumented verification. The central design question becomes how much of the real task the verifier stack actually captures.

11.4 Canonical Cases

  • Citation-grounded long-context question answering.
  • Vision-language tasks with verifiable perceptual subgoals.
  • Tool-using agents whose traces can be checked for execution validity and grounded progress.

11.5 Comparative Lessons

  • Long-context tasks force a distinction between answer correctness and grounded evidence use.
  • Multimodal tasks make perceptual ambiguity a first-class verifier problem.
  • Agentic settings make temporal traces and environment instrumentation part of the verifier interface.
  • Partial verification is often still useful, but it should not be oversold as full evaluation.

11.6 Research Notes

  • What is the right unit of verification for agentic trajectories?
  • How can long-context tasks separate answer correctness from grounded evidence use?
  • Which multimodal checks are robust enough to train against rather than only evaluate with?