Appendix A — Minimal RL and Post-Training Background

A.1 Purpose

This appendix should provide the smallest amount of RL and post-training context needed to read the main text without turning the book into a general RL manual.

A.2 Include

  • Trajectories, returns, and policy improvement at a high level.
  • KL regularization and why post-training stacks use it.
  • The difference between rejection, search, and policy optimization.
  • Enough terminology to decode papers without shifting the book’s focus away from verifiers.

A.3 Exclude

  • Full derivations that belong in a standard RL textbook.
  • Optimizer-by-optimizer historical detail unless a verifier-facing concept depends on it.

A.4 Objective, States, and Rewards

At its core, reinforcement learning optimizes the expected discounted return of trajectories in a Markov decision process:

\[ \mathcal M=(\mathcal S,\mathcal A,P,R,\gamma). \]

For a policy \(\pi\), the objective is:

\[ J(\pi)=\mathbb E_{\tau\sim \pi}\!\left[\sum_{t=0}^{T-1}\gamma^{t}r(s_t,a_t)\right]. \]

where \(\tau=(s_0,a_0,s_1,a_1,\dots)\).

For single-turn LLM inference/training with one prompt-to-completion interaction, this collapses to a contextual bandit view: one initial state \(s_0\) (the prompt), one action \(a\) (the sampled completion), and then termination:

\[ s_0=x,\quad a=y,\quad s_1=\text{terminal},\quad r=r_\phi(x,y). \]

So the objective becomes:

\[ J(\pi_\theta)=\mathbb E_{x\sim p_{\text{data}}}\left[\mathbb E_{y\sim \pi_\theta(\cdot\mid x)}\,r_\phi(x,y)\right]. \]

For multi-turn settings we retain the full trajectory form with state as dialogue/context history:

\[ s_t=(x,y_{<t}),\qquad a_t=y_t,\qquad s_{t+1}= (x,y_{\le t}). \]

The trajectory distribution factorizes as:

\[ \pi_\theta(y_{1:T}\mid x)=\prod_{t=1}^{T}\pi_\theta(y_t\mid x,y_{<t}), \]

and the return is:

\[ J(\pi_\theta)=\mathbb E_{x\sim p_{\text{data}}}\!\left[\mathbb E_{y_{1:T}\sim\pi_\theta(\cdot\mid x)} \left[\sum_{t=1}^{T}\gamma^{t-1}r_t(s_t,y_t)\right]\right]. \]

In many verifier-driven setups, (r_t) for (t<T) and a scalar terminal reward (r_T=R_(x,y_{1:T})) carries the verification signal from the environment.

In this book, we use (x) for prompts and (y) for generated outputs (or turn-level outputs), and we write the verifier or environment as a score function (R_) or (r_) that maps prompts and completions, or full trajectories, to reward.