Appendix B — Benchmarks, Evals, and Contamination

B.1 Purpose

Collect evaluation hygiene issues that would otherwise interrupt the flow of the main chapters.

This appendix should become the reference location for benchmark caveats so the main text can stay focused on verifier concepts.