Appendix B — Benchmarks, Evals, and Contamination

B.1 Purpose

Collect evaluation hygiene issues that would otherwise interrupt the flow of the main chapters.

B.2 Include

  • Leakage and train-test overlap.
  • Extraction mismatches between benchmark and verifier.
  • Hidden tests and benchmark saturation.
  • Reporting pitfalls such as pass@k misuse or search-assisted evaluation drift.

B.3 Working Note

This appendix should become the reference location for benchmark caveats so the main text can stay focused on verifier concepts.