Appendix B — Benchmarks, Evals, and Contamination
B.1 Purpose
Collect evaluation hygiene issues that would otherwise interrupt the flow of the main chapters.
B.2 Include
- Leakage and train-test overlap.
- Extraction mismatches between benchmark and verifier.
- Hidden tests and benchmark saturation.
- Reporting pitfalls such as pass@k misuse or search-assisted evaluation drift.
B.3 Working Note
This appendix should become the reference location for benchmark caveats so the main text can stay focused on verifier concepts.