Machine Learning Evaluation

Syllabus for Spring 2025

Instructors: Deb Raji and Ben Recht

“What does it mean for machine learning to work?” By examining the mathematical foundations, historical implementation, and practical futures of machine learning evaluation, we will explore various conceptual and mathematical models of prediction and the corresponding implications on evaluation theory and practice. With these models in mind, we will discuss the strengths and limitations of common metrics (e.g., scoring rules, calibration, precision-recall) and methods (e.g., cross validation, the hold out method, and competitions) with respect to different notions of validity. We will additionally study current evaluation practice in the context of deployment and interaction, including interfaces with regulatory compliance and legal requirements. Throughout, we will look to historical practices and other domains to inspire new approaches to evaluation beyond the standard paradigms of benchmarking.

Many of the readings from this course are drawn from Patterns, Predictions, and Actions (PPA) by Moritz Hardt and Ben Recht. Available online and at your favorite bookseller.


Week 1: Evaluation and Prediction

Jan. 20, 22.

Topics: Prediction Problems. Predictions, decisions, and actions. The Statistical Trap. Elements of probability, generalization, and repeatability. Historical view of clinical versus actuarial prediction.

Reading

Additional Reading


Week 2: The Holdout Method

Jan 27, 29

Topics: i.i.d. Models, Stability, Generalization, internal validity, machine learning benchmark design.

Reading:

Additional Reading

  • [blog] on generalization.


Week 3: History of Machine Learning Benchmarks

Feb 2, 4

Topics: History of benchmarking, competition, and the associated data sets used to evaluate machine learning progress.

Reading

Additional Reading


Week 4: External Validity, Reproducibility, and Robustness

Feb 11, 13

Topics: Distribution shift and data set shifts. How such shifts manifest themselves in practice.

Reading:

  • Recht, Benjamin, Roelofs, Rebecca, Schmidt, Ludwig, and Shankar, Vaishaal. "Do ImageNet classifiers generalize to ImageNet?" International Conference on Machine Learning. 2019. Full report: arXiv:1902.10811

  • On overfitting: [blog 1]

  • On “distribution shift”: [blog 2]

  • On test sets: [blog 3]

Additional Reading:


Week 5: Adaptivity and Overfitting

Feb 18, 20

Topics: Adaptive data analysis. Why training on the test set should be bad. How you can climb a leaderboard without looking at the data. The ladder algorithm and natural algorithms.

Reading

Additional Reading

  • Blum, Avrim, and Moritz Hardt. "`The ladder: A reliable leaderboard for machine learning competitions." International Conference on Machine Learning. PMLR, 2015. Tech Report: arXiv:1502.04585

  • Mania, H et al. "Model Similarity Mitigates Test Set Overuse." NeurIPS 2019. Tech Report: arXiv:1905.12580


Week 6: Construct Validity

Feb 25, 27

Topics: reference classes, nomological nets, construct validity, target definitions, context and scope of prediction

Reading

Additional Reading


Week 7: Quanitfying & Reporting Uncertainty

Mar 4, 6

Topics: Reporting & quantifying uncertainty of learning models, making use of reported uncertainty

Reading:

Additional Reading:

  • Rishabh Agarwal et al. (2022) Deep Reinforcement Learning at the Edge of the Statistical Precipice. arXiv:2108.13264

  • Miller, Evan (2024). Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. arXiv:2411.00640

  • Chen, Irene, Johansson, Fredrik D., and Sontag, David (2018) Why Is My Classifier Discriminatory? arXiv:1805.12002


Week 8: Forecasts as Benchmarks

Mar 11, 13

Topics: Proper scoring rules. Information retrieval: AUC ROC, precision, recall, F1 scores. Reliability, reproducibility, and measurement.

Reading

Additional Reading


Week 9: Calibration

Mar 18, 20

Topics: What calibration is, what it can do, what it can't do. Bayesian and frequentist interpretations. Validating calibration. Calibrating without knowledge.

Reading:

Additional Reading:


Week 10: Games as Benchmarks

Apr 1, 3

Topics: Actual games, like chess and checkers and poker. Other closed world ``arenas'' as benchmarks, like ARC.

Reading:

Suggested Reading:


Week 11: Contemporary Benchmarking?

Apr 8, 10

Topics: Polyphasic, LLMs; Symbolic-testing, behavioral testing, verification; Interactivity, Adaptive Testing, Dynamic benchmarks

Reading:

  • Jason Wei et. al. "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. https://openreview.net/forum?id=yzkSU5zdwD

  • Schaeffer, Rylan et al. "Are Emergent Abilities of Large Language Models a Mirage?" In NeuRIPS. arXiv:2304.15004 (2023)

  • What if all of these papers on LLM evaluation are just making Sam Altman richer? [blog]

Additional Reading:

  • Srivastava, Aarohi et al. "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." arXiv.org:2206.04615 (2022)

  • Kiela, Douwe, et al. "Dynabench: Rethinking benchmarking in NLP." arXiv:2104.14337 (2021).

  • Ribeiro, Marco Tulio, et al. "Beyond accuracy: Behavioral testing of NLP models with CheckList." arXiv:2005.04118 (2020).

  • Hennigen, Lucas Torroba, et al. "Towards Verifiable Text Generation with Symbolic References." arXiv:2311.09188 (2023)

  • Mirzadeh, Iman, et al. "Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models." arXiv:2410.05229 (2024).

  • Seshia, Sanjit A., Dorsa Sadigh, and S. Shankar Sastry. "Toward verified artificial intelligence." Communications of the ACM 65.7 (2022): 46-55. arXiv:1606.08514

  • Chen, Mark, et al. "Evaluating large language models trained on code." arXiv:2107.03374 (2021).


Week 12: Evaluating Interventions

Apr 15, 17

Topics. Evaluating systems with RCTs and AB Tests. Temporal validity, compliance, and adjustment. Adaptive experiments.

Reading:

Additional Reading:


Week 13: Regulation and Deployment

Apr 22, 24

Topics: Program Evaluation, Policy Evaluation, Legal Compatibility, deployment

Reading:

Additional Reading: