Machine Learning Evaluation
Syllabus for Spring 2025
Instructors: Deb Raji and Ben Recht
“What does it mean for machine learning to work?” By examining the mathematical foundations, historical implementation, and practical futures of machine learning evaluation, we will explore various conceptual and mathematical models of prediction and the corresponding implications on evaluation theory and practice. With these models in mind, we will discuss the strengths and limitations of common metrics (e.g., scoring rules, calibration, precision-recall) and methods (e.g., cross validation, the hold out method, and competitions) with respect to different notions of validity. We will additionally study current evaluation practice in the context of deployment and interaction, including interfaces with regulatory compliance and legal requirements. Throughout, we will look to historical practices and other domains to inspire new approaches to evaluation beyond the standard paradigms of benchmarking.
Many of the readings from this course are drawn from Patterns, Predictions, and Actions (PPA) by Moritz Hardt and Ben Recht. Available online and at your favorite bookseller.
Week 1: Evaluation and Prediction
Jan. 20, 22.
Topics: Prediction Problems. Predictions, decisions, and actions. The Statistical Trap. Elements of probability, generalization, and repeatability. Historical view of clinical versus actuarial prediction.
Reading
Freedman, David. “Some Issues in the Foundation of Statistics.” Foundations of Science 1 (1995): 19–39.
Dawes, Robyn M., David Faust, and Paul E. Meehl. "Clinical versus actuarial judgment." Science 243.4899 (1989): 1668-1674.
Additional Reading
Hardt and Recht, PPA, Chapter 2.
Blogs on Clinical Versus Statistical Prediction [Part 1][Part 2][Part 3]
Week 2: The Holdout Method
Jan 27, 29
Topics: i.i.d. Models, Stability, Generalization, internal validity, machine learning benchmark design.
Reading:
M. Stone. ``Cross-Validatory Choice and Assessment of Statistical Predictions.'' Journal of the Royal Statistical Society. Series B (Methodological), Vol. 36, No. 2 (1974), pp. 111-147.
PPA. Chapter 8, Section ``The scientific basis of machine learning benchmarks''
[blog post] linking bit prediction to the holdout method
Additional Reading
[blog] on generalization.
Week 3: History of Machine Learning Benchmarks
Feb 2, 4
Topics: History of benchmarking, competition, and the associated data sets used to evaluate machine learning progress.
Reading
Donoho, D. (2024). Data Science at the Singularity. Harvard Data Science Review, 6(1). doi.org/10.1162/99608f92.b91339ef
Recht, B. (2024). The Mechanics of Frictionless Reproducibility. Harvard Data Science Review, 6(1). doi.org/10.1162/99608f92.f0f013d4
Patterns, Predictions, and Actions, Chapter 8, up to the heading "Longevity of benchmarks."
Blog post on The Netflix Prize.
Additional Reading
Bowman, Samuel R., and George E. Dahl. "What will it take to fix benchmarking in natural language understanding?." arXiv:2104.02145 (2021).
Raji, Inioluwa Deborah, et al. "AI and the everything in the whole wide world benchmark." arXiv:2111.15366 (2021).
Liao, Thomas, et al. "Are we learning yet? a meta review of evaluation failures across machine learning." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021.
Herbert Simon - "Why should machines learn?" From the first ICML.
Week 4: External Validity, Reproducibility, and Robustness
Feb 11, 13
Topics: Distribution shift and data set shifts. How such shifts manifest themselves in practice.
Reading:
Recht, Benjamin, Roelofs, Rebecca, Schmidt, Ludwig, and Shankar, Vaishaal. "Do ImageNet classifiers generalize to ImageNet?" International Conference on Machine Learning. 2019. Full report: arXiv:1902.10811
On overfitting: [blog 1]
On “distribution shift”: [blog 2]
On test sets: [blog 3]
Additional Reading:
J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, Dataset Shift in Machine Learning, The MIT Press, 2009.
Moreno-Torres, Jose G., et al. "A unifying view on dataset shift in classification." Pattern Recognition 45.1 (2012): 521-530.
Egami, Naoki, and Erin Hartman. "Elements of external validity: Framework, design, and analysis." American Political Science Review 117.3 (2023): 1070-1088.
Finlayson, Samuel G., et al. "The clinician and dataset shift in artificial intelligence." The New England Journal of Medicine 385.3 (2021): 283.
Wong, Andrew, et al. "External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients." JAMA Internal Medicine 181.8 (2021): 1065-1070
Week 5: Adaptivity and Overfitting
Feb 18, 20
Topics: Adaptive data analysis. Why training on the test set should be bad. How you can climb a leaderboard without looking at the data. The ladder algorithm and natural algorithms.
Reading
"The problem of adaptivity," Chapter 8, PPA
Additional Reading
Blum, Avrim, and Moritz Hardt. "`The ladder: A reliable leaderboard for machine learning competitions." International Conference on Machine Learning. PMLR, 2015. Tech Report: arXiv:1502.04585
Mania, H et al. "Model Similarity Mitigates Test Set Overuse." NeurIPS 2019. Tech Report: arXiv:1905.12580
Week 6: Construct Validity
Feb 25, 27
Topics: reference classes, nomological nets, construct validity, target definitions, context and scope of prediction
Reading
Cronbach, L. J., and Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. doi.org/10.1037/h0040957
Additional Reading
Altman, D. G., and P. Royston. “What Do We Mean by Validating a Prognostic Model?” Statistics in Medicine 19, no. 4 (2000): 453–73.
Mullainathan, Sendhil, and Ziad Obermeyer. "On the inequity of predicting A while hoping for B." AEA Papers and Proceedings. Vol. 111. (2021) 37-42.
Week 7: Quanitfying & Reporting Uncertainty
Mar 4, 6
Topics: Reporting & quantifying uncertainty of learning models, making use of reported uncertainty
Reading:
Hoekstra, R., Morey, R.D., Rouder, J.N. et al. Robust misinterpretation of confidence intervals. Psychon Bull Rev 21, 1157–1164 (2014). https://doi.org/10.3758/s13423-013-0572-3
Tversky, A., & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185(4157), 1124–1131. http://www.jstor.org/stable/1738360
Additional Reading:
Rishabh Agarwal et al. (2022) Deep Reinforcement Learning at the Edge of the Statistical Precipice. arXiv:2108.13264
Miller, Evan (2024). Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. arXiv:2411.00640
Chen, Irene, Johansson, Fredrik D., and Sontag, David (2018) Why Is My Classifier Discriminatory? arXiv:1805.12002
Week 8: Forecasts as Benchmarks
Mar 11, 13
Topics: Proper scoring rules. Information retrieval: AUC ROC, precision, recall, F1 scores. Reliability, reproducibility, and measurement.
Reading
Gneiting, Tilmann, and Adrian E Raftery. ``Strictly Proper Scoring Rules, Prediction, and Estimation.'' Journal of the American Statistical Association 102, no. 477 (March 2007): 359–78.
Scoring Rules: [blog 1]
Scoring Rules and Utility Maximization [blog 2]
Scoring Rules force predictions to be probabilistic [blog 3]
Additional Reading
Brier (1950). ``Verification of Forecasts Expressed in Terms of Probability.'' Monthly Weather Review. 78 (1): 1–3.
Savage, Leonard J. (1971) “Elicitation of Personal Probabilities and Expectations.” Journal of the American Statistical Association, vol. 66, no. 336, pp. 783–801. https://doi.org/10.2307/2284229.
Lindley, Dennis V. “Scoring Rules and the Inevitability of Probability.” International Statistical Review / Revue Internationale de Statistique 50, no. 1 (1982): 1–11. https://doi.org/10.2307/1402448.
J. B. Predd, R. Seiringer, E. H. Lieb, D. N. Osherson, H. V. Poor and S. R. Kulkarni, "Probabilistic Coherence and Proper Scoring Rules," in IEEE Transactions on Information Theory, vol. 55, no. 10, pp. 4786-4792, Oct. 2009, doi: 10.1109/TIT.2009.2027573.
Week 9: Calibration
Mar 18, 20
Topics: What calibration is, what it can do, what it can't do. Bayesian and frequentist interpretations. Validating calibration. Calibrating without knowledge.
Reading:
Dawid, A. Philip (1982) "The well-calibrated Bayesian." Journal of the American Statistical Association 77.379: 605-610.
Foster, Dean P. and Hart, Sergiu (2021) "Forecast-Hedging and Calibration." Journal of Political Economy. 129(12). https://doi.org/10.1086/716559 also: arXiv:2210.07169 (Only read section 1)
[blog]
Additional Reading:
Seidenfeld, Teddy. “Calibration, Coherence, and Scoring Rules.” Philosophy of Science 52, no. 2 (1985): 274–94.
Little, Roderick J. “Calibrated Bayes: A Bayes/Frequentist Roadmap.” The American Statistician 60, no. 3 (2006): 213–23.
Foster, Dean P. “A Proof of Calibration via Blackwell’s Approachability Theorem.” Games and Economic Behavior 29, no. 1–2 (1999): 73–78.
Week 10: Games as Benchmarks
Apr 1, 3
Topics: Actual games, like chess and checkers and poker. Other closed world ``arenas'' as benchmarks, like ARC.
Reading:
Arthur L. Samuel, "Some Studies in Machine Learning Using the Game of Checkers," IBM Journal of Research and Development, vol. 3, no. 3, pp. 210-229, July 1959, doi:10.1147/rd.33.0210.
On the role of demos in machine learning evaluation [blog!]
Suggested Reading:
Strickland, Eliza. "How IBM overpromised and underdelivered on AI health care." IEEE Spectrum 56.4 (2019): 24-31.
McCarthy, John. "AI as Sport." Science (1997): 1518-1519.
McCarthy, John. "Chess as the Drosophila of AI." Computers, chess, and cognition. New York, NY: Springer New York, 1990. 227-237.
Ensmenger, Nathan. "Is chess the drosophila of artificial intelligence? A social history of an algorithm." Social studies of science 42.1 (2012): 5-30.
Henderson, P., et al. ``Deep reinforcement learning that matters.'' In AAAI 2018. Tech Report: arXiv:1709.06560 (2017).
Mania, Horia, Aurelia Guy, and Benjamin Recht. "Simple random search of static linear policies is competitive for reinforcement learning." Advances in neural information processing systems 31 (2018). Tech Report: arXiv:1803.07055
Week 11: Contemporary Benchmarking?
Apr 8, 10
Topics: Polyphasic, LLMs; Symbolic-testing, behavioral testing, verification; Interactivity, Adaptive Testing, Dynamic benchmarks
Reading:
Jason Wei et. al. "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. https://openreview.net/forum?id=yzkSU5zdwD
Schaeffer, Rylan et al. "Are Emergent Abilities of Large Language Models a Mirage?" In NeuRIPS. arXiv:2304.15004 (2023)
What if all of these papers on LLM evaluation are just making Sam Altman richer? [blog]
Additional Reading:
Srivastava, Aarohi et al. "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." arXiv.org:2206.04615 (2022)
Kiela, Douwe, et al. "Dynabench: Rethinking benchmarking in NLP." arXiv:2104.14337 (2021).
Ribeiro, Marco Tulio, et al. "Beyond accuracy: Behavioral testing of NLP models with CheckList." arXiv:2005.04118 (2020).
Hennigen, Lucas Torroba, et al. "Towards Verifiable Text Generation with Symbolic References." arXiv:2311.09188 (2023)
Mirzadeh, Iman, et al. "Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models." arXiv:2410.05229 (2024).
Seshia, Sanjit A., Dorsa Sadigh, and S. Shankar Sastry. "Toward verified artificial intelligence." Communications of the ACM 65.7 (2022): 46-55. arXiv:1606.08514
Chen, Mark, et al. "Evaluating large language models trained on code." arXiv:2107.03374 (2021).
Week 12: Evaluating Interventions
Apr 15, 17
Topics. Evaluating systems with RCTs and AB Tests. Temporal validity, compliance, and adjustment. Adaptive experiments.
Reading:
Splawa-Neyman, Jerzy. "On the Application of Probability Theory to Agricultural Experiments. Essay on Principles." Roczniki Nauk Rolniczych Tom X (1923) 1-51 (Annals of Agricultural Sciences)
Freedman, David. "Randomization Does Not Justify Logistic Regression." Statistical Science. 23(2): 237-249 (May 2008). DOI: 10.1214/08-STS262 (Only read the first three pages for a modern translation of Neyman's paper).
Additional Reading:
Gerber, Alan S., and Donald P. Green. Field experiments: Design, Analysis, and Interpretation. Norton (2008). First two chapters.
Kohavi et al. Trustworthy Online Controlled Experiments. Cambridge University Press (2020).
Week 13: Regulation and Deployment
Apr 22, 24
Topics: Program Evaluation, Policy Evaluation, Legal Compatibility, deployment
Reading:
Rossi, Peter. "The iron law of evaluation and other metallic rules." Research in social problems and public policy 4.1 (1987): 3-20.
Rossi 40 years later [blog]
Additional Reading:
Wu, Eric, et al. "How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals." Nature Medicine 27.4 (2021): 582-584.
Han, Ryan, et al. "Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review." The Lancet Digital Health 6.5 (2024): e367-e373.
Alkhatib, Ali, and Michael Bernstein. "Street-level algorithms: A theory at the gaps between policy and decisions." CHI 2019.
Recht, Benjamin. "A Bureaucratic Theory of Statistics." arXiv:2501.03457. 2025.
Rossi, Peter H., Mark W. Lipsey, and Howard E. Freeman. Evaluation: A systematic approach. Sage Publications, 2003.