5 Comments
User's avatar
Nico Formanek's avatar

If the hard currency of evaluation is, well hard currency, then I see an interesting feedback loop arising.

Uday Singh Saini's avatar

I got your book today :)

Bob Williamson's avatar

This is a nice example of the "self-justification" of the benchmark "style of reasoning." (The phrases are due to Ian Hacking, Statistical Language, Statistical Truth and Statistical Reason: The Self-Authentication of a Style of Scientific Reasoning, pages 130—157 in Social Dimensions of Science, University of Notre Dame Press, 1992.)

Paul Feyerabend has a nice observation that I think is apposite:

"Some of the methods of modern empiricism which are introduced in the spirit of anti-dogmatism and progress are bound to lead to the establishment of a dogmatic metaphysics and to the construction of defence mechanisms which make this metaphysics safe from refutation by experimental inquiry.”

(from page 4 of Paul Feyerabend, “How to be a good empiricist: a plea for tolerance in matters epistemological,” pages 3–39 in B. Baumrin (ed.), Philosophy of Science: The Delaware Seminar, volume 2, Interscience Press, 1963.)

Until you step outside of the naive-empiricist style of reasoning, of _course_ you will not see any problem with it. It is self-perpetuating and self-justifying by design.

Meanwhile, there are good and careful critiques of benchmark culture: Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. “Evaluation gaps in machine learning practice.” In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, 1859-1876, 2022.

Anyway, I will remain distrustful of anyone who tries to persuade me "this here is the best method to use; trust me ... I used my own method to test how good my method is, and it says it is great!" notwithstanding the amount of money sloshing around in their neighborhood :-)

Anthony Clairmont's avatar

Thank you for this article. I want to reply from the perspective of the evaluation of social programs (which I know if not your main focus here, but which you may find interesting).

"This definition seems reasonable enough, but in a world obsessed with quantification, this sets into motion an inevitable bureaucratic collapse. If you want to make your evaluation legible and fair to all stakeholders, you must make it quantitative. If you want to handle a diversity of contexts, you must evaluate on multiple instantiations and report the average behavior. Quantification has to become statistical. And once you declare your expectations and metrics, everything becomes optimization. Evaluation inevitably becomes statistical prediction."

The actual practice of program evaluation doesn't fall victim to this bureaucratic collapse because we are typically evaluating just one program at a time, comparing it to benchmarks which are often arbitrary. Maybe this is the other end of the spectrum? Idiosyncratic evaluation versus large-n optimization?

Jae Yeon Kim's avatar

This is fascinating. FWIW, your definition of evaluation is quite similar to how Aaron Wildavsky and Angela Browne defined it as the other side of implementation, the gap between intended and actual policy outcomes, in the third edition of Implementation. Wildavsky was the founding dean of the Goldman School and an intellectual giant of his generation. https://www.degruyterbrill.com/document/doi/10.1525/9780520353497-013/html?srsltid=AfmBOopTPY57DtVyE558s0z-flRyiToGMoC_gX3LJUskgtYWkV4iexBL