Related to external validity, most benchmarks also suffer from poor ecological validity, i.e., is your testing environment at all related to the environment you ultimately care about. Which I'm now remembering you've already written about! :) https://ai.nejm.org/doi/abs/10.1056/AIe2401235
I wonder about these new benchmarks that test the ability to solve real tasks (like https://www.swebench.com/), but it feels like contamination is probably an issue.
I have been thinking about this for a while (Ben will confirm, given our weekly chats about something related), and am coming around to the view that evaluation is not just an atomic act, but a particular type of ongoing interaction between the persons doing the evaluation, the system being evaluated, and the environment where everything resides. In that context, I liked a recent paper by Fintan Mallory, "Large Language models are stochastic measuring devices:"
Not in this field. Do I detect a hint of alarm?
Related to external validity, most benchmarks also suffer from poor ecological validity, i.e., is your testing environment at all related to the environment you ultimately care about. Which I'm now remembering you've already written about! :) https://ai.nejm.org/doi/abs/10.1056/AIe2401235
I wonder about these new benchmarks that test the ability to solve real tasks (like https://www.swebench.com/), but it feels like contamination is probably an issue.
I have been thinking about this for a while (Ben will confirm, given our weekly chats about something related), and am coming around to the view that evaluation is not just an atomic act, but a particular type of ongoing interaction between the persons doing the evaluation, the system being evaluated, and the environment where everything resides. In that context, I liked a recent paper by Fintan Mallory, "Large Language models are stochastic measuring devices:"
https://philarchive.org/rec/MALLLM
I wish I could have taken this course.