4 Comments
User's avatar
Ormond's avatar

Not in this field. Do I detect a hint of alarm?

Expand full comment
Raj Movva's avatar

Related to external validity, most benchmarks also suffer from poor ecological validity, i.e., is your testing environment at all related to the environment you ultimately care about. Which I'm now remembering you've already written about! :) https://ai.nejm.org/doi/abs/10.1056/AIe2401235

I wonder about these new benchmarks that test the ability to solve real tasks (like https://www.swebench.com/), but it feels like contamination is probably an issue.

Expand full comment
Maxim Raginsky's avatar

I have been thinking about this for a while (Ben will confirm, given our weekly chats about something related), and am coming around to the view that evaluation is not just an atomic act, but a particular type of ongoing interaction between the persons doing the evaluation, the system being evaluated, and the environment where everything resides. In that context, I liked a recent paper by Fintan Mallory, "Large Language models are stochastic measuring devices:"

https://philarchive.org/rec/MALLLM

Expand full comment
Kshitij Parikh's avatar

I wish I could have taken this course.

Expand full comment