To Measure Is to Know

Deb Raji (@rajiinio)

Apr 25

Deb's initial reflections on our machine learning evaluation course.

Read →

4 Comments

Ormond

Apr 25

Not in this field. Do I detect a hint of alarm?

Expand full comment

Raj Movva

Apr 25

Related to external validity, most benchmarks also suffer from poor ecological validity, i.e., is your testing environment at all related to the environment you ultimately care about. Which I'm now remembering you've already written about! :) https://ai.nejm.org/doi/abs/10.1056/AIe2401235

I wonder about these new benchmarks that test the ability to solve real tasks (like https://www.swebench.com/), but it feels like contamination is probably an issue.

Expand full comment

Maxim Raginsky

Apr 25

I have been thinking about this for a while (Ben will confirm, given our weekly chats about something related), and am coming around to the view that evaluation is not just an atomic act, but a particular type of ongoing interaction between the persons doing the evaluation, the system being evaluated, and the environment where everything resides. In that context, I liked a recent paper by Fintan Mallory, "Large Language models are stochastic measuring devices:"

https://philarchive.org/rec/MALLLM

Expand full comment

Kshitij Parikh

Apr 25

I wish I could have taken this course.

Expand full comment

arg min

To Measure Is to Know