Discussion about this post

User's avatar
Ormond's avatar

Not in this field. Do I detect a hint of alarm?

Expand full comment
Raj Movva's avatar

Related to external validity, most benchmarks also suffer from poor ecological validity, i.e., is your testing environment at all related to the environment you ultimately care about. Which I'm now remembering you've already written about! :) https://ai.nejm.org/doi/abs/10.1056/AIe2401235

I wonder about these new benchmarks that test the ability to solve real tasks (like https://www.swebench.com/), but it feels like contamination is probably an issue.

Expand full comment
2 more comments...

No posts