I wanted to take a moment to surface some of the great feedback I’ve received on the current blog series. Though I’ve avoided writing too many equations, I’ve gotten in some academic weeds, and folks are sending me fascinating connections to other work I haven’t discussed yet. Let me pause regularly scheduled programming and share a few of these.
And thank you all for engaging!
Carl Boettiger shared a fantastic paper showing how better predictions in fishery models can lead to worse ecological outcomes. Using a model for predator-prey dynamics between fish and birds, he develops two approximate models, one much more accurate than the other. But, the inaccurate model ends up being better for conservation than the accurate one. In fact, the bird population the model is designed to protect decreases under the accurate model. It’s yet another example of Doyle’s paradox: within a set of possible models, using models that appear to fit the data better can be worse when deployed in practice. This again shows that in feedback systems, uncertainty is far more complex than just “prediction error.”
Jeff Linderoth and I discussed what happens when you think about stochastic programming with recourse where the agents' action impacts the environment. In stochastic programming, you choose an action for today and tomorrow, where tomorrow’s plan is allowed to take into account new information that comes to light after the first action. Stochastic programming (and robust optimization as well) can solve very complicated two-stage models, but they always assume that the data that you get tomorrow is independent of your action today. This is why I’ve been putting it in the bottom left corner of my decision-making-under-uncertainty scatter plot.1
Jeff tells me that there is a line of work in stochastic optimization called “optimization under data-dependent uncertainty” that attempts to change this. In this paper, Nohadani and Sharma extend the popular linear programming formulations used in stochastic programming and robust optimization to the case when the new data depends on the first action. Even in the simplest cases, the problems are NP-hard. This hardness result isn’t a total show-stopper. It just means that universal solvers won’t be possible in this case without further simplification of the models. This seems to be a recurring theme.
Jessica Hullman and David Stutz both wrote very thoughtful blogs in response to my series on prediction intervals. These are both very thoughtful write-ups and are worth a read. I think you’ll be surprised how much we agree. David’s final paragraph here is very uncontroversial:
“Overall, I have come to the following conclusion: In many cases there is no reason not to perform conformal calibration wherever we are calibrating a model or need uncertainty estimates. While we do not get perfectly conditional guarantees, we get a marginal one "for free". This is because these methods do not need more data, extra compute or stronger assumptions than standard, empirical calibration.”
I want to gently push back on the notion that prediction intervals are “free,” as I’ve pointed out that they are very data hungry. I showed how a single prediction interval calibrated to 1% error would require about 10,000 data points. Would a person make a better decision knowing that their prediction uncertainty was so tightly calibrated?
I don’t think so. I come back to risk scores in medicine, where most returned predictions are intervals. Clinical risk scores tend to provide predictions like “low risk (0-20%),” “medium risk (40-60%),” and “high risk (above 75%).” Would US healthcare be better if those ranges were calibrated? Jessica agrees that excessive precision is probably unnecessary here but wonders if calibrated intervals degrade more gracefully under distribution shift based on recent papers to make them “more conditional.” (ed., the scare quotes here are mine, not Jessica’s).
This is again going to get me in trouble, but I’ve tried to read all of these papers on conditional coverage and just come away confused. The nice part about conformal prediction is it makes only one assumption: the joint distribution of the past and the future is exchangeable. But conditional coverage is intractable, so significant assumptions are needed for any approximate conditional notion (See, I told you this would be a recurring theme). Dozens of people have sent me this paper from the Stanford group, but no one, including the authors, can explain to me what it’s trying to do. Isn’t this just picking groups in advance and calibrating them individually? Is that what we want?
The technical details around conditional coverage are largely beside the point. The main point of disagreement between Jessica and David and myself is about the potential utility of prediction intervals. Both Jessica and David argue that there is value in getting people to think about point estimates and to consider explicit uncertainty in their decision making. I think more evidence is needed to back this assertion. Jessica also wonders about this:
“Of course, it’s possible that in many settings we would be better using some inherently interpretable model for which we no longer need a distribution-free approach. And ultimately we might be better off if we can better understand the decision problem the human decision-maker faces and apply decision theory to try to find better strategies rather than leaving it up to the human how to combine their knowledge with what they get from a model prediction. I think we still barely understand how this occurs even in high stakes settings that people often talk about.”
Here, we agree. Prediction is only a small piece of the complex puzzle of decision making under uncertainty. Jessica has inspired me to try to tackle expert human decision making. At some point this spring, I’ll do a blog series on some of my favorite writings on expert decision making. The evidence shows experts couldn’t be farther away from what we think of as “mathematically optimal,” and I want to engage with why our math always fails to capture the most amazing aspects of people.
Do I start referring to that as the DMUUS plot? That seems like a bad idea.
Thanks for engaging with the response blog posts, including mine. Minor thing but I can't find the phrase you attribute to me in what I wrote ("more conditional")?
You mentioned Doyle's paradox but I can't seem to find any information about that. Is this John Doyle? Is there a paper or set of papers he sets out his arguments in that you could refer me to?