Nostalgia: We (older generation EE folks) have learned this topic from Van Trees' textbook. His wiki page says that "Van Trees was initially on loan to the government by MIT; he then ended up staying for a number of projects." https://en.wikipedia.org/wiki/Harry_L._Van_Trees . Shame on MIT loaning a professor for money, which year is that? I think that explains why he had published rarely after his famous volumes on the topic. I have some friends utilizing his chap. 2 of vol.1 for their entire professional career (Vol. 1: https://www.amazon.com/Detection-Estimation-Modulation-Theory-Part/dp/0471095176/ )
Really like this topic because it makes a lot of engineering students uncomfortable.
"How you balance the tradeoff can’t be determined by math." That’s the most interesting part of building models, because it comes down to how they’ll actually be used. Take cancer detection: it’s rare that an AI tool is the sole arbiter of a diagnosis. Clinicians use it to guide follow-up steps.
What they really care about is, "In deployment, when the model says someone has cancer, do they actually have cancer?" That’s the positive predictive value. If PPV is low, they get alert fatigue fast and stop paying attention. And PPV isn’t just about the model, it also depends on prevalence. For rare conditions, even a model with good TPR and FPR can have such a low PPV that nine out of ten alerts are false.
This also applies to the legal system. "What are the costs of false positive verdicts for a capital crime?" It depends on the importance of justice. Authoritarians will offer the argument that executing a few innocents is worth teh reduction in the murder rate." More liberal systems value the life of the individual and prefer "Better some of the guilty go free rather than one innocent die."
There is no mathematical approach (unless you have the data for teh equivalent of the trolley bus problem), and we know that we don't all respond to that problem the same way. Interestingly, the US has one of the highest incarceration rates on the planet, with the high attendant costs.
A big problem in medical evaluation of testing is that doctors tend to think only in terms of health outcomes, and disregard preferences and the value of information - the first things an economist would think about.
So if prostate screening leads some patients to undertake (what doctors judge to be) premature surgery, it's rated as bad. And if there is no treatment implication, doctors see it as a pure negative. But information about your healthy life expectancy is valuable in all sorts of ways, and judgements about costs and benefits depend on personal preferences.
Would it be possible to further discuss the issues of ROC? Also, could we learn from exisitng metrics to form patterns to make future proposed metrics less problematic/easier to be revised?
Hi Deborah, I find the Neyman-Pearson lemma to be relevant and useful in illustrating the types of population error rates that are possible for prediction. But NP tests are seldom used directly in machine learning contexts.
But NP tests are certainly useful in signal detection theory, as commenters Cagatay and Visar both point out.
Nostalgia: We (older generation EE folks) have learned this topic from Van Trees' textbook. His wiki page says that "Van Trees was initially on loan to the government by MIT; he then ended up staying for a number of projects." https://en.wikipedia.org/wiki/Harry_L._Van_Trees . Shame on MIT loaning a professor for money, which year is that? I think that explains why he had published rarely after his famous volumes on the topic. I have some friends utilizing his chap. 2 of vol.1 for their entire professional career (Vol. 1: https://www.amazon.com/Detection-Estimation-Modulation-Theory-Part/dp/0471095176/ )
Really like this topic because it makes a lot of engineering students uncomfortable.
"How you balance the tradeoff can’t be determined by math." That’s the most interesting part of building models, because it comes down to how they’ll actually be used. Take cancer detection: it’s rare that an AI tool is the sole arbiter of a diagnosis. Clinicians use it to guide follow-up steps.
What they really care about is, "In deployment, when the model says someone has cancer, do they actually have cancer?" That’s the positive predictive value. If PPV is low, they get alert fatigue fast and stop paying attention. And PPV isn’t just about the model, it also depends on prevalence. For rare conditions, even a model with good TPR and FPR can have such a low PPV that nine out of ten alerts are false.
Yes. And just adding for folks in ML land, PPV is the same thing as the "precision" of a classifier.
What happens if you try to use decision theory to find the cost?
We can't be wrong if we apply enough recursion!
This also applies to the legal system. "What are the costs of false positive verdicts for a capital crime?" It depends on the importance of justice. Authoritarians will offer the argument that executing a few innocents is worth teh reduction in the murder rate." More liberal systems value the life of the individual and prefer "Better some of the guilty go free rather than one innocent die."
There is no mathematical approach (unless you have the data for teh equivalent of the trolley bus problem), and we know that we don't all respond to that problem the same way. Interestingly, the US has one of the highest incarceration rates on the planet, with the high attendant costs.
A big problem in medical evaluation of testing is that doctors tend to think only in terms of health outcomes, and disregard preferences and the value of information - the first things an economist would think about.
So if prostate screening leads some patients to undertake (what doctors judge to be) premature surgery, it's rated as bad. And if there is no treatment implication, doctors see it as a pure negative. But information about your healthy life expectancy is valuable in all sorts of ways, and judgements about costs and benefits depend on personal preferences.
Would it be possible to further discuss the issues of ROC? Also, could we learn from exisitng metrics to form patterns to make future proposed metrics less problematic/easier to be revised?
From this post, it's clear that Neyman-Pearson tests are relevant in machine learning, perhaps as part of your "regulatory" view of the purpose of statistics? This topic comes up in this post and the exchange in the comments. https://errorstatistics.com/2024/10/22/response-to-ben-rechts-post-what-is-statistics-purpose-on-my-neyman-seminar/
Hi Deborah, I find the Neyman-Pearson lemma to be relevant and useful in illustrating the types of population error rates that are possible for prediction. But NP tests are seldom used directly in machine learning contexts.
But NP tests are certainly useful in signal detection theory, as commenters Cagatay and Visar both point out.
The only work I’m aware of that connects NP tests and statistical learning is this paper by Clay Scott and Rob Nowak: https://web.eecs.umich.edu/~cscott/pubs/npIT.pdf