In my early days on Substack, Cosma Shalizi pointed me to a puzzling conundrum about decision making. Cosma, way back in 2004, wrote a post about what seemed to be a completely contradictory finding. Cosma describes the problem succinctly through an example:
“Say you're interested in diagnosing heart diseases from electrocardiograms. Normally we have clinicians, i.e., expert doctors, look at a chart and say whether the patient has (to be definite) a heart condition requiring treatment within one year. Alternately, we could ask the experts what features they look at, when making their prognosis, and then fit a statistical model to that data, trying to predict the outcome or classification based on those features, which we can still have human experts evaluate. This is the actuarial approach, since it's just based on averages --- "of patients with features x, y and z, q percent have a serious heart condition".
“The rather surprising, and completely consistent, result of these studies is that there are no known cases where clinicians reliably out-perform actuarial methods, even when the statistical models are just linear classification rules, i.e., about as simple a model as you can come up with.”
Longtime argmin readers will recognize this as Meehl’s problem of “clinical versus statistical decisions.” That statistical rules outperform clinical experts is a remarkably robust finding that has puzzled social scientists for seventy years. Every single study shows cold statistical tabulation is better at prediction than people. But why?
An unfortunately wildly influential school of economic thought decided the answer was people were fundamentally flawed. Daniel Kahneman credits Meehl’s book as the inspiration for the entire Heuristics and Biases program. The H&B crew thought experts should be more like computers and actuaries, better able to process probabilistic information and not be fooled by Linda Problems. And those smartypants Bayesian experts concluded they could just Nudge the rest of us to do what they paternalistically decided was in our best interests.
Despite its persistent validation, there still seemed to be something off with the Meehlian superiority of the algorithm. Part of the reason why Meehl’s result is so puzzling is that in practice, everyone hates the implementation and outcomes associated with statistical rules. Such rules are always clunky, inflexible, and static. If you want a thorough analysis of the nightmare of the actuarial approach in medicine, this post by Justin Morgenstern goes into excruciating detail about why medical decision aids actually make medical decision-making worse. Beyond medicine, you get folks who write popular books calling statistical predictive rules literal snake oil and declaring them a danger to society.
Moreover, as Cosma wrote:
“[T]here is another body of experimental work, admittedly more recent, on "simple heuristics that make us smart", which seems to show that people are often very good judges, under natural conditions. That is to say, we're very good at solving the problems we tend to actually encounter, presented in the way we encounter them. The heuristics we use to solve those problems may not be generally applicable, but they are adapted to our environments, and, in those environments, are fast, simple and effective.”
There seems to be something broken. Computers make better predictions than humans. Building systems where all decisions are computerized leads to dangerous nightmares. Human experts are more adaptable and effective at implementing actual decisions. How can all of these things be true? To say this contradiction consumed the two years of my thinking isn’t quite accurate. But it’s close enough.
It turns out the resolution of the contradiction is not complicated. Statistical rules outperform people because the Meehlian advocates, the disciples of Kahneman and Tversky, and the reprehensible nudge economists, don’t bother to tell you that they rig the game against the humans.
Meehl’s argument is a trick. He builds a rigorous theory scaffolding to define a decision problem, but this deceptively makes the problem one where the actuarial tables will always be better. He first insists the decision problem be explicitly machine-legible. It must have a small number of precisely defined actions or outcomes. The actuarial method must be able to process the same data as the clinician. This narrows down the set of problems to those that are computable. We box people into working in the world of machines.
But what truly rigs the game is how the decisions are evaluated: Decisions about individuals are evaluated on average. Meehl is very precise about such evaluation in his book. The only way, he claims, to decide if a decision maker is good is to examine their track record. He concedes doctors might be better at predicting the outcomes in rare cases. But rare cases are rare and don’t show up in the averages.
If you are trying to maximize your accuracy on average, you are solving a statistics problem. We shouldn’t be surprised that statistics is rarely worse than humans.
I find it fascinating that Meehl never questions his evaluation schema. Or at least I’ve never seen him engage with the issue in any of his writing. Meehl is hyperfocused on inductive probability. Since clinical decisions are about the future, they are inherently uncertain. Since the outcome of a particular case is uncertain, it must be probabilistic. He concludes that our best probabilistic guess is clearly a synthesis of past experience into statistics.
However, this focus on transmuting rates of the past into probabilities of the future misses a bigger issue. That is, the outcome is uncertain and the administrators demand accountability. Bureaucrats evaluate systems not on one case, but on the average. Decisions about individuals are evaluated by averaging together the outcomes across a representative class of individuals. This trick fixes the game: if all that matters is statistical outcomes, then you’d better make decisions using statistical methods.
Meehl sides with the bureaucrats, stating, “Lacking quantification of inductive probability, we have no choice but to examine the clinician’s success-rate.” For the statistically minded, this follows logically: Rates in the past are how we make predictions about the future, so clearly rates in the future will be how we evaluate predictions. This second part is not clear at all!
How we evaluate decisions determines which methods are best. That we should be trying to maximize the mean value of some clunky, quantized, performance indicator is not normatively determined. We don’t have to evaluate individual decisions by crude artificial averages. But if we do, the actuary will indeed, as Meehl dourly insists, have the final word.
There’s an inherent tension in maximizing averages. Public health policy wonks insist there is a health of a population, one that can be computed as the average of the health of its individuals. Blindly considering only the top-level averages leads to some isolating and totalitarian thinking. It’s authoritarian thinking designed to remove discretion from people who have the expertise on the ground to know why their cases are different from the committee-constructed metrics handed down by fiat.
Since the clinical-statistical problem remains perpetually confusing and contentious, I thought it deserved a DOI: I posted a short paper on arXiv today. I think of it as “A Bureaucratic Theory of Statistics, Part 2.” The manuscript is a compilation of three posts from argmin, a talk I’ve been giving for about a year, and a simple argument about averages.
I hope you find it useful! I think Meehl’s problem and its resolution should be taught in every machine learning course. I’d love your feedback on this draft and how we might move forward with evaluation and assessment of individualized decisions.
Going to read the paper when I get a chance (hope it will be soon) in the mean time - Thank you. In years of reading H&B research I have sometimes felt I was losing it. The phrase I often uttered was "well if you ask the question that way, of course that's how people will answer it" yet I've only rarely encountered anyone digging into that.
This is a 10/10 read. Thank you!