25 Comments
User's avatar
Indy Neogy's avatar

Going to read the paper when I get a chance (hope it will be soon) in the mean time - Thank you. In years of reading H&B research I have sometimes felt I was losing it. The phrase I often uttered was "well if you ask the question that way, of course that's how people will answer it" yet I've only rarely encountered anyone digging into that.

Expand full comment
Kevin Baker's avatar

Ben, I don't know if you've ever read any of Gary Klein's work or any work from Naturalistic Decision Making, but this complements their critique of laboratory decision studies and of behavioral economics as a field. Klein studied how experts actually make decisions—firefighters, nurses, etc. He found they could sense when something was catastrophically wrong through cues they couldn't articulate.¹ A firefighter knew a floor was about to collapse but couldn't say why. Apply an Evidence-Based Checklist Approach® to these cases and you'd force them to code the scene into predefined categories: temperature readings, smoke color, ventilation patterns and then calculate as the floor crashed under them.

It's not just the statistical framework that performs this flattening, but the laboratory itself is a technology of isolation, designed to strip away context until only the quantifiable remains. This is, of course, desired,² but the process of turning the decision-making environment into a lab can have the effect of destroying the substrate upon which normal human expertise thrives.

---

¹ In some ways this is a classic Polanyi/Collins-style tacit knowledge problem, but SV has done so much damage to the meaning of "tacit knowledge" that I hesitate to mention it, lest someone thinks I mean "things I can learn by watching a youtube video, but that we haven't put in the company notion yet."

² And indeed the impossibility of defining a stable context in the field is part what makes many social RCTs so farcical.

Expand full comment
Kevin Baker's avatar

only after writing this, did i look bibliography of the preprint you shared, lol

Expand full comment
Ben Recht's avatar

All good! Sources of Power is my favorite book [1], and more people should read it.

And very well put about the bizarre context stripping required in order to formulate Meehlian problems. Meehl himself goes back and forth on this. The guy who wrote some of the most cogent and influential critiques of null hypothesis testing is also the biggest advocate of statistical predictive optimization. It's jarring reading him, where he'll avidly defend psychoanalysis in one context and castigate clinical prediction in the next.

But I think this quote gets at the core of his issue, and also ties into another topic close your heart. That so much "clinical practice" is just about mechanically moving individuals through a bureacracy:

"95% of the ordinary decisions made by working practitioners, whether psychiatrist, psychologist, or social worker, are not comparable in richness and subtlety to that of a good psychoanalytic hour. The special function of the skilled clinical brain that I was at such pains to emphasize against Sarbin and Lundberg rarely operates in the ordinary workaday predictions of a parole board or in forecasting whether somebody will do well in law school, or respond to Elavil, or continue in group therapy."

"In order to use theoretical concepts fruitfully in making predictions for concrete cases, one requires a well-corroborated theory, which has high verisimilitude and includes almost all of the relevant variables, and an accurate technology of measurement, including access to the initial and boundary conditions of the system to be predicted and negligible influence of what Paul Horst called “contingency factors.” None of these conditions is met in our routine clinical forecasting situation."

- Paul Meehl (1986) “Causes and Effects of My Disturbing Little Book.” Journal of Personality Assessment 50, no. 3 (1986): 370–75. https://doi.org/10.1207/s15327752jpa5003_6.

Also, omg, "things I can learn by watching a youtube video, but that we haven't put in the company notion yet." savage.

[1] It's also Malcolm Gladwell's favorite book, which perhaps should give me pause... bah, fuck it.

Expand full comment
J.D. Haltigan's avatar

Excellent stuff. A Bureaucratic Theory of Statistics part 2. I am going to integrate this as supplemental reading for my Intro to Biobehavioral Stats classes. It fits perfectly with chapters on probability and the normal curve with real-life examples students can grasp. Human decision-making.

Expand full comment
Alex Balinsky's avatar

Thank you very much for the insightful post. Your explanation clarifies why conformal prediction methods hold significant promise—particularly for mission-critical or high-stakes decisions—yet, in practice, they tend to perform reliably only on average.

Expand full comment
Zaki's avatar

This is a 10/10 read. Thank you!

Expand full comment
Carl Boettiger's avatar

Lovely example! I think one could read the argument you make here as essentially saying "Meehl assumed the wrong utility function"? i.e. the costs to errors are likely quite asymmetric: say, missing a fatal case vs sending a healthy patient for some extra tests. A clinician (or an algorithm) might minimizing the expected number of deaths, say, rather than the number of "correctly diagnosed" cases, no?

It seems to me this is the way economists often handle the same apparent contradiction. They are very reluctant to assume that the humans are wrong -- always wanting to assume the collective actions of humans in an a market economy is the ultimate optimizing algorithm, the invisible hand -- and so usually explain this as the utility function in the model having overlooked something.

We often do the same in evolutionary ecology too. When real world observations of organisms differ from that predicted by the algorithm, the argument is typically that the utility function of the algorithm must be wrong --- we assume evolution is maximizing something, which essentially by definition is the utility function.

Expand full comment
Ben Recht's avatar

It's a bit more subtle: Most real decision problems can't be posed as utility maximization. And inverse optimization may be helpful for understanding but is almost always a nonverifiable just-so-story. These optimization tools are helpful thought experiments, but never capture the complexity of reality.

So while the econometricians want us to look at averages of variables they made up and declared relevant, no human problem is tidy enough to be converted into a linear program.

Expand full comment
Carl Boettiger's avatar

very well put! yes, it is very much a just-so story of trying to find a maximization (linear or otherwise) that happens to correspond with the observed decisions. Just because we can often construct a utility post-hoc does not mean we have demonstrated it is what the decision maker is doing (or trying to do).

It does reframe the problem a bit though. Meehl's apparent contradiction isn't really this issue, since Meehl's utility function doesn't hold up to much scrutiny and we could presumably post-hoc at least construct a utility function that sounds better (e.g. clinicians weigh asymmetric risks) and better matches observed decisions. And yet the issue remains, as you say. (for instance, we can construct conflicting just-so stories that both fit the decisions equally well).

I sometimes struggle to explain to colleagues why the problem is fundamentally non-verifiable, and appreciate being able to lean on your examples!

Expand full comment
J.D. Haltigan's avatar

Would you say this about decision-making around correctional decisions for violent SO's though?

Expand full comment
Ben Recht's avatar

I don't think I understand your question. Could you say more?

Expand full comment
J.D. Haltigan's avatar

I read your piece as making the case that "most real decision problems can't be posed as utility maximization." You draw on medicine to make the case of course that there are outliers etc. that decisions based on averages elide etc.

What I am asking however, is whether you can find a situation or a decision where Meehl's algorithm (i.e., the use of averages, past behavior, etc.) is the superior choice. In the case of making decisions based on actuarial data as to whether to release violent sexual offenders from prison/place them on parole, etc. where psychopaths can "game" even expert clinicians, wouldn't you think that actuarial prediction is the superior choice here? That is, the nomothetic approach?

Expand full comment
J.D. Haltigan's avatar

Just finished up the preprint in full. Just realized you cite the early parole board studies etc. to describe the algorithmic/actuarial method. That said, it wasn't clear to me whether--as I pose above-- whether you come down on either side for certain types of prediction problems that are high-stakes such as this. In reading your conclusion, you say "For all these reasons, the adoption of mechanical rules and statistical prediction in high-stakes scenarios should be done with care." Would this be essentially what you say in the particular case I describe above?

Expand full comment
Jeff Phillips's avatar

So then how should we evaluate or quantify the effect of various methods for decision making? Or are you arguing that it is often an ill-defined task, and we should not?

Expand full comment
Ben Recht's avatar

I have a frustrating answer: Evaluation is always ill-defined, and it's on the stakeholders to be clear about their requirements. I stick with an academic, rigid definition of evaluation: "measuring the difference between articulated expectations of a system and its actual performance." To evaluate, you need to (a) articulate expectations and (b) specify how you are going to measure performance and (c) describe how you are going to compare the performance to the expectation. I don't think any of these are easy, and i don't think there's one system that will tell us how to do (a), (b), and (c) for every decision problem we care about.

Expand full comment
Mark Sammons's avatar

Thank you for making and sharing this analysis. Thank you too for the pointer to the article about CDRs -- very illuminating.

Expand full comment
Jishnu Das's avatar

look forward to reading this. I really appreciate this insight that its the averaging that drives the superiority. Will work through the paper!

Expand full comment
Ben Recht's avatar

I'm very much looking forward to your feedback. Please let me know if anything is unclear or if there's anything I haven't sufficiently addressed.

Expand full comment
Nico Formanek's avatar

Just wanted to point to this article by Malik which I found very helpful - https://arxiv.org/abs/2002.05193. He discusses the "meta-decisions" involved in what you term actuarial rules at length. Among them is the decision to assume a central tendency, e.g. that you care only about predictions of the mean.

Expand full comment
David Rothman's avatar

Ben, #Principal_Agent_Theory_Vibes. Agents (clinicians or decision-makers) get their decisions judged by the principals (suits) using averaged statistical performance metrics. This creates an asymmetric loss function where statistical models tend to dominate when judged by average metrics. That said, clinicians can bring critical value in those complex cases that get lost in the averaging, where nuanced judgment and adaptability rules. Marcus Welby was right: "Always get a second opinion."

Expand full comment
Kshitij Parikh's avatar

The research paper is a great read. Reality is fat-tailed & rare cases can't be simulated or measured but are always there & poor performance on them can be highly deadly. Humans are good with that, today's ML models not so much. It's a difference of dynamics vs statistics as an approach to understanding reality. The point of statistics being brittle to model & always needing to update was good. And so that psychological causation is personal & better to evaluate a person personally when you have access to them & humans detect lot of tacit information.

Expand full comment
FourierBot's avatar

Averaging on everything would be horrible!: boys would like to evenly share cheece cake with cousins but definitely not their popularity from girls! We have l1 and l2 regulations which respectively related to median and avearge(mean) to cope with different situations. If we only rely on one single reference system, it may be vulnerable with perturbations in practice. I believe the reason humans would not be replaced by robits is humans may make good choices instead of correct choices even with big prices.

Expand full comment
Alex Tolley's avatar

"Human experts are more adaptable and effective at implementing actual decisions. "

What is the evidence that humans, e.g., doctors, make better decisions about individuals, such as deciding treatments, possible longevity, etc.? Any statistical curve will have data that lies far from the predicted equation. One can argue that this is because unrecorded information was excluded, information a doctor might have, but is there any evidence that this is true? Shouldn't we just acknowledge the uncertainty?

Expand full comment
Arudra Burra's avatar

This is fascinating stuff, and I think goes quite deep in the direction of some puzzles in the epistemological literature. I'll have to think more and read the paper, but I wonder if there's a more general tension of which the statistical-clinical distinction is a special case.

I have in mind a tension between third-personal and first-personal views on knowledge. One lovely example arises in the literature on epistemic disagreement, inaugurated more-or-less by a paper by Thomas Kelly 20-odd years ago. I think carefully about some issue and come to a conclusion to the effect that X is the case. A colleague I regard as equally well-qualified to deliberate upon this issue (an "epistemic peer" in the jargon) comes to the opposite conclusion, ~X. How should *this* fact feed into my overall deliberations with regard to X?

There are two possibilities. One is that I should suspend my belief in X. After all, from a third personal point of view, I have no *special* reason to think that I have got it right and my colleague has got it wrong; a third-party looking at both of us would have no special reason to favour one over the other. (Easiest to see in cases involving perception, e.g. which of two horses wins a very tight race).

On the other hand, the reasons I initially had to think that X is the case haven't gone away, or been defeated by the new knowledge I get about my colleague's beliefs! So the fact that they reach the opposite conclusion doesn't add anything new to my deliberations after all, and I should be able to ignore them as a result.

Disagreement isn't crucial to the story; the more general puzzle is how to integrate higher-order evidence with first-order evidence, and one might see the Meehl considerations as a special case of this.

I've not looked at this literature in ages, but here's an entry-point: https://plato.stanford.edu/entries/higher-order-evidence/ which may be of interest.

Expand full comment