arg min

Kevin Baker

only after writing this, did i look bibliography of the preprint you shared, lol

Expand full comment

All good! Sources of Power is my favorite book [1], and more people should read it.

And very well put about the bizarre context stripping required in order to formulate Meehlian problems. Meehl himself goes back and forth on this. The guy who wrote some of the most cogent and influential critiques of null hypothesis testing is also the biggest advocate of statistical predictive optimization. It's jarring reading him, where he'll avidly defend psychoanalysis in one context and castigate clinical prediction in the next.

But I think this quote gets at the core of his issue, and also ties into another topic close your heart. That so much "clinical practice" is just about mechanically moving individuals through a bureacracy:

"95% of the ordinary decisions made by working practitioners, whether psychiatrist, psychologist, or social worker, are not comparable in richness and subtlety to that of a good psychoanalytic hour. The special function of the skilled clinical brain that I was at such pains to emphasize against Sarbin and Lundberg rarely operates in the ordinary workaday predictions of a parole board or in forecasting whether somebody will do well in law school, or respond to Elavil, or continue in group therapy."

"In order to use theoretical concepts fruitfully in making predictions for concrete cases, one requires a well-corroborated theory, which has high verisimilitude and includes almost all of the relevant variables, and an accurate technology of measurement, including access to the initial and boundary conditions of the system to be predicted and negligible influence of what Paul Horst called “contingency factors.” None of these conditions is met in our routine clinical forecasting situation."

- Paul Meehl (1986) “Causes and Effects of My Disturbing Little Book.” Journal of Personality Assessment 50, no. 3 (1986): 370–75. https://doi.org/10.1207/s15327752jpa5003_6.

Also, omg, "things I can learn by watching a youtube video, but that we haven't put in the company notion yet." savage.

[1] It's also Malcolm Gladwell's favorite book, which perhaps should give me pause... bah, fuck it.

Expand full comment

Excellent stuff. A Bureaucratic Theory of Statistics part 2. I am going to integrate this as supplemental reading for my Intro to Biobehavioral Stats classes. It fits perfectly with chapters on probability and the normal curve with real-life examples students can grasp. Human decision-making.

Expand full comment

Alex Balinsky

Thank you very much for the insightful post. Your explanation clarifies why conformal prediction methods hold significant promise—particularly for mission-critical or high-stakes decisions—yet, in practice, they tend to perform reliably only on average.

Expand full comment

Zaki

This is a 10/10 read. Thank you!

Expand full comment

Carl Boettiger

Lovely example! I think one could read the argument you make here as essentially saying "Meehl assumed the wrong utility function"? i.e. the costs to errors are likely quite asymmetric: say, missing a fatal case vs sending a healthy patient for some extra tests. A clinician (or an algorithm) might minimizing the expected number of deaths, say, rather than the number of "correctly diagnosed" cases, no?

It seems to me this is the way economists often handle the same apparent contradiction. They are very reluctant to assume that the humans are wrong -- always wanting to assume the collective actions of humans in an a market economy is the ultimate optimizing algorithm, the invisible hand -- and so usually explain this as the utility function in the model having overlooked something.

We often do the same in evolutionary ecology too. When real world observations of organisms differ from that predicted by the algorithm, the argument is typically that the utility function of the algorithm must be wrong --- we assume evolution is maximizing something, which essentially by definition is the utility function.

Expand full comment

It's a bit more subtle: Most real decision problems can't be posed as utility maximization. And inverse optimization may be helpful for understanding but is almost always a nonverifiable just-so-story. These optimization tools are helpful thought experiments, but never capture the complexity of reality.

So while the econometricians want us to look at averages of variables they made up and declared relevant, no human problem is tidy enough to be converted into a linear program.

Expand full comment

Reply (2)

Carl Boettiger

very well put! yes, it is very much a just-so story of trying to find a maximization (linear or otherwise) that happens to correspond with the observed decisions. Just because we can often construct a utility post-hoc does not mean we have demonstrated it is what the decision maker is doing (or trying to do).

It does reframe the problem a bit though. Meehl's apparent contradiction isn't really this issue, since Meehl's utility function doesn't hold up to much scrutiny and we could presumably post-hoc at least construct a utility function that sounds better (e.g. clinicians weigh asymmetric risks) and better matches observed decisions. And yet the issue remains, as you say. (for instance, we can construct conflicting just-so stories that both fit the decisions equally well).

I sometimes struggle to explain to colleagues why the problem is fundamentally non-verifiable, and appreciate being able to lean on your examples!

Expand full comment

Would you say this about decision-making around correctional decisions for violent SO's though?

Expand full comment

I don't think I understand your question. Could you say more?

Expand full comment

I read your piece as making the case that "most real decision problems can't be posed as utility maximization." You draw on medicine to make the case of course that there are outliers etc. that decisions based on averages elide etc.

What I am asking however, is whether you can find a situation or a decision where Meehl's algorithm (i.e., the use of averages, past behavior, etc.) is the superior choice. In the case of making decisions based on actuarial data as to whether to release violent sexual offenders from prison/place them on parole, etc. where psychopaths can "game" even expert clinicians, wouldn't you think that actuarial prediction is the superior choice here? That is, the nomothetic approach?

Expand full comment

Sep 9Edited

Just finished up the preprint in full. Just realized you cite the early parole board studies etc. to describe the algorithmic/actuarial method. That said, it wasn't clear to me whether--as I pose above-- whether you come down on either side for certain types of prediction problems that are high-stakes such as this. In reading your conclusion, you say "For all these reasons, the adoption of mechanical rules and statistical prediction in high-stakes scenarios should be done with care." Would this be essentially what you say in the particular case I describe above?

Expand full comment

Jeff Phillips

So then how should we evaluate or quantify the effect of various methods for decision making? Or are you arguing that it is often an ill-defined task, and we should not?

Expand full comment

I have a frustrating answer: Evaluation is always ill-defined, and it's on the stakeholders to be clear about their requirements. I stick with an academic, rigid definition of evaluation: "measuring the difference between articulated expectations of a system and its actual performance." To evaluate, you need to (a) articulate expectations and (b) specify how you are going to measure performance and (c) describe how you are going to compare the performance to the expectation. I don't think any of these are easy, and i don't think there's one system that will tell us how to do (a), (b), and (c) for every decision problem we care about.

Expand full comment

Mark Sammons

Thank you for making and sharing this analysis. Thank you too for the pointer to the article about CDRs -- very illuminating.

Expand full comment

Jishnu Das

look forward to reading this. I really appreciate this insight that its the averaging that drives the superiority. Will work through the paper!

Expand full comment

I'm very much looking forward to your feedback. Please let me know if anything is unclear or if there's anything I haven't sufficiently addressed.

Expand full comment

Nico Formanek

Just wanted to point to this article by Malik which I found very helpful - https://arxiv.org/abs/2002.05193. He discusses the "meta-decisions" involved in what you term actuarial rules at length. Among them is the decision to assume a central tendency, e.g. that you care only about predictions of the mean.

Expand full comment

Jishnu Das

Sep 13

I read the paper and am working through it. I get the scoring rules and the averaging implies actuarial computations wins. Now, I am trying to understand two related things. First, in a response to a previous comment on the lack of prediction, you asked about construct validity and, in particular, the stability of what was being predicted. Second, the second part of your paper argues that a view that privileges the idiographic implies that evaluation is not possible (hope I have read it correctly).

Before accepting that conclusion, I wanted to give it another go.

*Suppose* I can send the same case to the same doctor multiple times (this is indeed what I do using standardized patients). An actuarial prediction would presumably give us the same answer each time, since the reference class is fixed (no broken leg). Now, imagine that the doctor does different things each time with the same patient. Can I:

(a) make a statement about the predictability of the doctor's behavior, or does the within-doc variability put a spanner in the validity of the construct?

(b) Is the doctor's performance evaluable, since there is no "real" variation in my patient by construction?

I ask these for two reasons.

Reason 1: It seems to me that in the evaluation of clinical versus statistical, we need a theory of variation for the clinical. The standardized patients help here by ruling out variation in the patient and force us to think of broader reference classes.

Reason 2: I want to see if, philosophically, there can be a concept of `imaginary' idiographic, which I guess would mean that the doctor sees the patient as different--even though the patient is, by construction, identical. For instance, the same patient in flu season may be treated quite differently from another time. In economics, a commodity is defined as a thing, a state of nature and a time. For obvious reasons, an umbrella in the rain 1 year from now is a very different commodity from an umbrella today in the sun. What exactly is the idiographic allowed to contain for clinical decision making?

Both of these get at a key question of what the reference classes could possibly contain, and may rescue an empirical way forward, since the kind of care that I often see is sometimes so bad that it should be possible to call it out.

Expand full comment

David Rothman

Ben, #Principal_Agent_Theory_Vibes. Agents (clinicians or decision-makers) get their decisions judged by the principals (suits) using averaged statistical performance metrics. This creates an asymmetric loss function where statistical models tend to dominate when judged by average metrics. That said, clinicians can bring critical value in those complex cases that get lost in the averaging, where nuanced judgment and adaptability rules. Marcus Welby was right: "Always get a second opinion."

Expand full comment

Kshitij Parikh

The research paper is a great read. Reality is fat-tailed & rare cases can't be simulated or measured but are always there & poor performance on them can be highly deadly. Humans are good with that, today's ML models not so much. It's a difference of dynamics vs statistics as an approach to understanding reality. The point of statistics being brittle to model & always needing to update was good. And so that psychological causation is personal & better to evaluate a person personally when you have access to them & humans detect lot of tacit information.

Expand full comment

FourierBot

Averaging on everything would be horrible!: boys would like to evenly share cheece cake with cousins but definitely not their popularity from girls! We have l1 and l2 regulations which respectively related to median and avearge(mean) to cope with different situations. If we only rely on one single reference system, it may be vulnerable with perturbations in practice. I believe the reason humans would not be replaced by robits is humans may make good choices instead of correct choices even with big prices.

Expand full comment

Alex Tolley