Milton Friedman's p-values

Ben Recht

Jun 10

Remind me what happens when a measure becomes a target.

Read →

13 Comments

Jeff Roberts

Jun 10Edited

I’m here for your same five points week in and week out. Pump it straight into my eyes!

P-values as bureaucratic rule / regulatory device is a statement worthy of elevation to revelation.

Expand full comment

Reply (2)

Ben Recht

Jun 11

Hah! Thank you. And mission accepted.

Expand full comment

Kevin M

Jun 10

Doesn’t it feel kind of morally dubious to rely on p-values as a bureaucratic rule in the first place? I mean, I suppose it is a really good way to use numbers as a means of politicizing science and creating your own narrative.

Expand full comment

Reply (1)

Jeff Roberts

Jun 10Edited

Morally dubious? Almost certainly. But the main thing about bureaucratic rules is that expediency, especially expediency provided by rules that may be borrowed because they appear to solve a problem that means the org doesn’t have to think about it anymore, is more often than not priority number 1.

So it’s not about creating your own narrative or politicization as much as a process of resolving issues so the org can move on to the next issue.

Expand full comment

Deborah Carver

Jun 10

"Trying to pretend you can do science without the stories isn’t going to advance anything" is maybe my favorite sentence so far this week.

Expand full comment

Kalen

Jun 10

Wasn't p-value hacking a known component of this whole dust up? Soooo.....

I feel like we're in need of a suitably grim and sly 'iron law' as a larger intellectual framework for these kind of discussions- maybe 'doing sophisticated statistics is only necessary to find small effects, and small effects have a habit of not existing'.

Expand full comment

Reply (1)

Ben Recht

Jun 11

Yeah, I think the impact of p-hacking was wildly overstated with little evidence to substantiate the panic.

And I 100% agree with your iron law. Or we could go with my favorite (apocryphal) Rutherford adage: "If your experiment needs statistics, you ought to have done a better experiment."

Expand full comment

Paul B.

Jun 10

Hi, thanks for sharing these thoughts. I copied here my replies on Bluesky:

I agree that trying to convert aggregate p-values into a replication rate won't be reliable. Nonetheless, a paper's p-values seem to track something awfully related to replicability. Per Figure 6 [1], fragile p-values neatly identify numerous topics and methods known to produce non-replicable findings

Small take on your COVID vaccine example, a p-value of p = .01 based on a correlation seems intuitive? Flip a coin 10 times and you'll get heads 9 times 1% of the time. Yet, Pfizer and society should act strongly on that result given their priors on efficacy and given the importance of the topic

The replication crisis is certainly not over, and the paper always refers to it as ongoing. However, I wonder what is the online layperson's view of psychology replicability. The crisis entered public consciousness, but I doubt the public is as aware of the progress to increase replicability

Continuing on this response to "The replication crisis in psychology is over", it is also worth considering psychology's place relative to other fields. Looking at analogous p-value data from neuro or med journals, only psychology seems to make a meaningful push to increase the strength of results

Although we certainly shouldn't conclude that the replication crisis is over, it seems fair to say that there has been productive progress

> But it should be obvious that the problem in psychology was not that 6% of the papers had p-values in a bad range.

Sorry about giving off this impression as it is not at all what I had in mind. No doubt, much more than 6% of the literature remains questionable even today.

Let's define a study as questionable if it doesn't have 80% power when the sample size as 2.5x. Eyeballing based on playing with G*Power, this means a questionable study is one with <45% power. An effect with 45% power will produce a fragile (.01 < p < .05) p-value about 50% of the time. If all studies are either 80% power or 45% power, a 32% level of fragile p-values implies that 25% of studies are questionable. This math isn't meant to argue that 25% of studies were questionable but to show why a 32% fragile percentage can suggest a rate of questionable studies presumably >6%

[1] Figure 6: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:nwddt5ujtulmgkw4xvjzssrh/bafkreifmdoszf4kgkjn2nuo4dbhiuuvsgsmgtzyt5mexvahkherwb4bpme@jpeg

Expand full comment

Reply (1)

Ben Recht

Jun 11

Thanks for replying here, Paul. Substack comment sections are where it's at.

Your points are all fair, and I don't doubt your evidence. But as is the case with all observational studies, there are just too many other ways to interpret the data. The Milton Friedman point is the hard one for me to get over. Science is a social system, and people are adaptive so measurable causal effects are fleeting at best. Knowing that there is supposed to be a 0.05 threshold dramatically changes behavior. And if a subcommunity decides to all only publish if p<0.01, that is also a performative change that makes causality impossible to untangle.

The COVID example is interesting to me because the study had to be large because outcomes are so rare. This is one of the things that gets lost in the discussion of power calculations. In health care, you only need massive studies for preventative interventions where the bad outcomes are so rare that it is impossible to draw conclusions without casting a wide net. Said another way, there is a very big difference between when the prevalence in treatment and control is 0.321 and 0.328 versus when it's 0.001 and 0.008.

And with regards to the replication crisis, I personally think the concerns are misplaced. What we mean by replication is very vague and nebulous, and science is inherently very messy. I wrote about this a bit last week: https://www.argmin.net/p/the-good-the-bad-and-the-science

Expand full comment

Jessica Hullman

Jun 16

"Last time I said this, Andrew Gelman issued a fatwa against me."

Curious if this is what you are referring to as a fatwa? Or something else?

https://open.substack.com/pub/argmin/p/is-the-reproducibility-crisis-reproducible?r=6ppvp&utm_campaign=comment-list-share-cta&utm_medium=web&comments=true&commentId=46912311

If its the above, how does commenting once on your blog to ask that you not attribute a false statement to him become a fatwa? I'm genuinely curious what you mean by this.

Expand full comment

Maxim Raginsky

Jun 11Edited

Seems like here is as good a place as any to remember this great quip by Gabriel Lippmann: "Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation.”

Expand full comment

Notger Heinz

Jun 27

Maybe I misunderstood something but ... I tried to replicate your claim that the p-value is just a redressed correlation and followed your steps by applying your way and the classical way (two-sided t-test) to two random arrays of binary values.

The two ways correlated not in the slightest, i.e. the p-values calculated in the two ways were completely unrelated.

So could you please explain the bit about the "correlation in trenchoat".

Expand full comment

Bob

Jun 11

"In 99.999% of the cases where the term is used, p-value is a goofy way of summarizing the correlation between an intervention and its outcome."

Doesn't this presume a Fisherian framework? Or at least, it excludes a Neyman-Pearson framework where hypothesis testing is conceived of as a decision procedure.

Expand full comment

arg min

Milton Friedman's p-values