Doesn’t it feel kind of morally dubious to rely on p-values as a bureaucratic rule in the first place? I mean, I suppose it is a really good way to use numbers as a means of politicizing science and creating your own narrative.
Morally dubious? Almost certainly. But the main thing about bureaucratic rules is that expediency, especially expediency provided by rules that may be borrowed because they appear to solve a problem that means the org doesn’t have to think about it anymore, is more often than not priority number 1.
So it’s not about creating your own narrative or politicization as much as a process of resolving issues so the org can move on to the next issue.
Wasn't p-value hacking a known component of this whole dust up? Soooo.....
I feel like we're in need of a suitably grim and sly 'iron law' as a larger intellectual framework for these kind of discussions- maybe 'doing sophisticated statistics is only necessary to find small effects, and small effects have a habit of not existing'.
Yeah, I think the impact of p-hacking was wildly overstated with little evidence to substantiate the panic.
And I 100% agree with your iron law. Or we could go with my favorite (apocryphal) Rutherford adage: "If your experiment needs statistics, you ought to have done a better experiment."
Hi, thanks for sharing these thoughts. I copied here my replies on Bluesky:
I agree that trying to convert aggregate p-values into a replication rate won't be reliable. Nonetheless, a paper's p-values seem to track something awfully related to replicability. Per Figure 6 [1], fragile p-values neatly identify numerous topics and methods known to produce non-replicable findings
Small take on your COVID vaccine example, a p-value of p = .01 based on a correlation seems intuitive? Flip a coin 10 times and you'll get heads 9 times 1% of the time. Yet, Pfizer and society should act strongly on that result given their priors on efficacy and given the importance of the topic
The replication crisis is certainly not over, and the paper always refers to it as ongoing. However, I wonder what is the online layperson's view of psychology replicability. The crisis entered public consciousness, but I doubt the public is as aware of the progress to increase replicability
Continuing on this response to "The replication crisis in psychology is over", it is also worth considering psychology's place relative to other fields. Looking at analogous p-value data from neuro or med journals, only psychology seems to make a meaningful push to increase the strength of results
Although we certainly shouldn't conclude that the replication crisis is over, it seems fair to say that there has been productive progress
> But it should be obvious that the problem in psychology was not that 6% of the papers had p-values in a bad range.
Sorry about giving off this impression as it is not at all what I had in mind. No doubt, much more than 6% of the literature remains questionable even today.
Let's define a study as questionable if it doesn't have 80% power when the sample size as 2.5x. Eyeballing based on playing with G*Power, this means a questionable study is one with <45% power. An effect with 45% power will produce a fragile (.01 < p < .05) p-value about 50% of the time. If all studies are either 80% power or 45% power, a 32% level of fragile p-values implies that 25% of studies are questionable. This math isn't meant to argue that 25% of studies were questionable but to show why a 32% fragile percentage can suggest a rate of questionable studies presumably >6%
Thanks for replying here, Paul. Substack comment sections are where it's at.
Your points are all fair, and I don't doubt your evidence. But as is the case with all observational studies, there are just too many other ways to interpret the data. The Milton Friedman point is the hard one for me to get over. Science is a social system, and people are adaptive so measurable causal effects are fleeting at best. Knowing that there is supposed to be a 0.05 threshold dramatically changes behavior. And if a subcommunity decides to all only publish if p<0.01, that is also a performative change that makes causality impossible to untangle.
The COVID example is interesting to me because the study had to be large because outcomes are so rare. This is one of the things that gets lost in the discussion of power calculations. In health care, you only need massive studies for preventative interventions where the bad outcomes are so rare that it is impossible to draw conclusions without casting a wide net. Said another way, there is a very big difference between when the prevalence in treatment and control is 0.321 and 0.328 versus when it's 0.001 and 0.008.
And with regards to the replication crisis, I personally think the concerns are misplaced. What we mean by replication is very vague and nebulous, and science is inherently very messy. I wrote about this a bit last week: https://www.argmin.net/p/the-good-the-bad-and-the-science
Seems like here is as good a place as any to remember this great quip by Gabriel Lippmann: "Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation.”
"In 99.999% of the cases where the term is used, p-value is a goofy way of summarizing the correlation between an intervention and its outcome."
Doesn't this presume a Fisherian framework? Or at least, it excludes a Neyman-Pearson framework where hypothesis testing is conceived of as a decision procedure.
I’m here for your same five points week in and week out. Pump it straight into my eyes!
P-values as bureaucratic rule / regulatory device is a statement worthy of elevation to revelation.
Hah! Thank you. And mission accepted.
Doesn’t it feel kind of morally dubious to rely on p-values as a bureaucratic rule in the first place? I mean, I suppose it is a really good way to use numbers as a means of politicizing science and creating your own narrative.
Morally dubious? Almost certainly. But the main thing about bureaucratic rules is that expediency, especially expediency provided by rules that may be borrowed because they appear to solve a problem that means the org doesn’t have to think about it anymore, is more often than not priority number 1.
So it’s not about creating your own narrative or politicization as much as a process of resolving issues so the org can move on to the next issue.
"Trying to pretend you can do science without the stories isn’t going to advance anything" is maybe my favorite sentence so far this week.
Wasn't p-value hacking a known component of this whole dust up? Soooo.....
I feel like we're in need of a suitably grim and sly 'iron law' as a larger intellectual framework for these kind of discussions- maybe 'doing sophisticated statistics is only necessary to find small effects, and small effects have a habit of not existing'.
Yeah, I think the impact of p-hacking was wildly overstated with little evidence to substantiate the panic.
And I 100% agree with your iron law. Or we could go with my favorite (apocryphal) Rutherford adage: "If your experiment needs statistics, you ought to have done a better experiment."
Hi, thanks for sharing these thoughts. I copied here my replies on Bluesky:
I agree that trying to convert aggregate p-values into a replication rate won't be reliable. Nonetheless, a paper's p-values seem to track something awfully related to replicability. Per Figure 6 [1], fragile p-values neatly identify numerous topics and methods known to produce non-replicable findings
Small take on your COVID vaccine example, a p-value of p = .01 based on a correlation seems intuitive? Flip a coin 10 times and you'll get heads 9 times 1% of the time. Yet, Pfizer and society should act strongly on that result given their priors on efficacy and given the importance of the topic
The replication crisis is certainly not over, and the paper always refers to it as ongoing. However, I wonder what is the online layperson's view of psychology replicability. The crisis entered public consciousness, but I doubt the public is as aware of the progress to increase replicability
Continuing on this response to "The replication crisis in psychology is over", it is also worth considering psychology's place relative to other fields. Looking at analogous p-value data from neuro or med journals, only psychology seems to make a meaningful push to increase the strength of results
Although we certainly shouldn't conclude that the replication crisis is over, it seems fair to say that there has been productive progress
> But it should be obvious that the problem in psychology was not that 6% of the papers had p-values in a bad range.
Sorry about giving off this impression as it is not at all what I had in mind. No doubt, much more than 6% of the literature remains questionable even today.
Let's define a study as questionable if it doesn't have 80% power when the sample size as 2.5x. Eyeballing based on playing with G*Power, this means a questionable study is one with <45% power. An effect with 45% power will produce a fragile (.01 < p < .05) p-value about 50% of the time. If all studies are either 80% power or 45% power, a 32% level of fragile p-values implies that 25% of studies are questionable. This math isn't meant to argue that 25% of studies were questionable but to show why a 32% fragile percentage can suggest a rate of questionable studies presumably >6%
[1] Figure 6: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:nwddt5ujtulmgkw4xvjzssrh/bafkreifmdoszf4kgkjn2nuo4dbhiuuvsgsmgtzyt5mexvahkherwb4bpme@jpeg
Thanks for replying here, Paul. Substack comment sections are where it's at.
Your points are all fair, and I don't doubt your evidence. But as is the case with all observational studies, there are just too many other ways to interpret the data. The Milton Friedman point is the hard one for me to get over. Science is a social system, and people are adaptive so measurable causal effects are fleeting at best. Knowing that there is supposed to be a 0.05 threshold dramatically changes behavior. And if a subcommunity decides to all only publish if p<0.01, that is also a performative change that makes causality impossible to untangle.
The COVID example is interesting to me because the study had to be large because outcomes are so rare. This is one of the things that gets lost in the discussion of power calculations. In health care, you only need massive studies for preventative interventions where the bad outcomes are so rare that it is impossible to draw conclusions without casting a wide net. Said another way, there is a very big difference between when the prevalence in treatment and control is 0.321 and 0.328 versus when it's 0.001 and 0.008.
And with regards to the replication crisis, I personally think the concerns are misplaced. What we mean by replication is very vague and nebulous, and science is inherently very messy. I wrote about this a bit last week: https://www.argmin.net/p/the-good-the-bad-and-the-science
Seems like here is as good a place as any to remember this great quip by Gabriel Lippmann: "Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation.”
"In 99.999% of the cases where the term is used, p-value is a goofy way of summarizing the correlation between an intervention and its outcome."
Doesn't this presume a Fisherian framework? Or at least, it excludes a Neyman-Pearson framework where hypothesis testing is conceived of as a decision procedure.