Discussion about this post

User's avatar
Deborah Mayo's avatar

Ben:

Thank you for continuing the discussion in such an interesting way. Here are some replies to what you say in the first portion (I'll come back to the rest later):

Recht: “Now, it is 100% clear by a scan of any existing literature that statistical tests in science do not [provide an objective tool to distinguish genuine from spurious effects] Statistical tests do not and have not revealed truths about objective reality.”

Objective tools for distinguishing genuine from spurious effects is not the same as tools for revealing “truths about objective reality”—whose meaning is unclear.

Recht: “Statistical tests most certainly do not constrain two scientists from completely disagreeing about how to interpret the same data.”

Who says that distinguishing genuine from spurious effects precludes “two scientists from completely disagreeing about how to interpret the same data”? I don’t understand why Recht thinks that tools that control the probability of erroneous interpretations of data would preclude disagreement. Scientists must give reasons for their disagreement that respect the evidence. Failed replications sometimes result in the initial researchers blaming the replication leading, in turn, to examining the allegation. A new replication may be carried out to avoid the criticism. That too is progress in a conjecture and refutation exercise.

Recht: “If Stark and Benyamini are right, this in turn means that consumers of statistical tests, despite hundreds of years of statistician browbeating (my phrasing), have all been using them incorrectly.”

Stark and Benjamini are right, and it in no way follows “that consumers of statistical tests, despite hundreds of years of statistician browbeating …have all been using them incorrectly”. The tests are tools for falsification. Inability to falsify (statistically) a null hypothesis is a way to block erroneously inferring evidence of a genuine effect. Such negative results are at the heart of the so-called “replication crisis”. When nominally significant results are due to multiple testing, data-dredging, outcome switching and the like, it is unsurprising that the effects disappear when independent groups seek to replicate them with more stringent protocols. The replication crisis, where there is one, is evidence of how tests are used to avoid being fooled by randomness.

I think it is very important to understand what the 100 years of controversy is all about--I'll come back to this.

Recht: “I made a lot of physicists angry yesterday arguing” against the use of p-values in the Higgs discovery.

The Higgs discovery is an excellent case study for examining the important role of statistical tests in science, as well as illuminating controversies (ever since Lindley accused physicists of “bad science”). In my 10 year review of the Higgs episode, I discuss the value of negative statistical results. https://errorstatistics.com/2022/07/04/10-years-after-the-july-4-statistical-discovery-of-the-the-higgs-the-value-of-negative-results/

From the 10 year review: It turned out that the promising bump or “resonance” (a great HEP term) disappeared as more data became available, drowning out the significant indications seen in April. Its reality was falsified. …While disappointing to physicists, this negative role of significance tests is crucial for denying BSM [Beyond Standard Model] anomalies are real, and setting upper bounds for these discrepancies with the SM Higgs.

I'll return to the second portion of Recht's post in another comment.

Expand full comment
Tom Dietterich's avatar

Good post. In both the A/B testing and new drug testing cases, don't we need to go beyond significance to consider cost/benefit? For web site A/B testing, up-side "risk" is dominant; for drugs, down-side risk comes direct from the Hippocratic Oath. The p-value test would be the first coarse filter before computing up-side, down-side, and expected costs and benefits.

Expand full comment
7 more comments...

No posts