Thank you for continuing the discussion in such an interesting way. Here are some replies to what you say in the first portion (I'll come back to the rest later):
Recht: “Now, it is 100% clear by a scan of any existing literature that statistical tests in science do not [provide an objective tool to distinguish genuine from spurious effects] Statistical tests do not and have not revealed truths about objective reality.”
Objective tools for distinguishing genuine from spurious effects is not the same as tools for revealing “truths about objective reality”—whose meaning is unclear.
Recht: “Statistical tests most certainly do not constrain two scientists from completely disagreeing about how to interpret the same data.”
Who says that distinguishing genuine from spurious effects precludes “two scientists from completely disagreeing about how to interpret the same data”? I don’t understand why Recht thinks that tools that control the probability of erroneous interpretations of data would preclude disagreement. Scientists must give reasons for their disagreement that respect the evidence. Failed replications sometimes result in the initial researchers blaming the replication leading, in turn, to examining the allegation. A new replication may be carried out to avoid the criticism. That too is progress in a conjecture and refutation exercise.
Recht: “If Stark and Benyamini are right, this in turn means that consumers of statistical tests, despite hundreds of years of statistician browbeating (my phrasing), have all been using them incorrectly.”
Stark and Benjamini are right, and it in no way follows “that consumers of statistical tests, despite hundreds of years of statistician browbeating …have all been using them incorrectly”. The tests are tools for falsification. Inability to falsify (statistically) a null hypothesis is a way to block erroneously inferring evidence of a genuine effect. Such negative results are at the heart of the so-called “replication crisis”. When nominally significant results are due to multiple testing, data-dredging, outcome switching and the like, it is unsurprising that the effects disappear when independent groups seek to replicate them with more stringent protocols. The replication crisis, where there is one, is evidence of how tests are used to avoid being fooled by randomness.
I think it is very important to understand what the 100 years of controversy is all about--I'll come back to this.
Recht: “I made a lot of physicists angry yesterday arguing” against the use of p-values in the Higgs discovery.
From the 10 year review: It turned out that the promising bump or “resonance” (a great HEP term) disappeared as more data became available, drowning out the significant indications seen in April. Its reality was falsified. …While disappointing to physicists, this negative role of significance tests is crucial for denying BSM [Beyond Standard Model] anomalies are real, and setting upper bounds for these discrepancies with the SM Higgs.
I'll return to the second portion of Recht's post in another comment.
As usual, we agree on a lot and disagree on some of the finer points. For instance, I don't think that the replication crisis is rooted in misapplication of statistical tools. I think there is a deeper question here about what is real and what replication even means. But let's save that debate for another day!
Popper’s metaphysical stance is based on falsification: while theories cannot be proven true, they can be subjected to severe tests aimed at falsifying them. Popper’s realism holds that objective truth exists, even though we can never be certain we’ve found it. This realism extends to the theoretical concepts needed for these theories, including latent constructs. Mayo builds on Popper by emphasizing probabilistic error control.
From a pragmatist perspective, especially Quine’s, constructs are real only if they are useful to our best science. This view is particularly relevant to latent constructs. If operationalized constructs like "affect regulation" fail to yield useful, predictive outcomes, their "reality" is unclear. Recht, whom I take to be a naturalistic Quinean, shares this view: he sees constructs like "affect regulation" or "ability to develop talent" as unclear and possibly unnecessary for scientific inquiry (beyond folk talk) if they don’t contribute to our best science. I agree it’s complicated, and I may be overstating the case, but I suspect he believes we should stay closer to what the experiment actually did, rather than invoking the big constructs in the theory we’re trying to severely test.
Recht favors a minimalist ontology: experiments, like growth mindset interventions, and tangible outcomes, like success in e-games, are real, while latent constructs like "affect regulation" or "ability to develop talent" are less real. In his view, science isn’t primarily about falsifying theories in the Popperian sense but about taking action: making predictions, observing outcomes, and adjusting based on practical results. Recht thinks the need for severe testing mostly arises when real-world applications are weak (and we go down the h-index route).
Recht believes that if we remove many of these shaky theoretical latent constructs, there’s less need for Popper’s framework. Testing a hypothesis is about doing something in the world (what he calls “action” in one of his books), not about “rigorous testing.” In his framework, you make a prediction and if it works, you continue (you don't stop because you were able to run power tests and published it); if it doesn’t work, you try something else. Ultimately, I think he believes that if the effect size isn’t clear and large (with "large" being context-dependent), it doesn’t really matter because we aren’t controlling anything meaningful in the world. So, it’s unclear that we need that test. If severe testing is done, it often serves peer review and your h-index. I genuinely feel bad saying this—it sounds slightly anti-science—but I think there’s some plausibility to this view, and many people outside academia would agree. Of course, one can respond, "But we always need theory! If you are a-theoretical, which experiment will you run?" And I agree, we don’t run experiments at random. But I sense a John Dewey bend in Recht: knowledge is a tool for solving problems, and we overstate the need for severe testing. And the choice of experiments we run is based more on a Habermassian conversation than on scientific theory.
This reflects a tension between Popperian realism/reductionism/atomism (Mayo) and nominalism/pragmatism/holism (Recht).
It’s funny, but it even reminds me of Thesis 11 on Feuerbach by Marx: “Philosophers have hitherto only interpreted the world in various ways (severe testing); the point is to change it (action control).” I would disagree with Marx and say we need both, and like Quine, I think science and philosophy is ultimately sort of the same connected enterprise.
The severe testing account, like error statistics in general, is agnostic to one's metaphysics, to whether one is a realist or the like. Neyman focused entirely on performance and empirically testable properties. Pointing to metaphysical problems or issues, as I see it, just obscures and buries the seminal issues that must be resolved by all who would use statistical method.
Good post. In both the A/B testing and new drug testing cases, don't we need to go beyond significance to consider cost/benefit? For web site A/B testing, up-side "risk" is dominant; for drugs, down-side risk comes direct from the Hippocratic Oath. The p-value test would be the first coarse filter before computing up-side, down-side, and expected costs and benefits.
Yes, if we accept that these tests serve regulatory purpose, then we should absolutely think about how to design them with desired ends in mind. We could do this by consensus or we could decide to adopt a rational choice optimization framework. Any approach of participatory decision making would all fit under my rubric.
Harry Collins has shown (in Gravity’s Shadow I believe) that the 5-sigma rule evolved in physics experiments over time to become less and less lenient. I need to find the exact quote, but I believe he traces it back to when it was 4-sigma and even 3-sigma. And in the context of gravitational waves the choice of the cutoff appeared instrumental in avoiding “early” detection, that is claiming GW detection by resonant bars before the big interferometers got funded.
Ben:
Thank you for continuing the discussion in such an interesting way. Here are some replies to what you say in the first portion (I'll come back to the rest later):
Recht: “Now, it is 100% clear by a scan of any existing literature that statistical tests in science do not [provide an objective tool to distinguish genuine from spurious effects] Statistical tests do not and have not revealed truths about objective reality.”
Objective tools for distinguishing genuine from spurious effects is not the same as tools for revealing “truths about objective reality”—whose meaning is unclear.
Recht: “Statistical tests most certainly do not constrain two scientists from completely disagreeing about how to interpret the same data.”
Who says that distinguishing genuine from spurious effects precludes “two scientists from completely disagreeing about how to interpret the same data”? I don’t understand why Recht thinks that tools that control the probability of erroneous interpretations of data would preclude disagreement. Scientists must give reasons for their disagreement that respect the evidence. Failed replications sometimes result in the initial researchers blaming the replication leading, in turn, to examining the allegation. A new replication may be carried out to avoid the criticism. That too is progress in a conjecture and refutation exercise.
Recht: “If Stark and Benyamini are right, this in turn means that consumers of statistical tests, despite hundreds of years of statistician browbeating (my phrasing), have all been using them incorrectly.”
Stark and Benjamini are right, and it in no way follows “that consumers of statistical tests, despite hundreds of years of statistician browbeating …have all been using them incorrectly”. The tests are tools for falsification. Inability to falsify (statistically) a null hypothesis is a way to block erroneously inferring evidence of a genuine effect. Such negative results are at the heart of the so-called “replication crisis”. When nominally significant results are due to multiple testing, data-dredging, outcome switching and the like, it is unsurprising that the effects disappear when independent groups seek to replicate them with more stringent protocols. The replication crisis, where there is one, is evidence of how tests are used to avoid being fooled by randomness.
I think it is very important to understand what the 100 years of controversy is all about--I'll come back to this.
Recht: “I made a lot of physicists angry yesterday arguing” against the use of p-values in the Higgs discovery.
The Higgs discovery is an excellent case study for examining the important role of statistical tests in science, as well as illuminating controversies (ever since Lindley accused physicists of “bad science”). In my 10 year review of the Higgs episode, I discuss the value of negative statistical results. https://errorstatistics.com/2022/07/04/10-years-after-the-july-4-statistical-discovery-of-the-the-higgs-the-value-of-negative-results/
From the 10 year review: It turned out that the promising bump or “resonance” (a great HEP term) disappeared as more data became available, drowning out the significant indications seen in April. Its reality was falsified. …While disappointing to physicists, this negative role of significance tests is crucial for denying BSM [Beyond Standard Model] anomalies are real, and setting upper bounds for these discrepancies with the SM Higgs.
I'll return to the second portion of Recht's post in another comment.
As usual, we agree on a lot and disagree on some of the finer points. For instance, I don't think that the replication crisis is rooted in misapplication of statistical tools. I think there is a deeper question here about what is real and what replication even means. But let's save that debate for another day!
I'm off to read your 2022 Higgs post.
I think your debate is fascinating.
Here's some random thoughts:
Popper’s metaphysical stance is based on falsification: while theories cannot be proven true, they can be subjected to severe tests aimed at falsifying them. Popper’s realism holds that objective truth exists, even though we can never be certain we’ve found it. This realism extends to the theoretical concepts needed for these theories, including latent constructs. Mayo builds on Popper by emphasizing probabilistic error control.
From a pragmatist perspective, especially Quine’s, constructs are real only if they are useful to our best science. This view is particularly relevant to latent constructs. If operationalized constructs like "affect regulation" fail to yield useful, predictive outcomes, their "reality" is unclear. Recht, whom I take to be a naturalistic Quinean, shares this view: he sees constructs like "affect regulation" or "ability to develop talent" as unclear and possibly unnecessary for scientific inquiry (beyond folk talk) if they don’t contribute to our best science. I agree it’s complicated, and I may be overstating the case, but I suspect he believes we should stay closer to what the experiment actually did, rather than invoking the big constructs in the theory we’re trying to severely test.
Recht favors a minimalist ontology: experiments, like growth mindset interventions, and tangible outcomes, like success in e-games, are real, while latent constructs like "affect regulation" or "ability to develop talent" are less real. In his view, science isn’t primarily about falsifying theories in the Popperian sense but about taking action: making predictions, observing outcomes, and adjusting based on practical results. Recht thinks the need for severe testing mostly arises when real-world applications are weak (and we go down the h-index route).
Recht believes that if we remove many of these shaky theoretical latent constructs, there’s less need for Popper’s framework. Testing a hypothesis is about doing something in the world (what he calls “action” in one of his books), not about “rigorous testing.” In his framework, you make a prediction and if it works, you continue (you don't stop because you were able to run power tests and published it); if it doesn’t work, you try something else. Ultimately, I think he believes that if the effect size isn’t clear and large (with "large" being context-dependent), it doesn’t really matter because we aren’t controlling anything meaningful in the world. So, it’s unclear that we need that test. If severe testing is done, it often serves peer review and your h-index. I genuinely feel bad saying this—it sounds slightly anti-science—but I think there’s some plausibility to this view, and many people outside academia would agree. Of course, one can respond, "But we always need theory! If you are a-theoretical, which experiment will you run?" And I agree, we don’t run experiments at random. But I sense a John Dewey bend in Recht: knowledge is a tool for solving problems, and we overstate the need for severe testing. And the choice of experiments we run is based more on a Habermassian conversation than on scientific theory.
This reflects a tension between Popperian realism/reductionism/atomism (Mayo) and nominalism/pragmatism/holism (Recht).
It’s funny, but it even reminds me of Thesis 11 on Feuerbach by Marx: “Philosophers have hitherto only interpreted the world in various ways (severe testing); the point is to change it (action control).” I would disagree with Marx and say we need both, and like Quine, I think science and philosophy is ultimately sort of the same connected enterprise.
The severe testing account, like error statistics in general, is agnostic to one's metaphysics, to whether one is a realist or the like. Neyman focused entirely on performance and empirically testable properties. Pointing to metaphysical problems or issues, as I see it, just obscures and buries the seminal issues that must be resolved by all who would use statistical method.
Ben:
I continue my response here. (Let's see if this works):
https://errorstatistics.com/2024/10/22/response-to-ben-rechts-post-what-is-statistics-purpose-on-my-neyman-seminar/comment-page-1/#comment-267231
Good post. In both the A/B testing and new drug testing cases, don't we need to go beyond significance to consider cost/benefit? For web site A/B testing, up-side "risk" is dominant; for drugs, down-side risk comes direct from the Hippocratic Oath. The p-value test would be the first coarse filter before computing up-side, down-side, and expected costs and benefits.
Yes, if we accept that these tests serve regulatory purpose, then we should absolutely think about how to design them with desired ends in mind. We could do this by consensus or we could decide to adopt a rational choice optimization framework. Any approach of participatory decision making would all fit under my rubric.
Harry Collins has shown (in Gravity’s Shadow I believe) that the 5-sigma rule evolved in physics experiments over time to become less and less lenient. I need to find the exact quote, but I believe he traces it back to when it was 4-sigma and even 3-sigma. And in the context of gravitational waves the choice of the cutoff appeared instrumental in avoiding “early” detection, that is claiming GW detection by resonant bars before the big interferometers got funded.
Ben - A/B testing is a particularly nice example. If you run an A/A test, preceding the A/B test. This helps you set proper severe tests. This practice is used to debug the random allocation procedure. I commented on this in https://errorstatistics.com/2024/08/31/georgi-georgiev-guest-post-the-frequentist-vs-bayesian-split-in-online-experimentation-before-and-after-the-abandon-statistical-significance-call/