Like it’s 2004, I’m going to spend the next couple of posts responding to other people’s responses to my blog posts. The resurgence in blogging and gradual abandonment of Twitter Threading is a net good for academic discourse.
Today, I want to reply to Deborah Mayo, who wrote a detailed and gracious response to my blog about statistical tests. At the bottom, Mayo also links her slide deck from her Neyman seminar. I highly recommend taking a look as it is quite thought provoking and gives a tight summary of the first part of her book Statistical Inference as Severe Testing.
In her response, Mayo asks me several follow up questions that point in many different directions. To keep my response-to-her-response concise, I’ll focus on a particular line of questions that she raises at the end of her post. It is here where I think we most agree.
The neverending discourse about p-values too easily gets bogged down asking what statistical tests mean rather than what statistical tests do. My last post and this response both want to examine what they do. Mayo quotes a few famous statisticians' opinions on the purpose of statistical hypothesis tests and p-values.
Yoav Benjamini: “In some sense it offers a first line of defense against being fooled by randomness, separating signal from noise, because the models it requires are simpler than any other statistical tool needs.”
Philip Stark: “Absent any objective assessment of the agreement between the data and competing theories, publication decisions may be even more subject to cronyism, “taste,” confirmation bias, etc.”
These quotes both address Benjamini and Stark’s beliefs about what statistical tests ought to do. They are both functional statements, and they share a flavor. For Benjamini and Stark, there is a material reality of signal that differs from the material reality of noise. The proper application of a statistical test gives an objective tool to distinguish one from the other despite our natural tendency to see what we want to see.
Now, it is 100% clear by a scan of any existing literature that statistical tests in science do not do this. Statistical tests do not and have not revealed truths about objective reality. I made a lot of physicists angry yesterday arguing this point. C’est la vie. Statistical tests most certainly do not constrain two scientists from completely disagreeing about how to interpret the same data.
If Stark and Benyamini are right, this in turn means that consumers of statistical tests, despite hundreds of years of statistician browbeating (my phrasing), have all been using them incorrectly. Mayo seems sympathetic to this conclusion. Indeed, Mayo includes this quote from Stark in her post:
Philip Stark: “Throwing away P-values because many practitioners don’t know how to use them is like banning scalpels because most people don’t know how to perform surgery. Those who would perform surgery should be trained in the proper use of scalpels, and those who would use statistics should be trained in the proper use of P-values.”
But what is the proper use of a p-value? If, after 100 years, we can’t point to anyone being able to use them properly, perhaps they don’t have this pristine and perfect proper use.
However, communities can come together and define proper use of p-values for their applications. As Benjamini puts it, p-values require minimal formal set-up to define. In almost all applications, they are pretty easy to compute. If you need to set some easily computable quality control standard based on statistical sampling, why not use Fisherian NHSTs or Neymanian confidence intervals? Null hypothesis tests are just rules. They are rules set by particular communities to advance particular agendas.
Look at AB Testing in the tech industry. Does anyone believe that AB tests severely test the veracity of claims about software widgets? Not anyone I’ve met. But AB tests are undeniably useful. They are a low bar set internally by companies to rate limit the number of features being pushed into a production stack. They are an imperfect convention, but a reasonable tool for quality control. This isn’t too far off from FDA drug testing. Are drug trials perfect at discovering which drugs are useful? No, but they provide a convenient framework to set regulatory standards against pushing toxic pharmaceuticals onto the market.
Neither of these example applications of statistical testing is about preventing someone from being fooled by randomness. Statistical tests are used in these applications because they give transparent, reasonably defined standards. Statistical power calculations standardize sample sizes to achieve agreed upon thresholds of acceptance. Parties agree in advance that 100, 1000 or 10,000 suffice for a community to agree to proceed. They provide a reasonably efficient means of settling disputes between stakeholders. I think this interpretation of tests and their purpose is much closer to what Mayo herself says at the end of her post:
Deborah Mayo: “The key function of statistical tests is to constrain the human tendency to selectively favor views they believe in.”
I would put it only slightly differently. Statistical tests constrain outcomes in participatory systems. Engineers want to push features to get promoted; data science teams insist on AB tests to ensure these features don’t harm key metrics. Drug companies want to make a ton of money; clinical trials ensure drugs aren’t harmful and have a chance of being beneficial. Academics want to publish as many papers as possible to get their h-index to the moon; journals insist on some NHSTs to placate editors. The purpose of statistical tests is regulation.
Ben:
Thank you for continuing the discussion in such an interesting way. Here are some replies to what you say in the first portion (I'll come back to the rest later):
Recht: “Now, it is 100% clear by a scan of any existing literature that statistical tests in science do not [provide an objective tool to distinguish genuine from spurious effects] Statistical tests do not and have not revealed truths about objective reality.”
Objective tools for distinguishing genuine from spurious effects is not the same as tools for revealing “truths about objective reality”—whose meaning is unclear.
Recht: “Statistical tests most certainly do not constrain two scientists from completely disagreeing about how to interpret the same data.”
Who says that distinguishing genuine from spurious effects precludes “two scientists from completely disagreeing about how to interpret the same data”? I don’t understand why Recht thinks that tools that control the probability of erroneous interpretations of data would preclude disagreement. Scientists must give reasons for their disagreement that respect the evidence. Failed replications sometimes result in the initial researchers blaming the replication leading, in turn, to examining the allegation. A new replication may be carried out to avoid the criticism. That too is progress in a conjecture and refutation exercise.
Recht: “If Stark and Benyamini are right, this in turn means that consumers of statistical tests, despite hundreds of years of statistician browbeating (my phrasing), have all been using them incorrectly.”
Stark and Benjamini are right, and it in no way follows “that consumers of statistical tests, despite hundreds of years of statistician browbeating …have all been using them incorrectly”. The tests are tools for falsification. Inability to falsify (statistically) a null hypothesis is a way to block erroneously inferring evidence of a genuine effect. Such negative results are at the heart of the so-called “replication crisis”. When nominally significant results are due to multiple testing, data-dredging, outcome switching and the like, it is unsurprising that the effects disappear when independent groups seek to replicate them with more stringent protocols. The replication crisis, where there is one, is evidence of how tests are used to avoid being fooled by randomness.
I think it is very important to understand what the 100 years of controversy is all about--I'll come back to this.
Recht: “I made a lot of physicists angry yesterday arguing” against the use of p-values in the Higgs discovery.
The Higgs discovery is an excellent case study for examining the important role of statistical tests in science, as well as illuminating controversies (ever since Lindley accused physicists of “bad science”). In my 10 year review of the Higgs episode, I discuss the value of negative statistical results. https://errorstatistics.com/2022/07/04/10-years-after-the-july-4-statistical-discovery-of-the-the-higgs-the-value-of-negative-results/
From the 10 year review: It turned out that the promising bump or “resonance” (a great HEP term) disappeared as more data became available, drowning out the significant indications seen in April. Its reality was falsified. …While disappointing to physicists, this negative role of significance tests is crucial for denying BSM [Beyond Standard Model] anomalies are real, and setting upper bounds for these discrepancies with the SM Higgs.
I'll return to the second portion of Recht's post in another comment.
Good post. In both the A/B testing and new drug testing cases, don't we need to go beyond significance to consider cost/benefit? For web site A/B testing, up-side "risk" is dominant; for drugs, down-side risk comes direct from the Hippocratic Oath. The p-value test would be the first coarse filter before computing up-side, down-side, and expected costs and benefits.