It seems to be an almost ineradicable article of faith that you can compensate for lack of causal understanding by introducing huge amounts of intentional randomness and doing a bunch of combinatorics.
I've always been puzzled by the fact that even for coin flipping, the p-value of any string of 200 flips is always zero. It's only when we flatten things to look at the number of heads that we get to do "inference."
Well it’s interesting to point out that if we go to physics, this coarse graining of statistics is what gives thermodynamics its modeling power and textbooks openly discusses such ideas but maybe not as “inference”
This is a deep and important topic. Some people feel it is “smart” to ultimately deny cause and effect, replacing it with statistical correlation only. This extends to a universe that is entirely driven by randomness and infinite combinations over immeasurable time.
The issue with cause and effect is that it seems metaphysical unless you come up with an operational definition, that is a sequence of actual measurements we can make to ascertain that A causes B rather than being merely correlated. We cannot rely on our own intuitive judgment for this (see this paper for a discussion of some issues https://www.psych.ucla.edu/wp-content/uploads/Revisiting-Hume.Ichien-Cheng.2021.pdf), and we cannot formalize the sequence of operations needed for finding a mechanism in advance, as that would be solving science. Pearl (in Chapter 2 of Causality) gives us some operational definitions of A->B based on (absence of) statistical dependence, boiling down to the IC and IC* algorithms. This is still far from ideal, as anyone who tried to use constraint-based causal discovery on actual empirical data can attest.
Quite a thought provoking piece for me. You said that “What’s so frustrating is how the combinatorial mindset refuses to grapple with actual cause and effect,” but what would you propose as the alternative? For biomedical science, people sometimes learn enough about the mechanism before going to clinical trials but the “cause and effect” learned from lab experiments still may not be the full picture when we administer these treatments in the real world?
This is one of those interesting cases where the alternative is pretty clear. It's the house rules of journals like the NEJM that force everything into dichotomous boxes of case study or trial. We don't need more observational studies; we need more case reports.
With regard to randomized trials, they are fine as a regulatory design for drugs (see https://arxiv.org/abs/2501.03457), but science, medicine, and pharmaceutical regulation are not synonymous.
Ben, you write "Might we have learned more, and the trial subjects been better off with a careful study of 10 children than an aggregate look at 640?" It seems to me that with this post (and the previous one) you are making an implicit argument for some type of sequential analysis. This seems reasonable to me, and our friends who work on E-values would (I assume) be happy to have another convert.
At the same time, using 640 study participants---and being willing to cut the study off and move control patients to treatment if the treatment is effective---helps avoid some issues that would come up with a purely sequential scenario, because if we were to do everything sequentially, we'd have to wait around for the first batch of individuals to finish. So there's (presumably) some type of tradeoff with time to understand the effects (e.g., of eating peanuts as a baby) with the number of participants.
I'd like to hear what you (or others) would advocate to replace RCTs here.
When I have high school stats students, I know that particular formulas and the like are likely to wash away, so I try to gently impart two organizing principles to carry into the world:
1) Statistics is about having a *principled* relationship with the unknown- the future, the hidden, the noisy, the inaccessible. Having principles is important. It is better than not having principles.
2) It is in no way a substitute for knowing what is actually going on.
"I have no doubt she’d have labeled all 16 cups correctly," I have some doubt of that! Partly because I don't really see what the mechanism of determining the history of well-mixed milk tea would be, partly because getting a yes-or-no question right is NOT a 5-sigma effect. I side with FIsher in that I'd find 16 correct answers more convincing than 8. The problem here is not with the tea lady experiment, it's that most real-life medical experiments are not tea-lady experiments. If you gave an Alzheimer's patient a pill and they recovered their memory, that's not getting a yes-or-no question right, that's an outcome which absent intervention is so rare that as far as I know it has never actually happened.
"That ibuprofen relieves headaches?" This, on the other hand, feels much more tealadylike! Headaches go away on their own after some time, whether you take an Advil or not. I would say the only reason I have any confidence that ibuprofen relieves headaches is because there are clinical trials demonstrating this -- at least, I presume there are! And I also presume there's research giving us at least some insight into the mechanism of action! I presume these things because I am a trusting soul! But I would not say my own personal experience is that ibuprofen has an effect so noticeable, reliable, large, or swift that it "obviously works."
How much more convinced are you by 8 rather than 16? I'd see her do 4 and then ask how she can tell. Every British person I've asked thinks this question is ridiculous and knows that they can tell the difference. There is no sigma. For an American example, I can tell if you overcooked my burger and don't need an RCT.
I'll concede that headache-ibuprofen is a bad example because it requires subjective scoring. Fever reduction is better, and the results there are >5-sigma.
This reminds me of how Fisher went to his grave convinced that you could not establish a causal relationship between smoking and lung cancer without RCTs. Good thing we didn’t take his side on that one!
"Why does she put up with assholes like Ronald Fisher?" I laughed out loud; thank you for brightening my day.
It seems to be an almost ineradicable article of faith that you can compensate for lack of causal understanding by introducing huge amounts of intentional randomness and doing a bunch of combinatorics.
Alas, it's the only way for the theorist to quantify out of ignorance.
I've always been puzzled by the fact that even for coin flipping, the p-value of any string of 200 flips is always zero. It's only when we flatten things to look at the number of heads that we get to do "inference."
You need an evidentiary ordering to get a p-value!
https://stats.stackexchange.com/a/561866/443411
Well it’s interesting to point out that if we go to physics, this coarse graining of statistics is what gives thermodynamics its modeling power and textbooks openly discusses such ideas but maybe not as “inference”
This is a deep and important topic. Some people feel it is “smart” to ultimately deny cause and effect, replacing it with statistical correlation only. This extends to a universe that is entirely driven by randomness and infinite combinations over immeasurable time.
The issue with cause and effect is that it seems metaphysical unless you come up with an operational definition, that is a sequence of actual measurements we can make to ascertain that A causes B rather than being merely correlated. We cannot rely on our own intuitive judgment for this (see this paper for a discussion of some issues https://www.psych.ucla.edu/wp-content/uploads/Revisiting-Hume.Ichien-Cheng.2021.pdf), and we cannot formalize the sequence of operations needed for finding a mechanism in advance, as that would be solving science. Pearl (in Chapter 2 of Causality) gives us some operational definitions of A->B based on (absence of) statistical dependence, boiling down to the IC and IC* algorithms. This is still far from ideal, as anyone who tried to use constraint-based causal discovery on actual empirical data can attest.
Quite a thought provoking piece for me. You said that “What’s so frustrating is how the combinatorial mindset refuses to grapple with actual cause and effect,” but what would you propose as the alternative? For biomedical science, people sometimes learn enough about the mechanism before going to clinical trials but the “cause and effect” learned from lab experiments still may not be the full picture when we administer these treatments in the real world?
This is one of those interesting cases where the alternative is pretty clear. It's the house rules of journals like the NEJM that force everything into dichotomous boxes of case study or trial. We don't need more observational studies; we need more case reports.
With regard to randomized trials, they are fine as a regulatory design for drugs (see https://arxiv.org/abs/2501.03457), but science, medicine, and pharmaceutical regulation are not synonymous.
Ben, you write "Might we have learned more, and the trial subjects been better off with a careful study of 10 children than an aggregate look at 640?" It seems to me that with this post (and the previous one) you are making an implicit argument for some type of sequential analysis. This seems reasonable to me, and our friends who work on E-values would (I assume) be happy to have another convert.
At the same time, using 640 study participants---and being willing to cut the study off and move control patients to treatment if the treatment is effective---helps avoid some issues that would come up with a purely sequential scenario, because if we were to do everything sequentially, we'd have to wait around for the first batch of individuals to finish. So there's (presumably) some type of tradeoff with time to understand the effects (e.g., of eating peanuts as a baby) with the number of participants.
I'd like to hear what you (or others) would advocate to replace RCTs here.
- RCTs are fine and dandy as regulatory approval mechanisms. No need to replace them there. https://arxiv.org/abs/2501.03457
- The Peanut RCT as a reaction to moral panic is a fascinating case study of how not all knowledge is synthesized from random randomized trials.
- Sequential testing has the exact same problem as nonsequential testing. People only believe what they are ready to believe.
When I have high school stats students, I know that particular formulas and the like are likely to wash away, so I try to gently impart two organizing principles to carry into the world:
1) Statistics is about having a *principled* relationship with the unknown- the future, the hidden, the noisy, the inaccessible. Having principles is important. It is better than not having principles.
2) It is in no way a substitute for knowing what is actually going on.
Several thoughts!
"I have no doubt she’d have labeled all 16 cups correctly," I have some doubt of that! Partly because I don't really see what the mechanism of determining the history of well-mixed milk tea would be, partly because getting a yes-or-no question right is NOT a 5-sigma effect. I side with FIsher in that I'd find 16 correct answers more convincing than 8. The problem here is not with the tea lady experiment, it's that most real-life medical experiments are not tea-lady experiments. If you gave an Alzheimer's patient a pill and they recovered their memory, that's not getting a yes-or-no question right, that's an outcome which absent intervention is so rare that as far as I know it has never actually happened.
"That ibuprofen relieves headaches?" This, on the other hand, feels much more tealadylike! Headaches go away on their own after some time, whether you take an Advil or not. I would say the only reason I have any confidence that ibuprofen relieves headaches is because there are clinical trials demonstrating this -- at least, I presume there are! And I also presume there's research giving us at least some insight into the mechanism of action! I presume these things because I am a trusting soul! But I would not say my own personal experience is that ibuprofen has an effect so noticeable, reliable, large, or swift that it "obviously works."
How much more convinced are you by 8 rather than 16? I'd see her do 4 and then ask how she can tell. Every British person I've asked thinks this question is ridiculous and knows that they can tell the difference. There is no sigma. For an American example, I can tell if you overcooked my burger and don't need an RCT.
I'll concede that headache-ibuprofen is a bad example because it requires subjective scoring. Fever reduction is better, and the results there are >5-sigma.
This reminds me of how Fisher went to his grave convinced that you could not establish a causal relationship between smoking and lung cancer without RCTs. Good thing we didn’t take his side on that one!
I'm waiting for a discussion of Bem's experiments in precognition... much more interesting than milk and tea.
Yeah, I definitely need to write about the origins of the 20th-century "replication crisis" again. On revisiting it, I've soured on most of that work.