I like to mentally find-and-replace phrases like "we cannot reject the null hypothesis" with "man, we can't even reject the null hypothesis", which emphasizes both that rejecting the null is a pretty low bar if you've got a decent sample size and that NHST results are mostly just shorthands.
Unfortunately I have no such trick for making phrases like "the difference was highly significant" palatable, which is of course the real problem.
I read this post a while ago, but it came back to my mind when I was trying to push ChatGPT's o1 model to derive a relationship between the z-score in null hypothesis testing and a correlation measure. It succeeded in doing so, pointing out to a tradition in statistical behavioral research of relating the point-biserial correlation and the t-test coefficients. You asked for pointers of where the formula you've derived appears in print, but I struggled to find it. Rosenthal and Rosnow 2008 (https://scholarshare.temple.edu/handle/20.500.12613/79) show in equation 2.2 a formula relating t-statistics and the point-biserial correlation that seems to be well known in their community. This formula should approximate yours for large sample sizes
Are you sure your proposed definition of crud is rigorous. Randomly pairing treatments, theories and outcomes seems problematic because some proportion of theories generated at random will be true, especially if they are vague enough.
"As Gerd Gigerenzer likes to remind us, null hypothesis significance testing (NHST) is just a mindless ritual."
All societies (including scientific communities) have mindless rituals, and they are very hard to do away with, especially when the alternative is that each person should think for themselves. The Bayesian solution for statistical reasoning, which begins with personal prior beliefs and updates them in the light of the evidence, is far more satisfactory than NHST, but doesn't command the agreement of someone with different priors.
Amazing article (as always), Professor! What does this then imply about how to make an ideal study to prove causation? From my understanding, RCT's are regarded as the gold standard because they can effectively remove confounding variables from consideration, but aside from making their sample sizes larger, how can we ensure that their results are accurate and not a result of the "crud distribution"?
I like to mentally find-and-replace phrases like "we cannot reject the null hypothesis" with "man, we can't even reject the null hypothesis", which emphasizes both that rejecting the null is a pretty low bar if you've got a decent sample size and that NHST results are mostly just shorthands.
Unfortunately I have no such trick for making phrases like "the difference was highly significant" palatable, which is of course the real problem.
Right, but there's something crazy in crud land. Because rejecting a null hypothesis is not even a bar. It's just a coin flip.
I read this post a while ago, but it came back to my mind when I was trying to push ChatGPT's o1 model to derive a relationship between the z-score in null hypothesis testing and a correlation measure. It succeeded in doing so, pointing out to a tradition in statistical behavioral research of relating the point-biserial correlation and the t-test coefficients. You asked for pointers of where the formula you've derived appears in print, but I struggled to find it. Rosenthal and Rosnow 2008 (https://scholarshare.temple.edu/handle/20.500.12613/79) show in equation 2.2 a formula relating t-statistics and the point-biserial correlation that seems to be well known in their community. This formula should approximate yours for large sample sizes
Are you sure your proposed definition of crud is rigorous. Randomly pairing treatments, theories and outcomes seems problematic because some proportion of theories generated at random will be true, especially if they are vague enough.
"As Gerd Gigerenzer likes to remind us, null hypothesis significance testing (NHST) is just a mindless ritual."
All societies (including scientific communities) have mindless rituals, and they are very hard to do away with, especially when the alternative is that each person should think for themselves. The Bayesian solution for statistical reasoning, which begins with personal prior beliefs and updates them in the light of the evidence, is far more satisfactory than NHST, but doesn't command the agreement of someone with different priors.
Amazing article (as always), Professor! What does this then imply about how to make an ideal study to prove causation? From my understanding, RCT's are regarded as the gold standard because they can effectively remove confounding variables from consideration, but aside from making their sample sizes larger, how can we ensure that their results are accurate and not a result of the "crud distribution"?
In clinical research, most of the interesting conversation (e.g. with the FDA) is around the design of the experiment:
- whether the endpoints is clinically meaningful
- the minimum effect size meaningful to patients
- making sure the study isn't overpowered or underpowered
- providing evidence that the relationship to be tested has some theoretical/empirical backing
Way more interesting than the statistical testing...