Saying the same thing in different ways
p-values, confidence intervals, and scientific rhetoric
A lot of the commentary I’ve read on confidence intervals asserts they are “better” than dreaded p-values. I agree that reporting only a p-value and not an effect size is worse than reporting a confidence interval. But once everything is normally distributed, if you tell me an estimated effect size and a p-value, I can tell you the 95% confidence interval. What makes them different?
p-values have a confusing general definition, but people almost always mean something very specific. More often than not, a p-value is the probability of observing a measurement at least as extreme as what I measured conditioned on the true effect being zero. Let’s just run with this definition today as it covers at least 95% of the p-values out there.
Remember our carefully calibrated measurement device from yesterday? I’m going to bring it out again for a demo. My measurement is a realization of a Gaussian random variable with mean M and variance S. In my latest lab experiment, I measure an object and see the value X. Can I infer that the true M for this object is not equal to zero?
I can try to argue counterfactually. Suppose M was equal to zero. What is the probability of seeing something of size |X| or larger? I divide |X| by S and call this the z-score. I can look this up in a mathematical table to find the probability I’m looking for.
This is the p-value. The smaller the p-value, the more sure I should be that M is not zero, I guess. Oddly, this table stops at z=3.49. But if you’re a sicko like me, you can compute the p-value using the formula
For normal random variables, the p-value is just a strictly decreasing function of the z-score. Events outside the “2-sigma” window (2 standard deviations from the mean) will happen once in every 20 measurements. Events outside the 3-sigma once every 100. Outside 4-sigma once every 20000.
Normal random variables are magical. We now see a simple formula to compute a p-value if you know the 95% confidence interval. And I can compute the confidence interval from a p-value and effect-size estimate by first computing the z-score from the p-value and then backing out the standard deviation of the mean. Everything, unsurprisingly, comes down to mean and variance.
Of course, not every random variable follows a normal distribution. But CIs and p-values can be really weird when your variables are not normal! Let me use a confusing example from
Teddy Seidenfeld’s 1979 book Philosophical Problems of Statistical Inference. Suppose you know that your data generating process is uniformly distributed on an interval [0,B]. You want to estimate a 95% confidence interval for the mean M=B/2 from an observation X. The two-sided confidence interval is then [X/2,10X]. That’s shockingly imprecise. And repeated experiments will have highly variable confidence intervals. Suppose M is 50. If you measure 80, the confidence interval is [40,800]. If you measure 20, the confidence interval [10,100]. If you measure 1, the confidence interval is [1/2,10]. All three of these events are equally likely under this model. Two of these intervals don’t even overlap!
Maybe this suggests that confidence-interval notion is a Gaussian notion. The 95% should be a dead giveaway. It is the coverage of 2 standard deviations of the normal. And though the world is not normal, most applied statistics work explicitly or implicitly assumes things are normal. Most software computes p-values using normal approximations. The world isn’t normal, but we spend a lot of time in science pretending it is.
Final thought for today: in what way are confidence intervals better than p-values? I am somewhat sympathetic to the idea that the confidence interval conveys more about measurement uncertainty than the p-value. Certainly, simple binary notions of statistical significance aren’t helpful. But I think there’s a better explanation. Communicating equivalent quantities with different framings conveys different opinions about what’s important. Half empty and half full convey different notions. In medicine, it’s popular to quote “number needed to treat” instead of “absolute risk reduction” even though these numbers are inverses of each other (I’ll say more about these two concepts in a future blog). There is a framing element involved with scientific reporting, and scientific reporting can not be divorced from its marketing. This marketing unfortunately always trumps whatever we want “scientific rigor” to mean.
It seems useful that confidence intervals can tell us *something* about how our estimates may be imprecise with asymmetric distributions. Take something Poisson, or Gamma, or what have you. There's lots of longer tails out there and giving some kind of confidence interval points at the directions of imprecision (and where you might be able to tell less vs more) than a p-value and an effect size, no?