I have no confidence in Confidence Intervals
Once again, I try to make sense of why people like CIs.
This week I’m going to blog about statistical intervals. I’m already regretting it. But let me try out a new way to explain them to myself.
What if we get it over with and just start by assuming all random variables are normal? Everyone loves the normal distribution and invents all sorts of post hoc reasons to use it: the central limit theorem, the principle of maximum entropy, eugenics. But the real reason we love the normal distribution is because it’s by far the easiest for calculating. Do its simple calculations illuminate the philosophy of statistics?
Let’s think about measurement. Suppose we want to measure some property that has a numerical value. We spend six months calibrating a measurement apparatus and decide our device has random errors. When we measure an object, the measurement process results in a random number that has a normal distribution with mean M and standard deviation S. For example, assume the device could be an infrared thermometer, M a person’s temperature, and S the measurement error of a reading.
M changes from object to object, but assume that S is the same no matter what object we measure. In other words, we have convinced ourselves that our measurement error looks like a normally distributed random variable, and each measurement we take is not correlated with previous measurements. We invested six months of metrology in making sure this was the case.
Now we go into the lab and try to measure a new object. We believe that measurement will work the same as it has before. We’ll record a measurement X. We know that the object has hidden property M (the mean of the normal distribution), but measurement noise gives us an imprecise measurement. You measure the temperature 98.7F. Is it really 98.6F? Can you be sure the person doesn’t have a fever? What can we infer about M from X?
Let’s try some statistical inference. When we measure an object, we know X will take a value between M+2S and M-2S 95% of the time. This is because 95% of the probability mass of the normal distribution is concentrated within two standard deviations of the mean. But we can turn this expression around. We also know M lies between X-2S and X+2S 95% of the time. Here X is whatever value we measure in a particular experiment. Hence, we can infer something about the location of M from the measurement of X. In statistics language, the range X-2S and X+2S is a 95% confidence interval for M.
But what exactly are we inferring from a confidence interval? “I have done extensive calibration and testing of my measurement device. Thus, I infer that if I had repeated the measurement of this quantity in the same conditions hundreds of times, the true value of the quantity would lie between the measured value plus or minus two standard deviations 95% of the time.” What a rambling and weird assertion.
What is not valid to infer from this interval? We cannot infer that M is within 2 standard deviations of X the measurement with 95% probability. If X is larger than 2S, this doesn’t guarantee that M is greater than zero with 95% probability. If I want to make claims like these, I need some, um, prior on M. Woo boy not today.
I also can’t infer that if I ran the measurement a bunch of times, M would be within two standard deviations of the particular value of X I just measured. I can’t infer M is going to lie in this particular confidence interval. Again, I can only infer “if I repeated the experiment a bunch of times, 95% of the time, the true mean would lie inside the confidence interval computed from the data in each experiment.” I’m confused, everyone! Why do we try to force everyone to talk like this?
The caveated verbosity around confidence intervals is so weird. And it feels especially weird when we compare it to standard conventions of physical measurement. If I knew my infrared thermometer was calibrated to 0.01 degrees, I would be sure that a person with measured temperature 98.7F wasn’t running a fever. If I used a scale and knew that the errors were normally distributed with a standard deviation of 1g, I would feel fine measuring a person’s body weight to the nearest kilogram. It is impossible to measure an error of 500g if the errors are normal. On the other hand, I wouldn’t use this scale for measuring my morning coffee, because I’m a nerd and like to measure my beans to the gram. In measurement, random errors are a useful model to gauge precision. If we want higher precision, we need to build a better measurement device.
This blog is now longer than I wanted it to be, but so is every other treatise on statistical inference. But it’s a first step. Statistics should guide designing and interpreting measurement. Thinking about measurement with normal errors might be more illuminating than thinking about causation and metaphysics. I’ll pull this thread as I walk away.
Did you know that estimating measurement error was indeed the reason Gauss originally introduced the "Gaussian" distribution in the context of astronomy? He postulated three properties of measurement error and derived it mathematically: https://www.maa.org/sites/default/files/pdf/upload_library/22/Allendoerfer/stahl96.pdf
If you haven’t you should read Deborah Mayo’s ‘Error & The Growth Of Experimental Knowledge’. I am completely convinced by her argument that ‘error statistics’ and not frequency statements is what Neyman-Pearson actually want. The point of a confidence interval is to measure the “ I have done extensive calibration and testing of my measurement device.” part, not the other part.