Randomly Revealing Hidden Truths

Thinking about confidence intervals as randomized algorithms

Jan 25, 2024

Yesterday, I reviewed the concept of Monte Carlo Algorithms, randomized algorithms that are incorrect with some probability. A Monte Carlo Algorithm aims to compute some attribute of a probability distribution. It applies deterministic computation to samples from the distribution, returning the correct attribute a high percentage of the time.

Statistical intervals are all the outputs of Monte Carlo Algorithms. Let’s start with the most famous statistical interval, the confidence interval. You know that your distribution is governed by a parameter A. Pseudocode for a general Monte Carlo Algorithm is

Sample S from D
Do deterministic computation on S
Return an interval C(S)

The guarantee is that, with probability p, this algorithm gives you a C(S) that contains A. For whatever reason, p is always equal to 0.95. That’s why we call them 95% confidence intervals.

Let’s work out some examples. Suppose you have a coin that lands on heads with probability q and tails with probability 1-q. Let’s describe a standard confidence interval for q. You flip n coins. This means your distribution, D, the binomial distribution. You observe k heads in your experiment. The input to the deterministic algorithm is k. That is, S=k.

The algorithm then might work like this:

q_guess = k/n
width = sqrt(q_guess*(1-q_guess)/n)
C = [q_guess - 1.96*width, q_guess + 1.96*width]

This computation of C is a few math formulas and is deterministic. If I ran this procedure of flipping coins and computing intervals, 95% of the true value of q would lie inside the interval C(X).

Algorithmically-minded people are likely fine with this presentation. But confidence intervals are seldom presented this way in social science or medicine or sports analytics. And this leads to some persistent misinterpretation of what the intervals mean.

Let me be even more concrete so I can explain why the concept is so pernicious. Suppose I flipped 32 coins and saw eight heads. I run my computer algorithm and get a confidence interval [0.1,0.4]. What does this interval tell me about the true value of the probability of seeing heads?

All I know is that the probability that my coin flipping was aberrational and randomness ruined my interval is 5%.

But in an excellent study, Rink Hoekstra and colleagues show that there are widespread misinterpretations of what the confidence interval tells you. All of the following, for example, are false:

The probability that the coin is biased is at least 95%.
The probability that the true q equals 0.5 is smaller than 5%.
The “null hypothesis” that the true q equals 0.5 is likely to be incorrect.
There is a 95% probability that the true q lies between 0.1 and 0.4.
We can be 95% confident that the true q lies between 0.1 and 0.4.
If we were to repeat the experiment over and over, then 95% of the time the true q falls between 0.1 and 0.4.

This last one is the annoying one. We are guaranteed that if we repeat the experiment over and over again, the parameter q will fall in the interval C(S) that we compute in each experiment. There is no guarantee that the interval will be [0.1,0.4] in any of these experiments. We can’t say that the C(S) computed in any particular experiment contains the true parameter. Maddening.

The only thing we can guarantee about confidence intervals is that if the probabilistic sampling model is correct, then it’s not likely that random chance messed up the estimation procedure. No, I take it back: We can’t even say this! Because of the bizarre stickiness of academic tradition, we’ve decided that all confidence intervals should be 95% confidence intervals. That means the CIs are wrong one out of every 20 times. 1 in 20 is not a rare event.

This idea always gets poo-pooed, but I strongly believe we should replace 95% confidence intervals with “5-9 confidence intervals.” The notion of “five nines” comes from reliability engineering in computer science and networking. It’s a cutesy way of saying the probability of a service being available is 99.999%. Think about a 99.999% confidence interval. In this case, the chance that your algorithm returned an incorrect answer is one in one hundred thousand. I’d personally feel more confident about such a confidence interval.

And it’s not even that much of a stretch to make such intervals. Here’s some code to compute 5-9 intervals for code flips:

q_guess = k/n
width = sqrt(q_guess*(1-q_guess)/n)
C = [q_guess - 4.42*width, q_guess + 4.42*width]

All I did was substitute “4.42” where I used to have “1.96.” The intervals get about two and a half times wider. In our coin example, 8 heads from 32 flips yields a 5-9 confidence interval of [0,0.6]. It tells you that you had better flip more coins to be sure the coin is biased.

For an example where the wider interval isn’t that big a deal, let me briefly return to sports analytics. In the 2023 football season, there were about 1200 extra point attempts and these were successful 95% of the time. If we believed these events were random, independent, and identically distributed (you know I don’t believe this), then the 95% confidence interval is [93%,97%]. The 5-9 confidence interval is [92%,98%]. It’s wider, but not that much wider. We’d still be confident that an extra point has a reasonable chance of being converted.1

I’ve seen a lot of arguments against requiring wider intervals, and they all seem to argue that people would find ways to hack no matter what the threshold is. But I don’t buy this argument at all. Once the chance that the interval computation was incorrect is 1 in 100,000, we don’t have to argue about statistics in our papers. We can move on to the other myriad sources of bias that could have led the experimenters astray.

Tomorrow, I want to dwell on more frustrations with confidence intervals. If there’s a chance the interval is wrong, how can we check? Most of the algorithms I discussed yesterday had a mechanism for you to go back and verify that the returned answer was correct. If not, you could run the Monte Carlo Algorithm again or try a different algorithm. But can you ever verify a confidence interval?

I know some of you sickos want the calculation for 2-point conversions. In this case there were 70 successes out of 127 attempts in 2023. The 95% confidence interval is [46%,64%]. The 5-9 confidence interval is [35%,75%]. A much riskier bet than the PAT.

David J. Hunter

Feb 5, 2024

I know Hoekstra says it’s wrong, but #5 is fairly standard shorthand for “The interval [0.1, 0.4] was produced using a procedure such that, if we were to repeat the procedure over and over, then 95% of the time the confidence intervals produced would contain the true value of q.” It’s in most introductory stats books, presumably as a convenient way to point toward the correct interpretation.

Expand full comment

1 reply by Ben Recht

T Coddington

Jan 26, 2024

but Tampa would have been kicking the extra point in an indoor stadium, surely the probability of a successful kick is higher indoors? 😜🤣

7 more comments...

arg min

Discussion about this post