From coin tosses to option pricing
Building prediction intervals using what we know about confidence intervals
Having developed conditions for probabilistically linking the future and the past, let’s build some prediction intervals. A prediction interval is a set where we believe a future event lies with high probability. For starters, we have to be clear on the sorts of things we can build intervals for.
Most of my examples yesterday were “either-or” events that either happened or didn’t. Free throws in basketball, adverse events in clinical trials, or Republicans winning the Senate either happen or they don’t. The only possible intervals for such events are {will happen}, {won’t happen}, and {will happen, won’t happen}. More often than not, the only interval you can return with high confidence is that last one, and that’s useless.
Prediction intervals are only potentially helpful for random events that can take lots of values. Some examples that quickly come to my mind include a person’s body temperature, the price of a stock, the amount of energy demand on a power plant, or the number of years a patient lives after an experimental treatment. These all take a range of values, and a prediction interval can perhaps narrow down their potentialities.
Ironically, we make prediction intervals for such ranged randomness by turning multiply valued processes into either-or processes. Let’s say we’re trying to predict the value of some attribute V. I have a bunch of events from the past that I think are predictive of the future. I can ask what is the probability that the next V will be at most 10? I look at my past data and count how frequently past values were less than or equal to 10. “Less than or equal to 10” is an either-or event. I can now build a confidence interval of this probability.
I described confidence intervals for probabilities of either-or events in an earlier blog, but let me repeat it here. Suppose that I have observed n values of V, which were at most 10 exactly k times. Then I can build a confidence interval for the probability that the next V is below 10 as:
q_guess = k/n
width = sqrt(q_guess*(1-q_guess)/n)
C = [q_guess - 4.42*width, q_guess + 4.42*width]
Let’s pause again to reflect on what this algorithm is promising. The verbiage is always weird for me: 99.999% of the time, running this computation on the samples we collected provides an interval that contains the true probability of some future occurrence. There are two probabilistic statements in that guarantee! The first statement is about the past. It tells us the likelihood that our sample was non-representative. The latter is about probabilities in the future. I apologize, but we have to think about multiple probabilities now. I’m trying my best with the hand I’ve been dealt.
Let us press on. Except for rare pathological samples, the probability that V will be at most 10 is more than the bottom part of the interval I computed. What’s special about 10? I could compute confidence intervals for any level. I could examine the levels 1, 2, 3.14159, etc. For every potential value of the process, I can estimate the probability that V will be smaller. But can all of these confidence intervals be simultaneously valid?
The answer is, quite surprisingly, yes. The random variable V has a cumulative distribution function
F(t) := the probability that V is less than or equal to t
For every t, F tells you the probability V will be at most t. I can also build an empirical distribution function on the data I’ve made so far:
Fe(t) := the past frequency I observed V was less than t
Fe(t) is simply the number of times I observed V was less than t divided by the total number of times I observed V. For every t, Fe(t) provides an estimate of F(t). A 1956 theorem by Dvoretzky, Kiefer, and Wolfowitz says that I can simultaneously estimate confidence intervals for every value of F(t). With probability 1-𝛅, the samples I use to create the empirical distribution function give me an approximation to the true cumulative distribution function satisfying
The DKW Theorem is remarkable. The theorem holds no matter what the distribution of V is. Statisticians like to say that the procedure is distribution-free. I hate this term because we’re not really distribution-free. We have to assume we’re observing a stream of iid events, and that’s a huge assumption about the distribution. But this theorem applies equally to power demand, stock prices, and body temperature, provided the observed process is iid.
The DKW theorem now gives me a procedure to generate a prediction interval for V.
choose failure probability delta
fudge_factor = sqrt(log(1/delta)/(2*n_samples)))
find the smallest t such that F_e(t) >= 1 - alpha + fudge_factor
return C=[-infity, t]
What can I say about this procedure? Again, I need two probabilities: “With probability at least 1-delta, this procedure returns an interval such that the probability the next V is in that interval is at least 1-alpha.” Though that statement is clunky, this procedure is valid for any sequence of iid random variables. Prediction intervals are quite easy to build! On the other hand, though it seemed like we were doing something radical, prediction intervals turned out to be nothing more than confidence intervals on probabilities. Still, perhaps we can do something useful with them. Tomorrow, I’ll describe how advocates propose to use them in practical situations.