More Like One Out of A Million...

We should be unsatisfied with unverifiable measurement.

Jan 26, 2024

Programming Note: I’m traveling next week and expect I won’t be able to blog. But this series on intervals will resume the week after that.

One of the weirder deficiencies of the confidence interval is you always maintain a chance that you’re wrong. 95% of the time (or even 99.999% of the time), you find an interval containing the parameter you want. But how can you know when the algorithm fails?

On Wednesday, I worked out the example of stochastic gradient. SGD is a Monte Carlo algorithm. It had a failure rate, but you could check if the returned solution was correct. If it was incorrect, you could rerun the algorithm. If the algorithm fails with probability 5%, two runs fail with probability 0.25%. The odds of not finding a solution after ten runs is ten trillion to 1.

If I see a 95% confidence interval in someone’s research paper, what can I do to verify its correctness? Nothing! At best, I can try to replicate their experiment myself. But even then, I’ll get my own 95% confidence interval. It will be different from their confidence interval. And unless I make some other assumptions about reality, we’ll just be stuck with two intervals.

I stand by what I wrote yesterday and reiterate it’s worth advocating for wider intervals. 5-9 intervals are much less likely to be wrong and are only 2.25 times wider than the conventional intervals ubiquitous today. But even the 5-9 intervals might be wrong. And we can never know. I’m telling you, there’s a chance.

To beat a dead horse, take one of the simplest examples from probability, Polya’s Urn. The mysterious urn contains three balls, colored either blue or red. Your goal is to estimate how many red balls are in the urn. But you can only take out one ball at a time and then must put the ball back. Magically, the probability of drawing any of the three balls is the same for every attempt. They don’t get warm or gunked up with the oils on your hands.

If you ever pull a red ball, you are immediately certain there is at least one red ball in the urn. In this case, 0 can’t be in your confidence interval. If you ever pull a blue ball, you are immediately certain that not all balls are red. Consequently, 3 can’t be in your confidence interval. I want to overemphasize here that these measurements completely rule out two of the options. A random measurement can completely rule out certain possibilities.

But you can never drive the probability of the other two options to 1. No matter how many times you draw, there will be some probability of “1 red ball” and some probability of “2 red balls.” You can get a tiny confidence interval but can never verify correctness.

Eventually, maybe after 100 or so draws, the probability of 1 or 2 is so small that any sane person would agree it’s zero. At some point, we’ll have less faith in our ability to independently randomly sample from the urn than in the probability we’ve made an error in estimating the composition of the contents. But it seems to me that the only way to verify the answer with certainty is to break the rules and empty the urn on the table.

This situation stinks. We are stuck with a flimsy algorithm that is demonstrably hard to interpret, impossible to verify, and always potentially wrong.

When you tell someone you have a confidence interval, most people want a procedure where we can guarantee that the measured quantity is between the upper and lower bounds. It’s fine if we have no idea what value it attains inside the interval between, but we should be able to guarantee that it is not outside of this.

We have plenty of measurement devices that work this way. I can guarantee you that I am between 5 and 6 feet tall. The probability that my height is outside of this interval is zero. This is too cute an example, but plenty of measurement devices have similar clarity. Can we change our expectations about statistical measurements used in data science? What is the analogy of dumping Polya’s Urn out on the table? I don’t have a satisfying answer, but maybe we can discuss some options in the comments.

When I get back from my trip, I want to turn away from the confidence interval and examine the other intervals people like. They’re randomized algorithms too, but their probabilistic guarantees are different. Stay tuned to see if those are more satisfying.

Discussion about this post

Miguel

Jan 27, 2024Edited

Let me have a try at "dumping Polya’s Urn out on the table".

Let $r_i$ and $b_i$ be Boolean variables, for i=1,2,3, and consider the following Boolean constraints:

$\bigwedge_{i=1}^{3} \neg r_i \lor \neg_b_i$

$\land$

$\bigwedge_{i=1}^{3} b_i \lor r_i$

The assignments to the Boolean variables (functions mapping each of $r_i$ and $b_i$ to either 0 or 1) that satisfy the above model possible configurations of the balls in the urn (e.g. which specific ball has what color).

Now, we can lay out each of the possible 2^6 assignments in a table. We can do this (and handle much bigger systems) with state-of-the-art SAT solvers and model counting algorithms. We can count for instance, how many assignments satisfy a given property, e.g. r_1=1 for "ball 1 is true" in time linear in the representation of this table. Over the past 20 years or so, great advances have been made that this query can be done on a "compressed" table (e.g. d-DNNF or SDD) in time linear in the size of this compressed representation (which is worst-case exponential on the treewidth of the formula primal graph, where vertices are variables and we have an edge whenever two variables appear on the same constraint). That is as good as oracles for counting stuff efficiently can go in the real world (i.e. run by a Turing Machine-like device).

Now, big problem. How do weight those assignments we have painstakingly computed and stored in a compact manner? Since we know nothing, the "standard" approach is to assume that all assignments that satisfy the constraints are equally likely. I can hear folks sucking their teeth already :)

For the scenario where we pick a ball and turns to be red, the information acquired can be encoded as the following formula

$r_1 \lor r_2 \lor r_3$

What kind of algorithm can exhaustively enumerate all the possible consequences of this new piece of information? One simple thing we can do is the following:

1/ Count the assignments for the original formula (linear on the size of our compressed table)

2/ Compute all the assignments for our original set of constraints conjoined with the new piece of information given by the formula above, store them in our compressed "table" representation.

3/ Count the assignments for the formula accounting for the new information.

4/ If the number of assignments for the new formula are less than those for the original formula, we conclude that "something has been learnt". Actually determining exactly *what* has been learnt is hard!

The reasoning for observing a blue ball is analogous for the above. It is obvious we reach a fixed point after we have obtained the two different observations. Which as you write in the post, it is not very interesting.

The above can be reformulated into the theory of linear arithmetic, with similarly frustrating results (we obtain trivial lower bounds on the numbers of blue and red balls).

Having set a rigorous framework for counting (because I have never been convinced by this thing of going from English to random variables and probability distributions right away) I think the strategy for "defeating" Polya urn could be like this:

1. Guess a satisfying assignment for our constraints (this can be actually hard as in NP-hard for other systems than Polya's urn!).

2. Extract N balls recording their color. Compute the probability of that sequence being consistent with the guessed assignment (all assignments assumed to be equiprobable). If probability becomes 0, go to 1.

3. Store result and go to 1. If no new assignment available, continue.

4. Sort assignments according to probability, and pick the best one.

Claim: as N -> \infty (ha!) the higher is the probability the top-most assignment is the true one :-)

There is an obvious problem with the above is in Step 2. I am not sure how to actually do that. We would want something like https://www.cs.ucdavis.edu/~green/papers/pods07.pdf to tag assignments to our constraints with "likelihood votes" and determine (via a suitable query) when what has been observed is impossible under the assumed assignment (the specific color of each ball).

Expand full comment

No posts

arg min

More Like One Out of A Million...

We should be unsatisfied with unverifiable measurement.

Discussion about this post