While the 1920s brought an explosion of diverse thinking about probability, the 1930s herded the flock into order. This order helped organize the logistics, planning, and strategy of the Allied war effort, which brought forth our information age in its aftermath. While I don’t want to dwell on the 30s, as I have a few other itchy topics I plan to scratch on this blog in the coming weeks, I want to close this latest series with a few thoughts on the central driver of the probabilistic clean-up in the 1930s: Kolmogorov’s axioms.
Kolmogorov says that if you want probability, you only need three things. You need a set of potential events of which you could compute the probability (a sigma-algebra). Each of these events is themselves a set. You need a reference class (the probability space), of which all potential events are a subset. And you need a probability function that takes any event as input and returns a probability as output.
Kolmogorov proposes a few rules for these three things together. The complement of any event must also be an event. That is, if there is a probability of an event, you must be able to assign the probability of that event not happening. Similarly, if you gather a collection of events, you must be able to assign the probability to any one of the collection being valid (the union of the events) or all of them being valid (the intersection of the events).
For the probability function, the probability of any event must be greater than or equal to zero. You must have that the probability of the entirety of the probability space is equal to one. This just means that there is certainty that at least one of the events you are considering will happen. Both of these seem like reasonable rules.
The final ingredient is the only one that’s a bit weird. If you have a collection of mutually exclusive events, the probability that any one of these events happens must be equal to the sum of their individual probabilities. It takes a bit of thinking about what “mutually exclusive” means here. In two independent coin tosses, the events {coin 1 is a head} and {coin 2 is a head} are not mutually exclusive. Events have to consider both coins. The events {both coins are heads} and {both coins are tails} are mutually exclusive.
Once you add this rule about “subadditivity,” you are done. In retrospect, these things seem obvious. People had been dabbling with all sorts of probability distributions for hundreds of years. How did they not realize that they all fit under this tidy roof? Part of this is that all probabilistic thinking requires an appeal to the infinite and math with infinities is only sensible in retrospect. We all got confused by infinity plus one as elementary school students. And anyone in freshman calc is initially baffled by the concept of a limit.
That all of our notions of probability can be mapped onto this high-level abstraction is remarkable. Kolmogorov gives us a unified mathematical modeling language that can describe the uncertainty of Brownian motion, quantum mechanics, games of chance, and agricultural experiments. And for all of these, we just need a look-up table with nonnegative values that sum to one.
But this abstraction is also a bit worrying. Just because we have found a language to describe many dissimilar phenomena does not mean all of these phenomena are fundamentally the same. Let me give a mathematical example to illustrate how abstractions are useful but only give us partial understanding of the phenomena they model.
One of the most fundamental and useful concepts in modern mathematics is the vector. Colloquially, a vector describes a direction. Air traffic control assigns vectors to airplanes to route them through the sky. Mathematically, a vector is an object that can be arbitrarily scaled (made larger or smaller) and added to any other vector. But these simple rules mean that lots of things are vectors. Any real number is a vector. The velocity of an object moving in space is a vector. The weights of a tuned neural network are a vector. Everything we do in “AI” is just mindless pushing around of vectors (i.e., linear algebra). Quadratic functions are vectors. Infinite sequences of real numbers whose sums of squares are finite are vectors.
Obviously, I could go on. Vectors are the most used abstraction in applied mathematics. But there is a difference between finite-dimensional vectors, infinite sequences, and functions. We can find relations between these concepts and understand principles that apply to all three. Yet we also separate the concepts to understand specifics that occur when shifting through the different contexts.
For whatever reason we all want probability to be one concept. While “vectors” mean nothing to someone who hasn’t taken college math, everyone has a feel for “probability” and “chance.” There is a grounding in the word probability that makes us think that all probabilities are the same. But probability is far more often a homonym than a synonym.
Even though they all follow the same basic axioms, we have dozens of different kinds of probability. A few examples are
Any finite list of numbers that is nonnegative and sums to one
Any infinite list of numbers that is nonnegative and whose infinite sum converges to one
Any nonnegative function on the unit interval whose integral is equal to one
The certainty of a logical proposition being true
The doubt that something in the past happened
The likelihood that future events will occur
A person’s internal beliefs about the world
The relative frequencies of occurrences that currently exist in the world
The relative frequencies of hypothetical infinite populations
The preponderance of evidence in a civil case
These are all sort of the same, but they’re also quite different in important ways. In the next post, I’ll describe the two places where I think these varied conceptions all seem to align.
The analogy with the generality of vector spaces is actually quite good. It really underscores the fact that, just as in the case of vectors it’s the common axiomatic framework that comes with a variety of concrete models of vector spaces, the relational structure of Kolmogorov’s axioms plays the main role in applications rather than any sort of a unified notion of uncertainty or chance. There are many models of Kolmogorov’s axioms, each comes with its own semantics. More or less what I wrote here: https://realizable.substack.com/p/probabilities-coherence-correspondence.
The last Kolmogorov axiom is usually called countable additivity, not subadditivity. (What's usually called subadditivity would give you an "outer measure," not a measure.) Actually, the restriction to *countable* collections has always struck me as a little bizarre and (a priori) hard to justify. So far as I know, that restriction is only made to get a useful collection of limit theorems for random variables and the expected value: limits of random variables are still random variables, etc. You wouldn't have those if you only had finite additivity.
In the function theory and geometric measure theory that the Kolmogorov axioms are a special case of, the countable additivity is at least legitimized by all the natural examples (Lebesgue measure, Haar measure, Hausdorff measure, etc. or measures with density relative to these) which tend to be the actual objects of study. But I've never seen a similarly convincing case in the context of real-life probability (whatever "real-life probability" might be).