Everyone was confused by randomness in the 1920s, and no one was more confused than Ronald Fisher. Fisher wrote a series of papers establishing much of modern statistics. But his internal philosophy about what probability means shifts with every paper, revealing his deep confusion about epistemology and inference. For someone who is infamous for his staunch dogmatism, Fisher was philosophically all over the map in the 1920s. He contradicts himself in each subsequent paper (though, of course, he never admits it).
My most and least favorite Fisher paper is his 1922 magnum opus, “On the Mathematical Foundations of Theoretical Statistics.” This is heralded by statisticians as one of the most important papers in Statistics. It’s my least favorite because it defines the maximum likelihood method, of which I’ve never been a fan and which has been a mathematical mess for a century. For statistics, this paper has done more harm than good. It’s my favorite because I love the free-wheeling way Fisher writes. It’s clear he’s making things up as he goes to justify rigor in a field that cannot be rigorous.
Fisher argues the role of statistics is data summarization. This had been its primary use: a way of tabulating bulk facts about the properties of the state so that those who ruled could make informed decisions. Fisher sought to make this tabulation of counts into rigorous mathematics. Let’s find out what Fisher thought, closely reading the first three paragraphs of Section 2.
“...the object of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent the whole, or which, in other words, shall contain as much as possible, ideally the whole, of the relevant information contained in the original data.”
So far, so good. Now, how should you summarize data? Here’s where things get wild:
“This object is accomplished by constructing a hypothetical infinite population, of which the actual data are regarded as constituting a random sample. The law of distribution of this hypothetical population is specified by relatively few parameters, which are sufficient to describe it exhaustively in respect of all qualities under discussion. Any information given by the sample, which is of use in estimating the values of these parameters, is relevant information.”
All data must be assumed to be random. Not only are they random, but they are randomly sampled (whatever that may mean) from a “population.” This population is hypothetical (i.e., it does not exist) and is a relatively simple mathematical object. Sampling from the population is the same as sampling from a certain simple probability distribution with only a few parameters. The important differences between populations can be summarized by a few numbers.
This set of assumptions about data is patently absurd and never true. However, for Fisher, it doesn’t need to be true. The purpose of this hypothetical population is data summarization. It need only encapsulate the important features of the data before the analyst. Fisher, with his frustrating overloquation, is just saying “All models are wrong, but some are useful.”
“Since the number of independent facts supplied in the data is usually far greater than the number of facts sought, much of the information supplied by any actual sample is irrelevant.”
Indeed, most of the information, whatever that is, is irrelevant to the facts we seek. Of course, what’s relevant and irrelevant is in the eye of the beholder. Does that make Fisher a Bayesian subjective probabilist?
“It is the object of the statistical processes employed in the reduction of data to exclude this irrelevant information, and to isolate the whole of the relevant information contained in the data.”
The goal of a statistical algorithm is to remove all irrelevant information and find only the relevant information. What is relevant is clarified by creating a hypothetical, simple random model of the world and assuming all data is generated by it. The randomness flattens all uncertainty into stochastic variation around a small number of statistics. The statistician must model the world as a few simple facts corrupted by aberrations due solely to chance.
Now I offhandedly quipped that Fisher might be considered a Bayesian for his subjectivity in this section. But he’s clearly not being a frequentist in this paper. How would the modern statistician characterize his proposed procedure?
I have a bunch of observations in front of me.
I hypothesize a model for this data.
I use some math to estimate the parameters of this model.
These parameters serve as my summary of the data.
This sounds like exploratory data analysis to me! We make some untestable assumptions about the world in order to tell a story about data. Fisher of 1922 is much closer to John Tukey than the Fisher of 1925 who wrote The Design of Experiments.
Fisher further expands upon his probabilitist beliefs in the next paragraph.
“It should be noted that there is no falsehood in interpreting any set of independent measurements as a random sample from an infinite population; for any such set of numbers are a random sample from the totality of numbers produced by the same matrix of causal conditions: the hypothetical population which we are studying is an aspect of the totality of the effects of these conditions, of whatever nature they may be. The postulate of randomness thus resolves itself into the question, ‘Of what population is this a random sample?’ which must frequently be asked by every practical statistician.”
That first sentence is a doozy. So many clauses! What does he mean by independent here? Regardless, he’s laying his cards on the table and telling us that all data are a random sampling of something. This means that all of our experience is nothing more than the manifestation of random fluctuations of the universe. You might defend this position, but realize that you are making some very strong philosophical assertions. Natural randomness is a postulate for Fisher. All observations are random. Some, I suppose, are useful.
In the remaining fifty odd pages, Fisher proceeds to write a bunch of formulae to derive the method of maximum likelihood. Let me include one more paragraph that still haunts statistics.
“Readers of the ensuing pages are invited to form their own opinion as to the possibility of the method of the maximum likelihood leading in any case to an insufficient statistic. For my own part I should gladly have withheld publication until a rigorously complete proof could have been formulated; but the number and variety of the new results which the method discloses press for publication, and at the same time I am not insensible of the advantage which accrues to Applied Mathematics from the co-operation of the Pure Mathematician, and this co-operationis not infrequently called forth by the very imperfections of writers on Applied Mathematics.”
Hilarious. Is maximum likelihood rigorous today? No! 100 years later, we still use the technique with little justification. It’s mostly harmless as it’s often just computing means or solving least-squares problems. And it’s often as good as anything else because data summarization is exploratory.
We’d certainly add some forms of mathematical rigor. For example, Doob would show the method could be considered empirical risk minimization. While this gives a rigorous justification for the method in special contexts, it does not rigorously justify the assumptions. Doob’s theory is true only if the data are actually generated from one of Fisher’s hypothetical probability distributions. But this is almost never true. The assumptions of statistics are metaphysical and can never be made rigorous. You can never prove that all observations are generated by having god randomly generate an iid sample from a probability distribution governed by a few parameters. The mathematical foundations of statistics have their issues. The philosophical foundations are untenable.
Isn't basically all of ML based on the assumption that there exists some unknown distribution over basically everything?
"all observations are generated by having god randomly generate an iid sample from a probability distribution governed by a few parameters. " This world view is so confusing. "Random variables" in statistical world view seem to be super zombie which make everything rv. Any constant/object + random variable is a random variable. Random variable infects everything ! This worldview is good for mathematical analysis or exploration in some context. Generalizing this idea is so weird.