I’m traveling this week, so blogging will be slower than usual until October 4.
A few readers have bristled that “good old fashioned machine learning” and its conventional wisdom is somehow different from deep learning. I just don’t think this could be farther from the truth.
Let’s review the list from yesterday. Before 2015, I believed a list of truths about machine learning:
Good prediction balances bias and variance.
You should not perfectly fit your training data as some in-sample errors can reduce out-of-sample error.
High-capacity models don’t generalize.
Optimizing to high precision harms generalization.
Nonconvex optimization is hard in machine learning.
Certainly, I must still believe these for linear models, right? We can throw out #5, because most linear models are fit using convex methods like least-squares or SVMs. What about the others?
Well, I don’t think any of these are universal truths either. To be clear, we can invent models where any of these are true, and maybe we can find pathological data sets where one is true, but, in general, linear models are no different from neural networks.
There is no bias-variance tradeoff
Let’s first get this bias-variance business out of the way. I will be blunt: the bias-variance tradeoff is a useless tautology and not a tradeoff at all. You take some prediction function F that you fit to some data. Let’s say that the best function you could have created if you had infinite data but the same code was G. And let’s say the best function you could have created if you had infinite data and infinite computation was H. Then
error[F] = (error[F]-error[G]) + (error[G]-error[H]) + error[H]
error[F]-error[G] is called the variance. error[G]-error[H] is called the bias. QED.
That’s it, friends. That is the “bias-variance tradeoff.” There is nothing profound about it. Do I have to isolate these terms when I do machine learning? Of course not. I will never know what that “H” function is, and it doesn’t matter.
I could make the bias-variance tradeoff look deep and mathematical by using expected values and properties of squares (like, for example, what you’ll see in statistics books or on Wikipedia). But let us not over-mathematicize things. The bias-variance tradeoff is neither complicated nor fancy nor particularly useful. It’s not even a trade-off. It’s a just sum. A tradeoff would imply some fundamental conservation law where bias was necessarily high when variance was low. But bias can be zero, variance non-zero, and prediction error small. There is nothing that prevents this! And which would you prefer? A model with bias 0.1 and variance 0.1 or a model with bias 0.01 and variance 0.15? The community has chosen the second answer time and time again.
High-capacity linear models can interpolate too
OK, so the bias-variance business always gets me going, but what about the other rules? First, I need only point to decades of research on Boosting which has shown that very large models that interpolate training data generalize well. I mentioned Peter Bartlett’s Neurips tutorial in 1998 where he presented graphs from his “Boosting the Margin” paper highlighting this phenomenon.
And the original margin bounds we discussed in the class in the second week show that we’ve been violating 2-4 since the beginning of machine learning. Perceptrons are linear rules that interpolate the data. These rules generalize as long as there is margin, and the function classes can be of arbitrarily high capacity. Moreover, optimizing them to arbitrary precision using least-squares classification or SVMs can often provide even better empirical performance.
In machine learning, when all we care about is prediction error, there aren’t any hard and fast rules that make predictors work. The only things that seem consistent are
cross-validation and the hold-out method preserve internal validity
minimizing loss functions on training data tends to provide low test error
with more data, you’ll likely get more validity and lower test error.
This list is becoming a mantra on this blog.
Is all data-driven prediction machine learning?
Let me highlight where things start to get more murky. What happens when your predictions stop being cut and dry? People often ask what to do when there is no margin. Clearly, we can’t interpolate in these cases.
I always push back by asking what it would mean for there to be no margin. This would mean you can’t devise any feature representation that perfectly distinguishes the two classes in your data set. In particular, there are many datasets with repeated examples that have different labels.
Such repetitions happen, for example, in the famous UCI Adult data set, extracted from the US Census Bureau’s 1994 CPS survey. Each example is coarse demographic information about a person, and the classification goal is to predict whether or not that person makes over 50,000 dollars a year. Unsurprisingly, two people with the same demographic information might have different incomes. Over 7,000 of the 32,000 standard training examples in Adult are duplicates, many with contradictory labels.
You can surely imagine many other scenarios where similar duplication would happen. In healthcare, for instance, what we measure about a patient nearly never indicates a perfect prognosis. Medicine would be much easier if it did! Instead of perfect classification, we settle for coarse estimates of “risk.” We say people with certain demographic features (age, gender, familial history, behavioral factors) are associated with certain frequencies of outcomes.
When we build such risk scores, are we still doing machine learning? This is a far cry from large language models, but it’s still prediction from data. If we call all prediction from data “machine learning” then maybe this is why our theory is stuck providing only mundane generalities.
What distinguishes language models from risk scores is their goal. And as we move forward, I want to highlight the different reasons data is collected in the first place. We clearly want to use them to do something. But in machine learning, we sometimes just get caught up in competing to see who can predict best. Why does it matter which decision tree algorithm gets the lowest error on UCI Adult? What are you going to do with that income predictor? The course will soon turn to understanding the intended decisions behind data collection and their consequences.
Before we get to this, I want to understand one more thing about machine learning sociology. A defining practice in our field is to completely forget why we collected data sets and turn them into fossilized benchmarks divorced from their initial goals. What do we learn from mining these benchmarks repeatedly for decades? This will be the topic of the next lecture when we return from our midterm break next Thursday.
I don't think contradictory labels is a problem for the interpolation framing. I always viewed it as meaning perfectly fitting the training data (maybe needs a different word). If you have three identical inputs, and one disagrees, just output 2/3. In fact, our models can only interpolate in probability space when labels are stochastic.
If bias/variance are irrelevant concepts, should we also consider the approximation/estimation decomposition as irrelevant, and anything that builds upon it?