This is a live blog of Lecture 8 of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A Table of Contents is here.
All of modern machine learning stands on the foundations of the perceptron. Three main branches of theoretical development emerged from this work and carry our understanding today. First, optimization. People interpreted the perceptron learning dynamics as a form of error minimization. This led to nice insights into how to algorithmically minimize average costs. I’ll talk more about optimization next week. Second, generalization. Machine learning researchers devoted lifetimes to understanding what would happen on new data. The vast majority of what we think of as “learning theory” is about the connection between in-sample and out-of-sample performance. Third, representation. This third branch is the most theoretically neglected but might be the most interesting.
Machine learning methods need to represent decision functions in machine-legible ways. Part of what makes machine learning such a weirdly atheoretical field is that we only apply it when we don’t have clean models of data statistics. If we had such models, like the ones we use for the motion of planets, we’d use them. Without predictive laws, we lean on intuitions about how to best digitally represent measurements of reality. Some of these intuitions come from signal processing, leveraging ideas from sampling theory or spectral analysis. But is there a “theory” behind how we represent signals in machine learning? If you look at conferences on learning theory, the relatively small list of results in this space would tell you the answer is no. And yet, the practice of the last decade has converged into a single idea, and perhaps there are theorems to be proven about this convergence.
In machine learning, we take for granted that there is a measurement process that renders signals of interest as bit strings on the computer. All images start as pixels. All language starts as Unicode characters. All people start as rows in an actuarial table. The representation goal in machine learning is to transform this digitization into a vector. Then we map the vector into a prediction.
For example, people need to be digitized to be classified. We represent them as the output of some sort of intake form or questionnaire. Then we assume this digitization is a vector. We embed binary answers as the real numbers 0 and 1. We encode categorical answers as a vector whose length is the number of categories, and whose value is 1 for the category that contains that person and 0 elsewhere. We call this a “one-hot embedding” in machine learning. In social sciences, it’s called a “fixed effect.” At this point, we can fit a linear model on top and see if we can make reasonable predictions about the people in our cohort. We call these long lists of ones and zeros “tabular data,” since they usually come from a table in some database system or spreadsheet.
As another example, best practices now quantize text at the token level and treat language as sequences of such tokens. Text passages are thus digitized as a sequence of one-hot encoded vectors. Or, if you will, it’s a matrix, where every column has a single one and is otherwise equal to zero. Transformers take this object and do some nonlinear junk to make the text into a sequence of dense, real-valued vectors where we can apply linear prediction.
Perhaps the biggest revelation of the deep learning era is that we can treat people, text, audio, and images in a uniform manner. All signals can be embedded into a sequence of vectors in Euclidean spaces. An image starts off as a matrix, now with each column equal to a digitization of a patch in the image. You can apply the same sort of transformer ideas to compress the columns into a representation where pattern recognition works. And if you want, you can do optimization tricks to embed images into the same vector space as text. This joint optimizing works well to solve pattern recognition problems between text and images or people and fMRI scans.
If anything, the signal processing functionality here is much simpler than what we were doing before. We build these embeddings by breaking signals into a collection of vectors that have some relationship with each other, and then nonlinearly transform these vectors until they are in a space where linear pattern recognition is possible. The endpoint of all representation is a matrix of vectors that we treat the same way, no matter what the originating phenomenon was. You can get there using transformers. You can get there by hand-tuning features. But everyone has a target for their final representation.
While we seem to have stumbled upon a unified practice of matricization, I still feel like our theory here is pretty empty. You don’t need to know Fourier analysis or wavelets to know that an image is basically a bunch of patches. But what can we say here? I’m surely missing out on new developments. Like you, I can’t keep up with the fifty thousand machine learning papers every year. Someone must have a nice, informative theory of why all experience can be represented as sequences in the same vector space, right? I look forward to links in the comments.
The only theory about data representation that I’m comfortable with, I’ll discuss in the next lecture: how to efficiently generate nonlinearity maps on vector spaces. Beyond that, I’m not sure we yet understand what the “right way” is to map reality into boxes of numbers. We have a paradigm that works. But is there an easier way? Is this the most efficient way? It’s hard to ask when people are willing to throw infinite salaries at you just to test the limits of the paradigm. Perhaps understanding can only happen after the bubble bursts.
I've been studying approximation algorithms lately and was surprised to find many tricks used to relax discrete symbols to enable continuous optimization are very similar to the representations used in ML. For example, using unit vectors or simplexes. Perhaps the theory of data representation is simply a study of convex relaxations of integer programs.
word2vec: The word-embedding part (the pre-training stage of LLMs) is still very much surprising for me. (As soon as I digest this, I am hoping to move towards the main novelty that everyone is talking about: transformers!)
I have asked ChatGPT whether it is doing anything special for agglutinative languages such as Turkish. (In Turkish, book is "kitap"; bookstore is "kitapci"; bookshelf is "kitaplik", bookshelf store "kitap-lik-ci" The 2-3 letter additions at the very end of the main word "kitap" changes the word all together.) ChatGPT told me that they are not doing anything special for any language and using sub-word level tokenization for all languages! How convenient?
This language independent approach leads to perfect translations between almost all languages. I am still in awe of this. How are they doing it?
This is my chat with him on the topic: https://chatgpt.com/share/68d55f4b-9a04-8002-bdb3-6d6a0b02a6b6