I've been studying approximation algorithms lately and was surprised to find many tricks used to relax discrete symbols to enable continuous optimization are very similar to the representations used in ML. For example, using unit vectors or simplexes. Perhaps the theory of data representation is simply a study of convex relaxations of integer programs.
You said that “But everyone has a target for their final representation.” — I wonder what would you say about the unsupervised learning community. Also I thought the techniques for representation learning like Fourier and wavelet transforms long existed before modern machine learning. Hence, I wonder if prediction tasks are all that we need to learn good representations, or they are just convenient because we can then apply decision and optimization methods as taught in your class.
I have a bit of a dim view of unsupervised learning. While the optimization problems are hard there, you can never tell if you got the right answer. So I find it very hard to evaluate methods and claims in that space.
Similarly, what does it mean for a representation to be "good?" I need to have an answer to this before I can evaluate whether prediction is all we need.
To do ML on a computer, you have to make the number box, of course. The magic seems to be mapping that into the right number box, and yeah neural networks seem to work best in most cases.
On the other hand, biological learning is doing a much messier thing with representation. People argue about whether firing rates or spike times or membrane potentials (w/o spikes) matter in various situations. Brains are a mixture of digital and analog computing that works as well as it does. However, our understanding of it generally comes down to mapping neural activity to number boxes.
There surely are many problems where Euclidean vector representations work poorly. I don't know if there are any known impossibility results in the literature, but I would be willing to bet that a problem such as determining the primality of an integer would not be efficiently tackled by an Euclidean vector embedding. Text, surveys, images, and audio all share in common that they are forms of data that humans can understand. Perhaps that has something to do with why a machine learning architecture inspired by connectionism (https://en.wikipedia.org/wiki/Connectionism) should work well on all of these forms of data.
A comment on: "Perhaps understanding can only happen after the bubble bursts."
I agree that the bubble is a very bad environment for getting understanding. However I worry the bubble won't burst anytime soon (despite the clear plateau the paradigm has reached which is acknowledged at least privately by some people who are involved in its development).
From knowing people on the inside and psychoanalyzing the tech billionaires as well as the general development of technology, my prediction is that the bubble is not bursting anytime soon. And AI is already concerningly integrated into society -- chatgpt is quickly rising to be one of the most accessed websites and I see so many students having chatgpt on during presentations.
I think a plateau (current moment) is the best we have for trying to understand and impact the trajectory of its development. But we'll see.
This is a really interesting question and one I've seen addressed by a few papers proposing explanations. One interesting idea that was explored in the visual language reasoning space in the paper: Platonic representation hypothesis (https://arxiv.org/pdf/2405.07987) - resulting vector embeddings from representation learning are derived from an underlying reality; all data is a projection of a lower dimensional underlying reality (allegory of the cave)
While not an explanation for why they are all represented specifically in sequential vector space, it provides fairly interesting insight.
"As another example, best practices now quantize text at the token level and treat language as sequences of such tokens. Text passages are thus digitized as a sequence of one-hot encoded vectors. Or, if you will, it’s a matrix, where every column has a single one and is otherwise equal to zero. Transformers take this object and do some nonlinear junk to make the text into a sequence of dense, real-valued vectors where we can apply linear prediction."
I think this was true ~10 years ago but not anymore? Text is tokenized and then mapped to a token embedding (not one-hot) via a learned embedding matrix.
Ok yeah, that would be equivalent. But your original passage described a process as if it were actually done that way, though it is not. (One-hot encoding of tokens is not how it is actually done.) Anyway, doesn't really matter.
I'm not sure I follow. Fixed effects in a regression equation are represented by one-hot-encoding categorical information. Writing "dummy" variables for fixed effects is the same as one-hot encoding.
Sure but you seemed to imply that one-hot-encoding categorical information is the same thing as a fixed effect. Just wanted to point out the distinction between representation and model structure. It seems odd to literally call the encoding a fixed effect, rather than a way of representing information in order to fit a particular model structure. Apologies for the pedantry
"Fixed effects in a regression equation are represented by one-hot-encoding categorical information."
No, not really. Fixed effects are simply effects that are fixed. It says nothing about the variable upon which they act (could be one hot representation, measured value, etc). The name stems from the distinction between fixed/random effects, where (at least in the early days of random effects and multi level models), random effects -often- were related to one-hot-encodings of classes / etc.
But even that is a holdover of categorical modeling being done, and random effects also do not require anything of their data representations (with again distinctions in the early days often coming from "random slope" vs "random intercept").
I take an idiosyncratic instrumentalist approach to statistical modeling. I don't care what the textbooks say fixed or random effects are. I care what the R or stata packages do. And more often than not, fixed effect just means adding a binary dummy variable for a category and running some garbage can regression on top.
I've been studying approximation algorithms lately and was surprised to find many tricks used to relax discrete symbols to enable continuous optimization are very similar to the representations used in ML. For example, using unit vectors or simplexes. Perhaps the theory of data representation is simply a study of convex relaxations of integer programs.
You said that “But everyone has a target for their final representation.” — I wonder what would you say about the unsupervised learning community. Also I thought the techniques for representation learning like Fourier and wavelet transforms long existed before modern machine learning. Hence, I wonder if prediction tasks are all that we need to learn good representations, or they are just convenient because we can then apply decision and optimization methods as taught in your class.
I have a bit of a dim view of unsupervised learning. While the optimization problems are hard there, you can never tell if you got the right answer. So I find it very hard to evaluate methods and claims in that space.
Similarly, what does it mean for a representation to be "good?" I need to have an answer to this before I can evaluate whether prediction is all we need.
"Someone must have a nice, informative theory of why all experience can be represented as sequences in the same vector space, right?"
At least we pretend it is a vector, or more pedantically, inner product space.
To do ML on a computer, you have to make the number box, of course. The magic seems to be mapping that into the right number box, and yeah neural networks seem to work best in most cases.
On the other hand, biological learning is doing a much messier thing with representation. People argue about whether firing rates or spike times or membrane potentials (w/o spikes) matter in various situations. Brains are a mixture of digital and analog computing that works as well as it does. However, our understanding of it generally comes down to mapping neural activity to number boxes.
There surely are many problems where Euclidean vector representations work poorly. I don't know if there are any known impossibility results in the literature, but I would be willing to bet that a problem such as determining the primality of an integer would not be efficiently tackled by an Euclidean vector embedding. Text, surveys, images, and audio all share in common that they are forms of data that humans can understand. Perhaps that has something to do with why a machine learning architecture inspired by connectionism (https://en.wikipedia.org/wiki/Connectionism) should work well on all of these forms of data.
A comment on: "Perhaps understanding can only happen after the bubble bursts."
I agree that the bubble is a very bad environment for getting understanding. However I worry the bubble won't burst anytime soon (despite the clear plateau the paradigm has reached which is acknowledged at least privately by some people who are involved in its development).
From knowing people on the inside and psychoanalyzing the tech billionaires as well as the general development of technology, my prediction is that the bubble is not bursting anytime soon. And AI is already concerningly integrated into society -- chatgpt is quickly rising to be one of the most accessed websites and I see so many students having chatgpt on during presentations.
I think a plateau (current moment) is the best we have for trying to understand and impact the trajectory of its development. But we'll see.
This is a really interesting question and one I've seen addressed by a few papers proposing explanations. One interesting idea that was explored in the visual language reasoning space in the paper: Platonic representation hypothesis (https://arxiv.org/pdf/2405.07987) - resulting vector embeddings from representation learning are derived from an underlying reality; all data is a projection of a lower dimensional underlying reality (allegory of the cave)
While not an explanation for why they are all represented specifically in sequential vector space, it provides fairly interesting insight.
"As another example, best practices now quantize text at the token level and treat language as sequences of such tokens. Text passages are thus digitized as a sequence of one-hot encoded vectors. Or, if you will, it’s a matrix, where every column has a single one and is otherwise equal to zero. Transformers take this object and do some nonlinear junk to make the text into a sequence of dense, real-valued vectors where we can apply linear prediction."
I think this was true ~10 years ago but not anymore? Text is tokenized and then mapped to a token embedding (not one-hot) via a learned embedding matrix.
I think Ben meant here that the embedding matrix is part of the transformer (which, fair) and the output is the contextualized vector.
That's right. You can also think of it like this:
- every token can be one-hot encoded. Let n_unique_tokens be the number of unique tokens.
- a sequence of tokens is then a sparse matrix X of size n_unique_tokens x seq_length
- you can put every token embedding in a matrix V of size d x n_unique tokens.
- then the output of the embedding layer is V*X
Ok yeah, that would be equivalent. But your original passage described a process as if it were actually done that way, though it is not. (One-hot encoding of tokens is not how it is actually done.) Anyway, doesn't really matter.
"We call this a “one-hot embedding” in machine learning. In social sciences, it’s called a “fixed effect.”"
Also I'm not quite sure what you're getting at here? A fixed effect is a part of a statistical model, not a way of representing information.
I'm not sure I follow. Fixed effects in a regression equation are represented by one-hot-encoding categorical information. Writing "dummy" variables for fixed effects is the same as one-hot encoding.
Sure but you seemed to imply that one-hot-encoding categorical information is the same thing as a fixed effect. Just wanted to point out the distinction between representation and model structure. It seems odd to literally call the encoding a fixed effect, rather than a way of representing information in order to fit a particular model structure. Apologies for the pedantry
"Fixed effects in a regression equation are represented by one-hot-encoding categorical information."
No, not really. Fixed effects are simply effects that are fixed. It says nothing about the variable upon which they act (could be one hot representation, measured value, etc). The name stems from the distinction between fixed/random effects, where (at least in the early days of random effects and multi level models), random effects -often- were related to one-hot-encodings of classes / etc.
But even that is a holdover of categorical modeling being done, and random effects also do not require anything of their data representations (with again distinctions in the early days often coming from "random slope" vs "random intercept").
I take an idiosyncratic instrumentalist approach to statistical modeling. I don't care what the textbooks say fixed or random effects are. I care what the R or stata packages do. And more often than not, fixed effect just means adding a binary dummy variable for a category and running some garbage can regression on top.