arg min

Oct 1

I have a bit of a dim view of unsupervised learning. While the optimization problems are hard there, you can never tell if you got the right answer. So I find it very hard to evaluate methods and claims in that space.

Similarly, what does it mean for a representation to be "good?" I need to have an answer to this before I can evaluate whether prediction is all we need.

Expand full comment

Nico Formanek

"Someone must have a nice, informative theory of why all experience can be represented as sequences in the same vector space, right?"

At least we pretend it is a vector, or more pedantically, inner product space.

Expand full comment

Kameron Decker Harris

Sep 30

To do ML on a computer, you have to make the number box, of course. The magic seems to be mapping that into the right number box, and yeah neural networks seem to work best in most cases.

On the other hand, biological learning is doing a much messier thing with representation. People argue about whether firing rates or spike times or membrane potentials (w/o spikes) matter in various situations. Brains are a mixture of digital and analog computing that works as well as it does. However, our understanding of it generally comes down to mapping neural activity to number boxes.

Expand full comment

Charles Zheng

There surely are many problems where Euclidean vector representations work poorly. I don't know if there are any known impossibility results in the literature, but I would be willing to bet that a problem such as determining the primality of an integer would not be efficiently tackled by an Euclidean vector embedding. Text, surveys, images, and audio all share in common that they are forms of data that humans can understand. Perhaps that has something to do with why a machine learning architecture inspired by connectionism (https://en.wikipedia.org/wiki/Connectionism) should work well on all of these forms of data.

Expand full comment

Paula Nicoleta

A comment on: "Perhaps understanding can only happen after the bubble bursts."

I agree that the bubble is a very bad environment for getting understanding. However I worry the bubble won't burst anytime soon (despite the clear plateau the paradigm has reached which is acknowledged at least privately by some people who are involved in its development).

From knowing people on the inside and psychoanalyzing the tech billionaires as well as the general development of technology, my prediction is that the bubble is not bursting anytime soon. And AI is already concerningly integrated into society -- chatgpt is quickly rising to be one of the most accessed websites and I see so many students having chatgpt on during presentations.

I think a plateau (current moment) is the best we have for trying to understand and impact the trajectory of its development. But we'll see.

Expand full comment

Dan Lee

This is a really interesting question and one I've seen addressed by a few papers proposing explanations. One interesting idea that was explored in the visual language reasoning space in the paper: Platonic representation hypothesis (https://arxiv.org/pdf/2405.07987) - resulting vector embeddings from representation learning are derived from an underlying reality; all data is a projection of a lower dimensional underlying reality (allegory of the cave)

While not an explanation for why they are all represented specifically in sequential vector space, it provides fairly interesting insight.

Expand full comment

"As another example, best practices now quantize text at the token level and treat language as sequences of such tokens. Text passages are thus digitized as a sequence of one-hot encoded vectors. Or, if you will, it’s a matrix, where every column has a single one and is otherwise equal to zero. Transformers take this object and do some nonlinear junk to make the text into a sequence of dense, real-valued vectors where we can apply linear prediction."

I think this was true ~10 years ago but not anymore? Text is tokenized and then mapped to a token embedding (not one-hot) via a learned embedding matrix.

Expand full comment

Reply (2)

Yuval

I think Ben meant here that the embedding matrix is part of the transformer (which, fair) and the output is the contextualized vector.

Expand full comment

That's right. You can also think of it like this:

- every token can be one-hot encoded. Let n_unique_tokens be the number of unique tokens.

- a sequence of tokens is then a sparse matrix X of size n_unique_tokens x seq_length

- you can put every token embedding in a matrix V of size d x n_unique tokens.

- then the output of the embedding layer is V*X

Expand full comment

Sep 26Edited

Ok yeah, that would be equivalent. But your original passage described a process as if it were actually done that way, though it is not. (One-hot encoding of tokens is not how it is actually done.) Anyway, doesn't really matter.

Expand full comment

Sep 26Edited

"We call this a “one-hot embedding” in machine learning. In social sciences, it’s called a “fixed effect.”"

Also I'm not quite sure what you're getting at here? A fixed effect is a part of a statistical model, not a way of representing information.

Expand full comment

I'm not sure I follow. Fixed effects in a regression equation are represented by one-hot-encoding categorical information. Writing "dummy" variables for fixed effects is the same as one-hot encoding.

Expand full comment

Reply (2)

Sep 26Edited

Sure but you seemed to imply that one-hot-encoding categorical information is the same thing as a fixed effect. Just wanted to point out the distinction between representation and model structure. It seems odd to literally call the encoding a fixed effect, rather than a way of representing information in order to fit a particular model structure. Apologies for the pedantry

Expand full comment

Nicholas Mancuso

"Fixed effects in a regression equation are represented by one-hot-encoding categorical information."

No, not really. Fixed effects are simply effects that are fixed. It says nothing about the variable upon which they act (could be one hot representation, measured value, etc). The name stems from the distinction between fixed/random effects, where (at least in the early days of random effects and multi level models), random effects -often- were related to one-hot-encodings of classes / etc.

But even that is a holdover of categorical modeling being done, and random effects also do not require anything of their data representations (with again distinctions in the early days often coming from "random slope" vs "random intercept").

Expand full comment