16 Comments

Wonderful topic. I think I reached a point where I would rarely recommend ML for anything with high stakes or high risks. This contrasts so much with my enthusiasm for ML when I started my PhD 15 years ago 😳.

Expand full comment

I can’t believe you waited until AFTER I left Berkeley to teach this.

Expand full comment

tbf, your class this semester seems pretty fun...

Expand full comment

Great topic guys, looking fwd to read more about this.

Expand full comment

I'm definitely interested in this topic as it applies to science. "Everyone" (okay, many people) tout AI as revolutionizing science. Hell, there two Nobel Prizes awarded this year in AI/ML! But, I'm not convinced that applying ML or AI for scientific questions is that straightforward. It's definitely not for traditional computational problems.

Expand full comment

I don't disagree, but wonder what you mean when you say it's not for traditional computational problems. What are the examples you have in mind?

Expand full comment

I was thinking of the FFT. It would be nonsense to replace an FFT with a neural net to compute the DFT of a vector. The precision would be awful, the compute time terrible compared to the FFT (that's *after* training), and you'd need to train the network with a bunch of data in the first place.

I do think that there might very well be situations (or computational models) in which all that overhead is worthwhile or sensible. For instance, you have a lot of training data readily available, you're willing to spend the computational time to train a network, in exchange for a fast evaluation on new data (a faster evaluation than running a standard scientific computing solver, for instance). And, you're not that concerned about accuracy.

That computational model I described above is really different from standard scientific computing in which you don't have a bunch of training data, you do care tremendously about accuracy, and you might be willing to trade off running time for some accuracy. But, mainly, you don't have lots of instances of solutions to a PDE; you're want to compute the solution to that PDE.

Expand full comment

I am curious. I am teaching a Generative AI course and evaluation is the biggest challenge for such systems. Don't ask me either if I know fully what I am doing but happy to share if curious. Look forward to your posts.

Expand full comment

Is this a grad or undergrad class? Would you mind sharing your syllabus?

Expand full comment

Grad. I'll email you the syllabus. I started with PCA and PPCA and voila, latent spaces already appear. :)

Expand full comment

Will you all be releasing lecture videos? Or is there a way to audit the course for those of us in the Bay Area?

Expand full comment

There's no video, but I'll be sporadically blogging here and hope to release something more detailed at the end of the semester. Perhaps I'll do a reprise of the class in a more open setting if it goes well.

Expand full comment

You’d do a great service to the world to record the lecture :)

Expand full comment

Thank you. I'm trying to structure the class as a seminar, and the format doesn't lend itself to video.

Though maybe it would be a new pedagogical innovation if I streamed the seminar on Twitch...

Expand full comment

I've recently started thinking of "machine learning" systems (especially LLMs) as "impostorhood evasion machines". That is, the machines are impostors by construction, and the additive (=summed over examples) loss functions are designed to make them evade detection through statistical means. For the last 50 years, this was hard enough that people have forgotten that there is anything more to "AI" than the part about evading detection by statistical means. Or, "certain" statistical means at any rate.

However, now that there are solutions to this problem that are approaching maturity (= diminishing returns, and practical utility,) people are immediately finding other ways of detecting impostorhood. Twitter/X is full of examples, especially if you follow the people with an interest in that. Most of these have to do with spotting inconsistencies, and, an inability to resolve them when confronted. Now, the question is whether these new means of detecting impostorhood are statistical at all, if they are, how are they different?

On the one hand, I'm inclined to think that certain tests are sufficient by themselves to expose impostorhood, meaning they are not "statistical", but really that just means they don't depend on an average. A "statistic" is just an aggregation of a sample that exposes a property of interest, and the presence of a single dispositive test fills that description. On the other hand, as humans we often face devastating rhetorical attacks, exposing some degree of impostorhood, and eventually recover our standing. I think that's because as humans, impostorhood is generally rectifiable, whereas for a machine system it may not be. Maybe it's because when a human is an "impostor" the thing they're an impostor of is a narrowly scoped role or skillset, not sentience itself. There's also the fact that even when humans fail a consistency check, we have the other thing - self awareness - that implies a path to rehabilitation, and LLMs regularly fail not only on consistency, but also on self-awareness when called out.

How to use this concept in evaluating systems? For one thing, if the average case is all you care about, then by all means, optimize for it. If you care about the worst case, then statistical learning theory doesn't have much to offer except hardness results. There could be a third way, which is something like regret analysis, where the thing you try to bound is, "given one occurrence of an asymptotically worst case outcome, what is the probability of a recurrence?"

Expand full comment

Further discussing offline makes me realize a couple more things. One is that, while the effort is to quantify, in interesting application domains there are often qualitatively different outcomes that need to be either avoided, or handled. Traditional statistical learning theory has punted this onto asymptotic worst case analysis, finding only that that is difficult, or impossible in PAC theory. If, aside from quantifying machine system fitness we can qualify it, we can then approach each of the qualitatively different categories in turn. In other words, we don't care so much about the *risk* of an outcome per se, as we do *whether it's in the support set at all*. Note that outcomes of a different asymptotic order of badness often have entirely different constructions from modal cases.

For LLMs, the qualitative categories are in the product space of, "Is it accurate?", "Is it internally self consistent?", and "Is it able to reconcile situations where its sub-perfections are called out to it?"

Most engineering practices can be described as the search for a configuration of a system such that all of its performance criteria are contained within some specified tolerances, i.e. a feasible set. The search for such a configuration may be quantitative, but the task itself is inherently qualitative - the system is either within tolerance, or it is not. Other ancillary tasks may also be quantitative, such as quantifying the sensitivity to external perturbations or the severity of departures or the risk of departure, or the cost of mitigating if the system does depart tolerance.

The other thing that is that I'm not aware of any machine learning system that can inherently distinguish between originality and regurgitation. There simply is no structural distinction, and researchers are left having to invent post-hoc measures. Current LLM benchmarks are absolutely plagued by test set contamination to the point where it's commonly understood that it doesn't matter if the exact questions don't occur in the training set, as long as the general *kind* of questions do. Human benchmarks are wholly inappropriate because the impostor machines have long since surpassed the human ability to regurgitate, and at that level, regurgitative methods can successfully evade detection via statistical averages in ways that human impostors cannot.

Given that a system inherently cannot say where it's leveraging what it's heard vs. genuinely synthesizing a novel solution, or, given a solution, it cannot articulate (other than as an impostor,) what parts of the solution were borrowed vs. novel / synthetic, it may not be realistic to evaluate it in the way we would evaluate a human worker. After all, it was *designed* from the start to evade detection.

Expand full comment