5 Comments
User's avatar
Ani's avatar
Oct 9Edited

Hi, very interesting blog.

1. There is this term "memorization" that seems to be thrown around a lot. What is the difference between memorization and generalization?

2. Is there any theory of what happens when the i.i.d assumption is broken? For example something that mathematically quantifies the "brokenness" of the i.i.d assumption, then provides a guarantee on the sample complexity?

3. How does over-parameterization and over-fitting relate to generalization theory?

4. Are there any instances (even "hypothetical") where "more data" is not good? Perhaps if you've built your model using i.i.d assumption and this assumption does not hold in practice, so the model just "collapses"?

Edit: It seems like law of large nos. just says that sample mean gets really close to population mean for a function f. However, it does not talk about how the model trains by increasing the size of the sample. So even without breaking this law and the samples are i.i.d from the population distribution…what stops me from saying that training on increasing sample sizes gives me worse performance on population?

5. Also, how does feature representation effect the generalization bounds?

Expand full comment
Oğuz Kaan Yüksel's avatar

I'll take the liberty of replying to some of the questions you posed here.

1. As far as I'm aware, there is no single definition that is widely seen as the standard definition. One way I think about this is the contrast between in-sample prediction error and out-of-sample prediction error. The former measures performance in seen inputs (in the context of autoregressive models, it would be the seen "context/prompt") whereas the latter measures performance across all inputs. An interesting observation for non-i.i.d. models is that the former decays very fast with the number of samples and the latter depends on the properties of the processes (https://proceedings.mlr.press/v258/yuksel25a.html).

2. Ben Recht has some great work with Max Simchowitz on linear system identification (https://arxiv.org/abs/1802.08334) that is very related. Here, the data is from a linear dynamical system which results in non-i.i.d. observations and the rates are very reminiscent of the i.i.d. rates. This is probably due to the linearity of the model and the fact that noise in each step of the process is i.i.d.

More generally for non-linear systems, one quantity that is related to this question is _mixing_. It describes the memory of the process and how fast the process yields i.i.d. observations. Learning-with-mixing arguments then simply use the mixing coefficients to repeat the i.i.d. arguments and just incur a constant deflation based on the process. Lately, I have been working on assumptions that are more relaxed than mixing as it is usually very strong for practical applications. One can still get reasonable rates but then the key question is not about clever concentration arguments but what kind of assumptions the data satisfies. And, as far as I'm aware, there hasn't been a lot of empirical work in this direction.

3. The "classical" generalization theory assumes a small training error (compared to the best within the candidates) and bounds the generalization error with training error plus a supremum of some random variables. The number of random variables usually scales with the dimensionality of the model and the bounds become very loose for over-parameterized settings. There has been a lot of interest in getting bounds that work for over-parameterized settings and I'll point to a very recent example: https://x.com/erfunmirzaei/status/1975929115871175113.

Expand full comment
Badri's avatar

Ben, I believe you may have gone a little too far in this post 😊

Theory wasn’t ever wrong - it was too pessimistic to be useful with the current mathematical tools we have. It is only misleading if we somehow interpreted upper bounds as lower bounds and hallucinated principles from those estimates. Also, is it not remarkable we can bound the population error at all and show the correct scaling with number of datapoints?

May be it is a question of degree… for a fixed model and fixed number of samples, some tweak seems to be needed because the training error is not the test error minimizer. Too bad theory is not able to provide the best guidance here - like Moritz shared in a talk, just about any intuitively stability inducing technique seems to help in a regime where data and compute are bounded.

Perhaps, we need the right formulation of this problem. One attempt: let’s say you cannot change the model architecture and size or dataset - what is the best recipe for minimizing holdout error? We can introduce any number of hyper parameters but we need a simple efficient procedure to minimize test error. Further challenge: can we make this recipe independent of optimization procedure. (For instance I may want to use Newton or quasi-Newton method to optimize and I can stop at a duality gap recommended by the recipe)

I don’t want to be fiddling with inscrutable hyperparameters and would love may be a new theory to guide!

Expand full comment
Sam's avatar

What are your thoughts on modern theory like Arora and Goyal's "A Theory for Emergence of Complex Skills in Language Models"(https://arxiv.org/abs/2307.15936)? Should we see more work like this? Less?

Generalization research in AI seems to be in the "extraordinary science" phase, to use the words of Kuhn. Theoretical progress from here will require recourse to philosophy and first principles, as well as cross-disciplinary interactions with psychology and cognitive science.

Expand full comment
Jack Zhang's avatar

Do you have any thoughts on recursive self-improvement? Proponents of RSI in LLMs (everyone these days) seem to implicitly subsumed this idea that LLMs could have already solved generalization all together, without building on a concrete theory of generalization.

Expand full comment