Discussion about this post

User's avatar
Ani's avatar
Oct 9Edited

Hi, very interesting blog.

1. There is this term "memorization" that seems to be thrown around a lot. What is the difference between memorization and generalization?

2. Is there any theory of what happens when the i.i.d assumption is broken? For example something that mathematically quantifies the "brokenness" of the i.i.d assumption, then provides a guarantee on the sample complexity?

3. How does over-parameterization and over-fitting relate to generalization theory?

4. Are there any instances (even "hypothetical") where "more data" is not good? Perhaps if you've built your model using i.i.d assumption and this assumption does not hold in practice, so the model just "collapses"?

Edit: It seems like law of large nos. just says that sample mean gets really close to population mean for a function f. However, it does not talk about how the model trains by increasing the size of the sample. So even without breaking this law and the samples are i.i.d from the population distribution…what stops me from saying that training on increasing sample sizes gives me worse performance on population?

5. Also, how does feature representation effect the generalization bounds?

Expand full comment
Badri's avatar

Ben, I believe you may have gone a little too far in this post 😊

Theory wasn’t ever wrong - it was too pessimistic to be useful with the current mathematical tools we have. It is only misleading if we somehow interpreted upper bounds as lower bounds and hallucinated principles from those estimates. Also, is it not remarkable we can bound the population error at all and show the correct scaling with number of datapoints?

May be it is a question of degree… for a fixed model and fixed number of samples, some tweak seems to be needed because the training error is not the test error minimizer. Too bad theory is not able to provide the best guidance here - like Moritz shared in a talk, just about any intuitively stability inducing technique seems to help in a regime where data and compute are bounded.

Perhaps, we need the right formulation of this problem. One attempt: let’s say you cannot change the model architecture and size or dataset - what is the best recipe for minimizing holdout error? We can introduce any number of hyper parameters but we need a simple efficient procedure to minimize test error. Further challenge: can we make this recipe independent of optimization procedure. (For instance I may want to use Newton or quasi-Newton method to optimize and I can stop at a duality gap recommended by the recipe)

I don’t want to be fiddling with inscrutable hyperparameters and would love may be a new theory to guide!

Expand full comment
3 more comments...

No posts