8 Comments
User's avatar
JP's avatar

One comment about fitting random Fourier features (rff) models. You can fit them with linear scaling in the number of random features if you use an iterative method as you noted. You mentioned stochastic gradient descent, although the optimization problem for fitting these models is often ill-conditioned for obvious reasons and I haven't found stochastic gradient descent to work very well, at least not without extensive tweaking of the learning rate and learning rate schedule which is time consuming and kind of annoying. My favorite method for fitting rff models for regression (and classification if it makes sense to use LDA) is conjugate gradients with randomized Nystrom preconditioning, it works like magic and converges very quickly, and you can implement the randomized Nystrom preconditioner using a subsampled randomized Hadamard transform-based procedure that's quite fast. Makes them very practical and fairly easy to use if you do need to use a large number of random features!

Expand full comment
Ben Recht's avatar

Great points and good tips for folks who want to try out RFFs.

Something interesting to me is that if we buy into the double descent mindset and pick somewhat more random features than data points (I'm fine with this), then the random feature matrix has O(N^2) entries. This means that RFFs and the unapproximated kernel regression problem are going to have similar computational complexities.

Expand full comment
Ita's avatar

Thanks!

One comment: Solutions of the Kernel ridge regression lie in the span of the training set vectors, so even if you use kernel that corresponds to an infinite dimensional space, effectively your solution belongs to a finite dimensional space, the dimension is bounded by the size of the train set N.

Expand full comment
Ben Recht's avatar

Yes! And the same thing is true when you use gradient descent to solve a linear regression when there are more features than data points.

Expand full comment
rif a saurous's avatar

Yikes. They'll never learn.

Expand full comment
David Rothman's avatar

Quickly (it’s a busy day), as a practitioner, here (again) are the most important things I’ve learned (and while it may or may not make any difference, we never looked to ‘predict’ prices of anything; we looked to model relationships between assets or asset classes:

Broadly, AI-based trading models (high variance) potentially excel at uncovering complex, non-linear patterns in noisy equity markets, leveraging big data to drive short-term alpha, but they are black-boxes (although this appears to be slowly changing) and subject to overfitting. Econometric models (high-bias) offer interpretability and stability (possibly/mainly for LT strategies) but will likely miss nuanced market dynamics. The choice depends on trading goals: AI for tactical, data-rich environments (think satellite pics of parking lots, crop health, or even the apocryphal (??) Neiderhoffer cigarette butt story 😊); econometrics for theory-driven, transparent applications. These days, many successful strategies blend both for robustness and adaptability.

The proof in the pudding regardless of trading model (AI, econometric, Magic 8-ball, whatever) is the ability to consistently generate risk-adjusted returns in live trading. That requires robust out-of-sample back-testing and forward-looking optimization of parameters (aka adaptability).

Deciding on what to do when there is or is not a regime shift staring u in the face is the bitch of all bitches. AI may rely on various clustering algs on vol to detect regime change while time series guys may use cruder techniques such as Cusum or Chow tests or even Bayesian methods – doesn’t really matter as hard is hard and the arbiter will be some combo of p&l, drawdown, tail risk, slippage, and the possible resulting fat thumbs of your boss telling u to get out of positions.

While the debate over model complexity is fascinating, in practice, the bias-variance tradeoff remains central—just more nuanced than the textbook version. Tools like regularization, cross-validation, and ensembling help manage this balance.

As I wrote here https://www.argmin.net/p/probability-is-only-a-game/comment/127390048 “Then again, applying a demanding degree of specificity to the problem of model formulation, estimation, and ultimately optimized decisions for live markets is a fool’s errand (unless you’re Jim Simon 😊). In practice, we fell back on the Herb Simon concept of “satisficing” rather than optimizing.”

Expand full comment
Joe Jordan's avatar

FWIW there is reason to believe that kernel ridge regression is a possible strategy for identifying hedges. Inverted Boltzmann is the name that physicists have given to KRR when applied to learning pair potentials between atoms in biophysical systems like proteins and lipids. KRR is great in such a context because the number of interaction pairs is much less than the number of data points to use for fitting. Mutatis mutandi you could use KRR to find pairs of securities with particular correlations, you would just need an oracle to tell you which of the nearly infinite number of pairs of securities to compare=). You would also suffer from the lack of ergodicity in securities markets though so one is always better off with the tried and true method known as "insider trading."

Expand full comment
John Quiggin's avatar

"First, calling ordinary least squares “AI” is a bit of a stretch."

The use of the term seems pretty arbitrary. For example, stepwise linear least squares regression is not very impressive as a technique but it meets the textbook definitions of "machine learning" which in turn is the central feature of AI as it's currently defined.

Expand full comment