You keep using that word

Jul 7

Can you use deep learning memes to pick meme stocks?

9 Comments

One comment about fitting random Fourier features (rff) models. You can fit them with linear scaling in the number of random features if you use an iterative method as you noted. You mentioned stochastic gradient descent, although the optimization problem for fitting these models is often ill-conditioned for obvious reasons and I haven't found stochastic gradient descent to work very well, at least not without extensive tweaking of the learning rate and learning rate schedule which is time consuming and kind of annoying. My favorite method for fitting rff models for regression (and classification if it makes sense to use LDA) is conjugate gradients with randomized Nystrom preconditioning, it works like magic and converges very quickly, and you can implement the randomized Nystrom preconditioner using a subsampled randomized Hadamard transform-based procedure that's quite fast. Makes them very practical and fairly easy to use if you do need to use a large number of random features!

Expand full comment

Reply (1)

Ben Recht

Jul 7

Great points and good tips for folks who want to try out RFFs.

Something interesting to me is that if we buy into the double descent mindset and pick somewhat more random features than data points (I'm fine with this), then the random feature matrix has O(N^2) entries. This means that RFFs and the unapproximated kernel regression problem are going to have similar computational complexities.

Expand full comment

Ita

Jul 7

Thanks!

One comment: Solutions of the Kernel ridge regression lie in the span of the training set vectors, so even if you use kernel that corresponds to an infinite dimensional space, effectively your solution belongs to a finite dimensional space, the dimension is bounded by the size of the train set N.

Expand full comment

Reply (1)

Ben Recht

Jul 7

Yes! And the same thing is true when you use gradient descent to solve a linear regression when there are more features than data points.

Expand full comment

rif a saurous

Jul 7

Yikes. They'll never learn.

Expand full comment

Jul 26

Thanks for sharing this, Ben — it’s really helpful to hear your take on not needing RFFs given the efficiency of modern kernel ridge methods. I had two related questions from a cross-sectional finance perspective, particularly in light of the follow-up to Virtue of Complexity — the DKKM paper by Didisheim, Ke, Kelly & Malamud (“Complexity in Factor Pricing Models”):

1. In DKKM, the training window spans 360 months, with thousands of assets each month and ~130 signals per asset. They aim to model nonlinear relationships within each asset using RFFs. If a kernel method were used instead, would that require computing a 130×130 kernel matrix per asset? And in such settings, do you think RFFs remain the more computationally efficient choice?

2. On a related note: RFFs tend to perform well with large datasets, but how can they be made effective in small-data regimes — say ~100 observations? In my experience, random projection variance can be high unless many features are used, which dilutes efficiency. Are there techniques (e.g., orthogonal RFFs, structured sampling, shrinkage) that help stabilize performance in these low-N cases?

Would love to hear your thoughts — appreciate the discussion!

Expand full comment

David Rothman

Jul 8

Quickly (it’s a busy day), as a practitioner, here (again) are the most important things I’ve learned (and while it may or may not make any difference, we never looked to ‘predict’ prices of anything; we looked to model relationships between assets or asset classes:

Broadly, AI-based trading models (high variance) potentially excel at uncovering complex, non-linear patterns in noisy equity markets, leveraging big data to drive short-term alpha, but they are black-boxes (although this appears to be slowly changing) and subject to overfitting. Econometric models (high-bias) offer interpretability and stability (possibly/mainly for LT strategies) but will likely miss nuanced market dynamics. The choice depends on trading goals: AI for tactical, data-rich environments (think satellite pics of parking lots, crop health, or even the apocryphal (??) Neiderhoffer cigarette butt story 😊); econometrics for theory-driven, transparent applications. These days, many successful strategies blend both for robustness and adaptability.

The proof in the pudding regardless of trading model (AI, econometric, Magic 8-ball, whatever) is the ability to consistently generate risk-adjusted returns in live trading. That requires robust out-of-sample back-testing and forward-looking optimization of parameters (aka adaptability).

Deciding on what to do when there is or is not a regime shift staring u in the face is the bitch of all bitches. AI may rely on various clustering algs on vol to detect regime change while time series guys may use cruder techniques such as Cusum or Chow tests or even Bayesian methods – doesn’t really matter as hard is hard and the arbiter will be some combo of p&l, drawdown, tail risk, slippage, and the possible resulting fat thumbs of your boss telling u to get out of positions.

While the debate over model complexity is fascinating, in practice, the bias-variance tradeoff remains central—just more nuanced than the textbook version. Tools like regularization, cross-validation, and ensembling help manage this balance.

As I wrote here https://www.argmin.net/p/probability-is-only-a-game/comment/127390048 “Then again, applying a demanding degree of specificity to the problem of model formulation, estimation, and ultimately optimized decisions for live markets is a fool’s errand (unless you’re Jim Simon 😊). In practice, we fell back on the Herb Simon concept of “satisficing” rather than optimizing.”

Expand full comment

Joe Jordan

Jul 7

FWIW there is reason to believe that kernel ridge regression is a possible strategy for identifying hedges. Inverted Boltzmann is the name that physicists have given to KRR when applied to learning pair potentials between atoms in biophysical systems like proteins and lipids. KRR is great in such a context because the number of interaction pairs is much less than the number of data points to use for fitting. Mutatis mutandi you could use KRR to find pairs of securities with particular correlations, you would just need an oracle to tell you which of the nearly infinite number of pairs of securities to compare=). You would also suffer from the lack of ergodicity in securities markets though so one is always better off with the tried and true method known as "insider trading."

Expand full comment

John Quiggin

Jul 7

"First, calling ordinary least squares “AI” is a bit of a stretch."

The use of the term seems pretty arbitrary. For example, stepwise linear least squares regression is not very impressive as a technique but it meets the textbook definitions of "machine learning" which in turn is the central feature of AI as it's currently defined.

Expand full comment

arg min

You keep using that word