Defining “generalization” in machine learning systems is tricky. We colloquially think of generalization as this ability to change and be flexible, but we define it mathematically in terms of sameness and constancy. The most common definition is for the predictions of a machine learning system to have the same error rate on the data we train it on and the data we test it on.
This reliance on sameness plagues the field of “domain adaptation” and “distribution shift” in machine learning. We don’t know what it means for a “distribution” to “shift.” We only know what it means for the things to stay the same. As Tolstoy warned us, distributions, like families, can be different in a myriad of ways. But that means we need to figure out which signal should be constant when data arbitrarily changes. Prediction error rate might not be the right signal.
Homeostasis might help us understand some of the shortcomings of our theories of distribution shift, as it provides a simple way to define generalization or adaptation. A homeostatic system adapts if it looks the same after a change. By definition, homeostasis is about sameness. We want the value of certain signals to be the same in future environments as in the present.
In the last post, I talked about the body maintaining constant blood calcium concentration. Under higher demand for calcium, the body works to keep the calcium concentration constant. As an engineered example, your home heating system works to keep your living room at a constant temperature despite changes in the outside weather. A system achieves perfect adaptation if a regulated signal maintains constancy when encountering a new environment.
Let me add a little bit of math to show perfect adaptation might occur. I was trying to hash out the appropriate level of math in the comments with Murat Arcak. As I’m figuring everything out in real time, I still need to lean on some formalism to describe adaptation. However, the mathematics of adaptation and homeostasis looks quantitative but aims to capture qualitative phenomena. I feel like I should be able to describe qualitative phenomena without mathematical notation, but that will have to be an ongoing project. For now, I can cleanly explain adaptation in homeostatic systems with linear algebra. But as I’ve written before, I’m not convinced linear algebra is actually easier than calculus. Bear with me, please.
I’m not sure if this is the best way to make things more accessible, but I’m going to write all of the mathematical formulas out in code blocks rather than using LaTeX. The zeitgeist seems to find code less intimidating than calculus. Your mileage may vary.
Alright, enough metapedagogical digression.
Let’s assume that the homeostatic system is governed by a huge set of variables that change over time. We will put them in a vector indexed by time x[t], called the state of the homeostatic system. I will denote the signal we want to regulate as y[t]. This is a single number at every time step t. Let’s further assume that I can summarize the nature of the disturbance at any time step as d[t].
The simplest models relating these time series are linear, time-invariant systems (LTI systems). We assume that the next state is a linear function of the last state and the disturbance. This means there is a constant matrix A and vector b such that
x[t+1] = A*x[t] + b*d[t]
We also assume that the dynamics of the homeostatic signal are well approximated by a linear combination of the state variables. This means there is a vector c such that
y[t] = c*x[t]
The system description here gives us computer code to simulate changes in the regulated signal.
I can write the blood calcium example from the last post in this form. The state vector consists of the deviation of the calcium concentration from its setpoint, the amount of calcium liberated from the bone, the amount of calcium liberated from the intestine, the amount of parathyroid hormone (PTH), and the amount of activated vitamin D. The regulated signal is the deviation of the calcium concentration from its setpoint. The model is then:
y[t+1] = y[t] + Ca_bone[t] + Ca_intestine[t] - d[t]
Ca_bone[t] = K1*PTH[t]
Ca_intestine[t] = K2*Vitamin_D[t]
Vitamin_D[t+1] = Vitamin_D[t]+ K3*PTH[t]
PTH[t] = -K4*y[t]
If you guys are into this sort of thing, you can simplify the code to be two lines:
y[t+1] =(1-K1*K4)* y[t] +K2*Vitamin_D[t] - d[t]
Vitamin_D[t+1] = Vitamin_D[t]- K3*K4*y[t]
With this simplification, the associated matrix A and vectors b and c are
A = [[1-K1*K4, K2], [-K3*K4, 1]]
b = [[-1],[0]]
c = [1, 0]
I am belaboring writing out the model because there’s something very interesting happening with the activated vitamin D. We can explicitly write that line as
Vitamin_D[t+1] = Vitamin_D[t] - K5*y[t]
These dynamics make it look like the vitamin D signal runs gradient descent on the homeostatic variable. Indeed, that is exactly what is happening. A dynamical system
s[t+1] = s[t] + e[t]
is called an integrator. If you start with s[0]=0, s[t] is the sum of the last t values of the input e[t]. Perhaps it should be called a summer instead, but integrator sounds cooler. The name comes from control engineers’ preference for differential equations over difference equations. They would write s[t] as the integral rather than the sum of e[t]. I try to avoid writing differential equations on this blog.
It doesn’t matter what we call it! What matters is that an integrator can only converge to a steady state if its input goes to zero. We can only have s[infinity+1] equal to s[infinity] if e[infinity] equals 0. Integral control is a sufficient component of a system that aims to keep signals constant. If a system has an integrator somewhere inside of it, it will always yield the correct steady state, provided that a steady state exists. The proof of this is literally the argument I just made: an integrated error can converge to a constant only if the error converges to zero. But in the next post, I’ll show—at least for linear systems—that integral control is necessary for homeostasis.
As with every new argmin series, I love this series on homeostasis!
One rather tangential comment about this point:
"However, the mathematics of adaptation and homeostasis looks quantitative but aims to capture qualitative phenomena. I feel like I should be able to describe qualitative phenomena without mathematical notation "
I think this is a very important point, and I came to think about it quite a bit in the context of behavioral / psychological sciences. Why do we need or want quantitative models of learning and behavior? Things like rescorla-wagner, Q-learning, etc. Or all sort of Bayesian accounts of perceptual inference. Or Drift-Diffusion models, and so on and so on.
It is often times that, at the end of the day, the "intuition", or qualitative, explanation for how the model works could have been (and/or actually had been) easily described by psychologists decades before the mathematical models. On the other hands, all the quantitative details are clearly (again most of the times) only serve as rough approximations / modeling choices. So, why do we bother with all the math if what we get in the end is a qualitative explanation that doesn't need the math at all?
My answer (I should say, this was one of the topics which I had endless discussions on with my phd supervisor) is that the mathematical/quantitative model building offers a mechanistic "concretization". This helps in identifying components and implications of the qualitative solution. We cannot use vague words that are too much open for interpretation if the model has to be translated into code. The 'mechanistic' aspect is also important because with some luck, it might help us from a neuroscience perspective of looking for a mechanistic/implementation (a similar argument will hold from an 'engineering' perspective I guess, like in your example: if you actually want to program/build systems that can do this form of regulation, you will have very hard time doing this without the math, _even though_ the principle itself stays qualitative).
ok that's probably long enough for a comment that is a digression from the main point! Looking forward for the next posts.
This "gradient descent as integrator" view reminds me of a cute derivation I came across recently. Solving Ax=b using least-squares gradient descent can be written as a harmonic sum which converges to b\A . So different variations of gd are different integrators converge to b\A at different rates https://www.dropbox.com/scl/fi/u1c7dy6uh858x6tsc93hc/Screenshot-2024-12-22-at-11.16.24-AM.png?rlkey=kkt5w9eh4042gi3wf1r1tg4az&e=1&dl=0