arg min

Engineering Architecture: A Syllabus?

Ben Recht — Thu, 23 Apr 2026 14:20:54 GMT

After spending a week trying to figure out what to say in one lecture, I realized I could teach an entire class on architecture. In fact, in hindsight, that’s what this class should have been! Oh well. I knew this from the outset, but I couldn’t figure out how to stitch a syllabus together last December. Inevitably, I had to work my way through a full semester of cleaning out the skeletons in my learning-for-control closet before I could figure out what I wanted to dive more into. To be clear, this is a success story and a model of how teaching classes ought to go.

Given that I’m already committed to a schedule for next year,1 a class on architecture will have to wait a bit. But nothing stops me from putting together a syllabus now, right? While preparing this week’s lecture, I assembled an unfortunately long reading list for this hypothetical class. Synthesizing this material promises a very interesting story. Let me explain how I’m thinking about its contents.

Though architectural theory is figured out on the fly, adapting existing systems to manage newfound complexity, there are repeated patterns that we can extract from our contemporary human-cyber-physical infrastructure. The architecture class would attempt to synthesize the design principles needed for enabling diversity and error handling. The paper I referenced yesterday by Matni, Ames, and Doyle takes a stab at this sort of view, but I want to look beyond robotics. I’d want to cover as diverse a set of applications as possible while still maintaining some degree of cohesion.

I’d probably start with computing systems. You’ll get different answers about what is needed to build good architectures in your operating systems, programming languages, and networking classes. And maybe you should. I’ll keep repeating myself: I’m not convinced that there’s a “universal theory” of architecture. However, that doesn’t mean we can’t move up a layer of abstraction and draw the common threads together. What are the shared patterns in hardware, software, and network design? I’m particularly interested in studying the 75-year development of software from manually mapping bits on registers to the complex high-level languages of today. There are a lot of interesting theories on modularity, abstraction boundaries, and protocol design, and those should be thrown into the mix.

I’d also extend downward into the physical layer, adding a “cyberphysical” systems view that connects to larger systems like the power grid or transportation network (good references on these two topics are currently missing from the bibliography). I’d spend time on the history of architectures in robotics and control, where we have settled on a platform, arguably by the discipline-wide adoption of the Robot Operating System. The principles were there in the Apollo project: a separation between low-level control, sensing, navigation, and mid-level feedback, and high-level planning. There have been other proposed architectures, like Brooks’ subsumption architecture, that didn’t gain much traction beyond the Roomba. There is something inescapable about the standard three-level architecture, and I want to unpack more about this diversity-enabled sweet spot.

I would like to examine some architectural theories in systems biology, especially those of Gerhart and Kirschner. We’d have to at least read some parts of The Plausibility of Life, mostly because it’s really good. I also think that we do learn a lot by reflecting our technology onto biological systems. I’m sure we’ll find interesting examples and insights by seeing how others have done this.

I also want to look at how we engineer architectures for organizing people. The complexity of the corporation and the computer grew symbiotically, and there are clear influences of human organizational behavior on information technology. There’s clearly a co-evolution of computing architectures with human architectures. Herb Simon, who has had as much influence on management science as computer science, would be a key figure here.

And since I can’t pass up an opportunity to dig into more Cold War technological history, we’d look at some classic theorizing about complex systems. This class would not be an aughties-era complex systems class. But I’d like to find the point in time before the network science people split off from the cyberneticists. So we’d go back and read Wiener and Weaver, Ashby and Simon, and look at what they got right and what they missed.

The bibliography needs both growth and pruning. But I have plenty of time to get it in order. Help me flesh it out. What topics and references would you add? What other books on architecture, organization, protocols, and design should I throw on there? I’d love to see your suggestions in the comments.

Subscribe now

I’ll teach “Forecasting: WTF?” in the Fall and probability in the Spring.

Freedom From Choice

Ben Recht — Wed, 22 Apr 2026 14:10:02 GMT

This is a live blog of Lecture 10 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

Though the “theory” of computer science is most associated with algorithms and complexity, by far the most impactful theories all stem from architecture. Computer architecture, software architecture, network architecture. Architectural theory in computer science is seldom packaged in clean theorems, but there are implicit and explicit design principles that recur across dozens of abstraction layers.

Computing hardware, software, and network design all share key architectural concepts, but our courses don’t often cleanly connect the architectural dots across the application domains. In computer science, all of these different theories of architecture focus on designing hierarchical systems to support diversity and robustness. They all use similar building blocks, namely abstraction boundaries, layered hierarchies, and protocols for cross-layer communication. These protocols are all constraints that deconstrain.

In class, I walked through a few examples, though I had to entirely gloss over all the details. The result was my most Santa Fe Institute slide deck ever, an endless scroll of ugly graphs of networks. I started with the internet, which has the clearest declarative design of all the architectures. Here’s a glimpse from Berkeley’s CS 168:

The internet enables diverse applications to run on diverse networks. It does so by enforcing seven layers of protocols. All of these protocols flow through the “narrow waist” of the Internet Protocol (IP), the jewel “constraint that deconstrains.” Since every application has to flow through a single protocol, you can have incredibly diverse physical networking below and incredibly diverse applications on top. The protocols fan out above and below IP to support the diverse goals. Notably, the transport layer supports TCP, which lets applications know if their packets arrived, and UDP, which doesn’t. The internet is designed for robustness by having a strict protocol list, but pushing all of the processing and thinking about those protocols to the edge.

I also briefly discussed software, operating systems, and hardware architectures in computer science. These systems are physically more localized and have different design constraints. Their main goal is to enable local physical scale so that computers can support fast, general-purpose software. As computers became faster, more complex, and more reliable, their design became more layered and hierarchical. Here’s an image of the timeline Alberto Sangiovanni-Vincentelli shared with me

Rather than trying to design a computer chip from transistors, design cycles accelerate by letting engineers work at higher and higher levels of abstraction. Layered design now comes in to simplify choices. Alberto and Edward Lee like to echo DEVO: “Freedom from choice is what you want!” By establishing clean abstraction layers, engineers can innovate at each layer without worrying about what happens above and below.

Now, this is the point in the lecture where I split with John Doyle. John likes to use layered architectures to understand biology. Yes, you can look at biology and see architecture. Indeed, the constraints that deconstrain terminology were coined by systems biologists. Marc Kirschner and John Gerhart use the notions of constraints and deconstraints to describe how common platforms in biology facilitate agile evolution into diverse phenotypes and species. Because the platform is conserved, this enables rapid evolutionary changes that wouldn’t be predicted by simple, uniformly random variation.

However, I always find that people project technology onto biology to organize and understand biological function. In the 1600s, the body was a bunch of clocks. In the 1800s, it was an engine. Now we think of it as a computer. I’m not saying these projections of technology onto biology aren’t useful, but I don’t think that we necessarily learn more about technology from seeking common patterns in biology. Indeed, I’d rather look at recurring patterns in artificial structures to identify commonalities and general principles.

So instead of looking to biology, let’s look to management. Because man, every computer architecture diagram looks like an industrial org chart. This is not accidental. They serve similar functions. Computing and the mega-organization grew symbiotically in the post-war period, and building complex computing infrastructure required complex organizations of people. Some individuals certainly made brilliant, important advances at isolated nodes of these networks. However, the genius of layered architecture is that they admit a diversity of narrow innovations at every layer that locally grows the architectural ruleset without disrupting what everyone else is doing. In organizations, we have specific reporting and evaluation protocols, rules for bonuses and promotions, and schemes for supporting diverse business goals. The organizational architecture serves functions similar to those of computer architecture.

In “Toward a Theory of Control Architecture,” which I’ll discuss more in the next post, Nik Matni, Aaron Ames, and John Doyle set the stage with the Apollo project’s architecture, which bears a striking resemblance to today’s standard robotic architectures.

You have low-level controllers at one layer, a synthesis of sensors and trajectory optimization in the middle, and a high-level planner at the bottom. Part of this is because the abstraction makes it easier to reason locally about mitigating the complexity of launching people to a cold, barren, airless moon. However, such complexity also required massive teams of people. Here’s a small part of the organization of the Apollo Spacecraft Project Office.

A theory of architecture can’t neglect a theory of human organization. Both artificial structures work together to create the complex infrastructure underneath our contemporary condition.

Subscribe now

Walk the Marble Malls

Ben Recht — Mon, 20 Apr 2026 14:37:24 GMT

This is a live blog of Lecture 10 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

John Doyle, he of robust control infamy, has been a close friend and mentor of mine for over two decades. And for the entirety of those two decades, he’s been yelling about the need for a unified theory of engineering architecture. If you know John, you’ve likely heard the same rants and seen his psychedelic PowerPoint slides with bowties, hourglasses, and bonobos. He’ll show you a picture of the internet stack and a picture of the organization of bacterial metabolism, and expect you to see that they are the same thing.

Now, I spent a good chunk of the last week trying to figure out how to pack John’s grand unified theory into a single lecture. Just like with simulation, I failed. However, I came out much more convinced that John is onto something than I went in. Let me do my best to explain why without sounding like a complex-systems crazy person.

In this class, and in control research more generally, we keep getting trapped by optimization. It’s always easiest to frame problems in terms of optimization problems, and then worry about the particulars of the solution method. However, optimal control is far too limited and brittle for any practical application. It is more often than not a simulation steering problem: we build a simulator at some abstraction level, and then shape a policy by designing an appropriate set of costs and constraints. The framework forces us to operate at a particular abstraction layer, which means we end up at a specific point on the action-impact trade-off curve. We can’t design for hidden states, and we’re sensitive to modeling errors. It forces us to model unknowns as average-case or worst-case disturbances. Optimization is specific and rigid, whereas control systems need to be diverse and flexible. What’s the right way to think about diversity and flexibility?

Fortunately, we have a lot of existence proofs to learn from when trying to answer that question. The world is run on complex, engineered feedback systems with astounding robustness, diversity, and flexibility. We transmit thousands of trillions of bits to each other every second on the internet. We maintain electricity for billions of people. We can have any product we can imagine delivered to our doorstep. We can get from our house to almost any point on earth in a couple of days. We carry around supercomputers in our pockets so we can watch vertical videos whenever we’re bored. We live in a world of engineering miracles that are more robust than any LQR system. So how in the world do they work?

John Doyle is not alone in arguing that the answer is architecture, a set of organizing principles for engineering design. Architecture is the rules and protocols for assembling components to enable diversity in system execution. You want systems that can accommodate a diversity of objectives: balancing speed, accuracy, and impact. You’d like to be able to solve a diversity of tasks. You’d like to accommodate a diversity of end users. Diversity is a particular kind of robustness. It’s robustness to intent. And you have to design for it.

Looking at the last hundred years of engineering, you definitely see repeated patterns in architectural design. First, the need for hierarchy to handle complexity isn’t surprising. Herb Simon was already on this in the sixties. But graph structure and emergulence are not enough.

Feedback is essential. As we’ve seen throughout the class, you can take two systems with dramatically different behaviors and put together a system with “best of both worlds” through feedback. A powerful amplifier and a precise attenuator combine in feedback to make a powerful, precise amplifier. A language machine that can recapitulate all software in feedback, using a simple agent harness with iterative exception handling, lets you build and manage complex software packages. Architecture lets you scale this feedback design principle.

The key to scalable architectures is protocols. Computer science — the engineering discipline of scaling logical systems — is obsessed with architectural protocols. To build a complex system like the internet or the modern computer, you build a hierarchy of abstraction boundaries. You design interfaces to talk across these boundaries with clean, well-specified protocols. The protocols let each system operate with a particular set of agreements about what the other will do. When you stack these protocols together, you get ridiculously impressive diversity. From this design strategy, you can build out the internet, the integrated circuit, the cell, and the control system. The internet serves arbitrary applications on arbitrary physical layers, all funneled through a set of contracts with IP in the middle. We can design a diversity of complex computer chips from standard cells and data flow models. In each of these cases, each layer only speaks to its neighbors in a structured hierarchy.

Finally, there is a central concept that seems to drive architectural design: constraints that deconstrain. They are restrictions on what we can do at one point of the hierarchy that end up enabling diversity at another. Constraints that deconstrain were proposed by Marc W. Kirschner and John Gerhart as a facilitator of evolution. Systems engineers like Alberto Sangiovanni-Vincentelli and Edward Lee emphasize how they provide a Devo-esque “Freedom of choice,” removing the paradox of choice and enabling more efficient design cycles. I’ll connect this to similar architectural theories of computer scientists and electrical engineers.

Engineering architecture is far too much to cram into a single lecture, but I’ll give a brief introduction to the ideas this week on the blog. I’ve come to the conclusion that there’s a whole semester’s class to be taught here. How better to end a class than by describing what the next class looks like?

Subscribe now

Structured Uncertainties

Ben Recht — Fri, 17 Apr 2026 16:59:40 GMT

This is a live blog of Lecture 9 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

The problem with a fast-paced course is that I keep hitting topics I want to dig into but am forced to move on. Simulation is fascinating, and I need to spend more time with its history and nuance. I guess that just means I’m going to add it to the syllabus of my next graduate course.1

But I wanted to post about one fun thing I learned this week that, while not directly related to simulation, does seem to have some broader lessons. I came to better appreciate John Doyle’s structured singular value, often called by its Greek name “𝜇” or “mu.” The lessons it teaches about interconnection and uncertainty, though perhaps not always computable, are quite general and important.

Let’s go back to the recurring simple feedback loop.

Here, C is the controller we’re designing, P is the plant we’re trying to steer. I’ve added a new block, Δ, to represent some uncertain system in our feedback loop. When Δ equals zero, we have our nominal system model. The structured singular value asks the following question: if I have a design that works without uncertainty, how much margin for error do I have? How large can Δ be before the system goes unstable?

A standard problem in linear control models Δ as a simple scalar. In that case, because of how the equations work out, the uncertainty question is about the gain margin of the linear system. How much can you amplify or attenuate the plant before the closed-loop system goes unstable? Asked another way, how well do you need to know the amplification factor to guarantee stable operation? Or, let’s say you have a bunch of plants that all have reasonably similar dynamics but different gains. Is your controller good enough for all of them? The uncertainty could model the mass of a flying vehicle or the insulin sensitivity of a person with diabetes. Steady-state control of both of these systems relies on some robustness to uncertainty.

One of the classic results about the linear quadratic regulator is that its stability is maintained for any Δ between -½ and infinity. That’s a good gain margin! Many other control design techniques from classical control using Nyquist plots can also guarantee large gain margins for single-input, single-output systems.

However, the problem becomes a lot trickier when you need to control a plant with many inputs and outputs. Most control systems are networks of interconnected feedback loops, not just simple single-input, single-output systems. In my favorite control system, the espresso machine, you might have a PID controller for water temperature and another for water pressure. You could calibrate these by tuning each PID parameter, one at a time. But obviously, these two loops interact with each other. They also interact with your grind and your tamping.

An industrial process, a chemical plant, or a robot has a networked control system of far greater complexity. You might be able to write out performance guarantees for each loop in the system, and that might look fine on its face. But if these loops are coupled, your margin calculations might be misleading.

To see why, we can look at a static example, like we did with the feedback amplifier. Imagine we’re trying to get a plant to track a constant reference signal. The controller compares the reference signal with the plant’s output and applies a new input if the difference is large. This signal sets a different set point for each loop. We can compute the steady state of our system by looking at a matrix equation. Indeed, slightly abusing notation, we can think of the steady-state maps C, P, and Δ as matrices (these are the DC responses of each system). The map from the error signal input of the controller, e, to the output of the uncertainty, y, is a system of equations:

Using the fact that the error is the difference between the reference and the output, e = r - y, we can compute the steady-state output as a function of the reference signal. It’s not the prettiest formula, but you can write it out in closed form and stare at it:

If the open-loop map from the controller input to plant output — the matrix (I+Δ) PC — is sufficiently large, the output of the plant will be approximately equal to the reference input. However, there’s a catch. We need to know that matrix never has an eigenvalue of -1 for any instantiation of the uncertainty. If it does, then the inverse in the above matrix expression isn’t defined, and the expression blows up in unpleasant ways. We’d say the closed-loop system was unstable.

Hence, we can capture a notion of multivariate robustness by finding the smallest perturbation that makes that matrix singular. The tricky part is that you get different answers based on what sorts of uncertainties you believe are plausible.

Consider the classic gain margin question. For simplicity, define the matrix

When Δ is a scalar, you are just looking for the smallest number such that

This number is precisely equal to the inverse of the magnitude of the largest eigenvalue of T. By contrast, if you think that you can have uncertainty that couples channels of your system together, the size of the uncertainty you can handle is much smaller. Indeed, you can check that if you allow for the uncertainty to be an arbitrary matrix, the norm of the uncertainty has to be smaller than the inverse of the magnitude of the largest singular value of T.

Singular values are always larger than eigenvalues. Sometimes, they can be much larger. For instance if

Then the eigenvalues of T are ½, and the maximum singular value of T is approximately 250. If the uncertainties were just a multiple of the identity, it would appear very robust to perturbations, handling disturbances with gains up to a magnitude of 2. However, if general matrix uncertainty were allowed, you could only handle disturbances with gains of magnitude at most 0.004.

The structured singular value lets you figure out what this magnitude is for whatever plausible model of uncertainty you can construct. Maybe only a subset of the loops is coupled. Maybe Δ has block structure. Each structure gives you a different number in between the spectral radius and the norm of your system’s complementary sensitivity, T. The structured singular value generalizes beyond this simple matrix example to general linear systems. It lets you compute bounds even when the uncertain blocks are themselves structured dynamical systems.

For people designing mission-critical linear feedback systems, you should learn all of the details. For everyone else who is stuck with nonlinear systems, there are still lessons to take away. Nonlinearity seldom makes our lives easier! If a robustness problem presents itself when we look at simple linear instances, we shouldn’t just hope that it’s not there on hard nonlinear ones. This is one of the reasons that in our post-math age of YOLO scaling, it’s useful to learn a little bit of math to be a little bit chastened. Though I suppose if you do it that way, you’ll never make a dime.

Subscribe now

It is indeed already on there, my friends.

Reticulating Splines

Ben Recht — Wed, 15 Apr 2026 14:12:27 GMT

This is a live blog of Lecture 9 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

As I mentioned Monday, one of the big paradigms in modern robotics and control is the “sim2real” pipeline. People invest in complex computer simulators to test their robotic policies. The simulators have detailed dynamic and kinematic models of the robot and how it moves in contact with varied terrain and obstacles. The hope is that by burning through infinite GPU credits to troubleshoot every possibility in simulation, they can deploy code to their actual robot and need no troubleshooting once it’s unleashed in the real world.

While the young folks like to make this paradigm sound like a novel new research program, all of optimal control rests on the sim2real pipeline. Think about the core problem of optimal control: the linear quadratic regulator. This problem looks for a control sequence that minimizes a quadratic cost subject to the world evolving according to a linear dynamical system. Control theorists banged their heads against this problem for decades, and we are now taught the beautiful dynamic programming derivations that reduce this problem to solving a compact equation. However, we can also solve it using gradient descent. The gradient computation amounts to simulating the system with the current control policy, computing the sensitivity of the cost trajectory to each control decision, and then adding this information up to compute the gradient.

The lovely thing about gradient descent is that it gives you a solution technique for general optimal control problems with nonquadratic costs or nonlinear dynamics. You evaluate your policy under the current control, run a dynamical system backward in time to compute how sensitive the trajectory was to your control decisions, and then add up the contributions of each time point to get the full gradient. Arthur Bryson invented this method to compute gradients of general optimal control problems in 1962. Today, we call his algorithm backpropagation. This simulation-based gradient method provides incremental improvement of policies for any differentiable dynamical model and any differentiable cost function.

Now, if your simulation isn’t differentiable, maybe you’ll use a different sort of policy optimization method to solve your optimal control problem. However, reinforcement learning for robotics is still optimal control. RL for robotics minimizes a designed cost function subject to dynamics. The modern departure is that no one bothers to write down the equations of motion anymore. They just assume the simulator will compute them.

This belief pushes a lot of work onto the simulator. GPU cycles are sadly neither free nor abundant. It would be nice to minimize the simulation time and cost required to find a good control policy. It would be particularly nice because many people would like to have a simulator on board the actual robot to compute policies with methods like model predictive control. This begs the question of how accurate your simulation needs to be.

Unfortunately, no one knows. We all think that if you can act quickly enough with enough control authority, then a really simple model should work. But it’s impossible to quantify “enough” in that sentence. You have to try things out because dynamical processes are always surprising.

While it feels like increasing the fidelity of a simulator to the minute details of physical law always improves performance, this is not remotely the case. In class on Monday, Spencer Schutz presented a paper on autonomous driving showing a simple, inaccurate kinematic model with a low sampling rate performed just as well as a more accurate dynamic model. Anyone who’s spent time with dynamic models knows that very high-dimensional complex systems often look simple when you have limited controllability and observability. This is the basis of thermodynamics, where infinitely many bodies colliding collectively produce fairly boring dissipative behavior. Many complex-looking circuits have the input-output behavior of resistors.

On the other side of the coin, safe execution demands identification of subtle aspects of input-output relationships. You can have two dynamical systems with nearly identical behavior perform completely differently once in a closed loop circuit. You can also have systems with completely different behavior look the same in closed loop. I worked through a few examples of this phenomenon in a blog post a couple of years ago. Your model needs to be perfect in exactly the right places. But it’s usually impossible to know those places in advance.

To make matters worse, you can’t really identify the parameters of a robot in open loop. An expensive robot is always going to be running with its low-level controllers on both for its safety and yours. The actual parameters of closed-loop systems can’t be identified.1 So you’re stuck with guesses in your simulator, and you have to hope that your plausible parameters are good enough for your sim2real application.

The most popular solution to this identification problem is domain adaptation. Since you can only find a range of parameters that describe reality, you build a control policy that works for randomly sampled parameters. By constantly sampling different parameters in each run, you build a policy that performs well on average across all possible parameters.

Finding controllers that work for the average model isn’t new. Indeed, this is just a variant of optimal control called dual control, which has seen bursts of interest since the 1960s. Dual control is literally the problem of minimizing an expected control performance over a distribution of parameters. Like dual control, domain adaptation needs a good prior model for how the environment “samples” parameters. But you can also just YOLO and hope that as long as you include all the edge cases, you’ll never crash. That’s the machine learning mindset, after all.

But what does it mean to sample the coefficient of friction of a surface? What’s the right distribution of coefficients of friction? This is again a tricky question.

One approach to modeling the distribution of parameters is to add an element of adversarial behavior to the system. We can adapt the simulations to find hard parameter settings and train more on those. We can have the simulator learn to trip up the robot. Rather than minimizing expected cost, we are working to minimize a worst-case cost, where the supremum is over a distribution of parameters or disturbances. The dual control people were really into this sort of minimax robustness in the 60s. But practice in aerospace applications ultimately pushed the community to robust control.

But people hate robust control because it gives them conservative policies. Computer scientists love to hack and ship. Look how productive they’ve been! You only need to write a few tests and make sure your simulator passes those. No bugs detected, LGTM! What could go wrong, right?

Is that last paragraph about coding agents? It might be.

But regardless, robust control pointed out that unmodeled uncertainties are everywhere, and they can be out there to bite you if you’re not careful. For its entire history, robust control advocates have been haranguing people about the limits of simulators. They note a couple of significant problems: first, training on a simulator often means fitting to quirks of the simulator that don’t appear in the real world. This is a major danger, even in linear systems. Second, many apparent parametric robustness properties of optimal controllers break down under scrutiny.

In class, I introduced the structured singular value to motivate this issue. The structured singular value showed that when you had a system with many inputs and outputs, and you only considered independent perturbations, you’d convince yourself that a system was stable when it was not remotely stable. Guaranteeing stable behavior required understanding the dependencies between different errors. But how you test stability in simulation is not clear.

We are thus left considering a strategy beyond sim2real: sim2real2sim2real. Or sim2real2sim2real2sim2real. You deploy the system and find out what didn’t work in reality. And then you go back to your simulator, add a few thousand lines of code to account for the mistake, and try again. The software state of mind is that we can always patch mistakes. You can have an all-hands, blameless post-mortem and say it won’t happen again. This drives the old control theorists mad, but it’s been great working so far, so why change course?

Subscribe now

In case you haven’t encountered this before, suppose you are trying to model a closed-loop system x[t+1] = Ax[t]+ Bu[t], u[t]= Kx[t]. Then for an arbitrary matrix E_B,

A+BK = (A -E_BK)+(B+E_B) K

Hence, you can only identify a subspace of possible dynamical systems describing your data.

Purposeful Predictions

Ben Recht — Mon, 13 Apr 2026 14:49:19 GMT

Every engineer and scientist knows there is a fundamental difference between a “simulation” and a “prediction,” but what is the root of that distinction? At the highest level, we contrast simulation against black-box modeling. Simulations are typically thought of as “transparent boxes” where we can describe the intent of each part of the model that produces a forecast.

A roboticist might think of a simulation as a computer system designed to integrate the differential equations that define basic laws of physics. For example, you predict the path the airplane takes based on physical models of lift and drag and how the plane moves under different control settings. Simple simulations based on reduced equations might suffice for some tasks. For others, we might have to rely on computational fluid dynamics to truly capture the behavior we’re after.

The transparent box becomes murky when systems are too complex to predict precisely. Many designers accept adding randomness to their simulations, provided they can characterize the statistical models as plausible. The dynamics of coin flipping are too hard to capture precisely, but we’re usually fine with a random number generator that produces an even number of heads and tails. Noise in measurement devices often reliably has statistics that match those of Gaussian or Poisson random numbers, and such stochastic processes are reasonable stand-ins for the sorts of signals we’ll encounter in the wild. Maybe you can simulate elections based on random numbers derived from current polling results.

Where do we draw the line between sampling and simulation? I maintain that LLMs are simulations of language. We train next-token predictors in language models so that their generation matches the statistical properties of the data. Indeed, maximum likelihood selects probability distributions that make past sequences likely in the future. I’ve received a lot of pushback on this because the samples generated by the transformer are too black-box to count as simulations. This reaction suggests to me that some people want simulations to arise from models with articulable causal explanations.

The academic literature on simulation is also intentionally vague about the difference between modeling, sampling, and simulation. But this quote from the 1975 textbook Systems Simulation: The Art and Science, by industrial engineer Robert Shannon, highlights a crucial feature of simulation:

“Simulation is the process of designing a model of a real system and conducting experiments with this model for the purpose either of understanding the behavior of the system or of evaluating various strategies (within the limits imposed by a criterion or set of criteria) for the operation of the system.”

For Shannon, simulation is purpose-driven. You replace a real system with a model, and then evaluate counterfactuals in the modeled world. A forecast that is not evaluating a counterfactual configuration or strategy is not a simulation. Simulation is anything where we can evaluate counterfactual futures and gain insights from them.

Simulations can help engineers describe the behavior of complex systems and build theories and hypotheses for why that behavior occurs. Engineers can also use them to predict future behavior of the system if they were to intervene with some new policy or if an external force acted to change some parameters.

Under this broad tent, optimal control is simulation. Since everyone learns LQR first, we get such clean formulas out that we don’t think of this as a simulator. We think of this as an analytical technique. But if you instead solve LQR by gradient descent, you’ll find that you need to simulate to compute a gradient. This is the “forward pass” in backpropagation, a method for computing gradients that was initially invented to solve optimal control problems.

Indeed, the process of solving LQR by gradient descent looks like this: You pick a cost function that seems to match your design specification. You try a particular control policy out. You get a signal back based on its performance under your cost function. You use this signal to modify your control policy to a policy with lower cost and try again. Once you have repeated this enough times so that you don’t think you can further improve, you deploy the control system trained in simulation.

AI people have coined a cutesy name for this iterative control design process: “sim2real.” On the one hand, sim2real looks like it’s doing something far more sophisticated than optimal control. The simulators they use are highly complex, their control policies are neural networks, and their cost functions are a clever pastiche of best past practice. However, robotic sim2real is a short conceptual hop from Kalman’s papers on basic linearization of chemical plants in the 50s.1 And just as today’s roboticist wishes Nvidia GPUs were cheaper, Kalman and Koepcke lament how they would be better served by more compute.

The question then becomes, how good does your simulation need to be for control? In their description of sim2real, Zakka et al. discuss the demand for the highest-fidelity simulations possible. But what does that even mean for a simulation to be high fidelity? How can you validate the assertion of high fidelity? Components with dramatically different behaviors look the same once they are interconnected in feedback loops. How can we identify what modeling is necessary? Once they are connected in feedback, identifying actual parameters becomes impossible. What is the right way to deal with uncertainty in the simulators? Is “domain adaptation,” the hot trend of the last decade where we simulate a lot of different plausible environments, the right way to make progress? Which transparent boxes can be replaced with black boxes? These are some of the questions I’ll dig into during today’s lecture. In the next post, I’ll report on partial answers.

Subscribe now

If you are a roboticist, you should read that paper to see how Kalman should be credited with inventing Iterative LQR.

Calibrated Games

Ben Recht — Fri, 10 Apr 2026 14:45:51 GMT

This is a live blog of Lecture 8 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

One of the main uses of simulation and forecasting in designed feedback systems is for deciding how to act. If I can map what will happen next, I can choose actions that steer me toward good outcomes. This mindset seems perfectly sensible, and it’s the backbone of statistical decision theory, tree search in game play, optimal control, and model predictive control. Moreover, people who are good at prediction get clout. You can even win money in markets. It seems like forecasting is a skill and talent, and one that requires deep knowledge of how the world works. And yet, in class on Monday, I discussed how you can make excellent forecasts by simple, strategic accounting.

To understand why, let’s examine how we know if forecasts are good. It’s sort of obvious, but we can only evaluate our predictions of the future once the future has become the past. I can’t tell how good your forecast is until the forecast event occurs. No matter how much we think about setting up ungamable metrics, forecasters can only be evaluated retrospectively. And this retrospective nature means we can cast forecasting in the game theoretic framework from last week. Let me write out the rules in the format I’ve been using.

We have a two-player game with repeated interactions. In every round t,

Information x_t is revealed to both players.
Player One makes the forecast p_t
Player Two takes reveals the actual outcome y_t
A score s_t is assigned based on the triple (x_t,p_t,y_t).

Player One is the “forecaster.” Their goal is to accumulate as high a score as possible, summed across all rounds. Player Two wants the sum of all of the s_t to be as low as possible.

Now, we need to come up with score functions that can’t be “gamed,” and people have thought of many. For example, you might require the forecaster to have a low Brier score.

Now, a low Brier score is impossible in the adversarial context. Think about this weird game of bit prediction, where Player One guesses a number 0 or 1 and Player Two responds with the correct answer equal to 0 or 1. Player Two goes second and can always pick the opposite of what Player One says. That seems unfair.

Last week, I brought up the possibility of judging Player One with regret. Regret would measure the difference between Player One’s score and the score of a player who knows all of Player Two’s moves in advance but can only play a single prediction. In math, this quantity is written as

But good predictions alone may not be what you care about. Certainly, you’d want the frequencies to match. If you are predicting a sequence of probabilities, the average of those probabilities should match the average of the actual outcomes. If you consistently predict a player makes 90% of their free throws, we should see 90% of free throws made.

Similarly, other expected values should match. If you are changing your probabilities over time, the variance of the outcomes should still match the variance of your probabilities.

Maybe you’d prefer the predictions to be good across stratifications of the data. For example, if you are predicting free throws, maybe you’d want your forecasts to be accurate for all players individually. There are lots of subtests and subsets I can inspect, and I’d like to check that you are making good predictions on all of them.

Perhaps you’d like a certain degree of calibration from the forecast. In all of Player One’s forecasts where they say 20%, Player Two should say 1 only 20% of the time. In the forecasts where they say 60%, Player Two should say 1 60% of the time. If in all of the times Player One says there’s a 90% chance of a 1, only 10% of the times Player Two plays a 1, we’d think Player One is a pretty bad forecaster. Trying to achieve calibration across all possible probabilistic predictions seems a lot harder than just getting a single frequency correct in a Brier score game.

Mathematically, however, all of these problems are basically the same. They list a set of “test functions”, and Player 1 wants the following to be small for every single test function:

What are the test functions? If all we care about is getting the frequency that y_t equals one correct, then the test function is the constant function. For calibration, the test function is equal to one when a forecaster predicts probability x% and 0 otherwise. You’ll have one test function for each calibration bin. For calibration across strata, there will be a function for each stratum. Even Brier scores amount to calibration. You can get a low Brier Score by calibrating the functions

for all values of q.

The amazing thing is that making calibration errors of the form E_fsmall is incredibly mechanical. Juanky Perdomo and I spell out the general details in Section 3 of the tutorial “In Defense of Defensive Forecasting.” More or less, you just have to choose a prediction that makes the future look uncorrelated with the past. And you can always find such a prediction with simple search. Though there are specific details you have to deal with for each case, essentially the same procedure applies to very general sets of calibration functions.

We found that we could reduce every metric used to evaluate forecasting skill to some form of generalized calibration. There are whole bodies of work on proper scoring rules, conformal prediction, omniprediction, and outcome indistinguishability that reduce to generalized calibration. In the forecasting game, this generalized calibration can be done without specific domain expertise. As long as the evaluation metrics are prescribed in advance, a Defensive Forecaster will do well in fantasy sports, weather prediction, and election forecasting. It doesn’t need to know anything about the topic other than the judgment scheme.

Though Juanky and I wrote up our defense of defensive forecasting almost a year ago, this week was the first time I tried to present it in class. I got a lot of puzzled looks, as if I was playing clever card tricks. That’s the correct reaction! We are naturally impressed by people who are good at forecasting. We’re obsessed with predicting the future. Predictions from soothsayers are reassuring even if they’re consistently wrong.

And yet, forecasting is often just playing clever tricks for fun and profit. Though Dean Foster and Rahesh Vorha famously showed that percentile calibration amounted to bookkeeping thirty years ago, it turns out that all forms of generalized calibration can be achieved through bookkeeping. Next time someone tries to impress you with their prediction market prowess, remember that cooking the books isn’t the same as clairvoyance.

Subscribe now

Unreal Is Here

Ben Recht — Tue, 07 Apr 2026 14:09:50 GMT

Though I’ve been prefacing my lecture blog posts with italicized disclaimers, I want to single this lecture blog out as being targeted a bit more broadly. Because, in a weird confluence, the topic of this week’s lecture coincides with the topic of an op-ed by Leif Weatherby and me that appears this morning in the New York Times: forecasting and simulation.

We can’t avoid prediction and simulation in a class about feedback systems. Our theories suggest that better predictions and forecasts lead to better plans of action. Additionally, we try to make sense of complex, interconnected systems by simulating their behavior, and simulations often reveal surprising “emergent” behavior of the whole, which wasn’t evident from the modeled behavior of the parts. We also tend to think that the subcomponents of complex, interconnected systems make sense of their surroundings by predicting what other components around them will do.

I was a bit slippery in that paragraph about what the difference is between simulation and prediction. That’s because I’m still not sure how to draw a boundary between the two concepts. The most common axis is opacity: everyone thinks there is a fundamental difference between a model that is “easy to describe” from first principles and one that is purely data-driven. We call the latter “black box” to mark our disdain. The “transparent box” systems might derive from physical laws, and we write down a set of equations that dictate how each step relates to the next. The black box systems might be derived by curve fitting, where we pick a function of convenience, untethered from causal explanation, to describe how inputs have historically mapped to outputs.

I’ll talk more about the opacity slider in later posts this week, but today, I want to ask about the purpose of simulation. That axis is more interesting to me. Simulations can be used in many different ways. You might use a simulation to better understand a system itself. Simulations of mechanical systems can give you a feel for their performance limits. You can use simulations to figure out why something went wrong, deriving causal explanations from plausible mechanisms. And, of course, you can use simulations to predict the future. You can use these simulation forecasts to make a plan of action. Or, in our Draft-Kings-addled culture, you might use them to gamble.

Leif and I talked about this murky simulation landscape in the world of public opinion polling. Specifically, we wrote about the absurdity of silicon sampling. For those unfamiliar with the term, silicon sampling is when you design a social science survey experiment and give the questions to LLMs rather than people. As absurd as this sounds, people are really pushing to do this. There’s a billion-dollar startup called Aaru that is based entirely on this silly idea. And one of their fake polls slipped its way into Axios last week, without Mike Allen realizing that the “poll” he was reporting on was a computer simulation (embarrassed, Axios later edited the story to reflect the phoniness).

But why do silicon samples have so much cachet with pollsters and social scientists? As Leif and I argue in our piece, it’s because polls already rely heavily on simulation methods. Because of remarkably high nonresponse bias, pollsters lean heavily on statistical modeling to tweak their numbers to align with reality. Polls that use multilevel regression and poststratification are already inputting a lot of simulated reality to “correct” their summarization of the data they collected. The number isn’t “percentage of yesses in my sample,” it’s “what I think the percentage of yesses is in the population given my sample and my beliefs about the population.”

Since polling already relies heavily on simulation, tossing out the expensive part of the process—you know, asking actual people questions—feels like a logical conclusion. The Nate Silverization of political coverage turned polling into prediction. In the media, the goal of polls stopped being about understanding what people think and became more about predicting the outcome of elections. If all you need to do is predict, you don’t really need pristine distillations of understanding. You can take your empirical facts and use them solely to predict outcomes. And if the goal is just prediction, you don’t need to bother asking people at all. In fact, you want more reliable data than the fickle behavior of people nagged by pollsters at the end of some modern transmission line. If your goal is only prediction, you’re probably better off not talking to people at all.

But is the purpose of polling prediction? It depends on who you ask, but I’d like to think that the answer is no. At pure face value, the topline numbers of an opinion poll are a summary of a survey. They reduce a list of ones and zeros into two numbers: a mean and a variance.

Now, using a bit more social-scientific reasoning, we might interpret this summarization as a measurement of what a group of people believes. With a rigid methodology, we can consider polling to be quantified opinion. It’s a bit odd to think that you can “objectively” measure opinion in the first place, but this has been a supposition of social science research for a long time.

Unfortunately, statistics has incredibly slippery semantics that lead people to conflate summarization with measurement and measurement with prediction. Is the percentage of “people who answered yes” a summarization of the data? Is it a measured quantity about the opinion of a broader population? Is it a prediction of how people will vote in November? Yes?

I’m interested in this conflation for both political and academic reasons. Leif and I think the polling industry is harmful to the public sphere. But setting those politics aside, I think that being upfront about the purpose of simulations and forecasts helps demystify their outputs. Indeed, this week I’ll describe how purpose dictates forecasts. Prediction of the future is difficult. But if you tell me how my predictions will be evaluated, prediction of the future is trivial. I’ll explain more about why in the next post.

Subscribe now

Arbitrary Geometry

Ben Recht — Fri, 03 Apr 2026 14:23:15 GMT

This is the third live blog of Lecture 7 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

I closed yesterday’s blog with a cliffhanger, promising to give a few examples of where I think adversarial regret is a useful concept. On its own, I’m not sure that it is a useful concept. More on that next week! But today, I’ll show how you can use adversarial regret to bootstrap interesting arguments linking machine learning, game theory, and stochastic optimization.

Once again, I list the rules of the game. We have a two-player game with repeated interactions. In every round t,

Information x_t is revealed to both players.
Player One takes action u_t
Player Two takes action d_t
A score r_t is assigned based on the triple (x_t,u_t,d_t).

Player One is the “decision maker,” and their action has to be computable from a few lines of code. Their goal is to accumulate as high a score as possible, summed across all rounds. Player two wants the sum of all of the r_t to be as low as possible.

The adversarial regret compares the score of Player One’s strategy to that of a player who sees the entire sequence of disturbances but must play the same action at every time step. It’s a weird setup where we are comparing a player who can change their strategy arbitrarily to an omniscient player forced to play the same move every time. While these two notions don’t seem worth comparing at first blush, there are a few cases in learning theory and game theory where the comparison is mathematically powerful. It turns out that computing these regret bounds is often quite simple, and they follow from elementary derivations. While these bounds themselves might not be useful, they then imply results you actually care about. Let me give my three favorite examples.

Online learning and PAC Learning

Online learning is the case argmin readers will have already encountered if they followed my machine learning course blogging. I like to teach online learning because adversarial regret bounds imply the standard model of probabilistic machine learning. Adversarial regret highlights how most of the “generalization bounds” we derive in machine learning are artifacts of geometry rather than mystical manifestations of mechanical epiphany.

In the online learning model, the goal is to predict the disturbance from the information. The actions are predictions. At each round, your score is high if your prediction is correct and low if the prediction is incorrect.

Let’s change the notation to match the standard verbiage of machine learning. The prediction is a function f, and the “disturbance” is a “label,” denoted y. Instead of a high score, we want low loss. In the online learning setting, you get to change your prediction function at every time step and compare your losses to a single model that tries to fit the labels after you see them. In equations, this is

You can now do math and show that this expression is bounded by a sublinear function of the number of rounds. This post works out the details. Now, the resulting deterministic bound is necessarily interesting in and of itself, but the magic happens when you declare the xs and ys to be generated by a stochastic process. If, for example, you assume the information-label pairs are identically distributed, independently samples from some data-generating process, then after making a few assumptions about convexity, the regret bound becomes a generalization bound:

Here, F_T denotes the random function that your model returns after seeing T examples. This bound compares the predictive accuracy of your algorithm on a new sample to that of the best function computable given the data-generating distribution. If you have sublinear regret, then this quantity tends to zero as T goes to infinity. This is called a generalization bound, or, if you use probability instead of expected values, a PAC Learning bound.

The technique of deriving a deterministic regret bound and transforming it into a probabilistic generalization bound by taking expected values is called “online-to-batch conversion.” It is one of the favorite tricks of learning theorists.

Stochastic Optimization

Similar techniques can be applied more generally to stochastic optimization. A clever analysis of the stochastic gradient method takes a similar approach: you can prove that gradient descent has low regret even if Player Two is handing you a different convex function at every time step. If you take expectations of the resulting regret bound and apply Jensen’s inequality, you derive a bound on the sample average approximation method for stochastic programming. Though substantially more general, the proof is almost identical to the one in online learning.

Repeated Games

Still closely related to but slightly more challenging than online convex optimization are repeated zero-sum games. In this setup, each round of the game is itself a zero-sum game. The players battle each other for multiple rounds, and Player One’s goal is to refine their strategy so they eventually achieve an infinite ELO score. Here, a classic result proves that when both players use algorithms with low adversarial regret, they converge to a Nash equilibrium. You assume that both Player One and Player Two are using algorithms that yield low regret against an arbitrary adversary. The baseline is a player forced to use the same strategy every round. If Player One and Two’s strategic improvements have sublinear regret, their strategies eventually converge to an equilibrium. This result is the backbone of modern poker bots, which use algorithms like counterfactual regret minimization. Whether or not you think solving poker is a major contribution to humanity and human knowledge is up to you.

Subscribe now

Should Have Known Better

Ben Recht — Thu, 02 Apr 2026 14:42:40 GMT

This is the second live blog of Lecture 7 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

When talking about sequential decision making and optimal control, I can’t avoid discussing the mathematical concept of regret. Regret is the preferred theoretical metric for evaluating bandit algorithms, and bandit algorithms are a core method for online decision-making.

Invariably, every time I try to explain “regret,” I get the question, “Wait, why do we care about that?” So to have an answer that I can point to in the future, I’m going to write a few blog posts.

We can use the setup from the last blog. We have a two-player game with repeated interactions. In every round t,

Information x_t is revealed to both players.
Player One takes action u_t
Player Two takes action d_t
A score r_t is assigned based on the triple (x_t,u_t,d_t).

You could think of designing the optimal policy as an optimization problem. Given a description of what Player Two can do, you can design a strategy for Player One. If Player Two is totally random, you could perhaps maximize the expected score. If Player Two is deterministic, you could maximize the score assuming Player Two plays the best possible strategy against you. Regret provides a flexible framework for going beyond both of these formulations.

To define regret, we imagine a counterfactual world in which Player One knows something about Player Two’s strategy in advance. The regret of some strategy is the difference between the score of a policy built on knowing this secret of Player Two and the score of the strategy that has to learn as it goes. It is called regret because it estimates how much the score could be improved with the benefit of hindsight.

Sometimes this regret is really large. Consider the following example. In each round, Player Two thinks of a color, Red or Blue, and Player One has to guess which color Player Two is thinking of. Player One gets a 1 if they guess correctly and a 0 otherwise. Player Two agrees to choose their sequence in advance, but only reveal one number to Player One in each round. In the counterfactual world, Player One would know the entire sequence and would receive a perfect score. In reality, Player One can’t do better than guessing, so they would be hard-pressed to get more than half of the colors correct.

This is where regret gets confusing. We ask, what if Player One in the counterfactual world has the benefit of hindsight but is constrained in their strategies? In this color guessing game, what if Player One is forced to choose one color in their counterfactual world? They see the entire sequence but can only pick red or blue. In this case, if Player Two chooses an even number of Red and Blues, the omniscient yet restricted Player One can only get half of the answers correct. A real-world strategy of random guessing will fare just as well as this counterfactual strategy with the benefit of hindsight.

No matter how many times I explain it, I find this setup confusing. Let me write it again: The regret model requires two things: a secret of Player Two and a restricted strategy set of Player One. In the real world, Player One has a flexible strategy set, but is missing information. In the counterfactual world, Player One has a restricted strategy set, but extra knowledge. Regret bounds the difference in scores achieved in these two worlds.

You might ask why this particular example of color guessing is interesting. I’m not sure it is, but it’s the one we’ll use next week when discussing forecasting. When someone tells you that they have calibrated predictions, they are doing this sort of sleight of hand and comparing against something that you probably don’t actually care about.

But let’s spend some time discussing examples where regret is reasonable. I’ll start with the canonical example: the stochastic multiarmed bandit. If Player Two is random and stationary, then the best strategy in hindsight makes a lot more sense. In our game of colors, this is the multiarmed bandit problem, the most annoyingly named subject in decision making. In the classic version of this problem, you have two slot machines and want to find the one with the highest payout. Each round, you are allowed to choose one of the machines to play. We model the payout from each machine as an independent draw from a fixed probability distribution. These distributions have different means, and your goal is to devise a strategy that results in the highest expected payout.

What would the best policy do? No matter how clever you are, you can’t beat the strategy of only using the machine with the higher mean payout. If you knew the expected payouts in advance and your goal is to maximize the expected payout, you would use only the machine with the highest expected payout. Thus, we can think of the secret held by Player Two to be the mean payouts of each machine.

If you didn’t know this secret, what would you do? You’d probably spend some time with each machine, look at which one is giving you higher returns, and then pick that one forever. This seems like a reasonable strategy.1 But note that this strategy necessarily has nonzero regret, because you necessarily have to try both machines to figure out which one is best.

Any strategy you devise for the real world has a particular expected regret, which is the difference between the expected payout of playing the best machine and the expected value of your strategy. In the case of our multiarmed bandit, the worst regret is accrued by always playing the suboptimal machine. So the regret would grow linearly with the number of pulls. Bandit algorithms seek strategies for which regret grows sublinearly.

Outside the casino, variants of the stochastic multiarmed bandit are reasonable models for adaptive experimentation. Suppose you want to select between treatments A and B that maximizes the average benefit to some cohort of subjects. If you can randomly sample individuals from the cohort, there will be regret associated with the number of subjects assigned to the suboptimal treatment in a randomized experiment, and there will be regret associated with the chance your experiment selects the suboptimal treatment. You would like to minimize application of the wrong treatment, but also be pretty sure you are finding the right one. You can compare this to the policy that assigned everyone to the optimal policy in advance.

Tomorrow I’ll talk through three other examples where regret feels like the right concept to me. In hindsight, it’s not always worth the headaches and confusion associated with regret minimization, but there are enough positive examples to make it a concept worth understanding.

Subscribe now

This is more or less the optimal strategy.

You Play to Win the Game

Ben Recht — Tue, 31 Mar 2026 14:38:10 GMT

This is a live blog of Lecture 7 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

The Monday after Spring Break is always a weird class with people trickling in from their various excursions. So it’s an ideal time for a weird lecture. I decided it was time for some game theory.

The goal of this seminar was to focus on the power of feedback, to understand how to think about complex interconnected systems, and to understand how feedback design allows systems to “generalize” and “behave effectively in unknown future settings”.

In classical control, you can argue that feedback is for stabilization, for maintaining fixed points, for rejecting disturbances, or for recovering from failure. We covered some of these ideas in the first part of the course.

However, there’s another view of feedback, one that’s ubiquitous in machine learning and artificial intelligence. It’s the one that’s most prevalent in the quantitative social sciences. And, increasingly, based on my interactions with Berkeley graduate students, in robotics. That is the idea of feedback as a way to augment optimization.

In optimal control, feedback is used for the most narrow-minded reason: it lowers cost. Feedback policies, because they search over a larger space of policies, have lower cost than open-loop policies. That’s it. Feedback provides more information to the decision-maker, and a decision-maker who uses information will achieve a lower cost than one who doesn’t.

The optimal control model of feedback is game theoretic with rules of engagement staged as follows:

In every round t,

Information x_t is revealed to both players.
Player one takes action u_t
Player two takes action d_t
A score r_t is assigned based on the triple (x_t,u_t,d_t).

Player One’s action can be computed based on the rules of the game and all of the moves they’ve seen thus far. This is why if they optimally use the revealed information, they will have no worse cost than if they throw the information out.

Player Two is “the adversary.” Their power dictates how hard the game is for the decision maker. In some formulations, Player Two chooses oblivious random actions. You can make Player One’s life harder by making Player Two an omniscient god that knows Player One’s strategy in advance and can compute undecidable functions to topple them.

If there is a single round and Player Two is random, this game is called decision theory. We’ve collectively decided that the best strategies against random adversaries are those that maximize the expected value of the score. Don’t ask me why. If there are two rounds, it’s stochastic programming. If there are an infinite number of rounds, but there is no relationship between the rounds, it’s a bandit problem. If there are infinitely many rounds and the information follows a Markov chain, this is stochastic optimal control or reinforcement learning. In this case, when the costs are quadratic, the Markovian dynamics linear, and the adversary normally distributed, this is the linear quadratic regulator problem.

When Player Two is adversarial, Player One seeks a strategy to maximize their score against the best imaginable opponent. If there is a single round and Player Two is adversarial, this is called game theory or robust optimization. If there are an infinite number of rounds, but there is no relationship between the rounds, it’s a non-stochastic bandit problem. If there are infinitely many rounds and the information follows a Markov chain, this is robust optimal control. The linear version of this robust control problem is called the H_♾️ optimal control problem. Phew!

Now, every single one of these problems requires a slightly different algorithmic solution. That’s what keeps us in business. For every gradation, the solution details can fill a textbook. But they are all variations of the same game-theoretic framework. Having been formalized in the late 1940s and honed in the military-industrial boom of the 1950s and 1960s, this game-theoretic model of control and decision-making has been standard since the 1970s.

I’m not saying any of this is wrong, per se. I am saying that it is a bit limited as a framework. Part of the motivation for this course was to make better sense of this “graph” I made in a blog series a couple of years ago.

I observed that decision-making frameworks were distinguished by two variables: the impact each action had on a system external to the decision-maker (the x-axis) and the frequency with which decisions could be made (the y-axis).1 Game-theoretic decision making requires first figuring out where on this graph you want to operate. If you have a problem that calls for a specific level of impact and have the authority to act at a specific speed, you can find a particular solution using a proper game-theoretic formulation. How powerful you make Player Two will affect the complexity of your decision system and its conservatism. Since you have no idea what the future holds, your conception of Player Two is a subjective decision, but at least it’s one you can precisely describe. In this sense, the optimal framework is nice because you can declaratively compute decision policies based on systems modeling.

But if you have problems that span multiple regions of this space, or ones that lie below that red curve, the optimization framework gets stuck. If you have problems where the costs are ambiguous or variable, it’s hard to argue in favor of a policy based on models of cumulative reward. If you care about multiple levels of interaction impacts and speed, optimization stops being helpful.

The problem is that if you want to move into high-impact regimes where your authority is less than you’d desire, no single system gets you there. At some point, you have to think a bit more broadly about what systems push hard against this red curve. We’re forced above the red curve because of different limits, some fundamental, some conceptual. Physical law, computational efficiency, and even the ability to model keep us on one side of the curve.

I’m not sure this class helped me understand how to move beyond this curve, but it helped me understand a bit better why we’re stuck with it. The action-impact curve shows that single optimization problems can’t govern complex systems on their own. How do existing complex systems, be they natural or artificial, get around it? I’ll reserve the last two lectures of the class for this sort of abstract navel gazing.

Subscribe now

To read more about that plot and what I intend the axes to mean, read this post from a couple of years ago.

The Poetics of Bureaucracy

Ben Recht — Thu, 26 Mar 2026 14:15:02 GMT

No conference taking a broad view of contemporary culture can escape the bureaucracy sickos (laudatory). Bureaucracy, with the complex social relations it codifies and entails, is one of the most salient aspects of our culture. Bureaucracies box in massively complex bodies of information through standardization, measurement, and policies. Computers are amazing. They are also the physical embodiment of mass bureaucracy. And no computing technology is more bureaucratic than the large language model.

Several talks at the Cultural AI conference threaded together the complexities of language models and bureaucracy. Henry Farrell kicked things off with a characteristically fantastic talk, describing his evolving view of AI as cultural and social technology. He introduced the notion of “coarse graining,” a new angle he’s working on with Cosma Shalizi.

In physics, coarse graining means “averaging out” a lot of complexity to leave you with bulk behavior that describes useful things. Arguably, it’s how you go from quantum field theory to atomic theory to the ideal gas law.1 There are levels of approximations, and details are lost in the transitions between layers. However, this loss of detail is often worth it because stacking abstractions lets us think simply inside clean layers. Moreover, surfacing coarse graining helps us understand what to look for when one level of description doesn’t suffice to describe observed phenomena.

For Farrell, bureaucracies, democracies, and markets are cultural coarse grainings. Bureaucracy establishes relations between parts such that management at one particular location in an organizational web can make decisions without having to understand the fine details at all other locations. It creates a distribution of decision making, simultaneously bound and freed by rules. We can see LLMs as coarse grainings that allow us to access mediated linguistic relationships between end users and the cultural material on which they were trained.

Good bureaucracy should provide constraints that deconstrain.2 However, so often bureaucracy, in its taming of complexity, obscures sources of power in cultural relationships and the human agency behind decision making. Lily Chumley and Abbie Jacobs both spoke to different angles of this concealment.

Through the lens of linguistic anthropology, Chumley described how language models obscure contractual relationships underlying enterprise software. The primary interaction with language models is through the chat box. When we squeeze our demands into prompts and skill files that use the institutional language of management, we are mimicking the casual nature of Irving Goffman’s “open-state of talk” with a computer. The interaction feels personal rather than transactional. However, your interactions with all of the work software are contingent on inscrutable vendor contracts with complex webs of accountabilities, restrictions, and obligations. The employee is left with only a chat interface that has been RLHFed into a servile caricature of a 1950s secretary. This erases the heavily surveilled, legally bound, hyper monetized relationships between corporate behemoths.

Chumley illustrated this through the SAAS web on the academic campus. Though we feel like we’re working with LLMs like they are other co-workers:

Every interaction with an LLM or web interface portal or training is mediated by a complex contract with giant corporations, be they Elsevier (who own Interfolio), Salesforce, SLATE Technolutions, Google, Microsoft, NVIDIA, OpenAI, or Anthropic. It is a move of power away from people to a fabric of capital. Gideon Lewis-Kraus commented that these power shifts from engineering to capital have been symptomatic of post-Cold War America and have had dire consequences, as in the example of Boeing.

Chumley extended her contractual analysis to the bureaucratic war machine that Kevin Baker has been so eloquently writing about. Big Tech owns AI, so this poses complex risks to the financial order as these companies are too big to fail. And yet, Big Tech is really small compared to the state. The relationships between the tech companies and the government established through military contracting are geopolitical. This means that even if we had a functioning Congress,3 the regulation of military AI would be ensnared in transnational agreements. Not only is the use of AI in warfare a smokescreen to avoid talking about the people who control decisions of violence, but it further entangles geopolitics in a big contractual mess.

From the perspective of measurement theory, Abbie Jacobs discussed how the language of governance, when coarse-grained into AI, creates new meaning. Jacobs argued that operationalizing language always in the context of governance requires conceptualizing how to measure those concepts. And this measurement and quantization are often not talked about by those doing the coding. We see this sort of talk about computing systems all the time. Words like “high-quality,” “relevant,” “toxic,” “harmful,” “age-appropriate,” “safe,” “responsible,” “fair,” “intelligence” are turned into rigid measurements by communities of coders, researchers, and policymakers. This operationalization through bureaucratic technology creates a new kind of coarse graining in which words gain meaning through their institutionalization. Arguments at this operationalized level themselves become exclusionary. Jacobs leans on measurement theory from the quantitative social sciences, arguing that “Measurement is the (usually hidden, implicit, diffuse) process through which these concepts are instantiated and made real.”

Measurement itself is governance. I associate this assertion with Theodore Porter, though he’d probably credit Horkheimer and Adorno’s Dialectic of Enlightenment. Jacobs argues that we have to bring such measurement to the surface of social technology before we go about asking our coding agents to coarse-grain it. If we can uncover the measurement process itself, then these hidden webs of governance perhaps become more legible to all of us caught in the middle. By fighting about operationalization, you are implicitly fighting about values. You are fighting about how the state sees you.

This will be my last dispatch on the Cultural AI conference for now. I don’t think I fully did justice to the speakers’ arguments or to the discussion at the conference, but the talks will be available on YouTube soon.

I’ll close with a few thoughts about “conferences” more generally. We use the same word to describe an academic gathering of ten people as fifty thousand, but those meetings couldn’t be more different. The one thing I wish we were better at was marking the proceedings of these small workshops in some non-empheral state. There is value in simply getting people in a room and then seeing influential intellectual artifacts manifest in later work. Some conversations are better when everyone knows there will be no permanent record. Not every conversation needs to become an Overleaf. Still, capturing something about the moment has value, too. I guess Max and I are blogging a bit, and that’s not nothing. There will be YouTube videos, as I have mentioned. But I’ve been thinking a lot about what it would mean to organize, archive, and coarse grain these small moments of intellectual discourse. To be continued.

Subscribe now

Real heads know that jumping between these abstraction levels is far less cut and dried than the physicists want us to believe.

Feel free to share examples of good bureaucracy in the comments.

LOL.

Information Transit Got the Wrong Man

Ben Recht — Tue, 24 Mar 2026 14:36:41 GMT

In case you hadn’t heard, people are using LLMs to create their peer reviews. I know, you’re as shocked as I am. To their credit, the program committee of the International Conference on Machine Learning (ICML) has been doing things to address the problem. Their attempts reveal the systematic problems here that are unfixable without a dramatic teardown.

Let’s recap the situation, though even typing it out makes me feel like a character in a Terry Gilliam movie.

In November 2025, the PC sent a series of surveys to past ICML reviewers to gauge their sentiment about LLMs in reviewing. They assumed that this list must also include a bunch of authors because they mandated a reciprocal reviewing policy for the 2025 conference, under which all submissions must have had at least one author who agreed to serve as a reviewer. In their final survey, they proposed the following policies for LLM reviewing at future conferences, and asked for preferences:

Policy A (Conservative): Use of LLMs for reviewing is strictly prohibited.
Policy B (Permissive): Reviewers may input the submission text into privacy-compliant LLMs. However, the assessment of the paper and the writing of the review must not be delegated to LLMs.

A random sample of 500 reviewers received the survey. 74 (15%) answered it. And the survey says!

This plot uses default matplotlib colors, so it must be science. I’m not sure what this measures, but whatever, surveys are the democratic way to make policy, amirite? Since the results weren’t equivocal, they decided to make both policy options available in 2026. That’s democracy.

In the multistage reviewer enrollment process for the 2026 conference, they gave reviewers the option of adhering to either Policy A or Policy B or saying they didn’t care. The people who didn’t care were then assigned to one of the two policies in one of the many absurdly long emails they send you when you participate in these conferences.1

Now here’s where it gets weird. The program committee decided to run a sting operation, watermarking the PDFs to trap people who uploaded them directly into LLMs. You can read about their ornate sting operation here. The details are boring. What’s interesting is their response. If they found that someone assigned to Policy A used LLMs in reviewing, then they rejected all papers of which that person was an author. They had 800 reviews that they flagged as violating policy A. They checked every one of these by hand to avoid false positives. Every. Single. One. Because that’s a good use of human manual labor. They ultimately rejected 500 papers associated with these naughty reviewers.

Everyone is patting themselves on the back about this, and tut-tutting those horrible people who dared violate the sacred random assignment of policy, and thanking the committee for their cleverness and transparency. But come on, this whole thing is absurd. The PC argues:

“We hope that by taking strong action against violations of agreed-upon policy we will remind the community that as our field changes rapidly the thing we must protect most actively is our trust in each other. If we cannot adapt our systems in a setting based in trust, we will find that they soon become outdated and meaningless. “

I’m sorry, but it’s already meaningless! ICML received over 33000 submissions. A random subset of 20-25% of these will be approved as “papers acceptable to go on one’s CV.” The process will churn forward. Everyone who attends the conference knows this process is impossibly bad, but the only proposed solutions make the paper-generation process more onerous for humans. This naturally leads people to offload work to LLMs. Next year, people will use watermark detection before they put the LLM into ChatGPT. The wheels of progress will continue rolling.

It’s unfortunate that the natural bureaucratic editorial mode is to assume everyone is cheating and to go on witch hunts to claim progress. The board wrote:

“Conferences must adapt, creating rules and policies to handle the new normal, and taking disciplinary action against those who break the rules and violate the trust that we all place in the review process. “

Psychology took this sort of approach in the 2010s. Though we all got to revel in the high-profile fraud schadenfreude, the field did not come out better for it.

Rejecting 800 of 33000 papers because of possibly inappropriate LLM use, when your LLM use policy is based on the most bizarre, arbitrary decision-making built upon a semblance of objective quantitative social science, is farce. At this point, the AI reviewing process can be nothing but farce. As Kevin Baker succinctly put it in his authoritatively inflectional essay on AI for science:

Systems can persist in dysfunction indefinitely, and absurdity is not self-correcting.

One nice thing about LLMs is that they show us which parts of our systems of intellect are mechanical traditions. LLMs are a good way to stress-test our systems for organizing experience and expertise. We’ll need to be more creative about what we want to do moving forward.

Moving forward requires us to talk more about the point of peer review. Yes, the AI conferences are the most absurd manifestation of this problem, but don’t think that your community is insulated from rampant LLM reviewing. At the Cultural AI conference, Mel Andrews showed us dozens of headlines across academia advocating for LLM review. Arguing that LLM review was better than human review. There are economists launching startups to do this as a service.

Andrews argued that the arguments in favor of LLM reviewing consistently conflated institutional and epistemic concerns. The institutional concerns are well known to us. Reviewing is an enormous burden of unpaid labor that further enriches rent-seeking publication houses, and reviewership is unfairly distributed across academia. The epistemic concerns worry that peer review doesn’t properly weed out invalid papers. At least in the sciences, peer review is supposedly meta-epistemic, judging the validity of papers that aim to get at scientific knowledge, understanding, and explanation. Many studies have found the current state of peer review unfit for this task.

Advocates for LLM peer review argue it solves both problems. Andrews took a hard line, claiming that it can’t solve the epistemic problems. Andrews’ boldest claim is that the relationship of the text generated by LLMs to semantic content and truth is always accidental or incidental. Hence, the mechanical aspects of peer review can only increase confabulation and error. Following tradition means not having to think, but peer review’s epistemic function demands thinking.

I don’t fully endorse Mel’s argument, but it’s a position worth airing and engaging with. By focusing solely on process and rules, tweaks to peer review make it more mechanical. Mechanization only makes LLMs better suited for the job. If epistemic cultivation of expertise and experience demands something beyond tradition, then more complex systems of checks and balances only stifle it.

Program committees in computer science used to be small groups of people who met in person to discuss every paper that would be presented at a conference. They are now ministries of truth that haruspicate the statistics of poorly designed surveys and build ornate policies for the masses. Program committees have become bureaucrats of the state, and they are forced to see like it. The bureaucratization of academic work product threatens its very epistemic nature. Perhaps the fix has to arise from spontaneous order.

Subscribe now

Here’s an email I received this morning asking me to be an area chair for one of the other big conferences, NeuRIPS. My AI detector flagged this email as “highly likely AI generated.”

Cosma Shalizi Is Aware of All Internet Traditions

Ben Recht — Fri, 20 Mar 2026 14:07:28 GMT

I’ve been wanting to write a summary of the Cultural AI conference I attended at NYU last week, but I’ve been struggling to succinctly capture my thoughts. That’s indicative of the depth and complexity of how AI meets culture, and the different perspectives and disciplines might not lend themselves to a tidy summary.1 As I often do when trying to wrap my head around complex things, I will stop worrying and just blog through it.

The talk that serves as my hub in the complex network of cultural AI is Cosma Shalizi’s “Aware of All Internet Traditions: Large Language Models as Information Retrieval and Synthesis.” That language models simultaneously retrieve information and synthesize new content isn’t controversial. Nor is the fact that this synthesis is formulaic. The current synthesis is next-token prediction trained on all written information, whose output is warped by some selective post-training. By design, language models mechanistically reproduce the recurring regularities in their training data. That training data consists of all the text files on the internet and what is easily available in printed books. Hence, the regularities are the tropes, stereotypes, templates, conventions, and genres of language and code.

The formulaic generation of discourse looks like discourse in ways we could never have imagined. But with hindsight, we shouldn’t be surprised. Human culture is very formulaic! There are long-standing formulas for oral tradition, for generating small talk, or for generating scientific papers. As Cosma put it, in the single sentence that summarizes the entire Cultural AI conference:2

Following a tradition means not having to think for oneself.

Not having to think is often a good thing! Tradition lets us externalize certain processes so we can focus on other tasks. Formalities strengthen cultural connections. Traditions in communication help us understand each other better and come to consensus faster.

Indeed, our vast externalized cultural intelligence is the jewel of human tradition. Cosma cites Jacques Barzun’s conception of the House of Intellect: intellect is the communal form of society’s intelligence. “[I]t is intelligence stored up and made into habits of discipline, signs and symbols of meaning, chains of reasoning and spurs to emotion — a shorthand and a wireless by which the mind can skip connectives, recognize ability, and communicate truth.” According to Barzun, intellect lets society share and externalize knowledge. It belongs to society, not any individual. It connects individual intelligences. It lives after any single intelligence dies.

GenAI is the mechanization of this intellect. It is the mechanization of all of our traditions.

With James Evans, Henry Farrell, and Alison Gopnik, Cosma has been preaching the gospel that AI is a cultural technology for several years. He’s gone through several iterations of what that means and what it implies, but mechanized tradition is the characterization that resonates most with me. Mechanized tradition of Barzun’s artificial intellect is a far better description of GenAI technology than “artificial intelligence.” This frame helps us get away from the silly C-suite sci-fi navel-gazing about the personalities inside the data centers. Claude is not a person. It is a mechanized intellect. A Lore Laundering Machine. The frame of mechanized tradition helps me build a social metascience of our LLM condition.

Let me give you a fun example.

In the same session as Cosma, Wouter Haverals gave a rhizomatic inspection of the tradition of literary style. What is style anyway? We love to ask LLMs to write in new styles. It’s funny to have it generate poetry. One of my most common queries is how to rewrite emails to sound less angry.

But humans are also great at mimicking style. It can be a fun, creative game to do the sort of rewriting we now task AI with. And our audience can all tell when something hits or misses the mark when we very ape a particular tradition.

Wouter introduced Raymond Queneau’s Exercices de Style, a book consisting of 99 rewritings of the same story in different styles. The main story is simple enough. Here’s Barbara Wright’s 1958 translation:

In the S bus, in the rush hour. A chap of about 26, felt hat with a cord instead of a ribbon, neck too long, as if someone’s been having a tug-of-war with it. People getting off. The chap in question gets annoyed with one of the men standing next to him. He accuses him of jostling him every time anyone goes past. A snivelling tone which is meant to be aggressive. When he sees a vacant seat he throws himself on to it.
Two hours later, I meet him in the Cour de Rome, in front of the gare Saint-Lazare. He’s with a friend who’s saying: You ought to get an extra button put on your overcoat.” He shows him where (at the lapels) and why.

Queneau rewrites this story in the past and the present. In reported speech. In the passive voice. In haiku.

Summer S long neck
plait hat toes abuse retreat
station button friend

Obviously, you can feed an LLM Queneau’s original story and prompt it to write in each of the prescribed styles. Can LLM capture the style? How could you know that LLM did a good job?

The only way to answer such questions is to lean on the tradition of vulgar positivism. In a delightfully recursive metanarrative,3 Wouter and his co-author Meredith Martin ran a survey experiment. On the platform Prolific, they asked “real people” a series of questions about Wright’s translations of Queneau’s original stories and AI-generated versions. They ran several variants. In one, mimicking the style of Kevin Roose’s mechanistically obnoxious New York Times quiz last week, they didn’t tell the participants how the two stories were generated and simply asked which better captured the style. In the second variant, they asked which captured the style better when the participants knew whether the story was Queneau’s or AI’s. In the third experiment, they asked for preferences with the true labels switched.

What happened next was entirely predictable. Without attribution of authorship, the “people” on Prolific slightly preferred the AI version, choosing the AI 55% of the time. There were 186 participants and 930 pairwise judgments, so statistical tradition would spew out a confidence interval somewhere between 3 and 7 percentage points wide, depending on the pedantry of Reviewer 2. Make of that what you will. On the other hand, with the correct labels, “people” only chose the AI 48% of the time. Most hilariously, when the labels were swapped, “people” chose what they thought was human 62% of the time.

To situate these numbers within our broader house of intellectual tradition, Haverals and Martin adopted a recently instituted social-scientific tradition: silicon sampling. They ran a survey experiment where the participants were LLMs. When prompted with the same survey, LLMs chose AI-writing 50% of the time without labels. But with the correct labels, the machines judged Queneau superior 70% of the time. And with the swapped labels, AI chose what was presented as Queneau 64% of the time. As the title of Wouter and Meredith’s paper says, “Everyone prefers human writers, even AI.”

There’s nothing surprising in these survey results, and that shouldn’t be surprising. Survey experiments are a woefully limited way to understand the social condition. They are completely mechanical. Of course, this sort of impoverished social science can be done by mechanical literary analysis. Silicon-sampled survey experiments enable us to mechanically generate stories from illusory correlations. These stories are interpreted traditionally as either informative or absurd, depending on the academic tradition in which you were raised. The recursion continues indefinitely. There are so many patterns and regularities in human behavior, and by simulating common text strings, we get text conforming to these regularities. To rephrase Nelson Goodman, regularities are where you find them, and in human tradition, you find them everywhere.

Subscribe now

That said, Maxim Raginsky gave a fun synthesis talk on assemblage, feedback, and cybernetics at the end of the conference. I hope he writes up his expletive-laden thoughts on The Art of Realizable.

I wrote Cosma asking whether that quote was a Shalizi-ism or if I was misattributing it. He replied, “It’s not a conscious quotation on my part, but wouldn’t it be better if it was?”

This blogpost is all recursive metanarrative.

Small World Models

Ben Recht — Mon, 16 Mar 2026 14:19:53 GMT

This is a live blog of Lecture 6 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.

The backbone of control engineering is the assumption of a reasonably reliable, reasonably simple dynamical system with inputs and outputs. We have to believe that the behavior of the thing we want to steer is consistent enough so that whatever we design in the lab will work on the road.

Now, what exactly do I need to know about this reliably consistent dynamical system to get it to do what I want to do? You want a model of the system that is rich enough to describe everything you might ever see, but small enough so that you can computationally derive control policies and, perhaps, performance guarantees. Simpler system descriptions yield simpler state-estimation and control-design algorithms, both online and offline. What’s the right balance between modeling precision and simplicity? This is the question of system identification.

System identification is the natural place where machine learning and statistics meet control engineering. You need to either estimate parameters of models you believe are true or build predictions of how the system will respond to a string of inputs. What sorts of statistical infrastructure you need varies on your control engineering task

Sometimes you really believe that for all intents and purposes, the system behavior is captured by simple differential equations. An object falling in space without friction will obey Newton’s laws. To identify this system, you just need to measure the object’s mass. Easy peasy.

For more complicated mechanical systems, like quadrotors or simple, slow-moving wheeled vehicles, you still can get away with relatively simple modeling where the geometry of the problem gives you a nice differential equation with a few parameters determined by the build of your drone. In these cases, where you don’t need particularly high performance, you need only break out the ruler to scale to guestimate all of the parameters.

Sometimes you can’t determine the parameters from simple measurements, as environmental conditions dictate their values. For example, coefficients of friction might depend on temperature and the particulars of the flooring. Now you can find the parameters by repeatedly testing your system and running a nonlinear regression to minimize the input-output error.

As your problem gets even more complicated, maybe you don’t want to bother building a sophisticated simulator and would be perfectly fine with a “black-box” prediction of outputs from inputs. We’ve developed a zoo of methods to do this sort of prediction. The simplest are the “ARMAX” models, which predict outputs as a linear combination of the past inputs and outputs. You can fit these using least squares. If you want to be fancy, you can even compute nice “state-space” models from these linear ARMAX models, using a family of methods that are called subspace identification. This will yield smaller models and simplify your control synthesis problem.

On the other hand, you can go in a completely different direction and make your time-series predictor nonlinear. You can use a neural network to predict the next output from your history. If you want to get extra fancy, throw a transformer at the problem. I’m sure this will work great and build the best simulator without knowing anything about the problem at all.

So what’s the right level of modeling granularity for your problem? I don’t have a clear answer. In optimal control, the better your estimate, the better your performance. But maybe you care about the minimal amount of information you need to control something. How much is it?

You might think none. We’ve seen in class already that two systems that look completely different in open loop look the same in closed loop. Feedback can correct modeling errors. The simplest example is

Input u[0]=1, and the “x” variable goes to infinity, but the “z” variable goes to zero. However, under the negative state feedback rule “u[t]=-x[t]”, the systems are identical

which both quickly converge to zero.

Negative feedback is powerful and can drive solid performance in the face of huge model uncertainties. If you simply care about robust tracking or homeostatic behavior, perhaps you can get away with the most minimal system identification. Unfortunately, it’s not quite that easy. You can have two systems that look the same in open loop but have completely different closed-loop behavior. Karl Astrom has a relatively simple example that I described in an earlier post. There, one system has a filter between the controller and the state that slightly attenuates the frequencies needed to stabilize the system.

Now the question is whether Astrom’s pathological counterexample—where two systems look similar in open loop simulation but are catastrophically different under feedback—is indicative of widespread problems. Probably not. I’m not convinced that you have to learn sophisticated robust control for most small-scale robotics demos. (Sorry, John, though complex aerospace systems are certainly another story). I think the takeaway from Astrom’s examples is that your model should represent the sorts of disturbances and signals you should see out in the world. And it should be cognizant of the fact that you are going to use the model in a closed-loop, so you have to understand whether there are delays and noise between the actuation signal and the actuation action.

Of course, this makes sense to any graduate student who has worked on a real robot. Every robotics grad student I’ve spoken to has told me that investing the time in system identification makes the robotic performance infinitely better. Sometimes we have to sit with our dynamical systems for a long time before we know what we need to control them. Understanding what it means for our models to be good enough is the tricky part.

Subscribe now

Benchmarking Culture

Ben Recht — Tue, 10 Mar 2026 13:58:32 GMT

What’s been clear so far about this conference on Cultural AI is the organizers were interested in a broadly construed definition of AI and Culture. That works for me, as my talk ended up being about two ways of construing the culture of benchmarking. Here’s a summarized version of what I said.

I’ve been a machine learning researcher for nearly 25 years now (yikes), and I opened with a slide describing machine learning that I originally made in 2003.

It still works. Machine learning is prediction from examples, and that’s it. You have some blob of stuff that you call X’s. You have some blob of stuff you call Y’s, and you build computer programs to predict the Y’s from the X’s. The key thing that makes machine learning algorithms different than other kinds of predictions is that you deliberately try to bake in as few assumptions as possible, other than the fact that you have examples.

I find the online discourse castigating those who say “LLMs are just next token predictors” beyond annoying. They are just next token predictors. And that’s fascinating.1

The fascinating part comes in convincing yourself that your function works on new examples. How do you do that? Anybody who has read David Hume knows you can’t do it with formal proof. We convince ourselves through a particular system of evaluation. And then we built an entire engineering discipline on top of this.

Now, what is evaluation? In our 2025 course, Deb Raji and I adapted Peter Rossi’s definition, which he developed for social scientific program evaluation.

Evaluation is measuring the difference between articulated expectations of a system and its actual performance.

This definition seems reasonable enough, but in a world obsessed with quantification, this sets into motion an inevitable bureaucratic collapse. If you want to make your evaluation legible and fair to all stakeholders, you must make it quantitative. If you want to handle a diversity of contexts, you must evaluate on multiple instantiations and report the average behavior. Quantification has to become statistical. And once you declare your expectations and metrics, everything becomes optimization. Evaluation inevitably becomes statistical prediction.

This bureaucratic loop swallows up not just social scientific program evaluation but engineering evaluation more broadly. If you are calculating mean-square errors, you’re shoehorning your evaluation into statistical prediction. Everyone loves to lean on the artifice of clean statistical facts. Once you have set this stage, machine learning is practically optimal by definition.

Machine learning as a discipline has no foundation beyond evaluation. This is a descriptive, not normative statement. The most successful machine learning papers work like this: I say that Y is predictable from X by Method M, and you should be impressed. I then make billions of dollars in a startup. Maybe I have to tell a story about how Method M relates to the brain or mean-field approximations in statistical physics. Fantastic stories don’t seem to hurt.

Now, here’s an invalid AI paper, which a lot of critics like to write: “Y is not predictable from X.” It is impossible to refute this claim. You can’t even refute it for simple methods, because what’s gonna happen is some high school kid is gonna go and change the rules slightly and prove you wrong. Then he will dunk on you on Twitter, gleefully writing “skill issue.”

The logical reconstruction makes the logical positivists roll over in their graves. The field is fueled by pure induction. We progress research programs by demonstration alone. And the way we convince others that our demos are cool is by sharing data and code.

Core to machine learning is the culture of data sets. I’m not sure if some poor soul is still trying to update this wikipedia page, but the field thrives on shared data with common tasks. The data sets give you an easy path to impress your colleagues. You can argue about the novelty of your method M, which achieves high accuracy on a dataset that others agree is challenging.

People have turned datasets into literal competitions, starting with the Netflix Prize, moving to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and ending with a company that hosts hundreds of competitions. Not to get all monocausal on you, but the ImageNet Challenge is why we use neural networks today instead of other methods. More fascinating is that we declared protein folding solved because Alphafold did really well on a machine learning competition. Competitive testing on benchmarks can produce Nobel Prizes.

You might be put off by how everything in machine learning becomes an optimization competition. But let’s applaud machine learning for its brutal, ingenuous honesty about how researchers are driven by ruthless competition. If you want to read more on how this works in practice, read Dave Donoho’s construction of frictionless reproducibility, my concurrence and analysis of the mechanism and costs, the history Moritz Hardt and I lay out in Patterns, Predictions, and Actions, or how it fits into the bigger story of The Irrational Decision.2

Perhaps the weirdest transition of the last decade was a move from dataset evaluation to “generalist” evaluation. Tracking the GPT series of papers is instructive. GPT2 declared language models to be general-purpose predictors, but OpenAI made their case with rather standard evaluations and metrics. GPT3 moved on to harder to pin down evaluations in “natural language inference,” but there were still tables with numbers and scores. With GPT4, we didn’t even get a paper. We got a press release formatted like a paper, which is fascinating in and of itself.3 That press release bragged about the LLM’s answers on standardized tests.

Of course, this got people excited, resulting in breathless press coverage and declarations of the end of education and white-collar work. Hypecycles are part of culture, too. Part of the goal of predicting Y from X is impressing people, and the results were very impressive. Based on the reaction, you can’t say that GPT4 didn’t surpass people’s expectations.

Now, if you live by the evaluation, you die by the evaluation. Notably, Facebook nuked their AI division after flopping on GPT4-style evaluation. Not only did its userbase think the model sucked, but they were caught cheating on their evaluations, too. In a flailing attempt to recover, Facebook went out and spent $14 billion to buy some random AI talent willing to report to King Z, and they’ve thrown orders of magnitude more at their subsequent AI investments. And what did this buy them? Literally Vibes.

We’re in this fascinating world now where research artifacts are consumer products, and the evaluation is necessarily cultural. Nonprofits funded by the same dirty money that funds AI companies might argue that they can measure the power of coding agents with objective statistical evaluations of yore. But agents are evaluated by coders’ experiences and managers’ fever dreams. I wrote this a year ago, and it remains true today: Generative AI lives in the weird liminal space between productivity software and science fiction revolution. The future of generative AI will be evaluated with our wallets. No leaderboard will help us do that.

Subscribe now

It should go without saying that we interface with via most corporate APIs is much more than LLMs now. That’s a topic for another day.

Out today! w00t.

Culture!

Minimal Updates

Ben Recht — Mon, 09 Mar 2026 13:59:34 GMT

This week, I’m attending a workshop at NYU organized by Tyler Shoemaker and Leif Weatherby called “Cultural AI: An Emerging Field.” What is Cultural AI? I’m not sure yet, but this workshop is going to find out. At this point in my career, the invite list is far more important than the planned topic, and this list is top notch. My brief talk will be about the culture of benchmarking in machine learning and how weird it is now that we’re trying to benchmark culture. I’ll let you know what the group synthesizes later this week.

In other exciting news, The Irrational Decision comes out tomorrow! Get it wherever you buy books. You can even get it in one of those atavistic bindings of printed pages wrapped in a pretty dust cover. I hear they look good in the background of your podcast.

I should say something longer about the book, but I want to see how release day moves me. So until tomorrow, here’s what I wrote about it last fall. Technology Review wrote a nice review that companioned it with two other great books, The Means of Prediction: How AI Really Works (and Who Benefits) by Max Kasy and Prophecy: Prediction, Power, and the Fight for the Future, from Ancient Oracles to AI by Carissa Véliz. I need to read both of these. In the interdisciplinary journal Critical AI, Jathan Sadowski further contextualized the book within the broader sphere of science studies.

Finally, for my friends and readers in the Bay Area, the folks at Bloomberg Beta and Reboot are sponsoring a launch event for The Irrational Decision on March 17 in San Francisco. If you’re in the area and interested, you can find more information and register here. Space is limited, but it would be great to meet you in person.

Subscribe now

For All The Cows

Ben Recht — Fri, 06 Mar 2026 15:01:34 GMT

This is the final part of a blogged essay “Steampunk Data Science.” A table of contents is here.

To be a practicing scientist, you need to maintain a state of vigilant cognitive dissonance. If you get too deep into the history and philosophy of your discipline, it becomes very hard to keep writing papers. A little perspective reveals the absurdity and irrationality of the system. And once you see it, you can’t unsee it. Participating requires a lot more effort.

Of course, I speak from experience. Here’s my super brief history. My research from 2015-2018 convinced me that machine learning was lying to itself about its foundations. I went to other disciplines for answers. I first looked to statistics and disappointingly found its epistemology even more confused than machine learning. David Blei sent me David Freedman’s Statistical Models and Shoe Leather, and I became a Freedman completionist. The last chapter of Freedman’s posthumously collected works, Statistical Models and Causal Inference, is a sequel to this famous Shoe Leather essay. It details several examples of “scientific shoe leather,” and devotes a page to the history of vitamins. I was a bit surprised that he wrote a lot about Eijkman, but didn’t mention Wisconsin. I wondered what else was missing in this summary. And here we are, six years later.

Though I wrote the first draft of this essay four years ago, I have never been able to figure out what to do with it. Initially, I thought it would be part of The Irrational Decision, but that project took me in a very different direction. The Irrational Decision is about science, engineering, and decision-making after the computer. This story was about how we did things before the standardizing forces of the computer and statistics. It didn’t cleanly fit, and I ended up jettisoning this chapter.

As a standalone piece, Henry Farrell warned me that it lived in an uncanny valley. The writing wasn’t academic enough to get published in a journal or pop enough to appear in a magazine. But you know where that sort of stuff can go? Directly to you, my argmin readers. Maybe this blog is the uncanny valley between NeurIPS and The New Yorker. I should make that the masthead.

Whatever the case, this piece not finding an external home is fine. Writing through a project is an exercise in itself. I finished this essay before I started substacking, and the recurring themes on here are built upon my research and writing about vitamins. This unpublished project laid the foundation of a research program. We’ll see where it ends up.

Now, for all of you scientists reading this, I’m not suggesting you follow me down this path. They’re uncommon, but scientific breakthroughs still happen, and looking for them can be thrilling. Keep looking.

The most dramatic plot in this essay is F. Gowland Hopkins’ crossover curves, in which he showed that some factor in milk was needed to sustain the growth of rats.

You might look at that plot and say, “Gee, it would have been nice to be a scientist back in 1900, as all of those low-hanging fruit are gone.” You might think that we never find such clear success stories in our modern, complex world. This couldn’t be further from the truth. I mean, this plot is from 2021:

This graphs the effect of starting and stopping semaglutide (Ozempic). The evidence of a revolutionary breakthrough in the management of weight loss tracks the average growth with a visualization scheme identical to F. Gowland Hopkins’ plots from a century earlier.

Yes, interventions like GLP-1 agonists come along rarely. But they do come along!

And for all of our results that are not robust and large? We should accept that, though they are likely illusory, the many small results help us build a scientific record. They form a scattered pile of puzzle pieces that we can all try to assemble. It’s only from the incremental pieces that aren’t precisely reproducible or clean that we can see the big picture. Don’t worry about the distribution of p-values. Do think hard about how to produce reproducible data and coding artifacts.

I’ll repeat what I said yesterday. The most important thing we can learn from the discovery of vitamins is that discovery starts with a mess. It’s in learning from the mess that we find the undeniable effects that completely transform our understanding.

Oh, one more thing. You might be wondering what happened to those cows in the Single-Grain Experiment. The cows were fed their monotonous diets for years, and the research team closely monitored the cows’ growth and health. They weighed each animal monthly and took a photograph once every six months. The cows all grew at similar rates, but there were noticeable differences in appearance, offspring, and milk production.

The corn-fed animals looked smooth of coat, fuller through the barrel; and as expressed by experienced feeders and judges of domestic animals, they were in a better state of nutrition. On the other extreme stood the wheat-fed group with rough coats, gaunt and thin in appearance, small of girth and barrel, and to the practiced eye, in rather a lower state of nutrition.

While those fed on corn gave birth to healthy calves, the wheat-fed cows’ offspring all died within a day. The milk of the cows had different fat contents depending on the diet. The corn-fed had somewhat less fat than the oat-fed, but the wheat-fed had almost no fat content whatsoever. Wheat alone was a perplexingly poor diet for cows.

Elmer McCollum was no dummy. He and Marguerite Davis began publishing papers about their rat colony well before the Single-Grain experiment had finished. Hart eventually terminated the Single-Grain experiment in 1911, and the team published their results in the Proceedings of the National Academy of Sciences, several years after McCollum and Davis announced their discovery of Vitamin A and confirmed the existence of Vitamin B.1

The Wisconsin researchers never figured out what was wrong with the wheat. Following McCollum and Davis’ discovery of Vitamin A, they tried adding butter to the rations, but this didn’t seem to improve the cows’ health. Their most likely hypothesis was that there was something toxic in their wheat ration. In his autobiography, McCollum later reflected that the harvested wheat itself was just of poor quality. Due to the way they grew, processed, and stored the wheat on the Madison campus, the cows ended up being fed only wheat grain and straw.

Had the cows eaten their full quota of leaf, as the corn- and oat-fed animals did, they would not have been in such poor nutritive condition. Through four years we had been inexcusably uncritical of some important details.

The experiment, though wildly influential, yielded nothing but inconclusive results.

Subscribe now

I’d like to thank Mihaela Curmei, Jessica Dai, Shamik Dasgupta, Henry Farrell, Sara Fridovich-Keil, Paula Gradu, Chris Harshaw, Lauren Kroiz, Kevin Munger, Deb Raji, Philippe Rigollet, Scott Shenker, and Chris Wiggins for many helpful comments and suggestions. Special thanks to the students in the 2024 Spring seminar “The Philosophy and History of Automated Decision Making,” who participated in a lively discussion about an earlier draft of this essay.

Even back then, PNAS was the journal for articles “Previously rejected from Nature And Science.”

Learning From the Mess

Ben Recht — Thu, 05 Mar 2026 14:59:39 GMT

This is Part 6 (of 7!) of a blogged essay “Steampunk Data Science.” A table of contents is here.

Having followed the tumultuous thirty-year journey from Eijkman’s chickens to Davis’ rats, let’s return to where we started: the question of reproducible, rigorous research. If there was an era of gold standard reproducible research, the early 20th century wasn’t it. I described multiple examples of important work that not only weren’t reproducible but were flat-out wrong.

McCollum’s first paper on nutrition couldn’t have been more wrong. Led astray by the work of Pavlov, McCollum was convinced that “the psychic influence of palatability is one of the most important factors in nutrition.” McCollum considered the possibility of vitamins, which he described as “certain organic complexes in the food given, which the body was not able to supply through its synthetic power from the materials at hand,” to be completely ruled out by his experiments. He would completely change his mind in the course of only a couple of years.

Was this paper published by McCollum bad for the scientific discovery of vitamins? The reality is the opposite. The fact that McCollum was dead wrong inspired further investigations, and the breakthroughs occurred in figuring out why he was wrong. Mendel’s team at Yale was inspired by McCollum’s synthetic diets and intrigued by his findings on palatability. In their replication attempts, they not only disproved McCollum’s hypothesis but also strengthened the case for the existence of essential amino acids. This work by Mendel’s team subsequently inspired McCollum and Davis’ investigations into the extracts of milk and egg yolks, resulting in their discovery of Vitamin A.

At the opposite end of the spectrum was German physician Wilhelm Stepp, who had conducted experiments removing the ether-soluble contents of bread and feeding them to mice. He claimed that after removing the ether-soluble contents, the mice quickly perished. When the ether-soluble materials were added as a supplement, the mice thrived. This sounds a lot like McCollum and Davis’ experimental setup, but Stepp’s results were deemed “far from conclusive” by Mendel and Osborne. His data was fishy. George Wolf and Kenneth Carpenter reanalyzed Stepp’s experiments from our contemporary understanding and found Stepp’s mice died far too quickly for the cause to be Vitamin A deficiency. What exactly Stepp had done remained unclear, and his work was not reproducible. But he had the right answer! He was clearly on the right track to finding Vitamin A.

We can and do learn from a lack of reproduction. Failure to reproduce tells us something about why our earlier assumptions were wrong, and digging into reproduction failures leads us to new discoveries. Nothing in the evidence points to malicious fraud or scientific misbehavior by those involved in the search for vitamins. It’s not clear how many of these errors would have been corrected by better statistical or scientific methodology. We should not expect science to be perfect and should be open to learning from mess.

And what about rigorous tools and research practices? Might these have accelerated our understanding of nutrition? Here, the evidence again points to no. The discovery of vitamins required a remarkably diverse set of investigative tools. As is always the case, well-controlled experiments designed to deliberately refute hypotheses were only one of many methods used to generate evidence. Natural experiments, such as Eijkman’s work with chickens and Vorderman’s prison observations, provided the initial clues that brown rice contained essential vitamins. The case studies initiated by Mendel, Osborne, and Ferry were experiments on single animals. They applied varied interventions over the course of the rat’s life, probing varied inputs into its diet, scouring for clues as they compared to a baseline. The individual case series of McCollum and Davis provided the definitive evidence that a simple organic compound could start and stop growth. Each of these methods provided a piece of the puzzle, but the researchers were learning how to do nutrition research as they went.

And though it was clear to Christiaan Eijkman that white and brown rice were different from each other, it took thirty years for that difference to be given a name. Sure, we can point back to single experiments that are as clear as day in hindsight, but the vitamins weren’t “discovered” until they were named. It took Funk’s bold survey article to name the problem (deficiency diseases) and the cure (vitamines). Only after Funk did everyone converge on the answer. The clean articulation of the problem and solution, of the cause and the effect, marked the actual discovery.

Perhaps the only pattern I can extract from the scientific processes here is that everyone involved was driven by a definitive purpose. The discovery of vitamins arose out of a deliberate, interventionist mindset. The researchers in Wisconsin wanted to identify the best diet for raising cattle. The researchers in the Dutch colonies sought a cure for beriberi. Nutrition research wasn’t aimed at breaking down the world for understanding, but rather at identifying interventions. They were trying to figure out cause and effect so that they could do something. The entire purpose was intervening, whether to save farmers money or cure terrible diseases. In finding what worked for their problems, they also discovered new chemistry and biology.

Would vitamins have been discovered sooner if the nutrition scientists had a more rigorous set of scientific tools? Could we imagine a counterfactual acceleration had they had access to computers loaded with spreadsheets and statistical software? We could also ask this question differently without assuming the present was wiser than the past. Was this discovery made possible by a rigorous practice that we can learn from?

The vitamin saga suggests the answer to all of these questions is probably no! If anything, the confines of scientific rigor of the day, like Koch’s rigid postulates for determining disease etiology, would have left us stuck with germ theory. I worry that contemporary quests for standardization and formalization of research practice lose sight of the value of creative experimentation and investigation. We can’t let reproducibility checklists stifle creative exploration.

What I also love about this story is how there’s no single hero. I understand the motivational power of the simple great-man science stories, like John Snow and his Broad Street Pump, or Alexander Fleming and his Petri dishes. But historians of science have been scolding us for decades that most of these scientific beatifications are far oversimplified. The muddled mess of vitamin discovery is more the rule than the exception. Though there were multiple Nobel Prizes, it’s really hard to extract a single hero.

On the other hand, it’s easy to find lots of petty fights. Babcock chided Atwater about his protein theories. McCollum was humiliated by Mendel’s work, which proved his palatability hypothesis erroneous. In his autobiography, he admits embarrassment and revenge were among his motivations for studying milk extracts. Harry Steenbock felt like he should have been on McCollum and Davis’ Vitamin A paper and held a grudge for years. He even wrote a letter to Science magazine in 1918, accusing McCollum of academic misconduct when he moved his lab from Wisconsin to Johns Hopkins.

The process of searching for vitamins was a mess. But we learned from the mess. And when we did, we found undeniable effects that completely transformed our understanding of food and our ability to treat disease.

Subscribe now

The Vital Amines

Ben Recht — Wed, 04 Mar 2026 14:57:35 GMT

This is Part 5 of 7 of a blogged essay “Steampunk Data Science.” A table of contents is here.

At the same time that Stephen Babcock was moving from New York to Wisconsin, Christian Eijkman departed the Netherlands for a new position in the Dutch Colony of Batavia. Eijkman would be credited with discovering Vitamin B, but his path to discovery was markedly different than the one taken in Wisconsin.

Eijkman was sent to Batavia, which we now know as Jakarta, to investigate the disease Beriberi. Plaguing much of Asia, beriberi caused nervous system disorders, including loss of the ability to walk and loss of reflexes. As was the style at the time, Eijkman was convinced beriberi was an infectious disease, and he set out upon a series of experiments to isolate the responsible germ.

Eijkman’s first attempts involved transfusing human blood into animals in the hope of inducing Beriberi. He first tried to infect monkeys, to no avail. He next experimented with rabbits but also failed to induce beriberi. His third animal model was chickens. Even in the 1800s, a non-mammalian test subject was outlandish, but chickens could succumb to a beriberi-like illness called polyneuritis, which caused a similar pattern of neural dysfunction. Moreover, Eijkman had grown frustrated with failures.

To his surprise and delight, Eijkman found that chickens often developed polyneuritis after receiving his beriberi transfusions. However, there was one perplexing glitch with his experiment: He observed an equal incidence of polyneuritis in the treatment and control groups. How was this possible? Eijkman tried and failed to cultivate the infecting substance from the afflicted chickens. None of the germ theory he had learned studying with Robert Koch could explain the outbreak of polyneuritis in his experimental fowl. Miraculously, in one of those mythologized moments of scientific luck, Eijkman happened upon an impossible answer by accident.

To save precious research funds, the lab assistant assigned to care for the chickens had been feeding them the leftover rice from the military mess hall adjacent to Eijkman’s facility. But the food supply was cut off after a new barracks chef, in Eijkman’s words, “refused to allow military rice to be taken for civilian chickens.” The lab keeper had to resort to purchasing bargain-basement brown rice for the chickens. Once the rice was switched, the chickens were cured of polyneuritis.

What was different about the barracks rice and the new civilian rice? The military rice was white. White rice is polished brown rice. Eijkman was convinced that something about the polishing turned the rice into a beriberi vector.

Unfortunately, he had no idea what that something was. He had a difficult time shaking the germ theory instilled in him by Koch. Perhaps polishing the rice allowed some bug to flourish in the white rice. Perhaps the brown rice contained something that killed an unknown pathogen. Determining what precisely was special about the brown rice would take decades.

Multiple investigations confirmed Eijkman’s observations over the subsequent years. For example, in 1897, Adolphe Vorderman observed that prisons that fed prisoners white rice had a considerably higher incidence of beriberi than those that fed brown rice. You might say that Eijkman’s observation in chickens had been “replicated” in humans in this observational study. But researchers at the time were as wary of confounding in non-interventional explanations as they are today.

It wasn’t until 1901 that an experiment seemed to definitively confirm that there was something valuable in the rice bran itself. Eijkman’s successor in Batavia, Gerrit Grijns, demonstrated that chickens would not develop polyneuritis on a diet of meat and rice bran. Grijns confidently (and correctly) declared there was something essential to the diet in the rice bran. It took another decade for scientific consensus to accept Grijns was right.

The case for vitamins would finally be closed by Casimir Funk, who wrote an audacious meta-analysis in 1912. In his report, he boldly claimed that beriberi, scurvy, and pellagra were all caused by the lack of some food substance in a person’s diet. Part of Funk’s evidence was epidemiological. He noted these afflictions only occurred in countries where people ate unvarying, monotonous diets, like those based on polished rice. However, not all monotonous diets were necessarily perilous. Russians who lived on a diet of cabbage, potatoes, and bacon seemed to avoid this cluster of illnesses. Some diets were missing yet-to-be-identified nutrients, and people couldn’t live without them. He called the missing components “vital amines” or “vitamines” for short.

Funk, a biochemist at the Lister Institute of Preventive Medicine in London, had been investigating the properties of rice polishing, his interest piqued by the findings from Indonesia. He devised multiple chemical mechanisms to extract the curative component from the discarded rice shavings. He found that it would take tremendous quantities of rice to yield tiny amounts of this substance, but only a small amount was required to cure pigeons of polyneuritis .

These experiments and the evidence from Asia convinced Funk. Germ theory, toxic theory, and hormonal theory were each insufficient to explain the evidence. Instead, Funk invented a new classification: deficiency diseases. He proposes that small changes in diet could reliably cure these diseases.

Wisconsin investigators McCollum and Davis soon confirmed Funk’s vitamin theory. Their fat-soluble compound could not sustain rats fed only polished rice. However, adding the water-soluble compound found in the rice polishings allowed the rats to thrive.

There are necessary for normal nutrition during growth two classes of unknown accessory substances, one soluble in fats and accompanying these in the process of isolation of fats from certain foodstuffs, and the other soluble in water, but apparently not in fats.

Thus, McCollum and Davis showed that multiple compounds were necessary to sustain the life of their rats, and these compounds were not proteins, fats, or carbohydrates. Having grown more adept with rat experiments, Davis now included a much more extensive collection of rodent growth curves, demonstrating the existence of at least two vitamins.

The fat-soluble vitamin was A. The water-soluble vitamin was B.1

Funk also called out Scurvy as a deficiency disease. Though the British Navy knew that scurvy could be treated with lemons, they did not understand its etiology at all. Axel Holst and Theodor Frolich had induced scurvy in guinea pigs by a deficient diet and then cured the condition using lemon juice or cabbage. Funk concluded that the vitamin associated with scurvy is distinct from that of beriberi, as it was much more sensitive to boiling. Though most were convinced that this evidence showed scurvy was a deficiency disease, Vitamin C would not be chemically isolated until 1928.

Funk went even further out on a limb. He asserted that Pellagra, a disease endemic to northern Italy, was also a deficiency disease. Pellagra was associated with starch-heavy diets based on corn. It wouldn’t be until two decades later, in 1937, that the associated vitamin would be discovered. Pellagra is caused by a deficiency of Niacin, also known as Vitamin B3.

In passing in the penultimate paragraph of his manuscript, Funk hypothesizes that Rickets, characterized by weakened bones and deformed legs, was also likely a deficiency disease. This hypothesis would also prove correct. Rickets is cured by vitamin D, a vitamin discovered in 1922 by McCollum. A year later, Harry Steenbock, another collaborator on the Single-Grain experiment at Wisconsin, found that Vitamin D could be synthesized by irradiating food with UV light and that this irradiated food cured rickets in mice.2

Though many now point to the serendipitous diet change in chicken as the “discovery” of vitamins, Funk’s assembly of 30 years of evidence and naming of a cause had more immediate impact. Funk’s theorizing flew in the face of germ theory and asserted something new. He completely revolutionized how we think about illness, as much as germ theory itself. The evidence had been piling up. People knew that treatments for these diseases involved dietary changes. The rodent studies assuredly pointed to particular necessary factors in food, but Funk was the first to put it all together and state it outright.

Once Funk formulated the concepts of vitamins and deficiency diseases, the rest of the scientific community jumped on board. They had a whole new way to think about what causes and cures illness. With sufficient dietary diversity, humankind was empowered to cure deficiency diseases plaguing large parts of the world.

Continue on to Part 6.

Subscribe now

This history of the discovery of Vitamin B in the Dutch colonies synthesizes Eijkman’s Nobel Prize address and the accounts of Carpenter, Combs and McClung, and Vandenbroucke. Combs and McClung’s text on vitamins also discusses Funk’s legacy in vitamins and his connection to McCollum and Davis. Griminger provides additional details about Funk’s biography.

Specifically, the vitamin whose absence causes beriberi is B1. Though they didn’t realize it at the time, nutritionists soon realized there were many water-soluble vitamins. We now have 12 named B vitamins, and all of them are water-soluble.

Steenbock would subsequently patent the method of irradiating milk to fortify it with Vitamin D. This patent would generate untold sums of money for the University of Wisconsin. It’s one of the earliest and most successful examples of Universities funding themselves through licensing their intellectual property.