This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” Here’s the full table of contents of my blogging through the class.
Software scaffolds every part of our scientific process, whether for data acquisition, cleaning, analysis, or prediction. So, in updating Meehlian metatheory for 2024, we must adjoin a new class of theories to logical derivation chains: that our software is correct. I will call this Software Validity, CS. As we all know in computer science (CS), the CS assumption is always the most suspect.
I want to take a few blogs to think through what it might look like to add CS to a metatheory of science. I could probably write a longer paper about this, but my thoughts are still pretty unformed here. In the classical spirit of blogging, let me draft a few half-baked thousand-word notes to get some seedlings of ideas out there. Mark this as something I want to revisit in the future.
Let me start closest to home: on the role of machine learning models in contemporary science.
We learn in school that science and experiments are about inference and understanding of the laws of the universe. But Meehl’s reconstruction places prediction as the central goal of science. The understanding and inference parts happen when the predictions are wrong and scientists have to patch their theories.
Recall Meehl’s setup one more time:
We have a logical conjunction that implies “O1 predicts O2.” And “in the absence of our theory, it would be surprising if you predicted O2 accurately from O1.” For Meehl, scientific validity is only about prediction. As he says in Lecture 2, those predictions need to be remarkably detailed and in close accord with the facts.
Meehl’s reconstruction of science notably doesn’t preclude arbitrarily complex models. Anyone who takes a science class learns that models with fewer parameters are better, but there’s never a justification for why. “Occam’s razor” or whatever. Even Meehl doesn’t pin down why people should or do prefer simpler models, which is one of the bigger holes in his presentation.
But I wonder if, when pressed, scientists really care about simplicity. People want models that can easily make lots of predictions. They want the predictions to be remarkably detailed and in close accord with the facts. When you had to do all of those calculations by hand, this required the models to be pretty simple. But if you have an NVidia Z28, you can quickly compute predictions of impossibly complex models that would be impossible to even write down by hand.
I ask you, my reader: Would you prefer a simple theory that made vague predictions or a complex computerized theory that made remarkably detailed predictions? Based on the trends I see in science and engineering, revealed preferences strongly suggest the latter. Supercomputer simulations, digital twins, and massive machine learning systems have demonstrated that we can make remarkably detailed predictions that are in close accordance with the facts, even when we have too many parameters.
This quest for prediction makes us use software to extreme degrees. I offhandedly mentioned last time that “we’re happy if a billion-parameter model gets one prediction correct.” But it’s more true than false. We don’t care if we can get a giant curve fit with a bunch of non-fundamental parameters as long as it makes good predictions on something interesting. I call this sort of prediction “nonparametric.” And one of the hottest areas right now, “AI for Science,” is embracing nonparametric prediction as the key to accelerating discovery.
How do we fit nonparametric models into Meehlian derivation chains? First, I’m fast and loose with the term nonparametric because we all know it when we see it. Any model with parameters we don’t think are fundamental or reusable is nonparametric. In physics, Newton’s law of gravitation, the ideal gas law, and Planck’s law are all parametric models as the “physical constants” can be used in other theories. Indeed, the ideal gas law and Planck’s law both depend on the Boltzmann constant.
On the other hand, statistical models are almost nonparametric by definition. Even linear regression correction is nonparametric. Whenever someone is “controlling for covariates” or adding “fixed effects” or “random effects” to their regression models, they never care what the fit parameters are. They add these effects to argue about causation or to force a standard error to be small enough to receive asterisks in a table. Machine learning models, of course, are also inherently nonparametric. Alphafold is a non-parametric science model. Not only do we not care what the values of the parameters are there, but we’re happy to change the values if we can explain more protein structure data.
Predictive models nestle their way into multiple parts of the derivation chain. First, we adjoin to our auxiliary theory the fact that some observation O2 is predictable from some observation O1. That means the auxiliary theory is “there exists some function f that and some parameter values v such that O2 = f(O1;v).” We also adjoin the auxiliaries “this relation was true for the data we observed before our experiment” and “we reliably captured representative measurements of (O1,O2) pairs in the past.” Once we have these auxiliary theories, we can use software to do curve fitting, finding suitable values for the parameters v. We then predict new outcomes with the model. If there is reasonable accordance between our new predictions and the old model, we celebrate. Otherwise, we adjoin the new observations to our data pool and fit again.
It’s interesting here—foreshadowing the next post—how the software infects multiple parts of the chain. The core theory is that O2 is predictable from O1. Our auxiliary theory is that the data is fittable with a particular nonparametric model. Our auxiliary instruments are the software pipeline we use to fit the model. A ceteris paribus condition is that past data is representative of future data. An experimental condition might be the version number of scikit learn. A core assumption throughout is CS, that our software is bug-free.
Some criticize black-box predictive models as punting on scientific understanding, but we learn things when the predictions fail. Perhaps we find the process has a time-varying element we didn’t account for, and we need to refit the model every couple of weeks. Perhaps we discover a condition where the model doesn’t work, but we find we can patch the predictions by adjoining data from that condition. Any time we make our data corpuses bigger and our prediction software more complex, we’re engaging in Lakatosian Defense. And it’s hard to not deny that this iterative process of statistical prediction works to some extent. If we make the datasets bigger or retrain more frequently, we end up with more detailed predictions that conform more closely with facts. Would we argue this is no longer “science” if we are amending our model with every new falsifier? No. As we’ve seen, this is central to the scientific method. If it predicts more facts, we’re going to stick with our program. This seems to be exactly what people want from “AI for Science.”
It's definitely interesting to consider why or when pure prediction (in the ML sense) is useful for science. One example that I think of often is the negative one: prediction of life outcomes (https://www.pnas.org/doi/10.1073/pnas.1915006117). What does it mean when our fanciest ML models and most extensive data to date *can't* make more accurate predictions that embarassingly simple models? There are definitely ethical/validity implications of this observation (https://predictive-optimization.cs.princeton.edu/), but maybe there are also scientific ones? It's also interesting to consider the role of mass collaboration/prediction competitions in justifying such a negative result.
"Some criticize black-box predictive models as punting on scientific understanding, but we learn things when the predictions fail." Yes, but. When we are using large ML models and then trying to learn from their failures, we're studying something truly different. The underlying cause of such failures are by nature not human-understandable; and the improvements we make with more data or larger models are often nothing more than overtraining of the model to drive toward the single truth we are seeking... and not global truths which others will demand of the same model.
Take the example of a ML-based vision processor trying to detect human movers in a given space.
In our example, we may observe that a given model doesn't detect bicyclists very well. More training data with more labeled bicyclists may certainly improve the model on this axis, and we would rejoice... if we're in the bicyclists-detection business. But what has that updated training done to the other predictions?... of pedestrian walkers and resting/sitting/standing persons for example?
In simpler software (or big software performing clearly defined functions) we can run regression tests to ensure we've not stepped backwards. In modern meta-models, I'm not sure we are very good at regression testing. I am sure we're overfitting a lot of models in the name of improving them.