Software as a Disservice

Meehl's Philosophical Psychology, Interlude 2.

May 24, 2024

This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” Here’s the full table of contents of my blogging through the class.

Every facet of science, from physics to psychology, interfaces with computers. Our Meehlian derivation chains have software interlaced at every layer. This means that every aspect of scientific validity rests on software validity. I touched on this yesterday in the context of pure prediction, but let’s do a bigger-picture sweep today, going through the different clauses and thinking about how software comes in.

Auxiliary Instruments

Software is clearly an auxiliary instrument (A_I). We always assume our computer is working, our copy of R or Python is stable, and our packages have no bugs. Unless you are a computer scientist, most of the code you use in an experimental analysis is an auxiliary instrument. The exception would be software written by other practitioners in your field. For example, if you’re a chemist using a Density Function Theory package to compute some molecular interactions, the DFT code depends on the theory of your scientific field. In cases like this, software moves from A_I to A_T.

Auxiliary Theories

At the auxiliary theory level (A_T), we always defer some part of the scientific prediction or inference to code. If we build a complex model of climate or the universe, this rests on the simulation code of our theory being correct.

More subtly, if you have any model of errors in your derivation chain, you have infected your auxiliary theory with software. Error analyses are code-infected because we trust computers to do statistics. Computers are simply better than people at arithmetic and reading z-score tables.

So if you are “controlling for confounders” like age or gender in an experiment using a logistic regression model, that’s obviously a statistical error model to be solved by a computer. But even if you’re just assuming your errors are normally distributed, that’s a statistical model, and you’re going to use code to compute the standard error.

Ceteris Paribus

Software infects the ceteris paribus conditions because this is how we enforce null hypotheses. You encode ceteris paribus in software by generating randomization to remove confounding. The shape of the null hypothesis is also a ceteris paribus assumption, arguing the variation in outcomes in the experimental context we study is due to a particular kind of random variation. Calling a standard error robust or clustered or using random or fixed effects is an assertion of ceteris paribus. But anyone who has run such an analysis knows that if you toggle these sorts of models, you can turn a null result into a significant result.

Even the version of your code is a C_P assumption. You may get a different p-value or parameter estimate if you update your software or change your software package. The experimental validity only holds in one Docker container.

Experimental Conditions

The experimental conditions involve software in multiple ways because software handles data entry, extraction, cleaning, manipulation, and loading. Bugs in any of these go in C_N. If you mess up a formula in your analysis, like not selecting an entire row in a spreadsheet when arguing for austerity politics, that’s an error in C_N.

There are now too many examples of scientific results being wrong because of coding errors somewhere in the experimental pipeline. The Reinhart-Rogoff Excel error was an egregious example with a simple fix. But I’ve looked at enough complex software stacks for experiments to know this isn’t an isolated example. Send me your code, I’ll find an issue in there somewhere.

Software removes abstraction boundaries from the Lakatosian defense. We can now introduce potential errors anywhere in our analysis stack. Are these errors due to errors in C_S or due to some other clause? How can you unpack them to fix your theory?

The situation is intractable because experiments aren’t bound by formal logical rules of correctness. In an experimental pipeline, it might not be possible to know if something is a “bug” or not. People tweak their software all the time. Sometimes, your analysis looks wrong, and this is legitimately because there is a bug in your code. But sometimes your analysis looks wrong, and, without malice, you introduce a bug in your code to make the outcomes look the way you expected.

Software forces you to “hack” without p-hacking. I’m not saying that people are misusing statistics and data dredging. I’m saying that writing a working pipeline is really hard, and requires iterating through many different tests and sanity checks to make sure everything is working properly. If you change a line of code because your plots looked wrong, and now they look right, did you fix a bug or introduce an error? How can you tell?

You might argue that you can unit test all of the little subcomponents and that as long as you follow these beautiful rules of analysis handed down from the replication-crisis zealots, you can never make a mistake. I don’t buy this for a minute.

Maybe you’d argue that you should preregister the entire software stack at the beginning, run your experiment, and then if the plots look wrong, abandon everything. Scientific rigor now dictates that you start over with an empty git repo and redo the entire experiment. That would be absurd.

The introduction of software to every aspect of science hence leaves us with extra heat for replication crisis arguments. We’ve accelerated science with computers, but we’ve also accelerated scientific doubt. All experiments depend on code, all code is wrong, and any time you look at a pipeline, you find mistakes. This lets us write more papers, metascientific analyses, and editorials, but what if software expedience might not help us accelerate much of anything other than infighting?

Ani N

Jun 9, 2024

I think that people are dealing with these problems in practice:

- Conda envs / Docker containers with the exact recipe used to create the software environment

- Standard repos / codebases for baselines / evaluations, to make sure implementation bugs don't contaminate results.

- Public code repos for verification and replication

Are these perfect? No! But I think that with all of these tools, verification & replication of (non expensive pretraining) papers is far cheaper / easier in AI/ML than it is in a field like biology. If there's a bug in your code, that can be detected by someone who attempts to replicate your results in a new codebase and compares it to your public github repo and conda env. In medicine, its often harder to isolate A_I, C_P, and C_N.

The fact that we still choose not to publish replications within the academic community is a larger reason for the replication crisis, as is the lack of truly held out test sets and most industry labs failing to commit to the above. But my current sense is that all the industry labs that are truly attempting to research bleeding edge models spend a lot on replication and are able to see massive returns on those efforts.

Expand full comment

John Quiggin

Part of the problem in dealing with this is a Platonic belief that philosophy/science can get us to The Truth. We are in the cave, and can't get out. But we can improve the lighting, keep careful records of the shadows, experiment with different shading patterns etc. Once you really accept fallibilism, replicability becomes a predictable problem, rather than a crisis.

arg min

Discussion about this post