*This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” I’m taking a brief interlude from my run-down of Lecture 7. Here’s the full table of contents of my blogging through the class.*

Can we fix the crud problem with more math? In many ways, that’s what the “credibility revolution” in economics set out to do. To build a more sophisticated statistical tool kit that accurately teases out cause and effect when properly deployed. As Guido Imbens and Don Rubin put it in the introduction to their 2015 text *Causal Inference for Statistics, Social, and Biomedical Sciences*,

“In many applications of statistics, a large proportion of the questions of interest are fundamentally questions of causality rather than simply questions of description or association.”

Imbens and Rubin map a path for answering questions about epistemology using statistics:

“All causal questions are tied to specific interventions or treatments.”

“Causal questions are viewed as comparisons of potential outcomes.”

Comparisons of potential outcomes can be computed by careful estimation of average treatment effects.

Hence, all questions of interest in human-facing sciences are reduced to estimating effects in randomized experiments—whether or not a randomized experiment actually occurred. This means that the “gold standard” of causation remains null hypothesis testing. And that means that the entire discipline is based on correlation (a.k.a. description and association) and complex mathematical stories.

You don’t have to take my word for it. If you look at what the causal inference methods *do*,* *you will see that everything rests on null hypothesis testing. I mean, most of the estimates are built upon ordinary least-squares, and all least-squares estimates are combinations of correlations.

Let me give a simple example of an often-used estimator: the Local Average Treatment Effect (LATE). LATE uses “Instrumental Variables” to tease out causal relationships. You care about whether *X* causes *Y*, but you worry there are lots of confounding factors in your observational data set. To remove the confounding factors, perhaps you could find some magic variable *Z* that is correlated with *X* but uncorrelated with all of the confounders. Maybe you also get lucky and can argue that any effect of *Z* on *Y* has to pass through *X* (to be clear, you spin a story).

Economists have a bunch of crazy ideas for what should count as instrumental variables. Caveat emptor. My favorite example of an instrumental variable–one of the only ones I believe in–comes from randomized clinical trials. In a medical trial, you can’t force a patient to take the treatment. Hence, the randomized treatment is actually the *offering* of a treatment a trial aims to study. In this case, Z is whether or not a patient is offered treatment, *X* is whether the patient takes the treatment, and *Y* is the outcome the trialists care about.

But let me not dwell on instrumental variable examples. I wrote more about it here and here. I actually really like Angrist, Imbens, and Rubin’s original paper on LATE. For today, I want to show why this is still just correlation analysis. The standard instrumental variable estimator that estimates the influence of *X* on *Y* is

It’s a ratio of correlations. The standard way to “test for significance” of this effect is to do a significance test on the numerator. If it passes, you add two stars next to the correlation in the table. In an instrumental variable analysis, we changed the story but still just computed a correlation and declared significance if the number of data points was large enough.

Even though other estimators aren’t as easy to write down, every causal inference method has this flavor. Everything is a combination of correlation and storytelling. “Causal inference,” as it’s built up in statistics and economics departments, is just an algebraically sophisticated language for data visualization.

Some of my best friends work on causal inference, and I respect what they’re after. They’d argue that these storytellings are better than just *randomly* picking two variables out of a hat. But I don’t see how causal inference methods can do anything to mitigate the effects of crud.

If there’s a latent crud distribution, causal storytelling connecting *X* and *Y* is no different than Meehl’s yarns about why certain methodists prefer certain shop classes. Clever people can construct stories about anything. If they gain access to STATA or R or Python, they can produce hundreds of pages of sciency robustness checks that back their story. If we don’t understand the crud distribution, there’s no math we can do to know whether the measured correlation between *X* and *Y* is real. If you buy Meehl’s framework (which I do), you can’t corroborate theories solely with the precision measurement of correlations. You need *prediction*.

Theories in the human-facing sciences need to make stronger predictions. At a bare minimum, the treatment effect estimates from one study should align across replication attempts. We seem to have issues even crossing this very low bar with our current framework. Adding more math to make the treatment estimate more precise doesn’t help us generalize beyond the data on our laptops.

Theories need to tell us more than whether the correlation between variables is positive or negative. We need to subject them to risky tests. Theories need to make varied, precise predictions. Only then does a precise measurement of these predicted empirical values matter. Reducing all question answering to Fisherian statistics will not solve these problems. But that’s where we seem to be stuck.

It might be worth going back to where the whole idea of significance testing started, with plant breeding experiments. While things can go wrong, the method is well suited to the problem. Try out two (putatively) different varieties under identical conditions, and see if one does better. If so, is the improvement too big to be due to chance variation? If it is, you can reject the null hypothesis and recommend adoption of the better variety, at least under the conditions of the test.

Social science problems are much harder, and significance testing doesn't work as described. But it's a social convention and we haven't found a better alternative. We should either admit this and stop pretending that "significance" means what is claimed, or forget about it complete and become subjective Bayesians.

Well done. My cynical take with "causal inference" is that it is no more causal than linear regression estimators *if we believe the structural linear model*, yet it gets wrapped up in exaggerated language. No causal inferential tool will provide causal estimates if the proposed DAG is completely wrong!

The field would be better off it it rebranded under something like "structural inference" or "explicit structural models".