*This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

In Lecture 8, Meehl puts forward a few suggestions on how to make the scientific literature more interpretable. He is not proposing ways to make research “better.” We can never get to the space where we’ll all agree. If we didn’t disagree, we wouldn’t be doing science in the first place. But the obfuscators Meehl listed in Lectures 6 and 7 made it so that the social scientific literature provided almost no information about what’s true and false. That means we can’t even have sensible arguments. Meehl wants the literature to better inform arguments about theories and beliefs about the effectiveness of treatments. Meehl’s suggestions in Lecutre 8 are to make Lakatosian warfare *possible*.

I’ve honed Meehl’s suggestions down to three main themes.

Improving reproducibility

Moving beyond hypothesis testing

Publishing less

Long-time readers will note that I have already blogged favorably about those three suggestions. Is that because I’m hearing what I want to hear when I listen to Meehl? Or is that because he was screaming into the void in the 80s, and we should have listened to him then? Could it be both? Let’s all buckle up for the next three posts to find out! I want to explore which of Meehl’s suggestions were heeded and what we might do to further implement them today.

Meehl’s was banging the drum for better reproducibility earlier than most. He suggests that investigators should even be required or strongly encouraged to replicate their own results. They could publish the results of a pilot study alongside the main study. A study would be more compelling if it provided two independent measurements of the same effect on two dissimilar datasets. Requiring two measurements would necessarily set a higher bar for publication, but it would also ensure considerably stronger evidence for the tested theory.

While asking for more experiments is a high bar, asking for more information is not. And that’s Meehl’s second suggestion. Meehl wants journals to require authors to provide more information. It’s amazing to hear the sorts of practices that seemed to be allowed in the 1980s. Meehl claims that people could report “significant at some level” and never tell you the mean difference between the groups. That seems preposterous. It’s not much better to report that mean difference with only an asterisk denoting significance. This was also common place in social science. The reader would never see the standard errors or the p-values. I’d be curious to hear from folks in psychology how common these practices remain.

It’s so cut and dry that this shouldn’t be allowed. If you’re going to run a hypothesis test, why not report everything? Say what the test is. State the standard error. State the p-value. Give a confidence interval. Meehl is right that you can compute any of these numbers from any other, but you should at least report one to high enough precision to do so. And why force a reader to open R or Python? Just list them.

Beyond this, Meehl says that papers should give a sense of the shape of the distributions of the two groups. *Pictures* are probably even more informative than the raw statistics. We know that most natural phenomena are not really Gaussian and linear. Investigators should plot the histograms so that people can understand group overlap and skew. Visual statistics are far more compelling *and* informative than test statistics.

Meehl additionally argues that investigators should be required to measure and report nuisance variables that they don’t think should be causally affected by treatment. If the measurement of the effect size is on the same order as the nuisance variables, perhaps this means that the study failed to corroborate the investigator’s theory. Here, I’d argue we’re in a better state now than in the 1980s. Most epidemiology papers I’ve looked at have extensive tables of variables comparing different groups. They tend to list the associated p-values with the group differences. Economics papers now have hundreds of pages of sensitivity and robustness checks to validate their causal claims. Papers come with all sorts of pretty plots. This is all a step in the right direction.

But it’s not enough for replication. Meehl is asking for as much information as possible in a paper. Why not take this to its logical limit? In 2024, there is no excuse for papers not to come as git repositories. Every paper should include a repository of readable, runnable, commented code and as much data as possible. Ideally, this repository should trace all steps from data extraction to statistical analysis. The data should be in its most primitive, unaltered state. This way, the interested reader can view the data from whatever angle they want. The authors of the paper can make their argument about what we should see, but everyone else should be able to run their own analyses.

I’d argue that we should make the papers themselves shorter! I don’t want to flip through people’s robustness analyses in an endless pdf file. I’m not sure why anyone puts up with these appendices. I mean, at this point, don’t we all think it’s odd that robustness analyses always come back in the author’s favor? There’s a reasonable alternative to such exhaustive sensitivity analyses. Just give out your code and data so the skeptic can see what’s under the hood. And if investigators were really committed to their robustness checks, they could include them in a folder in their repository in a nice interactive notebook. I’m all for it.

There are no good arguments against this sort of reproducibility. Certainly, “proprietary data” is an absurd argument. If your data is proprietary, I don’t believe your results. You are trying to sell me something, so no paper for you.

A more tricky argument is made in medical research: data can’t be released because “privacy.” This argument derives from a mindless, shallow reading of the Belmont Report. I fully endorse that respect for people and beneficence dictate that investigators respect people’s desire for privacy in studies. But how real are the privacy concerns behind revealing counts in randomized trials? Why can you request the data from drug trials from the FDA but not device trials? Why are other clinical trials or random EHR data mining exercises impossible to access? Does it actually benefit patients in the study that we can’t check investigators’ work? Does privacy outweigh the potential for hiding fraud? We should discuss these questions seriously and in depth.

Now, I’m actually optimistic here. One of the few good things to come out of the international covid response was a broader embrace of preprint servers by the human-facing sciences. If medicine can embrace preprints, they can embrace code sharing and open data too. The future of scientific publication must bend towards open repositories. We’re on the right track there, but let’s continue to pressure our colleagues to keep moving in the right direction.

Meehl starts off the lecture with modest advice that is so uncontroversial that it’s astounding it’s still often not taken. Though aimed at observational studies, these suggestions should also apply to every randomized trial or other interventional experiment. First, every investigation should begin with an estimate of the effect size needed to strongly corroborate the proposed theory. A mere directional prediction is far too weak. Second, studies should be powered at the 90% level to detect this effect. Third, that power calculation should be explicitly written down.

This is all perfectly reasonable, and I’m sure almost every methods class teaches something along these lines. And yet I found a bunch of violations of these principles in a cursory glance at my Zotero this morning. Though power concerns used to bug me, I’ve become more relaxed about this over time. These particular suggestions are just lipstick on the hypothesis-testing pig. Patched-up hypothesis testing is still just hypothesis testing. Hypothesis testing is the problem! That’s probably why Meehl doesn’t dwell too deeply on it. And that’s why I’ve relegated the discussion to this footnote.

]]>*This post digs into Lecture 7 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

Today, let’s breeze through Meehl’s final four obfuscators of observational null hypothesis testing. This can be breezy in part because Meehl has already spoken at length about two of them (here and here): * Selective Bias in Submitting Reports*, and

So we’re down to two, * Pilot Studies* and

Before you invest a ton of money in a major data collection effort, it makes perfect sense to run a baby study to see if there’s any hope of the result panning out. Such *pilot studies *are where you might test whether your code or device works, get some qualitative feedback on the design, and get a sense of how large the effect of your treatment is. Meehl argues that pilot studies are valuable and likely necessary exercises to nail down the technical foundations of a good experiment. I don’t know anyone who disagrees with this. However, if the outcome relies on null hypothesis testing, pilot studies have a pernicious paradoxical downside.

If the pilot doesn’t pan out, it could be because the pilot is underpowered! Pilot studies are necessarily small. They might be so small that they have a high false negative rate. In advance, you don’t know how large the effect should be. So unless the intervention works without fail, the pilot might yield no effect. Since it’s just a pilot, researchers are more willing to file-drawer the finding and move on to their next clever experimental idea.

On the other hand, given the crud factor, false positives will be abundant in pilot studies. Taking power functions seriously, researchers will make their full studies big enough to surely replicate their pilot findings. That is, pilot studies might influence researchers to set their main study size large enough to reject the null hypothesis because of a crud effect.

This leads to a bit of a nightmare. Good theories are getting screened out at the pilot stage due to insufficient power. Bad theories are getting accepted at the main stage due to crud. If this were the case, random theories with no verisimilitude would be consistently corroborated in published results.

Meehl’s final obfuscator is that we forget that measurements are often very imperfect representations of the treatments and outcomes we aim to study. For example, in psychology, many measurements come from psychometric tests. These tests have multiple, universally accepted issues. First, the correlation between the test score and the trait you care about is often low. Test builders might find a Pearson r as low as 0.4, but still deem the test useful enough for some aspects of clinical practice. To make matters worse, test-to-test reliability is often low, with the Pearson r between two versions of the same test–or even two *administrations* of the same test–being as low as 0.8. This means that the test scores are often a weak proxy for the trait you are testing.

This weak correlation is problematic, but it’s even worse when researchers *forget* it is low. Meehl notes that in psychology, researchers will write in the methods sections that “this test was validated in reference 11.” But then they’ll just report hypothesis test results on the test scores, completely dropping that the test has far from perfect validity and reliability. With the numbers of the example I gave above, the true trait might have a fraction of the measured effect size using the imprecise test. That fraction could be as low as 0.1. Few significance tests pan out if you need to divide the z-score by 10.

More broadly here, the issue is understanding what the measurement is telling you about the outcome you care about. Across the board in the human-facing sciences, we are faced with imperfect outcome measurements. A personal favorite of mine is “progression-free survival” (PFS) in cancer studies, as no one seems to know what that outcome means with regard to the health and well-being of a patient, but you can get drugs approved if you can improve PFS.

Though Meehl was reticent to argue against them, the four final obfuscators also plague randomized trials and other interventional experiment designs. In fact, seven of Meehl’s ten obfuscators are major issues in general experiments. I could make the case that many randomized experiments are plagued by problematic ceteris paribus assertions, experimenter error, insufficient power, incorrect conclusions from pilot studies, selective bias in submitting reports, selective editorial bias, and detached validation claims. I could argue that the first two of Meehl’s obfuscators—loose derivation chains and problematic auxiliary theories—lead to poor experimental design choices and poor statistical analyses in interventional studies. So that’s 9 out of 10? Could I even make the case that crud might oddly impact randomized trials? Was Lykken right? Yikes. I’ll definitely come back to this.

But first, let me stick with observational studies. Given his long list of obfuscators, Meehl leaves us asking what we should do. One could argue “stop null hypothesis testing,” but no one wants to go as far as “shutter all quantitative research in social science.” In Lecture 8, Meehl proposes some fixes. It’s interesting to see which have been adopted, which remain untried, and which have had positive impact. In the next few blogs, I’ll not only talk through Meehl’s suggestions, but will propose a few of my own.

]]>

*This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” I’m taking a brief interlude from my run-down of Lecture 7. Here’s the full table of contents of my blogging through the class.*

Can we fix the crud problem with more math? In many ways, that’s what the “credibility revolution” in economics set out to do. To build a more sophisticated statistical tool kit that accurately teases out cause and effect when properly deployed. As Guido Imbens and Don Rubin put it in the introduction to their 2015 text *Causal Inference for Statistics, Social, and Biomedical Sciences*,

“In many applications of statistics, a large proportion of the questions of interest are fundamentally questions of causality rather than simply questions of description or association.”

Imbens and Rubin map a path for answering questions about epistemology using statistics:

“All causal questions are tied to specific interventions or treatments.”

“Causal questions are viewed as comparisons of potential outcomes.”

Comparisons of potential outcomes can be computed by careful estimation of average treatment effects.

Hence, all questions of interest in human-facing sciences are reduced to estimating effects in randomized experiments—whether or not a randomized experiment actually occurred. This means that the “gold standard” of causation remains null hypothesis testing. And that means that the entire discipline is based on correlation (a.k.a. description and association) and complex mathematical stories.

You don’t have to take my word for it. If you look at what the causal inference methods *do*,* *you will see that everything rests on null hypothesis testing. I mean, most of the estimates are built upon ordinary least-squares, and all least-squares estimates are combinations of correlations.

Let me give a simple example of an often-used estimator: the Local Average Treatment Effect (LATE). LATE uses “Instrumental Variables” to tease out causal relationships. You care about whether *X* causes *Y*, but you worry there are lots of confounding factors in your observational data set. To remove the confounding factors, perhaps you could find some magic variable *Z* that is correlated with *X* but uncorrelated with all of the confounders. Maybe you also get lucky and can argue that any effect of *Z* on *Y* has to pass through *X* (to be clear, you spin a story).

Economists have a bunch of crazy ideas for what should count as instrumental variables. Caveat emptor. My favorite example of an instrumental variable–one of the only ones I believe in–comes from randomized clinical trials. In a medical trial, you can’t force a patient to take the treatment. Hence, the randomized treatment is actually the *offering* of a treatment a trial aims to study. In this case, Z is whether or not a patient is offered treatment, *X* is whether the patient takes the treatment, and *Y* is the outcome the trialists care about.

But let me not dwell on instrumental variable examples. I wrote more about it here and here. I actually really like Angrist, Imbens, and Rubin’s original paper on LATE. For today, I want to show why this is still just correlation analysis. The standard instrumental variable estimator that estimates the influence of *X* on *Y* is

It’s a ratio of correlations. The standard way to “test for significance” of this effect is to do a significance test on the numerator. If it passes, you add two stars next to the correlation in the table. In an instrumental variable analysis, we changed the story but still just computed a correlation and declared significance if the number of data points was large enough.

Even though other estimators aren’t as easy to write down, every causal inference method has this flavor. Everything is a combination of correlation and storytelling. “Causal inference,” as it’s built up in statistics and economics departments, is just an algebraically sophisticated language for data visualization.

Some of my best friends work on causal inference, and I respect what they’re after. They’d argue that these storytellings are better than just *randomly* picking two variables out of a hat. But I don’t see how causal inference methods can do anything to mitigate the effects of crud.

If there’s a latent crud distribution, causal storytelling connecting *X* and *Y* is no different than Meehl’s yarns about why certain methodists prefer certain shop classes. Clever people can construct stories about anything. If they gain access to STATA or R or Python, they can produce hundreds of pages of sciency robustness checks that back their story. If we don’t understand the crud distribution, there’s no math we can do to know whether the measured correlation between *X* and *Y* is real. If you buy Meehl’s framework (which I do), you can’t corroborate theories solely with the precision measurement of correlations. You need *prediction*.

Theories in the human-facing sciences need to make stronger predictions. At a bare minimum, the treatment effect estimates from one study should align across replication attempts. We seem to have issues even crossing this very low bar with our current framework. Adding more math to make the treatment estimate more precise doesn’t help us generalize beyond the data on our laptops.

Theories need to tell us more than whether the correlation between variables is positive or negative. We need to subject them to risky tests. Theories need to make varied, precise predictions. Only then does a precise measurement of these predicted empirical values matter. Reducing all question answering to Fisherian statistics will not solve these problems. But that’s where we seem to be stuck.

]]>*This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” I’m taking a brief interlude from my run-down of Lecture 7. Here’s the full table of contents of my blogging through the class.*

Meehl’s course has already emphasized that significance testing is a very weak form of theory corroboration. Testing if some correlation is non-zero is very different from the earlier examples in the course. Saying “it will rain in April” is much less compelling than predicting next year’s precise daily rainfall in a specific city. It’s frankly less compelling than predicting a numerical value of the pressure of a gas from its volume and temperature. I’m a bit reluctant to plead for a “better” form of significance testing. Part of the issue with the human-facing sciences is the obsession with reducing all cause and effect, all experimental evidence, to Fisher’s exact test. Randomized controlled experiments are *a particular* experiment design, not the only experiment design. Someday, we’ll all break free from this bizarre, simplistic epistemology.

But that won’t be today. Let me ask something incremental rather than revolutionary for a moment. What would null hypothesis significance testing look like if we took crud seriously? We know the standard null hypothesis (i.e., that the means of two groups are equal) is never true. What seems to be true is that if we draw two random variables out of a pot, they will be surprisingly correlated. If that’s true, what should we test?

Here’s a crudded-up null hypothesis:

H0: Someone sampled your two variables *X* and *Y* from the crud distribution.

We could ask what is the probability of seeing a recorded correlation if H0 is true. What would the test look like? We’d need to compute a distribution of the potentially observed Pearson r values. Since we’re working with finite data, that distribution would be the convolution of the distribution of a sample correlation coefficient r (perhaps making a normal assumption) with the crud distribution. While you probably couldn’t compute this convolution in closed form, you could get a reasonable numerical approximation. The “p-value” now is synonymous with how far your data’s correlation is into the tail of this computed empirical crud distribution. If it’s more than two standard deviations from the mean crud, maybe you’re onto something.

Note that this sort of testing can’t cheat by growing n. In standard null hypothesis significance testing, a small correlation will be significant if n is large enough. But big n does not mean you’ll refute the cruddy null hypothesis. In fact, all that happens with growing n here is the “empirical” crud distribution converges to “population” crud distribution. That is, the convolution doesn’t change the distribution much. When n is moderate, you will be more or less testing if your correlation is more than two standard deviations away from the mean of the crud distribution.

Again, I don’t think this cruddy null testing solves *everything*, but it is definitely better than what we do now. We should know what is a reasonably low bar for an effect size. We should power our studies to refute that low bar. This doesn’t seem like an unreasonable request, does it?

What stops this from happening is that we don’t seem too enthusiastic to measure these crud distributions carefully. What would that look like? Since the crud distribution is a distribution of correlation coefficients, we’d need to find a somewhat reasonable set of pairings of treatments and control variables specific to a field. We’d need reasonable datasets from which we could sample these pairings and compute the crud distribution. To me, this sounds like what Meehl and Lykken did in the 1960s: finding surveys with candidly answered questionnaires and tabulating correlations. In 2024, we have so many different tabulated spreadsheets we can download. I’m curious to see what crud we’d find.

For people who are familiar with his writing, I don’t think my suggestions here are different than Jacob Cohen’s. In the 1960s, Cohen tried to formalize reasonable standardized effect size measures and use these to guide experiment design and analysis in psychology. One of Cohen’s more popular measures, Cohen’s **d**, is more or less equal to twice the correlation coefficient:

Cohen asked that people compute **d**, and then evaluate the effect on a relative scale (small effects are **d**<0.2, large effects are** d**>0.8). One problem with Cohen is he assumed the scale for **d** was universal. But it certainly varies from field to field. It varies within fields as well, depending on the questions you’re asking. As I noted yesterday in epidemiology, we will always have Cohen’s **d** less than 0.2 for diseases like cancer. So to merge Meehl with Cohen, we’d need to look at the *right* distribution of effect sizes of random interactions and use this to set a relative scale for the believability of stories about correlations.

After my dives into the history of machine learning, I’m not at all surprised that I’m rediscovering sensible advice from the 1960s. In fact, I wrote a book about why we keep reinventing ideas from the Cold War that will be out next year. (More on that later). My point today is that some ideas from the 1960s shouldn’t go out of style. Everyone pays lip service to Cohen, but then he gets ignored in practice. Cohen laments this disregard in the preface to the 1988 edition of his book. Perhaps this means that incremental changes aren’t the answer, and the system of mindless significance testing exists to maintain a powerful status quo. If that’s the case, maybe we need a revolution after all.

Wait! Didn’t we *have* a revolution? You know, a “credibility revolution?” Did that fix anything? Let me take on that question in the next post.

*This post digs into Lecture 7 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

Meehl begins Lecture 7 by clarifying his rant about statistics from Lecture 6: “I love statisticians, and I like statistics.” It’s certainly true that Meehl should not be confused as someone who is against statistical methodology. Lectures 6 through 10 are almost entirely about probability and statistics, after all. And after his five-minute quasi-apology to the “subgroup of statisticians who have a certain arrogance toward the social or medical sciences,” he spends the next 45 minutes of Lecture 7 diving into numerical examples of how the crud factor might manifest itself even when theories are false.

In the spirit of these technical calculations, let me take this post to work through a few mathy-ish loose ends on crud. There will be more equations than have been the norm in these blog posts, but that’s because we’re pushing into arguments with statisticians. I’m setting the stage here for the subsequent posts where I want to try to rethink statistical practices with crud in mind.

Meehl works an extended example where the treatment variable is a thresholded normal. A potential example he gives would be groups that score high on a test versus groups that score low on a test. Perhaps you’d look at the mean of some attribute in people above the mean on an introversion scale and compare that to the mean of people low on the scale. If the introversion scale is a normal distribution, then the treatment variable is a thresholded normal distribution.

The correlation coefficients between thresholded normal random variables are close to those of the unthresholded variables. There are lots of fun integrals you can compute. Let θ denote the Heaviside function: θ(t) equals 1 if t is greater than 0 and equals 0 otherwise. If *X* and *Y* are normally distributed, then:

If you threshold one variable, the resulting correlation equals 0.8 of the initial correlation. Meehl alludes to this formula in his whiteboard calculations in Lecture 7. We can go a step further and threshold both *X* and *Y*:

If *X* and *Y *are correlated, their thresholded counterparts will be similarly correlated. Thresholding normal distributions does not eliminate the worry about crud.

I don’t exactly know how to best estimate the modern crud factor, but I think it’s worth giving some scale. In Monday’s post, I called out this JAMA Internal Medicine article that claimed people who ate organic diets had lower cancer rates. We all know these nutrition papers are absurd and easy to pick on. And yet they still consistently get credulously written up in the New York Times. This paper doesn’t seem to be any more egregious than any other in the field. The whole field is very bad! But it does help give a sense of scale.

In this paper, the authors come up with some score of how much organic food people eat. They find the top quartile of scorers have low cancer rates. Obviously, this is clearly a dressed up correlation with wealth and socioeconomic status. Bear with me anyway.

In their main finding, they have 50,914 with low organic score and 16,962 with high organic score. Of these survey respondents, 1,071 of the low-organic group reported cancer while only 269 of the high organic group reported cancer. That’s a 25% relative risk reduction. While it’s not proper to treat this as an RCT, the z-score here is more than 4 and the p-value is less than 0.0001. So I could imagine (as the paper does) some sort of “causal correction” mumbo jumbo that “corrects for confounders” or whatever and still gets you a p-value less than 0.05. Eat organic, everyone!

OK, so what’s the correlation coefficient? We have a formula for it. Take the z-score and divide it by 261. It’s about 0.02.

I don’t yet know what to make of this. The fact that cancer is already rare means the correlation coefficient can only be so high. For binary random variables when the treatment and control groups are of the same size, the largest the correlation coefficient can be is the square root of the odds of the prevalence:

This would be the correlation between *X* and *Y* even when you have 100% risk reduction. It would be worth thinking more about what Meehl’s crud has to do with epidemiology where we have huge n and low prevalence, and hence all variables with small correlation. What is the crud factor in epidemiology? Somebody should study that!

Dean Eckles noted on Twitter that for non-binary outcomes, the common estimator for the variance in the z-test is a combination of the variance in the group when *X*=0 and the group when *X*=1:

I could quibble that this variance estimator isn’t better than the one used in the proportions z-test, but it’s a quibble. As I’ve said before and will say again, these formulas are just rituals and you can’t really justify anything with “rigor.” And it’s fine because we can still calculate stuff. If I use this variance estimator, the formula for z becomes

Carlos Cinelli tells me that Cohen uses this formula in his writings about power and effect sizes. While it is no longer a simple product, nothing in the crud story changes here. A significance test is still computing a simple function of the Pearson r, multiplying that number by the square root of n, and declaring significance when that product is larger than 2. That is the same as declaring significance when

That 4 in the denominator isn’t doing much work. Also, when r is less than ½, this z-score is less than 1.15 times larger than when you use the other variance estimator. We can’t escape the fact that significance tests are measurements of correlation. Maybe we should embrace that fact and see what happens.

]]>

*This post digs into Lecture 6 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

To understand why the crud factor is a major problem, it’s worth unpacking how hypothesis tests, though the foundation of causal inference, are always about correlation.

Let’s focus on the most standard hypothesis tests that attempt to estimate whether some outcome is larger on average in one group versus another. Meehl used the example of comparing the difference in color naming aptitude between boys and girls. I worked through an example last week comparing the effectiveness of a vaccine to a placebo. Almost all hypothesis tests work by computing the difference, *D*, between the means of the outcomes in each group. Since the groups are always samples of a population, the computed difference on the sample is only an estimate of the difference in the broader population. We thus also compute an estimate of the *standard error*, *SE*, which is the standard deviation of this mean difference. This gives us a sense of the precision of our difference estimate.

To test statistical significance, we compute the probability under the null hypothesis that the population’s mean difference was literally zero, but, due to the standard error, we would have seen a sampled mean difference at least as large as the one we computed. Ugh, what a silly and confusing convention! But bear with me. In a few paragraphs, you will know what this is really doing and why it’s not only annoying but also pernicious.

The probability of seeing such a sampled difference under the null hypothesis is the probability of seeing a value of size *D*/*SE *or larger when taking a sample from a normal distribution with mean zero and variance 1. The quantity *D/SE* is all we need to compute the p-value and test statistical significance. The ratio *D*/*SE* hence gets a special name: the z-score. The z-score is always the estimated mean difference divided by the estimated standard error. If it is greater than 2, then the finding is statistically significant. That’s because 95% of the normal distribution is within two standard deviations of the mean.

Now, that’s the motivation for the z-score. I hope it’s clear that there is nothing particularly rigorous about this methodology. I reminded us yesterday that the null hypothesis–that the mean difference equals zero—is never literally true. But it’s also never literally true that the mean difference is normally distributed. I also can never figure out why we should care whether a probability is small under the null hypothesis. And then what? No one has ever explained to me what you do with this information. As Gerd Gigerenzer likes to remind us, null hypothesis significance testing (NHST) is just a mindless ritual.

Let me give you a different interpretation of NHST that I think might be more intuitive but also raises more concern about the whole enterprise. Let me introduce two new variables: *Y *is the outcome we want to compare between groups. *X* is a binary variable equal to 0 if the individual is in group A and equal to 1 if the individual is in group B. Let’s say you have a sample of n pairs of (X,Y). The mean of group A is then

This identity is true because when *X _{i}* is 0, (1-

With my notation, if you do a little algebra, it turns out that the z-score takes a simple form:

In words: the z-score is the Pearson correlation between *X* and *Y* times the square root of n.1 So when you ritually NHST, all you are doing is seeing whether the Pearson r of the treatment and outcome variables is greater than 2 over root n. That’s it.

I want to emphasize that my formula for the z-score is true even in randomized controlled trials! RCTs, it turns out, are also only about correlation. But because we explicitly randomize the *X* variable, we know that correlation is *measuring* *the effect* of causation. We’re not proving that *X* causes *Y* in an RCT. We’re measuring the size of the influence of *X* on *Y* knowing there is already some causal connection between the two. Correlation doesn’t imply causation, but it’s the only thing we can measure with statistics.

OK, but now it should be crystal clear why the NHST framework is an annoyingly obfuscatory convention and also why the crud factor is a huge problem. If you have two weakly correlated variables *X* and *Y*, you’ll find a statistically significant result with a large enough sample. Two things can bring us to statistical significance: Either the treatment and the outcome are highly correlated OR n is large. When n is small, you are “underpowered” insofar as you’ll fail to reject the null hypothesis even though *X* and *Y* are strongly correlated. But when n is moderately sized, you will reject the null hypothesis for weakly correlated variables.

Let’s come back to the crud factor and crud distribution. I’m going to be a bit more precise about crud today. Given a population of variables, the *crud distribution* is the distribution of the Pearson correlation coefficients between all pairs in the population when the pairs are chosen uniformly at random. Following the suggestion of Matt Hoffman and Jordan Ellenberg, the *crud factor* is the average *magnitude* of the Pearson r. In the example yesterday from Webster and Starbuck, the crud distribution was approximately Gaussian with mean 0.1 and standard deviation 0.2. The crud factor of this distribution is 0.2.

Now let’s imagine Meehl’s silly thought experiment from the end of Lecture 6: I have a pot of binary treatments, I have a pot of outcomes, I have a pot of theories. I pick out one thing from each pot at random. A theory *T*, a treatment *X*, and an outcome *Y*. I then assert T implies that *X* causes *Y*, even though there is no connection here between the logical content of the theory and the relationship between treatment and outcome. Then I gather a sample and test for significance.

What happens in the presence of a crud distribution? Let’s use the rough numbers from Webster and Starbuck as parameters for the crud distribution. If I gather a cohort of 800 participants, then the probability of rejecting the null at 0.05 is over 75%. For two variables I chose at random, the probability of finding a statistically significant result is 3 in 4. You might say, well, maybe we could fix this by setting the p-value threshold lower. Say to 0.001? At this p-value threshold, the probability of finding a statistically significant result between randomly chosen variables is over 60%. Even with a 5-9s p-value threshold (rejecting if p<10^{-5}), the chance of rejecting the null for a nonsensical comparison is over 50%.

We find ourselves in the midst of yet another statistical paradox. If I sufficiently “power” a study that looks at relationships between randomly chosen attributes, and if the crud factor is not negligible, then I can corroborate theories that have no connection with facts. And the corroboration only increases if I make n larger. Oh crud!

1

]]>I’ve never seen the z-score written this way before. If you’ve seen this formula in print, could you please send me a reference? And if you haven’t seen this identity before, are you as outraged as I am about how we teach hypothesis testing?

*This post digs into Lecture 6 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

The most glaring issue with null hypothesis testing in observational studies, one that critics like Meehl have been arguing since at least the 1960s, is that the null hypothesis is almost always literally false.

“Even things that are not of any theoretical interest to us are not likely to be totally independent of one another. It is almost impossible to come by a Pearson r of zero point zero zero. In psychology, you really would have to work at it.”

Note again, Meehl is talking about studies where you compare factors that people bring with them.1 Meehl gives the example of probing the statistical significance of the difference between boys’ and girls’ ability to name colors. He goes through a long description of why it’s preposterous to assume that a collection of boys would have *exactly* the same expected ability as a collection of girls.

In the social and biological sciences, the null hypothesis is never true because everything is correlated with everything else. The question is just how much. Meehl worries that everything is *highly *correlated with everything else.

Meehl calls the ambient correlation between variables *the crud factor*. In the class and in his writing, he’s never totally precise about what the crud factor is, but it seems pretty clear from context that he is talking about the average value of the Pearson r, also known as the Pearson correlation coefficient. The Pearson r is a normalized measure of the covariance between two random variables:

Pearson’s r is also called the correlation coefficient. It conveniently takes values between -1 and 1. The crud factor is the mean of all of these correlation coefficients over all possible pairings. How large is it? And how large does it need to be in order to imperil the paradigm of hypothesis testing? It’s going to take me over a week to go through why this particular variable is a good one, that it’s probably quite large, and why it means that “large N” studies are almost always false.

Let me start with the evidence that the crud factor is large. Meehl describes some surprising results he found with David Lykken. They looked at a survey of 57,000 high school seniors administered by the University of Minnesota Student Counseling Bureau's Statewide Testing Program in 1966. The survey asked a bunch of questions about their families, their preferred vocations, their experience in school, their hobbies, and so on. Meehl lists some of the questions:

What magazines do you take in the home?

What are your plans to go on to college, if any?

What are you going to major in?

How often do you go out on dates?

Do you like picnics?

Which shop courses did you prefer? Sheet metal, electricity, printing, etc.

What religious views do you adhere to?

Meehl and Lykken computed the correlation coefficients of 44 different questions, leading to 946 correlations. Of these, 94% were significant at 0.1 level. Most were significant at 0.0001 level.

If everything tests significant, you can justify almost any psychological just-so story with a significance test. For example, for the last question, there was a breakdown into the varied denominations of Lutherans, asking whether a student was from the ELC, LCA, Missouri, or Wisconsin synods. What shop class a boy preferred in high school was correlated with which Lutheran synod a boy belonged to. Meehl launches into some satirical theorizing:

“The Missouri synod came over here in 1848, mostly fleeing from the unsuccessful revolutions of 48 in Germany. They were, to some considerable extent, skilled workers. High-level proletariat. Tinsmiths and factory workers with a strong socialist leaning. So maybe they had genes and environment leaning more toward things like sheet metal and electricity. Whereas the LCA or the ALC were mostly, like my ancestors, Norwegian yokels that came over here not from a revolution but because they heard there was good soil and they didn't like the established Lutheran Church of Norway or Sweden. So they came over here, and they were farmers and foresters Lumberjacks. The view was in the early days of Minnesota that a dumb Swede was good for nothing except to be a farmer or to chop trees down. Maybe they're better at woodwork.”

Bam. Publish it! Meehl then asks whether being in the Missouri synod would ever lead to you liking printing. He first tells the class that this would be absurd but has a sudden eureka moment and exclaims. “Wait Wait, I can do that one!”

“The Missouri Synod was the most scholarly of the bunch. All of the clergy that I knew when I was in that outfit had four years of Hebrew, four years of German, four years of Latin, and of course Greek… [They had a] very strong emphasis upon scholarship and upon intellect, whereas some of the Scandinavian Lutheran's were a little bit like Saint Bernard, you know, that you shouldn't have too much [G-factor] cooking here.”

Meehl’s assertion, which we all know is true, is that a clever theoretician can spin a story for any large correlation plopped before them. It would probably be impossible for them to explain the entire universe of correlations, but the scientific literature doesn’t require that. Organic diet is correlated with lower cancer rates? Nice! Send it to JAMA!

If everything is correlated with everything, the question remains how much? Meehl thinks it’s enough to worry about, but he’s not sure how to estimate it more generally. In Lecture 7, he describes a few more studies where people found shockingly large crud factors. He ponders

“I don't know how big the average correlation is between any pair of variables picked at random out of a pot. Somebody should study that.”

Well, somebody did! I haven’t yet done an exhaustive survey, but I did do some preliminary googling for the crud factor and quickly found several studies. Ferguson and Heene correlated 14 variables they considered to have “little significance” with adolescent aggression and found “significant” correlation coefficients that were near 0.1 in many cases. Frequency of sunscreen use was highly correlated with adolescent delinquency.

In another study, blogged about here, Webster and Starbuck did an analysis of 15,000 correlations published in the Administrative Science Quarterly, the Academy of Management Journal, and the Journal of Applied Psychology. They found a distribution of correlations that looks like this:

The mean is about 0.1, showing that the crud factor is definitely nonzero. But what I found most fascinating is that the standard deviation is huge. It’s somewhere in the range of 0.2 to 0.3.

Webster and Starbuck’s plot made me realize that we need to go a step beyond Meehl. An ambient crud factor is bad, but having a highly variable *crud distribution* might be worse. Null Hypothesis Testing would be problematic even if the crud factor (i.e., the average correlation) equaled zero. Aligned with Meehl’s thought experiment about variables picked out of a pot, let me define the crud distribution as the distribution of the correlation coefficients of randomly selected pairs of variables. What if the mean of the crud distribution was zero, but the standard deviation was 0.2? That wouldn’t be good either! It would mean that the probability of picking two variables at random and finding a correlation coefficient of 0.25 or more is over 20%. As I’ll show in the next post, a crud distribution with nontrivial variance is just as worrying as a large crud factor.

1

]]>I can certainly construct randomized experiments in which the null hypothesis is probably true. If I take a cohort of people and assign them at random to two identical treatments, the null hypothesis will be true. Maybe I take two bottles of Advil from the same pharmacy and give the treatment group pills from the first bottle and the control group pills from the second. In RCTs, the null hypothesis might be true. But randomized experiments have their own issues, which I’ll come back to later.

*This post digs into Lecture 6 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

As Meehl has noted, his ten obfuscators point in opposing directions, which is why they make the literature uninterpretable. For Meehl, insufficient power takes *good* theories and makes them look bad. The point here is simple: when studies are too small, null results don’t necessarily provide evidence that the theory is false. If you legitimately have 80% power and a theory of high verisimilitude, there’s a 1 in 5 chance you’ll run your experiment and fail to reject the null hypothesis. 1 in 5 seems like a pretty poor bet for a study requiring multiple person-years to execute. Meehl’s point is that if you don’t pay attention to power, you’re shooting yourself in the foot when trying to corroborate your theory.

What would it take to get higher power? The answer is always more data. Let me use yesterday’s example of a vaccine study to illustrate how these power calculations change as we ask for more precise studies. To repeat the setup, we have a vaccine that we think prevents a disease. To compute the power, we guess the *prevalence *(i.e., what percentage of people will catch the disease without the vaccine). We assert a tolerable level of *risk reduction* (i.e., the percent fewer people who will get the disease if they take the vaccine). If we want to have a power of 80% and a size of 5%, the number of people we need to enroll in the trial is

If the prevalence is one in a hundred and the tolerable risk reduction 50%, then we’ll need to enroll about 13,000 people in the study. That’s already pretty big. But what if we demanded a power of 99.999% and a size of 0.001%? Then we’d need

On the one hand, this is only about a factor of nine times larger. An order of magnitude increase in samples yields a massive increase in statistical precision. On the other hand, for the same levels of prevalence and risk reduction, we’d need about 120,000 participants. Factors of 9 matter a lot when it comes to fundraising for studies.

But people need to do experiments! So they make up some optimistic projections so that the number of participants works out to be exactly the number they think they can feasibly enroll given their staffing and budget constraints. It’s easier to be optimistic about your result when the numerator is 32 rather than 300.

Since we’re all budget-constrained, we end up with a world of underpowered studies. But this is where Meehl’s argument gets confusing. If a study has insufficient power, it will likely yield a null finding even when the experimenter’s theory is true. As we’ve discussed before–and I’ll return to it when unpacking Lecture 7—we don’t tend to publish null results. I’ve argued that not publishing null findings is acceptable practice. That’s not the issue here. My issue is that if power just causes more null findings, we would see fewer publications and no knocks on good theories in the scientific literature.

If most studies were underpowered, we’d see more null results and fewer publications, right? We all know that’s definitely not what’s happening. We see an endless doom-scroll of non-null results. Is low power really then a problem? I’m not sure! I’m probably missing something here.

Now, in the 2000s, John Ioannidis leveled a different flavor of critique against the ubiquity of low power. He proposed a seductive “screening model” for evaluating studies using Bayes’ rule. You plug in the power and size into a Bayesian update, put the prior on theories being true to be low, and voila, most published studies are false. With respect to Dr. Ioannidis, I don’t buy this argument at all. Taking power and size as literal, exhaustive likelihoods is not valid. I may have more to say about Ioannidis’ “positive predictive value” interpretation in future posts. For now, if you want to read more, Mayo and Morey have a detailed critique of this viewpoint.

But for today, I’ll just leave it at this: I agree with Meehl that small studies make it hard to corroborate good theories. We should try to make studies as large as feasibly possible. Fortunately, we’re in the age of big data and can grab huge data sets that never have power issues. This should solve the problem, right? Unfortunately, it solves one problem while creating another. As we’ll see in the next post, too much power can be a bad thing in observational studies. Because everything, everywhere is correlated.

]]>

I’ve been struggling all week to figure out how to blog about Meehl’s *statistical* obfuscators. In some sense, I’m guessing they are very obvious to you, dear reader. On one hand, studies should be large to make sure we aren’t rejecting good theories. On the other hand, in the social sciences, everything is correlated with everything, so if you make a study big enough, you’ll always reject the null hypothesis. These two contradict each other. You are damned if you do, and damned if you don’t. Meehl knows this. As he says at the beginning of Lecture 6, his ten obfuscators work in opposing directions to make the literature uninterpretable. Yet, this tension needs to be pulled apart because everyone is still arguing about the role of “power” in applied statistical work.

If you read enough critiques of papers, you might come away with the impression that an underpowered study is one where someone disagrees with the main conclusions. What does it mean to be underpowered? Why is that bad? It seems worth reviewing the *practice* of null hypothesis testing to try to set the stage for why power and crud combine to make the entire practice nonsensical.

In any experimental study, you want to ensure you have enough measurement power to detect what you are looking for. If you believe a star should be at some location in the sky, you need a telescope that can precisely point where you should look and a lens powerful enough to resolve the star.

In most hypothesis tests, we are trying to measure if the average value of some quantity is larger in one population than another. Is the average adult taller than the average child? Is the average number of deaths from a disease greater among patients who receive treatment A or treatment B? Does my website make more money on pages where there is a blue banner or a green banner? To measure this difference, we’re going to get *N* people from group one and *N *people from group two. If the difference between the means of the two groups is big enough, we’ll conclude there’s something going on there.

*Power* is the statistical analog of measurement precision. In a statistical study, precision is dictated by the number of samples. We first assume that every value we record is a random number. This assumption is useful because the more random numbers we average together, the more precise our estimate of their mean becomes.

A power calculation estimates how many samples we need to observe to assure ourselves that the means are different. There are two main probabilities we use to calculate this number of observations. We imagine two different hypothetical scenarios. First, assume the intervention you care about does nothing. The *size* is the probability that we run our experiment and get a result that erroneously concludes the means are different. (It is the probability of rejecting the null hypothesis assuming the null hypothesis is true). Now assume the intervention we care about actually works. The *power* of a test is the probability that we get a result that correctly surmises that the intervention does something. (It is the probability of rejecting the null hypothesis assuming the intervention has a particular effect). The power will increase with the number of samples in the experiment, and a *power calculation* finds the number that gives us the desired power and size.

How does this play out in practice? Let *D* denote the measured difference between the means of the two groups. From our data, we can also estimate the standard deviation of the estimated difference. Call that *S*. Almost all null hypothesis tests work by rejecting the null hypothesis when *D*>2*S*.

Why we use the number 2 doesn’t matter. It supposedly corresponds to a probability of 5%, but these probabilities aren’t real. The threshold of 2 is just a very sticky convention that we all mindlessly accept. The important part is that *S* gets smaller with more samples. If our mean difference is truly not zero, we’ll find that *D*>2*S* if our sample is large enough.

But how large is large enough? This is where the power comes in. We first come up with a number *T* which is the minimally tolerable mean difference we’d accept. If it’s smaller than *T*, then we don’t practically care about the difference. This is our “risky prediction,” if you will. We want to make sure that we have enough data so that we’ll see *D*>2*S *in our sample if the true mean difference is *T* and the true standard deviation of the estimate is S*E.* So how small should *SE* be?

Statisticians returned to their chambers, donned their hoods, ran through their incantations, and decided that we need *T > *2.8 *SE. *This will “guarantee” that the size is 5% and the power is 80%. To reiterate, to do a power calculation, we declare in advance a tolerable difference and compute the standard deviation of your estimate. Our standard deviation calculation will be some function of the number of samples. We choose the number of samples so that the tolerable difference is 2.8 times larger than the computed standard deviation. Amen.

Usually, these calculations are given in simple formulas for a person to read off, so they don’t have to think about what power calculations mean. Or even better, you can just run them in conveniently accessible software. You can just do your own personal incantation in your office, using R or STATA or some web calculator or whatever. Ask ChatGPT.

Let’s do one here! Though we’re supposed to be focusing on observational studies, let me use a hypothetical randomized trial as an illustration. Suppose we have a vaccine that we think prevents a disease. Our null hypothesis is that the vaccine does nothing. How large does the trial need to be to reject the null hypothesis when the treatment is effective? We can get a back of the envelope calculation if we know some things in advance. First, we guess the *prevalence *of what percentage of people will catch the disease without the vaccine. We assert a tolerable level of *risk reduction* for the vaccine. A 100% risk reduction means it always works. A 0% risk reduction means it does nothing. A 50% risk reduction means half the prevalence in those who receive the vaccine. And a 20% risk reduction means four-fifths of the prevalence in the vaccinated group. If we want to have a power of 80% and a size of 5%, the number of people we need to enroll in the trial is

We get a nice and tidy formula that we can compute. If the prevalence is one in a hundred and the tolerable risk reduction 50%, then we’ll need to enroll about 13,000 people in the study. Yes, thirteen thousand.

Now what is this calculation guaranteeing? We’ve made a lot of assumptions about things we don’t know in advance (here about prevalence). But if we believe all of the modeling, we’re going to run a 13,000 person study and have a 20% chance of having a perfectly good vaccine but being unable to reject the null hypothesis.

Wait, 20% seems high to me! Hmm.

In sum, we have a bunch of confusing calculations that are hard to explain, a bunch of assumptions about experimental conditions that are hard to verify, and a one in five chance of having wasted our time after running a huge study. And this calculation was for a randomized trial with a concrete, manipulable intervention. As I’ll describe in the next post, the problem only becomes worse when we move to observational studies.

]]>

There is an inherent tension between rules and play in the game of science. Because scientists approach other scientists' results with skepticism, they weed out folklore from facts in a way that other modes of inquiry don’t. So we invent rules to make it easier to attack each other’s work. Show me the p-values. Show me the preregistration. Show me the identification strategy. Show me.

But where do the ideas come from? How do you decide which papers to write? If you want to discover new facts, being rigid and skeptical is a terrible starting point. You need to be creative. You need to play. Here’s Meehl:

“I don't know of a single really brilliant mathematician or physicist who has written a book saying ‘The way I made my discoveries was to shackle my leg to the chair like Hemingway writing for four hours before he could get drunk every day.’ They don't write about it that way. People like Poincare talk in terms of dreams, free associations, and going fishing. The typical attitude and physics is it should be a little wacky. There's a famous quote from Niels Bohr about somebody's theory of the nucleus: ‘The only thing I don't like about this is not crazy enough.’

“Physicists are much less spastic than we are. They can relax because they know they’ve got the best science there is. Blow up the damn world! That's how good theirs is. So they're very relaxed. They're not puffed and stuffy and pompous about their vocabulary. They talk about charm and color and strawberry flavors in the nucleus. We never would do that in psychology. If we had to talk about strawberry flavor, we’d look up the Latin word for strawberry.”

The scientific mind needs to have two conflicting personas. One for coming up with theories. One for testing them.

“When I characterize science as being tough-minded that means in

corroboratingorrefutingthe theory. That’s stage two. It's not in the stage of creating it. it's not in the stage of designing the experiments to test it. There you should be freewheeling. Let your hair down. Allow your creative spark to function.”

Herein lies the key tension. When reading everyone else’s work you need to be skeptical. In your own work, you need to be freewheeling. Skepticism requires rules. Discovery requires play. Research is a game of inquiry. We’re not going to get anywhere if we don’t appreciate this tension. If we suck the fun out of it. If we make it too hard to play.

My affection for research play gets me into endless trouble. I’ve been chastised by many senior scientists for not taking research seriously enough or for having too many jokes in papers. “Science should be a serious business!” They tell me. But the tension between rules and play is part of what drives research forward. So middle-aged me has embraced that I get myself into trouble no matter what I do. I’m here to advocate for play, and I have constructive suggestions.

The first concerns graduate education. In K-12 science, we have all sorts of ways to make the topic captivating and fun. Egg drop experiments or growing crystals or whatever. In graduate school, we only teach rigid methods and host reading groups tearing apart papers. As Meehl notes, the blame rests on the shoulders of faculty.

“It's our fault. The faculty does this to you. We get your head fixed so that to not have a random sample is the worst conceivable thing a person could do.

“The mental set is to play it safe so that the peer group won't think you're dumb or the faculty won't think you're stupid. And the way you can play it safe is to make the usual at picking statistical criticisms. I mean that's it's always safe to [ask] “Was this a totally random sample of the western hemispheres schizophrenic?” It's bound to work, because it never is. So you can get by with that and get reinforced.”

We need to find ways to teach graduate students to occupy two personas—“Offense” and “Defense” if you will. We seldom do seminars on creativity and play. We should do more of them! We should encourage grad courses where projects can be speculative and weird and not necessarily targeted toward future education. We should run creative writing seminars. We should run reading groups on offbeat topics. Encouraging play has to be part of the process.

My second suggestion is to abandon pre-publication peer review. I’ve written about this before, and I’ll write about it again, but reviewing is a practice whose time has come and gone.1 It’s an unpaid waste of everyone’s time and gives undue authority to rules that hinder progress.

We could treat reviewing as a sanity check rather than a full-on adversarial attack. Is the paper written well? Is it plagiarism or fraud? Did they provide sufficient means for others to reproduce the work? If so, publish.

Ssince we’re not getting rid of peer review tomorrow, let me at least add some constructive criticism for how to approach reviewing. That we have all of these classes on methods and none cover how to be a good referee is pretty weird. I support Meehl’s proposal:

“I was a good [referee] in the sense that I tried to enter into the author's frame of reference. I asked the clinician in me, “What is this person trying to think about?” [I didn’t ask] what I'm thinking about! I mean I can write my own damn article. What is this author trying to say? My task as a referee is not whether he converts me, but whether he explains what he's up to so that I can see, as the Supreme Court says, “a rational mind being converted.” Even if I don't get converted!”

Make a steelman argument for every paper you read. Don’t think your primary role is gatekeeping. Remember that reviewing isn’t about you and your predilections. Remember that reviewing can’t determine whether a paper will be revolutionary. Figure out if the paper makes a cogent, clear argument. Can you make a case that someone might be convinced by whatever is written?

What deserves to be published, after all? What should the rules be? Should there be rules?

1

]]>If you want a good argument in one place, I agree with almost all of what Adam Mastroianni says here.

“Some of my best friends are statisticians!” proclaims Paul Meehl midway through Lecture 6. Me too, Paul! Me too. And yet!

In the most epic rant of the quarter, Meehl lays out a complex argument about institutionalized statistics. On the one hand, he thinks statistics is necessary for science. On the other hand, he’s unconvinced that we need statisticians. Meehl complains that statistical culture harms inquiry.

“I am convinced—not only from my own sad experiences but other people's where I can be somewhat objective—that psychology editors should be very careful in sending articles that contain anything statistical to PhDs in statistics. Because they are a virulent, anal, rigid, dogmatic bunch of bastards, and they are in the habit of treating social scientists as if they were nincompoops.”

Meehl thinks this particular arrogant disposition arises from the statistician’s role as consultant. A scientific paper needs a stats section to be published. So statisticians set up clinics where grad students and faculty benevolently offer to help the statistically uninitiated with their designs and analyses. Meehl worries that a field considering themselves consultants reinforces bad associations with the role of the statistician and bad associations with social scientists more generally.

“There they are, Omniscient Jones in their office with their Gamma functions. And here comes poor little doctor Glotts, second-year resident in psychiatry, who hardly recognizes the standard deviation that he meets on the street. He's seen a few of these patients, and he brings in his little mess of crummy data and asks, “What do I do with this?”

“[Statisticians] get in a habit of talking to social scientists who are mathematically ignorant and pontificating to them. Consequently, if a psychologist or sociologist has an idea once in a while, he or she is likely to have trouble getting it published.”

Though Meehl is frustrated with statisticians and Statistics as a field, he is a proponent of statistical thinking. We’ll see this more over the remainder of the lectures. Meehl argues we use statistics because skepticism is essential in scientific thinking, and statistics can help formalize such skepticism.

“Science is a better enterprise from folklore and various other forms of human cognition because it has a set of procedures. One of its procedural things is ‘show me [because] I'm hard to convince.’”

Meehl is making a normative claim here that I don’t necessarily want to cosign. But let me try to unpack what *he* thinks is important. For Meehl, a core part of scientific thinking is collecting data and “finding out whether other people can see what you see.” Hence other scientists should approach findings with skepticism. Through an iterative debate in a field, perhaps some progress can be made. Meehl argues this is why we have “p<0.05.” The p-value was invented by Fisher—and motivated by statisticians before him—as a formal mechanism of skepticism.

But herein lies the challenge of statistics. Goodhart’s Law tells us that formalized rules of skepticism cease to be useful tools for the skeptic. Anyone who has looked at the current state of applied statistics knows we are in a mess. Even the simplest application of p-values confuses people and invites high-handed scolding from statistical clerics. We get endless meta-papers analyzing p-value distributions and tut-tutting about how they have the wrong shape.

At the other extreme, we have a finite suite of statistical tools accepted as canon that encourage the laziest, least motivated mathematical modeling. The idea that every analysis has to be shoehorned into a particular statistical formalization is ridiculous. We have a set of accepted analyses that make odd linearity, independence, and odds assumptions. Go read a few David Freedman articles, and you will quickly realize that none of these canonical statistical models are ever justified or validated. No one actually believes a time series model has the form

But we use run least squares on equations like this, make tables with asterisks, and declare causality because people have accepted Differences-in-Differences into the statistical canon (IYKYK). Such models are not just wrong, they are *not* *even* *wrong. *They are an incredibly fancy way of presenting correlational data to tell imaginary folk tales about causality.* *We accept this bizarrely limited modeling suite because it makes it easier to write and reject papers. But it doesn’t help us be more skeptical. If anything, it makes it easier to obfuscate results behind ornate statistical tapestry in a hundred-page appendix of robustness checks.

And this is my issue. A statistical canon just becomes a ritualistic set of rules required for paper formatting. Not a tool for overcoming skepticism. I think of statistical models the same way I think of plots. They have become a form of data presentation, not analysis. Forcing everyone to use a fixed set of statistical modeling rules is the same as forcing everyone to use the same visualization package. It’s like how in the New England Journal of Medicine, every paper has a Table 1 and a Table 2, and you know exactly what’s going to be in them. Applied statistics has become nothing more than an arcane rigid set of rules for how you are allowed to arrange your data in a paper. They have become rules of style, not instruments of discovery. But we imbue them with some sort of epistemological magic.

The problem is, no one is convinced by “p<0.05” anymore! We all now know that statistical sorcerers can make random fluctuations look statistically significant. Have we forced everyone to tie themselves to methodological rigidity that convinces no one of scientific validity?

People are obsessed with ways of making science more rigorous. One of my dear friends has an NSF IGERT on training people to do better science. But what exactly does that mean? Too often, rigor devolves into ritual. Preregistration and p-values. Robustness checks and Regressions. Are these achieving Meehl’s goal of convincing the skeptic? Remember how we got here in the first place! Meehl’s complaint was about loose derivation chains and bad modeling. Does tying our hands with rigid systems of rigor make these issues better or worse?

One thing is for certain: statistical rigidity inhibits creativity. In Lecture 6, Meehl is trying to feel out the balance between creativity and rigor in scientific investigation. It’s a real challenge. Let’s talk about that next and see if there’s a middle ground.

]]>

In Lectures 6 and 7, Meehl dives into what he’s known best for: a critique of observational data analysis based on null hypothesis testing. These lectures draw from his paper “Why Summaries of Research on Psychological Theories are Often Uninterpretable,” where he lists ten obfuscators that make it hard to assess published results. Why is it that no matter how many studies we do, we can come away from a meta-analysis with no idea whether a particular theory is true or false or not? Why do we end up with equivocal results stating “we conclude with low to moderate confidence that some factor may or may not have any relevance with the outcome in question.” Why does so much of science just amount to wasting everyone’s time?

Meehl makes it very clear that he is only speaking about observational studies, not randomized experiments. He remarks that his colleague David Lykken thinks the criticisms do indeed also apply to randomized experiments. For whatever it’s worth, Meehl thinks Lykken might be right, and I do too. Meehl wasn’t ready to drop the hammer just yet. However, Meehl’s critiques do apply to all “observational studies,” even the ones that use fancy statistics to pretend like they did a randomized experiment (this sort of fancy stats is what economists and their friends arrogantly call “causal inference”).

I’ve been meaning to write about the poverty of observational causal inference since I started this substack, and thank Meehl for finally presenting me the opportunity. Let’s spend a couple of weeks on why you shouldn’t believe any observational studies. We can then move on to see which of the critiques also apply to RCTs. And then next year, we can work on closing down the economics department.

Meehl’s obfuscators break down into four clean groups. The first four obfuscators are about derivation chains, the next three are about statistical correlations, the next two about research bias, and the final one about construct validity. Today, we can tackle the first four as they are a nice segue from the lecture on Lakatosian Retreat. And these obfuscators highlight a piece that was missing from the program. When your derivation chain from theory to outcome is not logically tight, then no amount of evidence can corroborate or refute your theory.

Recall, one last time, Meehl’s logical formula for scientific prediction:

In this model, we logically deduce a prediction that “If we see *O*1, then we see *O*2” from our theory *T*, our auxiliary theories *A _{T}*, our instrumental auxiliaries

But this all rests on the deduction chain being valid. One further attack on a scientific result is to go after the particulars of the derivation chain. Meehl’s first four obfuscators too widely apply:

The deduction chains are not explicit.

There are problematic auxiliary theories (unstated or badly formalized

*A*)._{T}The ceteris paribus clause is almost surely false (easily deniable

*C*)._{P}The particulars are imperfectly realized (murky

*C*)._{N}

That any of these four are show stoppers for corroborating a theory should be clear. If any of these are true, then the deduction chain does not logically imply that O2 follows from O1. A poorly described, poorly justified derivation chain combined with correlational evidence mined from some public data set doesn’t corroborate anything. When your theory is a murky mess, the negation of it is also a murky mess, and we can just end up being confused.

The first two obfuscators are saying that theories fail “robustness checks.” We end up in this silly game where authors write down a loose mathematical model for why *O*2 follows from *O*1, but don’t have a convincing reason for why that should be the relationship. They might say *O*2 corresponds to a measured quantity *y*. They quantify *O*1 in a covariate *x*. They write down equations

Where *b* is some parameter and *e* is an error signal. Then they assert *e* has some statistical properties, like being normally distributed. This model makes a bunch of huge leaps. Why is it linear? Why is that noise random? Why aren’t there other variables in the equation? How many specifications could be close enough to this theory while still being plausible?

Unfortunately, most published observational studies have this problem. There are deductive leaps from the theory to the model, the model is never valid, and there are dozens of plausible models that are just as good as the one written down in the appendix. This means the probability of O2 given O1 in the absence of the theory can change wildly depending on this specification. As a viral Twitter thread showed yesterday, most observational results in prestigious academic journals don’t pass modest robustness checks. And yet we keep publishing them.

Oh well, moving on! In our contemporary language, we can sum up Meehl’s third bullet as there are always hidden confounders. How can there not be? I find it hilarious when people explicitly state in their papers that they are assuming there are no hidden confounders. I mean, I appreciate the candor, I guess. But I don’t believe them! As Meehl puts it:

“It's really hard to conceive of a thing we do in soft psychology that involves correlational stuff in which you could say with any confidence there isn't any other systematic trait of humans (or any other demographic thing about them like their social class origin, their race, their age, their sex, their religion or their political affiliation) that's going to be a correlate of one of the factors that we're plugging into our design.”

How can you disagree with that? And how can you prove without a shadow of a doubt that the things you didn’t randomize and control didn’t cause the observational relationship you are seeing? I’ll say more about this when I discuss the wonderfully named “crud factor.”

Finally, there’s the imperfect realization of the particulars. Here you see Meehl decrying the replication crisis in 1989, decades ahead of the crowd. We’ve touched on this before when discussing the context of discovery. Experimenter bias is always a worry and manifests in surprising, unintentional ways. There are always parts of the experiment that don’t explicitly appear in the text. We can partially fix this with reproducibility standards. Sharing data and code pipelines helps a lot. But if you really want to be sure that the experiment is valid as written, you have to reproduce it. As we’ve seen, more often than not, most published studies are hard to experimentally reproduce.

Meehl argues that these first four obfuscators too often cast doubt on *good* theories. But how does he propose that scientists approach their research to prevent good theories from being abandoned? By being more rigid and logical? By doing better statistics? In the next posts, I’ll dig into two digressions Meehl takes in Lecture 6 in an attempt to infer his answers to these questions.

]]>

*This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” Here’s the full table of contents of my blogging through the class.*

Every facet of science, from physics to psychology, interfaces with computers. Our Meehlian derivation chains have software interlaced at every layer. This means that every aspect of scientific validity rests on software validity. I touched on this yesterday in the context of pure prediction, but let’s do a bigger-picture sweep today, going through the different clauses and thinking about how software comes in.

Software is clearly an auxiliary instrument (*A _{I}*). We always assume our computer is working, our copy of R or Python is stable, and our packages have no bugs. Unless you are a computer scientist, most of the code you use in an experimental analysis is an auxiliary instrument. The exception would be software written by other practitioners in your field. For example, if you’re a chemist using a Density Function Theory package to compute some molecular interactions, the DFT code depends on the theory of your scientific field. In cases like this, software moves from

At the auxiliary theory level (*A _{T}*), we always defer some part of the scientific prediction or inference to code. If we build a complex model of climate or the universe, this rests on the simulation code of our theory being correct.

More subtly, if you have any model of errors in your derivation chain, you have infected your auxiliary theory with software. Error analyses are code-infected because we trust computers to do statistics. Computers are simply better than people at arithmetic and reading z-score tables.

So if you are “controlling for confounders” like age or gender in an experiment using a logistic regression model, that’s obviously a statistical error model to be solved by a computer. But even if you’re just assuming your errors are normally distributed, that’s a statistical model, and you’re going to use code to compute the standard error.

Software infects the ceteris paribus conditions because this is how we enforce null hypotheses. You encode ceteris paribus in software by generating randomization to remove confounding. The shape of the null hypothesis is also a ceteris paribus assumption, arguing the variation in outcomes in the experimental context we study is due to a particular kind of random variation. Calling a standard error robust or clustered or using random or fixed effects is an assertion of ceteris paribus. But anyone who has run such an analysis knows that if you toggle these sorts of models, you can turn a null result into a significant result.

Even the version of your code is a *C _{P}* assumption. You may get a different p-value or parameter estimate if you update your software or change your software package. The experimental validity only holds in one Docker container.

The experimental conditions involve software in multiple ways because software handles data entry, extraction, cleaning, manipulation, and loading. Bugs in any of these go in *C _{N}*. If you mess up a formula in your analysis, like not selecting an entire row in a spreadsheet when arguing for austerity politics, that’s an error in

There are now too many examples of scientific results being wrong because of coding errors somewhere in the experimental pipeline. The Reinhart-Rogoff Excel error was an egregious example with a simple fix. But I’ve looked at enough complex software stacks for experiments to know this isn’t an isolated example. Send me your code, I’ll find an issue in there somewhere.

Software removes abstraction boundaries from the Lakatosian defense. We can now introduce potential errors anywhere in our analysis stack. Are these errors due to errors in *C _{S}* or due to some other clause? How can you unpack them to fix your theory?

The situation is intractable because experiments aren’t bound by formal logical rules of correctness. In an experimental pipeline, it might not be possible to know if something is a “bug” or not. People tweak their software all the time. Sometimes, your analysis looks wrong, and this is legitimately because there is a bug in your code. But sometimes your analysis looks wrong, and, without malice, you introduce a bug in your code to make the outcomes look the way you expected.

Software forces you to “hack” without p-hacking. I’m not saying that people are misusing statistics and data dredging. I’m saying that writing a working pipeline is really hard, and requires iterating through many different tests and sanity checks to make sure everything is working properly. If you change a line of code because your plots looked wrong, and now they look right, did you fix a bug or introduce an error? How can you tell?

You might argue that you can unit test all of the little subcomponents and that as long as you follow these beautiful rules of analysis handed down from the replication-crisis zealots, you can never make a mistake. I don’t buy this for a minute.

Maybe you’d argue that you should preregister the entire software stack at the beginning, run your experiment, and then if the plots look wrong, abandon everything. Scientific rigor now dictates that you start over with an empty git repo and redo the entire experiment. That would be absurd.

The introduction of software to every aspect of science hence leaves us with extra heat for replication crisis arguments. We’ve accelerated science with computers, but we’ve also accelerated scientific doubt. All experiments depend on code, all code is wrong, and any time you look at a pipeline, you find mistakes. This lets us write more papers, metascientific analyses, and editorials, but what if software expedience might not help us accelerate much of anything other than infighting?

]]>*This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” Here’s the full table of contents of my blogging through the class.*

Software scaffolds every part of our scientific process, whether for data acquisition, cleaning, analysis, or prediction. So, in updating Meehlian metatheory for 2024, we must adjoin a new class of theories to logical derivation chains: that our software is correct. I will call this Software Validity, *C _{S}*. As we all know in computer science (CS), the

I want to take a few blogs to think through what it might look like to add *C*_{S} to a metatheory of science. I could probably write a longer paper about this, but my thoughts are still pretty unformed here. In the classical spirit of blogging, let me draft a few half-baked thousand-word notes to get some seedlings of ideas out there. Mark this as something I want to revisit in the future.

Let me start closest to home: on the role of machine learning models in contemporary science.

We learn in school that science and experiments are about inference and understanding of the laws of the universe. But Meehl’s reconstruction places *prediction* as the central goal of science. The understanding and inference parts happen when the predictions are wrong and scientists have to patch their theories.

Recall Meehl’s setup one more time:

We have a logical conjunction that implies “O1 predicts O2.” And “in the absence of our theory, it would be surprising if you predicted O2 accurately from O1.” For Meehl, scientific validity is *only* about prediction. As he says in Lecture 2, those predictions need to be *remarkably detailed* and *in close accord with the facts*.

Meehl’s reconstruction of science notably doesn’t preclude arbitrarily complex models. Anyone who takes a science class learns that models with fewer parameters are better, but there’s never a justification for *why*. “Occam’s razor” or whatever. Even Meehl doesn’t pin down why people should or do prefer simpler models, which is one of the bigger holes in his presentation.

But I wonder if, when pressed, scientists really care about simplicity. People want models that can easily make lots of predictions. They want the predictions to be remarkably detailed and in close accord with the facts. When you had to do all of those calculations by hand, this required the models to be pretty simple. But if you have an NVidia Z28, you can quickly compute predictions of impossibly complex models that would be impossible to even write down by hand.

I ask you, my reader: Would you prefer a simple theory that made vague predictions or a complex computerized theory that made remarkably detailed predictions? Based on the trends I see in science and engineering, revealed preferences strongly suggest the latter. Supercomputer simulations, digital twins, and massive machine learning systems have demonstrated that we can make remarkably detailed predictions that are in close accordance with the facts, even when we have too many parameters.

This quest for prediction makes us use software to extreme degrees. I offhandedly mentioned last time that “we’re happy if a billion-parameter model gets one prediction correct.” But it’s more true than false. We don’t care if we can get a giant curve fit with a bunch of non-fundamental parameters as long as it makes good predictions on something interesting. I call this sort of prediction “nonparametric.” And one of the hottest areas right now, “AI for Science,” is embracing nonparametric prediction as the key to accelerating discovery.

How do we fit nonparametric models into Meehlian derivation chains? First, I’m fast and loose with the term nonparametric because we all know it when we see it. Any model with parameters we don’t think are fundamental or reusable is nonparametric. In physics, Newton’s law of gravitation, the ideal gas law, and Planck’s law are all parametric models as the “physical constants” can be used in other theories. Indeed, the ideal gas law and Planck’s law both depend on the Boltzmann constant.

On the other hand, statistical models are almost nonparametric by definition. Even linear regression correction is nonparametric. Whenever someone is “controlling for covariates” or adding “fixed effects” or “random effects” to their regression models, they never care what the fit parameters are. They add these effects to argue about causation or to force a standard error to be small enough to receive asterisks in a table. Machine learning models, of course, are also inherently nonparametric. Alphafold is a non-parametric science model. Not only do we not care what the values of the parameters are there, but we’re happy to change the values if we can explain more protein structure data.

Predictive models nestle their way into multiple parts of the derivation chain. First, we adjoin to our auxiliary theory the fact that some observation *O*2 is predictable from some observation *O*1. That means the auxiliary theory is “there exists some function *f* that and some parameter values *v* such that *O*2 = *f*(*O*1;*v*).” We also adjoin the auxiliaries “this relation was true for the data we observed before our experiment” and “we reliably captured representative measurements of (*O*1,*O*2) pairs in the past.” Once we have these auxiliary theories, we can use software to do curve fitting, finding suitable values for the parameters *v*. We then predict new outcomes with the model. If there is reasonable accordance between our new predictions and the old model, we celebrate. Otherwise, we adjoin the new observations to our data pool and fit again.

It’s interesting here—foreshadowing the next post—how the software infects multiple parts of the chain. The core theory is that *O*2 is predictable from *O*1. Our auxiliary theory is that the data is fittable with a particular nonparametric model. Our auxiliary instruments are the software pipeline we use to fit the model. A ceteris paribus condition is that past data is representative of future data. An experimental condition might be the version number of scikit learn. A core assumption throughout is *C _{S}*, that our software is bug-free.

Some criticize black-box predictive models as punting on scientific understanding, but we learn things when the predictions fail. Perhaps we find the process has a time-varying element we didn’t account for, and we need to refit the model every couple of weeks. Perhaps we discover a condition where the model doesn’t work, but we find we can patch the predictions by adjoining data from that condition. Any time we make our data corpuses bigger and our prediction software more complex, we’re engaging in Lakatosian Defense. And it’s hard to not deny that this iterative process of statistical prediction works to some extent. If we make the datasets bigger or retrain more frequently, we end up with more detailed predictions that conform more closely with facts. Would we argue this is no longer “science” if we are amending our model with every new falsifier? No. As we’ve seen, this is central to the scientific method. If it predicts more facts, we’re going to stick with our program. This seems to be exactly what people want from “AI for Science.”

]]>*This post digs into Lecture 5 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

It’s impossible for me to fathom, but the molecular theory of matter wasn’t widely accepted until the 1910s. Very famous, very smart scientists, including Ernst Mach and Henri Poincare, thought atoms were merely a convenient fiction for predicting experimental outcomes. Statistical calculations, like those deployed to derive the ideal gas laws from kinetic theory, were akin to approximating integrals with discrete sums. Since bulk media obeyed differential equations, many thought it was more likely that they were continuous substances. That nature was an assemblage of a nearly infinite collection of invisible, discrete billiard balls seemed rightfully outlandish.

The controversy was effectively put to rest by Jean Baptiste Perrin in his 1913 book *Les Atomes*. Perrin spent hundreds of pages detailing the experimental evidence for atoms and molecules. The core of his argument was that if you assume molecules exist, you could *count them* in surprisingly diverse ways.

Again, it’s difficult to take my 21st century brain and imagine the mindset of an eighteenth century chemist, but scientists had somehow accepted the unit of a mole before they accepted the theory of atoms. The person who coined the term mole, Nobel Prize Winner Wilhelm Ostwald, was another famous atomic skeptic. How could you believe in moles but not molecules? I suppose it’s not that crazy. Different substances had different “molecular weights,” meaning that the same “amount of stuff” could weigh different amounts. You could observe nitrogen and hydrogen balloons at the same temperature, pressure, and volume; one would rise while the other would not. Oil floats on water. Gold is heavier than lead.

Perrin argued a mole always had the same number of molecules. This number, *N _{A}*, is called

Perrin presented a derivation of *N _{A}* from Brownian motion, assuming a macroscopic particle was bombarded by microscopic particles in a fluid. This bombardment caused the macroscopic particle to randomly move around. The predictions of Brownian motion would rely on some number of molecules in any given unit of the fluid. Balancing the predictions of Brownian motion against the effect of gravity gave the number of water molecules in a small volume. Extrapolating out provided an estimate of

Perrin described a derivation of Einstein that predicted the color of the sky by analyzing the statistical mechanical scattering of light by air particles. This derivation relied on an estimate of the number of particles in a given volume of air. Perrin derived an estimate from the theory of black body radiation, using their calculation of Boltzman’s constant and applying the fact that the “*R*” in the ideal gas law was equal to Avogadro’s number divided by Boltzman’s constant.

Perrin also used properties of alpha decay to compute an estimate. Radium emits alpha particles, which combine with electrons to produce helium. Perrin knew the rate of alpha particle decay, which yielded a prediction of the number of Helium molecules. He then compared this number to the amount of Helium produced to get yet another estimate of *N _{A}*.

Electrochemistry determined the charge required to deposit one mole of silver onto an electrode through electrolysis. This was called a Faraday and had value *F*. Assuming it costs one electron to deposit one atom, then the total number of atoms deposited should be *F*/*e* where *e* is the charge of the electron. Millikan had recently computed an estimate for *e*, and hence this gave another path to estimating Avogadro’s number.

The conclusion of *Les Atomes* contains a remarkable table. Perrin lists his estimates of *N _{A}* from 13 different derivations.

No matter which he chose, the number was always around 6x10^{23}. The fact that all of these different calculations gave the same answer was indeed a damned strange coincidence. After Perrin’s work, almost everyone (except Mach2) conceded that matter was composed of atoms and molecules. This was in 1913! Look where we are today.

Let’s take a minute to explore how this example fits into Meehl’s Lakatosian Defense framework and why Meehl spent so much time on Avogadro in Lecture 5. Here we have a single theory: “A mole is a constant number of molecules.” Adjoining this theory to a variety of different derivation chains gives the same value for the *N _{A}. *The results are too close to each other for that to have happened by pure chance, and hence the theory is corroborated.

Well, except it’s not *really* that clean. First, it’s hard to precisely say what is the probability here. What does it mean that the probability that Perrin’s calculations all gave the same *N _{A}* was exceptionally small if atoms weren’t real? What is p(atoms)? The atomic skeptics were clearly very surprised by Perrin’s results. Wesley Salmon quotes Poincare as exclaiming, “How can you argue, if all of the counts come out the same?” He announced, “Atoms are no longer a useful fiction; things seem to us in favor of saying that we see them since we know how to count them.”

Still, I don’t think we can make this “probability” notion too formal. Meehl chirps against hypothesis testing here. It’s certainly not the case that physicists ran some F-test or something on Perrin’s table to convince themselves of the result. It certainly wasn’t the case that scientists at the time ground out some Bayesian confirmation calculation, as Bayesian statistics wouldn’t be formulated for another twenty years. Salmon tries to construct a post hoc “common cause” formalization of atomic confirmation, but I am not at all convinced by his handwavy argument.

Meehl argues it’s the specificity of the predictions that convinced everyone. I suppose we could reverse engineer a Meehlian Spielraum argument here. Suppose Perrin had instead assumed that Avogadro’s number was 6x10^{23}. Then he’d get estimates of the charge of the electron, Boltzmann’s constant, the decay rate of radium, and a dozen other physical constants, all within a factor of 25% of their known value. Perrin’s atomic theory would be predicitng a remarkably narrow range of the Spielraum over dozens of experiments. Again, this isn’t what Perrin did, but I can imagine that if he had presented his results this way, people would have been just as convinced.

And obviously, the fact that all of Perrin’s rested on a single number made the whole theory too good to ignore. A friend joked this weekend that these days, we’re happy if a billion-parameter model gets one prediction correct. But we’re much certainly happier if one parameter yields a billion predictions.

Indeed, the atomic theory would continue to predict the outcomes of endless experiments. It would prove critical in understanding X-rays and estimating their wavelengths. And once the atomic theory became entrenched, it would be guarded by Lakatosian Defense. When Bäcklin and Bearden made measurements that found too high a value for X-ray wavelengths, this suggested an error in the estimates of Avogadro's number. Prins would write to *Nature* to correct the record: “the usual diffraction formula needs a correction when applied to X-rays under the usual experimental conditions.” Atomic theory was now far too useful and had to be defended at all costs.

*If you want to get more into the weeds of atoms, check out Wesley Salmon’s *Scientific Explanation and the Causal Structure of the World*. The blog here pieces together Meehl’s lecture, Salmon’s argument, and my own reading of Perrin. Maybe I got too into the weeds here.*

1

In our modern standards, Avogadro’s constant *defines* a mole. It is now completely backwards: A mole was first defined by a volume of gas, but now *defined* to be *N _{A}* molecules.

2

]]>As Meehl says, Mach died in 1916, but maybe would have changed his mind had he lived a bit longer.

*This post digs into Lecture 5 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

Since I spread the argument out over the last post, let me tidily summarize Meehl’s metatheory of Lakatosian Defense. An experimental outcome is deduced from a collection of scientific assertions in a derivation chain. The derivation chain uses clauses from the core theory (*T _{C}), *assorted auxiliary theories from the discipline (

If we observe *O*1 and *O*2 in our experiment, the low probability statement corroborates our theory. The smaller *p _{E}*, the more the theory is corroborated. If we observe

Today, I want to cast the examples I’ve listed so far in this series as stages of Lakatosian defense to illustrate why this framework elucidates the role of falsifiers in scientific development. In the next post, I’ll discuss corroboration.

Meehl’s description of the controversies in latent learning shows how arguments about experimental conditions alone can keep scientists busy for years. He described squabbles about the right way to manipulate rats before setting them loose in a maze. It mattered how you carried it from the cage to the maze. If you held the rat by the tail, it would be less likely to cooperate than if you had let it gently sit on your forearm while petting it. If you gave the rat a little food in advance, it might perform better than if it went in hungry. Every tiny adjustment mattered. Whichever side of the latent learning debate you were on, you could criticize the other side by arguing about *C _{N}*.

Wars about experimental conditions, as Meehl says, are usually focused on replication. It is replication in the narrowest sense here:

“You are not denying the facts. You’re denying that something

isa fact.”

You attack *C _{N}* by asking if someone can create

Ceteris paribus clauses are so general that fields can spend decades fighting about them. Here, we ask if something *outside of the derivation chain* is responsible for the experimental outcome. Ceteris paribus clauses assert that we assume the derivation chain is true “everything else being equal.” But of course, such a bold statement is never true literally true. We can’t really control everything in an experiment. The question is always whether we have controlled things enough.

Most of the arguments about “DAGs” in causal inference are attacks on *C _{P}*. If an unspecified confounding variable causes both

But the ceteris paribus clause is even more general than this. *C _{P}* is where we store all of our idealizations. You assert certain things you know are true are not relevant to the outcome under the prescribed experimental conditions. Since you know these idealizations might be false, you can use

Meehl emphasizes that in his characterization, instrumental auxiliaries are only those outside of the discipline. But where you draw disciplinary boundaries can be tricky. In the example of Eddington fudging his analysis of light deflection, are telescopes inside or outside the theory of astrophysics? I might argue that the telescopes and photographic plates are governed by terrestrial physical concerns, not grand theories of cosmology.

What about Dayton Miller’s observations of aether drift? While many people questioned the functionality of the Michaelson-Morley interferometer, Miller’s apparatus was harder to attack because Miller was such a careful experimentalist. In the end, the results were explained away as thermal artifacts. Was this a violation of ceteris paribus or a problem with the instrument? I suppose we could say it’s both.

I bring this up because I want to talk about software and statistics, which messily infect all of the clauses in the Lakatosian Defense. I’ll say more about this in a future post.

The final stand of a Lakatosian Defense attacks the theoretical auxiliaries of a core theory. As I mentioned, *adding* auxiliary theories is a common part of Lakatosian defense. We let ourselves explain facts by adding conditional characterizations of when certain approximations are valid. But *removing* auxiliary theories—declaring them false—is much more rare. In fact, I’m hard-pressed to find good examples, though I’m probably just not thinking hard enough as I write. For what it’s worth, Meehl doesn’t give any clean examples of experiments messing with theoretical auxiliaries directly. If you have any fun examples of attacking auxiliary theories, tell me in the comments!

The reason it’s hard to come by attacks on auxiliaries is removing them messes up all of your past results. Removing an auxiliary will not only invalidate a bunch of past derivations, but it will also cause your theory to disagree with past experiments. You’ll have to explain that away, too. Meehl argues that scientists will trade some contradictions with the past with a bunch of new damned strange coincidences in the future, but he doesn’t give any examples. I’ll try to pin this all down in Lecture 6 when discussing Lakatos’ notion of the “Hard Core” and “Protective Belt” of a theory. But before we get there, I want to get into the second part of Lakatosian Defense, describing what happens when experiments corroborate your theory.

]]>

*This post digs into Lecture 5 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

In Lecture 5, we get Meehl’s elaboration of “Lakatosian Defense.” This is the meat of the course and the core of what resonated with me. Meehl’s Lakatosian Defense is a tight metatheory that closely characterizes what I think most people mean by “the scientific method.” I plan to spend the rest of the week mulling over the many implications. I’ll discuss how Lakatosian Defense encapsulates everything weird we’ve seen so far in the course. I might even add some examples of my own. Unpacking the implications will require some careful meditation.

What are we actually testing when we test a theory? Let’s start with the naive caricature. We design experiments by creating a chain of deduction from the theory to a predicted experimental outcome. In the language of logic, we have a theory, *T*, and deduce an experimental outcome, *O*.

If the prediction pans out, we say the theory is corroborated. But if it doesn’t pan out, the theory is falsified. However, as we’ve been describing, this is far too simple a view to accord with the history of scientific practice. Meehl shows, however, that it’s not too far off. We just have to adjoin a bit of complexity to fully flesh out the scientific method.

First, Meehl presents a slight elaboration of what counts as a prediction. An experimental outcome is a material conditional “If I see *O1*, then I will see *O2*.” He writes this using the logical notation as “*O1*⊃*O2*.” This seems like a fairly reasonable starting point.

Let’s now turn to the theory “T.” An actual deduction chain from theory to this experimental prediction is a complicated complex of statements that Meehl breaks into five subsets:

- The theory to be tested. I’ve added an extra subscript C here that Meehl doesn’t use to denote “core.”**T**_{C}- Auxiliary theories involved in the derivation chain. These are all of the sorts of statements you use on top of your core theory.**A**_{T}*Anything*you use to map the constructs of the core theory to the observables would count as an auxiliary theory. In particular, this includes the idealizations you know are wrong, whether idealizations of first principles or boundary conditions.**A**_{I}*A*to be any auxiliary that’s not directly in the scientific field. If you’re running a mouse experiment, the mechanics of the lever the mouse uses to get food is an instrumental auxiliary. In medicine, you’d probably say that imaging devices or lab tests are instrumental auxiliaries. An unavoidable auxiliary in every experimental setup is software, whether for data storage, cleaning, analysis, running experimental protocols, or really anything else. Woo boy, is software a problem. Flag that for later._{I}- The**C**_{T}*ceteris paribus*clause. Ceteris paribus is just the “all else being equal” clause we apply logically to an experiment that allows some notion of transportability of the results. What exactly we’re holding equal is not always clear. This statement asserts experiments are sufficiently controlled so that no unspecified outside factors influence the experimental observations.- The particulars about the experimental conditions. This is a bit trickier to pin down but contains all of the ways a particular lab runs an experiment that are not cleanly bucketed into auxiliary theories. For example, Meehl described how rat handling could influence the outcome of latent learning experiments. It’s these sorts of conditions that maybe aren’t as cleanly specified that get lumped into**C**_{N}*C*._{N}

Putting everything together gives us this logical representation of a scientific derivation chain:

The conjunction of the core theory, the auxiliary theories, the instrumental auxiliaries, the ceteris paribus clause, and the experimental conditions logically imply “If *O1*, then *O2*.” This statement is a logical statement. A scientific prediction is a logical *deduction* from the set of theories on the left-hand side to the material implication on the right-hand side.

It might seem like we haven’t done much in this development, but we now can neatly explain why falsification is so tricky. When you see a lot of *O1*, but not much *O2*, the right-hand side is false. We can now apply modus tollens and declare the left-hand side false. But what did we falsify? We didn’t falsify our theory *T _{C}*. We falsified a messy conjunction. The falsification of the conjunction just means that

We first go after experimental conditions. This is where we demand replication. What if my rival scientist did something screwy in the lab? If the result replicates, perhaps we can attack ceteris paribus, finding some other unspecified cause that explains the outcome. We could attack the instruments, claiming there was a bug in the software, the wrong application of statistics, or a heat artifact in the measurements. We could attack our idealizations, creating a longer derivation chain that explains away the potential falsifier. We could attack other auxiliary theories. There are so many things we can defend and rationalize before we ever decide our theory is wrong.

But if we can continually churn out experimental results, why would we abandon our theory? This is the one missing piece in the scientific method as presented thus far: It’s not enough to imply obvious experiments. We need our theory to generate novel and surprising facts. To account for this, Meehl adds one last piece to the implication: we must derive experimental outcomes such that, given our background information, the probability of *O2* conditioned on *O1* is small. That is, our theoretical derivations *must result in Damned Strange Coincidences*. In this case, when we observe *O2* given *O1*, the theory is corroborated.

This completes the picture. Scientists derive clever experiments. They test their predictions. When the predictions pan out and are Damned Strange Coincidences, they give Ted talks. When they don’t pan out, they attack their rivals.

Lakatosian Defense gives us a rational justification of scientific irrationality. You can already see how scientists could continue an infinite regression (also known as a career) of theory-building and experimenting while never abandoning their core superstitions. There is no end to fighting about experimental conditions, looking for hidden causes, or creating monstrous theories with unbounded free parameters. But it’s “rationally justified” as long as it’s producing new facts. Science is a paperclip maximizer.

]]>

*This post digs into Lecture 4 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

I’ll use the term boundary conditions to describe the various particulars of the world that go into some predictive calculation. These are our assumptions about entities, like their mass, speed, or even existence, that we plug into our theory when we make predictions. For the ideal gas in a chamber from the previous post, we needed to measure temperature, volume, and the amount of gas to predict the pressure on a piston. But why only these quantities? Why did we not need to consider the particularities of the piston we used to measure the pressure? Did we need to consider the composition of the walls of the chamber?

Boundary conditions are always idealizations because of omission. Such omissions are necessary because we have a finite amount of computation speed and memory. Moreover, we begin any prediction with the limited evidence collected and tabulated by our predecessors. Now, when outcomes don’t agree with our predictions, hence falsifying the theory, could we blame the failed prediction on omitted evidence? That is, can we blame it on the boundary conditions?

A favorite story of philosophers of science (in particular of Imre Lakatos) is the discovery of Neptune. French astronomer Alexis Bouvard published observations of the motion of Uranus that seemed to violate Newton’s Laws. This meant that either Newton’s Law changed the further you moved from the sun or there was an unaccounted-for massive body in the solar system. Positing the latter in 1845, Urbain Le Verrier predicted the location of the missing planet and sent his predictions to the Berlin observatory. Lo and behold, there was Neptune, within a degree of where Le Verrier said it would be.

This is again a remarkable corroboration of the theory made by predicting an idealization was incorrect. Note that there were two main choices here: adjusting the theory to account for distances (like what van Der Waals did) or keeping the theory as is and predicting the facts were wrong. But in either case, Newton’s Laws were never abandoned because physicists were so enamored with the elegance of their theory.

This Neptune example may have had too much influence on astrophysics. Now, when the laws of gravity seem screwy in our telescopes, we just imagine there’s something else out there we haven’t seen. Since the 1960s, we’ve observed countless aberrations in our measurements that imply either our telescopes are broken, general relativity is wrong, or there is a vast amount of matter out there that we can’t see. Since we can’t see it, we might as well call it dark matter. The current consensus is 95% of the energy content of the universe is made up of “dark stuff” (see also CERN). I dunno, that seems like way too many unobserved Neptunes to me. But I guess adding parameters to keep the old model on the books is easier than starting over from zero.

All of the nice examples come from physics, but there seems to be a generalizable story emerging here. Every scientific field starts with some simple core ideas (what Lakatos calls “the hard core”). Meehl argues that this core will be present in most derivation chains associated with a given theory. For the initial breakthroughs in the field, that hard core alone will neatly and quickly predict the outcome of several experiments.

But then we’ll start to pile up experiments that don’t agree with the theory. Rather than abandoning the hard core, we add some other theories to patch things. As any good machine learning scientist knows, if you add enough parameters, you can explain almost anything. In the kinetic theory of gas model, we added substance-specific parameters to handle low volumes and temperatures. We give every gas a couple of degrees of freedom and are still able to manipulate pneumatic systems as long as we account for the extra parameters. In the Neptune example, we predicted an unseen planet. We can predict a lot of unseen stuff in the universe and get a cosmology of dark matter.

We add complexity to the theory to protect the core. Longer derivation chains, More post hoc parameters. Longer computer simulation code. More complex statistical analysis. Expansive mathematical proofs. Everything gets longer and more complex. But we never give up on some fundamental principles in the hard core. Physicists never give up on conservation of energy. They don’t give up on general relativity. Even though they keep collecting countless observations that falsify their theory, we patch things up by adding a few degrees of freedom here, there, everywhere. These extra explanations are what Lakatos calls “the predictive belt” of a theory.

But the computational crud gets you in the end. At some point, these calculations stop being, dare I say, useful. Astrophysics doesn’t get called out because it doesn’t matter if there’s dark energy or modified Newtonian dynamics or aliens playing games with our telescopes. None of this will help us build better computers or launch more satellites. Without that turn to practice, sciences can go and chase whatever weird facts they want to chase until governments stop funding their supercomputers or supercolliders.

But when you turn to practice you can see all sorts of predictions becoming intractably complex. My favorite example, Nancy Cartwright's Perturbation of Galileo1, is again from physics. Surely, if we drop a bowling ball off the Leaning Tower of Pisa, we can predict where it will land and how long it will take to get there. You’ll take Newton’s laws and a simple air resistance model, and you’ll get a good prediction. But what can you say about a euro bill dropped off the same tower? How accurately can you predict where that bill will land or how long it will take?

Think about what you’d need to compute this prediction using Newton’s Laws. It’s impossible. The dynamics of this system aren’t technically chaotic, but you’d still need unfathomable precision about the initial conditions of the bill and molecules in the air. I never know what we're supposed to take away when someone argues for a simulation that needs more FLOPs than atoms in the universe.

I highlight this to flag something that Meehl doesn’t discuss. Verisimilitude, that is, approximation of truth, assumes that you can compute all possible derivation chains of a theory. But part of verisimiltude is how much computation is required to approximate truth. Part of what makes theories useful is not just the ability to make accurate risky predictions in theory but to make accurate risky predictions quickly in practice. This prediction efficiency strongly influences whether we use quantum mechanics, classical mechanics, statistical mechanics, fluid dynamics, or just logistic regression. What if expedience is actually essential to truth?

Meehl ends Lecture 4 and begins Lecture 5 with nomological nets. These were proposed by Cronbach and Meehl in their development of construct validity. At a high level, there are three parts of a nomological network: (1) There are six scientific concepts (substances, structures, states, events, dispositions, and fields). (2) A theory is built by defining concepts and linking them together (through statements about composition, dynamics, or history). (3) These links form a graph, and the graph is scientific only if the leaf nodes are observational. I haven’t figured out how to work nomological nets into this series just yet, but will try to expand upon them in more detail when they are referenced in later lectures.

1

]]>From Chapter 1.1 of The Dappled World.

*This post digs into Lecture 4 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

I cringe every time someone drops George Box’s overused aphorism “All models are wrong, but some are useful.” This quip gets thrown around to defend the worst sorts of scientific and engineering practices. But the thing is, when I read Box in context, I agree with everything he writes. Today, let me write a defense of Box, and describe what he actually meant. It turns out that Box was making the same point about modeling, feedback, and refinement of theories that Meehl picks up in Lecture 4.

Every scientific theory is indeed technically false. A theory is a complex of statements, and if any one of them is false, their conjunction is necessarily false. You’ll always be able to find one statement in any theory that’s not precisely true. If you only use a 3-significant-digit point estimate of the mass of some quantity in some formal derivation chain, your conclusion is technically false according to the logician. But such logical pedantry is annoying. Some of these calculations let us do stuff, like build things or run experiments. They might be logically wrong, but they are more than good enough for our purposes.

Meehl says that scientific theories that are good enough have some relative accordance with truth. Different theories have different degrees of truth. The approximation of truth is what Popper called *verisimilitude.* I’ll use this term even though I’m not particularly concerned with what “truth” is. As Meehl points out, there’s a difference between Instrumentalism and Realism. If you’re an Instrumentalist, you validate theories based on their utility for prediction and control. If you’re a Realist, it’s the theories themselves that you care about.

Since I’m an engineer, this blog series takes a decidedly instrumentalist interpretation of Meehl. For the engineer, verisimilitude is your estimate of how widely—dare I say—*useful* a theory is for practical ends. We want to do things with scientific theories. We want to make predictions, cause things to happen, or build things.

What do we do when the predictions don’t quite line up with what the theory says? The theory is now not just literally false, but practically false too. Do we throw everything away and start from scratch? That’s impractical for the instrumentalist and aesthetically displeasing for the realist. It seems expedient to look for ways to patch the theory up rather than throwing the whole thing in the trash can.

To patch a theory, we have to go to the statements and look at which ones to change. The nice thing about scientific theories is we *know* in advance that some of the statements are literally* *false. We know because we deliberately added them to the theory with full knowledge they were false. Deliberate false assumptions in a theory are called *idealizations*.

Meehl gives two examples of idealizations that illuminate their value. Scientific theories include literally false statements that simplify derivation chains or yield simple rules. I’ll call these *idealizations of first principles*. Theories also include literally false statements about the particulars of entities in the derivation chains because we only have collected a finite amount of information about the universe. I’m going to call these *idealizations of boundary conditions. *Let me now use Meehl’s examples to show how both can be adjusted to patch up false theories to make them more truthy. And in doing so, we’ll see how the dynamic feedback between theory building and experiment doesn’t admit a clean, logical set of rules. I’ll describe idealizations of first principles today and idealizations of boundary conditions in the next post.

Meehl and Box have the same favorite example of a model that is wrong yet useful: the ideal gas law we all learn in high school. The development of this law nicely illustrates idealizations and the iterative feedback loop of theory building.

If you have a gas in a chamber and you compress it with a piston, the pressure the gas exerts on the piston is related to the volume of the gas, the temperature of the gas, and the amount of the gas. You get the famous formula:

P is pressure, V is volume, T is temperature, n is the number of moles of gas, and R is a constant. This law was initially derived from experiments where two variables were manipulated and the others held constant. But in the 1850s, physicists figured out how to derive this law from basic kinetic interactions of trillions and trillions of individual gas particles. This derivation connected microscopic classical mechanics to macroscopic thermodynamics: small tiny particles bouncing against each other would manifest themselves in properties that we measure as heat or pressure.

The kinetic theory relied on several idealizations. Two central ones were

Particles are “point masses” so small that you can neglect their size.

There are no attractive forces between the particles.

Both of these were known to be literally false, even in the primitive molecular theories of the day. The particles, though very very tiny, were unlikely to have literally zero volume. And they clearly had mass, so they must attract each other gravitationally. These assumptions helped simplify the calculations. As you wrote out the math, you’d see a term that would depend on the radius of the particle and consider it too small to influence the downstream calculations. With these simplified equations, you grind out some calculations and, lo and behold, find PV = nRT.

The ideal gas law fits the data well over a wide range of Ps, Vs, and Ts. But it breaks down at small volumes and low temperatures. Now, if physicists had just applied the hammer of Popperian modus tollens, they’d have to throw the whole theory away at this point. The new experiments had falsified the theory. But that seems silly. If a theory is good over vast ranges of Ps, Vs, and Ts, maybe we can figure out a way to patch it in the regions where it’s not so good. We don’t consider the theory dead. We don’t even consider the theory mostly dead. We just consider it injured and in need of a band-aid.

Adjusting the kinetic theory calculations to account for the incorrect idealizations gave the needed fix. Van der Waals showed that by allowing the molecules to occupy a non-negligible volume and removing some pressure due to pairwise molecular attraction, you could get a correction that would match the data:

Here, Van der Waals added two new constants, “a” and “b.” When a and b are much smaller than V, this formula is more or less PV=nRT again. It is only at small volumes and temperatures where the a and b play a role.

The constants a and b were not universal in the way “R” was. They were properties of the associated gas and had to be fit on a case-by-case basis. But Van der Waals’ formula, with the additional free parameters, gave the right predictions in a wider region than its uncorrected version.

Adjusting the known false statements in the idealizations of first principles to be more truth-like gave better predictions over a wider range of outcomes. This wider applicability came only with the expense of uglier formulas and more difficult derivation chains. Experiments that had technically falsified the ideal gas law ended up corroborating the kinetic theory of gas. By changing what they knew was false, physicists ended up with a better prediction, and that’s a damned strange coincidence.

]]>

*This post digs into Lecture 3 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.*

You’ve made it to part five of Lecture 3, so you can probably already guess how indirect costs and neoliberal university rent-seeking shape research and its associated literature. But perhaps we can share a catharsis in one final reckoning with the context of discovery.

As Meehl puts it, “You don't know, when you look at what has surfaced, to what extent the experiments in a given domain came to be performed because of the financial and other pressures upon academicians.” Explicitly or implicitly, academic scientists have to raise funds to do their research, and such funds are scarce. Consequently, faculty will work on only what they think can be funded.

As any young investigator knows, federal funding has become competitive. Proposals at the National Science Foundation or the National Institute of Health are fiercely reviewed by other scientists and infrequently awarded. This review process means that a scientist might choose to play the odds, jumping on every academic trend and throwing their hat in every call for proposals they come across. It also means that scientists must spend their time marketing their ideas to convince their peers their work deserves one of the few rare, prestigious grants.

Meehl argues that funding scarcity also shapes the sorts of projects that people propose, forcing faculty to favor expedience over curiosity. Scientists are compelled to choose the cheapest path to a result. This means that the particular method appearing in a paper is frequently the *cheapest,* not the most *scientifically appropriate*.

Well-funded projects come with their own special problems. If a sponsored project grows too large in cost, it becomes too big to fail. If you run some field study with hundreds of staff and hundreds of thousands of participants, the massive expenditure compels you to find some evidence that your intervention did what you said it would do. Does this mean that investigators write up the results of big projects in a way to save face? Does this mean there are incentives to continue to look for evidence of results that aren’t quite there? You’ll have to be the judge when you read such papers.

And what about projects that go against a party line? Are these less likely to be funded? Meehl recounts witnessing overly zealous scrutiny applied to edgier proposals that were out of favor and that scientists “feel they've got to research what the bureaucrat in Bethesda wants researched.” I know many still believe this to be true. Even such circumstantial evidence adds doubt to one’s assessment of the scientific literature.

Meehl doesn’t discuss this, but we obviously run into similar problems with non-governmental funding sources. Gifts from philanthropists are targeted at pet causes. Gifts from industry are dependent on industrial interests.

You might think that we should look outside the academy for less harried investigations, but industrial research, which has been growing steadily in computer science for the last decade, has its own biases. There is unquestionably a filter on the questions asked by researchers who work in industry. Industrial papers have to pass internal corporate review before being published. There have been notable blow-ups of people getting fired from industrial labs for not towing party lines.

Now, patronage has always been part of science, but there is something particularly pernicious about our contemporary model built around constant, vicious competition. As I mentioned in passing, the constant competition with peers for scarce funds means scientists are constantly marketing, and this mindless scientific marketing may be the most damaging aspect of all of this.

Every proposal, paper, and presentation becomes a marketing promotion. The reader has to work through a startup pitch before getting to the main findings. If a clinician or practitioner knows that every publication is a sales document, their interpretation of every result becomes more critical and suspicious.

David Graeber points to this marketing, which has “come to engulf every aspect of university life,” as a primary source of stifled innovation. In his essay “Of Flying Cars and The Declining Rate of Profit” from his 2015 collection *The Utopia of Rules, *he asks why progress in science seems to have slowed since 1970. In academia, he calls out marketing as a central pernicious force:

“There was a time when academia was society's refuge for the eccentric, brilliant, and impractical. No longer. It is now the domain of professional self-marketers. As for the eccentric, brilliant, and impractical: it would seem society now has no place for them at all.”

Graeber concludes that when scientists spend their time marketing, competing with their peers, and choosing expedience over curiosity, we end up in a world of scientifically overproduced incrementalism.

“That pretty much answers the question of why we don’t have teleportation devices or antigravity shoes. Common sense dictates that if you want to maximize scientific creativity, you find some bright people, give them the resources they need to pursue whatever idea comes into their heads, and then leave them alone for a while. Most will probably turn up nothing, but one or two may well discover something completely unexpected. If you want to minimize the possibility of unexpected breakthroughs, tell those same people they will receive no resources at all unless they spend the bulk of their time competing against each other to convince you they already know what they are going to discover.

“That’s pretty much the system we have now.”

Welp. On that cheery note, we’d better get back to the tidy abstractions of philosophy next post…

]]>