arg min

Theories of Metatheories

Ben Recht — Fri, 26 Jul 2024 14:56:19 GMT

This is the final post about Paul Meehl’s course “Philosophical Psychology.” Here’s the full table of contents of my blogging through the class.

My friends, we find ourselves at the end of July and the end of Meehl Blogging. There is one more lecture in the series, but the video is corrupted, so it’s hard for me to figure out what to say. No matter, I think psychoanalysis makes for a fitting end, and I’m calling it there.

I’ve spent the last few days collecting my thoughts and pondering a coda. I don’t have a poetic summary just yet. I thought this blog series would take a few weeks, but it ended up consuming three months. I guess that’s a semester class! Maybe I can get course credit? Kidding aside, one of the great joys in life is taking a new class, learning new things, and completely reshaping how I think about a subject. It’s always good to remind myself this remains possible no matter how old I get. But I’m going to have to take an incomplete on my term paper as I can’t do this course justice just yet.

Let me give a quick high-level take on my current estimation of my most valuable takeaways.

Lakatosian Defense. Meehl’s synthesis of 20th-century philosophy of science, what he calls metatheory, is better aligned with my views than anyone else I’ve encountered. This “equation” that I’ve posted a dozen times has completely reshaped how I think about scientific evidence and argumentation.

Meehl pulls so many pieces together in this simple formulation: Popper, Duhem-Quine, Reichenbach, Salmon, Kuhn, Feyerabend, Lakatos. From this vantage point, you can tie the logical aspects to the social aspect. Deduction to induction. You can reconcile post-Latourian science studies with logical positivism. It’s a beautiful, elegant framing that pulled together so much of what I’ve been thinking about for the past half-decade.

The poverty of null hypothesis testing. Before engaging with these lectures, Meehl was one of my prime gotos for critiquing the tyranny of frequentist hypothesis testing. But hearing him talk about it helped me understand the broader context of his interest and the nuances of his critique. It’s wild how we torture ourselves with this broken inference framework. The Fisher, Neyman, and Pearson program from the 1930s is just broken. It isn’t fixable. But social conventions stick us with this, guaranteeing we will always be confused. These lectures highlight the absurdity of applying a Lakatosian strategy to understand social relations. It’s the wrong toolset and has led to nothing but widespread confusion and invasive bureaucratic kudzu. Let me not get stuck here again today… perhaps this will be the topic of my term paper.

Prediction is quantitative, inference is qualitative. The last three lectures on probability, prediction, and psychoanalytic inference, might have been the most valuable part of the class for me. Finally, for the first time in my career, I feel sure-footed about probability. I’m serious. Once you realize that Carnap’s Probability 1 isn’t numbers, you can be more relaxed about the entire probabilistic endeavor. Statistical prediction is a powerful concept but has limited scope, no matter what the AGI dorks tell you. Probabilistic inference can be qualitative and still helpful for guiding the practice of inference. Meehl makes a compelling argument for a dappled world of probability. Quantifying the unknown is an art form and an ethos. It can never be rigorous. And that’s ok!

Why we’re stuck. Meehl’s 1989 complaints and critiques about science and research are depressingly similar to those science reformers complain about on Twitter today. You could make the case the problems were the same in 1970. Why did we get stuck? This is my favorite question about the history of science, and the answer seems to be non-scientific factors that scientists pretend they are isolated from—the Cold War, the American Empire, hyperfinancialization, computerization. We’ve lived in a time of unprecedented technological advancement and stagnation. Though unintentional, because he didn’t want the class filmed in the first place, Meehl’s lectures helped me get more appreciation about what has flourished, what has withered, and what we might want to do next.

More to come. In the meantime, I’d love to hear what you all took away. Let me know which parts you found most interesting, and which parts you most disagreed with.

And with that, I’m going to take a few weeks “off” to finish a few other writing projects. I may pop back in here if there’s something wrong on the internet, but I don’t have planned regular blogging until late August.

And that’s because late August is when the semester begins! This fall, I’m excited to blog about my class on Convex Optimization in the age of LLMs. This should make for an interesting blog project where I’ll try to digest a very mathematical and technical topic with as few equations as possible. It should be a fun mathematical communication experiment. I hope you’ll enjoy reading along.

Subscribe now

Inference and The Psychoanalytic Interview

Ben Recht — Tue, 23 Jul 2024 14:59:57 GMT

This post digs into Lecture 11 of Paul Meehl’s course “Philosophical Psychology.” The video for Lecture 11 is here. Here’s the full table of contents of my blogging through the class.

The International Conference on Machine Learning is this week. So let me set today’s stage with an AI problem. Suppose I have a sequence of symbols that I model as generated by some semi-parametric process. My goal is to infer the latent state of that process. I can choose a set of actions to probe the system and see how the sequence changes. What is the optimal algorithm for inferring the latent state?

I’m sure there’s some poster in Vienna1 right now that has this model in the abstract. They probably use LLMs or something. Whatever the case, this model is also Paul Meehl’s formulation of the psychoanalytic interview as a problem in inference.

The sequence in psychoanalysis is the words uttered by the patient. The latents are the repressed memories or emotions impinging on the observed utterances. The patient is talking about what they intend to talk about, but the therapist seeks to infer the latent sequence that is impacting unintentional parts of the monologue. There are pauses, blanks, and rate of speech that are all possibly caused by the latent stream. The therapist can occasionally interject to see if they can steer the conversation toward confirming their suspicions. Can the therapist infer the latents? The abstract model of psychoanalysis is a complex game of Bayesian inference with a clinical chatbot.

Meehl gives several examples of how such Bayesian-ish inferences work in psychoanalysis through his own case studies, and I can’t do them justice in a blog. It’s worth watching the lecture to fully immerse yourself in the complexity and nuance.

Sometimes, the examples are obvious. A woman drops her wedding ring in the toilet. She calls her husband by the wrong name. She has a headache after date night. While you could come up with lots of behavioral explanations for each of these events (acute clumsiness, temporary disassociation, food poisoning), a considerably simpler model is that one latent issue (insecurity about her marriage) explains the disconnected events.

Sometimes, especially when it comes to dream interpretation, the inferences are far more complex and entangled. Meehl describes linking a patient’s dream about a dying asian man to the cover of Time magazine featuring Burmese prime minister U Nu, which he linked to the patient being astounded and incensed by the sorts of deductions Meehl would make in therapy (“you knew!”), which he linked to the patient's general inferiority complexes for not finishing college.

The examples of Lecture 11 illustrate the craft of psychoanalysis. It needs a lot of training and skill to perform well. How can we evaluate if it works? Does it actually work to ameliorate mental health problems? Here is the main philosophy of science problem. Meehl was disappointed to say that, almost 100 years after Freud, no one knew the answer. Another four decades later, and I don’t think we have much more clarification.

Mental health and therapy are certainly less stigmatized than they were in the 1980s. Such conditions and treatments are considered “real” and “scientific” by a large medical community. But psychodynamics remains part of the “less scientific” corner of mental health therapies, losing favor to more “testable” interventions like cognitive behavioral therapy and an army of pharmaceuticals. But do those therapies actually work better? Why do we think so?

You can’t deny that psychoanalysis is empirical. The protocols are based in observation and intervention. The case studies provide plenty of clear evidence. You could videotape sessions if you needed absolute concrete “evidence.” But the evidence is never quantitative. Even though therapists make inferences, they can’t convert psychoanalytic sessions into numbers. Probability 1, as we’ve discussed now, is barely quantitative. It might not be quantitative at all.

You can’t dismiss psychoanalysis on the basis of its inherent subjectivity. All Bayesian inference is subjective! Subjectivity is the foundation of the entire Bayesian statistics program. Is a sophisticated mathematical model of a patient’s utterances (say, a multi-plate latent Dirichlet model) less subjective than proposing they are impinged by an Oedipus complex? Of course not. Even the most radical millenarian rationalists commit to the inherent subjectivity of inference.

Perhaps you can say that adherents of psychoanalysis cherry pick the positive examples. Psychodynamic therapy can take years to work, and it’s hard to tabulate its success rates. Perhaps its effects are heterogeneous, where it only works for some people and not others. The entire enterprise could be resting on 100 years of motivated reasoning. But there are enough revelatory examples in talk therapy, examples where people’s anxieties vanish in a single session, to think that something is going on.

There’s such a sharp contrast between Lectures 10 and 11. In Lecture 10 we discussed problems with clear-cut outcomes and simple interventions. For these, we could just tabulate statistics and predict what to do. Everything reduced to simple probability calculations. Find the smallest reference classes with stable frequencies and use these frequencies as degrees of confidence about the future. We could tell a clean quantitative story and even outperform clinical judgment.

Psychoanalysis is the entire other end of the spectrum. It requires a highly trained analyst, multiple interactions over potentially long time horizons, open-ended decision-making with unclear and flexible rules, high subjectivity, and little quantification. You can do RCTs of psychodynamic therapy. Unsurprisingly, there aren’t that many of them, and the assembled evidence is inconclusive.

In our age of randomized trial-o-mania, treatments gain popularity solely because they can more easily survive randomized trialing. But, because I cannot say it enough times, randomized trials provide an impoverished view of causality. Simply measured interventions are rarer than we’d like them to be. Even if we know the effect of a single drug on some disease, we rarely have an evidence base for two drugs. What about actual clinical practice? There are so many complex protocols that aren’t easily RCTed. Coaching, tutoring, artistry, craftsmanship. We don’t need RCTs to know these skills work. RCTs and other quantitative experiments have their time and place, but it’s worth noting that we can find things that work without such rigid statistical dogma.

The frustrating thing for many scientists is there will always be a fine line between math and metaphor. But mathematical metaphors untethered from quantification are undervalued. Being able to manipulate metaphors with mathematical poetry is a valuable means of understanding method. Mathematical metaphors can tell us what matters. They can tell us what doesn’t matter. They can simplify how we train clinicians. They can provide good heuristics for what we try next. They can help us understand what works.

Subscribe now

Oh, the Freudian irony here…

Clinical versus Statistical Prediction (III)

Ben Recht — Wed, 17 Jul 2024 14:41:19 GMT

This post digs into Lecture 10 of Paul Meehl’s course “Philosophical Psychology.” Technically speaking, this lecture starts at minute 74 of Lecture 9. The video for Lecture 10 is here. Here’s the full table of contents of my blogging through the class.

The earliest study Meehl finds demonstrating the superiority of statistical judgment asked whether “scientific methods” could be applied to parole. In the 1920s, sociologist Ernest Burgess worked with the Illinois Parole Board to determine the factors that contributed to recidivism and whether it was possible to predict whether a parolee would commit further crimes after release.

Burgess assembled 21 predictive factors, including age, the type of offense, whether a person was a repeat offender, and whether the person had held a job before. He then constructed a sophisticated AI tool for predicting parole: he scored each factor either 0 or 1 and then added them all up. Of the 68 men with at least 16 positive factors, only one ever committed a crime again. Of the 25 men with fewer than five positive factors, 19 recidivated.

In a 1928 report, Burgess compared his predictions against two prison psychiatrists. He gathered a dataset of 1000 men who appeared before the Illinois Parole Board in the 1920s. The psychiatrists assigned each prisoner as likely to violate parole, unlikely to violate parole, or uncertain. Of the ones deemed unlikely to violate, the first psychiatrist predicted 85 percent correctly, the second 80 percent. Of the ones deemed likely to violate parole, the first psychiatrist predicted 30 percent correctly, the second 51 percent. Burgess’ method, looking for at least ten positive factors, not only made a prediction for all parolees but correctly predicted 86 percent of the ones unlikely to violate and 51 percent of those likely to violate. It outperformed the first psychiatrist at predicting violations and the second at predicting successful parole.

Algorithmic recidivism prediction remains a contentious topic. It is one of the most popular examples discussed by the machine learning fairness community. The common refrain is to argue that these risk assessments are examples of “an opaque decision-making system that influences the fundamental rights of residents of the US.” But Burgess was attempting to make the case for a more liberal parole system. He thought his algorithm could be less political, more fair, and more accurate.

Meehl highlights a dozen other studies in his book and continued to track these throughout his career. No matter how much he looked, he kept finding the same thing as Burgess: statistical rules were seldom worse and often much better than clinical predictions. In a reflection on his book, Meehl wrote in 1986, “There is no controversy in social science that shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one.”

He may have been right. I wasn’t sure how best to convey the evidence, but let me discuss two meta-analyses from the 21st century. Meehl was no fan of metaanalysis (and neither am I), but sometimes it’s worth gathering all the papers and looking at the trends.

The two biggest, broadest meta-analyses were done by Grove et al. (2000) and Ægistdóttir et al. (2006). Grove et al.’s analysis included 136 predictions. In 46% of the predictions, mechanical methods were roughly 5 percentage points better than clinical judgments.1 That is, the difference between the accuracy of the statistical and clinical predictions was at least 0.05. In 48% the predictions were close to each other within about 5 points of each other. Clinical predictions were substantially better than mechanical predictions in less than 6% of the studies. This plot from Grove demonstrates further that there was a skew in the distribution.

Here, a positive score denotes an advantage for mechanical prediction and a negative score an advantage for clinical. When they were better, mechanical predictions were more frequently far better.

Ægistdóttir et al. focused on statistical methods but found the same results as Grove et al. In their compilation of 48 predictions, 52% favored statistical methods, 38% reported comparable performance, and 10% favored clinical judgment.

What should we make of these findings? First, and foremost, let us realize that we should accept that statistical judgment can be considerably better than expert judgment. This should inform how we proceed in decisionmaking about people. However, I cannot emphasize enough here that just because statistical prediction is never worse and often better than clinical judgment, that doesn’t mean that you still can’t screw up statistical prediction. Careful statistical prediction remains a delicate skill.

You can have too few features.
You can have too many features.
You can have completely uniformative features.
You can have missing data.
You can have non-stationarity and frequency shifts.

These are just a few of the major headaches you have to deal with. If we’re going to rely on statistical prediction, then we need expertise in statistical hygiene.

With regards to that hygiene, I also want to emphasize again and again and again that you shouldn’t just break out some fancy new machine learning method and assume it’s going to be the best method. Many have noted that sophisticated machine learning methods are often outperformed by least squares. But least squares is still statistical prediction! One of Meehl’s examples is Sarbin’s study (1943), which showed a two-variable linear regression based on high school ranking and college entrance exam score was more predictive of a student’s success at the University of Minnesota than the assessment of the university’s clinical councilors. Just because simple ML methods perform better than complex ones does not mean that simple ML methods are inferior to clinical judgment.

Recidivism prediction provides another great example. An infamous ProPublica report highlighted a bizarre, opaque psychometric system, COMPAS, sold by Northpointe to the state of Wisconsin for predicting recidivism. Later analysis showed that COMPAS was no better than simple rules like Sarbin’s. Fancy opaque rules should always be compared to the simplest baselines. For many of these messy social questions, you’ll never beat the simple rules because the prediction problems are so hard anyway.

Statistical rules are more accurate, faster, and cheaper than experts. They can even be more fair and safe. And yet, statistical prediction is not a panacea. Meehl didn’t think so either! Statistical rules need to be targeted at interventions with simple outcomes. They are challenging to keep updated. Tech companies retrain their prediction systems every day. Medical risk assessments might stay static for decades. And mechanical rules have human costs. They can lead to an erosion of expertise as practitioners spend too much time deferring to their apps. They can lead to decision fatigue, forcing too many things into a computerized system. And they can lead to complacency, as following mechanical rules is drudgery. For all these reasons, the adoption of mechanical rules and statistical prediction in high-stakes scenarios must be done with care.

I understand why the power of statistical prediction will remain discomfiting for professionals. Statistical prediction is atheoretical. There’s no good reason why counts of the past lead to reasonable predictions of the future. That’s the problem of induction, my friends. I’m not going to go full neorationalist on you and argue that statistical prediction always works. That would be ridiculous and I don’t believe it (I have written endless blogs on why). We should interpret Meehl as providing us with a setting where statistical is probably going to be better than clinical: answering clear, multiple-choice questions about simple actions from machine-readable data. This characterization is useful!

Still, it feels like a doctor can assess more than what is fed into the computer. That a counselor can see subtle queues that are valuable for prediction. That there are edge cases statistics can’t catch. Isn’t this true? Why is clinical judgment worse on average?

The key to the entire clinical-statistical puzzle is those last two words. “On average.” The trick that Meehl plays–and that all bureaucrats play–is in the quantification of “better.” By better we of course mean on average. Once you decide that things will be evaluated by averages, the game is up. If you believe that prediction is possible, and you tell me that I’m going to be evaluated by hit rates, then I’m going find a method that maximizes hit rate over some class of possible algorithms. In machine learning, we call this empirical risk minimization. You should find a rule that predicts the past well and use this to make predictions about the future. Since you will be evaluated based on averages, this is effectively the optimal thing to do.

Meehl summarizes the situation in the last paragraph of his 1954 book. If we subscribe to the bureaucratic utilitarian mindset, the algorithm always wins:

“If a clinician says, ‘This one is different’ or ‘It’s not like the ones in your table,’ ‘This time I’m surer,’ the obvious question is, ‘Why should we care whether you think this one is different or whether you are surer?’ Again, there is only one rational reply to such a question. We have now to study the success frequency of the clinician’s guesses when he asserts that he feels this way. If we have already done so and found him still behind the hit frequency of the table, we would be well advised to ignore him. Always, we might as well face it, the shadow of the statistician hovers in the background; always the actuary will have the final word.”

Subscribe now

As is the case with all meta-analyses, the way they pool their comparisons is frustrating. In order to evaluate the bulk benefit of one method versus another, you have to take a diverse set of results and homogenize them. Both of these studies do this by trying to scale the difference between clinical and statistical judgment to standardized units using Cohen’s d. Specifically, if one method has accuracy a1 and the other method accuracy a2, then

A method was favored if |d| was greater than 0.1. Now, if a2 is 60% and d is 0.1, then a1 is 65%. If a2=70% and d=0.1, then a1=74%. This is why I say “roughly 0.05” above. Is this the right metric? Gah, I don’t think so. But if you have a better idea, please tell me in the comments!

Clinical versus Statistical Prediction (II)

Ben Recht — Tue, 16 Jul 2024 14:28:11 GMT

One of the more common misreadings of Meehl is that he thought you could somehow do away with clinicians altogether. This was not his position, and as we’ll see more in Lecture 11, Meehl did not believe that all decisions could be made statistically. His aim was determining the scope of statistical judgment and when it might be useful. There was a significant set of decisions where he deemed statistics superior. By being precise about this subset, he thought that he could both improve care and simplify the life of the clinician, allowing them room to automate part of their job. Today, let’s hone in on the sorts of predictions Meehl thought were best decided by statistical methods.

Actions

Meehl first clarifies that the goal should be about predicting the outcome of interventions. He is not interested in diagnostic tests. He is not asking about the construct validity of testing for diseases. (He has written other papers about that topic!) Here, he wants to understand how to predict the consequences of actions.

All of the example questions he asks are attempting to predict how an action will affect a particular person. If granted admission, will a person succeed in law school? If released from prison, will a person recidivate? If a depressed person isn’t hospitalized, will they commit suicide? If a person receives shock therapy, will their depression be relieved?

These sorts of questions are about the impact of single actions. They also have yes or no answers. Meehl focuses on questions with a small list of possible outcomes. For open-ended questions, Meehl thought clinical expertise was indispensable. It was only for problems with simple multiple-choice answers where he thought statistical decision-making could play a role.

Data

To make the decision, Meehl assumes the clinician has the same data as the statistical rule. He belabors distinguishing between the kind of data and the mode of combining the data. As long as the statistical formula and the clinician are presented with the same information, the data can be anything: interviews, life history data, a mental test, other biometrics.

Obviously, such data has to be transformed into a machine-readable format somehow. Here’s another place the clinician may be indispensable. A clinician may be required to observe a patient’s behavior or facial expressions and write down appropriate diagnostics. Today, this could perhaps also be done with statistical machine learning. In his 1989 lectures, he notes that character recognition is still barely functional. He doesn’t rule out the possibility of more sophisticated pattern recognition methods being used if computers improve. (Spoiler alert: they did).

Regardless, he just wants the computer and the clinician to be using the same data. The controversy is about the mode of combination not the data types.

Mechanical and actuarial rules

Meehl defines two forms of algorithmic decision rules. First, there are mechanical rules, which we now call algorithms. Mechanical rule and algorithm are synonymous. A mechanical rule is a well defined, step by step process for translating data into a decision that can be implemented on a computer.

Actuarial rules are a special kind of mechanical rule. They are algorithms that make decisions based on rates of past occurrences. These are the statistical prediction methods. A decade ago we called these prediction methods machine learning. Today we call them AI.

Actually, now that I think about it, we’re in the goofy phase of the hype cycle where all mechanical rules are now annoyingly called AI. So I’m going to use Meehl’s terms of mechanical and actuarial to keep things clear. Let me still emphasize that Meehl’s clinical-statistical question asks when AI is better than people at making decisions. There’s a large academic community that still argues the answer is never. As we’ll see, Meehl does not agree.

Clinical judgment

Meehl’s definition of a clinical judgment is a bit more vague. He says it’s anything “informal” made by a human specialist. It’s whatever process occurs in a person’s head. Clinical rules are those made by clinicians based on intuitive assessments of data. These are decisions that clinicians can’t cleanly explain and hence aren’t formalizable as algorithms.

The clinical-statistical question

With all of this setup, we can now pose Meehl’s central question:

Given a decision problem with a small set of possible outcomes and an appropriate, fixed collection of data, do actuarial rules or clinical judgment provide more accurate judgments about the future?

For this narrow but broadly applicable question, Meehl came down solidly on one side: Statistical prediction would never be worse than clinical prediction.

If you had asked me a year ago, I’d have vehemently disagreed. But I’ve come around. Meehl provides compelling empirical evidence in his 1954 book. And 70 years of studies have backed him up. You’d be hard-pressed to find a result in social science that is as robust as statistical decisions outperforming clinical judgment. After grappling with the evidence and the counterarguments, I now totally agree with Meehl. Tomorrow, let me try to convince you, too. I will present both the empirical evidence, Meehl’s philosophical arguments, and what I consider to be a simple but deceptively subtle explanation. It’s through the subtlety that we might find some resolution.

Subscribe now

Clinical versus Statistical Prediction (I)

Ben Recht — Mon, 15 Jul 2024 14:29:49 GMT

Throughout his undergraduate and graduate studies in Minnesota, Meehl found himself at the center of a personal and professional conflict for the soul of psychology. As a high schooler, Meehl had been drawn to psychology by the psychodynamic school (pioneered by Freud), which considered the myriad connections between a patient’s past experiences–even their dreams–and their current mental state. At Minnesota, he was educated by a rigid behaviorist crew, strongly anti-Freudian, focused on understanding the impact of external factors on mental states, and adamantly scientific and statistical.

This struggle in psychology was part of a broader struggle in social science between the idiographic and the nomothetic. The idiographic focuses on the particulars, on the individual, trying to make sense of the singular and unverifiable. The nomothetic focuses on the general, trying to determine laws and principles that explain categories with clear measurements. The idiographic treats every case as unique. The nomothetic treats every case as a statistic. The last two lectures of his course are about Meehl’s career-long project of demarcating the purviews of the idiographic and the nomothetic.

One of Paul Meehl’s most famous works, Clinical versus Statistical Prediction, grew out of a lecture series from 1947 probing this boundary. Though he wouldn’t call it by name, Meehl makes the first argument for machine learning in the clinic. After struggling to find a publisher, his book finally appeared in 1954, two years before the famous Dartmouth AI conference. It was four years before Rosenblatt’s Perceptron. Even as computers were just coming online, there was already ample evidence that statistical pattern recognition could, and perhaps should, play a role in critical decision-making.

Meehl’s book focuses on how to predict behavior. He gives some examples of what he had in mind in Lecture 10:

Given an application with LSAT score, undergraduate grades, and letters of recommendation, who should be admitted into law school?
Given a record of behavior, should a jailed person be released on parole?
Should you hospitalize a patient who is clinically depressed to prevent suicide?
Should a person who doesn’t respond to antidepressant prescriptions be given shock therapy?

These questions demand consequential decisions about people’s lives. They are all concerned with how a human being reacts under particular circumstances. And in all cases, the outcomes are uncertain. We just don’t know what will happen as a result of many very consequential decisions.

To answer these questions, we have to predict what will happen as a result of our actions. If you’ve been following along with the blog, you should be comfortable saying that all such predictions are probabilistic. Even though these questions are about individual people, their answers have epistemic uncertainty. Answers to these questions are logical statements that can be measured with Probability 1.

“I believe this candidate will do will in our law program.”
“I believe this person will not commit crimes if released.”
“I believe this person will harm themselves unless they are committed.”
“I believe this person will find some relief from shock therapy.”

These are all beliefs, and Meehl wanted to know how best to quantify them. What would be the best way to decide in the face of the inherent uncertainty?

In Clinical versus Statistical Prediction, Meehl aims to compare the idiographic to nomothetic approach to such decision-making. I will clarify the technical distinction momentarily, but let me first set the high-level distinction between these approaches. The nomothetic approach is statistical, transmuting an assessment of past rates into future uncertainty. We could look at all similar cases in the past and count the number of times a treatment had worked. We could use the success percentage as a proxy for our belief the treatment will work on this patient before us. Then, we could use optimal statistical decision rules to weigh the costs and benefits and select an action. This is statistical prediction. We convert past performance into future confidence.

Clinical prediction, on the other hand, starts from the idea that all patients are unique. In the 1940s, the validity of inference from class membership was not at all conventional wisdom. In his book, Meehl quotes Gordon Allport making the case for the idiographic:

“A fatal non-sequitur occurs in the reasoning that if 80% of the delinquents who come from broken homes are recidivists, then this delinquent from a broken home has an 80% chance of becoming a recidivist. The truth of the matter is that this delinquent has either 100% certainty of becoming a repeater or 100% certainty of going straight. If all the causes in his case were known, we could predict for him perfectly (barring environmental accidents). His chances are determined by the pattern of his life and not by the frequencies found in the population at large. Indeed, psychological causation is always personal and never actuarial.”

I still hear this argument today. It is made by doctors and patients. It is made by advocates against algorithmic decision systems. In fact, I’ve made this argument multiple times myself. I’m personally very sympathetic to Allport. Meehl himself concurs with the general sentiment. Cases are indeed unique.

Is it always impossible to make inferences from class membership? That seems too strong. Moreover, you’ll note that Allport uses the word “chances.” Chance is, by its very nature, a probabilistic concept. The question remains whether that chance can be usefully estimated through actuarial methods. When is generalizing about the future just a question of careful counting?

This was a radical question in the 1940s, but it seems quaint today. We are living in the glory days of statistical pattern recognition. The tech industry and half of academia have decided that general intelligence is nothing but making decisions by counting things in appropriate reference classes. Everything we do is a sum of our past experience. All decisions are actuarial. It’s just a matter of finding the formula.

As we’ll see, Meehl wouldn’t go this far, but he’d come closer to agreeing than disagreeing. In 1947–before computers, before machine learning, before AI–he tried to understand the effectiveness and limits of actuarial tables in human decision-making. This week, I will walk through Meehl’s argument. I’ll make precise the sets of questions, decisions, and evaluation methods he considers. I’ll provide his evidence. And I’ll close with some reflections on the valuable lessons we can still learn from reading Meehl’s 1954 book.

Subscribe now

Holy Wars... The Probability Two

Ben Recht — Fri, 12 Jul 2024 14:51:11 GMT

This post digs into Lecture 9 of Paul Meehl’s course “Philosophical Psychology.” Technically speaking, this lecture starts at minute 82 of Lecture 8. The video for Lecture 9 is here. Here’s the full table of contents of my blogging through the class.

It’s a bit weird that Probability 1 and Probability 2 have the same name. Probability 1 is a metalinguistic construct about relationships between evidence and conclusions. Probability 2 is an object-linguistic concept about frequencies of events. Probability 2 we can calculate using appropriate combinatorics. Despite a century of effort, we still don’t have a clean algorithm to grind out real number assignments for Probability 1. We don’t know how to compute numerical probabilities of theories from facts. Is it possible that Probability 1 and Probability 2 are just different concepts with the same name?

Probably not. There are striking connections that tell us they must be related. A second natural question is then whether one is a special case of the other. Some religious people think that Probability 1 is just a confused version of Probability 2. These people are called Frequentists. More common are the faithful who think that Probability 2 is the collection of simple, computable cases of Probability 1. These people are called Bayesians. Who’s right? Let’s look at the evidence.

We have the compelling fact that both notions lead to the definition of a fair bet. Many (most?) subjective Bayesians define probability by fair betting. Relative frequency also gives you the correct betting odds for games of chance. Though we can’t always compute Probability 1, when we can, it gives us the same answers as Probability 2. Since Probability 1 applies to more than frequency, perhaps this suggests that Probability 1 is the fundamental concept and Probability 2 is a special case.

Can we make a case that Probability 2 is the fundamental concept? Meehl puts forward a compelling defense of Frequentism, or at least how Probability 1 has to correlate with frequency concepts. When it comes to prediction, frequency takes over.

Suppose you have a Bayesian monk who believes he can predict the future. Let’s call him Jake Gold. Gold makes probabilistic predictions about everything. He writes down all sorts of predictions about everyday life, about sports, and about politics. Everything in his experience gets assigned a likelihood between 0 and 1.

Since the future eventually becomes the present, we can retrospectively evaluate Gold’s predictions. Take his past track record and make a histogram. There will be a collection of future events where Gold declared the probability between 0.7 and 0.8. We can count the frequency they were correct. We can make a similar bin for his predictions between 0 and 0.1, 0.1 and 0.2, and so on. Now, what should the accuracy be in each of these bins? If I take all of Gold’s predictions scored at 70%, those should happen around 70% of the time, no? It seems reasonable that the event rates in each of these bins should be in the ballpark of their midpoint. If the correlation between the inductive, logical prediction algorithm and the frequency of being right were negligible, then we’d all think the forecast was fishy. Maybe frequency, not belief, might be the basic notion behind probability after all.

This notion I’ve described here, a strong correlation between the frequency of predicted events and predicted probabilities, is called calibration. It is a necessary property of a good Probability 1 prediction system. And that necessity is why Bayesians can’t ditch frequency.

Now, calibration isn’t sufficient for a good Probability 1 system. You can have perfectly calibrated forecasts that aren’t particularly useful. Let me give an example due to Rakesh Vorha. Suppose you have a sequence of events that just alternates between being equal to A and B.

A,B,A,B,A,B,A,B,A,B,A,B,....

The prediction goal is to provide the probability of A. An example of a calibrated forecast is

1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,...

100% of the time, when the forecast predicts a 100% certainty of A, A happens. A never happens when the forecast predicts 0% probability. However, another perfectly calibrated forecast is

0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,...

Since A and B occur with equal frequency, the forecast is right 50% of the time. Since it only predicts 50%, it is perfectly calibrated.1

While calibration is necessary for a Probability 1 prediction system, it is only part of the story. Having a calibrated system doesn’t tell us that your probabilities are informative or useful (even if Jake Gold spends a lot of time bragging about how impressive his system’s calibration is). This is fine. Predicting fair betting odds is insufficient for a Probability 2 system too. The need for calibration merely confirms frequency is fundamental to probability. With respect to the future, Bayesian predictions need to align with long-run frequencies. Something about frequencies in the past clues us into certainty about the future. And this leads us to Meehl’s penultimate lecture: no matter how confusing we find probability, we can’t deny the astonishing power of statistical prediction.

Subscribe now

I bring up Vorha because he and Dean Foster proved something even more surprising about calibration. Though his “always predict 50%” example was contrived, it turns out it’s generalizable. Building a calibrated prediction system is trivial, and you can do it without knowing anything about the events you are predicting. You can make predictions solely to enforce calibration. Foster gives a simple proof of this fact in a later paper. If you have some calibrated bins where the probability forecasts match the correct historical frequencies, predict from one of those bins. If not, you sample a random uncalibrated bin proportional to what will most likely result in calibration. Foster’s proof uses a powerful technique called Blackwell Approachability. More or less, he shows that if you define “calibration” to be your loss function, then an online gradient-like algorithm can find a calibrated forecast with no knowledge about what it should be predicting.

The Algorithmic Subjectivist Myth

Ben Recht — Wed, 10 Jul 2024 14:34:03 GMT

While we can construct a calculus for Probability 1 with laws for manipulating logical formulas into mathematical inequalities about likelihoods, how do we actually assign probabilities to metalinguistic statements? Let me prime you with a few questions:

What is the probability that smoking causes cancer?
What is the probability that the theory of evolution is true?
What is the probability Bruno Hauptmann kidnapped the Lindbergh baby?
What is the probability of the Big Bang?
What is the probability of Freud’s Theory of Dreams?

These are all questions about theories based on accumulated evidence. None of these questions are about predictions about the future. Given Kolmogorov’s probability axioms, deductive logic, and a supercomputer, is there an algorithm that can take our current evidence and give us a number for each of these statements?

A popular way to extract probabilities from individuals is to goad them into betting. Meehl describes how his colleague James Boen skillfully extracted probabilities from students.1 Boen was a professor of Biometry and a dedicated Bayesian. He would start with something vague, like asking how much you’d be willing to bet on the existence of aliens (I want to believe). Then, he would propose bets of different sizes on other less controversial topics. How much would you bet that Nixon would flunk a lie detector test on Watergate? How does it compare to your willingness to wager on the existence of aliens? Like a skilled psychotherapist, he’d eventually break the will of the interrogated, getting them to say, “No, I wouldn’t take those odds.” This confession meant that the probability their brain had computed was lower.

Boen’s Betting Inquisition is an algorithm for measuring someone’s internal probabilities. In fact, measurement was precisely what Frank Ramsey had in mind when he conceived of subjective probability a century ago: “The old-established way of measuring a person's belief is to propose a bet, and see what are the lowest odds which he will accept.”

However, measuring a person’s beliefs doesn’t tell us how they came to those beliefs. Just because you can twist someone’s arm into betting doesn’t mean that the person arrived at their comparative belief system by some well-specified algorithm. Meehl claims there is no algorithm to compute these probabilities. There is no algorithm to convert evidence into belief.

I can hear the Bayesians coming for me already. But let me make Meehl’s case. In his Appraising and Amending Theories paper, Meehl has this elucidating diagram illustrating the disconnect between statistics and theories.

The map from substantive theory T to testable statistical hypothesis H goes through a derivation chain involving auxiliary theories, instruments, ceteris paribus assertions, and experimental conditions. The map from hypothesis to observation is through the statistical model manufactured by the derivation chain.

Statistical theory provides a variety of means to infer the veracity of H from O. Usually this goes through Bayes’ Rule.

When we perform statistical inference, we attempt to calculate the left-hand side. It is the probability the statistical hypothesis is true given the observed experimental data. The right-hand side has terms we can hopefully compute. Pr[O|H] is the probability of the observation given the statistical hypothesis and is what we derived from our theory. Pr[H] is our pre-existing belief in H (our prior). We can compute Pr[O] if we know the probability of the observation when the statistical hypothesis is not true: Pr[O | not H]. More often than not, this “not H” is what people call their null hypothesis.

This seems all well and good. But let’s say we now infer that H has high probability given all of our evidence. Then what? We cared about T! How do we compute the probability of T from O?

Let me give a simple example using a preposterous correlation from a prior lecture. I have a complex theory that asserts that sunscreen use causes juvenile delinquency I use a theoretical derivation chain that deduces a model where the odds of delinquency increase exponentially with regular sunscreen use.

With my model in hand, I want to test it. I gather some data on children, put together a big CSV file, and run logistic regression with my favorite statistical software package. The software tells me that the probability of seeing the data given equal in both groups is less than 5%. It also reports the confidence interval only contains exponential functions that increase with sunscreen use. That’s all that logistic regression does, by the way. All of these steps, from the data to the confidence interval and p-value, are algorithmic.

I could proceed to do other calculations to squeeze out a posterior distribution on the parameters of the exponential function. This posterior tells me the probability of the parameter of my logistic model given the data… assuming the data was generated from a logistic model. Again, my friends, question-begging. What does this tell me about the derivation chain? About the different mechanisms I proposed that lead from sunscreen to delinquency? Can I conclude that sunscreen use increases the risk of delinquency?

Well, no. Because now we need to do a Lakatosian defense. There is a Pr[H | T] that we need to sort through. How does the statistical model derive from the theory, auxiliaries, ceteris paribus clause, and experimental conditions? I imagine you could develop a very clean, logical chain of statements that precisely deduces H from T. If you did this, you could apply Bayes’ Rule again to get some functional form for Pr[T | H]. This functional form would depend on a bunch of other probabilities that you’d need to sus out from the chain. What is the probability of the hypothesis if we keep everything fixed but negate one auxiliary? What about if we negate only the ceteris paribus clause? What does that even mean? What would the probability be under the negation of the experimental conditions? There’s a combinatorial explosion of probabilities you’d need to write down. And then, you’d need a prior probability on every step of that chain too. What’s the prior probability that the ceteris paribus clause is true? 0%? Bayesian philosophers have been trying to work out the details of such inferences for decades, but no one has gotten anywhere satisfactory. Even when there is only a single auxiliary theory!

Regardless, science has advanced despite our inability to algorithmically quantify the probabilities of theories. We’ve had a quasi-functional legal system despite asking jurors to estimate if defendants are guilty with a probability greater than 50%. Our heuristic probability and inference systems are fallible, but they work pretty well, all things considered.

Everything I wrote about today was about the difficulty of inferring the probability of things that had already happened. What about estimating things that haven’t happened yet? That doesn’t sound easier to me! But people love to bet, I suppose. I’ll grant the gamblers and superforecasters this: predicting the future at least gives us consistent opportunities to see how often we’re wrong. In the last post on Lecture 9, I’ll discuss how this ability to guess and check provides our connections between Probability 1 and Probability 2.

Subscribe now

In the Lecture, Meehl refers to Boen simply as “Boen from Biometry.” I couldn’t figure out who this was from context and came up short doing web searches. Thanks to reader Zach Meisel for identifying Boen.

Degrees of Disbelief

Ben Recht — Mon, 08 Jul 2024 14:59:21 GMT

After weeks of extracting hot takes from Lecture 8, I now turn to the very uncontroversial topic of Lecture 9: Probability. Woo boy. Please don’t tell The Bayesians.

It’s a bit odd that Meehl waited until the 2nd to last lecture of the quarter to dive into probability, as every lecture has touched on this foundational concept. But it’s also brilliant how far he got playing fast and loose with probability concepts. That’s what makes probability so compelling and so dangerous. Like causality, we all have a casual understanding of probability and uncertainty that undergirds our everyday lives. Chance, likelihood, and probability are all useful but slippery words. But then trying to make it rigorous becomes a mathematical and philosophical mess. We’ve spent hundreds of years in a quixotic quest to make everyday concepts about opinion into rigorous mathematics.

Having spent a career using it and endless hours blogging about it, I find myself less comfortable with probability than ever. The mathematics is daunting, and the connections between the mathematics and reality feel so tenuous. But at least I have found clarity in understanding that there is a problem! Meehl succinctly presents the concept and problem of probability in a single lecture, and let me try to do him justice in a few blog posts.

Everyone learns about probability by thinking about games of chance. Our initial notion is that probability quantifies a degree of confidence about something happening in the future. The probability that I will roll snake eyes in craps is 1 in 1/36. The probability I will roll seven is 1/6. These numbers quantify how much I’d be willing to be on an outcome in a game where every round is more or less the same.

But we use the word probability to describe a lot of other things. We can ask “What is the probability Donald Trump will win the 2024 election?” People now get paid a lot of money to put numbers on this. Nate Silver puts the number at 71%. Where does that number come from exactly? I unfortunately don’t get paid enough to tell you.

We also use probability in courts of law. In criminal trials, we ask jurors to decide if defendants are guilty “beyond a reasonable doubt.” In civil trials, we only ask if they believe it is “more likely than not” that the defendant violated the law. These are also statements about degrees of belief. About probabilities. Given the evidence, quantify how much you believe some statement of fact.

Throughout Meehl’s course, a running theme is that theories, though always wrong, have some accordance with truth. We could ask “Given everything we know, what is the probability this theory is true?” Is that a probability? Can this colloquial question be quantified too?

Can we actually answer all of these different questions with concrete numbers? Meehl broaches this problem using a dichotomy he attributes to Carnap. For Carnap, there are two types of probability, conveniently named like errors in statistics, Probability 1 and Probability 2.

Before defining the two probabilities, we have to go back to the first lecture to remember the distinction between object language and metalanguage. Object language speaks about entities you can observe or measure: blood, protons, libido. Metalanguage speaks about statements, properties about statements, and logic: truth, confirmation, validity.

Probability 1, often called logical probability, maps metalinguistic statements into numbers. It measures our certainty about some given statement. Probability 1 refers to the relationship between a hypothesis and its evidence. It quantifies how much credence we put into a particular hypothesis given everything we’ve seen so far. It is a relationship between propositions and beliefs.

Probability 2, the one we all learn in grade school, measures the relative frequency of some property in a set of objects. Probability 2 just amounts to counting. The probability a coin flip is heads is just the fraction of the time the coin comes up as heads.

But wait, you might ask, if I haven’t flipped this coin I just got from the mint, what’s its probability of coming up with heads? Is this a question of object language or metalanguage? Herein, we find ourselves in a pickle.

The confusing part is that the laws of fractions can be made to work for a calculus of belief. Even with one-off statements with no frequencies or probabilities, I can imagine a way to put numbers on degrees of certainty.

Here, let me confuse you some more. Let’s say I have a big set of stuff, like a bunch of cards laid face down on the table, and I want to understand some properties about the proportions of subsets. Here are a few things that are true

Any subset has proportion greater than 0
If I take the entire set, the proportion is equal to 1.
If I take two nonoverlapping subsets, the proportion of their union is equal to the sums of their proportion

Using the cards example, in a standard deck of 52 cards, the proportion of diamonds is 1/4, the proportion of aces is 1/13, the proportion of face cards is 3/13. With a little more mathematical abstraction, these are Kolmogov’s axioms of probability. From these, we get every property we want of probability.

Surprisingly, we can do this for logic too! Let’s find us a function that maps metalinguistic propositions into numbers. I could say

Any syntactically valid statement has probability bigger than zero
Any statement that is certain has probability 1.
If two statements describe mutually exclusive outcomes, then the probability of either one or both of the outcomes is equal to the sum of the individual probabilities.

Lo and behold I have Kolmogorov’s axioms again, and now I can do all sorts of probability calculus on degrees of belief or verisimilitude. No frequencies in sight.

Which one is more epistemically fundamental for understanding the unknown? Fractions or degrees of belief? You get to choose a side in the war.

Carnap thought he could end the war. He spent the last decades of his life attempting to find formal rules so that you could, in inductive logic, grind out precise probabilities by looking at the propositions in a particular formalized language. He didn’t succeed. Meehl suggests that if Carnap, one of the smartest people he’d ever met, couldn’t do it, this was evidence that it was impossible. Tomorrow, let me at least describe the weird roadblocks we run into when we try to give primacy to one camp versus the other.

Subscribe now

Acting On the Unknowable

Ben Recht — Fri, 05 Jul 2024 15:49:45 GMT

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

Lecture 8 closes with the obvious but profound assertion that some questions can’t be answered, at least not given the current stage of development. Meehl notes that a question can be perfectly well posed, theoretically sound, and still unanswerable. We might not have the theoretical structure of auxiliaries necessary to design a good experiment. We might not have developed the instruments needed for appropriate observation or control. Some theory testing requires decades of development of intricate conceptual scaffolding.

For example, Einstein posited the existence of gravitational waves in 1916, but it took over one hundred years to develop the appropriate instrumentation and analysis methods to see them at LIGO. Miescher discovered DNA in 1869, but it took, among many other things, the development of crystallography and X-ray diffraction, the theoretical understanding of amino acids, and the experimental measurement of amino acid balance to understand its structure.

Meehl argues that the social sciences too often ignore this unanswerability issue. He attributes the oversight in psychology to a weird combination of remnant logical positivism and reliance on null hypothesis significance testing. Roughly, he says that if you believe that concepts are defined only if we can give procedures to measure them, and that theories are meaningful only if they are testable, and that all tests are NHSTs, you end up dropping theories that can’t be validated by our current statistical tool kit.

But Meehl leaves out something even more deleterious: research in social sciences is driven by action bias. Methodological critiques are always met by the canned response, “But we have to do something!” The world won’t end if we have to wait another century for a particle collider more powerful than the LHC, but many will suffer if we don’t understand the causes of the opioid crisis, the impact of abortion bans, or how to nonpharmaceutically slow the spread of infectious diseases. We have to do something!

That something is, unfortunately, often just doing more of the same. We fool ourselves by thinking that any meaningful question must be answerable now. We just have to define our terms operationally so they are verifiable. We can verify things through the right identification strategy. We know there’s a problem of individual differences and sampling errors, but our estimators and stat packages will handle these. So any meaningful question must be answerable at this time, right? Well, sadly, no.

But if our social-scientific tools, reducing everything to Fisher’s Exact Test, aren’t up for the job, what do we do? Even if we have perfectly valid questions that we can’t answer with cold, statistical empiricism, we still have to make decisions. We still have to do something. So what do we do?

Shreeharsh Kelkar sent me a clarifying paper by Daniel Sarewitz “How science makes environmental controversies worse,” that proposes a path past technocracy. Though Sarewitz’s title focuses on climate science, he gives examples in agriculture and political science that buttress a broader argument.

His point follows from a Meehlian foundation. Science is always uncertain, and given the persuasions of any particular scientist, scientific theories can be attacked from a variety of different arguments about corroboration. Attack the other scientists’ methods. Attack the ceteris paribus clauses. Attack the instruments. Et cetera. Research programs in science advance by ignoring the mountains of uncertainty and focusing on generating new facts regardless. But uncertainty will always remain as long as there’s will of a large enough group of other scientists to keep the arguments going.

Nothing provides such will more than politics. This means that making a scientific question political, asking if science says we should do something, only increases uncertainty. We turn a problem of is into a problem of ought. As long as there are camps on either side, we sink into a morass of bothsiderism. The scientific method almost ensures you can never get a scientific answer to a political question.

Sarewitz’s conclusion is counterintuitive, but it points to the disutility of uncertainty quantification in policymaking. Scientists can see whatever they want to see if they try hard enough. Presentations from two competing camps only result in more questions. More heat. Little clarification.

But we have to do something! So what do we do? Sarewitz concludes that decisions under uncertainty are thus necessarily more about ethics and values than about optimization and uncertainty quantification. We decide based on what ought to be true. On what we’d like to make real. On the sort of society we want to be. Statistical measurement can’t tell us any of these things. And it never will be able to.

As a way forward, Sarewitz oddly enough ends up agreeing with Karl Popper, calling for some kind of piecemeal social engineering with incremental agile interventions and scientific monitoring of policy implementation. Decision-making about bit issues might be best served by the small. Quoting Rayner and Malone, Sarewitz argues “Sustainability is about being nimble, not being right.” I may be too primed to receive these messages from Popper, Sarewitz, Malone, and Rayner, as I reached similar conclusions in this blog series on the tradeoffs between action and impact.

Try different things in different places. Be willing and able to nimbly change course. As best you can, devise measurements to make sense of policy impacts in a deeply interconnected mess. This micro-policymaking still requires help from a research community to guide how to intervene, what to measure, and what is measurable. It also requires society to accept some things are unmeasurable and some questions are unanswerable. And when we find ourselves with scientifically unanswerable questions, we need to be open about making decisions based on our values, not based on our science.

Subscribe now

You're gonna run when you find out who I am.

Ben Recht — Wed, 03 Jul 2024 14:34:37 GMT

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

One of the central suggestions Meehl advances to improve the interpretability of the scientific literature is less publication. Or, perhaps a bit more accurately, he calls for less pressure to publish. Less pressure would be better for researchers because it would reduce stress and improve their mental health. It would be better for society because so much published research is bad. It would be better for the scientific literature as it would open the opportunity to think more deeply about theory testing.

Meehl worries

“Today, there's something about the academic culture that has become almost sick as regards to the [number] of publications. One of the main forms it takes now is 37 authors in a single paper. it turns out that all Joe Blow ever had to do with it was to say yep. It’s really become absurd. The pressure when you're hiring a fresh-baked PhD is to look at how many publications that person had. He hadn’t even finished his thesis yet. When I see somebody with 20 pre PhD publications in fact I tend to get a little suspicious. I mean how good can they be, after all?”

Does that sound familiar? Hearing a recapitulation of my own argument in a video recorded 35 years ago was unnerving.

Meehl’s had some proposals for how to fix the problem, and they all failed. Meehl had the quaint idea that we should count citations rather than publications. But that was before the internet made it easy to track your stats. Maxing h-index instead of paper counts was, if anything, worse. People figured out how to run citation rings and game the metrics. Meehl also called for dean’s offices to implement more holistic evaluation of tenure cases. But no matter how many memos central campus sends stating that your teaching and service matter as much as your research, people still just count papers.

People have known that publish-or-perish is a harmful, dangerous problem since at least the 1980s. Why has it only gotten worse? You might say Meehl’s proposals weren’t implemented correctly. That if we were more thoughtful and had a more agile academy, we could make everyone happier. That we could technocratically engineer a more perfect literature.

But let me advance an alternative hypothesis seldom discussed at academic committee meetings or in surveys by serious metascientists.

What if the problem is us?

Late in the lecture, after remarking that he thinks psychology would be no worse off if 4 out of 5 people stopped publishing, Meehl muses about why people don’t stop publishing. The most successful people feel the most pressure to publish even more. Even Nobel Prize winners think they still have to prove their merit and keep churning out results.

At this point, Meehl goes off script and puts on his psychoanalyst hat. He remarks that all academics seem to suffer from a “What Have You Done For Me Lately” syndrome.1 Meehl relates a story from Nevitt Sanford, who ran pioneering psychological studies at Berkeley’s Institute of Personality Assessment and Research.

Some of Sanford’s most famous work aimed to understand the psychology of the successful. Sanford recruited many Bay Area elites as subjects. He assessed Berkeley scientists, expert surgeons, famous musicians, artists, and poets psychologically. The people in his studies were all high-achieving, high-visibility volunteers from the community.

Sanford had found that these successful people almost all shared the same pathology. Except for one or two psychopaths, they all had imposter syndrome. Meehl related one extreme case:

“Of course I got to be head of Neurosurgery at the age of 39 which is unusual but I really wonder sometimes how did I fool all these people? you know I'm really not that good. What the hell am I doing here? I’m going to stick a knife in some guy’s brain?”

Meehl observed the same phenomenon in his own psychological service. Most of his recent patients had been academics, and he saw precisely what Sanford had seen.

“My experience with treating college professors is that the unsuccessful, the moderately successful, and the super successful all talk the same way in psychotherapy. They all think they're not doing quite as well as they could or should given their IQ. Or alternatively, ‘How did I ever get to have such status since I'm not as bright as my brother whose IQ is 180?’ Or, ‘How did I fool so many people? Will they catch on to me that I'm not really as smart as I appear to be?’ It’s practically universal.”

Does this sound familiar to you?

I’m left with a far too plausible explanation for overpublication. We fund an exponential expansion of the academy after World War II. This creates an expansion of paper writing. Professors need to do more to feel worthwhile. We make mechanical reproduction frictionless. We can instantly generate pdfs to send to our friends. And here we are. What if all we can do is drown in papers? What if it’s an unavoidable psychosociological state? I suppose I now have a theory that I can test. Be right back, I have to go write a paper.

If this psychosociological explanation is verisimilitudinous, there’s a solution to overpublication. And it’s terrifying. A few years ago, I had grown worried about the problem and tried to pull back. The pandemic lockdowns provided a convenient excuse to tap out entirely. You can do some analytics on my Google Scholar profile if you want, but I’m down to writing a few papers a year. I like those papers a lot! But it’s been a conscious effort. Fewer papers, fewer graduate students. Less.

And yet, I would be lying if I said it didn’t feel like I have been actively committing career suicide for half a decade. Shouldn’t I be writing more? What if I stop getting invited to meetings? What if I stop getting grants? You might have noticed that I blog a lot. Is that some sort of coping mechanism? Hold on, let me call a therapist.

More seriously, I still sincerely believe that less is better. It might be natural to publish more, but that doesn’t mean I can’t actively work to publish less. I think my blogs, en masse, have been no less valuable than the ensemble of 90% of my papers. I’ve been enjoying spending more time on my courses. I love writing books. Maybe this model is OK for a tenured Professor of EECS at Berkeley? I suppose we’ll find out at my next merit review.

Meehl suggests we all just relax. Accept that the relentless what-have-you-done-for me-lately syndrome and imposter syndrome are normal states of mind. Do some rational emotive behavior therapy. Or practice Buddhism. Or pause and consider, “What does it matter when the sun burns out?” The constant pressure to succeed isn’t “healthy” by any conventional psychological definition. And not engaging with it won’t make it go away.

Subscribe now

He calls this particular trait something that I can’t decipher. He either says “N adj or N edge.” See his discussion at 1:05:00 in the video. If anyone knows what he is referring to here, please let me know.

A year of milliblogging

Ben Recht — Mon, 01 Jul 2024 14:38:47 GMT

On July 1, 2023, Elon Musk decided to rate limit Twitter, rendering the website unusable. No one knows why he did it, but Occam’s Razor suggests it was because he didn’t want to pay his cloud computing bills. Whatever the case, that website was a nightmare. I had been looking for an excuse to get back to blogging and away from Tweeting, and Musk turning off his multi-billion dollar website for a few weeks was as good a forcing function as any.

I decided to start blogging here on substack. Initially, my thought was to use this space as a Twitter replacement. To do some sort of Twitteresque microblogging. I had been a sporadic blogger, but I used to slave over posts, often for weeks. This led to some pretty good posts, but I wanted something a bit more like Twitter without Twitter. A place where I could write spontaneous thoughts and post “questions, rants, and shitposts.” Thinking about it now, I was pining for a 2003 blog.

Being a process-driven guy, I made up some arbitrary rules for myself. My idea was to post Twitter threads with better grammar on Substack. If a tweet was 140 characters and a Twitter thread 10 tweets long, writing a 300-word note every day seemed pretty doable. I set aside my morning coffee time to write and forced myself to post after an hour. I wrote,

“I’ll still give this milliblogging a go for a week or so. Let’s see how long this lasts!”

Here we are a year later. I promised myself I’d write a reflection if I made it a year, so that’s what we’re getting today. I promise you that I’ll limit the number of these naval gazing writing posts to one or two a year.

I have been pretty good about using every morning to try to write. I spend at least an hour with Google Docs open, and—most days—avoid other doomscrolling distractions. I have found the morning blogging similar to practicing an instrument. The consistency itself makes the process easier. This is so cliche, but it’s true. I bet you can even find convincing science-y studies proving that it’s true. Blogging has made writing other projects flow better, even the ultra-tedious ones professors are regularly saddled with.

The blogging has certainly been time-consuming, but I’ve tried to rate-limit it like a Twitter engineer. If I don’t feel like a post will be ready at the end of the hour, I push it to the next day. On all days, I keep my blogging to under two hours. Though I had been shooting for about 300 words, I’ve found my natural cadence, where I can say what I want to say, ends up being about 1000 words. Today’s post is 908 words.

My arbitrary rules of practice explain the volume you get from me. I’m able to get about a thousand words I like about three or four times a week. I could, I suppose, wait until Monday and send whatever I had written the week before. But that would be against the spirit of my initial microblogging project.

In any event, this has been an exceptionally beneficial experience for me. I’ve been able to think more deeply about my teaching, explore weird research topics I might have neglected, and rant about topics I didn’t want to cast in stuffy academic arxiv papers.

So thanks to everyone who puts up with the newsletter volume and follows along. I imagine getting 3-5 emails from me a week must be annoying. Maybe you like email more than I do. I like everything about Substack except for its “newsletter-centrism.” Maybe Substack can recreate Google Reader? I suppose in many ways it has. Regardless, know that I appreciate every one of you who put up with the email volume.

And extra thanks for your feedback in emails, comments, and Twitter replies. I’ve learned so much from readers in so many surprising ways. The Substack discourse has been far more productive for me than on any microblogging site, whether it be Twitter or Bluesky or Mastodon. So far I only have two people who have sworn to never speak to me again because of my blogging. That seems better than the rate of enemies you gain Tweeting.

Let’s see if I make it to July 1, 2025. I’ve got a plan to get me to January. First, I need to finish these Meehl lectures! I had thought this would only be a few weeks of blogging, but there’s so much depth and complexity and so much valuable perspective on the state of science, engineering, and decision-making. It’s definitely worth seeing through to the end! I think I’ll need until the end of July to work through those lectures. After that, I may take a short break as I have a manuscript deadline on September 1.

In the fall, I’m teaching a course I love for the first time in over a decade: Convex Optimization. I want to try live blogging it, following my process from last fall. I’m looking forward to the challenge of writing about Mathematical Optimization with almost no equations. The material is decidedly more technical than machine learning but has so many interesting, practical applications. What is the role of Convex Optimization in the era of LLMs? Tune in to find out!

And after that, who knows? All I know is I hope you’ll stick around to find out.

Subscribe now

I don't care what the studies say.

Ben Recht — Fri, 28 Jun 2024 14:56:21 GMT

This post, um, tangentially digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” But it ties the material into current events. You can watch the video of the Lecture here. Here’s the full table of contents of my blogging through the class.

Yesterday on Twitter Dot Com, Tyler Harper kicked a hornet’s nest:

“The debates about what the research does or doesn’t show about the dangers of smartphone use are a distraction. WE CAN ALL TELL THE PHONES ARE BAD! I don’t mean to do a science denial but I just don’t care what the studies say. Anyone with working eyes knows there is a problem.”

Aha! The studies. Well, what studies are those exactly? It’s Twitter so, of course, none of the people yelling at Tyler provided any links. Now, I don’t want to do a literature survey on phones and depression either,1 because I, and all of you who have been following along with this blog series, know the literature is an uninterpretable mess.

The conventional wisdom—espoused here by Osita Nwanevu—is that it is always better to look at studies than not. But what if that's not true? What if the studies are just wrong? If social psychologists use tools proven insufficient to measure anything, why should we care what they say?

The fallacy expressed by Nwanevu is that making decisions based on vibes is bad and science is less value-laden or vibes-based. This couldn't be farther from the truth! We all know that science is a social effort. Biases persistently manifest themselves in the scientific literature. Recent work by Winsberg and Harvard describes how sneakily value judgments make their way into models. Add some null hypothesis significance testing and some crud, and the science will show you whatever you want to see. Then add to the mix that scientists love to be quoted “in the press,” and we end up incentivizing research as meme generation.

In Lecture 8, I have been arguing for open data. Part of this is in hopes that we can spend more time understanding the depths of unreliability of published literature. Open-data requirements don’t prevent the publication of incorrect findings. Do you guys remember science during the pandemic? Oh boy, I have been so reluctant to bring that up again. But for a while, when trapped at home with nothing to do, I was downloading all sorts of papers and all of them were wrong. If you give me any observational study about COVID19, I’m now able to find a major flaw in 60 seconds. This one got some attention. But that was just one of dozens of deeply flawed papers I looked at. It didn’t matter if the mainline result was based on incorrect data downloaded from a public source.

After an economist recently insulted his expertise, Rex Douglass, decided to be a glutton for punishment and really dig into a COVID study. In this beautifully detailed report, he shows how said economist’s paper is based on complete mishandling of big data sets, averaging things that shouldn’t be averaged, and making arbitrary decisions to make their desired conclusion look better.

As Rex has noted, this process of debunking isn’t sustainable. Rex, with no gain for himself, spent hours digging through the paper’s code, wrote a long blog post, and yet the original paper remains published. This is the norm. Most papers don’t get retracted. All of the authors still get gold stars on their CVs. The press will write up studies as if they are true, and politicians will cite them in their policy briefs. If you follow the rules and rituals, you can do science solely for political talking point generation. That’s deeply dangerous.

I don’t want to try to argue that debunking should be made more sustainable. I have a more radical proposal. What if we start from the assumption that everything in quantitative social science is wrong? What if we just ignored these papers and used our eyes? I’m not saying that’s perfect. I’m not saying that this will avoid moral panics. I’m just saying that The Science is making the situation worse.

Nicholas Wilkey challenged me about what to do instead. “There are endless critiques, but precious little positive solutions. I say this as someone who works in policy, rather being an academic.” This position is common, even among academics. I’ve been arguing (collegially) with my friend Avi Feller about this for years now. I have some thoughts about this based, oddly enough, on the writing of Karl Popper. I’ll share in a future post.

But for today, let me quote David Graeber (making his second appearance in this blog series), who, as usual, sums up my position better than I can. In Bullshit Jobs, he writes:

“Another reason I hesitate to make policy suggestions is that I am suspicious of the very idea of policy. Policy implies the existence of an elite group—government officials, typically—that gets to decide on something (‘a policy’) that they then arrange to be imposed on everybody else.”

The problem is that scientific elites don’t know better. Our capabilities to know and understand human behavior using the scientific method have been demonstrated so deeply fallible that I don’t even know where to begin. And yet policymakers cite these papers in their briefs, experts get quoted in court documents, and we race to the bottom to find “evidence” that justifies whatever position we want. This leads to, as Tyler says, gaslighting. Anyone with working eyes sees one thing, but you can find any random scientist to cook a study to claim “the evidence” points the other way. Again, we shouldn’t be pointing to an uninterpretable mess as evidence for anything. If those techniques are failing us, why should we outsource decisions to The Science? Especially when answers are staring us in the face.

Subscribe now

I poked around, and most of the commentary was hating on Jonathan Haidt. I get it. Haidt is a preachy tut-tutter. But the counter-evidence is terrible as expected. In this Nature editorial, for example, the author cites not only a bunch of random metaanalyses but also a bizarro study using FMRI to spot brain changes from screen time. When you are leaning on FMRI to make your case, you have lost the argument.

Growing Evidence

Ben Recht — Thu, 27 Jun 2024 14:25:03 GMT

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

The second theme of Lecture 8 is moving beyond significance testing altogether. Hear hear. I worry that science reformers spend too much time asking for rigid, robotic “repeatability” of the same experiment rather than creatively thinking through the different potential predictions and consequences of the same theory. Repeatability spends too much time trying to nail down the experimental conditions clause in the derivation chain. We need ways of probing higher up the chain, among the auxiliaries and theoretical constructs. This can only happen by generating and testing diverse, risky predictions.

Indeed, Meehl thinks we are overly optimistic about what we prove when we do a significance test. As we’ve discussed, significance tests only evaluate whether a treatment is more correlated with the outcome than pure randomness. If I tell you “under the assumption that the treatment was a completely random assignment and didn’t do anything, the probability of seeing your result or greater is 5%,” what do you conclude exactly? What did you corroborate?

The most common reply I get when I rant about the bizarre primacy of Fisher’s exact test is “People have to do something, no? You are arguing for doing nothing.” This is a strawman neither Meehl nor I subscribe to. There are so many possible ways to reach a Damned Strange Coincidence, but having to stick a p-value on them squeezes our brain into thinking about the most primitive sorts of tests, ones that don’t tell us much of anything. Of course there are other things you can do! Meehl puts it this way:

“We are overly optimistic about what you prove when you do a significance test, but when pressed to do something more than that, psychologists and the soft areas at least tend to be pessimists about whether you could do any better than that.”

Meehl’s ideal case, which perhaps might be asking for more than we can get, is the sort of point predictions you might get from a physical theory. A theory that can predict diverse outcomes to high precision.

While they are important and worth learning, the physics examples are wanting. First, getting the same precision as classical mechanics is likely out of the question in social science. To his credit, Meehl had put forward various suggestions of imprecise predictions that would be informative. For example, in the abstract of his famous “Theoretical Risks and Tabular Asterisks” paper, he writes

“Multiple paths to estimating numerical point values (“consistency tests”) are better [than significance tests], even if approximate with rough tolerances; and lacking this, ranges, orderings, second-order differences, curve peaks and valleys, and function forms should be used.”

Second, while fundamental physics had an unprecedented run of success from 1800 to 1970, it has been in a rut since 1970.1 We should find inspiration from other fields to provide useful suggestions for human-facing sciences. Let me now give one of my favorite examples from the human-facing sciences.

In 1906, biochemist F. Gowland Hopkins was investigating the essential chemical composition of sustainable diets. As was the style at the time, he ran controlled rat studies comparing the value of different diets. In one such study, Hopkins fed a control group only bread and the treatment group bread and a tiny amount of milk. He charted their weight for 18 days and then swapped their diets. The following plot won him a Nobel Prize.

In this plot, each dot is the average weight of a rat on a particular day (the x-axis is days, the y-axis is grams). The white dots mark the bread-only group, the black dots the bread and milk group. It’s beyond evident there is some component of the milk needed for growth. The fact that when you switch the groups’ diets, you see a trend reversal is particularly compelling. The group withheld milk always fails to thrive. There is something necessary for growth in the milk. Today, we know it as Vitamin A.

I guess we’d call Hopkins’ experiment a “crossover design.” But how would you compute a p-value? What’s the right way to test whether two growth curves are “statistically significantly different?” I don’t think statisticians have a good answer for us! There have been plenty of proposals but no consensus answer. And why would you compute a p-value? It’s so clear that something is happening in this experiment. I suppose we could just compare the percentage growth between the treatment and control groups. I did this. Even though there were only eight rats in each group, the p-value of the t-test was 0.

You might now say, “Well, we never find interventions with p=0 in our modern complex world.” But this couldn’t be further from the truth. Here’s a plot of growth curves I found in the New England Journal of Medicine in 2021.

This plot, of course, is from the clinical trial report on semaglutide. This curve tracks the average growth! It’s identical to Hopkins’ error-bar-free visualization. I love it. Since this study had to conform to the rigid standards of the FDA and the NEJM, the paper reports discrete primary outcomes. They looked at the percentage change in body weight and whether weight loss was greater than 5%. And what’s the p-value for these outcomes? It’s “less than 0.001.” How much less? You can compute the z-scores of the endpoints by inverting the confidence interval. z equals 26 for the percentage change and 25 for the at least 5% weight loss. Anything over 7 means p=0. z=25 means p is very, very zero.

What do you think would have happened had they run a cross-over design like Hopkins? We know the answer as another 2021 trial reported in JAMA showed that switching from semaglutide to placebo resulted in weight gain.

Though these are different studies, I think we can stitch the Damned Strange Coincidence together with our feeble inferential capacities.

Obviously, interventions like GLP-1 agonists come along rarely. But they do come along! And what’s weird about cases like this is that it makes you ask for examples of pivotal studies where p was 0.04. What about a rally pivotal study with p=0.01? If you have a favorite example, put it in the comments. And while we’re at it, what are examples in economics where z is more than 10?

We should think about how we find and learn from controlled experiments with p=0. What can we learn from these discoveries? What are the different methodologies we’ve used to demonstrate such stark intervention effects (Damned Strange Coincidences)? What characterizes results where the statistics play no role in the corroboration? Just like historians of science think there are lessons to learn from the Newtons and the Einsteins, we have lessons to learn from the discovery of Vitamin A and semaglutide.

Subscribe now

It’s a bit of a subcurrent in this blog series, but one of the ideas I’m most obsessed with is understanding a metatheory of science grounded in the successes since 1970. It’s quite possible we need something wholly new for the information age.

Reproducing The Blue Screen of Death

Ben Recht — Tue, 25 Jun 2024 14:35:43 GMT

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

I wrote yesterday that we should be pedantic about the distinction between reproducibility and replicability. A result is reproduced when someone applies the same analysis to the same data and finds the same result. A result is replicated when someone reimplements the same study and finds the same result. Reproduction should be mandatory, but replication is far more subtle.

Now, if you really want to get pedantic, reproduction turns out to be more subtle than I made it out to be. Never forget, computer scientists can always be more pedantic than you can. Commenter Chris made a brilliant point that I will try to articulate here for all of you who aren’t computer scientists. And thanks to computer scientists Eric Jonas, Chris Re, Stephen Tu, and Shivaram Venkataraman for helping me flesh out this articulation!

Let’s say that a paper comes with a code repository that uses version 9 of Stata. I open my current copy of Stata 17 and find their scripts don’t run. I dust off the installation compact disk with Stata 9, find my external CD drive, and try to install the old Stata version. But now I find MacOS Sonoma doesn’t support the old binaries. Shoot. Now I get into the crawl space under my house and pull out my iMac G5… Anyway, you see where this absurdity goes. If no one can run the code without finding a computer and software from 2005, is this result reproducible?

Suppose I take their code and port it to Python. I’m able to recreate all of the graphs in the original paper and most of the tables, too. But porting code is a reimplementation, and hence, my analysis isn’t technically the same. Is that a reproduction or a replication?

For complex code pipelines, reproduction can be an issue on much shorter timescales. Chris told me about how Huggingface's model reproduction pipeline maintains Python scripts that build and change models. These scripts are not allowed to be updated. If there are bugs, or if a version of some package is updated, you might lose the model forever.

And when you get really down into the weeds of things, you discover subtlety in what we could even ask for. Computers, which we treat as these cold, precise, infinitely reproducible calculating machines, are a mess of ambiguity. Most people never think about it, but “real number” algebra on computers, the computation we use to simulate reality or calculate statistical integrals, is based on digital processes. These digital approximations produce compounding errors.

We use a convention called “floating point arithmetic” to approximate real numbers with ones and zeros. If not carefully implemented, floating point arithmetic is decidedly hard to reproduce. Because silicon fixes the number of bits you can use to represent a number, you throw away some of your computation at every step. For intuition, take this caricatured example: suppose we can only represent numbers by one digit and a power of 10. Then even though 30 times 80 really equals 2400, the computer can only use the first digit and stores the outcome as 2000. 900 times 70 is stored as 60,000. And so on. Floating point math does a similar sort of truncation, though with many more digits.1 These truncations break some of the basic laws of math that we take for granted. For example, the identity

is not exactly true because of floating point truncation (i.e., floating point arithmetic is not associative). Little issues like this mean that a compiler or interpreter might read the same line of code, implement it in two slightly different ways, and yield a slightly different number. Perfect, bit-for-bit reproduction becomes impossible.

Computer scientists have noted that if two people compile the same code, there isn’t an easy way to check whether the artifacts are bit-for-bit the same function. They are, in fact, most likely not identical. Does it matter? The answer is usually no. But sometimes yes! What counts as reproduction then? Where do we draw the line? Yesterday I wanted to argue for pedantry, but now am back arguing for pragmatism.

A simple eye test suffices here. Reproduction is just about communication. It is you showing me what you did. It is a way that I can check your work. Replication, however, is about robustness. It is about how your findings and predictions change due to small changes in the derivation chain. Let’s apply this dichotomy to software. Reproduction demands that you have something at the time you did the work to show that it works. If your code doesn’t explain why you dropped ten thousand of the forty thousand units in your final analysis, then your result is not reproducible. On the other hand, if I change the random seed in your code and am unable to reproduce your figures, your work isn’t replicable. Your general scientific theory is probably very fragile. IYKYK.

In computer science research, we’re in a far better state of reproducibility than ever. Sure, old machine learning models get deprecated, but the models that persist are the ones that are robust to code perturbations. You want a model to be reproducible to get your audience’s attention. You want it to be robust (and replicable) to keep their attention.

While the line between reproduction and replication is sort of grey, it’s just not that grey in the vast majority of cases. You should have code that’s runnable by the reviewers and colleagues. They should be able to reproduce all of the numbers in your paper, which means that the floating point issue shouldn’t propagate to the level of differences in the numbers and plots in the published report. If your code doesn’t run a week from now, then that’s on you. It’s perfectly fine and fair for us to dismiss papers that are fragile. This is pedantic enough for me.

Subscribe now

Except in neural nets, where this single-digit floating point is what we do these days.

Replication Versus Reproduction

Ben Recht — Mon, 24 Jun 2024 15:07:32 GMT

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

Leif Weatherby reminded me that there is a distinction between Reproduction and Replication that’s worth repeating in the context of Meehl’s lectures. I realize now that I conflated these in my last post, and that’s a grave error. Replicability and reproducibility are fundamentally different demands on experimentation and are important for fundamentally different reasons. I’m going to take this post to delineate the meaning of two words with the same first three letters.

Let me yield to leading metascience experts for the definitions of two rep-concepts. Nosek et al.’s Annual Reviews article “Replicability, Robustness, and Reproducibility in Psychological Science” provides solid definitions. Reproducibility means that “if someone applies the same analysis to the same data, the same result should occur.” Replication means that if someone does “the same study again,” “the same outcome recurs.”

By these definitions, reproducibility is such a small ask! If someone releases a repository of code and data, all of the analyses in the main paper are easily reproduced. I’ve become a major stickler for reproducibility. There are no excuses for scientific results to be unreproducible. I’ve seen instances where people release the plotting command to generate plots from (x,y) pairs but never tell you how they computed the xs and ys. That is simply not enough. There should be a complete, well-commented pipeline that takes the rawest possible data and produces a rendering of the paper. Every scientist produces their papers with computers: after data collection, all of the analysis, visualization, and writing is computerized. Hence, it is by nature completely reproducible. Reproduction just demands scientists log the steps they take from data to journal submission.

Replication, on the other hand, is far harder to pin down. Let me give the full quote of Nosek et al.’s definition,

“Replication seems straightforward—do the same study again and see if the same outcome recurs—but it is not easy to determine what counts as the same study or same outcome.”

In their definition of replication, they tell you that it’s hard to define. I love it. The very next sentence states, “There is no such thing as an exact replication.” And here lies the rub. I can tell you what an exact reproduction is. It’s running the same code on the same input twice. Demonstrating reproduction just requires sharing data and code. But even metascientific experts can’t tell you what exact replication is. You can’t repeat the same experiment with the same researchers with the same experimental subjects at the same chronological time. Something has to change.

And here lies the importance of the theory that Meehl hammers over and over again in his lectures. Jump back to when we were discussing derivation chains. We had a theory, auxiliaries, instruments, ceteris paribus assertions, and experimental conditions. The last three are always different in two different experiments. The question is a matter of how different they can be so that the predictions come out the “same.”

If a theory predicts experimental uncertainty, then we don’t even expect the “same” outcome even if we had a time machine that allowed us to repeat the same experiment. If my nomological derivation chain makes a probabilistic prediction, even in the fake reality where I can frequentist-style repeat my experiment ad infinitim, I’ll see different outcomes in different realizations. If you have a shaky ceteris paribus clause or very complicated experimental conditions, you add even more uncertainty to ever seeing the same result. We can only vaguely justify repeatability. While reproductions can and should be exact, replications cannot be.

The distinction between reproduction and replication is so key and important that we should be pedantic sticklers for the jargon. I don’t want to call anyone out, but do a search for “reproducibility crisis,” and you’ll find a lot of people talking about replication. Reporters and many scientists use the words interchangeably. They are not interchangeable. Replication and reproduction are asking for such incomparable things that we need two terms.

There is no “crisis” in reproducibility. We have a reproduction problem because people don’t always produce the most usable pipelines (I got in a bunch of trouble pointing this out in January). But a reproduction problem is so easy to fix. We just have to hold ourselves to a higher standard when communicating data. We could solve the “reproducibility problem” by the end of business today.

The replication crisis is a whole other matter. In my predictably contrarian opinion, the replication crisis is overblown. Yes, studies in many disciplines fail to replicate. But studies have always been hard to replicate. Some of the most important insights follow when thinking about why a replication failed. A failure to replicate can be more information than replication itself. Failed replication leads to fights about theoretical derivation chains and refinement or augmentations of theory. Lakatosian Warfare progresses because there are no perfect replications. It’s through the contradictions and conflicts that research programs move forward. Failure to replicate is core to scientific advancement.

Don’t get me wrong, there’s a problem if a field spends decades producing results that are fragile to the slightest change in experimental conditions. In such fields, I worry that contemporary metascience spends too much time arguing that better methods will fix the problems. It’s pretty clear from the outside that if a field spins itself in circles failing to replicate experiments, “science” won’t solve this community’s problems. Call me crazy, but some parts of the world can’t be mathematicized or sciencified. I’ll expand on this argument in a future post.

Bringing this back to the Lectures, Meehl makes similar claims. Replication for Meehl is a means to deal with the variability of experimental conditions and the ceteris paribus clause. If a theory yields good predictions for a variety of experimental conditions and under loose ceteris paribus restrictions, then it’s a Salmonian “Damned Strange Coincidence.” These theories get corroborated.

By contrast, reproduction is a means to better understand scientific data. If we’re only given p-values and lumped statistics, how can we answer skeptical questions about the data analysis? Reproduction also makes it possible for other scientists to detect analysis bugs. As I’ve hammered, contemporary scientific derivation chains are deeply dependent on software validity. Making it possible for others to find coding bugs is thus critical and straightforward.

In his lecture, Meehl explicitly argues for replication but only implicitly argues for reproduction. To Meehl’s credit, it was much harder to produce a git repository in 1989, as git wouldn’t be invented until 2005. Enforcing reproducibility was somewhat challenging in 1989. It is trivial now. Failures in reproduction should today be inexcusable. If journals and conferences required registration of quality code and data, reproduction failures should never happen. The main findings of the paper should follow from a processing chain from the data to the plots and text. This should be clear and trivial to mechanically reproduce by peer reviewers. Reproducibility is such a minor request, and fields only harm themselves by resisting it.

Subscribe now

Towards Interpretable Literature

Ben Recht — Fri, 21 Jun 2024 15:03:17 GMT

This post digs into Lecture 8 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

In Lecture 8, Meehl puts forward a few suggestions on how to make the scientific literature more interpretable. He is not proposing ways to make research “better.” We can never get to the space where we’ll all agree. If we didn’t disagree, we wouldn’t be doing science in the first place. But the obfuscators Meehl listed in Lectures 6 and 7 made it so that the social scientific literature provided almost no information about what’s true and false. That means we can’t even have sensible arguments. Meehl wants the literature to better inform arguments about theories and beliefs about the effectiveness of treatments. Meehl’s suggestions in Lecutre 8 are to make Lakatosian warfare possible.

I’ve honed Meehl’s suggestions down to three main themes.

Improving reproducibility
Moving beyond hypothesis testing
Publishing less

Long-time readers will note that I have already blogged favorably about those three suggestions. Is that because I’m hearing what I want to hear when I listen to Meehl? Or is that because he was screaming into the void in the 80s, and we should have listened to him then? Could it be both? Let’s all buckle up for the next three posts to find out! I want to explore which of Meehl’s suggestions were heeded and what we might do to further implement them today.

Improving reproducibility

Meehl’s was banging the drum for better reproducibility earlier than most. He suggests that investigators should even be required or strongly encouraged to replicate their own results. They could publish the results of a pilot study alongside the main study. A study would be more compelling if it provided two independent measurements of the same effect on two dissimilar datasets. Requiring two measurements would necessarily set a higher bar for publication, but it would also ensure considerably stronger evidence for the tested theory.

While asking for more experiments is a high bar, asking for more information is not. And that’s Meehl’s second suggestion. Meehl wants journals to require authors to provide more information. It’s amazing to hear the sorts of practices that seemed to be allowed in the 1980s. Meehl claims that people could report “significant at some level” and never tell you the mean difference between the groups. That seems preposterous. It’s not much better to report that mean difference with only an asterisk denoting significance. This was also common place in social science. The reader would never see the standard errors or the p-values. I’d be curious to hear from folks in psychology how common these practices remain.

It’s so cut and dry that this shouldn’t be allowed. If you’re going to run a hypothesis test, why not report everything? Say what the test is. State the standard error. State the p-value. Give a confidence interval. Meehl is right that you can compute any of these numbers from any other, but you should at least report one to high enough precision to do so. And why force a reader to open R or Python? Just list them.

Beyond this, Meehl says that papers should give a sense of the shape of the distributions of the two groups. Pictures are probably even more informative than the raw statistics. We know that most natural phenomena are not really Gaussian and linear. Investigators should plot the histograms so that people can understand group overlap and skew. Visual statistics are far more compelling and informative than test statistics.

Meehl additionally argues that investigators should be required to measure and report nuisance variables that they don’t think should be causally affected by treatment. If the measurement of the effect size is on the same order as the nuisance variables, perhaps this means that the study failed to corroborate the investigator’s theory. Here, I’d argue we’re in a better state now than in the 1980s. Most epidemiology papers I’ve looked at have extensive tables of variables comparing different groups. They tend to list the associated p-values with the group differences. Economics papers now have hundreds of pages of sensitivity and robustness checks to validate their causal claims. Papers come with all sorts of pretty plots. This is all a step in the right direction.

But it’s not enough for replication. Meehl is asking for as much information as possible in a paper. Why not take this to its logical limit? In 2024, there is no excuse for papers not to come as git repositories. Every paper should include a repository of readable, runnable, commented code and as much data as possible. Ideally, this repository should trace all steps from data extraction to statistical analysis. The data should be in its most primitive, unaltered state. This way, the interested reader can view the data from whatever angle they want. The authors of the paper can make their argument about what we should see, but everyone else should be able to run their own analyses.

I’d argue that we should make the papers themselves shorter! I don’t want to flip through people’s robustness analyses in an endless pdf file. I’m not sure why anyone puts up with these appendices. I mean, at this point, don’t we all think it’s odd that robustness analyses always come back in the author’s favor? There’s a reasonable alternative to such exhaustive sensitivity analyses. Just give out your code and data so the skeptic can see what’s under the hood. And if investigators were really committed to their robustness checks, they could include them in a folder in their repository in a nice interactive notebook. I’m all for it.

There are no good arguments against this sort of reproducibility. Certainly, “proprietary data” is an absurd argument. If your data is proprietary, I don’t believe your results. You are trying to sell me something, so no paper for you.

A more tricky argument is made in medical research: data can’t be released because “privacy.” This argument derives from a mindless, shallow reading of the Belmont Report. I fully endorse that respect for people and beneficence dictate that investigators respect people’s desire for privacy in studies. But how real are the privacy concerns behind revealing counts in randomized trials? Why can you request the data from drug trials from the FDA but not device trials? Why are other clinical trials or random EHR data mining exercises impossible to access? Does it actually benefit patients in the study that we can’t check investigators’ work? Does privacy outweigh the potential for hiding fraud? We should discuss these questions seriously and in depth.

Now, I’m actually optimistic here. One of the few good things to come out of the international covid response was a broader embrace of preprint servers by the human-facing sciences. If medicine can embrace preprints, they can embrace code sharing and open data too. The future of scientific publication must bend towards open repositories. We’re on the right track there, but let’s continue to pressure our colleagues to keep moving in the right direction.

Subscribe now

Loose ends

Meehl starts off the lecture with modest advice that is so uncontroversial that it’s astounding it’s still often not taken. Though aimed at observational studies, these suggestions should also apply to every randomized trial or other interventional experiment. First, every investigation should begin with an estimate of the effect size needed to strongly corroborate the proposed theory. A mere directional prediction is far too weak. Second, studies should be powered at the 90% level to detect this effect. Third, that power calculation should be explicitly written down.

This is all perfectly reasonable, and I’m sure almost every methods class teaches something along these lines. And yet I found a bunch of violations of these principles in a cursory glance at my Zotero this morning. Though power concerns used to bug me, I’ve become more relaxed about this over time. These particular suggestions are just lipstick on the hypothesis-testing pig. Patched-up hypothesis testing is still just hypothesis testing. Hypothesis testing is the problem! That’s probably why Meehl doesn’t dwell too deeply on it. And that’s why I’ve relegated the discussion to this footnote.

Imperfect Assessments

Ben Recht — Tue, 18 Jun 2024 14:16:43 GMT

This post digs into Lecture 7 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

Today, let’s breeze through Meehl’s final four obfuscators of observational null hypothesis testing. This can be breezy in part because Meehl has already spoken at length about two of them (here and here): Selective Bias in Submitting Reports, and Selective Editorial Bias. Let me not spend more time on them today.

So we’re down to two, Pilot Studies and Detached Validation Claims.

Pilot Studies

Before you invest a ton of money in a major data collection effort, it makes perfect sense to run a baby study to see if there’s any hope of the result panning out. Such pilot studies are where you might test whether your code or device works, get some qualitative feedback on the design, and get a sense of how large the effect of your treatment is. Meehl argues that pilot studies are valuable and likely necessary exercises to nail down the technical foundations of a good experiment. I don’t know anyone who disagrees with this. However, if the outcome relies on null hypothesis testing, pilot studies have a pernicious paradoxical downside.

If the pilot doesn’t pan out, it could be because the pilot is underpowered! Pilot studies are necessarily small. They might be so small that they have a high false negative rate. In advance, you don’t know how large the effect should be. So unless the intervention works without fail, the pilot might yield no effect. Since it’s just a pilot, researchers are more willing to file-drawer the finding and move on to their next clever experimental idea.

On the other hand, given the crud factor, false positives will be abundant in pilot studies. Taking power functions seriously, researchers will make their full studies big enough to surely replicate their pilot findings. That is, pilot studies might influence researchers to set their main study size large enough to reject the null hypothesis because of a crud effect.

This leads to a bit of a nightmare. Good theories are getting screened out at the pilot stage due to insufficient power. Bad theories are getting accepted at the main stage due to crud. If this were the case, random theories with no verisimilitude would be consistently corroborated in published results.

Detached Validation Claims

Meehl’s final obfuscator is that we forget that measurements are often very imperfect representations of the treatments and outcomes we aim to study. For example, in psychology, many measurements come from psychometric tests. These tests have multiple, universally accepted issues. First, the correlation between the test score and the trait you care about is often low. Test builders might find a Pearson r as low as 0.4, but still deem the test useful enough for some aspects of clinical practice. To make matters worse, test-to-test reliability is often low, with the Pearson r between two versions of the same test–or even two administrations of the same test–being as low as 0.8. This means that the test scores are often a weak proxy for the trait you are testing.

This weak correlation is problematic, but it’s even worse when researchers forget it is low. Meehl notes that in psychology, researchers will write in the methods sections that “this test was validated in reference 11.” But then they’ll just report hypothesis test results on the test scores, completely dropping that the test has far from perfect validity and reliability. With the numbers of the example I gave above, the true trait might have a fraction of the measured effect size using the imprecise test. That fraction could be as low as 0.1. Few significance tests pan out if you need to divide the z-score by 10.

More broadly here, the issue is understanding what the measurement is telling you about the outcome you care about. Across the board in the human-facing sciences, we are faced with imperfect outcome measurements. A personal favorite of mine is “progression-free survival” (PFS) in cancer studies, as no one seems to know what that outcome means with regard to the health and well-being of a patient, but you can get drugs approved if you can improve PFS.

Though Meehl was reticent to argue against them, the four final obfuscators also plague randomized trials and other interventional experiment designs. In fact, seven of Meehl’s ten obfuscators are major issues in general experiments. I could make the case that many randomized experiments are plagued by problematic ceteris paribus assertions, experimenter error, insufficient power, incorrect conclusions from pilot studies, selective bias in submitting reports, selective editorial bias, and detached validation claims. I could argue that the first two of Meehl’s obfuscators—loose derivation chains and problematic auxiliary theories—lead to poor experimental design choices and poor statistical analyses in interventional studies. So that’s 9 out of 10? Could I even make the case that crud might oddly impact randomized trials? Was Lykken right? Yikes. I’ll definitely come back to this.

But first, let me stick with observational studies. Given his long list of obfuscators, Meehl leaves us asking what we should do. One could argue “stop null hypothesis testing,” but no one wants to go as far as “shutter all quantitative research in social science.” In Lecture 8, Meehl proposes some fixes. It’s interesting to see which have been adopted, which remain untried, and which have had positive impact. In the next few blogs, I’ll not only talk through Meehl’s suggestions, but will propose a few of my own.

Subscribe now

A Credible Junta

Ben Recht — Fri, 14 Jun 2024 14:31:29 GMT

This post contains some of my own thoughts reacting to Paul Meehl’s course “Philosophical Psychology.” I’m taking a brief interlude from my run-down of Lecture 7. Here’s the full table of contents of my blogging through the class.

Can we fix the crud problem with more math? In many ways, that’s what the “credibility revolution” in economics set out to do. To build a more sophisticated statistical tool kit that accurately teases out cause and effect when properly deployed. As Guido Imbens and Don Rubin put it in the introduction to their 2015 text Causal Inference for Statistics, Social, and Biomedical Sciences,

“In many applications of statistics, a large proportion of the questions of interest are fundamentally questions of causality rather than simply questions of description or association.”

Imbens and Rubin map a path for answering questions about epistemology using statistics:

“All causal questions are tied to specific interventions or treatments.”
“Causal questions are viewed as comparisons of potential outcomes.”
Comparisons of potential outcomes can be computed by careful estimation of average treatment effects.

Hence, all questions of interest in human-facing sciences are reduced to estimating effects in randomized experiments—whether or not a randomized experiment actually occurred. This means that the “gold standard” of causation remains null hypothesis testing. And that means that the entire discipline is based on correlation (a.k.a. description and association) and complex mathematical stories.

You don’t have to take my word for it. If you look at what the causal inference methods do, you will see that everything rests on null hypothesis testing. I mean, most of the estimates are built upon ordinary least-squares, and all least-squares estimates are combinations of correlations.

Let me give a simple example of an often-used estimator: the Local Average Treatment Effect (LATE). LATE uses “Instrumental Variables” to tease out causal relationships. You care about whether X causes Y, but you worry there are lots of confounding factors in your observational data set. To remove the confounding factors, perhaps you could find some magic variable Z that is correlated with X but uncorrelated with all of the confounders. Maybe you also get lucky and can argue that any effect of Z on Y has to pass through X (to be clear, you spin a story).

Economists have a bunch of crazy ideas for what should count as instrumental variables. Caveat emptor. My favorite example of an instrumental variable–one of the only ones I believe in–comes from randomized clinical trials. In a medical trial, you can’t force a patient to take the treatment. Hence, the randomized treatment is actually the offering of a treatment a trial aims to study. In this case, Z is whether or not a patient is offered treatment, X is whether the patient takes the treatment, and Y is the outcome the trialists care about.

But let me not dwell on instrumental variable examples. I wrote more about it here and here. I actually really like Angrist, Imbens, and Rubin’s original paper on LATE. For today, I want to show why this is still just correlation analysis. The standard instrumental variable estimator that estimates the influence of X on Y is

It’s a ratio of correlations. The standard way to “test for significance” of this effect is to do a significance test on the numerator. If it passes, you add two stars next to the correlation in the table. In an instrumental variable analysis, we changed the story but still just computed a correlation and declared significance if the number of data points was large enough.

Even though other estimators aren’t as easy to write down, every causal inference method has this flavor. Everything is a combination of correlation and storytelling. “Causal inference,” as it’s built up in statistics and economics departments, is just an algebraically sophisticated language for data visualization.

Some of my best friends work on causal inference, and I respect what they’re after. They’d argue that these storytellings are better than just randomly picking two variables out of a hat. But I don’t see how causal inference methods can do anything to mitigate the effects of crud.

If there’s a latent crud distribution, causal storytelling connecting X and Y is no different than Meehl’s yarns about why certain methodists prefer certain shop classes. Clever people can construct stories about anything. If they gain access to STATA or R or Python, they can produce hundreds of pages of sciency robustness checks that back their story. If we don’t understand the crud distribution, there’s no math we can do to know whether the measured correlation between X and Y is real. If you buy Meehl’s framework (which I do), you can’t corroborate theories solely with the precision measurement of correlations. You need prediction.

Theories in the human-facing sciences need to make stronger predictions. At a bare minimum, the treatment effect estimates from one study should align across replication attempts. We seem to have issues even crossing this very low bar with our current framework. Adding more math to make the treatment estimate more precise doesn’t help us generalize beyond the data on our laptops.

Theories need to tell us more than whether the correlation between variables is positive or negative. We need to subject them to risky tests. Theories need to make varied, precise predictions. Only then does a precise measurement of these predicted empirical values matter. Reducing all question answering to Fisherian statistics will not solve these problems. But that’s where we seem to be stuck.

Subscribe now

Crud Hypothesis Testing

Ben Recht — Thu, 13 Jun 2024 14:02:02 GMT

Meehl’s course has already emphasized that significance testing is a very weak form of theory corroboration. Testing if some correlation is non-zero is very different from the earlier examples in the course. Saying “it will rain in April” is much less compelling than predicting next year’s precise daily rainfall in a specific city. It’s frankly less compelling than predicting a numerical value of the pressure of a gas from its volume and temperature. I’m a bit reluctant to plead for a “better” form of significance testing. Part of the issue with the human-facing sciences is the obsession with reducing all cause and effect, all experimental evidence, to Fisher’s exact test. Randomized controlled experiments are a particular experiment design, not the only experiment design. Someday, we’ll all break free from this bizarre, simplistic epistemology.

But that won’t be today. Let me ask something incremental rather than revolutionary for a moment. What would null hypothesis significance testing look like if we took crud seriously? We know the standard null hypothesis (i.e., that the means of two groups are equal) is never true. What seems to be true is that if we draw two random variables out of a pot, they will be surprisingly correlated. If that’s true, what should we test?

Here’s a crudded-up null hypothesis:

H0: Someone sampled your two variables X and Y from the crud distribution.

We could ask what is the probability of seeing a recorded correlation if H0 is true. What would the test look like? We’d need to compute a distribution of the potentially observed Pearson r values. Since we’re working with finite data, that distribution would be the convolution of the distribution of a sample correlation coefficient r (perhaps making a normal assumption) with the crud distribution. While you probably couldn’t compute this convolution in closed form, you could get a reasonable numerical approximation. The “p-value” now is synonymous with how far your data’s correlation is into the tail of this computed empirical crud distribution. If it’s more than two standard deviations from the mean crud, maybe you’re onto something.

Note that this sort of testing can’t cheat by growing n. In standard null hypothesis significance testing, a small correlation will be significant if n is large enough. But big n does not mean you’ll refute the cruddy null hypothesis. In fact, all that happens with growing n here is the “empirical” crud distribution converges to “population” crud distribution. That is, the convolution doesn’t change the distribution much. When n is moderate, you will be more or less testing if your correlation is more than two standard deviations away from the mean of the crud distribution.

Again, I don’t think this cruddy null testing solves everything, but it is definitely better than what we do now. We should know what is a reasonably low bar for an effect size. We should power our studies to refute that low bar. This doesn’t seem like an unreasonable request, does it?

What stops this from happening is that we don’t seem too enthusiastic to measure these crud distributions carefully. What would that look like? Since the crud distribution is a distribution of correlation coefficients, we’d need to find a somewhat reasonable set of pairings of treatments and control variables specific to a field. We’d need reasonable datasets from which we could sample these pairings and compute the crud distribution. To me, this sounds like what Meehl and Lykken did in the 1960s: finding surveys with candidly answered questionnaires and tabulating correlations. In 2024, we have so many different tabulated spreadsheets we can download. I’m curious to see what crud we’d find.

For people who are familiar with his writing, I don’t think my suggestions here are different than Jacob Cohen’s. In the 1960s, Cohen tried to formalize reasonable standardized effect size measures and use these to guide experiment design and analysis in psychology. One of Cohen’s more popular measures, Cohen’s d, is more or less equal to twice the correlation coefficient:

Cohen asked that people compute d, and then evaluate the effect on a relative scale (small effects are d<0.2, large effects are d>0.8). One problem with Cohen is he assumed the scale for d was universal. But it certainly varies from field to field. It varies within fields as well, depending on the questions you’re asking. As I noted yesterday in epidemiology, we will always have Cohen’s d less than 0.2 for diseases like cancer. So to merge Meehl with Cohen, we’d need to look at the right distribution of effect sizes of random interactions and use this to set a relative scale for the believability of stories about correlations.

After my dives into the history of machine learning, I’m not at all surprised that I’m rediscovering sensible advice from the 1960s. In fact, I wrote a book about why we keep reinventing ideas from the Cold War that will be out next year. (More on that later). My point today is that some ideas from the 1960s shouldn’t go out of style. Everyone pays lip service to Cohen, but then he gets ignored in practice. Cohen laments this disregard in the preface to the 1988 edition of his book. Perhaps this means that incremental changes aren’t the answer, and the system of mindless significance testing exists to maintain a powerful status quo. If that’s the case, maybe we need a revolution after all.

Wait! Didn’t we have a revolution? You know, a “credibility revolution?” Did that fix anything? Let me take on that question in the next post.

Subscribe now

The Technical Depths of Crud

Ben Recht — Wed, 12 Jun 2024 14:48:39 GMT

This post digs into Lecture 7 of Paul Meehl’s course “Philosophical Psychology.” You can watch the video here. Here’s the full table of contents of my blogging through the class.

Meehl begins Lecture 7 by clarifying his rant about statistics from Lecture 6: “I love statisticians, and I like statistics.” It’s certainly true that Meehl should not be confused as someone who is against statistical methodology. Lectures 6 through 10 are almost entirely about probability and statistics, after all. And after his five-minute quasi-apology to the “subgroup of statisticians who have a certain arrogance toward the social or medical sciences,” he spends the next 45 minutes of Lecture 7 diving into numerical examples of how the crud factor might manifest itself even when theories are false.

In the spirit of these technical calculations, let me take this post to work through a few mathy-ish loose ends on crud. There will be more equations than have been the norm in these blog posts, but that’s because we’re pushing into arguments with statisticians. I’m setting the stage here for the subsequent posts where I want to try to rethink statistical practices with crud in mind.

Thresholded Variables

Meehl works an extended example where the treatment variable is a thresholded normal. A potential example he gives would be groups that score high on a test versus groups that score low on a test. Perhaps you’d look at the mean of some attribute in people above the mean on an introversion scale and compare that to the mean of people low on the scale. If the introversion scale is a normal distribution, then the treatment variable is a thresholded normal distribution.

The correlation coefficients between thresholded normal random variables are close to those of the unthresholded variables. There are lots of fun integrals you can compute. Let θ denote the Heaviside function: θ(t) equals 1 if t is greater than 0 and equals 0 otherwise. If X and Y are normally distributed, then:

If you threshold one variable, the resulting correlation equals 0.8 of the initial correlation. Meehl alludes to this formula in his whiteboard calculations in Lecture 7. We can go a step further and threshold both X and Y:

If X and Y are correlated, their thresholded counterparts will be similarly correlated. Thresholding normal distributions does not eliminate the worry about crud.

Epidemiological Crud

I don’t exactly know how to best estimate the modern crud factor, but I think it’s worth giving some scale. In Monday’s post, I called out this JAMA Internal Medicine article that claimed people who ate organic diets had lower cancer rates. We all know these nutrition papers are absurd and easy to pick on. And yet they still consistently get credulously written up in the New York Times. This paper doesn’t seem to be any more egregious than any other in the field. The whole field is very bad! But it does help give a sense of scale.

In this paper, the authors come up with some score of how much organic food people eat. They find the top quartile of scorers have low cancer rates. Obviously, this is clearly a dressed up correlation with wealth and socioeconomic status. Bear with me anyway.

In their main finding, they have 50,914 with low organic score and 16,962 with high organic score. Of these survey respondents, 1,071 of the low-organic group reported cancer while only 269 of the high organic group reported cancer. That’s a 25% relative risk reduction. While it’s not proper to treat this as an RCT, the z-score here is more than 4 and the p-value is less than 0.0001. So I could imagine (as the paper does) some sort of “causal correction” mumbo jumbo that “corrects for confounders” or whatever and still gets you a p-value less than 0.05. Eat organic, everyone!

OK, so what’s the correlation coefficient? We have a formula for it. Take the z-score and divide it by 261. It’s about 0.02.

I don’t yet know what to make of this. The fact that cancer is already rare means the correlation coefficient can only be so high. For binary random variables when the treatment and control groups are of the same size, the largest the correlation coefficient can be is the square root of the odds of the prevalence:

This would be the correlation between X and Y even when you have 100% risk reduction. It would be worth thinking more about what Meehl’s crud has to do with epidemiology where we have huge n and low prevalence, and hence all variables with small correlation. What is the crud factor in epidemiology? Somebody should study that!

Varied Variance Estimators

Dean Eckles noted on Twitter that for non-binary outcomes, the common estimator for the variance in the z-test is a combination of the variance in the group when X=0 and the group when X=1:

I could quibble that this variance estimator isn’t better than the one used in the proportions z-test, but it’s a quibble. As I’ve said before and will say again, these formulas are just rituals and you can’t really justify anything with “rigor.” And it’s fine because we can still calculate stuff. If I use this variance estimator, the formula for z becomes

Carlos Cinelli tells me that Cohen uses this formula in his writings about power and effect sizes. While it is no longer a simple product, nothing in the crud story changes here. A significance test is still computing a simple function of the Pearson r, multiplying that number by the square root of n, and declaring significance when that product is larger than 2. That is the same as declaring significance when

That 4 in the denominator isn’t doing much work. Also, when r is less than ½, this z-score is less than 1.15 times larger than when you use the other variance estimator. We can’t escape the fact that significance tests are measurements of correlation. Maybe we should embrace that fact and see what happens.

Subscribe now