Fleming discovered the antibacterial properties of penicillin in a petri dish in 1928. But how did we figure out how to treat humans with this antibiotic? Famously, the first randomized controlled clinical trial was of the antibiotic streptomycin in 1948. But we were able to determine that penicillin worked long before that.
Physicians at Oxford report on their experiences in the 1941 Lancet article “Further Observations on Penicillin.” Their “Therapeutic Trial of Penicillin” consists of ten case studies. Yes, that’s right, ten patients. But the case studies are outlined in gripping detail.
In the first case, a 43-year-old police officer had a severe staph infection. The infection had spread from the corner of his mouth to his eyes and down his right arm, infecting the bone. On February 12, they gave him a dose of penicillin. After the first day, they already noted improvement in his condition. But by February 14, they thought his condition was not improving and upped his dosage. In 1941, penicillin was still rare and hard to manufacture. The doctors had to be careful to use as little as they could get away with. In order to raise this patient's dosage, they had to recover penicillin from his bedpan. But these extreme measures proved fruitful, and, by February 16, the patient had markedly improved. All of his infections were swelling less, and some of the regions looked almost back to normal. The patient felt better and had a full appetite and normal temperature. But tragically, the hospital ran out of antibiotics on February 17. Still, the patient’s condition remained about the same. It was clear that he still had an infection, but it didn’t seem to be getting worse. Sadly, the infection did indeed come back, with the patient’s condition worsening about 10 days later. On March 15, the patient died from his infection.
The doctors concluded that antibiotics could not be stopped early. The treatment could only be effective if they gave a large enough dose. In a second case, a 48-year-old laborer presented with a worrying bacterial infection in his shoulder. They kept this patient on antibiotics for eight days, a day after the infection seemed resolved. The patient recovered more or less without incident.
The final four cases of the paper discuss the topical use of penicillin to treat eye infections. All four of these cases were more minor infections, but all patients rapidly recovered from relatively small doses of the antibiotic.
From these ten patients, six with severe bacterial infections and four with eye infections, the doctors drew many conclusions. First, it seemed that penicillin wasn’t toxic to humans, even with adult doses of up to 1g a day. Second, even though some of the patients had died in their trial, penicillin had prevented the growth of bacterial infections in every case. Every patient’s temperature fell and their infections all locally improved. Third, penicillin was able to cure the infection in several cases. Finally, in the 6 serious case studies, all but the laborer had received other drugs. Some had even had surgery to attempt to purge their infections. None of these treatments had proven successful. But penicillin changed the course of the illness in every case.
All of these observations turned out to generalize well beyond this study.
I’m not going to conclude this post with a call to abandon randomized trials. The criticism I’ll lob at modern trials is that the case data is completely hidden. All trial information is reduced to statistics displayed in two tables. We are told that patients respond differently, but only in “exploratory analyses” displayed in a third table or forest plot. We never get the details on what the different responses look like. More worryingly, when people get access to the case data of a trial, they find errors at alarming rates, calling the validity of the statistics into question (This story in Nature is deeply troubling).
Despite the known issues with trials, case studies are maligned as a poor form of evidence in the grand scheme of evidence-based medicine (a small step above the useless opinion of experts in the evidence pyramid). But the case studies of penicillin are gripping. The detailed case studies gave profound insights into how antibiotics worked in the hospital, and these multiple details informed how treatments should run their course. Much like the extreme example of language acquisition I gave earlier, the idea that these bacterial infections would reverse course on their own was impossible. Hence this case evidence was impossible to ignore.
Moreover, the course of the disease tracked the administration of penicillin. The police officer’s condition stopped improving. The doctors increased his dose, and his condition improved. This temporal observation provided insights into dosing regimens.
I want to emphasize the value of case studies as a piece of a larger evidentiary puzzle. Why are case studies considered poor evidence? Perhaps the doctors are cherry-picking which cases to report. Perhaps there are too many confounding variables. But rather than concluding that we must resort to large population trials, I want to ask if there are ways to reduce these problems with case studies. I’ll try to answer this tomorrow.
Could strengthen case studies too by pre-registering which patients you'll write studies for, to avoid the problem of (perhaps unconsciously) cherry-picking particularly vivid cases that support your preferred hypothesis.
I think case studies are considered poor evidence because most researchers chase tiny effects and indeed case studies are poor evidence for tiny effects. With n = 10 the effects will go in all directions and nothing will be concluded. With n = 2250, maybe you will get a publishable p-value. Even if the effect is only 1%, you can multiply that by the number of students in the US and argue the effect is very substantively significant.
https://twitter.com/jayvanbavel/status/1681719800450490368
In the case of vaccines, it's also perilous to have small n case study. What if no one gets COVID in the case study? But there's also a concern about effect size. I suppose effect size of penicillin >> effect size of vaccine >> effect size of social science experiments. I'm no expert but there seems to be inherent (large) randomness in effects of vaccines on outcomes. n = 10 is problematic.
But further, I think there’s an additional, more philosophical reason, why they are considered suspect (at least in social science). There are two kinds of case studies. On the one hand, some are done by empirical scholars in the “framework” or “world-view” of statistical, positive, science. These ones will be considered less suspect. People will say: if you have a large effect and it’s possible to do a case study, do a case study! Famously what is argued here: https://en.wikipedia.org/wiki/Designing_Social_Inquiry
But on the other hand, some case studies will be “hermeneutic” or “interpretative”. In this tradition: https://static1.squarespace.com/static/55c3972ee4b0632d3480491b/t/56eb3ad537013b8180b9159c/1458256600062/Taylor_InterpretationandtheSciencesofMan.pdf A case study to “make sense” of a phenomena. And many people who do science will simply roll eyes. I think that stuff is often fascinating, but indeed quite different from science. And in practice, I think, in social science, it’s not always clear if case studies are of kind 1 or 2.