I’m always in for a treat when someone retweets Eric Topol into one of my social media timelines. Topol is one of the most prolific sources of medical misinformation on the internet, uncritically parroting any press release that confirms his priors. Whenever one of his posts goes viral, it’s almost always because he’s misrepresenting a study. This week, he promoted a big study about machine learning evaluation, so I had to dig in.
Here’s Topol’s tweet
“The largest medical A.I. randomized controlled trial yet performed, enrolling >100,000 women undergoing mammography screening. The use of AI led to 29% higher detection of cancer, no increase of false positives, and reduced workload compared with radiologists w/o AI”
Oooh! AI! Are we finally fulfilling Geoff Hinton’s prophecy and putting radiologists out of business? Is chatGPT actually better than your doctor?
No.
As is almost always the case, Topol’s topline is willfully misleading. Let’s peel this onion.
I clicked the link below, and it took me to this study in the Lancet. What is AI in this trial? It’s a commercial software package called Transpara by Screenpoint Medical that promises to "Safely reduce workload by up to 44% with no change in recall rate.” Hmm, that’s already a fairly modest claim. The software is closed-source and claims to use deep learning. I don’t care about what’s inside the machine. I care about what it does. Transpara takes a mammogram and outputs a risk score on a scale from 0-10. For mammograms deemed high risk, a segmentation of the image highlights regions likely to be malignant cancer. This is boring, old-school pattern recognition, not chatGPT. In the medical literature, such image segmentation and risk scoring is called computer-aided detection (CAD). CAD software has existed since the 1990s. CAD is a lot more honest than AI, but it’s harder to raise VC money with such terminology.
Let’s press on. What does the use of AI (that is, CAD) mean? The trial was performed in Sweden where the standard of care for mammography screening has two radiologists look at every mammogram. The control group of this trial received precisely such standard of care, with each mammography getting a double reading. In the intervention group of this trial, mammograms were read by a single physician who used CAD software. For what it’s worth, a single doctor with CAD is already standard of care in the United States. Essentially, This trial was trying to determine if European or American standard of care was better. It does not tell us if this particular CAD software is better than what is currently used in the US. To determine that, someone would have to run another clinical trial, I guess.
Regardless, this study is not evaluating a landmark breakthrough in artificial intelligence. It is AB testing the effectiveness of two sets of stringent medical rules and regulations. Such an evaluation is a lot less sexy, I suppose.
Making matters even worse, the primary outcome examined had nothing to do with patient health. This sort of thing drives me crazy when reading machine learning for healthcare papers. In healthcare, we’re supposed to care about patient outcomes. Instead, this study looked at detecting, not treating, cancer. Furthermore, let’s be clear about how a tumor was defined. The paper states:
“Ground truth was based on pathology reports on surgical specimens or core-needle biopsies.”
That is, the study only looks at detection rates compared to biopsy ground truth. Such ground truth has less to do with patient outcomes than you might think. This study can’t tell us whether such cancers would have been caught by patient symptoms and led to similar outcomes. It can’t tell us if CAD-assisted reading saves more lives than double reading. It can’t tell us if software reduces the need for invasive treatment. It can’t tell us if it improves quality of life. Given that the absolute detection rate increased only by 1 woman out of 1000, it’s likely impossible to even hope to evaluate such important downstream outcomes. The study is only saying that 1 out of every 1000 women screened receives a slightly earlier cancer diagnosis. Unfortunately, whether such early detection contributes to better outcomes can’t be ascertained from this study.
And what can we say about the validity of the topline number claiming an improvement of 1 per 1000? Unfortunately, the statistical analysis is problematic. Randomized trials are deemed “significant” if they yield a small p-value. There are tons of assumptions that must hold for these p-values to be indicative of anything. In particular, you need to assume the patient outcomes are all independent of each other. For an example of good randomization, consider a hypothetical vaccine study. If you give 50K people vaccines and 50K saline, and they are geographically separated and blinded, then perhaps you can assert that their outcomes are independent.
However, in this CAD study, the patient outcomes were correlated by the trial protocol. The patients appear to have been properly randomized, but the doctors were not. Only 16 doctors screened the 100K patients, and these doctors participated in both screening arms. From the paper:
“The radiologists rolled a dice before each screen-reading session to randomly allocate themselves to one of the two study groups: numbers 1–3 allocated them to the control group and 4–6 to the intervention group.“
Doctors were randomized at each session to read either treatment or control patients. This means that over multiple sessions, doctors got a feel of whether they preferred working alone, or preferred working with a colleague. Since these preferences accumulate over time, the diagnoses were not independent. Hence, the “p-value” and confidence intervals are meaningless (for my stats readers, SUTVA is strongly violated here).
Single RCTs are seldom definitive, and this RCT is no exception. Evaluation of machine learning is hard, and evaluation of medical practice might be harder! Sadly, the evidence thus remains murky as to whether computer-assisted detection is better than double human reading. Prior studies are unclear if a single unaided radiologist is much worse than two. Medical guidelines and bureaucracy are often more cultural than “evidence based,” and we just have to accept that.1
But let me get back to our friend Eric Topol. This study had almost nothing to do with the popular conception of “AI!” Given all the absurd excitement, fear, and panic surrounding the AI buzzword, is it helpful for a popular opinion leader to tweet that this study confirms its benefits? Or would we all be healthier if Topol shut down his Twitter account?
I’m not even going to get into the decades long controversy around the effectiveness of cancer screening in the first place. I’ll write more about that later, I promise.
You are correct -- Topol has to be the biggest tool on the Internet. You can assume if he writes something, it is wrong. That is easier than doing all the research to prove it so -- it can be safely assumed. There are not many people who can be that consistent, but he is one.
In this case, is it correct to think that the 29% headline number means that ~4 women per thousand were diagnosed in the control and ~5 women per thousand were diagnosed in the other arm?