I’m attending the Machine Learning for Healthcare conference this week, and yesterday participated in an engrossing workshop on decision making. The workshop and conference have me buzzing about multiple topics I’ve ranted about on this blog, and I imagine they’ll consume my substack bandwidth for the next week.
There is plenty of excitement to bring flashy things like neural nets to the clinic and hospital. But a person today will encounter more “AI” at the hospital than in any other context. Medicine is filled with guidelines and rules that aid decision making from top to bottom. These decision rules guide whether a patient gets a test, gets admitted, gets a surgery, gets a treatment. The contemporary American hospital lets us examine the dystopian world where our lives are governed by the whims of AI.
A patient comes into the ER presenting chest pain. Are they at high risk of a cardiovascular event? We can estimate their risk using the HEART score: What is the patient’s history (H)? Does the electrocardiogram look abnormal (score on a scale of 0-2) (E)? How old is the patient (A)? Does the patient have risk factors (R)? What is the patient’s troponin level (T)? A physician can score each of these questions on a scale of 0-2, giving a total score of 0-10. Each HEART score is associated with a risk of a major cardiovascular event. 0-3 corresponds to a risk of 1-2%. 4-6 corresponds to a risk of about 15%. 7-10 corresponds to a risk of over 50%. A BuzzFeed quiz governs the fate of a patient.
The HEART score is only supposed to aid a doctor in risk assessment, but these assessments affect care. And there are hundreds of these guidelines in the hospital, some in calculator apps, some in EHR systems, some ensconced in complex bureaucracy. These risk scores are AI tools. The AI tools that use deep learning and language models look a lot more complicated, but in the end, they are still just some box that takes in patient information and spits out a risk score or a clinical decision. If we want to know what algorithmic decision systems impact people, there’s no better place to look than our local healthcare system.
Now, doctors all tell me that scores like this are just “guidelines” and are not binding for care decisions. But this is what an “algorithmic decision guide” looks like. Their presence alone is clearly part of the standard of care. Is their presence good, bad, or neutral? I see no difference between a risk calculator like HEART and some commercial chatbot that provides suggestions to an attending. If we have a risk score or a wildly more sophisticated algorithmic decision system, this gets coupled to some action. That action could be further tests, a new diagnosis, a potential surgery, or a medication. Do the suggestions affect care or not? Does their presence change patient outcomes? Would it be better for doctors to follow their recommendations more strictly? How would we evaluate the impact of these decision rules on outcomes?
For evaluation, we could appeal to how we evaluate treatments. There are case reports, observational studies, RCTs. But when we evaluate decision systems, it’s more murky. Decision systems tend to be evaluated as predictions (like in machine learning). You can first look retrospectively to see if the risk scores accurately estimate risk. You can look prospectively, to see if the risk scores predict the bad outcomes you had in mind. You can do these prospective studies at multiple hospitals to make sure you didn’t overfit to the particulars of your clinic. But in all of these evaluations, you are not evaluating whether having the score helped a patient's outcome. There are few examples of evaluating risk scores like we evaluate treatments.
And part of this is because the effects of any individual decision are challenging to isolate in the larger scale of a patient’s care. Take my favorite example, the cancer screen. Imagine a hypothetical AI screening device that spits out a cancer risk score. Here, the pragmatic randomized controlled trial is our standard of evaluation. The “treatment” here is simply the offering of the algorithmic system to a patient. Does a doctor suggest a patient get screened or not? The outcome is cancer deaths over some time window. My read of the evidence is that no screen has demonstrated any substantial impact on cancer deaths. Certainly, screening does not improve overall mortality. And the risk of false diagnoses and unnecessary treatments is high. But if cancer screens show such modest benefit when toggled in isolation, what about the thousands of other simple decisions that go into a patient’s care?
Even evaluating the simplest guidelines is a challenge. There was some evidence that aspirin moderately reduced the risk of heart attacks and strokes, but increased the risk of internal bleeding. The US Preventative Services Task Force (USPSTF) had concluded that the benefits outweighed the risks in elderly Americans and recommended aspirin as a general prophylactic against heart disease in people over 60. That’s an algorithmic decision rule: it is a threshold function on a single feature, but that is still a neural net, my friends. In 2022, they revised their guidance, now concluding that the risk of bleeding outweighs the preventative benefits.
The USPSTF puts out countless recommendations every year. Societies of medical specialties write reports on their own suggestions for standards of care. In the software MDCalc alone there are over 500 clinical decision tools. How effective are these? How do they interplay with each other?
This blog asks a lot of questions. And I don’t know the answer to any of them! This is why I find the ML for Healthcare community so compelling. MDs want to figure out how to untangle this complex web of algorithmic recommendations. And for us PhDs who think about decision systems, there is an overabundance important and challenging problems to address.
Relevant to the theme: https://first10em.com/clinical-decision-rules/
Of course I lack the medical knowledge to say whether or not it's accurate, fair, etc.
I totally agree that these simple scorecards have a lot of richness re: data-driven decision-making in the wild. (I'm partial to nomograms myself https://en.wikipedia.org/wiki/Nomogram, visual calculators). There's so much richness in understanding how these seemingly pragmatic issues are actually first-order important in real-world decision-making.
Just as one example, recent work from Liu-Shahn-Robins-Rotnitzky specifically model the statistical restrictions imposed by assuming that screening has no direct effect on treatment: https://projects.iq.harvard.edu/files/applied.stats.workshop-gov3009/files/efficient_estimation_of_optimal_regimes_under_a_no_direct_effect_assumption.pdf
As a different on-the-ground example, the Sepsis Watch project and deployment is a huge implementation effort involving organizational change and a (pre-post) clinical trial. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391165/
It's still not clear how to appropriately evaluate these kinds of operational interventions -- I'd love to learn more about regulatory discussions and other examples on these fronts -- but I think there's interest converging from multiple points of view (stat/ml, biostats, HCI/AI) given the importance of these real-world systems.