Love your blog, Ben. I semi-grasp your point but then, as a cardiologist, I view the double-blind RCT as our best current defence against the cognitive biases that trap us into doing harmful things to patients. The list of treatments that we thought helpful, but were debunked by an RCT, is long and strewn with fatalities. Simple examples include oxygen to reduce the size of a heart attack, and hormone replacement therapy to prevent heart attacks. Yes, applying RCT findings to an individual patient is fraught. But maybe not as fraught as using an anecdotal experience in one patient as the template for treating the next patient - "I gave the patient pill x, a week later their runny nose stopped, therefore x treats runny noses". Each element of the modern RCT protects against a number of cognitive biases - randomisation prevents self-selection bias, an appropriate control group mitigates the Hawthorne effect and blinding both the patient and researcher will combat expectation and confirmation biases. Now I'm left thinking - did all these RCTs debunk quackery or am I on shakier ground - eek
Your example of the WHI trial on HRT is instructive. Yes, the trial shows that broadly speaking, a particular form of HRT (specifically, PremPro) results in an increase in the rate of heart attacks in post-menopausal women, but conventional wisdom now thinks this risk is only increased for older women, and it's still not clear whether this particular formulation is more dangerous than other prescribed therapies. Context matters a lot. And the needs of a particular patient matter a lot. Does that make sense?
Unfortunately, every RCT has this sort of nuance. RCTs are a very blunt instrument. That said, they are certainly invaluable for drug approval. They are powerful procedures that remove biases and set expectations between the government and pharmaceutical companies. But I think their debunking power is a bit overinflated among the evidence-based medicine crowd.
Yes, blunt indeed is the RCT. I shouldn't have chosen the contentious WHI trial as an example! Interesting to note that there have been no further trials; the change in conventional wisdom is due to the re-interpretation of the original WHI trial. This led some to speculate that white-hat bias (whimsical name) may be at play.
Another very human problem with the RCT is its weaponisation by commercial interests - regulators do not require blinding of device trials (as opposed to drug trials). Companies potentiate the lack of blinding by using a combined endpoint containing a subjective metric such as unplanned hospitalisation. This allows the control patient & investigator to unleash their expectation and confirmation bias to discern symptoms that require hospitalisation so that they can receive the investigational device. It's all a muddy mess. But if I ditch the RCT in toto, what could my advice to patients be based upon ...
Extremely interesting article as usual. Though I would not jump from opposing quantification through averages on RCTs to opposing quantification tout court. But maybe I misunderstood the closing of the article. At any rate I understand that some work in the direction you point at has been done by people in hermeneutics, specifically data/digital hermeneutics. Though they sound a bit too hasty in concluding that a silver bullet for intersubjectivity simply cannot be found.
Yes, I don't oppose quantification. I just want us to think about it deliberately instead of blindly treating it as objective. And I'd like more nuance with how we understand the unquantifiable.
Here's a thought experiment I use to think about the individual/collective spectrum. Imagine you're hosting a party, and you're going to provide food. You could declare that you will serve hotdogs, and you need a headcount to inform how many hotdogs to purchase. In this case, each individual is reduced (simplified, dehumanized) to a hotdog consumer for the sake of the count. Toward acknowledging varying tastes in the population of guests, you could offer your guests a *choice* between hotdog and hamburger, which they indicate in an RSVP. Greater acknowledgment of varying preferences might include choices of hotdog, hamburger, or veggie burger. Now go to the extreme. You could personally interview every potential guest to obtain their individual meal preference, and then you could go buy and prepare all the food needed to make each guest's favorite meal.
This thought experiment contains many aspects of policy-making for collectives: reduction of individuals to measurable/categorical characteristics, power to define terms and set constraints for resource distribution, cost of meeting community needs. A mediating factor against dehumanization is acknowledging choice, and yet each level of choice acknowledgment incurs cost. The presence of choice (ie, a variable specification that is neither necessary nor sufficient) amongst individuals of a population is what leads to the problem of induction. Statistics (historically, numbers for *the state*) is used to craft compelling arguments that appear rational for a truly irrational decision.
I'm not sure I understand the challenge. Obviously, non-human facing science deals with singular cases. It does so by developing models that are predictive and explanatory. Then, if you want to estimate a singular case, you plug in the singular parameters of that case. Depending on the availability and accuracy of the parameters, and the accuracy and prediction power of the model, you get an answer to the singular case. We definitely have, for instance, successful singular space exploration missions that went through without a single experiment, definitely not data collection at a population of similar space exploration missions level.
So, what you seem to be lamenting is the lack of reliable models of humans, or a lack of ways to measure a sufficient collection of relevant parameters. I think that there is another issue. The models that human facing science has are almost always complex equilibria. Equilibria are feedback mechanisms, not acyclic causality. In medicine there is essentially a single equilibrium (homeostasis) that you are trying to maintain. In macro-economics, even very simple models have multiple dynamic equilibria, which means in particular that the same intervention may have different, even opposing, causal consequences in different equilibrium states. So there's some singularity not just per person, but also per population and time etc. This might be why medicine gets better outcomes than macro-economics.
I wonder about psychology. Is it that the models are poor? Is it that the measurement of parameters is poor? Or is it that there are multiple desirable inconsistent quilibria? Perhaps all of the above.
Looking forward to it! The language for the tension you name has been thin since EBM colonized medical epistemology in the 90s. We are yet to acknowledge the level of harm it did (and I use EBM all the time and consider it worth protecting).
I think that the language you're looking for partly already exists — it just got thrown under the evidence-based bus. Aristotle distinguished three kinds of wisdom that mapped neatly onto clinical practice. Episteme is theoretical knowledge of universals. In medicine, pathophysiology and mechanism — why and how things happen. Techne is a productive craft knowledge. The know-how of examining, operating, diagnosing — what your hands and eyes have learned. And phronesis is a practical wisdom. How to act well in this particular case under unavoidable uncertainty, weighing competing goods and taking patient's interest into account.
Pre-EBM clinical medicine lived across all three. EBM's contribution was real — it did kill off a lot of quackery. But the move was to collapse the hierarchy: only statistical empirical evidence counts as proper knowledge, everything else is bias to be corrected against the algorithm. Reasoning about mechanisms got demoted to "hypothesis-generating." Procedural craft got partly invisibilized — you can't really randomize the surgeon, so surgical skill became something the literature can't really see. Phronesis got reframed as cognitive bias and reduced to "using anecdotal evidence"— so the expert clinician's pattern recognition became something to debias against the guideline.
Blind faith in RCTs and "the hierarchy of evidence" attributes to them powers they never had. There were tons of trials of different antibiotic regimens for bacterial vaginosis since the 90s, and yet the recurrence rate stayed pretty much the same. It wasn't until a couple of years ago that someone ran a trial comparing treatment of the woman alone against treatment of the woman and her partner. They had to stop early because treating the couple was just so much better. I'd argue the concept of "bacterial vaginosis-associated bacteria" had a lot to do with this blindness. I heard talks on conferences that men don't have a vagina, therefore these bacteria are irrelevant to them, therefore the partner doesn't matter.
Believing that an RCT will produce TRUTH regardless of the assumptions, concepts, and terms fed into its design is just funny. RCTs produce results that have to be thought through carefully — whether the design was reasonable at all, what the question even was, where the question came from. So I approach an RCT the way I approach a patient: examine its history, make sense of what it's saying, stay critical of its language. We can't really say that we read an obesity RCT unless you acknowledge the shady history of body mass index (BMI). Or vitamin D in pregnancy RCT unless we remember the William's syndrome scare from the 60s and the whole earlier story behind mistaking an ancient and complex steroid signal system for a bone health nutrient. So yeah, there is no way around a proper history taking.
Have you considered the methods developed around casuistry, the method of cases, nearest-neighbor, (local) pair matching, and so on? If you drop the ideas of abstract random variables, continuity, and infinity and switch to arrangements of whole objects the problem shifts. I find those ideas are seductive but they pull me into an abstraction trap. Blaise Pascal and others did a number on this whole mode of reasoning.
Casuistry lurks about under different names.. I got some of it in high school and a full semester of it in college (from jesuits, no less) without anyone saying its name. In statistics, nearest-neighbor and caliper pair matching are just two instances of the technique as are local regresssions and splines.
I wasn't aware that Tou;min got pushed into the crank category. I liked what I read of his book.
How does this relate to the reference class problem? I can imagine that you have in mind situations where there can be no reference class at all. Is the following relevant? Consider modeling the causal processes underlying a lung cancer and how it will respond to treatment. Cancers can be unique -- not just a unique combination of known factors but a novel biological mechanism at work. Hence, there can be no basis for statistical prediction. I suppose one might argue that if we had complete knowledge of the system, including the genetics of the cancer and the state of the patient's immune system, we might be able to make a prediction. But such a prediction would be based on mechanism, not statistics. Given our lack of knowledge concerning mechanism, we substitute a reference class, collect statistics, and hope that everyone in the reference class instantiates the same phenomenon. But we can't verify this. (I suppose we can rule out some potential reference classes, because they are known to be causally incompatible with the patient in question. But typically to build a sufficiently large reference class, we include cases that known to be incompatible and hope that this incompatibility can be treated as a small perturbation, zero mean noise, etc.)
Baseball is a really interesting example. Personally, I became so indoctrinated by the Moneyball/Baseball Prospectus statistical lens on baseball that it's very difficult for me to enjoy a game anymore because I feel like the players are just random number generators with a mean and standard error. But that's not true, they're humans, and it seems impossible that human factors play no role in whether they succeed or fail.
> But that's not true, they're humans, and it seems impossible that human factors play no role in whether they succeed or fail.
I thought the argument here is that professional athletes are so far up the skill distribution, and have such powerful ancillary support systems (coaches, analysts, etc) that are serving as optimization engines on the frontier of current knowledge and practice, and / or is mitigated by the coach pulling somebody who's doing bad because of human factors (bad sleep, breakup etc) and subbing in the next best player, such that any human factor stuff rounds out as noise even in a lot of individual games, much less over multiple games.
Some thoughts about the "what does an RTC say about an individual" question; It seems you would want to have some prior expectation about the outcome for that person, and then update that expectation based on any evidence you have. If you have mechanistic models that fit and predict reality well with which you can do deductive reasoning, then that will be much better than needing to rely on fuzzier inductive evidence. But many (maybe most) important things we deal with are too complicated to develop good mechanistic models. So instead we have to use the RTC, but we can't just say the individual is 0.5 as likely to have the effect with the intervention as without, and that's for (at least) two reasons. (1) We need to ask about statistical significance, applying the binomial distribution and saying, "how likely is it that we would get these results if the intervention worked a different amount than reducing risk by half?", that is the p-value. (2) We need to ask if the individual is represented within the population that the RTC is performed on. I think this is a failure that is now starting to get talked about quite a bit, that marginal people are, of course, not as well represented, especially in smaller studies that just draw from students from the local university where the research is done. University students are not representative of the whole population!
The language around all of this, and how that language is used by scientists, by general populations, and between them is also a really interesting thing to be looking at. I'm not exactly sure I get what you're pointing at with "language for individual experiences", but interested to find out.
I'm kinda wondering if "mimetic power" was maybe meant to be "memetic power"? Both make sense in this context.
I'm interested in better describing the "shape" of data clusters in high dimensional spaces. I wonder if that relates to this theme at all.
Love your blog, Ben. I semi-grasp your point but then, as a cardiologist, I view the double-blind RCT as our best current defence against the cognitive biases that trap us into doing harmful things to patients. The list of treatments that we thought helpful, but were debunked by an RCT, is long and strewn with fatalities. Simple examples include oxygen to reduce the size of a heart attack, and hormone replacement therapy to prevent heart attacks. Yes, applying RCT findings to an individual patient is fraught. But maybe not as fraught as using an anecdotal experience in one patient as the template for treating the next patient - "I gave the patient pill x, a week later their runny nose stopped, therefore x treats runny noses". Each element of the modern RCT protects against a number of cognitive biases - randomisation prevents self-selection bias, an appropriate control group mitigates the Hawthorne effect and blinding both the patient and researcher will combat expectation and confirmation biases. Now I'm left thinking - did all these RCTs debunk quackery or am I on shakier ground - eek
Your example of the WHI trial on HRT is instructive. Yes, the trial shows that broadly speaking, a particular form of HRT (specifically, PremPro) results in an increase in the rate of heart attacks in post-menopausal women, but conventional wisdom now thinks this risk is only increased for older women, and it's still not clear whether this particular formulation is more dangerous than other prescribed therapies. Context matters a lot. And the needs of a particular patient matter a lot. Does that make sense?
Unfortunately, every RCT has this sort of nuance. RCTs are a very blunt instrument. That said, they are certainly invaluable for drug approval. They are powerful procedures that remove biases and set expectations between the government and pharmaceutical companies. But I think their debunking power is a bit overinflated among the evidence-based medicine crowd.
Yes, blunt indeed is the RCT. I shouldn't have chosen the contentious WHI trial as an example! Interesting to note that there have been no further trials; the change in conventional wisdom is due to the re-interpretation of the original WHI trial. This led some to speculate that white-hat bias (whimsical name) may be at play.
Another very human problem with the RCT is its weaponisation by commercial interests - regulators do not require blinding of device trials (as opposed to drug trials). Companies potentiate the lack of blinding by using a combined endpoint containing a subjective metric such as unplanned hospitalisation. This allows the control patient & investigator to unleash their expectation and confirmation bias to discern symptoms that require hospitalisation so that they can receive the investigational device. It's all a muddy mess. But if I ditch the RCT in toto, what could my advice to patients be based upon ...
Extremely interesting article as usual. Though I would not jump from opposing quantification through averages on RCTs to opposing quantification tout court. But maybe I misunderstood the closing of the article. At any rate I understand that some work in the direction you point at has been done by people in hermeneutics, specifically data/digital hermeneutics. Though they sound a bit too hasty in concluding that a silver bullet for intersubjectivity simply cannot be found.
Yes, I don't oppose quantification. I just want us to think about it deliberately instead of blindly treating it as objective. And I'd like more nuance with how we understand the unquantifiable.
Here's a thought experiment I use to think about the individual/collective spectrum. Imagine you're hosting a party, and you're going to provide food. You could declare that you will serve hotdogs, and you need a headcount to inform how many hotdogs to purchase. In this case, each individual is reduced (simplified, dehumanized) to a hotdog consumer for the sake of the count. Toward acknowledging varying tastes in the population of guests, you could offer your guests a *choice* between hotdog and hamburger, which they indicate in an RSVP. Greater acknowledgment of varying preferences might include choices of hotdog, hamburger, or veggie burger. Now go to the extreme. You could personally interview every potential guest to obtain their individual meal preference, and then you could go buy and prepare all the food needed to make each guest's favorite meal.
This thought experiment contains many aspects of policy-making for collectives: reduction of individuals to measurable/categorical characteristics, power to define terms and set constraints for resource distribution, cost of meeting community needs. A mediating factor against dehumanization is acknowledging choice, and yet each level of choice acknowledgment incurs cost. The presence of choice (ie, a variable specification that is neither necessary nor sufficient) amongst individuals of a population is what leads to the problem of induction. Statistics (historically, numbers for *the state*) is used to craft compelling arguments that appear rational for a truly irrational decision.
See also Hacking's "Kinds of People: Moving Targets" (https://sites.tufts.edu/models/files/2020/02/hacking-draft.pdf)
I'm not sure I understand the challenge. Obviously, non-human facing science deals with singular cases. It does so by developing models that are predictive and explanatory. Then, if you want to estimate a singular case, you plug in the singular parameters of that case. Depending on the availability and accuracy of the parameters, and the accuracy and prediction power of the model, you get an answer to the singular case. We definitely have, for instance, successful singular space exploration missions that went through without a single experiment, definitely not data collection at a population of similar space exploration missions level.
So, what you seem to be lamenting is the lack of reliable models of humans, or a lack of ways to measure a sufficient collection of relevant parameters. I think that there is another issue. The models that human facing science has are almost always complex equilibria. Equilibria are feedback mechanisms, not acyclic causality. In medicine there is essentially a single equilibrium (homeostasis) that you are trying to maintain. In macro-economics, even very simple models have multiple dynamic equilibria, which means in particular that the same intervention may have different, even opposing, causal consequences in different equilibrium states. So there's some singularity not just per person, but also per population and time etc. This might be why medicine gets better outcomes than macro-economics.
I wonder about psychology. Is it that the models are poor? Is it that the measurement of parameters is poor? Or is it that there are multiple desirable inconsistent quilibria? Perhaps all of the above.
Looking forward to it! The language for the tension you name has been thin since EBM colonized medical epistemology in the 90s. We are yet to acknowledge the level of harm it did (and I use EBM all the time and consider it worth protecting).
I think that the language you're looking for partly already exists — it just got thrown under the evidence-based bus. Aristotle distinguished three kinds of wisdom that mapped neatly onto clinical practice. Episteme is theoretical knowledge of universals. In medicine, pathophysiology and mechanism — why and how things happen. Techne is a productive craft knowledge. The know-how of examining, operating, diagnosing — what your hands and eyes have learned. And phronesis is a practical wisdom. How to act well in this particular case under unavoidable uncertainty, weighing competing goods and taking patient's interest into account.
Pre-EBM clinical medicine lived across all three. EBM's contribution was real — it did kill off a lot of quackery. But the move was to collapse the hierarchy: only statistical empirical evidence counts as proper knowledge, everything else is bias to be corrected against the algorithm. Reasoning about mechanisms got demoted to "hypothesis-generating." Procedural craft got partly invisibilized — you can't really randomize the surgeon, so surgical skill became something the literature can't really see. Phronesis got reframed as cognitive bias and reduced to "using anecdotal evidence"— so the expert clinician's pattern recognition became something to debias against the guideline.
Blind faith in RCTs and "the hierarchy of evidence" attributes to them powers they never had. There were tons of trials of different antibiotic regimens for bacterial vaginosis since the 90s, and yet the recurrence rate stayed pretty much the same. It wasn't until a couple of years ago that someone ran a trial comparing treatment of the woman alone against treatment of the woman and her partner. They had to stop early because treating the couple was just so much better. I'd argue the concept of "bacterial vaginosis-associated bacteria" had a lot to do with this blindness. I heard talks on conferences that men don't have a vagina, therefore these bacteria are irrelevant to them, therefore the partner doesn't matter.
Believing that an RCT will produce TRUTH regardless of the assumptions, concepts, and terms fed into its design is just funny. RCTs produce results that have to be thought through carefully — whether the design was reasonable at all, what the question even was, where the question came from. So I approach an RCT the way I approach a patient: examine its history, make sense of what it's saying, stay critical of its language. We can't really say that we read an obesity RCT unless you acknowledge the shady history of body mass index (BMI). Or vitamin D in pregnancy RCT unless we remember the William's syndrome scare from the 60s and the whole earlier story behind mistaking an ancient and complex steroid signal system for a bone health nutrient. So yeah, there is no way around a proper history taking.
Have you considered the methods developed around casuistry, the method of cases, nearest-neighbor, (local) pair matching, and so on? If you drop the ideas of abstract random variables, continuity, and infinity and switch to arrangements of whole objects the problem shifts. I find those ideas are seductive but they pull me into an abstraction trap. Blaise Pascal and others did a number on this whole mode of reasoning.
The problem with casuistry is that people will think that you fell "into the cracks of crankhood". It happened to Toulmin.
Casuistry lurks about under different names.. I got some of it in high school and a full semester of it in college (from jesuits, no less) without anyone saying its name. In statistics, nearest-neighbor and caliper pair matching are just two instances of the technique as are local regresssions and splines.
I wasn't aware that Tou;min got pushed into the crank category. I liked what I read of his book.
How does this relate to the reference class problem? I can imagine that you have in mind situations where there can be no reference class at all. Is the following relevant? Consider modeling the causal processes underlying a lung cancer and how it will respond to treatment. Cancers can be unique -- not just a unique combination of known factors but a novel biological mechanism at work. Hence, there can be no basis for statistical prediction. I suppose one might argue that if we had complete knowledge of the system, including the genetics of the cancer and the state of the patient's immune system, we might be able to make a prediction. But such a prediction would be based on mechanism, not statistics. Given our lack of knowledge concerning mechanism, we substitute a reference class, collect statistics, and hope that everyone in the reference class instantiates the same phenomenon. But we can't verify this. (I suppose we can rule out some potential reference classes, because they are known to be causally incompatible with the patient in question. But typically to build a sufficiently large reference class, we include cases that known to be incompatible and hope that this incompatibility can be treated as a small perturbation, zero mean noise, etc.)
Baseball is a really interesting example. Personally, I became so indoctrinated by the Moneyball/Baseball Prospectus statistical lens on baseball that it's very difficult for me to enjoy a game anymore because I feel like the players are just random number generators with a mean and standard error. But that's not true, they're humans, and it seems impossible that human factors play no role in whether they succeed or fail.
> But that's not true, they're humans, and it seems impossible that human factors play no role in whether they succeed or fail.
I thought the argument here is that professional athletes are so far up the skill distribution, and have such powerful ancillary support systems (coaches, analysts, etc) that are serving as optimization engines on the frontier of current knowledge and practice, and / or is mitigated by the coach pulling somebody who's doing bad because of human factors (bad sleep, breakup etc) and subbing in the next best player, such that any human factor stuff rounds out as noise even in a lot of individual games, much less over multiple games.
This is a pretty interesting theme.
Some thoughts about the "what does an RTC say about an individual" question; It seems you would want to have some prior expectation about the outcome for that person, and then update that expectation based on any evidence you have. If you have mechanistic models that fit and predict reality well with which you can do deductive reasoning, then that will be much better than needing to rely on fuzzier inductive evidence. But many (maybe most) important things we deal with are too complicated to develop good mechanistic models. So instead we have to use the RTC, but we can't just say the individual is 0.5 as likely to have the effect with the intervention as without, and that's for (at least) two reasons. (1) We need to ask about statistical significance, applying the binomial distribution and saying, "how likely is it that we would get these results if the intervention worked a different amount than reducing risk by half?", that is the p-value. (2) We need to ask if the individual is represented within the population that the RTC is performed on. I think this is a failure that is now starting to get talked about quite a bit, that marginal people are, of course, not as well represented, especially in smaller studies that just draw from students from the local university where the research is done. University students are not representative of the whole population!
The language around all of this, and how that language is used by scientists, by general populations, and between them is also a really interesting thing to be looking at. I'm not exactly sure I get what you're pointing at with "language for individual experiences", but interested to find out.
I'm kinda wondering if "mimetic power" was maybe meant to be "memetic power"? Both make sense in this context.
I'm interested in better describing the "shape" of data clusters in high dimensional spaces. I wonder if that relates to this theme at all.