arg min blogMusings on systems, information, learning, and optimization.
http://benjamin-recht.github.io/
There’s more to data than distributions.<p><em>This is first guest post by Deb. More to come!</em></p>
<p>In “<a href="https://www.nejm.org/doi/full/10.1056/NEJMc2104626">The clinician and dataset shift in artificial intelligence</a>,” published in the New England Journal of Medicine, a set of physician-scientists describe how a popular sepsis-prediction system developed by the company Epic needed to be deactivated. “Changes in patients’ demographic characteristics associated with the coronavirus disease 2019 pandemic” supposedly caused spurious alerting arising from the system, rendering it of little value to clinicians. For the authors, this is a clear illustration of distribution shift, a change in training and test data that, in this case, made it difficult to distinguish between fevers and bacterial sepsis. They go into detail about what this means: distribution shift is a fundamental challenge in machine learning, and whenever we attempt to deploy machine learning in the real world without considering the way in which that real world environment can change (whether its changes in technology (e.g., software vendors), changes in population and setting (e.g., new demographics), or changes in behavior (e.g., new reimbursement incentives), then we fail to properly consider the ways in which the data can change or shift between train and test environments. If not considered, the model will inevitably fail.</p>
<p>And why not? If the underlying test data diverges from the data used in the development of the model, we should expect disappointing results. But the distribution shift problem is so common that ML researchers and practitioners have started seeing it everywhere they look. In fact, in many cases, they will inappropriately characterize any failure of deployed ML models as a distribution shift. This both muddles our understanding of what exactly distribution shift means, and limits our vocabulary for the range of failures that can show up in deployment. In this blog post, I’ll use Epic’s sepsis-detector to illustrate some of the current confusion about distribution shift, and why the notion of “external validity”, a description of generalization problems used widely in other fields, is perhaps more relevant.</p>
<p>The terminology of distribution shift is both too specific and not specific enough. A “change in distribution” could be characterized as anything from a variation in source to a re-sampling. <a href="https://rtg.cis.upenn.edu/cis700-2019/papers/dataset-shift/dataset-shift-terminology.pdf">These changes could involve changes to the input features (ie. covariate shift), changes to the labels (ie. prior probability shift) or both (ie. concept drift).</a></p>
<p>The notion of “data distributions” themselves assumes data comes from an imagined data generating function. In that world of infinitely abundant independent data points samples from a bespoke probability distribution (ie. the “independent and identically distributed” i.i.d. assumption), describing data in terms of how it’s distributed makes a lot of sense. But as Breiman describes in <a href="http://www2.math.uu.se/~thulin/mm/breiman.pdf">“Statistical Modeling: The Two Cultures”</a>, that assumption doesn’t often hold for real world data. Very rarely does one actually know the data generating function, or even a reasonable proxy - real world data is disorganized, inconsistent, and unpredictable. As a result, the term “distribution” is vague enough to not address the additional specificity necessary to direct actions and interventions. When we talk about a hypothetical distribution shift, we talk about data changes but are not specific about which data changes happen and why they happen. We’re also constraining our discourse by just looking at changes in the data in the first place, when in fact, many other changes occur between development and deployment (such as changes in interactions with the model, changes in the interpretation of model results, etc.). Specifying the type of distribution shift is one solution, but more importantly, we need to understand specific distribution shifts as part of a broader phenomenon of external validity that we need to begin to articulate as a field.</p>
<p>The most significant consequence of the myopic obsession with distributions is how it constrains ML evaluations. The benchmarking paradigm that dominates ML at the moment is a by-product of its obsession with detecting shifts in data - the evaluation of models on static data test sets are tied to assumptions about failures being due to shifts in data distribution and not much else. A myopic view on distribution shift confuses the discourse on how to evaluate models for deployment. <a href="https://www.nature.com/articles/s41591-021-01312-x">Several</a> <a href="https://www.bmj.com/content/374/bmj.n1872">studies</a> on <a href="https://www.nature.com/articles/s41746-020-00324-0">regulatory approvals</a> of ML-based tools in healthcare already demonstrate how over-emphasis on data distribution shift failures has led ML practitioners and even regulators within the healthcare space to inappropriately prioritize the use of <em>retrospective studies</em> (ie. evaluations on static collections of past data) rather than <em>prospective studies</em> (ie. examinations of the system within its context of use). Things like multi-site assessment, median evaluation sample size, demographic subgroup performance and
“side-by-side comparison of clinicians’ performances with and without AI” are also exceedingly rare in the evaluation of ML-based healthcare tools, as they don’t fit our current narrow perception of what can go wrong when you throw an ML model into the real world. Of course distribution shift matters but the nature in which we focus on it to the exclusion of everything else is regrettable. For better regulation and evaluation methodology for machine learning deployments, we need to expand our thinking and align ourselves with the other fields attempting to understand performance gaps between the theory and practice of interventions.</p>
<p>This broader notion of validity characterizes the accuracy of the claims being made in a specific context. The related notion of reliability has to do with reproducibility and the consistency of results (think of measurement precision), but validity is concerned with some notion of truthfulness and how close claims get to the target of describing the real relationship between inputs and outputs. There are various notions of validity discussed in measurement theory, evaluation science, program evaluation and experiment design literature, but there are common core concepts. For example, internal validity is about assessing a consistent causal relationship between the inputs and outputs within the experiment and construct validity is related to the evaluation of how well experimental variables represent the real world phenomena being observed. When discussing generalization issues, we are most interested in external validity, which analyzes if the causal relationship between inputs and outputs observed in experiments holds for inputs and outputs outside the experimental setting.</p>
<p>To understand how external validity differs from the current discourse on distribution shift, let’s go back to the sepsis monitoring example. <a href="https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2781307">“External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients,”</a>, published in JAMA, describes a retrospective study on the use of the sepsis tool between December 2018 and October 2019 (notably well before the pandemic began). They examined 27,697 patients undergoing 38,455 hospitalizations and found that the Epic Sepsis Model predicted the onset of sepsis with an area under the curve of 0.63, “which is substantially worse than the performance reported by its developer”. Furthermore, the tool “did not identify 1,709 patients with sepsis (67%) despite generating alerts… for 6,971 of all 38 455 hospitalized patients (18%), thus creating a large burden of alert fatigue.” These researchers rightfully describe these issues as “external validity” issues, and go into detail examining a range of problems far beyond the data-related “shifts” described in the “Clinician and Dataset Shift” oped. They don’t pretend that this doesn’t have to do with changes in the data - of course it does. Epic’s system evaluation was on data from 3 US health systems from 2013 to 2015, and that’s certainly a different dataset than University of Michigan’s 2018-2019 patient records. But they also comment on changes to the interactions doctors had with the model and how that modified outcomes, as well as other external validity factors that had very little to do with data at all, much less “data distribution shift.” Even when discussing substantive data changes, they are specific in characterizing what it is and breaking down the differences that occurred upon deployment at their hospital.</p>
<p>As this study shows, machine learning needs some clean guidelines for evaluating external validity.
To begin scaffolding such frameworks, we can learn from the social sciences. For example, Erin Hartman, a UC Berkeley colleague in political science, and her co-author Naoki Egami <a href="https://erinhartman.com/publication/elements/">propose a taxonomy that provides an interesting start to this discussion</a>. Their interest is in assessing external validity in scenarios where a population is given a policy treatment (eg. sending out voting reminders, updating the tax code, giving out free vaccines, etc.) and the impact of this treatment as measured within the experiment and also once implemented in the real world. If we consider the treatment to be an ML model’s integration into a broader system, we can begin to articulate what external validity could mean in the algorithmic context. In my next blog post, I’ll try to work through Hartman and Egami’s framework and other specific proposals from other fields on how we could begin to taxonomize external validity issues, and see which of the external validity problems they describe are actually quite relevant to machine learning.</p>
Thu, 31 Mar 2022 00:00:00 +0000
http://benjamin-recht.github.io/2022/03/31/external-evaluations/
http://benjamin-recht.github.io/2022/03/31/external-evaluations/Machine Learning has a validity problem.<p>One of the central tenets of machine learning warns the more times you run experiments with the same test set, the more you overfit to that test set. This conventional wisdom is mostly wrong and prevents machine learning from reconciling its inductive nihilism with the rest of the empirical sciences.</p>
<p>Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar led an passionate quest to test the overfitting hypothesis, devoting countless hours to reproducing machine learning benchmarks. In particular, they painstakingly recreated a test set of the famous <a href="https://www.image-net.org/">ImageNet benchmark</a>, which itself is responsible for bringing about the latest AI feeding frenzy. Out of the many surprises in my research career, what <a href="https://arxiv.org/abs/1902.10811">they found surprised me the most.</a></p>
<p class="center"><img src="/assets/RSS_Scatter.png" alt="The scatterplot of nightmares" width="90%" /></p>
<p>In this graph, the x-axis is the accuracy on the original ImageNet benchmark, which has been used millions of times by individual researchers at Google alone. On the y-axis is the accuracy evaluated on “ImageNet v2” set, which was made by closely trying to replicate the data creation method for the benchmark. Each blue dot represents a single machine learning model trained on the original ImageNet data. The red line is a linear fit to these models, and the dashed line is what we would see if the accuracy was the same on both test sets. What do we see? The models which perform the best on the original test set perform the best on the new test set. That is, there is no evidence of overfitting.</p>
<p>What is clear, however, is a noticeable drop in performance on the new test set. Despite their best efforts in reproducing the ImageNet protocol, there is evidence of a <em>distribution shift</em>. Distribution shift is a far reaching term describing whenever the data on which a machine learning algorithm is deployed is different from the data on which it is trained. The Mechanical Turk workers who labeled the images were different from those originally employed. The API used for the labeling was slightly different. The selection mechanism to aggregate differences in opinions between labelers is slightly different. The small differences add up to around a 10% drop in accuracy, equivalent to five years of progress on the benchmark.</p>
<p>Folks in my research group have reproduced this phenomenon several times. In <a href="https://papers.nips.cc/paper/9117-a-meta-analysis-of-overfitting-in-machine-learning">Kaggle competitions</a>, where the held out set and validation set were <em>identically</em> distributed, we saw no overfitting <em>and</em> no distribution shift. We found sensitivity to distribution shifts in CIFAR10, in <a href="https://arxiv.org/abs/1906.02168">video</a>, and in <a href="https://arxiv.org/abs/2004.14444">question answering</a> benchmarks. And Chhavi Yadav and Leon Bottou showed that we have not yet overfit to the <a href="https://arxiv.org/abs/1905.10498">venerable MNIST data set</a>, but distribution shift remains a challenge.</p>
<p>The marked sensitivity to distribution shift is a huge issue. If small ambiguities in reproductions lead to large losses in predictive performance, what happens when we take ML systems designed on static benchmarks and deploy them in important applications? A decade of AI fever has delivered piles of evidence that distribution shift is machine learning’s achilles heel. Algorithms run inside the big tech companies need to be constantly retrained with their huge computing resources. <a href="https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683">Data-driven algorithms for radiology often fail if one changes the X-ray machine</a>. <a href="https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2781307">AI algorithms for sepsis fail if you change hospitals</a>. And self-driving car systems are readily confused in new environments (No citation needed. Keep your Tesla away from me.).</p>
<p>The only way forward is for machine learning to engage more broadly with other scientists who have been tackling similar issues for centuries. My first proposal is simple: let’s change our terminology to align with the rest of the sciences. The study of distribution shift in machine learning has always been insular and, while machine learning is particularly sensitive, all empirical science must deal with the jump from experiment to reality.</p>
<p>With this in mind, <a href="https://twitter.com/rajiinio">Deb Raji</a> and I have been digging through the scientific literature for a while now hoping to find some answers. In most other parts of science, “robustness to distribution shift” is called external validity. External validity quantifies how well a finding generalizes beyond a specific experiment. For example, a significant result on a particular cohort may not generalize to a broader population.</p>
<p>Predictive algorithms and experimental science both rely on repeatability. “The sun has always risen in the east.” “The apple always falls straight to the ground.” We expect that given the same contexts, the natural world more or less repeats itself. There is unfortunately a big leap from the sun rising in the morning, to an experimental finding in machine learning or biomedicine being reproducible. Why?</p>
<p>The experimental contexts under which predictions and inferences are designed are often far too narrow. The results of a study performed on young male college students in Maine may not help us understand properties of a retirement community in Arizona. These populations are different! However, it may give us insights into other cohorts of male college students: a study at Bates may generalize to Colby or Bowdoin.</p>
<p>Contexts can change in a myriad of ways. Some examples include the following:</p>
<ol>
<li>The context can just be too narrow in the experiment. Do studies on adults generalize to children? Do studies on medications with only men generalize to women?</li>
<li>The measured quantity may itself change. It is often easier to measure, detect, and control for exogenous disturbances in a lab setting than in the real world.</li>
<li>Populations can change over time. For example, medical recommendations from the 1980s may no longer apply to the current population. Recent developments have led to <a href="https://www.npr.org/2021/10/13/1045746669/task-force-says-most-people-should-not-take-daily-aspirin-to-prevent-a-heart-att">not recommending aspirin to prevent heart attacks</a>. Machine Learners like to call this <em>covariate shift</em>.</li>
<li>Even more nefariously, the population can change in response to the intervention. A classic example of this is <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart’s Law</a> which states “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”</li>
</ol>
<p>How can we grapple with these external validity challenges? Verifying external validity is daunting and the set of potential solutions remains quite limited. As I mentioned, Deb and I have been chatting about this for a year now, and we’ve now dragged the rest of the group into our investigations. So I’m going to share the blog with Deb for a few posts now, and we’ll both expand on what we’ve been reading and thinking about. In the next few posts, we’ll explore some of the intricacies of when external validity can fail and will also try to spell out some of the research directions that might help bridge the gaps between experiments and reality.</p>
Tue, 15 Mar 2022 00:00:00 +0000
http://benjamin-recht.github.io/2022/03/15/external-validity/
http://benjamin-recht.github.io/2022/03/15/external-validity/Let us never speak of these values again.<p>A recent <a href="https://twitter.com/emollick/status/1493428796539772937">Twitter quiz</a> asked “what is a powerful concept from your field that, if more people understood it, their lives would be better?” Unambiguously, the answer from my field is statistical significance. Significance testing is a confusing, obscure statistical practice. It is hard to explain and usually impossible to justify. But papers can only be published if they are “statistically significant.” Here, I’ll explain in as plain terms as I can what statistical significance means in almost every published scientific study. I’ll do this without ever defining a p-value, as p-values have nothing to do with the way significance testing is used. Instead, significance testing amounts to hand wavy arguments about precision and variability. Laying it out this way shows why the authority granted to significance testing is so suspect and unearned.</p>
<p>Let’s suppose we’re trying to evaluate the benefit of some intervention. We test the intervention on a bunch of individuals and compute the average benefit which we call the effect size. The effect size could be the number of years of life a person gains with some cancer treatment or the amount of money gained with an investment strategy.</p>
<p>Since the effect size is the <em>average</em> of all the individuals in a study some individuals will have received less benefit than average and some more than average. Combining this varied benefit with the reality that experiments are always noisy and complicated, it may very well be that even if we measure a positive benefit on average, the intervention may actually be mostly harmful to the general population.</p>
<p>Think about the hypothetical situation where on half the population the benefit is equal to 1 and on the other half it’s negative 1. In this case, the average benefit is zero. If we collected a random set of people and computed the average effect size, we’d see a positive effect size about half of the time. In other words, if your experiment was “flip a coin ten times,” you’ll see 5 or more heads about half of the times you run this experiment.</p>
<p>The experimenter’s goal is to distinguish whether the benefit is large, small, or if the intervention is actually harmful. What the paradigm of statistical significance aims to do is to find a way to determine if an intervention is “mostly good.” It goes forth like this: using statistics, you estimate the standard error (SE) of the measured effect size. Roughly speaking, the standard error measures how spread out the effect size is over a population. When the standard error is small, everyone in the population experiences the same effect from a treatment. When the standard error is large, even if the effect size is positive, some may experience negative effects and some positive.</p>
<p>If the measured effect size is greater than twice the standard error, you declare that you have discovered your intervention is statistically significant at the “p<0.05” level. If the effect size is bigger than 2.6 times the standard error you declare statistical significance at the
“p<0.01” level. If effect size is even bigger than three times the standard error you declare statistical significance at the “p<0.003” level! Wow! There’s a Nobel Prize in your future. But if the effect size is only 1.9 times bigger than the standard error, you unfortunately have not found a statistically significant result and have to throw all your work in the trash. Unless, of course, you are an economist. In this case, you are allowed the special “p<0.1” level, which occurs when the effect size is 1.7 times the standard error.</p>
<p>The following figure, adapted from a tweet by <a href="https://twitter.com/TAH_Sci/status/1490701257769734145">Thomas House</a>, nicely illustrates the situation:</p>
<p class="center"><img src="/assets/significant.png" alt="Only one of these droids is significant" width="100%" /></p>
<p>Here, there are four hypothetical distributions of benefit. All but (a) are “statistically significant.” The interventions we are always striving for are ones with clear benefit, like (d) in the bottom right. Interventions like this do exist: In the Pfizer trial, the effect size was 12 times larger than the standard error. Vaccines work! But most interventions are not vaccines (or parachutes for that matter).</p>
<p>The other panels show why statistical significance alone is so problematic. Figures (a) and (b) are nearly indistinguishable plots, but one is significant and the other is not. Moreover, statistical significance also misses the forest for the trees. You can have a miniscule effect size and still have a significant effect. Do we always prefer the (c) to the (a)? Is a meager, but mostly positive benefit necessarily better than a treatment potentially of large benefit to some but harmful to others necessarily? Wouldn’t it be in our interest to understand this spread of outcomes so we could isolate the group of individuals who benefit from the treatment?</p>
<p>The fact that a mere factor of 1.5 separates the difference between “not publishable” and “undoubtedly real” is deeply concerning. And every statistician knows how estimates of standard error can be <em>very</em> sensitive. Simple approximations can make the standard error appear 1.4 times smaller, which is enough to transform an insignificant result into a significant one. This is what is commonly known as “data dredging” or “p-hacking”: trying to find the appropriate set of assumptions under which your experiment has small standard error and is hence statistically significant.</p>
<p>The precise definitions of standard error and p-value don’t illuminate the situation. Since p-values lead you to pedantry and quibbling about tiny effects, their actual definition, which is complicated and hard to even explain to other statisticians, just confuses people and doesn’t fix science. Most practicing scientists would be better off not knowing what a p-value is.</p>
<p>And a lot of the other fixes also don’t help. For example, the <a href="https://clincalc.com/Stats/FragilityIndex.aspx">fragility index</a> is often used in medicine to describe how many “non-events” have to become events for the significance to vanish. But this is just a way to conflate sample size and p-values, and isn’t getting away from the core problem that statistical significance testing is a mass waste of time.</p>
<p>I am by no means the first person to complain about the absurdity of significance testing. Some examples from the last 50 years include <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.200.7648&rep=rep1&type=pdf">Meehl</a> (1978), <a href="https://www.jstor.org/stable/1803924">Leamer</a> (1983), <a href="https://www.jstor.org/stable/270939">Freedman</a> (1991), <a href="https://www.bmj.com/content/308/6924/283">Altman</a> (1994), <a href="http://www.principlesofeconometrics.com/poe5/writing/kennedy.pdf">Kennedy</a> (2002), <a href="https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124">Ioannidis</a> (2005), <a href="https://www.princeton.edu/~deaton/downloads/Instruments_of_Development.pdf">Deaton</a> (2009), <a href="https://www.press.umich.edu/186351/cult_of_statistical_significance">Ziliak and McCloskey</a> (2008), and <a href="https://stat.columbia.edu/~gelman/research/published/asa_pvalues.pdf">Gelman</a> (2016). The significance testing framework is barely 100 years old. And people have been rightfully attacking it for nearly as long. Why do we remain so stuck?</p>
<p>The aforementioned list of grumpy people have a deeper criticism of contemporary science beyond significance testing, and this critique is one I hope to take up in future blogs. When can we trust scientific consensus and what structures and methods are necessary to build valid scientific theories? Statistical validity is only a small part of the bigger picture that establishes the trustworthiness of a study. The study must be <em>construct valid</em>, measuring the right quantities. It must be <em>internally valid</em>, avoiding bias and confounding. And it must be <em>externally valid</em>, generalizable to other contexts. The next few blogs will try to unpack some thoughts on validity, and why validity and design remain the most pressing challenges in contemporary scientific inquiry.</p>
Wed, 23 Feb 2022 00:00:00 +0000
http://benjamin-recht.github.io/2022/02/23/standard-errors/
http://benjamin-recht.github.io/2022/02/23/standard-errors/What were the effects of the Bangladesh mask intervention?<p>There’s been a bit of a social-media back-and-forth between us and Jason Abaluck about the design and statistical significance of the <a href="https://www.poverty-action.org/sites/default/files/publications/Mask_Second_Stage_Paper_20211108.pdf.pdf">Bangladesh Mask RCT</a>. To focus and hone the discussion on some crucial details, we just posted a <a href="http://arxiv.org/abs/2112.01296">short note</a> where we re-analyzed the data from this trial using standard non-parametric paired statistics tests on treatment-village pairs. In this blog, we summarize those results, highlighting potentially significant biases in the study. Importantly, we found that the behavior of unblinded staff when enrolling study participants was one of the most highly significant differences between treatment and control groups. The significant impacts on staff and participant behavior urge caution in interpreting small differences in the study outcomes which depended on survey responses.</p>
<p>Let’s first review the full study protocol, as it’s a cluster RCT and hence a bit more complicated than the typical trials with which many are familiar. First 600 villages in Bangladesh were paired based on COVID-case data, population density, and population size. Each paired village was assigned to treatment or control at random. Next, observers were sent to the treatment and control villages to enroll households in the study. Importantly, observers <em>were not blinded</em> to the treatment assignment in each village. The observers were also tasked with giving out the masks in the intervention villages, so they knew if the village was assigned to treatment or control. Households were approached by the observers and the households either (a) consented to participate, (b) declined to participate, or (c) were marked “unreachable” by the unblinded observers. The study team then proceeded to implement a mask promotion intervention in the treatment villages. In both villages, participants were asked to report COVID-like symptoms. Those who reported symptoms were asked to volunteer blood draws for serology. The primary endpoint was evaluated based on the number of these blood draws that tested positive for COVID antibodies.</p>
<p>As we have noted before, despite 300:300 randomization of the 600 villages, there was a notable imbalance in the size of the consenting populations between the control and treatment groups. The control group contained 156,938 individuals while the treatment group contained 170,497 individuals. The total absolute numbers of symptomatic seropositives in the treatment and control villages was 1086 and 1106, respectively. This difference is too small to be significant if participants had been individually randomized. In the study, an effect is asserted for the <em>relative rate</em> of symptomatic seropositives, i.e., normalized by population denominators. We note that the 10% decrease reported in the fraction of individuals who become symptomatic seropositives is not driven primarily by a decrease in the numerator of this fraction of symptomatic seropositives, but instead by the increase in the denominator.</p>
<p>What is the right way to test whether attributes differ between the treatment and control groups? If we have an attribute that we observe in every village, we can assess the hypothesis “a control village is equally likely to have a larger value of the attribute than its control pair than it is to have a smaller value.” Since the villages were paired, a standard non-parametric test for such questions is the <a href="https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon signed-rank test</a>. This test is nice because we need not worry about using sophisticated models or variance correction techniques. Instead, we can just try to assess if bulk statistics are plausibly different in the treatment and control groups. Let’s use this test to gain insights about the mask study.</p>
<p>First, let’s look at the count in each village of individuals who reported COVID-19 symptoms and tested positive for covid-antibodies. The null hypothesis is that it is equally likely for a paired village to have more symptomatic seropositive in control as in treatment. According to the Wilcoxon signed rank test, the p-value is 0.97 and we cannot reject the null hypothesis. What about the <em>rate</em> of symptomatic seropositivity? Was the percentage of infections in the control group higher than in the treatment? Again, we cannot reject the null as the p-value of the Wilcoxon test is 0.25. In our writeup, we found similar results when restricting attention to subgroups assigned to cloth masks and surgical masks, and also in the population of individuals over 60. Based on these tests, we cannot assert that the intervention affected the primary endpoint of symptomatic seroprevalence. However, there are behavioral effects that are significantly different between treatment and control.</p>
<p class="center"><img src="/assets/village_attribute_boxplots.png" alt="Boxplots of different village features" width="100%" /></p>
<p>From our analysis, it is quite clear that more mask wearing was observed in the treatment group rather than the control group (p<$10^{-47}$). Moreover, there was a large difference in observed social distancing (p<$10^{-15}$). It is notable—even if one did not expect any effect from masks at all—that the increase in observed physical distancing did not translate to clearer differences in symptomatic seropositivity between treatment and control groups.</p>
<p>Second, the populations of the treatment and control groups themselves are very different. One of the main difficulties in running a mask trial is the issue of blinding. You certainly cannot run a blinded intervention as people know whether or not they are wearing masks. But in such unblinded studies, it is critical that the populations sampled to be surveyed are identical before the unblinding occurs.</p>
<p>Here is where things get a bit subtle. The villages were assigned to treatment at random, but the households were not. The surveyors who handed out masks knew in advance the treatment assignment of their villages. This knowledge alone induced a highly significant difference. <em>The fraction of households approached in each village is significantly different in treatment and control</em> (p<$10^{-11}$). This selection bias induced a large imbalance in the size of the treatment and control groups, and may have affected the overall seropositivity counts. Interestingly we found that the rate of consent for symptom surveys and the rate of consent for blood draws were indistinguishable between treatment and controls. The main significant difference was due to the behavior of the study staff.</p>
<p class="center"><img src="/assets/survey_boxplots.png" alt="Boxplots of survey interactions" width="100%" /></p>
<p>That the unblinded staff behaved differently in the different types of villages is not surprising and similar experimenter behavior has blemished randomized trials since their inception. We can’t know the exact cause of the difference in households reached, but perhaps the staff put just a little bit more effort into soliciting responses in the treatment group because they were excited about testing their intervention. Whatever the case, this behavioral difference created a large population difference between the groups: whereas 3,394 of 68,514 households were unreachable in the treatment group, 4,970 of 65,536 households were unreachable in control. This is a difference of over four thousand people, far exceeding the 20 case difference in symptomatic seropositivity.</p>
<p>Our analysis suggests that the impact of the mask intervention was highly effective at modifying behaviors (distancing, mask-wearing, symptom reporting), but that any effect on actual symptomatic seropositivity was much more subtle. In particular, whatever effects the intervention had on the rate of symptomatic seropositivity in the villages was certainly not large relative to other factors contributing to variance in this parameter across villages.
We suggest that the very large causal effects on consent rates and thus population denominators urge caution in interpreting the small differences we see in symptomatic seropositivity between treatment and controls, which are not statistically significant according to standard nonparametric paired tests.</p>
<p><em>Code to reproduce the figures in this post and those in our technical <a href="https://people.eecs.berkeley.edu/~brecht/papers/CPR_mask_note.pdf">note</a> can be found <a href="https://github.com/mchikina/maskRCTnote">here</a>.</em></p>
Wed, 01 Dec 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/12/01/unblinding/
http://benjamin-recht.github.io/2021/12/01/unblinding/The cult of statistical significance and the Bangladesh Mask RCT.<p>In the last post, <a href="https://www.argmin.net/2021/11/23/mask-rct-revisited/">I argued that the effect size in the Bangladesh Mask RCT was too small to inform policy making</a>. I deliberately avoided diving into statistical significance as arguments about p-values quickly devolve into scientific gish gallop. Statistical validity is the most overrated form of experimental validity, and it crowds out more important questions of effect size, bias, design, and applicability.</p>
<p>But shoot, sometimes Byzantine academic arguments are fun. And, though they are always wrong, sometimes they are useful. In this blog I want to discuss how we analyze statistical validity in cluster randomized controlled trials. It is quite subtle and sensitive, and should give us pause about these experimental designs. The sample sizes needed for validating large effects in cluster randomized trials can be absurdly high, and running a clean trial with millions of participants is likely impossibly difficult and almost never worth doing.</p>
<p>To review in the <a href="https://www.poverty-action.org/sites/default/files/publications/Mask_Second_Stage_Paper_20211108.pdf.pdf">Bangladesh Mask RCT</a> there were $n_C=$163,861 individuals from $k_C=$300 villages in the control group. There were $n_T=$178,322 individuals from $k_T=$300 villages in the intervention group. The main end point of the study was whether their intervention reduced the number of individuals who reported covid-like symptoms and tested seropositive at some point during the trial. There were $i_C=$1,106 symptomatic individuals confirmed seropositive in the control group and $i_T=$1,086 such individuals in the treatment group.</p>
<p>What can we say about the statistical significance of this 20 case difference? Most would guess this difference is not significant. Indeed, in a balanced design with $n_T=n_C$ and 180,000 individuals in each arm, this study would not be statistically significant. Let’s imagine we ran an experiment where we could treat the outcomes of each individual as independent, identically distributed random variables. What is the p-value associated with the null hypothesis that the prevalence of infections in the control group is less than or equal to the prevalence in the treatment group? A simple statistical test of this hypothesis is the z-test for proportions. For the z-test, the p-value when the groups are balanced is 0.3.</p>
<p>The authors claim a balanced design, but, though the number of treatment and control villages are indeed equal, the number of <em>individuals</em> in the treatment group is 1.1x bigger than the control group. <a href="https://www.argmin.net/2021/11/23/mask-rct-revisited/">As I mentioned in the previous post</a>, this discrepancy can likely be explained by the large differential in response rates between the groups: 1.05x fewer households were approached for surveys in control and the control group responded at 1.07x lower rate than treatment. The most significant difference between the treatment and control group may very well be the consent rate of the household survey. For the medical statistics experts, <a href="https://en.wikipedia.org/wiki/Intention-to-treat_analysis">the intention to treat principle</a> says that the individuals who are unreachable or who refuse to be surveyed must be counted in the study. Omitting them invalidates the study.</p>
<p>But the issues of significance remain even if we forgive this large imbalance in the study. If we re-run the z-test with the $n_C$ and $n_T$ in the study data, the p-value is now 0.009, which would be quite significant at the standard p < 0.05 threshold. However, the individual outcomes are <em>not</em> independent. The trial was cluster-randomized, so everyone in the same village received the same intervention. This means that the outcomes inside a village are correlated, and they are likely more correlated inside a village than outside.</p>
<p>To capture the correlation among intra-cluster participants, statisticians use the notion of the <a href="https://www.povertyactionlab.org/resource/power-calculations"><em>intra-cluster correlation coefficient</em></a> $\rho$. $\rho$ is a scalar between 0 and 1 that measures the relative variance within clusters and between clusters. When $\rho=1$, all of the responses in each cluster are identical. When $\rho=0$, the clustering has no effect, and we can treat our assignment as purely randomized. Once we know $\rho$ we can compute an <em>effective sample size</em>: if the villages are completely correlated, the number of samples in the study would be 600. If they were independent, the number of samples would be over 340,000. The number of effective samples is equal to the total number of samples divided by the <em>design effect</em>:</p>
\[{\small
DE = 1+\left(\frac{n_T+n_C}{k_T+k_C}-1\right)\rho \,.
}\]
<p>What is the design effect of the Bangladesh RCT? Measuring the intra-cluster correlation $\rho$ is nontrivial: the true value of $\rho$ depends on both potential outcomes in an experiment and needs to be estimated using some side experiment or previous trials at baseline. $\rho$ is often inferred from secondary covariates of earlier experiments on a similar population. We don’t have a pre-specified estimate, but can cheat a bit here and estimate $\rho$ from the provided data in the control villages. A standard ANOVA calculation says that the observed symptomatic seropositivity in the control villages has an intra-village correlation of $\rho=$0.007. This value isn’t particularly unreasonable. Some practitioners suggest that because of behavioral contagion alone, $\rho$ should be <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1466680">between 0.01 and 0.02 for human studies.</a> <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0013998#pone.0013998-Carrat2">This cluster RCT on mask use to prevent influenza in households</a> uses $\rho=$0.24. As a sanity check, the intra-village correlation of reported systems is 0.03. So let’s stick with $\rho=$0.007 and see where it takes us.</p>
<p>For the Bangladesh RCT, assuming $\rho=$0.007, the design effect is about 5. This reduces the effective sample size from over 340,000 to just under 70,000. What happens with our z-test? We simply take the z-score and divide by the square root of the design effect, yielding a p-value of 0.14. The result is not statistically significant once we take into account the intra-cluster correlation. In order to “achieve” statistical significance, $\rho$ would need to be less than 0.001 and the design effect would have to be less than 2.2.</p>
<p>We can also do similar design-effect adjustments for relative risk reduction. Recall the relative risk reduction is the ratio of the rate of infection in the treatment group to the rate of infection in the control group</p>
\[{\small
RR = \frac{i_T/n_T}{i_C/n_C}\,.
}\]
<p>A small $RR$ corresponds to a large reduction in risk. For the mask study, the estimated risk reduction is $RR=$0.9. If the assignments of every individual to treatment and control were random, we could compute error bars on the log of the risk ratio. The log risk ratio is</p>
\[{\small
\ell RR = \log \frac{i_T/n_T}{i_C/n_C}\,,
}\]
<p>and <a href="https://en.wikipedia.org/wiki/Relative_risk#Inference">a standard estimate of the standard error $SE$ of $\ell RR$</a> is</p>
\[{\small
SE = \sqrt{ \frac{1}{i_T} + \frac{1}{i_C} - \frac{1}{n_T}- \frac{1}{n_C}}\,.
}\]
<p>In the mask study, $SE=$0.043. Using a Gaussian approximation, our confidence interval would then be</p>
\[{\scriptsize
[\exp(\ell RR - 1.96 SE), \exp(\ell RR + 1.96 SE)] = [0.83, 0.98]\,.
}\]
<p>The way we interpret the confidence interval (and I’ll likely screw this up) is that if the Gaussian approximation were true, and all of the individual assignments to treatment and control were independent, and we repeated the experiment many times, the true risk ratio would fall inside the confidence interval 95% of the time. This calculation suggests that the confidence interval (barely) excludes a risk ratio of 1. However, this calculation does not take into account the cluster effects. Assuming again that $\rho=$0.007, when we adjust our confidence intervals for cluster effects, we get the larger interval</p>
\[{\scriptsize
\left[\exp(\ell RR - 1.96 SE \sqrt{DE}), \exp(\ell RR + 1.96 SE \sqrt{DE} )\right] = [0.75, 1.09]\,.
}\]
<p>Again, a standard cluster RCT analysis would not be able to reject a null effect for the complex masking intervention. In terms of my most-loathed statistic of efficacy, the confidence interval ranges from -9% to 25% after adjustment.</p>
<p>Note that even the strong claims made in the paper about subgroups are not significant once intra-cluster correlation is accounted for. A commonly quoted result is that surgical masks dramatically reduced infections for individuals over 60 years old. In this case, $n_C =$14,826, $n_T$=16,088, $i_C=$157 and $i_T=$124. The estimated effectiveness is 27%. However, with a design effect of 5, the p-value for the z-test here is 0.13 and the confidence intervals for the efficacy are -23% to 57%. So again, one can’t rest on statistical significance to argue this effect is real.</p>
<p>As a last statistical grumble, all of these corrections don’t even account for the multiple hypothesis testing in the manuscript where nearly 200 hypotheses were evaluated. <strong>After a <a href="https://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a> and accounting for design effect, none of the p-values would be less than 0.5.</strong></p>
<p>How large would the trial have to be in order to have statistical significance? We can focus on the z-test, and ask how many samples would be needed to reject the null hypothesis 95% of the time when the relative risk is 0.9, the prevalence in the control group is 0.076, and the intra-cluster correlation is 0.007. The answer is <strong><em>1.1 million people</em></strong>, over 3 times larger than the actual study size.</p>
<p>When a power calculation reveals a trial needs more than a million subjects, researchers need to pause to think if they are asking the right question. It is likely impossible to conduct a precise experiment that rules out all confounding at such a scale. The number of people needed to run such a trial is huge, and maintaining data quality would be both prohibitively difficult and expensive. Any trial has potential harms to its subjects, and the larger the sample size, the more likely harm may occur. Ensuring beneficence and informed consent at this scale is likely impossible. And if one really expects the clinical significance to be this small, why invest all of these resources into running an RCT instead of looking for more powerful interventions?</p>
<p><em>For those interested in seeing how I computed all of the numbers in this post and the <a href="https://www.argmin.net/2021/11/23/mask-rct-revisited/">last post</a>, <a href="https://nbviewer.jupyter.org/url/argmin.net/code/revisiting-bd-mask-rct.ipynb">here is a Jupyter notebook.</a></em></p>
Mon, 29 Nov 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/11/29/cluster-power/
http://benjamin-recht.github.io/2021/11/29/cluster-power/Revisiting the Bangladesh Mask RCT.<p>In an earlier post, <a href="https://www.argmin.net/2021/09/13/effect-size/">I raised a few issues</a> with a <a href="https://www.poverty-action.org/sites/default/files/publications/Mask_Second_Stage_Paper_20211108.pdf.pdf">large-scale RCT run in Bangladesh aimed at estimating the effectiveness of masks on reducing the spread of the coronavirus</a>. In particular, I was a bit dismayed that the authors did not post the raw number of seropositive cases in their study, preventing me from computing standard statistical analyses of their results. I also objected to the number of statistical regressions run to pull signals out of a very complex intervention.</p>
<p>Recently, the authors were kind enough to release their <a href="https://gitlab.com/emily-crawford/bd-mask-rct">code and data</a>. I send nothing but kudos to them in this regard. Releasing code and data can help disambiguate questions that are not always answerable from papers alone. In fact, I was immediately able to answer my question by querying their data. In this post, I will walk through a simple analysis to estimate the efficacy of their proposed intervention.</p>
<p>In the Bangladesh Mask RCT, there were $n_C=$163,861 individuals from 300 villages in the control group. There were $n_T=$178,322 individuals from 300 villages in the intervention group. The main end point of the study was whether their intervention reduced the number of individuals who both reported covid-like symptoms and tested seropositive at some point during the trial. The number of such individuals appears nowhere in their paper, and one has to compute this from the data they kindly provided: There were $i_C=$1,106 symptomatic individuals confirmed seropositive in the control group and $i_T=$1,086 such individuals in the treatment group. The difference between the two groups was small: only <em>20 cases</em> out of over 340,000 individuals over a span of 8 weeks.</p>
<p>I have a hard time going from these numbers to the assured conclusions that “masks work” that was <a href="https://www.theatlantic.com/ideas/archive/2021/09/masks-were-working-all-along/619989/">promulgated</a> <a href="https://www.nature.com/articles/d41586-021-02457-y">by</a> <a href="https://www.nbcnews.com/science/science-news/largest-study-masks-yet-details-importance-fighting-covid-19-rcna1858">the</a> <a href="https://www.washingtonpost.com/world/2021/09/01/masks-study-covid-bangladesh/">media</a> or <a href="https://www.nytimes.com/2021/09/26/opinion/do-masks-work-for-covid-prevention.html">the authors</a> after this preprint appeared. This study was not blinded, as it’s impossible to blind a study on masks. The intervention was highly complex and included a mask promotion campaign and education about other mitigation measures including social distancing. Moreover, individuals were only added to the study if they consented to allow the researchers to visit and survey their household. There was a large differential between the control and treatment groups here, with 95% consenting in the treatment group but only 92% consenting in control. <em>This differential alone could wash away the difference in observed cases.</em> Finally, symptomatic seropositivity is a crude measure of covid as the individuals could have been infected before the trial began.</p>
<p>Given the numerous caveats and confounders, the study still only found a tiny effect size. My takeaway is that a complex intervention including an educational program, free masks, encouraged mask wearing, and surveillance in a poor country with low population immunity and no vaccination showed at best modest reduction in infection. I think this summary is fair to the study authors. And this is valuable information to have! It reaffirms my priors that non-pharmaceutical interventions are challenging to implement and have only modest benefits in the presence of a highly contagious respiratory infection. But your mileage may vary.</p>
<p>As I mentioned, of course, this was not the message that the majority of the media took away from this study. Instead we were told that this trial finally confirmed that masks worked. I think one of the key confusing points was <a href="http://www.argmin.net/2021/08/13/relative-risk/">using “efficacy” instead of relative risk</a> as a measure of intervention power.</p>
<p>One of the dark tricks of biostatistics is moving away from absolute case counts to measures of risk such as relative risk reduction, efficacy, or the odds ratio. All of these measures are relative, and they tend to exaggerate effects. The relative risk reduction is the ratio of the rate of infection in the treatment group to the rate of infection in the control group</p>
\[{\small
RR = \frac{i_T/n_T}{i_C/n_C}\,.
}\]
<p>A small $RR$ corresponds to a large reduction in risk. For the mask study, $RR=$0.9. That’s not a lot of risk reduction: in this study, community masking improved an individual’s risk of infection by a factor of only 1.1x. As a convenient comparator, the $RR$ in the MRNA vaccine trials was 0.05. In this case, vaccines reduce the risk of symptomatic infection by a factor of 20x.</p>
<p>The academic vaccine community unfortunately uses “efficacy” or “effectiveness” to describe relative risk reduction. <a href="xxx">Efficacy is a confusing, commonly misinterpreted metric</a>. Efficacy in a trial is one minus the relative risk reduction:</p>
\[{\small
EFF = 1-RR\,,
}\]
<p>reported as a percentage. So if the $RR=$0.9, then $EFF=$10%.</p>
<p>The important thing to realize about efficacy is that the range from 0% to 20% is barely better than nothing. Here, even a 20% efficacy corresponds to a reduction of risk by a factor of 1.25x. 1.25x is not literally nothing, but it’s also not enough to halt a highly contagious respiratory infection. For what it’s worth, a vaccine with 20% efficacy would not be approved. Another major flaw of using efficacy as a metric is that it is highly nonlinear. The difference between 10% and 20% efficacy is very small whereas the difference between 85% and 95% is huge, corresponding to a 7-fold and 20-fold risk reduction respectively. Efficacy is a nonlinear metric, but these percentages are bandied around as if they are linear effects, and this adds confusion to the public dialogue.</p>
<p class="center"><img src="/assets/eff_v_rr.png" alt="The relationship between effectiveness and risk reduction is highly nonlinear" width="65%" /></p>
<p>To further dive into the absurdity of efficacy, let’s examine the claim that “cloth masks” worked less well than “surgical masks.” This is too strong an observation to be gleaned from the data. The preprint provides two stratified calculations to estimate the efficacy of types of masks. In the first case, the authors analyzed villages randomized to only be given surgical masks and their matched control villages. In this case there were 190 pairs of villages consisting of $n_C=$103,247 individuals in the control group and $n_T=$113,082 individuals in the treatment group. They observed $i_C=$774 symptomatic and seropositive individuals in the control group and $i_T=$756 symptomatic and seropositive individuals in the treatment group. <em>This is a difference of 18 individuals.</em> The corresponding efficacy is 11%, still woefully low.</p>
<p>We can do a similar analysis for the villages only given cloth masks. There were 96 pairs of villages consisting of $n_C=$53,691 individuals in the control group and $n_T=$57,415 individuals in the treatment group. They observed $i_C=$332 symptomatic and seropositive individuals in the control group and $i_T=$330 symptomatic and seropositive individuals in the treatment group. <em>This is a difference of only 2 individuals.</em> Certainly, no one would put much faith in an intervention where we see a difference of 2 cases in a study with over one hundred thousand people. However, to further demonstrate the absurdity of the notion of efficacy, the observed efficacy for cloth masks in this study is 7%. I think in many people’s minds, the difference between 7% and 11% is small. And 7% should be considered “no effect” as should 11%. <del>As a final absurd comparison, the study data shows cloth masks are more efficacious than purple surgical masks where the estimated efficacy is 0% ($n_C=$27,918, $n_T=$29,541, $i_C=$177, $i_T=$187)!</del> (<em>Ed note: turns out the purple masks were cloth. So the cloth purple masks did nothing, but the red masks “work.” Indeed, red masks were more effective than surgical masks!)</em> Certainly, comparing a bunch of such small effects is not telling us much.</p>
<p>Anyone who spends too much time around statisticians will note that I never once tried to compute a p-value for any of these results. As I’ve belabored, obsession with statistical significance distracts us from discussing effect sizes. We should be able to just look at the effect size and conclude the study did not find a significant impact of masks on coronavirus spread. We don’t need a p-value to tell us 10% efficacy is not helpful in this context. But it’s also important to note that you can’t just run a standard binomial test on this data because it is cluster-randomized and the subjects are anything but independent. <a href="http://www.argmin.net/2021/11/29/cluster-power/">In the next blog</a>, just for the sake of academic navel gazing, I’ll discuss the lack of statistical significance of this study and show why cluster randomized trials are inherently more challenging to interpret than standard RCTs.</p>
Tue, 23 Nov 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/11/23/mask-rct-revisited/
http://benjamin-recht.github.io/2021/11/23/mask-rct-revisited/The Perceptron as a prototype for machine learning theory.<p>Just as many of the algorithms and community practices of machine learning were invented <a href="http://www.argmin.net/2021/10/20/highleyman/">in the late 1950s and early 1960s</a>, the foundations of machine learning theory were also established during this time. Many of the analyses of this period were strikingly simple, had surprisingly precise constants, and provided prescient guidelines for contemporary machine learning practice. Here, I’ll summarize the study of the Perceptron, highlighting both its algorithmic and statistical analyses, and using it as a prototype to illustrate further how prediction deviates from the umbrella of classical statistics.</p>
<p>Let’s begin with a classification problem where each individual from some population has a feature vector $x$ and an associated binary label $y$ that we take as valued $\pm 1$ for notational convenience. The goal of the Perceptron is to find a linear separator such that $\langle w, x \rangle>0$ for when $y=1$ and $\langle w, x \rangle<0$ when $y=-1$. We can write this compactly as saying that we want to find a $w$ for which $y \langle w, x \rangle >0$ for as many individuals in the population as possible.</p>
<p>Rosenblatt’s Perceptron provides a simple algorithm for finding such a $w$. The Perceptron inputs an example, checks if it makes the correct classification. If yes, it does nothing and proceeds to the next example. If no, the decision boundary is nudged in the direction of classifying the example correctly next time.</p>
<p><strong>Perceptron</strong></p>
<ul>
<li>Start from the initial solution $w_0=0$</li>
<li>At each step $t=0,1,2,…$:
<ul>
<li>Select an individual from the population and look up their attributes: (x_t,y_t).</li>
<li>Case 1: If $y_t\langle w_t, x_t\rangle \leq 0$, put
\(w_{t+1} = w_t + y_t x_t\)</li>
<li>Case 2: Otherwise put $w_{t+1} = w_t$.</li>
</ul>
</li>
</ul>
<p>If the examples were selected at random, machine learners would recognize this algorithm as an instance of stochastic gradient descent, still the most ubiquitous way to train classifiers whether they be deep or shallow. Stochastic gradient descent minimizes sums of functions</p>
\[f(w) = \frac{1}{N} \sum_{i=1}^N \mathit{loss}( f(x_i; w) , y_i)\]
<p>with the update</p>
\[w_{t+1} = w_t - \alpha_t \nabla_w \mathit{loss}( f(x_t; w_t) , y_t)\,.\]
<p>When the examples are sampled randomly, the Perceptron is stochastic gradient descent with $\alpha_t=1$, $f(x;w) = \langle w,x \rangle$, and loss function $\mathit{loss}(\hat{y},y) = \max(-\hat{y} y, 0)$.</p>
<p>Stochastic gradient methods were invented a few years before the Perceptron. And the relations between these methods were noted by the mid-60s. Vapnik discusses some of this history in Chapter 1.11 of <a href="https://link.springer.com/book/10.1007/978-1-4757-3264-1"><em>The Nature of Statistical Learning Theory</em></a>.</p>
<p>While we might be tempted to use a standard stochastic gradient analysis to understand the optimization properties of the Perceptron, it turns out that a more rarified proof technique applies that uses no randomization whatsoever. Moreover, the argument will not only bound errors in optimization but also in generalization. Optimization is concerned with errors on a training data set. Generalization is concerned with errors on data we haven’t seen. The analysis from the 1960s links these two by first understanding the dynamics of the algorithm.</p>
<p><a href="https://cs.uwaterloo.ca/~y328yu/classics/novikoff.pdf">A celebrated result by Al Novikoff in 1962</a> showed that under reasonable conditions the algorithm makes a bounded number of updates no matter how large the sample size. Novikoff’s result is typically referred to as a <em>mistake bound</em> as it bounds the number of total misclassifications made when running the Perceptron on some data set. The key assumption in Novikoff’s argument is that the positive and negative examples are cleanly separated by a linear function. People often dismiss the Perceptron because of this <em>separability</em> assumption. But for any finite data set, can always add features and end up with a linearly separable problem. And if we add enough features, we’ll usually be separable no matter how many points we have.</p>
<p>This has been the trend in modern machine learning: don’t fear big models and don’t fear getting zero errors on your training set. This is no different than what was being proposed in the Perceptron. In fact, <a href="https://cs.uwaterloo.ca/~y328yu/classics/kernel.pdf">Aizerman, Braverman, and Roeznoer</a> recognized the power of such overparameterization, and extended Novikoff’s argument to “potential functions” that we now recognize as functions belonging to an infinite dimensional Reproducing Kernel Hilbert Space.</p>
<p>To state Novikoff’s result, we make the following assumptions: First, we assume as input a set of examples $S$. We assume every data point has norm at most $R(S)$ and that there exists a hyperplane that correctly classifies all of the data points and is of distance at least $\gamma(S)$ from every data point. This second assumption is called a <em>margin condition</em> that quantifies how separated the given data is. With these assumptions, Novikoff proved the Perceptron algorithm makes at most</p>
\[{\small
\frac{R(S)^2}{\gamma(S)^{2}}
}\]
<p>mistakes when run on $S$. No matter what the ordering of the data points in $S$, the algorithm makes a bounded number of errors.</p>
<p>The algorithmic analysis of Novikoff has many implications. First, if the data is separable, we can conclude that the Perceptron will terminate if it is run over the data set several times. This is because we can think of $k$ epochs of the Perceptron as running on the union of $k$ distinct copies of $S$, and the Perceptron eventually stops updating when run on this enlarged data set. Hence, the mistake bound tells us something particular about optimization: the Perceptron converges to a solution with zero training errors and hence a global minimizer of the empirical risk.</p>
<p>Second, we can think of the Perceptron algorithm as an <em>online learning algorithm</em>. We need not assume anything distributional about the sequence $S$. We can instead think about how long it takes for the Perceptron to converge to a solution that would have been as good as the optimal classifier. We can quantify this convergence by measuring the <em>regret</em>, equal to</p>
\[\mathcal{R}_T = \sum_{t=1}^T \mathrm{error}(w_t, (x_t,y_t)) - \sum_{t=1}^T \mathrm{error}(w_\star, (x_t,y_t))\,,\]
<p>where $w_\star$ denotes the optimal hyperplane. That is, the regret counts how frequently the classifier at step $t$ misclassifies the next example in the sequence. Novikoff’s argument shows that, if a sequence is perfectly classifiable, then the accrued regret is a constant that does not scale with T.</p>
<p>A third, less well known application of Novikoff’s bound is as a building block for a <em>generalization bound</em>. A generalization bound estimates the probability of making an error on a new example given that the new example is sampled from the same population as the data thus far sceen. To state the generalization bound for the Perceptron, I <em>now</em> need to return to statistics. Generalization theory concerns statistical validity, and hence we need to define some notion of sampling from the population. I will use the same sampling model I have been using in this blog series. Rather than assuming a statistical model of the population, I will assume we have some population of data from which we can uniformly sample. Our training data will consist of $n$ points sampled uniformly from this population: $S={(x_1,y_1)\ldots, (x_n,y_n) }$.</p>
<p>We know that the Perceptron will find a good linear predictor for the training data if it exists. What we now show is that this predictor also works on new data sampled uniformly from the same population.</p>
<p>To analyze what happens on new data, I will employ an elegant argument I learned from Sasha Rakhlin. This argument appears in a book on Learning Theory by Vapnik and Chervonenkis from 1974, which, to my knowledge, is only available in Russian. Sasha also believes this argument is considerably older as <a href="http://www.mit.edu/~rakhlin/papers/chervonenkis_chapter.pdf">Aizermann and company were making similar “online to batch” constructions in the 1960s</a>. The proof here leverages the assumption that the data are sampled in such a way that they are identically distributed, so we can swap the roles of training and test examples in the analysis. It foreshadows later studies of stability and generalization that would be revisited decades later.</p>
<p><strong>Theorem</strong> <em>Let $w(S)$ be the output of the Perceptron on a dataset $S$ after running until the hyperplane makes no more mistakes on $S$. Let $S_n$ denote a training set of $n$ samples uniformly at random from some population. And let $(x,y)$ be an additional independent uniform sample from the same population. Then, the probability of making a mistake on $(x,y)$ is bounded as</em></p>
\[\Pr[y \langle w(S_n), x \rangle \leq 0] \leq \frac{1}{n+1} {\mathbb{E}}_{S_{n+1}}\left[ \frac{R(S_{n+1})^2}{\gamma(S_{n+1})^2} \right]\,.\]
<p>To prove the theorem, define the “leave-one-out set” to be the set where we drop $(x_k,y_k)$:</p>
\[{\scriptsize
S^{-k}=\{(x_1,y_1),\dots,(x_{k-1},y_{k-1}),(x_{k+1},y_{k+1}),...,(x_{n+1},y_{n+1})\}\,.
}\]
<p>With this notation, since all of the data are sampled identically and independently, we can rewrite the probability of a mistake on the final data point as the expectation of the leave-one-out error</p>
\[{\small
\Pr[y \langle w(S_n), x \rangle \leq 0]
= \frac1{n+1}\sum_{k=1}^{n+1} \mathbb{E}[\mathbb{1}\{y_k \langle w(S^{-k}), x_k \rangle \leq 0\}]\,.
}\]
<p>Novikoff’s mistake bound asserts the Perceptron makes at most</p>
\[{\small
m=\frac{R(S_{n+1})^2}{\gamma(S_{n+1})^2}
}\]
<p>mistakes when run on the entire sequence $S_{n+1}$. Let $I={i_1,\dots,i_m}$ denote the indices on which the algorithm makes a mistake in any of its cycles over the data. If $k$ is not in $I$, the output of the algorithm remains the same after we remove the $k$-th sample from the sequence. It follows that such $k \in S_{n+1}\setminus I$ satisfy $y_k w(S^{-k})x_k \geq 0$ and therefore do not contribute to the right hand side of the summation. The other terms can at most contribute $1$ to the summation.
Hence,</p>
\[\Pr[y \langle w(S_n), x \rangle \leq 0] \le \frac{\mathbb{E}[m]}{n+1}\,,\]
<p>which is what we wanted to prove.</p>
<p>What’s most stunning to me about this argument is that there are no numerical constants or logarithms. The generalization error is perfectly quantified by a simple formula of $R$, $\gamma$, and $n$. There are a variety of other arguments that get the $\tilde{O}(R/(n\gamma))$ scaling with far more complex arguments and large constants and logarithmic terms. For example, one can show that the set of hyperplanes in Euclidean space with norm bounded by $\gamma^{-1}$ has <a href="https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034">VC dimension $R/\gamma$</a>. Similarly, a <a href="https://www.jmlr.org/papers/volume3/bartlett02a/bartlett02a.pdf">Rademacher complexity argument will achieve a similar scaling</a>. These arguments apply to far more algorithms than the Perceptron, but it’s frustrating how this simple algorithm from 1956 gets such a tight bound with such a short argument whereas analyzing more “powerful” algorithms often takes pages of derivations.</p>
<p>It’s remarkable that these bounds on optimization, regret, and generalization worked out in the 1960s all turned out to be optimal for classification theory. This strikes me as particularly odd because when I was in graduate school I was taught that the Perceptron was a failed enterprise. But as fads in AI have come and gone, the role of the Perceptron has remained central for 65 years. We’ve made more progress in machine learning theory since then, but it’s not always at the front of our minds just how long ago we had established our modern learning theory framework.</p>
Thu, 04 Nov 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/11/04/perceptron/
http://benjamin-recht.github.io/2021/11/04/perceptron/The Saga of Highleyman's Data.<p>The first machine learning benchmark dates back to the late 1950s. Few used it and even fewer still remembered it by the time benchmarks became widely used in machine learning in the late 1980s.</p>
<p>In 1959 at Bell Labs, Bill Highleyman and Louis Kamenstky designed a <a href="https://dl.acm.org/doi/10.1145/1457838.1457894">scanner to evaluate character recognition techniques</a>. Their goal was “to facilitate a systematic study of character-recognition techniques and an evaluation of methods prior to actual machine development.” It was not clear at the time which part of the computations should be done in special purpose hardware and which parts should be done with more general computers. Highleyman later <a href="https://patents.google.com/patent/US2978675A/en">patented an OCR scheme</a> that we recognize today as a convolutional neural network with convolutions optically computed as part of the scanning.</p>
<p>Highleyman and Kamentsky used their scanner to create a dataset of 1800 alphanumeric characters. They gathered the 26 letters of the alphabet and 10 digits from 50 different writers. Each character in their corpus was scanned in binary at a resolution of 12 x 12 and stored on punch cards that were compatible with the <a href="https://en.wikipedia.org/wiki/IBM_704">IBM 704</a>, the GPGPU of the era.</p>
<p class="center"><img src="/assets/highleyman-data.png" alt="A look at Highleyman’s digits" width="95%" /></p>
<p>With the data in hand, Highleyman and Kamenstky began studying various proposed techniques for recognition. In particular, they analyzed a method of Woody Bledsoe and published an analysis claiming to be <a href="https://ieeexplore.ieee.org/document/5219829">unable to reproduce Bledsoe’s results</a>. Bledsoe found their numbers to be considerably lower than he had expected, and asked Highleyman to send him the data. Highleyman obliged, mailing the package of punch cards across the country to Sandia Labs.</p>
<p>Upon receiving the data, Bledsoe conducted a new experiment. In what may be the first implementation of a train-test split, he divided the characters up, using 40 writers for training and 10 for testing. By tuning the hyperparameters, <a href="https://ieeexplore.ieee.org/document/5219162">Bledsoe was able to achieve approximately 60% error</a>. Bledsoe also suggested that the high error rates were to be expected as Highleyman’s data was too small. Prophetically, he declared that 1000 alphabets might be needed for good performance.</p>
<p>By this point, Highleyman had also shared his data with Chao Kong “C.K.” Chow at the Burroughs Corporation (a precursor to Unisys). A pioneer of <a href="https://ieeexplore.ieee.org/document/5222035">using decision theory for pattern recognition</a>, Chow built a pattern recognition system for characters. Using the same train-test split as Bledsoe, <a href="https://ieeexplore.ieee.org/document/5219431">Chow obtained an error rate of 41.7%</a> using a convolutional neural network.</p>
<p class="center"><img src="/assets/chownet.png" alt="Chow’s architecture" width="75%" /></p>
<p>Highleyman made at least six additional copies of the data he had sent to Bledsoe and Chow, and many researchers remained interested. He thus decided to <a href="https://ieeexplore.ieee.org/document/4037813">publicly offer to send a copy to anyone</a> willing to pay for the duplication and shipping fees. An interested party would simply have to mail him a request. Of course, the dataset was sent by US Postal Service. Electronic transfer didn’t exist at the time, resulting in sluggish data transfer rates on the order of a few bits per minute.</p>
<p>Highleyman not only created the first machine learning benchmark. He authored the the first formal study of <a href="https://ieeexplore.ieee.org/document/6768949">train-test splits</a> and proposed <a href="https://ieeexplore.ieee.org/document/4066882">empirical risk minimization for pattern classification</a> as part of his 1961 dissertation.
By 1963, however, Highleyman had left his research position at Bell Labs and abandoned pattern recognition research.</p>
<p>We don’t know how many people requested Highleyman’s data. The total number of copies may have been less than twenty. Based on citation surveys, we determined there were at least another six copies made after Highleyman’s public offer for duplication, sent to <a href="https://ieeexplore.ieee.org/abstract/document/1671536">CMU</a>, <a href="https://ieeexplore.ieee.org/document/1671257">Honeywell</a>, <a href="https://ieeexplore.ieee.org/document/5008873">SUNY Stony Brook</a>, <a href="https://spiral.imperial.ac.uk/bitstream/10044/1/16132/2/Ullmann-JR-1968-PhD-Thesis.pdf">Imperial College</a>, <a href="https://www.sciencedirect.com/science/article/abs/pii/0031320371900045">UW Madison</a>, and Stanford Research Institute (SRI).</p>
<p>The SRI team of John Munson, Richard Duda, and Peter Hart performed some of the most <a href="https://ieeexplore.ieee.org/document/1687355">extensive experiments with Highleyman’s data</a>. A 1-nearest-neighbors baseline achieved an error rate of 47.5%. With a more sophisticated approach, they were able to do significantly better. They used a multi-class, piecewise linear model, trained using Kesler’s multi-class version of the perceptron algorithm (what we’d now call “one-versus all classification”). Their feature vectors were 84 simple pooled edge detectors in different regions of the image at different orientations. With these features, they were able to get a test error of 31.7%, 10 percentage points better than Chow. When restricted only to digits, this method recorded 12% error. The authors concluded that they needed more data, and that the error rates were “still far too high to be practical.” They concluded that “larger and higher-quality datasets are needed for work aimed at achieving useful results.” They suggested that such datasets “may contain hundreds, or even thousands, of samples in each class.”</p>
<p>Munson, Duda, and Hart also performed informal experiments with humans to gauge the readability of Highleyman’s characters. On the full set of alphanumeric characters, they found an average error rate of 15.7%, about 2x better than their pattern recognition machine. But this rate was still quite high and suggested the data needed to be of higher quality. They (again prophetically) concluded that “an array size of at least 20X20 is needed, with an optimum size of perhaps 30X30.”</p>
<p>Decades passed until such a dataset appeared. Thirty years later, with 125 times as much training data, 28x28 resolution, and with grayscale scans, a neural net achieved 0.7% test error on the <a href="http://yann.lecun.com/exdb/mnist/">MNIST digit recognition task</a>. In fact, a similar model to Munson’s architecture consisting of kernel ridge regression trained on pooled edged detectors also achieves 0.6% error. Intuition from the 1960s proved right. The resolution was higher and the number of examples per digit was now in the thousands, just as Bledsoe, Munson, Duda, and Hart predicted would be sufficient. Reasoning heuristically that the test error should be inversely proportional to the square root of the number of training examples, we would expect an 11x improvement over Munson’s approaches. The actual recorded improvement from 12% to 0.7% was closer to 17x, not far from what the back of the envelope calculation predicts.</p>
<p>Unlike Highleyman’s data, MNIST featured only digits, no letters. Only recently, in 2017, researchers from Western Sydney University <a href="https://arxiv.org/abs/1702.05373">extracted alphanumeric characters from the NIST-19 repository</a>. The resulting <em>EMNIST_Balanced</em> dataset has 2400 examples in each of the 47 classes, with a class for all upper case letters, all digits, and some of the non-ambiguous lower case letters. Currently, the best performing <a href="https://www.mdpi.com/2076-3417/9/15/3169">model achieves a test error rate of 9.4%</a>. While the dataset is still fairly new, this is only a 3x improvement over the methods of Munson, Duda, and Hart. Applying the same naive scaling argument as above, the increase in dataset size would predict a 7x improvement if such an improvement was achievable. Considering that the SRI team observed a human-error rate of 11% on Highleyman’s data, it is quite possible that an accuracy of 90% is close to the best that we can expect for recognizing handwritten digits without context.</p>
<p>The story of Highleyman’s data foreshadows many of the later waves of machine learning research. A desire for better evaluation inspired the creation of novel data. Dissemination of the experimental results on this data led to sharing in order for researchers to be content that the evaluation was fair. Once the dataset was distributed, others requested the data to prove their methods were superior. And then the dataset itself became enshrined as a benchmark for competitive testing. Such comparative testing led to innovations in methods, theory, and data collection and curation itself. We have seen this pattern time and time again in machine learning, from <a href="https://archive.ics.uci.edu/ml/index.php">the UCI repository</a>, to <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a>, to <a href="https://www.image-net.org/">ImageNet</a>, to <a href="https://predictioncenter.org/">CASP</a>. The nearly forgotten history of Highleyman’s data marks the beginning of this pattern recognition research paradigm.</p>
<p><em>We are, as always, deeply indebted to Chris Wiggins for sending us Munson et al.’s paper after watching a talk by BR on the history of ML benchmarking. We also thank Ludwig Schmidt for pointing us to EMNIST.</em></p>
<h2 id="addendum-on-our-protagonist-bill-highleyman">Addendum on our protagonist Bill Highleyman.</h2>
<p>After posting this blog, we found <a href="https://availabilitydigest.com/public_articles/1208/thesis.pdf">some lovely recollections by Bill Highleyman about his thesis</a>. It is remarkable how Bill invented so many powerful machine learning primitives—finding linear functions that minimize empirical risk, gradient descent to minimize the risk, train-test splits, convolutional neural networks—all as part of his PhD dissertation project. That said,
Bill considered the project to be a failure. He (and Bell Labs) realized the computing of 1959 was not up to the task of character recognition.</p>
<p>After he finished his thesis, Bill abandoned pattern recognition and moved on to work on other cool and practical computer engineering projects that interested him, never once looking back. By the mid sixties Bill had immersed himself in data communication and transmission, and patented novel approaches to electrolytic printing and financial transaction hardware. He eventually ended up specializing in high-reliability computing. Though he developed many of the machine learning techniques we use today, he was content to leave the field and work to advance general computing to catch up with his early ideas.</p>
<p>It’s odd but not surprising that while every machine learning class mentions Rosenblatt, Minsky, and Papert, almost everyone we’ve spoken with so far has never heard of Bill Highleyman.</p>
<p>We worry Bill is no longer reachable as he seems to have no online presence after 2019 and would be 88 years old today. If anyone out there on has met Bill, we’d love to hear more about him. Please drop us a note.</p>
<p>And if anyone has any idea of where we can get a copy of his 1800 characters from 1959, please let us know about that too…</p>
Wed, 20 Oct 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/10/20/highleyman/
http://benjamin-recht.github.io/2021/10/20/highleyman/Machine learning is not nonparametric statistics.<p>Many times in my career, I’ve been told by respected statisticians that machine learning is nothing more than nonparametric statistics. The longer I work in this field, the more I think this view is both misleading and unhelpful. Not only can I never get a consistent definition of what “nonparametric” means, but the jump from statistics to machine learning is considerably larger than most expect. Statistics is an important tool for understanding machine learning and randomness is valuable for machine learning algorithm design, but there is considerably more to machine learning than what we learn in elementary statistics.</p>
<p>Machine learning at its core is the art and science of <em>prediction</em>. By prediction, I mean the general problem of leveraging regularity of natural processes to guess the outcome of yet unseen events. As before, we can formalize the prediction problem by assuming a population of $N$ individuals with a variety of attributes. Suppose each individual has an associated variable $X$ and $Y$. The goal of prediction is to guess the value of $Y$ from $X$ that minimizes some error metric.</p>
<p>A classic prediction problem aims to find a function that makes the fewest number of incorrect predictions across the population. Think of this function like a computer program that takes $X$ as input and outputs a prediction of $Y$. For a fixed prediction function, we can sum up all of the errors made on the population. If we divide by the size of the population, this is the mean error rate of the function.</p>
<p>A particularly important prediction problem is classification. In classification, the attribute $Y$ takes only two values: the input $X$ could be some demographic details about a person, and $Y$ would be whether or not that person was taller than 6 feet. The input $X$ could be an image, and $Y$ could be whether or not the image contains a cat. Or the input could be a set of laboratory results about a patient, and $Y$ could be whether or not the patient is afflicted by a disease. Classification is the simplest and most common prediction problem, one that forms the basis of most contemporary machine learning systems.</p>
<p>For classification problems, it is relatively straightforward to compute the best error rate achievable. First, for every possible value of the attribute $X$, collect the subgroup of individuals of the population with that value. Then, the best assignment for the prediction function is the one that correctly labels the majority of this subgroup. For example, in our height example, we could take all women, aged 30, born in the United States, and reside in California. Then the optimal label for this group would be decided based on whether there are more people in the group who are taller than 6 feet or not. (Answer: no).</p>
<p>This minimum error rule is intuitive and simple, but computing the rule exactly requires examining the entire population. What can we do if we work from a subsample? Just as was the case in experiment design, we’d like to be able to design good prediction functions from a small sample of the population so we don’t have to inspect all individuals. For a <em>fixed</em> function, we could use the same law-of-large-numbers approximations to estimate the best decision. That is, if we decide in advance upon a prediction function, we could estimate the percentage of mistakes on the population by gathering a random sample and computing the proportion of mistakes on this subset. Then we could apply a standard confidence interval analysis to extrapolate to the population.</p>
<p>However, what if we’d like to find a good predictor on the population using only a set of examples sampled from the population. We immediately run into an issue: to find the best prediction function, we needed to observe all possible values of $X$. What if we’d like to make predictions about an individual with a set of attributes that was not observed in our sample?</p>
<p>How can we build accurate population-level predictors from small subsamples? In order to solve this problem, we must make some assumptions about the relationship between predictions at related, but different values of $X$. We can restrict our attention to a set of functions that respect regularity properties that we think the predictions should have. Then, with a subsample from the population, we find the function that minimizes the error on the sample and obeys the prescribed regularity properties.</p>
<p>This optimization procedure is called <em>”empirical risk minimization”</em> and is the core predictive algorithm of machine learning. Indeed, for all of the talk about neuromorphic deep networks with fancy widgets, most of what machine learning does is try to find computer programs that make good predictions on the data we have collected and that respect some sort of rudimentary knowledge that we have about the broader population.</p>
<p>The flexibility in defining what “knowledge” or “regularity” means complicates the solution of such empirical risk minimization problems. What does the right set of functions look like? There are three immediate concerns:</p>
<ol>
<li>
<p>What is the right <em>representation</em>? The set needs to contain enough functions to well approximate the true population prediction function. There are a variety of ways to express complex functions, and each expression has its own benefits and drawbacks.</p>
</li>
<li>
<p>The set of functions needs to be simple to search over, so we don’t have to evaluate every function in our set as this would be too time consuming. Efficient search for high quality solutions is called <em>optimization</em>.</p>
</li>
<li>
<p>How will the predictor <em>generalize</em> to the broader population? The functions cannot be too complex or else they will fail to capture the regularity and smoothness of the prediction problem (estimating functions of too high complexity is colloquially called “overfitting”).</p>
</li>
</ol>
<p>Balancing representation, optimization, and generalization gets complicated quickly, and this is why we have a gigantic academic and industrial field devoted to the problem.</p>
<p>I’m repeating myself at this point, but I again want to pound my fist on the table and reiterate that nothing in our development here requires that the relationship between the variables $X$ and $Y$ be probabilistic. Statistical models are often the starting point of discussion in machine learning, but such models are just a convenient way to describe populations and their proportions. Prediction can be analyzed in terms of a deterministic population, and, just as we discussed in the case of randomized experiments, randomness can be introduced as a means of sampling the population to determine trends. Even generalization, which is usually studied as a statistical phenomenon, can be analyzed in terms of the randomness of the sampling procedure with no probabilistic modeling of the population.</p>
<p>On the other hand, some sort of knowledge about the population is necessary. The more we know about how prediction varies based on changes in the covariates, the better a predictor we can build. Engineering such prior knowledge into appropriate function classes and optimization algorithms form the art and science of contemporary machine learning.</p>
<p>This discussion highlights that while we <em>can</em> view prediction through the lens of statistical sampling, pigeonholing it as simply “nonparametric statistics” does not do the subject justice. While the <a href="https://www.argmin.net/2021/09/28/rct/">jump from mean estimation to causal RCTs is small</a>, the jump from mean estimation to prediction is much less immediate. And in machine learning practice, the intuitions from statistics often don’t apply. For example, conventional wisdom from statistics tells us that evaluating multiple models on the same data set amounts to multiple hypothesis testing, and will lead to overfitting on the test set. However, <a href="https://arxiv.org/abs/1902.10811">there</a> <a href="https://papers.nips.cc/paper/9117-a-meta-analysis-of-overfitting-in-machine-learning">is</a> <a href="https://arxiv.org/abs/1906.02168">more</a> <a href="https://arxiv.org/abs/2004.14444">and</a> <a href="https://proceedings.mlr.press/v119/shankar20c.html">more</a> <a href="https://arxiv.org/abs/1905.10498">evidence</a> that using a train-test split leads does not lead to overfitting. Instead, the phenomena we see is that dataset benchmarks can remain useful for decades. Another common refrain from statistics is that model complexity must be explicitly constrained in order to extrapolate to new data, but this also <a href="https://cacm.acm.org/magazines/2021/3/250713-understanding-deep-learning-still-requires-rethinking-generalization/fulltext">does not seem to apply at all to machine learning practice</a>.</p>
<p>Prediction predates probability and statistics by centuries. As Moritz and I chronicle in the introduction to <a href="http://mlstory.org">Patterns, Predictions, and Actions</a> astronomers were using pattern matching to predict celestial motions, and the astronomer Edmund Halley realized that similar techniques could be used to predict life expectancy when pricing annuities. Moreover, even though modern machine learning embraced contemporary developments in statistics by Neyman, Pearson, and Wald, the tools quickly grew more sophisticated and separate from core statistical practice. In the next post, I’ll discuss an early example of this divergence between machine learning and statistics, describing some of the theoretical understanding of the Perceptron in the 1960s and how its analysis was decidedly different from the theory advanced by statisticians.</p>
Wed, 13 Oct 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/10/13/prediction/
http://benjamin-recht.github.io/2021/10/13/prediction/Experiments as randomized algorithms<p>While every statistics course leads with how correlation does not imply causation, the methodological jump from observation to causal inference is small. Using the same algorithmic summarization and statistical analysis tools that we use to estimate averages, we can construct a reliable algorithm for evaluating the causal effect of interventions—actions that change the fate of individuals in a population. The crucial addition needed to evaluate cause and effect is the ability to intervene itself.</p>
<p>Let’s say that we have devised some intervention that can be applied to any individual in a population. We’d like to evaluate the impact of the intervention on the broader population by testing it on a small subset of individuals. As an example following on from the <a href="https://www.argmin.net/2021/09/28/summarization/">last post</a>, the reader can think of height as the property we’d like to affect, and milk consumption as the treatment. We cannot apply the treatment to every individual or else we’d never be able to disentangle whether the treatment caused the outcome or not. The solution is to restrict our attention to a subset of the population, and leverage randomized assignment to eliminate confounding effects.</p>
<p>The simplest mathematical formulation of experiment design often referred to in terms of <em>potential outcomes</em> and was originally conceived by <a href="https://www.jstor.org/stable/2245382">Jerzy Neyman</a> (a character who will likely appear in every blog post in this series). Suppose if we apply a treatment to an individual, the quantity of interest takes value $A$. If we don’t apply the treatment, the quantity of interest is equal to $B$. Then we can define a quantity $Y$ which is equal to $A$ if the treatment is applied and equal to $B$ if the treatment is not applied. $Y$ is a deterministic quantity like height. However, there is an odd conditional effect: the value of $Y$ changes depending on whether we applied the treatment or not.</p>
<p>As a simple example, let $A$ denote the height of a person at age 18 if they drank milk growing up and $B$ denote their height if they did not drink milk. Now, obviously, one person can only take one of these paths! But we can imagine the two alternate realities where the same child either drank a cup of milk a day or drank a cup of water instead. The goal of an experimenter is to determine what would happen to a general individual had they taken either of the two paths in the road. The two potential outcomes here are the outcome if the treatment is applied and the outcome if the treatment is not applied.</p>
<p>While the potential outcomes formulation is tautological, it lets us apply the same ideas and statistics we used for computing the mean to the problem of estimating more complex treatment effects. For any individual, the treatment effect is a relation between the quantities A and B, commonly just the difference $A-B$. If the difference is positive, we see that applying the treatment increases the outcome variable for this individual. If a child drank a lot of milk, perhaps they are taller as an adult than if they only drank water. But, as we’ve discussed, our main hitch is that we can never simultaneously observe $A$ and $B$: once we choose whether to apply the treatment or not, we can only measure the corresponding treated or untreated condition.</p>
<p>This is where statistics can enter. Statistical algorithms can be applied to estimate <em>average</em> treatment effects across the general population. We can examine trends in small groups of individuals and extrapolate the insights to the broader population.</p>
<p>For such extrapolation, there are a variety of conventions for defining population level treatment effects. For example, we can define the <em>average treatment effect</em> to be the difference between the mean of $A$ and the mean of $B$. For those more comfortable seeing this written out as a formula, we can write</p>
\[\small{
\text{Average Treatment Effect} = \text{mean}(A)-\text{mean}(B)}\]
<p>In our milk example, this would be the difference in the mean of the population height if everyone drank milk versus no one drank milk.</p>
<p>Other population level quantities of interest arise when $A$ and $B$ represent binary outcomes. This could be, say, whether a person is over six feet tall as an adult. Or, for a more salient contemporary example, this could be whether a patient catches a disease or not in a vaccine study. In this case, $A$ is whether the patient catches the disease after receiving a vaccine and $B$ is whether the patient catches the disease after receiving a placebo.</p>
<p>The odds that an individual catches the disease is the number of people who catch the disease divided by the number who do not. The odds ratio for a treatment is the odds when every person receives the vaccine divided by the odds when no one receives the vaccine. We can write this out as a formula in terms of our quantities $A$ and $B$: when $A$ and $B$ can only take values 0 or 1, $\text{mean}(A)$ is the number of individuals for which $A=1$ divided by the total number of individuals. Hence, we can write the odds ratio as</p>
\[\small{
\text{Odds Ratio} = \frac{\text{mean}(A)}{1-\text{mean}(A)} \cdot \frac{ 1-\text{mean}(B)}{\text{mean}(B)}}\]
<p>This measures the decrease (or increase!) of the odds of a bad event happening when the treatment is applied. When the odds ratio is less than 1, the odds of a bad event are lower if the treatment is applied. When the odds ratio is greater than 1, the odds of a bad event are higher if the treatment is applied.</p>
<p>Similarly, the risk that an individual catches the disease is the ratio of the number of people who catch the disease to the total population size. Risk and odds are similar quantities, but some disciplines prefer one to the other by convention. The risk ratio is the fraction of bad events when a treatment is applied divided by the fraction of bad events when not applied. Again, in a formula,</p>
\[\small{
\text{Risk Ratio} = \frac{\text{mean}(A)}{\text{mean}(B)}}\]
<p>The risk ratio measures the increase or decrease of relative risk of a bad event when the treatment is applied. In the recent context of vaccines, this ratio is popularly reported differently. The effectiveness of a treatment is one minus the risk ratio.</p>
<p><a href="https://www.argmin.net/2021/09/13/effect-size/">This is precisely the number used when people say a vaccine is 95% effective.</a> It is equivalent to saying that the proportion of those treated who fell ill was 20 times less than the proportion of those not treated who fell ill. Importantly, it does not mean that one has a 5% chance of contracting the disease.</p>
<p>Randomized algorithms give us cut and dry techniques to construct high accuracy estimators of population-level effects. Specifically, we can frame the estimation of the various measures of treatment effects as a particular statistical sampling strategy.</p>
<p>Think of the potential outcomes framework as doubling the size of the population. Each individual has an outcome under treatment and not under treatment. Hence, if we randomly select a sample and then randomly assign a treatment to each individual of the sample, we can compute the mean values of all subjects assigned to the treatment and all patients assigned to control. As long as $A$ and $B$ are bounded, such means of our samples are reasonable estimates of all of the treatment effects provided the number of samples is large enough.</p>
<p>The two stage process of building a sample and then randomizing assignment is equivalent to computing a random sample of the potential outcomes population. Randomized assignment allows us to probe population level effects without observing both outcomes of each individual. It’s an algorithmic strategy to extract information: Experiments are algorithms.</p>
<p>Just like in the <a href="https://www.argmin.net/2021/09/28/summarization/">last post</a>, this sampling method does not assume anything about the randomness of A and B. This randomized experiment design assumes that we can select samples at random from the population and assign treatments at random. But the individual treatment effects can be either deterministic or random. We do not need a probabilistic view of the universe in order to take advantage of the power of randomized experiments and prediction. Statistics still serve as a way to reason about the proportion of beneficial and adverse effects of interventions.</p>
<p>Moreover, elementary statistics allows us to quantify how confident we should be in the point estimate generated by our experiment with very little knowledge about the processes behind A and B. If the outcomes are binary, we can compute <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval">exact confidence intervals</a> for our outcomes, regardless of how they are distributed. For example, returning to my favorite example of the <a href="https://www.nejm.org/doi/full/10.1056/nejmoa2034577">Pfizer vaccine trial</a>, the confidence intervals used were the <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval">Clopper-Pearson intervals</a>, which are directly derived from the binomial distribution. The effect size was so large that simple statistics revealed an impressively large effect.</p>
<p>There are certainly limitations with the randomized experiment paradigm. Reliability estimating treatment effects needs the variability of the outcomes to be low, average effects may hide important variation across the population, and temporal dynamics and feedback effects can impact causal conclusions. In future posts, I’ll dive into these critiques in more detail.</p>
<p>Despite the potential limitations, it’s remarkable how causal effects can be measured with some rudimentary sampling and statistics. The same ideas used to estimate a mean can immediately be applied to estimate average effects of interventions. In both cases, we needed only modest knowledge of the effects under study to design algorithmic measures and to establish confidence intervals on their outcomes. In the next post, we’ll explore whether a similar program can be applied to the art of prediction and machine learning. (Spoiler alert: it’s complicated!)</p>
<p>Finally, I have to discuss an elephant I’ve left in the room. Determining cause and effect becomes impossibly challenging once we <em>can’t</em> intervene. For example, suppose we are trying to understand the effectiveness of a vaccine outside a well controlled clinical trial. In the wild, we have no control over who takes the vaccine, but instead can sample from a general population where a vaccine is available and count the number of people who got sick. Determining cause and effect from such <em>observational data</em> requires more modeling, knowledge, and statistical machinery. And no matter how sophisticated the analysis, arguments about hidden confounding variables and other counterfactuals are inevitable. Whenever naysayers are yelling how correlation doesn’t imply causation, it’s always targeting an observational study rather than a randomized controlled trial. For a comprehensive introduction to the deep complexity of this topic, let me shamelessly plug the causality chapter in <a href="http://mlstory.org"><em>Patterns, Predictions, and Actions</em></a>, which features both my favorite introduction to observational causal inference penned by Moritz and a version of this blog on experiments.</p>
Tue, 28 Sep 2021 00:00:00 +0000
http://benjamin-recht.github.io/2021/09/28/rct/
http://benjamin-recht.github.io/2021/09/28/rct/