Individual experiences and collective evidence
Jessica Dai on theory for the world as it could be
I’m on vacation this week and am handing over the blog controls to Jessica Dai. Jessica is a PhD student in my group and also a cofounder of Reboot and an editor of Kernel. This blog post covers work she’s been leading for the past year on how we ought to make statistics out of individuals.
I am a critic at heart, but an impatient one, which gives me a complicated relationship to the world of “responsible AI.” On the one hand, structural critiques are almost always true — AI tends to concentrate power, market incentives accelerate harm, structural forces will necessarily compromise any proposed “solution.” On the other hand, I am, as I mentioned, impatient. That these forces are so entrenched is precisely why we ought to try to take action in spite of them, rather than remain, safely, in the critic’s seat.1
Over the last few years, I’ve accumulated a long list of grievances about this area of research: the glut of methods and metrics that remained disconnected from real-world deployment; the way that such methods relied on the good faith of model developers, while almost never leaving room for the agency of the everyday people who actually experience the model.
I really wanted to understand what should be done instead. The result, a very long time in the making, was a theory(ish) paper, written with my dear friends/neighbors/labmates Paula and Deb. In this paper, we describe and analyze a system that addresses at least some of these critiques; in doing so, we are explicitly prescriptive about a world we hope to see.
The paper is titled, somewhat floridly but hopefully also descriptively, From Individual Experience to Collective Evidence: A Reporting-Based Framework for Identifying Systemic Harms. Suppose there existed a mechanism where individuals could report a problematic experience with a specific system (e.g., perceived mistreatment by a predictive algorithm that affected loan allocations). Then, as reports arrive over time, we show how it’s possible to quickly identify whether there was a particular subset of the population that was disproportionately experiencing harm (e.g., actually experienced elevated rates of loan denials even conditioned on financial health), and do so with some sense of statistical rigor.2
The core concept is individual reporting as a means to build collective knowledge: intuitively, if one person had one bad experience, that by itself doesn’t necessarily mean that there was something wrong with the entire system. But if lots of people start reporting the same problem — and, if those people appear to be similar in important ways — then that seems like it might constitute a true, underlying problem with the system.
This collective knowledge, by virtue of being tied to a specific system, is also therefore actionable: it describes real, known patterns of behavior that e.g., a model developer might then be able to fix, or e.g., a third-party or government body (RIP) might use as evidence to hold the model owner accountable. In this way, individual reporting also becomes a new lens for evaluation, and thus a pathway for post-deployment monitoring that is not doomed to be solely descriptive of past harm but can rather also be
What we did in this particular paper was fairly specific to the problem of fairness, in that the “problem” we were trying to identify was intrinsically about identifying subgroups that experienced disproportionate rates of harm. But I believe very strongly that individual reporting mechanisms, as described in the previous paragraph, are valuable in a much broader sense. (I make this general case in greater detail in a new preprint — Aggregated Individual Reporting for Post-Deployment Evaluation — which elaborates on the arguments in this post.3)
Of course, there’s the general case for thinking about what happens post-deployment: it is quite literally impossible for even the most well-intentioned model developers to fully predict, measure, and mitigate problems before seeing how people are actually engaging with the system. This is especially true for more complex systems like LLMs, and the sentiment around needing careful analysis of actual usage is in fact starting to percolate to practical systems. On the general-purpose end, some of Anthropic’s recent posts gesture at the importance of understanding actual usage; on the task-specific end, UCSF is piloting more targeted monitoring for AI systems in clinical applications.
What’s unique, I think, about the individual reporting approach to post-deployment monitoring is that it relies on two crucial beliefs: first, that those interacting with the system have unique and valuable perspectives on its failure modes, and second, that they ought to be able to express those perspectives in a nontrivial way. While these statements feel natural to me personally, mainstream practice in AI evaluation seems to be rarely consistent with them. The word I’ve been dancing around, perhaps, is values: if you’ll forgive my elementary STS facility, one might say that the values or politics expressed by typical approaches to evaluation do not include respect for or valorization of individual agency.
My personal feelings about these values are why I care about individual reporting. But why should you, or anyone with decision-making power to actually implement a system like this, care? Normative considerations aside, I suspect that there could be really interesting substantive information that might come out of reports. I say suspect, because, of course, this is an empirical question and, for the most part, individual reporting mechanisms for AI do not exist. On the other hand, in other domains where individual reporting has been established as standard practice — notably, vaccines and pharmaceuticals — these individual reports have surfaced interesting, important, and consequential information about their effects at a population level.
Today, individual experiences with AI systems are captured largely informally — Reddit discussions and Twitter memes that sometimes make it to reported journalistic stories. To me, the ChatGPT personality flip-flop from late April is instructive. (In case you missed it, OpenAI rolled out a change to the ChatGPT system prompt that made it seem much friendlier and “personable” — then reverted that change, no doubt in large part because of a wave of semi-viral tweets showing dangerously sycophantic responses.)
It goes without saying that x dot com is not in fact a structured repository for problems with chatgpt dot com, but I do think what happened here is an illustration of what individual reports have to offer as an evaluation mechanism. OpenAI wasn’t able to identify ahead of time that this “personality” update was problematic, in part because it is hard to anticipate the richness of usage patterns and therefore failure modes. People thought the problems were egregious enough that it motivated them to tweet. Enough people — importantly, LLM-twitter-famous people — tweeted about the same problem that OpenAI noticed.
This ChatGPT personality problem was serious, widespread, and was therefore caught. But what other, subtler, non-Twitter-viral patterns are happening? What more could we understand about the fractal, “jagged” edges of AI system deployments if we had better ways to listen to the people who interact with them?
Something that we talk about a lot in Ben’s group is that if you were to believe the titles and abstracts of the ML theory x society work from the last few years, you might think that all manner of societal problems—from data sharing to social media polarization to democratic decisionmaking—had been solved, and optimally at that. (And yet.)
In that sense, we’re not completely guiltless. This paper lays out a pretty ambitious vision. At the same time, an uncharitable but not inaccurate characterization of what we’ve actually done in the paper would be that it’s not much. The HCI-oriented reader might be rightfully skeptical of all of our assumptions. The capital-s Statisticians in the audience — I am assuming there are a few among typical argmin readers, if I haven’t lost you yet — might be unimpressed by the technical results. (What, so this was just sequential mean testing with a Bonferroni correction all along?4)
I say this not for cosmetic humility but because I want to be honest about what we’ve done, and what’s left to be done, which is most of it. Our paper was fairness-oriented, sure, and thus much narrower than the high-level vision for individual reporting mechanisms in general. Still, this paper, as a concrete instantiation of an individual reporting system, helped me to crystallize what the big challenges might look like in any other instantiation: How would the system be scoped? What counts as “harm”? What information is communicated in a report? How should the dynamics of reporting behaviors be understood and addressed?
Ultimately, I think of this initial paper as a fundamentally optimistic artifact. The paper is speculative, in a hopefully constructive way; it’s as concrete an answer to “what’s the alternative” as we could come up with. To repeat a sentence from above: for the most part, individual reporting mechanisms for AI do not currently exist! I can’t, of course, promise with complete certainty that they would actually always generate substantively useful insights.5 But, at the very least, I think it’s worth trying to find out.
Aspirationally, my orientation towards these questions is influenced by non-reformist reforms (Gorz) and utopian demands (Weeks).
Never mind what Ben typically says about statistical rigor.... In all seriousness, though, regardless of concerns about whether statistical conclusions are real in a metaphysical or philosophical sense, it is undeniable that statistical evidence, whatever that means, holds some unique power (see, for example, Ben's “bureaucratic statistics” take). To that end, the technical meat of the paper is about how, and to what extent, our fuzzy intuitions about individual agency and collective knowledge can be made legible to societal standards of what is deemed to be acceptable as (statistical) evidence.
Sorry, yet another position paper — we can meta-discourse about position papers and their dis/utility to the research ecosystem at another time.
Yes, though not for lack of trying. I am happy to discuss what the technically interesting meat would have been for anyone who is curious. I would be even happier if someone had ideas for how to fix it.
For the true Recht-heads: Metallic Laws — the Brass one, at least — probably suggest that most things people propose are mostly useless.
First
What about privacy? To identify harms affecting specific subgroups from individual reports, you have to collect those covariates (financial status, gender, sex, race, ethnicity, religion, veteran status, etc etc). Broadly speaking, this has two problems:
- you may be subject to privacy rules which forbid you from looking at this characteristics directly
- your end users may be very wary of providing so much information about themselves
I think these two points go some way towards answering the question "why we don't do it in the industry today".