Can you please everyone?
The question of fairness reveals profound failings of aggregate decision-making.
Optimal decision-making starts looking a lot less optimal once we start deciding things about people. It’s obvious once you see it, but a decision that is optimal for a population is not optimal for groups. This leads us to ask how to engineer decisions that are fair. The question of fairness reveals profound failings of aggregate decision-making.
Fair is a vague word with many connotations. But there are legal forces that require equal treatment and impact for certain protected classes. In the US, we have laws requiring equal treatment with respect to dozens of attributes, including race, sex, religion, age, and disability. And beyond the law, most of us who work on engineering systems want to build technology that isn’t inherently discriminatory.
Sometimes, companies deploy technology that is in clear violation of antidiscrimination laws. For instance, Facebook violated the Fair Housing Act by allowing advertisers to target housing ads based on protected demographic attributes. Facebook settled a lawsuit with HUD about this violation. Since most companies aren’t as audacious as Facebook, discrimination is typically more subtle and challenging to audit.
The first and by far most important lesson of fair decision-making is that ignoring these protected attributes doesn’t guarantee fair treatment. But you can’t pretend the protected attribute isn’t there. The features you use to classify are likely correlated with hidden attributes.
Once you explicitly account for the protected attributes, the next important lesson is it is impossible to write down rules for decision systems that everyone agrees will be fair. There have been dozens of “fairness metrics” proposed, and all of them are deeply flawed.
Do we want equalized rates of acceptance? That is, should the rate each group is assigned to action be equal? This doesn’t rule out the possibility of too many false positives, which, as in the case of medical interventions, can lead to unnecessary, harmful procedures. What if we equalize true positive and false positive rates across groups? This necessitates making everyone’s decision system suboptimal. Do we want equal acceptance based on some score? This assumes that we can make good tests that don’t result in disparate impact.
What’s worse is the various proposed fairness criteria compete against each other. People propose that the decision be entirely independent of the attribute. Others think decision rules should be independent of the attribute conditioned on some true worthiness. Another proposal is that the outcome that should occur should be independent of the attribute conditioned on a good test. But it’s impossible to have all three. You can have a highly predictive test that still yields disparate impact. For example, an infamous recidivism test called COMPAS nearly perfectly predicts risk of recidivism for all individuals, but ends up having high false positive rates for black people.
You get yourself stuck. There are all sorts of ways to define what it would mean for a decision to be fair, and all of these definitions logically conflict with each other. Why is this happening?
The issue here is the statistics itself. Maximizing average utility says nothing about the utility of subgroups. And adding more constraints to balance out the discrepancies leads to infeasible problems. You can’t fix this framework.
This quasi-legal obsession with groups also misses a critical point. Once we subdivide enough, we get down to the individual. Maximizing any contrived notion of aggregate utility will never tell you anything about the welfare of individuals. In fact, it almost surely implies some individuals will be treated unfairly. What do we morally owe to each individual in a fair society? We can’t answer this with “more data.”
This is one of those things that gets me in trouble with my graduate students and will likely get me in trouble with you, my readers. I worry my CS friends who think about fairness and related issues too frequently fall into the technocracy trap. People who study policy mean well, but they convince themselves that because they spend so much time on it, they know better than everyone else. They then want to leverage their elite status to impose their ideas on everyone else. Technocracy applies a veneer of science used to justify the moral positions of those in power.
What if fairness can’t solved by proving theorems and writing code? Do technocratic problems require technocratic solutions? I don’t know who, other than democratic technocrats, thinks overly technocratic governance has been a particularly uplifting solution. Has it even been good on average? What if instead the answer is less technocracy? At some point, we have to decide what outcomes we want and fight to make these true. We have to refine our rules to achieve our moral outcomes. No ROC curve can do this.
Newsflash: algorithmic systems don't automatically "align" with our morality. But that doesn't mean we can ever get statistics to align with morality. The salient artifacts of a decade of algorithmic fairness papers are theorems proving algorithmic fairness is impossible. The statistical frame dooms us to discriminate. The way out isn't going to be through more math.
"Technocracy applies a veneer of science used to justify the moral positions of those in power."
This may very well go into my file cabinet of pithy wisdom to pull out when needed. I can put it in the same drawer as Sowell, "There are no solutions, only tradeoffs."
Good rant. However, I want to push back on your point about COMPAS. A perfectly calibrated risk prediction can still be extremely lacking. I find it hard to understand why risk scores far from 0 or 1 are useful when it comes to decision-making. When an outcome is highly variable given observed attributes, shouldn't that tell us that we need to collect more data, or focus our technocratic energy elsewhere? For COMPAS specifically, similarly accurate predictions can be achieved by logistic regression on two features (age and number of prior convictions): https://www.science.org/doi/full/10.1126/sciadv.aao5580