Yesterday, Semafor posted a silly article about a silly company using chatbots to simulate public opinion and predict elections.
I mean, good for them, I guess.
A dozen people sent me this article. My friend wrote me “It does feel like someone pitched their editor: ‘What if we make this one Berkeley guy really really angry.’” I found it too on the nose to be that annoyed, but it did seed a question in my head that I haven’t been able to shake. I asked on Twitter: “We all believe pollsters aren’t actually doing this. But how could someone actually tell the difference?”
I was serious about this question. What exactly is an opinion poll?
Pollsters want to estimate the percentage of people in a population who would answer yes to a yes or no question. In a perfect world, everyone in the population could potentially be asked this question and would always answer truthfully. If this was the case, you could pick a random sample of about 800 people, ask them the question, and compute the percentage that answers yes. By the unquestionable laws of frequentist statistics, we’d be guaranteed that the true answer would lie within 3% of the sampled answer for 95% of the potential random samplings.1
But of course, no poll works this way. The process instead goes like this: the polling company has some means of getting people to answer questions. Maybe they can call up landlines. Maybe they can gather a panel of participants to click on a web form. Maybe they can harass people on the street. By hook or by crook, they gather a sample of people and hope they respond. Some people answer some of their questions truthfully. Some people tell them what they think they want to hear. Some people lie. Some people tell them to leave them alone.
With this pristinely collected data, the pollsters have to come up with a percentage to send to the press. They do not send you the raw percentage! Instead, they build a statistical model to impute the unanswered questions and adjust for sampling and nonresponse biases. Whatever this model says, that’s what they report. But that model has tons of choices and knobs. If you give different pollsters the same data, they give you wildly different answers. Nate Cohn tried this experiment in 2016. He gave 5 “good” pollsters the same data and found a 5% split between the pollsters about what numbers to return. The systematic bias of “house effects” is as large as the “margin of error.”
Of course, the pollsters still tell you the margin of error is 3%! This is at best misleading and at worst a lie. The 3% MOE happens if you sample uniformly from a population. If you do whatever weird data collection and post-processing procedures the pollsters do, that 3% frequentist guarantee goes out the window.
Polling is quantitative social science with less openness. And quantitative social science is a giant mess. I know not to trust any result in quantitative social science. I know you can’t fix quantitative social science with metaanalysis (here’s looking at you, poll averagers). Given what we have learned from the “replication crisis” and the infinite set of forking paths in model adjustment, why should we believe any of the numbers that come out of these polls? I’m going to go a step further. Why should we believe that pollsters actually talked to people? How could you or I know for sure that pollsters ever did a survey?
The answer is trust. We’re supposed to trust certain pollsters because certain media empires tell us that they should be trusted. The claim is this trust comes from “track record” but what a poll in july tells us about a result in November is dubious at best. And estimating a pollster’s track record is impossible to reliably validate or corroborate.
No, the trust here is established through incestuous politico-media relationships. But what is the relative news value of a poll to one of those obnoxious undecided voter panels? What is the value over an anonymous source? Just because they give you numbers, perhaps to three decimal places, doesn’t mean polls, pollsters, or poll analysts deserve our attention.
Every time I write out the definition of a confidence interval, god kills a kitten
Why not save the kitten and write that, "for any p outside the interval, the probability of observing this sample, conditional on p being the true probability is less than 5 per cent". But God would probably kill you instead
An interesting article which, in my view, contains explanations that it would be useful for many to see.
Politicians, when questioned about polls, often answer that they don't pay great attention to them for "the only poll that matters is that on election day." Well, I don't believe that they don't pay attention to polls prior to election day because many actually refer to them whilst campaigning. Nor is it correct that the poll on election day is the only one that matters for the reality is that those prior to it have already had an effect, no matter whether they are reliable indicators of valid opinions or intentions of voters. - So, those pre election day polls do matter. Indeed, when the influence of the wealthy and other influencers, for instance media and celebrities, is taken into account, and that they can skew poll results, there is no doubt, (at least in my mind), that polls are a concern and can certainly influence election results.
Given also that the election is, in effect, a poll - yet another problem exists and that is the process used in the election. Issues such as whether voting is mandatory; whether all candidates are treated equally and whether all voters are treated equally, need also to be considered. The reality is that, generally speaking, they are not.
So, particularly in a modern world where every Tom, Dick, Mabel, & Sue can add their 10c worth to social information and debate, and where major media companies are owned by a handful of, inevitably, overly wealthy individuals with corresponding views on government, what is for sure is that the 'average' person (whatever that is), really has little influence on who is elected or how the nation is governed, managed, conditioned, legislated or anything else.
In other words, put simply, the whole electoral process is so fraught with opportunities for error and skewing that it being seen as an indicator of democracy is, at best, extremely debatable. This is particularly true where there is basically only one of two parties that has any real chance of holding government. In almost every cases, there is likely to be a result where probably at least 40% but often far greater a number of voters will have voted against whoever wins.
Representative democracy? I think not.