Today I want to take a journey through some of the statistical confusion I discussed yesterday. I’ll make it more confusing perhaps by using a metaphor that almost no one under thirty will understand. (This might be a problem with trying to milliblog every day… some mornings, good metaphors escape me.)
Here’s the scenario. The last hipster record store in the world is downsizing and only has a few records on vinyl. Why do they carry the records they carry?
I can go into the record store and make a list of all of their records. I can write down all of the attributes. How many songs does each record have? What is the color of the cover art? What is the genre of this record? What pronouns does the lead singer use?
Now I can define a “conditional probability” of being in the record store. I will collect the same information about all of the records currently being distributed in the US and I will compare it to those in the record store. Let’s say I want the conditional probability that a record is in the shop given that the cover is mostly green. Then I just take the number of records in the store with green covers and divide them by the number of records currently available from distributors with green covers. This is the probability of a record appearing in the store conditioned on its cover being green.
So far conditional probability is just a fraction. It’s a percentage of all things in existence with some property that are in the store. But I want to understand *why* these records are here. There must be a reason. I now build models. I turn to regression.
I lump all of the attributes of my records into a vector I call X. I get X by taking all of the columns in my spreadsheet except the one that tells me if the record is in the store. My regression model posits some math
β is a vector of coefficients that’s going to explain something important about the relationship between the attributes of my record and why the manager decided to put it on sale here. Why does it have this form? Well, mostly convenience. But we call this logistic regression. I could make some argument about odds ratios or some crap, but the blog is longer than I wanted it to be today. Whatever, people tell me logistic regression is the right thing to do.
Hey, and I have software that can take my spreadsheet and give me those sweet β coefficients! I can use SAS or Stata or R or Python and get out the values of these coefficients and some p-values too! The software might tell me that green covers are statistically significant with p<0.01. I can add three stars to that column of the spreadsheet.
Now I’m tempted to take it even a step further. I have three stars. If I add a little storytelling about exogeneity, I can declare that green covers cause records to appear in the last dying record store. Causality has emerged from squinting at fractions.
But what the hell does it mean for logistic regression to be a model of music store curation? Logistic regression asserts that records appear in the store because of a random process. The store owner gets a list of records from his distributors. He opens up his excel spreadsheet with his coefficients β. He takes each record on the list and computes a score based on β and the record’s attributes.
To determine if he stocks the record, the store owner generates a sample from the logistic distribution without looking at the score. Samples from a logistic distribution may take any value, but half the time they are greater than zero, and half the time they are less than zero. About 95% of the time, these samples will be between negative three and three. The random sample for each record must be collected without any reference to any other record.
The sample for the record is added to its score. If the sum of the sample and the score is greater than zero, the manager buys the record. Otherwise, he doesn’t. The decision to carry the record must be made without any other reference to any other records. The larger the record’s score, the higher the probability that this process will get that record on the shelf, but a high score does not guarantee a purchase. In some rare cases, a large negative sample will result in a record with a very high score not getting sold in the store. There’s limited shelf space!
Though this mechanical process with random number generators is absurd and does not at all resemble anything plausible, this is what it means to model phenomena with logistic regression. Does anything have a model that works like this? This process is so far away from the reality of how any real decision is made. How does an analysis of the fit logistic model inform us about the mental state of anyone making decisions?
Now take my metaphor and apply it to your favorite observational regression study on bias.
Isn't this just the first half of a recommendation algorithm? Use beta to offer the record store owner deals from the wholesaler, and boom, now there *is* a causal link between the color of the cover and the records in stock (assuming that the owner isn't too hipster to use coupons)
> If I add a little storytelling about exogeneity, I can declare that green covers cause records to appear in the last dying record store
A story about cover green-ness being as good as randomly assigned conditional on other covariates would have to be different from sleeve size being as good as randomly assigned conditional on other covariates, no? At best, one of these stories might ring true; usually, none will.
Not saying this is what most observational research does, unfortunately, but that's a sociological problem and not a statistical one. The statistical argument for how, under conditional ignorability and overlap, you can estimate the conditional probabilities of records being kept with and without some attribute, is valid, and you can estimate causal effects from these quantities.