Preamble
In my next few blog posts, I want to talk about some tricks for controlling statistical confounding in the context of multivariate linear regression, which is about the simplest kind of model that can be used to relate more than 2 variables. Although I've taken a full load of statistics classes including a whole course on multivariate linear regression alone, I never learned how to choose the right variables to include for a desired analysis until I came across it in Richard McElreath's book 'Statistical Rethinking'.
In short, it's likely to be something that most machine learning and data science practitioners wouldn't ordinarily pick up in a class on regression, and it's useful and kind of fun.
Controlling confounding requires drawing hypothetical diagrams of how your variables might relate causally to each other, doing some checks to determine whether the data conflict with the hypotheses, and then using the diagrams to derive sets of variables to exclude and include. It's a nice interplay between high level thinking about causality, and mechanical variable selection.
This week's post is an introduction where I'll set the stage a bit.
---
Here's the entire 'statistical confounding' series:
- Part 1: Statistical confounding: why it matters: on the many ways that confounding affects statistical analyses
- Part 2: Simpson's Paradox: extreme statistical confounding: understanding how statistical confounding can cause you to draw exactly the wrong conclusion
- Part 3: Linear regression is trickier than you think (this post): a discussion of multivariate linear regression models
- Part 4: A gentle introduction to causal diagrams: a causal analysis of fake data relating COVID-19 incidence to wearing protective goggles
- Part 5: How to eliminate confounding in multivariate regression: how to do a causal analysis to eliminate confounding in your regression analyses
-Part 6: A simple example of omitted variable bias: an example of statistical confounding that can't be fixed, using only 4 variables.
Multivariate linear regression
Multivariate linear regressions are the first type of frequentist models you encounter as a statistician. They are used to relate an outcome variable
The model specification below is for a Bayesian linear regression model with
The fact that the scale of the modeled parameter
This model is about as simple a statistical model as you can have for modeling data sets with a lot of variables. But when I was studying multivariate regression, the covariates used for modeling were often chosen without much explanation. Sometimes we would use all the variables available, and sometimes we would only use a subset of them. It wasn't until later that I learned how to choose which variables to include in a multivariate regression model. The choice depends on what you're trying to study, and on the causal relationships among all the variables.
And, of course, you don't know the causal relationships among the variables -- often, this is what you're trying to figure out by doing linear regression -- so you need to consider several possible diagrams of causal relationships.
The ultimate goal is to get statistical models that clearly answer your questions, and don't 'lie'. Actually, statistical models never lie, but they can mislead. Statistical confounding occurs when the apparent relationship between the value of a covariate
It takes some know-how to eliminate confounding. Sometimes you have to be sure to include a variable in a multivariate regression in order to get an unconfounded model; sometimes including a variable will *cause* confounding.
Sometimes, nothing you can do will prevent confounding, because of an unobserved variable. But here is what you can do:
1. You can hypothesize one or more causal diagrams that relate the variables under study. You can consider some that include variables you may not have measured, in order to anticipate problems.
2. You might be able to discard some of these hypotheses, if the implied condiional independence relationships between the variables aren't supported by the data.
3. You can learn how to choose what variables to include and exclude, on the basis of the remaining hypothetical causal diagrams, to get multivariate regressions that aren't confounded.
4. You can also determine when confounding can't be prevented, because you would need to include a variable that isn't available.
In succeeding posts, I'll show you how to go about doing this yourself.
No comments:
Post a Comment