Statistical confounding: why it matters


This article is a brief introduction to statistical confounding. My hope is that, having read it, you'll be more on the lookout for it, and interested in learning a bit more about it. 

Statistical confounding, leading to errors in data-based decision-making, is a problem that has important consequences for public policy-making. This is always true, but it seems especially true in 2020-2021. Consider these questions:

1. Is lockdown the best policy to reduce COVID death rates, or would universal masking work just as well?

2. Would outlawing assault rifles lower death rates due to violence in the US? 

3. Would outlawing hate speech reduce the incidence of crime against minorities in the US?  

4. What effect would shutting down Trump's access to social media have on his more extreme supporters?

If you're making data-based decisions (or deciding whether to support them), it's important to be aware that confounding happens. For people practicing statistics, including scientists and analysts, it's important to understand how to prevent confounding from influencing your inferences, if possible. 

Identifying and preventing confounding is a topic that I haven't seen covered in most places -- not even in my multivariate linear regression classes. It's explained beautifully in Chapter 5 of Richard McElreath's book "Statistical Rethinking", which I highly recommend if you're up for a major investment of time and thought. 

This topic is the first in a cluster about statistical inference from my slipbox (what's a slipbox?).

Important note: I use the 'masking' example below as a case where confounding might hypothetically occur. I am not suggesting for a moment that masks don't fight COVID transmission. I am a huge fan of masking! Even though I wear glasses and am constantly fogged up.


Here's the entire 'statistical confounding' series:

- Part 1: Statistical confounding: why it matters (this post): on the many ways that confounding affects statistical analyses 

- Part 2: Simpson's Paradox: extreme statistical confounding: understanding how statistical confounding can cause you to draw exactly the wrong conclusion

- Part 3: Linear regression is trickier than you think: a discussion of multivariate linear regression models

- Part 4: A gentle introduction to causal diagrams: a causal analysis of fake data relating COVID-19 incidence to wearing protective goggles

- Part 5: How to eliminate confounding in multivariate regression: how to do a causal analysis to eliminate confounding in your regression analyses   

-Part 6: A simple example of omitted variable bias: an example of statistical confounding that can't be fixed, using only 4 variables.

Statistical Confounding

Suppose you've got an outcome you desire: for example, you want covid cases per capita in your state to go down. Give COVID cases per capita a name: call it $Y$. 

You've also got another variable, $X$, that you believe has an effect on $Y$. Perhaps $X$ is the fraction of people utilizing masks whenever they go out in public. $X=0$ means no one is wearing masks: $X=1$ means everyone is.

You believe, based on numbers collected in a lot of other locations, that the higher the value of $X$ is, the lower the value of $Y$ is. After a political fight, you might be able to require everyone to mask up by passing a strict public mask ordinance: in this case, you would be forcing $X$ to have the value 1.

In order to determine whether to do this, you set up an experiment, a clinical trial for masks. You start with a representative group of people, and set half of them at random to wear masks whenever they go out, and the other half to not wear masks. The 'at-random' piece is important here, as it is in clinical trials. Setting $X$ forcibly to a specific value, chosen at random, can be thought of as applying an operator to $X$: call it the "do-operator". 

The do-operator is routinely applied in experimental science. For example, in a vaccine clinical trial, people aren't allow to choose whether they get the placebo or vaccine: one of these possibilities is chosen at random. This lets you assess the true causal effect of $X$ on $Y$.

If your experiment shows that mask-wearing is effective at lowering the per capita COVID case rate, you can then support a mask-wearing ordinance, with confidence that the ordinance will have the desired effect 'in the wild'. 

Statistical confounding occurs when the apparent relationship $p(Y|X)$ between the value of $X$ and the value of $Y$, observed in the wild rather than under experiment, differs from the true causal effect of $X$ on $Y$, $p(Y|do(X)$.

To put this in the context of masking, suppose we've observed in the wild that people who wear their masks when they go outside the house have lower COVID case rates per capita than people who don't. If we enforce a mask ordinance on the basis of this observation, it's possible that we might find that the law has no effect on the COVID case rate.  

This might happen because of the presence of other variables which affect the outcome variable, called confounder variables. In the case of the masking question, it may be that an important confounder is whether a person is concerned about catching COVID. If a person is concerned, it may be that in addition to wearing masks when they go out, they are also avoiding close contact with people outside their household. And perhaps that is the true cause of the reduction in the COVID rate among people who wear masks.

If it is the case that avoiding in-person meetings is the real cause of the lowered case rates, rather than wearing masks, then enforcing a masking law will not have the desired effect of reducing the case rate. And you definitely want to avoid passing an ineffective ordinance, for obvious reasons.

Next week, I'll talk about Simpson's Paradox, an extreme example of statistical confounding.


  1. This sounds like what people mean when they say that correlation doesn't necessarily mean causation


Trouble that you can't fix: omitted variable bias

  credit: SkipsterUK ( CC BY-NC-ND 2.0) Preamble In the previous post in this series, I explained how to use causal diagram...