### A gentle introduction to causal diagrams Cute burrowing owl. (creative commons)

In a recent blog post, Statistical confounding: why it matters, I touched a bit on the topic of causal diagrams, and defined statistical confounding as occurring when the association between two variables is influenced by a third variable, leading (potentially) to incorrect conclusions about the causal relationship between them.

In this blog post, I'll work through an example of a simple (and totally hypothetical! but nevertheless kind of plausible!) causal diagram, and show how it can be used to select variables for multiple regressions so that causality can be inferred. Please note that this is a completely made-up example using data that I generated. I've just made up a story around COVID-19 because it is an important statistical problem to which most people can intensely relate!

Imagine that we've collected data for a huge observational study, in an attempt to determine whether wearing protective glasses affects the likelihood of catching COVID-19. Our dataset consists of many thousands of rows of data; each row represents an individual. The dataset has two columns: $G$, a boolean variable indicating whether the person wears protective glasses, and $C$, a boolean variable indicating whether the person has tested positive for covid. The causal relationship we want to test can be diagrammed like this:

where an arrow from $G$ to $C$ indicates that glasses-wearing has a causal effect on catching COVID-19.

I've generated some data representing the results of such an observational study. Because I faked the data, I know for certain that there is no direct causal impact of $G$ on $C$ in it; $C$ is generated from an expression that doesn't include $G$.  The generated data includes $G$ and $C$ and several other boolean variables, including $W$ (is the person concerned about COVID-19?) and $S$ (is the person avoiding social contact?).

The diagram below shows the true causal relationships among these 4 random variables. There is no directed arrow between $G$ and $C$, indicating that wearing protective glasses neither prevents nor causes a person to catch COVID-19. There are arrows from $W$ to both $G$ and $S$, indicating that concern about COVID-19 drives people to both wear protective glasses, and to socially distance. There is an arrow from $S$ to $C$, indicating that avoiding social contact actually does prevent catching COVID-19.

But let's suppose we don't know anything about the data. In order to investigate the question of whether glasses-wearing helps prevent COVID-19, we do a Bayesian logistic regression using the following model:

\begin{aligned} C_i &\sim\text{Bernoulli}(p) \\ \text{logit}(p) &= \alpha_0 + \alpha_{[G_i]}\\ \alpha_0 & \sim N(\mu=0, \sigma=1.5) \\ \alpha_{[G_i]} & \sim N(\mu=0, \sigma=3) \end{aligned}

where $G_i$ is either 0 or 1. The line $\text{logit}(p)=\alpha_0+\alpha_{[G_i]}$ indicates that we are modeling the probability of catching COVID-19 as a function only of $G$ (wearing protective glasses); we aren't including any other covariates.

We fit this model, and find that the differences between the values of the model's posterior parameters $\alpha_{[G==0]}$ and $\alpha_{[G==1]}$ are large, indicating that protective glasses make a difference. The graph below shows the histogram of the differences between the fitted values of $\alpha_{[G==0]}$ and $\alpha_{[G==1]}$; more than 95% of the histogram's mass lies between the red lines, to the right of 0.

In fact, the mean fitted (posterior) probability of catching COVID-19 is 0.02 for the group that does not wear protective glasses, and 0.005 for the group that does wear protective glasses. Can we conclude that wearing protective glasses reduces the risk of catching covid by a factor of about 4?

Well, no. But this set of causal relationships can nevertheless produce values of $G$ and $C$ that make it look like wearing protective glasses is highly effective for preventing COVID-19. This is because we've messed up by using a model that depends only on $G$ and $C$.

In the diagram below, the causal relationship we want to assess is represented by the gray arrow between $G$ and $C$ (these are in red, indicating that they were included in the regression model). But there is a second path in the graph from $G$ to $C$ that can generate an association between $G$ and $C$, the one from $G$ to $W$ to $S$ to $C$. Unlike the one we want to test, it is a 'back-door' path from $G$ to $C$, meaning that it starts with a causal arrow that points *into* $G$ rather than away from $G$.

Here's what's going on: the factor $W$ is driving both glasses-wearing $G$ (an ineffective intervention) and social distancing $S$ (the effective intervention). This creates an association between $G$ and $S$: if a person is wearing protective glasses, they are highly likely to also be social distancing, and vice-versa. Therefore, a person wearing protective glasses is probably also social distancing, and therefore is less likely to catch COVID-19. And so, if all you're using in your model are the variables $G$ and $C$, it looks like wearing protective glasses is effective against COVID-19.

But if you then pass a law that everyone has to wear protective glasses, it will have no effect on the COVID-19 rate, and you'll have spent a lot of political capital getting an ineffective resolution passed, and people won't listen to your advice anymore. This is a bad outcome.

How can we fix this statistical problem?

If the above causal diagram is the true one (a big if!), then we can fix it. We need to have collected not only the values of $G$ and $C$, but also those of $S$. What we are going to do is to 'block the back-door path' from $G$ to $C$ by conditioning on $S$, which (in the regression context) means we are going to include $S$ as a variable in the regression model. We write the new model as:

\begin{aligned} C_i &\sim\text{Bernoulli}(p) \\ \text{logit}(p) &= \alpha_0+\alpha_{[G_i]} +\alpha_{[S_i]}\\ \alpha_0 &\sim N(0, 1.5) \\ \alpha_{[G_i]} & \sim N(0, 3) \\ \alpha_{[S_i]} & \sim N(0, 3) \end{aligned}

where now we have added new terms to the model that depend on whether the person is social distancing. The new causal diagram model looks like the one below; in which we are conditioning on $S$.

We fit this model, but we find that the differences between the values of the model's posterior parameters $\alpha_{[G==0]}$ and $\alpha_{[G==1]}$ are still large. The graph below shows the histogram of the differences between the fitted values of $\alpha_{[G==0]}$ and $\alpha_{[G==1]}$ for this second model; once again, more than 95% of the histogram's mass lies to the right of 0. If we were convinced that our previous causal diagram was correct, then we would again conclude erroneously that protective glasses help prevent COVID-19.

The problem this time (and this is the last problem, I promise) is that we've omitted an important variable from the causal diagram: $V$, whether the person is vaccinated. The (real!) true causal diagram that generated the data, including $V$, is shown below.

Adding $V$ adds some new and interesting connections to the causal diagram. There is an arrow from $W$ to $V$, because if a person is concerned about COVID-19, they're more likely to get the vaccine. There is an arrow from $V$ to $S$, because if a person is vaccinated, they're likely to be less careful about social distancing. And clearly, whether a person is vaccinated directly impacts their risk of catching COVID-19.

Because of $V$, our new, true causal diagram still has an unblocked back-door path in it from $G$ to $C$: the one from $G$ through $W$, to $V$ and then to $C$. Also because of $V$, the back-door path from $G$ to $W$ through $S$ to $C$ that we thought was blocked is actually  unblocked. These unblocked back-door paths from $G$ to $C$ are still producing confounding that makes it look as though wearing protective glasses helps with COVID-19.

How can we fix the problem with $V$? Well, analyzing the causal diagram shows that including $V$ in the model with $G$ and $S$ would block all of the back-door paths from $G$ to $C$. But what if we don't have $V$ in the data we collected, because we never thought to collect it?

In some situations, we might be unable to fix confounding. Unobserved variables like $V$ are often present in statistical studies, and you may not even suspect they are there, but they can still cause confounding. The best we can do in statistical analyses of causality is to try to collect all the variables that might influence the problem, and think about possible causal diagrams for the variables.

In this example, even if we didn't collect vaccination information, we can still fix the problem by conditioning on $W$ instead of $S$, as shown in the diagram below. Since all of the backdoor paths from $G$ to $C$ lead through the variable $W$, conditioning on $W$ blocks them all at once. So, in order to get an unconfounded model, the only information we need to add to the model is whether the person is Concerned About COVID-19.

Our final and good model would look like this:

\begin{aligned} C_i &\sim\text{Bernoulli}(p) \\ \text{logit}(p) &= \alpha_0+\alpha_{[G_i]} +\alpha_{[W_i]}\\ \alpha_0 &\sim N(0, 1.5) \\ \alpha_{[G_i]} & \sim N(0, 3) \\ \alpha_{[W_i]} & \sim N(0, 3) \end{aligned}

After fitting this model, we find that the histogram of the differences between $\alpha_{[G==0]}$ and $\alpha_{[G==1]}$ for this final model straddles 0 as shown below, indicating that the variable $G$ is not significant for modeling the rate at which people catch COVID-19.

In my next post, I'll explain how you can analyze causal diagrams yourself, find back-door paths, and block them by conditioning on specific variables (and not conditioning on others!), in order to prevent statistical confounding in statistical analyses.

---

Here's the entire 'statistical confounding' series:

- Part 1: Statistical confounding: why it matters: on the many ways that confounding affects statistical analyses

- Part 2: Simpson's Paradox: extreme statistical confounding: understanding how statistical confounding can cause you to draw exactly the wrong conclusion

- Part 3: Linear regression is trickier than you think: a discussion of multivariate linear regression models

- Part 4: A gentle introduction to causal diagrams (this post): a causal analysis of fake data relating COVID-19 incidence to wearing protective goggles

- Part 5: How to eliminate confounding in multivariate regression: how to do a causal analysis to eliminate confounding in your regression analyses

-Part 6: A simple example of omitted variable bias: an example of statistical confounding that can't be fixed, using only 4 variables. 