### Preamble

In my next few blog posts, I want to talk about some tricks for controlling statistical confounding in the context of multivariate linear regression, which is about the simplest kind of model that can be used to relate more than 2 variables. Although I've taken a full load of statistics classes including a whole course on multivariate linear regression alone, I never learned how to choose the right variables to include for a desired analysis until I came across it in Richard McElreath's book 'Statistical Rethinking'.

In short, it's likely to be something that most machine learning and data science practitioners wouldn't ordinarily pick up in a class on regression, and it's useful and kind of fun.

Controlling confounding requires drawing hypothetical diagrams of how your variables might relate causally to each other, doing some checks to determine whether the data conflict with the hypotheses, and then using the diagrams to derive sets of variables to exclude and include. It's a nice interplay between high level thinking about causality, and mechanical variable selection.

This week's post is an introduction where I'll set the stage a bit.

### Multivariate linear regression

Multivariate linear regressions are the first type of frequentist models you encounter as a statistician. They are used to relate an outcome variable $Y$ in a data set to any number of covariates $X_i$ which accompany it. For example, the height that a tree grows this year, $H$, might be associated with several continuous covariates, such as the number of hours of sunlight it receives per day $S$, the amount of water it receives per day $W$, and the iron content of the soil around it, $I$. These variables in turn may be associated with each other; for example, if the tree is not artificially watered, then $S$ and $W$ may be negatively correlated, since the sun doesn't usually shine when it's raining.

The model specification below is for a Bayesian linear regression model with $n$ covariates, and no higher-order terms. The distribution of $Y$ is normal, with a mean that linearly depends on the covariates $X_i$, and a variance parameter. All the parameters have priors, which the model specifies. Models like this are usually fitted using methods that sample the posterior distribution of the parameters given the observed data. The results of Bayesian model fitting are usually very similar to frequentist model fitting results when there is sufficient data for analysis.

$

\begin{aligned}

Y &\sim N(\mu, \sigma^2) \\

\mu &= \alpha + \beta_1X_1 + ... + \beta_nX_n \\

\alpha &\sim N(1, 0.5) \\

\beta_j &\sim N(0, 0.2) \text{ for } j=1,...,n \\

\sigma & \sim \text{exp}(1)

\end{aligned}

$

The fact that the scale of the modeled parameter $\mu$ is the same as that of $Y$, and the absence of higher-order terms (such as $x_1x_3$), make it easy to interpret the meaning of each slope parameter: $\beta_j$ is the expected change in the value of the outcome variable when the covariate $X_j$ changes by one unit. The assumption that this expected change is always the same, independent of the values of the $n$ covariates, is built right into this model.

This model is about as simple a statistical model as you can have for modeling data sets with a lot of variables. But when I was studying multivariate regression, the covariates used for modeling were often chosen without much explanation. Sometimes we would use all the variables available, and sometimes we would only use a subset of them. It wasn't until later that I learned how to choose which variables to include in a multivariate regression model. The choice depends on what you're trying to study, and on the causal relationships among all the variables.

And, of course, you don't know the causal relationships among the variables -- often, this is what you're trying to figure out by doing linear regression -- so you need to consider several possible diagrams of causal relationships.

The ultimate goal is to get statistical models that clearly answer your questions, and don't 'lie'. Actually, statistical models never lie, but they can mislead. Statistical confounding occurs when the apparent relationship between the value of a covariate $X$ and the outcome variable $Y$, as measured by a model, differs from the true causal effect of $X$ on $Y$. The effects of confounding can be so extreme that they result in Simpson's Paradox reversals, where the apparent association between variables is the opposite of the causal association.

It takes some know-how to eliminate confounding. Sometimes you have to be sure to include a variable in a multivariate regression in order to get an unconfounded model; sometimes including a variable will *cause* confounding.

Sometimes, nothing you can do will prevent confounding, because of an unobserved variable. But here is what you can do:

1. You can hypothesize one or more causal diagrams that relate the variables under study. You can consider some that include variables you may not have measured, in order to anticipate problems.

2. You might be able to discard some of these hypotheses, if the implied condiional independence relationships between the variables aren't supported by the data.

3. You can learn how to choose what variables to include and exclude, on the basis of the remaining hypothetical causal diagrams, to get multivariate regressions that aren't confounded.

4. You can also determine when confounding can't be prevented, because you would need to include a variable that isn't available.

In succeeding posts, I'll show you how to go about doing this yourself.