# Remaining Topics

As the purpose of this course is to be a thin and quick introductory course at the intersection of causal inference and machine learning, it is not the intention nor desirable to cover all topics in causal inference exhaustively. In this final chapter, I discuss a few topics that I did not feel necessary to be included in the main course but could be useful for students if they could be taught.

## Other Techniques in Causal Inference

In practice the following observational causal inference techniques are widely used:

Difference-in-difference and regression discontinuity design are heavily used in practice, but they work for relatively more specialized cases, which is why this course has omitted them so far. In this section, we briefly cover these two approaches for the sake of completeness. Furthermore, this section wraps up by providing a high-level intuition behind a more recently proposed and popularized technique of double machine learning.

### Difference-in-Difference

The average treatment effect (ATE) from Causal Quantities of Interest measures the difference between the outcomes of two groups; treated and not treated, or more precisely, it measures the difference between the outcome of the treated group and the expected outcome over all possible actions. One way to interpret this is to view ATE as checking what happens to a treated individual had the individual was not treated, on average. First, we can compute what happens to the individual once they were treated, on average, as

[] \begin{align} y^1_{\mathrm{diff}} = \mathbb{E}_x \mathbb{E}_a \mathbb{E}_{y_{\mathrm{pre}},y_{\mathrm{post}}} \left[ \mathds{1}(a = 1) (y_{\mathrm{post}} - y_{\mathrm{pre}}) \right], \end{align} []

where $y_{\mathrm{pre}}$ and $y_{\mathrm{post}}$ are the outcomes before and after the treatment ($a=1$). We can similarly compute what happens to the individual had they not been treated, on average, as well by

[] \begin{align} y^0_{\mathrm{diff}} = \mathbb{E}_x \mathbb{E}_a \mathbb{E}_{y_{\mathrm{pre}},y_{\mathrm{post}}} \left[ \mathds{1}(a = 0) (y_{\mathrm{post}} - y_{\mathrm{pre}}) \right]. \end{align} []

We now check the difference between these two quantities:

[] \begin{align} \label{eq:diff-in-diff} y^1_{\mathrm{diff}} - y^0_{\mathrm{diff}} = \mathbb{E}_x \mathbb{E}_a \left[ \right. & \mathbb{E}_{y_{\mathrm{post}}} \left[ \mathds{1}(a = 1) y_{\mathrm{post}} - \mathds{1}(a = 0) y_{\mathrm{post}} \right] \\ - & \left. \mathbb{E}_{y_{\mathrm{pre}}} \left[ \mathds{1}(a = 1) y_{\mathrm{pre}} - \mathds{1}(a = 0) y_{\mathrm{pre}} \right] \right]. \end{align} []

If we used RCT from Randomized Controlled Trials to assign the action independent of the covariate $x$ and also uniformly, the second term, that is the difference in the pre-treatment outcome, should disappear, since the treatment had not been given to the treatment group yet. This leaves only the first term, which is precisely how we would compute the outcome from RCT. In an observational study, that is passive causal inference, we often do not have a control over how the participants were split into treatment and placebo groups. This often leads to the discrepancy in the base outcome between the treated and placebo groups. In that case, the second term above would not vanish but will work to remove this baseline effect. Consider measuring the effect of a vitamin supplement on the height of school-attending girls of age 10. Let us assume that this particular vitamin supplement is provided to school children by default in Netherlands from age 10 but is not in North Korea. We may be tempted to simply measure the average heights of school-attending girls of age 10 from these two countries, and draw a conclusion whether this supplement helps school children grow taller. This however would not be a reasonable way to draw the conclusion, since the averages heights of girls of age 9, right before the vitamin supplement begins to be provided in Netherlands, differ quite significantly between two countries (146.55cm vs. 140.58cm.) We would rather look at how much taller these children grew between ages of 9 and 10. Because we consider the difference of the difference in Eq.\eqref{eq:diff-in-diff}, we call this estimator difference-in-difference. This approach is widely used and was one of the most successful cases of passive causal inference, dating back to the 19th century[1]. In the context of what we have learned this course, let us write a structural causal model that admits this difference-in-different estimator:

[] \begin{align} &x \leftarrow \epsilon_x \\ &a \leftarrow \mathds{1}(x + \epsilon_a) \\ &y \leftarrow \mathds{1}(x \gt 0) y_0 + \alpha a + \epsilon_y. \end{align} []

With zero-mean and symmetric $\epsilon_x$ and $\epsilon_a$, those with positive $x$ are more likely to be assigned to $a=1$. Due to the first term in $y$, the outcome has a constant bias $y_0$ when $x$ is positive. In other words, those, who are likely to be given the treatment, have $y_0$ added to the outcome regardless of the treatment ($a=1$) itself, since $+y_0$ does not depend on $a$. The difference-in-difference estimator removes the effect of $y_0$ from estimating $\alpha$ which is the direct causal effect of $a$ on $y$. This tells us when the difference-in-difference estimator works, and how we can extend it further. For instance, it is not necessary to assume the linearity between $a$ and $y$. I leave it to you as an exercise.

### Regression Discontinuity

Another popular technique for passive causal inference is called regression discontinuity[2](and references therein). Regression discontinuity assumes that there exists a simple rule to determine to which group, either treated or placebo, an individual is assigned based on the covariate $x$. This rule can be written down as

[] \begin{align} \label{eq:rd-rule} a = \begin{cases} 1, &\text{ if } x_d \geq c_0 \\ 0, &\text{ otherwise}. \end{cases} \end{align} []

If the $d$-th covariate crosses over the threshold $c_0$, the individual is assigned to $a=1$. We further assume that the outcome given a particular action is a smooth function of the covariate. That is, the outcome of a particular action, $f(\hat{a}, x)$, changes smoothly especially around the threshold $c_0$. In other words, had it not been for the assignment rule above, $\lim_{x_d \to c_0} f(\hat{a}, x) = \lim_{c_0 \leftarrow x_d} f(\hat{a}, x)$. There is no discontinuity of $f(\hat{a}, x)$ at $x_d=c_0$, and we can fit a smooth predictor that extrapolates well to approximate $f(\hat{a}, x)$ (or $\mathbb{E}_{x_{d' \neq d}} f(\hat{a}, x_{d'}\cup x_c=c_0)$.) If we assume that the threshold $c_0$ was chosen arbitrarily, that is independent of the values of $x_{\neq d}$, it follows that the distributions over $x_{\neq d}$ before and after $c_0$ to remain the same at least locally.[Notes 1] This means that the assignment of an action $a$ and the covariate other than $x_d$ are independent locally, i.e., $|x_d - c_0| \leq \epsilon$, where $\epsilon$ defines the radius of the local neighbourhood centered on $c_0$. Thanks to this independence, which is the key difference between the conditional and interventional distributions, as we have seen repeatedly earlier, we can now compute the average treatment effect locally (so is often called a local average treatment effect) as

[] \begin{align} \mathrm{LATE} =& \mathbb{E}_{x} \left[ \mathds{1}(|x_d - c_0| \leq \epsilon) f(1, x) - f(0, x) \right] \\ =& \mathbb{E}_{x: |x_d - c_0| \leq \epsilon} \left[ f(1,x) \right] - \mathbb{E}_{x: |x_d - c_0| \leq \epsilon} \left[ f(0,x) \right]. \end{align} []

Of course, our assumption here is that we do not observed $x_{\neq d}$. Even worse, we never observe $f(1,x)$ when $x_d \lt c_0$ and $f(0,x)$ when $x_d \gt c_0$. Instead, we can fit a non-parametric regression model $\hat{f}(\hat{a}, x_d)$ to approximate $\mathbb{E}_{x_{\neq d} | x_d} f(\hat{a},x)$ and expect (or hope?) that it would extrapolate either before or after the threshold $c_0$. Then, LATE becomes

[] \begin{align} \mathrm{LATE} &= \int_{c_0-\epsilon}^{c_0+\epsilon} \hat{f}(1,x_d) - \hat{f}(0,x_d) \mathrm{d}x_d \\ &=_{\epsilon \to 0} \hat{f}(1, c_0) - \hat{f}(0, c_0), \end{align} []

thanks to the smoothness assumption of $f$. The final line above tells us pretty plainly why this approach is called regression discontinuity design. We literally fit two regression models on the treated and placebo groups and look at their discrepancy at the decision threshold. The amount of the discrepancy implies the change in the outcome due to the change in the action, of course under the strong set of assumptions we have discussed so far.

### Double Machine Learning

Recent advances in machine learning have open a door to training large-scale non-parametric methods on high-dimensional data. This allows us to expand some of the more conventional approaches. One such example is double machine learning[3]. We briefly describe one particular instantiation of double machine learning here. Recall the instrument variable approach from Instrumental Variables: When Confounders were not Collected. The basic idea was to notice that the action $a$ was determined using two independent sources of information, the confounder $x$ and the external noise $\epsilon_a$:

[] \begin{align} a \leftarrow f_a(x, \epsilon_a), \end{align} []

with $x \indep \epsilon_a$. We then introduced an instrument $z$ that is a subset of $\epsilon_a$, such that $z$ is predictive of $a$ but continues to be independent of $x$. From $z$, using regression, we capture a part of variation in $a$ that is independent of $x$, in order to severe the edge from the confounder $x$ to the outcome $y$. Then, we use this instrument-predicted action $a'$ to predict the outcome $y$. We can instead think of fitting a regression model $g_a$ from $x$ to $a$ and use the residual $a_{\bot} = a - g_a(x)$ as the component of $a$ that is independent of $x$, because the residual was not predictable from $x$. This procedure can now be applied to the outcome which is written down as

[] \begin{align} y \leftarrow f_y(a_{\bot}, x, \epsilon_y). \end{align} []

Because $x$ and $a_{\bot}$ are independent, we can estimate the portion of $y$ that is predictable from $y$ by building a predictor $g_y$ of $y$ given $x$. The residual $y_{\bot} = y - g_y(x)$ is then what cannot be predicted by $x$, directly nor via $a$. We are in fact relying on the fact that such a non-parametric predictor would capture both causal and spurious correlations indiscriminately. $a_{\bot}$ is a subset of $a$ that is independent of the confounder $x$, and $y_{\bot}$ is a subset of $y$ that is independent of the confounder $x$. The relationship between $a_{\bot}$ and $y_{\bot}$ must then be the direct causal effect of the action on the outcome. In other words, we have removed the effect of $x$ on $a$ to close the backdoor path, resulting in $a_{\bot}$. We have removed the effect of $x$ on $y$ to reduce non-causal noise, resulting $y_{\bot}$. What remains is the direct effect of $a$ on the outcome $y$. We therefore fit another regression from $a_{\bot}$ to $y_{\bot}$, in order to capture this remaining correlation that is equivalent to the direct cause of $a$ on $y$.

## Behaviour Cloning from Multiple Expert Policies Requires a World Model

A Markov decision process (MDP) is often described as a tuple of the following items:

• $\mathcal{S}$: a set of all possible states
• $\mathcal{A}$: a set of all possible actions
• $\tau: \mathcal{S} \times \mathcal{A} \times \mathcal{E} \to \mathcal{S}$: a transition dynamics. $s' = \tau(s, a, \epsilon)$.
• $\rho: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$: a reward function. $r = \rho(s, a, s')$.

The transition dynamics $\tau$ is a deterministic function but takes as input noise $\epsilon \in \mathcal{E}$, which overall makes it stochastic. We use $p_\tau(s' | s, a)$ to denote the conditional distribution over the next state given the current state and action by marginalizing out noise $\epsilon$. The reward function $r$ depends on the current state, the action taken and the next state. It is however often the case that the reward function only depends on the next (resulting) state. A major goal is then to find a policy $p_\pi: \mathcal{S} \times \mathcal{A} \to \mathbb{R}_{ \gt 0}$ that maximizes

[] \begin{align} \label{eq:return} J(\pi) =& \sum_{s_0} p_0(s_0) \sum_{a_0} p_\pi(s_0, a_0) \sum_{s_1} p_{\tau}(s_1 | s_0, a_0) \left(\gamma^0 \rho(s_0, a_0, s_1) \right. \nonumber \\ & \qquad \quad + \left. \sum_{a_1} p_\pi(s_1, a_1) \sum_{s_2} p_{\tau}(s_2 | s_1, a_1) \left( \gamma^1 \rho(s_1, a_1, s_2) + \cdots \right) \right) \\ =& \mathbb{E}_{s_0 \sim p_0(s_0)} \mathbb{E}_{a_0, s_1 \sim p_\pi(a_0|s_0) p_{\tau}(s_1|s_0,a_0)} \nonumber \\ & \qquad \qquad \quad \mathbb{E}_{a_1, s_2 \sim p_\pi(a_1|s_1) p_{\tau}(s_2|s_1,a_1)} \cdots \left[ \sum_{t=0}^\infty \gamma^t \rho(s_t, a_t, s_{t+1}) \right] \\ =& \mathbb{E}_{p_0, p_\pi, p_\tau} \left[ \sum_{t=0}^\infty \gamma^t \rho(s_t, a_t, s_{t+1}) \right] , \end{align} []

where $p_0(s_0)$ is the distribution over the initial state. $\gamma \in (0, 1]$ is a discounting factor. The discounting factor can be viewed from two angles. First, we can view it conceptually as a way to express how much we care about the future rewards. With a large $\gamma$, our policy can sacrifice earlier time steps' rewards in return of higher rewards in the future. The other way to think of the discounting factor is purely computational. With $\gamma \lt 1$, we can prevent the total return $J(\pi)$ from diverging to infinity, even when the length of each episode is not bounded. As we have learned earlier when we saw the equivalence between the probabilistic graphical model and the structural causal model in Probababilistic Graphical Models--Structural Causal Models, we can guess the form of $\pi$ as a deterministic function:

[] \begin{align} a \leftarrow \pi(s, \epsilon_\pi). \end{align} []

Together with the transition dynamics $\tau$ and the reward function $\rho$, we notice that the Markov decision process can be thought of as defining a structural causal model for each time step $t$ as follows:

[] \begin{align} &s\text{ is given.} \\ &a \leftarrow \pi(s, \epsilon_\pi) \\ &s' \leftarrow \tau(s, a, \epsilon_{s'}) \\ &r \leftarrow \rho(s', \epsilon_{r}), \end{align} []

where we make a simplifying assumption that the reward only depends on the landing state. Graphically,

Behaviour cloning. With this in our mind, let us consider the problem of so-called ‘behavior cloning’. In behaviour cloning, we assume the existence of an expert policy $\pi^*$ that results in a high return $J(\pi^*)$ from Eq.\eqref{eq:return} and that we have access to a large amount of data collected from the expert policy. This dataset consists of tuples of current state $s$, action by the expert policy $a$ and the next state $s'$. We often do not observe the associated reward directly.

[] \begin{align} D = \left\{ (s_n, a_n, s'_n) \right\}_{n=1}^N, \end{align} []

where $a_n \sim p_{\pi^*}(a | s_n)$ and $s'_n \sim p_{\tau}(s' | s_n, a_n)$. Behavior cloning refers to training a policy $\pi$ that imitates the expert policy $\pi^*$ using this dataset. We train a new policy $\pi$ often by maximizing

[] \begin{align} \label{eq:bc-loss} J_{\mathrm{bc}}(\pi) = \sum_{n=1}^N \log \pi(a_n, s_n). \end{align} []

In other words, we ensure that the learned policy $\pi$ puts a high probability on the action that was taken by the expert policy $\pi^*$.

Behaviour cloning with multiple experts. It is however often that it is not just one expert policy that was used to collect data but a set of expert-like policies that collected these data points. It is furthermore often that we do not know which such expert-like policy was used to produce each tuple $(s_n, a_n, s'_n)$. This necessitates us to consider the policy used to collect these tuples as a random variable that we do not observe, resulting the following graphical model:[Notes 2]

The inclusion of an unobserved $\tilde{\pi}$ makes the original behaviour cloning objective in Eq.\eqref{eq:bc-loss} less than ideal. In the original graph, because we sampled both $s$ and $a$ without conditioning on $s'$, there was only one open path between $s$ and $a$, that is, $s\to a$. We could thereby simply train a policy to capture the correlation between $s$ and $a$ to learn the policy which should capture $p(a | \mathrm{do}(s))$. With the unobserved variable $\pi$, this does not hold anymore. Consider $(s_t,a_t)$. There are two open paths between these two variables. The first one is the original direct path; $s_t \to a_t$. There is however the second path now; $s_t \leftarrow a_{t-1} \leftarrow \pi \rightarrow a_t$. If we na\"ively train a policy $\pi$ on this dataset, this policy would learn to capture the correlation between the current state and associated action arising from both of these paths. This is not desirable as the second path is not causal, as we discussed earlier in Confounders, Colliders and Mediators. In other words, $\pi(a | s)$ would not correspond to $p(a | \mathrm{do}(s))$. In order to block this backdoor path, we can use the idea of inverse probability weighting (IPW; Inverse Probability Weighting). If we assume we have access to the transition model $\tau$, we can use it to severe two direct connections into $s_t$; $s_{t-1} \to s_t$ and $a_{t-1} \to s_t$, by

[] \begin{align} \mathbb{E}_{a_t \sim p_{\pi}(a_t | \mathrm{do}(s_t))}[a_t] = \mathbb{E}_{s_t} \left[ \frac{p_{\pi}(a_t | s_t)}{ p_{\tau}(s_t | s_{t-1}, a_{t-1}) } a_t \right]. \end{align} []

Learned transition: a world model. Of course, we often do not have access to $\tau$ directly, but must infer this transition dynamics from data. Unlike the policy $s \to a$, fortunately, the transition $(s, a) \to s'$ is however not confounded by $\pi$. We can therefore learn an approximate transition model, which is sometimes referred to as a world model[4][and references therein], from data. This can be done by

[] \begin{align} \hat{\tau} = \arg\max_{\tau} \sum_{n=1}^N \log p_{\tau} (s'_n | s_n, a_n). \end{align} []

Deconfounded behaviour closing. Once training is done, we can use $\hat{\tau}$ in place of the true transition dynamics $\tau$, to train a de-confounded policy by

[] \begin{align} \hat{\pi} = \arg\max_{\pi} \sum_{n=1}^N \log \frac{p_{\pi}(a'_n | s'_n)} {p_{\hat{\tau}}(s'_n | s_n, a_n)}, \end{align} []

where $a'_n$ is the next action in the dataset. That is, the dataset now consists of $(s_n, a_n, s'_n, a'_n)$ rather than $(s_n, a_n, s'_n)$. This effectively makes us lose a few examples from the original dataset that correspond to the final steps of episodes, although this is a small price to pay to avoid the confounding by multiple expert policies.

Causal reinforcement learning. This is an example of how causality can assist us in identifying a potential issue a priori and design a better learning algorithm without relying on trials and errors. In the context of reinforcement learning, which is a sub-field of machine learning focused on learning a policy, such as like behaviour cloning, this is often referred to as and studied in causal reinforcement learning[5].

## Summary

In this final chapter, I have touched upon a few topics that were left out from the main chapters perhaps for no particular strong reason. These topics included

• Difference-in-Difference
• Regression discontinuity
• Double machine learning
• A taste of causal reinforcement learning

There are many interesting topics that were not discussed in this lecture note both due to the lack of time as well as the lack of my own knowledge and expertise. I find the following two areas to be particular interesting and recommend you to follow up on.

• Counterfactual analysis: Can we build an algorithm that can imagine taking an alternative action and guess the resulting outcome instead of the actual outcome?
• (Scalable) causal discovery: How can we infer useful causal relationship among many variables?
• Beyond invariance (The Principle of Invariance): Invariance is a strong assumption. Can we relax this assumption to identify a more flexible notion of causal prediction?

## General references

Cho, Kyunghyun (2024). "A Brief Introduction to Causal Inference in Machine Learning". arXiv:2405.08793 [cs.LG].

## Notes

1. This provides a good ground for testing the validity of regression discontinuity. If the distributions of $x$ before and after $c_0$ differ significantly from each other, regression discontinuity cannot be used.
2. I am only drawing two time steps for simplicity, however, without loss of generality.

## References

1. "On the Mode of Communication of Cholera." (1856). Edinburgh Medical Journal 1.
2. "Regression discontinuity designs: A guide to practice" (2008). Journal of econometrics 142. Elsevier.
3. "Effective training of a neural network character classifier for word recognition" (1996). Advances in neural information processing systems 9.
4. LeCun, Yann (2022). "A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27". Open Review 62.
5. Elias Bareinboim (2020), ICML Tutorial on Causal Reinforcement Learning