Conditioning

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations and conditional distributions are treated on two levels: discrete probabilities and probability density functions. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

Conditional Expectation

The conditional expectation of a random variable is another random variable equal to the average of the former over each possible "condition". Conditional expectation is also known as conditional expected value or conditional mean.

The concept of conditional expectation can be nicely illustrated through the following example. Suppose we have daily rainfall data (mm of rain each day) collected by a weather station on every day of the ten year period from Jan 1, 1990 to Dec 31, 1999. The conditional expectation of daily rainfall knowing the month of the year is the average of daily rainfall over all days of the ten year period that fall in a given month. These data then may be considered either as a function of each day (so for example its value for Mar 3, 1992, would be the sum of daily rainfalls on all days that are in the month of March during the ten years, divided by the number of these days, which is 310) or as a function of just the month (so for example the value for March would be equal to the value of the previous example).

It is important to note the following.

The conditional expectation of daily rainfall knowing that we are in a month of March of the given ten years is not a monthly rainfall data, that is it is not the average of the ten monthly total March rainfalls. That number would be 31 times higher.
The average daily rainfall in March 1992 is not equal to the conditional expectation of daily rainfall knowing that we are in a month of March of the given ten years, because we have restricted ourselves to 1992, that is we have more conditions than just that of being in March. This shows that reasoning as "we are in March 1992, so I know we are in March, so the average daily rainfall is the March average daily rainfall" is incorrect. Stated differently, although we use the expression "conditional expectation knowing that we are in March" this really means "conditional expectation knowing nothing other than that we are in March".

Definition

We will give a (semi) rigorous treatment of conditional expectation. We suppose that [math]X[/math] and [math]Y[/math] are random variables, then we wish to make sense of the conditional expectation of [math]X[/math] given [math]Y[/math] normally denoted by

[[math]] \operatorname{E}[X \mid Y]. [[/math]]

As we have seen above with the rainfall example, we wish to make sense of the expectation of [math]X[/math] given the information obtained by observing [math]Y[/math]. One can also think of this as an estimation problem: estimate the expectation of [math]X[/math] given the data point [math]Y[/math].

When [math]Y[/math] is Discrete

Suppose that [math]Y[/math] satisfies the following:

[[math]] \sum_{k}p_k = 1, \, \operatorname{P}(A_k) = p_k \gt 0,\, A_k = \{Y = y_k\}. [[/math]]

If we let

[[math]] 1_{A_k}(\omega) = \begin{cases} 1 & \text{if}\,\,\, Y(\omega) = y_k \\ 0 & \text{otherwise} \end{cases} [[/math]]

then a reasonable definition for the conditional expectation is

[[math]] \operatorname{E}[X \mid Y] = \sum_k \frac{\operatorname{E}[X \cdot 1_{A_k}] \cdot 1_{A_k}\left(Y\right)}{p_k}. [[/math]]

To see why this definition makes sense, suppose that [math]Y[/math] is observed to equal [math]y_k[/math], then

[[math]] \operatorname{E}[X \mid Y] = \operatorname{E}[X \mid y_k] = \frac{\operatorname{E}[X \cdot 1_{A_k}]}{p_k} [[/math]]

and

[[math]] \begin{align} \frac{\operatorname{E}[X \cdot 1_{A_k}]}{p_k} &\approx \sum_{j=-nI}^{nI} (j+1)2^{-n} \frac{\operatorname{P}[\{j2^{-n} \lt X \leq (j+1)2^{-n}\} \cap A_k ] }{\operatorname{P}(A_k)} \\ & = \sum_{j=-nI}^{nI} (j+1)2^{-n} \operatorname{P}[j2^{-n} \lt X \leq (j+1)2^{-n} \mid A_k ] \end{align} [[/math]]

with [math]I[/math] and [math]n[/math] sufficiently large (we're basically just splitting the interval [-[math]I[/math],[math]I[/math]] into pieces of length [math]2^{-n}[/math]). The crucial property of conditional expectations is the following:

[[math]] \begin{equation} \label{cond-exp-discrete-prop1} \operatorname{E}\left[\operatorname{E}[X \mid Y] \cdot 1_{A_j} \right] = \sum_{k}\operatorname{E[X \cdot 1_{A_k} \cdot 1_{A_j}]} = \operatorname{E}\left[X \cdot 1_{A_j} \right]. \end{equation} [[/math]]

By the linearity of expectation and \ref{cond-exp-discrete-prop1}, we see that

[[math]] \begin{equation} \label{cond-exp-discrete-prop2} \operatorname{E}[\operatorname{E}[X \mid Y] \cdot 1_A ] = \operatorname{E}[X \cdot 1_A] \end{equation} [[/math]]

for any event of the form

[[math]] A = A_{n_1} \cup \ldots \cup A_{n_j}. [[/math]]

In fact, we will see below that property \ref{cond-exp-discrete-prop2} is the defining property of conditional expectation.

When [math]Y[/math] is Continuous

In the continuous case, things get a little tricky since the event [math]Y = y[/math] has a zero probability. To explain further, suppose we had incomplete information and the only thing we knew about the observation [math]Y[/math] is that it was contained in the interval [math][a,b][/math], how would we define the conditional expectation of [math]X[/math] given that [math]Y[/math] belongs to said interval? Proceeding as in the discrete case, we set

[[math]] \operatorname{E}[X \mid A] = \frac{\operatorname{E}[X \cdot 1_{A} ]}{\operatorname{P}(A)}\,,\quad A = \{a\leq Y \leq b \} [[/math]]

provided that [math]\operatorname{P}(A)\gt0[/math]. One natural approach would be to set

[[math]] \operatorname{E}[X \mid y] = \lim_{\epsilon \rightarrow 0} \operatorname{E}[X \mid A_{\epsilon}]\, , A_{\epsilon} = \{y -\epsilon \leq Y \leq y + \epsilon\} [[/math]]

provided that the limit exists. If [math]X[/math] and [math]Y[/math] have a continuous joint density [math]f_{X,Y}(x,y)[/math] then we have

[[math]] \begin{align*} \operatorname{E}[X \mid y_0] &= \lim_{\epsilon \rightarrow 0} \frac{\int_{y_0-\epsilon}^{y_0 + \epsilon}\int_{-\infty}^{\infty} x\, f_{X,Y}(x,y) \, dy \,dx}{\int_{y_0-\epsilon}^{y_0 + \epsilon}\int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \,dy} \\ &= \frac{\lim_{\epsilon \rightarrow 0} \epsilon^{-1} \int_{y_0-\epsilon}^{y_0 + \epsilon}\int_{-\infty}^{\infty} x\, f_{X,Y}(x,y) \, dx \,dy}{\lim_{\epsilon \rightarrow 0} \epsilon^{-1} \int_{y_0-\epsilon}^{y_0 + \epsilon}\int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \,dy} \\ &= \frac{\int_{-\infty}^{\infty} x\, f_{X,Y}(x,y_0) \, dx }{\int_{-\infty}^{\infty} \, f_{X,Y}(x,y_0) \, dx} \\ &= \frac{\int_{-\infty}^{\infty} x\, f_{X,Y}(x,y_0) \, dx }{f_{Y}(y_0)}. \end{align*} [[/math]]

provided that [math]f_Y(y_0)\gt 0[/math]. Therefore a reasonable definition for the conditional expectation appears to be

[[math]] \operatorname{E}[X \mid Y] = \frac{\int_{-\infty}^{\infty} x\, f_{X,Y}(x,Y) \, dx }{f_{Y}(Y)}. [[/math]]

With this definition for the conditional expectation, we can show that property \ref{cond-exp-discrete-prop2} also holds in the continuous case for a certain family of events. More precisely, we have

[[math]] \begin{equation} \label{cond-exp-cont-prop} \operatorname{E}\left[ \operatorname{E}[X \mid Y] \cdot 1_{A}\right] = \operatorname{E}\left[X\cdot 1_{A}\right] \end{equation} [[/math]]

for any event [math]A = \{a \leq Y \leq b \}[/math].

General Formulation

Now suppose that [math]Y[/math] is a general random variable. Properties \ref{cond-exp-discrete-prop2} and \ref{cond-exp-cont-prop} motivate the formal and general definition for the expectation of [math]X[/math] given [math]Y[/math]. More precisely,[math] \operatorname{E}[X | Y ] [/math] is defined as the unique random variable depending on [math]Y[/math] satisfying

[[math]] \operatorname{E}\left[ \operatorname{E}[X \mid Y] \cdot 1_{A}\right] = \operatorname{E}\left[X\cdot 1_{A}\right] [[/math]]

for any event [math]A = \{a \leq Y \leq b \}[/math]. It can be shown that such a random variable always exists and is unique. The uniqueness is in the almost sure sense: the probability that two such random variables differ has probability zero. In particular, it follows that the previous definitions for the conditional expectation are consistent with the formal one presented here.

Properties

We list some basic properties of conditional expectation (without derivation):

[math]\operatorname{E}[aX + b \mid Y] = a\operatorname{E}[X \mid Y] + b [/math]
[math]\operatorname{E}[\operatorname{E}[X \mid Y]] = \operatorname{E}[X][/math]
[math]\operatorname{E}[g(Y)\mid Y] = g(Y)[/math] for any continuous function g
[math]\operatorname{E}[X \mid Y] = \operatorname{E}[X] [/math] when [math]X[/math] is independent of [math]Y[/math]
[math] \operatorname{E}[X_1 \mid Y] \leq \operatorname{E}[X_2 \mid Y] [/math] almost surely when [math] X_1 \leq X_2[/math] almost surely.

Conditional Variance

We can consider a very natural random variable

[[math]] \begin{equation} \label{cond-variance} \operatorname{Var}[X \mid Y] = \operatorname{E}\left[(X - \operatorname{E}[X \mid Y])^2 \mid Y \right] \end{equation} [[/math]]

called the conditional variance of [math]X[/math] given [math]Y[/math].

Properties

Here are a few basic properties of the conditional variance (without derivation):

[math] \operatorname{Var}( cX \mid Y ) = c^2 \operatorname{Var}( X \mid Y ) [/math]
[math] \operatorname{Var}(X \mid Y) = \operatorname{E}[X^2 \mid Y] - \operatorname{E}[X \mid Y]^2 [/math]
[math]\operatorname{Var}[X] = \operatorname{E}\left[\operatorname{Var}[X \mid Y]\right] + \operatorname{Var}\left[\operatorname{E}[X \mid Y]\right] [/math]

Minimizing Mean Square Error

Suppose that [math]X[/math] has a finite second moment ([math]\operatorname{E}[X^2] \lt \infty [/math]) and consider the following minimization problem:

[[math]] \begin{equation} \label{least-squares} \min \operatorname{E}[(X-Z)^2] \quad \text{Z depending on Y.} \end{equation} [[/math]]

The solution to \ref{least-squares} can be thought of as the best estimate of [math]X[/math] based on the information provided by [math]Y[/math]. It turns out that the solution to \ref{least-squares} is unique (almost surely) and equals the conditional expectation of [math]X[/math] given [math]Y[/math]:

[[math]] \operatorname{E}\left[(X-\operatorname{E}[X \mid Y])^2\right] \leq \operatorname{E}\left[(X-Z)^2\right] [[/math]]

for any [math]Z[/math] random variable which depends on [math]Y[/math].

To be mathematically precise, Z needs to be measurable with respect to the sigma algebra generated by [math]Y[/math]. For our purposes, you can think of [math]Z[/math] as being a suitable (like say continuous) function of [math]Y[/math].

Conditional Probability Distribution

Given two jointly distributed random variables [math]X[/math] and [math]Y[/math], the conditional probability distribution of [math]X[/math] given [math]Y[/math] is the probability distribution of [math]X[/math] when [math]Y[/math] is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value [math]y[/math] of [math]Y[/math] as a parameter. In case that both [math]X[/math] and [math]Y[/math] are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.

If the conditional distribution of [math]X[/math] given [math]Y[/math] is a continuous distribution, then its probability density function is known as the conditional density function. The properties of a conditional distribution, such as the moments, are often referred to by corresponding names such as the conditional mean and conditional variance.

More generally, one can refer to the conditional distribution of a subset of a set of more than two variables; this conditional distribution is contingent on the values of all the remaining variables, and if more than one variable is included in the subset then this conditional distribution is the conditional joint distribution of the included variables.

Relation to conditional expectation

We define the conditional distribution of [math]X[/math] given [math]Y[/math] using conditional expectation:

[[math]] F_{X}(x \mid Y) = \operatorname{P}(X \leq x \mid Y) = \operatorname{E}[1_{A} \mid Y]\,,\quad A = \{X \leq x\}. [[/math]]

When Y is Discrete

When [math]Y[/math] is a discrete random variable, the conditional distribution of [math]X[/math] given [math]Y[/math] = [math]Y[/math] is given by (see When [math]Y[/math] is Discrete):

[[math]] \begin{align} F_{X}(x \mid y) &= \begin{cases} \frac{\operatorname{E}[1_{X\leq x} \cdot 1_{Y=y}]}{\operatorname{P}(Y=y)} & \operatorname{P}(Y=y) \gt 0 \\ 0 & \text{otherwise} \end{cases} \\ &= \begin{cases} \frac{\operatorname{P}(X\leq x \, \cap \, Y = y)}{\operatorname{P}(Y=y)} & \operatorname{P}(Y=y) \gt 0 \\ 0 & \text{otherwise} \end{cases} \end{align} [[/math]]

Continuous distributions

Suppose that [math]X[/math] and [math]Y[/math] have a continuous joint density [math]f_{X,Y}(x,y)[/math] then (see When [math]Y[/math] is Continuous)

[[math]] \begin{equation} \label{cond-prob-continuous} F(x_0 \mid Y) = \frac{\int_{-\infty}^{x_0} \, f_{X,Y}(x,Y) \, dx }{f_{Y}(Y)}. \end{equation} [[/math]]

We also see from \ref{cond-prob-continuous} that the distribution [math]F(x |Y)[/math] will have a density function, called the conditional density of [math]X[/math] given [math]Y[/math], given by

[[math]] f_{X}(x \mid Y) = \frac{f_{X,Y}(x,Y)}{f_Y(Y)}. [[/math]]

Bayes' Formulas

The Joint Distribution

We can express the joint distribution function in terms of conditional and marginal distributions:

[[math]] F_{X,Y}(x,y) = \int_{-\infty}^{x} F_{Y}(y \mid z ) \, dF_{X}(z) = \int_{-\infty}^{y} F_{X}(x \mid z ) \, dF_{Y}(z). [[/math]]

In particular, we obtain simple expressions for the marginal distributions

[[math]] F_{X}(x) = \int_{-\infty}^{x} F_{Y}(y \mid z ) \, dF_{X}(z) [[/math]]

and for conditional probabilities for events:

[[math]] \begin{equation} \label{bayes-cond-prob} \operatorname{P}(X \leq x \mid Y \leq y) = \frac{\int_{-\infty}^x F_Y(y \mid z ) \, dF_{X}(z)}{\int_{-\infty}^{\infty} F_Y(y \mid z ) \, dF_{X}(z)}. \end{equation} [[/math]]

In particular, if [math]X[/math] is a discrete distribution taking on values [math]x_1,\ldots,x_n[/math], then \ref{bayes-cond-prob} translates into

[[math]] \frac{\sum_{x_i \leq x} \operatorname{P}(Y \leq y \mid X = x_i)}{\sum_{x_i \leq x}\operatorname{P}(X = x_i)} [[/math]]

which is consistent with the Bayes' formula presented in conditional probability.

The Marginal and Conditional Densities

If [math]X[/math] and [math]Y[/math] have a joint density function, then we can express one marginal density as an average of the conditional densities:

[[math]] f_{X}(x) = \int_{-\infty}^{\infty}f_X(x \mid y ) f_{Y}(y) \, dy. [[/math]]

Furthermore, one conditional density can be expressed in terms of the other as follows:

[[math]] f_{Y}(y \mid X ) = \frac{f_X(X \mid y) f_{Y}(y)}{f_{X}(X)} = \frac{f_X(X \mid y) f_{Y}(y)}{\int_{-\infty}^{\infty}f_{X}(X \mid y) f_{Y}(y) \, dy}. [[/math]]

References

Durrett, Richard (1996), Probability: theory and examples (Second ed.)
Pollard, David (2002), A user's guide to measure theoretic probability, Cambridge University Press
Billingsley, Patrick (1995). Probability and Measure (3rd ed.). New York: John Wiley and Sons.