Change of Variables

Previously we studied transformations of a single random variable and how to compute the distribution of a transformed random variable. Here we are primarily considered with situations when the function of the random variables, sometimes called a change of variables, has a (joint) cumulative distribution or (joint) density function that can be computed simply in terms of the corresponding cumulative distribution function or joint density function of the random variables that it depends on.

The Setup

We let [math]X_1,\ldots,X_n[/math] denote a sequence of random variables and let

[[math]] \mathbf{Y} = [Y_1,\ldots,Y_m] = T(X_1,\ldots,X_n) = T(\mathbf{X}) [[/math]]

denote the change of variables.

Monotone One-to-One Transformations

A transformation [math]T[/math] is one-to-one when [math]T(\mathbf{x}) = T( \mathbf{y}) [/math] implies that [math] \mathbf{x} = \mathbf{y}[/math]: [math]T [/math] maps two distinct points to two distinct points. If [math]T[/math] is one-to-one then [math]T[/math] has a left-inverse [math]T^{-1}[/math]:

[[math]] T^{-1}(T(\mathbf{x})) = \mathbf{x}. [[/math]]

Strictly Increasing

If [math]T[/math] is strictly increasing,

[[math]] x_i \leq y_i \,\text{for all}\,\,i \Leftrightarrow \mathbf{x} \leq \mathbf{y} \implies T(\mathbf{x}) \leq T(\mathbf{y}) [[/math]]

, then

[[math]] \begin{align} \operatorname{P}[T(\mathbf{X}) \leq y ] = \operatorname{P}[X \leq T^{-1}(y)] = f_{\textbf{X}}[T^{-1}(y)] \end{align} [[/math]]

and we obtain the simple relation

[[math]] \begin{equation} \label{transform-rel-up} \mathbf{X} \mapsto T(\mathbf{X})=\textbf{Y} \implies F_{\textbf{Y}} = f_{\textbf{X}} \circ T^{-1}. \end{equation} [[/math]]

Example

Suppose that [math]\mathbf{X} = [X_1,X_2] [/math] and let

[[math]] \mathbf{Y} = [Y_1,Y_2]=T(x_1,x_2) = [e^{2x_1 + x_2},e^{x_1 + 2x_2}] = \exp[A(\mathbf{X})] [[/math]]

with [math]A[/math] the linear transformation

[[math]] A(x_1,X_2) = [x_1 + x_2, x_1 + 2x_2]. [[/math]]

The inverse of the transformation, [math]T^{-1}[/math], equals

[[math]] A^{-1} \circ \log \, , \quad A^{-1}(\mathbf{y}) = [(y_1 - y_2)/3,(2y_2 - y_1)/3]. [[/math]]

By \ref{transform-rel-up}, the cumulative distribution function of [math]\textbf{Y}[/math] equals [math]F_{\textbf{Y}} = f_{\textbf{X}} \circ A^{-1} \circ \log.[/math]

Strictly Decreasing

If the transformation [math]T[/math] is strictly decreasing and the distribution function of [math]X[/math], [math]F_{\textbf{X}}[/math], is continuous then

[[math]] \operatorname{P}[T(\mathbf{X})\leq y ] = \operatorname{P}[\mathbf{X} \geq T^{-1}(y)] = 1 - f_{\textbf{X}}[T^{-1}(y)] [[/math]]

and the distribution funtion of [math]\textbf{Y}[/math] equals [math]1 - f_{\textbf{X}} \circ T^{-1} [/math].

Smooth One-to-One Transformations

We assume the following:

[math]X_1,\ldots,X_n[/math] have a joint density.
The transformation [math]T[/math] is one-to-one.
The transformation [math]T[/math] is differentiable almost surely: the set of points where [math]T[/math] is not differentiable has probability 0.

To illustrate condition 3, suppose that [math]X_1 [/math] and [math]X_2[/math] are uniformly distributed on [0,1] and [math]T(X_1,X_2) = [X_2,X_1][/math]. Then [math]T[/math] is defined on the square region [0,1] x [0,1] and the points where it's differentiable has probability 1:

[[math]] \operatorname{P}(\{0 \lt X_1 \lt 1\} \cap \{0 \lt X_2 \lt 1\}) = 1. [[/math]]

If the conditions above are met, we can actually compute the joint density of [math]\textbf{Y}[/math]:

[[math]] f_{\textbf{Y}}(\mathbf{y}) = \frac{ f_{\textbf{X}}(T^{-1}(\mathbf{y}))}{|\operatorname{det}[J_T(T^{-1}(\mathbf{y}))]|}\,, \mathbf{y} = [y_1,\ldots,y_n]. [[/math]]

Here [math]J_T[/math] denotes the Jacobian matrix of [math]T[/math] and [math]\textrm{det}[J_T][/math] its determinant. We can express the density of [math]Y[/math] more succinctly (recall that [math]\circ[/math] denotes function composition):

[[math]] \begin{equation} \label{transform-density} f_{\textbf{Y}} = \left(f_{\textbf{X}} \cdot |\operatorname{det}[J_T]|^{-1}\right) \circ T^{-1} \end{equation} [[/math]]

Example

Let [math]\textbf{U}[/math] denote a uniform on the square [0,1]x[0,1] and let

[[math]] \textbf{Y} = [\exp(U_1 + U_2),\exp(U_1 - U_2)] = \exp[A(\mathbf{U})] [[/math]]

with [math]A[/math] the linear transformation [math]A(x_1,x_2) = [x_1 + x+2,x_1 - x_2][/math]. The inverse of the transformation, [math]T^{-1}[/math], equals

[[math]] A^{-1} \circ \log\, , \quad A^{-1}(\mathbf{y}) =1/2 [y_1 + y_2,y_1-y_2]. [[/math]]

We compute the determinant of the Jacobian of the transformation

[[math]] \begin{align} J_T &= \left(\begin{matrix} \frac{\partial}{\partial x_1} e^{x_1 + x_2} & \frac{\partial}{\partial x_2} e^{x_1 + x_2} \\ \frac{\partial}{\partial x_1}e^{x_1 - x_2} & \frac{\partial}{\partial x_2} e^{x_1 - x_2}\\ \end{matrix}\right) \\ & = \left(\begin{matrix} e^{x_1 + x_2} & e^{x_1 + x_2} \\ e^{x_1 - x_2} & - e^{x_1 - x_2} \\ \end{matrix}\right) \end{align} [[/math]]

and its corresponding determinant equals [math] -2e^{2x_1}[/math]. Recalling \ref{transform-density}, the density function of [math]\textbf{Y}[/math] equals

[[math]] \begin{cases}\frac{1}{2}\exp[-\log(y_1) -\log(y_2)] = \frac{1}{2y_1y_2} \ & \mathbf{y} \in D \\ 0 & \text{otherwise} \end{cases} [[/math]]

with [math]D[/math] equal to the region [math]T([0,1] \times [0,1])[/math] (the image of the unit square under the transformation). Expressing [math]x_1[/math] and [math]x_2[/math] in terms of [math]y_1[/math] and [math]y_2[/math], the region [math]D[/math] can be described as follows:

[[math]] \begin{align*} 0 \lt \frac{\log(y_1) + \log(y_2)}{2} \lt 1 &\Leftrightarrow 1 \lt y_1y_2 \lt e^2 \\ 0 \lt \frac{\log(y_1) - \log(y_2)}{2} \lt 1 &\Leftrightarrow 1 \lt \frac{y_1}{y_2} \lt e^2 \\ \end{align*} [[/math]]

Other Methods

We present other ways of calculating the cumulative distribution function of transformed random variables or the probability of events associated with said variables.

Direct Evaluation

The most direct way to compute probabilities such as [math]\operatorname{P}[\textbf{Y} \in A] [/math] is to express them in terms of events associated with the initial random variable [math]\textbf{X}[/math]:

[[math]] \operatorname{P}[\textbf{Y} \in A ] = \operatorname{P}[\textbf{X} \in B]\,, \quad B = \{x \,:\, T(x) \in A\}. [[/math]]

Example 1

If [math]Y = X_1 + X_2= T(\mathbf{X})[/math], then [math]\operatorname{P}[Y_1 \leq y] = \operatorname{P}[X \in B ] [/math] with [math]B[/math] the set of points

[[math]] B = \{[x_1,x_2] \, : \, x_1 + x_2 \leq y \}. [[/math]]

Example 2

If [math]\textbf{Y} = [X_1 + X_2,X_1 - X_2] = T(\mathbf{X}) [/math], then [math]\operatorname{P}[Y_1 \leq y_1,Y_2 \leq y_2] = \operatorname{P}[\textbf{X} \in B ] [/math] with [math]B[/math] the set of points

[[math]] B = \{[x_1,x_2] \, : \, x_1 + x_2 \leq y_1\,, x_1 - x_2 \leq y_2\}. [[/math]]

Conditioning

We limit ourselves to the case when we only have two variables: [math]\mathbf{X} = [X_1,X_2] [/math]. The conditioning method to computing probabilities associated with [math]\textbf{Y}[/math] is to condition on [math]X_1[/math] or [math]X_2[/math]:

[[math]] \begin{equation} \label{cond-method-eqn} \operatorname{P}(\textbf{Y} \in A) = \operatorname{E}[\operatorname{P}(A \mid X_i)]\,, i=1,2. \end{equation} [[/math]]

[math]\mathbf{X}[/math] typically has a joint density function and then \ref{cond-method-eqn} also equals

[[math]] \int_{-\infty}^{\infty}\operatorname{P}(\textbf{Y} \in A \mid X_i = x) \, f_{X_i}(x) \, dx \,, i=1,2 [[/math]]

with [math]f_{X_i}[/math] the marginal density of [math]X_i[/math].

Example 1

Suppose [math]X_1[/math] and [math]X_2[/math] are mutually independent. Suppose further that [math]X_1[/math] has density function

[[math]] f_{X_1}(x) = \begin{cases} \frac{\alpha}{x^{\alpha + 1}} & x \geq 1 \\ 0 & x \lt 1 \end{cases} [[/math]]

and [math]X_2 [/math] is uniformly distributed on [0,1]. We consider the random variable [math]Y = X_1 X_2 [/math]. Conditioning on [math]X_2[/math], we have

[[math]] f_Y(y\mid X_2 = \theta) = \begin{cases} \frac{\alpha \theta^{\alpha}}{x^{\alpha + 1}} & x \geq \theta \\ 0 & x \lt \theta \end{cases} [[/math]]

and thus the density function of [math]Y[/math] equals

[[math]] \alpha x^{-(\alpha + 1)} \, \int_0^{\min(x,1)} \theta^{\alpha} \, d\theta = \begin{cases} \frac{\alpha}{\alpha +1 } & x \lt 1 \\ \frac{\alpha}{\alpha + 1} \cdot x^{-(\alpha + 1)} & x \geq 1 \end{cases} [[/math]]

Example 2

Suppose [math]X[/math] is uniform on the square [0,1]x[0,1] and let

[[math]] Y = X_1/X_2. [[/math]]

Conditional on [math]X_2[/math], [math]Y[/math] is uniformly distributed on the interval [math]X_2 [/math]. Furthermore, [math]X_2[/math] is uniformly distributed on [0,1]. By \ref{cond-method-eqn}, we have

[[math]] \begin{align*} \operatorname{P}(Y \leq y) &= \int_0^1 x\,\min(y,1/x)\, dx \\ &= \int_0^{1/y} xy \,dx + \int_{1/y}^1 \, dx \\ &= \begin{cases} 1/2y & y \lt 1 \\ 1 - 1/2y & y \geq 1. \end{cases} \end{align*} [[/math]]

Sums of Independent Random Variables

The probability distribution of the sum of two or more independent random variables is the convolution of their individual distributions. The term is motivated by the fact that the probability mass function or probability density function of a sum of random variables is the convolution of their corresponding probability mass functions or probability density functions respectively.

Convolutions

Continuous

The convolution of [math]f[/math] and [math]g[/math] is written [math]f*g[/math], using an asterisk or star. It is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a particular kind of integral transform:

[[math]] \begin{align} (f * g )(t) &\stackrel{\mathrm{def}}{=}\ \int_{-\infty}^\infty f(\tau)\, g(t - \tau)\, d\tau \\ &= \int_{-\infty}^\infty f(t-\tau)\, g(\tau)\, d\tau \\ &= (g * f)(t). \end{align} [[/math]]

While the symbol [math]t[/math] is used above, it need not represent the time domain. But in that context, the convolution formula can be described as a weighted average of the function [math]f(\tau)[/math] at the moment [math]t[/math] where the weighting is given by [math]g(-\tau)[/math] simply shifted by amount [math]t[/math]. As [math]t[/math] changes, the weighting function emphasizes different parts of the input function.

For functions [math]f[/math], [math]g[/math] supported on only [math][0, \infty)[/math] (i.e., zero for negative arguments), the integration limits can be truncated, resulting in

[[math]] (f * g )(t) = \int_{0}^{t} f(\tau)\, g(t - \tau)\, d\tau \ \ \ \mathrm{for} \ \ f, g : [0, \infty) \to \mathbb{R} [[/math]]

Discrete

For functions [math]f[/math], [math]g[/math] defined on the set [math]Z[/math] of integers, the discrete convolution of [math]f[/math] and [math]g[/math] is given by:^[1]

[[math]] \begin{align} (f * g)[n]\, & \stackrel{\mathrm{def}}{=}\ \sum_{m=-\infty}^\infty f[m]\, g[n - m] \\ &= \sum_{m=-\infty}^\infty f[n-m]\, g[m] \\ &= (g * f)[n]. \end{align} [[/math]]

When [math]g[/math] has finite support in the set [math]\{-M,-M+1,\dots,M-1,M\}[/math] (representing, for instance, a finite impulse response), a finite summation may be used:^[2]

[[math]](f*g)[n]=\sum_{m=-M}^M f[n-m]g[m].[[/math]]

Continuous Distributions

If [math]X_1,\ldots,X_n[/math] denotes a sequence of mutually independent continuous random variables each having a probability density function, then the sum [math] Y = \sum_{i=1}^n X_i [/math] has probability density function [math] f_{X_1} * \cdots * f_{X_n}[/math]. Here is a table giving the distribution for sums of well-known discrete distributions:

[math]X_i[/math]	[math]Y[/math]
[math]\textrm{Normal}(\mu_i, \sigma^2_i)[/math]	[math]\textrm{Normal}(\mu_1 + ... + \mu_n, \sigma_1^2 + \ldots + \sigma^2_n)[/math]
[math]\textrm{Cauchy}(a_i,γ_i)[/math]	[math]\textrm{Cauchy}(a_1+...+a_n,γ_1 +...+γ_n)[/math]
[math]\textrm{Exponential}(\theta)[/math]	[math]\textrm{Gamma}(n, \theta)[/math]
[math]\textrm{Chi-Square}(r_i)[/math]	[math]\textrm{Chi-Square}(r_1 +...+ r_n) [/math]

To demonstrate the convolution technique, we will give a derivation of the result described in the third row in the table above (sum of exponentials). The proof is by induction on [math]n[/math]. For [math]n = 1 [/math], there is nothing to show. If

[[math]] Y_n = X_1 + \ldots + X_n [[/math]]

then we need to show that [math]Y_{n+1}[/math] is [math]\textrm{Gamma}(n+1, \theta)[/math] given that [math]Y_n[/math] is [math]\textrm{Gamma}(n,\theta)[/math]. It suffices to show the result for [math]\theta = 1 [/math] since the general case then follows by scaling. Using the convolution technique, the density of [math]Y_{n+1}[/math] equals:

[[math]] \begin{align} f_{Y_n}*f_{X_{n+1}}(y) & \propto \int_0^y x^{-n} e^{-x} e^{x-y} dx \\ & \propto e^{-y} \cdot \int_0^y x^{n-1} dx \\ & \propto e^{-y} \, y^{n}. \\ \end{align} [[/math]]

This is the density function of a [math]\textrm{Gamma}(n+1,1)[/math] and thus, by induction, we have completed the derivation.

The symbol [math]\propto[/math] means proportional to and is used to signify that the expression to the left of it is proportional to the expression to the right of it.

Discrete Distributions

Similarly, if [math]X_1, \ldots, X_n [/math] denotes a sequence of mutually independent integer valued random variables, then the sum

[[math]] Y = \sum_{i=1}^n X_i [[/math]]

has probability mass function [math]p_{Y} = p_{X_1} * \cdots * p_{X_n}.[/math]. Here is a table giving the distribution for sums of well-known discrete distributions:

[math]X_i[/math]	[math]Y[/math]
[math]\textrm{Bernouilli}(p)[/math]	[math]\textrm{Binomial}(n,p)[/math]
[math]\textrm{Binomial}(n_i,p)[/math]	[math]\textrm{Binomial}(n_1 +...+n_n,p) [/math]
[math]\textrm{NegativeBinomial}(n_i,p)[/math]	[math]\textrm{NegativeBinomial}(n_1 + \cdots + n_n, p)[/math]
[math]\textrm{Geometric}(p)[/math]	[math]\textrm{NegativeBinomial}(n,p) [/math]
[math]\textrm{Poisson}(\lambda_i)[/math]	[math]\textrm{Poisson}(\lambda_1 + \ldots + \lambda_n)[/math]

To demonstrate the convolution technique, we will give a derivation of the result described in the second row in the table above (sum of binomials). The proof is by induction on [math]n[/math]. For [math] n = 1 [/math] there is nothing to show. If [math]Y_n = \sum_{i=1}^n X_i[/math], [math]s_n = \sum_{i=1}^n n_i[/math], and we set [math]q= 1-p[/math] and [math]\operatorname{P}[Y_{n+1} = y] = p(y) [/math], then

[[math]] \begin{align} p(y) &= \sum_{m=0}^{y} \binom{s_n}{m} p^m q^{s_n-m} \binom{n_{n+1}}{y-m} p^{m-y}q^{n_{n+1}-(m-y)} \\ \label{conv-bin-1} &= p^{y} q^{s_{n+1} -y} \sum_{m=0}^y \left[\binom{s_{n}}{m} + \binom{n_{n+1}}{y-m} \right] \\ \label{conv-bin-2} &= p^{y} q^{s_{n+1} -y} \binom{s_{n+1}}{y}. \end{align} [[/math]]

To go from \ref{conv-bin-1} to \ref{conv-bin-2}, we have used the combinatorial identity

[[math]] \sum_{m=0}^y \left[\binom{N_1}{m} + \binom{N_2}{y-m} \right] = \binom{N_1 + N_2}{y}. [[/math]]

\ref{conv-bin-2} indicates that [math]Y_{n+1}[/math] is a binomial with parameters [math]s_{n+1}[/math] and [math]p[/math]. By induction, we have completed the derivation.

Notes

Damelin & Miller 2011, p. 232
Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T. (1989). Numerical Recipes in Pascal. Cambridge University Press. p. 450. ISBN 0-521-37516-9.

References

Hogg, Robert V.; McKean, Joseph W.; Craig, Allen T. (2004). Introduction to mathematical statistics (6th ed.). Upper Saddle River, New Jersey: Prentice Hall. p. 692. ISBN 978-0-13-008507-8. MR 0467974.
Wikipedia contributors. "Convolution". Wikipedia. Wikipedia. Retrieved 28 January 2022.

[1] Damelin & Miller 2011, p. 232

[2] Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T. (1989). Numerical Recipes in Pascal. Cambridge University Press. p. 450. ISBN 0-521-37516-9.

[1]

[2]