# Covariance

Covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behavior, the covariance is positive. For example, as a balloon is blown up it gets larger in all dimensions. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. If a sealed balloon is squashed in one dimension then it will expand in the other two. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

A distinction must be made between (1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability distribution, and (2) the sample covariance, which serves as an estimated value of the parameter.

## Definition

The covariance between two jointly distributed real-valued random variables $X$ and $Y$ with finite second moments is defined as

[$] \operatorname{cov}(X,Y) = \operatorname{E}{\big[(X - \operatorname{E}[X])(Y - \operatorname{E}[Y])\big]}, [$]

where $\operatorname{E}[X]$ is the expected value of $X$, also known as the mean of $X$. By using the linearity property of expectations, this can be simplified to

[] \begin{align*} \operatorname{cov}(X,Y) &= \operatorname{E}\left[\left(X - \operatorname{E}\left[X\right]\right) \left(Y - \operatorname{E}\left[Y\right]\right)\right] \\ &= \operatorname{E}\left[X Y - X \operatorname{E}\left[Y\right] - \operatorname{E}\left[X\right] Y + \operatorname{E}\left[X\right] \operatorname{E}\left[Y\right]\right] \\ &= \operatorname{E}\left[X Y\right] - \operatorname{E}\left[X\right] \operatorname{E}\left[Y\right] - \operatorname{E}\left[X\right] \operatorname{E}\left[Y\right] + \operatorname{E}\left[X\right] \operatorname{E}\left[Y\right] \\ &= \operatorname{E}\left[X Y\right] - \operatorname{E}\left[X\right] \operatorname{E}\left[Y\right]. \end{align*} []

However, when $\operatorname{E}[XY] \approx \operatorname{E}[X]\operatorname{E}[Y]$, this last equation is prone to catastrophic cancellation when computed with floating point arithmetic and thus should be avoided in computer programs when the data has not been centered before. Numerically stable algorithms should be preferred in this case.

Random variables whose covariance is zero are called uncorrelated. Similarly, random vectors whose covariance matrix is zero in every entry outside the main diagonal are called uncorrelated.

The units of measurement of the covariance $\operatorname{Cov}(X,Y)$ are those of $X$ times those of $Y$. By contrast, correlation coefficients, which depend on the covariance, are a dimensionless measure of linear dependence. (In fact, correlation coefficients can simply be understood as a normalized version of covariance.)

### Discrete variables

If each variable has a finite set of equal-probability values, $x_i$ and $y_i$ respectively for $i=1,\dots , n,$ then the covariance can be equivalently written in terms of the means $\operatorname{E}(X)$ and $\operatorname{E}(Y)$ as

[$]\operatorname{cov} (X,Y)=\frac{1}{n}\sum_{i=1}^n (x_i-E(X))(y_i-E(Y)).[$]

It can also be equivalently expressed, without directly referring to the means, as

[] \begin{align} \operatorname{cov}(X,Y) &= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \frac{1}{2}(x_i - x_j)\cdot(y_i - y_j) \\ &= \frac{1}{n^2}\sum_i \sum_{j \gt i} (x_i-x_j)\cdot(y_i - y_j). \end{align} []

## Properties

• Variance is a special case of the covariance when the two variables are identical:

[$]\operatorname{cov}(X, X) =\operatorname{Var}(X)\equiv\sigma^2(X).[$]

• If $X, Y, W$, and $V$ are real-valued rarndom variables and $a, b, c, d$ are constants. then the following facts are a consequence of the definition of covariance:

[] \begin{align*} \sigma(X, a) &= 0 \\ \sigma(X, X) &= \sigma^2(X) \\ \sigma(X, Y) &= \sigma(Y, X) \\ \sigma(aX, bY) &= ab\, \sigma(X, Y) \\ \sigma(X+a, Y+b) &= \sigma(X, Y) \\ \end{align*} []

and

[$] \sigma(aX+bY, cW+dV) = ac\,\sigma(X,W)+ad\,\sigma(X,V)+bc\,\sigma(Y,W)+bd\,\sigma(Y,V). [$]

• For a sequence $X_1, \ldots, X_n$ of random variables, and constants $a_1,\ldots,a_n$, we have

[$]\sigma^2\left(\sum_{i=1}^n a_iX_i \right) = \sum_{i=1}^n a_i^2\sigma^2(X_i) + 2\sum_{i,j\,:\,i \lt j} a_ia_j\sigma(X_i,X_j) = \sum_{i,j} {a_ia_j\sigma(X_i,X_j)}. [$]

### Uncorrelatedness and independence

If $X$ and $Y$ are independent, then their covariance is zero. This follows because under independence,

[$]\text{E}[XY]=\text{E}[X] \text{E}[Y]. [$]

The converse, however, is not generally true. For example, let $X$ be uniformly distributed in [-1, 1] and let $Y = X^2$. Clearly, $X$ and $Y$ are dependent, but $\sigma(X,Y) = 0$.

In this case, the relationship between $Y$ and $X$ is non-linear, while correlation and covariance are measures of linear dependence between two variables. This example shows that if two variables are uncorrelated, that does not in general imply that they are independent. However, if two variables are jointly normally distributed (but not if they are merely individually normally distributed), uncorrelatedness does imply independence.

### Relationship to inner products

Many of the properties of covariance can be extracted elegantly by observing that it satisfies similar properties to those of an inner product:

Property Description
Bilinearity For constants $a$ and $b$, $\sigma( aX + bY, Z) = a\sigma(X,Y) + b\sigma(Y,Z).$
Symmetry $\sigma(X,Y) = \sigma(Y,X).$
Positive semi-definite $\sigma^2(X) = \sigma(X,X) \geq 0$, and $\sigma(X,X) = 0$ implies that $X$ is a constant random variable.

In fact these properties imply that the covariance defines an inner product over the quotient vector space obtained by taking the subspace of random variables with finite second moment and identifying any two that differ by a constant. (This identification turns the positive semi-definiteness above into positive definiteness.) That quotient vector space is isomorphic to the subspace of random variables with finite second moment and mean zero; on that subspace, the covariance is exactly the L2 inner product of real-valued functions on the sample space.

As a result for random variables with finite variance, the inequality

[$]|\sigma(X,Y)| \le \sqrt{\sigma^2(X) \sigma^2(Y)} [$]

holds via the Cauchy–Schwarz inequality.

The covariance is sometimes called a measure of "linear dependence" between the two random variables. That does not mean the same thing as in the context of linear algebra (see linear dependence). When the covariance is normalized, one obtains the correlation coefficient. From it, one can obtain the Pearson coefficient, which gives the goodness of the fit for the best possible linear function describing the relation between the variables. In this sense covariance is a linear gauge of dependence.

## Pearson's product-moment coefficient

The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.

The population correlation coefficient $\rho_{X,Y}$ between two random variables $X$ and $Y$ with expected values $\mu_X$ and $\mu_Y$ and standard deviations $\sigma_X$ and $\sigma_Y$ is defined as:

[$]\rho_{X,Y}=\mathrm{corr}(X,Y)={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}[$]

The formula for $\rho$ can be expressed in terms of uncentered moments. Since

• $\mu_X=\operatorname{E}[X]$
• $\mu_Y=\operatorname{E}[Y]$
• $\sigma_X^2=\operatorname{E}[(X-\operatorname{E}[X])^2]=\operatorname{E}[X^2]-\operatorname{E}[X]^2$
• $\sigma_Y^2=\operatorname{E}[(Y-\operatorname{E}[Y])^2]=\operatorname{E}[Y^2]-\operatorname{E}[Y]^2$
• $\operatorname{E}[(X-\mu_X)(Y-\mu_Y)]=\operatorname{E}[XY]-\operatorname{E}[X]\operatorname{E}[Y],\,$

the formula for $\rho$ can also be written as

[$]\rho_{X,Y}=\frac{\operatorname{E}[XY]-\operatorname{E}[X]\operatorname{E}[Y]}{\sqrt{\operatorname{E}[X^2]-\operatorname{E}[X]^2}~\sqrt{\operatorname{E}[Y^2]- \operatorname{E}[Y]^2}}.[$]

The Pearson correlation is defined only if both of the standard deviations are finite and nonzero. It is a corollary of the Cauchy–Schwarz inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: $\operatorname{corr}(X,Y)=\operatorname{corr}(Y,X)$.

If we have a series of $n$ measurements of $X$ and $Y$ written as $x_i$ and $y_i$ for $i = 1, \ldots, n$, then the sample correlation coefficient can be used to estimate the population Pearson correlation $\rho_{XY}$ between $X$ and $Y$. The sample correlation coefficient is written:

[$] \rho_{xy}=\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{ns_x s_y} =\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2 \sum\limits_{i=1}^n (y_i-\bar{y})^2}}, [$]

where $\overline{x}$ and $\overline{y}$ are the sample means of $X$ and $Y$, and $s_x$ and $s_y$ are the sample standard deviations of $X$ and $Y$.

This can also be written as:

[$] \rho_{xy}=\frac{\sum x_iy_i-n \bar{x} \bar{y}}{n s_x s_y}=\frac{n\sum x_iy_i-\sum x_i\sum y_i}{\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}. [$]

### Mathematical properties

The Pearson correlation is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect decreasing (inverse) linear relationship (anticorrelation), and some value between −1 and 1 in all other cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.

If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. For example, suppose the random variable $X$ is symmetrically distributed about zero, and $Y = X^2$. Then $Y$ is completely determined by $X$, so that $X$ and $Y$ are perfectly dependent, but their correlation is zero; they are uncorrelated. However, in the special case when $X$ and $Y$ are jointly normal, uncorrelatedness is equivalent to independence.

The absolute values of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or −1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation).

A key mathematical property of the Pearson correlation coefficient is that it is invariant to separate changes in location and scale in the two variables. That is, we may transform $X$ to $a+bX$ and transform $Y$ to $c+dY$, where $a, b, c$ and $d$ are constants with $b, d \neq 0$, without changing the correlation coefficient. (This fact holds for both the population and sample Pearson correlation coefficients.) Note that more general linear transformations do change the correlation: see a later section for an application of this.

### Interpretation

The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between $X$ and $Y$ perfectly, with all data points lying on a line for which $Y$ increases as $X$ increases. A value of −1 implies that all data points lie on a line for which $Y$ decreases as $X$ increases. A value of 0 implies that there is no linear correlation between the variables.

More generally, note that $(X_i - \overline{x})(Y_i - \overline{y})$ is positive if and only if $X_i$ and $Y_i$ lie on the same side of their respective means. Thus the correlation coefficient is positive if $X_i$ and $Y_i$ tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative if $X_i$ and $Y_i$ tend to lie on opposite sides of their respective means. Moreover, the stronger is either tendency, the larger is the absolute value of the correlation coefficient.

## Common misconceptions

### Correlation and causality

The conventional dictum that "correlation does not imply causation" means that correlation cannot be used to infer a causal relationship between the variables. This dictum should not be taken to mean that correlations cannot indicate the potential existence of causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown, and high correlations also overlap with identity relations (tautologies), where no causal process exists. Consequently, establishing a correlation between two variables is not a sufficient condition to establish a causal relationship (in either direction).

A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health, or does good health lead to good mood, or both? Or does some other factor underlie both? In other words, a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.

### Correlation and linearity

The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of $Y$ given $X$, denoted $\operatorname{E}[Y|X]$, is not linear in $X$, the correlation coefficient will not fully determine the form of $\operatorname{E}[Y|X]$.

## The Bivariate Normal Distribution

In probability theory and statistics, the bivariate normal distribution is a generalization of the one-dimensional univariate normal distribution to two dimensions. One definition is that a random vector is said to be bivariate normally distributed if every linear combination of its two components has a univariate normal distribution. The bivariate normal distribution is often used to describe, at least approximately, any set of two (possibly) correlated real-valued random variables each of which clusters around a mean value.

## Definitions

### Notation and parameterization

The bivariate normal distribution of a two-dimensional random vector $\mathbf{X} = (X_1,X_2)^{\mathrm T}$ can be written in the following notation:

[$] \mathbf{X}\ \sim\ \mathcal{N}(\boldsymbol\mu,\, \boldsymbol\Sigma), [$]

or to make it explicitly known that $X$ is two-dimensional,

[$] \mathbf{X}\ \sim\ \mathcal{N}_2(\boldsymbol\mu,\, \boldsymbol\Sigma), [$]

with two-dimensional mean vector

[$] \boldsymbol\mu = \operatorname{E}[\mathbf{X}] = ( \operatorname{E}[X_1], \operatorname{E}[X_2]) ^ \textbf{T}, [$]

and $2 \times k$ covariance matrix

[$] \Sigma_{i,j} = \operatorname{E} [(X_i - \mu_i)( X_j - \mu_j)] = \operatorname{Cov}[X_i, X_j] [$]

such that $1 \le i,j \le 2.$ The inverse of the covariance matrix is called the precision matrix, denoted by $\boldsymbol{Q}=\boldsymbol\Sigma^{-1}$.

### Standard normal random vector

A real random vector $\mathbf{X} = (X_1,X_2)^{\mathrm T}$ is called a standard normal random vector if all of its components $X_k$ are independent and each is a zero-mean unit-variance normally distributed random variable, i.e. if $X_k \sim\ \mathcal{N}(0,1)$ for all $k$.:p. 454

### Centered normal random vector

A real random vector $\mathbf{X} = (X_1,X_2)^{\mathrm T}$ is called a centered normal random vector if there exists a deterministic $2 \times \ell$ matrix $\boldsymbol{A}$ such that $\boldsymbol{A} \mathbf{Z}$ has the same distribution as $\mathbf{X}$ where $\mathbf{Z}$ is a standard normal random vector with $\ell$ components.:p. 454

### Normal random vector

A real random vector $\mathbf{X} = (X_1,X_2)^{\mathrm T}$ is called a normal random vector if there exists a random $\ell$-vector $\mathbf{Z}$, which is a standard normal random vector, a vector $\mathbf{\mu}$, and a $2 \times \ell$ matrix $\boldsymbol{A}$, such that $\mathbf{X}=\boldsymbol{A} \mathbf{Z} + \mathbf{\mu}$.:p. 454:p. 455

### Density function

The bivariate normal distribution is said to be "non-degenerate" when the symmetric covariance matrix $\boldsymbol\Sigma$ is positive definite. The distribution has density

[$] f(x,y) = \frac{1}{2 \pi \sigma_X \sigma_Y \sqrt{1-\rho^2}} \exp \left( -\frac{1}{2(1-\rho^2)}\left[ \left(\frac{x-\mu_X}{\sigma_X}\right)^2 - 2\rho\left(\frac{x-\mu_X}{\sigma_X}\right)\left(\frac{y-\mu_Y}{\sigma_Y}\right) + \left(\frac{y-\mu_Y}{\sigma_Y}\right)^2 \right] \right) [$]

where $\rho$ is the correlation between $X$ and $Y$ and where $\sigma_X\gt0$ and $\sigma_Y\gt0$.

### Joint normality

#### Normally distributed and independent

If $X$ and $Y$ are normally distributed and independent, this implies they are "jointly normally distributed", i.e., the pair $(X,Y)$ must have multivariate normal distribution. However, a pair of jointly normally distributed variables need not be independent (would only be so if uncorrelated, $\rho = 0$ ).

#### Two normally distributed random variables need not be jointly bivariate normal

The fact that two random variables $X$ and $Y$ both have a normal distribution does not imply that the pair $(X,Y)$ has a joint normal distribution. A simple example is one in which X has a normal distribution with expected value 0 and variance 1, and $Y=X$ if $|X| \gt c$ and $Y=-X$ if $|X| \lt c$, where $c \gt 0$. There are similar counterexamples for more than two random variables. In general, they sum to a mixture model.

#### Correlations and independence

In general, random variables may be uncorrelated but statistically dependent. But if a random vector has a multivariate normal distribution then any two or more of its components that are uncorrelated are independent. This implies that any two or more of its components that are pairwise independent are independent. But, as pointed out just above, it is not true that two random variables that are (separately, marginally) normally distributed and uncorrelated are independent.

### Conditional distribution

The conditional distribution of $X_1$ given $X_2$ is

[$]X_1\mid X_2=a \ \sim\ \mathcal{N}\left(\mu_1+\frac{\sigma_1}{\sigma_2}\rho( a - \mu_2),\, (1-\rho^2)\sigma_1^2\right). [$]

where $\rho$ is the correlation coefficient between $X_1$ and $X_2$.

#### Bivariate conditional expectation

By taking the expectation of the conditional distribution $X_1\mid X_2$ above, the conditional expectation of $X_1$ given $X_2$ equals:

[$]\operatorname{E}(X_1 \mid X_2=x_2) = \mu_1 + \rho \frac{\sigma_1}{\sigma_2}(x_2 - \mu_2).[$]