# Variance and Moments

Variance is the expectation of the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean. The variance has a central role in statistics. It is used in descriptive statistics, statistical inference, hypothesis testing, goodness of fit, Monte Carlo sampling, amongst many others. This makes it a central quantity in numerous fields such as physics, biology, chemistry, economics, and finance. The variance is the square of the standard deviation, the second central moment of a distribution, and it is often represented by $\sigma^2$ or $\operatorname{Var}(X)$.

## Definition

The variance of a random variable $X$ is the expected value of the squared deviation from the mean of $X$, $\mu = \operatorname{E}[X]$:

[$] \operatorname{Var}(X) = \operatorname{E}\left[(X - \mu)^2 \right]. [$]

This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed. The variance is also equivalent to the second cumulant of a probability distribution that generates $X$. The variance is typically designated as $\operatorname{Var}(X)$, $\sigma^2_X$, or simply $\sigma^2$ (pronounced "sigma squared"). The expression for the variance can be expanded:

[]\begin{align*} \operatorname{Var}(X) &= \operatorname{E}\left[(X - \operatorname{E}[X])^2\right] \\ &= \operatorname{E}\left[X^2 - 2X\operatorname{E}[X] + (\operatorname{E}[X])^2\right] \\ &= \operatorname{E}\left[X^2\right] - 2\operatorname{E}[X]\operatorname{E}[X] + (\operatorname{E}[X])^2 \\ &= \operatorname{E}\left[X^2 \right] - (\operatorname{E}[X])^2 \end{align*} []

A mnemonic for the above expression is "mean of square minus square of mean".

### Continuous random variable

If the random variable $X$ represents samples generated by a continuous distribution with probability density function $f(x)$, then the population variance is given by

[$]\operatorname{Var}(X) =\sigma^2 =\int (x-\mu)^2 \, f(x) \, dx\, =\int x^2 \, f(x) \, dx\, - \mu^2[$]

where $\mu$ is the expected value,

[$]\mu = \int x \, f(x) \, dx\, [$]

and where the integrals are definite integrals taken for $x$ ranging over the range of $X$.

If a continuous distribution does not have an expected value, as is the case for the Cauchy distribution, it does not have a variance either. Many other distributions for which the expected value does exist also do not have a finite variance because the integral in the variance definition diverges. An example is a Pareto distribution whose index $k$ satisfies $1 \lt k \leq 2$.

### Discrete random variable

If the generator of random variable $X$ is discrete with probability mass function $x_1 \mapsto p_1, x_2 \mapsto p_2, \ldots, x_n \mapsto p_n$ then

[$]\operatorname{Var}(X) = \sum_{i=1}^n p_i\cdot(x_i - \mu)^2,[$]

or equivalently

[$]\operatorname{Var}(X) = \sum_{i=1}^n p_i x_i ^2- \mu^2,[$]

where $\mu$ is the expected value, i.e.

[$]\mu = \sum_{i=1}^n p_i\cdot x_i. [$]

(When such a discrete weighted variance is specified by weights whose sum is not 1, then one divides by the sum of the weights.)

The variance of a set of $n$ equally likely values can be written as

[$] \operatorname{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2. [$]

where $\mu$ is the expected value, i.e.,

[$]\mu = \frac{1}{n}\sum_{i=1}^n x_i [$]

The variance of a set of $n$ equally likely values can be equivalently expressed, without directly referring to the mean, in terms of squared deviations of all points from each other:[1]

[$] \operatorname{Var}(X) = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \frac{1}{2}(x_i - x_j)^2 = \frac{1}{n^2}\sum_i \sum_{j \gt i} (x_i-x_j)^2. [$]

## Properties

### Basic properties

Property Formula
Non-negative $\operatorname{Var}(X)\ge 0$
Constant $\operatorname{Var}(c) = 0 \,$
Translation $\operatorname{Var}(X + c) = \operatorname{Var}(X)$
Scaling $\operatorname{Var}(cX) = c^2 \operatorname{Var}(X)$

### Formulae for the variance

A formula often used for deriving the variance of a theoretical distribution is as follows:

[$] \operatorname{Var}(X) =\operatorname{E}(X^2) - (\operatorname{E}(X))^2. [$]

This will be useful when it is possible to derive formulae for the expected value and for the expected value of the square.

### Calculation from the CDF

The population variance for a non-negative random variable can be expressed in terms of the cumulative distribution function F using

[$] 2\int_0^\infty u( 1-F(u))\,du - \Big(\int_0^\infty 1-F(u)\,du\Big)^2. [$]

This expression can be used to calculate the variance in situations where the CDF, but not the density, can be conveniently expressed.

### Characteristic property

The second moment of a random variable attains the minimum value when taken around the first moment (i.e., mean) of the random variable, i.e. $\mathrm{argmin}_m\,\mathrm{E}((X - m)^2) = \mathrm{E}(X)\,$. Conversely, if a continuous function $\varphi$ satisfies $\mathrm{argmin}_m\,\mathrm{E}(\varphi(X - m)) = \mathrm{E}(X)\,$ for all random variables $X$, then it is necessarily of the form $\varphi(x) = a x^2 + b$, where a > 0. This also holds in the multidimensional case.[2]

## Standard Deviation

The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance. It is algebraically simpler, though in practice less robust, than the average absolute deviation.[3][4]

A useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data. There are also other measures of deviation from the norm, including mean absolute deviation, which provide different mathematical properties from standard deviation.[5]

### Identities and mathematical properties

Property Formula
Constant $\sigma(c) = 0$
Translation $\sigma(X + c) = \sigma(X)$
Scaling $\sigma(cX) = |c| \sigma(X)$

### Chebyshev's Inequality

Chebyshev's inequality (also spelled as Tchebysheff's inequality) guarantees that in any probability distribution, "nearly all" values are close to the mean—the precise statement being that no more than 1/k2 of the distribution's values can be more than k standard deviations away from the mean (or equivalently, at least 1−1/k2 of the distribution's values are within k standard deviations of the mean). The rule is often called Chebyshev's theorem, about the range of standard deviations around the mean, in statistics. The inequality has great utility because it can be applied to completely arbitrary distributions (unknown except for mean and variance). For example, it can be used to prove the weak law of large numbers.

In practical usage, in contrast to the 68–95–99.7 rule, which applies to normal distributions, under Chebyshev's inequality a minimum of just 75% of values must lie within two standard deviations of the mean and 89% within three standard deviations.[6][7]

#### Probabilistic statement

Let $X$ (integrable) be a random variable with finite expected value $\mu$ and finite non-zero variance $\sigma^2$. Then for any real number $k\gt0$,

[$] \operatorname{P}(|X-\mu|\geq k\sigma) \leq \frac{1}{k^2}. [$]

Only the case $k \gt 1$ is useful. When $k \leq 1$ the right hand

[$] \frac{1}{k^2} \geq 1 [$]

and the inequality is trivial as all probabilities are ≤ 1. As an example, using $k = \sqrt{2}$ shows that the probability that values lie outside the interval $\mu - \sqrt{2}\sigma, \mu + \sqrt{2}\sigma$ does not exceed $\frac{1}{2}$.

Because it can be applied to completely arbitrary distributions (unknown except for mean and variance), the inequality generally gives a poor bound compared to what might be deduced if more aspects are known about the distribution involved.

## Moments

In mathematics, a moment is a specific quantitative measure, used in both mechanics and statistics, of the shape of a set of points. If the points represent probability density, then the zeroth moment is the total probability (i.e. one), the first moment is the mean, the second central moment is the variance, the third moment is the skewness, and the fourth moment (with normalization and shift) is the kurtosis. The mathematical concept is closely related to the concept of moment in physics.

For a bounded distribution of mass or probability, the collection of all the moments (of all orders, from 0 to ) uniquely determines the distribution.

### Significance of the moments

The $n$-th moment of a real-valued continuous function $f(x)$ of a real variable about a value $c$ is

[$]\mu_n=\int_{-\infty}^\infty (x - c)^n\,f(x)\,dx.[$]

The moment of a function, without further explanation, usually refers to the above expression with $c$ = 0.

For the second and higher moments, the central moments (moments about the mean, with $c$ being the mean) are usually used rather than the moments about zero, because they provide clearer information about the distribution's shape.

The $n$-th moment about zero of a probability density function $f(x)$ is the expected value of $X^n$ and is called a raw moment or crude moment.[8] The moments about its mean $\mu$ are called central moments; these describe the shape of the function, independently of translation.

If $f$ is a probability density function, then the value of the integral above is called the $n$-th moment of the probability distribution. More generally, if $f$ is a cumulative probability distribution function of any probability distribution, which may not have a density function, then the $n$-th moment of the probability distribution is given by the Riemann–Stieltjes integral

[$]\mu'_n = \operatorname{E} \left [ X^n \right ] =\int_{-\infty}^\infty x^n\,dF(x)\,[$]

where $X$ is a random variable that has this cumulative distribution $f$, and $\operatorname{E}$ is the expectation operator or mean.

When

[$]\operatorname{E}\left [\left |X^n \right | \right ] = \int_{-\infty}^\infty |x^n|\,dF(x) = \infty,[$]

then the moment is said not to exist. If the $n$-th moment about any point exists, so does the ($n$ − 1)-th moment (and thus, all lower-order moments) about every point.

The zeroth moment of any probability density function is 1, since the area under any probability_density_function must be equal to one.

### Central Moments

Central moments are used in preference to ordinary moments, computed in terms of deviations from the mean instead of from zero, because the higher-order central moments relate only to the spread and shape of the distribution, rather than also to its location.

The $n$th moment about the mean (or $n$th central moment) of a real-valued random variable $X$ is the quantity

[$] \mu_n = \operatorname{E}[X^n] [$]

where $\operatorname{E}$ is the expectation operator. For a continuousunivariate |probability distribution with probability density function $f(x)$, the $n$th moment about the mean μ is

[$] \mu_n = \operatorname{E} \left[ ( X - \operatorname{E}[X] )^n \right] = \int_{-\infty}^{+\infty} (x - \mu)^n f(x)\,\mathrm{d} x. [$]

For random variables that have no mean, such as the Cauchy distribution, central moments are not defined.

The first few central moments have intuitive interpretations:

• The "zeroth" central moment $\mu_0$ is 1.
• The first central moment $\mu_1$ is 0 (not to be confused with the first (raw) moment itself, the expected value or mean).
• The second central moment $\mu_2$ is called the variance, and is usually denoted $\sigma^2$, where $\sigma$ represents the standard deviation.
• The third and fourth central moments are used to define the standardized moments which are used to define skewness and kurtosis, respectively.

#### Properties

The $n$th central moment is translation-invariant, i.e. for any random variable $X$ and any constant $c$, we have

[$]\mu_n(X+c)=\mu_n(X).\,[$]

For all $n$, the $n$th central moment is homogeneous of degree $n$:

[$]\mu_n(cX)=c^n\mu_n(X).\,[$]

Only for $n$ such that n equals 1, 2, or 3 do we have an additivity property for random variables $X$ and $Y$ that are independent:

[$]\mu_n(X+Y)=\mu_n(X)+\mu_n(Y)\,[$]

.

#### Relation to moments about the origin

Sometimes it is convenient to convert moments about the origin to moments about the mean. The general equation for converting the $n$th-order moment about the origin to the moment about the mean is

[$] \mu_n = \mathrm{E}\left[\left(X - \mathrm{E}\left[X\right]\right)^n\right] = \sum_{j=0}^n {n \choose j} (-1) ^{n-j} \mu'_j \mu^{n-j}, [$]

where μ is the mean of the distribution, and the moment about the origin is given by

[$] \mu'_j = \int_{-\infty}^{+\infty} x^j f(x)\,dx = \mathrm{E}\left[X^j\right] [$]

For the cases $n$ = 2, 3, 4 — which are of most interest because of the relations to variance, skewness, and kurtosis, respectively — this formula becomes (noting that $\mu = \mu'_1$ and $\mu'_0=1$):,

[$]\mu_2 = \mu'_2 - \mu^2\,[$]

which is commonly referred to as $\mathrm{Var}\left(X\right) = \mathrm{E}\left[X^2\right] - \left(\mathrm{E}\left[X\right]\right)^2$

[$]\mu_3 = \mu'_3 - 3 \mu \mu'_2 +2 \mu^3\,[$]
[$]\mu_4 = \mu'_4 - 4 \mu \mu'_3 + 6 \mu^2 \mu'_2 - 3 \mu^4.\,[$]

... and so on,[10] following Pascal's triangle, i.e.

[$]\mu_5 = \mu'_5 - 5 \mu \mu'_4 + 10 \mu^2 \mu'_3 - 10 \mu^3 \mu'_2 + 4 \mu^5.\,[$]

because $5\mu^4\mu'_1 - \mu^5 \mu'_0 = 5\mu^4\mu - \mu^5 = 5 \mu^5 - \mu^5 = 4 \mu^5$

#### Symmetric distributions

In a symmetric distribution (one that is unaffected by being reflected about its mean), all odd central moments equal zero, because in the formula for the $n$th moment, each term involving a value of $X$ less than the mean by a certain amount exactly cancels out the term involving a value of $X$ greater than the mean by the same amount.

### Higher moments

High-order moments are moments beyond 4th-order moments. As with variance, skewness, and kurtosis, these are higher-order statistics, involving non-linear combinations of the data, and can be used for description or estimation of further shape parameters. The higher the moment, the harder it is to estimate, in the sense that larger samples are required in order to obtain estimates of similar quality. This is due to the excess degrees of freedom consumed by the higher orders. Further, they can be subtle to interpret, often being most easily understood in terms of lower order moments – compare the higher derivatives of jerk and jounce in physics. For example, just as the 4th-order moment (kurtosis) can be interpreted as "relative importance of tails versus shoulders in causing dispersion" (for a given dispersion, high kurtosis corresponds to heavy tails, while low kurtosis corresponds to heavy shoulders), the 5th-order moment can be interpreted as measuring "relative importance of tails versus center (mode, shoulders) in causing skew" (for a given skew, high 5th moment corresponds to heavy tail and little movement of mode, while low 5th moment corresponds to more change in shoulders).

### Sample moments

For all $k$, the $k$-th raw moment of a population can be estimated using the $k$-th raw sample moment

[$]\frac{1}{n}\sum_{i = 1}^{n} X^k_i[$]

applied to a sample $X_1,\ldots,X_n$ drawn from the population.

It can be shown that the expected value of the raw sample moment is equal to the $k$-th raw moment of the population, if that moment exists, for any sample size $n$. It is thus an unbiased estimator. This contrasts with the situation for central moments, whose computation uses up a degree of freedom by using the sample mean. So for example an unbiased estimate of the population variance (the second central moment) is given by

[$]\frac{1}{n-1}\sum_{i = 1}^{n} (X_i-\bar X)^2[$]

in which the previous denominator $n$ has been replaced by the degrees of freedom $n-1$, and in which $\bar X$ refers to the sample mean. This estimate of the population moment is greater than the unadjusted observed sample moment by a factor of $\frac{n}{n-1},$ and it is referred to as the "adjusted sample variance" or sometimes simply the "sample variance".

## Notes

1. Yuli Zhang,Huaiyu Wu,Lei Cheng (June 2012). Some new deformation formulas about variance and covariance. Proceedings of 4th International Conference on Modelling, Identification and Control(ICMIC2012). pp. 987–992.CS1 maint: uses authors parameter (link)
2. "Why the variance?" (1998). Statistics & Probability Letters 38 (4): 329–333. doi:10.1016/S0167-7152(98)00041-8.
3. Gauss, Carl Friedrich (1816). "Bestimmung der Genauigkeit der Beobachtungen". Zeitschrift für Astronomie und verwandte Wissenschaften 1: 187–197.
4. Walker, Helen (1931). Studies in the History of the Statistical Method. Baltimore, MD: Williams & Wilkins Co. pp. 24–25.
5. Gorard, Stephen. Revisiting a 90-year-old debate: the advantages of the mean deviation. Department of Educational Studies, University of York
6. Kvanli, Alan H.; Pavur, Robert J.; Keeling, Kellie B. (2006). Concise Managerial Statistics. cEngage Learning. pp. 81–82. ISBN 9780324223880.
7. Chernick, Michael R. (2011). The Essentials of Biostatistics for Physicians, Nurses, and Clinicians. John Wiley & Sons. pp. 49–50. ISBN 9780470641859.
8. http://mathworld.wolfram.com/RawMoment.html Raw Moments at Math-world
9. Grimmett, Geoffrey; Stirzaker, David (2009). Probability and Random Processes. Oxford, England: Oxford University Press. ISBN 978 0 19 857222 0.
10. http://mathworld.wolfram.com/CentralMoment.html

## References

• Wikipedia contributors. "Variance". Wikipedia. Wikipedia. Retrieved 28 January 2022.
• Wikipedia contributors. "Standard deviation". Wikipedia. Wikipedia. Retrieved 28 January 2022.
• Wikipedia contributors. "Moment (mathematics)". Wikipedia. Wikipedia. Retrieved 28 January 2022.