Model Accuracy Assessment

Mean Squared Error

In statistics, the mean squared error (MSE)[1] or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss.[2] The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.[3] In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution).

The MSE is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero.

The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the true value). For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error.

Definition and basic properties

The MSE either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). The definition of an MSE differs according to whether one is describing a predictor or an estimator.

Predictor

If a vector of $n$ predictions is generated from a sample of $n$ data points on all variables, and $Y$ is the vector of observed values of the variable being predicted, with $\hat{Y}$ being the predicted values (e.g. as from a least-squares fit), then the within-sample MSE of the predictor is computed as

[$]\operatorname{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2.[$]

In other words, the MSE is the mean $\left(\frac{1}{n} \sum_{i=1}^n \right)$ of the squares of the errors $\left(Y_i-\hat{Y_i}\right)^2$. This is an easily computable quantity for a particular sample (and hence is sample-dependent).

In matrix notation,

[$]\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(e_i)^2=\frac{1}{n}\mathbf e^\mathsf T \mathbf e[$]

where $e_i$ is $(Y_i-\hat{Y_i})$ and $\mathbf e$ is the $n \times 1$ column vector.

The MSE can also be computed on q data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as statistical learning, the MSE is often called the test MSE,[4] and is computed as

[$]\operatorname{MSE} = \frac{1}{q} \sum_{i=n+1}^{n+q} \left(Y_i-\hat{Y_i}\right)^2.[$]

Estimator

The MSE of an estimator $\hat{\theta}$ with respect to an unknown parameter $\theta$ is defined as[1]

[$]\operatorname{MSE}(\hat{\theta})=\operatorname{E}_{\theta}\left[(\hat{\theta}-\theta)^2\right].[$]

This definition depends on the unknown parameter, but the MSE is a priori a property of an estimator. The MSE could be a function of unknown parameters, in which case any estimator of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable). If the estimator $\hat{\theta}$ is derived as a sample statistic and is used to estimate some population parameter, then the expectation is with respect to the sampling distribution of the sample statistic.

The MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent.[5]

[$]\operatorname{MSE}(\hat{\theta})=\operatorname{Var}_{\theta}(\hat{\theta})+ \operatorname{Bias}(\hat{\theta},\theta)^2.[$]

But in real modeling case, MSE could be described as the addition of model variance, model bias, and irreducible uncertainty (see Bias–variance tradeoff). According to the relationship, the MSE of the estimators could be simply used for the efficiency comparison, which includes the information of estimator variance and bias. This is called MSE criterion.

Examples

Mean

Suppose we have a random sample of size $n$ from a population, $X_1,\dots,X_n$. Suppose the sample units were chosen with replacement. That is, the $n$ units are selected one at a time, and previously selected units are still eligible for selection for all $n$ draws. The usual estimator for the $\mu$ is the sample average

[$]\overline{X}=\frac{1}{n}\sum_{i=1}^n X_i [$]

which has an expected value equal to the true mean $\mu$ (so it is unbiased) and a mean squared error of

[$]\operatorname{MSE}\left(\overline{X}\right)=\operatorname{E}\left[\left(\overline{X}-\mu\right)^2\right]=\left(\frac{\sigma}{\sqrt{n}}\right)^2= \frac{\sigma^2}{n}[$]

where $\sigma^2$ is the population variance.

For a Gaussian distribution, this is the best unbiased estimator (i.e., one with the lowest MSE among all unbiased estimators), but not, say, for a uniform distribution.

Variance

The usual estimator for the variance is the corrected sample variance:

[$]S^2_{n-1} = \frac{1}{n-1}\sum_{i=1}^n\left(X_i-\overline{X} \right)^2 =\frac{1}{n-1}\left(\sum_{i=1}^n X_i^2-n\overline{X}^2\right).[$]

This is unbiased (its expected value is $\sigma^2$), hence also called the unbiased sample variance, and its MSE is[6]

[$]\operatorname{MSE}(S^2_{n-1})= \frac{1}{n} \left(\mu_4-\frac{n-3}{n-1}\sigma^4\right) =\frac{1}{n} \left(\gamma_2+\frac{2n}{n-1}\right)\sigma^4,[$]

where $\mu_4$ is the fourth central moment of the distribution or population, and $\gamma_2=\mu_4/\sigma^4-3$ is the excess kurtosis.

However, one can use other estimators for $\sigma^2$ which are proportional to $S^2_{n-1}$, and an appropriate choice can always give a lower mean squared error. If we define

[$]S^2_a = \frac{n-1}{a}S^2_{n-1}= \frac{1}{a}\sum_{i=1}^n\left(X_i-\overline{X}\,\right)^2[$]

then we calculate:

[]\begin{align*} \operatorname{MSE}(S^2_a) &=\operatorname{E}\left[\left(\frac{n-1}{a} S^2_{n-1}-\sigma^2\right)^2 \right] \\ &= \operatorname{E}\left[ \frac{(n-1)^2}{a^2} S^4_{n-1} -2 \left ( \frac{n-1}{a} S^2_{n-1} \right ) \sigma^2 + \sigma^4 \right ] \\ &= \frac{(n-1)^2}{a^2} \operatorname{E}\left[ S^4_{n-1} \right ] - 2 \left ( \frac{n-1}{a}\right ) \operatorname{E}\left[ S^2_{n-1} \right ] \sigma^2 + \sigma^4 \\ &= \frac{(n-1)^2}{a^2} \operatorname{E}\left[ S^4_{n-1} \right ] - 2 \left ( \frac{n-1}{a}\right ) \sigma^4 + \sigma^4 && \operatorname{E}\left[ S^2_{n-1} \right ] = \sigma^2 \\ &= \frac{(n-1)^2}{a^2} \left ( \frac{\gamma_2}{n} + \frac{n+1}{n-1} \right ) \sigma^4- 2 \left ( \frac{n-1}{a}\right ) \sigma^4+\sigma^4 && \operatorname{E}\left[ S^4_{n-1} \right ] = \operatorname{MSE}(S^2_{n-1}) + \sigma^4 \\ &=\frac{n-1}{n a^2} \left ((n-1)\gamma_2+n^2+n \right ) \sigma^4- 2 \left ( \frac{n-1}{a}\right ) \sigma^4+\sigma^4 \end{align*}[]

This is minimized when

[$]a=\frac{(n-1)\gamma_2+n^2+n}{n} = n+1+\frac{n-1}{n}\gamma_2.[$]

For a Gaussian distribution, where $\gamma_2=0$, this means that the MSE is minimized when dividing the sum by $a=n+1$. The minimum excess kurtosis is $\gamma_2=-2$,[a] which is achieved by a Bernoulli distribution with p = 1/2 (a coin flip), and the MSE is minimized for $a=n-1+\tfrac{2}{n}.$ Hence regardless of the kurtosis, we get a "better" estimate (in the sense of having a lower MSE) by scaling down the unbiased estimator a little bit; this is a simple example of a shrinkage estimator: one "shrinks" the estimator towards zero (scales down the unbiased estimator).

Further, while the corrected sample variance is the best unbiased estimator (minimum mean squared error among unbiased estimators) of variance for Gaussian distributions, if the distribution is not Gaussian, then even among unbiased estimators, the best unbiased estimator of the variance may not be $S^2_{n-1}.$

Gaussian distribution

The following table gives several estimators of the true parameters of the population, μ and σ2, for the Gaussian case.[7]

True value Estimator Mean squared error
$\theta=\mu$ $\hat{\theta}$ = the unbiased estimator of the population mean, $\overline{X}=\frac{1}{n}\sum_{i=1}^n(X_i)$ $\operatorname{MSE}(\overline{X})=\operatorname{E}((\overline{X}-\mu)^2)=\left(\frac{\sigma}{\sqrt{n}}\right)^2$
$\theta=\sigma^2$ $\hat{\theta}$ = the unbiased estimator of the population variance, $S^2_{n-1} = \frac{1}{n-1}\sum_{i=1}^n\left(X_i-\overline{X}\,\right)^2$ $\operatorname{MSE}(S^2_{n-1})=\operatorname{E}((S^2_{n-1}-\sigma^2)^2)=\frac{2}{n - 1}\sigma^4$
$\theta=\sigma^2$ $\hat{\theta}$ = the biased estimator of the population variance, $S^2_{n} = \frac{1}{n}\sum_{i=1}^n\left(X_i-\overline{X}\,\right)^2$ $\operatorname{MSE}(S^2_{n})=\operatorname{E}((S^2_{n}-\sigma^2)^2)=\frac{2n - 1}{n^2}\sigma^4$
$\theta=\sigma^2$ $\hat{\theta}$ = the biased estimator of the population variance, $S^2_{n+1} = \frac{1}{n+1}\sum_{i=1}^n\left(X_i-\overline{X}\,\right)^2$ $\operatorname{MSE}(S^2_{n+1})=\operatorname{E}((S^2_{n+1}-\sigma^2)^2)=\frac{2}{n + 1}\sigma^4$

Interpretation

An MSE of zero, meaning that the estimator $\hat{\theta}$ predicts observations of the parameter $\theta$ with perfect accuracy, is ideal (but typically not possible).

Values of MSE may be used for comparative purposes. Two or more statistical models may be compared using their MSEs—as a measure of how well they explain a given set of observations: An unbiased estimator (estimated from a statistical model) with the smallest variance among all unbiased estimators is the best unbiased estimator or MVUE (Minimum-Variance Unbiased Estimator).

Both analysis of variance and linear regression techniques estimate the MSE as part of the analysis and use the estimated MSE to determine the statistical significance of the factors or predictors under study. The goal of experimental design is to construct experiments in such a way that when the observations are analyzed, the MSE is close to zero relative to the magnitude of at least one of the estimated treatment effects.

In one-way analysis of variance, MSE can be calculated by the division of the sum of squared errors and the degree of freedom. Also, the f-value is the ratio of the mean squared treatment and the MSE.

Loss function

Squared error loss is one of the most widely used loss functions in statistics, though its widespread use stems more from mathematical convenience than considerations of actual loss in applications. Carl Friedrich Gauss, who introduced the use of mean squared error, was aware of its arbitrariness and was in agreement with objections to it on these grounds.[3] The mathematical benefits of mean squared error are particularly evident in its use at analyzing the performance of linear regression, as it allows one to partition the variation in a dataset into variation explained by the model and variation explained by randomness.

Criticism

The use of mean squared error without question has been criticized by the decision theorist James Berger. Mean squared error is the negative of the expected value of one specific utility function, the quadratic utility function, which may not be the appropriate utility function to use under a given set of circumstances. There are, however, some scenarios where mean squared error can serve as a good approximation to a loss function occurring naturally in an application.[8]

Like variance, mean squared error has the disadvantage of heavily weighting outliers.[9] This is a result of the squaring of each term, which effectively weights large errors more heavily than small ones. This property, undesirable in many applications, has led researchers to use alternatives such as the mean absolute error, or those based on the median.

In statistics and machine learning, the bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters. The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:[10][11]

• The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
• The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.

Motivation

The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.

It is an often made fallacy[12][13] to assume that complex models must have high variance; High variance models are 'complex' in some sense, but the reverse needs not be true. In addition, one has to be careful how to define complexity: In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from:[14] The model $f_{a,b}(x)=a\sin(bx)$ has only two parameters ($a,b$) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance.

An analogy can be made to the relationship between accuracy and precision. Accuracy is a description of bias and can intuitively be improved by selecting from only local information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words, test data may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is due to inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit regularization, such as shrinkage.

Bias–variance decomposition of mean squared error

Suppose that we have a training set consisting of a set of points $x_1, \dots, x_n$ and real values $y_i$ associated with each point $x_i$. We assume that there is a function with noise $y = f(x) + \varepsilon$, where the noise, $\varepsilon$, has zero mean and variance $\sigma^2$.

We want to find a function $\hat{f}(x;D)$, that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm based on a training dataset (sample) $D=\{(x_1,y_1) \dots, (x_n, y_n)\}$. We make "as well as possible" precise by measuring the mean squared error between $y$ and $\hat{f}(x;D)$: we want $(y - \hat{f}(x;D))^2$ to be minimal, both for $x_1, \dots, x_n$ and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the $y_i$ contain noise $\varepsilon$; this means we must be prepared to accept an irreducible error in any function we come up with.

Finding an $\hat{f}$ that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function $\hat{f}$ we select, we can decompose its expected error on an unseen sample $x$ as follows:[15]:34[16]:223

[$] \operatorname{E}_{D, \varepsilon} \Big[\big(y - \hat{f}(x;D)\big)^2\Big] = \Big(\operatorname{Bias}_D\big[\hat{f}(x;D)\big] \Big) ^2 + \operatorname{Var}_D\big[\hat{f}(x;D)\big] + \sigma^2 [$]

where

[$] \operatorname{Bias}_D\big[\hat{f}(x;D)\big] = \operatorname{E}_D\big[\hat{f}(x;D)\big] - f(x) [$]

and

[$] \operatorname{Var}_D\big[\hat{f}(x;D)\big] = \operatorname{E}_D[\big(\operatorname{E}_D[\hat{f}(x;D)] - \hat{f}(x;D)\big)^2]. [$]

The expectation ranges over different choices of the training set $D=\{(x_1,y_1) \dots, (x_n, y_n)\}$, all sampled from the same joint distribution $P(x,y)$ which can for example be done via bootstrapping. The three terms represent:

• the square of the bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function $f(x)$ using a learning method for linear models, there will be error in the estimates $\hat{f}(x)$ due to this assumption;
• the variance of the learning method, or, intuitively, how much the learning method $\hat{f}(x)$ will move around its mean;
• the irreducible error $\sigma^2$.

Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples.[15]:34

The more complex the model $\hat{f}(x)$ is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.

Approaches

Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example,

One way of resolving the trade-off is to use mixture models and ensemble learning.[20][21] For example, boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance.

Model validation methods such as cross-validation (statistics) can be used to tune models so as to optimize the trade-off.

Applications

In regression

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

In classification

The bias–variance decomposition was originally formulated for least-squares regression. For the case of classification under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition.[22][23] Alternatively, if the classification problem can be phrased as probabilistic classification, then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before.[24]

It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimized by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimize variance.[25]

References

1. "Mean Squared Error (MSE)". www.probabilitycourse.com. Retrieved 2020-09-12.
2. Bickel, Peter J.; Doksum, Kjell A. (2015). Mathematical Statistics: Basic Ideas and Selected Topics. I (Second ed.). p. 20. If we use quadratic loss, our risk function is called the mean squared error (MSE) ...
3. Lehmann, E. L.; Casella, George (1998). Theory of Point Estimation (2nd ed.). New York: Springer. ISBN 978-0-387-98502-2. MR 1639875.
4. Gareth, James; Witten, Daniela; Hastie, Trevor; Tibshirani, Rob (2021). An Introduction to Statistical Learning: with Applications in R. Springer. ISBN 978-1071614174.
5. Wackerly, Dennis; Mendenhall, William; Scheaffer, Richard L. (2008). Mathematical Statistics with Applications (7 ed.). Belmont, CA, USA: Thomson Higher Education. ISBN 978-0-495-38508-0.
6. Mood, A.; Graybill, F.; Boes, D. (1974). Introduction to the Theory of Statistics (3rd ed.). McGraw-Hill. p. 229.
7. DeGroot, Morris H. (1980). Probability and Statistics (2nd ed.). Addison-Wesley.
8. Berger, James O. (1985). "2.4.2 Certain Standard Loss Functions". Statistical Decision Theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. p. 60. ISBN 978-0-387-96098-2. MR 0804611.
9. "Oriented principal component analysis for large margin classifiers" (2001). Neural Networks 14 (10): 1447–1461. doi:10.1016/S0893-6080(01)00106-X. PMID 11771723.
10. "Bias Plus Variance Decomposition for Zero-One Loss Functions" (1996). ICML 96.
11. "Statistical learning theory: Models, concepts, and results" (2011). Handbook of the History of Logic 10.
12. Neal, Brady (2019). "On the Bias-Variance Tradeoff: Textbooks Need an Update". arXiv:1912.08286 [cs.LG].
13. Neal, Brady; Mittal, Sarthak; Baratin, Aristide; Tantia, Vinayak; Scicluna, Matthew; Lacoste-Julien, Simon; Mitliagkas, Ioannis (2018). "A Modern Take on the Bias-Variance Tradeoff in Neural Networks". arXiv:1810.08591 [cs.LG].
14. Vapnik, Vladimir (2000). The nature of statistical learning theory. New York: Springer-Verlag. ISBN 978-1-4757-3264-1.
15. James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2013). An Introduction to Statistical Learning. Springer.
16. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning. Archived from the original on 2015-01-26. Retrieved 2014-08-20.
17. Belsley, David (1991). Conditioning diagnostics : collinearity and weak data in regression. New York (NY): Wiley. ISBN 978-0471528890.
18. "Neural networks and the bias/variance dilemma" (1992). Neural Computation 4: 1–58. doi:10.1162/neco.1992.4.1.1.
19. "Instance-based classifiers applied to medical databases: diagnosis and knowledge extraction" (May 2011). Artificial Intelligence in Medicine 52 (3): 123–139. doi:10.1016/j.artmed.2011.04.002. PMID 21621400.
20. Ting, Jo-Anne; Vijaykumar, Sethu; Schaal, Stefan (2011). "Locally Weighted Regression for Control". In Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of Machine Learning (PDF). Springer. p. 615. Bibcode:2010eoml.book.....S.
21. Fortmann-Roe, Scott (2012). "Understanding the Bias–Variance Tradeoff".
22. Domingos, Pedro (2000). A unified bias-variance decomposition (PDF). ICML.
23. Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (2008). Introduction to Information Retrieval. Cambridge University Press. pp. 308–314.
24. Brain, Damian; Webb, Geoffrey (2002). The Need for Low Bias Algorithms in Classification Learning From Large Data Sets (PDF). Proceedings of the Sixth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002).

Notes

1. This can be proved by Jensen's inequality as follows. The fourth central moment is an upper bound for the square of variance, so that the least value for their ratio is one, therefore, the least value for the excess kurtosis is −2, achieved, for instance, by a Bernoulli with p=1/2.