# Moments

## Moments

The first two moments of the ridge regression estimator are derived. Next the performance of the ridge regression estimator is studied in terms of the mean squared error, which combines the first two moments.

### Expectation

The left panel of Figure shows ridge estimates of the regression parameters converging to zero as the penalty parameter tends to infinity. This behaviour of the ridge estimator does not depend on the specifics of the data set. To see this study the expectation of the ridge estimator:

[$] \begin{eqnarray*} \mathbb{E} \big[ \hat{\bbeta}(\lambda) \big] & = & \mathbb{E} \big[ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y} \big] \, \, \, \, = \, \, \, (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \, \mathbf{X}^{\top} \mathbb{E} ( \mathbf{Y} ) \\ & = & (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} (\mathbf{X}^{\top} \mathbf{X}) \, \bbeta \, \, \, = \, \, \, \bbeta - \lambda (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \bbeta. \end{eqnarray*} [$]

Clearly, $\mathbb{E} \big[ \hat{\bbeta}(\lambda) \big] \not= \bbeta$ for any $\lambda \gt 0$. Hence, the ridge estimator is biased.

Example (Orthonormal design matrix)

Consider an orthonormal design matrix $\mathbf{X}$, i.e.: $\mathbf{X}^{\top} \mathbf{X} = \mathbf{I}_{pp} = (\mathbf{X}^{\top} \mathbf{X})^{-1}$. An example of an orthonormal design matrix would be:

[$] \begin{eqnarray*} \mathbf{X} & = & \frac{1}{2} \left( \begin{array}{rr} -1 & -1 \\ -1 & 1 \\ 1 & -1 \\ 1 & 1 \end{array} \right). \end{eqnarray*} [$]

This design matrix is orthonormal as $\mathbf{X}^{\top} \mathbf{X} = \mathbf{I}_{22}$, which is easily verified by working out the matrix multiplication. In case of an orthonormal design matrix the relation between the OLS and ridge estimator is:

[$] \begin{eqnarray*} \hat{\bbeta}(\lambda) & = & (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y} \, \, \, = \, \, \, (\mathbf{I}_{pp} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y} \\ & = & (1 + \lambda)^{-1} \mathbf{I}_{pp} \mathbf{X}^{\top} \mathbf{Y} \qquad \, \, = \, \, \, (1 + \lambda)^{-1} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y} \\ & = & (1 + \lambda)^{-1} \hat{\bbeta}. \end{eqnarray*} [$]

Hence, the ridge regression estimator scales the OLS estimator by a factor. When taking the expectation on both sides, it is evident that the ridge regression estimator is biased: $\mathbb{E}[ \hat{\bbeta}(\lambda) ] = \mathbb{E}[ (1 + \lambda)^{-1} \hat{\bbeta} ] = (1 + \lambda)^{-1} \mathbb{E}( \hat{\bbeta} ) = (1 + \lambda)^{-1} \bbeta \not= \bbeta$. From this it also clear that the estimator, and thus its expectation, vanishes as $\lambda \rightarrow \infty$.

The bias of the ridge regression estimator may be decomposed into two parts (as pointed out by [1]), one attributable to the penalization and another to the high-dimensionality of the study design. To arrive at this decomposition define the projection matrix, i.e. a matrix $\mathbf{P}$ such that $\mathbf{P} = \mathbf{P}^2$, that projects the parameter space $\mathbb{R}^p$ onto the subspace $\mathcal{R}(\mathbf{X}) \subset \mathbb{R}^p$ spanned by the rows of the design matrix $\mathbf{X}$, denoted $\mathbf{P}_x$, and given by: $\mathbf{P}_x = \mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^+ \mathbf{X}$, where $(\mathbf{X} \mathbf{X}^{\top})^+$ is the Moore-Penrose inverse of $\mathbf{X} \mathbf{X}^{\top}$. The ridge regression estimator lives in the subspace defined by the projection $\mathbf{P}_x$ of $\mathbb{R}^p$ onto $\mathcal{R}(\mathbf{X})$. To verify this, consider the singular value decomposition $\mathbf{X} = \mathbf{U}_x \mathbf{D}_x \mathbf{V}_x^{\top}$ (with matrices defined as before) and note that:

[$] \begin{eqnarray*} \mathbf{P}_x & = & \mathbf{X}^{\top} (\mathbf{X} \mathbf{X}^{\top})^+ \mathbf{X} \qquad \qquad \, \, \, = \, \, \, \mathbf{V}_x \mathbf{D}_x^{\top} \mathbf{U}_x^{\top} ( \mathbf{U}_x \mathbf{D}_x \mathbf{V}_x^{\top} \mathbf{V}_x \mathbf{D}_x^{\top} \mathbf{U}_x^{\top} )^{+} \mathbf{U}_x \mathbf{D}_x \mathbf{V}_x^{\top} \\ & = & \mathbf{V}_x \mathbf{D}_x^{\top} ( \mathbf{D}_x \mathbf{D}_x^{\top} )^{+} \mathbf{D}_x \mathbf{V}_x^{\top} \, \, \, = \, \, \, \mathbf{V}_x \mathbf{I}_{pn} \mathbf{I}_{np} \mathbf{V}_x^{\top}. \end{eqnarray*} [$]

Then:

[$] \begin{eqnarray*} \mathbf{P}_x \hat{\bbeta}(\lambda) & = & \mathbf{V}_x \mathbf{I}_{pn} \mathbf{I}_{np} \mathbf{V}_x^{\top} \mathbf{V}_x (\mathbf{D}_x^{\top} \mathbf{D}_x + \lambda \mathbf{I}_{pp})^{-1} \mathbf{D}_x^{\top} \mathbf{U}_x^{\top} \mathbf{Y} \\ & = & \mathbf{V}_x (\mathbf{D}_x^{\top} \mathbf{D}_x + \lambda \mathbf{I}_{pp})^{-1} \mathbf{I}_{pn} \mathbf{I}_{np} \mathbf{D}_x^{\top} \mathbf{U}_x^{\top} \mathbf{Y} \\ & = & \hat{\bbeta}(\lambda) . \end{eqnarray*} [$]

The ridge regression estimator is thus unaffected by the projection, as $\mathbf{P}_x \hat{\bbeta}(\lambda) = \hat{\bbeta}(\lambda)$, and it must therefore already be an element of the projected subspace $\mathcal{R}(\mathbf{X})$. The bias can now be decomposed as:

[$] \begin{eqnarray*} \mathbb{E}[ \hat{\bbeta}(\lambda) - \bbeta] & = & \mathbb{E}[ \hat{\bbeta}(\lambda) - \mathbf{P}_x \bbeta + \mathbf{P}_x \bbeta - \bbeta] \, \, \, = \, \, \, \mathbf{P}_x \mathbb{E}[ \hat{\bbeta}(\lambda) - \bbeta] + (\mathbf{P}_x - \mathbf{I}_{pp}) \bbeta. \end{eqnarray*} [$]

The first summand on the right-hand side of the preceding display represents the bias of the ridge regression estimator to the projection of the true parameter value, whereas the second summand is the bias introduced by the high-dimensionality of the study design. Either if i) $\mathbf{X}$ is of full row rank (i.e. the study design in low-dimensional and $\mathbf{P}_x = \mathbf{I}_{pp}$) or if ii) the true regression parameter $\bbeta$ is an element the projected subspace (i.e. $\bbeta = \mathbf{P}_x \bbeta \in \mathcal{R}(\mathbf{X})$), the second summand of the bias will vanish.

### Variance

The second moment of the ridge regression estimator is straightforwardly obtained when exploiting its linearly relation with the maximum likelihood regression estimator. Then,

[$] \begin{eqnarray*} \mbox{Var}[ \hat{\bbeta}(\lambda) ] & = & \mbox{Var} ( \mathbf{W}_{\lambda} \hat{\bbeta} ) \qquad \qquad \, \, \, \, \, \, = \, \, \, \mathbf{W}_{\lambda} \mbox{Var}(\hat{\bbeta} ) \mathbf{W}_{\lambda}^{\top} \\ & = & \sigma^2 \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} \, \, \, = \, \, \, \sigma^2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} \mathbf{X}^{\top} \mathbf{X} [ ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} ]^{\top}, \end{eqnarray*} [$]

in which we have used $\mbox{Var}(\mathbf{A} \mathbf{Y}) = \mathbf{A} \mbox{Var}( \mathbf{Y}) \mathbf{A}^{\top}$ for a non-random matrix $\mathbf{A}$, the fact that $\mathbf{W}_{\lambda}$ is non-random, and $\mbox{Var}[\hat{\bbeta} ] = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}$.

Like the expectation the variance of the ridge regression estimator vanishes as $\lambda$ tends to infinity:

[$] \begin{eqnarray*} \lim_{\lambda \rightarrow \infty} \mbox{Var} \big[ \hat{\bbeta}(\lambda) \big] & = & \lim_{\lambda \rightarrow \infty} \sigma^2 \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} \, \, \, = \, \, \, \mathbf{0}_{pp}. \end{eqnarray*} [$]

Hence, the variance of the ridge regression coefficient estimates decreases towards zero as the penalty parameter becomes large. This is illustrated in the right panel of Figure for the data of Example.

With an explicit expression of the variance of the ridge regression estimator at hand, we can compare it to that of the OLS estimator:

[$] \begin{eqnarray*} \mbox{Var}[ \hat{\bbeta} ] - \mbox{Var}[ \hat{\bbeta}(\lambda) ] & = & \sigma^2 [(\mathbf{X}^{\top} \mathbf{X})^{-1} - \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} ] \\ & = & \sigma^2 \mathbf{W}_{\lambda} \{ [\mathbf{I} + \lambda (\mathbf{X}^{\top} \mathbf{X})^{-1} ] (\mathbf{X}^{\top} \mathbf{X})^{-1} [\mathbf{I} + \lambda (\mathbf{X}^{\top} \mathbf{X})^{-1} ]^{\top} - (\mathbf{X}^{\top} \mathbf{X})^{-1} \} \mathbf{W}_{\lambda}^{\top} \\ & = & \sigma^2 \mathbf{W}_{\lambda} [ 2 \, \lambda \, (\mathbf{X}^{\top} \mathbf{X})^{-2} + \lambda^2 (\mathbf{X}^{\top} \mathbf{X})^{-3} ] \mathbf{W}_{\lambda}^{\top} \\ & = & \sigma^2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} [ 2 \, \lambda \, \mathbf{I}_{pp} + \lambda^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} ] [ ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} ]^{\top}. \end{eqnarray*} [$]

The difference is non-negative definite as each component in the matrix product is non-negative definite. Hence, the variance of the ML estimator exceeds, in the positive definite ordering sense, that of the ridge regression estimator:

[$] \begin{eqnarray} \label{form.VarInequalityMLandRidge} \mbox{Var} ( \hat{\bbeta} ) & \succeq & \mbox{Var}[ \hat{\bbeta}(\lambda) ], \end{eqnarray} [$]

with the inequality being strict if $\lambda \gt 0$. In other words, the variance of the ML estimator is larger than that of the ridge estimator (in the sense that their difference is non-negative definite). The variance inequality (\ref{form.VarInequalityMLandRidge}) can be interpreted in terms of the stochastic behaviour of the estimate. This is illustrated by the next example.

Level sets of the distribution of the ML (left panel) and ridge (right panel) regression estimators.

Example (Variance comparison)

Consider the design matrix:

[$] \begin{eqnarray*} \mathbf{X} & = & \left( \begin{array}{rr} -1 & 2 \\ 0 & 1 \\ 2 & -1 \\ 1 & 0 \end{array} \right). \end{eqnarray*} [$]

The variances of the ML and ridge (with $\lambda=1$) estimates of the regression coefficients then are:

[$] \begin{eqnarray*} \mbox{Var}(\hat{\bbeta}) & = & \sigma^2 \left( \begin{array}{rr} 0.3 & 0.2 \\ 0.2 & 0.3 \end{array} \right) \qquad \mbox{and} \qquad \mbox{Var}[\hat{\bbeta}(\lambda)] \, \, \, = \, \, \, \sigma^2 \left( \begin{array}{rr} 0.1524 & 0.0698 \\ 0.0698 & 0.1524 \end{array} \right). \end{eqnarray*} [$]

These variances can be used to construct levels sets of the distribution of the estimates. The level sets that contain 50%, 75% and 95% of the distribution of the maximum likelihood and ridge regression estimates are plotted in Figure. In line with inequality \ref{form.VarInequalityMLandRidge} the level sets of the ridge regression estimate are smaller than that of the maximum likelihood one. The former thus varies less.

Example (Orthonormal design matrix, continued)

Assume the design matrix $\mathbf{X}$ is orthonormal. Then, $\mbox{Var} ( \hat{\bbeta} ) = \sigma^2 \mathbf{I}_{pp}$ and

[$] \begin{eqnarray*} \mbox{Var}[ \hat{\bbeta}(\lambda) ] & = & \sigma^2 \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} \, \, \, = \, \, \, \sigma^2 (\mathbf{I}_{pp} + \lambda \mathbf{I}_{pp} )^{-1} \mathbf{I}_{pp} [ (\mathbf{I}_{pp} + \lambda \mathbf{I}_{pp} )^{-1} ]^{\top}\, \, \, = \, \, \, \sigma^2 (1 + \lambda )^{-2} \mathbf{I}_{pp}. \end{eqnarray*} [$]

As the penalty parameter $\lambda$ is non-negative the former exceeds the latter. In particular, the expression after the utmost right equality sign vanishes as $\lambda \rightarrow \infty$.

The variance of the ridge regression estimator may be decomposed in the same way as its bias (cf. the end of Section Expectation ). There is, however, no contribution of the high-dimensionality of the study design as that is non-random and, consequently, exhibits no variation. Hence, the variance only relates to the variation in the projected subspace $\mathcal{R}(\mathbf{X})$ as is obvious from:

[$] \begin{eqnarray*} \mbox{Var}[ \hat{\bbeta}(\lambda) ] & = & \mbox{Var}[ \mathbf{P}_x \hat{\bbeta}(\lambda) ] \, \, \, = \, \, \, \mathbf{P}_x \mbox{Var}[ \hat{\bbeta}(\lambda) ] \mathbf{P}_x^{\top} \, \, \, = \, \, \, \mbox{Var}[ \hat{\bbeta}(\lambda) ]. \end{eqnarray*} [$]

Perhaps this is seen more clearly when writing the variance of the ridge regression estimator in terms of the matrices that constitute the singular value decomposition of $\mathbf{X}$:

[$] \begin{eqnarray*} \mbox{Var}[ \hat{\bbeta}(\lambda) ] & = & \mathbf{V}_x (\mathbf{D}_x^{\top} \mathbf{D}_x + \lambda \mathbf{I}_{pp})^{-1} \mathbf{D}_x^{\top} \mathbf{D}_x (\mathbf{D}_x^{\top} \mathbf{D}_x + \lambda \mathbf{I}_{pp})^{-1} \mathbf{V}_x^{\top}. \end{eqnarray*} [$]

High-dimensionally, $(\mathbf{D}_x^{\top} \mathbf{D}_x)_{jj} = 0$ for $j=n+1, \ldots, p$. And if $(\mathbf{D}_x^{\top} \mathbf{D}_x)_{jj} = 0$, so is $[(\mathbf{D}_x^{\top} \mathbf{D}_x + \lambda \mathbf{I}_{pp})^{-1} \mathbf{D}_x^{\top} \mathbf{D}_x \\ (\mathbf{D}_x^{\top} \mathbf{D}_x + \lambda \mathbf{I}_{pp})^{-1}]_{jj} = 0$. Hence, the variance is determined by the first $n$ columns of $\mathbf{V}_x$. When $n \lt p$, the variance is then to interpreted as the spread of the ridge regression estimator (with the same choice of $\lambda$) when the study is repeated with exactly the same design matrix such that the resulting estimator is confined to the same subspace $\mathcal{R}(\mathbf{X})$. The following R-script illustrates this by an arbitrary data example (plot not shown):

# set parameters
X      <- matrix(rnorm(2), nrow=1)
betas  <- matrix(c(2, -1), ncol=1)
lambda <- 1

# generate multiple ridge regression estimators with a fixed design matrixs
bHats <- numeric()
for (k in 1:1000){
Y     <- matrix(X %*% betas + rnorm(1), ncol=1)
bHats <- rbind(bHats, t(solve(t(X) %*% X + lambda * diag(2)) %*% t(X) %*% Y))
}

# plot the ridge regression estimators
plot(bHats, xlab=expression(paste(hat(beta)[1], "(", lambda, ")", sep="")),
ylab=expression(paste(hat(beta)[2], "(", lambda, ")", sep="")), pch=20)



The full distribution of the ridge regression estimator is now known. The estimator, $\hat{\bbeta}(\lambda) = (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}$ is a linear estimator, linear in $\mathbf{Y}$. As $\mathbf{Y}$ is normally distributed, so is $\hat{\bbeta}(\lambda)$. Moreover, the normal distribution is fully characterized by its first two moments, which are available. Hence:

[$] \begin{eqnarray*} \hat{\bbeta}(\lambda) & \sim & \mathcal{N} \{ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{X} \, \bbeta, ~\sigma^2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} \mathbf{X}^{\top} \mathbf{X} [ ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} ]^{\top} \}. \end{eqnarray*} [$]

Given $\lambda$ and $\mathbf{X}$, the random behavior of the estimator is thus known. In particular, when $n \lt p$, the variance is semi-positive definite and this $p$-variate normal distribution is degenerate, i.e. there is no probability mass outside $\mathcal{R}(\mathbf{X})$ the subspace of $\mathbb{R}^p$ spanned by the rows of the $\mathbf{X}$.

### Mean squared error

Previously, we motivated the ridge estimator as an ad hoc solution to collinearity. An alternative motivation comes from studying the Mean Squared Error (MSE) of the ridge regression estimator: for a suitable choice of $\lambda$ the ridge regression estimator may outperform the ML regression estimator in terms of the MSE. Before we prove this, we first derive the MSE of the ridge estimator and quote some auxiliary results. Note that, as the ridge regression estimator is compared to its ML counterpart, throughout this subsection $n \gt p$ is assumed to warrant the uniqueness of the latter.

Recall that (in general) for any estimator of a parameter $\theta$:

[$] \begin{eqnarray*} \mbox{MSE}( \hat{\theta} ) & = & \mathbb{E} [ ( \hat{\theta} - \theta)^2 ] \, \, \, = \, \, \, \mbox{Var}( \hat{ \theta} ) + [\mbox{Bias} ( \hat{\theta} )]^2. \end{eqnarray*} [$]

Hence, the MSE is a measure of the quality of the estimator. The MSE of the ridge regression estimator is:

[$] \begin{eqnarray} \mbox{MSE}[\hat{\bbeta}(\lambda)] & = & \mathbb{E} [ (\mathbf{W}_{\lambda} \, \hat{\bbeta} - \bbeta)^{\top} \, (\mathbf{W}_{\lambda} \, \hat{\bbeta} - \bbeta) ] \nonumber \\ & = & \mathbb{E} ( \hat{\bbeta}^{\top} \mathbf{W}_{\lambda}^{\top} \,\mathbf{W}_{\lambda} \, \hat{\bbeta} ) - \mathbb{E} ( \bbeta^{\top} \, \mathbf{W}_{\lambda} \, \hat{\bbeta}) - \mathbb{E} ( \hat{\bbeta}^{\top} \mathbf{W}_{\lambda}^{\top} \, \bbeta) + \mathbb{E} ( \bbeta^{\top} \bbeta) \nonumber \\ & = & \mathbb{E} ( \hat{\bbeta}^{\top} \mathbf{W}_{\lambda}^{\top} \,\mathbf{W}_{\lambda} \, \hat{\bbeta} ) - \mathbb{E} ( \bbeta^{\top} \, \mathbf{W}_{\lambda}^{\top} \mathbf{W}_{\lambda} \, \hat{\bbeta}) - \mathbb{E} ( \hat{\bbeta}^{\top} \mathbf{W}_{\lambda}^{\top} \, \mathbf{W}_{\lambda} \bbeta) + \mathbb{E} ( \bbeta^{\top} \mathbf{W}_{\lambda}^{\top} \,\mathbf{W}_{\lambda} \, \bbeta ) \nonumber \\ & & - \mathbb{E} ( \bbeta^{\top} \mathbf{W}_{\lambda}^{\top} \,\mathbf{W}_{\lambda} \, \bbeta ) + \mathbb{E} ( \bbeta^{\top} \, \mathbf{W}_{\lambda}^{\top} \mathbf{W}_{\lambda} \, \hat{\bbeta}) + \mathbb{E} ( \hat{\bbeta}^{\top} \mathbf{W}_{\lambda}^{\top} \, \mathbf{W}_{\lambda} \bbeta) \nonumber \\ & & - \mathbb{E} ( \bbeta^{\top} \, \mathbf{W}_{\lambda} \, \hat{\bbeta}) - \mathbb{E} ( \hat{\bbeta}^{\top} \mathbf{W}_{\lambda}^{\top} \, \bbeta) + \mathbb{E} ( \bbeta^{\top} \bbeta) \nonumber \\ & = & \mathbb{E} [ ( \hat{\bbeta} - \bbeta )^{\top} \mathbf{W}_{\lambda}^{\top} \, \mathbf{W}_{\lambda} \, (\hat{\bbeta} - \bbeta) ] \nonumber \\ & & - \bbeta^{\top} \mathbf{W}_{\lambda}^{\top} \,\mathbf{W}_{\lambda} \, \bbeta + \bbeta^{\top} \, \mathbf{W}_{\lambda}^{\top} \mathbf{W}_{\lambda} \, \bbeta + \bbeta^{\top} \mathbf{W}_{\lambda}^{\top} \, \mathbf{W}_{\lambda} \bbeta \nonumber \\ & & - \bbeta^{\top} \, \mathbf{W}_{\lambda} \, \bbeta - \bbeta^{\top} \mathbf{W}_{\lambda}^{\top} \, \bbeta + \bbeta^{\top} \bbeta \nonumber \\ & = & \mathbb{E} [ ( \hat{\bbeta} - \bbeta )^{\top} \mathbf{W}_{\lambda}^{\top} \, \mathbf{W}_{\lambda} \, (\hat{\bbeta} - \bbeta) ] + \bbeta^{\top} (\mathbf{W}_{\lambda} - \mathbf{I}_{pp})^{\top} (\mathbf{W}_{\lambda} - \mathbf{I}_{pp}) \bbeta \nonumber \\ & = & \sigma^2 \, \mbox{tr} [ \mathbf{W}_{\lambda} \, (\mathbf{X}^{\top} \mathbf{X})^{-1} \, \mathbf{W}_{\lambda}^{\top} ] + \bbeta^{\top} (\mathbf{W}_{\lambda} - \mathbf{I}_{pp})^{\top} (\mathbf{W}_{\lambda} - \mathbf{I}_{pp}) \bbeta. \label{form.ridgeMSE} \end{eqnarray} [$]

In the last step we have used $\hat{\bbeta} \sim \mathcal{N}[ \bbeta, \sigma^2 \, (\mathbf{X}^{\top} \mathbf{X})^{-1} ]$ and the expectation of the quadratic form of a multivariate random variable $\vvarepsilon \sim \mathcal{N}(\mmu_{\varepsilon}, \SSigma_{\varepsilon})$ that for a nonrandom symmetric positive definite matrix $\LLambda$ is (cf. [2]):

[$] \begin{eqnarray*} \mathbb{E} ( \vvarepsilon^{\top} \LLambda \, \vvarepsilon) & = & \mbox{tr} ( \LLambda \SSigma_{\varepsilon}) + \mmu_{\varepsilon}^{\top} \LLambda \mmu_{\varepsilon}, \end{eqnarray*} [$]

of course replacing $\vvarepsilon$ by $\hat{\bbeta}$ in this expectation. The first summand in the final derived expression for $\mbox{MSE}[\hat{\bbeta}(\lambda)]$ is the sum of the variances of the ridge estimator, while the second summand can be thought of the “squared bias” of the ridge estimator. In particular, $\lim_{\lambda \rightarrow \infty} \mbox{MSE}[\hat{\bbeta}(\lambda)] = \bbeta^{\top} \bbeta$, which is the squared biased for an estimator that equals zero (as does the ridge regression estimator in the limit).

Example Orthonormal design matrix

Assume the design matrix $\mathbf{X}$ is orthonormal. Then, $\mbox{MSE}[ \hat{\bbeta} ] = p \, \sigma^2$ and

[$] \begin{eqnarray*} \mbox{MSE}[ \hat{\bbeta}(\lambda) ] & = & p \, \sigma^2 (1+ \lambda)^{-2} + \lambda^2 (1+ \lambda)^{-2} \bbeta^{\top} \bbeta. \end{eqnarray*} [$]

The latter achieves its minimum at: $\lambda = p \sigma^2 / \bbeta^{\top} \bbeta$.

The following theorem and proposition are required for the proof of the main result.

Theorem 1 of [3]

Let $\hat{\ttheta}_1$ and $\hat{\ttheta}_2$ be (different) estimators of $\ttheta$ with second order moments:

[$] \begin{eqnarray*} \mathbf{M}_k & = & \mathbb{E} [ (\hat{\ttheta}_k - \ttheta) (\hat{\ttheta}_k - \ttheta)^{\top} ] \qquad \mbox{for } k=1,2, \end{eqnarray*} [$]

and

[$] \begin{eqnarray*} \mbox{MSE}(\hat{\ttheta}_k) & = & \mathbb{E} [ (\hat{\ttheta}_k - \ttheta)^{\top} \mathbf{A} (\hat{\ttheta}_k - \ttheta) ] \qquad \mbox{for } k=1,2, \end{eqnarray*} [$]

where $\mathbf{A} \succeq 0$. Then, $\mathbf{M}_1 - \mathbf{M}_2 \succeq 0$ if and only if $\mbox{MSE}(\hat{\ttheta}_1) - \mbox{MSE}(\hat{\ttheta}_2) \geq 0$ for all $\mathbf{A} \succeq 0$.

Proposition [4]

Let $\mathbf{A}$ be a $p \times p$-dimensional, positive definite matrix, $\mathbf{b}$ be a nonzero $p$ dimensional vector, and $c \in \mathbb{R}_+$. Then, $c \mathbf{A} - \mathbf{b} \mathbf{b}^{\top} \succ 0$ if and only if $\mathbf{b}^{\top} \mathbf{A}^{-1} \mathbf{b} \lt c$.

We are now ready to proof the main result, formalized as Theorem, that for some $\lambda$ the ridge regression estimator yields a lower MSE than the ML regression estimator. Question provides a simpler (?) but more limited proof of this result.

Theorem 2 of [3]

There exists $\lambda \gt 0$ such that $\mbox{MSE}[\hat{\bbeta}(\lambda)] \lt \mbox{MSE}[\hat{\bbeta}(0)] = \mbox{MSE}(\hat{\bbeta})$.

The second order moment matrix of the ridge estimator is:

[$] \begin{eqnarray*} \mathbf{M} (\lambda) & := & \mathbb{E} [ (\hat{\bbeta}(\lambda) - \bbeta) (\hat{\bbeta} (\lambda) - \bbeta)^{\top} ] \\ & = & \mathbb{E} \{ \hat{\bbeta}(\lambda) [\hat{\bbeta}(\lambda)]^{\top} \} - \mathbb{E} [ \hat{\bbeta}(\lambda) ] \{ \mathbb{E} [ \hat{\bbeta}(\lambda) ] \}^{\top} + \mathbb{E} [\hat{\bbeta} (\lambda) - \bbeta)] \{ \mathbb{E} [\hat{\bbeta} (\lambda) - \bbeta)] \}^{\top} \\ & = & \mbox{Var}[ \hat{\bbeta}(\lambda) ] + \mathbb{E} [\hat{\bbeta} (\lambda) - \bbeta)] \{ \mathbb{E} [\hat{\bbeta} (\lambda) - \bbeta)] \}^{\top}. \end{eqnarray*} [$]

Then:

[$] \begin{eqnarray*} \mathbf{M} ( 0 ) - \mathbf{M}(\lambda) & = & \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} - \sigma^2 \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} - (\mathbf{W}_{\lambda} - \mathbf{I}_{pp}) \bbeta \bbeta^{\top} (\mathbf{W}_{\lambda} -\mathbf{I}_{pp})^{\top} \\ & = & \sigma^2 \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}+ \lambda \mathbf{I}_{pp}) (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}+ \lambda \mathbf{I}_{pp}) (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} \\ & & - \sigma^2 \mathbf{W}_{\lambda} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{W}_{\lambda}^{\top} - (\mathbf{W}_{\lambda} - \mathbf{I}_{pp}) \bbeta \bbeta^{\top} (\mathbf{W}_{\lambda} -\mathbf{I}_{pp})^{\top} \\ & = & \sigma^2 \mathbf{W}_{\lambda} [ 2 \, \lambda \, (\mathbf{X}^{\top} \mathbf{X})^{-2} + \lambda^2 (\mathbf{X}^{\top} \mathbf{X})^{-3} ] \mathbf{W}_{\lambda}^{\top} \\ & & - \lambda^2 (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \bbeta \bbeta^{\top} [ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} ]^{\top} \\ & = & \sigma^2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} [ 2 \, \lambda \, \mathbf{I}_{pp} + \lambda^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} ] [ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} ]^{\top} \\ & & - \lambda^2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} \bbeta \bbeta^{\top} [ ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} ]^{\top} \\ & = & \lambda ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} [ 2 \, \sigma^2 \, \mathbf{I}_{pp} + \lambda \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} - \lambda \bbeta \bbeta^{\top} ] [ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} )^{-1} ]^{\top}. \end{eqnarray*} [$]

This is positive definite if and only if $2 \, \sigma^2 \, \mathbf{I}_{pp} + \lambda \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} - \lambda \bbeta \bbeta^{\top} \succ 0$. Hereto it suffices to show that $2 \, \sigma^2 \, \mathbf{I}_{pp} - \lambda \bbeta \bbeta^{\top} \succ 0$. By Proposition this holds for $\lambda$ such that $2 \sigma^2 (\bbeta^{\top} \bbeta)^{-1} \gt \lambda$. For these $\lambda$, we thus have $\mathbf{M} ( 0 ) - \mathbf{M}(\lambda)$. Application of Theorem now concludes the proof. ■

This result of [3] is generalized by [4] to the class of design matrices $\mathbf{X}$ with $\mbox{rank}(\mathbf{X}) \lt p$.

Theorem can be used to illustrate that the ridge regression estimator strikes a balance between the bias and variance. This is illustrated in the left panel of Figure. For small $\lambda$, the variance of the ridge estimator dominates the MSE. This may be understood when realizing that in this domain of $\lambda$ the ridge estimator is close to the unbiased ML regression estimator. For large $\lambda$, the variance vanishes and the bias dominates the MSE. For small enough values of $\lambda$, the decrease in variance of the ridge regression estimator exceeds the increase in its bias. As the MSE is the sum of these two, the MSE first decreases as $\lambda$ moves away from zero. In particular, as $\lambda = 0$ corresponds to the ML regression estimator, the ridge regression estimator yields a lower MSE for these values of $\lambda$. In the right panel of Figure $\mbox{MSE}[ \hat{\bbeta}(\lambda)] \lt \mbox{MSE}[ \hat{\bbeta}(0)]$ for $\lambda \lt 7$ (roughly) and the ridge estimator outperforms the ML estimator.

Left panel: mean squared error, and its ‘bias’ and ‘variance’ parts, of the ridge regression estimator (for artificial data). Right panel: mean squared error of the ridge and ML estimator of the regression coefficient vector (for the same artificial data).

Besides another motivation behind the ridge regression estimator, the use of Theorem is limited. The optimal choice of $\lambda$ depends on the quantities $\bbeta$ and $\sigma^2$. These are unknown in practice. Then, the penalty parameter is chosen in a data-driven fashion (see e.g. Section Cross-validation and various other places).

Theorem may be of limited practical use, it does give insight in when the ridge regression estimator may be preferred over its ML counterpart. Ideally, the range of penalty parameters for which the ridge regression estimator outperforms -- in the MSE sense -- the ML regression estimator is as large as possible. The factors that influence the size of this range may be deduced from the optimal penalty $\lambda_{\mbox{{\tiny opt}}} = \sigma^2 (\bbeta^{\top} \bbeta / p)^{-1}$ found under the assumption of an orthonormal $\mathbf{X}$ (see Example). But also from the bound on the penalty parameter, $\lambda_{\mbox{{\tiny max}}} = 2 \sigma^2 (\bbeta^{\top} \bbeta )^{-1}$ such that $MSE[\hat{\bbeta}(\lambda)] \lt MSE[\hat{\bbeta}(0)]$ for all $\lambda \in (0, \lambda_{\mbox{{\tiny max}}})$, derived in the proof of Theorem. Firstly, an increase of the error variance $\sigma^2$ yields a larger $\lambda_{\mbox{{\tiny opt}}}$ and $\lambda_{\mbox{{\tiny max}}}$. Put differently, more noisy data benefits the ridge regression estimator. Secondly, $\lambda_{\mbox{{\tiny opt}}}$ and $\lambda_{\mbox{{\tiny max}}}$ also become larger when their denominators decreases. The denominator $\bbeta^{\top} \bbeta / p$ may be viewed as an estimator of the ‘signal’ variance ‘$\sigma^2_{\beta}$’. A quick conclusion would be that ridge regression profits from less signal. But more can be learned from the denominator. Contrast the two regression parameters $\bbeta_{\mbox{{\tiny unif}}} = \mathbf{1}_p$ and $\bbeta_{\mbox{{\tiny sparse}}}$ which comprises of only zeros except the first element which equals $p$, i.e. $\bbeta_{\mbox{{\tiny sparse}}} = (p, 0, \ldots, 0)^{\top}$. Then, the $\bbeta_{\mbox{{\tiny unif}}}$ and $\bbeta_{\mbox{{\tiny sparse}}}$ have comparable signal in the sense that $\sum_{j=1}^p \beta_j = p$. The denominator of $\lambda_{\mbox{{\tiny opt}}}$ corresponding both parameters equals $1$ and $p$, respectively. This suggests that ridge regression will perform better in the former case where the regression parameter is not dominated by a few elements but rather all contribute comparably to the explanation of the variation in the response. Of course, more factors contribute. For instance, collinearity among the columns of $\mathbf{X}$, which gave rise to ridge regression in the first place.

Theorem can also be used to conclude on the biasedness of the ridge regression estimator. The Gauss-Markov theorem [5] states (under some assumptions) that the ML regression estimator is the best linear unbiased estimator (BLUE) with the smallest MSE. As the ridge regression estimator is a linear estimator and outperforms (in terms of MSE) this ML estimator, it must be biased (for it would otherwise refute the Gauss-Markov theorem).

## General References

van Wieringen, Wessel N. (2021). "Lecture notes on ridge regression". arXiv:1509.09169 [stat.ME].

## References

1. Shao, J. and Deng, X. (2012).Estimation in high-dimensional linear models with deterministic design matrices.The Annals of Statistics, 40(2), 812--831
2. Mathai, A. M. and Provost, S. B. (1992).Quadratic Forms in Random Variables: Theory and Applications.Dekker
3. Theobald, C. M. (1974).Generalizations of mean square error applied to ridge regression.Journal of the Royal Statistical Society. Series B (Methodological), 36(1), 103--106
4. Farebrother, R. W. (1976).Further results on the mean square error of ridge regression.Journal of the Royal Statistical Society, Series B (Methodological), pages 248--250
5. Rao, C. R. (1973).Linear Statistical Inference and its Applications.John Wiley & Sons