# Constrained Estimation, degrees of freedom, and computationally efficient evaluation

## Constrained estimation

The ad-hoc fix of [1] to super-collinearity of the design matrix (and, consequently the singularity of the matrix $\mathbf{X}^{\top} \mathbf{X}$) has been motivated post-hoc. The ridge regression estimator minimizes the ridge loss function, which is defined as:

[$] \begin{eqnarray} \mathcal{L}_{\mbox{{\tiny ridge}}}(\bbeta; \lambda) & = & \| \mathbf{Y} - \mathbf{X} \bbeta \|^2_2 + \lambda \| \bbeta \|^2_2 \, \, \, = \, \, \, \sum\nolimits_{i=1}^n (Y_i - \mathbf{X}_{i\ast} \bbeta)^2 + \lambda \sum\nolimits_{j=1}^p \beta_j^2. \label{form.ridgeLossFunction} \end{eqnarray} [$]

This loss function is the traditional sum-of-squares augmented with a penalty. The particular form of the penalty, $\lambda \| \bbeta \|^2_2$ is referred to as the ridge penalty and $\lambda$ as the penalty parameter. For $\lambda=0$, minimization of the ridge loss function yields the ML estimator (if it exists). For any $\lambda \gt 0$, the ridge penalty contributes to the loss function, affecting its minimum and its location. The minimum of the sum-of-squares is well-known. The minimum of the ridge penalty is attained at $\bbeta = \mathbf{0}_{p}$ whenever $\lambda \gt 0$. The $\bbeta$ that minimizes $\mathcal{L}_{\mbox{{\tiny ridge}}}(\bbeta; \lambda)$ then balances the sum-of-squares and the penalty. The effect of the penalty in this balancing act is to shrink the regression coefficients towards zero, its minimum. In particular, the larger $\lambda$, the larger the contribution of the penalty to the loss function, the stronger the tendency to shrink non-zero regression coefficients to zero (and decrease the contribution of the penalty to the loss function). This motivates the name ‘penalty’ as non-zero elements of $\bbeta$ increase (or penalize) the loss function.

To verify that the ridge regression estimator indeed minimizes the ridge loss function, proceed as usual. Take the derivative with respect to $\bbeta$:

[$] \begin{eqnarray*} \frac{\partial}{\partial \bbeta} \mathcal{L}_{\mbox{{\tiny ridge}}}(\bbeta; \lambda) & = & -2 \mathbf{X}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta) + 2 \lambda \mathbf{I}_{pp} \bbeta \, \, \, = \, \, \, -2 \mathbf{X}^{\top} \mathbf{Y} + 2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp}) \bbeta. \end{eqnarray*} [$]

Equate the derivative to zero and solve for $\bbeta$. This yields the ridge regression estimator.

The ridge estimator is thus a stationary point of the ridge loss function. A stationary point corresponds to a minimum if the Hessian matrix with second order partial derivatives is positive definite. The Hessian of the ridge loss function is

[$] \begin{eqnarray*} \frac{\partial^2}{\partial \bbeta \, \partial \bbeta^{\top}} \mathcal{L}_{\mbox{{\tiny ridge}}}(\bbeta; \lambda) & = & 2 ( \mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp}). \end{eqnarray*} [$]

This Hessian is the sum of the (semi-)positive definite matrix $\mathbf{X}^{\top} \mathbf{X}$ and the positive definite matrix $\lambda \, \mathbf{I}_{pp}$. Lemma 14.2.4 of [2] then states that the sum of these matrices is itself a positive definite matrix. Hence, the Hessian is positive definite and the ridge loss function has a stationary point at the ridge estimator, which is a minimum.

The ridge regression estimator minimizes the ridge loss function. It rests to verify that it is a global minimum. To this end we introduce the concept of a convex function. As a prerequisite, a set $\mathcal{S} \subset \mathbb{R}^p$ is called convex if for all $\bbeta_1, \bbeta_2 \in \mathcal{S}$ their weighted average $\bbeta_{\theta} = (1 - \theta) \bbeta_1 + \theta \bbeta_2$ for all $\theta \in [0, 1]$ is itself an element of $\mathcal{S}$, thus $\bbeta_{\theta} \in \mathcal{S}$. If for all $\theta \in (0, 1)$, the weighted average $\bbeta_{\theta}$ is inside $\mathcal{S}$ and not on its boundary, the set is called strictly convex. Examples of (strictly) convex and nonconvex sets are depicted in Figure. A function $f(\cdot)$ is (strictly) convex if the set $\{ y \, : \, y \geq f(\bbeta) \mbox{ for all } \bbeta \in \mathcal{S} \mbox{ for any convex } \mathcal{S} \}$, called the epigraph of $f(\cdot)$, is (strictly) convex. Examples of (strictly) convex and nonconvex functions are depicted in Figure. The ridge loss function is the sum of two parabola's: one is at least convex and the other strictly convex in $\bbeta$. The sum of a convex and strictly convex function is itself strictly convex (see Lemma 9.4.2 of [3]). The ridge loss function is thus strictly convex. Theorem 9.4.1 of [3] then warrants, by the strict convexity of the ridge loss function, that the ridge regression estimator is a global minimum.

From the ridge loss function the limiting behavior of the variance of the ridge regression estimator can be understood. The ridge penalty with its minimum $\bbeta = \mathbf{0}_{p}$ does not involve data and, consequently, the variance of its minimum equals zero. With the ridge regression estimator being a compromise between the maximum likelihood estimator and the minimum of the penalty, so is its variance a compromise of their variances. As $\lambda$ tends to infinity, the ridge regression estimator and its variance converge to the minimizer of the loss function and the variance of the minimizer, respectively. Hence, in the limit (large $\lambda$) the variance of the ridge regression estimator vanishes. Understandably, as the penalty now fully dominates the loss function and, consequently, it does no longer involve data (i.e. randomness).

Figure: Top panels show examples of convex (left) and nonconvex (right) sets. Middle panels show examples of convex (left) and nonconvex (right) functions. The left bottom panel illustrates the ridge estimation as a constrained estimation problem. The ellipses represent the contours of the ML loss function, with the blue dot at the center the ML estimate. The circle is the ridge parameter constraint. The red dot is the ridge estimate. It is at the intersection of the ridge constraint and the smallest contour with a non-empty intersection with the constraint. The right bottom panel shows the data corresponding to Example. The grey line represents the ‘true’ relationship, while the black line the fitted one.

Above it has been shown that the ridge estimator can be defined as:

[$] \begin{eqnarray} \label{form.ridgeEstViaPenEst} \hat{\bbeta}(\lambda) & = & \arg \min_{\bbeta} \| \mathbf{Y} - \mathbf{X} \, \bbeta \|^2_2 + \lambda \| \bbeta \|^2_2. \end{eqnarray} [$]

This minimization problem can be reformulated into the following constrained optimization problem:

[$] \begin{eqnarray} \label{form.constrEstProblemRidge} \hat{\bbeta}(\lambda) & = & \arg \min_{\| \bbeta \|_2^2 \leq c} \| \mathbf{Y} - \mathbf{X} \, \bbeta \|^2_2, \end{eqnarray} [$]

for some suitable $c \gt 0$. The constrained optimization problem (\ref{form.constrEstProblemRidge}) can be solved by means of the Karush-Kuhn-Tucker (KKT) multiplier method, which minimizes a function subject to inequality constraints. The KKT multiplier method states that, under some regularity conditions (all met here), there exists a constant $\nu \geq 0$, called the multiplier, such that the solution $\hat{\bbeta}(\nu)$ of the constrained minimization problem (\ref{form.constrEstProblemRidge) satisfies the so-called KKT conditions. The first KKT condition (referred to as the stationarity condition) demands that the gradient (with respect to $\bbeta$) of the Lagrangian associated with the minimization problem equals zero at the solution $\hat{\bbeta}(\nu)$. The Lagrangian for problem (\ref{form.constrEstProblemRidge}) is:

[$] \begin{eqnarray*} \| \mathbf{Y} - \mathbf{X} \, \bbeta \|^2_2 + \nu ( \| \bbeta \|^2_2 - c). \end{eqnarray*} [$]

The second KKT condition (the complementarity condition) requires that $\nu (\| \hat{\bbeta}(\nu) \|_2^2 - c) = 0$. If $\nu = \lambda$ and $c = \| \hat{\bbeta}(\lambda) \|_2^2$, the ridge estimator $\bbeta (\lambda)$ satisfies both KKT conditions. Hence, both problems have the same solution when $c = \| \hat{\bbeta}(\lambda) \|_2^2$.

The constrained estimation interpretation of the ridge regression estimator is illustrated in the left bottom panel of Figure. It shows the level sets of the sum-of-squares criterion and centered around zero the circular ridge parameter constraint, parametrized by $\{ \bbeta \, : \, \| \bbeta \|_2^2 = \beta_1^2 + \beta_2^2 \leq c \}$ for some $c \gt 0$. The ridge regression estimator is then the point where the sum-of-squares' smallest level set hits the constraint. Exactly at that point the sum-of-squares is minimized over those $\bbeta$'s that live on or inside the constraint. In the high-dimensional setting the ellipsoidal level sets are degenerated. For instance, the 2-dimensional case of the left bottom panel of Figure the ellipsoids would then be lines, but the geometric interpretation is unaltered.

The ridge regression estimator is always to be found on the boundary of the ridge parameter constraint and is never an interior point. To see this assume, for simplicity, that the $\mathbf{X}^{\top} \mathbf{X}$ matrix is of full rank. The radius of the ridge parameter constraint can then be bounded as follows:

[$] \begin{eqnarray*} \| \hat{\bbeta} ( \lambda) \|_2^2 & = & \mathbf{Y}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-2} \mathbf{X}^{\top} \mathbf{Y} \, \, \,\leq \, \, \, \mathbf{Y}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-2} \mathbf{X}^{\top} \mathbf{Y} \, \, \, = \, \, \, \| \hat{\bbeta}^{\mbox{{\tiny ml}}} \|_2^2. \end{eqnarray*} [$]

The inequality in the display above follows from

1. $\mathbf{X}^{\top} \mathbf{X} \succ 0$ and $\lambda \mathbf{I}_{pp} \succ 0$,
2. $\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp} \succ \mathbf{X}^{\top} \mathbf{X}$ (due to Lemma 14.2.4 of [2]),
3. and $(\mathbf{X}^{\top} \mathbf{X})^{-2} \succ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-2}$ (inferring Corollary 7.7.4 of [4]).

The ridge regression estimator is thus always on the boundary or in a circular constraint centered around the origin with a radius that is smaller than the size of the maximum likelihood estimator. Moreover, the constrained estimation formulation of the ridge regression estimator then implies that the latter must be on the boundary of the constraint.

The relevance of viewing the ridge regression estimator as the solution to a constrained estimation problem becomes obvious when considering a typical threat to high-dimensional data analysis: overfitting. Overfitting refers to the phenomenon of modelling the noise rather than the signal. In case the true model is parsimonious (few covariates driving the response) and data on many covariates are available, it is likely that a linear combination of all covariates yields a higher likelihood than a combination of the few that are actually related to the response. As only the few covariates related to the response contain the signal, the model involving all covariates then cannot but explain more than the signal alone: it also models the error. Hence, it overfits the data. In high-dimensional settings overfitting is a real threat. The number of explanatory variables exceeds the number of observations. It is thus possible to form a linear combination of the covariates that perfectly explains the response, including the noise.

Large estimates of regression coefficients are often an indication of overfitting. Augmentation of the estimation procedure with a constraint on the regression coefficients is a simple remedy to large parameter estimates. As a consequence it decreases the probability of overfitting. Overfitting is illustrated in the next example.

Example (Overfitting)

Consider an artificial data set comprising of ten observations on a response $Y_i$ and nine covariates $X_{i,j}$. All covariate data are sampled from the standard normal distribution: $X_{i,j} \sim \mathcal{N}(0, 1)$. The response is generated by $Y_i = X_{i,1} + \varepsilon_i$ with $\varepsilon_{i} \sim \mathcal{N}(0, \tfrac{1}{4})$. Hence, only the first covariate contributes to the response.

The regression model $Y_i = \sum_{j=1}^9 X_{i,j} \beta_j+ \varepsilon_i$ is fitted to the artificial data using R. This yields the regression parameter estimates:

[$] \begin{eqnarray*} \hat{\bbeta}^{\top} & = & (0.048, -2.386, -5.528, 6.243, -4.819, 0.760, -3.345, -4.748, 2.136). \end{eqnarray*} [$]

As $\bbeta^{\top} = (1, 0, \ldots, 0)$, many regression coefficient are clearly over-estimated.

The fitted values $\widehat{Y}_i = \mathbf{X}_i \hat{\bbeta}$ are plotted against the values of the first covariates in the right bottom panel of Figure. As a reference the line $x=y$ is added, which represents the ‘true’ model. The fitted model follows the ‘true’ relationship. But it also captures the deviations from this line that represent the errors.

## Degrees of freedom

The degrees of freedom consumed by ridge regression is calculated. The degrees of freedom may be used in combination with an information criterion to decide on the value of the penalty parameter. Recall from ordinary regression that: $\widehat{\mathbf{Y}} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y} = \mathbf{H} \mathbf{Y}$ where $\mathbf{H}$ is the hat matrix. The degrees of freedom used in the regression is then equal to $\mbox{tr}(\mathbf{H})$, the trace of $\mathbf{H}$. In particular, if $\mathbf{X}$ is of full rank, i.e. $\mbox{rank}(\mathbf{X}) = p$, then $\mbox{tr}(\mathbf{H}) = p$.

By analogy, the ridge-version of the hat matrix is defined as $\mathbf{H}(\lambda) := \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top}$. Continuing this analogy, the degrees of freedom of ridge regression is given by the trace of the ridge hat matrix $\mathbf{H}(\lambda)$:

[$] \begin{eqnarray*} \mbox{tr}[ \mathbf{H}(\lambda)] & = & \mbox{tr}[ \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} ] \, \, \, = \, \, \, \sum\nolimits_{j=1}^p (\mathbf{D}_x^{\top} \mathbf{D}_x)_{jj} [(\mathbf{D}_x^{\top} \mathbf{D}_x)_{jj} + \lambda]^{-1}. \end{eqnarray*} [$]

The degrees of freedom consumed by ridge regression is monotone decreasing in $\lambda$. In particular, $\lim_{\lambda \rightarrow \infty} \mbox{tr}[ \mathbf{H}(\lambda)] = 0$. That is, in the limit no information from $\mathbf{X}$ is used. Indeed, $\bbeta$ is forced to equal $\mathbf{0}_{p}$ which is not derived from data.

## Computationally efficient evaluation

In the high-dimensional setting the number of covariates $p$ is large compared to the number of samples $n$. In a microarray experiment $p = 40000$ and $n= 100$ is not uncommon. To perform ridge regression in this context, the following expression needs to be evaluated numerically: $(\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y}$. For $p=40000$ this requires the inversion of a $40000 \times 40000$ dimensional matrix. This is not feasible on most desktop computers.

There, however, is a workaround to the computational burden. Revisit the singular value decomposition of $\mathbf{X} = \mathbf{U}_x \mathbf{D}_x \mathbf{V}_x^{\top}$. Drop from $\mathbf{D}_x$ and $\mathbf{V}_x$ the columns that correspond to zero singular values. The resulting $\mathbf{D}_x$ and $\mathbf{V}_x$ are $n \times n$ and $p \times n$-dimensional, respectively. Note that dropping these columns has no effect on the matrix factorization of $\mathbf{X}$, i.e. still $\mathbf{X} = \mathbf{U}_x \mathbf{D}_x \mathbf{V}_x^{\top}$ but with the last two matrices in this decomposition defined differently from the traditional singular value decomposition. Now write $\mathbf{R}_x = \mathbf{U}_x \mathbf{D}_x$. As both $\mathbf{U}_x$ and $\mathbf{D}_x$ are $n \times n$-dimensional matrices, so is $\mathbf{R}_x$. Consequently, $\mathbf{X}$ is now decomposed as $\mathbf{X} = \mathbf{R}_x \mathbf{V}_x^{\top}$. The ridge regression estimator can be rewritten in terms of $\mathbf{R}_x$ and $\mathbf{V}_x$:

[$] \begin{eqnarray*} \hat{\bbeta}(\lambda) & = & (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{Y} \\ & = & (\mathbf{V}_x \mathbf{R}_x^{\top} \mathbf{R}_x \mathbf{V}_x^{\top} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{V}_x \mathbf{R}_x^{\top} \mathbf{Y} \\ & = & (\mathbf{V}_x \mathbf{R}_x^{\top} \mathbf{R}_x \mathbf{V}_x^{\top} + \lambda \mathbf{V}_x \mathbf{V}_x^{\top})^{-1} \mathbf{V}_x \mathbf{R}_x^{\top} \mathbf{Y} \\ & = & \mathbf{V}_x (\mathbf{R}_x^{\top} \mathbf{R}_x + \lambda \mathbf{I}_{nn})^{-1} \mathbf{V}_x^{\top} \mathbf{V}_x \mathbf{R}_x^{\top} \mathbf{Y} \\ & = & \mathbf{V}_x (\mathbf{R}_x^{\top} \mathbf{R}_x + \lambda \mathbf{I}_{nn})^{-1} \mathbf{R}_x^{\top} \mathbf{Y}. \end{eqnarray*} [$]

Hence, the reformulated ridge regressin estimator involves the inversion of an $n \times n$-dimensional matrix. With $n= 100$ this is feasible on most standard computers.

[5] point out that, with the SVD-trick above, the number of computation operations reduces from $\mathcal{O}(p^3)$ to $\mathcal{O}(p n^2)$. In addition, they point out that this computational short-cut can be used in combination with other loss functions, for instance that of standard generalized linear models (see Chapter Ridge logistic regression ). This computation can is illustrated in Figure, which shows the substantial gain in computation time of the evaluation of the ridge regression estimator using the efficient over the naive implementation against the dimension $p$. Details of this figure are provided in Question.

The inversion of the $p \times p$-dimensional matrix can be avoided in an other way. Hereto one needs the Woodbury identity. Let $\mathbf{A}$, $\mathbf{U}$ and $\mathbf{V}$ be $p \times p$-, $p \times n$- and $n \times p$-dimensional matrices, respectively. The (simplified form of the) Woodbury identity then is:

[$] \begin{eqnarray*} (\mathbf{A} + \mathbf{U} \mathbf{V})^{-1} & = & \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_{nn} + \mathbf{V} \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V} \mathbf{A}^{-1}. \end{eqnarray*} [$]

Application of the Woodbury identity to the matrix inverse in the ridge estimator of the regression parameter gives:

[$] \begin{eqnarray*} (\lambda \mathbf{I}_{pp} + \mathbf{X}^{\top} \mathbf{X})^{-1} & = & \lambda^{-1} \mathbf{I}_{pp} - \lambda^{-2} \mathbf{X}^{\top} (\mathbf{I}_{nn} + \lambda^{-1} \mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X}. \end{eqnarray*} [$]

This gives:

[$] \begin{eqnarray} \label{form.effRidgeEstimator} (\lambda \mathbf{I}_{pp} + \mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y} & = & \lambda^{-1} \mathbf{X}^{\top} \mathbf{Y} - \lambda^{-2} \mathbf{X}^{\top} (\mathbf{I}_{nn} + \lambda^{-1} \mathbf{X} \mathbf{X}^{\top})^{-1} \mathbf{X} \mathbf{X}^{\top} \mathbf{Y} \end{eqnarray} [$]

The inversion of the $p \times p$-dimensional matrix $\lambda \mathbf{I}_{pp} + \mathbf{X}^{\top} \mathbf{X}$ is thus replaced by that of the $n \times n$-dimensional matrix $\mathbf{I}_{nn} + \lambda^{-1} \mathbf{X} \mathbf{X}^{\top}$. In addition, this expression of the ridge regression estimator avoids the singular value decomposition of $\mathbf{X}$, which may in some cases introduce additional numerical errors (e.g. at the level of machine precision).

## General References

van Wieringen, Wessel N. (2021). "Lecture notes on ridge regression". arXiv:1509.09169 [stat.ME].

## References

1. Hoerl, A. E. and Kennard, R. W. (1970).Ridge regression: biased estimation for nonorthogonal problems.Technometrics, 12(1), 55--67
2. Harville, D. A. (2008).Matrix Algebra From a Statistician's Perspective.Springer, New York
3. Fletcher, R. (2008).Practical Methods of Optimization, 2nd Edition.John Wiley, New York
4. Horn, R. A. and Johnson, C. R. (2012).Matrix analysis.Cambridge University Press
5. Hastie, T. and Tibshirani, R. (2004).Efficient quadratic regularization for expression arrays.Biostatistics, 5(3), 329--340