# Model Selection and Regularization

## Ridge Regression

Tikhonov regularization, named for Andrey Tikhonov, is a method of regularization of ill-posed problems. Also known as ridge regression,[a] it is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias (see bias–variance tradeoff).

In the simplest case, the problem of a near-singular moment matrix $(\mathbf{X}^\mathsf{T}\mathbf{X})$ is alleviated by adding positive elements to the diagonals, thereby decreasing its condition number. Analogous to the ordinary least squares estimator, the simple ridge estimator is then given by

[$]\hat{\beta}_{R} = (\mathbf{X}^{\mathsf{T}} \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^{\mathsf{T}} \mathbf{y}[$]

where $\mathbf{y}$ is the regressand, $\mathbf{X}$ is the design matrix, $\mathbf{I}$ is the identity matrix, and the ridge parameter $\lambda \geq 0$ serves as the constant shifting the diagonals of the moment matrix. It can be shown that this estimator is the solution to the least squares problem subject to the constraint $\beta^\mathsf{T}\beta = c$, which can be expressed as a Lagrangian:

[$]\min_{\beta} \, (\mathbf{y} - \mathbf{X} \beta)^\mathsf{T}(\mathbf{y} - \mathbf{X} \beta) + \lambda (\beta^\mathsf{T}\beta - c)[$]

which shows that $\lambda$ is nothing but the Lagrange multiplier of the constraint. Typically, $\lambda$ is chosen according to a heuristic criterion, so that the constraint will not be satisfied exactly. Specifically in the case of $\lambda = 0$, in which the constraint is non-binding, the ridge estimator reduces to ordinary least squares.

## History

Tikhonov regularization has been invented independently in many different contexts. It became widely known from its application to integral equations from the work of Andrey Tikhonov and David L. Phillips. Some authors use the term Tikhonov–Phillips regularization. The finite-dimensional case was expounded by Arthur E. Hoerl, who took a statistical approach, and by Manus Foster, who interpreted this method as a Wiener–Kolmogorov (Kriging) filter. Following Hoerl, it is known in the statistical literature as ridge regression, named after the shape along the diagonal of the identity matrix.

## Tikhonov regularization

Suppose that for a known matrix $A$ and vector $\mathbf{b}$, we wish to find a vector $\mathbf{x}$ such that

[$]A\mathbf{x} = \mathbf{b}.[$]

The standard approach is ordinary least squares linear regression. However, if no $\mathbf{x}$ satisfies the equation or more than one $\mathbf{x}$ does—that is, the solution is not unique—the problem is said to be ill posed. In such cases, ordinary least squares estimation leads to an overdetermined, or more often an underdetermined system of equations. Most real-world phenomena have the effect of low-pass filters in the forward direction where $A$ maps $\mathbf{x}$ to $\mathbf{b}$. Therefore, in solving the inverse-problem, the inverse mapping operates as a high-pass filter that has the undesirable tendency of amplifying noise (eigenvalues / singular values are largest in the reverse mapping where they were smallest in the forward mapping). In addition, ordinary least squares implicitly nullifies every element of the reconstructed version of $\mathbf{x}$ that is in the null-space of $A$, rather than allowing for a model to be used as a prior for $\mathbf{x}$. Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as

[$]\|A\mathbf{x} - \mathbf{b}\|_2^2,[$]

where $\|\cdot\|_2$ is the Euclidean norm.

In order to give preference to a particular solution with desirable properties, a regularization term can be included in this minimization:

[$]\|A\mathbf{x} - \mathbf{b}\|_2^2 + \|\Gamma \mathbf{x}\|_2^2[$]

for some suitably chosen Tikhonov matrix $\Gamma$. In many cases, this matrix is chosen as a scalar multiple of the identity matrix ($\Gamma = \alpha I$), giving preference to solutions with smaller norms; this is known as L2 regularization. In other cases, high-pass operators (e.g., a difference operator or a weighted Fourier operator) may be used to enforce smoothness if the underlying vector is believed to be mostly continuous. This regularization improves the conditioning of the problem, thus enabling a direct numerical solution. An explicit solution, denoted by $\hat{x}$, is given by

[$]\hat{x} = (A^\top A + \Gamma^\top \Gamma)^{-1} A^\top \mathbf{b}.[$]

The effect of regularization may be varied by the scale of matrix $\Gamma$. For $\Gamma = 0$ this reduces to the unregularized least-squares solution, provided that $(A^{\mathsf{T}}A)^{-1}$ exists.

$L_2$ regularization is used in many contexts aside from linear regression, such as classification with logistic regression or support vector machines, and matrix factorization.

## Lasso

In statistics and machine learning, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

Lasso was originally formulated for linear regression models. This simple case reveals a substantial amount about the estimator. These include its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates do not need to be unique if covariates are collinear.

Though originally defined for linear regression, lasso regularization is easily extended to other statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators. Lasso's ability to perform subset selection relies on the form of the constraint and has a variety of interpretations including in terms of geometry, Bayesian statistics and convex analysis.

The LASSO is closely related to basis pursuit denoising.

### History

Lasso was introduced in order to improve the prediction accuracy and interpretability of regression models. It selects a reduced set of the known covariates for use in a model.

Lasso was developed independently in geophysics literature in 1986, based on prior work that used the $\ell^1$ penalty for both fitting and penalization of the coefficients. Statistician Robert Tibshirani independently rediscovered and popularized it in 1996, based on Breiman's nonnegative garrote.

Prior to lasso, the most widely used method for choosing covariates was stepwise selection. That approach only improves prediction accuracy in certain cases, such as when only a few covariates have a strong relationship with the outcome. However, in other cases, it can increase prediction error.

At the time, ridge regression was the most popular technique for improving prediction accuracy. Ridge regression improves prediction error by shrinking the sum of the squares of the regression coefficients to be less than a fixed value in order to reduce overfitting, but it does not perform covariate selection and therefore does not help to make the model more interpretable.

Lasso achieves both of these goals by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which forces certain coefficients to zero, excluding them from impacting prediction. This idea is similar to ridge regression, which also shrinks the size of the coefficients, however Ridge Regression tends to set far fewer coefficients to zero.

### Basic form

#### Least squares

Consider a sample consisting of N cases, each of which consists of p covariates and a single outcome. Let $y_i$ be the outcome and $x_i:=(x_1,x_2,\ldots,x_p)_i^T$ be the covariate vector for the i th case. Then the objective of lasso is to solve

[$] \min_{ \beta_0, \beta } \left\{ \sum_{i=1}^N (y_i - \beta_0 - x_i^T \beta)^2 \right\} \text{ subject to } \sum_{j=1}^p |\beta_j| \leq t. [$]

Here $\beta_0$ is the constant coefficient, $\beta:=(\beta_1,\beta_2,\ldots, \beta_p)$ is the coefficient vector, and $t$ is a prespecified free parameter that determines the degree of regularization.

Letting $X$ be the covariate matrix, so that $X_{ij} = (x_i)_j$ and $x_i^T$ is the $i$ th row of $X$, the expression can be written more compactly as

[$] \min_{ \beta_0, \beta } \left\{ \left\| y - \beta_0 - X \beta \right\|_2^2 \right\} \text{ subject to } \| \beta \|_1 \leq t, [$]

where $\| u \|_p = \left( \sum_{i=1}^N | u_i |^p \right)^{1/p}$ is the standard $\ell^p$ norm.

Denoting the scalar mean of the data points $x_i$ by $\bar{x}$ and the mean of the response variables $y_i$ by $\bar{y}$, the resulting estimate for $\beta_0$ is $\hat{\beta}_0 = \bar{y} - \bar{x}^T \beta$, so that

[$] y_i - \hat{\beta}_0 - x_i^T \beta = y_i - ( \bar{y} - \bar{x}^T \beta ) - x_i^T \beta = ( y_i - \bar{y} ) - ( x_i - \bar{x} )^T \beta, [$]

and therefore it is standard to work with variables that have been made zero-mean. Additionally, the covariates are typically standardized $\textstyle \left( \sum_{i=1}^N x_{i}^2 = 1 \right)$ so that the solution does not depend on the measurement scale.

It can be helpful to rewrite

[$] \min_{ \beta \in \mathbb{R}^p } \left\{ \frac{1}{N} \left\| y - X \beta \right\|_2^2 \right\} \text{ subject to } \| \beta \|_1 \leq t. [$]

in the so-called Lagrangian form

[$] \min_{ \beta \in \mathbb{R}^p } \left\{ \frac{1}{N} \left\| y - X \beta \right\|_2^2 + \lambda \| \beta \|_1 \right\} [$]

where the exact relationship between $t$ and $\lambda$ is data dependent.

#### Orthonormal covariates

Some basic properties of the lasso estimator can now be considered.

Assuming first that the covariates are orthonormal so that $( x_i \mid x_j ) = \delta_{ij}$, where $( \cdot \mid \cdot )$ is the inner product and $\delta_{ij}$ is the Kronecker delta, or, equivalently, $X^T X = I$, then using subgradient methods it can be shown that

[] \begin{align*} \hat{\beta}_j = {} & S_{N \lambda}( \hat{\beta}^\text{OLS}_j ) = \hat{\beta}^\text{OLS}_j \max \left( 0, 1 - \frac{ N \lambda }{ |\hat{\beta}^\text{OLS}_j| } \right) \\ & \text{ where } \hat{\beta}^\text{OLS} = (X^T X)^{-1} X^T y \end{align*} []

$S_\alpha$ is referred to as the soft thresholding operator, since it translates values towards zero (making them exactly zero if they are small enough) instead of setting smaller values to zero and leaving larger ones untouched as the hard thresholding operator, often denoted $H_\alpha$, would.

In ridge regression the objective is to minimize

[$] \min_{ \beta \in \mathbb{R}^p } \left\{ \frac{1}{N} \| y - X \beta \|_2^2 + \lambda \| \beta \|_2^2 \right\} [$]

yielding

[$] \hat{\beta} = \left( (X^T X)^{-1} + N\lambda I \right) X^T y. [$]

It can also be compared to regression with best subset selection, in which the goal is to minimize

[$] \min_{ \beta \in \mathbb{R}^p } \left\{ \frac{1}{N} \left\| y - X \beta \right\|_2^2 + \lambda \| \beta \|_0 \right\} [$]

where $\| \cdot \|_0$ is the "$\ell^0$ norm", which is defined as $\| z \| = m$ if exactly $m$ components of $z$ are nonzero. In this case, it can be shown that

[$] \hat{\beta}_j = H_{ \sqrt{ N \lambda } } \left( \hat{\beta}^\text{OLS}_j \right) = \hat{\beta}^\text{OLS}_j \mathrm{I} \left( \left| \hat{\beta}^\text{OLS}_j \right| \geq \sqrt{ N \lambda } \right) [$]

where $H_\alpha$ is the so-called hard thresholding function and $\mathrm{I}$ is an indicator function (it is 1 if its argument is true and 0 otherwise).

Therefore, the lasso estimates share features of both ridge and best subset selection regression since they both shrink the magnitude of all the coefficients, like ridge regression and set some of them to zero, as in the best subset selection case. Additionally, while ridge regression scales all of the coefficients by a constant factor, lasso instead translates the coefficients towards zero by a constant value and sets them to zero if they reach it.

#### Correlated covariates

In one special case two covariates, say $j$ and $k$, are identical for each observation, so that $x_{(j)} = x_{(k)}$, where $x_{(j),i} = x_{(k),i}$. Then the values of $\beta_j$ and $\beta_k$ that minimize the lasso objective function are not uniquely determined. In fact, if some $\hat{\beta}$ in which $\hat{\beta}_j \hat{\beta}_k \geq 0$, then if $s \in [0,1]$ replacing $\hat{\beta}_j$ by $s ( \hat{\beta}_j + \hat{\beta}_k )$ and $\hat{\beta}_k$ by $(1 - s ) ( \hat{\beta}_j + \hat{\beta}_k )$, while keeping all the other $\hat{\beta}_i$ fixed, gives a new solution, so the lasso objective function then has a continuum of valid minimizers. Several variants of the lasso, including the Elastic net regularization, have been designed to address this shortcoming.

### Interpretations

#### Geometric interpretation

Lasso can set coefficients to zero, while the superficially similar ridge regression cannot. This is due to the difference in the shape of their constraint boundaries. Both lasso and ridge regression can be interpreted as minimizing the same objective function

[$] \min_{ \beta_0, \beta } \left\{ \frac{1}{N} \left\| y - \beta_0 - X \beta \right\|_2^2 \right\} [$]

but with respect to different constraints$\| \beta \|_1 \leq t$ for lasso and $\| \beta \|_2^2 \leq t$ for ridge. The figure shows that the constraint region defined by the $\ell^1$ norm is a square rotated so that its corners lie on the axes (in general a cross-polytope), while the region defined by the $\ell^2$ norm is a circle (in general an n-sphere), which is rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of $\beta$ are identically zero, while in the case of an n-sphere, the points on the boundary for which some of the components of $\beta$ are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of $\beta$ are zero than one for which none of them are.

#### Making λ easier to interpret with an accuracy-simplicity tradeoff

The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of $\lambda$. It is assumed that $X$ is standardized with z-scores and that $y$ is centered (zero mean). Let $\beta_0$ represent the hypothesized regression coefficients and let $b_{OLS}$ refer to the data-optimized ordinary least squares solutions. We can then define the Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values. This results in

[$] \min_{ \beta \in \mathbb{R}^p } \left\{ \frac{(y-X\beta)'(y-X\beta)}{(y-X\beta_0)'(y-X\beta_0)}+ 2\lambda \sum_{i=1}^p \frac{|\beta_i-\beta_{0,i}|}{q_{i}} \right\} [$]

where $q_i$ is specified below. The first fraction represents relative accuracy, the second fraction relative simplicity, and $\lambda$ balances between the two. Solution paths for the $\ell_1$ norm and $\ell_2$ norm when $b_{OLS}=2$ and $\beta_{0}=0$

Given a single regressor, relative simplicity can be defined by specifying $q_i$ as $|b_{OLS}-\beta_{0}|$, which is the maximum amount of deviation from $\beta_0$ when $\lambda=0$. Assuming that $\beta_{0}=0$, the solution path can be defined in terms of $R^2$:

[$] b_{\ell_1} = \begin{cases} (1-\lambda/R^{2})b_{OLS} & \mbox{if } \lambda \leq R^{2}, \\ 0 & \mbox{if } \lambda \gt R^{2}. \end{cases} [$]

If $\lambda=0$, the ordinary least squares solution (OLS) is used. The hypothesized value of $\beta_0=0$ is selected if $\lambda$ is bigger than $R^2$. Furthermore, if $R^2=1$, then $\lambda$ represents the proportional influence of $\beta_0=0$. In other words, $\lambda\times100\%$ measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution.

If an $\ell_2$-norm is used to penalize deviations from zero given a single regressor, the solution path is given by

$b_{\ell_2}=\bigg(1+\frac{\lambda}{R^{2}(1-\lambda)}\bigg)^{-1}b_{OLS}$. Like $b_{\ell_1}$, $b_{\ell_2}$ moves in the direction of the point $(\lambda = R^2, b=0)$ when $\lambda$ is close to zero; but unlike $b_{\ell_1}$, the influence of $R^2$ diminishes in $b_{\ell_2}$ if $\lambda$ increases (see figure).
Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from $\beta_0$) is also determined by a regressor's contribution to $R^2$ accuracy. First,

[$]R^2=1-\frac{(y-Xb)'(y-Xb)}{(y-X\beta_0)'(y-X\beta_0)}.[$]

An $R^2$ of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized $\beta_0$ values. The individual contribution of deviating from each hypothesis can be computed with the $p$ x $p$ matrix

[$]R^{\otimes}=(X'\tilde y_0)(X'\tilde y_0)' (X'X)^{-1}(\tilde y_0'\tilde y_0)^{-1},[$]

where $\tilde y_0=y-X\beta_0$. If $b=b_{OLS}$ when $R^2$ is computed, then the diagonal elements of $R^{\otimes}$ sum to $R^2$. The diagonal $R^{\otimes}$ values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the $i^{th}$ diagonal element of $R^{\otimes}$ simply corresponds to the $r^2$ value between $x_i$ and $y$.

A rescaled version of the adaptive lasso of can be obtained by setting $q_{\mbox{adaptive lasso},i}=|b_{OLS,i}-\beta_{0,i}|$. If regressors are uncorrelated, the moment that the $i^{th}$ parameter is activated is given by the $i^{th}$ diagonal element of $R^{\otimes}$. Assuming for convenience that $\beta_0$ is a vector of zeros,

[$] b_{i} = \begin{cases} (1-\lambda/R_{ii}^{\otimes})b_{OLS,i} & \mbox{if } \lambda \leq R_{ii}^{\otimes}, \\ 0 & \mbox{if } \lambda \gt R_{ii}^{\otimes}. \end{cases} [$]

That is, if regressors are uncorrelated, $\lambda$ again specifies the minimal influence of $\beta_0$. Even when regressors are correlated, the first time that a regression parameter is activated occurs when $\lambda$ is equal to the highest diagonal element of $R^{\otimes}$.

These results can be compared to a rescaled version of the lasso by defining $q_{\mbox{lasso},i}=\frac{1}{p} \sum_{l} |b_{OLS,l}-\beta_{0,l}|$, which is the average absolute deviation of $b_{OLS}$ from $\beta_0$. Assuming that regressors are uncorrelated, then the moment of activation of the $i^{th}$ regressor is given by

[$] \tilde \lambda_{\text{lasso},i} = \frac{1}{p}\sqrt{R^{\otimes}_i} \sum_{l=1}^p\sqrt{R^{\otimes}_{l}}.[$]

For $p=1$, the moment of activation is again given by $\tilde \lambda_{\text{lasso},i}=R^2$. If $\beta_0$ is a vector of zeros and a subset of $p_B$ relevant parameters are equally responsible for a perfect fit of $R^2=1$, then this subset is activated at a $\lambda$ value of $\frac{1}{p}$. The moment of activation of a relevant regressor then equals $\frac{1}{p}\frac{1}{\sqrt{p_B}}p_B\frac{1}{\sqrt{p_B}}=\frac{1}{p}$. In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value.

## Choice of regularization parameter

Choosing the regularization parameter ($\lambda$) is a fundamental part of lasso. A good value is essential to the performance of lasso since it controls the strength of shrinkage and variable selection, which, in moderation can improve both prediction accuracy and interpretability. However, if the regularization becomes too strong, important variables may be omitted and coefficients may be shrunk excessively, which can harm both predictive capacity and inferencing. Cross-validation is often used to find the regularization parameter.

Information criteria such as the Bayesian information criterion (BIC) and the Akaike information criterion (AIC) might be preferable to cross-validation, because they are faster to compute and their performance is less volatile in small samples. An information criterion selects the estimator's regularization parameter by maximizing a model's in-sample accuracy while penalizing its effective number of parameters/degrees of freedom. Zou et al. proposed to measure the effective degrees of freedom by counting the number of parameters that deviate from zero. The degrees of freedom approach was considered flawed by Kaufman and Rosset and Janson et al., because a model's degrees of freedom might increase even when it is penalized harder by the regularization parameter. As an alternative, the relative simplicity measure defined above can be used to count the effective number of parameters. For the lasso, this measure is given by

[$]\hat{\mathcal{P}} = \sum_{i=1}^p \frac{|\beta_i-\beta_{0,i}|}{\frac{1}{p} \sum_{l} |b_{OLS,l}-\beta_{0,l}|}[$]

, which monotonically increases from zero to $p$ as the regularization parameter decreases from $\infty$ to zero.