# Mixed model

Here the mixed model introduced by [1], which generalizes the linear regression model, is studied and estimated in unpenalized (!) fashion. Nonetheless, it will turn out to have an interesting connection to ridge regression. This connection may be exploited to arrive at an informed choice of the ridge penalty parameter.

The linear regression model, $\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon$, assumes the effect of each covariate to be fixed. In certain situations it may be desirable to relax this assumption. For instance, a study may be replicated. Conditions need not be exactly constant across replications. Among others this may be due to batch effects. These may be accounted for and are then incorporated in the linear regression model. But it is not the effects of these particular batches included in the study that are of interest. Would the study have been carried out at a later date, other batches may have been involved. Hence, the included batches are thus a random draw from the population of all batches. With each batch possibly having a different effect, these effects may also be viewed as random draws from some hypothetical ‘effect’-distribution. From this point of view the effects estimated by the linear regression model are realizations from the ‘effect’-distribution. But interest is not in the particular but the general. Hence, a model that enables a generalization to the distribution of batch effects would be more suited here.

Like the linear regression model the mixed model, also called mixed effects model or random effects model, explains the variation in the response by a linear combination of the covariates. The key difference lies in the fact that the latter model distinguishes two sets of covariates, one with fixed effects and the other with random effects. In matrix notation mirroring that of the linear regression model, the mixed model can be written as:

[$] \begin{eqnarray*} \mathbf{Y} & = & \mathbf{X} \bbeta + \mathbf{Z} \ggamma + \vvarepsilon, \end{eqnarray*} [$]

where $\mathbf{Y}$ is the response vector of length $n$, $\mathbf{X}$ the $(n \times p)$-dimensional design matrix with the fixed vector $\bbeta$ with $p$ fixed effects, $\mathbf{Z}$ the $(n \times q)$- dimensional design matrix with an associated $q \times 1$ dimensional vector $\ggamma$ of random effects, and distributional assumptions $\vvarepsilon \sim \mathcal{N}( \mathbf{0}_{n}, \sigma_{\varepsilon}^2 \mathbf{I}_{nn})$, $\ggamma \sim \mathcal{N}(\mathbf{0}_{q}, \mathbf{R}_{\ttheta})$ and $\vvarepsilon$ and $\ggamma$ independent. In this $\mathbf{R}_{\ttheta}$ is symmetric, positive definite and parametrized by a low-dimensional parameter $\ttheta$.

The distribution of $\mathbf{Y}$ is fully defined by the mixed model and its accompanying assumptions. As $\mathbf{Y}$ is a linear combination of normally distributed random variables, it is itself normally distributed. Its mean is:

[$] \begin{eqnarray*} \mathbb{E}(\mathbf{Y}) & = & \mathbb{E}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma + \vvarepsilon) \, \, \, = \, \, \, \mathbb{E}(\mathbf{X} \bbeta) + \mathbb{E}(\mathbf{Z} \ggamma) + \mathbb{E}(\vvarepsilon) \, \, \, = \, \, \, \mathbf{X} \bbeta + \mathbf{Z} \mathbb{E}(\ggamma) \, \, \, = \, \, \, \mathbf{X} \bbeta, \end{eqnarray*} [$]

while its variance is:

[$] \begin{eqnarray*} \mbox{Var}(\mathbf{Y}) & = & \mbox{Var}(\mathbf{X} \bbeta) + \mbox{Var}( \mathbf{Z} \ggamma) + \mbox{Var}(\vvarepsilon) \, \, \, = \, \, \, \mathbf{Z} \mbox{Var}(\ggamma) \mathbf{Z}^{\top} + \sigma_\varepsilon^2 \mathbf{I}_{nn} \, \, \, = \, \, \, \mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} + \sigma_\varepsilon^2 \mathbf{I}_{nn} \end{eqnarray*} [$]

in which the independence between $\vvarepsilon$ and $\ggamma$ and the standard algebra rules for the $\mbox{Var}(\cdot)$ and $\mbox{Cov}(\cdot)$ operators have been used. Put together, this yields: $\mathbf{Y} \sim \mathcal{N}(\mathbf{X} \bbeta, \mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} + \sigma^2_{\varepsilon} \mathbf{I}_{nn})$. Hence, the random effects term $\mathbf{Z} \ggamma$ of the mixed model does not contribute to the explanation of the mean of $\mathbf{Y}$, but aids in the decomposition of its variance around the mean. From this formulation of the model it is obvious that the random part of two distinct observations of the response are -- in general -- not independent: their covariance is given by the corresponding element of $\mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top}$. Put differently, due to the independence assumption on the error two observations can only be (marginally) dependent through the random effect which is attenuated by the associated design matrix $\mathbf{Z}$. To illustrate this, temporarily set $\mathbf{R}_{\theta} = \sigma_{\gamma}^2 \mathbf{I}_{qq}$. Then, $\mbox{Var}(\mathbf{Y}) = \sigma_{\gamma}^2 \mathbf{Z} \mathbf{Z}^{\top} + \sigma_{\varepsilon}^2 \mathbf{I}_{nn}$. From this it is obvious that two variates of $\mathbf{Y}$ are now independent if and only if the corresponding rows of $\mathbf{Z}$ are orthogonal. Moreover, two pairs of variates have the same covariance if they have the same covariate information in $\mathbf{Z}$. Two distinct observations of the same individual have the same covariance as one of these observations with that of another individual with identical covariate information as the left-out observation on the former individual. In particular, their ‘between-covariance’ equals their individual ‘within-covariance’.

The mixed model and the linear regression model are clearly closely related: they share a common mean, a normally distributed error, and both explain the response by a linear combination of the explanatory variables. Moreover, when $\ggamma$ is known, the mixed model reduces to a linear regression model. This is seen from the conditional distribution of $\mathbf{Y}$: $\mathbf{Y} \, | \, \ggamma \sim \mathcal{N}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma, \sigma^2_{\varepsilon} \mathbf{I}_{nn})$. Conditioning on the random effect $\ggamma$ thus pulls in the term $\mathbf{Z} \ggamma$ to the systematic, non-random explanatory part of the model. In principle, the conditioned mixed model could now be rewritten as a linear regression model by forming a new design matrix and parameter from $\mathbf{X}$ and $\mathbf{Z}$ and $\bbeta$ and $\ggamma$, respectively.

Example (Mixed model for a longitudinal study)

A longitudinal study looks into the growth rate of cells. At the beginning of the study cells are placed in $n$ petri dishes, with the same growth medium but at different concentrations. The initial number of cells in each petri dish is counted as is done at several subsequent time points. The change in cell count is believed to be -- at the log-scale -- a linear function of the concentration of the growth medium. The linear regression model may suffice. However, variation is omnipresent in biology. That is, apart from variation in the initial cell count, each cell -- even if they are from common decent -- will react (slightly?) differently to the stimulus of the growth medium. This intrinsic cell-to-cell variation in growth response may be accommodated for in the linear mixed model by the introduction of a random cell effect, both in off-set and slope. The (log) cell count of petri dish $i$ at time point $t$, denoted $Y_{it}$, is thus described by:

[$] \begin{eqnarray*} Y_{it} & = & \beta_0 + X_i \beta_1 + \mathbf{Z}_i \ggamma + \varepsilon_{it}, \end{eqnarray*} [$]

with intercept $\beta_0$, growth medium concentration $X_i$ in petri dish $i$, and fixed growth medium effect $\beta_1$, and $\mathbf{Z}_i = (1, X_i)$, $\ggamma$ the $2$ dimensional random effect parameter bivariate normally distributed with zero mean and diagonal covariance matrix, and finally $\varepsilon_{it}\sim \mathcal{N}(0,\sigma_{\varepsilon}^2)$ the error in the cell count of petri dish $i$ at time $t$. In matrix notation the matrix $\mathbf{Z}$ would comprise of $2n$ columns: two columns for each cell, $\mathbf{e}_i$ and $X_i \mathbf{e}_i$ (with $\mathbf{e}_i$ the $n$-dimensional unit vector with a one at the $i$-th location and zeros elsewhere), corresponding to the random intercept and slope effect, respectively.

The fact that the number columns of $\mathbf{Z}$, i.e. the explanatory random effects, equals $2n$ does not pose identifiability problems as per column only a single parameter is estimated. Finally, to illustrate the difference between the linear regression and the linear mixed model their fits on artifical data are plotted (top left panel, Figure). Where the linear regression fit shows the ‘grand mean relationship’ between cell count and growth medium, the linear mixed model fit depicts the petri dish specific fits.

The mixed model was motivated by its ability to generalize to instances not included in the study. From the examples above another advantage can be deduced. E.g., the cells' effects are modelled by a single parameter (rather than one per cell). More degrees of freedom are thus left to estimate the noise level. In particular, a test for the presence of a cell effect will have more power.

The parameters of the mixed model are estimated either by means of likelihood maximization or a related procedure known as restricted maximum likelihood. Both are presented, with the exposé loosely based on [2]. First the maximum likelihood procedure is introduced, which requires the derivation of the likelihood. Hereto the assumption on the random effects is usually transformed. Let $\tilde{\mathbf{R}}_{\theta} = \sigma_{\varepsilon}^{-2} \mathbf{R}_{\theta}$, which is the covariance of the random effects parameter relevative to the error variance, and $\tilde{\mathbf{R}}_{\theta} = \mathbf{L}_{\theta} \mathbf{L}_{\theta}^{\top}$ its Cholesky decomposition. Next define the change-of-variables $\ggamma = \mathbf{L}_{\theta} \tilde{\ggamma}$. This transforms the model to: $\mathbf{Y} = \mathbf{X} \bbeta + \mathbf{Z} \mathbf{L}_{\ttheta} \tilde{\ggamma} + \vvarepsilon$ but now with the assumption $\tilde{\ggamma} \sim \mathcal{N}(\mathbf{0}_{q}, \sigma_{\varepsilon}^2 \mathbf{I}_{qq})$. Under this assumption the conditional likelihood, conditional on the random effects, is:

[$] \begin{eqnarray*} L(\mathbf{Y} \, | \, \tilde{\ggamma} = \mathbf{g}) & = & (2 \pi \sigma_{\varepsilon}^2)^{-n/2} \exp ( -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} \| \mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g} \|_2^2 ). \end{eqnarray*} [$]

From this the unconditional likelihood is obtained through:

[$] \begin{eqnarray*} L(\mathbf{Y}) & = & \int_{\mathbb{R}^q} L(\mathbf{Y} \, | \, \tilde{\ggamma} = \mathbf{g}) \, f_{\tilde{\ggamma}}(\mathbf{g}) \, d \mathbf{g} \\ & = & \int_{\mathbb{R}^q} (2 \pi \sigma_{\varepsilon}^2)^{-(n+q)/2} \exp [ -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} (\| \mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g} \|_2^2 + \| \mathbf{g} \|_2^2 ) ] \, d \mathbf{g}. \end{eqnarray*} [$]

To evaluate the integral, the exponent needs rewriting. Hereto first note that:

[$] \begin{eqnarray*} \| \mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g} \|_2^2 + \| \mathbf{g} \|_2^2 & = & (\mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g})^{\top} (\mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g}) + \mathbf{g}^{\top} \mathbf{g}. \end{eqnarray*} [$]

Now expand the right-hand side as follows:

[$] \begin{eqnarray*} & & \hspace{-1cm} (\mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g})^{\top} (\mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \mathbf{L}_{\theta} \mathbf{g}) + \mathbf{g}^{\top} \mathbf{g} \\ & = & \mathbf{Y}^{\top} \mathbf{Y} + \mathbf{g}^{\top} (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq}) \mathbf{g} + \bbeta^{\top} \mathbf{X}^{\top} \mathbf{X} \bbeta - \mathbf{Y}^{\top} \mathbf{X} \bbeta - \bbeta^{\top} \mathbf{X}^{\top} \mathbf{Y} \\ & & - (\mathbf{Y}^{\top} \mathbf{Z} \mathbf{L}_{\theta} - \bbeta^{\top} \mathbf{X}^{\top} \mathbf{Z} \mathbf{L}_{\theta}) \mathbf{g} - \mathbf{g}^{\top} (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Y} - \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{X} \bbeta) \\ & = & (\mathbf{g} - \mmu_{\tilde{\ggamma} \, | \mathbf{Y}})^{\top} (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq}) (\mathbf{g} - \mmu_{\tilde{\ggamma} \, | \, \mathbf{Y}}) \\ & & + ~ (\mathbf{Y} - \mathbf{X} \bbeta)^{\top} [\mathbf{I}_{nn} - \mathbf{Z} \mathbf{L}_{\theta} (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq})^{-1} \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} ] (\mathbf{Y} - \mathbf{X} \bbeta) \\ & = & (\mathbf{g} - \mmu_{\tilde{\ggamma} \, | \mathbf{Y}})^{\top} (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq}) (\mathbf{g} - \mmu_{\tilde{\ggamma} \, | \, \mathbf{Y}}) \\ & & + ~ (\mathbf{Y} - \mathbf{X} \bbeta)^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top})^{-1} (\mathbf{Y} - \mathbf{X} \bbeta), \end{eqnarray*} [$]

where $\mmu_{\tilde{\ggamma} \, | \mathbf{Y}} = (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq})^{-1} \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta)$ and the Woodbury identity has been used in the last step. As the notation suggests $\mmu_{\tilde{\ggamma} \, | \mathbf{Y}}$ is the conditional expectation of the random effect conditional on the data: $\mathbb{E}(\tilde{\ggamma} \, | \, \mathbf{Y})$. This may be verified from the conditional distribution $\tilde{\ggamma} \, | \, \mathbf{Y}$ when exploiting the equality derived in the preceeding display. Substitute the latter in the integral of the likelihood and use the change-of-variables: $\mathbf{h} = (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq})^{1/2} ( \mathbf{g} - \mmu_{\tilde{\ggamma} \, | \, \mathbf{Y}} )$ with Jacobian $| (\mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq})^{1/2} |$:

[$] \begin{eqnarray} \nonumber L(\mathbf{Y}) & = & \int_{\mathbb{R}^q} (2 \pi \sigma_{\varepsilon}^2)^{-(n+q)/2} | \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq} |^{-1/2} \exp ( -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} \mathbf{h}^{\top} \mathbf{h}) \\ \nonumber & & \qquad \qquad \exp [ -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} (\mathbf{Y} - \mathbf{X} \bbeta)^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top})^{-1} (\mathbf{Y} - \mathbf{X} \bbeta) ] \, d \mathbf{g} \\ \nonumber & = & (2 \pi \sigma_{\varepsilon}^2)^{-n/2} | \mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top} |^{-1/2} \\ \label{form.mixedModel_fullLikelihood} & & \qquad \qquad \exp [ -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} (\mathbf{Y} - \mathbf{X} \bbeta)^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top})^{-1} (\mathbf{Y} - \mathbf{X} \bbeta) ], \end{eqnarray} [$]

where in the last step Sylvester's determinant identity has been used.

The maximum likelihood estimators of the mixed model parameters $\bbeta$, $\sigma_{\varepsilon}^2$ and $\tilde{\mathbf{R}}_{\theta}$ are found through the maximization of the logarithm of the likelihood (\ref{form.mixedModel_fullLikelihood}). Find the roots of the partial derivatives of this log-likelihood with respect to the mixed model parameters. For $\bbeta$ and $\sigma_{\varepsilon}^2$ this yields:

[$] \begin{eqnarray*} \hat{\bbeta} & = & [\mathbf{X}^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top})^{-1} \mathbf{X}]^{-1} \mathbf{X}^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top} )^{-1} \mathbf{Y}, \\ \hat{\sigma}_{\varepsilon}^2 & = & \tfrac{1}{n} (\mathbf{Y} - \mathbf{X} \bbeta)^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top})^{-1} (\mathbf{Y} - \mathbf{X} \bbeta). \end{eqnarray*} [$]

The former estimate can be substituted into the latter to remove its dependency on $\bbeta$. However, both estimators still depend on $\ttheta$. An estimator of $\ttheta$ may be found by substitution of $\hat{\bbeta}$ and $\hat{\sigma}_{\varepsilon}^2$ into the log-likelihood followed by its maximization. For general parametrizations of $\tilde{\mathbf{R}}_{\theta}$ by $\ttheta$ there are no explicit solutions. Then, resort to standard nonlinear solvers such as the Newton-Raphson algorithm and the like. With a maximum likelihood estimate of $\ttheta$ at hand, those of the other two mixed model parameters are readily obtained from the formula's above. As $\ttheta$ is unknown at the onset, it needs to be initiated followed by sequential updating of the parameter estimates until convergence.

Restricted maximum likelihood (REML) considers the fixed effect parameter $\bbeta$ as a ‘nuisance’ parameter and concentrates on the estimation of the variance components. The nuisance parameter is integrated out of the likelihood, $\int_{\mathbb{R}^p} L(\mathbf{Y}) d\bbeta$, which is referred to as the restricted likelihood. Those values of $\ttheta$ (and thereby $\tilde{\mathbf{R}}_{\theta}$) and $\sigma_{\varepsilon}^2$ that maximize the restricted likelihood are the REML estimators. The restricted likelihood, by an argument similar to that used in the derivation of the likelihood, simplifies to:

[$] \begin{eqnarray*} \int_{\mathbb{R}^p} L(\mathbf{Y}) d\bbeta & = & (2 \pi \sigma_{\varepsilon}^2)^{-n/2} | \tilde{\mathbf{Q}} |^{-1/2} \exp \{ -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} \mathbf{Y}^{\top} [ \tilde{\mathbf{Q}}_{\theta}^{-1} - \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X} (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} ] \mathbf{Y} \} \\ & & \qquad \int_{\mathbb{R}^p} \exp \{ -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} [\bbeta - (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{Y}]^{\top} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X} \\ & & \qquad \qquad \qquad \qquad \qquad \qquad \qquad [\bbeta - (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{Y} ] \} d \bbeta \\ & = & (2 \pi \sigma_{\varepsilon}^2)^{-(n-p)/2} | \tilde{\mathbf{Q}}_{\theta} |^{-1/2} | \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X} |^{-1/2} \\ & & \qquad \qquad \qquad \exp \{ -\tfrac{1}{2} \sigma_{\varepsilon}^{-2} \mathbf{Y}^{\top} [ \tilde{\mathbf{Q}}_{\theta}^{-1} - \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X} (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} ] \mathbf{Y} \}, \end{eqnarray*} [$]

where $\tilde{\mathbf{Q}}_{\theta} = \mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top}$ is the relative covariance (relative to the error variance) of $\mathbf{Y}$. The REML estimators are now found by equating the partial derivatives of this restricted loglikelihood to zero and solving for $\sigma_{\varepsilon}^2$ and $\ttheta$. The former, given the latter, is:

[$] \begin{eqnarray*} \hat{\sigma}_{\varepsilon}^2 & = & \tfrac{1}{n-p} \mathbf{Y}^{\top} [ \tilde{\mathbf{Q}}_{\theta}^{-1} - \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X} (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} ] \mathbf{Y} \\ & = & \tfrac{1}{n-p} \mathbf{Y}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1/2} [ \mathbf{I}_{nn} - \tilde{\mathbf{Q}}_{\theta}^{-1/2} \mathbf{X} (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1/2} \tilde{\mathbf{Q}}_{\theta}^{-1/2} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1/2} ] \tilde{\mathbf{Q}}_{\theta}^{-1/2} \mathbf{Y}, \end{eqnarray*} [$]

where the rewritten form reveals a projection matrix and, consequently, a residual sum of squares. Like the maximum likelihood estimator of $\ttheta$, its REML counterpart is generally unknown analytically and to be found numerically. Iterating between the estimation of both parameters until convergence yields the REML estimators. Obviously, REML estimation of the mixed model parameters does not produce an estimate of the fixed parameter $\bbeta$ (as it has been integrated out). Should however a point estimate be desired, then in practice the ML estimate of $\bbeta$ with the REML estimates of the other parameters is used.

An alternative way to proceed (and insightful for the present purpose) follows the original approach of Henderson, who aimed to construct a linear predictor for $\mathbf{Y}$.

Definition

A predictand is the function of the parameters that is to be predicted. A predictor is a function of the data that predicts the predictand. When this latter function is linear in the observation it is said to be a linear predictor.

In case of the mixed model the predictand is $\mathbf{X}_{{\mbox{{\tiny new}}}} \bbeta + \mathbf{Z}_{{\mbox{{\tiny new}}}} \ggamma$ for $(n_{{\mbox{{\tiny new}}}} \times p)$- and $(n_{{\mbox{{\tiny new}}}} \times q)$-dimensional design matrices $\mathbf{X}_{{\mbox{{\tiny new}}}}$ and $\mathbf{Z}_{{\mbox{{\tiny new}}}}$, respectively. Similarly, the predictor is some function of the data $\mathbf{Y}$. When it can be expressed as $\mathbf{A} \mathbf{Y}$ for some matrix $\mathbf{A}$ it is a linear predictor.

The construction of the aforementioned linear predictor requires estimates of $\bbeta$ and $\ggamma$. To obtain these estimates first derive the joint density of $(\ggamma, \mathbf{Y})$:

[$] \begin{eqnarray*} \left( \begin{array}{c} \ggamma \\ \mathbf{Y} \end{array} \right) & \sim & \mathcal{N} \left( \left( \begin{array}{l} \mathbf{0}_q \\ \mathbf{X} \bbeta \end{array} \right), \left( \begin{array}{lr} \mathbf{R}_{\theta} & \mathbf{R}_{\theta} \mathbf{Z}^{\top} \\ \mathbf{Z} \mathbf{R}_{\theta} & \sigma_{\varepsilon}^2 \mathbf{I}_{nn} + \mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} \end{array} \right) \right) \end{eqnarray*} [$]

From this the likelihood is obtained and after some manipulations the loglikelihood can be shown to be proportional to:

[$] \begin{eqnarray} \label{mixedModel.penalizedLossFunction} \sigma_{\varepsilon}^{-2} \|\mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \ggamma \|_2^2 + \ggamma^{\top} \mathbf{R}_{\theta}^{-1} \ggamma, \end{eqnarray} [$]

in which -- following Henderson -- $\mathbf{R}_{\theta}$ and $\sigma_{\varepsilon}^2$ are assumed known (for instance by virtue of maximum likelihood or REML estimation). The estimators of $\bbeta$ and $\ggamma$ are now the minimizers of (loss criterion ). Effectively, the random effect parameter $\ggamma$ is now temporarily assumed to be ‘fixed’. That is, it is temporarily treated as fixed in the derivations below that lead to the construction of the linear predictor. However, $\ggamma$ is a random variable and one therefore speaks of a linear predictor rather than linear estimator.

To find the estimators of $\bbeta$ and $\ggamma$, defined as the minimizer of loss function, equate the partial derivatives of mixed model (loss function) with respect to $\bbeta$ and $\ggamma$ to zero. This yields the estimating equations (also referred to as Henderson's mixed model equations):

[$] \begin{eqnarray*} \mathbf{X}^{\top} \mathbf{Y} - \mathbf{X}^{\top} \mathbf{X} \bbeta - \mathbf{X}^{\top} \mathbf{Z} \ggamma & = & \mathbf{0}_{p}, \\ \sigma_{\varepsilon}^{-2} \mathbf{Z}^{\top} \mathbf{Y} - \sigma_{\varepsilon}^{-2} \mathbf{Z}^{\top} \mathbf{Z} \ggamma - \sigma_{\varepsilon}^{-2} \mathbf{Z}^{\top} \mathbf{X} \bbeta - \mathbf{R}_{\theta}^{-1} \ggamma & = & \mathbf{0}_{q}. \end{eqnarray*} [$]

Solve each estimating equation for the parameters individually and find:

[$] \begin{eqnarray} \label{form.mixedModel_penEstOfBeta} \hat{\bbeta} & = & (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} ( \mathbf{Y} - \mathbf{Z} \ggamma), \\ \label{form.mixedModel_penEstOfGamma} \hat{\ggamma} & = & (\mathbf{Z}^{\top} \mathbf{Z} + \sigma_{\varepsilon}^{2} \mathbf{R}_{\theta}^{-1})^{-1} \mathbf{Z}^{\top} ( \mathbf{Y} - \mathbf{X} \bbeta). \end{eqnarray} [$]

Note, using the Cholesky decomposition of $\tilde{\mathbf{R}}_{\theta}$ and applying the Woodbury identity twice (in both directions), that:

[$] \begin{eqnarray*} \hat{\ggamma} & = & (\mathbf{Z}^{T} \mathbf{Z} + \tilde{\mathbf{R}}_{\theta}^{-1})^{-1} \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta) \\ & = & [\tilde{\mathbf{R}}_{\theta} - \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \tilde{\mathbf{R}}_{\theta} \mathbf{Z}^{\top} )^{-1} \mathbf{Z} \tilde{\mathbf{R}}_{\theta}] \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta) \\ & = & \mathbf{L}_{\theta} [ \mathbf{I}_{qq} - \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} (\mathbf{I}_{nn} + \mathbf{Z} \mathbf{L}_{\theta} \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} )^{-1} \mathbf{Z} \mathbf{L}_{\theta} ] \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta) \\ & = & \mathbf{L}_{\theta} ( \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{L}_{\theta} + \mathbf{I}_{qq})^{-1} \mathbf{L}_{\theta}^{\top} \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta) \\ & = & \mathbf{L}_{\theta} \, \mmu_{\tilde{\ggamma} \, | \mathbf{Y}}. \end{eqnarray*} [$]

It thus coincides with the conditional estimate of the $\ggamma$ found in the derivation of the maximum likelihood estimator of the mixed model. This expression could also have been found by conditioning with the multivariate normal above which would have given $\mathbb{E}(\ggamma \, | \, \mathbf{Y})$.

The estimator of both $\bbeta$ and $\ggamma$ can be expressed fully and explicitly in terms of $\mathbf{X}$, $\mathbf{Y}$, $\mathbf{Z}$ and $\mathbf{R}_{\theta}$. To obtain that of $\bbeta$ substitute the estimator of $\ggamma$ of equation (\ref{form.mixedModel_penEstOfGamma}) into that of $\bbeta$ given by equation \ref{form.mixedModel_penEstOfBeta}):

[$] \begin{eqnarray*} \bbeta & = & (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} [ \mathbf{Y} - \mathbf{Z} (\mathbf{Z}^{\top} \mathbf{Z} + \sigma_{\varepsilon}^{2} \mathbf{R}_{\theta}^{-1})^{-1} \mathbf{Z}^{\top} ( \mathbf{Y} - \mathbf{X} \bbeta)] \\ & = & (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \{ \mathbf{Y} - [ \mathbf{I}_{nn} - (\sigma_{\varepsilon}^{-2} \mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} + \mathbf{I}_{nn})^ {-1}] ( \mathbf{Y} - \mathbf{X} \bbeta) \}, \end{eqnarray*} [$]

in which the Woodbury identity has been used. Now group terms and solve for $\bbeta$:

[$] \begin{eqnarray} \label{form.mixedModel_penEstOfBeta_explicit} \hat{\bbeta} & = & [\mathbf{X}^{\top} (\mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} + \sigma_{\varepsilon}^2 \mathbf{I}_{nn})^{-1} \mathbf{X}]^{-1} \mathbf{X}^{\top} (\mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} + \sigma_{\varepsilon}^2 \mathbf{I}_{nn})^{-1} \mathbf{Y}. \end{eqnarray} [$]

This coincides with the maximum likelihood estimator of $\bbeta$ presented above (for known $\mathbf{R}_{\theta}$ and $\sigma_{\varepsilon}^2$). Moreover, in the preceeding display one recognizes a generalized least squares (GLS) estimator. The GLS regression estimator is BLUE (Best Linear Unbiased Estimator) when $\mathbf{R}_{\theta}$ and $\sigma_{\varepsilon}$ are known. To find an explicit expression for $\ggamma$ use $\mathbf{Q}_{\theta}$ as previously defined and substitute the explicit expression (\ref{form.mixedModel_penEstOfBeta_explicit}) for the estimator of $\bbeta$ in the estimator of $\ggamma$, shown in display (\ref{form.mixedModel_penEstOfGamma}) above. This gives:

[$] \begin{eqnarray*} \hat{\ggamma} & = & (\mathbf{Z}^{\top} \mathbf{Z} + \sigma_{\varepsilon}^2 \mathbf{R}_{\theta}^{-1})^{-1} \mathbf{Z}^{\top} [ \mathbf{I}_{nn} - \mathbf{X} (\mathbf{X} \mathbf{Q}_{\theta}^{-1} \mathbf{X}^{\top})^{-1} \mathbf{X}^{\top} \mathbf{Q}_{\theta}^{-1}] \mathbf{Y}, \end{eqnarray*} [$]

an explicit expression for the estimator of $\ggamma$.

The linear predictor constructed from these estimator can be shown (cf. Theorem) to be optimal, in the BLUP sense.

Definition

A Best Linear Unbiased Predictor (BLUP)

• is linear in the observations,
• is unbiased, and
• has a minimum (variance of its) predictor error, i.e. the difference among the predictor and predictand, among all unbiased linear predictors.

Theorem

The predictor $\mathbf{X} \hat{\bbeta} + \mathbf{Z} \hat{\gamma}$ is the BLUP of $\tilde{\mathbf{Y}} = \mathbf{X} \bbeta + \mathbf{Z} \gamma$.

The predictor of $\mathbf{X} \hat{\bbeta} + \mathbf{Z} \hat{\ggamma}$ is:

[$] \begin{eqnarray*} \mathbf{X} \hat{\bbeta} + \mathbf{Z} \hat{\ggamma} & = & [ \mathbf{I}_{nn} - \sigma_{\varepsilon}^2 \mathbf{Q}^{-1}_{\theta} + \mathbf{Q}_{\theta}^{-1} \mathbf{X} (\mathbf{X} \mathbf{Q}_{\theta}^{-1} \mathbf{X}^{\top})^{-1} \mathbf{X}^{\top} \mathbf{Q}^{-1}_{\theta} ] \mathbf{Y} \, \, \, := \, \, \, \mathbf{B} \mathbf{Y}. \end{eqnarray*} [$]

Clearly, this is a linear function in $\mathbf{Y}$.

The expectation of the linear predictor is

[$] \begin{eqnarray*} \mathbb{E} (\mathbf{X} \hat{\bbeta} + \mathbf{Z} \hat{\ggamma} ) & = & \mathbb{E} [\mathbf{X} \hat{\bbeta} + \mathbf{Z} ( \mathbf{Z}^{\top} \mathbf{Z} + \sigma_{\varepsilon}^{2} \mathbf{R}_{\theta})^{-1} \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \hat{\bbeta}) ] \\ & = & \mathbf{X} \mathbb{E} (\hat{\bbeta}) + (\mathbf{I}_{nn} - \sigma_{\varepsilon}^2 \mathbf{Q}_{\theta}^{-1}) [\mathbb{E} (\mathbf{Y}) - \mathbb{E} (\mathbf{X} \hat{\bbeta})] \, \, \, = \, \, \, \mathbf{X} \bbeta. \end{eqnarray*} [$]

This is also the expectation of the predictand $\mathbf{X} \bbeta + \mathbf{Z} \ggamma$. Hence, the predictor is unbiased.

To show the predictor $\mathbf{B} \mathbf{Y}$ has minimum prediction error variance within the class of unbiased linear predictors, assume the existence of another unbiased linear predictor $\mathbf{A} \mathbf{Y}$ of $\mathbf{X} \bbeta + \mathbf{Z} \ggamma$. The predictor error variance of the latter predictor is:

[$] \begin{eqnarray*} \mbox{Var}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{A} \mathbf{Y}) & = & \mbox{Var}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{B} \mathbf{Y} - \mathbf{A} \mathbf{Y} + \mathbf{B} \mathbf{Y}) \\ & = & \mbox{Var}[ (\mathbf{A} - \mathbf{B}) \mathbf{Y}] + \mbox{Var}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{B} \mathbf{Y}) \\ & & - 2 ~ \mbox{Cov}[ \mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{B} \mathbf{Y}, (\mathbf{A} - \mathbf{B}) \mathbf{Y}]. \end{eqnarray*} [$]

The last term vanishes as:

[$] \begin{eqnarray*} & & \hspace{-1.5cm} \mbox{Cov}[ \mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{B} \mathbf{Y}, (\mathbf{A} - \mathbf{B}) \mathbf{Y}] \\ & = & [\mathbf{Z} \mbox{Cov}( \ggamma, \mathbf{Y}) - \mathbf{B} \mbox{Var}(\mathbf{Y}) ] (\mathbf{A} - \mathbf{B})^{\top} \\ & = & \{ \mathbf{Z} \mathbf{R}_{\theta} \mathbf{Z}^{\top} - [ \mathbf{I}_{nn} - \sigma_{\varepsilon}^2 \mathbf{Q}^{-1}_{\theta} + \mathbf{Q}_{\theta}^{-1} \mathbf{X} (\mathbf{X} \mathbf{Q}_{\theta}^{-1} \mathbf{X}^{\top})^{-1} \mathbf{X}^{\top} \mathbf{Q}^{-1}_{\theta} ] \mathbf{Q}_{\theta} \} (\mathbf{A} - \mathbf{B})^{\top} \\ & = & \mathbf{Q}_{\theta}^{-1} \mathbf{X} (\mathbf{X} \mathbf{Q}_{\theta}^{-1} \mathbf{X}^{\top})^{-1} \mathbf{X}^{\top} (\mathbf{A} - \mathbf{B})^{\top} \\ & = & \mathbf{Q}_{\theta}^{-1} \mathbf{X} (\mathbf{X} \mathbf{Q}_{\theta}^{-1} \mathbf{X}^{\top})^{-1} [ (\mathbf{A} - \mathbf{B}) \mathbf{X}]^{\top} \, \, \, = \, \, \, \mathbf{0}_{nn}, \end{eqnarray*} [$]

where the last step uses $\mathbf{A} \mathbf{X} = \mathbf{B} \mathbf{X}$, which follows from the fact that

[$] \begin{eqnarray*} \mathbf{A} \mathbf{X} \bbeta & = & \mathbb{E}(\mathbf{A} \mathbf{Y}) \, \, \, = \, \, \, \mathbb{E}(\mathbf{B} \mathbf{Y}) \, \, \, = \, \, \, \mathbf{B} \mathbf{X} \bbeta, \end{eqnarray*} [$]

for all $\bbeta \in \mathbb{R}^p$. Hence,

[$] \begin{eqnarray*} \mbox{Var}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{A} \mathbf{Y}) & = & \mbox{Var}[ (\mathbf{A} - \mathbf{B}) \mathbf{Y}] + \mbox{Var}(\mathbf{X} \bbeta + \mathbf{Z} \ggamma - \mathbf{B} \mathbf{Y}), \end{eqnarray*} [$]

from which the minimum variance follows as the first summand on the right-hand side is nonnegative and zero if and only if $\mathbf{A} = \mathbf{B}$. ■

The link with ridge regression, implicit in the exposé on the mixed model, is now explicated. Recall that ridge regression fits the linear regression model $\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon$ by means of a penalized maximum likelihood procedure, which defines -- for given penalty parameter $\lambda$ -- the estimator as:

[$] \begin{eqnarray*} \hat{\bbeta} (\lambda) & = & \arg \min_{\bbeta \in \mathbb{R}^p} \| \mathbf{Y} - \mathbf{X} \bbeta \|_2^2 + \lambda \bbeta^{\top} \bbeta. \end{eqnarray*} [$]

Constrast this to a mixed model void of covariates with fixed effects and comprising only covariates with a random effects: $\mathbf{Y} = \mathbf{Z} \ggamma + \vvarepsilon$ with distributional assumptions $\ggamma \sim \mathcal{N}( \mathbf{0}_q, \sigma_{\gamma}^2 \mathbf{I}_{qq})$ and $\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma_{\varepsilon}^2 \mathbf{I}_{nn})$. This model, when temporarily considering $\ggamma$ as fixed, is fitted by the minimization of loss function (loss function). The corresponding estimator of $\ggamma$ is then defined, with the current mixed model assumptions in place, as:

[$] \begin{eqnarray*} \hat{\ggamma} & = & \arg \min_{\ggamma \in \mathbb{R}^p} \|\mathbf{Y} - \mathbf{Z} \ggamma \|_2^2 + \sigma_{\gamma}^{-2} \ggamma^{\top} \ggamma. \end{eqnarray*} [$]

The estimators are -- up to a reparamatrization of the penalty parameter -- defined identically. This should not come as a surprise after the discussion of Bayesian regression (cf. Chapter Bayesian regression ) and the alert reader would already have recognized a generalized ridge loss function in (Equation ). The fact that we discarded the fixed effect part of the mixed model is irrelevant for the analogy as those would correspond to unpenalized covariates in the ridge regression problem.

The link with ridge regression is also immenent from the linear predictor of the random effect. Recall: $\hat{\ggamma} = (\mathbf{Z}^{\top} \mathbf{Z} + \mathbf{R}_{\theta}^{-1})^{-1} \mathbf{Z}^{\top} (\mathbf{Y} - \mathbf{X} \bbeta)$. When we ignore $\mathbf{R}_{\theta}^{-1}$, the predictor reduces to a least squares estimator. But with a symmetric and positive definite matrix $\mathbf{R}_{\theta}^{-1}$, the predictor is actually of the shrinkage type as is the ridge regression estimator. This shrinkage estimator also reveals, through the term $(\mathbf{Z}^{\top} \mathbf{Z} + \mathbf{R}_{\theta}^{-1})^{-1}$, that a $q$ larger than $n$ does not cause identifiability problems as long as $\mathbf{R}_{\theta}$ is parametrized low-dimensionally enough.

The following mixed model result provide an alternative approach to choice of the penalty parameter in ridge regression. It assumes a mixed model comprising of the random effects part only. Or, put differently, it assume the linear regression model $\mathbf{Y} = \mathbf{X} \bbeta + \vvarepsilon$ with $\bbeta \sim \mathcal{N}(\mathbf{0}_p, \sigma_{\bbeta}^2 \mathbf{I}_{pp})$ and $\varepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma_{\varepsilon}^2 \mathbf{I}_{nn})$.

(Theorem 2, Golub et al)

The expected generalized cross-validation error $\mathbb{E}_{\bbeta} \{ \mathbb{E}_{\varepsilon} [ GCV(\lambda)] \}$ is minimized for $\lambda = \sigma_{\varepsilon}^2 / \sigma^2_{\beta}$.

The proof first finds an analytic expression of the expected $GCV(\lambda)$, then its minimum. Its expectation can be re-expressed as follows:

[$] \begin{eqnarray*} \mathbb{E}_{\bbeta} \{ \mathbb{E}_{\varepsilon} [ GCV(\lambda)] \} & = & \mathbb{E}_{\bbeta} \big[ \mathbb{E}_{\varepsilon} \big( \tfrac{1}{n} \{\mbox{tr}[\mathbf{I}_{nn} - \mathbf{H}(\lambda)] / n \}^{-2} \| [\mathbf{I}_{nn} - \mathbf{H}(\lambda) ] \mathbf{Y} \big\|_2^2 \big) \big] \\ & = & n \{\mbox{tr}[\mathbf{I}_{nn} - \mathbf{H}(\lambda)] \}^{-2} \mathbb{E}_{\bbeta} \big( \mathbb{E}_{\varepsilon} \{ \mathbf{Y}^{\top} [\mathbf{I}_{nn} - \mathbf{H}(\lambda) ]^2 \mathbf{Y} \} \big) \\ & = & n \{\mbox{tr}[\mathbf{I}_{nn} - \mathbf{H}(\lambda)] \}^{-2} \mathbb{E}_{\bbeta} \big[ \mathbb{E}_{\varepsilon} \big( \mbox{tr} \{ ( \mathbf{X} \bbeta + \vvarepsilon)^{\top} [\mathbf{I}_{nn} - \mathbf{H}(\lambda) ]^2 ( \mathbf{X} \bbeta + \vvarepsilon) \} \big) \big] \\ & = & n \{\mbox{tr}[\mathbf{I}_{nn} - \mathbf{H}(\lambda)] \}^{-2} \big( \mbox{tr} \{[\mathbf{I}_{nn} - \mathbf{H}(\lambda) ]^2 \mathbf{X} \mathbf{X}^{\top} \mathbb{E}_{\bbeta} [ \mathbb{E}_{\varepsilon} ( \bbeta \bbeta^{\top} ) ] \} \\ & & \qquad \qquad \qquad \qquad \quad + \, \mbox{tr}\{ [\mathbf{I}_{nn} - \mathbf{H}(\lambda) ]^2 \mathbb{E}_{\bbeta} [ \mathbb{E}_{\varepsilon} (\vvarepsilon \vvarepsilon^{\top} ) ] \} \big) \\ & = & n \big( \sigma_{\beta}^2 \mbox{tr}\{ [\mathbf{I}_{nn} - \mathbf{H}(\lambda)]^{2} \mathbf{X} \mathbf{X}^{\top} \} + \sigma_{\varepsilon}^2 \mbox{tr}\{ [\mathbf{I}_{nn} - \mathbf{H}(\lambda)]^{2}\} \big) \{\mbox{tr}[\mathbf{I}_{nn} - \mathbf{H}(\lambda)] \}^{-2}. \end{eqnarray*} [$]

To get a handle on this expression, use $(\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} \mathbf{X}^{\top} \mathbf{X} = \mathbf{I}_{pp} - \lambda (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} = \mathbf{X}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1}$, the cyclic property of the trace, and define $A(\lambda) = \sum\nolimits_{j=1}^p (d_{x,j}^2 + \lambda)^{-1}$, $B(\lambda) = \sum\nolimits_{j=1}^p (d_{x,j}^2 + \lambda)^{-2}$, and $C(\lambda) = \sum\nolimits_{j=1}^p (d_{x,j}^2 + \lambda)^{-3}$. The traces in the expectation of $GCV(\lambda)$ can now be written as:

[$] \begin{eqnarray*} \mbox{tr}[\mathbf{I}_{nn} - \mathbf{H}(\lambda)] & = & \lambda ~ \mbox{tr}[ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} ] \, \, \, \, \, = \, \, \, \lambda A(\lambda), \\ \mbox{tr} \{ [\mathbf{I}_{nn} - \mathbf{H}(\lambda)]^2 \} & = & \lambda^2 \mbox{tr}[ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-2} ] \, \, \, \, = \, \, \, \lambda^2 B(\lambda), \\ \mbox{tr} \{ [\mathbf{I}_{nn} - \mathbf{H}(\lambda)]^2 \mathbf{X} \mathbf{X}^{\top} \} & = & \lambda^{2} \mbox{tr}[ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-1} ] - \lambda^{3} \mbox{tr}[ (\mathbf{X}^{\top} \mathbf{X} + \lambda \mathbf{I}_{pp})^{-2} ] \\ & = & \lambda^2 A(\lambda) - \lambda^3 B(\lambda). \end{eqnarray*} [$]

The expectation of $GCV(\lambda)$ can then be reformulated as:

[$] \begin{eqnarray*} \mathbb{E}_{\bbeta} \{ \mathbb{E}_{\varepsilon} [ GCV(\lambda)] \} & = & n \{ \sigma_{\beta}^2 [A(\lambda) - \lambda B(\lambda)] + \sigma_{\varepsilon}^2 B(\lambda) \} [A(\lambda)]^{-2}. \end{eqnarray*} [$]

Equate the derivative of this expectation w.r.t. $\lambda$ to zero, which can be seen to be proportional to:

[$] \begin{eqnarray*} 2 (\lambda \sigma^2_{\beta} - \sigma_{\varepsilon}^2) [B(\lambda)]^2 + 2 (\lambda \sigma^2_{\beta} - \sigma_{\varepsilon}^2) A(\lambda) C(\lambda) & = & 0. \end{eqnarray*} [$]

Indeed, $\lambda = \sigma_{\varepsilon}^2 / \sigma^{2}_{\beta}$ is the root of this equation. ■

Theorem can be extended to include unpenalized covariates. This leaves the result unaltered: the optimal (in the expected GCV sense) ridge penalty is the same signal-to-noise ratio.

We have encountered the result of Theorem before. Revisit Example which derived the mean squared error (MSE) of the ridge regression estimator when $\mathbf{X}$ is orthonormal. It was pointed out that this MSE is minized for $\lambda = p \sigma_{\varepsilon} / \bbeta^{\top} \bbeta$. As $\bbeta^{\top} \bbeta /p$ is an estimator for $\sigma_{\beta}^2$, this implies the same optimal choice of the penalty parameter.

To point out the relevance of Theorem for the choice of the ridge penalty parameter still assume the regression parameter random. The theorem then says that the optimal penalty parameter (in the GCV sense) equals the ratio of the error variance and that of the regression parameter. Both variances can be estimated by means of the mixed model machinery (provided for instance by the lme4 package in R). These estimates may be plugged in the ratio to arrive at a choice of ridge penalty parameter (see Section Illustration: P-splines for an illustration of this usage).

## REML consistency, high-dimensionally

Here a result on the asymptotic quality of the REML estimators of the random effect and error variance parameters is presented and discussed. It is the ratio of these parameters that forms the optimal choice (in the expected GCV sense) of the penalty parameter of the ridge regression estimator. As in practice the parameters are replaced by estimates to arrive at a choice for the penalty parameter, the quality of these estimators propogates to the chosen penalty parameter.

Consider the standard linear mixed model $\mathbf{Y} = \mathbf{X} \bbeta + \mathbf{Z} \ggamma + \vvarepsilon$, now with equivariant and uncorrelated random effects: $\ggamma \sim \mathcal{N}( \mathbf{0}_q, \sigma_{\gamma}^2 \mathbf{I}_{qq})$. Write $\theta = \sigma_{\gamma}^2 / \sigma_{\varepsilon}^2$. The REML estimators of $\theta$ and $\sigma_{\varepsilon}^2$ are to be found from the estimating equations:

[$] \begin{eqnarray*} \mbox{tr}( \mathbf{P}_{\theta} \mathbf{Z} \mathbf{Z}^{\top}) & = & \sigma_{\varepsilon}^{-2} \mbox{tr}( \mathbf{Y}^{\top} \mathbf{P}_{\theta} \mathbf{Z} \mathbf{Z}^{\top} \mathbf{P}_{\theta} \mathbf{Y}), \\ \sigma_{\varepsilon}^2 & = & (n-p)^{-1} \mathbf{Y}^{\top} \mathbf{P}_{\theta} \mathbf{Y}, \end{eqnarray*} [$]

where $\mathbf{P}_{\theta} = \tilde{\mathbf{Q}}_{\theta}^{-1} - \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X} (\mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \tilde{\mathbf{Q}}_{\theta}^{-1}$ and $\tilde{\mathbf{Q}}_{\theta} = \mathbf{I}_{nn} + \theta \mathbf{Z} \mathbf{Z}^{\top}$. To arrive at the REML estimators choose initial values for the parameters. Choose one of the estimating equations substitute the initial value of the one of the parameters and solve for the other. The found root is then substituted into the other estimating equation, which is subsequently solved for the remaining parameter. Iterate between these two steps until convergence. The discussion of the practical evaluation of a root for $\theta$ from these estimating equations in a high-dimensional context is postponed to the next section.

The employed linear mixed model assumes that each of the $q$ covariates included as a column in $\mathbf{Z}$ contributes to the variation of the response. However, it may be that only a fraction of these covariates exerts any influence on the response. That is, the random effect parameter $\ggamma$ is sparse, which could be operationalized as $\ggamma$ having $q_0$ zero elements while the remaining $q_{c} = q - q_0$ elements are non-zero. Only for the latter $q_{c}$ elements of $\ggamma$ the normal assumption makes sense, but is invalid for the $q_0$ zeros in $\ggamma$. The posed mixed model is then misspecified.

The next theorem states that the REML estimators of $\theta = \sigma_{\gamma}^2 / \sigma_{\varepsilon}^2$ and $\sigma_{\varepsilon}^2$ are consistent (possibly after adjustment, see the theorem), even under the above mentioned misspecification.

(Theorem 3.1, [3])

Let $\mathbf{Z}$ be standardized column-wise and with its unstandardized entries i.i.d. from a sub-Gaussian distribution. Furthermore, assume that $n, q, q_{c} \rightarrow \infty$ such that

[$] \begin{eqnarray*} \frac{n}{q} \rightarrow \tau \qquad \mbox{ and } \qquad \frac{ q_{c}}{q} \rightarrow \omega, \end{eqnarray*} [$]

where $\tau, \omega \in (0, 1]$. Finally, suppose that $\sigma_{\varepsilon}^2$ and $\sigma_{\gamma}^2$ are positive. Then:

• The ‘adjusted’ REML estimator of the variance ratio $\sigma_{\gamma}^2 / \sigma_{\varepsilon}^2$ is consistent:
[$] \begin{eqnarray*} \frac{q}{ q_{c}} \widehat{(\sigma_{\gamma}^2 / \sigma_{\varepsilon}^2)} \stackrel{P}{\longrightarrow} \sigma_{\gamma}^2 / \sigma_{\varepsilon}^2. \end{eqnarray*} [$]
• The REML estimator of the error variance is consistent: $\hat{\sigma}_{\varepsilon}^2 \stackrel{P}{\longrightarrow} \sigma_{\varepsilon}^2$.

Confer [3]. ■

Before the interpretation and implication of Theorem are discussed, its conditions for the consistency result are reviewed:

• The standardization and distribution assumption on the design matrix of the random effects has no direct practical interpretation. These conditions warrant the applicability of certain results from random matrix theory upon which the proof of the theorem hinges.
• The positive variance assumption $\sigma_{\varepsilon}^2, \sigma_{\gamma}^2 \gt 0$, in particular that of the random effect parameter, effectively states that some -- possibly misspecified -- form of the mixed model applies.
• Practically most relevant are the conditions on the sample size, random effect dimension, and sparsity. The $\tau$ and $\omega$ in Theorem are the limiting ratio's of the sample size $n$ and non-zero random effects $q_{c}$, respectively, to the total number of random effects $q$. The number of random effects thus exceeds the sample size, as long as the latter grows (in the limit) at some fixed rate with the former. Independently, the model may be misspecified. The sparsity condition only requires that (in the limit) a fraction of the random effects is nonzero.

Now discuss the interpretation and relevance of the theorem:

• Theorem complements the classical low-dimensional consistency results on the REML estimator.
• Theorem shows that not all (i.e. consistency) is lost when the model is misspecified.
• The practical relevance of the part i) of Theorem is limited as the number of nonzero random effects $q_{c}$, or $\omega$ for that matter, is usually unknown. Consequently, the REML estimator of the variance ratio $\sigma_{\gamma}^2 / \sigma_{\varepsilon}^2$ cannot be adjusted correctly to achieve asymptotically unbiasedness and -- thereby -- consistency
• Part ii) in its own right may not seem very useful. But it is surprising that high-dimensionally (i.e. when the dimension of the random effect parameter exceeds the sample size) the standard (that is, derived for low-dimensional data) REML estimator of $\sigma_{\varepsilon}^2$ is consistent. Beyond this surprise, a good estimator of $\sigma_{\varepsilon}^2$ indicates how much of the variation in the response cannot be attributed to the covariates represented by the columns $\mathbf{Z}$. A good indication of the noise level in the data finds use at many place. In particular, it is helpful in deciding on the order of the penalty parameter.
• Theorem suggests to choose the ridge penalty parameter equal to the ratio of the error variance and that of the random effects. Confronted with data the reciprocal of the REML estimator of $\theta = \sigma_{\gamma}^2 / \sigma_{\varepsilon}^2$ may be used as value for the penalty parameter. Without the adjustment for the fraction of nonzero random effects, this value is off. But in the worst case this value is an over-estimation of the optimal (in the GCV sense) ridge penalty parameter. Consequently, too much penalization is applied and the ridge regression estimate of the regression parameter is conservative as it shrinks the elements too much to zero.

## Illustration: P-splines

An organism's internal circadian clock enables it to synchronize its activities to the earth's day-and-night cycle. The circadian clock maintains, due to environmental feedback, oscillations of approximately 24 hours. Molecularly, these oscillations reveal themselves in the fluctuation of the transcription levels of genes. The molecular core of the circadian clock is made up of $\pm 10$ genes. Their behaviour (in terms of their expression patterns) is described by a dynamical system with feedback mechanisms. Linked to this core are genes that tap into the clock's rythm and use it to regulate the molecular processes. As such many genes are expected to exhibit circadian rythms. This is investigated in a mouse experiment in which the expression levels of several transcipts have been measured during two days with a resolution of one hour, resulting in a time-series of 48 time points publicly available from the R-package MetaCycle. Circadian rythms may be identified simply by eye-balling the data. But to facilitate this identification the data are smoothed to emphasize the pattern present in these data.

Top left and right panels: B-spline basis functions of degree 1 and 2, respectively. Bottom left and right panel: P-spline fit to transcript levels of circadian clock experiment in mice.

Smooothing refers to nonparametric -- in the sense that parameters have no tangible interpretation -- description of a curve. For instance, one may wish to learn some general functional relationship between two variable, $X$ and $Y$, from data. Statistically, the model $Y= f(X) + \varepsilon$, for unknown and general function $f(\cdot)$, is to be fitted to paired observations $\{ (y_i, x_i) \}_{i=1}^n$. Here we use P-splines, penalized B-splines with B for Basis [4].

A B-spline is formed through a linear combination of (pieces of) polynomial basis functions of degree $r$. For their construction specify the interval $[x_{\mbox{{\tiny start}}}, x_{\mbox{{\tiny end}}}]$ on which the function is to be learned/approximated. Let $\{ t_j \}_{j=0}^{m+2r}$ be a grid, overlapping the interval, of equidistantly placed points called knots given by $t_j = x_{\mbox{{\tiny start}}} + (j- r) h$ for all $j=0, \ldots, m+2r$ with $h = \tfrac{1}{m}(x_{\mbox{{\tiny end}}} - x_{\mbox{{\tiny start}}})$. The B-spline base functions are then defined as:

[$] \begin{eqnarray*} B_{j}(x; r) & = & (-1)^{r+1} (h^r r!)^{-1} ~ \Delta^{r+1} [(x - t_j)^r \mathbb{1}_{\{x \geq t_j \}} ] \end{eqnarray*} [$]

where $\Delta^r[f_j(\cdot)]$ is the $r$-th difference operator applied to $f_j(\cdot)$. For $r=1$: $\Delta[f_j(\cdot)] = f_j(\cdot) - f_{j-1}(\cdot)$, while $r=2$: $\Delta^2[f_j(\cdot)] = \Delta \{ \Delta [f_j(\cdot)] \} = \Delta[f_j(\cdot) - f_{j-1}(\cdot)] = f_j(\cdot) - 2f_{j-1}(\cdot) + f_{j-2}(\cdot)$, et cetera. The top right and bottom left panels of Figure show a $1^{\mbox{{\tiny st}}}$ and $2^{\mbox{{\tiny nd}}}$ degree B-spline basis functions. A P-spline is a curve of the form $\sum_{j=0}^{m+2r} \alpha_j B_j(x; r)$ fitted to the data by means of penalized least squares minimization. The least squares are $\| \mathbf{Y} - \mathbf{B} \aalpha \|_2^2$ where $\mathbf{B}$ is a $n \times (m + 2r)$-dimensional matrix with the $j$-th column equalling $(B_j(x_i; r), B_j(x_2; r), \ldots, B_j(x_{n}; r))^{\top}$. The employed penalty is of the ridge type: the sum of the squared difference among contiguous $\alpha_j$. Let $\mathbf{D}$ be the first order differencing matrix. The penalty can then be written as $\| \mathbf{D} \alpha \|_2^2 = \sum_{j=2}^{m+2r} (\alpha_j - \alpha_{j-1})^2$. A second order difference matrix would amount to $\| \mathbf{D} \alpha \|_2^2 = \sum_{j=3}^{m+2r} (\alpha_j - 2 \alpha_{j-1} + \alpha_{j-2})^2$. [5] points out how P-splines may be interpret as a mixed model. Hereto choose $\tilde{\mathbf{X}}$ such that its columns span the null space of $\mathbf{D}^{\top} \mathbf{D}$, which comprises a single column representing the intercept when $\mathbf{D}$ is a first order differencing matrix, and $\tilde{\mathbf{Z}} = \mathbf{D}^{\top} (\mathbf{D} \mathbf{D}^{\top})^{-1}$. Then, for any $\aalpha$:

[$] \begin{eqnarray*} \mathbf{B} \aalpha & = & \mathbf{B} (\tilde{\mathbf{X}} \bbeta + \tilde{\mathbf{Z}} \ggamma) \, \, \, := \, \, \, \mathbf{X} \bbeta + \mathbf{Z} \ggamma. \end{eqnarray*} [$]

This parametrization simplifies the employed penalty to:

[$] \begin{eqnarray*} \| \mathbf{D} \alpha \|_2^2 & = & \| \mathbf{D} (\tilde{\mathbf{X}} \bbeta + \tilde{\mathbf{Z}} \ggamma) \|_2^2 \, \, \, = \, \, \, \| \mathbf{D} \mathbf{D}^{\top} (\mathbf{D} \mathbf{D}^{\top})^{-1} \ggamma \|_2^2 \, \, \, = \, \, \, \| \ggamma \|_2^2, \end{eqnarray*} [$]

where $\mathbf{D} \tilde{\mathbf{X}} \bbeta$ has vanished by the construction of $\tilde{\mathbf{X}}$. Hence, the penalty only affects the random effect parameter, leaving the fixed effect parameter unshrunken. The resulting loss function, $\| \mathbf{Y} - \mathbf{X} \bbeta - \mathbf{Z} \ggamma \|_2^2 + \lambda \| \gamma \|_2^2$, coincides for suitably chosen $\lambda$ to that of the mixed model (as will become apparent later). The bottom panels of Figure shows the flexibility of this approach.

The following R-script fits a P-spline to a gene's transcript levels of the circadian clock study in mice. It uses a basis of $m=50$ truncated polynomial functions of degree $r=3$ (cubic), which is generated first alongside several auxillary matrices. This basis forms, after post-multiplication with a projection matrix onto the space spanned by the columns of the difference matrix $\mathbf{D}$, the design matrix for the random coefficient of the mixed model $\mathbf{Y} = \mathbf{X} \bbeta + \mathbf{Z} \ggamma + \vvarepsilon$ with $\ggamma \sim \mathcal{N}(\mathbf{0}_q, \sigma_{\gamma}^2 \mathbf{I}_{qq})$ and $\vvarepsilon \sim \mathcal{N}(\mathbf{0}_n, \sigma_{\varepsilon}^2 \mathbf{I}_{nn})$. The variance parameters of this model are then estimated by means of restricted maximum likelihood (REML). The final P-spline fit is obtained from the linear predictor using, in line with Theorem, $\lambda = \sigma_{\varepsilon}^2 / \sigma_{\gamma}^2$ in which the REML estimates of these variance parameters are substituted. The resulting P-spline fit of two transcripts is shown in the bottom panels of Figure.

# load libraries
library(gridExtra)
library(MetaCycle)
library(MASS)

#------------------------------------------------------------------------------
# intermezzo: declaration of functions used analysis
#------------------------------------------------------------------------------

tpower <- function(x, knots, p) {
# function for evaluation of truncated p-th
# power functions positions x, given knots
return((x - knots)^p * (x > knots))
}

bbase <- function(x, m, r) {
# function for B-spline basis generation
# evaluated at the extremes of x
# with m segments and spline degree r
h     <- (max(x) - min(x))/m
knots <- min(x) + (c(0:(m+2*r)) - r) * h
P     <- outer(x, knots, tpower, r)
D     <- diff(diag(m+2*r+1), diff=r+1) / (gamma(r+1) * h^r)
return((-1)^(r+1) * P %*% t(D))
}

thetaEstEqREML <- function(theta, Z, Y, X, sigma2e){
# function for REML estimation:
# estimating equation of theta
QthetaInv <- solve(diag(length(Y)) + theta * Z %*% t(Z))
Ptheta    <- QthetaInv -
QthetaInv %*% X %*%
solve(t(X) %*% QthetaInv %*% X) %*% t(X) %*% QthetaInv
return(sum(diag(Ptheta %*% Z %*% t(Z))) -
as.numeric(t(Y) %*% Ptheta %*% Z %*% t(Z) %*% Ptheta %*% Y) / sigma2e)
}

#------------------------------------------------------------------------------

data(cycMouseLiverRNA)
id <- 14
Y  <- as.numeric(cycMouseLiverRNA[id,-1])
X  <- 1:length(Y)

# set P-spline parameters
m <- 50
r <- 3
B <- bbase(X, m=m, r=r)

# prepare some matrices
D <- diff(diag(m+r), diff = 2)
Z <- B %*% t(D) %*% solve(D %*% t(D))
X <- B %*% Null(t(D) %*% D)

# initiate
theta <- 1
for (k in 1:100){
# for-loop, alternating between theta and error variance estimation
thetaPrev <- theta
QthetaInv <- solve(diag(length(Y)) + theta * Z %*% t(Z))
Ptheta    <- QthetaInv -
QthetaInv %*% X %*%
solve(t(X) %*% QthetaInv %*% X) %*% t(X) %*% QthetaInv
sigma2e   <- t(Y) %*% Ptheta %*% Y / (length(Y)-2)
theta     <- uniroot(thetaEstEqREML, c(0, 100000),
Z=Z, Y=Y, X=X, sigma2e=sigma2e)\$root
if (abs(theta - thetaPrev) < 10^(-5)){ break }
}

# P-spline fit
bgHat <- solve(t(cbind(X, Z)) %*% cbind(X, Z) +
diag(c(rep(0, ncol(X)), rep(1/theta, ncol(Z))))) %*%
t(cbind(X, Z)) %*% Y

# plot fit
plot(Y, pch=20, xlab="time", ylab="RNA concentration",
main=paste(strsplit(cycMouseLiverRNA[id,1], "_")[[1]][1],
"; # segments: ", m, sep=""))
lines(cbind(X, Z) %*% bgHat, col="blue", lwd=2)


The fitted splines displayed Figure nicely match the data. From the circadian clock perspective it is especially the fit in the right-hand side bottom panel that displays the archetypical sinusoidal behaviour associated by the layman with the sought-for rythm. Close inspection of the fits reveals some minor discontinuities in the derivative of the spline fit. These minor discontinuities are indicative of a little overfitting, due to too large an estimate of $\sigma_{\gamma}^2$. This appears to be due to numerical instability of the solution of the estimating equations of the REML estimators of the mixed model's variance parameter estimators when $m$ is large compared to the sample size $n$.

## General References

van Wieringen, Wessel N. (2021). "Lecture notes on ridge regression". arXiv:1509.09169 [stat.ME].

## References

1. Henderson, C. (1953).Estimation of variance and covariance components.Biometrics, 9(2), 226--252
2. Bates, D. and DebRoy, S. (2004).Linear mixed models and penalized least squares.Journal of Multivariate Analysis, 91(1), 1--17
3. Jiang, J., Li, C., Paul, D., Yang, C., and Zhao, H. (2016).On high-dimensional misspecified mixed model analysis in genome-wide association study.The Annals of Statistics, 44(5), 2127--2160
4. Eilers, P. and Marx, B. (1996).Flexible smoothing with b-splines and penalties.Statistical Science, 11(2), 89--102
5. Eilers, P. (1999).Discussion on: The analysis of designed experiments and longitudinal data by using smoothing splines.Journal of the Royal Statistical Society: Series C (Applied Statistics), 48(3), 307--308