# Model Validation and Selection

Chapter Empirical Risk Minimization discussed empirical risk minimization (ERM) as a principled approach to learning a good hypothesis out of a hypothesis space or model. ERM based methods learn a hypothesis $\hat{h} \in \hypospace$ that incurs minimum average loss on a set of labeled data points that serve as the training set. We refer to the average loss incurred by a hypothesis on the training set as the training error. The minimum average loss achieved by a hypothesis that solves the ERM might be referred to as the training error of the overall ML method. This overall ML method is defined by the choice of hypothesis space (or model) and loss function (see Chapter The Landscape of ML ).

Empirical risk minimization (ERM) is sensible only if the training error of a hypothesis is an reliable approximation for its loss incurred on data points outside the training set. Whether the training error of a hypothesis is a reliable approximation for its loss on data points outside the training set depends on both, the statistical properties of the data points generated by an ML application and on the hypothesis space used by the ML method.

ML methods often use hypothesis spaces with a large effective dimension (see Section The Model ). As an example consider linear regression (see Section Linear Regression ) with data points having a large number $\featuredim$ of features (this setting is referred to as the high-dimensional regime). The effective dimension of the linear hypothesis space, which is used by linear regression, is equal to the number $\featuredim$ of features. Modern technology allows to collect a huge number of features about individual data points which implies, in turn, that the effective dimension of equ_lin_hypospace is large. Another example of a high-dimensional hypothesis space arises in deep learning methods using a hypothesis space are constituted by all maps represented by an artificial neural network (ANN) with billions of tunable parameters.

A high-dimensional hypothesis space is very likely to contain a hypothesis that perfectly fits any given training set. Such a hypothesis achieves a very small training error but might incur a large loss when predicting the labels of a data point that is not included in training set. Thus, the (minimum) training error achieved by a hypothesis learnt by ERM can be misleading. We say that a ML method, such as linear regression using too many features, overfits the training set when it learns a hypothesis (e.g., via ERM) that has small training error but incurs much larger loss outside the training set.

Section Overfitting shows that linear regression will overfit a training set as soon as the number of features of a data point exceeds the size of the training set. Section Validation demonstrates how to validate a learnt hypothesis by computing its average loss on data points which are not contained in the training set. We refer to the set of data points used to validate the learnt hypothesis as a validation set. If a ML method overfits the training set, it learns a hypothesis whose training error is much smaller than its validation error. We can detect if a ML method overfits by comparing its training error with its validation error (see Figure fig_bars_val_sel).

We can use the validation error not only to detect if a ML method overfits. The validation error can also be used as a quality measure for the hypothesis space or model used by the ML method. This is analogous to the concept of a loss function that allows us to evaluate the quality of a hypothesis $h\!\in\!\hypospace$. Section Model Selection shows how to select between ML methods using different models by comparing their validation errors.

Section A Probabilistic Analysis of Generalization uses a simple probabilistic model for the data to study the relation between the training error of a learnt hypothesis and its expected loss (see risk). This probabilistic analysis reveals the interplay between the data, the hypothesis space and the resulting training error and validation error of a ML method.

Section The Bootstrap discusses the bootstrap as a simulation based alternative to the probabilistic analysis of Section A Probabilistic Analysis of Generalization . While Section A Probabilistic Analysis of Generalization assumes a specific probability distribution of the data points, the bootstrap does not require the specification of a probability distribution underlying the data.

As indicated in Figure fig_bars_val_sel, for some ML applications, we might have a baseline (or benchmark) for the achievable performance of ML methods. Such a baseline might be obtained from existing ML methods, human performance levels or from a probabilistic model (see Section A Probabilistic Analysis of Generalization ). Section Diagnosing ML details how the comparison between training error, validation error and (if available) a baseline informs possible improvements for a ML method. These improvements might be obtained by collecting more data points, using more features of data points or by changing the hypothesis space (or model).

Having a baseline for the expected loss, such as the Bayes risk, allows to tell if a ML method already provides satisfactory results. If the training error and the validation error of a ML method are close to the baseline, there might be little point in trying to further improve the ML method.

## Overfitting

We now have a closer look at the occurrence of overfitting in linear regression methods. As discussed in Section Linear Regression , linear regression methods learn a linear hypothesis $h(\featurevec) = \weights^{T} \featurevec$ which is parametrized by the parameter vector $\weights \in \mathbb{R}^{\featurelen}$. The learnt hypothesis is then used to predict the numeric label $\truelabel \in \mathbb{R}$ of a data point based on its feature vector $\featurevec \in \mathbb{R}^{\featurelen}$. Linear regression aims at finding a parameter vector $\widehat{\vw}$ with minimum average squared error loss incurred on a training set

[$] \dataset = \big\{ \big(\featurevec^{(1)},\truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)}\big) \big\}. [$]

The training set $\dataset$ consists of $\samplesize$ data points $\big(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}\big)$, for $\sampleidx=1,\ldots,\samplesize$, with known label values $\truelabel^{(\sampleidx)}$. We stack the feature vectors $\featurevec^{(\sampleidx)}$ and labels $\truelabel^{(\sampleidx)}$, respectively, of the data points in the training set into the feature matrix $\featuremtx=(\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)})^{T}$ and label vector $\labelvec=(\truelabel^{(1)},\ldots,\truelabel^{(\samplesize)})^{T}$.

The ERM of linear regression is solved by any parameter vector $\widehat{\weights}$ that solves equ_zero_gradient_lin_reg. The (minimum) training error of the hypothesis $h^{(\widehat{\weights})}$ is obtained as

[] \begin{align} \emperror(h^{(\widehat{\weights})} \mid \dataset) & \stackrel{\eqref{eq_def_ERM_weight}}{=} \min_{\weights \in \mathbb{R}^{\featuredim}} \emperror(h^{(\weights)} | \dataset) \nonumber \\ & \stackrel{\eqref{equ_emp_risk_lin_proje}}{=} \sqeuclnorm{ (\mathbf{I}- \mathbf{P}) \labelvec }. \end{align} []

Here, we used the orthogonal projection matrix $\mathbf{P}$ on the linear span

[$] $$\nonumber {\rm span}\{ \featuremtx \} = \big\{ \featuremtx \va : \va \in \mathbb{R}^{\featuredim} \big\} \subseteq \mathbb{R}^{\samplesize} ,$$ [$]

of the feature matrix $\featuremtx = (\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)})^{T} \in \mathbb{R}^{ \samplesize \times \featuredim}$.

In many ML applications we have access to a huge number of individual features to characterize a data point. As a point in case, consider a data point which is a snapshot obtained from a modern smartphone camera. These cameras have a resolution of several megapixels. Here, we can use millions of pixel colour intensities as its features. For such applications, it is common to have more features for data points than the size of the training set,

[$] $$\label{equ_condition_overfitting} \featuredim \geq \samplesize.$$ [$]

Whenever \eqref{equ_condition_overfitting} holds, the feature vectors $\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)} \in \mathbb{R}^{\featuredim}$ of the data points in $\dataset$ are typically linearly independent. As a case in point, if the feature vectors

$\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)} \in \mathbb{R}^{\featuredim}$ are realizations of independent and identically distributed (iid) random variable (RV)s with a continuous probability distribution, these vectors are linearly independent with probability one [1].

If the feature vectors $\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)} \in \mathbb{R}^{\featuredim}$ are linearly independent, the span of the feature matrix $\featuremtx = (\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)})^{T}$ coincides with $\mathbb{R}^{\samplesize}$ which implies, in turn, $\mathbf{P} = \mathbf{I}$. Inserting $\mathbf{P} = \mathbf{I}$ into equ_emp_risk_lin_proje yields

[$] $$\label{eq_zero_trianing_error} \emperror(h^{(\widehat{\weights})} \mid \dataset) = 0.$$ [$]

As soon as the number $\samplesize= | \dataset|$ of training data points does not exceed the number $\featuredim$ of features that characterize data points, there is (with probability one) a linear predictor $h^{(\widehat{\weights})}$ achieving zero training error(!).

While the hypothesis $h^{(\widehat{\weights})}$ achieves zero training error, it will typically incur a non-zero average prediction error $\truelabel - h^{(\widehat{\weights})}(\featurevec)$ on data points $(\featurevec,\truelabel)$ outside the training set (see Figure fig_polyn_training). Section A Probabilistic Analysis of Generalization will make this statement more precise by using a probabilistic model for the data points within and outside the training set.

Note that \eqref{eq_zero_trianing_error} also applies if the features $\featurevec$ and labels $y$ of data points are completely unrelated. Consider an ML problem with data points whose labels $\truelabel$ and features are realizations of a RV that are statistically independent. Thus, in a very strong sense, the features $\featurevec$ contain no information about the label of a data point. Nevertheless, as soon as the number of features exceeds the size of the training set, such that \eqref{equ_condition_overfitting} holds, linear regression methods will learn a hypothesis with zero training error.

We can easily extend the above discussion about the occurrence of overfitting in linear regression to other methods that combine linear regression with a feature map. Polynomial regression, using data points with a single feature $z$, combines linear regression with the feature map $\rawfeature \mapsto \featuremapvec(\rawfeature) \defeq \big(\rawfeature^{0},\ldots,\rawfeature^{\featurelen-1}\big)^{T}$ as discussed in Section Polynomial Regression .

It can be shown that whenever \eqref{equ_condition_overfitting} holds and the features

$\rawfeature^{(1)},\ldots,\rawfeature^{(\samplesize)}$ of the training set are all different, the feature vectors $\featurevec^{(1)}\defeq \featuremapvec \big(\rawfeature^{(1)}\big),\ldots, \featurevec^{(\samplesize)}\defeq \featuremapvec \big(\rawfeature^{(\samplesize)}\big)$ are linearly independent. This implies, in turn, that polynomial regression is guaranteed to find a hypothesis with zero training error whenever $\samplesize \leq \featurelen$ and the data points in the training set have different feature values.

## Validation

Consider an ML method that uses ERM to learn a hypothesis $\hat{h} \in \hypospace$ out of the hypothesis space $\hypospace$. The discussion in Section Overfitting revealed that the training error of a learnt hypothesis $\hat{h}$ can be a poor indicator for the performance of $\hat{h}$ for data points outside the training set. The hypothesis $\hat{h}$ tends to “look better” on the training set over which it has been tuned within ERM.The basic idea of validating the predictor $\hat{h}$ is simple:

• first we learn a hypothesis $\hat{h}$ using ERM on a training set and
• then we compute the average loss of $\hat{h}$ on data points that do not belong to the training set.

Thus, validation means to compute the average loss of a hypothesis using data points that have not been used in ERM to learn that hypothesis.

Assume we have access to a dataset of $\samplesize$ data points,

[$] \dataset = \big\{ \big(\featurevec^{(1)},\truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)}\big) \big\}. [$]

Each data point is characterized by a feature vector $\featurevec^{(\sampleidx)}$ and a label $\truelabel^{(\sampleidx)}$. Algorithm alg:validated_ERM outlines how to learn and validate a hypothesis $h\in \hypospace$ by splitting the dataset $\dataset$ into a training set and a validation set. The random shuffling in step alg_shuffle_step of Algorithm alg:validated_ERM ensures the i.i.d. assumption for the shuffled data. Section The Size of the Validation Set shows next how the i.i.d. assumption ensures that the validation error \eqref{equ_def_training_val_val} approximates the expected loss of the hypothesis $\hat{h}$. The hypothesis $\hat{h}$ is learnt via ERM on the training set during step equ_step_train_val_ERM of Algorithm alg:validated_ERM.

Validated ERM

Input: model $\hypospace$, loss function $\lossfun$, dataset $\dataset=\big\{ \big(\featurevec^{(1)}, \truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)}, \truelabel^{(\samplesize)}\big) \big\}$; split ratio $\splitratio$

• randomly shuffle the data points in $\dataset$
• create the training set $\trainset$ using the first $\samplesize_{t}\!=\! \lceil\splitratio \samplesize\rceil$ data points,
[$] \trainset = \big\{ \big(\featurevec^{(1)}, \truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize_{t})}, \truelabel^{(\samplesize_{t})}\big) \big\}.[$]
• create the validation set $\valset$ by the $\samplesize_v = \samplesize - \samplesize_t$ remaining data points,
[$] \valset = \big\{ \big(\featurevec^{(\samplesize_{t}+1)}, \truelabel^{(\samplesize_{t}+1)}\big),\ldots,\big(\featurevec^{(\samplesize)}, \truelabel^{(\samplesize)}\big) \big\}.[$]
• learn hypothesis $\hat{h}$ via ERM on the training set,
[$] $$\label{equ_def_hat_h_fitting} \hat{h} \defeq \argmin_{h\in \hypospace} \emperror\big(h| \trainset \big)$$ [$]
• compute the training error
[$] $$\label{equ_def_training_error_val} \trainerror \defeq \emperror\big(\hat{h}| \trainset \big) = (1/\samplesize_{t}) \sum_{\sampleidx=1}^{\samplesize_{t}} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{\hat{h}}.$$ [$]
• compute the validation error
[$] $$\label{equ_def_training_val_val} \valerror \defeq \emperror\big(\hat{h}| \valset \big)= (1/\samplesize_{v}) \sum_{\sampleidx=\samplesize_{t}+1}^{\samplesize} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{\hat{h}}.$$ [$]

Output: learnt hypothesis $\hat{h}$, training error $\trainerror$, validation error $\valerror$

### The Size of the Validation Set

The choice of the split ratio $\splitratio \approx \samplesize_{t}/ \samplesize$

in Algorithm alg:validated_ERM is often based on trial and error. We try out different choices for the split ratio and pick the one with the smallest validation error. It is difficult to make a precise statement on how to choose the split ratio which applies broadly [2]. This difficulty stems from the fact that the optimal choice for $\rho$ depends on the precise statistical properties of the data points.

One approach to determine the required size of the validation set is to use a probabilistic model for the data points. The i.i.d. assumption is maybe the most widely used probabilistic model within ML. Here, we interpret data points as the realizations of iid RVs. These iid RVs have a common (joint) probability distribution

$p(\featurevec,\truelabel)$ over possible features $\featurevec$ and labels $\truelabel$ of a data point. Under the i.i.d. assumption, the validation error $\valerror$ \eqref{equ_def_training_val_val} also becomes a realization of a RV. The expectation (or mean) $\expect \{ \valerror \}$ of this RV is precisely the risk $\expect\{ \loss{(\featurevec,\truelabel)} {\hat{h}} \}$ of $\hat{h}$ (see risk).

Within the above i.i.d. assumption, the validation error $\valerror$ becomes a realization of a RV that fluctuates around its mean $\expect \{ \valerror \}$. We can quantify this fluctuation using the variance

[$] \sigma_{\valerror}^{2} \defeq \expect \big\{ \big( \valerror -\expect \{ \valerror \}\big)^{2} \big\}. [$]

Note that the validation error is the average of the realizations $\loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{\hat{h}}$ of iid RVs. The probability distribution of the RV $\loss{(\featurevec,\truelabel)}{\hat{h}}$ is determined by the probability distribution $p(\featurevec,\truelabel)$, the choice of loss function and the hypothesis $\hat{h}$. In general, we do not know $p(\featurevec,\truelabel)$ and, in turn, also do not know the probability distribution of $\loss{(\featurevec,\truelabel)}{\hat{h}}$.

If we know an upper bound $U$ on the variance of the (random) loss $\loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{\hat{h}}$, we can bound the variance of $\valerror$ as

[$] \sigma_{\valerror}^{2} \leq U/\samplesize_{v}. [$]

We can then, in turn, ensure that the variance $\sigma_{\valerror}^{2}$ of the validation error $\valerror$ does not exceed a given threshold

$\eta$, say $\eta = (1/100) \trainerror^2$, by using a validation set of size

[$] $$\label{equ_lower_bound_variance} \samplesize_{v} \geq U/ \eta.$$ [$]

The lower bound \eqref{equ_lower_bound_variance} is only useful if we can determine an upper bound $U$ on the variance of the RV $\loss{(\featurevec,\truelabel)}{\hat{h}}$ where $\big(\featurevec,\truelabel\big)$ is a RV with probability distribution $p(\featurevec,\truelabel)$. An upper bound on the variance of $\loss{(\featurevec,\truelabel)}{\hat{h}}$ can be derived using probability theory if we know an accurate probabilistic model $p(\featurevec,\truelabel)$ for the data points. Such a probabilistic model might be provided by application-specific scientific fields such as biology or psychology. Another option is to estimate the variance of $\loss{(\featurevec,\truelabel)}{\hat{h}}$ using the sample variance of the actual loss values $\loss{(\featurevec^{(1)},\truelabel^{(1)})}{\hat{h}},\ldots, \loss{(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)})}{\hat{h}}$ obtained for the dataset $\dataset$.

### k-Fold Cross Validation

Algorithm alg:validated_ERM uses the most basic form of splitting a given dataset $\dataset$ into a training set and a validation set. Many variations and extensions of this basic splitting approach have been proposed and studied (see [3] and Section The Bootstrap ). One very popular extension of the single split into training set and validation set is known as $k$-fold cross-validation ( $k$-fold CV) [4](Sec. 7.10). We summarize $k$-fold CV in Algorithm alg:kfoldCV_ERM below.

Figure fig_k_fold_CV illustrates the key principle behind $k$-fold CV. First, we divide the entire dataset evenly into $\nrfolds$ subsets which are referred to as “folds”. The learning (via ERM) and validation of a hypothesis out of a given hypothesis space $\hypospace$ is then repeated $\nrfolds$ times.

During each repetition, we use one fold as the validation set and the remaining $\nrfolds-1$ folds as a training set. We then average the values of the training error and validation error obtained for each repetition (fold).

The average (over all $\nrfolds$ folds) validation error delivered by $k$-fold CV tends to better estimate the expected loss or risk compared to the validation error obtained from a single split in Algorithm alg:validated_ERM. Consider a dataset that consists of a relatively small number of data points. If we use a single split of this small dataset into a training set and validation set, we might be very unlucky and choose data points for the validation set which are outliers and not representative for the statistical properties of most data points. The effect of such an unlucky split is typically averaged out when using $k$-fold CV.

k-fold CV ERM

Input: model $\hypospace$, loss function $\lossfun$, dataset $\dataset=\big\{ \big(\featurevec^{(1)}, \truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)}, \truelabel^{(\samplesize)}\big) \big\}$; number $\nrfolds$ of folds

• randomly shuffle the data points in $\dataset$
• divide the shuffled dataset $\dataset$ into $\nrfolds$ folds $\dataset_{1},\ldots,\dataset_{\nrfolds}$ of size $\foldsize=\lceil\samplesize/\nrfolds\rceil$,
[$] $$\dataset_{1}\!=\!\big\{ \big(\featurevec^{(1)}, \truelabel^{(1)}\big),\ldots, \big(\featurevec^{(\foldsize)}, \truelabel^{(\foldsize)}\big)\} ,\ldots,\dataset_{k}\!=\!\big\{ \big(\featurevec^{((\nrfolds\!-\!1)\foldsize+1)}, \truelabel^{((\nrfolds\!-\!1)\foldsize+1)}\big),\ldots, \big(\featurevec^{(\samplesize)}, \truelabel^{(\samplesize)}\big)\}$$ [$]
• For fold index $\foldidx=1,\ldots,\nrfolds$ do
• use $\foldidx$ th fold as the validation set $\valset=\dataset_{\foldidx}$
• use rest as the training set $\trainset=\dataset \setminus \dataset_{\foldidx}$
• learn hypothesis $\hat{h}$ via ERM on the training set,
[$] $$\label{equ_def_hat_h_fitting_cv} \hat{h}^{(\foldidx)} \defeq \argmin_{h\in \hypospace} \emperror\big(h| \trainset \big)$$ [$]
• compute the training error
[$] $$\label{equ_def_training_error_val_cv} \trainerror^{(\foldidx)} \defeq \emperror\big(\hat{h}| \trainset \big) = (1/\big|\trainset\big|) \sum_{\sampleidx \in \trainset} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{h}.$$ [$]
• compute validation error
[$] $$\label{equ_def_training_val_val_cv} \valerror^{(\foldidx)} \defeq \emperror\big(\hat{h}| \valset \big)= (1/\big|\valset\big|) \sum_{\sampleidx \in \valset} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{\hat{h}}.$$ [$]
• end for
• compute average training and validation errors
[$] \trainerror \defeq (1/\nrfolds) \sum_{\foldidx=1}^{\nrfolds} \trainerror^{(\foldidx)}\mbox{, and }\valerror \defeq (1/\nrfolds) \sum_{\foldidx=1}^{\nrfolds} \valerror^{(\foldidx)} [$]
• pick a learnt hypothesis $\hat{h} \defeq \hat{h}^{(\foldidx)}$for some $\foldidx \in \{1,\ldots,\nrfolds\}$

Output: learnt hypothesis $\hat{h}$; average training error $\trainerror$; average validation error $\valerror$

### Imbalanced Data

The simple validation approach discussed above requires the validation set to be a good representative for the overall statistical properties of the data. This might not be the case in applications with discrete valued labels and some of the label values being very rare. We might then be interested in having a good estimate of the conditional risks $\expect \{ \loss{(\featurevec,\truelabel)}{h} | \truelabel=\truelabel'\}$ where $\truelabel'$ is one of the rare label values. This is more than requiring a good estimate for the risk $\expect \{ \loss{(\featurevec,\truelabel)}{h} \}$.

Consider data points characterized by a feature vector $\featurevec$ and binary label $\truelabel \in \{-1,1\}$. Assume we aim at learning a hypothesis $h(\featurevec) = \weights^{T} \featurevec$ to classify data points as $\hat{\truelabel}=1$ if $h(\featurevec) \geq 0$ while $\hat{\truelabel}=-1$ otherwise. The learning is based on a dataset $\dataset$ which contains only one single (!) data point with $\truelabel=-1$. If we then split the dataset into training and validation set, it is with high probability that the validation set does not include any data point with label value $\truelabel=-1$. This cannot happen when using $\nrfolds$-fold CV since the single data point must be in one of the validation folds. However, even the applicability of $k$-fold CV for such an imbalanced dataset is limited since we evaluate the performance of a hypothesis $h(\featurevec)$ using only one single data point with $\truelabel=-1$. The resulting validation error will be dominated by the loss of $h(\featurevec)$ incurred on data points from the majority class (those with true label value $\truelabel=1$).

To learn and validate a hypothesis with imbalanced data, it might be useful to to generate synthetic data points to enlarge the minority class. This can be done using data augmentation techniques which we discuss in Section Data Augmentation . Another option is to choose a loss function that takes the different frequencies of label values into account. Let us illustrate this approach in what follows by an illustrative example.

Consider an imbalanced dataset of size $\samplesize=100$, which contains $90$ data points with label $\truelabel=1$ but only $10$ data points with label $\truelabel=-1$. We might want to put more weight on wrong predictions obtained for data points from the minority class (with true label value $\truelabel=-1$). This can be done by using a much larger value for the loss $\loss{(\featurevec,\truelabel=-1)}{h(\featurevec)=1}$ than for the loss $\loss{(\featurevec,\truelabel=1)}{h(\featurevec)=-1}$ incurred by incorrectly predicting the label of a data point from the majority class (with true label value $\truelabel=1$).

## Model Selection

Chapter The Landscape of ML illustrated how many well-known ML methods are obtained by different combinations of a hypothesis space or model, loss function and data representation. While for many ML applications there is often a natural choice for the loss function and data representation, the right choice for the model is typically less obvious. We now discuss how to use the validation methods of Section Validation to choose between different candidate models.

Consider data points characterized by a single numeric feature $\feature\in \mathbb{R}$ and numeric label $\truelabel\in \mathbb{R}$. If we suspect that the relation between feature $\feature$ and label $\truelabel$ is non-linear, we might use polynomial regression which is discussed in Section Polynomial Regression . Polynomial regression uses the hypothesis space $\hypospace_{\rm poly}^{(\featuredim)}$ with some maximum degree $\featuredim$. Different choices for the maximum degree $\featuredim$ yield a different hypothesis space: $\hypospace^{(1)} = \mathcal{H}_{\rm poly}^{(0)},\hypospace^{(2)} = \mathcal{H}_{\rm poly}^{(1)},\ldots,\hypospace^{(\nrmodels)} = \hypospace_{\rm poly}^{(\nrmodels)}$.

Another ML method that learns non-linear hypothesis map is Gaussian basis regression (see Section Gaussian Basis Regression ). Here, different choices for the variance $\sigma$ and shifts $\mu$ of the Gaussian basis function result in different hypothesis spaces. For example, $\hypospace^{(1)} = \mathcal{H}^{(2)}_{\rm Gauss}$ with $\sigma=1$ and $\mu_{1}=1$ and $\mu_{2}=2$, $\hypospace^{(2)} = \hypospace^{(2)}_{\rm Gauss}$ with $\sigma = 1/10$, $\mu_{1}=10$, $\mu_{2}= 20$.

Algorithm alg:model_selection summarizes a simple method to choose between different candidate models $\hypospace^{(1)},\hypospace^{(2)},\ldots,\hypospace^{(\nrmodels)}$. The idea is to first learn and validate a hypothesis $\hat{h}^{(\modelidx)}$ separately for each model $\hypospace^{(\modelidx)}$ using Algorithm alg:kfoldCV_ERM. For each model $\hypospace^{(\modelidx)}$, we learn the hypothesis

$\hat{h}^{(\modelidx)}$ via ERM \eqref{equ_def_hat_h_fitting} and then compute its validation error

$\valerror^{(\modelidx)}$ \eqref{equ_def_training_val_val}. We then choose the hypothesis $\hat{h}^{(\hat{\modelidx})}$ from those model $\hypospace^{(\hat{\modelidx})}$ which resulted in the smallest validation error $\valerror^{(\hat{\modelidx})} = \min_{\modelidx=1,\ldots,\nrmodels} \valerror^{(\modelidx)}$.

The workflow of Algorithm alg:model_selection is similar to the workflow of ERM. Remember that the idea of ERM is to learn a hypothesis out of a set of different candidates (the hypothesis space). The quality of a particular hypothesis $h$ is measured using the (average) loss incurred on some training set. We use the same principle for model selection but on a higher level. Instead of learning a hypothesis within a hypothesis space, we choose (or learn) a hypothesis space within a set of candidate hypothesis spaces. The quality of a given hypothesis space is measured by the validation error \eqref{equ_def_training_val_val}. To determine the validation error of a hypothesis space, we first learn the hypothesis $\hat{h} \in \hypospace$ via ERM \eqref{equ_def_hat_h_fitting} on the training set. Then, we obtain the validation error as the average loss of $\hat{h}$ on the validation set.

The final hypothesis $\hat{h}$ delivered by the model selection Algorithm alg:model_selection not only depends on the training set used in ERM (see \eqref{equ_def_hat_h_fitting_cv}). This hypothesis $\hat{h}$ has also been chosen based on its validation error which is the average loss on the validation set in \eqref{equ_def_training_val_val_cv}. Indeed, we compared this validation error with the validation errors of other models to pick the model $\hypospace^{(\hat{\modelidx})}$ (see step step_pick_optimal_model) which contains $\hat{h}$. Since we used the validation error \eqref{equ_def_training_val_val_cv} of $\hat{h}$ to learn it, we cannot use this validation error as a good indicator for the general performance of $\hat{h}$.

To estimate the general performance of the final hypothesis $\hat{h}$ delivered by Algorithm alg:model_selection we must try it out on a test set. The test set, which is constructed in step equ_construct_test_set_algmodsel of Algorithm alg:model_selection, consists of data points that are neither contained in the training set \eqref{equ_def_hat_h_fitting_cv} nor the validation set \eqref{equ_def_training_val_val_cv} used for training and validating the candidate models $\hypospace^{(1)},\ldots,\hypospace^{(\nrmodels)}$. The average loss of the final hypothesis on the test set is referred to as the test error. The test error is computed in the step step_compute_test_error_mod_selection of Algorithm alg:model_selection.

Model Selection

Input: list of candidate models $\hypospace^{(1)},\ldots,\hypospace^{(\nrmodels)}$, loss function $\lossfun$, dataset $\dataset=\big\{ \big(\featurevec^{(1)}, \truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)}, \truelabel^{(\samplesize)}\big) \big\}$; number $\nrfolds$ of folds, test set fraction $\rho$

• randomly shuffle the data points in $\dataset$
• determine size $\samplesize' \defeq \lceil \rho \samplesize \rceil$ of test set
• construct a test set
[$] \testset= \big\{\big(\featurevec^{(1)}, \truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize')}, \truelabel^{(\samplesize')}\big) \big\}[$]
• construct a training set and a validation set,
[$]\dataset^{(\rm trainval)} = \big\{\big(\featurevec^{(\samplesize'+1)}, \truelabel^{(\samplesize'+1)}\big),\ldots,\big(\featurevec^{(\samplesize)}, \truelabel^{(\samplesize)}\big) \big\}[$]
• For $\modelidx=1,\ldots,\nrmodels$ do
• run Algorithm alg:kfoldCV_ERM using $\hypospace=\hypospace^{(\modelidx)}$, dataset $\dataset=\dataset^{(\rm trainval)}$, loss function $\lossfun$ and $\nrfolds$ folds
• Algorithm alg:kfoldCV_ERM delivers hypothesis $\hat{h}$ and validation error $\valerror$
• store learnt hypothesis $\hat{h}^{(\modelidx)}\defeq\hat{h}$ and validation error $\valerror^{(\modelidx)}\defeq\valerror$
• end for
• pick model $\hypospace^{(\hat{\modelidx})}$ with minimum validation error $\valerror^{(\hat{\modelidx})}\!=\!\min_{\modelidx=1,\ldots,\nrmodels}\valerror^{(\modelidx)}$ \label{step_pick_optimal_model}
• define optimal hypothesis $\hat{h}= \hat{h}^{(\hat{\modelidx})}$
• compute test error
[$] $$\label{equ_def_training_error_val_test} \testerror \defeq \emperror\big(\hat{h}| \testset \big) = (1/\big|\testset\big|) \sum_{\sampleidx \in \testset} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{\hat{h}}.$$ [$]
Output: hypothesis $\hat{h}$; training error $\trainerror^{(\hat{\modelidx})}$; validation error $\valerror^{(\hat{\modelidx})}$, test error $\testerror$.

Sometimes it is beneficial to use different loss functions for the training and the validation of a hypothesis. As an example, consider logistic regression and the support vector machine (SVM) which have been discussed in Sections Logistic Regression and Support Vector Machines , respectively. Both methods use the same model which is the space of linear hypothesis maps $h(\featurevec) = \weights^{T} \featurevec$. The main difference between these two methods is in their choice for the loss function. Logistic regression minimizes the (average) logistic losson the training set to learn the hypothesis $h^{(1)}(\featurevec)= \big( \weights^{(1)} \big)^{T} \featurevec$

with a parameter vector $\weights^{(1)}$. The SVM instead minimizes the (average) hinge loss on the training set to learn the hypothesis $h^{(2)}(\featurevec) = \big( \weights^{(2)} \big)^{T} \featurevec$ with a parameter vector $\weights^{(2)}$. It is inconvenient to compare the usefulness of the two hypotheses $h^{(1)}(\featurevec)$ and $h^{(2)}(\featurevec)$ using different loss functions to compute their validation errors. This comparison is more convenient if we instead compute the validation errors for $h^{(1)}(\featurevec)$ and $h^{(2)}(\featurevec)$ using the average 0/1 loss.

Algorithm alg:model_selection requires as one of its inputs a given list of candidate models. The longer this list, the more computation is required from Algorithm alg:model_selection. Sometimes it is possible to prune the list of candidate models by removing models that are very unlikely to have minimum validation error.

Consider polynomial regression which uses as the model the space $\hypospace_{\rm poly}^{(r)}$ of polynomials with maximum degree $\polydegree$ (see equ_def_poly_hyposapce). For $\polydegree=1$,

$\hypospace_{\rm poly}^{(\polydegree)}$ is the space of polynomials with maximum degree one (which are linear maps), $h(\feature)=\weight_{2}\feature+\weight_{1}$. For $\polydegree=2$, $\hypospace_{\rm poly}^{(\polydegree)}$ is the space of polynomials with maximum degree two, $h(\feature)=\weight_{3}\feature^2 +\weight_{2}\feature+\weight_{1}.$ The polynomial degree $\polydegree$ parametrizes a nested set of models,

[$] \hypospace_{\rm poly}^{(\polydegree)} \subset \hypospace_{\rm poly}^{(\polydegree)} \subset \ldots. [$]

For each degree $\polydegree$, we learn a hypothesis $h^{(r)} \in \hypospace_{\rm poly}^{(\polydegree)}$ with minimum average loss (training error) $\trainerror^{(\polydegree)}$ on a training set (see \eqref{equ_def_training_error_val}). To validate the learnt hypothesis $h^{(\polydegree)}$, we compute its average loss (validation error) $\valerror^{(\polydegree)}$ on a validation set (see \eqref{equ_def_training_val_val}).

Figure fig_trainvalvsdegree depicts the typical dependency of the training and validation errors on the polynomial degree $\polydegree$. The training error $\trainerror^{(\polydegree)}$ decreases monotonically with increasing polynomial degree $\polydegree$. To illustrate this monotonic decrease, we consider the two specific choices $\polydegree=3$ and $\polydegree=5$ with corresponding models $\hypospace^{(\polydegree)}_{\rm poly}$ and $\hypospace^{(\polydegree)}_{\rm poly}$. Note that $\hypospace^{(3)}_{\rm poly} \subset \hypospace^{(5)}_{\rm poly}$ since any polynomial with degree not exceeding $3$ is also a polynomial with degree not exceeding $5$. Therefore, the training error \eqref{equ_def_training_error_val} obtained when minimizing over the larger model $\hypospace^{(5)}_{\rm poly}$ can only decrease but never increase compared to \eqref{equ_def_training_error_val} using the smaller model $\hypospace^{(3)}_{\rm poly}$ Figure fig_trainvalvsdegree indicates that the validation error $\valerror^{(\polydegree)}$ (see \eqref{equ_def_training_val_val}) behaves very different compared to the training error $\trainerror^{(\polydegree)}$. Starting with degree $\polydegree=0$, the validation error first decreases with increasing degree $\polydegree$. As soon as the degree $\polydegree$ is increased beyond a critical value, the validation error starts to increase with increasing $\polydegree$. For very large values of $\polydegree$, the training error becomes almost negligible while the validation error becomes very large. In this regime, polynomial regression overfits the training set.

Figure fig_polyregdegree9 illustrates the overfitting of polynomial regression when using a maximum degree that is too large. In particular, Figure fig_polyregdegree9 depicts a learnt hypothesis which is a degree $9$ polynomial that fits very well the training set, resulting in a very small training error. To achieve this low training error the resulting polynomial has an unreasonable high rate of change for feature values $\feature \approx 0$. This results in large prediction errors for data points with feature values $x \approx 0$.

## A Probabilistic Analysis of Generalization

More Data Beats Clever Algorithms ?; More Data Beats Clever Feature Selection?

A key challenge in ML is to ensure that a hypothesis that predicts well the labels on a training set (which has been used to learn that hypothesis) will also predict well the labels of data points outside the training set. We say that a ML method generalizes well if it learns a hypothesis $\hat{h}$ that performs not significantly worse on data points outside the training set. In other words, the loss incurred by $\hat{h}$ for data points outside the training set is not much larger than the average loss of $\hat{h}$ incurred on the training set.

We now study the generalization of linear regression methods (see Section Linear Regression ) using an i.i.d. assumption. In particular, we interpret data points as iid realizations of RVs that have the same distribution as a random data point $\datapoint=(\featurevec,\truelabel)$. The feature vector $\featurevec$ is then a realization of a standard Gaussian RV with zero mean and covariance being the identity matrix, i.e., $\featurevec \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.

The label $\truelabel$ of a random data point is related to its features $\featurevec$ via a linear Gaussian model

[$] $$\label{equ_linear_obs_model} \truelabel = \overline{\weights}^{T} \featurevec + \varepsilon \mbox{, with noise } \varepsilon \sim \mathcal{N}(0,\sigma^{2}).$$ [$]

We assume the noise variance $\sigma^{2}$ fixed and known. This is a simplifying assumption and in practice we would need to estimate the noise variance from data [5]. Note that, within our probabilistic model, the error component $\varepsilon$ in \eqref{equ_linear_obs_model} is intrinsic to the data and cannot be overcome by any ML method. We highlight that the probabilistic model for the observed data points is just a modelling assumption. This assumption allows us to study some fundamental behaviour of ML methods. There are principled methods (“statistical tests”) that allow to determine if a given dataset can be accurately modelled using \eqref{equ_linear_obs_model} [6].

We predict the label $\truelabel$ from the features $\featurevec$ using a linear hypothesis $h(\featurevec)$ that depends only on the first $\modelidx$ features $\feature_{1},\ldots,\feature_{\modelidx}$. Thus, we use the hypothesis space

[$] $$\label{equ_generalization_hypospace_r} \hypospace^{(\modelidx)} = \{ h^{(\weights)}(\featurevec)= (\weights^{T},\mathbf{0}^{T}) \featurevec \mbox{ with } \weights \in \mathbb{R}^{\modelidx} \}.$$ [$]

Note that each element $h^{(\weights)} \in \hypospace^{(\modelidx)}$ corresponds to a particular choice of the parameter vector $\weights \in \mathbb{R}^{\modelidx}$.

The model parameter $\modelidx \in \{0,\ldots,\featuredim\}$ coincides with the effective dimension of the hypothesis space $\hypospace^{(\modelidx)}$. For $\modelidx\lt \featuredim$, the hypothesis space $\hypospace^{(\modelidx)}$ is a proper (strict) subset of the space of linear hypothesis maps used within linear regression (see Section Linear Regression ). Moreover, the parameter $\modelidx$ indexes a nested sequence of models,

[$] $$\hypospace^{(0)} \subseteq \hypospace^{(1)} \subseteq \ldots \subseteq \hypospace^{(\featuredim)}. \nonumber$$ [$]

The quality of a particular predictor $h^{(\weights)} \in \hypospace^{(\modelidx)}$ is measured via the average squared error $\emperror (h^{(\weights)} \mid \trainset)$ incurred on the labeled training set

[$] $$\label{equ_def_train_set_prob_analysis_generatlization} \trainset= \{ \big(\featurevec^{(1)}, \truelabel^{(1)}\big), \ldots, \big(\featurevec^{(\samplesize_{t})}, \truelabel^{(\samplesize_{t})}\big) \}.$$ [$]

We interpret data points in the training set $\trainset$ as well as any other data point outside the training set as realizations of iid RVs with a common probability distribution. This common probability distribution is a multivariate normal (Gaussian) distribution,

[$] $$\label{equ_toy_model_iid} \featurevec, \featurevec^{(\sampleidx)} \mbox{iid with } \featurevec, \featurevec^{(\sampleidx)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).$$ [$]

The labels $\truelabel^{(\sampleidx)},\truelabel$ are related to the features of data points via (see \eqref{equ_linear_obs_model})

[$] $$\label{equ_labels_training_data} \truelabel^{(\sampleidx)} = \overline{\weights}^{T} \featurevec^{(\sampleidx)} + \varepsilon^{(\sampleidx)}\mbox{, and } \truelabel = \overline{\weights}^{T} \featurevec + \varepsilon.$$ [$]

Here, the noise terms $\varepsilon, \varepsilon^{(\sampleidx)} \sim \mathcal{N}(0,\sigma^{2})$ are realizations of iid Gaussian RVs with zero mean and variance $\sigma^{2}$.

Chapter Empirical Risk Minimization showed that the training error $\emperror (h^{(\weights)} \mid \trainset)$ is minimized by the predictor $h^{(\widehat{\weights})}(\featurevec) = \widehat{\weights}^{T} \mathbf{I}_{\modelidx \times \featuredim} \featurevec$, that uses the parameter vector

[$] $$\label{equ_optimal_weight_closed_form} \widehat{\weights} = \big(\big(\featuremtx^{(\modelidx)}\big)^{T} \featuremtx^{(\modelidx)} \big)^{-1} \big(\featuremtx^{(\modelidx)}\big)^{T} \labelvec.$$ [$]

Here we used the (restricted) feature matrix $\mX^{(\modelidx)}$ and the label vector $\labelvec$ defined as, respectively,

[] \begin{align} \mX^{(\modelidx)}& \!=\!(\featurevec^{(1)},\ldots,\featurevec^{(\samplesize_{\rm t})})^{T} \mathbf{I}_{\featuredim \times \modelidx}\!\in\!\mathbb{R}^{\samplesize_{\rm t} \times \modelidx} \mbox{, and } \nonumber \\ \label{equ_def_feature_matrix_r} \vy& \!=\!\big(\truelabel^{(1)},\ldots,\truelabel^{(\samplesize_{\rm t})}\big)^{T}\!\in\!\mathbb{R}^{\samplesize_{\rm t}}. \end{align} []

It will be convenient to tolerate a slight abuse of notation and denote both, the length- $\modelidx$ vector \eqref{equ_optimal_weight_closed_form} as well as the zero-padded parameter vector $\big(\widehat{\weights}^{T},\mathbf{0}^{T}\big)^{T} \in \mathbb{R}^{\featuredim}$, by $\widehat{\weights}$. This allows us to write

[$] $$h^{(\widehat{\weights})}(\featurevec) = \widehat{\weights}^{T} \featurevec.$$ [$]

We highlight that the formula \eqref{equ_optimal_weight_closed_form} for the optimal weight vector $\widehat{\weights}$ is only valid if the matrix $\big(\featuremtx^{(\modelidx)}\big)^{T} \featuremtx^{(\modelidx)}$ is invertible. Within our toy model (see \eqref{equ_toy_model_iid}), this is true with probability one whenever $\samplesize_{\rm t} \geq \modelidx$. Indeed, for $\samplesize_{\rm t} \geq \modelidx$ the truncated feature vectors $\mathbf{I}_{\modelidx \times \featuredim} \featurevec^{(1)}, \ldots, \mathbf{I}_{\modelidx \times \featuredim} \featurevec^{(\samplesize_{t})}$, which are iid realizations of a Gaussian RV, are linearly independent with probability one [7][8].

In what follows, we consider the case $\samplesize_{\rm t} \gt \modelidx$ such that the formula \eqref{equ_optimal_weight_closed_form} is valid (with probability one). The more challenging high-dimensional regime $\samplesize_{\rm t} \leq \modelidx$ will be studied in Chapter Regularization .

The optimal parameter vector $\widehat{\weights}$ (see \eqref{equ_optimal_weight_closed_form}) depends on the training set $\trainset$ via the feature matrix $\featuremtx^{(\modelidx)}$ and label vector $\labelvec$ (see \eqref{equ_def_feature_matrix_r}). Therefore, since we model the data points in the training set as realizations of RVs, the parameter vector $\widehat{\weights}$ \eqref{equ_optimal_weight_closed_form} is the realization of a RV. For each specific realization of the training set $\trainset$, we obtain a specific realization of the optimal parameter vector $\widehat{\weights}$.

The probabilistic model \eqref{equ_linear_obs_model} relates the features $\featurevec$ of a data point to its label $\truelabel$ via some (unknown) true parameter vector $\overline{\weights}$. Intuitively, the best linear hypothesis would be $h(\featurevec) =\widehat{\weights}^{T} \featurevec$ with parameter vector $\widehat{\weights} = \overline{\weights}$. However, in general this will not be achievable since we have to compute $\widehat{\weights}$ based on the features $\featurevec^{(\sampleidx)}$ and noisy labels $\truelabel^{(\sampleidx)}$ of the data points in the training set $\dataset$.

The parameter vector $\widehat{\vw}$ delivered by ERM typically results in a non-zero estimation error

[$] $$\label{equ_def_est_error} \Delta \weights \defeq \widehat{\weights} - \overline{\weights}.$$ [$]

The estimation error \eqref{equ_def_est_error} is the realization of a RV since the learnt parameter vector $\widehat{\weights}$ (see \eqref{equ_optimal_weight_closed_form}) is itself a realization of a RV.

The Bias and Variance Decomposition. The prediction accuracy of $h^{(\widehat{\weights})}$, using the learnt parameter vector \eqref{equ_optimal_weight_closed_form}, depends crucially on the mean squared estimation error (MSEE)

[$] $$\label{equ_def_est_err} \mseesterr \defeq \expect \{ \sqeuclnorm{\Delta \weights } \} \stackrel{\eqref{equ_def_est_error}}{=} \expect \big\{ \sqeuclnorm{\widehat{\weights} - \overline{\weights}} \big \}.$$ [$]

We will next decompose the MSEE $\mseesterr$ into two components, which are referred to as a variance term and a bias term. The variance term quantifies the random fluctuations or the parameter vector obtained from ERM on the training set \eqref{equ_def_train_set_prob_analysis_generatlization}. The bias term characterizes the systematic (or expected) deviation between the true parameter vector $\overline{\weights}$ (see \eqref{equ_linear_obs_model}) and the (expectation of the) learnt parameter vector $\widehat{\weights}$.

[] \begin{align} \mseesterr & \stackrel{\eqref{equ_def_est_err}}{=} \expect \big\{ \sqeuclnorm{\widehat{\weights} - \overline{\weights}} \big \}\big \} \nonumber \\[2mm] &= \expect \bigg\{ \sqeuclnorm{\big( \widehat{\weights} - \expect \big\{ \widehat{\weights} \big\}\big) - \big( \overline{\weights} - \expect \big\{ \widehat{\weights} \big\} \big)} \bigg \}. \nonumber \end{align} []

We can develop the last expression further by expanding the squared Euclidean norm,

[] \begin{align} \mseesterr &= \expect \big\{ \sqeuclnorm{\widehat{\weights} - \expect \{ \widehat{\weights} \} } \big\} - 2 \expect \big \{ \big( \widehat{\weights} - \expect \big\{ \widehat{\weights} \big\} \big)^{T} \big( \overline{\weights} - \expect \big\{ \widehat{\weights} \big\} \big) \big\} + \expect \big\{ \sqeuclnorm{\overline{\weights} - \expect \big\{ \widehat{\weights} \big\} } \big\}\nonumber \\[4mm] &= \expect \big\{ \sqeuclnorm{\widehat{\weights} - \expect \{ \widehat{\weights} \} } \big\} - 2 \big( \underbrace{\expect \big \{ \widehat{\weights} \big\} - \expect \big\{ \widehat{\weights} \big\}}_{=\mathbf{0}} \big)^{T} \big( \overline{\weights} - \expect \big\{ \widehat{\weights} \big\} \big) \big\} + \expect \big\{ \sqeuclnorm{\overline{\weights} - \expect \big\{ \widehat{\weights} \big\} } \big\}\nonumber \\[4mm] &= \label{equ_bias_var_decomp}\underbrace{ \expect \big\{ \sqeuclnorm{ \widehat{\weights} - \expect \{ \widehat{\weights} \} } \big\} }_{\mbox{variance } \varianceterm} + \underbrace{ \expect \big\{ \sqeuclnorm{ \overline{\weights} - \expect \{ \widehat{\weights} \} } \big\} }_{\mbox{bias } \biasterm^2}. \end{align} []

The first component in \eqref{equ_bias_var_decomp} represents the (expected) variance of the learnt parameter vector $\widehat{\weights}$ \eqref{equ_optimal_weight_closed_form}. Note that, within our probabilistic model, the training set \eqref{equ_def_train_set_prob_analysis_generatlization} is the realization of a RV since it is constituted by data points that are iid realizations of RVs (see \eqref{equ_toy_model_iid} and \eqref{equ_linear_obs_model}).

The second component in \eqref{equ_bias_var_decomp} is referred to as a bias term. The parameter vector $\widehat{\weights}$ is computed from a randomly fluctuating training set via \eqref{equ_optimal_weight_closed_form} and is therefore itself fluctuating around its expectation $\expect \big \{\widehat{\weights}\}$. The bias term is the Euclidean distance between this expectation $\expect \big \{\widehat{\weights}\}$ and the true parameter vector $\overline{\weights}$ relating features and label of a data point via \eqref{equ_linear_obs_model}.

The bias term $\biasterm^{2}$ and the variance $\varianceterm$ in \eqref{equ_bias_var_decomp} both depend on the model complexity parameter $\modelidx$ but in a fundamentally different manner. The bias term $\biasterm^{2}$ typically decreases with increasing $\modelidx$ while the variance $\varianceterm$ increases with increasing $\modelidx$. In particular, the bias term is given as

[$] $$\label{equ_def_bias_term} \biasterm^{2} = \sqeuclnorm{\overline{\weights} - \expect \{ \widehat{\weights} \} } = \sum_{\featureidx=\modelidx+1}^{\featuredim} \overline{\weight}_{\featureidx}^2,$$ [$]

The bias term \eqref{equ_def_bias_term} is zero if and only if

[$] $$\label{equ_def_zero_bias_weght_elements} \overline{\weight}_{\featureidx}=0 \mbox{ for any index } \featureidx=\modelidx+1,\ldots,\featuredim.$$ [$]

The necessary and sufficient condition \eqref{equ_def_zero_bias_weght_elements} for zero bias is equivalent to $h^{(\overline{\weights})} \in \hypospace^{(\modelidx)}$. Note that the condition \eqref{equ_def_zero_bias_weght_elements} depends on both, the model parameter $\modelidx$ and the true parameter vector $\overline{\weights}$. While the model parameter $\modelidx$ is under control, the true parameter vector $\overline{\weights}$ is not under our control but determined by the underlying data generation process. The only way to ensure \eqref{equ_def_zero_bias_weght_elements} for every possible parameter vector $\overline{\weights}$ in \eqref{equ_linear_obs_model} is to use $\modelidx=\featuredim$, i.e., to use all available features $\feature_{1},\ldots,\feature_{\featuredim}$ of a data point.

When using the model $\hypospace^{(\modelidx)}$ with $\modelidx \lt \featuredim$, we cannot guarantee a zero bias term since we have no control over the true underlying parameter vector $\overline{\weights}$ in \eqref{equ_linear_obs_model}. In general, the bias term decreases with an increasing model size $\modelidx$ (see Figure fig_bias_variance). We highlight that the bias term does not depend on the variance $\sigma^{2}$ of the noise $\varepsilon$ in our toy model \eqref{equ_linear_obs_model}.

Let us now consider the variance term in \eqref{equ_bias_var_decomp}. Using the statistical independence of the features and labels of data points (see \eqref{equ_linear_obs_model}, \eqref{equ_toy_model_iid} and \eqref{equ_labels_training_data}), one can show that [a]

[$] $$\label{equ_variance_term_toy_model} \varianceterm = \expect \big\{ \sqeuclnorm{\widehat{\weights} - \expect\{ \widehat{\weights} \} } \big\} = \big( \biasterm^2+ \sigma^{2}\big) \mbox{tr} \left\{ \expect \left\{\big( \big(\featuremtx^{(\modelidx)}\big)^{T} \featuremtx^{(\modelidx)} \big)^{-1} \right\} \right\}.$$ [$]

By \eqref{equ_toy_model_iid}, the matrix $\left(\big(\featuremtx^{(\modelidx)}\big)^{T} \mX^{(\modelidx)} \right)^{-1}$ is a realization of a (matrix-valued) RV with an inverse Wishart distribution [10]. For $\samplesize_{\rm t} \gt \modelidx+1$, its expectation is given as

[$] $$\label{equ_expr_expect_inv-wishart} \expect\{ \big(\big(\featuremtx^{(\modelidx)} \big)^{T} \featuremtx^{(\modelidx)} \big)^{-1} \} = 1/(\samplesize_{\rm t}-\modelidx-1) \mathbf{I}.$$ [$]

By inserting \eqref{equ_expr_expect_inv-wishart} and $\mbox{tr} \{ \mathbf{I} \} = \modelidx$ into \eqref{equ_variance_term_toy_model},

[$] $$\label{equ_formulae_variance_toy_model} \varianceterm = \expect \left\{ \sqeuclnorm{\widehat{\weights} - \expect\left\{ \widehat{\weights} \right\}} \right\} = \big( \biasterm^{2} +\sigma^{2}\big) \modelidx/(\samplesize_{\rm t}-\modelidx-1).$$ [$]

The variance \eqref{equ_formulae_variance_toy_model} typically increases with increasing model complexity $\modelidx$ (see Figure fig_bias_variance). In contrast, the bias term \eqref{equ_def_bias_term} decreases with increasing $\modelidx$.

The opposite dependence of variance and bias on the model complexity results in a bias-variance trade-off. Choosing a model (hypothesis space) with small bias will typically result in large variance and vice versa. In general, the choice of model must balance between a small variance and a small bias.

Generalization.

Consider a linear regression method that learns the linear hypothesis $h(\featurevec) = \widehat{\weights}^{T} \featurevec$ using the parameter vector \eqref{equ_optimal_weight_closed_form}. The parameter vector $\widehat{\weights}^{T}$ \eqref{equ_optimal_weight_closed_form} results in a linear hypothesis with minimum training error, i.e., minimum average loss on the training set. However, the ultimate goal of ML is to find a hypothesis that predicts well the label of any data point. In particular, we want the hypothesis $h(\featurevec) = \widehat{\weights}^{T} \featurevec$ to generalize well to data points outside the training set.

We quantify the generalization capability of $h(\featurevec) = \widehat{\weights}^{T} \featurevec$ by its expected prediction loss

[$] $$\label{equ_def_expected_pred_loss} \error_{\rm pred} = \expect \big\{ \big( \truelabel - \underbrace{\widehat{\weights}^{T} \featurevec}_{ = \hat{\truelabel}} \big)^2 \big\}.$$ [$]

Note that $\error_{\rm pred}$ is a measure for the performance of a ML method and not of a specific hypothesis. Indeed, the learnt parameter vector $\widehat{\weights}$ is not fixed but depends on the data points in the training set. These data points are modelled as realizations of iid RVs and, in turn, the learnt parameter vector $\widehat{\weights}$ becomes a realization of a RV. Thus, in some sense, the expected prediction loss \eqref{equ_def_expected_pred_loss} characterizes the overall ML method that reads in a training set and delivers (learn) a linear hypothesis with parameter vector $\widehat{\weights}$ \eqref{equ_optimal_weight_closed_form}. In contrast, the risk introduced in Chapter Empirical Risk Minimization characterizes the performance of a specific (fixed) hypothesis $h$ without taking into account a learning process that delivered $h$ based on data.

Let us now relate the expected prediction loss \eqref{equ_def_expected_pred_loss} of the linear hypothesis $h(\featurevec) = \widehat{\weights}^{T} \featurevec$ to the bias and variance of \eqref{equ_optimal_weight_closed_form},

[] \begin{align} \error_{\rm pred} & \stackrel{\eqref{equ_linear_obs_model}}{=} \expect \{ \Delta \weights^{T} \featurevec \featurevec^{T} \Delta \weights \} + \sigma^{2} \nonumber \\ & \stackrel{(a)}{=} \expect \{ \expect \{ \Delta \weights^{T} \featurevec \featurevec^{T} \Delta \weights \mid \trainset \} \} + \sigma^{2} \nonumber \\ & \stackrel{(b)}{=} \expect \{ \Delta \weights^{T} \Delta \weights \} + \sigma^{2} \nonumber \\ & \stackrel{\eqref{equ_def_est_error},\eqref{equ_def_est_err}}{=} \mseesterr + \sigma^{2} \nonumber \\ & \label{equ_decomp_E_pred_toy_model}\stackrel{\eqref{equ_bias_var_decomp}}{=} \biasterm^{2} + \varianceterm + \sigma^{2}. \end{align} []

Here, step (a) uses the law of iterated expectation (see, e.g., [7]). Step (b) uses that the feature vector $\featurevec$ of a “new” data point is a realization of a RV which is statistically independent of the data points in the training set $\trainset$. We also used our assumption that $\featurevec$ is the realization of a RV with zero mean and covariance matrix $\expect \{ \featurevec \featurevec^{T}\}=\mathbf{I}$ (see \eqref{equ_toy_model_iid}).

According to \eqref{equ_decomp_E_pred_toy_model}, the average (expected) prediction error $\error_{\rm pred}$ is the sum of three components: (i) the bias $\biasterm^{2}$, (ii) the variance $\varianceterm$ and (iii) the noise variance $\sigma^{2}$. Figure fig_bias_variance illustrates the typical dependency of the bias and variance on the model \eqref{equ_generalization_hypospace_r}, which is parametrized by the model complexity $\modelidx$. Note that the model complexity parameter $\modelidx$ in \eqref{equ_generalization_hypospace_r} coincides with the effective model dimension $\effdim{\hypospace^{(\modelidx)}}$ (see Section The Model ).

The bias and variance, whose sum is the estimation error $\error_{\rm est}$, can be influenced by varying the model complexity $\modelidx$ which is a design parameter. The noise variance $\sigma^{2}$ is the intrinsic accuracy limit of our toy model \eqref{equ_linear_obs_model} and is not under the control of the ML engineer. It is impossible for any ML method (no matter how computationally expensive) to achieve, on average, a prediction error smaller than the noise variance $\sigma^{2}$. Carefully note that this statement only applies if the data points arising in a ML application can be (reasonably well) modelled as realizations of iid RVs.

We highlight that our statistical analysis, resulting the formulas for bias \eqref{equ_def_bias_term}, variance \eqref{equ_formulae_variance_toy_model} and the average prediction error \eqref{equ_decomp_E_pred_toy_model}, applies only if the observed data points can be well modelled using the probabilistic model specified by \eqref{equ_linear_obs_model}, \eqref{equ_toy_model_iid} and \eqref{equ_labels_training_data}. The validity of this probabilistic model can to be verified by principled statistical model validation techniques [11][12]. Section The Bootstrap discusses a fundamentally different approach to analyzing the statistical properties of a ML method. Instead of a probabilistic model, this approach uses random sampling techniques to synthesize iid copies of given (small) data points. We can approximate the expectation of some relevant quantity, such as the loss $\loss{\big(\featurevec,\truelabel \big) }{h}$, using an average over synthetic data [4].

The qualitative behaviour of estimation error in Figure fig_bias_variance depends on the definition for the model complexity. Our concept of effective dimension (see Section The Model ) coincides with most other notions of model complexity for the linear hypothesis space \eqref{equ_generalization_hypospace_r}. However, for more complicated models such as deep nets it is often not obvious how effective dimension is related to more tangible quantities such as total number of tunable weights or the number of artificial neurons. Indeed, the effective dimension might also depend on the specific learning algorithm such as stochastic gradient descent (SGD). Therefore, for deep nets, if we would plot estimation error against number of tunable weights we might observe a behaviour of estimation error fundamentally different from the shape in Figure fig_bias_variance. One example for such un-intuitive behaviour is known as “double descent phenomena” [13].

## The Bootstrap

basic idea of bootstrap: use histogram of dataset as the underlying probability distribution; generate new data points by random sampling (with replacement) from that distribution.

Consider learning a hypothesis $\hat{h} \in \hypospace$ by minimizing the average loss incurred on a dataset $\dataset=\{ \big(\featurevec^{(1)},\truelabel^{(1)}\big),\ldots,\big(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)}\big)\}$. The data points $\big(\featurevec^{(\sampleidx))},\truelabel^{(\sampleidx)}\big)$ are modelled as realizations of iid RVs. Let use denote the (common) probability distribution of these RVs by $p(\featurevec,\truelabel)$.

If we interpret the data points $\big(\featurevec^{(\sampleidx))},\truelabel^{(\sampleidx)}\big)$ as realizations of RVs, also the learnt hypothesis $\hat{h}$ is a realization of a RV. Indeed, the hypothesis $\hat{h}$ is obtained by solving an optimization problem that involves realizations of RVs. The bootstrap is a method for estimating (parameters of) the probability distribution $p(\hat{h})$ [4].

Section A Probabilistic Analysis of Generalization used a probabilistic model for data points to derive (the parameters of) the probability distribution $p(\hat{h})$. Note that the analysis in Section A Probabilistic Analysis of Generalization only applies to the specific probabilistic model \eqref{equ_toy_model_iid}, \eqref{equ_labels_training_data}. In contrast, the bootstrap can be used for data points drawn from an arbitrary probability distribution.

The core idea behind the bootstrap is to use the histogram $\hat{p}(\datapoint)$ of the data points in $\dataset$ to generate $\nrbootstraps$ new datasets $\dataset^{(1)},\ldots,\dataset^{(\nrbootstraps)}$. Each dataset is constructed such that is has the same size as the original dataset $\dataset$. For each dataset $\dataset^{(\bootstrapidx)}$, we solve a separate ERM to obtain the hypothesis $\hat{h}^{(\bootstrapidx)}$. The hypothesis $\hat{h}^{(\bootstrapidx)}$ is a realization of a RV whose distribution is determined by the histogram $\hat{p}(\datapoint)$ as well as the hypothesis space and the loss function used in the ERM.

## Diagnosing ML

diagnose ML methods by comparing training error with validation error and (if available) some baseline; baseline can be obtained via the Bayes risk when using a probabilistic model (such as the i.i.d. assumption) or human performance or the performance of existing ML methods ("experts" in regret framework)

In what follows, we tacitly assume that data points can (to a good approximation) be interpreted as realizations of iid RVs (see Section Probabilistic Models for Data ). This “i.i.d. assumption” underlies ERM as the guiding principle for learning a hypothesis with small risk. This assumption also motivates to use the average loss \eqref{equ_def_training_val_val} on a validation set as an estimate for the risk. More fundamentally, we need the i.i.d. assumption to define the concept of risk as a measure for how well a hypothesis predicts the labels of arbitrary data points.

Consider a ML method which uses Algorithm alg:validated_ERM (or Algorithm alg:kfoldCV_ERM) to learn and validate the hypothesis $\hat{h} \in \hypospace$. Besides the learnt hypothesis

$\hat{h}$, these algorithms also deliver the training error $\trainerror$ and the validation error $\valerror$. As we will see shortly, we can diagnose ML methods to some extent just by comparing training with validation errors. This diagnosis is further enabled if we know a baseline $\benchmarkerror$ .

One important source of a baseline $\benchmarkerror$ are probabilistic models for the data points (see Section A Probabilistic Analysis of Generalization ). Given a probabilistic model, which specifies the probability distribution $p(\featurevec,\truelabel)$ of the features and label of data points, we can compute the minimum achievable risk. Indeed, the minimum achievable risk is precisely the expected loss of the Bayes estimator $\hat{h}(\featurevec)$ of the label $\truelabel$, given the features $\featurevec$ of a data point. The Bayes estimator

$\hat{h}(\featurevec)$ is fully determined by the probability distribution $p(\featurevec,\truelabel)$ of the features and label of a (random) data point [14](Chapter 4).

A further potential source for a baseline $\benchmarkerror$ is an existing, but for some reason unsuitable, ML method. This existing ML method might be computationally too expensive to be used for the ML application at end. However, we might still use its statistical properties as a benchmark. The We might also use the performance of human experts as a baseline. If we want to develop a ML method that detects certain type of skin cancers from images of the skin, a benchmark might be the current classification accuracy achieved by experienced dermatologists [15].

We can diagnose a ML method by comparing the training error $\trainerror$ with the validation error $\valerror$ and (if available) the benchmark $\benchmarkerror$.

• $\trainerror \approx \valerror \approx \benchmarkerror$: The training error is on the same level as the validation error and the benchmark error. There is not much to improve here since the validation error is already on the desired error level. Moreover, the training error is not much smaller than the validation error which indicates that there is no overfitting. It seems we have obtained a ML method that achieves the benchmark error level.
• $\valerror \gg \trainerror$: The validation error is significantly larger than the training error. It seems that the ERM results in a hypothesis $\hat{h}$ that overfits the training set. The loss incurred by $\hat{h}$ on data points outside the training set, such as those in the validation set, is significantly worse. This is an indicator for overfitting which can be addressed either by reducing the effective dimension of the hypothesis space or by increasing the size of the training set. To reduce the effective dimension of the hypothesis space we have different options depending on the used model. We might use a small number of features in a linear model, a smaller maximum depth of decision trees (Section Decision Trees ) or a fewer layers in an ANN (Section Deep Learning ). One very elegant means for reducing the effective dimension of a hypothesis space is by limiting the number of gradient descent (GD) steps used in gradient-based methods. This optimization based shrinking of a hypothesis space is referred to as early stopping. More generally, we can reduce the effective dimension of a hypothesis space via regularization techniques (see Chapter Regularization ).
• $\trainerror \approx \valerror\gg \benchmarkerror$: The training error is on the same level as the validation errorand both are significantly larger than the baseline. Since the training error is not much smaller than the validation error, the learnt hypothesis seems to not overfit the training set. However, the training error achieved by the learnt hypothesis is significantly larger than the benchmark error level. There can be several reasons for this to happen. First, it might be that the hypothesis space used by the ML method is too small, i.e., it does not include a hypothesis that provides a good approximation for the relation between features and label of a data point. The remedy for this situation is to use a larger hypothesis space, e.g., by including more features in a linear model, using higher polynomial degrees in polynomial regression, using deeper decision trees or having larger ANNs (deep ANN (deep net)s). Another reason for the training error being too large is that the optimization algorithm used to solve ERM is not working properly. When using gradient-based methods (see Section GD for Linear Regression ) to solve ERM, one reason for $\trainerror \gg \benchmarkerror$ could be that the learning rate $\lrate$ in the GD step is chosen too small or too large (see Figure fig_small_large_lrate-(b)). This can be solved by adjusting the learning rate by trying out several different values and using the one resulting in the smallest training error. Another option is derive optimal values for the learning rate based on a probabilistic model for how the data points are generated. One example for such a probabilistic model is the i.i.d. assumption that has been used in Section A Probabilistic Analysis of Generalization to analyze linear regression methods.
• $\trainerror \gg \valerror$: The training error is significantly larger than the validation error (see Exercise). The idea of ERM is to approximate the risk of a hypothesis by its average loss on a training set $\dataset = \{ (\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$.The mathematical underpinning for this approximation is the law of large numbers which characterizes the average of (realizations of) iid RVs. The quality and usefulness of this approximation depends on the validity of two conditions. First, the data points used for computing the average loss should be such that they would be typically obtained as realizations of iid RVs with a common probability distribution. Second, the number of data points used for computing the average loss must be sufficiently large. Whenever the data points behave different than the the realizations of iid RVs or if the size of the training set or validation set is too small, the interpretation (comparison) of training error and validation errors becomes more difficult. As an extreme case, it might then be that the validation error consists of data points for which every hypothesis incurs small average loss. Here, we might try to increase the size of the validation set by collecting more labeled data points or by using data augmentation (see Section Data Augmentation ). If the size of training set and validation set are large but we still obtain $\trainerror \gg \valerror$, one should verify if data points in these sets conform to the i.i.d. assumption. There are principled statistical methods that allow to test if an i.i.d. assumption is satisfied (see [9] and references therein).

## Notes

1. This derivation is not very difficult but rather lengthy. For more details about the derivation of \eqref{equ_variance_term_toy_model} we refer to the literature [7][9].

## General References

Jung, Alexander (2022). Machine Learning: The Basics. Signapore: Springer. doi:10.1007/978-981-16-8193-6.

Jung, Alexander (2022). "Machine Learning: The Basics". arXiv:1805.05052.

## References

1. R. Muirhead. Aspects of Multivariate Statistical Theory John Wiley \& Sons Inc., 1982
2. J. Larsen and C. Goutte. On optimal data split for generalization estimation and model selection. In IEEE Workshop on Neural Networks for Signal Process 1999
3. B. Efron and R. Tibshirani. Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association 92(438):548--560, 1997
4. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning Springer Series in Statistics. Springer, New York, NY, USA, 2001
5. I. Cohen and B. Berdugo. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Sig. Proc. Lett. 9(1):12--15, Jan. 2002
6. P. Huber. Approximate models. In C. Huber-Carol, N. Balakrishnan, M. Nikulin, and M. Mesbah, editors, Goodness-of-Fit Tests and Model Validity. Statistics for Industry and Technology. Birkhäuser, Boston, MA, 2002
7. D. Bertsekas and J. Tsitsiklis. Introduction to Probability Athena Scientific, 2 edition, 2008
8. R. G. Gallager. Stochastic Processes: Theory for Applications Cambridge University Press, 2013
9. H. Lütkepohl. New Introduction to Multiple Time Series Analysis Springer, New York, 2005
10. K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis Academic Press, 1979
11. K. Young. Bayesian diagnostics for checking assumptions of normality. Journal of Statistical Computation and Simulation 47(3--4):167 -- 180, 1993
12. O. Vasicek. A test for normality based on sample entropy. Journal of the Royal Statistical Society. Series B (Methodological) 38(1):54--59, 1976
13. M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116(32):15849--15854, 2019
14. E. L. Lehmann and G. Casella. Theory of Point Estimation Springer, New York, 2nd edition, 1998
15. A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 2017