# Regularization

Many ML methods use the principle of empirical risk minimization (ERM) (see Chapter Empirical Risk Minimization ) to learn a hypothesis out of a hypothesis space by minimizing the average loss (training error) on a set of labeled data points (which constitute a training set). Using ERM as a guiding principle for ML methods makes sense only if the training error is a good indicator for its loss incurred outside the training set.

Figure fig_regular illustrates a typical scenario for a modern ML method which uses a large hypothesis space. This large hypothesis space includes highly non-linear maps which can perfectly resemble any dataset of modest size. However, there might be non-linear maps for which a small training error does not guarantee accurate predictions for the labels of data points outside the training set.

Chapter Model Validation and Selection discussed validation techniques to verify if a hypothesis with small training error will predict also well the labels of data points outside the training set. These validation techniques, including Algorithm alg:validated_ERM and Algorithm alg:kfoldCV_ERM, probe the hypothesis $\hat{h} \in \hypospace$ delivered by ERM on a validation set. The validation set consists of data points which have not been used in the training set of ERM . The validation error, which is the average loss of the hypothesis on the data points in the validation set, serves as an estimate for the average error or risk of the hypothesis $\hat{h}$.

This chapter discusses regularization as an alternative to validation techniques. In contrast to validation, regularization techniques do not require having a separate validation set which is not used for the ERM . This makes regularization attractive for applications where obtaining a separate validation set is difficult or costly (where labelled data is scarce).

Instead of probing a hypothesis $\hat{h}$ on a validation set, regularization techniques estimate (or approximate) the loss increase when applying $\hat{h}$ to data points outside the training set. The loss increase is estimated by adding a regularization term to the training error in ERM .

Section Structural Risk Minimization discusses the resulting regularized ERM, which we will refer to as structural risk minimization (SRM). It turns out that the SRM is equivalent to ERM using a smaller (pruned) hypothesis space. The amount of pruning depends on the weight of the regularization term relative to the training error. For an increasing weight of the regularization term, we obtain a stronger pruning resulting in a smaller effective hypothesis space.

Section Robustness constructs regularization terms by requiring the resulting ML method to be robust against (small) random perturbations of the data points in a training set. Here, we replace each data point of a training set by the realization of a random variable (RV) that fluctuates around this data point. This construction allows to interpret regularization as a (implicit) form of data augmentation.

Section Data Augmentation discusses data augmentation methods as a simulation-based implementation of regularization. Data augmentation adds a certain number of perturbed copies to each data point in the training set. One way to construct perturbed copies of a data point is to add the realization of a RV to its features.

Section Statistical and Computational Aspects of Regularization analyzes the effect of regularization for linear regression using a simple probabilistic model for data points. This analysis parallels our previous study of the validation error of linear regression in Section A Probabilistic Analysis of Generalization . Similar to Section A Probabilistic Analysis of Generalization , we reveal a trade-off between the bias and variance of the hypothesis learnt by regularized linear regression. This trade- off was traced out by a discrete model parameter (the effective dimension) in Section A Probabilistic Analysis of Generalization . In contrast, regularization offers a continuous trade-off between bias and variance via a continuous regularization parameter.

Semi-supervised learning (SSL) uses (large amounts of) unlabeled data points to support the learning of a hypothesis from (a small number of) labeled data points [1]. Section Semi-Supervised Learning discusses semi-supervised learning (SSL) methods that use the statistical properties of unlabeled data points to construct useful regularization terms. These regularization terms are then used in SRM with a (typically small) set of labeled data points.

Section Transfer Learning shows how regularization can be used for transfer learning. Like multitask learning also transfer learning exploits relations between different learning tasks. In contrast to multitask learning, which jointly solves the individual learning tasks, transfer learning solves the learning tasks sequentially. The most basic form of transfer learning is to fine tune a pre-trained model. A pre-trained model can be obtained via ERM in a (“source”) learning task for which we have a large amount of labeled training data. The fine-tuning is then obtained via ERM in the (“target”) learning task of interest for which we might have only a small amount of labeled training data.

## Structural Risk Minimization

Section The Model defined the effective dimension $\effdim{\hypospace}$ of a hypothesis space $\hypospace$ as the maximum number of data points that can be perfectly fit by some hypothesis $h \in \hypospace$. As soon as the effective dimension of the hypothesis space in equ_def_ERM_funs exceeds the number $\samplesize$ of training data points, we can find a hypothesis that perfectly fits the training data. However, a hypothesis that perfectly fits the training data might deliver poor predictions for data points outside the training set (see Figure fig_regular).

Modern ML methods typically use a hypothesis space with large effective dimension [3][4]. Two well-known examples for such methods is linear regression (see Section Linear Regression ) using a large number of features and deep learning with artificial neural network (ANN)s using a large number (billions) of artificial neurons (see Section Deep Learning ). The effective dimension of these methods can be easily on the order of billions ($10^{9}$) if not larger [5]. To avoid overfitting during the naive use of ERM we would require a training set containing at least as many data points as the effective dimension of the hypothesis space. However, in practice we often do not have access to a training set consisting of billions of labeled data points. The challenge is typically in the labelling process which often requires human labour.

It seems natural to combat overfitting of a ML method by pruning its hypothesis space $\hypospace$. We prune $\hypospace$ by removing some of the hypothesis in $\hypospace$ to obtain the smaller hypothesis space $\hypospace' \subset \hypospace$. We then replace ERM with the restricted (or pruned) ERM

[$] $$\label{equ_ERM_fun_pruned} \hat{h} = \argmin_{h \in \hypospace'} \emperror(h|\dataset) \mbox{ with pruned hypothesis space } \hypospace' \!\subset\!\hypospace.$$ [$]

The effective dimension of the pruned hypothesis space $\hypospace'$ is typically much smaller than the effective dimension of the original (large) hypothesis space $\hypospace$, $\effdim{\hypospace'} \ll \effdim{\hypospace}$. For a given size $\samplesize$ of the training set, the risk of overfitting in \eqref{equ_ERM_fun_pruned} is much smaller than the risk of overfitting in equ_def_ERM_funs .

Let us illustrate the idea of pruning for linear regression using the hypothesis space constituted by linear maps $h(\featurevec) = \weights^{T} \featurevec$. The effective dimension of equ_lin_hypospace is equal to the number of features, $\effdim{\hypospace} = \featuredim$. The hypothesis space $\hypospace$ might be too large if we use a large number $\featurelen$ of features, leading to overfitting. We prune equ_lin_hypospace by retaining only linear hypotheses $h(\featurevec) = \big(\weights'\big)^T \featurevec$ with parameter vectors $\weights'$ satisfying $\weight'_{3} = \weights_{4}'= \ldots = \weights_{\featurelen}'=0$. Thus, the hypothesis space $\hypospace'$ is constituted by all linear maps that only depend on the first two features $\feature_{1},\feature_{2}$ of a data point. The effective dimension of $\hypospace'$ is dimension is $\effdim{\hypospace'}=2$ instead of $\effdim{\hypospace}=\featurelen$.

Pruning the hypothesis space is a special case of a more general strategy which we refer to as SRM [6]. The idea behind SRM is to modify the training error in ERM to favour hypotheses which are more smooth or regular in a specific sense. By enforcing a smooth hypothesis, a ML methods becomes less sensitive, or more robust, to small perturbations of data points in the training set. Section Robustness discusses the intimate relation between the robustness (against perturbations of the data points in the training set) of a ML method and its ability to generalize to data points outside the training set.

We measure the smoothness of a hypothesis using a regularizer $\regularizer(h) \in \mathbb{R}_{+}$. Roughly speaking, the value $\regularizer(h)$ measures the irregularity or variation of a predictor map $h$. The (design) choice for the regularizer depends on the precise definition of what is meant by regularity or variation of a hypothesis. Section Data Augmentation discusses how a particular choice for the regularizer $\regularizer(h)$ arises naturally from a probabilistic model for data points.

We obtain SRM by adding the scaled regularizer $\regparam \regularizer(h)$ to the ERM ,

[] \begin{align} \hat{h} & = \argmin_{h \in \hypospace} \big[ \emperror(h|\dataset) + \regparam \regularizer(h) \big] \nonumber \\ & \label{equ_ERM_fun_regularized}\stackrel{\eqref{eq_def_emp_error_101}}{=} \argmin_{h \in \hypospace} \big[(1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{h}+ \regparam \regularizer(h)\big]. \end{align} []

We can interpret the penalty term $\regparam \regularizer(h)$ in \eqref{equ_ERM_fun_regularized} as an estimate (or approximation) for the increase, relative to the training error on $\dataset$, of the average loss of a hypothesis $\hat{h}$ when it is applied to data points outside $\dataset$. Another interpretation of the term $\regparam \regularizer(h)$ will be discussed in Section Data Augmentation .

The regularization parameter $\regparam$ allows us to trade between a small training error $\emperror(h^{(\vw)}|\dataset)$ and small regularization term $\regularizer(h)$, which enforces smoothness or regularity of $h$. If we choose a large value for $\regparam$, irregular or hypotheses $h$, with large $\regularizer(h)$, are heavily “punished” in \eqref{equ_ERM_fun_regularized}. Thus, increasing the value of $\regparam$ results in the solution (minimizer) of \eqref{equ_ERM_fun_regularized} having smaller $\regularizer(h)$. On the other hand, choosing a small value for $\regparam$ in \eqref{equ_ERM_fun_regularized} puts more emphasis on obtaining a hypothesis $h$ incurring a small training error. For the extreme case $\regparam =0$, the SRM \eqref{equ_ERM_fun_regularized} reduces to ERM.

The pruning approach \eqref{equ_ERM_fun_pruned} is intimately related to the SRM \eqref{equ_ERM_fun_regularized}. They are, in a certain sense, dual to each other. First, note that \eqref{equ_ERM_fun_regularized} reduces to the pruning approach \eqref{equ_ERM_fun_pruned} when using the regularizer $\regularizer(h) = 0$ for all $h \in \hypospace'$ , and $\regularizer(h) = \infty$ otherwise, in \eqref{equ_ERM_fun_regularized}. In the other direction, for many important choices for the regularizer $\regularizer(h)$, there is a restriction $\hypospace^{(\regparam)} \subset \hypospace$ such that the solutions of \eqref{equ_ERM_fun_pruned} and \eqref{equ_ERM_fun_regularized} coincide (see Figure fig_soft_pruning_regularization). The relation between the optimization problems \eqref{equ_ERM_fun_pruned} and \eqref{equ_ERM_fun_regularized} can be made precise using the theory of convex duality (see [7](Ch. 5) and [8]).

For a hypothesis space $\hypospace$ whose elements $h \in \hypospace$ are parametrized by a parameter vector $\weights \in \mathbb{R}^{\featuredim}$, we can rewrite SRM \eqref{equ_ERM_fun_regularized} as

[] \begin{align} \widehat{\weights}^{(\regparam)} & = \argmin_{\weights \in \mathbb{R}^{\featurelen}} \big[ \emperror(h^{(\weights)}|\dataset)+ \regparam \regularizer(\weights)\big] \nonumber \\ & = \label{equ_rerm_weight}\argmin_{\weights \in \mathbb{R}^{\featurelen}} \big[(1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \loss{(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)})}{h^{(\weights)}} + \regparam \regularizer(\weights) \big]. \end{align} []

For the particular choice of squared squared error loss, linear hypothesis spaceand regularizer $\regularizer(\weights)=\| \weights \|_{2}^{2}$, SRM \eqref{equ_rerm_weight} specializes to

[] \begin{align} \label{equ_rerm_ridge_regression} \widehat{\weights}^{(\regparam)} & = \argmin_{\weights \in \mathbb{R}^{\featurelen}} \big[(1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)}\big)^{2} + \regparam \| \weights \|_{2}^{2}\big]. \end{align} []

The special case \eqref{equ_rerm_ridge_regression} of SRM \eqref{equ_rerm_weight} is known as ridge regression [9]. Ridge regression \eqref{equ_rerm_ridge_regression} is equivalent to (see [8](Ch. 5))

[$] $$\label{equ_restr_ERM} \widehat{\weights}^{(\regparam)} = \argmin_{h^{(\weights)} \in \hypospace^{(\regparam)}} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big(\truelabel^{(\sampleidx)} - h^{(\weights)}(\featurevec^{(\sampleidx)}) \big)^2$$ [$]

with the restricted hypothesis space

[] \begin{align} \label{equ_hyposapce_lambda} \hypospace^{(\regparam)} & \defeq \{ h^{(\weights)}: \mathbb{R}^{\featuredim} \rightarrow \mathbb{R}: h^{(\vw)}(\featurevec) = \weights^{T} \featurevec \mbox{, with some } \weights \in \mathbb{R}^{\featuredim}, \| \weights \|_{2}^{2} \leq C(\regparam) \} \subset \hypospace^{(\featuredim)}. \end{align} []

For any given value $\regparam$ of the regularization parameter in \eqref{equ_rerm_ridge_regression}, there is a number $C(\regparam)$ such that solutions of \eqref{equ_rerm_ridge_regression} coincide with the solutions of \eqref{equ_restr_ERM}. Thus, ridge regression \eqref{equ_rerm_ridge_regression} is equivalent to linear regression with a pruned version $\hypospace^{(\regparam)}$ of the linear hypothesis space. The size of the pruned hypothesis space $\hypospace^{(\regparam)}$ \eqref{equ_hyposapce_lambda} varies continuously with $\regparam$.

Another popular special case of ERM \eqref{equ_rerm_weight} is obtained for the regularizer $\regularizer(\weights)=\| \weights \|_{1}$ and known as the Lasso [10]

[] \begin{align} \label{equ_rerm_Lasso} \widehat{\weights}^{(\regparam)} & = \argmin_{\weights \in \mathbb{R}^{\featurelen}} \big[(1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)}\big)^{2} + \regparam \| \weights \|_{1}\big]. \end{align} []

Ridge regression \eqref{equ_rerm_ridge_regression} and the Lasso \eqref{equ_rerm_Lasso} have fundamentally different computational and statistical properties. Ridge regression \eqref{equ_rerm_ridge_regression} uses a smooth and convex objective function that can be minimized using efficient gradient descent (GD) methods. The objective function of Lasso \eqref{equ_rerm_Lasso} is also convex but non-smooth and therefore requires more advanced optimization methods. The increased computational complexity of Lasso \eqref{equ_rerm_Lasso} comes at the benefit of typically delivering a hypothesis with a smaller expected loss than those obtained from ridge regression [4][10].

## Robustness

Section Structural Risk Minimization motivates regularization as a soft variant of model selection. Indeed, the regularization term in SRM \eqref{equ_ERM_fun_regularized} is equivalent to ERM \eqref{equ_ERM_fun_pruned} using a pruned (reducing) hypothesis space. We now discuss an alternative view on regularization as a means to make ML methods robust.

The ML methods discussed in Chapter Empirical Risk Minimization rest on the idealizing assumption that we have access to the true label values and feature values of labeled data points (that form a training set). These methods learn a hypothesis $h \in \hypospace$ with minimum average loss (training error) incurred for data points in the training set. In practice, the acquisition of label and feature values might be prone to errors. These errors might stem from the measurement device itself (hardware failures or thermal noise in electronic devices) or might be due to human mistakes such as labelling errors.

Let us assume for the sake of exposition that the label values $\truelabel^{(\sampleidx)}$ in the training set are accurate but that the features $\featurevec^{(\sampleidx)}$ are a perturbed version of the true features of the $\sampleidx$th data point. Thus, instead of having observed the data point $\big( \featurevec^{(\sampleidx)}, \truelabel^{(\sampleidx)} \big)$ we could have equally well observed the data point $\big( \featurevec^{(\sampleidx)}+\bm{\varepsilon}, \truelabel^{(\sampleidx)} \big)$ in the training set. Here, we have modelled the perturbations in the features using a RV $\bm{\varepsilon}$. The probability distribution of the perturbation $\bm{\varepsilon}$ is a design parameter that controls robustness properties of the overall ML method. We will study a particular choice for this distribution in Section Data Augmentation .

A robust ML method should learn a hypothesis that incurs a small loss not only for a specific data point $\big( \featurevec^{(\sampleidx)}, y^{(\sampleidx)} \big)$ but also for perturbed data points $\big( \featurevec^{(\sampleidx)}+\bm{\varepsilon}, y^{(\sampleidx)} \big)$. Therefore, it seems natural to replace the loss $\loss{\big( \featurevec^{(\sampleidx)}, y^{(\sampleidx)} \big)}{h}$, incurred on the $\sampleidx$th data point in the training set, with the expectation

[$] $$\label{equ_def_expe_perturb_robust} \expect \big\{ \loss{ \big( \featurevec^{(\sampleidx)}+\bm{\varepsilon}, y^{(\sampleidx)} \big)}{h}\big\}.$$ [$]

The expectation \eqref{equ_def_expe_perturb_robust} is computed using the probability distribution of the perturbation $\bm{\varepsilon}$. We will show in Section Data Augmentation that minimizing the average of the expectation \eqref{equ_def_expe_perturb_robust}, for $\sampleidx=1,\ldots,\samplesize$, is equivalent to the SRM \eqref{equ_ERM_fun_regularized}.

Using the expected loss \eqref{equ_def_expe_perturb_robust} is not the only possible approach to make a ML method robust. Another approach to make a ML method robust is known as bootstrap aggreation (bagging). The idea of bagging is to use the bootstrap method (see Section The Bootstrap and [9](Ch. 8)) to construct a finite number of perturbed copies $\dataset^{(1)},\ldots,\dataset^{(\augparam)}$ of the original training set $\dataset$.

We then learn (e.g, using ERM) a separate hypothesis $h^{(\augidx)}$ for each perturbed copy $\dataset^{(\augidx)}$, $\augidx = 1,\ldots, \augparam$. This results in a whole ensemble of different hypotheses $h^{(\augidx)}$ which might even belong to different hypothesis spaces. For example, one the hypothesis $h^{(1)}$ could be a linear map (see Section Linear Regression ) and the hypothesis $h^{(2)}$ could be obtained from an ANN (see Section Deep Learning ).

The final hypothesis delivered by bagging is obtained by combining or aggregating (e.g., using the average) the predictions $h^{(\augidx)}\big(\featurevec\big)$ delivered by each hypothesis $h^{(\augidx)}$, for $\augidx=1,\ldots,\augparam$ in the ensemble. The ML method referred to as random forest uses bagging to learn an ensemble of decision trees (see Chapter Decision Trees ). The individual predictions obtained from the different decision trees forming a random forest are then combined (e.g., using an average for numeric labels or a majority vote for finite-valued labels), to obtain a final prediction [9].

## Data Augmentation

ML methods using ERM are prone to overfitting as soon as the effective dimension of the hypothesis space $\hypospace$ exceeds the number $\samplesize$ of data points in the training set. Section Model Selection and Section Structural Risk Minimization approached this by modifying either the model or the loss function by adding a regularization term. Both approaches prune the hypothesis space $\hypospace$ underlying a ML method to reduce the effective dimension $\effdim{\hypospace}$. Model selection does this reduction in a discrete fashion while regularization implements a soft “shrinking” of the hypothesis space.

Instead of trying to reduce the effective dimension we could also try to increase the number $\samplesize$ of data points in the training set used for ERM .

We now discuss how to synthetically generate new labeled data points by exploiting statistical symmetries of data.

The data arising in many ML applications exhibit intrinsic symmetries and invariances at least in some approximation. The rotated image of a cat still shows a cat. The temperature measurement taken at a given location will be similar to another measurement taken $10$ milliseconds later. Data augmentation exploits such symmetries and invariances to augment the raw data with additional synthetic data.

Let us illustrate data augmentation using an application that involves data points characterized by features $\featurevec \in \mathbb{R}^{\featuredim}$ and number labels $y \in \mathbb{R}$. We assume that the data generating process is such that data points with close feature values have the same label. Equivalently, this assumption is requiring the resulting ML method to be robust against small perturbations of the feature values (see Section Robustness ). This suggests to augment a data point $\big(\featurevec,\truelabel\big)$ by several synthetic data points

[$] $$\label{equ_def_copies_aug} \big(\featurevec+{\bm \varepsilon}^{(1)},\truelabel\big),\ldots,\big(\featurevec+{\bm \varepsilon}^{(\augparam)},\truelabel\big),$$ [$]

with ${\bm \varepsilon}^{(1)},\ldots,{\bm \varepsilon}^{(\augparam)}$ being realizations of independent and identically distributed (iid) random vectors with the same probability distribution $p({\bm \varepsilon})$.

Given a (raw) dataset $\dataset = \big\{ \big(\featurevec^{(1)},\truelabel^{(1)}\big),\ldots, \big(\featurevec^{(\samplesize)},\truelabel^{(\samplesize)}\big) \}$ we denote the associated augmented dataset by

[] \begin{align} \label{equ_def_augmented_dataset} \dataset' = \big\{ &\big(\featurevec^{(1,1)},y^{(1)}\big), \ldots, \big(\featurevec^{(1,\augparam)},\truelabel^{(1)}\big), \nonumber \\ &\big(\featurevec^{(2,1)},\truelabel^{(2)}\big), \ldots, \big(\featurevec^{(2,\augparam)},\truelabel^{(2)}\big), \nonumber \\ & \ldots \nonumber \\ &\big(\featurevec^{(\samplesize,1)},y^{(\samplesize)}\big), \ldots, \big(\featurevec^{(\samplesize,\augparam)},\truelabel^{(\samplesize)}\big) \}. \end{align} []

The size of the augmented dataset $\dataset'$ is $\samplesize' = \augparam \times \samplesize$. For a sufficiently large augmentation parameter $\augparam$, the augmented sample size $\samplesize'$ is larger than the effective dimension $\featurelen$ of the hypothesis space $\hypospace$. We then learn a hypothesis via ERM on the augmented dataset,

[] \begin{align} \hat{h} & = \argmin_{h \in \hypospace} \emperror(h|\dataset') \nonumber \\ & \stackrel{\eqref{equ_def_augmented_dataset}}{=} \argmin_{h \in \hypospace} (1/\samplesize') \sum_{\sampleidx=1}^{\samplesize} \sum_{\augidx=1}^{\augparam} \loss{(\featurevec^{(\sampleidx,\augidx)},y^{(\sampleidx,\augidx)})}{h} \nonumber \\ & \label{equ_def_ERM_funs_aug}\stackrel{\eqref{equ_def_copies_aug}}{=} \argmin_{h \in \hypospace} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (1/\augparam)\sum_{\augidx=1}^{\augparam} \loss{(\featurevec^{(\sampleidx)}+{\bm \varepsilon}^{(b)},y^{(\sampleidx)})}{h}. \end{align} []

We can interpret data-augmented ERM \eqref{equ_def_ERM_funs_aug} as a data-driven form of regularization (see Section Structural Risk Minimization ). The regularization is implemented by replacing, for each data point $\big(\featurevec^{(\sampleidx)},y^{(\sampleidx)}\big) \in \dataset$, the loss $\loss{(\featurevec^{(\sampleidx)},y^{(\sampleidx)})}{h}$ with the average loss $(1/\augparam)\sum_{\augidx=1}^{\augparam} \loss{(\featurevec^{(\sampleidx)}+{\bm \varepsilon}^{(b)},y^{(\sampleidx)})}{h}$ over the augmented data points that accompany $\big(\featurevec^{(\sampleidx)},y^{(\sampleidx)}\big) \in \dataset$.

Note that in order to implement \eqref{equ_def_ERM_funs_aug} we need to first generate $\augparam$ realizations ${\bm \varepsilon}^{(b)} \in \mathbb{R}^{\featurelen}$ of iid random vectors with common probability distribution $p({\bm \varepsilon})$. This might be computationally costly for a large $\augparam, \featurelen$. However, when using a large augmentation parameter $\augparam$, we might use the approximation

[] \begin{align} \label{equ_approx_augm_loss_expect} (1/\augparam)\sum_{\augidx=1}^{\augparam} \loss{(\featurevec^{(\sampleidx)}+{\bm \varepsilon}^{(\augidx)},y^{(\sampleidx)})}{h} \approx \expect \big\{ \loss{(\featurevec^{(\sampleidx)}+{\bm \varepsilon},y^{(\sampleidx)})}{h} \big\}. \end{align} []

This approximation is made precise by a key result of probability theory, known as the law of large numbers. We obtain an instance of ERM by inserting \eqref{equ_approx_augm_loss_expect} into \eqref{equ_def_ERM_funs_aug},

[] \begin{align} \label{equ_def_ERM_funs_aug_approx} \hat{h} = \argmin_{h \in \hypospace} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \expect\big\{ \loss{(\featurevec^{(\sampleidx)}+{\bm \varepsilon},\truelabel^{(\sampleidx)})}{h} \big\}. \end{align} []

The usefulness of \eqref{equ_def_ERM_funs_aug_approx} as an approximation to the augmented ERM \eqref{equ_def_ERM_funs_aug} depends on the difficulty of computing the expectation $\expect\big\{ \loss{(\featurevec^{(\sampleidx)}+{\bm \varepsilon},\truelabel^{(\sampleidx)})}{h} \big\}$. The complexity of computing this expectation depends on the choice of loss function and the choice for the probability distribution $p({\bm \varepsilon})$.

Let us study \eqref{equ_def_ERM_funs_aug_approx} for the special case linear regression with squared error loss and linear hypothesis space,

[] \begin{align} \label{equ_def_ERM_funs_aug_approx_linreg} \hat{h} = \argmin_{h^{(\vw)} \in \hypospace^{(\featuredim)}} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \expect\big\{ \big( y^{(\sampleidx)} - \vw^{T} \big(\featurevec^{(\sampleidx)}+{\bm \varepsilon} \big) \big)^{2} \big\}. \end{align} []

We use perturbations ${\bm \varepsilon}$ drawn a multivariate normal distribution with zero mean and covariance matrix $\sigma^{2} \mathbf{I}$,

[$] $$\label{equ_augm_mvn_standard} {\bm \varepsilon} \sim \mathcal{N}(\mathbf{0},\sigma^{2} \mathbf{I}).$$ [$]

We develop \eqref{equ_def_ERM_funs_aug_approx_linreg} further by using

[$] $$\label{equ_uncorr_augmentation_implicit} \expect\{\big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)} \big) {\bm \varepsilon} \} = \mathbf{0}.$$ [$]

The identity \eqref{equ_uncorr_augmentation_implicit} uses that the data points $\big(\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}\big)$ are fixed and known (deterministic) while ${\bm \varepsilon}$ is a zero-mean random vector. Combining \eqref{equ_uncorr_augmentation_implicit} with \eqref{equ_def_ERM_funs_aug_approx_linreg},

[] \begin{align} \expect\big\{ \big( y^{(\sampleidx)} - \weights^{T} \big(\featurevec^{(\sampleidx)}+{\bm \varepsilon} \big) \big)^{2} \big\} & = \big( \truelabel^{(\sampleidx)} - \weights^{T}\featurevec^{(\sampleidx)} \big) ^{2} \!+\!\sqeuclnorm{ \weights } \, \expect \big\{ \sqeuclnorm{ {\bm \varepsilon} } \big\} \nonumber \\[5mm] &= \label{equ_implicit_aug_regu_11}\big( \truelabel^{(\sampleidx)} - \weights^{T}\featurevec^{(\sampleidx)} \big) ^{2} + \featurelen \sqeuclnorm{ \weights } \sigma^{2}. \end{align} []

where the last step used $\expect \big\{ \sqeuclnorm { {\bm \varepsilon} } \big\} \stackrel{\eqref{equ_augm_mvn_standard}}{=} \featurelen \sigma^{2}$. Inserting \eqref{equ_implicit_aug_regu_11} into \eqref{equ_def_ERM_funs_aug_approx_linreg},

[] \begin{align} \label{equ_def_ERM_funs_aug_approx_ridge} \hat{h} = \argmin_{h^{(\weights)} \in \hypospace^{(\featuredim)}} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( \truelabel^{(\sampleidx)} - \vw^{T}\featurevec^{(\sampleidx)} \big) ^{2} + \featurelen \sqeuclnorm{ \weights } \sigma^{2}. \end{align} []

We have obtained \eqref{equ_def_ERM_funs_aug_approx_ridge} as an approximation of the augmented ERM \eqref{equ_def_ERM_funs_aug} for the special case of squared error loss and the linear hypothesis space. This approximation uses the law of large numbers \eqref{equ_approx_augm_loss_expect} and becomes more accurate for increasing augmentation parameter $\augparam$.

Note that \eqref{equ_def_ERM_funs_aug_approx_ridge} is nothing but ridge regression \eqref{equ_rerm_ridge_regression} using the regularization parameter $\regparam =\featurelen \sigma^{2}$. Thus, we can interpret ridge regression as implicit data augmentation \eqref{equ_def_augmented_dataset} by applying random perturbations \eqref{equ_def_copies_aug} to the feature vectors in the original training set $\dataset$.

The regularizer $\regularizer(\weights) = \sqeuclnorm{ \weights }$ in \eqref{equ_def_ERM_funs_aug_approx_ridge} arose naturally from the specific choice for the probability distribution \eqref{equ_augm_mvn_standard} of the random perturbation ${\bm \varepsilon}^{(\sampleidx)}$ in \eqref{equ_def_copies_aug} and using the squared error loss. Other choices for this probability distribution or the loss function result in different regularizers.

Augmenting data points with random perturbations distributed according \eqref{equ_augm_mvn_standard} treat the features of a data point independently. For application domains that generate data points with highly correlated features it might be useful to augment data points using random perturbations ${\bm \varepsilon}$ (see \eqref{equ_def_copies_aug}) distributed as

[$] $$\label{equ_augm_mvn_cov_matris} {\bm \varepsilon} \sim \mathcal{N}(\mathbf{0},\mC).$$ [$]

The covariance matrix $\mC$ of the perturbation ${\bm \varepsilon}$ can be chosen using domain expertise or estimated (see Section Semi-Supervised Learning ). Inserting the distribution \eqref{equ_augm_mvn_cov_matris} into \eqref{equ_def_ERM_funs_aug_approx},

[] \begin{align} \label{equ_def_ERM_funs_aug_approx_ridge_cov} \hat{h} = \argmin_{h^{(\weights)} \in \hypospace^{(\featuredim)}} \bigg[ (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( \truelabel^{(\sampleidx)} - \weights^{T}\featurevec^{(\sampleidx)} \big) ^{2} + \weights^{T} \mC \weights \bigg] . \end{align} []

Note that \eqref{equ_def_ERM_funs_aug_approx_ridge_cov} reduces to ordinary ridge regression \eqref{equ_def_ERM_funs_aug_approx_ridge} for the choice $\mC = \sigma^{2} \mathbf{I}$.

## Statistical and Computational Aspects of Regularization

The goal of this section is to develop a better understanding for the effect of the regularization term in SRM \eqref{equ_rerm_weight}. We will analyze the solutions of ridge regression \eqref{equ_rerm_ridge_regression} which is the special case of SRM using the linear hypothesis space and squared squared error loss. Using the feature matrix $\mX\!=\!\big(\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)}\big)^{T}$ and label vector $\labelvec\!=\!(\truelabel^{(1)},\ldots,\truelabel^{(\samplesize)})^{T}$, we can rewrite \eqref{equ_rerm_ridge_regression} more compactly as

[$] $$\label{equ_def_rlr_w_opt} \widehat{\weights}^{(\regparam)} = \argmin_{\weights \in \mathbb{R}^{\featuredim}} \big[ (1/\samplesize) \sqeuclnorm{\labelvec - \featuremtx \vw} + \regparam \sqeuclnorm{ \vw }\big].$$ [$]

The solution of \eqref{equ_def_rlr_w_opt} is given by

[$] $$\label{equ_close_form_reglinreg} \widehat{\weights}^{(\regparam)} = (1/\samplesize) \big((1/\samplesize) \featuremtx^{T} \featuremtx + \regparam \mathbf{I} \big)^{-1} \featuremtx^{T} \labelvec.$$ [$]

For $\regparam\!=\!0$, \eqref{equ_close_form_reglinreg} reduces to the formula for the optimal weights in linear regression (see \eqref{equ_rerm_ridge_regression} and equ_def_cost_MSE). Note that for $\regparam\gt0$, the formula \eqref{equ_close_form_reglinreg} is always valid, even when $\featuremtx^{T} \featuremtx$ is singular (not invertible). For $\regparam\gt 0$ the optimization problem \eqref{equ_def_rlr_w_opt} (and \eqref{equ_rerm_ridge_regression}) has the unique solution \eqref{equ_close_form_reglinreg}.

To study the statistical properties of the predictor $h^{(\widehat{\weights}^{(\regparam)})}(\featurevec) = \big(\widehat{\weights}^{(\regparam)}\big)^{T} \featurevec$(see \eqref{equ_close_form_reglinreg}) we use the probabilistic toy model equ_linear_obs_model, equ_toy_model_iid and equ_labels_training_data that we used already in Section A Probabilistic Analysis of Generalization . We interpret the training data $\dataset^{(\rm train)} = \{ (\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$ as realizations of iid RVs whose distribution is defined by equ_linear_obs_model, equ_toy_model_iid and equ_labels_training_data.

We can then define the average prediction error of ridge regression as

[$] $$\error_{\rm pred}^{(\regparam)} \defeq \expect \bigg\{ \bigg( \truelabel - h^{(\widehat{\weights}^{(\regparam)})}(\featurevec) \bigg)^{2} \bigg\}.$$ [$]

As shown in Section A Probabilistic Analysis of Generalization , the error $\error_{\rm pred}^{(\regparam)}$ is the sum of three components: the bias, the variance and the noise variance $\sigma^{2}$ (see equ_decomp_E_pred_toy_model). The bias of $\widehat{\weights}^{(\regparam)}$ is

[$] $$\label{equ_bias_reg_lin_reg} \biasterm^{2} = \expect \bigg\{ \sqeuclnorm{ (\mathbf{I} - ( \featuremtx^{T} \featuremtx + \samplesize \regparam \mathbf{I})^{-1} \featuremtx^{T} \featuremtx ) \overline{\weights} } \bigg\}.$$ [$]

For sufficiently large size $\samplesize$ of the training set, we can use the approximation

[$] $$\label{equ_approx_Gram_large_N} \mX^{T} \mX \approx \samplesize \mathbf{I}$$ [$]

such that \eqref{equ_bias_reg_lin_reg} can be approximated as

[] \begin{align} \biasterm^{2} & \approx \sqeuclnorm{(\mathbf{I}\!-\!(\mathbf{I}\!+\!\regparam \mathbf{I})^{-1} ) \overline{\weights} }\nonumber \\ & = \label{equ_bias_reg_lin_reg_approx}\sum_{\featureidx=1}^{\featuredim} \frac{\regparam}{1+\regparam} \overline{\weight}_{\featureidx}^2 . \end{align} []

Let us compare the (approximate) bias term \eqref{equ_bias_reg_lin_reg_approx} of ridge regression with the bias term of ordinary linear regression (which is the extreme case of ridge regression with $\regparam=0$). The bias term \eqref{equ_bias_reg_lin_reg_approx} increases with increasing regularization parameter $\regparam$ in ridge regression \eqref{equ_rerm_ridge_regression}. Sometimes the increase in bias is outweighed by the reduction in variance. The variance typically decreases with increasing $\regparam$ as shown next.

The variance of ridge regression \eqref{equ_rerm_ridge_regression} satisfies

[] \begin{align} \varianceterm & =( \sigma^{2}/\samplesize^{2}) \times \nonumber \\ & \label{equ_variance_reglinreg}\rm tr \big\{ \expect \{ ( (1/\samplesize) \featuremtx^{T} \featuremtx\!+\!\regparam \mathbf{I} )^{-1} \featuremtx^{T} \featuremtx ( (1/\samplesize) \featuremtx^{T} \featuremtx\!+\!\regparam \mathbf{I} )^{-1} \} \big\}. \end{align} []

Inserting the approximation \eqref{equ_approx_Gram_large_N} into \eqref{equ_variance_reglinreg},

[$] $$\varianceterm \approx ( \sigma^{2}/\samplesize^{2}) {\rm tr} \big\{ \expect \{ ( \mathbf{I}\!+\!\regparam \mathbf{I} )^{-1} \featuremtx^{T} \featuremtx ( \mathbf{I} \!+\!\regparam \mathbf{I} )^{-1} \} \big\} = \label{equ_variance_reglinreg_approx} \sigma^{2} (1/\samplesize) (\featuredim/(1\!+\!\regparam)^2).$$ [$]

According to \eqref{equ_variance_reglinreg_approx}, the variance of $\widehat{\weights}^{(\regparam)}$ decreases with increasing regularization parameter $\regparam$ of ridge regression \eqref{equ_rerm_ridge_regression}. This is the opposite behaviour as observed for the bias \eqref{equ_bias_reg_lin_reg_approx}, which increases with increasing $\regparam$. By comparing the variance approximation \eqref{equ_variance_reglinreg_approx} with the variance of linear regression suggests to interpret the ratio $\featuredim/(1\!+\!\regparam)^2$ as an effective number of features used by ridge regression. Increasing the regularization parameter $\regparam$ decreases the effective number of features.

Figure fig_bias_variance_lambda illustrates the trade-off between the bias $\biasterm^{2}$ \eqref{equ_bias_reg_lin_reg_approx} of ridge regression, which increases for increasing $\regparam$, and the variance $\varianceterm$ \eqref{equ_variance_reglinreg_approx} which decreases with increasing $\regparam$. Note that we have seen another example for a bias-variance trade-off in Section A Probabilistic Analysis of Generalization . This trade-off was traced out by a discrete (model complexity) parameter $\modelidx \in \{1,2,\ldots\}$ (see equ_generalization_hypospace_r). In stark contrast to discrete model selection, the bias-variance trade-off for ridge regression is traced out by the continuous regularization parameter $\regparam \in \mathbb{R}_{+}$.

The main statistical effect of the regularization term in ridge regression is to balance the bias with the variance to minimize the average prediction error of the learnt hypothesis. There is also a computational effect or adding a regularization term. Roughly speaking, the regularization term serves as a pre-conditioning of the optimization problem and, in turn, reduces the computational complexity of solving ridge regression \eqref{equ_def_rlr_w_opt}.

The objective function in \eqref{equ_def_rlr_w_opt} is a smooth (infinitely often differentiable) convex function. We can therefore use GD to solve \eqref{equ_def_rlr_w_opt} efficiently (see Chapter Gradient-Based Learning ). Algorithm alg:gd_reglinreg summarizes the application of GD to \eqref{equ_def_rlr_w_opt}. The computational complexity of Algorithm alg:gd_reglinreg depends crucially on the number of GD iterations required to reach a sufficiently small neighbourhood of the solutions to \eqref{equ_def_rlr_w_opt}. Adding the regularization term $\regparam \| \weights \|^{2}_{2}$ to the objective function of linear regression speeds up GD. To verify this claim, we first rewrite \eqref{equ_def_rlr_w_opt} as the quadratic problem

[] \begin{align} \min_{\weights \in \mathbb{R}^{\featuredim}} & \underbrace{(1/2) \weights^{T} \mQ \weights - \vq^{T} \weights}_{= f(\weights)} \nonumber \\ & \label{equ_quadr_form_reglinreg}\mbox{ with } \mathbf{Q}= (1/\samplesize) \mX^{T} \mX + \regparam \mathbf{I}, \vq =(1/\samplesize) \mX^{T} \labelvec. \end{align} []

This is similar to the quadratic optimization problem underlying linear regression but with a different matrix $\mQ$. The computational complexity (number of iterations) required by GD (see equ_def_GD_step) to solve \eqref{equ_quadr_form_reglinreg} up to a prescribed accuracy depends crucially on the condition number $\kappa(\mQ) \geq 1$ of the positive semi-definite (psd) matrix $\mQ$ [11]. The smaller the condition number $\kappa(\mQ)$, the fewer iterations are required by GD. We refer to a matrix with a small condition number as being “well-conditioned”.

The condition number of the matrix $\mQ$ in \eqref{equ_quadr_form_reglinreg} is given by

[$] $$\label{equ_def_cond_number_ridgereg} \kappa(\mQ) = \frac{\eigval{\rm max}((1/\samplesize)\mX^{T} \mX) + \regparam} {\eigval{\rm min}((1/\samplesize)\mX^{T} \mX)+ \regparam}.$$ [$]

According to \eqref{equ_def_cond_number_ridgereg}, the condition number $\kappa(\mQ)$ tends to one for increasing regularization parameter $\regparam$,

[$] $$\label{equ_cond_numer_goes_1_rlr} \lim_{\regparam \rightarrow \infty} \frac{\eigval{\rm max}((1/\samplesize)\mX^{T} \mX) + \regparam} {\eigval{\rm min}((1/\samplesize)\mX^{T} \mX)+ \regparam} =1.$$ [$]

Thus, the number of required GD iterations in Algorithm alg:gd_reglinreg decreases with increasing regularization parameter $\regparam$.

Regularized Linear regression via GD

Input: dataset $\dataset=\{ (\featurevec^{(\sampleidx)}, \truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$; GD learning rate $\lrate \gt0$.

Initialize: set $\weights^{(0)}\!\defeq\!\mathbf{0}$; set iteration counter $\itercntr\!\defeq\!0$
• repeat
• $\itercntr \defeq \itercntr +1$ (increase iteration counter)
• $\weights^{(\itercntr)} \defeq (1- \lrate \regparam) \weights^{(\itercntr\!-\!1)} + \lrate (2/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\truelabel^{(\sampleidx)} - \big(\weights^{(\itercntr\!-\!1)})^{T} \featurevec^{(\sampleidx)}) \featurevec^{(\sampleidx)}$ (do a GD step)
• until stopping criterion met
Output: $\weights^{(\itercntr)}$ (which approximates $\widehat{\weights}^{(\regparam)}$ in \eqref{equ_def_rlr_w_opt})

## Semi-Supervised Learning

Consider the task of predicting the numeric label $y$ of a data point $\vz=\big(\featurevec,y\big)$ based on its feature vector $\featurevec\!=\!\big(x_{1},\ldots,x_{\featurelen}\big)^{T} \in \mathbb{R}^{\featurelen}$. At our disposal are two datasets $\dataset^{(u)}$ and $\dataset^{(l)}$. For each datapoint in $\dataset^{(u)}$ we only know the feature vector. We therefore refer to $\dataset^{(u)}$ as “unlabelled data”. For each datapoint in $\dataset^{(l)}$ we know both, the feature vector $\featurevec$ and the label $y$. We therefore refer to $\dataset^{(l)}$ as “labeled data”.

SSL methods exploit the information provided by unlabelled data $\dataset^{(u)}$ to support the learning of a hypothesis based on minimizing its empirical risk on the labelled (training) data $\dataset^{(l)}$. The success of SSL methods depends on the statistical properties of the data generated within a given application domain. Loosely speaking, the information provided by the probability distribution of the features must be relevant for the ultimate task of predicting the label $y$ from the the features $\featurevec$ [1].

Let us design a SSL method, summarized in Algorithm alg:ssl_linreg below, using the data augmentation perspective from Section Data Augmentation . The idea is the augment the (small) labeled dataset $\dataset^{(l)}$ by adding random perturbations fo the features vectors of data point in $\dataset^{(l)}$. This is reasonable for applications where feature vectors are subject to inherent measurement or modelling errors. Given a data point with vector $\featurevec$ we could have equally well observed a feature vector $\featurevec + {\bm \varepsilon}$ with some small random perturbation ${\bm \varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$. To estimate the covariance matrix $\mC$, we use the sample covariance matrix of the feature vectors in the (large) unlabelled dataset $\dataset^{(u)}$. We then learn a hypothesis using the augmented (regularized) ERM \eqref{equ_def_ERM_funs_aug_approx_ridge_cov}.

A Semi-Supervised Learning Algorithm

Input: labeled dataset $\dataset^{(l)}=\{ (\featurevec^{(\sampleidx)}, y^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$; unlabeled dataset $\dataset^{(u)}=\{ \widetilde{\featurevec}^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize'}$

• compute $\mC$ via sample covariance on $\dataset^{(u)}$,
[$] $$\mC \defeq (1/\samplesize') \sum_{\sampleidx=1}^{\samplesize'} \big(\widetilde{\featurevec}^{(\sampleidx)}\!-\!\widehat{\featurevec} \big) \big(\widetilde{\featurevec}^{(\sampleidx)}\!-\!\widehat{\featurevec} \big)^{T} \mbox{ with } \widehat{\featurevec} \defeq (1/\samplesize') \sum_{\sampleidx=1}^{\samplesize'} \widetilde{\featurevec}^{(\sampleidx)}.$$ [$]
• compute (e.g. using GD)
[] \begin{align} \label{equ_def_ERM_funs_aug_approx_ridge_cov_weights} \widehat{\vw} \defeq \argmin_{\vw \in \mathbb{R}^{\featuredim}} \bigg[ (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( y^{(\sampleidx)} - \weights^{T}\featurevec^{(\sampleidx)} \big) ^{2} + \vw^{T} \mC \vw \bigg] . \end{align} []
Output: hypothesis $\hat{h}(\featurevec) = \big( \widehat{\vw} )^{T} \featurevec$

Consider a specific learning task of finding a hypothesis $h$ with minimum (expected) loss $\loss{(\featurevec,\truelabel)}{h}$. Note that the loss incurred by $h$ for a specific data point depends on the definition for the label of a data point. We can obtain different learning tasks for the same data points by using different choices or definitions for the label of a data point. Multitask learning exploits the similarities between different learning tasks to jointly solve them. Let us next discuss a simple example of a multitask learning problem.

Consider a data point $\vz$ representing a hand-drawing that is collected via the online game [1]. The features of a data point are the pixel intensities of the bitmap which is used to store the hand-drawing. As label we could use the fact if a hand-drawing shows an apple or not. This results in the learning task $\task^{(1)}$. Another choice for the label of a hand-drawing could be the fact if a hand-drawing shows a fruit at all or not. This results in another learning task $\task^{(2)}$ which is similar but different from the task $\task^{(1)}$.

The idea of multitask learning is that a reasonable hypothesis $h$ for a learning task should also do well for a related learning tasks. Thus, we can use the loss incurred on similar learning tasks as a regularization term for learning a hypothesis for the learning task at hand. Algorithm alg:mtl is a straightforward implementation of this idea for a given dataset that gives rise to $\nrtasks$ related learning tasks $\task^{(1)},\ldots,\task^{(\nrtasks)}$. For each individual learning task $\task^{(\taskidx')}$ it uses the loss on the remaining learning tasks $\task^{(\taskidx)}$, with $\taskidx \neq \taskidx'$, as regularization term in \eqref{equ_def_ERM_mt}.

Input: dataset $\dataset = \{\datapoint^{(1)},\ldots,\datapoint^{(\samplesize)} \}$; $\nrtasks$ learning tasks with loss functions $\lossfun^{(1)},\ldots,\lossfun^{(\nrtasks)}$, hypothesis space $\hypospace$

• learn a hypothesis $\hat{h}$ via
[] \begin{align} \label{equ_def_ERM_mt} \hat{h} \defeq \argmin_{h \in \hypospace} \sum_{\taskidx=1}^{\nrtasks} \sum_{\sampleidx=1}^{\samplesize} \lossfun^{(\taskidx)}\big(\datapoint^{(\sampleidx)},h\big). \end{align} []

Output: hypothesis $\hat{h}$

The applicability of Algorithm alg:mtl is somewhat limited as it aims at finding a single hypothesis that does well for all $\nrtasks$ learning tasks simultaneously. For certain application domains it might be more reasonable to not learn a single hypothesis for all learning tasks but to learn a separate hypothesis $h^{(\taskidx)}$ for each learning task $\taskidx=1,\ldots,\nrtasks$. However, these separate hypotheses typically might still share some structural similarities.[a]

We can enforce different notion of similarities between the hypotheses $h^{(\taskidx)}$ by adding a regularization term to the loss functions of the tasks.

Algorithm alg:mtl_reg generalizes Algorithms alg:mtl by learning a separate hypothesis for each task $\taskidx$ while requiring these hypotheses to be structurally similar. The structural (dis-)similarity between the hypotheses is measured by a regularization term $\regularizer$ in \eqref{equ_def_ERM_mt_reg}.

Input: dataset $\dataset = \{\datapoint^{(1)},\ldots,\datapoint^{(\samplesize)} \}$ with $\nrtasks$ associated learning tasks with loss functions $\lossfun^{(1)},\ldots,\lossfun^{(\nrtasks)}$, hypothesis space $\hypospace$

• learn a hypothesis $\hat{h}$ via
[] \begin{align} \label{equ_def_ERM_mt_reg} \hat{h}^{(1)},\ldots,\hat{h}^{(\nrtasks)} \defeq \argmin_{h^{(1)},\ldots,h^{(\nrtasks)} \in \hypospace} \sum_{\taskidx=1}^{\nrtasks} \sum_{\sampleidx=1}^{\samplesize} \lossfun^{(\taskidx)}\big(\vz^{(\sampleidx)},h^{(\taskidx)}\big) + \regparam \regularizer \big(h^{(1)},\ldots,h^{(\nrtasks)}\big). \end{align} []
Output: hypotheses $\hat{h}^{(1)},\ldots,\hat{h}^{(\nrtasks)}$

## Transfer Learning

Regularization is also instrumental for transfer learning to capitalize on synergies between different related learning tasks [13][14]. Transfer learning is enabled by constructing regularization terms for a learning task by using the result of a previous leaning task. While multitask learning methods solve many related learning tasks simultaneously, transfer learning methods operate in a sequential fashion.

Let us illustrate the idea of transfer learning using two learning tasks which differ signifcantly in their intrinsic difficulty. Informally, we consider a learning task to be easy if we can easily gather large amounts of labeled (training) data for that task. Consider the learning task $\task^{(1)}$ of predicting whether an image shows a cat or not. For this learning task we can easily gather a large training set $\dataset^{(1)}$ using via image collections of animals. Another (related) learning task $\task^{(2)}$ is to predict whether an image shows a cat of a particular breed, with a particular body height and with a specific age. The learning task $\task^{(2)}$ is more dificult than $\task^{(1)}$ since we have only a very limited amount of cat images for which we know the particular breed, body height and precise age of the depicted cat.

## Notes

1. One important example for such a structural similarity in the case of linear predictors $h^{(\taskidx)}(\featurevec) =\big(\weights^{(\taskidx)} \big)^{T} \featurevec$is that the parameter vectors $\vw^{(\nrtasks)}$ have a small joint support. Requiring the parameter vectors to have a small joint support is equivalent to requiring the stacked vector $\widetilde{\weights}=\big(\weights^{(1)},\ldots,\weights^{(\nrtasks)} \big)$ to be block (group) sparse [12].

## General References

Jung, Alexander (2022). Machine Learning: The Basics. Signapore: Springer. doi:10.1007/978-981-16-8193-6.

Jung, Alexander (2022). "Machine Learning: The Basics". arXiv:1805.05052.

## References

1. O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning The MIT Press, Cambridge, Massachusetts, 2006
2. R. Caruana. Multitask learning. Machine Learning 28(1):41--75, 1997
3. M. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint Cambridge: Cambridge University Press, 2019
4. P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data Springer, New York, 2011
5. S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning -- from Theory to Algorithms Cambridge University Press, 2014
6. V. N. Vapnik. The Nature of Statistical Learning Theory Springer, 1999
7. S. Boyd and L. Vandenberghe. Convex Optimization Cambridge Univ. Press, Cambridge, UK, 2004
8. D. P. Bertsekas. Nonlinear Programming Athena Scientific, Belmont, MA, 2nd edition, June 1999
9. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning Springer Series in Statistics. Springer, New York, NY, USA, 2001
10. T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity. The Lasso and its Generalizations CRC Press, 2015
11. A. Jung. A fixed-point of view on gradient methods for big data. Frontiers in Applied Mathematics and Statistics 3, 2017
12. Y. C. Eldar, P. Kuppinger, and H. Bölcskei. Block-sparse signals: Uncertainty relations and efficient recovery. IEEE Trans. Signal Processing 58(6):3042--3054, June 2010
13. S. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10):1345--1359, 2010
14. J. Howard and S. Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) pages 328--339, Melbourne, Australia, July 2018. Association for Computational Linguistics