The Landscape of ML

As discussed in Chapter Three Components of ML , ML methods combine three main components:

• a set of data points that are characterized by features and labels
• a model or hypothesis space $\hypospace$ that consists of different hypotheses $h \in \hypospace$.
• a loss function to measure the quality of a particular hypothesis $h$.

Each of these three components involves design choices for the representation of data, their features and labels, the model and loss function. This chapter details the high-level design choices used by some of the most popular ML methods.

To obtain a practical ML method we also need to combine the above components. The basic principle of any ML method is to search the model for a hypothesis that incurs minimum loss on any data point. Chapter Empirical Risk Minimization will then discuss a principled way to turn this informal statement into actual ML algorithms that could be implemented on a computer.

Linear Regression

Consider data points characterized by feature vectors $\featurevec \in \mathbb{R}^{\featuredim}$ and numeric label $\truelabel \in \mathbb{R}$. Linear regression methods learn a hypothesis out of the linear hypothesis space

[] \begin{align} \label{equ_lin_hypospace} \hypospace^{(\featuredim)} & \defeq \{ h^{(\weights)}: \mathbb{R}^{\featuredim}\!\rightarrow\!\mathbb{R}: h^{(\weights)}(\featurevec)\!=\!\weights^{T} \featurevec \mbox{ with some parameter vector } \weights \in \mathbb{R}^{\featuredim} \}. \end{align} []

Figure fig_three_maps_example depicts the graphs of some maps from $\hypospace^{(2)}$ for data points with feature vectors of the form $\featurevec = (1,\feature)^{T}$. The quality of a particular predictor $h^{(\weights)}$ is measured by the squared error loss. Using labeled data $\dataset =\{ (\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$, linear regression learns a linear predictor $\hat{h}$ which minimizes the average squared error loss squared error loss, or “mean squared error”,

[] \begin{align} \label{equ_opt_pred_linreg} \hat{h} & = \argmin_{h \in \hypospace^{(\featuredim)} }\emperror(h|\dataset) \stackrel{\eqref{eq_def_emp_error_101}}{=} \argmin_{h \in \hypospace^{(\featuredim)} } (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\truelabel^{(\sampleidx)} - h(\featurevec^{(\sampleidx)}))^{2}. \end{align} []

Since the hypothesis space $\hypospace^{(\featuredim)}$ is parametrized by the parameter vector $\weights$ (see \eqref{equ_lin_hypospace}), we can rewrite \eqref{equ_opt_pred_linreg} as an optimization of the parameter vector $\weights$,

[] \begin{align} \widehat{\weights} & = \argmin_{\weights \in \mathbb{R}^{\featuredim}} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\truelabel^{(\sampleidx)} - h^{(\weights)}(\featurevec^{(\sampleidx)}))^{2} \nonumber \\ & \label{equ_opt_weight_vector_linreg_weight}\stackrel{h^{(\weights)}(\featurevec) = \weights^{T} \featurevec}{=} \argmin_{\weights \in \mathbb{R}^{\featuredim}} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)})^{2}. \end{align} []

The optimization problems \eqref{equ_opt_pred_linreg} and \eqref{equ_opt_weight_vector_linreg_weight} are equivalent in the following sense: Any optimal parameter vector $\widehat{\weights}$ which solves \eqref{equ_opt_weight_vector_linreg_weight}, can be used to construct an optimal predictor $\hat{h}$, which solves \eqref{equ_opt_pred_linreg}, via $\hat{h}(\featurevec) = h^{(\widehat{\weights})}(\featurevec) = \big(\widehat{\weights}\big)^{T} \featurevec$.

Polynomial Regression

Consider an ML problem involving data points which are characterized by a single numeric feature $\feature \in \mathbb{R}$ (the feature space is $\featurespace= \mathbb{R}$) and a numeric label $\truelabel \in \mathbb{R}$ (the label space is $\labelspace= \mathbb{R}$). We observe a bunch of labeled data points which are depicted in Figure fig_scatterplot_poly.

Figure fig_scatterplot_poly suggests that the relation $\feature \mapsto \truelabel$ between feature $\feature$ and label $\truelabel$ is highly non-linear. For such non-linear relations between features and labels it is useful to consider a hypothesis space which is constituted by polynomial maps

[] \begin{align} \label{equ_def_poly_hyposapce} \hypospace^{(\featuredim)}_{\rm poly}& = \{ h^{(\weights)}: \mathbb{R} \rightarrow \mathbb{R}: h^{(\weights)}(\feature) = \sum_{\featureidx=1}^{\featuredim} \weight_{\featureidx} \feature^{\featureidx-1}, \nonumber \\ & \mbox{with some } \weights\!=\!(\weight_{1},\ldots,\weight_{\featuredim})^{T}\!\in\!\mathbb{R}^{\featuredim} \}. \end{align} []

We can approximate any non-linear relation $\truelabel\!=\!h(\feature)$ with any desired level of accuracy using a polynomial $\sum_{\featureidx=1}^{\featuredim} \weight_{\featureidx} \feature^{\featureidx-1}$ of sufficiently large degree $\featuredim$.[a]

For linear regression (see Section Linear Regression ), we measure the quality of a predictor by the squared error loss squared error loss. Based on labeled data points $\dataset =\{ (\feature^{(\sampleidx)},\truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$, each having a scalar feature $\feature^{(\sampleidx)}$ and label $\truelabel^{(\sampleidx)}$, polynomial regression minimizes the average squared error loss (see squared error loss):

[$]$$\label{opt_hypo_poly} \min_{h \in \hypospace_{\rm poly}^{(\featuredim)} } (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\truelabel^{(\sampleidx)} - h^{(\weights)}(\feature^{(\sampleidx)}))^{2}.$$[$]

It is customary to refer to the average squared error loss also as the mean squared error.

We can interpret polynomial regression as a combination of a feature map (transformation) (see Section Features ) and linear regression (see Section Linear Regression ). Indeed, any polynomial predictor $h^{(\weights)} \in \hypospace_{\rm poly}^{(\featuredim)}$ is obtained as a concatenation of the feature map

[$]$$\label{equ_poly_feature_map} \featuremap(\feature) \mapsto (1,\feature,\ldots,\feature^{\featuredim})^{T} \in \mathbb{R}^{\featuredim+1}$$[$]

with some linear map $\tilde{h}^{(\weights)}: \mathbb{R}^{\featuredim+1} \rightarrow \mathbb{R}: \featurevec \mapsto \weights^{T} \featurevec$, i.e.,

[$]$$\label{equ_concact_phi_g_poly} h^{(\weights)}(\feature) = \tilde{h}^{(\weights)}(\featuremap(\feature)).$$[$]

Thus, we can implement polynomial regression by first applying the feature map $\featuremap$ (see \eqref{equ_poly_feature_map}) to the scalar features $\feature^{(\sampleidx)}$, resulting in the transformed feature vectors

[$]$$\label{equ_def_poly_feature_vectors} \featurevec^{(\sampleidx)} = \featuremap \big(\feature^{(\sampleidx)}\big) = \big(1,\feature^{(\sampleidx)},\ldots,\big( \feature^{(\sampleidx)} \big)^{\featuredim-1} \big)^{T} \in \mathbb{R}^{\featuredim},$$[$]

and then applying linear regression (see Section Linear Regression ) to these new feature vectors.

By inserting \eqref{equ_concact_phi_g_poly} into \eqref{opt_hypo_poly}, we obtain a linear regression problem \eqref{equ_opt_weight_vector_linreg_weight} with feature vectors \eqref{equ_def_poly_feature_vectors}. Thus, while a predictor $h^{(\vw)} \in \hypospace_{\rm poly}^{(\featuredim)}$ is a non-linear function $h^{(\vw)}(\feature)$ of the original feature $\feature$, it is a linear function $\tilde{h}^{(\weights)}(\featurevec) = \weights^{T}\featurevec$ (see \eqref{equ_concact_phi_g_poly}), of the transformed features $\featurevec$ \eqref{equ_def_poly_feature_vectors}.

Least Absolute Deviation Regression

Learning a linear predictor by minimizing the average squared error loss incurred on training data is not robust against the presence of outliers. This sensitivity to outliers is rooted in the properties of the squared error loss $(\truelabel - h(\featurevec))^{2}$. Minimizing the average squared error forces the resulting predictor $\hat{\truelabel}$ to not be too far away from any data point. However, it might be useful to tolerate a large prediction error $\truelabel - h ( \featurevec)$ for an unusual or exceptional data point that can be considered an outlier.

Replacing the squared loss with a different loss function can make the learning robust against outliers. One important example for such a “robustifying” loss function is the Huber loss [2]

[$]$$\label{equ_def_Huber_loss} \loss{\big(\featurevec,\truelabel \big)}{h} = \begin{cases} (1/2) (\truelabel-h(\featurevec))^{2} & \mbox{ for } |\truelabel-h(\featurevec)| \leq \varepsilon \\ \varepsilon(|\truelabel-h(\featurevec)| - \varepsilon/2) & \mbox{ else. }\end{cases}$$[$]

Figure fig_Huber_loss depicts the Huber loss as a function of the prediction error $\truelabel - h(\featurevec)$.

The Huber loss definition \eqref{equ_def_Huber_loss} contains a tuning parameter $\epsilon$. The value of this tuning parameter defines when a data point is considered as an outlier. Figure fig_huber_loss_scatter illustrates the role of this parameter as the width of a band around a hypothesis map. The prediction error of this hypothesis map for data points within this band are measured used squared error loss. For data points outside this band (outliers) we use instead the absolute value of the prediction error as the resulting loss.

The Huber loss is robust to outliers since the corresponding (large) prediction errors $\truelabel - \hat{\truelabel}$ are not squared. Outliers have a smaller effect on the average Huber loss (over the entire dataset) compared to the average squared error loss. The improved robustness against outliers of the Huber loss comes at the expense of increased computational complexity. The squared error loss can be minimized using efficient gradient based methods (see Chapter Gradient-Based Learning ). In contrast, for $\varepsilon=0$, the Huber loss is non-differentiable and requires more advanced optimization methods.

The Huber loss \eqref{equ_def_Huber_loss} contains two important special cases. The first special case occurs when $\varepsilon$ is chosen to be very large, such that the condition $|\truelabel-\hat{\truelabel}| \leq \varepsilon$ is satisfied for most data points. In this case, the Huber loss resembles the squared error loss squared error loss (up to a scaling factor $1/2$). The second special case is obtained for $\varepsilon=0$. Here, the Huber loss reduces to the scaled absolute difference loss $|\truelabel - \hat{\truelabel}|$.

The Lasso

We will see in Chapter Model Validation and Selection that linear regression (see Section Linear Regression ) typically requires a training set larger than the number of features used to characterized a data point. However, many important application domains generate data points with a number $\featuredim$ of features much higher than the number $\samplesize$ of available labeled data points in the training set.

In the high-dimensional regime, where $\samplesize \ll \featurelen$, basic linear regression methods will not be able to learn useful weights $\weights$ for a linear hypothesis. Section A Probabilistic Analysis of Generalization shows that for $\samplesize \ll \featurelen$, linear regression will typically learn a hypothesis that perfectly predicts labels of data points in the training set but delivers poor predictions for data points outside the training set. This phenomenon is referred to as overfitting and poses a main challenge for ML applications in the high-dimensional regime.

Chapter Regularization discusses basic regularization techniques that allow to prevent ML methods from overfitting. We can regularize linear regression by augmenting the squared error loss squared error loss of a hypothesis $h^{(\weights)}(\featurevec) = \weights^{T} \featurevec$ with an additional penalty term. This penalty term depends solely on the weights $\weights$ and serves as an estimate for the increase of the average loss on data points outside the training set. Different ML methods are obtained from different choices for this penalty term. The least absolute shrinkage and selection operator (Lasso) is obtained from linear regression by replacing the squared error loss with the regularized loss

[$]$$\label{equ_def_reg_loss_lin_reg} \loss{(\featurevec,\truelabel)}{h^{(\vw)}} = (\truelabel-\weights^{T}\featurevec)^{2} + \regparam \| \weights \|_{1}.$$[$]

Here, the penalty term is given by the scaled norm $\regparam \| \vw \|_{1}$. The value of $\regparam$ can be chosen based on some probabilistic model that interprets a data point as the realization of a random variable (RV). The label of this data point (which is a realization of a RV) is related to its features via

[$]\truelabel = \overline{\weights}^{T} \featurevec + \varepsilon.[$]

Here, $\overline{\weights}$ denotes some true underlying parameter vector and $\varepsilon$ is a realization of an a RV that is independent of the features $\featurevec$. We need the “noise” term $\varepsilon$ since the labels of data points collected in some ML application are typically not exactly obtained by a linear combination $\overline{\weights}^{T} \featurevec$ of its features.

The tuning of $\regparam$ in \eqref{equ_def_reg_loss_lin_reg} can be guided by the statistical properties (such as the variance) of the noise $\varepsilon$, the number of non-zero entries in $\overline{\weights}$ and a lower bound on the non-zero values [3][4]. Another option for choosing the value $\regparam$ is to try out different candidate values and pick the one resulting in smallest validation error (see Section Validation ).

Gaussian Basis Regression

Section Polynomial Regression showed how to extend linear regression by first transforming the feature $\feature$ using a vector-valued feature map $\featuremap: \mathbb{R} \rightarrow \mathbb{R}^{\featuredim}$. The output of this feature map are the transformed features $\featuremap(\feature)$ which are fed, in turn, to a linear map $h\big(\featuremap(\feature)\big) = \weights^{T} \featuremap(\feature)$. Polynomial regression in Section Polynomial Regression has been obtained for the specific feature map \eqref{equ_poly_feature_map} whose entries are the powers $\feature^{l}$ of the scalar original feature $\feature$. However, it is possible to use other functions, different from polynomials, to construct the feature map $\featuremap$. We can extend linear regression using an arbitrary feature map

[$]$$\featuremap(\feature) = (\basisfunc_{1}(\feature),\ldots,\basisfunc_{\featuredim}(\feature))^{T}$$[$]

with the scalar maps $\basisfunc_{\featureidx}: \mathbb{R} \rightarrow \mathbb{R}$ which are referred to as “basis functions”. The choice of basis functions depends heavily on the particular application and the underlying relation between features and labels of the observed data points. The basis functions underlying polynomial regression are $\basisfunc_{\featureidx}(\feature)= \feature^{\featureidx}$.

Another popular choice for the basis functions are “Gaussians”

[$]$$\label{equ_basis_Gaussian} \phi_{\sigma,\mu}(\feature) = \exp(-(1/(2\sigma^{2})) (\feature\!-\!\mu)^{2}).$$[$]

The family \eqref{equ_basis_Gaussian} of maps is parameterized by the variance $\sigma^{2}$ and the mean (shift) $\mu$. Gaussian basis linear regression combines the feature map

[$]$$\featuremap(\feature) = \big (\phi_{\sigma_{1},\mu_{1}}(\feature) ,\ldots,\phi_{\sigma_{\featuredim},\mu_{\featuredim}}(\feature) \big)^{T}$$[$]

with linear regression (see Figure fig_lin_bas_expansion). The resulting hypothesis space is

[] \begin{align} \hypospace^{(\featuredim)}_{\rm Gauss} & = \{ h^{(\weights)}: \mathbb{R} \rightarrow \mathbb{R}: h^{(\weights)}(\feature)\!=\!\sum_{\featureidx=1}^{\featuredim} \weight_{\featureidx}\phi_{\sigma_{\featureidx},\mu_{\featureidx}}(\feature) \nonumber \\ & \label{equ_def_Gauss_hypospace}\mbox{ with weights } \weights=(\weight_{1},\ldots,\weight_{\featuredim})^{T} \in \mathbb{R}^{\featuredim}\}. \end{align} []

Different choices for the variance $\sigma_{\featureidx}^{2}$ and shifts $\mu_{\featureidx}$ of the Gaussian function in \eqref{equ_basis_Gaussian} results in different hypothesis spaces $\hypospace_{\rm Gauss}$. Chapter Model Selection will discuss model selection techniques that allow to find useful values for these parameters.

The hypotheses of \eqref{equ_def_Gauss_hypospace} are parametrized by a parameter vector $\weights \in \mathbb{R}^{\featuredim}$. Each hypothesis in $\hypospace_{\rm Gauss}$ corresponds to a particular choice for the parameter vector $\weights$. Thus, instead of searching over $\hypospace_{\rm Gauss}$ to find a good hypothesis, we can search over the Euclidean space $\mathbb{R}^{\featuredim}$. Highly developed methods for searching over the space $\mathbb{R}^{\featuredim}$, for a wide range of values for $\featuredim$, are provided by numerical linear algebra [5].

Logistic Regression

Logistic regression is a ML method that allows to classify data points according to two categories. Thus, logistic regression is a binary classification method that can be applied to data points characterized by feature vectors $\featurevec \in \mathbb{R}^{\featuredim}$ (feature space $\featurespace=\mathbb{R}^{\featuredim}$) and binary labels $\truelabel$. These binary labels take on values from a label space that contains two different label values. Each of these two label values represents one of the two categories to which the data points can belong.

It is convenient to use the label space $\labelspace = \mathbb{R}$ and encode the two label values as $\truelabel=1$ and $\truelabel=-1$. However, it is important to note that logistic regression can be used with an arbitrary label space which contains two different elements. Another popular choice for the label space is $\labelspace=\{0,1\}$. Logistic regression learns a hypothesis out of the hypothesis space $\hypospace^{(\featuredim)}$ (see \eqref{equ_lin_hypospace}). Note that logistic regression uses the same hypothesis space as linear regression (see Section Linear Regression ).

At first sight, it seems wasteful to use a linear hypothesis $h(\featurevec) = \weights^T \featurevec$, with some parameter vector $\weights \in \mathbb{R}^{\featuredim}$, to predict a binary label $\truelabel$. Indeed, while the prediction $h(\featurevec)$ can take any real number, the label $\truelabel \in \{-1,1\}$ takes on only one of the two real numbers $1$ and $-1$.

It turns out that even for binary labels it is quite useful to use a hypothesis map $h$ which can take on arbitrary real numbers. We can always obtain a predicted label $\hat{\truelabel} \in \{-1,1\}$ by comparing hypothesis value $h(\featurevec)$ with a threshold. A data point with features $\featurevec$, is classified as $\hat{\truelabel}=1$ if $h(\featurevec)\geq 0$ and $\hat{\truelabel}=-1$ for $h(\featurevec)\lt 0$. Thus, we use the sign of the predictor $h$ to determine the final prediction for the label. The absolute value $|h(\featurevec)|$ is then used to quantify the reliability of (or confidence in) the classification $\hat{\truelabel}$.

Consider two data points with feature vectors $\featurevec^{(1)}, \featurevec^{(2)}$ and a linear classifier map $h$ yielding the function values $h(\featurevec^{(1)}) = 1/10$ and $h(\featurevec^{(2)}) = 100$. Whereas the predictions for both data points result in the same label predictions, i.e., $\hat{\truelabel}^{(1)}\!=\!\hat{\truelabel}^{(2)}\!=\!1$, the classification of the data point with feature vector $\featurevec^{(2)}$ seems to be much more reliable.

Logistic regression uses the logistic loss to assess the quality of a particular hypothesis $h^{(\weights)} \in \hypospace^{(\featuredim)}$. In particular, given some labeled training set $\dataset =\{ \featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)} \}_{\sampleidx=1}^{\samplesize}$, logistic regression tries to minimize the empirical risk (average logistic loss)

[] \begin{align} \emperror ( \weights | \dataset ) & = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \log( 1+ \exp ( - \truelabel^{(\sampleidx)} h^{(\weights)}(\featurevec^{(\sampleidx)}))) \nonumber \\ & \label{equ_def_emp_risk_logreg}\stackrel{h^{(\weights)}(\featurevec)=\weights^{T} \featurevec}{=} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \log( 1+ \exp ( - \truelabel^{(\sampleidx)} \weights^{T} \featurevec^{(\sampleidx)})). \end{align} []

Once we have found the optimal parameter vector $\widehat{\weights}$, which minimizes \eqref{equ_def_emp_risk_logreg}, we can classify any data point solely based on its features $\featurevec$. Indeed, we just need to evaluate the hypothesis $h^{(\widehat{\weights})}$ for the features $\featurevec$ to obtain the predicted label

[$]$$\label{equ_class_logreg} \hat{\truelabel} = \begin{cases} 1 & \mbox{ if } h^{(\widehat{\weights})}(\featurevec) \geq 0 \\ -1 & \mbox{ otherwise.} \end{cases}$$[$]
Since $h^{(\widehat{\weights})}(\featurevec) = \big(\widehat{\weights}\big)^{T} \featurevec$ (see \eqref{equ_lin_hypospace}), the classifier \eqref{equ_class_logreg} amounts to testing whether $\big(\widehat{\weights}\big)^{T} \featurevec \geq 0$ or not.

The classifier \eqref{equ_class_logreg} partitions the feature space $\featurespace\!=\!\mathbb{R}^{\featuredim}$ into two half-spaces $\decreg{1}\!=\!\big\{ \featurevec: \big(\widehat{\weights}\big)^{T} \featurevec\!\geq\!0 \big\}$ and $\decreg{-1}\!=\!\big\{ \featurevec: \big(\widehat{\weights}\big)^{T} \featurevec\!\lt\!0 \big\}$ which are separated by the hyperplane $\big(\widehat{\weights}\big)^{T} \featurevec =0$ (see Figure fig_lin_dec_boundary). Any data point with features $\featurevec \in \decreg{1}$ ($\featurevec \in \decreg{-1}$) is classified as $\hat{\truelabel}\!=\!1$ ($\hat{\truelabel}\!=\!-1$).

Logistic regression can be interpreted as a statistical estimation method for a particular probabilistic model for the data points. This probabilistic model interprets the label $\truelabel \in \{-1,1\}$ of a data point as a RV with the probability distribution

[] \begin{align} \prob{\truelabel=1;\weights}& = 1/(1 + \exp(- \weights^{T} \featurevec) ) \nonumber \\ & \label{equ_prob_model_logistic_model}\stackrel{h^{(\weights)}(\featurevec)\!=\!\weights^{T} \featurevec}{=} 1/(1 + \exp(- h^{(\weights)}(\featurevec))) ) . \end{align} []
As the notation indicates, the probability \eqref{equ_prob_model_logistic_model} is parametrized by the parameter vector $\weights$ of the linear hypothesis $h^{(\weights)}(\featurevec)\!=\!\weights^{T} \featurevec$. Given the probabilistic model \eqref{equ_prob_model_logistic_model}, we can interpret the classification \eqref{equ_class_logreg} as choosing $\hat{\truelabel}$ to maximize the probability $\prob{\truelabel=\hat{\truelabel};\weights}$.

Since $\prob{\truelabel=1} + \prob{\truelabel=-1}=1$,

[] \begin{align} \prob{\truelabel=-1} & = 1 - \prob{\truelabel=1} \nonumber \\ & \stackrel{\eqref{equ_prob_model_logistic_model}}{=} 1 - 1/(1+ \exp(-\weights^{T} \featurevec)) \nonumber \\ & = \label{equ_prob_model_logistic_model_minus_1} 1/(1+ \exp(\weights^{T} \featurevec)). \end{align} []

In practice we do not know the parameter vector in \eqref{equ_prob_model_logistic_model}. Rather, we have to estimate the parameter vector $\weights$ in \eqref{equ_prob_model_logistic_model} from observed data points. A principled approach to estimate the parameter vector is to maximize the probability (or likelihood) of actually obtaining the dataset $\dataset=\{ (\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$ as realizations of independent and identically distributed (iid) data points whose labels are distributed according to \eqref{equ_prob_model_logistic_model}. This yields the maximum likelihood estimator

[] \begin{align} \widehat{\weights} & = \argmax_{\weights \in \mathbb{R}^{\featuredim}} \prob{ \{\truelabel^{(\sampleidx)}\}_{\sampleidx=1}^{\samplesize}} \nonumber \\ & \stackrel{\truelabel^{(\sampleidx)}\mbox{iid}}{=} \argmax_{\vw \in \mathbb{R}^{\featuredim}} \prod_{\sampleidx=1}^{\samplesize} \prob { \truelabel^{(\sampleidx)}}\nonumber \\ & \label{equ_deriv_prog_logreg}\stackrel{\eqref{equ_prob_model_logistic_model},\eqref{equ_prob_model_logistic_model_minus_1}}{=} \argmax_{\vw \in \mathbb{R}^{\featuredim}} \prod_{\sampleidx=1}^{\samplesize} 1/(1 + \exp(- \truelabel^{(\sampleidx)} \weights^{T} \featurevec^{(\sampleidx)}) ). \end{align} []
Note that the last expression \eqref{equ_deriv_prog_logreg} is only valid if we encode the binary labels using the values $1$ and $-1$. Using different label values results in a different expression.

Maximizing a positive function $f(\weights)\gt0$ is equivalent to maximizing $\log f(\weight)$,

[$]\argmax\limits_{\weights \in \mathbb{R}^{\featuredim}} f(\weights)\!=\!\argmax\limits_{\weights \in \mathbb{R}^{\featuredim}}\log f(\weights).[$]
Therefore, \eqref{equ_deriv_prog_logreg} can be further developed as
[] \begin{align} \widehat{\weights} & \stackrel{\eqref{equ_deriv_prog_logreg}}{=} \argmax_{\weights \in \mathbb{R}^{\featuredim}} \sum_{\sampleidx=1}^{\samplesize} - \log\big(1\!+\!\exp(- \truelabel^{(\sampleidx)} \weights^{T} \featurevec^{(\sampleidx)}) \big) \nonumber \\ & = \label{equ_deriv_prog_logreg_2}\argmin_{\weights \in \mathbb{R}^{\featuredim}} (1/\samplesize)\sum_{\sampleidx=1}^{\samplesize} \log\big(1\!+\!\exp(- \truelabel^{(\sampleidx)}\vw^{T} \featurevec^{(\sampleidx)}) \big). \end{align} []
Comparing \eqref{equ_deriv_prog_logreg_2} with \eqref{equ_def_emp_risk_logreg} reveals that logistic regression is nothing but maximum likelihood estimation of the parameter vector $\weights$ in the probabilistic model \eqref{equ_prob_model_logistic_model}.

Support Vector Machines

Support vector machine (SVM)s are a family of ML methods for learning a hypothesis to predict a binary label $\truelabel$ of a data point based on its features $\featurevec$. Without loss of generality we consider binary labels taking values in the label space $\labelspace = \{-1,1\}$. A support vector machine (SVM) uses the linear hypothesis space \eqref{equ_lin_hypospace} which consists of linear maps $h(\featurevec) = \weights^{T} \featurevec$ with some parameter vector $\weights \in \mathbb{R}^{\featuredim}$. Thus, the SVM uses the same hypothesis space as linear regression and logistic regression which we have discussed in Section Linear Regression and Section Logistic Regression , respectively. What sets the SVM apart from these other methods is the choice of loss function.

Different instances of a SVM are obtained by using different constructions for the features of a data point. Kernel SVMs use the concept of a kernel map to construct (typically high-dimensional) features (see Section Kernel Methods and [6]). In what follows, we assume the feature construction has been solved and we have access to a feature vector $\featurevec \in \mathbb{R}^{\featuredim}$ for each data point.

Figure fig_svm depicts a dataset $\dataset$ of labeled data points, each characterized by a feature vector $\featurevec^{(\sampleidx)} \in \mathbb{R}^{2}$ (used as coordinates of a marker) and a binary label $\truelabel^{(\sampleidx)} \in \{-1,1\}$ (indicated by different marker shapes). We can partition dataset $\dataset$ into two classes

[$]$$\label{equ_two_classes_svm} \mathcal{C}^{(\truelabel=1)}\!=\!\{ \featurevec^{(\sampleidx)} : \truelabel^{(\sampleidx)}\!=\!1\}\mbox{, and }\mathcal{C}^{(\truelabel=-1)}\!=\!\{ \featurevec^{(\sampleidx)} : \truelabel^{(\sampleidx)}\!=\!-1 \}.$$[$]
The SVM tries to learn a linear map $h^{(\weights)}(\featurevec) = \weights^{T} \featurevec$ that perfectly separates the two classes in the sense of
[] \begin{align} \label{equ_two_classes_svm_perf_sep} \underbrace{h \big( \featurevec^{(\sampleidx)} \big)}_{\weights^{T} \featurevec^{(\sampleidx)} } \gt 0 \mbox{ for } \featurevec^{(\sampleidx)} \in \mathcal{C}^{(\truelabel=1)} \mbox{ and } \underbrace{h \big( \featurevec^{(\sampleidx)} \big)}_{\weights^{T} \featurevec^{(\sampleidx)} } \lt 0 \mbox{ for } \featurevec^{(\sampleidx)} \in \mathcal{C}^{(\truelabel=-1)}. \end{align} []
We refer to a dataset, whose data points have binary labels. as linear separable if we can find at least one linear map that separates in the sense of \eqref{equ_two_classes_svm_perf_sep}. The dataset in Figure fig_svm is linearly separable.

As can be verified easily, any linear map $h^{(\weights)}(\featurevec) = \weights^{T} \featurevec$ achieving zero average hinge loss on the dataset $\dataset$ perfectly satisfies this dataset \eqref{equ_two_classes_svm_perf_sep}. It seems reasonable to learn a linear map by minimizing the average hinge loss. However, one drawback of this approach is that there might be (infinitely) many different linear maps that achieve zero average hinge loss and, in turn, perfectly separate the data points in Figure fig_svm. Indeed, consider a linear map $h^{(\weights)}$ that achieves zero average hinge loss for the $\dataset$ in Figure fig_svm (and therefore perfectly separates it). Then, any other linear map $h^{(\weights')}$ with weights $\weights' = \regparam \weights$, using an arbitrary number $\regparam \gt 1$ also achieves zero average hinge loss (and perfectly separates the dataset).

Neither the separability requirement \eqref{equ_two_classes_svm_perf_sep} nor the hinge loss are sufficient as a sole training criterion. Indeed, there are many (if not most) datasets that are not linearly separable. Even for a linearly separable dataset (such as the one Figure fig_svm), there are infinitely many linear maps with zero average hinge loss. Which one of these infinitely many different maps should we use? To settle these issues, the SVM uses a “regularized” hinge loss,

[] \begin{align} \loss{(\featurevec,\truelabel)}{h^{(\weights)}} & \defeq \max \{ 0 , 1 - \truelabel\cdot h^{(\weights)}(\featurevec) \} + \regparam \sqeuclnorm{ \weights }\nonumber \\ & \stackrel{h^{(\weights)}(\featurevec) = \weights^{T} \featurevec}{=} \max \{ 0 , 1 - \truelabel\cdot \weights^{T} \featurevec \} + \regparam \sqeuclnorm{ \weights }. \label{equ_loss_svm} \end{align} []
The loss \eqref{equ_loss_svm} augments the hinge loss by the term $\regparam \sqeuclnorm{\weights }$. This term is the scaled (by $\regparam \gt0$) squared Euclidean norm of the weights $\weights$ of the linear hypothesis $h$ used to classify data points. it can be shown that adding the term $\regparam \sqeuclnorm{ \weights }$ to the hinge loss has an regularization effect.

The loss \eqref{equ_loss_svm} favours linear maps $h^{(\weights)}$ that are robust against (small) perturbations of the data points. The tuning parameter $\regparam$ in \eqref{equ_loss_svm} controls the strength of this regularization effect and might therefore also be referred to as a regularization parameter. We will discuss regularization on a more general level in Chapter Regularization .

Let us now develop a useful geometric interpretation of the linear hypothesis obtained by minimizing the loss function \eqref{equ_loss_svm}. According to [6](Chapter 2), a classifier $h^{(\weights_{\rm SVM})}$ that minimizes the average loss \eqref{equ_loss_svm}, maximizes the distance (margin) $\xi$ between its decision boundary and each of the two classes $\cluster^{(\truelabel=1)}$ and $\cluster^{(\truelabel=-1)}$ (see \eqref{equ_two_classes_svm}). The decision boundary is given by the set of feature vectors $\featurevec$ satisfying $\weights_{\rm SVM}^{T} \featurevec=0$,

Making the margin as large as possible is reasonable as it ensures that the resulting classifications are robust against small perturbations of the features (see Section Robustness ). As depicted in Figure fig_svm, the margin between the decision boundary and the classes $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ is typically determined by few data points (such as $\featurevec^{(6)}$ in Figure fig_svm) which are closest to the decision boundary. These data points have minimum distance to the decision boundary and are referred to as support vectors.

We highlight that both, the SVM and logistic regression use the same hypothesis space of linear maps. Both methods learn a linear classifier $h^{(\weights)} \in \hypospace^{(\featuredim)}$ (see \eqref{equ_lin_hypospace}) whose decision boundary is a hyperplane in the feature space $\featurespace = \mathbb{R}^{\featuredim}$ (see Figure fig_lin_dec_boundary). The difference between SVM and logistic regression is in their choice for the loss function used to evaluate the quality of a hypothesis $h^{(\weights)} \in \hypospace^{(\featuredim)}$.

The hinge loss is a (in some sense optimal) convex approximation to the 0/1 loss. Thus, we expect the classifier obtained by the SVM to yield a smaller classification error probability $\prob{ \predictedlabel \neq \truelabel }$ (with $\predictedlabel =1$ if $h(\featurevec)\geq0$ and $\predictedlabel=-1$ otherwise) compared to logistic regression which uses the logistic loss. The SVM is also statistically appealing as it learns a robust hypothesis. Indeed, the hypothesis map with maximum margin is maximally robust against perturbations of the feature vectors of data points. Section Robustness discusses the importance of robustness in ML methods in more detail.

The statistical superiority (in terms of robustness) of the SVM comes at the cost of increased computational complexity. The hinge loss is non-differentiable which prevents the use of simple gradient-based methods (see Chapter Gradient-Based Learning ) and requires more advanced optimization methods. In contrast, the logistic loss is convex and differentiable. Logistic regression allows for the use of gradient-based methods to minimize the average logistic loss incurred on a training set (see Chapter Gradient-Based Learning ).

Bayes Classifier

Consider data points characterized by features $\featurevec \in \featurespace$ and some binary label $\truelabel \in \labelspace$. We can use any two different label values but let us assume that the two possible label values are $\truelabel=-1$ or $\truelabel=1$. We would like to find (or learn) a classifier $h: \featurespace \rightarrow \labelspace$ such that the predicted (or estimated) label $\hat{\truelabel} = h(\featurevec)$ agrees with the true label $\truelabel \in \labelspace$ as much as possible. Thus, it is reasonable to assess the quality of a classifier $h$ using the 0/1 loss. We could then learn a classifier using the empirical risk minimization (ERM) with the 0/1 loss. However, the resulting optimization problem is typically intractable since the 0/1 loss is non-convex and non-differentiable.

Instead of solving the (intractable) ERM for 0/1 loss, we can take a different route to construct a classifier. This construction is based on a simple probabilistic model for data. Using this model, we can interpret the average $0/1$ loss incurred by a hypothesis on a training set as an approximation to the probability $\errprob = \prob{ \truelabel \neq h(\featurevec) }$. Any classifier $\hat{h}$ that minimizes the error probability $\errprob$, which is the expected $0/1$ loss, is referred to as a Bayes estimator. Section ERM for Bayes Classifiers will discuss ML methods using Bayes estimator in more detail.

Let us derive the Bayes estimator for a the special case of a binary classification problem. Here, data points are characterized by features $\featurevec$ and label $\truelabel \in \{-1,1\}$. Elementary probability theory allows to derive the Bayes estimator, which is the hypothesis minimizing the expected $0/1$ loss, as

[$]$$\label{equ_def_Bayes_est_binary_class} \hat{h}(\featurevec) = \begin{cases} 1 & \mbox{ if } \prob{\truelabel = 1| \featurevec } \gt \prob{\truelabel = -1| \featurevec } \\ - 1 & \mbox{ otherwise.} \end{cases}.$$[$]

Note that the Bayes estimator \eqref{equ_def_Bayes_est_binary_class} depends on the probability distribution $\prob{\featurevec,\truelabel}$ underlying the data points.\footnote{Remember that we interpret data points as realizations of iid RVs with common probability distribution $\prob{\featurevec,\truelabel}$.} We obtain different Bayes estimators for different probabilistic models. One widely used probabilistic model results in a Bayes estimator that belongs to the linear hypothesis space \eqref{equ_lin_hypospace}. Note that this hypothesis space underlies also logistic regression (see Section Logistic Regression ) and the SVM (see Section Support Vector Machines ). Thus, logistic regression, SVM and Bayes estimator are all examples of a linear classifier (see Figure fig_lin_dec_boundary).

A linear classifier partitions the feature space $\featurespace$ into two half-spaces. One half-space consists of all feature vectors $\featurevec$ which result in the predicted label $\predictedlabel=1$ and the other half-space constituted by all feature vectors $\featurevec$ which result in the predicted label $\predictedlabel=-1$. The family of ML methods that learn a linear classifier differ in their choices for the loss functions used to assess the quality of these half-spaces.

Kernel Methods

Consider a ML (classification or regression) problem with an underlying feature space $\featurespace$. In order to predict the label $y \in \labelspace$ of a data point based on its features $\featurevec \in \featurespace$, we apply a predictor $h$ selected out of some hypothesis space $\hypospace$. Let us assume that the available computational infrastructure only allows us to use a linear hypothesis space $\hypospace^{(\featuredim)}$ (see \eqref{equ_lin_hypospace}).

For some applications, using a linear hypothesis $h(\featurevec)=\weights^{T}\featurevec$ is not suitable since the relation between features $\featurevec$ and label $\truelabel$ might be highly non-linear. One approach to extend the capabilities of linear hypotheses is to transform the raw features of a data point before applying a linear hypothesis $h$.

The family of kernel methods is based on transforming the features $\featurevec$ to new features $\hat{\featurevec} \in \featurespace'$ which belong to a (typically very) high-dimensional space $\featurespace'$ [6]. It is not uncommon that, while the original feature space is a low-dimensional Euclidean space (e.g., $\featurespace = \mathbb{R}^{2}$), the transformed feature space $\featurespace'$ is an infinite-dimensional function space.

The rationale behind transforming the original features into a new (higher-dimensional) feature space $\featurespace'$ is to reshape the intrinsic geometry of the feature vectors $\featurevec^{(\sampleidx)} \in \featurespace$ such that the transformed feature vectors $\hat{\featurevec}^{(\sampleidx)}$ have a “simpler” geometry (see Figure fig_kernelmethods).

Kernel methods are obtained by formulating ML problems (such as linear regression or logistic regression) using the transformed features $\hat{\featurevec}= \phi(\featurevec)$. A key challenge within kernel methods is the choice of the feature map $\phi: \featurespace \rightarrow \featurespace'$ which maps the original feature vector $\featurevec$ to a new feature vector $\hat{\featurevec}= \phi(\featurevec)$.

Decision Trees

A decision tree is a flowchart-like description of a map $h: \featurespace \rightarrow \labelspace$ which maps the features $\featurevec \in \featurespace$ of a data point to a predicted label $h(\featurevec) \in \labelspace$ [7]. While we can use decision trees for an arbitrary feature space $\featurespace$ and label space $\labelspace$, we will discuss them for the particular feature space $\featurespace = \mathbb{R}^{2}$ and label space $\labelspace=\mathbb{R}$.

Figure fig_decision_tree depicts an example for a decision tree. A decision tree, consists of nodes which are connected by directed edges. We can think of a decision tree,as a step-by-step instruction, or a “recipe”, for how to compute the function value $h(\featurevec)$ given the features $\featurevec \in \featurespace$ of a data point. This computation starts at the “root” node and ends at one of the “leaf” nodes of the decision tree.

A leaf node $\hat{\truelabel}$, which does not have any outgoing edges, represents a decision region $\decreg{\hat{\truelabel}} \subseteq \featurespace$ in the feature space. The hypothesis $h$ associated with a decision tree, is constant over the regions $\decreg{\hat{\truelabel}}$, such that $h(\featurevec) = \hat{\truelabel}$ for all $\featurevec \in \decreg{\hat{\truelabel}}$ and some label value $\hat{\truelabel}\in \mathbb{R}$.

The nodes in a decision tree are of two different types,

• decision (or test) nodes, which represent particular “tests” about the feature vector $\featurevec$, e.g., “is the norm of $\featurevec$ larger than $10$?”).
• leaf nodes, which correspond to subsets of the feature space.

The particular decision tree,depicted in Figure fig_decision_tree consists of two decision nodes (including the root node) and three leaf nodes.

Given limited computational resources, we can only use decision trees with a limited depth. The depth of a decision tree, is the maximum number of hops it takes to reach a leaf node starting from the root and following the arrows. The decision tree,depicted in Figure fig_decision_tree has depth $2$. We obtain an entire hypothesis space by collecting all hypothesis maps that are obtained from the decision tree in Figure fig_decision_tree with some vectors $\vu$ and $\vv$, some positive radius $\weight \gt0$. The resulting hypothesis space is parametrized by the vectors $\vu, \vv$ and the number $\weight$.

To assess the quality of a particular decision tree,we can use various loss functions. Examples of loss functions used to measure the quality of a decision tree, are the squared error loss for numeric labels (regression) or the impurity of individual decision region for categorical labels (classification).

Decision tree methods use as a hypothesis space the set of all hypotheses which represented by a family of decision trees. Figure fig_hypospace_DT_depth_2 depicts a collection of decision trees which are characterized by having depth at most two. More generally, we can construct a collection of decision trees using a fixed set of “elementary tests” on the input feature vector such as $\| \featurevec \| \gt 3$, $\feature_{3} \lt 1$. These tests might also involve a continuous (real-valued) parameter such as $\{ \feature_{2} \gt \weight \}_ {\weight \in [0,10]}$. We then build a hypothesis space by considering all decision trees not exceeding a maximum depth and whose decision nodes carry out one of the elementary tests.

A decision tree represents a map $h: \featurespace \rightarrow \labelspace$, which is piecewise-constant over regions of the feature space $\featurespace$. These non-overlapping decision regions partition the feature space into subsets of features that are all mapped to the same predicted label. Each leaf node of a decision tree corresponds to one particular decision region. Using large decision trees that contain many different test nodes, we can learn a hypothesis with highly complicated decision regions. These decision regions can be chosen such they perfectly align with almost any given labeled dataset (see Figure fig_decisiontree_overfits).

Using a sufficiently large (deep) decision tree, we can obtain a hypothesis map that closely approximates any given non-linear map (under mild technical conditions such as Lipschitz continuity). This is quite different from ML methods using the linear hypothesis space \eqref{equ_lin_hypospace}, such as linear regression, logistic regression or the SVM. These methods learn linear hypothesis maps with a rather simple geometry. Indeed, a linear map is constant along hyperplanes. Moreover, the decision regions obtained from linear classifiers are always entire half-spaces (see Figure fig_lin_dec_boundary).

Deep Learning

Another example of a hypothesis space uses a signal-flow representation of a hypothesis map $h: \mathbb{R}^{\featuredim} \rightarrow \mathbb{R}$. This signal-flow representation is referred to as a artificial neural network (ANN). As the name indicates, an ANN is a network of interconnected elementary computational units. These computational units might be referred to as artificial neurons or just neurons.

Figure fig_activate_neuron depicts the simplest possible ANN that consists of a single neuron. The neuron computes a weighted sum of the inputs and then applies an activation function $\actfun(z)$ to produce the output $\actfun(z)$. The $\featureidx$-th input of a neuron is assigned a parameter or weight $\weight_{\featureidx}$. For a given choice of weights the ANN in Figure fig_activate_neuron represents a hypothesis map $h^{(\weights)}(\featurevec) = \actfun(z) = \actfun \big( \sum_{\featureidx} \weight_{\featureidx} \feature_{\featureidx} \big)$.

The ANN in Figure fig_activate_neuron defines a hypothesis space that is constituted by all maps $h^{(\weights)}$ obtained for different choices for the weights $\weights$ in Figure fig_activate_neuron. Note that the single-neuron ANN in Figure fig_activate_neuron reduces to a linear map when we use the activation function $\actfun(z) = z$. However, even when we use a non-linear activation function in Figure fig_activate_neuron, the resulting hypothesis space is essentially the same as the space of linear maps \eqref{equ_lin_hypospace}. In particular, if we threshold the output of the ANN in Figure fig_activate_neuron to obtain a binary label, we will always obtain a linear classifier like logistic regression and SVM (see Section Logistic Regression and Section Support Vector Machines ).

Deep learning methods use ANN consisting of many (thousands to millions) interconnected neurons [8].

In principle the interconnections between neurons can be arbitrary. One widely used approach however is to organize neurons as layers and place connections mainly between neurons in consecutive layers [8]. Figure fig_ANN depicts an example for a ANN consisting of one hidden layer to represent a (parametrized) hypothesis $h^{(\weights)}: \mathbb{R}^{\featuredim} \rightarrow \mathbb{R}$.

The first layer of the ANN in Figure fig_ANN is referred to as the input layer. The input layer reads in the feature vector $\featurevec \in \mathbb{R}^{\featuredim}$ of a data point. The features $\feature_{\featureidx}$ are then multiplied with the weights $\weight_{\featureidx,\featureidx'}$ associated with the link between the $\featureidx$th input node (“neuron”) with the $\featureidx'$th node in the middle (hidden) layer. The output of the $\featureidx'$-th node in the hidden layer is given by $s_{\featureidx'}\!=\!\actfun( \sum_{\featureidx=1}^{\featuredim} \weight_{\featureidx,\featureidx'} \feature_{\featureidx} )$ with some (typically non-linear) activation function $\actfun: \mathbb{R} \rightarrow \mathbb{R}$. The argument to the activation function is the weighted combination $\sum_{\featureidx=1}^{\featuredim} \weight_{\featureidx,\featureidx'} s_{\featureidx'}$ of the outputs $s_{\featureidx}$ of the nodes in the previous layer. For the ANN depicted in Figure fig_ANN, the output of neuron $s_{1}$ is $\actfun(z)$ with $z=\weight_{1,1}\feature_{1} + \weight_{1,2}\feature_{2}$.

The hypothesis map represented by an ANN is parametrized by the weights of the connections between neurons. Moreover, the resulting hypothesis map depends also on the choice for the activation functions of the individual neurons. These activation function are a design choice that can be adapted to the statistical properties of the data. However, a few particular choices for the activation function have proven useful in many important application domains. Two popular choices for the activation function used within ANNs are the sigmoid function $\actfun(z) = \frac{1}{1+\exp(-z)}$ or the rectified linear unit (ReLU) $\actfun(z) = \max\{0,z\}$. ANNs using many, say $10$, hidden layers, is often referred to as a “deep net”. ML methods using hypothesis spaces obtained from deep ANN (deep net)s are known as deep learning methods [8].

It can be shown that an ANN with only one single (but arbitrarily large) hidden layer can approximate any given map $h: \featurespace \rightarrow \labelspace=\mathbb{R}$ to any desired accuracy [9]. Deep learning methods often use a ANN with a relatively large number (more than hundreds) of hidden layers. We refer to a ANN with a relatively large number of hidden layers as a deep net.

There is empirical and theoretical evidence that using many hidden layers, instead of few but wide layers, is computationally and statistically favourable [10][11] and [8](Ch. 6.4.1.). The hypothesis map $h^{(\weights)}$ represented by an ANN can be evaluated (to obtain the predicted label at the output) efficiently using message passing over the ANN. This message passing can be implemented using parallel and distributed computers. Moreover, the graphical representation of a parametrized hypothesis in the form of a ANN allows us to efficiently compute the gradient of the loss function via a (highly scalable) message passing procedure known as back-propagation [8]. Being able to quickly compute gradients is instrumental for the efficiency of gradient based methods for learning a good choice for the ANN weights (see Chapter Gradient-Based Learning ).

Maximum Likelihood

For many applications it is useful to model the observed data points $\datapoint^{(\sampleidx)}$, with $\sampleidx=1,\ldots,\samplesize$, as iid realizations of a RV $\datapoint$ with probability distribution $\prob{\datapoint; \weights}$. This probability distribution is parametrized by a parameter vector $\weights \in \mathbb{R}^{\featuredim}$. A principled approach to estimating the parameter vector $\weights$ based on a set of iid realizations $\datapoint^{(1)},\ldots,\datapoint^{(\samplesize)}\sim \prob{\vz; \weights}$ is maximum likelihood estimation [12].

Maximum likelihood estimation can be interpreted as an ML problem with a hypothesis space parametrized by the parameter vector $\weights$. Each element $h^{(\weights)}$ of the hypothesis space $\hypospace$ corresponds to one particular choice for the parameter vector $\weights$. Maximum likelihood methods use the loss function

[$]$$\label{equ_loss_ML} \loss{\datapoint}{h^{(\weights)}} \defeq - \log \prob{\datapoint; \weights}.$$[$]

A widely used choice for the probability distribution $p \big( \datapoint; \weights \big)$ is a multivariate normal (Gaussian) distribution with mean ${\bm \mu}$ and covariance matrix ${\bf \Sigma}$, both of which constitute the parameter vector $\weights = ({\bm \mu}, {\bf \Sigma})$ (we have to reshape the matrix ${\bf \Sigma}$ suitably into a vector form). Given the iid realizations $\datapoint^{(1)},\ldots,\datapoint^{(\samplesize)} \sim p \big( \datapoint; \weights \big)$, the maximum likelihood estimates $\hat{\bm \mu}$, $\widehat{\bf \Sigma}$ of the mean vector and the covariance matrix are obtained via

[$]$$\label{equ_max_likeli} \hat{\bm \mu}, \widehat{\bf \Sigma} = \argmin_{{\bm \mu} \in \mathbb{R}^{\featurelen},{\bf \Sigma} \in \mathbb{S}^{\featurelen}_{+}} (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} - \log p\big(\datapoint^{(\sampleidx)}; ({\bm \mu},{\bf \Sigma})\big).$$[$]

The optimization in \eqref{equ_max_likeli} is over all possible choices for the mean vector ${\bm \mu} \in \mathbb{R}^{\featurelen}$ and the covariance matrix ${\bf \Sigma} \in \mathbb{S}^{\featurelen}_{+}$. Here, $\mathbb{S}^{\featurelen}_{+}$ denotes the set of all positive semi-definite (psd) Hermitian $\featurelen \times \featurelen$ matrices.

The maximum likelihood problem \eqref{equ_max_likeli} can be interpreted as an instance of ERM using the particular loss function \eqref{equ_loss_ML}. The resulting estimates are given explicitly as

[$] $$\label{equ_ML_mean_cov_Gauss} \hat{\bm \mu} = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \datapoint^{(\sampleidx)} \mbox{, and } \widehat{\bf \Sigma} = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} (\datapoint^{(\sampleidx)} - \hat{\bm \mu})(\datapoint^{(\sampleidx)} - \hat{\bm \mu})^{T}.$$ [$]

Note that the expressions \eqref{equ_ML_mean_cov_Gauss} are only valid when the probability distribution of the data points is modelled as a multivariate normal distribution.

Nearest Neighbour Methods

Nearest neighbour (NN) methods are a family of ML methods that are characterized by a specific construction of the hypothesis space. NN methods can be applied to regression problems involving numeric labels (e.g., using label space $\labelspace=\mathbb{R}$ ) as well as for classification problems involving categorical labels (e.g., with label space $\labelspace = \{-1,1\}$).

While nearest neighbour (NN) methods can be combined with arbitrary label spaces, they require the feature space to have a specific structure. NN methods require the feature space be a metric space [1] that provides a measure for the distance between different feature vectors. We need a metric or distance measure to determine the nearest neighbour of a data point. A prominent example for a metric feature space is the Euclidean space $\mathbb{R}^{\featuredim}$. The metric of $\mathbb{R}^{\featuredim}$ given by the Euclidean distance $\| \featurevec\!-\!\featurevec' \|$ between two vectors $\featurevec, \featurevec' \in \mathbb{R}^{\featuredim}$.

Consider a training set $\dataset = \{ (\featurevec^{(\sampleidx)},\truelabel^{(\sampleidx)}) \}_{\sampleidx=1}^{\samplesize}$ that consists of labeled data points. Thus, for each data point we know the features and the label value. Given such a training set, NN methods construct a hypothesis space that consist of piece-wise constant maps $h: \featurespace \rightarrow \labelspace$.

For any hypothesis $h$ in that space, the function value $h(\featurevec)$ obtained for a data point with features $\featurevec$ depends only on the (labels of the) $k$ nearest data points (smallest distance to $\featurevec$) in the training set $\dataset$. The number $k$ of NNs used to determine the function value $h(\featurevec)$ is a design (hyper-) parameter of a NN method. NN methods are also referred to as $k$-NN methods to make their dependence on the parameter $k$ explicit.

Let us illustrate NN methods by considering a binary classification problem using an uneven number for $k$ (e.g., $k=3$ or $k=5$).

The goal is to learn a hypothesis that predicts the binary label $\truelabel \in \{-1,1\}$ of a data point based on its feature vector $\featurevec \in \mathbb{R}^{\featuredim}$. This learning task can make use of a training set $\dataset$ containing $\samplesize \gt k$ data points with known labels. Given a data point with features $\featurevec$, denote by $\neighbourhood{k}$ a set of $k$ data points in $\dataset$ whose feature vectors have smallest distance to $\featurevec$. The number of data points in $\neighbourhood{k}$ whose label is $1$ is denoted $\samplesize^{(k)}_{1}$ and those with label value $-1$ is denoted $\samplesize^{(k)}_{-1}$. The $k$-NN method “learns” a hypothesis $\hat{h}$ given by

[$] $$\label{equ_def_k_nn} \hat{h}(\featurevec) = \begin{cases} 1 & \mbox{ if } \samplesize^{(k)}_{1} \gt \samplesize^{(k)}_{-1} \\ - 1 & \mbox{ otherwise.} \end{cases}$$[$]

It is important to note that, in contrast to the ML methods in Section Linear Regression - Section Deep Learning , the hypothesis space of $k$-NN depends on a labeled dataset (training set) $\dataset$. As a consequence, $k$-NN methods need to access (and store) the training set whenever the compute a prediction (evaluate $h(\featurevec))$.

To compute the prediction $h(\featurevec)$ for a data point with features $\featurevec$, $k$-NN needs to determine its NNs in the training set. When using a large training set this implies a large storage requirement for $k$-NN methods. Moreover, $k$-NN methods might be prone to revealing sensitive information with its predictions (see Exercise).

For a fixed $k$, NN methods do not require any parameter tuning. Such parameter tuning (or learning) is required linear regression, logistic regression and deep learning methods. In contrast, the hypothesis “learnt” by NN methods is characterized point-wise, for each possible value of features $\featurevec$, by the NN in the training set. Compared to the ML methods in Section Linear Regression - Section Deep Learning , NN methods do not require to solve (challenging) optimization problems for model parameters. Beside their low computational requirements (put aside the memory requirements), NN methods are also conceptually appealing as natural approximations of Bayes estimators (see [13] and Exercise).

Deep Reinforcement Learning

Deep reinforcement learning (DRL) refers to a subset of ML problems and methods that revolve around the control of dynamic systems such as autonomous driving cars or cleaning robots [14][15][16]. A DRL problem involves data points that represent the states of a dynamic system at different time instants $\timeidx=0,1,\ldots$. The data points representing the state at some time instant $\timeidx$ is characterized by the feature vector $\featurevec^{(\timeidx)}$. The entries of this feature vector are the individual features of the state at time $\timeidx$. These features might be obtained via sensors, onboard-cameras or other ML methods (that predict the location of the dynamic system). The label $\truelabel^{(\timeidx)}$ of a data point might represent the optimal steering angle at time $\timeidx$.

DRL methods learn a hypothesis $h$ that delivers optimal predictions $\hat{\truelabel}^{(\timeidx)} \defeq h \big( \featurevec^{(\timeidx)} \big)$ for the optimal steering angle $\truelabel^{(\timeidx)}$. As their name indicates, DRL methods use hypothesis spaces obtained from a deep net (see Section Deep Learning ). The quality of the prediction $\hat{\truelabel}^{(\timeidx)}$ obtained from a hypothesis is measured by the loss $\loss{\big(\featurevec^{(\timeidx)},\truelabel^{(\timeidx)} \big)}{h} \defeq - \reward^{(\timeidx)}$ with a reward signal $\reward^{(\timeidx)}$. This reward signal might be obtained from a distance (collision avoidance) sensor or low-level characteristics of an on-board camera snapshot.

The (negative) reward signal $- \reward^{(\timeidx)}$ typically depends on the feature vector $\featurevec^{(\timeidx)}$ and the discrepancy between optimal steering direction $\truelabel^{(\timeidx)}$ (which is unknown) and its prediction $\hat{\truelabel}^{(\timeidx)} \defeq h \big( \featurevec^{(\timeidx)} \big)$. However, what sets DRL methods apart from other ML methods such as linear regression (see Section Linear Regression ) or logistic regression (see Section Logistic Regression ) is that they can evaluate the loss function only point-wise $\loss{\big(\featurevec^{(\timeidx)},\truelabel^{(\timeidx)} \big)}{h}$ for the specific hypothesis $h$ that has been used to compute the prediction $\hat{\truelabel}^{(\timeidx)} \defeq h \big( \featurevec^{(\timeidx)} \big)$ at time instant $\timeidx$. This is fundamentally different from linear regression that uses the squared error loss which can be evaluated for every possible hypothesis $h \in \hypospace$.

LinUCB

ML methods are instrumental for various recommender systems [17]. A basic form of a recommender system amount to chose at some time instant $\timeidx$ the most suitable item (product, song, movie) among a finite set of alternatives $\actionidx = 1,\ldots,\nractions$. Each alternative is characterized by a feature vector $\featurevec^{(\timeidx,\actionidx)}$ that varies between different time instants.

The data points arising in recommender systems might represent time instants $\timeidx$ at which recommendations are computed. The data point at time $\timeidx$ is characterized by a feature vector

[$]$$\label{equ_stacked_feature_LinUCB} \featurevec^{(\timeidx)} = \big( \big(\featurevec^{(\timeidx,1)}\big)^{T},\ldots,\big( \featurevec^{(\timeidx,\nractions)} \big)^{T} \big)^{T}.$$[$]

The feature vector $\featurevec^{(\timeidx)}$ is obtained by stacking the feature vectors of alternatives at time $\timeidx$ into a single long feature vector. The label of the data point $\timeidx$ is a vector of rewards $\labelvec^{(\timeidx)} \defeq \big ( \reward_{1}^{(\timeidx)}, \ldots,\reward_{\nractions}^{(\timeidx)}\big)^{T} \in \mathbb{R}^{\nractions}$. The entry $\reward_{\actionidx}^{(\timeidx)}$ represents the reward obtained by choosing (recommending) alternative $\actionidx$ (with features $\featurevec^{(\timeidx,\actionidx)}$) at time $\timeidx$. We might interpret the reward $\reward^{(\timeidx,\actionidx)}$ as an indicator if the costumer actually buys the product corresponding to the recommended alternative $\actionidx$.

The ML method LinUCB (the name seems to be inspired by the terms “linear” and “upper confidence bound” (UCB)) aims at learning a hypothesis $h$ that allows to predict the rewards $\labelvec^{(\sampleidx)}$ based on the feature vector $\featurevec^{(\timeidx)}$ \eqref{equ_stacked_feature_LinUCB}. As its hypothesis space $\hypospace$, LinUCB uses the space of linear maps from the stacked feature vectors $\mathbb{R}^{\featurelen \nractions}$ to the space of reward vectors $\mathbb{R}^{\nractions}$. This hypothesis space can be parametrized by matrices $\mW \in \mathbb{R}^{\nractions \times \featurelen \nractions}$. Thus, LinUCB learns a hypothesis that computes predicted rewards via

[$]$$\widehat{\labelvec}^{(\timeidx)} \defeq \mW \featurevec^{(\timeidx)}.$$[$]
The entries of $\widehat{\labelvec}^{(\timeidx)} = \big( \hat{\reward}_{1}^{(\timeidx)},\ldots,\hat{\reward}_{\nractions}^{(\timeidx)}\big)$ are predictions of the individual rewards $\reward^{(\timeidx,\actionidx)}$. It seems natural to recommend at time $\timeidx$ the alternative $\actionidx$ whose predicted reward is maximum. However, it turns out that this approach is sub-optimal as it prevents the recommender system from learning the optimal predictor map $\mW$.

Loosely speaking, LinUCB tries out (explores) each alternative $\actionidx \in \{1,\ldots,\nractions\}$ sufficiently often to obtain a sufficient amount of training data for learning a good weight matrix $\mW$. At time $\timeidx$, LinUCB chooses the alternative $\actionidx^{(\timeidx)}$ that maximizes the quantity

[$]$$\label{equ_linucb_bound} \hat{\reward}_{\actionidx}^{(\timeidx)} + R(\timeidx,\actionidx) \mbox{ , } \actionidx = 1,\ldots,\nractions.$$[$]
We can think of the component $R(\timeidx,\actionidx)$ as a form of confidence interval. It is constructed such that \eqref{equ_linucb_bound} upper bounds the actual reward $\reward_{\actionidx}^{(\timeidx)}$ with a prescribed level of confidence (or probability). The confidence term $R(\timeidx,\actionidx)$ depends on the feature vectors $\featurevec^{(\timeidx',\actionidx)}$ of the alternative $\actionidx$ at previous time instants $\timeidx' \lt \timeidx$. Thus, at each time instant $\timeidx$, LinUCB chooses the alternative $\actionidx$ that results in the largest upper confidence bound (UCB) \eqref{equ_linucb_bound} on the reward (hence the “UCB” in LinUCB). We refer to the relevant literature on sequential learning (and decision making) for more details on the LinUCB [17].

Notes

1. The precise formulation of this statement is known as the “Stone-Weierstrass Theorem” [1](Thm. 7.26).

General References

Jung, Alexander (2022). Machine Learning: The Basics. Signapore: Springer. doi:10.1007/978-981-16-8193-6.

Jung, Alexander (2022). "Machine Learning: The Basics". arXiv:1805.05052.

References

1. W. Rudin. Principles of Mathematical Analysis McGraw-Hill, New York, 3 edition, 1976
2. P. J. Huber. Robust Statistics Wiley, New York, 1981
3. M. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint Cambridge: Cambridge University Press, 2019
4. P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data Springer, New York, 2011
5. G. H. Golub and C. F. Van Loan. Matrix Computations Johns Hopkins University Press, Baltimore, MD, 3rd edition, 1996
6. C. Lampert. Kernel methods in computer vision. Foundations and Trends in Computer Graphics and Vision 2009
7. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning Springer Series in Statistics. Springer, New York, NY, USA, 2001
8. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning MIT Press, 2016
9. G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4):303--314, 1989
10. R. Eldan and O. Shamir. The power of depth for feedforward neural networks. CoRR abs/1512.03965, 2015
11. T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing 14(5):503--519, 2017
12. E. L. Lehmann and G. Casella. Theory of Point Estimation Springer, New York, 2nd edition, 1998
13. T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1):21--27, 1967
14. S.Levine, C. Finn, T. Darrell, and P.Abbeel. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 2016
15. R. Sutton and A. Barto. Reinforcement learning: An introduction MIT press, Cambridge, MA, 2 edition, 2018
16. A. Ng. Shaping and Policy search in Reinforcement Learning PhD thesis, University of California, Berkeley, 2003
17. L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. International World Wide Web Conference pages 661--670, Raleigh, North Carolina, USA, April 2010