⧼exchistory⧽
5 exercise(s) shown, 0 hidden
Jun 12'23

Consider a linear regression problem with data points $(\feature,\truelabel)$ characterized by a scalar feature $\feature$ and a numeric label $\truelabel$.

Assume data points are realizations of independent and identically distributed (iid) random variable (RV)s whose common probability distribution is multivariate normal with zero-mean and covariance matrix $\mathbf{C} = \begin{pmatrix} \sigma^2_{\feature} & \sigma_{\feature,\truelabel} \\ \sigma_{\feature,\truelabel} & \sigma^{2}_{\truelabel} \end{pmatrix}$.

The entries of this covariance matrix are the variance $\sigma^2_{\feature}$ of the (zero-mean) feature, the variance $\sigma^2_{\feature}$ of the (zero-mean) label and the covariance between feature and label of a random data point.

How many data points do we need to include in a validation set such that with probability of at least $0.8$ the validation error of a given hypothesis $h$ does not deviate by more than $20$ percent from its expected loss?

Jun 12'23

Linear regression learns a linear hypothesis map $\hat{h}$ having minimal average squared error on a training set. The learnt hypothesis $\hat{h}$ is then validated on a validation set which is different from the training set.

Can you construct a training set and validation set such that the validation error of $\hat{h}$ is strictly smaller than the training error of $\hat{h}$?

Jun 12'23

The usefulness of the validation error as an indicator for the performance of a hypothesis depends on the size of the validation set.

Experiment with different ML methods and datasets to find out the minimum required size for the validation set.

Jun 12'23

Consider data points that are characterized by $\featuredim=1000$ numeric features $\feature_{1},\ldots,\feature_{\featuredim} \in \mathbb{R}$ and a numeric label $\truelabel \in \mathbb{R}$. We want to learn a linear hypothesis map $h(\featurevec) = \weights^{T} \featurevec$ for predicting the label of a data point based on its features.

Could it be beneficial to constrain the learnt hypothesis by requiring it to only depend on the first $5$ features of a data point?

Consider data points that are characterized with single numeric feature $\feature$ and label $\truelabel$. We model the feature and label of a data point as iid realizations of a Gaussian random vector $\datapoint \sim \mathcal{N}(0,\mathbf{C})$ with zero mean and covariance matrix $\mathbf{C}$.
The optimal hypothesis $\hat{h}(\feature)$ to predict the label $\truelabel$ given the feature $\feature$ is the conditional expectation of the (unobserved) label $\truelabel$ given the (observed) feature $\feature$.
How is the expected squared error loss of this optimal hypothesis (which is the Bayes estimator) related to the covariance matrix $\mathbf{C}$ of the Gaussian random vector $\datapoint$.