⧼exchistory⧽
8 exercise(s) shown, 0 hidden
Jun 11'23

What conditions on a training set ensure that there is a unique optimal linear hypothesis map for linear regression?

Jun 11'23

Linear regression uses the squared error loss to measure the quality of a linear hypothesis map. We learn the weights $\weights$ of a linear map via ERM using a training set $\dataset$ that consists of $\samplesize=100$ data points. Each data point is characterized by $\featurelen=5$ features and a numeric label.

Is there a unique choice for the weights $\weights$ that results in a linear predictor with minimum average squared error loss on the training set $\dataset$)?

Jun 11'23

Consider a training set of $\samplesize$ datapoints, each characterized by a single numeric feature $\feature$ and numeric label $\truelabel$. We learn hypothesis map of the form $h(\feature) = \feature + b$ with some bias $b \in \mathbb{R}$.

Can you write down a formula for the optimal $b$, that minimizes the average squared error on training data $\big(\feature^{(1)},\truelabel^{(1)} \big),\ldots,\big(\feature^{(\samplesize)},\truelabel^{(\samplesize)}\big)$.

Jun 11'23

Consider polynomial regression for data points with a single numeric feature $\feature \in \mathbb{R}$ and numeric label $\truelabel$. Here, polynomial regression is equivalent to linear regression using the transformed feature vectors $\featurevec = \big(\feature^{0},\feature^{1},\ldots,\feature^{\featuredim-1}\big)^{T}$.

Given a dataset $\dataset= \big(\feature^{(1)},\truelabel^{(1)}\big),\ldots,\big(\feature^{(\samplesize)},\truelabel^{(\samplesize)}\big)$, we construct the feature matrix $\featuremtx =\big(\featurevec^{(1)},\ldots,\featurevec^{(\samplesize)}\big) \in \mathbb{R}^{\samplesize \times \samplesize}$ with its $\sampleidx$th column given by the feature vector $\featurevec^{(\sampleidx)}$.

Verify that this feature matrix is a Vandermonde matrix [1]?

How is the determinant of the feature matrix related to the features and labels of data points in the dataset $\dataset$?

1. W. Gautschi and G. Inglese. Lower bounds for the condition number of vandermonde matrices. Numer. Math. 52:241 -- 250, 1988
Jun 12'23

Consider a training set that consists of data points $\big(\feature^{(\sampleidx)},\truelabel^{(\sampleidx)} \big)$, for $\sampleidx = 1,\ldots,\samplesize=100$, that are obtained as realizations of iid RVs. The common probability distribution of these RVs is defined by a random data point $(\feature,\truelabel)$. The feature $\feature$ of this random data point is a standard Gaussian RV with zero mean and unit variance. The label of a data point is modelled as $\truelabel = \feature + e$ with Gaussian noise $e \sim \mathcal{N}(0,1)$. The feature $\feature$ and noise $e$ are statistically independent.

We evaluate the specific hypothesis $h(\feature)=0$ (which outputs $0$ no matter what the feature value $\feature$ is) bythe training error $\trainerror = (1/\samplesize) \sum_{\sampleidx=1}^{\samplesize} \big( \truelabel^{(\sampleidx)} - h \big( \feature^{(\sampleidx)} \big) \big)^2$. Note that $\trainerror$ is the average squared error loss incurred by hypothesis $h$ on the datapoints $\big(\feature^{(\sampleidx)},\truelabel^{(\sampleidx)} \big)$, for $\sampleidx = 1,\ldots,\samplesize=100$.

What is the probability that the training error $\trainerror$ is at least $20$ than the expected (squared error) loss $\expect \big\{ \big( \truelabel - h(\feature) \big)^{2} \big \}$?

What is the mean (expected value) and variance of the training error ?

Jun 12'23

Let us consider a fictional (idel) optimization method that can be represented as a filter $\mathcal{F}$. This filter $\mathcal{F}$ reads in a real-valued objective function $f(\cdot)$, defined for all parameter vectors $\vw\in\mathbb{R}^{\featuredim}$. The output of the filter $\mathcal{F}$ is another real-valued function $\hat{f}(\vw)$ that is defined point-wise as

[$] $$\hat{f}(\vw) = \begin{cases} 1 & \mbox{ , if } \vw \mbox{ is a local minimum of } f(\cdot) \\ 0 & \mbox{, otherwise.} \end{cases}$$ [$]

Verify that the filter $\mathcal{F}$ is shift or translation invariant, i.e., $\mathcal{F}$ commutes with a translation $f'(\weights) \defeq f(\weights + \weights^{(o)})$ with an arbitrary but fixed (reference) vector $\weights^{(o)} \in \mathbb{R}^{\featuredim}$.

Jun 12'23

Consider a linear regression method that uses ERM to learn weights $\widehat{\weights}$ of a linear hypothesis map $h(\featurevec) =\weights^{T} \featurevec$. The weights are learnt by minimizing the average squared error loss incurred by $h$ on a training set that is constituted by the data points $\big( \featurevec^{(\sampleidx)}, \truelabel^{(\sampleidx)} \big)$ for $\sampleidx=1,\ldots, 100$. Someimtes it is useful to assign sample-weights $\sampleweight{\sampleidx}$ to the data points and learn $\widehat{\weights}$. These sample-weights reflect varying levels of importance or relevance of different data points.

For simplicity we use the sample weights $\sampleweight{\sampleidx} = 2 \alpha \in [0,1]$ for $\sampleidx=1,\ldots,50$ and $\sampleweight{\sampleidx} = 2(1 - \alpha)$ for $\sampleidx=51,\ldots,100$.

Can you find a closed-form expression (similar to equ_close_form_lin_reg) for the weights $\widehat{\weights}^{(\alpha)}$ that minimize the weighted average squared error

[$]f(\weights) \defeq (1/50)\sum_{\sampleidx=1}^{50} \alpha \big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)} \big)^{2} + (1/50)\sum_{\sampleidx=51}^{100} (1-\alpha) \big( \truelabel^{(\sampleidx)} - \weights^{T} \featurevec^{(\sampleidx)} \big)^{2}[$]

for different $\alpha$?

Consider data points characterized by single numeric feature $\feature$ and label $\truelabel$. We learn a hypothesis map of the form $h(\feature) = \feature + b$ with some bias $b \in \mathbb{R}$.
Can you write down a formula for the optimal $b$, that minimizes the average absolute error on training data $\big(\feature^{(1)},\truelabel^{(1)} \big),\ldots,\big(\feature^{(\samplesize)},\truelabel^{(\samplesize)}\big)$.