⧼exchistory⧽
6 exercise(s) shown, 0 hidden
Jun 12'23

Consider the space $\mathcal{P}$ of sequences $f = (f[0],f[1],\ldots)$ that have the following properties

• for each sequence there is an index $\sampleidx^{(f)}$ such that $f$ is monotone increasing for indices $\sampleidx' \geq \sampleidx^{(f)}$ and monotone decreasing for indices $\sampleidx' \geq \sampleidx^{(f)}$
• any change point $\itercntr$ of $f$, where $f[\itercntr] \neq f[\itercntr\!+\!1]$ can only occrur at integer multiples of $100$, e.g., $\itercntr\!=\!100$ or $\itercntr\!=\!300$.

Given a function $f \in \mathcal{P}$ and starting point $\itercntr_{0}$ our goal is to find the minimum value of $\min_{\itercntr} f[\itercntr] = f[\itercntr^{(f)}]$ as quickly as possible.

Can you constuct an iterative algorithm that can access a given function $f$ only by querying the values $f[\itercntr], f[\itercntr\!-\!1]$ and $f[\itercntr\!+\!1]$ for any given index $\itercntr$.

Jun 12'23

Let us learn a linear hypothesis $h(\featurevec) = \weights^{T} \featurevec$ using data points that arrive sequentially at discrete time instants $\timeidx=0,1,\ldots$. At time $\timeidx$, we gather a new data point $\big(\featurevec^{(\itercntr)},\truelabel^{(\itercntr)} \big)$.

The data points can be modelled as realizations of independent and identically distributed (iid) copies of a random variable (RV) $\big(\featurevec,\truelabel\big)$. The probability distribution of the features $\featurevec$ is a standard multivariate normal distribution $\mathcal{N}(\mathbf{0},\mathbf{I})$.

The label of a random (more precisely, a data point that is obtained as the realization of RV) data point is related to its features via $\truelabel = \overline{\weights}^{T} \featurevec+ \varepsilon$ with some fixed but unknown true parameter vector $\overline{\weights}$. The additive noise $\varepsilon \sim \mathcal{N}(0,1)$ follows a standard normal distribution. We use SGD to learn the parameter vector $\weights$ of a linear hypothesis,

[$] $$\label{equ_SGD_update_online_exercise} \weights^{(\timeidx\!+\!1)} = \weights^{(\timeidx)} - \lrate_{\timeidx} \big ( \big(\weights^{(\timeidx)}\big)^{T} \featurevec^{(\timeidx)} - \truelabel^{(\timeidx)} \big) \featurevec^{(\timeidx)}.$$ [$]

with learning rate schedule $\lrate_{\timeidx} = \beta/\timeidx^{\gamma}$. Note that we compute a new SGD iteration \eqref{equ_SGD_update_online_exercise} for each new time instant $\timeidx$.

What conditions on the hyper-parameters $\beta, \gamma$ ensure that $\lim\limits_{\timeidx \rightarrow \infty} \weights^{(\timeidx)} = \overline{\weights}$ in distribution?

Jun 12'23

The “ImageNet” database contains more than $10^{6}$ images [1]. These images are labeled according to their content (e.g., does the image show a dog?) and stored as a file of size at least $4$ kilobytes.

We want to learn a classifier that allows to predict if an image shows a dog or not. To learn this classifier we run gradient descent (GD) for logistic regression on a small computer that has $32$ kilobytes memory and is connected to the internet with bandwidth of $1$ Mbit/s. Therefore, for each single GD update it must essentially download all images in ImageNet.

How long would such a single GD update take ?

1. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, NIPS 2012
Jun 12'23

Consider data points being images of size of $1024 \times 1024$ pixels. Each image is characterized by the RGB pixel color intensities (value range $0,\ldots,255$ resuling in $24$ bits for each pixel), which we stack into a feature vector $\featurevec \in \mathbb{R}^{\featurelen}$.

We assign each image the label $\truelabel=1$ if it shows an apple and $\truelabel=-1$ if it does not show an apple. We use logistic regression to learn a linear hypothesis $h(\featurevec) = \weights^{T} \featurevec$ to classify an image according to $\hat{\truelabel} =1$ if $h(\featurevec) \geq 0$. The training set consists of $\samplesize=10^{10}$ labeled images which are stored in the cloud.

We implement the ML method on our own laptop which is connected to the internet with a bandwidth of at most $100$ Mbps. Unfortunately we can only store at most five images on our computer. How long does it take at least to complete one single GD step ?

Jun 12'23

Consider the dataset with feature vectors $\featurevec^{(1)}=(100,0)^{T} \in \mathbb{R}^{2}$ and $\featurevec^{(2)} = (0,1/10)^{T} \in \mathbb{R}^{2}$ which we stack into the matrix $\featuremtx = (\featurevec^{(1)},\featurevec^{(2)})^{T}$.

What is the condition number of $\featuremtx^{T} \featuremtx$?

What is the condition number of $\big(\widehat{\featuremtx}\big)^{T} \widehat{\featuremtx}$ with the matrix $\widehat{\featuremtx}=(\widehat{\featurevec}^{(1)},\widehat{\featurevec}^{(2)})^{T}$ constructed from the normalized feature vectors $\widehat{\featurevec}^{(\sampleidx)}$ delivered by Algorithm \ref{alg_reshaping}.

Consider a differentiable objective function $f(\weights)$ whose argument is a parameter vector $\weights \in \mathbb{R}^{\featuredim}$. We make no assumption about smoothness or convexity. Thus, the function $f(\weights)$ might be non-convex and might also be not $\beta$ smooth. However, the gradient $\nabla f(\weights)$ is uniformly upper bounded $\| \nabla f(\weights) \| \leq 100$ for every $\weights$.
Starting from some initial vector $\weights^{(0)}$ we construct a sequence of parameter vectors using GD steps,
[$] $$\label{equ_def_gd_exericse} \weights^{(\itercntr+1)} = \weights^{(\itercntr)} - \lrate_{\itercntr } \nabla f \big( \weights^{(\itercntr)} \big).$$ [$]
The learning rate $\lrate_{\itercntr }$ in \eqref{equ_def_gd_exericse} is allowed to vary between different iterations.
Can you provide sufficient conditions on the evolution of the learning rate $\lrate_{\itercntr }$, as iterations proceed, that ensure convergence of the sequence $\weights^{(0)}, \weights^{(1)}, \dots$.