# Introduction

Content for this page was copied verbatim from Herberg, Evelyn (2023). "Lecture Notes: Neural Network Architectures". arXiv:2304.05133 [cs.LG].

Machine Learning (ML) denotes the field of study in which algorithms infer from given data how to perform a specific task, without being explicitly programmed for the task (Arthur Samuel, 1959). Here, we consider a popular subset of ML algorithms: Neural Networks. The inspiration for a Neural Network (NN) originates from the human brain, where biological neurons (nerve cells) respond to the activation of other neurons they are connected to. At a very simple level, neurons in the brain take electrical inputs that are then channeled to outputs. The sensitivity of this relation also depends on the strength of the connection, i.e. a neuron may be more responsive to one neuron, then to another.

For a single neuron/node with input $u \in \mathbb{R}^{n}$, a mathematical model, named the perceptron [1], can be described as

[$] $$\label{eq:perceptron} y = \sigma \left( \sum_{i=1}^{n} W_i u_i + b \right) = \sigma(W^{\top}u + b),$$ [$]

where $y$ is the activation of the neuron/node, $W_i$ are the weights and $b$ is the bias.

The function $\sigma:\mathbb{R} \rightarrow \mathbb{R}$ is called activation function. Originally, in [1], it was proposed to choose the Heaviside function as activation function to model whether a neuron fires or not, i.e.

[$] \begin{equation*} \sigma(y) = \begin{cases} 1 &\mbox{if } y\geq 0, \\ 0 &\mbox{if } y \lt0. \end{cases} \end{equation*} [$]

However, over time several other activation functions have been suggested and are being used. Typically, they are monotone increasing to remain in the spirit of the original idea, but continuous.

Popular activation functions are, cf. [2](p.90)

[] \begin{align*} \sigma(y) &= \frac{1}{1+ \exp(-y)} &&\mbox{sigmoid (logistic)} ,\\ \sigma(y) &= \tanh(y) = \frac{\exp(y)-\exp(-y)}{\exp(y)+\exp(-y)} &&\mbox{hyperbolic tangent} ,\\ \sigma(y) &= \max\{y,0\} &&\mbox{rectified linear unit (ReLU)} ,\\ \sigma(y) &= \max\{\alpha y,y\} &&\mbox{leaky ReLU} . \end{align*} []

The nonlinearity of activation functions is an integral part of the Neural Networks success. Since concatenations of linear functions result again in a linear function, see e.g. [2](p.90), the complexity that can be achieved by using linear activation functions is limited.

While the sigmoid function approximates the Heaviside function continuously, and is differentiable, it contains an exponential operation, which is computationally expensive. Similar problems arise with the hyperbolic tangent function. However, the fact that $\tanh$ is closer to the identity function often helps speed up convergence, since it resembles a linear model, as long as the values are close to zero. Another challenge that needs to be overcome is vanishing derivatives, which is visibly present for Heaviside, sigmoid and hyperbolic tangent. In contrast, ReLU is not bounded on positive values, while also being comparatively cheap to compute, because linear computations tend to be very well optimized in modern computing. Altogether, these advantages have resulted in ReLU (and variants thereof) becoming the most widely used activation function currently. As a remedy for the vanishing gradient on negative values, leaky ReLU was introduced. When taking derivatives of ReLU one needs to account for the non-differentiability at 0, but in numerical practice this is easily overcome.

With the help of Neural Networks we want to solve a task, cf. [3](Section 5.1). Let the performance of the algorithm for the given task be measured by the loss function $L$, which needs to be adequately modeled. By $\mathcal{F}$ we denote the Neural Network. The variables that will be learned are the weights $W$ and biases $b$ of the Neural Network. Hence, we can formulate the following optimization problem, cf. [4][5][6]

[$] $$\label{eq:LP} \tag{P} \min_{W,b} \mathscr{L}(y,u,W,b) \qquad \mbox{s.t.}\qquad y = \mathcal{F}(u,W,b).$$ [$]

One possible choice for $\mathcal{F}$ has already been given in \eqref{eq:perceptron}, the perceptron. In the subsequent sections we introduce and analyze various other Neural Network architectures. They all have in common that they contain weights and biases, so that the above problem formulation remains sensible.

Before we move on to different network architectures, we discuss the modeling of the loss function. Learning tasks can be divided into two subgroups: Supervised and Unsupervised learning.

## Supervised Learning

In supervised learning we have given data $u$ with known supervision $S(u)$ (also called labels), so that the task is to match the output $y$ of the Neural Network to the supervision. These problems are further categorized depending on the known supervision, e.g. for $S(u) \in \mathbb{N}$ it is called a classification and for $S(u) \in \mathbb{R}$ a regression. Furthermore, the supervision $S(u)$ can also take more complex forms like a black and white picture of $256 \times 256$ pixels represented by $[0,1]^{256}$, a higher dimensional quantity, a sentence, etc. These cases are called structured output learning.

Let us consider one very simple example, cf. [3](Section 5.1.4).

Example

We have a given set of inputs $u^{(i)} \in \mathbb{R}^d$ with known supervisions $S(u^{(i)}) \in \mathbb{R}$ for $i=1,\ldots,N$. In this example we only consider weights $W \in \mathbb{R}^{d}$ and no bias. Additionally, let $\sigma = \textrm{id}$. The perceptron network simplifies to

[$] \begin{equation*} y^{(i)} = W^{\top} u^{(i)}, \end{equation*} [$]

and the learning task is to find $W$, such that $y^{(i)} \approx S(u^{(i)})$. This can be modeled by the mean squared error (MSE) function

[$] \begin{equation*} \mathscr{L}(\{y^{(i)}\}_i,\{u^{(i)}\}_i,W) := \frac{1}{2N} \sum_{i=1}^N \| y^{(i)} - S(u^{(i)}) \|^2. \end{equation*} [$]

By convention we will use $\| \cdot \| = \| \cdot \|_2$ throughout the lecture. The chosen loss function is quadratic, convex and non-negative. We define

[$] \begin{equation*} U := \begin{pmatrix} (u^{(1)})^{\top} \\ \vdots \\ (u^{(N)})^{\top} \end{pmatrix} \in \mathbb{R}^{N\times d}, \qquad S:= \begin{pmatrix} S(u^{(1)}) \\ \vdots \\ S(u^{(N)}) \end{pmatrix} \in \mathbb{R}^{N}, \end{equation*} [$]

so that we can write $\mathscr{L}(W) = \frac{1}{2} \| U W - S\|_2^2$. Minimizing this function will deliver the same optimal weight $W$ as minimizing the MSE function defined above. We can now derive the gradient

[$] \begin{equation*} \nabla_W \mathscr{L}(W) = U^{\top} U W - U^{\top} S \end{equation*} [$]

and immediately find the stationary point $W = (U^{\top} U)^{-1} U^{\top} S$.

## Unsupervised Learning

In unsupervised learning, only the input data $u$ is given and we have no knowledge of supervisions or labels. The algorithm is supposed to learn e.g. a structure or relation in the data. Some examples are k-clustering and principal component analysis (PCA). Modeling the loss function specifies the task and has a direct influence on the learning process. For illustration of this concept, we introduce the k-means algorithm, see eg. [2](Chapter 10), which is used for clustering.

Example

We have a set of given data points

[$] \begin{equation*} \left\{ u^{(i)}\right\}_{i=1}^N \in \mathbb{R}^{ d}, \end{equation*} [$]

and a desired number of clusters $k \in \mathbb{N}$ with $k \leq N$ and typically $k \ll N$. Every data point is supposed to be assigned to a cluster.

Iteratively every data point is assigned to the cluster with the nearest centroid, and we redefine cluster centroids as the mean of the vectors in the cluster. The procedure is specified in Algorithm k-means clustering and illustrated for an example in Figure, which can be found e.g. in [2](Chapter 10). The loss function (also called distortion function in this setup) can be defined as

[$]\mathscr{L}(c,\mu):=\sum_{i=1}^{N}\|u^{(i)}-\mu_{c^{(i)}}\|^2,[$]

which is also a model of the quantity that we try to minimize in Algorithm k-means clustering. We have a non-convex set of points in $\mathbb{R}^d$, so the algorithm may converge to a local minimum. To prevent this, we run the algorithm many times, compare the resulting clusterings using the loss function, and choose the one with the minimal value attained in the loss function.

k-means clustering

Require: Initial cluster centroids $\mu_1,\ldots,\mu_k$

while not converged do
for $i=1:N$ do
$c^{(i)} := \arg\min_{j} \|u^{(i)}-\mu_j\|^2$
end for
for $j=1:k$ do
$\mu_j \gets \frac{\sum_{i=1}^N 1_{\{c^{(i)} = j\}}u^{(i)}}{\sum_{i=1}^N 1_{\{c^{(i)}=j\}}}$
end for
end while

We will see various other loss functions $\mathscr{L}$ throughout the remainder of this lecture, all of them specifically tailored to the task at hand.

In the case of Linear Regression, we have a closed form derivative, so we are able to find the solution by direct calculus, while for k-means clustering the optimization was done by a tailored iteration. For general problems we will need a suitable optimization algorithm. We move on to introduce a few options.

## Optimization Algorithms

Here, for simplicity we define $\theta$, which collects all variables, i.e. weights $W$ and bias $b$ and write the loss function as

[$]\mathscr{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \mathscr{L}^{(i)}(\theta),[$]

which we want to minimize. Here, $\mathscr{L}^{(i)}$ indicates the loss function evaluated for data point $i$, for example with a MSE loss $\mathscr{L}^{(i)}(\theta) = \frac{1}{2} \| y^{(i)} - S(u^{(i)}) \|^2$.

First, let us recall the standard gradient descent algorithm, see e.g. [7](Section 9.3), which is also known as steepest descent or batch gradient descent.

Require: Initial point $\theta^0$, step size $\tau\gt0$, counter $k=0$.

while Stopping criterion not fulfilled do
$\theta^{k+1} = \theta^k - \tau \cdot \nabla \mathscr{L}(\theta^k)$,
$k \gets k+1$.

end while

Possible stopping criterion are e.g. setting a maximum number of iterations $k$, reaching a certain exactness $\|\mathscr{L}(\theta)\| \lt \epsilon$ with a small number $\epsilon\gt0$, or a decay in change $\| \theta^{k+1}-\theta^k\| \lt \epsilon$. Determining a suitable step size is integral to the success of the gradient descent method, especially since this algorithm uses the same step size $\tau$ for all components of $\theta$, which can be a large vector in applications. If may happen that in some components the computed descent direction is only providing descent in a small neighborhood, therefore requiring a small step size $\tau$. It is also possible to employ a line search algorithm. However, this is not common in Machine Learning currently. Instead, typically a small step size is chosen, so that it will (hopefully) be not too large for any component of $\theta$, and then it may be adaptively increased. Furthermore, let us remark that the step size is often called learning rate in a Machine Learning context.

Additionally, a grand challenge in Machine Learning tasks is that we have huge data sets, and the gradient descent algorithm has to iterate over all data points in every iteration, since $\mathscr{L}(\theta)$ contains all data points, which causes a tremendous computational cost. This motivates the use of the stochastic gradient descent algorithm, cf. [2](Algorithm 1), which only takes one data point into account per iteration.

Require: Initial point $\theta^0$, step size $\tau\gt0$, counter $k=0$, maximum number of iterations $K$.

while $k \leq K$ do
Sample $j \in \{1,\ldots,N\}$ uniformly.
$\theta^{k+1} = \theta^k - \tau \cdot \nabla \mathscr{L}^{(j)}(\theta^k)$,
$k \gets k+1$.
end while

Since the stochastic gradient descent method only calculates the gradient for one data point, it produces an irregular convergence behavior. Indeed, it does not necessarily converge at all, but for a large number of iterations $K$ it often produces a good approximation. In fact, actually converging in training the Neural Network is often not necessary/desired anyhow, since we want to have a solution that generalizes well to unseen data, rather than fit the given data points perfectly. Actually, the latter may lead to overfitting, cf. Section Overfitting and Underfitting. Therefore, SGD is a computationally cheap, reasonable alternative to gradient descent. As a compromise, which generates a less irregular convergence behavior, there also exists mini batch gradient descent, cf. [2](Algorithm 2), where every iteration takes into account a subset (mini batch) of the data points.

Require: Initial point $\theta^0$, step size $\tau\gt0$, counter $k=0$, maximum number of iterations $K$, batch size $b\in \mathbb{N}$.

while $k \leq K$ do
Sample $b$ examples $j_1,\ldots,j_b$ uniformly from $\{1,\ldots,N\}$
$\theta^{k+1} = \theta^k - \tau \cdot \frac{1}{b} \sum_{i=1}^{b} \nabla \mathscr{L}^{(j_i)}(\theta^k)$,
$k \gets k+1$.

end while

Finally, we introduce a sophisticated algorithm for stochastic optimization called Adam, [8], see Algorithm Adam. It is also a gradient-based method, and as an extension of the previous methods it employs adaptive estimates of so-called moments.

All operations on vectors are element-wise. $(g^k)^2$ indicates the element-wise square $g^k \odot g^k$, and $(\beta_1)^k, (\beta_2)^k$ denote the $k$-th power of $\beta_1$ and $\beta_2$, respectively.

Require: Initial point $\theta^0$, step size $\tau\gt0$, counter $k=0$, exponential decay rates for the moment estimates $\beta_1,\beta_2 \in [0,1)$, $\epsilon \gt 0$, stochastic approximation $\widetilde{\mathscr{L}}(\theta)$ of the loss function.

$m_1^0 \gets 0$ (Initialize first moment vector)
$m_2^0 \gets 0$ (Initialize second moment vector)
while $\theta^k$ not converged do
$g^{k+1} = \nabla_\theta \widetilde{\mathscr{L}}(\theta^{k})$
$m_1^{k+1} = \beta_1 \cdot m_1^k + (1-\beta_1) \cdot g^{k+1}$
$m_2^{k+1} = \beta_2 \cdot m_2^k + (1-\beta_2) \cdot (g^{k+1})^2$
$m_1^{k+1} \gets \frac{m_1^{k+1}}{(1-(\beta_1)^k)}$
$m_2^{k+1} \gets \frac{m_2^{k+1}}{(1-(\beta_2)^k)}$
$\theta^{k+1} = \theta^k - \tau \cdot \frac{m_1^{k+1}}{ \left( \sqrt{m_2^{k+1}} + \epsilon \right)}$
$k \gets k+1$

end while

Good default settings in Adam for the tested machine learning problems are $\tau = 0.001$, $\beta_1 = 0.9, \beta_2 = 0.999$ and $\epsilon = 10^{-8}$, cf. [8]. Typically, the stochasticity of $\widetilde{\mathscr{L}}(\theta)$ will come from using mini batches of the data set, as in Mini Batch Gradient Descent, Algorithm Mini Batch Gradient Descent.

In any case we need to be cautious when interpreting results, since independent of the chosen algorithm, we are dealing with a non-convex loss function, so that we can only expect convergence to stationary points.

In the following section we discuss how fitting the given data points and generalizing well to unseen data can be contradictory goals.

## Overfitting and Underfitting

As an example we discuss supervised learning with polynomials of degree $r$, cf. [9](Section 1.3.3). Example

Define

[$]p(u,W):=\sum_{j=0}^r W_j u^j= W^\top u, [$]

with $u=(u^0,...,u^r)^\top \in \mathbb{R}^{r+1}$ the potencies of data point $u$, and $W:=(W_0,...,W_r)^\top \in \mathbb{R}^{r+1}$. The polynomial $p$ is linear in $W$, but not in $u$. As in Linear Regression (Example), we do not consider bias $b$ here. Our goal is to compute weights $W$, given data points $u^{(i)}$ with supervisions $S(u^{(i)})$, so that $p$ makes good predictions on data it hasn't seen before. We again employ the MSE loss function

[$]\mathscr{L}(W)=\frac{1}{2N}\sum_{i=1}^N \| p(u^{(i)},W)-S(u^{(i)}) \|^2[$]

As before, we write the loss in matrix-vector notation

[$]\mathscr{L}(W)=\frac{1}{2N}\|U W - S\|^2[$]

where

[$]U:=\begin{pmatrix} u^{(1)}_0&u^{(1)}_1& \ldots &u^{(1)}_r\\ \vdots& \vdots & & \vdots\\ u^{(m)}_0&u^{(m)}_1& \ldots &u^{(m)}_r \end{pmatrix},\ S:=\begin{pmatrix} S(u^{(1)})\\ \vdots\\ S(u^{(m)}). \end{pmatrix}[$]

The minimizer $W$ can be directly calculated, cf. Example.

Plots of polynomials of various degrees r (red graph) fitted to the noisy data points (green dots) based on the ground truth (green graph). The model should extend well to the test set data (blue dots). We observe underfitting in the top row for $r=0$ (left) and $r=1$ (right). In the bottom left with $r=3$ reasonable results are achieved, while $r=9$ in the bottom right leads to overfitting. Image modified from: [9](Fig. 1).

To measure the performance of the polynomial curve fitting we compute the error on data points that were not used to determine the best polynomial fit, because we aim for a model that will generalize well. To this end, finding a suitable degree for the polynomial that we are fitting over the data points is crucial. If the degree is too low, we will encounter underfitting, see Figure top row. This means that the complexity of the polynomial is too low and the model does not even fit the data points. A remedy is to increase the degree of the polynomial, see Figure bottom left. However, increasing the degree too much may lead to overfitting, see Figure bottom right. The data points are fit perfectly, but the curve will not generalize well.

We can characterize overfitting and underfitting by using some statistics, cf. [2](Section 8.1). A point estimator $g:\mathcal{U}^N \rightarrow \Theta$ (where $\mathcal{U}$ denotes the data space, and $\Theta$ denotes the parameter space) is a function which makes an estimation of the underlying parameters of the model. For example, the estimate for $\theta=W$ from Example: $\hat{\theta}=(U^\top U)^{-1}U^\top S$ (which we will denote with a hat in this subsection to emphasize that it is an estimation) is an example of a point estimator. We assume that the data from $\mathcal{U}^N$ is i.i.d, so that $\hat{\theta}$ is a random variable. We can define the variance and the bias

[$]\textrm{Var}(\hat{\theta}):=\mathbb{E}(\hat{\theta}^2)-\mathbb{E}(\hat{\theta})^2,\ \mbox{Bias}(\hat{\theta}):=\mathbb{E}(\hat{\theta})-\theta,[$]

with $\mathbb{E}$ denoting the expected value. A good estimator has both, low variance and low bias. We can characterize overfitting with low bias and high variance, and underfitting with high bias and low variance. The bias-variance trade-off is illustrated in Figure. Hence, we can make a decision based on mean squared error of the estimates

[$]\mbox{MSE}(\hat{\theta}):=\mathbb{E}[(\hat{\theta}-{\theta})^2] =\mbox{Var}(\hat{\theta})+\mbox{Bias}(\hat{\theta})^2.[$]

In general, it can be hard to guess a suitable degree for the polynomial beforehand. We could compute a fitting curve for different choices of $r$ and then compare the error on previously unseen data points of the validation data set, cf. Section Hyperparameters and Data Set Splitting , to determine which one generalizes best. This will require solving the problem multiple times which is unfavorable, especially for large data sets. Also, the polynomial degree can only be set discretely. Another, continuous way is to introduce a penalization term in the loss function

[$]\mathscr{L}_\lambda(\theta) := \mathscr{L}(\theta) + \lambda \|\theta \|^2 .[$]

This technique is also called weight decay, cf. [3](Section 5.2.2). We can also use other norms, e.g. $\| \cdot \|_1$. Here, we can choose a large degree $r$ and for $\lambda$ big enough, we will still avoid overfitting, because many components of $\theta$ will be (close to) zero. Nonetheless, we need to be cautious with the choice of $\lambda$. If it is too big, we will face again the problem of underfitting.

We see that choosing values for the degree $r$ and the penalization parameter $\lambda$ poses challenges, and will discuss this further in the next section.

## Hyperparameters and Data Set Splitting

We call all quantities that need to be chosen before solving the optimization problem hyperparameters, cf. [3](Section 5.3). Let us point out that hyperparameters are not learnt by the optimization algorithm itself, but nevertheless have an impact on the algorithms performance. Examples of hyperparameters include the polynomial degree $r$, the scalar $\lambda$, all parameters in the optimization algorithms (Section Optimization Algorithms ) like the step size $\tau$, and also the architecture of the Neural Network, and many more.

The impact of having a good set of hyperparameters can be tremendous, however finding such a set is not trivial. First of all, we split our given data into three sets. training data, validation data and test data (a 4:1:1 ratio is common). We have seen training and test data before. The data points that we are using as input to solve the optimization problem are called training data, and the unseen data points, which we use to evaluate whether the model generalizes well, are called test data. Since we don't want to mix different causes of error, we also introduce the validation data set. This will be used to compare different choices of hyperparameter configurations, i.e. we train the model on the training data for different hyperparameters, compute the error on the validation data set, choose the hyperparameter setup with the lowest error and finally evaluate the model on the test set. The reasoning behind this is that if we would use the test data set to determine the hyperparameter values, the test error may be not meaningful, because the hyperparameters have been optimized for this specific test set. Since we are using the validation set, we will have the test set with previously unseen data available to determine the generalization error without giving our network an advantage.

Still, imagine you need to choose 5 hyperparameters and have 4 possible values that you want to try for each hyperparameter. This amounts to $4^5 = 1024$ combinations you have to run on the training data and evaluate on the validation set. In real applications the number of hyperparameters and possible values can be much larger, so that it is nearly infeasible to try every combination, but rather common to change one hyperparameter at a time. Luckily, some hyperparameters also have known good default values, like the hyperparameters for Adam Optimizer, Algorithm Adam. Apart from that it is a tedious, manual work to try out, monitor and choose suitable hyperparameters.

Finally, we discuss the limitations of shallow Neural Networks, i.e. networks with only one layer.

## Modeling logical functions

Let us consider a shallow Neural Network with input layer $y^{[0]} \in \mathbb{N}^2$ and output layer $y^{[1]} \in \mathbb{N}$.

We model true by the value 1 and false by the value 0, which results in the following truth table for the logical "OR" function.

 input $y^{[0]}_1$ input $y^{[0]}_2$ $y^{[0]}_1$ OR $y^{[0]}_2$ (output $y^{[1]}_1$) 0 0 0 1 0 1 0 1 1 1 1 1

With Heaviside activation function, we have

[$] y_1^{[1]} = \begin{cases} 1, &\mbox{if } W_1 y^{[0]}_1 + W_2 y^{[0]}_2 + b \geq 0,\\ 0, &\mbox{else }. \end{cases} [$]

The goal is now to choose $W_1,W_2,b$ so that we match the output from the truth table for given input. Obviously, $W_1 = W_2 = 1$ and $b=-1$ is a possible choice that fulfills the task. Similarly, one can find values for $W_1,W_2$ and $b$ to model the logical "AND" function.

Next, let us consider the logical "XOR" function with the following truth table.

 input $y^{[0]}_1$ input $y^{[0]}_2$ $y^{[0]}_1$ XOR $y^{[0]}_2$ (output $y^{[1]}_1$) 0 0 0 1 0 1 0 1 1 1 1 0

In fact, the logical "XOR" function can not be represented by the given shallow Neural Network, since the data is not linearly separable, see e.g. [3](Section 6.1). This motivates the introduction of additional layers in between the input and output layer, i.e. we choose a more complex function $\mathcal{F}$ in the learning problem \eqref{eq:LP}.

## General references

Herberg, Evelyn (2023). "Lecture Notes: Neural Network Architectures". arXiv:2304.05133 [cs.LG].

## References

1. F. Rosenblatt. (1958). "The perceptron: a probabilistic model for information storage and organi- zation in the brain.". Psychological review 65. American Psychological Association.
2. Ng, A. (2022), CS229 Lecture Notes (PDF)
3. I. Goodfellow, Y. Bengio, and A. Courville. (2016). Deep learning. MIT Press.CS1 maint: multiple names: authors list (link)
4. H. Antil, T. S. Brown, R. Löhner, F. Togashi, and D. Verma. (2022). "Deep Neural Nets with Fixed Bias Configuration". arXiv preprint arXiv:2107.01308.
5. H. Antil and H. Díaz and E. Herberg (2022). "An Optimal Time Variable Learning Framework for Deep Neural Networks". arXiv preprint arXiv:2204.08528.
6. H. Antil, R. Khatri, R. Löhner, and D. Verma. (2020). "Fractional deep neural network via constrained optimization". Machine Learning: Science and Technology 2. IOP Publishing.
7. S. Boyd and L. Vandenberghe. (2004). Convex optimization. Cambridge university press.
8. Kingma, D. P. and Ba, J. (2014). "Adam: A method for stochastic optimization". arXiv preprint arXiv:1412.6980.
9. Geiger, A. (2021), Deep Learning Lecture Notes