Convolutional Neural Network

Content for this page was copied verbatim from Herberg, Evelyn (2023). "Lecture Notes: Neural Network Architectures". arXiv:2304.05133 [cs.LG].

In this section, based on [1],[2](Section 9), we consider Neural Networks with a different architecture: convolutional neural networks (CNNs or ConvNets). They were first introduced by Kunihiko Fukushima in 1980 under the name "neocognitron" [3]. Famous examples of convolutional neural networks today are "LeNet" [4], see Figure and "AlexNet" [5].

As a motivation, consider a classification task where the input is an image of size $n_{0,1} \times n_{0,2}$ pixels. We want to train a Neural Network so that it can decide, e.g. which digit is written in the image (MNIST data set). We have seen in Figure that the image with $n_{0,1}=n_{0,2}=28$ has been reshaped (vectorized, flattened) into a vector in $\mathbb{R}^{n_{0,1}\cdot n_{0,2}} = \mathbb{R}^{784}$, so that we can use it as an input for a regular FNN. However, this approach has several disadvantages:

• Vectorization causes the input image to loose all of its spatial structure, which could have been helpful during training.
• Let e.g. $n_{0,1}=n_{0,2}=1000$, then $n_0 = 10^6$ and the weight matrix $W^{[0]} \in \mathbb{R}^{n_1 \times 10^6}$ contains an enormous number of optimization variables. This can make training very slow or even infeasible.

On the contrary, convolutional neural networks are designed to exploit the relationships between neighboring pixels. In fact, the input of a CNN is typically a matrix or even a three-dimensional tensor, which is then passed through the layers while maintaining this structure. CNNs take small patches, e.g. squares or cubes, from the input images and learn features from them. Consequently, they can subsequently recognize these features in other images, even when they appear in other parts of the image.

In Figure we see the architecture of "LeNet-5". The inputs are images, where we have 1 channel, because we consider grayscale images. At first we have two sequences of convolution layer (yellow), Section Convolutional Layer , detector layer (violet), Section Detector Layer , and pooling layer (orange), Section Pooling Layer . These layers retain the multidimensional structure of the input. Since this network is built for a classification tasks, the output should be a vector of 10. Consequently, the multi-dimensional output of a hidden layer is flattened, i.e. vectorized, and the remaining layers are fully connected layers (bright yellow) as we have seen in FNNs.

In other, larger architectures, like AlexNet, cf. Figure, to avoid overfitting with large fully connected layers, a technique called dropout is applied. The key idea is to randomly drop units with a given probability and their connections from the neural network during training, for more details we refer to [6].

• We will view convolution, detector and pooling layers as separate layers. However, it is also possible to define a convolutional layer to consist of a convolution, detector and pooling stage, cf. Figure. This can be a source of confusion when referring to convolutional layers, which we should be aware of.
• Throughout the remainder of this section we omit layer indices $\ell$ to simplify notation, and we indicate the data with capital letter $Y$ to clarify that they are matrices or tensors.

Before we move on to a detailed introduction of the different layer types in CNNs, let us recall the mathematical concept of a convolution.

Convolution

As explained in [2](Section 9.1), in general, convolution describes how one function influences the shape of another function. But it can also be used to apply a weight function to another function, which is how convolution is used in convolutional neural networks.

Definition

Let $f, g: \mathbb{R}^n \to \mathbb{R}$ be two functions. If both $f$ and $g$ are integrable with respect to Lebesgue measure, we can define the convolution as:

[] \begin{align*} c(t) = (f \ast g) (t) = \int f(x) g(t-x) \, dx, \end{align*} []

for some $t \in \mathbb{R}^n$. Here, $f$ is called the input and $g$ is called the kernel. The new function $c: \mathbb{R}^n \rightarrow \mathbb{R}$ is called the feature map.

However, for convolutional neural networks we need the discrete version.

Definition

Let $f, g: \mathbb{Z}^n \to \mathbb{R}$ be two discrete functions. The discrete convolution is then defined as:

[] \begin{align*} c(t) = (f \ast g) (t) = \sum_{x \in \mathbb{Z}^n} f(x) g(t - x), \end{align*} []

for some $t \in \mathbb{Z}^n$.

A special case of the discrete convolution is setting $f$ and $g$ to $n$-dimensional vectors and using the indices as arguments. We illustrate this approach in the following example.

Example

Let $X$ and $Y$ be two random variable each describing the outcome of rolling a dice. The probability mass functions are defined as:

[] \begin{align*} f_X(t) = f_Y(t) = \begin{cases} \frac{1}{6}, &\text{if } t \in \{ 1,2,3,4,5,6\}, \\ 0, &\text{if } t \in \mathbb{Z}\setminus\{ 1,2,3,4,5,6\}. \end{cases} \end{align*} []

We aim at calculating the probability that the sum of both dice rolls equals nine. To this end, we take the vectors of all possible outcomes and arrange them into two rows. Here, we flip the second vector and slide it to the right, such that the numbers which add to nine align.

Now, we replace the outcomes with their respective probabilities, multiply the adjacent components and add up the results.

This gives

[$] f_{X+Y}(9) = \frac{1}{36} + \frac{1}{36} +\frac{1}{36} +\frac{1}{36} = \frac{1}{9}, [$]

i.e. the probability that the sum of the dice equals nine is $\frac{1}{9}.$ In fact, all the steps we have just done are equivalent to calculating a discrete convolution:

[] \begin{align*} f_{X+Y}(9) = \sum_{x = 1}^6 f_X(x)f_Y(9- x) = (f_X \ast f_Y) (9) \end{align*} []

Convolutional Layer

For the convolutional layers in CNNs we define convolutions for matrices, cf. e.g. [2]((9.4)). This can be extended to tensors straight forward.

Definition

Let $Y \in \mathbb{R}^{n_1 \times n_2}$ and $K \in \mathbb{R}^{m_1 \times m_2}$ be given matrices, such that $m_1 \leq n_1$ and $m_2 \leq n_2$. The convolution of $Y$ and $K$ is denoted by $Y \ast K$ with entries

[$] \left[ Y \ast K \right]_{i,j} := \sum_{k=1}^{m_1} \sum_{l=1}^{m_2} K_{k,l} Y_{i+m_1-k, j+m_2-l}, [$]

for $1 \leq i \leq n_1-m_1 + 1$ and $1 \leq j \leq n_2 - m_2 +1$. Here, $Y$ is called the input and $K$ is called the kernel.

In Machine Learning often the closely related concept of (cross) correlation, cf. e.g. [2]((9.6)), is used, and incorrectly referred to as convolution, where

[$] \left[ Y \circledast K \right]_{i,j} := \sum_{k=1}^{m_1} \sum_{l=1}^{m_2} K_{k,l} Y_{i \color{red}{-1+k}, j \color{red}{-1+l}}, [$]

for $1 \leq i \leq n_1-m_1 + 1$ and $1 \leq j \leq n_2 - m_2 +1$. The (cross) correlation has the same effect as convolution, if you flip both, rows and columns of the kernel $K$, see the changed indices indicated in red. Since we learn the kernel anyway, it is irrelevant whether the kernel is flipped, thus either concept can be used.

We illustrate the matrix computations with an example.

Example

For this example we have the data matrix

[$] \begin{equation*} Y = \begin{pmatrix} 1 & 5 & -2 & 0 & 2 \\ 3 & 8 & 7 & 1 & 0 \\ -1 & 0 & 1 & 2 & 3 \\ 4 & 2 & 1 & -1 & 2 \end{pmatrix}, \end{equation*} [$]

and the kernel

[$] \begin{equation*} K = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{pmatrix}. \end{equation*} [$]

The computation of $[Y \ast K]_{1,1}$ can be illustrated as follows

[$] \begin{equation*} [Y \ast K]_{1,1} = \begin{pmatrix} +1\cdot 9 & +5 \cdot 8 & -2\cdot 7 & \color{gray}{0} & \color{gray}{2} \\ +3 \cdot 6 & +8\cdot 5 & +7\cdot 4 & \color{gray}{1} & \color{gray}{0} \\ -1 \cdot 3 & +0\cdot 2 & +1\cdot 1 & \color{gray}{2} & \color{gray}{3} \\ \color{gray}{4} & \color{gray}{2} & \color{gray}{1} & \color{gray}{-1} & \color{gray}{2} \end{pmatrix} = 9 +40 -14 + 18 +40 +28 - 3 + 0 +1 = 119. \end{equation*} [$]

The gray values of $Y$ are not used in the computation. Here, we see that $K$ is flipped when used in the convolution. This also clarifies, how the (cross) correlation can be more intuitive, where

[$] \begin{equation*} [Y \circledast K]_{1,1} = \begin{pmatrix} +1\cdot 1 & +5 \cdot 2 & -2\cdot 3 & \color{gray}{0} & \color{gray}{2} \\ +3 \cdot 4 & +8\cdot 5 & +7\cdot 6 & \color{gray}{1} & \color{gray}{0} \\ -1 \cdot 7& +0\cdot 8 & +1\cdot 9 & \color{gray}{2} & \color{gray}{3} \\ \color{gray}{4} & \color{gray}{2} & \color{gray}{1} & \color{gray}{-1} & \color{gray}{2} \end{pmatrix} = 1 + 10 - 6 + 12 +40 + 42 -7 + 0 +9 = 101. \end{equation*} [$]

In a similar way we can proceed to calculate the remaining values by shifting the kernel over the matrix

[$] \begin{equation*} [Y \circledast K]_{1,2} = \begin{pmatrix} \color{gray}{1} & +5 \cdot 1 & -2\cdot 2 & +0\cdot 3 & \color{gray}{2} \\ \color{gray}{3} & +8\cdot 4 & +7\cdot 5 & +1 \cdot 6 & \color{gray}{0} \\ \color{gray}{-1} & +0\cdot 7 & +1\cdot 8 & +2 \cdot 9 & \color{gray}{3} \\ \color{gray}{4} & \color{gray}{2} & \color{gray}{1} & \color{gray}{-1} & \color{gray}{2} \end{pmatrix} = 5 -4 + 0 +32 +35 +6 + 0 + 8 +18= 100 . \end{equation*} [$]

Altogether, we get

[$] \begin{equation*} Y \ast K = \begin{pmatrix} 119 & 120 & 53 \\ 155 & 155 & 102 \end{pmatrix} \qquad \text{and} \qquad Y \circledast K = \begin{pmatrix} 101 & 100 & 87 \\ 95 & 55 & 58 \end{pmatrix}. \end{equation*} [$]

Especially, $Y \ast K \neq Y \circledast K$.

The kernel size, which is typically square, e.g. $m \times m$, is a hyperparameter of the CNN. Furthermore, the convolutional layer has additional hyperparameters that need to be chosen. We have seen that a convolution with a $m \times m$ kernel reduces the dimension from $n_1\times n_2$ to $n_1 - m + 1 \times n_2 - m + 1$. To retain the image dimension we can use (zero) padding, cf. [1], applied to the input $Y$ with $p \in \mathbb{N}_0$. Choosing $p=0$ yields $Y$ again, whereas $p=1$ results in

[$] \begin{equation*} \hat Y = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 5 &-2 & 0 & 2 & 0\\ 0 & 3 & 8 & 7 & 1 & 0 & 0\\ 0 &-1 & 0 & 1 & 2 & 3 & 0 \\ 0 & 4 & 2 & 1 &-1 & 2 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}, \end{equation*} [$]

for $Y$ from Example. Consequently, the padded matrix $\hat Y$ is of dimension $(n_1 +2p) \times (n_2+2p)$. To retain the image dimension, we need to choose $p$ so that

[] \begin{align*} (n_1 + 2p) - m + 1 &= n_1, \\ (n_2 + 2p) - m + 1 &= n_2 , \end{align*} []

i.e. $p = \frac{m-1}{2}$, which is possible for any odd $m$.

Furthermore, we can choose the stride $s\in\mathbb{N}$, which indicates how far to move the kernel. For example, in Figure the stride is chosen as $s = 1$, while the stride in Figure is $s=2$.

Let us remark that a stride $s\gt1$ reduces the output dimension of the convolution to

[$]\left(\frac{n_1-m}{s}+1\right) \times \left(\frac{n_2-m}{s}+1\right). [$]

Altogether, we can describe the convolutional layer. It consists of $M \in \mathbb{N}$ filters with identical hyperparameters: kernel size $m \times m$, padding $p$ and stride $s$, but each of them has its own learnable kernel $K$. Consequently, the filters in this layer have $M\cdot m^2$ variables in total. Applying all $M$ filters to an input matrix $Y \in \mathbb{R}^{n_1 \times n_2}$ leads to an output of size

[$] \left( \frac{n_1 + 2p - m}{s} +1 \right) \times \left( \frac{n_2 + 2p - m}{s} +1 \right) \times M, [$]

where the results for all $M$ filters are stacked, cf. [1]. Typically, the depth $M$ is chosen as a power of 2, and growing for deeper layers, while height and width are shrinking, cf. Figure.

Obviously, the output is a tensor with three dimensions, hence the subsequent layers need to process 3-tensor-valued data. In fact, for colored images already the original input of the network is a tensor. The (cross) correlation operation (and also the convolution operation) can be generalized to this case in the following way.

Assume we have an input tensor of size $n_1 \times n_2 \times n_3,$ then we choose a three dimensional kernel of size

[$] m \times m \times n_3 ,[$]

i.e. the depth coincides. No striding or padding is applied in the third dimension. Hence, the output is of dimension

[$] \left( \frac{n_1 + 2p - m}{s} +1 \right) \times \left( \frac{n_2 + 2p - m}{s} +1 \right) \times 1, [$]

which can be understood as a matrix by discarding the redundant third dimension, cf. Figure. Doing this for $M$ filters, again leads to the output being a 3-tensor.

• Convolutional layers have the advantage that they have less variables than fully connected layers applied to the flattened image. For example consider a grayscale image of size $28 \times 28$ as input in LeNet, cf. Figure. The first convolution has 6 kernels with $5\times 5$ entries. Due to padding with $p=2$ the output is $28\times 28 \times 6$, so the image size is retained. Additionally, before applying a detector layer, we add a bias per channel, so 6 bias variables in this case. In total, we have to learn 156 variables. Now, imagine this image is flattened to a $784\times 1$ vector and fed into a FNN with fully connected layer, where we also want to retain the size, i.e. the first hidden layer has $784$ nodes. This results in a much larger number of variables:

[$] \underbrace{784\cdot784}_{weight}+\underbrace{784}_{bias} = '''615440'''.[$]

• Directly related, a disadvantage of convolutional layers is that every output only sees a subset of all input neurons, cf. e.g. [2](Section 9.2). We denote this set of seen inputs by effective receptive field of the neuron. In an FNN with fully connected layers the effective receptive field of a neuron is the entire input. However, the receptive field of a neuron increases with depth of the network, as illustrated in Figure.

Detector Layer

In standard CNN architecture after a convolutional layer, a detector layer is applied. This simply means performing an activation function. To this end, we extend the activation function $\sigma:\mathbb{R} \rightarrow \mathbb{R}$ to matrix and tensor valued inputs by applying it component-wise, as we did for vectors before in Section Feedforward Neural Network , e.g. for a 3-tensor $Y = \{Y_{i,j,k}\}_{i,j,k}$ with $i=1,\ldots,n_1, j=1,\ldots,n_2, k=1,\ldots,n_3$ we get

[$] \left(\sigma(Y)\right)_{i,j,k} = \sigma(Y_{i,j,k}). [$]

Pooling Layer

After the detector layer, typically a pooling layer (also called downsampling layer) is applied, cf. e.g. [2](Section 9.3). This layer type is responsible for reducing the first two dimensions (height and width) and usually does not interfere with the third dimension (depth) of the data $Y$, but rather is applied for all channels independently. Consequently, the depth of the output coincides with the depth of the input and we omit the depth in our discussion.

As in convolutional layers, pooling layers have a filter size $m \times m$, stride $s$ and padding $p$. However, almost always $p=0$ is chosen. The most popular values for for the filter size and stride are $m=s=2$. Again, with an input of size $n_1 \times n_2$ the output dimension is

[$] \left( \frac{n_1 + 2p -m}{s} +1 \right) \times \left( \frac{n_1 + 2p -m}{s} +1 \right) \quad \stackrel{m=s=2, p=0}{=} \quad \frac{n_1}{2} \times \frac{n_2}{2}. [$]

One common choice is Max Pooling (or Maximum Pooling), where the largest value is selected, cf. e.g. [1]. Below we see an example of max pooling with a $2 \times 2$ kernel, stride $s=2$ and no padding.

[$] \begin{equation*} \left( \begin{array}{c c | c c} 1 & 3 & 0 & -7 \\ -2 & \color{myorange}{4} & \color{mygreen}{1} & -1 \\ \hline 0 & 1 & \color{HeiRot}{8} & -3 \\ \color{myblue}{2} & 0 & 4 & 5 \end{array} \right) \stackrel{\max}{\longrightarrow} \left( \begin{array}{c | c} \color{myorange}{4} & \color{mygreen}{1} \\ \hline \color{myblue}{2} & \color{HeiRot}{8} \end{array} \right) \end{equation*} [$]

Another common choice is Average Pooling, where we take the mean of all values. Below we see an example of average pooling (abbreviated: "avg") with a $2 \times 2$ kernel,

[$] K = \begin{pmatrix} 0.25 & 0.25 \\ 0.25 & 0.25 \end{pmatrix},[$]

stride $s=2$ and no padding.

[$] \begin{equation*} \hspace{1.2cm} \left( \begin{array}{c c | c c} \color{myorange}{1} & \color{myorange}{3} & \color{mygreen}{0} & \color{mygreen}{-7} \\ \color{myorange}{-2} & \color{myorange}{4} & \color{mygreen}{1} & \color{mygreen}{-1} \\ \hline \color{myblue}{0} & \color{myblue}{1} & \color{HeiRot}{8} & \color{HeiRot}{-3} \\ \color{myblue}{2} & \color{myblue}{0} & \color{HeiRot}{4} & \color{HeiRot}{5} \end{array} \right) \stackrel{\text{avg}}{\longrightarrow} \left( \begin{array}{c | c} \color{myorange}{1.50} & \color{mygreen}{-1.75} \\ \hline \color{myblue}{0.75} & \color{HeiRot}{3.50} \end{array} \right) \end{equation*} [$]

The effect of average pooling applied to an image is easily visible: It blurs the image. In the new image every pixel is an average of a pixel and its neighboring $m^2-1$ pixels, see Figure. Depending on the choice of stride $s$ and padding $p$, the blurred image may also have less pixels.

Original image (left) and blurred image produced by average pooling (right) with a $5 \times 5$ kernel, stride $s=1$ and zero padding with $p=2$. Image Source: Laurin Ernst.

• Pooling layers do not contain variables to learn.
• We have seen that when using CNNs, we make the following assumptions:
• Pixels far away from each other do not need to interact with each other.
• Small translations are not relevant.

Local Response Normalization

Similar to batch normalization, Local Response Normalization (LRN) [5](Section 3.3) stabilizes training with unbounded activation functions like ReLU. This strategy was first introduced within the "AlexNet" architecture [5], cf. Figure, because contrary to previous CNNs like "LeNet-5", which used sigmoid activation, "AlexNet" employs ReLU activation.

The inter-channel LRN, as introduced in [5](Section 3.3), see also Figure a), is given by

[$] \begin{equation*} \hat{Y}_{i,j,k} = \frac{Y_{i,j,k}}{ \left( \kappa + \gamma \sum\limits_{m = \max(1,k-\frac{n}{2})}^{\min(M,k+\frac{n}{2})} (Y_{i,j,m})^2 \right)^\beta }. \end{equation*} [$]

Here, $Y_{i,j,k}$ and $\hat{Y}_{i,j,k}$ denote the activity of the neuron before and after normalization, respectively. The indices $i,j,k$ indicate the height, width and depth of $Y$. We have $i=1,\ldots,n_1$, $j=1,\ldots,n_2$ and $k=1,\ldots,M$, where $M$ is the number of filters in the previous convolutional layer. The values $\kappa, \gamma, \beta, n \in \mathbb{R}$ are hyperparameters, where $\kappa$ is used to avoid singularities, and $\gamma$ and $\beta$ are called normalization and contrasting constants, respectively. Furthermore, $n$ dictates how many surrounding neurons are taken into consideration, see also Figure. In [5] $\kappa = 2, \gamma = 10^{-4}, \beta = 0.75$ were chosen.

In the case of intra-channel LRN, the neighborhood is extended within the same channel. This leads to the following formula

[$] \begin{equation*} \hat{Y}_{i,j,k} = \frac{Y_{i,j,k}}{ \left( \kappa + \gamma \sum\limits_{p = \max(1,i-\frac{n}{2})}^{\min(n_1,i+\frac{n}{2})} \sum\limits_{q = \max(1,j-\frac{n}{2})}^{\min(n_2,j+\frac{n}{2})} (Y_{p,q,k})^2 \right)^\beta }. \end{equation*} [$]

The LRN layer is non-trainable, since it only contains hyperparameters and no variables.

General references

Herberg, Evelyn (2023). "Lecture Notes: Neural Network Architectures". arXiv:2304.05133 [cs.LG].

References

1. I. Goodfellow, Y. Bengio, and A. Courville. (2016). Deep learning. MIT Press.CS1 maint: multiple names: authors list (link)
2. Fukushima, K.. "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position". Biological Cybernetics 36.
3. Y. LeCun and B. Boser and J. S. Denker and D. Henderson and R. E. Howard and W. Hubbard and L. D. Jackel (1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE 86.
4. A. Krizhevsky and I. Sutskever and G. E. Hinton (2017). "ImageNet Classification with Deep Convolutional Neural Networks". Communications of the ACM 60.
5. Srivastava, N. and Hinton, G. and Krizhevsky, A. and Sutskever, I. and Salakhutdinov, R. (2014). "Dropout: a simple way to prevent neural networks from overfitting". The journal of machine learning research 15. JMLR. org.