\section{Application of NN to higher complexity Problems} As neural networks are applied to problems of higher complexity often resulting in higher dimensionality of the input the amount of parameters in the network rises drastically. For example a network with ... A way to combat the \subsection{Convolution} Convolution is a mathematical operation, where the product of two functions is integrated after one has been reversed and shifted. \[ (f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds. \] This operation can be described as a filter-function $g$ being applied to $f$, as values $f(t)$ are being replaced by an average of values of $f$ weighted by $g$ in position $t$. The convolution operation allows plentiful manipulation of data, with a simple example being smoothing of real-time data. Consider a sensor measuring the location of an object (e.g. via GPS). We expect the output of the sensor to be noisy as a result of a number of factors that will impact the accuracy. In order to get a better estimate of the actual location we want to smooth the data to reduce the noise. Using convolution for this task, we can control the significance we want to give each data-point. We might want to give a larger weight to more recent measurements than older ones. If we assume these measurements are taken on a discrete timescale, we need to introduce discrete convolution first. Let $f$, $g: \mathbb{Z} \to \mathbb{R}$ then \[ (f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i). \] Applying this on the data with the filter $g$ chosen accordingly we are able to improve the accuracy, which can be seen in Figure~\ref{fig:sin_conv}. \input{Plots/sin_conv.tex} This form of discrete convolution can also be applied to functions with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to \mathbb{R}$ then \[ (f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1, \dots, x_d - i_d) g(i_1, \dots, i_d) \] This will prove to be a useful framework for image manipulation but in order to apply convolution to images we need to discuss representation of image data first. Most often images are represented by each pixel being a mixture of base colors these base colors define the color-space in which the image is encoded. Often used are color-spaces RGB (red, blue, green) or CMYK (cyan, magenta, yellow, black). An example of an image split in its red, green and blue channel is given in Figure~\ref{fig:rgb} Using this encoding of the image we can define a corresponding discrete function describing the image, by mapping the coordinates $(x,y)$ of an pixel and the channel (color) $c$ to the respective value $v$ \begin{align} \begin{split} I: \mathbb{N}^3 & \to \mathbb{R}, \\ (x,y,c) & \mapsto v. \end{split} \label{def:I} \end{align} \begin{figure} \begin{adjustbox}{width=\textwidth} \begin{tikzpicture} \begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)] \node[canvas is xy plane at z=0, transform shape] at (0,0) {\includegraphics[width=5cm]{Plots/Data/klammern_r.jpg}}; \node[canvas is xy plane at z=2, transform shape] at (0,-0.2) {\includegraphics[width=5cm]{Plots/Data/klammern_g.jpg}}; \node[canvas is xy plane at z=4, transform shape] at (0,-0.4) {\includegraphics[width=5cm]{Plots/Data/klammern_b.jpg}}; \node[canvas is xy plane at z=4, transform shape] at (-8,-0.2) {\includegraphics[width=5.3cm]{Plots/Data/klammern_rgb.jpg}}; \end{scope} \end{tikzpicture} \end{adjustbox} \caption{On the right the red, green and blue chances of the picture are displayed. In order to better visualize the color channels the black and white picture of each channel has been colored in the respective color. Combining the layers results in the image on the left.} \label{fig:rgb} \end{figure} With this representation of an image as a function, we can apply filters to the image using convolution for multidimensional functions as described above. In order to simplify the notation we will write the function $I$ given in (\ref{def:I}) as well as the filter-function $g$ as a tensor from now on, resulting in the modified notation of convolution \[ (I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}. \] Simple examples for image manipulation using convolution are smoothing operations or rudimentary detection of edges in grayscale images, meaning they only have one channel. A popular filter for smoothing images is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and size $s \in \mathbb{N}$ is defined as \[ G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2 \sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}. \] For edge detection purposes the Sobel operator is widespread. Here two filters are applied to the image $I$ and then combined. Edges in the $x$ direction are detected by convolution with \[ G =\left[ \begin{matrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{matrix}\right], \] and edges is the y direction by convolution with $G^T$, the final output is given by \[ O = \sqrt{(I * G)^2 + (I*G^T)^2} \] where $\sqrt{\cdot}$ and $\cdot^2$ are applied component wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}. \todo{padding} \begin{figure}[h] \centering \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Plots/Data/klammern.jpg} \caption{Original Picture} \label{subf:OrigPicGS} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Plots/Data/image_conv9.png} \caption{Gaussian Blur $\sigma^2 = 1$} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Plots/Data/image_conv10.png} \caption{Gaussian Blur $\sigma^2 = 4$} \end{subfigure}\\ \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Plots/Data/image_conv4.png} \caption{Sobel Operator $x$-direction} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Plots/Data/image_conv5.png} \caption{Sobel Operator $y$-direction} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png} \caption{Sobel Operator combined} \end{subfigure} % \begin{subfigure}{0.24\textwidth} % \centering % \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png} % \caption{test} % \end{subfigure} \caption{Convolution of original greyscale Image (a) with different kernels. In (b) and (c) Gaussian kernels of size 11 and stated $\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator kernels are used.} \label{fig:img_conv} \end{figure} \clearpage \newpage \subsection{Convolutional NN} \todo{Eileitung zu CNN} % Conventional neural network as described in chapter .. are made up of % fully connected layers, meaning each node in a layer is influenced by % all nodes of the previous layer. If one wants to extract information % out of high dimensional input such as images this results in a very % large amount of variables in the model. This limits the % In conventional neural networks as described in chapter ... all layers % are fully connected, meaning each output node in a layer is influenced % by all inputs. For $i$ inputs and $o$ output nodes this results in $i % + 1$ variables at each node (weights and bias) and a total $o(i + 1)$ % variables. For large inputs like image data the amount of variables % that have to be trained in order to fit the model can get excessive % and hinder the ability to train the model due to memory and % computational restrictions. By using convolution we can extract % meaningful information such as edges in an image with a kernel of a % small size $k$ in the tens or hundreds independent of the size of the % original image. Thus for a large image $k \cdot i$ can be several % orders of magnitude smaller than $o\cdot i$ . As seen in the previous section convolution can lend itself to manipulation of images or other large data which motivates it usage in neural networks. This is achieved by implementing convolutional layers where several filters are applied to the input. Where the values of the filters are trainable parameters of the model. Each node in such a layer corresponds to a pixel of the output of convolution with one of those filters on which a bias and activation function are applied. The usage of multiple filters results in multiple outputs of the same size as the input. These are often called channels. Depending on the size of the filters this can result in the dimension of the output being one larger than the input. However for convolutional layers following a convolutional layer the size of the filter is often chosen to coincide with the amount of channels of the output of the previous layer without using padding in this direction in order to prevent gaining additional dimensions\todo{komisch} in the output. This can also be used to flatten certain less interesting channels of the input as for example a color channels. Thus filters used in convolutional networks are usually have the same amount of dimensions as the input or one more. The size of the filters and the way they are applied can be tuned while building the model should be the same for all filters in one layer in order for the output being of consistent size in all channels. It is common to reduce the d< by not applying the filters on each ``pixel'' but rather specify a ``stride'' $s$ at which the filter $g$ is moved over the input $I$ \[ O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}. \] As seen convolution lends itself for image manipulation. In this chapter we will explore how we can incorporate convolution in neural networks, and how that might be beneficial. Convolutional Neural Networks as described by ... are made up of convolutional layers, pooling layers, and fully connected ones. The fully connected layers are layers in which each input node is connected to each output node which is the structure introduced in chapter ... In a convolutional layer instead of combining all input nodes for each output node, the input nodes are interpreted as a tensor on which a kernel is applied via convolution, resulting in the output. Most often multiple kernels are used, resulting in multiple output tensors. These kernels are the variables, which can be altered in order to fit the model to the data. Using multiple kernels it is possible to extract different features from the image (e.g. edges -> sobel). As this increases dimensionality even further which is undesirable as it increases the amount of variables in later layers of the model, a convolutional layer is often followed by a pooling one. In a pooling layer the input is reduced in size by extracting a single value from a neighborhood \todo{moving...}... . The resulting output size is dependent on the offset of the neighborhoods used. Popular is max-pooling where the largest value in a neighborhood is used or. This construct allows for extraction of features from the input while using far less input variables. ... \todo{Beispiel mit kleinem Bild, am besten das von oben} \subsubsection{Parallels to the Visual Cortex in Mammals} The choice of convolution for image classification tasks is not arbitrary. ... auge... bla bla \subsection{Limitations of the Gradient Descent Algorithm} -Hyperparameter guesswork -Problems navigating valleys -> momentum -Different scale of gradients for vars in different layers -> ADAdelta \subsection{Stochastic Training Algorithms} For many applications in which neural networks are used such as image classification or segmentation, large training data sets become detrimental to capture the nuances of the data. However as training sets get larger the memory requirement during training grows with it. In order to update the weights with the gradient descent algorithm derivatives of the network with respect for each variable need to be calculated for all data points in order to get the full gradient of the error of the network. Thus the amount of memory and computing power available limits the size of the training data that can be efficiently used in fitting the network. A class of algorithms that augment the gradient descent algorithm in order to lessen this problem are stochastic gradient descent algorithms. Here the premise is that instead of using the whole dataset a (different) subset of data is chosen to compute the gradient in each iteration (Algorithm~\ref{alg:sdg}). The training period until each data point has been considered in updating the parameters is commonly called an ``epoch''. Using subsets reduces the amount of memory and computing power required for each iteration. This makes it possible to use very large training sets to fit the model. Additionally the noise introduced on the gradient can improve the accuracy of the fit as stochastic gradient descent algorithms are less likely to get stuck on local extrema. Another important benefit in using subsets is that depending on their size the gradient can be calculated far quicker which allows for more parameter updates in the same time. If the approximated gradient is close enough to the ``real'' one this can drastically cut down the time required for training the model to a certain degree or improve the accuracy achievable in a given mount of training time. \begin{algorithm} \SetAlgoLined \KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$, Training Data $D$, Epochs $E$.} \For{$i \in \left\{1:E\right\}$}{ S <- D \While{$\abs{S} \geq B$}{ Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\; Update $S$: $S \leftarrow S \setminus \tilde{D}$\; Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w, \tilde{D})}{\mathrm{d} w}$\; Update: $w \leftarrow w - \gamma g$\; } \If{$S \neq \emptyset$}{ Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w, S)}{\mathrm{d} w}$\; Update: $w \leftarrow w - \gamma g$\; } Increment: $i \leftarrow i+1$\; } \caption{Stochastic gradient descent.} \label{alg:sgd} \end{algorithm} In order to illustrate this behavior we modeled a convolutional neural network to ... handwritten digits. The data set used for this is the MNIST database of handwritten digits (\textcite{MNIST}, Figure~\ref{fig:MNIST}). \input{Plots/mnist.tex} The network used consists of two convolution and max pooling layers followed by one fully connected hidden layer and the output layer. Both covolutional layers utilize square filters of size five which are applied with a stride of one. The first layer consists of 32 filters and the second of 64. Both pooling layers pool a $2\times 2$ area. The fully connected layer consists of 256 nodes and the output layer of 10, one for each digit. All layers except the output layer use RELU as activation function with the output layer using softmax (\ref{def:softmax}). As loss function categorical crossentropy is used (\ref{def:...}). The architecture of the convolutional neural network is summarized in Figure~\ref{fig:mnist_architecture}. \begin{figure} \missingfigure{network architecture} \caption{architecture} \label{fig:mnist_architecture} \end{figure} The results of the network being trained with gradient descent and stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd} and Table~\ref{table:sgd_vs_gd} \input{Plots/SGD_vs_GD.tex} Here it can be seen that the network trained with stochstic gradient descent is more accurate after the first epoch than the ones trained with gradient descent after 20 epochs. This is due to the former using a batch size of 32 and thus having made 1.875 updates to the weights after the first epoch in comparison to one update . While each of these updates uses a approximate gradient calculated on the subset it performs far better than the network using true gradients when training for the same mount of time. \todo{vergleich training time} \clearpage \subsection{Modified Stochastic Gradient Descent} There is a inherent problem in the sensitivity of the gradient descent algorithm regarding the learning rate $\gamma$. The difficulty of choosing the learning rate is in the Figure~\ref{sgd_vs_gd}. For small rates the progress in each iteration is small but as the rate is enlarged the algorithm can become unstable and diverge. Even for learning rates small enough to ensure the parameters do not diverge to infinity steep valleys can hinder the progress of the algorithm as with to large leaning rates gradient descent ``bounces between'' the walls of the valley rather then follow ... % \[ % w - \gamma \nabla_w ... % \] thus the weights grow to infinity. \todo{unstable learning rate besser erklären} To combat this problem it is proposed \todo{quelle} alter the learning rate over the course of training, often called leaning rate scheduling. The most popular implementations of this are time based decay \[ \gamma_{n+1} = \frac{\gamma_n}{1 + d n}, \] where $d$ is the decay parameter and $n$ is the number of epochs, step based decay where the learning rate is fixed for a span of $r$ epochs and then decreased according to parameter $d$ \[ \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}} \] and exponential decay, where the learning rate is decreased after each epoch, \[ \gamma_n = \gamma_o e^{-n d}. \] These methods are able to increase the accuracy of a model by a large margin as seen in the training of RESnet by \textcite{resnet} \todo{vielleicht grafik einbauen}. However stochastic gradient descent with weight decay is still highly sensitive to the choice of the hyperparameters $\gamma$ and $d$. In order to mitigate this problem a number of algorithms have been developed to regularize the learning rate with as minimal hyperparameter guesswork as possible. One of these algorithms is the ADADELTA algorithm developed by \textcite{ADADELTA} \clearpage \begin{itemize} \item ADAM \item momentum \item ADADETLA \textcite{ADADELTA} \end{itemize} \begin{algorithm}[H] \SetAlgoLined \KwInput{Decay Rate $\rho$, Constant $\varepsilon$} \KwInput{Initial parameter $x_1$} Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\; \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{ Compute Gradient: $g_t$\; Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} + (1-\rho)g_t^2$\; Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\; Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta x^2]_{t-1} + (1+p)\Delta x_t^2$\; Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\; } \caption{ADADELTA, \textcite{ADADELTA}} \label{alg:gd} \end{algorithm} % \subsubsubsection{Stochastic Gradient Descent} \subsection{Combating Overfitting} % As in many machine learning applications if the model is overfit in % the data it can drastically reduce the generalization of the model. In % many machine learning approaches noise introduced in the learning % algorithm in order to reduce overfitting. This results in a higher % bias of the model but the trade off of lower variance of the model is % beneficial in many cases. For example the regression tree model % ... benefits greatly from restricting the training algorithm on % randomly selected features in every iteration and then averaging many % such trained trees inserted of just using a single one. \todo{noch % nicht sicher ob ich das nehmen will} For neural networks similar % strategies exist. A popular approach in regularizing convolutional neural network % is \textit{dropout} which has been first introduced in % \cite{Dropout} Similarly to shallow networks overfitting still can impact the quality of convolutional neural networks. A popular way to combat this problem is by introducing noise into the training of the model. This is a successful strategy for ofter models as well, the a conglomerate of descision trees grown on bootstrapped trainig samples benefit greatly of randomizing the features available to use in each training iteration (Hastie, Bachelorarbeit??). The way noise is introduced into the model is by deactivating certain nodes (setting the output of the node to 0) in the fully connected layers of the convolutional neural networks. The nodes are chosen at random and change in every iteration, this practice is called Dropout and was introduced by \textcite{Dropout}. \todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als training set?} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: