\section{Application of NN to higher complexity Problems} This section is based on \textcite[Chapter~9]{Goodfellow} As neural networks are applied to problems of higher complexity often resulting in higher dimensionality of the input the amount of parameters in the network rises drastically. For very large inputs such as high resolution image data due to the fully connected nature of the neural network the amount of parameters can ... exceed the amount that is feasible for training and storage. A way to combat this is by using layers which are only sparsely connected and share parameters between nodes.\todo{Überleitung zu conv?} \subsection{Convolution} Convolution is a mathematical operation, where the product of two functions is integrated after one has been reversed and shifted. \[ (f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds. \] This operation can be described as a filter-function $g$ being applied to $f$, as values $f(t)$ are being replaced by an average of values of $f$ weighted by a filter-function $g$ in position $t$. The convolution operation allows plentiful manipulation of data, with a simple example being smoothing of real-time data. Consider a sensor measuring the location of an object (e.g. via GPS). We expect the output of the sensor to be noisy as a result of a number of factors will impact the accuracy of the measurements. In order to get a better estimate of the actual location we want to smooth the data to reduce the noise. Using convolution for this task, we can control the significance we want to give each data-point. We might want to give a larger weight to more recent measurements than older ones. If we assume these measurements are taken on a discrete timescale, we need to define convolution for discrete functions. \\Let $f$, $g: \mathbb{Z} \to \mathbb{R}$ then \[ (f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i). \] Applying this on the data with the filter $g$ chosen accordingly we are able to improve the accuracy, which can be seen in Figure~\ref{fig:sin_conv}. \input{Figures/sin_conv.tex} This form of discrete convolution can also be applied to functions with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to \mathbb{R}$ then \[ (f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1, \dots, x_d - i_d) g(i_1, \dots, i_d) \] This will prove to be a useful framework for image manipulation but in order to apply convolution to images we need to discuss representation of image data first. Most often images are represented by each pixel being a mixture of base colors. These base colors define the color-space in which the image is encoded. Often used are color-spaces RGB (red, blue, green) or CMYK (cyan, magenta, yellow, black). An example of an image decomposed in its red, green and blue channel is given in Figure~\ref{fig:rgb}. Using this encoding of the image we can define a corresponding discrete function describing the image, by mapping the coordinates $(x,y)$ of an pixel and the channel (color) $c$ to the respective value $v$ \begin{align} \begin{split} I: \mathbb{N}^3 & \to \mathbb{R}, \\ (x,y,c) & \mapsto v. \end{split} \label{def:I} \end{align} \begin{figure} \begin{adjustbox}{width=\textwidth} \begin{tikzpicture} \begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)] \node[canvas is xy plane at z=0, transform shape] at (0,0) {\includegraphics[width=5cm]{Figures/Data/klammern_r.jpg}}; \node[canvas is xy plane at z=2, transform shape] at (0,-0.2) {\includegraphics[width=5cm]{Figures/Data/klammern_g.jpg}}; \node[canvas is xy plane at z=4, transform shape] at (0,-0.4) {\includegraphics[width=5cm]{Figures/Data/klammern_b.jpg}}; \node[canvas is xy plane at z=4, transform shape] at (-8,-0.2) {\includegraphics[width=5.3cm]{Figures/Data/klammern_rgb.jpg}}; \end{scope} \end{tikzpicture} \end{adjustbox} \caption[Channel separation of color image]{On the right the red, green and blue chances of the picture are displayed. In order to better visualize the color channels the black and white picture of each channel has been colored in the respective color. Combining the layers results in the image on the left.} \label{fig:rgb} \end{figure} With this representation of an image as a function, we can apply filters to the image using convolution for multidimensional functions as described above. In order to simplify the notation we will write the function $I$ given in (\ref{def:I}) as well as the filter-function $g$ as a tensor from now on, resulting in the modified notation of convolution \[ (I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}. \] As images are finite in size for pixels to close to the border the convolution is not well defined. Thus the output will be of reduced size, with the now size in each dimension $d$ being \textit{(size of input in dimension $d$) - (size of kernel in dimension $d$) +1}. In order to ensure the output is of the same size as the input the image can be padded in each dimension with 0 entries which ensures the convolution is well defined for all pixels of the image. Simple examples for image manipulation using convolution are smoothing operations or rudimentary detection of edges in grayscale images, meaning they only have one channel. A popular filter for smoothing images is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and size $s \in \mathbb{N}$ is defined as \[ G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2 \sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}. \] For edge detection purposes the Sobel operator is widespread. Here two filters are applied to the image $I$ and then combined. Edges in the $x$ direction are detected by convolution with \[ G =\left[ \begin{matrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{matrix}\right], \] and edges is the y direction by convolution with $G^T$, the final output is given by \[ O = \sqrt{(I * G)^2 + (I*G^T)^2} \] where $\sqrt{\cdot}$ and $\cdot^2$ are applied component wise. Examples for convolution of an image with both kernels are given in Figure~\ref{fig:img_conv}. \begin{figure}[h] \centering \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Figures/Data/klammern.jpg} \caption{Original Picture} \label{subf:OrigPicGS} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Figures/Data/image_conv9.png} \caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Figures/Data/image_conv10.png} \caption{Gaussian Blur $\sigma^2 = 4$} \end{subfigure}\\ \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Figures/Data/image_conv4.png} \caption{Sobel Operator $x$-direction} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Figures/Data/image_conv5.png} \caption{Sobel Operator $y$-direction} \end{subfigure} \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{Figures/Data/image_conv6.png} \caption{Sobel Operator combined} \end{subfigure} % \begin{subfigure}{0.24\textwidth} % \centering % \includegraphics[width=\textwidth]{Figures/Data/image_conv6.png} % \caption{test} % \end{subfigure} \caption[Convolution applied on image]{Convolution of original greyscale Image (a) with different kernels. In (b) and (c) Gaussian kernels of size 11 and stated $\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator kernels are used.} \label{fig:img_conv} \end{figure} \clearpage \newpage \subsection{Convolutional NN} \todo{Eileitung zu CNN amout of parameters} % Conventional neural network as described in chapter .. are made up of % fully connected layers, meaning each node in a layer is influenced by % all nodes of the previous layer. If one wants to extract information % out of high dimensional input such as images this results in a very % large amount of variables in the model. This limits the % In conventional neural networks as described in chapter ... all layers % are fully connected, meaning each output node in a layer is influenced % by all inputs. For $i$ inputs and $o$ output nodes this results in $i % + 1$ variables at each node (weights and bias) and a total $o(i + 1)$ % variables. For large inputs like image data the amount of variables % that have to be trained in order to fit the model can get excessive % and hinder the ability to train the model due to memory and % computational restrictions. By using convolution we can extract % meaningful information such as edges in an image with a kernel of a % small size $k$ in the tens or hundreds independent of the size of the % original image. Thus for a large image $k \cdot i$ can be several % orders of magnitude smaller than $o\cdot i$ . As seen in the previous section convolution can lend itself to manipulation of images or other large data which motivates it usage in neural networks. This is achieved by implementing convolutional layers where several trainable filters are applied to the input. Each node in such a layer corresponds to a pixel of the output of convolution with one of those filters, on which a bias and activation function are applied. Depending on the sizes this can drastically reduce the amount of variables in a layer compared to fully connected ones. As the variables of the filters are shared among all nodes a convolutional layer with input of size $s_i$, output size $s_o$ and $n$ filters of size $f$ will contain $n f + s_o$ parameters whereas a fully connected layer has $(s_i + 1) s_o$ trainable weights. The usage of multiple filters results in multiple outputs of the same size as the input (or slightly smaller if no padding is used). These are often called channels. For convolutional layers that are preceded by convolutional layers the size of the filter is often chosen to coincide with the amount of channels of the output of the previous layer and not padded in this direction. This results in the channels ``being squashed'' and prevents gaining additional dimensions\todo{filter mit ganzer tiefe besser erklären} in the output. This can also be used to flatten certain less interesting channels of the input as for example color channels. % Thus filters used in convolutional networks are usually have the same % amount of dimensions as the input or one more. A way additionally reduce the size using convolution is not applying the convolution on every pixel, but rather specifying a certain ``stride'' $s$ at which the filter $g$ is moved over the input $I$, \[ O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}. \] The size and stride for all filters in a layer should be the same in order to get a uniform tensor as output. T% he size of the filters and the way they are applied can be tuned % while building the model should be the same for all filters in one % layer in order for the output being of consistent size in all channels. % It is common to reduce the d< by not applying the % filters on each ``pixel'' but rather specify a ``stride'' $s$ at which % the filter $g$ is moved over the input $I$ % \[ % O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}. % \] % As seen convolution lends itself for image manipulation. In this % chapter we will explore how we can incorporate convolution in neural % networks, and how that might be beneficial. % Convolutional Neural Networks as described by ... are made up of % convolutional layers, pooling layers, and fully connected ones. The % fully connected layers are layers in which each input node is % connected to each output node which is the structure introduced in % chapter ... % In a convolutional layer instead of combining all input nodes for each % output node, the input nodes are interpreted as a tensor on which a % kernel is applied via convolution, resulting in the output. Most often % multiple kernels are used, resulting in multiple output tensors. These % kernels are the variables, which can be altered in order to fit the % model to the data. Using multiple kernels it is possible to extract % different features from the image (e.g. edges -> sobel). In order to further reduce the size towards the final layer, convolutional layers are often followed by a pooling layer. In a pooling layer the input is reduced in size by extracting a single value from a neighborhood of pixels, often by taking the maximum value in the neighborhood (max-pooling). The resulting output size is dependent on the offset of the neighborhoods used, this offset is commonly called ``stride''\todo{zwei mal stride}. The combination of convolution and pooling layers allows for extraction of features from the input in the from of feature maps while using relatively few parameters that need to be trained. A example of this is given in Figure~\ref{fig:feature_map} where intermediary outputs of a small convoluninal neural network consisting of two convolutional and pooling layers each with one filter followed by two fully connected layers. \begin{figure}[h] \centering \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/mnist0bw.pdf} \caption{input} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/conv2d_6.pdf} \caption{convolution} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_6.pdf} \caption{max-pool} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/conv2d_7.pdf} \caption{convolution} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_7.pdf} \caption{max-pool} \end{subfigure} \caption[Feature map]{Intermediary outputs of a convolutional neural network, starting with the input and ending with the corresponding feature map.} \label{fig:feature_map} \end{figure} \subsubsection{Parallels to the Visual Cortex in Mammals} The choice of convolution for image classification tasks is not arbitrary. ... auge... bla bla % \subsection{Limitations of the Gradient Descent Algorithm} % -Hyperparameter guesswork % -Problems navigating valleys -> momentum % -Different scale of gradients for vars in different layers -> ADAdelta \subsection{Stochastic Training Algorithms} For many applications in which neural networks are used such as image classification or segmentation, large training data sets become detrimental to capture the nuances of the data. However as training sets get larger the memory requirement during training grows with it. In order to update the weights with the gradient descent algorithm derivatives of the network with respect for each variable need to be computed for all data points. Thus the amount of memory and computing power available limits the size of the training data that can be efficiently used in fitting the network. A class of algorithms that augment the gradient descent algorithm in order to lessen this problem are stochastic gradient descent algorithms. Here the full dataset is split into smaller disjoint subsets. Then in each iteration a (different) subset of data is chosen to compute the gradient (Algorithm~\ref{alg:sdg}). The training period until each data point has been considered at least once in updating the parameters is commonly called an ``epoch''. Using subsets reduces the amount of memory required for storing the necessary values for each update, thus making it possible to use very large training sets to fit the model. Additionally the noise introduced on the gradient can improve the accuracy of the fit as stochastic gradient descent algorithms are less likely to get stuck on local extrema. Another important benefit in using subsets is that depending on their size the gradient can be calculated far quicker which allows for more parameter updates in the same time. If the approximated gradient is close enough to the ``real'' one this can drastically cut down the time required for training the model to a certain degree or improve the accuracy achievable in a given mount of training time. \begin{algorithm} \SetAlgoLined \KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$, Training Data $D$, Epochs $E$.} \For{$i \in \left\{1:E\right\}$}{ S <- D \While{$\abs{S} \geq B$}{ Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\; Update $S$: $S \leftarrow S \setminus \tilde{D}$\; Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w, \tilde{D})}{\mathrm{d} w}$\; Update: $w \leftarrow w - \gamma g$\; } \If{$S \neq \emptyset$}{ Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w, S)}{\mathrm{d} w}$\; Update: $w \leftarrow w - \gamma g$\; } Increment: $i \leftarrow i+1$\; } \caption{Stochastic gradient descent.} \label{alg:sgd} \end{algorithm} In order to illustrate this behavior we modeled a convolutional neural network to classify handwritten digits. The data set used for this is the MNIST database of handwritten digits (\textcite{MNIST}, Figure~\ref{fig:MNIST}). \input{Figures/mnist.tex} The network used consists of two convolution and max pooling layers followed by one fully connected hidden layer and the output layer. Both covolutional layers utilize square filters of size five which are applied with a stride of one. The first layer consists of 32 filters and the second of 64. Both pooling layers pool a $2\times 2$ area. The fully connected layer consists of 256 nodes and the output layer of 10, one for each digit. All layers use RELU as activation function, except the output layer with the output layer which uses softmax (\ref{def:softmax}). As loss function categorical crossentropy is used (\ref{eq:cross_entropy}). The architecture of the convolutional neural network is summarized in Figure~\ref{fig:mnist_architecture}. \begin{figure} \includegraphics[width=\textwidth]{Figures/Data/convnet_fig.pdf} \caption{Convolutional neural network architecture used to model the MNIST handwritten digits dataset. This figure was created using the draw\textunderscore convnet Python script by \textcite{draw_convnet}.} \label{fig:mnist_architecture} \end{figure} The results of the network being trained with gradient descent and stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd} and Table~\ref{table:sgd_vs_gd} Here it can be seen that the network trained with stochstic gradient descent is more accurate after the first epoch than the ones trained with gradient descent after 20 epochs. This is due to the former using a batch size of 32 and thus having made 1.875 updates to the weights after the first epoch in comparison to one update. While each of these updates only use a approximate gradient calculated on the subset it performs far better than the network using true gradients when training for the same mount of time. \todo{vergleich training time} \input{Figures/SGD_vs_GD.tex} \clearpage \subsection{\titlecap{modified stochastic gradient descent}} This section is based on \textcite{ruder}. An inherent problem of the stochastic gradient descent algorithm is its sensitivity to the learning rate $\gamma$. This results in the problem of having to find a appropriate learning rate for each problem which is largely guesswork, the impact of choosing a bad learning rate can be seen in Figure~\ref{fig:sgd_vs_gd}. % There is a inherent problem in the sensitivity of the gradient descent % algorithm regarding the learning rate $\gamma$. % The difficulty of choosing the learning rate can be seen % in Figure~\ref{sgd_vs_gd}. For small rates the progress in each iteration is small but as the rate is enlarged the algorithm can become unstable and the parameters diverge to infinity. Even for learning rates small enough to ensure the parameters do not diverge to infinity, steep valleys in the function to be minimized can hinder the progress of the algorithm as for leaning rates not small enough gradient descent ``bounces between'' the walls of the valley rather then following a downward trend in the valley. % \[ % w - \gamma \nabla_w ... % \] %thus the weights grow to infinity. \todo{unstable learning rate besser erklären} To combat this problem \todo{quelle} propose to alter the learning rate over the course of training, often called leaning rate scheduling in order to decrease the learning rate over the course of training. The most popular implementations of this are time based decay \[ \gamma_{n+1} = \frac{\gamma_n}{1 + d n}, \] where $d$ is the decay parameter and $n$ is the number of epochs, step based decay where the learning rate is fixed for a span of $r$ epochs and then decreased according to parameter $d$ \[ \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}} \] and exponential decay where the learning rate is decreased after each epoch \[ \gamma_n = \gamma_o e^{-n d}. \] These methods are able to increase the accuracy of a model by large margins as seen in the training of RESnet by \textcite{resnet}. \todo{vielleicht grafik einbauen} However stochastic gradient descent with weight decay is still highly sensitive to the choice of the hyperparameters $\gamma_0$ and $d$. In order to mitigate this problem a number of algorithms have been developed to regularize the learning rate with as minimal hyperparameter guesswork as possible. We will examine and compare a ... algorithms that use a adaptive learning rate. They all scale the gradient for the update depending of past gradients for each weight individually. The algorithms are build up on each other with the adaptive gradient algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD}) laying the base work. Here for each parameter update the learning rate is given my a constant $\gamma$ is divided by the sum of the squares of the past partial derivatives in this parameter. This results in a monotonous decaying learning rate with faster decay for parameters with large updates, where as parameters with small updates experience smaller decay. The \textsc{AdaGrad} algorithm is given in Algorithm~\ref{alg:ADAGRAD}. Note that while this algorithm is still based upon the idea of gradient descent it no longer takes steps in the direction of the gradient while updating. Due to the individual learning rates for each parameter only the direction/sign for single parameters remain the same. \begin{algorithm}[H] \SetAlgoLined \KwInput{Global learning rate $\gamma$} \KwInput{Constant $\varepsilon$} \KwInput{Initial parameter vector $x_1 \in \mathbb{R}^p$} \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{ Compute Gradient: $g_t$\; Compute Update: $\Delta x_{t,i} \leftarrow -\frac{\gamma}{\norm{g_{1:t,i}}_2 + \varepsilon} g_{t,i}, \forall i = 1, \dots,p$\; Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\; } \caption{\textsc{AdaGrad}} \label{alg:ADAGRAD} \end{algorithm} Building on \textsc{AdaGrad} \textcite{ADADELTA} developed the \textsc{AdaDelta} algorithm in order to improve upon the two main drawbacks of \textsc{AdaGrad}, being the continual decay of the learning rate and the need for a manually selected global learning rate $\gamma$. As \textsc{AdaGrad} uses division by the accumulated squared gradients the learning rate will eventually become infinitely small. In order to ensure that even after a significant of iterations learning continues to make progress instead of summing the squared gradients a exponentially decaying average of the past squared gradients is used to for regularizing the learning rate resulting in \begin{align*} E[g^2]_t & = \rho E[g^2]_{t-1} + (1-\rho) g_t^2, \\ \Delta x_t & = -\frac{\gamma}{\sqrt{E[g^2]_t + \varepsilon}} g_t, \end{align*} for a decay rate $\rho$. Additionally the fixed global learning rate $\gamma$ is substituted by a exponentially decaying average of the past parameter updates. The usage of the past parameter updates is motivated by ensuring that hypothetical units of the parameter vector match those of the parameter update $\Delta x_t$. When only using the gradient with a scalar learning rate as in SDG the resulting unit of the parameter update is: \[ \text{units of } \Delta x \propto \text{units of } g \propto \frac{\partial f}{\partial x} \propto \frac{1}{\text{units of } x}, \] assuming the cost function $f$ is unitless. \textsc{AdaGrad} neither has correct units since the update is given by a ratio of gradient quantities resulting in a unitless parameter update. If however Hessian information or a approximation thereof is used to scale the gradients the unit of the updates will be correct: \[ \text{units of } \Delta x \propto H^{-1} g \propto \frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2 f}{\partial x^2}} \propto \text{units of } x \] Since using the second derivative results in correct units, Newton's method (assuming diagonal hessian) is rearranged to determine the quantities involved in the inverse of the second derivative: \[ \Delta x = \frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2 f}{\partial x^2}} \iff \frac{1}{\frac{\partial^2 f}{\partial x^2}} = \frac{\Delta x}{\frac{\partial f}{\partial x}}. \] As the root mean square of the past gradients is already used in the denominator of the learning rate a exponentially decaying root mean square of the past updates is used to obtain a $\Delta x$ quantity for the denominator resulting in the correct unit of the update. The full algorithm is given by Algorithm~\ref{alg:adadelta}. \begin{algorithm}[H] \SetAlgoLined \KwInput{Decay Rate $\rho$, Constant $\varepsilon$} \KwInput{Initial parameter $x_1$} Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\; \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{ Compute Gradient: $g_t$\; Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} + (1-\rho)g_t^2$\; Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\; Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta x^2]_{t-1} + (1+p)\Delta x_t^2$\; Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\; } \caption{\textsc{AdaDelta}, \textcite{ADADELTA}} \label{alg:adadelta} \end{algorithm} While the stochastic gradient algorithm is less susceptible to getting stuck in local extrema than gradient descent the problem still persists especially for saddle points with steep .... \textcite{DBLP:journals/corr/Dauphinpgcgb14} An approach to the problem of ``getting stuck'' in saddle point or local minima/maxima is the addition of momentum to SDG. Instead of using the actual gradient for the parameter update an average over the past gradients is used. In order to avoid the need to SAVE the past values usually a exponentially decaying average is used resulting in Algorithm~\ref{alg:sgd_m}. This is comparable of following the path of a marble with mass rolling down the slope of the error function. The decay rate for the average is comparable to the inertia of the marble. This results in the algorithm being able to escape some local extrema due to the build up momentum from approaching it. % \begin{itemize} % \item ADAM % \item momentum % \item ADADETLA \textcite{ADADELTA} % \end{itemize} \begin{algorithm}[H] \SetAlgoLined \KwInput{Learning Rate $\gamma$, Decay Rate $\rho$} \KwInput{Initial parameter $x_1$} Initialize accumulation variables $m_0 = 0$\; \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{ Compute Gradient: $g_t$\; Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\; Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\; Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\; } \caption{SDG with momentum} \label{alg:sgd_m} \end{algorithm} In an effort to combine the properties of the momentum method and the automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM} developed the \textsc{Adam} algorithm, given in Algorithm~\ref{alg:adam}. Here the exponentially decaying root mean square of the gradients is still used for realizing and combined with the momentum method. Both terms are normalized such that the ... are the first and second moment of the gradient. However the term used in \textsc{AdaDelta} to ensure correct units is dropped for a scalar global learning rate. This results in .. hyperparameters, however the algorithms seems to be exceptionally stable with the recommended parameters of ... and is a very reliable algorithm for training neural networks. \begin{algorithm}[H] \SetAlgoLined \KwInput{Stepsize $\alpha$} \KwInput{Decay Parameters $\beta_1$, $\beta_2$} Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\; \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{ Compute Gradient: $g_t$\; Accumulate first Moment of the Gradient and correct for bias: $m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t;$\hspace{\linewidth} $\hat{m}_t \leftarrow \frac{m_t}{1-\beta_1^t}$\; Accumulate second Moment of the Gradient and correct for bias: $v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2)g_t^2;$\hspace{\linewidth} $\hat{v}_t \leftarrow \frac{v_t}{1-\beta_2^t}$\; Compute Update: $\Delta x_t \leftarrow -\frac{\alpha}{\sqrt{\hat{v}_t + \varepsilon}} \hat{m}_t$\; Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\; } \caption{ADAM, \cite{ADAM}} \label{alg:adam} \end{algorithm} In order to get an understanding of the performance of the above discussed training algorithms the neural network given in ... has been trained on the ... and the results are given in Figure~\ref{fig:comp_alg}. Here it can be seen that the ADAM algorithm performs far better than the other algorithms, with AdaGrad and Adelta following... bla bla \input{Figures/sdg_comparison.tex} % \subsubsubsection{Stochastic Gradient Descent} \clearpage \subsection{Combating Overfitting} % As in many machine learning applications if the model is overfit in % the data it can drastically reduce the generalization of the model. In % many machine learning approaches noise introduced in the learning % algorithm in order to reduce overfitting. This results in a higher % bias of the model but the trade off of lower variance of the model is % beneficial in many cases. For example the regression tree model % ... benefits greatly from restricting the training algorithm on % randomly selected features in every iteration and then averaging many % such trained trees inserted of just using a single one. \todo{noch % nicht sicher ob ich das nehmen will} For neural networks similar % strategies exist. A popular approach in regularizing convolutional neural network % is \textit{dropout} which has been first introduced in % \cite{Dropout} This section is based on .... Similarly to shallow networks overfitting still can impact the quality of convolutional neural networks. Popular ways to combat this problem for a .. of models is averaging over multiple models trained on subsets (bootstrap) or introducing noise directly during the training (for example random forest, where a conglomerate of decision trees benefit greatly of randomizing the features available to use in each training iteration). We explore implementations of these approaches for neural networks being dropout for simulating a conglomerate of networks and introducing noise during training by slightly altering the input pictures. % A popular way to combat this problem is % by introducing noise into the training of the model. % This can be done in a variety % This is a % successful strategy for ofter models as well, the a conglomerate of % descision trees grown on bootstrapped trainig samples benefit greatly % of randomizing the features available to use in each training % iteration (Hastie, Bachelorarbeit??). % There are two approaches to introduce noise to the model during % learning, either by manipulating the model it self or by manipulating % the input data. \subsubsection{Dropout} If a neural network has enough hidden nodes there will be sets of weights that accurately fit the training set (proof for a small scenario given in ...) this expecially occurs when the relation between the input and output is highly complex, which requires a large network to model and the training set is limited in size (vgl cnn wening bilder). However each of these weights will result in different predicitons for a test set and all of them will perform worse on the test data than the training data. A way to improve the predictions and reduce the overfitting would be to train a large number of networks and average their results (vgl random forests) however this is often computational not feasible in training as well as testing. % Similarly to decision trees and random forests training multiple % models on the same task and averaging the predictions can improve the % results and combat overfitting. However training a very large % number of neural networks is computationally expensive in training %as well as testing. In order to make this approach feasible \textcite{Dropout1} propose random dropout. Instead of training different models for each data point in a batch randomly chosen nodes in the network are disabled (their output is fixed to zero) and the updates for the weights in the remaining smaller network are comuted. These the updates computed for each data point in the batch are then accumulated and applied to the full network. This can be compared to many small networks which share their weights for their active neurons being trained simultaniously. For testing the ``mean network'' with all nodes active but their output scaled accordingly to compensate for more active nodes is used. \todo{comparable to averaging dropout networks, beispiel für besser in kleinem fall} % Here for each training iteration from a before specified (sub)set of nodes % randomly chosen ones are deactivated (their output is fixed to 0). % During training % Instead of using different models and averaging them randomly % deactivated nodes are used to simulate different networks which all % share the same weights for present nodes. % A simple but effective way to introduce noise to the model is by % deactivating randomly chosen nodes in a layer % The way noise is introduced into % the model is by deactivating certain nodes (setting the output of the % node to 0) in the fully connected layers of the convolutional neural % networks. The nodes are chosen at random and change in every % iteration, this practice is called Dropout and was introduced by % \textcite{Dropout}. \subsubsection{\titlecap{manipulation of input data}} Another way to combat overfitting is to keep the network from learning the dataset by manipulating the inputs randomly for each iteration of training. This is commonly used in image based tasks as there are often ways to maipulate the input while still being sure the labels remain the same. For example in a image classification task such as handwritten digits the associated label should remain right when the image is rotated or stretched by a small amount. When using this one has to be sure that the labels indeed remain the same or else the network will not learn the desired ... In the case of handwritten digits for example a to high rotation angle will ... a nine or six. The most common transformations are rotation, zoom, shear, brightness, mirroring. Examples of this are given in Figure~\ref{fig:datagen}. \begin{figure}[h] \centering \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/mnist0.pdf} \caption{original\\image} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_zoom.pdf} \caption{random\\zoom} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shear.pdf} \caption{random\\shear} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_rotation.pdf} \caption{random\\rotation} \end{subfigure} \begin{subfigure}{0.19\textwidth} \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shift.pdf} \caption{random\\positional shift} \end{subfigure} \caption[Image data generation]{Example for the manipuations used in ... As all images are of the same intensity brightness manipulation does not seem ... Additionally mirroring is not used for ... reasons.} \label{fig:datagen} \end{figure} In order to compare the benefits obtained from implementing these measures we have trained the network given in ... on the same problem and implemented different combinations of data generation and dropout. The results are given in Figure~\ref{fig:gen_dropout}. For each scennario the model was trained five times and the performance measures were averaged. It can be seen that implementing the measures does indeed increase the performance of the model. Implementing data generation on its own seems to have a larger impact than dropout and applying both increases the accuracy even further. The better performance stems most likely from reduced overfitting. The reduction in overfitting can be seen in \ref{fig:gen_dropout}~(\subref{fig:gen_dropout_b}) as the training accuracy decreases with test accuracy increasing. However utlitizing data generation as well as dropout with a probability of 0.4 seems to be a too aggressive approach as the training accuracy drops below the test accuracy\todo{kleine begründung}. \input{Figures/gen_dropout.tex} \todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als training set?} \clearpage \subsubsection{\titlecap{effectivety for small training sets}} For some applications (medical problems with small amount of patients) the available data can be highly limited. In these problems the networks are highly ... for overfitting the data. In order to get a understanding of accuracys achievable and the impact of the measures to prevent overfitting discussed above we and train the network on datasets of varying sizes with different measures implemented. For training we use the mnist handwriting dataset as well as the fashion mnist dataset. The fashion mnist dataset is a benchmark set build by \textcite{fashionMNIST} in order to provide a harder set, as state of the art models are able to achive accuracies of 99.88\% (\textcite{10.1145/3206098.3206111}) on the handwriting set. The dataset contains 70.000 preprocessed images of clothes from zalando, a overview is given in Figure~\ref{fig:fashionMNIST}. \input{Figures/fashion_mnist.tex} \afterpage{ \noindent \begin{minipage}{\textwidth} \small \begin{tabu} to \textwidth {@{}l*4{X[c]}@{}} \Tstrut \Bstrut & \textsc{Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\ \hline & \multicolumn{4}{c}{\titlecap{test accuracy for 1 sample}}\Bstrut \\ \cline{2-5} max \Tstrut & 0.5633 & 0.5312 & \textbf{0.6704} & 0.6604 \\ min & 0.3230 & 0.4224 & 0.4878 & \textbf{0.5175} \\ mean & 0.4570 & 0.4714 & 0.5862 & \textbf{0.6014} \\ var \Bstrut & 0.0040 & \textbf{0.0012} & 0.0036 & 0.0023 \\ \hline & \multicolumn{4}{c}{\titlecap{test accuracy for 10 samples}}\Bstrut \\ \cline{2-5} max \Tstrut & 0.8585 & 0.9423 & 0.9310 & \textbf{0.9441} \\ min & 0.8148 & \textbf{0.9081} & 0.9018 & 0.9061 \\ mean & 0.8377 & \textbf{0.9270} & 0.9185 & 0.9232 \\ var \Bstrut & 2.7e-04 & 1.3e-04 & 6e-05 & 1.5e-04 \\ \hline & \multicolumn{4}{c}{\titlecap{test accuracy for 100 samples}}\Bstrut \\ \cline{2-5} max \Tstrut & 0.9637 & 0.9796 & 0.9810 & \textbf{0.9811} \\ min & 0.9506 & 0.9719 & 0.9702 & \textbf{0.9727} \\ mean & 0.9582 & 0.9770 & 0.9769 & \textbf{0.9783} \\ var \Bstrut & 2e-05 & 1e-05 & 1e-05 & 1e-05 \\ \hline \end{tabu} \normalsize \captionof{table}{Values of the test accuracy of the model trained 10 times on random MNIST handwriting training sets containing 1, 10 and 100 data points per class after 125 epochs. The mean achieved accuracy for the full set employing both overfitting measures is } \label{table:digitsOF} \small \centering \begin{tabu} to \textwidth {@{}l*4{X[c]}@{}} \Tstrut \Bstrut & \textsc{Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\ \hline & \multicolumn{4}{c}{\titlecap{test accuracy for 1 sample}}\Bstrut \\ \cline{2-5} max \Tstrut & 0.4885 & \textbf{0.5613} & 0.5488 & 0.5475 \\ min & 0.3710 & \textbf{0.3858} & 0.3736 & 0.3816 \\ mean \Bstrut & 0.4166 & 0.4838 & 0.4769 & \textbf{0.4957} \\ var & \textbf{0.002} & 0.00294 & 0.00338 & 0.0030 \\ \hline & \multicolumn{4}{c}{\titlecap{test accuracy for 10 samples}}\Bstrut \\ \cline{2-5} max \Tstrut & 0.7370 & 0.7340 & 0.7236 & \textbf{0.7502} \\ min & 0.6818 & 0.6673 & 0.6709 & \textbf{0.6799} \\ mean & 0.7130 & \textbf{0.7156} & 0.7031 & 0.7136 \\ var \Bstrut & 3.2e-04 & 3.4e-04 & 3.2e-04 & 4.5e-04 \\ \hline & \multicolumn{4}{c}{\titlecap{test accuracy for 100 samples}}\Bstrut \\ \cline{2-5} max \Tstrut & 0.8454 & 0.8385 & 0.8456 & \textbf{0.8459} \\ min & 0.8227 & 0.8200 & \textbf{0.8305} & 0.8274 \\ mean & 0.8331 & 0.8289 & 0.8391 & \textbf{0.8409} \\ var \Bstrut & 4e-05 & 4e-05 & 2e-05 & 3e-05 \\ \hline \end{tabu} \normalsize \captionof{table}{Values of the test accuracy of the model trained 10 times on random fashion MNIST training sets containing 1, 10 and 100 data points per class. The mean achieved accuracy for the full dataset is: ....} \label{table:fashionOF} \end{minipage} \clearpage % if needed/desired } The random datasets chosen for training are made up of a certain number of datapoints for each class, which are chosen at random. The sizes chosen for the comparisons are the full dataset, 100, 10 and 1 data points per class. For the task of classifying the fashion data a slightly altered model is used. The convolutional layers with filters of size 5 are replaced by two consecutive convolutional layers with filters of size 3. This is done in order to have more ... in order to better ... the data in the model. A diagram of the architecture is given in Figure~\ref{fig:fashion_MNIST}. \afterpage{ \noindent \begin{figure}[h] \includegraphics[width=\textwidth]{Figures/Data/cnn_fashion_fig.pdf} \caption{Convolutional neural network architecture used to model the fashion MNIST dataset. This figure was created using the draw\textunderscore convnet Python script by \textcite{draw_convnet}.} \label{fig:fashion_MNIST} \end{figure} } For both scenarios the models are trained 10 times on randomly sampled training sets. For each scenario the models are trained without overfitting measures and combinations of dropout and datageneration implemented. The Python implementation of the models and the parameters used for the datageneration are given in Listing~\ref{lst:handwriting} for the handwriting model and Listing~\ref{lst:fashion} for the fashion model. The models are trained for 125 epoch in order to have enough random augmentations of the input images present during training for the networks to fully profit of the additional training data generated. The test accuracies of the models after training for 125 epochs are given in Table~\ref{table:digitsOF} for the handwritten digits and in Table~\ref{table:fashionOF} for the fashion datasets. Additionally the average test accuracies over the course of learning are given in Figure~\ref{fig:plotOF_digits} for the handwriting application and Figure~\ref{fig:plotOF_fashion} for the fashion application. \begin{figure}[h] \centering \small \begin{subfigure}[h]{\textwidth} \begin{tikzpicture} \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, /pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth, height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east}, xlabel = {epoch},ylabel = {Test Accuracy}, cycle list/Dark2, every axis plot/.append style={line width =1.25pt}] \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_1.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_dropout_02_1.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_datagen_1.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_datagen_dropout_02_1.mean}; \addlegendentry{\footnotesize{Default}} \addlegendentry{\footnotesize{D. 0.2}} \addlegendentry{\footnotesize{G.}} \addlegendentry{\footnotesize{G. + D. 0.2}} \addlegendentry{\footnotesize{D. 0.4}} \addlegendentry{\footnotesize{Default}} \end{axis} \end{tikzpicture} \caption{1 sample per class} \vspace{0.25cm} \end{subfigure} \begin{subfigure}[h]{\textwidth} \begin{tikzpicture} \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, /pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth, height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east}, xlabel = {epoch},ylabel = {Test Accuracy}, cycle list/Dark2, every axis plot/.append style={line width =1.25pt}] \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_dropout_00_10.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_dropout_02_10.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_datagen_dropout_00_10.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_datagen_dropout_02_10.mean}; \addlegendentry{\footnotesize{Default.}} \addlegendentry{\footnotesize{D. 0.2}} \addlegendentry{\footnotesize{G.}} \addlegendentry{\footnotesize{G + D. 0.2}} \end{axis} \end{tikzpicture} \caption{10 samples per class} \end{subfigure} \begin{subfigure}[h]{\textwidth} \begin{tikzpicture} \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth, height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east}, xlabel = {epoch}, ylabel = {Test Accuracy}, cycle list/Dark2, every axis plot/.append style={line width =1.25pt}, ymin = {0.92}] \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_dropout_00_100.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_dropout_02_100.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_datagen_dropout_00_100.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/adam_datagen_dropout_02_100.mean}; \addlegendentry{\footnotesize{Default.}} \addlegendentry{\footnotesize{D. 0.2}} \addlegendentry{\footnotesize{G.}} \addlegendentry{\footnotesize{G + D. 0.2}} \end{axis} \end{tikzpicture} \caption{100 samples per class} \vspace{.25cm} \end{subfigure} \caption{Mean test accuracies of the models fitting the sampled MNIST handwriting datasets over the 125 epochs of training.} \label{fig:plotOF_digits} \end{figure} \begin{figure}[h] \centering \small \begin{subfigure}[h]{\textwidth} \begin{tikzpicture} \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, /pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth, height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east}, xlabel = {epoch},ylabel = {Test Accuracy}, cycle list/Dark2, every axis plot/.append style={line width =1.25pt}] \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_dropout_0_1.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_dropout_2_1.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_datagen_dropout_0_1.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_datagen_dropout_2_1.mean}; \addlegendentry{\footnotesize{Default}} \addlegendentry{\footnotesize{D. 0.2}} \addlegendentry{\footnotesize{G.}} \addlegendentry{\footnotesize{G. + D. 0.2}} \addlegendentry{\footnotesize{D. 0.4}} \end{axis} \end{tikzpicture} \caption{1 sample per class} \vspace{0.25cm} \end{subfigure} \begin{subfigure}[h]{\textwidth} \begin{tikzpicture} \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, /pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth, height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east}, xlabel = {epoch},ylabel = {Test Accuracy}, cycle list/Dark2, every axis plot/.append style={line width =1.25pt}, ymin = {0.62}] \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_dropout_0_10.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_dropout_2_10.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_datagen_dropout_0_10.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_datagen_dropout_2_10.mean}; \addlegendentry{\footnotesize{Default.}} \addlegendentry{\footnotesize{D. 0.2}} \addlegendentry{\footnotesize{G.}} \addlegendentry{\footnotesize{G + D. 0.2}} \end{axis} \end{tikzpicture} \caption{10 samples per class} \end{subfigure} \begin{subfigure}[h]{\textwidth} \begin{tikzpicture} \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth, height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east}, xlabel = {epoch}, ylabel = {Test Accuracy}, cycle list/Dark2, every axis plot/.append style={line width =1.25pt}, ymin = {0.762}] \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_dropout_0_100.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_dropout_2_100.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_datagen_dropout_0_100.mean}; \addplot table [x=epoch, y=val_accuracy, col sep=comma, mark = none] {Figures/Data/fashion_datagen_dropout_2_100.mean}; \addlegendentry{\footnotesize{Default.}} \addlegendentry{\footnotesize{D. 0.2}} \addlegendentry{\footnotesize{G.}} \addlegendentry{\footnotesize{G + D. 0.2}} \end{axis} \end{tikzpicture} \caption{100 samples per class} \vspace{.25cm} \end{subfigure} \caption{Mean test accuracies of the models fitting the sampled MNIST handwriting datasets over the 125 epochs of training.} \label{fig:plotOF_fashion} \end{figure} It can be seen in ... and ... that the usage of .. overfitting measures greatly improves the accuracy for small datasets. However for the smallest size of one datapoint per class generating more data ... outperforms dropout with only a ... improvment being seen by the implementation of dropout whereas datageneration improves the accuracy by... . On the other hand the implementation of dropout seems to reduce the variance in the model accuracy, as the variance in accuracy for the dropout model is less than .. while the variance of the datagen .. model is nearly the same. The model with datageneration ... a reduction in variance with the addition of dropout. For the slightly larger training sets of ten samples per class the difference between the two measures seems smaller. Here the improvement in accuracy seen by dropout is slightly larger than the one of datageneration. However for the larger sized training set the variance in test accuracies is lower for the model with datageneration than the one with dropout. The results for the training sets with 100 samples per class resemble the ones for the sets with 10 per class. Overall the models ... both measures to combat overfitting seem to perform considerably well compared to the ones without. The usage of these measures has great potential in improving models used for applications with limited training data. Additional tables and figures visualizing the effects on the logarithmic corssentropy rather than loss are given in the appendix\todo{figs für appendix} \clearpage \section{Schluss} \begin{itemize} \item generate more data, GAN etc \textcite{gan} \item Transfer learning, use network trained on different task and repurpose it / train it with the training data \textcite{transfer_learning} \item random erasing fashion mnist 96.35\% accuracy \textcite{random_erasing} \item However the \textsc{Adam} algorithm can have problems with high variance of the adaptive learning rate early in training. \textcite{rADAM} try to address these issues with the Rectified Adam algorithm \end{itemize} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: