\section{Application of Neural Networks to Higher Complexity Problems}
\label{sec:cnn}
This section is based on \textcite[Chapter~9]{Goodfellow}.

As neural networks are applied to problems of higher complexity which often
results in higher dimensionality of the input the number of
parameters in the network rises drastically.
For very large inputs such as high-resolution image data due to the
fully connected nature of the neural network the number of parameters
can exceed what is feasible for training and storage.

The number of parameters for a given network size can be reduced by
using layers which are only sparsely 
connected and share parameters between nodes. An effective way to
implement this is by using convolution with filters that are shared
among the nodes of a layer.

\subsection{Convolution}

Convolution is a mathematical operation, where the product of two
functions is integrated after one has been reversed and shifted.

\[
  (f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
\]
This operation can be described as a filter-function $g$ being applied
to $f$,
as values $f(t)$ are being replaced by an average of values of $f$
weighted by a filter-function $g$ in position $t$.
Convolution operation allows plentiful  manipulation of data, with
a simple example being smoothing of real-time data.

Consider a sensor measuring the location of an object (e.g. via
GPS). We expect the output of the sensor to be noisy as a result of
some factors impacting the accuracy of the measurements. In order to
get a better estimate of the actual location, we want to smooth
the data to reduce the noise.

Using convolution for this task, we
can control the significance we want to give each data-point. We
might want to give a larger weight to more recent measurements than
older ones. If we assume these measurements are taken on a discrete
timescale, we need to define convolution for discrete functions. \\Let $f$,
$g: \mathbb{Z} \to \mathbb{R}$ then

\[
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
\]
Applying this on the data with the filter $g$ chosen accordingly we
are
able to improve the accuracy, which can be seen in
Figure~\ref{fig:sin_conv}.
\clearpage
\input{Figures/sin_conv.tex}
This form of discrete convolution can also be applied to functions
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
\mathbb{R}$ then

\[
  (f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
  \dots, x_d - i_d) g(i_1, \dots, i_d) 
\]
This will prove to be a useful framework for image manipulation but
to apply convolution to images, we need to discuss the
representation of image data.

Most often images are represented
by each pixel being a mixture of base colors. These base colors define
the color-space in which the image is encoded. Often used are
color-spaces RGB (red,
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
image decomposed in its red, green, and blue channel is given in
Figure~\ref{fig:rgb}.

Using this encoding of the image we can define a corresponding
discrete function describing the image, by mapping the coordinates
$(x,y)$ of a pixel and the channel (color) $c$ to the respective value
$v$
\begin{align}
  \begin{split}    
    I: \mathbb{N}^3 & \to \mathbb{R}, \\
    (x,y,c) & \mapsto v.
  \end{split}
              \label{def:I}
\end{align}

\begin{figure}
  \begin{adjustbox}{width=\textwidth}
    \begin{tikzpicture}  
      \begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
        \node[canvas is xy plane at z=0, transform shape] at (0,0)
        {\includegraphics[width=5cm]{Figures/Data/klammern_r.jpg}};
        \node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
        {\includegraphics[width=5cm]{Figures/Data/klammern_g.jpg}};
        \node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
        {\includegraphics[width=5cm]{Figures/Data/klammern_b.jpg}};
        \node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
        {\includegraphics[width=5.3cm]{Figures/Data/klammern_rgb.jpg}};
      \end{scope}
    \end{tikzpicture}
  \end{adjustbox}
  \caption[Channel Separation of Color Image]{On the right the red, green, and blue channels of the picture
    are displayed. In order to better visualize the color channels the
    black and white picture of each channel has been colored in the
    respective color. Combining the layers results in the image on the
    left.}
  \label{fig:rgb}
\end{figure}

With this representation of an image as a function, we can apply
filters to the image using convolution for multidimensional functions
as described~above. To simplify the notation, we will write
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
as a tensor from now on, resulting in the modified notation of
convolution

\[
  (I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
\]

As images are finite in size for pixels to close to the border the
convolution is not well defined.
Thus the output will be of reduced size. With $s_i$ being the size of
the input in dimension $d$ and $s_k$ being the size of kernel in
dimension $d$, the size of the output in dimension $d$ is $d_i - d_o
+ 1$. 
% with the new size in each
% dimension $d$ being \textit{(size of input in dimension $d$) -
%   (size of kernel in dimension $d$) + 1}. \todo{den dims namen geben
%   formal in eine zeile}
In order to receive outputs of the same size as the input, the
image can be padded in each dimension with 0 entries which ensure the
convolution is well defined for all pixels of the image.

Simple examples of image manipulation using
convolution are smoothing operations or
rudimentary detection of edges in gray-scale images, meaning they only
have one channel. A filter often used to smooth or blur images
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
size $s \in \mathbb{N}$ is
defined as
\[
  G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
      \sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
\]
\pagebreak[4]

\noindent An effective filter for edge detection purposes is the Sobel operator. Here two
filters are applied to the
image $I$ and then the outputs are combined. Edges in the $x$ direction are detected
by convolution with
\[
  G =\left[
  \begin{matrix}
    -1 & 0 & 1 \\
    -2 & 0 & 2 \\
    -1 & 0 & 1
  \end{matrix}\right],
\]
and edges is the y direction by convolution with $G^T$, the final
output is given by
\[
  O = \sqrt{(I * G)^2 + (I*G^T)^2}
\]
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component-wise. Examples
for convolution of an image with both kernels are given 
in Figure~\ref{fig:img_conv}.
\begin{figure}[H]
  \centering
  \begin{subfigure}{0.27\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/Data/klammern.jpg}
    \caption{\small Original Picture\\~}
    \label{subf:OrigPicGS}
  \end{subfigure}
  \hspace{0.02\textwidth}
  \begin{subfigure}{0.27\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/Data/image_conv9.png}
    \caption{\small Gaussian Blur $\sigma^2 = 1$}
  \end{subfigure}
  \hspace{0.02\textwidth}
  \begin{subfigure}{0.27\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/Data/image_conv10.png}
    \caption{\small Gaussian Blur $\sigma^2 = 4$}
  \end{subfigure}\\
  \begin{subfigure}{0.27\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/Data/image_conv4.png}
    \caption{\small Sobel Operator $x$-direction}
  \end{subfigure}
  \hspace{0.02\textwidth}
  \begin{subfigure}{0.27\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/Data/image_conv5.png}
    \caption{\small Sobel Operator $y$-direction}
  \end{subfigure}
  \hspace{0.02\textwidth}
  \begin{subfigure}{0.27\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Figures/Data/image_conv6.png}
    \caption{\small Sobel Operator combined}
  \end{subfigure}
%   \begin{subfigure}{0.24\textwidth}
%     \centering
%     \includegraphics[width=\textwidth]{Figures/Data/image_conv6.png}
%     \caption{test}
%   \end{subfigure}
  \vspace{-0.1cm}
  \caption[Convolution Applied on Image]{Convolution of original gray-scale Image (a)  with different
    kernels. In (b) and (c) Gaussian kernels of size 11 and stated
    $\sigma^2$ are used. In (d) to (f) the above defined Sobel Operator
  kernels are used.}
  \label{fig:img_conv}
\end{figure}
\vspace{-0.2cm}
\clearpage
\subsection{Convolutional Neural Networks}
% Conventional neural network as described in chapter .. are made up of
% fully connected layers, meaning each node in a layer is influenced by
% all nodes of the previous layer. If one wants to extract information
% out of high dimensional input such as images this results in a very
% large amount of variables in the model. This limits the 

% In conventional neural networks as described in chapter ... all layers
% are fully connected, meaning each output node in a layer is influenced
% by all inputs. For $i$ inputs and $o$ output nodes this results in $i
% + 1$ variables at each node (weights and bias) and a total $o(i + 1)$
% variables. For large inputs like image data the amount of variables
% that have to be trained in order to fit the model can get excessive
% and hinder the ability to train the model due to memory and
% computational restrictions. By using convolution we can extract
% meaningful information such as edges in an image with a kernel of a
% small size $k$ in the tens or hundreds independent of the size of the
% original image. Thus for a large image $k \cdot i$ can be several
% orders of magnitude smaller than $o\cdot i$ .

As seen in the previous section convolution can lend itself to
manipulation of images or other large data which motivates its usage in
neural networks.
This is achieved by implementing convolutional layers where several
trainable filters are applied to the input.

Each node in such a layer corresponds to a pixel of the output of
convolution with one of those filters, on which a bias and activation
function is applied.
Depending on the sizes this can drastically reduce the amount of
variables compared to fully connected layers.
As the variables of the filters are shared among all nodes a
convolutional layer with input of size $s_i$, output size $s_o$ and
$n$ filters of size $f$ will contain $n f + s_o$ parameters whereas a
fully connected layer has $(s_i + 1) s_o$ trainable weights.

The usage of multiple filters results in multiple outputs of the same
size as the input (or slightly smaller if no padding is used). These
are often called (convolution) channels.

Filters in layers that are preceded by convolutional layers are
often chosen such that the convolution channels of the input are
flattened into a single layer. This prevents gaining additional
dimensions with each convolutional layer.
To accomplish this in the direction of the convolution channels no
padding is used and the size of the filter is chosen to match the
number of these channels.
% Thus filters used in convolutional networks are usually have the same
% amount of dimensions as the input or one more.

An additional way to reduce the size using convolution is to not apply the
convolution on every pixel, but rather specifying a certain ``stride''
$s$ for each direction at which the filter $g$ is moved over the input $I$,
\[
  O_{x,\dots,c} = \sum_{i,\dots,l \in \mathbb{Z}} \left(I_{(x \cdot
    s_x)-i,\dots,(c \cdot s_c)-l}\right) \left(g_{i,\dots,l}\right).
\] 

The sizes and stride should be the same for all filters in a layer in
order to get a uniform tensor as output.
% The size of the filters and the way they are applied can be tuned
% while building the model should be the same for all filters in one
% layer in order for the output being of consistent size in all channels.
% It is common to reduce the d< by not applying the
% filters on each ``pixel'' but rather specify a ``stride'' $s$ at which
% the filter $g$ is moved over the input $I$

% \[
%   O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
% \] 

% As seen convolution lends itself for image manipulation. In this
% chapter we will explore how we can incorporate convolution in neural
% networks, and how that might be beneficial.

% Convolutional Neural Networks as described by ... are made up of
% convolutional layers, pooling layers, and fully connected ones. The
% fully connected layers are layers in which each input node is
% connected to each output node which is the structure introduced in
% chapter ...

% In a convolutional layer instead of combining all input nodes for each
% output node, the input nodes are interpreted as a tensor on which a
% kernel is applied via convolution, resulting in the output. Most often
% multiple kernels are used, resulting in multiple output tensors. These
% kernels are the variables, which can be altered in order to fit the
% model to the data. Using multiple kernels it is possible to extract
% different features from the image (e.g. edges -> sobel).

As a means to further reduce the size towards the final layer, convolutional
layers are often followed by a pooling layer. 
In a pooling layer, the input is
reduced in size by extracting a single value from a
neighborhood of pixels, often by taking the maximum value in the
neighborhood (max-pooling). The resulting output size is dependent on
the offset (stride) of the neighborhoods used.
The combination of convolution and pooling layers allows for
extraction of features from the input in the form of feature maps while
using relatively few parameters that need to be trained.

An example of this is given in Figure~\ref{fig:feature_map} where
intermediary outputs of a small convolutional neural network, consisting
of two convolutional and pooling layers, each with one filter followed
by two fully connected layers, are shown.


\begin{figure}[h]
  \renewcommand{\thesubfigure}{\alph{subfigure}1}
  \centering
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist0bw.pdf}
    %\caption{input}
    \caption{input}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/conv2d_2_5.pdf}
    \caption{\hspace{-1pt}convolution}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_2_5.pdf}
    \caption{max-pool}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/conv2d_3_5.pdf}
    \caption{\hspace{-1pt}convolution}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_3_5.pdf}
    \caption{max-pool}
  \end{subfigure}
  \centering
  \setcounter{subfigure}{0}
  \renewcommand{\thesubfigure}{\alph{subfigure}2}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist1bw.pdf}
    \caption{input}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/conv2d_2_0.pdf}
    \caption{\hspace{-1pt}convolution}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_2_0.pdf}
    \caption{max-pool}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/conv2d_3_0.pdf}
    \caption{\hspace{-1pt}convolution}
  \end{subfigure}
  \hfill
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_3_0.pdf}
    \caption{max-pool}
  \end{subfigure}
  \caption[Feature Map]{Intermediary outputs of a
    convolutional neural network, starting with the input and ending
    with the corresponding feature map.}
  \label{fig:feature_map}
\end{figure}

% \subsubsection{Parallels to the Visual Cortex in Mammals}

% The choice of convolution for image classification tasks is not
% arbitrary. ... auge... bla bla


% \subsection{Limitations of the Gradient Descent Algorithm}

% -Hyperparameter guesswork
% -Problems navigating valleys -> momentum
% -Different scale of gradients for vars in different layers -> ADAdelta

\subsection{Stochastic Training Algorithms}
For many applications in which neural networks are used such as
image classification or segmentation, large training data sets become
detrimental to capture the nuances of the
data. However, as training sets get larger the memory requirement
during training grows with it.
To update the weights with the gradient descent algorithm,
derivatives of the network with respect to each
variable need to be computed for all data points.
Thus the amount of memory and computing power available limits the
size of the training data that can be efficiently used in fitting the
network.

A class of algorithms that augment the gradient descent
algorithm to lessen this problem are stochastic gradient
descent algorithms.
Here the full data set is split into smaller disjoint subsets.
Then in each iteration, a (different) subset of data is chosen to
compute the gradient (Algorithm~\ref{alg:sgd}).
The training period until each data point has been considered at least
once in
updating the parameters is commonly called an ``epoch''.

Using subsets reduces the amount of memory required for storing the
necessary values for each update, thus making it possible to use very
large training sets to fit the model.
Additionally, the noise introduced on the gradient can improve
the accuracy of the fit as stochastic gradient descent algorithms are
less likely to get stuck on local extrema.

Another important benefit in using subsets is that depending on their size the
gradient can be calculated far quicker which allows for more parameter updates
in the same time. If the approximated gradient is close enough to the
``real'' one this can drastically cut down the time required for
training the model to a certain degree or improve the accuracy achievable in a given
amount of training time.

\begin{algorithm}
  \SetAlgoLined
  \KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$,
    Training Data $D$, Epochs $E$.}
  \For{$i \in  \left\{1:E\right\}$}{
    S <- D
    \While{$\abs{S} \geq B$}{
      Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\;
      Update $S$: $S \leftarrow S \setminus \tilde{D}$\;
      Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
        \tilde{D})}{\mathrm{d} w}$\;
      Update: $w \leftarrow w - \gamma g$\;
    }
    \If{$S \neq \emptyset$}{
      Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
        S)}{\mathrm{d} w}$\;
      Update:  $w \leftarrow w - \gamma g$\;
    }
    Increment: $i \leftarrow i+1$\;
  }  
  \caption{Stochastic gradient descent.}
  \label{alg:sgd}
\end{algorithm}

To illustrate this behavior, we modeled a convolutional neural
network to classify handwritten digits. The data set used for this is the
MNIST database of handwritten digits (\textcite{MNIST},
Figure~\ref{fig:MNIST}).

The network used consists of two convolution and max-pooling layers
followed by one fully connected hidden layer and the output layer.
Both convolutional layers utilize square filters of size five which are
applied with a stride of one.
The first layer consists of 32 filters and the second of 64. Both
pooling layers pool a $2\times 2$ area with a stride of two in both
directions. The fully connected layer 
consists of 256 nodes and the output layer of 10, one for each digit.
All layers use a ReLU (\ref{eq:relu}) as activation function, except the output layer
which uses softmax (\ref{eq:softmax}). 
As loss function categorical cross entropy (\ref{eq:cross_entropy}) is used.
The architecture of the convolutional neural network is summarized in
Figure~\ref{fig:mnist_architecture}.

% The network is trained with gradient descent and stochastic gradient
% descent five times for ... epochs. The reluts
The results of the network being trained with gradient descent and
stochastic gradient descent for 20 epochs are given in
Figure~\ref{fig:sgd_vs_gd}.
Here it can be seen that the network trained with stochastic gradient
descent is more accurate after the first epoch than the ones trained
with gradient descent after 20 epochs.
This is due to the former using a batch size of 32 and thus having
made 1.875  updates to the weights
after the first epoch in comparison to just one update. While each of
these updates only uses an approximate 
gradient calculated on the subset it performs far better than the
network using true gradients when training for the same amount of
time.
\vfill
\input{Figures/mnist.tex}
\vfill
\begin{figure}[h]
  \includegraphics[width=\textwidth]{Figures/Data/convnet_fig.pdf}
  \caption[CNN Architecture for MNIST Handwritten
  Digits]{Convolutional neural network architecture used to model the
    MNIST handwritten digits data set. This figure was created with
    help of the
    {\sffamily{draw\textunderscore convnet}} Python script by \textcite{draw_convnet}.}
  \label{fig:mnist_architecture}
\end{figure}

\input{Figures/SGD_vs_GD.tex}
\clearpage
\subsection{Modified Stochastic Gradient Descent}
This section is based on \textcite{ruder}, \textcite{ADAGRAD},
\textcite{ADADELTA}, and \textcite{ADAM}.

While stochastic gradient descent can work quite well in fitting
models its sensitivity to the learning rate $\gamma$ is an inherent
problem.
It is necessary to find an appropriate learning rate for each problem
which is largely guesswork. The impact of choosing a bad learning rate
can be seen in Figure~\ref{fig:sgd_vs_gd}.
% There is a inherent problem in the sensitivity of the gradient descent
% algorithm regarding the learning rate $\gamma$.
% The difficulty of choosing the learning rate can be seen
% in Figure~\ref{sgd_vs_gd}.
For small rates the progress in each iteration is small
but for learning rates too large the algorithm can become unstable.
This is caused by updates being larger than the parameters themselves
which can result in the parameters diverging to infinity.

Even for learning rates small enough to ensure the parameters
do not diverge to infinity, steep valleys in the function to be
minimized can hinder the progress of
the algorithm.
If the bottom of the valley slowly slopes towards the minimum
the steep nature of the valley can result in the
algorithm ``bouncing between'' the walls of the valley rather then
following the downwards trend.

A possible way to combat this is to alter the learning
rate over the course of training. This is often called learning rate
scheduling.
The most popular three implementations of this are:
\begin{itemize}
  \item Time-based decay, where $d$ is the decay parameter and $n$ is the number of epochs
  \[
    \gamma_{n+1} = \frac{\gamma_n}{1 + d n}.
  \]
  \item Step based decay, where the learning rate is fixed for a span of $r$
  epochs and then decreased according to parameter $d$
  \[
    \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}.
  \]
  \item Exponential decay, where the learning rate is decreased after each epoch
\[
  \gamma_n = \gamma_o e^{-n d}.
\]
\end{itemize}
% time-based
% decay
% \[
%   \gamma_{n+1} = \frac{\gamma_n}{1 + d n},
% \]
% where $d$ is the decay parameter and $n$ is the number of epochs.
% Step based decay where the learning rate is fixed for a span of $r$
% epochs and then decreased according to parameter $d$
% \[
%   \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}.
% \]
% And exponential decay where the learning rate is decreased after each epoch
% \[
%   \gamma_n = \gamma_o e^{-n d}.
% \]
These methods are able to increase the accuracy of models by large
margins as seen in the training of RESnet by \textcite{resnet}, cf. Figure~\ref{fig:resnet}.
\begin{figure}[h]
  \centering
  \includegraphics[width=\textwidth]{Figures/Data/7780459-fig-4-source-hires.png}
  \caption[Learning Rate Decay]{Error history of convolutional neural
    network trained with learning rate decay. The drops seen at 15.000 and
    30.000 iterations correspond to changes of the learning rate. \textcite[Figure
    4]{resnet}.}
  \label{fig:resnet}
\end{figure}


However stochastic gradient descent with weight decay is
still highly sensitive to the choice of the hyperparameters $\gamma_0$
and $d$.
Several algorithms have been developed to mitigate this problem by
regularizing the learning rate with as minimal 
hyperparameter guesswork as possible.

In the following, we will compare three algorithms that use an adaptive
learning rate, meaning they scale the updates according to past iterations.
The algorithms are built upon each other with the adaptive gradient
algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD})
laying the base~work.
Here,~for~each~parameter~update, the learning rate
is given by a constant global rate
$\gamma$ divided by the sum of the squares of the past partial
derivatives in this parameter. This results in a monotonous decaying
learning rate with faster
decay for parameters with large updates, whereas
parameters with small updates experience smaller decay.
The \textsc{AdaGrad}
algorithm is given in Algorithm~\ref{alg:ADAGRAD}. Note that while
this algorithm is still based upon the idea of gradient descent it no
longer takes steps in the direction of the gradient while
updating. Due to the individual learning rates for each parameter, only
the direction or sign for single parameters remains the same compared to
gradient descent.

\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Global learning rate $\gamma$}
  \KwInput{Constant $\varepsilon$}
  \KwInput{Initial parameter vector $x_1 \in \mathbb{R}^p$}
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Compute Update: $\Delta x_{t,i} \leftarrow
    -\frac{\gamma}{\norm{g_{1:t,i}}_2 + \varepsilon} g_{t,i}, \forall i =
    1, \dots,p$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{\textsc{AdaGrad}}
  \label{alg:ADAGRAD}
\end{algorithm}

Building on \textsc{AdaGrad}, \textcite{ADADELTA} developed the
\textsc{AdaDelta} algorithm
to improve upon the two main drawbacks of \textsc{AdaGrad}, being the
continuous decay of the learning rate and the need for a manually
selected global learning rate $\gamma$.
As \textsc{AdaGrad} uses division by the accumulated squared gradients the learning rate will
eventually become infinitely small.
Instead of summing the squared gradients a exponential decaying
average of the past squared gradients is used to regularize the
learning rate
% In order to ensure that even after a significant of iterations
% learning continues to make progress instead of summing the squared gradients a
% exponentially decaying average of the past squared gradients is used to for
% regularizing the learning rate resulting in
\begin{align*}
  E[g^2]_t   & = \rho E[g^2]_{t-1} + (1-\rho) g_t^2, \\
  \Delta x_t & = -\frac{\gamma}{\sqrt{E[g^2]_t + \varepsilon}} g_t,
\end{align*}
for a decay rate $\rho$. This is done to ensure that even after a
significant amount of iterations learning can make progress.

Additionally the fixed global learning rate $\gamma$ is substituted by
a exponentially decaying average of the past parameter updates.
The usage of the past parameter updates is motivated by ensuring that
hypothetical units of the parameter vector match those of the
parameter update $\Delta x_t$. When only using the
gradient with a scalar learning rate as in SDG the resulting unit of
the parameter update is:
\[
  \text{units of } \Delta x \propto \text{units of } g \propto
  \frac{\partial f}{\partial x} \propto \frac{1}{\text{units of } x},
\]
assuming the cost function $f$ is unitless. \textsc{AdaGrad} neither
has correct units since the update is given by a ratio of gradient
quantities resulting in a unitless parameter update. If however
Hessian information or a approximation thereof is used to scale the
gradients the unit of the updates will be correct:
\[
  \text{units of } \Delta x \propto H^{-1} g \propto
  \frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2 f}{\partial
      x^2}} \propto \text{units of } x
\]
Since using the second derivative results in correct units, Newton's
method (assuming diagonal hessian) is rearranged to determine the
quantities involved in the inverse of the second derivative:
\[
  \Delta x = \frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2
      f}{\partial x^2}} \iff \frac{1}{\frac{\partial^2 f}{\partial
      x^2}} = \frac{\Delta x}{\frac{\partial f}{\partial x}}.
\]
As the root mean square of the past gradients is already used in the
denominator of the learning rate an exponentially decaying root mean
square of the past updates is used to obtain a $\Delta x$ quantity for
the denominator resulting in the correct unit of the update. The full
algorithm is given in Algorithm~\ref{alg:adadelta}.

\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
  \KwInput{Initial parameter $x_1$}
  Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
  
    Compute Gradient: $g_t$\;
    Accumulate Gradient: $E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
    (1-\rho)g_t^2$\;
    Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
        x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
    Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
    x^2]_{t-1} + (1+p)\Delta x_t^2$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }
  \caption{\textsc{AdaDelta}, \textcite{ADADELTA}}
  \label{alg:adadelta}
\end{algorithm}

While the stochastic gradient algorithm is less susceptible to getting
stuck in local
extrema than gradient descent the problem persists especially
for saddle points (\textcite{DBLP:journals/corr/Dauphinpgcgb14}).

An approach to the problem of ``getting stuck'' in saddle point or
local minima/maxima is the addition of momentum to SDG. Instead of
using the actual gradient for the parameter update an average over the
past gradients is used.
Usually, an exponentially decaying average is used to avoid the need to
hold the past values in memory, resulting in Algorithm~\ref{alg:sgd_m}.
% In order to avoid the need to hold the past 
% values in memory usually a exponentially decaying average is used resulting in
% Algorithm~\ref{alg:sgd_m}.
This is comparable to following the path
of a marble with mass rolling down the slope of the error
function. The decay rate for the average is comparable to the inertia
of the marble.
This results in the algorithm being able to escape some local extrema due to the
build up momentum from approaching it. 

% \begin{itemize}
%   \item ADAM
%   \item momentum
%   \item ADADETLA \textcite{ADADELTA} 
% \end{itemize}


\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Learning Rate $\gamma$, Decay Rate $\rho$}
  \KwInput{Initial parameter $x_1$}
  Initialize accumulation variables $m_0 = 0$\;
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\;
    Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{SDG with momentum}
  \label{alg:sgd_m}
\end{algorithm}

In an effort to combine the properties of the momentum method and the
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
developed the \textsc{Adam} algorithm, given in
Algorithm~\ref{alg:adam}. Here the exponentially decaying 
root mean square of the gradients is still used for regularizing the
learning rate and
combined with the momentum method. Both terms are normalized such that
their means are the first and second moments of the gradient. However,
the term used in 
\textsc{AdaDelta} to ensure correct units is dropped for a scalar
global learning rate. This results in four tunable  hyperparameters.
However, the 
algorithm seems to be exceptionally stable with the recommended
parameters of $\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999, \varepsilon=10^{-7}$ and is a very reliable algorithm for training
neural networks.

\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Stepsize $\alpha$}
  \KwInput{Decay Parameters $\beta_1$, $\beta_2$}
  Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Accumulate first Moment of the Gradient and correct for bias:
    $m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t;$\hspace{\linewidth}
    $\hat{m}_t \leftarrow \frac{m_t}{1-\beta_1^t}$\;
    Accumulate second Moment of the Gradient and correct for bias:
    $v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2)g_t^2;$\hspace{\linewidth}
    $\hat{v}_t \leftarrow \frac{v_t}{1-\beta_2^t}$\;
    Compute Update: $\Delta x_t \leftarrow
    -\frac{\alpha}{\sqrt{\hat{v}_t + \varepsilon}}
    \hat{m}_t$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{ADAM, \cite{ADAM}}
  \label{alg:adam}
\end{algorithm}

To get an understanding of the performance of the above
discussed training algorithms the neural network given in
\ref{fig:mnist_architecture} has been
trained on the MNIST handwriting data set with the above described
algorithms. For all algorithms, a global learning rate of $0.001$ is
chosen. The parameter preventing divisions by zero is set to
$\varepsilon = 10^{-7}$. For \textsc{AdaDelta} and
Momentum $\rho = 0.95$ is used as decay rate. For \textsc{Adam} the recommended
parameters are chosen.
The performance metrics of the resulting learned functions are given in
Figure~\ref{fig:comp_alg}.

Here it can be seen that \textsc{AdaDelta} is the least effective of
the algorithms for the problem. Stochastic gradient descent and
\textsc{AdaGrad} perform similarly with \textsc{AdaGrad} being slightly
faster. \textsc{Adam} and stochastic gradient
descent with momentum achieve similar accuracies. However, the model
trained with \textsc{Adam} learns the fastest and achieves the best
accuracy. Thus we will use \textsc{Adam} for following comparisons.
\newpage

\input{Figures/sdg_comparison.tex}

\clearpage
\subsection{Combating Overfitting}

% As in many machine learning applications if the model is overfit in
% the data it can drastically reduce the generalization of the model. In
% many machine learning approaches noise introduced in the learning
% algorithm in order to reduce overfitting. This results in a higher
% bias of the model but the trade off of lower variance of the model is
% beneficial in many cases. For example the regression tree model
% ... benefits greatly from restricting the training algorithm on
% randomly selected features in every iteration and then averaging many
% such trained trees inserted of just using a single one. \todo{noch
%   nicht sicher ob ich das nehmen will} For neural networks similar
% strategies exist. A popular approach in regularizing convolutional neural network
% is \textit{dropout} which has been first introduced in
% \cite{Dropout}
This section is based on \textcite{Dropout1} and \textcite{Dropout}.
Similarly to shallow networks overfitting still can impact the quality of
convolutional neural networks.
Effective ways to combat this problem for many models is averaging
over multiple models trained on subsets (bootstrap) or introducing
noise directly during the training.
For example decision trees benefit greatly from averaging many trees
trained on slightly different training sets and the
introduction of noise during training by limiting the variables
available at each iteration
(cf. \textcite[Chapter~15]{hastie01statisticallearning}).
We explore implementations of these approaches for neural networks
being dropout for simulating a conglomerate of networks and
introducing noise during training by slightly altering the input
pictures.
% A popular way to combat this problem is
% by introducing noise into the training of the model.
% This can be done in a variety 
% This is a
% successful strategy for ofter models as well, the a conglomerate of
% descision trees grown on bootstrapped trainig samples benefit greatly
% of randomizing the features available to use in each training
% iteration (Hastie, Bachelorarbeit??).
% There are two approaches to introduce noise to the model during
% learning, either by manipulating the model it self or by manipulating
% the input data.
\subsubsection{Dropout}
If a neural network has enough hidden nodes there will be sets of
weights that accurately fit the training set (proof for a small
scenario is given in Theorem~\ref{theo:overfit}) this especially
occurs when the relation between the in- and output is highly complex,
which requires a large network to model and the training set is
limited in size. However, each of these weights will result in different 
predictions for a test set and all of them will perform worse on the
test data than the training data. A way to improve the predictions and
reduce the overfitting  would be to train a large number of networks
and average their results.
However, this is often computational not feasible in 
training as well as in testing.
% Similarly to decision trees and random forests training multiple
% models on the same task and averaging the predictions can improve the
% results and combat overfitting. However training a very large
% number of neural networks is computationally expensive in training
%as well as testing.
In order to make this approach feasible
\textcite{Dropout1} propose random dropout.
Instead of training different models, for each data point in a batch
randomly chosen nodes in the network are disabled (their output is
fixed to zero) and the updates for the weights in the remaining
smaller network are computed.
After updates have been obtained this way for each data point in a batch,
the updates are accumulated and applied to the full network.
This can be compared to many small networks which share their weights
for their active neurons being trained simultaneously.
For testing the ``mean network'' with all nodes active  is used. But the
output of the nodes is scaled accordingly to compensate for more nodes
being active.
%\todo{comparable to averaging dropout networks, beispiel für
%  besser in kleinem fall}
% Here for each training iteration from a before specified (sub)set of nodes
% randomly chosen ones are deactivated (their output is fixed to 0).
% During training 
% Instead of using different models and averaging them randomly
% deactivated nodes are used to simulate different networks which all
% share the same weights for present nodes.


% A simple but effective way to introduce noise to the model is by
% deactivating randomly chosen nodes in a layer 
% The way noise is introduced into
% the model is by deactivating certain nodes (setting the output of the
% node to 0) in the fully connected layers of the convolutional neural
% networks. The nodes are chosen at random and change in every
% iteration, this practice is called Dropout and was introduced by
% \textcite{Dropout}.

\subsubsection{Manipulation of Input Data}
Another way to combat overfitting is to randomly alter the training
inputs for each iteration of training.
% This is done keep the network from
% ``memorizing'' the training data rather than learning the relation
% between in- and output.
This can often be used in image based tasks as there are
often ways to manipulate the input while still being sure the labels
remain the same. For example, in an image classification task such as
handwritten digits, the associated label should remain right when the
image is rotated or stretched by a small amount.
When applying this, one has to ensure that the alterations are
reasonable in the context of the data, or else the network might make
false connections between in- and output.
In the case of handwritten digits for example a too high rotation angle
will make the distinction between a nine or a six hard and will lessen
the quality of the learned function.
The most common transformations are rotation, zoom, shear, brightness,
mirroring. Examples of these are given in Figure~\ref{fig:datagen}. In
to following this practice will be referred to as data generation.

\begin{figure}[h]
  \centering
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist0.pdf}
    \caption{original\\image}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_zoom.pdf}
    \caption{random\\zoom}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shear.pdf}
    \caption{random\\shear}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_rotation.pdf}
    \caption{random\\rotation}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shift.pdf}
    \caption{random\\positional shift}
  \end{subfigure}
  \caption[Image Data Generation]{Example for the manipulations used in
    later comparisons. Brightness manipulation and mirroring are not
    used, as the images are equal in brightness and digits are not
    invariant to mirroring.}
  \label{fig:datagen}
\end{figure}

\subsubsection{Comparisons}

To compare the benefits obtained from implementing these
measures we have trained the network given in
\ref{fig:mnist_architecture} on the handwriting recognition problem
and implemented different combinations of data generation and dropout. The results
are given in Figure~\ref{fig:gen_dropout}. For each scenario, the
model was trained five times and the performance measures were
averaged.

It can be seen that implementing the measures does indeed
increase the performance of the model.
Using data generation to alter the training data seems to have a
larger impact than dropout however, utilizing both measures yields the
best results.
%\todo{auf zahlen in tabelle verweisen?}

% Implementing data generation on
% its own seems to have a larger impact than dropout and applying both
% increases the accuracy even further.

The better performance stems most likely from reduced overfitting. The
reduction in overfitting can be seen in
\ref{fig:gen_dropout}~(\subref{fig:gen_dropout_b}) as the training
accuracy decreases with test accuracy increasing. However, utilizing
data generation, as well as dropout with a probability of 0.4, seems to
be a too aggressive approach as the training accuracy drops below the
test accuracy.

\input{Figures/gen_dropout.tex}

\subsubsection{Effectiveness for Small Training Sets}

\label{sec:smalldata}

For some applications (medical problems with a small number of patients)
the available data can be highly limited.
In these scenarios, the networks are highly prone to overfit the
data. To get an understanding of accuracies achievable and the
impact of the methods aimed at mitigating overfitting discussed above we fit
networks with different measures implemented to data sets of
varying sizes.

For training, we use the MNIST handwriting data set as well as the fashion
MNIST data set. The fashion MNIST data set is a benchmark set build by
\textcite{fashionMNIST} to provide a more challenging set, as state of
the art models are able to achieve accuracies of 99.88\%
(\textcite{10.1145/3206098.3206111}) on the handwriting set.
The data set contains 70.000 preprocessed and labeled images of clothes from
Zalando. An overview is given in Figure~\ref{fig:fashionMNIST}.

\input{Figures/fashion_mnist.tex}

\afterpage{
  \noindent
\begin{minipage}{\textwidth}
  \small
  \begin{tabu} to \textwidth {@{}l*4{X[c]}@{}}
    \Tstrut \Bstrut & \textsc{Adam}     & D. 0.2            & Gen               & Gen.+D. 0.2       \\
    \hline           
                    & 
    \multicolumn{4}{c}{Test Accuracy for 1 Sample}\Bstrut                                \\
    \cline{2-5}
    max  \Tstrut    & 0.5633            & 0.5312            & \textbf{0.6704}   & 0.6604            \\
    min             & 0.3230            & 0.4224            & 0.4878            & \textbf{0.5175}   \\
    mean            & 0.4570            & 0.4714            & 0.5862            & \textbf{0.6014}   \\
    var  \Bstrut    & 4.021e-3          & \textbf{1.175e-3} & 3.600e-3          & 2.348e-3          \\
    \hline
                    & 
    \multicolumn{4}{c}{Test Accuracy for 10 Sample}\Bstrut                              \\
    \cline{2-5}
    max  \Tstrut    & 0.8585            & 0.9423            & 0.9310            & \textbf{0.9441}   \\
    min             & 0.8148            & \textbf{0.9081}   & 0.9018            & 0.9061            \\
    mean            & 0.8377            & \textbf{0.9270}   & 0.9185            & 0.9232            \\
    var  \Bstrut    & 2.694e-4          & \textbf{1.278e-4} & 6.419e-5          & 1.504e-4          \\
    \hline
                    & 
    \multicolumn{4}{c}{Test Accuracy for 100 Sample}\Bstrut                             \\
    \cline{2-5}
    max  \Tstrut    & 0.9637            & 0.9796            & 0.9810            & \textbf{0.9811}   \\
    min             & 0.9506            & 0.9719            & 0.9702            & \textbf{0.9727}   \\
    mean            & 0.9582            & 0.9770            & 0.9769            & \textbf{0.9783}   \\
    var  \Bstrut    & 1.858e-5          & 5.778e-6          & 9.398e-6          & \textbf{4.333e-6} \\
    \hline
  \end{tabu}
  \normalsize
  \captionof{table}[Values of Test Accuracies for Models Trained on
  Subsets of MNIST Handwritten Digits]{Values of the test accuracy of 
    the model trained 10 times on random MNIST handwritten digits
    training sets containing 1, 10, and 100 data points per class after
    125 epochs. The mean accuracy achieved for the full set employing
    both overfitting measures is 99.58\%.} 
  \label{table:digitsOF}
  \small
  \centering
  \begin{tabu} to \textwidth {@{}l*4{X[c]}@{}}
    \Tstrut \Bstrut & \textsc{Adam}     & D. 0.2            & Gen               & Gen.+D. 0.2       \\
    \hline           
                    & 
    \multicolumn{4}{c}{Test Accuracy for 1 Sample}\Bstrut                                \\
    \cline{2-5}
    max  \Tstrut    & 0.4885            & \textbf{0.5513}   & 0.5488            & 0.5475            \\
    min             & 0.3710            & \textbf{0.3858}   & 0.3736            & 0.3816            \\
    mean \Bstrut    & 0.4166            & 0.4838            & 0.4769            & \textbf{0.4957}   \\
    var             & \textbf{1.999e-3} & 2.945e-3          & 3.375e-3          & 2.976e-3          \\
    \hline
                    & 
    \multicolumn{4}{c}{Test Accuracy for 10 Sample}\Bstrut                              \\
    \cline{2-5}
    max  \Tstrut    & 0.7370            & 0.7340            & 0.7236            & \textbf{0.7502}   \\
    min             & \textbf{0.6818}   & 0.6673            & 0.6709            & 0.6799            \\
    mean            & 0.7130            & \textbf{0.7156}   & 0.7031            & 0.7136            \\
    var  \Bstrut    & \textbf{3.184e-4} & 3.356e-4          & 3.194e-4          & 4.508e-4          \\
    \hline
                    & 
    \multicolumn{4}{c}{Test Accuracy for 100 Sample}\Bstrut                             \\
    \cline{2-5}
    max  \Tstrut    & 0.8454            & 0.8385            & 0.8456            & \textbf{0.8459}   \\
    min             & 0.8227            & 0.8200            & \textbf{0.8305}   & 0.8274            \\
    mean            & 0.8331            & 0.8289            & 0.8391            & \textbf{0.8409}   \\
    var  \Bstrut    & 3.847e-5          & 4.259e-5          & \textbf{2.315e-5} & 2.769e-5          \\
    \hline
  \end{tabu}
  \normalsize
  \captionof{table}[Values of Test Accuracies for Models Trained on
  Subsets of Fashion MNIST]{Values of the test accuracy of the model 
    trained 10 times  on random fashion MNIST training sets containing
    1, 10, and 100 data points per class after 125 epochs. The mean
    accuracy achieved for the full set employing both overfitting
    measures is 93.72\%.} 
  \label{table:fashionOF}
\end{minipage}
\clearpage 
}

The models are trained on subsets with a certain amount of randomly
chosen data points per class.
The sizes chosen for the comparisons are the full data set, 100, 10, and 1
data points per class.

For the task of classifying the fashion data a slightly altered model
is used. The convolutional layers with filters of size 5 are replaced
by two consecutive convolutional layers with filters of size 3.
\newpage
\begin{figure}[h]
  \includegraphics[width=\textwidth]{Figures/Data/cnn_fashion_fig.pdf}
  \caption[CNN Architecture for Fashion MNIST]{Convolutional neural
    network architecture used to model the 
    fashion MNIST data set. This figure was created using the
    draw\textunderscore convnet Python script by \textcite{draw_convnet}.} 
  \label{fig:fashion_MNIST}
\end{figure}
This is done in order to better accommodate
the more complex nature of the data by having 
more degrees of freedom. A diagram of the architecture is given in  
Figure~\ref{fig:fashion_MNIST}.

For both scenarios, the models are trained 10 times on randomly
sampled training sets.
The models are trained without overfitting measures and combinations
of dropout and data generation implemented. The Python implementation
of the models and the parameters used for data generation are given
in Listing~\ref{lst:handwriting} for the handwriting model and in
Listing~\ref{lst:fashion} for the fashion model.

The models are trained for 125 epochs in order
to have enough random
augmentations of the input images present during training,
for the networks to fully profit from the additional training data generated.
The test accuracies of the models after
training for 125 
epochs are given in Table~\ref{table:digitsOF} for the handwritten digits
and in Table~\ref{table:fashionOF} for the fashion data sets. Additionally the
average test accuracies over the course of learning are given in
Figure~\ref{fig:plotOF_digits} for the handwriting application and
Figure~\ref{fig:plotOF_fashion} for the 
fashion application.

\begin{figure}[h]
  \centering
  \small
  \begin{subfigure}[h]{\textwidth}
    \begin{tikzpicture}
      \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
                     /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
        height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
        xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
        list/Dark2, every axis plot/.append style={line width
          =1.25pt},
        ytick = {0.2,0.4,0.6},
        yticklabels = {$0.2$,$0.4$,$\phantom{0}0.6$}]
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_1.mean};  
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_dropout_02_1.mean};
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_datagen_1.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_datagen_dropout_02_1.mean};

        
        \addlegendentry{\footnotesize{Default}}
        \addlegendentry{\footnotesize{D. 0.2}}
        \addlegendentry{\footnotesize{G.}}
        \addlegendentry{\footnotesize{G. + D. 0.2}}
        \addlegendentry{\footnotesize{D. 0.4}}
        \addlegendentry{\footnotesize{Default}}
      \end{axis}
    \end{tikzpicture}
    \caption{1 Sample per Class}
    \vspace{0.25cm}
  \end{subfigure}
  \begin{subfigure}[h]{\textwidth}
    \begin{tikzpicture}
      \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
                     /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
        height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
        xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
        list/Dark2, every axis plot/.append style={line width
          =1.25pt},
        ytick = {0.2,0.6,0.8},
        yticklabels = {$0.2$,$0.6$,$\phantom{0}0.8$}]
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_dropout_00_10.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_dropout_02_10.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_datagen_dropout_00_10.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_datagen_dropout_02_10.mean};


        \addlegendentry{\footnotesize{Default.}}
        \addlegendentry{\footnotesize{D. 0.2}}
        \addlegendentry{\footnotesize{G.}}
        \addlegendentry{\footnotesize{G + D. 0.2}}
      \end{axis}
    \end{tikzpicture}
    \caption{10 Samples per Class}
  \end{subfigure}
  \begin{subfigure}[h]{\textwidth}
    \begin{tikzpicture}
      \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
                     /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
        height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
        xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
        list/Dark2, every axis plot/.append style={line width
          =1.25pt}, ymin = {0.92}]
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_dropout_00_100.mean};
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_dropout_02_100.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_datagen_dropout_00_100.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/adam_datagen_dropout_02_100.mean}; 
        
        \addlegendentry{\footnotesize{Default.}}
        \addlegendentry{\footnotesize{D. 0.2}}
        \addlegendentry{\footnotesize{G.}}
        \addlegendentry{\footnotesize{G + D. 0.2}}
      \end{axis}
    \end{tikzpicture}
    \caption{100 Samples per Class}
    \vspace{.25cm}
  \end{subfigure}
  \caption[Mean Test Accuracies for Subsets of MNIST Handwritten
  Digits]{Mean test accuracies of the models fitting the sampled MNIST
    handwriting data sets over the 125 epochs of training.}
  \label{fig:plotOF_digits}
\end{figure}

\begin{figure}[h]
  \centering
  \small
  \begin{subfigure}[h]{\textwidth}
    \begin{tikzpicture}
      \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
                     /pgf/number format/precision=3},tick style =
                   {draw = none}, width = 0.9875\textwidth,
        height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
        xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
        list/Dark2, every axis plot/.append style={line width
          =1.25pt},
        ytick = {0.2,0.3,0.4,0.5},
        yticklabels = {$0.2$,$0.3$,$0.4$,$\phantom{0}0.5$}]
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_dropout_0_1.mean};  
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_dropout_2_1.mean};
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_datagen_dropout_0_1.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_datagen_dropout_2_1.mean};

        
        \addlegendentry{\footnotesize{Default}}
        \addlegendentry{\footnotesize{D. 0.2}}
        \addlegendentry{\footnotesize{G.}}
        \addlegendentry{\footnotesize{G. + D. 0.2}}
        \addlegendentry{\footnotesize{D. 0.4}}
      \end{axis}
    \end{tikzpicture}
    \caption{1 sample per class}
    \vspace{0.25cm}
  \end{subfigure}
  \begin{subfigure}[h]{\textwidth}
    \begin{tikzpicture}
      \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
                     /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
        height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
        xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
        list/Dark2, every axis plot/.append style={line width
          =1.25pt}, ymin = {0.62}]
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_dropout_0_10.mean};  
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_dropout_2_10.mean};
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_datagen_dropout_0_10.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_datagen_dropout_2_10.mean};

        
        \addlegendentry{\footnotesize{Default.}}
        \addlegendentry{\footnotesize{D. 0.2}}
        \addlegendentry{\footnotesize{G.}}
        \addlegendentry{\footnotesize{G + D. 0.2}}
      \end{axis}
    \end{tikzpicture}
    \caption{10 Samples per Class}
  \end{subfigure}
  \begin{subfigure}[h]{\textwidth}
    \begin{tikzpicture}
      \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
                     /pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
        height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
        xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
        list/Dark2, every axis plot/.append style={line width
          =1.25pt}, ymin = {0.762}]
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_dropout_0_100.mean};  
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_dropout_2_100.mean};
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_datagen_dropout_0_100.mean}; 
        \addplot table
        [x=epoch, y=val_accuracy, col sep=comma, mark = none]
        {Figures/Data/fashion_datagen_dropout_2_100.mean};
        
        \addlegendentry{\footnotesize{Default.}}
        \addlegendentry{\footnotesize{D. 0.2}}
        \addlegendentry{\footnotesize{G.}}
        \addlegendentry{\footnotesize{G + D. 0.2}}
      \end{axis}
    \end{tikzpicture}
    \caption{100 Samples per Class}
    \vspace{.25cm}
  \end{subfigure}
  \caption[Mean Test Accuracies for Subsets of Fashion MNIST]{Mean test
    accuracies of the models fitting the sampled fashion MNIST 
    over the 125 epochs of training.}
  \label{fig:plotOF_fashion}
\end{figure}

It can be seen in Figure~\ref{fig:plotOF_digits} that for the
handwritten digits scenario 
using data generation greatly improves the accuracy for the smallest
training set of one sample per class.
While the addition of dropout only seems to have a small effect on the
accuracy of the model, the variance gets further reduced than with data
generation. This drop in variance translates to the combination of
both measures, resulting in the overall best performing model.

In the scenario with 10 and 100 samples per class, the measures improve
the performance as well, however the difference in performance between
overfitting measures is much smaller than in the first scenario
with the accuracy gain of dropout being similar to data generation.
While the observation of the variances persists for the scenario with
100 samples per class it does not for the one with 10 samples per
class.
In all scenarios, the addition of the measures reduces the
variance of the model.

The model fit to the fashion MNIST data set benefits less from these
measures.
For the smallest scenario of one sample per class, a substantial
increase in accuracy can be observed for both measures.
Contrary to the digits data set, dropout improves the
model by a similar margin to data generation.
For the larger data sets, the benefits are much smaller. While
in the scenario with 100 samples per class a performance increase can
be seen for with data generation, in the scenario with 10 samples per
class it performs worse than the baseline model.
Dropout does seem to have a negligible impact on its own in both the 10
and 100 sample scenario. In all scenarios data generation seems to
benefit from the addition of dropout.

Additional Figures and Tables for the same comparisons with different
performance metrics are given in Appendix~\ref{app:comp}.
There it can be seen that while the measures are able reduce overfitting
effectively for the handwritten digits data set, the neural networks
trained on the fashion data set overfit despite these measures being
in place.


% It can be seen in ...  that the usage of .. overfitting
% measures greatly improves the accuracy for small datasets. However for
% the smallest size of one datapoint per class generating more data
% ... outperforms dropout with only a ... improvment being seen by the
% implementation of dropout whereas data generation improves the accuracy
% by... . On the other hand the implementation of dropout seems to
% reduce the variance in the model accuracy, as the variance in accuracy
% for the dropout model is less than .. while the variance of the
% datagen .. model is nearly the same. The model with data generation
% ... a reduction in variance with the addition of dropout.

% For the slightly larger training sets of ten samples per class the
% difference between the two measures seems smaller. Here the
% improvement in accuracy
% seen by dropout is slightly larger than the one of
% data generation. However for the larger sized training set the variance
% in test accuracies is lower for the model with data generation than the
% one with dropout.

% The results for the training sets with 100 samples per class resemble
% the ones for the sets with 10 per class.

Overall it seems that both measures are able increase the performance of
a convolutional neural network however, the success is dependent on the problem.
For the handwritten digits, the great result of data generation likely
stems from a large portion of the differences between two data points
of the same class being explainable by different positions, sizes or
slants. Which is what data generation emulates.

In the fashion data set however the alignment of all images are very
uniform with little to no differences in size or angle between
data points which might explain the worse performance of data generation.


\clearpage
\section{Summary and Outlook}

In this thesis, we have taken a look at neural networks, their
behavior in small scenarios and their application on image
classification with limited data sets.

We have explored the relation between ridge penalized neural networks
and slightly altered cubic smoothing splines, giving us an insight
about the behavior of the learned function of neural networks.

When comparing optimization algorithms, we have seen that choosing the
right training algorithm can have a 
the drastic impact on the efficiency of training and quality of a model
obtainable in a reasonable time frame.
The \textsc{Adam} algorithm has performed well in training the
convolutional neural networks.
However, there is ongoing research in further
improving these algorithms. For example, \textcite{rADAM} propose an
alteration to the \textsc{Adam} algorithm in order to reduce variance
of the learning rate in the early phases of training.

We have seen that a convolutional network can benefit greatly from
measures combating overfitting,  especially if the available training sets are of
a small size. The success of the measures we have examined
seems to be highly dependent on the use case and further research is
being done on the topic of combating overfitting in neural networks. 
\textcite{random_erasing} propose randomly erasing parts of the input
images during training and are able to achieve a high accuracy of 96,35\% on the fashion MNIST
data set this way.
While data generation explored in this thesis is able to rudimentary
generate new training data, further research is being done in more
elaborate ways
to enlarge the training set.
\textcite{gan} explore the usage of generative adversarial
networks to generate training images for the task of
classifying liver lesions.
These networks are trained to generate new images from
random noise, ideally resulting in completely new data that can be used
in training (cf. \textcite{goodfellow_gan}).

Overall, convolutional neural networks are able to achieve remarkable
results in many use cases
and are a staple here to stay.

% \begin{itemize}
%   \item generate more data, GAN etc \textcite{gan}
%   \item Transfer learning, use network trained on different task and
%   repurpose it / train it with the training data \textcite{transfer_learning}
%   \item random erasing fashion MNIST 96.35\% accuracy
%   \textcite{random_erasing}
%   \item However the \textsc{Adam} algorithm can have problems with high
%   variance of the adaptive learning rate early in training.
%   \textcite{rADAM} try to address these issues with the Rectified Adam
%   \item error measure: Robust error measure for supervised neural network learning with outliers
% algorithm
% \end{itemize}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: