\section{Application of NN to higher complexity Problems}

As neural networks are applied to problems of higher complexity often
resulting in higher dimensionality of the input the amount of
parameters in the network rises drastically. For example a network
with ...
A way to combat the

\subsection{Convolution}

Convolution is a mathematical operation, where the product of two
functions is integrated after one has been reversed and shifted.

\[
  (f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
\]

This operation can be described as a filter-function $g$ being applied
to $f$,
as values $f(t)$ are being replaced by an average of values of $f$
weighted by $g$ in position $t$.
The convolution operation allows plentiful  manipulation of data, with
a simple example being smoothing of real-time data. Consider a sensor
measuring the location of an object (e.g. via GPS). We expect the
output of the sensor to be noisy as a result of a number of factors
that will impact the accuracy. In order to get a better estimate of
the actual location we want to smooth
the data to reduce the noise. Using convolution for this task, we
can control the significance we want to give each data-point. We
might want to give a larger weight to more recent measurements than
older ones. If we assume these measurements are taken on a discrete
timescale, we need to introduce discrete convolution first. Let $f$,
$g: \mathbb{Z} \to \mathbb{R}$ then

\[
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
\]
Applying this on the data with the filter $g$ chosen accordingly we
are
able to improve the accuracy, which can be seen in
Figure~\ref{fig:sin_conv}.
\input{Plots/sin_conv.tex}
This form of discrete convolution can also be applied to functions
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
\mathbb{R}$ then

\[
  (f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
  \dots, x_d - i_d) g(i_1, \dots, i_d) 
\]
This will prove to be a useful framework for image manipulation but
in order to apply convolution to images we need to discuss
representation of image data first. Most often images are represented
by each pixel being a mixture of base colors these base colors define
the color-space in which the image is encoded. Often used are
color-spaces RGB (red,
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
image split in its red, green and blue channel is given in
Figure~\ref{fig:rgb} Using this 
encoding of the image we can define a corresponding discrete function
describing the image, by mapping the coordinates $(x,y)$ of an pixel
and the
channel (color) $c$ to the respective value $v$

\begin{align}
  \begin{split}    
    I: \mathbb{N}^3 & \to \mathbb{R}, \\
    (x,y,c) & \mapsto v.
  \end{split}
              \label{def:I}
\end{align}

\begin{figure}
  \begin{adjustbox}{width=\textwidth}
    \begin{tikzpicture}  
      \begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
        \node[canvas is xy plane at z=0, transform shape] at (0,0)
        {\includegraphics[width=5cm]{Plots/Data/klammern_r.jpg}};
        \node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
        {\includegraphics[width=5cm]{Plots/Data/klammern_g.jpg}};
        \node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
        {\includegraphics[width=5cm]{Plots/Data/klammern_b.jpg}};
        \node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
        {\includegraphics[width=5.3cm]{Plots/Data/klammern_rgb.jpg}};
      \end{scope}
    \end{tikzpicture}
  \end{adjustbox}
  \caption{On the right the red, green and blue chances of the picture
    are displayed. In order to better visualize the color channels the
    black and white picture of each channel has been colored in the
    respective color. Combining the layers results in the image on the
    left.}
  \label{fig:rgb}
\end{figure}

With this representation of an image as a function, we can apply
filters to the image using convolution for multidimensional functions
as described above. In order to simplify the notation we will write
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
as a tensor from now on, resulting in the modified notation of
convolution

\[
  (I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
\]

Simple examples for image manipulation using
convolution are smoothing operations or
rudimentary detection of edges in grayscale images, meaning they only
have one channel. A popular filter for smoothing images
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
size $s \in \mathbb{N}$ is
defined as
\[
  G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
      \sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
\]

For edge detection purposes the Sobel operator is widespread. Here two
filters are applied to the
image $I$ and then combined. Edges in the $x$ direction are detected
by convolution with
\[
  G =\left[
  \begin{matrix}
    -1 & 0 & 1 \\
    -2 & 0 & 2 \\
    -1 & 0 & 1
  \end{matrix}\right],
\]
and edges is the y direction by convolution with $G^T$, the final
output is given by

\[
  O = \sqrt{(I * G)^2 + (I*G^T)^2}
\]
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component
wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}.
\todo{padding}


\begin{figure}[h]
  \centering
  \begin{subfigure}{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Plots/Data/klammern.jpg}
    \caption{Original Picture}
    \label{subf:OrigPicGS}
  \end{subfigure}
  \begin{subfigure}{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
    \caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$}
  \end{subfigure}
  \begin{subfigure}{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Plots/Data/image_conv10.png}
    \caption{Gaussian Blur $\sigma^2 = 4$}
  \end{subfigure}\\
  \begin{subfigure}{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Plots/Data/image_conv4.png}
    \caption{Sobel Operator $x$-direction}
  \end{subfigure}
  \begin{subfigure}{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Plots/Data/image_conv5.png}
    \caption{Sobel Operator $y$-direction}
  \end{subfigure}
  \begin{subfigure}{0.3\textwidth}
    \centering
    \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
    \caption{Sobel Operator combined}
  \end{subfigure}
%   \begin{subfigure}{0.24\textwidth}
%     \centering
%     \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
%     \caption{test}
%   \end{subfigure}
  \caption{Convolution of original greyscale Image (a)  with different
    kernels. In (b) and (c) Gaussian kernels of size 11 and stated
    $\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator
  kernels are used.}
  \label{fig:img_conv}
\end{figure}
\clearpage
\newpage
\subsection{Convolutional NN}
\todo{Eileitung zu CNN}
% Conventional neural network as described in chapter .. are made up of
% fully connected layers, meaning each node in a layer is influenced by
% all nodes of the previous layer. If one wants to extract information
% out of high dimensional input such as images this results in a very
% large amount of variables in the model. This limits the 

% In conventional neural networks as described in chapter ... all layers
% are fully connected, meaning each output node in a layer is influenced
% by all inputs. For $i$ inputs and $o$ output nodes this results in $i
% + 1$ variables at each node (weights and bias) and a total $o(i + 1)$
% variables. For large inputs like image data the amount of variables
% that have to be trained in order to fit the model can get excessive
% and hinder the ability to train the model due to memory and
% computational restrictions. By using convolution we can extract
% meaningful information such as edges in an image with a kernel of a
% small size $k$ in the tens or hundreds independent of the size of the
% original image. Thus for a large image $k \cdot i$ can be several
% orders of magnitude smaller than $o\cdot i$ .

As seen in the previous section convolution can lend itself to
manipulation of images or other large data which motivates it usage in
neural networks.
This is achieved by implementing convolutional layers where several
filters are applied to the input. Where the values of the filters are
trainable parameters of the model.
Each node in such a layer corresponds to a pixel of the output of
convolution with one of those filters on which a bias and activation
function are applied.
The usage of multiple filters results in multiple outputs of the same
size as the input. These are often called channels. Depending on the
size of the filters this can result in the dimension of the output
being one larger than the input.
However for convolutional layers following a convolutional layer the
size of the filter is often chosen to coincide with the amount of channels
of the output of the previous layer without using padding in this
direction in order to prevent gaining additional
dimensions\todo{komisch} in the output.
This can also be used to flatten certain less interesting channels of
the input as for example a color channels.
Thus filters used in convolutional networks are usually have the same
amount of dimensions as the input or one more.

The size of the filters and the way they are applied can be tuned
while building the model should be the same for all filters in one
layer in order for the output being of consistent size in all channels.
It is common to reduce the d< by not applying the
filters on each ``pixel'' but rather specify a ``stride'' $s$ at which
the filter $g$ is moved over the input $I$

\[
  O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
\] 

As seen convolution lends itself for image manipulation. In this
chapter we will explore how we can incorporate convolution in neural
networks, and how that might be beneficial.

Convolutional Neural Networks as described by ... are made up of
convolutional layers, pooling layers, and fully connected ones. The
fully connected layers are layers in which each input node is
connected to each output node which is the structure introduced in
chapter ...

In a convolutional layer instead of combining all input nodes for each
output node, the input nodes are interpreted as a tensor on which a
kernel is applied via convolution, resulting in the output. Most often
multiple kernels are used, resulting in multiple output tensors. These
kernels are the variables, which can be altered in order to fit the
model to the data. Using multiple kernels it is possible to extract
different features from the image (e.g. edges -> sobel). As this
increases dimensionality even further which is undesirable as it
increases the amount of variables in later layers of the model, a convolutional layer
is often followed by a pooling one. In a pooling layer the input is
reduced in size by extracting a single value from a
neighborhood \todo{moving...}... . The resulting output size is dependent on
the offset of the neighborhoods used. Popular is max-pooling where the
largest value in a neighborhood is used or.

This construct allows for extraction of features from the input while
using far less input variables.

... \todo{Beispiel mit kleinem Bild, am besten das von oben}

\subsubsection{Parallels to the Visual Cortex in Mammals}

The choice of convolution for image classification tasks is not
arbitrary. ... auge... bla bla


% \subsection{Limitations of the Gradient Descent Algorithm}

% -Hyperparameter guesswork
% -Problems navigating valleys -> momentum
% -Different scale of gradients for vars in different layers -> ADAdelta

\subsection{Stochastic Training Algorithms}

For many applications in which neural networks are used such as
image classification or segmentation, large training data sets become
detrimental to capture the nuances of the
data. However as training sets get larger the memory requirement
during training grows with it.
In order to update the weights with the gradient descent algorithm
derivatives of the network with respect for each
variable need to be calculated for all data points in order to get the
full gradient of the error of the network.
Thus the amount of memory and computing power available limits the
size of the training data that can be efficiently used in fitting the
network. A class of algorithms that augment the gradient descent
algorithm in order to lessen this problem are stochastic gradient
descent algorithms. Here the premise is that instead of using the whole
dataset a (different) subset of data is chosen to
compute the gradient in each iteration (Algorithm~\ref{alg:sdg}).
The training period until each data point has been considered in
updating the parameters is commonly called an ``epoch''.
Using subsets reduces the amount of memory and computing power required for
each iteration. This makes it possible to use very large training
sets to fit the model.
Additionally the noise introduced on the gradient can improve
the accuracy of the fit as stochastic gradient descent algorithms are
less likely to get stuck on local extrema.

Another important benefit in using subsets is that depending on their size the
gradient can be calculated far quicker which allows for more parameter updates
in the same time. If the approximated gradient is close enough to the
``real'' one this can drastically cut down the time required for
training the model to a certain degree or improve the accuracy achievable in a given
mount of training time.

\begin{algorithm}
  \SetAlgoLined
  \KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$,
    Training Data $D$, Epochs $E$.}
  \For{$i \in  \left\{1:E\right\}$}{
    S <- D
    \While{$\abs{S} \geq B$}{
      Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\;
      Update $S$: $S \leftarrow S \setminus \tilde{D}$\;
      Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
        \tilde{D})}{\mathrm{d} w}$\;
      Update: $w \leftarrow w - \gamma g$\;
    }
    \If{$S \neq \emptyset$}{
      Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
        S)}{\mathrm{d} w}$\;
      Update:  $w \leftarrow w - \gamma g$\;
    }
    Increment: $i \leftarrow i+1$\;
  }  
  \caption{Stochastic gradient descent.}
  \label{alg:sgd}
\end{algorithm}

In order to illustrate this behavior we modeled a convolutional neural
network to ... handwritten digits. The data set used for this is the
MNIST database of handwritten digits (\textcite{MNIST},
Figure~\ref{fig:MNIST}).
\input{Plots/mnist.tex}
The network used consists of two convolution and max pooling layers
followed by one fully connected hidden layer and the output layer.
Both covolutional layers utilize square filters of size five which are
applied with a stride of one.
The first layer consists of 32 filters and the second of 64. Both
pooling layers pool a $2\times 2$ area. The fully connected layer
consists of 256 nodes and the output layer of 10, one for each digit.
All layers except the output layer use RELU as activation function
with the output layer using softmax (\ref{def:softmax}).
As loss function categorical crossentropy is used (\ref{def:...}).
The architecture of the convolutional neural network is summarized in
Figure~\ref{fig:mnist_architecture}.

\begin{figure}
  \missingfigure{network architecture}
  \caption{architecture}
  \label{fig:mnist_architecture}
\end{figure}

The results of the network being trained with gradient descent and
stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd}
and Table~\ref{table:sgd_vs_gd}


Here it can be seen that the network trained with stochstic gradient
descent is more accurate after the first epoch than the ones trained
with gradient descent after 20 epochs.
This is due to the former using a batch size of 32 and thus having
made 1.875  updates to the weights
after the first epoch in comparison to one update. While each of
these updates uses a approximate 
gradient calculated on the subset it performs far better than the
network using true gradients when training for the same mount of time.
\todo{vergleich training time}

\input{Plots/SGD_vs_GD.tex}
\clearpage
\subsection{\titlecap{modified stochastic gradient descent}}
An inherent problem of the stochastic gradient descent algorithm is
its sensitivity to the learning rate $\gamma$. This results in the
problem of having to find a appropriate learning rate for each problem
which is largely guesswork, the impact of choosing a bad learning rate
can be seen in Figure~\ref{fig:sgd_vs_gd}.
% There is a inherent problem in the sensitivity of the gradient descent
% algorithm regarding the learning rate $\gamma$.
% The difficulty of choosing the learning rate can be seen
% in Figure~\ref{sgd_vs_gd}.
For small rates the progress in each iteration is small
but as the rate is enlarged the algorithm can become unstable and the parameters
diverge to infinity. Even for learning rates small enough to ensure the parameters
do not diverge to infinity, steep valleys in the function to be
minimized can hinder the progress of
the algorithm as for leaning rates not small enough gradient descent
``bounces between'' the walls of the valley rather then following a
downward trend in the valley.

% \[
%   w - \gamma \nabla_w ...
% \]
%thus the weights grow to infinity.
\todo{unstable learning rate besser
  erklären}

To combat this problem \todo{quelle} propose to alter the learning
rate over the course of training, often called leaning rate
scheduling in order to decrease the learning rate over the course of
training. The most popular implementations of this are time based
decay
\[
  \gamma_{n+1} = \frac{\gamma_n}{1 + d n},
\]
where $d$ is the decay parameter and $n$ is the number of epochs,
step based decay where the learning rate is fixed for a span of $r$
epochs and then decreased according to parameter $d$
\[
  \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
\]
and exponential decay where the learning rate is decreased after each epoch
\[
  \gamma_n = \gamma_o e^{-n d}.
\]
These methods are able to increase the accuracy of a model by large
margins as seen in the training of RESnet by \textcite{resnet}.
\todo{vielleicht grafik
  einbauen}
However stochastic gradient descent with weight decay is
still highly sensitive to the choice of the hyperparameters $\gamma_0$
and $d$.
In order to mitigate this problem a number of algorithms have been
developed to regularize the learning rate with as minimal
hyperparameter guesswork as possible.

We will examine and compare a ... algorithms that use a adaptive
learning rate.
They all scale the gradient for the update depending of past gradients
for each weight individually.

The algorithms are build up on each other with the adaptive gradient
algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD})
laying the base work. Here for each parameter update the learning rate
is given my a constant
$\gamma$ is divided by the sum of the squares of the past partial
derivatives in this parameter. This results in a monotonously
decreasing learning rate for each parameter. This results in a faster
decaying learning rate for parameters with large updates, where as
parameters with small updates experience smaller decay. The \textsc{AdaGrad}
algorithm is given in Algorithm~\ref{alg:ADAGRAD}.

\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Global learning rate $\gamma$}
  \KwInput{Constant $\varepsilon$}
  \KwInput{Initial parameter vector $x_1 \in \mathbb{R}^p$}
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Compute Update: $\Delta x_{t,i} \leftarrow
    -\frac{\gamma}{\norm{g_{1:t,i}}_2 + \varepsilon} g_t, \forall i =
    1, \dots,p$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{\textls{\textsc{AdaGrad}}}
  \label{alg:ADAGRAD}
\end{algorithm}

Building on \textsc{AdaGrad} \textcite{ADADELTA} developed the ... (\textsc{AdaDelta})
in order to improve upon the two main drawbacks of \textsc{AdaGrad}, being the
continual decay of the learning rate and the need for a manually
selected global learning rate $\gamma$.
As \textsc{AdaGrad} uses division by the accumulated squared gradients the learning rate will
eventually become infinitely small.
In order to ensure that even after a significant of iterations
learning continues to make progress instead of summing the gradients a
exponentially decaying average of the past gradients is used to ....
Additionally the fixed global learning rate $\gamma$ is substituted by
a exponentially decaying average of the past parameter updates.
The usage of the past parameter updates is motivated by ensuring that
if the parameter vector had some hypothetical units they would be matched
by these of the parameter update $\Delta x_t$. This proper
\todo{erklärung unit}

\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
  \KwInput{Initial parameter $x_1$}
  Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
    (1-\rho)g_t^2$\;
    Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
        x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
    Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
    x^2]_{t-1} + (1+p)\Delta x_t^2$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{\textsc{AdaDelta}, \textcite{ADADELTA}}
  \label{alg:gd}
\end{algorithm}

While the stochastic gradient algorithm is less susceptible to local
extrema than gradient descent the problem still persists especially
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}

An approach to the problem of ``getting stuck'' in saddle point or
local minima/maxima is the addition of momentum to SDG. Instead of
using the actual gradient for the parameter update an average over the
past gradients is used. In order to avoid the need to SAVE the past
values usually a exponentially decaying average is used resulting in
Algorithm~\ref{alg_momentum}. This is comparable of following the path
of a marble with mass rolling down the SLOPE of the error
function. The decay rate for the average is comparable to the TRÄGHEIT
of the marble.
This results in the algorithm being able to escape ... due to the
build up momentum from approaching it. 

% \begin{itemize}
%   \item ADAM
%   \item momentum
%   \item ADADETLA \textcite{ADADELTA} 
% \end{itemize}


\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Learning Rate $\gamma$, Decay Rate $\rho$}
  \KwInput{Initial parameter $x_1$}
  Initialize accumulation variables $m_0 = 0$\;
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\;
    Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{SDG with momentum}
  \label{alg:gd}
\end{algorithm}

In an effort to combine the properties of the momentum method and the
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
developed the \textsc{Adam} algorithm. The 

Problems / Improvements ADAM \textcite{rADAM}


\begin{algorithm}[H]
  \SetAlgoLined
  \KwInput{Stepsize $\alpha$}
  \KwInput{Decay Parameters $\beta_1$, $\beta_2$}
  Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
  \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
    Compute Gradient: $g_t$\;
    Accumulate first and second Moment of the Gradient:
    \begin{align*}
      m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
      v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\;
    \end{align*}
    Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
        x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
    Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
    x^2]_{t-1} + (1+p)\Delta x_t^2$\;
    Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
  }  
  \caption{ADAM, \cite{ADAM}}
  \label{alg:gd}
\end{algorithm}


\input{Plots/sdg_comparison.tex}

% \subsubsubsection{Stochastic Gradient Descent}
\clearpage
\subsection{Combating Overfitting}

% As in many machine learning applications if the model is overfit in
% the data it can drastically reduce the generalization of the model. In
% many machine learning approaches noise introduced in the learning
% algorithm in order to reduce overfitting. This results in a higher
% bias of the model but the trade off of lower variance of the model is
% beneficial in many cases. For example the regression tree model
% ... benefits greatly from restricting the training algorithm on
% randomly selected features in every iteration and then averaging many
% such trained trees inserted of just using a single one. \todo{noch
%   nicht sicher ob ich das nehmen will} For neural networks similar
% strategies exist. A popular approach in regularizing convolutional neural network
% is \textit{dropout} which has been first introduced in
% \cite{Dropout}

Similarly to shallow networks overfitting still can impact the quality of
convolutional neural networks. A popular way to combat this problem is
by introducing noise into the training of the model. This is a
successful strategy for ofter models as well, the a conglomerate of
descision trees grown on bootstrapped trainig samples benefit greatly
of randomizing the features available to use in each training
iteration (Hastie, Bachelorarbeit??).
There are two approaches to introduce noise to the model during
learning, either by manipulating the model it self or by manipulating
the input data.
\subsubsection{Dropout}
If a neural network has enough hidden nodes there will be sets of
weights that accurately fit the training set (proof for a small
scenario given in ...) this expecially occurs when the relation
between the input and output is highly complex, which requires a large
network to model and the training set is limited in size (vgl cnn
wening bilder). However each of these weights will result in different
predicitons for a test set and all of them will perform worse on the
test data than the training data. A way to improve the predictions and
reduce the overfitting  would
be to train a large number of networks and average their results (vgl
random forests) however this is often computational not feasible in
training as well as testing.
% Similarly to decision trees and random forests training multiple
% models on the same task and averaging the predictions can improve the
% results and combat overfitting. However training a very large
% number of neural networks is computationally expensive in training
%as well as testing.
In order to make this approach feasible
\textcite{Dropout1} propose random dropout.
Instead of training different models for each data point in a batch
randomly chosen nodes in the network are disabled (their output is
fixed to zero) and the updates for the weights in the remaining
smaller network are comuted. These the updates computed for each data
point in the batch are then accumulated and applied to the full
network.
This can be compared to many small networks which share their weights
for their active neurons being trained simultaniously.
For testing the ``mean network'' with all nodes active but their
output scaled accordingly to compensate for more active nodes is
used. \todo{comparable to averaging dropout networks, beispiel für
  besser in kleinem fall}
% Here for each training iteration from a before specified (sub)set of nodes
% randomly chosen ones are deactivated (their output is fixed to 0).
% During training 
% Instead of using different models and averaging them randomly
% deactivated nodes are used to simulate different networks which all
% share the same weights for present nodes.


% A simple but effective way to introduce noise to the model is by
% deactivating randomly chosen nodes in a layer 
% The way noise is introduced into
% the model is by deactivating certain nodes (setting the output of the
% node to 0) in the fully connected layers of the convolutional neural
% networks. The nodes are chosen at random and change in every
% iteration, this practice is called Dropout and was introduced by
% \textcite{Dropout}.

\subsubsection{\titlecap{manipulation of input data}}
Another way to combat overfitting is to keep the network from learning
the dataset by manipulating the inputs randomly for each iteration of
training. This is commonly used in image based tasks as there are
often ways to maipulate the input while still being sure the labels
remain the same. For example in a image classification task such as
handwritten digits the associated label should remain right when the
image is rotated or stretched by a small amount.
When using this one has to be sure that the labels indeed remain the
same or else the network will not learn the desired ...
In the case of handwritten digits for example a to high rotation angle
will ... a nine or six.
The most common transformations are rotation, zoom, shear, brightness,
mirroring.

\begin{figure}[h]
  \centering
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Plots/Data/mnist0.pdf}
    \caption{original\\image}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Plots/Data/mnist_gen_zoom.pdf}
    \caption{random\\zoom}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Plots/Data/mnist_gen_shear.pdf}
    \caption{random\\shear}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Plots/Data/mnist_gen_rotation.pdf}
    \caption{random\\rotation}
  \end{subfigure}
  \begin{subfigure}{0.19\textwidth}
    \includegraphics[width=\textwidth]{Plots/Data/mnist_gen_shift.pdf}
    \caption{random\\positional shift}
  \end{subfigure}
  \caption{Example for the manipuations used in ... As all images are
    of the same intensity brightness manipulation does not seem
    ... Additionally mirroring is not used for ...  reasons.}
\end{figure}

\input{Plots/gen_dropout.tex}

\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
training set?}

\subsubsection{\titlecap{effectivety for small training sets}}

For some applications (medical problems with small amount of patients)
the available data can be highly limited.
In these problems the networks are highly ... for overfitting the
data. In order to get a understanding of accuracys achievable and the
impact of the measures to prevent overfitting discussed above we and train
the network on datasets of varying sizes.
First we use the mnist handwriting dataset and then a slightly harder
problem given by the mnist fashion dataset which contains PREEDITED
pictures of clothes from 10 different categories.

\input{Plots/fashion_mnist.tex}

For training for each class a certain number of random datapoints are
chosen for training the network. The sizes chosen are:
full dataset: ... per class\\
1000 per class
100 per class
10 per class

the results for training .. are given in ... Here can be seen...

\begin{figure}[h]
  \centering
  \missingfigure{datagen digits}
  \caption{Sample pictures of the mnist fashioyn dataset, one per
    class.}
  \label{mnist fashion}
\end{figure}

\begin{figure}[h]
  \centering
  \missingfigure{datagen fashion}
  \caption{Sample pictures of the mnist fashioyn dataset, one per
    class.}
  \label{mnist fashion}
\end{figure}


\clearpage
\section{Bla}
\begin{itemize}
  \item generate more data, GAN etc
  \item Transfer learning, use network trained on different task and
  repurpose it / train it with the training data
\end{itemize}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: