\section{Application of NN to higher complexity Problems}
As neural networks are applied to problems of higher complexity often
resulting in higher dimensionality of the input the amount of
parameters in the network rises drastically. For example a network
with ...
A way to combat the
Convolution is a mathematical operation, where the product of two
functions is integrated after one has been reversed and shifted.
(f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
This operation can be described as a filter-function $g$ being applied
to $f$,
as values $f(t)$ are being replaced by an average of values of $f$
weighted by $g$ in position $t$.
The convolution operation allows plentiful manipulation of data, with
a simple example being smoothing of real-time data. Consider a sensor
measuring the location of an object (e.g. via GPS). We expect the
output of the sensor to be noisy as a result of a number of factors
that will impact the accuracy. In order to get a better estimate of
the actual location we want to smooth
the data to reduce the noise. Using convolution for this task, we
can control the significance we want to give each data-point. We
might want to give a larger weight to more recent measurements than
older ones. If we assume these measurements are taken on a discrete
timescale, we need to introduce discrete convolution first. Let $f$,
$g: \mathbb{Z} \to \mathbb{R}$ then
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
Applying this on the data with the filter $g$ chosen accordingly we
able to improve the accuracy, which can be seen in
This form of discrete convolution can also be applied to functions
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
\mathbb{R}$ then
(f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
\dots, x_d - i_d) g(i_1, \dots, i_d)
This will prove to be a useful framework for image manipulation but
in order to apply convolution to images we need to discuss
representation of image data first. Most often images are represented
by each pixel being a mixture of base colors these base colors define
the color-space in which the image is encoded. Often used are
color-spaces RGB (red,
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
image split in its red, green and blue channel is given in
Figure~\ref{fig:rgb} Using this
encoding of the image we can define a corresponding discrete function
describing the image, by mapping the coordinates $(x,y)$ of an pixel
and the
channel (color) $c$ to the respective value $v$
I: \mathbb{N}^3 & \to \mathbb{R}, \\
(x,y,c) & \mapsto v.
\begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
\node[canvas is xy plane at z=0, transform shape] at (0,0)
\node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
\node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
\node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
\caption{On the right the red, green and blue chances of the picture
are displayed. In order to better visualize the color channels the
black and white picture of each channel has been colored in the
respective color. Combining the layers results in the image on the
With this representation of an image as a function, we can apply
filters to the image using convolution for multidimensional functions
as described above. In order to simplify the notation we will write
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
as a tensor from now on, resulting in the modified notation of
(I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
Simple examples for image manipulation using
convolution are smoothing operations or
rudimentary detection of edges in grayscale images, meaning they only
have one channel. A popular filter for smoothing images
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
size $s \in \mathbb{N}$ is
defined as
G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
\sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
For edge detection purposes the Sobel operator is widespread. Here two
filters are applied to the
image $I$ and then combined. Edges in the $x$ direction are detected
by convolution with
G =\left[
-1 & 0 & 1 \\
-2 & 0 & 2 \\
-1 & 0 & 1
and edges is the y direction by convolution with $G^T$, the final
output is given by
O = \sqrt{(I * G)^2 + (I*G^T)^2}
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component
wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}.
\caption{Original Picture}
\caption{Gaussian Blur $\sigma^2 = 1$}
\caption{Gaussian Blur $\sigma^2 = 4$}
\caption{Sobel Operator $x$-direction}
\caption{Sobel Operator $y$-direction}
\caption{Sobel Operator combined}
% \begin{subfigure}{0.24\textwidth}
% \centering
% \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
% \caption{test}
% \end{subfigure}
\caption{Convolution of original greyscale Image (a) with different
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
$\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator
kernels are used.}
\subsection{Convolutional NN}
In conventional neural networks as described in chapter ... all layers
are fully connected, meaning each output node in a layer is influenced
by all inputs. For $i$ inputs and $o$ output nodes this results in $i
+ 1$ variables at each node (weights and bias) and a total $o(i + 1)$
variables. For large inputs like image data the amount of variables
that have to be trained in order to fit the model can get excessive
and hinder the ability to train the model due to memory and
computational restrictions. By using convolution we can extract
meaningful information such as edges in an image with a kernel of a
small size $k$ in the tens or hundreds independent of the size of the
original image. Thus for a large image $k \cdot i$ can be several
orders of magnitude smaller than $o\cdot i$ .
As seen convolution lends itself for image manipulation. In this
chapter we will explore how we can incorporate convolution in neural
networks, and how that might be beneficial.
Convolutional Neural Networks as described by ... are made up of
convolutional layers, pooling layers, and fully connected ones. The
fully connected layers are layers in which each input node is
connected to each output node which is the structure introduced in
chapter ...
In a convolutional layer instead of combining all input nodes for each
output node, the input nodes are interpreted as a tensor on which a
kernel is applied via convolution, resulting in the output. Most often
multiple kernels are used, resulting in multiple output tensors. These
kernels are the variables, which can be altered in order to fit the
model to the data. Using multiple kernels it is possible to extract
different features from the image (e.g. edges -> sobel). As this
increases dimensionality even further which is undesirable as it
increases the amount of variables in later layers of the model, a convolutional layer
is often followed by a pooling one. In a pooling layer the input is
reduced in size by extracting a single value from a
neighborhood \todo{moving...}... . The resulting output size is dependent on
the offset of the neighborhoods used. Popular is max-pooling where the
largest value in a neighborhood is used or.
This construct allows for extraction of features from the input while
using far less input variables.
... \todo{Beispiel mit kleinem Bild, am besten das von oben}
\subsubsection{Parallels to the Visual Cortex in Mammals}
The choice of convolution for image classification tasks is not
arbitrary. ... auge... bla bla
\subsection{Limitations of the Gradient Descent Algorithm}
-Hyperparameter guesswork
-Problems navigating valleys -> momentum
-Different scale of gradients for vars in different layers -> ADAdelta
\subsection{Stochastic Training Algorithms}
For many applications in which neural networks are used such as
image classification or segmentation, large training data sets become
detrimental to capture the nuances of the
data. However as training sets get larger the memory requirement
during training grows with it.
In order to update the weights with the gradient descent algorithm
derivatives of the network with respect for each
variable need to be calculated for all data points in order to get the
full gradient of the error of the network.
Thus the amount of memory and computing power available limits the
size of the training data that can be efficiently used in fitting the
network. A class of algorithms that augment the gradient descent
algorithm in order to lessen this problem are stochastic gradient
descent algorithms. Here the premise is that instead of using the whole
dataset a (different) subset of data is chosen to
compute the gradient in each iteration.
The amount of iterations until each data point has been considered in
updating the parameters is commonly called a ``epoch''.
This reduces the amount of memory and computing power required for
each iteration. This allows for use of very large training
sets. Additionally the noise introduced on the gradient can improve
the accuracy of the fit as stochastic gradient descent algorithms are
less likely to get stuck on local extrema.
Another benefit of using subsets even if enough memory is available to
use the whole dataset is that depending on the size of the subsets the
gradient can be calculated far quicker which allows to make more steps
in the same time. If the approximated gradient is close enough to the
``real'' one this can drastically cut down the time required for
training the model.
\item ADAM
\item momentum
\item ADADETLA \textcite{ADADELTA}
% \subsubsubsection{Stochastic Gradient Descent}
\subsection{Combating Overfitting}
% As in many machine learning applications if the model is overfit in
% the data it can drastically reduce the generalization of the model. In
% many machine learning approaches noise introduced in the learning
% algorithm in order to reduce overfitting. This results in a higher
% bias of the model but the trade off of lower variance of the model is
% beneficial in many cases. For example the regression tree model
% ... benefits greatly from restricting the training algorithm on
% randomly selected features in every iteration and then averaging many
% such trained trees inserted of just using a single one. \todo{noch
% nicht sicher ob ich das nehmen will} For neural networks similar
% strategies exist. A popular approach in regularizing convolutional neural network
% is \textit{dropout} which has been first introduced in
% \cite{Dropout}
Similarly to shallow networks overfitting still can impact the quality of
convolutional neural networks. A popular way to combat this problem is
by introducing noise into the training of the model. This is a
successful strategy for ofter models as well, the a conglomerate of
descision trees grown on bootstrapped trainig samples benefit greatly
of randomizing the features available to use in each training
iteration (Hastie, Bachelorarbeit??). The way noise is introduced into
the model is by deactivating certain nodes (setting the output of the
node to 0) in the fully connected layers of the convolutional neural
networks. The nodes are chosen at random and change in every
iteration, this practice is called Dropout and was introduced by
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä.}
