@ -1,4 +1,4 @@
\section { Application of NN to higher complexity Problems }
\section { \titlecap { application of neural networks to higher complexity problems} }
This section is based on \textcite [Chapter~9] { Goodfellow}
This section is based on \textcite [Chapter~9] { Goodfellow}
@ -155,36 +155,40 @@ in Figure~\ref{fig:img_conv}.
\begin { figure} [h]
\begin { figure} [h]
\centering
\centering
\begin { subfigure} { 0.3 \textwidth }
\begin { subfigure} { 0.27 \textwidth }
\centering
\centering
\includegraphics [width=\textwidth] { Figures/Data/klammern.jpg}
\includegraphics [width=\textwidth] { Figures/Data/klammern.jpg}
\caption { Original Picture}
\caption { \small Original Picture\\ ~}
\label { subf:OrigPicGS}
\label { subf:OrigPicGS}
\end { subfigure}
\end { subfigure}
\begin { subfigure} { 0.3\textwidth }
\hspace { 0.02\textwidth }
\begin { subfigure} { 0.27\textwidth }
\centering
\centering
\includegraphics [width=\textwidth] { Figures/Data/image_ conv9.png}
\includegraphics [width=\textwidth] { Figures/Data/image_ conv9.png}
\caption { \ hspace{ -2pt} Gaussian Blur $ \sigma ^ 2 = 1 $ }
\caption { \ small Gaussian Blur $ \sigma ^ 2 = 1 $ }
\end { subfigure}
\end { subfigure}
\begin { subfigure} { 0.3\textwidth }
\hspace { 0.02\textwidth }
\begin { subfigure} { 0.27\textwidth }
\centering
\centering
\includegraphics [width=\textwidth] { Figures/Data/image_ conv10.png}
\includegraphics [width=\textwidth] { Figures/Data/image_ conv10.png}
\caption { Gaussian Blur $ \sigma ^ 2 = 4 $ }
\caption { \small Gaussian Blur $ \sigma ^ 2 = 4 $ }
\end { subfigure} \\
\end { subfigure} \\
\begin { subfigure} { 0.3 \textwidth }
\begin { subfigure} { 0.27 \textwidth }
\centering
\centering
\includegraphics [width=\textwidth] { Figures/Data/image_ conv4.png}
\includegraphics [width=\textwidth] { Figures/Data/image_ conv4.png}
\caption { Sobel Operator $ x $ -direction}
\caption { \small Sobel Operator $ x $ -direction}
\end { subfigure}
\end { subfigure}
\begin { subfigure} { 0.3\textwidth }
\hspace { 0.02\textwidth }
\begin { subfigure} { 0.27\textwidth }
\centering
\centering
\includegraphics [width=\textwidth] { Figures/Data/image_ conv5.png}
\includegraphics [width=\textwidth] { Figures/Data/image_ conv5.png}
\caption { Sobel Operator $ y $ -direction}
\caption { \small Sobel Operator $ y $ -direction}
\end { subfigure}
\end { subfigure}
\begin { subfigure} { 0.3\textwidth }
\hspace { 0.02\textwidth }
\begin { subfigure} { 0.27\textwidth }
\centering
\centering
\includegraphics [width=\textwidth] { Figures/Data/image_ conv6.png}
\includegraphics [width=\textwidth] { Figures/Data/image_ conv6.png}
\caption { Sobel Operator combined}
\caption { \small Sobel Operator combined}
\end { subfigure}
\end { subfigure}
% \begin { subfigure} { 0.24\textwidth }
% \begin { subfigure} { 0.24\textwidth }
% \centering
% \centering
@ -199,7 +203,7 @@ in Figure~\ref{fig:img_conv}.
\end { figure}
\end { figure}
\clearpage
\clearpage
\newpage
\newpage
\subsection { Convolutional NN}
\subsection { Convolutional Neural Networks }
\todo { Eileitung zu CNN amout of parameters}
\todo { Eileitung zu CNN amout of parameters}
% Conventional neural network as described in chapter .. are made up of
% Conventional neural network as described in chapter .. are made up of
% fully connected layers, meaning each node in a layer is influenced by
% fully connected layers, meaning each node in a layer is influenced by
@ -239,10 +243,10 @@ The usage of multiple filters results in multiple outputs of the same
size as the input (or slightly smaller if no padding is used). These
size as the input (or slightly smaller if no padding is used). These
are often called channels.
are often called channels.
For convolutional layers that are preceded by convolutional layers the
For convolutional layers that are preceded by convolutional layers the
size of the filter is often chosen to coincide with the amount of channels
size of the filters are often chosen to coincide with the amount of channels
of the output of the previous layer and not padded in this
of the output of the previous layer and not padded in this
direction.
direction.
This results in the channels ``being squashed'' and prevents gaining
This results in these channels ``being squashed'' and prevents gaining
additional
additional
dimensions\todo { filter mit ganzer tiefe besser erklären} in the output.
dimensions\todo { filter mit ganzer tiefe besser erklären} in the output.
This can also be used to flatten certain less interesting channels of
This can also be used to flatten certain less interesting channels of
@ -252,14 +256,15 @@ the input as for example color channels.
A way additionally reduce the size using convolution is not applying the
A way additionally reduce the size using convolution is not applying the
convolution on every pixel, but rather specifying a certain ``stride''
convolution on every pixel, but rather specifying a certain ``stride''
$ s $ at which the filter $ g $ is moved over the input $ I $ ,
$ s $ for each direction at which the filter $ g $ is moved over the input $ I $ ,
\[
\[
O_ { x,y,c} = \sum _ { i,j,l \in \mathbb { Z} } I_ { x-i,y-j,c-l} g_ { i,j,l} .
O_ { x,\dots ,c} = \sum _ { i,\dots ,l \in \mathbb { Z} } I_ { (x \cdot
s_ x)-i,\dots ,(c \cdot s_ c)-l} \cdot g_ { i,\dots ,l} .
\]
\]
The size and stride for all filters in a layer should be the same in
The sizes and stride should be the sam e for all filters in a layer in
order to get a uniform tensor as output.
order to get a uniform tensor as output.
T % he size of the filters and the way they are applied can be tuned
% T he size of the filters and the way they are applied can be tuned
% while building the model should be the same for all filters in one
% while building the model should be the same for all filters in one
% layer in order for the output being of consistent size in all channels.
% layer in order for the output being of consistent size in all channels.
% It is common to reduce the d< by not applying the
% It is common to reduce the d< by not applying the
@ -288,14 +293,13 @@ T% he size of the filters and the way they are applied can be tuned
% model to the data. Using multiple kernels it is possible to extract
% model to the data. Using multiple kernels it is possible to extract
% different features from the image (e.g. edges -> sobel).
% different features from the image (e.g. edges -> sobel).
In order to further reduce the size towards the final layer, convolutional
As a means to further reduce the size towards the final layer, convolutional
layers are often followed by a pooling layer.
layers are often followed by a pooling layer.
In a pooling layer the input is
In a pooling layer the input is
reduced in size by extracting a single value from a
reduced in size by extracting a single value from a
neighborhood of pixels, often by taking the maximum value in the
neighborhood of pixels, often by taking the maximum value in the
neighborhood (max-pooling). The resulting output size is dependent on
neighborhood (max-pooling). The resulting output size is dependent on
the offset of the neighborhoods used, this offset is commonly called
the offset (stride) of the neighborhoods used.
``stride''\todo { zwei mal stride} .
The combination of convolution and pooling layers allows for
The combination of convolution and pooling layers allows for
extraction of features from the input in the from of feature maps while
extraction of features from the input in the from of feature maps while
using relatively few parameters that need to be trained.
using relatively few parameters that need to be trained.
@ -306,6 +310,27 @@ by two fully connected layers.
\begin { figure} [h]
\begin { figure} [h]
\centering
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/mnist0bw.pdf}
\caption { input}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/conv2d_ 6.pdf}
\caption { convolution}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/max_ pooling2d_ 6.pdf}
\caption { max-pool}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/conv2d_ 7.pdf}
\caption { convolution}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/max_ pooling2d_ 7.pdf}
\caption { max-pool}
\end { subfigure}
\centering
\centering
\begin { subfigure} { 0.19\textwidth }
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/mnist0bw.pdf}
\includegraphics [width=\textwidth] { Figures/Data/mnist0bw.pdf}
@ -333,10 +358,10 @@ by two fully connected layers.
\label { fig:feature_ map}
\label { fig:feature_ map}
\end { figure}
\end { figure}
\subsubsection { Parallels to the Visual Cortex in Mammals}
% \subsubsection { Parallels to the Visual Cortex in Mammals}
The choice of convolution for image classification tasks is not
% The choice of convolution for image classification tasks is not
arbitrary. ... auge... bla bla
% arbitrary. ... auge... bla bla
% \subsection { Limitations of the Gradient Descent Algorithm}
% \subsection { Limitations of the Gradient Descent Algorithm}
@ -345,7 +370,7 @@ arbitrary. ... auge... bla bla
% -Problems navigating valleys -> momentum
% -Problems navigating valleys -> momentum
% -Different scale of gradients for vars in different layers -> ADAdelta
% -Different scale of gradients for vars in different layers -> ADAdelta
\subsection { Stochastic Training Algorithms }
\subsection { \titlecap { stochastic training algorithms} }
For many applications in which neural networks are used such as
For many applications in which neural networks are used such as
image classification or segmentation, large training data sets become
image classification or segmentation, large training data sets become
detrimental to capture the nuances of the
detrimental to capture the nuances of the
@ -356,15 +381,18 @@ derivatives of the network with respect for each
variable need to be computed for all data points.
variable need to be computed for all data points.
Thus the amount of memory and computing power available limits the
Thus the amount of memory and computing power available limits the
size of the training data that can be efficiently used in fitting the
size of the training data that can be efficiently used in fitting the
network. A class of algorithms that augment the gradient descent
network.
A class of algorithms that augment the gradient descent
algorithm in order to lessen this problem are stochastic gradient
algorithm in order to lessen this problem are stochastic gradient
descent algorithms.
descent algorithms.
Here the full dataset is split into smaller disjoint subsets.
Here the full dataset is split into smaller disjoint subsets.
Then in each iteration a (different) subset of data is chosen to
Then in each iteration a (different) subset of data is chosen to
compute the gradient (Algorithm~\ref { alg:sd g} ).
compute the gradient (Algorithm~\ref { alg:sgd } ).
The training period until each data point has been considered at least
The training period until each data point has been considered at least
once in
once in
updating the parameters is commonly called an ``epoch''.
updating the parameters is commonly called an ``epoch''.
Using subsets reduces the amount of memory required for storing the
Using subsets reduces the amount of memory required for storing the
necessary values for each update, thus making it possible to use very
necessary values for each update, thus making it possible to use very
large training sets to fit the model.
large training sets to fit the model.
@ -407,7 +435,7 @@ In order to illustrate this behavior we modeled a convolutional neural
network to classify handwritten digits. The data set used for this is the
network to classify handwritten digits. The data set used for this is the
MNIST database of handwritten digits (\textcite { MNIST} ,
MNIST database of handwritten digits (\textcite { MNIST} ,
Figure~\ref { fig:MNIST} ).
Figure~\ref { fig:MNIST} ).
\input { Figures/mnist.tex}
The network used consists of two convolution and max pooling layers
The network used consists of two convolution and max pooling layers
followed by one fully connected hidden layer and the output layer.
followed by one fully connected hidden layer and the output layer.
Both covolutional layers utilize square filters of size five which are
Both covolutional layers utilize square filters of size five which are
@ -415,25 +443,15 @@ applied with a stride of one.
The first layer consists of 32 filters and the second of 64. Both
The first layer consists of 32 filters and the second of 64. Both
pooling layers pool a $ 2 \times 2 $ area. The fully connected layer
pooling layers pool a $ 2 \times 2 $ area. The fully connected layer
consists of 256 nodes and the output layer of 10, one for each digit.
consists of 256 nodes and the output layer of 10, one for each digit.
All layers use RE LU as activation function, except the output layer
All layers use a Re LU as activation function, except the output layer
with the output layer which uses softmax (\ref { def:softmax} ).
which uses softmax (\ref { eq:softmax} ).
As loss function categorical crossentropy is used (\ref { eq:cross_ entropy} ).
As loss function categorical cross entropy (\ref { eq:cross_ entropy} ) is used .
The architecture of the convolutional neural network is summarized in
The architecture of the convolutional neural network is summarized in
Figure~\ref { fig:mnist_ architecture} .
Figure~\ref { fig:mnist_ architecture} .
\begin { figure}
\includegraphics [width=\textwidth] { Figures/Data/convnet_ fig.pdf}
\caption { Convolutional neural network architecture used to model the
MNIST handwritten digits dataset. This figure was created using the
draw\textunderscore convnet Python script by \textcite { draw_ convnet} .}
\label { fig:mnist_ architecture}
\end { figure}
The results of the network being trained with gradient descent and
The results of the network being trained with gradient descent and
stochastic gradient descent for 20 epochs are given in Figure~\ref { fig:sgd_ vs_ gd}
stochastic gradient descent for 20 epochs are given in Figure~\ref { fig:sgd_ vs_ gd}
and Table~\ref { table:sgd_ vs_ gd}
and Table~\ref { table:sgd_ vs_ gd} .
Here it can be seen that the network trained with stochstic gradient
Here it can be seen that the network trained with stochstic gradient
descent is more accurate after the first epoch than the ones trained
descent is more accurate after the first epoch than the ones trained
with gradient descent after 20 epochs.
with gradient descent after 20 epochs.
@ -445,58 +463,75 @@ gradient calculated on the subset it performs far better than the
network using true gradients when training for the same mount of time.
network using true gradients when training for the same mount of time.
\todo { vergleich training time}
\todo { vergleich training time}
\input { Figures/mnist.tex}
\begin { figure}
\includegraphics [width=\textwidth] { Figures/Data/convnet_ fig.pdf}
\caption { Convolutional neural network architecture used to model the
MNIST handwritten digits dataset. This figure was created using the
draw\textunderscore convnet Python script by \textcite { draw_ convnet} .}
\label { fig:mnist_ architecture}
\end { figure}
\input { Figures/SGD_ vs_ GD.tex}
\input { Figures/SGD_ vs_ GD.tex}
\clearpage
\clearpage
\subsection { \titlecap { modified stochastic gradient descent} }
\subsection { \titlecap { modified stochastic gradient descent} }
This section is based on \textcite { ruder} .
This section is based on \textcite { ruder} , \textcite { ADAGRAD} ,
\textcite { ADADELTA} and \textcite { ADAM} .
An inherent problem of the stochastic gradient descent algorithm is
its sensitivity to the learning rate $ \gamma $ . This results in the
While stochastic gradient descent can work quite well in fitting
problem of having to find a appropriate learning rate for each problem
models its sensitivity to the learning rate $ \gamma $ is an inherent
which is largely guesswork, the impact of choosing a bad learning rate
problem.
This results in having to find an appropriate learning rate for each problem
which is largely guesswork. The impact of choosing a bad learning rate
can be seen in Figure~\ref { fig:sgd_ vs_ gd} .
can be seen in Figure~\ref { fig:sgd_ vs_ gd} .
% There is a inherent problem in the sensitivity of the gradient descent
% There is a inherent problem in the sensitivity of the gradient descent
% algorithm regarding the learning rate $ \gamma $ .
% algorithm regarding the learning rate $ \gamma $ .
% The difficulty of choosing the learning rate can be seen
% The difficulty of choosing the learning rate can be seen
% in Figure~\ref { sgd_ vs_ gd} .
% in Figure~\ref { sgd_ vs_ gd} .
For small rates the progress in each iteration is small
For small rates the progress in each iteration is small
but as the rate is enlarged the algorithm can become unstable and the parameters
but for learning rates to large the algorithm can become unstable with
diverge to infinity. Even for learning rates small enough to ensure the parameters
updates being larger then the parameters themselves which can result
in the parameters diverging to infinity.
Even for learning rates small enough to ensure the parameters
do not diverge to infinity, steep valleys in the function to be
do not diverge to infinity, steep valleys in the function to be
minimized can hinder the progress of
minimized can hinder the progress of
the algorithm as for leaning rates not small enough gradient descent
the algorithm.
``bounces between'' the walls of the valley rather then following a
If the bottom of the valley slowly slopes towards the minimum
downward trend in the valley.
the steep nature of the valley can result in the
algorithm ``bouncing between'' the walls of the valley rather then
following the downwards trend.
% \[
A possible way to combat this is to alter the learning
% w - \gamma \nabla _ w ...
% \]
% thus the weights grow to infinity.
\todo { unstable learning rate besser
erklären}
To combat this problem \todo { quelle} propose to alter the learning
rate over the course of training, often called leaning rate
rate over the course of training, often called leaning rate
scheduling in order to decrease the learning rate over the course of
scheduling.
training. The most popular implementations of this are time based
The most popular implementations of this are time based
decay
decay
\[
\[
\gamma _ { n+1} = \frac { \gamma _ n} { 1 + d n} ,
\gamma _ { n+1} = \frac { \gamma _ n} { 1 + d n} ,
\]
\]
where $ d $ is the decay parameter and $ n $ is the number of epochs,
where $ d $ is the decay parameter and $ n $ is the number of epochs.
s tep based decay where the learning rate is fixed for a span of $ r $
S tep based decay where the learning rate is fixed for a span of $ r $
epochs and then decreased according to parameter $ d $
epochs and then decreased according to parameter $ d $
\[
\[
\gamma _ n = \gamma _ 0 d^ { \text { floor} { \frac { n+1} { r} } }
\gamma _ n = \gamma _ 0 d^ { \text { floor} { \frac { n+1} { r} } } .
\]
\]
a nd exponential decay where the learning rate is decreased after each epoch
A nd exponential decay where the learning rate is decreased after each epoch
\[
\[
\gamma _ n = \gamma _ o e^ { -n d} .
\gamma _ n = \gamma _ o e^ { -n d} .
\]
\] \todo { satz aufteilen}
These methods are able to increase the accuracy of a model by large
These methods are able to increase the accuracy of models by large
margins as seen in the training of RESnet by \textcite { resnet} .
margins as seen in the training of RESnet by \textcite { resnet} , cf. Figure~\ref { fig:resnet} .
\todo { vielleicht grafik
\begin { figure} [h]
einbauen}
\centering
\includegraphics [width=\textwidth] { Figures/Data/7780459-fig-4-source-hires.png}
\caption [Learning Rate Decay] { Error history of convolutional neural
network trained with learning rate decay. \textcite [Figure
4]{ resnet} }
\label { fig:resnet}
\end { figure}
However stochastic gradient descent with weight decay is
However stochastic gradient descent with weight decay is
still highly sensitive to the choice of the hyperparameters $ \gamma _ 0 $
still highly sensitive to the choice of the hyperparameters $ \gamma _ 0 $
and $ d $ .
and $ d $ .
@ -504,25 +539,29 @@ In order to mitigate this problem a number of algorithms have been
developed to regularize the learning rate with as minimal
developed to regularize the learning rate with as minimal
hyperparameter guesswork as possible.
hyperparameter guesswork as possible.
We will examine and compare a ... algorithms that use a adaptive
In the following we will compare three algorithms that use a adaptive
learning rate.
learning rate, meaning they scale the updates according to past iterations.
They all scale the gradient for the update depending of past gradients
% We will examine and compare a four algorithms that use a adaptive
for each weight individually.
% learning rate.
% They all scale the gradient for the update depending of past gradients
% for each weight individually.
The algorithms are build up on each other with the adaptive gradient
The algorithms are build up on each other with the adaptive gradient
algorithm (\textsc { AdaGrad} , \textcite { ADAGRAD} )
algorithm (\textsc { AdaGrad} , \textcite { ADAGRAD} )
laying the base work. Here for each parameter update the learning rate
laying the base work. Here for each parameter update the learning rate
is given my a constant
is given by a constant global rate
$ \gamma $ is divided by the sum of the squares of the past partial
$ \gamma $ divided by the sum of the squares of the past partial
derivatives in this parameter. This results in a monotonous decaying
derivatives in this parameter. This results in a monotonous decaying
learning rate with faster
learning rate with faster
decay for parameters with large updates, where as
decay for parameters with large updates, where as
parameters with small updates experience smaller decay. The \textsc { AdaGrad}
parameters with small updates experience smaller decay.
The \textsc { AdaGrad}
algorithm is given in Algorithm~\ref { alg:ADAGRAD} . Note that while
algorithm is given in Algorithm~\ref { alg:ADAGRAD} . Note that while
this algorithm is still based upon the idea of gradient descent it no
this algorithm is still based upon the idea of gradient descent it no
longer takes steps in the direction of the gradient while
longer takes steps in the direction of the gradient while
updating. Due to the individual learning rates for each parameter only
updating. Due to the individual learning rates for each parameter only
the direction/sign for single parameters remain the same.
the direction/sign for single parameters remain the same compared to
gradient descent.
\begin { algorithm} [H]
\begin { algorithm} [H]
\SetAlgoLined
\SetAlgoLined
@ -589,7 +628,7 @@ As the root mean square of the past gradients is already used in the
denominator of the learning rate a exponentially decaying root mean
denominator of the learning rate a exponentially decaying root mean
square of the past updates is used to obtain a $ \Delta x $ quantity for
square of the past updates is used to obtain a $ \Delta x $ quantity for
the denominator resulting in the correct unit of the update. The full
the denominator resulting in the correct unit of the update. The full
algorithm is given by Algorithm~\ref { alg:adadelta} .
algorithm is given in Algorithm~\ref { alg:adadelta} .
\begin { algorithm} [H]
\begin { algorithm} [H]
\SetAlgoLined
\SetAlgoLined
@ -613,13 +652,13 @@ algorithm is given by Algorithm~\ref{alg:adadelta}.
While the stochastic gradient algorithm is less susceptible to getting
While the stochastic gradient algorithm is less susceptible to getting
stuck in local
stuck in local
extrema than gradient descent the problem still persists especially
extrema than gradient descent the problem still persists especially
for saddle points with steep .... \textcite { DBLP:journals/corr/Dauphinpgcgb14}
for saddle points ( \textcite { DBLP:journals/corr/Dauphinpgcgb14} ).
An approach to the problem of ``getting stuck'' in saddle point or
An approach to the problem of ``getting stuck'' in saddle point or
local minima/maxima is the addition of momentum to SDG. Instead of
local minima/maxima is the addition of momentum to SDG. Instead of
using the actual gradient for the parameter update an average over the
using the actual gradient for the parameter update an average over the
past gradients is used. In order to avoid the need to SAVE the past
past gradients is used. In order to avoid the need to hold the past
values usually a exponentially decaying average is used resulting in
values in memory usually a exponentially decaying average is used resulting in
Algorithm~\ref { alg:sgd_ m} . This is comparable of following the path
Algorithm~\ref { alg:sgd_ m} . This is comparable of following the path
of a marble with mass rolling down the slope of the error
of a marble with mass rolling down the slope of the error
function. The decay rate for the average is comparable to the inertia
function. The decay rate for the average is comparable to the inertia
@ -653,13 +692,15 @@ In an effort to combine the properties of the momentum method and the
automatic adapted learning rate of \textsc { AdaDelta} \textcite { ADAM}
automatic adapted learning rate of \textsc { AdaDelta} \textcite { ADAM}
developed the \textsc { Adam} algorithm, given in
developed the \textsc { Adam} algorithm, given in
Algorithm~\ref { alg:adam} . Here the exponentially decaying
Algorithm~\ref { alg:adam} . Here the exponentially decaying
root mean square of the gradients is still used for realizing and
root mean square of the gradients is still used for regularizing the
learning rate and
combined with the momentum method. Both terms are normalized such that
combined with the momentum method. Both terms are normalized such that
the ... are the first and second moment of the gradient. However the term used in
their means are the first and second moment of the gradient. However the term used in
\textsc { AdaDelta} to ensure correct units is dropped for a scalar
\textsc { AdaDelta} to ensure correct units is dropped for a scalar
global learning rate. This results in .. hyperparameters, however the
global learning rate. This results in four tunable hyperparameters,
however the
algorithms seems to be exceptionally stable with the recommended
algorithms seems to be exceptionally stable with the recommended
parameters of ... and is a very reliable algorithm for training
parameters of $ \alpha = 0 .001 , \beta _ 1 = 0 .9 , \beta _ 2 = 0 .999 , \varepsilon = $ 1e-7 and is a very reliable algorithm for training
neural networks.
neural networks.
\begin { algorithm} [H]
\begin { algorithm} [H]
@ -685,8 +726,10 @@ neural networks.
\end { algorithm}
\end { algorithm}
In order to get an understanding of the performance of the above
In order to get an understanding of the performance of the above
discussed training algorithms the neural network given in ... has been
discussed training algorithms the neural network given in \ref { fig:mnist_ architecture} has been
trained on the ... and the results are given in
trained on the MNIST handwriting dataset with the above described
algorithms.
The performance metrics of the resulting learned functions are given in
Figure~\ref { fig:comp_ alg} .
Figure~\ref { fig:comp_ alg} .
Here it can be seen that the ADAM algorithm performs far better than
Here it can be seen that the ADAM algorithm performs far better than
the other algorithms, with AdaGrad and Adelta following... bla bla
the other algorithms, with AdaGrad and Adelta following... bla bla
@ -696,7 +739,7 @@ the other algorithms, with AdaGrad and Adelta following... bla bla
% \subsubsubsection { Stochastic Gradient Descent}
% \subsubsubsection { Stochastic Gradient Descent}
\clearpage
\clearpage
\subsection { Combating Overfitting }
\subsection { \titlecap { combating overfitting} }
% As in many machine learning applications if the model is overfit in
% As in many machine learning applications if the model is overfit in
% the data it can drastically reduce the generalization of the model. In
% the data it can drastically reduce the generalization of the model. In
@ -754,12 +797,12 @@ training as well as testing.
% as well as testing.
% as well as testing.
In order to make this approach feasible
In order to make this approach feasible
\textcite { Dropout1} propose random dropout.
\textcite { Dropout1} propose random dropout.
Instead of training different models for each data point in a batch
Instead of training different models, for each data point in a batch
randomly chosen nodes in the network are disabled (their output is
randomly chosen nodes in the network are disabled (their output is
fixed to zero) and the updates for the weights in the remaining
fixed to zero) and the updates for the weights in the remaining
smaller network are comuted. These the updates computed for each data
smaller network are computed.
point in the batch are then accumulated and applied to the full
After updates have been ... this way for each data point in a batch
network.
the updates are accumulated and applied to the full network.
This can be compared to many small networks which share their weights
This can be compared to many small networks which share their weights
for their active neurons being trained simultaniously.
for their active neurons being trained simultaniously.
For testing the ``mean network'' with all nodes active but their
For testing the ``mean network'' with all nodes active but their
@ -785,9 +828,12 @@ used. \todo{comparable to averaging dropout networks, beispiel für
% \textcite { Dropout} .
% \textcite { Dropout} .
\subsubsection { \titlecap { manipulation of input data} }
\subsubsection { \titlecap { manipulation of input data} }
Another way to combat overfitting is to keep the network from learning
Another way to combat overfitting is to keep the network form
the dataset by manipulating the inputs randomly for each iteration of
``memorizing''
training. This is commonly used in image based tasks as there are
the training data rather then learning the relation between in- and
output is to randomly alter the training inputs for
each iteration of training.
This is commonly used in image based tasks as there are
often ways to maipulate the input while still being sure the labels
often ways to maipulate the input while still being sure the labels
remain the same. For example in a image classification task such as
remain the same. For example in a image classification task such as
handwritten digits the associated label should remain right when the
handwritten digits the associated label should remain right when the
@ -795,7 +841,8 @@ image is rotated or stretched by a small amount.
When using this one has to be sure that the labels indeed remain the
When using this one has to be sure that the labels indeed remain the
same or else the network will not learn the desired ...
same or else the network will not learn the desired ...
In the case of handwritten digits for example a to high rotation angle
In the case of handwritten digits for example a to high rotation angle
will ... a nine or six.
will make the distinction between a nine or six hard and will lessen
the quality of the learned function.
The most common transformations are rotation, zoom, shear, brightness,
The most common transformations are rotation, zoom, shear, brightness,
mirroring. Examples of this are given in Figure~\ref { fig:datagen} .
mirroring. Examples of this are given in Figure~\ref { fig:datagen} .
@ -827,15 +874,26 @@ mirroring. Examples of this are given in Figure~\ref{fig:datagen}.
\label { fig:datagen}
\label { fig:datagen}
\end { figure}
\end { figure}
\subsubsection { \titlecap { comparisons} }
In order to compare the benefits obtained from implementing these
In order to compare the benefits obtained from implementing these
measures we have trained the network given in ... on the same problem
measures we have trained the network given in
\ref { fig:mnist_ architecture} on the handwriting recognition problem
and implemented different combinations of data generation and dropout. The results
and implemented different combinations of data generation and dropout. The results
are given in Figure~\ref { fig:gen_ dropout} . For each scennario the
are given in Figure~\ref { fig:gen_ dropout} . For each scennario the
model was trained five times and the performance measures were
model was trained five times and the performance measures were
averaged. It can be seen that implementing the measures does indeed
averaged.
increase the performance of the model. Implementing data generation on
its own seems to have a larger impact than dropout and applying both
It can be seen that implementing the measures does indeed
increases the accuracy even further.
increase the performance of the model.
Using data generation to alter the training data seems to have a
larger impact than dropout, however utilizing both measures yields the
best results.
\todo { auf zahlen in tabelle verweisen?}
% Implementing data generation on
% its own seems to have a larger impact than dropout and applying both
% increases the accuracy even further.
The better performance stems most likely from reduced overfitting. The
The better performance stems most likely from reduced overfitting. The
reduction in overfitting can be seen in
reduction in overfitting can be seen in
@ -843,29 +901,29 @@ reduction in overfitting can be seen in
accuracy decreases with test accuracy increasing. However utlitizing
accuracy decreases with test accuracy increasing. However utlitizing
data generation as well as dropout with a probability of 0.4 seems to
data generation as well as dropout with a probability of 0.4 seems to
be a too aggressive approach as the training accuracy drops below the
be a too aggressive approach as the training accuracy drops below the
test accuracy\todo { kleine begründung} .
test accuracy\todo { kleine begründung} .
\input { Figures/gen_ dropout.tex}
\input { Figures/gen_ dropout.tex}
\todo { Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
training set?}
\clearpage
\clearpage
\subsubsection { \titlecap { effectivety for small training sets} }
\subsubsection { \titlecap { effectivety for small training sets} }
For some applications (medical problems with small amount of patients)
For some applications (medical problems with small amount of patients)
the available data can be highly limited.
the available data can be highly limited.
In these problems the networks are highly ... for overfitting the
In these problems the networks are highly prone to overfit the
data. In order to get a understanding of accuracys achievable and the
data. In order to get a understanding of accuracys achievable and the
impact of the measures to prevent overfitting discussed above we and train
impact of the methods aimed at mitigating overfitting discussed above we and train
the network on datasets of varying sizes with different measures implemented.
networks with different measures implemented to fit datasets of
varying sizes.
For training we use the mnist handwriting dataset as well as the fashion
For training we use the mnist handwriting dataset as well as the fashion
mnist dataset. The fashion mnist dataset is a benchmark set build by
mnist dataset. The fashion mnist dataset is a benchmark set build by
\textcite { fashionMNIST} in order to provide a harder set, as state of
\textcite { fashionMNIST} in order to provide a harder set, as state of
the art models are able to achive accuracies of 99.88\%
the art models are able to achive accuracies of 99.88\%
(\textcite { 10.1145/3206098.3206111} ) on the handwriting set.
(\textcite { 10.1145/3206098.3206111} ) on the handwriting set.
The dataset contains 70.000 preprocessed images of clothes from
The dataset contains 70.000 preprocessed and labeled images of clothes from
z alando, a overview is given in Figure~\ref { fig:fashionMNIST} .
Z alando, a overview is given in Figure~\ref { fig:fashionMNIST} .
\input { Figures/fashion_ mnist.tex}
\input { Figures/fashion_ mnist.tex}
@ -874,90 +932,91 @@ zalando, a overview is given in Figure~\ref{fig:fashionMNIST}.
\begin { minipage} { \textwidth }
\begin { minipage} { \textwidth }
\small
\small
\begin { tabu} to \textwidth { @{ } l*4{ X[c]} @{ } }
\begin { tabu} to \textwidth { @{ } l*4{ X[c]} @{ } }
\Tstrut \Bstrut & \textsc { Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
\Tstrut \Bstrut & \textsc { Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.5633 & 0.5312 & \textbf { 0.6704} & 0.6604 \\
max \Tstrut & 0.5633 & 0.5312 & \textbf { 0.6704} & 0.6604 \\
min & 0.3230 & 0.4224 & 0.4878 & \textbf { 0.5175} \\
min & 0.3230 & 0.4224 & 0.4878 & \textbf { 0.5175} \\
mean & 0.4570 & 0.4714 & 0.5862 & \textbf { 0.6014} \\
mean & 0.4570 & 0.4714 & 0.5862 & \textbf { 0.6014} \\
var \Bstrut & 0.0040 & \textbf { 0.0012} & 0.0036 & 0.002 3 \\
var \Bstrut & 4.021e-3 & \textbf { 1.175e-3} & 3.600e-3 & 2.348e- 3 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & \textbf { 0.9441} \\
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & \textbf { 0.9441} \\
min & 0.8148 & \textbf { 0.9081} & 0.9018 & 0.9061 \\
min & 0.8148 & \textbf { 0.9081} & 0.9018 & 0.9061 \\
mean & 0.8377 & \textbf { 0.9270} & 0.9185 & 0.9232 \\
mean & 0.8377 & \textbf { 0.9270} & 0.9185 & 0.9232 \\
var \Bstrut & 2.7e-04 & 1.3e-04 & 6e-05 & 1.5e-04 \\
var \Bstrut & 2.694e-4 & \textbf { 1.278e-4} & 6.419e-5 & 1.504e-4 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.9637 & 0.9796 & 0.9810 & \textbf { 0.9811} \\
max \Tstrut & 0.9637 & 0.9796 & 0.9810 & \textbf { 0.9811} \\
min & 0.9506 & 0.9719 & 0.9702 & \textbf { 0.9727} \\
min & 0.9506 & 0.9719 & 0.9702 & \textbf { 0.9727} \\
mean & 0.9582 & 0.9770 & 0.9769 & \textbf { 0.9783} \\
mean & 0.9582 & 0.9770 & 0.9769 & \textbf { 0.9783} \\
var \Bstrut & 2e-05 & 1e-05 & 1e-05 & 1e-05 \\
var \Bstrut & 1.858e-5 & 5.778e-6 & 9.398e-6 & \textbf { 4.333e-6} \\
\hline
\hline
\end { tabu}
\end { tabu}
\normalsize
\normalsize
\captionof { table} { Values of the test accuracy of the model trained
\captionof { table} { Values of the test accuracy of the model trained
10 times
10 times
on random MNIST handwriting training sets containing 1, 10 and 100
on random MNIST handwriting training sets containing 1, 10 and 100
data points per class after 125 epochs. The mean achieved ac curacy
data points per class after 125 epochs. The mean accuracy achieved
for the full set employing both overfitting measures is }
for the full set employing both overfitting measures is }
\label { table:digitsOF}
\label { table:digitsOF}
\small
\small
\centering
\centering
\begin { tabu} to \textwidth { @{ } l*4{ X[c]} @{ } }
\begin { tabu} to \textwidth { @{ } l*4{ X[c]} @{ } }
\Tstrut \Bstrut & \textsc { Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
\Tstrut \Bstrut & \textsc { Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.4885 & \textbf { 0.5613} & 0.5488 & 0.5475 \\
max \Tstrut & 0.4885 & \textbf { 0.5513} & 0.5488 & 0.5475 \\
min & 0.3710 & \textbf { 0.3858} & 0.3736 & 0.3816 \\
min & 0.3710 & \textbf { 0.3858} & 0.3736 & 0.3816 \\
mean \Bstrut & 0.4166 & 0.4838 & 0.4769 & \textbf { 0.4957} \\
mean \Bstrut & 0.4166 & 0.4838 & 0.4769 & \textbf { 0.4957} \\
var & \textbf { 0.002} & 0.00294 & 0.00338 & 0.0030 \\
var & \textbf { 1.999e-3} & 2.945e-3 & 3.375e-3 & 2.976e-3 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.7370 & 0.7340 & 0.7236 & \textbf { 0.7502} \\
max \Tstrut & 0.7370 & 0.7340 & 0.7236 & \textbf { 0.7502} \\
min & 0.6818 & 0.6673 & 0.6709 & \textbf { 0.6799} \\
min & \textbf { 0.6818} & 0.6673 & 0.6709 & 0.6799 \\
mean & 0.7130 & \textbf { 0.7156} & 0.7031 & 0.7136 \\
mean & 0.7130 & \textbf { 0.7156} & 0.7031 & 0.7136 \\
var \Bstrut & 3.2e-04 & 3.4e-04 & 3.2e-04 & 4.5e-04 \\
var \Bstrut & \textbf { 3.184e-4} & 3.356e-4 & 3.194e-4 & 4.508e-4 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.8454 & 0.8385 & 0.8456 & \textbf { 0.8459} \\
max \Tstrut & 0.8454 & 0.8385 & 0.8456 & \textbf { 0.8459} \\
min & 0.8227 & 0.8200 & \textbf { 0.8305} & 0.8274 \\
min & 0.8227 & 0.8200 & \textbf { 0.8305} & 0.8274 \\
mean & 0.8331 & 0.8289 & 0.8391 & \textbf { 0.8409} \\
mean & 0.8331 & 0.8289 & 0.8391 & \textbf { 0.8409} \\
var \Bstrut & 4e-05 & 4e-05 & 2e-05 & 3e-05 \\
var \Bstrut & 3.847e-5 & 4.259e-5 & \textbf { 2.315e-5} & 2.769e-5 \\
\hline
\hline
\end { tabu}
\end { tabu}
\normalsize
\normalsize
\captionof { table} { Values of the test accuracy of the model trained 10 times
\captionof { table} { Values of the test accuracy of the model trained
on random fashion MNIST training sets containing 1, 10 and 100 data points per
10 times
class. The mean achieved accuracy for the full dataset is: ....}
on random fashion MNIST training sets containing 1, 10 and 100
data points per class after 125 epochs. The mean accuracy achieved
for the full set employing both overfitting measures is }
\label { table:fashionOF}
\label { table:fashionOF}
\end { minipage}
\end { minipage} \todo { check values}
\clearpage % if needed/desired
\clearpage
}
}
The random datasets chosen for training are made up of a certain
The models are trained on subsets with a certain amount of randomly
number of datapoints for each class, which are chosen at random. The
chosen datapoints per class.
sizes chosen for the comparisons are the full dataset, 100, 10 and 1
The sizes chosen for the comparisons are the full dataset, 100, 10 and 1
data points
data points per class.
per class.
For the task of classifying the fashion data a slightly altered model
For the task of classifying the fashion data a slightly altered model
is used. The convolutional layers with filters of size 5 are replaced
is used. The convolutional layers with filters of size 5 are replaced
by two consecutive convolutional layers with filters of size 3.
by two consecutive convolutional layers with filters of size 3.
This is done in order to have more ... in order to better ... the data
This is done in order to have more ... in order to better accommodate
in the model . A diagram of the architecture is given in
for the more complex nature of the data . A diagram of the architecture is given in
Figure~\ref { fig:fashion_ MNIST} .
Figure~\ref { fig:fashion_ MNIST} .
\afterpage {
\afterpage {
@ -981,7 +1040,7 @@ Listing~\ref{lst:fashion} for the fashion model.
The models are trained for 125 epoch in order
The models are trained for 125 epoch in order
to have enough random
to have enough random
augmentations of the input images present during training
augmentations of the input images are present during training
for the networks to fully profit of the additional training data generated.
for the networks to fully profit of the additional training data generated.
The test accuracies of the models after
The test accuracies of the models after
training for 125
training for 125
@ -998,7 +1057,7 @@ fashion application.
\begin { tikzpicture}
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
height = 0.35 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.4 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} ]
=1.25pt} ]
@ -1031,7 +1090,7 @@ fashion application.
\begin { tikzpicture}
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
height = 0.35 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.4 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} ]
=1.25pt} ]
@ -1061,7 +1120,7 @@ fashion application.
\begin { tikzpicture}
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = 0.9875\textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = 0.9875\textwidth ,
height = 0.35 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.4 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} , ymin = { 0.92} ]
=1.25pt} , ymin = { 0.92} ]
@ -1100,7 +1159,7 @@ fashion application.
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style =
/pgf/number format/precision=3} ,tick style =
{ draw = none} , width = \textwidth ,
{ draw = none} , width = \textwidth ,
height = 0.35 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.4 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} ]
=1.25pt} ]
@ -1132,7 +1191,7 @@ fashion application.
\begin { tikzpicture}
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
height = 0.35 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.4 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} , ymin = { 0.62} ]
=1.25pt} , ymin = { 0.62} ]
@ -1162,7 +1221,7 @@ fashion application.
\begin { tikzpicture}
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = 0.9875\textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = 0.9875\textwidth ,
height = 0.35 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.4 \textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} , ymin = { 0.762} ]
=1.25pt} , ymin = { 0.762} ]
@ -1188,45 +1247,129 @@ fashion application.
\caption { 100 samples per class}
\caption { 100 samples per class}
\vspace { .25cm}
\vspace { .25cm}
\end { subfigure}
\end { subfigure}
\caption { Mean test accuracies of the models fitting the sampled MNIST
\caption { Mean test accuracies of the models fitting the sampled fashion MNIST
handwriting datasets over the 125 epochs of training.}
over the 125 epochs of training.}
\label { fig:plotOF_ fashion}
\label { fig:plotOF_ fashion}
\end { figure}
\end { figure}
It can be seen in ... and ... that the usage of .. overfitting
It can be seen in figure ... that for the handwritten digits scenario
measures greatly improves the accuracy for small datasets. However for
using data generation greatly improves the accuracy for the smallest
the smallest size of one datapoint per class generating more data
training set of one sample per class.
... outperforms dropout with only a ... improvment being seen by the
While the addition of dropout only seems to have a small effect on the
implementation of dropout whereas datageneration improves the accuracy
accuracy of the model, the variance get further reduced than with data
by... . On the other hand the implementation of dropout seems to
generation. This drop in variance translates to the combination of
reduce the variance in the model accuracy, as the variance in accuracy
both measures, resulting in the overall best performing model.
for the dropout model is less than .. while the variance of the
datagen .. model is nearly the same. The model with datageneration
In the scenario with 10 and 100 samples per class the measures improve
... a reduction in variance with the addition of dropout.
the performance as well, however the difference in performance between
overfitting measures is much smaller than in the first scenario
For the slightly larger training sets of ten samples per class the
with the accuracy gain of dropout being similar to data generation.
difference between the two measures seems smaller. Here the
While the observation of the variances persist for the scenario with
improvement in accuracy
100 samples per class it does not for the one with 10 samples per
seen by dropout is slightly larger than the one of
class.
datageneration. However for the larger sized training set the variance
However in all scenarios the addition of the measures reduces the
in test accuracies is lower for the model with datageneration than the
variance of the model.
one with dropout.
The model fit to the fashion MNIST data set benefits less of the
measures.
For the smallest scenario of one sample fer class a substantial
increase in accuracy can be observed for the models with the
... measures.... Contrary to the digits data set dropout improves the
model by a similar margin to data generation.
For the larger data sets however the benefits are far smaller. While
in the scenario with 100 samples per class a performance increase can
be seen for ... of data generation, it performs worse in the 10
samples per class scenario than the baseline mode.
Dropout does seem to have negligible impact on its own in both the 10
and 100 sample scenario. However in all scenarios the addition of
dropout to data generation seems to ...
Additional Figures and Tables for the same comparisons with different
performance metrics are given in Appendix ...
There it cam be seen that while the measures ... reduce overfitting
effectively for the handwritten digits data set, the neural networks
trained on the fashion data set overfit despite these measures being
in place.
% It can be seen in ... that the usage of .. overfitting
% measures greatly improves the accuracy for small datasets. However for
% the smallest size of one datapoint per class generating more data
% ... outperforms dropout with only a ... improvment being seen by the
% implementation of dropout whereas datageneration improves the accuracy
% by... . On the other hand the implementation of dropout seems to
% reduce the variance in the model accuracy, as the variance in accuracy
% for the dropout model is less than .. while the variance of the
% datagen .. model is nearly the same. The model with datageneration
% ... a reduction in variance with the addition of dropout.
% For the slightly larger training sets of ten samples per class the
% difference between the two measures seems smaller. Here the
% improvement in accuracy
% seen by dropout is slightly larger than the one of
% datageneration. However for the larger sized training set the variance
% in test accuracies is lower for the model with datageneration than the
% one with dropout.
% The results for the training sets with 100 samples per class resemble
% the ones for the sets with 10 per class.
Overall it seems that both measures can increase the performance of
a convolution neural network however the success is dependent on the problem.
For the handwritten digits the great result of data generation likely
stems from the .. As the digits are not rotated the same way or
aligned the same way in all ... using images that are altered in such
a way can help the network learn to recognize digits that are written
at a different slant.
In the fashion data set however the alignment of all images are very
COHERENT and little to no difference between two data points of the
same class can be ... by rotation, shifts or shear ...
The results for the training sets with 100 samples per class resemble
the ones for the sets with 10 per class.
Overall the models ... both measures to combat overfitting seem to
perform considerably well compared to the ones without. The usage of
these measures has great potential in improving models used for
applications with limited training data. Additional tables and figures
visualizing the effects on the logarithmic corssentropy rather than
loss are given in the appendix\todo { figs für appendix}
\clearpage
\clearpage
\section { Schluss}
\section { \titlecap { summary and outlook} }
In this thesis we have taken a look at neural networks, their
behavior in small scenarios and their application on image
classification with limited datasets.
We have shown that ridge penalized neural networks ... to
slightly altered cubic smoothing splines, giving us an insight about
the behavior of the learned function of neural networks.
We have seen that choosing the right training algorithm can have a
drastic impact on the efficiency of training and quality of a model
obtainable in a reasonable time frame.
The \textsc { Adam} algorithm has proven itself as best fit for the task
of classifying images. However there is ... ongoing research in
improving these algorithms, for example \textcite { rADAM} propose an
alteration to the \textsc { Adam} algorithm in order to make the
... term more stable in early phases of training.
We have seen that a convolutional network can benefit greatly from
measures combating overfitting, especially if the available training sets are of
a small size. However the success of the measures we have examined
seem to be highly dependent on ...
... there is further research being done on the topic of combating
overfitting.
\textcite { random_ erasing} propose randomly erasing parts of the inputs
images during training and are able to achieve high a high accuracy on the fashion MNIST
data set this way (96,35\% ).
While data generation explored in this thesis is able to rudimentary
generate new training data there is ... in using more elaborate methods
to enlagre the training set.
\textcite { gan} explore the application of generative adversarial
networks in order to ... for medical images with small ...
These networks ... in order to generate completely new images
... (cf. \textcite { goodfellow_ gan} ).
Convolutional neural networks are able to achieve remarkable results
and with further improvements and ... will find further applications
and is a staple here to stay.
\begin { itemize}
\begin { itemize}
\item generate more data, GAN etc \textcite { gan}
\item generate more data, GAN etc \textcite { gan}
\item Transfer learning, use network trained on different task and
\item Transfer learning, use network trained on different task and