@ -1,5 +1,7 @@
\section { Application of NN to higher complexity Problems}
\section { Application of NN to higher complexity Problems}
This section is based on \textcite [Chapter~9] { Goodfellow}
As neural networks are applied to problems of higher complexity often
As neural networks are applied to problems of higher complexity often
resulting in higher dimensionality of the input the amount of
resulting in higher dimensionality of the input the amount of
parameters in the network rises drastically.
parameters in the network rises drastically.
@ -7,8 +9,7 @@ For very large inputs such as high resolution image data due to the
fully connected nature of the neural network the amount of parameters
fully connected nature of the neural network the amount of parameters
can ... exceed the amount that is feasible for training and storage.
can ... exceed the amount that is feasible for training and storage.
A way to combat this is by using layers which are only sparsely
A way to combat this is by using layers which are only sparsely
connected and share parameters between nodes. This can be implemented
connected and share parameters between nodes.\todo { Überleitung zu conv?}
using convolution.\todo { Überleitung besser schreiben}
\subsection { Convolution}
\subsection { Convolution}
@ -27,13 +28,13 @@ The convolution operation allows plentiful manipulation of data, with
a simple example being smoothing of real-time data. Consider a sensor
a simple example being smoothing of real-time data. Consider a sensor
measuring the location of an object (e.g. via GPS). We expect the
measuring the location of an object (e.g. via GPS). We expect the
output of the sensor to be noisy as a result of a number of factors
output of the sensor to be noisy as a result of a number of factors
that will impact the accuracy. In order to get a better estimate of
will impact the accuracy of the measurements . In order to get a better estimate of
the actual location we want to smooth
the actual location we want to smooth
the data to reduce the noise. Using convolution for this task, we
the data to reduce the noise. Using convolution for this task, we
can control the significance we want to give each data-point. We
can control the significance we want to give each data-point. We
might want to give a larger weight to more recent measurements than
might want to give a larger weight to more recent measurements than
older ones. If we assume these measurements are taken on a discrete
older ones. If we assume these measurements are taken on a discrete
timescale, we need to introduce discrete convolution first . \\ Let $ f $ ,
timescale, we need to define convolution for discrete functions . \\ Let $ f $ ,
$ g: \mathbb { Z } \to \mathbb { R } $ then
$ g: \mathbb { Z } \to \mathbb { R } $ then
\[
\[
@ -59,7 +60,7 @@ by each pixel being a mixture of base colors. These base colors define
the color-space in which the image is encoded. Often used are
the color-space in which the image is encoded. Often used are
color-spaces RGB (red,
color-spaces RGB (red,
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
image split in its red, green and blue channel is given in
image decomposed in its red, green and blue channel is given in
Figure~\ref { fig:rgb} . Using this
Figure~\ref { fig:rgb} . Using this
encoding of the image we can define a corresponding discrete function
encoding of the image we can define a corresponding discrete function
describing the image, by mapping the coordinates $ ( x,y ) $ of an pixel
describing the image, by mapping the coordinates $ ( x,y ) $ of an pixel
@ -108,13 +109,14 @@ convolution
(I * g)_ { x,y,c} = \sum _ { i,j,l \in \mathbb { Z} } I_ { x-i,y-j,c-l} g_ { i,j,l} .
(I * g)_ { x,y,c} = \sum _ { i,j,l \in \mathbb { Z} } I_ { x-i,y-j,c-l} g_ { i,j,l} .
\]
\]
As images are finite in size for pixels close enough to the border
As images are finite in size for pixels to close to the border the
that the filter ... the convolution is not well defined. In such cases
convolution is not well defined.
padding can be used. With padding the image is enlarged beyond .. with
Thus the output will be of reduced size, with the now size in each
0 entries to
dimension $ d $ being \textit { (size of input in dimension $ d $ ) -
ensure the convolution is well defined for all pixels. If no padding
(size of kernel in dimension $ d $ ) +1} .
is used the size of the output is reduced to \textit { size of input -
In order to ensure the output is of the same size as the input the
size of kernel +1} in each dimension.
image can be padded in each dimension with 0 entries which ensures the
convolution is well defined for all pixels of the image.
Simple examples for image manipulation using
Simple examples for image manipulation using
convolution are smoothing operations or
convolution are smoothing operations or
@ -147,8 +149,8 @@ output is given by
O = \sqrt { (I * G)^ 2 + (I*G^ T)^ 2}
O = \sqrt { (I * G)^ 2 + (I*G^ T)^ 2}
\]
\]
where $ \sqrt { \cdot } $ and $ \cdot ^ 2 $ are applied component
where $ \sqrt { \cdot } $ and $ \cdot ^ 2 $ are applied component
wise. Examples of convolution with both kernels are given in Figure~\ref { fig:img_ conv} .
wise. Examples for convolution of an image with both kernels are given
\todo { padding}
in Figure~\ref { fig:img_ conv} .
\begin { figure} [h]
\begin { figure} [h]
@ -222,65 +224,114 @@ As seen in the previous section convolution can lend itself to
manipulation of images or other large data which motivates it usage in
manipulation of images or other large data which motivates it usage in
neural networks.
neural networks.
This is achieved by implementing convolutional layers where several
This is achieved by implementing convolutional layers where several
filters are applied to the input. Where the values of the filters are
trainable filters are applied to the input.
trainable parameters of the model.
Each node in such a layer corresponds to a pixel of the output of
Each node in such a layer corresponds to a pixel of the output of
convolution with one of those filters on which a bias and activation
convolution with one of those filters, on which a bias and activation
function are applied.
function are applied.
Depending on the sizes this can drastically reduce the amount of
variables in a layer compared to fully connected ones.
As the variables of the filters are shared among all nodes a
convolutional layer with input of size $ s _ i $ , output size $ s _ o $ and
$ n $ filters of size $ f $ will contain $ n f + s _ o $ parameters whereas a
fully connected layer has $ ( s _ i + 1 ) s _ o $ trainable weights.
The usage of multiple filters results in multiple outputs of the same
The usage of multiple filters results in multiple outputs of the same
size as the input. These are often called channels. Depending on the
size as the input (or slightly smaller if no padding is used). These
size of the filters this can result in the dimension of the output
are often called channels.
being one larger than the input.
For convolutional layers that are preceded by convolutional layers the
However for convolutional layers that are preceded by convolutional layers the
size of the filter is often chosen to coincide with the amount of channels
size of the filter is often chosen to coincide with the amount of channels
of the output of the previous layer without using padding in this
of the output of the previous layer and not padded in this
direction in order to prevent gaining additional
direction.
This results in the channels ``being squashed'' and prevents gaining
additional
dimensions\todo { filter mit ganzer tiefe besser erklären} in the output.
dimensions\todo { filter mit ganzer tiefe besser erklären} in the output.
This can also be used to flatten certain less interesting channels of
This can also be used to flatten certain less interesting channels of
the input as for example a color channels.
the input as for example color channels.
Thus filters used in convolutional networks are usually have the same
% Thus filters used in convolutional networks are usually have the same
amount of dimensions as the input or one more.
% amount of dimensions as the input or one more.
The size of the filters and the way they are applied can be tuned
while building the model should be the same for all filters in one
layer in order for the output being of consistent size in all channels.
It is common to reduce the d< by not applying the
filters on each ``pixel'' but rather specify a ``stride'' $ s $ at which
the filter $ g $ is moved over the input $ I $
A way additionally reduce the size using convolution is not applying the
convolution on every pixel, but rather specifying a certain ``stride''
$ s $ at which the filter $ g $ is moved over the input $ I $ ,
\[
\[
O_ { x,y,c} = \sum _ { i,j,l \in \mathbb { Z} } I_ { x-i,y-j,c-l} g_ { i,j,l} .
O_ { x,y,c} = \sum _ { i,j,l \in \mathbb { Z} } I_ { x-i,y-j,c-l} g_ { i,j,l} .
\]
\]
As seen convolution lends itself for image manipulation. In this
The size and stride for all filters in a layer should be the same in
chapter we will explore how we can incorporate convolution in neural
order to get a uniform tensor as output.
networks, and how that might be beneficial.
T% he size of the filters and the way they are applied can be tuned
% while building the model should be the same for all filters in one
Convolutional Neural Networks as described by ... are made up of
% layer in order for the output being of consistent size in all channels.
convolutional layers, pooling layers, and fully connected ones. The
% It is common to reduce the d< by not applying the
fully connected layers are layers in which each input node is
% filters on each ``pixel'' but rather specify a ``stride'' $ s $ at which
connected to each output node which is the structure introduced in
% the filter $ g $ is moved over the input $ I $
chapter ...
% \[
In a convolutional layer instead of combining all input nodes for each
% O_ { x,y,c} = \sum _ { i,j,l \in \mathbb { Z} } I_ { x-i,y-j,c-l} g_ { i,j,l} .
output node, the input nodes are interpreted as a tensor on which a
% \]
kernel is applied via convolution, resulting in the output. Most often
multiple kernels are used, resulting in multiple output tensors. These
% As seen convolution lends itself for image manipulation. In this
kernels are the variables, which can be altered in order to fit the
% chapter we will explore how we can incorporate convolution in neural
model to the data. Using multiple kernels it is possible to extract
% networks, and how that might be beneficial.
different features from the image (e.g. edges -> sobel). As this
increases dimensionality even further which is undesirable as it
% Convolutional Neural Networks as described by ... are made up of
increases the amount of variables in later layers of the model, a convolutional layer
% convolutional layers, pooling layers, and fully connected ones. The
is often followed by a pooling one. In a pooling layer the input is
% fully connected layers are layers in which each input node is
% connected to each output node which is the structure introduced in
% chapter ...
% In a convolutional layer instead of combining all input nodes for each
% output node, the input nodes are interpreted as a tensor on which a
% kernel is applied via convolution, resulting in the output. Most often
% multiple kernels are used, resulting in multiple output tensors. These
% kernels are the variables, which can be altered in order to fit the
% model to the data. Using multiple kernels it is possible to extract
% different features from the image (e.g. edges -> sobel).
In order to further reduce the size towards the final layer, convolutional
layers are often followed by a pooling layer.
In a pooling layer the input is
reduced in size by extracting a single value from a
reduced in size by extracting a single value from a
neighborhood \todo { moving...} ... . The resulting output size is dependent on
neighborhood of pixels, often by taking the maximum value in the
the offset of the neighborhoods used. Popular is max-pooling where the
neighborhood (max-pooling). The resulting output size is dependent on
largest value in a neighborhood is used or.
the offset of the neighborhoods used, this offset is commonly called
\todo { kleine grafik}
``stride''\todo { zwei mal stride} .
The combination of convolution and pooling layers allows for
The combination of convolution and pooling layers allows for
extraction of features from the input in the from of feature maps while
extraction of features from the input in the from of feature maps while
using relatively few parameters that need to be trained.
using relatively few parameters that need to be trained.
\todo { Beispiel feature maps}
A example of this is given in Figure~\ref { fig:feature_ map} where
intermediary outputs of a small convoluninal neural network consisting
of two convolutional and pooling layers each with one filter followed
by two fully connected layers.
\begin { figure} [h]
\centering
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/mnist0bw.pdf}
\caption { input}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/conv2d_ 6.pdf}
\caption { convolution}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/max_ pooling2d_ 6.pdf}
\caption { max-pool}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/conv2d_ 7.pdf}
\caption { convolution}
\end { subfigure}
\begin { subfigure} { 0.19\textwidth }
\includegraphics [width=\textwidth] { Figures/Data/max_ pooling2d_ 7.pdf}
\caption { max-pool}
\end { subfigure}
\caption [Feature map] { Intermediary outputs of a
convolutional neural network, starting with the input and ending
with the corresponding feature map.}
\label { fig:feature_ map}
\end { figure}
\subsubsection { Parallels to the Visual Cortex in Mammals}
\subsubsection { Parallels to the Visual Cortex in Mammals}
@ -295,7 +346,6 @@ arbitrary. ... auge... bla bla
% -Different scale of gradients for vars in different layers -> ADAdelta
% -Different scale of gradients for vars in different layers -> ADAdelta
\subsection { Stochastic Training Algorithms}
\subsection { Stochastic Training Algorithms}
For many applications in which neural networks are used such as
For many applications in which neural networks are used such as
image classification or segmentation, large training data sets become
image classification or segmentation, large training data sets become
detrimental to capture the nuances of the
detrimental to capture the nuances of the
@ -303,20 +353,21 @@ data. However as training sets get larger the memory requirement
during training grows with it.
during training grows with it.
In order to update the weights with the gradient descent algorithm
In order to update the weights with the gradient descent algorithm
derivatives of the network with respect for each
derivatives of the network with respect for each
variable need to be calculated for all data points in order to get the
variable need to be computed for all data points.
full gradient of the error of the network.
Thus the amount of memory and computing power available limits the
Thus the amount of memory and computing power available limits the
size of the training data that can be efficiently used in fitting the
size of the training data that can be efficiently used in fitting the
network. A class of algorithms that augment the gradient descent
network. A class of algorithms that augment the gradient descent
algorithm in order to lessen this problem are stochastic gradient
algorithm in order to lessen this problem are stochastic gradient
descent algorithms. Here the premise is that instead of using the whole
descent algorithms.
dataset a (different) subset of data is chosen to
Here the full dataset is split into smaller disjoint subsets.
compute the gradient in each iteration (Algorithm~\ref { alg:sdg} ).
Then in each iteration a (different) subset of data is chosen to
The training period until each data point has been considered in
compute the gradient (Algorithm~\ref { alg:sdg} ).
The training period until each data point has been considered at least
once in
updating the parameters is commonly called an ``epoch''.
updating the parameters is commonly called an ``epoch''.
Using subsets reduces the amount of memory and computing power required for
Using subsets reduces the amount of memory required for storing the
each iteration. This makes it possible to use very large training
necessary values for each update, thus making it possible to use very
sets to fit the model.
large training sets to fit the model.
Additionally the noise introduced on the gradient can improve
Additionally the noise introduced on the gradient can improve
the accuracy of the fit as stochastic gradient descent algorithms are
the accuracy of the fit as stochastic gradient descent algorithms are
less likely to get stuck on local extrema.
less likely to get stuck on local extrema.
@ -353,7 +404,7 @@ mount of training time.
\end { algorithm}
\end { algorithm}
In order to illustrate this behavior we modeled a convolutional neural
In order to illustrate this behavior we modeled a convolutional neural
network to ... handwritten digits. The data set used for this is the
network to classify handwritten digits. The data set used for this is the
MNIST database of handwritten digits (\textcite { MNIST} ,
MNIST database of handwritten digits (\textcite { MNIST} ,
Figure~\ref { fig:MNIST} ).
Figure~\ref { fig:MNIST} ).
\input { Figures/mnist.tex}
\input { Figures/mnist.tex}
@ -364,15 +415,17 @@ applied with a stride of one.
The first layer consists of 32 filters and the second of 64. Both
The first layer consists of 32 filters and the second of 64. Both
pooling layers pool a $ 2 \times 2 $ area. The fully connected layer
pooling layers pool a $ 2 \times 2 $ area. The fully connected layer
consists of 256 nodes and the output layer of 10, one for each digit.
consists of 256 nodes and the output layer of 10, one for each digit.
All layers except the output layer use RELU as activation function
All layers use RELU as activation function, except the output layer
with the output layer using softmax (\ref { def:softmax} ).
with the output layer which uses softmax (\ref { def:softmax} ).
As loss function categorical crossentropy is used (\ref { def:... } ).
As loss function categorical crossentropy is used (\ref { eq:cross_ entropy } ).
The architecture of the convolutional neural network is summarized in
The architecture of the convolutional neural network is summarized in
Figure~\ref { fig:mnist_ architecture} .
Figure~\ref { fig:mnist_ architecture} .
\begin { figure}
\begin { figure}
\includegraphics [width=\textwidth] { Figures/Data/convnet_ fig.pdf}
\includegraphics [width=\textwidth] { Figures/Data/convnet_ fig.pdf}
\caption { architecture}
\caption { Convolutional neural network architecture used to model the
MNIST handwritten digits dataset. This figure was created using the
draw\textunderscore convnet Python script by \textcite { draw_ convnet} .}
\label { fig:mnist_ architecture}
\label { fig:mnist_ architecture}
\end { figure}
\end { figure}
@ -387,7 +440,7 @@ with gradient descent after 20 epochs.
This is due to the former using a batch size of 32 and thus having
This is due to the former using a batch size of 32 and thus having
made 1.875 updates to the weights
made 1.875 updates to the weights
after the first epoch in comparison to one update. While each of
after the first epoch in comparison to one update. While each of
these updates uses a approximate
these updates only use a approximate
gradient calculated on the subset it performs far better than the
gradient calculated on the subset it performs far better than the
network using true gradients when training for the same mount of time.
network using true gradients when training for the same mount of time.
\todo { vergleich training time}
\todo { vergleich training time}
@ -395,6 +448,8 @@ network using true gradients when training for the same mount of time.
\input { Figures/SGD_ vs_ GD.tex}
\input { Figures/SGD_ vs_ GD.tex}
\clearpage
\clearpage
\subsection { \titlecap { modified stochastic gradient descent} }
\subsection { \titlecap { modified stochastic gradient descent} }
This section is based on \textcite { ruder} .
An inherent problem of the stochastic gradient descent algorithm is
An inherent problem of the stochastic gradient descent algorithm is
its sensitivity to the learning rate $ \gamma $ . This results in the
its sensitivity to the learning rate $ \gamma $ . This results in the
problem of having to find a appropriate learning rate for each problem
problem of having to find a appropriate learning rate for each problem
@ -606,12 +661,6 @@ global learning rate. This results in .. hyperparameters, however the
algorithms seems to be exceptionally stable with the recommended
algorithms seems to be exceptionally stable with the recommended
parameters of ... and is a very reliable algorithm for training
parameters of ... and is a very reliable algorithm for training
neural networks.
neural networks.
However the \textsc { Adam} algorithm can have problems with high
variance of the adaptive learning rate early in training.
\textcite { rADAM} try to address these issues with the Rectified Adam
algorithm
\todo { will ich das einbauen?}
\begin { algorithm} [H]
\begin { algorithm} [H]
\SetAlgoLined
\SetAlgoLined
@ -662,7 +711,7 @@ the other algorithms, with AdaGrad and Adelta following... bla bla
% strategies exist. A popular approach in regularizing convolutional neural network
% strategies exist. A popular approach in regularizing convolutional neural network
% is \textit { dropout} which has been first introduced in
% is \textit { dropout} which has been first introduced in
% \cite { Dropout}
% \cite { Dropout}
This section is based on ....
Similarly to shallow networks overfitting still can impact the quality of
Similarly to shallow networks overfitting still can impact the quality of
convolutional neural networks.
convolutional neural networks.
Popular ways to combat this problem for a .. of models is averaging
Popular ways to combat this problem for a .. of models is averaging
@ -748,7 +797,7 @@ same or else the network will not learn the desired ...
In the case of handwritten digits for example a to high rotation angle
In the case of handwritten digits for example a to high rotation angle
will ... a nine or six.
will ... a nine or six.
The most common transformations are rotation, zoom, shear, brightness,
The most common transformations are rotation, zoom, shear, brightness,
mirroring.
mirroring. Examples of this are given in Figure~\ref { fig:datagen} .
\begin { figure} [h]
\begin { figure} [h]
\centering
\centering
@ -775,6 +824,7 @@ mirroring.
\caption [Image data generation] { Example for the manipuations used in ... As all images are
\caption [Image data generation] { Example for the manipuations used in ... As all images are
of the same intensity brightness manipulation does not seem
of the same intensity brightness manipulation does not seem
... Additionally mirroring is not used for ... reasons.}
... Additionally mirroring is not used for ... reasons.}
\label { fig:datagen}
\end { figure}
\end { figure}
In order to compare the benefits obtained from implementing these
In order to compare the benefits obtained from implementing these
@ -829,26 +879,26 @@ zalando, a overview is given in Figure~\ref{fig:fashionMNIST}.
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.5633 & 0.5312 & 0.6704 & 0.6604 \\
max \Tstrut & 0.5633 & 0.5312 & \textbf { 0.6704} & 0.6604 \\
min & 0.3230 & 0.4224 & 0.4878 & 0.5175 \\
min & 0.3230 & 0.4224 & 0.4878 & \textbf { 0.5175} \\
mean & 0.4570 & 0.4714 & 0.5862 & 0.6014 \\
mean & 0.4570 & 0.4714 & 0.5862 & \textbf { 0.6014} \\
var & 0.0040 & 0.0012 & 0.0036 & 0.0023 \\
var \Bstrut & 0.0040 & \textbf { 0.0012} & 0.0036 & 0.0023 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & 0.9441 \\
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & \textbf { 0.9441} \\
min & 0.8148 & 0.9081 & 0.9018 & 0.9061 \\
min & 0.8148 & \textbf { 0.9081} & 0.9018 & 0.9061 \\
mean & 0.8377 & 0.9270 & 0.9185 & 0.9232 \\
mean & 0.8377 & \textbf { 0.9270} & 0.9185 & 0.9232 \\
var & 2.7e-4 & 1.3e-4 & 6e-05 & 1.5e-4 \\
var \Bstrut & 2.7e-04 & 1.3e-04 & 6e-05 & 1.5e-04 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max & 0.9637 & 0.9796 & 0.9810 & 0.9805 \\
max \Tstrut & 0.9637 & 0.9796 & 0.9810 & \textbf { 0.9811} \\
min & 0.9506 & 0.9719 & 0.9702 & 0.9727 \\
min & 0.9506 & 0.9719 & 0.9702 & \textbf { 0.9727} \\
mean & 0.9582 & 0.9770 & 0.9769 & 0.9783 \\
mean & 0.9582 & 0.9770 & 0.9769 & \textbf { 0.9783} \\
var & 2e-05 & 1e-05 & 1e-05 & 0 \\
var \Bstrut & 2e-05 & 1e-05 & 1e-05 & 1e-05 \\
\hline
\hline
\end { tabu}
\end { tabu}
\normalsize
\normalsize
@ -857,6 +907,7 @@ zalando, a overview is given in Figure~\ref{fig:fashionMNIST}.
on random MNIST handwriting training sets containing 1, 10 and 100
on random MNIST handwriting training sets containing 1, 10 and 100
data points per class after 125 epochs. The mean achieved accuracy
data points per class after 125 epochs. The mean achieved accuracy
for the full set employing both overfitting measures is }
for the full set employing both overfitting measures is }
\label { table:digitsOF}
\small
\small
\centering
\centering
\begin { tabu} to \textwidth { @{ } l*4{ X[c]} @{ } }
\begin { tabu} to \textwidth { @{ } l*4{ X[c]} @{ } }
@ -865,32 +916,33 @@ zalando, a overview is given in Figure~\ref{fig:fashionMNIST}.
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 1 sample} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.5633 & 0.5312 & 0.6704 & 0.6604 \\
max \Tstrut & 0.4885 & \textbf { 0.5613} & 0.5488 & 0.5475 \\
min & 0.3230 & 0.4224 & 0.4878 & 0.5175 \\
min & 0.3710 & \textbf { 0.3858} & 0.3736 & 0.3816 \\
mean & 0.4570 & 0.4714 & 0.5862 & 0.6014 \\
mean \Bstrut & 0.4166 & 0.4838 & 0.4769 & \textbf { 0.4957} \\
var & 0.0040 & 0.0012 & 0.0036 & 0.0023 \\
var & \textbf { 0.002} & 0.00294 & 0.00338 & 0.0030 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 10 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & 0.9441 \\
max \Tstrut & 0.7370 & 0.7340 & 0.7236 & \textbf { 0.7502} \\
min & 0.8148 & 0.9081 & 0.9018 & 0.9061 \\
min & 0.6818 & 0.6673 & 0.6709 & \textbf { 0.6799} \\
mean & 0.8377 & 0.9270 & 0.9185 & 0.9232 \\
mean & 0.7130 & \textbf { 0.7156} & 0.7031 & 0.7136 \\
var & 2.7e-4 & 1.3e-4 & 6e-05 & 1.5e-4 \\
var \Bstrut & 3.2e-04 & 3.4e-04 & 3.2e-04 & 4.5e-04 \\
\hline
\hline
&
&
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\multicolumn { 4} { c} { \titlecap { test accuracy for 100 samples} } \Bstrut \\
\cline { 2-5}
\cline { 2-5}
max & 0.9637 & 0.9796 & 0.9810 & 0.9805 \\
max \Tstrut & 0.8454 & 0.8385 & 0.8456 & \textbf { 0.8459} \\
min & 0.9506 & 0.9719 & 0.9702 & 0.9727 \\
min & 0.8227 & 0.8200 & \textbf { 0.8305} & 0.8274 \\
mean & 0.9582 & 0.9770 & 0.9769 & 0.9783 \\
mean & 0.8331 & 0.8289 & 0.8391 & \textbf { 0.8409} \\
var & 2e-05 & 1e-05 & 1e-05 & 0 \\
var \Bstrut & 4e-05 & 4e-05 & 2e-05 & 3e-05 \\
\hline
\hline
\end { tabu}
\end { tabu}
\normalsize
\normalsize
\captionof { table} { Values of the test accuracy of the model trained 10 times
\captionof { table} { Values of the test accuracy of the model trained 10 times
on random fashion MNIST training sets containing 1, 10 and 100 data points per
on random fashion MNIST training sets containing 1, 10 and 100 data points per
class. The mean achieved accuracy for the full dataset is: ....}
class. The mean achieved accuracy for the full dataset is: ....}
\label { table:fashionOF}
\end { minipage}
\end { minipage}
\clearpage % if needed/desired
\clearpage % if needed/desired
}
}
@ -908,26 +960,36 @@ This is done in order to have more ... in order to better ... the data
in the model. A diagram of the architecture is given in
in the model. A diagram of the architecture is given in
Figure~\ref { fig:fashion_ MNIST} .
Figure~\ref { fig:fashion_ MNIST} .
For both scenarios the model are trained 10 times on randomly
\afterpage {
... training sets. Additionally models of the same architecture where
\noindent
a dropout layer with a ... 20\% is implemented and/or datageneration
\begin { figure} [h]
is used to augment the data during training. The values for the
datageneration are given in CODE APPENDIX.
The models are trained for 125 epoch to ensure enough random
augmentations of the input images are considered to ensure
convergence. The test accuracies of the models after training for 125
epoch are given in Figure~\ref { ...} for the handwriting
and in Figure~\ref { ...} for the fashion scenario. Additionally the
average test accuracies of the models are given for each epoch in
Figure ... and Figure...
\begin { figure}
\includegraphics [width=\textwidth] { Figures/Data/cnn_ fashion_ fig.pdf}
\includegraphics [width=\textwidth] { Figures/Data/cnn_ fashion_ fig.pdf}
\caption { Convolutional neural network architecture used to model the
\caption { Convolutional neural network architecture used to model the
fashion MNIST dataset.}
fashion MNIST dataset. This figure was created using the
\label { fig:mnist_ architecture}
draw\textunderscore convnet Python script by \textcite { draw_ convnet} .}
\label { fig:fashion_ MNIST}
\end { figure}
\end { figure}
}
For both scenarios the models are trained 10 times on randomly
sampled training sets.
For each scenario the models are trained without overfitting measures and combinations
of dropout and datageneration implemented. The Python implementation
of the models and the parameters used for the datageneration are given
in Listing~\ref { lst:handwriting} for the handwriting model and
Listing~\ref { lst:fashion} for the fashion model.
The models are trained for 125 epoch in order
to have enough random
augmentations of the input images present during training
for the networks to fully profit of the additional training data generated.
The test accuracies of the models after
training for 125
epochs are given in Table~\ref { table:digitsOF} for the handwritten digits
and in Table~\ref { table:fashionOF} for the fashion datasets. Additionally the
average test accuracies over the course of learning are given in
Figure~\ref { fig:plotOF_ digits} for the handwriting application and Figure~\ref { fig:plotOF_ fashion} for the
fashion application.
\begin { figure} [h]
\begin { figure} [h]
\centering
\centering
@ -937,7 +999,7 @@ Figure ... and Figure...
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
ylabel = { Test Accuracy} , cycle
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} ]
=1.25pt} ]
\addplot table
\addplot table
@ -970,7 +1032,7 @@ Figure ... and Figure...
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
ylabel = { Test Accuracy} , cycle
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
list/Dark2, every axis plot/.append style={ line width
=1.25pt} ]
=1.25pt} ]
\addplot table
\addplot table
@ -1025,18 +1087,143 @@ Figure ... and Figure...
\caption { 100 samples per class}
\caption { 100 samples per class}
\vspace { .25cm}
\vspace { .25cm}
\end { subfigure}
\end { subfigure}
\caption { }
\caption { Mean test accuracies of the models fitting the sampled MNIST
\label { fig:MNISTfashion}
handwriting datasets over the 125 epochs of training.}
\label { fig:plotOF_ digits}
\end { figure}
\end { figure}
\begin { figure} [h]
\begin { figure} [h]
\centering
\centering
\missingfigure { datagen fashion}
\small
\caption { Sample pictures of the mnist fashion dataset, one per
\begin { subfigure} [h]{ \textwidth }
class.}
\begin { tikzpicture}
\label { mnist fashion}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style =
{ draw = none} , width = \textwidth ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
=1.25pt} ]
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ dropout_ 0_ 1.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ dropout_ 2_ 1.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ datagen_ dropout_ 0_ 1.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ datagen_ dropout_ 2_ 1.mean} ;
\addlegendentry { \footnotesize { Default} }
\addlegendentry { \footnotesize { D. 0.2} }
\addlegendentry { \footnotesize { G.} }
\addlegendentry { \footnotesize { G. + D. 0.2} }
\addlegendentry { \footnotesize { D. 0.4} }
\end { axis}
\end { tikzpicture}
\caption { 1 sample per class}
\vspace { 0.25cm}
\end { subfigure}
\begin { subfigure} [h]{ \textwidth }
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = \textwidth ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} ,ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
=1.25pt} , ymin = { 0.62} ]
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ dropout_ 0_ 10.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ dropout_ 2_ 10.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ datagen_ dropout_ 0_ 10.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ datagen_ dropout_ 2_ 10.mean} ;
\addlegendentry { \footnotesize { Default.} }
\addlegendentry { \footnotesize { D. 0.2} }
\addlegendentry { \footnotesize { G.} }
\addlegendentry { \footnotesize { G + D. 0.2} }
\end { axis}
\end { tikzpicture}
\caption { 10 samples per class}
\end { subfigure}
\begin { subfigure} [h]{ \textwidth }
\begin { tikzpicture}
\begin { axis} [legend cell align={ left} ,yticklabel style={ /pgf/number format/fixed,
/pgf/number format/precision=3} ,tick style = { draw = none} , width = 0.9875\textwidth ,
height = 0.35\textwidth , legend style={ at={ (0.9825,0.0175)} ,anchor=south east} ,
xlabel = { epoch} , ylabel = { Test Accuracy} , cycle
list/Dark2, every axis plot/.append style={ line width
=1.25pt} , ymin = { 0.762} ]
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ dropout_ 0_ 100.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ dropout_ 2_ 100.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ datagen_ dropout_ 0_ 100.mean} ;
\addplot table
[x=epoch, y=val_ accuracy, col sep=comma, mark = none]
{ Figures/Data/fashion_ datagen_ dropout_ 2_ 100.mean} ;
\addlegendentry { \footnotesize { Default.} }
\addlegendentry { \footnotesize { D. 0.2} }
\addlegendentry { \footnotesize { G.} }
\addlegendentry { \footnotesize { G + D. 0.2} }
\end { axis}
\end { tikzpicture}
\caption { 100 samples per class}
\vspace { .25cm}
\end { subfigure}
\caption { Mean test accuracies of the models fitting the sampled MNIST
handwriting datasets over the 125 epochs of training.}
\label { fig:plotOF_ fashion}
\end { figure}
\end { figure}
It can be seen in ... and ... that the usage of .. overfitting
measures greatly improves the accuracy for small datasets. However for
the smallest size of one datapoint per class generating more data
... outperforms dropout with only a ... improvment being seen by the
implementation of dropout whereas datageneration improves the accuracy
by... . On the other hand the implementation of dropout seems to
reduce the variance in the model accuracy, as the variance in accuracy
for the dropout model is less than .. while the variance of the
datagen .. model is nearly the same. The model with datageneration
... a reduction in variance with the addition of dropout.
For the slightly larger training sets of ten samples per class the
difference between the two measures seems smaller. Here the
improvement in accuracy
seen by dropout is slightly larger than the one of
datageneration. However for the larger sized training set the variance
in test accuracies is lower for the model with datageneration than the
one with dropout.
The results for the training sets with 100 samples per class resemble
the ones for the sets with 10 per class.
Overall the models ... both measures to combat overfitting seem to
perform considerably well compared to the ones without. The usage of
these measures has great potential in improving models used for
applications with limited training data. Additional tables and figures
visualizing the effects on the logarithmic corssentropy rather than
loss are given in the appendix\todo { figs für appendix}
\clearpage
\clearpage
\section { Schluss}
\section { Schluss}
@ -1044,7 +1231,12 @@ Figure ... and Figure...
\item generate more data, GAN etc \textcite { gan}
\item generate more data, GAN etc \textcite { gan}
\item Transfer learning, use network trained on different task and
\item Transfer learning, use network trained on different task and
repurpose it / train it with the training data \textcite { transfer_ learning}
repurpose it / train it with the training data \textcite { transfer_ learning}
\item random erasing fashion mnist 96.35\% accuracy \textcite { random_ erasing}
\item random erasing fashion mnist 96.35\% accuracy
\textcite { random_ erasing}
\item However the \textsc { Adam} algorithm can have problems with high
variance of the adaptive learning rate early in training.
\textcite { rADAM} try to address these issues with the Rectified Adam
algorithm
\end { itemize}
\end { itemize}