You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1057 lines
46 KiB
TeX
1057 lines
46 KiB
TeX
\section{Application of NN to higher complexity Problems}
|
|
|
|
As neural networks are applied to problems of higher complexity often
|
|
resulting in higher dimensionality of the input the amount of
|
|
parameters in the network rises drastically.
|
|
For very large inputs such as high resolution image data due to the
|
|
fully connected nature of the neural network the amount of parameters
|
|
can ... exceed the amount that is feasible for training and storage.
|
|
A way to combat this is by using layers which are only sparsely
|
|
connected and share parameters between nodes. This can be implemented
|
|
using convolution.\todo{Überleitung besser schreiben}
|
|
|
|
\subsection{Convolution}
|
|
|
|
Convolution is a mathematical operation, where the product of two
|
|
functions is integrated after one has been reversed and shifted.
|
|
|
|
\[
|
|
(f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
|
|
\]
|
|
|
|
This operation can be described as a filter-function $g$ being applied
|
|
to $f$,
|
|
as values $f(t)$ are being replaced by an average of values of $f$
|
|
weighted by a filter-function $g$ in position $t$.
|
|
The convolution operation allows plentiful manipulation of data, with
|
|
a simple example being smoothing of real-time data. Consider a sensor
|
|
measuring the location of an object (e.g. via GPS). We expect the
|
|
output of the sensor to be noisy as a result of a number of factors
|
|
that will impact the accuracy. In order to get a better estimate of
|
|
the actual location we want to smooth
|
|
the data to reduce the noise. Using convolution for this task, we
|
|
can control the significance we want to give each data-point. We
|
|
might want to give a larger weight to more recent measurements than
|
|
older ones. If we assume these measurements are taken on a discrete
|
|
timescale, we need to introduce discrete convolution first. \\Let $f$,
|
|
$g: \mathbb{Z} \to \mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
|
|
\]
|
|
Applying this on the data with the filter $g$ chosen accordingly we
|
|
are
|
|
able to improve the accuracy, which can be seen in
|
|
Figure~\ref{fig:sin_conv}.
|
|
\input{Figures/sin_conv.tex}
|
|
This form of discrete convolution can also be applied to functions
|
|
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
|
|
\mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
|
|
\dots, x_d - i_d) g(i_1, \dots, i_d)
|
|
\]
|
|
This will prove to be a useful framework for image manipulation but
|
|
in order to apply convolution to images we need to discuss
|
|
representation of image data first. Most often images are represented
|
|
by each pixel being a mixture of base colors. These base colors define
|
|
the color-space in which the image is encoded. Often used are
|
|
color-spaces RGB (red,
|
|
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
|
|
image split in its red, green and blue channel is given in
|
|
Figure~\ref{fig:rgb}. Using this
|
|
encoding of the image we can define a corresponding discrete function
|
|
describing the image, by mapping the coordinates $(x,y)$ of an pixel
|
|
and the
|
|
channel (color) $c$ to the respective value $v$
|
|
|
|
\begin{align}
|
|
\begin{split}
|
|
I: \mathbb{N}^3 & \to \mathbb{R}, \\
|
|
(x,y,c) & \mapsto v.
|
|
\end{split}
|
|
\label{def:I}
|
|
\end{align}
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{width=\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
|
|
\node[canvas is xy plane at z=0, transform shape] at (0,0)
|
|
{\includegraphics[width=5cm]{Figures/Data/klammern_r.jpg}};
|
|
\node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
|
|
{\includegraphics[width=5cm]{Figures/Data/klammern_g.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
|
|
{\includegraphics[width=5cm]{Figures/Data/klammern_b.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
|
|
{\includegraphics[width=5.3cm]{Figures/Data/klammern_rgb.jpg}};
|
|
\end{scope}
|
|
\end{tikzpicture}
|
|
\end{adjustbox}
|
|
\caption[Channel separation of color image]{On the right the red, green and blue chances of the picture
|
|
are displayed. In order to better visualize the color channels the
|
|
black and white picture of each channel has been colored in the
|
|
respective color. Combining the layers results in the image on the
|
|
left.}
|
|
\label{fig:rgb}
|
|
\end{figure}
|
|
|
|
With this representation of an image as a function, we can apply
|
|
filters to the image using convolution for multidimensional functions
|
|
as described above. In order to simplify the notation we will write
|
|
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
|
|
as a tensor from now on, resulting in the modified notation of
|
|
convolution
|
|
|
|
\[
|
|
(I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
\]
|
|
|
|
As images are finite in size for pixels close enough to the border
|
|
that the filter ... the convolution is not well defined. In such cases
|
|
padding can be used. With padding the image is enlarged beyond .. with
|
|
0 entries to
|
|
ensure the convolution is well defined for all pixels. If no padding
|
|
is used the size of the output is reduced to \textit{size of input -
|
|
size of kernel +1} in each dimension.
|
|
|
|
Simple examples for image manipulation using
|
|
convolution are smoothing operations or
|
|
rudimentary detection of edges in grayscale images, meaning they only
|
|
have one channel. A popular filter for smoothing images
|
|
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
|
|
size $s \in \mathbb{N}$ is
|
|
defined as
|
|
\[
|
|
G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
|
|
\sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
|
|
\]
|
|
|
|
For edge detection purposes the Sobel operator is widespread. Here two
|
|
filters are applied to the
|
|
image $I$ and then combined. Edges in the $x$ direction are detected
|
|
by convolution with
|
|
\[
|
|
G =\left[
|
|
\begin{matrix}
|
|
-1 & 0 & 1 \\
|
|
-2 & 0 & 2 \\
|
|
-1 & 0 & 1
|
|
\end{matrix}\right],
|
|
\]
|
|
and edges is the y direction by convolution with $G^T$, the final
|
|
output is given by
|
|
|
|
\[
|
|
O = \sqrt{(I * G)^2 + (I*G^T)^2}
|
|
\]
|
|
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component
|
|
wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}.
|
|
\todo{padding}
|
|
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/klammern.jpg}
|
|
\caption{Original Picture}
|
|
\label{subf:OrigPicGS}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv9.png}
|
|
\caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv10.png}
|
|
\caption{Gaussian Blur $\sigma^2 = 4$}
|
|
\end{subfigure}\\
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv4.png}
|
|
\caption{Sobel Operator $x$-direction}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv5.png}
|
|
\caption{Sobel Operator $y$-direction}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv6.png}
|
|
\caption{Sobel Operator combined}
|
|
\end{subfigure}
|
|
% \begin{subfigure}{0.24\textwidth}
|
|
% \centering
|
|
% \includegraphics[width=\textwidth]{Figures/Data/image_conv6.png}
|
|
% \caption{test}
|
|
% \end{subfigure}
|
|
\caption[Convolution applied on image]{Convolution of original greyscale Image (a) with different
|
|
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
|
|
$\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator
|
|
kernels are used.}
|
|
\label{fig:img_conv}
|
|
\end{figure}
|
|
\clearpage
|
|
\newpage
|
|
\subsection{Convolutional NN}
|
|
\todo{Eileitung zu CNN amout of parameters}
|
|
% Conventional neural network as described in chapter .. are made up of
|
|
% fully connected layers, meaning each node in a layer is influenced by
|
|
% all nodes of the previous layer. If one wants to extract information
|
|
% out of high dimensional input such as images this results in a very
|
|
% large amount of variables in the model. This limits the
|
|
|
|
% In conventional neural networks as described in chapter ... all layers
|
|
% are fully connected, meaning each output node in a layer is influenced
|
|
% by all inputs. For $i$ inputs and $o$ output nodes this results in $i
|
|
% + 1$ variables at each node (weights and bias) and a total $o(i + 1)$
|
|
% variables. For large inputs like image data the amount of variables
|
|
% that have to be trained in order to fit the model can get excessive
|
|
% and hinder the ability to train the model due to memory and
|
|
% computational restrictions. By using convolution we can extract
|
|
% meaningful information such as edges in an image with a kernel of a
|
|
% small size $k$ in the tens or hundreds independent of the size of the
|
|
% original image. Thus for a large image $k \cdot i$ can be several
|
|
% orders of magnitude smaller than $o\cdot i$ .
|
|
|
|
As seen in the previous section convolution can lend itself to
|
|
manipulation of images or other large data which motivates it usage in
|
|
neural networks.
|
|
This is achieved by implementing convolutional layers where several
|
|
filters are applied to the input. Where the values of the filters are
|
|
trainable parameters of the model.
|
|
Each node in such a layer corresponds to a pixel of the output of
|
|
convolution with one of those filters on which a bias and activation
|
|
function are applied.
|
|
The usage of multiple filters results in multiple outputs of the same
|
|
size as the input. These are often called channels. Depending on the
|
|
size of the filters this can result in the dimension of the output
|
|
being one larger than the input.
|
|
However for convolutional layers that are preceded by convolutional layers the
|
|
size of the filter is often chosen to coincide with the amount of channels
|
|
of the output of the previous layer without using padding in this
|
|
direction in order to prevent gaining additional
|
|
dimensions\todo{filter mit ganzer tiefe besser erklären} in the output.
|
|
This can also be used to flatten certain less interesting channels of
|
|
the input as for example a color channels.
|
|
Thus filters used in convolutional networks are usually have the same
|
|
amount of dimensions as the input or one more.
|
|
|
|
The size of the filters and the way they are applied can be tuned
|
|
while building the model should be the same for all filters in one
|
|
layer in order for the output being of consistent size in all channels.
|
|
It is common to reduce the d< by not applying the
|
|
filters on each ``pixel'' but rather specify a ``stride'' $s$ at which
|
|
the filter $g$ is moved over the input $I$
|
|
|
|
\[
|
|
O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
\]
|
|
|
|
As seen convolution lends itself for image manipulation. In this
|
|
chapter we will explore how we can incorporate convolution in neural
|
|
networks, and how that might be beneficial.
|
|
|
|
Convolutional Neural Networks as described by ... are made up of
|
|
convolutional layers, pooling layers, and fully connected ones. The
|
|
fully connected layers are layers in which each input node is
|
|
connected to each output node which is the structure introduced in
|
|
chapter ...
|
|
|
|
In a convolutional layer instead of combining all input nodes for each
|
|
output node, the input nodes are interpreted as a tensor on which a
|
|
kernel is applied via convolution, resulting in the output. Most often
|
|
multiple kernels are used, resulting in multiple output tensors. These
|
|
kernels are the variables, which can be altered in order to fit the
|
|
model to the data. Using multiple kernels it is possible to extract
|
|
different features from the image (e.g. edges -> sobel). As this
|
|
increases dimensionality even further which is undesirable as it
|
|
increases the amount of variables in later layers of the model, a convolutional layer
|
|
is often followed by a pooling one. In a pooling layer the input is
|
|
reduced in size by extracting a single value from a
|
|
neighborhood \todo{moving...}... . The resulting output size is dependent on
|
|
the offset of the neighborhoods used. Popular is max-pooling where the
|
|
largest value in a neighborhood is used or.
|
|
\todo{kleine grafik}
|
|
The combination of convolution and pooling layers allows for
|
|
extraction of features from the input in the from of feature maps while
|
|
using relatively few parameters that need to be trained.
|
|
\todo{Beispiel feature maps}
|
|
|
|
\subsubsection{Parallels to the Visual Cortex in Mammals}
|
|
|
|
The choice of convolution for image classification tasks is not
|
|
arbitrary. ... auge... bla bla
|
|
|
|
|
|
% \subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
% -Hyperparameter guesswork
|
|
% -Problems navigating valleys -> momentum
|
|
% -Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
|
|
For many applications in which neural networks are used such as
|
|
image classification or segmentation, large training data sets become
|
|
detrimental to capture the nuances of the
|
|
data. However as training sets get larger the memory requirement
|
|
during training grows with it.
|
|
In order to update the weights with the gradient descent algorithm
|
|
derivatives of the network with respect for each
|
|
variable need to be calculated for all data points in order to get the
|
|
full gradient of the error of the network.
|
|
Thus the amount of memory and computing power available limits the
|
|
size of the training data that can be efficiently used in fitting the
|
|
network. A class of algorithms that augment the gradient descent
|
|
algorithm in order to lessen this problem are stochastic gradient
|
|
descent algorithms. Here the premise is that instead of using the whole
|
|
dataset a (different) subset of data is chosen to
|
|
compute the gradient in each iteration (Algorithm~\ref{alg:sdg}).
|
|
The training period until each data point has been considered in
|
|
updating the parameters is commonly called an ``epoch''.
|
|
Using subsets reduces the amount of memory and computing power required for
|
|
each iteration. This makes it possible to use very large training
|
|
sets to fit the model.
|
|
Additionally the noise introduced on the gradient can improve
|
|
the accuracy of the fit as stochastic gradient descent algorithms are
|
|
less likely to get stuck on local extrema.
|
|
|
|
Another important benefit in using subsets is that depending on their size the
|
|
gradient can be calculated far quicker which allows for more parameter updates
|
|
in the same time. If the approximated gradient is close enough to the
|
|
``real'' one this can drastically cut down the time required for
|
|
training the model to a certain degree or improve the accuracy achievable in a given
|
|
mount of training time.
|
|
|
|
\begin{algorithm}
|
|
\SetAlgoLined
|
|
\KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$,
|
|
Training Data $D$, Epochs $E$.}
|
|
\For{$i \in \left\{1:E\right\}$}{
|
|
S <- D
|
|
\While{$\abs{S} \geq B$}{
|
|
Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\;
|
|
Update $S$: $S \leftarrow S \setminus \tilde{D}$\;
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
\tilde{D})}{\mathrm{d} w}$\;
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
}
|
|
\If{$S \neq \emptyset$}{
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
S)}{\mathrm{d} w}$\;
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
}
|
|
Increment: $i \leftarrow i+1$\;
|
|
}
|
|
\caption{Stochastic gradient descent.}
|
|
\label{alg:sgd}
|
|
\end{algorithm}
|
|
|
|
In order to illustrate this behavior we modeled a convolutional neural
|
|
network to ... handwritten digits. The data set used for this is the
|
|
MNIST database of handwritten digits (\textcite{MNIST},
|
|
Figure~\ref{fig:MNIST}).
|
|
\input{Figures/mnist.tex}
|
|
The network used consists of two convolution and max pooling layers
|
|
followed by one fully connected hidden layer and the output layer.
|
|
Both covolutional layers utilize square filters of size five which are
|
|
applied with a stride of one.
|
|
The first layer consists of 32 filters and the second of 64. Both
|
|
pooling layers pool a $2\times 2$ area. The fully connected layer
|
|
consists of 256 nodes and the output layer of 10, one for each digit.
|
|
All layers except the output layer use RELU as activation function
|
|
with the output layer using softmax (\ref{def:softmax}).
|
|
As loss function categorical crossentropy is used (\ref{def:...}).
|
|
The architecture of the convolutional neural network is summarized in
|
|
Figure~\ref{fig:mnist_architecture}.
|
|
|
|
\begin{figure}
|
|
\includegraphics[width=\textwidth]{Figures/Data/convnet_fig.pdf}
|
|
\caption{architecture}
|
|
\label{fig:mnist_architecture}
|
|
\end{figure}
|
|
|
|
The results of the network being trained with gradient descent and
|
|
stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd}
|
|
and Table~\ref{table:sgd_vs_gd}
|
|
|
|
|
|
Here it can be seen that the network trained with stochstic gradient
|
|
descent is more accurate after the first epoch than the ones trained
|
|
with gradient descent after 20 epochs.
|
|
This is due to the former using a batch size of 32 and thus having
|
|
made 1.875 updates to the weights
|
|
after the first epoch in comparison to one update. While each of
|
|
these updates uses a approximate
|
|
gradient calculated on the subset it performs far better than the
|
|
network using true gradients when training for the same mount of time.
|
|
\todo{vergleich training time}
|
|
|
|
\input{Figures/SGD_vs_GD.tex}
|
|
\clearpage
|
|
\subsection{\titlecap{modified stochastic gradient descent}}
|
|
An inherent problem of the stochastic gradient descent algorithm is
|
|
its sensitivity to the learning rate $\gamma$. This results in the
|
|
problem of having to find a appropriate learning rate for each problem
|
|
which is largely guesswork, the impact of choosing a bad learning rate
|
|
can be seen in Figure~\ref{fig:sgd_vs_gd}.
|
|
% There is a inherent problem in the sensitivity of the gradient descent
|
|
% algorithm regarding the learning rate $\gamma$.
|
|
% The difficulty of choosing the learning rate can be seen
|
|
% in Figure~\ref{sgd_vs_gd}.
|
|
For small rates the progress in each iteration is small
|
|
but as the rate is enlarged the algorithm can become unstable and the parameters
|
|
diverge to infinity. Even for learning rates small enough to ensure the parameters
|
|
do not diverge to infinity, steep valleys in the function to be
|
|
minimized can hinder the progress of
|
|
the algorithm as for leaning rates not small enough gradient descent
|
|
``bounces between'' the walls of the valley rather then following a
|
|
downward trend in the valley.
|
|
|
|
% \[
|
|
% w - \gamma \nabla_w ...
|
|
% \]
|
|
%thus the weights grow to infinity.
|
|
\todo{unstable learning rate besser
|
|
erklären}
|
|
|
|
To combat this problem \todo{quelle} propose to alter the learning
|
|
rate over the course of training, often called leaning rate
|
|
scheduling in order to decrease the learning rate over the course of
|
|
training. The most popular implementations of this are time based
|
|
decay
|
|
\[
|
|
\gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
|
\]
|
|
where $d$ is the decay parameter and $n$ is the number of epochs,
|
|
step based decay where the learning rate is fixed for a span of $r$
|
|
epochs and then decreased according to parameter $d$
|
|
\[
|
|
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
|
|
\]
|
|
and exponential decay where the learning rate is decreased after each epoch
|
|
\[
|
|
\gamma_n = \gamma_o e^{-n d}.
|
|
\]
|
|
These methods are able to increase the accuracy of a model by large
|
|
margins as seen in the training of RESnet by \textcite{resnet}.
|
|
\todo{vielleicht grafik
|
|
einbauen}
|
|
However stochastic gradient descent with weight decay is
|
|
still highly sensitive to the choice of the hyperparameters $\gamma_0$
|
|
and $d$.
|
|
In order to mitigate this problem a number of algorithms have been
|
|
developed to regularize the learning rate with as minimal
|
|
hyperparameter guesswork as possible.
|
|
|
|
We will examine and compare a ... algorithms that use a adaptive
|
|
learning rate.
|
|
They all scale the gradient for the update depending of past gradients
|
|
for each weight individually.
|
|
|
|
The algorithms are build up on each other with the adaptive gradient
|
|
algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD})
|
|
laying the base work. Here for each parameter update the learning rate
|
|
is given my a constant
|
|
$\gamma$ is divided by the sum of the squares of the past partial
|
|
derivatives in this parameter. This results in a monotonous decaying
|
|
learning rate with faster
|
|
decay for parameters with large updates, where as
|
|
parameters with small updates experience smaller decay. The \textsc{AdaGrad}
|
|
algorithm is given in Algorithm~\ref{alg:ADAGRAD}. Note that while
|
|
this algorithm is still based upon the idea of gradient descent it no
|
|
longer takes steps in the direction of the gradient while
|
|
updating. Due to the individual learning rates for each parameter only
|
|
the direction/sign for single parameters remain the same.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Global learning rate $\gamma$}
|
|
\KwInput{Constant $\varepsilon$}
|
|
\KwInput{Initial parameter vector $x_1 \in \mathbb{R}^p$}
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Compute Update: $\Delta x_{t,i} \leftarrow
|
|
-\frac{\gamma}{\norm{g_{1:t,i}}_2 + \varepsilon} g_{t,i}, \forall i =
|
|
1, \dots,p$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{\textsc{AdaGrad}}
|
|
\label{alg:ADAGRAD}
|
|
\end{algorithm}
|
|
|
|
Building on \textsc{AdaGrad} \textcite{ADADELTA} developed the
|
|
\textsc{AdaDelta} algorithm
|
|
in order to improve upon the two main drawbacks of \textsc{AdaGrad}, being the
|
|
continual decay of the learning rate and the need for a manually
|
|
selected global learning rate $\gamma$.
|
|
As \textsc{AdaGrad} uses division by the accumulated squared gradients the learning rate will
|
|
eventually become infinitely small.
|
|
In order to ensure that even after a significant of iterations
|
|
learning continues to make progress instead of summing the squared gradients a
|
|
exponentially decaying average of the past squared gradients is used to for
|
|
regularizing the learning rate resulting in
|
|
\begin{align*}
|
|
E[g^2]_t & = \rho E[g^2]_{t-1} + (1-\rho) g_t^2, \\
|
|
\Delta x_t & = -\frac{\gamma}{\sqrt{E[g^2]_t + \varepsilon}} g_t,
|
|
\end{align*}
|
|
for a decay rate $\rho$.
|
|
Additionally the fixed global learning rate $\gamma$ is substituted by
|
|
a exponentially decaying average of the past parameter updates.
|
|
The usage of the past parameter updates is motivated by ensuring that
|
|
hypothetical units of the parameter vector match those of the
|
|
parameter update $\Delta x_t$. When only using the
|
|
gradient with a scalar learning rate as in SDG the resulting unit of
|
|
the parameter update is:
|
|
\[
|
|
\text{units of } \Delta x \propto \text{units of } g \propto
|
|
\frac{\partial f}{\partial x} \propto \frac{1}{\text{units of } x},
|
|
\]
|
|
assuming the cost function $f$ is unitless. \textsc{AdaGrad} neither
|
|
has correct units since the update is given by a ratio of gradient
|
|
quantities resulting in a unitless parameter update. If however
|
|
Hessian information or a approximation thereof is used to scale the
|
|
gradients the unit of the updates will be correct:
|
|
\[
|
|
\text{units of } \Delta x \propto H^{-1} g \propto
|
|
\frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2 f}{\partial
|
|
x^2}} \propto \text{units of } x
|
|
\]
|
|
Since using the second derivative results in correct units, Newton's
|
|
method (assuming diagonal hessian) is rearranged to determine the
|
|
quantities involved in the inverse of the second derivative:
|
|
\[
|
|
\Delta x = \frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2
|
|
f}{\partial x^2}} \iff \frac{1}{\frac{\partial^2 f}{\partial
|
|
x^2}} = \frac{\Delta x}{\frac{\partial f}{\partial x}}.
|
|
\]
|
|
As the root mean square of the past gradients is already used in the
|
|
denominator of the learning rate a exponentially decaying root mean
|
|
square of the past updates is used to obtain a $\Delta x$ quantity for
|
|
the denominator resulting in the correct unit of the update. The full
|
|
algorithm is given by Algorithm~\ref{alg:adadelta}.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
|
|
\KwInput{Initial parameter $x_1$}
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
|
(1-\rho)g_t^2$\;
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{\textsc{AdaDelta}, \textcite{ADADELTA}}
|
|
\label{alg:adadelta}
|
|
\end{algorithm}
|
|
|
|
While the stochastic gradient algorithm is less susceptible to getting
|
|
stuck in local
|
|
extrema than gradient descent the problem still persists especially
|
|
for saddle points with steep .... \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
|
|
|
An approach to the problem of ``getting stuck'' in saddle point or
|
|
local minima/maxima is the addition of momentum to SDG. Instead of
|
|
using the actual gradient for the parameter update an average over the
|
|
past gradients is used. In order to avoid the need to SAVE the past
|
|
values usually a exponentially decaying average is used resulting in
|
|
Algorithm~\ref{alg:sgd_m}. This is comparable of following the path
|
|
of a marble with mass rolling down the slope of the error
|
|
function. The decay rate for the average is comparable to the inertia
|
|
of the marble.
|
|
This results in the algorithm being able to escape some local extrema due to the
|
|
build up momentum from approaching it.
|
|
|
|
% \begin{itemize}
|
|
% \item ADAM
|
|
% \item momentum
|
|
% \item ADADETLA \textcite{ADADELTA}
|
|
% \end{itemize}
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Learning Rate $\gamma$, Decay Rate $\rho$}
|
|
\KwInput{Initial parameter $x_1$}
|
|
Initialize accumulation variables $m_0 = 0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\;
|
|
Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{SDG with momentum}
|
|
\label{alg:sgd_m}
|
|
\end{algorithm}
|
|
|
|
In an effort to combine the properties of the momentum method and the
|
|
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
|
|
developed the \textsc{Adam} algorithm, given in
|
|
Algorithm~\ref{alg:adam}. Here the exponentially decaying
|
|
root mean square of the gradients is still used for realizing and
|
|
combined with the momentum method. Both terms are normalized such that
|
|
the ... are the first and second moment of the gradient. However the term used in
|
|
\textsc{AdaDelta} to ensure correct units is dropped for a scalar
|
|
global learning rate. This results in .. hyperparameters, however the
|
|
algorithms seems to be exceptionally stable with the recommended
|
|
parameters of ... and is a very reliable algorithm for training
|
|
neural networks.
|
|
However the \textsc{Adam} algorithm can have problems with high
|
|
variance of the adaptive learning rate early in training.
|
|
\textcite{rADAM} try to address these issues with the Rectified Adam
|
|
algorithm
|
|
\todo{will ich das einbauen?}
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Stepsize $\alpha$}
|
|
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
|
Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate first Moment of the Gradient and correct for bias:
|
|
$m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t;$\hspace{\linewidth}
|
|
$\hat{m}_t \leftarrow \frac{m_t}{1-\beta_1^t}$\;
|
|
Accumulate second Moment of the Gradient and correct for bias:
|
|
$v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2)g_t^2;$\hspace{\linewidth}
|
|
$\hat{v}_t \leftarrow \frac{v_t}{1-\beta_2^t}$\;
|
|
Compute Update: $\Delta x_t \leftarrow
|
|
-\frac{\alpha}{\sqrt{\hat{v}_t + \varepsilon}}
|
|
\hat{m}_t$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{ADAM, \cite{ADAM}}
|
|
\label{alg:adam}
|
|
\end{algorithm}
|
|
|
|
In order to get an understanding of the performance of the above
|
|
discussed training algorithms the neural network given in ... has been
|
|
trained on the ... and the results are given in
|
|
Figure~\ref{fig:comp_alg}.
|
|
Here it can be seen that the ADAM algorithm performs far better than
|
|
the other algorithms, with AdaGrad and Adelta following... bla bla
|
|
|
|
|
|
\input{Figures/sdg_comparison.tex}
|
|
|
|
% \subsubsubsection{Stochastic Gradient Descent}
|
|
\clearpage
|
|
\subsection{Combating Overfitting}
|
|
|
|
% As in many machine learning applications if the model is overfit in
|
|
% the data it can drastically reduce the generalization of the model. In
|
|
% many machine learning approaches noise introduced in the learning
|
|
% algorithm in order to reduce overfitting. This results in a higher
|
|
% bias of the model but the trade off of lower variance of the model is
|
|
% beneficial in many cases. For example the regression tree model
|
|
% ... benefits greatly from restricting the training algorithm on
|
|
% randomly selected features in every iteration and then averaging many
|
|
% such trained trees inserted of just using a single one. \todo{noch
|
|
% nicht sicher ob ich das nehmen will} For neural networks similar
|
|
% strategies exist. A popular approach in regularizing convolutional neural network
|
|
% is \textit{dropout} which has been first introduced in
|
|
% \cite{Dropout}
|
|
|
|
Similarly to shallow networks overfitting still can impact the quality of
|
|
convolutional neural networks.
|
|
Popular ways to combat this problem for a .. of models is averaging
|
|
over multiple models trained on subsets (bootstrap) or introducing
|
|
noise directly during the training (for example random forest, where a
|
|
conglomerate of decision trees benefit greatly of randomizing the
|
|
features available to use in each training iteration).
|
|
We explore implementations of these approaches for neural networks
|
|
being dropout for simulating a conglomerate of networks and
|
|
introducing noise during training by slightly altering the input
|
|
pictures.
|
|
% A popular way to combat this problem is
|
|
% by introducing noise into the training of the model.
|
|
% This can be done in a variety
|
|
% This is a
|
|
% successful strategy for ofter models as well, the a conglomerate of
|
|
% descision trees grown on bootstrapped trainig samples benefit greatly
|
|
% of randomizing the features available to use in each training
|
|
% iteration (Hastie, Bachelorarbeit??).
|
|
% There are two approaches to introduce noise to the model during
|
|
% learning, either by manipulating the model it self or by manipulating
|
|
% the input data.
|
|
\subsubsection{Dropout}
|
|
If a neural network has enough hidden nodes there will be sets of
|
|
weights that accurately fit the training set (proof for a small
|
|
scenario given in ...) this expecially occurs when the relation
|
|
between the input and output is highly complex, which requires a large
|
|
network to model and the training set is limited in size (vgl cnn
|
|
wening bilder). However each of these weights will result in different
|
|
predicitons for a test set and all of them will perform worse on the
|
|
test data than the training data. A way to improve the predictions and
|
|
reduce the overfitting would
|
|
be to train a large number of networks and average their results (vgl
|
|
random forests) however this is often computational not feasible in
|
|
training as well as testing.
|
|
% Similarly to decision trees and random forests training multiple
|
|
% models on the same task and averaging the predictions can improve the
|
|
% results and combat overfitting. However training a very large
|
|
% number of neural networks is computationally expensive in training
|
|
%as well as testing.
|
|
In order to make this approach feasible
|
|
\textcite{Dropout1} propose random dropout.
|
|
Instead of training different models for each data point in a batch
|
|
randomly chosen nodes in the network are disabled (their output is
|
|
fixed to zero) and the updates for the weights in the remaining
|
|
smaller network are comuted. These the updates computed for each data
|
|
point in the batch are then accumulated and applied to the full
|
|
network.
|
|
This can be compared to many small networks which share their weights
|
|
for their active neurons being trained simultaniously.
|
|
For testing the ``mean network'' with all nodes active but their
|
|
output scaled accordingly to compensate for more active nodes is
|
|
used. \todo{comparable to averaging dropout networks, beispiel für
|
|
besser in kleinem fall}
|
|
% Here for each training iteration from a before specified (sub)set of nodes
|
|
% randomly chosen ones are deactivated (their output is fixed to 0).
|
|
% During training
|
|
% Instead of using different models and averaging them randomly
|
|
% deactivated nodes are used to simulate different networks which all
|
|
% share the same weights for present nodes.
|
|
|
|
|
|
|
|
% A simple but effective way to introduce noise to the model is by
|
|
% deactivating randomly chosen nodes in a layer
|
|
% The way noise is introduced into
|
|
% the model is by deactivating certain nodes (setting the output of the
|
|
% node to 0) in the fully connected layers of the convolutional neural
|
|
% networks. The nodes are chosen at random and change in every
|
|
% iteration, this practice is called Dropout and was introduced by
|
|
% \textcite{Dropout}.
|
|
|
|
\subsubsection{\titlecap{manipulation of input data}}
|
|
Another way to combat overfitting is to keep the network from learning
|
|
the dataset by manipulating the inputs randomly for each iteration of
|
|
training. This is commonly used in image based tasks as there are
|
|
often ways to maipulate the input while still being sure the labels
|
|
remain the same. For example in a image classification task such as
|
|
handwritten digits the associated label should remain right when the
|
|
image is rotated or stretched by a small amount.
|
|
When using this one has to be sure that the labels indeed remain the
|
|
same or else the network will not learn the desired ...
|
|
In the case of handwritten digits for example a to high rotation angle
|
|
will ... a nine or six.
|
|
The most common transformations are rotation, zoom, shear, brightness,
|
|
mirroring.
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist0.pdf}
|
|
\caption{original\\image}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_zoom.pdf}
|
|
\caption{random\\zoom}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shear.pdf}
|
|
\caption{random\\shear}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_rotation.pdf}
|
|
\caption{random\\rotation}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shift.pdf}
|
|
\caption{random\\positional shift}
|
|
\end{subfigure}
|
|
\caption[Image data generation]{Example for the manipuations used in ... As all images are
|
|
of the same intensity brightness manipulation does not seem
|
|
... Additionally mirroring is not used for ... reasons.}
|
|
\end{figure}
|
|
|
|
In order to compare the benefits obtained from implementing these
|
|
measures we have trained the network given in ... on the same problem
|
|
and implemented different combinations of data generation and dropout. The results
|
|
are given in Figure~\ref{fig:gen_dropout}. For each scennario the
|
|
model was trained five times and the performance measures were
|
|
averaged. It can be seen that implementing the measures does indeed
|
|
increase the performance of the model. Implementing data generation on
|
|
its own seems to have a larger impact than dropout and applying both
|
|
increases the accuracy even further.
|
|
|
|
The better performance stems most likely from reduced overfitting. The
|
|
reduction in overfitting can be seen in
|
|
\ref{fig:gen_dropout}~(\subref{fig:gen_dropout_b}) as the training
|
|
accuracy decreases with test accuracy increasing. However utlitizing
|
|
data generation as well as dropout with a probability of 0.4 seems to
|
|
be a too aggressive approach as the training accuracy drops below the
|
|
test accuracy\todo{kleine begründung}.
|
|
|
|
\input{Figures/gen_dropout.tex}
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
|
training set?}
|
|
|
|
\clearpage
|
|
\subsubsection{\titlecap{effectivety for small training sets}}
|
|
|
|
For some applications (medical problems with small amount of patients)
|
|
the available data can be highly limited.
|
|
In these problems the networks are highly ... for overfitting the
|
|
data. In order to get a understanding of accuracys achievable and the
|
|
impact of the measures to prevent overfitting discussed above we and train
|
|
the network on datasets of varying sizes with different measures implemented.
|
|
For training we use the mnist handwriting dataset as well as the fashion
|
|
mnist dataset. The fashion mnist dataset is a benchmark set build by
|
|
\textcite{fashionMNIST} in order to provide a harder set, as state of
|
|
the art models are able to achive accuracies of 99.88\%
|
|
(\textcite{10.1145/3206098.3206111}) on the handwriting set.
|
|
The dataset contains 70.000 preprocessed images of clothes from
|
|
zalando, a overview is given in Figure~\ref{fig:fashionMNIST}.
|
|
|
|
\input{Figures/fashion_mnist.tex}
|
|
|
|
\afterpage{
|
|
\noindent
|
|
\begin{minipage}{\textwidth}
|
|
\small
|
|
\begin{tabu} to \textwidth {@{}l*4{X[c]}@{}}
|
|
\Tstrut \Bstrut & \textsc{Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{\titlecap{test accuracy for 1 sample}}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.5633 & 0.5312 & 0.6704 & 0.6604 \\
|
|
min & 0.3230 & 0.4224 & 0.4878 & 0.5175 \\
|
|
mean & 0.4570 & 0.4714 & 0.5862 & 0.6014 \\
|
|
var & 0.0040 & 0.0012 & 0.0036 & 0.0023 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{\titlecap{test accuracy for 10 samples}}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & 0.9441 \\
|
|
min & 0.8148 & 0.9081 & 0.9018 & 0.9061 \\
|
|
mean & 0.8377 & 0.9270 & 0.9185 & 0.9232 \\
|
|
var & 2.7e-4 & 1.3e-4 & 6e-05 & 1.5e-4 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{\titlecap{test accuracy for 100 samples}}\Bstrut \\
|
|
\cline{2-5}
|
|
max & 0.9637 & 0.9796 & 0.9810 & 0.9805 \\
|
|
min & 0.9506 & 0.9719 & 0.9702 & 0.9727 \\
|
|
mean & 0.9582 & 0.9770 & 0.9769 & 0.9783 \\
|
|
var & 2e-05 & 1e-05 & 1e-05 & 0 \\
|
|
\hline
|
|
\end{tabu}
|
|
\normalsize
|
|
\captionof{table}{Values of the test accuracy of the model trained
|
|
10 times
|
|
on random MNIST handwriting training sets containing 1, 10 and 100
|
|
data points per class after 125 epochs. The mean achieved accuracy
|
|
for the full set employing both overfitting measures is }
|
|
\small
|
|
\centering
|
|
\begin{tabu} to \textwidth {@{}l*4{X[c]}@{}}
|
|
\Tstrut \Bstrut & \textsc{Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{\titlecap{test accuracy for 1 sample}}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.5633 & 0.5312 & 0.6704 & 0.6604 \\
|
|
min & 0.3230 & 0.4224 & 0.4878 & 0.5175 \\
|
|
mean & 0.4570 & 0.4714 & 0.5862 & 0.6014 \\
|
|
var & 0.0040 & 0.0012 & 0.0036 & 0.0023 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{\titlecap{test accuracy for 10 samples}}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & 0.9441 \\
|
|
min & 0.8148 & 0.9081 & 0.9018 & 0.9061 \\
|
|
mean & 0.8377 & 0.9270 & 0.9185 & 0.9232 \\
|
|
var & 2.7e-4 & 1.3e-4 & 6e-05 & 1.5e-4 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{\titlecap{test accuracy for 100 samples}}\Bstrut \\
|
|
\cline{2-5}
|
|
max & 0.9637 & 0.9796 & 0.9810 & 0.9805 \\
|
|
min & 0.9506 & 0.9719 & 0.9702 & 0.9727 \\
|
|
mean & 0.9582 & 0.9770 & 0.9769 & 0.9783 \\
|
|
var & 2e-05 & 1e-05 & 1e-05 & 0 \\
|
|
\hline
|
|
\end{tabu}
|
|
\normalsize
|
|
\captionof{table}{Values of the test accuracy of the model trained 10 times
|
|
on random fashion MNIST training sets containing 1, 10 and 100 data points per
|
|
class. The mean achieved accuracy for the full dataset is: ....}
|
|
\end{minipage}
|
|
\clearpage % if needed/desired
|
|
}
|
|
|
|
The random datasets chosen for training are made up of a certain
|
|
number of datapoints for each class, which are chosen at random. The
|
|
sizes chosen for the comparisons are the full dataset, 100, 10 and 1
|
|
data points
|
|
per class.
|
|
|
|
For the task of classifying the fashion data a slightly altered model
|
|
is used. The convolutional layers with filters of size 5 are replaced
|
|
by two consecutive convolutional layers with filters of size 3.
|
|
This is done in order to have more ... in order to better ... the data
|
|
in the model. A diagram of the architecture is given in
|
|
Figure~\ref{fig:fashion_MNIST}.
|
|
|
|
For both scenarios the model are trained 10 times on randomly
|
|
... training sets. Additionally models of the same architecture where
|
|
a dropout layer with a ... 20\% is implemented and/or datageneration
|
|
is used to augment the data during training. The values for the
|
|
datageneration are given in CODE APPENDIX.
|
|
|
|
The models are trained for 125 epoch to ensure enough random
|
|
augmentations of the input images are considered to ensure
|
|
convergence. The test accuracies of the models after training for 125
|
|
epoch are given in Figure~\ref{...} for the handwriting
|
|
and in Figure~\ref{...} for the fashion scenario. Additionally the
|
|
average test accuracies of the models are given for each epoch in
|
|
Figure ... and Figure...
|
|
|
|
\begin{figure}
|
|
\includegraphics[width=\textwidth]{Figures/Data/cnn_fashion_fig.pdf}
|
|
\caption{Convolutional neural network architecture used to model the
|
|
fashion MNIST dataset.}
|
|
\label{fig:mnist_architecture}
|
|
\end{figure}
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\small
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
|
|
height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_02_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_02_1.mean};
|
|
|
|
|
|
\addlegendentry{\footnotesize{Default}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G. + D. 0.2}}
|
|
\addlegendentry{\footnotesize{D. 0.4}}
|
|
\addlegendentry{\footnotesize{Default}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{1 sample per class}
|
|
\vspace{0.25cm}
|
|
\end{subfigure}
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
|
|
height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_00_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_02_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_00_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_02_10.mean};
|
|
|
|
|
|
\addlegendentry{\footnotesize{Default.}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G + D. 0.2}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{10 samples per class}
|
|
\end{subfigure}
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
|
|
height = 0.35\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {epoch}, ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt}, ymin = {0.92}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_00_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_02_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_00_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_02_100.mean};
|
|
|
|
\addlegendentry{\footnotesize{Default.}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G + D. 0.2}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{100 samples per class}
|
|
\vspace{.25cm}
|
|
\end{subfigure}
|
|
\caption{}
|
|
\label{fig:MNISTfashion}
|
|
\end{figure}
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\missingfigure{datagen fashion}
|
|
\caption{Sample pictures of the mnist fashion dataset, one per
|
|
class.}
|
|
\label{mnist fashion}
|
|
\end{figure}
|
|
|
|
|
|
\clearpage
|
|
\section{Schluss}
|
|
\begin{itemize}
|
|
\item generate more data, GAN etc \textcite{gan}
|
|
\item Transfer learning, use network trained on different task and
|
|
repurpose it / train it with the training data \textcite{transfer_learning}
|
|
\item random erasing fashion mnist 96.35\% accuracy \textcite{random_erasing}
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|