You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1479 lines
64 KiB
TeX
1479 lines
64 KiB
TeX
\section{Application of Neural Networks to Higher Complexity Problems}
|
|
\label{sec:cnn}
|
|
This section is based on \textcite[Chapter~9]{Goodfellow}.
|
|
|
|
As neural networks are applied to problems of higher complexity which often
|
|
results in higher dimensionality of the input the number of
|
|
parameters in the network rises drastically.
|
|
For very large inputs such as high-resolution image data due to the
|
|
fully connected nature of the neural network the number of parameters
|
|
can exceed what is feasible for training and storage.
|
|
|
|
The number of parameters for a given network size can be reduced by
|
|
using layers which are only sparsely
|
|
connected and share parameters between nodes. An effective way to
|
|
implement this is by using convolution with filters that are shared
|
|
among the nodes of a layer.
|
|
|
|
\subsection{Convolution}
|
|
|
|
Convolution is a mathematical operation, where the product of two
|
|
functions is integrated after one has been reversed and shifted.
|
|
|
|
\[
|
|
(f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
|
|
\]
|
|
This operation can be described as a filter-function $g$ being applied
|
|
to $f$,
|
|
as values $f(t)$ are being replaced by an average of values of $f$
|
|
weighted by a filter-function $g$ in position $t$.
|
|
Convolution operation allows plentiful manipulation of data, with
|
|
a simple example being smoothing of real-time data.
|
|
|
|
Consider a sensor measuring the location of an object (e.g. via
|
|
GPS). We expect the output of the sensor to be noisy as a result of
|
|
some factors impacting the accuracy of the measurements. In order to
|
|
get a better estimate of the actual location, we want to smooth
|
|
the data to reduce the noise.
|
|
|
|
Using convolution for this task, we
|
|
can control the significance we want to give each data-point. We
|
|
might want to give a larger weight to more recent measurements than
|
|
older ones. If we assume these measurements are taken on a discrete
|
|
timescale, we need to define convolution for discrete functions. \\Let $f$,
|
|
$g: \mathbb{Z} \to \mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
|
|
\]
|
|
Applying this on the data with the filter $g$ chosen accordingly we
|
|
are
|
|
able to improve the accuracy, which can be seen in
|
|
Figure~\ref{fig:sin_conv}.
|
|
\clearpage
|
|
\input{Figures/sin_conv.tex}
|
|
This form of discrete convolution can also be applied to functions
|
|
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
|
|
\mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
|
|
\dots, x_d - i_d) g(i_1, \dots, i_d)
|
|
\]
|
|
This will prove to be a useful framework for image manipulation but
|
|
to apply convolution to images, we need to discuss the
|
|
representation of image data.
|
|
|
|
Most often images are represented
|
|
by each pixel being a mixture of base colors. These base colors define
|
|
the color-space in which the image is encoded. Often used are
|
|
color-spaces RGB (red,
|
|
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
|
|
image decomposed in its red, green, and blue channel is given in
|
|
Figure~\ref{fig:rgb}.
|
|
|
|
Using this encoding of the image we can define a corresponding
|
|
discrete function describing the image, by mapping the coordinates
|
|
$(x,y)$ of a pixel and the channel (color) $c$ to the respective value
|
|
$v$
|
|
\begin{align}
|
|
\begin{split}
|
|
I: \mathbb{N}^3 & \to \mathbb{R}, \\
|
|
(x,y,c) & \mapsto v.
|
|
\end{split}
|
|
\label{def:I}
|
|
\end{align}
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{width=\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
|
|
\node[canvas is xy plane at z=0, transform shape] at (0,0)
|
|
{\includegraphics[width=5cm]{Figures/Data/klammern_r.jpg}};
|
|
\node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
|
|
{\includegraphics[width=5cm]{Figures/Data/klammern_g.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
|
|
{\includegraphics[width=5cm]{Figures/Data/klammern_b.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
|
|
{\includegraphics[width=5.3cm]{Figures/Data/klammern_rgb.jpg}};
|
|
\end{scope}
|
|
\end{tikzpicture}
|
|
\end{adjustbox}
|
|
\caption[Channel Separation of Color Image]{On the right the red, green, and blue channels of the picture
|
|
are displayed. In order to better visualize the color channels the
|
|
black and white picture of each channel has been colored in the
|
|
respective color. Combining the layers results in the image on the
|
|
left.}
|
|
\label{fig:rgb}
|
|
\end{figure}
|
|
|
|
With this representation of an image as a function, we can apply
|
|
filters to the image using convolution for multidimensional functions
|
|
as described~above. To simplify the notation, we will write
|
|
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
|
|
as a tensor from now on, resulting in the modified notation of
|
|
convolution
|
|
|
|
\[
|
|
(I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
\]
|
|
|
|
As images are finite in size for pixels to close to the border the
|
|
convolution is not well defined.
|
|
Thus the output will be of reduced size. With $s_i$ being the size of
|
|
the input in dimension $d$ and $s_k$ being the size of kernel in
|
|
dimension $d$, the size of the output in dimension $d$ is $d_i - d_o
|
|
+ 1$.
|
|
% with the new size in each
|
|
% dimension $d$ being \textit{(size of input in dimension $d$) -
|
|
% (size of kernel in dimension $d$) + 1}. \todo{den dims namen geben
|
|
% formal in eine zeile}
|
|
In order to receive outputs of the same size as the input, the
|
|
image can be padded in each dimension with 0 entries which ensure the
|
|
convolution is well defined for all pixels of the image.
|
|
|
|
Simple examples of image manipulation using
|
|
convolution are smoothing operations or
|
|
rudimentary detection of edges in gray-scale images, meaning they only
|
|
have one channel. A filter often used to smooth or blur images
|
|
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
|
|
size $s \in \mathbb{N}$ is
|
|
defined as
|
|
\[
|
|
G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
|
|
\sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
|
|
\]
|
|
\pagebreak[4]
|
|
|
|
\noindent An effective filter for edge detection purposes is the Sobel operator. Here two
|
|
filters are applied to the
|
|
image $I$ and then the outputs are combined. Edges in the $x$ direction are detected
|
|
by convolution with
|
|
\[
|
|
G =\left[
|
|
\begin{matrix}
|
|
-1 & 0 & 1 \\
|
|
-2 & 0 & 2 \\
|
|
-1 & 0 & 1
|
|
\end{matrix}\right],
|
|
\]
|
|
and edges is the y direction by convolution with $G^T$, the final
|
|
output is given by
|
|
\[
|
|
O = \sqrt{(I * G)^2 + (I*G^T)^2}
|
|
\]
|
|
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component-wise. Examples
|
|
for convolution of an image with both kernels are given
|
|
in Figure~\ref{fig:img_conv}.
|
|
\begin{figure}[H]
|
|
\centering
|
|
\begin{subfigure}{0.27\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/klammern.jpg}
|
|
\caption{\small Original Picture\\~}
|
|
\label{subf:OrigPicGS}
|
|
\end{subfigure}
|
|
\hspace{0.02\textwidth}
|
|
\begin{subfigure}{0.27\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv9.png}
|
|
\caption{\small Gaussian Blur $\sigma^2 = 1$}
|
|
\end{subfigure}
|
|
\hspace{0.02\textwidth}
|
|
\begin{subfigure}{0.27\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv10.png}
|
|
\caption{\small Gaussian Blur $\sigma^2 = 4$}
|
|
\end{subfigure}\\
|
|
\begin{subfigure}{0.27\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv4.png}
|
|
\caption{\small Sobel Operator $x$-direction}
|
|
\end{subfigure}
|
|
\hspace{0.02\textwidth}
|
|
\begin{subfigure}{0.27\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv5.png}
|
|
\caption{\small Sobel Operator $y$-direction}
|
|
\end{subfigure}
|
|
\hspace{0.02\textwidth}
|
|
\begin{subfigure}{0.27\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/image_conv6.png}
|
|
\caption{\small Sobel Operator combined}
|
|
\end{subfigure}
|
|
% \begin{subfigure}{0.24\textwidth}
|
|
% \centering
|
|
% \includegraphics[width=\textwidth]{Figures/Data/image_conv6.png}
|
|
% \caption{test}
|
|
% \end{subfigure}
|
|
\vspace{-0.1cm}
|
|
\caption[Convolution Applied on Image]{Convolution of original gray-scale Image (a) with different
|
|
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
|
|
$\sigma^2$ are used. In (d) to (f) the above defined Sobel Operator
|
|
kernels are used.}
|
|
\label{fig:img_conv}
|
|
\end{figure}
|
|
\vspace{-0.2cm}
|
|
\clearpage
|
|
\subsection{Convolutional Neural Networks}
|
|
% Conventional neural network as described in chapter .. are made up of
|
|
% fully connected layers, meaning each node in a layer is influenced by
|
|
% all nodes of the previous layer. If one wants to extract information
|
|
% out of high dimensional input such as images this results in a very
|
|
% large amount of variables in the model. This limits the
|
|
|
|
% In conventional neural networks as described in chapter ... all layers
|
|
% are fully connected, meaning each output node in a layer is influenced
|
|
% by all inputs. For $i$ inputs and $o$ output nodes this results in $i
|
|
% + 1$ variables at each node (weights and bias) and a total $o(i + 1)$
|
|
% variables. For large inputs like image data the amount of variables
|
|
% that have to be trained in order to fit the model can get excessive
|
|
% and hinder the ability to train the model due to memory and
|
|
% computational restrictions. By using convolution we can extract
|
|
% meaningful information such as edges in an image with a kernel of a
|
|
% small size $k$ in the tens or hundreds independent of the size of the
|
|
% original image. Thus for a large image $k \cdot i$ can be several
|
|
% orders of magnitude smaller than $o\cdot i$ .
|
|
|
|
As seen in the previous section convolution can lend itself to
|
|
manipulation of images or other large data which motivates its usage in
|
|
neural networks.
|
|
This is achieved by implementing convolutional layers where several
|
|
trainable filters are applied to the input.
|
|
|
|
Each node in such a layer corresponds to a pixel of the output of
|
|
convolution with one of those filters, on which a bias and activation
|
|
function is applied.
|
|
Depending on the sizes this can drastically reduce the amount of
|
|
variables compared to fully connected layers.
|
|
As the variables of the filters are shared among all nodes a
|
|
convolutional layer with input of size $s_i$, output size $s_o$ and
|
|
$n$ filters of size $f$ will contain $n f + s_o$ parameters whereas a
|
|
fully connected layer has $(s_i + 1) s_o$ trainable weights.
|
|
|
|
The usage of multiple filters results in multiple outputs of the same
|
|
size as the input (or slightly smaller if no padding is used). These
|
|
are often called (convolution) channels.
|
|
|
|
Filters in layers that are preceded by convolutional layers are
|
|
often chosen such that the convolution channels of the input are
|
|
flattened into a single layer. This prevents gaining additional
|
|
dimensions with each convolutional layer.
|
|
To accomplish this in the direction of the convolution channels no
|
|
padding is used and the size of the filter is chosen to match the
|
|
number of these channels.
|
|
% Thus filters used in convolutional networks are usually have the same
|
|
% amount of dimensions as the input or one more.
|
|
|
|
An additional way to reduce the size using convolution is to not apply the
|
|
convolution on every pixel, but rather specifying a certain ``stride''
|
|
$s$ for each direction at which the filter $g$ is moved over the input $I$,
|
|
\[
|
|
O_{x,\dots,c} = \sum_{i,\dots,l \in \mathbb{Z}} \left(I_{(x \cdot
|
|
s_x)-i,\dots,(c \cdot s_c)-l}\right) \left(g_{i,\dots,l}\right).
|
|
\]
|
|
|
|
The sizes and stride should be the same for all filters in a layer in
|
|
order to get a uniform tensor as output.
|
|
% The size of the filters and the way they are applied can be tuned
|
|
% while building the model should be the same for all filters in one
|
|
% layer in order for the output being of consistent size in all channels.
|
|
% It is common to reduce the d< by not applying the
|
|
% filters on each ``pixel'' but rather specify a ``stride'' $s$ at which
|
|
% the filter $g$ is moved over the input $I$
|
|
|
|
% \[
|
|
% O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
% \]
|
|
|
|
% As seen convolution lends itself for image manipulation. In this
|
|
% chapter we will explore how we can incorporate convolution in neural
|
|
% networks, and how that might be beneficial.
|
|
|
|
% Convolutional Neural Networks as described by ... are made up of
|
|
% convolutional layers, pooling layers, and fully connected ones. The
|
|
% fully connected layers are layers in which each input node is
|
|
% connected to each output node which is the structure introduced in
|
|
% chapter ...
|
|
|
|
% In a convolutional layer instead of combining all input nodes for each
|
|
% output node, the input nodes are interpreted as a tensor on which a
|
|
% kernel is applied via convolution, resulting in the output. Most often
|
|
% multiple kernels are used, resulting in multiple output tensors. These
|
|
% kernels are the variables, which can be altered in order to fit the
|
|
% model to the data. Using multiple kernels it is possible to extract
|
|
% different features from the image (e.g. edges -> sobel).
|
|
|
|
As a means to further reduce the size towards the final layer, convolutional
|
|
layers are often followed by a pooling layer.
|
|
In a pooling layer, the input is
|
|
reduced in size by extracting a single value from a
|
|
neighborhood of pixels, often by taking the maximum value in the
|
|
neighborhood (max-pooling). The resulting output size is dependent on
|
|
the offset (stride) of the neighborhoods used.
|
|
The combination of convolution and pooling layers allows for
|
|
extraction of features from the input in the form of feature maps while
|
|
using relatively few parameters that need to be trained.
|
|
|
|
An example of this is given in Figure~\ref{fig:feature_map} where
|
|
intermediary outputs of a small convolutional neural network, consisting
|
|
of two convolutional and pooling layers, each with one filter followed
|
|
by two fully connected layers, are shown.
|
|
|
|
|
|
\begin{figure}[h]
|
|
\renewcommand{\thesubfigure}{\alph{subfigure}1}
|
|
\centering
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist0bw.pdf}
|
|
%\caption{input}
|
|
\caption{input}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/conv2d_2_5.pdf}
|
|
\caption{\hspace{-1pt}convolution}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_2_5.pdf}
|
|
\caption{max-pool}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/conv2d_3_5.pdf}
|
|
\caption{\hspace{-1pt}convolution}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_3_5.pdf}
|
|
\caption{max-pool}
|
|
\end{subfigure}
|
|
\centering
|
|
\setcounter{subfigure}{0}
|
|
\renewcommand{\thesubfigure}{\alph{subfigure}2}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist1bw.pdf}
|
|
\caption{input}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/conv2d_2_0.pdf}
|
|
\caption{\hspace{-1pt}convolution}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_2_0.pdf}
|
|
\caption{max-pool}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/conv2d_3_0.pdf}
|
|
\caption{\hspace{-1pt}convolution}
|
|
\end{subfigure}
|
|
\hfill
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/max_pooling2d_3_0.pdf}
|
|
\caption{max-pool}
|
|
\end{subfigure}
|
|
\caption[Feature Map]{Intermediary outputs of a
|
|
convolutional neural network, starting with the input and ending
|
|
with the corresponding feature map.}
|
|
\label{fig:feature_map}
|
|
\end{figure}
|
|
|
|
% \subsubsection{Parallels to the Visual Cortex in Mammals}
|
|
|
|
% The choice of convolution for image classification tasks is not
|
|
% arbitrary. ... auge... bla bla
|
|
|
|
|
|
% \subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
% -Hyperparameter guesswork
|
|
% -Problems navigating valleys -> momentum
|
|
% -Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
For many applications in which neural networks are used such as
|
|
image classification or segmentation, large training data sets become
|
|
detrimental to capture the nuances of the
|
|
data. However, as training sets get larger the memory requirement
|
|
during training grows with it.
|
|
To update the weights with the gradient descent algorithm,
|
|
derivatives of the network with respect to each
|
|
variable need to be computed for all data points.
|
|
Thus the amount of memory and computing power available limits the
|
|
size of the training data that can be efficiently used in fitting the
|
|
network.
|
|
|
|
A class of algorithms that augment the gradient descent
|
|
algorithm to lessen this problem are stochastic gradient
|
|
descent algorithms.
|
|
Here the full data set is split into smaller disjoint subsets.
|
|
Then in each iteration, a (different) subset of data is chosen to
|
|
compute the gradient (Algorithm~\ref{alg:sgd}).
|
|
The training period until each data point has been considered at least
|
|
once in
|
|
updating the parameters is commonly called an ``epoch''.
|
|
|
|
Using subsets reduces the amount of memory required for storing the
|
|
necessary values for each update, thus making it possible to use very
|
|
large training sets to fit the model.
|
|
Additionally, the noise introduced on the gradient can improve
|
|
the accuracy of the fit as stochastic gradient descent algorithms are
|
|
less likely to get stuck on local extrema.
|
|
|
|
Another important benefit in using subsets is that depending on their size the
|
|
gradient can be calculated far quicker which allows for more parameter updates
|
|
in the same time. If the approximated gradient is close enough to the
|
|
``real'' one this can drastically cut down the time required for
|
|
training the model to a certain degree or improve the accuracy achievable in a given
|
|
amount of training time.
|
|
|
|
\begin{algorithm}
|
|
\SetAlgoLined
|
|
\KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$,
|
|
Training Data $D$, Epochs $E$.}
|
|
\For{$i \in \left\{1:E\right\}$}{
|
|
S <- D
|
|
\While{$\abs{S} \geq B$}{
|
|
Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\;
|
|
Update $S$: $S \leftarrow S \setminus \tilde{D}$\;
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
\tilde{D})}{\mathrm{d} w}$\;
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
}
|
|
\If{$S \neq \emptyset$}{
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
S)}{\mathrm{d} w}$\;
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
}
|
|
Increment: $i \leftarrow i+1$\;
|
|
}
|
|
\caption{Stochastic gradient descent.}
|
|
\label{alg:sgd}
|
|
\end{algorithm}
|
|
|
|
To illustrate this behavior, we modeled a convolutional neural
|
|
network to classify handwritten digits. The data set used for this is the
|
|
MNIST database of handwritten digits (\textcite{MNIST},
|
|
Figure~\ref{fig:MNIST}).
|
|
|
|
The network used consists of two convolution and max-pooling layers
|
|
followed by one fully connected hidden layer and the output layer.
|
|
Both convolutional layers utilize square filters of size five which are
|
|
applied with a stride of one.
|
|
The first layer consists of 32 filters and the second of 64. Both
|
|
pooling layers pool a $2\times 2$ area with a stride of two in both
|
|
directions. The fully connected layer
|
|
consists of 256 nodes and the output layer of 10, one for each digit.
|
|
All layers use a ReLU (\ref{eq:relu}) as activation function, except the output layer
|
|
which uses softmax (\ref{eq:softmax}).
|
|
As loss function categorical cross entropy (\ref{eq:cross_entropy}) is used.
|
|
The architecture of the convolutional neural network is summarized in
|
|
Figure~\ref{fig:mnist_architecture}.
|
|
|
|
% The network is trained with gradient descent and stochastic gradient
|
|
% descent five times for ... epochs. The reluts
|
|
The results of the network being trained with gradient descent and
|
|
stochastic gradient descent for 20 epochs are given in
|
|
Figure~\ref{fig:sgd_vs_gd}.
|
|
Here it can be seen that the network trained with stochastic gradient
|
|
descent is more accurate after the first epoch than the ones trained
|
|
with gradient descent after 20 epochs.
|
|
This is due to the former using a batch size of 32 and thus having
|
|
made 1.875 updates to the weights
|
|
after the first epoch in comparison to just one update. While each of
|
|
these updates only uses an approximate
|
|
gradient calculated on the subset it performs far better than the
|
|
network using true gradients when training for the same amount of
|
|
time.
|
|
\vfill
|
|
\input{Figures/mnist.tex}
|
|
\vfill
|
|
\begin{figure}[h]
|
|
\includegraphics[width=\textwidth]{Figures/Data/convnet_fig.pdf}
|
|
\caption[CNN Architecture for MNIST Handwritten
|
|
Digits]{Convolutional neural network architecture used to model the
|
|
MNIST handwritten digits data set. This figure was created with
|
|
help of the
|
|
{\sffamily{draw\textunderscore convnet}} Python script by \textcite{draw_convnet}.}
|
|
\label{fig:mnist_architecture}
|
|
\end{figure}
|
|
|
|
\input{Figures/SGD_vs_GD.tex}
|
|
\clearpage
|
|
\subsection{Modified Stochastic Gradient Descent}
|
|
This section is based on \textcite{ruder}, \textcite{ADAGRAD},
|
|
\textcite{ADADELTA}, and \textcite{ADAM}.
|
|
|
|
While stochastic gradient descent can work quite well in fitting
|
|
models its sensitivity to the learning rate $\gamma$ is an inherent
|
|
problem.
|
|
It is necessary to find an appropriate learning rate for each problem
|
|
which is largely guesswork. The impact of choosing a bad learning rate
|
|
can be seen in Figure~\ref{fig:sgd_vs_gd}.
|
|
% There is a inherent problem in the sensitivity of the gradient descent
|
|
% algorithm regarding the learning rate $\gamma$.
|
|
% The difficulty of choosing the learning rate can be seen
|
|
% in Figure~\ref{sgd_vs_gd}.
|
|
For small rates the progress in each iteration is small
|
|
but for learning rates too large the algorithm can become unstable.
|
|
This is caused by updates being larger than the parameters themselves
|
|
which can result in the parameters diverging to infinity.
|
|
|
|
Even for learning rates small enough to ensure the parameters
|
|
do not diverge to infinity, steep valleys in the function to be
|
|
minimized can hinder the progress of
|
|
the algorithm.
|
|
If the bottom of the valley slowly slopes towards the minimum
|
|
the steep nature of the valley can result in the
|
|
algorithm ``bouncing between'' the walls of the valley rather then
|
|
following the downwards trend.
|
|
|
|
A possible way to combat this is to alter the learning
|
|
rate over the course of training. This is often called learning rate
|
|
scheduling.
|
|
The most popular three implementations of this are:
|
|
\begin{itemize}
|
|
\item Time-based decay, where $d$ is the decay parameter and $n$ is the number of epochs
|
|
\[
|
|
\gamma_{n+1} = \frac{\gamma_n}{1 + d n}.
|
|
\]
|
|
\item Step based decay, where the learning rate is fixed for a span of $r$
|
|
epochs and then decreased according to parameter $d$
|
|
\[
|
|
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}.
|
|
\]
|
|
\item Exponential decay, where the learning rate is decreased after each epoch
|
|
\[
|
|
\gamma_n = \gamma_o e^{-n d}.
|
|
\]
|
|
\end{itemize}
|
|
% time-based
|
|
% decay
|
|
% \[
|
|
% \gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
|
% \]
|
|
% where $d$ is the decay parameter and $n$ is the number of epochs.
|
|
% Step based decay where the learning rate is fixed for a span of $r$
|
|
% epochs and then decreased according to parameter $d$
|
|
% \[
|
|
% \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}.
|
|
% \]
|
|
% And exponential decay where the learning rate is decreased after each epoch
|
|
% \[
|
|
% \gamma_n = \gamma_o e^{-n d}.
|
|
% \]
|
|
These methods are able to increase the accuracy of models by large
|
|
margins as seen in the training of RESnet by \textcite{resnet}, cf. Figure~\ref{fig:resnet}.
|
|
\begin{figure}[h]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Figures/Data/7780459-fig-4-source-hires.png}
|
|
\caption[Learning Rate Decay]{Error history of convolutional neural
|
|
network trained with learning rate decay. The drops seen at 15.000 and
|
|
30.000 iterations correspond to changes of the learning rate. \textcite[Figure
|
|
4]{resnet}.}
|
|
\label{fig:resnet}
|
|
\end{figure}
|
|
|
|
|
|
However stochastic gradient descent with weight decay is
|
|
still highly sensitive to the choice of the hyperparameters $\gamma_0$
|
|
and $d$.
|
|
Several algorithms have been developed to mitigate this problem by
|
|
regularizing the learning rate with as minimal
|
|
hyperparameter guesswork as possible.
|
|
|
|
In the following, we will compare three algorithms that use an adaptive
|
|
learning rate, meaning they scale the updates according to past iterations.
|
|
The algorithms are built upon each other with the adaptive gradient
|
|
algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD})
|
|
laying the base~work.
|
|
Here,~for~each~parameter~update, the learning rate
|
|
is given by a constant global rate
|
|
$\gamma$ divided by the sum of the squares of the past partial
|
|
derivatives in this parameter. This results in a monotonous decaying
|
|
learning rate with faster
|
|
decay for parameters with large updates, whereas
|
|
parameters with small updates experience smaller decay.
|
|
The \textsc{AdaGrad}
|
|
algorithm is given in Algorithm~\ref{alg:ADAGRAD}. Note that while
|
|
this algorithm is still based upon the idea of gradient descent it no
|
|
longer takes steps in the direction of the gradient while
|
|
updating. Due to the individual learning rates for each parameter, only
|
|
the direction or sign for single parameters remains the same compared to
|
|
gradient descent.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Global learning rate $\gamma$}
|
|
\KwInput{Constant $\varepsilon$}
|
|
\KwInput{Initial parameter vector $x_1 \in \mathbb{R}^p$}
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Compute Update: $\Delta x_{t,i} \leftarrow
|
|
-\frac{\gamma}{\norm{g_{1:t,i}}_2 + \varepsilon} g_{t,i}, \forall i =
|
|
1, \dots,p$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{\textsc{AdaGrad}}
|
|
\label{alg:ADAGRAD}
|
|
\end{algorithm}
|
|
|
|
Building on \textsc{AdaGrad}, \textcite{ADADELTA} developed the
|
|
\textsc{AdaDelta} algorithm
|
|
to improve upon the two main drawbacks of \textsc{AdaGrad}, being the
|
|
continuous decay of the learning rate and the need for a manually
|
|
selected global learning rate $\gamma$.
|
|
As \textsc{AdaGrad} uses division by the accumulated squared gradients the learning rate will
|
|
eventually become infinitely small.
|
|
Instead of summing the squared gradients a exponential decaying
|
|
average of the past squared gradients is used to regularize the
|
|
learning rate
|
|
% In order to ensure that even after a significant of iterations
|
|
% learning continues to make progress instead of summing the squared gradients a
|
|
% exponentially decaying average of the past squared gradients is used to for
|
|
% regularizing the learning rate resulting in
|
|
\begin{align*}
|
|
E[g^2]_t & = \rho E[g^2]_{t-1} + (1-\rho) g_t^2, \\
|
|
\Delta x_t & = -\frac{\gamma}{\sqrt{E[g^2]_t + \varepsilon}} g_t,
|
|
\end{align*}
|
|
for a decay rate $\rho$. This is done to ensure that even after a
|
|
significant amount of iterations learning can make progress.
|
|
|
|
Additionally the fixed global learning rate $\gamma$ is substituted by
|
|
a exponentially decaying average of the past parameter updates.
|
|
The usage of the past parameter updates is motivated by ensuring that
|
|
hypothetical units of the parameter vector match those of the
|
|
parameter update $\Delta x_t$. When only using the
|
|
gradient with a scalar learning rate as in SDG the resulting unit of
|
|
the parameter update is:
|
|
\[
|
|
\text{units of } \Delta x \propto \text{units of } g \propto
|
|
\frac{\partial f}{\partial x} \propto \frac{1}{\text{units of } x},
|
|
\]
|
|
assuming the cost function $f$ is unitless. \textsc{AdaGrad} neither
|
|
has correct units since the update is given by a ratio of gradient
|
|
quantities resulting in a unitless parameter update. If however
|
|
Hessian information or a approximation thereof is used to scale the
|
|
gradients the unit of the updates will be correct:
|
|
\[
|
|
\text{units of } \Delta x \propto H^{-1} g \propto
|
|
\frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2 f}{\partial
|
|
x^2}} \propto \text{units of } x
|
|
\]
|
|
Since using the second derivative results in correct units, Newton's
|
|
method (assuming diagonal hessian) is rearranged to determine the
|
|
quantities involved in the inverse of the second derivative:
|
|
\[
|
|
\Delta x = \frac{\frac{\partial f}{\partial x}}{\frac{\partial ^2
|
|
f}{\partial x^2}} \iff \frac{1}{\frac{\partial^2 f}{\partial
|
|
x^2}} = \frac{\Delta x}{\frac{\partial f}{\partial x}}.
|
|
\]
|
|
As the root mean square of the past gradients is already used in the
|
|
denominator of the learning rate an exponentially decaying root mean
|
|
square of the past updates is used to obtain a $\Delta x$ quantity for
|
|
the denominator resulting in the correct unit of the update. The full
|
|
algorithm is given in Algorithm~\ref{alg:adadelta}.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
|
|
\KwInput{Initial parameter $x_1$}
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate Gradient: $E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
|
(1-\rho)g_t^2$\;
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{\textsc{AdaDelta}, \textcite{ADADELTA}}
|
|
\label{alg:adadelta}
|
|
\end{algorithm}
|
|
|
|
While the stochastic gradient algorithm is less susceptible to getting
|
|
stuck in local
|
|
extrema than gradient descent the problem persists especially
|
|
for saddle points (\textcite{DBLP:journals/corr/Dauphinpgcgb14}).
|
|
|
|
An approach to the problem of ``getting stuck'' in saddle point or
|
|
local minima/maxima is the addition of momentum to SDG. Instead of
|
|
using the actual gradient for the parameter update an average over the
|
|
past gradients is used.
|
|
Usually, an exponentially decaying average is used to avoid the need to
|
|
hold the past values in memory, resulting in Algorithm~\ref{alg:sgd_m}.
|
|
% In order to avoid the need to hold the past
|
|
% values in memory usually a exponentially decaying average is used resulting in
|
|
% Algorithm~\ref{alg:sgd_m}.
|
|
This is comparable to following the path
|
|
of a marble with mass rolling down the slope of the error
|
|
function. The decay rate for the average is comparable to the inertia
|
|
of the marble.
|
|
This results in the algorithm being able to escape some local extrema due to the
|
|
build up momentum from approaching it.
|
|
|
|
% \begin{itemize}
|
|
% \item ADAM
|
|
% \item momentum
|
|
% \item ADADETLA \textcite{ADADELTA}
|
|
% \end{itemize}
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Learning Rate $\gamma$, Decay Rate $\rho$}
|
|
\KwInput{Initial parameter $x_1$}
|
|
Initialize accumulation variables $m_0 = 0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\;
|
|
Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{SDG with momentum}
|
|
\label{alg:sgd_m}
|
|
\end{algorithm}
|
|
|
|
In an effort to combine the properties of the momentum method and the
|
|
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
|
|
developed the \textsc{Adam} algorithm, given in
|
|
Algorithm~\ref{alg:adam}. Here the exponentially decaying
|
|
root mean square of the gradients is still used for regularizing the
|
|
learning rate and
|
|
combined with the momentum method. Both terms are normalized such that
|
|
their means are the first and second moments of the gradient. However,
|
|
the term used in
|
|
\textsc{AdaDelta} to ensure correct units is dropped for a scalar
|
|
global learning rate. This results in four tunable hyperparameters.
|
|
However, the
|
|
algorithm seems to be exceptionally stable with the recommended
|
|
parameters of $\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999, \varepsilon=10^{-7}$ and is a very reliable algorithm for training
|
|
neural networks.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Stepsize $\alpha$}
|
|
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
|
Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate first Moment of the Gradient and correct for bias:
|
|
$m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t;$\hspace{\linewidth}
|
|
$\hat{m}_t \leftarrow \frac{m_t}{1-\beta_1^t}$\;
|
|
Accumulate second Moment of the Gradient and correct for bias:
|
|
$v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2)g_t^2;$\hspace{\linewidth}
|
|
$\hat{v}_t \leftarrow \frac{v_t}{1-\beta_2^t}$\;
|
|
Compute Update: $\Delta x_t \leftarrow
|
|
-\frac{\alpha}{\sqrt{\hat{v}_t + \varepsilon}}
|
|
\hat{m}_t$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{ADAM, \cite{ADAM}}
|
|
\label{alg:adam}
|
|
\end{algorithm}
|
|
|
|
To get an understanding of the performance of the above
|
|
discussed training algorithms the neural network given in
|
|
\ref{fig:mnist_architecture} has been
|
|
trained on the MNIST handwriting data set with the above described
|
|
algorithms. For all algorithms, a global learning rate of $0.001$ is
|
|
chosen. The parameter preventing divisions by zero is set to
|
|
$\varepsilon = 10^{-7}$. For \textsc{AdaDelta} and
|
|
Momentum $\rho = 0.95$ is used as decay rate. For \textsc{Adam} the recommended
|
|
parameters are chosen.
|
|
The performance metrics of the resulting learned functions are given in
|
|
Figure~\ref{fig:comp_alg}.
|
|
|
|
Here it can be seen that \textsc{AdaDelta} is the least effective of
|
|
the algorithms for the problem. Stochastic gradient descent and
|
|
\textsc{AdaGrad} perform similarly with \textsc{AdaGrad} being slightly
|
|
faster. \textsc{Adam} and stochastic gradient
|
|
descent with momentum achieve similar accuracies. However, the model
|
|
trained with \textsc{Adam} learns the fastest and achieves the best
|
|
accuracy. Thus we will use \textsc{Adam} for following comparisons.
|
|
\newpage
|
|
|
|
\input{Figures/sdg_comparison.tex}
|
|
|
|
\clearpage
|
|
\subsection{Combating Overfitting}
|
|
|
|
% As in many machine learning applications if the model is overfit in
|
|
% the data it can drastically reduce the generalization of the model. In
|
|
% many machine learning approaches noise introduced in the learning
|
|
% algorithm in order to reduce overfitting. This results in a higher
|
|
% bias of the model but the trade off of lower variance of the model is
|
|
% beneficial in many cases. For example the regression tree model
|
|
% ... benefits greatly from restricting the training algorithm on
|
|
% randomly selected features in every iteration and then averaging many
|
|
% such trained trees inserted of just using a single one. \todo{noch
|
|
% nicht sicher ob ich das nehmen will} For neural networks similar
|
|
% strategies exist. A popular approach in regularizing convolutional neural network
|
|
% is \textit{dropout} which has been first introduced in
|
|
% \cite{Dropout}
|
|
This section is based on \textcite{Dropout1} and \textcite{Dropout}.
|
|
Similarly to shallow networks overfitting still can impact the quality of
|
|
convolutional neural networks.
|
|
Effective ways to combat this problem for many models is averaging
|
|
over multiple models trained on subsets (bootstrap) or introducing
|
|
noise directly during the training.
|
|
For example decision trees benefit greatly from averaging many trees
|
|
trained on slightly different training sets and the
|
|
introduction of noise during training by limiting the variables
|
|
available at each iteration
|
|
(cf. \textcite[Chapter~15]{hastie01statisticallearning}).
|
|
We explore implementations of these approaches for neural networks
|
|
being dropout for simulating a conglomerate of networks and
|
|
introducing noise during training by slightly altering the input
|
|
pictures.
|
|
% A popular way to combat this problem is
|
|
% by introducing noise into the training of the model.
|
|
% This can be done in a variety
|
|
% This is a
|
|
% successful strategy for ofter models as well, the a conglomerate of
|
|
% descision trees grown on bootstrapped trainig samples benefit greatly
|
|
% of randomizing the features available to use in each training
|
|
% iteration (Hastie, Bachelorarbeit??).
|
|
% There are two approaches to introduce noise to the model during
|
|
% learning, either by manipulating the model it self or by manipulating
|
|
% the input data.
|
|
\subsubsection{Dropout}
|
|
If a neural network has enough hidden nodes there will be sets of
|
|
weights that accurately fit the training set (proof for a small
|
|
scenario is given in Theorem~\ref{theo:overfit}) this especially
|
|
occurs when the relation between the in- and output is highly complex,
|
|
which requires a large network to model and the training set is
|
|
limited in size. However, each of these weights will result in different
|
|
predictions for a test set and all of them will perform worse on the
|
|
test data than the training data. A way to improve the predictions and
|
|
reduce the overfitting would be to train a large number of networks
|
|
and average their results.
|
|
However, this is often computational not feasible in
|
|
training as well as in testing.
|
|
% Similarly to decision trees and random forests training multiple
|
|
% models on the same task and averaging the predictions can improve the
|
|
% results and combat overfitting. However training a very large
|
|
% number of neural networks is computationally expensive in training
|
|
%as well as testing.
|
|
In order to make this approach feasible
|
|
\textcite{Dropout1} propose random dropout.
|
|
Instead of training different models, for each data point in a batch
|
|
randomly chosen nodes in the network are disabled (their output is
|
|
fixed to zero) and the updates for the weights in the remaining
|
|
smaller network are computed.
|
|
After updates have been obtained this way for each data point in a batch,
|
|
the updates are accumulated and applied to the full network.
|
|
This can be compared to many small networks which share their weights
|
|
for their active neurons being trained simultaneously.
|
|
For testing the ``mean network'' with all nodes active is used. But the
|
|
output of the nodes is scaled accordingly to compensate for more nodes
|
|
being active.
|
|
%\todo{comparable to averaging dropout networks, beispiel für
|
|
% besser in kleinem fall}
|
|
% Here for each training iteration from a before specified (sub)set of nodes
|
|
% randomly chosen ones are deactivated (their output is fixed to 0).
|
|
% During training
|
|
% Instead of using different models and averaging them randomly
|
|
% deactivated nodes are used to simulate different networks which all
|
|
% share the same weights for present nodes.
|
|
|
|
|
|
|
|
% A simple but effective way to introduce noise to the model is by
|
|
% deactivating randomly chosen nodes in a layer
|
|
% The way noise is introduced into
|
|
% the model is by deactivating certain nodes (setting the output of the
|
|
% node to 0) in the fully connected layers of the convolutional neural
|
|
% networks. The nodes are chosen at random and change in every
|
|
% iteration, this practice is called Dropout and was introduced by
|
|
% \textcite{Dropout}.
|
|
|
|
\subsubsection{Manipulation of Input Data}
|
|
Another way to combat overfitting is to randomly alter the training
|
|
inputs for each iteration of training.
|
|
% This is done keep the network from
|
|
% ``memorizing'' the training data rather than learning the relation
|
|
% between in- and output.
|
|
This can often be used in image based tasks as there are
|
|
often ways to manipulate the input while still being sure the labels
|
|
remain the same. For example, in an image classification task such as
|
|
handwritten digits, the associated label should remain right when the
|
|
image is rotated or stretched by a small amount.
|
|
When applying this, one has to ensure that the alterations are
|
|
reasonable in the context of the data, or else the network might make
|
|
false connections between in- and output.
|
|
In the case of handwritten digits for example a too high rotation angle
|
|
will make the distinction between a nine or a six hard and will lessen
|
|
the quality of the learned function.
|
|
The most common transformations are rotation, zoom, shear, brightness,
|
|
mirroring. Examples of these are given in Figure~\ref{fig:datagen}. In
|
|
to following this practice will be referred to as data generation.
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist0.pdf}
|
|
\caption{original\\image}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_zoom.pdf}
|
|
\caption{random\\zoom}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shear.pdf}
|
|
\caption{random\\shear}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_rotation.pdf}
|
|
\caption{random\\rotation}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Figures/Data/mnist_gen_shift.pdf}
|
|
\caption{random\\positional shift}
|
|
\end{subfigure}
|
|
\caption[Image Data Generation]{Example for the manipulations used in
|
|
later comparisons. Brightness manipulation and mirroring are not
|
|
used, as the images are equal in brightness and digits are not
|
|
invariant to mirroring.}
|
|
\label{fig:datagen}
|
|
\end{figure}
|
|
|
|
\subsubsection{Comparisons}
|
|
|
|
To compare the benefits obtained from implementing these
|
|
measures we have trained the network given in
|
|
\ref{fig:mnist_architecture} on the handwriting recognition problem
|
|
and implemented different combinations of data generation and dropout. The results
|
|
are given in Figure~\ref{fig:gen_dropout}. For each scenario, the
|
|
model was trained five times and the performance measures were
|
|
averaged.
|
|
|
|
It can be seen that implementing the measures does indeed
|
|
increase the performance of the model.
|
|
Using data generation to alter the training data seems to have a
|
|
larger impact than dropout however, utilizing both measures yields the
|
|
best results.
|
|
%\todo{auf zahlen in tabelle verweisen?}
|
|
|
|
% Implementing data generation on
|
|
% its own seems to have a larger impact than dropout and applying both
|
|
% increases the accuracy even further.
|
|
|
|
The better performance stems most likely from reduced overfitting. The
|
|
reduction in overfitting can be seen in
|
|
\ref{fig:gen_dropout}~(\subref{fig:gen_dropout_b}) as the training
|
|
accuracy decreases with test accuracy increasing. However, utilizing
|
|
data generation, as well as dropout with a probability of 0.4, seems to
|
|
be a too aggressive approach as the training accuracy drops below the
|
|
test accuracy.
|
|
|
|
\input{Figures/gen_dropout.tex}
|
|
|
|
\subsubsection{Effectiveness for Small Training Sets}
|
|
|
|
\label{sec:smalldata}
|
|
|
|
For some applications (medical problems with a small number of patients)
|
|
the available data can be highly limited.
|
|
In these scenarios, the networks are highly prone to overfit the
|
|
data. To get an understanding of accuracies achievable and the
|
|
impact of the methods aimed at mitigating overfitting discussed above we fit
|
|
networks with different measures implemented to data sets of
|
|
varying sizes.
|
|
|
|
For training, we use the MNIST handwriting data set as well as the fashion
|
|
MNIST data set. The fashion MNIST data set is a benchmark set build by
|
|
\textcite{fashionMNIST} to provide a more challenging set, as state of
|
|
the art models are able to achieve accuracies of 99.88\%
|
|
(\textcite{10.1145/3206098.3206111}) on the handwriting set.
|
|
The data set contains 70.000 preprocessed and labeled images of clothes from
|
|
Zalando. An overview is given in Figure~\ref{fig:fashionMNIST}.
|
|
|
|
\input{Figures/fashion_mnist.tex}
|
|
|
|
\afterpage{
|
|
\noindent
|
|
\begin{minipage}{\textwidth}
|
|
\small
|
|
\begin{tabu} to \textwidth {@{}l*4{X[c]}@{}}
|
|
\Tstrut \Bstrut & \textsc{Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{Test Accuracy for 1 Sample}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.5633 & 0.5312 & \textbf{0.6704} & 0.6604 \\
|
|
min & 0.3230 & 0.4224 & 0.4878 & \textbf{0.5175} \\
|
|
mean & 0.4570 & 0.4714 & 0.5862 & \textbf{0.6014} \\
|
|
var \Bstrut & 4.021e-3 & \textbf{1.175e-3} & 3.600e-3 & 2.348e-3 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{Test Accuracy for 10 Sample}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.8585 & 0.9423 & 0.9310 & \textbf{0.9441} \\
|
|
min & 0.8148 & \textbf{0.9081} & 0.9018 & 0.9061 \\
|
|
mean & 0.8377 & \textbf{0.9270} & 0.9185 & 0.9232 \\
|
|
var \Bstrut & 2.694e-4 & \textbf{1.278e-4} & 6.419e-5 & 1.504e-4 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{Test Accuracy for 100 Sample}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.9637 & 0.9796 & 0.9810 & \textbf{0.9811} \\
|
|
min & 0.9506 & 0.9719 & 0.9702 & \textbf{0.9727} \\
|
|
mean & 0.9582 & 0.9770 & 0.9769 & \textbf{0.9783} \\
|
|
var \Bstrut & 1.858e-5 & 5.778e-6 & 9.398e-6 & \textbf{4.333e-6} \\
|
|
\hline
|
|
\end{tabu}
|
|
\normalsize
|
|
\captionof{table}[Values of Test Accuracies for Models Trained on
|
|
Subsets of MNIST Handwritten Digits]{Values of the test accuracy of
|
|
the model trained 10 times on random MNIST handwritten digits
|
|
training sets containing 1, 10, and 100 data points per class after
|
|
125 epochs. The mean accuracy achieved for the full set employing
|
|
both overfitting measures is 99.58\%.}
|
|
\label{table:digitsOF}
|
|
\small
|
|
\centering
|
|
\begin{tabu} to \textwidth {@{}l*4{X[c]}@{}}
|
|
\Tstrut \Bstrut & \textsc{Adam} & D. 0.2 & Gen & Gen.+D. 0.2 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{Test Accuracy for 1 Sample}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.4885 & \textbf{0.5513} & 0.5488 & 0.5475 \\
|
|
min & 0.3710 & \textbf{0.3858} & 0.3736 & 0.3816 \\
|
|
mean \Bstrut & 0.4166 & 0.4838 & 0.4769 & \textbf{0.4957} \\
|
|
var & \textbf{1.999e-3} & 2.945e-3 & 3.375e-3 & 2.976e-3 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{Test Accuracy for 10 Sample}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.7370 & 0.7340 & 0.7236 & \textbf{0.7502} \\
|
|
min & \textbf{0.6818} & 0.6673 & 0.6709 & 0.6799 \\
|
|
mean & 0.7130 & \textbf{0.7156} & 0.7031 & 0.7136 \\
|
|
var \Bstrut & \textbf{3.184e-4} & 3.356e-4 & 3.194e-4 & 4.508e-4 \\
|
|
\hline
|
|
&
|
|
\multicolumn{4}{c}{Test Accuracy for 100 Sample}\Bstrut \\
|
|
\cline{2-5}
|
|
max \Tstrut & 0.8454 & 0.8385 & 0.8456 & \textbf{0.8459} \\
|
|
min & 0.8227 & 0.8200 & \textbf{0.8305} & 0.8274 \\
|
|
mean & 0.8331 & 0.8289 & 0.8391 & \textbf{0.8409} \\
|
|
var \Bstrut & 3.847e-5 & 4.259e-5 & \textbf{2.315e-5} & 2.769e-5 \\
|
|
\hline
|
|
\end{tabu}
|
|
\normalsize
|
|
\captionof{table}[Values of Test Accuracies for Models Trained on
|
|
Subsets of Fashion MNIST]{Values of the test accuracy of the model
|
|
trained 10 times on random fashion MNIST training sets containing
|
|
1, 10, and 100 data points per class after 125 epochs. The mean
|
|
accuracy achieved for the full set employing both overfitting
|
|
measures is 93.72\%.}
|
|
\label{table:fashionOF}
|
|
\end{minipage}
|
|
\clearpage
|
|
}
|
|
|
|
The models are trained on subsets with a certain amount of randomly
|
|
chosen data points per class.
|
|
The sizes chosen for the comparisons are the full data set, 100, 10, and 1
|
|
data points per class.
|
|
|
|
For the task of classifying the fashion data a slightly altered model
|
|
is used. The convolutional layers with filters of size 5 are replaced
|
|
by two consecutive convolutional layers with filters of size 3.
|
|
\newpage
|
|
\begin{figure}[h]
|
|
\includegraphics[width=\textwidth]{Figures/Data/cnn_fashion_fig.pdf}
|
|
\caption[CNN Architecture for Fashion MNIST]{Convolutional neural
|
|
network architecture used to model the
|
|
fashion MNIST data set. This figure was created using the
|
|
draw\textunderscore convnet Python script by \textcite{draw_convnet}.}
|
|
\label{fig:fashion_MNIST}
|
|
\end{figure}
|
|
This is done in order to better accommodate
|
|
the more complex nature of the data by having
|
|
more degrees of freedom. A diagram of the architecture is given in
|
|
Figure~\ref{fig:fashion_MNIST}.
|
|
|
|
For both scenarios, the models are trained 10 times on randomly
|
|
sampled training sets.
|
|
The models are trained without overfitting measures and combinations
|
|
of dropout and data generation implemented. The Python implementation
|
|
of the models and the parameters used for data generation are given
|
|
in Listing~\ref{lst:handwriting} for the handwriting model and in
|
|
Listing~\ref{lst:fashion} for the fashion model.
|
|
|
|
The models are trained for 125 epochs in order
|
|
to have enough random
|
|
augmentations of the input images present during training,
|
|
for the networks to fully profit from the additional training data generated.
|
|
The test accuracies of the models after
|
|
training for 125
|
|
epochs are given in Table~\ref{table:digitsOF} for the handwritten digits
|
|
and in Table~\ref{table:fashionOF} for the fashion data sets. Additionally the
|
|
average test accuracies over the course of learning are given in
|
|
Figure~\ref{fig:plotOF_digits} for the handwriting application and
|
|
Figure~\ref{fig:plotOF_fashion} for the
|
|
fashion application.
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\small
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
|
|
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt},
|
|
ytick = {0.2,0.4,0.6},
|
|
yticklabels = {$0.2$,$0.4$,$\phantom{0}0.6$}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_02_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_02_1.mean};
|
|
|
|
|
|
\addlegendentry{\footnotesize{Default}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G. + D. 0.2}}
|
|
\addlegendentry{\footnotesize{D. 0.4}}
|
|
\addlegendentry{\footnotesize{Default}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{1 Sample per Class}
|
|
\vspace{0.25cm}
|
|
\end{subfigure}
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
|
|
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt},
|
|
ytick = {0.2,0.6,0.8},
|
|
yticklabels = {$0.2$,$0.6$,$\phantom{0}0.8$}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_00_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_02_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_00_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_02_10.mean};
|
|
|
|
|
|
\addlegendentry{\footnotesize{Default.}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G + D. 0.2}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{10 Samples per Class}
|
|
\end{subfigure}
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
|
|
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt}, ymin = {0.92}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_00_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_dropout_02_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_00_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/adam_datagen_dropout_02_100.mean};
|
|
|
|
\addlegendentry{\footnotesize{Default.}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G + D. 0.2}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{100 Samples per Class}
|
|
\vspace{.25cm}
|
|
\end{subfigure}
|
|
\caption[Mean Test Accuracies for Subsets of MNIST Handwritten
|
|
Digits]{Mean test accuracies of the models fitting the sampled MNIST
|
|
handwriting data sets over the 125 epochs of training.}
|
|
\label{fig:plotOF_digits}
|
|
\end{figure}
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\small
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style =
|
|
{draw = none}, width = 0.9875\textwidth,
|
|
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt},
|
|
ytick = {0.2,0.3,0.4,0.5},
|
|
yticklabels = {$0.2$,$0.3$,$0.4$,$\phantom{0}0.5$}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_dropout_0_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_dropout_2_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_datagen_dropout_0_1.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_datagen_dropout_2_1.mean};
|
|
|
|
|
|
\addlegendentry{\footnotesize{Default}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G. + D. 0.2}}
|
|
\addlegendentry{\footnotesize{D. 0.4}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{1 sample per class}
|
|
\vspace{0.25cm}
|
|
\end{subfigure}
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
|
|
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt}, ymin = {0.62}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_dropout_0_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_dropout_2_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_datagen_dropout_0_10.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_datagen_dropout_2_10.mean};
|
|
|
|
|
|
\addlegendentry{\footnotesize{Default.}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G + D. 0.2}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{10 Samples per Class}
|
|
\end{subfigure}
|
|
\begin{subfigure}[h]{\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
|
|
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
|
|
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
|
|
xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
|
|
list/Dark2, every axis plot/.append style={line width
|
|
=1.25pt}, ymin = {0.762}]
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_dropout_0_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_dropout_2_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_datagen_dropout_0_100.mean};
|
|
\addplot table
|
|
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
|
|
{Figures/Data/fashion_datagen_dropout_2_100.mean};
|
|
|
|
\addlegendentry{\footnotesize{Default.}}
|
|
\addlegendentry{\footnotesize{D. 0.2}}
|
|
\addlegendentry{\footnotesize{G.}}
|
|
\addlegendentry{\footnotesize{G + D. 0.2}}
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{100 Samples per Class}
|
|
\vspace{.25cm}
|
|
\end{subfigure}
|
|
\caption[Mean Test Accuracies for Subsets of Fashion MNIST]{Mean test
|
|
accuracies of the models fitting the sampled fashion MNIST
|
|
over the 125 epochs of training.}
|
|
\label{fig:plotOF_fashion}
|
|
\end{figure}
|
|
|
|
It can be seen in Figure~\ref{fig:plotOF_digits} that for the
|
|
handwritten digits scenario
|
|
using data generation greatly improves the accuracy for the smallest
|
|
training set of one sample per class.
|
|
While the addition of dropout only seems to have a small effect on the
|
|
accuracy of the model, the variance gets further reduced than with data
|
|
generation. This drop in variance translates to the combination of
|
|
both measures, resulting in the overall best performing model.
|
|
|
|
In the scenario with 10 and 100 samples per class, the measures improve
|
|
the performance as well, however the difference in performance between
|
|
overfitting measures is much smaller than in the first scenario
|
|
with the accuracy gain of dropout being similar to data generation.
|
|
While the observation of the variances persists for the scenario with
|
|
100 samples per class it does not for the one with 10 samples per
|
|
class.
|
|
In all scenarios, the addition of the measures reduces the
|
|
variance of the model.
|
|
|
|
The model fit to the fashion MNIST data set benefits less from these
|
|
measures.
|
|
For the smallest scenario of one sample per class, a substantial
|
|
increase in accuracy can be observed for both measures.
|
|
Contrary to the digits data set, dropout improves the
|
|
model by a similar margin to data generation.
|
|
For the larger data sets, the benefits are much smaller. While
|
|
in the scenario with 100 samples per class a performance increase can
|
|
be seen for with data generation, in the scenario with 10 samples per
|
|
class it performs worse than the baseline model.
|
|
Dropout does seem to have a negligible impact on its own in both the 10
|
|
and 100 sample scenario. In all scenarios data generation seems to
|
|
benefit from the addition of dropout.
|
|
|
|
Additional Figures and Tables for the same comparisons with different
|
|
performance metrics are given in Appendix~\ref{app:comp}.
|
|
There it can be seen that while the measures are able reduce overfitting
|
|
effectively for the handwritten digits data set, the neural networks
|
|
trained on the fashion data set overfit despite these measures being
|
|
in place.
|
|
|
|
|
|
% It can be seen in ... that the usage of .. overfitting
|
|
% measures greatly improves the accuracy for small datasets. However for
|
|
% the smallest size of one datapoint per class generating more data
|
|
% ... outperforms dropout with only a ... improvment being seen by the
|
|
% implementation of dropout whereas data generation improves the accuracy
|
|
% by... . On the other hand the implementation of dropout seems to
|
|
% reduce the variance in the model accuracy, as the variance in accuracy
|
|
% for the dropout model is less than .. while the variance of the
|
|
% datagen .. model is nearly the same. The model with data generation
|
|
% ... a reduction in variance with the addition of dropout.
|
|
|
|
% For the slightly larger training sets of ten samples per class the
|
|
% difference between the two measures seems smaller. Here the
|
|
% improvement in accuracy
|
|
% seen by dropout is slightly larger than the one of
|
|
% data generation. However for the larger sized training set the variance
|
|
% in test accuracies is lower for the model with data generation than the
|
|
% one with dropout.
|
|
|
|
% The results for the training sets with 100 samples per class resemble
|
|
% the ones for the sets with 10 per class.
|
|
|
|
Overall it seems that both measures are able increase the performance of
|
|
a convolutional neural network however, the success is dependent on the problem.
|
|
For the handwritten digits, the great result of data generation likely
|
|
stems from a large portion of the differences between two data points
|
|
of the same class being explainable by different positions, sizes or
|
|
slants. Which is what data generation emulates.
|
|
|
|
In the fashion data set however the alignment of all images are very
|
|
uniform with little to no differences in size or angle between
|
|
data points which might explain the worse performance of data generation.
|
|
|
|
|
|
|
|
|
|
|
|
\clearpage
|
|
\section{Summary and Outlook}
|
|
|
|
In this thesis, we have taken a look at neural networks, their
|
|
behavior in small scenarios and their application on image
|
|
classification with limited data sets.
|
|
|
|
We have explored the relation between ridge penalized neural networks
|
|
and slightly altered cubic smoothing splines, giving us an insight
|
|
about the behavior of the learned function of neural networks.
|
|
|
|
When comparing optimization algorithms, we have seen that choosing the
|
|
right training algorithm can have a
|
|
the drastic impact on the efficiency of training and quality of a model
|
|
obtainable in a reasonable time frame.
|
|
The \textsc{Adam} algorithm has performed well in training the
|
|
convolutional neural networks.
|
|
However, there is ongoing research in further
|
|
improving these algorithms. For example, \textcite{rADAM} propose an
|
|
alteration to the \textsc{Adam} algorithm in order to reduce variance
|
|
of the learning rate in the early phases of training.
|
|
|
|
We have seen that a convolutional network can benefit greatly from
|
|
measures combating overfitting, especially if the available training sets are of
|
|
a small size. The success of the measures we have examined
|
|
seems to be highly dependent on the use case and further research is
|
|
being done on the topic of combating overfitting in neural networks.
|
|
\textcite{random_erasing} propose randomly erasing parts of the input
|
|
images during training and are able to achieve a high accuracy of 96,35\% on the fashion MNIST
|
|
data set this way.
|
|
While data generation explored in this thesis is able to rudimentary
|
|
generate new training data, further research is being done in more
|
|
elaborate ways
|
|
to enlarge the training set.
|
|
\textcite{gan} explore the usage of generative adversarial
|
|
networks to generate training images for the task of
|
|
classifying liver lesions.
|
|
These networks are trained to generate new images from
|
|
random noise, ideally resulting in completely new data that can be used
|
|
in training (cf. \textcite{goodfellow_gan}).
|
|
|
|
Overall, convolutional neural networks are able to achieve remarkable
|
|
results in many use cases
|
|
and are a staple here to stay.
|
|
|
|
% \begin{itemize}
|
|
% \item generate more data, GAN etc \textcite{gan}
|
|
% \item Transfer learning, use network trained on different task and
|
|
% repurpose it / train it with the training data \textcite{transfer_learning}
|
|
% \item random erasing fashion MNIST 96.35\% accuracy
|
|
% \textcite{random_erasing}
|
|
% \item However the \textsc{Adam} algorithm can have problems with high
|
|
% variance of the adaptive learning rate early in training.
|
|
% \textcite{rADAM} try to address these issues with the Rectified Adam
|
|
% \item error measure: Robust error measure for supervised neural network learning with outliers
|
|
% algorithm
|
|
% \end{itemize}
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|