You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
755 lines
31 KiB
TeX
755 lines
31 KiB
TeX
\section{Application of NN to higher complexity Problems}
|
|
|
|
As neural networks are applied to problems of higher complexity often
|
|
resulting in higher dimensionality of the input the amount of
|
|
parameters in the network rises drastically. For example a network
|
|
with ...
|
|
A way to combat the
|
|
|
|
\subsection{Convolution}
|
|
|
|
Convolution is a mathematical operation, where the product of two
|
|
functions is integrated after one has been reversed and shifted.
|
|
|
|
\[
|
|
(f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
|
|
\]
|
|
|
|
This operation can be described as a filter-function $g$ being applied
|
|
to $f$,
|
|
as values $f(t)$ are being replaced by an average of values of $f$
|
|
weighted by $g$ in position $t$.
|
|
The convolution operation allows plentiful manipulation of data, with
|
|
a simple example being smoothing of real-time data. Consider a sensor
|
|
measuring the location of an object (e.g. via GPS). We expect the
|
|
output of the sensor to be noisy as a result of a number of factors
|
|
that will impact the accuracy. In order to get a better estimate of
|
|
the actual location we want to smooth
|
|
the data to reduce the noise. Using convolution for this task, we
|
|
can control the significance we want to give each data-point. We
|
|
might want to give a larger weight to more recent measurements than
|
|
older ones. If we assume these measurements are taken on a discrete
|
|
timescale, we need to introduce discrete convolution first. Let $f$,
|
|
$g: \mathbb{Z} \to \mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
|
|
\]
|
|
Applying this on the data with the filter $g$ chosen accordingly we
|
|
are
|
|
able to improve the accuracy, which can be seen in
|
|
Figure~\ref{fig:sin_conv}.
|
|
\input{Plots/sin_conv.tex}
|
|
This form of discrete convolution can also be applied to functions
|
|
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
|
|
\mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
|
|
\dots, x_d - i_d) g(i_1, \dots, i_d)
|
|
\]
|
|
This will prove to be a useful framework for image manipulation but
|
|
in order to apply convolution to images we need to discuss
|
|
representation of image data first. Most often images are represented
|
|
by each pixel being a mixture of base colors these base colors define
|
|
the color-space in which the image is encoded. Often used are
|
|
color-spaces RGB (red,
|
|
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
|
|
image split in its red, green and blue channel is given in
|
|
Figure~\ref{fig:rgb} Using this
|
|
encoding of the image we can define a corresponding discrete function
|
|
describing the image, by mapping the coordinates $(x,y)$ of an pixel
|
|
and the
|
|
channel (color) $c$ to the respective value $v$
|
|
|
|
\begin{align}
|
|
\begin{split}
|
|
I: \mathbb{N}^3 & \to \mathbb{R}, \\
|
|
(x,y,c) & \mapsto v.
|
|
\end{split}
|
|
\label{def:I}
|
|
\end{align}
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{width=\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
|
|
\node[canvas is xy plane at z=0, transform shape] at (0,0)
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_r.jpg}};
|
|
\node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_g.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_b.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
|
|
{\includegraphics[width=5.3cm]{Plots/Data/klammern_rgb.jpg}};
|
|
\end{scope}
|
|
\end{tikzpicture}
|
|
\end{adjustbox}
|
|
\caption{On the right the red, green and blue chances of the picture
|
|
are displayed. In order to better visualize the color channels the
|
|
black and white picture of each channel has been colored in the
|
|
respective color. Combining the layers results in the image on the
|
|
left.}
|
|
\label{fig:rgb}
|
|
\end{figure}
|
|
|
|
With this representation of an image as a function, we can apply
|
|
filters to the image using convolution for multidimensional functions
|
|
as described above. In order to simplify the notation we will write
|
|
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
|
|
as a tensor from now on, resulting in the modified notation of
|
|
convolution
|
|
|
|
\[
|
|
(I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
\]
|
|
|
|
Simple examples for image manipulation using
|
|
convolution are smoothing operations or
|
|
rudimentary detection of edges in grayscale images, meaning they only
|
|
have one channel. A popular filter for smoothing images
|
|
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
|
|
size $s \in \mathbb{N}$ is
|
|
defined as
|
|
\[
|
|
G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
|
|
\sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
|
|
\]
|
|
|
|
For edge detection purposes the Sobel operator is widespread. Here two
|
|
filters are applied to the
|
|
image $I$ and then combined. Edges in the $x$ direction are detected
|
|
by convolution with
|
|
\[
|
|
G =\left[
|
|
\begin{matrix}
|
|
-1 & 0 & 1 \\
|
|
-2 & 0 & 2 \\
|
|
-1 & 0 & 1
|
|
\end{matrix}\right],
|
|
\]
|
|
and edges is the y direction by convolution with $G^T$, the final
|
|
output is given by
|
|
|
|
\[
|
|
O = \sqrt{(I * G)^2 + (I*G^T)^2}
|
|
\]
|
|
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component
|
|
wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}.
|
|
\todo{padding}
|
|
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/klammern.jpg}
|
|
\caption{Original Picture}
|
|
\label{subf:OrigPicGS}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
|
|
\caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv10.png}
|
|
\caption{Gaussian Blur $\sigma^2 = 4$}
|
|
\end{subfigure}\\
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv4.png}
|
|
\caption{Sobel Operator $x$-direction}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv5.png}
|
|
\caption{Sobel Operator $y$-direction}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
|
|
\caption{Sobel Operator combined}
|
|
\end{subfigure}
|
|
% \begin{subfigure}{0.24\textwidth}
|
|
% \centering
|
|
% \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
|
|
% \caption{test}
|
|
% \end{subfigure}
|
|
\caption{Convolution of original greyscale Image (a) with different
|
|
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
|
|
$\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator
|
|
kernels are used.}
|
|
\label{fig:img_conv}
|
|
\end{figure}
|
|
\clearpage
|
|
\newpage
|
|
\subsection{Convolutional NN}
|
|
\todo{Eileitung zu CNN}
|
|
% Conventional neural network as described in chapter .. are made up of
|
|
% fully connected layers, meaning each node in a layer is influenced by
|
|
% all nodes of the previous layer. If one wants to extract information
|
|
% out of high dimensional input such as images this results in a very
|
|
% large amount of variables in the model. This limits the
|
|
|
|
% In conventional neural networks as described in chapter ... all layers
|
|
% are fully connected, meaning each output node in a layer is influenced
|
|
% by all inputs. For $i$ inputs and $o$ output nodes this results in $i
|
|
% + 1$ variables at each node (weights and bias) and a total $o(i + 1)$
|
|
% variables. For large inputs like image data the amount of variables
|
|
% that have to be trained in order to fit the model can get excessive
|
|
% and hinder the ability to train the model due to memory and
|
|
% computational restrictions. By using convolution we can extract
|
|
% meaningful information such as edges in an image with a kernel of a
|
|
% small size $k$ in the tens or hundreds independent of the size of the
|
|
% original image. Thus for a large image $k \cdot i$ can be several
|
|
% orders of magnitude smaller than $o\cdot i$ .
|
|
|
|
As seen in the previous section convolution can lend itself to
|
|
manipulation of images or other large data which motivates it usage in
|
|
neural networks.
|
|
This is achieved by implementing convolutional layers where several
|
|
filters are applied to the input. Where the values of the filters are
|
|
trainable parameters of the model.
|
|
Each node in such a layer corresponds to a pixel of the output of
|
|
convolution with one of those filters on which a bias and activation
|
|
function are applied.
|
|
The usage of multiple filters results in multiple outputs of the same
|
|
size as the input. These are often called channels. Depending on the
|
|
size of the filters this can result in the dimension of the output
|
|
being one larger than the input.
|
|
However for convolutional layers following a convolutional layer the
|
|
size of the filter is often chosen to coincide with the amount of channels
|
|
of the output of the previous layer without using padding in this
|
|
direction in order to prevent gaining additional
|
|
dimensions\todo{komisch} in the output.
|
|
This can also be used to flatten certain less interesting channels of
|
|
the input as for example a color channels.
|
|
Thus filters used in convolutional networks are usually have the same
|
|
amount of dimensions as the input or one more.
|
|
|
|
The size of the filters and the way they are applied can be tuned
|
|
while building the model should be the same for all filters in one
|
|
layer in order for the output being of consistent size in all channels.
|
|
It is common to reduce the d< by not applying the
|
|
filters on each ``pixel'' but rather specify a ``stride'' $s$ at which
|
|
the filter $g$ is moved over the input $I$
|
|
|
|
\[
|
|
O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
\]
|
|
|
|
As seen convolution lends itself for image manipulation. In this
|
|
chapter we will explore how we can incorporate convolution in neural
|
|
networks, and how that might be beneficial.
|
|
|
|
Convolutional Neural Networks as described by ... are made up of
|
|
convolutional layers, pooling layers, and fully connected ones. The
|
|
fully connected layers are layers in which each input node is
|
|
connected to each output node which is the structure introduced in
|
|
chapter ...
|
|
|
|
In a convolutional layer instead of combining all input nodes for each
|
|
output node, the input nodes are interpreted as a tensor on which a
|
|
kernel is applied via convolution, resulting in the output. Most often
|
|
multiple kernels are used, resulting in multiple output tensors. These
|
|
kernels are the variables, which can be altered in order to fit the
|
|
model to the data. Using multiple kernels it is possible to extract
|
|
different features from the image (e.g. edges -> sobel). As this
|
|
increases dimensionality even further which is undesirable as it
|
|
increases the amount of variables in later layers of the model, a convolutional layer
|
|
is often followed by a pooling one. In a pooling layer the input is
|
|
reduced in size by extracting a single value from a
|
|
neighborhood \todo{moving...}... . The resulting output size is dependent on
|
|
the offset of the neighborhoods used. Popular is max-pooling where the
|
|
largest value in a neighborhood is used or.
|
|
|
|
This construct allows for extraction of features from the input while
|
|
using far less input variables.
|
|
|
|
... \todo{Beispiel mit kleinem Bild, am besten das von oben}
|
|
|
|
\subsubsection{Parallels to the Visual Cortex in Mammals}
|
|
|
|
The choice of convolution for image classification tasks is not
|
|
arbitrary. ... auge... bla bla
|
|
|
|
|
|
% \subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
% -Hyperparameter guesswork
|
|
% -Problems navigating valleys -> momentum
|
|
% -Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
|
|
For many applications in which neural networks are used such as
|
|
image classification or segmentation, large training data sets become
|
|
detrimental to capture the nuances of the
|
|
data. However as training sets get larger the memory requirement
|
|
during training grows with it.
|
|
In order to update the weights with the gradient descent algorithm
|
|
derivatives of the network with respect for each
|
|
variable need to be calculated for all data points in order to get the
|
|
full gradient of the error of the network.
|
|
Thus the amount of memory and computing power available limits the
|
|
size of the training data that can be efficiently used in fitting the
|
|
network. A class of algorithms that augment the gradient descent
|
|
algorithm in order to lessen this problem are stochastic gradient
|
|
descent algorithms. Here the premise is that instead of using the whole
|
|
dataset a (different) subset of data is chosen to
|
|
compute the gradient in each iteration (Algorithm~\ref{alg:sdg}).
|
|
The training period until each data point has been considered in
|
|
updating the parameters is commonly called an ``epoch''.
|
|
Using subsets reduces the amount of memory and computing power required for
|
|
each iteration. This makes it possible to use very large training
|
|
sets to fit the model.
|
|
Additionally the noise introduced on the gradient can improve
|
|
the accuracy of the fit as stochastic gradient descent algorithms are
|
|
less likely to get stuck on local extrema.
|
|
|
|
Another important benefit in using subsets is that depending on their size the
|
|
gradient can be calculated far quicker which allows for more parameter updates
|
|
in the same time. If the approximated gradient is close enough to the
|
|
``real'' one this can drastically cut down the time required for
|
|
training the model to a certain degree or improve the accuracy achievable in a given
|
|
mount of training time.
|
|
|
|
\begin{algorithm}
|
|
\SetAlgoLined
|
|
\KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$,
|
|
Training Data $D$, Epochs $E$.}
|
|
\For{$i \in \left\{1:E\right\}$}{
|
|
S <- D
|
|
\While{$\abs{S} \geq B$}{
|
|
Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\;
|
|
Update $S$: $S \leftarrow S \setminus \tilde{D}$\;
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
\tilde{D})}{\mathrm{d} w}$\;
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
}
|
|
\If{$S \neq \emptyset$}{
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
S)}{\mathrm{d} w}$\;
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
}
|
|
Increment: $i \leftarrow i+1$\;
|
|
}
|
|
\caption{Stochastic gradient descent.}
|
|
\label{alg:sgd}
|
|
\end{algorithm}
|
|
|
|
In order to illustrate this behavior we modeled a convolutional neural
|
|
network to ... handwritten digits. The data set used for this is the
|
|
MNIST database of handwritten digits (\textcite{MNIST},
|
|
Figure~\ref{fig:MNIST}).
|
|
\input{Plots/mnist.tex}
|
|
The network used consists of two convolution and max pooling layers
|
|
followed by one fully connected hidden layer and the output layer.
|
|
Both covolutional layers utilize square filters of size five which are
|
|
applied with a stride of one.
|
|
The first layer consists of 32 filters and the second of 64. Both
|
|
pooling layers pool a $2\times 2$ area. The fully connected layer
|
|
consists of 256 nodes and the output layer of 10, one for each digit.
|
|
All layers except the output layer use RELU as activation function
|
|
with the output layer using softmax (\ref{def:softmax}).
|
|
As loss function categorical crossentropy is used (\ref{def:...}).
|
|
The architecture of the convolutional neural network is summarized in
|
|
Figure~\ref{fig:mnist_architecture}.
|
|
|
|
\begin{figure}
|
|
\missingfigure{network architecture}
|
|
\caption{architecture}
|
|
\label{fig:mnist_architecture}
|
|
\end{figure}
|
|
|
|
The results of the network being trained with gradient descent and
|
|
stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd}
|
|
and Table~\ref{table:sgd_vs_gd}
|
|
|
|
|
|
Here it can be seen that the network trained with stochstic gradient
|
|
descent is more accurate after the first epoch than the ones trained
|
|
with gradient descent after 20 epochs.
|
|
This is due to the former using a batch size of 32 and thus having
|
|
made 1.875 updates to the weights
|
|
after the first epoch in comparison to one update. While each of
|
|
these updates uses a approximate
|
|
gradient calculated on the subset it performs far better than the
|
|
network using true gradients when training for the same mount of time.
|
|
\todo{vergleich training time}
|
|
|
|
\input{Plots/SGD_vs_GD.tex}
|
|
\clearpage
|
|
\subsection{\titlecap{modified stochastic gradient descent}}
|
|
An inherent problem of the stochastic gradient descent algorithm is
|
|
its sensitivity to the learning rate $\gamma$. This results in the
|
|
problem of having to find a appropriate learning rate for each problem
|
|
which is largely guesswork, the impact of choosing a bad learning rate
|
|
can be seen in Figure~\ref{fig:sgd_vs_gd}.
|
|
% There is a inherent problem in the sensitivity of the gradient descent
|
|
% algorithm regarding the learning rate $\gamma$.
|
|
% The difficulty of choosing the learning rate can be seen
|
|
% in Figure~\ref{sgd_vs_gd}.
|
|
For small rates the progress in each iteration is small
|
|
but as the rate is enlarged the algorithm can become unstable and the parameters
|
|
diverge to infinity. Even for learning rates small enough to ensure the parameters
|
|
do not diverge to infinity, steep valleys in the function to be
|
|
minimized can hinder the progress of
|
|
the algorithm as for leaning rates not small enough gradient descent
|
|
``bounces between'' the walls of the valley rather then following a
|
|
downward trend in the valley.
|
|
|
|
% \[
|
|
% w - \gamma \nabla_w ...
|
|
% \]
|
|
%thus the weights grow to infinity.
|
|
\todo{unstable learning rate besser
|
|
erklären}
|
|
|
|
To combat this problem \todo{quelle} propose to alter the learning
|
|
rate over the course of training, often called leaning rate
|
|
scheduling in order to decrease the learning rate over the course of
|
|
training. The most popular implementations of this are time based
|
|
decay
|
|
\[
|
|
\gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
|
\]
|
|
where $d$ is the decay parameter and $n$ is the number of epochs,
|
|
step based decay where the learning rate is fixed for a span of $r$
|
|
epochs and then decreased according to parameter $d$
|
|
\[
|
|
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
|
|
\]
|
|
and exponential decay where the learning rate is decreased after each epoch
|
|
\[
|
|
\gamma_n = \gamma_o e^{-n d}.
|
|
\]
|
|
These methods are able to increase the accuracy of a model by large
|
|
margins as seen in the training of RESnet by \textcite{resnet}.
|
|
\todo{vielleicht grafik
|
|
einbauen}
|
|
However stochastic gradient descent with weight decay is
|
|
still highly sensitive to the choice of the hyperparameters $\gamma_0$
|
|
and $d$.
|
|
In order to mitigate this problem a number of algorithms have been
|
|
developed to regularize the learning rate with as minimal
|
|
hyperparameter guesswork as possible.
|
|
|
|
We will examine and compare a ... algorithms that use a adaptive
|
|
learning rate.
|
|
They all scale the gradient for the update depending of past gradients
|
|
for each weight individually.
|
|
|
|
The algorithms are build up on each other with the adaptive gradient
|
|
algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD})
|
|
laying the base work. Here for each parameter update the learning rate
|
|
is given my a constant
|
|
$\gamma$ is divided by the sum of the squares of the past partial
|
|
derivatives in this parameter. This results in a monotonously
|
|
decreasing learning rate for each parameter. This results in a faster
|
|
decaying learning rate for parameters with large updates, where as
|
|
parameters with small updates experience smaller decay. The \textsc{AdaGrad}
|
|
algorithm is given in Algorithm~\ref{alg:ADAGRAD}.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Global learning rate $\gamma$}
|
|
\KwInput{Constant $\varepsilon$}
|
|
\KwInput{Initial parameter vector $x_1 \in \mathbb{R}^p$}
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Compute Update: $\Delta x_{t,i} \leftarrow
|
|
-\frac{\gamma}{\norm{g_{1:t,i}}_2 + \varepsilon} g_t, \forall i =
|
|
1, \dots,p$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{\textls{\textsc{AdaGrad}}}
|
|
\label{alg:ADAGRAD}
|
|
\end{algorithm}
|
|
|
|
Building on \textsc{AdaGrad} \textcite{ADADELTA} developed the ... (\textsc{AdaDelta})
|
|
in order to improve upon the two main drawbacks of \textsc{AdaGrad}, being the
|
|
continual decay of the learning rate and the need for a manually
|
|
selected global learning rate $\gamma$.
|
|
As \textsc{AdaGrad} uses division by the accumulated squared gradients the learning rate will
|
|
eventually become infinitely small.
|
|
In order to ensure that even after a significant of iterations
|
|
learning continues to make progress instead of summing the gradients a
|
|
exponentially decaying average of the past gradients is used to ....
|
|
Additionally the fixed global learning rate $\gamma$ is substituted by
|
|
a exponentially decaying average of the past parameter updates.
|
|
The usage of the past parameter updates is motivated by ensuring that
|
|
if the parameter vector had some hypothetical units they would be matched
|
|
by these of the parameter update $\Delta x_t$. This proper
|
|
\todo{erklärung unit}
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
|
|
\KwInput{Initial parameter $x_1$}
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
|
(1-\rho)g_t^2$\;
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{\textsc{AdaDelta}, \textcite{ADADELTA}}
|
|
\label{alg:gd}
|
|
\end{algorithm}
|
|
|
|
While the stochastic gradient algorithm is less susceptible to local
|
|
extrema than gradient descent the problem still persists especially
|
|
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
|
|
|
An approach to the problem of ``getting stuck'' in saddle point or
|
|
local minima/maxima is the addition of momentum to SDG. Instead of
|
|
using the actual gradient for the parameter update an average over the
|
|
past gradients is used. In order to avoid the need to SAVE the past
|
|
values usually a exponentially decaying average is used resulting in
|
|
Algorithm~\ref{alg_momentum}. This is comparable of following the path
|
|
of a marble with mass rolling down the SLOPE of the error
|
|
function. The decay rate for the average is comparable to the TRÄGHEIT
|
|
of the marble.
|
|
This results in the algorithm being able to escape ... due to the
|
|
build up momentum from approaching it.
|
|
|
|
% \begin{itemize}
|
|
% \item ADAM
|
|
% \item momentum
|
|
% \item ADADETLA \textcite{ADADELTA}
|
|
% \end{itemize}
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Learning Rate $\gamma$, Decay Rate $\rho$}
|
|
\KwInput{Initial parameter $x_1$}
|
|
Initialize accumulation variables $m_0 = 0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\;
|
|
Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{SDG with momentum}
|
|
\label{alg:gd}
|
|
\end{algorithm}
|
|
|
|
In an effort to combine the properties of the momentum method and the
|
|
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
|
|
developed the \textsc{Adam} algorithm. The
|
|
|
|
Problems / Improvements ADAM \textcite{rADAM}
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{Stepsize $\alpha$}
|
|
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
|
Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
Compute Gradient: $g_t$\;
|
|
Accumulate first and second Moment of the Gradient:
|
|
\begin{align*}
|
|
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
|
|
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\;
|
|
\end{align*}
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
}
|
|
\caption{ADAM, \cite{ADAM}}
|
|
\label{alg:gd}
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
\input{Plots/sdg_comparison.tex}
|
|
|
|
% \subsubsubsection{Stochastic Gradient Descent}
|
|
\clearpage
|
|
\subsection{Combating Overfitting}
|
|
|
|
% As in many machine learning applications if the model is overfit in
|
|
% the data it can drastically reduce the generalization of the model. In
|
|
% many machine learning approaches noise introduced in the learning
|
|
% algorithm in order to reduce overfitting. This results in a higher
|
|
% bias of the model but the trade off of lower variance of the model is
|
|
% beneficial in many cases. For example the regression tree model
|
|
% ... benefits greatly from restricting the training algorithm on
|
|
% randomly selected features in every iteration and then averaging many
|
|
% such trained trees inserted of just using a single one. \todo{noch
|
|
% nicht sicher ob ich das nehmen will} For neural networks similar
|
|
% strategies exist. A popular approach in regularizing convolutional neural network
|
|
% is \textit{dropout} which has been first introduced in
|
|
% \cite{Dropout}
|
|
|
|
Similarly to shallow networks overfitting still can impact the quality of
|
|
convolutional neural networks. A popular way to combat this problem is
|
|
by introducing noise into the training of the model. This is a
|
|
successful strategy for ofter models as well, the a conglomerate of
|
|
descision trees grown on bootstrapped trainig samples benefit greatly
|
|
of randomizing the features available to use in each training
|
|
iteration (Hastie, Bachelorarbeit??).
|
|
There are two approaches to introduce noise to the model during
|
|
learning, either by manipulating the model it self or by manipulating
|
|
the input data.
|
|
\subsubsection{Dropout}
|
|
If a neural network has enough hidden nodes there will be sets of
|
|
weights that accurately fit the training set (proof for a small
|
|
scenario given in ...) this expecially occurs when the relation
|
|
between the input and output is highly complex, which requires a large
|
|
network to model and the training set is limited in size (vgl cnn
|
|
wening bilder). However each of these weights will result in different
|
|
predicitons for a test set and all of them will perform worse on the
|
|
test data than the training data. A way to improve the predictions and
|
|
reduce the overfitting would
|
|
be to train a large number of networks and average their results (vgl
|
|
random forests) however this is often computational not feasible in
|
|
training as well as testing.
|
|
% Similarly to decision trees and random forests training multiple
|
|
% models on the same task and averaging the predictions can improve the
|
|
% results and combat overfitting. However training a very large
|
|
% number of neural networks is computationally expensive in training
|
|
%as well as testing.
|
|
In order to make this approach feasible
|
|
\textcite{Dropout1} propose random dropout.
|
|
Instead of training different models for each data point in a batch
|
|
randomly chosen nodes in the network are disabled (their output is
|
|
fixed to zero) and the updates for the weights in the remaining
|
|
smaller network are comuted. These the updates computed for each data
|
|
point in the batch are then accumulated and applied to the full
|
|
network.
|
|
This can be compared to many small networks which share their weights
|
|
for their active neurons being trained simultaniously.
|
|
For testing the ``mean network'' with all nodes active but their
|
|
output scaled accordingly to compensate for more active nodes is
|
|
used. \todo{comparable to averaging dropout networks, beispiel für
|
|
besser in kleinem fall}
|
|
% Here for each training iteration from a before specified (sub)set of nodes
|
|
% randomly chosen ones are deactivated (their output is fixed to 0).
|
|
% During training
|
|
% Instead of using different models and averaging them randomly
|
|
% deactivated nodes are used to simulate different networks which all
|
|
% share the same weights for present nodes.
|
|
|
|
|
|
|
|
% A simple but effective way to introduce noise to the model is by
|
|
% deactivating randomly chosen nodes in a layer
|
|
% The way noise is introduced into
|
|
% the model is by deactivating certain nodes (setting the output of the
|
|
% node to 0) in the fully connected layers of the convolutional neural
|
|
% networks. The nodes are chosen at random and change in every
|
|
% iteration, this practice is called Dropout and was introduced by
|
|
% \textcite{Dropout}.
|
|
|
|
\subsubsection{\titlecap{manipulation of input data}}
|
|
Another way to combat overfitting is to keep the network from learning
|
|
the dataset by manipulating the inputs randomly for each iteration of
|
|
training. This is commonly used in image based tasks as there are
|
|
often ways to maipulate the input while still being sure the labels
|
|
remain the same. For example in a image classification task such as
|
|
handwritten digits the associated label should remain right when the
|
|
image is rotated or stretched by a small amount.
|
|
When using this one has to be sure that the labels indeed remain the
|
|
same or else the network will not learn the desired ...
|
|
In the case of handwritten digits for example a to high rotation angle
|
|
will ... a nine or six.
|
|
The most common transformations are rotation, zoom, shear, brightness,
|
|
mirroring.
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Plots/Data/mnist0.pdf}
|
|
\caption{original\\image}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Plots/Data/mnist_gen_zoom.pdf}
|
|
\caption{random\\zoom}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Plots/Data/mnist_gen_shear.pdf}
|
|
\caption{random\\shear}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Plots/Data/mnist_gen_rotation.pdf}
|
|
\caption{random\\rotation}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.19\textwidth}
|
|
\includegraphics[width=\textwidth]{Plots/Data/mnist_gen_shift.pdf}
|
|
\caption{random\\positional shift}
|
|
\end{subfigure}
|
|
\caption{Example for the manipuations used in ... As all images are
|
|
of the same intensity brightness manipulation does not seem
|
|
... Additionally mirroring is not used for ... reasons.}
|
|
\end{figure}
|
|
|
|
\input{Plots/gen_dropout.tex}
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
|
training set?}
|
|
|
|
\subsubsection{\titlecap{effectivety for small training sets}}
|
|
|
|
For some applications (medical problems with small amount of patients)
|
|
the available data can be highly limited.
|
|
In these problems the networks are highly ... for overfitting the
|
|
data. In order to get a understanding of accuracys achievable and the
|
|
impact of the measures to prevent overfitting discussed above we and train
|
|
the network on datasets of varying sizes.
|
|
First we use the mnist handwriting dataset and then a slightly harder
|
|
problem given by the mnist fashion dataset which contains PREEDITED
|
|
pictures of clothes from 10 different categories.
|
|
|
|
\input{Plots/fashion_mnist.tex}
|
|
|
|
For training for each class a certain number of random datapoints are
|
|
chosen for training the network. The sizes chosen are:
|
|
full dataset: ... per class\\
|
|
1000 per class
|
|
100 per class
|
|
10 per class
|
|
|
|
the results for training .. are given in ... Here can be seen...
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\missingfigure{datagen digits}
|
|
\caption{Sample pictures of the mnist fashioyn dataset, one per
|
|
class.}
|
|
\label{mnist fashion}
|
|
\end{figure}
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\missingfigure{datagen fashion}
|
|
\caption{Sample pictures of the mnist fashioyn dataset, one per
|
|
class.}
|
|
\label{mnist fashion}
|
|
\end{figure}
|
|
|
|
|
|
\clearpage
|
|
\section{Bla}
|
|
\begin{itemize}
|
|
\item generate more data, GAN etc
|
|
\item Transfer learning, use network trained on different task and
|
|
repurpose it / train it with the training data
|
|
\end{itemize}
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|