|
|
|
\section{Application of NN to higher complexity Problems}
|
|
|
|
|
|
|
|
As neural networks are applied to problems of higher complexity often
|
|
|
|
resulting in higher dimensionality of the input the amount of
|
|
|
|
parameters in the network rises drastically. For example a network
|
|
|
|
with ...
|
|
|
|
A way to combat the
|
|
|
|
|
|
|
|
\subsection{Convolution}
|
|
|
|
|
|
|
|
Convolution is a mathematical operation, where the product of two
|
|
|
|
functions is integrated after one has been reversed and shifted.
|
|
|
|
|
|
|
|
\[
|
|
|
|
(f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
|
|
|
|
\]
|
|
|
|
|
|
|
|
This operation can be described as a filter-function $g$ being applied
|
|
|
|
to $f$,
|
|
|
|
as values $f(t)$ are being replaced by an average of values of $f$
|
|
|
|
weighted by $g$ in position $t$.
|
|
|
|
The convolution operation allows plentiful manipulation of data, with
|
|
|
|
a simple example being smoothing of real-time data. Consider a sensor
|
|
|
|
measuring the location of an object (e.g. via GPS). We expect the
|
|
|
|
output of the sensor to be noisy as a result of a number of factors
|
|
|
|
that will impact the accuracy. In order to get a better estimate of
|
|
|
|
the actual location we want to smooth
|
|
|
|
the data to reduce the noise. Using convolution for this task, we
|
|
|
|
can control the significance we want to give each data-point. We
|
|
|
|
might want to give a larger weight to more recent measurements than
|
|
|
|
older ones. If we assume these measurements are taken on a discrete
|
|
|
|
timescale, we need to introduce discrete convolution first. Let $f$,
|
|
|
|
$g: \mathbb{Z} \to \mathbb{R}$ then
|
|
|
|
|
|
|
|
\[
|
|
|
|
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
|
|
|
|
\]
|
|
|
|
Applying this on the data with the filter $g$ chosen accordingly we
|
|
|
|
are
|
|
|
|
able to improve the accuracy, which can be seen in
|
|
|
|
Figure~\ref{fig:sin_conv}.
|
|
|
|
\input{Plots/sin_conv.tex}
|
|
|
|
This form of discrete convolution can also be applied to functions
|
|
|
|
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
|
|
|
|
\mathbb{R}$ then
|
|
|
|
|
|
|
|
\[
|
|
|
|
(f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
|
|
|
|
\dots, x_d - i_d) g(i_1, \dots, i_d)
|
|
|
|
\]
|
|
|
|
This will prove to be a useful framework for image manipulation but
|
|
|
|
in order to apply convolution to images we need to discuss
|
|
|
|
representation of image data first. Most often images are represented
|
|
|
|
by each pixel being a mixture of base colors these base colors define
|
|
|
|
the color-space in which the image is encoded. Often used are
|
|
|
|
color-spaces RGB (red,
|
|
|
|
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
|
|
|
|
image split in its red, green and blue channel is given in
|
|
|
|
Figure~\ref{fig:rgb} Using this
|
|
|
|
encoding of the image we can define a corresponding discrete function
|
|
|
|
describing the image, by mapping the coordinates $(x,y)$ of an pixel
|
|
|
|
and the
|
|
|
|
channel (color) $c$ to the respective value $v$
|
|
|
|
|
|
|
|
\begin{align}
|
|
|
|
\begin{split}
|
|
|
|
I: \mathbb{N}^3 & \to \mathbb{R}, \\
|
|
|
|
(x,y,c) & \mapsto v.
|
|
|
|
\end{split}
|
|
|
|
\label{def:I}
|
|
|
|
\end{align}
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
\begin{adjustbox}{width=\textwidth}
|
|
|
|
\begin{tikzpicture}
|
|
|
|
\begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
|
|
|
|
\node[canvas is xy plane at z=0, transform shape] at (0,0)
|
|
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_r.jpg}};
|
|
|
|
\node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
|
|
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_g.jpg}};
|
|
|
|
\node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
|
|
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_b.jpg}};
|
|
|
|
\node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
|
|
|
|
{\includegraphics[width=5.3cm]{Plots/Data/klammern_rgb.jpg}};
|
|
|
|
\end{scope}
|
|
|
|
\end{tikzpicture}
|
|
|
|
\end{adjustbox}
|
|
|
|
\caption{On the right the red, green and blue chances of the picture
|
|
|
|
are displayed. In order to better visualize the color channels the
|
|
|
|
black and white picture of each channel has been colored in the
|
|
|
|
respective color. Combining the layers results in the image on the
|
|
|
|
left.}
|
|
|
|
\label{fig:rgb}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
With this representation of an image as a function, we can apply
|
|
|
|
filters to the image using convolution for multidimensional functions
|
|
|
|
as described above. In order to simplify the notation we will write
|
|
|
|
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
|
|
|
|
as a tensor from now on, resulting in the modified notation of
|
|
|
|
convolution
|
|
|
|
|
|
|
|
\[
|
|
|
|
(I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
|
|
\]
|
|
|
|
|
|
|
|
Simple examples for image manipulation using
|
|
|
|
convolution are smoothing operations or
|
|
|
|
rudimentary detection of edges in grayscale images, meaning they only
|
|
|
|
have one channel. A popular filter for smoothing images
|
|
|
|
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
|
|
|
|
size $s \in \mathbb{N}$ is
|
|
|
|
defined as
|
|
|
|
\[
|
|
|
|
G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
|
|
|
|
\sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
|
|
|
|
\]
|
|
|
|
|
|
|
|
For edge detection purposes the Sobel operator is widespread. Here two
|
|
|
|
filters are applied to the
|
|
|
|
image $I$ and then combined. Edges in the $x$ direction are detected
|
|
|
|
by convolution with
|
|
|
|
\[
|
|
|
|
G =\left[
|
|
|
|
\begin{matrix}
|
|
|
|
-1 & 0 & 1 \\
|
|
|
|
-2 & 0 & 2 \\
|
|
|
|
-1 & 0 & 1
|
|
|
|
\end{matrix}\right],
|
|
|
|
\]
|
|
|
|
and edges is the y direction by convolution with $G^T$, the final
|
|
|
|
output is given by
|
|
|
|
|
|
|
|
\[
|
|
|
|
O = \sqrt{(I * G)^2 + (I*G^T)^2}
|
|
|
|
\]
|
|
|
|
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component
|
|
|
|
wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}.
|
|
|
|
\todo{padding}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{figure}[h]
|
|
|
|
\centering
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/klammern.jpg}
|
|
|
|
\caption{Original Picture}
|
|
|
|
\label{subf:OrigPicGS}
|
|
|
|
\end{subfigure}
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
|
|
|
|
\caption{Gaussian Blur $\sigma^2 = 1$}
|
|
|
|
\end{subfigure}
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv10.png}
|
|
|
|
\caption{Gaussian Blur $\sigma^2 = 4$}
|
|
|
|
\end{subfigure}\\
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv4.png}
|
|
|
|
\caption{Sobel Operator $x$-direction}
|
|
|
|
\end{subfigure}
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv5.png}
|
|
|
|
\caption{Sobel Operator $y$-direction}
|
|
|
|
\end{subfigure}
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
|
|
|
|
\caption{Sobel Operator combined}
|
|
|
|
\end{subfigure}
|
|
|
|
% \begin{subfigure}{0.24\textwidth}
|
|
|
|
% \centering
|
|
|
|
% \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
|
|
|
|
% \caption{test}
|
|
|
|
% \end{subfigure}
|
|
|
|
\caption{Convolution of original greyscale Image (a) with different
|
|
|
|
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
|
|
|
|
$\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator
|
|
|
|
kernels are used.}
|
|
|
|
\label{fig:img_conv}
|
|
|
|
\end{figure}
|
|
|
|
\clearpage
|
|
|
|
\newpage
|
|
|
|
\subsection{Convolutional NN}
|
|
|
|
\todo{Eileitung zu CNN}
|
|
|
|
% Conventional neural network as described in chapter .. are made up of
|
|
|
|
% fully connected layers, meaning each node in a layer is influenced by
|
|
|
|
% all nodes of the previous layer. If one wants to extract information
|
|
|
|
% out of high dimensional input such as images this results in a very
|
|
|
|
% large amount of variables in the model. This limits the
|
|
|
|
|
|
|
|
% In conventional neural networks as described in chapter ... all layers
|
|
|
|
% are fully connected, meaning each output node in a layer is influenced
|
|
|
|
% by all inputs. For $i$ inputs and $o$ output nodes this results in $i
|
|
|
|
% + 1$ variables at each node (weights and bias) and a total $o(i + 1)$
|
|
|
|
% variables. For large inputs like image data the amount of variables
|
|
|
|
% that have to be trained in order to fit the model can get excessive
|
|
|
|
% and hinder the ability to train the model due to memory and
|
|
|
|
% computational restrictions. By using convolution we can extract
|
|
|
|
% meaningful information such as edges in an image with a kernel of a
|
|
|
|
% small size $k$ in the tens or hundreds independent of the size of the
|
|
|
|
% original image. Thus for a large image $k \cdot i$ can be several
|
|
|
|
% orders of magnitude smaller than $o\cdot i$ .
|
|
|
|
|
|
|
|
As seen in the previous section convolution can lend itself to
|
|
|
|
manipulation of images or other large data which motivates it usage in
|
|
|
|
neural networks.
|
|
|
|
This is achieved by implementing convolutional layers where several
|
|
|
|
filters are applied to the input. Where the values of the filters are
|
|
|
|
trainable parameters of the model.
|
|
|
|
Each node in such a layer corresponds to a pixel of the output of
|
|
|
|
convolution with one of those filters on which a bias and activation
|
|
|
|
function are applied.
|
|
|
|
The usage of multiple filters results in multiple outputs of the same
|
|
|
|
size as the input. These are often called channels. Depending on the
|
|
|
|
size of the filters this can result in the dimension of the output
|
|
|
|
being one larger than the input.
|
|
|
|
However for convolutional layers following a convolutional layer the
|
|
|
|
size of the filter is often chosen to coincide with the amount of channels
|
|
|
|
of the output of the previous layer without using padding in this
|
|
|
|
direction in order to prevent gaining additional
|
|
|
|
dimensions\todo{komisch} in the output.
|
|
|
|
This can also be used to flatten certain less interesting channels of
|
|
|
|
the input as for example a color channels.
|
|
|
|
Thus filters used in convolutional networks are usually have the same
|
|
|
|
amount of dimensions as the input or one more.
|
|
|
|
|
|
|
|
The size of the filters and the way they are applied can be tuned
|
|
|
|
while building the model should be the same for all filters in one
|
|
|
|
layer in order for the output being of consistent size in all channels.
|
|
|
|
It is common to reduce the d< by not applying the
|
|
|
|
filters on each ``pixel'' but rather specify a ``stride'' $s$ at which
|
|
|
|
the filter $g$ is moved over the input $I$
|
|
|
|
|
|
|
|
\[
|
|
|
|
O_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
|
|
\]
|
|
|
|
|
|
|
|
As seen convolution lends itself for image manipulation. In this
|
|
|
|
chapter we will explore how we can incorporate convolution in neural
|
|
|
|
networks, and how that might be beneficial.
|
|
|
|
|
|
|
|
Convolutional Neural Networks as described by ... are made up of
|
|
|
|
convolutional layers, pooling layers, and fully connected ones. The
|
|
|
|
fully connected layers are layers in which each input node is
|
|
|
|
connected to each output node which is the structure introduced in
|
|
|
|
chapter ...
|
|
|
|
|
|
|
|
In a convolutional layer instead of combining all input nodes for each
|
|
|
|
output node, the input nodes are interpreted as a tensor on which a
|
|
|
|
kernel is applied via convolution, resulting in the output. Most often
|
|
|
|
multiple kernels are used, resulting in multiple output tensors. These
|
|
|
|
kernels are the variables, which can be altered in order to fit the
|
|
|
|
model to the data. Using multiple kernels it is possible to extract
|
|
|
|
different features from the image (e.g. edges -> sobel). As this
|
|
|
|
increases dimensionality even further which is undesirable as it
|
|
|
|
increases the amount of variables in later layers of the model, a convolutional layer
|
|
|
|
is often followed by a pooling one. In a pooling layer the input is
|
|
|
|
reduced in size by extracting a single value from a
|
|
|
|
neighborhood \todo{moving...}... . The resulting output size is dependent on
|
|
|
|
the offset of the neighborhoods used. Popular is max-pooling where the
|
|
|
|
largest value in a neighborhood is used or.
|
|
|
|
|
|
|
|
This construct allows for extraction of features from the input while
|
|
|
|
using far less input variables.
|
|
|
|
|
|
|
|
... \todo{Beispiel mit kleinem Bild, am besten das von oben}
|
|
|
|
|
|
|
|
\subsubsection{Parallels to the Visual Cortex in Mammals}
|
|
|
|
|
|
|
|
The choice of convolution for image classification tasks is not
|
|
|
|
arbitrary. ... auge... bla bla
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
|
|
|
|
-Hyperparameter guesswork
|
|
|
|
-Problems navigating valleys -> momentum
|
|
|
|
-Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
|
|
|
|
|
|
For many applications in which neural networks are used such as
|
|
|
|
image classification or segmentation, large training data sets become
|
|
|
|
detrimental to capture the nuances of the
|
|
|
|
data. However as training sets get larger the memory requirement
|
|
|
|
during training grows with it.
|
|
|
|
In order to update the weights with the gradient descent algorithm
|
|
|
|
derivatives of the network with respect for each
|
|
|
|
variable need to be calculated for all data points in order to get the
|
|
|
|
full gradient of the error of the network.
|
|
|
|
Thus the amount of memory and computing power available limits the
|
|
|
|
size of the training data that can be efficiently used in fitting the
|
|
|
|
network. A class of algorithms that augment the gradient descent
|
|
|
|
algorithm in order to lessen this problem are stochastic gradient
|
|
|
|
descent algorithms. Here the premise is that instead of using the whole
|
|
|
|
dataset a (different) subset of data is chosen to
|
|
|
|
compute the gradient in each iteration (Algorithm~\ref{alg:sdg}).
|
|
|
|
The training period until each data point has been considered in
|
|
|
|
updating the parameters is commonly called an ``epoch''.
|
|
|
|
Using subsets reduces the amount of memory and computing power required for
|
|
|
|
each iteration. This makes it possible to use very large training
|
|
|
|
sets to fit the model.
|
|
|
|
Additionally the noise introduced on the gradient can improve
|
|
|
|
the accuracy of the fit as stochastic gradient descent algorithms are
|
|
|
|
less likely to get stuck on local extrema.
|
|
|
|
|
|
|
|
Another important benefit in using subsets is that depending on their size the
|
|
|
|
gradient can be calculated far quicker which allows for more parameter updates
|
|
|
|
in the same time. If the approximated gradient is close enough to the
|
|
|
|
``real'' one this can drastically cut down the time required for
|
|
|
|
training the model to a certain degree or improve the accuracy achievable in a given
|
|
|
|
mount of training time.
|
|
|
|
|
|
|
|
\begin{algorithm}
|
|
|
|
\SetAlgoLined
|
|
|
|
\KwInput{Function $f$, Weights $w$, Learning Rate $\gamma$, Batch Size $B$, Loss Function $L$,
|
|
|
|
Training Data $D$, Epochs $E$.}
|
|
|
|
\For{$i \in \left\{1:E\right\}$}{
|
|
|
|
S <- D
|
|
|
|
\While{$\abs{S} \geq B$}{
|
|
|
|
Draw $\tilde{D}$ from $S$ with $\vert\tilde{D}\vert = B$\;
|
|
|
|
Update $S$: $S \leftarrow S \setminus \tilde{D}$\;
|
|
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
|
|
\tilde{D})}{\mathrm{d} w}$\;
|
|
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
|
|
}
|
|
|
|
\If{$S \neq \emptyset$}{
|
|
|
|
Compute Gradient: $g \leftarrow \frac{\mathrm{d} L(f_w,
|
|
|
|
S)}{\mathrm{d} w}$\;
|
|
|
|
Update: $w \leftarrow w - \gamma g$\;
|
|
|
|
}
|
|
|
|
Increment: $i \leftarrow i+1$\;
|
|
|
|
}
|
|
|
|
\caption{Stochastic gradient descent.}
|
|
|
|
\label{alg:sgd}
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
In order to illustrate this behavior we modeled a convolutional neural
|
|
|
|
network to ... handwritten digits. The data set used for this is the
|
|
|
|
MNIST database of handwritten digits (\textcite{MNIST},
|
|
|
|
Figure~\ref{fig:MNIST}).
|
|
|
|
\input{Plots/mnist.tex}
|
|
|
|
The network used consists of two convolution and max pooling layers
|
|
|
|
followed by one fully connected hidden layer and the output layer.
|
|
|
|
Both covolutional layers utilize square filters of size five which are
|
|
|
|
applied with a stride of one.
|
|
|
|
The first layer consists of 32 filters and the second of 64. Both
|
|
|
|
pooling layers pool a $2\times 2$ area. The fully connected layer
|
|
|
|
consists of 256 nodes and the output layer of 10, one for each digit.
|
|
|
|
All layers except the output layer use RELU as activation function
|
|
|
|
with the output layer using softmax (\ref{def:softmax}).
|
|
|
|
As loss function categorical crossentropy is used (\ref{def:...}).
|
|
|
|
The architecture of the convolutional neural network is summarized in
|
|
|
|
Figure~\ref{fig:mnist_architecture}.
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
\missingfigure{network architecture}
|
|
|
|
\caption{architecture}
|
|
|
|
\label{fig:mnist_architecture}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
The results of the network being trained with gradient descent and
|
|
|
|
stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd}
|
|
|
|
and Table~\ref{table:sgd_vs_gd}
|
|
|
|
|
|
|
|
\input{Plots/SGD_vs_GD.tex}
|
|
|
|
|
|
|
|
Here it can be seen that the network trained with stochstic gradient
|
|
|
|
descent is more accurate after the first epoch than the ones trained
|
|
|
|
with gradient descent after 20 epochs.
|
|
|
|
This is due to the former using a batch size of 32 and thus having
|
|
|
|
made 1.875 updates to the weights
|
|
|
|
after the first epoch in comparison to one update . While each of
|
|
|
|
these updates uses a approximate
|
|
|
|
gradient calculated on the subset it performs far better than the
|
|
|
|
network using true gradients when training for the same mount of time.
|
|
|
|
\todo{vergleich training time}
|
|
|
|
\clearpage
|
|
|
|
\subsection{Modified Stochastic Gradient Descent}
|
|
|
|
There is a inherent problem in the sensitivity of the gradient descent
|
|
|
|
algorithm regarding the learning rate $\gamma$.
|
|
|
|
The difficulty of choosing the learning rate is
|
|
|
|
in the Figure~\ref{sgd_vs_gd}. For small rates the progress in each iteration is small
|
|
|
|
but as the rate is enlarged the algorithm can become unstable and
|
|
|
|
diverge. Even for learning rates small enough to ensure the parameters
|
|
|
|
do not diverge to infinity steep valleys can hinder the progress of
|
|
|
|
the algorithm as with to large leaning rates gradient descent
|
|
|
|
``bounces between'' the walls of the valley rather then follow ...
|
|
|
|
|
|
|
|
% \[
|
|
|
|
% w - \gamma \nabla_w ...
|
|
|
|
% \]
|
|
|
|
thus the weights grow to infinity.
|
|
|
|
\todo{unstable learning rate besser
|
|
|
|
erklären}
|
|
|
|
|
|
|
|
To combat this problem it is proposed \todo{quelle} alter the learning
|
|
|
|
rate over the course of training, often called leaning rate
|
|
|
|
scheduling. The most popular implementations of this are time based
|
|
|
|
decay
|
|
|
|
\[
|
|
|
|
\gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
|
|
|
\]
|
|
|
|
where $d$ is the decay parameter and $n$ is the number of epochs,
|
|
|
|
step based decay where the learning rate is fixed for a span of $r$
|
|
|
|
epochs and then decreased according to parameter $d$
|
|
|
|
\[
|
|
|
|
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
|
|
|
|
\]
|
|
|
|
and exponential decay, where the learning rate is decreased after each epoch,
|
|
|
|
\[
|
|
|
|
\gamma_n = \gamma_o e^{-n d}.
|
|
|
|
\]
|
|
|
|
These methods are able to increase the accuracy of a model by a large
|
|
|
|
margin as seen in the training of RESnet by \textcite{resnet}
|
|
|
|
\todo{vielleicht grafik
|
|
|
|
einbauen}. However stochastic gradient descent with weight decay is
|
|
|
|
still highly sensitive to the choice of the hyperparameters $\gamma$
|
|
|
|
and $d$.
|
|
|
|
In order to mitigate this problem a number of algorithms have been
|
|
|
|
developed to regularize the learning rate with as minimal
|
|
|
|
hyperparameter guesswork as possible.
|
|
|
|
One of these algorithms is the ADADELTA algorithm developed by \textcite{ADADELTA}
|
|
|
|
\clearpage
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item ADAM
|
|
|
|
\item momentum
|
|
|
|
\item ADADETLA \textcite{ADADELTA}
|
|
|
|
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
\begin{algorithm}[H]
|
|
|
|
\SetAlgoLined
|
|
|
|
\KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
|
|
|
|
\KwInput{Initial parameter $x_1$}
|
|
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
Compute Gradient: $g_t$\;
|
|
|
|
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
|
|
|
(1-\rho)g_t^2$\;
|
|
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
}
|
|
|
|
\caption{ADADELTA, \textcite{ADADELTA}}
|
|
|
|
\label{alg:gd}
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
% \subsubsubsection{Stochastic Gradient Descent}
|
|
|
|
|
|
|
|
\subsection{Combating Overfitting}
|
|
|
|
|
|
|
|
% As in many machine learning applications if the model is overfit in
|
|
|
|
% the data it can drastically reduce the generalization of the model. In
|
|
|
|
% many machine learning approaches noise introduced in the learning
|
|
|
|
% algorithm in order to reduce overfitting. This results in a higher
|
|
|
|
% bias of the model but the trade off of lower variance of the model is
|
|
|
|
% beneficial in many cases. For example the regression tree model
|
|
|
|
% ... benefits greatly from restricting the training algorithm on
|
|
|
|
% randomly selected features in every iteration and then averaging many
|
|
|
|
% such trained trees inserted of just using a single one. \todo{noch
|
|
|
|
% nicht sicher ob ich das nehmen will} For neural networks similar
|
|
|
|
% strategies exist. A popular approach in regularizing convolutional neural network
|
|
|
|
% is \textit{dropout} which has been first introduced in
|
|
|
|
% \cite{Dropout}
|
|
|
|
|
|
|
|
Similarly to shallow networks overfitting still can impact the quality of
|
|
|
|
convolutional neural networks. A popular way to combat this problem is
|
|
|
|
by introducing noise into the training of the model. This is a
|
|
|
|
successful strategy for ofter models as well, the a conglomerate of
|
|
|
|
descision trees grown on bootstrapped trainig samples benefit greatly
|
|
|
|
of randomizing the features available to use in each training
|
|
|
|
iteration (Hastie, Bachelorarbeit??). The way noise is introduced into
|
|
|
|
the model is by deactivating certain nodes (setting the output of the
|
|
|
|
node to 0) in the fully connected layers of the convolutional neural
|
|
|
|
networks. The nodes are chosen at random and change in every
|
|
|
|
iteration, this practice is called Dropout and was introduced by
|
|
|
|
\textcite{Dropout}.
|
|
|
|
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
|
|
|
training set?}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
|
|
%%% mode: latex
|
|
|
|
%%% TeX-master: "main"
|
|
|
|
%%% End:
|