|
|
|
@ -150,7 +150,7 @@ wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img
|
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
|
\centering
|
|
|
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
|
|
|
|
|
\caption{Gaussian Blur $\sigma^2 = 1$}
|
|
|
|
|
\caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$}
|
|
|
|
|
\end{subfigure}
|
|
|
|
|
\begin{subfigure}{0.3\textwidth}
|
|
|
|
|
\centering
|
|
|
|
@ -383,15 +383,22 @@ network using true gradients when training for the same mount of time.
|
|
|
|
|
\input{Plots/SGD_vs_GD.tex}
|
|
|
|
|
\clearpage
|
|
|
|
|
\subsection{\titlecap{modified stochastic gradient descent}}
|
|
|
|
|
There is a inherent problem in the sensitivity of the gradient descent
|
|
|
|
|
algorithm regarding the learning rate $\gamma$.
|
|
|
|
|
The difficulty of choosing the learning rate can be seen
|
|
|
|
|
in Figure~\ref{sgd_vs_gd}. For small rates the progress in each iteration is small
|
|
|
|
|
but as the rate is enlarged the algorithm can become unstable and
|
|
|
|
|
diverge. Even for learning rates small enough to ensure the parameters
|
|
|
|
|
do not diverge to infinity steep valleys can hinder the progress of
|
|
|
|
|
the algorithm as with to large leaning rates gradient descent
|
|
|
|
|
``bounces between'' the walls of the valley rather then follow a
|
|
|
|
|
An inherent problem of the stochastic gradient descent algorithm is
|
|
|
|
|
its sensitivity to the learning rate $\gamma$. This results in the
|
|
|
|
|
problem of having to find a appropriate learning rate for each problem
|
|
|
|
|
which is largely guesswork, the impact of choosing a bad learning rate
|
|
|
|
|
can be seen in Figure~\ref{fig:sgd_vs_gd}.
|
|
|
|
|
% There is a inherent problem in the sensitivity of the gradient descent
|
|
|
|
|
% algorithm regarding the learning rate $\gamma$.
|
|
|
|
|
% The difficulty of choosing the learning rate can be seen
|
|
|
|
|
% in Figure~\ref{sgd_vs_gd}.
|
|
|
|
|
For small rates the progress in each iteration is small
|
|
|
|
|
but as the rate is enlarged the algorithm can become unstable and the parameters
|
|
|
|
|
diverge to infinity. Even for learning rates small enough to ensure the parameters
|
|
|
|
|
do not diverge to infinity, steep valleys in the function to be
|
|
|
|
|
minimized can hinder the progress of
|
|
|
|
|
the algorithm as for leaning rates not small enough gradient descent
|
|
|
|
|
``bounces between'' the walls of the valley rather then following a
|
|
|
|
|
downward trend in the valley.
|
|
|
|
|
|
|
|
|
|
% \[
|
|
|
|
@ -403,7 +410,8 @@ downward trend in the valley.
|
|
|
|
|
|
|
|
|
|
To combat this problem \todo{quelle} propose to alter the learning
|
|
|
|
|
rate over the course of training, often called leaning rate
|
|
|
|
|
scheduling. The most popular implementations of this are time based
|
|
|
|
|
scheduling in order to decrease the learning rate over the course of
|
|
|
|
|
training. The most popular implementations of this are time based
|
|
|
|
|
decay
|
|
|
|
|
\[
|
|
|
|
|
\gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
|
|
|
@ -414,12 +422,12 @@ epochs and then decreased according to parameter $d$
|
|
|
|
|
\[
|
|
|
|
|
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
|
|
|
|
|
\]
|
|
|
|
|
and exponential decay, where the learning rate is decreased after each epoch,
|
|
|
|
|
and exponential decay where the learning rate is decreased after each epoch
|
|
|
|
|
\[
|
|
|
|
|
\gamma_n = \gamma_o e^{-n d}.
|
|
|
|
|
\]
|
|
|
|
|
These methods are able to increase the accuracy of a model by a large
|
|
|
|
|
margin as seen in the training of RESnet by \textcite{resnet}.
|
|
|
|
|
These methods are able to increase the accuracy of a model by large
|
|
|
|
|
margins as seen in the training of RESnet by \textcite{resnet}.
|
|
|
|
|
\todo{vielleicht grafik
|
|
|
|
|
einbauen}
|
|
|
|
|
However stochastic gradient descent with weight decay is
|
|
|
|
@ -500,9 +508,9 @@ While the stochastic gradient algorithm is less susceptible to local
|
|
|
|
|
extrema than gradient descent the problem still persists especially
|
|
|
|
|
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
|
|
|
|
|
|
|
|
|
A approach to the problem of ``getting stuck'' in saddle point or
|
|
|
|
|
An approach to the problem of ``getting stuck'' in saddle point or
|
|
|
|
|
local minima/maxima is the addition of momentum to SDG. Instead of
|
|
|
|
|
using the actual gradient for the parameter update a average over the
|
|
|
|
|
using the actual gradient for the parameter update an average over the
|
|
|
|
|
past gradients is used. In order to avoid the need to SAVE the past
|
|
|
|
|
values usually a exponentially decaying average is used resulting in
|
|
|
|
|
Algorithm~\ref{alg_momentum}. This is comparable of following the path
|
|
|
|
@ -534,6 +542,10 @@ build up momentum from approaching it.
|
|
|
|
|
\label{alg:gd}
|
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
In an effort to combine the properties of the momentum method and the
|
|
|
|
|
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
|
|
|
|
|
developed the \textsc{Adam} algorithm. The
|
|
|
|
|
|
|
|
|
|
Problems / Improvements ADAM \textcite{rADAM}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@ -541,11 +553,14 @@ Problems / Improvements ADAM \textcite{rADAM}
|
|
|
|
|
\SetAlgoLined
|
|
|
|
|
\KwInput{Stepsize $\alpha$}
|
|
|
|
|
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
|
|
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
|
|
|
Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
|
|
|
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
|
Compute Gradient: $g_t$\;
|
|
|
|
|
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
|
|
|
|
(1-\rho)g_t^2$\;
|
|
|
|
|
Accumulate first and second Moment of the Gradient:
|
|
|
|
|
\begin{align*}
|
|
|
|
|
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
|
|
|
|
|
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\;
|
|
|
|
|
\end{align*}
|
|
|
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
|
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
|
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
|
|
@ -589,41 +604,88 @@ There are two approaches to introduce noise to the model during
|
|
|
|
|
learning, either by manipulating the model it self or by manipulating
|
|
|
|
|
the input data.
|
|
|
|
|
\subsubsection{Dropout}
|
|
|
|
|
If a neural network has enough hidden nodes to model a training set
|
|
|
|
|
accuratly
|
|
|
|
|
Similarly to decision trees and random forests training multiple
|
|
|
|
|
models on the same task and averaging the predictions can improve the
|
|
|
|
|
results and combat overfitting. However training a very large
|
|
|
|
|
number of neural networks is computationally expensive in training
|
|
|
|
|
as well as testing. In order to make this approach feasible
|
|
|
|
|
\textcite{Dropout1} introduced random dropout.
|
|
|
|
|
Here for each training iteration from a before specified (sub)set of nodes
|
|
|
|
|
randomly chosen ones are deactivated (their output is fixed to 0).
|
|
|
|
|
During training
|
|
|
|
|
Instead of using different models and averaging them randomly
|
|
|
|
|
deactivated nodes are used to simulate different networks which all
|
|
|
|
|
share the same weights for present nodes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A simple but effective way to introduce noise to the model is by
|
|
|
|
|
deactivating randomly chosen nodes in a layer
|
|
|
|
|
The way noise is introduced into
|
|
|
|
|
the model is by deactivating certain nodes (setting the output of the
|
|
|
|
|
node to 0) in the fully connected layers of the convolutional neural
|
|
|
|
|
networks. The nodes are chosen at random and change in every
|
|
|
|
|
iteration, this practice is called Dropout and was introduced by
|
|
|
|
|
\textcite{Dropout}.
|
|
|
|
|
If a neural network has enough hidden nodes there will be sets of
|
|
|
|
|
weights that accurately fit the training set (proof for a small
|
|
|
|
|
scenario given in ...) this expecially occurs when the relation
|
|
|
|
|
between the input and output is highly complex, which requires a large
|
|
|
|
|
network to model and the training set is limited in size (vgl cnn
|
|
|
|
|
wening bilder). However each of these weights will result in different
|
|
|
|
|
predicitons for a test set and all of them will perform worse on the
|
|
|
|
|
test data than the training data. A way to improve the predictions and
|
|
|
|
|
reduce the overfitting would
|
|
|
|
|
be to train a large number of networks and average their results (vgl
|
|
|
|
|
random forests) however this is often computational not feasible in
|
|
|
|
|
training as well as testing.
|
|
|
|
|
% Similarly to decision trees and random forests training multiple
|
|
|
|
|
% models on the same task and averaging the predictions can improve the
|
|
|
|
|
% results and combat overfitting. However training a very large
|
|
|
|
|
% number of neural networks is computationally expensive in training
|
|
|
|
|
%as well as testing.
|
|
|
|
|
In order to make this approach feasible
|
|
|
|
|
\textcite{Dropout1} propose random dropout.
|
|
|
|
|
Instead of training different models for each data point in a batch
|
|
|
|
|
randomly chosen nodes in the network are disabled (their output is
|
|
|
|
|
fixed to zero) and the updates for the weights in the remaining
|
|
|
|
|
smaller network are comuted. These the updates computed for each data
|
|
|
|
|
point in the batch are then accumulated and applied to the full
|
|
|
|
|
network.
|
|
|
|
|
This can be compared to many small networks which share their weights
|
|
|
|
|
for their active neurons being trained simultaniously.
|
|
|
|
|
For testing the ``mean network'' with all nodes active but their
|
|
|
|
|
output scaled accordingly to compensate for more active nodes is
|
|
|
|
|
used. \todo{comparable to averaging dropout networks, beispiel für
|
|
|
|
|
besser in kleinem fall}
|
|
|
|
|
% Here for each training iteration from a before specified (sub)set of nodes
|
|
|
|
|
% randomly chosen ones are deactivated (their output is fixed to 0).
|
|
|
|
|
% During training
|
|
|
|
|
% Instead of using different models and averaging them randomly
|
|
|
|
|
% deactivated nodes are used to simulate different networks which all
|
|
|
|
|
% share the same weights for present nodes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
% A simple but effective way to introduce noise to the model is by
|
|
|
|
|
% deactivating randomly chosen nodes in a layer
|
|
|
|
|
% The way noise is introduced into
|
|
|
|
|
% the model is by deactivating certain nodes (setting the output of the
|
|
|
|
|
% node to 0) in the fully connected layers of the convolutional neural
|
|
|
|
|
% networks. The nodes are chosen at random and change in every
|
|
|
|
|
% iteration, this practice is called Dropout and was introduced by
|
|
|
|
|
% \textcite{Dropout}.
|
|
|
|
|
|
|
|
|
|
\subsubsection{\titlecap{manipulation of input data}}
|
|
|
|
|
Another way to combat overfitting is to keep the network from learning
|
|
|
|
|
the dataset by manipulating the inputs randomly for each iteration of
|
|
|
|
|
training. This is commonly used in image based tasks as there are
|
|
|
|
|
often ways to maipulate the input while still being sure the labels
|
|
|
|
|
remain the same. For example in a image classification task such as
|
|
|
|
|
handwritten digits the associated label should remain right when the
|
|
|
|
|
image is rotated or stretched by a small amount.
|
|
|
|
|
When using this one has to be sure that the labels indeed remain the
|
|
|
|
|
same or else the network will not learn the desired ...
|
|
|
|
|
In the case of handwritten digits for example a to high rotation angle
|
|
|
|
|
will ... a nine or six.
|
|
|
|
|
The most common transformations are rotation, zoom, shear, brightness, mirroring.
|
|
|
|
|
|
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
|
|
|
|
training set?}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Effectivety for small training sets}
|
|
|
|
|
\subsubsection{\titlecap{effectivety for small training sets}}
|
|
|
|
|
|
|
|
|
|
For some applications (medical problems with small amount of patients)
|
|
|
|
|
the available data can be highly limited. In the following the impact
|
|
|
|
|
on highly reduced training sets has been ... for ... and the results
|
|
|
|
|
are given in Figure ...
|
|
|
|
|
the available data can be highly limited.
|
|
|
|
|
In order to get a understanding for the achievable accuracy for such a
|
|
|
|
|
scenario in the following we examine the ... and .. with a highly
|
|
|
|
|
reduced training set and the impact the above mentioned strategies on
|
|
|
|
|
combating overfitting have.
|
|
|
|
|
|
|
|
|
|
\clearpage
|
|
|
|
|
\section{Bla}
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item generate more data, GAN etc
|
|
|
|
|
\item Transfer learning, use network trained on different task and
|
|
|
|
|
repurpose it / train it with the training data
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
|
|
|
%%% mode: latex
|
|
|
|
|