|
|
@ -276,11 +276,11 @@ The choice of convolution for image classification tasks is not
|
|
|
|
arbitrary. ... auge... bla bla
|
|
|
|
arbitrary. ... auge... bla bla
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
% \subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
-Hyperparameter guesswork
|
|
|
|
% -Hyperparameter guesswork
|
|
|
|
-Problems navigating valleys -> momentum
|
|
|
|
% -Problems navigating valleys -> momentum
|
|
|
|
-Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
% -Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
|
|
|
|
|
|
|
|
|
@ -368,20 +368,21 @@ The results of the network being trained with gradient descent and
|
|
|
|
stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd}
|
|
|
|
stochastic gradient descent for 20 epochs are given in Figure~\ref{fig:sgd_vs_gd}
|
|
|
|
and Table~\ref{table:sgd_vs_gd}
|
|
|
|
and Table~\ref{table:sgd_vs_gd}
|
|
|
|
|
|
|
|
|
|
|
|
\input{Plots/SGD_vs_GD.tex}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Here it can be seen that the network trained with stochstic gradient
|
|
|
|
Here it can be seen that the network trained with stochstic gradient
|
|
|
|
descent is more accurate after the first epoch than the ones trained
|
|
|
|
descent is more accurate after the first epoch than the ones trained
|
|
|
|
with gradient descent after 20 epochs.
|
|
|
|
with gradient descent after 20 epochs.
|
|
|
|
This is due to the former using a batch size of 32 and thus having
|
|
|
|
This is due to the former using a batch size of 32 and thus having
|
|
|
|
made 1.875 updates to the weights
|
|
|
|
made 1.875 updates to the weights
|
|
|
|
after the first epoch in comparison to one update . While each of
|
|
|
|
after the first epoch in comparison to one update. While each of
|
|
|
|
these updates uses a approximate
|
|
|
|
these updates uses a approximate
|
|
|
|
gradient calculated on the subset it performs far better than the
|
|
|
|
gradient calculated on the subset it performs far better than the
|
|
|
|
network using true gradients when training for the same mount of time.
|
|
|
|
network using true gradients when training for the same mount of time.
|
|
|
|
\todo{vergleich training time}
|
|
|
|
\todo{vergleich training time}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\input{Plots/SGD_vs_GD.tex}
|
|
|
|
\clearpage
|
|
|
|
\clearpage
|
|
|
|
\subsection{Modified Stochastic Gradient Descent}
|
|
|
|
\subsection{\titlecap{modified stochastic gradient descent}}
|
|
|
|
There is a inherent problem in the sensitivity of the gradient descent
|
|
|
|
There is a inherent problem in the sensitivity of the gradient descent
|
|
|
|
algorithm regarding the learning rate $\gamma$.
|
|
|
|
algorithm regarding the learning rate $\gamma$.
|
|
|
|
The difficulty of choosing the learning rate can be seen
|
|
|
|
The difficulty of choosing the learning rate can be seen
|
|
|
@ -434,7 +435,7 @@ They all scale the gradient for the update depending of past gradients
|
|
|
|
for each weight individually.
|
|
|
|
for each weight individually.
|
|
|
|
|
|
|
|
|
|
|
|
The algorithms are build up on each other with the adaptive gradient
|
|
|
|
The algorithms are build up on each other with the adaptive gradient
|
|
|
|
algorithm (ADAGRAD, \textcite{ADAGRAD})
|
|
|
|
algorithm (\textsc{AdaGrad}, \textcite{ADAGRAD})
|
|
|
|
laying the base work. Here for each parameter update the learning rate
|
|
|
|
laying the base work. Here for each parameter update the learning rate
|
|
|
|
is given my a constant
|
|
|
|
is given my a constant
|
|
|
|
$\gamma$ is divided by the sum of the squares of the past partial
|
|
|
|
$\gamma$ is divided by the sum of the squares of the past partial
|
|
|
@ -456,11 +457,11 @@ algorithm is given in Algorithm~\ref{alg:ADAGRAD}.
|
|
|
|
1, \dots,p$\;
|
|
|
|
1, \dots,p$\;
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
\caption{ADAGRAD}
|
|
|
|
\caption{\textls{ADAGRAD}}
|
|
|
|
\label{alg:ADAGRAD}
|
|
|
|
\label{alg:ADAGRAD}
|
|
|
|
\end{algorithm}
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
Building on ADAGRAD \textcite{ADADELTA} developed the ... (ADADELTA)
|
|
|
|
Building on \textsc{AdaGrad} \textcite{ADADELTA} developed the ... (ADADELTA)
|
|
|
|
in order to improve upon the two main drawbacks of ADAGRAD, being the
|
|
|
|
in order to improve upon the two main drawbacks of ADAGRAD, being the
|
|
|
|
continual decay of the learning rate and the need for a manually
|
|
|
|
continual decay of the learning rate and the need for a manually
|
|
|
|
selected global learning rate $\gamma$.
|
|
|
|
selected global learning rate $\gamma$.
|
|
|
@ -476,22 +477,70 @@ if the parameter vector had some hypothetical units they would be matched
|
|
|
|
by these of the parameter update $\Delta x_t$. This proper
|
|
|
|
by these of the parameter update $\Delta x_t$. This proper
|
|
|
|
\todo{erklärung unit}
|
|
|
|
\todo{erklärung unit}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
|
|
|
|
|
|
\SetAlgoLined
|
|
|
|
|
|
|
|
\KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
|
|
|
|
|
|
|
|
\KwInput{Initial parameter $x_1$}
|
|
|
|
|
|
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
|
|
|
|
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
|
|
|
|
Compute Gradient: $g_t$\;
|
|
|
|
|
|
|
|
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
|
|
|
|
|
|
|
(1-\rho)g_t^2$\;
|
|
|
|
|
|
|
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
|
|
|
|
|
|
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
|
|
|
|
|
|
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
|
|
|
|
|
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
|
|
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
\caption{ADADELTA, \textcite{ADADELTA}}
|
|
|
|
|
|
|
|
\label{alg:gd}
|
|
|
|
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
While the stochastic gradient algorithm is less susceptible to local
|
|
|
|
While the stochastic gradient algorithm is less susceptible to local
|
|
|
|
extrema than gradient descent the problem still persists especially
|
|
|
|
extrema than gradient descent the problem still persists especially
|
|
|
|
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
|
|
|
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A approach to the problem of ``getting stuck'' in saddle point or
|
|
|
|
|
|
|
|
local minima/maxima is the addition of momentum to SDG. Instead of
|
|
|
|
|
|
|
|
using the actual gradient for the parameter update a average over the
|
|
|
|
|
|
|
|
past gradients is used. In order to avoid the need to SAVE the past
|
|
|
|
|
|
|
|
values usually a exponentially decaying average is used resulting in
|
|
|
|
|
|
|
|
Algorithm~\ref{alg_momentum}. This is comparable of following the path
|
|
|
|
|
|
|
|
of a marble with mass rolling down the SLOPE of the error
|
|
|
|
|
|
|
|
function. The decay rate for the average is comparable to the TRÄGHEIT
|
|
|
|
|
|
|
|
of the marble.
|
|
|
|
|
|
|
|
This results in the algorithm being able to escape ... due to the
|
|
|
|
|
|
|
|
build up momentum from approaching it.
|
|
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\begin{itemize}
|
|
|
|
\item ADAM
|
|
|
|
\item ADAM
|
|
|
|
\item momentum
|
|
|
|
\item momentum
|
|
|
|
\item ADADETLA \textcite{ADADELTA}
|
|
|
|
\item ADADETLA \textcite{ADADELTA}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
|
|
\begin{algorithm}[H]
|
|
|
|
\SetAlgoLined
|
|
|
|
\SetAlgoLined
|
|
|
|
\KwInput{Decay Rate $\rho$, Constant $\varepsilon$}
|
|
|
|
\KwInput{Learning Rate $\gamma$, Decay Rate $\rho$}
|
|
|
|
\KwInput{Initial parameter $x_1$}
|
|
|
|
\KwInput{Initial parameter $x_1$}
|
|
|
|
|
|
|
|
Initialize accumulation variables $m_0 = 0$\;
|
|
|
|
|
|
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
|
|
|
|
Compute Gradient: $g_t$\;
|
|
|
|
|
|
|
|
Accumulate Gradient: $m_t \leftarrow \rho m_{t-1} + (1-\rho) g_t$\;
|
|
|
|
|
|
|
|
Compute Update: $\Delta x_t \leftarrow -\gamma m_t$\;
|
|
|
|
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
\caption{SDG with momentum}
|
|
|
|
|
|
|
|
\label{alg:gd}
|
|
|
|
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Problems / Improvements ADAM \textcite{rADAM}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\begin{algorithm}[H]
|
|
|
|
|
|
|
|
\SetAlgoLined
|
|
|
|
|
|
|
|
\KwInput{Stepsize $\alpha$}
|
|
|
|
|
|
|
|
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
|
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
|
|
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
|
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
|
|
|
Compute Gradient: $g_t$\;
|
|
|
|
Compute Gradient: $g_t$\;
|
|
|
@ -503,10 +552,12 @@ with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
|
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
|
|
x^2]_{t-1} + (1+p)\Delta x_t^2$\;
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
Apply Update: $x_{t+1} \leftarrow x_t + \Delta x_t$\;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
\caption{ADADELTA, \textcite{ADADELTA}}
|
|
|
|
\caption{ADAM, \cite{ADAM}}
|
|
|
|
\label{alg:gd}
|
|
|
|
\label{alg:gd}
|
|
|
|
\end{algorithm}
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\input{Plots/sdg_comparison.tex}
|
|
|
|
\input{Plots/sdg_comparison.tex}
|
|
|
|
|
|
|
|
|
|
|
|
% \subsubsubsection{Stochastic Gradient Descent}
|
|
|
|
% \subsubsubsection{Stochastic Gradient Descent}
|
|
|
@ -533,7 +584,31 @@ by introducing noise into the training of the model. This is a
|
|
|
|
successful strategy for ofter models as well, the a conglomerate of
|
|
|
|
successful strategy for ofter models as well, the a conglomerate of
|
|
|
|
descision trees grown on bootstrapped trainig samples benefit greatly
|
|
|
|
descision trees grown on bootstrapped trainig samples benefit greatly
|
|
|
|
of randomizing the features available to use in each training
|
|
|
|
of randomizing the features available to use in each training
|
|
|
|
iteration (Hastie, Bachelorarbeit??). The way noise is introduced into
|
|
|
|
iteration (Hastie, Bachelorarbeit??).
|
|
|
|
|
|
|
|
There are two approaches to introduce noise to the model during
|
|
|
|
|
|
|
|
learning, either by manipulating the model it self or by manipulating
|
|
|
|
|
|
|
|
the input data.
|
|
|
|
|
|
|
|
\subsubsection{Dropout}
|
|
|
|
|
|
|
|
If a neural network has enough hidden nodes to model a training set
|
|
|
|
|
|
|
|
accuratly
|
|
|
|
|
|
|
|
Similarly to decision trees and random forests training multiple
|
|
|
|
|
|
|
|
models on the same task and averaging the predictions can improve the
|
|
|
|
|
|
|
|
results and combat overfitting. However training a very large
|
|
|
|
|
|
|
|
number of neural networks is computationally expensive in training
|
|
|
|
|
|
|
|
as well as testing. In order to make this approach feasible
|
|
|
|
|
|
|
|
\textcite{Dropout1} introduced random dropout.
|
|
|
|
|
|
|
|
Here for each training iteration from a before specified (sub)set of nodes
|
|
|
|
|
|
|
|
randomly chosen ones are deactivated (their output is fixed to 0).
|
|
|
|
|
|
|
|
During training
|
|
|
|
|
|
|
|
Instead of using different models and averaging them randomly
|
|
|
|
|
|
|
|
deactivated nodes are used to simulate different networks which all
|
|
|
|
|
|
|
|
share the same weights for present nodes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A simple but effective way to introduce noise to the model is by
|
|
|
|
|
|
|
|
deactivating randomly chosen nodes in a layer
|
|
|
|
|
|
|
|
The way noise is introduced into
|
|
|
|
the model is by deactivating certain nodes (setting the output of the
|
|
|
|
the model is by deactivating certain nodes (setting the output of the
|
|
|
|
node to 0) in the fully connected layers of the convolutional neural
|
|
|
|
node to 0) in the fully connected layers of the convolutional neural
|
|
|
|
networks. The nodes are chosen at random and change in every
|
|
|
|
networks. The nodes are chosen at random and change in every
|
|
|
@ -543,7 +618,12 @@ iteration, this practice is called Dropout and was introduced by
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
|
|
|
training set?}
|
|
|
|
training set?}
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Effectively for small training sets}
|
|
|
|
\subsubsection{Effectivety for small training sets}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For some applications (medical problems with small amount of patients)
|
|
|
|
|
|
|
|
the available data can be highly limited. In the following the impact
|
|
|
|
|
|
|
|
on highly reduced training sets has been ... for ... and the results
|
|
|
|
|
|
|
|
are given in Figure ...
|
|
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
|
|
%%% Local Variables:
|
|
|
|
%%% mode: latex
|
|
|
|
%%% mode: latex
|
|
|
|