main
Tobias Arndt 4 years ago
parent 7ffa8e63f2
commit 1a45e7d596

@ -1,4 +1,4 @@
\documentclass{article} \documentclass[a4paper, 12pt, draft=true]{article}
\usepackage{pgfplots} \usepackage{pgfplots}
\usepackage{filecontents} \usepackage{filecontents}
\usepackage{subcaption} \usepackage{subcaption}
@ -78,12 +78,8 @@ plot coordinates {
\end{tabu} \end{tabu}
\caption{Performace metrics after 20 epochs} \caption{Performace metrics after 20 epochs}
\end{subfigure} \end{subfigure}
\caption{The neural network given in ?? trained with different \caption{Performance metrics of the network given in ... trained
algorithms on the MNIST handwritten digits data set. For gradient with different optimization algorithms}
descent the learning rated 0.01, 0.05 and 0.1 are (GD$_{
rate}$). For
stochastic gradient descend a batch size of 32 and learning rate
of 0.01 is used (SDG$_{0.01}$)}
\end{figure} \end{figure}
\begin{center} \begin{center}
@ -147,6 +143,58 @@ plot coordinates {
left} left}
\end{figure} \end{figure}
\begin{figure}
\centering
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, ymin=0, ymax = 1, width=\textwidth]
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, width=\textwidth]
\addplot[domain=-5:5, samples=100]{tanh(x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, width=\textwidth,
ytick={0,2,4},yticklabels={\hphantom{4.}0,2,4}, ymin=-1]
\addplot[domain=-5:5, samples=100]{max(0,x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, width=\textwidth, ymin=-1,
ytick={0,2,4},yticklabels={$\hphantom{-5.}0$,2,4}]
\addplot[domain=-5:5, samples=100]{max(0,x)+ 0.1*min(0,x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\end{figure}
\begin{tikzpicture}
\begin{axis}[enlargelimits=false]
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
\addplot[domain=-5:5, samples=100]{tanh(x)};
\addplot[domain=-5:5, samples=100]{max(0,x)};
\end{axis}
\end{tikzpicture}
\begin{tikzpicture}
\begin{axis}[enlargelimits=false]
\addplot[domain=-2*pi:2*pi, samples=100]{cos(deg(x))};
\end{axis}
\end{tikzpicture}
\end{document} \end{document}

@ -150,7 +150,7 @@ wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img
\begin{subfigure}{0.3\textwidth} \begin{subfigure}{0.3\textwidth}
\centering \centering
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png} \includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
\caption{Gaussian Blur $\sigma^2 = 1$} \caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$}
\end{subfigure} \end{subfigure}
\begin{subfigure}{0.3\textwidth} \begin{subfigure}{0.3\textwidth}
\centering \centering
@ -383,15 +383,22 @@ network using true gradients when training for the same mount of time.
\input{Plots/SGD_vs_GD.tex} \input{Plots/SGD_vs_GD.tex}
\clearpage \clearpage
\subsection{\titlecap{modified stochastic gradient descent}} \subsection{\titlecap{modified stochastic gradient descent}}
There is a inherent problem in the sensitivity of the gradient descent An inherent problem of the stochastic gradient descent algorithm is
algorithm regarding the learning rate $\gamma$. its sensitivity to the learning rate $\gamma$. This results in the
The difficulty of choosing the learning rate can be seen problem of having to find a appropriate learning rate for each problem
in Figure~\ref{sgd_vs_gd}. For small rates the progress in each iteration is small which is largely guesswork, the impact of choosing a bad learning rate
but as the rate is enlarged the algorithm can become unstable and can be seen in Figure~\ref{fig:sgd_vs_gd}.
diverge. Even for learning rates small enough to ensure the parameters % There is a inherent problem in the sensitivity of the gradient descent
do not diverge to infinity steep valleys can hinder the progress of % algorithm regarding the learning rate $\gamma$.
the algorithm as with to large leaning rates gradient descent % The difficulty of choosing the learning rate can be seen
``bounces between'' the walls of the valley rather then follow a % in Figure~\ref{sgd_vs_gd}.
For small rates the progress in each iteration is small
but as the rate is enlarged the algorithm can become unstable and the parameters
diverge to infinity. Even for learning rates small enough to ensure the parameters
do not diverge to infinity, steep valleys in the function to be
minimized can hinder the progress of
the algorithm as for leaning rates not small enough gradient descent
``bounces between'' the walls of the valley rather then following a
downward trend in the valley. downward trend in the valley.
% \[ % \[
@ -403,7 +410,8 @@ downward trend in the valley.
To combat this problem \todo{quelle} propose to alter the learning To combat this problem \todo{quelle} propose to alter the learning
rate over the course of training, often called leaning rate rate over the course of training, often called leaning rate
scheduling. The most popular implementations of this are time based scheduling in order to decrease the learning rate over the course of
training. The most popular implementations of this are time based
decay decay
\[ \[
\gamma_{n+1} = \frac{\gamma_n}{1 + d n}, \gamma_{n+1} = \frac{\gamma_n}{1 + d n},
@ -414,12 +422,12 @@ epochs and then decreased according to parameter $d$
\[ \[
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}} \gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
\] \]
and exponential decay, where the learning rate is decreased after each epoch, and exponential decay where the learning rate is decreased after each epoch
\[ \[
\gamma_n = \gamma_o e^{-n d}. \gamma_n = \gamma_o e^{-n d}.
\] \]
These methods are able to increase the accuracy of a model by a large These methods are able to increase the accuracy of a model by large
margin as seen in the training of RESnet by \textcite{resnet}. margins as seen in the training of RESnet by \textcite{resnet}.
\todo{vielleicht grafik \todo{vielleicht grafik
einbauen} einbauen}
However stochastic gradient descent with weight decay is However stochastic gradient descent with weight decay is
@ -500,9 +508,9 @@ While the stochastic gradient algorithm is less susceptible to local
extrema than gradient descent the problem still persists especially extrema than gradient descent the problem still persists especially
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14} with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
A approach to the problem of ``getting stuck'' in saddle point or An approach to the problem of ``getting stuck'' in saddle point or
local minima/maxima is the addition of momentum to SDG. Instead of local minima/maxima is the addition of momentum to SDG. Instead of
using the actual gradient for the parameter update a average over the using the actual gradient for the parameter update an average over the
past gradients is used. In order to avoid the need to SAVE the past past gradients is used. In order to avoid the need to SAVE the past
values usually a exponentially decaying average is used resulting in values usually a exponentially decaying average is used resulting in
Algorithm~\ref{alg_momentum}. This is comparable of following the path Algorithm~\ref{alg_momentum}. This is comparable of following the path
@ -534,6 +542,10 @@ build up momentum from approaching it.
\label{alg:gd} \label{alg:gd}
\end{algorithm} \end{algorithm}
In an effort to combine the properties of the momentum method and the
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
developed the \textsc{Adam} algorithm. The
Problems / Improvements ADAM \textcite{rADAM} Problems / Improvements ADAM \textcite{rADAM}
@ -541,11 +553,14 @@ Problems / Improvements ADAM \textcite{rADAM}
\SetAlgoLined \SetAlgoLined
\KwInput{Stepsize $\alpha$} \KwInput{Stepsize $\alpha$}
\KwInput{Decay Parameters $\beta_1$, $\beta_2$} \KwInput{Decay Parameters $\beta_1$, $\beta_2$}
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\; Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{ \For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
Compute Gradient: $g_t$\; Compute Gradient: $g_t$\;
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} + Accumulate first and second Moment of the Gradient:
(1-\rho)g_t^2$\; \begin{align*}
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\;
\end{align*}
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\; x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
@ -589,41 +604,88 @@ There are two approaches to introduce noise to the model during
learning, either by manipulating the model it self or by manipulating learning, either by manipulating the model it self or by manipulating
the input data. the input data.
\subsubsection{Dropout} \subsubsection{Dropout}
If a neural network has enough hidden nodes to model a training set If a neural network has enough hidden nodes there will be sets of
accuratly weights that accurately fit the training set (proof for a small
Similarly to decision trees and random forests training multiple scenario given in ...) this expecially occurs when the relation
models on the same task and averaging the predictions can improve the between the input and output is highly complex, which requires a large
results and combat overfitting. However training a very large network to model and the training set is limited in size (vgl cnn
number of neural networks is computationally expensive in training wening bilder). However each of these weights will result in different
as well as testing. In order to make this approach feasible predicitons for a test set and all of them will perform worse on the
\textcite{Dropout1} introduced random dropout. test data than the training data. A way to improve the predictions and
Here for each training iteration from a before specified (sub)set of nodes reduce the overfitting would
randomly chosen ones are deactivated (their output is fixed to 0). be to train a large number of networks and average their results (vgl
During training random forests) however this is often computational not feasible in
Instead of using different models and averaging them randomly training as well as testing.
deactivated nodes are used to simulate different networks which all % Similarly to decision trees and random forests training multiple
share the same weights for present nodes. % models on the same task and averaging the predictions can improve the
% results and combat overfitting. However training a very large
% number of neural networks is computationally expensive in training
%as well as testing.
A simple but effective way to introduce noise to the model is by In order to make this approach feasible
deactivating randomly chosen nodes in a layer \textcite{Dropout1} propose random dropout.
The way noise is introduced into Instead of training different models for each data point in a batch
the model is by deactivating certain nodes (setting the output of the randomly chosen nodes in the network are disabled (their output is
node to 0) in the fully connected layers of the convolutional neural fixed to zero) and the updates for the weights in the remaining
networks. The nodes are chosen at random and change in every smaller network are comuted. These the updates computed for each data
iteration, this practice is called Dropout and was introduced by point in the batch are then accumulated and applied to the full
\textcite{Dropout}. network.
This can be compared to many small networks which share their weights
for their active neurons being trained simultaniously.
For testing the ``mean network'' with all nodes active but their
output scaled accordingly to compensate for more active nodes is
used. \todo{comparable to averaging dropout networks, beispiel für
besser in kleinem fall}
% Here for each training iteration from a before specified (sub)set of nodes
% randomly chosen ones are deactivated (their output is fixed to 0).
% During training
% Instead of using different models and averaging them randomly
% deactivated nodes are used to simulate different networks which all
% share the same weights for present nodes.
% A simple but effective way to introduce noise to the model is by
% deactivating randomly chosen nodes in a layer
% The way noise is introduced into
% the model is by deactivating certain nodes (setting the output of the
% node to 0) in the fully connected layers of the convolutional neural
% networks. The nodes are chosen at random and change in every
% iteration, this practice is called Dropout and was introduced by
% \textcite{Dropout}.
\subsubsection{\titlecap{manipulation of input data}}
Another way to combat overfitting is to keep the network from learning
the dataset by manipulating the inputs randomly for each iteration of
training. This is commonly used in image based tasks as there are
often ways to maipulate the input while still being sure the labels
remain the same. For example in a image classification task such as
handwritten digits the associated label should remain right when the
image is rotated or stretched by a small amount.
When using this one has to be sure that the labels indeed remain the
same or else the network will not learn the desired ...
In the case of handwritten digits for example a to high rotation angle
will ... a nine or six.
The most common transformations are rotation, zoom, shear, brightness, mirroring.
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als \todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
training set?} training set?}
\subsubsection{Effectivety for small training sets} \subsubsection{\titlecap{effectivety for small training sets}}
For some applications (medical problems with small amount of patients) For some applications (medical problems with small amount of patients)
the available data can be highly limited. In the following the impact the available data can be highly limited.
on highly reduced training sets has been ... for ... and the results In order to get a understanding for the achievable accuracy for such a
are given in Figure ... scenario in the following we examine the ... and .. with a highly
reduced training set and the impact the above mentioned strategies on
combating overfitting have.
\clearpage
\section{Bla}
\begin{itemize}
\item generate more data, GAN etc
\item Transfer learning, use network trained on different task and
repurpose it / train it with the training data
\end{itemize}
%%% Local Variables: %%% Local Variables:
%%% mode: latex %%% mode: latex

@ -87,6 +87,93 @@ except for the input layer, which recieves the components of the input.
\subsection{Nonlinearity of Neural Networks} \subsection{Nonlinearity of Neural Networks}
The arguably most important feature of neural networks that sets them
apart from linear models is the activation function implemented in the
neurons. As seen in Figure~\ref{fig:neuron} on the weighted sum of the
inputs a activation function $\sigma$ is applied in order to obtain
the output resulting in the output being given by
\[
o_k = \sigma\left(b_k + \sum_{j=1}^m w_{k,j} i_j\right).
\]
The activation function is usually chosen nonlinear (a linear one
would result in the entire model collapsing into a linear one) which
allows it to better model data (beispiel satz ...).
There are two types of activation functions, saturating and not
saturating ones. Popular examples for the former are sigmoid
functions where most commonly the standard logisitc function or tanh are used
as they have easy to compute derivatives which is ... for gradient
based optimization algorithms. The standard logistic function (often
referred to simply as sigmoid function) is given by
\[
f(x) = \frac{1}{1+e^{-x}}
\]
and has a realm of $[0,1]$. Its usage as an activation function is
motivated by modeling neurons which
are close to deactive until a certain threshold where they grow in
intensity until they are fully
active, which is similar to the behavior of neurons in brains
\todo{besser schreiben}. The tanh function is given by
\[
tanh(x) = \frac{2}{e^{2x}+1}
\]
The downside of these saturating activation functions is that given
their ... their derivatives are close to zero for large or small
input values which can ... the ... of gradient based methods.
The nonsaturating activation functions commonly used are the recified
linear using (ReLU) or the leaky RelU. The ReLU is given by
\[
r(x) = \max\left\{0, x\right\}.
\]
This has the benefit of having a constant derivative for values larger
than zero. However the derivative being zero ... . The leaky ReLU is
an attempt to counteract this problem by assigning a small constant
derivative to all values smaller than zero and for scalar $\alpha$ is given by
\[
l(x) = \max\left\{0, x\right\} + \alpha.
\]
In order to illustrate these functions plots of them are given in Figure~\ref{fig:activation}.
\begin{figure}
\centering
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, ymin=0, ymax = 1, width=\textwidth]
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, width=\textwidth]
\addplot[domain=-5:5, samples=100]{tanh(x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, width=\textwidth,
ytick={0,2,4},yticklabels={\hphantom{4.}0,2,4}, ymin=-1]
\addplot[domain=-5:5, samples=100]{max(0,x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45\linewidth}
\centering
\begin{tikzpicture}
\begin{axis}[enlargelimits=false, width=\textwidth, ymin=-1,
ytick={0,2,4},yticklabels={$\hphantom{-5.}0$,2,4}]
\addplot[domain=-5:5, samples=100]{max(0,x)+ 0.1*min(0,x)};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\caption{Plots of the activation fucntoins...}
\label{fig:activation}
\end{figure}
\begin{figure} \begin{figure}
@ -173,6 +260,7 @@ except for the input layer, which recieves the components of the input.
\end{tikzpicture} \end{tikzpicture}
\caption{Structure of a single neuron} \caption{Structure of a single neuron}
\label{fig:neuron}
\end{figure} \end{figure}
\clearpage \clearpage
@ -345,19 +433,19 @@ large in networks with multiple layers of high neuron count naively
computing these can get quite memory and computational expensive. But computing these can get quite memory and computational expensive. But
by using the chain rule and exploiting the layered structure we can by using the chain rule and exploiting the layered structure we can
compute the gradient much more efficiently by using backpropagation compute the gradient much more efficiently by using backpropagation
first introduced by \textcite{backprop}. introduced by \textcite{backprop}.
\subsubsection{Backpropagation} % \subsubsection{Backpropagation}
As with an increasing amount of layers the derivative of a loss % As with an increasing amount of layers the derivative of a loss
function with respect to a certain variable becomes more intensive to % function with respect to a certain variable becomes more intensive to
compute there have been efforts in increasing the efficiency of % compute there have been efforts in increasing the efficiency of
computing these derivatives. Today the BACKPROPAGATION algorithm is % computing these derivatives. Today the BACKPROPAGATION algorithm is
widely used to compute the derivatives needed for the optimization % widely used to compute the derivatives needed for the optimization
algorithms. Here instead of naively calculating the derivative for % algorithms. Here instead of naively calculating the derivative for
each variable, the chain rule is used in order to compute derivatives % each variable, the chain rule is used in order to compute derivatives
for each layer from output layer towards the first layer while only % for each layer from output layer towards the first layer while only
needing to .... % needing to ....
\[ \[
\frac{\partial L(...)}{} \frac{\partial L(...)}{}

@ -220,7 +220,7 @@ plot coordinates {
Networks} Networks}
In this section we will analyze the connection of randomized shallow This section is based on \textcite{heiss2019}. We will analyze the connection of randomized shallow
Neural Networks with one dimensional input and regression splines. We Neural Networks with one dimensional input and regression splines. We
will see that the punishment of the size of the weights in training will see that the punishment of the size of the weights in training
the randomized shallow the randomized shallow

Loading…
Cancel
Save