progress
This commit is contained in:
parent
7ffa8e63f2
commit
1a45e7d596
@ -1,4 +1,4 @@
|
|||||||
\documentclass{article}
|
\documentclass[a4paper, 12pt, draft=true]{article}
|
||||||
\usepackage{pgfplots}
|
\usepackage{pgfplots}
|
||||||
\usepackage{filecontents}
|
\usepackage{filecontents}
|
||||||
\usepackage{subcaption}
|
\usepackage{subcaption}
|
||||||
@ -78,12 +78,8 @@ plot coordinates {
|
|||||||
\end{tabu}
|
\end{tabu}
|
||||||
\caption{Performace metrics after 20 epochs}
|
\caption{Performace metrics after 20 epochs}
|
||||||
\end{subfigure}
|
\end{subfigure}
|
||||||
\caption{The neural network given in ?? trained with different
|
\caption{Performance metrics of the network given in ... trained
|
||||||
algorithms on the MNIST handwritten digits data set. For gradient
|
with different optimization algorithms}
|
||||||
descent the learning rated 0.01, 0.05 and 0.1 are (GD$_{
|
|
||||||
rate}$). For
|
|
||||||
stochastic gradient descend a batch size of 32 and learning rate
|
|
||||||
of 0.01 is used (SDG$_{0.01}$)}
|
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
@ -147,6 +143,58 @@ plot coordinates {
|
|||||||
left}
|
left}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\centering
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, ymin=0, ymax = 1, width=\textwidth]
|
||||||
|
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, width=\textwidth]
|
||||||
|
\addplot[domain=-5:5, samples=100]{tanh(x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, width=\textwidth,
|
||||||
|
ytick={0,2,4},yticklabels={\hphantom{4.}0,2,4}, ymin=-1]
|
||||||
|
\addplot[domain=-5:5, samples=100]{max(0,x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, width=\textwidth, ymin=-1,
|
||||||
|
ytick={0,2,4},yticklabels={$\hphantom{-5.}0$,2,4}]
|
||||||
|
\addplot[domain=-5:5, samples=100]{max(0,x)+ 0.1*min(0,x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false]
|
||||||
|
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
|
||||||
|
\addplot[domain=-5:5, samples=100]{tanh(x)};
|
||||||
|
\addplot[domain=-5:5, samples=100]{max(0,x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false]
|
||||||
|
\addplot[domain=-2*pi:2*pi, samples=100]{cos(deg(x))};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
|
|
||||||
|
@ -150,7 +150,7 @@ wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img
|
|||||||
\begin{subfigure}{0.3\textwidth}
|
\begin{subfigure}{0.3\textwidth}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
|
||||||
\caption{Gaussian Blur $\sigma^2 = 1$}
|
\caption{\hspace{-2pt}Gaussian Blur $\sigma^2 = 1$}
|
||||||
\end{subfigure}
|
\end{subfigure}
|
||||||
\begin{subfigure}{0.3\textwidth}
|
\begin{subfigure}{0.3\textwidth}
|
||||||
\centering
|
\centering
|
||||||
@ -383,15 +383,22 @@ network using true gradients when training for the same mount of time.
|
|||||||
\input{Plots/SGD_vs_GD.tex}
|
\input{Plots/SGD_vs_GD.tex}
|
||||||
\clearpage
|
\clearpage
|
||||||
\subsection{\titlecap{modified stochastic gradient descent}}
|
\subsection{\titlecap{modified stochastic gradient descent}}
|
||||||
There is a inherent problem in the sensitivity of the gradient descent
|
An inherent problem of the stochastic gradient descent algorithm is
|
||||||
algorithm regarding the learning rate $\gamma$.
|
its sensitivity to the learning rate $\gamma$. This results in the
|
||||||
The difficulty of choosing the learning rate can be seen
|
problem of having to find a appropriate learning rate for each problem
|
||||||
in Figure~\ref{sgd_vs_gd}. For small rates the progress in each iteration is small
|
which is largely guesswork, the impact of choosing a bad learning rate
|
||||||
but as the rate is enlarged the algorithm can become unstable and
|
can be seen in Figure~\ref{fig:sgd_vs_gd}.
|
||||||
diverge. Even for learning rates small enough to ensure the parameters
|
% There is a inherent problem in the sensitivity of the gradient descent
|
||||||
do not diverge to infinity steep valleys can hinder the progress of
|
% algorithm regarding the learning rate $\gamma$.
|
||||||
the algorithm as with to large leaning rates gradient descent
|
% The difficulty of choosing the learning rate can be seen
|
||||||
``bounces between'' the walls of the valley rather then follow a
|
% in Figure~\ref{sgd_vs_gd}.
|
||||||
|
For small rates the progress in each iteration is small
|
||||||
|
but as the rate is enlarged the algorithm can become unstable and the parameters
|
||||||
|
diverge to infinity. Even for learning rates small enough to ensure the parameters
|
||||||
|
do not diverge to infinity, steep valleys in the function to be
|
||||||
|
minimized can hinder the progress of
|
||||||
|
the algorithm as for leaning rates not small enough gradient descent
|
||||||
|
``bounces between'' the walls of the valley rather then following a
|
||||||
downward trend in the valley.
|
downward trend in the valley.
|
||||||
|
|
||||||
% \[
|
% \[
|
||||||
@ -403,7 +410,8 @@ downward trend in the valley.
|
|||||||
|
|
||||||
To combat this problem \todo{quelle} propose to alter the learning
|
To combat this problem \todo{quelle} propose to alter the learning
|
||||||
rate over the course of training, often called leaning rate
|
rate over the course of training, often called leaning rate
|
||||||
scheduling. The most popular implementations of this are time based
|
scheduling in order to decrease the learning rate over the course of
|
||||||
|
training. The most popular implementations of this are time based
|
||||||
decay
|
decay
|
||||||
\[
|
\[
|
||||||
\gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
\gamma_{n+1} = \frac{\gamma_n}{1 + d n},
|
||||||
@ -414,12 +422,12 @@ epochs and then decreased according to parameter $d$
|
|||||||
\[
|
\[
|
||||||
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
|
\gamma_n = \gamma_0 d^{\text{floor}{\frac{n+1}{r}}}
|
||||||
\]
|
\]
|
||||||
and exponential decay, where the learning rate is decreased after each epoch,
|
and exponential decay where the learning rate is decreased after each epoch
|
||||||
\[
|
\[
|
||||||
\gamma_n = \gamma_o e^{-n d}.
|
\gamma_n = \gamma_o e^{-n d}.
|
||||||
\]
|
\]
|
||||||
These methods are able to increase the accuracy of a model by a large
|
These methods are able to increase the accuracy of a model by large
|
||||||
margin as seen in the training of RESnet by \textcite{resnet}.
|
margins as seen in the training of RESnet by \textcite{resnet}.
|
||||||
\todo{vielleicht grafik
|
\todo{vielleicht grafik
|
||||||
einbauen}
|
einbauen}
|
||||||
However stochastic gradient descent with weight decay is
|
However stochastic gradient descent with weight decay is
|
||||||
@ -500,9 +508,9 @@ While the stochastic gradient algorithm is less susceptible to local
|
|||||||
extrema than gradient descent the problem still persists especially
|
extrema than gradient descent the problem still persists especially
|
||||||
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
with saddle points. \textcite{DBLP:journals/corr/Dauphinpgcgb14}
|
||||||
|
|
||||||
A approach to the problem of ``getting stuck'' in saddle point or
|
An approach to the problem of ``getting stuck'' in saddle point or
|
||||||
local minima/maxima is the addition of momentum to SDG. Instead of
|
local minima/maxima is the addition of momentum to SDG. Instead of
|
||||||
using the actual gradient for the parameter update a average over the
|
using the actual gradient for the parameter update an average over the
|
||||||
past gradients is used. In order to avoid the need to SAVE the past
|
past gradients is used. In order to avoid the need to SAVE the past
|
||||||
values usually a exponentially decaying average is used resulting in
|
values usually a exponentially decaying average is used resulting in
|
||||||
Algorithm~\ref{alg_momentum}. This is comparable of following the path
|
Algorithm~\ref{alg_momentum}. This is comparable of following the path
|
||||||
@ -534,6 +542,10 @@ build up momentum from approaching it.
|
|||||||
\label{alg:gd}
|
\label{alg:gd}
|
||||||
\end{algorithm}
|
\end{algorithm}
|
||||||
|
|
||||||
|
In an effort to combine the properties of the momentum method and the
|
||||||
|
automatic adapted learning rate of \textsc{AdaDelta} \textcite{ADAM}
|
||||||
|
developed the \textsc{Adam} algorithm. The
|
||||||
|
|
||||||
Problems / Improvements ADAM \textcite{rADAM}
|
Problems / Improvements ADAM \textcite{rADAM}
|
||||||
|
|
||||||
|
|
||||||
@ -541,11 +553,14 @@ Problems / Improvements ADAM \textcite{rADAM}
|
|||||||
\SetAlgoLined
|
\SetAlgoLined
|
||||||
\KwInput{Stepsize $\alpha$}
|
\KwInput{Stepsize $\alpha$}
|
||||||
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
\KwInput{Decay Parameters $\beta_1$, $\beta_2$}
|
||||||
Initialize accumulation variables $E[g^2]_0 = 0, E[\Delta x^2]_0 =0$\;
|
Initialize accumulation variables $m_0 = 0$, $v_0 = 0$\;
|
||||||
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
\For{$t \in \left\{1,\dots,T\right\};\, t+1$}{
|
||||||
Compute Gradient: $g_t$\;
|
Compute Gradient: $g_t$\;
|
||||||
Accumulate Gradient: $[E[g^2]_t \leftarrow \rho D[g^2]_{t-1} +
|
Accumulate first and second Moment of the Gradient:
|
||||||
(1-\rho)g_t^2$\;
|
\begin{align*}
|
||||||
|
m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
|
||||||
|
v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\;
|
||||||
|
\end{align*}
|
||||||
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
Compute Update: $\Delta x_t \leftarrow -\frac{\sqrt{E[\Delta
|
||||||
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
x^2]_{t-1} + \varepsilon}}{\sqrt{E[g^2]_t + \varepsilon}} g_t$\;
|
||||||
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
Accumulate Updates: $E[\Delta x^2]_t \leftarrow \rho E[\Delta
|
||||||
@ -589,41 +604,88 @@ There are two approaches to introduce noise to the model during
|
|||||||
learning, either by manipulating the model it self or by manipulating
|
learning, either by manipulating the model it self or by manipulating
|
||||||
the input data.
|
the input data.
|
||||||
\subsubsection{Dropout}
|
\subsubsection{Dropout}
|
||||||
If a neural network has enough hidden nodes to model a training set
|
If a neural network has enough hidden nodes there will be sets of
|
||||||
accuratly
|
weights that accurately fit the training set (proof for a small
|
||||||
Similarly to decision trees and random forests training multiple
|
scenario given in ...) this expecially occurs when the relation
|
||||||
models on the same task and averaging the predictions can improve the
|
between the input and output is highly complex, which requires a large
|
||||||
results and combat overfitting. However training a very large
|
network to model and the training set is limited in size (vgl cnn
|
||||||
number of neural networks is computationally expensive in training
|
wening bilder). However each of these weights will result in different
|
||||||
as well as testing. In order to make this approach feasible
|
predicitons for a test set and all of them will perform worse on the
|
||||||
\textcite{Dropout1} introduced random dropout.
|
test data than the training data. A way to improve the predictions and
|
||||||
Here for each training iteration from a before specified (sub)set of nodes
|
reduce the overfitting would
|
||||||
randomly chosen ones are deactivated (their output is fixed to 0).
|
be to train a large number of networks and average their results (vgl
|
||||||
During training
|
random forests) however this is often computational not feasible in
|
||||||
Instead of using different models and averaging them randomly
|
training as well as testing.
|
||||||
deactivated nodes are used to simulate different networks which all
|
% Similarly to decision trees and random forests training multiple
|
||||||
share the same weights for present nodes.
|
% models on the same task and averaging the predictions can improve the
|
||||||
|
% results and combat overfitting. However training a very large
|
||||||
|
% number of neural networks is computationally expensive in training
|
||||||
|
%as well as testing.
|
||||||
|
In order to make this approach feasible
|
||||||
|
\textcite{Dropout1} propose random dropout.
|
||||||
|
Instead of training different models for each data point in a batch
|
||||||
|
randomly chosen nodes in the network are disabled (their output is
|
||||||
|
fixed to zero) and the updates for the weights in the remaining
|
||||||
|
smaller network are comuted. These the updates computed for each data
|
||||||
|
point in the batch are then accumulated and applied to the full
|
||||||
|
network.
|
||||||
|
This can be compared to many small networks which share their weights
|
||||||
|
for their active neurons being trained simultaniously.
|
||||||
|
For testing the ``mean network'' with all nodes active but their
|
||||||
|
output scaled accordingly to compensate for more active nodes is
|
||||||
|
used. \todo{comparable to averaging dropout networks, beispiel für
|
||||||
|
besser in kleinem fall}
|
||||||
|
% Here for each training iteration from a before specified (sub)set of nodes
|
||||||
|
% randomly chosen ones are deactivated (their output is fixed to 0).
|
||||||
|
% During training
|
||||||
|
% Instead of using different models and averaging them randomly
|
||||||
|
% deactivated nodes are used to simulate different networks which all
|
||||||
|
% share the same weights for present nodes.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
A simple but effective way to introduce noise to the model is by
|
% A simple but effective way to introduce noise to the model is by
|
||||||
deactivating randomly chosen nodes in a layer
|
% deactivating randomly chosen nodes in a layer
|
||||||
The way noise is introduced into
|
% The way noise is introduced into
|
||||||
the model is by deactivating certain nodes (setting the output of the
|
% the model is by deactivating certain nodes (setting the output of the
|
||||||
node to 0) in the fully connected layers of the convolutional neural
|
% node to 0) in the fully connected layers of the convolutional neural
|
||||||
networks. The nodes are chosen at random and change in every
|
% networks. The nodes are chosen at random and change in every
|
||||||
iteration, this practice is called Dropout and was introduced by
|
% iteration, this practice is called Dropout and was introduced by
|
||||||
\textcite{Dropout}.
|
% \textcite{Dropout}.
|
||||||
|
|
||||||
|
\subsubsection{\titlecap{manipulation of input data}}
|
||||||
|
Another way to combat overfitting is to keep the network from learning
|
||||||
|
the dataset by manipulating the inputs randomly for each iteration of
|
||||||
|
training. This is commonly used in image based tasks as there are
|
||||||
|
often ways to maipulate the input while still being sure the labels
|
||||||
|
remain the same. For example in a image classification task such as
|
||||||
|
handwritten digits the associated label should remain right when the
|
||||||
|
image is rotated or stretched by a small amount.
|
||||||
|
When using this one has to be sure that the labels indeed remain the
|
||||||
|
same or else the network will not learn the desired ...
|
||||||
|
In the case of handwritten digits for example a to high rotation angle
|
||||||
|
will ... a nine or six.
|
||||||
|
The most common transformations are rotation, zoom, shear, brightness, mirroring.
|
||||||
|
|
||||||
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä., subset als
|
||||||
training set?}
|
training set?}
|
||||||
|
|
||||||
\subsubsection{Effectivety for small training sets}
|
\subsubsection{\titlecap{effectivety for small training sets}}
|
||||||
|
|
||||||
For some applications (medical problems with small amount of patients)
|
For some applications (medical problems with small amount of patients)
|
||||||
the available data can be highly limited. In the following the impact
|
the available data can be highly limited.
|
||||||
on highly reduced training sets has been ... for ... and the results
|
In order to get a understanding for the achievable accuracy for such a
|
||||||
are given in Figure ...
|
scenario in the following we examine the ... and .. with a highly
|
||||||
|
reduced training set and the impact the above mentioned strategies on
|
||||||
|
combating overfitting have.
|
||||||
|
|
||||||
|
\clearpage
|
||||||
|
\section{Bla}
|
||||||
|
\begin{itemize}
|
||||||
|
\item generate more data, GAN etc
|
||||||
|
\item Transfer learning, use network trained on different task and
|
||||||
|
repurpose it / train it with the training data
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
%%% Local Variables:
|
%%% Local Variables:
|
||||||
%%% mode: latex
|
%%% mode: latex
|
||||||
|
@ -87,6 +87,93 @@ except for the input layer, which recieves the components of the input.
|
|||||||
|
|
||||||
\subsection{Nonlinearity of Neural Networks}
|
\subsection{Nonlinearity of Neural Networks}
|
||||||
|
|
||||||
|
The arguably most important feature of neural networks that sets them
|
||||||
|
apart from linear models is the activation function implemented in the
|
||||||
|
neurons. As seen in Figure~\ref{fig:neuron} on the weighted sum of the
|
||||||
|
inputs a activation function $\sigma$ is applied in order to obtain
|
||||||
|
the output resulting in the output being given by
|
||||||
|
\[
|
||||||
|
o_k = \sigma\left(b_k + \sum_{j=1}^m w_{k,j} i_j\right).
|
||||||
|
\]
|
||||||
|
The activation function is usually chosen nonlinear (a linear one
|
||||||
|
would result in the entire model collapsing into a linear one) which
|
||||||
|
allows it to better model data (beispiel satz ...).
|
||||||
|
There are two types of activation functions, saturating and not
|
||||||
|
saturating ones. Popular examples for the former are sigmoid
|
||||||
|
functions where most commonly the standard logisitc function or tanh are used
|
||||||
|
as they have easy to compute derivatives which is ... for gradient
|
||||||
|
based optimization algorithms. The standard logistic function (often
|
||||||
|
referred to simply as sigmoid function) is given by
|
||||||
|
\[
|
||||||
|
f(x) = \frac{1}{1+e^{-x}}
|
||||||
|
\]
|
||||||
|
and has a realm of $[0,1]$. Its usage as an activation function is
|
||||||
|
motivated by modeling neurons which
|
||||||
|
are close to deactive until a certain threshold where they grow in
|
||||||
|
intensity until they are fully
|
||||||
|
active, which is similar to the behavior of neurons in brains
|
||||||
|
\todo{besser schreiben}. The tanh function is given by
|
||||||
|
\[
|
||||||
|
tanh(x) = \frac{2}{e^{2x}+1}
|
||||||
|
\]
|
||||||
|
|
||||||
|
The downside of these saturating activation functions is that given
|
||||||
|
their ... their derivatives are close to zero for large or small
|
||||||
|
input values which can ... the ... of gradient based methods.
|
||||||
|
|
||||||
|
The nonsaturating activation functions commonly used are the recified
|
||||||
|
linear using (ReLU) or the leaky RelU. The ReLU is given by
|
||||||
|
\[
|
||||||
|
r(x) = \max\left\{0, x\right\}.
|
||||||
|
\]
|
||||||
|
This has the benefit of having a constant derivative for values larger
|
||||||
|
than zero. However the derivative being zero ... . The leaky ReLU is
|
||||||
|
an attempt to counteract this problem by assigning a small constant
|
||||||
|
derivative to all values smaller than zero and for scalar $\alpha$ is given by
|
||||||
|
\[
|
||||||
|
l(x) = \max\left\{0, x\right\} + \alpha.
|
||||||
|
\]
|
||||||
|
In order to illustrate these functions plots of them are given in Figure~\ref{fig:activation}.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\centering
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, ymin=0, ymax = 1, width=\textwidth]
|
||||||
|
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, width=\textwidth]
|
||||||
|
\addplot[domain=-5:5, samples=100]{tanh(x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, width=\textwidth,
|
||||||
|
ytick={0,2,4},yticklabels={\hphantom{4.}0,2,4}, ymin=-1]
|
||||||
|
\addplot[domain=-5:5, samples=100]{max(0,x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\begin{subfigure}{.45\linewidth}
|
||||||
|
\centering
|
||||||
|
\begin{tikzpicture}
|
||||||
|
\begin{axis}[enlargelimits=false, width=\textwidth, ymin=-1,
|
||||||
|
ytick={0,2,4},yticklabels={$\hphantom{-5.}0$,2,4}]
|
||||||
|
\addplot[domain=-5:5, samples=100]{max(0,x)+ 0.1*min(0,x)};
|
||||||
|
\end{axis}
|
||||||
|
\end{tikzpicture}
|
||||||
|
\end{subfigure}
|
||||||
|
\caption{Plots of the activation fucntoins...}
|
||||||
|
\label{fig:activation}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
@ -173,6 +260,7 @@ except for the input layer, which recieves the components of the input.
|
|||||||
|
|
||||||
\end{tikzpicture}
|
\end{tikzpicture}
|
||||||
\caption{Structure of a single neuron}
|
\caption{Structure of a single neuron}
|
||||||
|
\label{fig:neuron}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\clearpage
|
\clearpage
|
||||||
@ -345,19 +433,19 @@ large in networks with multiple layers of high neuron count naively
|
|||||||
computing these can get quite memory and computational expensive. But
|
computing these can get quite memory and computational expensive. But
|
||||||
by using the chain rule and exploiting the layered structure we can
|
by using the chain rule and exploiting the layered structure we can
|
||||||
compute the gradient much more efficiently by using backpropagation
|
compute the gradient much more efficiently by using backpropagation
|
||||||
first introduced by \textcite{backprop}.
|
introduced by \textcite{backprop}.
|
||||||
|
|
||||||
\subsubsection{Backpropagation}
|
% \subsubsection{Backpropagation}
|
||||||
|
|
||||||
As with an increasing amount of layers the derivative of a loss
|
% As with an increasing amount of layers the derivative of a loss
|
||||||
function with respect to a certain variable becomes more intensive to
|
% function with respect to a certain variable becomes more intensive to
|
||||||
compute there have been efforts in increasing the efficiency of
|
% compute there have been efforts in increasing the efficiency of
|
||||||
computing these derivatives. Today the BACKPROPAGATION algorithm is
|
% computing these derivatives. Today the BACKPROPAGATION algorithm is
|
||||||
widely used to compute the derivatives needed for the optimization
|
% widely used to compute the derivatives needed for the optimization
|
||||||
algorithms. Here instead of naively calculating the derivative for
|
% algorithms. Here instead of naively calculating the derivative for
|
||||||
each variable, the chain rule is used in order to compute derivatives
|
% each variable, the chain rule is used in order to compute derivatives
|
||||||
for each layer from output layer towards the first layer while only
|
% for each layer from output layer towards the first layer while only
|
||||||
needing to ....
|
% needing to ....
|
||||||
|
|
||||||
\[
|
\[
|
||||||
\frac{\partial L(...)}{}
|
\frac{\partial L(...)}{}
|
||||||
|
@ -220,7 +220,7 @@ plot coordinates {
|
|||||||
Networks}
|
Networks}
|
||||||
|
|
||||||
|
|
||||||
In this section we will analyze the connection of randomized shallow
|
This section is based on \textcite{heiss2019}. We will analyze the connection of randomized shallow
|
||||||
Neural Networks with one dimensional input and regression splines. We
|
Neural Networks with one dimensional input and regression splines. We
|
||||||
will see that the punishment of the size of the weights in training
|
will see that the punishment of the size of the weights in training
|
||||||
the randomized shallow
|
the randomized shallow
|
||||||
|
Loading…
Reference in New Issue
Block a user