You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
487 lines
19 KiB
TeX
487 lines
19 KiB
TeX
|
|
\section{Introduction to Neural Networks}
|
|
|
|
Neural Networks (NN) are a mathematical construct inspired by the
|
|
... of brains in mammals. It consists of an array of neurons that
|
|
receive inputs and compute a accumulated output. These neurons are
|
|
arranged in layers, with one input and output layer and a arbirtary
|
|
amount of hidden layer between them.
|
|
The amount of neurons in the in- and output layers correspond to the
|
|
desired dimensions of in- and outputs of the model.
|
|
In conventional neural networks the information is passed ... from the
|
|
input layer towards the output layer hence they are often called feed
|
|
forward networks. Each neuron in a layer has the outputs of all
|
|
neurons in the preceding layer as input (fully connected). A
|
|
illustration of a example neuronal network is given in
|
|
Figure~\ref{fig:nn} and one of a neuron in Figure~\ref{fig:neuron}
|
|
|
|
\tikzset{%
|
|
every neuron/.style={
|
|
circle,
|
|
draw,
|
|
minimum size=1cm
|
|
},
|
|
neuron missing/.style={
|
|
draw=none,
|
|
scale=1.5,
|
|
text height=0.333cm,
|
|
execute at begin node=\color{black}$\vdots$
|
|
},
|
|
}
|
|
\begin{figure}[h!]
|
|
\center
|
|
|
|
% \fbox{
|
|
|
|
\resizebox{\textwidth}{!}{%
|
|
\begin{tikzpicture}[x=1.75cm, y=1.75cm, >=stealth]
|
|
\tikzset{myptr/.style={decoration={markings,mark=at position 1 with %
|
|
{\arrow[scale=1.5,>=stealth]{>}}},postaction={decorate}}}
|
|
|
|
\foreach \m/\l [count=\y] in {1,2,3,missing,4}
|
|
\node [every neuron/.try, neuron \m/.try] (input-\m) at (0,2.5-\y) {};
|
|
|
|
\foreach \m [count=\y] in {1,missing,2}
|
|
\node [every neuron/.try, neuron \m/.try ] (hidden1-\m) at (2,2-\y*1.25) {};
|
|
|
|
\foreach \m [count=\y] in {1,missing,2}
|
|
\node [every neuron/.try, neuron \m/.try ] (hidden2-\m) at (5,2-\y*1.25) {};
|
|
|
|
\foreach \m [count=\y] in {1,missing,2}
|
|
\node [every neuron/.try, neuron \m/.try ] (output-\m) at (7,1.5-\y) {};
|
|
|
|
\foreach \l [count=\i] in {1,2,3,d_i}
|
|
\draw [myptr] (input-\i)+(-1,0) -- (input-\i)
|
|
node [above, midway] {$x_{\l}$};
|
|
|
|
\foreach \l [count=\i] in {1,n_1}
|
|
\node [above] at (hidden1-\i.north) {$\mathcal{N}_{1,\l}$};
|
|
|
|
\foreach \l [count=\i] in {1,n_l}
|
|
\node [above] at (hidden2-\i.north) {$\mathcal{N}_{l,\l}$};
|
|
|
|
\foreach \l [count=\i] in {1,d_o}
|
|
\draw [myptr] (output-\i) -- ++(1,0)
|
|
node [above, midway] {$O_{\l}$};
|
|
|
|
\foreach \i in {1,...,4}
|
|
\foreach \j in {1,...,2}
|
|
\draw [myptr] (input-\i) -- (hidden1-\j);
|
|
|
|
\foreach \i in {1,...,2}
|
|
\foreach \j in {1,...,2}
|
|
\draw [myptr] (hidden1-\i) -- (hidden2-\j);
|
|
|
|
\foreach \i in {1,...,2}
|
|
\foreach \j in {1,...,2}
|
|
\draw [myptr] (hidden2-\i) -- (output-\j);
|
|
|
|
\node [align=center, above] at (0,2) {Input\\layer};
|
|
\node [align=center, above] at (2,2) {Hidden \\layer $1$};
|
|
\node [align=center, above] at (5,2) {Hidden \\layer $l$};
|
|
\node [align=center, above] at (7,2) {Output \\layer};
|
|
|
|
\node[fill=white,scale=1.5,inner xsep=10pt,inner ysep=10mm] at ($(hidden1-1)!.5!(hidden2-2)$) {$\dots$};
|
|
|
|
\end{tikzpicture}}%}
|
|
\caption[Illustration of a neural network]{Illustration of a neural network with $d_i$ inputs, $l$
|
|
hidden layers with $n_{\cdot}$ nodes in each layer, as well as
|
|
$d_o$ outputs.
|
|
}
|
|
\label{fig:nn}
|
|
\end{figure}
|
|
|
|
\subsection{Nonlinearity of Neural Networks}
|
|
|
|
The arguably most important feature of neural networks that sets them
|
|
apart from linear models is the activation function implemented in the
|
|
neurons. As seen in Figure~\ref{fig:neuron} on the weighted sum of the
|
|
inputs a activation function $\sigma$ is applied in order to obtain
|
|
the output resulting in the output of the $k$-th. neuron in a layer
|
|
being given by
|
|
\[
|
|
o_k = \sigma\left(b_k + \sum_{j=1}^m w_{k,j} i_j\right)
|
|
\]
|
|
for weights $w_{k,j}$ and biases $b_k$.
|
|
The activation function is usually chosen nonlinear (a linear one
|
|
would result in the entire model collapsing into a linear one\todo{beweis?}) which
|
|
allows it to better model data where the relation of in- and output is
|
|
of nonlinear nature.
|
|
There are two types of activation functions, saturating and not
|
|
saturating ones. Popular examples for the former are sigmoid
|
|
functions where most commonly the standard logisitc function or tangen
|
|
hyperbolicus are used
|
|
as they have easy to compute derivatives which is desirable for gradient
|
|
based optimization algorithms. The standard logistic function (often
|
|
referred to simply as sigmoid function) is given by
|
|
\[
|
|
f(x) = \frac{1}{1+e^{-x}}
|
|
\]
|
|
and has a realm of $[0,1]$. Its usage as an activation function is
|
|
motivated by modeling neurons which
|
|
are close to deactive until a certain threshold where they grow in
|
|
intensity until they are fully
|
|
active, which is similar to the behavior of neurons in
|
|
brains\todo{besser schreiben}. The tangens hyperbolicus is given by
|
|
\[
|
|
\tanh(x) = \frac{2}{e^{2x}+1}
|
|
\]
|
|
|
|
The downside of these saturating activation functions is that given
|
|
their saturating nature their derivatives are close to zero for large or small
|
|
input values which can slow or hinder the progress of gradient based methods.
|
|
|
|
The nonsaturating activation functions commonly used are the recified
|
|
linear using (ReLU) or the leaky RelU. The ReLU is given by
|
|
\[
|
|
r(x) = \max\left\{0, x\right\}.
|
|
\]
|
|
This has the benefit of having a constant derivative for values larger
|
|
than zero. However the derivative being zero has the same downside for
|
|
fitting the model with gradient based methods. The leaky ReLU is
|
|
an attempt to counteract this problem by assigning a small constant
|
|
derivative to all values smaller than zero and for scalar $\alpha$ is given by
|
|
\[
|
|
l(x) = \max\left\{0, x\right\} + \alpha \min \left\{0, x\right\}.
|
|
\]
|
|
In order to illustrate these functions plots of them are given in Figure~\ref{fig:activation}.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\begin{subfigure}{.45\linewidth}
|
|
\centering
|
|
\begin{tikzpicture}
|
|
\begin{axis}[enlargelimits=false, ymin=0, ymax = 1, width=\textwidth]
|
|
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{\titlecap{standard logistic function}}
|
|
\end{subfigure}
|
|
\begin{subfigure}{.45\linewidth}
|
|
\centering
|
|
\begin{tikzpicture}
|
|
\begin{axis}[enlargelimits=false, width=\textwidth]
|
|
\addplot[domain=-5:5, samples=100]{tanh(x)};
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{\titlecap{tangens hyperbolicus}}
|
|
\end{subfigure}
|
|
\begin{subfigure}{.45\linewidth}
|
|
\centering
|
|
\begin{tikzpicture}
|
|
\begin{axis}[enlargelimits=false, width=\textwidth,
|
|
ytick={0,2,4},yticklabels={\hphantom{4.}0,2,4}, ymin=-1]
|
|
\addplot[domain=-5:5, samples=100]{max(0,x)};
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{ReLU}
|
|
\end{subfigure}
|
|
\begin{subfigure}{.45\linewidth}
|
|
\centering
|
|
\begin{tikzpicture}
|
|
\begin{axis}[enlargelimits=false, width=\textwidth, ymin=-1,
|
|
ytick={0,2,4},yticklabels={$\hphantom{-5.}0$,2,4}]
|
|
\addplot[domain=-5:5, samples=100]{max(0,x)+ 0.1*min(0,x)};
|
|
\end{axis}
|
|
\end{tikzpicture}
|
|
\caption{Leaky ReLU, $\alpha = 0.1$}
|
|
\end{subfigure}
|
|
\caption{Plots of the activation functions}
|
|
\label{fig:activation}
|
|
\end{figure}
|
|
|
|
|
|
\begin{figure}
|
|
\begin{tikzpicture}[x=1.5cm, y=1.5cm, >=stealth]
|
|
|
|
|
|
\tikzset{myptr/.style={decoration={markings,mark=at position 1 with %
|
|
{\arrow[scale=1.5,>=stealth]{>}}},postaction={decorate}}}
|
|
|
|
\node [circle, draw, fill=black, inner sep = 0pt, minimum size =
|
|
1.5mm, left] (i_1) at (0, 2.5) {};
|
|
\node [align=left, left] at (-0.125, 2.5) {\(i_1\)};
|
|
\node [circle, draw, fill=black, inner sep = 0pt, minimum size =
|
|
1.5mm] (i_2) at (0, 1.25) {};
|
|
\node [align=left, left] at (-0.125, 1.25) {\(i_2\)};
|
|
\node [neuron missing] (i_3) at (0, 0) {};
|
|
\node [circle, draw, fill=black, inner sep = 0pt, minimum size =
|
|
1.5mm] (i_4) at (0, -1.25) {};
|
|
\node [align=left, left] at (-0.125, -1.25) {\(i_m\)};
|
|
\draw[decoration={calligraphic brace,amplitude=5pt, mirror}, decorate, line width=1.25pt]
|
|
(-0.6,2.7) -- (-0.6,-1.45) node [black, midway, xshift=-0.6cm, left] {Inputs};
|
|
|
|
\node [align = center, above] at (1.25, 3) {Synaptic\\weights};
|
|
\node [every neuron] (w_1) at (1.25, 2.5) {\(w_{k, 1}\)};
|
|
\node [every neuron] (w_2) at (1.25, 1.25) {\(w_{k, 2}\)};
|
|
\node [neuron missing] (w_3) at (1.25, 0) {};
|
|
\node [every neuron] (w_4) at (1.25, -1.25) {\(w_{k, m}\)};
|
|
|
|
\node [circle, draw] (sig) at (3, 0.625) {\Large\(\sum\)};
|
|
\node [align = center, below] at (3, 0) {Summing \\junction};
|
|
|
|
\node [draw, minimum size = 1.25cm] (act) at (4.5, 0.625)
|
|
{\(\sigma(.)\)};
|
|
\node [align = center, above] at (4.5, 1.25) {Activation \\function};
|
|
|
|
\node [circle, draw, fill=black, inner sep = 0pt, minimum size =
|
|
1.5mm] (b) at (3, 2.5) {};
|
|
\node [align = center, above] at (3, 2.75) {Bias \\\(b_k\)};
|
|
|
|
\node [align = center] (out) at (6, 0.625) {Output \\\(o_k\)};
|
|
|
|
|
|
\draw [myptr] (i_1) -- (w_1);
|
|
\draw [myptr] (i_2) -- (w_2);
|
|
\draw [myptr] (i_4) -- (w_4);
|
|
|
|
\draw [myptr] (w_1) -- (sig);
|
|
\draw [myptr] (w_2) -- (sig);
|
|
\draw [myptr] (w_4) -- (sig);
|
|
|
|
\draw [myptr] (b) -- (sig);
|
|
|
|
\draw [myptr] (sig) -- (act);
|
|
|
|
\draw [myptr] (act) -- (out);
|
|
|
|
% \foreach \m [count=\y] in {1,2,missing,3,4}
|
|
% \node [every neuron/.try, neuron \m/.try ] (hidden-\m) at (1.25,3.25-\y*1.25) {\(w_{k,\y}\)};
|
|
|
|
% \foreach \m [count=\y] in {1}
|
|
% \node [every neuron/.try, neuron \m/.try ] (output-\m) at (2.5,0.5-\y) {};
|
|
|
|
% \foreach \l [count=\i] in {1}
|
|
% \draw [<-] (input-\i) -- ++(-1,0)
|
|
% node [above, midway] {$x$};
|
|
|
|
% \foreach \l [count=\i] in {1,2,n-1,n}
|
|
% \node [above] at (hidden-\i.north) {$\mathcal{N}_{\l}$};
|
|
|
|
% \foreach \l [count=\i] in {1,n_l}
|
|
% \node [above] at (output-\i.north) {};
|
|
|
|
% \foreach \l [count=\i] in {1}
|
|
% \draw [->] (output-\i) -- ++(1,0)
|
|
% node [above, midway] {$y$};
|
|
|
|
% \foreach \i in {1}
|
|
% \foreach \j in {1,2,...,3,4}
|
|
% \draw [->] (input-\i) -- (hidden-\j);
|
|
|
|
% \foreach \i in {1,2,...,3,4}
|
|
% \foreach \j in {1}
|
|
% \draw [->] (hidden-\i) -- (output-\j);
|
|
|
|
\end{tikzpicture}
|
|
\caption{Structure of a single neuron}
|
|
\label{fig:neuron}
|
|
\end{figure}
|
|
|
|
\clearpage
|
|
\subsection{Training Neural Networks}
|
|
|
|
As neural networks are a PARAMETRIC model we need to fit it to input
|
|
data in order to get meaningfull OUTPUT from the network in order to
|
|
do this we first need to discuss how we interpret the output of the
|
|
neural network.
|
|
|
|
% After a neural network model is designed, like most statistical models
|
|
% it has to be fit to the data. In the machine learning context this is
|
|
% often called ``training'' as due to the complexity and amount of
|
|
% variables in these models they are fitted iteratively to the data,
|
|
% ``learing'' the properties of the data better with each iteration.
|
|
|
|
% There are two main categories of machine learning models, being
|
|
% supervised and unsupervised learners. Unsupervised learners learn
|
|
% structure in the data without guidance form outside (as labeling data
|
|
% beforehand for training) popular examples of this are clustering
|
|
% algorithms\todo{quelle}. Supervised learners on the other hand are as
|
|
% the name suggest supervised during learning. This generally amounts to
|
|
% using data with the expected response (label) attached to each
|
|
% data-point in fitting the model, where usually some distance between
|
|
% the model output and the labels is minimized.
|
|
|
|
\subsubsection{\titlecap{nonliniarity in last layer}}
|
|
|
|
Given the nature of the neural net the output of the last layer are
|
|
real numbers. For regression tasks this is desirable, for
|
|
classification problems however some transformations might be
|
|
necessary.
|
|
As the goal in the latter is to predict a certain class or classes for
|
|
an object the output needs to be of a form that allows this
|
|
interpretation.
|
|
Commonly the nodes in the output layer each correspond to a class and
|
|
the class chosen as prediction is the one with the highest value at
|
|
the corresponding output node.
|
|
This corresponds to a transformation of the output
|
|
vector $o$ into a one-hot vector
|
|
\[
|
|
\text{pred}_i =
|
|
\begin{cases}
|
|
1,& \text{if } o_i = \max_j o_j \\
|
|
0,& \text{else}.
|
|
\end{cases}
|
|
\]
|
|
This however makes training the model with gradient based methods impossible, as the derivative of
|
|
the transformation is either zero or undefined.
|
|
A continuous transformation that is close to the argmax one is given by
|
|
softmax
|
|
\[
|
|
\text{softmax}(o)_i = \frac{e^{o_i}}{\sum_j e^{o_j}}.
|
|
\]
|
|
The softmax function transforms the realm of the output to the interval $[0,1]$
|
|
and the individual values sum to one, thus the output can be interpreted as
|
|
a probability for each class given the input.
|
|
Additionally to being differentiable this allows for evaluataing the
|
|
cetainiy of a prediction, rather than just whether it is accurate.
|
|
A similar effect is obtained when for a binary or two class problem the
|
|
sigmoid function
|
|
\[
|
|
f(x) = \frac{1}{1 + e^{-x}}
|
|
\]
|
|
is used and the output $f(x)$ is interpreted as the probability for
|
|
the first class and $1-f(x)$ for the second class.
|
|
|
|
\todo{vielleicht additiv invarianz}
|
|
% Another property that makes softmax attractive is the invariance to addition
|
|
% \[
|
|
% \text{sofmax}(o) = \text{softmax}(o + c
|
|
% \]
|
|
|
|
|
|
% In order to properly interpret the output of a neural network and
|
|
% training it, depending on the problem it might be advantageous to
|
|
% transform the output form the last layer. Given the nature of the
|
|
% neural network the value at each output node is a real number. This is
|
|
% desirable for applications where the desired output is a real numbered
|
|
% vector (e.g. steering inputs for a autonomous car), however for
|
|
% classification problems it is desirable to transform this
|
|
% output. Often classification problems are modeled in such a way that
|
|
% each output node corresponds to a class. Then the output vector needs
|
|
% to be normalized in order to give a prediction. The naive approach is
|
|
% to transform the output vector $o$ into a one-hot vector $p$
|
|
% corresponding to a $0$
|
|
% entry for all classes except one, which is the predicted class.
|
|
|
|
% \[
|
|
% p_i =
|
|
% \begin{cases}
|
|
% 1,& i < j, \forall i,j \in \text{arg}\max o_i, \\
|
|
% 0,& \text{else.}
|
|
% \end{cases}
|
|
% \]\todo{besser formulieren}
|
|
|
|
% However this imposes difficulties in training the network as with this
|
|
% addition the model is no longer differentiable which imitates the
|
|
% ways the model can be trained. Additionally information about the
|
|
% ``certainty'' for each class in the prediction gets lost. A popular
|
|
% way to circumvent this problem is to normalize the output vector is
|
|
% such a way that the entries add up to one, this allows for the
|
|
% interpretation of probabilities assigned to each class.
|
|
|
|
\subsubsection{Error Measurement}
|
|
|
|
In order to make assessment about the quality of a network $\mathcal{NN}$ and train
|
|
it we need to discuss how we measure error. The choice of the error
|
|
function is highly dependent on the type of the problem. For
|
|
regression problems a commonly used error measure is the mean squared
|
|
error (MSE)
|
|
which for a function $f$ and data $(x_i,y_i), i=1,\dots,n$ is given by
|
|
\[
|
|
MSE(f) = \frac{1}{n} \sum_i^n \left(f(x_i) - y_i\right)^2.
|
|
\]
|
|
However depending on the problem error measures with differnt
|
|
properties might be needed, for example in some contexts it is
|
|
required to consider a proportional rather than absolute error as is
|
|
common in time series models. \todo{komisch}
|
|
|
|
As discussed above the output of a neural network for a classification
|
|
problem can be interpreted as a probability distribution over the classes
|
|
conditioned on the input. In this case it is desirable to
|
|
use error functions designed to compare probability distributions. A
|
|
widespread error function for this use case is the cross entropy (\textcite{PRML}),
|
|
which for two discrete distributions $p, q$ with the same realm $C$ is given by
|
|
\[
|
|
H(p, q) = \sum_{c \in C} p(c) \ln\left(\frac{1}{q(c)}\right),
|
|
\]
|
|
which compares a $q$ to a true underlying distribution $p$.
|
|
For a data set $(x_i,y_i), i = 1,\dots,n$ where each $y_{i,c}$
|
|
corresponds to the probability of class $c$ given $x_i$ and predictor
|
|
$f$ we get the loss function
|
|
\[
|
|
Bla = \sum_{i=1}^n H(y_i, f(x_i)).
|
|
\]
|
|
|
|
-Maximum Likelihood
|
|
-Ableitung mit softmax pseudo linear -> fast improvemtns possible
|
|
|
|
\subsubsection{Gradient Descent Algorithm}
|
|
|
|
Trying to find the optimal parameter for fitting the model to the data
|
|
can be a hard problem. Given the complex nature of a neural network
|
|
with many layers and neurons it is hard to predict the impact of
|
|
single parameters on the accuracy of the output.
|
|
Thus applying numeric optimization algorithms is the only
|
|
feasible way to fit the model. A attractive algorithm for training
|
|
neural networks is gradient descent where each parameter $\theta_i$ is
|
|
iterative changed according to the gradient regarding the error
|
|
measure and a step size $\gamma$. For this all parameters are
|
|
initialized (often random or close to zero) and then iteratively
|
|
updated until a certain stopping criterion is hit, mostly either being a fixed
|
|
number of iterations or a desired upper limit for the error measure.
|
|
% For a function $f_\theta$ with parameters $\theta \in \mathbb{R}^n$
|
|
% and a error function $L(f_\theta)$ the gradient descent algorithm is
|
|
% given in \ref{alg:gd}.
|
|
|
|
\begin{algorithm}[H]
|
|
\SetAlgoLined
|
|
\KwInput{function $f_\theta$ with parameters $\theta \in
|
|
\mathbb{R}^n$ \newline step size $\gamma$}
|
|
initialize $\theta^0$\;
|
|
$i \leftarrow 1$\;
|
|
\While{While termination condition is not met}{
|
|
$\nabla \leftarrow \frac{\mathrm{d}f_\theta}{\mathrm{d} \theta}\vert_{\theta^{i-1}}$\;
|
|
$\theta^i \leftarrow \theta^{i-1} - \gamma \nabla $\;
|
|
$i \leftarrow i +1$\;
|
|
}
|
|
|
|
\caption{Gradient Descent}
|
|
\label{alg:gd}
|
|
\end{algorithm}
|
|
|
|
The algorithm for gradient descent is given in
|
|
Algorithm~\ref{alg:gd}. In the context of fitting a neural network
|
|
$f_\theta$ corresponds to the error measurement of the network
|
|
$L\left(\mathcal{NN}_{\theta}\right)$ where $\theta$ is a vector
|
|
containing all the weights and biases of the network.
|
|
As ca be seen this requires computing the derivative of the network
|
|
with regard to each variable. With the number of variables getting
|
|
large in networks with multiple layers of high neuron count naively
|
|
computing these can get quite memory and computational expensive. But
|
|
by using the chain rule and exploiting the layered structure we can
|
|
compute the gradient much more efficiently by using backpropagation
|
|
introduced by \textcite{backprop}.
|
|
|
|
% \subsubsection{Backpropagation}
|
|
|
|
% As with an increasing amount of layers the derivative of a loss
|
|
% function with respect to a certain variable becomes more intensive to
|
|
% compute there have been efforts in increasing the efficiency of
|
|
% computing these derivatives. Today the BACKPROPAGATION algorithm is
|
|
% widely used to compute the derivatives needed for the optimization
|
|
% algorithms. Here instead of naively calculating the derivative for
|
|
% each variable, the chain rule is used in order to compute derivatives
|
|
% for each layer from output layer towards the first layer while only
|
|
% needing to ....
|
|
|
|
\[
|
|
\frac{\partial L(...)}{}
|
|
\]
|
|
\todo{Backprop richtig aufschreiben}
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|