You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
330 lines
13 KiB
TeX
330 lines
13 KiB
TeX
\section{Application of NN to higher complexity Problems}
|
|
|
|
As neural networks are applied to problems of higher complexity often
|
|
resulting in higher dimensionality of the input the amount of
|
|
parameters in the network rises drastically. For example a network
|
|
with ...
|
|
A way to combat the
|
|
|
|
\subsection{Convolution}
|
|
|
|
Convolution is a mathematical operation, where the product of two
|
|
functions is integrated after one has been reversed and shifted.
|
|
|
|
\[
|
|
(f * g) (t) \coloneqq \int_{-\infty}^{\infty} f(t-s) g(s) ds.
|
|
\]
|
|
|
|
This operation can be described as a filter-function $g$ being applied
|
|
to $f$,
|
|
as values $f(t)$ are being replaced by an average of values of $f$
|
|
weighted by $g$ in position $t$.
|
|
The convolution operation allows plentiful manipulation of data, with
|
|
a simple example being smoothing of real-time data. Consider a sensor
|
|
measuring the location of an object (e.g. via GPS). We expect the
|
|
output of the sensor to be noisy as a result of a number of factors
|
|
that will impact the accuracy. In order to get a better estimate of
|
|
the actual location we want to smooth
|
|
the data to reduce the noise. Using convolution for this task, we
|
|
can control the significance we want to give each data-point. We
|
|
might want to give a larger weight to more recent measurements than
|
|
older ones. If we assume these measurements are taken on a discrete
|
|
timescale, we need to introduce discrete convolution first. Let $f$,
|
|
$g: \mathbb{Z} \to \mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(t) = \sum_{i \in \mathbb{Z}} f(t-i) g(i).
|
|
\]
|
|
Applying this on the data with the filter $g$ chosen accordingly we
|
|
are
|
|
able to improve the accuracy, which can be seen in
|
|
Figure~\ref{fig:sin_conv}.
|
|
\input{Plots/sin_conv.tex}
|
|
This form of discrete convolution can also be applied to functions
|
|
with inputs of higher dimensionality. Let $f$, $g: \mathbb{Z}^d \to
|
|
\mathbb{R}$ then
|
|
|
|
\[
|
|
(f * g)(x_1, \dots, x_d) = \sum_{i \in \mathbb{Z}^d} f(x_1 - i_1,
|
|
\dots, x_d - i_d) g(i_1, \dots, i_d)
|
|
\]
|
|
This will prove to be a useful framework for image manipulation but
|
|
in order to apply convolution to images we need to discuss
|
|
representation of image data first. Most often images are represented
|
|
by each pixel being a mixture of base colors these base colors define
|
|
the color-space in which the image is encoded. Often used are
|
|
color-spaces RGB (red,
|
|
blue, green) or CMYK (cyan, magenta, yellow, black). An example of an
|
|
image split in its red, green and blue channel is given in
|
|
Figure~\ref{fig:rgb} Using this
|
|
encoding of the image we can define a corresponding discrete function
|
|
describing the image, by mapping the coordinates $(x,y)$ of an pixel
|
|
and the
|
|
channel (color) $c$ to the respective value $v$
|
|
|
|
\begin{align}
|
|
\begin{split}
|
|
I: \mathbb{N}^3 & \to \mathbb{R}, \\
|
|
(x,y,c) & \mapsto v.
|
|
\end{split}
|
|
\label{def:I}
|
|
\end{align}
|
|
|
|
\begin{figure}
|
|
\begin{adjustbox}{width=\textwidth}
|
|
\begin{tikzpicture}
|
|
\begin{scope}[x = (0:1cm), y=(90:1cm), z=(15:-0.5cm)]
|
|
\node[canvas is xy plane at z=0, transform shape] at (0,0)
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_r.jpg}};
|
|
\node[canvas is xy plane at z=2, transform shape] at (0,-0.2)
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_g.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (0,-0.4)
|
|
{\includegraphics[width=5cm]{Plots/Data/klammern_b.jpg}};
|
|
\node[canvas is xy plane at z=4, transform shape] at (-8,-0.2)
|
|
{\includegraphics[width=5.3cm]{Plots/Data/klammern_rgb.jpg}};
|
|
\end{scope}
|
|
\end{tikzpicture}
|
|
\end{adjustbox}
|
|
\caption{On the right the red, green and blue chances of the picture
|
|
are displayed. In order to better visualize the color channels the
|
|
black and white picture of each channel has been colored in the
|
|
respective color. Combining the layers results in the image on the
|
|
left.}
|
|
\label{fig:rgb}
|
|
\end{figure}
|
|
|
|
With this representation of an image as a function, we can apply
|
|
filters to the image using convolution for multidimensional functions
|
|
as described above. In order to simplify the notation we will write
|
|
the function $I$ given in (\ref{def:I}) as well as the filter-function $g$
|
|
as a tensor from now on, resulting in the modified notation of
|
|
convolution
|
|
|
|
\[
|
|
(I * g)_{x,y,c} = \sum_{i,j,l \in \mathbb{Z}} I_{x-i,y-j,c-l} g_{i,j,l}.
|
|
\]
|
|
|
|
Simple examples for image manipulation using
|
|
convolution are smoothing operations or
|
|
rudimentary detection of edges in grayscale images, meaning they only
|
|
have one channel. A popular filter for smoothing images
|
|
is the Gauss-filter which for a given $\sigma \in \mathbb{R}_+$ and
|
|
size $s \in \mathbb{N}$ is
|
|
defined as
|
|
\[
|
|
G_{x,y} = \frac{1}{2 \pi \sigma^2} e^{-\frac{x^2 + y^2}{2
|
|
\sigma^2}}, ~ x,y \in \left\{1,\dots,s\right\}.
|
|
\]
|
|
|
|
For edge detection purposes the Sobel operator is widespread. Here two
|
|
filters are applied to the
|
|
image $I$ and then combined. Edges in the $x$ direction are detected
|
|
by convolution with
|
|
\[
|
|
G =\left[
|
|
\begin{matrix}
|
|
-1 & 0 & 1 \\
|
|
-2 & 0 & 2 \\
|
|
-1 & 0 & 1
|
|
\end{matrix}\right],
|
|
\]
|
|
and edges is the y direction by convolution with $G^T$, the final
|
|
output is given by
|
|
|
|
\[
|
|
O = \sqrt{(I * G)^2 + (I*G^T)^2}
|
|
\]
|
|
where $\sqrt{\cdot}$ and $\cdot^2$ are applied component
|
|
wise. Examples of convolution with both kernels are given in Figure~\ref{fig:img_conv}.
|
|
|
|
|
|
|
|
\begin{figure}[h]
|
|
\centering
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/klammern.jpg}
|
|
\caption{Original Picture}
|
|
\label{subf:OrigPicGS}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv9.png}
|
|
\caption{Gaussian Blur $\sigma^2 = 1$}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv10.png}
|
|
\caption{Gaussian Blur $\sigma^2 = 4$}
|
|
\end{subfigure}\\
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv4.png}
|
|
\caption{Sobel Operator $x$-direction}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv5.png}
|
|
\caption{Sobel Operator $y$-direction}
|
|
\end{subfigure}
|
|
\begin{subfigure}{0.3\textwidth}
|
|
\centering
|
|
\includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
|
|
\caption{Sobel Operator combined}
|
|
\end{subfigure}
|
|
% \begin{subfigure}{0.24\textwidth}
|
|
% \centering
|
|
% \includegraphics[width=\textwidth]{Plots/Data/image_conv6.png}
|
|
% \caption{test}
|
|
% \end{subfigure}
|
|
\caption{Convolution of original greyscale Image (a) with different
|
|
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
|
|
$\sigma^2$ are used. In (d) - (f) the above defined Sobel Operator
|
|
kernels are used.}
|
|
\label{fig:img_conv}
|
|
\end{figure}
|
|
\clearpage
|
|
\newpage
|
|
\subsection{Convolutional NN}
|
|
|
|
In conventional neural networks as described in chapter ... all layers
|
|
are fully connected, meaning each output node in a layer is influenced
|
|
by all inputs. For $i$ inputs and $o$ output nodes this results in $i
|
|
+ 1$ variables at each node (weights and bias) and a total $o(i + 1)$
|
|
variables. For large inputs like image data the amount of variables
|
|
that have to be trained in order to fit the model can get excessive
|
|
and hinder the ability to train the model due to memory and
|
|
computational restrictions. By using convolution we can extract
|
|
meaningful information such as edges in an image with a kernel of a
|
|
small size $k$ in the tens or hundreds independent of the size of the
|
|
original image. Thus for a large image $k \cdot i$ can be several
|
|
orders of magnitude smaller than $o\cdot i$ .
|
|
|
|
|
|
As seen convolution lends itself for image manipulation. In this
|
|
chapter we will explore how we can incorporate convolution in neural
|
|
networks, and how that might be beneficial.
|
|
|
|
Convolutional Neural Networks as described by ... are made up of
|
|
convolutional layers, pooling layers, and fully connected ones. The
|
|
fully connected layers are layers in which each input node is
|
|
connected to each output node which is the structure introduced in
|
|
chapter ...
|
|
|
|
In a convolutional layer instead of combining all input nodes for each
|
|
output node, the input nodes are interpreted as a tensor on which a
|
|
kernel is applied via convolution, resulting in the output. Most often
|
|
multiple kernels are used, resulting in multiple output tensors. These
|
|
kernels are the variables, which can be altered in order to fit the
|
|
model to the data. Using multiple kernels it is possible to extract
|
|
different features from the image (e.g. edges -> sobel). As this
|
|
increases dimensionality even further which is undesirable as it
|
|
increases the amount of variables in later layers of the model, a convolutional layer
|
|
is often followed by a pooling one. In a pooling layer the input is
|
|
reduced in size by extracting a single value from a
|
|
neighborhood \todo{moving...}... . The resulting output size is dependent on
|
|
the offset of the neighborhoods used. Popular is max-pooling where the
|
|
largest value in a neighborhood is used or.
|
|
|
|
This construct allows for extraction of features from the input while
|
|
using far less input variables.
|
|
|
|
... \todo{Beispiel mit kleinem Bild, am besten das von oben}
|
|
|
|
\subsubsection{Parallels to the Visual Cortex in Mammals}
|
|
|
|
The choice of convolution for image classification tasks is not
|
|
arbitrary. ... auge... bla bla
|
|
|
|
|
|
\subsection{Limitations of the Gradient Descent Algorithm}
|
|
|
|
-Hyperparameter guesswork
|
|
-Problems navigating valleys -> momentum
|
|
-Different scale of gradients for vars in different layers -> ADAdelta
|
|
|
|
\subsection{Stochastic Training Algorithms}
|
|
|
|
For many applications in which neural networks are used such as
|
|
image classification or segmentation, large training data sets become
|
|
detrimental to capture the nuances of the
|
|
data. However as training sets get larger the memory requirement
|
|
during training grows with it.
|
|
In order to update the weights with the gradient descent algorithm
|
|
derivatives of the network with respect for each
|
|
variable need to be calculated for all data points in order to get the
|
|
full gradient of the error of the network.
|
|
Thus the amount of memory and computing power available limits the
|
|
size of the training data that can be efficiently used in fitting the
|
|
network. A class of algorithms that augment the gradient descent
|
|
algorithm in order to lessen this problem are stochastic gradient
|
|
descent algorithms. Here the premise is that instead of using the whole
|
|
dataset a (different) subset of data is chosen to
|
|
compute the gradient in each iteration.
|
|
The amount of iterations until each data point has been considered in
|
|
updating the parameters is commonly called a ``epoch''.
|
|
This reduces the amount of memory and computing power required for
|
|
each iteration. This allows for use of very large training
|
|
sets. Additionally the noise introduced on the gradient can improve
|
|
the accuracy of the fit as stochastic gradient descent algorithms are
|
|
less likely to get stuck on local extrema.
|
|
|
|
\input{Plots/SGD_vs_GD.tex}
|
|
|
|
Another benefit of using subsets even if enough memory is available to
|
|
use the whole dataset is that depending on the size of the subsets the
|
|
gradient can be calculated far quicker which allows to make more steps
|
|
in the same time. If the approximated gradient is close enough to the
|
|
``real'' one this can drastically cut down the time required for
|
|
training the model.
|
|
|
|
\begin{itemize}
|
|
\item ADAM
|
|
\item momentum
|
|
\item ADADETLA \textcite{ADADELTA}
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
% \subsubsubsection{Stochastic Gradient Descent}
|
|
|
|
\subsection{Combating Overfitting}
|
|
|
|
% As in many machine learning applications if the model is overfit in
|
|
% the data it can drastically reduce the generalization of the model. In
|
|
% many machine learning approaches noise introduced in the learning
|
|
% algorithm in order to reduce overfitting. This results in a higher
|
|
% bias of the model but the trade off of lower variance of the model is
|
|
% beneficial in many cases. For example the regression tree model
|
|
% ... benefits greatly from restricting the training algorithm on
|
|
% randomly selected features in every iteration and then averaging many
|
|
% such trained trees inserted of just using a single one. \todo{noch
|
|
% nicht sicher ob ich das nehmen will} For neural networks similar
|
|
% strategies exist. A popular approach in regularizing convolutional neural network
|
|
% is \textit{dropout} which has been first introduced in
|
|
% \cite{Dropout}
|
|
|
|
Similarly to shallow networks overfitting still can impact the quality of
|
|
convolutional neural networks. A popular way to combat this problem is
|
|
by introducing noise into the training of the model. This is a
|
|
successful strategy for ofter models as well, the a conglomerate of
|
|
descision trees grown on bootstrapped trainig samples benefit greatly
|
|
of randomizing the features available to use in each training
|
|
iteration (Hastie, Bachelorarbeit??). The way noise is introduced into
|
|
the model is by deactivating certain nodes (setting the output of the
|
|
node to 0) in the fully connected layers of the convolutional neural
|
|
networks. The nodes are chosen at random and change in every
|
|
iteration, this practice is called Dropout and was introduced by
|
|
\textcite{Dropout}.
|
|
|
|
\todo{Vergleich verschiedene dropout größen auf MNSIT o.ä.}
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|