@ -99,7 +99,7 @@ $v$
\end { scope}
\end { tikzpicture}
\end { adjustbox}
\caption [Channel Separation of Color Image] { On the right the red, green, and blue chance s of the picture
\caption [Channel Separation of Color Image] { On the right the red, green, and blue channel s of the picture
are displayed. In order to better visualize the color channels the
black and white picture of each channel has been colored in the
respective color. Combining the layers results in the image on the
@ -134,7 +134,7 @@ convolution is well defined for all pixels of the image.
Simple examples of image manipulation using
convolution are smoothing operations or
rudimentary detection of edges in grayscale images, meaning they only
rudimentary detection of edges in gray- scale images, meaning they only
have one channel. A filter often used to smooth or blur images
is the Gauss-filter which for a given $ \sigma \in \mathbb { R } _ + $ and
size $ s \in \mathbb { N } $ is
@ -162,7 +162,7 @@ output is given by
\[
O = \sqrt { (I * G)^ 2 + (I*G^ T)^ 2}
\]
where $ \sqrt { \cdot } $ and $ \cdot ^ 2 $ are applied componentwise. Examples
where $ \sqrt { \cdot } $ and $ \cdot ^ 2 $ are applied component- wise. Examples
for convolution of an image with both kernels are given
in Figure~\ref { fig:img_ conv} .
\begin { figure} [H]
@ -208,7 +208,7 @@ in Figure~\ref{fig:img_conv}.
% \caption { test}
% \end { subfigure}
\vspace { -0.1cm}
\caption [Convolution Applied on Image] { Convolution of original grey scale Image (a) with different
\caption [Convolution Applied on Image] { Convolution of original gray- scale Image (a) with different
kernels. In (b) and (c) Gaussian kernels of size 11 and stated
$ \sigma ^ 2 $ are used. In (d) to (f) the above defined Sobel Operator
kernels are used.}
@ -410,7 +410,7 @@ network.
A class of algorithms that augment the gradient descent
algorithm to lessen this problem are stochastic gradient
descent algorithms.
Here the full dataset is split into smaller disjoint subsets.
Here the full data set is split into smaller disjoint subsets.
Then in each iteration, a (different) subset of data is chosen to
compute the gradient (Algorithm~\ref { alg:sgd} ).
The training period until each data point has been considered at least
@ -496,7 +496,7 @@ time.
\includegraphics [width=\textwidth] { Figures/Data/convnet_ fig.pdf}
\caption [CNN Architecture for MNIST Handwritten
Digits]{ Convolutional neural network architecture used to model the
MNIST handwritten digits dataset. This figure was created with
MNIST handwritten digits data set. This figure was created with
help of the
{ \sffamily { draw\textunderscore convnet} } Python script by \textcite { draw_ convnet} .}
\label { fig:mnist_ architecture}
@ -546,7 +546,7 @@ The most popular three implementations of this are:
\[
\gamma _ n = \gamma _ 0 d^ { \text { floor} { \frac { n+1} { r} } } .
\]
\item Exponential deca, y where the learning rate is decreased after each epoch
\item Exponential decay, where the learning rate is decreased after each epoch
\[
\gamma _ n = \gamma _ o e^ { -n d} .
\]
@ -782,7 +782,7 @@ neural networks.
To get an understanding of the performance of the above
discussed training algorithms the neural network given in
\ref { fig:mnist_ architecture} has been
trained on the MNIST handwriting dataset with the above described
trained on the MNIST handwriting data set with the above described
algorithms. For all algorithms, a global learning rate of $ 0 . 001 $ is
chosen. The parameter preventing divisions by zero is set to
$ \varepsilon = 10 ^ { - 7 } $ . For \textsc { AdaDelta} and
@ -938,7 +938,7 @@ to following this practice will be referred to as data generation.
\includegraphics [width=\textwidth] { Figures/Data/mnist_ gen_ shift.pdf}
\caption { random\\ positional shift}
\end { subfigure}
\caption [Image Data Generation] { Example for the manipuations used in
\caption [Image Data Generation] { Example for the manipul ations used in
later comparisons. Brightness manipulation and mirroring are not
used, as the images are equal in brightness and digits are not
invariant to mirroring.}
@ -985,15 +985,15 @@ the available data can be highly limited.
In these scenarios, the networks are highly prone to overfit the
data. To get an understanding of accuracies achievable and the
impact of the methods aimed at mitigating overfitting discussed above we fit
networks with different measures implemented to datasets of
networks with different measures implemented to data sets of
varying sizes.
For training, we use the MNIST handwriting dataset as well as the fashion
MNIST dataset. The fashion MNIST dataset is a benchmark set build by
For training, we use the MNIST handwriting data set as well as the fashion
MNIST data set. The fashion MNIST data set is a benchmark set build by
\textcite { fashionMNIST} to provide a more challenging set, as state of
the art models are able to achieve accuracies of 99.88\%
(\textcite { 10.1145/3206098.3206111} ) on the handwriting set.
The dataset contains 70.000 preprocessed and labeled images of clothes from
The data set contains 70.000 preprocessed and labeled images of clothes from
Zalando. An overview is given in Figure~\ref { fig:fashionMNIST} .
\input { Figures/fashion_ mnist.tex}
@ -1082,7 +1082,7 @@ Zalando. An overview is given in Figure~\ref{fig:fashionMNIST}.
The models are trained on subsets with a certain amount of randomly
chosen data points per class.
The sizes chosen for the comparisons are the full dataset, 100, 10, and 1
The sizes chosen for the comparisons are the full data set, 100, 10, and 1
data points per class.
For the task of classifying the fashion data a slightly altered model
@ -1093,7 +1093,7 @@ by two consecutive convolutional layers with filters of size 3.
\includegraphics [width=\textwidth] { Figures/Data/cnn_ fashion_ fig.pdf}
\caption [CNN Architecture for Fashion MNIST] { Convolutional neural
network architecture used to model the
fashion MNIST dataset. This figure was created using the
fashion MNIST data set. This figure was created using the
draw\textunderscore convnet Python script by \textcite { draw_ convnet} .}
\label { fig:fashion_ MNIST}
\end { figure}
@ -1110,14 +1110,14 @@ of the models and the parameters used for data generation are given
in Listing~\ref { lst:handwriting} for the handwriting model and in
Listing~\ref { lst:fashion} for the fashion model.
The models are trained for 125s epochs in order
The models are trained for 125 epochs in order
to have enough random
augmentations of the input images present during training,
for the networks to fully profit from the additional training data generated.
The test accuracies of the models after
training for 125
epochs are given in Table~\ref { table:digitsOF} for the handwritten digits
and in Table~\ref { table:fashionOF} for the fashion datasets. Additionally the
and in Table~\ref { table:fashionOF} for the fashion data sets. Additionally the
average test accuracies over the course of learning are given in
Figure~\ref { fig:plotOF_ digits} for the handwriting application and
Figure~\ref { fig:plotOF_ fashion} for the
@ -1225,7 +1225,7 @@ fashion application.
\end { subfigure}
\caption [Mean Test Accuracies for Subsets of MNIST Handwritten
Digits]{ Mean test accuracies of the models fitting the sampled MNIST
handwriting datasets over the 125 epochs of training.}
handwriting data sets over the 125 epochs of training.}
\label { fig:plotOF_ digits}
\end { figure}
@ -1352,13 +1352,13 @@ class.
In all scenarios, the addition of the measures reduces the
variance of the model.
The model fit to the fashion MNIST data set benefits less of th e
The model fit to the fashion MNIST data set benefits less from thes e
measures.
For the smallest scenario of one sample per class, a substantial
increase in accuracy can be observed for both measures.
Contrary to the digits data set, dropout improves the
model by a similar margin to data generation.
For the larger data sets, the benefits are far smaller. While
For the larger data sets, the benefits are much smaller. While
in the scenario with 100 samples per class a performance increase can
be seen for with data generation, in the scenario with 10 samples per
class it performs worse than the baseline model.
@ -1367,7 +1367,7 @@ and 100 sample scenario. In all scenarios data generation seems to
benefit from the addition of dropout.
Additional Figures and Tables for the same comparisons with different
performance metrics are given in Appendix~\ref { app:comp}
performance metrics are given in Appendix~\ref { app:comp} .
There it can be seen that while the measures are able reduce overfitting
effectively for the handwritten digits data set, the neural networks
trained on the fashion data set overfit despite these measures being
@ -1416,7 +1416,7 @@ data points which might explain the worse performance of data generation.
In this thesis, we have taken a look at neural networks, their
behavior in small scenarios and their application on image
classification with limited datasets.
classification with limited data sets.
We have explored the relation between ridge penalized neural networks
and slightly altered cubic smoothing splines, giving us an insight
@ -1424,7 +1424,7 @@ about the behavior of the learned function of neural networks.
When comparing optimization algorithms, we have seen that choosing the
right training algorithm can have a
drastic impact on the efficiency of training and quality of a model
the drastic impact on the efficiency of training and quality of a model
obtainable in a reasonable time frame.
The \textsc { Adam} algorithm has performed well in training the
convolutional neural networks.
@ -1438,7 +1438,7 @@ measures combating overfitting, especially if the available training sets are o
a small size. The success of the measures we have examined
seems to be highly dependent on the use case and further research is
being done on the topic of combating overfitting in neural networks.
\textcite { random_ erasing} propose randomly erasing parts of the inputs
\textcite { random_ erasing} propose randomly erasing parts of the input
images during training and are able to achieve a high accuracy of 96,35\% on the fashion MNIST
data set this way.
While data generation explored in this thesis is able to rudimentary