hoffentlich final

master
Tobias Arndt 4 years ago
parent 3bae82eaf9
commit 2ef7cda1dd

1
.gitignore vendored

@ -16,6 +16,7 @@ main-blx.bib
*.tex~ *.tex~
*#*.tex* *#*.tex*
*~ *~
*#*
# no pdfs # no pdfs
*.pdf *.pdf

@ -1,19 +1,16 @@
\section{Code...} \section{Implementations}
In this ... the implementations of the models used in ... are In this section the implementations models used are given.
given. The randomized shallow neural network used in CHAPTER... are The randomized shallow neural network used in Section~\ref{sec:conv} are
implemented in Scala from ground up to ensure the model is exactly to implemented in Scala. No preexisting frameworks were used to ensure
... of Theorem~\ref{theo:main1}. the implementation was according to the definitions used in Theorem~\ref{theo:main1}.
The neural networks used in CHAPTER are implemented in python using The neural networks used in Section~\ref{sec:cnn} are implemented in python using
the Keras framework given in Tensorflow. Tensorflow is a library the Keras framework given in TensorFlow. TensorFlow is a library
containing highly efficient GPU implementations of most important containing highly efficient GPU implementations of a wide variety
tensor operations, such as convolution as well as efficient algorithms tensor operations, such as convolution as well as efficient algorithms
for training neural networks (computing derivatives, updating parameters). for training neural networks.% (computing derivatives, updating parameters).
\begin{itemize}
\item Code for randomized shallow neural network
\item Code for keras
\end{itemize}
\vspace*{-0.5cm}
\begin{lstfloat} \begin{lstfloat}
\begin{lstlisting}[language=iPython] \begin{lstlisting}[language=iPython]
import breeze.stats.distributions.Uniform import breeze.stats.distributions.Uniform
@ -72,10 +69,11 @@ class RSNN(val n: Int, val gamma: Double = 0.001) {
} }
\end{lstlisting} \end{lstlisting}
\caption{Scala code used to build and train the ridge penalized \caption{Scala code used to build and train the ridge penalized
randomized shallow neural network in .... The parameter \textit{lam} randomized shallow neural network in Section~\ref{sec:rsnn_sim}.}
in the train function represents the $\lambda$ parameter in the error % The parameter \textit{lam}
function. The parameters \textit{n} and \textit{gamma} set the number % in the train function represents the $\lambda$ parameter in the error
of hidden nodes and the stepsize for training.} % function. The parameters \textit{n} and \textit{gamma} set the number
% of hidden nodes and the stepsize for training.}
\label{lst:rsnn} \label{lst:rsnn}
\end{lstfloat} \end{lstfloat}
\clearpage \clearpage
@ -126,8 +124,8 @@ validation_data=(x_test, y_test),
steps_per_epoch = x_train.shape[0]//50) steps_per_epoch = x_train.shape[0]//50)
\end{lstlisting} \end{lstlisting}
\caption{Python code for the model used... the MNIST handwritten digits \caption{Python code used to build the network modeling the MNIST
dataset.} handwritten digits data set.}
\label{lst:handwriting} \label{lst:handwriting}
\end{lstfloat} \end{lstfloat}
\clearpage \clearpage
@ -163,11 +161,11 @@ model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(optimizer=tf.keras.optimizers.Adam(lr = 1e-3), loss="categorical_crossentropy", metrics=["accuracy"]) model.compile(optimizer=tf.keras.optimizers.Adam(lr = 1e-3), loss="categorical_crossentropy", metrics=["accuracy"])
datagen = ImageDataGenerator( datagen = ImageDataGenerator(
rotation_range = 15, rotation_range = 6,
zoom_range = 0.1, zoom_range = 0.15,
width_shift_range=2, width_shift_range=2,
height_shift_range=2, height_shift_range=2,
shear_range = 0.5, shear_range = 0.15,
fill_mode = 'constant', fill_mode = 'constant',
cval = 0) cval = 0)
@ -180,8 +178,8 @@ datagen = ImageDataGenerator(
shuffle=True) shuffle=True)
\end{lstlisting} \end{lstlisting}
\caption{Python code for the model used... the fashion MNIST \caption[Python Code for fashion MNIST]{Python code
dataset.} used to build the network modeling the fashion MNIST data set.}
\label{lst:fashion} \label{lst:fashion}
\end{lstfloat} \end{lstfloat}
\clearpage \clearpage
@ -205,6 +203,418 @@ def get_random_sample(a, b, number_of_samples=10):
\caption{Python code used to generate the datasets containing a \caption{Python code used to generate the datasets containing a
certain amount of random datapoints per class.} certain amount of random datapoints per class.}
\end{lstfloat} \end{lstfloat}
\section{Additional Comparisons}
\label{app:comp}
In this section comparisons of cross entropy loss and training
accuracy for the models trained in Section~\ref{sec:smalldata} are given.
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Test Loss}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\addlegendentry{\footnotesize{Default}}
\end{axis}
\end{tikzpicture}
\caption{1 Sample per Class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Test Loss}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 Samples per Class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch}, ylabel = {Test Loss}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 Samples per Class}
\vspace{.25cm}
\end{subfigure}
\caption[Mean Test Loss for Subsets of MNIST Handwritten
Digits]{Mean test cross entropy loss of the models fitting the
sampled subsets of MNIST
handwritten digits over the 125 epochs of training.}
\end{figure}
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style =
{draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Test Loss}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\end{axis}
\end{tikzpicture}
\caption{1 Sample per Class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Test Loss}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}, ymin = {0.62}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 Samples per Class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch}, ylabel = {Test Loss}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 Samples per Class}
\vspace{.25cm}
\end{subfigure}
\caption[Mean Test Accuracies for Subsets of Fashion MNIST]{Mean
test cross entropy loss of the models fitting the sampled subsets
of fashion MNIST
over the 125 epochs of training.}
\end{figure}
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Training Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\addlegendentry{\footnotesize{Default}}
\end{axis}
\end{tikzpicture}
\caption{1 Sample per Class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 Samples per Class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch}, ylabel = {Training Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}, ymin = {0.92}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 Samples per Class}
\vspace{.25cm}
\end{subfigure}
\caption[Mean Training Accuracies for Subsets of MNIST Handwritten
Digits]{Mean training accuracies of the models fitting the sampled
subsets of MNIST
handwritten digits over the 125 epochs of training.}
\end{figure}
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style =
{draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Training Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\end{axis}
\end{tikzpicture}
\caption{1 Sample per Class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch},ylabel = {Training Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}, ymin = {0.62}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 Samples per Class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch}, ylabel = {Training Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 Samples per Class}
\vspace{.25cm}
\end{subfigure}
\caption[Mean Training Accuracies for Subsets of Fashion MNIST]{Mean
training accuracies of the models fitting the sampled subsets of fashion MNIST
over the 125 epochs of training.}
\end{figure}
%%% Local Variables: %%% Local Variables:
%%% mode: latex %%% mode: latex
%%% TeX-master: "main" %%% TeX-master: "main"

@ -10,13 +10,14 @@ plot coordinates {
} }
} }
\begin{figure} \begin{figure}
\begin{subfigure}[b]{0.5\textwidth} \begin{subfigure}[b]{0.48\textwidth}
\begin{subfigure}[b]{\textwidth} \begin{subfigure}[b]{\textwidth}
\begin{adjustbox}{width=\textwidth, height=0.25\textheight} \begin{adjustbox}{width=\textwidth, height=0.25\textheight}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis}[ \begin{axis}[
ytick = {-1, 0, 1, 2}, ytick = {-1, 0, 1, 2},
yticklabels = {$-1$, $\phantom{-0.}0$, $1$, $2$},] yticklabels = {$-1$, $\phantom{-0.}0$, $1$, $2$},
restrict x to domain=-4:4, enlarge x limits = {0.1}]
\addplot table [x=x, y=y, col sep=comma, only marks, \addplot table [x=x, y=y, col sep=comma, only marks,
forget plot] {Figures/Data/sin_6.csv}; forget plot] {Figures/Data/sin_6.csv};
\addplot [black, line width=2pt] table [x=x, y=y, col \addplot [black, line width=2pt] table [x=x, y=y, col
@ -33,7 +34,7 @@ plot coordinates {
\begin{subfigure}[b]{\textwidth} \begin{subfigure}[b]{\textwidth}
\begin{adjustbox}{width=\textwidth, height=0.25\textheight} \begin{adjustbox}{width=\textwidth, height=0.25\textheight}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis} \begin{axis}[restrict x to domain=-4:4, enlarge x limits = {0.1}]
\addplot table [x=x, y=y, col sep=comma, only marks, \addplot table [x=x, y=y, col sep=comma, only marks,
forget plot] {Figures/Data/sin_6.csv}; forget plot] {Figures/Data/sin_6.csv};
\addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_1.csv}; \addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_1.csv};
@ -49,7 +50,7 @@ plot coordinates {
\begin{subfigure}[b]{\textwidth} \begin{subfigure}[b]{\textwidth}
\begin{adjustbox}{width=\textwidth, height=0.25\textheight} \begin{adjustbox}{width=\textwidth, height=0.25\textheight}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis} \begin{axis}[restrict x to domain=-4:4, enlarge x limits = {0.1}]
\addplot table [x=x, y=y, col sep=comma, only marks, \addplot table [x=x, y=y, col sep=comma, only marks,
forget plot] {Figures/Data/sin_6.csv}; forget plot] {Figures/Data/sin_6.csv};
\addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_3.csv}; \addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_3.csv};
@ -63,13 +64,14 @@ plot coordinates {
\caption{$\lambda = 3.0$} \caption{$\lambda = 3.0$}
\end{subfigure} \end{subfigure}
\end{subfigure} \end{subfigure}
\begin{subfigure}[b]{0.5\textwidth} \begin{subfigure}[b]{0.48\textwidth}
\begin{subfigure}[b]{\textwidth} \begin{subfigure}[b]{\textwidth}
\begin{adjustbox}{width=\textwidth, height=0.245\textheight} \begin{adjustbox}{width=\textwidth, height=0.245\textheight}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis}[ \begin{axis}[
ytick = {-2,-1, 0, 1, 2}, ytick = {-2,-1, 0, 1, 2},
yticklabels = {$-2$,$-1$, $\phantom{-0.}0$, $1$, $2$},] yticklabels = {$-2$,$-1$, $\phantom{-0.}0$, $1$, $2$},
restrict x to domain=-4:4, enlarge x limits = {0.1}]
\addplot table [x=x, y=y, col sep=comma, only marks, \addplot table [x=x, y=y, col sep=comma, only marks,
forget plot] {Figures/Data/data_sin_d_t.csv}; forget plot] {Figures/Data/data_sin_d_t.csv};
\addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_sin_d_01.csv}; \addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_sin_d_01.csv};
@ -85,7 +87,7 @@ plot coordinates {
\begin{subfigure}[b]{\textwidth} \begin{subfigure}[b]{\textwidth}
\begin{adjustbox}{width=\textwidth, height=0.25\textheight} \begin{adjustbox}{width=\textwidth, height=0.25\textheight}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis} \begin{axis}[restrict x to domain=-4:4, enlarge x limits = {0.1}]
\addplot table [x=x, y=y, col sep=comma, only marks, \addplot table [x=x, y=y, col sep=comma, only marks,
forget plot] {Figures/Data/data_sin_d_t.csv}; forget plot] {Figures/Data/data_sin_d_t.csv};
\addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_sin_d_1.csv}; \addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_sin_d_1.csv};
@ -101,7 +103,7 @@ plot coordinates {
\begin{subfigure}[b]{\textwidth} \begin{subfigure}[b]{\textwidth}
\begin{adjustbox}{width=\textwidth, height=0.25\textheight} \begin{adjustbox}{width=\textwidth, height=0.25\textheight}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis} \begin{axis}[restrict x to domain=-4:4, enlarge x limits = {0.1}]
\addplot table [x=x, y=y, col sep=comma, only marks, \addplot table [x=x, y=y, col sep=comma, only marks,
forget plot] {Figures/Data/data_sin_d_t.csv}; forget plot] {Figures/Data/data_sin_d_t.csv};
\addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_sin_d_3.csv}; \addplot [black, line width=2pt] table [x=x, y=y, col sep=comma, mark=none] {Figures/Data/matlab_sin_d_3.csv};
@ -115,8 +117,8 @@ plot coordinates {
\caption{$\lambda = 3.0$} \caption{$\lambda = 3.0$}
\end{subfigure} \end{subfigure}
\end{subfigure} \end{subfigure}
\caption[Comparison of shallow neural networks and regression \caption[Comparison of Shallow Neural Networks and Regression
splines]{% In these Figures the behaviour stated in ... is Splines] {% In these Figures the behaviour stated in ... is
% visualized % visualized
% in two exaples. For $(a), (b), (c)$ six values of sinus equidistantly % in two exaples. For $(a), (b), (c)$ six values of sinus equidistantly
% spaced on $[-\pi, \pi]$ have been used as training data. For % spaced on $[-\pi, \pi]$ have been used as training data. For

@ -4,28 +4,32 @@ legend image code/.code={
\draw[mark repeat=2,mark phase=2] \draw[mark repeat=2,mark phase=2]
plot coordinates { plot coordinates {
(0cm,0cm) (0cm,0cm)
(0.0cm,0cm) %% default is (0.3cm,0cm) (0.15cm,0cm) %% default is (0.3cm,0cm)
(0.0cm,0cm) %% default is (0.6cm,0cm) (0.3cm,0cm) %% default is (0.6cm,0cm)
};% };%
} }
} }
\begin{figure} \begin{figure}
\begin{subfigure}[h!]{\textwidth} \begin{subfigure}[h!]{\textwidth}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis}[tick style = {draw = none}, width = \textwidth, \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
height = 0.6\textwidth, /pgf/number format/precision=3},tick style = {draw = none}, width = 0.975\textwidth,
height = 0.6\textwidth, legend
style={at={(0.0125,0.7)},anchor=north west},
xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt, mark = *, mark size=1pt},
xtick = {1, 3, 5,7,9,11,13,15,17,19}, xtick = {1, 3, 5,7,9,11,13,15,17,19},
xticklabels = {$2$, $4$, $6$, $8$, xticklabels = {$2$, $4$, $6$, $8$,
$10$,$12$,$14$,$16$,$18$,$20$}, $10$,$12$,$14$,$16$,$18$,$20$}]
xlabel = {training epoch}, ylabel = {classification accuracy}]
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma] {Figures/Data/GD_01.log}; [x=epoch, y=val_accuracy, col sep=comma] {Figures/Data/GD_01.log};
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma] {Figures/Data/GD_05.log}; [x=epoch, y=val_accuracy, col sep=comma, mark = *] {Figures/Data/GD_05.log};
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma] {Figures/Data/GD_1.log}; [x=epoch, y=val_accuracy, col sep=comma, mark = *] {Figures/Data/GD_1.log};
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma] [x=epoch, y=val_accuracy, col sep=comma, mark = *]
{Figures/Data/SGD_01_b32.log}; {Figures/Data/SGD_01_b32.log};
\addlegendentry{GD$_{0.01}$} \addlegendentry{GD$_{0.01}$}
@ -34,59 +38,65 @@ plot coordinates {
\addlegendentry{SGD$_{0.01}$} \addlegendentry{SGD$_{0.01}$}
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}
%\caption{Classification accuracy} \caption{Test accuracy during training.}
\end{subfigure} \end{subfigure}
\begin{subfigure}[b]{\textwidth} % \begin{subfigure}[b]{\textwidth}
\begin{tikzpicture} % \begin{tikzpicture}
\begin{axis}[tick style = {draw = none}, width = \textwidth, % \begin{axis}[tick style = {draw = none}, width = \textwidth,
height = 0.6\textwidth, % height = 0.6\textwidth,
ytick = {0, 1, 2, 3, 4}, % ytick = {0, 1, 2, 3, 4},
yticklabels = {$0$, $1$, $\phantom{0.}2$, $3$, $4$}, % yticklabels = {$0$, $1$, $\phantom{0.}2$, $3$, $4$},
xtick = {1, 3, 5,7,9,11,13,15,17,19}, % xtick = {1, 3, 5,7,9,11,13,15,17,19},
xticklabels = {$2$, $4$, $6$, $8$, % xticklabels = {$2$, $4$, $6$, $8$,
$10$,$12$,$14$,$16$,$18$,$20$}, % $10$,$12$,$14$,$16$,$18$,$20$},
xlabel = {training epoch}, ylabel = {error measure\vphantom{fy}}] % xlabel = {training epoch}, ylabel = {error measure\vphantom{fy}}]
\addplot table % \addplot table
[x=epoch, y=val_loss, col sep=comma] {Figures/Data/GD_01.log}; % [x=epoch, y=val_loss, col sep=comma] {Figures/Data/GD_01.log};
\addplot table % \addplot table
[x=epoch, y=val_loss, col sep=comma] {Figures/Data/GD_05.log}; % [x=epoch, y=val_loss, col sep=comma] {Figures/Data/GD_05.log};
\addplot table % \addplot table
[x=epoch, y=val_loss, col sep=comma] {Figures/Data/GD_1.log}; % [x=epoch, y=val_loss, col sep=comma] {Figures/Data/GD_1.log};
\addplot table % \addplot table
[x=epoch, y=val_loss, col sep=comma] {Figures/Data/SGD_01_b32.log}; % [x=epoch, y=val_loss, col sep=comma] {Figures/Data/SGD_01_b32.log};
\addlegendentry{GD$_{0.01}$} % \addlegendentry{GD$_{0.01}$}
\addlegendentry{GD$_{0.05}$} % \addlegendentry{GD$_{0.05}$}
\addlegendentry{GD$_{0.1}$} % \addlegendentry{GD$_{0.1}$}
\addlegendentry{SGD$_{0.01}$} % \addlegendentry{SGD$_{0.01}$}
\end{axis} % \end{axis}
\end{tikzpicture} % \end{tikzpicture}
\caption{Performance metrics during training} % \caption{Performance metrics during training}
\end{subfigure} % \end{subfigure}
% \\~\\ % \\~\\
\caption[Performance comparison of SDG and GD]{The neural network
given in Figure~\ref{fig:mnist_architecture} trained with different
algorithms on the MNIST handwritten digits data set. For gradient
descent the learning rated 0.01, 0.05 and 0.1 are (GD$_{\cdot}$). For
stochastic gradient descend a batch size of 32 and learning rate
of 0.01 is used (SDG$_{0.01}$).}
\label{fig:sgd_vs_gd}
\end{figure}
\begin{table}[h] \begin{subfigure}[b]{1.0\linewidth}
\begin{tabu} to \textwidth {@{} *4{X[c]}c*4{X[c]} @{}} \begin{tabu} to \textwidth {@{} *4{X[c]}c*4{X[c]} @{}}
\multicolumn{4}{c}{Classification Accuracy} \multicolumn{4}{c}{Test Accuracy}
&~&\multicolumn{4}{c}{Error Measure} &~&\multicolumn{4}{c}{Test Loss}
\\\cline{1-4}\cline{6-9} \\\cline{1-4}\cline{6-9}
GD$_{0.01}$&GD$_{0.05}$&GD$_{0.1}$&SGD$_{0.01}$&&GD$_{0.01}$&GD$_{0.05}$&GD$_{0.1}$&SGD$_{0.01}$ GD$_{0.01}$&GD$_{0.05}$&GD$_{0.1}$&SGD$_{0.01}$&&GD$_{0.01}$&GD$_{0.05}$&GD$_{0.1}$&SGD$_{0.01}$
\\\cline{1-4}\cline{6-9} \\\cline{1-4}\cline{6-9}
0.265&0.633&0.203&0.989&&2.267&1.947&3.91&0.032 0.265&0.633&0.203&0.989&&2.267&1.947&3.911&0.032 \\
\multicolumn{4}{c}{Training Accuracy}
&~&\multicolumn{4}{c}{Training Loss}
\\\cline{1-4}\cline{6-9}
GD$_{0.01}$&GD$_{0.05}$&GD$_{0.1}$&SGD$_{0.01}$&&GD$_{0.01}$&GD$_{0.05}$&GD$_{0.1}$&SGD$_{0.01}$
\\\cline{1-4}\cline{6-9}
0.250&0.599&0.685&0.996&&2.271&1.995&1.089&0.012 \\
\end{tabu} \end{tabu}
\caption{Performance metrics of the networks trained in \caption{Performance metrics after 20 training epochs.}
Figure~\ref{fig:sgd_vs_gd} after 20 training epochs.}
\label{table:sgd_vs_gd} \label{table:sgd_vs_gd}
\end{table} \end{subfigure}
\caption[Performance Comparison of SDG and GD]{The neural network
given in Figure~\ref{fig:mnist_architecture} trained with different
algorithms on the MNIST handwritten digits data set. For gradient
descent the learning rated 0.01, 0.05, and 0.1 are (GD$_{\cdot}$). For
stochastic gradient descend a batch size of 32 and learning rate
of 0.01 is used (SDG$_{0.01}$).}
\label{fig:sgd_vs_gd}
\end{figure}
%%% Local Variables: %%% Local Variables:
%%% mode: latex %%% mode: latex
%%% TeX-master: "../main" %%% TeX-master: "../main"

@ -40,11 +40,11 @@
\includegraphics[width=\textwidth]{Figures/Data/fashion_mnist9.pdf} \includegraphics[width=\textwidth]{Figures/Data/fashion_mnist9.pdf}
\caption{Ankle boot} \caption{Ankle boot}
\end{subfigure} \end{subfigure}
\caption[Fashion MNIST data set]{The fashtion MNIST data set contains 70.000 images of \caption[Fashion MNIST Data Set]{The fashtion MNIST data set contains 70.000 images of
preprocessed product images from Zalando, which are categorized as preprocessed product images from Zalando, which are categorized as
T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt,
Sneaker, Bag, Ankle boot. Of these images 60.000 are used as training images, while Sneaker, Bag, Ankle boot. Of these images 60.000 are used as training images, while
the rest are used to validate the models trained.} the rest is used to validate the models trained.}
\label{fig:fashionMNIST} \label{fig:fashionMNIST}
\end{figure} \end{figure}
%%% Local Variables: %%% Local Variables:

@ -16,7 +16,7 @@ plot coordinates {
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed, \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.975\textwidth, /pgf/number format/precision=3},tick style = {draw = none}, width = 0.975\textwidth,
height = 0.6\textwidth, ymin = 0.988, legend style={at={(0.9825,0.0175)},anchor=south east}, height = 0.6\textwidth, ymin = 0.988, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch}, ylabel = {Classification Accuracy}, cycle xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width =1.25pt}] list/Dark2, every axis plot/.append style={line width =1.25pt}]
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma, mark = none] [x=epoch, y=val_accuracy, col sep=comma, mark = none]
@ -45,18 +45,18 @@ plot coordinates {
\addlegendentry{\footnotesize{Default}} \addlegendentry{\footnotesize{Default}}
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}
\caption{Classification accuracy} \caption{Test Accuracy}
\vspace{.25cm} \vspace{.25cm}
\end{subfigure} \end{subfigure}
\begin{subfigure}[h]{1.0\linewidth} \begin{subfigure}[h]{1.0\linewidth}
\begin{tabu} to \textwidth {@{}lc*5{X[c]}@{}} \begin{tabu} to \textwidth {@{}lc*5{X[c]}@{}}
\Tstrut \Bstrut & \textsc{\,Adam\,} & D. 0.2 & D. 0.4 & G. &G.+D.\,0.2 & G.+D.\,0.4 \\ \Tstrut \Bstrut & Default & D. 0.2 & D. 0.4 & G. &G.+D.\,0.2 & G.+D.\,0.4 \\
\hline \hline
\multicolumn{7}{c}{Test Accuracy}\Bstrut \\ \multicolumn{7}{c}{Test Accuracy}\Bstrut \\
\cline{2-7} \cline{2-7}
mean \Tstrut & 0.9914 & 0.9923 & 0.9930 & 0.9937 & 0.9938 & 0.9943 \\ mean \Tstrut & 0.9914 & 0.9923 & 0.9930 & 0.9937 & 0.9943 & 0.9944 \\
max & 0.9926 & 0.9930 & 0.9934 & 0.9946 & 0.9955 & 0.9956 \\ max & 0.9926 & 0.9930 & 0.9934 & 0.9946 & 0.9957 & 0.9956 \\
min & 0.9887 & 0.9909 & 0.9922 & 0.9929 & 0.9929 & 0.9934 \\ min & 0.9887 & 0.9909 & 0.9922 & 0.9929 & 0.9930 & 0.9934 \\
\hline \hline
\multicolumn{7}{c}{Training Accuracy}\Bstrut \\ \multicolumn{7}{c}{Training Accuracy}\Bstrut \\
\cline{2-7} \cline{2-7}
@ -64,15 +64,16 @@ plot coordinates {
max & 0.9996 & 0.9996 & 0.9992 & 0.9979 & 0.9971 & 0.9937 \\ max & 0.9996 & 0.9996 & 0.9992 & 0.9979 & 0.9971 & 0.9937 \\
min & 0.9992 & 0.9990 & 0.9984 & 0.9947 & 0.9926 & 0.9908 \\ min & 0.9992 & 0.9990 & 0.9984 & 0.9947 & 0.9926 & 0.9908 \\
\end{tabu} \end{tabu}
\caption{Mean and maximum accuracy after 48 epochs of training.} \caption{Mean, maximum and minimum accuracy after 50 epochs of training.}
\label{fig:gen_dropout_b} \label{fig:gen_dropout_b}
\end{subfigure} \end{subfigure}
\caption[Performance comparison of overfitting measures]{Accuracy for the net given in ... with Dropout (D.), \caption[Performance Comparison of Overfitting Measures]{Accuracy
for the net given in Figure~\ref{fig:mnist_architecture} with Dropout (D.),
data generation (G.), a combination, or neither (Default) implemented and trained data generation (G.), a combination, or neither (Default) implemented and trained
with \textsc{Adam}. For each epoch the 60.000 training samples with \textsc{Adam}. For each epoch the 60.000 training samples
were used, or for data generation 10.000 steps with each using were used, or for data generation 10.000 steps with each using
batches of 60 generated data points. For each configuration the batches of 60 generated data points. For each configuration the
model was trained 5 times and the average accuracies at each epoch model was trained five times and the average accuracies at each epoch
are given in (a). Mean, maximum and minimum values of accuracy on are given in (a). Mean, maximum and minimum values of accuracy on
the test and training set are given in (b).} the test and training set are given in (b).}
\label{fig:gen_dropout} \label{fig:gen_dropout}

@ -30,9 +30,10 @@
\begin{subfigure}{0.19\textwidth} \begin{subfigure}{0.19\textwidth}
\includegraphics[width=\textwidth]{Figures/Data/mnist9.pdf} \includegraphics[width=\textwidth]{Figures/Data/mnist9.pdf}
\end{subfigure} \end{subfigure}
\caption[MNIST data set]{The MNIST data set contains 70.000 images of preprocessed handwritten \caption[MNIST Database of Handwritten Digits]{The MNIST database of handwritten
digits contains 70.000 images of preprocessed handwritten
digits. Of these images 60.000 are used as training images, while digits. Of these images 60.000 are used as training images, while
the rest are used to validate the models trained.} the rest is used to validate the models trained.}
\label{fig:MNIST} \label{fig:MNIST}
\end{figure} \end{figure}
%%% Local Variables: %%% Local Variables:

@ -4,34 +4,56 @@ legend image code/.code={
\draw[mark repeat=2,mark phase=2] \draw[mark repeat=2,mark phase=2]
plot coordinates { plot coordinates {
(0cm,0cm) (0cm,0cm)
(0.0cm,0cm) %% default is (0.3cm,0cm) (0.15cm,0cm) %% default is (0.3cm,0cm)
(0.0cm,0cm) %% default is (0.6cm,0cm) (0.3cm,0cm) %% default is (0.6cm,0cm)
};% };%
} }
} }
\begin{figure} \begin{figure}
\begin{subfigure}[h]{\textwidth} \begin{subfigure}[h]{\textwidth}
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis}[tick style = {draw = none}, width = \textwidth, \begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
height = 0.6\textwidth, ymin = 0.92, legend style={at={(0.9825,0.75)},anchor=north east}, /pgf/number format/precision=3},tick style = {draw = none}, width = 0.975\textwidth,
xlabel = {epoch}, ylabel = {Classification Accuracy}] height = 0.6\textwidth, ymin = 0.885, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {Epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
% [tick style = {draw = none}, width = \textwidth,
% height = 0.6\textwidth, ymin = 0.905, legend style={at={(0.9825,0.75)},anchor=north east},
% xlabel = {epoch}, ylabel = {Classification Accuracy}]
% \addplot table
% [x=epoch, y=val_accuracy, col sep=comma, mark = none]
% {Figures/Data/adagrad.log};
% \addplot table
% [x=epoch, y=val_accuracy, col sep=comma, mark = none]
% {Figures/Data/adadelta.log};
% \addplot table
% [x=epoch, y=val_accuracy, col sep=comma, mark = none]
% {Figures/Data/adam.log};
\addplot table
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
{Figures/Data/Adagrad.mean};
\addplot table
[x=epoch, y=val_accuracy, col sep=comma, mark = none]
{Figures/Data/Adadelta.mean};
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma, mark = none] [x=epoch, y=val_accuracy, col sep=comma, mark = none]
{Figures/Data/adagrad.log}; {Figures/Data/Adam.mean};
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma, mark = none] [x=epoch, y=val_accuracy, col sep=comma, mark = none]
{Figures/Data/adadelta.log}; {Figures/Data/SGD_00.mean};
\addplot table \addplot table
[x=epoch, y=val_accuracy, col sep=comma, mark = none] [x=epoch, y=val_accuracy, col sep=comma, mark = none]
{Figures/Data/adam.log}; {Figures/Data/SGD_09.mean};
\addlegendentry{\footnotesize{ADAGRAD}} \addlegendentry{\footnotesize{\textsc{AdaGrad}}}
\addlegendentry{\footnotesize{ADADELTA}} \addlegendentry{\footnotesize{\textsc{Adadelta}}}
\addlegendentry{\footnotesize{ADAM}} \addlegendentry{\footnotesize{\textsc{Adam}}}
\addlegendentry{SGD$_{0.01}$} \addlegendentry{\footnotesize{\textsc{Sgd}}}
\addlegendentry{\footnotesize{Momentum}}
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}
%\caption{Classification accuracy} \caption{Test accuracies during training}
\vspace{.25cm} \vspace{.25cm}
\end{subfigure} \end{subfigure}
% \begin{subfigure}[b]{\textwidth} % \begin{subfigure}[b]{\textwidth}
@ -58,18 +80,27 @@ plot coordinates {
% \vspace{.25cm} % \vspace{.25cm}
% \end{subfigure} % \end{subfigure}
\begin{subfigure}[b]{1.0\linewidth} \begin{subfigure}[b]{1.0\linewidth}
\begin{tabu} to \textwidth {@{} *3{X[c]}c*3{X[c]} @{}} \begin{tabu} to \textwidth {@{}l*5{X[c]}@{}}
\multicolumn{3}{c}{Classification Accuracy} \Tstrut \Bstrut &\textsc{AdaGrad}& \textsc{AdaDelta}&
&~&\multicolumn{3}{c}{Error Measure} \textsc{Adam} & \textsc{Sgd} & Momentum \\
\\\cline{1-3}\cline{5-7} \hline
\textsc{AdaGad}&\textsc{AdaDelta}&\textsc{Adam}&&\textsc{AdaGrad}&\textsc{AdaDelta}&\textsc{Adam} \Tstrut Accuracy &0.9870 & 0.9562 & 0.9925 & 0.9866 & 0.9923 \\
\\\cline{1-3}\cline{5-7} \Tstrut Loss &0.0404 & 0.1447 & 0.0999 & 0.0403 & 0.0246 \\
1&1&1&&1&1&1
\end{tabu} \end{tabu}
\caption{Performace metrics after 20 epochs} % \begin{tabu} to \textwidth {@{} *3{X[c]}c*3{X[c]} @{}}
% \multicolumn{3}{c}{Classification Accuracy}
% &~&\multicolumn{3}{c}{Error Measure}
% \\\cline{1-3}\cline{5-7}
% \textsc{AdaGad}&\textsc{AdaDelta}&\textsc{Adam}&&\textsc{AdaGrad}&\textsc{AdaDelta}&\textsc{Adam}
% \\\cline{1-3}\cline{5-7}
% 1&1&1&&1&1&1
% \end{tabu}
\caption{Performace metrics after 50 epochs}
\end{subfigure} \end{subfigure}
\caption[Performance comparison of training algorithms]{Classification accuracy on the test set and ...Performance metrics of the network given in ... trained \caption[Performance Comparison of Training Algorithms]{
with different optimization algorithms} Average performance metrics of the neural network given in
Figure~\ref{fig:mnist_architecture} trained 5 times for 50 epochs
using different optimization algorithms.}
\label{fig:comp_alg} \label{fig:comp_alg}
\end{figure} \end{figure}
%%% Local Variables: %%% Local Variables:

@ -14,6 +14,7 @@
\end{adjustbox} \end{adjustbox}
\caption{True position (\textcolor{red}{red}), distorted position data (black)} \caption{True position (\textcolor{red}{red}), distorted position data (black)}
\end{subfigure} \end{subfigure}
\hfill
\begin{subfigure}[b]{0.49\textwidth} \begin{subfigure}[b]{0.49\textwidth}
\centering \centering
\begin{adjustbox}{width=\textwidth, height=0.25\textheight} \begin{adjustbox}{width=\textwidth, height=0.25\textheight}
@ -28,7 +29,7 @@
\end{adjustbox} \end{adjustbox}
\caption{True position (\textcolor{red}{red}), filtered position data (black)} \caption{True position (\textcolor{red}{red}), filtered position data (black)}
\end{subfigure} \end{subfigure}
\caption[Signal smoothing using convolution]{Example for noise reduction using convolution with simulated \caption[Signal Smoothing Using Convolution]{Example for noise reduction using convolution with simulated
positional data. As filter positional data. As filter
$g(i)=\left(\nicefrac{1}{3},\nicefrac{1}{4},\nicefrac{1}{5},\nicefrac{1}{6},\nicefrac{1}{20}\right)_{(i-1)}$ $g(i)=\left(\nicefrac{1}{3},\nicefrac{1}{4},\nicefrac{1}{5},\nicefrac{1}{6},\nicefrac{1}{20}\right)_{(i-1)}$
is chosen and applied to the $x$ and $y$ coordinate is chosen and applied to the $x$ and $y$ coordinate

@ -2,216 +2,330 @@
\newpage \newpage
\begin{appendices} \begin{appendices}
\counterwithin{lstfloat}{section} \counterwithin{lstfloat}{section}
\section{Proofs for sone Lemmata in ...} \section{Notes on Proofs of Lemmata in Section~\ref{sec:conv}}
In the following there will be proofs for some important Lemmata in \label{appendix:proofs}
Section~\ref{sec:theo38}. Further proofs not discussed here can be Contrary to \textcite{heiss2019} we do not make the distinction between $f_+$ and
found in \textcite{heiss2019} $f_-$.
The proves in this section are based on \textcite{heiss2019}. Slight This results in some alterations in the proofs being necessary. In
alterations have been made to accommodate for not splitting $f$ into the following the affected proofs and the required changes are given.
$f_+$ and $f_-$. % Because of that slight alterations are needed in the proofs of
\begin{Theorem}[Proof of Lemma~\ref{theo38}] % .. auxiliary lemmata.
\end{Theorem} % Alterations that go beyond substituting $F_{+-}^{}$
% As the proofs are ... for the most part only
\begin{Lemma}[$\frac{w^{*,\tilde{\lambda}}_k}{v_k}\approx\mathcal{O}(\frac{1}{n})$] % the alterations needed are specified.
For any $\lambda > 0$ and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in
\left\{1,\dots,N\right\}$, we have
\[
\max_{k \in \left\{1,\dots,n\right\}} \frac{w^{*,
\tilde{\lambda}}_k}{v_k} = \po_{n\to\infty}
\]
\end{Lemma}
\begin{Proof}[Proof of Lemma~\ref{lem:s3}]
\[ % In the following there will be proofs for some important Lemmata in
\sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k) % Section~\ref{sec:theo38}. Further proofs not discussed here can be
h_{k,n} = \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta % found in \textcite{heiss2019}
(l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T \}]}} % The proves in this section are based on \textcite{heiss2019}. Slight
\left(\sum_{\substack{k \in \kappa \\ \xi_k \in % alterations have been made to accommodate for not splitting $f$ into
[\delta l , \delta(l+1))}} \varphi(\xi_k, v_k) % $f_+$ and $f_-$.
h_{k,n}\right) \approx % \begin{Theorem}[Proof of Lemma~\ref{theo38}]
\] % \end{Theorem}
\[
\approx \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta % \begin{Lemma}[$\frac{w^{*,\tilde{\lambda}}_k}{v_k}\approx\mathcal{O}(\frac{1}{n})$]
(l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T \}]}} % For any $\lambda > 0$ and training data $(x_i^{\text{train}},
\left(\sum_{\substack{k \in \kappa \\ \xi_k \in % y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in
[\delta l , \delta(l+1))}} \left(\varphi(\delta l, v_k) % \left\{1,\dots,N\right\}$, we have
\frac{1}{n g_\xi (\delta l)} \pm \frac{\varepsilon}{n}\right) % \[
\frac{\abs{\left\{m \in \kappa : \xi_m \in [\delta l, % \max_{k \in \left\{1,\dots,n\right\}} \frac{w^{*,
\delta(l+1))\right\}}}{\abs{\left\{m \in \kappa : \xi_m % \tilde{\lambda}}_k}{v_k} = \po_{n\to\infty}
\in [\delta l, \delta(l+1))\right\}}}\right) % \]
\]
\[
\approx \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta % \end{Lemma}
(l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T \}]}}
\left(\frac{\sum_{\substack{k \in \kappa \\ \xi_k \in \begin{Proof}[Heiss, Teichmann, and Wutte (2019, Lemma A.9)]~\\\noindent
[\delta l , \delta(l+1))}}\varphi(\delta l, \label{proof:lem9}
v_k)}{\abs{\left\{m \in \kappa : \xi_m With $\tilde{\lambda} \coloneqq \lambda n g(0)$ Lemma~\ref{lem:cnvh} follows
\in [\delta l, \delta(l+1))\right\}}} analogously when considering $\tilde{w}$, $f_g^{*, \lambda}$, and $h_k$
\frac{\abs{\left\{m \in \kappa : \xi_m \in [\delta l, instead of $\tilde{w}^+$, $f_{g,+}^{*, \lambda}$, and $\bar{h}_k$.
\delta(l+1))\right\}}}{n g_\xi (\delta l)}\right) \pm \varepsilon Consider $\kappa = \left\{1, \dots, n \right\}$ for $n$ nodes
\] instead of $\kappa^+$. With $h_k = \frac{1}{n g_\xi(\xi_n)}$
The amount of kinks in a given interval of length $\delta$ follows a instead of $\bar{h}_k$
binomial distribution, and \[
\[
\mathbb{E} \left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l, \mathbb{E} \left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l,
\delta(l+1))\right\}}\right] = n \int_{\delta \delta(l+1))\right\}}\right] = n \int_{\delta
l}^{\delta(l+1)}g_\xi (x) dx \approx n (\delta g_\xi(\delta l) l}^{\delta(l+1)}g_\xi (x) dx \approx n (\delta g_\xi(\delta l)
\pm \delta \tilde{\varepsilon}), \pm \delta \tilde{\varepsilon}).
\] \]
for any $\delta \leq \delta(\varepsilon, \tilde{\varepsilon})$, since $g_\xi$ is uniformly continuous on its % \[
support by Assumption.. % \sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k)
As the distribution of $v$ is continuous as well we get that % h_{k,n} = \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta
$\mathcal{L}(v_k) = \mathcal{L} v| \xi = \delta l) \forall k \in % (l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T \}]}}
\kappa : \xi_k \in [\delta l, \delta(l+1))$ for $\delta \leq % \left(\sum_{\substack{k \in \kappa \\ \xi_k \in
\delta(\varepsilon, \tilde{\varepsilon})$. Thus we get with the law of % [\delta l , \delta(l+1))}} \varphi(\xi_k, v_k)
large numbers % h_{k,n}\right) \approx
\begin{align*} % \]
&\sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k) % \[
h_{k,n} \approx\\ % \approx \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta
&\approx \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta % (l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T \}]}}
(l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T % \left(\sum_{\substack{k \in \kappa \\ \xi_k \in
\}]}}\left(\mathbb{E}[\phi(\xi, v)|\xi=\delta l] % [\delta l , \delta(l+1))}} \left(\varphi(\delta l, v_k)
\stackrel{\mathbb{P}}{\pm}\right) \delta \left(1 \pm % \frac{1}{n g_\xi (\delta l)} \pm \frac{\varepsilon}{n}\right)
\frac{\tilde{\varepsilon}}{g_\xi(\delta l)}\right) \pm \varepsilon % \frac{\abs{\left\{m \in \kappa : \xi_m \in [\delta l,
\\ % \delta(l+1))\right\}}}{\abs{\left\{m \in \kappa : \xi_m
&\approx \left(\sum_{\substack{l \in \mathbb{Z} \\ [\delta % \in [\delta l, \delta(l+1))\right\}}}\right)
l, \delta % \]
(l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T % \[
\}]}}\mathbb{E}[\phi(\xi, v)|\xi=\delta l] \delta % \approx \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta
\stackrel{\mathbb{P}}{\pm}\tilde{\tilde{\varepsilon}} % (l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T \}]}}
\abs{C_{g_\xi}^u - C_{g_\xi}^l} % \left(\frac{\sum_{\substack{k \in \kappa \\ \xi_k \in
\right)\\ % [\delta l , \delta(l+1))}}\varphi(\delta l,
&\phantom{\approx}\cdot \left(1 \pm % v_k)}{\abs{\left\{m \in \kappa : \xi_m
\frac{\tilde{\varepsilon}}{g_\xi(\delta l)}\right) \pm \varepsilon % \in [\delta l, \delta(l+1))\right\}}}
\end{align*} % \frac{\abs{\left\{m \in \kappa : \xi_m \in [\delta l,
% \delta(l+1))\right\}}}{n g_\xi (\delta l)}\right) \pm \varepsilon
% \]
% The amount of kinks in a given interval of length $\delta$ follows a
% binomial distribution,
% \[
% \mathbb{E} \left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l,
% \delta(l+1))\right\}}\right] = n \int_{\delta
% l}^{\delta(l+1)}g_\xi (x) dx \approx n (\delta g_\xi(\delta l)
% \pm \delta \tilde{\varepsilon}),
% \]
% for any $\delta \leq \delta(\varepsilon, \tilde{\varepsilon})$, since $g_\xi$ is uniformly continuous on its
% support by Assumption..
% As the distribution of $v$ is continuous as well we get that
% $\mathcal{L}(v_k) = \mathcal{L} v| \xi = \delta l) \forall k \in
% \kappa : \xi_k \in [\delta l, \delta(l+1))$ for $\delta \leq
% \delta(\varepsilon, \tilde{\varepsilon})$. Thus we get with the law of
% large numbers
% \begin{align*}
% &\sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k)
% h_{k,n} \approx\\
% &\approx \sum_{\substack{l \in \mathbb{Z} \\ [\delta l, \delta
% (l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T
% \}]}}\left(\mathbb{E}[\phi(\xi, v)|\xi=\delta l]
% \stackrel{\mathbb{P}}{\pm}\right) \delta \left(1 \pm
% \frac{\tilde{\varepsilon}}{g_\xi(\delta l)}\right) \pm \varepsilon
% \\
% &\approx \left(\sum_{\substack{l \in \mathbb{Z} \\ [\delta
% l, \delta
% (l+1)) \in [C_{g_\xi}^l,\min\{C_{g_\xi}^u, T
% \}]}}\mathbb{E}[\phi(\xi, v)|\xi=\delta l] \delta
% \stackrel{\mathbb{P}}{\pm}\tilde{\tilde{\varepsilon}}
% \abs{C_{g_\xi}^u - C_{g_\xi}^l}
% \right)\\
% &\phantom{\approx}\cdot \left(1 \pm
% \frac{\tilde{\varepsilon}}{g_\xi(\delta l)}\right) \pm \varepsilon
% \end{align*}
\end{Proof} \end{Proof}
\begin{Lemma}[($L(f_n) \to L(f)$), Heiss, Teichmann, and % \begin{Lemma}[($L(f_n) \to L(f)$), Heiss, Teichmann, and
Wutte (2019, Lemma A.11)] % Wutte (2019, Lemma A.11)]
For any data $(x_i^{\text{train}}, y_i^{\text{train}}) \in % For any data $(x_i^{\text{train}}, y_i^{\text{train}}) \in
\mathbb{R}^2, i \in \left\{1,\dots,N\right\}$, let $(f_n)_{n \in % \mathbb{R}^2, i \in \left\{1,\dots,N\right\}$, let $(f_n)_{n \in
\mathbb{N}}$ be a sequence of functions that converges point-wise % \mathbb{N}}$ be a sequence of functions that converges point-wise
in probability to a function $f : \mathbb{R}\to\mathbb{R}$, then the % in probability to a function $f : \mathbb{R}\to\mathbb{R}$, then the
loss $L$ of $f_n$ converges is probability to $L(f)$ as $n$ tends to % loss $L$ of $f_n$ converges is probability to $L(f)$ as $n$ tends to
infinity, % infinity,
\[ % \[
\plimn L(f_n) = L(f). % \plimn L(f_n) = L(f).
\] % \]
\proof Vgl. ... % \proof Vgl. ...
\end{Lemma} % \end{Lemma}
\begin{Proof}[Step 2] \begin{Proof}[Heiss, Teichmann, and Wutte (2019, Lemma A.12)]~\\\noindent
We start by showing that \label{proof:lem12}
\[ With $\tilde{\lambda} \coloneqq \lambda n g(0)$ Lemma~\ref{lem:s2} follows
\plimn \tilde{\lambda} \norm{\tilde{w}}_2^2 = \lambda g(0) analogously when considering $\tilde{w}$, $f_g^{*, \lambda}$, and $h_k$
\left(\int \frac{\left(f_g^{*,\lambda''}\right)^2}{g(x)} dx\right) instead of $\tilde{w}^+$, $f_{g,+}^{*, \lambda}$, and $\bar{h}_k$.
\] % We start by showing that
With the definitions of $\tilde{w}$, $\tilde{\lambda}$ and % \[
$h$ we have % \plimn \tilde{\lambda} \norm{\tilde{w}}_2^2 = \lambda g(0)
\begin{align*} % \left(\int \frac{\left(f_g^{*,\lambda''}\right)^2}{g(x)} dx\right)
\tilde{\lambda} \norm{\tilde{w}}_2^2 % \]
&= \tilde{\lambda} \sum_{k \in % With the definitions of $\tilde{w}$, $\tilde{\lambda}$ and
\kappa}\left(f_g^{*,\lambda''}(\xi_k) \frac{h_k % $h$ we have
v_k}{\mathbb{E}v^2|\xi = \xi_k]}\right)^2\\ % \begin{align*}
&= \tilde{\lambda} \sum_{k \in % \tilde{\lambda} \norm{\tilde{w}}_2^2
\kappa}\left(\left(f_g^{*,\lambda''}\right)^2(\xi_k) \frac{h_k % &= \tilde{\lambda} \sum_{k \in
v_k^2}{\mathbb{E}v^2|\xi = \xi_k]}\right) h_k\\ % \kappa}\left(f_g^{*,\lambda''}(\xi_k) \frac{h_k
& = \lambda g(0) \sum_{k \in % v_k}{\mathbb{E}v^2|\xi = \xi_k]}\right)^2\\
\kappa}\left(\left(f_g^{*,\lambda''}\right)^2(\xi_k)\frac{v_k^2}{g_\xi(\xi_k)\mathbb{E} % &= \tilde{\lambda} \sum_{k \in
[v^2|\xi=\xi_k]}\right)h_k. % \kappa}\left(\left(f_g^{*,\lambda''}\right)^2(\xi_k) \frac{h_k
\end{align*} % v_k^2}{\mathbb{E}v^2|\xi = \xi_k]}\right) h_k\\
By using Lemma~\ref{lem} with $\phi(x,y) = % & = \lambda g(0) \sum_{k \in
\left(f_g^{*,\lambda''}\right)^2(x)\frac{y^2}{g_\xi(\xi)\mathbb{E}[v^2|\xi=y]}$ % \kappa}\left(\left(f_g^{*,\lambda''}\right)^2(\xi_k)\frac{v_k^2}{g_\xi(\xi_k)\mathbb{E}
this converges to % [v^2|\xi=\xi_k]}\right)h_k.
\begin{align*} % \end{align*}
&\plimn \tilde{\lambda}\norm{\tilde{w}}_2^2 = \\ % By using Lemma~\ref{lem} with $\phi(x,y) =
&=\lambda % \left(f_g^{*,\lambda''}\right)^2(x)\frac{y^2}{g_\xi(\xi)\mathbb{E}[v^2|\xi=y]}$
g_\xi(0)\mathbb{E}[v^2|\xi=0]\int_{\supp{g_\xi}}\mathbb{E}\left[ % this converges to
\left(f_g^{*,\lambda''}\right)^2(\xi)\frac{v^2}{ % \begin{align*}
g_\xi(\xi)\mathbb{E}[v^2|\xi=x]^2}\Big{|} \xi = x\right]dx\\ % &\plimn \tilde{\lambda}\norm{\tilde{w}}_2^2 = \\
&=\lambda g_\xi(0) \mathbb{E}[v^2|\xi=0] \int_{\supp{g_xi}} % &=\lambda
\frac{\left(f_g^{*,\lambda''}\right)^2 (x)}{g_\xi(x) % g_\xi(0)\mathbb{E}[v^2|\xi=0]\int_{\supp{g_\xi}}\mathbb{E}\left[
\mathbb{E}[v^2|\xi=x]} dx \\ % \left(f_g^{*,\lambda''}\right)^2(\xi)\frac{v^2}{
&=\lambda g(0) \int_{\supp{g_\xi}} \frac{\left(f_g^{*,\lambda''}\right)^2}{g(x)}dx. % g_\xi(\xi)\mathbb{E}[v^2|\xi=x]^2}\Big{|} \xi = x\right]dx\\
\end{align*} % &=\lambda g_\xi(0) \mathbb{E}[v^2|\xi=0] \int_{\supp{g_xi}}
% \frac{\left(f_g^{*,\lambda''}\right)^2 (x)}{g_\xi(x)
% \mathbb{E}[v^2|\xi=x]} dx \\
% &=\lambda g(0) \int_{\supp{g_\xi}} \frac{\left(f_g^{*,\lambda''}\right)^2}{g(x)}dx.
% \end{align*}
\end{Proof}
\begin{Proof}[Heiss, Teichmann, and Wutte (2019, Lemma A.14)]~\\\noindent
\label{proof:lem14}
Substitute $F_{+-}^{\lambda, g}\left(f_{g,+}^{*,\lambda},
f_{g,-}^{*,\lambda}\right)$ with $F^{\lambda,g}\left(f_g^{*,\lambda}\right)$.
\end{Proof}
% \begin{Lemma}[Heiss, Teichmann, and
% Wutte (2019, Lemma A.13)]
% Using the notation of Definition .. and ... the following statement
% holds:
% $\forall \varepsilon \in \mathbb{R}_{>0} : \exists \delta \in
% \mathbb{R}_{>0} : \forall \omega \in \Omega : \forall l, l' \in
% \left\{1,\dots,N\right\} : \forall n \in \mathbb{N}$
% \[
% \left(\abs{\xi_l(\omega) - \xi_{l'}(\omega)} < \delta \wedge
% \text{sign}(v_l(\omega)) = \text{sign}(v_{l'}(\omega))\right)
% \implies \abs{\frac{w_l^{*, \tilde{\lambda}}(\omega)}{v_l(\omega)}
% - \frac{w_{l'}^{*, \tilde{\lambda}}(\omega)}{v_{l'}(\omega)}} <
% \frac{\varepsilon}{n},
% \]
% if we assume that $v_k$ is never zero.
% \proof given in ..
% \end{Lemma}
% \begin{Lemma}[$\frac{w^{*,\tilde{\lambda}}}{v} \approx
% \mathcal{O}(\frac{1}{n})$, Heiss, Teichmann, and
% Wutte (2019, Lemma A.14)]
% For any $\lambda > 0$ and data $(x_i^{\text{train}},
% y_i^{\text{train}}) \in \mathbb{R}^2, i\in
% \left\{1,\dots,\right\}$, we have
% \[
% \forall P \in (0,1) : \exists C \in \mathbb{R}_{>0} : \exists
% n_0 \in \mathbb{N} : \forall n > n_0 : \mathbb{P}
% \left[\max_{k\in \left\{1,\dots,n\right\}}
% \frac{w_k^{*,\tilde{\lambda}}}{v_k} < C
% \frac{1}{n}\right] > P
% % \max_{k\in \left\{1,\dots,n\right\}}
% % \frac{w_k^{*,\tilde{\lambda}}}{v_k} = \plimn
% \]
% \proof
% Let $k^*_+ \in \argmax_{k\in
% \left\{1,\dots,n\right\}}\frac{w^{*,\tilde{\lambda}}}{v_k} : v_k
% > 0$ and $k^*_- \in \argmax_{k\in
% \left\{1,\dots,n\right\}}\frac{w^{*,\tilde{\lambda}}}{v_k} : v_k
% < 0$. W.l.o.g. assume $\frac{w_{k_+^*}^2}{v_{k_+^*}^2} \geq
% \frac{w_{k_-^*}^2}{v_{k_-^*}^2}$
% \begin{align*}
% \frac{F^{\lambda,
% g}\left(f^{*,\lambda}_g\right)}{\tilde{\lambda}}
% \makebox[2cm][c]{$\stackrel{\mathbb{P}}{\geq}$}
% & \frac{1}{2 \tilde{\lambda}}
% F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right)
% = \frac{1}{2 \tilde{\lambda}}\left[\sum ... + \tilde{\lambda} \norm{w}_2^2\right]
% \\
% \makebox[2cm][c]{$\geq$}
% & \frac{1}{2}\left( \sum_{\substack{k: v_k
% > 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*}
% + \delta)}} \left(w_k^{*,\tilde{\lambda}}\right)^2 +
% \sum_{\substack{k: v_k < 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*}
% + \delta)}} \left(w_k^{*,\tilde{\lambda}}\right)^2\right) \\
% \makebox[2cm][c]{$\overset{\text{Lem. A.6}}{\underset{\delta \text{
% small enough}}{\geq}} $}
% &
% \frac{1}{4}\left(\left(\frac{w_{k_+^*}^{*,\tilde{\lambda}}}
% {v_{k_+^*}}\right)^2\sum_{\substack{k:
% v_k > 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*} + \delta)}}v_k^2 +
% \left(\frac{w_{k_-^*}^{*,\tilde{\lambda}}}{v_{k_-^*}}\right)^2
% \sum_{\substack{k:
% v_k < 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*} +
% \delta)}}v_k^2\right)\\
% \makebox[2cm][c]{$\stackrel{\mathbb{P}}{\geq}$}
% & \frac{1}{8}
% \left(\frac{w_{k_+^*}^{*,\tilde{\lambda}}}{v_{k^*}}\right)^2
% n \delta g_\xi(\xi_{k_+^*}) \mathbb{P}(v_k
% >0)\mathbb{E}[v_k^2|\xi_k = \xi_{k^*_+}]
% \end{align*}
% \end{Lemma}
\begin{Proof}[Heiss, Teichmann, and Wutte (2019, Lemma A.15)]~\\\noindent
\label{proof:lem15}
Consider $\mathcal{RN}^{*,\tilde{\lambda}}$,
$f^{w^{*,\tilde{\lambda}}}$, and $\kappa = \left\{1, \dots, n
\right\}$ instead of $\mathcal{RN}_+^{*,\tilde{\lambda}}$,
$f_+^{w^{*,\tilde{\lambda}}}$, and $\kappa^+$.
Assuming w.l.o.g. $max_{k \in
\kappa^+}\abs{\frac{w_k^{*,\tilde{\lambda}}}{v_k}} \geq max_{k \in
\kappa^-}\abs{\frac{w_k^{*,\tilde{\lambda}}}{v_k}}$
Lemma~ref{lem:s3} follows analogously by multiplying (58b) with two.
\end{Proof} \end{Proof}
\begin{Lemma}[Heiss, Teichmann, and \begin{Proof}[Heiss, Teichmann, and Wutte (2019, Lemma
Wutte (2019, Lemma A.13)] A.16)]~\\\noindent
Using the notation of Definition .. and ... the following statement \label{proof:lem16}
holds: As we are considering $F^{\lambda,g}$ instead of
$\forall \varepsilon \in \mathbb{R}_{>0} : \exists \delta \in $F^{\lambda,g}_{+-}$ we need to substitute $2\lambda g(0)$ with
\mathbb{R}_{>0} : \forall \omega \in \Omega : \forall l, l' \in $\lambda g(0)$
\left\{1,\dots,N\right\} : \forall n \in \mathbb{N}$ and thus get
\[ \[
\left(\abs{\xi_l(\omega) - \xi_{l'}(\omega)} < \delta \wedge \left(f^{w^{*,\tilde{\lambda}}}\right)''(x) \approx
\text{sign}(v_l(\omega)) = \text{sign}(v_{l'}(\omega))\right) \frac{w_{l_x}^{*,\tilde{\lambda}}}{v_{l_x}} n g_\xi(x)
\implies \abs{\frac{w_l^{*, \tilde{\lambda}}(\omega)}{v_l(\omega)} \mathbb{E}\left[v_k^2|\xi_k = x\right] \stackrel{\mathbb{P}}{\pm} \varepsilon_3
- \frac{w_{l'}^{*, \tilde{\lambda}}(\omega)}{v_{l'}(\omega)}} <
\frac{\varepsilon}{n},
\] \]
if we assume that $v_k$ is never zero. and use this to follow
\proof given in ..
\end{Lemma}
\begin{Lemma}[$\frac{w^{*,\tilde{\lambda}}}{v} \approx
\mathcal{O}(\frac{1}{n})$, Heiss, Teichmann, and
Wutte (2019, Lemma A.14)]
For any $\lambda > 0$ and data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2, i\in
\left\{1,\dots,\right\}$, we have
\[ \[
\forall P \in (0,1) : \exists C \in \mathbb{R}_{>0} : \exists \lambda g(0)
n_0 \in \mathbb{N} : \forall n > n_0 : \mathbb{P} \int_{\supp(g)}\hspace{-0.15cm}\frac{\left(\left(f^{w^{*,\tilde{\lambda}}}\right)''(x)\right)^2}{g(0)}dx
\left[\max_{k\in \left\{1,\dots,n\right\}} \approx \tilde{\lambda} n
\frac{w_k^{*,\tilde{\lambda}}}{v_k} < C \int_{\supp(g)}\left(\frac{w_{l_x}^{*,\tilde{\lambda}}}{v_{l_x}}\right)^2 \hspace{-0.1cm}
\frac{1}{n}\right] > P g_xi(x) \mathbb{E}\left[v_k^2|\xi_k=x\right]dx
% \max_{k\in \left\{1,\dots,n\right\}}
% \frac{w_k^{*,\tilde{\lambda}}}{v_k} = \plimn
\] \]
\proof Analogous to the proof of \textcite{heiss2019} we get
Let $k^*_+ \in \argmax_{k\in
\left\{1,\dots,n\right\}}\frac{w^{*,\tilde{\lambda}}}{v_k} : v_k
> 0$ and $k^*_- \in \argmax_{k\in
\left\{1,\dots,n\right\}}\frac{w^{*,\tilde{\lambda}}}{v_k} : v_k
< 0$. W.l.o.g. assume $\frac{w_{k_+^*}^2}{v_{k_+^*}^2} \geq
\frac{w_{k_-^*}^2}{v_{k_-^*}^2}$
\begin{align*} \begin{align*}
\frac{F^{\lambda, \tilde{\lambda} \sum_{k \in \kappa}
g}\left(f^{*,\lambda}_g\right)}{\tilde{\lambda}} \left(w_k^{*,\tilde{\lambda}}\right)^2
\makebox[2cm][c]{$\stackrel{\mathbb{P}}{\geq}$} &= \tilde{\lambda} \sum_{k \in \kappa^+}
& \frac{1}{2 \tilde{\lambda}} \left(w_k^{*,\tilde{\lambda}}\right)^2 + \tilde{\lambda} \sum_{k \in \kappa^-}
F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right) \left(w_k^{*,\tilde{\lambda}}\right)^2 \\
= \frac{1}{2 \tilde{\lambda}}\left[\sum ... + \tilde{\lambda} \norm{w}_2^2\right] &\approx \left(\mathbb{P}[v_k <0] + \mathbb{P}[v_k >0]\right)\\
\\ &\phantom{=}
\makebox[2cm][c]{$\geq$} \int_{\supp(g_xi)}
& \frac{1}{2}\left( \sum_{\substack{k: v_k \left(\frac{w_{l_x}^{*,\tilde{\lambda}}}{v_{l_x}}\right)^2
> 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*} g_\xi(x) \mathbb{E}\left[v_k^2|\xi_k = x\right] dx
+ \delta)}} \left(w_k^{*,\tilde{\lambda}}\right)^2 + \stackrel{\mathbb{P}}{\pm} \varepsilon_9 \\
\sum_{\substack{k: v_k < 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*} &= \int_{\supp{g_xi}}
+ \delta)}} \left(w_k^{*,\tilde{\lambda}}\right)^2\right) \\ \left(\frac{w_{l_x}^{*,\tilde{\lambda}}}{v_{l_x}}\right)^2
\makebox[2cm][c]{$\overset{\text{Lem. A.6}}{\underset{\delta \text{ g_\xi(x) \mathbb{E}\left[v_k^2|\xi_k = x\right] dx
small enough}}{\geq}} $} \stackrel{\mathbb{P}}{\pm} \varepsilon_9.
&
\frac{1}{4}\left(\left(\frac{w_{k_+^*}^{*,\tilde{\lambda}}}
{v_{k_+^*}}\right)^2\sum_{\substack{k:
v_k > 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*} + \delta)}}v_k^2 +
\left(\frac{w_{k_-^*}^{*,\tilde{\lambda}}}{v_{k_-^*}}\right)^2
\sum_{\substack{k:
v_k < 0 \\\xi_k\in(\xi_{k^*}, \xi_{k^*} +
\delta)}}v_k^2\right)\\
\makebox[2cm][c]{$\stackrel{\mathbb{P}}{\geq}$}
& \frac{1}{8}
\left(\frac{w_{k_+^*}^{*,\tilde{\lambda}}}{v_{k^*}}\right)^2
n \delta g_\xi(\xi_{k_+^*}) \mathbb{P}(v_k
>0)\mathbb{E}[v_k^2|\xi_k = \xi_{k^*_+}]
\end{align*} \end{align*}
With these transformations Lemma~\ref{lem:s4} follows analogously.
\end{Proof}
\end{Lemma} \begin{Proof}[Heiss, Teichmann, and Wutte (2019, Lemma A.19)]~\\\noindent
\label{proof:lem19}
The proof works analogously if $F_{+-}^{\lambda,g}$ is substituted
by
\begin{align*}
F_{+-}^{\lambda,g '}(f_+, f_-) =
& \sum_{i =
1}^N \left(f(x_i^{\text{train}}) -
y_i^{\text{train}}\right)^2 \\
& + \lambda g(0) \left(\int_{\supp(g)}\frac{\left(f_+''(x)\right)^2}{g(x)}
dx + \int_{\supp(g)}\frac{\left(f''_-(x)\right)^2}{g(x)}
dx\right).
\end{align*}
As for $f^n = f_+^n + f_-^n$ such that $\supp(f_+^n) \cap \supp(f_-^n) =
\emptyset$ and $h = h_+ + h_-$ such that $\supp(h_+) \cap \supp(h_-) =
\emptyset$ it holds
\[
\plimn F^{\lambda, g}(f^n) = F^{\lambda, g}(h) \implies
\plimn F_{+-}^{\lambda,g '}(f_+,f_-) = F_{+-}^{\lambda,g '}(h_+,h_-),
\]
and all functions can be split in two functions with disjoint support
Lemma~\ref{lem:s7} follows.
\end{Proof}
\input{Appendix_code.tex} \input{Appendix_code.tex}
\end{appendices} \end{appendices}

@ -296,3 +296,18 @@ year = {2014},
publisher = {Curran Associates, Inc.}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf} url = {http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf}
} }
@book{hastie01statisticallearning,
added-at = {2008-05-16T16:17:42.000+0200},
address = {New York, NY, USA},
author = {Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome},
biburl = {https://www.bibsonomy.org/bibtex/2f58afc5c9793fcc8ad8389824e57984c/sb3000},
interhash = {d585aea274f2b9b228fc1629bc273644},
intrahash = {f58afc5c9793fcc8ad8389824e57984c},
keywords = {ml statistics},
publisher = {Springer New York Inc.},
series = {Springer Series in Statistics},
timestamp = {2008-05-16T16:17:43.000+0200},
title = {The Elements of Statistical Learning},
year = 2001
}

File diff suppressed because it is too large Load Diff

@ -1,22 +1,74 @@
\section{Introduction} \section{Introduction}
Neural networks have become a widely used model as they are relatively Neural networks have become a widely used model for a plethora of
easy to build with modern frameworks like tensorflow and are able to applications.
model complex data. They are an attractive choice as they are able to
In this thesis we will .. networks .. model complex data with relatively little additional input to the
training data needed.
In order to get some understanding about the behavior of the learned Additionally, as the price of parallelized computing
function of neural networks we examine the convergence behavior for power in the form of graphics processing unit has decreased drastically over the last
.... years, it has become far more accessible to train and use large
neural networks.
An interesting application of neural networks is the application to Furthermore, highly optimized and parallelized frameworks for tensor
image classification tasks. We ... impact of ... on the performance of operations have been developed.
a neural network in such a task. With these frameworks, such as TensorFlow and PyTorch, building neural
networks as become a much more straightforward process.
As in some applications such as medical imaging one might be limited % Furthermore, with the development of highly optimized and
to very small training data we study the impact of two measures in % parallelized implementations of mathematical operations needed for
improving the accuracy in such a case by trying to ... the model from % neural networks, such as TensorFlow or PyTorch, building neural network
overfitting the data. % models has become a much more straightforward process.
% For example the flagship consumer GPU GeForce RTX 3080 of NVIDIA's current
% generation has 5.888 CUDS cores at a ... price of 799 Euro compared
% to the last generations flagship GeForce RTX 2080 Ti with 4352 CUDA
% cores at a ... price of 1259 Euro. These CUDA cores are computing
% cores specialized for tensor operations, which are necessary in
% fitting and using neural networks.
In this thesis we want to get an understanding of the behavior of neural %
networks and
how we can use them for problems with a complex relationship between
in and output.
In Section 2 we introduce the mathematical construct of neural
networks and how to fit them to training data.
To gain some insight about the learned function,
we examine a simple class of neural networks that only contain one
hidden layer.
In Section~\ref{sec:shallownn} we proof a relation between such networks and
functions that minimize the distance to training data
with respect to its second derivative.
An interesting application of neural networks is the task of
classifying images.
However, for such complex problems the number of parameters in fully
connected neural networks can exceed what is
feasible for training.
In Section~\ref{sec:cnn} we explore the addition of convolution to neural
networks to reduce the number of parameters.
As these large networks are commonly trained using gradient decent
algorithms we compare the performance of different algorithms based on
gradient descent in Section~4.4.
% and
% show that it is beneficial to only use small subsets of the training
% data in each iteration rather than using the whole data set to update
% the parameters.
Most statistical models especially these with large amounts of
trainable parameter can struggle with overfitting the data.
In Section 4.5 we examine the impact of two measures designed to combat
overfitting.
In some applications such as working with medical images the data
available for training can be scarce, which results in the networks
being prone to overfitting.
As these are interesting applications of neural networks we examine
the benefit of the measures to combat overfitting for
scenarios with limited amounts of training data.
% As in some applications such as medical imaging one might be limited
% to very small training data we study the impact of two measures in
% improving the accuracy in such a case by trying to ... the model from
% overfitting the data.

@ -1,23 +1,26 @@
\section{\titlecap{Introduction to Neural Networks}} \section{Introduction to Neural Networks}
This chapter is based on \textcite[Chapter~6]{Goodfellow} and \textcite{Haykin}. This chapter is based on \textcite[Chapter~6]{Goodfellow} and \textcite{Haykin}.
Neural Networks (NN) are a mathematical construct inspired by the Neural Networks are a mathematical construct inspired by the
structure of brains in mammals. It consists of an array of neurons that structure of brains in mammals. They consist of an array of neurons that
receive inputs and compute an accumulated output. These neurons are receive inputs and compute an accumulated output. These neurons are
arranged in layers, with one input and output layer arranged in layers, with one input and output layer
and a arbirtary and an arbitrary
amount of hidden layer between them. amount of hidden layers between them.
The amount of neurons in the in- and output layers correspond to the The number of neurons in the in- and output layers correspond to the
desired dimensions of in- and outputs of the model. desired dimensions of in- and outputs of the model.
In conventional neural networks the information is fed forward from the
input layer towards the output layer hence they are often called feed In conventional neural networks, the information is fed forward from the
forward networks. Each neuron in a layer has the outputs of all input layer towards the output layer, hence they are often called
neurons in the preceding layer as input and computes a accumulated feed forward networks. Each neuron in a layer has the outputs of all
value from these (fully connected). A neurons in the preceding layer as input and computes an accumulated
illustration of an example neuronal network is given in value from these (fully connected).
Figure~\ref{fig:nn} and one of a neuron in Figure~\ref{fig:neuron}. % An illustration of an example neural network is given in
% Figure~\ref{fig:nn} and one of a neuron in Figure~\ref{fig:neuron}.
Illustrations of a neural network and the structure of a neuron are given
in Figure~\ref{fig:nn} and Figure~\ref{fig:neuron}.
\tikzset{% \tikzset{%
every neuron/.style={ every neuron/.style={
@ -88,71 +91,71 @@ Figure~\ref{fig:nn} and one of a neuron in Figure~\ref{fig:neuron}.
\node[fill=white,scale=1.5,inner xsep=10pt,inner ysep=10mm] at ($(hidden1-1)!.5!(hidden2-2)$) {$\dots$}; \node[fill=white,scale=1.5,inner xsep=10pt,inner ysep=10mm] at ($(hidden1-1)!.5!(hidden2-2)$) {$\dots$};
\end{tikzpicture}}%} \end{tikzpicture}}%}
\caption[Illustration of a neural network]{Illustration of a neural network with $d_i$ inputs, $l$ \caption[Illustration of a Neural Network]{Illustration of a neural network with $d_i$ inputs, $l$
hidden layers with $n_{\cdot}$ nodes in each layer, as well as hidden layers with $n_{\cdot}$ nodes in each layer, as well as
$d_o$ outputs. $d_o$ outputs.
} }
\label{fig:nn} \label{fig:nn}
\end{figure} \end{figure}
\subsection{\titlecap{nonlinearity of neural networks}} \subsection{Nonlinearity of Neural Networks}
The arguably most important feature of neural networks that sets them The arguably most important feature of neural networks which sets them
apart from linear models is the activation function implemented in the apart from linear models is the activation function implemented in the
neurons. As seen in Figure~\ref{fig:neuron} on the weighted sum of the neurons. As illustrated in Figure~\ref{fig:neuron} on the weighted sum of the
inputs a activation function $\sigma$ is applied resulting in the inputs an activation function $\sigma$ is applied resulting in the
output of the $k$-th neuron in a layer $l$ with $m$ nodes in layer $l-1$ output of the $k$-th neuron in a layer $l$ with $m$ nodes in layer $l-1$
being given by being given by
\begin{align*} \begin{align*}
o_{l,k} = \sigma\left(b_{l,k} + \sum_{j=1}^{m} w_{l,k,j} o_{l,k} = \sigma\left(b_{l,k} + \sum_{j=1}^{m} w_{l,k,j}
o_{l-1,j}\right) o_{l-1,j}\right),
\end{align*} \end{align*}
for weights $w_{l,k,j}$ and biases $b_{l,k}$. For a network with $L$ for weights $w_{l,k,j}$ and biases $b_{l,k}$. For a network with $L$
hidden layers and inputs $o_{0}$ the final outputs of the network hidden layers and inputs $o_{0}$ the final outputs of the network
are thus given by $o_{L+1}$. are thus given by $o_{L+1}$.
The activation function is usually chosen nonlinear (a linear one The activation function is usually chosen nonlinear (a linear one
would result in the entire model collapsing into a linear one\todo{beweis?}) which would result in the entire network collapsing into a linear model) which
allows it to better model data where the relation of in- and output is allows it to better model data where the relation of in- and output is
of nonlinear nature. of nonlinear nature.
There are two types of activation functions, saturating and not There are two types of activation functions, saturating and not
saturating ones. Popular examples for the former are sigmoid saturating ones. Popular examples for the former are sigmoid
functions where most commonly the standard logisitc function or tangens functions where most commonly the standard logistic function or tangens
hyperbolicus are used hyperbolicus are used
as they have easy to compute derivatives which is desirable for gradient as they have easy to compute derivatives which is desirable for
based optimization algorithms. The standard logistic function (often gradient-based optimization algorithms. The standard logistic function
referred to simply as sigmoid function) is given by (often simply referred to as sigmoid function) is given by
\[ \[
f(x) = \frac{1}{1+e^{-x}} f(x) = \frac{1}{1+e^{-x}}
\] \]
and has a realm of $[0,1]$. Its usage as an activation function is and has a realm of $[0,1]$. The tangens hyperbolicus is given by
motivated by modeling neurons which
are close to deactive until a certain threshold is hit and then grow in
intensity until they are fully
active. This is similar to the behavior of neurons in
brains\todo{besser schreiben}. The tangens hyperbolicus is given by
\[ \[
\tanh(x) = \frac{2}{e^{2x}+1} \tanh(x) = \frac{2}{e^{2x}+1}
\] \]
and has a realm of $[-1,1]$. and has a realm of $[-1,1]$. Both functions result in neurons that are
The downside of these saturating activation functions is that given close to inactive until a certain threshold is reached where they grow
their saturating nature their derivatives are close to zero for large or small until saturation.
input values. This can slow or hinder the progress of gradient based methods. The downside of these saturating activation functions is, that their
derivatives are close to zero on most of their realm, only assuming
The nonsaturating activation functions commonly used are the recified larger values in proximity to zero.
This can hinder the progress of gradient-based methods.
The nonsaturating activation functions commonly used are the rectified
linear unit (ReLU) or the leaky ReLU. The ReLU is given by linear unit (ReLU) or the leaky ReLU. The ReLU is given by
\[ \begin{equation}
r(x) = \max\left\{0, x\right\}. r(x) = \max\left\{0, x\right\}.
\] \label{eq:relu}
\end{equation}
This has the benefit of having a constant derivative for values larger This has the benefit of having a constant derivative for values larger
than zero. However the derivative being zero for negative values has than zero. However, the derivative being zero for negative values has
the same downside for the same downside for
fitting the model with gradient based methods. The leaky ReLU is fitting the model with gradient-based methods. The leaky ReLU is
an attempt to counteract this problem by assigning a small constant an attempt to counteract this problem by assigning a small constant
derivative to all values smaller than zero and for scalar $\alpha$ is given by derivative to all values smaller than zero and for a scalar $\alpha$ is given by
\[ \[
l(x) = \max\left\{0, x\right\} + \alpha \min \left\{0, x\right\}. l(x) = \max\left\{0, x\right\} + \alpha \min \left\{0, x\right\}.
\] \]
In order to illustrate these functions plots of them are given in Figure~\ref{fig:activation}. In Figure~\ref{fig:activation} visualizations of these functions are given.
%In order to illustrate these functions plots of them are given in Figure~\ref{fig:activation}.
\begin{figure} \begin{figure}
@ -238,7 +241,7 @@ In order to illustrate these functions plots of them are given in Figure~\ref{fi
% \draw [->] (hidden-\i) -- (output-\j); % \draw [->] (hidden-\i) -- (output-\j);
\end{tikzpicture} \end{tikzpicture}
\caption{Structure of a single neuron} \caption[Structure of a Single Neuron]{Structure of a single neuron.}
\label{fig:neuron} \label{fig:neuron}
\end{figure} \end{figure}
@ -251,7 +254,7 @@ In order to illustrate these functions plots of them are given in Figure~\ref{fi
\addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)}; \addplot [domain=-5:5, samples=101,unbounded coords=jump]{1/(1+exp(-x)};
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}
\caption{\titlecap{standard logistic function}} \caption{Standard Logistic Function}
\end{subfigure} \end{subfigure}
\begin{subfigure}{.45\linewidth} \begin{subfigure}{.45\linewidth}
\centering \centering
@ -260,7 +263,7 @@ In order to illustrate these functions plots of them are given in Figure~\ref{fi
\addplot[domain=-5:5, samples=100]{tanh(x)}; \addplot[domain=-5:5, samples=100]{tanh(x)};
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}
\caption{\titlecap{tangens hyperbolicus}} \caption{Tangens Hyperbolicus}
\end{subfigure} \end{subfigure}
\begin{subfigure}{.45\linewidth} \begin{subfigure}{.45\linewidth}
\centering \centering
@ -282,7 +285,7 @@ In order to illustrate these functions plots of them are given in Figure~\ref{fi
\end{tikzpicture} \end{tikzpicture}
\caption{Leaky ReLU, $\alpha = 0.1$} \caption{Leaky ReLU, $\alpha = 0.1$}
\end{subfigure} \end{subfigure}
\caption{Plots of the activation functions} \caption[Plots of the Activation Functions]{Plots of the activation functions.}
\label{fig:activation} \label{fig:activation}
\end{figure} \end{figure}
@ -291,9 +294,9 @@ In order to illustrate these functions plots of them are given in Figure~\ref{fi
As neural networks are a parametric model we need to fit the As neural networks are a parametric model we need to fit the
parameters to the input parameters to the input
data in order to get meaningful results from the network. To be able data to get meaningful predictions from the network. In order
do this we first need to discuss how we interpret the output of the to accomplish this we need to discuss how we interpret the output of the
neural network. neural network and assess the quality of predictions.
% After a neural network model is designed, like most statistical models % After a neural network model is designed, like most statistical models
% it has to be fit to the data. In the machine learning context this is % it has to be fit to the data. In the machine learning context this is
@ -311,20 +314,20 @@ neural network.
% data-point in fitting the model, where usually some distance between % data-point in fitting the model, where usually some distance between
% the model output and the labels is minimized. % the model output and the labels is minimized.
\subsubsection{\titlecap{nonliniarity in the last layer}} \subsubsection{Nonlinearity in the Last Layer}
Given the nature of the neural net the outputs of the last layer are Given the nature of the neural net, the outputs of the last layer are
real numbers. For regression tasks this is desirable, for real numbers. For regression tasks, this is desirable, for
classification problems however some transformations might be classification problems however some transformations might be
necessary. necessary.
As the goal in the latter is to predict a certain class or classes for As the goal in the latter is to predict a certain class or classes for
an object the output needs to be of a form that allows this an object, the output needs to be of a form that allows this
interpretation. interpretation.
Commonly the nodes in the output layer each correspond to a class and Commonly the nodes in the output layer each correspond to a class and
the class chosen as prediction is the one with the highest value at the class chosen as prediction is the one with the highest value at
the corresponding output node. the corresponding output node.
This corresponds to a transformation of the output This can be modeled as a transformation of the output
vector $o$ into a one-hot vector vector $o \in \mathbb{R}^n$ into a one-hot vector
\[ \[
\text{pred}_i = \text{pred}_i =
\begin{cases} \begin{cases}
@ -332,9 +335,9 @@ vector $o$ into a one-hot vector
0,& \text{else}. 0,& \text{else}.
\end{cases} \end{cases}
\] \]
This however makes training the model with gradient based methods impossible, as the derivative of This however makes training the model with gradient-based methods impossible, as the derivative of
the transformation is either zero or undefined. the transformation is either zero or undefined.
A continuous transformation that is close to the argmax one is given by An continuous transformation that is close to argmax is given by
softmax softmax
\begin{equation} \begin{equation}
\text{softmax}(o)_i = \frac{e^{o_i}}{\sum_j e^{o_j}}. \text{softmax}(o)_i = \frac{e^{o_i}}{\sum_j e^{o_j}}.
@ -342,10 +345,10 @@ softmax
\end{equation} \end{equation}
The softmax function transforms the realm of the output to the interval $[0,1]$ The softmax function transforms the realm of the output to the interval $[0,1]$
and the individual values sum to one, thus the output can be interpreted as and the individual values sum to one, thus the output can be interpreted as
a probability for each class given the input. a probability for each class conditioned on the input.
Additionally to being differentiable this allows for evaluataing the Additionally, to being differentiable this allows to evaluate the
cetainiy of a prediction, rather than just whether it is accurate. certainty of a prediction, rather than just whether it is accurate.
A similar effect is obtained when for a binary or two class problem the A similar effect is obtained when for a binary or two-class problem the
sigmoid function sigmoid function
\[ \[
f(x) = \frac{1}{1 + e^{-x}} f(x) = \frac{1}{1 + e^{-x}}
@ -353,7 +356,6 @@ sigmoid function
is used and the output $f(x)$ is interpreted as the probability for is used and the output $f(x)$ is interpreted as the probability for
the first class and $1-f(x)$ for the second class. the first class and $1-f(x)$ for the second class.
\todo{vielleicht additiv invarianz}
% Another property that makes softmax attractive is the invariance to addition % Another property that makes softmax attractive is the invariance to addition
% \[ % \[
% \text{sofmax}(o) = \text{softmax}(o + c % \text{sofmax}(o) = \text{softmax}(o + c
@ -389,26 +391,26 @@ the first class and $1-f(x)$ for the second class.
% way to circumvent this problem is to normalize the output vector is % way to circumvent this problem is to normalize the output vector is
% such a way that the entries add up to one, this allows for the % such a way that the entries add up to one, this allows for the
% interpretation of probabilities assigned to each class. % interpretation of probabilities assigned to each class.
\clearpage
\subsubsection{Error Measurement} \subsubsection{Error Measurement}
In order to train the network we need to be able to make an assessment In order to train the network we need to be able to assess the quality
about the quality of predictions using some error measure. of predictions using some error measure.
The choice of the error The choice of the error
function is highly dependent on the type of the problem. For function is highly dependent on the type of problem. For
regression problems a commonly used error measure is the mean squared regression problems, a commonly used error measure is the mean squared
error (MSE) error (MSE)
which for a function $f$ and data $(x_i,y_i), i=1,\dots,n$ is given by which for a function $f$ and data $(x_i,y_i), i \in \left\{1,\dots,n\right\}$ is given by
\[ \[
MSE(f) = \frac{1}{n} \sum_i^n \left(f(x_i) - y_i\right)^2. MSE(f) = \frac{1}{n} \sum_i^n \left(f(x_i) - y_i\right)^2.
\] \]
However depending on the problem error measures with different However, depending on the problem error measures with different
properties might be needed, for example in some contexts it is properties might be needed. For example in some contexts it is
required to consider a proportional rather than absolute error. required to consider a proportional rather than absolute error.
As discussed above the output of a neural network for a classification As discussed above the output of a neural network for a classification
problem can be interpreted as a probability distribution over the classes problem can be interpreted as a probability distribution over the classes
conditioned on the input. In this case it is desirable to conditioned on the input. In this case, it is desirable to
use error functions designed to compare probability distributions. A use error functions designed to compare probability distributions. A
widespread error function for this use case is the categorical cross entropy (\textcite{PRML}), widespread error function for this use case is the categorical cross entropy (\textcite{PRML}),
which for two discrete distributions $p, q$ with the same realm $C$ is given by which for two discrete distributions $p, q$ with the same realm $C$ is given by
@ -416,33 +418,35 @@ which for two discrete distributions $p, q$ with the same realm $C$ is given by
H(p, q) = \sum_{c \in C} p(c) \ln\left(\frac{1}{q(c)}\right), H(p, q) = \sum_{c \in C} p(c) \ln\left(\frac{1}{q(c)}\right),
\] \]
comparing $q$ to a target density $p$. comparing $q$ to a target density $p$.
For a data set $(x_i,y_i), i = 1,\dots,n$ where each $y_{i,c}$ For a data set $(x_i,y_i), i \in \left\{1,\dots,n\right\}$ where each $y_{i,c}$
corresponds to the probability of class $c$ given $x_i$ and predictor corresponds to the probability of class $c$ given $x_i$ and a predictor
$f$ we get the loss function $f$ we get the loss function
\begin{equation} \begin{equation}
CE(f) = \sum_{i=1}^n H(y_i, f(x_i)). CE(f) = \sum_{i=1}^n H(y_i, f(x_i)).
\label{eq:cross_entropy} \label{eq:cross_entropy}
\end{equation} \end{equation}
\todo{Den satz einbauen} % \todo{Den satz einbauen}
-Maximum Likelihood % -Maximum Likelihood
-Ableitung mit softmax pseudo linear -> fast improvemtns possible % -Ableitung mit softmax pseudo linear -> fast improvemtns possible
\subsubsection{Gradient Descent Algorithm} \subsubsection{Gradient Descent Algorithm}
Trying to find the optimal parameter for fitting the model to the data Trying to find the optimal parameter for fitting the model to the data
can be a hard problem. Given the complex nature of a neural network can be a hard problem. Given the complex nature of a neural network
with many layers and neurons it is hard to predict the impact of with many layers and neurons, it is hard to predict the impact of
single parameters on the accuracy of the output. single parameters on the accuracy of the output.
Thus using numeric optimization algorithms is the only Thus using numeric optimization algorithms is the only
feasible way to fit the model. A attractive algorithm for training feasible way to fit the model.
neural networks is gradient descent where each parameter
$\theta_i$\todo{parameter name?} is An attractive algorithm for training
iterative changed according to the gradient regarding the error neural networks is gradient descent. Here all parameters are
measure and a step size $\gamma$. For this all parameters are initialized with certain values (often random or close to zero) and
initialized (often random or close to zero) and then iteratively then iteratively updated. The updates are made in the direction of the
updated until a certain stopping criterion is hit, mostly either being a fixed gradient regarding the error with a step size $\gamma$ until a
number of iterations or a desired upper limit for the error measure. specified stopping criterion is hit.
% This mostly either being a fixed
% number of iterations or a desired upper limit for the error measure.
% For a function $f_\theta$ with parameters $\theta \in \mathbb{R}^n$ % For a function $f_\theta$ with parameters $\theta \in \mathbb{R}^n$
% and a error function $L(f_\theta)$ the gradient descent algorithm is % and a error function $L(f_\theta)$ the gradient descent algorithm is
% given in \ref{alg:gd}. % given in \ref{alg:gd}.
@ -465,21 +469,21 @@ number of iterations or a desired upper limit for the error measure.
The algorithm for gradient descent is given in The algorithm for gradient descent is given in
Algorithm~\ref{alg:gd}. In the context of fitting a neural network Algorithm~\ref{alg:gd}. In the context of fitting a neural network
$f_\theta$ corresponds to a error measurement of a neural network $f_\theta$ corresponds to an error measurement of a neural network
$\mathcal{NN}_{\theta}$ where $\theta$ is a vector $\mathcal{NN}_{\theta}$ where $\theta$ is a vector
containing all the weights and biases of the network. containing all the weights and biases of the network.
As can be seen this requires computing the derivative of the network As can be seen, this requires computing the derivative of the network
with regard to each variable. With the number of variables getting with regard to each variable. With the number of variables getting
large in networks with multiple layers of high neuron count naively large in networks with multiple layers of high neuron count naively
computing the derivatives can get quite memory and computational computing the derivatives can get quite memory and computational
expensive. expensive.
By using the chain rule and exploiting the layered structure we can By using the chain rule and exploiting the layered structure we can
compute the parameter update much more efficiently, this practice is compute the parameter update much more efficiently. This practice is
called backpropagation and was introduced by called backpropagation and was introduced for use in neural networks by
\textcite{backprop}\todo{nachsehen ob richtige quelle}. The algorithm \textcite{backprop}. The algorithm
for one data point is given in Algorithm~\ref{alg:backprop}, but for all error for one data point is given in Algorithm~\ref{alg:backprop}, but for all error
functions that are sums of errors for single data points (MSE, cross functions that are sums of errors for single data points (MSE, cross
entropy) backpropagation works analogous for larger training data. entropy) backpropagation works analogously for larger training data.
% \subsubsection{Backpropagation} % \subsubsection{Backpropagation}
@ -496,8 +500,9 @@ entropy) backpropagation works analogous for larger training data.
\begin{algorithm}[H] \begin{algorithm}[H]
\SetAlgoLined \SetAlgoLined
\KwInput{Inputs $o_0$, neural network \KwInput{Inputs $o_0$, neural network
with $L$ hidden layers and weights $w$ and biases $b$ for $n_l$ with $L$ hidden layers, weights $w$, and biases $b$ for $n_l$
nodes and activation function $\sigma_l$ in layer $l$, loss $\tilde{L}$.} nodes as well as an activation function $\sigma_l$ in layer $l$
and loss function $\tilde{L}$.}
Forward Propagation: Forward Propagation:
\For{$l \in \left\{1, \dots, L+1\right\}$}{ \For{$l \in \left\{1, \dots, L+1\right\}$}{
Compute values for layer $l$: Compute values for layer $l$:

@ -1,8 +1,6 @@
\boolfalse {citerequest}\boolfalse {citetracker}\boolfalse {pagetracker}\boolfalse {backtracker}\relax \boolfalse {citerequest}\boolfalse {citetracker}\boolfalse {pagetracker}\boolfalse {backtracker}\relax
\babel@toc {english}{} \babel@toc {english}{}
\defcounter {refsection}{0}\relax \defcounter {refsection}{0}\relax
\contentsline {table}{\numberline {4.1}{\ignorespaces Performance metrics of the networks trained in Figure~\ref {fig:sgd_vs_gd} after 20 training epochs.\relax }}{29}{table.caption.32}% \contentsline {table}{\numberline {4.1}{\ignorespaces Values of Test Accuracies for Models Trained on Subsets of MNIST Handwritten Digits}}{41}%
\defcounter {refsection}{0}\relax \defcounter {refsection}{0}\relax
\contentsline {table}{\numberline {4.2}{\ignorespaces Values of the test accuracy of the model trained 10 times on random MNIST handwriting training sets containing 1, 10 and 100 data points per class after 125 epochs. The mean accuracy achieved for the full set employing both overfitting measures is \relax }}{42}{table.4.2}% \contentsline {table}{\numberline {4.2}{\ignorespaces Values of Test Accuracies for Models Trained on Subsets of Fashion MNIST}}{41}%
\defcounter {refsection}{0}\relax
\contentsline {table}{\numberline {4.3}{\ignorespaces Values of the test accuracy of the model trained 10 times on random fashion MNIST training sets containing 1, 10 and 100 data points per class after 125 epochs. The mean accuracy achieved for the full set employing both overfitting measures is \relax }}{42}{table.4.3}%

@ -0,0 +1,25 @@
\BOOKMARK [1][-]{section.1}{Introduction}{}% 1
\BOOKMARK [1][-]{section.2}{Introduction to Neural Networks}{}% 2
\BOOKMARK [2][-]{subsection.2.1}{Nonlinearity of Neural Networks}{section.2}% 3
\BOOKMARK [2][-]{subsection.2.2}{Training Neural Networks}{section.2}% 4
\BOOKMARK [3][-]{subsubsection.2.2.1}{Nonlinearity in the Last Layer}{subsection.2.2}% 5
\BOOKMARK [3][-]{subsubsection.2.2.2}{Error Measurement}{subsection.2.2}% 6
\BOOKMARK [3][-]{subsubsection.2.2.3}{Gradient Descent Algorithm}{subsection.2.2}% 7
\BOOKMARK [1][-]{section.3}{Shallow Neural Networks}{}% 8
\BOOKMARK [2][-]{subsection.3.1}{Convergence Behavior of One-Dimensional Randomized Shallow Neural Networks}{section.3}% 9
\BOOKMARK [2][-]{subsection.3.2}{Simulations}{section.3}% 10
\BOOKMARK [1][-]{section.4}{Application of Neural Networks to Higher Complexity Problems}{}% 11
\BOOKMARK [2][-]{subsection.4.1}{Convolution}{section.4}% 12
\BOOKMARK [2][-]{subsection.4.2}{Convolutional Neural Networks}{section.4}% 13
\BOOKMARK [2][-]{subsection.4.3}{Stochastic Training Algorithms}{section.4}% 14
\BOOKMARK [2][-]{subsection.4.4}{Modified Stochastic Gradient Descent}{section.4}% 15
\BOOKMARK [2][-]{subsection.4.5}{Combating Overfitting}{section.4}% 16
\BOOKMARK [3][-]{subsubsection.4.5.1}{Dropout}{subsection.4.5}% 17
\BOOKMARK [3][-]{subsubsection.4.5.2}{Manipulation of Input Data}{subsection.4.5}% 18
\BOOKMARK [3][-]{subsubsection.4.5.3}{Comparisons}{subsection.4.5}% 19
\BOOKMARK [3][-]{subsubsection.4.5.4}{Effectiveness for Small Training Sets}{subsection.4.5}% 20
\BOOKMARK [1][-]{section.5}{Summary and Outlook}{}% 21
\BOOKMARK [1][-]{section*.28}{Appendices}{}% 22
\BOOKMARK [1][-]{Appendix.a.A}{Notes on Proofs of Lemmata in Section 3.1}{}% 23
\BOOKMARK [1][-]{Appendix.a.B}{Implementations}{}% 24
\BOOKMARK [1][-]{Appendix.a.C}{Additional Comparisons}{}% 25

@ -1,4 +1,4 @@
\documentclass[a4paper, 12pt, draft=true]{article} \documentclass[a4paper, 12pt]{article}
%\usepackage[margin=1in]{geometry} %\usepackage[margin=1in]{geometry}
%\geometry{a4paper, left=30mm, right=40mm,top=25mm, bottom=20mm} %\geometry{a4paper, left=30mm, right=40mm,top=25mm, bottom=20mm}
@ -34,17 +34,17 @@
\usepackage{todonotes} \usepackage{todonotes}
\usepackage{lipsum} \usepackage{lipsum}
\usepackage[ruled,vlined]{algorithm2e} \usepackage[ruled,vlined]{algorithm2e}
\usepackage{showframe} %\usepackage{showframe}
\usepackage[protrusion=true, expansion=true, kerning=true, letterspace \usepackage[protrusion=true, expansion=true, kerning=true, letterspace
= 150]{microtype} = 150]{microtype}
\usepackage{titlecaps} %\usepackage{titlecaps}
\usepackage{afterpage} \usepackage{afterpage}
\usepackage{xcolor} \usepackage{xcolor}
\usepackage{chngcntr} \usepackage{chngcntr}
\usepackage{hyperref} %\usepackage{hyperref}
\hypersetup{ % \hypersetup{
linktoc=all, %set to all if you want both sections and subsections linked % linktoc=all, %set to all if you want both sections and subsections linked
} % }
\allowdisplaybreaks \allowdisplaybreaks
\captionsetup[sub]{justification=centering} \captionsetup[sub]{justification=centering}
@ -245,8 +245,7 @@
\begin{center} \begin{center}
\vspace{1cm} \vspace{1cm}
\huge \textbf{\titlecap{neural networks and their application on \huge \textbf{Neural Networks and their Application on Higher Complexity Problems}\\
higher complexity problems}}\\
\vspace{1cm} \vspace{1cm}
\huge \textbf{Tim Tobias Arndt}\\ \huge \textbf{Tim Tobias Arndt}\\
\vspace{1cm} \vspace{1cm}
@ -260,7 +259,6 @@
\clearpage \clearpage
\listoffigures \listoffigures
\listoftables \listoftables
\listoftodos
\newpage \newpage
\pagenumbering{arabic} \pagenumbering{arabic}
% Introduction % Introduction
@ -288,413 +286,6 @@
% Appendix A % Appendix A
\input{appendixA.tex} \input{appendixA.tex}
\section{\titlecap{additional comparisons}}
In this section we show additional comparisons for the neural networks
trained in Section~\ref{...}. In ... the same comparisons given for
the test accuracy are given for the cross entropy loss on the test
set, as well as on the training set.
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\addlegendentry{\footnotesize{Default}}
\end{axis}
\end{tikzpicture}
\caption{1 sample per class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 samples per class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 samples per class}
\vspace{.25cm}
\end{subfigure}
\caption{Mean test accuracies of the models fitting the sampled MNIST
handwriting datasets over the 125 epochs of training.}
\end{figure}
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style =
{draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_1.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\end{axis}
\end{tikzpicture}
\caption{1 sample per class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}, ymin = {0.62}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_10.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 samples per class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_100.mean};
\addplot table
[x=epoch, y=val_loss, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 samples per class}
\vspace{.25cm}
\end{subfigure}
\caption{Mean test accuracies of the models fitting the sampled fashion MNIST
over the 125 epochs of training.}
\end{figure}
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\addlegendentry{\footnotesize{Default}}
\end{axis}
\end{tikzpicture}
\caption{1 sample per class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 samples per class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}, ymin = {0.92}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_00_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_dropout_02_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_00_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/adam_datagen_dropout_02_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 samples per class}
\vspace{.25cm}
\end{subfigure}
\caption{Mean test accuracies of the models fitting the sampled MNIST
handwriting datasets over the 125 epochs of training.}
\end{figure}
\begin{figure}[h]
\centering
\small
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style =
{draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_1.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_1.mean};
\addlegendentry{\footnotesize{Default}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G. + D. 0.2}}
\addlegendentry{\footnotesize{D. 0.4}}
\end{axis}
\end{tikzpicture}
\caption{1 sample per class}
\vspace{0.25cm}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = \textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch},ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}, ymin = {0.62}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_10.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_10.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{10 samples per class}
\end{subfigure}
\begin{subfigure}[h]{\textwidth}
\begin{tikzpicture}
\begin{axis}[legend cell align={left},yticklabel style={/pgf/number format/fixed,
/pgf/number format/precision=3},tick style = {draw = none}, width = 0.9875\textwidth,
height = 0.4\textwidth, legend style={at={(0.9825,0.0175)},anchor=south east},
xlabel = {epoch}, ylabel = {Test Accuracy}, cycle
list/Dark2, every axis plot/.append style={line width
=1.25pt}]
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_0_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_dropout_2_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_0_100.mean};
\addplot table
[x=epoch, y=accuracy, col sep=comma, mark = none]
{Figures/Data/fashion_datagen_dropout_2_100.mean};
\addlegendentry{\footnotesize{Default.}}
\addlegendentry{\footnotesize{D. 0.2}}
\addlegendentry{\footnotesize{G.}}
\addlegendentry{\footnotesize{G + D. 0.2}}
\end{axis}
\end{tikzpicture}
\caption{100 samples per class}
\vspace{.25cm}
\end{subfigure}
\caption{Mean test accuracies of the models fitting the sampled fashion MNIST
over the 125 epochs of training.}
\end{figure}
\end{document} \end{document}
%%% Local Variables: %%% Local Variables:

@ -5,16 +5,16 @@
%%% TeX-master: "main" %%% TeX-master: "main"
%%% End: %%% End:
\section{Shallow Neural Networks} \section{Shallow Neural Networks}
\label{sec:shallownn}
% In order to get a some understanding of the behavior of neural % In order to get a some understanding of the behavior of neural
% networks we study a simplified class of networks called shallow neural % networks we study a simplified class of networks called shallow neural
% networks in this chapter. % networks in this chapter.
% We consider shallow neural networks consist of a single % We consider shallow neural networks consist of a single
% hidden layer and % hidden layer and
In order to get some understanding of the behavior of neural networks To get some understanding of the behavior of neural networks
we examine a simple class of networks in this chapter. We consider we examine a simple class of networks in this chapter. We consider
networks that contain only one hidden layer and have a single output networks that contain only one hidden layer and have a single output
node. We call these networks shallow neural networks. node and call these networks shallow neural networks.
\begin{Definition}[Shallow neural network, Heiss, Teichmann, and \begin{Definition}[Shallow neural network, Heiss, Teichmann, and
Wutte (2019, Definition 1.4)] Wutte (2019, Definition 1.4)]
For a input dimension $d$ and a Lipschitz continuous activation function $\sigma: For a input dimension $d$ and a Lipschitz continuous activation function $\sigma:
@ -85,8 +85,8 @@ with
% \label{fig:shallowNN} % \label{fig:shallowNN}
% \end{figure} % \end{figure}
As neural networks with a large amount of nodes have a large amount of As neural networks with a large number of nodes have a large amount of
parameters that can be tuned it can often fit the data quite well. If tunable parameters it can often fit data quite well. If
a ReLU activation function a ReLU activation function
\[ \[
\sigma(x) \coloneqq \max{(0, x)} \sigma(x) \coloneqq \max{(0, x)}
@ -106,7 +106,7 @@ on MSE will perfectly fit the data.
minimizing squared error loss. minimizing squared error loss.
\proof \proof
W.l.o.g. all values $x_{ij}^{\text{train}} \in [0,1],~\forall i \in W.l.o.g. all values $x_{ij}^{\text{train}} \in [0,1],~\forall i \in
\left\{1,\dots\right\}, j \in \left\{1,\dots,d\right\}$. Now we \left\{1,\dots, t\right\}, j \in \left\{1,\dots,d\right\}$. Now we
chose $v^*$ in order to calculate a unique value for all chose $v^*$ in order to calculate a unique value for all
$x_i^{\text{train}}$: $x_i^{\text{train}}$:
\[ \[
@ -142,30 +142,32 @@ on MSE will perfectly fit the data.
and $\vartheta^* = (w^*, b^*, v^*, c = 0)$ we get and $\vartheta^* = (w^*, b^*, v^*, c = 0)$ we get
\[ \[
\mathcal{NN}_{\vartheta^*} (x_i^{\text{train}}) = \sum_{k = \mathcal{NN}_{\vartheta^*} (x_i^{\text{train}}) = \sum_{k =
1}^{i-1} w_k\left(\left(v^*\right)^{\mathrm{T}} 1}^{i-1} w_k\left(b_k^* + \left(v^*\right)^{\mathrm{T}}
x_i^{\text{train}}\right) + w_i\left(\left(v^*\right)^{\mathrm{T}} x_i^{\text{train}}\right) + w_i\left(b_i^* +\left(v^*\right)^{\mathrm{T}}
x_i^{\text{train}}\right) = y_i^{\text{train}}. x_i^{\text{train}}\right) = y_i^{\text{train}}.
\] \]
As the squared error of $\mathcal{NN}_{\vartheta^*}$ is zero all As the squared error of $\mathcal{NN}_{\vartheta^*}$ is zero all
squared error loss minimizing shallow networks with at least $t$ hidden squared error loss minimizing shallow networks with at least $t$ hidden
nodes will perfectly fit the data. nodes will perfectly fit the data. \qed
\qed
\label{theo:overfit} \label{theo:overfit}
\end{Theorem} \end{Theorem}
However this behavior is often not desired as over fit models generally However, this behavior is often not desired as overfit models tend to
have bad generalization properties especially if noise is present in have bad generalization properties, especially if noise is present in
the data. This effect is illustrated in the data. This effect is illustrated in
Figure~\ref{fig:overfit}. Here a shallow neural network that perfectly fits the Figure~\ref{fig:overfit}.
training data is
constructed according to the proof of Theorem~\ref{theo:overfit} and Here a shallow neural network is
constructed according to the proof of Theorem~\ref{theo:overfit} to
perfectly fit some data and
compared to a cubic smoothing spline compared to a cubic smoothing spline
(Definition~\ref{def:wrs}). While the neural network (Definition~\ref{def:wrs}). While the neural network
fits the data better than the spline, the spline represents the fits the data better than the spline, the spline represents the
underlying mechanism that was used to generate the data more accurately. The better underlying mechanism that was used to generate the data more accurately. The better
generalization of the spline compared to the network is further generalization of the spline compared to the network is further
demonstrated by the better validation error computed on newly generated demonstrated by the better performance on newly generated
test data. test data.
In order to improve the accuracy of the model we want to reduce In order to improve the accuracy of the model we want to reduce
overfitting. A possible way to achieve this is by explicitly overfitting. A possible way to achieve this is by explicitly
regularizing the network through the cost function as done with regularizing the network through the cost function as done with
@ -173,24 +175,23 @@ ridge penalized networks
(Definition~\ref{def:rpnn}) where large weights $w$ are punished. In (Definition~\ref{def:rpnn}) where large weights $w$ are punished. In
Theorem~\ref{theo:main1} we will Theorem~\ref{theo:main1} we will
prove that this will result in the shallow neural network converging to prove that this will result in the shallow neural network converging to
regressions splines as the amount of nodes in the hidden layer is a form of splines as the number of nodes in the hidden layer is
increased. increased.
\vfill
\begin{figure}[h]
\begin{figure}
\pgfplotsset{ \pgfplotsset{
compat=1.11, compat=1.11,
legend image code/.code={ legend image code/.code={
\draw[mark repeat=2,mark phase=2] \draw[mark repeat=2,mark phase=2]
plot coordinates { plot coordinates {
(0cm,0cm) (0cm,0cm)
(0.15cm,0cm) %% default is (0.3cm,0cm) (0.15cm,0cm) %% default is (0.3cm,0cm)
(0.3cm,0cm) %% default is (0.6cm,0cm) (0.3cm,0cm) %% default is (0.6cm,0cm)
};% };%
} }
} }
\begin{tikzpicture} \begin{tikzpicture}
\begin{axis}[tick style = {draw = none}, width = \textwidth, \begin{axis}[tick style = {draw = none}, width = \textwidth,
height = 0.6\textwidth] height = 0.6\textwidth]
@ -209,12 +210,12 @@ plot coordinates {
\addlegendentry{\footnotesize{spline}}; \addlegendentry{\footnotesize{spline}};
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}
\caption[Overfitting of shallow neural networks]{For data of the form $y=\sin(\frac{x+\pi}{2 \pi}) + \caption[Overfitting of Shallow Neural Networks]{For data of the form $y=\sin(\frac{x+\pi}{2 \pi}) +
\varepsilon,~ \varepsilon \sim \mathcal{N}(0,0.4)$ \varepsilon,~ \varepsilon \sim \mathcal{N}(0,0.4)$
(\textcolor{blue}{blue dots}) the neural network constructed (\textcolor{blue}{blue}) the neural network constructed
according to the proof of Theorem~\ref{theo:overfit} (black) and the according to the proof of Theorem~\ref{theo:overfit} (black) and the
underlying signal (\textcolor{red}{red}). While the network has no underlying signal (\textcolor{red}{red}). While the network has no
bias a cubic smoothing spline (black dashed) fits the data much bias a cubic smoothing spline (black, dashed) fits the data much
better. For a test set of size 20 with uniformly distributed $x$ better. For a test set of size 20 with uniformly distributed $x$
values and responses of the same fashion as the training data the MSE of the neural network is values and responses of the same fashion as the training data the MSE of the neural network is
0.30, while the MSE of the spline is only 0.14 thus generalizing 0.30, while the MSE of the spline is only 0.14 thus generalizing
@ -223,17 +224,20 @@ plot coordinates {
\label{fig:overfit} \label{fig:overfit}
\end{figure} \end{figure}
\clearpage \vfill
\subsection{\titlecap{convergence behaviour of 1-dim. randomized shallow neural
networks}}
\clearpage
\subsection{Convergence Behavior of One-Dimensional Randomized Shallow
Neural Networks}
\label{sec:conv}
This section is based on \textcite{heiss2019}. This section is based on \textcite{heiss2019}.
In this section, we examine the convergence behavior of certain shallow
... shallow neural networks with a one dimensional input where the parameters in the neural networks.
We consider shallow neural networks with a one dimensional input where the parameters in the
hidden layer are randomized resulting in only the weights is the hidden layer are randomized resulting in only the weights is the
output layer being trainable. output layer being trainable.
Additionally we assume all neurons use a ReLU as activation function Additionally, we assume all neurons use a ReLU as an activation function
and call such networks randomized shallow neural networks. and call such networks randomized shallow neural networks.
% We will analyze the % We will analyze the
@ -271,14 +275,12 @@ and call such networks randomized shallow neural networks.
% are penalized in the loss % are penalized in the loss
% function ridge penalized neural networks. % function ridge penalized neural networks.
We will prove that if we penalize the amount of the trainable weights
We will prove that ... nodes .. a randomized shallow neural network will when fitting a randomized shallow neural network it will
converge to a function that minimizes the distance to the training converge to a function that minimizes the distance to the training
data with .. to its second derivative, data with respect to its second derivative as the amount of nodes is increased.
if the $L^2$ norm of the trainable weights $w$ is
penalized in the loss function.
We call such a network that is fitted according to MSE and a penalty term for We call such a network that is fitted according to MSE and a penalty term for
the amount of the weights a ridge penalized neural network. the $L^2$ norm of the trainable weights $w$ a ridge penalized neural network.
% $\lam$ % $\lam$
% We call a randomized shallow neural network trained on MSE and % We call a randomized shallow neural network trained on MSE and
% punished for the amount of the weights $w$ according to a % punished for the amount of the weights $w$ according to a
@ -300,7 +302,7 @@ the amount of the weights a ridge penalized neural network.
\mathcal{RN}^{*, \tilde{\lambda}}_{\omega}(x) \coloneqq \mathcal{RN}^{*, \tilde{\lambda}}_{\omega}(x) \coloneqq
\mathcal{RN}_{w^{*, \tilde{\lambda}}(\omega), \omega} \mathcal{RN}_{w^{*, \tilde{\lambda}}(\omega), \omega}
\] \]
with with \
\[ \[
w^{*,\tilde{\lambda}}(\omega) :\in \argmin_{w \in w^{*,\tilde{\lambda}}(\omega) :\in \argmin_{w \in
\mathbb{R}^n} \underbrace{ \left\{\overbrace{\sum_{i = 1}^N \left(\mathcal{RN}_{w, \mathbb{R}^n} \underbrace{ \left\{\overbrace{\sum_{i = 1}^N \left(\mathcal{RN}_{w,
@ -316,7 +318,7 @@ having minimal weights, resulting in the \textit{minimum norm
network} $\mathcal{RN}_{w^{\text{min}}, \omega}$. network} $\mathcal{RN}_{w^{\text{min}}, \omega}$.
\[ \[
\mathcal{RN}_{w^{\text{min}}, \omega} \text{ randomized shallow \mathcal{RN}_{w^{\text{min}}, \omega} \text{ randomized shallow
Neural network with weights } w^{\text{min}}: neural network with weights } w^{\text{min}}\colon
\] \]
\[ \[
w^{\text{min}} \in \argmin_{w \in \mathbb{R}^n} \norm{w}, \text{ w^{\text{min}} \in \argmin_{w \in \mathbb{R}^n} \norm{w}, \text{
@ -328,8 +330,8 @@ For $\tilde{\lambda} \to \infty$ the learned
function will resemble the data less and with the weights function will resemble the data less and with the weights
approaching $0$ will converge to the constant $0$ function. approaching $0$ will converge to the constant $0$ function.
In order to make the notation more convinient in the following the To make the notation more convenient, in the following the
$\omega$ used to express the realised random parameters will no longer $\omega$ used to express the realized random parameters will no longer
be explicitly mentioned. be explicitly mentioned.
We call a function that minimizes the cubic distance between training points We call a function that minimizes the cubic distance between training points
@ -348,9 +350,9 @@ derivative of the function a cubic smoothing spline.
\] \]
\end{Definition} \end{Definition}
We will show that for specific hyper parameters the ridge penalized We will show that for specific hyperparameters the ridge penalized
shallow neural networks converge to a slightly modified variant of the shallow neural networks converge to a slightly modified variant of the
cubic smoothing spline. We will need to incorporate the densities of the cubic smoothing spline. We need to incorporate the densities of the
random parameters in the loss function of the spline to ensure random parameters in the loss function of the spline to ensure
convergence. Thus we define convergence. Thus we define
the adapted weighted cubic smoothing spline where the loss for the second the adapted weighted cubic smoothing spline where the loss for the second
@ -371,7 +373,8 @@ definition is given in Definition~\ref{def:wrs}.
% Definition~\ref{def:rpnn} converges a weighted cubic smoothing spline, as % Definition~\ref{def:rpnn} converges a weighted cubic smoothing spline, as
% the amount of hidden nodes is grown to inifity. % the amount of hidden nodes is grown to inifity.
\begin{Definition}[Adapted weighted cubic smoothing spline] \begin{Definition}[Adapted weighted cubic smoothing spline, Heiss, Teichmann, and
Wutte (2019, Definition 3.5)]
\label{def:wrs} \label{def:wrs}
Let $x_i^{\text{train}}, y_i^{\text{train}} \in \mathbb{R}, i \in Let $x_i^{\text{train}}, y_i^{\text{train}} \in \mathbb{R}, i \in
\left\{1,\dots,N\right\}$ be trainig data. For a given $\lambda \in \mathbb{R}_{>0}$ \left\{1,\dots,N\right\}$ be trainig data. For a given $\lambda \in \mathbb{R}_{>0}$
@ -385,16 +388,15 @@ definition is given in Definition~\ref{def:wrs}.
\lambda g(0) \int_{\supp(g)}\frac{\left(f''(x)\right)^2}{g(x)} \lambda g(0) \int_{\supp(g)}\frac{\left(f''(x)\right)^2}{g(x)}
dx\right\}}_{\eqqcolon F^{\lambda, g}(f)}. dx\right\}}_{\eqqcolon F^{\lambda, g}(f)}.
\] \]
\todo{Anforderung an Ableitung von f, doch nicht?} % \todo{Anforderung an Ableitung von f, doch nicht?}
\end{Definition} \end{Definition}
Similarly to ridge weight penalized neural networks the parameter Similarly to ridge weight penalized neural networks the parameter
$\lambda$ controls a trade-off between accuracy on the training data $\lambda$ controls a trade-off between accuracy on the training data
and smoothness or low second dreivative. For $g \equiv 1$ and $\lambda \to 0$ the and smoothness or low second derivative. For $g \equiv 1$ and $\lambda \to 0$ the
resulting function $f^{*, 0+}$ will interpolate the training data while minimizing resulting function $f^{*, 0+}$ will interpolate the training data while minimizing
the second derivative. Such a function is known as cubic spline the second derivative. Such a function is known as cubic spline
interpolation. interpolation.
\vspace{-0.2cm}
\[ \[
f^{*, 0+} \text{ smooth spline interpolation: } f^{*, 0+} \text{ smooth spline interpolation: }
\] \]
@ -403,7 +405,6 @@ interpolation.
\argmin_{\substack{f \in \mathcal{C}^2(\mathbb{R}), \\ f(x_i^{\text{train}}) = \argmin_{\substack{f \in \mathcal{C}^2(\mathbb{R}), \\ f(x_i^{\text{train}}) =
y_i^{\text{train}}}} = \left( \int _{\mathbb{R}} (f''(x))^2dx\right). y_i^{\text{train}}}} = \left( \int _{\mathbb{R}} (f''(x))^2dx\right).
\] \]
For $\lambda \to \infty$ on the other hand $f_g^{*\lambda}$ converges For $\lambda \to \infty$ on the other hand $f_g^{*\lambda}$ converges
to linear regression of the data. to linear regression of the data.
@ -412,16 +413,16 @@ the ridge penalized shallow neural network to adapted cubic smoothing splines.
% In order to show that ridge penalized shallow neural networks converge % In order to show that ridge penalized shallow neural networks converge
% to adapted cubic smoothing splines for a growing amount of hidden nodes we % to adapted cubic smoothing splines for a growing amount of hidden nodes we
% define two intermediary functions. % define two intermediary functions.
One being a smooth approximation of One being a smooth approximation of a
the neural network, and a randomized shallow neural network designed neural network and the other being a randomized shallow neural network designed
to approximate a spline. to approximate a spline.
In order to properly BUILD these functions we need to take the points In order to properly construct these functions, we need to take the points
of the network into consideration where the TRAJECTORY of the learned of the network into consideration where the trajectory of the learned
function changes function changes
(or their points of discontinuity). (or their points of discontinuity).
As we use the ReLU activation the function learned by the As we use the ReLU activation the function learned by the
network will possess points of discontinuity where a neuron in the hidden network will possess points of discontinuity where a neuron in the hidden
layer gets activated (goes from 0 -> x>0). We formalize these points layer gets activated and their output is no longer zero. We formalize these points
as kinks in Definition~\ref{def:kink}. as kinks in Definition~\ref{def:kink}.
\begin{Definition} \begin{Definition}
\label{def:kink} \label{def:kink}
@ -439,9 +440,9 @@ as kinks in Definition~\ref{def:kink}.
\item Let $\xi_k \coloneqq -\frac{b_k}{v_k}$ be the k-th kink of $\mathcal{RN}_w$. \item Let $\xi_k \coloneqq -\frac{b_k}{v_k}$ be the k-th kink of $\mathcal{RN}_w$.
\item Let $g_{\xi}(\xi_k)$ be the density of the kinks $\xi_k = \item Let $g_{\xi}(\xi_k)$ be the density of the kinks $\xi_k =
- \frac{b_k}{v_k}$ in accordance to the distributions of $b_k$ and - \frac{b_k}{v_k}$ in accordance to the distributions of $b_k$ and
$v_k$. $v_k$. With $\supp(g_\xi) = \left[C_{g_\xi}^l, C_{g_\xi}^u\right]$.
\item Let $h_{k,n} \coloneqq \frac{1}{n g_{\xi}(\xi_k)}$ be the \item Let $h_{k,n} \coloneqq \frac{1}{n g_{\xi}(\xi_k)}$ be the
average estmated distance from kink $\xi_k$ to the next nearest average estimated distance from kink $\xi_k$ to the next nearest
one. one.
\end{enumerate} \end{enumerate}
\end{Definition} \end{Definition}
@ -457,40 +458,36 @@ network by applying the kernel similar to convolution.
corresponding kink density $g_{\xi}$ as given by corresponding kink density $g_{\xi}$ as given by
Definition~\ref{def:kink}. Definition~\ref{def:kink}.
In order to smooth the RSNN consider following kernel for every $x$: In order to smooth the RSNN consider following kernel for every $x$:
\begin{align*}
\[ \kappa_x(s) &\coloneqq \mathds{1}_{\left\{\abs{s} \leq \frac{1}{2 \sqrt{n}
\kappa_x(s) \coloneqq \mathds{1}_{\left\{\abs{s} \leq \frac{1}{2 \sqrt{n} g_{\xi}(x)}\right\}}(s)\sqrt{n} g_{\xi}(x), \, \forall s \in \mathbb{R}\\
g_{\xi}(x)}\right\}}(s)\sqrt{n} g_{\xi}(x), \, \forall s \in \mathbb{R} \intertext{Using this kernel we define a smooth approximation of
\] $\mathcal{RN}_w$ by}
f^w(x) &\coloneqq \int_{\mathds{R}} \mathcal{RN}_w(x-s)
Using this kernel we define a smooth approximation of \kappa_x(s) ds.
$\mathcal{RN}_w$ by \end{align*}
\[
f^w(x) \coloneqq \int_{\mathds{R}} \mathcal{RN}_w(x-s) \kappa_x(s) ds.
\]
\end{Definition} \end{Definition}
Note that the kernel introduced in Definition~\ref{def:srsnn} Note that the kernel introduced in Definition~\ref{def:srsnn}
satisfies $\int_{\mathbb{R}}\kappa_x dx = 1$. While $f^w$ looks highly satisfies $\int_{\mathbb{R}}\kappa_x dx = 1$. While $f^w$ looks
similar to a convolution, it differs slightly as the kernel $\kappa_x(s)$ similar to a convolution, it differs slightly as the kernel $\kappa_x(s)$
is dependent on $x$. Therefore only $f^w = (\mathcal{RN}_w * is dependent on $x$. Therefore only $f^w = (\mathcal{RN}_w *
\kappa_x)(x)$ is well defined, while $\mathcal{RN}_w * \kappa$ is not. \kappa_x)(x)$ is well defined, while $\mathcal{RN}_w * \kappa$ is not.
We use $f^{w^{*,\tilde{\lambda}}}$ to describe the spline We use $f^{w^{*,\tilde{\lambda}}}$ to describe the spline
approximating the ridge penalized network approximating the ridge penalized network
$\mathrm{RN}^{*,\tilde{\lambda}}$. $\mathcal{RN}^{*,\tilde{\lambda}}$.
Next we construct a randomized shallow neural network which Next, we construct a randomized shallow neural network that
approximates a spline independent from the realization of the random is designed to be close to a spline, independent from the realization of the random
parameters. In order to achieve this we ... parameters, by approximating the splines curvature between the
kinks.
\begin{Definition}[Spline approximating Randomised Shallow Neural \begin{Definition}[Spline approximating Randomized Shallow Neural
Network] Network]
\label{def:sann} \label{def:sann}
Let $\mathcal{RN}$ be a randomised shallow Neural Network according Let $\mathcal{RN}$ be a randomized shallow Neural Network according
to Definition~\ref{def:rsnn} and $f^{*, \lambda}_g$ be the weighted to Definition~\ref{def:rsnn} and $f^{*, \lambda}_g$ be the weighted
cubic smoothing spline as introduced in Definition~\ref{def:wrs}. Then cubic smoothing spline as introduced in Definition~\ref{def:wrs}. Then
the randomised shallow neural network approximating $f^{*, the randomized shallow neural network approximating $f^{*,
\lambda}_g$ is given by \lambda}_g$ is given by
\[ \[
\mathcal{RN}_{\tilde{w}}(x) = \sum_{k = 1}^n \tilde{w}_k \sigma(b_k + v_k x), \mathcal{RN}_{\tilde{w}}(x) = \sum_{k = 1}^n \tilde{w}_k \sigma(b_k + v_k x),
@ -498,7 +495,7 @@ parameters. In order to achieve this we ...
with the weights $\tilde{w}_k$ defined as with the weights $\tilde{w}_k$ defined as
\[ \[
\tilde{w}_k \coloneqq \frac{h_{k,n} v_k}{\mathbb{E}[v^2 \vert \xi \tilde{w}_k \coloneqq \frac{h_{k,n} v_k}{\mathbb{E}[v^2 \vert \xi
= \xi_k]} (f_g^{*, \lambda})''(\xi_k). = \xi_k]} \left(f_g^{*, \lambda}\right)''(\xi_k).
\] \]
\end{Definition} \end{Definition}
@ -512,16 +509,16 @@ derivative of $\mathcal{RN}_{\tilde{w}}(x)$ which is given by
x}} \tilde{w}_k v_k \nonumber \\ x}} \tilde{w}_k v_k \nonumber \\
&= \frac{1}{n} \sum_{\substack{k \in \mathbb{N} \\ &= \frac{1}{n} \sum_{\substack{k \in \mathbb{N} \\
\xi_k < x}} \frac{v_k^2}{g_{\xi}(\xi_k) \mathbb{E}[v^2 \vert \xi \xi_k < x}} \frac{v_k^2}{g_{\xi}(\xi_k) \mathbb{E}[v^2 \vert \xi
= \xi_k]} (f_g^{*, \lambda})''(\xi_k). \label{eq:derivnn} = \xi_k]} \left(f_g^{*, \lambda}\right)''(\xi_k). \label{eq:derivnn}
\end{align} \end{align}
As the expression (\ref{eq:derivnn}) behaves similary to a As the expression (\ref{eq:derivnn}) behaves similarly to a
Riemann-sum for $n \to \infty$ it will converge in probability to the Riemann-sum for $n \to \infty$ it will converge in probability to the
first derivative of $f^{*,\lambda}_g$. A formal proof of this behaviour first derivative of $f^{*,\lambda}_g$. A formal proof of this behavior
is given in Lemma~\ref{lem:s0}. is given in Lemma~\ref{lem:s0}.
In order to ensure the functions used in the proof of the convergence In order to ensure the functions used in the proof of the convergence
are well defined we need to assume some properties of the random are well defined we need to make some assumptions about properties of the random
parameters and their densities parameters and their densities.
% In order to formulate the theorem describing the convergence of $RN_w$ % In order to formulate the theorem describing the convergence of $RN_w$
% we need to make a couple of assumptions. % we need to make a couple of assumptions.
@ -530,7 +527,7 @@ parameters and their densities
\begin{Assumption}~ \begin{Assumption}~
\label{ass:theo38} \label{ass:theo38}
\begin{enumerate}[label=(\alph*)] \begin{enumerate}[label=(\alph*)]
\item The probability density fucntion of the kinks $\xi_k$, \item The probability density function of the kinks $\xi_k$,
namely $g_{\xi}$ as defined in Definition~\ref{def:kink} exists namely $g_{\xi}$ as defined in Definition~\ref{def:kink} exists
and is well defined. and is well defined.
\item The density function $g_\xi$ \item The density function $g_\xi$
@ -545,7 +542,7 @@ parameters and their densities
\end{enumerate} \end{enumerate}
\end{Assumption} \end{Assumption}
As we will prove the convergence of in the Sobolev space, we hereby As we will prove the convergence of in the Sobolev Space, we hereby
introduce it and the corresponding induced norm. introduce it and the corresponding induced norm.
\begin{Definition}[Sobolev Space] \begin{Definition}[Sobolev Space]
@ -563,7 +560,7 @@ introduce it and the corresponding induced norm.
\norm{u^{(\alpha)}}_{L^p} < \infty. \norm{u^{(\alpha)}}_{L^p} < \infty.
\] \]
\label{def:sobonorm} \label{def:sobonorm}
The natural norm of the sobolev space is given by The natural norm of the Sobolev Space is given by
\[ \[
\norm{f}_{W^{k,p}(K)} = \norm{f}_{W^{k,p}(K)} =
\begin{cases} \begin{cases}
@ -577,18 +574,21 @@ introduce it and the corresponding induced norm.
\] \]
\end{Definition} \end{Definition}
With the important definitions and assumptions in place we can now With the important definitions and assumptions in place, we can now
formulate the main theorem ... the convergence of ridge penalized formulate the main theorem.
random neural networks to adapted cubic smoothing splines when the % ... the convergence of ridge penalized
parameters are chosen accordingly. % random neural networks to adapted cubic smoothing splines when the
% parameters are chosen accordingly.
\begin{Theorem}[Ridge weight penaltiy corresponds to weighted cubic smoothing spline] \begin{Theorem}[Ridge Weight Penalty Corresponds to Weighted Cubic
Smoothing Spline]
\label{theo:main1} \label{theo:main1}
For $N \in \mathbb{N}$ arbitrary training data For $N \in \mathbb{N}$, arbitrary training data
\(\left(x_i^{\text{train}}, y_i^{\text{train}} $\left(x_i^{\text{train}}, y_i^{\text{train}}
\right)\) and $\mathcal{RN}^{*, \tilde{\lambda}}, f_g^{*, \lambda}$ \right)~\in~\mathbb{R}^2$, with $i \in \left\{1,\dots,N\right\}$,
and $\mathcal{RN}^{*, \tilde{\lambda}}, f_g^{*, \lambda}$
according to Definition~\ref{def:rpnn} and Definition~\ref{def:wrs} according to Definition~\ref{def:rpnn} and Definition~\ref{def:wrs}
respectively with Assumption~\ref{ass:theo38} it holds respectively with Assumption~\ref{ass:theo38} it holds that
\begin{equation} \begin{equation}
\label{eq:main1} \label{eq:main1}
@ -604,7 +604,7 @@ parameters are chosen accordingly.
\end{align*} \end{align*}
\end{Theorem} \end{Theorem}
As mentioned above we will prof Theorem~\ref{theo:main1} utilizing As mentioned above we will prof Theorem~\ref{theo:main1} utilizing
the ... functions. We show that intermediary functions. We show that
\begin{equation} \begin{equation}
\label{eq:main2} \label{eq:main2}
\plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f^{w^*}}_{W^{1, \plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f^{w^*}}_{W^{1,
@ -616,13 +616,13 @@ and
\plimn \norm{f^{w^*} - f_g^{*, \lambda}}_{W^{1,\infty}(K)} = 0 \plimn \norm{f^{w^*} - f_g^{*, \lambda}}_{W^{1,\infty}(K)} = 0
\end{equation} \end{equation}
and then get (\ref{eq:main1}) using the triangle inequality. In and then get (\ref{eq:main1}) using the triangle inequality. In
order to prove (\ref{eq:main2}) and (\ref{eq:main3}) we will need to order to prove (\ref{eq:main2}) and (\ref{eq:main3}) we need to
introduce a number of auxiliary lemmmata, proves of these will be introduce a number of auxiliary lemmata, proves of which are
provided in the appendix. given in \textcite{heiss2019} and Appendix~\ref{appendix:proofs}.
\begin{Lemma}[Poincar\'e typed inequality] \begin{Lemma}[Poincar\'e Typed Inequality]
\label{lem:pieq} \label{lem:pieq}
Let \(f:\mathbb{R} \to \mathbb{R}\) differentiable with \(f' : Let \(f:\mathbb{R} \to \mathbb{R}\) differentiable with \(f' :
\mathbb{R} \to \mathbb{R}\) Lesbeque integrable. Then for \(K=[a,b] \mathbb{R} \to \mathbb{R}\) Lesbeque integrable. Then for \(K=[a,b]
@ -634,13 +634,14 @@ provided in the appendix.
\norm{f'}_{L^{\infty}(K)}. \norm{f'}_{L^{\infty}(K)}.
\end{equation*} \end{equation*}
If additionaly \(f'\) is differentiable with \(f'': \mathbb{R} \to If additionaly \(f'\) is differentiable with \(f'': \mathbb{R} \to
\mathbb{R}\) Lesbeque integrable then additionally \mathbb{R}\) Lesbeque integrable then
\begin{equation*} \begin{equation*}
\label{eq:pti2} \label{eq:pti2}
\exists C_K^2 \in \mathbb{R}_{>0} : \norm{f}_{W^{1,\infty}(K)} \leq \exists C_K^2 \in \mathbb{R}_{>0} : \norm{f}_{W^{1,\infty}(K)} \leq
C_K^2 \norm{f''}_{L^2(K)}. C_K^2 \norm{f''}_{L^2(K)}.
\end{equation*} \end{equation*}
\proof The proof is given in the appendix... % \proof The proof is given in the appendix...
% With the fundamental theorem of calculus, if % With the fundamental theorem of calculus, if
% \(\norm{f}_{L^{\infty}(K)}<\infty\) we get % \(\norm{f}_{L^{\infty}(K)}<\infty\) we get
% \begin{equation} % \begin{equation}
@ -682,6 +683,7 @@ provided in the appendix.
\forall x \in \supp(g_{\xi}) : \mathbb{E}\left[\varphi(\xi, v) \forall x \in \supp(g_{\xi}) : \mathbb{E}\left[\varphi(\xi, v)
\frac{1}{n g_{\xi}(\xi)} \vert \xi = x \right] < \infty, \frac{1}{n g_{\xi}(\xi)} \vert \xi = x \right] < \infty,
\] \]
\clearpage
it holds, that it holds, that
\[ \[
\plimn \sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k) \plimn \sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k)
@ -690,7 +692,7 @@ provided in the appendix.
\mathbb{E}\left[\varphi(\xi, v) \vert \xi = x \right] dx \mathbb{E}\left[\varphi(\xi, v) \vert \xi = x \right] dx
\] \]
uniformly in \(T \in K\). uniformly in \(T \in K\).
\proof The proof is given in appendix... % \proof The proof is given in appendix...
% For \(T \leq C_{g_{\xi}}^l\) both sides equal 0, so it is sufficient to % For \(T \leq C_{g_{\xi}}^l\) both sides equal 0, so it is sufficient to
% consider \(T > C_{g_{\xi}}^l\). With \(\varphi\) and % consider \(T > C_{g_{\xi}}^l\). With \(\varphi\) and
% \(\nicefrac{1}{g_{\xi}}\) uniformly continous in \(\xi\), % \(\nicefrac{1}{g_{\xi}}\) uniformly continous in \(\xi\),
@ -735,7 +737,7 @@ provided in the appendix.
% \kappa : \xi_m \in [\delta l, \delta(l + % \kappa : \xi_m \in [\delta l, \delta(l +
% 1)]\right\}}}{ng_{\xi}(l\delta)}\right) \pm \varepsilon .\\ % 1)]\right\}}}{ng_{\xi}(l\delta)}\right) \pm \varepsilon .\\
% \intertext{We use the mean to approximate the number of kinks in % \intertext{We use the mean to approximate the number of kinks in
% each $\delta$-strip, as it follows a bonomial distribution this % each $\delta$-strip, as it follows a binomial distribution this
% amounts to % amounts to
% \[ % \[
% \mathbb{E}\left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l, % \mathbb{E}\left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l,
@ -746,12 +748,13 @@ provided in the appendix.
% Bla Bla Bla $v_k$} % Bla Bla Bla $v_k$}
% \circled{1} & \approx % \circled{1} & \approx
% \end{align*} % \end{align*}
\proof Notes on the proof are given in Proof~\ref{proof:lem9}.
\end{Lemma} \end{Lemma}
\begin{Lemma} \begin{Lemma}
For any $\lambda > 0$, training data $(x_i^{\text{train}} For any $\lambda > 0$, $N \in \mathbb{N}$, training data $(x_i^{\text{train}}
y_i^{\text{train}}) \in \mathbb{R}^2$, with $ i \in y_i^{\text{train}}) \in \mathbb{R}^2$, with $ i \in
\left\{1,\dots,N\right\}$ and subset $K \subset \mathbb{R}$ the spline approximating randomized \left\{1,\dots,N\right\}$, and subset $K \subset \mathbb{R}$ the spline approximating randomized
shallow neural network $\mathcal{RN}_{\tilde{w}}$ converges to the shallow neural network $\mathcal{RN}_{\tilde{w}}$ converges to the
cubic smoothing spline $f^{*, \lambda}_g$ in cubic smoothing spline $f^{*, \lambda}_g$ in
$\norm{.}_{W^{1,\infty}(K)}$ as the node count $n$ increases, $\norm{.}_{W^{1,\infty}(K)}$ as the node count $n$ increases,
@ -767,50 +770,63 @@ provided in the appendix.
\lambda}_g)'}_{L^{\infty}} = 0. \lambda}_g)'}_{L^{\infty}} = 0.
\] \]
This can be achieved by using Lemma~\ref{lem:cnvh} with $\varphi(\xi_k, This can be achieved by using Lemma~\ref{lem:cnvh} with $\varphi(\xi_k,
v_k) = \frac{v_k^2}{\mathbb{E}[v^2|\xi = z]} (f^{*, \lambda}_w)''(\xi_k) $ v_k) = \frac{v_k^2}{\mathbb{E}[v^2|\xi = z]} (f^{*, \lambda}_g)''(\xi_k) $
thus obtaining thus obtaining
\begin{align*} \begin{align*}
\plimn \frac{\partial \mathcal{RN}_{\tilde{w}}}{\partial x} \plimn \frac{\partial \mathcal{RN}_{\tilde{w}}}{\partial x} (x)
\stackrel{(\ref{eq:derivnn})}{=} \equals^{(\ref{eq:derivnn})}_{\phantom{\text{Lemma 3.1.4}}}
& \plimn \sum_{\substack{k \in \mathbb{N} \\ %\stackrel{(\ref{eq:derivnn})}{=}
&
\plimn \sum_{\substack{k \in \mathbb{N} \\
\xi_k < x}} \frac{v_k^2}{\mathbb{E}[v^2 \vert \xi \xi_k < x}} \frac{v_k^2}{\mathbb{E}[v^2 \vert \xi
= \xi_k]} (f_g^{*, \lambda})''(\xi_k) h_{k,n} = \xi_k]} (f_g^{*, \lambda})''(\xi_k) h_{k,n} \\
\stackrel{\text{Lemma}~\ref{lem:cnvh}}{=} \\ \stackrel{\text{Lemma}~\ref{lem:cnvh}}{=}
\stackrel{\phantom{(\ref{eq:derivnn})}}{=} %\stackrel{\phantom{(\ref{eq:derivnn})}}{=}
& &
\int_{\min\left\{C_{g_{\xi}}^l,T\right\}}^{min\left\{C_{g_{\xi}}^u,T\right\}} \int_{\max\left\{C_{g_{\xi}}^l,x\right\}}^{\min\left\{C_{g_{\xi}}^u,x\right\}}
\mathbb{E}\left[\frac{v^2}{\mathbb{E}[v^2|\xi = z]} (f^{*, \mathbb{E}\left[\frac{v^2}{\mathbb{E}[v^2|\xi = z]} (f^{*,
\lambda}_w)''(\xi) \vert \lambda}_g)''(\xi) \vert
\xi = x \right] dx \equals^{\text{Tower-}}_{\text{property}} \\ \xi = z \right] dz\\
\stackrel{\phantom{(\ref{eq:derivnn})}}{=} \mathmakebox[\widthof{$\stackrel{\text{Lemma 3.14}}{=}$}][c]{\equals^{\text{Tower-}}_{\text{property}}}
%\stackrel{\phantom{(\ref{eq:derivnn})}}{=}
& &
\int_{\min\left\{C_{g_{\xi}}^l, \int_{\max\left\{C_{g_{\xi}}^l,
T\right\}}^{min\left\{C_{g_{\xi}}^u,T\right\}}(f^{*,\lambda}_w)''(x) x\right\}}^{\min\left\{C_{g_{\xi}}^u,x\right\}}(f^{*,\lambda}_g)''(z)
dx. dz.
\end{align*} \end{align*}
By the fundamental theorem of calculus and $\supp(f') \subset With the fundamental theorem of calculus we get
\supp(f)$, (\ref{eq:s0}) follows with Lemma~\ref{lem:pieq}. \[
\todo{ist die 0 wichtig?} \plimn \mathcal{RN}_{\tilde{w}}'(x) = f_g^{*,\lambda
'}(\min\left\{C_{g_{\xi}}^u, x\right\}) - f_g^{*,\lambda
'}(\max\left\{C_{g_{\xi}}^l, x\right\})
\]
As $f_g^{*,\lambda '}$ is constant on $\left[C_{g_\xi}^l,
C_{g_\xi}^u\right]^C$ because $\supp(f_g^{*,\lambda ''}) \subseteq
\supp(g) \subseteq \supp(g_\xi)$ we get
\[
\plimn \mathcal{RN}_{\tilde{w}}'(x) = f_g^{*,\lambda
'},
\]
thus (\ref{eq:s0}) follows with Lemma~\ref{lem:pieq}.
\qed \qed
\label{lem:s0} \label{lem:s0}
\end{Lemma} \end{Lemma}
\begin{Lemma} \begin{Lemma}
For any $\lambda > 0$ and training data $(x_i^{\text{train}}, For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, we have \left\{1,\dots,N\right\}$, we have
\[ \[
\plimn F^{\tilde{\lambda}}_n(\mathcal{RN}_{\tilde{w}}) = \plimn F^{\tilde{\lambda}}_n(\mathcal{RN}_{\tilde{w}}) =
F^{\lambda, g}(f^{*, \lambda}_g) = 0. F^{\lambda, g}(f^{*, \lambda}_g) = 0.
\] \]
\proof \proof Notes on the proof are given in Proof~\ref{proof:lem14}.
The proof is given in the appendix...
\label{lem:s2} \label{lem:s2}
\end{Lemma} \end{Lemma}
\begin{Lemma} \begin{Lemma}
For any $\lambda > 0$ and training data $(x_i^{\text{train}}, For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, with $w^*$ as \left\{1,\dots,N\right\}$, with $w^*$ as
defined in Definition~\ref{def:rpnn} and $\tilde{\lambda}$ as defined in Definition~\ref{def:rpnn} and $\tilde{\lambda}$ as
defined in Theroem~\ref{theo:main1}, it holds defined in Theroem~\ref{theo:main1}, it holds
@ -818,13 +834,13 @@ provided in the appendix.
\plimn \norm{\mathcal{RN}^{*,\tilde{\lambda}} - \plimn \norm{\mathcal{RN}^{*,\tilde{\lambda}} -
f^{w*, \tilde{\lambda}}}_{W^{1,\infty}(K)} = 0. f^{w*, \tilde{\lambda}}}_{W^{1,\infty}(K)} = 0.
\] \]
\proof The proof is given in Appendix .. \proof Notes on the proof are given in Proof~\ref{proof:lem15}.
\label{lem:s3} \label{lem:s3}
\end{Lemma} \end{Lemma}
\begin{Lemma} \begin{Lemma}
For any $\lambda > 0$ and training data $(x_i^{\text{train}}, For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, with $w^*$ and $\tilde{\lambda}$ as \left\{1,\dots,N\right\}$, with $w^*$ and $\tilde{\lambda}$ as
defined in Definition~\ref{def:rpnn} and Theroem~\ref{theo:main1} defined in Definition~\ref{def:rpnn} and Theroem~\ref{theo:main1}
respectively, it holds respectively, it holds
@ -832,13 +848,13 @@ provided in the appendix.
\plimn \abs{F_n^{\tilde{\lambda}}(\mathcal{RN}^{*,\tilde{\lambda}}) - \plimn \abs{F_n^{\tilde{\lambda}}(\mathcal{RN}^{*,\tilde{\lambda}}) -
F^{\lambda, g}(f^{w*, \tilde{\lambda}})} = 0. F^{\lambda, g}(f^{w*, \tilde{\lambda}})} = 0.
\] \]
\proof The proof is given in appendix... \proof Notes on the proof are given in Proof~\ref{proof:lem16}.
\label{lem:s4} \label{lem:s4}
\end{Lemma} \end{Lemma}
\begin{Lemma} \begin{Lemma}
For any $\lambda > 0$ and training data $(x_i^{\text{train}}, For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, for any sequence of functions $f^n \in \left\{1,\dots,N\right\}$, for any sequence of functions $f^n \in
W^{2,2}$ with W^{2,2}$ with
\[ \[
@ -848,39 +864,45 @@ provided in the appendix.
\[ \[
\plimn \norm{f^n - f^{*, \lambda}} = 0. \plimn \norm{f^n - f^{*, \lambda}} = 0.
\] \]
\proof The proof is given in appendix ... \proof Notes on the proof are given in Proof~\ref{proof:lem19}.
\label{lem:s7} \label{lem:s7}
\end{Lemma} \end{Lemma}
Using these lemmata we can now proof Theorem~\ref{theo:main1}. We Using these lemmata we can now proof Theorem~\ref{theo:main1}. We
start by showing that the error measure of the smooth approximation of start by showing that the error measure of the smooth approximation of
the ridge penalized randomized shallow neural network $F^{\lambda, the ridge penalized randomized shallow neural network $F^{\lambda,
g}\left(f^{w^{*,\tilde{\lambda}}}\right)$ g}(f^{w^{*,\tilde{\lambda}}})$
will converge in probability to the error measure of the adapted weighted regression will converge in probability to the error measure of the adapted weighted regression
spline $F^{\lambda, g}\left(f^{*,\lambda}\right)$ for the specified spline $F^{\lambda, g}\left(f^{*,\lambda}\right)$ for the specified
parameters. parameters.
Using Lemma~\ref{lem:s4} we get that for every $P \in (0,1)$ and Using Lemma~\ref{lem:s4} we get that for every $P \in (0,1)$ and
$\varepsilon > 0$ there exists a $n_1 \in \mathbb{N}$ such that $\varepsilon > 0$ there exists a $n_1 \in \mathbb{N}$ such that
\[ \begin{equation}
\mathbb{P}\left[F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) \in \mathbb{P}\left[F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) \in
F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right) F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right)
+[-\varepsilon, \varepsilon]\right] > P, \forall n \in \mathbb{N}_{> n_1}. +[-\varepsilon, \varepsilon]\right] > P, \forall n \in
\] \mathbb{N}_{> n_1}.
\label{eq:squeeze_1}
\end{equation}
As $\mathcal{RN}^{*,\tilde{\lambda}}$ is the optimal network for As $\mathcal{RN}^{*,\tilde{\lambda}}$ is the optimal network for
$F_n^{\tilde{\lambda}}$ we know that $F_n^{\tilde{\lambda}}$ we know that
\[ \begin{equation}
F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right) F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right)
\leq F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right). \leq F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right).
\] \label{eq:squeeze_2}
\end{equation}
Using Lemma~\ref{lem:s2} we get that for every $P \in (0,1)$ and Using Lemma~\ref{lem:s2} we get that for every $P \in (0,1)$ and
$\varepsilon > 0$ there exists a $n_2 \in \mathbb{N}$ such that $\varepsilon > 0$ a $n_2 \in \mathbb{N}$ exists such that
\[ \begin{equation}
\mathbb{P}\left[F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right) \mathbb{P}\left[F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right)
\in F^{\lambda, g}\left(f^{*,\lambda}_g\right)+[-\varepsilon, \in F^{\lambda, g}\left(f^{*,\lambda}_g\right)+[-\varepsilon,
\varepsilon]\right] > P, \forall n \in \mathbb{N}_{> n_2}. \varepsilon]\right] > P, \forall n \in \mathbb{N}_{> n_2}.
\] \label{eq:squeeze_3}
If we combine these ... we get that for every $P \in (0,1)$ and \end{equation}
$\varepsilon > 0$ and $n_3 \geq Combining (\ref{eq:squeeze_1}), (\ref{eq:squeeze_2}), and
(\ref{eq:squeeze_3}) we get that for every $P \in (0,1)$ and for \linebreak
every
$\varepsilon > 0$ with $n_3 \geq
\max\left\{n_1,n_2\right\}$ \max\left\{n_1,n_2\right\}$
\[ \[
\mathbb{P}\left[F^{\lambda, \mathbb{P}\left[F^{\lambda,
@ -888,47 +910,51 @@ $\varepsilon > 0$ and $n_3 \geq
g}\left(f^{*,\lambda}_g\right)+2\varepsilon\right] > P, \forall g}\left(f^{*,\lambda}_g\right)+2\varepsilon\right] > P, \forall
n \in \mathbb{N}_{> n_3}. n \in \mathbb{N}_{> n_3}.
\] \]
As ... is in ... and ... is optimal we know that As $\supp(f^{w^{*,\tilde{\lambda}}}) \subseteq \supp(g_\xi)$ and $f^{*,\lambda}_g$ is optimal we know that
\[ \[
F^{\lambda, g}\left(f^{*,\lambda}_g\right) \leq F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) F^{\lambda, g}\left(f^{*,\lambda}_g\right) \leq F^{\lambda,
g}\left(f^{w^{*,\tilde{\lambda}}}\right)
\] \]
and thus get with the squeeze theorem and thus get with the squeeze theorem
\[ \[
\plimn F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) = F^{\lambda, g}\left(f^{*,\lambda}_g\right). \plimn F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) = F^{\lambda, g}\left(f^{*,\lambda}_g\right).
\] \]
We can now use Lemma~\ref{lem:s7} to follow that With Lemma~\ref{lem:s7} it follows that
\begin{equation} \begin{equation}
\plimn \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g} \plimn \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g}
_{W^{1,\infty}} = 0. _{W^{1,\infty}} = 0.
\label{eq:main4} \label{eq:main4}
\end{equation} \end{equation}
Now by using the triangle inequality with Lemma~\ref{lem:s3} and By using the triangle inequality with Lemma~\ref{lem:s3} and
(\ref{eq:main4}) we get (\ref{eq:main4}) we get
\begin{align*} \begin{multline}
\plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f_g^{*,\lambda}} \plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f_g^{*,\lambda}}\\
\leq& \plimn \bigg(\norm{\mathcal{RN}^{*, \tilde{\lambda}} - \leq \plimn \bigg(\norm{\mathcal{RN}^{*, \tilde{\lambda}} -
f_g^{w^{*,\tilde{\lambda}}}}_{W^{1,\infty}}\\ f_g^{w^{*,\tilde{\lambda}}}}_{W^{1,\infty}}
&+ \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g} + \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g}
_{W^{1,\infty}}\bigg) = 0 _{W^{1,\infty}}\bigg) = 0
\end{align*} \end{multline}
and thus have proven Theorem~\ref{theo:main1}. and thus have proven Theorem~\ref{theo:main1}.
We now know that randomized shallow neural networks behave similar to We now know that randomized shallow neural networks behave similar to
spline regression if we regularize the size of the weights during spline regression if we regularize the size of the weights during
training. training.
\textcite{heiss2019} further explore a connection between ridge penalized \textcite{heiss2019} further explore a connection between ridge penalized
networks and randomized shallow neural networks trained using gradient networks and randomized shallow neural networks trained using gradient
descent. descent.
They come to the conclusion that the effect of weight regularization They infer that the effect of weight regularization
can be achieved by stopping the training of the randomized shallow can be achieved by stopping the training of the randomized shallow
neural network early, with the amount of epochs being proportional to neural network early, with the number of iterations being proportional to
the punishment for weight size. the tuning parameter penalizing the size of the weights.
This ... that randomized shallow neural networks trained for a certain They use this to further conclude that for a large number of training epochs and number of
amount of iterations converge for a increasing amount of nodes to neurons shallow neural networks trained with gradient descent are
cubic smoothing splines with appropriate weights. very close to spline interpolations. Alternatively if the training
\todo{nochmal nachlesen wie es genau war} is stopped early, they are close to adapted weighted cubic smoothing splines.
\newpage \newpage
\subsection{Simulations} \subsection{Simulations}
\label{sec:rsnn_sim}
In the following the behaviour described in Theorem~\ref{theo:main1} In the following the behaviour described in Theorem~\ref{theo:main1}
is visualized in a simulated example. For this two sets of training is visualized in a simulated example. For this two sets of training
data have been generated. data have been generated.
@ -962,20 +988,26 @@ Theorem~\ref{theo:main1}
would equate to $g(x) = \frac{\mathbb{E}[v_k^2|\xi_k = x]}{10}$. In would equate to $g(x) = \frac{\mathbb{E}[v_k^2|\xi_k = x]}{10}$. In
order to utilize the order to utilize the
smoothing spline implemented in Mathlab, $g$ has been simplified to $g smoothing spline implemented in Mathlab, $g$ has been simplified to $g
\equiv \frac{1}{10}$ instead. For all figures $f_1^{*, \lambda}$ has \equiv \frac{1}{10}$ instead.
been calculated with Matlab's ``smoothingspline'', as this minimizes
For all figures $f_1^{*, \lambda}$ has
been calculated with Matlab's {\sffamily{smoothingspline}}, as this minimizes
\[ \[
\bar{\lambda} \sum_{i=1}^N(y_i^{train} - f(x_i^{train}))^2 + (1 - \bar{\lambda} \sum_{i=1}^N(y_i^{train} - f(x_i^{train}))^2 + (1 -
\bar{\lambda}) \int (f''(x))^2 dx \bar{\lambda}) \int (f''(x))^2 dx
\] \]
the smoothing parameter used for fittment is $\bar{\lambda} = the smoothing parameter used for fitment is $\bar{\lambda} =
\frac{1}{1 + \lambda}$. The parameter $\tilde{\lambda}$ for training \frac{1}{1 + \lambda}$. The parameter $\tilde{\lambda}$ for training
the networks is chosen as defined in Theorem~\ref{theo:main1} and each the networks is chosen as defined in Theorem~\ref{theo:main1}.
network is trained on the full training data for 5000 epochs using
Each
network contains 10.000 hidden nodes and is trained on the full
training data for 100.000 epochs using
gradient descent. The gradient descent. The
results are given in Figure~\ref{fig:rn_vs_rs}, here it can be seen that in results are given in Figure~\ref{fig:rn_vs_rs}, where it can be seen
the intervall of the traing data $[-\pi, \pi]$ the neural network and that the neural network and
smoothing spline are nearly identical, coinciding with the proposition. smoothing spline are nearly identical, coinciding with the
proposition.
\input{Figures/RN_vs_RS} \input{Figures/RN_vs_RS}

Loading…
Cancel
Save