@ -6,14 +6,15 @@
% % % End:
% % % End:
\section { Shallow Neural Networks}
\section { Shallow Neural Networks}
In order to get a some understanding of the behavior of neural
% In order to get a some understanding of the behavior of neural
networks we study a simplified class of networks called shallow neural
% networks we study a simplified class of networks called shallow neural
networks in this chapter. We consider shallow neural networks consist of a single
% networks in this chapter.
hidden layer and
% We consider shallow neural networks consist of a single
In order to examine some behavior of neural networks in this chapter
% hidden layer and
we consider a simple class of networks, the shallow ones. These
In order to get some understanding of the behavior of neural networks
networks only contain one hidden layer and have a single output node.
we examine a simple class of networks in this chapter. We consider
networks that contain only one hidden layer and have a single output
node. We call these networks shallow neural networks.
\begin { Definition} [Shallow neural network]
\begin { Definition} [Shallow neural network]
For a input dimension $ d $ and a Lipschitz continuous activation function $ \sigma :
For a input dimension $ d $ and a Lipschitz continuous activation function $ \sigma :
\mathbb { R} \to \mathbb { R} $ we define a shallow neural network with
\mathbb { R} \to \mathbb { R} $ we define a shallow neural network with
@ -84,15 +85,16 @@ with
% \end { figure}
% \end { figure}
As neural networks with a large amount of nodes have a large amount of
As neural networks with a large amount of nodes have a large amount of
parameters that can be tuned it can often fit the data quite well. If a ReLU
parameters that can be tuned it can often fit the data quite well. If
a ReLU activation function
\[
\[
\sigma (x) \coloneqq \max { (0, x)}
\sigma (x) \coloneqq \max { (0, x)}
\]
\]
is chosen as activation function one can easily prove that if the
is chosen one can easily prove that if the
amount of hidden nodes exceeds the
amount of hidden nodes exceeds the
amount of data points in the training data a shallow network trained
amount of data points in the training data a shallow network trained
on MSE will perfectly fit the data.
on MSE will perfectly fit the data.
\begin { Theorem} [sinnvoller titel ]
\begin { Theorem} [Shallow neural network can fit data perfectly ]
For training data of size t
For training data of size t
\[
\[
\left (x_ i^ { \text { train} } , y_ i^ { \text { train} } \right ) \in \mathbb { R} ^ d
\left (x_ i^ { \text { train} } , y_ i^ { \text { train} } \right ) \in \mathbb { R} ^ d
@ -150,17 +152,18 @@ on MSE will perfectly fit the data.
\label { theo:overfit}
\label { theo:overfit}
\end { Theorem}
\end { Theorem}
However this behavior is often not desired as over fit models often
However this behavior is often not desired as over fit models generally
have bad generalization properties especially if noise is present in
have bad generalization properties especially if noise is present in
the data. This effect can be seen in
the data. This effect is illustrated in
Figure~\ref { fig:overfit} . Here a network that perfectly fits the
Figure~\ref { fig:overfit} . Here a shallow neural network that perfectly fits the
training data regarding the MSE is \todo { Formulierung}
training data regarding MSE is \todo { Formulierung}
constructed and compared to a regression spline
constructed according to the proof of Theorem~\ref { theo:overfit} and
(Definition~\ref { def:wrs} ). While the network
compared to a regression spline
fits the data better than the spline, the spline is much closer to the
(Definition~\ref { def:wrs} ). While the neural network
underlying mechanism that was used to generate the data. The better
fits the data better than the spline, the spline represents the
underlying mechanism that was used to generate the data more accurately. The better
generalization of the spline compared to the network is further
generalization of the spline compared to the network is further
illustrated by the better validation error computed with new generated
demonstrated by the better validation error computed on newly generated
test data.
test data.
In order to improve the accuracy of the model we want to reduce
In order to improve the accuracy of the model we want to reduce
overfitting. A possible way to achieve this is by explicitly
overfitting. A possible way to achieve this is by explicitly
@ -168,7 +171,7 @@ regularizing the network through the cost function as done with
ridge penalized networks
ridge penalized networks
(Definition~\ref { def:rpnn} ) where large weights $ w $ are punished. In
(Definition~\ref { def:rpnn} ) where large weights $ w $ are punished. In
Theorem~\ref { theo:main1} we will
Theorem~\ref { theo:main1} we will
prove that this will result in the network converging to
prove that this will result in the shallow neural network converging to
regressions splines as the amount of nodes in the hidden layer is
regressions splines as the amount of nodes in the hidden layer is
increased.
increased.
@ -205,7 +208,7 @@ plot coordinates {
\addlegendentry { \footnotesize { spline} } ;
\addlegendentry { \footnotesize { spline} } ;
\end { axis}
\end { axis}
\end { tikzpicture}
\end { tikzpicture}
\caption { For data of the form $ y = \sin ( \frac { x + \pi } { 2 \pi } ) +
\caption [Overfitting of shallow neural networks] { For data of the form $ y = \sin ( \frac { x + \pi } { 2 \pi } ) +
\varepsilon ,~ \varepsilon \sim \mathcal { N} (0,0.4)$
\varepsilon ,~ \varepsilon \sim \mathcal { N} (0,0.4)$
(\textcolor { blue} { blue dots} ) the neural network constructed
(\textcolor { blue} { blue dots} ) the neural network constructed
according to the proof of Theorem~\ref { theo:overfit} (black) and the
according to the proof of Theorem~\ref { theo:overfit} (black) and the
@ -224,14 +227,24 @@ plot coordinates {
Networks}
Networks}
This section is based on \textcite { heiss2019} . We will analyze the connection of randomized shallow
This section is based on \textcite { heiss2019} . We will analyze the
Neural Networks with one dimensional input and regression splines. We
connection between randomized shallow
will see that the punishment of the size of the weights in training
Neural Networks with one dimensional input with a ReLU as activation
function for all neurons and regression splines.
% \[
% \sigma (x) = \max \left \{ 0,x\right \} .
% \]
We will see that the punishment of the size of the weights in training
the randomized shallow
the randomized shallow
Neural Network will result in a function that minimizes the second
Neural Network will result in a learned function that minimizes the second
derivative as the amount of hidden nodes is grown to infinity. In order
derivative as the amount of hidden nodes is grown to infinity. In order
to properly formulate this relation we will first need to introduce
to properly formulate this relation we will first need to introduce
some definitions.
some definitions, all neural networks introduced in the following will
use a ReLU as activation at all neurons.
A randomized shallow network is characterized by only the weight
parameter of the output layer being trainable, whereas the other
parameters are random numbers.
\begin { Definition} [Randomized shallow neural network]
\begin { Definition} [Randomized shallow neural network]
For an input dimension $ d $ , let $ n \in \mathbb { N } $ be the number of
For an input dimension $ d $ , let $ n \in \mathbb { N } $ be the number of
@ -244,11 +257,20 @@ some definitions.
\]
\]
\label { def:rsnn}
\label { def:rsnn}
\end { Definition}
\end { Definition}
We call a one dimensional randomized shallow neural network were the
$ L ^ 2 $ norm of the trainable weights $ w $ are penalized in the loss
function ridge penalized neural networks.
% We call a randomized shallow neural network where the size of the trainable
% weights is punished in the error function a ridge penalized
% neural network. For a tuning parameter $ \tilde { \lambda } $ .. the extent
% of penalization we get:
\begin { Definition} [Ridge penalized Neural Network]
\begin { Definition} [Ridge penalized Neural Network]
\label { def:rpnn}
\label { def:rpnn}
Let $ \mathcal { RN } _ { w, \omega } $ be a randomized shallow neural
Let $ \mathcal { RN } _ { w, \omega } $ be a randomized shallow neural
network, as introduced in ???. Then the optimal ridge penalized
network, as introduced in Definition~\ref { def:rsnn} and tuning
parameter $ \tilde { \lambda } \in \mathbb { R } $ . Then the optimal ridge
penalized
network is given by
network is given by
\[
\[
\mathcal { RN} ^ { *, \tilde { \lambda } } _ { \omega } (x) \coloneqq
\mathcal { RN} ^ { *, \tilde { \lambda } } _ { \omega } (x) \coloneqq
@ -263,9 +285,8 @@ some definitions.
\tilde { \lambda } \norm { w} _ 2^ 2\right \} } _ { \eqqcolon F_ n^ { \tilde { \lambda } } (\mathcal { RN} _ { w,\omega } )} .
\tilde { \lambda } \norm { w} _ 2^ 2\right \} } _ { \eqqcolon F_ n^ { \tilde { \lambda } } (\mathcal { RN} _ { w,\omega } )} .
\]
\]
\end { Definition}
\end { Definition}
In the ridge penalized Neural Network large weights are penalized, the
If the amount of hidden nodes $ n $ is larger than the amount of
extend of which can be tuned with the parameter $ \tilde { \lambda } $ . If
training samples $ N $ then for
$ n $ is larger than the amount of training samples $ N $ then for
$ \tilde { \lambda } \to 0 $ the network will interpolate the data while
$ \tilde { \lambda } \to 0 $ the network will interpolate the data while
having minimal weights, resulting in the \textit { minimum norm
having minimal weights, resulting in the \textit { minimum norm
network} $ \mathcal { RN } _ { w ^ { \text { min } } , \omega } $ .
network} $ \mathcal { RN } _ { w ^ { \text { min } } , \omega } $ .
@ -280,15 +301,109 @@ having minimal weights, resulting in the \textit{minimum norm
\left \{ 1,\dots ,N\right \} .
\left \{ 1,\dots ,N\right \} .
\]
\]
For $ \tilde { \lambda } \to \infty $ the learned
For $ \tilde { \lambda } \to \infty $ the learned
function will resemble the data less and less with the weights
function will resemble the data less and with the weights
approaching $ 0 $ . .\par
approaching $ 0 $ will converge to the constant $ 0 $ function.
In order to make the notation more convinient in the follwoing the
In order to make the notation more convinient in the following the
$ \omega $ used to express the realised random parameters will no longer
$ \omega $ used to express the realised random parameters will no longer
be explizitly mentioned.
be explicitly mentioned.
We call a function that minimizes the cubic distance between training points
and the function with respect\todo { richtiges wort} to the second
derivative of the function a regression spline.
\begin { Definition} [Regression Spline]
Let $ x _ i ^ { \text { train } } , y _ i ^ { \text { train } } \in \mathbb { R } , i \in
\left \{ 1,\dots ,N\right \} $ be trainig data. for a given $ \lambda \in
\mathbb { R} $ the regression spline is given by
\[
f^ { *,\lambda } :\in \argmin _ { f \in
\mathcal { C} ^ 2} \left \{ \sum _ { i=1} ^ N
\left (f\left (x_ i^ { \text { train} } \right ) -
y_ i^ { \text { train} } \right )^ 2 + \lambda \int f^ { ''} (x)^ 2dx\right \} .
\]
\end { Definition}
We will show that for specific hyper parameters the ridge penalized
shallow neural networks converge to a slightly modified variant of the
regression spline. We will need to incorporate the densities of the
random parameters in the loss function of the spline to ensure
convergence. Thus we define
the adapted weighted regression spline where the loss for the second
derivative is weighted by a function $ g $ and the support of the second
derivative of $ f $ has to be a subset the support of $ g $ . The formal
definition is given in Definition~\ref { def:wrs} .
% We will later ... the converging .. of the ridge penalized shallow
% neural network, in order to do so we will need a slightly modified
% version of the regression
% spline that allows for weighting the penalty term for the second
% derivative with a weight function $ g $ . This is needed to ...the
% distributions of the random parameters ... We call this the adapted
% weighted regression spline.
% Now we take a look at weighted regression splines. Later we will prove
% that the ridge penalized neural network as defined in
% Definition~\ref { def:rpnn} converges a weighted regression spline, as
% the amount of hidden nodes is grown to inifity.
\begin { Definition} [Adapted Weighted regression spline]
\label { def:wrs}
Let $ x _ i ^ { \text { train } } , y _ i ^ { \text { train } } \in \mathbb { R } , i \in
\left \{ 1,\dots ,N\right \} $ be trainig data. For a given $ \lambda \in \mathbb { R} _ { >0} $
and a function $ g: \mathbb { R } \to \mathbb { R } _ { > 0 } $ the weighted
regression spline $ f ^ { * , \lambda } _ g $ is given by
\[
f^ { *, \lambda } _ g :\in \argmin _ { \substack { f \in \mathcal { C} ^ 2(\mathbb { R} )
\\ \supp (f'') \subseteq \supp (g)} } \underbrace { \left \{ \overbrace { \sum _ { i =
1} ^ N \left (f(x_ i^ { \text { train} } ) - y_ i^ { \text { train} } \right )^ 2} ^ { L(f)} +
\lambda g(0) \int _ { \supp (g)} \frac { \left (f''(x)\right )^ 2} { g(x)}
dx\right \} } _ { \eqqcolon F^ { \lambda , g} (f)} .
\]
\todo { Anforderung an Ableitung von f, doch nicht?}
\end { Definition}
Similarly to ridge weight penalized neural networks the parameter
$ \lambda $ controls a trade-off between accuracy on the training data
and smoothness or low second dreivative. For $ g \equiv 1 $ and $ \lambda \to 0 $ the
resulting function $ f ^ { * , 0 + } $ will interpolate the training data while minimizing
the second derivative. Such a function is known as cubic spline
interpolation.
\todo { cite cubic spline}
\[
f^ { *, 0+} \text { smooth spline interpolation: }
\]
\[
f^ { *, 0+} \coloneqq \lim _ { \lambda \to 0+} f^ { *, \lambda } _ 1 \in
\argmin _ { \substack { f \in \mathcal { C} ^ 2\mathbb { R} , \\ f(x_ i^ { \text { train} } ) =
y_ i^ { \text { train} } } } = \left ( \int _ { \mathbb { R} } (f''(x))^ 2dx\right ).
\]
For $ \lambda \to \infty $ on the other hand $ f _ g ^ { * \lambda } $ converges
to linear regression of the data.
We use two intermediary functions in order to show the convergence of
the ridge penalized shallow neural network to adapted regression splines.
% In order to show that ridge penalized shallow neural networks converge
% to adapted regression splines for a growing amount of hidden nodes we
% define two intermediary functions.
One being a smooth approximation of
the neural network, and a randomized shallow neural network designed
to approximate a spline.
In order to properly BUILD these functions we need to take the points
of the network into consideration where the TRAJECTORY changes or
their points of discontinuity
As we use the ReLU activation the function learned by the
network will possess points of discontinuity where a neuron in the hidden
layer gets activated (goes from 0 -> x>0). We formalize these points
as kinks in Definition~\ref { def:kink} .
\begin { Definition}
\begin { Definition}
\label { def:kink}
\label { def:kink}
Let $ \mathcal { RN } _ w $ be a randomized shallow Neural
Let $ \mathcal { RN } _ w $ be a randomized shallow Neural
Network according to Definition~\ref { def:rsnn} , then kinks depending on the random parameters can
Network according to Definition~\ref { def:rsnn} , then kinks depending
on the random parameters can
be observed.
be observed.
\[
\[
\mathcal { RN} _ w(x) = \sum _ { k = 1} ^ n w_ k \sigma (b_ k + v_ kx)
\mathcal { RN} _ w(x) = \sum _ { k = 1} ^ n w_ k \sigma (b_ k + v_ kx)
@ -307,15 +422,14 @@ be explizitly mentioned.
\end { enumerate}
\end { enumerate}
\end { Definition}
\end { Definition}
In order to later prove the connection between randomised shallow
Using the density of the kinks we construct a kernel and smooth the
Neural Networks and regression splines, we first take a look at a
network by applying the kernel similar to convolution.
smooth approximation of the RSNN.
\begin { Definition} [Smooth Approximation of Randomized Shallow Neural
\begin { Definition} [Smooth Approximation of Randomized Shallow Neural
Network]
Network]
\label { def:srsnn}
\label { def:srsnn}
Let $ RS _ { w } $ be a randomized shallow Neural Network according to
Let $ RS _ { w } $ be a randomized shallow Neural Network according to
Definition~\ref { def:RSNN } with weights $ w $ and kinks $ \xi _ k $ with
Definition~\ref { def:rsnn } with weights $ w $ and kinks $ \xi _ k $ with
corresponding kink density $ g _ { \xi } $ as given by
corresponding kink density $ g _ { \xi } $ as given by
Definition~\ref { def:kink} .
Definition~\ref { def:kink} .
In order to smooth the RSNN consider following kernel for every $ x $ :
In order to smooth the RSNN consider following kernel for every $ x $ :
@ -338,53 +452,19 @@ satisfies $\int_{\mathbb{R}}\kappa_x dx = 1$. While $f^w$ looks highly
similar to a convolution, it differs slightly as the kernel $ \kappa _ x ( s ) $
similar to a convolution, it differs slightly as the kernel $ \kappa _ x ( s ) $
is dependent on $ x $ . Therefore only $ f ^ w = ( \mathcal { RN } _ w *
is dependent on $ x $ . Therefore only $ f ^ w = ( \mathcal { RN } _ w *
\kappa _ x)(x)$ is well defined, while $ \mathcal { RN} _ w * \kappa $ is not.
\kappa _ x)(x)$ is well defined, while $ \mathcal { RN} _ w * \kappa $ is not.
We use $ f ^ { w ^ { * , \tilde { \lambda } } } $ do describe the spline
approximating the ... ridge penalized network
$ \mathrm { RN } ^ { * , \tilde { \lambda } } $ .
Now we take a look at weighted regression splines. Later we will prove
Next we construct a randomized shallow neural network which
that the ridge penalized neural network as defined in
approximates a spline independent from the realization of the random
Definition~\ref { def:rpnn} converges a weighted regression spline, as
parameters. In order to achieve this we ...
the amount of hidden nodes is grown to inifity.
\begin { Definition} [Adapted Weighted regression spline]
\label { def:wrs}
Let $ x _ i ^ { \text { train } } , y _ i ^ { \text { train } } \in \mathbb { R } , i \in
\left \{ 1,\dots ,N\right \} $ be trainig data. For a given $ \lambda \in \mathbb { R} _ { >0} $
and a function $ g: \mathbb { R } \to \mathbb { R } _ { > 0 } $ the weighted
regression spline $ f ^ { * , \lambda } _ g $ is given by
\[
f^ { *, \lambda } _ g :\in \argmin _ { \substack { f \in \mathcal { C} ^ 2(\mathbb { R} )
\\ \supp (f) \subseteq \supp (g)} } \underbrace { \left \{ \overbrace { \sum _ { i =
1} ^ N \left (f(x_ i^ { \text { train} } ) - y_ i^ { \text { train} } \right )^ 2} ^ { L(f)} +
\lambda g(0) \int _ { \supp (g)} \frac { \left (f''(x)\right )^ 2} { g(x)}
dx\right \} } _ { \eqqcolon F^ { \lambda , g} (f)} .
\]
\todo { Anforderung an Ableitung von f, doch nicht?}
\end { Definition}
Similary to ridge weight penalized neural networks the parameter
$ \lambda $ controls a trade-off between accuracy on the training data
and smoothness or low second dreivative. For $ g \equiv 1 $ and $ \lambda \to 0 $ the
resuling function $ f ^ { * , 0 + } $ will interpolate the training data while minimizing
the second derivative. Such a function is known as cubic spline
interpolation.
\todo { cite cubic spline}
\[
f^ { *, 0+} \text { smooth spline interpolation: }
\]
\[
f^ { *, 0+} \coloneqq \lim _ { \lambda \to 0+} f^ { *, \lambda } _ 1 \in
\argmin _ { \substack { f \in \mathcal { C} ^ 2\mathbb { R} , \\ f(x_ i^ { \text { train} } ) =
y_ i^ { \text { train} } } } = \left ( \int _ { \mathbb { R} } (f''(x))^ 2dx\right ).
\]
For $ \lambda \to \infty $ on the other hand $ f _ g ^ { * \lambda } $ converges
to linear regression of the data.
\begin { Definition} [Spline approximating Randomised Shallow Neural
\begin { Definition} [Spline approximating Randomised Shallow Neural
Network]
Network]
\label { def:sann}
\label { def:sann}
Let $ \mathcal { RN } $ be a randomised shallow Neural Network according
Let $ \mathcal { RN } $ be a randomised shallow Neural Network according
to Definition~\ref { def:RSNN } and $ f ^ { * , \lambda } _ g $ be the weighted
to Definition~\ref { def:rsnn} and $ f ^ { * , \lambda } _ g $ be the weighted
regression spline as introduced in Definition~\ref { def:wrs} . Then
regression spline as introduced in Definition~\ref { def:wrs} . Then
the randomised shallow neural network approximating $ f ^ { * ,
the randomised shallow neural network approximating $ f ^ { * ,
\lambda } _ g$ is given by
\lambda } _ g$ is given by
@ -399,9 +479,8 @@ to linear regression of the data.
\end { Definition}
\end { Definition}
The approximating nature of the network in
The approximating nature of the network in
Definition~\ref { def:sann} can be seen by LOOKING \todo { besseres Wort
Definition~\ref { def:sann} can be seen by examining the first
finden} at the first derivative of $ \mathcal { RN } _ { \tilde { w } } ( x ) $ which is given
derivative of $ \mathcal { RN } _ { \tilde { w } } ( x ) $ which is given by
by
\begin { align}
\begin { align}
\frac { \partial \mathcal { RN} _ { \tilde { w} } } { \partial x}
\frac { \partial \mathcal { RN} _ { \tilde { w} } } { \partial x}
\Big { |} _ { x} & = \sum _ k^ n \tilde { w} _ k \mathds { 1} _ { \left \{ b_ k + v_ k x >
\Big { |} _ { x} & = \sum _ k^ n \tilde { w} _ k \mathds { 1} _ { \left \{ b_ k + v_ k x >
@ -411,16 +490,18 @@ by
\xi _ k < x} } \frac { v_ k^ 2} { g_ { \xi } (\xi _ k) \mathbb { E} [v^ 2 \vert \xi
\xi _ k < x} } \frac { v_ k^ 2} { g_ { \xi } (\xi _ k) \mathbb { E} [v^ 2 \vert \xi
= \xi _ k]} (f_ g^ { *, \lambda } )''(\xi _ k). \label { eq:derivnn}
= \xi _ k]} (f_ g^ { *, \lambda } )''(\xi _ k). \label { eq:derivnn}
\end { align}
\end { align}
\todo { gescheite Ableitungs Notation}
As the expression (\ref { eq:derivnn} ) behaves similary to a
As the expression (\ref { eq:derivnn} ) behaves similary to a
Riemann-sum for $ n \to \infty $ it will converge to the first
Riemann-sum for $ n \to \infty $ it will converge in probability to the
derie vative of $ f ^ { * , \lambda } _ g $ . A formal proof of this behaviour
first derivative of $ f ^ { * , \lambda } _ g $ . A formal proof of this behaviour
is given in Lemma~\ref { lem:s0} .
is given in Lemma~\ref { lem:s0} .
In order to ensure the functions used in the proof of the convergence
are well defined we need to assume some properties of the random
parameters and their densities
In order to formulate the theorem describing the convergence of $ RN _ w $
% In order to formulate the theorem describing the convergence of $ RN _ w $
we need to make a couple of assumptions.
% we need to make a couple of assumptions.
\todo { Bessere Formulierung}
% \todo { Bessere Formulierung}
\begin { Assumption} ~
\begin { Assumption} ~
\label { ass:theo38}
\label { ass:theo38}
@ -440,8 +521,8 @@ we need to make a couple of assumptions.
\end { enumerate}
\end { enumerate}
\end { Assumption}
\end { Assumption}
As we will prove the prorpsition in the Sobolev space, we hereby
As we will prove the convergence of in the Sobolev space, we hereby
introduce it and i ts inuced\todo { richtiges wort?} norm.
introduce it and the corre sponding ind uced norm.
\begin { Definition} [Sobolev Space]
\begin { Definition} [Sobolev Space]
For $ K \subset \mathbb { R } ^ n $ open and $ 1 \leq p \leq \infty $ we
For $ K \subset \mathbb { R } ^ n $ open and $ 1 \leq p \leq \infty $ we
@ -473,9 +554,10 @@ introduce it and its inuced\todo{richtiges wort?} norm.
\]
\]
\end { Definition}
\end { Definition}
With these assumption in place we can formulate the main theorem.
With the important definitions and assumptions in place we can now
\todo { Bezug Raum}
formulate the main theorem ... the convergence of ridge penalized
random neural networks to adapted regression splines when the
parameters are chosen accordingly.
\begin { Theorem} [Ridge weight penaltiy corresponds to weighted regression spline]
\begin { Theorem} [Ridge weight penaltiy corresponds to weighted regression spline]
\label { theo:main1}
\label { theo:main1}
@ -498,7 +580,8 @@ With these assumption in place we can formulate the main theorem.
\tilde { \lambda } & \coloneqq \lambda n g(0).
\tilde { \lambda } & \coloneqq \lambda n g(0).
\end { align*}
\end { align*}
\end { Theorem}
\end { Theorem}
We will proof Theo~\ref { theo:main1} by showing that
As mentioned above we will prof Theorem~\ref { theo:main1} utilizing
the ... functions. We show that
\begin { equation}
\begin { equation}
\label { eq:main2}
\label { eq:main2}
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f^ { w^ *} } _ { W^ { 1,
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f^ { w^ *} } _ { W^ { 1,
@ -509,10 +592,10 @@ and
\label { eq:main3}
\label { eq:main3}
\plimn \norm { f^ { w^ *} - f_ g^ { *, \lambda } } _ { W^ { 1,\infty } (K)} = 0
\plimn \norm { f^ { w^ *} - f_ g^ { *, \lambda } } _ { W^ { 1,\infty } (K)} = 0
\end { equation}
\end { equation}
and then using the triangle inequality to follow (\ref { eq:main1} ) . In
and then get (\ref { eq:main1} ) using the triangle inequality. In
order to prove (\ref { eq:main2} ) and (\ref { eq:main3} ) we will need to
order to prove (\ref { eq:main2} ) and (\ref { eq:main3} ) we will need to
introduce a number of auxiliary lemmmata, proves t o these will be
introduce a number of auxiliary lemmmata, proves of these will be
provided in the appendix, as they would SPRENGEN DEN RAHMEN .
provided in the appendix.
@ -534,7 +617,7 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
\exists C_ K^ 2 \in \mathbb { R} _ { >0} : \norm { f} _ { W^ { 1,\infty } (K)} \leq
\exists C_ K^ 2 \in \mathbb { R} _ { >0} : \norm { f} _ { W^ { 1,\infty } (K)} \leq
C_ K^ 2 \norm { f''} _ { L^ 2(K)} .
C_ K^ 2 \norm { f''} _ { L^ 2(K)} .
\end { equation*}
\end { equation*}
% \proof
\proof The proof is given in the appendix...
% With the fundamental theorem of calculus, if
% With the fundamental theorem of calculus, if
% \( \norm { f } _ { L ^ { \infty } ( K ) } < \infty \) we get
% \( \norm { f } _ { L ^ { \infty } ( K ) } < \infty \) we get
% \begin { equation}
% \begin { equation}
@ -584,7 +667,7 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
\mathbb { E} \left [\varphi(\xi, v) \vert \xi = x \right] dx
\mathbb { E} \left [\varphi(\xi, v) \vert \xi = x \right] dx
\]
\]
uniformly in \( T \in K \) .
uniformly in \( T \in K \) .
% \proof
\proof The proof is given in appendix...
% For \( T \leq C _ { g _ { \xi } } ^ l \) both sides equal 0, so it is sufficient to
% For \( T \leq C _ { g _ { \xi } } ^ l \) both sides equal 0, so it is sufficient to
% consider \( T > C _ { g _ { \xi } } ^ l \) . With \( \varphi \) and
% consider \( T > C _ { g _ { \xi } } ^ l \) . With \( \varphi \) and
% \( \nicefrac { 1 } { g _ { \xi } } \) uniformly continous in \( \xi \) ,
% \( \nicefrac { 1 } { g _ { \xi } } \) uniformly continous in \( \xi \) ,
@ -620,7 +703,7 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
% \kappa : \xi _ m \in [\delta l, \delta (l + 1)]\right \} } } { \abs { \left \{ m \in
% \kappa : \xi _ m \in [\delta l, \delta (l + 1)]\right \} } } { \abs { \left \{ m \in
% \kappa : \xi _ m \in [\delta l, \delta (l + 1)]\right \} } } } _ { =
% \kappa : \xi _ m \in [\delta l, \delta (l + 1)]\right \} } } } _ { =
% 1} \right ) \\
% 1} \right ) \\
% % \intertext { }
% \intertext { }
% & = \sum _ { l \in I_ { \delta } } \left ( \frac { \sum _ { \substack { k \in \kappa \\
% & = \sum _ { l \in I_ { \delta } } \left ( \frac { \sum _ { \substack { k \in \kappa \\
% \xi _ k \in \left [\delta l, \delta (l + 1)\right] } }
% \xi _ k \in \left [\delta l, \delta (l + 1)\right] } }
% \varphi \left (l\delta , v_ k\right )}
% \varphi \left (l\delta , v_ k\right )}
@ -685,6 +768,7 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
By the fundamental theorem of calculus and $ \supp ( f' ) \subset
By the fundamental theorem of calculus and $ \supp ( f' ) \subset
\supp (f)$ , ( \ref { eq:s 0 } ) follows with Lemma~ \ref { lem:pieq } .
\supp (f)$ , ( \ref { eq:s 0 } ) follows with Lemma~ \ref { lem:pieq } .
\qed
\qed
\label { lem:s0}
\end { Lemma}
\end { Lemma}
\begin { Lemma} [Step 2]
\begin { Lemma} [Step 2]
@ -696,19 +780,22 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
F^ { \lambda , g} (f^ { *, \lambda } _ g) = 0.
F^ { \lambda , g} (f^ { *, \lambda } _ g) = 0.
\]
\]
\proof
\proof
This can be prooven by showing
The proof is given in the appendix...
\label { lem:s2}
\end { Lemma}
\end { Lemma}
\begin { Lemma} [Step 3]
\begin { Lemma} [Step 3]
For any $ \lambda > 0 $ and training data $ ( x _ i ^ { \text { train } } ,
For any $ \lambda > 0 $ and training data $ ( x _ i ^ { \text { train } } ,
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2, \, i \in
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2, \, i \in
\left \{ 1,\dots ,N\right \} $ , with $ w^ *$ a nd $ \tilde { \lambda } $ a s
\left \{ 1,\dots ,N\right \} $ , with $ w^ *$ a s
defined in Definition~\ref { def:rpnn} and Theroem~\ref { theo:main1}
defined in Definition~\ref { def:rpnn} and $ \tilde { \lambda } $ as
respectively , it holds
defined in Theroem~\ref { theo:main1} , it holds
\[
\[
\plimn \norm { \mathcal { RN} ^ { *,\tilde { \lambda } } -
\plimn \norm { \mathcal { RN} ^ { *,\tilde { \lambda } } -
f^ { w*, \tilde { \lambda } } } _ { W^ { 1,\infty } (K)} = 0.
f^ { w*, \tilde { \lambda } } } _ { W^ { 1,\infty } (K)} = 0.
\]
\]
\proof The proof is given in Appendix ..
\label { lem:s3}
\end { Lemma}
\end { Lemma}
\begin { Lemma} [Step 4]
\begin { Lemma} [Step 4]
@ -718,9 +805,11 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
defined in Definition~\ref { def:rpnn} and Theroem~\ref { theo:main1}
defined in Definition~\ref { def:rpnn} and Theroem~\ref { theo:main1}
respectively, it holds
respectively, it holds
\[
\[
\plimn \abs { F_ n^ { \ lambda} (\mathcal { RN} ^ { *,\tilde { \lambda } } ) -
\plimn \abs { F_ n^ { \ tilde{ \ lambda} } (\mathcal { RN} ^ { *,\tilde { \lambda } } ) -
F^ { \lambda , g} (f^ { w*, \tilde { \lambda } } )} = 0.
F^ { \lambda , g} (f^ { w*, \tilde { \lambda } } )} = 0.
\]
\]
\proof The proof is given in appendix...
\label { lem:s4}
\end { Lemma}
\end { Lemma}
\begin { Lemma} [Step 7]
\begin { Lemma} [Step 7]
@ -735,11 +824,81 @@ provided in the appendix, as they would SPRENGEN DEN RAHMEN.
\[
\[
\plimn \norm { f^ n - f^ { *, \lambda } } = 0.
\plimn \norm { f^ n - f^ { *, \lambda } } = 0.
\]
\]
\proof The proof is given in appendix ...
\label { lem:s7}
\end { Lemma}
\end { Lemma}
Using these lemmata we can now proof Theorem~\ref { theo:main1} . We
\textcite { heiss2019} further show a link between ridge penalized
start by showing that the error measure of the smooth approximation of
networks and randomized shallow neural networks which are trained with
the ridge penalized randomized shallow neural network $ F ^ { \lambda ,
gradient descent which is stopped after a certain amount of iterations.
g} \left (f^ { w^ { *,\tilde { \lambda } } } \right )$
will converge in probability to the error measure of the adapted weighted regression
spline $ F ^ { \lambda , g } \left ( f ^ { * , \lambda } \right ) $ for the specified
parameters.
Using Lemma~\ref { lem:s4} we get that for every $ P \in ( 0 , 1 ) $ and
$ \varepsilon > 0 $ there exists a $ n _ 1 \in \mathbb { N } $ such that
\[
\mathbb { P} \left [F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) \in
F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} ^ { *,\tilde { \lambda } } \right )
+[-\varepsilon , \varepsilon ]\right ] > P, \forall n \in \mathbb { N} _ { > n_ 1} .
\]
As $ \mathcal { RN } ^ { * , \tilde { \lambda } } $ is the optimal network for
$ F _ n ^ { \tilde { \lambda } } $ we know that
\[
F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} ^ { *,\tilde { \lambda } } \right )
\leq F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} _ { \tilde { w} } \right ).
\]
Using Lemma~\ref { lem:s2} we get that for every $ P \in ( 0 , 1 ) $ and
$ \varepsilon > 0 $ there exists a $ n _ 2 \in \mathbb { N } $ such that
\[
\mathbb { P} \left [F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} _ { \tilde { w} } \right )
\in F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right )+[-\varepsilon ,
\varepsilon ]\right ] > P, \forall n \in \mathbb { N} _ { > n_ 2} .
\]
If we combine these ... we get that for every $ P \in ( 0 , 1 ) $ and
$ \varepsilon > 0 $ and $ n _ 3 \geq
\max \left \{ n_ 1,n_ 2\right \} $
\[
\mathbb { P} \left [F^ { \lambda ,
g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) \leq F^ { \lambda ,
g} \left (f^ { *,\lambda } _ g\right )+2\varepsilon \right ] > P, \forall
n \in \mathbb { N} _ { > n_ 3} .
\]
As ... is in ... and ... is optimal we know that
\[
F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right ) \leq F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right )
\]
and thus get with the squeeze theorem
\[
\plimn F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) = F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right ).
\]
We can now use Lemma~\ref { lem:s7} to follow that
\begin { equation}
\plimn \norm { f^ { w^ { *,\tilde { \lambda } } } - f^ { *,\lambda } _ g}
_ { W^ { 1,\infty } } = 0.
\label { eq:main2}
\end { equation}
Now by using the triangle inequality with Lemma~\ref { lem:s3} and
(\ref { eq:main2} ) we get
\begin { align*}
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f_ g^ { *,\lambda } }
\leq & \plimn \bigg (\norm { \mathcal { RN} ^ { *, \tilde { \lambda } } -
f_ g^ { w^ { *,\tilde { \lambda } } } } _ { W^ { 1,\infty } } \\
& + \norm { f^ { w^ { *,\tilde { \lambda } } } - f^ { *,\lambda } _ g}
_ { W^ { 1,\infty } } \bigg ) = 0
\end { align*}
and thus have proven Theorem~\ref { theo:main1} .
We now know that randomized shallow neural networks behave similar to
spline regression if we regularize the size of the weights during
training.
\textcite { heiss2019} further explore a connection between ridge penalized
networks and randomized shallow neural networks which are trained
which are only trained for a certain amount of epoch using gradient
descent.
And ... that the effect of weight regularization can be achieved by
training for a certain amount of iterations this ... between adapted
weighted regression splines and randomized shallow neural networks
where training is stopped early.
\newpage
\newpage
\subsection { Simulations}
\subsection { Simulations}
@ -755,7 +914,7 @@ data have been generated.
y_ { i, A} ^ { \text { train} } & \coloneqq \sin ( x_ { i, A} ^ { \text { train} } ). \phantom { (i - 1),
y_ { i, A} ^ { \text { train} } & \coloneqq \sin ( x_ { i, A} ^ { \text { train} } ). \phantom { (i - 1),
i \in \left \{ 1, \dots , 6\right \} }
i \in \left \{ 1, \dots , 6\right \} }
\end { align*}
\end { align*}
\item $ \text { data } _ b = ( x _ { i, B } ^ { \text { train } } , y _ { i,
\item $ \text { data } _ B = ( x _ { i, B } ^ { \text { train } } , y _ { i,
B} ^ { \text { train} } )$ with
B} ^ { \text { train} } )$ with
\begin { align*}
\begin { align*}
x_ { i, B} ^ { \text { train} } & \coloneqq \pi \frac { i - 8} { 7} ,
x_ { i, B} ^ { \text { train} } & \coloneqq \pi \frac { i - 8} { 7} ,
@ -785,9 +944,9 @@ been calculated with Matlab's ..... As ... minimizes
the smoothing parameter used for fittment is $ \bar { \lambda } =
the smoothing parameter used for fittment is $ \bar { \lambda } =
\frac { 1} { 1 + \lambda } $ . The parameter $ \tilde { \lambda } $ for training
\frac { 1} { 1 + \lambda } $ . The parameter $ \tilde { \lambda } $ for training
the networks is chosen as defined in Theorem~\ref { theo:main1} and each
the networks is chosen as defined in Theorem~\ref { theo:main1} and each
one is trained on the full training data for 5000 iterations using
one is trained on the full training data for 5000 epoch using
gradient descent. The
gradient descent. The
results are given in Figure~\ref { blblb } , here it can be seen that in
results are given in Figure~\ref { fig:rs_ vs_ rs } , here it can be seen that in
the intervall of the traing data $ [ - \pi , \pi ] $ the neural network and
the intervall of the traing data $ [ - \pi , \pi ] $ the neural network and
smoothing spline are nearly identical, coinciding with the proposition.
smoothing spline are nearly identical, coinciding with the proposition.