@ -5,16 +5,16 @@
% % % TeX-master: "main"
% % % TeX-master: "main"
% % % End:
% % % End:
\section { Shallow Neural Networks}
\section { Shallow Neural Networks}
\label { sec:shallownn}
% In order to get a some understanding of the behavior of neural
% In order to get a some understanding of the behavior of neural
% networks we study a simplified class of networks called shallow neural
% networks we study a simplified class of networks called shallow neural
% networks in this chapter.
% networks in this chapter.
% We consider shallow neural networks consist of a single
% We consider shallow neural networks consist of a single
% hidden layer and
% hidden layer and
In order t o get some understanding of the behavior of neural networks
T o get some understanding of the behavior of neural networks
we examine a simple class of networks in this chapter. We consider
we examine a simple class of networks in this chapter. We consider
networks that contain only one hidden layer and have a single output
networks that contain only one hidden layer and have a single output
node. We call these networks shallow neural networks.
node and call these networks shallow neural networks.
\begin { Definition} [Shallow neural network, Heiss, Teichmann, and
\begin { Definition} [Shallow neural network, Heiss, Teichmann, and
Wutte (2019, Definition 1.4)]
Wutte (2019, Definition 1.4)]
For a input dimension $ d $ and a Lipschitz continuous activation function $ \sigma :
For a input dimension $ d $ and a Lipschitz continuous activation function $ \sigma :
@ -85,8 +85,8 @@ with
% \label { fig:shallowNN}
% \label { fig:shallowNN}
% \end { figure}
% \end { figure}
As neural networks with a large amount of nodes have a large amount of
As neural networks with a large number of nodes have a large amount of
parameters that can be tuned it can often fit the data quite well. If
tunable parameters it can often fit data quite well. If
a ReLU activation function
a ReLU activation function
\[
\[
\sigma (x) \coloneqq \max { (0, x)}
\sigma (x) \coloneqq \max { (0, x)}
@ -106,7 +106,7 @@ on MSE will perfectly fit the data.
minimizing squared error loss.
minimizing squared error loss.
\proof
\proof
W.l.o.g. all values $ x _ { ij } ^ { \text { train } } \in [ 0 , 1 ] ,~ \forall i \in
W.l.o.g. all values $ x _ { ij } ^ { \text { train } } \in [ 0 , 1 ] ,~ \forall i \in
\left \{ 1,\dots \right \} , j \in \left \{ 1,\dots ,d\right \} $ . Now we
\left \{ 1,\dots , t \right \} , j \in \left \{ 1,\dots ,d\right \} $ . Now we
chose $ v ^ * $ in order to calculate a unique value for all
chose $ v ^ * $ in order to calculate a unique value for all
$ x _ i ^ { \text { train } } $ :
$ x _ i ^ { \text { train } } $ :
\[
\[
@ -142,30 +142,32 @@ on MSE will perfectly fit the data.
and $ \vartheta ^ * = ( w ^ * , b ^ * , v ^ * , c = 0 ) $ we get
and $ \vartheta ^ * = ( w ^ * , b ^ * , v ^ * , c = 0 ) $ we get
\[
\[
\mathcal { NN} _ { \vartheta ^ *} (x_ i^ { \text { train} } ) = \sum _ { k =
\mathcal { NN} _ { \vartheta ^ *} (x_ i^ { \text { train} } ) = \sum _ { k =
1} ^ { i-1} w_ k\left (\left (v^ *\right )^ { \mathrm { T} }
1} ^ { i-1} w_ k\left (b_ k^ * + \left (v^ *\right )^ { \mathrm { T} }
x_ i^ { \text { train} } \right ) + w_ i\left (\left (v^ *\right )^ { \mathrm { T} }
x_ i^ { \text { train} } \right ) + w_ i\left (b_ i^ * + \left (v^ *\right )^ { \mathrm { T} }
x_ i^ { \text { train} } \right ) = y_ i^ { \text { train} } .
x_ i^ { \text { train} } \right ) = y_ i^ { \text { train} } .
\]
\]
As the squared error of $ \mathcal { NN } _ { \vartheta ^ * } $ is zero all
As the squared error of $ \mathcal { NN } _ { \vartheta ^ * } $ is zero all
squared error loss minimizing shallow networks with at least $ t $ hidden
squared error loss minimizing shallow networks with at least $ t $ hidden
nodes will perfectly fit the data.
nodes will perfectly fit the data. \qed
\qed
\label { theo:overfit}
\label { theo:overfit}
\end { Theorem}
\end { Theorem}
However this behavior is often not desired as over fit models generally
However, this behavior is often not desired as overfit models tend to
have bad generalization properties especially if noise is present in
have bad generalization properties, especially if noise is present in
the data. This effect is illustrated in
the data. This effect is illustrated in
Figure~\ref { fig:overfit} . Here a shallow neural network that perfectly fits the
Figure~\ref { fig:overfit} .
training data is
constructed according to the proof of Theorem~\ref { theo:overfit} and
Here a shallow neural network is
constructed according to the proof of Theorem~\ref { theo:overfit} to
perfectly fit some data and
compared to a cubic smoothing spline
compared to a cubic smoothing spline
(Definition~\ref { def:wrs} ). While the neural network
(Definition~\ref { def:wrs} ). While the neural network
fits the data better than the spline, the spline represents the
fits the data better than the spline, the spline represents the
underlying mechanism that was used to generate the data more accurately. The better
underlying mechanism that was used to generate the data more accurately. The better
generalization of the spline compared to the network is further
generalization of the spline compared to the network is further
demonstrated by the better validation error computed on newly generated
demonstrated by the better performance on newly generated
test data.
test data.
In order to improve the accuracy of the model we want to reduce
In order to improve the accuracy of the model we want to reduce
overfitting. A possible way to achieve this is by explicitly
overfitting. A possible way to achieve this is by explicitly
regularizing the network through the cost function as done with
regularizing the network through the cost function as done with
@ -173,48 +175,47 @@ ridge penalized networks
(Definition~\ref { def:rpnn} ) where large weights $ w $ are punished. In
(Definition~\ref { def:rpnn} ) where large weights $ w $ are punished. In
Theorem~\ref { theo:main1} we will
Theorem~\ref { theo:main1} we will
prove that this will result in the shallow neural network converging to
prove that this will result in the shallow neural network converging to
regressions splines as the amount of nodes in the hidden layer is
a form of splines as the number of nodes in the hidden layer is
increased.
increased.
\vfill
\begin { figure} [h]
\begin { figure}
\pgfplotsset {
\pgfplotsset {
compat=1.11,
compat=1.11,
legend image code/.code={
legend image code/.code={
\draw [mark repeat=2,mark phase=2]
\draw [mark repeat=2,mark phase=2]
plot coordinates {
plot coordinates {
(0cm,0cm)
(0cm,0cm)
(0.15cm,0cm) % % default is (0.3cm,0cm)
(0.15cm,0cm) % % default is (0.3cm,0cm)
(0.3cm,0cm) % % default is (0.6cm,0cm)
(0.3cm,0cm) % % default is (0.6cm,0cm)
} ;%
} ;%
}
}
}
}
\begin { tikzpicture}
\begin { tikzpicture}
\begin { axis} [tick style = { draw = none} , width = \textwidth ,
\begin { axis} [tick style = { draw = none} , width = \textwidth ,
height = 0.6\textwidth ]
height = 0.6\textwidth ]
\addplot table
\addplot table
[x=x, y=y, col sep=comma, only marks,mark options={ scale =
[x=x, y=y, col sep=comma, only marks,mark options={ scale =
0.7} ] { Figures/Data/overfit.csv} ;
0.7} ] { Figures/Data/overfit.csv} ;
\addplot [red, line width=0.8pt] table [x=x_ n, y=s_ n, col
\addplot [red, line width=0.8pt] table [x=x_ n, y=s_ n, col
sep=comma, forget plot] { Figures/Data/overfit.csv} ;
sep=comma, forget plot] { Figures/Data/overfit.csv} ;
\addplot [black, line width=0.8pt] table [x=x_ n, y=y_ n, col
\addplot [black, line width=0.8pt] table [x=x_ n, y=y_ n, col
sep=comma] { Figures/Data/overfit.csv} ;
sep=comma] { Figures/Data/overfit.csv} ;
\addplot [black, line width=0.8pt, dashed] table [x=x, y=y, col
\addplot [black, line width=0.8pt, dashed] table [x=x, y=y, col
sep=comma] { Figures/Data/overfit_ spline.csv} ;
sep=comma] { Figures/Data/overfit_ spline.csv} ;
\addlegendentry { \footnotesize { data} } ;
\addlegendentry { \footnotesize { data} } ;
\addlegendentry { \footnotesize { $ \mathcal { NN } _ { \vartheta ^ * } $ } } ;
\addlegendentry { \footnotesize { $ \mathcal { NN } _ { \vartheta ^ * } $ } } ;
\addlegendentry { \footnotesize { spline} } ;
\addlegendentry { \footnotesize { spline} } ;
\end { axis}
\end { axis}
\end { tikzpicture}
\end { tikzpicture}
\caption [Overfitting of Shallow Neural Networks] { For data of the form $ y = \sin ( \frac { x + \pi } { 2 \pi } ) +
\caption [Overfitting of shallow neural networks] { For data of the form $ y = \sin ( \frac { x + \pi } { 2 \pi } ) +
\varepsilon ,~ \varepsilon \sim \mathcal { N} (0,0.4)$
\varepsilon ,~ \varepsilon \sim \mathcal { N} (0,0.4)$
(\textcolor { blue} { blue dots } ) the neural network constructed
(\textcolor { blue} { blue} ) the neural network constructed
according to the proof of Theorem~\ref { theo:overfit} (black) and the
according to the proof of Theorem~\ref { theo:overfit} (black) and the
underlying signal (\textcolor { red} { red} ). While the network has no
underlying signal (\textcolor { red} { red} ). While the network has no
bias a cubic smoothing spline (black dashed) fits the data much
bias a cubic smoothing spline (black, dashed) fits the data much
better. For a test set of size 20 with uniformly distributed $ x $
better. For a test set of size 20 with uniformly distributed $ x $
values and responses of the same fashion as the training data the MSE of the neural network is
values and responses of the same fashion as the training data the MSE of the neural network is
0.30, while the MSE of the spline is only 0.14 thus generalizing
0.30, while the MSE of the spline is only 0.14 thus generalizing
@ -223,17 +224,20 @@ plot coordinates {
\label { fig:overfit}
\label { fig:overfit}
\end { figure}
\end { figure}
\clearpage
\vfill
\subsection { \titlecap { convergence behaviour of 1-dim. randomized shallow neural
networks} }
\clearpage
\subsection { Convergence Behavior of One-Dimensional Randomized Shallow
Neural Networks}
\label { sec:conv}
This section is based on \textcite { heiss2019} .
This section is based on \textcite { heiss2019} .
In this section, we examine the convergence behavior of certain shallow
... shallow neural networks with a one dimensional input where the parameters in the
neural networks.
We consider shallow neural networks with a one dimensional input where the parameters in the
hidden layer are randomized resulting in only the weights is the
hidden layer are randomized resulting in only the weights is the
output layer being trainable.
output layer being trainable.
Additionally we assume all neurons use a ReLU as activation function
Additionally, we assume all neurons use a ReLU as an activation function
and call such networks randomized shallow neural networks.
and call such networks randomized shallow neural networks.
% We will analyze the
% We will analyze the
@ -271,14 +275,12 @@ and call such networks randomized shallow neural networks.
% are penalized in the loss
% are penalized in the loss
% function ridge penalized neural networks.
% function ridge penalized neural networks.
We will prove that if we penalize the amount of the trainable weights
We will prove that ... nodes .. a randomized shallow neural network will
when fitting a randomized shallow neural network it will
converge to a function that minimizes the distance to the training
converge to a function that minimizes the distance to the training
data with .. to its second derivative,
data with respect to its second derivative as the amount of nodes is increased.
if the $ L ^ 2 $ norm of the trainable weights $ w $ is
penalized in the loss function.
We call such a network that is fitted according to MSE and a penalty term for
We call such a network that is fitted according to MSE and a penalty term for
the amount of the weights a ridge penalized neural network.
the $ L ^ 2 $ norm of the trainable weights $ w $ a ridge penalized neural network.
% $ \lam $
% $ \lam $
% We call a randomized shallow neural network trained on MSE and
% We call a randomized shallow neural network trained on MSE and
% punished for the amount of the weights $ w $ according to a
% punished for the amount of the weights $ w $ according to a
@ -300,7 +302,7 @@ the amount of the weights a ridge penalized neural network.
\mathcal { RN} ^ { *, \tilde { \lambda } } _ { \omega } (x) \coloneqq
\mathcal { RN} ^ { *, \tilde { \lambda } } _ { \omega } (x) \coloneqq
\mathcal { RN} _ { w^ { *, \tilde { \lambda } } (\omega ), \omega }
\mathcal { RN} _ { w^ { *, \tilde { \lambda } } (\omega ), \omega }
\]
\]
with
with \
\[
\[
w^ { *,\tilde { \lambda } } (\omega ) :\in \argmin _ { w \in
w^ { *,\tilde { \lambda } } (\omega ) :\in \argmin _ { w \in
\mathbb { R} ^ n} \underbrace { \left \{ \overbrace { \sum _ { i = 1} ^ N \left (\mathcal { RN} _ { w,
\mathbb { R} ^ n} \underbrace { \left \{ \overbrace { \sum _ { i = 1} ^ N \left (\mathcal { RN} _ { w,
@ -316,7 +318,7 @@ having minimal weights, resulting in the \textit{minimum norm
network} $ \mathcal { RN } _ { w ^ { \text { min } } , \omega } $ .
network} $ \mathcal { RN } _ { w ^ { \text { min } } , \omega } $ .
\[
\[
\mathcal { RN} _ { w^ { \text { min} } , \omega } \text { randomized shallow
\mathcal { RN} _ { w^ { \text { min} } , \omega } \text { randomized shallow
Neural network with weights } w^ { \text { min} } :
neural network with weights } w^ { \text { min} } \colon
\]
\]
\[
\[
w^ { \text { min} } \in \argmin _ { w \in \mathbb { R} ^ n} \norm { w} , \text {
w^ { \text { min} } \in \argmin _ { w \in \mathbb { R} ^ n} \norm { w} , \text {
@ -328,8 +330,8 @@ For $\tilde{\lambda} \to \infty$ the learned
function will resemble the data less and with the weights
function will resemble the data less and with the weights
approaching $ 0 $ will converge to the constant $ 0 $ function.
approaching $ 0 $ will converge to the constant $ 0 $ function.
In order to make the notation more convinient in the following the
To make the notation more convenient, in the following the
$ \omega $ used to express the realis ed random parameters will no longer
$ \omega $ used to express the realiz ed random parameters will no longer
be explicitly mentioned.
be explicitly mentioned.
We call a function that minimizes the cubic distance between training points
We call a function that minimizes the cubic distance between training points
@ -348,11 +350,11 @@ derivative of the function a cubic smoothing spline.
\]
\]
\end { Definition}
\end { Definition}
We will show that for specific hyper parameters the ridge penalized
We will show that for specific hyperparameters the ridge penalized
shallow neural networks converge to a slightly modified variant of the
shallow neural networks converge to a slightly modified variant of the
cubic smoothing spline. We will need to incorporate the densities of the
cubic smoothing spline. We need to incorporate the densities of the
random parameters in the loss function of the spline to ensure
random parameters in the loss function of the spline to ensure
convergence. Thus we define
convergence. Thus we define
the adapted weighted cubic smoothing spline where the loss for the second
the adapted weighted cubic smoothing spline where the loss for the second
derivative is weighted by a function $ g $ and the support of the second
derivative is weighted by a function $ g $ and the support of the second
derivative of $ f $ has to be a subset the support of $ g $ . The formal
derivative of $ f $ has to be a subset the support of $ g $ . The formal
@ -371,7 +373,8 @@ definition is given in Definition~\ref{def:wrs}.
% Definition~\ref { def:rpnn} converges a weighted cubic smoothing spline, as
% Definition~\ref { def:rpnn} converges a weighted cubic smoothing spline, as
% the amount of hidden nodes is grown to inifity.
% the amount of hidden nodes is grown to inifity.
\begin { Definition} [Adapted weighted cubic smoothing spline]
\begin { Definition} [Adapted weighted cubic smoothing spline, Heiss, Teichmann, and
Wutte (2019, Definition 3.5)]
\label { def:wrs}
\label { def:wrs}
Let $ x _ i ^ { \text { train } } , y _ i ^ { \text { train } } \in \mathbb { R } , i \in
Let $ x _ i ^ { \text { train } } , y _ i ^ { \text { train } } \in \mathbb { R } , i \in
\left \{ 1,\dots ,N\right \} $ be trainig data. For a given $ \lambda \in \mathbb { R} _ { >0} $
\left \{ 1,\dots ,N\right \} $ be trainig data. For a given $ \lambda \in \mathbb { R} _ { >0} $
@ -385,16 +388,15 @@ definition is given in Definition~\ref{def:wrs}.
\lambda g(0) \int _ { \supp (g)} \frac { \left (f''(x)\right )^ 2} { g(x)}
\lambda g(0) \int _ { \supp (g)} \frac { \left (f''(x)\right )^ 2} { g(x)}
dx\right \} } _ { \eqqcolon F^ { \lambda , g} (f)} .
dx\right \} } _ { \eqqcolon F^ { \lambda , g} (f)} .
\]
\]
\todo { Anforderung an Ableitung von f, doch nicht?}
% \todo { Anforderung an Ableitung von f, doch nicht?}
\end { Definition}
\end { Definition}
Similarly to ridge weight penalized neural networks the parameter
Similarly to ridge weight penalized neural networks the parameter
$ \lambda $ controls a trade-off between accuracy on the training data
$ \lambda $ controls a trade-off between accuracy on the training data
and smoothness or low second dr eivative. For $ g \equiv 1 $ and $ \lambda \to 0 $ the
and smoothness or low second der ivative. For $ g \equiv 1 $ and $ \lambda \to 0 $ the
resulting function $ f ^ { * , 0 + } $ will interpolate the training data while minimizing
resulting function $ f ^ { * , 0 + } $ will interpolate the training data while minimizing
the second derivative. Such a function is known as cubic spline
the second derivative. Such a function is known as cubic spline
interpolation.
interpolation.
\vspace { -0.2cm}
\[
\[
f^ { *, 0+} \text { smooth spline interpolation: }
f^ { *, 0+} \text { smooth spline interpolation: }
\]
\]
@ -403,7 +405,6 @@ interpolation.
\argmin _ { \substack { f \in \mathcal { C} ^ 2(\mathbb { R} ), \\ f(x_ i^ { \text { train} } ) =
\argmin _ { \substack { f \in \mathcal { C} ^ 2(\mathbb { R} ), \\ f(x_ i^ { \text { train} } ) =
y_ i^ { \text { train} } } } = \left ( \int _ { \mathbb { R} } (f''(x))^ 2dx\right ).
y_ i^ { \text { train} } } } = \left ( \int _ { \mathbb { R} } (f''(x))^ 2dx\right ).
\]
\]
For $ \lambda \to \infty $ on the other hand $ f _ g ^ { * \lambda } $ converges
For $ \lambda \to \infty $ on the other hand $ f _ g ^ { * \lambda } $ converges
to linear regression of the data.
to linear regression of the data.
@ -412,16 +413,16 @@ the ridge penalized shallow neural network to adapted cubic smoothing splines.
% In order to show that ridge penalized shallow neural networks converge
% In order to show that ridge penalized shallow neural networks converge
% to adapted cubic smoothing splines for a growing amount of hidden nodes we
% to adapted cubic smoothing splines for a growing amount of hidden nodes we
% define two intermediary functions.
% define two intermediary functions.
One being a smooth approximation of
One being a smooth approximation of a
the neural network, and a randomized shallow neural network designed
neural network and the other being a randomized shallow neural network designed
to approximate a spline.
to approximate a spline.
In order to properly BUILD these functions we need to take the points
In order to properly construct these functions, we need to take the points
of the network into consideration where the TRAJECTORY of the learned
of the network into consideration where the trajectory of the learned
function changes
function changes
(or their points of discontinuity).
(or their points of discontinuity).
As we use the ReLU activation the function learned by the
As we use the ReLU activation the function learned by the
network will possess points of discontinuity where a neuron in the hidden
network will possess points of discontinuity where a neuron in the hidden
layer gets activated (goes from 0 -> x>0) . We formalize these points
layer gets activated and their output is no longer zero . We formalize these points
as kinks in Definition~\ref { def:kink} .
as kinks in Definition~\ref { def:kink} .
\begin { Definition}
\begin { Definition}
\label { def:kink}
\label { def:kink}
@ -439,9 +440,9 @@ as kinks in Definition~\ref{def:kink}.
\item Let $ \xi _ k \coloneqq - \frac { b _ k } { v _ k } $ be the k-th kink of $ \mathcal { RN } _ w $ .
\item Let $ \xi _ k \coloneqq - \frac { b _ k } { v _ k } $ be the k-th kink of $ \mathcal { RN } _ w $ .
\item Let $ g _ { \xi } ( \xi _ k ) $ be the density of the kinks $ \xi _ k =
\item Let $ g _ { \xi } ( \xi _ k ) $ be the density of the kinks $ \xi _ k =
- \frac { b_ k} { v_ k} $ in accordance to the distributions of $ b_ k$ and
- \frac { b_ k} { v_ k} $ in accordance to the distributions of $ b_ k$ and
$ v _ k $ .
$ v _ k $ . With $ \supp ( g _ \xi ) = \left [ C _ { g _ \xi } ^ l, C _ { g _ \xi } ^ u \right ] $ .
\item Let $ h _ { k,n } \coloneqq \frac { 1 } { n g _ { \xi } ( \xi _ k ) } $ be the
\item Let $ h _ { k,n } \coloneqq \frac { 1 } { n g _ { \xi } ( \xi _ k ) } $ be the
average estmated distance from kink $ \xi _ k $ to the next nearest
average esti mated distance from kink $ \xi _ k $ to the next nearest
one.
one.
\end { enumerate}
\end { enumerate}
\end { Definition}
\end { Definition}
@ -457,40 +458,36 @@ network by applying the kernel similar to convolution.
corresponding kink density $ g _ { \xi } $ as given by
corresponding kink density $ g _ { \xi } $ as given by
Definition~\ref { def:kink} .
Definition~\ref { def:kink} .
In order to smooth the RSNN consider following kernel for every $ x $ :
In order to smooth the RSNN consider following kernel for every $ x $ :
\begin { align*}
\[
\kappa _ x(s) & \coloneqq \mathds { 1} _ { \left \{ \abs { s} \leq \frac { 1} { 2 \sqrt { n}
\kappa _ x(s) \coloneqq \mathds { 1} _ { \left \{ \abs { s} \leq \frac { 1} { 2 \sqrt { n}
g_ { \xi } (x)} \right \} } (s)\sqrt { n} g_ { \xi } (x), \, \forall s \in \mathbb { R} \\
g_ { \xi } (x)} \right \} } (s)\sqrt { n} g_ { \xi } (x), \, \forall s \in \mathbb { R}
\intertext { Using this kernel we define a smooth approximation of
\]
$ \mathcal { RN } _ w $ by}
f^ w(x) & \coloneqq \int _ { \mathds { R} } \mathcal { RN} _ w(x-s)
Using this kernel we define a smooth approximation of
\kappa _ x(s) ds.
$ \mathcal { RN } _ w $ by
\end { align*}
\[
f^ w(x) \coloneqq \int _ { \mathds { R} } \mathcal { RN} _ w(x-s) \kappa _ x(s) ds.
\]
\end { Definition}
\end { Definition}
Note that the kernel introduced in Definition~\ref { def:srsnn}
Note that the kernel introduced in Definition~\ref { def:srsnn}
satisfies $ \int _ { \mathbb { R } } \kappa _ x dx = 1 $ . While $ f ^ w $ looks highly
satisfies $ \int _ { \mathbb { R } } \kappa _ x dx = 1 $ . While $ f ^ w $ looks
similar to a convolution, it differs slightly as the kernel $ \kappa _ x ( s ) $
similar to a convolution, it differs slightly as the kernel $ \kappa _ x ( s ) $
is dependent on $ x $ . Therefore only $ f ^ w = ( \mathcal { RN } _ w *
is dependent on $ x $ . Therefore only $ f ^ w = ( \mathcal { RN } _ w *
\kappa _ x)(x)$ is well defined, while $ \mathcal { RN} _ w * \kappa $ is not.
\kappa _ x)(x)$ is well defined, while $ \mathcal { RN} _ w * \kappa $ is not.
We use $ f ^ { w ^ { * , \tilde { \lambda } } } $ to describe the spline
We use $ f ^ { w ^ { * , \tilde { \lambda } } } $ to describe the spline
approximating the ridge penalized network
approximating the ridge penalized network
$ \math rm { RN } ^ { * , \tilde { \lambda } } $ .
$ \math cal { RN } ^ { * , \tilde { \lambda } } $ .
Next we construct a randomized shallow neural network which
Next, we construct a randomized shallow neural network that
approximates a spline independent from the realization of the random
is designed to be close to a spline, independent from the realization of the random
parameters. In order to achieve this we ...
parameters, by approximating the splines curvature between the
kinks.
\begin { Definition} [Spline approximating Randomis ed Shallow Neural
\begin { Definition} [Spline approximating Randomiz ed Shallow Neural
Network]
Network]
\label { def:sann}
\label { def:sann}
Let $ \mathcal { RN } $ be a randomis ed shallow Neural Network according
Let $ \mathcal { RN } $ be a randomiz ed shallow Neural Network according
to Definition~\ref { def:rsnn} and $ f ^ { * , \lambda } _ g $ be the weighted
to Definition~\ref { def:rsnn} and $ f ^ { * , \lambda } _ g $ be the weighted
cubic smoothing spline as introduced in Definition~\ref { def:wrs} . Then
cubic smoothing spline as introduced in Definition~\ref { def:wrs} . Then
the randomis ed shallow neural network approximating $ f ^ { * ,
the randomiz ed shallow neural network approximating $ f ^ { * ,
\lambda } _ g$ is given by
\lambda } _ g$ is given by
\[
\[
\mathcal { RN} _ { \tilde { w} } (x) = \sum _ { k = 1} ^ n \tilde { w} _ k \sigma (b_ k + v_ k x),
\mathcal { RN} _ { \tilde { w} } (x) = \sum _ { k = 1} ^ n \tilde { w} _ k \sigma (b_ k + v_ k x),
@ -498,7 +495,7 @@ parameters. In order to achieve this we ...
with the weights $ \tilde { w } _ k $ defined as
with the weights $ \tilde { w } _ k $ defined as
\[
\[
\tilde { w} _ k \coloneqq \frac { h_ { k,n} v_ k} { \mathbb { E} [v^ 2 \vert \xi
\tilde { w} _ k \coloneqq \frac { h_ { k,n} v_ k} { \mathbb { E} [v^ 2 \vert \xi
= \xi _ k]} (f_ g^ { *, \lambda } )''(\xi _ k).
= \xi _ k]} \left (f_ g^ { *, \lambda } \right )''(\xi _ k).
\]
\]
\end { Definition}
\end { Definition}
@ -512,16 +509,16 @@ derivative of $\mathcal{RN}_{\tilde{w}}(x)$ which is given by
x} } \tilde { w} _ k v_ k \nonumber \\
x} } \tilde { w} _ k v_ k \nonumber \\
& = \frac { 1} { n} \sum _ { \substack { k \in \mathbb { N} \\
& = \frac { 1} { n} \sum _ { \substack { k \in \mathbb { N} \\
\xi _ k < x} } \frac { v_ k^ 2} { g_ { \xi } (\xi _ k) \mathbb { E} [v^ 2 \vert \xi
\xi _ k < x} } \frac { v_ k^ 2} { g_ { \xi } (\xi _ k) \mathbb { E} [v^ 2 \vert \xi
= \xi _ k]} (f_ g^ { *, \lambda } )''(\xi _ k). \label { eq:derivnn}
= \xi _ k]} \left (f_ g^ { *, \lambda } \right )''(\xi _ k). \label { eq:derivnn}
\end { align}
\end { align}
As the expression (\ref { eq:derivnn} ) behaves similary to a
As the expression (\ref { eq:derivnn} ) behaves similarl y to a
Riemann-sum for $ n \to \infty $ it will converge in probability to the
Riemann-sum for $ n \to \infty $ it will converge in probability to the
first derivative of $ f ^ { * , \lambda } _ g $ . A formal proof of this behaviou r
first derivative of $ f ^ { * , \lambda } _ g $ . A formal proof of this behavior
is given in Lemma~\ref { lem:s0} .
is given in Lemma~\ref { lem:s0} .
In order to ensure the functions used in the proof of the convergence
In order to ensure the functions used in the proof of the convergence
are well defined we need to assu me some properties of the random
are well defined we need to mak e some assumptions about properties of the random
parameters and their densities
parameters and their densities.
% In order to formulate the theorem describing the convergence of $ RN _ w $
% In order to formulate the theorem describing the convergence of $ RN _ w $
% we need to make a couple of assumptions.
% we need to make a couple of assumptions.
@ -530,7 +527,7 @@ parameters and their densities
\begin { Assumption} ~
\begin { Assumption} ~
\label { ass:theo38}
\label { ass:theo38}
\begin { enumerate} [label=(\alph * )]
\begin { enumerate} [label=(\alph * )]
\item The probability density fuc ntion of the kinks $ \xi _ k $ ,
\item The probability density func tion of the kinks $ \xi _ k $ ,
namely $ g _ { \xi } $ as defined in Definition~\ref { def:kink} exists
namely $ g _ { \xi } $ as defined in Definition~\ref { def:kink} exists
and is well defined.
and is well defined.
\item The density function $ g _ \xi $
\item The density function $ g _ \xi $
@ -545,7 +542,7 @@ parameters and their densities
\end { enumerate}
\end { enumerate}
\end { Assumption}
\end { Assumption}
As we will prove the convergence of in the Sobolev s pace, we hereby
As we will prove the convergence of in the Sobolev S pace, we hereby
introduce it and the corresponding induced norm.
introduce it and the corresponding induced norm.
\begin { Definition} [Sobolev Space]
\begin { Definition} [Sobolev Space]
@ -563,7 +560,7 @@ introduce it and the corresponding induced norm.
\norm { u^ { (\alpha )} } _ { L^ p} < \infty .
\norm { u^ { (\alpha )} } _ { L^ p} < \infty .
\]
\]
\label { def:sobonorm}
\label { def:sobonorm}
The natural norm of the sobolev s pace is given by
The natural norm of the Sobolev S pace is given by
\[
\[
\norm { f} _ { W^ { k,p} (K)} =
\norm { f} _ { W^ { k,p} (K)} =
\begin { cases}
\begin { cases}
@ -577,18 +574,21 @@ introduce it and the corresponding induced norm.
\]
\]
\end { Definition}
\end { Definition}
With the important definitions and assumptions in place we can now
With the important definitions and assumptions in place, we can now
formulate the main theorem ... the convergence of ridge penalized
formulate the main theorem.
random neural networks to adapted cubic smoothing splines when the
% ... the convergence of ridge penalized
parameters are chosen accordingly.
% random neural networks to adapted cubic smoothing splines when the
% parameters are chosen accordingly.
\begin { Theorem} [Ridge weight penaltiy corresponds to weighted cubic smoothing spline]
\begin { Theorem} [Ridge Weight Penalty Corresponds to Weighted Cubic
Smoothing Spline]
\label { theo:main1}
\label { theo:main1}
For $ N \in \mathbb { N } $ arbitrary training data
For $ N \in \mathbb { N } $ , arbitrary training data
\( \left ( x _ i ^ { \text { train } } , y _ i ^ { \text { train } }
$ \left ( x _ i ^ { \text { train } } , y _ i ^ { \text { train } }
\right )\) and $ \mathcal { RN } ^ { * , \tilde { \lambda } } , f _ g ^ { * , \lambda } $
\right )~\in ~\mathbb { R} ^ 2$ , with $ i \in \left \{ 1,\dots ,N\right \} $ ,
and $ \mathcal { RN } ^ { * , \tilde { \lambda } } , f _ g ^ { * , \lambda } $
according to Definition~\ref { def:rpnn} and Definition~\ref { def:wrs}
according to Definition~\ref { def:rpnn} and Definition~\ref { def:wrs}
respectively with Assumption~\ref { ass:theo38} it holds
respectively with Assumption~\ref { ass:theo38} it holds that
\begin { equation}
\begin { equation}
\label { eq:main1}
\label { eq:main1}
@ -604,7 +604,7 @@ parameters are chosen accordingly.
\end { align*}
\end { align*}
\end { Theorem}
\end { Theorem}
As mentioned above we will prof Theorem~\ref { theo:main1} utilizing
As mentioned above we will prof Theorem~\ref { theo:main1} utilizing
the ... functions. We show that
intermediary functions. We show that
\begin { equation}
\begin { equation}
\label { eq:main2}
\label { eq:main2}
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f^ { w^ *} } _ { W^ { 1,
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f^ { w^ *} } _ { W^ { 1,
@ -616,13 +616,13 @@ and
\plimn \norm { f^ { w^ *} - f_ g^ { *, \lambda } } _ { W^ { 1,\infty } (K)} = 0
\plimn \norm { f^ { w^ *} - f_ g^ { *, \lambda } } _ { W^ { 1,\infty } (K)} = 0
\end { equation}
\end { equation}
and then get (\ref { eq:main1} ) using the triangle inequality. In
and then get (\ref { eq:main1} ) using the triangle inequality. In
order to prove (\ref { eq:main2} ) and (\ref { eq:main3} ) we will need to
order to prove (\ref { eq:main2} ) and (\ref { eq:main3} ) we need to
introduce a number of auxiliary lemmmata, proves of these will b e
introduce a number of auxiliary lemmata, proves of which ar e
provided in the appendix .
given in \textcite { heiss2019} and Appendix~\ref { appendix:proofs} .
\begin { Lemma} [Poincar\' e typed i nequality]
\begin { Lemma} [Poincar\' e Typed I nequality]
\label { lem:pieq}
\label { lem:pieq}
Let \( f: \mathbb { R } \to \mathbb { R } \) differentiable with \( f' :
Let \( f: \mathbb { R } \to \mathbb { R } \) differentiable with \( f' :
\mathbb { R} \to \mathbb { R} \) Lesbeque integrable. Then for \( K = [ a,b ]
\mathbb { R} \to \mathbb { R} \) Lesbeque integrable. Then for \( K = [ a,b ]
@ -634,13 +634,14 @@ provided in the appendix.
\norm { f'} _ { L^ { \infty } (K)} .
\norm { f'} _ { L^ { \infty } (K)} .
\end { equation*}
\end { equation*}
If additionaly \( f' \) is differentiable with \( f'': \mathbb { R } \to
If additionaly \( f' \) is differentiable with \( f'': \mathbb { R } \to
\mathbb { R} \) Lesbeque integrable then additionally
\mathbb { R} \) Lesbeque integrable then
\begin { equation*}
\begin { equation*}
\label { eq:pti2}
\label { eq:pti2}
\exists C_ K^ 2 \in \mathbb { R} _ { >0} : \norm { f} _ { W^ { 1,\infty } (K)} \leq
\exists C_ K^ 2 \in \mathbb { R} _ { >0} : \norm { f} _ { W^ { 1,\infty } (K)} \leq
C_ K^ 2 \norm { f''} _ { L^ 2(K)} .
C_ K^ 2 \norm { f''} _ { L^ 2(K)} .
\end { equation*}
\end { equation*}
\proof The proof is given in the appendix...
% \proof The proof is given in the appendix...
% With the fundamental theorem of calculus, if
% With the fundamental theorem of calculus, if
% \( \norm { f } _ { L ^ { \infty } ( K ) } < \infty \) we get
% \( \norm { f } _ { L ^ { \infty } ( K ) } < \infty \) we get
% \begin { equation}
% \begin { equation}
@ -682,6 +683,7 @@ provided in the appendix.
\forall x \in \supp (g_ { \xi } ) : \mathbb { E} \left [\varphi (\xi , v)
\forall x \in \supp (g_ { \xi } ) : \mathbb { E} \left [\varphi (\xi , v)
\frac { 1} { n g_ { \xi } (\xi )} \vert \xi = x \right ] < \infty ,
\frac { 1} { n g_ { \xi } (\xi )} \vert \xi = x \right ] < \infty ,
\]
\]
\clearpage
it holds, that
it holds, that
\[
\[
\plimn \sum _ { k \in \kappa : \xi _ k < T} \varphi (\xi _ k, v_ k)
\plimn \sum _ { k \in \kappa : \xi _ k < T} \varphi (\xi _ k, v_ k)
@ -690,7 +692,7 @@ provided in the appendix.
\mathbb { E} \left [\varphi(\xi, v) \vert \xi = x \right] dx
\mathbb { E} \left [\varphi(\xi, v) \vert \xi = x \right] dx
\]
\]
uniformly in \( T \in K \) .
uniformly in \( T \in K \) .
\proof The proof is given in appendix...
% \proof The proof is given in appendix...
% For \( T \leq C _ { g _ { \xi } } ^ l \) both sides equal 0, so it is sufficient to
% For \( T \leq C _ { g _ { \xi } } ^ l \) both sides equal 0, so it is sufficient to
% consider \( T > C _ { g _ { \xi } } ^ l \) . With \( \varphi \) and
% consider \( T > C _ { g _ { \xi } } ^ l \) . With \( \varphi \) and
% \( \nicefrac { 1 } { g _ { \xi } } \) uniformly continous in \( \xi \) ,
% \( \nicefrac { 1 } { g _ { \xi } } \) uniformly continous in \( \xi \) ,
@ -735,7 +737,7 @@ provided in the appendix.
% \kappa : \xi _ m \in [\delta l, \delta (l +
% \kappa : \xi _ m \in [\delta l, \delta (l +
% 1)]\right \} } } { ng_ { \xi } (l\delta )} \right ) \pm \varepsilon .\\
% 1)]\right \} } } { ng_ { \xi } (l\delta )} \right ) \pm \varepsilon .\\
% \intertext { We use the mean to approximate the number of kinks in
% \intertext { We use the mean to approximate the number of kinks in
% each $ \delta $ -strip, as it follows a bo nomial distribution this
% each $ \delta $ -strip, as it follows a bi nomial distribution this
% amounts to
% amounts to
% \[
% \[
% \mathbb { E} \left [\abs { \left \{ m \in \kappa : \xi _ m \in [\delta l,
% \mathbb { E} \left [\abs { \left \{ m \in \kappa : \xi _ m \in [\delta l,
@ -745,13 +747,14 @@ provided in the appendix.
% \]
% \]
% Bla Bla Bla $ v _ k $ }
% Bla Bla Bla $ v _ k $ }
% \circled { 1} & \approx
% \circled { 1} & \approx
% \end { align*}
% \end { align*}
\proof Notes on the proof are given in Proof~\ref { proof:lem9} .
\end { Lemma}
\end { Lemma}
\begin { Lemma}
\begin { Lemma}
For any $ \lambda > 0 $ , training data $ ( x _ i ^ { \text { train } }
For any $ \lambda > 0 $ , $ N \in \mathbb { N } $ , training data $ ( x _ i ^ { \text { train } }
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2$ , with $ i \in
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2$ , with $ i \in
\left \{ 1,\dots ,N\right \} $ and subset $ K \subset \mathbb { R} $ the spline approximating randomized
\left \{ 1,\dots ,N\right \} $ , and subset $ K \subset \mathbb { R} $ the spline approximating randomized
shallow neural network $ \mathcal { RN } _ { \tilde { w } } $ converges to the
shallow neural network $ \mathcal { RN } _ { \tilde { w } } $ converges to the
cubic smoothing spline $ f ^ { * , \lambda } _ g $ in
cubic smoothing spline $ f ^ { * , \lambda } _ g $ in
$ \norm { . } _ { W ^ { 1 , \infty } ( K ) } $ as the node count $ n $ increases,
$ \norm { . } _ { W ^ { 1 , \infty } ( K ) } $ as the node count $ n $ increases,
@ -767,50 +770,63 @@ provided in the appendix.
\lambda } _ g)'} _ { L^ { \infty } } = 0.
\lambda } _ g)'} _ { L^ { \infty } } = 0.
\]
\]
This can be achieved by using Lemma~\ref { lem:cnvh} with $ \varphi ( \xi _ k,
This can be achieved by using Lemma~\ref { lem:cnvh} with $ \varphi ( \xi _ k,
v_ k) = \frac { v_ k^ 2} { \mathbb { E} [v^ 2|\xi = z]} (f^ { *, \lambda } _ w )''(\xi _ k) $
v_ k) = \frac { v_ k^ 2} { \mathbb { E} [v^ 2|\xi = z]} (f^ { *, \lambda } _ g )''(\xi _ k) $
thus obtaining
thus obtaining
\begin { align*}
\begin { align*}
\plimn \frac { \partial \mathcal { RN} _ { \tilde { w} } } { \partial x}
\plimn \frac { \partial \mathcal { RN} _ { \tilde { w} } } { \partial x} (x)
\stackrel { (\ref { eq:derivnn} )} { =}
\equals ^ { (\ref { eq:derivnn} )} _ { \phantom { \text { Lemma 3.1.4} } }
& \plimn \sum _ { \substack { k \in \mathbb { N} \\
% \stackrel { (\ref { eq:derivnn} )} { =}
\xi _ k < x} } \frac { v_ k^ 2} { \mathbb { E} [v^ 2 \vert \xi
&
= \xi _ k]} (f_ g^ { *, \lambda } )''(\xi _ k) h_ { k,n}
\plimn \sum _ { \substack { k \in \mathbb { N} \\
\stackrel { \text { Lemma} ~\ref { lem:cnvh} } { =} \\
\xi _ k < x} } \frac { v_ k^ 2} { \mathbb { E} [v^ 2 \vert \xi
\stackrel { \phantom { (\ref { eq:derivnn} )} } { =}
= \xi _ k]} (f_ g^ { *, \lambda } )''(\xi _ k) h_ { k,n} \\
&
\stackrel { \text { Lemma} ~\ref { lem:cnvh} } { =}
\int _ { \min \left \{ C_ { g_ { \xi } } ^ l,T\right \} } ^ { min\left \{ C_ { g_ { \xi } } ^ u,T\right \} }
% \stackrel { \phantom { (\ref { eq:derivnn} )} } { =}
&
\int _ { \max \left \{ C_ { g_ { \xi } } ^ l,x\right \} } ^ { \min \left \{ C_ { g_ { \xi } } ^ u,x\right \} }
\mathbb { E} \left [\frac{v^2}{\mathbb{E}[v^2|\xi = z] } (f^ { *,
\mathbb { E} \left [\frac{v^2}{\mathbb{E}[v^2|\xi = z] } (f^ { *,
\lambda } _ w)''(\xi ) \vert
\lambda } _ g)''(\xi ) \vert
\xi = x \right ] dx \equals ^ { \text { Tower-} } _ { \text { property} } \\
\xi = z \right ] dz\\
\stackrel { \phantom { (\ref { eq:derivnn} )} } { =}
\mathmakebox [\widthof{$\stackrel{\text{Lemma 3.14}}{=}$}] [c] { \equals ^ { \text { Tower-} } _ { \text { property} } }
&
% \stackrel { \phantom { (\ref { eq:derivnn} )} } { =}
\int _ { \min \left \{ C_ { g_ { \xi } } ^ l,
&
T\right \} } ^ { min\left \{ C_ { g_ { \xi } } ^ u,T\right \} } (f^ { *,\lambda } _ w)''(x)
\int _ { \max \left \{ C_ { g_ { \xi } } ^ l,
dx.
x\right \} } ^ { \min \left \{ C_ { g_ { \xi } } ^ u,x\right \} } (f^ { *,\lambda } _ g)''(z)
dz.
\end { align*}
\end { align*}
By the fundamental theorem of calculus and $ \supp ( f' ) \subset
With the fundamental theorem of calculus we get
\supp (f)$ , ( \ref { eq:s 0 } ) follows with Lemma~ \ref { lem:pieq } .
\[
\todo { ist die 0 wichtig?}
\plimn \mathcal { RN} _ { \tilde { w} } '(x) = f_ g^ { *,\lambda
'} (\min \left \{ C_ { g_ { \xi } } ^ u, x\right \} ) - f_ g^ { *,\lambda
'} (\max \left \{ C_ { g_ { \xi } } ^ l, x\right \} )
\]
As $ f _ g ^ { * , \lambda ' } $ is constant on $ \left [ C _ { g _ \xi } ^ l,
C_ { g_ \xi } ^ u\right ]^ C$ because $ \supp (f_ g^ { *,\lambda ''} ) \subseteq
\supp (g) \subseteq \supp (g_ \xi )$ we get
\[
\plimn \mathcal { RN} _ { \tilde { w} } '(x) = f_ g^ { *,\lambda
'} ,
\]
thus (\ref { eq:s0} ) follows with Lemma~\ref { lem:pieq} .
\qed
\qed
\label { lem:s0}
\label { lem:s0}
\end { Lemma}
\end { Lemma}
\begin { Lemma}
\begin { Lemma}
For any $ \lambda > 0 $ and training data $ ( x _ i ^ { \text { train } } ,
For any $ \lambda > 0 $ , $ N \in \mathbb { N } $ , and training data $ ( x _ i ^ { \text { train } } ,
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2, \, i \in
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2$ , with $ i \in
\left \{ 1,\dots ,N\right \} $ , we have
\left \{ 1,\dots ,N\right \} $ , we have
\[
\[
\plimn F^ { \tilde { \lambda } } _ n(\mathcal { RN} _ { \tilde { w} } ) =
\plimn F^ { \tilde { \lambda } } _ n(\mathcal { RN} _ { \tilde { w} } ) =
F^ { \lambda , g} (f^ { *, \lambda } _ g) = 0.
F^ { \lambda , g} (f^ { *, \lambda } _ g) = 0.
\]
\]
\proof
\proof Notes on the proof are given in Proof~\ref { proof:lem14} .
The proof is given in the appendix...
\label { lem:s2}
\label { lem:s2}
\end { Lemma}
\end { Lemma}
\begin { Lemma}
\begin { Lemma}
For any $ \lambda > 0 $ and training data $ ( x _ i ^ { \text { train } } ,
For any $ \lambda > 0 $ , $ N \in \mathbb { N } $ , and training data $ ( x _ i ^ { \text { train } } ,
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2, \, i \in
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2$ , with $ i \in
\left \{ 1,\dots ,N\right \} $ , with $ w^ *$ as
\left \{ 1,\dots ,N\right \} $ , with $ w^ *$ as
defined in Definition~\ref { def:rpnn} and $ \tilde { \lambda } $ as
defined in Definition~\ref { def:rpnn} and $ \tilde { \lambda } $ as
defined in Theroem~\ref { theo:main1} , it holds
defined in Theroem~\ref { theo:main1} , it holds
@ -818,13 +834,13 @@ provided in the appendix.
\plimn \norm { \mathcal { RN} ^ { *,\tilde { \lambda } } -
\plimn \norm { \mathcal { RN} ^ { *,\tilde { \lambda } } -
f^ { w*, \tilde { \lambda } } } _ { W^ { 1,\infty } (K)} = 0.
f^ { w*, \tilde { \lambda } } } _ { W^ { 1,\infty } (K)} = 0.
\]
\]
\proof The proof is given in Appendix . .
\proof Notes on the proof are given in Proof~\ref { proof:lem15} .
\label { lem:s3}
\label { lem:s3}
\end { Lemma}
\end { Lemma}
\begin { Lemma}
\begin { Lemma}
For any $ \lambda > 0 $ and training data $ ( x _ i ^ { \text { train } } ,
For any $ \lambda > 0 $ , $ N \in \mathbb { N } $ , and training data $ ( x _ i ^ { \text { train } } ,
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2, \, i \in
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2$ , with $ i \in
\left \{ 1,\dots ,N\right \} $ , with $ w^ *$ and $ \tilde { \lambda } $ as
\left \{ 1,\dots ,N\right \} $ , with $ w^ *$ and $ \tilde { \lambda } $ as
defined in Definition~\ref { def:rpnn} and Theroem~\ref { theo:main1}
defined in Definition~\ref { def:rpnn} and Theroem~\ref { theo:main1}
respectively, it holds
respectively, it holds
@ -832,13 +848,13 @@ provided in the appendix.
\plimn \abs { F_ n^ { \tilde { \lambda } } (\mathcal { RN} ^ { *,\tilde { \lambda } } ) -
\plimn \abs { F_ n^ { \tilde { \lambda } } (\mathcal { RN} ^ { *,\tilde { \lambda } } ) -
F^ { \lambda , g} (f^ { w*, \tilde { \lambda } } )} = 0.
F^ { \lambda , g} (f^ { w*, \tilde { \lambda } } )} = 0.
\]
\]
\proof The proof is given in appendix.. .
\proof Notes on the proof are given in Proof~\ref { proof:lem16} .
\label { lem:s4}
\label { lem:s4}
\end { Lemma}
\end { Lemma}
\begin { Lemma}
\begin { Lemma}
For any $ \lambda > 0 $ and training data $ ( x _ i ^ { \text { train } } ,
For any $ \lambda > 0 $ , $ N \in \mathbb { N } $ , and training data $ ( x _ i ^ { \text { train } } ,
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2, \, i \in
y_ i^ { \text { train} } ) \in \mathbb { R} ^ 2$ , with $ i \in
\left \{ 1,\dots ,N\right \} $ , for any sequence of functions $ f^ n \in
\left \{ 1,\dots ,N\right \} $ , for any sequence of functions $ f^ n \in
W^ { 2,2} $ with
W^ { 2,2} $ with
\[
\[
@ -848,39 +864,45 @@ provided in the appendix.
\[
\[
\plimn \norm { f^ n - f^ { *, \lambda } } = 0.
\plimn \norm { f^ n - f^ { *, \lambda } } = 0.
\]
\]
\proof The proof is given in appendix .. .
\proof Notes on the proof are given in Proof~\ref { proof:lem19} .
\label { lem:s7}
\label { lem:s7}
\end { Lemma}
\end { Lemma}
Using these lemmata we can now proof Theorem~\ref { theo:main1} . We
Using these lemmata we can now proof Theorem~\ref { theo:main1} . We
start by showing that the error measure of the smooth approximation of
start by showing that the error measure of the smooth approximation of
the ridge penalized randomized shallow neural network $ F ^ { \lambda ,
the ridge penalized randomized shallow neural network $ F ^ { \lambda ,
g} \left (f^ { w^ { *,\tilde { \lambda } } } \right )$
g} (f^ { w^ { *,\tilde { \lambda } } } )$
will converge in probability to the error measure of the adapted weighted regression
will converge in probability to the error measure of the adapted weighted regression
spline $ F ^ { \lambda , g } \left ( f ^ { * , \lambda } \right ) $ for the specified
spline $ F ^ { \lambda , g } \left ( f ^ { * , \lambda } \right ) $ for the specified
parameters.
parameters.
Using Lemma~\ref { lem:s4} we get that for every $ P \in ( 0 , 1 ) $ and
Using Lemma~\ref { lem:s4} we get that for every $ P \in ( 0 , 1 ) $ and
$ \varepsilon > 0 $ there exists a $ n _ 1 \in \mathbb { N } $ such that
$ \varepsilon > 0 $ there exists a $ n _ 1 \in \mathbb { N } $ such that
\[
\begin { equation}
\mathbb { P} \left [F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) \in
\mathbb { P} \left [F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) \in
F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} ^ { *,\tilde { \lambda } } \right )
F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} ^ { *,\tilde { \lambda } } \right )
+[-\varepsilon , \varepsilon ]\right ] > P, \forall n \in \mathbb { N} _ { > n_ 1} .
+[-\varepsilon , \varepsilon ]\right ] > P, \forall n \in
\]
\mathbb { N} _ { > n_ 1} .
\label { eq:squeeze_ 1}
\end { equation}
As $ \mathcal { RN } ^ { * , \tilde { \lambda } } $ is the optimal network for
As $ \mathcal { RN } ^ { * , \tilde { \lambda } } $ is the optimal network for
$ F _ n ^ { \tilde { \lambda } } $ we know that
$ F _ n ^ { \tilde { \lambda } } $ we know that
\[
\begin { equation}
F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} ^ { *,\tilde { \lambda } } \right )
F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} ^ { *,\tilde { \lambda } } \right )
\leq F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} _ { \tilde { w} } \right ).
\leq F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} _ { \tilde { w} } \right ).
\]
\label { eq:squeeze_ 2}
\end { equation}
Using Lemma~\ref { lem:s2} we get that for every $ P \in ( 0 , 1 ) $ and
Using Lemma~\ref { lem:s2} we get that for every $ P \in ( 0 , 1 ) $ and
$ \varepsilon > 0 $ there exists a $ n _ 2 \in \mathbb { N } $ such that
$ \varepsilon > 0 $ a $ n _ 2 \in \mathbb { N } $ exists such that
\[
\begin { equation}
\mathbb { P} \left [F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} _ { \tilde { w} } \right )
\mathbb { P} \left [F_ n^ { \tilde { \lambda } } \left (\mathcal { RN} _ { \tilde { w} } \right )
\in F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right )+[-\varepsilon ,
\in F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right )+[-\varepsilon ,
\varepsilon ]\right ] > P, \forall n \in \mathbb { N} _ { > n_ 2} .
\varepsilon ]\right ] > P, \forall n \in \mathbb { N} _ { > n_ 2} .
\]
\label { eq:squeeze_ 3}
If we combine these ... we get that for every $ P \in ( 0 , 1 ) $ and
\end { equation}
$ \varepsilon > 0 $ and $ n _ 3 \geq
Combining (\ref { eq:squeeze_ 1} ), (\ref { eq:squeeze_ 2} ), and
(\ref { eq:squeeze_ 3} ) we get that for every $ P \in ( 0 , 1 ) $ and for \linebreak
every
$ \varepsilon > 0 $ with $ n _ 3 \geq
\max \left \{ n_ 1,n_ 2\right \} $
\max \left \{ n_ 1,n_ 2\right \} $
\[
\[
\mathbb { P} \left [F^ { \lambda ,
\mathbb { P} \left [F^ { \lambda ,
@ -888,47 +910,51 @@ $\varepsilon > 0$ and $n_3 \geq
g} \left (f^ { *,\lambda } _ g\right )+2\varepsilon \right ] > P, \forall
g} \left (f^ { *,\lambda } _ g\right )+2\varepsilon \right ] > P, \forall
n \in \mathbb { N} _ { > n_ 3} .
n \in \mathbb { N} _ { > n_ 3} .
\]
\]
As ... is in ... and ... is optimal we know that
As $ \supp ( f ^ { w ^ { * , \tilde { \lambda } } } ) \subseteq \supp ( g _ \xi ) $ and $ f ^ { * , \lambda } _ g $ is optimal we know that
\[
\[
F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right ) \leq F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right )
F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right ) \leq F^ { \lambda ,
g} \left (f^ { w^ { *,\tilde { \lambda } } } \right )
\]
\]
and thus get with the squeeze theorem
and thus get with the squeeze theorem
\[
\[
\plimn F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) = F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right ).
\plimn F^ { \lambda , g} \left (f^ { w^ { *,\tilde { \lambda } } } \right ) = F^ { \lambda , g} \left (f^ { *,\lambda } _ g\right ).
\]
\]
We can now use Lemma~\ref { lem:s7} to follow that
With Lemma~\ref { lem:s7} it follows that
\begin { equation}
\begin { equation}
\plimn \norm { f^ { w^ { *,\tilde { \lambda } } } - f^ { *,\lambda } _ g}
\plimn \norm { f^ { w^ { *,\tilde { \lambda } } } - f^ { *,\lambda } _ g}
_ { W^ { 1,\infty } } = 0.
_ { W^ { 1,\infty } } = 0.
\label { eq:main4}
\label { eq:main4}
\end { equation}
\end { equation}
Now b y using the triangle inequality with Lemma~\ref { lem:s3} and
B y using the triangle inequality with Lemma~\ref { lem:s3} and
(\ref { eq:main4} ) we get
(\ref { eq:main4} ) we get
\begin { align* }
\begin { multline }
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f_ g^ { *,\lambda } }
\plimn \norm { \mathcal { RN} ^ { *, \tilde { \lambda } } - f_ g^ { *,\lambda } } \\
\leq & \plimn \bigg (\norm { \mathcal { RN} ^ { *, \tilde { \lambda } } -
\leq \plimn \bigg (\norm { \mathcal { RN} ^ { *, \tilde { \lambda } } -
f_ g^ { w^ { *,\tilde { \lambda } } } } _ { W^ { 1,\infty } } \\
f_ g^ { w^ { *,\tilde { \lambda } } } } _ { W^ { 1,\infty } }
& + \norm { f^ { w^ { *,\tilde { \lambda } } } - f^ { *,\lambda } _ g}
+ \norm { f^ { w^ { *,\tilde { \lambda } } } - f^ { *,\lambda } _ g}
_ { W^ { 1,\infty } } \bigg ) = 0
_ { W^ { 1,\infty } } \bigg ) = 0
\end { align* }
\end { multline }
and thus have proven Theorem~\ref { theo:main1} .
and thus have proven Theorem~\ref { theo:main1} .
We now know that randomized shallow neural networks behave similar to
We now know that randomized shallow neural networks behave similar to
spline regression if we regularize the size of the weights during
spline regression if we regularize the size of the weights during
training.
training.
\textcite { heiss2019} further explore a connection between ridge penalized
\textcite { heiss2019} further explore a connection between ridge penalized
networks and randomized shallow neural networks trained using gradient
networks and randomized shallow neural networks trained using gradient
descent.
descent.
They come to the conclus io n that the effect of weight regularization
They infer that the effect of weight regularization
can be achieved by stopping the training of the randomized shallow
can be achieved by stopping the training of the randomized shallow
neural network early, with the amount of epoch s being proportional to
neural network early, with the number of iteration s being proportional to
the punishment for weight size .
the tuning parameter penalizing the size of the weights .
This ... that randomized shallow neural networks trained for a certain
They use this to further conclude that for a large number of training epochs and number of
amount of iterations converge for a increasing amount of nodes to
neurons shallow neural networks trained with gradient descent are
cubic smoothing splines with appropriate weights.
very close to spline interpolations. Alternatively if the training
\todo { nochmal nachlesen wie es genau war}
is stopped early, they are close to adapted weighted cubic smoothing splines.
\newpage
\newpage
\subsection { Simulations}
\subsection { Simulations}
\label { sec:rsnn_ sim}
In the following the behaviour described in Theorem~\ref { theo:main1}
In the following the behaviour described in Theorem~\ref { theo:main1}
is visualized in a simulated example. For this two sets of training
is visualized in a simulated example. For this two sets of training
data have been generated.
data have been generated.
@ -962,20 +988,26 @@ Theorem~\ref{theo:main1}
would equate to $ g ( x ) = \frac { \mathbb { E } [ v _ k ^ 2 | \xi _ k = x ] } { 10 } $ . In
would equate to $ g ( x ) = \frac { \mathbb { E } [ v _ k ^ 2 | \xi _ k = x ] } { 10 } $ . In
order to utilize the
order to utilize the
smoothing spline implemented in Mathlab, $ g $ has been simplified to $ g
smoothing spline implemented in Mathlab, $ g $ has been simplified to $ g
\equiv \frac { 1} { 10} $ instead. For all figures $ f_ 1^ { *, \lambda } $ has
\equiv \frac { 1} { 10} $ instead.
been calculated with Matlab's ``smoothingspline'', as this minimizes
For all figures $ f _ 1 ^ { * , \lambda } $ has
been calculated with Matlab's { \sffamily { smoothingspline} } , as this minimizes
\[
\[
\bar { \lambda } \sum _ { i=1} ^ N(y_ i^ { train} - f(x_ i^ { train} ))^ 2 + (1 -
\bar { \lambda } \sum _ { i=1} ^ N(y_ i^ { train} - f(x_ i^ { train} ))^ 2 + (1 -
\bar { \lambda } ) \int (f''(x))^ 2 dx
\bar { \lambda } ) \int (f''(x))^ 2 dx
\]
\]
the smoothing parameter used for fitt ment is $ \bar { \lambda } =
the smoothing parameter used for fitment is $ \bar { \lambda } =
\frac { 1} { 1 + \lambda } $ . The parameter $ \tilde { \lambda } $ for training
\frac { 1} { 1 + \lambda } $ . The parameter $ \tilde { \lambda } $ for training
the networks is chosen as defined in Theorem~\ref { theo:main1} and each
the networks is chosen as defined in Theorem~\ref { theo:main1} .
network is trained on the full training data for 5000 epochs using
Each
network contains 10.000 hidden nodes and is trained on the full
training data for 100.000 epochs using
gradient descent. The
gradient descent. The
results are given in Figure~\ref { fig:rn_ vs_ rs} , here it can be seen that in
results are given in Figure~\ref { fig:rn_ vs_ rs} , where it can be seen
the intervall of the traing data $ [ - \pi , \pi ] $ the neural network and
that the neural network and
smoothing spline are nearly identical, coinciding with the proposition.
smoothing spline are nearly identical, coinciding with the
proposition.
\input { Figures/RN_ vs_ RS}
\input { Figures/RN_ vs_ RS}