You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1020 lines
41 KiB
TeX

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
\section{Shallow Neural Networks}
\label{sec:shallownn}
4 years ago
% In order to get a some understanding of the behavior of neural
% networks we study a simplified class of networks called shallow neural
% networks in this chapter.
% We consider shallow neural networks consist of a single
% hidden layer and
To get some understanding of the behavior of neural networks
4 years ago
we examine a simple class of networks in this chapter. We consider
networks that contain only one hidden layer and have a single output
node and call these networks shallow neural networks.
\begin{Definition}[Shallow neural network, Heiss, Teichmann, and
Wutte (2019, Definition 1.4)]
For a input dimension $d$ and a Lipschitz continuous activation function $\sigma:
\mathbb{R} \to \mathbb{R}$ we define a shallow neural network with
$n$ hidden nodes as
$\mathcal{NN}_\vartheta : \mathbb{R}^d \to \mathbb{R}$ as
\[
\mathcal{NN}_\vartheta \coloneqq \sum_{k=1}^n w_k \sigma\left(b_k +
\sum_{j=1}^d v_{k,j} x_j\right) + c ~~ \forall x \in \mathbb{R}^d
\]
with
\begin{itemize}
\item weights $w_k \in \mathbb{R},~k \in \left\{1,\dots,n\right\}$
\item biases $b_k \in \mathbb{R},~k \in \left\{1, \dots,n\right\}$
\item weights $v_k \in \mathbb{R}^d,~k\in\left\{1,\dots,n\right\}$
\item bias $c \in \mathbb{R}$
\item these weights and biases collected in
\[
\vartheta \coloneqq (w, b, v, c) \in \Theta \coloneqq
\mathbb{R}^{n \times n \times (n \times d) \times 1}
\]
\end{itemize}
\end{Definition}
% \begin{figure}
% \begin{tikzpicture}[x=1.5cm, y=1.5cm]
% \tikzset{myptr/.style={decoration={markings,mark=at position 1 with %
% {\arrow[scale=1.5,>=stealth]{>}}},postaction={decorate}}}
% \foreach \m/\l [count=\y] in {1}
% \node [every neuron/.try, neuron \m/.try] (input-\m) at (0,0.5-\y) {};
% \foreach \m [count=\y] in {1,2,missing,3,4}
% \node [every neuron/.try, neuron \m/.try ] (hidden-\m) at (1.25,3.25-\y*1.25) {};
% \foreach \m [count=\y] in {1}
% \node [every neuron/.try, neuron \m/.try ] (output-\m) at (2.5,0.5-\y) {};
% \foreach \l [count=\i] in {1}
% \draw [myptr] (input-\i)+(-1,0) -- (input-\i)
% node [above, midway] {$x$};
% \foreach \l [count=\i] in {1,2,n-1,n}
% \node [above] at (hidden-\i.north) {$\mathcal{N}_{\l}$};
% \foreach \l [count=\i] in {1,n_l}
% \node [above] at (output-\i.north) {};
% \foreach \l [count=\i] in {1}
% \draw [myptr, >=stealth] (output-\i) -- ++(1,0)
% node [above, midway] {$y$};
% \foreach \i in {1}
% \foreach \j in {1,2,...,3,4}
% \draw [myptr, >=stealth] (input-\i) -- (hidden-\j);
% \foreach \i in {1,2,...,3,4}
% \foreach \j in {1}
% \draw [myptr, >=stealth] (hidden-\i) -- (output-\j);
% \node [align=center, above] at (0,1) {Input \\layer};
% \node [align=center, above] at (1.25,3) {Hidden layer};
% \node [align=center, above] at (2.5,1) {Output \\layer};
% \end{tikzpicture}
% \caption{Shallow Neural Network with input- and output-dimension of \(d
% = 1\)}
% \label{fig:shallowNN}
% \end{figure}
As neural networks with a large number of nodes have a large amount of
tunable parameters it can often fit data quite well. If
4 years ago
a ReLU activation function
\[
\sigma(x) \coloneqq \max{(0, x)}
\]
4 years ago
is chosen one can easily prove that if the
amount of hidden nodes exceeds the
amount of data points in the training data a shallow network trained
on MSE will perfectly fit the data.
4 years ago
\begin{Theorem}[Shallow neural network can fit data perfectly]
For training data of size t
\[
\left(x_i^{\text{train}}, y_i^{\text{train}}\right) \in \mathbb{R}^d
\times \mathbb{R},~i\in\left\{1,\dots,t\right\}
\]
a shallow neural network $\mathcal{NN}_\vartheta$ with $n \geq t$
hidden nodes will perfectly fit the data when
minimizing squared error loss.
\proof
W.l.o.g. all values $x_{ij}^{\text{train}} \in [0,1],~\forall i \in
\left\{1,\dots, t\right\}, j \in \left\{1,\dots,d\right\}$. Now we
4 years ago
chose $v^*$ such that the vector-product with $x_i^{\text{train}}$
results is distinct values for all $i \in \left\{1,\dots,t\right\}$:
\[
v^*_{k,j} = v^*_{j} = 10^{j-1}, ~ \forall k \in \left\{1,\dots,n\right\}.
\]
Assuming $x_i^{\text{train}} \neq x_j^{\text{train}},~\forall i\neq
j$ we get
\[
\left(v_k^*\right)^{\mathrm{T}} x_i^{\text{train}} \neq
\left(v_k^*\right)^{\mathrm{T}} x_j^{\text{train}}, ~ \forall i
\neq j.
\]
W.l.o.g assume $x_i^{\text{train}}$ are ordered such that
$\left(v_k^*\right)^{\mathrm{T}} x_i^{\text{train}} <
\left(v_k^*\right)^{\mathrm{T}} x_j^{\text{train}}, ~\forall j<j$,
Then we can choose $b^*_k$ such that neuron $k$ is only active for all
$x_i^{\text{train}}$ with $i \geq k$:
\begin{align*}
b^*_1 &> -\left(v^*\right)^{\mathrm{T}} x_1^{\text{train}},\\
b^*_k &= -\left(v^*\right)^{\mathrm{T}}
x_{k-1}^{\text{train}},~\forall k \in \left\{2, \dots,
t\right\}, \\
b_k^* &\leq -\left(v^*\right)^{\mathrm{T}}
x_{t}^{\text{train}},~\forall k > t.
\end{align*}
With
\begin{align*}
w_k^* &= \frac{y_k^{\text{train}} - \sum_{j =1}^{k-1} w^*_j\left(b^*_j +
x_k^{\text{train}}\right)}{b_k + \left(v^*\right)^{\mathrm{T}}
x_k^{\text{train}}},~\forall k \in \left\{1,\dots,t\right\}\\
w_k^* &\in \mathbb{R} \text{ arbitrary, } \forall k > t.
\end{align*}
and $\vartheta^* = (w^*, b^*, v^*, c = 0)$ we get
\[
\mathcal{NN}_{\vartheta^*} (x_i^{\text{train}}) = \sum_{k =
1}^{i-1} w_k\left(b_k^* + \left(v^*\right)^{\mathrm{T}}
x_i^{\text{train}}\right) + w_i\left(b_i^* +\left(v^*\right)^{\mathrm{T}}
x_i^{\text{train}}\right) = y_i^{\text{train}}.
\]
As the squared error of $\mathcal{NN}_{\vartheta^*}$ is zero all
squared error loss minimizing shallow networks with at least $t$ hidden
nodes will perfectly fit the data. \qed
\label{theo:overfit}
\end{Theorem}
However, this behavior is often not desired as overfit models tend to
have bad generalization properties, especially if noise is present in
4 years ago
the data. This effect is illustrated in
Figure~\ref{fig:overfit}.
Here a shallow neural network is
constructed according to the proof of Theorem~\ref{theo:overfit} to
perfectly fit some data and
compared to a cubic smoothing spline
4 years ago
(Definition~\ref{def:wrs}). While the neural network
fits the data better than the spline, the spline represents the
underlying mechanism that was used to generate the data more accurately. The better
generalization of the spline compared to the network is further
demonstrated by the better performance on newly generated
test data.
In order to improve the accuracy of the model we want to reduce
overfitting. A possible way to achieve this is by explicitly
regularizing the network through the cost function as done with
ridge penalized networks
(Definition~\ref{def:rpnn}) where large weights $w$ are punished. In
Theorem~\ref{theo:main1} we will
4 years ago
prove that this will result in the shallow neural network converging to
a form of splines as the number of nodes in the hidden layer is
increased.
\vfill
\begin{figure}[h]
\pgfplotsset{
compat=1.11,
legend image code/.code={
\draw[mark repeat=2,mark phase=2]
plot coordinates {
(0cm,0cm)
(0.15cm,0cm) %% default is (0.3cm,0cm)
(0.3cm,0cm) %% default is (0.6cm,0cm)
};%
}
}
\begin{tikzpicture}
\begin{axis}[tick style = {draw = none}, width = \textwidth,
height = 0.6\textwidth]
\addplot table
[x=x, y=y, col sep=comma, only marks,mark options={scale =
0.7}] {Figures/Data/overfit.csv};
\addplot [red, line width=0.8pt] table [x=x_n, y=s_n, col
4 years ago
sep=comma] {Figures/Data/overfit.csv};
\addplot [black, line width=0.8pt] table [x=x_n, y=y_n, col
sep=comma] {Figures/Data/overfit.csv};
\addplot [black, line width=0.8pt, dashed] table [x=x, y=y, col
sep=comma] {Figures/Data/overfit_spline.csv};
4 years ago
\addlegendentry{\footnotesize{Data}};
\addlegendentry{\footnotesize{Truth}};
\addlegendentry{\footnotesize{$\mathcal{NN}_{\vartheta^*}$}};
4 years ago
\addlegendentry{\footnotesize{Spline}};
\end{axis}
\end{tikzpicture}
\caption[Overfitting of Shallow Neural Networks]{For data of the form $y=\sin(\frac{x+\pi}{2 \pi}) +
\varepsilon,~ \varepsilon \sim \mathcal{N}(0,0.4)$
(\textcolor{blue}{blue}) the neural network constructed
according to the proof of Theorem~\ref{theo:overfit} (black) and the
underlying signal (\textcolor{red}{red}). While the network has no
bias a cubic smoothing spline (black, dashed) fits the data much
better. For a test set of size 20 with uniformly distributed $x$
values and responses of the same fashion as the training data the MSE of the neural network is
0.30, while the MSE of the spline is only 0.14 thus generalizing
much better.
}
\label{fig:overfit}
\end{figure}
\vfill
\clearpage
\subsection{Convergence Behavior of One-Dimensional Randomized Shallow
Neural Networks}
\label{sec:conv}
This section is based on \textcite{heiss2019}.
In this section, we examine the convergence behavior of certain shallow
neural networks.
We consider shallow neural networks with a one dimensional input where the parameters in the
hidden layer are randomized resulting in only the weights is the
output layer being trainable.
Additionally, we assume all neurons use a ReLU as an activation function
and call such networks randomized shallow neural networks.
% We will analyze the
% connection between randomized shallow
% Neural Networks with one dimensional input with a ReLU as activation
% function for all neurons and cubic smoothing splines.
% % \[
% % \sigma(x) = \max\left\{0,x\right\}.
% % \]
% We will see that the punishment of the size of the weights in training
% the randomized shallow
% Neural Network will result in a learned function that minimizes the second
% derivative as the amount of hidden nodes is grown to infinity. In order
% to properly formulate this relation we will first need to introduce
% some definitions, all neural networks introduced in the following will
% use a ReLU as activation at all neurons.
% A randomized shallow network is characterized by only the weight
% parameter of the output layer being trainable, whereas the other
% parameters are random numbers.
\begin{Definition}[Randomized shallow neural network, Heiss, Teichmann, and
Wutte (2019, Definition 2.1)]
For an input dimension $d$, let $n \in \mathbb{N}$ be the number of
hidden nodes and $v(\omega) \in \mathbb{R}^{i \times n}, b(\omega)
\in \mathbb{R}^n$ randomly drawn weights. Then for a weight vector
$w$ the corresponding randomized shallow neural network is given by
\[
\mathcal{RN}_{w, \omega} (x) = \sum_{k=1}^n w_k
\sigma\left(b_k(\omega) + \sum_{j=1}^d v_{k, j}(\omega) x_j\right).
\]
\label{def:rsnn}
\end{Definition}
% We call a one dimensional randomized shallow neural network were the
% are penalized in the loss
% function ridge penalized neural networks.
We will prove that if we penalize the amount of the trainable weights
when fitting a randomized shallow neural network it will
converge to a function that minimizes the distance to the training
data with respect to its second derivative as the amount of nodes is increased.
We call such a network that is fitted according to MSE and a penalty term for
the $L^2$ norm of the trainable weights $w$ a ridge penalized neural network.
% $\lam$
% We call a randomized shallow neural network trained on MSE and
% punished for the amount of the weights $w$ according to a
% ... $\lambda$ ridge penalized neural networks.
4 years ago
% We call a randomized shallow neural network where the size of the trainable
% weights is punished in the error function a ridge penalized
% neural network. For a tuning parameter $\tilde{\lambda}$ .. the extent
% of penalization we get:
\begin{Definition}[Ridge penalized Neural Network, Heiss, Teichmann, and
Wutte (2019, Definition 3.2)]
\label{def:rpnn}
Let $\mathcal{RN}_{w, \omega}$ be a randomized shallow neural
4 years ago
network, as introduced in Definition~\ref{def:rsnn} and tuning
parameter $\tilde{\lambda} \in \mathbb{R}$. Then the optimal ridge
penalized
network is given by
\[
\mathcal{RN}^{*, \tilde{\lambda}}_{\omega}(x) \coloneqq
\mathcal{RN}_{w^{*, \tilde{\lambda}}(\omega), \omega}
\]
with \
\[
w^{*,\tilde{\lambda}}(\omega) :\in \argmin_{w \in
\mathbb{R}^n} \underbrace{ \left\{\overbrace{\sum_{i = 1}^N \left(\mathcal{RN}_{w,
\omega}(x_i^{\text{train}}) -
y_i^{\text{train}}\right)^2}^{L(\mathcal{RN}_{w, \omega})} +
\tilde{\lambda} \norm{w}_2^2\right\}}_{\eqqcolon F_n^{\tilde{\lambda}}(\mathcal{RN}_{w,\omega})}.
\]
\end{Definition}
4 years ago
If the amount of hidden nodes $n$ is larger than the amount of
training samples $N$ then for
$\tilde{\lambda} \to 0$ the network will interpolate the data while
having minimal weights, resulting in the \textit{minimum norm
network} $\mathcal{RN}_{w^{\text{min}}, \omega}$.
\[
\mathcal{RN}_{w^{\text{min}}, \omega} \text{ randomized shallow
neural network with weights } w^{\text{min}}\colon
\]
\[
w^{\text{min}} \in \argmin_{w \in \mathbb{R}^n} \norm{w}, \text{
s.t. }
\mathcal{RN}_{w,\omega}(x_i^{\text{train}}) = y_i^{\text{train}}, \, \forall i \in
\left\{1,\dots,N\right\}.
\]
For $\tilde{\lambda} \to \infty$ the learned
4 years ago
function will resemble the data less and with the weights
approaching $0$ will converge to the constant $0$ function.
To make the notation more convenient, in the following the
$\omega$ used to express the realized random parameters will no longer
4 years ago
be explicitly mentioned.
We call a function that minimizes the cubic distance between training points
and the function with regard to the second
derivative of the function a cubic smoothing spline.
4 years ago
\begin{Definition}[Cubic Smoothing Spline]
4 years ago
Let $x_i^{\text{train}}, y_i^{\text{train}} \in \mathbb{R}, i \in
4 years ago
\left\{1,\dots,N\right\}$ be training data. for a given $\lambda \in
\mathbb{R}$ the cubic smoothing spline is given by
4 years ago
\[
f^{*,\lambda} :\in \argmin_{f \in
\mathcal{C}^2}\left\{\sum_{i=1}^N
\left(f\left(x_i^{\text{train}}\right) -
y_i^{\text{train}}\right)^2 + \lambda \int f^{''}(x)^2dx\right\}.
\]
\end{Definition}
We will show that for specific hyperparameters the ridge penalized
4 years ago
shallow neural networks converge to a slightly modified variant of the
cubic smoothing spline. We need to incorporate the densities of the
4 years ago
random parameters in the loss function of the spline to ensure
convergence. Thus we define
the adapted weighted cubic smoothing spline where the loss for the second
4 years ago
derivative is weighted by a function $g$ and the support of the second
derivative of $f$ has to be a subset the support of $g$. The formal
definition is given in Definition~\ref{def:wrs}.
% We will later ... the converging .. of the ridge penalized shallow
% neural network, in order to do so we will need a slightly modified
% version of the regression
% spline that allows for weighting the penalty term for the second
% derivative with a weight function $g$. This is needed to ...the
% distributions of the random parameters ... We call this the adapted
% weighted cubic smoothing spline.
4 years ago
% Now we take a look at weighted cubic smoothing splines. Later we will prove
4 years ago
% that the ridge penalized neural network as defined in
% Definition~\ref{def:rpnn} converges a weighted cubic smoothing spline, as
4 years ago
% the amount of hidden nodes is grown to inifity.
\begin{Definition}[Adapted weighted cubic smoothing spline, Heiss, Teichmann, and
Wutte (2019, Definition 3.5)]
4 years ago
\label{def:wrs}
Let $x_i^{\text{train}}, y_i^{\text{train}} \in \mathbb{R}, i \in
4 years ago
\left\{1,\dots,N\right\}$ be training data. For a given $\lambda \in \mathbb{R}_{>0}$
4 years ago
and a function $g: \mathbb{R} \to \mathbb{R}_{>0}$ the weighted
cubic smoothing spline $f^{*, \lambda}_g$ is given by
4 years ago
\[
f^{*, \lambda}_g :\in \argmin_{\substack{f \in \mathcal{C}^2(\mathbb{R})
\\ \supp(f'') \subseteq \supp(g)}} \underbrace{\left\{ \overbrace{\sum_{i =
1}^N \left(f(x_i^{\text{train}}) - y_i^{\text{train}}\right)^2}^{L(f)} +
\lambda g(0) \int_{\supp(g)}\frac{\left(f''(x)\right)^2}{g(x)}
dx\right\}}_{\eqqcolon F^{\lambda, g}(f)}.
\]
% \todo{Anforderung an Ableitung von f, doch nicht?}
4 years ago
\end{Definition}
Similarly to ridge weight penalized neural networks the parameter
$\lambda$ controls a trade-off between accuracy on the training data
and smoothness or low second derivative. For $g \equiv 1$ and $\lambda \to 0$ the
4 years ago
resulting function $f^{*, 0+}$ will interpolate the training data while minimizing
the second derivative. Such a function is known as cubic spline
interpolation.
\vspace{-0.2cm}
4 years ago
\[
f^{*, 0+} \text{ smooth spline interpolation: }
\]
\[
f^{*, 0+} \coloneqq \lim_{\lambda \to 0+} f^{*, \lambda}_1 \in
\argmin_{\substack{f \in \mathcal{C}^2(\mathbb{R}), \\ f(x_i^{\text{train}}) =
4 years ago
y_i^{\text{train}}}} = \left( \int _{\mathbb{R}} (f''(x))^2dx\right).
\]
For $\lambda \to \infty$ on the other hand $f_g^{*\lambda}$ converges
to linear regression of the data.
We use two intermediary functions in order to show the convergence of
the ridge penalized shallow neural network to adapted cubic smoothing splines.
4 years ago
% In order to show that ridge penalized shallow neural networks converge
% to adapted cubic smoothing splines for a growing amount of hidden nodes we
4 years ago
% define two intermediary functions.
One being a smooth approximation of a
neural network and the other being a randomized shallow neural network designed
4 years ago
to approximate a spline.
In order to properly construct these functions, we need to take the points
of the network into consideration where the trajectory of the learned
function changes
(or their points of discontinuity).
4 years ago
As we use the ReLU activation the function learned by the
network will possess points of discontinuity where a neuron in the hidden
layer gets activated and their output is no longer zero. We formalize these points
4 years ago
as kinks in Definition~\ref{def:kink}.
\begin{Definition}
\label{def:kink}
Let $\mathcal{RN}_w$ be a randomized shallow Neural
4 years ago
Network according to Definition~\ref{def:rsnn}, then kinks depending
on the random parameters can
be observed.
\[
\mathcal{RN}_w(x) = \sum_{k = 1}^n w_k \sigma(b_k + v_kx)
\]
Because we specified $\sigma(y) \coloneqq \max\left\{0, y\right\}$ a
kink in $\sigma$ can be observed at $\sigma(0) = 0$. As $b_k + v_kx = 0$ for $x
= -\frac{b_k}{v_k}$ we define the following:
\begin{enumerate}[label=(\alph*)]
\item Let $\xi_k \coloneqq -\frac{b_k}{v_k}$ be the k-th kink of $\mathcal{RN}_w$.
\item Let $g_{\xi}(\xi_k)$ be the density of the kinks $\xi_k =
- \frac{b_k}{v_k}$ in accordance to the distributions of $b_k$ and
$v_k$. With $\supp(g_\xi) = \left[C_{g_\xi}^l, C_{g_\xi}^u\right]$.
\item Let $h_{k,n} \coloneqq \frac{1}{n g_{\xi}(\xi_k)}$ be the
average estimated distance from kink $\xi_k$ to the next nearest
one.
\end{enumerate}
\end{Definition}
4 years ago
Using the density of the kinks we construct a kernel and smooth the
network by applying the kernel similar to convolution.
\begin{Definition}[Smooth Approximation of Randomized Shallow Neural
Network]
\label{def:srsnn}
Let $RS_{w}$ be a randomized shallow Neural Network according to
4 years ago
Definition~\ref{def:rsnn} with weights $w$ and kinks $\xi_k$ with
corresponding kink density $g_{\xi}$ as given by
Definition~\ref{def:kink}.
In order to smooth the RSNN consider following kernel for every $x$:
\begin{align*}
\kappa_x(s) &\coloneqq \mathds{1}_{\left\{\abs{s} \leq \frac{1}{2 \sqrt{n}
g_{\xi}(x)}\right\}}(s)\sqrt{n} g_{\xi}(x), \, \forall s \in \mathbb{R}\\
\intertext{Using this kernel we define a smooth approximation of
$\mathcal{RN}_w$ by}
f^w(x) &\coloneqq \int_{\mathds{R}} \mathcal{RN}_w(x-s)
\kappa_x(s) ds.
\end{align*}
\end{Definition}
Note that the kernel introduced in Definition~\ref{def:srsnn}
satisfies $\int_{\mathbb{R}}\kappa_x dx = 1$. While $f^w$ looks
similar to a convolution, it differs slightly as the kernel $\kappa_x(s)$
is dependent on $x$. Therefore only $f^w = (\mathcal{RN}_w *
\kappa_x)(x)$ is well defined, while $\mathcal{RN}_w * \kappa$ is not.
We use $f^{w^{*,\tilde{\lambda}}}$ to describe the spline
approximating the ridge penalized network
$\mathcal{RN}^{*,\tilde{\lambda}}$.
Next, we construct a randomized shallow neural network that
is designed to be close to a spline, independent from the realization of the random
parameters, by approximating the splines curvature between the
kinks.
\begin{Definition}[Spline approximating Randomized Shallow Neural
Network]
\label{def:sann}
Let $\mathcal{RN}$ be a randomized shallow Neural Network according
4 years ago
to Definition~\ref{def:rsnn} and $f^{*, \lambda}_g$ be the weighted
cubic smoothing spline as introduced in Definition~\ref{def:wrs}. Then
the randomized shallow neural network approximating $f^{*,
\lambda}_g$ is given by
\[
\mathcal{RN}_{\tilde{w}}(x) = \sum_{k = 1}^n \tilde{w}_k \sigma(b_k + v_k x),
\]
with the weights $\tilde{w}_k$ defined as
\[
\tilde{w}_k \coloneqq \frac{h_{k,n} v_k}{\mathbb{E}[v^2 \vert \xi
= \xi_k]} \left(f_g^{*, \lambda}\right)''(\xi_k).
\]
\end{Definition}
The approximating nature of the network in
4 years ago
Definition~\ref{def:sann} can be seen by examining the first
derivative of $\mathcal{RN}_{\tilde{w}}(x)$ which is given by
\begin{align}
\frac{\partial \mathcal{RN}_{\tilde{w}}}{\partial x}
\Big{|}_{x} &= \sum_k^n \tilde{w}_k \mathds{1}_{\left\{b_k + v_k x >
0\right\}}(v_k) = \sum_{\substack{k \in \mathbb{N} \\ \xi_k <
x}} \tilde{w}_k v_k \nonumber \\
&= \frac{1}{n} \sum_{\substack{k \in \mathbb{N} \\
\xi_k < x}} \frac{v_k^2}{g_{\xi}(\xi_k) \mathbb{E}[v^2 \vert \xi
= \xi_k]} \left(f_g^{*, \lambda}\right)''(\xi_k). \label{eq:derivnn}
\end{align}
As the expression (\ref{eq:derivnn}) behaves similarly to a
4 years ago
Riemann-sum for $n \to \infty$ it will converge in probability to the
first derivative of $f^{*,\lambda}_g$. A formal proof of this behavior
is given in Lemma~\ref{lem:s0}.
4 years ago
In order to ensure the functions used in the proof of the convergence
are well defined we need to make some assumptions about properties of the random
parameters and their densities.
4 years ago
% In order to formulate the theorem describing the convergence of $RN_w$
% we need to make a couple of assumptions.
% \todo{Bessere Formulierung}
\begin{Assumption}~
\label{ass:theo38}
\begin{enumerate}[label=(\alph*)]
\item The probability density function of the kinks $\xi_k$,
namely $g_{\xi}$ as defined in Definition~\ref{def:kink} exists
and is well defined.
\item The density function $g_\xi$
has compact support on $\supp(g_{\xi})$.
\item The density function $g_{\xi}$ is uniformly continuous on $\supp(g_{\xi})$.
\item $g_{\xi}(0) \neq 0$.
\item $\frac{1}{g_{\xi}}\Big|_{\supp(g_{\xi})}$ is uniformly
4 years ago
continuous on $\supp(g_{\xi})$.
\item The conditional distribution $\mathcal{L}(v_k|\xi_k = x)$
4 years ago
is uniformly continuous on $\supp(g_{\xi})$.
\item $\mathbb{E}\left[v_k^2\right] < \infty$.
\end{enumerate}
\end{Assumption}
As we will prove the convergence of in the Sobolev Space, we hereby
4 years ago
introduce it and the corresponding induced norm.
\begin{Definition}[Sobolev Space]
For $K \subset \mathbb{R}^n$ open and $1 \leq p \leq \infty$ we
define the Sobolev space $W^{k,p}(K)$ as the space containing all
real valued functions $u \in L^p(K)$ such that for every multi-index
$\alpha \in \mathbb{N}^n$ with $\abs{\alpha} \leq
4 years ago
k$ the mixed partial derivatives
\[
u^{(\alpha)} = \frac{\partial^{\abs{\alpha}} u}{\partial
x_1^{\alpha_1} \dots \partial x_n^{\alpha_n}}
\]
exists in the weak sense and
\[
\norm{u^{(\alpha)}}_{L^p} < \infty.
\]
\label{def:sobonorm}
The natural norm of the Sobolev Space is given by
\[
\norm{f}_{W^{k,p}(K)} =
\begin{cases}
\left(\sum_{\abs{\alpha} \leq k}
\norm{f^{(\alpha)}}^p_{L^p}\right)^{\nicefrac{1}{p}},&
\text{for } 1 \leq p < \infty \\
max_{\abs{\alpha} \leq k}\left\{f^{(\alpha)}\right\},& \text{for
} p = \infty
\end{cases}
.
\]
\end{Definition}
With the important definitions and assumptions in place, we can now
formulate the main theorem.
% ... the convergence of ridge penalized
% random neural networks to adapted cubic smoothing splines when the
% parameters are chosen accordingly.
\begin{Theorem}[Ridge Weight Penalty Corresponds to Weighted Cubic
Smoothing Spline]
\label{theo:main1}
For $N \in \mathbb{N}$, arbitrary training data
$\left(x_i^{\text{train}}, y_i^{\text{train}}
\right)~\in~\mathbb{R}^2$, with $i \in \left\{1,\dots,N\right\}$,
and $\mathcal{RN}^{*, \tilde{\lambda}}, f_g^{*, \lambda}$
according to Definition~\ref{def:rpnn} and Definition~\ref{def:wrs}
respectively with Assumption~\ref{ass:theo38} it holds that
\begin{equation}
\label{eq:main1}
\plimn \norm{\mathcal{RN^{*, \tilde{\lambda}}} - f^{*,
\lambda}_{g}}_{W^{1,\infty}(K)} = 0.
\end{equation}
With
\begin{align*}
g(x) & \coloneqq g_{\xi}(x)\mathbb{E}\left[ v_k^2 \vert \xi_k = x
\right], \forall x \in \mathbb{R}, \\
\tilde{\lambda} & \coloneqq \lambda n g(0).
\end{align*}
\end{Theorem}
4 years ago
As mentioned above we will prof Theorem~\ref{theo:main1} utilizing
intermediary functions. We show that
\begin{equation}
\label{eq:main2}
\plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f^{w^*}}_{W^{1,
\infty}(K)} = 0
\end{equation}
and
\begin{equation}
\label{eq:main3}
\plimn \norm{f^{w^*} - f_g^{*, \lambda}}_{W^{1,\infty}(K)} = 0
\end{equation}
4 years ago
and then get (\ref{eq:main1}) using the triangle inequality. In
order to prove (\ref{eq:main2}) and (\ref{eq:main3}) we need to
introduce a number of auxiliary lemmata, proves of which are
given in \textcite{heiss2019} and Appendix~\ref{appendix:proofs}.
\begin{Lemma}[Poincar\'e Typed Inequality]
\label{lem:pieq}
Let \(f:\mathbb{R} \to \mathbb{R}\) differentiable with \(f' :
4 years ago
\mathbb{R} \to \mathbb{R}\) Lebesgue integrable. Then for \(K=[a,b]
\subset \mathbb{R}\) with \(f(a)=0\) it holds that
\begin{equation*}
\label{eq:pti1}
\exists C_K^{\infty} \in \mathbb{R}_{>0} :
\norm{f}_{w^{1,\infty}(K)} \leq C_K^{\infty}
\norm{f'}_{L^{\infty}(K)}.
\end{equation*}
4 years ago
If additionally \(f'\) is differentiable with \(f'': \mathbb{R} \to
\mathbb{R}\) Lebesgue integrable then
\begin{equation*}
\label{eq:pti2}
\exists C_K^2 \in \mathbb{R}_{>0} : \norm{f}_{W^{1,\infty}(K)} \leq
C_K^2 \norm{f''}_{L^2(K)}.
\end{equation*}
% \proof The proof is given in the appendix...
% With the fundamental theorem of calculus, if
% \(\norm{f}_{L^{\infty}(K)}<\infty\) we get
% \begin{equation}
% \label{eq:f_f'}
% \norm{f}_{L^{\infty}(K)} = \sup_{x \in K}\abs{\int_a^x f'(s) ds} \leq
% \sup_{x \in K}\abs{\int_a^x \sup_{y \in K} \abs{f'(y)} ds} \leq \abs{b-a}
% \sup_{y \in K}\abs{f'(y)}.
% \end{equation}
% Using this we can bound \(\norm{f}_{w^{1,\infty}(K)}\) by
% \[
% \norm{f}_{w^{1,\infty}(K)} \stackrel{\text{Def~\ref{def:sobonorm}}}{=}
% \max\left\{\norm{f}_{L^{\infty}(K)},
% \norm{f'}_{L^{\infty}(K)}\right\}
% \stackrel{(\ref{eq:f_f'})}{\leq} max\left\{\abs{b-a},
% 1\right\}\norm{f'}_{L^{\infty}(K)}.
% \]
% With \(C_k^{\infty} \coloneqq max\left\{\abs{b-a}, 1\right\}\) we
% get (\ref{eq:pti1}).
% By using the Hölder inequality, we can proof the second claim.
% \begin{align*}
4 years ago
% \norm{f'}_{L^{\infty}(K)} &= \sup_{x \in K} \abs{\int_a^bf''(y)
% \mathds{1}_{[a,x]}(y)dy} \leq \sup_{x \in
% K}\norm{f''\mathds{1}_{[a,x]}}_{L^1(K)}\\
% &\hspace{-6pt} \stackrel{\text{Hölder}}{\leq} sup_{x
% \in
% K}\norm{f''}_{L^2(K)}\norm{\mathds{1}_{[a,x]}}_{L^2(K)}
% = \abs{b-a}\norm{f''}_{L^2(K)}.
% \end{align*}
% Thus (\ref{eq:pti2}) follows with \(C_K^2 \coloneqq
% \abs{b-a}C_K^{\infty}\).
% \qed
\end{Lemma}
\begin{Lemma}
\label{lem:cnvh}
Let $\mathcal{RN}$ be a shallow Neural network. For \(\varphi :
4 years ago
\mathbb{R}^2 \to \mathbb{R}\) uniformly continuous such that
\[
\forall x \in \supp(g_{\xi}) : \mathbb{E}\left[\varphi(\xi, v)
\frac{1}{n g_{\xi}(\xi)} \vert \xi = x \right] < \infty,
\]
\clearpage
it holds, that
\[
\plimn \sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k)
h_{k,n}
=\int_{\min\left\{C_{g_{\xi}}^l, T\right\}}^{min\left\{C_{g_{\xi}}^u,T\right\}}
\mathbb{E}\left[\varphi(\xi, v) \vert \xi = x \right] dx
\]
uniformly in \(T \in K\).
% \proof The proof is given in appendix...
4 years ago
% For \(T \leq C_{g_{\xi}}^l\) both sides equal 0, so it is sufficient to
% consider \(T > C_{g_{\xi}}^l\). With \(\varphi\) and
% \(\nicefrac{1}{g_{\xi}}\) uniformly continous in \(\xi\),
% \begin{equation}
% \label{eq:psi_stet}
% \forall \varepsilon > 0 : \exists \delta(\varepsilon) : \forall
% \abs{\xi - \xi'} < \delta(\varepsilon) : \abs{\varphi(\xi, v)
% \frac{1}{g_{\xi}(\xi)} - \varphi(\xi', v)
% \frac{1}{g_{\xi}(\xi')}} < \varepsilon
% \end{equation}
% uniformly in \(v\). In order to
% save space we use the notation \((a \wedge b) \coloneqq \min\{a,b\}\) for $a$ and $b
% \in \mathbb{R}$. W.l.o.g. assume \(\sup(g_{\xi})\) in an
% intervall. By splitting the interval in disjoint strips of length \(\delta
% \leq \delta(\varepsilon)\) we get:
% \[
% \underbrace{\sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k)
% \frac{\bar{h}_k}{2}}_{\circled{1}} =
% \underbrace{\sum_{l \in \mathbb{Z}:
% \left[\delta l, \delta (l + 1)\right] \subseteq
% \left[C_{g_{\xi}}^l, C_{g_{\xi}}^u \wedge T
% \right]}}_{\coloneqq \, l \in I_{\delta}} \left( \, \sum_{\substack{k \in \kappa\\
% \xi_k \in \left[\delta l, \delta (l + 1)\right]}}
% \varphi\left(\xi_k, v_k\right)\frac{\bar{h}_k}{2} \right)
% \]
% Using (\ref{eq:psi_stet}) we can approximate $\circled{1}$ by
% \begin{align*}
% \circled{1} & \approx \sum_{l \in I_{\delta}} \left( \, \sum_{\substack{k \in \kappa\\
% \xi_k \in \left[\delta l, \delta (l + 1)\right]}}
% \left(\varphi\left(l\delta, v_k\right)\frac{1}{g_{\xi}(l\delta)}
% \pm \varepsilon\right)\frac{1}{n} \underbrace{\frac{\abs{\left\{m \in
% \kappa : \xi_m \in [\delta l, \delta(l + 1)]\right\}}}{\abs{\left\{m \in
% \kappa : \xi_m \in [\delta l, \delta(l + 1)]\right\}}}}_{=
% 1}\right) \\
% \intertext{}
% &= \sum_{l \in I_{\delta}} \left( \frac{ \sum_{ \substack{k \in \kappa\\
% \xi_k \in \left[\delta l, \delta (l + 1)\right]}}
% \varphi\left(l\delta, v_k\right)}
% {\abs{\left\{m \in
% \kappa : \xi_m \in [\delta l, \delta(l + 1)]\right\}}}\frac{\abs{\left\{m \in
% \kappa : \xi_m \in [\delta l, \delta(l +
% 1)]\right\}}}{ng_{\xi}(l\delta)}\right) \pm \varepsilon .\\
% \intertext{We use the mean to approximate the number of kinks in
% each $\delta$-strip, as it follows a binomial distribution this
4 years ago
% amounts to
% \[
% \mathbb{E}\left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l,
% \delta(l + 1)]\right\}\right]} = n \int_{[\delta l, \delta (l +
% 1)]} g_{\xi}(x)dx \approx n (\delta g_{\xi}(l\delta) \pm
% \tilde{\varepsilon}).
% \]
% Bla Bla Bla $v_k$}
% \circled{1} & \approx
% \end{align*}
\proof Notes on the proof are given in Proof~\ref{proof:lem9}.
\end{Lemma}
\begin{Lemma}
For any $\lambda > 0$, $N \in \mathbb{N}$, training data $(x_i^{\text{train}}
y_i^{\text{train}}) \in \mathbb{R}^2$, with $ i \in
\left\{1,\dots,N\right\}$, and subset $K \subset \mathbb{R}$ the spline approximating randomized
shallow neural network $\mathcal{RN}_{\tilde{w}}$ converges to the
cubic smoothing spline $f^{*, \lambda}_g$ in
$\norm{.}_{W^{1,\infty}(K)}$ as the node count $n$ increases,
\begin{equation}
\label{eq:s0}
\plimn \norm{\mathcal{RN}_{\tilde{w}} - f^{*, \lambda}_g}_{W^{1,
\infty}(K)} = 0
\end{equation}
\proof
Using Lemma~\ref{lem:pieq} it is sufficient to show
\[
\plimn \norm{\mathcal{RN}_{\tilde{w}}' - (f^{*,
\lambda}_g)'}_{L^{\infty}} = 0.
\]
This can be achieved by using Lemma~\ref{lem:cnvh} with $\varphi(\xi_k,
v_k) = \frac{v_k^2}{\mathbb{E}[v^2|\xi = z]} (f^{*, \lambda}_g)''(\xi_k) $
thus obtaining
\begin{align*}
\plimn \frac{\partial \mathcal{RN}_{\tilde{w}}}{\partial x} (x)
\equals^{(\ref{eq:derivnn})}_{\phantom{\text{Lemma 3.1.4}}}
%\stackrel{(\ref{eq:derivnn})}{=}
&
\plimn \sum_{\substack{k \in \mathbb{N} \\
\xi_k < x}} \frac{v_k^2}{\mathbb{E}[v^2 \vert \xi
= \xi_k]} (f_g^{*, \lambda})''(\xi_k) h_{k,n} \\
\stackrel{\text{Lemma}~\ref{lem:cnvh}}{=}
%\stackrel{\phantom{(\ref{eq:derivnn})}}{=}
&
\int_{\max\left\{C_{g_{\xi}}^l,x\right\}}^{\min\left\{C_{g_{\xi}}^u,x\right\}}
\mathbb{E}\left[\frac{v^2}{\mathbb{E}[v^2|\xi = z]} (f^{*,
\lambda}_g)''(\xi) \vert
\xi = z \right] dz\\
\mathmakebox[\widthof{$\stackrel{\text{Lemma 3.14}}{=}$}][c]{\equals^{\text{Tower-}}_{\text{property}}}
%\stackrel{\phantom{(\ref{eq:derivnn})}}{=}
&
\int_{\max\left\{C_{g_{\xi}}^l,
x\right\}}^{\min\left\{C_{g_{\xi}}^u,x\right\}}(f^{*,\lambda}_g)''(z)
dz.
\end{align*}
With the fundamental theorem of calculus we get
\[
\plimn \mathcal{RN}_{\tilde{w}}'(x) = f_g^{*,\lambda
'}(\min\left\{C_{g_{\xi}}^u, x\right\}) - f_g^{*,\lambda
'}(\max\left\{C_{g_{\xi}}^l, x\right\})
\]
As $f_g^{*,\lambda '}$ is constant on $\left[C_{g_\xi}^l,
C_{g_\xi}^u\right]^C$ because $\supp(f_g^{*,\lambda ''}) \subseteq
\supp(g) \subseteq \supp(g_\xi)$ we get
\[
\plimn \mathcal{RN}_{\tilde{w}}'(x) = f_g^{*,\lambda
'},
\]
thus (\ref{eq:s0}) follows with Lemma~\ref{lem:pieq}.
\qed
4 years ago
\label{lem:s0}
\end{Lemma}
\begin{Lemma}
For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, we have
\[
\plimn F^{\tilde{\lambda}}_n(\mathcal{RN}_{\tilde{w}}) =
F^{\lambda, g}(f^{*, \lambda}_g) = 0.
\]
\proof Notes on the proof are given in Proof~\ref{proof:lem14}.
4 years ago
\label{lem:s2}
\end{Lemma}
\begin{Lemma}
For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
4 years ago
\left\{1,\dots,N\right\}$, with $w^*$ as
defined in Definition~\ref{def:rpnn} and $\tilde{\lambda}$ as
4 years ago
defined in Theorem~\ref{theo:main1}, it holds
\[
\plimn \norm{\mathcal{RN}^{*,\tilde{\lambda}} -
f^{w*, \tilde{\lambda}}}_{W^{1,\infty}(K)} = 0.
\]
\proof Notes on the proof are given in Proof~\ref{proof:lem15}.
4 years ago
\label{lem:s3}
\end{Lemma}
\begin{Lemma}
For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, with $w^*$ and $\tilde{\lambda}$ as
4 years ago
defined in Definition~\ref{def:rpnn} and Theorem~\ref{theo:main1}
respectively, it holds
\[
4 years ago
\plimn \abs{F_n^{\tilde{\lambda}}(\mathcal{RN}^{*,\tilde{\lambda}}) -
F^{\lambda, g}(f^{w*, \tilde{\lambda}})} = 0.
\]
\proof Notes on the proof are given in Proof~\ref{proof:lem16}.
4 years ago
\label{lem:s4}
\end{Lemma}
\begin{Lemma}
For any $\lambda > 0$, $N \in \mathbb{N}$, and training data $(x_i^{\text{train}},
y_i^{\text{train}}) \in \mathbb{R}^2$, with $i \in
\left\{1,\dots,N\right\}$, for any sequence of functions $f^n \in
W^{2,2}$ with
\[
\plimn F^{\lambda, g} (f^n) = F^{\lambda, g}(f^{*, \lambda}),
\]
it follows
\[
\plimn \norm{f^n - f^{*, \lambda}} = 0.
\]
\proof Notes on the proof are given in Proof~\ref{proof:lem19}.
4 years ago
\label{lem:s7}
\end{Lemma}
4 years ago
Using these lemmata we can now proof Theorem~\ref{theo:main1}. We
start by showing that the error measure of the smooth approximation of
the ridge penalized randomized shallow neural network $F^{\lambda,
g}(f^{w^{*,\tilde{\lambda}}})$
4 years ago
will converge in probability to the error measure of the adapted weighted regression
spline $F^{\lambda, g}\left(f^{*,\lambda}\right)$ for the specified
parameters.
Using Lemma~\ref{lem:s4} we get that for every $P \in (0,1)$ and
$\varepsilon > 0$ there exists a $n_1 \in \mathbb{N}$ such that
\begin{equation}
4 years ago
\mathbb{P}\left[F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) \in
F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right)
+[-\varepsilon, \varepsilon]\right] > P, \forall n \in
\mathbb{N}_{> n_1}.
\label{eq:squeeze_1}
\end{equation}
4 years ago
As $\mathcal{RN}^{*,\tilde{\lambda}}$ is the optimal network for
$F_n^{\tilde{\lambda}}$ we know that
\begin{equation}
4 years ago
F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right)
\leq F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right).
\label{eq:squeeze_2}
\end{equation}
4 years ago
Using Lemma~\ref{lem:s2} we get that for every $P \in (0,1)$ and
$\varepsilon > 0$ a $n_2 \in \mathbb{N}$ exists such that
\begin{equation}
4 years ago
\mathbb{P}\left[F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right)
\in F^{\lambda, g}\left(f^{*,\lambda}_g\right)+[-\varepsilon,
\varepsilon]\right] > P, \forall n \in \mathbb{N}_{> n_2}.
\label{eq:squeeze_3}
\end{equation}
Combining (\ref{eq:squeeze_1}), (\ref{eq:squeeze_2}), and
(\ref{eq:squeeze_3}) we get that for every $P \in (0,1)$ and for \linebreak
every
$\varepsilon > 0$ with $n_3 \geq
4 years ago
\max\left\{n_1,n_2\right\}$
\[
\mathbb{P}\left[F^{\lambda,
g}\left(f^{w^{*,\tilde{\lambda}}}\right) \leq F^{\lambda,
g}\left(f^{*,\lambda}_g\right)+2\varepsilon\right] > P, \forall
n \in \mathbb{N}_{> n_3}.
\]
As $\supp(f^{w^{*,\tilde{\lambda}}}) \subseteq \supp(g_\xi)$ and $f^{*,\lambda}_g$ is optimal we know that
4 years ago
\[
F^{\lambda, g}\left(f^{*,\lambda}_g\right) \leq F^{\lambda,
g}\left(f^{w^{*,\tilde{\lambda}}}\right)
4 years ago
\]
and thus get with the squeeze theorem
\[
\plimn F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) = F^{\lambda, g}\left(f^{*,\lambda}_g\right).
\]
With Lemma~\ref{lem:s7} it follows that
4 years ago
\begin{equation}
\plimn \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g}
_{W^{1,\infty}} = 0.
\label{eq:main4}
4 years ago
\end{equation}
By using the triangle inequality with Lemma~\ref{lem:s3} and
(\ref{eq:main4}) we get
\begin{multline}
\plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f_g^{*,\lambda}}\\
\leq \plimn \bigg(\norm{\mathcal{RN}^{*, \tilde{\lambda}} -
f_g^{w^{*,\tilde{\lambda}}}}_{W^{1,\infty}}
+ \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g}
4 years ago
_{W^{1,\infty}}\bigg) = 0
\end{multline}
4 years ago
and thus have proven Theorem~\ref{theo:main1}.
4 years ago
We now know that randomized shallow neural networks behave similar to
spline regression if we regularize the size of the weights during
training.
4 years ago
\textcite{heiss2019} further explore a connection between ridge penalized
networks and randomized shallow neural networks trained using gradient
4 years ago
descent.
They infer that the effect of weight regularization
can be achieved by stopping the training of the randomized shallow
neural network early, with the number of iterations being proportional to
the tuning parameter penalizing the size of the weights.
They use this to further conclude that for a large number of training epochs and number of
neurons shallow neural networks trained with gradient descent are
very close to spline interpolations. Alternatively if the training
is stopped early, they are close to adapted weighted cubic smoothing splines.
\newpage
\subsection{Simulations}
\label{sec:rsnn_sim}
4 years ago
In the following the behavior described in Theorem~\ref{theo:main1}
is visualized in a simulated example. For this two sets of training
data have been generated.
\begin{itemize}
\item $\text{data}_A = (x_{i, A}^{\text{train}},
y_{i,A}^{\text{train}})$ with
\begin{align*}
x_{i, A}^{\text{train}} &\coloneqq -\pi + \frac{2 \pi}{5} (i - 1),
i \in \left\{1, \dots, 6\right\}, \\
y_{i, A}^{\text{train}} &\coloneqq \sin( x_{i, A}^{\text{train}}). \phantom{(i - 1),
i \in \left\{1, \dots, 6\right\}}
\end{align*}
4 years ago
\item $\text{data}_B = (x_{i, B}^{\text{train}}, y_{i,
B}^{\text{train}})$ with
\begin{align*}
x_{i, B}^{\text{train}} &\coloneqq \pi\frac{i - 8}{7},
i \in \left\{1, \dots, 15\right\}, \\
y_{i, B}^{\text{train}} &\coloneqq \sin( x_{i, B}^{\text{train}}). \phantom{(i - 1),
i \in \left\{1, \dots, 6\right\}}
\end{align*}
\end{itemize}
For the $\mathcal{RN}$ the random weights are distributed
as follows
\begin{align*}
\xi_i &\stackrel{i.i.d.}{\sim} \text{Unif}(-5,5), \\
v_i &\stackrel{i.i.d.}{\sim} \mathcal{N}(0, 5), \\
b_i &\stackrel{\phantom{i.i.d.}}{\sim} -\xi_i v_i.
\end{align*}
Note that by the choices for the distributions $g$ as defined in
Theorem~\ref{theo:main1}
would equate to $g(x) = \frac{\mathbb{E}[v_k^2|\xi_k = x]}{10}$. In
order to utilize the
smoothing spline implemented in Mathlab, $g$ has been simplified to $g
\equiv \frac{1}{10}$ instead.
For all figures $f_1^{*, \lambda}$ has
been calculated with Matlab's {\sffamily{smoothingspline}}, as this minimizes
\[
\bar{\lambda} \sum_{i=1}^N(y_i^{train} - f(x_i^{train}))^2 + (1 -
\bar{\lambda}) \int (f''(x))^2 dx
\]
the smoothing parameter used for fitment is $\bar{\lambda} =
\frac{1}{1 + \lambda}$. The parameter $\tilde{\lambda}$ for training
the networks is chosen as defined in Theorem~\ref{theo:main1}.
Each
network contains 10.000 hidden nodes and is trained on the full
training data for 100.000 epochs using
gradient descent. The
results are given in Figure~\ref{fig:rn_vs_rs}, where it can be seen
that the neural network and
smoothing spline are nearly identical, coinciding with the
proposition.
4 years ago
\input{Figures/RN_vs_RS}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: