%%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: \section{Shallow Neural Networks} % In order to get a some understanding of the behavior of neural % networks we study a simplified class of networks called shallow neural % networks in this chapter. % We consider shallow neural networks consist of a single % hidden layer and In order to get some understanding of the behavior of neural networks we examine a simple class of networks in this chapter. We consider networks that contain only one hidden layer and have a single output node. We call these networks shallow neural networks. \begin{Definition}[Shallow neural network, Heiss, Teichmann, and Wutte (2019, Definition 1.4)] For a input dimension $d$ and a Lipschitz continuous activation function $\sigma: \mathbb{R} \to \mathbb{R}$ we define a shallow neural network with $n$ hidden nodes as $\mathcal{NN}_\vartheta : \mathbb{R}^d \to \mathbb{R}$ as \[ \mathcal{NN}_\vartheta \coloneqq \sum_{k=1}^n w_k \sigma\left(b_k + \sum_{j=1}^d v_{k,j} x_j\right) + c ~~ \forall x \in \mathbb{R}^d \] with \begin{itemize} \item weights $w_k \in \mathbb{R},~k \in \left\{1,\dots,n\right\}$ \item biases $b_k \in \mathbb{R},~k \in \left\{1, \dots,n\right\}$ \item weights $v_k \in \mathbb{R}^d,~k\in\left\{1,\dots,n\right\}$ \item bias $c \in \mathbb{R}$ \item these weights and biases collected in \[ \vartheta \coloneqq (w, b, v, c) \in \Theta \coloneqq \mathbb{R}^{n \times n \times (n \times d) \times 1} \] \end{itemize} \end{Definition} % \begin{figure} % \begin{tikzpicture}[x=1.5cm, y=1.5cm] % \tikzset{myptr/.style={decoration={markings,mark=at position 1 with % % {\arrow[scale=1.5,>=stealth]{>}}},postaction={decorate}}} % \foreach \m/\l [count=\y] in {1} % \node [every neuron/.try, neuron \m/.try] (input-\m) at (0,0.5-\y) {}; % \foreach \m [count=\y] in {1,2,missing,3,4} % \node [every neuron/.try, neuron \m/.try ] (hidden-\m) at (1.25,3.25-\y*1.25) {}; % \foreach \m [count=\y] in {1} % \node [every neuron/.try, neuron \m/.try ] (output-\m) at (2.5,0.5-\y) {}; % \foreach \l [count=\i] in {1} % \draw [myptr] (input-\i)+(-1,0) -- (input-\i) % node [above, midway] {$x$}; % \foreach \l [count=\i] in {1,2,n-1,n} % \node [above] at (hidden-\i.north) {$\mathcal{N}_{\l}$}; % \foreach \l [count=\i] in {1,n_l} % \node [above] at (output-\i.north) {}; % \foreach \l [count=\i] in {1} % \draw [myptr, >=stealth] (output-\i) -- ++(1,0) % node [above, midway] {$y$}; % \foreach \i in {1} % \foreach \j in {1,2,...,3,4} % \draw [myptr, >=stealth] (input-\i) -- (hidden-\j); % \foreach \i in {1,2,...,3,4} % \foreach \j in {1} % \draw [myptr, >=stealth] (hidden-\i) -- (output-\j); % \node [align=center, above] at (0,1) {Input \\layer}; % \node [align=center, above] at (1.25,3) {Hidden layer}; % \node [align=center, above] at (2.5,1) {Output \\layer}; % \end{tikzpicture} % \caption{Shallow Neural Network with input- and output-dimension of \(d % = 1\)} % \label{fig:shallowNN} % \end{figure} As neural networks with a large amount of nodes have a large amount of parameters that can be tuned it can often fit the data quite well. If a ReLU activation function \[ \sigma(x) \coloneqq \max{(0, x)} \] is chosen one can easily prove that if the amount of hidden nodes exceeds the amount of data points in the training data a shallow network trained on MSE will perfectly fit the data. \begin{Theorem}[Shallow neural network can fit data perfectly] For training data of size t \[ \left(x_i^{\text{train}}, y_i^{\text{train}}\right) \in \mathbb{R}^d \times \mathbb{R},~i\in\left\{1,\dots,t\right\} \] a shallow neural network $\mathcal{NN}_\vartheta$ with $n \geq t$ hidden nodes will perfectly fit the data when minimizing squared error loss. \proof W.l.o.g. all values $x_{ij}^{\text{train}} \in [0,1],~\forall i \in \left\{1,\dots\right\}, j \in \left\{1,\dots,d\right\}$. Now we chose $v^*$ in order to calculate a unique value for all $x_i^{\text{train}}$: \[ v^*_{k,j} = v^*_{j} = 10^{j-1}, ~ \forall k \in \left\{1,\dots,n\right\}. \] Assuming $x_i^{\text{train}} \neq x_j^{\text{train}},~\forall i\neq j$ we get \[ \left(v_k^*\right)^{\mathrm{T}} x_i^{\text{train}} \neq \left(v_k^*\right)^{\mathrm{T}} x_j^{\text{train}}, ~ \forall i \neq j. \] W.l.o.g assume $x_i^{\text{train}}$ are ordered such that $\left(v_k^*\right)^{\mathrm{T}} x_i^{\text{train}} < \left(v_k^*\right)^{\mathrm{T}} x_j^{\text{train}}, ~\forall j -\left(v^*\right)^{\mathrm{T}} x_1^{\text{train}},\\ b^*_k &= -\left(v^*\right)^{\mathrm{T}} x_{k-1}^{\text{train}},~\forall k \in \left\{2, \dots, t\right\}, \\ b_k^* &\leq -\left(v^*\right)^{\mathrm{T}} x_{t}^{\text{train}},~\forall k > t. \end{align*} With \begin{align*} w_k^* &= \frac{y_k^{\text{train}} - \sum_{j =1}^{k-1} w^*_j\left(b^*_j + x_k^{\text{train}}\right)}{b_k + \left(v^*\right)^{\mathrm{T}} x_k^{\text{train}}},~\forall k \in \left\{1,\dots,t\right\}\\ w_k^* &\in \mathbb{R} \text{ arbitrary, } \forall k > t. \end{align*} and $\vartheta^* = (w^*, b^*, v^*, c = 0)$ we get \[ \mathcal{NN}_{\vartheta^*} (x_i^{\text{train}}) = \sum_{k = 1}^{i-1} w_k\left(\left(v^*\right)^{\mathrm{T}} x_i^{\text{train}}\right) + w_i\left(\left(v^*\right)^{\mathrm{T}} x_i^{\text{train}}\right) = y_i^{\text{train}}. \] As the squared error of $\mathcal{NN}_{\vartheta^*}$ is zero all squared error loss minimizing shallow networks with at least $t$ hidden nodes will perfectly fit the data. \qed \label{theo:overfit} \end{Theorem} However this behavior is often not desired as over fit models generally have bad generalization properties especially if noise is present in the data. This effect is illustrated in Figure~\ref{fig:overfit}. Here a shallow neural network that perfectly fits the training data is constructed according to the proof of Theorem~\ref{theo:overfit} and compared to a cubic smoothing spline (Definition~\ref{def:wrs}). While the neural network fits the data better than the spline, the spline represents the underlying mechanism that was used to generate the data more accurately. The better generalization of the spline compared to the network is further demonstrated by the better validation error computed on newly generated test data. In order to improve the accuracy of the model we want to reduce overfitting. A possible way to achieve this is by explicitly regularizing the network through the cost function as done with ridge penalized networks (Definition~\ref{def:rpnn}) where large weights $w$ are punished. In Theorem~\ref{theo:main1} we will prove that this will result in the shallow neural network converging to regressions splines as the amount of nodes in the hidden layer is increased. \begin{figure} \pgfplotsset{ compat=1.11, legend image code/.code={ \draw[mark repeat=2,mark phase=2] plot coordinates { (0cm,0cm) (0.15cm,0cm) %% default is (0.3cm,0cm) (0.3cm,0cm) %% default is (0.6cm,0cm) };% } } \begin{tikzpicture} \begin{axis}[tick style = {draw = none}, width = \textwidth, height = 0.6\textwidth] \addplot table [x=x, y=y, col sep=comma, only marks,mark options={scale = 0.7}] {Figures/Data/overfit.csv}; \addplot [red, line width=0.8pt] table [x=x_n, y=s_n, col sep=comma, forget plot] {Figures/Data/overfit.csv}; \addplot [black, line width=0.8pt] table [x=x_n, y=y_n, col sep=comma] {Figures/Data/overfit.csv}; \addplot [black, line width=0.8pt, dashed] table [x=x, y=y, col sep=comma] {Figures/Data/overfit_spline.csv}; \addlegendentry{\footnotesize{data}}; \addlegendentry{\footnotesize{$\mathcal{NN}_{\vartheta^*}$}}; \addlegendentry{\footnotesize{spline}}; \end{axis} \end{tikzpicture} \caption[Overfitting of shallow neural networks]{For data of the form $y=\sin(\frac{x+\pi}{2 \pi}) + \varepsilon,~ \varepsilon \sim \mathcal{N}(0,0.4)$ (\textcolor{blue}{blue dots}) the neural network constructed according to the proof of Theorem~\ref{theo:overfit} (black) and the underlying signal (\textcolor{red}{red}). While the network has no bias a cubic smoothing spline (black dashed) fits the data much better. For a test set of size 20 with uniformly distributed $x$ values and responses of the same fashion as the training data the MSE of the neural network is 0.30, while the MSE of the spline is only 0.14 thus generalizing much better. } \label{fig:overfit} \end{figure} \clearpage \subsection{Convergence Behaviour of 1-dim. Randomized Shallow Neural Networks} This section is based on \textcite{heiss2019}. ... shallow neural networks with a one dimensional input where the parameters in the hidden layer are randomized resulting in only the weights is the output layer being trainable. Additionally we assume all neurons use a ReLU as activation function and call such networks randomized shallow neural networks. % We will analyze the % connection between randomized shallow % Neural Networks with one dimensional input with a ReLU as activation % function for all neurons and cubic smoothing splines. % % \[ % % \sigma(x) = \max\left\{0,x\right\}. % % \] % We will see that the punishment of the size of the weights in training % the randomized shallow % Neural Network will result in a learned function that minimizes the second % derivative as the amount of hidden nodes is grown to infinity. In order % to properly formulate this relation we will first need to introduce % some definitions, all neural networks introduced in the following will % use a ReLU as activation at all neurons. % A randomized shallow network is characterized by only the weight % parameter of the output layer being trainable, whereas the other % parameters are random numbers. \begin{Definition}[Randomized shallow neural network, Heiss, Teichmann, and Wutte (2019, Definition 2.1)] For an input dimension $d$, let $n \in \mathbb{N}$ be the number of hidden nodes and $v(\omega) \in \mathbb{R}^{i \times n}, b(\omega) \in \mathbb{R}^n$ randomly drawn weights. Then for a weight vector $w$ the corresponding randomized shallow neural network is given by \[ \mathcal{RN}_{w, \omega} (x) = \sum_{k=1}^n w_k \sigma\left(b_k(\omega) + \sum_{j=1}^d v_{k, j}(\omega) x_j\right). \] \label{def:rsnn} \end{Definition} % We call a one dimensional randomized shallow neural network were the % are penalized in the loss % function ridge penalized neural networks. We will prove that ... nodes .. a randomized shallow neural network will converge to a function that minimizes the distance to the training data with .. to its second derivative, if the $L^2$ norm of the trainable weights $w$ is penalized in the loss function. We call such a network that is fitted according to MSE and a penalty term for the amount of the weights a ridge penalized neural network. % $\lam$ % We call a randomized shallow neural network trained on MSE and % punished for the amount of the weights $w$ according to a % ... $\lambda$ ridge penalized neural networks. % We call a randomized shallow neural network where the size of the trainable % weights is punished in the error function a ridge penalized % neural network. For a tuning parameter $\tilde{\lambda}$ .. the extent % of penalization we get: \begin{Definition}[Ridge penalized Neural Network, Heiss, Teichmann, and Wutte (2019, Definition 3.2)] \label{def:rpnn} Let $\mathcal{RN}_{w, \omega}$ be a randomized shallow neural network, as introduced in Definition~\ref{def:rsnn} and tuning parameter $\tilde{\lambda} \in \mathbb{R}$. Then the optimal ridge penalized network is given by \[ \mathcal{RN}^{*, \tilde{\lambda}}_{\omega}(x) \coloneqq \mathcal{RN}_{w^{*, \tilde{\lambda}}(\omega), \omega} \] with \[ w^{*,\tilde{\lambda}}(\omega) :\in \argmin_{w \in \mathbb{R}^n} \underbrace{ \left\{\overbrace{\sum_{i = 1}^N \left(\mathcal{RN}_{w, \omega}(x_i^{\text{train}}) - y_i^{\text{train}}\right)^2}^{L(\mathcal{RN}_{w, \omega})} + \tilde{\lambda} \norm{w}_2^2\right\}}_{\eqqcolon F_n^{\tilde{\lambda}}(\mathcal{RN}_{w,\omega})}. \] \end{Definition} If the amount of hidden nodes $n$ is larger than the amount of training samples $N$ then for $\tilde{\lambda} \to 0$ the network will interpolate the data while having minimal weights, resulting in the \textit{minimum norm network} $\mathcal{RN}_{w^{\text{min}}, \omega}$. \[ \mathcal{RN}_{w^{\text{min}}, \omega} \text{ randomized shallow Neural network with weights } w^{\text{min}}: \] \[ w^{\text{min}} \in \argmin_{w \in \mathbb{R}^n} \norm{w}, \text{ s.t. } \mathcal{RN}_{w,\omega}(x_i^{\text{train}}) = y_i^{\text{train}}, \, \forall i \in \left\{1,\dots,N\right\}. \] For $\tilde{\lambda} \to \infty$ the learned function will resemble the data less and with the weights approaching $0$ will converge to the constant $0$ function. In order to make the notation more convinient in the following the $\omega$ used to express the realised random parameters will no longer be explicitly mentioned. We call a function that minimizes the cubic distance between training points and the function with regard to the second derivative of the function a cubic smoothing spline. \begin{Definition}[Cubic Smoothing Spline] Let $x_i^{\text{train}}, y_i^{\text{train}} \in \mathbb{R}, i \in \left\{1,\dots,N\right\}$ be trainig data. for a given $\lambda \in \mathbb{R}$ the cubic smoothing spline is given by \[ f^{*,\lambda} :\in \argmin_{f \in \mathcal{C}^2}\left\{\sum_{i=1}^N \left(f\left(x_i^{\text{train}}\right) - y_i^{\text{train}}\right)^2 + \lambda \int f^{''}(x)^2dx\right\}. \] \end{Definition} We will show that for specific hyper parameters the ridge penalized shallow neural networks converge to a slightly modified variant of the cubic smoothing spline. We will need to incorporate the densities of the random parameters in the loss function of the spline to ensure convergence. Thus we define the adapted weighted cubic smoothing spline where the loss for the second derivative is weighted by a function $g$ and the support of the second derivative of $f$ has to be a subset the support of $g$. The formal definition is given in Definition~\ref{def:wrs}. % We will later ... the converging .. of the ridge penalized shallow % neural network, in order to do so we will need a slightly modified % version of the regression % spline that allows for weighting the penalty term for the second % derivative with a weight function $g$. This is needed to ...the % distributions of the random parameters ... We call this the adapted % weighted cubic smoothing spline. % Now we take a look at weighted cubic smoothing splines. Later we will prove % that the ridge penalized neural network as defined in % Definition~\ref{def:rpnn} converges a weighted cubic smoothing spline, as % the amount of hidden nodes is grown to inifity. \begin{Definition}[Adapted weighted cubic smoothing spline] \label{def:wrs} Let $x_i^{\text{train}}, y_i^{\text{train}} \in \mathbb{R}, i \in \left\{1,\dots,N\right\}$ be trainig data. For a given $\lambda \in \mathbb{R}_{>0}$ and a function $g: \mathbb{R} \to \mathbb{R}_{>0}$ the weighted cubic smoothing spline $f^{*, \lambda}_g$ is given by \[ f^{*, \lambda}_g :\in \argmin_{\substack{f \in \mathcal{C}^2(\mathbb{R}) \\ \supp(f'') \subseteq \supp(g)}} \underbrace{\left\{ \overbrace{\sum_{i = 1}^N \left(f(x_i^{\text{train}}) - y_i^{\text{train}}\right)^2}^{L(f)} + \lambda g(0) \int_{\supp(g)}\frac{\left(f''(x)\right)^2}{g(x)} dx\right\}}_{\eqqcolon F^{\lambda, g}(f)}. \] \todo{Anforderung an Ableitung von f, doch nicht?} \end{Definition} Similarly to ridge weight penalized neural networks the parameter $\lambda$ controls a trade-off between accuracy on the training data and smoothness or low second dreivative. For $g \equiv 1$ and $\lambda \to 0$ the resulting function $f^{*, 0+}$ will interpolate the training data while minimizing the second derivative. Such a function is known as cubic spline interpolation. \[ f^{*, 0+} \text{ smooth spline interpolation: } \] \[ f^{*, 0+} \coloneqq \lim_{\lambda \to 0+} f^{*, \lambda}_1 \in \argmin_{\substack{f \in \mathcal{C}^2(\mathbb{R}), \\ f(x_i^{\text{train}}) = y_i^{\text{train}}}} = \left( \int _{\mathbb{R}} (f''(x))^2dx\right). \] For $\lambda \to \infty$ on the other hand $f_g^{*\lambda}$ converges to linear regression of the data. We use two intermediary functions in order to show the convergence of the ridge penalized shallow neural network to adapted cubic smoothing splines. % In order to show that ridge penalized shallow neural networks converge % to adapted cubic smoothing splines for a growing amount of hidden nodes we % define two intermediary functions. One being a smooth approximation of the neural network, and a randomized shallow neural network designed to approximate a spline. In order to properly BUILD these functions we need to take the points of the network into consideration where the TRAJECTORY of the learned function changes (or their points of discontinuity). As we use the ReLU activation the function learned by the network will possess points of discontinuity where a neuron in the hidden layer gets activated (goes from 0 -> x>0). We formalize these points as kinks in Definition~\ref{def:kink}. \begin{Definition} \label{def:kink} Let $\mathcal{RN}_w$ be a randomized shallow Neural Network according to Definition~\ref{def:rsnn}, then kinks depending on the random parameters can be observed. \[ \mathcal{RN}_w(x) = \sum_{k = 1}^n w_k \sigma(b_k + v_kx) \] Because we specified $\sigma(y) \coloneqq \max\left\{0, y\right\}$ a kink in $\sigma$ can be observed at $\sigma(0) = 0$. As $b_k + v_kx = 0$ for $x = -\frac{b_k}{v_k}$ we define the following: \begin{enumerate}[label=(\alph*)] \item Let $\xi_k \coloneqq -\frac{b_k}{v_k}$ be the k-th kink of $\mathcal{RN}_w$. \item Let $g_{\xi}(\xi_k)$ be the density of the kinks $\xi_k = - \frac{b_k}{v_k}$ in accordance to the distributions of $b_k$ and $v_k$. \item Let $h_{k,n} \coloneqq \frac{1}{n g_{\xi}(\xi_k)}$ be the average estmated distance from kink $\xi_k$ to the next nearest one. \end{enumerate} \end{Definition} Using the density of the kinks we construct a kernel and smooth the network by applying the kernel similar to convolution. \begin{Definition}[Smooth Approximation of Randomized Shallow Neural Network] \label{def:srsnn} Let $RS_{w}$ be a randomized shallow Neural Network according to Definition~\ref{def:rsnn} with weights $w$ and kinks $\xi_k$ with corresponding kink density $g_{\xi}$ as given by Definition~\ref{def:kink}. In order to smooth the RSNN consider following kernel for every $x$: \[ \kappa_x(s) \coloneqq \mathds{1}_{\left\{\abs{s} \leq \frac{1}{2 \sqrt{n} g_{\xi}(x)}\right\}}(s)\sqrt{n} g_{\xi}(x), \, \forall s \in \mathbb{R} \] Using this kernel we define a smooth approximation of $\mathcal{RN}_w$ by \[ f^w(x) \coloneqq \int_{\mathds{R}} \mathcal{RN}_w(x-s) \kappa_x(s) ds. \] \end{Definition} Note that the kernel introduced in Definition~\ref{def:srsnn} satisfies $\int_{\mathbb{R}}\kappa_x dx = 1$. While $f^w$ looks highly similar to a convolution, it differs slightly as the kernel $\kappa_x(s)$ is dependent on $x$. Therefore only $f^w = (\mathcal{RN}_w * \kappa_x)(x)$ is well defined, while $\mathcal{RN}_w * \kappa$ is not. We use $f^{w^{*,\tilde{\lambda}}}$ to describe the spline approximating the ridge penalized network $\mathrm{RN}^{*,\tilde{\lambda}}$. Next we construct a randomized shallow neural network which approximates a spline independent from the realization of the random parameters. In order to achieve this we ... \begin{Definition}[Spline approximating Randomised Shallow Neural Network] \label{def:sann} Let $\mathcal{RN}$ be a randomised shallow Neural Network according to Definition~\ref{def:rsnn} and $f^{*, \lambda}_g$ be the weighted cubic smoothing spline as introduced in Definition~\ref{def:wrs}. Then the randomised shallow neural network approximating $f^{*, \lambda}_g$ is given by \[ \mathcal{RN}_{\tilde{w}}(x) = \sum_{k = 1}^n \tilde{w}_k \sigma(b_k + v_k x), \] with the weights $\tilde{w}_k$ defined as \[ \tilde{w}_k \coloneqq \frac{h_{k,n} v_k}{\mathbb{E}[v^2 \vert \xi = \xi_k]} (f_g^{*, \lambda})''(\xi_k). \] \end{Definition} The approximating nature of the network in Definition~\ref{def:sann} can be seen by examining the first derivative of $\mathcal{RN}_{\tilde{w}}(x)$ which is given by \begin{align} \frac{\partial \mathcal{RN}_{\tilde{w}}}{\partial x} \Big{|}_{x} &= \sum_k^n \tilde{w}_k \mathds{1}_{\left\{b_k + v_k x > 0\right\}}(v_k) = \sum_{\substack{k \in \mathbb{N} \\ \xi_k < x}} \tilde{w}_k v_k \nonumber \\ &= \frac{1}{n} \sum_{\substack{k \in \mathbb{N} \\ \xi_k < x}} \frac{v_k^2}{g_{\xi}(\xi_k) \mathbb{E}[v^2 \vert \xi = \xi_k]} (f_g^{*, \lambda})''(\xi_k). \label{eq:derivnn} \end{align} As the expression (\ref{eq:derivnn}) behaves similary to a Riemann-sum for $n \to \infty$ it will converge in probability to the first derivative of $f^{*,\lambda}_g$. A formal proof of this behaviour is given in Lemma~\ref{lem:s0}. In order to ensure the functions used in the proof of the convergence are well defined we need to assume some properties of the random parameters and their densities % In order to formulate the theorem describing the convergence of $RN_w$ % we need to make a couple of assumptions. % \todo{Bessere Formulierung} \begin{Assumption}~ \label{ass:theo38} \begin{enumerate}[label=(\alph*)] \item The probability density fucntion of the kinks $\xi_k$, namely $g_{\xi}$ as defined in Definition~\ref{def:kink} exists and is well defined. \item The density function $g_\xi$ has compact support on $\supp(g_{\xi})$. \item The density function $g_{\xi}$ is uniformly continuous on $\supp(g_{\xi})$. \item $g_{\xi}(0) \neq 0$. \item $\frac{1}{g_{\xi}}\Big|_{\supp(g_{\xi})}$ is uniformly continous on $\supp(g_{\xi})$. \item The conditional distribution $\mathcal{L}(v_k|\xi_k = x)$ is uniformly continous on $\supp(g_{\xi})$. \item $\mathbb{E}\left[v_k^2\right] < \infty$. \end{enumerate} \end{Assumption} As we will prove the convergence of in the Sobolev space, we hereby introduce it and the corresponding induced norm. \begin{Definition}[Sobolev Space] For $K \subset \mathbb{R}^n$ open and $1 \leq p \leq \infty$ we define the Sobolev space $W^{k,p}(K)$ as the space containing all real valued functions $u \in L^p(K)$ such that for every multi-index $\alpha \in \mathbb{N}^n$ with $\abs{\alpha} \leq k$ the mixed parial derivatives \[ u^{(\alpha)} = \frac{\partial^{\abs{\alpha}} u}{\partial x_1^{\alpha_1} \dots \partial x_n^{\alpha_n}} \] exists in the weak sense and \[ \norm{u^{(\alpha)}}_{L^p} < \infty. \] \label{def:sobonorm} The natural norm of the sobolev space is given by \[ \norm{f}_{W^{k,p}(K)} = \begin{cases} \left(\sum_{\abs{\alpha} \leq k} \norm{f^{(\alpha)}}^p_{L^p}\right)^{\nicefrac{1}{p}},& \text{for } 1 \leq p < \infty \\ max_{\abs{\alpha} \leq k}\left\{f^{(\alpha)}\right\},& \text{for } p = \infty \end{cases} . \] \end{Definition} With the important definitions and assumptions in place we can now formulate the main theorem ... the convergence of ridge penalized random neural networks to adapted cubic smoothing splines when the parameters are chosen accordingly. \begin{Theorem}[Ridge weight penaltiy corresponds to weighted cubic smoothing spline] \label{theo:main1} For $N \in \mathbb{N}$ arbitrary training data \(\left(x_i^{\text{train}}, y_i^{\text{train}} \right)\) and $\mathcal{RN}^{*, \tilde{\lambda}}, f_g^{*, \lambda}$ according to Definition~\ref{def:rpnn} and Definition~\ref{def:wrs} respectively with Assumption~\ref{ass:theo38} it holds \begin{equation} \label{eq:main1} \plimn \norm{\mathcal{RN^{*, \tilde{\lambda}}} - f^{*, \lambda}_{g}}_{W^{1,\infty}(K)} = 0. \end{equation} With \begin{align*} g(x) & \coloneqq g_{\xi}(x)\mathbb{E}\left[ v_k^2 \vert \xi_k = x \right], \forall x \in \mathbb{R}, \\ \tilde{\lambda} & \coloneqq \lambda n g(0). \end{align*} \end{Theorem} As mentioned above we will prof Theorem~\ref{theo:main1} utilizing the ... functions. We show that \begin{equation} \label{eq:main2} \plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f^{w^*}}_{W^{1, \infty}(K)} = 0 \end{equation} and \begin{equation} \label{eq:main3} \plimn \norm{f^{w^*} - f_g^{*, \lambda}}_{W^{1,\infty}(K)} = 0 \end{equation} and then get (\ref{eq:main1}) using the triangle inequality. In order to prove (\ref{eq:main2}) and (\ref{eq:main3}) we will need to introduce a number of auxiliary lemmmata, proves of these will be provided in the appendix. \begin{Lemma}[Poincar\'e typed inequality] \label{lem:pieq} Let \(f:\mathbb{R} \to \mathbb{R}\) differentiable with \(f' : \mathbb{R} \to \mathbb{R}\) Lesbeque integrable. Then for \(K=[a,b] \subset \mathbb{R}\) with \(f(a)=0\) it holds that \begin{equation*} \label{eq:pti1} \exists C_K^{\infty} \in \mathbb{R}_{>0} : \norm{f}_{w^{1,\infty}(K)} \leq C_K^{\infty} \norm{f'}_{L^{\infty}(K)}. \end{equation*} If additionaly \(f'\) is differentiable with \(f'': \mathbb{R} \to \mathbb{R}\) Lesbeque integrable then additionally \begin{equation*} \label{eq:pti2} \exists C_K^2 \in \mathbb{R}_{>0} : \norm{f}_{W^{1,\infty}(K)} \leq C_K^2 \norm{f''}_{L^2(K)}. \end{equation*} \proof The proof is given in the appendix... % With the fundamental theorem of calculus, if % \(\norm{f}_{L^{\infty}(K)}<\infty\) we get % \begin{equation} % \label{eq:f_f'} % \norm{f}_{L^{\infty}(K)} = \sup_{x \in K}\abs{\int_a^x f'(s) ds} \leq % \sup_{x \in K}\abs{\int_a^x \sup_{y \in K} \abs{f'(y)} ds} \leq \abs{b-a} % \sup_{y \in K}\abs{f'(y)}. % \end{equation} % Using this we can bound \(\norm{f}_{w^{1,\infty}(K)}\) by % \[ % \norm{f}_{w^{1,\infty}(K)} \stackrel{\text{Def~\ref{def:sobonorm}}}{=} % \max\left\{\norm{f}_{L^{\infty}(K)}, % \norm{f'}_{L^{\infty}(K)}\right\} % \stackrel{(\ref{eq:f_f'})}{\leq} max\left\{\abs{b-a}, % 1\right\}\norm{f'}_{L^{\infty}(K)}. % \] % With \(C_k^{\infty} \coloneqq max\left\{\abs{b-a}, 1\right\}\) we % get (\ref{eq:pti1}). % By using the Hölder inequality, we can proof the second claim. % \begin{align*} % \norm{f'}_{L^{\infty}(K)} &= \sup_{x \in K} \abs{\int_a^bf''(y) % \mathds{1}_{[a,x]}(y)dy} \leq \sup_{x \in % K}\norm{f''\mathds{1}_{[a,x]}}_{L^1(K)}\\ % &\hspace{-6pt} \stackrel{\text{Hölder}}{\leq} sup_{x % \in % K}\norm{f''}_{L^2(K)}\norm{\mathds{1}_{[a,x]}}_{L^2(K)} % = \abs{b-a}\norm{f''}_{L^2(K)}. % \end{align*} % Thus (\ref{eq:pti2}) follows with \(C_K^2 \coloneqq % \abs{b-a}C_K^{\infty}\). % \qed \end{Lemma} \begin{Lemma} \label{lem:cnvh} Let $\mathcal{RN}$ be a shallow Neural network. For \(\varphi : \mathbb{R}^2 \to \mathbb{R}\) uniformly continous such that \[ \forall x \in \supp(g_{\xi}) : \mathbb{E}\left[\varphi(\xi, v) \frac{1}{n g_{\xi}(\xi)} \vert \xi = x \right] < \infty, \] it holds, that \[ \plimn \sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k) h_{k,n} =\int_{\min\left\{C_{g_{\xi}}^l, T\right\}}^{min\left\{C_{g_{\xi}}^u,T\right\}} \mathbb{E}\left[\varphi(\xi, v) \vert \xi = x \right] dx \] uniformly in \(T \in K\). \proof The proof is given in appendix... % For \(T \leq C_{g_{\xi}}^l\) both sides equal 0, so it is sufficient to % consider \(T > C_{g_{\xi}}^l\). With \(\varphi\) and % \(\nicefrac{1}{g_{\xi}}\) uniformly continous in \(\xi\), % \begin{equation} % \label{eq:psi_stet} % \forall \varepsilon > 0 : \exists \delta(\varepsilon) : \forall % \abs{\xi - \xi'} < \delta(\varepsilon) : \abs{\varphi(\xi, v) % \frac{1}{g_{\xi}(\xi)} - \varphi(\xi', v) % \frac{1}{g_{\xi}(\xi')}} < \varepsilon % \end{equation} % uniformly in \(v\). In order to % save space we use the notation \((a \wedge b) \coloneqq \min\{a,b\}\) for $a$ and $b % \in \mathbb{R}$. W.l.o.g. assume \(\sup(g_{\xi})\) in an % intervall. By splitting the interval in disjoint strips of length \(\delta % \leq \delta(\varepsilon)\) we get: % \[ % \underbrace{\sum_{k \in \kappa : \xi_k < T} \varphi(\xi_k, v_k) % \frac{\bar{h}_k}{2}}_{\circled{1}} = % \underbrace{\sum_{l \in \mathbb{Z}: % \left[\delta l, \delta (l + 1)\right] \subseteq % \left[C_{g_{\xi}}^l, C_{g_{\xi}}^u \wedge T % \right]}}_{\coloneqq \, l \in I_{\delta}} \left( \, \sum_{\substack{k \in \kappa\\ % \xi_k \in \left[\delta l, \delta (l + 1)\right]}} % \varphi\left(\xi_k, v_k\right)\frac{\bar{h}_k}{2} \right) % \] % Using (\ref{eq:psi_stet}) we can approximate $\circled{1}$ by % \begin{align*} % \circled{1} & \approx \sum_{l \in I_{\delta}} \left( \, \sum_{\substack{k \in \kappa\\ % \xi_k \in \left[\delta l, \delta (l + 1)\right]}} % \left(\varphi\left(l\delta, v_k\right)\frac{1}{g_{\xi}(l\delta)} % \pm \varepsilon\right)\frac{1}{n} \underbrace{\frac{\abs{\left\{m \in % \kappa : \xi_m \in [\delta l, \delta(l + 1)]\right\}}}{\abs{\left\{m \in % \kappa : \xi_m \in [\delta l, \delta(l + 1)]\right\}}}}_{= % 1}\right) \\ % \intertext{} % &= \sum_{l \in I_{\delta}} \left( \frac{ \sum_{ \substack{k \in \kappa\\ % \xi_k \in \left[\delta l, \delta (l + 1)\right]}} % \varphi\left(l\delta, v_k\right)} % {\abs{\left\{m \in % \kappa : \xi_m \in [\delta l, \delta(l + 1)]\right\}}}\frac{\abs{\left\{m \in % \kappa : \xi_m \in [\delta l, \delta(l + % 1)]\right\}}}{ng_{\xi}(l\delta)}\right) \pm \varepsilon .\\ % \intertext{We use the mean to approximate the number of kinks in % each $\delta$-strip, as it follows a bonomial distribution this % amounts to % \[ % \mathbb{E}\left[\abs{\left\{m \in \kappa : \xi_m \in [\delta l, % \delta(l + 1)]\right\}\right]} = n \int_{[\delta l, \delta (l + % 1)]} g_{\xi}(x)dx \approx n (\delta g_{\xi}(l\delta) \pm % \tilde{\varepsilon}). % \] % Bla Bla Bla $v_k$} % \circled{1} & \approx % \end{align*} \end{Lemma} \begin{Lemma} For any $\lambda > 0$, training data $(x_i^{\text{train}} y_i^{\text{train}}) \in \mathbb{R}^2$, with $ i \in \left\{1,\dots,N\right\}$ and subset $K \subset \mathbb{R}$ the spline approximating randomized shallow neural network $\mathcal{RN}_{\tilde{w}}$ converges to the cubic smoothing spline $f^{*, \lambda}_g$ in $\norm{.}_{W^{1,\infty}(K)}$ as the node count $n$ increases, \begin{equation} \label{eq:s0} \plimn \norm{\mathcal{RN}_{\tilde{w}} - f^{*, \lambda}_g}_{W^{1, \infty}(K)} = 0 \end{equation} \proof Using Lemma~\ref{lem:pieq} it is sufficient to show \[ \plimn \norm{\mathcal{RN}_{\tilde{w}}' - (f^{*, \lambda}_g)'}_{L^{\infty}} = 0. \] This can be achieved by using Lemma~\ref{lem:cnvh} with $\varphi(\xi_k, v_k) = \frac{v_k^2}{\mathbb{E}[v^2|\xi = z]} (f^{*, \lambda}_w)''(\xi_k) $ thus obtaining \begin{align*} \plimn \frac{\partial \mathcal{RN}_{\tilde{w}}}{\partial x} \stackrel{(\ref{eq:derivnn})}{=} & \plimn \sum_{\substack{k \in \mathbb{N} \\ \xi_k < x}} \frac{v_k^2}{\mathbb{E}[v^2 \vert \xi = \xi_k]} (f_g^{*, \lambda})''(\xi_k) h_{k,n} \stackrel{\text{Lemma}~\ref{lem:cnvh}}{=} \\ \stackrel{\phantom{(\ref{eq:derivnn})}}{=} & \int_{\min\left\{C_{g_{\xi}}^l,T\right\}}^{min\left\{C_{g_{\xi}}^u,T\right\}} \mathbb{E}\left[\frac{v^2}{\mathbb{E}[v^2|\xi = z]} (f^{*, \lambda}_w)''(\xi) \vert \xi = x \right] dx \equals^{\text{Tower-}}_{\text{property}} \\ \stackrel{\phantom{(\ref{eq:derivnn})}}{=} & \int_{\min\left\{C_{g_{\xi}}^l, T\right\}}^{min\left\{C_{g_{\xi}}^u,T\right\}}(f^{*,\lambda}_w)''(x) dx. \end{align*} By the fundamental theorem of calculus and $\supp(f') \subset \supp(f)$, (\ref{eq:s0}) follows with Lemma~\ref{lem:pieq}. \todo{ist die 0 wichtig?} \qed \label{lem:s0} \end{Lemma} \begin{Lemma} For any $\lambda > 0$ and training data $(x_i^{\text{train}}, y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in \left\{1,\dots,N\right\}$, we have \[ \plimn F^{\tilde{\lambda}}_n(\mathcal{RN}_{\tilde{w}}) = F^{\lambda, g}(f^{*, \lambda}_g) = 0. \] \proof The proof is given in the appendix... \label{lem:s2} \end{Lemma} \begin{Lemma} For any $\lambda > 0$ and training data $(x_i^{\text{train}}, y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in \left\{1,\dots,N\right\}$, with $w^*$ as defined in Definition~\ref{def:rpnn} and $\tilde{\lambda}$ as defined in Theroem~\ref{theo:main1}, it holds \[ \plimn \norm{\mathcal{RN}^{*,\tilde{\lambda}} - f^{w*, \tilde{\lambda}}}_{W^{1,\infty}(K)} = 0. \] \proof The proof is given in Appendix .. \label{lem:s3} \end{Lemma} \begin{Lemma} For any $\lambda > 0$ and training data $(x_i^{\text{train}}, y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in \left\{1,\dots,N\right\}$, with $w^*$ and $\tilde{\lambda}$ as defined in Definition~\ref{def:rpnn} and Theroem~\ref{theo:main1} respectively, it holds \[ \plimn \abs{F_n^{\tilde{\lambda}}(\mathcal{RN}^{*,\tilde{\lambda}}) - F^{\lambda, g}(f^{w*, \tilde{\lambda}})} = 0. \] \proof The proof is given in appendix... \label{lem:s4} \end{Lemma} \begin{Lemma} For any $\lambda > 0$ and training data $(x_i^{\text{train}}, y_i^{\text{train}}) \in \mathbb{R}^2, \, i \in \left\{1,\dots,N\right\}$, for any sequence of functions $f^n \in W^{2,2}$ with \[ \plimn F^{\lambda, g} (f^n) = F^{\lambda, g}(f^{*, \lambda}), \] it follows \[ \plimn \norm{f^n - f^{*, \lambda}} = 0. \] \proof The proof is given in appendix ... \label{lem:s7} \end{Lemma} Using these lemmata we can now proof Theorem~\ref{theo:main1}. We start by showing that the error measure of the smooth approximation of the ridge penalized randomized shallow neural network $F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right)$ will converge in probability to the error measure of the adapted weighted regression spline $F^{\lambda, g}\left(f^{*,\lambda}\right)$ for the specified parameters. Using Lemma~\ref{lem:s4} we get that for every $P \in (0,1)$ and $\varepsilon > 0$ there exists a $n_1 \in \mathbb{N}$ such that \[ \mathbb{P}\left[F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) \in F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right) +[-\varepsilon, \varepsilon]\right] > P, \forall n \in \mathbb{N}_{> n_1}. \] As $\mathcal{RN}^{*,\tilde{\lambda}}$ is the optimal network for $F_n^{\tilde{\lambda}}$ we know that \[ F_n^{\tilde{\lambda}}\left(\mathcal{RN}^{*,\tilde{\lambda}}\right) \leq F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right). \] Using Lemma~\ref{lem:s2} we get that for every $P \in (0,1)$ and $\varepsilon > 0$ there exists a $n_2 \in \mathbb{N}$ such that \[ \mathbb{P}\left[F_n^{\tilde{\lambda}}\left(\mathcal{RN}_{\tilde{w}}\right) \in F^{\lambda, g}\left(f^{*,\lambda}_g\right)+[-\varepsilon, \varepsilon]\right] > P, \forall n \in \mathbb{N}_{> n_2}. \] If we combine these ... we get that for every $P \in (0,1)$ and $\varepsilon > 0$ and $n_3 \geq \max\left\{n_1,n_2\right\}$ \[ \mathbb{P}\left[F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) \leq F^{\lambda, g}\left(f^{*,\lambda}_g\right)+2\varepsilon\right] > P, \forall n \in \mathbb{N}_{> n_3}. \] As ... is in ... and ... is optimal we know that \[ F^{\lambda, g}\left(f^{*,\lambda}_g\right) \leq F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) \] and thus get with the squeeze theorem \[ \plimn F^{\lambda, g}\left(f^{w^{*,\tilde{\lambda}}}\right) = F^{\lambda, g}\left(f^{*,\lambda}_g\right). \] We can now use Lemma~\ref{lem:s7} to follow that \begin{equation} \plimn \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g} _{W^{1,\infty}} = 0. \label{eq:main4} \end{equation} Now by using the triangle inequality with Lemma~\ref{lem:s3} and (\ref{eq:main4}) we get \begin{align*} \plimn \norm{\mathcal{RN}^{*, \tilde{\lambda}} - f_g^{*,\lambda}} \leq& \plimn \bigg(\norm{\mathcal{RN}^{*, \tilde{\lambda}} - f_g^{w^{*,\tilde{\lambda}}}}_{W^{1,\infty}}\\ &+ \norm{f^{w^{*,\tilde{\lambda}}} - f^{*,\lambda}_g} _{W^{1,\infty}}\bigg) = 0 \end{align*} and thus have proven Theorem~\ref{theo:main1}. We now know that randomized shallow neural networks behave similar to spline regression if we regularize the size of the weights during training. \textcite{heiss2019} further explore a connection between ridge penalized networks and randomized shallow neural networks trained using gradient descent. They come to the conclusion that the effect of weight regularization can be achieved by stopping the training of the randomized shallow neural network early, with the amount of epochs being proportional to the punishment for weight size. This ... that randomized shallow neural networks trained for a certain amount of iterations converge for a increasing amount of nodes to cubic smoothing splines with appropriate weights. \todo{nochmal nachlesen wie es genau war} \newpage \subsection{Simulations} In the following the behaviour described in Theorem~\ref{theo:main1} is visualized in a simulated example. For this two sets of training data have been generated. \begin{itemize} \item $\text{data}_A = (x_{i, A}^{\text{train}}, y_{i,A}^{\text{train}})$ with \begin{align*} x_{i, A}^{\text{train}} &\coloneqq -\pi + \frac{2 \pi}{5} (i - 1), i \in \left\{1, \dots, 6\right\}, \\ y_{i, A}^{\text{train}} &\coloneqq \sin( x_{i, A}^{\text{train}}). \phantom{(i - 1), i \in \left\{1, \dots, 6\right\}} \end{align*} \item $\text{data}_B = (x_{i, B}^{\text{train}}, y_{i, B}^{\text{train}})$ with \begin{align*} x_{i, B}^{\text{train}} &\coloneqq \pi\frac{i - 8}{7}, i \in \left\{1, \dots, 15\right\}, \\ y_{i, B}^{\text{train}} &\coloneqq \sin( x_{i, B}^{\text{train}}). \phantom{(i - 1), i \in \left\{1, \dots, 6\right\}} \end{align*} \end{itemize} For the $\mathcal{RN}$ the random weights are distributed as follows \begin{align*} \xi_i &\stackrel{i.i.d.}{\sim} \text{Unif}(-5,5), \\ v_i &\stackrel{i.i.d.}{\sim} \mathcal{N}(0, 5), \\ b_i &\stackrel{\phantom{i.i.d.}}{\sim} -\xi_i v_i. \end{align*} Note that by the choices for the distributions $g$ as defined in Theorem~\ref{theo:main1} would equate to $g(x) = \frac{\mathbb{E}[v_k^2|\xi_k = x]}{10}$. In order to utilize the smoothing spline implemented in Mathlab, $g$ has been simplified to $g \equiv \frac{1}{10}$ instead. For all figures $f_1^{*, \lambda}$ has been calculated with Matlab's 'smoothingspline', as this minimizes \[ \bar{\lambda} \sum_{i=1}^N(y_i^{train} - f(x_i^{train}))^2 + (1 - \bar{\lambda}) \int (f''(x))^2 dx \] the smoothing parameter used for fittment is $\bar{\lambda} = \frac{1}{1 + \lambda}$. The parameter $\tilde{\lambda}$ for training the networks is chosen as defined in Theorem~\ref{theo:main1} and each one is trained on the full training data for 5000 epoch using gradient descent. The results are given in Figure~\ref{fig:rn_vs_rs}, here it can be seen that in the intervall of the traing data $[-\pi, \pi]$ the neural network and smoothing spline are nearly identical, coinciding with the proposition. \input{Figures/RN_vs_RS} %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: