\section{Introduction} Neural networks have become a widely used model for a plethora of applications. They are an attractive choice as they are able to model complex data with relatively little additional input to the training data needed. Additionally, as the price of parallelized computing power in the form of graphics processing unit has decreased drastically over the last years, it has become far more accessible to train and use large neural networks. Furthermore, highly optimized and parallelized frameworks for tensor operations have been developed. With these frameworks, such as TensorFlow and PyTorch, building neural networks has become a much more straightforward process. % Furthermore, with the development of highly optimized and % parallelized implementations of mathematical operations needed for % neural networks, such as TensorFlow or PyTorch, building neural network % models has become a much more straightforward process. % For example the flagship consumer GPU GeForce RTX 3080 of NVIDIA's current % generation has 5.888 CUDS cores at a ... price of 799 Euro compared % to the last generations flagship GeForce RTX 2080 Ti with 4352 CUDA % cores at a ... price of 1259 Euro. These CUDA cores are computing % cores specialized for tensor operations, which are necessary in % fitting and using neural networks. In this thesis we want to get an understanding of the behavior of neural % networks and how we can use them for problems with a complex relationship between in- and output. In Section 2 we introduce the mathematical construct of neural networks and how to fit them to training data. To gain some insight about the learned function, we examine a simple class of neural networks that contain only one hidden layer. In Section~\ref{sec:shallownn} we proof a relation between such networks and functions that minimize the distance to training data with respect to its second derivative. An interesting application of neural networks is the task of classifying images. However, for such complex problems the number of parameters in fully connected neural networks can exceed what is feasible for training. In Section~\ref{sec:cnn} we explore the addition of convolution to neural networks to reduce the number of parameters. As these large networks are commonly trained using gradient decent algorithms we compare the performance of different algorithms based on gradient descent in Section~4.4. % and % show that it is beneficial to only use small subsets of the training % data in each iteration rather than using the whole data set to update % the parameters. Most statistical models especially these with large amounts of trainable parameters can struggle with overfitting the data. In Section 4.5 we examine the impact of two measures designed to combat overfitting. In some applications such as working with medical images the data available for training can be scarce, which results in the networks being prone to overfitting. As these are interesting applications of neural networks we examine the benefit of the measures to combat overfitting for scenarios with limited amounts of training data. % As in some applications such as medical imaging one might be limited % to very small training data we study the impact of two measures in % improving the accuracy in such a case by trying to ... the model from % overfitting the data. %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: