239 lines
13 KiB
TeX
239 lines
13 KiB
TeX
|
%% Template for MLP Coursework 1 / 14 October 2024
|
||
|
|
||
|
%% Based on LaTeX template for ICML 2017 - example_paper.tex at
|
||
|
%% https://2017.icml.cc/Conferences/2017/StyleAuthorInstructions
|
||
|
|
||
|
\documentclass{article}
|
||
|
\input{mlp2022_includes}
|
||
|
|
||
|
|
||
|
\definecolor{red}{rgb}{0.95,0.4,0.4}
|
||
|
\definecolor{blue}{rgb}{0.4,0.4,0.95}
|
||
|
\definecolor{orange}{rgb}{1, 0.65, 0}
|
||
|
|
||
|
\newcommand{\youranswer}[1]{{\color{red} \bf[#1]}} %your answer:
|
||
|
|
||
|
|
||
|
%% START of YOUR ANSWERS
|
||
|
\input{mlp-cw1-questions}
|
||
|
%% END of YOUR ANSWERS
|
||
|
|
||
|
|
||
|
|
||
|
%% Do not change anything in this file. Add your answers to mlp-cw1-questions.tex
|
||
|
|
||
|
|
||
|
|
||
|
\begin{document}
|
||
|
|
||
|
\twocolumn[
|
||
|
\mlptitle{MLP Coursework 1}
|
||
|
\centerline{\studentNumber}
|
||
|
\vskip 7mm
|
||
|
]
|
||
|
|
||
|
|
||
|
\begin{abstract}
|
||
|
In this report we study the problem of overfitting, which is the training regime where performance increases on the training set but decrease on validation data. Overfitting prevents our trained model from generalizing well to unseen data.
|
||
|
We first analyse the given example and discuss the probable causes of the underlying problem.
|
||
|
Then we investigate how the depth and width of a neural network can affect overfitting in a feedforward architecture and observe that increasing width and depth tend to enable further overfitting.
|
||
|
Next we discuss how two standard methods, Dropout and Weight Penalty, can
|
||
|
mitigate overfitting, then describe their implementation and use them in our experiments
|
||
|
to reduce the overfitting on the EMNIST dataset.
|
||
|
Based on our results, we ultimately find that both dropout and weight penalty are able to mitigate overfitting.
|
||
|
Finally, we conclude the report with our observations and related work.
|
||
|
Our main findings indicate that preventing overfitting is achievable through regularization, although combining different methods together is not straightforward.
|
||
|
\end{abstract}
|
||
|
|
||
|
|
||
|
\section{Introduction}
|
||
|
\label{sec:intro}
|
||
|
In this report we focus on a common and important problem while training machine learning models known as overfitting, or overtraining, which is the training regime where performances increase on the training set but decrease on unseen data.
|
||
|
We first start with analysing the given problem in Figure~\ref{fig:example}, study it in different architectures and then investigate different strategies to mitigate the problem.
|
||
|
In particular, Section~\ref{sec:task1} identifies and discusses the given problem, and investigates the effect of network width and depth in terms of generalization gap (see Ch.~5 in \citealt{Goodfellow-et-al-2016}) and generalization performance.
|
||
|
Section \ref{sec:task2.1} introduces two regularization techniques to alleviate overfitting: Dropout \cite{srivastava2014dropout} and L1/L2 Weight Penalties (see Section~7.1 in \citealt{Goodfellow-et-al-2016}).
|
||
|
We first explain them in detail and discuss why they are used for alleviating overfitting.
|
||
|
In Section~\ref{sec:task2.2}, we incorporate each of them and their various combinations to a three hidden layer\footnote{We denote all layers as hidden except the final (output) one. This means that depth of a network is equal to the number of its hidden layers + 1.} neural network, train it on the EMNIST dataset, which contains 131,600 images of characters and digits, each of size 28x28, from 47 classes.
|
||
|
We evaluate them in terms of generalization gap and performance, and discuss the results and effectiveness of the tested regularization strategies.
|
||
|
Our results show that both dropout and weight penalty are able to mitigate overfitting.
|
||
|
|
||
|
Finally, we conclude our study in Section~\ref{sec:concl}, noting that preventing overfitting is achievable through regularization, although combining different methods together is not straightforward.
|
||
|
|
||
|
|
||
|
\section{Problem identification}
|
||
|
\label{sec:task1}
|
||
|
|
||
|
\begin{figure}[t]
|
||
|
\centering
|
||
|
\begin{subfigure}{\linewidth}
|
||
|
\includegraphics[width=\linewidth]{figures/fig1_acc.png}
|
||
|
\caption{accuracy by epoch}
|
||
|
\label{fig:example_acccurves}
|
||
|
\end{subfigure}
|
||
|
\begin{subfigure}{\linewidth}
|
||
|
\centering
|
||
|
\includegraphics[width=\linewidth]{figures/fig1_err.png}
|
||
|
\caption{error by epoch}
|
||
|
\label{fig:example_errorcurves}
|
||
|
\end{subfigure}
|
||
|
\caption{Training and validation curves in terms of classification accuracy (a) and cross-entropy error (b) on the EMNIST dataset for the baseline model.}
|
||
|
\label{fig:example}
|
||
|
\end{figure}
|
||
|
|
||
|
Overfitting to training data is a very common and important issue that needs to be dealt with when training neural networks or other machine learning models in general (see Ch.~5 in \citealt{Goodfellow-et-al-2016}).
|
||
|
A model is said to be overfitting when as the training progresses, its performance on the training data keeps improving, while its is degrading on validation data.
|
||
|
Effectively, the model stops learning related patterns for the task and instead starts to memorize specificities of each training sample that are irrelevant to new samples.
|
||
|
Overfitting leads to bad generalization performance in unseen data, as performance on validation data is indicative of performance on test data and (to an extent) during deployment.
|
||
|
|
||
|
Although it eventually happens to all gradient-based training, it is most often caused by models that are too large with respect to the amount and diversity of training data. The more free parameters the model has, the easier it will be to memorize complex data patterns that only apply to a restricted amount of samples.
|
||
|
A prominent symptom of overfitting is the generalization gap, defined as the difference between the validation and training error.
|
||
|
A steady increase in this quantity is usually interpreted as the model entering the overfitting regime.
|
||
|
|
||
|
|
||
|
Figure~\ref{fig:example_acccurves} and \ref{fig:example_errorcurves} show a prototypical example of overfitting.
|
||
|
We see in Figure~\ref{fig:example_acccurves} that \questionOne.
|
||
|
|
||
|
The extent to which our model overfits depends on many factors.
|
||
|
For example, the quality and quantity of the training set and the complexity of the model.
|
||
|
If we have sufficiently many diverse training samples, or if our model contains few hidden units, it will in general be less prone to overfitting.
|
||
|
Any form of regularization will also limit the extent to which the model overfits.
|
||
|
|
||
|
|
||
|
\subsection{Network width}
|
||
|
|
||
|
\questionTableOne
|
||
|
\questionFigureTwo
|
||
|
|
||
|
First we investigate the effect of increasing the number of hidden units in a single hidden layer network when training on the EMNIST dataset.
|
||
|
The network is trained using the Adam optimizer
|
||
|
with a learning rate of $9 \times 10^{-4}$ and a batch size of 100, for a total of 100 epochs.
|
||
|
|
||
|
The input layer is of size 784, and output layer consists of 47 units.
|
||
|
Three different models were trained, with a single hidden layer of 32, 64 and 128 ReLU hidden units respectively.
|
||
|
Figure~\ref{fig:width} depicts the error and accuracy curves over 100 epochs for the model with varying number of hidden units.
|
||
|
Table~\ref{tab:width_exp} reports the final accuracy, training error, and validation error.
|
||
|
We observe that \questionTwo.
|
||
|
|
||
|
\questionThree.
|
||
|
|
||
|
|
||
|
\subsection{Network depth}
|
||
|
|
||
|
\questionTableTwo
|
||
|
\questionFigureThree
|
||
|
|
||
|
Next we investigate the effect of varying the number of hidden layers in the network.
|
||
|
Table~\ref{tab:depth_exps} and Figure~\ref{fig:depth} depict results from training three models with one, two and three hidden layers respectively, each with 128 ReLU hidden units.
|
||
|
As with previous experiments, they are trained with the Adam optimizer with a learning rate of $9 \times 10^{-4}$ and a batch size of 100.
|
||
|
|
||
|
We observe that \questionFour.
|
||
|
|
||
|
\questionFive.
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
\section{Regularization}
|
||
|
\label{sec:task2.1}
|
||
|
|
||
|
In this section, we investigate three regularization methods to alleviate the overfitting problem, specifically dropout layers, the L1 and L2 weight penalties and label smoothing.
|
||
|
|
||
|
|
||
|
\subsection{Dropout}
|
||
|
|
||
|
Dropout~\cite{srivastava2014dropout} is a stochastic method that randomly inactivates neurons in a neural network according to an hyperparameter, the inclusion rate (\textit{i.e.} the rate that an unit is included).
|
||
|
Dropout is commonly represented by an additional layer inserted between the linear layer and activation function.
|
||
|
Its forward pass during training is defined as follows:
|
||
|
\begin{align}
|
||
|
\text{mask} &\sim \text{bernoulli}(p)\\
|
||
|
\bm{y}' &= \text{mask} \odot \bm{y}\
|
||
|
\end{align}
|
||
|
where $\bm{y}, \bm{y}' \in \mathbb{R}^d$ are the output of the linear layer before and after applying dropout, respectively.
|
||
|
$\text{mask} \in \mathbb{R}^d$ is a mask vector randomly sampled from the Bernoulli distribution with inclusion probability $p$, and $\odot$ denotes the element-wise multiplication.
|
||
|
|
||
|
At inference time, stochasticity is not desired, so no neurons are dropped.
|
||
|
To account for the change in expectations of the output values, we scale them down by the inclusion probability $p$:
|
||
|
\begin{align}
|
||
|
\bm{y}' &= \bm{y}*p\
|
||
|
\end{align}
|
||
|
|
||
|
As there is no nonlinear calculation involved, the backward propagation is just the element-wise product of the gradients with respect to the layer outputs and mask created in the forward calculation.
|
||
|
The backward propagation for dropout is therefore formulated as follows:
|
||
|
\begin{align}
|
||
|
\frac{\partial \bm{y}'}{\partial \bm{y}} = mask
|
||
|
\end{align}
|
||
|
|
||
|
Dropout is an easy to implement and highly scalable method.
|
||
|
It can be implemented as a layer-based calculation unit, and be placed on any layer of the neural network at will.
|
||
|
Dropout can reduce the dependence of hidden units between layers so that the neurons of the next layer will not rely on only few features from of the previous layer.
|
||
|
Instead, it forces the network to extract diverse features and evenly distribute information among all features.
|
||
|
By randomly dropping some neurons in training, dropout makes use of a subset of the whole architecture, so it can also be viewed as bagging different sub networks and averaging their outputs.
|
||
|
|
||
|
|
||
|
\subsection{Weight penalty}
|
||
|
|
||
|
L1 and L2 regularization~\cite{ng2004feature} are simple but effective methods to mitigate overfitting to training data.
|
||
|
The application of L1 and L2 regularization strategies could be formulated as adding penalty terms with L1 and L2 norm square of weights in the cost function without changing other formulations.
|
||
|
The idea behind this regularization method is to penalize the weights by adding a term to the cost function, and explicitly constrain the magnitude of the weights with either the L1 and L2 norms.
|
||
|
The optimization problem takes a different form:
|
||
|
\begin{align}
|
||
|
\text{L1: } & \text{min}_{\bm{w}} \; E_{\text{data}}(\bm{X}, \bm{y}, \bm{w}) + \lambda ||w||_1\\
|
||
|
\text{L2: } & \text{min}_{\bm{w}} \; E_{\text{data}}(\bm{X}, \bm{y}, \bm{w}) + \lambda ||w||^2_2
|
||
|
\end{align}
|
||
|
where $E_{\text{data}}$ denotes the cross entropy error function, and $\{\bm{X}, \bm{y}\}$ denotes the input and target training pairs.
|
||
|
$\lambda$ controls the strength of regularization.
|
||
|
|
||
|
Weight penalty works by constraining the scale of parameters and preventing them to grow too much, avoiding overly sensitive behaviour on unseen data.
|
||
|
While L1 and L2 regularization are similar to each other in calculation, they have different effects.
|
||
|
Gradient magnitude in L1 regularization does not depend on the weight value and tends to bring small weights to 0, which can be used as a form of feature selection, whereas L2 regularization tends to shrink the weights to a smaller scale uniformly.
|
||
|
|
||
|
|
||
|
\subsection{Label smoothing}
|
||
|
Label smoothing regularizes a model based on a softmax with $K$ output values by replacing the hard target 0 labels and 1 labels with $\frac{\alpha}{K-1}$ and $1-\alpha$ respectively.
|
||
|
$\alpha$ is typically set to a small number such as $0.1$.
|
||
|
\begin{equation}
|
||
|
\begin{cases}
|
||
|
\frac{\alpha}{K-1}, & \quad \text{if} \quad t_k=0\\
|
||
|
1 - \alpha, & \quad \text{if} \quad t_k=1
|
||
|
\end{cases}
|
||
|
\end{equation}
|
||
|
The standard cross-entropy error is typically used with these \emph{soft} targets to train the neural network.
|
||
|
Hence, implementing label smoothing requires only modifying the targets of training set.
|
||
|
This strategy may prevent a neural network to obtain very large weights by discouraging very high output values.
|
||
|
|
||
|
\section{Balanced EMNIST Experiments}
|
||
|
|
||
|
\questionTableThree
|
||
|
|
||
|
\questionFigureFour
|
||
|
|
||
|
\label{sec:task2.2}
|
||
|
|
||
|
Here we evaluate the effectiveness of the given regularization methods for reducing the overfitting on the EMNIST dataset.
|
||
|
We build a baseline architecture with three hidden layers, each with 128 neurons, which suffers from overfitting as shown in section \ref{sec:task1}.
|
||
|
|
||
|
Here we train the network with a lower learning rate of $10^{-4}$, as the previous runs were overfitting after only a handful of epochs.
|
||
|
Results for the new baseline (c.f. Table~\ref{tab:hp_search}) confirm that lower learning rate helps, so all further experiments are run using it.
|
||
|
|
||
|
Here, we apply the L1 or L2 regularization with dropout to our baseline and search for good hyperparameters on the validation set.
|
||
|
Then, we apply the label smoothing with $\alpha=0.1$ to our baseline.
|
||
|
We summarize all the experimental results in Table~\ref{tab:hp_search}. For each method except the label smoothing, we plot the relationship between generalization gap and validation accuracy in Figure~\ref{fig:hp_search}.
|
||
|
|
||
|
First we analyze three methods separately, train each over a set of hyperparameters and compare their best performing results.
|
||
|
|
||
|
\questionSix.
|
||
|
|
||
|
\questionSeven.
|
||
|
|
||
|
|
||
|
\section{Conclusion}
|
||
|
\label{sec:concl}
|
||
|
|
||
|
\questionEight.
|
||
|
|
||
|
\newpage
|
||
|
\bibliography{refs}
|
||
|
|
||
|
\end{document}
|
||
|
|