315 lines
16 KiB
TeX
315 lines
16 KiB
TeX
%% Template for MLP Coursework 2 / 13 November 2023
|
||
|
||
%% Based on LaTeX template for ICML 2017 - example_paper.tex at
|
||
%% https://2017.icml.cc/Conferences/2017/StyleAuthorInstructions
|
||
|
||
\documentclass{article}
|
||
\input{mlp2022_includes}
|
||
|
||
|
||
\definecolor{red}{rgb}{0.95,0.4,0.4}
|
||
\definecolor{blue}{rgb}{0.4,0.4,0.95}
|
||
\definecolor{orange}{rgb}{1, 0.65, 0}
|
||
|
||
\newcommand{\youranswer}[1]{{\color{red} \bf[#1]}} %your answer:
|
||
|
||
|
||
%% START of YOUR ANSWERS
|
||
\input{mlp-cw2-questions}
|
||
%% END of YOUR ANSWERS
|
||
|
||
|
||
|
||
%% Do not change anything in this file. Add your answers to mlp-cw1-questions.tex
|
||
|
||
|
||
|
||
\begin{document}
|
||
|
||
\twocolumn[
|
||
\mlptitle{MLP Coursework 2}
|
||
\centerline{\studentNumber}
|
||
\vskip 7mm
|
||
]
|
||
|
||
\begin{abstract}
|
||
Deep neural networks have become the state-of-the-art
|
||
in many standard computer vision problems thanks to their powerful
|
||
representations and availability of large labeled datasets.
|
||
While very deep networks allow for learning more levels of abstractions in their layers from the data, training these models successfully is a challenging task due to problematic gradient flow through the layers, known as vanishing/exploding gradient problem.
|
||
In this report, we first analyze this problem in VGG models with 8 and 38 hidden layers on the CIFAR100 image dataset, by monitoring the gradient flow during training.
|
||
We explore known solutions to this problem including batch normalization or residual connections, and explain their theory and implementation details.
|
||
Our experiments show that batch normalization and residual connections effectively address the aforementioned problem and hence enable a deeper model to outperform shallower ones in the same experimental setup.
|
||
\end{abstract}
|
||
|
||
\section{Introduction}
|
||
\label{sec:intro}
|
||
Despite the remarkable progress of modern convolutional neural networks (CNNs) in image classification problems~\cite{simonyan2014very, he2016deep}, training very deep networks is a challenging procedure.
|
||
One of the major problems is the Vanishing Gradient Problem (VGP), a phenomenon where the gradients of the error function with respect to network weights shrink to zero, as they backpropagate to earlier layers, hence preventing effective weight updates.
|
||
This phenomenon is prevalent and has been extensively studied in various deep neural networks including feedforward networks~\cite{glorot2010understanding}, RNNs~\cite{bengio1993problem}, and CNNs~\cite{he2016deep}.
|
||
Multiple solutions have been proposed to mitigate this problem by using weight initialization strategies~\cite{glorot2010understanding},
|
||
activation functions~\cite{glorot2010understanding}, input normalization~\cite{bishop1995neural},
|
||
batch normalization~\cite{ioffe2015batch}, and shortcut connections \cite{he2016deep, huang2017densely}.
|
||
|
||
This report focuses on diagnosing the VGP occurring in the VGG38 model\footnote{VGG stands for the Visual Geometry Group in the University of Oxford.} and addressing it by implementing two standard solutions.
|
||
In particular, we first study a ``broken'' network in terms of its gradient flow, L1 norm of gradients with respect to its weights for each layer and contrast it to ones in the healthy and shallower VGG08 to pinpoint the problem.
|
||
Next, we review two standard solutions for this problem, batch normalization (BN)~\cite{ioffe2015batch} and residual connections (RC)~\cite{he2016deep} in detail and discuss how they can address the gradient problem.
|
||
We first incorporate batch normalization (denoted as VGG38+BN), residual connections (denoted as VGG38+RC), and their combination (denoted as VGG38+BN+RC) to the given VGG38 architecture.
|
||
We train the resulting three configurations, and VGG08 and VGG38 models on CIFAR100 (pronounced as `see far 100' ) dataset and present the results.
|
||
The results show that though separate use of BN and RC does mitigate the vanishing/exploding gradient problem, therefore enabling effective training of the VGG38 model, the best results are obtained by combining both BN and RC.
|
||
|
||
%
|
||
|
||
|
||
\section{Identifying training problems of a deep CNN}
|
||
\label{sec:task1}
|
||
|
||
\begin{figure}[t]
|
||
\begin{subfigure}{\linewidth}
|
||
\centering
|
||
\includegraphics[width=\linewidth]{figures/loss_plot.pdf}
|
||
\caption{Cross entropy error per epoch}
|
||
\label{fig:loss_curves}
|
||
\end{subfigure}
|
||
|
||
\begin{subfigure}{\linewidth}
|
||
\centering
|
||
\includegraphics[width=\linewidth]{figures/accuracy_plot.pdf}
|
||
\caption{Classification accuracy per epoch}
|
||
\label{fig:acc_curves}
|
||
\end{subfigure}
|
||
\caption{Training curves for VGG08 and VGG38 in terms of (a) cross-entropy error and (b) classification accuracy}
|
||
\label{fig:curves}
|
||
\end{figure}
|
||
|
||
\begin{figure}[t]
|
||
\centering
|
||
\includegraphics[width=\linewidth]{figures/grad_flow_vgg08.pdf}
|
||
\caption{Gradient flow on VGG08}
|
||
\label{fig:grad_flow_08}
|
||
\end{figure}
|
||
|
||
\questionFigureThree
|
||
|
||
Concretely, training deep neural networks typically involves three steps: forward
|
||
pass, backward pass (or backpropagation algorithm~\cite{rumelhart1986learning}) and weight update.
|
||
The first step involves passing the input $\bx^{(0)}$ to the network and producing
|
||
the network prediction and also the error value.
|
||
In detail, each layer takes in the output of the previous layer and applies
|
||
a non-linear transformation:
|
||
\begin{equation}
|
||
\label{eq.fprop}
|
||
\bx^{(l)} = f^{(l)}(\bx^{(l-1)}; W^{(l)})
|
||
\end{equation}
|
||
where $(l)$ denotes the $l$-th layer in $L$ layer deep network,
|
||
$f^{(l)}(\cdot,W^{(l)})$ is a non-linear transformation for layer $l$, and $W^{(l)}$ are the weights of layer $l$.
|
||
For instance, $f^{(l)}$ is typically a convolution operation followed by an activation function in convolutional neural networks.
|
||
The second step involves the backpropagation algorithm, where we calculate the gradient of an error function $E$ (\textit{e.g.} cross-entropy) for each layer's weight as follows:
|
||
|
||
\begin{equation}
|
||
\label{eq.bprop}
|
||
\frac{\partial E}{\partial W^{(l)}} = \frac{\partial E}{\partial \bx^{(L)}} \frac{\partial \bx^{(L)}}{\partial \bx^{(L-1)}} \dots \frac{\partial \bx^{(l+1)}}{\partial \bx^{(l)}}\frac{\partial \bx^{(l)}}{\partial W^{(l)}}.
|
||
\end{equation}
|
||
|
||
This step includes consecutive tensor multiplications between multiple
|
||
partial derivative terms.
|
||
The final step involves updating model weights by using the computed
|
||
$\frac{\partial E}{\partial W^{(l)}}$ with an update rule.
|
||
The exact update rule depends on the optimizer.
|
||
|
||
A notorious problem for training deep neural networks is the vanishing/exploding gradient
|
||
problem~\cite{bengio1993problem} that typically occurs in the backpropagation step when some of partial gradient terms in Eq.~\ref{eq.bprop} includes values larger or smaller than 1.
|
||
In this case, due to the multiple consecutive multiplications, the gradients \textit{w.r.t.} weights can get exponentially very small (close to 0) or very large (close to infinity) and
|
||
prevent effective learning of network weights.
|
||
|
||
|
||
%
|
||
|
||
|
||
Figures~\ref{fig:grad_flow_08} and \ref{fig:grad_flow_38} depict the gradient flows through VGG architectures \cite{simonyan2014very} with 8 and 38 layers respectively, trained and evaluated for a total of 100 epochs on the CIFAR100 dataset. \questionOne.
|
||
|
||
|
||
\section{Background Literature}
|
||
\label{sec:lit_rev}
|
||
In this section we will highlight some of the most influential
|
||
papers that have been central to overcoming the VGP in
|
||
deep CNNs.
|
||
|
||
\paragraph{Batch Normalization}\cite{ioffe2015batch}
|
||
BN seeks to solve the problem of
|
||
internal covariate shift (ICS), when distribution of each layer’s
|
||
inputs changes during training, as the parameters of the previous layers change.
|
||
The authors argue that without batch normalization, the distribution of
|
||
each layer’s inputs can vary significantly due to the stochastic nature of randomly sampling mini-batches from your
|
||
training set.
|
||
Layers in the network hence must continuously adapt to these high variance distributions which hinders the rate of convergence gradient-based optimizers.
|
||
This optimization problem is exacerbated further with network depth due
|
||
to the updating of parameters at layer $l$ being dependent on
|
||
the previous $l-1$ layers.
|
||
|
||
It is hence beneficial to embed the normalization of
|
||
training data into the network architecture after work from
|
||
LeCun \emph{et al.} showed that training converges faster with
|
||
this addition \cite{lecun2012efficient}. Through standardizing
|
||
the inputs to each layer, we take a step towards achieving
|
||
the fixed distributions of inputs that remove the ill effects
|
||
of ICS. Ioffe and Szegedy demonstrate the effectiveness of
|
||
their technique through training an ensemble of BN
|
||
networks which achieve an accuracy on the ImageNet classification
|
||
task exceeding that of humans in 14 times fewer
|
||
training steps than the state-of-the-art of the time.
|
||
It should be noted, however, that the exact reason for BN’s effectiveness is still not completely understood and it is
|
||
an open research question~\cite{santurkar2018does}.
|
||
|
||
|
||
|
||
\paragraph{Residual networks (ResNet)}\cite{he2016deep} A well-known way of mitigating the VGP is proposed by He~\emph{et al.} in \cite{he2016deep}. In their paper, the authors depict the error curves of a 20 layer and a 56 layer network to motivate their method. Both training and testing error of the 56 layer network are significantly higher than of the shallower one.
|
||
|
||
\questionTwo.
|
||
|
||
Residual networks, colloquially
|
||
known as ResNets, aim to alleviate VGP through the
|
||
incorporation of skip connections that bypass the linear
|
||
transformations into the network architecture.
|
||
The authors argue that this new mapping is significantly easier
|
||
to optimize since if an identity mapping were optimal, the
|
||
network could comfortably learn to push the residual to
|
||
zero rather than attempting to fit an identity mapping via
|
||
a stack of nonlinear layers.
|
||
They bolster their argument
|
||
by successfully training ResNets with depths exceeding
|
||
1000 layers on the CIFAR10 dataset.
|
||
Prior to their work, training even a 100-layer was accepted
|
||
as a great challenge within the deep learning community.
|
||
The addition of skip connections solves the VGP through
|
||
enabling information to flow more freely throughout the
|
||
network architecture without the addition of neither extra
|
||
parameters, nor computational complexity.
|
||
|
||
\section{Solution overview}
|
||
\subsection{Batch normalization}
|
||
|
||
|
||
|
||
|
||
|
||
BN has been a standard component in the state-of-the-art
|
||
convolutional neural networks \cite{he2016deep,huang2017densely}.
|
||
% As mentioned in Section~\ref{sec:lit_rev},
|
||
Concretely, BN is a
|
||
layer transformation that is performed to whiten the activations
|
||
originating from each layer.
|
||
As computing full dataset statistics at each training iteration
|
||
would be computationally expensive, BN computes batch statistics
|
||
to approximate them.
|
||
Given a minibatch of $B$ training samples and their feature maps
|
||
$X = (\bx^1, \bx^2,\ldots , \bx^B)$ at an arbitrary layer where $X \in \mathbb{R}^{B\times H \times W \times C}$, $H, W$ are the height, width of the feature map and $C$ is the number of channels, the batch normalization first computes the following statistics:
|
||
|
||
\begin{align}
|
||
\label{eq.bnstats}
|
||
\mu_c &= \frac{1}{BWH} \sum_{n=1}^{B}\sum_{i,j=1}^{H,W} \bx_{cij}^{n}\\
|
||
\sigma^2_c &= \frac{1}{BWH}
|
||
\sum_{n=1}^{B}\sum_{i,j=1}^{H,W} (\bx_{cij}^{n} - \mu_{c})^2
|
||
\end{align} where $c$, $i$, $j$ denote the index values for $y$, $x$ and channel coordinates of feature maps, and $\bm{\mu}$ and $\bm{\sigma}^2$ are the mean and variance of the batch.
|
||
|
||
BN applies the following operation on each feature map in batch B for every $c,i,j$:
|
||
\begin{equation}
|
||
\label{eq.bnop}
|
||
\text{BN}(\bx_{cij}) = \frac{\bx_{cij} - \mu_{c}}{\sqrt{\sigma^2_{c}} + \epsilon} * \gamma_{c} + \beta_{c}
|
||
\end{equation} where $\gamma \in \mathbb{R}^C$ and $\beta\in \mathbb{R}^C$ are learnable parameters and $\epsilon$ is a small constant introduced to ensure numerical stability.
|
||
|
||
At inference time, using batch statistics is a poor choice as it introduces noise in the evaluation and might not even be well defined. Therefore, $\bm{\mu}$ and $\bm{\sigma}$ are replaced by running averages of the mean and variance computed during training, which is a better approximation of the full dataset statistics.
|
||
|
||
Recent work
|
||
has shown that BatchNorm has a more fundamental
|
||
benefit of smoothing the optimization landscape during
|
||
training \cite{santurkar2018does} thus enhancing the predictive
|
||
power of gradients as our guide to the global minimum.
|
||
Furthermore, a smoother optimization landscape should
|
||
additionally enable the use of a wider range of learning
|
||
rates and initialization schemes which is congruent with the
|
||
findings of Ioffe and Szegedy in the original BatchNorm
|
||
paper~\cite{ioffe2015batch}.
|
||
|
||
|
||
\subsection{Residual connections}
|
||
|
||
Residual connections are another approach used in the state-of-the-art Residual Networks~\cite{he2016deep} to tackle the vanishing gradient problem.
|
||
Introduced by He et. al.~\cite{he2016deep}, a residual block consists of a
|
||
convolution (or group of convolutions) layer, ``short-circuited'' with an identity mapping.
|
||
More precisely, given a mapping $F^{(b)}$ that denotes the transformation of the block $b$ (multiple consecutive layers), $F^{(b)}$ is applied to its input
|
||
feature map $\bx^{(b-1)}$ as $\bx^{(b)} = \bx^{(b-1)} + {F}(\bx^{(b-1)})$.
|
||
|
||
Intuitively, stacking residual blocks creates an architecture where inputs of each blocks
|
||
are given two paths : passing through the convolution or skipping to the next layer. A residual network can therefore be seen as an ensemble model averaging every sub-network
|
||
created by choosing one of the two paths. The skip connections allow gradients to flow
|
||
easily into early layers, since
|
||
\begin{equation}
|
||
\frac{\partial \bx^{(b)}}{\partial \bx^{(b-1)}} = \mathbbm{1} + \frac{\partial{F}(\bx^{(b-1)})}{\partial \bx^{(b-1)}}
|
||
\label{eq.grad_skip}
|
||
\end{equation} where $\bx^{(b-1)} \in \mathbb{R}^{C \times H \times W }$ and $\mathbbm{1}$ is a $\mathbb{R}^{C \times H \times W}$-dimensional tensor with entries 1 where $C$, $H$ and $W$ denote the number of feature maps, its height and width respectively.
|
||
Importantly, $\mathbbm{1}$ prevents the zero gradient flow.
|
||
|
||
|
||
\section{Experiment Setup}
|
||
|
||
\questionFigureFour
|
||
|
||
\questionFigureFive
|
||
|
||
\questionTableOne
|
||
|
||
We conduct our experiment on the CIFAR100 dataset \cite{krizhevsky2009learning},
|
||
which consists of 60,000 32x32 colour images from 100 different classes. The number of samples per class is balanced, and the
|
||
samples are split into training, validation, and test set while
|
||
maintaining balanced class proportions. In total, there are 47,500; 2,500; and 10,000 instances in the training, validation,
|
||
and test set, respectively. Moreover, we apply data augmentation strategies (cropping, horizontal flipping) to improve the generalization of the model.
|
||
|
||
With the goal of understanding whether BN or skip connections
|
||
help fighting vanishing gradients, we first test these
|
||
methods independently, before combining them in an attempt
|
||
to fully exploit the depth of the VGG38 model.
|
||
|
||
All experiments are conducted using the Adam optimizer with the default
|
||
learning rate (1e-3) -- unless otherwise specified, cosine annealing and a batch size of 100
|
||
for 100 epochs.
|
||
Additionally, training images are augmented with random
|
||
cropping and horizontal flipping.
|
||
Note that we do not use data augmentation at test time.
|
||
These hyperparameters along with the augmentation strategy are used
|
||
to produce the results shown in Fig.~\ref{fig:curves}.
|
||
|
||
When used, BN is applied
|
||
after each convolutional layer, before the Leaky
|
||
ReLU non-linearity.
|
||
Similarly, the skip connections are applied from
|
||
before the convolution layer to before the final activation function
|
||
of the block as per Fig.~2 of \cite{he2016deep}.
|
||
Note that adding residual connections between the feature maps before and after downsampling requires special treatment, as there is a dimension mismatch between them.
|
||
Therefore in the coursework, we do not use residual connections in the down-sampling blocks. However, please note that batch normalization should still be implemented for these blocks.
|
||
|
||
\subsection{Residual Connections to Downsampling Layers}
|
||
\label{subsec:rescimp}
|
||
|
||
\questionThree.
|
||
|
||
|
||
\section{Results and Discussion}
|
||
\label{sec:disc}
|
||
|
||
\questionFour.
|
||
|
||
\section{Conclusion}
|
||
\label{sec:concl}
|
||
|
||
\questionFive.
|
||
|
||
\bibliography{refs}
|
||
|
||
\end{document}
|
||
|
||
|
||
|
||
|
||
|