mlpractical/report/mlp-cw2-questions.tex

%% REPLACE sXXXXXXX with your student number
\def\studentNumber{s2759177}


%% START of YOUR ANSWERS
%% Add answers to the questions below, by replacing the text inside the brackets {} for \youranswer{ "Text to be replaced with your answer." }.
%
% Do not delete the commands for adding figures and tables. Instead fill in the missing values with your experiment results, and replace the images with your own respective figures.
%
% You can generally delete the placeholder text, such as for example the text "Question Figure 3 - Replace the images ..."
%
% There are 5 TEXT QUESTIONS. Replace the text inside the brackets of the command \youranswer with your answer to the question.
%
% There are also 3 "questions" to replace some placeholder FIGURES with your own, and 1 "question" asking you to fill in the missing entries in the TABLE provided.
%
% NOTE! that questions are ordered by the order of appearance of their answers in the text, and not necessarily by the order you should tackle them. You should attempt to fill in the TABLE and FIGURES before discussing the results presented there.
%
% NOTE! If for some reason you do not manage to produce results for some FIGURES and the TABLE, then you can get partial marks by discussing your expectations of the results in the relevant TEXT QUESTIONS. The TABLE specifically has enough information in it already for you to draw meaningful conclusions.
%
% Please refer to the coursework specification for more details.


%% - - - - - - - - - - - - TEXT QUESTIONS - - - - - - - - - - - -

%% Question 1:
% Use Figures 1, 2, and 3 to identify the Vanishing Gradient Problem (which of these model suffers from it, and what are the consequences depicted?).
% The average length for an answer to this question is approximately 1/5 of the columns in a 2-column page}

\newcommand{\questionOne} {
\youranswer{
We can observe the 8 layer network learning (even though it does not achieve high accuracy), but the 38-layer network fails to learn, as its gradients vanish almost entirely in the earlier layers. This is evident in Figure 3, where the gradients in VGG38 are close to zero for all but the last few layers, preventing effective weight updates during backpropagation. Consequently, the deeper network is unable to extract meaningful features or minimize its loss, leading to stagnation in both training and validation performance.

We conclude that VGG08 performs nominally during training, while VGG38 suffers from the vanishing gradient problem, as its gradients diminish to near-zero in early layers, impeding effective weight updates and preventing the network from learning meaningful features. This limitation nullifies the advantages of its deeper architecture, as reflected in its stagnant loss and accuracy throughout training. This is in stark contrast to VGG08 which maintains a healthy gradient flow across layers, allowing effective weight updates and enabling the network to learn features, reduce loss, and improve accuracy despite its smaller depth.
}
}

%% Question 2:
% Consider these results (including Figure 1 from \cite{he2016deep}). Discuss the relation between network capacity and overfitting, and whether, and how, this is reflected on these results. What other factors may have lead to this difference in performance?
% The average length for an answer to this question is approximately 1/5 of the columns in a 2-column page
\newcommand{\questionTwo} {
\youranswer{Our results thus corroborate that increasing network depth can lead to higher training and testing errors, as seen in the comparison between VGG08 and VGG38. While deeper networks, like VGG38, have a larger capacity to learn complex features, they may struggle to generalize effectively, resulting in overfitting and poor performance on unseen data. This is consistent with the behaviour observed in Figure 1 from \cite{he2016deep}, where the 56-layer network exhibits higher training error and, consequently, higher test error compared to the 20-layer network.

Our results suggest that the increased capacity of VGG38 does not translate into better generalization, likely due to the vanishing gradient problem, which hinders learning in deeper networks. Other factors, such as inadequate regularization or insufficient data augmentation, could also contribute to the observed performance difference, leading to overfitting in deeper architectures.}
}

%% Question 3:
% In this coursework, we didn't incorporate residual connections to the downsampling layers. Explain and justify what would need to be changed in order to add residual connections to the downsampling layers. Give and explain 2 ways of incorporating these changes and discuss pros and cons of each.
\newcommand{\questionThree} {
\youranswer{
Our work does not incorporate residual connections across the downsampling layers, as this creates a dimensional mismatch between the input and output feature maps due to the reduction in spatial dimensions. To add residual connections, one approach is to use a convolutional layer with a kernel size of $1\times 1$, stride, and padding matched to the downsampling operation to transform the input to the same shape as the output. Another approach would be to use average pooling or max pooling directly on the residual connection to downsample the input feature map, matching its spatial dimensions to the output, followed by a linear transformation to align the channel dimensions.

The difference between these two methods is that the first approach using a $1\times 1$ convolution provides more flexibility by learning the transformation, which can enhance model expressiveness but increases computational cost, whereas the second approach with pooling is computationally cheaper and simpler but may lose fine-grained information due to the fixed, non-learnable nature of pooling operations.
}
}

%% Question 4:
% Question 4 - Present and discuss the experiment results (all of the results and not just the ones you had to fill in) in Table 1 and Figures 4 and 5 (you may use any of the other Figures if you think they are relevant to your analysis). You will have to determine what data are relevant to the discussion, and what information can be extracted from it. Also, discuss what further experiments you would have ran on any combination of VGG08, VGG38, BN, RC in order to
% \begin{itemize}
%     \item Improve performance of the model trained (explain why you expect your suggested experiments will help with this).
%     \item Learn more about the behaviour of BN and RC (explain what you are trying to learn and how).
% \end{itemize}
%
% The average length for an answer to this question is approximately 1 of the columns in a 2-column page
\newcommand{\questionFour} {
\youranswer{
Our results demonstrate the effectiveness of batch normalization and residual connection as proposed by \cite{he2016deep}, enabling effective training of deep convolutional networks as shown by the significant improvement in training and validation performance for VGG38 when incorporating these techniques. Table~\ref{tab:CIFAR_results} highlights that adding BN alone (VGG38 BN) reduces both training and validation losses compared to the baseline VGG38, with validation accuracy increasing from near-zero to $47.68\%$ at a learning rate (LR) of $1\mathrm{e}{-3}$. Adding RC further enhances performance, as seen in VGG38 RC achieving $52.32\%$ validation accuracy under the same conditions. The combination of BN and RC (VGG38 BN + RC) yields the best results, achieving $53.76\%$ validation accuracy with LR $1\mathrm{e}{-3}$. BN+RC appears to benefit greatly from a higher learning rate, as it improves further to $58.20\%$ a LR of $1\mathrm{e}{-2}$. BN alone however deteriorates at higher learning rates, as evidenced by lower validation accuracy, emphasizing the stabilizing role of RC. \autoref{fig:training_curves_bestModel} confirms the synergy of BN and RC, with the VGG38 BN + RC model reaching $74\%$ training accuracy and plateauing near $60\%$ validation accuracy. \autoref{fig:avg_grad_flow_bestModel} illustrates stable gradient flow, with BN mitigating vanishing gradients and RC maintaining gradient propagation through deeper layers, particularly in the later stages of the network.

While this work did not evaluate residual connections on downsampling layers, a thorough evaluation of both methods put forth earlier would be required to complete the picture, highlighting how exactly residual connections in downsampling layers affect gradient flow, feature learning, and overall performance. Such an evaluation would clarify whether the additional computational cost of using $1\times 1$ convolutions for matching dimensions is justified by improved accuracy or if the simpler pooling-based approach suffices, particularly for tasks where computational efficiency is crucial.
}
}


%% Question 5:
% Briefly draw your conclusions based on the results from the previous sections (what are the take-away messages?) and conclude your report with a recommendation for future work.
%
% Good recommendations for future work also draw on the broader literature (the papers already referenced are good starting points). Great recommendations for future work are not just incremental (an example of an incremental suggestion would be: ``we could also train with different learning rates'') but instead also identify meaningful questions or, in other words, questions with answers that might be somewhat more generally applicable.
%
% For example, \citep{huang2017densely} end with \begin{quote}``Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features, e.g.,  [4,5].''\end{quote}
%
% while \cite{bengio1993problem} state in their conclusions that \begin{quote}``There remains theoretical questions to be considered,  such as whether the problem with simple gradient descent  discussed in this paper would be observed with  chaotic attractors that are not  hyperbolic.''\\\end{quote}
%
% The length of this question description is indicative of the average length of a conclusion section
\newcommand{\questionFive} {
\youranswer{
The results presented showcase a clear solution to the vanishing gradient problem. With batch normalization and Residual Connections, we are able to train much deeper neural networks effectively, as evidenced by the improved performance of VGG38 with these modifications. The combination of BN and RC not only stabilizes gradient flow but also enhances both training and validation accuracy, particularly when paired with an appropriate learning rate. These findings reinforce the utility of architectural innovations like those proposed in \cite{he2016deep} and \cite{ioffe2015batch}, which have become foundational in modern deep learning.

While these methods appear to enable training of deeper neural networks, the critical question of how these architectural enhancements generalize across different datasets and tasks remains open. Future work could investigate the effectiveness of BN and RC in scenarios involving large-scale datasets, such as ImageNet, or in domains like natural language processing and generative models, where deep architectures also face optimization challenges. Additionally, exploring the interplay between residual connections and emerging techniques like attention mechanisms \citep{vaswani2017attention} might uncover further synergies. Beyond this, understanding the theoretical underpinnings of how residual connections influence optimization landscapes and gradient flow could yield insights applicable to designing novel architectures.}
}


%% - - - - - - - - - - - - FIGURES - - - - - - - - - - - -

%% Question Figure 3:
\newcommand{\questionFigureThree} {
% Question Figure 3 - Replace this image with a figure depicting the average gradient across layers, for the VGG38 model.
%\textit{(The provided figure is correct, and can be used in your analysis. It is partially obscured so you can get credit for producing your own copy).}
\youranswer{
\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/gradplot_38.pdf}
    \caption{Gradient Flow on VGG38}
    \label{fig:avg_grad_flow_38}
\end{figure}
}
}

%% Question Figure 4:
% Question Figure 4 - Replace this image with a figure depicting the training curves for the model with the best performance \textit{across experiments you have available (you don't need to run the experiments for the models we already give you results for)}. Edit the caption so that it clearly identifies the model and what is depicted.
\newcommand{\questionFigureFour} {
\youranswer{
\begin{figure}[t]
    \begin{subfigure}{\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/VGG38_BN_RC_loss_performance.pdf}
        \caption{Cross entropy error per epoch}
        \label{fig:vgg38_loss_curves}
    \end{subfigure}

    \begin{subfigure}{\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/VGG38_BN_RC_accuracy_performance.pdf}
        \caption{Classification accuracy per epoch}
        \label{fig:vgg38_acc_curves}
    \end{subfigure}
    \caption{Training curves for the 38 layer CNN with batch normalization and residual connections, trained with LR of $0.01$}
    \label{fig:training_curves_bestModel}
\end{figure}
}
}

%% Question Figure 5:
% Question Figure 5 - Replace this image with a figure depicting the average gradient across layers, for the model with the best performance \textit{across experiments you have available (you don't need to run the experiments for the models we already give you results for)}. Edit the caption so that it clearly identifies the model and what is depicted.
\newcommand{\questionFigureFive} {
\youranswer{
\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/gradplot_38_bn_rc.pdf}
    \caption{Gradient Flow for the 38 layer CNN with batch normalization and residual connections, trained with LR of $0.01$}
    \label{fig:avg_grad_flow_bestModel}
\end{figure}
}
}

%% - - - - - - - - - - - - TABLES - - - - - - - - - - - -

%% Question Table 1:
% Question Table 1 - Fill in Table 1 with the results from your experiments on
% \begin{enumerate}
%     \item \textit{VGG38 BN (LR 1e-3)}, and
%     \item \textit{VGG38 BN + RC (LR 1e-2)}.
% \end{enumerate}
\newcommand{\questionTableOne} {
\youranswer{
%
\begin{table*}[t]
    \centering
    \begin{tabular}{lr|ccccc}
    \toprule
        Model                   & LR   & \# Params & Train loss & Train acc & Val loss & Val acc \\
    \midrule
        VGG08                   & 1e-3 & 60 K      &  1.74      & 51.59     & 1.95     & 46.84 \\
        VGG38                   & 1e-3 & 336 K     &  4.61      & 00.01     & 4.61     & 00.01 \\
        VGG38 BN                & 1e-3 & 339 K     &  1.76      & 50.62     & 1.95     & 47.68 \\
        VGG38 RC                & 1e-3 & 336 K     &  1.33      & 61.52     & 1.84     & 52.32 \\
        VGG38 BN + RC           & 1e-3 & 339 K     &  1.26      & 62.99     & 1.73     & 53.76 \\
        VGG38 BN                & 1e-2 & 339 K     &  1.70      & 52.28     & 1.99     & 46.72 \\
        VGG38 BN + RC           & 1e-2 & 339 K     &  0.83      & 74.35     & 1.70     & 58.20 \\
    \bottomrule
    \end{tabular}
    \caption{Experiment results (number of model parameters, Training and Validation loss and accuracy) for different combinations of VGG08, VGG38, Batch Normalisation (BN), and Residual Connections (RC), LR is learning rate.}
    \label{tab:CIFAR_results}
\end{table*}
}
}

%% END of YOUR ANSWERS