final changes

changes
add BN+RC layer
2024-11-22 09:26:24 +00:00 · 2024-11-19 17:04:58 +00:00 · 2024-11-19 10:38:54 +00:00 · 2024-11-19 10:10:02 +00:00 · 2024-11-19 09:47:18 +00:00 · 2024-11-19 09:42:31 +00:00
106 changed files with 13106 additions and 1587 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,8 +1,6 @@
-#editors
-*.idea/
-
 #dropbox stuff
 *.dropbox*
+.idea/*

 # Byte-compiled / optimized / DLL files
 __pycache__/
@ -28,6 +26,7 @@ var/
 *.egg-info/
 .installed.cfg
 *.egg
+*.tar.gz

 # PyInstaller
 #  Usually these files are written by a python script from a template
@ -61,8 +60,30 @@ docs/_build/

 # PyBuilder
 target/
-*.tar.gz
-google-cloud-sdk/
-solutions/
+
+# Pycharm
+.idea/*
+
+#Notebook stuff
+notebooks/.ipynb_checkpoints/
+
+#Google Cloud stuff
+/google-cloud-sdk
 .ipynb_checkpoints/
-emnist_tutorial/
+data/cifar-100-python/
+data/MNIST/
+solutions/
+report/mlp-cw1-template.aux
+report/mlp-cw1-template.out
+report/mlp-cw1-template.pdf
+report/mlp-cw1-template.synctex.gz
+.DS_Store
+report/mlp-cw2-template.aux
+report/mlp-cw2-template.out
+report/mlp-cw2-template.pdf
+report/mlp-cw2-template.synctex.gz
+report/mlp-cw2-template.bbl
+report/mlp-cw2-template.blg
+
+venv
+saved_models
--- a/README.md
+++ b/README.md
@ -1,22 +1,15 @@
-# MLP Compute Engines Tutorials Branch
+# Machine Learning Practical

-A short code repo that guides you through the process of running experiments on the Google Cloud Platform.
+This repository contains the code for the University of Edinburgh [School of Informatics](http://www.inf.ed.ac.uk) course [Machine Learning Practical](http://www.inf.ed.ac.uk/teaching/courses/mlp/).

-## Why do I need it?
-Most Deep Learning experiments require a large amount of compute as you have noticed in term 1. Usage of GPU can accelerate experiments around 30-50x therefore making experiments that require a large amount of time feasible by slashing their runtimes down by a massive factor. For a simple example consider an experiment that required a month to run, that would make it infeasible to actually do research with. Now consider that experiment only requiring 1 day to run, which allows one to iterate over methodologies, tune hyperparameters and overall try far more things. This simple example expresses one of the simplest reasons behind the GPU hype that surrounds machine learning research today.
+This assignment-based course is focused on the implementation and evaluation of machine learning systems. Students who do this course will have experience in the design, implementation, training, and evaluation of machine learning systems.

-## Introduction
+The code in this repository is split into:

-The material available includes tutorial documents and code, as well as tooling that provides more advanced features to aid you in your quests to train lots of learnable differentiable computational graphs.
+  *  a Python package `mlp`, a [NumPy](http://www.numpy.org/) based neural network package designed specifically for the course that students will implement parts of and extend during the course labs and assignments,
+  *  a series of [Jupyter](http://jupyter.org/) notebooks in the `notebooks` directory containing explanatory material and coding exercises to be completed during the course labs.

-## Getting Started
-
-### Google Cloud Platform
-
-Google Cloud Platform (GCP) is a cloud computing service that provides a number of services, including the ability to run virtual machines (VMs) on their infrastructure. The VMs are called Compute Engine instances. 
-
-As an MLP course student, you will be given 50$ worth of credits. This is enough to run a number of experiments on the cloud.
-
-To get started with GCP, please read the [this getting started guide](notes/google_cloud_setup.md).
-
-The guide will take you through the process of setting up a GCP account, creating a project, creating a VM instance, and connecting to it. The VM instance will be a GPU-endowed Linux machine that already includes the necessary PyTorch packages for you to run your experiments. 
+## Coursework 2
+This branch contains the python code and latex files of the first coursework. The code follows the same structure as the labs, in particular the mlp package, and a specific notebook is provided to help you run experiments.
+ * Detailed instructions are given in MLP2024_25_CW2_Spec.pdf (see Learn, Assessment, CW2).
+ * The [report directory](https://github.com/VICO-UoE/mlpractical/tree/mlp2024-25/coursework2/report) contains the latex files that you will use to create your report.
--- a/VGG_08/result_outputs/summary.csv
+++ b/VGG_08/result_outputs/summary.csv
@ -0,0 +1,102 @@
+train_acc,train_loss,val_acc,val_loss
+0.010694736842105264,4.827323,0.024800000000000003,4.5659676
+0.03562105263157895,4.3888855,0.0604,4.136276
+0.0757684210526316,3.998175,0.09480000000000001,3.8678854
+0.10734736842105265,3.784943,0.12159999999999999,3.6687074
+0.13741052631578948,3.6023798,0.15439999999999998,3.4829779
+0.16888421052631578,3.4196754,0.1864,3.3093607
+0.1941263157894737,3.2674048,0.20720000000000002,3.2223148
+0.21861052631578948,3.139925,0.22880000000000003,3.1171055
+0.24134736842105264,3.0145736,0.24760000000000001,3.0554724
+0.26399999999999996,2.9004965,0.2552,2.9390912
+0.27898947368421056,2.815607,0.2764,2.9205213
+0.29532631578947366,2.7256868,0.2968,2.7410471
+0.31138947368421044,2.6567938,0.3016,2.7083752
+0.3236842105263158,2.595405,0.322,2.665904
+0.33486315789473686,2.5434496,0.3176,2.688214
+0.3462526315789474,2.5021079,0.33159999999999995,2.648656
+0.35381052631578946,2.4609485,0.342,2.5658453
+0.36157894736842106,2.4152951,0.34119999999999995,2.5403407
+0.36774736842105266,2.382958,0.3332,2.6936982
+0.37753684210526317,2.3510027,0.36160000000000003,2.4663532
+0.38597894736842114,2.319616,0.3608,2.4559999
+0.3912421052631579,2.294115,0.3732,2.3644555
+0.39840000000000003,2.2598042,0.3716,2.4516551
+0.4036,2.2318766,0.37439999999999996,2.4189563
+0.4105263157894737,2.2035582,0.3772,2.3899698
+0.41501052631578944,2.1830406,0.3876,2.3215945
+0.4193263157894737,2.158597,0.37800000000000006,2.3831298
+0.4211578947368421,2.148888,0.38160000000000005,2.3436418
+0.4260842105263159,2.1250536,0.39840000000000003,2.3471045
+0.4313684210526315,2.107519,0.4044,2.2744477
+0.4370526315789474,2.0837262,0.398,2.245617
+0.439642105263158,2.0691078,0.41200000000000003,2.216309
+0.4440842105263158,2.046351,0.4096,2.2329648
+0.44696842105263157,2.0330904,0.4104,2.1841388
+0.4518105263157895,2.0200553,0.4244,2.1780539
+0.45298947368421055,2.0069249,0.42719999999999997,2.1625984
+0.4602105263157895,1.9896894,0.4204,2.2195568
+0.46023157894736844,1.9788533,0.4244,2.1803434
+0.46101052631578954,1.9693571,0.4128,2.1858895
+0.46774736842105263,1.9547894,0.4204,2.1908271
+0.4671157894736842,1.9390026,0.4244,2.1841395
+0.4698105263157895,1.924038,0.424,2.1843896
+0.4738736842105264,1.9161719,0.43,2.154806
+0.47541052631578945,1.9033127,0.4463999999999999,2.1130056
+0.48,1.8961077,0.44439999999999996,2.113019
+0.48456842105263154,1.8838875,0.43079999999999996,2.1191697
+0.4857263157894737,1.8711865,0.44920000000000004,2.1213412
+0.4887578947368421,1.8590263,0.44799999999999995,2.1077166
+0.49035789473684216,1.8479114,0.4428,2.0737479
+0.4908421052631579,1.845268,0.4436,2.07655
+0.4939368421052632,1.8336699,0.4548,2.0769904
+0.49924210526315793,1.8237538,0.4548,2.061769
+0.49677894736842104,1.8111013,0.44240000000000007,2.0676718
+0.5008842105263157,1.8031327,0.4548,2.0859065
+0.5,1.8026625,0.458,2.0704215
+0.5030736842105263,1.792004,0.4596,2.1113508
+0.505578947368421,1.7810374,0.45679999999999993,2.0382714
+0.5090315789473684,1.7691813,0.4444000000000001,2.0911386
+0.512042105263158,1.7633294,0.4616,2.0458508
+0.5142736842105263,1.7549652,0.4464,2.0786576
+0.5128421052631579,1.7518128,0.4656,2.026332
+0.518042105263158,1.7420768,0.46,2.0141299
+0.5182315789473684,1.7321203,0.45960000000000006,2.0226884
+0.5192842105263158,1.7264535,0.46279999999999993,2.0182638
+0.5217894736842105,1.7245325,0.46399999999999997,2.0110855
+0.5229684210526316,1.7184331,0.46679999999999994,2.0191038
+0.5227578947368421,1.7116771,0.4604,2.0334535
+0.5245894736842105,1.7009526,0.4692,2.0072439
+0.5262315789473684,1.6991171,0.4700000000000001,2.0296187
+0.5278526315789474,1.6958193,0.4708,1.9912667
+0.527157894736842,1.6907407,0.4736,2.006095
+0.5299578947368421,1.6808176,0.4715999999999999,2.012164
+0.5313052631578947,1.676356,0.47239999999999993,1.9955354
+0.5338315789473685,1.6731659,0.47839999999999994,2.005768
+0.5336000000000001,1.662152,0.4672,2.015392
+0.5354736842105263,1.6638054,0.4692,1.9890119
+0.5397894736842105,1.6575475,0.4768,2.0090258
+0.5386526315789474,1.6595734,0.4824,1.9728817
+0.5376631578947368,1.6536722,0.4816,1.9769167
+0.5384842105263159,1.6495628,0.47600000000000003,1.9980135
+0.5380842105263157,1.6488388,0.478,1.9884782
+0.5393473684210528,1.6408547,0.48,1.9772192
+0.5415157894736843,1.632917,0.4828,1.9732709
+0.5394947368421052,1.6340653,0.4776,1.9623082
+0.5429052631578948,1.6340532,0.47759999999999997,1.9812362
+0.5452421052631579,1.6246406,0.48119999999999996,1.9846246
+0.5436210526315789,1.6288266,0.4864,1.9822198
+0.5437684210526316,1.6240481,0.48279999999999995,1.9768158
+0.546357894736842,1.6208181,0.4804,1.9625885
+0.5485052631578946,1.6164333,0.47839999999999994,1.9738724
+0.5466736842105263,1.6169226,0.47800000000000004,1.9842362
+0.547621052631579,1.6159856,0.4828,1.9709526
+0.5480421052631579,1.6175526,0.48560000000000003,1.967775
+0.5468421052631579,1.6149833,0.48119999999999996,1.9626708
+0.5493894736842105,1.6063902,0.4835999999999999,1.96621
+0.5490736842105263,1.6096952,0.48120000000000007,1.9742922
+0.5514736842105264,1.6084315,0.4867999999999999,1.9604725
+0.5489263157894737,1.6069487,0.4831999999999999,1.9733659
+0.5494947368421053,1.6030664,0.49079999999999996,1.9693874
+0.5516842105263158,1.6043342,0.486,1.9647765
+0.552442105263158,1.6039867,0.48480000000000006,1.9649359
--- a/VGG_08/result_outputs/test_summary.csv
+++ b/VGG_08/result_outputs/test_summary.csv
@ -0,0 +1,2 @@
+test_acc,test_loss
+0.49950000000000006,1.9105633
--- a/VGG_38/result_outputs/summary.csv
+++ b/VGG_38/result_outputs/summary.csv
@ -0,0 +1,101 @@
+train_acc,train_loss,val_acc,val_loss
+0.009263157894736843,4.8649125,0.0104,4.630689
+0.009810526315789474,4.6264124,0.009600000000000001,4.618983
+0.009705263157894738,4.621914,0.011200000000000002,4.6184525
+0.008989473684210525,4.619472,0.0064,4.6164784
+0.009747368421052633,4.6168556,0.0076,4.6138463
+0.00951578947368421,4.6156826,0.0108,4.6139345
+0.009789473684210525,4.614809,0.008400000000000001,4.6116896
+0.009936842105263159,4.613147,0.0104,4.6148276
+0.009810526315789474,4.612325,0.0076,4.6123877
+0.009094736842105263,4.6117926,0.007200000000000001,4.6149993
+0.008421052631578947,4.611283,0.011600000000000001,4.6114736
+0.009010526315789472,4.6105323,0.009600000000000001,4.607559
+0.009894736842105263,4.6103206,0.008400000000000001,4.6086206
+0.00934736842105263,4.6095214,0.011200000000000002,4.6091933
+0.009473684210526316,4.6095295,0.008,4.6095695
+0.010252631578947369,4.609189,0.0104,4.610459
+0.009536842105263158,4.6087623,0.0092,4.6091356
+0.00848421052631579,4.6086617,0.009600000000000001,4.609126
+0.008421052631578947,4.6083455,0.011200000000000002,4.6088147
+0.009410526315789473,4.608145,0.0068000000000000005,4.608519
+0.009263157894736843,4.6078997,0.0092,4.6085033
+0.009389473684210526,4.607453,0.01,4.6083508
+0.008989473684210528,4.6075597,0.008400000000000001,4.6073136
+0.009326315789473686,4.607266,0.008,4.6069093
+0.01,4.607154,0.0076,4.6069508
+0.008778947368421053,4.607089,0.011200000000000002,4.60659
+0.009326315789473684,4.606807,0.0068,4.6072598
+0.009031578947368422,4.6068263,0.011200000000000002,4.607257
+0.008842105263157896,4.6066294,0.008,4.606883
+0.008968421052631579,4.606647,0.006400000000000001,4.607275
+0.008947368421052631,4.6065364,0.0092,4.606976
+0.008842105263157896,4.6064167,0.0076,4.607016
+0.008799999999999999,4.606425,0.0096,4.607184
+0.009326315789473686,4.606305,0.0072,4.6068683
+0.00905263157894737,4.606274,0.0072,4.606982
+0.00934736842105263,4.6062336,0.007200000000000001,4.607209
+0.009221052631578948,4.606221,0.0076,4.607369
+0.009557894736842105,4.60607,0.0076,4.6074376
+0.009073684210526317,4.6061006,0.0072,4.607068
+0.009242105263157895,4.606005,0.0064,4.6067224
+0.009957894736842107,4.605986,0.0072,4.6068263
+0.009052631578947368,4.605935,0.0072,4.6067867
+0.008694736842105264,4.6059127,0.0064,4.6070905
+0.009536842105263158,4.605874,0.006400000000000001,4.606976
+0.009663157894736842,4.605872,0.0072,4.6068897
+0.008821052631578948,4.6057997,0.0064,4.607028
+0.009768421052631579,4.605778,0.0072,4.6069264
+0.0092,4.6057644,0.007200000000000001,4.607018
+0.008926315789473685,4.6057386,0.0072,4.60698
+0.008989473684210525,4.6057277,0.0064,4.6070237
+0.009242105263157895,4.6057053,0.0064,4.6069183
+0.009094736842105263,4.605692,0.006400000000000001,4.6068764
+0.009473684210526316,4.60566,0.0064,4.606909
+0.009494736842105262,4.605613,0.0064,4.606978
+0.009747368421052631,4.6056285,0.0064,4.606753
+0.009789473684210527,4.605578,0.006400000000000001,4.6068797
+0.009199999999999998,4.6055675,0.0064,4.606888
+0.009073684210526317,4.6055593,0.0064,4.606874
+0.008821052631578948,4.6055293,0.006400000000000001,4.606851
+0.009326315789473684,4.6055255,0.0064,4.606871
+0.009557894736842105,4.6055083,0.006400000000000001,4.606851
+0.009600000000000001,4.605491,0.0064,4.6068635
+0.00856842105263158,4.605466,0.0064,4.606862
+0.009894736842105263,4.605463,0.006400000000000001,4.6068873
+0.009494736842105262,4.605441,0.0064,4.6068926
+0.008673684210526314,4.6054277,0.0064,4.6068554
+0.009221052631578948,4.6054296,0.0063999999999999994,4.6068907
+0.008989473684210528,4.605404,0.0064,4.6068807
+0.00928421052631579,4.6053905,0.006400000000000001,4.6068707
+0.0092,4.6053743,0.0064,4.606894
+0.008989473684210525,4.605368,0.0064,4.606845
+0.009515789473684212,4.605355,0.0064,4.6068635
+0.009073684210526317,4.605352,0.0064,4.6068773
+0.009642105263157895,4.6053243,0.0064,4.606883
+0.009747368421052633,4.6053176,0.0064,4.6069
+0.009873684210526316,4.6053023,0.0064,4.6068873
+0.009536842105263156,4.605297,0.0064,4.6068654
+0.009515789473684212,4.6052866,0.0064,4.6068883
+0.009978947368421053,4.605265,0.006400000000000001,4.606894
+0.009957894736842107,4.605259,0.0064,4.6068826
+0.009410526315789475,4.6052504,0.0064,4.6068697
+0.01002105263157895,4.6052403,0.006400000000000001,4.6068807
+0.01002105263157895,4.6052313,0.0064,4.606872
+0.00951578947368421,4.605224,0.0064,4.6068883
+0.009852631578947368,4.605219,0.006400000000000001,4.606871
+0.009894736842105265,4.605209,0.0064,4.606871
+0.00922105263157895,4.605204,0.0064,4.6068654
+0.010042105263157896,4.605193,0.0064,4.6068764
+0.009978947368421053,4.6051874,0.006400000000000001,4.6068697
+0.009747368421052633,4.605183,0.0064,4.6068673
+0.010189473684210526,4.605178,0.0064,4.606873
+0.009789473684210527,4.605173,0.0064,4.6068773
+0.009936842105263159,4.605169,0.0064,4.606874
+0.010042105263157894,4.605166,0.0064,4.606877
+0.009494736842105262,4.6051593,0.0064,4.606874
+0.009536842105263158,4.6051593,0.0063999999999999994,4.606874
+0.010021052631578946,4.6051564,0.006400000000000001,4.6068716
+0.009747368421052631,4.605154,0.0064,4.6068726
+0.009642105263157895,4.605153,0.0064,4.606872
+0.009305263157894737,4.6051517,0.0064,4.6068726
--- a/VGG_38/result_outputs/test_summary.csv
+++ b/VGG_38/result_outputs/test_summary.csv
@ -0,0 +1,2 @@
+test_acc,test_loss
+0.01,4.608619
--- a/arg_extractor.py
+++ b/arg_extractor.py
@ -1,87 +0,0 @@
-import argparse
-import json
-import os
-import sys
-
-def str2bool(v):
-    if v.lower() in ('yes', 'true', 't', 'y', '1'):
-        return True
-    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
-        return False
-    else:
-        raise argparse.ArgumentTypeError('Boolean value expected.')
-
-
-def get_args():
-    """
-    Returns a namedtuple with arguments extracted from the command line.
-    :return: A namedtuple with arguments
-    """
-    parser = argparse.ArgumentParser(
-        description='Welcome to the MLP course\'s Pytorch training and inference helper script')
-
-    parser.add_argument('--batch_size', nargs="?", type=int, default=100, help='Batch_size for experiment')
-    parser.add_argument('--continue_from_epoch', nargs="?", type=int, default=-1, help='Batch_size for experiment')
-    parser.add_argument('--dataset_name', type=str, help='Dataset on which the system will train/eval our model')
-    parser.add_argument('--seed', nargs="?", type=int, default=7112018,
-                        help='Seed to use for random number generator for experiment')
-    parser.add_argument('--image_num_channels', nargs="?", type=int, default=1,
-                        help='The channel dimensionality of our image-data')
-    parser.add_argument('--image_height', nargs="?", type=int, default=28, help='Height of image data')
-    parser.add_argument('--image_width', nargs="?", type=int, default=28, help='Width of image data')
-    parser.add_argument('--dim_reduction_type', nargs="?", type=str, default='strided_convolution',
-                        help='One of [strided_convolution, dilated_convolution, max_pooling, avg_pooling]')
-    parser.add_argument('--num_layers', nargs="?", type=int, default=4,
-                        help='Number of convolutional layers in the network (excluding '
-                             'dimensionality reduction layers)')
-    parser.add_argument('--num_filters', nargs="?", type=int, default=64,
-                        help='Number of convolutional filters per convolutional layer in the network (excluding '
-                             'dimensionality reduction layers)')
-    parser.add_argument('--num_epochs', nargs="?", type=int, default=100, help='The experiment\'s epoch budget')
-    parser.add_argument('--experiment_name', nargs="?", type=str, default="exp_1",
-                        help='Experiment name - to be used for building the experiment folder')
-    parser.add_argument('--use_gpu', nargs="?", type=str2bool, default=False,
-                        help='A flag indicating whether we will use GPU acceleration or not')
-    parser.add_argument('--weight_decay_coefficient', nargs="?", type=float, default=1e-05,
-                        help='Weight decay to use for Adam')
-    parser.add_argument('--filepath_to_arguments_json_file', nargs="?", type=str, default=None,
-                        help='')
-
-    args = parser.parse_args()
-
-    if args.filepath_to_arguments_json_file is not None:
-        args = extract_args_from_json(json_file_path=args.filepath_to_arguments_json_file, existing_args_dict=args)
-
-    arg_str = [(str(key), str(value)) for (key, value) in vars(args).items()]
-    print(arg_str)
-
-    import torch
-
-    if torch.cuda.is_available():  # checks whether a cuda gpu is available and whether the gpu flag is True
-        device = torch.cuda.current_device()
-        print("use {} GPU(s)".format(torch.cuda.device_count()), file=sys.stderr)
-    else:
-        print("use CPU", file=sys.stderr)
-        device = torch.device('cpu')  # sets the device to be CPU
-
-    return args, device
-
-
-class AttributeAccessibleDict(object):
-    def __init__(self, adict):
-        self.__dict__.update(adict)
-
-
-def extract_args_from_json(json_file_path, existing_args_dict=None):
-
-    summary_filename = json_file_path
-    with open(summary_filename) as f:
-        arguments_dict = json.load(fp=f)
-
-    for key, value in vars(existing_args_dict).items():
-        if key not in arguments_dict:
-            arguments_dict[key] = value
-
-    arguments_dict = AttributeAccessibleDict(arguments_dict)
-
-    return arguments_dict
--- a/cluster_experiment_scripts/cifar100_standard_single_gpu_tutorial.sh
+++ b/cluster_experiment_scripts/cifar100_standard_single_gpu_tutorial.sh
@ -1,43 +0,0 @@
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-Standard
-#SBATCH --gres=gpu:1
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-08:00:00
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 3 --image_height 32 --image_width 32 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'cifar100_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "cifar100"
--- a/cluster_experiment_scripts/cifar10_standard_single_gpu_tutorial.sh
+++ b/cluster_experiment_scripts/cifar10_standard_single_gpu_tutorial.sh
@ -1,38 +0,0 @@
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-Standard
-#SBATCH --gres=gpu:1
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-08:00:00
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --filepath_to_arguments_json_file experiment_configs/cifar10_tutorial_config.json
--- a/cluster_experiment_scripts/emnist_longjobs_single_gpu_tutorial.sh
+++ b/cluster_experiment_scripts/emnist_longjobs_single_gpu_tutorial.sh
@ -1,43 +0,0 @@
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-LongJobs
-#SBATCH --gres=gpu:1
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-08:00:00
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 1 --image_height 28 --image_width 28 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'emnist_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "emnist"
--- a/cluster_experiment_scripts/emnist_short_single_gpu_tutorial.sh
+++ b/cluster_experiment_scripts/emnist_short_single_gpu_tutorial.sh
@ -1,43 +0,0 @@
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-Short
-#SBATCH --gres=gpu:1
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-03:59:00
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 1 --image_height 28 --image_width 28 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'emnist_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "emnist"
--- a/cluster_experiment_scripts/emnist_standard_multi_gpu_tutorial.sh
+++ b/cluster_experiment_scripts/emnist_standard_multi_gpu_tutorial.sh
@ -1,44 +0,0 @@
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-Standard
-#SBATCH --gres=gpu:4
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-08:00:00
-
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 1 --image_height 28 --image_width 28 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'emnist_test_multi_gpu_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "emnist"
--- a/cluster_experiment_scripts/emnist_standard_single_gpu_tutorial.sh
+++ b/cluster_experiment_scripts/emnist_standard_single_gpu_tutorial.sh
@ -1,43 +0,0 @@
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-Standard
-#SBATCH --gres=gpu:1
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-08:00:00
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 1 --image_height 28 --image_width 28 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'emnist_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "emnist"
--- a/cluster_experiment_scripts/run_jobs_simple.py
+++ b/cluster_experiment_scripts/run_jobs_simple.py
@ -1,57 +0,0 @@
-import os
-import subprocess
-import argparse
-import tqdm
-import getpass
-import time
-
-parser = argparse.ArgumentParser(description='Welcome to the run N at a time script')
-parser.add_argument('--num_parallel_jobs', type=int)
-parser.add_argument('--total_epochs', type=int)
-args = parser.parse_args()
-
-
-def check_if_experiment_with_name_is_running(experiment_name):
-    result = subprocess.run(['squeue --name {}'.format(experiment_name), '-l'], stdout=subprocess.PIPE, shell=True)
-    lines = result.stdout.split(b'\n')
-    if len(lines) > 2:
-        return True
-    else:
-        return False
-
-student_id = getpass.getuser().encode()[:5]
-list_of_scripts = [item for item in
-                   subprocess.run(['ls'], stdout=subprocess.PIPE).stdout.split(b'\n') if
-                   item.decode("utf-8").endswith(".sh")]
-
-for script in list_of_scripts:
-    print('sbatch', script.decode("utf-8"))
-
-epoch_dict = {key.decode("utf-8"): 0 for key in list_of_scripts}
-total_jobs_finished = 0
-
-while total_jobs_finished < args.total_epochs * len(list_of_scripts):
-    curr_idx = 0
-    with tqdm.tqdm(total=len(list_of_scripts)) as pbar_experiment:
-        while curr_idx < len(list_of_scripts):
-            number_of_jobs = 0
-            result = subprocess.run(['squeue', '-l'], stdout=subprocess.PIPE)
-            for line in result.stdout.split(b'\n'):
-                if student_id in line:
-                    number_of_jobs += 1
-
-            if number_of_jobs < args.num_parallel_jobs:
-                while check_if_experiment_with_name_is_running(
-                        experiment_name=list_of_scripts[curr_idx].decode("utf-8")) or epoch_dict[
-                    list_of_scripts[curr_idx].decode("utf-8")] >= args.total_epochs:
-                    curr_idx += 1
-                    if curr_idx >= len(list_of_scripts):
-                        curr_idx = 0
-
-                str_to_run = 'sbatch {}'.format(list_of_scripts[curr_idx].decode("utf-8"))
-                total_jobs_finished += 1
-                os.system(str_to_run)
-                print(str_to_run)
-                curr_idx += 1
-            else:
-                time.sleep(1)
--- a/data/HadSSP_daily_qc.txt
+++ b/data/HadSSP_daily_qc.txt
--- a/data/VGG38_BN_RC_accuracy_performance.pdf
+++ b/data/VGG38_BN_RC_accuracy_performance.pdf
--- a/data/VGG38_BN_RC_loss_performance.pdf
+++ b/data/VGG38_BN_RC_loss_performance.pdf
--- a/data/ccpp_data.npz
+++ b/data/ccpp_data.npz
--- a/data/mnist-test.npz
+++ b/data/mnist-test.npz
--- a/data/mnist-train.npz
+++ b/data/mnist-train.npz
--- a/data/mnist-valid.npz
+++ b/data/mnist-valid.npz
--- a/data/problem_model_accuracy_performance.pdf
+++ b/data/problem_model_accuracy_performance.pdf
--- a/data/problem_model_loss_performance.pdf
+++ b/data/problem_model_loss_performance.pdf
--- a/data_augmentations.py
+++ b/data_augmentations.py
@ -1,55 +0,0 @@
-from PIL import Image
-from numpy import random
-from torchvision import transforms
-import numpy as np
-import torch
-
-class Cutout(object):
-    """Randomly mask out one or more patches from an image.
-    Args:
-        n_holes (int): Number of patches to cut out of each image.
-        length (int): The length (in pixels) of each square patch.
-    """
-
-    def __init__(self, n_holes, length):
-        self.n_holes = n_holes
-        self.length = length
-
-    def __call__(self, img):
-        """
-        Args:
-            img (Tensor): Tensor image of size (C, H, W).
-        Returns:
-            Tensor: Image with n_holes of dimension length x length cut out of it.
-        """
-
-        from_PIL = False
-
-        if type(img) == Image.Image:
-            from_PIL = True
-            img = transforms.ToTensor()(img)
-
-        h = img.size(1)
-        w = img.size(2)
-
-        mask = np.ones((h, w), np.float32)
-
-        for n in range(self.n_holes):
-            y = random.randint(0, h)
-            x = random.randint(0, w)
-
-            y1 = np.clip(y - self.length // 2, 0, h)
-            y2 = np.clip(y + self.length // 2, 0, h)
-            x1 = np.clip(x - self.length // 2, 0, w)
-            x2 = np.clip(x + self.length // 2, 0, w)
-
-            mask[y1: y2, x1: x2] = 0.
-
-        mask = torch.from_numpy(mask)
-        mask = mask.expand_as(img)
-        img = img * mask
-
-        if from_PIL:
-            img = transforms.ToPILImage()(img)
-
-        return img
--- a/experiment_builder.py
+++ b/experiment_builder.py
@ -1,306 +0,0 @@
-import sys
-
-import torch
-import torch.nn as nn
-import torch.optim as optim
-import torch.nn.functional as F
-import tqdm
-import os
-import numpy as np
-import time
-
-from torch.optim.adam import Adam
-
-from storage_utils import save_statistics
-
-class ExperimentBuilder(nn.Module):
-    def __init__(self, network_model, experiment_name, num_epochs, train_data, val_data,
-                 test_data, weight_decay_coefficient, use_gpu, continue_from_epoch=-1):
-        """
-        Initializes an ExperimentBuilder object. Such an object takes care of running training and evaluation of a deep net
-        on a given dataset. It also takes care of saving per epoch models and automatically inferring the best val model
-        to be used for evaluating the test set metrics.
-        :param network_model: A pytorch nn.Module which implements a network architecture.
-        :param experiment_name: The name of the experiment. This is used mainly for keeping track of the experiment and creating and directory structure that will be used to save logs, model parameters and other.
-        :param num_epochs: Total number of epochs to run the experiment
-        :param train_data: An object of the DataProvider type. Contains the training set.
-        :param val_data: An object of the DataProvider type. Contains the val set.
-        :param test_data: An object of the DataProvider type. Contains the test set.
-        :param weight_decay_coefficient: A float indicating the weight decay to use with the adam optimizer.
-        :param use_gpu: A boolean indicating whether to use a GPU or not.
-        :param continue_from_epoch: An int indicating whether we'll start from scrach (-1) or whether we'll reload a previously saved model of epoch 'continue_from_epoch' and continue training from there.
-        """
-        super(ExperimentBuilder, self).__init__()
-
-        self.experiment_name = experiment_name
-        self.model = network_model
-        self.model.reset_parameters()
-        self.device = torch.cuda.current_device()
-
-        if torch.cuda.device_count() > 1 and use_gpu:
-            self.device = torch.cuda.current_device()
-            self.model.to(self.device)
-            self.model = nn.DataParallel(module=self.model)
-            print('Use Multi GPU', self.device)
-        elif torch.cuda.device_count() == 1 and use_gpu:
-            self.device = torch.cuda.current_device()
-            self.model.to(self.device)  # sends the model from the cpu to the gpu
-            print('Use GPU', self.device)
-        else:
-            print("use CPU")
-            self.device = torch.device('cpu')  # sets the device to be CPU
-            print(self.device)
-
-        # re-initialize network parameters
-        self.train_data = train_data
-        self.val_data = val_data
-        self.test_data = test_data
-        self.optimizer = Adam(self.parameters(), amsgrad=False,
-                                    weight_decay=weight_decay_coefficient)
-
-        print('System learnable parameters')
-        num_conv_layers = 0
-        num_linear_layers = 0
-        total_num_parameters = 0
-        for name, value in self.named_parameters():
-            print(name, value.shape)
-            if all(item in name for item in ['conv', 'weight']):
-                num_conv_layers += 1
-            if all(item in name for item in ['linear', 'weight']):
-                num_linear_layers += 1
-            total_num_parameters += np.prod(value.shape)
-
-        print('Total number of parameters', total_num_parameters)
-        print('Total number of conv layers', num_conv_layers)
-        print('Total number of linear layers', num_linear_layers)
-
-        # Generate the directory names
-        self.experiment_folder = os.path.abspath(experiment_name)
-        self.experiment_logs = os.path.abspath(os.path.join(self.experiment_folder, "result_outputs"))
-        self.experiment_saved_models = os.path.abspath(os.path.join(self.experiment_folder, "saved_models"))
-        print(self.experiment_folder, self.experiment_logs)
-        # Set best models to be at 0 since we are just starting
-        self.best_val_model_idx = 0
-        self.best_val_model_acc = 0.
-
-        if not os.path.exists(self.experiment_folder):  # If experiment directory does not exist
-            os.mkdir(self.experiment_folder)  # create the experiment directory
-
-        if not os.path.exists(self.experiment_logs):
-            os.mkdir(self.experiment_logs)  # create the experiment log directory
-
-        if not os.path.exists(self.experiment_saved_models):
-            os.mkdir(self.experiment_saved_models)  # create the experiment saved models directory
-
-        self.num_epochs = num_epochs
-        self.criterion = nn.CrossEntropyLoss().to(self.device)  # send the loss computation to the GPU
-        if continue_from_epoch == -2:
-            try:
-                self.best_val_model_idx, self.best_val_model_acc, self.state = self.load_model(
-                    model_save_dir=self.experiment_saved_models, model_save_name="train_model",
-                    model_idx='latest')  # reload existing model from epoch and return best val model index
-                # and the best val acc of that model
-                self.starting_epoch = self.state['current_epoch_idx']
-            except:
-                print("Model objects cannot be found, initializing a new model and starting from scratch")
-                self.starting_epoch = 0
-                self.state = dict()
-
-        elif continue_from_epoch != -1:  # if continue from epoch is not -1 then
-            self.best_val_model_idx, self.best_val_model_acc, self.state = self.load_model(
-                model_save_dir=self.experiment_saved_models, model_save_name="train_model",
-                model_idx=continue_from_epoch)  # reload existing model from epoch and return best val model index
-            # and the best val acc of that model
-            self.starting_epoch = self.state['current_epoch_idx']
-        else:
-            self.starting_epoch = 0
-            self.state = dict()
-
-    def get_num_parameters(self):
-        total_num_params = 0
-        for param in self.parameters():
-            total_num_params += np.prod(param.shape)
-
-        return total_num_params
-
-    def run_train_iter(self, x, y):
-        """
-        Receives the inputs and targets for the model and runs a training iteration. Returns loss and accuracy metrics.
-        :param x: The inputs to the model. A numpy array of shape batch_size, channels, height, width
-        :param y: The targets for the model. A numpy array of shape batch_size, num_classes
-        :return: the loss and accuracy for this batch
-        """
-        self.train()  # sets model to training mode (in case batch normalization or other methods have different procedures for training and evaluation)
-
-        if len(y.shape) > 1:
-            y = np.argmax(y, axis=1)  # convert one hot encoded labels to single integer labels
-
-        #print(type(x))
-
-        if type(x) is np.ndarray:
-            x, y = torch.Tensor(x).float().to(device=self.device), torch.Tensor(y).long().to(
-            device=self.device)  # send data to device as torch tensors
-
-        x = x.to(self.device)
-        y = y.to(self.device)
-
-        out = self.model.forward(x)  # forward the data in the model
-        loss = F.cross_entropy(input=out, target=y)  # compute loss
-
-        self.optimizer.zero_grad()  # set all weight grads from previous training iters to 0
-        loss.backward()  # backpropagate to compute gradients for current iter loss
-
-        self.optimizer.step()  # update network parameters
-        _, predicted = torch.max(out.data, 1)  # get argmax of predictions
-        accuracy = np.mean(list(predicted.eq(y.data).cpu()))  # compute accuracy
-        return loss.data.detach().cpu().numpy(), accuracy
-
-    def run_evaluation_iter(self, x, y):
-        """
-        Receives the inputs and targets for the model and runs an evaluation iterations. Returns loss and accuracy metrics.
-        :param x: The inputs to the model. A numpy array of shape batch_size, channels, height, width
-        :param y: The targets for the model. A numpy array of shape batch_size, num_classes
-        :return: the loss and accuracy for this batch
-        """
-        self.eval()  # sets the system to validation mode
-        if len(y.shape) > 1:
-            y = np.argmax(y, axis=1)  # convert one hot encoded labels to single integer labels
-        if type(x) is np.ndarray:
-            x, y = torch.Tensor(x).float().to(device=self.device), torch.Tensor(y).long().to(
-            device=self.device)  # convert data to pytorch tensors and send to the computation device
-
-        x = x.to(self.device)
-        y = y.to(self.device)
-        out = self.model.forward(x)  # forward the data in the model
-        loss = F.cross_entropy(out, y)  # compute loss
-        _, predicted = torch.max(out.data, 1)  # get argmax of predictions
-        accuracy = np.mean(list(predicted.eq(y.data).cpu()))  # compute accuracy
-        return loss.data.detach().cpu().numpy(), accuracy
-
-    def save_model(self, model_save_dir, model_save_name, model_idx, state):
-        """
-        Save the network parameter state and current best val epoch idx and best val accuracy.
-        :param model_save_name: Name to use to save model without the epoch index
-        :param model_idx: The index to save the model with.
-        :param best_validation_model_idx: The index of the best validation model to be stored for future use.
-        :param best_validation_model_acc: The best validation accuracy to be stored for use at test time.
-        :param model_save_dir: The directory to store the state at.
-        :param state: The dictionary containing the system state.
-
-        """
-        state['network'] = self.state_dict()  # save network parameter and other variables.
-        torch.save(state, f=os.path.join(model_save_dir, "{}_{}".format(model_save_name, str(
-            model_idx))))  # save state at prespecified filepath
-
-    def run_training_epoch(self, current_epoch_losses):
-        with tqdm.tqdm(total=len(self.train_data), file=sys.stdout) as pbar_train:  # create a progress bar for training
-            for idx, (x, y) in enumerate(self.train_data):  # get data batches
-                loss, accuracy = self.run_train_iter(x=x, y=y)  # take a training iter step
-                current_epoch_losses["train_loss"].append(loss)  # add current iter loss to the train loss list
-                current_epoch_losses["train_acc"].append(accuracy)  # add current iter acc to the train acc list
-                pbar_train.update(1)
-                pbar_train.set_description("loss: {:.4f}, accuracy: {:.4f}".format(loss, accuracy))
-
-        return current_epoch_losses
-
-    def run_validation_epoch(self, current_epoch_losses):
-
-        with tqdm.tqdm(total=len(self.val_data), file=sys.stdout) as pbar_val:  # create a progress bar for validation
-            for x, y in self.val_data:  # get data batches
-                loss, accuracy = self.run_evaluation_iter(x=x, y=y)  # run a validation iter
-                current_epoch_losses["val_loss"].append(loss)  # add current iter loss to val loss list.
-                current_epoch_losses["val_acc"].append(accuracy)  # add current iter acc to val acc lst.
-                pbar_val.update(1)  # add 1 step to the progress bar
-                pbar_val.set_description("loss: {:.4f}, accuracy: {:.4f}".format(loss, accuracy))
-
-        return current_epoch_losses
-
-    def run_testing_epoch(self, current_epoch_losses):
-
-        with tqdm.tqdm(total=len(self.test_data), file=sys.stdout) as pbar_test:  # ini a progress bar
-            for x, y in self.test_data:  # sample batch
-                loss, accuracy = self.run_evaluation_iter(x=x,
-                                                          y=y)  # compute loss and accuracy by running an evaluation step
-                current_epoch_losses["test_loss"].append(loss)  # save test loss
-                current_epoch_losses["test_acc"].append(accuracy)  # save test accuracy
-                pbar_test.update(1)  # update progress bar status
-                pbar_test.set_description(
-                    "loss: {:.4f}, accuracy: {:.4f}".format(loss, accuracy))  # update progress bar string output
-        return current_epoch_losses
-
-
-    def load_model(self, model_save_dir, model_save_name, model_idx):
-        """
-        Load the network parameter state and the best val model idx and best val acc to be compared with the future val accuracies, in order to choose the best val model
-        :param model_save_dir: The directory to store the state at.
-        :param model_save_name: Name to use to save model without the epoch index
-        :param model_idx: The index to save the model with.
-        :return: best val idx and best val model acc, also it loads the network state into the system state without returning it
-        """
-        state = torch.load(f=os.path.join(model_save_dir, "{}_{}".format(model_save_name, str(model_idx))))
-        self.load_state_dict(state_dict=state['network'])
-        return state['best_val_model_idx'], state['best_val_model_acc'], state
-
-    def run_experiment(self):
-        """
-        Runs experiment train and evaluation iterations, saving the model and best val model and val model accuracy after each epoch
-        :return: The summary current_epoch_losses from starting epoch to total_epochs.
-        """
-        total_losses = {"train_acc": [], "train_loss": [], "val_acc": [],
-                        "val_loss": [], "curr_epoch": []}  # initialize a dict to keep the per-epoch metrics
-        for i, epoch_idx in enumerate(range(self.starting_epoch, self.num_epochs)):
-            epoch_start_time = time.time()
-            current_epoch_losses = {"train_acc": [], "train_loss": [], "val_acc": [], "val_loss": []}
-
-            current_epoch_losses = self.run_training_epoch(current_epoch_losses)
-            current_epoch_losses = self.run_validation_epoch(current_epoch_losses)
-
-            val_mean_accuracy = np.mean(current_epoch_losses['val_acc'])
-            if val_mean_accuracy > self.best_val_model_acc:  # if current epoch's mean val acc is greater than the saved best val acc then
-                self.best_val_model_acc = val_mean_accuracy  # set the best val model acc to be current epoch's val accuracy
-                self.best_val_model_idx = epoch_idx  # set the experiment-wise best val idx to be the current epoch's idx
-
-            for key, value in current_epoch_losses.items():
-                total_losses[key].append(np.mean(value))
-                # get mean of all metrics of current epoch metrics dict,
-                # to get them ready for storage and output on the terminal.
-
-            total_losses['curr_epoch'].append(epoch_idx)
-            save_statistics(experiment_log_dir=self.experiment_logs, filename='summary.csv',
-                            stats_dict=total_losses, current_epoch=i,
-                            continue_from_mode=True if (self.starting_epoch != 0 or i > 0) else False) # save statistics to stats file.
-
-            # load_statistics(experiment_log_dir=self.experiment_logs, filename='summary.csv') # How to load a csv file if you need to
-
-            out_string = "_".join(
-                ["{}_{:.4f}".format(key, np.mean(value)) for key, value in current_epoch_losses.items()])
-            # create a string to use to report our epoch metrics
-            epoch_elapsed_time = time.time() - epoch_start_time  # calculate time taken for epoch
-            epoch_elapsed_time = "{:.4f}".format(epoch_elapsed_time)
-            print("Epoch {}:".format(epoch_idx), out_string, "epoch time", epoch_elapsed_time, "seconds")
-            self.state['current_epoch_idx'] = epoch_idx
-            self.state['best_val_model_acc'] = self.best_val_model_acc
-            self.state['best_val_model_idx'] = self.best_val_model_idx
-            self.save_model(model_save_dir=self.experiment_saved_models,
-                            # save model and best val idx and best val acc, using the model dir, model name and model idx
-                            model_save_name="train_model", model_idx=epoch_idx, state=self.state)
-            self.save_model(model_save_dir=self.experiment_saved_models,
-                            # save model and best val idx and best val acc, using the model dir, model name and model idx
-                            model_save_name="train_model", model_idx='latest', state=self.state)
-
-        print("Generating test set evaluation metrics")
-        self.load_model(model_save_dir=self.experiment_saved_models, model_idx=self.best_val_model_idx,
-                        # load best validation model
-                        model_save_name="train_model")
-        current_epoch_losses = {"test_acc": [], "test_loss": []}  # initialize a statistics dict
-
-        current_epoch_losses = self.run_testing_epoch(current_epoch_losses=current_epoch_losses)
-
-        test_losses = {key: [np.mean(value)] for key, value in
-                       current_epoch_losses.items()}  # save test set metrics in dict format
-
-        save_statistics(experiment_log_dir=self.experiment_logs, filename='test_summary.csv',
-                        # save test set metrics on disk in .csv format
-                        stats_dict=test_losses, current_epoch=0, continue_from_mode=False)
-
-        return total_losses, test_losses
--- a/experiment_configs/cifar10_tutorial_config.json
+++ b/experiment_configs/cifar10_tutorial_config.json
@ -1,16 +0,0 @@
-{
-  "batch_size": 100,
-  "dataset_name": "cifar10",
-  "continue_from_epoch": -2,
-  "seed": 0,
-  "image_num_channels": 3,
-  "image_height": 32,
-  "image_width": 32,
-  "dim_reduction_type": "avg_pooling",
-  "num_layers": 4,
-  "num_filters": 64,
-  "num_epochs": 250,
-  "experiment_name": "cifar10_tutorial",
-  "use_gpu": true,
-  "weight_decay_coefficient": 1e-05
-}
--- a/experiment_configs/emnist_tutorial_config.json
+++ b/experiment_configs/emnist_tutorial_config.json
@ -1,16 +0,0 @@
-{
-  "batch_size": 100,
-  "dataset_name": "emnist",
-  "continue_from_epoch": -2,
-  "seed": 0,
-  "image_num_channels": 1,
-  "image_height": 28,
-  "image_width": 28,
-  "dim_reduction_type": "avg_pooling",
-  "num_layers": 4,
-  "num_filters": 32,
-  "num_epochs": 250,
-  "experiment_name": "emnist_tutorial",
-  "use_gpu": true,
-  "weight_decay_coefficient": 1e-05
-}
--- a/install.sh
+++ b/install.sh
@ -1,6 +0,0 @@
-conda install -c conda-forge opencv
-conda install numpy scipy matplotlib
-conda install -c conda-forge pbzip2 pydrive
-conda install pillow tqdm
-pip install GPUtil
-conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
--- a/local_experiment_scripts/cifar100_arg_parsing_template.sh
+++ b/local_experiment_scripts/cifar100_arg_parsing_template.sh
@ -1,12 +0,0 @@
-#!/bin/sh
-
-cd ..
-export DATASET_DIR="data/"
-# Activate the relevant virtual environment:
-
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 3 --image_height 32 --image_width 32 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'cifar100_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "cifar100"
--- a/local_experiment_scripts/cifar10_arg_parsing_template.sh
+++ b/local_experiment_scripts/cifar10_arg_parsing_template.sh
@ -1,12 +0,0 @@
-#!/bin/sh
-
-cd ..
-export DATASET_DIR="data/"
-# Activate the relevant virtual environment:
-
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 3 --image_height 32 --image_width 32 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'cifar10_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0. \
-                                                      --dataset_name "cifar10"
--- a/local_experiment_scripts/cifar10_json_parsing_template.sh
+++ b/local_experiment_scripts/cifar10_json_parsing_template.sh
@ -1,7 +0,0 @@
-#!/bin/sh
-
-cd ..
-export DATASET_DIR="data/"
-# Activate the relevant virtual environment:
-
-python train_evaluate_emnist_classification_system.py --filepath_to_arguments_json_file experiment_configs/cifar10_tutorial_config.json
--- a/local_experiment_scripts/emnist_arg_parsing_template.sh
+++ b/local_experiment_scripts/emnist_arg_parsing_template.sh
@ -1,11 +0,0 @@
-#!/bin/sh
-
-cd ..
-export DATASET_DIR="data/"
-# Activate the relevant virtual environment:
-
-python train_evaluate_emnist_classification_system.py --batch_size 100 --continue_from_epoch -1 --seed 0 \
-                                                      --image_num_channels 1 --image_height 28 --image_width 28 \
-                                                      --dim_reduction_type "strided" --num_layers 4 --num_filters 64 \
-                                                      --num_epochs 100 --experiment_name 'emnist_test_exp' \
-                                                      --use_gpu "True" --weight_decay_coefficient 0.
--- a/local_experiment_scripts/emnist_json_parsing_template.sh
+++ b/local_experiment_scripts/emnist_json_parsing_template.sh
@ -1,7 +0,0 @@
-#!/bin/sh
-
-cd ..
-export DATASET_DIR="data/"
-# Activate the relevant virtual environment:
-
-python train_evaluate_emnist_classification_system.py --filepath_to_arguments_json_file experiment_configs/emnist_tutorial_config.json
--- a/mlp/init.py
+++ b/mlp/init.py
@ -0,0 +1,6 @@
+# -*- coding: utf-8 -*-
+"""Machine Learning Practical package."""
+
+__authors__ = ['Pawel Swietojanski', 'Steve Renals', 'Matt Graham', 'Antreas Antoniou']
+
+DEFAULT_SEED = 123456  # Default random number generator seed if none provided.
--- a/mlp/data_providers.py
+++ b/mlp/data_providers.py
@ -4,25 +4,23 @@
 This module provides classes for loading datasets and iterating over batches of
 data points.
 """
-from __future__ import print_function
+
 import pickle
 import gzip
-import numpy as np
-import os
-DEFAULT_SEED = 20112018
-from PIL import Image
-import os
-import os.path
-import numpy as np
 import sys
-if sys.version_info[0] == 2:
-    import cPickle as pickle
-else:
-    import pickle

-import torch.utils.data as data
+import numpy as np
+import os
+
+from PIL import Image
+from torch.utils import data
+from torch.utils.data import Dataset
+from torchvision import transforms
 from torchvision.datasets.utils import download_url, check_integrity

+from mlp import DEFAULT_SEED
+
+
 class DataProvider(object):
    """Generic data provider."""

@ -174,7 +172,7 @@ class MNISTDataProvider(DataProvider):
        # separator for the current platform / OS is used
        # MLP_DATA_DIR environment variable should point to the data directory
        data_path = os.path.join(
-            "data", 'mnist-{0}.npz'.format(which_set))
+            os.environ['MLP_DATA_DIR'], 'mnist-{0}.npz'.format(which_set))
        assert os.path.isfile(data_path), (
            'Data file does not exist at expected path: ' + data_path
        )
@ -240,7 +238,7 @@ class EMNISTDataProvider(DataProvider):
        # separator for the current platform / OS is used
        # MLP_DATA_DIR environment variable should point to the data directory
        data_path = os.path.join(
-            "data", 'emnist-{0}.npz'.format(which_set))
+            os.environ['MLP_DATA_DIR'], 'emnist-{0}.npz'.format(which_set))
        assert os.path.isfile(data_path), (
            'Data file does not exist at expected path: ' + data_path
        )
@ -249,18 +247,16 @@ class EMNISTDataProvider(DataProvider):
        print(loaded.keys())
        inputs, targets = loaded['inputs'], loaded['targets']
        inputs = inputs.astype(np.float32)
+        targets = targets.astype(np.int)
        if flatten:
            inputs = np.reshape(inputs, newshape=(-1, 28*28))
        else:
-            inputs = np.reshape(inputs, newshape=(-1, 1, 28, 28))
+            inputs = np.reshape(inputs, newshape=(-1, 28, 28, 1))
        inputs = inputs / 255.0
        # pass the loaded data to the parent class __init__
        super(EMNISTDataProvider, self).__init__(
            inputs, targets, batch_size, max_num_batches, shuffle_order, rng)

-    def __len__(self):
-        return self.num_batches
-
    def next(self):
        """Returns next data batch or raises `StopIteration` if at end."""
        inputs_batch, targets_batch = super(EMNISTDataProvider, self).next()
@ -285,7 +281,6 @@ class EMNISTDataProvider(DataProvider):
        one_of_k_targets[range(int_targets.shape[0]), int_targets] = 1
        return one_of_k_targets

-
 class MetOfficeDataProvider(DataProvider):
    """South Scotland Met Office weather data provider."""

@ -308,7 +303,7 @@ class MetOfficeDataProvider(DataProvider):
            rng (RandomState): A seeded random number generator.
        """
        data_path = os.path.join(
-            os.environ['DATASET_DIR'], 'HadSSP_daily_qc.txt')
+            os.environ['MLP_DATA_DIR'], 'HadSSP_daily_qc.txt')
        assert os.path.isfile(data_path), (
            'Data file does not exist at expected path: ' + data_path
        )
@ -356,7 +351,7 @@ class CCPPDataProvider(DataProvider):
            rng (RandomState): A seeded random number generator.
        """
        data_path = os.path.join(
-            os.environ['DATASET_DIR'], 'ccpp_data.npz')
+            os.environ['MLP_DATA_DIR'], 'ccpp_data.npz')
        assert os.path.isfile(data_path), (
            'Data file does not exist at expected path: ' + data_path
        )
@ -379,6 +374,21 @@ class CCPPDataProvider(DataProvider):
        super(CCPPDataProvider, self).__init__(
            inputs, targets, batch_size, max_num_batches, shuffle_order, rng)

+class EMNISTPytorchDataProvider(Dataset):
+    def __init__(self, which_set='train', batch_size=100, max_num_batches=-1,
+                 shuffle_order=True, rng=None, flatten=False, transforms=None):
+        self.numpy_data_provider = EMNISTDataProvider(which_set=which_set, batch_size=batch_size, max_num_batches=max_num_batches,
+                 shuffle_order=shuffle_order, rng=rng, flatten=flatten)
+        self.transforms = transforms
+
+    def __getitem__(self, item):
+        x = self.numpy_data_provider.inputs[item]
+        for augmentation in self.transforms:
+            x = augmentation(x)
+        return x, int(self.numpy_data_provider.targets[item])
+
+    def __len__(self):
+        return len(self.numpy_data_provider.targets)

 class AugmentedMNISTDataProvider(MNISTDataProvider):
    """Data provider for MNIST dataset which randomly transforms images."""
@ -417,12 +427,8 @@ class AugmentedMNISTDataProvider(MNISTDataProvider):
        transformed_inputs_batch = self.transformer(inputs_batch, self.rng)
        return transformed_inputs_batch, targets_batch

-
-
-
-class CIFAR10(data.Dataset):
+class Omniglot(data.Dataset):
    """`CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
-
    Args:
        root (string): Root directory of dataset where directory
            ``cifar-10-batches-py`` exists or will be saved to if download is set to True.
@ -435,7 +441,118 @@ class CIFAR10(data.Dataset):
        download (bool, optional): If true, downloads the dataset from the internet and
            puts it in root directory. If dataset is already downloaded, it is not
            downloaded again.
+    """
+    def collect_data_paths(self, root):
+        data_dict = dict()
+        print(root)
+        for subdir, dir, files in os.walk(root):
+            for file in files:
+                if file.endswith('.png'):
+                    filepath = os.path.join(subdir, file)
+                    class_label = '_'.join(subdir.split("/")[-2:])
+                    if class_label in data_dict:
+                        data_dict[class_label].append(filepath)
+                    else:
+                        data_dict[class_label] = [filepath]

+        return data_dict
+
+    def __init__(self, root, set_name,
+                 transform=None, target_transform=None,
+                 download=False):
+        self.root = os.path.expanduser(root)
+        self.root = os.path.abspath(os.path.join(self.root, 'omniglot_dataset'))
+        self.transform = transform
+        self.target_transform = target_transform
+        self.set_name = set_name  # training set or test set
+        self.data_dict = self.collect_data_paths(root=self.root)
+
+        x = []
+        label_to_idx = {label: idx for idx, label in enumerate(self.data_dict.keys())}
+        y = []
+
+        for key, value in self.data_dict.items():
+            x.extend(value)
+            y.extend(len(value) * [label_to_idx[key]])
+
+        y = np.array(y)
+
+
+        rng = np.random.RandomState(seed=0)
+
+        idx = np.arange(len(x))
+        rng.shuffle(idx)
+
+        x = [x[current_idx] for current_idx in idx]
+        y = y[idx]
+
+        train_sample_idx = rng.choice(a=[i for i in range(len(x))], size=int(len(x) * 0.80), replace=False)
+        evaluation_sample_idx = [i for i in range(len(x)) if i not in train_sample_idx]
+        validation_sample_idx = rng.choice(a=[i for i in range(len(evaluation_sample_idx))], size=int(len(evaluation_sample_idx) * 0.40), replace=False)
+        test_sample_idx = [i for i in range(len(evaluation_sample_idx)) if i not in evaluation_sample_idx]
+
+        if self.set_name=='train':
+            self.data = [item for idx, item in enumerate(x) if idx in train_sample_idx]
+            self.labels = y[train_sample_idx]
+
+        elif self.set_name=='val':
+            self.data = [item for idx, item in enumerate(x) if idx in validation_sample_idx]
+            self.labels = y[validation_sample_idx]
+
+        else:
+            self.data = [item for idx, item in enumerate(x) if idx in test_sample_idx]
+            self.labels = y[test_sample_idx]
+
+    def __getitem__(self, index):
+        """
+        Args:
+            index (int): Index
+        Returns:
+            tuple: (image, target) where target is index of the target class.
+        """
+        img, target = self.data[index], self.labels[index]
+
+        img = Image.open(img)
+        img.show()
+
+        if self.transform is not None:
+            img = self.transform(img)
+
+        if self.target_transform is not None:
+            target = self.target_transform(target)
+
+        return img, target
+
+    def __len__(self):
+        return len(self.data)
+
+
+    def __repr__(self):
+        fmt_str = 'Dataset ' + self.__class__.__name__ + '\n'
+        fmt_str += '    Number of datapoints: {}\n'.format(self.__len__())
+        tmp = self.set_name
+        fmt_str += '    Split: {}\n'.format(tmp)
+        fmt_str += '    Root Location: {}\n'.format(self.root)
+        tmp = '    Transforms (if any): '
+        fmt_str += '{0}{1}\n'.format(tmp, self.transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
+        tmp = '    Target Transforms (if any): '
+        fmt_str += '{0}{1}'.format(tmp, self.target_transform.__repr__().replace('\n', '\n' + ' ' * len(tmp)))
+        return fmt_str
+
+class CIFAR10(data.Dataset):
+    """`CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
+    Args:
+        root (string): Root directory of dataset where directory
+            ``cifar-10-batches-py`` exists or will be saved to if download is set to True.
+        train (bool, optional): If True, creates dataset from training set, otherwise
+            creates from test set.
+        transform (callable, optional): A function/transform that  takes in an PIL image
+            and returns a transformed version. E.g, ``transforms.RandomCrop``
+        target_transform (callable, optional): A function/transform that takes in the
+            target and transforms it.
+        download (bool, optional): If true, downloads the dataset from the internet and
+            puts it in root directory. If dataset is already downloaded, it is not
+            downloaded again.
    """
    base_folder = 'cifar-10-batches-py'
    url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
@ -551,7 +668,6 @@ class CIFAR10(data.Dataset):
        """
        Args:
            index (int): Index
-
        Returns:
            tuple: (image, target) where target is index of the target class.
        """
@ -615,7 +731,6 @@ class CIFAR10(data.Dataset):

 class CIFAR100(CIFAR10):
    """`CIFAR100 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
-
    This is a subclass of the `CIFAR10` Dataset.
    """
    base_folder = 'cifar-100-python'
@ -628,4 +743,4 @@ class CIFAR100(CIFAR10):

    test_list = [
        ['test', 'f0ef6b0ae62326f3e7ffdfab6717acfc'],
-    ]
+    ]
--- a/mlp/errors.py
+++ b/mlp/errors.py
@ -0,0 +1,176 @@
+# -*- coding: utf-8 -*-
+"""Error functions.
+
+This module defines error functions, with the aim of model training being to
+minimise the error function given a set of inputs and target outputs.
+
+The error functions will typically measure some concept of distance between the
+model outputs and target outputs, averaged over all data points in the data set
+or batch.
+"""
+
+import numpy as np
+
+
+class SumOfSquaredDiffsError(object):
+    """Sum of squared differences (squared Euclidean distance) error."""
+
+    def __call__(self, outputs, targets):
+        """Calculates error function given a batch of outputs and targets.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Scalar cost function value.
+        """
+        return 0.5 * np.mean(np.sum((outputs - targets)**2, axis=1))
+
+    def grad(self, outputs, targets):
+        """Calculates gradient of error function with respect to outputs.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Gradient of error function with respect to outputs.
+        """
+        return (outputs - targets) / outputs.shape[0]
+
+    def __repr__(self):
+        return 'MeanSquaredErrorCost'
+
+
+class BinaryCrossEntropyError(object):
+    """Binary cross entropy error."""
+
+    def __call__(self, outputs, targets):
+        """Calculates error function given a batch of outputs and targets.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Scalar error function value.
+        """
+        return -np.mean(
+            targets * np.log(outputs) + (1. - targets) * np.log(1. - outputs))
+
+    def grad(self, outputs, targets):
+        """Calculates gradient of error function with respect to outputs.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Gradient of error function with respect to outputs.
+        """
+        return ((1. - targets) / (1. - outputs) -
+                (targets / outputs)) / outputs.shape[0]
+
+    def __repr__(self):
+        return 'BinaryCrossEntropyError'
+
+
+class BinaryCrossEntropySigmoidError(object):
+    """Binary cross entropy error with logistic sigmoid applied to outputs."""
+
+    def __call__(self, outputs, targets):
+        """Calculates error function given a batch of outputs and targets.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Scalar error function value.
+        """
+        probs = 1. / (1. + np.exp(-outputs))
+        return -np.mean(
+            targets * np.log(probs) + (1. - targets) * np.log(1. - probs))
+
+    def grad(self, outputs, targets):
+        """Calculates gradient of error function with respect to outputs.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Gradient of error function with respect to outputs.
+        """
+        probs = 1. / (1. + np.exp(-outputs))
+        return (probs - targets) / outputs.shape[0]
+
+    def __repr__(self):
+        return 'BinaryCrossEntropySigmoidError'
+
+
+class CrossEntropyError(object):
+    """Multi-class cross entropy error."""
+
+    def __call__(self, outputs, targets):
+        """Calculates error function given a batch of outputs and targets.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Scalar error function value.
+        """
+        return -np.mean(np.sum(targets * np.log(outputs), axis=1))
+
+    def grad(self, outputs, targets):
+        """Calculates gradient of error function with respect to outputs.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Gradient of error function with respect to outputs.
+        """
+        return -(targets / outputs) / outputs.shape[0]
+
+    def __repr__(self):
+        return 'CrossEntropyError'
+
+
+class CrossEntropySoftmaxError(object):
+    """Multi-class cross entropy error with Softmax applied to outputs."""
+
+    def __call__(self, outputs, targets):
+        """Calculates error function given a batch of outputs and targets.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Scalar error function value.
+        """
+        normOutputs = outputs - outputs.max(-1)[:, None]
+        logProb = normOutputs - np.log(np.sum(np.exp(normOutputs), axis=-1)[:, None])
+        return -np.mean(np.sum(targets * logProb, axis=1))
+
+    def grad(self, outputs, targets):
+        """Calculates gradient of error function with respect to outputs.
+
+        Args:
+            outputs: Array of model outputs of shape (batch_size, output_dim).
+            targets: Array of target outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Gradient of error function with respect to outputs.
+        """
+        probs = np.exp(outputs - outputs.max(-1)[:, None])
+        probs /= probs.sum(-1)[:, None]
+        return (probs - targets) / outputs.shape[0]
+
+    def __repr__(self):
+        return 'CrossEntropySoftmaxError'
--- a/mlp/initialisers.py
+++ b/mlp/initialisers.py
@ -0,0 +1,143 @@
+# -*- coding: utf-8 -*-
+"""Parameter initialisers.
+
+This module defines classes to initialise the parameters in a layer.
+"""
+
+import numpy as np
+from mlp import DEFAULT_SEED
+
+
+class ConstantInit(object):
+    """Constant parameter initialiser."""
+
+    def __init__(self, value):
+        """Construct a constant parameter initialiser.
+
+        Args:
+            value: Value to initialise parameter to.
+        """
+        self.value = value
+
+    def __call__(self, shape):
+        return np.ones(shape=shape) * self.value
+
+
+class UniformInit(object):
+    """Random uniform parameter initialiser."""
+
+    def __init__(self, low, high, rng=None):
+        """Construct a random uniform parameter initialiser.
+
+        Args:
+            low: Lower bound of interval to sample from.
+            high: Upper bound of interval to sample from.
+            rng (RandomState): Seeded random number generator.
+        """
+        self.low = low
+        self.high = high
+        if rng is None:
+            rng = np.random.RandomState(DEFAULT_SEED)
+        self.rng = rng
+
+    def __call__(self, shape):
+        return self.rng.uniform(low=self.low, high=self.high, size=shape)
+
+
+class NormalInit(object):
+    """Random normal parameter initialiser."""
+
+    def __init__(self, mean, std, rng=None):
+        """Construct a random uniform parameter initialiser.
+
+        Args:
+            mean: Mean of distribution to sample from.
+            std: Standard deviation of distribution to sample from.
+            rng (RandomState): Seeded random number generator.
+        """
+        self.mean = mean
+        self.std = std
+        if rng is None:
+            rng = np.random.RandomState(DEFAULT_SEED)
+        self.rng = rng
+
+    def __call__(self, shape):
+        return self.rng.normal(loc=self.mean, scale=self.std, size=shape)
+
+class GlorotUniformInit(object):
+    """Glorot and Bengio (2010) random uniform weights initialiser.
+
+    Initialises an two-dimensional parameter array using the 'normalized
+    initialisation' scheme suggested in [1] which attempts to maintain a
+    roughly constant variance in the activations and backpropagated gradients
+    of a multi-layer model consisting of interleaved affine and logistic
+    sigmoidal transformation layers.
+
+    Weights are sampled from a zero-mean uniform distribution with standard
+    deviation `sqrt(2 / (input_dim * output_dim))` where `input_dim` and
+    `output_dim` are the input and output dimensions of the weight matrix
+    respectively.
+
+    References:
+      [1]: Understanding the difficulty of training deep feedforward neural
+           networks, Glorot and Bengio (2010)
+    """
+
+    def __init__(self, gain=1., rng=None):
+        """Construct a normalised initilisation random initialiser object.
+
+        Args:
+            gain: Multiplicative factor to scale initialised weights by.
+                Recommended values is 1 for affine layers followed by
+                logistic sigmoid layers (or another affine layer).
+            rng (RandomState): Seeded random number generator.
+        """
+        self.gain = gain
+        if rng is None:
+            rng = np.random.RandomState(DEFAULT_SEED)
+        self.rng = rng
+
+    def __call__(self, shape):
+        assert len(shape) == 2, (
+            'Initialiser should only be used for two dimensional arrays.')
+        std = self.gain * (2. / (shape[0] + shape[1]))**0.5
+        half_width = 3.**0.5 * std
+        return self.rng.uniform(low=-half_width, high=half_width, size=shape)
+
+
+class GlorotNormalInit(object):
+    """Glorot and Bengio (2010) random normal weights initialiser.
+
+    Initialises an two-dimensional parameter array using the 'normalized
+    initialisation' scheme suggested in [1] which attempts to maintain a
+    roughly constant variance in the activations and backpropagated gradients
+    of a multi-layer model consisting of interleaved affine and logistic
+    sigmoidal transformation layers.
+
+    Weights are sampled from a zero-mean normal distribution with standard
+    deviation `sqrt(2 / (input_dim * output_dim))` where `input_dim` and
+    `output_dim` are the input and output dimensions of the weight matrix
+    respectively.
+
+    References:
+      [1]: Understanding the difficulty of training deep feedforward neural
+           networks, Glorot and Bengio (2010)
+    """
+
+    def __init__(self, gain=1., rng=None):
+        """Construct a normalised initilisation random initialiser object.
+
+        Args:
+            gain: Multiplicative factor to scale initialised weights by.
+                Recommended values is 1 for affine layers followed by
+                logistic sigmoid layers (or another affine layer).
+            rng (RandomState): Seeded random number generator.
+        """
+        self.gain = gain
+        if rng is None:
+            rng = np.random.RandomState(DEFAULT_SEED)
+        self.rng = rng
+
+    def __call__(self, shape):
+        std = self.gain * (2. / (shape[0] + shape[1]))**0.5
+        return self.rng.normal(loc=0., scale=std, size=shape)
--- a/mlp/layers.py
+++ b/mlp/layers.py
@ -0,0 +1,824 @@
+# -*- coding: utf-8 -*-
+"""Layer definitions.
+
+This module defines classes which encapsulate a single layer.
+
+These layers map input activations to output activation with the `fprop`
+method and map gradients with repsect to outputs to gradients with respect to
+their inputs with the `bprop` method.
+
+Some layers will have learnable parameters and so will additionally define
+methods for getting and setting parameter and calculating gradients with
+respect to the layer parameters.
+"""
+
+import numpy as np
+import mlp.initialisers as init
+from mlp import DEFAULT_SEED
+
+
+class Layer(object):
+    """Abstract class defining the interface for a layer."""
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        raise NotImplementedError()
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        raise NotImplementedError()
+
+
+class LayerWithParameters(Layer):
+    """Abstract class defining the interface for a layer with parameters."""
+
+    def grads_wrt_params(self, inputs, grads_wrt_outputs):
+        """Calculates gradients with respect to layer parameters.
+
+        Args:
+            inputs: Array of inputs to layer of shape (batch_size, input_dim).
+            grads_wrt_to_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            List of arrays of gradients with respect to the layer parameters
+            with parameter gradients appearing in same order in tuple as
+            returned from `get_params` method.
+        """
+        raise NotImplementedError()
+
+    def params_penalty(self):
+        """Returns the parameter dependent penalty term for this layer.
+
+        If no parameter-dependent penalty terms are set this returns zero.
+        """
+        raise NotImplementedError()
+
+    @property
+    def params(self):
+        """Returns a list of parameters of layer.
+
+        Returns:
+            List of current parameter values. This list should be in the
+            corresponding order to the `values` argument to `set_params`.
+        """
+        raise NotImplementedError()
+
+    @params.setter
+    def params(self, values):
+        """Sets layer parameters from a list of values.
+
+        Args:
+            values: List of values to set parameters to. This list should be
+                in the corresponding order to what is returned by `get_params`.
+        """
+        raise NotImplementedError()
+
+
+class StochasticLayerWithParameters(Layer):
+    """Specialised layer which uses a stochastic forward propagation."""
+
+    def __init__(self, rng=None):
+        """Constructs a new StochasticLayer object.
+
+        Args:
+            rng (RandomState): Seeded random number generator object.
+        """
+        if rng is None:
+            rng = np.random.RandomState(DEFAULT_SEED)
+        self.rng = rng
+
+    def fprop(self, inputs, stochastic=True):
+        """Forward propagates activations through the layer transformation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            stochastic: Flag allowing different deterministic
+                forward-propagation mode in addition to default stochastic
+                forward-propagation e.g. for use at test time. If False
+                a deterministic forward-propagation transformation
+                corresponding to the expected output of the stochastic
+                forward-propagation is applied.
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        raise NotImplementedError()
+
+    def grads_wrt_params(self, inputs, grads_wrt_outputs):
+        """Calculates gradients with respect to layer parameters.
+
+        Args:
+            inputs: Array of inputs to layer of shape (batch_size, input_dim).
+            grads_wrt_to_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            List of arrays of gradients with respect to the layer parameters
+            with parameter gradients appearing in same order in tuple as
+            returned from `get_params` method.
+        """
+        raise NotImplementedError()
+
+    def params_penalty(self):
+        """Returns the parameter dependent penalty term for this layer.
+
+        If no parameter-dependent penalty terms are set this returns zero.
+        """
+        raise NotImplementedError()
+
+    @property
+    def params(self):
+        """Returns a list of parameters of layer.
+
+        Returns:
+            List of current parameter values. This list should be in the
+            corresponding order to the `values` argument to `set_params`.
+        """
+        raise NotImplementedError()
+
+    @params.setter
+    def params(self, values):
+        """Sets layer parameters from a list of values.
+
+        Args:
+            values: List of values to set parameters to. This list should be
+                in the corresponding order to what is returned by `get_params`.
+        """
+        raise NotImplementedError()
+
+
+class StochasticLayer(Layer):
+    """Specialised layer which uses a stochastic forward propagation."""
+
+    def __init__(self, rng=None):
+        """Constructs a new StochasticLayer object.
+
+        Args:
+            rng (RandomState): Seeded random number generator object.
+        """
+        if rng is None:
+            rng = np.random.RandomState(DEFAULT_SEED)
+        self.rng = rng
+
+    def fprop(self, inputs, stochastic=True):
+        """Forward propagates activations through the layer transformation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            stochastic: Flag allowing different deterministic
+                forward-propagation mode in addition to default stochastic
+                forward-propagation e.g. for use at test time. If False
+                a deterministic forward-propagation transformation
+                corresponding to the expected output of the stochastic
+                forward-propagation is applied.
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        raise NotImplementedError()
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs. This should correspond to
+        default stochastic forward-propagation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        raise NotImplementedError()
+
+
+class AffineLayer(LayerWithParameters):
+    """Layer implementing an affine tranformation of its inputs.
+
+    This layer is parameterised by a weight matrix and bias vector.
+    """
+
+    def __init__(self, input_dim, output_dim,
+                 weights_initialiser=init.UniformInit(-0.1, 0.1),
+                 biases_initialiser=init.ConstantInit(0.),
+                 weights_penalty=None, biases_penalty=None):
+        """Initialises a parameterised affine layer.
+
+        Args:
+            input_dim (int): Dimension of inputs to the layer.
+            output_dim (int): Dimension of the layer outputs.
+            weights_initialiser: Initialiser for the weight parameters.
+            biases_initialiser: Initialiser for the bias parameters.
+            weights_penalty: Weights-dependent penalty term (regulariser) or
+                None if no regularisation is to be applied to the weights.
+            biases_penalty: Biases-dependent penalty term (regulariser) or
+                None if no regularisation is to be applied to the biases.
+        """
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.weights = weights_initialiser((self.output_dim, self.input_dim))
+        self.biases = biases_initialiser(self.output_dim)
+        self.weights_penalty = weights_penalty
+        self.biases_penalty = biases_penalty
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        For inputs `x`, outputs `y`, weights `W` and biases `b` the layer
+        corresponds to `y = W.dot(x) + b`.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        return self.weights.dot(inputs.T).T + self.biases
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return grads_wrt_outputs.dot(self.weights)
+
+    def grads_wrt_params(self, inputs, grads_wrt_outputs):
+        """Calculates gradients with respect to layer parameters.
+
+        Args:
+            inputs: array of inputs to layer of shape (batch_size, input_dim)
+            grads_wrt_to_outputs: array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim)
+
+        Returns:
+            list of arrays of gradients with respect to the layer parameters
+            `[grads_wrt_weights, grads_wrt_biases]`.
+        """
+
+        grads_wrt_weights = np.dot(grads_wrt_outputs.T, inputs)
+        grads_wrt_biases = np.sum(grads_wrt_outputs, axis=0)
+
+        if self.weights_penalty is not None:
+            grads_wrt_weights += self.weights_penalty.grad(parameter=self.weights)
+
+        if self.biases_penalty is not None:
+            grads_wrt_biases += self.biases_penalty.grad(parameter=self.biases)
+
+        return [grads_wrt_weights, grads_wrt_biases]
+
+    def params_penalty(self):
+        """Returns the parameter dependent penalty term for this layer.
+
+        If no parameter-dependent penalty terms are set this returns zero.
+        """
+        params_penalty = 0
+        if self.weights_penalty is not None:
+            params_penalty += self.weights_penalty(self.weights)
+        if self.biases_penalty is not None:
+            params_penalty += self.biases_penalty(self.biases)
+        return params_penalty
+
+    @property
+    def params(self):
+        """A list of layer parameter values: `[weights, biases]`."""
+        return [self.weights, self.biases]
+
+    @params.setter
+    def params(self, values):
+        self.weights = values[0]
+        self.biases = values[1]
+
+    def __repr__(self):
+        return 'AffineLayer(input_dim={0}, output_dim={1})'.format(
+            self.input_dim, self.output_dim)
+
+
+class SigmoidLayer(Layer):
+    """Layer implementing an element-wise logistic sigmoid transformation."""
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        For inputs `x` and outputs `y` this corresponds to
+        `y = 1 / (1 + exp(-x))`.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        return 1. / (1. + np.exp(-inputs))
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return grads_wrt_outputs * outputs * (1. - outputs)
+
+    def __repr__(self):
+        return 'SigmoidLayer'
+
+
+class ConvolutionalLayer(LayerWithParameters):
+    """Layer implementing a 2D convolution-based transformation of its inputs.
+    The layer is parameterised by a set of 2D convolutional kernels, a four
+    dimensional array of shape
+        (num_output_channels, num_input_channels, kernel_height, kernel_dim_2)
+    and a bias vector, a one dimensional array of shape
+        (num_output_channels,)
+    i.e. one shared bias per output channel.
+    Assuming no-padding is applied to the inputs so that outputs are only
+    calculated for positions where the kernel filters fully overlap with the
+    inputs, and that unit strides are used the outputs will have spatial extent
+        output_height = input_height - kernel_height + 1
+        output_width = input_width - kernel_width + 1
+    """
+
+    def __init__(self, num_input_channels, num_output_channels,
+                 input_height, input_width,
+                 kernel_height, kernel_width,
+                 kernels_init=init.UniformInit(-0.01, 0.01),
+                 biases_init=init.ConstantInit(0.),
+                 kernels_penalty=None, biases_penalty=None):
+        """Initialises a parameterised convolutional layer.
+        Args:
+            num_input_channels (int): Number of channels in inputs to
+                layer (this may be number of colour channels in the input
+                images if used as the first layer in a model, or the
+                number of output channels, a.k.a. feature maps, from a
+                a previous convolutional layer).
+            num_output_channels (int): Number of channels in outputs
+                from the layer, a.k.a. number of feature maps.
+            input_height (int): Size of first input dimension of each 2D
+                channel of inputs.
+            input_width (int): Size of second input dimension of each 2D
+                channel of inputs.
+            kernel_height (int): Size of first dimension of each 2D channel of
+                kernels.
+            kernel_width (int): Size of second dimension of each 2D channel of
+                kernels.
+            kernels_intialiser: Initialiser for the kernel parameters.
+            biases_initialiser: Initialiser for the bias parameters.
+            kernels_penalty: Kernel-dependent penalty term (regulariser) or
+                None if no regularisation is to be applied to the kernels.
+            biases_penalty: Biases-dependent penalty term (regulariser) or
+                None if no regularisation is to be applied to the biases.
+        """
+        self.num_input_channels = num_input_channels
+        self.num_output_channels = num_output_channels
+        self.input_height = input_height
+        self.input_width = input_width
+        self.kernel_height = kernel_height
+        self.kernel_width = kernel_width
+        self.kernels_init = kernels_init
+        self.biases_init = biases_init
+        self.kernels_shape = (
+            num_output_channels, num_input_channels, kernel_height, kernel_width
+        )
+        self.inputs_shape = (
+            None, num_input_channels, input_height, input_width
+        )
+        self.kernels = self.kernels_init(self.kernels_shape)
+        self.biases = self.biases_init(num_output_channels)
+        self.kernels_penalty = kernels_penalty
+        self.biases_penalty = biases_penalty
+
+        self.cache = None
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+        For inputs `x`, outputs `y`, kernels `K` and biases `b` the layer
+        corresponds to `y = conv2d(x, K) + b`.
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, num_input_channels, image_height, image_width).
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, num_output_channels, output_height, output_width).
+        """
+        raise NotImplementedError
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+        Args:
+            inputs: Array of layer inputs of shape
+                (batch_size, num_input_channels, input_height, input_width).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape
+                (batch_size, num_output_channels, output_height, output_width).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape
+                (batch_size, num_output_channels, output_height, output_width).
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, num_input_channels, input_height, input_width).
+        """
+        # Pad the grads_wrt_outputs
+        raise NotImplementedError
+
+    def grads_wrt_params(self, inputs, grads_wrt_outputs):
+        """Calculates gradients with respect to layer parameters.
+        Args:
+            inputs: array of inputs to layer of shape (batch_size, input_dim)
+            grads_wrt_to_outputs: array of gradients with respect to the layer
+                outputs of shape
+                (batch_size, num_output_channels, output_height, output_width).
+        Returns:
+            list of arrays of gradients with respect to the layer parameters
+            `[grads_wrt_kernels, grads_wrt_biases]`.
+        """
+        # Get inputs_col from previous fprop
+        raise NotImplementedError
+
+    def params_penalty(self):
+        """Returns the parameter dependent penalty term for this layer.
+        If no parameter-dependent penalty terms are set this returns zero.
+        """
+        params_penalty = 0
+        if self.kernels_penalty is not None:
+            params_penalty += self.kernels_penalty(self.kernels)
+        if self.biases_penalty is not None:
+            params_penalty += self.biases_penalty(self.biases)
+        return params_penalty
+
+    @property
+    def params(self):
+        """A list of layer parameter values: `[kernels, biases]`."""
+        return [self.kernels, self.biases]
+
+    @params.setter
+    def params(self, values):
+        self.kernels = values[0]
+        self.biases = values[1]
+
+    def __repr__(self):
+        return (
+            'ConvolutionalLayer(\n'
+            '    num_input_channels={0}, num_output_channels={1},\n'
+            '    input_height={2}, input_width={3},\n'
+            '    kernel_height={4}, kernel_width={5}\n'
+            ')'
+                .format(self.num_input_channels, self.num_output_channels,
+                        self.input_height, self.input_width, self.kernel_height,
+                        self.kernel_width)
+        )
+
+
+class ReluLayer(Layer):
+    """Layer implementing an element-wise rectified linear transformation."""
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        For inputs `x` and outputs `y` this corresponds to `y = max(0, x)`.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        return np.maximum(inputs, 0.)
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return (outputs > 0) * grads_wrt_outputs
+
+    def __repr__(self):
+        return 'ReluLayer'
+
+
+class TanhLayer(Layer):
+    """Layer implementing an element-wise hyperbolic tangent transformation."""
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        For inputs `x` and outputs `y` this corresponds to `y = tanh(x)`.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        return np.tanh(inputs)
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return (1. - outputs ** 2) * grads_wrt_outputs
+
+    def __repr__(self):
+        return 'TanhLayer'
+
+
+class SoftmaxLayer(Layer):
+    """Layer implementing a softmax transformation."""
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        For inputs `x` and outputs `y` this corresponds to
+
+            `y = exp(x) / sum(exp(x))`.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        # subtract max inside exponential to improve numerical stability -
+        # when we divide through by sum this term cancels
+        exp_inputs = np.exp(inputs - inputs.max(-1)[:, None])
+        return exp_inputs / exp_inputs.sum(-1)[:, None]
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return (outputs * (grads_wrt_outputs -
+                           (grads_wrt_outputs * outputs).sum(-1)[:, None]))
+
+    def __repr__(self):
+        return 'SoftmaxLayer'
+
+
+class RadialBasisFunctionLayer(Layer):
+    """Layer implementing projection to a grid of radial basis functions."""
+
+    def __init__(self, grid_dim, intervals=[[0., 1.]]):
+        """Creates a radial basis function layer object.
+
+        Args:
+            grid_dim: Integer specifying how many basis function to use in
+                grid across input space per dimension (so total number of
+                basis functions will be grid_dim**input_dim)
+            intervals: List of intervals (two element lists or tuples)
+                specifying extents of axis-aligned region in input-space to
+                tile basis functions in grid across. For example for a 2D input
+                space spanning [0, 1] x [0, 1] use intervals=[[0, 1], [0, 1]].
+        """
+        num_basis = grid_dim ** len(intervals)
+        self.centres = np.array(np.meshgrid(*[
+            np.linspace(low, high, grid_dim) for (low, high) in intervals])
+                                ).reshape((len(intervals), -1))
+        self.scales = np.array([
+            [(high - low) * 1. / grid_dim] for (low, high) in intervals])
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        return np.exp(-(inputs[..., None] - self.centres[None, ...]) ** 2 /
+                      self.scales ** 2).reshape((inputs.shape[0], -1))
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        num_basis = self.centres.shape[1]
+        return -2 * (
+                ((inputs[..., None] - self.centres[None, ...]) / self.scales ** 2) *
+                grads_wrt_outputs.reshape((inputs.shape[0], -1, num_basis))
+        ).sum(-1)
+
+    def __repr__(self):
+        return 'RadialBasisFunctionLayer(grid_dim={0})'.format(self.grid_dim)
+
+
+class DropoutLayer(StochasticLayer):
+    """Layer which stochastically drops input dimensions in its output."""
+
+    def __init__(self, rng=None, incl_prob=0.5, share_across_batch=True):
+        """Construct a new dropout layer.
+
+        Args:
+            rng (RandomState): Seeded random number generator.
+            incl_prob: Scalar value in (0, 1] specifying the probability of
+                each input dimension being included in the output.
+            share_across_batch: Whether to use same dropout mask across
+                all inputs in a batch or use per input masks.
+        """
+        super(DropoutLayer, self).__init__(rng)
+        assert incl_prob > 0. and incl_prob <= 1.
+        self.incl_prob = incl_prob
+        self.share_across_batch = share_across_batch
+        self.rng = rng
+
+    def fprop(self, inputs, stochastic=True):
+        """Forward propagates activations through the layer transformation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            stochastic: Flag allowing different deterministic
+                forward-propagation mode in addition to default stochastic
+                forward-propagation e.g. for use at test time. If False
+                a deterministic forward-propagation transformation
+                corresponding to the expected output of the stochastic
+                forward-propagation is applied.
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        if stochastic:
+            mask_shape = (1,) + inputs.shape[1:] if self.share_across_batch else inputs.shape
+            self._mask = (self.rng.uniform(size=mask_shape) < self.incl_prob)
+            return inputs * self._mask
+        else:
+            return inputs * self.incl_prob
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs. This should correspond to
+        default stochastic forward-propagation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return grads_wrt_outputs * self._mask
+
+    def __repr__(self):
+        return 'DropoutLayer(incl_prob={0:.1f})'.format(self.incl_prob)
+
+
+class ReshapeLayer(Layer):
+    """Layer which reshapes dimensions of inputs."""
+
+    def __init__(self, output_shape=None):
+        """Create a new reshape layer object.
+
+        Args:
+            output_shape: Tuple specifying shape each input in batch should
+                be reshaped to in outputs. This **excludes** the batch size
+                so the shape of the final output array will be
+                    (batch_size, ) + output_shape
+                Similarly to numpy.reshape, one shape dimension can be -1. In
+                this case, the value is inferred from the size of the input
+                array and remaining dimensions. The shape specified must be
+                compatible with the input array shape - i.e. the total number
+                of values in the array cannot be changed. If set to `None` the
+                output shape will be set to
+                    (batch_size, -1)
+                which will flatten all the inputs to vectors.
+        """
+        self.output_shape = (-1,) if output_shape is None else output_shape
+
+    def fprop(self, inputs):
+        """Forward propagates activations through the layer transformation.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+
+        Returns:
+            outputs: Array of layer outputs of shape (batch_size, output_dim).
+        """
+        return inputs.reshape((inputs.shape[0],) + self.output_shape)
+
+    def bprop(self, inputs, outputs, grads_wrt_outputs):
+        """Back propagates gradients through a layer.
+
+        Given gradients with respect to the outputs of the layer calculates the
+        gradients with respect to the layer inputs.
+
+        Args:
+            inputs: Array of layer inputs of shape (batch_size, input_dim).
+            outputs: Array of layer outputs calculated in forward pass of
+                shape (batch_size, output_dim).
+            grads_wrt_outputs: Array of gradients with respect to the layer
+                outputs of shape (batch_size, output_dim).
+
+        Returns:
+            Array of gradients with respect to the layer inputs of shape
+            (batch_size, input_dim).
+        """
+        return grads_wrt_outputs.reshape(inputs.shape)
+
+    def __repr__(self):
+        return 'ReshapeLayer(output_shape={0})'.format(self.output_shape)
--- a/mlp/learning_rules.py
+++ b/mlp/learning_rules.py
@ -0,0 +1,388 @@
+# -*- coding: utf-8 -*-
+"""Learning rules.
+
+This module contains classes implementing gradient based learning rules.
+"""
+
+import numpy as np
+
+
+class GradientDescentLearningRule(object):
+    """Simple (stochastic) gradient descent learning rule.
+
+    For a scalar error function `E(p[0], p_[1] ... )` of some set of
+    potentially multidimensional parameters this attempts to find a local
+    minimum of the loss function by applying updates to each parameter of the
+    form
+
+        p[i] := p[i] - learning_rate * dE/dp[i]
+
+    With `learning_rate` a positive scaling parameter.
+
+    The error function used in successive applications of these updates may be
+    a stochastic estimator of the true error function (e.g. when the error with
+    respect to only a subset of data-points is calculated) in which case this
+    will correspond to a stochastic gradient descent learning rule.
+    """
+
+    def __init__(self, learning_rate=1e-3):
+        """Creates a new learning rule object.
+
+        Args:
+            learning_rate: A postive scalar to scale gradient updates to the
+                parameters by. This needs to be carefully set - if too large
+                the learning dynamic will be unstable and may diverge, while
+                if set too small learning will proceed very slowly.
+
+        """
+        assert learning_rate > 0., 'learning_rate should be positive.'
+        self.learning_rate = learning_rate
+
+    def initialise(self, params):
+        """Initialises the state of the learning rule for a set or parameters.
+
+        This must be called before `update_params` is first called.
+
+        Args:
+            params: A list of the parameters to be optimised. Note these will
+                be updated *in-place* to avoid reallocating arrays on each
+                update.
+        """
+        self.params = params
+
+    def reset(self):
+        """Resets any additional state variables to their intial values.
+
+        For this learning rule there are no additional state variables so we
+        do nothing here.
+        """
+        pass
+
+    def update_params(self, grads_wrt_params):
+        """Applies a single gradient descent update to all parameters.
+
+        All parameter updates are performed using in-place operations and so
+        nothing is returned.
+
+        Args:
+            grads_wrt_params: A list of gradients of the scalar loss function
+                with respect to each of the parameters passed to `initialise`
+                previously, with this list expected to be in the same order.
+        """
+        for param, grad in zip(self.params, grads_wrt_params):
+            param -= self.learning_rate * grad
+
+
+class MomentumLearningRule(GradientDescentLearningRule):
+    """Gradient descent with momentum learning rule.
+
+    This extends the basic gradient learning rule by introducing extra
+    momentum state variables for each parameter. These can help the learning
+    dynamic help overcome shallow local minima and speed convergence when
+    making multiple successive steps in a similar direction in parameter space.
+
+    For parameter p[i] and corresponding momentum m[i] the updates for a
+    scalar loss function `L` are of the form
+
+        m[i] := mom_coeff * m[i] - learning_rate * dL/dp[i]
+        p[i] := p[i] + m[i]
+
+    with `learning_rate` a positive scaling parameter for the gradient updates
+    and `mom_coeff` a value in [0, 1] that determines how much 'friction' there
+    is the system and so how quickly previous momentum contributions decay.
+    """
+
+    def __init__(self, learning_rate=1e-3, mom_coeff=0.9):
+        """Creates a new learning rule object.
+
+        Args:
+            learning_rate: A postive scalar to scale gradient updates to the
+                parameters by. This needs to be carefully set - if too large
+                the learning dynamic will be unstable and may diverge, while
+                if set too small learning will proceed very slowly.
+            mom_coeff: A scalar in the range [0, 1] inclusive. This determines
+                the contribution of the previous momentum value to the value
+                after each update. If equal to 0 the momentum is set to exactly
+                the negative scaled gradient each update and so this rule
+                collapses to standard gradient descent. If equal to 1 the
+                momentum will just be decremented by the scaled gradient at
+                each update. This is equivalent to simulating the dynamic in
+                a frictionless system. Due to energy conservation the loss
+                of 'potential energy' as the dynamics moves down the loss
+                function surface will lead to an increasingly large 'kinetic
+                energy' and so speed, meaning the updates will become
+                increasingly large, potentially unstably so. Typically a value
+                less than but close to 1 will avoid these issues and cause the
+                dynamic to converge to a local minima where the gradients are
+                by definition zero.
+        """
+        super(MomentumLearningRule, self).__init__(learning_rate)
+        assert mom_coeff >= 0. and mom_coeff <= 1., (
+            'mom_coeff should be in the range [0, 1].'
+        )
+        self.mom_coeff = mom_coeff
+
+    def initialise(self, params):
+        """Initialises the state of the learning rule for a set or parameters.
+
+        This must be called before `update_params` is first called.
+
+        Args:
+            params: A list of the parameters to be optimised. Note these will
+                be updated *in-place* to avoid reallocating arrays on each
+                update.
+        """
+        super(MomentumLearningRule, self).initialise(params)
+        self.moms = []
+        for param in self.params:
+            self.moms.append(np.zeros_like(param))
+
+    def reset(self):
+        """Resets any additional state variables to their intial values.
+
+        For this learning rule this corresponds to zeroing all the momenta.
+        """
+        for mom in zip(self.moms):
+            mom *= 0.
+
+    def update_params(self, grads_wrt_params):
+        """Applies a single update to all parameters.
+
+        All parameter updates are performed using in-place operations and so
+        nothing is returned.
+
+        Args:
+            grads_wrt_params: A list of gradients of the scalar loss function
+                with respect to each of the parameters passed to `initialise`
+                previously, with this list expected to be in the same order.
+        """
+        for param, mom, grad in zip(self.params, self.moms, grads_wrt_params):
+            mom *= self.mom_coeff
+            mom -= self.learning_rate * grad
+            param += mom
+
+
+class AdamLearningRule(GradientDescentLearningRule):
+    """Adaptive moments (Adam) learning rule.
+    First-order gradient-descent based learning rule which uses adaptive
+    estimates of first and second moments of the parameter gradients to
+    calculate the parameter updates.
+    References:
+      [1]: Adam: a method for stochastic optimisation
+           Kingma and Ba, 2015
+    """
+
+    def __init__(self, learning_rate=1e-3, beta_1=0.9, beta_2=0.999,
+                 epsilon=1e-8):
+        """Creates a new learning rule object.
+        Args:
+            learning_rate: A postive scalar to scale gradient updates to the
+                parameters by. This needs to be carefully set - if too large
+                the learning dynamic will be unstable and may diverge, while
+                if set too small learning will proceed very slowly.
+            beta_1: Exponential decay rate for gradient first moment estimates.
+                This should be a scalar value in [0, 1]. The running gradient
+                first moment estimate is calculated using
+                `m_1 = beta_1 * m_1_prev + (1 - beta_1) * g`
+                 where `m_1_prev` is the previous estimate and `g` the current
+                 parameter gradients.
+            beta_2: Exponential decay rate for gradient second moment
+                estimates. This should be a scalar value in [0, 1]. The run
+                gradient second moment estimate is calculated using
+                `m_2 = beta_2 * m_2_prev + (1 - beta_2) * g**2`
+                 where `m_2_prev` is the previous estimate and `g` the current
+                 parameter gradients.
+            epsilon: 'Softening' parameter to stop updates diverging when
+                second moment estimates are close to zero. Should be set to
+                a small positive value.
+        """
+        super(AdamLearningRule, self).__init__(learning_rate)
+        assert beta_1 >= 0. and beta_1 <= 1., 'beta_1 should be in [0, 1].'
+        assert beta_2 >= 0. and beta_2 <= 1., 'beta_2 should be in [0, 2].'
+        assert epsilon > 0., 'epsilon should be > 0.'
+        self.beta_1 = beta_1
+        self.beta_2 = beta_2
+        self.epsilon = epsilon
+
+    def initialise(self, params):
+        """Initialises the state of the learning rule for a set or parameters.
+        This must be called before `update_params` is first called.
+        Args:
+            params: A list of the parameters to be optimised. Note these will
+                be updated *in-place* to avoid reallocating arrays on each
+                update.
+        """
+        super(AdamLearningRule, self).initialise(params)
+        self.moms_1 = []
+        for param in self.params:
+            self.moms_1.append(np.zeros_like(param))
+        self.moms_2 = []
+        for param in self.params:
+            self.moms_2.append(np.zeros_like(param))
+        self.step_count = 0
+
+    def reset(self):
+        """Resets any additional state variables to their initial values.
+        For this learning rule this corresponds to zeroing the estimates of
+        the first and second moments of the gradients.
+        """
+        for mom_1, mom_2 in zip(self.moms_1, self.moms_2):
+            mom_1 *= 0.
+            mom_2 *= 0.
+        self.step_count = 0
+
+    def update_params(self, grads_wrt_params):
+        """Applies a single update to all parameters.
+        All parameter updates are performed using in-place operations and so
+        nothing is returned.
+        Args:
+            grads_wrt_params: A list of gradients of the scalar loss function
+                with respect to each of the parameters passed to `initialise`
+                previously, with this list expected to be in the same order.
+        """
+        for param, mom_1, mom_2, grad in zip(
+                self.params, self.moms_1, self.moms_2, grads_wrt_params):
+            mom_1 *= self.beta_1
+            mom_1 += (1. - self.beta_1) * grad
+            mom_2 *= self.beta_2
+            mom_2 += (1. - self.beta_2) * grad ** 2
+            alpha_t = (
+                    self.learning_rate *
+                    (1. - self.beta_2 ** (self.step_count + 1)) ** 0.5 /
+                    (1. - self.beta_1 ** (self.step_count + 1))
+            )
+            param -= alpha_t * mom_1 / (mom_2 ** 0.5 + self.epsilon)
+        self.step_count += 1
+
+
+class AdaGradLearningRule(GradientDescentLearningRule):
+    """Adaptive gradients (AdaGrad) learning rule.
+    First-order gradient-descent based learning rule which normalises gradient
+    updates by a running sum of the past squared gradients.
+    References:
+      [1]: Adaptive Subgradient Methods for Online Learning and Stochastic
+           Optimization. Duchi, Haxan and Singer, 2011
+    """
+
+    def __init__(self, learning_rate=1e-2, epsilon=1e-8):
+        """Creates a new learning rule object.
+        Args:
+            learning_rate: A postive scalar to scale gradient updates to the
+                parameters by. This needs to be carefully set - if too large
+                the learning dynamic will be unstable and may diverge, while
+                if set too small learning will proceed very slowly.
+            epsilon: 'Softening' parameter to stop updates diverging when
+                sums of squared gradients are close to zero. Should be set to
+                a small positive value.
+        """
+        super(AdaGradLearningRule, self).__init__(learning_rate)
+        assert epsilon > 0., 'epsilon should be > 0.'
+        self.epsilon = epsilon
+
+    def initialise(self, params):
+        """Initialises the state of the learning rule for a set or parameters.
+        This must be called before `update_params` is first called.
+        Args:
+            params: A list of the parameters to be optimised. Note these will
+                be updated *in-place* to avoid reallocating arrays on each
+                update.
+        """
+        super(AdaGradLearningRule, self).initialise(params)
+        self.sum_sq_grads = []
+        for param in self.params:
+            self.sum_sq_grads.append(np.zeros_like(param))
+
+    def reset(self):
+        """Resets any additional state variables to their initial values.
+        For this learning rule this corresponds to zeroing all the sum of
+        squared gradient states.
+        """
+        for sum_sq_grad in self.sum_sq_grads:
+            sum_sq_grad *= 0.
+
+    def update_params(self, grads_wrt_params):
+        """Applies a single update to all parameters.
+        All parameter updates are performed using in-place operations and so
+        nothing is returned.
+        Args:
+            grads_wrt_params: A list of gradients of the scalar loss function
+                with respect to each of the parameters passed to `initialise`
+                previously, with this list expected to be in the same order.
+        """
+        for param, sum_sq_grad, grad in zip(
+                self.params, self.sum_sq_grads, grads_wrt_params):
+            sum_sq_grad += grad ** 2
+            param -= (self.learning_rate * grad /
+                      (sum_sq_grad + self.epsilon) ** 0.5)
+
+
+class RMSPropLearningRule(GradientDescentLearningRule):
+    """Root mean squared gradient normalised learning rule (RMSProp).
+    First-order gradient-descent based learning rule which normalises gradient
+    updates by a exponentially smoothed estimate of the gradient second
+    moments.
+    References:
+      [1]: Neural Networks for Machine Learning: Lecture 6a slides
+           University of Toronto,Computer Science Course CSC321
+      http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
+    """
+
+    def __init__(self, learning_rate=1e-3, beta=0.9, epsilon=1e-8):
+        """Creates a new learning rule object.
+        Args:
+            learning_rate: A postive scalar to scale gradient updates to the
+                parameters by. This needs to be carefully set - if too large
+                the learning dynamic will be unstable and may diverge, while
+                if set too small learning will proceed very slowly.
+            beta: Exponential decay rate for gradient second moment
+                estimates. This should be a scalar value in [0, 1]. The running
+                gradient second moment estimate is calculated using
+                `m_2 = beta * m_2_prev + (1 - beta) * g**2`
+                 where `m_2_prev` is the previous estimate and `g` the current
+                 parameter gradients.
+            epsilon: 'Softening' parameter to stop updates diverging when
+                gradient second moment estimates are close to zero. Should be
+                set to a small positive value.
+        """
+        super(RMSPropLearningRule, self).__init__(learning_rate)
+        assert beta >= 0. and beta <= 1., 'beta should be in [0, 1].'
+        assert epsilon > 0., 'epsilon should be > 0.'
+        self.beta = beta
+        self.epsilon = epsilon
+
+    def initialise(self, params):
+        """Initialises the state of the learning rule for a set or parameters.
+        This must be called before `update_params` is first called.
+        Args:
+            params: A list of the parameters to be optimised. Note these will
+                be updated *in-place* to avoid reallocating arrays on each
+                update.
+        """
+        super(RMSPropLearningRule, self).initialise(params)
+        self.moms_2 = []
+        for param in self.params:
+            self.moms_2.append(np.zeros_like(param))
+
+    def reset(self):
+        """Resets any additional state variables to their initial values.
+        For this learning rule this corresponds to zeroing all gradient
+        second moment estimates.
+        """
+        for mom_2 in self.moms_2:
+            mom_2 *= 0.
+
+    def update_params(self, grads_wrt_params):
+        """Applies a single update to all parameters.
+        All parameter updates are performed using in-place operations and so
+        nothing is returned.
+        Args:
+            grads_wrt_params: A list of gradients of the scalar loss function
+                with respect to each of the parameters passed to `initialise`
+                previously, with this list expected to be in the same order.
+        """
+        for param, mom_2, grad in zip(
+                self.params, self.moms_2, grads_wrt_params):
+            mom_2 *= self.beta
+            mom_2 += (1. - self.beta) * grad ** 2
+            param -= (self.learning_rate * grad /
+                      (mom_2 + self.epsilon) ** 0.5)
--- a/mlp/models.py
+++ b/mlp/models.py
@ -0,0 +1,145 @@
+# -*- coding: utf-8 -*-
+"""Model definitions.
+
+This module implements objects encapsulating learnable models of input-output
+relationships. The model objects implement methods for forward propagating
+the inputs through the transformation(s) defined by the model to produce
+outputs (and intermediate states) and for calculating gradients of scalar
+functions of the outputs with respect to the model parameters.
+"""
+
+from mlp.layers import LayerWithParameters, StochasticLayer, StochasticLayerWithParameters
+
+
+class SingleLayerModel(object):
+    """A model consisting of a single transformation layer."""
+
+    def __init__(self, layer):
+        """Create a new single layer model instance.
+
+        Args:
+            layer: The layer object defining the model architecture.
+        """
+        self.layer = layer
+
+    @property
+    def params(self):
+        """A list of all of the parameters of the model."""
+        return self.layer.params
+
+    def fprop(self, inputs, evaluation=False):
+        """Calculate the model outputs corresponding to a batch of inputs.
+
+        Args:
+            inputs: Batch of inputs to the model.
+
+        Returns:
+            List which is a concatenation of the model inputs and model
+            outputs, this being done for consistency of the interface with
+            multi-layer models for which `fprop` returns a list of
+            activations through all immediate layers of the model and including
+            the inputs and outputs.
+        """
+        activations = [inputs, self.layer.fprop(inputs)]
+        return activations
+
+    def grads_wrt_params(self, activations, grads_wrt_outputs):
+        """Calculates gradients with respect to the model parameters.
+
+        Args:
+            activations: List of all activations from forward pass through
+                model using `fprop`.
+            grads_wrt_outputs: Gradient with respect to the model outputs of
+               the scalar function parameter gradients are being calculated
+               for.
+
+        Returns:
+            List of gradients of the scalar function with respect to all model
+            parameters.
+        """
+        return self.layer.grads_wrt_params(activations[0], grads_wrt_outputs)
+
+    def __repr__(self):
+        return 'SingleLayerModel(' + str(self.layer) + ')'
+
+
+class MultipleLayerModel(object):
+    """A model consisting of multiple layers applied sequentially."""
+
+    def __init__(self, layers):
+        """Create a new multiple layer model instance.
+
+        Args:
+            layers: List of the the layer objecst defining the model in the
+                order they should be applied from inputs to outputs.
+        """
+        self.layers = layers
+
+    @property
+    def params(self):
+        """A list of all of the parameters of the model."""
+        params = []
+        for layer in self.layers:
+            if isinstance(layer, LayerWithParameters) or isinstance(layer, StochasticLayerWithParameters):
+                params += layer.params
+        return params
+
+    def fprop(self, inputs, evaluation=False):
+        """Forward propagates a batch of inputs through the model.
+
+        Args:
+            inputs: Batch of inputs to the model.
+
+        Returns:
+            List of the activations at the output of all layers of the model
+            plus the inputs (to the first layer) as the first element. The
+            last element of the list corresponds to the model outputs.
+        """
+        activations = [inputs]
+        for i, layer in enumerate(self.layers):
+            if evaluation:
+                if issubclass(type(self.layers[i]), StochasticLayer) or issubclass(type(self.layers[i]),
+                                                                                   StochasticLayerWithParameters):
+                    current_activations = self.layers[i].fprop(activations[i], stochastic=False)
+                else:
+                    current_activations = self.layers[i].fprop(activations[i])
+            else:
+                if issubclass(type(self.layers[i]), StochasticLayer) or issubclass(type(self.layers[i]),
+                                                                                   StochasticLayerWithParameters):
+                    current_activations = self.layers[i].fprop(activations[i], stochastic=True)
+                else:
+                    current_activations = self.layers[i].fprop(activations[i])
+            activations.append(current_activations)
+        return activations
+
+    def grads_wrt_params(self, activations, grads_wrt_outputs):
+        """Calculates gradients with respect to the model parameters.
+
+        Args:
+            activations: List of all activations from forward pass through
+                model using `fprop`.
+            grads_wrt_outputs: Gradient with respect to the model outputs of
+               the scalar function parameter gradients are being calculated
+               for.
+
+        Returns:
+            List of gradients of the scalar function with respect to all model
+            parameters.
+        """
+        grads_wrt_params = []
+        for i, layer in enumerate(self.layers[::-1]):
+            inputs = activations[-i - 2]
+            outputs = activations[-i - 1]
+            grads_wrt_inputs = layer.bprop(inputs, outputs, grads_wrt_outputs)
+            if isinstance(layer, LayerWithParameters) or isinstance(layer, StochasticLayerWithParameters):
+                grads_wrt_params += layer.grads_wrt_params(
+                    inputs, grads_wrt_outputs)[::-1]
+            grads_wrt_outputs = grads_wrt_inputs
+        return grads_wrt_params[::-1]
+
+    def __repr__(self):
+        return (
+            'MultiLayerModel(\n    ' +
+            '\n    '.join([str(layer) for layer in self.layers]) +
+            '\n)'
+        )
--- a/mlp/optimisers.py
+++ b/mlp/optimisers.py
@ -0,0 +1,148 @@
+# -*- coding: utf-8 -*-
+"""Model optimisers.
+
+This module contains objects implementing (batched) stochastic gradient descent
+based optimisation of models.
+"""
+
+import time
+import logging
+from collections import OrderedDict
+import numpy as np
+import tqdm
+
+logger = logging.getLogger(__name__)
+
+
+class Optimiser(object):
+    """Basic model optimiser."""
+
+    def __init__(self, model, error, learning_rule, train_dataset,
+                 valid_dataset=None, data_monitors=None, notebook=False):
+        """Create a new optimiser instance.
+
+        Args:
+            model: The model to optimise.
+            error: The scalar error function to minimise.
+            learning_rule: Gradient based learning rule to use to minimise
+                error.
+            train_dataset: Data provider for training set data batches.
+            valid_dataset: Data provider for validation set data batches.
+            data_monitors: Dictionary of functions evaluated on targets and
+                model outputs (averaged across both full training and
+                validation data sets) to monitor during training in addition
+                to the error. Keys should correspond to a string label for
+                the statistic being evaluated.
+        """
+        self.model = model
+        self.error = error
+        self.learning_rule = learning_rule
+        self.learning_rule.initialise(self.model.params)
+        self.train_dataset = train_dataset
+        self.valid_dataset = valid_dataset
+        self.data_monitors = OrderedDict([('error', error)])
+        if data_monitors is not None:
+            self.data_monitors.update(data_monitors)
+        self.notebook = notebook
+        if notebook:
+            self.tqdm_progress = tqdm.tqdm_notebook
+        else:
+            self.tqdm_progress = tqdm.tqdm
+
+    def do_training_epoch(self):
+        """Do a single training epoch.
+
+        This iterates through all batches in training dataset, for each
+        calculating the gradient of the estimated error given the batch with
+        respect to all the model parameters and then updates the model
+        parameters according to the learning rule.
+        """
+        with self.tqdm_progress(total=self.train_dataset.num_batches) as train_progress_bar:
+            train_progress_bar.set_description("Epoch Progress")
+            for inputs_batch, targets_batch in self.train_dataset:
+                activations = self.model.fprop(inputs_batch)
+                grads_wrt_outputs = self.error.grad(activations[-1], targets_batch)
+                grads_wrt_params = self.model.grads_wrt_params(
+                    activations, grads_wrt_outputs)
+                self.learning_rule.update_params(grads_wrt_params)
+                train_progress_bar.update(1)
+
+    def eval_monitors(self, dataset, label):
+        """Evaluates the monitors for the given dataset.
+
+        Args:
+            dataset: Dataset to perform evaluation with.
+            label: Tag to add to end of monitor keys to identify dataset.
+
+        Returns:
+            OrderedDict of monitor values evaluated on dataset.
+        """
+        data_mon_vals = OrderedDict([(key + label, 0.) for key
+                                     in self.data_monitors.keys()])
+        for inputs_batch, targets_batch in dataset:
+            activations = self.model.fprop(inputs_batch, evaluation=True)
+            for key, data_monitor in self.data_monitors.items():
+                data_mon_vals[key + label] += data_monitor(
+                    activations[-1], targets_batch)
+        for key, data_monitor in self.data_monitors.items():
+            data_mon_vals[key + label] /= dataset.num_batches
+        return data_mon_vals
+
+    def get_epoch_stats(self):
+        """Computes training statistics for an epoch.
+
+        Returns:
+            An OrderedDict with keys corresponding to the statistic labels and
+            values corresponding to the value of the statistic.
+        """
+        epoch_stats = OrderedDict()
+        epoch_stats.update(self.eval_monitors(self.train_dataset, '(train)'))
+        if self.valid_dataset is not None:
+            epoch_stats.update(self.eval_monitors(
+                self.valid_dataset, '(valid)'))
+        return epoch_stats
+
+    def log_stats(self, epoch, epoch_time, stats):
+        """Outputs stats for a training epoch to a logger.
+
+        Args:
+            epoch (int): Epoch counter.
+            epoch_time: Time taken in seconds for the epoch to complete.
+            stats: Monitored stats for the epoch.
+        """
+        logger.info('Epoch {0}: {1:.1f}s to complete\n    {2}'.format(
+            epoch, epoch_time,
+            ', '.join(['{0}={1:.2e}'.format(k, v) for (k, v) in stats.items()])
+        ))
+
+    def train(self, num_epochs, stats_interval=5):
+        """Trains a model for a set number of epochs.
+
+        Args:
+            num_epochs: Number of epochs (complete passes through trainin
+                dataset) to train for.
+            stats_interval: Training statistics will be recorded and logged
+                every `stats_interval` epochs.
+
+        Returns:
+            Tuple with first value being an array of training run statistics
+            and the second being a dict mapping the labels for the statistics
+            recorded to their column index in the array.
+        """
+        start_train_time = time.time()
+        run_stats = [list(self.get_epoch_stats().values())]
+        with self.tqdm_progress(total=num_epochs) as progress_bar:
+            progress_bar.set_description("Experiment Progress")
+            for epoch in range(1, num_epochs + 1):
+                start_time = time.time()
+                self.do_training_epoch()
+                epoch_time = time.time()- start_time
+                if epoch % stats_interval == 0:
+                    stats = self.get_epoch_stats()
+                    self.log_stats(epoch, epoch_time, stats)
+                    run_stats.append(list(stats.values()))
+                progress_bar.update(1)
+        finish_train_time = time.time()
+        total_train_time = finish_train_time - start_train_time
+        return np.array(run_stats), {k: i for i, k in enumerate(stats.keys())}, total_train_time
+
--- a/mlp/penalties.py
+++ b/mlp/penalties.py
@ -0,0 +1,90 @@
+import numpy as np
+
+seed = 22102017
+rng = np.random.RandomState(seed)
+
+
+class L1Penalty(object):
+    """L1 parameter penalty.
+
+    Term to add to the objective function penalising parameters
+    based on their L1 norm.
+    """
+
+    def __init__(self, coefficient):
+        """Create a new L1 penalty object.
+
+        Args:
+            coefficient: Positive constant to scale penalty term by.
+        """
+        assert coefficient > 0., 'Penalty coefficient must be positive.'
+        self.coefficient = coefficient
+
+    def __call__(self, parameter):
+        """Calculate L1 penalty value for a parameter.
+
+        Args:
+            parameter: Array corresponding to a model parameter.
+
+        Returns:
+            Value of penalty term.
+        """
+        return self.coefficient * abs(parameter).sum()
+
+    def grad(self, parameter):
+        """Calculate the penalty gradient with respect to the parameter.
+
+        Args:
+            parameter: Array corresponding to a model parameter.
+
+        Returns:
+            Value of penalty gradient with respect to parameter. This
+            should be an array of the same shape as the parameter.
+        """
+        return self.coefficient * np.sign(parameter)
+
+    def __repr__(self):
+        return 'L1Penalty({0})'.format(self.coefficient)
+
+
+class L2Penalty(object):
+    """L1 parameter penalty.
+
+    Term to add to the objective function penalising parameters
+    based on their L2 norm.
+    """
+
+    def __init__(self, coefficient):
+        """Create a new L2 penalty object.
+
+        Args:
+            coefficient: Positive constant to scale penalty term by.
+        """
+        assert coefficient > 0., 'Penalty coefficient must be positive.'
+        self.coefficient = coefficient
+
+    def __call__(self, parameter):
+        """Calculate L2 penalty value for a parameter.
+
+        Args:
+            parameter: Array corresponding to a model parameter.
+
+        Returns:
+            Value of penalty term.
+        """
+        return 0.5 * self.coefficient * (parameter ** 2).sum()
+
+    def grad(self, parameter):
+        """Calculate the penalty gradient with respect to the parameter.
+
+        Args:
+            parameter: Array corresponding to a model parameter.
+
+        Returns:
+            Value of penalty gradient with respect to parameter. This
+            should be an array of the same shape as the parameter.
+        """
+        return self.coefficient * parameter
+
+    def __repr__(self):
+        return 'L2Penalty({0})'.format(self.coefficient)
--- a/mlp/schedulers.py
+++ b/mlp/schedulers.py
@ -0,0 +1,34 @@
+# -*- coding: utf-8 -*-
+"""Training schedulers.
+
+This module contains classes implementing schedulers which control the
+evolution of learning rule hyperparameters (such as learning rate) over a
+training run.
+"""
+
+import numpy as np
+
+
+class ConstantLearningRateScheduler(object):
+    """Example of scheduler interface which sets a constant learning rate."""
+
+    def __init__(self, learning_rate):
+        """Construct a new constant learning rate scheduler object.
+
+        Args:
+            learning_rate: Learning rate to use in learning rule.
+        """
+        self.learning_rate = learning_rate
+
+    def update_learning_rule(self, learning_rule, epoch_number):
+        """Update the hyperparameters of the learning rule.
+
+        Run at the beginning of each epoch.
+
+        Args:
+            learning_rule: Learning rule object being used in training run,
+                any scheduled hyperparameters to be altered should be
+                attributes of this object.
+            epoch_number: Integer index of training epoch about to be run.
+        """
+        learning_rule.learning_rate = self.learning_rate
--- a/model_architectures.py
+++ b/model_architectures.py
@ -1,208 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class FCCNetwork(nn.Module):
-    def __init__(self, input_shape, num_output_classes, num_filters, num_layers, use_bias=False):
-        """
-        Initializes a fully connected network similar to the ones implemented previously in the MLP package.
-        :param input_shape: The shape of the inputs going in to the network.
-        :param num_output_classes: The number of outputs the network should have (for classification those would be the number of classes)
-        :param num_filters: Number of filters used in every fcc layer.
-        :param num_layers: Number of fcc layers (excluding dim reduction stages)
-        :param use_bias: Whether our fcc layers will use a bias.
-        """
-        super(FCCNetwork, self).__init__()
-        # set up class attributes useful in building the network and inference
-        self.input_shape = input_shape
-        self.num_filters = num_filters
-        self.num_output_classes = num_output_classes
-        self.use_bias = use_bias
-        self.num_layers = num_layers
-        # initialize a module dict, which is effectively a dictionary that can collect layers and integrate them into pytorch
-        self.layer_dict = nn.ModuleDict()
-        # build the network
-        self.build_module()
-
-    def build_module(self):
-        print("Building basic block of FCCNetwork using input shape", self.input_shape)
-        x = torch.zeros((self.input_shape))
-
-        out = x
-        out = out.view(out.shape[0], -1)
-        # flatten inputs to shape (b, -1) where -1 is the dim resulting from multiplying the
-        # shapes of all dimensions after the 0th dim
-
-        for i in range(self.num_layers):
-            self.layer_dict['fcc_{}'.format(i)] = nn.Linear(in_features=out.shape[1],  # initialize a fcc layer
-                                                            out_features=self.num_filters,
-                                                            bias=self.use_bias)
-
-            out = self.layer_dict['fcc_{}'.format(i)](out)  # apply ith fcc layer to the previous layers outputs
-            out = F.relu(out)  # apply a ReLU on the outputs
-
-        self.logits_linear_layer = nn.Linear(in_features=out.shape[1],  # initialize the prediction output linear layer
-                                             out_features=self.num_output_classes,
-                                             bias=self.use_bias)
-        out = self.logits_linear_layer(out)  # apply the layer to the previous layer's outputs
-        print("Block is built, output volume is", out.shape)
-        return out
-
-    def forward(self, x):
-        """
-        Forward prop data through the network and return the preds
-        :param x: Input batch x a batch of shape batch number of samples, each of any dimensionality.
-        :return: preds of shape (b, num_classes)
-        """
-        out = x
-        out = out.view(out.shape[0], -1)
-        # flatten inputs to shape (b, -1) where -1 is the dim resulting from multiplying the
-        # shapes of all dimensions after the 0th dim
-
-        for i in range(self.num_layers):
-            out = self.layer_dict['fcc_{}'.format(i)](out)  # apply ith fcc layer to the previous layers outputs
-            out = F.relu(out)  # apply a ReLU on the outputs
-
-        out = self.logits_linear_layer(out)  # apply the layer to the previous layer's outputs
-        return out
-
-    def reset_parameters(self):
-        """
-        Re-initializes the networks parameters
-        """
-        for item in self.layer_dict.children():
-            item.reset_parameters()
-
-        self.logits_linear_layer.reset_parameters()
-
-class ConvolutionalNetwork(nn.Module):
-    def __init__(self, input_shape, dim_reduction_type, num_output_classes, num_filters, num_layers, use_bias=False):
-        """
-        Initializes a convolutional network module object.
-        :param input_shape: The shape of the inputs going in to the network.
-        :param dim_reduction_type: The type of dimensionality reduction to apply after each convolutional stage, should be one of ['max_pooling', 'avg_pooling', 'strided_convolution', 'dilated_convolution']
-        :param num_output_classes: The number of outputs the network should have (for classification those would be the number of classes)
-        :param num_filters: Number of filters used in every conv layer, except dim reduction stages, where those are automatically infered.
-        :param num_layers: Number of conv layers (excluding dim reduction stages)
-        :param use_bias: Whether our convolutions will use a bias.
-        """
-        super(ConvolutionalNetwork, self).__init__()
-        # set up class attributes useful in building the network and inference
-        self.input_shape = input_shape
-        self.num_filters = num_filters
-        self.num_output_classes = num_output_classes
-        self.use_bias = use_bias
-        self.num_layers = num_layers
-        self.dim_reduction_type = dim_reduction_type
-        # initialize a module dict, which is effectively a dictionary that can collect layers and integrate them into pytorch
-        self.layer_dict = nn.ModuleDict()
-        # build the network
-        self.build_module()
-
-    def build_module(self):
-        """
-        Builds network whilst automatically inferring shapes of layers.
-        """
-        print("Building basic block of ConvolutionalNetwork using input shape", self.input_shape)
-        x = torch.zeros((self.input_shape))  # create dummy inputs to be used to infer shapes of layers
-
-        out = x
-        # torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
-        for i in range(self.num_layers):  # for number of layers times
-            self.layer_dict['conv_{}'.format(i)] = nn.Conv2d(in_channels=out.shape[1],
-                                                             # add a conv layer in the module dict
-                                                             kernel_size=3,
-                                                             out_channels=self.num_filters, padding=1,
-                                                             bias=self.use_bias)
-
-            out = self.layer_dict['conv_{}'.format(i)](out)  # use layer on inputs to get an output
-            out = F.relu(out)  # apply relu
-            print(out.shape)
-            if self.dim_reduction_type == 'strided_convolution':  # if dim reduction is strided conv, then add a strided conv
-                self.layer_dict['dim_reduction_strided_conv_{}'.format(i)] = nn.Conv2d(in_channels=out.shape[1],
-                                                                                       kernel_size=3,
-                                                                                       out_channels=out.shape[1],
-                                                                                       padding=1,
-                                                                                       bias=self.use_bias, stride=2,
-                                                                                       dilation=1)
-
-                out = self.layer_dict['dim_reduction_strided_conv_{}'.format(i)](
-                    out)  # use strided conv to get an output
-                out = F.relu(out)  # apply relu to the output
-            elif self.dim_reduction_type == 'dilated_convolution':  # if dim reduction is dilated conv, then add a dilated conv, using an arbitrary dilation rate of i + 2 (so it gets smaller as we go, you can choose other dilation rates should you wish to do it.)
-                self.layer_dict['dim_reduction_dilated_conv_{}'.format(i)] = nn.Conv2d(in_channels=out.shape[1],
-                                                                                       kernel_size=3,
-                                                                                       out_channels=out.shape[1],
-                                                                                       padding=1,
-                                                                                       bias=self.use_bias, stride=1,
-                                                                                       dilation=i + 2)
-                out = self.layer_dict['dim_reduction_dilated_conv_{}'.format(i)](
-                    out)  # run dilated conv on input to get output
-                out = F.relu(out)  # apply relu on output
-
-            elif self.dim_reduction_type == 'max_pooling':
-                self.layer_dict['dim_reduction_max_pool_{}'.format(i)] = nn.MaxPool2d(2, padding=1)
-                out = self.layer_dict['dim_reduction_max_pool_{}'.format(i)](out)
-
-            elif self.dim_reduction_type == 'avg_pooling':
-                self.layer_dict['dim_reduction_avg_pool_{}'.format(i)] = nn.AvgPool2d(2, padding=1)
-                out = self.layer_dict['dim_reduction_avg_pool_{}'.format(i)](out)
-
-            print(out.shape)
-        if out.shape[-1] != 2:
-            out = F.adaptive_avg_pool2d(out,
-                                        2)  # apply adaptive pooling to make sure output of conv layers is always (2, 2) spacially (helps with comparisons).
-        print('shape before final linear layer', out.shape)
-        out = out.view(out.shape[0], -1)
-        self.logit_linear_layer = nn.Linear(in_features=out.shape[1],  # add a linear layer
-                                            out_features=self.num_output_classes,
-                                            bias=self.use_bias)
-        out = self.logit_linear_layer(out)  # apply linear layer on flattened inputs
-        print("Block is built, output volume is", out.shape)
-        return out
-
-    def forward(self, x):
-        """
-        Forward propages the network given an input batch
-        :param x: Inputs x (b, c, h, w)
-        :return: preds (b, num_classes)
-        """
-        out = x
-        for i in range(self.num_layers):  # for number of layers
-
-            out = self.layer_dict['conv_{}'.format(i)](out)  # pass through conv layer indexed at i
-            out = F.relu(out)  # pass conv outputs through ReLU
-            if self.dim_reduction_type == 'strided_convolution':  # if strided convolution dim reduction then
-                out = self.layer_dict['dim_reduction_strided_conv_{}'.format(i)](
-                    out)  # pass previous outputs through a strided convolution indexed i
-                out = F.relu(out)  # pass strided conv outputs through ReLU
-
-            elif self.dim_reduction_type == 'dilated_convolution':
-                out = self.layer_dict['dim_reduction_dilated_conv_{}'.format(i)](out)
-                out = F.relu(out)
-
-            elif self.dim_reduction_type == 'max_pooling':
-                out = self.layer_dict['dim_reduction_max_pool_{}'.format(i)](out)
-
-            elif self.dim_reduction_type == 'avg_pooling':
-                out = self.layer_dict['dim_reduction_avg_pool_{}'.format(i)](out)
-
-        if out.shape[-1] != 2:
-            out = F.adaptive_avg_pool2d(out, 2)
-        out = out.view(out.shape[0], -1)  # flatten outputs from (b, c, h, w) to (b, c*h*w)
-        out = self.logit_linear_layer(out)  # pass through a linear layer to get logits/preds
-        return out
-
-    def reset_parameters(self):
-        """
-        Re-initialize the network parameters.
-        """
-        for item in self.layer_dict.children():
-            try:
-                item.reset_parameters()
-            except:
-                pass
-
-        self.logit_linear_layer.reset_parameters()
--- a/notebooks/Coursework_2_Pytorch_Introduction.ipynb
+++ b/notebooks/Coursework_2_Pytorch_Introduction.ipynb
--- a/notebooks/Plot_Results.ipynb
+++ b/notebooks/Plot_Results.ipynb
--- a/notebooks/res/code_scheme.svg
+++ b/notebooks/res/code_scheme.svg
--- a/notebooks/res/fprop-bprop-block-diagram.pdf
+++ b/notebooks/res/fprop-bprop-block-diagram.pdf
--- a/notebooks/res/fprop-bprop-block-diagram.png
+++ b/notebooks/res/fprop-bprop-block-diagram.png
--- a/notebooks/res/fprop-bprop-block-diagram.tex
+++ b/notebooks/res/fprop-bprop-block-diagram.tex
@ -0,0 +1,65 @@
+\documentclass[tikz]{standalone}
+
+\usepackage{amsmath}
+\usepackage{tikz}
+\usetikzlibrary{arrows}
+\usetikzlibrary{calc}
+\usepackage{ifthen}
+
+\newcommand{\vct}[1]{\boldsymbol{#1}}
+\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
+
+\tikzstyle{fprop} = [draw,fill=blue!20,minimum size=2em,align=center]
+\tikzstyle{bprop} = [draw,fill=red!20,minimum size=2em,align=center]
+
+\begin{document}
+
+\begin{tikzpicture}[xscale=1.75] %
+    % define number of layers
+    \def\nl{2};
+    % model input
+    \node at (0, 0) (input) {$\vct{x}$};
+    % draw fprop through model layers
+    \foreach \l in {0,...,\nl} {
+        \node[fprop] at (2 * \l + 1, 0) (fprop\l) {\texttt{layers[\l]} \\ \texttt{.fprop}};
+        \ifthenelse{\l > 0}{
+            \node at (2 * \l, 0) (hidden\l) {$\vct{h}_\l$};
+            \draw[->] (hidden\l) -- (fprop\l);
+            \draw[->] let \n1={\l - 1} in (fprop\n1) -- (hidden\l);
+        }{
+            \draw[->] (input) -- (fprop\l);
+        }
+    }
+    % model output
+    \node at (2 * \nl + 2, 0) (output) {$\mathbf{y}$};
+    % error function
+    \node[fprop] at (2 * \nl + 3, 0) (errorfunc) {\texttt{error}};
+    % error value
+    \node at (2 * \nl + 3, -1) (error) {$\bar{E}$};
+    % targets
+    \node at (2 * \nl + 4, -1) (tgt) {$\vct{t}$};
+    % error gradient
+    \node[bprop] at (2 * \nl + 3, -2) (errorgrad) {\texttt{error} \\ \texttt{.grad}};
+    % gradient wrt outputs
+    \node at (2 * \nl + 2, -2) (gradoutput) {$\pd{\bar{E}}{\vct{y}}$};
+    \draw[->] (fprop\nl) -- (output);
+    \draw[->] (output) -- (errorfunc);
+    \draw[->] (errorfunc) -- (error);
+    \draw[->] (error) -- (errorgrad);
+    \draw[->] (errorgrad) -- (gradoutput);
+    \draw[->] (tgt) |- (errorfunc);
+    \draw[->] (tgt) |- (errorgrad);
+    \foreach \l in {0,...,\nl} {
+        \node[bprop] at (2 * \l + 1, -2) (bprop\l) {\texttt{layers[\l]} \\ \texttt{.bprop}};
+        \ifthenelse{\l > 0}{
+            \node at (2 * \l, -2) (grad\l) {$\pd{\bar{E}}{\vct{h}_\l}$};
+            \draw[<-] (grad\l) -- (bprop\l);
+            \draw[<-] let \n1={\l - 1} in (bprop\n1) -- (grad\l);
+        }{}
+    }
+    \node at (0, -2) (gradinput) {$\pd{\bar{E}}{\vct{x}}$};
+    \draw[->] (bprop0) -- (gradinput);
+    \draw[->] (gradoutput) -- (bprop\nl);
+\end{tikzpicture}
+
+\end{document}
--- a/notebooks/res/jupyter-dashboard.png
+++ b/notebooks/res/jupyter-dashboard.png
--- a/notebooks/res/jupyter-notebook-interface.png
+++ b/notebooks/res/jupyter-notebook-interface.png
--- a/notebooks/res/singleLayerNetBP-1.png
+++ b/notebooks/res/singleLayerNetBP-1.png
--- a/notebooks/res/singleLayerNetPredict.png
+++ b/notebooks/res/singleLayerNetPredict.png
--- a/notebooks/res/singleLayerNetWts-1.png
+++ b/notebooks/res/singleLayerNetWts-1.png
--- a/notebooks/res/singleLayerNetWtsBP.pdf
+++ b/notebooks/res/singleLayerNetWtsBP.pdf
--- a/notebooks/res/singleLayerNetWtsEqns-1.png
+++ b/notebooks/res/singleLayerNetWtsEqns-1.png
--- a/notebooks/res/singleLayerNetWtsEqns.pdf
+++ b/notebooks/res/singleLayerNetWtsEqns.pdf
--- a/notes/google_cloud_setup.md
+++ b/notes/google_cloud_setup.md
@ -1,175 +0,0 @@
-# Google Cloud Usage Tutorial
-
-This document has been created to help you setup a google cloud instance to be used for the MLP course using the student credit the course has acquired.
-This document is non-exhaustive and many more useful information is available on the [google cloud documentation page](https://cloud.google.com/docs/).
-For any question you might have, that is not covered here, a quick google search should get you what you need. Anything in the official google cloud docs should be very helpful.
-
-| WARNING: Read those instructions carefully. You will be given 50$ worth of credits and you will need to manage them properly. We will not be able to provide more credits. |
-| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-
-
-### To create your account and start a project funded by the student credit
-
-1. Login with your preferred gmail id to [google cloud console](https://cloud.google.com/). Click on `Console` (upper right corner), which would lead you to a new page and once there, click on Select a Project on the left hand side of the search bar on top of the page and then click on New Project on the right hand side of the Pop-Up.
-Name your project sxxxxxxx-MLPractical - replacing the sxxxxxxx with your student number. **Make sure you are on this project before following the next steps**. 
-2. Get your coupon by following the instructions in the coupon retrieval link that you received.
-3. Once you receive your coupon, follow the email instructions to add your coupon to your account.
-4. Once you have added your coupon, join the [MLPractical GCP Google Group](https://groups.google.com/forum/#!forum/mlpractical_gcp) using the same Google account you used to redeem your coupon. This ensures access to the shared disk images.
-5. Make sure that the financial source for your project is the MLPractical credit. You can check this by going to the [Google Cloud Console](https://console.cloud.google.com/) and selecting your project. Then, click on the `Billing` tile. Once on the `Billing` page, you should be prompted to add the billing account if you haven't yet done so. Choose `Billing Account for Education` as your billing account. Then, under the billing account, click `account management` on the left-hand side tab. You should see your project under `Projects linked to this billing account`. If not, you can add it by clicking on `Add projects` and selecting your project from the list of available projects.  
-
-### To create an instance
-
-1. On the console page, click the button with the three lines at the top left corner.
-2. In the ```Compute Engine``` sub-menu select ```VM Instances```.
-3. Enable ```Compute Engine API``` if prompted.
-4. Click the ```CREATE INSTANCE``` button at the top of the window. 
-5. Click on ```VM FROM INSTANCE TEMPLATE```, and create your VM template for this coursework:
-6. Name the template ```mlpractical-1```.
-7. Select ```Regional``` as the location type and ```us-west1(Oregon)``` as the region.
-
-![VM location](figures/vm_instance_location.png)
-
-8. Under ```Machine Configuration```, select ```GPU``` machine family. Select one NVIDIA T4. Those are the cheapest one, be careful as others can cost up to 8 times more to run.
-9. Below, in ```Machine type```, under ```PRESET``` select ```n1-standard-2 (2 vCPU, 1 core, 7.5Gb memory)```.
-
-![VM location](figures/vm_instance_configuration.png)
-
-10. Under ```Boot disk```, click change.
-11. On the right-hand new menu that appears (under ```PUBLIC IMAGES```), select
-    * ```Deep Learning on Linux``` operating system,
-    * ```Deep Learning VM for PyTorch 2.0 with CUDA 11.8 M125``` 
-        * **Note**: If the above version is not available, you can use any ```Deep Learning VM for PyTorch 2.0 with CUDA 11.8 M***``` instead.
-    * ```Balanced persistent disk``` as boot disk type, 
-    * ```100```GB as disk size, and then click select at the bottom.
-
-![Boot disk](figures/boot_disk.png)
-    
-12. Under ```Availability policies```, in the ```VM provisioning model``` drop down menu, select ```Spot```. Using this option will be helpful if you're running low on credits.
-13. You can ```Enable display device``` if you want to use a GUI. This is not necessary for the coursework.
-14. Leave other options as default and click ```CREATE```.
-15. Tick your newly created template and click ```CREATE VM``` (top centre).
-16. Click ```CREATE```. Your instance should be ready in a minute or two.
-15. If your instance failed to create due to the following error - ```The GPUS-ALL-REGIONS-per-project quota maximum has been exceeded. Current limit: 0.0. Metric: compute.googleapis.com/gpus_all_regions.```, click on ```REQUEST QUOTA``` in the notification.
-16. Tick ```Compute Engine API``` and then click ```EDIT QUOTAS``` (top right).
-
-![VM location](figures/increase_quota.png)
-
-17. This will open a box in the right side corner. Put your ```New Limit``` as ```1``` and in the description you can mention you need GPU for machine learning coursework.
-18. Click ```NEXT```, fill in your details and then click ```SUBMIT REQUEST```.
-19. You will receive a confirmation email with your Quota Limit increased. This may take some minutes.
-20. After the confirmation email, you can recheck the GPU(All Regions) Quota Limit being set to 1. This usually shows up in 10-15 minutes after the confirmation email. 
-21. Retry making the VM instance again as before, by choosing your template, and you should have your instance now. 
-
-
-#### Note
-Be careful to select 1 x T4 GPU (Others can be much more expensive).
-
-You only have $50 dollars worth of credit, which should be about 6 days of GPU usage on a T4.
-
-
-### To login into your instance via terminal:
-
-1. Install `google-cloud-sdk` (or similarly named) package using your OS package manager
-2. Authorize the current machine to access your nodes run ```gcloud auth login```. This will authenticate your google account login.
-3. Follow the prompts to get a token for your current machine.
-4. Run ```gcloud config set project PROJECT_ID``` where you replace `PROJECT-ID` with your project ID. You can find that in the projects drop down menu on the top of the Google Compute Engine window; this sets the current project as the active one. If you followed the above instructions, your project ID should be `sxxxxxxx-mlpractical`, where `sxxxxxxx` is your student number.
-5. In your compute engine window, in the line for the instance  that you have started (`mlpractical-1`), click on the downward arrow next to ```SSH```. Choose ```View gcloud command```. Copy the command to your terminal and press enter. Make sure your VM is up and running before doing this.
-6. Don't add a password to the SSH key.
-7. On your first login, you will be asked if you want to install nvidia drivers, **DO NOT AGREE** and follow the nvidia drivers installation below.
-8. Install the R470 Nvidia driver by running the following commands:
-    * Add "contrib" and "non-free" components to /etc/apt/sources.list
-    ```bash
-    sudo tee -a /etc/apt/sources.list >/dev/null <<'EOF'
-    deb http://deb.debian.org/debian/ bullseye main contrib non-free
-    deb-src http://deb.debian.org/debian/ bullseye main contrib non-free
-    EOF
-    ```
-    * Check that the lines were well added by running:
-    ```bash
-    cat /etc/apt/sources.list
-    ```
-    * Update the list of available packages and install the nvidia-driver package:
-    ```bash
-    sudo apt update
-    sudo apt install nvidia-driver firmware-misc-nonfree
-    ```
-9. Run ```nvidia-smi``` to confirm that the GPU can be found.  This should report 1 Tesla T4 GPU. if not, the driver might have failed to install.
-10. To test that PyTorch has access to the GPU you can type the commands below in your terminal. You should see `torch.cuda_is_available()` return `True`.
-    ```
-    python
-    ```
-    ```
-    import torch
-    torch.cuda.is_available()
-    ```
-    ```
-    exit()
-    ```
-11. Well done, you are now in your instance and ready to use it for your coursework.
-12. Clone a fresh mlpractical repository, and checkout branch `mlp2024-25/mlp_compute_engines`: 
-
-    ```
-    git clone https://github.com/VICO-UoE/mlpractical.git ~/mlpractical
-    cd ~/mlpractical
-    git checkout mlp2024-25/mlp_compute_engines
-    ```
-
-    Then, to test PyTorch running on the GPU, run this script that trains a small convolutional network on EMNIST dataset:
-
-    ```
-    python train_evaluate_emnist_classification_system.py --filepath_to_arguments_json_file experiment_configs/emnist_tutorial_config.json
-    ```
-
-    You should be able to see an experiment running, using the GPU. It should be doing about 260-300 it/s (iterations per second). You can stop it when ever you like using `ctrl-c`.
-
-If all the above matches what’s stated then you should be ready to run your experiments.
-
-To log out of your instance, simply type ```exit``` in the terminal.
-
-### Remember to ```stop``` your instance when not using it. You pay for the time you use the machine, not for the computational cycles used.
-To stop the instance go to `Compute Engine -> VM instances` on the Google Cloud Platform, slect the instance and click ```Stop```.
-
-#### Future ssh access:
-To access the instance in the future simply run the `gcloud` command you copied from the google compute engine instance page.
-
-
-## Copying data to and from an instance
-
-Please look at the [transfering files to VMs from Linux, macOS and Windows](https://cloud.google.com/compute/docs/instances/transfer-files?hl=en) and [google docs page on copying data](https://cloud.google.com/filestore/docs/copying-data). Note also the link on the page for [seting up your SSH keys (Linux or MacOS)](https://cloud.google.com/compute/docs/instances/access-overview?hl=en).
-
-To copy from local machine to a google instance, have a look at this [stackoverflow post](https://stackoverflow.com/questions/27857532/rsync-to-google-compute-engine-instance-from-jenkins).
-
-## Running experiments over ssh:
-
-If ssh fails while running an experiment, then the experiment is normally killed.
-To avoid this use the command ```screen```. It creates a process of the current session that keeps running whether a user is signed in or not.
- 
-The basics of using screen is to use ```screen``` to create a new session, then to enter an existing session you use:
-```screen -ls```
-To get a list of all available sessions. Then once you find the one you want use:
-```screen -d -r screen_id``` 
-Replacing screen_id with the id of the session you want to enter.
-
-While in a session, you can use:
- ```ctrl+a+esc``` To pause process and be able to scroll.
- ```ctrl+a+d``` to detach from session while leaving it running (once you detach you can reattach using ```screen -r```).
- ```ctrl+a+n``` to see the next session.
- ```ctrl+a+c``` to create a new session.
-
-You are also free to use other tools such as `nohup` or `tmux`. Use online tutorials and learn it yourself.
- 
-## Troubleshooting:
-
-| Error| Fix|
-| --- | --- |
-| ```ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].``` | Delete the ssh key files and try again: ```rm ~/.ssh/google_compute_engine*``` |
-|"Mapping" error after following step 3 (```tar zxvf google-cloud-sdk-365.0.0-linux-x86_64.tar.gz; bash google-cloud-sdk/install.sh```) | This is due to conflicts and several packages not being installed properly according to your Python version when creating your Conda environment. Run ```conda create --name mlp python=3.9``` to recreate the environment supported with Python 3.9.  Then, activate the environment ```conda activate mlp``` and follow the instructions from step 3 again.  |
-|"Mapping" error even after successfully completing steps 3 and 4 when using the ```gcloud``` command | Restart your computer and run the following command: ```export CLOUDSDK_PYTHON="/usr/bin/python3"``` |
-| ```gcloud command not found``` | Restart your computer and run the following command: ```export CLOUDSDK_PYTHON="/usr/bin/python3"``` |
-| ```module 'collections' has no attribute 'Mapping'``` when installing the Google Cloud SDK | Install Google Cloud SDK with brew: ```brew install --cask google-cloud-sdk```|
-| ```Access blocked: authorisation error``` in your browser after running ```gcloud auth login``` | Run ```gcloud components update``` and retry to login again. |
-| ```ModuleNotFoundError: No module named 'GPUtil'``` | Install the GPUtil package and you should be able to run the script afterwards: ```pip install GPUtil``` |
-| ```module mlp not found``` | Install the mlp package in your environment: ```python setup.py develop``` |
-| ```NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.``` | Remove the current driver by running: ```cd /``` and ```sudo apt purge nvidia-*``` Follow step 11 of the instructions or the following commands: (1) download the R470 driver ```wget https://us.download.nvidia.com/XFree86/Linux-x86_64/470.223.02/NVIDIA-Linux-x86_64-470.223.02.run```, (2) change the file permissions to make it executable with ```chmod +x NVIDIA-Linux-x86_64-470.223.02.run``` and (3) install the driver ```sudo ./NVIDIA-Linux-x86_64-470.223.02.run``` |
-| ```module 'torch' has no attribute 'cuda'``` | You most probably have a file named ```torch.py``` in your current directory. Rename it to something else and try again. You might need to run the setup again. Else ```import torch``` will be calling this file instead of the PyTorch library and thus causing a conflict. |
-| ```Finalizing NVIDIA driver installation. Error! Your kernel headers for kernel 5.10.0-26-cloud-amd64 cannot be found. Please install the linux-headers-5.10.0-26-cloud-amd64 package, or use the --kernelsourcedir option to tell DKMS where it's located. Driver updated for latest kernel.``` | Install the header package with ```sudo apt install linux-headers-5.10.0-26-cloud-amd64``` |
--- a/notes/mlp_cluster_quick_start_up.md
+++ b/notes/mlp_cluster_quick_start_up.md
@ -1,176 +0,0 @@
-# MLP GPU Cluster Usage Tutorial
-
-This guide is intended to guide students into the basics of using the charles GPU cluster. It is not intended to be
-an exhaustive guide that goes deep into micro-details of the Slurm ecosystem. For an exhaustive guide please visit 
-[the Slurm Documentation page.](https://slurm.schedmd.com/)
-
-
-##### For info on clusters and some tips on good cluster ettiquete please have a look at the complementary lecture slides https://docs.google.com/presentation/d/1SU4ExARZLbenZtxm3K8Unqch5282jAXTq0CQDtfvtI0/edit?usp=sharing
-
-## Getting Started
-
-### Accessing the Cluster:
-1. If you are not on a DICE machine, then ssh into your dice home using ```ssh sxxxxxx@student.ssh.inf.ed.ac.uk``` 
-2. Then ssh into either mlp1 or mlp2 which are the headnodes of the GPU cluster - it does not matter which you use. To do that
- run ```ssh mlp1``` or ```ssh mlp2```.
-3. You are now logged into the MLP gpu cluster. If this is your first time logging in you'll need to build your environment.  This is because your home directory on the GPU cluster is separate to your usual AFS home directory on DICE.
- Note: Alternatively you can just ```ssh sxxxxxxx@mlp.inf.ed.ac.uk``` to get there in one step.
-
-### Installing requirements:
-1. Start by downloading the miniconda3 installation file using 
- ```wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh```.
-2. Now run the installation using ```bash Miniconda3-latest-Linux-x86_64.sh```. At the first prompt reply yes. 
-    ```
-    Do you accept the license terms? [yes|no]
-    [no] >>> yes
-    ```
-3. At the second prompt simply press enter.
-    ```
-    Miniconda3 will now be installed into this location:
-    /home/sxxxxxxx/miniconda3
-    
-      - Press ENTER to confirm the location
-      - Press CTRL-C to abort the installation
-      - Or specify a different location below
-    ```
-4. At the last prompt to initialise conda reply 'yes':
-    ```
-    Do you wish the installer to initialize Miniconda3
-    by running conda init [yes|no]
-    [no] >>> yes
-    ```
-5. Now you need to activate your environment by first running:
-```source .bashrc```.
-This reloads .bashrc which includes the new miniconda path.
-6. Run ```source activate``` to load miniconda root.
-7. Now run ```conda create -n mlp python=3``` this will create the mlp environment. At the prompt choose y.
-8. Now run ```source activate mlp```.
-9. Install git using```conda install git```. Then config git using: 
-```git config --global user.name "[your name]"; git config --global user.email "[matric-number]@sms.ed.ac.uk"```
-10. Now clone the mlpractical repo using ```git clone https://github.com/VICO-UoE/mlpractical.git```.
-11. ```cd mlpractical```
-12. Checkout the mlp_cluster_tutorial branch using ```git checkout mlp2023-24/mlp_compute_engines```.
-13. Install the required packages using ```bash install.sh```.
-
-> Note: Check that you can use the GPU version of PyTorch by running ```python -c "import torch; print(torch.cuda.is_available())"``` in a `bash` script (see the example below). If this returns `True`, then you are good to go. If it returns `False`, then you need to install the GPU version of PyTorch manually. To do this, run ```conda uninstall pytorch``` and then ```pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118``` or ```pip install torch torchvision```. This will install the latest version of PyTorch with CUDA support. This version is also compatible with older CUDA versions installed on the cluster.
-
-14. This includes all of the required installations. Proceed to the next section outlining how to use the slurm cluster
- management software. Please remember to clean your setup files using ```conda clean -t```
- 
-### Using Slurm
-Slurm provides us with some commands that can be used to submit, delete, view, explore current jobs, nodes and resources among others.
-To submit a job one needs to use ```sbatch script.sh``` which will automatically find available nodes and pass the job,
- resources and restrictions required. The script.sh is the bash script containing the job that we want to run. Since we will be using the NVIDIA CUDA and CUDNN libraries 
- we have provided a sample script which should be used for your job submissions. The script is explained in detail below:
- 
-```bash
-#!/bin/sh
-#SBATCH -N 1	  # nodes requested
-#SBATCH -n 1	  # tasks requested
-#SBATCH --partition=Teach-Standard
-#SBATCH --gres=gpu:1
-#SBATCH --mem=12000  # memory in Mb
-#SBATCH --time=0-08:00:00
-
-export CUDA_HOME=/opt/cuda-9.0.176.1/
-
-export CUDNN_HOME=/opt/cuDNN-7.0/
-
-export STUDENT_ID=$(whoami)
-
-export LD_LIBRARY_PATH=${CUDNN_HOME}/lib64:${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
-
-export LIBRARY_PATH=${CUDNN_HOME}/lib64:$LIBRARY_PATH
-
-export CPATH=${CUDNN_HOME}/include:$CPATH
-
-export PATH=${CUDA_HOME}/bin:${PATH}
-
-export PYTHON_PATH=$PATH
-
-mkdir -p /disk/scratch/${STUDENT_ID}
-
-
-export TMPDIR=/disk/scratch/${STUDENT_ID}/
-export TMP=/disk/scratch/${STUDENT_ID}/
-
-mkdir -p ${TMP}/datasets/
-export DATASET_DIR=${TMP}/datasets/
-# Activate the relevant virtual environment:
-
-source /home/${STUDENT_ID}/miniconda3/bin/activate mlp
-cd ..
-python train_evaluate_emnist_classification_system.py --filepath_to_arguments_json_file experiment_configs/emnist_tutorial_config.json
-```
-
-To actually run this use ```sbatch emnist_single_gpu_tutorial.sh```. When you do this, the job will be submitted and you will be given a job id.
-```bash
-[burly]sxxxxxxx: sbatch emnist_single_gpu_tutorial.sh 
-Submitted batch job 147
-
-```
-
-To view a list of all running jobs use ```squeue``` for a minimal presentation and ```smap``` for a more involved presentation. Furthermore to view node information use ```sinfo```.
-```bash
-squeue
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-               143 interacti     bash    iainr  R       8:00      1 landonia05
-               147 interacti gpu_clus sxxxxxxx  R       1:05      1 landonia02
-
-```
-Also in case you want to stop/delete a job use ```scancel job_id``` where job_id is the id of the job.
-
-Furthermore in case you want to test some of your code interactively to prototype your solution before you submit it to
- a node you can use ```srun -p interactive  --gres=gpu:2 --pty python my_code_exp.py```.
-
-## Slurm Cheatsheet
-For a nice list of most commonly used Slurm commands please visit [here](https://bitsanddragons.wordpress.com/2017/04/12/slurm-user-cheatsheet/).
-
-## Syncing or copying data over to DICE
-
-At some point you will need to copy your data to DICE so you can analyse them and produce charts, write reports, store for future use etc.
-1. If you are on a terminal within I.F/A.T, then skip to step 2, if you are not, then, you'll first have to open a VPN into the university network using the instructions found [here](http://computing.help.inf.ed.ac.uk/openvpn).
-2. From your local machine:
-    1. To send data from a local machine to the cluster: ```rsync -ua --progress <local_path_of_data_to_transfer> <studentID>@mlp.inf.ed.ac.uk:/home/<studentID>/path/to/folder```
-    2. To receive data from the cluster to your local machine ```rsync -ua --progress <studentID>@mlp.inf.ed.ac.uk:/home/<studentID>/path/to/folder <local_path_of_data_to_transfer> ```
-
-## Running an experiment
-To run a default image classification experiment using the template models provided:
-1. Sign into the cluster using ssh sxxxxxxx@mlp1.inf.ed.ac.uk
-2. Activate your conda environment using, source miniconda3/bin/activate ; conda activate mlp
-3. cd mlpractical
-4. cd cluster_experiment_scripts
-5. Find which experiment(s) you want to run (make sure the experiment ends in 'gpu_cluster.sh'). Decide if you want to run a single experiment or multiple experiments in parallel.
-    1. For a single experiment: ```sbatch experiment_script.sh```
-    2. To run multiple experiments using the "hurdle-reducing" script that automatically submits jobs, makes sure the jobs are always in queue/running:
-        1. Make sure the cluster_experiment_scripts folder contains ***only*** the jobs you want to run. 
-        2. Run the command: 
-        ```
-        python run_jobs.py --num_parallel_jobs <number of jobs to keep in the slurm queue at all times> --num_epochs <number of epochs to run each job>
-        ```
-
-## Additional Help
-
-If you require additional help please post on piazza or if you are experiencing technical problems (actual system/hardware problems) then please submit a [computing support ticket](https://www.inf.ed.ac.uk/systems/support/form/).
-
-## List of very useful slurm commands:
- squeue: Shows all jobs from all users currently in the queue/running
- squeue -u <user_id>: Shows all jobs from user <user_id> in the queue/running 
- sprio: Shows the priority score of all of your current jobs that are not yet running
- scontrol show job <job_id>: Shows all information about job <job_id>
- scancel <job_id>: Cancels job with id <job_id>
- scancel -u <user_id>: Cancels all jobs, belonging to user <user_id>, that are currently in the queue/running
- sinfo: Provides info about the cluster/partitions
- sbatch <job_script>: Submit a job that will run the script <job_script> to the slurm scheduler.
-
-## Overview of code:
- [arg_extractor.py](arg_extractor.py): Contains an array of utility methods that can parse python arguments or convert
- a json config file into an argument NamedTuple.
- [data_providers.py](data_providers.py): A sample data provider, of the same type used in the MLPractical course.
- [experiment_builder.py](experiment_builder.py): Builds and executes a simple image classification experiment, keeping track
-of relevant statistics, taking care of storing and re-loading pytorch models, as well as choosing the best validation-performing model to evaluate the test set on.
- [model_architectures.py](model_architectures.py): Provides a fully connected network and convolutional neural network 
-sample models, which have a number of moving parts indicated as hyperparameters.
- [storage_utils.py](storage_utils.py): Provides a number of storage/loading methods for the experiment statistics.
- [train_evaluated_emnist_classification_system.py](train_evaluate_emnist_classification_system.py): Runs an experiment 
-given a data provider, an experiment builder instance and a model architecture
--- a/notes/pytorch-experiment-framework.md
+++ b/notes/pytorch-experiment-framework.md
@ -122,4 +122,4 @@ First course of action should be to search the web and then to refer to the PyTo
 [tutorials](https://pytorch.org/tutorials/) and [github](https://github.com/pytorch/pytorch) sites.
 
 If you still can't get an answer to your question then as always, post on Piazza and/or come to the lab sessions.
- 
+ 
--- a/notes/remote-working-guide.md
+++ b/notes/remote-working-guide.md
@ -114,24 +114,24 @@ Here we provide a detailed guide for setting-up PuTTY with tunnel forwarding so

 1. To start off, run the PuTTY executable file you downloaded, navigate to **Session** on the left column and enter the **hostname** as `student.ssh.inf.ed.ac.uk`. Put any name in the **Saved Sessions** box so that you can retrieve your saved PuTTY session for future use.  

-    Change the remaining options as is in the screenshot below. 
+Change the remaining options as is in the screenshot below. 

-    <center><img src="./figures/putty1.png"  width="400" height="300"></center> 
+<center><img src="./figures/putty1.png"  width="400" height="300"></center> 

 2. Now navigate to **Connection** and drop-down on **Data**. In **Auto-Login username** , enter your student id `sXXXXXXX`. 
    
-    <center><img src="./figures/putty2.png"  width="400" height="300"></center> 
+<center><img src="./figures/putty2.png"  width="400" height="300"></center> 

 3. After step 1 and 2, follow the instructions [here](http://computing.help.inf.ed.ac.uk/installing-putty) from screenshots 3-5 to set-up **Auth** and **X11 Forwarding**. To avoid errors later, strictly follow the instructions for this set-up. 

 4. In this step, we will configure SSH tunneling to locally run the notebooks. On the left side of the PuTTY window, navigate to **Tunnels** under SSH and then add a `[local-port]` in **Source port** and `localhost:[local-port]` in **Destination**.  Remember the `[local-port]`  you used here as we will need this later. 
        

-    <center><img src="./figures/putty3.png"  width="400" height="300"></center> 
-            
-    Then press **Add** near the Source port box to add your new forwarded port. Once you add, you will see your newly added port as shown below - 
+<center><img src="./figures/putty3.png"  width="400" height="300"></center> 
        
-    <center><img src="./figures/putty4.png"  width="400" height="300"></center> 
+Then press **Add** near the Source port box to add your new forwarded port. Once you add, you will see your newly added port as shown below - 
+    
+<center><img src="./figures/putty4.png"  width="400" height="300"></center> 
        
 5. After you have done steps 1-4, navigate back to **Session** on the left side and click **Save** to save all your current configurations. 

@ -139,36 +139,36 @@ Here we provide a detailed guide for setting-up PuTTY with tunnel forwarding so
        
 6.  Then click **Open** and a terminal window will pop-up asking for your DICE password. After you enter the password, you will be logged in to SSH Gateway Server.  As the message printed when you log in points out this is intended only for accessing the Informatics network externally and you should not attempt to work on this server. You should log in to one of the student.compute shared-use servers by running - 

-    ```
-    ssh student.compute
-    ```
-    You should now be logged on to one of the shared-use compute servers. The name of the server you are logged on to will appear at the bash prompt e.g.
+```
+ssh student.compute
+```
+You should now be logged on to one of the shared-use compute servers. The name of the server you are logged on to will appear at the bash prompt e.g.

-    ```
-    ashbury:~$
-    ```
-    You will need to know the name of this remote server you are using later on.
+```
+ashbury:~$
+```
+You will need to know the name of this remote server you are using later on.


 7.  You can setup your `mlp` environment by following the instructions [here](environment-set-up.md). If you have correctly set-up the environment, activate your `conda` environment and navigate to the jupyter notebooks as detailed [here](remote-working-guide.md#starting-a-notebook-server-on-the-remote-computer).  You should also secure your notebook server by following the instructions [here](remote-working-guide.md#running-jupyter-notebooks-over-ssh). 

-    Once the notebook server starts running you should take note of the port it is being served on as indicated in the `The Jupyter Notebook is running at: https://localhost:[port]/` message.
+Once the notebook server starts running you should take note of the port it is being served on as indicated in the `The Jupyter Notebook is running at: https://localhost:[port]/` message.

 8.  Now that the notebook server is running on the remote server you need to connect to it on your local machine. We will do this by forwarding the port the notebook server is being run on over SSH to your local machine. 

-    For doing this, open another session of PuTTY and load the session that you saved in the **Session** on the left side. Enter the password in the prompt and this will login to the SSH gateway server. **Do not** run `ssh student.compute` now. 
-    
-    In this terminal window, enter the command below - 
-    
-    ```
-    ssh -N -f -L localhost:[local-port]:localhost:[port] [dice-username]@[remote-server-name] 
-    ```
-    The `[local-port]` is the source port you entered in Step 4, `[port]` is the remote port running on the remote server as in Step 7 and  `[remote-server-name]` is the name of the remote server you got connected to in Step 6. 
-    
-    If asked for a password at this stage, enter your DICE password again to login. 
+       For doing this, open another session of PuTTY and load the session that you saved in the **Session** on the left side. Enter the password in the prompt and this will login to the SSH gateway server. **Do not** run `ssh student.compute` now. 
       
-9.  Assuming you have set-up correctly, the remote port will now be getting forwarded to the specified local port on your computer. If you now open up a browser on your computer and go to `https://localhost:[local-port]`  you should (potentially after seeing a security warning about the self-signed certificate) now asked to enter the notebook server password you specified earlier. Once you enter this password you should be able to access the notebook dashboard and open and edit notebooks as you usually do in laboratories.
-
-    When you are finished working you should both close down the notebook server by entering `Ctrl+C` twice in the terminal window the SSH session you used to start up the notebook server is running and halt the port forwarding command by entering `Ctrl+C` in the terminal it is running in.
+       In this terminal window, enter the command below - 
+       
+       ```
+       ssh -N -f -L localhost:[local-port]:localhost:[port] [dice-username]@[remote-server-name] 
+       ```
+       The `[local-port]` is the source port you entered in Step 4, `[port]` is the remote port running on the remote server as in Step 7 and  `[remote-server-name]` is the name of the remote server you got connected to in Step 6. 
+       
+       If asked for a password at this stage, enter your DICE password again to login. 
+       
+   9.  Assuming you have set-up correctly, the remote port will now be getting forwarded to the specified local port on your computer. If you now open up a browser on your computer and go to `https://localhost:[local-port]`  you should (potentially after seeing a security warning about the self-signed certificate) now asked to enter the notebook server password you specified earlier. Once you enter this password you should be able to access the notebook dashboard and open and edit notebooks as you usually do in laboratories.
+   
+   When you are finished working you should both close down the notebook server by entering `Ctrl+C` twice in the terminal window the SSH session you used to start up the notebook server is running and halt the port forwarding command by entering `Ctrl+C` in the terminal it is running in.
   
   
--- a/pytorch_mlp_framework/arg_extractor.py
+++ b/pytorch_mlp_framework/arg_extractor.py
@ -0,0 +1,133 @@
+import argparse
+
+
+def str2bool(v):
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise argparse.ArgumentTypeError("Boolean value expected.")
+
+
+def get_args():
+    """
+    Returns a namedtuple with arguments extracted from the command line.
+    :return: A namedtuple with arguments
+    """
+    parser = argparse.ArgumentParser(
+        description="Welcome to the MLP course's Pytorch training and inference helper script"
+    )
+
+    parser.add_argument(
+        "--batch_size",
+        nargs="?",
+        type=int,
+        default=100,
+        help="Batch_size for experiment",
+    )
+    parser.add_argument(
+        "--continue_from_epoch",
+        nargs="?",
+        type=int,
+        default=-1,
+        help="Epoch you want to continue training from while restarting an experiment",
+    )
+    parser.add_argument(
+        "--seed",
+        nargs="?",
+        type=int,
+        default=7112018,
+        help="Seed to use for random number generator for experiment",
+    )
+    parser.add_argument(
+        "--image_num_channels",
+        nargs="?",
+        type=int,
+        default=3,
+        help="The channel dimensionality of our image-data",
+    )
+    parser.add_argument(
+        "--learning-rate",
+        nargs="?",
+        type=float,
+        default=1e-3,
+        help="The learning rate (default 1e-3)",
+    )
+    parser.add_argument(
+        "--image_height", nargs="?", type=int, default=32, help="Height of image data"
+    )
+    parser.add_argument(
+        "--image_width", nargs="?", type=int, default=32, help="Width of image data"
+    )
+    parser.add_argument(
+        "--num_stages",
+        nargs="?",
+        type=int,
+        default=3,
+        help="Number of convolutional stages in the network. A stage is considered a sequence of "
+        "convolutional layers where the input volume remains the same in the spacial dimension and"
+        " is always terminated by a dimensionality reduction stage",
+    )
+    parser.add_argument(
+        "--num_blocks_per_stage",
+        nargs="?",
+        type=int,
+        default=5,
+        help="Number of convolutional blocks in each stage, not including the reduction stage."
+        " A convolutional block is made up of two convolutional layers activated using the "
+        " leaky-relu non-linearity",
+    )
+    parser.add_argument(
+        "--num_filters",
+        nargs="?",
+        type=int,
+        default=16,
+        help="Number of convolutional filters per convolutional layer in the network (excluding "
+        "dimensionality reduction layers)",
+    )
+    parser.add_argument(
+        "--num_epochs",
+        nargs="?",
+        type=int,
+        default=100,
+        help="Total number of epochs for model training",
+    )
+    parser.add_argument(
+        "--num_classes",
+        nargs="?",
+        type=int,
+        default=100,
+        help="Number of classes in the dataset",
+    )
+    parser.add_argument(
+        "--experiment_name",
+        nargs="?",
+        type=str,
+        default="exp_1",
+        help="Experiment name - to be used for building the experiment folder",
+    )
+    parser.add_argument(
+        "--use_gpu",
+        nargs="?",
+        type=str2bool,
+        default=True,
+        help="A flag indicating whether we will use GPU acceleration or not",
+    )
+    parser.add_argument(
+        "--weight_decay_coefficient",
+        nargs="?",
+        type=float,
+        default=0,
+        help="Weight decay to use for Adam",
+    )
+    parser.add_argument(
+        "--block_type",
+        type=str,
+        default="conv_block",
+        help="Type of convolutional blocks to use in our network"
+        "(This argument will be useful in running experiments to debug your network)",
+    )
+    args = parser.parse_args()
+    print(args)
+    return args
--- a/pytorch_mlp_framework/experiment_builder.py
+++ b/pytorch_mlp_framework/experiment_builder.py
@ -0,0 +1,462 @@
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.nn.functional as F
+import tqdm
+import os
+import numpy as np
+import time
+
+from pytorch_mlp_framework.storage_utils import save_statistics
+from matplotlib import pyplot as plt
+import matplotlib
+
+matplotlib.rcParams.update({"font.size": 8})
+
+
+class ExperimentBuilder(nn.Module):
+    def __init__(
+        self,
+        network_model,
+        experiment_name,
+        num_epochs,
+        train_data,
+        val_data,
+        test_data,
+        weight_decay_coefficient,
+        learning_rate,
+        use_gpu,
+        continue_from_epoch=-1,
+    ):
+        """
+        Initializes an ExperimentBuilder object. Such an object takes care of running training and evaluation of a deep net
+        on a given dataset. It also takes care of saving per epoch models and automatically inferring the best val model
+        to be used for evaluating the test set metrics.
+        :param network_model: A pytorch nn.Module which implements a network architecture.
+        :param experiment_name: The name of the experiment. This is used mainly for keeping track of the experiment and creating and directory structure that will be used to save logs, model parameters and other.
+        :param num_epochs: Total number of epochs to run the experiment
+        :param train_data: An object of the DataProvider type. Contains the training set.
+        :param val_data: An object of the DataProvider type. Contains the val set.
+        :param test_data: An object of the DataProvider type. Contains the test set.
+        :param weight_decay_coefficient: A float indicating the weight decay to use with the adam optimizer.
+        :param use_gpu: A boolean indicating whether to use a GPU or not.
+        :param continue_from_epoch: An int indicating whether we'll start from scrach (-1) or whether we'll reload a previously saved model of epoch 'continue_from_epoch' and continue training from there.
+        """
+        super(ExperimentBuilder, self).__init__()
+
+        self.experiment_name = experiment_name
+        self.model = network_model
+
+        if torch.cuda.device_count() >= 1 and use_gpu:
+            self.device = torch.device("cuda")
+            self.model.to(self.device)  # sends the model from the cpu to the gpu
+            print("Use GPU", self.device)
+        else:
+            print("use CPU")
+            self.device = torch.device("cpu")  # sets the device to be CPU
+            print(self.device)
+
+        print("here")
+
+        self.model.reset_parameters()  # re-initialize network parameters
+        self.train_data = train_data
+        self.val_data = val_data
+        self.test_data = test_data
+
+        print("System learnable parameters")
+        num_conv_layers = 0
+        num_linear_layers = 0
+        total_num_parameters = 0
+        for name, value in self.named_parameters():
+            print(name, value.shape)
+            if all(item in name for item in ["conv", "weight"]):
+                num_conv_layers += 1
+            if all(item in name for item in ["linear", "weight"]):
+                num_linear_layers += 1
+            total_num_parameters += np.prod(value.shape)
+
+        print("Total number of parameters", total_num_parameters)
+        print("Total number of conv layers", num_conv_layers)
+        print("Total number of linear layers", num_linear_layers)
+
+        print(f"Learning rate: {learning_rate}")
+        self.optimizer = optim.Adam(
+            self.parameters(),
+            amsgrad=False,
+            weight_decay=weight_decay_coefficient,
+            lr=learning_rate,
+        )
+        self.learning_rate_scheduler = optim.lr_scheduler.CosineAnnealingLR(
+            self.optimizer, T_max=num_epochs, eta_min=0.00002
+        )
+        # Generate the directory names
+        self.experiment_folder = os.path.abspath(experiment_name)
+        self.experiment_logs = os.path.abspath(
+            os.path.join(self.experiment_folder, "result_outputs")
+        )
+        self.experiment_saved_models = os.path.abspath(
+            os.path.join(self.experiment_folder, "saved_models")
+        )
+
+        # Set best models to be at 0 since we are just starting
+        self.best_val_model_idx = 0
+        self.best_val_model_acc = 0.0
+
+        if not os.path.exists(
+            self.experiment_folder
+        ):  # If experiment directory does not exist
+            os.mkdir(self.experiment_folder)  # create the experiment directory
+            os.mkdir(self.experiment_logs)  # create the experiment log directory
+            os.mkdir(
+                self.experiment_saved_models
+            )  # create the experiment saved models directory
+
+        self.num_epochs = num_epochs
+        self.criterion = nn.CrossEntropyLoss().to(
+            self.device
+        )  # send the loss computation to the GPU
+
+        if (
+            continue_from_epoch == -2
+        ):  # if continue from epoch is -2 then continue from latest saved model
+            self.state, self.best_val_model_idx, self.best_val_model_acc = (
+                self.load_model(
+                    model_save_dir=self.experiment_saved_models,
+                    model_save_name="train_model",
+                    model_idx="latest",
+                )
+            )  # reload existing model from epoch and return best val model index
+            # and the best val acc of that model
+            self.starting_epoch = int(self.state["model_epoch"])
+
+        elif continue_from_epoch > -1:  # if continue from epoch is greater than -1 then
+            self.state, self.best_val_model_idx, self.best_val_model_acc = (
+                self.load_model(
+                    model_save_dir=self.experiment_saved_models,
+                    model_save_name="train_model",
+                    model_idx=continue_from_epoch,
+                )
+            )  # reload existing model from epoch and return best val model index
+            # and the best val acc of that model
+            self.starting_epoch = continue_from_epoch
+        else:
+            self.state = dict()
+            self.starting_epoch = 0
+
+    def get_num_parameters(self):
+        total_num_params = 0
+        for param in self.parameters():
+            total_num_params += np.prod(param.shape)
+
+        return total_num_params
+
+    def plot_func_def(self, all_grads, layers):
+        """
+        Plot function definition to plot the average gradient with respect to the number of layers in the given model
+        :param all_grads: Gradients wrt weights for each layer in the model.
+        :param layers: Layer names corresponding to the model parameters
+        :return: plot for gradient flow
+        """
+        plt.plot(all_grads, alpha=0.3, color="b")
+        plt.hlines(0, 0, len(all_grads) + 1, linewidth=1, color="k")
+        plt.xticks(range(0, len(all_grads), 1), layers, rotation="vertical")
+        plt.xlim(xmin=0, xmax=len(all_grads))
+        plt.xlabel("Layers")
+        plt.ylabel("Average Gradient")
+        plt.title("Gradient flow")
+        plt.grid(True)
+        plt.tight_layout()
+
+        return plt
+
+    def plot_grad_flow(self, named_parameters):
+        """
+        The function is being called in Line 298 of this file.
+        Receives the parameters of the model being trained. Returns plot of gradient flow for the given model parameters.
+
+        """
+        all_grads = []
+        layers = []
+
+        """
+        Complete the code in the block below to collect absolute mean of the gradients for each layer in all_grads with the             layer names in layers.
+        """
+
+        for name, param in named_parameters:
+            if "bias" in name:
+                continue
+            # Check if the parameter requires gradient and has a gradient
+            if param.requires_grad and param.grad is not None:
+                try:
+                    _, a, _, b, _ = name.split(".", 4)
+                except:
+                    b, a = name.split(".", 1)
+
+                layers.append(f"{a}_{b}")
+                # Collect the mean of the absolute gradients
+                all_grads.append(param.grad.abs().mean().item())
+
+        plt = self.plot_func_def(all_grads, layers)
+
+        return plt
+
+    def run_train_iter(self, x, y):
+
+        self.train()  # sets model to training mode (in case batch normalization or other methods have different procedures for training and evaluation)
+        x, y = x.float().to(device=self.device), y.long().to(
+            device=self.device
+        )  # send data to device as torch tensors
+        out = self.model.forward(x)  # forward the data in the model
+
+        loss = F.cross_entropy(input=out, target=y)  # compute loss
+
+        self.optimizer.zero_grad()  # set all weight grads from previous training iters to 0
+        loss.backward()  # backpropagate to compute gradients for current iter loss
+
+        self.optimizer.step()  # update network parameters
+        self.learning_rate_scheduler.step()  # update learning rate scheduler
+
+        _, predicted = torch.max(out.data, 1)  # get argmax of predictions
+        accuracy = np.mean(list(predicted.eq(y.data).cpu()))  # compute accuracy
+        return loss.cpu().data.numpy(), accuracy
+
+    def run_evaluation_iter(self, x, y):
+        """
+        Receives the inputs and targets for the model and runs an evaluation iterations. Returns loss and accuracy metrics.
+        :param x: The inputs to the model. A numpy array of shape batch_size, channels, height, width
+        :param y: The targets for the model. A numpy array of shape batch_size, num_classes
+        :return: the loss and accuracy for this batch
+        """
+        self.eval()  # sets the system to validation mode
+        x, y = x.float().to(device=self.device), y.long().to(
+            device=self.device
+        )  # convert data to pytorch tensors and send to the computation device
+        out = self.model.forward(x)  # forward the data in the model
+
+        loss = F.cross_entropy(input=out, target=y)  # compute loss
+
+        _, predicted = torch.max(out.data, 1)  # get argmax of predictions
+        accuracy = np.mean(list(predicted.eq(y.data).cpu()))  # compute accuracy
+        return loss.cpu().data.numpy(), accuracy
+
+    def save_model(
+        self,
+        model_save_dir,
+        model_save_name,
+        model_idx,
+        best_validation_model_idx,
+        best_validation_model_acc,
+    ):
+        """
+        Save the network parameter state and current best val epoch idx and best val accuracy.
+        :param model_save_name: Name to use to save model without the epoch index
+        :param model_idx: The index to save the model with.
+        :param best_validation_model_idx: The index of the best validation model to be stored for future use.
+        :param best_validation_model_acc: The best validation accuracy to be stored for use at test time.
+        :param model_save_dir: The directory to store the state at.
+        :param state: The dictionary containing the system state.
+
+        """
+        self.state["network"] = (
+            self.state_dict()
+        )  # save network parameter and other variables.
+        self.state["best_val_model_idx"] = (
+            best_validation_model_idx  # save current best val idx
+        )
+        self.state["best_val_model_acc"] = (
+            best_validation_model_acc  # save current best val acc
+        )
+        torch.save(
+            self.state,
+            f=os.path.join(
+                model_save_dir, "{}_{}".format(model_save_name, str(model_idx))
+            ),
+        )  # save state at prespecified filepath
+
+    def load_model(self, model_save_dir, model_save_name, model_idx):
+        """
+        Load the network parameter state and the best val model idx and best val acc to be compared with the future val accuracies, in order to choose the best val model
+        :param model_save_dir: The directory to store the state at.
+        :param model_save_name: Name to use to save model without the epoch index
+        :param model_idx: The index to save the model with.
+        :return: best val idx and best val model acc, also it loads the network state into the system state without returning it
+        """
+        state = torch.load(
+            f=os.path.join(
+                model_save_dir, "{}_{}".format(model_save_name, str(model_idx))
+            )
+        )
+        self.load_state_dict(state_dict=state["network"])
+        return state, state["best_val_model_idx"], state["best_val_model_acc"]
+
+    def run_experiment(self):
+        """
+        Runs experiment train and evaluation iterations, saving the model and best val model and val model accuracy after each epoch
+        :return: The summary current_epoch_losses from starting epoch to total_epochs.
+        """
+        total_losses = {
+            "train_acc": [],
+            "train_loss": [],
+            "val_acc": [],
+            "val_loss": [],
+        }  # initialize a dict to keep the per-epoch metrics
+        for i, epoch_idx in enumerate(range(self.starting_epoch, self.num_epochs)):
+            epoch_start_time = time.time()
+            current_epoch_losses = {
+                "train_acc": [],
+                "train_loss": [],
+                "val_acc": [],
+                "val_loss": [],
+            }
+            self.current_epoch = epoch_idx
+            with tqdm.tqdm(
+                total=len(self.train_data)
+            ) as pbar_train:  # create a progress bar for training
+                for idx, (x, y) in enumerate(self.train_data):  # get data batches
+                    loss, accuracy = self.run_train_iter(
+                        x=x, y=y
+                    )  # take a training iter step
+                    current_epoch_losses["train_loss"].append(
+                        loss
+                    )  # add current iter loss to the train loss list
+                    current_epoch_losses["train_acc"].append(
+                        accuracy
+                    )  # add current iter acc to the train acc list
+                    pbar_train.update(1)
+                    pbar_train.set_description(
+                        "loss: {:.4f}, accuracy: {:.4f}".format(loss, accuracy)
+                    )
+
+            with tqdm.tqdm(
+                total=len(self.val_data)
+            ) as pbar_val:  # create a progress bar for validation
+                for x, y in self.val_data:  # get data batches
+                    loss, accuracy = self.run_evaluation_iter(
+                        x=x, y=y
+                    )  # run a validation iter
+                    current_epoch_losses["val_loss"].append(
+                        loss
+                    )  # add current iter loss to val loss list.
+                    current_epoch_losses["val_acc"].append(
+                        accuracy
+                    )  # add current iter acc to val acc lst.
+                    pbar_val.update(1)  # add 1 step to the progress bar
+                    pbar_val.set_description(
+                        "loss: {:.4f}, accuracy: {:.4f}".format(loss, accuracy)
+                    )
+            val_mean_accuracy = np.mean(current_epoch_losses["val_acc"])
+            if (
+                val_mean_accuracy > self.best_val_model_acc
+            ):  # if current epoch's mean val acc is greater than the saved best val acc then
+                self.best_val_model_acc = val_mean_accuracy  # set the best val model acc to be current epoch's val accuracy
+                self.best_val_model_idx = epoch_idx  # set the experiment-wise best val idx to be the current epoch's idx
+
+            for key, value in current_epoch_losses.items():
+                total_losses[key].append(
+                    np.mean(value)
+                )  # get mean of all metrics of current epoch metrics dict, to get them ready for storage and output on the terminal.
+
+            save_statistics(
+                experiment_log_dir=self.experiment_logs,
+                filename="summary.csv",
+                stats_dict=total_losses,
+                current_epoch=i,
+                continue_from_mode=(
+                    True if (self.starting_epoch != 0 or i > 0) else False
+                ),
+            )  # save statistics to stats file.
+
+            # load_statistics(experiment_log_dir=self.experiment_logs, filename='summary.csv') # How to load a csv file if you need to
+
+            out_string = "_".join(
+                [
+                    "{}_{:.4f}".format(key, np.mean(value))
+                    for key, value in current_epoch_losses.items()
+                ]
+            )
+            # create a string to use to report our epoch metrics
+            epoch_elapsed_time = (
+                time.time() - epoch_start_time
+            )  # calculate time taken for epoch
+            epoch_elapsed_time = "{:.4f}".format(epoch_elapsed_time)
+            print(
+                "Epoch {}:".format(epoch_idx),
+                out_string,
+                "epoch time",
+                epoch_elapsed_time,
+                "seconds",
+            )
+            self.state["model_epoch"] = epoch_idx
+            self.save_model(
+                model_save_dir=self.experiment_saved_models,
+                # save model and best val idx and best val acc, using the model dir, model name and model idx
+                model_save_name="train_model",
+                model_idx=epoch_idx,
+                best_validation_model_idx=self.best_val_model_idx,
+                best_validation_model_acc=self.best_val_model_acc,
+            )
+            self.save_model(
+                model_save_dir=self.experiment_saved_models,
+                # save model and best val idx and best val acc, using the model dir, model name and model idx
+                model_save_name="train_model",
+                model_idx="latest",
+                best_validation_model_idx=self.best_val_model_idx,
+                best_validation_model_acc=self.best_val_model_acc,
+            )
+
+            ################################################################
+            ##### Plot Gradient Flow at each Epoch during Training  ######
+            print("Generating Gradient Flow Plot at epoch {}".format(epoch_idx))
+            plt = self.plot_grad_flow(self.model.named_parameters())
+            if not os.path.exists(
+                os.path.join(self.experiment_saved_models, "gradient_flow_plots")
+            ):
+                os.mkdir(
+                    os.path.join(self.experiment_saved_models, "gradient_flow_plots")
+                )
+                # plt.legend(loc="best")
+            plt.savefig(
+                os.path.join(
+                    self.experiment_saved_models,
+                    "gradient_flow_plots",
+                    "epoch{}.pdf".format(str(epoch_idx)),
+                )
+            )
+            ################################################################
+
+        print("Generating test set evaluation metrics")
+        self.load_model(
+            model_save_dir=self.experiment_saved_models,
+            model_idx=self.best_val_model_idx,
+            # load best validation model
+            model_save_name="train_model",
+        )
+        current_epoch_losses = {
+            "test_acc": [],
+            "test_loss": [],
+        }  # initialize a statistics dict
+        with tqdm.tqdm(total=len(self.test_data)) as pbar_test:  # ini a progress bar
+            for x, y in self.test_data:  # sample batch
+                loss, accuracy = self.run_evaluation_iter(
+                    x=x, y=y
+                )  # compute loss and accuracy by running an evaluation step
+                current_epoch_losses["test_loss"].append(loss)  # save test loss
+                current_epoch_losses["test_acc"].append(accuracy)  # save test accuracy
+                pbar_test.update(1)  # update progress bar status
+                pbar_test.set_description(
+                    "loss: {:.4f}, accuracy: {:.4f}".format(loss, accuracy)
+                )  # update progress bar string output
+
+        test_losses = {
+            key: [np.mean(value)] for key, value in current_epoch_losses.items()
+        }  # save test set metrics in dict format
+        save_statistics(
+            experiment_log_dir=self.experiment_logs,
+            filename="test_summary.csv",
+            # save test set metrics on disk in .csv format
+            stats_dict=test_losses,
+            current_epoch=0,
+            continue_from_mode=False,
+        )
+
+        return total_losses, test_losses
--- a/pytorch_mlp_framework/model_architectures.py
+++ b/pytorch_mlp_framework/model_architectures.py
@ -0,0 +1,640 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class FCCNetwork(nn.Module):
+    def __init__(
+        self, input_shape, num_output_classes, num_filters, num_layers, use_bias=False
+    ):
+        """
+        Initializes a fully connected network similar to the ones implemented previously in the MLP package.
+        :param input_shape: The shape of the inputs going in to the network.
+        :param num_output_classes: The number of outputs the network should have (for classification those would be the number of classes)
+        :param num_filters: Number of filters used in every fcc layer.
+        :param num_layers: Number of fcc layers (excluding dim reduction stages)
+        :param use_bias: Whether our fcc layers will use a bias.
+        """
+        super(FCCNetwork, self).__init__()
+        # set up class attributes useful in building the network and inference
+        self.input_shape = input_shape
+        self.num_filters = num_filters
+        self.num_output_classes = num_output_classes
+        self.use_bias = use_bias
+        self.num_layers = num_layers
+        # initialize a module dict, which is effectively a dictionary that can collect layers and integrate them into pytorch
+        self.layer_dict = nn.ModuleDict()
+        # build the network
+        self.build_module()
+
+    def build_module(self):
+        print("Building basic block of FCCNetwork using input shape", self.input_shape)
+        x = torch.zeros((self.input_shape))
+
+        out = x
+        out = out.view(out.shape[0], -1)
+        # flatten inputs to shape (b, -1) where -1 is the dim resulting from multiplying the
+        # shapes of all dimensions after the 0th dim
+
+        for i in range(self.num_layers):
+            self.layer_dict["fcc_{}".format(i)] = nn.Linear(
+                in_features=out.shape[1],  # initialize a fcc layer
+                out_features=self.num_filters,
+                bias=self.use_bias,
+            )
+
+            out = self.layer_dict["fcc_{}".format(i)](
+                out
+            )  # apply ith fcc layer to the previous layers outputs
+            out = F.relu(out)  # apply a ReLU on the outputs
+
+        self.logits_linear_layer = nn.Linear(
+            in_features=out.shape[1],  # initialize the prediction output linear layer
+            out_features=self.num_output_classes,
+            bias=self.use_bias,
+        )
+        out = self.logits_linear_layer(
+            out
+        )  # apply the layer to the previous layer's outputs
+        print("Block is built, output volume is", out.shape)
+        return out
+
+    def forward(self, x):
+        """
+        Forward prop data through the network and return the preds
+        :param x: Input batch x a batch of shape batch number of samples, each of any dimensionality.
+        :return: preds of shape (b, num_classes)
+        """
+        out = x
+        out = out.view(out.shape[0], -1)
+        # flatten inputs to shape (b, -1) where -1 is the dim resulting from multiplying the
+        # shapes of all dimensions after the 0th dim
+
+        for i in range(self.num_layers):
+            out = self.layer_dict["fcc_{}".format(i)](
+                out
+            )  # apply ith fcc layer to the previous layers outputs
+            out = F.relu(out)  # apply a ReLU on the outputs
+
+        out = self.logits_linear_layer(
+            out
+        )  # apply the layer to the previous layer's outputs
+        return out
+
+    def reset_parameters(self):
+        """
+        Re-initializes the networks parameters
+        """
+        for item in self.layer_dict.children():
+            item.reset_parameters()
+
+        self.logits_linear_layer.reset_parameters()
+
+
+class EmptyBlock(nn.Module):
+    def __init__(
+        self,
+        input_shape=None,
+        num_filters=None,
+        kernel_size=None,
+        padding=None,
+        bias=None,
+        dilation=None,
+        reduction_factor=None,
+    ):
+        super(EmptyBlock, self).__init__()
+
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        self.layer_dict["Identity"] = nn.Identity()
+
+    def forward(self, x):
+        out = x
+
+        out = self.layer_dict["Identity"].forward(out)
+
+        return out
+
+
+class EntryConvolutionalBlock(nn.Module):
+    def __init__(self, input_shape, num_filters, kernel_size, padding, bias, dilation):
+        super(EntryConvolutionalBlock, self).__init__()
+
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        out = x
+
+        self.layer_dict["conv_0"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+
+        out = self.layer_dict["conv_0"].forward(out)
+        self.layer_dict["bn_0"] = nn.BatchNorm2d(num_features=out.shape[1])
+        out = F.leaky_relu(self.layer_dict["bn_0"].forward(out))
+
+        print(out.shape)
+
+    def forward(self, x):
+        out = x
+
+        out = self.layer_dict["conv_0"].forward(out)
+        out = F.leaky_relu(self.layer_dict["bn_0"].forward(out))
+
+        return out
+
+
+class ConvolutionalProcessingBlock(nn.Module):
+    def __init__(self, input_shape, num_filters, kernel_size, padding, bias, dilation):
+        super(ConvolutionalProcessingBlock, self).__init__()
+
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        out = x
+
+        self.layer_dict["conv_0"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+
+        out = self.layer_dict["conv_0"].forward(out)
+        out = F.leaky_relu(out)
+
+        self.layer_dict["conv_1"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+
+        out = self.layer_dict["conv_1"].forward(out)
+        out = F.leaky_relu(out)
+
+        print(out.shape)
+
+    def forward(self, x):
+        out = x
+
+        out = self.layer_dict["conv_0"].forward(out)
+        out = F.leaky_relu(out)
+
+        out = self.layer_dict["conv_1"].forward(out)
+        out = F.leaky_relu(out)
+
+        return out
+
+
+class ConvolutionalDimensionalityReductionBlock(nn.Module):
+    def __init__(
+        self,
+        input_shape,
+        num_filters,
+        kernel_size,
+        padding,
+        bias,
+        dilation,
+        reduction_factor,
+    ):
+        super(ConvolutionalDimensionalityReductionBlock, self).__init__()
+
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+        self.reduction_factor = reduction_factor
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        out = x
+
+        self.layer_dict["conv_0"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+
+        out = self.layer_dict["conv_0"].forward(out)
+        out = F.leaky_relu(out)
+
+        out = F.avg_pool2d(out, self.reduction_factor)
+
+        self.layer_dict["conv_1"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+
+        out = self.layer_dict["conv_1"].forward(out)
+        out = F.leaky_relu(out)
+
+        print(out.shape)
+
+    def forward(self, x):
+        out = x
+
+        out = self.layer_dict["conv_0"].forward(out)
+        out = F.leaky_relu(out)
+
+        out = F.avg_pool2d(out, self.reduction_factor)
+
+        out = self.layer_dict["conv_1"].forward(out)
+        out = F.leaky_relu(out)
+
+        return out
+
+
+class ConvolutionalNetwork(nn.Module):
+    def __init__(
+        self,
+        input_shape,
+        num_output_classes,
+        num_filters,
+        num_blocks_per_stage,
+        num_stages,
+        use_bias=False,
+        processing_block_type=ConvolutionalProcessingBlock,
+        dimensionality_reduction_block_type=ConvolutionalDimensionalityReductionBlock,
+    ):
+        """
+        Initializes a convolutional network module
+        :param input_shape: The shape of the tensor to be passed into this network
+        :param num_output_classes: Number of output classes
+        :param num_filters: Number of filters per convolutional layer
+        :param num_blocks_per_stage: Number of blocks per "stage". Each block is composed of 2 convolutional layers.
+        :param num_stages: Number of stages in a network. A stage is defined as a sequence of layers within which the
+        data dimensionality remains constant in the spacial axis (h, w) and can change in the channel axis. After each stage
+        there exists a dimensionality reduction stage, composed of two convolutional layers and an avg pooling layer.
+        :param use_bias: Whether to use biases in our convolutional layers
+        :param processing_block_type: Type of processing block to use within our stages
+        :param dimensionality_reduction_block_type: Type of dimensionality reduction block to use after each stage in our network
+        """
+        super(ConvolutionalNetwork, self).__init__()
+        # set up class attributes useful in building the network and inference
+        self.input_shape = input_shape
+        self.num_filters = num_filters
+        self.num_output_classes = num_output_classes
+        self.use_bias = use_bias
+        self.num_blocks_per_stage = num_blocks_per_stage
+        self.num_stages = num_stages
+        self.processing_block_type = processing_block_type
+        self.dimensionality_reduction_block_type = dimensionality_reduction_block_type
+
+        # build the network
+        self.build_module()
+
+    def build_module(self):
+        """
+        Builds network whilst automatically inferring shapes of layers.
+        """
+        self.layer_dict = nn.ModuleDict()
+        # initialize a module dict, which is effectively a dictionary that can collect layers and integrate them into pytorch
+        print(
+            "Building basic block of ConvolutionalNetwork using input shape",
+            self.input_shape,
+        )
+        x = torch.zeros(
+            (self.input_shape)
+        )  # create dummy inputs to be used to infer shapes of layers
+
+        out = x
+        self.layer_dict["input_conv"] = EntryConvolutionalBlock(
+            input_shape=out.shape,
+            num_filters=self.num_filters,
+            kernel_size=3,
+            padding=1,
+            bias=self.use_bias,
+            dilation=1,
+        )
+        out = self.layer_dict["input_conv"].forward(out)
+        # torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
+        for i in range(self.num_stages):  # for number of layers times
+            for j in range(self.num_blocks_per_stage):
+                self.layer_dict["block_{}_{}".format(i, j)] = (
+                    self.processing_block_type(
+                        input_shape=out.shape,
+                        num_filters=self.num_filters,
+                        bias=self.use_bias,
+                        kernel_size=3,
+                        dilation=1,
+                        padding=1,
+                    )
+                )
+                out = self.layer_dict["block_{}_{}".format(i, j)].forward(out)
+            self.layer_dict["reduction_block_{}".format(i)] = (
+                self.dimensionality_reduction_block_type(
+                    input_shape=out.shape,
+                    num_filters=self.num_filters,
+                    bias=True,
+                    kernel_size=3,
+                    dilation=1,
+                    padding=1,
+                    reduction_factor=2,
+                )
+            )
+            out = self.layer_dict["reduction_block_{}".format(i)].forward(out)
+
+        out = F.avg_pool2d(out, out.shape[-1])
+        print("shape before final linear layer", out.shape)
+        out = out.view(out.shape[0], -1)
+        self.logit_linear_layer = nn.Linear(
+            in_features=out.shape[1],  # add a linear layer
+            out_features=self.num_output_classes,
+            bias=True,
+        )
+        out = self.logit_linear_layer(out)  # apply linear layer on flattened inputs
+        print("Block is built, output volume is", out.shape)
+        return out
+
+    def forward(self, x):
+        """
+        Forward propages the network given an input batch
+        :param x: Inputs x (b, c, h, w)
+        :return: preds (b, num_classes)
+        """
+        out = x
+        out = self.layer_dict["input_conv"].forward(out)
+        for i in range(self.num_stages):  # for number of layers times
+            for j in range(self.num_blocks_per_stage):
+                out = self.layer_dict["block_{}_{}".format(i, j)].forward(out)
+            out = self.layer_dict["reduction_block_{}".format(i)].forward(out)
+
+        out = F.avg_pool2d(out, out.shape[-1])
+        out = out.view(
+            out.shape[0], -1
+        )  # flatten outputs from (b, c, h, w) to (b, c*h*w)
+        out = self.logit_linear_layer(
+            out
+        )  # pass through a linear layer to get logits/preds
+        return out
+
+    def reset_parameters(self):
+        """
+        Re-initialize the network parameters.
+        """
+        for item in self.layer_dict.children():
+            try:
+                item.reset_parameters()
+            except:
+                pass
+
+        self.logit_linear_layer.reset_parameters()
+
+
+# My Implementation:
+
+
+class ConvolutionalProcessingBlockBN(nn.Module):
+    def __init__(self, input_shape, num_filters, kernel_size, padding, bias, dilation):
+        super().__init__()
+
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        out = x
+
+        # First convolutional layer with Batch Normalization
+        self.layer_dict["conv_0"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+        self.layer_dict["bn_0"] = nn.BatchNorm2d(self.num_filters)
+        out = F.leaky_relu(self.layer_dict["bn_0"](self.layer_dict["conv_0"](out)))
+
+        # Second convolutional layer with Batch Normalization
+        self.layer_dict["conv_1"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+        self.layer_dict["bn_1"] = nn.BatchNorm2d(self.num_filters)
+        out = F.leaky_relu(self.layer_dict["bn_1"](self.layer_dict["conv_1"](out)))
+
+        print(out.shape)
+
+    def forward(self, x):
+        out = x
+
+        # Apply first conv layer + BN + ReLU
+        out = F.leaky_relu(self.layer_dict["bn_0"](self.layer_dict["conv_0"](out)))
+
+        # Apply second conv layer + BN + ReLU
+        out = F.leaky_relu(self.layer_dict["bn_1"](self.layer_dict["conv_1"](out)))
+
+        return out
+
+
+class ConvolutionalDimensionalityReductionBlockBN(nn.Module):
+    def __init__(
+        self,
+        input_shape,
+        num_filters,
+        kernel_size,
+        padding,
+        bias,
+        dilation,
+        reduction_factor,
+    ):
+        super().__init__()
+
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+        self.reduction_factor = reduction_factor
+
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        out = x
+
+        # First convolutional layer with Batch Normalization
+        self.layer_dict["conv_0"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+        self.layer_dict["bn_0"] = nn.BatchNorm2d(self.num_filters)
+        out = F.leaky_relu(self.layer_dict["bn_0"](self.layer_dict["conv_0"](out)))
+
+        # Dimensionality reduction through average pooling
+        out = F.avg_pool2d(out, self.reduction_factor)
+
+        # Second convolutional layer with Batch Normalization
+        self.layer_dict["conv_1"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+        self.layer_dict["bn_1"] = nn.BatchNorm2d(self.num_filters)
+        out = F.leaky_relu(self.layer_dict["bn_1"](self.layer_dict["conv_1"](out)))
+
+        print(out.shape)
+
+    def forward(self, x):
+        out = x
+
+        # Apply first conv layer + BN + ReLU
+        out = F.leaky_relu(self.layer_dict["bn_0"](self.layer_dict["conv_0"](out)))
+
+        # Dimensionality reduction through average pooling
+        out = F.avg_pool2d(out, self.reduction_factor)
+
+        # Apply second conv layer + BN + ReLU
+        out = F.leaky_relu(self.layer_dict["bn_1"](self.layer_dict["conv_1"](out)))
+
+        return out
+
+
+class ConvolutionalProcessingBlockBNRC(nn.Module):
+    def __init__(self, input_shape, num_filters, kernel_size, padding, bias, dilation):
+        super().__init__()
+        self.num_filters = num_filters
+        self.kernel_size = kernel_size
+        self.input_shape = input_shape
+        self.padding = padding
+        self.bias = bias
+        self.dilation = dilation
+        self.build_module()
+
+    def build_module(self):
+        self.layer_dict = nn.ModuleDict()
+        x = torch.zeros(self.input_shape)
+        out = x
+
+        # First convolutional layer with BN
+        self.layer_dict["conv_0"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+        self.layer_dict["bn_0"] = nn.BatchNorm2d(self.num_filters)
+
+        out = self.layer_dict["conv_0"].forward(out)
+        out = self.layer_dict["bn_0"].forward(out)
+        out = F.leaky_relu(out)
+
+        # Second convolutional layer with BN
+        self.layer_dict["conv_1"] = nn.Conv2d(
+            in_channels=out.shape[1],
+            out_channels=self.num_filters,
+            bias=self.bias,
+            kernel_size=self.kernel_size,
+            dilation=self.dilation,
+            padding=self.padding,
+            stride=1,
+        )
+        self.layer_dict["bn_1"] = nn.BatchNorm2d(self.num_filters)
+
+        out = self.layer_dict["conv_1"].forward(out)
+        out = self.layer_dict["bn_1"].forward(out)
+        out = F.leaky_relu(out)
+
+        # Print final output shape for debugging
+        print(out.shape)
+
+    def forward(self, x):
+        residual = x  # Save input for residual connection
+        out = x
+
+        # Apply first conv layer + BN + ReLU
+        out = F.leaky_relu(self.layer_dict["bn_0"](self.layer_dict["conv_0"](out)))
+
+        # Apply second conv layer + BN + ReLU
+        out = F.leaky_relu(self.layer_dict["bn_1"](self.layer_dict["conv_1"](out)))
+
+        # Add residual connection
+        # Ensure shape compatibility
+        assert residual.shape == out.shape
+        # if residual.shape == out.shape:
+        out += residual
+
+        return out
--- a/pytorch_mlp_framework/storage_utils.py
+++ b/pytorch_mlp_framework/storage_utils.py
@ -17,7 +17,14 @@ def load_from_stats_pkl_file(experiment_log_filepath, filename):
    return stats


-def save_statistics(experiment_log_dir, filename, stats_dict, current_epoch, continue_from_mode=False, save_full_dict=False):
+def save_statistics(
+    experiment_log_dir,
+    filename,
+    stats_dict,
+    current_epoch,
+    continue_from_mode=False,
+    save_full_dict=False,
+):
    """
    Saves the statistics in stats dict into a csv file. Using the keys as the header entries and the values as the
    columns of a particular header entry
@ -29,7 +36,7 @@ def save_statistics(experiment_log_dir, filename, stats_dict, current_epoch, con
    :return: The filepath to the summary file
    """
    summary_filename = os.path.join(experiment_log_dir, filename)
-    mode = 'a' if continue_from_mode else 'w'
+    mode = "a" if continue_from_mode else "w"
    with open(summary_filename, mode) as f:
        writer = csv.writer(f)
        if not continue_from_mode:
@ -57,7 +64,7 @@ def load_statistics(experiment_log_dir, filename):
    """
    summary_filename = os.path.join(experiment_log_dir, filename)

-    with open(summary_filename, 'r+') as f:
+    with open(summary_filename, "r+") as f:
        lines = f.readlines()

    keys = lines[0].split(",")
--- a/pytorch_mlp_framework/tests.py
+++ b/pytorch_mlp_framework/tests.py
@ -0,0 +1,87 @@
+import unittest
+import torch
+from model_architectures import (
+    ConvolutionalProcessingBlockBN,
+    ConvolutionalDimensionalityReductionBlockBN,
+    ConvolutionalProcessingBlockBNRC,
+)
+
+
+class TestBatchNormalizationBlocks(unittest.TestCase):
+    def setUp(self):
+        # Common parameters
+        self.input_shape = (1, 3, 32, 32)  # Batch size 1, 3 channels, 32x32 input
+        self.num_filters = 16
+        self.kernel_size = 3
+        self.padding = 1
+        self.bias = False
+        self.dilation = 1
+        self.reduction_factor = 2
+
+    def test_convolutional_processing_block(self):
+        # Create a ConvolutionalProcessingBlockBN instance
+        block = ConvolutionalProcessingBlockBN(
+            input_shape=self.input_shape,
+            num_filters=self.num_filters,
+            kernel_size=self.kernel_size,
+            padding=self.padding,
+            bias=self.bias,
+            dilation=self.dilation,
+        )
+
+        # Generate a random tensor matching the input shape
+        input_tensor = torch.randn(self.input_shape)
+
+        # Forward pass
+        try:
+            output = block(input_tensor)
+            self.assertIsNotNone(output, "Output should not be None.")
+        except Exception as e:
+            self.fail(f"ConvolutionalProcessingBlock raised an error: {e}")
+
+    def test_convolutional_processing_block_with_rc(self):
+        # Create a ConvolutionalProcessingBlockBNRC instance
+        block = ConvolutionalProcessingBlockBNRC(
+            input_shape=self.input_shape,
+            num_filters=self.num_filters,
+            kernel_size=self.kernel_size,
+            padding=self.padding,
+            bias=self.bias,
+            dilation=self.dilation,
+        )
+
+        # Generate a random tensor matching the input shape
+        input_tensor = torch.randn(self.input_shape)
+
+        # Forward pass
+        try:
+            output = block(input_tensor)
+            self.assertIsNotNone(output, "Output should not be None.")
+        except Exception as e:
+            self.fail(f"ConvolutionalProcessingBlock raised an error: {e}")
+
+    def test_convolutional_dimensionality_reduction_block(self):
+        # Create a ConvolutionalDimensionalityReductionBlockBN instance
+        block = ConvolutionalDimensionalityReductionBlockBN(
+            input_shape=self.input_shape,
+            num_filters=self.num_filters,
+            kernel_size=self.kernel_size,
+            padding=self.padding,
+            bias=self.bias,
+            dilation=self.dilation,
+            reduction_factor=self.reduction_factor,
+        )
+
+        # Generate a random tensor matching the input shape
+        input_tensor = torch.randn(self.input_shape)
+
+        # Forward pass
+        try:
+            output = block(input_tensor)
+            self.assertIsNotNone(output, "Output should not be None.")
+        except Exception as e:
+            self.fail(f"ConvolutionalDimensionalityReductionBlock raised an error: {e}")
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/pytorch_mlp_framework/train_evaluate_image_classification_system.py
+++ b/pytorch_mlp_framework/train_evaluate_image_classification_system.py
@ -0,0 +1,102 @@
+import numpy as np
+import torch
+from torch.utils.data import DataLoader
+from torchvision import transforms
+
+import mlp.data_providers as data_providers
+from pytorch_mlp_framework.arg_extractor import get_args
+from pytorch_mlp_framework.experiment_builder import ExperimentBuilder
+from pytorch_mlp_framework.model_architectures import *
+import os
+
+# os.environ["CUDA_VISIBLE_DEVICES"]="0"
+
+args = get_args()  # get arguments from command line
+rng = np.random.RandomState(seed=args.seed)  # set the seeds for the experiment
+torch.manual_seed(seed=args.seed)  # sets pytorch's seed
+
+# set up data augmentation transforms for training and testing
+transform_train = transforms.Compose(
+    [
+        transforms.RandomCrop(32, padding=4),
+        transforms.RandomHorizontalFlip(),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ]
+)
+
+transform_test = transforms.Compose(
+    [
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ]
+)
+
+train_data = data_providers.CIFAR100(
+    root="data", set_name="train", transform=transform_train, download=True
+)  # initialize our rngs using the argument set seed
+val_data = data_providers.CIFAR100(
+    root="data", set_name="val", transform=transform_test, download=True
+)  # initialize our rngs using the argument set seed
+test_data = data_providers.CIFAR100(
+    root="data", set_name="test", transform=transform_test, download=True
+)  # initialize our rngs using the argument set seed
+
+train_data_loader = DataLoader(
+    train_data, batch_size=args.batch_size, shuffle=True, num_workers=2
+)
+val_data_loader = DataLoader(
+    val_data, batch_size=args.batch_size, shuffle=True, num_workers=2
+)
+test_data_loader = DataLoader(
+    test_data, batch_size=args.batch_size, shuffle=True, num_workers=2
+)
+
+if args.block_type == "conv_block":
+    processing_block_type = ConvolutionalProcessingBlock
+    dim_reduction_block_type = ConvolutionalDimensionalityReductionBlock
+elif args.block_type == "empty_block":
+    processing_block_type = EmptyBlock
+    dim_reduction_block_type = EmptyBlock
+elif args.block_type == "conv_bn":
+    processing_block_type = ConvolutionalProcessingBlockBN
+    dim_reduction_block_type = ConvolutionalDimensionalityReductionBlockBN
+elif args.block_type == "conv_bn_rc":
+    processing_block_type = ConvolutionalProcessingBlockBNRC
+    dim_reduction_block_type = ConvolutionalDimensionalityReductionBlockBN
+else:
+    raise ModuleNotFoundError
+
+custom_conv_net = (
+    ConvolutionalNetwork(  # initialize our network object, in this case a ConvNet
+        input_shape=(
+            args.batch_size,
+            args.image_num_channels,
+            args.image_height,
+            args.image_width,
+        ),
+        num_output_classes=args.num_classes,
+        num_filters=args.num_filters,
+        use_bias=False,
+        num_blocks_per_stage=args.num_blocks_per_stage,
+        num_stages=args.num_stages,
+        processing_block_type=processing_block_type,
+        dimensionality_reduction_block_type=dim_reduction_block_type,
+    )
+)
+
+conv_experiment = ExperimentBuilder(
+    network_model=custom_conv_net,
+    experiment_name=args.experiment_name,
+    num_epochs=args.num_epochs,
+    weight_decay_coefficient=args.weight_decay_coefficient,
+    learning_rate=args.learning_rate,
+    use_gpu=args.use_gpu,
+    continue_from_epoch=args.continue_from_epoch,
+    train_data=train_data_loader,
+    val_data=val_data_loader,
+    test_data=test_data_loader,
+)  # build an experiment object
+experiment_metrics, test_metrics = (
+    conv_experiment.run_experiment()
+)  # run experiment and return experiment metrics
--- a/report/.gitignore
+++ b/report/.gitignore
@ -0,0 +1,4 @@
+*.fls
+*.fdb_latexmk
+s2759177/
+*.zip
--- a/report/README.txt
+++ b/report/README.txt
@ -0,0 +1 @@
+Most reasonable LaTeX distributions should have no problem building the document from what is in the provided LaTeX source directory.  However certain LaTeX distributions are missing certain files, and the they are included in this directory.  If you get an error message when you build the LaTeX document saying one of these files is missing, then move the relevant file into your latex source directory.
--- a/report/VGG38_BN/result_outputs/summary.csv
+++ b/report/VGG38_BN/result_outputs/summary.csv
@ -0,0 +1,101 @@
+train_acc,train_loss,val_acc,val_loss
+0.027410526315789472,4.440032,0.0368,4.238186
+0.0440842105263158,4.1909122,0.0644,4.1239405
+0.05604210526315791,4.0817885,0.0368,4.495799
+0.0685263157894737,3.984858,0.0964,3.8527937
+0.08345263157894738,3.8947835,0.09080000000000002,3.8306112
+0.09391578947368423,3.8246264,0.10399999999999998,3.7504945
+0.10189473684210527,3.760145,0.1124,3.6439042
+0.11197894736842108,3.704831,0.0992,3.962508
+0.12534736842105265,3.6408415,0.1404,3.516474
+0.1385894736842105,3.5672796,0.1444,3.5242612
+0.14873684210526317,3.5145628,0.12960000000000002,3.5745378
+0.16103157894736844,3.4476008,0.1852,3.3353982
+0.16846315789473681,3.399858,0.15600000000000003,3.453797
+0.1760210526315789,3.3611393,0.1464,3.5799885
+0.18625263157894736,3.3005812,0.196,3.201007
+0.19233684210526317,3.26565,0.17439999999999997,3.397586
+0.19625263157894737,3.2346153,0.212,3.169959
+0.20717894736842105,3.174345,0.2132,3.0981174
+0.2136,3.1425776,0.2036,3.2191591
+0.2217684210526316,3.094137,0.236,3.0018876
+0.23069473684210529,3.0539455,0.20440000000000003,3.1800296
+0.23395789473684211,3.0338168,0.22599999999999998,3.0360818
+0.24463157894736842,2.9761615,0.2588,2.8876188
+0.25311578947368424,2.931479,0.2,3.242481
+0.25795789473684216,2.900163,0.28320000000000006,2.830947
+0.26789473684210524,2.8484874,0.2768,2.8190458
+0.2709263157894737,2.833472,0.2352,3.0098538
+0.2816421052631579,2.7842317,0.29560000000000003,2.7288156
+0.28764210526315787,2.745757,0.2648,2.8955112
+0.2930315789473684,2.7276495,0.27680000000000005,2.8336413
+0.3001263157894737,2.6826382,0.316,2.6245823
+0.3068421052631579,2.658441,0.27,2.9279957
+0.30909473684210526,2.638565,0.31160000000000004,2.637653
+0.3213263157894737,2.5939283,0.31799999999999995,2.627816
+0.3211157894736843,2.579544,0.25079999999999997,2.9502957
+0.3259999999999999,2.5540712,0.3332,2.569941
+0.3336421052631579,2.5239582,0.278,2.7676308
+0.3371368421052632,2.5109046,0.2916,2.725589
+0.34404210526315787,2.4714804,0.34120000000000006,2.4782379
+0.3500631578947368,2.4545348,0.30600000000000005,2.6625924
+0.34976842105263156,2.4408882,0.342,2.5351026
+0.3586315789473684,2.4116046,0.3452,2.450749
+0.3568421052631579,2.4133172,0.3288,2.5647113
+0.3630947368421052,2.3772728,0.36519999999999997,2.388074
+0.37069473684210524,2.3505116,0.324,2.5489926
+0.37132631578947367,2.352426,0.33680000000000004,2.5370462
+0.37606315789473677,2.319005,0.3712,2.3507965
+0.3800210526315789,2.3045664,0.33,2.6327293
+0.38185263157894733,2.2965574,0.3764,2.364877
+0.38785263157894734,2.269467,0.37799999999999995,2.330837
+0.3889684210526316,2.26941,0.3559999999999999,2.513778
+0.3951789473684211,2.2413251,0.3888,2.2839465
+0.3944421052631579,2.2319226,0.35919999999999996,2.4310353
+0.4,2.220305,0.3732,2.348543
+0.4051157894736842,2.1891508,0.39440000000000003,2.2730627
+0.40581052631578945,2.1873925,0.33399999999999996,2.5648093
+0.4067789473684211,2.1817088,0.4044,2.2244952
+0.41555789473684207,2.1543047,0.39759999999999995,2.220972
+0.4170526315789474,2.14905,0.33399999999999996,2.6612198
+0.41762105263157895,2.1321266,0.3932,2.2343464
+0.42341052631578946,2.1131704,0.37800000000000006,2.327929
+0.4212842105263158,2.112597,0.376,2.3302126
+0.4295157894736842,2.0925663,0.4100000000000001,2.175698
+0.4299368421052632,2.0846903,0.3772,2.3750577
+0.43134736842105265,2.075184,0.4044,2.1888158
+0.43829473684210524,2.045202,0.41239999999999993,2.1673117
+0.43534736842105265,2.0590534,0.37440000000000007,2.3269994
+0.4417684210526316,2.0356588,0.42,2.1668334
+0.4442736842105263,2.028207,0.41239999999999993,2.2346516
+0.44581052631578943,2.021492,0.40519999999999995,2.2030904
+0.44884210526315793,2.0058675,0.4296,2.0948715
+0.45071578947368424,1.993417,0.39,2.2856123
+0.45130526315789476,1.9970801,0.43599999999999994,2.110219
+0.45686315789473686,1.9651922,0.4244,2.1253593
+0.4557263157894737,1.9701725,0.3704,2.4576838
+0.4609684210526315,1.956996,0.4412,2.0626938
+0.4639789473684211,1.9407912,0.398,2.3076272
+0.46311578947368426,1.9410807,0.4056,2.2181008
+0.4686736842105263,1.918824,0.45080000000000003,2.030652
+0.4650315789473684,1.924879,0.3948,2.2926931
+0.46964210526315786,1.9188553,0.43599999999999994,2.107239
+0.47357894736842104,1.8991861,0.43119999999999997,2.067097
+0.47212631578947367,1.8987728,0.41359999999999997,2.1667569
+0.4773263157894737,1.8892545,0.46,2.0283196
+0.4802526315789474,1.8736148,0.41960000000000003,2.1698954
+0.47406315789473685,1.8849738,0.43399999999999994,2.1001608
+0.48627368421052636,1.8492608,0.45520000000000005,1.9936249
+0.48589473684210527,1.8534511,0.38439999999999996,2.354954
+0.48667368421052637,1.8421199,0.44120000000000004,2.0467849
+0.4902736842105263,1.8265136,0.45519999999999994,2.0044358
+0.4879789473684211,1.838593,0.3984,2.3019247
+0.49204210526315795,1.8199797,0.4656,1.9858631
+0.4945894736842105,1.805858,0.436,2.1293921
+0.4939578947368421,1.8174701,0.4388,2.0611947
+0.4961684210526316,1.7953233,0.4612,1.9728945
+0.49610526315789477,1.7908033,0.42440000000000005,2.1648548
+0.4996,1.7908286,0.4664,1.9897026
+0.5070105263157895,1.7658812,0.452,2.0411723
+0.5027368421052631,1.7692825,0.4136000000000001,2.280331
+0.5062315789473685,1.7649119,0.4768,1.9493303
--- a/report/VGG38_BN/result_outputs/test_summary.csv
+++ b/report/VGG38_BN/result_outputs/test_summary.csv
@ -0,0 +1,2 @@
+test_acc,test_loss
+0.46970000000000006,1.9579598
--- a/report/VGG38_BN_RC/result_outputs/summary.csv
+++ b/report/VGG38_BN_RC/result_outputs/summary.csv
@ -0,0 +1,101 @@
+train_acc,train_loss,val_acc,val_loss
+0.04040000000000001,4.2986817,0.07600000000000001,3.9793916
+0.07663157894736841,3.948711,0.09840000000000002,3.8271046
+0.1072842105263158,3.7670445,0.0908,3.8834984
+0.14671578947368422,3.544252,0.1784,3.3180876
+0.18690526315789474,3.3382895,0.1672,3.4958847
+0.2185684210526316,3.1613564,0.23240000000000002,3.0646808
+0.2584,2.9509778,0.2904,2.7620668
+0.2886736842105263,2.7674758,0.2504,3.083242
+0.3186736842105263,2.6191177,0.34600000000000003,2.5320892
+0.3488421052631579,2.4735146,0.3556,2.463249
+0.36701052631578945,2.3815694,0.32480000000000003,2.6590502
+0.39258947368421054,2.2661598,0.41200000000000003,2.215237
+0.40985263157894736,2.1811035,0.3644,2.4625826
+0.42557894736842106,2.1193688,0.3896,2.2802749
+0.4452,2.0338347,0.45080000000000003,2.0216491
+0.45298947368421055,1.9886738,0.3768,2.4903286
+0.4690105263157895,1.9385177,0.46519999999999995,1.9589043
+0.48627368421052636,1.8654134,0.46199999999999997,1.9572229
+0.4910947368421053,1.836772,0.3947999999999999,2.371203
+0.5033052631578947,1.7882212,0.4864,1.8270072
+0.515578947368421,1.7451773,0.418,2.2281988
+0.5166526315789474,1.7310464,0.4744,1.9468222
+0.532,1.6639497,0.5176,1.7627875
+0.534821052631579,1.6504371,0.426,2.2908173
+0.5399578947368422,1.6263881,0.5092,1.7892419
+0.5538105263157893,1.5786182,0.5184,1.7781507
+0.5530526315789474,1.5743873,0.45480000000000004,2.052206
+0.5610526315789474,1.5367776,0.5404000000000001,1.6886607
+0.5709263157894736,1.508275,0.5072000000000001,1.8317349
+0.5693894736842106,1.5026951,0.49760000000000004,1.9268813
+0.5827368421052632,1.4614111,0.5484,1.6791071
+0.583557894736842,1.4580216,0.4744,2.084504
+0.5856842105263159,1.4402864,0.5468,1.6674811
+0.5958105263157895,1.4054152,0.5468,1.7081916
+0.5964631578947368,1.4043275,0.4988,1.8901508
+0.6044631578947368,1.3692447,0.548,1.6456038
+0.6065473684210526,1.3562685,0.5448,1.7725601
+0.6055578947368421,1.3638091,0.52,1.803752
+0.6169684210526316,1.3224502,0.5688,1.6048553
+0.6184421052631579,1.3228824,0.4772,2.0309162
+0.6193894736842105,1.312684,0.5496,1.6357917
+0.6287368421052631,1.2758818,0.5552,1.7120187
+0.6270105263157894,1.2829372,0.4872000000000001,1.9630791
+0.6313473684210527,1.2609128,0.5632,1.6049384
+0.6374736842105263,1.2429903,0.5516,1.7101723
+0.6342947368421055,1.2540665,0.5272,1.8112053
+0.642778947368421,1.2098345,0.5692,1.5996393
+0.6447368421052632,1.217454,0.5056,2.087292
+0.6437052631578949,1.2123955,0.5660000000000001,1.6426488
+0.6533263157894735,1.1804259,0.5672,1.6429158
+0.6521052631578947,1.1856273,0.5316000000000001,1.8833923
+0.658021052631579,1.1663536,0.5652,1.6239171
+0.6622947368421054,1.1522906,0.5376000000000001,1.8352613
+0.6543789473684212,1.1700194,0.5539999999999999,1.7920883
+0.6664,1.1246897,0.5828,1.5657492
+0.6645473684210526,1.1307288,0.5296,1.8285477
+0.6647157894736843,1.1294464,0.5852,1.59438
+0.6713473684210526,1.1020554,0.5647999999999999,1.6256377
+0.6691368421052631,1.1129124,0.5224,1.9497899
+0.6737684210526315,1.0941163,0.5708,1.5900868
+0.6765473684210527,1.0844595,0.55,1.7522817
+0.6762947368421053,1.0832069,0.5428000000000001,1.8020345
+0.6799789473684209,1.0637755,0.5864,1.5690281
+0.6808421052631578,1.066873,0.5168,1.9964217
+0.6843157894736842,1.0618489,0.5720000000000001,1.6391727
+0.6866736842105262,1.0432214,0.5731999999999999,1.6571078
+0.6877684210526315,1.0442319,0.5192,2.0341485
+0.6890105263157895,1.0338738,0.5836,1.5887364
+0.693642105263158,1.0206536,0.5456,1.8537303
+0.6905894736842106,1.0271776,0.5548000000000001,1.8022745
+0.6981263157894737,1.001102,0.5852,1.5923084
+0.6986105263157896,1.0052379,0.512,2.011443
+0.698042105263158,0.9990784,0.5744,1.638558
+0.7031578947368421,0.977477,0.5816,1.5790274
+0.7013473684210526,0.98766434,0.5448000000000001,1.8414693
+0.7069684210526315,0.9691622,0.59,1.5866013
+0.7061894736842105,0.9620083,0.55,1.7695292
+0.7050526315789474,0.9689725,0.5408,1.8329593
+0.7101052631578948,0.95279986,0.5852,1.5835829
+0.7122315789473684,0.9483001,0.5224,1.9749893
+0.7115157894736842,0.94911486,0.5808,1.6965445
+0.7166315789473684,0.9338312,0.5788,1.6249495
+0.7120631578947368,0.9428737,0.5224,1.9721117
+0.7197263157894737,0.92057914,0.5960000000000001,1.6235417
+0.7258315789473684,0.9071854,0.528,2.0651033
+0.7186947368421053,0.922529,0.5628,1.7508049
+0.7257684210526316,0.9007169,0.5980000000000001,1.5797865
+0.7254105263157896,0.89657074,0.5472,1.8673587
+0.7229263157894736,0.90324384,0.5771999999999999,1.6998875
+0.7308842105263157,0.8757633,0.5856,1.6750972
+0.7254947368421052,0.8956531,0.5479999999999999,1.9809356
+0.7302105263157894,0.8803156,0.5960000000000001,1.6343199
+0.7353473684210525,0.8630421,0.56,1.9686066
+0.732021052631579,0.8823739,0.5632,1.8139118
+0.7324631578947367,0.8676047,0.5952000000000001,1.6235788
+0.7366526315789473,0.85581774,0.5392,1.9346147
+0.7340210526315789,0.8636227,0.5868,1.6743768
+0.7416631578947368,0.84529686,0.5836,1.6691054
+0.734757894736842,0.85352796,0.516,2.227477
+0.7435368421052632,0.83374214,0.582,1.697568
--- a/report/VGG38_BN_RC/result_outputs/test_summary.csv
+++ b/report/VGG38_BN_RC/result_outputs/test_summary.csv
@ -0,0 +1,2 @@
+test_acc,test_loss
+0.6018000000000001,1.5933747
--- a/report/VGG38_default/result_outputs/summary.csv
+++ b/report/VGG38_default/result_outputs/summary.csv
@ -0,0 +1,101 @@
+train_acc,train_loss,val_acc,val_loss
+0.009600000000000001,4.609349,0.0104,4.6072426
+0.009326315789473684,4.6068563,0.0092,4.606588
+0.009747368421052631,4.6062207,0.0084,4.606326
+0.009621052631578947,4.6059957,0.0076,4.6067405
+0.009873684210526314,4.605887,0.0076,4.6068487
+0.009136842105263157,4.605854,0.008,4.6074386
+0.009536842105263158,4.605795,0.007200000000000001,4.6064863
+0.009578947368421051,4.6057415,0.006400000000000001,4.6065035
+0.009410526315789473,4.6058245,0.0076,4.606772
+0.009094736842105263,4.6057224,0.007600000000000001,4.6064925
+0.00911578947368421,4.605707,0.007200000000000001,4.6067533
+0.009852631578947368,4.605685,0.007200000000000001,4.6068745
+0.01031578947368421,4.6056952,0.0072,4.6067533
+0.009789473684210527,4.6057863,0.0072,4.6070247
+0.01031578947368421,4.6056023,0.0064,4.607134
+0.010189473684210526,4.605698,0.0064,4.606934
+0.009957894736842107,4.605643,0.006400000000000001,4.6068535
+0.009452631578947369,4.605595,0.0064,4.6070676
+0.009368421052631578,4.6057224,0.008,4.6070356
+0.010210526315789474,4.6056094,0.009600000000000001,4.6070833
+0.009557894736842105,4.6056895,0.0076,4.6069493
+0.009600000000000001,4.605709,0.008400000000000001,4.60693
+0.00985263157894737,4.6055284,0.0084,4.6068263
+0.009200000000000002,4.60564,0.0076,4.6071053
+0.009031578947368422,4.6056323,0.008400000000000001,4.606731
+0.009663157894736842,4.60559,0.0068,4.6069546
+0.008484210526315789,4.605676,0.009600000000000001,4.6063976
+0.0096,4.605595,0.011200000000000002,4.6067076
+0.00951578947368421,4.605619,0.0096,4.6068506
+0.009242105263157895,4.6056657,0.0072,4.6067576
+0.009326315789473684,4.6055913,0.012,4.6070724
+0.01023157894736842,4.605646,0.012000000000000002,4.6066885
+0.009494736842105262,4.605563,0.0072,4.6067305
+0.009810526315789474,4.6055746,0.007200000000000001,4.6067824
+0.010147368421052632,4.605596,0.0072,4.607214
+0.009536842105263156,4.6055007,0.007200000000000001,4.607186
+0.009452631578947369,4.605547,0.0072,4.607297
+0.009578947368421055,4.6055694,0.0072,4.607313
+0.009410526315789475,4.6055374,0.0072,4.60726
+0.00985263157894737,4.605587,0.0072,4.6072307
+0.009389473684210526,4.605559,0.0072,4.607227
+0.009852631578947368,4.6055884,0.008,4.6070976
+0.008968421052631579,4.6055803,0.008,4.607156
+0.009536842105263158,4.605502,0.0076,4.6073594
+0.009410526315789473,4.6055517,0.008,4.607176
+0.01,4.6055126,0.006400000000000001,4.606937
+0.009915789473684213,4.6055126,0.008,4.607185
+0.009305263157894737,4.605594,0.0064,4.606834
+0.009326315789473684,4.6054907,0.008,4.6070714
+0.009094736842105263,4.6055007,0.0076,4.6068645
+0.009052631578947368,4.6055903,0.008400000000000001,4.606755
+0.010294736842105263,4.605449,0.008,4.6068816
+0.009578947368421055,4.6054883,0.0064,4.6067166
+0.009452631578947369,4.60552,0.01,4.6066008
+0.008821052631578948,4.6054573,0.009600000000000001,4.6065955
+0.008968421052631579,4.605544,0.008,4.6063676
+0.010147368421052632,4.605516,0.0064,4.6068606
+0.009600000000000001,4.6054597,0.0096,4.6072354
+0.01008421052631579,4.605526,0.0076,4.6074166
+0.010126315789473685,4.6054554,0.0076,4.6074657
+0.009705263157894736,4.6054635,0.0088,4.607237
+0.009726315789473684,4.605516,0.007200000000000001,4.606978
+0.009894736842105262,4.6054883,0.0072,4.607135
+0.009663157894736842,4.605501,0.007200000000000001,4.607015
+0.00976842105263158,4.605536,0.008,4.6073785
+0.009473684210526316,4.6055303,0.009600000000000001,4.6070166
+0.009347368421052632,4.6054993,0.0076,4.607084
+0.009178947368421054,4.6054535,0.0084,4.6070604
+0.008842105263157892,4.605507,0.0076,4.6069884
+0.009726315789473684,4.6055107,0.007599999999999999,4.6069903
+0.009536842105263156,4.6054244,0.0084,4.6070695
+0.009452631578947369,4.605474,0.0072,4.607035
+0.009621052631578949,4.605444,0.0076,4.6071277
+0.010084210526315791,4.6054263,0.0076,4.6071534
+0.009326315789473686,4.605477,0.0088,4.607115
+0.009010526315789472,4.60548,0.0076,4.6072206
+0.010042105263157897,4.605475,0.0076,4.607185
+0.00976842105263158,4.6054463,0.008400000000000001,4.6071196
+0.01,4.605421,0.008,4.6069384
+0.009536842105263156,4.605482,0.008,4.607035
+0.009915789473684213,4.6054354,0.008,4.6071534
+0.010042105263157894,4.6054177,0.007200000000000001,4.607074
+0.009242105263157895,4.605473,0.0072,4.606825
+0.009726315789473684,4.6054006,0.0072,4.606701
+0.009684210526315788,4.6054583,0.0104,4.606925
+0.009642105263157895,4.6054606,0.0104,4.6068645
+0.00936842105263158,4.605405,0.0076,4.606976
+0.009263157894736843,4.605455,0.0076,4.606981
+0.00905263157894737,4.6054463,0.0092,4.6070757
+0.009915789473684213,4.605465,0.0068000000000000005,4.607151
+0.009389473684210526,4.605481,0.008400000000000001,4.606995
+0.009789473684210527,4.605436,0.0068000000000000005,4.6071105
+0.010273684210526315,4.605466,0.007200000000000001,4.606909
+0.009789473684210527,4.605443,0.0072,4.6066866
+0.009957894736842107,4.6053886,0.0076,4.606541
+0.010168421052631578,4.605481,0.006400000000000001,4.606732
+0.009242105263157894,4.605444,0.006400000000000001,4.606939
+0.009621052631578949,4.6054454,0.008,4.606915
+0.00976842105263158,4.60547,0.0076,4.6068935
+0.009873684210526316,4.6055245,0.0064,4.6072345
--- a/report/VGG38_default/result_outputs/test_summary.csv
+++ b/report/VGG38_default/result_outputs/test_summary.csv
@ -0,0 +1,2 @@
+test_acc,test_loss
+0.01,4.6053004
--- a/report/additional-latex-files/README.txt
+++ b/report/additional-latex-files/README.txt
@ -0,0 +1 @@
+Most reasonable LaTeX distributions should have no problem building the document from what is in the provided LaTeX source directory.  However certain LaTeX distributions are missing certain files, and the they are included in this directory.  If you get an error message when you build the LaTeX document saying one of these files is missing, then move the relevant file into your latex source directory.
--- a/report/additional-latex-files/algorithm.sty
+++ b/report/additional-latex-files/algorithm.sty
@ -0,0 +1,79 @@
+% ALGORITHM STYLE -- Released 8 April 1996
+%    for LaTeX-2e
+% Copyright -- 1994 Peter Williams
+% E-mail Peter.Williams@dsto.defence.gov.au
+\NeedsTeXFormat{LaTeX2e}
+\ProvidesPackage{algorithm}
+\typeout{Document Style `algorithm' - floating environment}
+
+\RequirePackage{float}
+\RequirePackage{ifthen}
+\newcommand{\ALG@within}{nothing}
+\newboolean{ALG@within}
+\setboolean{ALG@within}{false}
+\newcommand{\ALG@floatstyle}{ruled}
+\newcommand{\ALG@name}{Algorithm}
+\newcommand{\listalgorithmname}{List of \ALG@name s}
+
+% Declare Options
+% first appearance
+\DeclareOption{plain}{
+  \renewcommand{\ALG@floatstyle}{plain}
+}
+\DeclareOption{ruled}{
+  \renewcommand{\ALG@floatstyle}{ruled}
+}
+\DeclareOption{boxed}{
+  \renewcommand{\ALG@floatstyle}{boxed}
+}
+% then numbering convention
+\DeclareOption{part}{
+  \renewcommand{\ALG@within}{part}
+  \setboolean{ALG@within}{true}
+}
+\DeclareOption{chapter}{
+  \renewcommand{\ALG@within}{chapter}
+  \setboolean{ALG@within}{true}
+}
+\DeclareOption{section}{
+  \renewcommand{\ALG@within}{section}
+  \setboolean{ALG@within}{true}
+}
+\DeclareOption{subsection}{
+  \renewcommand{\ALG@within}{subsection}
+  \setboolean{ALG@within}{true}
+}
+\DeclareOption{subsubsection}{
+  \renewcommand{\ALG@within}{subsubsection}
+  \setboolean{ALG@within}{true}
+}
+\DeclareOption{nothing}{
+  \renewcommand{\ALG@within}{nothing}
+  \setboolean{ALG@within}{true}
+}
+\DeclareOption*{\edef\ALG@name{\CurrentOption}}
+
+% ALGORITHM
+%
+\ProcessOptions
+\floatstyle{\ALG@floatstyle}
+\ifthenelse{\boolean{ALG@within}}{
+  \ifthenelse{\equal{\ALG@within}{part}}
+     {\newfloat{algorithm}{htbp}{loa}[part]}{}
+  \ifthenelse{\equal{\ALG@within}{chapter}}
+     {\newfloat{algorithm}{htbp}{loa}[chapter]}{}
+  \ifthenelse{\equal{\ALG@within}{section}}
+     {\newfloat{algorithm}{htbp}{loa}[section]}{}
+  \ifthenelse{\equal{\ALG@within}{subsection}}
+     {\newfloat{algorithm}{htbp}{loa}[subsection]}{}
+  \ifthenelse{\equal{\ALG@within}{subsubsection}}
+     {\newfloat{algorithm}{htbp}{loa}[subsubsection]}{}
+  \ifthenelse{\equal{\ALG@within}{nothing}}
+     {\newfloat{algorithm}{htbp}{loa}}{}
+}{
+  \newfloat{algorithm}{htbp}{loa}
+}
+\floatname{algorithm}{\ALG@name}
+
+\newcommand{\listofalgorithms}{\listof{algorithm}{\listalgorithmname}}
+
--- a/report/additional-latex-files/algorithmic.sty
+++ b/report/additional-latex-files/algorithmic.sty
@ -0,0 +1,201 @@
+% ALGORITHMIC STYLE -- Released 8 APRIL 1996
+%    for LaTeX version 2e
+% Copyright -- 1994 Peter Williams
+% E-mail PeterWilliams@dsto.defence.gov.au
+%
+% Modified by Alex Smola (08/2000)
+% E-mail Alex.Smola@anu.edu.au
+%
+\NeedsTeXFormat{LaTeX2e}
+\ProvidesPackage{algorithmic}
+\typeout{Document Style `algorithmic' - environment}
+%
+\RequirePackage{ifthen}
+\RequirePackage{calc}
+\newboolean{ALC@noend}
+\setboolean{ALC@noend}{false}
+\newcounter{ALC@line}
+\newcounter{ALC@rem}
+\newlength{\ALC@tlm}
+%
+\DeclareOption{noend}{\setboolean{ALC@noend}{true}}
+%
+\ProcessOptions
+%
+% ALGORITHMIC
+\newcommand{\algorithmicrequire}{\textbf{Require:}}
+\newcommand{\algorithmicensure}{\textbf{Ensure:}}
+\newcommand{\algorithmiccomment}[1]{\{#1\}}
+\newcommand{\algorithmicend}{\textbf{end}}
+\newcommand{\algorithmicif}{\textbf{if}}
+\newcommand{\algorithmicthen}{\textbf{then}}
+\newcommand{\algorithmicelse}{\textbf{else}}
+\newcommand{\algorithmicelsif}{\algorithmicelse\ \algorithmicif}
+\newcommand{\algorithmicendif}{\algorithmicend\ \algorithmicif}
+\newcommand{\algorithmicfor}{\textbf{for}}
+\newcommand{\algorithmicforall}{\textbf{for all}}
+\newcommand{\algorithmicdo}{\textbf{do}}
+\newcommand{\algorithmicendfor}{\algorithmicend\ \algorithmicfor}
+\newcommand{\algorithmicwhile}{\textbf{while}}
+\newcommand{\algorithmicendwhile}{\algorithmicend\ \algorithmicwhile}
+\newcommand{\algorithmicloop}{\textbf{loop}}
+\newcommand{\algorithmicendloop}{\algorithmicend\ \algorithmicloop}
+\newcommand{\algorithmicrepeat}{\textbf{repeat}}
+\newcommand{\algorithmicuntil}{\textbf{until}}
+
+%changed by alex smola
+\newcommand{\algorithmicinput}{\textbf{input}}
+\newcommand{\algorithmicoutput}{\textbf{output}}
+\newcommand{\algorithmicset}{\textbf{set}}
+\newcommand{\algorithmictrue}{\textbf{true}}
+\newcommand{\algorithmicfalse}{\textbf{false}}
+\newcommand{\algorithmicand}{\textbf{and\ }}
+\newcommand{\algorithmicor}{\textbf{or\ }}
+\newcommand{\algorithmicfunction}{\textbf{function}}
+\newcommand{\algorithmicendfunction}{\algorithmicend\ \algorithmicfunction}
+\newcommand{\algorithmicmain}{\textbf{main}}
+\newcommand{\algorithmicendmain}{\algorithmicend\ \algorithmicmain}
+%end changed by alex smola
+
+\def\ALC@item[#1]{%
+\if@noparitem \@donoparitem
+  \else \if@inlabel \indent \par \fi
+         \ifhmode \unskip\unskip \par \fi
+         \if@newlist \if@nobreak \@nbitem \else
+                        \addpenalty\@beginparpenalty
+                        \addvspace\@topsep \addvspace{-\parskip}\fi
+           \else \addpenalty\@itempenalty \addvspace\itemsep
+          \fi
+    \global\@inlabeltrue
+\fi
+\everypar{\global\@minipagefalse\global\@newlistfalse
+          \if@inlabel\global\@inlabelfalse \hskip -\parindent \box\@labels
+             \penalty\z@ \fi
+          \everypar{}}\global\@nobreakfalse
+\if@noitemarg \@noitemargfalse \if@nmbrlist \refstepcounter{\@listctr}\fi \fi
+\sbox\@tempboxa{\makelabel{#1}}%
+\global\setbox\@labels
+ \hbox{\unhbox\@labels \hskip \itemindent
+       \hskip -\labelwidth \hskip -\ALC@tlm
+       \ifdim \wd\@tempboxa >\labelwidth
+                \box\@tempboxa
+          \else \hbox to\labelwidth {\unhbox\@tempboxa}\fi
+       \hskip \ALC@tlm}\ignorespaces}
+%
+\newenvironment{algorithmic}[1][0]{
+\let\@item\ALC@item
+  \newcommand{\ALC@lno}{%
+\ifthenelse{\equal{\arabic{ALC@rem}}{0}}
+{{\footnotesize \arabic{ALC@line}:}}{}%
+}
+\let\@listii\@listi
+\let\@listiii\@listi
+\let\@listiv\@listi
+\let\@listv\@listi
+\let\@listvi\@listi
+\let\@listvii\@listi
+  \newenvironment{ALC@g}{
+    \begin{list}{\ALC@lno}{ \itemsep\z@ \itemindent\z@
+    \listparindent\z@ \rightmargin\z@ 
+    \topsep\z@ \partopsep\z@ \parskip\z@\parsep\z@
+    \leftmargin 1em
+    \addtolength{\ALC@tlm}{\leftmargin}
+    }
+  }
+  {\end{list}}
+  \newcommand{\ALC@it}{\addtocounter{ALC@line}{1}\addtocounter{ALC@rem}{1}\ifthenelse{\equal{\arabic{ALC@rem}}{#1}}{\setcounter{ALC@rem}{0}}{}\item}
+  \newcommand{\ALC@com}[1]{\ifthenelse{\equal{##1}{default}}%
+{}{\ \algorithmiccomment{##1}}}
+  \newcommand{\REQUIRE}{\item[\algorithmicrequire]}
+  \newcommand{\ENSURE}{\item[\algorithmicensure]}
+  \newcommand{\STATE}{\ALC@it}
+  \newcommand{\COMMENT}[1]{\algorithmiccomment{##1}}
+%changes by alex smola
+  \newcommand{\INPUT}{\item[\algorithmicinput]}
+  \newcommand{\OUTPUT}{\item[\algorithmicoutput]}
+  \newcommand{\SET}{\item[\algorithmicset]}
+%  \newcommand{\TRUE}{\algorithmictrue}
+%  \newcommand{\FALSE}{\algorithmicfalse}
+  \newcommand{\AND}{\algorithmicand}
+  \newcommand{\OR}{\algorithmicor}
+  \newenvironment{ALC@func}{\begin{ALC@g}}{\end{ALC@g}}
+  \newenvironment{ALC@main}{\begin{ALC@g}}{\end{ALC@g}}
+%end changes by alex smola
+  \newenvironment{ALC@if}{\begin{ALC@g}}{\end{ALC@g}}
+  \newenvironment{ALC@for}{\begin{ALC@g}}{\end{ALC@g}}
+  \newenvironment{ALC@whl}{\begin{ALC@g}}{\end{ALC@g}}
+  \newenvironment{ALC@loop}{\begin{ALC@g}}{\end{ALC@g}}
+  \newenvironment{ALC@rpt}{\begin{ALC@g}}{\end{ALC@g}}
+  \renewcommand{\\}{\@centercr}
+  \newcommand{\IF}[2][default]{\ALC@it\algorithmicif\ ##2\ \algorithmicthen%
+\ALC@com{##1}\begin{ALC@if}}
+  \newcommand{\SHORTIF}[2]{\ALC@it\algorithmicif\ ##1\
+    \algorithmicthen\ {##2}}
+  \newcommand{\ELSE}[1][default]{\end{ALC@if}\ALC@it\algorithmicelse%
+\ALC@com{##1}\begin{ALC@if}}
+  \newcommand{\ELSIF}[2][default]%
+{\end{ALC@if}\ALC@it\algorithmicelsif\ ##2\ \algorithmicthen%
+\ALC@com{##1}\begin{ALC@if}}
+  \newcommand{\FOR}[2][default]{\ALC@it\algorithmicfor\ ##2\ \algorithmicdo%
+\ALC@com{##1}\begin{ALC@for}}
+  \newcommand{\FORALL}[2][default]{\ALC@it\algorithmicforall\ ##2\ %
+\algorithmicdo%
+\ALC@com{##1}\begin{ALC@for}}
+  \newcommand{\SHORTFORALL}[2]{\ALC@it\algorithmicforall\ ##1\ %
+    \algorithmicdo\ {##2}}
+  \newcommand{\WHILE}[2][default]{\ALC@it\algorithmicwhile\ ##2\ %
+\algorithmicdo%
+\ALC@com{##1}\begin{ALC@whl}}
+  \newcommand{\LOOP}[1][default]{\ALC@it\algorithmicloop%
+\ALC@com{##1}\begin{ALC@loop}}
+%changed by alex smola
+  \newcommand{\FUNCTION}[2][default]{\ALC@it\algorithmicfunction\ ##2\ %
+    \ALC@com{##1}\begin{ALC@func}}
+  \newcommand{\MAIN}[2][default]{\ALC@it\algorithmicmain\ ##2\ %
+    \ALC@com{##1}\begin{ALC@main}}
+%end changed by alex smola
+  \newcommand{\REPEAT}[1][default]{\ALC@it\algorithmicrepeat%
+    \ALC@com{##1}\begin{ALC@rpt}}
+    \newcommand{\UNTIL}[1]{\end{ALC@rpt}\ALC@it\algorithmicuntil\ ##1}
+  \ifthenelse{\boolean{ALC@noend}}{
+    \newcommand{\ENDIF}{\end{ALC@if}}
+    \newcommand{\ENDFOR}{\end{ALC@for}}
+    \newcommand{\ENDWHILE}{\end{ALC@whl}}
+    \newcommand{\ENDLOOP}{\end{ALC@loop}}
+    \newcommand{\ENDFUNCTION}{\end{ALC@func}}
+    \newcommand{\ENDMAIN}{\end{ALC@main}}
+  }{
+    \newcommand{\ENDIF}{\end{ALC@if}\ALC@it\algorithmicendif}
+    \newcommand{\ENDFOR}{\end{ALC@for}\ALC@it\algorithmicendfor}
+    \newcommand{\ENDWHILE}{\end{ALC@whl}\ALC@it\algorithmicendwhile}
+    \newcommand{\ENDLOOP}{\end{ALC@loop}\ALC@it\algorithmicendloop}
+    \newcommand{\ENDFUNCTION}{\end{ALC@func}\ALC@it\algorithmicendfunction}
+    \newcommand{\ENDMAIN}{\end{ALC@main}\ALC@it\algorithmicendmain}
+  } 
+  \renewcommand{\@toodeep}{}
+  \begin{list}{\ALC@lno}{\setcounter{ALC@line}{0}\setcounter{ALC@rem}{0}%
+      \itemsep\z@ \itemindent\z@ \listparindent\z@%
+      \partopsep\z@ \parskip\z@ \parsep\z@%
+      \labelsep 0.5em \topsep 0.2em%
+      \ifthenelse{\equal{#1}{0}}
+      {\labelwidth 0.5em }
+      {\labelwidth  1.2em }
+      \leftmargin\labelwidth \addtolength{\leftmargin}{\labelsep}
+      \ALC@tlm\labelsep
+      }
+    }
+  {\end{list}}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
--- a/report/additional-latex-files/fancyhdr.sty
+++ b/report/additional-latex-files/fancyhdr.sty
@ -0,0 +1,485 @@
+% fancyhdr.sty version 3.2
+% Fancy headers and footers for LaTeX.
+% Piet van Oostrum, 
+% Dept of Computer and Information Sciences, University of Utrecht,
+% Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
+% Telephone: +31 30 2532180. Email: piet@cs.uu.nl
+% ========================================================================
+% LICENCE:
+% This file may be distributed under the terms of the LaTeX Project Public
+% License, as described in lppl.txt in the base LaTeX distribution.
+% Either version 1 or, at your option, any later version.
+% ========================================================================
+% MODIFICATION HISTORY:
+% Sep 16, 1994
+% version 1.4: Correction for use with \reversemargin
+% Sep 29, 1994:
+% version 1.5: Added the \iftopfloat, \ifbotfloat and \iffloatpage commands
+% Oct 4, 1994:
+% version 1.6: Reset single spacing in headers/footers for use with
+% setspace.sty or doublespace.sty
+% Oct 4, 1994:
+% version 1.7: changed \let\@mkboth\markboth to
+% \def\@mkboth{\protect\markboth} to make it more robust
+% Dec 5, 1994:
+% version 1.8: corrections for amsbook/amsart: define \@chapapp and (more
+% importantly) use the \chapter/sectionmark definitions from ps@headings if
+% they exist (which should be true for all standard classes).
+% May 31, 1995:
+% version 1.9: The proposed \renewcommand{\headrulewidth}{\iffloatpage...
+% construction in the doc did not work properly with the fancyplain style. 
+% June 1, 1995:
+% version 1.91: The definition of \@mkboth wasn't restored on subsequent
+% \pagestyle{fancy}'s.
+% June 1, 1995:
+% version 1.92: The sequence \pagestyle{fancyplain} \pagestyle{plain}
+% \pagestyle{fancy} would erroneously select the plain version.
+% June 1, 1995:
+% version 1.93: \fancypagestyle command added.
+% Dec 11, 1995:
+% version 1.94: suggested by Conrad Hughes <chughes@maths.tcd.ie>
+% CJCH, Dec 11, 1995: added \footruleskip to allow control over footrule
+% position (old hardcoded value of .3\normalbaselineskip is far too high
+% when used with very small footer fonts).
+% Jan 31, 1996:
+% version 1.95: call \@normalsize in the reset code if that is defined,
+% otherwise \normalsize.
+% this is to solve a problem with ucthesis.cls, as this doesn't
+% define \@currsize. Unfortunately for latex209 calling \normalsize doesn't
+% work as this is optimized to do very little, so there \@normalsize should
+% be called. Hopefully this code works for all versions of LaTeX known to
+% mankind.  
+% April 25, 1996:
+% version 1.96: initialize \headwidth to a magic (negative) value to catch
+% most common cases that people change it before calling \pagestyle{fancy}.
+% Note it can't be initialized when reading in this file, because
+% \textwidth could be changed afterwards. This is quite probable.
+% We also switch to \MakeUppercase rather than \uppercase and introduce a
+% \nouppercase command for use in headers. and footers.
+% May 3, 1996:
+% version 1.97: Two changes:
+% 1. Undo the change in version 1.8 (using the pagestyle{headings} defaults
+% for the chapter and section marks. The current version of amsbook and
+% amsart classes don't seem to need them anymore. Moreover the standard
+% latex classes don't use \markboth if twoside isn't selected, and this is
+% confusing as \leftmark doesn't work as expected.
+% 2. include a call to \ps@empty in ps@@fancy. This is to solve a problem
+% in the amsbook and amsart classes, that make global changes to \topskip,
+% which are reset in \ps@empty. Hopefully this doesn't break other things.
+% May 7, 1996:
+% version 1.98:
+% Added % after the line  \def\nouppercase
+% May 7, 1996:
+% version 1.99: This is the alpha version of fancyhdr 2.0
+% Introduced the new commands \fancyhead, \fancyfoot, and \fancyhf.
+% Changed \headrulewidth, \footrulewidth, \footruleskip to
+% macros rather than length parameters, In this way they can be
+% conditionalized and they don't consume length registers. There is no need
+% to have them as length registers unless you want to do calculations with
+% them, which is unlikely. Note that this may make some uses of them
+% incompatible (i.e. if you have a file that uses \setlength or \xxxx=)
+% May 10, 1996:
+% version 1.99a:
+% Added a few more % signs
+% May 10, 1996:
+% version 1.99b:
+% Changed the syntax of \f@nfor to be resistent to catcode changes of :=
+% Removed the [1] from the defs of \lhead etc. because the parameter is
+% consumed by the \@[xy]lhead etc. macros.
+% June 24, 1997:
+% version 1.99c:
+% corrected \nouppercase to also include the protected form of \MakeUppercase
+% \global added to manipulation of \headwidth.
+% \iffootnote command added.
+% Some comments added about \@fancyhead and \@fancyfoot.
+% Aug 24, 1998
+% version 1.99d
+% Changed the default \ps@empty to \ps@@empty in order to allow
+% \fancypagestyle{empty} redefinition.
+% Oct 11, 2000
+% version 2.0
+% Added LPPL license clause.
+%
+% A check for \headheight is added. An errormessage is given (once) if the
+% header is too large. Empty headers don't generate the error even if
+% \headheight is very small or even 0pt. 
+% Warning added for the use of 'E' option when twoside option is not used.
+% In this case the 'E' fields will never be used.
+%
+% Mar 10, 2002
+% version 2.1beta
+% New command: \fancyhfoffset[place]{length}
+% defines offsets to be applied to the header/footer to let it stick into
+% the margins (if length > 0).
+% place is like in fancyhead, except that only E,O,L,R can be used.
+% This replaces the old calculation based on \headwidth and the marginpar
+% area.
+% \headwidth will be dynamically calculated in the headers/footers when
+% this is used.
+%
+% Mar 26, 2002
+% version 2.1beta2
+% \fancyhfoffset now also takes h,f as possible letters in the argument to
+% allow the header and footer widths to be different.
+% New commands \fancyheadoffset and \fancyfootoffset added comparable to
+% \fancyhead and \fancyfoot.
+% Errormessages and warnings have been made more informative.
+%
+% Dec 9, 2002
+% version 2.1
+% The defaults for \footrulewidth, \plainheadrulewidth and
+% \plainfootrulewidth are changed from \z@skip to 0pt. In this way when
+% someone inadvertantly uses \setlength to change any of these, the value
+% of \z@skip will not be changed, rather an errormessage will be given.
+
+% March 3, 2004
+% Release of version 3.0
+
+% Oct 7, 2004
+% version 3.1
+% Added '\endlinechar=13' to \fancy@reset to prevent problems with
+% includegraphics in header when verbatiminput is active.
+
+% March 22, 2005
+% version 3.2
+% reset \everypar (the real one) in \fancy@reset because spanish.ldf does
+% strange things with \everypar between << and >>.
+
+\def\ifancy@mpty#1{\def\temp@a{#1}\ifx\temp@a\@empty}
+
+\def\fancy@def#1#2{\ifancy@mpty{#2}\fancy@gbl\def#1{\leavevmode}\else
+                                   \fancy@gbl\def#1{#2\strut}\fi}
+
+\let\fancy@gbl\global
+
+\def\@fancyerrmsg#1{%
+        \ifx\PackageError\undefined
+        \errmessage{#1}\else
+        \PackageError{Fancyhdr}{#1}{}\fi}
+\def\@fancywarning#1{%
+        \ifx\PackageWarning\undefined
+        \errmessage{#1}\else
+        \PackageWarning{Fancyhdr}{#1}{}\fi}
+
+% Usage: \@forc \var{charstring}{command to be executed for each char}
+% This is similar to LaTeX's \@tfor, but expands the charstring.
+
+\def\@forc#1#2#3{\expandafter\f@rc\expandafter#1\expandafter{#2}{#3}}
+\def\f@rc#1#2#3{\def\temp@ty{#2}\ifx\@empty\temp@ty\else
+                                    \f@@rc#1#2\f@@rc{#3}\fi}
+\def\f@@rc#1#2#3\f@@rc#4{\def#1{#2}#4\f@rc#1{#3}{#4}}
+
+% Usage: \f@nfor\name:=list\do{body}
+% Like LaTeX's \@for but an empty list is treated as a list with an empty
+% element
+
+\newcommand{\f@nfor}[3]{\edef\@fortmp{#2}%
+    \expandafter\@forloop#2,\@nil,\@nil\@@#1{#3}}
+
+% Usage: \def@ult \cs{defaults}{argument}
+% sets \cs to the characters from defaults appearing in argument
+% or defaults if it would be empty. All characters are lowercased.
+
+\newcommand\def@ult[3]{%
+    \edef\temp@a{\lowercase{\edef\noexpand\temp@a{#3}}}\temp@a
+    \def#1{}%
+    \@forc\tmpf@ra{#2}%
+        {\expandafter\if@in\tmpf@ra\temp@a{\edef#1{#1\tmpf@ra}}{}}%
+    \ifx\@empty#1\def#1{#2}\fi}
+% 
+% \if@in <char><set><truecase><falsecase>
+%
+\newcommand{\if@in}[4]{%
+    \edef\temp@a{#2}\def\temp@b##1#1##2\temp@b{\def\temp@b{##1}}%
+    \expandafter\temp@b#2#1\temp@b\ifx\temp@a\temp@b #4\else #3\fi}
+
+\newcommand{\fancyhead}{\@ifnextchar[{\f@ncyhf\fancyhead h}%
+                                     {\f@ncyhf\fancyhead h[]}}
+\newcommand{\fancyfoot}{\@ifnextchar[{\f@ncyhf\fancyfoot f}%
+                                     {\f@ncyhf\fancyfoot f[]}}
+\newcommand{\fancyhf}{\@ifnextchar[{\f@ncyhf\fancyhf{}}%
+                                   {\f@ncyhf\fancyhf{}[]}}
+
+% New commands for offsets added
+
+\newcommand{\fancyheadoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyheadoffset h}%
+                                           {\f@ncyhfoffs\fancyheadoffset h[]}}
+\newcommand{\fancyfootoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyfootoffset f}%
+                                           {\f@ncyhfoffs\fancyfootoffset f[]}}
+\newcommand{\fancyhfoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyhfoffset{}}%
+                                         {\f@ncyhfoffs\fancyhfoffset{}[]}}
+
+% The header and footer fields are stored in command sequences with
+% names of the form: \f@ncy<x><y><z> with <x> for [eo], <y> from [lcr]
+% and <z> from [hf].
+
+\def\f@ncyhf#1#2[#3]#4{%
+    \def\temp@c{}%
+    \@forc\tmpf@ra{#3}%
+        {\expandafter\if@in\tmpf@ra{eolcrhf,EOLCRHF}%
+            {}{\edef\temp@c{\temp@c\tmpf@ra}}}%
+    \ifx\@empty\temp@c\else
+        \@fancyerrmsg{Illegal char `\temp@c' in \string#1 argument:
+          [#3]}%
+    \fi
+    \f@nfor\temp@c{#3}%
+        {\def@ult\f@@@eo{eo}\temp@c
+         \if@twoside\else
+           \if\f@@@eo e\@fancywarning
+             {\string#1's `E' option without twoside option is useless}\fi\fi
+         \def@ult\f@@@lcr{lcr}\temp@c
+         \def@ult\f@@@hf{hf}{#2\temp@c}%
+         \@forc\f@@eo\f@@@eo
+             {\@forc\f@@lcr\f@@@lcr
+                 {\@forc\f@@hf\f@@@hf
+                     {\expandafter\fancy@def\csname
+                      f@ncy\f@@eo\f@@lcr\f@@hf\endcsname
+                      {#4}}}}}}
+
+\def\f@ncyhfoffs#1#2[#3]#4{%
+    \def\temp@c{}%
+    \@forc\tmpf@ra{#3}%
+        {\expandafter\if@in\tmpf@ra{eolrhf,EOLRHF}%
+            {}{\edef\temp@c{\temp@c\tmpf@ra}}}%
+    \ifx\@empty\temp@c\else
+        \@fancyerrmsg{Illegal char `\temp@c' in \string#1 argument:
+          [#3]}%
+    \fi
+    \f@nfor\temp@c{#3}%
+        {\def@ult\f@@@eo{eo}\temp@c
+         \if@twoside\else
+           \if\f@@@eo e\@fancywarning
+             {\string#1's `E' option without twoside option is useless}\fi\fi
+         \def@ult\f@@@lcr{lr}\temp@c
+         \def@ult\f@@@hf{hf}{#2\temp@c}%
+         \@forc\f@@eo\f@@@eo
+             {\@forc\f@@lcr\f@@@lcr
+                 {\@forc\f@@hf\f@@@hf
+                     {\expandafter\setlength\csname
+                      f@ncyO@\f@@eo\f@@lcr\f@@hf\endcsname
+                      {#4}}}}}%
+     \fancy@setoffs}
+
+% Fancyheadings version 1 commands. These are more or less deprecated,
+% but they continue to work.
+
+\newcommand{\lhead}{\@ifnextchar[{\@xlhead}{\@ylhead}}
+\def\@xlhead[#1]#2{\fancy@def\f@ncyelh{#1}\fancy@def\f@ncyolh{#2}}
+\def\@ylhead#1{\fancy@def\f@ncyelh{#1}\fancy@def\f@ncyolh{#1}}
+
+\newcommand{\chead}{\@ifnextchar[{\@xchead}{\@ychead}}
+\def\@xchead[#1]#2{\fancy@def\f@ncyech{#1}\fancy@def\f@ncyoch{#2}}
+\def\@ychead#1{\fancy@def\f@ncyech{#1}\fancy@def\f@ncyoch{#1}}
+
+\newcommand{\rhead}{\@ifnextchar[{\@xrhead}{\@yrhead}}
+\def\@xrhead[#1]#2{\fancy@def\f@ncyerh{#1}\fancy@def\f@ncyorh{#2}}
+\def\@yrhead#1{\fancy@def\f@ncyerh{#1}\fancy@def\f@ncyorh{#1}}
+
+\newcommand{\lfoot}{\@ifnextchar[{\@xlfoot}{\@ylfoot}}
+\def\@xlfoot[#1]#2{\fancy@def\f@ncyelf{#1}\fancy@def\f@ncyolf{#2}}
+\def\@ylfoot#1{\fancy@def\f@ncyelf{#1}\fancy@def\f@ncyolf{#1}}
+
+\newcommand{\cfoot}{\@ifnextchar[{\@xcfoot}{\@ycfoot}}
+\def\@xcfoot[#1]#2{\fancy@def\f@ncyecf{#1}\fancy@def\f@ncyocf{#2}}
+\def\@ycfoot#1{\fancy@def\f@ncyecf{#1}\fancy@def\f@ncyocf{#1}}
+
+\newcommand{\rfoot}{\@ifnextchar[{\@xrfoot}{\@yrfoot}}
+\def\@xrfoot[#1]#2{\fancy@def\f@ncyerf{#1}\fancy@def\f@ncyorf{#2}}
+\def\@yrfoot#1{\fancy@def\f@ncyerf{#1}\fancy@def\f@ncyorf{#1}}
+
+\newlength{\fancy@headwidth}
+\let\headwidth\fancy@headwidth
+\newlength{\f@ncyO@elh}
+\newlength{\f@ncyO@erh}
+\newlength{\f@ncyO@olh}
+\newlength{\f@ncyO@orh}
+\newlength{\f@ncyO@elf}
+\newlength{\f@ncyO@erf}
+\newlength{\f@ncyO@olf}
+\newlength{\f@ncyO@orf}
+\newcommand{\headrulewidth}{0.4pt}
+\newcommand{\footrulewidth}{0pt}
+\newcommand{\footruleskip}{.3\normalbaselineskip}
+
+% Fancyplain stuff shouldn't be used anymore (rather
+% \fancypagestyle{plain} should be used), but it must be present for
+% compatibility reasons.
+
+\newcommand{\plainheadrulewidth}{0pt}
+\newcommand{\plainfootrulewidth}{0pt}
+\newif\if@fancyplain \@fancyplainfalse
+\def\fancyplain#1#2{\if@fancyplain#1\else#2\fi}
+
+\headwidth=-123456789sp %magic constant
+
+% Command to reset various things in the headers:
+% a.o.  single spacing (taken from setspace.sty)
+% and the catcode of ^^M (so that epsf files in the header work if a
+% verbatim crosses a page boundary)
+% It also defines a \nouppercase command that disables \uppercase and
+% \Makeuppercase. It can only be used in the headers and footers.
+\let\fnch@everypar\everypar% save real \everypar because of spanish.ldf
+\def\fancy@reset{\fnch@everypar{}\restorecr\endlinechar=13
+ \def\baselinestretch{1}%
+ \def\nouppercase##1{{\let\uppercase\relax\let\MakeUppercase\relax
+     \expandafter\let\csname MakeUppercase \endcsname\relax##1}}%
+ \ifx\undefined\@newbaseline% NFSS not present; 2.09 or 2e
+   \ifx\@normalsize\undefined \normalsize % for ucthesis.cls
+   \else \@normalsize \fi
+ \else% NFSS (2.09) present
+  \@newbaseline%
+ \fi}
+
+% Initialization of the head and foot text.
+
+% The default values still contain \fancyplain for compatibility.
+\fancyhf{} % clear all
+% lefthead empty on ``plain'' pages, \rightmark on even, \leftmark on odd pages
+% evenhead empty on ``plain'' pages, \leftmark on even, \rightmark on odd pages
+\if@twoside
+  \fancyhead[el,or]{\fancyplain{}{\sl\rightmark}}
+  \fancyhead[er,ol]{\fancyplain{}{\sl\leftmark}}
+\else
+  \fancyhead[l]{\fancyplain{}{\sl\rightmark}}
+  \fancyhead[r]{\fancyplain{}{\sl\leftmark}}
+\fi
+\fancyfoot[c]{\rm\thepage} % page number
+
+% Use box 0 as a temp box and dimen 0 as temp dimen. 
+% This can be done, because this code will always
+% be used inside another box, and therefore the changes are local.
+
+\def\@fancyvbox#1#2{\setbox0\vbox{#2}\ifdim\ht0>#1\@fancywarning
+  {\string#1 is too small (\the#1): ^^J Make it at least \the\ht0.^^J
+    We now make it that large for the rest of the document.^^J
+    This may cause the page layout to be inconsistent, however\@gobble}%
+  \dimen0=#1\global\setlength{#1}{\ht0}\ht0=\dimen0\fi
+  \box0}
+
+% Put together a header or footer given the left, center and
+% right text, fillers at left and right and a rule.
+% The \lap commands put the text into an hbox of zero size,
+% so overlapping text does not generate an errormessage.
+% These macros have 5 parameters:
+% 1. LEFTSIDE BEARING % This determines at which side the header will stick
+%    out. When \fancyhfoffset is used this calculates \headwidth, otherwise
+%    it is \hss or \relax (after expansion).
+% 2. \f@ncyolh, \f@ncyelh, \f@ncyolf or \f@ncyelf. This is the left component.
+% 3. \f@ncyoch, \f@ncyech, \f@ncyocf or \f@ncyecf. This is the middle comp.
+% 4. \f@ncyorh, \f@ncyerh, \f@ncyorf or \f@ncyerf. This is the right component.
+% 5. RIGHTSIDE BEARING. This is always \relax or \hss (after expansion).
+
+\def\@fancyhead#1#2#3#4#5{#1\hbox to\headwidth{\fancy@reset
+  \@fancyvbox\headheight{\hbox
+    {\rlap{\parbox[b]{\headwidth}{\raggedright#2}}\hfill
+      \parbox[b]{\headwidth}{\centering#3}\hfill
+      \llap{\parbox[b]{\headwidth}{\raggedleft#4}}}\headrule}}#5}
+
+\def\@fancyfoot#1#2#3#4#5{#1\hbox to\headwidth{\fancy@reset
+    \@fancyvbox\footskip{\footrule
+      \hbox{\rlap{\parbox[t]{\headwidth}{\raggedright#2}}\hfill
+        \parbox[t]{\headwidth}{\centering#3}\hfill
+        \llap{\parbox[t]{\headwidth}{\raggedleft#4}}}}}#5}
+
+\def\headrule{{\if@fancyplain\let\headrulewidth\plainheadrulewidth\fi
+    \hrule\@height\headrulewidth\@width\headwidth \vskip-\headrulewidth}}
+
+\def\footrule{{\if@fancyplain\let\footrulewidth\plainfootrulewidth\fi
+    \vskip-\footruleskip\vskip-\footrulewidth
+    \hrule\@width\headwidth\@height\footrulewidth\vskip\footruleskip}}
+
+\def\ps@fancy{%
+\@ifundefined{@chapapp}{\let\@chapapp\chaptername}{}%for amsbook
+%
+% Define \MakeUppercase for old LaTeXen.
+% Note: we used \def rather than \let, so that \let\uppercase\relax (from
+% the version 1 documentation) will still work.
+%
+\@ifundefined{MakeUppercase}{\def\MakeUppercase{\uppercase}}{}%
+\@ifundefined{chapter}{\def\sectionmark##1{\markboth
+{\MakeUppercase{\ifnum \c@secnumdepth>\z@
+ \thesection\hskip 1em\relax \fi ##1}}{}}%
+\def\subsectionmark##1{\markright {\ifnum \c@secnumdepth >\@ne
+ \thesubsection\hskip 1em\relax \fi ##1}}}%
+{\def\chaptermark##1{\markboth {\MakeUppercase{\ifnum \c@secnumdepth>\m@ne
+ \@chapapp\ \thechapter. \ \fi ##1}}{}}%
+\def\sectionmark##1{\markright{\MakeUppercase{\ifnum \c@secnumdepth >\z@
+ \thesection. \ \fi ##1}}}}%
+%\csname ps@headings\endcsname % use \ps@headings defaults if they exist
+\ps@@fancy
+\gdef\ps@fancy{\@fancyplainfalse\ps@@fancy}%
+% Initialize \headwidth if the user didn't
+%
+\ifdim\headwidth<0sp
+%
+% This catches the case that \headwidth hasn't been initialized and the
+% case that the user added something to \headwidth in the expectation that
+% it was initialized to \textwidth. We compensate this now. This loses if
+% the user intended to multiply it by a factor. But that case is more
+% likely done by saying something like \headwidth=1.2\textwidth. 
+% The doc says you have to change \headwidth after the first call to
+% \pagestyle{fancy}. This code is just to catch the most common cases were
+% that requirement is violated.
+%
+    \global\advance\headwidth123456789sp\global\advance\headwidth\textwidth
+\fi}
+\def\ps@fancyplain{\ps@fancy \let\ps@plain\ps@plain@fancy}
+\def\ps@plain@fancy{\@fancyplaintrue\ps@@fancy}
+\let\ps@@empty\ps@empty
+\def\ps@@fancy{%
+\ps@@empty % This is for amsbook/amsart, which do strange things with \topskip
+\def\@mkboth{\protect\markboth}%
+\def\@oddhead{\@fancyhead\fancy@Oolh\f@ncyolh\f@ncyoch\f@ncyorh\fancy@Oorh}%
+\def\@oddfoot{\@fancyfoot\fancy@Oolf\f@ncyolf\f@ncyocf\f@ncyorf\fancy@Oorf}%
+\def\@evenhead{\@fancyhead\fancy@Oelh\f@ncyelh\f@ncyech\f@ncyerh\fancy@Oerh}%
+\def\@evenfoot{\@fancyfoot\fancy@Oelf\f@ncyelf\f@ncyecf\f@ncyerf\fancy@Oerf}%
+}
+% Default definitions for compatibility mode:
+% These cause the header/footer to take the defined \headwidth as width
+% And to shift in the direction of the marginpar area
+
+\def\fancy@Oolh{\if@reversemargin\hss\else\relax\fi}
+\def\fancy@Oorh{\if@reversemargin\relax\else\hss\fi}
+\let\fancy@Oelh\fancy@Oorh
+\let\fancy@Oerh\fancy@Oolh
+
+\let\fancy@Oolf\fancy@Oolh
+\let\fancy@Oorf\fancy@Oorh
+\let\fancy@Oelf\fancy@Oelh
+\let\fancy@Oerf\fancy@Oerh
+
+% New definitions for the use of \fancyhfoffset
+% These calculate the \headwidth from \textwidth and the specified offsets.
+
+\def\fancy@offsolh{\headwidth=\textwidth\advance\headwidth\f@ncyO@olh
+                   \advance\headwidth\f@ncyO@orh\hskip-\f@ncyO@olh}
+\def\fancy@offselh{\headwidth=\textwidth\advance\headwidth\f@ncyO@elh
+                   \advance\headwidth\f@ncyO@erh\hskip-\f@ncyO@elh}
+
+\def\fancy@offsolf{\headwidth=\textwidth\advance\headwidth\f@ncyO@olf
+                   \advance\headwidth\f@ncyO@orf\hskip-\f@ncyO@olf}
+\def\fancy@offself{\headwidth=\textwidth\advance\headwidth\f@ncyO@elf
+                   \advance\headwidth\f@ncyO@erf\hskip-\f@ncyO@elf}
+
+\def\fancy@setoffs{%
+% Just in case \let\headwidth\textwidth was used
+  \fancy@gbl\let\headwidth\fancy@headwidth
+  \fancy@gbl\let\fancy@Oolh\fancy@offsolh
+  \fancy@gbl\let\fancy@Oelh\fancy@offselh
+  \fancy@gbl\let\fancy@Oorh\hss
+  \fancy@gbl\let\fancy@Oerh\hss
+  \fancy@gbl\let\fancy@Oolf\fancy@offsolf
+  \fancy@gbl\let\fancy@Oelf\fancy@offself
+  \fancy@gbl\let\fancy@Oorf\hss
+  \fancy@gbl\let\fancy@Oerf\hss}
+
+\newif\iffootnote
+\let\latex@makecol\@makecol
+\def\@makecol{\ifvoid\footins\footnotetrue\else\footnotefalse\fi
+\let\topfloat\@toplist\let\botfloat\@botlist\latex@makecol}
+\def\iftopfloat#1#2{\ifx\topfloat\empty #2\else #1\fi}
+\def\ifbotfloat#1#2{\ifx\botfloat\empty #2\else #1\fi}
+\def\iffloatpage#1#2{\if@fcolmade #1\else #2\fi}
+
+\newcommand{\fancypagestyle}[2]{%
+  \@namedef{ps@#1}{\let\fancy@gbl\relax#2\relax\ps@fancy}}
--- a/report/additional-latex-files/natbib.sty
+++ b/report/additional-latex-files/natbib.sty
--- a/report/figures/VGG38_BN_RC_acc.pdf
+++ b/report/figures/VGG38_BN_RC_acc.pdf
--- a/report/figures/VGG38_BN_RC_accuracy_performance.pdf
+++ b/report/figures/VGG38_BN_RC_accuracy_performance.pdf
--- a/report/figures/VGG38_BN_RC_loss.pdf
+++ b/report/figures/VGG38_BN_RC_loss.pdf
--- a/report/figures/VGG38_BN_RC_loss_performance.pdf
+++ b/report/figures/VGG38_BN_RC_loss_performance.pdf
--- a/report/figures/accuracy_plot.pdf
+++ b/report/figures/accuracy_plot.pdf
--- a/report/figures/grad_flow_vgg08.pdf
+++ b/report/figures/grad_flow_vgg08.pdf
--- a/report/figures/gradplot_38.pdf
+++ b/report/figures/gradplot_38.pdf
--- a/report/figures/gradplot_38_bn.pdf
+++ b/report/figures/gradplot_38_bn.pdf
--- a/report/figures/gradplot_38_bn_rc.pdf
+++ b/report/figures/gradplot_38_bn_rc.pdf
--- a/report/figures/gradplot_38_watermarked.pdf
+++ b/report/figures/gradplot_38_watermarked.pdf
--- a/report/figures/gradplot_38bnrc.pdf
+++ b/report/figures/gradplot_38bnrc.pdf
--- a/report/figures/loss_plot.pdf
+++ b/report/figures/loss_plot.pdf
--- a/report/icml2017.bst
+++ b/report/icml2017.bst
--- a/report/mlp-cw2-questions.tex
+++ b/report/mlp-cw2-questions.tex
@ -0,0 +1,176 @@
+%% REPLACE sXXXXXXX with your student number
+\def\studentNumber{s2759177}
+
+
+%% START of YOUR ANSWERS
+%% Add answers to the questions below, by replacing the text inside the brackets {} for \youranswer{ "Text to be replaced with your answer." }. 
+%
+% Do not delete the commands for adding figures and tables. Instead fill in the missing values with your experiment results, and replace the images with your own respective figures.
+%
+% You can generally delete the placeholder text, such as for example the text "Question Figure 3 - Replace the images ..." 
+%
+% There are 5 TEXT QUESTIONS. Replace the text inside the brackets of the command \youranswer with your answer to the question.
+%
+% There are also 3 "questions" to replace some placeholder FIGURES with your own, and 1 "question" asking you to fill in the missing entries in the TABLE provided. 
+%
+% NOTE! that questions are ordered by the order of appearance of their answers in the text, and not necessarily by the order you should tackle them. You should attempt to fill in the TABLE and FIGURES before discussing the results presented there. 
+%
+% NOTE! If for some reason you do not manage to produce results for some FIGURES and the TABLE, then you can get partial marks by discussing your expectations of the results in the relevant TEXT QUESTIONS. The TABLE specifically has enough information in it already for you to draw meaningful conclusions.
+%
+% Please refer to the coursework specification for more details.
+
+
+%% - - - - - - - - - - - - TEXT QUESTIONS - - - - - - - - - - - - 
+
+%% Question 1:
+% Use Figures 1, 2, and 3 to identify the Vanishing Gradient Problem (which of these model suffers from it, and what are the consequences depicted?).
+% The average length for an answer to this question is approximately 1/5 of the columns in a 2-column page}
+
+\newcommand{\questionOne} {
+\youranswer{
+We can observe the 8 layer network learning (even though it does not achieve high accuracy), but the 38-layer network fails to learn, as its gradients vanish almost entirely in the earlier layers. This is evident in Figure 3, where the gradients in VGG38 are close to zero for all but the last few layers, preventing effective weight updates during backpropagation. Consequently, the deeper network is unable to extract meaningful features or minimize its loss, leading to stagnation in both training and validation performance.
+
+We conclude that VGG08 performs nominally during training, while VGG38 suffers from the vanishing gradient problem, as its gradients diminish to near-zero in early layers, impeding effective weight updates and preventing the network from learning meaningful features. This limitation nullifies the advantages of its deeper architecture, as reflected in its stagnant loss and accuracy throughout training. This is in stark contrast to VGG08 which maintains a healthy gradient flow across layers, allowing effective weight updates and enabling the network to learn features, reduce loss, and improve accuracy despite its smaller depth.
+}
+}
+
+%% Question 2:
+% Consider these results (including Figure 1 from \cite{he2016deep}). Discuss the relation between network capacity and overfitting, and whether, and how, this is reflected on these results. What other factors may have lead to this difference in performance?
+% The average length for an answer to this question is approximately 1/5 of the columns in a 2-column page
+\newcommand{\questionTwo} {
+\youranswer{Our results thus corroborate that increasing network depth can lead to higher training and testing errors, as seen in the comparison between VGG08 and VGG38. While deeper networks, like VGG38, have a larger capacity to learn complex features, they may struggle to generalize effectively, resulting in overfitting and poor performance on unseen data. This is consistent with the behaviour observed in Figure 1 from \cite{he2016deep}, where the 56-layer network exhibits higher training error and, consequently, higher test error compared to the 20-layer network.
+
+Our results suggest that the increased capacity of VGG38 does not translate into better generalization, likely due to the vanishing gradient problem, which hinders learning in deeper networks. Other factors, such as inadequate regularization or insufficient data augmentation, could also contribute to the observed performance difference, leading to overfitting in deeper architectures.}
+}
+
+%% Question 3:
+% In this coursework, we didn't incorporate residual connections to the downsampling layers. Explain and justify what would need to be changed in order to add residual connections to the downsampling layers. Give and explain 2 ways of incorporating these changes and discuss pros and cons of each.
+\newcommand{\questionThree} {
+\youranswer{
+Our work does not incorporate residual connections across the downsampling layers, as this creates a dimensional mismatch between the input and output feature maps due to the reduction in spatial dimensions. To add residual connections, one approach is to use a convolutional layer with a kernel size of $1\times 1$, stride, and padding matched to the downsampling operation to transform the input to the same shape as the output. Another approach would be to use average pooling or max pooling directly on the residual connection to downsample the input feature map, matching its spatial dimensions to the output, followed by a linear transformation to align the channel dimensions.
+
+The difference between these two methods is that the first approach using a $1\times 1$ convolution provides more flexibility by learning the transformation, which can enhance model expressiveness but increases computational cost, whereas the second approach with pooling is computationally cheaper and simpler but may lose fine-grained information due to the fixed, non-learnable nature of pooling operations.
+}
+}
+
+%% Question 4:
+% Question 4 - Present and discuss the experiment results (all of the results and not just the ones you had to fill in) in Table 1 and Figures 4 and 5 (you may use any of the other Figures if you think they are relevant to your analysis). You will have to determine what data are relevant to the discussion, and what information can be extracted from it. Also, discuss what further experiments you would have ran on any combination of VGG08, VGG38, BN, RC in order to
+% \begin{itemize}
+%     \item Improve performance of the model trained (explain why you expect your suggested experiments will help with this).
+%     \item Learn more about the behaviour of BN and RC (explain what you are trying to learn and how).
+% \end{itemize}
+% 
+% The average length for an answer to this question is approximately 1 of the columns in a 2-column page
+\newcommand{\questionFour} {
+\youranswer{
+Our results demonstrate the effectiveness of batch normalization and residual connection as proposed by \cite{he2016deep}, enabling effective training of deep convolutional networks as shown by the significant improvement in training and validation performance for VGG38 when incorporating these techniques. Table~\ref{tab:CIFAR_results} highlights that adding BN alone (VGG38 BN) reduces both training and validation losses compared to the baseline VGG38, with validation accuracy increasing from near-zero to $47.68\%$ at a learning rate (LR) of $1\mathrm{e}{-3}$. Adding RC further enhances performance, as seen in VGG38 RC achieving $52.32\%$ validation accuracy under the same conditions. The combination of BN and RC (VGG38 BN + RC) yields the best results, achieving $53.76\%$ validation accuracy with LR $1\mathrm{e}{-3}$. BN+RC appears to benefit greatly from a higher learning rate, as it improves further to $58.20\%$ a LR of $1\mathrm{e}{-2}$. BN alone however deteriorates at higher learning rates, as evidenced by lower validation accuracy, emphasizing the stabilizing role of RC. \autoref{fig:training_curves_bestModel} confirms the synergy of BN and RC, with the VGG38 BN + RC model reaching $74\%$ training accuracy and plateauing near $60\%$ validation accuracy. \autoref{fig:avg_grad_flow_bestModel} illustrates stable gradient flow, with BN mitigating vanishing gradients and RC maintaining gradient propagation through deeper layers, particularly in the later stages of the network.
+
+While this work did not evaluate residual connections on downsampling layers, a thorough evaluation of both methods put forth earlier would be required to complete the picture, highlighting how exactly residual connections in downsampling layers affect gradient flow, feature learning, and overall performance. Such an evaluation would clarify whether the additional computational cost of using $1\times 1$ convolutions for matching dimensions is justified by improved accuracy or if the simpler pooling-based approach suffices, particularly for tasks where computational efficiency is crucial.
+}
+}
+
+
+%% Question 5:
+% Briefly draw your conclusions based on the results from the previous sections (what are the take-away messages?) and conclude your report with a recommendation for future work. 
+% 
+% Good recommendations for future work also draw on the broader literature (the papers already referenced are good starting points). Great recommendations for future work are not just incremental (an example of an incremental suggestion would be: ``we could also train with different learning rates'') but instead also identify meaningful questions or, in other words, questions with answers that might be somewhat more generally applicable. 
+% 
+% For example, \citep{huang2017densely} end with \begin{quote}``Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features, e.g.,  [4,5].''\end{quote} 
+% 
+% while \cite{bengio1993problem} state in their conclusions that \begin{quote}``There remains theoretical questions to be considered,  such as whether the problem with simple gradient descent  discussed in this paper would be observed with  chaotic attractors that are not  hyperbolic.''\\\end{quote}
+% 
+% The length of this question description is indicative of the average length of a conclusion section
+\newcommand{\questionFive} {
+\youranswer{
+The results presented showcase a clear solution to the vanishing gradient problem. With batch normalization and Residual Connections, we are able to train much deeper neural networks effectively, as evidenced by the improved performance of VGG38 with these modifications. The combination of BN and RC not only stabilizes gradient flow but also enhances both training and validation accuracy, particularly when paired with an appropriate learning rate. These findings reinforce the utility of architectural innovations like those proposed in \cite{he2016deep} and \cite{ioffe2015batch}, which have become foundational in modern deep learning.
+
+While these methods appear to enable training of deeper neural networks, the critical question of how these architectural enhancements generalize across different datasets and tasks remains open. Future work could investigate the effectiveness of BN and RC in scenarios involving large-scale datasets, such as ImageNet, or in domains like natural language processing and generative models, where deep architectures also face optimization challenges. Additionally, exploring the interplay between residual connections and emerging techniques like attention mechanisms \citep{vaswani2017attention} might uncover further synergies. Beyond this, understanding the theoretical underpinnings of how residual connections influence optimization landscapes and gradient flow could yield insights applicable to designing novel architectures.}
+}
+
+
+%% - - - - - - - - - - - - FIGURES - - - - - - - - - - - - 
+
+%% Question Figure 3:
+\newcommand{\questionFigureThree} {
+% Question Figure 3 - Replace this image with a figure depicting the average gradient across layers, for the VGG38 model.
+%\textit{(The provided figure is correct, and can be used in your analysis. It is partially obscured so you can get credit for producing your own copy).}
+\youranswer{
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=\linewidth]{figures/gradplot_38.pdf}
+    \caption{Gradient Flow on VGG38}
+    \label{fig:avg_grad_flow_38}
+\end{figure}
+}
+}
+
+%% Question Figure 4:
+% Question Figure 4 - Replace this image with a figure depicting the training curves for the model with the best performance \textit{across experiments you have available (you don't need to run the experiments for the models we already give you results for)}. Edit the caption so that it clearly identifies the model and what is depicted.
+\newcommand{\questionFigureFour} {
+\youranswer{
+\begin{figure}[t]
+    \begin{subfigure}{\linewidth}
+        \centering
+        \includegraphics[width=\linewidth]{figures/VGG38_BN_RC_loss_performance.pdf}
+        \caption{Cross entropy error per epoch}
+        \label{fig:vgg38_loss_curves}
+    \end{subfigure}
+
+    \begin{subfigure}{\linewidth}
+        \centering
+        \includegraphics[width=\linewidth]{figures/VGG38_BN_RC_accuracy_performance.pdf}
+        \caption{Classification accuracy per epoch}
+        \label{fig:vgg38_acc_curves}
+    \end{subfigure}
+    \caption{Training curves for the 38 layer CNN with batch normalization and residual connections, trained with LR of $0.01$}
+    \label{fig:training_curves_bestModel}
+\end{figure}
+}
+}
+
+%% Question Figure 5:
+% Question Figure 5 - Replace this image with a figure depicting the average gradient across layers, for the model with the best performance \textit{across experiments you have available (you don't need to run the experiments for the models we already give you results for)}. Edit the caption so that it clearly identifies the model and what is depicted.
+\newcommand{\questionFigureFive} {
+\youranswer{
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=\linewidth]{figures/gradplot_38_bn_rc.pdf}
+    \caption{Gradient Flow for the 38 layer CNN with batch normalization and residual connections, trained with LR of $0.01$}
+    \label{fig:avg_grad_flow_bestModel}
+\end{figure}
+}
+}
+
+%% - - - - - - - - - - - - TABLES - - - - - - - - - - - - 
+
+%% Question Table 1:
+% Question Table 1 - Fill in Table 1 with the results from your experiments on 
+% \begin{enumerate}
+%     \item \textit{VGG38 BN (LR 1e-3)}, and 
+%     \item \textit{VGG38 BN + RC (LR 1e-2)}.
+% \end{enumerate}
+\newcommand{\questionTableOne} {
+\youranswer{
+%
+\begin{table*}[t]
+    \centering
+    \begin{tabular}{lr|ccccc}
+    \toprule
+        Model                   & LR   & \# Params & Train loss & Train acc & Val loss & Val acc \\
+    \midrule
+        VGG08                   & 1e-3 & 60 K      &  1.74      & 51.59     & 1.95     & 46.84 \\
+        VGG38                   & 1e-3 & 336 K     &  4.61      & 00.01     & 4.61     & 00.01 \\
+        VGG38 BN                & 1e-3 & 339 K     &  1.76      & 50.62     & 1.95     & 47.68 \\
+        VGG38 RC                & 1e-3 & 336 K     &  1.33      & 61.52     & 1.84     & 52.32 \\
+        VGG38 BN + RC           & 1e-3 & 339 K     &  1.26      & 62.99     & 1.73     & 53.76 \\
+        VGG38 BN                & 1e-2 & 339 K     &  1.70      & 52.28     & 1.99     & 46.72 \\
+        VGG38 BN + RC           & 1e-2 & 339 K     &  0.83      & 74.35     & 1.70     & 58.20 \\
+    \bottomrule
+    \end{tabular}
+    \caption{Experiment results (number of model parameters, Training and Validation loss and accuracy) for different combinations of VGG08, VGG38, Batch Normalisation (BN), and Residual Connections (RC), LR is learning rate.}
+    \label{tab:CIFAR_results}
+\end{table*} 
+}
+}
+
+%% END of YOUR ANSWERS
--- a/report/mlp-cw2-template.tex
+++ b/report/mlp-cw2-template.tex
@ -0,0 +1,314 @@
+%% Template for MLP Coursework 2 / 13 November 2023
+
+%% Based on  LaTeX template for ICML 2017 - example_paper.tex at 
+%%  https://2017.icml.cc/Conferences/2017/StyleAuthorInstructions
+
+\documentclass{article}
+\input{mlp2022_includes}
+
+
+\definecolor{red}{rgb}{0.95,0.4,0.4}
+\definecolor{blue}{rgb}{0.4,0.4,0.95}
+\definecolor{orange}{rgb}{1, 0.65, 0}
+
+\newcommand{\youranswer}[1]{{\color{red} \bf[#1]}} %your answer: 
+
+
+%% START of YOUR ANSWERS
+\input{mlp-cw2-questions}
+%% END of YOUR ANSWERS
+
+
+
+%% Do not change anything in this file. Add your answers to mlp-cw1-questions.tex
+
+
+
+\begin{document} 
+
+\twocolumn[
+\mlptitle{MLP Coursework 2}
+\centerline{\studentNumber}
+\vskip 7mm
+]
+
+\begin{abstract} 
+Deep neural networks have become the state-of-the-art 
+in many standard computer vision problems thanks to their powerful
+representations and availability of large labeled datasets.
+While very deep networks allow for learning more levels of abstractions in their layers from the data, training these models successfully is a challenging task due to problematic gradient flow through the layers, known as vanishing/exploding gradient problem.
+In this report, we first analyze this problem in VGG models with 8 and 38 hidden layers on the CIFAR100 image dataset, by monitoring the gradient flow during training. 
+We explore known solutions to this problem including batch normalization or residual connections, and explain their theory and implementation details. 
+Our experiments show that batch normalization and residual connections effectively address the aforementioned problem and hence enable a deeper model to outperform shallower ones in the same experimental setup.
+\end{abstract} 
+
+\section{Introduction}
+\label{sec:intro}
+Despite the remarkable progress of modern convolutional neural networks (CNNs) in image classification problems~\cite{simonyan2014very, he2016deep}, training very deep networks is a challenging procedure.
+One of the major problems is the Vanishing Gradient Problem (VGP), a phenomenon where the gradients of the error function with respect to network weights shrink to zero, as they backpropagate to earlier layers, hence preventing effective weight updates. 
+This phenomenon is prevalent and has been extensively studied in various deep neural networks including feedforward  networks~\cite{glorot2010understanding},  RNNs~\cite{bengio1993problem}, and CNNs~\cite{he2016deep}. 
+Multiple solutions have been proposed to mitigate this problem by using weight initialization strategies~\cite{glorot2010understanding},
+activation functions~\cite{glorot2010understanding}, input normalization~\cite{bishop1995neural},
+batch normalization~\cite{ioffe2015batch}, and shortcut connections \cite{he2016deep, huang2017densely}.
+
+This report focuses on diagnosing the VGP occurring in the VGG38 model\footnote{VGG stands for the Visual Geometry Group in the University of Oxford.} and addressing it by implementing two standard solutions.
+In particular, we first study a ``broken'' network in terms of its gradient flow, L1 norm of gradients with respect to its weights for each layer and contrast it to ones in the healthy and shallower VGG08 to pinpoint the problem.
+Next, we review two standard solutions for this problem,  batch normalization (BN)~\cite{ioffe2015batch} and residual connections (RC)~\cite{he2016deep} in detail and discuss how they can address the gradient problem.
+We first incorporate batch normalization (denoted as VGG38+BN), residual connections (denoted as VGG38+RC),  and their combination (denoted as VGG38+BN+RC) to the given VGG38 architecture.
+We train the resulting three configurations, and VGG08 and VGG38 models on CIFAR100 (pronounced as `see far 100' ) dataset and present the results.
+The results show that though separate use of BN and RC does mitigate the vanishing/exploding gradient problem, therefore enabling effective training of the VGG38 model, the best results are obtained by combining both BN and RC.
+
+%
+
+
+\section{Identifying training problems of a deep CNN}
+\label{sec:task1}
+
+\begin{figure}[t]
+    \begin{subfigure}{\linewidth}
+        \centering
+        \includegraphics[width=\linewidth]{figures/loss_plot.pdf}
+        \caption{Cross entropy error per epoch}
+        \label{fig:loss_curves}
+    \end{subfigure}
+
+    \begin{subfigure}{\linewidth}
+        \centering
+        \includegraphics[width=\linewidth]{figures/accuracy_plot.pdf}
+        \caption{Classification accuracy per epoch}
+        \label{fig:acc_curves}
+    \end{subfigure}
+    \caption{Training curves for VGG08 and VGG38 in terms of (a) cross-entropy error and (b) classification accuracy}
+    \label{fig:curves}
+\end{figure}
+
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=\linewidth]{figures/grad_flow_vgg08.pdf}
+    \caption{Gradient flow on VGG08}
+    \label{fig:grad_flow_08}
+\end{figure}
+
+\questionFigureThree
+
+Concretely, training deep neural networks typically involves three steps: forward
+pass, backward pass (or backpropagation algorithm~\cite{rumelhart1986learning}) and weight update.
+The first step involves passing the input $\bx^{(0)}$ to the network and producing 
+the network prediction and also the error value.
+In detail, each layer takes in the output of the previous layer and applies
+a non-linear transformation:
+\begin{equation}
+\label{eq.fprop}
+\bx^{(l)} = f^{(l)}(\bx^{(l-1)}; W^{(l)})    
+\end{equation} 
+where $(l)$ denotes the $l$-th layer in $L$ layer deep network,
+$f^{(l)}(\cdot,W^{(l)})$ is a non-linear transformation for layer $l$, and $W^{(l)}$ are the weights of layer $l$.
+For instance, $f^{(l)}$ is typically a convolution operation followed by an activation function in convolutional neural networks.
+The second step involves the backpropagation algorithm, where we calculate the gradient of an error function $E$ (\textit{e.g.} cross-entropy) for each layer's weight as follows:
+
+\begin{equation}
+    \label{eq.bprop}
+\frac{\partial E}{\partial W^{(l)}} = \frac{\partial E}{\partial \bx^{(L)}} \frac{\partial \bx^{(L)}}{\partial \bx^{(L-1)}} \dots \frac{\partial \bx^{(l+1)}}{\partial \bx^{(l)}}\frac{\partial \bx^{(l)}}{\partial W^{(l)}}.
+\end{equation}
+
+This step includes consecutive tensor multiplications between multiple
+partial derivative terms.
+The final step involves updating model weights by using the computed 
+$\frac{\partial E}{\partial W^{(l)}}$ with an update rule.
+The exact update rule depends on the optimizer.
+
+A notorious problem for training deep neural networks is the vanishing/exploding gradient
+problem~\cite{bengio1993problem} that typically occurs in the backpropagation step when some of partial gradient terms in Eq.~\ref{eq.bprop} includes values larger or smaller than 1.
+In this case, due to the multiple consecutive multiplications, the gradients \textit{w.r.t.} weights can get exponentially very small (close to 0) or very large (close to infinity) and
+prevent effective learning of network weights.
+
+
+%
+
+
+Figures~\ref{fig:grad_flow_08} and \ref{fig:grad_flow_38} depict the gradient flows through VGG architectures \cite{simonyan2014very} with 8 and 38 layers respectively, trained and evaluated for a total of 100 epochs on the CIFAR100 dataset. \questionOne.
+
+
+\section{Background Literature}
+\label{sec:lit_rev}
+In this section we will highlight some of the most influential
+papers that have been central to overcoming the VGP in
+deep CNNs.
+
+\paragraph{Batch Normalization}\cite{ioffe2015batch}
+BN seeks to solve the  problem of 
+internal covariate shift (ICS), when distribution of each layer’s 
+inputs changes during training, as the parameters of the previous layers change. 
+The authors argue that without batch normalization, the distribution of
+each layer’s inputs can vary significantly due to the  stochastic nature of randomly sampling mini-batches from your
+training set. 
+Layers in the network hence must continuously adapt to these high variance distributions which hinders the rate of convergence gradient-based optimizers.
+This optimization problem is exacerbated further with network depth due
+to the updating of parameters at layer $l$ being dependent on
+the previous $l-1$ layers.
+
+It is hence beneficial to embed the normalization of
+training data into the network architecture after work from
+LeCun \emph{et al.} showed that training converges faster with
+this addition \cite{lecun2012efficient}. Through standardizing
+the inputs to each layer, we take a step towards achieving
+the fixed distributions of inputs that remove the ill effects
+of ICS. Ioffe and Szegedy demonstrate the effectiveness of
+their technique through training an ensemble of BN
+networks which achieve an accuracy on the ImageNet classification
+task exceeding that of humans in 14 times fewer
+training steps than the state-of-the-art of the time.
+It should be noted, however, that the exact reason for BN’s effectiveness is still not completely understood and it is 
+an open research question~\cite{santurkar2018does}.
+
+
+
+\paragraph{Residual networks (ResNet)}\cite{he2016deep} A well-known way of mitigating the VGP is proposed by He~\emph{et al.} in \cite{he2016deep}. In their paper, the authors depict the error curves of a 20 layer and a 56 layer network to motivate their method. Both training and testing error of the 56 layer network are significantly higher than of the shallower one.
+
+\questionTwo.
+
+Residual networks, colloquially
+known as ResNets, aim to alleviate VGP through the
+incorporation of skip connections that bypass the linear
+transformations into the network architecture. 
+The authors argue that this new mapping is significantly easier
+to optimize since if an identity mapping were optimal, the
+network could comfortably learn to push the residual to
+zero rather than attempting to fit an identity mapping via
+a stack of nonlinear layers. 
+They bolster their argument
+by successfully training ResNets with depths exceeding
+1000 layers on the CIFAR10 dataset.
+Prior to their work, training even a 100-layer was accepted
+as a great challenge within the deep learning community.
+The addition of skip connections solves the VGP through
+enabling information to flow more freely throughout the
+network architecture without the addition of neither extra
+parameters, nor computational complexity.
+
+\section{Solution overview}
+\subsection{Batch normalization}
+
+
+
+
+
+BN has been a standard component in the state-of-the-art 
+convolutional neural networks \cite{he2016deep,huang2017densely}.
+% As mentioned in Section~\ref{sec:lit_rev}, 
+Concretely, BN is a
+layer transformation that is performed to whiten the activations
+originating from each layer. 
+As computing full dataset statistics at each training iteration
+would be computationally expensive, BN computes batch statistics
+to approximate them. 
+Given a minibatch of $B$ training samples and their feature maps
+ $X = (\bx^1, \bx^2,\ldots , \bx^B)$ at an arbitrary layer where $X \in \mathbb{R}^{B\times H \times W \times C}$, $H, W$ are the height, width of the feature map and $C$ is the number of channels, the batch normalization first computes the following statistics:
+
+\begin{align}
+\label{eq.bnstats}
+    \mu_c &= \frac{1}{BWH}  \sum_{n=1}^{B}\sum_{i,j=1}^{H,W} \bx_{cij}^{n}\\
+    \sigma^2_c &= \frac{1}{BWH}
+    \sum_{n=1}^{B}\sum_{i,j=1}^{H,W} (\bx_{cij}^{n} - \mu_{c})^2
+\end{align} where $c$, $i$, $j$ denote the index values for $y$, $x$ and channel coordinates of feature maps, and $\bm{\mu}$ and $\bm{\sigma}^2$ are the mean and variance of the batch.
+
+BN applies the following operation on each feature map in batch B for every $c,i,j$:
+\begin{equation}
+\label{eq.bnop}
+\text{BN}(\bx_{cij}) = \frac{\bx_{cij} - \mu_{c}}{\sqrt{\sigma^2_{c}} + \epsilon} * \gamma_{c} + \beta_{c}
+\end{equation} where $\gamma \in \mathbb{R}^C$ and $\beta\in \mathbb{R}^C$ are learnable parameters and $\epsilon$ is a small constant introduced to ensure numerical stability.
+
+At inference time, using batch statistics is a poor choice as it introduces noise in the evaluation and might not even be well defined. Therefore, $\bm{\mu}$ and $\bm{\sigma}$ are replaced by running averages of the mean and variance computed during training, which is a better approximation of the full dataset statistics.
+
+Recent work
+has shown that BatchNorm has a more fundamental
+benefit of smoothing the optimization landscape during
+training \cite{santurkar2018does} thus enhancing the predictive
+power of gradients as our guide to the global minimum.
+Furthermore, a smoother optimization landscape should
+additionally enable the use of a wider range of learning
+rates and initialization schemes which is congruent with the
+findings of Ioffe and Szegedy in the original BatchNorm
+paper~\cite{ioffe2015batch}.
+
+
+\subsection{Residual connections}
+
+Residual connections are another approach used in the state-of-the-art Residual Networks~\cite{he2016deep} to tackle the vanishing gradient problem.
+Introduced by He et. al.~\cite{he2016deep}, a residual block consists of a
+convolution (or group of convolutions) layer, ``short-circuited'' with an identity mapping.
+More precisely, given a mapping $F^{(b)}$ that denotes the transformation of the block $b$ (multiple consecutive layers), $F^{(b)}$ is applied to its input
+feature map $\bx^{(b-1)}$ as $\bx^{(b)} = \bx^{(b-1)} + {F}(\bx^{(b-1)})$.
+
+Intuitively, stacking residual blocks creates an architecture where inputs of each blocks
+are given two paths : passing through the convolution or skipping to the next layer. A residual network can therefore be seen as an ensemble model averaging every sub-network
+created by choosing one of the two paths. The skip connections allow gradients to flow
+easily into early layers, since 
+\begin{equation}
+    \frac{\partial \bx^{(b)}}{\partial \bx^{(b-1)}} = \mathbbm{1} + \frac{\partial{F}(\bx^{(b-1)})}{\partial \bx^{(b-1)}}
+    \label{eq.grad_skip}
+\end{equation} where $\bx^{(b-1)} \in \mathbb{R}^{C \times H \times W }$ and $\mathbbm{1}$ is a $\mathbb{R}^{C \times H \times W}$-dimensional tensor with entries 1 where $C$, $H$ and $W$ denote the number of feature maps, its height and width respectively. 
+Importantly, $\mathbbm{1}$ prevents the zero gradient flow.
+
+
+\section{Experiment Setup}
+
+\questionFigureFour
+
+\questionFigureFive
+
+\questionTableOne
+
+We conduct our experiment on the CIFAR100 dataset \cite{krizhevsky2009learning},
+which consists of 60,000 32x32 colour images from 100 different classes. The number of samples per class is balanced, and the
+samples are split into training, validation, and test set while
+maintaining balanced class proportions. In total, there are 47,500; 2,500; and 10,000 instances in the training, validation,
+and test set, respectively. Moreover, we apply data augmentation strategies (cropping, horizontal flipping) to improve the generalization of the model.
+
+With the goal of understanding whether BN or skip connections
+help fighting vanishing gradients, we first test these
+methods independently, before combining them in an attempt
+to fully exploit the depth of the VGG38 model.
+
+All experiments are conducted using the Adam optimizer with the default
+learning rate (1e-3) -- unless otherwise specified, cosine annealing and a batch size of 100
+for 100 epochs. 
+Additionally, training images are augmented with random 
+cropping and horizontal flipping.
+Note that we do not use data augmentation at test time.
+These hyperparameters along with the augmentation strategy are used
+to produce the results shown in Fig.~\ref{fig:curves}.
+
+When used, BN is applied
+after each convolutional layer, before the Leaky
+ReLU non-linearity. 
+Similarly, the skip connections are applied from 
+before the convolution layer to before the final activation function
+of the block as per Fig.~2 of \cite{he2016deep}. 
+Note that adding residual connections between the feature maps before and after downsampling requires special treatment, as there is a dimension mismatch between them. 
+Therefore in the coursework, we do not use residual connections in the down-sampling blocks. However, please note that batch normalization should still be implemented for these blocks. 
+
+\subsection{Residual Connections to Downsampling Layers}
+\label{subsec:rescimp}
+
+\questionThree.
+
+
+\section{Results and Discussion}
+\label{sec:disc}
+
+\questionFour.
+
+\section{Conclusion}
+\label{sec:concl}
+
+\questionFive.    
+
+\bibliography{refs}
+
+\end{document} 
+
+
+
+
+
--- a/report/mlp2022.sty
+++ b/report/mlp2022.sty
@ -0,0 +1,720 @@
+% File: mlp2017.sty (LaTeX style file for ICML-2017, version of 2017-05-31)
+
+% Modified by Daniel Roy 2017: changed byline to use footnotes for affiliations, and removed emails
+
+% This file contains the LaTeX formatting parameters for a two-column 
+% conference proceedings that is 8.5 inches wide by 11 inches high.  
+% 
+% Modified by Percy Liang 12/2/2013: changed the year, location from the previous template for ICML 2014
+
+% Modified by Fei Sha 9/2/2013: changed the year, location form the previous template for ICML 2013
+%
+% Modified by Fei Sha 4/24/2013: (1) remove the extra whitespace after the first author's email address (in %the camera-ready version) (2) change the Proceeding ... of ICML 2010 to 2014 so PDF's metadata will show up % correctly
+%
+% Modified by Sanjoy Dasgupta, 2013: changed years, location
+%
+% Modified by Francesco Figari, 2012: changed years, location
+%
+% Modified by Christoph Sawade and Tobias Scheffer, 2011: added line 
+% numbers, changed years
+%
+% Modified by Hal Daume III, 2010: changed years, added hyperlinks
+%
+% Modified by Kiri Wagstaff, 2009: changed years
+%
+% Modified by Sam Roweis, 2008: changed years
+%
+% Modified by Ricardo Silva, 2007: update of the ifpdf verification
+%
+% Modified by Prasad Tadepalli and Andrew Moore, merely changing years. 
+%
+% Modified by Kristian Kersting, 2005, based on Jennifer Dy's 2004 version
+% - running title. If the original title is to long or is breaking a line,
+%   use \mlptitlerunning{...} in the preamble to supply a shorter form.
+%   Added fancyhdr package to get a running head. 
+% - Updated to store the page size because pdflatex does compile the 
+%   page size into the pdf. 
+%
+% Hacked by Terran Lane, 2003:
+% - Updated to use LaTeX2e style file conventions (ProvidesPackage,
+%   etc.)
+% - Added an ``appearing in'' block at the base of the first column
+%   (thus keeping the ``appearing in'' note out of the bottom margin
+%   where the printer should strip in the page numbers).
+% - Added a package option [accepted] that selects between the ``Under
+%   review'' notice (default, when no option is specified) and the
+%   ``Appearing in'' notice (for use when the paper has been accepted
+%   and will appear).
+%
+%   Originally created as:  ml2k.sty (LaTeX style file for ICML-2000)
+%   by P. Langley (12/23/99)
+
+%%%%%%%%%%%%%%%%%%%%
+%% This version of the style file supports both a ``review'' version
+%% and a ``final/accepted'' version.  The difference is only in the
+%% text that appears in the note at the bottom of the first column of
+%% the first page.  The default behavior is to print a note to the
+%% effect that the paper is under review and don't distribute it.  The
+%% final/accepted version prints an ``Appearing in'' note.  To get the
+%% latter behavior, in the calling file change the ``usepackage'' line
+%% from:
+%%	\usepackage{icml2017}
+%% to
+%%	\usepackage[accepted]{icml2017}
+%%%%%%%%%%%%%%%%%%%%
+
+\NeedsTeXFormat{LaTeX2e}
+\ProvidesPackage{mlp2022}[2021/10/16 MLP Coursework Style File]
+
+% Use fancyhdr package
+\RequirePackage{fancyhdr}
+\RequirePackage{color}
+\RequirePackage{algorithm}
+\RequirePackage{algorithmic}
+\RequirePackage{natbib}
+\RequirePackage{eso-pic} % used by \AddToShipoutPicture 
+\RequirePackage{forloop}
+
+%%%%%%%% Options
+%\DeclareOption{accepted}{%
+% \renewcommand{\Notice@String}{\ICML@appearing}
+  \gdef\isaccepted{1}
+%}
+\DeclareOption{nohyperref}{%
+  \gdef\nohyperref{1}
+}
+
+\ifdefined\nohyperref\else\ifdefined\hypersetup
+  \definecolor{mydarkblue}{rgb}{0,0.08,0.45}
+  \hypersetup{ %
+    pdftitle={},
+    pdfauthor={},
+    pdfsubject={MLP Coursework 2021-22},
+    pdfkeywords={},
+    pdfborder=0 0 0,
+    pdfpagemode=UseNone,
+    colorlinks=true,
+    linkcolor=mydarkblue,
+    citecolor=mydarkblue,
+    filecolor=mydarkblue,
+    urlcolor=mydarkblue,
+    pdfview=FitH}
+
+  \ifdefined\isaccepted \else
+    \hypersetup{pdfauthor={Anonymous Submission}}
+  \fi
+\fi\fi
+
+%%%%%%%%%%%%%%%%%%%%
+% This string is printed at the bottom of the page for the
+% final/accepted version of the ``appearing in'' note.  Modify it to
+% change that text.
+%%%%%%%%%%%%%%%%%%%%
+\newcommand{\ICML@appearing}{\textit{MLP Coursework 1 2021--22}}
+
+%%%%%%%%%%%%%%%%%%%%
+% This string is printed at the bottom of the page for the draft/under
+% review version of the ``appearing in'' note.  Modify it to change
+% that text.
+%%%%%%%%%%%%%%%%%%%%
+\newcommand{\Notice@String}{MLP Coursework 1 2021--22}
+
+% Cause the declared options to actually be parsed and activated
+\ProcessOptions\relax
+
+% Uncomment the following for debugging.  It will cause LaTeX to dump
+% the version of the ``appearing in'' string that will actually appear
+% in the document.
+%\typeout{>> Notice string='\Notice@String'}
+
+% Change citation commands to be more like old ICML styles
+\newcommand{\yrcite}[1]{\citeyearpar{#1}}
+\renewcommand{\cite}[1]{\citep{#1}}
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% to ensure the letter format is used. pdflatex does compile the
+% page size into the pdf. This is done using \pdfpagewidth and 
+% \pdfpageheight. As Latex does not know this directives, we first
+% check whether pdflatex or latex is used.
+%
+% Kristian Kersting 2005
+%
+% in order to account for the more recent use of pdfetex as the default
+% compiler, I have changed the pdf verification.
+%
+% Ricardo Silva 2007
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\paperwidth=210mm
+\paperheight=297mm
+
+% old PDFLaTex verification, circa 2005
+%
+%\newif\ifpdf\ifx\pdfoutput\undefined
+%  \pdffalse % we are not running PDFLaTeX
+%\else
+%  \pdfoutput=1 % we are running PDFLaTeX
+%  \pdftrue
+%\fi
+
+\newif\ifpdf %adapted from ifpdf.sty
+\ifx\pdfoutput\undefined
+\else
+   \ifx\pdfoutput\relax
+   \else
+     \ifcase\pdfoutput
+     \else
+       \pdftrue
+     \fi
+   \fi
+\fi
+
+\ifpdf
+%    \pdfpagewidth=\paperwidth
+%    \pdfpageheight=\paperheight
+  \setlength{\pdfpagewidth}{210mm}
+  \setlength{\pdfpageheight}{297mm}
+\fi
+
+% Physical page layout 
+
+\evensidemargin -5.5mm  
+\oddsidemargin -5.5mm 
+\setlength\textheight{248mm}
+\setlength\textwidth{170mm} 
+\setlength\columnsep{6.5mm}
+\setlength\headheight{10pt}
+\setlength\headsep{10pt} 
+\addtolength{\topmargin}{-20pt}
+
+%\setlength\headheight{1em}
+%\setlength\headsep{1em}
+\addtolength{\topmargin}{-6mm}
+
+%\addtolength{\topmargin}{-2em}
+
+%% The following is adapted from code in the acmconf.sty conference
+%% style file.  The constants in it are somewhat magical, and appear
+%% to work well with the two-column format on US letter paper that
+%% ICML uses, but will break if you change that layout, or if you use
+%% a longer block of text for the copyright notice string.  Fiddle with
+%% them if necessary to get the block to fit/look right.
+%%
+%% -- Terran Lane, 2003
+%%
+%% The following comments are included verbatim from acmconf.sty:
+%%
+%%% This section (written by KBT) handles the 1" box in the lower left
+%%% corner of the left column of the first page by creating a picture,
+%%% and inserting the predefined string at the bottom (with a negative
+%%% displacement to offset the space allocated for a non-existent
+%%% caption).
+%%%
+\def\ftype@copyrightbox{8}
+\def\@copyrightspace{
+% Create a float object positioned at the bottom of the column.  Note
+% that because of the mystical nature of floats, this has to be called
+% before the first column is populated with text (e.g., from the title
+% or abstract blocks).  Otherwise, the text will force the float to
+% the next column.  -- TDRL.
+\@float{copyrightbox}[b]
+\begin{center}
+\setlength{\unitlength}{1pc}
+\begin{picture}(20,1.5)
+% Create a line separating the main text from the note block.
+% 4.818pc==0.8in.
+\put(0,2.5){\line(1,0){4.818}}
+% Insert the text string itself.  Note that the string has to be
+% enclosed in a parbox -- the \put call needs a box object to
+% position.  Without the parbox, the text gets splattered across the
+% bottom of the page semi-randomly.  The 19.75pc distance seems to be
+% the width of the column, though I can't find an appropriate distance
+% variable to substitute here.  -- TDRL.
+\put(0,0){\parbox[b]{19.75pc}{\small \Notice@String}}
+\end{picture}
+\end{center}
+\end@float}
+
+% Note: A few Latex versions need the next line instead of the former.
+% \addtolength{\topmargin}{0.3in}
+% \setlength\footheight{0pt}
+\setlength\footskip{0pt} 
+%\pagestyle{empty} 
+\flushbottom \twocolumn
+\sloppy
+
+% Clear out the addcontentsline command
+\def\addcontentsline#1#2#3{}
+ 
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%% commands for formatting paper title, author names, and addresses. 
+
+%%start%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%% title as running head -- Kristian Kersting 2005 %%%%%%%%%%%%%
+
+
+%\makeatletter
+%\newtoks\mytoksa
+%\newtoks\mytoksb
+%\newcommand\addtomylist[2]{%
+%  \mytoksa\expandafter{#1}%
+%  \mytoksb{#2}%
+%  \edef#1{\the\mytoksa\the\mytoksb}%
+%}
+%\makeatother 
+
+% box to check the size of the running head
+\newbox\titrun
+
+% general page style
+\pagestyle{fancy}
+\fancyhf{}
+\fancyhead{}
+\fancyfoot{}
+% set the width of the head rule to 1 point
+\renewcommand{\headrulewidth}{1pt}
+
+% definition to set the head as running head in the preamble
+\def\mlptitlerunning#1{\gdef\@mlptitlerunning{#1}}
+
+% main definition adapting \mlptitle from 2004
+\long\def\mlptitle#1{%
+
+   %check whether @mlptitlerunning exists
+   % if not \mlptitle is used as running head
+   \ifx\undefined\@mlptitlerunning%
+	\gdef\@mlptitlerunning{#1}
+   \fi
+
+   %add it to pdf information
+  \ifdefined\nohyperref\else\ifdefined\hypersetup
+     \hypersetup{pdftitle={#1}}
+   \fi\fi
+
+   %get the dimension of the running title
+   \global\setbox\titrun=\vbox{\small\bf\@mlptitlerunning}
+
+   % error flag
+   \gdef\@runningtitleerror{0}
+
+   % running title too long
+   \ifdim\wd\titrun>\textwidth%
+	  {\gdef\@runningtitleerror{1}}%
+   % running title breaks a line
+   \else\ifdim\ht\titrun>6.25pt
+	   {\gdef\@runningtitleerror{2}}%
+	\fi
+   \fi 
+
+   % if there is somthing wrong with the running title
+   \ifnum\@runningtitleerror>0
+	   \typeout{}%
+           \typeout{}%
+           \typeout{*******************************************************}%
+           \typeout{Title exceeds size limitations for running head.}%
+           \typeout{Please supply a shorter form for the running head}
+           \typeout{with \string\mlptitlerunning{...}\space prior to \string\begin{document}}%
+           \typeout{*******************************************************}%
+ 	    \typeout{}%
+           \typeout{}%
+           % set default running title
+	   \chead{\small\bf Title Suppressed Due to Excessive Size}%
+    \else
+	   % 'everything' fine, set provided running title
+  	   \chead{\small\bf\@mlptitlerunning}%
+    \fi
+
+  % no running title on the first page of the paper
+  \thispagestyle{empty}
+
+%%%%%%%%%%%%%%%%%%%% Kristian Kersting %%%%%%%%%%%%%%%%%%%%%%%%%  
+%end%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+  {\center\baselineskip 18pt
+                       \toptitlebar{\Large\bf #1}\bottomtitlebar}
+}
+
+
+\gdef\icmlfullauthorlist{}
+\newcommand\addstringtofullauthorlist{\g@addto@macro\icmlfullauthorlist}
+\newcommand\addtofullauthorlist[1]{%
+  \ifdefined\icmlanyauthors%
+    \addstringtofullauthorlist{, #1}%
+  \else%
+    \addstringtofullauthorlist{#1}%
+    \gdef\icmlanyauthors{1}%
+  \fi%
+  \ifdefined\nohyperref\else\ifdefined\hypersetup%
+    \hypersetup{pdfauthor=\icmlfullauthorlist}%
+  \fi\fi}
+
+
+\def\toptitlebar{\hrule height1pt \vskip .25in} 
+\def\bottomtitlebar{\vskip .22in \hrule height1pt \vskip .3in} 
+
+\newenvironment{icmlauthorlist}{%
+  \setlength\topsep{0pt}
+  \setlength\parskip{0pt}
+  \begin{center}
+}{%
+  \end{center}
+}
+
+\newcounter{@affiliationcounter}
+\newcommand{\@pa}[1]{%
+% ``#1''
+\ifcsname the@affil#1\endcsname
+   % do nothing
+\else
+  \ifcsname @icmlsymbol#1\endcsname
+    % nothing
+  \else
+  \stepcounter{@affiliationcounter}%
+  \newcounter{@affil#1}%
+  \setcounter{@affil#1}{\value{@affiliationcounter}}%
+  \fi
+\fi%
+\ifcsname @icmlsymbol#1\endcsname
+  \textsuperscript{\csname @icmlsymbol#1\endcsname\,}%
+\else
+  %\expandafter\footnotemark[\arabic{@affil#1}\,]%
+  \textsuperscript{\arabic{@affil#1}\,}%
+\fi
+}
+
+%\newcommand{\icmlauthor}[2]{%
+%\addtofullauthorlist{#1}%
+%#1\@for\theaffil:=#2\do{\pa{\theaffil}}%
+%}
+\newcommand{\icmlauthor}[2]{%
+  \ifdefined\isaccepted
+    \mbox{\bf #1}\,\@for\theaffil:=#2\do{\@pa{\theaffil}} \addtofullauthorlist{#1}%
+   \else
+    \ifdefined\@icmlfirsttime
+    \else
+      \gdef\@icmlfirsttime{1}
+      \mbox{\bf Anonymous Authors}\@pa{@anon} \addtofullauthorlist{Anonymous Authors}
+     \fi
+    \fi
+}
+
+\newcommand{\icmlsetsymbol}[2]{%
+  \expandafter\gdef\csname @icmlsymbol#1\endcsname{#2}
+ }
+   
+
+\newcommand{\icmlaffiliation}[2]{%
+\ifdefined\isaccepted
+\ifcsname the@affil#1\endcsname
+ \expandafter\gdef\csname @affilname\csname the@affil#1\endcsname\endcsname{#2}%
+\else
+  {\bf AUTHORERR: Error in use of \textbackslash{}icmlaffiliation command. Label ``#1'' not mentioned in some \textbackslash{}icmlauthor\{author name\}\{labels here\} command beforehand. }
+  \typeout{}%
+  \typeout{}%
+  \typeout{*******************************************************}%
+  \typeout{Affiliation label undefined. }%
+  \typeout{Make sure \string\icmlaffiliation\space follows }
+  \typeout{all of \string\icmlauthor\space commands}%
+  \typeout{*******************************************************}%
+  \typeout{}%
+  \typeout{}%
+\fi
+\else % \isaccepted
+ % can be called multiple times... it's idempotent
+ \expandafter\gdef\csname @affilname1\endcsname{Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country}
+\fi
+}
+
+\newcommand{\icmlcorrespondingauthor}[2]{
+\ifdefined\isaccepted
+ \ifdefined\icmlcorrespondingauthor@text
+   \g@addto@macro\icmlcorrespondingauthor@text{, #1 \textless{}#2\textgreater{}}
+ \else
+   \gdef\icmlcorrespondingauthor@text{#1 \textless{}#2\textgreater{}}
+ \fi
+\else
+\gdef\icmlcorrespondingauthor@text{Anonymous Author \textless{}anon.email@domain.com\textgreater{}}
+\fi
+}
+
+\newcommand{\icmlEqualContribution}{\textsuperscript{*}Equal contribution }
+
+\newcounter{@affilnum}
+\newcommand{\printAffiliationsAndNotice}[1]{%
+\stepcounter{@affiliationcounter}%
+{\let\thefootnote\relax\footnotetext{\hspace*{-\footnotesep}#1%
+\forloop{@affilnum}{1}{\value{@affilnum} < \value{@affiliationcounter}}{
+\textsuperscript{\arabic{@affilnum}}\ifcsname @affilname\the@affilnum\endcsname%
+\csname @affilname\the@affilnum\endcsname%
+\else
+{\bf AUTHORERR: Missing \textbackslash{}icmlaffiliation.}
+\fi
+}.
+\ifdefined\icmlcorrespondingauthor@text
+Correspondence to: \icmlcorrespondingauthor@text.
+\else
+{\bf AUTHORERR: Missing \textbackslash{}icmlcorrespondingauthor.}
+\fi
+
+\ \\
+\Notice@String
+}
+}
+}
+
+  
+%\makeatother
+
+\long\def\icmladdress#1{%
+ {\bf The \textbackslash{}icmladdress command is no longer used.  See the example\_paper PDF .tex for usage of \textbackslash{}icmlauther and \textbackslash{}icmlaffiliation.}
+}
+
+%% keywords as first class citizens
+\def\icmlkeywords#1{%
+%  \ifdefined\isaccepted \else
+%    \par {\bf Keywords:} #1%
+%  \fi
+%  \ifdefined\nohyperref\else\ifdefined\hypersetup
+%    \hypersetup{pdfkeywords={#1}}
+%  \fi\fi
+%  \ifdefined\isaccepted \else
+%    \par {\bf Keywords:} #1%
+%  \fi
+  \ifdefined\nohyperref\else\ifdefined\hypersetup
+    \hypersetup{pdfkeywords={#1}}
+  \fi\fi
+}
+
+% modification to natbib citations
+\setcitestyle{authoryear,round,citesep={;},aysep={,},yysep={;}}
+
+% Redefinition of the abstract environment. 
+\renewenvironment{abstract}
+   {%
+% Insert the ``appearing in'' copyright notice.
+%\@copyrightspace
+\centerline{\large\bf Abstract}
+    \vspace{-0.12in}\begin{quote}}
+   {\par\end{quote}\vskip 0.12in}
+
+% numbered section headings with different treatment of numbers
+
+\def\@startsection#1#2#3#4#5#6{\if@noskipsec \leavevmode \fi
+   \par \@tempskipa #4\relax
+   \@afterindenttrue
+% Altered the following line to indent a section's first paragraph. 
+%  \ifdim \@tempskipa <\z@ \@tempskipa -\@tempskipa \@afterindentfalse\fi
+   \ifdim \@tempskipa <\z@ \@tempskipa -\@tempskipa \fi
+   \if@nobreak \everypar{}\else
+     \addpenalty{\@secpenalty}\addvspace{\@tempskipa}\fi \@ifstar
+     {\@ssect{#3}{#4}{#5}{#6}}{\@dblarg{\@sict{#1}{#2}{#3}{#4}{#5}{#6}}}}
+
+\def\@sict#1#2#3#4#5#6[#7]#8{\ifnum #2>\c@secnumdepth
+     \def\@svsec{}\else 
+     \refstepcounter{#1}\edef\@svsec{\csname the#1\endcsname}\fi
+     \@tempskipa #5\relax
+      \ifdim \@tempskipa>\z@
+        \begingroup #6\relax
+          \@hangfrom{\hskip #3\relax\@svsec.~}{\interlinepenalty \@M #8\par}
+        \endgroup
+       \csname #1mark\endcsname{#7}\addcontentsline
+         {toc}{#1}{\ifnum #2>\c@secnumdepth \else
+                      \protect\numberline{\csname the#1\endcsname}\fi
+                    #7}\else
+        \def\@svsechd{#6\hskip #3\@svsec #8\csname #1mark\endcsname
+                      {#7}\addcontentsline
+                           {toc}{#1}{\ifnum #2>\c@secnumdepth \else
+                             \protect\numberline{\csname the#1\endcsname}\fi
+                       #7}}\fi
+     \@xsect{#5}}
+
+\def\@sect#1#2#3#4#5#6[#7]#8{\ifnum #2>\c@secnumdepth
+     \def\@svsec{}\else 
+     \refstepcounter{#1}\edef\@svsec{\csname the#1\endcsname\hskip 0.4em }\fi
+     \@tempskipa #5\relax
+      \ifdim \@tempskipa>\z@ 
+        \begingroup #6\relax
+          \@hangfrom{\hskip #3\relax\@svsec}{\interlinepenalty \@M #8\par}
+        \endgroup
+       \csname #1mark\endcsname{#7}\addcontentsline
+         {toc}{#1}{\ifnum #2>\c@secnumdepth \else
+                      \protect\numberline{\csname the#1\endcsname}\fi
+                    #7}\else
+        \def\@svsechd{#6\hskip #3\@svsec #8\csname #1mark\endcsname
+                      {#7}\addcontentsline
+                           {toc}{#1}{\ifnum #2>\c@secnumdepth \else
+                             \protect\numberline{\csname the#1\endcsname}\fi
+                       #7}}\fi
+     \@xsect{#5}}
+
+% section headings with less space above and below them
+\def\thesection {\arabic{section}}
+\def\thesubsection {\thesection.\arabic{subsection}}
+\def\section{\@startsection{section}{1}{\z@}{-0.12in}{0.02in}
+             {\large\bf\raggedright}}
+\def\subsection{\@startsection{subsection}{2}{\z@}{-0.10in}{0.01in}
+                {\normalsize\bf\raggedright}}
+\def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-0.08in}{0.01in}
+                {\normalsize\sc\raggedright}}
+\def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
+  0.5ex minus .2ex}{-1em}{\normalsize\bf}}
+\def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus
+  0.5ex minus .2ex}{-1em}{\normalsize\bf}}
+ 
+% Footnotes 
+\footnotesep 6.65pt % 
+\skip\footins 9pt 
+\def\footnoterule{\kern-3pt \hrule width 0.8in \kern 2.6pt } 
+\setcounter{footnote}{0} 
+ 
+% Lists and paragraphs 
+\parindent 0pt 
+\topsep 4pt plus 1pt minus 2pt 
+\partopsep 1pt plus 0.5pt minus 0.5pt 
+\itemsep 2pt plus 1pt minus 0.5pt 
+\parsep 2pt plus 1pt minus 0.5pt 
+\parskip 6pt
+ 
+\leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em 
+\leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em  
+\leftmarginvi .5em 
+\labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt 
+ 
+\def\@listi{\leftmargin\leftmargini} 
+\def\@listii{\leftmargin\leftmarginii 
+   \labelwidth\leftmarginii\advance\labelwidth-\labelsep 
+   \topsep 2pt plus 1pt minus 0.5pt 
+   \parsep 1pt plus 0.5pt minus 0.5pt 
+   \itemsep \parsep} 
+\def\@listiii{\leftmargin\leftmarginiii 
+    \labelwidth\leftmarginiii\advance\labelwidth-\labelsep 
+    \topsep 1pt plus 0.5pt minus 0.5pt  
+    \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt 
+    \itemsep \topsep} 
+\def\@listiv{\leftmargin\leftmarginiv 
+     \labelwidth\leftmarginiv\advance\labelwidth-\labelsep} 
+\def\@listv{\leftmargin\leftmarginv 
+     \labelwidth\leftmarginv\advance\labelwidth-\labelsep} 
+\def\@listvi{\leftmargin\leftmarginvi 
+     \labelwidth\leftmarginvi\advance\labelwidth-\labelsep} 
+ 
+\abovedisplayskip 7pt plus2pt minus5pt% 
+\belowdisplayskip \abovedisplayskip 
+\abovedisplayshortskip  0pt plus3pt%    
+\belowdisplayshortskip  4pt plus3pt minus3pt% 
+ 
+% Less leading in most fonts (due to the narrow columns) 
+% The choices were between 1-pt and 1.5-pt leading 
+\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} 
+\def\small{\@setsize\small{10pt}\ixpt\@ixpt} 
+\def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt} 
+\def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt} 
+\def\tiny{\@setsize\tiny{7pt}\vipt\@vipt} 
+\def\large{\@setsize\large{14pt}\xiipt\@xiipt} 
+\def\Large{\@setsize\Large{16pt}\xivpt\@xivpt} 
+\def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt} 
+\def\huge{\@setsize\huge{23pt}\xxpt\@xxpt} 
+\def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt} 
+
+% Revised formatting for figure captions and table titles. 
+\newsavebox\newcaptionbox\newdimen\newcaptionboxwid
+
+\long\def\@makecaption#1#2{
+ \vskip 10pt 
+        \baselineskip 11pt
+        \setbox\@tempboxa\hbox{#1. #2}
+        \ifdim \wd\@tempboxa >\hsize
+        \sbox{\newcaptionbox}{\small\sl #1.~}
+        \newcaptionboxwid=\wd\newcaptionbox
+        \usebox\newcaptionbox {\footnotesize #2}
+%        \usebox\newcaptionbox {\small #2}
+        \else 
+          \centerline{{\small\sl #1.} {\small #2}} 
+        \fi}
+
+\def\fnum@figure{Figure \thefigure}
+\def\fnum@table{Table \thetable}
+
+% Strut macros for skipping spaces above and below text in tables. 
+\def\abovestrut#1{\rule[0in]{0in}{#1}\ignorespaces}
+\def\belowstrut#1{\rule[-#1]{0in}{#1}\ignorespaces}
+
+\def\abovespace{\abovestrut{0.20in}}
+\def\aroundspace{\abovestrut{0.20in}\belowstrut{0.10in}}
+\def\belowspace{\belowstrut{0.10in}}
+
+% Various personal itemization commands. 
+\def\texitem#1{\par\noindent\hangindent 12pt
+               \hbox to 12pt {\hss #1 ~}\ignorespaces}
+\def\icmlitem{\texitem{$\bullet$}}
+
+% To comment out multiple lines of text.
+\long\def\comment#1{}
+
+
+
+
+%% Line counter (not in final version). Adapted from NIPS style file by Christoph Sawade
+
+% Vertical Ruler
+% This code is, largely, from the CVPR 2010 conference style file
+% ----- define vruler
+\makeatletter
+\newbox\icmlrulerbox
+\newcount\icmlrulercount
+\newdimen\icmlruleroffset
+\newdimen\cv@lineheight
+\newdimen\cv@boxheight
+\newbox\cv@tmpbox
+\newcount\cv@refno
+\newcount\cv@tot
+% NUMBER with left flushed zeros  \fillzeros[<WIDTH>]<NUMBER>
+\newcount\cv@tmpc@ \newcount\cv@tmpc
+\def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
+\cv@tmpc=1 %
+\loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
+   \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
+\ifnum#2<0\advance\cv@tmpc1\relax-\fi
+\loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
+\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
+% \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
+\def\makevruler[#1][#2][#3][#4][#5]{
+	\begingroup\offinterlineskip
+		\textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
+		\global\setbox\icmlrulerbox=\vbox to \textheight{%
+			{
+				\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
+				\cv@lineheight=#1\global\icmlrulercount=#2%
+				\cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
+				\cv@refno1\vskip-\cv@lineheight\vskip1ex%
+				\loop\setbox\cv@tmpbox=\hbox to0cm{					 % side margin
+					\hfil {\hfil\fillzeros[#4]\icmlrulercount}
+				}%
+				\ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
+				\advance\cv@refno1\global\advance\icmlrulercount#3\relax
+				\ifnum\cv@refno<\cv@tot\repeat
+			}
+		}
+	\endgroup
+}%
+\makeatother
+% ----- end of vruler
+
+
+% \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
+\def\icmlruler#1{\makevruler[12pt][#1][1][3][\textheight]\usebox{\icmlrulerbox}}
+\AddToShipoutPicture{%
+\icmlruleroffset=\textheight
+\advance\icmlruleroffset by 5.2pt % top margin
+  \color[rgb]{.7,.7,.7}
+  \ifdefined\isaccepted \else
+	  \AtTextUpperLeft{%
+	    \put(\LenToUnit{-35pt},\LenToUnit{-\icmlruleroffset}){%left ruler
+	      \icmlruler{\icmlrulercount}}
+%	    \put(\LenToUnit{1.04\textwidth},\LenToUnit{-\icmlruleroffset}){%right ruler
+%	      \icmlruler{\icmlrulercount}}
+	  }
+	 \fi
+}
+\endinput
--- a/report/mlp2022_includes.tex
+++ b/report/mlp2022_includes.tex
@ -0,0 +1,50 @@
+\usepackage[T1]{fontenc}
+\usepackage{amssymb,amsmath}
+\usepackage{txfonts}
+\usepackage{microtype}
+
+% For figures
+\usepackage{graphicx}
+\usepackage{subcaption} 
+
+% For citations
+\usepackage{natbib}
+
+% For algorithms
+\usepackage{algorithm}
+\usepackage{algorithmic}
+
+% the hyperref package is used to produce hyperlinks in the
+% resulting PDF.  If this breaks your system, please commend out the
+% following usepackage line and replace \usepackage{mlp2017} with
+% \usepackage[nohyperref]{mlp2017} below.
+\usepackage{hyperref}
+\usepackage{url}
+\urlstyle{same}
+
+\usepackage{color}
+\usepackage{booktabs} % To thicken table lines
+\usepackage{multirow} % Multirow cells in table
+
+% Packages hyperref and algorithmic misbehave sometimes.  We can fix
+% this with the following command.
+\newcommand{\theHalgorithm}{\arabic{algorithm}}
+
+
+% Set up MLP coursework style (based on ICML style)
+\usepackage{mlp2022}
+\mlptitlerunning{MLP Coursework 2 (\studentNumber)}
+\bibliographystyle{icml2017}
+\usepackage{bm,bbm}
+\usepackage{soul}
+
+\DeclareMathOperator{\softmax}{softmax}
+\DeclareMathOperator{\sigmoid}{sigmoid}
+\DeclareMathOperator{\sgn}{sgn}
+\DeclareMathOperator{\relu}{relu}
+\DeclareMathOperator{\lrelu}{lrelu}
+\DeclareMathOperator{\elu}{elu}
+\DeclareMathOperator{\selu}{selu}
+\DeclareMathOperator{\maxout}{maxout}
+\newcommand{\bx}{\bm{x}}
+
--- a/report/refs.bib
+++ b/report/refs.bib
@ -0,0 +1,184 @@
+
+@inproceedings{goodfellow2013maxout,
+  title={Maxout networks},
+  author={Goodfellow, Ian and Warde-Farley, David and Mirza, Mehdi and Courville, Aaron and Bengio, Yoshua},
+  booktitle={International conference on machine learning},
+  pages={1319--1327},
+  year={2013},
+  organization={PMLR}
+}
+
+@article{srivastava2014dropout,
+  title={Dropout: a simple way to prevent neural networks from overfitting},
+  author={Srivastava, Nitish and Hinton, Geoffrey and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan},
+  journal={The journal of machine learning research},
+  volume={15},
+  number={1},
+  pages={1929--1958},
+  year={2014},
+  publisher={JMLR. org}
+}
+
+@book{Goodfellow-et-al-2016,
+    title={Deep Learning},
+    author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
+    publisher={MIT Press},
+    note={\url{http://www.deeplearningbook.org}},
+    year={2016}
+}
+
+@inproceedings{ng2004feature,
+  title={Feature selection, L1 vs. L2 regularization, and rotational invariance},
+  author={Ng, Andrew Y},
+  booktitle={Proceedings of the twenty-first international conference on Machine learning},
+  pages={78},
+  year={2004}
+}
+
+@article{simonyan2014very,
+  title={Very deep convolutional networks for large-scale image recognition},
+  author={Simonyan, Karen and Zisserman, Andrew},
+  journal={arXiv preprint arXiv:1409.1556},
+  year={2014}
+}
+
+@inproceedings{he2016deep,
+  title={Deep residual learning for image recognition},
+  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={770--778},
+  year={2016}
+}
+
+@inproceedings{glorot2010understanding,
+  title={Understanding the difficulty of training deep feedforward neural networks},
+  author={Glorot, Xavier and Bengio, Yoshua},
+  booktitle={Proceedings of the thirteenth international conference on artificial intelligence and statistics},
+  pages={249--256},
+  year={2010},
+  organization={JMLR Workshop and Conference Proceedings}
+}
+
+@inproceedings{bengio1993problem,
+  title={The problem of learning long-term dependencies in recurrent networks},
+  author={Bengio, Yoshua and Frasconi, Paolo and Simard, Patrice},
+  booktitle={IEEE international conference on neural networks},
+  pages={1183--1188},
+  year={1993},
+  organization={IEEE}
+}
+
+@inproceedings{ide2017improvement,
+  title={Improvement of learning for CNN with ReLU activation by sparse regularization},
+  author={Ide, Hidenori and Kurita, Takio},
+  booktitle={2017 International Joint Conference on Neural Networks (IJCNN)},
+  pages={2684--2691},
+  year={2017},
+  organization={IEEE}
+}
+
+@inproceedings{ioffe2015batch,
+  title={Batch normalization: Accelerating deep network training by reducing internal covariate shift},
+  author={Ioffe, Sergey and Szegedy, Christian},
+  booktitle={International conference on machine learning},
+  pages={448--456},
+  year={2015},
+  organization={PMLR}
+}
+
+@inproceedings{huang2017densely,
+  title={Densely connected convolutional networks},
+  author={Huang, Gao and Liu, Zhuang and Van Der Maaten, Laurens and Weinberger, Kilian Q},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={4700--4708},
+  year={2017}
+}
+
+@article{rumelhart1986learning,
+  title={Learning representations by back-propagating errors},
+  author={Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J},
+  journal={nature},
+  volume={323},
+  number={6088},
+  pages={533--536},
+  year={1986},
+  publisher={Nature Publishing Group}
+}
+
+@inproceedings{du2019gradient,
+  title={Gradient descent finds global minima of deep neural networks},
+  author={Du, Simon and Lee, Jason and Li, Haochuan and Wang, Liwei and Zhai, Xiyu},
+  booktitle={International Conference on Machine Learning},
+  pages={1675--1685},
+  year={2019},
+  organization={PMLR}
+}
+
+@inproceedings{pascanu2013difficulty,
+  title={On the difficulty of training recurrent neural networks},
+  author={Pascanu, Razvan and Mikolov, Tomas and Bengio, Yoshua},
+  booktitle={International conference on machine learning},
+  pages={1310--1318},
+  year={2013},
+  organization={PMLR}
+}
+
+@article{li2017visualizing,
+  title={Visualizing the loss landscape of neural nets},
+  author={Li, Hao and Xu, Zheng and Taylor, Gavin and Studer, Christoph and Goldstein, Tom},
+  journal={arXiv preprint arXiv:1712.09913},
+  year={2017}
+}
+
+@inproceedings{santurkar2018does,
+  title={How does batch normalization help optimization?},
+  author={Santurkar, Shibani and Tsipras, Dimitris and Ilyas, Andrew and M{\k{a}}dry, Aleksander},
+  booktitle={Proceedings of the 32nd international conference on neural information processing systems},
+  pages={2488--2498},
+  year={2018}
+}
+
+@article{krizhevsky2009learning,
+  title={Learning multiple layers of features from tiny images},
+  author={Krizhevsky, Alex and Hinton, Geoffrey and others},
+  journal={},
+  year={2009},
+  publisher={Citeseer}
+}
+
+@incollection{lecun2012efficient,
+  title={Efficient backprop},
+  author={LeCun, Yann A and Bottou, L{\'e}on and Orr, Genevieve B and M{\"u}ller, Klaus-Robert},
+  booktitle={Neural networks: Tricks of the trade},
+  pages={9--48},
+  year={2012},
+  publisher={Springer}
+}
+
+@book{bishop1995neural,
+  title={Neural networks for pattern recognition},
+  author={Bishop, Christopher M and others},
+  year={1995},
+  publisher={Oxford university press}
+}
+
+@article{vaswani2017attention,
+  author       = {Ashish Vaswani and
+                  Noam Shazeer and
+                  Niki Parmar and
+                  Jakob Uszkoreit and
+                  Llion Jones and
+                  Aidan N. Gomez and
+                  Lukasz Kaiser and
+                  Illia Polosukhin},
+  title        = {Attention Is All You Need},
+  journal      = {CoRR},
+  volume       = {abs/1706.03762},
+  year         = {2017},
+  url          = {http://arxiv.org/abs/1706.03762},
+  eprinttype    = {arXiv},
+  eprint       = {1706.03762},
+  timestamp    = {Sat, 23 Jan 2021 01:20:40 +0100},
+  biburl       = {https://dblp.org/rec/journals/corr/VaswaniSPUJGKP17.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
--- a/run_vgg_08_default.sh
+++ b/run_vgg_08_default.sh
@ -0,0 +1 @@
+python pytorch_mlp_framework/train_evaluate_image_classification_system.py --batch_size 100 --seed 0 --num_filters 32 --num_stages 3 --num_blocks_per_stage 0 --experiment_name VGG_08_experiment --use_gpu True --num_classes 100 --block_type 'conv_block' --continue_from_epoch -1
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Anton Lydike	46ca7c6dfd	final changes	2024-11-22 09:26:24 +00:00
Anton Lydike	c29681b4ba	changes	2024-11-19 17:04:58 +00:00
Anton Lydike	ae0e14b5fb	add BN+RC layer	2024-11-19 10:38:54 +00:00
Anton Lydike	7861133463	don't plot bias layers	2024-11-19 10:10:02 +00:00
Anton Lydike	94d3a1d484	add runner for batch normalized version	2024-11-19 09:47:18 +00:00
Anton Lydike	cb5c6f4e19	formatting and BN	2024-11-19 09:42:31 +00:00
Anton Lydike	92fccb8eb2	add a bunch of extra files	2024-11-18 20:40:20 +00:00
Anton Lydike	05e53aacaf	fix experiment_builder.py	2024-11-18 13:30:36 +00:00
tpmmthomas	58613aee35	Update cw2 debug	2024-11-11 22:41:17 +08:00
tpmmthomas	26364ec94e	update cw2	2024-11-11 22:33:32 +08:00
Visual Computing (VICO) Group	98e232af70	Add missing files	2024-11-11 13:00:28 +00:00
Visual Computing (VICO) Group	a404c62b6f	Rm cw1 figures	2024-11-11 11:46:48 +00:00
Visual Computing (VICO) Group	45a2df1b11	Update	2024-11-11 11:34:32 +00:00
Visual Computing (VICO) Group	be1f124dff	Update	2024-11-11 09:57:57 +00:00
Visual Computing (VICO) Group	9b9a7d50fa	Add missing data files	2024-10-14 11:01:45 +01:00
Visual Computing (VICO) Group	5d52a22448	Add missing files	2024-10-14 10:51:43 +01:00
Hakan Bilen	4657cca862	Update README.md	2024-10-14 10:10:17 +01:00
Hakan Bilen	6a17a30da1	Update README.md	2024-10-14 10:08:48 +01:00
Visual Computing (VICO) Group	2fda722e3d	Minor update	2024-10-14 10:03:02 +01:00
Visual Computing (VICO) Group	6883eb77c2	Add cw1	2024-10-14 09:56:47 +01:00
tpmmthomas	207595b4a1	update lab 4	2024-10-10 21:52:23 +08:00
tpmmthomas	9f1f3ccd04	Update lab 3	2024-10-03 21:53:33 +08:00
				`@ -0,0 +1 @@`
				`Most reasonable LaTeX distributions should have no problem building the document from what is in the provided LaTeX source directory. However certain LaTeX distributions are missing certain files, and the they are included in this directory. If you get an error message when you build the LaTeX document saying one of these files is missing, then move the relevant file into your latex source directory.`
				`@ -0,0 +1 @@`
				`python pytorch_mlp_framework/train_evaluate_image_classification_system.py --batch_size 100 --seed 0 --num_filters 32 --num_stages 3 --num_blocks_per_stage 0 --experiment_name VGG_08_experiment --use_gpu True --num_classes 100 --block_type 'conv_block' --continue_from_epoch -1`