Updating third lab notebook on multilayer models.

2016-10-07 07:00:47 +01:00 · 2016-10-07 07:00:47 +01:00 · ee0c6ce1d5
commit ee0c6ce1d5
parent b1a031deac
1 changed files with 438 additions and 208 deletions
--- a/notebooks/03_Multi_layer_models.ipynb
+++ b/notebooks/03_Multi_layer_models.ipynb
@ -4,195 +4,166 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Introduction\n",
+    "$\\newcommand{\\vct}[1]{\\boldsymbol{#1}}\n",
+    "\\newcommand{\\mtx}[1]{\\mathbf{#1}}\n",
+    "\\newcommand{\\tr}{^\\mathrm{T}}\n",
+    "\\newcommand{\\reals}{\\mathbb{R}}\n",
+    "\\newcommand{\\lpa}{\\left(}\n",
+    "\\newcommand{\\rpa}{\\right)}\n",
+    "\\newcommand{\\lsb}{\\left[}\n",
+    "\\newcommand{\\rsb}{\\right]}\n",
+    "\\newcommand{\\lbr}{\\left\\lbrace}\n",
+    "\\newcommand{\\rbr}{\\right\\rbrace}\n",
+    "\\newcommand{\\fset}[1]{\\lbr #1 \\rbr}\n",
+    "\\newcommand{\\pd}[2]{\\frac{\\partial #1}{\\partial #2}}$\n",
    "\n",
-    "This tutorial is an introduction to the first coursework about multi-layer networks (also known as Multi-Layer Perceptrons - MLPs - or Deep Neural Networks - DNNs). Here, we will show how to build a single layer linear model (similar to the one from the previous lab) for MNIST digit classification using the provided code-base. \n",
+    "# Multiple layer models\n",
    "\n",
-    "The principal purpose of this introduction is to get you familiar with how to connect the code blocks (and what operations each of them implements) in order to set up an experiment that includes 1) building the model structure 2) optimising the model's parameters (weights) and 3) evaluating the model on test data. \n",
+    "In this notebook we will explore network models with multiple layers of transformations. This will build upon the single-layer affine model we looked at in the previous notebook and use material covered in the [second](http://www.inf.ed.ac.uk/teaching/courses/mlp/2016/mlp02-sln.pdf) and [third](http://www.inf.ed.ac.uk/teaching/courses/mlp/2016/mlp03-mlp.pdf) lectures.\n",
    "\n",
-    "## For those affected by notebook kernel issues\n",
+    "You will need to use these models for the experiments you will be running in the first coursework so part of the aim of this lab will be to get you familiar with how to construct multiple layer models in our framework and how to train them.\n",
    "\n",
-    "In case you are still having issues with running notebook kernels, have a look at [this note](https://github.com/CSTR-Edinburgh/mlpractical/blob/master/kernel_issue_fix.md) on the GitHub.\n",
+    "## What is a layer?\n",
    "\n",
-    "## Virtual environments\n",
+    "Often when discussing (neural) network models, a network layer is taken to mean an input to output transformation of the form\n",
    "\n",
-    "Before you proceed onwards, remember to activate your virtual environment:\n",
-    "   * If you were in last week's Tuesday or Wednesday group type `activate_mlp` or `source ~/mlpractical/venv/bin/activate`\n",
-    "   * If you were in the Monday group:\n",
-    "      + and if you have chosen the **comfy** way type: `workon mlpractical`\n",
-    "      + and if you have chosen the **generic** way, `source` your virutal environment using `source` and specyfing the path to the activate script (you need to localise it yourself, there were not any general recommendations w.r.t dir structure and people have installed it in different places, usually somewhere in the home directories. If you cannot easily find it by yourself, use something like: `find . -iname activate` ):\n",
+    "\\begin{equation}\n",
+    "    \\vct{y} = \\vct{f}(\\mtx{W} \\vct{x} + \\vct{b})\n",
+    "    \\qquad\n",
+    "    \\Leftrightarrow\n",
+    "    \\qquad\n",
+    "    y_k = f\\lpa\\sum_{d=1}^D \\lpa W_{kd} x_d \\rpa + b_k \\rpa\n",
+    "\\end{equation}\n",
    "\n",
-    "## Syncing the git repository\n",
+    "where $\\mtx{W}$ and $\\vct{b}$ parameterise an affine transformation as discussed in the previous notebook, and $f$ is a function applied elementwise to the result of the affine transformation. For example a common choice for $f$ is the logistic sigmoid function \n",
+    "\\begin{equation}\n",
+    "  f(u) = \\frac{1}{1 + \\exp(-u)}.\n",
+    "\\end{equation}\n",
    "\n",
-    "Look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> for more details. But in short, we recommend to create a separate branch for the coursework, as follows:\n",
+    "In the second lecture slides you were shown how to train a model consisting of an affine transformation followed by the elementwise logistic sigmoid using gradient descent. This was referred to as a 'sigmoid single-layer network'.\n",
    "\n",
-    "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
-    "2. List the branches and check which is currently active by typing: `git checkout`\n",
-    "3. If you are not in `master` branch, switch to it by typing: \n",
-    "```\n",
-    "git checkout master\n",
-    " ```\n",
-    "4. Then update the repository (note, assuming master does not have any conflicts), if there are some, have a look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>\n",
-    "```\n",
-    "git pull\n",
-    "```\n",
-    "5. And now, create the new branch & swith to it by typing:\n",
-    "```\n",
-    "git checkout -b coursework1\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Multi Layer Models\n",
+    "In the previous notebook we also referred to single-layer models, where in that case the layer was an affine transformation, with you implementing the various necessary methods for the `AffineLayer` class before using an instance of that class within a `SingleLayerModel` on a regression problem. We could in that case consider the function $f$ to simply be the identity function $f(u) = u$. In the code for the labs we will however use a slightly different convention. Here we will consider the affine transformations and subsequent elementwise function $f$ to each be a separate transformation layer. \n",
    "\n",
-    "Today, we shall build models which can have an arbitrary number of hidden layers.  Please have a look at the  diagram below, and the corresponding computations (which have an *exact* matrix form as expected by numpy, and row-wise orientation; note that $\\circ$ denotes an element-wise product). In the diagram, we briefly describe how each comptation relates to the code we have provided.\n",
+    "This will mean we can combine our already implemented `AffineLayer` class with any non-linear function applied to the outputs just by implementing a layer object for the relevant non-linearity and then stacking the two layers together. An alternative would be to have our new layer objects inherit from `AffineLayer` and then call the relevant parent class methods in the child class however this would mean we need to include a lot of the same boilerplate code in every new class.\n",
    "\n",
-    "![Making Predictions](res/code_scheme.svg)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "1. Structuring the model\n",
-    "   * The model (for now) is allowed to have a sequence of layers, mapping inputs $\\mathbf{x}$ to outputs $\\mathbf{y}$. \n",
-    "   * This operation is implemented as a special type of a layer in `mlp.layers.MLP` class. It keeps a sequence of other layers (of various typyes like Linear, Sigmoid, Softmax, etc.) as well as the internal state of a model for a mini-batch, that is, the intermediate data produced in *forward* and *backward* passes.\n",
-    "2. Forward computation\n",
-    "    * `mlp.layers.MLP` provides an `fprop()` method that iterates over defined layers propagates $\\mathbf{x}$ to $\\mathbf{y}$. \n",
-    "    * Each layer (look at `mlp.layers.Linear` attached below) also implements an `fprop()` method, which performs an atomic, for the given layer, operation. Most often, for the $i$-th layer, we want to obtain a linear transform $\\mathbf a^i$ of the inputs, and apply some non-linear transfer function $f^i(\\mathbf a^i)$ to produce the output $\\mathbf h^i$. Note, in general each layer may implement different activation functions $f^i()$, however for now we will use only `sigmoid` and `softmax`\n",
-    "3. Backward computation\n",
-    "   * Similarly, `mlp.layers.MLP` also implements a `bprop()` function, to back-propagate the errors from the top to the bottom layer. This class also keeps the back-propagated statistics ($\\delta$) to be used later when computing the gradients with respect to the parameters.\n",
-    "   * This functionality is also re-implemented by particular layers (again, have a look at the `bprop` function of `mlp.layers.Linear`). `bprop()`  returns both $\\delta$ (needed to update the parameters) but also back-progapates the gradient down to the inputs. Also note, that depending on whether the layer is the top or not (i.e. if it deals directly with the cost function or not) some simplifications may apply ( as with cross-entropy and softmax). That's why when implementing a new type of layer that may be used as an output layer one also need to specify the implementation of `bprop_cost()`.\n",
-    "4. Learning the model\n",
-    "   * The actual evaluation of the cost as well as the *forward* and *backward* passes may be found in the `train_epoch()` method of `mlp.optimisers.SGDOptimiser`\n",
-    "   * This function also calls the `pgrads()` method on each layer, that given activations and deltas, returns the list of the gradients of the cost with respect to the model parameters, i.e. $\\frac{\\partial{\\mathbf{E}}}{\\partial{\\mathbf{W^i}}}$ and  $\\frac{\\partial{\\mathbf{E}}}{\\partial{\\mathbf{b}^i}}$ at the above diagram (look at an example implementation in `mlp.layers.Linear`)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "collapsed": false
-   },
-   "source": [
-    "Example code for the above\n",
-    "```python\n",
-    "# %load -s Linear mlp/layers.py\n",
-    "class Linear(Layer):\n",
+    "To give a concrete example, in the `mlp.layers` module there is a definition for a `SigmoidLayer` equivalent to the following (documentation strings have been removed here for brevity)\n",
    "\n",
-    "    def __init__(self, idim, odim,\n",
-    "                 rng=None,\n",
-    "                 irange=0.1):\n",
-    "\n",
-    "        super(Linear, self).__init__(rng=rng)\n",
-    "\n",
-    "        self.idim = idim\n",
-    "        self.odim = odim\n",
-    "\n",
-    "        self.W = self.rng.uniform(\n",
-    "            -irange, irange,\n",
-    "            (self.idim, self.odim))\n",
-    "\n",
-    "        self.b = numpy.zeros((self.odim,), dtype=numpy.float32)\n",
+    "```Python\n",
+    "class SigmoidLayer(Layer):\n",
    "\n",
    "    def fprop(self, inputs):\n",
-    "        \"\"\"\n",
-    "        Implements a forward propagation through the i-th layer, that is\n",
-    "        some form of:\n",
-    "           a^i = xW^i + b^i\n",
-    "           h^i = f^i(a^i)\n",
-    "        with f^i, W^i, b^i denoting a non-linearity, weight matrix and\n",
-    "        biases of this (i-th) layer, respectively and x denoting inputs.\n",
+    "        return 1. / (1. + np.exp(-inputs))\n",
    "\n",
-    "        :param inputs: matrix of features (x) or the output of the previous layer h^{i-1}\n",
-    "        :return: h^i, matrix of transformed by layer features\n",
-    "        \"\"\"\n",
-    "        a = numpy.dot(inputs, self.W) + self.b\n",
-    "        # here f() is an identity function, so just return a linear transformation\n",
-    "        return a\n",
+    "    def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
+    "        return grads_wrt_outputs * outputs * (1. - outputs)\n",
+    "```\n",
    "\n",
-    "    def bprop(self, h, igrads):\n",
-    "        \"\"\"\n",
-    "        Implements a backward propagation through the layer, that is, given\n",
-    "        h^i denotes the output of the layer and x^i the input, we compute:\n",
-    "        dh^i/dx^i which by chain rule is dh^i/da^i da^i/dx^i\n",
-    "        x^i could be either features (x) or the output of the lower layer h^{i-1}\n",
-    "        :param h: it's an activation produced in forward pass\n",
-    "        :param igrads, error signal (or gradient) flowing to the layer, note,\n",
-    "               this in general case does not corresponds to 'deltas' used to update\n",
-    "               the layer's parameters, to get deltas ones need to multiply it with\n",
-    "               the dh^i/da^i derivative\n",
-    "        :return: a tuple (deltas, ograds) where:\n",
-    "               deltas = igrads * dh^i/da^i\n",
-    "               ograds = deltas \\times da^i/dx^i\n",
-    "        \"\"\"\n",
+    "As you can see this `SigmoidLayer` class has a very lightweight definition, defining just two key methods:\n",
    "\n",
-    "        # since df^i/da^i = 1 (f is assumed identity function),\n",
-    "        # deltas are in fact the same as igrads\n",
-    "        ograds = numpy.dot(igrads, self.W.T)\n",
-    "        return igrads, ograds\n",
+    "  * `fprop` which takes a batch of activations at the input to the layer and forward propagates them to produce activates at the outputs (directly equivalently to the `fprop` method you implemented for then `AffineLayer` in the previous notebook),\n",
+    "  * `brop` which takes a batch of gradients with respect to the outputs of the layer and backward propagates them to calculate gradients with respect to the inputs of the layer (explained in more detail below).\n",
+    "  \n",
+    "This `SigmoidLayer` class only implements the logistic sigmoid non-linearity transformation and so does not have any parameters. Therefore unlike `AffineLayer` it is derived directly from the base `Layer` class rather than `LayerWithParameters` and does not need to implement `grads_wrt_params` or `params` methods. \n",
    "\n",
-    "    def bprop_cost(self, h, igrads, cost):\n",
-    "        \"\"\"\n",
-    "        Implements a backward propagation in case the layer directly\n",
-    "        deals with the optimised cost (i.e. the top layer)\n",
-    "        By default, method should implement a bprop for default cost, that is\n",
-    "        the one that is natural to the layer's output, i.e.:\n",
-    "        here we implement linear -> mse scenario\n",
-    "        :param h: it's an activation produced in forward pass\n",
-    "        :param igrads, error signal (or gradient) flowing to the layer, note,\n",
-    "               this in general case does not corresponds to 'deltas' used to update\n",
-    "               the layer's parameters, to get deltas ones need to multiply it with\n",
-    "               the dh^i/da^i derivative\n",
-    "        :param cost, mlp.costs.Cost instance defining the used cost\n",
-    "        :return: a tuple (deltas, ograds) where:\n",
-    "               deltas = igrads * dh^i/da^i\n",
-    "               ograds = deltas \\times da^i/dx^i\n",
-    "        \"\"\"\n",
+    "To create a model consisting of an affine transformation followed by applying an elementwise logistic sigmoid transformation we first create a list of the two layer objects (in the order they are applied from inputs to outputs) and then use this to instantiate a new `MultipleLayerModel` object:\n",
    "\n",
-    "        if cost is None or cost.get_name() == 'mse':\n",
-    "            # for linear layer and mean square error cost,\n",
-    "            # cost back-prop is the same as standard back-prop\n",
-    "            return self.bprop(h, igrads)\n",
-    "        else:\n",
-    "            raise NotImplementedError('Linear.bprop_cost method not implemented '\n",
-    "                                      'for the %s cost' % cost.get_name())\n",
+    "```Python\n",
+    "from mlp.layers import AffineLayer, SigmoidLayer\n",
+    "from mlp.models import MultipleLayerModel\n",
    "\n",
-    "    def pgrads(self, inputs, deltas):\n",
-    "        \"\"\"\n",
-    "        Return gradients w.r.t parameters\n",
+    "layers = [AffineLayer(input_dim, output_dim), SigmoidLayer()]\n",
+    "model = MultipleLayerModel(layers)\n",
+    "```\n",
    "\n",
-    "        :param inputs, input to the i-th layer\n",
-    "        :param deltas, deltas computed in bprop stage up to -ith layer\n",
-    "        :return list of grads w.r.t parameters dE/dW and dE/db in *exactly*\n",
-    "                the same order as the params are returned by get_params()\n",
+    "Because of the modular way in which the layers are defined we can also stack an arbitrarily long sequence of layers together to produce deeper models. For instance the following would define a model consisting of three pairs of affine and logistic sigmoid transformations.\n",
    "\n",
-    "        Note: deltas here contain the whole chain rule leading\n",
-    "        from the cost up to the the i-th layer, i.e.\n",
-    "        dE/dy^L dy^L/da^L da^L/dh^{L-1} dh^{L-1}/da^{L-1} ... dh^{i}/da^{i}\n",
-    "        and here we are just asking about\n",
-    "          1) da^i/dW^i and 2) da^i/db^i\n",
-    "        since W and b are only layer's parameters\n",
-    "        \"\"\"\n",
+    "```Python\n",
+    "model = MultipleLayerModel([\n",
+    "    AffineLayer(input_dim, hidden_dim), SigmoidLayer(),\n",
+    "    AffineLayer(hidden_dim, hidden_dim), SigmoidLayer(),\n",
+    "    AffineLayer(hidden_dim, output_dim), SigmoidLayer(),\n",
+    "])\n",
+    "```\n",
    "\n",
-    "        grad_W = numpy.dot(inputs.T, deltas)\n",
-    "        grad_b = numpy.sum(deltas, axis=0)\n",
+    "## Back-propagation of gradients\n",
+    "  \n",
+    "To allow training models consisting of a stack of multiple layers, all layers need to implement a `bprop` method in addition to the `fprop` we encountered in the previous week. \n",
    "\n",
-    "        return [grad_W, grad_b]\n",
+    "The `bprop` method takes gradients of an error function with respect to the *outputs* of a layer and uses these gradients to calculate gradients of the error function with respect to the *inputs* of a layer. As the inputs to a non-input layer in a multiple-layer model consist of the outputs of the previous layer, this means we can calculate the gradients of the error function with respect to the outputs of every layer in the model by iteratively propagating the gradients backwards through the layers of the model (i.e. from the last to first layer), hence the term 'back-propagation' or 'bprop' for short. A block diagram illustrating this is shown for a three layer model below.\n",
    "\n",
-    "    def get_params(self):\n",
-    "        return [self.W, self.b]\n",
+    "<img src='res/fprop-bprop-block-diagram.png' />\n",
    "\n",
-    "    def set_params(self, params):\n",
-    "        #we do not make checks here, but the order on the list\n",
-    "        #is assumed to be exactly the same as get_params() returns\n",
-    "        self.W = params[0]\n",
-    "        self.b = params[1]\n",
+    "For a layer with parameters, the gradients with respect to the layer outputs are required to calculate gradients with respect to the layer parameters. Therefore by combining backward propagation of gradients through the model with computing the gradients with respect to parameters in the relevant layers we can calculate gradients of the error function with respect to all of the parameters of a multiple-layer model in a very efficient manner (in fact the computational cost of computing gradients with respect to all of the parameters of the model using this method will only be a constant factor times the cost of calculating the model outputs in the forwards pass).\n",
    "\n",
-    "    def get_name(self):\n",
-    "        return 'linear'\n",
+    "We so far have abstractly talked about calculating gradients with respect to the inputs of a layer using gradients with respect to the layer outputs. More concretely we will be using the chain rule for derivatives to do this, similarly to how we used the chain rule in exercise 4 of the previous notebook to calculate gradients with respect to the parameters of an affine layer given gradients with respect to the outputs of the layer.\n",
+    "\n",
+    "In particular if our layer has a batch of $B$ vector inputs each of dimension $D$, $\\fset{\\vct{x}^{(b)}}_{b=1}^B$, and produces a batch of $B$ vector outputs each of dimension $K$, $\\fset{\\vct{y}^{(b)}}_{b=1}^B$,  then we can calculate the gradient with respect to the $d^\\textrm{th}$ dimension of the $b^{\\textrm{th}}$ input given the gradients with respect to the $b^{\\textrm{th}}$ output using\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\pd{\\bar{E}}{x^{(b)}_d} = \\sum_{k=1}^K \\lpa \\pd{\\bar{E}}{y^{(b)}_k} \\pd{y^{(b)}_k}{x^{(b)}_d} \\rpa.\n",
+    "\\end{equation}\n",
+    "\n",
+    "Mathematically therefore the `bprop` method takes an array of gradients with respect to the outputs $\\pd{y^{(b)}_k}{x^{(b)}_d}$ and applies a sum-product operation with the partial derivatives of each output with respect to each input $\\pd{y^{(b)}_k}{x^{(b)}_d}$ to produce gradients with respect to the inputs of the layer $\\pd{\\bar{E}}{x^{(b)}_d}$.\n",
+    "\n",
+    "For the affine transformation used in the `AffineLayer` implemented last week, i.e a forwards propagation corresponding to \n",
+    "\n",
+    "\\begin{equation}\n",
+    "    y^{(b)}_k = \\sum_{d=1}^D \\lpa W_{kd} x^{(b)}_d \\rpa + b_k\n",
+    "\\end{equation}\n",
+    "\n",
+    "then the corresponding partial derivatives of layer outputs with respect to inputs are\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\pd{y^{(b)}_k}{x^{(b)}_d} = W_{kd}\n",
+    "\\end{equation}\n",
+    "\n",
+    "and so the backwards-propagation method for the `AffineLayer` takes the following form\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\pd{\\bar{E}}{x^{(b)}_d} = \\sum_{k=1}^K \\lpa \\pd{\\bar{E}}{y^{(b)}_k} W_{kd} \\rpa.\n",
+    "\\end{equation}\n",
+    "\n",
+    "This can be efficiently implemented in NumPy using the `dot` function\n",
+    "\n",
+    "```Python\n",
+    "class AffineLayer(LayerWithParameters):\n",
+    "\n",
+    "    # ... [implementation of remaining methods from previous week] ...\n",
+    "    \n",
+    "    def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
+    "        return grads_wrt_outputs.dot(self.weights)\n",
+    "```\n",
+    "\n",
+    "An important special case applies when the outputs of a layer are an elementwise function of the inputs such that $y^{(b)}_k$ only depends on $x^{(b)}_d$ when $d = k$. In this case the partial derivatives $\\pd{y^{(b)}_k}{x^{(b)}_d}$ will be zero for $k \\neq d$ and so the above summation collapses to a single term, giving\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\pd{\\bar{E}}{x^{(b)}_d} = \\pd{\\bar{E}}{y^{(b)}_d} \\pd{y^{(b)}_d}{x^{(b)}_d}\n",
+    "\\end{equation}\n",
+    "\n",
+    "i.e. to calculate the gradient with respect to the $b^{\\textrm{th}}$ input vector we just perform an elementwise multiplication of the gradient with respect to the $b^{\\textrm{th}}$ output vector with the vector of derivatives of the outputs with respect to the inputs. This case applies to the `SigmoidLayer` and to all other layers applying an elementwise function to their inputs.\n",
+    "\n",
+    "For the logistic sigmoid layer we have that\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    y^{(b)}_d = \\frac{1}{1 + \\exp(-x^{(b)}_d)}\n",
+    "    \\qquad\n",
+    "    \\Rightarrow\n",
+    "    \\qquad\n",
+    "    \\pd{y^{(b)}_d}{x^{(b)}_d} = \n",
+    "    \\frac{\\exp(-x^{(b)}_d)}{\\lsb 1 + \\exp(-x^{(b)}_d) \\rsb^2} =\n",
+    "     y^{(b)}_d \\lsb 1 -  y^{(b)}_d  \\rsb\n",
+    "\\end{equation}\n",
+    "\n",
+    "which you should now be able relate to the implementation of `SigmoidLayer.bprop` given earlier:\n",
+    "\n",
+    "```Python\n",
+    "class SigmoidLayer(Layer):\n",
+    "\n",
+    "    def fprop(self, inputs):\n",
+    "        return 1. / (1. + np.exp(-inputs))\n",
+    "\n",
+    "    def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
+    "        return grads_wrt_outputs * outputs * (1. - outputs)\n",
    "```"
   ]
  },
@ -200,17 +171,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Example 1: Experiment with linear models and MNIST\n",
+    "## Exercise 1: training a softmax model on MNIST\n",
    "\n",
-    "The below snippet demonstrates how to use the code we have provided for the coursework 1. Get familiar with it, as from now on we will use till the end of the course, including the 2nd coursework.\n",
+    "For this first exercise we will train a model consisting of an affine transformation plus softmax on a multiclass classification task: classifying the digit labels for handwritten digit images from the MNIST data set introduced in the first notebook.\n",
    "\n",
-    "It should be straightforward to extend the following code to more complex models, like stack more layers, change the cost, the optimiser, learning rate schedules, etc.. But **ask** in case something is not clear.\n",
+    "First run the cell below to import the necessary modules and classes and to load the MNIST data provider objects. As it takes a little while to load the MNIST data from disk into memory it is worth loading the data providers just once in a separate cell like this rather than recreating the objects for every training run.\n",
    "\n",
-    "In this particular example, we use the following components:\n",
-    "  *  One layer mapping data-points ($\\mathbf x$) straight to 10 digits classes represented as 10 (linear) outputs ($\\mathbf y$). This operation is implemented as a linear layer in `mlp.layers.Linear`. Get familiar with this class (read the comments, etc.) as it is going to be a building block for the coursework.\n",
-    "  * One can stack as many different layers as required through the container `mlp.layers.MLP`\n",
-    "  * As an objective here we use the Mean Square Error cost defined in `mlp.costs.MSECost`\n",
-    "  * Our *Stochastic Gradient Descent* optimiser can be found in `mlp.optimisers.SGDOptimiser`. Its parent `mlp.optimisers.Optimiser` implements validation functionality (and an interface in case one need to implement a different optimiser)."
+    "We are loading two data provider objects here - one corresponding to the training data set and a second to use as a *validation* data set. This is data we do not train the model on but measure the performance of the trained model on to assess its ability to *generalise* to unseen data. \n",
+    "\n",
+    "If you are in the Monday or Tuesday lab sessions you will not yet have had the lecture introducing the concepts of generalisation and validation data sets (though those doing MLPR alongside this course should already be familiar with these ideas). As you will need to report both training and validation set performances in your experiments for the first coursework assignment we are providing code here to give an example of how to do this."
   ]
  },
  {
@ -221,52 +190,39 @@
   },
   "outputs": [],
   "source": [
-    "import numpy\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
    "import logging\n",
+    "from mlp.layers import AffineLayer, SoftmaxLayer, SigmoidLayer\n",
+    "from mlp.errors import CrossEntropyError, CrossEntropySoftmaxError\n",
+    "from mlp.models import SingleLayerModel, MultipleLayerModel\n",
+    "from mlp.initialisers import UniformInit\n",
+    "from mlp.learning_rules import GradientDescentLearningRule\n",
+    "from mlp.data_providers import MNISTDataProvider\n",
+    "from mlp.optimisers import Optimiser\n",
+    "%matplotlib inline\n",
+    "plt.style.use('ggplot')\n",
    "\n",
+    "# Seed a random number generator\n",
+    "seed = 6102016 \n",
+    "rng = np.random.RandomState(seed)\n",
+    "\n",
+    "# Set up a logger object to print info about the training run to stdout\n",
    "logger = logging.getLogger()\n",
    "logger.setLevel(logging.INFO)\n",
+    "logger.handlers = [logging.StreamHandler()]\n",
    "\n",
-    "from mlp.layers import MLP, Linear #import required layer types\n",
-    "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
-    "from mlp.dataset import MNISTDataProvider #import data provider\n",
-    "from mlp.costs import MSECost #import the cost we want to use for optimisation\n",
-    "from mlp.schedulers import LearningRateFixed\n",
-    "\n",
-    "rng = numpy.random.RandomState([2015,10,10])\n",
-    "\n",
-    "# define the model structure, here just one linear layer\n",
-    "# and mean square error cost\n",
-    "cost = MSECost()\n",
-    "model = MLP(cost=cost)\n",
-    "model.add_layer(Linear(idim=784, odim=10, rng=rng))\n",
-    "#one can stack more layers here\n",
-    "\n",
-    "# define the optimiser, here stochasitc gradient descent\n",
-    "# with fixed learning rate and max_epochs as stopping criterion\n",
-    "lr_scheduler = LearningRateFixed(learning_rate=0.01, max_epochs=20)\n",
-    "optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
-    "\n",
-    "logger.info('Initialising data providers...')\n",
-    "train_dp = MNISTDataProvider(dset='train', batch_size=100, max_num_batches=-10, randomize=True)\n",
-    "valid_dp = MNISTDataProvider(dset='valid', batch_size=100, max_num_batches=-10, randomize=False)\n",
-    "\n",
-    "logger.info('Training started...')\n",
-    "optimiser.train(model, train_dp, valid_dp)\n",
-    "\n",
-    "logger.info('Testing the model on test set:')\n",
-    "test_dp = MNISTDataProvider(dset='eval', batch_size=100, max_num_batches=-10, randomize=False)\n",
-    "cost, accuracy = optimiser.validate(model, test_dp)\n",
-    "logger.info('MNIST test set accuracy is %.2f %% (cost is %.3f)'%(accuracy*100., cost))\n"
+    "# Create data provider objects for the MNIST data set\n",
+    "train_data = MNISTDataProvider('train', rng=rng)\n",
+    "valid_data = MNISTDataProvider('valid', rng=rng)\n",
+    "input_dim, output_dim = 784, 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Exercise\n",
-    "\n",
-    "Modify the above code by adding an intemediate linear layer of size 200 hidden units between input and output layers."
+    "To minimise replication of code and allow you to run experiments more quickly a helper function is provided below which trains a model and plots the evolution of the error and classification accuracy of the model (on both training and validation sets) over training."
   ]
  },
  {
@ -276,14 +232,288 @@
    "collapsed": true
   },
   "outputs": [],
+   "source": [
+    "def train_model_and_plot_stats(\n",
+    "        model, error, learning_rule, train_data, valid_data, num_epochs, stats_interval):\n",
+    "\n",
+    "    # As well as monitoring the error over training also monitor classification\n",
+    "    # accuracy i.e. proportion of most-probable predicted classes being equal to targets\n",
+    "    data_monitors={'acc': lambda y, t: (y.argmax(-1) == t.argmax(-1)).mean()}\n",
+    "\n",
+    "    # Use the created objects to initialise a new Optimiser instance.\n",
+    "    optimiser = Optimiser(\n",
+    "        model, error, learning_rule, train_data, valid_data, data_monitors)\n",
+    "\n",
+    "    # Run the optimiser for 5 epochs (full passes through the training set)\n",
+    "    # printing statistics every epoch.\n",
+    "    stats, keys = optimiser.train(num_epochs=num_epochs, stats_interval=stats_interval)\n",
+    "\n",
+    "    # Plot the change in the validation and training set error over training.\n",
+    "    fig_1 = plt.figure(figsize=(8, 4))\n",
+    "    ax_1 = fig_1.add_subplot(111)\n",
+    "    for k in ['error(train)', 'error(valid)']:\n",
+    "        ax_1.plot(np.arange(1, stats.shape[0]) * stats_interval, \n",
+    "                  stats[1:, keys[k]], label=k)\n",
+    "    ax_1.legend(loc=0)\n",
+    "    ax_1.set_xlabel('Epoch number')\n",
+    "\n",
+    "    # Plot the change in the validation and training set accuracy over training.\n",
+    "    fig_2 = plt.figure(figsize=(8, 4))\n",
+    "    ax_2 = fig_2.add_subplot(111)\n",
+    "    for k in ['acc(train)', 'acc(valid)']:\n",
+    "        ax_2.plot(np.arange(1, stats.shape[0]) * stats_interval, \n",
+    "                  stats[1:, keys[k]], label=k)\n",
+    "    ax_2.legend(loc=0)\n",
+    "    ax_2.set_xlabel('Epoch number')\n",
+    "    \n",
+    "    return stats, keys, fig_1, ax_1, fig_2, ax_2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Running the cell below will create a model consisting of an affine layer follower by a softmax transformation and train it on the MNIST data set by minimising the multi-class cross entropy error function using a basic gradient descent learning rule. By using the helper function defined above, at the end of training curves of the evolution of the error function and also classification accuracy of the model over the training epochs will be plotted.\n",
+    "\n",
+    "You should try running the code for various settings of the training hyperparameters defined at the beginning of the cell to get a feel for how these affect how training proceeds. You may wish to create multiple copies of the cell below to allow you to keep track of and compare the results across different hyperparameter settings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# Set training run hyperparameters\n",
+    "batch_size = 100  # number of data points in a batch\n",
+    "init_scale = 0.1  # scale for random parameter initialisation\n",
+    "learning_rate = 0.1  # learning rate for gradient descent\n",
+    "num_epochs = 100  # number of training epochs to perform\n",
+    "stats_interval = 5  # epoch interval between recording and printing stats\n",
+    "\n",
+    "# Reset random number generator and data provider states on each run\n",
+    "# to ensure reproducibility of results\n",
+    "rng.seed(seed)\n",
+    "train_data.reset()\n",
+    "valid_data.reset()\n",
+    "\n",
+    "# Alter data-provider batch size\n",
+    "train_data.batch_size = batch_size \n",
+    "valid_data.batch_size = batch_size\n",
+    "\n",
+    "# Create a parameter initialiser which will sample random uniform values\n",
+    "# from [-init_scale, init_scale]\n",
+    "param_init = UniformInit(-init_scale, init_scale, rng=rng)\n",
+    "\n",
+    "# Create affine + softmax model\n",
+    "model = MultipleLayerModel([\n",
+    "    AffineLayer(input_dim, output_dim, param_init, param_init),\n",
+    "    SoftmaxLayer()\n",
+    "])\n",
+    "\n",
+    "# Initialise a cross entropy error object\n",
+    "error = CrossEntropyError()\n",
+    "\n",
+    "# Use a basic gradient descent learning rule\n",
+    "learning_rule = GradientDescentLearningRule(learning_rate=learning_rate)\n",
+    "\n",
+    "_ = train_model_and_plot_stats(\n",
+    "    model, error, learning_rule, train_data, valid_data, num_epochs, stats_interval)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Optional extra: more efficient softmax gradient evaluation\n",
+    "\n",
+    "In the lectures you were shown that for certain combinations of error function and final output layers, that the expressions for the gradients take particularly simple forms. \n",
+    "\n",
+    "In particular it can be shown that the combinations of \n",
+    "\n",
+    "  * logistic sigmoid output layer and binary cross entropy error function\n",
+    "  * softmax output layer and cross entropy error function\n",
+    " \n",
+    "lead to particularly simple forms for the gradients of the error function with respect to the inputs to the final layer. In particular for the latter softmax and cross entropy error function case we have that\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    y^{(b)}_k = \\textrm{Softmax}_k\\lpa\\vct{x}^{(b)}\\rpa = \\frac{\\exp(x^{(b)}_k)}{\\sum_{d=1}^D \\lbr \\exp(x^{(b)}_d) \\rbr}\n",
+    "    \\qquad\n",
+    "    E^{(b)} = \\textrm{CrossEntropy}\\lpa\\vct{y}^{(b)},\\,\\vct{t}^{(b)}\\rpa = -\\sum_{d=1}^D \\lbr t^{(b)}_k \\log(y^{(b)}_k) \\rbr\n",
+    "\\end{equation}\n",
+    "\n",
+    "and it can be shown (this is an instructive mathematical exercise if you want a challenge!) that\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\pd{E^{(b)}}{x^{(b)}_d} = y^{(b)}_d - t^{(b)}_d.\n",
+    "\\end{equation}\n",
+    "\n",
+    "The combination of `CrossEntropyError` and `SoftmaxLayer` used to train the model above calculate this gradient less directly by first calculating the gradient of the error with respect to the model outputs in `CrossEntropyError.grad` and then back-propagating this gradient to the inputs of the softmax layer using `SoftmaxLayer.bprop`.\n",
+    "\n",
+    "Rather than computing the gradient in two steps like this we can instead wrap the softmax transformation in to the definition of the error function and make use of the simpler gradient expression above. More explicitly we define an error function as follows\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    E^{(b)} = \\textrm{CrossEntropySoftmax}\\lpa\\vct{y}^{(b)},\\,\\vct{t}^{(b)}\\rpa = -\\sum_{d=1}^D \\lbr t^{(b)}_k \\log\\lsb\\textrm{Softmax}_k\\lpa \\vct{y}^{(b)}\\rpa\\rsb\\rbr\n",
+    "\\end{equation}\n",
+    "\n",
+    "with corresponding gradient\n",
+    "\n",
+    "\\begin{equation}\n",
+    "    \\pd{E^{(b)}}{y^{(b)}_d} = \\textrm{Softmax}_d\\lpa \\vct{y}^{(b)}\\rpa - t^{(b)}_d.\n",
+    "\\end{equation}\n",
+    "\n",
+    "The final layer of the model will then be an affine transformation which produces unbounded output values corresponding to the logarithms of the unnormalised predicted class probabilities. An implementation of this error function is provided in `CrossEntropySoftmaxError`. The cell below sets up a model with a single affine transformation layer and trains it on MNIST using this new cost. If you run it with equivalent hyperparameters to one of your runs with the alternative formulation above you should get identical error and classification curves (other than floating point error) but with a minor improvement in training speed.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# Set training run hyperparameters\n",
+    "batch_size = 100  # number of data points in a batch\n",
+    "init_scale = 0.1  # scale for random parameter initialisation\n",
+    "learning_rate = 0.1  # learning rate for gradient descent\n",
+    "num_epochs = 100  # number of training epochs to perform\n",
+    "stats_interval = 5  # epoch interval between recording and printing stats\n",
+    "\n",
+    "# Reset random number generator and data provider states on each run\n",
+    "# to ensure reproducibility of results\n",
+    "rng.seed(seed)\n",
+    "train_data.reset()\n",
+    "valid_data.reset()\n",
+    "\n",
+    "# Alter data-provider batch size\n",
+    "train_data.batch_size = batch_size \n",
+    "valid_data.batch_size = batch_size\n",
+    "\n",
+    "# Create a parameter initialiser which will sample random uniform values\n",
+    "# from [-init_scale, init_scale]\n",
+    "param_init = UniformInit(-init_scale, init_scale, rng=rng)\n",
+    "\n",
+    "# Create affine model (outputs are logs of unnormalised class probabilities)\n",
+    "model = SingleLayerModel(\n",
+    "    AffineLayer(input_dim, output_dim, param_init, param_init)\n",
+    ")\n",
+    "\n",
+    "# Initialise the error object\n",
+    "error = CrossEntropySoftmaxError()\n",
+    "\n",
+    "# Use a basic gradient descent learning rule\n",
+    "learning_rule = GradientDescentLearningRule(learning_rate=learning_rate)\n",
+    "\n",
+    "_ = train_model_and_plot_stats(\n",
+    "    model, error, learning_rule, train_data, valid_data, num_epochs, stats_interval)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 2: training deeper models on MNIST\n",
+    "\n",
+    "We are now going to investigate using deeper multiple-layer model archictures for the MNIST classification task. You should experiment with training models with two to five `AffineLayer` transformations interleaved with `SigmoidLayer` nonlinear transformations. Intermediate hidden layers between the input and output should have a dimension of 100. For example the `layers` definition of a model with two `AffineLayer` transformations would be\n",
+    "\n",
+    "```Python\n",
+    "layers = [\n",
+    "    AffineLayer(input_dim, 100),\n",
+    "    SigmoidLayer(),\n",
+    "    AffineLayer(100, output_dim),\n",
+    "    SoftmaxLayer()\n",
+    "]\n",
+    "```\n",
+    "\n",
+    "If you read through the extension to the first exercise you may wish to use the `CrossEntropySoftmaxError` without the final `SoftmaxLayer`.\n",
+    "\n",
+    "Use the code from the first exercise as a starting point and start with training hyperparameters which gave reasonable performance for the shallow architecture trained previously.\n",
+    "\n",
+    "Some questions to investigate:\n",
+    "\n",
+    "  * How does increasing the number of layers affect the model's performance on the training data set? And on the validation data set?\n",
+    "  * Do deeper models seem to be harder or easier to train (e.g. in terms of ease of choosing training hyperparameters to give good final performance and/or quick convergence)?\n",
+    "  * Do the models seem to be sensitive to the choice of the parameter initialisation range? Can you think of any reasons for why setting individual parameter initialisation scales for each `AffineLayer` in a model might be useful? Can you come up with (or find) any heuristics for setting the parameter initialisation scales?\n",
+    "  \n",
+    "You do not need to come up with explanations for all of these (though if you can that's great!), they are meant as prompts to get you thinking about the various issues involved in training multiple-layer models. \n",
+    "\n",
+    "You may wish to start with shorter pilot training runs (by decreasing the number of training epochs) for each of the model architectures to get an initial idea of appropriate hyperparameter settings before doing one or two longer training runs to assess the final performance of the architectures."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Models with two affine layers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Models with three affine layers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Models with four affine layers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Models with five affine layers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
+  "anaconda-cloud": {},
  "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python [conda env:mlp]",
   "language": "python",
-   "name": "python2"
+   "name": "conda-env-mlp-py"
  },
  "language_info": {
   "codemirror_mode": {
@ -295,7 +525,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
-   "version": "2.7.10"
+   "version": "2.7.12"
  }
 },
 "nbformat": 4,