Merge pull request #16 from pswietojanski/master

next lab and coursework
2015-10-12 02:08:35 +01:00 · 2015-10-12 02:08:35 +01:00 · a618c4c9dc
commit a618c4c9dc
parent ae3bf46162 b04f150e05
11 changed files with 4853 additions and 39 deletions
--- a/00_Introduction_solution.ipynb
+++ b/00_Introduction_solution.ipynb
--- a/01_Linear_Models.ipynb
+++ b/01_Linear_Models.ipynb
@ -4,45 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Introduction\n",
-    "\n",
-    "This tutorial is about linear transforms - a basic building block of neural networks, including deep learning models.\n",
-    "\n",
-    "# Virtual environments and syncing repositories\n",
-    "\n",
-    "Before you proceed onwards, remember to activate you virtual environments so you can use the software you installed last week as well as run the notebooks in interactive mode, not through the github.com website.\n",
-    "\n",
-    "## Virtual environments\n",
-    "\n",
-    "To activate the virtual environment:\n",
-    "   * If you were in last week's Tuesday or Wednesday group type `activate_mlp` or `source ~/mlpractical/venv/bin/activate`\n",
-    "   * If you were in the Monday group:\n",
-    "      + and if you have chosen the **comfy** way type: `workon mlpractical`\n",
-    "      + and if you have chosen the **generic** way, `source` your virutal environment using `source` and specyfing the path to the activate script (you need to localise it yourself, there were not any general recommendations w.r.t dir structure and people have installed it in different places, usually somewhere in the home directories. If you cannot easily find it by yourself, use something like: `find . -iname activate` ):\n",
-    "\n",
-    "## On Synchronising repositories\n",
-    "\n",
-    "Enter the git mlp repository you set up last week (i.e. `~/mlpractical/repo-mlp`) and once you sync the repository (in one of the two below ways, or look at our short Git FAQ <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>), start the notebook session by typing:\n",
-    "\n",
-    "```\n",
-    "ipython notebook\n",
-    "```\n",
-    "\n",
-    "### Default way\n",
-    "\n",
-    "To avoid potential conflicts between the changes you have made since last week and our additions, we recommend `stash` your changes and `pull` the new code from the mlpractical repository by typing:\n",
-    "\n",
-    "1. `git stash save \"Lab1 work\"`\n",
-    "2. `git pull`\n",
-    "\n",
-    "Then, if you need to, you can always (temporaily) restore a desired state of the repository (look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>).\n",
-    "\n",
-    "**Otherwise** you may also create a branch for each lab separately (again, look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> and git tutorials we linked there), this will allow you to keep `master` branch clean, and pull changes into it every week from the central repository. At the same time branching gives you much more flexibility with changes you introduce to the code as potential conflicts will not occur until you try to make an explicit merge.\n",
-    "\n",
-    "### For advanced github users\n",
-    "\n",
-    "It is OK if you want to keep your changes and merge the new code with whatever you already have, but you need to know what you are doing and how to resolve conflicts.\n",
-    " "
+    "-"
   ]
  },
  {
--- a/01_Linear_Models_solution.ipynb
+++ b/01_Linear_Models_solution.ipynb
@ -0,0 +1,890 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "This tutorial is about linear transforms - a basic building block of neural networks, including deep learning models.\n",
+    "\n",
+    "# Virtual environments and syncing repositories\n",
+    "\n",
+    "Before you proceed onwards, remember to activate you virtual environments so you can use the software you installed last week as well as run the notebooks in interactive mode, not through the github.com website.\n",
+    "\n",
+    "## Virtual environments\n",
+    "\n",
+    "To activate the virtual environment:\n",
+    "   * If you were in last week's Tuesday or Wednesday group type `activate_mlp` or `source ~/mlpractical/venv/bin/activate`\n",
+    "   * If you were in the Monday group:\n",
+    "      + and if you have chosen the **comfy** way type: `workon mlpractical`\n",
+    "      + and if you have chosen the **generic** way, `source` your virutal environment using `source` and specyfing the path to the activate script (you need to localise it yourself, there were not any general recommendations w.r.t dir structure and people have installed it in different places, usually somewhere in the home directories. If you cannot easily find it by yourself, use something like: `find . -iname activate` ):\n",
+    "\n",
+    "## On Synchronising repositories\n",
+    "\n",
+    "Enter the git mlp repository you set up last week (i.e. `~/mlpractical/repo-mlp`) and once you sync the repository (in one of the two below ways, or look at our short Git FAQ <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>), start the notebook session by typing:\n",
+    "\n",
+    "```\n",
+    "ipython notebook\n",
+    "```\n",
+    "\n",
+    "### Default way\n",
+    "\n",
+    "To avoid potential conflicts between the changes you have made since last week and our additions, we recommend `stash` your changes and `pull` the new code from the mlpractical repository by typing:\n",
+    "\n",
+    "1. `git stash save \"Lab1 work\"`\n",
+    "2. `git pull`\n",
+    "\n",
+    "Then, if you need to, you can always (temporaily) restore a desired state of the repository (look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>).\n",
+    "\n",
+    "**Otherwise** you may also create a branch for each lab separately (again, look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> and git tutorials we linked there), this will allow you to keep `master` branch clean, and pull changes into it every week from the central repository. At the same time branching gives you much more flexibility with changes you introduce to the code as potential conflicts will not occur until you try to make an explicit merge.\n",
+    "\n",
+    "### For advanced github users\n",
+    "\n",
+    "It is OK if you want to keep your changes and merge the new code with whatever you already have, but you need to know what you are doing and how to resolve conflicts.\n",
+    " \n",
+    " "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Single Layer Models\n",
+    "\n",
+    "***\n",
+    "### Note on storing matrices in computer memory\n",
+    "\n",
+    "Consider you want to store the following array in memory: $\\left[ \\begin{array}{ccc}\n",
+    "1 & 2 & 3 \\\\\n",
+    "4 & 5 & 6 \\\\\n",
+    "7 & 8 & 9 \\end{array} \\right]$ \n",
+    "\n",
+    "In computer memory the above matrix would be organised as a vector in either (assume you allocate the memory at once for the whole matrix):\n",
+    "\n",
+    "* Row-wise layout where the order would look like: $\\left [ \\begin{array}{ccccccccc}\n",
+    "1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\end{array} \\right ]$\n",
+    "* Column-wise layout where the order would look like: $\\left [ \\begin{array}{ccccccccc}\n",
+    "1 & 4 & 7 & 2 & 5 & 8 & 3 & 6 & 9 \\end{array} \\right ]$\n",
+    "\n",
+    "Although `numpy` can easily handle both formats (possibly with some computational overhead), in our code we will stick with modern (and default) `c`-like approach and use row-wise format (contrary to Fortran that used column-wise approach). \n",
+    "\n",
+    "This means, that in this tutorial:\n",
+    "* vectors are kept row-wise $\\mathbf{x} = (x_1, x_1, \\ldots, x_D) $ (rather than $\\mathbf{x} = (x_1, x_1, \\ldots, x_D)^T$)\n",
+    "* similarly, in case of matrices we will stick to: $\\left[ \\begin{array}{cccc}\n",
+    "x_{11} & x_{12} & \\ldots & x_{1D} \\\\\n",
+    "x_{21} & x_{22} & \\ldots & x_{2D} \\\\\n",
+    "x_{31} & x_{32} & \\ldots & x_{3D} \\\\ \\end{array} \\right]$ and each row (i.e. $\\left[ \\begin{array}{cccc} x_{11} & x_{12} & \\ldots & x_{1D} \\end{array} \\right]$) represents a single data-point (like one MNIST image or one window of observations)\n",
+    "\n",
+    "In lecture slides you will find the equations following the conventional mathematical column-wise approach, but you can easily map them one way or the other using using matrix transpose.\n",
+    "\n",
+    "***\n",
+    "\n",
+    "## Linear and Affine Transforms\n",
+    "\n",
+    "The basis of all linear models is so called affine transform, that is a transform that implements some linear transformation and translation of input features. The transforms we are going to use are parameterised by:\n",
+    "\n",
+    "  * Weight matrix $\\mathbf{W} \\in \\mathbb{R}^{D\\times K}$: where element $w_{ik}$ is the weight from input $x_i$ to output $y_k$\n",
+    "  * Bias vector $\\mathbf{b}\\in R^{K}$ : where element $b_{k}$ is the bias for output $k$\n",
+    "\n",
+    "Note, the bias is simply some additve term, and can be easily incorporated into an additional row in weight matrix and an additinal input in the inputs which is set to $1.0$ (as in the below picture taken from the lecture slides). However, here (and in the code) we will keep them separate.\n",
+    "\n",
+    "![Making Predictions](res/singleLayerNetWts-1.png)\n",
+    "\n",
+    "For instance, for the above example of 5-dimensional input vector by $\\mathbf{x} = (x_1, x_2, x_3, x_4, x_5)$, weight matrix $\\mathbf{W}=\\left[ \\begin{array}{ccc}\n",
+    "w_{11} & w_{12} & w_{13} \\\\\n",
+    "w_{21} & w_{22} & w_{23} \\\\\n",
+    "w_{31} & w_{32} & w_{33} \\\\\n",
+    "w_{41} & w_{42} & w_{43} \\\\\n",
+    "w_{51} & w_{52} & w_{53} \\\\ \\end{array} \\right]$, bias vector $\\mathbf{b} = (b_1, b_2, b_3)$ and outputs $\\mathbf{y} = (y_1, y_2, y_3)$, one can write the transformation as follows:\n",
+    "\n",
+    "(for the $i$-th output)\n",
+    "\n",
+    "(1) $\n",
+    "\\begin{equation}\n",
+    "   y_i = b_i + \\sum_j x_jw_{ji}\n",
+    "\\end{equation}\n",
+    "$\n",
+    "\n",
+    "or the equivalent vector form (where $\\mathbf w_i$ is the $i$-th column of $\\mathbf W$, but note, when we **slice** the $i$th column we will get a **vector** $\\mathbf w_i = (w_{1i}, w_{2i},  w_{3i},  w_{4i}, w_{5i})$, hence the transpose for $\\mathbf w_i$ in the below equation):\n",
+    "\n",
+    "(2) $\n",
+    "\\begin{equation}\n",
+    "   y_i = b_i + \\mathbf x \\mathbf w_i^T\n",
+    "\\end{equation}\n",
+    "$\n",
+    "\n",
+    "The same operation can be also written in matrix form, to compute all the outputs $\\mathbf{y}$ at the same time:\n",
+    "\n",
+    "(3) $\n",
+    "\\begin{equation}\n",
+    "  \\mathbf y=\\mathbf x\\mathbf W + \\mathbf b\n",
+    "\\end{equation}\n",
+    "$\n",
+    "\n",
+    "This is equivalent to slides 12/13 in lecture 1, except we are using row vectors.\n",
+    "\n",
+    "When $\\mathbf{x}$ is a mini-batch (contains $B$ data-points of dimension $D$ each), i.e. $\\left[ \\begin{array}{cccc}\n",
+    "x_{11} & x_{12} & \\ldots & x_{1D} \\\\\n",
+    "x_{21} & x_{22} & \\ldots & x_{2D} \\\\\n",
+    "\\cdots \\\\\n",
+    "x_{B1} & x_{B2} & \\ldots & x_{BD} \\\\ \\end{array} \\right]$ equation (3) effectively becomes to be\n",
+    "\n",
+    "(4) $\n",
+    "\\begin{equation}\n",
+    "  \\mathbf Y=\\mathbf X\\mathbf W + \\mathbf b\n",
+    "\\end{equation}\n",
+    "$\n",
+    "\n",
+    "where both $\\mathbf{X}\\in\\mathbb{R}^{B\\times D}$ and $\\mathbf{Y}\\in\\mathbb{R}^{B\\times K}$ are matrices, and $\\mathbf{b}$ needs to be <a href=\"http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html\">broadcasted</a> $B$ times (numpy will do this by default). However, we will not make an explicit distinction between a special case for $B=1$ and $B>1$ and simply use equation (3) instead, although $\\mathbf{x}$ and hence $\\mathbf{y}$ could be matrices. From an implementation point of view, it does not matter.\n",
+    "\n",
+    "The desired functionality for matrix multiplication in numpy is provided by <a href=\"http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html\">numpy.dot</a> function. If you haven't use it so far, get familiar with it as we will use it extensively."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A general note on random number generators\n",
+    "\n",
+    "It is generally a good practice (for machine learning applications **not** for cryptography!) to seed a pseudo-random number generator once at the beginning of the experiment, and use it later through the code where necesarry. This makes it easier to reproduce results since random initialisations can be replicated. As such, within this course we are going use a single random generator object, similar to the below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import numpy\n",
+    "import sys\n",
+    "\n",
+    "#initialise the random generator to be used later\n",
+    "seed=[2015, 10, 1]\n",
+    "random_generator = numpy.random.RandomState(seed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 1 \n",
+    "\n",
+    "Using numpy.dot, implement **forward** propagation through the linear transform defined by equations (3) and (4) for $B=1$ and $B>1$. As data ($\\mathbf{x}$) use `MNISTDataProvider` from previous laboratories. For case when $B=1$ write a function to compute the 1st output ($y_1$) using equations (1) and (2). Check if the output is the same as the corresponding one obtained with numpy. \n",
+    "\n",
+    "Tip: To generate random data you can use `random_generator.uniform(-0.1, 0.1, (D, 10))` from the preceeding cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from mlp.dataset import MNISTDataProvider\n",
+    "\n",
+    "mnist_dp = MNISTDataProvider(dset='valid', batch_size=3, max_num_batches=1, randomize=False)\n",
+    "\n",
+    "irange = 0.1\n",
+    "W = random_generator.uniform(-irange, irange, (784,10)) \n",
+    "b = numpy.zeros((10,))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "y1e1 0.55861474982\n",
+      "y1e2 0.55861474982\n",
+      "ye3 [[ 0.55861475  0.79450077  0.17439693  0.00265688  0.66272539 -0.09985686\n",
+      "   0.56468591  0.58105588 -0.18613727  0.08151257]\n",
+      " [-0.43965864  0.59573972 -0.22691119  0.26767124 -0.31343979  0.07224664\n",
+      "  -0.19616183  0.0851733  -0.24088286 -0.19305162]\n",
+      " [-0.20176359  0.42394166 -1.03984446  0.15492101  0.15694745 -0.53741022\n",
+      "   0.05887668 -0.21124527 -0.07870156 -0.00506471]]\n",
+      "ye4 [[ 0.55861475  0.79450077  0.17439693  0.00265688  0.66272539 -0.09985686\n",
+      "   0.56468591  0.58105588 -0.18613727  0.08151257]\n",
+      " [-0.43965864  0.59573972 -0.22691119  0.26767124 -0.31343979  0.07224664\n",
+      "  -0.19616183  0.0851733  -0.24088286 -0.19305162]\n",
+      " [-0.20176359  0.42394166 -1.03984446  0.15492101  0.15694745 -0.53741022\n",
+      "   0.05887668 -0.21124527 -0.07870156 -0.00506471]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "mnist_dp.reset()\n",
+    "\n",
+    "#implement following functions, then run the cell\n",
+    "def y1_equation_1(x, W, b):\n",
+    "    y1=0\n",
+    "    for j in xrange(0, x.shape[0]):\n",
+    "      y1 += x[j]*W[j,0]\n",
+    "    return y1 + b[0]\n",
+    "    \n",
+    "def y1_equation_2(x, W, b):\n",
+    "    return numpy.dot(x, W[:,0].T) + b[0]\n",
+    "\n",
+    "def y_equation_3(x, W, b):\n",
+    "    return numpy.dot(x,W) + b\n",
+    "\n",
+    "def y_equation_4(x, W, b):\n",
+    "    return numpy.dot(x,W) + b\n",
+    "\n",
+    "for x, t in mnist_dp:\n",
+    "    y1e1 = y1_equation_1(x[0], W, b)\n",
+    "    y1e2 = y1_equation_2(x[0], W, b)\n",
+    "    ye3 = y_equation_3(x, W, b)\n",
+    "    ye4 = y_equation_4(x, W, b)\n",
+    "\n",
+    "print 'y1e1', y1e1\n",
+    "print 'y1e2', y1e2\n",
+    "print 'ye3', ye3\n",
+    "print 'ye4', ye4\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "## Exercise 2\n",
+    "\n",
+    "Modify (if necessary) examples from Exercise 1 to perform **backward** propagation, that is, given $\\mathbf{y}$ (obtained in previous step) and weight matrix $\\mathbf{W}$, project $\\mathbf{y}$ onto the input space $\\mathbf{x}$ (ignore or set to zero the biases towards $\\mathbf{x}$ in backward pass). Mathematically, we are interested in the following transformation: $\\mathbf{z}=\\mathbf{y}\\mathbf{W}^T$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[-0.00683757 -0.13638553  0.00203203 ...,  0.02690207 -0.07364245\n",
+      "   0.04403087]\n",
+      " [-0.00447621 -0.06409652  0.01211384 ...,  0.0402248  -0.04490571\n",
+      "  -0.05013801]\n",
+      " [ 0.03981022 -0.13705957  0.05882239 ...,  0.04491902 -0.08644539\n",
+      "  -0.07106441]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "y = y_equation_3(x, W, b)\n",
+    "z = numpy.dot(y, W.T)\n",
+    "\n",
+    "print z\n",
+    "assert z.shape == x.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "## Exercise 3 (optional)\n",
+    "\n",
+    "In case you do not fully understand how matrix-vector and/or matrix-matrix products work, consider implementing `my_dot_mat_mat` function  (you have been given `my_dot_vec_mat` code to look at as an example) which takes as the input the following arguments:\n",
+    "\n",
+    "* D-dimensional input vector $\\mathbf{x} = (x_1, x_2, \\ldots, x_D) $.\n",
+    "* Weight matrix $\\mathbf{W}\\in\\mathbb{R}^{D\\times K}$:\n",
+    "\n",
+    "and returns:\n",
+    "\n",
+    "* K-dimensional output vector $\\mathbf{y} = (y_1, \\ldots, y_K) $\n",
+    "\n",
+    "Your job is to write a variant that works in a mini-batch mode where both $\\mathbf{x}\\in\\mathbb{R}^{B\\times D}$ and $\\mathbf{y}\\in\\mathbb{R}^{B\\times K}$ are matrices in which each rows contain one of $B$ data-points from mini-batch (rather than  $\\mathbf{x}\\in\\mathbb{R}^{D}$ and $\\mathbf{y}\\in\\mathbb{R}^{K}$)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def my_dot_vec_mat(x, W):\n",
+    "    J = x.shape[0]\n",
+    "    K = W.shape[1]\n",
+    "    assert (J == W.shape[0]), (\n",
+    "        \"Number of columns of x expected to \"\n",
+    "        \" to be equal to the number of rows in \"\n",
+    "        \"W, bot got shapes %s, %s\" % (x.shape, W.shape)\n",
+    "    )\n",
+    "    y = numpy.zeros((K,))\n",
+    "    for k in xrange(0, K):\n",
+    "        for j in xrange(0, J):\n",
+    "            y[k] += x[j] * W[j,k]\n",
+    "                \n",
+    "    return y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Well done!\n"
+     ]
+    }
+   ],
+   "source": [
+    "irange = 0.1 #+-range from which we draw the random numbers\n",
+    "\n",
+    "x = random_generator.uniform(-irange, irange, (5,)) \n",
+    "W = random_generator.uniform(-irange, irange, (5,3)) \n",
+    "\n",
+    "y_my = my_dot_vec_mat(x, W)\n",
+    "y_np = numpy.dot(x, W)\n",
+    "\n",
+    "same = numpy.allclose(y_my, y_np)\n",
+    "\n",
+    "if same:\n",
+    "    print 'Well done!'\n",
+    "else:\n",
+    "    print 'Matrices are different:'\n",
+    "    print 'y_my is: ', y_my\n",
+    "    print 'y_np is: ', y_np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def my_dot_mat_mat(x, W):\n",
+    "    I = x.shape[0]\n",
+    "    J = x.shape[1]\n",
+    "    K = W.shape[1]\n",
+    "    assert (J == W.shape[0]), (\n",
+    "        \"Number of columns in of x expected to \"\n",
+    "        \" to be the same as rows in W, got\"\n",
+    "    )\n",
+    "    #allocate the output container\n",
+    "    y = numpy.zeros((I, K))\n",
+    "    \n",
+    "    #implement here matrix-matrix inner product here\n",
+    "    for i in xrange(0, I):\n",
+    "        for k in xrange(0, K):\n",
+    "            for j in xrange(0, J):\n",
+    "                y[i, k] += x[i, j] * W[j,k]\n",
+    "                \n",
+    "    return y"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Test whether you get comparable numbers to what numpy is producing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Well done!\n"
+     ]
+    }
+   ],
+   "source": [
+    "irange = 0.1 #+-range from which we draw the random numbers\n",
+    "\n",
+    "x = random_generator.uniform(-irange, irange, (2,5)) \n",
+    "W = random_generator.uniform(-irange, irange, (5,3)) \n",
+    "\n",
+    "y_my = my_dot_mat_mat(x, W)\n",
+    "y_np = numpy.dot(x, W)\n",
+    "\n",
+    "same = numpy.allclose(y_my, y_np)\n",
+    "\n",
+    "if same:\n",
+    "    print 'Well done!'\n",
+    "else:\n",
+    "    print 'Matrices are different:'\n",
+    "    print 'y_my is: ', y_my\n",
+    "    print 'y_np is: ', y_np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we benchmark each approach (we do it in separate cells, as timeit currently can measure whole cell execuiton only)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "#generate bit bigger matrices, to better evaluate timings\n",
+    "x = random_generator.uniform(-irange, irange, (10, 1000))\n",
+    "W = random_generator.uniform(-irange, irange, (1000, 100))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "my_dot timings:\n",
+      "10 loops, best of 3: 726 ms per loop\n"
+     ]
+    }
+   ],
+   "source": [
+    "print 'my_dot timings:'\n",
+    "%timeit -n10 my_dot_mat_mat(x, W)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "numpy.dot timings:\n",
+      "10 loops, best of 3: 1.17 ms per loop\n"
+     ]
+    }
+   ],
+   "source": [
+    "print 'numpy.dot timings:'\n",
+    "%timeit -n10 numpy.dot(x, W)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Optional section ends here**\n",
+    "***"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Iterative learning of linear models\n",
+    "\n",
+    "We will learn the model with stochastic gradient descent on N data-points using mean square error (MSE) loss function, which is defined as follows:\n",
+    "\n",
+    "(5) $\n",
+    "E = \\frac{1}{2} \\sum_{n=1}^N ||\\mathbf{y}^n - \\mathbf{t}^n||^2 =  \\sum_{n=1}^N E^n \\\\\n",
+    "  E^n = \\frac{1}{2} ||\\mathbf{y}^n - \\mathbf{t}^n||^2\n",
+    "$\n",
+    "\n",
+    "(6) $ E^n = \\frac{1}{2} \\sum_{k=1}^K (y_k^n - t_k^n)^2 $\n",
+    "  \n",
+    "Hence, the gradient w.r.t (with respect to) the $r$ output y of the model is defined as, so called delta function, $\\delta_r$: \n",
+    "\n",
+    "(8) $\\frac{\\partial{E^n}}{\\partial{y_{r}}} = (y^n_r - t^n_r) =  \\delta^n_r \\quad ; \\quad\n",
+    "    \\delta^n_r = y^n_r - t^n_r \\\\\n",
+    "    \\frac{\\partial{E}}{\\partial{y_{r}}} = \\sum_{n=1}^N \\frac{\\partial{E^n}}{\\partial{y_{r}}} = \\sum_{n=1}^N \\delta^n_r\n",
+    "$\n",
+    "\n",
+    "Similarly, using the above $\\delta^n_r$ one can express the gradient of the  weight $w_{sr}$ (from the s-th input to the r-th output) for linear model and MSE cost as follows:\n",
+    "\n",
+    "(9) $\n",
+    "    \\frac{\\partial{E^n}}{\\partial{w_{sr}}} = (y^n_r - t^n_r)x_s^n =  \\delta^n_r x_s^n \\quad\\\\\n",
+    "    \\frac{\\partial{E}}{\\partial{w_{sr}}} = \\sum_{n=1}^N \\frac{\\partial{E^n}}{\\partial{w_{rs}}} = \\sum_{n=1}^N \\delta^n_r x_s^n\n",
+    "$\n",
+    "\n",
+    "and the gradient for bias parameter at the $r$-th output is:\n",
+    "\n",
+    "(10) $\n",
+    "    \\frac{\\partial{E}}{\\partial{b_{r}}} = \\sum_{n=1}^N \\frac{\\partial{E^n}}{\\partial{b_{r}}} = \\sum_{n=1}^N \\delta^n_r\n",
+    "$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "![Making Predictions](res/singleLayerNetPredict.png)\n",
+    " \n",
+    "  * Input vector $\\mathbf{x} = (x_1, x_2, \\ldots, x_D) $\n",
+    "  * Output scalar $y_1$\n",
+    "  * Weight matrix $\\mathbf{W}$: $w_{ik}$ is the weight from input $x_i$ to output $y_k$. Note, here this is really a vector since a single scalar output, y_1.\n",
+    "  * Scalar bias $b$ for the only output in our model \n",
+    "  * Scalar target $t$ for the only output in out model\n",
+    "  \n",
+    "First, ensure you can make use of data provider (note, for training data has been normalised to zero mean and unit variance, hence different effective range than one can find in file):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Observations:  [[-0.12 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13]\n",
+      " [-0.11 -0.1   0.09 -0.06 -0.09 -0.    0.28 -0.12 -0.12 -0.08]\n",
+      " [-0.13  0.05 -0.13 -0.01 -0.11 -0.13 -0.13 -0.13 -0.13 -0.13]\n",
+      " [ 0.2   0.12  0.25  0.16  0.03 -0.    0.15  0.08 -0.08 -0.11]\n",
+      " [-0.13 -0.12 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13]\n",
+      " [-0.1   0.51  1.52  0.14 -0.02  0.77  0.11  0.79 -0.02  0.08]\n",
+      " [ 0.24  0.15 -0.01  0.08 -0.1   0.45 -0.12 -0.1  -0.13  0.48]\n",
+      " [ 0.13 -0.06 -0.07 -0.11 -0.11 -0.11 -0.13 -0.11 -0.02 -0.12]\n",
+      " [-0.06  0.28 -0.13  0.06  0.09  0.09  0.01 -0.07  0.14 -0.11]\n",
+      " [-0.13 -0.13 -0.1  -0.06 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13]]\n",
+      "To predict:  [[-0.12]\n",
+      " [-0.12]\n",
+      " [-0.13]\n",
+      " [-0.1 ]\n",
+      " [-0.13]\n",
+      " [-0.08]\n",
+      " [ 0.24]\n",
+      " [-0.13]\n",
+      " [-0.02]\n",
+      " [-0.13]]\n",
+      "Observations:  [[-0.09 -0.13 -0.13 -0.03 -0.05 -0.11 -0.13 -0.13 -0.13 -0.13]\n",
+      " [-0.03  0.32  0.28  0.09 -0.04  0.19  0.31 -0.13  0.37  0.34]\n",
+      " [ 0.12  0.13  0.06 -0.1  -0.1   0.94  0.24  0.12  0.28 -0.04]\n",
+      " [ 0.26  0.17 -0.04 -0.13 -0.12 -0.09 -0.12 -0.13 -0.1  -0.13]\n",
+      " [-0.1  -0.1  -0.01 -0.03 -0.07  0.05 -0.03 -0.12 -0.05 -0.13]\n",
+      " [-0.13 -0.13 -0.13 -0.13 -0.13 -0.13  0.1  -0.13 -0.13 -0.13]\n",
+      " [-0.01 -0.1  -0.13 -0.13 -0.12 -0.13 -0.13 -0.13 -0.13 -0.11]\n",
+      " [-0.11 -0.06 -0.11  0.02 -0.03 -0.02 -0.05 -0.11 -0.13 -0.13]\n",
+      " [-0.01  0.25 -0.08  0.04 -0.1  -0.12  0.06 -0.1   0.08 -0.06]\n",
+      " [-0.09 -0.09 -0.09 -0.13 -0.11 -0.12 -0.   -0.02  0.19 -0.11]]\n",
+      "To predict:  [[-0.13]\n",
+      " [-0.11]\n",
+      " [-0.09]\n",
+      " [-0.08]\n",
+      " [ 0.19]\n",
+      " [-0.13]\n",
+      " [-0.13]\n",
+      " [-0.03]\n",
+      " [-0.13]\n",
+      " [-0.11]]\n",
+      "Observations:  [[-0.08 -0.11 -0.11  0.32  0.05 -0.11 -0.13  0.07  0.08  0.63]\n",
+      " [-0.07 -0.1  -0.09 -0.08  0.26 -0.05 -0.1  -0.    0.36 -0.12]\n",
+      " [-0.03 -0.1   0.19 -0.02  0.35  0.38 -0.1   0.44 -0.02  0.21]\n",
+      " [-0.12 -0.   -0.02  0.19 -0.11 -0.11 -0.13 -0.11 -0.02 -0.13]\n",
+      " [ 0.09  0.1  -0.03 -0.05  0.   -0.12 -0.12 -0.13 -0.13 -0.13]\n",
+      " [ 0.21  0.05 -0.12 -0.05 -0.08 -0.1  -0.13 -0.13 -0.13 -0.13]\n",
+      " [-0.04 -0.11  0.19  0.16 -0.01 -0.07 -0.   -0.06 -0.03  0.16]\n",
+      " [ 0.09  0.05  0.51  0.34  0.16  0.51  0.56  0.21 -0.06 -0.  ]\n",
+      " [-0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.13 -0.09  0.49]\n",
+      " [-0.06 -0.11 -0.13  0.06 -0.01 -0.12  0.54  0.2  -0.1  -0.11]]\n",
+      "To predict:  [[ 0.1 ]\n",
+      " [ 0.09]\n",
+      " [ 0.16]\n",
+      " [-0.13]\n",
+      " [-0.13]\n",
+      " [ 0.04]\n",
+      " [-0.1 ]\n",
+      " [ 0.05]\n",
+      " [-0.1 ]\n",
+      " [-0.11]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mlp.dataset import MetOfficeDataProvider\n",
+    "\n",
+    "modp = MetOfficeDataProvider(10, batch_size=10, max_num_batches=3, randomize=True)\n",
+    "\n",
+    "%precision 2\n",
+    "for x, t in modp:\n",
+    "    print 'Observations: ', x\n",
+    "    print 'To predict: ', t"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 4\n",
+    "\n",
+    "The below code implements a very simple variant of stochastic gradient descent for the weather regression example. Your task is to implement 5 functions in the next cell and then run two next cells that 1) build sgd functions and 2) run the actual training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#When implementing those, take into account the mini-batch case, for which one is\n",
+    "#expected to sum the errors for each example\n",
+    "\n",
+    "def fprop(x, W, b):\n",
+    "    #code implementing eq. (3)\n",
+    "    return numpy.dot(x, W) + b\n",
+    "\n",
+    "def cost(y, t):\n",
+    "    #Mean Square Error cost, equation (5)\n",
+    "    return numpy.mean(0.5*numpy.sum((y - t)**2, axis=1))\n",
+    "\n",
+    "def cost_grad(y, t):\n",
+    "    #Gradient of the cost w.r.t y equation (8)\n",
+    "    return y - t\n",
+    "\n",
+    "def cost_wrt_W(cost_grad, x):\n",
+    "    #Gradient of the cost w.r.t W, equation (9)\n",
+    "    return numpy.dot(x.T, cost_grad)\n",
+    "    \n",
+    "def cost_wrt_b(cost_grad):\n",
+    "    #Gradient of the cost w.r.t to b, equation (10)\n",
+    "    return numpy.sum(cost_grad, axis = 0)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def sgd_epoch(data_provider, W, b, learning_rate):\n",
+    "    mse_stats = []\n",
+    "    \n",
+    "    #get the minibatch of data\n",
+    "    for x, t in data_provider:\n",
+    "\n",
+    "        #1. get the estimate of y\n",
+    "        y = fprop(x, W, b)\n",
+    "\n",
+    "        #2. compute the loss function\n",
+    "        tmp = cost(y, t)\n",
+    "        mse_stats.append(tmp)\n",
+    "        \n",
+    "        #3. compute the grad of the cost w.r.t the output layer activation y\n",
+    "        #i.e. how the cost changes when output y changes\n",
+    "        cost_grad_deltas = cost_grad(y, t)\n",
+    "\n",
+    "        #4. compute the gradients w.r.t model's parameters\n",
+    "        grad_W = cost_wrt_W(cost_grad_deltas, x)\n",
+    "        grad_b = cost_wrt_b(cost_grad_deltas)\n",
+    "\n",
+    "        #4. Update the model, we update with the mean gradient\n",
+    "        # over the minibatch, rather than sum of particular gradients\n",
+    "        # in a minibatch, to do so we scale the learning rate by batch_size\n",
+    "        batch_size = x.shape[0]\n",
+    "        effect_learn_rate = learning_rate / batch_size\n",
+    "\n",
+    "        W = W - effect_learn_rate * grad_W\n",
+    "        b = b - effect_learn_rate * grad_b\n",
+    "    \n",
+    "    return W, b, numpy.mean(mse_stats)\n",
+    "\n",
+    "def sgd(data_provider, W, b, learning_rate=0.1, max_epochs=10):\n",
+    "    \n",
+    "    for epoch in xrange(0, max_epochs):\n",
+    "        #reset the data provider\n",
+    "        data_provider.reset()\n",
+    "        \n",
+    "        #train for one epoch\n",
+    "        W, b, mean_cost = \\\n",
+    "            sgd_epoch(data_provider, W, b, learning_rate)\n",
+    "                \n",
+    "        print \"MSE training cost after %d-th epoch is %f\" % (epoch + 1, mean_cost)\n",
+    "    \n",
+    "    return W, b\n",
+    "        \n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "MSE training cost after 1-th epoch is 0.017213\n",
+      "MSE training cost after 2-th epoch is 0.016103\n",
+      "MSE training cost after 3-th epoch is 0.015705\n",
+      "MSE training cost after 4-th epoch is 0.015437\n",
+      "MSE training cost after 5-th epoch is 0.015255\n",
+      "MSE training cost after 6-th epoch is 0.015128\n",
+      "MSE training cost after 7-th epoch is 0.015041\n",
+      "MSE training cost after 8-th epoch is 0.014981\n",
+      "MSE training cost after 9-th epoch is 0.014936\n",
+      "MSE training cost after 10-th epoch is 0.014903\n",
+      "MSE training cost after 11-th epoch is 0.014879\n",
+      "MSE training cost after 12-th epoch is 0.014862\n",
+      "MSE training cost after 13-th epoch is 0.014849\n",
+      "MSE training cost after 14-th epoch is 0.014839\n",
+      "MSE training cost after 15-th epoch is 0.014830\n",
+      "MSE training cost after 16-th epoch is 0.014825\n",
+      "MSE training cost after 17-th epoch is 0.014820\n",
+      "MSE training cost after 18-th epoch is 0.014813\n",
+      "MSE training cost after 19-th epoch is 0.014813\n",
+      "MSE training cost after 20-th epoch is 0.014810\n",
+      "MSE training cost after 21-th epoch is 0.014808\n",
+      "MSE training cost after 22-th epoch is 0.014805\n",
+      "MSE training cost after 23-th epoch is 0.014806\n",
+      "MSE training cost after 24-th epoch is 0.014804\n",
+      "MSE training cost after 25-th epoch is 0.014796\n",
+      "MSE training cost after 26-th epoch is 0.014798\n",
+      "MSE training cost after 27-th epoch is 0.014801\n",
+      "MSE training cost after 28-th epoch is 0.014802\n",
+      "MSE training cost after 29-th epoch is 0.014801\n",
+      "MSE training cost after 30-th epoch is 0.014799\n",
+      "MSE training cost after 31-th epoch is 0.014799\n",
+      "MSE training cost after 32-th epoch is 0.014793\n",
+      "MSE training cost after 33-th epoch is 0.014800\n",
+      "MSE training cost after 34-th epoch is 0.014796\n",
+      "MSE training cost after 35-th epoch is 0.014799\n",
+      "MSE training cost after 36-th epoch is 0.014800\n",
+      "MSE training cost after 37-th epoch is 0.014798\n",
+      "MSE training cost after 38-th epoch is 0.014799\n",
+      "MSE training cost after 39-th epoch is 0.014799\n",
+      "MSE training cost after 40-th epoch is 0.014794\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "(array([[ 0.01],\n",
+       "        [ 0.03],\n",
+       "        [ 0.03],\n",
+       "        [ 0.04],\n",
+       "        [ 0.06],\n",
+       "        [ 0.07],\n",
+       "        [ 0.26]]), array([-0.]))"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "\n",
+    "#some hyper-parameters\n",
+    "window_size = 7\n",
+    "irange = 0.1\n",
+    "learning_rate = 0.001\n",
+    "max_epochs=40\n",
+    "\n",
+    "# note, while developing you can set max_num_batches to some positive number to limit\n",
+    "# the number of training data-points (you will get feedback faster)\n",
+    "mdp = MetOfficeDataProvider(window_size, batch_size=10, max_num_batches=-100, randomize=True)\n",
+    "\n",
+    "#initialise the parameters\n",
+    "W = random_generator.uniform(-irange, irange, (window_size, 1))\n",
+    "b = random_generator.uniform(-irange, irange, (1, ))\n",
+    "\n",
+    "#train the model\n",
+    "sgd(mdp, W, b, learning_rate=learning_rate, max_epochs=max_epochs)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "## Exercise 5\n",
+    "\n",
+    "Modify the above regression problem so the model makes binary classification whether the the weather is going to be one of those \\{rainy, sunny} (look at slide 12 of the 2nd lecture)\n",
+    "\n",
+    "Tip: You need to introduce the following changes:\n",
+    "1. Modify `MetOfficeDataProvider` (for example, inherit from MetOfficeDataProvider to create a new class MetOfficeDataProviderBin) and modify `next()` function so it returns as `targets` either 0 (sunny - if the the amount of rain [before mean/variance normalisation] is equal to 0 or 1 (rainy -- otherwise).\n",
+    "2. Modify the functions from previous exercise so the fprop implements `sigmoid` on top of affine transform.\n",
+    "3. Modify cost function to binary cross-entropy\n",
+    "4. Make sure you compute the gradients correctly (as you have changed both the output and the cost)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "#sorry, this one will be added later..."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/02_MNIST_SLN.ipynb
+++ b/02_MNIST_SLN.ipynb
@ -0,0 +1,302 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "This tutorial is an introduction to the coursework about multi-layer preceptron (MLP) models, or Deep Neural Networks (DNNs). Here, we will show how to build a single layer linear model (similar to the one from the previous lab) for MNIST classification using the provided code-base. \n",
+    "\n",
+    "The principal purpose of this introduction is to get you familiar with how to connect the provided blocks (and what operations each of them implements) to set up an experiment, including 1) build the model structure 2) optimise the model's parameters and 3) evaluate the model on test data. \n",
+    "\n",
+    "## For those affected by notebook kernel issues\n",
+    "\n",
+    "In case you are still having issues with running notebook kernels, have a look at [this note](https://github.com/CSTR-Edinburgh/mlpractical/blob/master/kernel_issue_fix.md) on the GitHub.\n",
+    "\n",
+    "## Virtual environments\n",
+    "\n",
+    "Before you proceed onwards, remember to activate your virtual environment:\n",
+    "   * If you were in last week's Tuesday or Wednesday group type `activate_mlp` or `source ~/mlpractical/venv/bin/activate`\n",
+    "   * If you were in the Monday group:\n",
+    "      + and if you have chosen the **comfy** way type: `workon mlpractical`\n",
+    "      + and if you have chosen the **generic** way, `source` your virutal environment using `source` and specyfing the path to the activate script (you need to localise it yourself, there were not any general recommendations w.r.t dir structure and people have installed it in different places, usually somewhere in the home directories. If you cannot easily find it by yourself, use something like: `find . -iname activate` ):\n",
+    "\n",
+    "## Syncing repos\n",
+    "\n",
+    "Look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> for more details. But in short, we recommend to create a separate branch for the coursework, as follows:\n",
+    "\n",
+    "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
+    "2. List the branches and check which is currently active by typing: `git checkout`\n",
+    "3. If you are not in `master` branch, switch to it by typing: \n",
+    "```\n",
+    "git checkout master\n",
+    " ```\n",
+    "4. Then update the repository (note, assuming master does not have any conflicts), if there are some, have a look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>\n",
+    "```\n",
+    "git pull\n",
+    "```\n",
+    "5. And now, create the new branch & swith to it by typing:\n",
+    "```\n",
+    "git checkout -b coursework1\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Multi Layer Models\n",
+    "\n",
+    "Today, we are going to build the models with an arbitrary number of hidden layers, please have a look at the below diagram and the corresponding computations (which have an *exact* matrix form as expected by numpy and row-wise orientation, $\\circ$ denotes an element-wise product). Below the diagram, we briefly describe how each comptation relates to the code we have provided.\n",
+    "\n",
+    "![Making Predictions](res/code_scheme.svg)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "1. Structuring the model\n",
+    "   * The model (for now) is allowed to have a sequence of layers, mapping inputs $\\mathbf{x}$ to outputs $\\mathbf{y}$. \n",
+    "   * This operation is implemented as a special type of a layer in `mlp.layers.MLP` class. It keeps a sequence of other layers (of various typyes like Linear, Sigmoid, Softmax, etc.) as well as the internal state of a model for a mini-batch, that is, the intermediate data produced in *forward* and *backward* passes.\n",
+    "2. Forward computation\n",
+    "    * `mlp.layers.MLP` provides a `fprop()` method that iterates over defined layers propagates $\\mathbf{x}$ to $\\mathbf{y}$. \n",
+    "    * Each layer (look at `mlp.layers.Linear` attached below) also implements `fprop()` method, which performs an atomic, for the given layer, operation. Most often, for the $i$-th layer, we want to obtain a linear transform $\\mathbf a^i$ of the inputs, and apply some non-linear transfer function $f^i(\\mathbf a^i)$ to produce the output $\\mathbf h^i$. Note, in general each layer may implement different activation functions $f^i()$, however for now we will use only `sigmoid` and `softmax`\n",
+    "3. Backward computation\n",
+    "   * Similarly, `mlp.layers.MLP` also implements `bprop()` function, to back-propagate the errors from the top to the bottom layer. This class also keeps the back-propagated stats ($\\delta$) to be used later when computing the gradients w.r.t the parameters.\n",
+    "   * This functionality is also re-implemented by particular layers (again, have a look at `bprop` function of `mlp.layers.Linear`). `bprop()` is suppsed to return both $\\delta$ (needed to update the parameters) but also back-progapate the gradient down to the inputs. Also note, that depending on whether the layer is the top or not (deals directly with the cost or not) some simplifications may apply (i.e. as with cross-entropy and softmax). That's why when implementing a new type of layer that may be used as an output layer one also need to specify the implementation of `bprop_cost()`.\n",
+    "4. Learning the model\n",
+    "   * The actual evaluation of the cost as well as the *forward* and *backward* passes one may find `train_epoch()` method of `mlp.optimisers.SGDOptimiser`\n",
+    "   * This function also calls the `pgrads()` method on each layer, that given activations and deltas, is supposed to return the list of the gradients of the cost w.r.t the model parameters, i.e. $\\frac{\\partial{\\mathbf{E}}}{\\partial{\\mathbf{W^i}}}$ and  $\\frac{\\partial{\\mathbf{E}}}{\\partial{\\mathbf{b}^i}}$ at the above diagram (look at an example implementation in `mlp.layers.Linear`)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# %load -s Linear mlp/layers.py\n",
+    "class Linear(Layer):\n",
+    "\n",
+    "    def __init__(self, idim, odim,\n",
+    "                 rng=None,\n",
+    "                 irange=0.1):\n",
+    "\n",
+    "        super(Linear, self).__init__(rng=rng)\n",
+    "\n",
+    "        self.idim = idim\n",
+    "        self.odim = odim\n",
+    "\n",
+    "        self.W = self.rng.uniform(\n",
+    "            -irange, irange,\n",
+    "            (self.idim, self.odim))\n",
+    "\n",
+    "        self.b = numpy.zeros((self.odim,), dtype=numpy.float32)\n",
+    "\n",
+    "    def fprop(self, inputs):\n",
+    "        \"\"\"\n",
+    "        Implements a forward propagation through the i-th layer, that is\n",
+    "        some form of:\n",
+    "           a^i = xW^i + b^i\n",
+    "           h^i = f^i(a^i)\n",
+    "        with f^i, W^i, b^i denoting a non-linearity, weight matrix and\n",
+    "        biases of this (i-th) layer, respectively and x denoting inputs.\n",
+    "\n",
+    "        :param inputs: matrix of features (x) or the output of the previous layer h^{i-1}\n",
+    "        :return: h^i, matrix of transformed by layer features\n",
+    "        \"\"\"\n",
+    "        a = numpy.dot(inputs, self.W) + self.b\n",
+    "        # here f() is an identity function, so just return a linear transformation\n",
+    "        return a\n",
+    "\n",
+    "    def bprop(self, h, igrads):\n",
+    "        \"\"\"\n",
+    "        Implements a backward propagation through the layer, that is, given\n",
+    "        h^i denotes the output of the layer and x^i the input, we compute:\n",
+    "        dh^i/dx^i which by chain rule is dh^i/da^i da^i/dx^i\n",
+    "        x^i could be either features (x) or the output of the lower layer h^{i-1}\n",
+    "        :param h: it's an activation produced in forward pass\n",
+    "        :param igrads, error signal (or gradient) flowing to the layer, note,\n",
+    "               this in general case does not corresponds to 'deltas' used to update\n",
+    "               the layer's parameters, to get deltas ones need to multiply it with\n",
+    "               the dh^i/da^i derivative\n",
+    "        :return: a tuple (deltas, ograds) where:\n",
+    "               deltas = igrads * dh^i/da^i\n",
+    "               ograds = deltas \\times da^i/dx^i\n",
+    "        \"\"\"\n",
+    "\n",
+    "        # since df^i/da^i = 1 (f is assumed identity function),\n",
+    "        # deltas are in fact the same as igrads\n",
+    "        ograds = numpy.dot(igrads, self.W.T)\n",
+    "        return igrads, ograds\n",
+    "\n",
+    "    def bprop_cost(self, h, igrads, cost):\n",
+    "        \"\"\"\n",
+    "        Implements a backward propagation in case the layer directly\n",
+    "        deals with the optimised cost (i.e. the top layer)\n",
+    "        By default, method should implement a bprop for default cost, that is\n",
+    "        the one that is natural to the layer's output, i.e.:\n",
+    "        here we implement linear -> mse scenario\n",
+    "        :param h: it's an activation produced in forward pass\n",
+    "        :param igrads, error signal (or gradient) flowing to the layer, note,\n",
+    "               this in general case does not corresponds to 'deltas' used to update\n",
+    "               the layer's parameters, to get deltas ones need to multiply it with\n",
+    "               the dh^i/da^i derivative\n",
+    "        :param cost, mlp.costs.Cost instance defining the used cost\n",
+    "        :return: a tuple (deltas, ograds) where:\n",
+    "               deltas = igrads * dh^i/da^i\n",
+    "               ograds = deltas \\times da^i/dx^i\n",
+    "        \"\"\"\n",
+    "\n",
+    "        if cost is None or cost.get_name() == 'mse':\n",
+    "            # for linear layer and mean square error cost,\n",
+    "            # cost back-prop is the same as standard back-prop\n",
+    "            return self.bprop(h, igrads)\n",
+    "        else:\n",
+    "            raise NotImplementedError('Linear.bprop_cost method not implemented '\n",
+    "                                      'for the %s cost' % cost.get_name())\n",
+    "\n",
+    "    def pgrads(self, inputs, deltas):\n",
+    "        \"\"\"\n",
+    "        Return gradients w.r.t parameters\n",
+    "\n",
+    "        :param inputs, input to the i-th layer\n",
+    "        :param deltas, deltas computed in bprop stage up to -ith layer\n",
+    "        :return list of grads w.r.t parameters dE/dW and dE/db in *exactly*\n",
+    "                the same order as the params are returned by get_params()\n",
+    "\n",
+    "        Note: deltas here contain the whole chain rule leading\n",
+    "        from the cost up to the the i-th layer, i.e.\n",
+    "        dE/dy^L dy^L/da^L da^L/dh^{L-1} dh^{L-1}/da^{L-1} ... dh^{i}/da^{i}\n",
+    "        and here we are just asking about\n",
+    "          1) da^i/dW^i and 2) da^i/db^i\n",
+    "        since W and b are only layer's parameters\n",
+    "        \"\"\"\n",
+    "\n",
+    "        grad_W = numpy.dot(inputs.T, deltas)\n",
+    "        grad_b = numpy.sum(deltas, axis=0)\n",
+    "\n",
+    "        return [grad_W, grad_b]\n",
+    "\n",
+    "    def get_params(self):\n",
+    "        return [self.W, self.b]\n",
+    "\n",
+    "    def set_params(self, params):\n",
+    "        #we do not make checks here, but the order on the list\n",
+    "        #is assumed to be exactly the same as get_params() returns\n",
+    "        self.W = params[0]\n",
+    "        self.b = params[1]\n",
+    "\n",
+    "    def get_name(self):\n",
+    "        return 'linear'\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example 1: Experiment with linear models and MNIST\n",
+    "\n",
+    "The below snippet demonstrates how to use the code we have provided for the coursework 1. Get familiar with it, as from now on we will use till the end of the course, including the 2nd coursework.\n",
+    "\n",
+    "It should be straightforward to extend the following code to more complex models, like stack more layers, change the cost, the optimiser, learning rate schedules, etc.. But **ask** in case something is not clear.\n",
+    "\n",
+    "In this particular example, we use the following components:\n",
+    "  *  One layer mapping data-points ($\\mathbf x$) straight to 10 digits classes represented as 10 (linear) outputs ($\\mathbf y$). This operation is implemented as a linear layer in `mlp.layers.Linear`. Get familiar with this class (read the comments, etc.) as it is going to be a building block for the coursework.\n",
+    "  * One can stack as many different layers as required through the container `mlp.layers.MLP`\n",
+    "  * As an objective here we use the Mean Square Error cost defined in `mlp.costs.MSECost`\n",
+    "  * Our *Stochastic Gradient Descent* optimiser can be found in `mlp.optimisers.SGDOptimiser`. Its parent `mlp.optimisers.Optimiser` implements validation functionality (and an interface in case one need to implement a different optimiser)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import numpy\n",
+    "import logging\n",
+    "\n",
+    "logger = logging.getLogger()\n",
+    "logger.setLevel(logging.INFO)\n",
+    "\n",
+    "from mlp.layers import MLP, Linear #import required layer types\n",
+    "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
+    "from mlp.dataset import MNISTDataProvider #import data provider\n",
+    "from mlp.costs import MSECost #import the cost we want to use for optimisation\n",
+    "from mlp.schedulers import LearningRateFixed\n",
+    "\n",
+    "rng = numpy.random.RandomState([2015,10,10])\n",
+    "\n",
+    "# define the model structure, here just one linear layer\n",
+    "# and mean square error cost\n",
+    "cost = MSECost()\n",
+    "model = MLP(cost=cost)\n",
+    "model.add_layer(Linear(idim=784, odim=10, rng=rng))\n",
+    "#one can stack more layers here\n",
+    "\n",
+    "# define the optimiser, here stochasitc gradient descent\n",
+    "# with fixed learning rate and max_epochs as stopping criterion\n",
+    "lr_scheduler = LearningRateFixed(learning_rate=0.01, max_epochs=20)\n",
+    "optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
+    "\n",
+    "logger.info('Initialising data providers...')\n",
+    "train_dp = MNISTDataProvider(dset='train', batch_size=100, max_num_batches=-10, randomize=True)\n",
+    "valid_dp = MNISTDataProvider(dset='valid', batch_size=100, max_num_batches=-10, randomize=False)\n",
+    "\n",
+    "logger.info('Training started...')\n",
+    "optimiser.train(model, train_dp, valid_dp)\n",
+    "\n",
+    "logger.info('Testing the model on test set:')\n",
+    "test_dp = MNISTDataProvider(dset='eval', batch_size=100, max_num_batches=-10, randomize=False)\n",
+    "cost, accuracy = optimiser.validate(model, test_dp)\n",
+    "logger.info('MNIST test set accuracy is %.2f %% (cost is %.3f)'%(accuracy*100., cost))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise\n",
+    "\n",
+    "Modify the above code by adding an intemediate linear layer of size 200 hidden units between input and output layers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/03_MLP_Coursework1.ipynb
+++ b/03_MLP_Coursework1.ipynb
@ -0,0 +1,328 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Coursework #1\n",
+    "\n",
+    "## Introduction\n",
+    "\n",
+    "This coursework is concerned with building multi-layer networks to address the MNIST digit classification problem. It builds on the previous labs, in particular [02_MNIST_SLN.ipynb](02_MNIST_SLN.ipynb) in which single layer networks were trained for MNIST digit classification.   The course will involve extending that code to use Sigmoid and Softmax layers, combining these into multi-layer networks, and carrying out a number of MNIST digit classification experiments, to investigate the effect of learning rate, the number of hidden units, and the number of hidden layers.\n",
+    "\n",
+    "The coursework is divided into 4 tasks:\n",
+    "* **Task 1**:   *Implementing a sigmoid layer* - 15 marks.  \n",
+    "This task involves extending the `Linear` class in file `mlp/layers.py` to `Sigmoid`, with code for forward prop, backprop computation of the gradient, and weight update.\n",
+    "* **Task 2**:  *Implementing a softmax layer* - 15 marks.  \n",
+    "This task involves extending the `Linear` class in file `mlp/layers.py` to `Softmax`, with code for forward prop, backprop computation of the gradient, and weight update.\n",
+    "* **Task 3**:  *Constructing a multi-layer network* - 40 marks.  \n",
+    "This task involves putting together a Sigmoid and a Softmax layer to create a multi-layer network, with one hidden layer (100 units) and one output layer, that is trained to classify MNIST digits.  This task will include reporting classification results, exploring the effect of learning rates, and plotting Hinton Diagrams for the hidden units and output units.\n",
+    "* **Task 4**:  *Experiments with different architectures*  - 30 marks.  \n",
+    "This task involves further MNIST classification experiments, primarily looking at the effect of using different numbers of hidden layers.\n",
+    "The coursework will be marked out of 100, and will contribute 30% of the total mark in the MLP course.\n",
+    "\n",
+    "## Previous Tutorials\n",
+    "\n",
+    "Before starting this coursework make sure that you have completed the first three labs:\n",
+    "\n",
+    "* [00_Introduction.ipynb](00_Introduction.ipynb) - setting up your environment; *Solutions*: [00_Introduction_solution.ipynb](00_Introduction_solution.ipynb)\n",
+    "* [01_Linear_Models.ipynb](01_Linear_Models.ipynb) - training single layer networks; *Solutions*: [01_Linear_Models_solution.ipynb](01_Linear_Models_solution.ipynb)\n",
+    "* [02_MNIST_SLN.ipynb](02_MNIST_SLN.ipynb) - training a single layer network for MNIST digit classification\n",
+    "\n",
+    "To ensure that your virtual environment is correct, please see [this note](https://github.com/CSTR-Edinburgh/mlpractical/blob/master/kernel_issue_fix.md) on the GitHub.\n",
+    "## Submission\n",
+    "**Submission Deadline:  Thursday 29 October, 16:00** \n",
+    "\n",
+    "Submit the coursework as an ipython notebook file, using the `submit` command in the terminal on a DICE machine. If your file is `03_MLP_Coursework1.ipynb` then you would enter:\n",
+    "\n",
+    "`submit mlp 1 03_MLP_Coursework1.ipynb` \n",
+    "\n",
+    "where `mlp 1` indicates this is the first coursework of MLP.\n",
+    "\n",
+    "After submitting, you should receive an email of acknowledgment from the system confirming that your submission has been received successfully. Keep the email as evidence of your coursework submission.\n",
+    "\n",
+    "**Please make sure you submit a single `ipynb` file (and nothing else)!**\n",
+    "\n",
+    "**Submission Deadline:  Thursday 29 October, 16:00** \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting Started\n",
+    "Please enter your exam number and the date in the next code cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "#MLP Coursework 1\n",
+    "#Exam number: <ENTER EXAM NUMBER>\n",
+    "#Date: <ENTER DATE>\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Please run the next code cell, which imports `numpy` and seeds the random number generator.  Please **do not** modify the random number generator seed!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import numpy\n",
+    "\n",
+    "#Seed a random number generator running the below cell, but do **not** modify the seed.\n",
+    "rng = numpy.random.RandomState([2015,10,10])\n",
+    "rng_state = rng.get_state()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 1 - Sigmoid Layer (15%)\n",
+    "\n",
+    "In this task you need to create a class `Sigmoid` which encapsulates a layer of `sigmoid` units.  You should do this by extending the `mlp.layers.Linear` class (in file `mlp/layers.py`), which implements a a layer of linear units (i.e. weighted sum plus bias).  The `Sigmoid` class extends this by applying the `sigmoid` transfer function to the weighted sum in the forward propagation, and applying the derivative of the `sigmoid` in the gradient descent back propagation and computing the gradients with respect to layer's parameters. Do **not** copy the implementation provided in `Linear` class but rather, **reuse** it through inheritance.\n",
+    "\n",
+    "When you have implemented `Sigmoid` (in the `mlp.layers` module), then please test it by running the below code cell.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from mlp.layers import Sigmoid\n",
+    "\n",
+    "a = numpy.asarray([-20.1, 52.4, 0, 0.05, 0.05, 49])\n",
+    "b = numpy.asarray([-20.1, 52.4, 0, 0.05, 0.05, 49, 20, 20])\n",
+    "\n",
+    "rng.set_state(rng_state)\n",
+    "sigm = Sigmoid(idim=a.shape[0], odim=b.shape[0], rng=rng)\n",
+    "\n",
+    "fp = sigm.fprop(a)\n",
+    "deltas, ograds  = sigm.bprop(h=fp, igrads=b)\n",
+    "\n",
+    "print fp.sum()\n",
+    "print deltas.sum()\n",
+    "print ograds.sum()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "To include the `Sigmoid` code in the notebook please run the below code cell.  (The `%load` notebook command is used to load the source of the `Sigmoid` class from `mlp/layers.py`.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%load -s Sigmoid mlp/layers.py\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 2 - Softmax (15%)\n",
+    "\n",
+    "In this task you need to create a class `Softmax` which encapsulates a layer of `softmax` units.  As in the previous task, you should do this by extending the `mlp.layers.Linear` class (in file `mlp/layers.py`).\n",
+    "\n",
+    "When you have implemented `Softmax` (in the `mlp.layers` module), then please test it by running the below code cell.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from mlp.layers import Softmax\n",
+    "\n",
+    "a = numpy.asarray([-20.1, 52.4, 0, 0.05, 0.05, 49])\n",
+    "b = numpy.asarray([0, 0, 0, 0, 0, 0, 0, 1])\n",
+    "\n",
+    "rng.set_state(rng_state)\n",
+    "softmax = Softmax(idim=a.shape[0], odim=b.shape[0], rng=rng)\n",
+    "\n",
+    "fp = softmax.fprop(a)\n",
+    "deltas, ograds = softmax.bprop_cost(h=None, igrads=fp-b, cost=None)\n",
+    "\n",
+    "print fp.sum()\n",
+    "print deltas.sum()\n",
+    "print ograds.sum()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "To include the `Softmax` code in the notebook please run the below code cell.  (The notebook `%load` command is used to load the source of the `Softmax` class from `mlp/layers.py`.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%load -s Softmax mlp/layers.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 3 - Multi-layer network for MNIST classification (40%)\n",
+    "\n",
+    "**(a)** (20%)  Building on the single layer linear network for MNIST classification used in lab [02_MNIST_SLN.ipynb](02_MNIST_SLN.ipynb), and using the `Sigmoid` and `Softmax` classes that you implemented in tasks 1 and 2, construct and learn a model that classifies MNIST images and:\n",
+    "   * Has one hidden layer with a `sigmoid` transfer function and 100 units\n",
+    "   * Uses a `softmax` output layer to discriminate between the 10 digit classes (use the `mlp.costs.CECost()` cost)\n",
+    "\n",
+    "Your code should print the final values of the error function and the classification accuracy for train, validation, and test sets (please keep also the log information printed by default by the optimiser). Limit the number of training epochs to 30. You can, of course, split the solution at as many cells as you think is necessary."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# include here the complete code that constructs the model, performs training,\n",
+    "# and prints the error and accuracy for train/valid/test"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**(b)** (10%) Investigate the impact of different learning rates $\\eta \\in \\{0.5, 0.2, 0.1, 0.05, 0.01, 0.005\\}$ on the convergence of the network training as well as the final accuracy:\n",
+    "   * Plot (on a single graph) the error rate curves for each learning rate as a function of training epochs for training set\n",
+    "   * Plot (on another single graph) the error rate curves as a function of training epochs for validation set\n",
+    "   * Include a table of the corresponding error rates for test set\n",
+    "\n",
+    "The notebook command `%matplotlib inline` ensures that your graphs will be added to the notebook, rather than opened as additional windows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**(c)** (10%) Plot the following graphs:\n",
+    "  * Display the 784-element weight vector of each of the 100 hidden units as 10x10 grid plot of 28x28 images, in order to visualise what features of the input they are encoding.  To do this, take the weight vector of each hidden unit, reshape to 28x28, and plot using the `imshow` function).\n",
+    "  * Plot a Hinton Diagram of the output layer weight matrix for digits 0 and 1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "## Task 4 - Experiments with 1-5 hidden layers (30%)\n",
+    "\n",
+    "In this task use the learning rate which resulted in the best accuracy in your experiments in Task 3 (b).  Perform the following experiments:\n",
+    "\n",
+    "  * Train a similar model to Task 3, with one hidden layer, but with 800 hidden units. \n",
+    "  * Train 4 additional models with 2, 3, 4 and 5 hidden layers.  Set the number of hidden units for each model, such that all the models have similar number of trainable weights ($\\pm$2%).   For simplicity, for a given model, keep the number of units in each hidden layer the same.\n",
+    "  * Plot value of the error function for training and validation sets as a function of training epochs for each model\n",
+    "  * Plot the test set classification accuracy as a function of the number of hidden layers\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "This is the end of coursework 1.\n",
+    "\n",
+    "Please remember to save your notebook, and submit your notebook following the instructions at the top.  Please make sure that you have executed all the code cells when you submit the notebook.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/mlp/costs.py
+++ b/mlp/costs.py
@ -0,0 +1,64 @@
+# Machine Learning Practical (INFR11119),
+# Pawel Swietojanski, University of Edinburgh
+
+
+import numpy
+
+
+class Cost(object):
+    """
+    Defines an interface for the cost object
+    """
+    def cost(self, y, t, **kwargs):
+        """
+        Implements a cost for monitoring purposes
+        :param y: matrix -- an output of the model
+        :param t: matrix -- an expected output the model should produce
+        :param kwargs: -- some optional parameters required by the cost
+        :return: the scalar value representing the cost given y and t
+        """
+        raise NotImplementedError()
+
+    def grad(self, y, t, **kwargs):
+        """
+        Implements a gradient of the cost w.r.t y
+        :param y: matrix -- an output of the model
+        :param t: matrix -- an expected output the model should produce
+        :param kwargs: -- some optional parameters required by the cost
+        :return: matrix - the gradient of the cost w.r.t y
+        """
+        raise NotImplementedError()
+
+    def get_name(self):
+        return 'cost'
+
+
+class MSECost(Cost):
+    def cost(self, y, t, **kwargs):
+        se = 0.5*numpy.sum((y - t)**2, axis=1)
+        return numpy.mean(se)
+
+    def grad(self, y, t, **kwargs):
+        return y - t
+
+    def get_name(self):
+        return 'mse'
+
+
+class CECost(Cost):
+    """
+    Cross Entropy (Negative log-likelihood) cost for multiple classes
+    """
+    def cost(self, y, t, **kwargs):
+        #assumes t is 1-of-K coded and y is a softmax
+        #transformed estimate at the output layer
+        nll = t * numpy.log(y)
+        return -numpy.mean(numpy.sum(nll, axis=1))
+
+    def grad(self, y, t, **kwargs):
+        #assumes t is 1-of-K coded and y is a softmax
+        #transformed estimate at the output layer
+        return y - t
+
+    def get_name(self):
+        return 'ce'
--- a/mlp/dataset.py
+++ b/mlp/dataset.py
@ -60,6 +60,13 @@ class DataProvider(object):
        """
        raise NotImplementedError()

+    def num_examples(self):
+        """
+        Returns a number of data-points in dataset
+        """
+        return NotImplementedError()
+
+

 class MNISTDataProvider(DataProvider):
    """
@ -142,6 +149,9 @@ class MNISTDataProvider(DataProvider):

        return rval_x, self.__to_one_of_k(rval_t)

+    def num_examples(self):
+        return self.x.shape[0]
+
    def __to_one_of_k(self, y):
        rval = numpy.zeros((y.shape[0], self.num_classes), dtype=numpy.float32)
        for i in xrange(y.shape[0]):
--- a/mlp/layers.py
+++ b/mlp/layers.py
@ -0,0 +1,281 @@
+
+# Machine Learning Practical (INFR11119),
+# Pawel Swietojanski, University of Edinburgh
+
+import numpy
+import logging
+from mlp.costs import Cost
+
+
+logger = logging.getLogger(__name__)
+
+
+class MLP(object):
+    """
+    This is a container for an arbitrary sequence of other transforms
+    On top of this, the class also keeps the state of the model, i.e.
+    the result of forward (activations) and backward (deltas) passes
+    through the model (for a mini-batch), which is required to compute
+    the gradients for the parameters
+    """
+    def __init__(self, cost):
+
+        assert isinstance(cost, Cost), (
+            "Cost needs to be of type mlp.costs.Cost, got %s" % type(cost)
+        )
+
+        self.layers = [] #the actual list of network layers
+        self.activations = [] #keeps forward-pass activations (h from equations)
+                              # for a given minibatch (or features at 0th index)
+        self.deltas = [] #keeps back-propagated error signals (deltas from equations)
+                         # for a given minibatch and each layer
+        self.cost = cost
+
+    def fprop(self, x):
+        """
+
+        :param inputs: mini-batch of data-points x
+        :return: y (top layer activation) which is an estimate of y given x
+        """
+
+        if len(self.activations) != len(self.layers) + 1:
+            self.activations = [None]*(len(self.layers) + 1)
+
+        self.activations[0] = x
+        for i in xrange(0, len(self.layers)):
+            self.activations[i+1] = self.layers[i].fprop(self.activations[i])
+        return self.activations[-1]
+
+    def bprop(self, cost_grad):
+        """
+        :param cost_grad: matrix -- grad of the cost w.r.t y
+        :return: None, the deltas are kept in the model
+        """
+
+        # allocate the list of deltas for each layer
+        # note, we do not use all of those fields but
+        # want to keep it aligned 1:1 with activations,
+        # which will simplify indexing later on when
+        # computing grads w.r.t parameters
+        if len(self.deltas) != len(self.activations):
+            self.deltas = [None]*len(self.activations)
+
+        # treat the top layer in special way, as it deals with the
+        # cost, which may lead to some simplifications
+        top_layer_idx = len(self.layers)
+        self.deltas[top_layer_idx], ograds = self.layers[top_layer_idx - 1].\
+            bprop_cost(self.activations[top_layer_idx], cost_grad, self.cost)
+
+        # then back-prop through remaining layers
+        for i in xrange(top_layer_idx - 1, 0, -1):
+            self.deltas[i], ograds = self.layers[i - 1].\
+                bprop(self.activations[i], ograds)
+
+    def add_layer(self, layer):
+        self.layers.append(layer)
+
+    def set_layers(self, layers):
+        self.layers = layers
+
+    def get_name(self):
+        return 'mlp'
+
+
+class Layer(object):
+    """
+    Abstract class defining an interface for
+    other transforms.
+    """
+    def __init__(self, rng=None):
+
+        if rng is None:
+            seed=[2015, 10, 1]
+            self.rng = numpy.random.RandomState(seed)
+        else:
+            self.rng = rng
+
+    def fprop(self, inputs):
+        """
+        Implements a forward propagation through the i-th layer, that is
+        some form of:
+           a^i = xW^i + b^i
+           h^i = f^i(a^i)
+        with f^i, W^i, b^i denoting a non-linearity, weight matrix and
+        biases at the i-th layer, respectively and x denoting inputs.
+
+        :param inputs: matrix of features (x) or the output of the previous layer h^{i-1}
+        :return: h^i, matrix of transformed by layer features
+        """
+        raise NotImplementedError()
+    
+    def bprop(self, h, igrads):
+        """
+        Implements a backward propagation through the layer, that is, given
+        h^i denotes the output of the layer and x^i the input, we compute:
+        dh^i/dx^i which by chain rule is dh^i/da^i da^i/dx^i
+        x^i could be either features (x) or the output of the lower layer h^{i-1}
+        :param h: it's an activation produced in forward pass
+        :param igrads, error signal (or gradient) flowing to the layer, note,
+               this in general case does not corresponds to 'deltas' used to update
+               the layer's parameters, to get deltas ones need to multiply it with
+               the dh^i/da^i derivative
+        :return: a tuple (deltas, ograds) where:
+               deltas = igrads * dh^i/da^i
+               ograds = deltas \times da^i/dx^i
+        """
+        raise NotImplementedError()
+
+    def bprop_cost(self, h, igrads, cost=None):
+        """
+        Implements a backward propagation in case the layer directly
+        deals with the optimised cost (i.e. the top layer)
+        By default, method should implement a back-prop for default cost, that is
+        the one that is natural to the layer's output, i.e.:
+        linear -> mse, softmax -> cross-entropy, sigmoid -> binary cross-entropy
+        :param h: it's an activation produced in forward pass
+        :param igrads, error signal (or gradient) flowing to the layer, note,
+               this in general case does not corresponds to 'deltas' used to update
+               the layer's parameters, to get deltas ones need to multiply it with
+               the dh^i/da^i derivative
+        :return: a tuple (deltas, ograds) where:
+               deltas = igrads * dh^i/da^i
+               ograds = deltas \times da^i/dx^i
+        """
+
+        raise NotImplementedError()
+
+    def pgrads(self, inputs, deltas):
+        """
+        Return gradients w.r.t parameters
+        """
+        raise NotImplementedError()
+
+    def get_params(self):
+        raise NotImplementedError()
+
+    def set_params(self):
+        raise NotImplementedError()
+
+    def get_name(self):
+        return 'abstract_layer'
+
+
+class Linear(Layer):
+
+    def __init__(self, idim, odim,
+                 rng=None,
+                 irange=0.1):
+
+        super(Linear, self).__init__(rng=rng)
+
+        self.idim = idim
+        self.odim = odim
+
+        self.W = self.rng.uniform(
+            -irange, irange,
+            (self.idim, self.odim))
+
+        self.b = numpy.zeros((self.odim,), dtype=numpy.float32)
+
+    def fprop(self, inputs):
+        """
+        Implements a forward propagation through the i-th layer, that is
+        some form of:
+           a^i = xW^i + b^i
+           h^i = f^i(a^i)
+        with f^i, W^i, b^i denoting a non-linearity, weight matrix and
+        biases of this (i-th) layer, respectively and x denoting inputs.
+
+        :param inputs: matrix of features (x) or the output of the previous layer h^{i-1}
+        :return: h^i, matrix of transformed by layer features
+        """
+        a = numpy.dot(inputs, self.W) + self.b
+        # here f() is an identity function, so just return a linear transformation
+        return a
+
+    def bprop(self, h, igrads):
+        """
+        Implements a backward propagation through the layer, that is, given
+        h^i denotes the output of the layer and x^i the input, we compute:
+        dh^i/dx^i which by chain rule is dh^i/da^i da^i/dx^i
+        x^i could be either features (x) or the output of the lower layer h^{i-1}
+        :param h: it's an activation produced in forward pass
+        :param igrads, error signal (or gradient) flowing to the layer, note,
+               this in general case does not corresponds to 'deltas' used to update
+               the layer's parameters, to get deltas ones need to multiply it with
+               the dh^i/da^i derivative
+        :return: a tuple (deltas, ograds) where:
+               deltas = igrads * dh^i/da^i
+               ograds = deltas \times da^i/dx^i
+        """
+
+        # since df^i/da^i = 1 (f is assumed identity function),
+        # deltas are in fact the same as igrads
+        ograds = numpy.dot(igrads, self.W.T)
+        return igrads, ograds
+
+    def bprop_cost(self, h, igrads, cost):
+        """
+        Implements a backward propagation in case the layer directly
+        deals with the optimised cost (i.e. the top layer)
+        By default, method should implement a bprop for default cost, that is
+        the one that is natural to the layer's output, i.e.:
+        here we implement linear -> mse scenario
+        :param h: it's an activation produced in forward pass
+        :param igrads, error signal (or gradient) flowing to the layer, note,
+               this in general case does not corresponds to 'deltas' used to update
+               the layer's parameters, to get deltas ones need to multiply it with
+               the dh^i/da^i derivative
+        :param cost, mlp.costs.Cost instance defining the used cost
+        :return: a tuple (deltas, ograds) where:
+               deltas = igrads * dh^i/da^i
+               ograds = deltas \times da^i/dx^i
+        """
+
+        if cost is None or cost.get_name() == 'mse':
+            # for linear layer and mean square error cost,
+            # cost back-prop is the same as standard back-prop
+            return self.bprop(h, igrads)
+        else:
+            raise NotImplementedError('Linear.bprop_cost method not implemented '
+                                      'for the %s cost' % cost.get_name())
+
+    def pgrads(self, inputs, deltas):
+        """
+        Return gradients w.r.t parameters
+
+        :param inputs, input to the i-th layer
+        :param deltas, deltas computed in bprop stage up to -ith layer
+        :return list of grads w.r.t parameters dE/dW and dE/db in *exactly*
+                the same order as the params are returned by get_params()
+
+        Note: deltas here contain the whole chain rule leading
+        from the cost up to the the i-th layer, i.e.
+        dE/dy^L dy^L/da^L da^L/dh^{L-1} dh^{L-1}/da^{L-1} ... dh^{i}/da^{i}
+        and here we are just asking about
+          1) da^i/dW^i and 2) da^i/db^i
+        since W and b are only layer's parameters
+        """
+
+        grad_W = numpy.dot(inputs.T, deltas)
+        grad_b = numpy.sum(deltas, axis=0)
+
+        return [grad_W, grad_b]
+
+    def get_params(self):
+        return [self.W, self.b]
+
+    def set_params(self, params):
+        #we do not make checks here, but the order on the list
+        #is assumed to be exactly the same as get_params() returns
+        self.W = params[0]
+        self.b = params[1]
+
+    def get_name(self):
+        return 'linear'
+
+        
+        
+        
+        
+        
--- a/mlp/optimisers.py
+++ b/mlp/optimisers.py
@ -0,0 +1,170 @@
+# Machine Learning Practical (INFR11119),
+# Pawel Swietojanski, University of Edinburgh
+
+import numpy
+import time
+import logging
+
+from mlp.layers import MLP
+from mlp.dataset import DataProvider
+from mlp.schedulers import LearningRateScheduler
+
+
+logger = logging.getLogger(__name__)
+
+
+class Optimiser(object):
+    def train_epoch(self, model, train_iter):
+        raise NotImplementedError()
+
+    def train(self, model, train_iter, valid_iter=None):
+        raise NotImplementedError()
+
+    def validate(self, model, valid_iterator):
+        assert isinstance(model, MLP), (
+            "Expected model to be a subclass of 'mlp.layers.MLP'"
+            " class but got %s " % type(model)
+        )
+
+        assert isinstance(valid_iterator, DataProvider), (
+            "Expected iterator to be a subclass of 'mlp.dataset.DataProvider'"
+            " class but got %s " % type(valid_iterator)
+        )
+
+        acc_list, nll_list = [], []
+        for x, t in valid_iterator:
+            y = model.fprop(x)
+            nll_list.append(model.cost.cost(y, t))
+            acc_list.append(numpy.mean(self.classification_accuracy(y, t)))
+
+        acc = numpy.mean(acc_list)
+        nll = numpy.mean(nll_list)
+
+        return nll, acc
+
+    @staticmethod
+    def classification_accuracy(y, t):
+        """
+        Returns classification accuracy given the estimate y and targets t
+        :param y: matrix -- estimate produced by the model in fprop
+        :param t: matrix -- target  1-of-K coded
+        :return: vector of y.shape[0] size with binary values set to 0
+                 if example was miscalssified or 1 otherwise
+        """
+        y_idx = numpy.argmax(y, axis=1)
+        t_idx = numpy.argmax(t, axis=1)
+        rval = numpy.equal(y_idx, t_idx)
+        return rval
+
+
+class SGDOptimiser(Optimiser):
+    def __init__(self, lr_scheduler):
+        super(SGDOptimiser, self).__init__()
+
+        assert isinstance(lr_scheduler, LearningRateScheduler), (
+            "Expected lr_scheduler to be a subclass of 'mlp.schedulers.LearningRateScheduler'"
+            " class but got %s " % type(lr_scheduler)
+        )
+
+        self.lr_scheduler = lr_scheduler
+
+    def train_epoch(self, model, train_iterator, learning_rate):
+
+        assert isinstance(model, MLP), (
+            "Expected model to be a subclass of 'mlp.layers.MLP'"
+            " class but got %s " % type(model)
+        )
+        assert isinstance(train_iterator, DataProvider), (
+            "Expected iterator to be a subclass of 'mlp.dataset.DataProvider'"
+            " class but got %s " % type(train_iterator)
+        )
+
+        acc_list, nll_list = [], []
+        for x, t in train_iterator:
+            # get the prediction
+            y = model.fprop(x)
+
+            # compute the cost and grad of the cost w.r.t y
+            cost = model.cost.cost(y, t)
+            cost_grad = model.cost.grad(y, t)
+
+            # do backward pass through the model
+            model.bprop(cost_grad)
+
+            #update the model, here we iterate over layers
+            #and then over each parameter in the layer
+            effective_learning_rate = learning_rate / x.shape[0]
+
+            for i in xrange(0, len(model.layers)):
+                params = model.layers[i].get_params()
+                grads = model.layers[i].pgrads(model.activations[i], model.deltas[i + 1])
+                uparams = []
+                for param, grad in zip(params, grads):
+                    param = param - effective_learning_rate * grad
+                    uparams.append(param)
+                model.layers[i].set_params(uparams)
+
+            nll_list.append(cost)
+            acc_list.append(numpy.mean(self.classification_accuracy(y, t)))
+
+        return numpy.mean(nll_list), numpy.mean(acc_list)
+
+    def train(self, model, train_iterator, valid_iterator=None):
+
+        converged = False
+        cost_name = model.cost.get_name()
+        tr_stats, valid_stats = [], []
+
+        # do the initial validation
+        tr_nll, tr_acc = self.validate(model, train_iterator)
+        logger.info('Epoch %i: Training cost (%s) for random model is %.3f. Accuracy is %.2f%%'
+                    % (self.lr_scheduler.epoch, cost_name, tr_nll, tr_acc * 100.))
+        tr_stats.append((tr_nll, tr_acc))
+
+        if valid_iterator is not None:
+            valid_iterator.reset()
+            valid_nll, valid_acc = self.validate(model, valid_iterator)
+            logger.info('Epoch %i: Validation cost (%s) for random model is %.3f. Accuracy is %.2f%%'
+                        % (self.lr_scheduler.epoch, cost_name, valid_nll, valid_acc * 100.))
+            valid_stats.append((valid_nll, valid_acc))
+
+        while not converged:
+            train_iterator.reset()
+
+            tstart = time.clock()
+            tr_nll, tr_acc = self.train_epoch(model=model,
+                                              train_iterator=train_iterator,
+                                              learning_rate=self.lr_scheduler.get_rate())
+            tstop = time.clock()
+            tr_stats.append((tr_nll, tr_acc))
+
+            logger.info('Epoch %i: Training cost (%s) is %.3f. Accuracy is %.2f%%'
+                        % (self.lr_scheduler.epoch + 1, cost_name, tr_nll, tr_acc * 100.))
+
+            vstart = time.clock()
+            if valid_iterator is not None:
+                valid_iterator.reset()
+                valid_nll, valid_acc = self.validate(model, valid_iterator)
+                logger.info('Epoch %i: Validation cost (%s) is %.3f. Accuracy is %.2f%%'
+                            % (self.lr_scheduler.epoch + 1, cost_name, valid_nll, valid_acc * 100.))
+                self.lr_scheduler.get_next_rate(valid_acc)
+                valid_stats.append((valid_nll, valid_acc))
+            else:
+                self.lr_scheduler.get_next_rate(None)
+            vstop = time.clock()
+
+            train_speed = train_iterator.num_examples() / (tstop - tstart)
+            valid_speed = valid_iterator.num_examples() / (vstop - vstart)
+            tot_time = vstop - tstart
+            #pps = presentations per second
+            logger.info("Epoch %i: Took %.0f seconds. Training speed %.0f pps. "
+                        "Validation speed %.0f pps."
+                        % (self.lr_scheduler.epoch, tot_time, train_speed, valid_speed))
+
+            # we stop training when learning rate, as returned by lr scheduler, is 0
+            # this is implementation dependent and depending on lr schedule could happen,
+            # for example, when max_epochs has been reached or if the progress between
+            # two consecutive epochs is too small, etc.
+            converged = (self.lr_scheduler.get_rate() == 0)
+
+        return tr_stats, valid_stats
--- a/mlp/schedulers.py
+++ b/mlp/schedulers.py
@ -0,0 +1,155 @@
+# Machine Learning Practical (INFR11119),
+# Pawel Swietojanski, University of Edinburgh
+
+import logging
+
+
+class LearningRateScheduler(object):
+    """
+    Define an interface for determining learning rates
+    """
+    def __init__(self, max_epochs=100):
+        self.epoch = 0
+        self.max_epochs = max_epochs
+
+    def get_rate(self):
+        raise NotImplementedError()
+
+    def get_next_rate(self, current_error=None):
+        self.epoch += 1
+
+
+class LearningRateList(LearningRateScheduler):
+    def __init__(self, learning_rates_list, max_epochs):
+
+        super(LearningRateList, self).__init__(max_epochs)
+
+        assert isinstance(learning_rates_list, list), (
+            "The learning_rates_list argument expected"
+            " to be of type list, got %s" % type(learning_rates_list)
+        )
+        self.lr_list = learning_rates_list
+        
+    def get_rate(self):
+        if self.epoch < len(self.lr_list):
+            return self.lr_list[self.epoch]
+        return 0.0
+    
+    def get_next_rate(self, current_error=None):
+        super(LearningRateList, self).get_next_rate(current_error=None)
+        return self.get_rate()
+
+
+class LearningRateFixed(LearningRateList):
+
+    def __init__(self, learning_rate, max_epochs):
+        assert learning_rate > 0, (
+            "learning rate expected to be > 0, got %f" % learning_rate
+        )
+        super(LearningRateFixed, self).__init__([learning_rate], max_epochs)
+
+    def get_rate(self):
+        if self.epoch < self.max_epochs:
+            return self.lr_list[0]
+        return 0.0
+
+    def get_next_rate(self, current_error=None):
+        super(LearningRateFixed, self).get_next_rate(current_error=None)
+        return self.get_rate()
+
+
+class LearningRateNewBob(LearningRateScheduler):
+    """
+    Exponential learning rate schema
+    """
+    
+    def __init__(self, start_rate, scale_by=.5, max_epochs=99, \
+                 min_derror_ramp_start=.5, min_derror_stop=.5, init_error=100.0, \
+                 patience=0, zero_rate=None, ramping=False):
+        """
+        :type start_rate: float
+        :param start_rate: 
+        
+        :type scale_by: float
+        :param scale_by: 
+        
+        :type max_epochs: int
+        :param max_epochs: 
+        
+        :type min_error_start: float
+        :param min_error_start: 
+        
+        :type min_error_stop: float
+        :param min_error_stop: 
+        
+        :type init_error: float
+        :param init_error: 
+                       # deltas2 below are just deltas returned by linear Linear,bprop transform
+        # and are exactly the same as
+        """
+        self.start_rate = start_rate
+        self.init_error = init_error
+        self.init_patience = patience
+        
+        self.rate = start_rate
+        self.scale_by = scale_by
+        self.max_epochs = max_epochs
+        self.min_derror_ramp_start = min_derror_ramp_start
+        self.min_derror_stop = min_derror_stop
+        self.lowest_error = init_error
+        
+        self.epoch = 1
+        self.ramping = ramping
+        self.patience = patience
+        self.zero_rate = zero_rate
+        
+    def reset(self):
+        self.rate = self.start_rate
+        self.lowest_error = self.init_error
+        self.epoch = 1
+        self.ramping = False
+        self.patience = self.init_patience
+    
+    def get_rate(self):
+        if (self.epoch==1 and self.zero_rate!=None):
+            return self.zero_rate
+        return self.rate  
+    
+    def get_next_rate(self, current_error):
+        """
+        :type current_error: float
+        :param current_error: percentage error 
+        
+        """
+        
+        diff_error = 0.0
+        
+        if ( (self.max_epochs > 10000) or (self.epoch >= self.max_epochs) ):
+            #logging.debug('Setting rate to 0.0. max_epochs or epoch>=max_epochs')
+            self.rate = 0.0
+        else:
+            diff_error = self.lowest_error - current_error
+            
+            if (current_error < self.lowest_error):
+                self.lowest_error = current_error
+    
+            if (self.ramping):
+                if (diff_error < self.min_derror_stop):
+                    if (self.patience > 0):
+                        #logging.debug('Patience decreased to %f' % self.patience)
+                        self.patience -= 1
+                        self.rate *= self.scale_by
+                    else:
+                        #logging.debug('diff_error (%f) < min_derror_stop (%f)' % (diff_error, self.min_derror_stop))
+                        self.rate = 0.0
+                else:
+                    self.rate *= self.scale_by
+            else:
+                if (diff_error < self.min_derror_ramp_start):
+                    #logging.debug('Start ramping.')
+                    self.ramping = True
+                    self.rate *= self.scale_by
+            
+            self.epoch += 1
+    
+        return self.rate
--- a/res/code_scheme.svg
+++ b/res/code_scheme.svg