2nd lab

2015-10-05 09:07:04 +01:00 · 2015-10-05 09:07:04 +01:00 · e5ffdfeb60
commit e5ffdfeb60
parent c71d6973f6
9 changed files with 652 additions and 169 deletions
--- a/00_Introduction.ipynb
+++ b/00_Introduction.ipynb
--- a/01_Linear_Models.ipynb
+++ b/01_Linear_Models.ipynb
@ -8,7 +8,7 @@
    "\n",
    "This tutorial is about linear transforms - a basic building block of many, including deep learning, models.\n",
    "\n",
-    "# Short recap and syncing repositories\n",
+    "# Virtual environments and syncing repositories\n",
    "\n",
    "Before you proceed onwards, remember to activate you virtual environments so you can use the software you installed last week as well as run the notebooks in interactive mode, no through github.com website.\n",
    "\n",
@ -22,134 +22,408 @@
    "\n",
    "## On Synchronising repositories\n",
    "\n",
-    "I started writing this, but do not think giving students a choice is a good way to progess, the most painless way to follow would be to ask them to stash their changes (with some meaningful message) and work on the clean updated repository. This way one can always (temporarily) recover the work once needed but everyone starts smoothly the next lab. We do not want to help anyone how to resolve the conflicts...\n",
-    "\n",
-    "Enter your git mlp repository you set up last week (i.e. ~/mlpractical/repo-mlp) and depending on how you want to proceed you either can:\n",
-    "  1. Overridde some changes you have made (both in the notebooks and/or in the code if you happen to modify parts that were updated by us) with the code we have provided for this lab\n",
-    "  2. Try to merge your code with ours (for example, if you want to use `MetOfficeDataProvider` you have written)\n",
-    "  \n",
-    "Our recommendation is, you should at least keep the progress in the notebooks (so you can peek some details when needed)\n",
+    "Enter your git mlp repository you set up last week (i.e. ~/mlpractical/repo-mlp) and once you synced the repository (in one of the two below ways), start the notebook session by typing:\n",
    "\n",
    "```\n",
-    "git pull\n",
+    "ipython notebook\n",
    "```\n",
    "\n",
-    "## Default Synchronising Strategy\n",
+    "### Default way\n",
    "\n",
-    "Need to think/discuss this."
+    "To avoid potential conflicts between the changes you have made since last week and our additions, we recommend `stash` your changes and `pull` the new code from the mlpractical repository by typing:\n",
+    "\n",
+    "1. `git stash save \"my 1st lab work\"`\n",
+    "2. `git pull`\n",
+    "\n",
+    "Then, once you need you can always (temporaily) restore a desired state of the repository.\n",
+    "\n",
+    "### For advanced github users\n",
+    "\n",
+    "It is OK if you want to keep your changes and merge the new code with whatever you already have, but you need to know what you are doing and how to resolve conflicts.\n",
+    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Linear and Affine Transforms\n",
+    "# Single Layer Models\n",
    "\n",
-    "Depending on the required level of details, one may need to. The basis of all linear models is so called affine transform, that is the transform that implements some (linear) rotation of some input points and shift (translation) them. Denote by $\\vec x$ some input vector, then the affine transform is defined as follows:\n",
+    "***\n",
+    "### Note on storing matrices in computer memory\n",
+    "\n",
+    "Consider you want to store the following array in memory: $\\left[ \\begin{array}{ccc}\n",
+    "1 & 2 & 3 \\\\\n",
+    "4 & 5 & 6 \\\\\n",
+    "7 & 8 & 9 \\end{array} \\right]$ \n",
+    "\n",
+    "In computer memory the above matrix would be organised as a vector in either (assume you allocate the memory at once for the whole matrix):\n",
+    "\n",
+    "* Row-wise layout where the order would look like: $\\left [ \\begin{array}{ccccccccc}\n",
+    "1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\end{array} \\right ]$\n",
+    "* Column-wise layout where the order would look like: $\\left [ \\begin{array}{ccccccccc}\n",
+    "1 & 4 & 7 & 2 & 5 & 8 & 3 & 6 & 9 \\end{array} \\right ]$\n",
+    "\n",
+    "Although `numpy` can easily handle both formats (possibly with some computational overhead), in our code we will stick with modern (and default) `c`-like approach and use row-wise format (contrary to Fortran that used column-wise approach). \n",
+    "\n",
+    "This means, that in this tutorial:\n",
+    "* vectors are kept row-wise $\\mathbf{x} = (x_1, x_1, \\ldots, x_D) $ (rather than $\\mathbf{x} = (x_1, x_1, \\ldots, x_D)^T$)\n",
+    "* similarly, in case of matrices we will stick to: $\\left[ \\begin{array}{cccc}\n",
+    "x_{11} & x_{12} & \\ldots & x_{1D} \\\\\n",
+    "x_{21} & x_{22} & \\ldots & x_{2D} \\\\\n",
+    "x_{31} & x_{32} & \\ldots & x_{3D} \\\\ \\end{array} \\right]$ and each row (i.e. $\\left[ \\begin{array}{cccc} x_{11} & x_{12} & \\ldots & x_{1D} \\end{array} \\right]$) represents a single data-point (like one MNIST image or one window of observations)\n",
+    "\n",
+    "In lecture slides you will find the equations following the conventional mathematical column-wise approach, but you can easily map them one way or the other using using matrix transpose.\n",
+    "\n",
+    "***\n",
+    "\n",
+    "## Linear and Affine Transforms\n",
+    "\n",
+    "The basis of all linear models is so called affine transform, that is a transform that implements some linear transformation and translation of input features. The transforms we are going to use are parameterised by:\n",
+    "\n",
+    "  * Weight matrix $\\mathbf{W} \\in \\mathbb{R}^{D\\times K}$: where element $w_{ik}$ is the weight from input $x_i$ to output $y_k$\n",
+    "  * Bias vector $\\mathbf{b}\\in R^{K}$ : where element $b_{k}$ is the bias for output $k$\n",
+    "\n",
+    "Note, the bias is simply some additve term, and can be easily incorporated into an additional row in weight matrix and an additinal input in the inputs which is set to $1.0$ (as in the below picture taken from the lecture slides). However, here (and in the code) we will keep them separate.\n",
    "\n",
    "![Making Predictions](res/singleLayerNetWts-1.png)\n",
    "\n",
-    "$\n",
+    "For instance, for the above example of 5-dimensional input vector by $\\mathbf{x} = (x_1, x_2, x_3, x_4, x_5)$, weight matrix $\\mathbf{W}=\\left[ \\begin{array}{ccc}\n",
+    "w_{11} & w_{12} & w_{13} \\\\\n",
+    "w_{21} & w_{22} & w_{23} \\\\\n",
+    "w_{31} & w_{32} & w_{33} \\\\\n",
+    "w_{41} & w_{42} & w_{43} \\\\\n",
+    "w_{51} & x_{52} & 2_{53} \\\\ \\end{array} \\right]$, bias vector $\\mathbf{b} = (b_1, b_2, b_3)$ and outputs $\\mathbf{y} = (y_1, y_2, y_3)$, one can write the transformation as follows:\n",
+    "\n",
+    "(for the $i$-th output)\n",
+    "\n",
+    "(1) $\n",
    "\\begin{equation}\n",
-    "  \\mathbf y=\\mathbf W \\mathbf x + \\mathbf b\n",
+    "   y_i = b_i + \\sum_j x_jw_{ji}\n",
    "\\end{equation}\n",
    "$\n",
    "\n",
-    "<b>Note:</b> the bias term can be incorporated as an additional column in the weight matrix, though in this tutorials we will use a separate variable to for this purpose.\n",
+    "or the equivalent vector form (where $\\mathbf w_i$ is the $i$-th column of $\\mathbf W$):\n",
    "\n",
-    "An $i$th element of vecotr $\\mathbf y$ is hence computed as:\n",
-    "\n",
-    "$\n",
+    "(2) $\n",
    "\\begin{equation}\n",
-    "   y_i=\\mathbf w_i \\mathbf x + b_i\n",
+    "   y_i = b_i + \\mathbf x \\mathbf w_i^T\n",
    "\\end{equation}\n",
    "$\n",
    "\n",
-    "where $\\mathbf w_i$ is the $i$th row of $\\mathbf W$\n",
+    "The same operation can be also written in matrix form, to compute all the outputs $\\mathbf{y}$ at the same time:\n",
    "\n",
-    "$\n",
+    "(3) $\n",
    "\\begin{equation}\n",
-    "   y_i=\\sum_j w_{ji}x_j + b_i\n",
+    "  \\mathbf y=\\mathbf x\\mathbf W + \\mathbf b\n",
    "\\end{equation}\n",
    "$\n",
    "\n",
-    "???\n",
-    "\n"
+    "When $\\mathbf{x}$ is a mini-batch (contains $B$ data-points of dimension $D$ each), i.e. $\\left[ \\begin{array}{cccc}\n",
+    "x_{11} & x_{12} & \\ldots & x_{1D} \\\\\n",
+    "x_{21} & x_{22} & \\ldots & x_{2D} \\\\\n",
+    "\\cdots \\\\\n",
+    "x_{B1} & x_{B2} & \\ldots & x_{BD} \\\\ \\end{array} \\right]$ equation (3) effectively becomes to be\n",
+    "\n",
+    "(4) $\n",
+    "\\begin{equation}\n",
+    "  \\mathbf Y=\\mathbf X\\mathbf W + \\mathbf b\n",
+    "\\end{equation}\n",
+    "$\n",
+    "\n",
+    "where both $\\mathbf{X}\\in\\mathbb{R}^{B\\times D}$ and $\\mathbf{Y}\\in\\mathbb{R}^{B\\times K}$ are matrices, and $\\mathbf{b}$ needs to be <a href=\"http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html\">broadcased</a> $B$ times (numpy will do this by default). However, we will not make an explicit distinction between a special case for $B=1$ and $B>1$ and simply use equation (3) instead, although $\\mathbf{x}$ and hence $\\mathbf{y}$ could be matrices. From implementation point of view, it does not matter.\n",
+    "\n",
+    "The desired functionality for matrix multiplication in numpy is provided by <a href=\"http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html\">numpy.dot</a> function. If you haven't use it so far, get familiar with it as we will use it extensively."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Note on random number generators (could move it somewhere else)\n",
+    "\n",
+    "It is generally a good practice (for machine learning applications **not** cryptography!) to seed a pseudo-random number generator once at the beginning of the experiment, and use it later through the code where necesarry. This allows to avoid hard to reproduce scenariors where a particular action happens only for a particular sequence of numbers (which you cannot reproduce easily due to unknown random seeds sequence on the way!). As such, within this course we are going use a single random generator object. For instance, the one similar to the below:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[ 0.06875593 -0.69616488  0.08823301  0.34533413 -0.22129962]\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "import numpy\n",
-    "x=numpy.random.uniform(-1,1,(4,)); \n",
-    "W=numpy.random.uniform(-1,1,(5,4)); \n",
-    "y=numpy.dot(W,x);\n",
-    "print y"
+    "\n",
+    "#initialise the random generator to be used later\n",
+    "seed=[2015, 10, 1]\n",
+    "random_generator = numpy.random.RandomState(seed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise 1 \n",
+    "\n",
+    "Using numpy.dot, implement **forward** propagation through the linear transform defined by equations (3) and (4) for $B=1$ and $B>1$. As data ($\\mathbf{x}$) use `MNISTDataProvider` from previous laboratories. For case when $B=1$ write a function to compute the 1st output ($y_1$) using equations (1) and (2). Check if the output is the same as the corresponding one obtained with numpy. \n",
+    "\n",
+    "Tip: To generate random data you can use `random_generator.uniform(-0.1, 0.1, (D, 10))` from the preceeding cell."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[[ 0.63711     0.11566944  0.74416104]\n",
-      " [-0.01335825  0.46206922 -0.1109265 ]\n",
-      " [-0.37523063 -0.06755371  0.04352121]\n",
-      " [ 0.25885831 -0.53660826 -0.40905639]]\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
-    "def my_dot(x, W, b):\n",
-    "    y = numpy.zeros_like((x.shape[0], W.shape[1]))\n",
+    "from mlp.dataset import MNISTDataProvider\n",
+    "\n",
+    "mnist_dp = MNISTDataProvider(dset='valid', batch_size=3, max_num_batches=1, randomize=False)\n",
+    "\n",
+    "irange = 0.1\n",
+    "W = random_generator.uniform(-irange, irange, (784,10)) \n",
+    "b = numpy.zeros((10,))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "mnist_dp.reset()\n",
+    "\n",
+    "#implement following functions, then run the cell\n",
+    "def y1_equation_1(x, W, b):\n",
+    "    raise NotImplementedError()\n",
+    "    \n",
+    "def y1_equation_2(x, W, b):\n",
+    "    raise NotImplementedError()\n",
+    "\n",
+    "def y_equation_3(x, W, b):\n",
+    "    #use numpy.dot\n",
+    "    raise NotImplementedError()\n",
+    "\n",
+    "def y_equation_4(x, W, b):\n",
+    "    #use numpy.dot\n",
+    "    raise NotImplementedError()\n",
+    "\n",
+    "for x, t in mnist_dp:\n",
+    "    y1e1 = y1_equation_1(x[0], W, b)\n",
+    "    y1e2 = y1_equation_2(x[0], W, b)\n",
+    "    ye3 = y_equation_3(x, W, b)\n",
+    "    ye4 = y_equation_4(x, W, b)\n",
+    "\n",
+    "print 'y1e1', y1e1\n",
+    "print 'y1e1', y1e1\n",
+    "print 'ye3', ye3\n",
+    "print 'ye4', ye4\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "## Exercise 2\n",
+    "\n",
+    "Modify (if necessary) examples from Exercise 1 to perform **backward** propagation, that is, given $\\mathbf{y}$ (obtained in previous step) and weight matrix $\\mathbf{W}$, project $\\mathbf{y}$ onto the input space $\\mathbf{x}$ (ignore or set to zero the biases towards $\\mathbf{x}$ in backward pass). Mathematically, we are interested in the following transformation: $\\mathbf{z}=\\mathbf{y}\\mathbf{W}^T$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "***\n",
+    "## Exercise 3 (optional)\n",
+    "\n",
+    "In case you do not fully understand how matrix-vector and/or matrix-matrix products work, consider implementing `my_dot_mat_mat` function  (you have been given `my_dot_vec_mat` code to look at as an example) which takes as the input the following arguments:\n",
+    "\n",
+    "* D-dimensional input vector $\\mathbf{x} = (x_1, x_2, \\ldots, x_D) $.\n",
+    "* Weight matrix $\\mathbf{W}\\in\\mathbb{R}^{D\\times K}$:\n",
+    "\n",
+    "and returns:\n",
+    "\n",
+    "* K-dimensional output vector $\\mathbf{y} = (y_1, \\ldots, y_K) $\n",
+    "\n",
+    "Your job is to write a variant that works in a mini-batch mode where both $\\mathbf{x}\\in\\mathbb{R}^{B\\times D}$ and $\\mathbf{y}\\in\\mathbb{R}^{B\\times K}$ are matrices in which each rows contain one of $B$ data-points from mini-batch (rather than  $\\mathbf{x}\\in\\mathbb{R}^{D}$ and $\\mathbf{y}\\in\\mathbb{R}^{K}$)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "def my_dot_vec_mat(x, W):\n",
+    "    J = x.shape[0]\n",
+    "    K = W.shape[1]\n",
+    "    assert (J == W.shape[0]), (\n",
+    "        \"Number of columns of x expected to \"\n",
+    "        \" to be equal to the number of rows in \"\n",
+    "        \"W, bot got shapes %s, %s\" % (x.shape, W.shape)\n",
+    "    )\n",
+    "    y = numpy.zeros((K,))\n",
+    "    for k in xrange(0, K):\n",
+    "        for j in xrange(0, J):\n",
+    "            y[k] += x[j] * W[j,k]\n",
+    "                \n",
+    "    return y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "irange = 0.1 #+-range from which we draw the random numbers\n",
+    "\n",
+    "x = random_generator.uniform(-irange, irange, (5,)) \n",
+    "W = random_generator.uniform(-irange, irange, (5,3)) \n",
+    "\n",
+    "y_my = my_dot_vec_mat(x, W)\n",
+    "y_np = numpy.dot(x, W)\n",
+    "\n",
+    "same = numpy.allclose(y_my, y_np)\n",
+    "\n",
+    "if same:\n",
+    "    print 'Well done!'\n",
+    "else:\n",
+    "    print 'Matrices are different:'\n",
+    "    print 'y_my is: ', y_my\n",
+    "    print 'y_np is: ', y_np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def my_dot_mat_mat(x, W):\n",
+    "    I = x.shape[0]\n",
+    "    J = x.shape[1]\n",
+    "    K = W.shape[1]\n",
+    "    assert (J == W.shape[0]), (\n",
+    "        \"Number of columns in of x expected to \"\n",
+    "        \" to be the same as rows in W, got\"\n",
+    "    )\n",
+    "    #allocate the output container\n",
+    "    y = numpy.zeros((I, K))\n",
+    "    \n",
+    "    #implement here matrix-matrix inner product here\n",
    "    raise NotImplementedError('Write me!')\n",
+    "                \n",
    "    return y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": []
+   "source": [
+    "Test whether you get comparable numbers to what numpy is producing:"
+   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[ 0  1  2  3  4  5  6  7  8  9 10]\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
+    "irange = 0.1 #+-range from which we draw the random numbers\n",
    "\n",
-    "for itr in xrange(0,100):\n",
-    "  my_dot(W,x)\n",
-    "    \n"
+    "x = random_generator.uniform(-irange, irange, (2,5)) \n",
+    "W = random_generator.uniform(-irange, irange, (5,3)) \n",
+    "\n",
+    "y_my = my_dot_mat_mat(x, W)\n",
+    "y_np = numpy.dot(x, W)\n",
+    "\n",
+    "same = numpy.allclose(y_my, y_np)\n",
+    "\n",
+    "if same:\n",
+    "    print 'Well done!'\n",
+    "else:\n",
+    "    print 'Matrices are different:'\n",
+    "    print 'y_my is: ', y_my\n",
+    "    print 'y_np is: ', y_np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we benchmark each approach (we do it in separate cells, as timeit currently can measure whole cell execuiton only)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "#generate bit bigger matrices, to better evaluate timings\n",
+    "x = random_generator.uniform(-irange, irange, (10, 1000))\n",
+    "W = random_generator.uniform(-irange, irange, (1000, 100))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print 'my_dot timings:'\n",
+    "%timeit -n10 my_dot_mat_mat(x, W)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print 'numpy.dot timings:'\n",
+    "%timeit -n10 numpy.dot(x, W)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Optional section ends here**\n",
+    "***"
   ]
  },
  {
@ -158,93 +432,212 @@
   "source": [
    "# Iterative learning of linear models\n",
    "\n",
-    "We will learn the model with (batch for now) gradient descent.\n",
+    "We will learn the model with stochastic gradient descent using mean square error (MSE) loss function, which is defined as follows:\n",
    "\n",
-    "\n",
-    "## Running example\n",
-    "\n",
-    "![Making Predictions](res/singleLayerNetPredict.png)\n",
-    " \n",
-    "\n",
-    "  * Input vector $\\mathbf{x} = (x_1, x_1, \\ldots, x_d)^T $\n",
-    "  * Output vector $\\mathbf{y} = (y_1, \\ldots, y_K)^T $\n",
-    "  * Weight matrix $\\mathbf{W}$: $w_{ki}$ is the weight from input $x_i$ to output $y_k$\n",
-    "  * Bias $w_{k0}$ is the bias for output $k$\n",
-    "  * Targets vector $\\mathbf{t} = (t_1, \\ldots, t_K)^T $\n",
-    "\n",
-    "\n",
-    "$\n",
-    "  y_k = \\sum_{i=1}^d w_{ki} x_i + w_{k0}\n",
-    "$\n",
-    "\n",
-    "If we define $x_0=1$ we can simplify the above to\n",
-    "\n",
-    "$\n",
-    "  y_k = \\sum_{i=0}^d w_{ki} x_i \\quad ; \\quad \\mathbf{y} = \\mathbf{Wx}\n",
-    "$\n",
-    "\n",
-    "$\n",
+    "(5) $\n",
    "E = \\frac{1}{2} \\sum_{n=1}^N ||\\mathbf{y}^n - \\mathbf{t}^n||^2 =  \\sum_{n=1}^N E^n \\\\\n",
    "  E^n = \\frac{1}{2} ||\\mathbf{y}^n - \\mathbf{t}^n||^2\n",
    "$\n",
    "\n",
-    " $ E^n = \\frac{1}{2} \\sum_{k=1}^K (y_k^n - t_k^n)^2 $\n",
-    " set $\\mathbf{W}$ to minimise $E$ given the training set\n",
+    "(6) $ E^n = \\frac{1}{2} \\sum_{k=1}^K (y_k^n - t_k^n)^2 $\n",
    "  \n",
-    "$\n",
-    " E^n = \\frac{1}{2} \\sum_{k=1}^K (y^n_k - t^n_k)^2 \n",
-    "    = \\frac{1}{2} \\sum_{k=1}^K \\left( \\sum_{i=0}^d w_{ki} x^n_i - t^n_k \\right)^2 \\\\\n",
-    "    \\pderiv{E^n}{w_{rs}} = (y^n_r - t^n_r)x_s^n =  \\delta^n_r x_s^n \\quad ; \\quad\n",
-    "    \\delta^n_r = y^n_r - t^n_r \\\\\n",
-    "    \\pderiv{E}{w_{rs}} = \\sum_{n=1}^N \\pderiv{E^n}{w_{rs}} = \\sum_{n=1}^N \\delta^n_r x_s^n\n",
+    "Hence, the gradient w.r.t (with respect to) the $r$ output y of the model is defined as, so called delta function, $\\delta_r$: \n",
+    "\n",
+    "(8) $\\frac{\\partial{E^n}}{\\partial{y_{r}}} = (y^n_r - t^n_r) =  \\delta^n_r \\quad ; \\quad\n",
+    "    \\delta^n_r = y^n_r - t^n_r \n",
    "$\n",
    "\n",
+    "Similarly, using the above $\\delta^n_r$ one can express the gradient of the  weight $w_{sr}$ (from the s-th input to the r-th output) for linear model and MSE cost as follows:\n",
    "\n",
-    "\\begin{algorithmic}[1]\n",
-    "      \\Procedure{gradientDescentTraining}{$\\mvec{X}, \\mvec{T},\n",
-    "        \\mvec{W}$}\n",
-    "        \\State initialize $\\mvec{W}$ to small random numbers\n",
-    "%        \\State randomize order of training examples in $\\mvec{X}\n",
-    "        \\While{not converged}\n",
-    "           \\State for all $k,i$: $\\Delta w_{ki} \\gets 0$\n",
-    "           \\For{$n \\gets 1,N$}\n",
-    "            \\For{$k \\gets 1,K$}\n",
-    "              \\State $y_k^n \\gets \\sum_{i=0}^d w_{ki} x_{ki}^n$\n",
-    "              \\State $\\delta_k^n \\gets y_k^n - t_k^n$\n",
-    "              \\For{$i \\gets 1,d$}\n",
-    "                \\State $\\Delta w_{ki} \\gets \\Delta w_{ki} + \\delta_k^n \\cdot x_i^n$\n",
-    "              \\EndFor\n",
-    "            \\EndFor\n",
-    "           \\EndFor\n",
-    "           \\State for all $k,i$: $w_{ki} \\gets w_{ki} - \\eta \\cdot \\Delta w_{ki}$\n",
-    "        \\EndWhile\n",
-    "       \\EndProcedure\n",
-    "\\end{algorithmic}"
+    "(9) $\n",
+    "    \\frac{\\partial{E^n}}{\\partial{w_{sr}}} = (y^n_r - t^n_r)x_s^n =  \\delta^n_r x_s^n \\quad\\\\\n",
+    "    \\frac{\\partial{E}}{\\partial{w_{sr}}} = \\sum_{n=1}^N \\frac{\\partial{E^n}}{\\partial{w_{rs}}} = \\sum_{n=1}^N \\delta^n_r x_s^n\n",
+    "$\n",
+    "\n",
+    "and the gradient for bias parameter at the $r$-th output is:\n",
+    "\n",
+    "(10) $\n",
+    "    \\frac{\\partial{E}}{\\partial{b_{r}}} = \\sum_{n=1}^N \\frac{\\partial{E^n}}{\\partial{b_{r}}} = \\sum_{n=1}^N \\delta^n_r\n",
+    "$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Excercises"
+    "\n",
+    "![Making Predictions](res/singleLayerNetPredict.png)\n",
+    " \n",
+    "  * Input vector $\\mathbf{x} = (x_1, x_2, \\ldots, x_D) $\n",
+    "  * Output scalar $y_1$\n",
+    "  * Weight matrix $\\mathbf{W}$: $w_{ik}$ is the weight from input $x_i$ to output $y_k$. Note, here this is really a vector since a single scalar output, y_1.\n",
+    "  * Scalar bias $b$ for the only output in our model \n",
+    "  * Scalar target $t$ for the only output in out model\n",
+    "  \n",
+    "First, ensure you can make use of data provider (note, for training data has been normalised to zero mean and unit variance, hence different effective range than one can find in file):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from mlp.dataset import MetOfficeDataProvider\n",
+    "\n",
+    "modp = MetOfficeDataProvider(10, batch_size=10, max_num_batches=2, randomize=False)\n",
+    "\n",
+    "%precision 2\n",
+    "for x, t in modp:\n",
+    "    print 'Observations: ', x\n",
+    "    print 'To predict: ', t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": []
+   "source": [
+    "## Exercise 4\n",
+    "\n",
+    "The below code implements a very simple variant of stochastic gradient descent for the weather regression example. Your task is to implement 5 functions in the next cell and then run two next cells that 1) build sgd functions and 2) run the actual training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#When implementing those, take into account the mini-batch case, for which one is\n",
+    "#expected to sum the errors for each example\n",
+    "\n",
+    "def fprop(x, W, b):\n",
+    "    #code implementing eq. (3)\n",
+    "    #return: y\n",
+    "    raise NotImplementedError('Write me!')\n",
+    "\n",
+    "def cost(y, t):\n",
+    "    #Mean Square Error cost, equation (5)\n",
+    "    raise NotImplementedError('Write me!')\n",
+    "\n",
+    "def cost_grad(y, t):\n",
+    "    #Gradient of the cost w.r.t y equation (8)\n",
+    "    raise NotImplementedError('Write me!')\n",
+    "\n",
+    "def cost_wrt_W(cost_grad, x):\n",
+    "    #Gradient of the cost w.r.t W, equation (9)\n",
+    "    raise NotImplementedError('Write me!')\n",
+    "    \n",
+    "def cost_wrt_b(cost_grad):\n",
+    "    #Gradient of the cost w.r.t to b, equation (10)\n",
+    "    raise NotImplementedError('Write me!')\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def sgd_epoch(data_provider, W, b, learning_rate):\n",
+    "    mse_stats = []\n",
+    "    \n",
+    "    #get the minibatch of data\n",
+    "    for x, t in data_provider:\n",
+    "        \n",
+    "        #1. get the estimate of y\n",
+    "        y = fprop(x, W, b)\n",
+    "\n",
+    "        #2. compute the loss function\n",
+    "        tmp = cost(y, t)\n",
+    "        mse_stats.append(tmp)\n",
+    "        \n",
+    "        #3. compute the grad of the cost w.r.t the output layer activation y\n",
+    "        #i.e. how the cost changes when output y changes\n",
+    "        cost_grad_deltas = cost_grad(y, t)\n",
+    "\n",
+    "        #4. compute the gradients w.r.t model's parameters\n",
+    "        grad_W = cost_wrt_W(cost_grad_deltas, x)\n",
+    "        grad_b = cost_wrt_b(cost_grad_deltas)\n",
+    "\n",
+    "        #4. Update the model, we update with the mean gradient\n",
+    "        # over the minibatch, rather than sum of particular gradients\n",
+    "        # in a minibatch, to do so we scale the learning rate by batch_size\n",
+    "        mb_size = x.shape[0]\n",
+    "        effect_learn_rate = learning_rate / mb_size\n",
+    "\n",
+    "        W = W - effect_learn_rate * grad_W\n",
+    "        b = b - effect_learn_rate * grad_b\n",
+    "    \n",
+    "    return W, b, numpy.mean(mse_stats)\n",
+    "\n",
+    "def sgd(data_provider, W, b, learning_rate=0.1, max_epochs=10):\n",
+    "    \n",
+    "    for epoch in xrange(0, max_epochs):\n",
+    "        #reset the data provider\n",
+    "        data_provider.reset()\n",
+    "        \n",
+    "        #train for one epoch\n",
+    "        W, b, mean_cost = \\\n",
+    "            sgd_epoch(data_provider, W, b, learning_rate)\n",
+    "                \n",
+    "        print \"MSE training cost after %d-th epoch is %f\" % (epoch + 1, mean_cost)\n",
+    "    \n",
+    "    return W, b\n",
+    "        \n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#some hyper-parameters\n",
+    "window_size = 12\n",
+    "irange = 0.1\n",
+    "learning_rate = 0.01\n",
+    "max_epochs=40\n",
+    "\n",
+    "# note, while developing you can set max_num_batches to some positive number to limit\n",
+    "# the number of training data-points (you will get feedback faster)\n",
+    "mdp = MetOfficeDataProvider(window_size, batch_size=10, max_num_batches=-100, randomize=False)\n",
+    "\n",
+    "#initialise the parameters\n",
+    "W = random_generator.uniform(-irange, irange, (window_size, 1))\n",
+    "b = random_generator.uniform(-irange, irange, (1, ))\n",
+    "\n",
+    "#train the model\n",
+    "sgd(mdp, W, b, learning_rate=learning_rate, max_epochs=max_epochs)\n"
+   ]
  },
  {
   "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "source": [
-    "# Fun Stuff\n",
+    "## Exercise 5\n",
    "\n",
-    "So what on can do with linear transform, and what are the properties of those?\n",
+    "Modify the above regression problem so the model makes binary classification whether the the weather is going to be one of those \\{rainy, sunny} (look at slide 12 of the 2nd lecture)\n",
    "\n",
-    "Exercise, show, the LT is invertible, basically, solve the equation:\n",
-    "\n",
-    "y=Wx+b, given y (transformed image), find such x that is the same as the original one."
+    "Tip: You need to introduce the following changes:\n",
+    "1. Modify `MetOfficeDataProvider` (for example, inherit from MetOfficeDataProvider to create a new class MetOfficeDataProviderBin) and modify `next()` function so it returns as `targets` either 0 (sunny - if the the amount of rain [before mean/variance normalisation] is equal to 0 or 1 (rainy -- otherwise).\n",
+    "2. Modify the functions from previous exercise so the fprop implements `sigmoid` on top of affine transform.\n",
+    "3. Modify cost function to binary cross-entropy\n",
+    "4. Make sure you compute the gradients correctly (as you have changed both the output and the cost)\n"
   ]
  },
  {
--- a/mlp/dataset.py
+++ b/mlp/dataset.py
@ -64,7 +64,7 @@ class MNISTDataProvider(DataProvider):
    """
    def __init__(self, dset,
                 batch_size=10,
-                 max_num_examples=-1,
+                 max_num_batches=-1,
                 randomize=True):

        super(MNISTDataProvider, self).\
@ -75,6 +75,10 @@ class MNISTDataProvider(DataProvider):
            "'valid' or 'eval' got %s" % dset
        )
        
+        assert max_num_batches != 0, (
+            "max_num_batches should be != 0"
+        )
+
        dset_path = './data/mnist_%s.pkl.gz' % dset
        assert os.path.isfile(dset_path), (
            "File %s was expected to exist!." % dset_path
@ -83,7 +87,7 @@ class MNISTDataProvider(DataProvider):
        with gzip.open(dset_path) as f:
            x, t = cPickle.load(f)

-        self._max_num_examples = max_num_examples
+        self._max_num_batches = max_num_batches
        self.x = x
        self.t = t
        self.num_classes = 10
@ -104,8 +108,7 @@ class MNISTDataProvider(DataProvider):
    def next(self):
        
        has_enough = (self._curr_idx + self.batch_size) <= self.x.shape[0]
-        presented_max = (self._max_num_examples > 0 and
-                         self._curr_idx + self.batch_size > self._max_num_examples)
+        presented_max = (0 < self._max_num_batches < (self._curr_idx / self.batch_size))

        if not has_enough or presented_max:
            raise StopIteration()
@ -122,8 +125,7 @@ class MNISTDataProvider(DataProvider):

        self._curr_idx += self.batch_size

-        return rval_x, self.__to_one_of_k(rval_y)
-        return rval_x, rval_t
+        return rval_x, self.__to_one_of_k(rval_t)

    def __to_one_of_k(self, y):
        rval = numpy.zeros((y.shape[0], self.num_classes), dtype=numpy.float32)
@ -132,7 +134,7 @@ class MNISTDataProvider(DataProvider):
        return rval


-class MetOfficeDataProvider_(DataProvider):
+class MetOfficeDataProvider(DataProvider):
    """
    The class iterates over South Scotland Weather, in possibly
    random order.
@ -142,7 +144,7 @@ class MetOfficeDataProvider_(DataProvider):
                 max_num_batches=-1,
                 randomize=True):

-        super(MetOfficeDataProvider_, self).\
+        super(MetOfficeDataProvider, self).\
            __init__(batch_size, randomize)

        dset_path = './data/HadSSP_daily_qc.txt'
@ -152,27 +154,35 @@ class MetOfficeDataProvider_(DataProvider):

        raw = numpy.loadtxt(dset_path, skiprows=3, usecols=range(2, 32))
        
-        self.window_size = windows_size
+        self.window_size = window_size
+        self._max_num_batches = max_num_batches
        #filter out all missing datapoints and
        #flatten a matrix to a vector, so we will get
        #a time preserving representation of measurments
        #with self.x[0] being the first day and self.x[-1] the last
-        self.x = raw[raw < 0].flatten()
-        self._max_num_examples = max_num_examples
+        self.x = raw[raw >= 0].flatten()
+        
+        #normalise data to zero mean, unit variance
+        mean = numpy.mean(self.x)
+        var = numpy.var(self.x)
+        assert var >= 0.01, (
+            "Variance too small %f " % var
+        )
+        self.x = (self.x-mean)/var
        
        self._rand_idx = None
        if self.randomize:
            self._rand_idx = self.__randomize()

    def reset(self):
-        super(MetOfficeDataProvider_, self).reset()
+        super(MetOfficeDataProvider, self).reset()
        if self.randomize:
            self._rand_idx = self.__randomize()

    def __randomize(self):
        assert isinstance(self.x, numpy.ndarray)
        # we generate random indexes starting from window_size, i.e. 10th absolute element
-        # in the self.x vector, as we later during minibatch preparation slice 
+        # in the self.x vector, as we later during mini-batch preparation slice
        # the self.x container backwards, i.e. given we want to get a training 
        # data-point for 11th day, we look at 10 preeceding days. 
        # Note, we cannot do this, for example, for the 5th day as
@ -182,8 +192,7 @@ class MetOfficeDataProvider_(DataProvider):
    def next(self):
        
        has_enough = (self._curr_idx + self.batch_size) <= self.x.shape[0]
-        presented_max = (self._max_num_examples > 0 and
-                         self._curr_idx + self.batch_size > self._max_num_examples)
+        presented_max = (0 < self._max_num_batches < (self._curr_idx / self.batch_size))

        if not has_enough or presented_max:
            raise StopIteration()
@ -198,18 +207,24 @@ class MetOfficeDataProvider_(DataProvider):
        #build slicing matrix of size minibatch, which will contain batch_size
        #rows, each keeping indexes that selects windows_size+1 [for (x,t)] elements
        #from data vector (self.x) that itself stays always sorted w.r.t time
-        range_slices = numpy.zeros((self.batch_size, self.window_size + 1))
+        range_slices = numpy.zeros((self.batch_size, self.window_size + 1), dtype=numpy.int32)
+       
        for i in xrange(0, self.batch_size):
            range_slices[i, :] = \
-                numpy.arange(range_idx[i], range_idx[i] - self.window_size - 1, -1)[::-1]
+                numpy.arange(range_idx[i], 
+                             range_idx[i] - self.window_size - 1, 
+                             -1,
+                             dtype=numpy.int32)[::-1]
        
        #here we use advanced indexing to select slices from observation vector
-        #last column of rval_x makes our targets t
-        rval_x = self.x[range_slices]
+        #last column of rval_x makes our targets t (as we splice window_size + 1
+        tmp_x = self.x[range_slices]
+        rval_x = tmp_x[:,:-1]
+        rval_t = tmp_x[:,-1].reshape(self.batch_size, -1)
        
        self._curr_idx += self.batch_size

-        return rval_x[:,:-1], rval[:,-1]
+        return rval_x, rval_t

    
 class FuncDataProvider(DataProvider):
--- a/res/singleLayerNetBP-1.png
+++ b/res/singleLayerNetBP-1.png
--- a/res/singleLayerNetPredict.png
+++ b/res/singleLayerNetPredict.png
--- a/res/singleLayerNetWts-1.png
+++ b/res/singleLayerNetWts-1.png
--- a/res/singleLayerNetWtsBP.pdf
+++ b/res/singleLayerNetWtsBP.pdf
--- a/res/singleLayerNetWtsEqns-1.png
+++ b/res/singleLayerNetWtsEqns-1.png
--- a/res/singleLayerNetWtsEqns.pdf
+++ b/res/singleLayerNetWtsEqns.pdf