some progress with 05 solutions

2015-11-14 17:06:12 +00:00 · 2015-11-14 17:06:12 +00:00 · c130785b79
commit c130785b79
parent 078094d8fc
6 changed files with 800 additions and 14 deletions
--- a/04_Regularisation_solution.ipynb
+++ b/04_Regularisation_solution.ipynb
@ -165,7 +165,10 @@
    "import logging\n",
    "from mlp.dataset import MNISTDataProvider\n",
    "\n",
+    "logger = logging.getLogger()\n",
+    "logger.setLevel(logging.INFO)\n",
    "logger.info('Initialising data providers...')\n",
+    "\n",
    "train_dp = MNISTDataProvider(dset='train', batch_size=10, max_num_batches=100, randomize=True)\n",
    "valid_dp = MNISTDataProvider(dset='valid', batch_size=10000, max_num_batches=-10, randomize=False)\n",
    "test_dp = MNISTDataProvider(dset='eval', batch_size=10000, max_num_batches=-10, randomize=False)"
@ -467,8 +470,6 @@
    "from mlp.schedulers import LearningRateFixed\n",
    "from scipy.optimize import leastsq\n",
    "\n",
-    "logger = logging.getLogger()\n",
-    "logger.setLevel(logging.INFO)\n",
    "rng = numpy.random.RandomState([2015,10,10])\n",
    "\n",
    "#some hyper-parameters\n",
--- a/05_Transfer_functions_solution.ipynb
+++ b/05_Transfer_functions_solution.ipynb
@ -0,0 +1,637 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "\n",
+    "This tutorial focuses on implementation of alternatives to sigmoid transfer functions for hidden units.  (*Transfer functions* are also called *activation functions* or *nonlinearities*.) First, we will work with hyperboilc tangent (tanh) and then unbounded (or partially unbounded) piecewise linear functions: Rectifying Linear Units (ReLU) and Maxout.\n",
+    "\n",
+    "\n",
+    "## Virtual environments\n",
+    "\n",
+    "Before you proceed onwards, remember to activate your virtual environment by typing `activate_mlp` or `source ~/mlpractical/venv/bin/activate` (or if you did the original install the \"comfy way\" type: `workon mlpractical`).\n",
+    "\n",
+    "\n",
+    "## Syncing the git repository\n",
+    "\n",
+    "Look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n",
+    "\n",
+    "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
+    "2. List the branches and check which are currently active by typing: `git branch`\n",
+    "3. If you have followed our recommendations, you should be in the `lab4` branch, please commit your local changed to the repo index by typing:\n",
+    "```\n",
+    "git commit -am \"finished lab4\"\n",
+    "```\n",
+    "4. Now you can switch to `master` branch by typing: \n",
+    "```\n",
+    "git checkout master\n",
+    " ```\n",
+    "5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>\n",
+    "```\n",
+    "git pull\n",
+    "```\n",
+    "6. And now, create the new branch & switch to it by typing:\n",
+    "```\n",
+    "git checkout -b lab5\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Overview of alternative transfer functions\n",
+    "\n",
+    "Now, we briefly summarise some other possible choices for hidden layer transfer functions.\n",
+    "\n",
+    "## Tanh\n",
+    "\n",
+    "Given a linear activation $a_{i}$ tanh implements the following operation:\n",
+    "\n",
+    "(1) $h_i(a_i) = \\mbox{tanh}(a_i) = \\frac{\\exp(a_i) - \\exp(-a_i)}{\\exp(a_i) + \\exp(-a_i)}$\n",
+    "\n",
+    "Hence, the derivative of $h_i$ with respect to $a_i$ is:\n",
+    "\n",
+    "(2) $\\begin{align}\n",
+    "\\frac{\\partial h_i}{\\partial a_i} &= 1 - h^2_i\n",
+    "\\end{align}\n",
+    "$\n",
+    "\n",
+    "\n",
+    "## ReLU\n",
+    "\n",
+    "Given a linear activation $a_{i}$ relu implements the following operation:\n",
+    "\n",
+    "(3) $h_i(a_i) = \\max(0, a_i)$\n",
+    "\n",
+    "Hence, the gradient is :\n",
+    "\n",
+    "(4) $\\begin{align}\n",
+    "\\frac{\\partial h_i}{\\partial a_i} &=\n",
+    "\\begin{cases}\n",
+    "     1     & \\quad \\text{if } a_i > 0 \\\\\n",
+    "     0       & \\quad \\text{if } a_i \\leq 0 \\\\\n",
+    "\\end{cases}\n",
+    "\\end{align}\n",
+    "$\n",
+    "\n",
+    "ReLU implements a form of data-driven sparsity, that is, on average the activations are sparse (many of them are 0) but the general sparsity pattern will depend on particular data-point. This is different from sparsity obtained in model's parameters one can obtain with $L1$ regularisation as the latter affect all data-points in the same way.\n",
+    "\n",
+    "## Maxout\n",
+    "\n",
+    "Maxout is an example of data-driven type of non-linearity in which the transfer function can be learned from data. That is, the model can build a non-linear transfer function from piecewise linear components.  These linear components, depending on the number of linear regions used in the pooling operator (given by parameter $K$), can approximate  arbitrary functions, such as ReLU, abs, etc.\n",
+    "\n",
+    "Given some subset (group, pool) of $K$ linear activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ at the $l$-th layer, maxout implements the following operation:\n",
+    "\n",
+    "(5) $h_i(a_j, a_{j+1}, \\ldots, a_{j+K}) = \\max(a_j, a_{j+1}, \\ldots, a_{j+K})$\n",
+    "\n",
+    "Hence, the gradient of $h_i$ w.r.t to the pooling region $a_{j}, a_{j+1}, \\ldots, a_{j+K}$  is :\n",
+    "\n",
+    "(6) $\\begin{align}\n",
+    "\\frac{\\partial h_i}{\\partial (a_j, a_{j+1}, \\ldots, a_{j+K})} &=\n",
+    "\\begin{cases}\n",
+    "     1     & \\quad \\text{for the max activation}  \\\\\n",
+    "     0       & \\quad \\text{otherwise} \\\\\n",
+    "\\end{cases}\n",
+    "\\end{align}\n",
+    "$\n",
+    "\n",
+    "Implementation tips are given in Exercise 3.\n",
+    "\n",
+    "# On weight initialisation\n",
+    "\n",
+    "Activation functions directly affect the \"network dynamics\", that is, the magnitudes of the statistics each layer is producing. For example, *slashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, on the contrary, has an unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters - hence also the pace at which the model learns. That is why learning rate is usually required to be tuned for given the characterictics of the non-linearities used. \n",
+    "\n",
+    "Another important hyperparameter is the initial range used to initialise the weight matrices.  We have largely ignored it so far (although if you did further experiments in coursework 1, you may have found setting it had an effect on training deeper networks with 4 or 5 hidden layers).  However, for sigmoidal non-linearities (sigmoid, tanh) the initialisation range is an important hyperparameter and a considerable amount of research has been put into determining what is the best strategy for choosing it. In fact, one of the early triggers of the recent resurgence of deep learning was pre-training - techniques for initialising weights in an unsupervised manner so that one can effectively train deeper models in supervised fashion later.  \n",
+    "\n",
+    "## Sigmoidal transfer functions\n",
+    "\n",
+    "Y. LeCun in [Efficient Backprop](http://link.springer.com/chapter/10.1007%2F3-540-49430-8_2) recommends the following setting of the initial range $r$ for sigmoidal units (assuming that the data has been normalised to zero mean, unit variance): \n",
+    "\n",
+    "(7) $ r = \\frac{1}{\\sqrt{N_{IN}}} $\n",
+    "\n",
+    "where $N_{IN}$ is the number of inputs to the given layer and the weights are then sampled from the (usually uniform) distribution $U(-r,r)$. The motivation is to keep the initial forward-pass signal in the linear region of the sigmoid non-linearity so that the gradients are large enough for training to proceed (note that the sigmoidal non-linearities saturate when activations are either very positive or very negative, leading to very small gradients and hence poor learning dynamics).\n",
+    "\n",
+    "The initialisation used in (7) however leads to different magnitudes of activations/gradients at different layers (due to multiplicative nature of the computations) and more recently, [Glorot et. al](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) proposed the so-called *normalised initialisation*, which ensures the variance of the forward signal (activations) is approximately the same in each layer. The same applies to the gradients obtained in backward pass.  \n",
+    "\n",
+    "The $r$ in the *normalised initialisation* for $\\mbox{tanh}$ non-linearity is then:\n",
+    "\n",
+    "(8) $ r = \\frac{\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
+    "\n",
+    "For the sigmoid (logistic) non-linearity, to get similiar characteristics, one should scale $r$ in (8) by 4, that is:\n",
+    "\n",
+    "(9) $ r = \\frac{4\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
+    "\n",
+    "## Piece-wise linear transfer functions (ReLU, Maxout)\n",
+    "\n",
+    "For unbounded transfer functions initialisation is not as crucial as for sigmoidal ones. This is due to the fact that their gradients do not diminish (they are acutally more likely to explode) and they do not saturate (ReLU saturates at 0, but not on the positive slope, where gradient is 1 everywhere).  (In practice ReLU is sometimes \"clipped\" with a maximum value, typically 20).\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 1:  Implement the tanh transfer function\n",
+    "\n",
+    "Your implementation should follow the code conventions used to build other layer types (for example, Sigmoid and Softmax). Test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. \n",
+    "\n",
+    "Tune the learning rate and compare the initial ranges in equations (7) and (8). Note that there might not be much difference for one-hidden-layer model, but you can easily notice a substantial gain from using (8) (or (9) for  logistic sigmoid activation) for deeper models, for example, the 5 hidden-layer network from the first coursework.\n",
+    "\n",
+    "Implementation tip: Use numpy.tanh() to compute the non-linearity.  Use the irange argument when creating the given layer type to provide the initial sampling range."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:Initialising data providers...\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy\n",
+    "import logging\n",
+    "from mlp.dataset import MNISTDataProvider\n",
+    "\n",
+    "logger = logging.getLogger()\n",
+    "logger.setLevel(logging.INFO)\n",
+    "\n",
+    "# Note, you were asked to do run the experiments on all data. \n",
+    "# Here I am running those examples on 1000 training data-points only (similar to regularisation notebook)\n",
+    "logger.info('Initialising data providers...')\n",
+    "train_dp = MNISTDataProvider(dset='train', batch_size=10, max_num_batches=100, randomize=True)\n",
+    "valid_dp = MNISTDataProvider(dset='valid', batch_size=10000, max_num_batches=-10, randomize=False)\n",
+    "test_dp = MNISTDataProvider(dset='eval', batch_size=10000, max_num_batches=-10, randomize=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:Training started...\n",
+      "INFO:mlp.optimisers:Epoch 0: Training cost (ce) for initial model is 2.368. Accuracy is 7.80%\n",
+      "INFO:mlp.optimisers:Epoch 0: Validation cost (ce) for initial model is 2.347. Accuracy is 9.86%\n",
+      "INFO:mlp.optimisers:Epoch 1: Training cost (ce) is 1.421. Accuracy is 64.70%\n",
+      "INFO:mlp.optimisers:Epoch 1: Validation cost (ce) is 0.479. Accuracy is 85.95%\n",
+      "INFO:mlp.optimisers:Epoch 1: Took 10 seconds. Training speed 233 pps. Validation speed 1624 pps.\n",
+      "INFO:mlp.optimisers:Epoch 2: Training cost (ce) is 0.571. Accuracy is 81.60%\n",
+      "INFO:mlp.optimisers:Epoch 2: Validation cost (ce) is 0.484. Accuracy is 85.23%\n",
+      "INFO:mlp.optimisers:Epoch 2: Took 11 seconds. Training speed 214 pps. Validation speed 1637 pps.\n",
+      "INFO:mlp.optimisers:Epoch 3: Training cost (ce) is 0.411. Accuracy is 87.40%\n",
+      "INFO:mlp.optimisers:Epoch 3: Validation cost (ce) is 0.507. Accuracy is 85.40%\n",
+      "INFO:mlp.optimisers:Epoch 3: Took 11 seconds. Training speed 226 pps. Validation speed 1640 pps.\n",
+      "INFO:mlp.optimisers:Epoch 4: Training cost (ce) is 0.318. Accuracy is 90.10%\n",
+      "INFO:mlp.optimisers:Epoch 4: Validation cost (ce) is 0.596. Accuracy is 84.40%\n",
+      "INFO:mlp.optimisers:Epoch 4: Took 10 seconds. Training speed 232 pps. Validation speed 1616 pps.\n",
+      "INFO:mlp.optimisers:Epoch 5: Training cost (ce) is 0.257. Accuracy is 91.80%\n",
+      "INFO:mlp.optimisers:Epoch 5: Validation cost (ce) is 0.468. Accuracy is 87.76%\n",
+      "INFO:mlp.optimisers:Epoch 5: Took 11 seconds. Training speed 229 pps. Validation speed 1629 pps.\n",
+      "INFO:mlp.optimisers:Epoch 6: Training cost (ce) is 0.244. Accuracy is 92.30%\n",
+      "INFO:mlp.optimisers:Epoch 6: Validation cost (ce) is 0.535. Accuracy is 86.31%\n",
+      "INFO:mlp.optimisers:Epoch 6: Took 11 seconds. Training speed 230 pps. Validation speed 1600 pps.\n",
+      "INFO:mlp.optimisers:Epoch 7: Training cost (ce) is 0.169. Accuracy is 94.30%\n",
+      "INFO:mlp.optimisers:Epoch 7: Validation cost (ce) is 0.554. Accuracy is 86.59%\n",
+      "INFO:mlp.optimisers:Epoch 7: Took 11 seconds. Training speed 226 pps. Validation speed 1631 pps.\n",
+      "INFO:mlp.optimisers:Epoch 8: Training cost (ce) is 0.130. Accuracy is 96.60%\n",
+      "INFO:mlp.optimisers:Epoch 8: Validation cost (ce) is 0.562. Accuracy is 86.83%\n",
+      "INFO:mlp.optimisers:Epoch 8: Took 11 seconds. Training speed 225 pps. Validation speed 1603 pps.\n",
+      "INFO:mlp.optimisers:Epoch 9: Training cost (ce) is 0.113. Accuracy is 96.90%\n",
+      "INFO:mlp.optimisers:Epoch 9: Validation cost (ce) is 0.605. Accuracy is 85.94%\n",
+      "INFO:mlp.optimisers:Epoch 9: Took 11 seconds. Training speed 231 pps. Validation speed 1616 pps.\n",
+      "INFO:mlp.optimisers:Epoch 10: Training cost (ce) is 0.087. Accuracy is 97.10%\n",
+      "INFO:mlp.optimisers:Epoch 10: Validation cost (ce) is 0.564. Accuracy is 87.50%\n",
+      "INFO:mlp.optimisers:Epoch 10: Took 11 seconds. Training speed 226 pps. Validation speed 1637 pps.\n",
+      "INFO:mlp.optimisers:Epoch 11: Training cost (ce) is 0.054. Accuracy is 98.70%\n",
+      "INFO:mlp.optimisers:Epoch 11: Validation cost (ce) is 0.599. Accuracy is 87.04%\n",
+      "INFO:mlp.optimisers:Epoch 11: Took 10 seconds. Training speed 232 pps. Validation speed 1640 pps.\n",
+      "INFO:mlp.optimisers:Epoch 12: Training cost (ce) is 0.045. Accuracy is 98.60%\n",
+      "INFO:mlp.optimisers:Epoch 12: Validation cost (ce) is 0.574. Accuracy is 87.75%\n",
+      "INFO:mlp.optimisers:Epoch 12: Took 10 seconds. Training speed 237 pps. Validation speed 1653 pps.\n",
+      "INFO:mlp.optimisers:Epoch 13: Training cost (ce) is 0.025. Accuracy is 99.30%\n",
+      "INFO:mlp.optimisers:Epoch 13: Validation cost (ce) is 0.615. Accuracy is 86.88%\n",
+      "INFO:mlp.optimisers:Epoch 13: Took 11 seconds. Training speed 232 pps. Validation speed 1616 pps.\n",
+      "INFO:mlp.optimisers:Epoch 14: Training cost (ce) is 0.011. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 14: Validation cost (ce) is 0.610. Accuracy is 87.50%\n",
+      "INFO:mlp.optimisers:Epoch 14: Took 11 seconds. Training speed 201 pps. Validation speed 1634 pps.\n",
+      "INFO:mlp.optimisers:Epoch 15: Training cost (ce) is 0.009. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 15: Validation cost (ce) is 0.599. Accuracy is 87.87%\n",
+      "INFO:mlp.optimisers:Epoch 15: Took 10 seconds. Training speed 233 pps. Validation speed 1637 pps.\n",
+      "INFO:mlp.optimisers:Epoch 16: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 16: Validation cost (ce) is 0.612. Accuracy is 87.71%\n",
+      "INFO:mlp.optimisers:Epoch 16: Took 10 seconds. Training speed 241 pps. Validation speed 1645 pps.\n",
+      "INFO:mlp.optimisers:Epoch 17: Training cost (ce) is 0.006. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 17: Validation cost (ce) is 0.614. Accuracy is 87.73%\n",
+      "INFO:mlp.optimisers:Epoch 17: Took 10 seconds. Training speed 237 pps. Validation speed 1634 pps.\n",
+      "INFO:mlp.optimisers:Epoch 18: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 18: Validation cost (ce) is 0.620. Accuracy is 87.77%\n",
+      "INFO:mlp.optimisers:Epoch 18: Took 10 seconds. Training speed 245 pps. Validation speed 1645 pps.\n",
+      "INFO:mlp.optimisers:Epoch 19: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 19: Validation cost (ce) is 0.623. Accuracy is 87.94%\n",
+      "INFO:mlp.optimisers:Epoch 19: Took 10 seconds. Training speed 234 pps. Validation speed 1631 pps.\n",
+      "INFO:mlp.optimisers:Epoch 20: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 20: Validation cost (ce) is 0.625. Accuracy is 87.84%\n",
+      "INFO:mlp.optimisers:Epoch 20: Took 11 seconds. Training speed 217 pps. Validation speed 1631 pps.\n",
+      "INFO:mlp.optimisers:Epoch 21: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 21: Validation cost (ce) is 0.633. Accuracy is 87.83%\n",
+      "INFO:mlp.optimisers:Epoch 21: Took 10 seconds. Training speed 235 pps. Validation speed 1618 pps.\n",
+      "INFO:mlp.optimisers:Epoch 22: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 22: Validation cost (ce) is 0.637. Accuracy is 87.93%\n",
+      "INFO:mlp.optimisers:Epoch 22: Took 11 seconds. Training speed 225 pps. Validation speed 1648 pps.\n",
+      "INFO:mlp.optimisers:Epoch 23: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 23: Validation cost (ce) is 0.639. Accuracy is 87.90%\n",
+      "INFO:mlp.optimisers:Epoch 23: Took 10 seconds. Training speed 238 pps. Validation speed 1626 pps.\n",
+      "INFO:mlp.optimisers:Epoch 24: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 24: Validation cost (ce) is 0.642. Accuracy is 87.86%\n",
+      "INFO:mlp.optimisers:Epoch 24: Took 10 seconds. Training speed 233 pps. Validation speed 1659 pps.\n",
+      "INFO:mlp.optimisers:Epoch 25: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 25: Validation cost (ce) is 0.645. Accuracy is 87.91%\n",
+      "INFO:mlp.optimisers:Epoch 25: Took 12 seconds. Training speed 179 pps. Validation speed 1618 pps.\n",
+      "INFO:mlp.optimisers:Epoch 26: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 26: Validation cost (ce) is 0.650. Accuracy is 87.90%\n",
+      "INFO:mlp.optimisers:Epoch 26: Took 10 seconds. Training speed 241 pps. Validation speed 1637 pps.\n",
+      "INFO:mlp.optimisers:Epoch 27: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 27: Validation cost (ce) is 0.653. Accuracy is 87.98%\n",
+      "INFO:mlp.optimisers:Epoch 27: Took 10 seconds. Training speed 250 pps. Validation speed 1629 pps.\n",
+      "INFO:mlp.optimisers:Epoch 28: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 28: Validation cost (ce) is 0.656. Accuracy is 87.89%\n",
+      "INFO:mlp.optimisers:Epoch 28: Took 10 seconds. Training speed 232 pps. Validation speed 1640 pps.\n",
+      "INFO:mlp.optimisers:Epoch 29: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 29: Validation cost (ce) is 0.659. Accuracy is 87.92%\n",
+      "INFO:mlp.optimisers:Epoch 29: Took 10 seconds. Training speed 235 pps. Validation speed 1613 pps.\n",
+      "INFO:mlp.optimisers:Epoch 30: Training cost (ce) is 0.003. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 30: Validation cost (ce) is 0.663. Accuracy is 87.91%\n",
+      "INFO:mlp.optimisers:Epoch 30: Took 11 seconds. Training speed 223 pps. Validation speed 1613 pps.\n",
+      "INFO:root:Testing the model on test set:\n",
+      "INFO:root:MNIST test set accuracy is 87.69 %, cost (ce) is 0.665\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "from mlp.layers import MLP, Tanh, Softmax #import required layer types\n",
+    "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
+    "\n",
+    "from mlp.costs import CECost #import the cost we want to use for optimisation\n",
+    "from mlp.schedulers import LearningRateFixed\n",
+    "from scipy.optimize import leastsq\n",
+    "\n",
+    "rng = numpy.random.RandomState([2015,10,10])\n",
+    "\n",
+    "#some hyper-parameters\n",
+    "nhid = 800\n",
+    "learning_rate = 0.2\n",
+    "max_epochs = 30\n",
+    "cost = CECost()\n",
+    "    \n",
+    "stats = []\n",
+    "for layer in xrange(1, 2):\n",
+    "\n",
+    "    train_dp.reset()\n",
+    "    valid_dp.reset()\n",
+    "    test_dp.reset()\n",
+    "    \n",
+    "    #define the model\n",
+    "    model = MLP(cost=cost)\n",
+    "    model.add_layer(Tanh(idim=784, odim=nhid, irange=1./numpy.sqrt(784), rng=rng))\n",
+    "    for i in xrange(1, layer):\n",
+    "        logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
+    "        model.add_layer(Tanh(idim=nhid, odim=nhid, irange=0.2, rng=rng))\n",
+    "    model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
+    "\n",
+    "    # define the optimiser, here stochasitc gradient descent\n",
+    "    # with fixed learning rate and max_epochs\n",
+    "    lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
+    "    optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
+    "\n",
+    "    logger.info('Training started...')\n",
+    "    tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
+    "\n",
+    "    logger.info('Testing the model on test set:')\n",
+    "    tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
+    "    logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
+    "    \n",
+    "    stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 2: Implement ReLU\n",
+    "\n",
+    "Again, your implementation should follow the conventions used to build Linear, Sigmoid and Softmax layers. As in exercise 1, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Tune the learning rate (start with the initial one set to 0.1) with the initial weight range set to 0.05."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:Training started...\n",
+      "INFO:mlp.optimisers:Epoch 0: Training cost (ce) for initial model is 2.362. Accuracy is 9.30%\n",
+      "INFO:mlp.optimisers:Epoch 0: Validation cost (ce) for initial model is 2.338. Accuracy is 10.80%\n",
+      "INFO:mlp.optimisers:Epoch 1: Training cost (ce) is 1.002. Accuracy is 68.60%\n",
+      "INFO:mlp.optimisers:Epoch 1: Validation cost (ce) is 0.623. Accuracy is 81.52%\n",
+      "INFO:mlp.optimisers:Epoch 1: Took 10 seconds. Training speed 227 pps. Validation speed 1698 pps.\n",
+      "INFO:mlp.optimisers:Epoch 2: Training cost (ce) is 0.483. Accuracy is 86.10%\n",
+      "INFO:mlp.optimisers:Epoch 2: Validation cost (ce) is 0.416. Accuracy is 88.84%\n",
+      "INFO:mlp.optimisers:Epoch 2: Took 10 seconds. Training speed 255 pps. Validation speed 1710 pps.\n",
+      "INFO:mlp.optimisers:Epoch 3: Training cost (ce) is 0.361. Accuracy is 90.20%\n",
+      "INFO:mlp.optimisers:Epoch 3: Validation cost (ce) is 0.388. Accuracy is 89.08%\n",
+      "INFO:mlp.optimisers:Epoch 3: Took 10 seconds. Training speed 232 pps. Validation speed 1710 pps.\n",
+      "INFO:mlp.optimisers:Epoch 4: Training cost (ce) is 0.294. Accuracy is 91.80%\n",
+      "INFO:mlp.optimisers:Epoch 4: Validation cost (ce) is 0.384. Accuracy is 88.91%\n",
+      "INFO:mlp.optimisers:Epoch 4: Took 10 seconds. Training speed 237 pps. Validation speed 1672 pps.\n",
+      "INFO:mlp.optimisers:Epoch 5: Training cost (ce) is 0.246. Accuracy is 94.10%\n",
+      "INFO:mlp.optimisers:Epoch 5: Validation cost (ce) is 0.375. Accuracy is 89.32%\n",
+      "INFO:mlp.optimisers:Epoch 5: Took 10 seconds. Training speed 236 pps. Validation speed 1672 pps.\n",
+      "INFO:mlp.optimisers:Epoch 6: Training cost (ce) is 0.217. Accuracy is 94.10%\n",
+      "INFO:mlp.optimisers:Epoch 6: Validation cost (ce) is 0.382. Accuracy is 88.88%\n",
+      "INFO:mlp.optimisers:Epoch 6: Took 10 seconds. Training speed 245 pps. Validation speed 1689 pps.\n",
+      "INFO:mlp.optimisers:Epoch 7: Training cost (ce) is 0.184. Accuracy is 96.10%\n",
+      "INFO:mlp.optimisers:Epoch 7: Validation cost (ce) is 0.420. Accuracy is 87.86%\n",
+      "INFO:mlp.optimisers:Epoch 7: Took 10 seconds. Training speed 234 pps. Validation speed 1692 pps.\n",
+      "INFO:mlp.optimisers:Epoch 8: Training cost (ce) is 0.148. Accuracy is 97.00%\n",
+      "INFO:mlp.optimisers:Epoch 8: Validation cost (ce) is 0.392. Accuracy is 88.87%\n",
+      "INFO:mlp.optimisers:Epoch 8: Took 11 seconds. Training speed 209 pps. Validation speed 1689 pps.\n",
+      "INFO:mlp.optimisers:Epoch 9: Training cost (ce) is 0.135. Accuracy is 97.60%\n",
+      "INFO:mlp.optimisers:Epoch 9: Validation cost (ce) is 0.381. Accuracy is 89.10%\n",
+      "INFO:mlp.optimisers:Epoch 9: Took 10 seconds. Training speed 238 pps. Validation speed 1667 pps.\n",
+      "INFO:mlp.optimisers:Epoch 10: Training cost (ce) is 0.109. Accuracy is 98.80%\n",
+      "INFO:mlp.optimisers:Epoch 10: Validation cost (ce) is 0.389. Accuracy is 89.04%\n",
+      "INFO:mlp.optimisers:Epoch 10: Took 10 seconds. Training speed 244 pps. Validation speed 1675 pps.\n",
+      "INFO:mlp.optimisers:Epoch 11: Training cost (ce) is 0.102. Accuracy is 98.40%\n",
+      "INFO:mlp.optimisers:Epoch 11: Validation cost (ce) is 0.406. Accuracy is 88.57%\n",
+      "INFO:mlp.optimisers:Epoch 11: Took 10 seconds. Training speed 236 pps. Validation speed 1667 pps.\n",
+      "INFO:mlp.optimisers:Epoch 12: Training cost (ce) is 0.085. Accuracy is 99.00%\n",
+      "INFO:mlp.optimisers:Epoch 12: Validation cost (ce) is 0.415. Accuracy is 88.49%\n",
+      "INFO:mlp.optimisers:Epoch 12: Took 11 seconds. Training speed 211 pps. Validation speed 1701 pps.\n",
+      "INFO:mlp.optimisers:Epoch 13: Training cost (ce) is 0.069. Accuracy is 99.40%\n",
+      "INFO:mlp.optimisers:Epoch 13: Validation cost (ce) is 0.423. Accuracy is 88.44%\n",
+      "INFO:mlp.optimisers:Epoch 13: Took 11 seconds. Training speed 209 pps. Validation speed 1704 pps.\n",
+      "INFO:mlp.optimisers:Epoch 14: Training cost (ce) is 0.057. Accuracy is 99.60%\n",
+      "INFO:mlp.optimisers:Epoch 14: Validation cost (ce) is 0.433. Accuracy is 88.47%\n",
+      "INFO:mlp.optimisers:Epoch 14: Took 10 seconds. Training speed 234 pps. Validation speed 1684 pps.\n",
+      "INFO:mlp.optimisers:Epoch 15: Training cost (ce) is 0.050. Accuracy is 99.70%\n",
+      "INFO:mlp.optimisers:Epoch 15: Validation cost (ce) is 0.430. Accuracy is 88.60%\n",
+      "INFO:mlp.optimisers:Epoch 15: Took 10 seconds. Training speed 231 pps. Validation speed 1704 pps.\n",
+      "INFO:mlp.optimisers:Epoch 16: Training cost (ce) is 0.042. Accuracy is 99.90%\n",
+      "INFO:mlp.optimisers:Epoch 16: Validation cost (ce) is 0.437. Accuracy is 88.57%\n",
+      "INFO:mlp.optimisers:Epoch 16: Took 10 seconds. Training speed 241 pps. Validation speed 1684 pps.\n",
+      "INFO:mlp.optimisers:Epoch 17: Training cost (ce) is 0.039. Accuracy is 99.80%\n",
+      "INFO:mlp.optimisers:Epoch 17: Validation cost (ce) is 0.452. Accuracy is 88.24%\n",
+      "INFO:mlp.optimisers:Epoch 17: Took 10 seconds. Training speed 233 pps. Validation speed 1684 pps.\n",
+      "INFO:mlp.optimisers:Epoch 18: Training cost (ce) is 0.032. Accuracy is 99.80%\n",
+      "INFO:mlp.optimisers:Epoch 18: Validation cost (ce) is 0.453. Accuracy is 88.39%\n",
+      "INFO:mlp.optimisers:Epoch 18: Took 10 seconds. Training speed 236 pps. Validation speed 1712 pps.\n",
+      "INFO:mlp.optimisers:Epoch 19: Training cost (ce) is 0.028. Accuracy is 99.90%\n",
+      "INFO:mlp.optimisers:Epoch 19: Validation cost (ce) is 0.447. Accuracy is 89.01%\n",
+      "INFO:mlp.optimisers:Epoch 19: Took 10 seconds. Training speed 238 pps. Validation speed 1678 pps.\n",
+      "INFO:mlp.optimisers:Epoch 20: Training cost (ce) is 0.025. Accuracy is 99.90%\n",
+      "INFO:mlp.optimisers:Epoch 20: Validation cost (ce) is 0.466. Accuracy is 88.41%\n",
+      "INFO:mlp.optimisers:Epoch 20: Took 10 seconds. Training speed 233 pps. Validation speed 1710 pps.\n",
+      "INFO:mlp.optimisers:Epoch 21: Training cost (ce) is 0.023. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 21: Validation cost (ce) is 0.464. Accuracy is 88.72%\n",
+      "INFO:mlp.optimisers:Epoch 21: Took 10 seconds. Training speed 220 pps. Validation speed 1695 pps.\n",
+      "INFO:mlp.optimisers:Epoch 22: Training cost (ce) is 0.021. Accuracy is 99.90%\n",
+      "INFO:mlp.optimisers:Epoch 22: Validation cost (ce) is 0.465. Accuracy is 88.70%\n",
+      "INFO:mlp.optimisers:Epoch 22: Took 11 seconds. Training speed 201 pps. Validation speed 1695 pps.\n",
+      "INFO:mlp.optimisers:Epoch 23: Training cost (ce) is 0.019. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 23: Validation cost (ce) is 0.472. Accuracy is 88.55%\n",
+      "INFO:mlp.optimisers:Epoch 23: Took 11 seconds. Training speed 188 pps. Validation speed 1675 pps.\n",
+      "INFO:mlp.optimisers:Epoch 24: Training cost (ce) is 0.017. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 24: Validation cost (ce) is 0.477. Accuracy is 88.53%\n",
+      "INFO:mlp.optimisers:Epoch 24: Took 11 seconds. Training speed 197 pps. Validation speed 1640 pps.\n",
+      "INFO:mlp.optimisers:Epoch 25: Training cost (ce) is 0.016. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 25: Validation cost (ce) is 0.482. Accuracy is 88.59%\n",
+      "INFO:mlp.optimisers:Epoch 25: Took 11 seconds. Training speed 214 pps. Validation speed 1689 pps.\n",
+      "INFO:mlp.optimisers:Epoch 26: Training cost (ce) is 0.014. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 26: Validation cost (ce) is 0.482. Accuracy is 88.73%\n",
+      "INFO:mlp.optimisers:Epoch 26: Took 11 seconds. Training speed 210 pps. Validation speed 1675 pps.\n",
+      "INFO:mlp.optimisers:Epoch 27: Training cost (ce) is 0.014. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 27: Validation cost (ce) is 0.490. Accuracy is 88.65%\n",
+      "INFO:mlp.optimisers:Epoch 27: Took 12 seconds. Training speed 165 pps. Validation speed 1684 pps.\n",
+      "INFO:mlp.optimisers:Epoch 28: Training cost (ce) is 0.013. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 28: Validation cost (ce) is 0.496. Accuracy is 88.47%\n",
+      "INFO:mlp.optimisers:Epoch 28: Took 12 seconds. Training speed 164 pps. Validation speed 1672 pps.\n",
+      "INFO:mlp.optimisers:Epoch 29: Training cost (ce) is 0.012. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 29: Validation cost (ce) is 0.496. Accuracy is 88.55%\n",
+      "INFO:mlp.optimisers:Epoch 29: Took 12 seconds. Training speed 172 pps. Validation speed 1650 pps.\n",
+      "INFO:mlp.optimisers:Epoch 30: Training cost (ce) is 0.011. Accuracy is 100.00%\n",
+      "INFO:mlp.optimisers:Epoch 30: Validation cost (ce) is 0.500. Accuracy is 88.56%\n",
+      "INFO:mlp.optimisers:Epoch 30: Took 10 seconds. Training speed 235 pps. Validation speed 1667 pps.\n",
+      "INFO:root:Testing the model on test set:\n",
+      "INFO:root:MNIST test set accuracy is 88.10 %, cost (ce) is 0.497\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "from mlp.layers import MLP, Relu, Softmax #import required layer types\n",
+    "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
+    "\n",
+    "from mlp.costs import CECost #import the cost we want to use for optimisation\n",
+    "from mlp.schedulers import LearningRateFixed\n",
+    "from scipy.optimize import leastsq\n",
+    "\n",
+    "rng = numpy.random.RandomState([2015,10,10])\n",
+    "\n",
+    "#some hyper-parameters\n",
+    "nhid = 800\n",
+    "learning_rate = 0.1\n",
+    "max_epochs = 30\n",
+    "cost = CECost()\n",
+    "    \n",
+    "stats = []\n",
+    "for layer in xrange(1, 2):\n",
+    "\n",
+    "    train_dp.reset()\n",
+    "    valid_dp.reset()\n",
+    "    test_dp.reset()\n",
+    "    \n",
+    "    #define the model\n",
+    "    model = MLP(cost=cost)\n",
+    "    model.add_layer(Relu(idim=784, odim=nhid, irange=0.05, rng=rng))\n",
+    "    for i in xrange(1, layer):\n",
+    "        logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
+    "        model.add_layer(Relu(idim=nhid, odim=nhid, irange=0.2, rng=rng))\n",
+    "    model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
+    "\n",
+    "    # define the optimiser, here stochasitc gradient descent\n",
+    "    # with fixed learning rate and max_epochs\n",
+    "    lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
+    "    optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
+    "\n",
+    "    logger.info('Training started...')\n",
+    "    tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
+    "\n",
+    "    logger.info('Testing the model on test set:')\n",
+    "    tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
+    "    logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
+    "    \n",
+    "    stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 3: Implement Maxout\n",
+    "\n",
+    "As with the previous two exercises, your implementation should follow the conventions used to build the Linear, Sigmoid and Softmax layers. For now implement only non-overlapping pools (i.e. the pool in which all activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ belong to only one pool). As before, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Use the same optimisation hyper-parameters (learning rate, initial weights range) as you used for ReLU models. Tune the pool size $K$ (but keep the number of total parameters fixed).\n",
+    "\n",
+    "Note: The Max operator reduces dimensionality, hence for example, to get 100 hidden maxout units with pooling size set to $K=2$ the size of linear part needs to be set to $100K$ (assuming non-overlapping pools). This affects how you compute the total number of weights in the model.\n",
+    "\n",
+    "Implementation tips: To back-propagate through the maxout layer, one needs to keep track of which linear activation $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ was the maximum in each pool. The convenient way to do so is by storing the indices of the maximum units in the fprop function and then in the backprop stage pass the gradient only through those (i.e. for example, one can build an auxiliary matrix where each element is either 1 (if unit was maximum, and passed forward through the max operator for a given data-point) or 0 otherwise. Then in the backward pass it suffices to upsample the maxout *igrads* signal to the linear layer dimension and element-wise multiply by the aforemenioned auxiliary matrix.\n",
+    "\n",
+    "*Optional:* Implement the generic pooling mechanism by introducing an additional *stride* hyper-parameter $0<S\\leq K$. It specifies how many units you move to build the next pool. For instance, for non-overlapping pooling with $S=K=3$ one would build the first two maxout units as: $h_1=\\max(a_1,a_2,a_3)$ and $h_2=\\max(a_4,a_5,a_6)$. However, after setting $S=1$ the pools should share some subset of linear activations: $h_1=\\max(a_1,a_2,a_3)$ and $h_2=\\max(a_2,a_3,a_4)$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:root:Training started...\n"
+     ]
+    },
+    {
+     "ename": "ValueError",
+     "evalue": "total size of new array must be unchanged",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
+      "\u001b[1;32m<ipython-input-6-f32b43e5484f>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m     38\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     39\u001b[0m     \u001b[0mlogger\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'Training started...'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 40\u001b[1;33m     \u001b[0mtr_stats\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalid_stats\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0moptimiser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtrain\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtrain_dp\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalid_dp\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     41\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     42\u001b[0m     \u001b[0mlogger\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0minfo\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'Testing the model on test set:'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32m/afs/inf.ed.ac.uk/user/s11/s1136550/Dropbox/repos/mlpractical/mlp/optimisers.pyc\u001b[0m in \u001b[0;36mtrain\u001b[1;34m(self, model, train_iterator, valid_iterator)\u001b[0m\n\u001b[0;32m    160\u001b[0m         \u001b[1;31m# do the initial validation\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    161\u001b[0m         \u001b[0mtrain_iterator\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreset\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 162\u001b[1;33m         \u001b[0mtr_nll\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtr_acc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvalidate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtrain_iterator\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0ml1_weight\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0ml2_weight\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    163\u001b[0m         logger.info('Epoch %i: Training cost (%s) for initial model is %.3f. Accuracy is %.2f%%'\n\u001b[0;32m    164\u001b[0m                     % (self.lr_scheduler.epoch, cost_name, tr_nll, tr_acc * 100.))\n",
+      "\u001b[1;32m/afs/inf.ed.ac.uk/user/s11/s1136550/Dropbox/repos/mlpractical/mlp/optimisers.pyc\u001b[0m in \u001b[0;36mvalidate\u001b[1;34m(self, model, valid_iterator, l1_weight, l2_weight)\u001b[0m\n\u001b[0;32m     34\u001b[0m         \u001b[0macc_list\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnll_list\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m[\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     35\u001b[0m         \u001b[1;32mfor\u001b[0m \u001b[0mx\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mt\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mvalid_iterator\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 36\u001b[1;33m             \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfprop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     37\u001b[0m             \u001b[0mnll_list\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcost\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcost\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     38\u001b[0m             \u001b[0macc_list\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnumpy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclassification_accuracy\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mt\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32m/afs/inf.ed.ac.uk/user/s11/s1136550/Dropbox/repos/mlpractical/mlp/layers.pyc\u001b[0m in \u001b[0;36mfprop\u001b[1;34m(self, x)\u001b[0m\n\u001b[0;32m     49\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mactivations\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mx\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     50\u001b[0m         \u001b[1;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mxrange\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlayers\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 51\u001b[1;33m             \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mactivations\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m+\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlayers\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfprop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mactivations\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mi\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     52\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mactivations\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     53\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;32m/afs/inf.ed.ac.uk/user/s11/s1136550/Dropbox/repos/mlpractical/mlp/layers.pyc\u001b[0m in \u001b[0;36mfprop\u001b[1;34m(self, inputs)\u001b[0m\n\u001b[0;32m    466\u001b[0m         \u001b[1;31m#get the linear activations\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    467\u001b[0m         \u001b[0ma\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msuper\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mMaxout\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfprop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 468\u001b[1;33m         \u001b[0mar\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0ma\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0modim\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mk\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    469\u001b[0m         \u001b[0mh\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mh_argmax\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mmax_and_argmax\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mar\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxes\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkeepdims_argmax\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    470\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mh_argmax\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mh_argmax\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
+      "\u001b[1;31mValueError\u001b[0m: total size of new array must be unchanged"
+     ]
+    }
+   ],
+   "source": [
+    "#%load_ext autoreload\n",
+    "%autoreload\n",
+    "from mlp.layers import MLP, Maxout, Softmax #import required layer types\n",
+    "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
+    "\n",
+    "from mlp.costs import CECost #import the cost we want to use for optimisation\n",
+    "from mlp.schedulers import LearningRateFixed\n",
+    "from scipy.optimize import leastsq\n",
+    "\n",
+    "rng = numpy.random.RandomState([2015,10,10])\n",
+    "\n",
+    "#some hyper-parameters\n",
+    "nhid = 800\n",
+    "learning_rate = 0.1\n",
+    "k = 2\n",
+    "max_epochs = 30\n",
+    "cost = CECost()\n",
+    "    \n",
+    "stats = []\n",
+    "for layer in xrange(1, 2):\n",
+    "\n",
+    "    train_dp.reset()\n",
+    "    valid_dp.reset()\n",
+    "    test_dp.reset()\n",
+    "    \n",
+    "    #define the model\n",
+    "    model = MLP(cost=cost)\n",
+    "    model.add_layer(Maxout(idim=784, odim=nhid, k=k, irange=0.05, rng=rng))\n",
+    "    for i in xrange(1, layer):\n",
+    "        logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
+    "        model.add_layer(Maxout(idim=nhid, odim=nhid, k=k, irange=0.2, rng=rng))\n",
+    "    model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
+    "\n",
+    "    # define the optimiser, here stochasitc gradient descent\n",
+    "    # with fixed learning rate and max_epochs\n",
+    "    lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
+    "    optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
+    "\n",
+    "    logger.info('Training started...')\n",
+    "    tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
+    "\n",
+    "    logger.info('Testing the model on test set:')\n",
+    "    tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
+    "    logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
+    "    \n",
+    "    stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exercise 4: Train all the above models with dropout\n",
+    "\n",
+    "Try all of the above non-linearities with dropout training. Use the dropout hyper-parameters $\\{p_{inp}, p_{hid}\\}$ that worked best for sigmoid models from the previous lab.\n",
+    "\n",
+    "Note: the code for dropout you were asked to implement last week has not been given as a solution for this week - as a result you need to move/merge the required dropout parts from your previous *lab4* branch (or implement it if you haven't already done so). \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "#This one is a simple merge of above experiments with last exercise in previous tutorial."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/06_MLP_Coursework2_Introduction.ipynb
+++ b/06_MLP_Coursework2_Introduction.ipynb
@ -10,7 +10,7 @@
    "\n",
    "# Store the intermediate results (check-pointing and pickling)\n",
    "\n",
-    "Once you have finished a certain task it is a good idea to check-point your current notebook's status (logs, plots and whatever else has been stored in the notebook). By doing this, you can always revert to this state later when necessary. You can do this by going to menus `File->Save and Checkpoint` and `File->Revert to Checkpoint`.\n",
+    "Once you have finished a certain task it is a good idea to check-point your current notebook's status (logs, plots and whatever else has been stored in the notebook). By doing this, you can always revert to this state later when necessary (without rerunning experimens). You can do this by going to menus `File->Save and Checkpoint` and `File->Revert to Checkpoint`.\n",
    "\n",
    "The other good practice would be to dump to disk models and produced statistics. You can easily do it in python by using pickles, as in the following example."
   ]
@ -59,7 +59,8 @@
    "* `numpy.amax` - the same as with sum\n",
    "* `numpy.transpose` - can specify which axes you want to get transposed in a tensor\n",
    "* `numpy.argmax` - gives you the argument (index) of the maximum value in a tensor\n",
-    "* `numpy.flatten` - collapses the n-dimensional tensor into vector\n",
+    "* `numpy.flatten` - collapses the n-dimensional tensor into vector (copy)\n",
+    "* `numpy.ravel` - collapses the n-dimensional tensor into vector (creates a view)\n",
    "* `numpy.reshape` - allows to reshape tensor into another (valid from data perspective) tensor (matrix, vector) with different shape (but the same number of total elements)\n",
    "* `numpy.rot90` - rotate a matrix by 90 (or multiply of 90) degrees counter-clockwise\n",
    "* `numpy.newaxis` - adds an axis with dimension 1 (handy for keeping tensor shapes compatible with expected broadcasting)\n",
@ -181,7 +182,7 @@
    "f(x) = \\frac{f(x+\\epsilon) - f(x)}{\\epsilon}\n",
    "$\n",
    "\n",
-    "Because $\\epsilon$ is usually very small (1e-4 or smaller) it is recommended (due to finite precision of numerical machines) to use the centred variant (which was implemented in mlp.utils):\n",
+    "Because $\\epsilon$ is usually very small (1e-4 or smaller) it is recommended (due to finite precision of numerical machines) to use the centred variant (which was implemented in `mlp.utils`):\n",
    "\n",
    "$\n",
    "f(x) = \\frac{f(x+\\epsilon) - f(x-\\epsilon)}{2\\epsilon}\n",
@ -270,7 +271,7 @@
    "\n",
    "## Using Cython for the crucial bottleneck pieces\n",
    "\n",
-    "Cython will compile them to C and the code should be comparable in terms of efficiency to numpy using similar operations in numpy. Of course, one can only rely on numpy. Slicing numpy across many dimensions gets much more complicated than working than working with vectors and matrices and we do undersand those can be confusing for some people. Hence, we allow the basic implementation (with any penalty or preference from our side) to be loop only based (which is perhaps much easier to comprehend and debug).\n",
+    "Cython will compile them to C and the code should be comparable in terms of efficiency to numpy using similar operations in numpy. Of course, one can only rely on numpy. Slicing numpy across many dimensions gets much more complicated than working than working with vectors and matrices and we do undersand those can be confusing for some people. Hence, we allow the basic implementation of convolutiona and/or pooling (with any penalty or preference from our side) to be loop only based (which is perhaps much easier to comprehend and debug).\n",
    "\n",
    "Below we give an example cython code for matrix-matrix dot function from the second tutorial so you can see the basic differences and compare obtained speeds. They give you all the necessary pattern needed to implement naive (reasonably efficient) convolution. Naive looping in (native) python is gonna be *very* slow.\n",
    "\n",
@ -278,7 +279,7 @@
    " * [Cython, language basics](http://docs.cython.org/src/userguide/language_basics.html#language-basics)\n",
    " * [Cython, basic tutorial](http://docs.cython.org/src/tutorial/cython_tutorial.html)\n",
    " * [Cython in ipython notebooks](http://docs.cython.org/src/quickstart/build.html)\n",
-    " * [A tutorial on how to optimise the cython code](http://docs.cython.org/src/tutorial/numpy.html) (a working example is actually a simple convolution code)\n",
+    " * [A tutorial on how to optimise the cython code](http://docs.cython.org/src/tutorial/numpy.html) (a working example is actually a simple convolution code, do not use it `as is`)\n",
    " \n",
    "\n",
    "Before you proceed, in case you do not have installed `cython` (it should be installed with scipy). But in case the below imports do not work, staying in the activated virtual environment type:\n",
@ -420,7 +421,7 @@
   "source": [
    "You can optimise the code further as in the [linked](http://docs.cython.org/src/tutorial/numpy.html) tutorial. However, the above example seems to be a reasonable compromise for developing the code - it gives a reasonably accelerated code, with all the security checks one may expect to be existent under development (checking bounds of indices, wheter types of variables match, tracking overflows etc.). Look [here](http://docs.cython.org/src/reference/compilation.html) for more optimisation decorators one can use to speed things up.\n",
    "\n",
-    "Below we do some benchmarks on each of the above functions. Notice huge speed-up from going from non-optimised cython code to optimised one (on my machine, 643ms -> 6.35ms - this is 2 orders!). It's still around two times slower than BLAS accelerated numpy.dot routine (non-cached result is around 3.3ms). But our method just benchmarks the dot product, operation that has been optimised incredibly well in numerical libraries. Of course, we **do not** want you to use this code for dot products and you should rely on functions provided by numpy (whenever reasonably possible). The above code was just given as an example how to produce much more efficient code with very small effort. In many scenarios (convolution is an example) the code is more complex than a single dot product and some looping is necessary anyway, especially when dealing with multi-dimensional tensors where atom operations using direct loop-based indexing may be much easier to comprehend (and debug) than a direct multi-dimensional manipulation of numpy tensors."
+    "Below we do some benchmarks on each of the above functions. Notice huge speed-up from going from non-optimised cython code to optimised one (on my machine, 643ms -> 6.35ms - this is 2 orders!). It's still around two times slower than BLAS accelerated numpy.dot routine (non-cached result is around 3.3ms). But our method just benchmarks the dot product, operation that has been optimised incredibly well in numerical libraries. Of course, we **do not** want you to use this code for dot products and you should rely on functions provided by numpy (whenever reasonably possible). The above code was just given as an example how to produce much more efficient code with very small programming effort. In many scenarios (convolution is an example) the code is more complex than a single dot product and some looping is necessary anyway, especially when dealing with multi-dimensional tensors where atom operations using direct loop-based indexing may be much easier to comprehend (and debug) than a direct multi-dimensional manipulation of numpy tensors."
   ]
  },
  {
--- a/mlp/layers.py
+++ b/mlp/layers.py
@ -18,7 +18,7 @@ class MLP(object):
    through the model (for a mini-batch), which is required to compute
    the gradients for the parameters
    """
-    def __init__(self, cost):
+    def __init__(self, cost, rng=None):

        assert isinstance(cost, Cost), (
            "Cost needs to be of type mlp.costs.Cost, got %s" % type(cost)
@ -31,6 +31,11 @@ class MLP(object):
                         # for a given minibatch and each layer
        self.cost = cost

+        if rng is None:
+            self.rng = numpy.random.RandomState([2015,11,11])
+        else:
+            self.rng = rng
+
    def fprop(self, x):
        """

@ -46,6 +51,32 @@ class MLP(object):
            self.activations[i+1] = self.layers[i].fprop(self.activations[i])
        return self.activations[-1]

+    def fprop_dropout(self, x, dp_scheduler):
+        """
+        :param inputs: mini-batch of data-points x
+        :param dp_scheduler: dropout scheduler
+        :return: y (top layer activation) which is an estimate of y given x
+        """
+
+        if len(self.activations) != len(self.layers) + 1:
+            self.activations = [None]*(len(self.layers) + 1)
+
+        p_inp, p_hid = dp_scheduler.get_rate()
+
+        d_inp = 1
+        p_inp_scaler, p_hid_scaler = 1.0/p_inp, 1.0/p_hid
+        if p_inp < 1:
+            d_inp = self.rng.binomial(1, p_inp, size=x.shape)
+
+        self.activations[0] = p_inp_scaler*d_inp*x
+        for i in xrange(0, len(self.layers)):
+            d_hid = 1
+            if p_hid < 1 and i > 0:
+                d_hid = self.rng.binomial(1, p_hid, size=self.activations[i].shape)
+            self.activations[i+1] = self.layers[i].fprop(p_hid_scaler*d_hid*self.activations[i])
+
+        return self.activations[-1]
+
    def bprop(self, cost_grad):
        """
        :param cost_grad: matrix -- grad of the cost w.r.t y
@ -258,8 +289,20 @@ class Linear(Layer):
        since W and b are only layer's parameters
        """

-        grad_W = numpy.dot(inputs.T, deltas)
-        grad_b = numpy.sum(deltas, axis=0)
+        #you could basically use different scalers for biases
+        #and weights, but it is not implemented here like this
+        l2_W_penalty, l2_b_penalty = 0, 0
+        if l2_weight > 0:
+            l2_W_penalty = l2_weight*self.W
+            l2_b_penalty = l2_weight*self.b
+
+        l1_W_penalty, l1_b_penalty = 0, 0
+        if l1_weight > 0:
+            l1_W_penalty = l1_weight*numpy.sign(self.W)
+            l1_b_penalty = l1_weight*numpy.sign(self.b)
+
+        grad_W = numpy.dot(inputs.T, deltas) + l2_W_penalty + l1_W_penalty
+        grad_b = numpy.sum(deltas, axis=0) + l2_b_penalty + l1_b_penalty

        return [grad_W, grad_b]

@ -323,12 +366,12 @@ class Softmax(Linear):
                                      odim,
                                      rng=rng,
                                      irange=irange)
-    
+
    def fprop(self, inputs):

        # compute the linear outputs
        a = super(Softmax, self).fprop(inputs)
-        # apply numerical stabilisation by subtracting max 
+        # apply numerical stabilisation by subtracting max
        # from each row (not required for the coursework)
        # then compute exponent
        assert a.ndim in [1, 2], (
@ -355,3 +398,88 @@ class Softmax(Linear):

    def get_name(self):
        return 'softmax'
+
+
+class Relu(Linear):
+    def __init__(self,  idim, odim,
+                 rng=None,
+                 irange=0.1):
+
+        super(Relu, self).__init__(idim, odim, rng, irange)
+
+    def fprop(self, inputs):
+        #get the linear activations
+        a = super(Relu, self).fprop(inputs)
+        h = numpy.clip(a, 0, 20.0)
+        #h = numpy.maximum(a, 0)
+        return h
+
+    def bprop(self, h, igrads):
+        deltas = (h > 0)*igrads + (h <= 0)*igrads
+        ___, ograds = super(Relu, self).bprop(h=None, igrads=deltas)
+        return deltas, ograds
+
+    def cost_bprop(self, h, igrads, cost):
+        raise NotImplementedError('Relu.bprop_cost method not implemented '
+                                      'for the %s cost' % cost.get_name())
+
+    def get_name(self):
+        return 'relu'
+
+
+class Tanh(Linear):
+    def __init__(self,  idim, odim,
+                 rng=None,
+                 irange=0.1):
+
+        super(Tanh, self).__init__(idim, odim, rng, irange)
+
+    def fprop(self, inputs):
+        #get the linear activations
+        a = super(Tanh, self).fprop(inputs)
+        numpy.clip(a, -30.0, 30.0, out=a)
+        h = numpy.tanh(a)
+        return h
+
+    def bprop(self, h, igrads):
+        deltas = (1.0 - h**2) * igrads
+        ___, ograds = super(Tanh, self).bprop(h=None, igrads=deltas)
+        return deltas, ograds
+
+    def cost_bprop(self, h, igrads, cost):
+        raise NotImplementedError('Tanh.bprop_cost method not implemented '
+                                      'for the %s cost' % cost.get_name())
+
+    def get_name(self):
+        return 'tanh'
+
+
+class Maxout(Linear):
+    def __init__(self,  idim, odim, k,
+                 rng=None,
+                 irange=0.05):
+
+        super(Maxout, self).__init__(idim, odim, rng, irange)
+        self.k = k
+
+    def fprop(self, inputs):
+        #get the linear activations
+        a = super(Maxout, self).fprop(inputs)
+        ar = a.reshape(a.shape[0], self.odim, self.k)
+        h, h_argmax = max_and_argmax(ar, axes=3, keepdims_argmax=True)
+        self.h_argmax = h_argmax
+        return h
+
+    def bprop(self, h, igrads):
+        igrads_up = igrads.reshape(a.shape[0], -1, 1)
+        igrads_up = numpy.tile(a, 1, self.k)
+        deltas = (igrads_up * self.h_argmax).reshape(a.shape[0], -1)
+        ___, ograds = super(Maxout, self).bprop(h=None, igrads=deltas)
+        return deltas, ograds
+
+    def cost_bprop(self, h, igrads, cost):
+        raise NotImplementedError('Maxout.bprop_cost method not implemented '
+                                      'for the %s cost' % cost.get_name())
+
+    def get_name(self):
+        return 'maxout'
--- a/mlp/optimisers.py
+++ b/mlp/optimisers.py
@ -112,8 +112,12 @@ class SGDOptimiser(Optimiser):

        acc_list, nll_list = [], []
        for x, t in train_iterator:
+
            # get the prediction
-            y = model.fprop(x)
+            if self.dp_scheduler is not None:
+                y = model.fprop_dropout(x, self.dp_scheduler)
+            else:
+                y = model.fprop(x)

            # compute the cost and grad of the cost w.r.t y
            cost = model.cost.cost(y, t)
--- a/mlp/schedulers.py
+++ b/mlp/schedulers.py
@ -153,3 +153,18 @@ class LearningRateNewBob(LearningRateScheduler):
            self.epoch += 1
    
        return self.rate
+
+
+class DropoutFixed(LearningRateList):
+
+    def __init__(self, p_inp_keep, p_hid_keep):
+        assert 0 < p_inp_keep <= 1 and 0 < p_hid_keep <= 1, (
+            "Dropout 'keep' probabilites are suppose to be in (0, 1] range"
+        )
+        super(DropoutFixed, self).__init__([(p_inp_keep, p_hid_keep)], max_epochs=999)
+
+    def get_rate(self):
+        return self.lr_list[0]
+
+    def get_next_rate(self, current_error=None):
+        return self.get_rate()