From b291e0dbd13f4ae5ef4a9465464124ca29a7a944 Mon Sep 17 00:00:00 2001
From: pswietojanski
Date: Sun, 15 Nov 2015 21:36:35 +0000
Subject: [PATCH] solutions 04 and 05
---
05_Transfer_functions_solution.ipynb | 1417 +++++++++++++-------------
mlp/layers.py | 7 +-
2 files changed, 714 insertions(+), 710 deletions(-)
diff --git a/05_Transfer_functions_solution.ipynb b/05_Transfer_functions_solution.ipynb
index 17b5016..1f6d7d8 100644
--- a/05_Transfer_functions_solution.ipynb
+++ b/05_Transfer_functions_solution.ipynb
@@ -1,709 +1,708 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Introduction\n",
- "\n",
- "This tutorial focuses on implementation of alternatives to sigmoid transfer functions for hidden units. (*Transfer functions* are also called *activation functions* or *nonlinearities*.) First, we will work with hyperboilc tangent (tanh) and then unbounded (or partially unbounded) piecewise linear functions: Rectifying Linear Units (ReLU) and Maxout.\n",
- "\n",
- "\n",
- "## Virtual environments\n",
- "\n",
- "Before you proceed onwards, remember to activate your virtual environment by typing `activate_mlp` or `source ~/mlpractical/venv/bin/activate` (or if you did the original install the \"comfy way\" type: `workon mlpractical`).\n",
- "\n",
- "\n",
- "## Syncing the git repository\n",
- "\n",
- "Look here for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n",
- "\n",
- "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
- "2. List the branches and check which are currently active by typing: `git branch`\n",
- "3. If you have followed our recommendations, you should be in the `lab4` branch, please commit your local changed to the repo index by typing:\n",
- "```\n",
- "git commit -am \"finished lab4\"\n",
- "```\n",
- "4. Now you can switch to `master` branch by typing: \n",
- "```\n",
- "git checkout master\n",
- " ```\n",
- "5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look here\n",
- "```\n",
- "git pull\n",
- "```\n",
- "6. And now, create the new branch & switch to it by typing:\n",
- "```\n",
- "git checkout -b lab5\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Overview of alternative transfer functions\n",
- "\n",
- "Now, we briefly summarise some other possible choices for hidden layer transfer functions.\n",
- "\n",
- "## Tanh\n",
- "\n",
- "Given a linear activation $a_{i}$ tanh implements the following operation:\n",
- "\n",
- "(1) $h_i(a_i) = \\mbox{tanh}(a_i) = \\frac{\\exp(a_i) - \\exp(-a_i)}{\\exp(a_i) + \\exp(-a_i)}$\n",
- "\n",
- "Hence, the derivative of $h_i$ with respect to $a_i$ is:\n",
- "\n",
- "(2) $\\begin{align}\n",
- "\\frac{\\partial h_i}{\\partial a_i} &= 1 - h^2_i\n",
- "\\end{align}\n",
- "$\n",
- "\n",
- "\n",
- "## ReLU\n",
- "\n",
- "Given a linear activation $a_{i}$ relu implements the following operation:\n",
- "\n",
- "(3) $h_i(a_i) = \\max(0, a_i)$\n",
- "\n",
- "Hence, the gradient is :\n",
- "\n",
- "(4) $\\begin{align}\n",
- "\\frac{\\partial h_i}{\\partial a_i} &=\n",
- "\\begin{cases}\n",
- " 1 & \\quad \\text{if } a_i > 0 \\\\\n",
- " 0 & \\quad \\text{if } a_i \\leq 0 \\\\\n",
- "\\end{cases}\n",
- "\\end{align}\n",
- "$\n",
- "\n",
- "ReLU implements a form of data-driven sparsity, that is, on average the activations are sparse (many of them are 0) but the general sparsity pattern will depend on particular data-point. This is different from sparsity obtained in model's parameters one can obtain with $L1$ regularisation as the latter affect all data-points in the same way.\n",
- "\n",
- "## Maxout\n",
- "\n",
- "Maxout is an example of data-driven type of non-linearity in which the transfer function can be learned from data. That is, the model can build a non-linear transfer function from piecewise linear components. These linear components, depending on the number of linear regions used in the pooling operator (given by parameter $K$), can approximate arbitrary functions, such as ReLU, abs, etc.\n",
- "\n",
- "Given some subset (group, pool) of $K$ linear activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ at the $l$-th layer, maxout implements the following operation:\n",
- "\n",
- "(5) $h_i(a_j, a_{j+1}, \\ldots, a_{j+K}) = \\max(a_j, a_{j+1}, \\ldots, a_{j+K})$\n",
- "\n",
- "Hence, the gradient of $h_i$ w.r.t to the pooling region $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ is :\n",
- "\n",
- "(6) $\\begin{align}\n",
- "\\frac{\\partial h_i}{\\partial (a_j, a_{j+1}, \\ldots, a_{j+K})} &=\n",
- "\\begin{cases}\n",
- " 1 & \\quad \\text{for the max activation} \\\\\n",
- " 0 & \\quad \\text{otherwise} \\\\\n",
- "\\end{cases}\n",
- "\\end{align}\n",
- "$\n",
- "\n",
- "Implementation tips are given in Exercise 3.\n",
- "\n",
- "# On weight initialisation\n",
- "\n",
- "Activation functions directly affect the \"network dynamics\", that is, the magnitudes of the statistics each layer is producing. For example, *slashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, on the contrary, has an unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters - hence also the pace at which the model learns. That is why learning rate is usually required to be tuned for given the characterictics of the non-linearities used. \n",
- "\n",
- "Another important hyperparameter is the initial range used to initialise the weight matrices. We have largely ignored it so far (although if you did further experiments in coursework 1, you may have found setting it had an effect on training deeper networks with 4 or 5 hidden layers). However, for sigmoidal non-linearities (sigmoid, tanh) the initialisation range is an important hyperparameter and a considerable amount of research has been put into determining what is the best strategy for choosing it. In fact, one of the early triggers of the recent resurgence of deep learning was pre-training - techniques for initialising weights in an unsupervised manner so that one can effectively train deeper models in supervised fashion later. \n",
- "\n",
- "## Sigmoidal transfer functions\n",
- "\n",
- "Y. LeCun in [Efficient Backprop](http://link.springer.com/chapter/10.1007%2F3-540-49430-8_2) recommends the following setting of the initial range $r$ for sigmoidal units (assuming that the data has been normalised to zero mean, unit variance): \n",
- "\n",
- "(7) $ r = \\frac{1}{\\sqrt{N_{IN}}} $\n",
- "\n",
- "where $N_{IN}$ is the number of inputs to the given layer and the weights are then sampled from the (usually uniform) distribution $U(-r,r)$. The motivation is to keep the initial forward-pass signal in the linear region of the sigmoid non-linearity so that the gradients are large enough for training to proceed (note that the sigmoidal non-linearities saturate when activations are either very positive or very negative, leading to very small gradients and hence poor learning dynamics).\n",
- "\n",
- "The initialisation used in (7) however leads to different magnitudes of activations/gradients at different layers (due to multiplicative nature of the computations) and more recently, [Glorot et. al](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) proposed the so-called *normalised initialisation*, which ensures the variance of the forward signal (activations) is approximately the same in each layer. The same applies to the gradients obtained in backward pass. \n",
- "\n",
- "The $r$ in the *normalised initialisation* for $\\mbox{tanh}$ non-linearity is then:\n",
- "\n",
- "(8) $ r = \\frac{\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
- "\n",
- "For the sigmoid (logistic) non-linearity, to get similiar characteristics, one should scale $r$ in (8) by 4, that is:\n",
- "\n",
- "(9) $ r = \\frac{4\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
- "\n",
- "## Piece-wise linear transfer functions (ReLU, Maxout)\n",
- "\n",
- "For unbounded transfer functions initialisation is not as crucial as for sigmoidal ones. This is due to the fact that their gradients do not diminish (they are acutally more likely to explode) and they do not saturate (ReLU saturates at 0, but not on the positive slope, where gradient is 1 everywhere). (In practice ReLU is sometimes \"clipped\" with a maximum value, typically 20).\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Exercise 1: Implement the tanh transfer function\n",
- "\n",
- "Your implementation should follow the code conventions used to build other layer types (for example, Sigmoid and Softmax). Test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. \n",
- "\n",
- "Tune the learning rate and compare the initial ranges in equations (7) and (8). Note that there might not be much difference for one-hidden-layer model, but you can easily notice a substantial gain from using (8) (or (9) for logistic sigmoid activation) for deeper models, for example, the 5 hidden-layer network from the first coursework.\n",
- "\n",
- "Implementation tip: Use numpy.tanh() to compute the non-linearity. Use the irange argument when creating the given layer type to provide the initial sampling range."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "collapsed": false
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "INFO:root:Initialising data providers...\n"
- ]
- }
- ],
- "source": [
- "import numpy\n",
- "import logging\n",
- "from mlp.dataset import MNISTDataProvider\n",
- "\n",
- "logger = logging.getLogger()\n",
- "logger.setLevel(logging.INFO)\n",
- "\n",
- "# Note, you were asked to do run the experiments on all data and smaller models. \n",
- "# Here I am running the exercises on 1000 training data-points only (similar to regularisation notebook)\n",
- "logger.info('Initialising data providers...')\n",
- "train_dp = MNISTDataProvider(dset='train', batch_size=10, max_num_batches=100, randomize=True)\n",
- "valid_dp = MNISTDataProvider(dset='valid', batch_size=10000, max_num_batches=-10, randomize=False)\n",
- "test_dp = MNISTDataProvider(dset='eval', batch_size=10000, max_num_batches=-10, randomize=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "collapsed": false,
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "INFO:root:Training started...\n",
- "INFO:mlp.optimisers:Epoch 0: Training cost (ce) for initial model is 2.319. Accuracy is 10.50%\n",
- "INFO:mlp.optimisers:Epoch 0: Validation cost (ce) for initial model is 2.315. Accuracy is 11.33%\n",
- "INFO:mlp.optimisers:Epoch 1: Training cost (ce) is 1.048. Accuracy is 66.30%\n",
- "INFO:mlp.optimisers:Epoch 1: Validation cost (ce) is 0.571. Accuracy is 82.72%\n",
- "INFO:mlp.optimisers:Epoch 1: Took 2 seconds. Training speed 764 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 2: Training cost (ce) is 0.485. Accuracy is 84.40%\n",
- "INFO:mlp.optimisers:Epoch 2: Validation cost (ce) is 0.455. Accuracy is 86.58%\n",
- "INFO:mlp.optimisers:Epoch 2: Took 2 seconds. Training speed 720 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 3: Training cost (ce) is 0.362. Accuracy is 87.70%\n",
- "INFO:mlp.optimisers:Epoch 3: Validation cost (ce) is 0.435. Accuracy is 86.90%\n",
- "INFO:mlp.optimisers:Epoch 3: Took 2 seconds. Training speed 788 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 4: Training cost (ce) is 0.251. Accuracy is 92.10%\n",
- "INFO:mlp.optimisers:Epoch 4: Validation cost (ce) is 0.417. Accuracy is 88.09%\n",
- "INFO:mlp.optimisers:Epoch 4: Took 2 seconds. Training speed 788 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 5: Training cost (ce) is 0.175. Accuracy is 95.40%\n",
- "INFO:mlp.optimisers:Epoch 5: Validation cost (ce) is 0.405. Accuracy is 88.16%\n",
- "INFO:mlp.optimisers:Epoch 5: Took 2 seconds. Training speed 776 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 6: Training cost (ce) is 0.121. Accuracy is 96.40%\n",
- "INFO:mlp.optimisers:Epoch 6: Validation cost (ce) is 0.458. Accuracy is 87.24%\n",
- "INFO:mlp.optimisers:Epoch 6: Took 2 seconds. Training speed 690 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 7: Training cost (ce) is 0.091. Accuracy is 97.90%\n",
- "INFO:mlp.optimisers:Epoch 7: Validation cost (ce) is 0.418. Accuracy is 88.37%\n",
- "INFO:mlp.optimisers:Epoch 7: Took 2 seconds. Training speed 841 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 8: Training cost (ce) is 0.065. Accuracy is 98.70%\n",
- "INFO:mlp.optimisers:Epoch 8: Validation cost (ce) is 0.400. Accuracy is 89.44%\n",
- "INFO:mlp.optimisers:Epoch 8: Took 2 seconds. Training speed 794 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 9: Training cost (ce) is 0.043. Accuracy is 99.30%\n",
- "INFO:mlp.optimisers:Epoch 9: Validation cost (ce) is 0.406. Accuracy is 89.35%\n",
- "INFO:mlp.optimisers:Epoch 9: Took 2 seconds. Training speed 747 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 10: Training cost (ce) is 0.029. Accuracy is 99.50%\n",
- "INFO:mlp.optimisers:Epoch 10: Validation cost (ce) is 0.410. Accuracy is 89.69%\n",
- "INFO:mlp.optimisers:Epoch 10: Took 2 seconds. Training speed 953 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 11: Training cost (ce) is 0.023. Accuracy is 99.80%\n",
- "INFO:mlp.optimisers:Epoch 11: Validation cost (ce) is 0.424. Accuracy is 89.41%\n",
- "INFO:mlp.optimisers:Epoch 11: Took 2 seconds. Training speed 953 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 12: Training cost (ce) is 0.018. Accuracy is 99.80%\n",
- "INFO:mlp.optimisers:Epoch 12: Validation cost (ce) is 0.429. Accuracy is 89.50%\n",
- "INFO:mlp.optimisers:Epoch 12: Took 2 seconds. Training speed 870 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 13: Training cost (ce) is 0.015. Accuracy is 99.90%\n",
- "INFO:mlp.optimisers:Epoch 13: Validation cost (ce) is 0.428. Accuracy is 89.58%\n",
- "INFO:mlp.optimisers:Epoch 13: Took 2 seconds. Training speed 878 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 14: Training cost (ce) is 0.012. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 14: Validation cost (ce) is 0.436. Accuracy is 89.41%\n",
- "INFO:mlp.optimisers:Epoch 14: Took 2 seconds. Training speed 894 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 15: Training cost (ce) is 0.010. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 15: Validation cost (ce) is 0.433. Accuracy is 89.64%\n",
- "INFO:mlp.optimisers:Epoch 15: Took 2 seconds. Training speed 834 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 16: Training cost (ce) is 0.009. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 16: Validation cost (ce) is 0.439. Accuracy is 89.63%\n",
- "INFO:mlp.optimisers:Epoch 16: Took 2 seconds. Training speed 820 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 17: Training cost (ce) is 0.008. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 17: Validation cost (ce) is 0.443. Accuracy is 89.78%\n",
- "INFO:mlp.optimisers:Epoch 17: Took 2 seconds. Training speed 902 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 18: Training cost (ce) is 0.008. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 18: Validation cost (ce) is 0.446. Accuracy is 89.72%\n",
- "INFO:mlp.optimisers:Epoch 18: Took 2 seconds. Training speed 870 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 19: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 19: Validation cost (ce) is 0.445. Accuracy is 89.83%\n",
- "INFO:mlp.optimisers:Epoch 19: Took 2 seconds. Training speed 918 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 20: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 20: Validation cost (ce) is 0.451. Accuracy is 89.75%\n",
- "INFO:mlp.optimisers:Epoch 20: Took 2 seconds. Training speed 834 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 21: Training cost (ce) is 0.006. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 21: Validation cost (ce) is 0.454. Accuracy is 89.80%\n",
- "INFO:mlp.optimisers:Epoch 21: Took 2 seconds. Training speed 902 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 22: Training cost (ce) is 0.006. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 22: Validation cost (ce) is 0.456. Accuracy is 89.77%\n",
- "INFO:mlp.optimisers:Epoch 22: Took 2 seconds. Training speed 863 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 23: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 23: Validation cost (ce) is 0.458. Accuracy is 89.84%\n",
- "INFO:mlp.optimisers:Epoch 23: Took 2 seconds. Training speed 820 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 24: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 24: Validation cost (ce) is 0.460. Accuracy is 89.80%\n",
- "INFO:mlp.optimisers:Epoch 24: Took 2 seconds. Training speed 856 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 25: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 25: Validation cost (ce) is 0.461. Accuracy is 89.86%\n",
- "INFO:mlp.optimisers:Epoch 25: Took 2 seconds. Training speed 902 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 26: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 26: Validation cost (ce) is 0.467. Accuracy is 89.86%\n",
- "INFO:mlp.optimisers:Epoch 26: Took 2 seconds. Training speed 910 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 27: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 27: Validation cost (ce) is 0.466. Accuracy is 89.81%\n",
- "INFO:mlp.optimisers:Epoch 27: Took 2 seconds. Training speed 827 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 28: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 28: Validation cost (ce) is 0.468. Accuracy is 89.84%\n",
- "INFO:mlp.optimisers:Epoch 28: Took 2 seconds. Training speed 894 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 29: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 29: Validation cost (ce) is 0.471. Accuracy is 89.83%\n",
- "INFO:mlp.optimisers:Epoch 29: Took 2 seconds. Training speed 902 pps. Validation speed 12659 pps.\n",
- "INFO:mlp.optimisers:Epoch 30: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 30: Validation cost (ce) is 0.473. Accuracy is 89.81%\n",
- "INFO:mlp.optimisers:Epoch 30: Took 2 seconds. Training speed 918 pps. Validation speed 11495 pps.\n",
- "INFO:root:Testing the model on test set:\n",
- "INFO:root:MNIST test set accuracy is 89.33 %, cost (ce) is 0.480\n"
- ]
- }
- ],
- "source": [
- "\n",
- "from mlp.layers import MLP, Tanh, Softmax #import required layer types\n",
- "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
- "from mlp.costs import CECost #import the cost we want to use for optimisation\n",
- "from mlp.schedulers import LearningRateFixed\n",
- "\n",
- "rng = numpy.random.RandomState([2015,10,10])\n",
- "\n",
- "#some hyper-parameters\n",
- "nhid = 100\n",
- "learning_rate = 0.2\n",
- "max_epochs = 30\n",
- "cost = CECost()\n",
- " \n",
- "stats = []\n",
- "for layer in xrange(1, 2):\n",
- "\n",
- " train_dp.reset()\n",
- " valid_dp.reset()\n",
- " test_dp.reset()\n",
- " \n",
- " #define the model\n",
- " model = MLP(cost=cost)\n",
- " model.add_layer(Tanh(idim=784, odim=nhid, irange=1./numpy.sqrt(784), rng=rng))\n",
- " for i in xrange(1, layer):\n",
- " logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
- " model.add_layer(Tanh(idim=nhid, odim=nhid, irange=0.2, rng=rng))\n",
- " model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
- "\n",
- " # define the optimiser, here stochasitc gradient descent\n",
- " # with fixed learning rate and max_epochs\n",
- " lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
- " optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
- "\n",
- " logger.info('Training started...')\n",
- " tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
- "\n",
- " logger.info('Testing the model on test set:')\n",
- " tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
- " logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
- " \n",
- " stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Exercise 2: Implement ReLU\n",
- "\n",
- "Again, your implementation should follow the conventions used to build Linear, Sigmoid and Softmax layers. As in exercise 1, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Tune the learning rate (start with the initial one set to 0.1) with the initial weight range set to 0.05."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "collapsed": false,
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "INFO:root:Training started...\n",
- "INFO:mlp.optimisers:Epoch 0: Training cost (ce) for initial model is 2.317. Accuracy is 15.20%\n",
- "INFO:mlp.optimisers:Epoch 0: Validation cost (ce) for initial model is 2.317. Accuracy is 13.98%\n",
- "INFO:mlp.optimisers:Epoch 1: Training cost (ce) is 1.452. Accuracy is 60.20%\n",
- "INFO:mlp.optimisers:Epoch 1: Validation cost (ce) is 0.750. Accuracy is 81.69%\n",
- "INFO:mlp.optimisers:Epoch 1: Took 2 seconds. Training speed 820 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 2: Training cost (ce) is 0.632. Accuracy is 82.40%\n",
- "INFO:mlp.optimisers:Epoch 2: Validation cost (ce) is 0.503. Accuracy is 86.74%\n",
- "INFO:mlp.optimisers:Epoch 2: Took 2 seconds. Training speed 788 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 3: Training cost (ce) is 0.446. Accuracy is 87.50%\n",
- "INFO:mlp.optimisers:Epoch 3: Validation cost (ce) is 0.438. Accuracy is 87.24%\n",
- "INFO:mlp.optimisers:Epoch 3: Took 2 seconds. Training speed 788 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 4: Training cost (ce) is 0.359. Accuracy is 90.00%\n",
- "INFO:mlp.optimisers:Epoch 4: Validation cost (ce) is 0.444. Accuracy is 86.44%\n",
- "INFO:mlp.optimisers:Epoch 4: Took 2 seconds. Training speed 710 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 5: Training cost (ce) is 0.304. Accuracy is 90.80%\n",
- "INFO:mlp.optimisers:Epoch 5: Validation cost (ce) is 0.408. Accuracy is 87.90%\n",
- "INFO:mlp.optimisers:Epoch 5: Took 2 seconds. Training speed 782 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 6: Training cost (ce) is 0.255. Accuracy is 93.80%\n",
- "INFO:mlp.optimisers:Epoch 6: Validation cost (ce) is 0.390. Accuracy is 88.56%\n",
- "INFO:mlp.optimisers:Epoch 6: Took 2 seconds. Training speed 782 pps. Validation speed 13515 pps.\n",
- "INFO:mlp.optimisers:Epoch 7: Training cost (ce) is 0.225. Accuracy is 93.80%\n",
- "INFO:mlp.optimisers:Epoch 7: Validation cost (ce) is 0.425. Accuracy is 87.46%\n",
- "INFO:mlp.optimisers:Epoch 7: Took 2 seconds. Training speed 725 pps. Validation speed 13890 pps.\n",
- "INFO:mlp.optimisers:Epoch 8: Training cost (ce) is 0.205. Accuracy is 95.00%\n",
- "INFO:mlp.optimisers:Epoch 8: Validation cost (ce) is 0.399. Accuracy is 88.51%\n",
- "INFO:mlp.optimisers:Epoch 8: Took 2 seconds. Training speed 834 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 9: Training cost (ce) is 0.163. Accuracy is 96.20%\n",
- "INFO:mlp.optimisers:Epoch 9: Validation cost (ce) is 0.474. Accuracy is 85.74%\n",
- "INFO:mlp.optimisers:Epoch 9: Took 2 seconds. Training speed 814 pps. Validation speed 13700 pps.\n",
- "INFO:mlp.optimisers:Epoch 10: Training cost (ce) is 0.140. Accuracy is 96.40%\n",
- "INFO:mlp.optimisers:Epoch 10: Validation cost (ce) is 0.418. Accuracy is 88.06%\n",
- "INFO:mlp.optimisers:Epoch 10: Took 2 seconds. Training speed 788 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 11: Training cost (ce) is 0.120. Accuracy is 97.70%\n",
- "INFO:mlp.optimisers:Epoch 11: Validation cost (ce) is 0.427. Accuracy is 87.93%\n",
- "INFO:mlp.optimisers:Epoch 11: Took 2 seconds. Training speed 731 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 12: Training cost (ce) is 0.105. Accuracy is 98.10%\n",
- "INFO:mlp.optimisers:Epoch 12: Validation cost (ce) is 0.449. Accuracy is 87.51%\n",
- "INFO:mlp.optimisers:Epoch 12: Took 2 seconds. Training speed 725 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 13: Training cost (ce) is 0.088. Accuracy is 98.50%\n",
- "INFO:mlp.optimisers:Epoch 13: Validation cost (ce) is 0.479. Accuracy is 87.14%\n",
- "INFO:mlp.optimisers:Epoch 13: Took 2 seconds. Training speed 715 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 14: Training cost (ce) is 0.086. Accuracy is 98.30%\n",
- "INFO:mlp.optimisers:Epoch 14: Validation cost (ce) is 0.455. Accuracy is 87.97%\n",
- "INFO:mlp.optimisers:Epoch 14: Took 2 seconds. Training speed 681 pps. Validation speed 13515 pps.\n",
- "INFO:mlp.optimisers:Epoch 15: Training cost (ce) is 0.070. Accuracy is 99.00%\n",
- "INFO:mlp.optimisers:Epoch 15: Validation cost (ce) is 0.465. Accuracy is 87.76%\n",
- "INFO:mlp.optimisers:Epoch 15: Took 2 seconds. Training speed 758 pps. Validation speed 12988 pps.\n",
- "INFO:mlp.optimisers:Epoch 16: Training cost (ce) is 0.054. Accuracy is 99.50%\n",
- "INFO:mlp.optimisers:Epoch 16: Validation cost (ce) is 0.467. Accuracy is 88.07%\n",
- "INFO:mlp.optimisers:Epoch 16: Took 2 seconds. Training speed 776 pps. Validation speed 12501 pps.\n",
- "INFO:mlp.optimisers:Epoch 17: Training cost (ce) is 0.052. Accuracy is 99.60%\n",
- "INFO:mlp.optimisers:Epoch 17: Validation cost (ce) is 0.485. Accuracy is 87.69%\n",
- "INFO:mlp.optimisers:Epoch 17: Took 2 seconds. Training speed 801 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 18: Training cost (ce) is 0.042. Accuracy is 99.70%\n",
- "INFO:mlp.optimisers:Epoch 18: Validation cost (ce) is 0.500. Accuracy is 87.61%\n",
- "INFO:mlp.optimisers:Epoch 18: Took 2 seconds. Training speed 686 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 19: Training cost (ce) is 0.035. Accuracy is 99.80%\n",
- "INFO:mlp.optimisers:Epoch 19: Validation cost (ce) is 0.499. Accuracy is 87.76%\n",
- "INFO:mlp.optimisers:Epoch 19: Took 2 seconds. Training speed 764 pps. Validation speed 12822 pps.\n",
- "INFO:mlp.optimisers:Epoch 20: Training cost (ce) is 0.031. Accuracy is 99.80%\n",
- "INFO:mlp.optimisers:Epoch 20: Validation cost (ce) is 0.506. Accuracy is 87.77%\n",
- "INFO:mlp.optimisers:Epoch 20: Took 2 seconds. Training speed 801 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 21: Training cost (ce) is 0.027. Accuracy is 99.90%\n",
- "INFO:mlp.optimisers:Epoch 21: Validation cost (ce) is 0.506. Accuracy is 87.61%\n",
- "INFO:mlp.optimisers:Epoch 21: Took 2 seconds. Training speed 731 pps. Validation speed 13515 pps.\n",
- "INFO:mlp.optimisers:Epoch 22: Training cost (ce) is 0.025. Accuracy is 99.80%\n",
- "INFO:mlp.optimisers:Epoch 22: Validation cost (ce) is 0.516. Accuracy is 87.68%\n",
- "INFO:mlp.optimisers:Epoch 22: Took 2 seconds. Training speed 758 pps. Validation speed 13335 pps.\n",
- "INFO:mlp.optimisers:Epoch 23: Training cost (ce) is 0.022. Accuracy is 99.90%\n",
- "INFO:mlp.optimisers:Epoch 23: Validation cost (ce) is 0.529. Accuracy is 87.33%\n",
- "INFO:mlp.optimisers:Epoch 23: Took 2 seconds. Training speed 770 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 24: Training cost (ce) is 0.020. Accuracy is 99.90%\n",
- "INFO:mlp.optimisers:Epoch 24: Validation cost (ce) is 0.526. Accuracy is 87.70%\n",
- "INFO:mlp.optimisers:Epoch 24: Took 2 seconds. Training speed 715 pps. Validation speed 13700 pps.\n",
- "INFO:mlp.optimisers:Epoch 25: Training cost (ce) is 0.018. Accuracy is 99.90%\n",
- "INFO:mlp.optimisers:Epoch 25: Validation cost (ce) is 0.535. Accuracy is 87.55%\n",
- "INFO:mlp.optimisers:Epoch 25: Took 2 seconds. Training speed 770 pps. Validation speed 13159 pps.\n",
- "INFO:mlp.optimisers:Epoch 26: Training cost (ce) is 0.016. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 26: Validation cost (ce) is 0.540. Accuracy is 87.55%\n",
- "INFO:mlp.optimisers:Epoch 26: Took 2 seconds. Training speed 741 pps. Validation speed 13515 pps.\n",
- "INFO:mlp.optimisers:Epoch 27: Training cost (ce) is 0.015. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 27: Validation cost (ce) is 0.546. Accuracy is 87.57%\n",
- "INFO:mlp.optimisers:Epoch 27: Took 2 seconds. Training speed 681 pps. Validation speed 13515 pps.\n",
- "INFO:mlp.optimisers:Epoch 28: Training cost (ce) is 0.014. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 28: Validation cost (ce) is 0.546. Accuracy is 87.78%\n",
- "INFO:mlp.optimisers:Epoch 28: Took 2 seconds. Training speed 753 pps. Validation speed 13700 pps.\n",
- "INFO:mlp.optimisers:Epoch 29: Training cost (ce) is 0.012. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 29: Validation cost (ce) is 0.556. Accuracy is 87.56%\n",
- "INFO:mlp.optimisers:Epoch 29: Took 2 seconds. Training speed 758 pps. Validation speed 13700 pps.\n",
- "INFO:mlp.optimisers:Epoch 30: Training cost (ce) is 0.012. Accuracy is 100.00%\n",
- "INFO:mlp.optimisers:Epoch 30: Validation cost (ce) is 0.558. Accuracy is 87.74%\n",
- "INFO:mlp.optimisers:Epoch 30: Took 2 seconds. Training speed 747 pps. Validation speed 13515 pps.\n",
- "INFO:root:Testing the model on test set:\n",
- "INFO:root:MNIST test set accuracy is 87.19 %, cost (ce) is 0.554\n"
- ]
- }
- ],
- "source": [
- "\n",
- "from mlp.layers import MLP, Relu, Softmax \n",
- "from mlp.optimisers import SGDOptimiser \n",
- "from mlp.costs import CECost \n",
- "from mlp.schedulers import LearningRateFixed\n",
- "\n",
- "rng = numpy.random.RandomState([2015,10,10])\n",
- "\n",
- "#some hyper-parameters\n",
- "nhid = 100\n",
- "learning_rate = 0.1\n",
- "max_epochs = 30\n",
- "cost = CECost()\n",
- " \n",
- "stats = []\n",
- "for layer in xrange(1, 2):\n",
- "\n",
- " train_dp.reset()\n",
- " valid_dp.reset()\n",
- " test_dp.reset()\n",
- " \n",
- " #define the model\n",
- " model = MLP(cost=cost)\n",
- " model.add_layer(Relu(idim=784, odim=nhid, irange=0.05, rng=rng))\n",
- " for i in xrange(1, layer):\n",
- " logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
- " model.add_layer(Relu(idim=nhid, odim=nhid, irange=0.2, rng=rng))\n",
- " model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
- "\n",
- " # define the optimiser, here stochasitc gradient descent\n",
- " # with fixed learning rate and max_epochs\n",
- " lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
- " optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
- "\n",
- " logger.info('Training started...')\n",
- " tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
- "\n",
- " logger.info('Testing the model on test set:')\n",
- " tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
- " logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
- " \n",
- " stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Exercise 3: Implement Maxout\n",
- "\n",
- "As with the previous two exercises, your implementation should follow the conventions used to build the Linear, Sigmoid and Softmax layers. For now implement only non-overlapping pools (i.e. the pool in which all activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ belong to only one pool). As before, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Use the same optimisation hyper-parameters (learning rate, initial weights range) as you used for ReLU models. Tune the pool size $K$ (but keep the number of total parameters fixed).\n",
- "\n",
- "Note: The Max operator reduces dimensionality, hence for example, to get 100 hidden maxout units with pooling size set to $K=2$ the size of linear part needs to be set to $100K$ (assuming non-overlapping pools). This affects how you compute the total number of weights in the model.\n",
- "\n",
- "Implementation tips: To back-propagate through the maxout layer, one needs to keep track of which linear activation $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ was the maximum in each pool. The convenient way to do so is by storing the indices of the maximum units in the fprop function and then in the backprop stage pass the gradient only through those (i.e. for example, one can build an auxiliary matrix where each element is either 1 (if unit was maximum, and passed forward through the max operator for a given data-point) or 0 otherwise. Then in the backward pass it suffices to upsample the maxout *igrads* signal to the linear layer dimension and element-wise multiply by the aforemenioned auxiliary matrix.\n",
- "\n",
- "*Optional:* Implement the generic pooling mechanism by introducing an additional *stride* hyper-parameter $0here for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n",
+ "\n",
+ "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
+ "2. List the branches and check which are currently active by typing: `git branch`\n",
+ "3. If you have followed our recommendations, you should be in the `lab4` branch, please commit your local changed to the repo index by typing:\n",
+ "```\n",
+ "git commit -am \"finished lab4\"\n",
+ "```\n",
+ "4. Now you can switch to `master` branch by typing: \n",
+ "```\n",
+ "git checkout master\n",
+ " ```\n",
+ "5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look here\n",
+ "```\n",
+ "git pull\n",
+ "```\n",
+ "6. And now, create the new branch & switch to it by typing:\n",
+ "```\n",
+ "git checkout -b lab5\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Overview of alternative transfer functions\n",
+ "\n",
+ "Now, we briefly summarise some other possible choices for hidden layer transfer functions.\n",
+ "\n",
+ "## Tanh\n",
+ "\n",
+ "Given a linear activation $a_{i}$ tanh implements the following operation:\n",
+ "\n",
+ "(1) $h_i(a_i) = \\mbox{tanh}(a_i) = \\frac{\\exp(a_i) - \\exp(-a_i)}{\\exp(a_i) + \\exp(-a_i)}$\n",
+ "\n",
+ "Hence, the derivative of $h_i$ with respect to $a_i$ is:\n",
+ "\n",
+ "(2) $\\begin{align}\n",
+ "\\frac{\\partial h_i}{\\partial a_i} &= 1 - h^2_i\n",
+ "\\end{align}\n",
+ "$\n",
+ "\n",
+ "\n",
+ "## ReLU\n",
+ "\n",
+ "Given a linear activation $a_{i}$ relu implements the following operation:\n",
+ "\n",
+ "(3) $h_i(a_i) = \\max(0, a_i)$\n",
+ "\n",
+ "Hence, the gradient is :\n",
+ "\n",
+ "(4) $\\begin{align}\n",
+ "\\frac{\\partial h_i}{\\partial a_i} &=\n",
+ "\\begin{cases}\n",
+ " 1 & \\quad \\text{if } a_i > 0 \\\\\n",
+ " 0 & \\quad \\text{if } a_i \\leq 0 \\\\\n",
+ "\\end{cases}\n",
+ "\\end{align}\n",
+ "$\n",
+ "\n",
+ "ReLU implements a form of data-driven sparsity, that is, on average the activations are sparse (many of them are 0) but the general sparsity pattern will depend on particular data-point. This is different from sparsity obtained in model's parameters one can obtain with $L1$ regularisation as the latter affect all data-points in the same way.\n",
+ "\n",
+ "## Maxout\n",
+ "\n",
+ "Maxout is an example of data-driven type of non-linearity in which the transfer function can be learned from data. That is, the model can build a non-linear transfer function from piecewise linear components. These linear components, depending on the number of linear regions used in the pooling operator (given by parameter $K$), can approximate arbitrary functions, such as ReLU, abs, etc.\n",
+ "\n",
+ "Given some subset (group, pool) of $K$ linear activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ at the $l$-th layer, maxout implements the following operation:\n",
+ "\n",
+ "(5) $h_i(a_j, a_{j+1}, \\ldots, a_{j+K}) = \\max(a_j, a_{j+1}, \\ldots, a_{j+K})$\n",
+ "\n",
+ "Hence, the gradient of $h_i$ w.r.t to the pooling region $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ is :\n",
+ "\n",
+ "(6) $\\begin{align}\n",
+ "\\frac{\\partial h_i}{\\partial (a_j, a_{j+1}, \\ldots, a_{j+K})} &=\n",
+ "\\begin{cases}\n",
+ " 1 & \\quad \\text{for the max activation} \\\\\n",
+ " 0 & \\quad \\text{otherwise} \\\\\n",
+ "\\end{cases}\n",
+ "\\end{align}\n",
+ "$\n",
+ "\n",
+ "Implementation tips are given in Exercise 3.\n",
+ "\n",
+ "# On weight initialisation\n",
+ "\n",
+ "Activation functions directly affect the \"network dynamics\", that is, the magnitudes of the statistics each layer is producing. For example, *slashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, on the contrary, has an unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters - hence also the pace at which the model learns. That is why learning rate is usually required to be tuned for given the characterictics of the non-linearities used. \n",
+ "\n",
+ "Another important hyperparameter is the initial range used to initialise the weight matrices. We have largely ignored it so far (although if you did further experiments in coursework 1, you may have found setting it had an effect on training deeper networks with 4 or 5 hidden layers). However, for sigmoidal non-linearities (sigmoid, tanh) the initialisation range is an important hyperparameter and a considerable amount of research has been put into determining what is the best strategy for choosing it. In fact, one of the early triggers of the recent resurgence of deep learning was pre-training - techniques for initialising weights in an unsupervised manner so that one can effectively train deeper models in supervised fashion later. \n",
+ "\n",
+ "## Sigmoidal transfer functions\n",
+ "\n",
+ "Y. LeCun in [Efficient Backprop](http://link.springer.com/chapter/10.1007%2F3-540-49430-8_2) recommends the following setting of the initial range $r$ for sigmoidal units (assuming that the data has been normalised to zero mean, unit variance): \n",
+ "\n",
+ "(7) $ r = \\frac{1}{\\sqrt{N_{IN}}} $\n",
+ "\n",
+ "where $N_{IN}$ is the number of inputs to the given layer and the weights are then sampled from the (usually uniform) distribution $U(-r,r)$. The motivation is to keep the initial forward-pass signal in the linear region of the sigmoid non-linearity so that the gradients are large enough for training to proceed (note that the sigmoidal non-linearities saturate when activations are either very positive or very negative, leading to very small gradients and hence poor learning dynamics).\n",
+ "\n",
+ "The initialisation used in (7) however leads to different magnitudes of activations/gradients at different layers (due to multiplicative nature of the computations) and more recently, [Glorot et. al](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) proposed the so-called *normalised initialisation*, which ensures the variance of the forward signal (activations) is approximately the same in each layer. The same applies to the gradients obtained in backward pass. \n",
+ "\n",
+ "The $r$ in the *normalised initialisation* for $\\mbox{tanh}$ non-linearity is then:\n",
+ "\n",
+ "(8) $ r = \\frac{\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
+ "\n",
+ "For the sigmoid (logistic) non-linearity, to get similiar characteristics, one should scale $r$ in (8) by 4, that is:\n",
+ "\n",
+ "(9) $ r = \\frac{4\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
+ "\n",
+ "## Piece-wise linear transfer functions (ReLU, Maxout)\n",
+ "\n",
+ "For unbounded transfer functions initialisation is not as crucial as for sigmoidal ones. This is due to the fact that their gradients do not diminish (they are acutally more likely to explode) and they do not saturate (ReLU saturates at 0, but not on the positive slope, where gradient is 1 everywhere). (In practice ReLU is sometimes \"clipped\" with a maximum value, typically 20).\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Exercise 1: Implement the tanh transfer function\n",
+ "\n",
+ "Your implementation should follow the code conventions used to build other layer types (for example, Sigmoid and Softmax). Test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. \n",
+ "\n",
+ "Tune the learning rate and compare the initial ranges in equations (7) and (8). Note that there might not be much difference for one-hidden-layer model, but you can easily notice a substantial gain from using (8) (or (9) for logistic sigmoid activation) for deeper models, for example, the 5 hidden-layer network from the first coursework.\n",
+ "\n",
+ "Implementation tip: Use numpy.tanh() to compute the non-linearity. Use the irange argument when creating the given layer type to provide the initial sampling range."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "collapsed": false
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "INFO:root:Initialising data providers...\n"
+ ]
+ }
+ ],
+ "source": [
+ "import numpy\n",
+ "import logging\n",
+ "from mlp.dataset import MNISTDataProvider\n",
+ "\n",
+ "logger = logging.getLogger()\n",
+ "logger.setLevel(logging.INFO)\n",
+ "\n",
+ "# Note, you were asked to do run the experiments on all data and smaller models. \n",
+ "# Here I am running the exercises on 1000 training data-points only (similar to regularisation notebook)\n",
+ "logger.info('Initialising data providers...')\n",
+ "train_dp = MNISTDataProvider(dset='train', batch_size=10, max_num_batches=100, randomize=True)\n",
+ "valid_dp = MNISTDataProvider(dset='valid', batch_size=10000, max_num_batches=-10, randomize=False)\n",
+ "test_dp = MNISTDataProvider(dset='eval', batch_size=10000, max_num_batches=-10, randomize=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "collapsed": false,
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "INFO:root:Training started...\n",
+ "INFO:mlp.optimisers:Epoch 0: Training cost (ce) for initial model is 2.319. Accuracy is 10.50%\n",
+ "INFO:mlp.optimisers:Epoch 0: Validation cost (ce) for initial model is 2.315. Accuracy is 11.33%\n",
+ "INFO:mlp.optimisers:Epoch 1: Training cost (ce) is 1.048. Accuracy is 66.30%\n",
+ "INFO:mlp.optimisers:Epoch 1: Validation cost (ce) is 0.571. Accuracy is 82.72%\n",
+ "INFO:mlp.optimisers:Epoch 1: Took 2 seconds. Training speed 764 pps. Validation speed 12988 pps.\n",
+ "INFO:mlp.optimisers:Epoch 2: Training cost (ce) is 0.485. Accuracy is 84.40%\n",
+ "INFO:mlp.optimisers:Epoch 2: Validation cost (ce) is 0.455. Accuracy is 86.58%\n",
+ "INFO:mlp.optimisers:Epoch 2: Took 2 seconds. Training speed 720 pps. Validation speed 12988 pps.\n",
+ "INFO:mlp.optimisers:Epoch 3: Training cost (ce) is 0.362. Accuracy is 87.70%\n",
+ "INFO:mlp.optimisers:Epoch 3: Validation cost (ce) is 0.435. Accuracy is 86.90%\n",
+ "INFO:mlp.optimisers:Epoch 3: Took 2 seconds. Training speed 788 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 4: Training cost (ce) is 0.251. Accuracy is 92.10%\n",
+ "INFO:mlp.optimisers:Epoch 4: Validation cost (ce) is 0.417. Accuracy is 88.09%\n",
+ "INFO:mlp.optimisers:Epoch 4: Took 2 seconds. Training speed 788 pps. Validation speed 13159 pps.\n",
+ "INFO:mlp.optimisers:Epoch 5: Training cost (ce) is 0.175. Accuracy is 95.40%\n",
+ "INFO:mlp.optimisers:Epoch 5: Validation cost (ce) is 0.405. Accuracy is 88.16%\n",
+ "INFO:mlp.optimisers:Epoch 5: Took 2 seconds. Training speed 776 pps. Validation speed 12988 pps.\n",
+ "INFO:mlp.optimisers:Epoch 6: Training cost (ce) is 0.121. Accuracy is 96.40%\n",
+ "INFO:mlp.optimisers:Epoch 6: Validation cost (ce) is 0.458. Accuracy is 87.24%\n",
+ "INFO:mlp.optimisers:Epoch 6: Took 2 seconds. Training speed 690 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 7: Training cost (ce) is 0.091. Accuracy is 97.90%\n",
+ "INFO:mlp.optimisers:Epoch 7: Validation cost (ce) is 0.418. Accuracy is 88.37%\n",
+ "INFO:mlp.optimisers:Epoch 7: Took 2 seconds. Training speed 841 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 8: Training cost (ce) is 0.065. Accuracy is 98.70%\n",
+ "INFO:mlp.optimisers:Epoch 8: Validation cost (ce) is 0.400. Accuracy is 89.44%\n",
+ "INFO:mlp.optimisers:Epoch 8: Took 2 seconds. Training speed 794 pps. Validation speed 12501 pps.\n",
+ "INFO:mlp.optimisers:Epoch 9: Training cost (ce) is 0.043. Accuracy is 99.30%\n",
+ "INFO:mlp.optimisers:Epoch 9: Validation cost (ce) is 0.406. Accuracy is 89.35%\n",
+ "INFO:mlp.optimisers:Epoch 9: Took 2 seconds. Training speed 747 pps. Validation speed 12822 pps.\n",
+ "INFO:mlp.optimisers:Epoch 10: Training cost (ce) is 0.029. Accuracy is 99.50%\n",
+ "INFO:mlp.optimisers:Epoch 10: Validation cost (ce) is 0.410. Accuracy is 89.69%\n",
+ "INFO:mlp.optimisers:Epoch 10: Took 2 seconds. Training speed 953 pps. Validation speed 12822 pps.\n",
+ "INFO:mlp.optimisers:Epoch 11: Training cost (ce) is 0.023. Accuracy is 99.80%\n",
+ "INFO:mlp.optimisers:Epoch 11: Validation cost (ce) is 0.424. Accuracy is 89.41%\n",
+ "INFO:mlp.optimisers:Epoch 11: Took 2 seconds. Training speed 953 pps. Validation speed 13159 pps.\n",
+ "INFO:mlp.optimisers:Epoch 12: Training cost (ce) is 0.018. Accuracy is 99.80%\n",
+ "INFO:mlp.optimisers:Epoch 12: Validation cost (ce) is 0.429. Accuracy is 89.50%\n",
+ "INFO:mlp.optimisers:Epoch 12: Took 2 seconds. Training speed 870 pps. Validation speed 12988 pps.\n",
+ "INFO:mlp.optimisers:Epoch 13: Training cost (ce) is 0.015. Accuracy is 99.90%\n",
+ "INFO:mlp.optimisers:Epoch 13: Validation cost (ce) is 0.428. Accuracy is 89.58%\n",
+ "INFO:mlp.optimisers:Epoch 13: Took 2 seconds. Training speed 878 pps. Validation speed 12822 pps.\n",
+ "INFO:mlp.optimisers:Epoch 14: Training cost (ce) is 0.012. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 14: Validation cost (ce) is 0.436. Accuracy is 89.41%\n",
+ "INFO:mlp.optimisers:Epoch 14: Took 2 seconds. Training speed 894 pps. Validation speed 12501 pps.\n",
+ "INFO:mlp.optimisers:Epoch 15: Training cost (ce) is 0.010. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 15: Validation cost (ce) is 0.433. Accuracy is 89.64%\n",
+ "INFO:mlp.optimisers:Epoch 15: Took 2 seconds. Training speed 834 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 16: Training cost (ce) is 0.009. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 16: Validation cost (ce) is 0.439. Accuracy is 89.63%\n",
+ "INFO:mlp.optimisers:Epoch 16: Took 2 seconds. Training speed 820 pps. Validation speed 12988 pps.\n",
+ "INFO:mlp.optimisers:Epoch 17: Training cost (ce) is 0.008. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 17: Validation cost (ce) is 0.443. Accuracy is 89.78%\n",
+ "INFO:mlp.optimisers:Epoch 17: Took 2 seconds. Training speed 902 pps. Validation speed 12501 pps.\n",
+ "INFO:mlp.optimisers:Epoch 18: Training cost (ce) is 0.008. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 18: Validation cost (ce) is 0.446. Accuracy is 89.72%\n",
+ "INFO:mlp.optimisers:Epoch 18: Took 2 seconds. Training speed 870 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 19: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 19: Validation cost (ce) is 0.445. Accuracy is 89.83%\n",
+ "INFO:mlp.optimisers:Epoch 19: Took 2 seconds. Training speed 918 pps. Validation speed 12822 pps.\n",
+ "INFO:mlp.optimisers:Epoch 20: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 20: Validation cost (ce) is 0.451. Accuracy is 89.75%\n",
+ "INFO:mlp.optimisers:Epoch 20: Took 2 seconds. Training speed 834 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 21: Training cost (ce) is 0.006. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 21: Validation cost (ce) is 0.454. Accuracy is 89.80%\n",
+ "INFO:mlp.optimisers:Epoch 21: Took 2 seconds. Training speed 902 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 22: Training cost (ce) is 0.006. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 22: Validation cost (ce) is 0.456. Accuracy is 89.77%\n",
+ "INFO:mlp.optimisers:Epoch 22: Took 2 seconds. Training speed 863 pps. Validation speed 12501 pps.\n",
+ "INFO:mlp.optimisers:Epoch 23: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 23: Validation cost (ce) is 0.458. Accuracy is 89.84%\n",
+ "INFO:mlp.optimisers:Epoch 23: Took 2 seconds. Training speed 820 pps. Validation speed 12822 pps.\n",
+ "INFO:mlp.optimisers:Epoch 24: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 24: Validation cost (ce) is 0.460. Accuracy is 89.80%\n",
+ "INFO:mlp.optimisers:Epoch 24: Took 2 seconds. Training speed 856 pps. Validation speed 12988 pps.\n",
+ "INFO:mlp.optimisers:Epoch 25: Training cost (ce) is 0.005. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 25: Validation cost (ce) is 0.461. Accuracy is 89.86%\n",
+ "INFO:mlp.optimisers:Epoch 25: Took 2 seconds. Training speed 902 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 26: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 26: Validation cost (ce) is 0.467. Accuracy is 89.86%\n",
+ "INFO:mlp.optimisers:Epoch 26: Took 2 seconds. Training speed 910 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 27: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 27: Validation cost (ce) is 0.466. Accuracy is 89.81%\n",
+ "INFO:mlp.optimisers:Epoch 27: Took 2 seconds. Training speed 827 pps. Validation speed 12501 pps.\n",
+ "INFO:mlp.optimisers:Epoch 28: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 28: Validation cost (ce) is 0.468. Accuracy is 89.84%\n",
+ "INFO:mlp.optimisers:Epoch 28: Took 2 seconds. Training speed 894 pps. Validation speed 12501 pps.\n",
+ "INFO:mlp.optimisers:Epoch 29: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 29: Validation cost (ce) is 0.471. Accuracy is 89.83%\n",
+ "INFO:mlp.optimisers:Epoch 29: Took 2 seconds. Training speed 902 pps. Validation speed 12659 pps.\n",
+ "INFO:mlp.optimisers:Epoch 30: Training cost (ce) is 0.004. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 30: Validation cost (ce) is 0.473. Accuracy is 89.81%\n",
+ "INFO:mlp.optimisers:Epoch 30: Took 2 seconds. Training speed 918 pps. Validation speed 11495 pps.\n",
+ "INFO:root:Testing the model on test set:\n",
+ "INFO:root:MNIST test set accuracy is 89.33 %, cost (ce) is 0.480\n"
+ ]
+ }
+ ],
+ "source": [
+ "\n",
+ "from mlp.layers import MLP, Tanh, Softmax #import required layer types\n",
+ "from mlp.optimisers import SGDOptimiser #import the optimiser\n",
+ "from mlp.costs import CECost #import the cost we want to use for optimisation\n",
+ "from mlp.schedulers import LearningRateFixed\n",
+ "\n",
+ "rng = numpy.random.RandomState([2015,10,10])\n",
+ "\n",
+ "#some hyper-parameters\n",
+ "nhid = 100\n",
+ "learning_rate = 0.2\n",
+ "max_epochs = 30\n",
+ "cost = CECost()\n",
+ " \n",
+ "stats = []\n",
+ "for layer in xrange(1, 2):\n",
+ "\n",
+ " train_dp.reset()\n",
+ " valid_dp.reset()\n",
+ " test_dp.reset()\n",
+ " \n",
+ " #define the model\n",
+ " model = MLP(cost=cost)\n",
+ " model.add_layer(Tanh(idim=784, odim=nhid, irange=1./numpy.sqrt(784), rng=rng))\n",
+ " for i in xrange(1, layer):\n",
+ " logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
+ " model.add_layer(Tanh(idim=nhid, odim=nhid, irange=0.2, rng=rng))\n",
+ " model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
+ "\n",
+ " # define the optimiser, here stochasitc gradient descent\n",
+ " # with fixed learning rate and max_epochs\n",
+ " lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
+ " optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
+ "\n",
+ " logger.info('Training started...')\n",
+ " tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
+ "\n",
+ " logger.info('Testing the model on test set:')\n",
+ " tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
+ " logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
+ " \n",
+ " stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Exercise 2: Implement ReLU\n",
+ "\n",
+ "Again, your implementation should follow the conventions used to build Linear, Sigmoid and Softmax layers. As in exercise 1, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Tune the learning rate (start with the initial one set to 0.1) with the initial weight range set to 0.05."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "collapsed": false,
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "INFO:root:Training started...\n",
+ "INFO:mlp.optimisers:Epoch 0: Training cost (ce) for initial model is 2.317. Accuracy is 15.20%\n",
+ "INFO:mlp.optimisers:Epoch 0: Validation cost (ce) for initial model is 2.317. Accuracy is 13.98%\n",
+ "INFO:mlp.optimisers:Epoch 1: Training cost (ce) is 1.400. Accuracy is 57.10%\n",
+ "INFO:mlp.optimisers:Epoch 1: Validation cost (ce) is 0.666. Accuracy is 84.13%\n",
+ "INFO:mlp.optimisers:Epoch 1: Took 0 seconds. Training speed 6600 pps. Validation speed 53922 pps.\n",
+ "INFO:mlp.optimisers:Epoch 2: Training cost (ce) is 0.607. Accuracy is 82.20%\n",
+ "INFO:mlp.optimisers:Epoch 2: Validation cost (ce) is 0.486. Accuracy is 85.70%\n",
+ "INFO:mlp.optimisers:Epoch 2: Took 0 seconds. Training speed 3536 pps. Validation speed 80001 pps.\n",
+ "INFO:mlp.optimisers:Epoch 3: Training cost (ce) is 0.413. Accuracy is 88.20%\n",
+ "INFO:mlp.optimisers:Epoch 3: Validation cost (ce) is 0.427. Accuracy is 87.59%\n",
+ "INFO:mlp.optimisers:Epoch 3: Took 0 seconds. Training speed 6251 pps. Validation speed 72222 pps.\n",
+ "INFO:mlp.optimisers:Epoch 4: Training cost (ce) is 0.313. Accuracy is 90.30%\n",
+ "INFO:mlp.optimisers:Epoch 4: Validation cost (ce) is 0.399. Accuracy is 88.12%\n",
+ "INFO:mlp.optimisers:Epoch 4: Took 0 seconds. Training speed 5038 pps. Validation speed 65450 pps.\n",
+ "INFO:mlp.optimisers:Epoch 5: Training cost (ce) is 0.233. Accuracy is 93.90%\n",
+ "INFO:mlp.optimisers:Epoch 5: Validation cost (ce) is 0.370. Accuracy is 89.09%\n",
+ "INFO:mlp.optimisers:Epoch 5: Took 0 seconds. Training speed 5425 pps. Validation speed 69818 pps.\n",
+ "INFO:mlp.optimisers:Epoch 6: Training cost (ce) is 0.189. Accuracy is 94.90%\n",
+ "INFO:mlp.optimisers:Epoch 6: Validation cost (ce) is 0.394. Accuracy is 88.26%\n",
+ "INFO:mlp.optimisers:Epoch 6: Took 0 seconds. Training speed 5226 pps. Validation speed 73042 pps.\n",
+ "INFO:mlp.optimisers:Epoch 7: Training cost (ce) is 0.141. Accuracy is 96.60%\n",
+ "INFO:mlp.optimisers:Epoch 7: Validation cost (ce) is 0.386. Accuracy is 88.72%\n",
+ "INFO:mlp.optimisers:Epoch 7: Took 0 seconds. Training speed 3155 pps. Validation speed 58826 pps.\n",
+ "INFO:mlp.optimisers:Epoch 8: Training cost (ce) is 0.105. Accuracy is 98.50%\n",
+ "INFO:mlp.optimisers:Epoch 8: Validation cost (ce) is 0.375. Accuracy is 89.52%\n",
+ "INFO:mlp.optimisers:Epoch 8: Took 0 seconds. Training speed 5681 pps. Validation speed 72784 pps.\n",
+ "INFO:mlp.optimisers:Epoch 9: Training cost (ce) is 0.084. Accuracy is 98.80%\n",
+ "INFO:mlp.optimisers:Epoch 9: Validation cost (ce) is 0.385. Accuracy is 89.54%\n",
+ "INFO:mlp.optimisers:Epoch 9: Took 0 seconds. Training speed 6656 pps. Validation speed 77418 pps.\n",
+ "INFO:mlp.optimisers:Epoch 10: Training cost (ce) is 0.070. Accuracy is 98.70%\n",
+ "INFO:mlp.optimisers:Epoch 10: Validation cost (ce) is 0.385. Accuracy is 89.74%\n",
+ "INFO:mlp.optimisers:Epoch 10: Took 0 seconds. Training speed 7067 pps. Validation speed 76853 pps.\n",
+ "INFO:mlp.optimisers:Epoch 11: Training cost (ce) is 0.051. Accuracy is 99.50%\n",
+ "INFO:mlp.optimisers:Epoch 11: Validation cost (ce) is 0.411. Accuracy is 88.81%\n",
+ "INFO:mlp.optimisers:Epoch 11: Took 0 seconds. Training speed 5863 pps. Validation speed 71343 pps.\n",
+ "INFO:mlp.optimisers:Epoch 12: Training cost (ce) is 0.044. Accuracy is 99.50%\n",
+ "INFO:mlp.optimisers:Epoch 12: Validation cost (ce) is 0.398. Accuracy is 89.34%\n",
+ "INFO:mlp.optimisers:Epoch 12: Took 0 seconds. Training speed 5974 pps. Validation speed 64065 pps.\n",
+ "INFO:mlp.optimisers:Epoch 13: Training cost (ce) is 0.035. Accuracy is 99.50%\n",
+ "INFO:mlp.optimisers:Epoch 13: Validation cost (ce) is 0.400. Accuracy is 89.37%\n",
+ "INFO:mlp.optimisers:Epoch 13: Took 0 seconds. Training speed 6211 pps. Validation speed 82847 pps.\n",
+ "INFO:mlp.optimisers:Epoch 14: Training cost (ce) is 0.027. Accuracy is 99.70%\n",
+ "INFO:mlp.optimisers:Epoch 14: Validation cost (ce) is 0.411. Accuracy is 89.30%\n",
+ "INFO:mlp.optimisers:Epoch 14: Took 0 seconds. Training speed 5834 pps. Validation speed 68986 pps.\n",
+ "INFO:mlp.optimisers:Epoch 15: Training cost (ce) is 0.023. Accuracy is 99.80%\n",
+ "INFO:mlp.optimisers:Epoch 15: Validation cost (ce) is 0.398. Accuracy is 89.99%\n",
+ "INFO:mlp.optimisers:Epoch 15: Took 0 seconds. Training speed 5601 pps. Validation speed 73777 pps.\n",
+ "INFO:mlp.optimisers:Epoch 16: Training cost (ce) is 0.020. Accuracy is 99.80%\n",
+ "INFO:mlp.optimisers:Epoch 16: Validation cost (ce) is 0.413. Accuracy is 89.64%\n",
+ "INFO:mlp.optimisers:Epoch 16: Took 0 seconds. Training speed 6732 pps. Validation speed 62236 pps.\n",
+ "INFO:mlp.optimisers:Epoch 17: Training cost (ce) is 0.017. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 17: Validation cost (ce) is 0.411. Accuracy is 89.74%\n",
+ "INFO:mlp.optimisers:Epoch 17: Took 0 seconds. Training speed 5339 pps. Validation speed 70669 pps.\n",
+ "INFO:mlp.optimisers:Epoch 18: Training cost (ce) is 0.016. Accuracy is 99.90%\n",
+ "INFO:mlp.optimisers:Epoch 18: Validation cost (ce) is 0.414. Accuracy is 89.78%\n",
+ "INFO:mlp.optimisers:Epoch 18: Took 0 seconds. Training speed 6595 pps. Validation speed 72214 pps.\n",
+ "INFO:mlp.optimisers:Epoch 19: Training cost (ce) is 0.014. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 19: Validation cost (ce) is 0.418. Accuracy is 89.79%\n",
+ "INFO:mlp.optimisers:Epoch 19: Took 0 seconds. Training speed 5860 pps. Validation speed 64213 pps.\n",
+ "INFO:mlp.optimisers:Epoch 20: Training cost (ce) is 0.012. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 20: Validation cost (ce) is 0.418. Accuracy is 90.04%\n",
+ "INFO:mlp.optimisers:Epoch 20: Took 0 seconds. Training speed 5726 pps. Validation speed 66591 pps.\n",
+ "INFO:mlp.optimisers:Epoch 21: Training cost (ce) is 0.011. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 21: Validation cost (ce) is 0.422. Accuracy is 89.87%\n",
+ "INFO:mlp.optimisers:Epoch 21: Took 0 seconds. Training speed 4780 pps. Validation speed 62159 pps.\n",
+ "INFO:mlp.optimisers:Epoch 22: Training cost (ce) is 0.011. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 22: Validation cost (ce) is 0.425. Accuracy is 89.98%\n",
+ "INFO:mlp.optimisers:Epoch 22: Took 0 seconds. Training speed 4839 pps. Validation speed 79642 pps.\n",
+ "INFO:mlp.optimisers:Epoch 23: Training cost (ce) is 0.010. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 23: Validation cost (ce) is 0.425. Accuracy is 90.09%\n",
+ "INFO:mlp.optimisers:Epoch 23: Took 0 seconds. Training speed 6742 pps. Validation speed 82643 pps.\n",
+ "INFO:mlp.optimisers:Epoch 24: Training cost (ce) is 0.009. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 24: Validation cost (ce) is 0.429. Accuracy is 90.00%\n",
+ "INFO:mlp.optimisers:Epoch 24: Took 0 seconds. Training speed 6590 pps. Validation speed 80128 pps.\n",
+ "INFO:mlp.optimisers:Epoch 25: Training cost (ce) is 0.008. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 25: Validation cost (ce) is 0.432. Accuracy is 90.04%\n",
+ "INFO:mlp.optimisers:Epoch 25: Took 0 seconds. Training speed 4635 pps. Validation speed 69917 pps.\n",
+ "INFO:mlp.optimisers:Epoch 26: Training cost (ce) is 0.008. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 26: Validation cost (ce) is 0.435. Accuracy is 89.99%\n",
+ "INFO:mlp.optimisers:Epoch 26: Took 0 seconds. Training speed 6685 pps. Validation speed 79693 pps.\n",
+ "INFO:mlp.optimisers:Epoch 27: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 27: Validation cost (ce) is 0.437. Accuracy is 90.04%\n",
+ "INFO:mlp.optimisers:Epoch 27: Took 0 seconds. Training speed 6992 pps. Validation speed 76700 pps.\n",
+ "INFO:mlp.optimisers:Epoch 28: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 28: Validation cost (ce) is 0.438. Accuracy is 90.07%\n",
+ "INFO:mlp.optimisers:Epoch 28: Took 0 seconds. Training speed 5097 pps. Validation speed 57735 pps.\n",
+ "INFO:mlp.optimisers:Epoch 29: Training cost (ce) is 0.007. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 29: Validation cost (ce) is 0.440. Accuracy is 90.04%\n",
+ "INFO:mlp.optimisers:Epoch 29: Took 0 seconds. Training speed 5390 pps. Validation speed 77974 pps.\n",
+ "INFO:mlp.optimisers:Epoch 30: Training cost (ce) is 0.006. Accuracy is 100.00%\n",
+ "INFO:mlp.optimisers:Epoch 30: Validation cost (ce) is 0.444. Accuracy is 90.08%\n",
+ "INFO:mlp.optimisers:Epoch 30: Took 0 seconds. Training speed 5589 pps. Validation speed 60936 pps.\n",
+ "INFO:root:Testing the model on test set:\n",
+ "INFO:root:MNIST test set accuracy is 89.13 %, cost (ce) is 0.444\n"
+ ]
+ }
+ ],
+ "source": [
+ "\n",
+ "from mlp.layers import MLP, Relu, Softmax \n",
+ "from mlp.optimisers import SGDOptimiser \n",
+ "from mlp.costs import CECost \n",
+ "from mlp.schedulers import LearningRateFixed\n",
+ "\n",
+ "rng = numpy.random.RandomState([2015,10,10])\n",
+ "\n",
+ "#some hyper-parameters\n",
+ "nhid = 100\n",
+ "learning_rate = 0.1\n",
+ "max_epochs = 30\n",
+ "cost = CECost()\n",
+ " \n",
+ "stats = []\n",
+ "for layer in xrange(1, 2):\n",
+ "\n",
+ " train_dp.reset()\n",
+ " valid_dp.reset()\n",
+ " test_dp.reset()\n",
+ " \n",
+ " #define the model\n",
+ " model = MLP(cost=cost)\n",
+ " model.add_layer(Relu(idim=784, odim=nhid, irange=0.05, rng=rng))\n",
+ " for i in xrange(1, layer):\n",
+ " logger.info(\"Stacking hidden layer (%s)\" % str(i+1))\n",
+ " model.add_layer(Relu(idim=nhid, odim=nhid, irange=0.2, rng=rng))\n",
+ " model.add_layer(Softmax(idim=nhid, odim=10, rng=rng))\n",
+ "\n",
+ " # define the optimiser, here stochasitc gradient descent\n",
+ " # with fixed learning rate and max_epochs\n",
+ " lr_scheduler = LearningRateFixed(learning_rate=learning_rate, max_epochs=max_epochs)\n",
+ " optimiser = SGDOptimiser(lr_scheduler=lr_scheduler)\n",
+ "\n",
+ " logger.info('Training started...')\n",
+ " tr_stats, valid_stats = optimiser.train(model, train_dp, valid_dp)\n",
+ "\n",
+ " logger.info('Testing the model on test set:')\n",
+ " tst_cost, tst_accuracy = optimiser.validate(model, test_dp)\n",
+ " logger.info('MNIST test set accuracy is %.2f %%, cost (%s) is %.3f'%(tst_accuracy*100., cost.get_name(), tst_cost))\n",
+ " \n",
+ " stats.append((tr_stats, valid_stats, (tst_cost, tst_accuracy)))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Exercise 3: Implement Maxout\n",
+ "\n",
+ "As with the previous two exercises, your implementation should follow the conventions used to build the Linear, Sigmoid and Softmax layers. For now implement only non-overlapping pools (i.e. the pool in which all activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ belong to only one pool). As before, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Use the same optimisation hyper-parameters (learning rate, initial weights range) as you used for ReLU models. Tune the pool size $K$ (but keep the number of total parameters fixed).\n",
+ "\n",
+ "Note: The Max operator reduces dimensionality, hence for example, to get 100 hidden maxout units with pooling size set to $K=2$ the size of linear part needs to be set to $100K$ (assuming non-overlapping pools). This affects how you compute the total number of weights in the model.\n",
+ "\n",
+ "Implementation tips: To back-propagate through the maxout layer, one needs to keep track of which linear activation $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ was the maximum in each pool. The convenient way to do so is by storing the indices of the maximum units in the fprop function and then in the backprop stage pass the gradient only through those (i.e. for example, one can build an auxiliary matrix where each element is either 1 (if unit was maximum, and passed forward through the max operator for a given data-point) or 0 otherwise. Then in the backward pass it suffices to upsample the maxout *igrads* signal to the linear layer dimension and element-wise multiply by the aforemenioned auxiliary matrix.\n",
+ "\n",
+ "*Optional:* Implement the generic pooling mechanism by introducing an additional *stride* hyper-parameter $0 0)*igrads + (h <= 0)*igrads
+ deltas = (h > 0)*igrads
___, ograds = super(Relu, self).bprop(h=None, igrads=deltas)
return deltas, ograds
@@ -527,6 +527,11 @@ class Maxout(Linear):
return h[:, :, 0] #get rid of the last reduced dimensison (of size 1)
def bprop(self, h, igrads):
+ #hack for dropout backprop (ignore dropped neurons), note, this is not
+ #entirely correct when h fires at 0 exaclty (but is not dropped, when
+ #derivative should be 1. However, this is rather unlikely to happen and
+ #probably can be ignored right now
+ igrads = (h != 0)*igrads
#convert into the shape where upsampling is easier
igrads_up = igrads.reshape(igrads.shape[0], self.max_odim, 1)
#upsample to the linear dimension (but reshaped to (batch_size, maxed_num (1), pool_size)