{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dropout and maxout\n", "In this lab we will explore the methods of [dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf), a regularisation method which stochastically drops out activations from the model during training, and [maxout](http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf), another non-linear transformation that can be used in multiple layer models. This is based on material covered in the [fifth lecture slides](http://www.inf.ed.ac.uk/teaching/courses/mlp/2016/mlp05-hid.pdf)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: Implementing a dropout layer\n", "\n", "During training the forward propagation through a dropout layer produces outputs where a subset of the input dimensions are set to zero ('dropped out'). The dimensions to be dropped out are randomly sampled for each new batch, with each dimension having a probability $p$ of being included and the inclusion (or not) of each dimension independent of all the others. If the inputs to a dropout layer are $D$ dimensional vectors then we can represent the dropout operation by an elementwise multiplication by a $D$ dimensional *binary mask* vector $\\boldsymbol{m} = \\left[m_1 ~ m_2 ~\\dots~ m_D\\right]^{\\rm T}$ where $m_d \\sim \\text{Bernoulli}(p) ~~\\forall d \\in \\lbrace 1 \\dots D\\rbrace$. \n", "\n", "As a first step implement a `random_binary_mask` function in the cell below to generate a binary mask array of a specified shape, where each value in the outputted array is either a one with probablity `prob_1` or zero with probability `1 - prob_1` and all values are sampled independently." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def random_binary_mask(prob_1, shape, rng):\n", " \"\"\"Generates a random binary mask array of a given shape.\n", " \n", " Each value in the outputted array should be an indepedently sampled\n", " binary value i.e in {0, 1} with the probability of each value\n", " being 1 being equal to `prob_1`.\n", " \n", " Args:\n", " prob_1: Scalar value in [0, 1] specifying probability each\n", " entry in output array is equal to one.\n", " shape: Shape of returned mask array.\n", " rng (RandomState): Seeded random number generator object.\n", " \n", " Returns:\n", " Random binary mask array of specified shape.\n", " \"\"\"\n", " return rng.uniform(size=shape) < prob_1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your `random_binary_mask` function using the cell below (if your implementation is incorrect you will get an `AssertionError` - look at what the assert statement is checking for a clue as to what is wrong)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "test_shapes = [(1, 1000), (10, 10, 10)]\n", "test_probs = [0.1, 0.5, 0.7]\n", "for i in range(10):\n", " for shape in test_shapes:\n", " for prob in test_probs:\n", " output = random_binary_mask(prob, shape, np.random)\n", " # Check generating correct shape output\n", " assert output.shape == shape\n", " # Check all outputs are binary values\n", " assert np.all((output == 1.) | (output == 0.))\n", " # Check proportion equal to one plausible\n", " # This will be noisy so there is a chance this will error\n", " # even for a correct implementation\n", " assert np.abs(output.mean() - prob) < 0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a randomly sampled binary mask $\\boldsymbol{m}$, the outputs $\\lbrace \\boldsymbol{y}^{(b)} \\rbrace_{b=1}^B$ of the stochastic forward propagation through a dropout layer given a batch of inputs $\\lbrace \\boldsymbol{x}^{(b)} \\rbrace_{b=1}^B$ can be calculated by simply performing an elementwise multiplication of the inputs with the mask\n", "\n", "\\begin{equation}\n", " y^{(b)}_d = m_k x^{(b)}_d \\qquad \\forall d \\in \\lbrace 1 \\dots D \\rbrace\n", "\\end{equation}\n", "\n", "The corresponding partial derivatives required for implementing back-propagation through a dropout layer are\n", "\n", "\\begin{equation}\n", " \\frac{\\partial y^{(b)}_k}{\\partial x^{(b)}_d} = \n", " \\begin{cases}\n", " m_k & \\quad k = d \\\\\n", " 0 & \\quad k \\neq d\n", " \\end{cases}\n", " \\qquad \\forall k,\\,d \\in \\lbrace 1 \\dots D \\rbrace\n", "\\end{equation}\n", "\n", "As discussed in the lecture slides, when using a model trained with dropout at test time dimensions are no longer stochastically dropped out and instead all activations are deterministically fed forward through the model. So that the expected (mean) outputs of each layer are the same at test and training we scale the forward propagated inputs during testing by $p$ the probability of each dimension being included in the output. If we denote the deterministically forward-propagated batch of outputs of a dropout layer at test time as $\\lbrace \\boldsymbol{z}^{(b)} \\rbrace_{b=1}^B$ then we have\n", "\n", "\\begin{equation}\n", " z^{(b)}_d =\n", " \\mathbb{E}\\left[ y^{(b)}_d \\right] = \n", " \\sum_{m_d \\in \\lbrace 0,1 \\rbrace} \\left( \\mathbb{P}\\left[\\mathrm{m}_d = m_d\\right] m_d x^{(b)}_d \\right) =\n", " (p) (1) x^{(b)}_d + (1-p) (0) x^{(b)}_d =\n", " p x^{(b)}_d \\qquad \\forall d \\in \\lbrace 1 \\dots D \\rbrace\n", "\\end{equation}\n", "\n", "To allow switching between this stochastic training time behaviour and deterministic test time behaviour, a new abstract `StochasticLayer` class has been defined in the `mlp.layers` module. This acts similarly to the layer objects we have already encountered other than adding an extra boolean argument `stochastic` to the `fprop` method interface. When `stochastic = True` (the default) a stochastic forward propagation should be caculated, for dropout this corresponding to $\\boldsymbol{x}^{(b)} \\to \\boldsymbol{y}^{(b)}$ above. When `stochastic = False` a deterministic forward-propagation corresponding to the expected output of the stochastic forward-propagation should be calculated, for dropout this corresponding to $\\boldsymbol{x}^{(b)} \\to \\boldsymbol{z}^{(b)}$ above.\n", "\n", "Using the skeleton `DropoutLayer` class definition below, implement the `fprop` and `bprop` methods. You may wish to store the binary mask used in the forward propagation as an attribute of the class for use in back-propagation - it is fine to assume that the `fprop` and `bprop` will always be called in sync." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from mlp.layers import StochasticLayer\n", "\n", "class DropoutLayer(StochasticLayer):\n", " \"\"\"Layer which stochastically drops input dimensions in its output.\"\"\"\n", " \n", " def __init__(self, rng=None, incl_prob=0.5, share_across_batch=True):\n", " \"\"\"Construct a new dropout layer.\n", " \n", " Args:\n", " rng (RandomState): Seeded random number generator.\n", " incl_prob: Scalar value in (0, 1] specifying the probability of\n", " each input dimension being included in the output.\n", " share_across_batch: Whether to use same dropout mask across\n", " all inputs in a batch or use per input masks.\n", " \"\"\"\n", " super(DropoutLayer, self).__init__(rng)\n", " assert incl_prob > 0. and incl_prob <= 1.\n", " self.incl_prob = incl_prob\n", " self.share_across_batch = share_across_batch\n", " \n", " def fprop(self, inputs, stochastic=True):\n", " \"\"\"Forward propagates activations through the layer transformation.\n", "\n", " Args:\n", " inputs: Array of layer inputs of shape (batch_size, input_dim).\n", " stochastic: Flag allowing different deterministic\n", " forward-propagation mode in addition to default stochastic\n", " forward-propagation e.g. for use at test time. If False\n", " a deterministic forward-propagation transformation\n", " corresponding to the expected output of the stochastic\n", " forward-propagation is applied.\n", "\n", " Returns:\n", " outputs: Array of layer outputs of shape (batch_size, output_dim).\n", " \"\"\"\n", " if stochastic:\n", " mask_shape = (1,) + inputs.shape[1:] if self.share_across_batch else inputs.shape\n", " self._mask = (rng.uniform(size=mask_shape) < self.incl_prob)\n", " return inputs * self._mask\n", " else:\n", " return inputs * self.incl_prob\n", " \n", " def bprop(self, inputs, outputs, grads_wrt_outputs):\n", " \"\"\"Back propagates gradients through a layer.\n", "\n", " Given gradients with respect to the outputs of the layer calculates the\n", " gradients with respect to the layer inputs. This should correspond to\n", " default stochastic forward-propagation.\n", "\n", " Args:\n", " inputs: Array of layer inputs of shape (batch_size, input_dim).\n", " outputs: Array of layer outputs calculated in forward pass of\n", " shape (batch_size, output_dim).\n", " grads_wrt_outputs: Array of gradients with respect to the layer\n", " outputs of shape (batch_size, output_dim).\n", "\n", " Returns:\n", " Array of gradients with respect to the layer inputs of shape\n", " (batch_size, input_dim).\n", " \"\"\"\n", " return grads_wrt_outputs * self._mask \n", "\n", " def __repr__(self):\n", " return 'DropoutLayer(incl_prob={0:.1f})'.format(self.incl_prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test your implementation by running the cell below (if your implementation is incorrect you will get an `AssertionError` - look at what the assert statement is checking for a clue as to what is wrong)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "seed = 31102016 \n", "rng = np.random.RandomState(seed)\n", "test_incl_probs = [0.1, 0.5, 0.7]\n", "input_shape = (5, 10)\n", "for incl_prob in test_incl_probs:\n", " layer = DropoutLayer(rng, incl_prob)\n", " inputs = rng.normal(size=input_shape)\n", " grads_wrt_outputs = rng.normal(size=input_shape)\n", " for t in range(100):\n", " outputs = layer.fprop(inputs, stochastic=True)\n", " # Check outputted array correct shape\n", " assert outputs.shape == inputs.shape\n", " # Check all outputs are either equal to inputs or zero\n", " assert np.all((outputs == inputs) | (outputs == 0))\n", " grads_wrt_inputs = layer.bprop(inputs, outputs, grads_wrt_outputs)\n", " # Check back-propagated gradients only non-zero for non-zero outputs\n", " assert np.all((outputs != 0) == (grads_wrt_inputs != 0))\n", " assert np.all(grads_wrt_outputs[outputs != 0] == grads_wrt_inputs[outputs != 0])\n", " det_outputs = layer.fprop(inputs, stochastic=False)\n", " # Check deterministic fprop outputs are correct shape\n", " assert det_outputs.shape == inputs.shape\n", " # Check deterministic fprop outputs scaled correctly\n", " assert np.allclose(det_outputs, incl_prob * inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optional extension\n", "\n", "Above we assumed the same dropout mask was applied to each input in a batch, as specified in the lecture slides. In practice sometimes a different mask is sampled for each input. As an extension you could try implementing this per-input form of dropout either by defining a new layer or adding an extra argument to the constructor of the above layer which allows you to switch between the two forms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: Training with dropout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experiment with training models with dropout layers to classify MNIST digits. Code has been provided below as a starting point for setting up the model objects though feel free to use any additional adaptive learning rules or learning rule schedulers you wrote during the coursework instead. You may also wish to change the model architecture to use a larger model with more parameters in which the regularisation provided by dropout is likely to have a more pronounced effect. You will probably also find that models with dropout generally need to be trained over more epochs than those without (can you suggest why this might be?).\n", "\n", "You should training with a few different `incl_prob` settings for the dropout layers and try to establish how the values chosen affect the training performance. You may wish to experiment with using a different dropout probability at the input than for the intermediate layers (why?).\n", "\n", "You may wish to start reading through and implementing exercise 3 while waiting for training runs to complete." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import logging\n", "from mlp.data_providers import MNISTDataProvider\n", "from mlp.models import MultipleLayerModel\n", "from mlp.layers import ReluLayer, AffineLayer\n", "from mlp.errors import CrossEntropySoftmaxError\n", "from mlp.initialisers import GlorotUniformInit, ConstantInit\n", "from mlp.learning_rules import MomentumLearningRule\n", "from mlp.optimisers import Optimiser\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Seed a random number generator\n", "seed = 31102016 \n", "rng = np.random.RandomState(seed)\n", "\n", "# Set up a logger object to print info about the training run to stdout\n", "logger = logging.getLogger()\n", "logger.setLevel(logging.INFO)\n", "logger.handlers = [logging.StreamHandler()]\n", "\n", "# Create data provider objects for the MNIST data set\n", "train_data = MNISTDataProvider('train', batch_size=50, rng=rng)\n", "valid_data = MNISTDataProvider('valid', batch_size=50, rng=rng)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1570e8d049a44c52b2608f768bf7431f", "version_major": 2, "version_minor": 0 }, "text/html": [ "
