mlpractical/notebooks/06_Dropout_and_maxout.ipynb
2017-10-29 21:30:15 +00:00

636 lines
30 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dropout and maxout\n",
"In this lab we will explore the methods of [dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf), a regularisation method which stochastically drops out activations from the model during training, and [maxout](http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf), another non-linear transformation that can be used in multiple layer models. This is based on material covered in the [fifth lecture slides](http://www.inf.ed.ac.uk/teaching/courses/mlp/2016/mlp05-hid.pdf)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1: Implementing a dropout layer\n",
"\n",
"During training the forward propagation through a dropout layer produces outputs where a subset of the input dimensions are set to zero ('dropped out'). The dimensions to be dropped out are randomly sampled for each new batch, with each dimension having a probability $p$ of being included and the inclusion (or not) of each dimension independent of all the others. If the inputs to a dropout layer are $D$ dimensional vectors then we can represent the dropout operation by an elementwise multiplication by a $D$ dimensional *binary mask* vector $\\boldsymbol{m} = \\left[m_1 ~ m_2 ~\\dots~ m_D\\right]^{\\rm T}$ where $m_d \\sim \\text{Bernoulli}(p) ~~\\forall d \\in \\lbrace 1 \\dots D\\rbrace$. \n",
"\n",
"As a first step implement a `random_binary_mask` function in the cell below to generate a binary mask array of a specified shape, where each value in the outputted array is either a one with probablity `prob_1` or zero with probability `1 - prob_1` and all values are sampled independently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def random_binary_mask(prob_1, shape, rng):\n",
" \"\"\"Generates a random binary mask array of a given shape.\n",
" \n",
" Each value in the outputted array should be an indepedently sampled\n",
" binary value i.e in {0, 1} with the probability of each value\n",
" being 1 being equal to `prob_1`.\n",
" \n",
" Args:\n",
" prob_1: Scalar value in [0, 1] specifying probability each\n",
" entry in output array is equal to one.\n",
" shape: Shape of returned mask array.\n",
" rng (RandomState): Seeded random number generator object.\n",
" \n",
" Returns:\n",
" Random binary mask array of specified shape.\n",
" \"\"\"\n",
" raise NotImplementedError()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test your `random_binary_mask` function using the cell below (if your implementation is incorrect you will get an `AssertionError` - look at what the assert statement is checking for a clue as to what is wrong)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"test_shapes = [(1, 1000), (10, 10, 10)]\n",
"test_probs = [0.1, 0.5, 0.7]\n",
"for i in range(10):\n",
" for shape in test_shapes:\n",
" for prob in test_probs:\n",
" output = random_binary_mask(prob, shape, np.random)\n",
" # Check generating correct shape output\n",
" assert output.shape == shape\n",
" # Check all outputs are binary values\n",
" assert np.all((output == 1.) | (output == 0.))\n",
" # Check proportion equal to one plausible\n",
" # This will be noisy so there is a chance this will error\n",
" # even for a correct implementation\n",
" assert np.abs(output.mean() - prob) < 0.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given a randomly sampled binary mask $\\boldsymbol{m}$, the outputs $\\lbrace \\boldsymbol{y}^{(b)} \\rbrace_{b=1}^B$ of the stochastic forward propagation through a dropout layer given a batch of inputs $\\lbrace \\boldsymbol{x}^{(b)} \\rbrace_{b=1}^B$ can be calculated by simply performing an elementwise multiplication of the inputs with the mask\n",
"\n",
"\\begin{equation}\n",
" y^{(b)}_d = m_k x^{(b)}_d \\qquad \\forall d \\in \\lbrace 1 \\dots D \\rbrace\n",
"\\end{equation}\n",
"\n",
"The corresponding partial derivatives required for implementing back-propagation through a dropout layer are\n",
"\n",
"\\begin{equation}\n",
" \\frac{\\partial y^{(b)}_k}{\\partial x^{(b)}_d} = \n",
" \\begin{cases}\n",
" m_k & \\quad k = d \\\\\n",
" 0 & \\quad k \\neq d\n",
" \\end{cases}\n",
" \\qquad \\forall k,\\,d \\in \\lbrace 1 \\dots D \\rbrace\n",
"\\end{equation}\n",
"\n",
"As discussed in the lecture slides, when using a model trained with dropout at test time dimensions are no longer stochastically dropped out and instead all activations are deterministically fed forward through the model. So that the expected (mean) outputs of each layer are the same at test and training we scale the forward propagated inputs during testing by $p$ the probability of each dimension being included in the output. If we denote the deterministically forward-propagated batch of outputs of a dropout layer at test time as $\\lbrace \\boldsymbol{z}^{(b)} \\rbrace_{b=1}^B$ then we have\n",
"\n",
"\\begin{equation}\n",
" z^{(b)}_d =\n",
" \\mathbb{E}\\left[ y^{(b)}_d \\right] = \n",
" \\sum_{m_d \\in \\lbrace 0,1 \\rbrace} \\left( \\mathbb{P}\\left[\\mathrm{m}_d = m_d\\right] m_d x^{(b)}_d \\right) =\n",
" (p) (1) x^{(b)}_d + (1-p) (0) x^{(b)}_d =\n",
" p x^{(b)}_d \\qquad \\forall d \\in \\lbrace 1 \\dots D \\rbrace\n",
"\\end{equation}\n",
"\n",
"To allow switching between this stochastic training time behaviour and deterministic test time behaviour, a new abstract `StochasticLayer` class has been defined in the `mlp.layers` module. This acts similarly to the layer objects we have already encountered other than adding an extra boolean argument `stochastic` to the `fprop` method interface. When `stochastic = True` (the default) a stochastic forward propagation should be caculated, for dropout this corresponding to $\\boldsymbol{x}^{(b)} \\to \\boldsymbol{y}^{(b)}$ above. When `stochastic = False` a deterministic forward-propagation corresponding to the expected output of the stochastic forward-propagation should be calculated, for dropout this corresponding to $\\boldsymbol{x}^{(b)} \\to \\boldsymbol{z}^{(b)}$ above.\n",
"\n",
"Using the skeleton `DropoutLayer` class definition below, implement the `fprop` and `bprop` methods. You may wish to store the binary mask used in the forward propagation as an attribute of the class for use in back-propagation - it is fine to assume that the `fprop` and `bprop` will always be called in sync."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from mlp.layers import StochasticLayer\n",
"\n",
"class DropoutLayer(StochasticLayer):\n",
" \"\"\"Layer which stochastically drops input dimensions in its output.\"\"\"\n",
" \n",
" def __init__(self, rng=None, incl_prob=0.5):\n",
" \"\"\"Construct a new dropout layer.\n",
" \n",
" Args:\n",
" rng (RandomState): Seeded random number generator.\n",
" incl_prob: Scalar value in (0, 1] specifying the probability of\n",
" each input dimension being included in the output.\n",
" \"\"\"\n",
" super(DropoutLayer, self).__init__(rng)\n",
" assert incl_prob > 0. and incl_prob <= 1.\n",
" self.incl_prob = incl_prob\n",
" \n",
" def fprop(self, inputs, stochastic=True):\n",
" \"\"\"Forward propagates activations through the layer transformation.\n",
"\n",
" Args:\n",
" inputs: Array of layer inputs of shape (batch_size, input_dim).\n",
" stochastic: Flag allowing different deterministic\n",
" forward-propagation mode in addition to default stochastic\n",
" forward-propagation e.g. for use at test time. If False\n",
" a deterministic forward-propagation transformation\n",
" corresponding to the expected output of the stochastic\n",
" forward-propagation is applied.\n",
"\n",
" Returns:\n",
" outputs: Array of layer outputs of shape (batch_size, output_dim).\n",
" \"\"\"\n",
" raise NotImplementedError()\n",
" \n",
" def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
" \"\"\"Back propagates gradients through a layer.\n",
"\n",
" Given gradients with respect to the outputs of the layer calculates the\n",
" gradients with respect to the layer inputs. This should correspond to\n",
" default stochastic forward-propagation.\n",
"\n",
" Args:\n",
" inputs: Array of layer inputs of shape (batch_size, input_dim).\n",
" outputs: Array of layer outputs calculated in forward pass of\n",
" shape (batch_size, output_dim).\n",
" grads_wrt_outputs: Array of gradients with respect to the layer\n",
" outputs of shape (batch_size, output_dim).\n",
"\n",
" Returns:\n",
" Array of gradients with respect to the layer inputs of shape\n",
" (batch_size, input_dim).\n",
" \"\"\"\n",
" raise NotImplementedError()\n",
"\n",
" def __repr__(self):\n",
" return 'DropoutLayer(incl_prob={0:.1f})'.format(self.incl_prob)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test your implementation by running the cell below (if your implementation is incorrect you will get an `AssertionError` - look at what the assert statement is checking for a clue as to what is wrong)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"seed = 31102016 \n",
"rng = np.random.RandomState(seed)\n",
"test_incl_probs = [0.1, 0.5, 0.7]\n",
"input_shape = (5, 10)\n",
"for incl_prob in test_incl_probs:\n",
" layer = DropoutLayer(rng, incl_prob)\n",
" inputs = rng.normal(size=input_shape)\n",
" grads_wrt_outputs = rng.normal(size=input_shape)\n",
" for t in range(100):\n",
" outputs = layer.fprop(inputs, stochastic=True)\n",
" # Check outputted array correct shape\n",
" assert outputs.shape == inputs.shape\n",
" # Check all outputs are either equal to inputs or zero\n",
" assert np.all((outputs == inputs) | (outputs == 0))\n",
" grads_wrt_inputs = layer.bprop(inputs, outputs, grads_wrt_outputs)\n",
" # Check back-propagated gradients only non-zero for non-zero outputs\n",
" assert np.all((outputs != 0) == (grads_wrt_inputs != 0))\n",
" assert np.all(grads_wrt_outputs[outputs != 0] == grads_wrt_inputs[outputs != 0])\n",
" det_outputs = layer.fprop(inputs, stochastic=False)\n",
" # Check deterministic fprop outputs are correct shape\n",
" assert det_outputs.shape == inputs.shape\n",
" # Check deterministic fprop outputs scaled correctly\n",
" assert np.allclose(det_outputs, incl_prob * inputs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional extension\n",
"\n",
"Above we assumed the same dropout mask was applied to each input in a batch, as specified in the lecture slides. In practice sometimes a different mask is sampled for each input. As an extension you could try implementing this per-input form of dropout either by defining a new layer or adding an extra argument to the constructor of the above layer which allows you to switch between the two forms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: Training with dropout"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Experiment with training models with dropout layers to classify MNIST digits. Code has been provided below as a starting point for setting up the model objects though feel free to use any additional adaptive learning rules or learning rule schedulers you wrote during the coursework instead. You may also wish to change the model architecture to use a larger model with more parameters in which the regularisation provided by dropout is likely to have a more pronounced effect. You will probably also find that models with dropout generally need to be trained over more epochs than those without (can you suggest why this might be?).\n",
"\n",
"You should training with a few different `incl_prob` settings for the dropout layers and try to establish how the values chosen affect the training performance. You may wish to experiment with using a different dropout probability at the input than for the intermediate layers (why?).\n",
"\n",
"You may wish to start reading through and implementing exercise 3 while waiting for training runs to complete."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import logging\n",
"from mlp.data_providers import MNISTDataProvider\n",
"from mlp.models import MultipleLayerModel\n",
"from mlp.layers import ReluLayer, AffineLayer\n",
"from mlp.errors import CrossEntropySoftmaxError\n",
"from mlp.initialisers import GlorotUniformInit, ConstantInit\n",
"from mlp.learning_rules import MomentumLearningRule\n",
"from mlp.optimisers import Optimiser\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"# Seed a random number generator\n",
"seed = 31102016 \n",
"rng = np.random.RandomState(seed)\n",
"\n",
"# Set up a logger object to print info about the training run to stdout\n",
"logger = logging.getLogger()\n",
"logger.setLevel(logging.INFO)\n",
"logger.handlers = [logging.StreamHandler()]\n",
"\n",
"# Create data provider objects for the MNIST data set\n",
"train_data = MNISTDataProvider('train', batch_size=50, rng=rng)\n",
"valid_data = MNISTDataProvider('valid', batch_size=50, rng=rng)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Probability of input being included in output in dropout layer\n",
"incl_prob = 0.5\n",
"\n",
"input_dim, output_dim, hidden_dim = 784, 10, 125\n",
"\n",
"# Use Glorot initialisation scheme for weights and zero biases\n",
"weights_init = GlorotUniformInit(rng=rng, gain=2.**0.5)\n",
"biases_init = ConstantInit(0.)\n",
"\n",
"# Create three affine layer model with rectified linear non-linearities\n",
"# and dropout layers before every affine layer\n",
"model = MultipleLayerModel([\n",
" DropoutLayer(rng, incl_prob),\n",
" AffineLayer(input_dim, hidden_dim, weights_init, biases_init), \n",
" ReluLayer(),\n",
" DropoutLayer(rng, incl_prob),\n",
" AffineLayer(hidden_dim, hidden_dim, weights_init, biases_init), \n",
" ReluLayer(),\n",
" DropoutLayer(rng, incl_prob),\n",
" AffineLayer(hidden_dim, output_dim, weights_init, biases_init)\n",
"])\n",
"\n",
"# Multiclass classification therefore use cross-entropy + softmax error\n",
"error = CrossEntropySoftmaxError()\n",
"\n",
"# Use a momentum learning rule - you could use an adaptive learning rule\n",
"# implemented for the coursework here instead\n",
"learning_rule = MomentumLearningRule(0.02, 0.9)\n",
"\n",
"# Monitor classification accuracy during training\n",
"data_monitors={'acc': lambda y, t: (y.argmax(-1) == t.argmax(-1)).mean()}\n",
"\n",
"optimiser = Optimiser(\n",
" model, error, learning_rule, train_data, valid_data, data_monitors)\n",
"\n",
"num_epochs = 100\n",
"stats_interval = 5\n",
"\n",
"stats, keys, run_time = optimiser.train(num_epochs=num_epochs, stats_interval=stats_interval)\n",
"\n",
"# Plot the change in the validation and training set error over training.\n",
"fig_1 = plt.figure(figsize=(8, 4))\n",
"ax_1 = fig_1.add_subplot(111)\n",
"for k in ['error(train)', 'error(valid)']:\n",
" ax_1.plot(np.arange(1, stats.shape[0]) * stats_interval, \n",
" stats[1:, keys[k]], label=k)\n",
"ax_1.legend(loc=0)\n",
"ax_1.set_xlabel('Epoch number')\n",
"\n",
"# Plot the change in the validation and training set accuracy over training.\n",
"fig_2 = plt.figure(figsize=(8, 4))\n",
"ax_2 = fig_2.add_subplot(111)\n",
"for k in ['acc(train)', 'acc(valid)']:\n",
" ax_2.plot(np.arange(1, stats.shape[0]) * stats_interval, \n",
" stats[1:, keys[k]], label=k)\n",
"ax_2.legend(loc=0)\n",
"ax_2.set_xlabel('Epoch number')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 3: Implementing maxout\n",
"\n",
"[Maxout](http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf) can be considered a generalisation of the rectified linear transformation implemented in the previous lab. \n",
"\n",
"For a rectified linear (`Relu`) layer the forward propagation corresponds to\n",
"\n",
"\\begin{equation}\n",
" y^{(b)}_k = \n",
" \\max\\left\\lbrace 0,\\,x^{(b)}_k \\right\\rbrace\n",
"\\end{equation}\n",
"\n",
"i.e. each output corresponds to an pairwise maximum of a constant (0) and the input.\n",
"\n",
"Instead of taking the maximum of the input and a constant, we could instead consider taking the maximum over sets of inputs of a fixed size $s$.\n",
"\n",
"\\begin{equation}\n",
" y^{(b)}_k = \n",
" \\max\\left\\lbrace x^{(b)}_{(k-1)s + 1},\\, x^{(b)}_{(k-1)s + 2},\\, \\dots ,\\, x^{(b)}_{ks} \\right\\rbrace\n",
"\\end{equation}\n",
"\n",
"If these inputs $x^{(b)}_d$ are themselves the outputs of an affine layer, then this corresponds to taking the maximum of a series of affine functions of the previous layer outputs. Like a rectified linear layer this leads to piecewise linear input-output relationships (which have well-behaved gradients which do not suffer from the saturation problems of logistic sigmoid / hyperbolic tangent transformations) but unlike the rectified linear case we do not end force a portion of the outputs to be zero. \n",
"\n",
"Experimentally this form of transformation has been found to give good performance, with the name *maxout* chosen because the *out*put is the *max*imum of a set of inputs. Maxout is also commonly used with dropout layers however note they are not directly related - maxout defines a deterministic non-linear transformation which can help improve the representational capacity and trainability of models; dropout defines a stochastic transformation which is mainly aimed at regularising a model to reduce overfitting.\n",
"\n",
"Using layers which take the maximum of fixed sized sets of inputs is also a common technique in models with convolutional layers which we will cover later in the course, with here the layer commonly being termed a *max-pooling* layer (with there being natural generalisation to other choices of reduction functions over pools such as the mean). We will adopt this terminology here for a layer implementing the transformation described above and we will be able to reuse our code implementing this maximum operation when experimenting with convolutional models.\n",
"\n",
"The partial derivatives of this max-pooling transformation are sparse (lots of values are zero), with only the partial derivative of the output of a pool with respect to the maximum input in the pool non-zero. This can be expressed as\n",
"\n",
"\\begin{equation}\n",
" \\frac{\\partial y^{(b)}_k}{\\partial x^{(b)}_d} = \n",
" \\begin{cases} \n",
" 1 & \\quad (k-1)s + 1 \\leq d \\leq ks \\quad\\textrm{and} &x^{(b)}_d = \\max\\left\\lbrace x^{(b)}_{(k-1)s + 1},\\, x^{(b)}_{(k-1)s + 2},\\, \\dots ,\\, x^{(b)}_{ks} \\right\\rbrace \\\\\n",
" 0 & \\quad \\textrm{otherwise}\n",
" \\end{cases}.\n",
"\\end{equation}\n",
"\n",
"Using these definitions implement the `fprop` and `bprop` methods of the skeleton `MaxPoolingLayer` class below.\n",
"\n",
"Some hints\n",
"\n",
" * One way of organising the inputs into non-overlapping pools is using the `numpy.reshape` function.\n",
" * The `numpy.max` function has an `axis` argument which allows you specify the axis (dimension) of the input array to take the maximum over.\n",
" * It may help to construct a binary mask corresponding to the definitions of the partial derivatives above to allow you to implement the `bprop` method. \n",
" * As with the `DropoutLayer` it is fine to temporarily store values calculated in the `fprop` method as attributes of the object (e.g. `self.val = val`) to use in the `bprop` method (although you don't necessarily need to do this)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from mlp.layers import Layer\n",
"\n",
"class MaxPoolingLayer(Layer):\n",
" \n",
" def __init__(self, pool_size=2):\n",
" \"\"\"Construct a new max-pooling layer.\n",
" \n",
" Args:\n",
" pool_size: Positive integer specifying size of pools over\n",
" which to take maximum value. The outputs of the layer\n",
" feeding in to this layer must have a dimension which\n",
" is a multiple of this pool size such that the outputs\n",
" can be split in to pools with no dimensions left over.\n",
" \"\"\"\n",
" assert pool_size > 0\n",
" self.pool_size = pool_size\n",
" \n",
" def fprop(self, inputs):\n",
" \"\"\"Forward propagates activations through the layer transformation.\n",
" \n",
" This corresponds to taking the maximum over non-overlapping pools of\n",
" inputs of a fixed size `pool_size`.\n",
"\n",
" Args:\n",
" inputs: Array of layer inputs of shape (batch_size, input_dim).\n",
"\n",
" Returns:\n",
" outputs: Array of layer outputs of shape (batch_size, output_dim).\n",
" \"\"\"\n",
" assert inputs.shape[-1] % self.pool_size == 0, (\n",
" 'Last dimension of inputs must be multiple of pool size')\n",
" raise NotImplementedError()\n",
"\n",
" def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
" \"\"\"Back propagates gradients through a layer.\n",
"\n",
" Given gradients with respect to the outputs of the layer calculates the\n",
" gradients with respect to the layer inputs.\n",
"\n",
" Args:\n",
" inputs: Array of layer inputs of shape (batch_size, input_dim).\n",
" outputs: Array of layer outputs calculated in forward pass of\n",
" shape (batch_size, output_dim).\n",
" grads_wrt_outputs: Array of gradients with respect to the layer\n",
" outputs of shape (batch_size, output_dim).\n",
"\n",
" Returns:\n",
" Array of gradients with respect to the layer inputs of shape\n",
" (batch_size, input_dim).\n",
" \"\"\"\n",
" raise NotImplementedError()\n",
"\n",
" def __repr__(self):\n",
" return 'MaxPoolingLayer(pool_size={0})'.format(self.pool_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Test your implementation by running the cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"test_inputs = np.array([[-3, -4, 5, 8], [0, -2, 3, -8], [1, 5, 3, 2]])\n",
"test_outputs_1 = np.array([[8], [3], [5]])\n",
"test_grads_wrt_outputs_1 = np.array([[10], [5], [-3]])\n",
"test_grads_wrt_inputs_1 = np.array([[0, 0, 0, 10], [0, 0, 5, 0], [0, -3, 0, 0]])\n",
"test_outputs_2 = np.array([[-3, 8], [0, 3], [5, 3]])\n",
"test_grads_wrt_outputs_2 = np.array([[3, -1], [2, 5], [5, 3]])\n",
"test_grads_wrt_inputs_2 = np.array([[3, 0, 0, -1], [2, 0, 5, 0], [0, 5, 3, 0]])\n",
"layer_1 = MaxPoolingLayer(4)\n",
"layer_2 = MaxPoolingLayer(2)\n",
"# Check fprop with pool_size = 4\n",
"assert np.allclose(layer_1.fprop(test_inputs), test_outputs_1)\n",
"# Check bprop with pool_size = 4\n",
"assert np.allclose(\n",
" layer_1.bprop(test_inputs, test_outputs_1, test_grads_wrt_outputs_1),\n",
" test_grads_wrt_inputs_1\n",
")\n",
"# Check fprop with pool_size = 2\n",
"assert np.allclose(layer_2.fprop(test_inputs), test_outputs_2)\n",
"# Check bprop with pool_size = 2\n",
"assert np.allclose(\n",
" layer_2.bprop(test_inputs, test_outputs_2, test_grads_wrt_outputs_2),\n",
" test_grads_wrt_inputs_2\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 4: Training with maxout\n",
"\n",
"Use your `MaxPoolingLayer` implementation in a multiple layer models to experiment with how well maxout networks are able to classify MNIST digits. As with the dropout training exercise, code has been provided below as a starting point for setting up the model objects, but again feel free to substitute any components.\n",
"\n",
"If you have time you may wish to experiment with training a model using a combination of maxout and dropout or another regularisation method covered in the last lab notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"import logging\n",
"from mlp.data_providers import MNISTDataProvider\n",
"from mlp.models import MultipleLayerModel\n",
"from mlp.layers import AffineLayer\n",
"from mlp.errors import CrossEntropySoftmaxError\n",
"from mlp.initialisers import GlorotUniformInit, ConstantInit\n",
"from mlp.learning_rules import MomentumLearningRule\n",
"from mlp.optimisers import Optimiser\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"# Seed a random number generator\n",
"seed = 31102016 \n",
"rng = np.random.RandomState(seed)\n",
"\n",
"# Set up a logger object to print info about the training run to stdout\n",
"logger = logging.getLogger()\n",
"logger.setLevel(logging.INFO)\n",
"logger.handlers = [logging.StreamHandler()]\n",
"\n",
"# Create data provider objects for the MNIST data set\n",
"train_data = MNISTDataProvider('train', batch_size=50, rng=rng)\n",
"valid_data = MNISTDataProvider('valid', batch_size=50, rng=rng)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Size of pools to take maximum over\n",
"pool_size = 2\n",
"\n",
"input_dim, output_dim, hidden_dim = 784, 10, 100\n",
"\n",
"# Use Glorot initialisation scheme for weights and zero biases\n",
"weights_init = GlorotUniformInit(rng=rng)\n",
"biases_init = ConstantInit(0.)\n",
"\n",
"# Create three affine layer model interleaved with max-pooling layers\n",
"model = MultipleLayerModel([\n",
" AffineLayer(input_dim, hidden_dim * pool_size, weights_init, biases_init), \n",
" MaxPoolingLayer(pool_size),\n",
" AffineLayer(hidden_dim, hidden_dim * pool_size, weights_init, biases_init), \n",
" MaxPoolingLayer(pool_size),\n",
" AffineLayer(hidden_dim, output_dim, weights_init, biases_init)\n",
"])\n",
"\n",
"# Multiclass classification therefore use cross-entropy + softmax error\n",
"error = CrossEntropySoftmaxError()\n",
"\n",
"# Use a momentum learning rule - you could use an adaptive learning rule\n",
"# implemented for the coursework here instead\n",
"learning_rule = MomentumLearningRule(0.02, 0.9)\n",
"\n",
"# Monitor classification accuracy during training\n",
"data_monitors={'acc': lambda y, t: (y.argmax(-1) == t.argmax(-1)).mean()}\n",
"\n",
"optimiser = Optimiser(\n",
" model, error, learning_rule, train_data, valid_data, data_monitors)\n",
"\n",
"num_epochs = 100\n",
"stats_interval = 5\n",
"\n",
"stats, keys, run_time = optimiser.train(num_epochs=num_epochs, stats_interval=stats_interval)\n",
"\n",
"# Plot the change in the validation and training set error over training.\n",
"fig_1 = plt.figure(figsize=(8, 4))\n",
"ax_1 = fig_1.add_subplot(111)\n",
"for k in ['error(train)', 'error(valid)']:\n",
" ax_1.plot(np.arange(1, stats.shape[0]) * stats_interval, \n",
" stats[1:, keys[k]], label=k)\n",
"ax_1.legend(loc=0)\n",
"ax_1.set_xlabel('Epoch number')\n",
"\n",
"# Plot the change in the validation and training set accuracy over training.\n",
"fig_2 = plt.figure(figsize=(8, 4))\n",
"ax_2 = fig_2.add_subplot(111)\n",
"for k in ['acc(train)', 'acc(valid)']:\n",
" ax_2.plot(np.arange(1, stats.shape[0]) * stats_interval, \n",
" stats[1:, keys[k]], label=k)\n",
"ax_2.legend(loc=0)\n",
"ax_2.set_xlabel('Epoch number')"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda env:mlp]",
"language": "python",
"name": "conda-env-mlp-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2.0
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}