Adding new lab 5 notebook.
This commit is contained in:
parent
ab060d556c
commit
167508cc63
700
notebooks/05_Non-linearities_and_regularisation.ipynb
Normal file
700
notebooks/05_Non-linearities_and_regularisation.ipynb
Normal file
@ -0,0 +1,700 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Non-linearities and regularisation\n",
|
||||
"\n",
|
||||
"In this lab we will explore layers using alternative elementwise non-linear functions to the logistic sigmoid used previously as well as different methods for regularising networks to reduce overfitting and improve generalisation. This uses the material covered in the [fifth lecture slides](http://www.inf.ed.ac.uk/teaching/courses/mlp/2016/mlp05-hid.pdf)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 1: Hyperbolic tangent and rectified linear layers\n",
|
||||
"\n",
|
||||
"In the models we have been investigating so far we have been applying elementwise logistic sigmoid transformations to the outputs of intermediate (affine) layers. The logistic sigmoid is just one particular choice of an elementwise non-linearity we can use. \n",
|
||||
"\n",
|
||||
"As mentioned in the lecture although logistic sigmoid has some favourable properties in terms of interpretability, there are also disadvantages from a computational perspective. In particular that the gradients of the sigmoid become very close to zero (and may actually become exactly zero to a finite numerical precision) for very positive or negative inputs, and that the outputs are non-centred - they cover the interval $[0,\\,1]$ so negative outputs are never produced.\n",
|
||||
"\n",
|
||||
"Two alternative elementwise non-linearities which are often used in multiple layer models are the hyperbolic tangent and the rectified linear function.\n",
|
||||
"\n",
|
||||
"For a hyperbolic tangent (`Tanh`) layer the forward propagation corresponds to\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" y^{(b)}_k = \n",
|
||||
" \\tanh\\left(x^{(b)}_k\\right) = \n",
|
||||
" \\frac{\\exp\\left(x^{(b)}_k\\right) - \\exp\\left(-x^{(b)}_k\\right)}{\\exp\\left(x^{(b)}_k\\right) + \\exp\\left(-x^{(b)}_k\\right)}\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"which has corresponding partial derivatives\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" \\frac{\\partial y^{(b)}_k}{\\partial x^{(b)}_d} = \n",
|
||||
" \\begin{cases} \n",
|
||||
" 1 - \\left(y^{(b)}_k\\right)^2 & \\quad k = d \\\\\n",
|
||||
" 0 & \\quad k \\neq d\n",
|
||||
" \\end{cases}.\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"For a rectified linear (`Relu`) layer the forward propagation corresponds to\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" y^{(b)}_k = \n",
|
||||
" \\max\\left(0,\\,x^{(b)}_k\\right)\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"which has corresponding partial derivatives\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" \\frac{\\partial y^{(b)}_k}{\\partial x^{(b)}_d} = \n",
|
||||
" \\begin{cases} \n",
|
||||
" 1 & \\quad k = d \\quad\\textrm{and} &x^{(b)}_d > 0 \\\\\n",
|
||||
" 0 & \\quad k \\neq d \\quad\\textrm{or} &x^{(b)}_d < 0\n",
|
||||
" \\end{cases}.\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"Using these definitions implement the `fprop` and `bprop` methods for the skeleton `TanhLayer` and `ReluLayer` class definitions below. If you are not sure what the `bprop` method should be doing you may wish to go back over [the section in third lab notebook](03_Multiple_layer_models.ipynb#Back-propagation-of-gradients) where this was covered."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"from mlp.layers import Layer\n",
|
||||
"\n",
|
||||
"class TanhLayer(Layer):\n",
|
||||
" \"\"\"Layer implementing an element-wise hyperbolic tangent transformation.\"\"\"\n",
|
||||
"\n",
|
||||
" def fprop(self, inputs):\n",
|
||||
" \"\"\"Forward propagates activations through the layer transformation.\n",
|
||||
"\n",
|
||||
" For inputs `x` and outputs `y` this corresponds to `y = tanh(x)`.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
"\n",
|
||||
" def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
|
||||
" \"\"\"Back propagates gradients through a layer.\n",
|
||||
"\n",
|
||||
" Given gradients with respect to the outputs of the layer calculates the\n",
|
||||
" gradients with respect to the layer inputs.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
"\n",
|
||||
" def __repr__(self):\n",
|
||||
" return 'TanhLayer'\n",
|
||||
" \n",
|
||||
"\n",
|
||||
"class ReluLayer(Layer):\n",
|
||||
" \"\"\"Layer implementing an element-wise rectified linear transformation.\"\"\"\n",
|
||||
"\n",
|
||||
" def fprop(self, inputs):\n",
|
||||
" \"\"\"Forward propagates activations through the layer transformation.\n",
|
||||
"\n",
|
||||
" For inputs `x` and outputs `y` this corresponds to `y = max(0, x)`.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
"\n",
|
||||
" def bprop(self, inputs, outputs, grads_wrt_outputs):\n",
|
||||
" \"\"\"Back propagates gradients through a layer.\n",
|
||||
"\n",
|
||||
" Given gradients with respect to the outputs of the layer calculates the\n",
|
||||
" gradients with respect to the layer inputs.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
"\n",
|
||||
" def __repr__(self):\n",
|
||||
" return 'ReluLayer'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Test your implementations by running the cells below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_inputs = np.array([[0.1, -0.2, 0.3], [-0.4, 0.5, -0.6]])\n",
|
||||
"test_grads_wrt_outputs = np.array([[5., 10., -10.], [-5., 0., 10.]])\n",
|
||||
"test_tanh_outputs = np.array(\n",
|
||||
" [[ 0.09966799, -0.19737532, 0.29131261],\n",
|
||||
" [-0.37994896, 0.46211716, -0.53704957]])\n",
|
||||
"test_tanh_grads_wrt_inputs = np.array(\n",
|
||||
" [[ 4.95033145, 9.61042983, -9.15136962],\n",
|
||||
" [-4.27819393, 0., 7.11577763]])\n",
|
||||
"tanh_layer = TanhLayer()\n",
|
||||
"tanh_outputs = tanh_layer.fprop(test_inputs)\n",
|
||||
"all_correct = True\n",
|
||||
"if not tanh_outputs.shape == test_tanh_outputs.shape:\n",
|
||||
" print('TanhLayer.fprop returned array with wrong shape.')\n",
|
||||
" all_correct = False\n",
|
||||
"elif not np.allclose(test_tanh_outputs, tanh_outputs):\n",
|
||||
" print('TanhLayer.fprop calculated incorrect outputs.')\n",
|
||||
" all_correct = False\n",
|
||||
"tanh_grads_wrt_inputs = tanh_layer.bprop(\n",
|
||||
" test_inputs, tanh_outputs, test_grads_wrt_outputs)\n",
|
||||
"if not tanh_grads_wrt_inputs.shape == test_tanh_grads_wrt_inputs.shape:\n",
|
||||
" print('TanhLayer.bprop returned array with wrong shape.')\n",
|
||||
" all_correct = False\n",
|
||||
"elif not np.allclose(tanh_grads_wrt_inputs, test_tanh_grads_wrt_inputs):\n",
|
||||
" print('TanhLayer.bprop calculated incorrect gradients with respect to inputs.')\n",
|
||||
" all_correct = False\n",
|
||||
"if all_correct:\n",
|
||||
" print('Outputs and gradients calculated correctly for TanhLayer.')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_inputs = np.array([[0.1, -0.2, 0.3], [-0.4, 0.5, -0.6]])\n",
|
||||
"test_grads_wrt_outputs = np.array([[5., 10., -10.], [-5., 0., 10.]])\n",
|
||||
"test_relu_outputs = np.array([[0.1, 0., 0.3], [0., 0.5, 0.]])\n",
|
||||
"test_relu_grads_wrt_inputs = np.array([[5., 0., -10.], [-0., 0., 0.]])\n",
|
||||
"relu_layer = ReluLayer()\n",
|
||||
"relu_outputs = relu_layer.fprop(test_inputs)\n",
|
||||
"all_correct = True\n",
|
||||
"if not relu_outputs.shape == test_relu_outputs.shape:\n",
|
||||
" print('ReluLayer.fprop returned array with wrong shape.')\n",
|
||||
" all_correct = False\n",
|
||||
"elif not np.allclose(test_relu_outputs, relu_outputs):\n",
|
||||
" print('ReluLayer.fprop calculated incorrect outputs.')\n",
|
||||
" all_correct = False\n",
|
||||
"relu_grads_wrt_inputs = relu_layer.bprop(\n",
|
||||
" test_inputs, relu_outputs, test_grads_wrt_outputs)\n",
|
||||
"if not relu_grads_wrt_inputs.shape == test_relu_grads_wrt_inputs.shape:\n",
|
||||
" print('ReluLayer.bprop returned array with wrong shape.')\n",
|
||||
" all_correct = False\n",
|
||||
"elif not np.allclose(relu_grads_wrt_inputs, test_relu_grads_wrt_inputs):\n",
|
||||
" print('ReluLayer.bprop calculated incorrect gradients with respect to inputs.')\n",
|
||||
" all_correct = False\n",
|
||||
"if all_correct:\n",
|
||||
" print('Outputs and gradients calculated correctly for ReluLayer.')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 2: L1 and L2 penalties\n",
|
||||
"\n",
|
||||
"In the previous lab notebook we explored the issue of overfitting. There we saw that this arises when the model is 'too complex' ($\\sim$ has too many degrees of freedom / parameters) for the amount of data we have available.\n",
|
||||
"\n",
|
||||
"One method for trying to reduce overfitting is therefore to try to decrease the flexibility of the model. We can do this by simply reducing the number of free parameters in the model (e.g. by using a shallower model with fewer layers or layers with smaller dimensionality). More generally however we might want some way of more continuously varying the effective flexibility of a model with a fixed architecture.\n",
|
||||
"\n",
|
||||
"A common method for doing this is to add an additional term to the objective function being minimised during training which penalises some measure of the complexity of a model as a function of the model parameters. The aim of training is then to minimise with respect to the model parameters the sum $E^\\star$ of the data-driven error function term $\\bar{E}$ and a model complexity term $C$.\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" E^\\star =\n",
|
||||
" \\underbrace{\\bar{E}}_{\\textrm{data term}} + \\underbrace{C}_{\\textrm{complexity term}}\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"We need the complexity term $C$ to be easy to compute and differentiable with respect to the model parameters. A common choice is to use terms involving the *norms* ($\\sim$ a measure of size) of the parameters. This penalises models with large parameter values. Two commonly used norms are the L1 and L2 norms. \n",
|
||||
"\n",
|
||||
"For a $D$ dimensional vector $\\mathbf{v}$ the L1 norm is defined as\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
"\\| \\boldsymbol{v} \\|_1 = \\sum_{d=1}^D \\left| v_d \\right|,\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"and the L2 norm is defined as\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
"\\| \\boldsymbol{v} \\|_2 = \\left[ \\sum_{d=1}^D \\left( v_d^2 \\right) \\right]^{\\frac{1}{2}}.\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"For a $K \\times D$ matrix $\\mathbf{M}$, we will define norms by collapsing the matrix to a vector $\\boldsymbol{m} = \\mathrm{vec}\\left[\\mathbf{M}\\right] = \\left[ M_{1,1} \\dots M_{1,D} ~ M_{2,1} \\dots M_{K,D} \\right]^{\\rm T}$ and then taking the norm as defined above of this resulting vector (practically this just results in summing over two sets of indices rather than one).\n",
|
||||
"\n",
|
||||
"The overall complexity penalty term $C$ is defined as a sum over individual complexity terms for each of the $P$ parameters of the model \n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" C = \\sum_{i=1}^P \\left[ C^{(i)} \\right]\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"Some of these per-parameter penalty terms $C^{(i)}$ may be set to zero if we do not wish to penalise the size of the corresponding parameter.\n",
|
||||
"\n",
|
||||
"To enable us to tradeoff between the model complexity and data error terms, it is typical to introduce positive scalar coefficients $\\beta_i$ to scale the penalty term on the $i$th parameter. A *L1 penalty* on the $i$th vector parameter $\\boldsymbol{p}^{(i)}$ (or matrix parameter collapsed to a vector) is then commonly defined as\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" C^{(i)}_{\\textrm{L1}} = \n",
|
||||
" \\beta_i \\left\\| \\boldsymbol{p}^{(i)} \\right\\|_1 = \n",
|
||||
" \\beta_i \\sum_{d=1}^D \\left| p^{(i)}_d \\right|.\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"This has a gradient with respect to the parameter vector\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" \\frac{\\partial C^{(i)}_{\\textrm{L1}}}{\\partial p^{(i)}_d} = \\beta_i \\, \\textrm{sgn}\\left( p^{(i)}_d \\right)\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"where $\\textrm{sgn}(u) = +1$ if $u > 0$, $\\textrm{sgn}(u) = -1$ if $u < 0$ (and is not well defined for $u=0$ though a common convention is to have $\\textrm{sgn}(0) = 0$).\n",
|
||||
"\n",
|
||||
"Similarly a *L2 penalty* on the $i$th vector parameter $\\boldsymbol{p}^{(i)}$ (or matrix parameter collapsed to a vector) is commonly defined as\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" C^{(i)}_{\\textrm{L2}} = \n",
|
||||
" \\frac{1}{2} \\beta_i \\left\\| \\boldsymbol{p}^{(i)} \\right\\|_2^2 =\n",
|
||||
" \\frac{1}{2} \\beta_i \\sum_{d=1}^D \\left[ \\left( p^{(i)}_d \\right)^2 \\right].\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"Somewhat confusingly this is proportional to the square of the L2 norm rather than the L2 norm itself, however it is an almost universal convention to call this an L2 penalty so we will stick with this nomenclature here. The $\\frac{1}{2}$ term is less universal and is sometimes not included; we include it here for consistency with how we defined the sum of squared errors cost. Similarly to that case, the $\\frac{1}{2}$ cancels when calculating the gradient with respect to the parameter\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" \\frac{\\partial C^{(i)}_{\\textrm{L2}}}{\\partial p^{(i)}_d} = \\beta_i p^{(i)}_d\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"Use the above definitions for the L1 and L2 penalties for a parameter and corresponding gradients to implement the `__call__` and `grad` methods respectively for the skeleton `L1Penalty` and `L2Penalty` class definitions below. The `coefficient` propert of these classes should be used as the $\\beta_i$ value in the equations above. The parameter the penalty term (or gradient) is being evaluated for will be either a one or two-dimensional NumPy array (corresponding to a vector or matrix parameter respectively) and your implementations should be able to cope with both cases."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class L1Penalty(object):\n",
|
||||
" \"\"\"L1 parameter penalty.\n",
|
||||
" \n",
|
||||
" Term to add to the objective function penalising parameters\n",
|
||||
" based on their L1 norm.\n",
|
||||
" \"\"\"\n",
|
||||
" \n",
|
||||
" def __init__(self, coefficient):\n",
|
||||
" \"\"\"Create a new L1 penalty object.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" coefficient: Positive constant to scale penalty term by.\n",
|
||||
" \"\"\"\n",
|
||||
" assert coefficient > 0., 'Penalty coefficient must be positive.'\n",
|
||||
" self.coefficient = coefficient\n",
|
||||
" \n",
|
||||
" def __call__(self, parameter):\n",
|
||||
" \"\"\"Calculate L1 penalty value for a parameter.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" parameter: Array corresponding to a model parameter.\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" Value of penalty term.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
" \n",
|
||||
" def grad(self, parameter):\n",
|
||||
" \"\"\"Calculate the penalty gradient with respect to the parameter.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" parameter: Array corresponding to a model parameter.\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" Value of penalty gradient with respect to parameter. This\n",
|
||||
" should be an array of the same shape as the parameter.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
" \n",
|
||||
" def __repr__(self):\n",
|
||||
" return 'L1Penalty({0})'.format(self.coefficient)\n",
|
||||
" \n",
|
||||
"\n",
|
||||
"class L2Penalty(object):\n",
|
||||
" \"\"\"L1 parameter penalty.\n",
|
||||
" \n",
|
||||
" Term to add to the objective function penalising parameters\n",
|
||||
" based on their L2 norm.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" def __init__(self, coefficient):\n",
|
||||
" \"\"\"Create a new L2 penalty object.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" coefficient: Positive constant to scale penalty term by.\n",
|
||||
" \"\"\"\n",
|
||||
" assert coefficient > 0., 'Penalty coefficient must be positive.'\n",
|
||||
" self.coefficient = coefficient\n",
|
||||
" \n",
|
||||
" def __call__(self, parameter):\n",
|
||||
" \"\"\"Calculate L2 penalty value for a parameter.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" parameter: Array corresponding to a model parameter.\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" Value of penalty term.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
" \n",
|
||||
" def grad(self, parameter):\n",
|
||||
" \"\"\"Calculate the penalty gradient with respect to the parameter.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" parameter: Array corresponding to a model parameter.\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" Value of penalty gradient with respect to parameter. This\n",
|
||||
" should be an array of the same shape as the parameter.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()\n",
|
||||
" \n",
|
||||
" def __repr__(self):\n",
|
||||
" return 'L2Penalty({0})'.format(self.coefficient)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Test your implementations by running the cells below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_params_1 = np.array([[0.5, 0.3, -1.2, 5.8], [0.2, -3.1, 4.9, -5.0]])\n",
|
||||
"test_params_2 = np.array([0.8, -0.6, -0.3, 1.5, 2.8])\n",
|
||||
"true_l1_cost_1 = 10.5\n",
|
||||
"true_l1_grad_1 = np.array([[0.5, 0.5, -0.5, 0.5], [0.5, -0.5, 0.5, -0.5]])\n",
|
||||
"true_l1_cost_2 = 3.\n",
|
||||
"true_l1_grad_2 = np.array([0.5, -0.5, -0.5, 0.5, 0.5])\n",
|
||||
"l1 = L1Penalty(0.5)\n",
|
||||
"if (not np.allclose(l1(test_params_1), true_l1_cost_1) or\n",
|
||||
" not np.allclose(l1(test_params_2), true_l1_cost_2)):\n",
|
||||
" print('L1Penalty.__call__ giving incorrect value(s).')\n",
|
||||
"elif (not np.allclose(l1.grad(test_params_1), true_l1_grad_1) or \n",
|
||||
" not np.allclose(l1.grad(test_params_2), true_l1_grad_2)):\n",
|
||||
" print('L1Penalty.grad giving incorrect value(s).')\n",
|
||||
"else:\n",
|
||||
" print('All test values calculated correctly for L1Penalty')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"test_params_1 = np.array([[0.5, 0.3, -1.2, 5.8], [0.2, -3.1, 4.9, -5.0]])\n",
|
||||
"test_params_2 = np.array([0.8, -0.6, -0.3, 1.5, 2.8])\n",
|
||||
"true_l2_cost_1 = 23.52\n",
|
||||
"true_l2_grad_1 = np.array([[0.25, 0.15, -0.6, 2.9], [0.1, -1.55, 2.45, -2.5]])\n",
|
||||
"true_l2_cost_2 = 2.795\n",
|
||||
"true_l2_grad_2 = np.array([0.4, -0.3, -0.15, 0.75, 1.4])\n",
|
||||
"l2 = L2Penalty(0.5)\n",
|
||||
"if (not np.allclose(l2(test_params_1), true_l2_cost_1) or\n",
|
||||
" not np.allclose(l2(test_params_2), true_l2_cost_2)):\n",
|
||||
" print('L2Penalty.__call__ giving incorrect value(s).')\n",
|
||||
"elif (not np.allclose(l2.grad(test_params_1), true_l2_grad_1) or \n",
|
||||
" not np.allclose(l2.grad(test_params_2), true_l2_grad_2)):\n",
|
||||
" print('L2Penalty.grad giving incorrect value(s).')\n",
|
||||
"else:\n",
|
||||
" print('All test values calculated correctly for L2Penalty')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 3: Training with regularisation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Previously in the second laboratory you implemented a function `grads_wrt_params` to calculate the gradient of an error function with respect to the parameters of an affine model (layer), given gradients of the error function with respect to the model (layer) outputs.\n",
|
||||
"\n",
|
||||
"If we are training a model using a regularised objective function, we need to additionally calculate the gradients of the regularisation penalty terms with respect to the parameters and add these to the error function gradient terms. Following from the definition of the regularised objective $E^\\star$ above we have that the gradient of the overall objective with respect to the $d$th element of the $i$th model parameter is\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" \\frac{\\partial E^\\star}{\\partial p^{(i)}_d} =\n",
|
||||
" \\frac{\\partial \\bar{E}}{\\partial p^{(i)}_d} + \n",
|
||||
" \\frac{\\partial C}{\\partial p^{(i)}_d}\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"We have already discussed in the second lab notebook how to calculate the error function gradient term $\\frac{\\partial \\bar{E}}{\\partial p^{(i)}_d}$. As the model complexity term is composed of a sum of per parameter terms and only one of these will depend on the $i$th parameter we can write\n",
|
||||
"\n",
|
||||
"\\begin{equation}\n",
|
||||
" \\frac{\\partial C}{\\partial p^{(i)}_d} = \\frac{\\partial C^{(i)}}{\\partial p^{(i)}_d}\n",
|
||||
"\\end{equation}\n",
|
||||
"\n",
|
||||
"which corresponds to the penalty term gradients you implemented above. To enable us to use the same `Optimiser` implementation that we have previously used to train models without regularisation, we have altered the implementation of the `AffineLayer` class (this being the only layer we currently have defined with parameters) to allow us to specify penalty terms on the weight matrix and bias vector when creating an instance of the class and to add the corresponding penalty gradients to the returned value from the `grads_wrt_params` method. \n",
|
||||
"\n",
|
||||
"The penalty terms need to be specified as a class matching the interface of the `L1Penalty` and `L2Penalty` classes you implemented above, defining both a `__call__` method to calculate the penalty value for a parameter and a `grad` method to calculate the gradient of the penalty with respect to the parameter. Separate penalties can be specified for the weight and bias parameters, with it common to only regularise the weight parameters. \n",
|
||||
"\n",
|
||||
"The penalty terms for a layer are specifed using the `weights_penalty` and `biases_penalty` arguments to the `__init__` method of the `AffineLayer` class. If either (or both) ofthese are set to `None` (the default) no regularisation is applied to the corresponding parameter.\n",
|
||||
"\n",
|
||||
"Using the `L1Penalty` and `L2Penalty` classes you implemented in the previous exercise, train models to classify MNIST digit images with\n",
|
||||
"\n",
|
||||
" * no regularisation\n",
|
||||
" * an L1 penalty with coefficient 0.1 on the all of the weight matrix parameters\n",
|
||||
" * an L1 penalty with coefficient 1.0 on the all of the weight matrix parameters\n",
|
||||
" * an L2 penalty with coefficient 0.1 on the all of the weight matrix parameters\n",
|
||||
" * an L2 penalty with coefficient 1.0 on the all of the weight matrix parameters\n",
|
||||
" \n",
|
||||
"The models should all have three affine layers interspersed with rectified linear layers (as implemented in the first exercise) and intermediate layers between the input and output should have dimensionalities of 100. The final output layer should be an `AffineLayer` (the model outputting the logarithms of the non-normalised class probabilties) and you should use the `CrossEntropySoftmaxError` as the error function (which calculates the softmax of the model outputs to convert to normalised class probabilities before calculating the corresponding multi-class cross entropy error). \n",
|
||||
"\n",
|
||||
"Use the `GlorotInit` class introduced in the first coursework to initialise the weights in all layers, using a gain of 0.5 (this adjusts for the fact that the rectified linear sets zeros all negative inputs), and initialises the biases to zero with a `ConstantInit` object. \n",
|
||||
"\n",
|
||||
"As an example the necessary parameter initialisers, model and error can be defined using\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"weights_init = GlorotUniformInit(0.5, rng)\n",
|
||||
"biases_init = ConstantInit(0.)\n",
|
||||
"input_dim, output_dim, hidden_dim = 784, 10, 100\n",
|
||||
"model = MultipleLayerModel([\n",
|
||||
" AffineLayer(input_dim, hidden_dim, weights_init, \n",
|
||||
" biases_init, weights_penalty=weights_penalty),\n",
|
||||
" ReluLayer(),\n",
|
||||
" AffineLayer(hidden_dim, hidden_dim, weights_init, \n",
|
||||
" biases_init, weights_penalty=weights_penalty),\n",
|
||||
" ReluLayer(),\n",
|
||||
" AffineLayer(hidden_dim, output_dim, weights_init, \n",
|
||||
" biases_init, weights_penalty=weights_penalty)\n",
|
||||
"])\n",
|
||||
"error = CrossEntropySoftmaxError()\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"This assumes all the relevant classes have been imported from their modules, a penalty object has been assigned to `weights_penalty` and a seeded random number generator assigned to `rng`.\n",
|
||||
"\n",
|
||||
"For each regularisation scheme, train the model for 100 epochs with a batch size of 50 and using a gradient descent with momentum learning rule with learning rate 0.05 and momentum coefficient 0.8. For each regularisation scheme you should store the run statistics (output of `Optimiser.train`) and the final values of the first layer weights for each of the trained models."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Plot the training set error against epoch number for all different regularisation schemes on the same axis. On a second axis plot the validation set error against epoch number for all the different regularisation schemes. Interpret and comment on what you see."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The cell below defines two functions for visualising the first layer weights of the models trained above. The first plots a histogram of the weight values and the second plots the first layer weights as feature maps, i.e. each row of the first layer weight matrix (corresponding to the weights going from the input MNIST image to a particular first layer output) is visualised as a $28\\times 28$ image. In these feature maps white corresponds to negative weights, black to positive weights and grey to weights close to zero. \n",
|
||||
"\n",
|
||||
"Use these functions to plot a histogram and feature map visualisation for the first layer weights of each model trained above. You should try to interpret the plots in the context of what you were told in the lecture about the behaviour of L1 versus L2 regularisation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def plot_param_histogram(param, fig_size=(6, 3), interval=[-1.5, 1.5]):\n",
|
||||
" \"\"\"Plots a normalised histogram of an array of parameter values.\"\"\"\n",
|
||||
" fig = plt.figure(figsize=fig_size)\n",
|
||||
" ax = fig.add_subplot(111)\n",
|
||||
" ax.hist(param.flatten(), 50, interval, normed=True)\n",
|
||||
" ax.set_xlabel('Parameter value')\n",
|
||||
" ax.set_ylabel('Normalised frequency density')\n",
|
||||
" return fig, ax\n",
|
||||
"\n",
|
||||
"def visualise_first_layer_weights(weights, fig_size=(5, 5)):\n",
|
||||
" \"\"\"Plots a grid of first layer weights as feature maps.\"\"\"\n",
|
||||
" fig = plt.figure(figsize=fig_size)\n",
|
||||
" num_feature_maps = weights.shape[0]\n",
|
||||
" grid_size = int(num_feature_maps**0.5)\n",
|
||||
" max_abs = np.abs(model.params[0]).max()\n",
|
||||
" tiled = -np.ones((30 * grid_size, \n",
|
||||
" 30 * num_feature_maps // grid_size)) * max_abs\n",
|
||||
" for i, fm in enumerate(model.params[0]):\n",
|
||||
" r, c = i % grid_size, i // grid_size\n",
|
||||
" tiled[1 + r * 30:(r + 1) * 30 - 1, \n",
|
||||
" 1 + c * 30:(c + 1) * 30 - 1] = fm.reshape((28, 28))\n",
|
||||
" ax = fig.add_subplot(111)\n",
|
||||
" max_abs = np.abs(tiled).max()\n",
|
||||
" ax.imshow(tiled, cmap='Greys', vmin=-max_abs, vmax=max_abs)\n",
|
||||
" ax.axis('off')\n",
|
||||
" fig.tight_layout()\n",
|
||||
" plt.show()\n",
|
||||
" return fig, ax"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 4: Random data augmentation\n",
|
||||
"\n",
|
||||
"Another technique mentioned in the lectures for trying to reduce overfitting is to artificially augment the training data set by performing random transformations to the original training data inputs. The idea is to produce further artificial inputs corresponding to the same target class as the original input. The closer the artificially generated inputs are to the appearing like the true inputs the better as they provide more realistic additional examples for the model to learn from.\n",
|
||||
"\n",
|
||||
"For the handwritten image inputs in the MNIST dataset, an obvious way to considering augmenting the dataset is to apply small rotations to the original images. Providing the rotations are small we would generally expect that what we would identify as the class of a digit image will remain the same.\n",
|
||||
"\n",
|
||||
"Implement a function which given a batch of MNIST images as 784-dimensional vectors, i.e. an array of shape `(batch_size, 784)`\n",
|
||||
"\n",
|
||||
" * chooses 25% of the images in the batch at random\n",
|
||||
" * for each image in the 25% chosen, rotates the image by a random angle in $\\left[-30^\\circ,\\,30^\\circ\\right]$\n",
|
||||
" * returns a new array of size `(batch_size, 784)` in which the rows corresponding to the 25% chosen images are the vectors corresponding to the new randomly rotated images, while the remaining rows correspond to the original images.\n",
|
||||
" \n",
|
||||
"You will need to make use of the [`scipy.ndimage.interpolation.rotate`](https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.ndimage.interpolation.rotate.html#scipy.ndimage.interpolation.rotate) function which is imported below for you. For computational efficiency you should use bilinear interpolation by setting `order=1` as a keyword argument to this function rather than using the default of bicubic interpolation. Additionally you should make sure the original shape of the images is maintained by passing a `reshape=False` keyword argument."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from scipy.ndimage.interpolation import rotate\n",
|
||||
"\n",
|
||||
"def random_rotation(inputs, rng):\n",
|
||||
" \"\"\"Randomly rotates a subset of images in a batch.\n",
|
||||
" \n",
|
||||
" Args:\n",
|
||||
" inputs: Input image batch, an array of shape (batch_size, 784).\n",
|
||||
" rng: A seeded random number generator.\n",
|
||||
" \n",
|
||||
" Returns:\n",
|
||||
" An array of shape (batch_size, 784) corresponding to a copy\n",
|
||||
" of the original `inputs` array with the randomly selected\n",
|
||||
" images rotated by a random angle. The original `inputs`\n",
|
||||
" array should not be modified.\n",
|
||||
" \"\"\"\n",
|
||||
" raise NotImplementedError()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Use the cell below to test your implementation. This uses the `show_batch_of_images` function we implemented in the first lab notebook to visualise the images in a batch before and after application of the random rotation transformation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def show_batch_of_images(img_batch, fig_size=(3, 3)):\n",
|
||||
" fig = plt.figure(figsize=fig_size)\n",
|
||||
" batch_size, im_height, im_width = img_batch.shape\n",
|
||||
" # calculate no. columns per grid row to give square grid\n",
|
||||
" grid_size = int(batch_size**0.5)\n",
|
||||
" # intialise empty array to tile image grid into\n",
|
||||
" tiled = np.empty((im_height * grid_size, \n",
|
||||
" im_width * batch_size // grid_size))\n",
|
||||
" # iterate over images in batch + indexes within batch\n",
|
||||
" for i, img in enumerate(img_batch):\n",
|
||||
" # calculate grid row and column indices\n",
|
||||
" r, c = i % grid_size, i // grid_size\n",
|
||||
" tiled[r * im_height:(r + 1) * im_height, \n",
|
||||
" c * im_height:(c + 1) * im_height] = img\n",
|
||||
" ax = fig.add_subplot(111)\n",
|
||||
" ax.imshow(tiled, cmap='Greys') #, vmin=0., vmax=1.)\n",
|
||||
" ax.axis('off')\n",
|
||||
" fig.tight_layout()\n",
|
||||
" plt.show()\n",
|
||||
" return fig, ax\n",
|
||||
"\n",
|
||||
"test_data = MNISTDataProvider('test', 100, rng=rng)\n",
|
||||
"inputs, targets = test_data.next()\n",
|
||||
"_ = show_batch_of_images(inputs.reshape((-1, 28, 28)))\n",
|
||||
"transformed_inputs = random_rotation(inputs, rng)\n",
|
||||
"_ = show_batch_of_images(transformed_inputs.reshape((-1, 28, 28)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 5: Training with data augmentation\n",
|
||||
"\n",
|
||||
"One simple way to use data augmentation is to just statically augment the training data set - for example we could iterate through the training dataset applying a transformation function like that implemented above to generate new artificial inputs, and use both the original and newly generated data in a new data provider object. We are quite limited however in how far we can augment the dataset by with a static method like this however - if we wanted to apply 9 random rotations to each image in the original datase, we would end up with a dataset with 10 times the memory requirements and that would take 10 times as long to run through each epoch.\n",
|
||||
"\n",
|
||||
"An alternative is to randomly augment the data on the fly as we iterate through the data provider in each epoch. In this method a new data provider class can be defined that inherits from the original data provider to be augmented, and provides a new `next` method which applies a random transformation function like that implemented in the previous exercise to each input batch before returning it. This method means that on every epoch a different set of training examples are provided to the model and so in some ways corresponds to an 'infinite' data set (although the amount of variability in the dataset will still be significantly less than the variability in all possible digit images). Compared to static augmentation, this dynamic augmentation scheme comes at the computational cost of having to apply the random transformation each time a new batch is provided. We can vary this overhead by changing the proportion of images in a batch randomly transformed.\n",
|
||||
"\n",
|
||||
"An implementation of this scheme has been provided for the MNIST data set in the `AugmentedMNISTDataProvider` object in the `mlp.data_providers` module. In addition to the arguments of the original `MNISTDataProvider.__init__` method, this additional takes a `transformer` argument, which should be a function which takes as arguments an inputs batch array and a random number generator object, and returns an array corresponding to a random transformation of the inputs. \n",
|
||||
"\n",
|
||||
"Train a model with the same architecture as in exercise 3 and with no L1 / L2 regularisation using a training data provider which randomly augments the training images using your `random_rotation` transformer function. Plot the training and validation set errors over the training epochs and compare this plot to your previous results from exercise 3. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from mlp.data_providers import AugmentedMNISTDataProvider\n",
|
||||
"\n",
|
||||
"aug_train_data = AugmentedMNISTDataProvider('train', rng=rng, transformer=random_rotation)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"anaconda-cloud": {},
|
||||
"kernelspec": {
|
||||
"display_name": "Python [conda env:mlp]",
|
||||
"language": "python",
|
||||
"name": "conda-env-mlp-py"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 2
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
Loading…
Reference in New Issue
Block a user