mlpractical/notebooks/04_Regularisation.ipynb

294 lines
16 KiB
Plaintext
Raw Normal View History

2015-10-28 17:59:11 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
2015-11-01 21:15:09 +01:00
"This tutorial focuses on implementation of three reqularisaion techniques: two of them add a regularisation term to the cost function based on the *L1* and *L2* norms; the third technique, called *Dropout*, is a form of noise injection by random corruption of information carried by the hidden units during training.\n",
2015-10-28 17:59:11 +01:00
"\n",
"\n",
"## Virtual environments\n",
"\n",
2015-11-01 21:15:09 +01:00
"Before you proceed onwards, remember to activate your virtual environment by typing `activate_mlp` or `source ~/mlpractical/venv/bin/activate` (or if you did the original install the \"comfy way\" type: `workon mlpractical`).\n",
"\n",
2015-10-28 17:59:11 +01:00
"\n",
"## Syncing the git repository\n",
"\n",
"Look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n",
"\n",
"1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
2015-11-01 21:15:09 +01:00
"2. List the branches and check which are currently active by typing: `git branch`\n",
2015-11-01 16:50:26 +01:00
"3. If you have followed our recommendations, you should be in the `coursework1` branch, please commit your local changed to the repo index by typing:\n",
2015-10-28 17:59:11 +01:00
"```\n",
2015-11-01 16:50:26 +01:00
"git commit -am \"finished coursework\"\n",
2015-10-28 17:59:11 +01:00
"```\n",
"4. Now you can switch to `master` branch by typing: \n",
"```\n",
"git checkout master\n",
" ```\n",
"5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>\n",
"```\n",
"git pull\n",
"```\n",
"6. And now, create the new branch & swith to it by typing:\n",
"```\n",
"git checkout -b lab4\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regularisation\n",
"\n",
2015-11-01 21:15:09 +01:00
"Regularisation add a *complexity term* to the cost function. Its purpose is to put some prior on the model's parameters, which will penalise complexity. The most common prior is perhaps the one which assumes smoother solutions (the one which are not able to fit training data too well) are better as they are more likely to better generalise to unseen data. \n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"A way to incorporate such a prior in the model is to add some term that penalise certain configurations of the parameters -- either from growing too large ($L_2$) or the one that prefers a solution that could be modelled with fewer parameters ($L_1$), hence encouraging some parameters to become 0. One can, of course, combine many such priors when optimising the model, however, in the lab we shall use $L_1$ and/or $L_2$ priors.\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"$L_1$ and $L_2$ priors can be easily incorporated into the training objective through additive terms, as follows:\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 16:50:26 +01:00
"(1) $\n",
" \\begin{align*}\n",
" E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n",
" \\underbrace{\\beta_{L_1} E^n_{L_1}}_{\\text{prior term}} + \\underbrace{\\beta_{L_2} E^n_{L_2}}_{\\text{prior term}}\n",
"\\end{align*}\n",
"$\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"where $ E^n_{\\text{train}} = - \\sum_{k=1}^K t^n_k \\ln y^n_k $ is the cross-entropy cost function, $\\beta_{L_1}$ and $\\beta_{L_2}$ are non-negative constants specified in advance (hyper-parameters) and $E^n_{L_1}$ and $E^n_{L_2}$ are norm metrics specifying certain properties of the parameters:\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 16:50:26 +01:00
"(2) $\n",
" \\begin{align*}\n",
2015-11-01 20:24:35 +01:00
" E^n_{L_p}(\\mathbf{W}) = ||\\mathbf{W}||_p = \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^p \\right )^{\\frac{1}{p}}\n",
2015-11-01 16:50:26 +01:00
"\\end{align*}\n",
"$\n",
"\n",
2015-11-01 20:24:35 +01:00
"where $p$ denotes the norm-order (for regularisation either 1 or 2). Notice, in practice for computational purposes we will rather compute squared $L_{p=2}$ norm, which omits the square root in (2), that is:\n",
"\n",
"(3)$ \\begin{align*}\n",
" E^n_{L_{p=2}}(\\mathbf{W}) = ||\\mathbf{W}||^2_2 = \\left ( \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^2 \\right )^{\\frac{1}{2}} \\right )^2 = \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^2\n",
"\\end{align*}\n",
"$\n",
2015-11-01 16:50:26 +01:00
"\n",
"## $L_{p=2}$ (Weight Decay)\n",
"\n",
2015-11-01 20:24:35 +01:00
"Our cost with $L_{2}$ regulariser then becomes ($\\frac{1}{2}$ simplifies a derivative later):\n",
"\n",
"(4) $\n",
2015-11-01 16:50:26 +01:00
" \\begin{align*}\n",
" E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n",
2015-11-01 20:24:35 +01:00
" \\underbrace{\\beta_{L_2} \\frac{1}{2} E^n_{L_2}}_{\\text{prior term}}\n",
2015-11-01 16:50:26 +01:00
"\\end{align*}\n",
"$\n",
"\n",
2015-11-01 20:24:35 +01:00
"Hence, the gradient of the cost w.r.t parameter $w_i$ is given as follows:\n",
"\n",
"(5) $\n",
2015-11-02 13:59:36 +01:00
"\\begin{align*}\\frac{\\partial E^n}{\\partial w_i} &= \\frac{\\partial (E^n_{\\text{train}} + \\beta_{L_2} 0.5 E^n_{L_2}) }{\\partial w_i} \n",
" = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} 0.5 \\frac{\\partial\n",
" E^n_{L_2}}{\\partial w_i} \\right) \n",
2015-11-01 16:50:26 +01:00
" = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right)\n",
"\\end{align*}\n",
"$\n",
"\n",
2015-11-01 20:24:35 +01:00
"And the actual update we to the $W_i$ parameter is:\n",
"\n",
"(6) $\n",
2015-11-01 16:50:26 +01:00
"\\begin{align*}\n",
" \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right) \n",
"\\end{align*}\n",
"$\n",
"\n",
2015-11-02 13:59:36 +01:00
"where $\\eta$ is learning rate. \n",
"\n",
"Exercise 1 gives some more implementational suggestions on how to incorporate this technique into the lab code, the cost related prior contributions (equation (1)) are computed in mlp.optimisers.Optimiser.compute_prior_costs() and your job is to add the relevant optimisation related code when computing the gradients w.r.t parameters. \n",
2015-11-01 16:50:26 +01:00
"\n",
"## $L_{p=1}$ (Sparsity)\n",
"\n",
2015-11-01 20:24:35 +01:00
"Our cost with $L_{1}$ regulariser then becomes:\n",
"\n",
"(7) $\n",
2015-11-01 16:50:26 +01:00
" \\begin{align*}\n",
" E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n",
2015-11-01 20:24:35 +01:00
" \\underbrace{\\beta_{L_1} E^n_{L_1}}_{\\text{prior term}} \n",
2015-11-01 16:50:26 +01:00
"\\end{align*}\n",
"$\n",
"\n",
2015-11-01 20:24:35 +01:00
"Hence, the gradient of the cost w.r.t parameter $w_i$ is given as follows:\n",
"\n",
"(8) $\\begin{align*}\n",
2015-11-01 16:50:26 +01:00
" \\frac{\\partial E^n}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\frac{\\partial E_{L_1}}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i)\n",
"\\end{align*}\n",
"$\n",
"\n",
2015-11-01 20:24:35 +01:00
"And the actual update we to the $W_i$ parameter is:\n",
"\n",
"(9) $\\begin{align*}\n",
2015-11-01 16:50:26 +01:00
" \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i) \\right) \n",
"\\end{align*}$\n",
"\n",
"Where $\\mbox{sgn}(w_i)$ is the sign of $w_i$: $\\mbox{sgn}(w_i) = 1$ if $w_i>0$ and $\\mbox{sgn}(w_i) = -1$ if $w_i<0$\n",
"\n",
2015-11-01 20:24:35 +01:00
"One can also easily apply those penalty terms for biases, however, this is usually not necessary as biases do not affect the smoothness of the solution (given data).\n",
2015-10-28 17:59:11 +01:00
"\n",
"## Dropout\n",
"\n",
2015-11-01 21:15:09 +01:00
"For a given layer's output $\\mathbf{h}^i \\in \\mathbb{R}^{BxH^l}$ (where $B$ is batch size and $H^l$ is the $l$-th layer output dimensionality), Dropout implements the following transformation:\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 20:24:35 +01:00
"(10) $\\mathbf{\\hat h}^l = \\mathbf{d}^l\\circ\\mathbf{h}^l$\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"where $\\circ$ denotes an elementwise product and $\\mathbf{d}^l \\in \\{0,1\\}^{BxH^i}$ is a matrix in which element $d^l_{ij}$ is sampled from the Bernoulli distribution:\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 20:24:35 +01:00
"(11) $d^l_{ij} \\sim \\mbox{Bernoulli}(p^l_d)$\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"with $0<p^l_d<1$ denoting the probability that the given unit is kept unchanged (the \"dropping probability\" is thus $1-p^l_d$). We ignore here the extreme scenarios in which $p^l_d=1$ and there is no dropout applied (hence the training would be exactly the same as in standard SGD) or in which $p^l_d=0$ whereby all units would be dropped, hence the model would not learn anything.\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"The probability $p^l_d$ is a hyperparameter (like learning rate) meaning it needs to be provided before training and also very often tuned for the given task. As the notation suggests, it can be specified separately for each layer, including the scenario where $l=0$ when some random dimensions in the input features (pixels in the image for MNIST) are being also corrupted.\n",
2015-10-28 17:59:11 +01:00
"\n",
"### Keeping the $l$-th layer output $\\mathbf{\\hat h}^l$ (input to the upper layer) appropiately scaled at test-time\n",
"\n",
2015-11-01 21:15:09 +01:00
"The other issue one needs to take into account is the mismatch that arises between training and test (runtime) stages when dropout is applied. Since dropout is not applied at the testing (run-time) stage, the average input to the unit in the upper layer will be bigger compared to the training stage (where some inputs were set to 0), on average $1/p^l_d$ times bigger. \n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"To account for this mismatch one could either:\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"1. When training is finished scale the final weight matrices $\\mathbf{W}^l, l=1,\\ldots,L$ by $p^{l-1}_d$ (remember, $p^{0}_d$ is the probability related to dropping input features), as mentioned in the lecture\n",
2015-11-01 20:24:35 +01:00
"2. Scale the activations in equation (10) during training, that is, for each mini-batch multiply $\\mathbf{\\hat h}^l$ by $1/p^l_d$ to compensate for dropped units and then at run-time use the model as usual, **without** scaling. Make sure the $1/p^l_d$ scaler is taken into account for both forward and backward passes.\n",
2015-11-01 16:50:26 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"In this lab we recommend option 2 as it will make some things easier to implement. "
2015-10-28 17:59:11 +01:00
]
},
{
2015-11-01 16:50:26 +01:00
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
2015-11-02 12:46:10 +01:00
"from mlp.dataset import MNISTDataProvider\n",
2015-11-01 16:50:26 +01:00
"\n",
"train_dp = MNISTDataProvider(dset='train', batch_size=10, max_num_batches=100, randomize=True)\n",
"valid_dp = MNISTDataProvider(dset='valid', batch_size=10000, randomize=False)\n",
"test_dp = MNISTDataProvider(dset='eval', batch_size=10000, randomize=False)"
]
2015-10-28 17:59:11 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2015-11-01 16:50:26 +01:00
"# Exercise 1: Implement L2 based regularisation\n",
"\n",
2015-11-01 21:15:09 +01:00
"Implement an L2 regularisation method (for the weight matrices, optionally for the biases). Test your solution on a one hidden layer model similar to the one used in Task 4 for coursework 1 (800 hidden units) -- but limit the training data to 1000 (random) data-points (keep the validation and test sets the same). You may use the data providers specified in the above cell. \n",
2015-11-01 16:50:26 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"*Note (optional): We limit both the amount of data as well as the size of a mini-batch - this is due to the fact that those two parameters directly affect the number of updates we do to the model's parameters per epoch (i.e. for `batch_size=100` and `max_num_batches=10` one can only adjust parameters `10` times per epoch versus `100` times in the case when `batch_size=10` and `max_num_batches=100`). Since SGD relies on making many small upates, this ratio (number of updates given data) is another hyper-parameter one should consider before optimisation.*\n",
2015-11-01 16:50:26 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"First build and train an unregularised model as a basline. Then train regularised models starting with $\\beta_{L2}$ set to 0.0001 and do a search over different values of $\\beta_{L2}$. Observe how different $L_2$ penalties affect the model's ability to fit training and validation data.\n",
2015-11-01 16:50:26 +01:00
"\n",
"Implementation tips:\n",
2015-11-01 21:15:09 +01:00
"* Have a look at the constructor of mlp.optimiser.SGDOptimiser class; it has been modified to take more optimisation-related arguments.\n",
"* The best place to implement regularisation terms is in the `pgrads` method of the mlp.layers.Layer class or its subclasses. See equations (6) and (9)."
2015-10-28 17:59:11 +01:00
]
},
2015-11-01 16:50:26 +01:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
2015-10-28 17:59:11 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
2015-11-01 16:50:26 +01:00
"# Exercise 2: Implement L1 based regularisation\n",
"\n",
2015-11-01 21:15:09 +01:00
"Implement the L1 regularisation penalty. Test your solution on a one hidden layer model similar to the one used in Exercise 1. Then train an $L_1$ regularised model starting with $\\beta_{L1}=0.0001$ and again search over different values of this parameter. Observe how different $L_1$ penalties affect the model's ability to fit training and validation data."
2015-10-28 17:59:11 +01:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2015-11-01 16:50:26 +01:00
"# Exercise 3:\n",
" \n",
2015-11-01 21:15:09 +01:00
"Dropout applied to input features (turning some random pixels on or off) may be also viewed as a form of data augmentation -- as we effectively create images that differ in some way from the training set; but also the model is tasked to properly classify imperfect data-points.\n",
2015-11-01 16:50:26 +01:00
"\n",
2015-11-01 21:15:09 +01:00
"Your task in this exercise is to pick a random digit from the MNIST dataset (use MNISTDataProvider) and corrupt it pixel-wise with different levels of probabilities $p_{d} \\in \\{0.9, 0.7, 0.5, 0.2, 0.1\\}$ (reminder, dropout probability is $1-p_d$) that is, for each pixel $x_{i,j}$ in image $\\mathbf{X} \\in \\mathbb{R}^{W\\times H}$:\n",
2015-11-01 16:50:26 +01:00
"\n",
"$\\begin{align}\n",
"d_{i,j} & \\sim\\ \\mbox{Bernoulli}(p_{d}) \\\\\n",
"x_{i,j} &=\n",
"\\begin{cases}\n",
" 0 & \\quad \\text{if } d_{i,j} = 0\\\\\n",
" x_{i,j} & \\quad \\text{if } d_{i,j} = 1\\\\\n",
"\\end{cases}\n",
"\\end{align}\n",
"$\n",
"\n",
"Plot the solution as a 2x3 grid of images for each $p_d$ scenario, at position (0, 0) plot an original (uncorrupted) image.\n",
"\n",
"Tip: You may use numpy.random.binomial function to draw samples from Bernoulli distribution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 4: Implement Dropout \n",
"\n",
2015-11-01 21:15:09 +01:00
"Implement the dropout regularisation technique. Then for the same initial configuration as used in Exercise 1. investigate the effectivness of different dropout rates applied to input features and/or hidden layers. Start with $p_{inp}=0.5$ and $p_{hid}=0.5$ and do a search for better settings of these parameters. Dropout usually slows training down (approximately by a factor of two) so train dropout models for around twice as many epochs as the baseline model.\n",
2015-10-28 17:59:11 +01:00
"\n",
2015-11-01 16:50:26 +01:00
"Implementation tips:\n",
"* Add a function `fprop_dropout` to `mlp.layers.MLP` class which (on top of `inputs` argument) takes also dropout-related argument(s) and perform dropout forward propagation through the model.\n",
2015-11-01 21:15:09 +01:00
"* Also you need to introduce some modifications to the `mlp.optimisers.SGDOptimiser.train_epoch()` function.\n",
"* Design and implement a dropout scheduler in a similar way to how learning rates are handled (that is, allowing for a schedule which is kept independent of the implementation in `mlp.optimisers.SGDOptimiser.train()`). \n",
" + For this exercise implement only a fixed dropout scheduler - `DropoutFixed`, but your implementation should allow to easily add other schedules in the future. \n",
" + A dropout scheduler of any type should return a tuple of two numbers $(p_{inp},\\; p_{hid})$, the first one is dropout factor for input features (data-points), and the latter dropout factor for hidden layers (assumed the same for all hidden layers)."
2015-10-28 17:59:11 +01:00
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
2015-11-02 13:59:36 +01:00
"version": "2.7.9"
2015-10-28 17:59:11 +01:00
}
},
"nbformat": 4,
"nbformat_minor": 0
}