157 lines
6.7 KiB
Plaintext
157 lines
6.7 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Introduction\n",
|
||
|
"\n",
|
||
|
"This tutorial focuses on implementation of three reqularisaion techniques, two of them are norm based approaches - L2 and L1 as well as technique called droput, that.\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"## Virtual environments\n",
|
||
|
"\n",
|
||
|
"Before you proceed onwards, remember to activate your virtual environment:\n",
|
||
|
" * If you were in last week's Tuesday or Wednesday group type `activate_mlp` or `source ~/mlpractical/venv/bin/activate`\n",
|
||
|
" * If you were in the Monday group:\n",
|
||
|
" + and if you have chosen the **comfy** way type: `workon mlpractical`\n",
|
||
|
" + and if you have chosen the **generic** way, `source` your virutal environment using `source` and specyfing the path to the activate script (you need to localise it yourself, there were not any general recommendations w.r.t dir structure and people have installed it in different places, usually somewhere in the home directories. If you cannot easily find it by yourself, use something like: `find . -iname activate` ):\n",
|
||
|
"\n",
|
||
|
"## Syncing the git repository\n",
|
||
|
"\n",
|
||
|
"Look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a> for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n",
|
||
|
"\n",
|
||
|
"1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n",
|
||
|
"2. List the branches and check which is currently active by typing: `git branch`\n",
|
||
|
"3. If you have followed recommendations, you should be in the `coursework1` branch, please commit your local changed to the repo index by typing:\n",
|
||
|
"```\n",
|
||
|
"git commit -am \"stuff I did for the coursework\"\n",
|
||
|
"```\n",
|
||
|
"4. Now you can switch to `master` branch by typing: \n",
|
||
|
"```\n",
|
||
|
"git checkout master\n",
|
||
|
" ```\n",
|
||
|
"5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look <a href=\"https://github.com/CSTR-Edinburgh/mlpractical/blob/master/gitFAQ.md\">here</a>\n",
|
||
|
"```\n",
|
||
|
"git pull\n",
|
||
|
"```\n",
|
||
|
"6. And now, create the new branch & swith to it by typing:\n",
|
||
|
"```\n",
|
||
|
"git checkout -b lab4\n",
|
||
|
"```"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Regularisation\n",
|
||
|
"\n",
|
||
|
"Today, we shall build models which can have an arbitrary number of hidden layers. Please have a look at the diagram below, and the corresponding computations (which have an *exact* matrix form as expected by numpy, and row-wise orientation; note that $\\circ$ denotes an element-wise product). In the diagram, we briefly describe how each comptation relates to the code we have provided.\n",
|
||
|
"\n",
|
||
|
"(1) $E = \\log(\\mathbf{y}|\\mathbf{x}; \\theta) + \\alpha J_{L2}(\\theta) + \\beta J_{L1}(\\theta)$\n",
|
||
|
"\n",
|
||
|
"## L2 Weight Decay\n",
|
||
|
"\n",
|
||
|
"(1) $J_{L2}(\\theta) = \\frac{1}{2}||\\theta||^2$\n",
|
||
|
"\n",
|
||
|
"(1) $\\frac{\\partial J_{L2}}{\\partial\\theta} = \\frac{1}{2}||\\theta||^2$\n",
|
||
|
"\n",
|
||
|
"## L1 Sparsity \n",
|
||
|
"\n",
|
||
|
"## Dropout\n",
|
||
|
"\n",
|
||
|
"Dropout, for a given layer's output $\\mathbf{h}^i \\in \\mathbb{R}^{BxH^l}$ (where $B$ is batch size and $H^l$ is the $l$-th layer output dimensionality) implements the following transformation:\n",
|
||
|
"\n",
|
||
|
"(1) $\\mathbf{\\hat h}^l = \\mathbf{d}^l\\circ\\mathbf{h}^l$\n",
|
||
|
"\n",
|
||
|
"where $\\circ$ denotes an elementwise product and $\\mathbf{d}^l \\in \\{0,1\\}^{BxH^i}$ is a matrix in which $d^l_{ij}$ element is sampled from the Bernoulli distribution:\n",
|
||
|
"\n",
|
||
|
"(2) $d^l_{ij} \\sim \\mbox{Bernoulli}(p^l_d)$\n",
|
||
|
"\n",
|
||
|
"with $0<p^l_d<1$ denoting the probability the unit is kept unchanged (dropping probability is thus $1-p^l_d$). We ignore here edge scenarios where $p^l_d=1$ and there is no dropout applied (and the training is exactly the same as in standard SGD) and $p^l_d=0$ where all units would have been dropped, hence the model would not learn anything.\n",
|
||
|
"\n",
|
||
|
"The probability $p^l_d$ is a hyperparameter (like learning rate) meaning it needs to be provided before training and also very often tuned for the given task. As the notation suggest, it can be specified separately for each layer, including scenario where $l=0$ and one randomply drops also input features.\n",
|
||
|
"\n",
|
||
|
"### Keeping the $l$-th layer output $\\mathbf{\\hat h}^l$ (input to the upper layer) appropiately scaled at test-time\n",
|
||
|
"\n",
|
||
|
"The other issue one needs to take into account is the mismatch that arises between training and test (runtime) stages of when dropout is applied. It is due to the fact that droput is not applied when testing hence the average input to the next layer is gonna be bigger when compared to training stage, in average $1/p^l_d$ times bigger. \n",
|
||
|
"\n",
|
||
|
"So to account for this you can either (it's up to you which way you decide to implement):\n",
|
||
|
"\n",
|
||
|
"1. When training is finished scale the final weight matrices $\\mathbf{W}^l, l=1,\\ldots,L$ by $p^{l-1}_d$ (remember, $p^{0}_d$ is the probability related to the input features)\n",
|
||
|
"2. Scale the activations in equation (1) during training, that is, for each mini-batch multiply $\\mathbf{\\hat h}^l$ by $1/p^l_d$ to compensate for dropped units and then at run-time use the model as usual, **without** scaling. Make sure the $1/p^l_d$ scaler is taken into account for both forward and backward passes."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": []
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Exercise 1: Implement L1 based regularisation\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Exercise 2: Implement L2 based regularisation\n",
|
||
|
"\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Exercise 3: Implement Dropout \n",
|
||
|
"\n",
|
||
|
"Modify the above code by adding an intemediate linear layer of size 200 hidden units between input and output layers."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 2",
|
||
|
"language": "python",
|
||
|
"name": "python2"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 2
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython2",
|
||
|
"version": "2.7.9"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 0
|
||
|
}
|