{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "This tutorial focuses on implementation of three reqularisaion techniques, two of them are norm based approaches which are added to optimised objective and the third technique, called *droput*, is a form of noise injection by random corruption of information carried by hidden units during training.\n", "\n", "\n", "## Virtual environments\n", "\n", "Before you proceed onwards, remember to activate your virtual environment:\n", " * If you were in last week's Tuesday or Wednesday group type `activate_mlp` or `source ~/mlpractical/venv/bin/activate`\n", " * If you were in the Monday group:\n", " + and if you have chosen the **comfy** way type: `workon mlpractical`\n", " + and if you have chosen the **generic** way, `source` your virutal environment using `source` and specyfing the path to the activate script (you need to localise it yourself, there were not any general recommendations w.r.t dir structure and people have installed it in different places, usually somewhere in the home directories. If you cannot easily find it by yourself, use something like: `find . -iname activate` ):\n", "\n", "## Syncing the git repository\n", "\n", "Look here for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n", "\n", "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n", "2. List the branches and check which is currently active by typing: `git branch`\n", "3. If you have followed our recommendations, you should be in the `coursework1` branch, please commit your local changed to the repo index by typing:\n", "```\n", "git commit -am \"finished coursework\"\n", "```\n", "4. Now you can switch to `master` branch by typing: \n", "```\n", "git checkout master\n", " ```\n", "5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look here\n", "```\n", "git pull\n", "```\n", "6. And now, create the new branch & swith to it by typing:\n", "```\n", "git checkout -b lab4\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regularisation\n", "\n", "Regularisation add some *complexity term* to the cost function. It's purpose is to put some prior on the model's parameters. The most common prior is perhaps the one which assumes smoother solutions (the one which are not able to fit training data too well) are better as they are more likely to better generalise to unseen data. \n", "\n", "A way to incorporate such prior in the model is to add some term that penalise certain configurations of the parameters -- either from growing too large ($L_2$) or the one that prefers solution that could be modelled with less parameters ($L_1$), hence encouraging some parameters to become 0. One can, of course, combine many such priors when optimising the model, however, in the lab we shall use $L_1$ and/or $L_2$ priors.\n", "\n", "They can be easily incorporated into the training objective by adding some additive terms, as follows:\n", "\n", "(1) $\n", " \\begin{align*}\n", " E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n", " \\underbrace{\\beta_{L_1} E^n_{L_1}}_{\\text{prior term}} + \\underbrace{\\beta_{L_2} E^n_{L_2}}_{\\text{prior term}}\n", "\\end{align*}\n", "$\n", "\n", "where $ E^n_{\\text{train}} = - \\sum_{k=1}^K t^n_k \\ln y^n_k $, $\\beta_{L_1}$ and $\\beta_{L_2}$ some non-negative constants specified a priori (hyper-parameters) and $E^n_{L_1}$ and $E^n_{L_2}$ norm metric specifying certain properties of parameters:\n", "\n", "(2) $\n", " \\begin{align*}\n", " E^n_{L_p}(\\mathbf{W}) = \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^p \\right )^{\\frac{1}{p}}\n", "\\end{align*}\n", "$\n", "\n", "where $p$ denotes the norm-order (for regularisation either 1 or 2). (TODO: explain here why we usualy skip square root for p=2)\n", "\n", "## $L_{p=2}$ (Weight Decay)\n", "\n", "(3) $\n", " \\begin{align*}\n", " E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n", " \\underbrace{\\beta E^n_{L_2}}_{\\text{prior term}} = E^n_{\\text{train}} + \\beta_{L_2} \\frac{1}{2}|w_i|^2\n", "\\end{align*}\n", "$\n", "\n", "(4) $\n", "\\begin{align*}\\frac{\\partial E^n}{\\partial w_i} &= \\frac{\\partial (E^n_{\\text{train}} + \\beta_{L_2} E_{L_2}) }{\\partial w_i} \n", " = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} \\frac{\\partial\n", " E_{L_2}}{\\partial w_i} \\right) \n", " = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right)\n", "\\end{align*}\n", "$\n", "\n", "(5) $\n", "\\begin{align*}\n", " \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right) \n", "\\end{align*}\n", "$\n", "\n", "where $\\eta$ is learning rate.\n", "\n", "## $L_{p=1}$ (Sparsity)\n", "\n", "(6) $\n", " \\begin{align*}\n", " E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n", " \\underbrace{\\beta E^n_{L_1}}_{\\text{prior term}} \n", " = E^n_{\\text{train}} + \\beta_{L_1} |w_i|\n", "\\end{align*}\n", "$\n", "\n", "(7) $\\begin{align*}\n", " \\frac{\\partial E^n}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\frac{\\partial E_{L_1}}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i)\n", "\\end{align*}\n", "$\n", "\n", "(8) $\\begin{align*}\n", " \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i) \\right) \n", "\\end{align*}$\n", "\n", "Where $\\mbox{sgn}(w_i)$ is the sign of $w_i$: $\\mbox{sgn}(w_i) = 1$ if $w_i>0$ and $\\mbox{sgn}(w_i) = -1$ if $w_i<0$\n", "\n", "One can also apply those penalty terms for biases, however, this is usually not necessary as biases have secondary impact on smoothnes of the given solution.\n", "\n", "## Dropout\n", "\n", "Dropout, for a given layer's output $\\mathbf{h}^i \\in \\mathbb{R}^{BxH^l}$ (where $B$ is batch size and $H^l$ is the $l$-th layer output dimensionality) implements the following transformation:\n", "\n", "(9) $\\mathbf{\\hat h}^l = \\mathbf{d}^l\\circ\\mathbf{h}^l$\n", "\n", "where $\\circ$ denotes an elementwise product and $\\mathbf{d}^l \\in \\{0,1\\}^{BxH^i}$ is a matrix in which $d^l_{ij}$ element is sampled from the Bernoulli distribution:\n", "\n", "(10) $d^l_{ij} \\sim \\mbox{Bernoulli}(p^l_d)$\n", "\n", "with $0