From 586c7477ebeb938f601994238762dfdd045958d8 Mon Sep 17 00:00:00 2001 From: Matt Graham Date: Fri, 21 Oct 2016 01:25:08 +0100 Subject: [PATCH] Removing old regularisation notebook. --- notebooks/05_Regularisation.ipynb | 293 ------------------------------ 1 file changed, 293 deletions(-) delete mode 100644 notebooks/05_Regularisation.ipynb diff --git a/notebooks/05_Regularisation.ipynb b/notebooks/05_Regularisation.ipynb deleted file mode 100644 index 24f2349..0000000 --- a/notebooks/05_Regularisation.ipynb +++ /dev/null @@ -1,293 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction\n", - "\n", - "This tutorial focuses on implementation of three reqularisaion techniques: two of them add a regularisation term to the cost function based on the *L1* and *L2* norms; the third technique, called *Dropout*, is a form of noise injection by random corruption of information carried by the hidden units during training.\n", - "\n", - "\n", - "## Virtual environments\n", - "\n", - "Before you proceed onwards, remember to activate your virtual environment by typing `activate_mlp` or `source ~/mlpractical/venv/bin/activate` (or if you did the original install the \"comfy way\" type: `workon mlpractical`).\n", - "\n", - "\n", - "## Syncing the git repository\n", - "\n", - "Look here for more details. But in short, we recommend to create a separate branch for this lab, as follows:\n", - "\n", - "1. Enter the mlpractical directory `cd ~/mlpractical/repo-mlp`\n", - "2. List the branches and check which are currently active by typing: `git branch`\n", - "3. If you have followed our recommendations, you should be in the `coursework1` branch, please commit your local changed to the repo index by typing:\n", - "```\n", - "git commit -am \"finished coursework\"\n", - "```\n", - "4. Now you can switch to `master` branch by typing: \n", - "```\n", - "git checkout master\n", - " ```\n", - "5. To update the repository (note, assuming master does not have any conflicts), if there are some, have a look here\n", - "```\n", - "git pull\n", - "```\n", - "6. And now, create the new branch & swith to it by typing:\n", - "```\n", - "git checkout -b lab4\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Regularisation\n", - "\n", - "Regularisation add a *complexity term* to the cost function. Its purpose is to put some prior on the model's parameters, which will penalise complexity. The most common prior is perhaps the one which assumes smoother solutions (the one which are not able to fit training data too well) are better as they are more likely to better generalise to unseen data. \n", - "\n", - "A way to incorporate such a prior in the model is to add some term that penalise certain configurations of the parameters -- either from growing too large ($L_2$) or the one that prefers a solution that could be modelled with fewer parameters ($L_1$), hence encouraging some parameters to become 0. One can, of course, combine many such priors when optimising the model, however, in the lab we shall use $L_1$ and/or $L_2$ priors.\n", - "\n", - "$L_1$ and $L_2$ priors can be easily incorporated into the training objective through additive terms, as follows:\n", - "\n", - "(1) $\n", - " \\begin{align*}\n", - " E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n", - " \\underbrace{\\beta_{L_1} E^n_{L_1}}_{\\text{prior term}} + \\underbrace{\\beta_{L_2} E^n_{L_2}}_{\\text{prior term}}\n", - "\\end{align*}\n", - "$\n", - "\n", - "where $ E^n_{\\text{train}} = - \\sum_{k=1}^K t^n_k \\ln y^n_k $ is the cross-entropy cost function, $\\beta_{L_1}$ and $\\beta_{L_2}$ are non-negative constants specified in advance (hyper-parameters) and $E^n_{L_1}$ and $E^n_{L_2}$ are norm metrics specifying certain properties of the parameters:\n", - "\n", - "(2) $\n", - " \\begin{align*}\n", - " E^n_{L_p}(\\mathbf{W}) = ||\\mathbf{W}||_p = \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^p \\right )^{\\frac{1}{p}}\n", - "\\end{align*}\n", - "$\n", - "\n", - "where $p$ denotes the norm-order (for regularisation either 1 or 2). Notice, in practice for computational purposes we will rather compute squared $L_{p=2}$ norm, which omits the square root in (2), that is:\n", - "\n", - "(3)$ \\begin{align*}\n", - " E^n_{L_{p=2}}(\\mathbf{W}) = ||\\mathbf{W}||^2_2 = \\left ( \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^2 \\right )^{\\frac{1}{2}} \\right )^2 = \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^2\n", - "\\end{align*}\n", - "$\n", - "\n", - "## $L_{p=2}$ (Weight Decay)\n", - "\n", - "Our cost with $L_{2}$ regulariser then becomes ($\\frac{1}{2}$ simplifies a derivative later):\n", - "\n", - "(4) $\n", - " \\begin{align*}\n", - " E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n", - " \\underbrace{\\beta_{L_2} \\frac{1}{2} E^n_{L_2}}_{\\text{prior term}}\n", - "\\end{align*}\n", - "$\n", - "\n", - "Hence, the gradient of the cost w.r.t parameter $w_i$ is given as follows:\n", - "\n", - "(5) $\n", - "\\begin{align*}\\frac{\\partial E^n}{\\partial w_i} &= \\frac{\\partial (E^n_{\\text{train}} + \\beta_{L_2} 0.5 E^n_{L_2}) }{\\partial w_i} \n", - " = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} 0.5 \\frac{\\partial\n", - " E^n_{L_2}}{\\partial w_i} \\right) \n", - " = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right)\n", - "\\end{align*}\n", - "$\n", - "\n", - "And the actual update we to the $W_i$ parameter is:\n", - "\n", - "(6) $\n", - "\\begin{align*}\n", - " \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right) \n", - "\\end{align*}\n", - "$\n", - "\n", - "where $\\eta$ is learning rate. \n", - "\n", - "Exercise 1 gives some more implementational suggestions on how to incorporate this technique into the lab code, the cost related prior contributions (equation (1)) are computed in mlp.optimisers.Optimiser.compute_prior_costs() and your job is to add the relevant optimisation related code when computing the gradients w.r.t parameters. \n", - "\n", - "## $L_{p=1}$ (Sparsity)\n", - "\n", - "Our cost with $L_{1}$ regulariser then becomes:\n", - "\n", - "(7) $\n", - " \\begin{align*}\n", - " E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n", - " \\underbrace{\\beta_{L_1} E^n_{L_1}}_{\\text{prior term}} \n", - "\\end{align*}\n", - "$\n", - "\n", - "Hence, the gradient of the cost w.r.t parameter $w_i$ is given as follows:\n", - "\n", - "(8) $\\begin{align*}\n", - " \\frac{\\partial E^n}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\frac{\\partial E_{L_1}}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i)\n", - "\\end{align*}\n", - "$\n", - "\n", - "And the actual update we to the $W_i$ parameter is:\n", - "\n", - "(9) $\\begin{align*}\n", - " \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i) \\right) \n", - "\\end{align*}$\n", - "\n", - "Where $\\mbox{sgn}(w_i)$ is the sign of $w_i$: $\\mbox{sgn}(w_i) = 1$ if $w_i>0$ and $\\mbox{sgn}(w_i) = -1$ if $w_i<0$\n", - "\n", - "One can also easily apply those penalty terms for biases, however, this is usually not necessary as biases do not affect the smoothness of the solution (given data).\n", - "\n", - "## Dropout\n", - "\n", - "For a given layer's output $\\mathbf{h}^i \\in \\mathbb{R}^{BxH^l}$ (where $B$ is batch size and $H^l$ is the $l$-th layer output dimensionality), Dropout implements the following transformation:\n", - "\n", - "(10) $\\mathbf{\\hat h}^l = \\mathbf{d}^l\\circ\\mathbf{h}^l$\n", - "\n", - "where $\\circ$ denotes an elementwise product and $\\mathbf{d}^l \\in \\{0,1\\}^{BxH^i}$ is a matrix in which element $d^l_{ij}$ is sampled from the Bernoulli distribution:\n", - "\n", - "(11) $d^l_{ij} \\sim \\mbox{Bernoulli}(p^l_d)$\n", - "\n", - "with $0