This commit is contained in:
pswietojanski 2015-11-01 19:24:35 +00:00
parent 35490a68fc
commit faaa6cb172
2 changed files with 39 additions and 32 deletions

View File

@ -64,22 +64,31 @@
"\n",
"(2) $\n",
" \\begin{align*}\n",
" E^n_{L_p}(\\mathbf{W}) = \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^p \\right )^{\\frac{1}{p}}\n",
" E^n_{L_p}(\\mathbf{W}) = ||\\mathbf{W}||_p = \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^p \\right )^{\\frac{1}{p}}\n",
"\\end{align*}\n",
"$\n",
"\n",
"where $p$ denotes the norm-order (for regularisation either 1 or 2). (TODO: explain here why we usualy skip square root for p=2)\n",
"where $p$ denotes the norm-order (for regularisation either 1 or 2). Notice, in practice for computational purposes we will rather compute squared $L_{p=2}$ norm, which omits the square root in (2), that is:\n",
"\n",
"(3)$ \\begin{align*}\n",
" E^n_{L_{p=2}}(\\mathbf{W}) = ||\\mathbf{W}||^2_2 = \\left ( \\left ( \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^2 \\right )^{\\frac{1}{2}} \\right )^2 = \\sum_{i,j \\in \\mathbf{W}} |w_{i,j}|^2\n",
"\\end{align*}\n",
"$\n",
"\n",
"## $L_{p=2}$ (Weight Decay)\n",
"\n",
"(3) $\n",
"Our cost with $L_{2}$ regulariser then becomes ($\\frac{1}{2}$ simplifies a derivative later):\n",
"\n",
"(4) $\n",
" \\begin{align*}\n",
" E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n",
" \\underbrace{\\beta E^n_{L_2}}_{\\text{prior term}} = E^n_{\\text{train}} + \\beta_{L_2} \\frac{1}{2}|\\mathbf{W}|^2\n",
" \\underbrace{\\beta_{L_2} \\frac{1}{2} E^n_{L_2}}_{\\text{prior term}}\n",
"\\end{align*}\n",
"$\n",
"\n",
"(4) $\n",
"Hence, the gradient of the cost w.r.t parameter $w_i$ is given as follows:\n",
"\n",
"(5) $\n",
"\\begin{align*}\\frac{\\partial E^n}{\\partial w_i} &= \\frac{\\partial (E^n_{\\text{train}} + \\beta_{L_2} E_{L_2}) }{\\partial w_i} \n",
" = \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} \\frac{\\partial\n",
" E_{L_2}}{\\partial w_i} \\right) \n",
@ -87,7 +96,9 @@
"\\end{align*}\n",
"$\n",
"\n",
"(5) $\n",
"And the actual update we to the $W_i$ parameter is:\n",
"\n",
"(6) $\n",
"\\begin{align*}\n",
" \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_2} w_i \\right) \n",
"\\end{align*}\n",
@ -97,49 +108,54 @@
"\n",
"## $L_{p=1}$ (Sparsity)\n",
"\n",
"(6) $\n",
"Our cost with $L_{1}$ regulariser then becomes:\n",
"\n",
"(7) $\n",
" \\begin{align*}\n",
" E^n &= \\underbrace{E^n_{\\text{train}}}_{\\text{data term}} + \n",
" \\underbrace{\\beta E^n_{L_1}}_{\\text{prior term}} \n",
" = E^n_{\\text{train}} + \\beta_{L_1} |\\mathbf{W}|\n",
" \\underbrace{\\beta_{L_1} E^n_{L_1}}_{\\text{prior term}} \n",
"\\end{align*}\n",
"$\n",
"\n",
"(7) $\\begin{align*}\n",
"Hence, the gradient of the cost w.r.t parameter $w_i$ is given as follows:\n",
"\n",
"(8) $\\begin{align*}\n",
" \\frac{\\partial E^n}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\frac{\\partial E_{L_1}}{\\partial w_i} = \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i)\n",
"\\end{align*}\n",
"$\n",
"\n",
"(8) $\\begin{align*}\n",
"And the actual update we to the $W_i$ parameter is:\n",
"\n",
"(9) $\\begin{align*}\n",
" \\Delta w_i &= -\\eta \\left( \\frac{\\partial E^n_{\\text{train}}}{\\partial w_i} + \\beta_{L_1} \\mbox{sgn}(w_i) \\right) \n",
"\\end{align*}$\n",
"\n",
"Where $\\mbox{sgn}(w_i)$ is the sign of $w_i$: $\\mbox{sgn}(w_i) = 1$ if $w_i>0$ and $\\mbox{sgn}(w_i) = -1$ if $w_i<0$\n",
"\n",
"One can also apply those penalty terms for biases, however, this is usually not necessary as biases have secondary impact on smoothnes of the given solution.\n",
"One can also easily apply those penalty terms for biases, however, this is usually not necessary as biases do not affect the smoothness of the solution (given data).\n",
"\n",
"## Dropout\n",
"\n",
"Dropout, for a given layer's output $\\mathbf{h}^i \\in \\mathbb{R}^{BxH^l}$ (where $B$ is batch size and $H^l$ is the $l$-th layer output dimensionality) implements the following transformation:\n",
"\n",
"(9) $\\mathbf{\\hat h}^l = \\mathbf{d}^l\\circ\\mathbf{h}^l$\n",
"(10) $\\mathbf{\\hat h}^l = \\mathbf{d}^l\\circ\\mathbf{h}^l$\n",
"\n",
"where $\\circ$ denotes an elementwise product and $\\mathbf{d}^l \\in \\{0,1\\}^{BxH^i}$ is a matrix in which $d^l_{ij}$ element is sampled from the Bernoulli distribution:\n",
"\n",
"(10) $d^l_{ij} \\sim \\mbox{Bernoulli}(p^l_d)$\n",
"(11) $d^l_{ij} \\sim \\mbox{Bernoulli}(p^l_d)$\n",
"\n",
"with $0<p^l_d<1$ denoting the probability the given unit is kept unchanged (dropping probability is thus $1-p^l_d$). We ignore here edge scenarios where $p^l_d=1$ and there is no dropout applied (and the training would be exactly the same as in standard SGD) and $p^l_d=0$ where all units would have been dropped, hence the model would not learn anything.\n",
"\n",
"The probability $p^l_d$ is a hyperparameter (like learning rate) meaning it needs to be provided before training and also very often tuned for the given task. As the notation suggest, it can be specified separately for each layer, including scenario where $l=0$ when some random input features (pixels in the image for MNIST) are being also ommitted.\n",
"The probability $p^l_d$ is a hyperparameter (like learning rate) meaning it needs to be provided before training and also very often tuned for the given task. As the notation suggest, it can be specified separately for each layer, including scenario where $l=0$ when some random dimensions in input features (pixels in the image for MNIST) are being also corrupted.\n",
"\n",
"### Keeping the $l$-th layer output $\\mathbf{\\hat h}^l$ (input to the upper layer) appropiately scaled at test-time\n",
"\n",
"The other issue one needs to take into account is the mismatch that arises between training and test (runtime) stages when dropout is applied. It is due to the fact that droput is not applied when testing hence the average input to the unit in upper layer is going to be bigger when compared to training stage (where some inputs are set to 0), in average $1/p^l_d$ times bigger. \n",
"The other issue one needs to take into account is the mismatch that arises between training and test (runtime) stages when dropout is applied. It is due to the fact that droput is not applied at testing (run-time) stage hence the average input to the unit in the upper layer is going to be bigger compared to training stage (where some inputs were set to 0), in average $1/p^l_d$ times bigger. \n",
"\n",
"So to account for this mismatch one could either:\n",
"\n",
"1. When training is finished scale the final weight matrices $\\mathbf{W}^l, l=1,\\ldots,L$ by $p^{l-1}_d$ (remember, $p^{0}_d$ is the probability related to the input features)\n",
"2. Scale the activations in equation (9) during training, that is, for each mini-batch multiply $\\mathbf{\\hat h}^l$ by $1/p^l_d$ to compensate for dropped units and then at run-time use the model as usual, **without** scaling. Make sure the $1/p^l_d$ scaler is taken into account for both forward and backward passes.\n",
"1. When training is finished scale the final weight matrices $\\mathbf{W}^l, l=1,\\ldots,L$ by $p^{l-1}_d$ (remember, $p^{0}_d$ is the probability related to dropping input features)\n",
"2. Scale the activations in equation (10) during training, that is, for each mini-batch multiply $\\mathbf{\\hat h}^l$ by $1/p^l_d$ to compensate for dropped units and then at run-time use the model as usual, **without** scaling. Make sure the $1/p^l_d$ scaler is taken into account for both forward and backward passes.\n",
"\n",
"Our recommendation is option 2 as it will make some things easier from implementation perspective. "
]
@ -173,7 +189,7 @@
"\n",
"Implementation tips:\n",
"* Have a look at the constructor of mlp.optimiser.SGDOptimiser class, it has been modified to take more optimisation-related arguments.\n",
"* The best place to implement regularisation terms is `pgrads` method of mlp.layers.Layer (sub)-classes "
"* The best place to implement regularisation terms is `pgrads` method of mlp.layers.Layer (sub)-classes. See equations (6) and (9) why."
]
},
{
@ -234,7 +250,7 @@
"source": [
"# Exercise 4: Implement Dropout \n",
"\n",
"Implement dropout regularisation technique. Then for the same initial configuration as in Exercise 1. investigate effectivness of different dropout rates applied to input features and/or hidden layers. Start with $p_{inp}=0.5$ and $p_{hid}=0.5$ and do some search for better settings.\n",
"Implement dropout regularisation technique. Then for the same initial configuration as in Exercise 1. investigate effectivness of different dropout rates applied to input features and/or hidden layers. Start with $p_{inp}=0.5$ and $p_{hid}=0.5$ and do some search for better settings. Dropout usually slows training down (approximately two times) so train dropout models for around twice as many epochs as baseline model.\n",
"\n",
"Implementation tips:\n",
"* Add a function `fprop_dropout` to `mlp.layers.MLP` class which (on top of `inputs` argument) takes also dropout-related argument(s) and perform dropout forward propagation through the model.\n",

View File

@ -257,18 +257,9 @@ class Linear(Layer):
1) da^i/dW^i and 2) da^i/db^i
since W and b are only layer's parameters
"""
l2_W_penalty, l2_b_penalty = 0, 0
if l2_weight > 0:
l2_W_penalty = l2_weight*self.W
l2_b_penalty = l2_weight*self.b
l1_W_penalty, l1_b_penalty = 0, 0
if l1_weight > 0:
l1_W_penalty = l1_weight*numpy.sign(self.W)
l1_b_penalty = l1_weight*numpy.sign(self.b)
grad_W = numpy.dot(inputs.T, deltas) + l2_W_penalty + l1_W_penalty
grad_b = numpy.sum(deltas, axis=0) + l2_b_penalty + l1_b_penalty
grad_W = numpy.dot(inputs.T, deltas)
grad_b = numpy.sum(deltas, axis=0)
return [grad_W, grad_b]
@ -352,7 +343,7 @@ class Softmax(Linear):
return y
def bprop(self, h, igrads):
raise NotImplementedError()
raise NotImplementedError('Softmax.bprop not implemented for hidden layer.')
def bprop_cost(self, h, igrads, cost):