light text editing

This commit is contained in:
Steve Renals 2015-11-08 20:20:47 +00:00
parent b8db1fd658
commit 05c8fb9021

View File

@ -6,7 +6,8 @@
"source": [
"# Introduction\n",
"\n",
"This tutorial focuses on implementation of alternative to sigmoid transfer functions (sometimes also called activation functions or non-linearities). First, we will implement another type of sigmoidal-like activation - hyperboilc tangent (Tanh) (to date, we have implemented sigmoid (logistic) activation). Then we go to unbounded (or partially bounded) activations: Rectifying Linear Units (ReLU) and Maxout.\n",
"This tutorial focuses on implementation of alternatives to sigmoid transfer functions for hidden units. (*Transfer functions* are also called *activation functions* or *nonlinearities*.) First, we will work with hyperboilc tangent (tanh) and then unbounded (or partially unbounded) piecewise linear functions: Rectifying Linear Units (ReLU) and Maxout.\n",
"\n",
"\n",
"## Virtual environments\n",
"\n",
@ -31,7 +32,7 @@
"```\n",
"git pull\n",
"```\n",
"6. And now, create the new branch & swith to it by typing:\n",
"6. And now, create the new branch & switch to it by typing:\n",
"```\n",
"git checkout -b lab5\n",
"```"
@ -49,16 +50,17 @@
"\n",
"(1) $h_i(a_i) = \\mbox{tanh}(a_i) = \\frac{\\exp(a_i) - \\exp(-a_i)}{\\exp(a_i) + \\exp(-a_i)}$\n",
"\n",
"Hence, the derivative h_i w.r.t a_i is:\n",
"Hence, the derivative of $h_i$ with respect to $a_i$ is:\n",
"\n",
"(2) $\n",
"\\frac{\\partial h_i}{\\partial a_i} = 1 - h^2_i\n",
"(2) $\\begin{align}\n",
"\\frac{\\partial h_i}{\\partial a_i} &= 1 - h^2_i\n",
"\\end{align}\n",
"$\n",
"\n",
"\n",
"## ReLU\n",
"\n",
"Given linear activation $a_{i}$ relu implements the following operation:\n",
"Given a linear activation $a_{i}$ relu implements the following operation:\n",
"\n",
"(3) $h_i(a_i) = \\max(0, a_i)$\n",
"\n",
@ -73,13 +75,13 @@
"\\end{align}\n",
"$\n",
"\n",
"ReLU implements a form of data-driven sparsity, that is, on average the activations are sparse (many of them are 0) but the general sparsity pattern will depend on particular data-point. This is different from sparsity obtained in model's parameters one can obtain with L1 regularisation as the latter affect all data-points in the same way.\n",
"ReLU implements a form of data-driven sparsity, that is, on average the activations are sparse (many of them are 0) but the general sparsity pattern will depend on particular data-point. This is different from sparsity obtained in model's parameters one can obtain with $L1$ regularisation as the latter affect all data-points in the same way.\n",
"\n",
"## Maxout\n",
"\n",
"Maxout is an example of data-driven type of non-linearity in which \"the transfer function\" can be learn from data. That is, the model can build non-linear transfer-function from piece-wise linear components. Those (linear components), depending on the number of linear regions used in the pooling operator ($K$ parameter), can approximate an arbitrary functions, like ReLU, abs, etc.\n",
"Maxout is an example of data-driven type of non-linearity in which the transfer function can be learned from data. That is, the model can build a non-linear transfer function from piecewise linear components. These linear components, depending on the number of linear regions used in the pooling operator (given by parameter $K$), can approximate arbitrary functions, such as ReLU, abs, etc.\n",
"\n",
"The maxout non-linearity, given some sub-set (group, pool) of $K$ linear activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ at the layer $l$-th, implements the following operation:\n",
"Given some subset (group, pool) of $K$ linear activations $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ at the $l$-th layer, maxout implements the following operation:\n",
"\n",
"(5) $h_i(a_j, a_{j+1}, \\ldots, a_{j+K}) = \\max(a_j, a_{j+1}, \\ldots, a_{j+K})$\n",
"\n",
@ -88,54 +90,54 @@
"(6) $\\begin{align}\n",
"\\frac{\\partial h_i}{\\partial (a_j, a_{j+1}, \\ldots, a_{j+K})} &=\n",
"\\begin{cases}\n",
" 1 & \\quad \\text{for maxed activation} \\\\\n",
" 0 & \\quad \\text{otherwsie} \\\\\n",
" 1 & \\quad \\text{for the max activation} \\\\\n",
" 0 & \\quad \\text{otherwise} \\\\\n",
"\\end{cases}\n",
"\\end{align}\n",
"$\n",
"\n",
"Implemenation tips are given in the Exercise 3.\n",
"Implementation tips are given in Exercise 3.\n",
"\n",
"# On the weight initialisation\n",
"# On weight initialisation\n",
"\n",
"Activation functions directly affect the \"network's dynamic\", that is, the magnitudes of the statistics each layer is producing. For example, *slashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, contrary has unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters hence also the pace at which the model learns. That is why learning rate usually requires to be tuned for certain characterictics of the non-linearities. \n",
"Activation functions directly affect the \"network dynamics\", that is, the magnitudes of the statistics each layer is producing. For example, *slashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, on the contrary, has an unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters - hence also the pace at which the model learns. That is why learning rate is usually required to be tuned for given the characterictics of the non-linearities used. \n",
"\n",
"The other, to date mostly \"omitted\" in the lab hyper-parameter, is the initial range the weight matrices are initialised with. For sigmoidal non-linearities (sigmoid, tanh) it is an important hyper-parameter and considerable amount of research has been put into determining what is the best strategy for choosing it. In fact, one of the early triggers of the recent resurgence of deep learning was pre-training - techniques allowing to better initialise the weights in unsupervised manner so one can effectively train deeper models in supervised fashion later. \n",
"Another important hyperparameter is the initial range used to initialise the weight matrices. We have largely ignored it so far (although if you did further experiments in coursework 1, you may have found setting it had an effect on training deeper networks with 4 or 5 hidden layers). However, for sigmoidal non-linearities (sigmoid, tanh) the initialisation range is an important hyperparameter and a considerable amount of research has been put into determining what is the best strategy for choosing it. In fact, one of the early triggers of the recent resurgence of deep learning was pre-training - techniques for initialising weights in an unsupervised manner so that one can effectively train deeper models in supervised fashion later. \n",
"\n",
"## Sigmoidal transfer functions\n",
"\n",
"Y. LeCun in [Efficient Backprop](http://link.springer.com/chapter/10.1007%2F3-540-49430-8_2) paper for sigmoidal units recommends the following setting (assuming the data has been normalised to zero mean, unit variance): \n",
"Y. LeCun in [Efficient Backprop](http://link.springer.com/chapter/10.1007%2F3-540-49430-8_2) recommends the following setting of the initial range $r$ for sigmoidal units (assuming that the data has been normalised to zero mean, unit variance): \n",
"\n",
"(7) $ r = \\frac{1}{\\sqrt{N_{IN}}} $\n",
"\n",
"where $N_{IN}$ is the number of inputs to the given layer and the weights are then sampled from (usually uniform) distribution $U(-r,r)$. The motivation is to keep the initial forward-pass signal in the linear region of the sigmoid non-linearity so the gradients are large enough for training to proceed (notice the sigmoidal non-linearities sature when activations get either very positive or very negatvie leading to very small gradients and hence poor learning dynamics as a result).\n",
"where $N_{IN}$ is the number of inputs to the given layer and the weights are then sampled from the (usually uniform) distribution $U(-r,r)$. The motivation is to keep the initial forward-pass signal in the linear region of the sigmoid non-linearity so that the gradients are large enough for training to proceed (note that the sigmoidal non-linearities saturate when activations are either very positive or very negative, leading to very small gradients and hence poor learning dynamics).\n",
"\n",
"Initialisation used in (7) however leads to different magnitues of activations/gradients at different layers (due to multiplicative narute of performed comutations) and more recently, [Glorot et. al](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) proposed so called *normalised initialisation*, which ensures the variance of the forward signal (activations) is approximately the same in each layer. The same applies to the gradients obtained in backward pass. \n",
"The initialisation used in (7) however leads to different magnitudes of activations/gradients at different layers (due to multiplicative nature of the computations) and more recently, [Glorot et. al](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) proposed the so-called *normalised initialisation*, which ensures the variance of the forward signal (activations) is approximately the same in each layer. The same applies to the gradients obtained in backward pass. \n",
"\n",
"The $r$ in the *normalised initialisation* for $\\mbox{tanh}$ non-linearity is then:\n",
"\n",
"(8) $ r = \\frac{\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
"\n",
"For sigmoid (logistic) non-linearity, to get similiar characteristics, one should scale $r$ in (8) by 4, that is:\n",
"For the sigmoid (logistic) non-linearity, to get similiar characteristics, one should scale $r$ in (8) by 4, that is:\n",
"\n",
"(9) $ r = \\frac{4\\sqrt{6}}{\\sqrt{N_{IN}+N_{OUT}}} $\n",
"\n",
"## Piece-wise linear transfer functions (ReLU, Maxout)\n",
"\n",
"For unbounded transfer functions initialisation is not that crucial as with sigmoidal ones. It's due to the fact their gradients do not diminish (they are acutally more likely to explode) and they do not saturate (ReLU saturates at 0, but not on the positive slope, where gradient is 1 everywhere).\n"
"For unbounded transfer functions initialisation is not as crucial as for sigmoidal ones. This is due to the fact that their gradients do not diminish (they are acutally more likely to explode) and they do not saturate (ReLU saturates at 0, but not on the positive slope, where gradient is 1 everywhere). (In practice ReLU is sometimes \"clipped\" with a maximum value, typically 20).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 1: Implement Tanh\n",
"# Exercise 1: Implement the tanh transfer function\n",
"\n",
"Implementation should follow the conventions used to build other layer types (for example, Sigmoid and Softmax). Test the solution by training one-hidden-layer (100 hidden units) model (similiar to the one used in Task 3a in the coursework). \n",
"Your implementation should follow the code conventions used to build other layer types (for example, Sigmoid and Softmax). Test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. \n",
"\n",
"Tune the learning rate and compare the initial ranges in equations (7) and (8) (notice, there might not be much difference for one-hidden-layer model, but you can easily notice substantial gain from using (8) (or (9) for logistic activation) for deeper models, for example, 5 hidden-layer one from the first coursework).\n",
"Tune the learning rate and compare the initial ranges in equations (7) and (8). Note that there might not be much difference for one-hidden-layer model, but you can easily notice a substantial gain from using (8) (or (9) for logistic sigmoid activation) for deeper models, for example, the 5 hidden-layer network from the first coursework.\n",
"\n",
"Implementation tip: Use numpy.tanh() to compute non-linearity. Use irange argument when creating the given layer type to provide the initial sampling range."
"Implementation tip: Use numpy.tanh() to compute the non-linearity. Use the irange argument when creating the given layer type to provide the initial sampling range."
]
},
{
@ -153,7 +155,7 @@
"source": [
"# Exercise 2: Implement ReLU\n",
"\n",
"Implementation should follow the conventions used to build Linear, Sigmoid and Softmax layers. Test the solution by training one-hidden-layer (100 hidden units) model (similiar to the one used in Task 3a in the coursework). Tune the learning rate (start with the initial one set to 0.1) and the initial weight range set to 0.05."
"Again, your implementation should follow the conventions used to build Linear, Sigmoid and Softmax layers. As in exercise 1, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Tune the learning rate (start with the initial one set to 0.1) with the initial weight range set to 0.05."
]
},
{
@ -171,11 +173,11 @@
"source": [
"# Exercise 3: Implement Maxout\n",
"\n",
"Implementation should follow the conventions used to build Linear, Sigmoid and Softmax layers. Implement scenario with non-overlapping pooling regions. Test the solution by training a one-hidden-layer model with the total number of weights similar to the models used in the previous exercises. Use the same optimisation hyper-parameters (learning rate, initial weights range) as you used for ReLU models. Tune the pool size $K$ (but keep the number of total parameters fixed).\n",
"As with the previous two exercises, your implementation should follow the conventions used to build the Linear, Sigmoid and Softmax layers. As before, test your solution by training a one-hidden-layer model with 100 hidden units, similiar to the one used in Task 3a in the coursework. Use the same optimisation hyper-parameters (learning rate, initial weights range) as you used for ReLU models. Tune the pool size $K$ (but keep the number of total parameters fixed).\n",
"\n",
"Note: Max operator reduces dimensionality, hence for example, to get 100 hidden maxout units with pooling size set to $K=2$ the size of linear part needs to be set to 100*K (assuming non-overlapping pools). This affects how you compute the total number of weights in the model.\n",
"Note: The Max operator reduces dimensionality, hence for example, to get 100 hidden maxout units with pooling size set to $K=2$ the size of linear part needs to be set to $100K$ (assuming non-overlapping pools). This affects how you compute the total number of weights in the model.\n",
"\n",
"Implementation tips: To back-propagate through maxout layer, one needs to keep track of which linear activation $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ was maxed-out in each pool. The convenient way to do so is by storing the maxed indices in fprop function and then in back-prop stage pass the gradient only through those (i.e. for example, one can build an auxiliary matrix where each element is either 1 (if unit was passed-forward through max operator for a given data-point) or 0 otherwise. Then in backward pass it suffices to upsample the maxout *igrads* signal to linear layer dimension and element-wise multiply by the aforemenioned auxiliary matrix."
"Implementation tips: To back-propagate through the maxout layer, one needs to keep track of which linear activation $a_{j}, a_{j+1}, \\ldots, a_{j+K}$ was the maximum in each pool. The convenient way to do so is by storing the indices of the maximum units in the fprop function and then in the backprop stage pass the gradient only through those (i.e. for example, one can build an auxiliary matrix where each element is either 1 (if unit was maximum, and passed forward through the max operator for a given data-point) or 0 otherwise. Then in the backward pass it suffices to upsample the maxout *igrads* signal to the linear layer dimension and element-wise multiply by the aforemenioned auxiliary matrix."
]
},
{
@ -195,7 +197,7 @@
"\n",
"Try all of the above non-linearities with dropout training. Use the dropout hyper-parameters $\\{p_{inp}, p_{hid}\\}$ that worked best for sigmoid models from the previous lab.\n",
"\n",
"Note: the code for dropout you were asked to implement last week has not been given as a solution for this week - as a result you need to move/merge the required dropout parts from your previous *lab4* branch (or implement it if you haven't done it so far). \n"
"Note: the code for dropout you were asked to implement last week has not been given as a solution for this week - as a result you need to move/merge the required dropout parts from your previous *lab4* branch (or implement it if you haven't already done so). \n"
]
},
{
@ -224,7 +226,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
"version": "2.7.10"
}
},
"nbformat": 4,