Update 05_Transfer_functions.ipynb

This commit is contained in:
Pawel Swietojanski 2015-11-09 10:50:02 +00:00
parent 25481eb471
commit b78144169e

View File

@ -102,7 +102,7 @@
"\n",
"# On weight initialisation\n",
"\n",
"Activation functions directly affect the \"network dynamics\", that is, the magnitudes of the statistics each layer is producing. For example, *slashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, on the contrary, has an unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters - hence also the pace at which the model learns. That is why learning rate is usually required to be tuned for given the characterictics of the non-linearities used. \n",
"Activation functions directly affect the \"network dynamics\", that is, the magnitudes of the statistics each layer is producing. For example, *squashing* non-linearities like sigmoid or tanh bring the linear activations to a certain bounded range. ReLU, on the contrary, has an unbounded positive side. This directly affects all statistics collected in forward and backward passes as well as the gradients w.r.t paramters - hence also the pace at which the model learns. That is why learning rate is usually required to be tuned for given the characterictics of the non-linearities used. \n",
"\n",
"Another important hyperparameter is the initial range used to initialise the weight matrices. We have largely ignored it so far (although if you did further experiments in coursework 1, you may have found setting it had an effect on training deeper networks with 4 or 5 hidden layers). However, for sigmoidal non-linearities (sigmoid, tanh) the initialisation range is an important hyperparameter and a considerable amount of research has been put into determining what is the best strategy for choosing it. In fact, one of the early triggers of the recent resurgence of deep learning was pre-training - techniques for initialising weights in an unsupervised manner so that one can effectively train deeper models in supervised fashion later. \n",
"\n",