"Although `numpy` can easily handle both formats (possibly with some computational overhead), in our code we will stick with the more modern (and default) `C`-like approach and use the row-wise format (in contrast to Fortran that used a column-wise approach). \n",
"\n",
"This means, that in this tutorial:\n",
"* vectors are kept row-wise $\\mathbf{x} = (x_1, x_1, \\ldots, x_D) $ (rather than $\\mathbf{x} = (x_1, x_1, \\ldots, x_D)^T$)\n",
"* similarly, in case of matrices we will stick to: $\\left[ \\begin{array}{cccc}\n",
"x_{11} & x_{12} & \\ldots & x_{1D} \\\\\n",
"x_{21} & x_{22} & \\ldots & x_{2D} \\\\\n",
"x_{31} & x_{32} & \\ldots & x_{3D} \\\\ \\end{array} \\right]$ and each row (i.e. $\\left[ \\begin{array}{cccc} x_{11} & x_{12} & \\ldots & x_{1D} \\end{array} \\right]$) represents a single data-point (like one MNIST image or one window of observations)\n",
"\n",
"In lecture slides you will find the equations following the conventional mathematical approach, using column vectors, but you can easily map between column-major and row-major organisations using a matrix transpose.\n",
"\n",
"***\n",
"\n",
"## Linear and Affine Transforms\n",
"\n",
"The basis of all linear models is the so called affine transform, which is a transform that implements a linear transformation and translation of the input features. The transforms we are going to use are parameterised by:\n",
"\n",
" * A weight matrix $\\mathbf{W} \\in \\mathbb{R}^{D\\times K}$: where element $w_{ik}$ is the weight from input $x_i$ to output $y_k$\n",
" * A bias vector $\\mathbf{b}\\in R^{K}$ : where element $b_{k}$ is the bias for output $k$\n",
"\n",
"Note, the bias is simply some additive term, and can be easily incorporated into an additional row in weight matrix and an additional input in the inputs which is set to $1.0$ (as in the below picture taken from the lecture slides). However, here (and in the code) we will keep them separate.\n",
"For instance, for the above example of 5-dimensional input vector by $\\mathbf{x} = (x_1, x_2, x_3, x_4, x_5)$, weight matrix $\\mathbf{W}=\\left[ \\begin{array}{ccc}\n",
"w_{11} & w_{12} & w_{13} \\\\\n",
"w_{21} & w_{22} & w_{23} \\\\\n",
"w_{31} & w_{32} & w_{33} \\\\\n",
"w_{41} & w_{42} & w_{43} \\\\\n",
"w_{51} & w_{52} & w_{53} \\\\ \\end{array} \\right]$, bias vector $\\mathbf{b} = (b_1, b_2, b_3)$ and outputs $\\mathbf{y} = (y_1, y_2, y_3)$, one can write the transformation as follows:\n",
"\n",
"(for the $i$-th output)\n",
"\n",
"(1) $\n",
"\\begin{equation}\n",
" y_i = b_i + \\sum_j x_jw_{ji}\n",
"\\end{equation}\n",
"$\n",
"\n",
"or the equivalent vector form (where $\\mathbf w_i$ is the $i$-th column of $\\mathbf W$, but note, when we **slice** the $i$th column we will get a **vector** $\\mathbf w_i = (w_{1i}, w_{2i}, w_{3i}, w_{4i}, w_{5i})$, hence the transpose for $\\mathbf w_i$ in the below equation):\n",
"\n",
"(2) $\n",
"\\begin{equation}\n",
" y_i = b_i + \\mathbf x \\mathbf w_i^T\n",
"\\end{equation}\n",
"$\n",
"\n",
"The same operation can be also written in matrix form, to compute all the outputs $\\mathbf{y}$ at the same time:\n",
"\n",
"(3) $\n",
"\\begin{equation}\n",
" \\mathbf y=\\mathbf x\\mathbf W + \\mathbf b\n",
"\\end{equation}\n",
"$\n",
"\n",
"This is equivalent to slides 12/13 in lecture 1, except we are using row vectors.\n",
"\n",
"When $\\mathbf{x}$ is a mini-batch (contains $B$ data-points of dimension $D$ each), i.e. $\\left[ \\begin{array}{cccc}\n",
" \\mathbf Y=\\mathbf X\\mathbf W + \\mathbf b\n",
"\\end{equation}\n",
"$\n",
"\n",
"where both $\\mathbf{X}\\in\\mathbb{R}^{B\\times D}$ and $\\mathbf{Y}\\in\\mathbb{R}^{B\\times K}$ are matrices, and $\\mathbf{b}$ needs to be <a href=\"http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html\">broadcasted</a> $B$ times (numpy will do this by default). However, we will not make an explicit distinction between a special case for $B=1$ and $B>1$ and simply use equation (3) instead, although $\\mathbf{x}$ and hence $\\mathbf{y}$ could be matrices. From an implementation point of view, it does not matter.\n",
"\n",
"The desired functionality for matrix multiplication in numpy is provided by <a href=\"http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html\">numpy.dot</a> function. If you haven't use it so far, get familiar with it as we will use it extensively."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A general note on random number generators\n",
"\n",
"It is generally a good practice (for machine learning applications **not** for cryptography!) to seed a pseudo-random number generator once at the beginning of the experiment, and use it later through the code where necesarry. This makes it easier to reproduce results since random initialisations can be replicated. As such, within this course we are going use a single random generator object, similar to the below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy\n",
"\n",
"#initialise the random generator to be used later\n",
"Using `numpy.dot`, implement **forward** propagation through the linear transform defined by equations (3) and (4) for $B=1$ and $B>1$. As data ($\\mathbf{x}$) use `MNISTDataProvider` introduced last week. For the case when $B=1$, write a function to compute the 1st output ($y_1$) using equations (1) and (2). Check if the output is the same as the corresponding one obtained with numpy. \n",
"\n",
"Tip: To generate random data you can use `random_generator.uniform(-0.1, 0.1, (D, 10))` from above."
"#implement following functions, then run the cell\n",
"def y1_equation_1(x, W, b):\n",
" raise NotImplementedError()\n",
" \n",
"def y1_equation_2(x, W, b):\n",
" raise NotImplementedError()\n",
"\n",
"def y_equation_3(x, W, b):\n",
" #use numpy.dot\n",
" raise NotImplementedError()\n",
"\n",
"def y_equation_4(x, W, b):\n",
" #use numpy.dot\n",
" raise NotImplementedError()\n",
"\n",
"for x, t in mnist_dp:\n",
" y1e1 = y1_equation_1(x[0], W, b)\n",
" y1e2 = y1_equation_2(x[0], W, b)\n",
" ye3 = y_equation_3(x, W, b)\n",
" ye4 = y_equation_4(x, W, b)\n",
"\n",
"print 'y1e1', y1e1\n",
"print 'y1e1', y1e1\n",
"print 'ye3', ye3\n",
"print 'ye4', ye4\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Exercise 2\n",
"\n",
"Modify the examples from Exercise 1 to perform **backward** propagation, that is, given $\\mathbf{y}$ (obtained in the previous step) and weight matrix $\\mathbf{W}$, project $\\mathbf{y}$ onto the input space $\\mathbf{x}$ (ignore or set to zero the biases towards $\\mathbf{x}$ in backward pass, and note, we are **not** trying to reconstruct the original $\\mathbf{x}$). Mathematically, we are interested in the following transformation: $\\mathbf{z}=\\mathbf{y}\\mathbf{W}^T$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"## Exercise 3 (optional)\n",
"\n",
"In case you do not fully understand how matrix-vector and/or matrix-matrix products work, consider implementing `my_dot_mat_mat` function (you have been given `my_dot_vec_mat` code to look at as an example) which takes as the input the following arguments:\n",
"Your job is to write a variant that works in a mini-batch mode where both $\\mathbf{x}\\in\\mathbb{R}^{B\\times D}$ and $\\mathbf{y}\\in\\mathbb{R}^{B\\times K}$ are matrices in which each rows contain one of $B$ data-points from mini-batch (rather than $\\mathbf{x}\\in\\mathbb{R}^{D}$ and $\\mathbf{y}\\in\\mathbb{R}^{K}$)."
"We will learn the model with stochastic gradient descent on N data-points using mean square error (MSE) loss function, which is defined as follows:\n",
"Similarly, using the above $\\delta^n_r$ one can express the gradient of the weight $w_{sr}$ (from the s-th input to the r-th output) for linear model and MSE cost as follows:\n",
" * Weight matrix $\\mathbf{W}$: $w_{ik}$ is the weight from input $x_i$ to output $y_k$. Note, here this is really a vector since a single scalar output, y_1.\n",
" * Scalar bias $b$ for the only output in our model \n",
" * Scalar target $t$ for the only output in out model\n",
" \n",
"First, ensure you can make use of the data provider (note, for training data has been normalised to zero mean and unit variance, hence different effective range than one can find in file):"
"The below code implements a very simple variant of stochastic gradient descent for the rainfall prediction example. Your task is to implement 5 functions in the next cell and then run two next cells that 1) build sgd functions and 2) run the actual training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"#When implementing those, take into account the mini-batch case, for which one is\n",
"#expected to sum the errors for each example\n",
"\n",
"def fprop(x, W, b):\n",
" #code implementing eq. (3)\n",
" raise NotImplementedError('Write me!')\n",
"\n",
"def cost(y, t):\n",
" #Mean Square Error cost, equation (5)\n",
" raise NotImplementedError('Write me!')\n",
"\n",
"def cost_grad(y, t):\n",
" #Gradient of the cost w.r.t y equation (8)\n",
" raise NotImplementedError('Write me!')\n",
"\n",
"def cost_wrt_W(cost_grad, x):\n",
" #Gradient of the cost w.r.t W, equation (9)\n",
" raise NotImplementedError('Write me!')\n",
" \n",
"def cost_wrt_b(cost_grad):\n",
" #Gradient of the cost w.r.t to b, equation (10)\n",
"Modify the above prediction (regression) problem so the model makes a binary classification whether the the weather is going to be one of those \\{rainy, not-rainy} (look at slide 12 of the 2nd lecture)\n",
"\n",
"Tip: You need to introduce the following changes:\n",
"1. Modify `MetOfficeDataProvider` (for example, inherit from MetOfficeDataProvider to create a new class MetOfficeDataProviderBin) and modify `next()` function so it returns as `targets` either 0 (not-rainy - if the the amount of rain [before mean/variance normalisation] is equal to 0) or 1 (rainy -- otherwise).\n",
"2. Modify the functions from previous exercise so the fprop implements `sigmoid` on top of affine transform.\n",
"3. Modify cost function to binary cross-entropy\n",
"4. Make sure you compute the gradients correctly (as you have changed both the output and the cost)\n"