mlpractical/notebooks/01_Introduction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "533c10f0-95ba-4684-a72d-fd52cef0d007"
    }
   },
   "source": [
    "# Exercises\n",
    "\n",
    "Today's exercises are meant to allow you to get some initial familiarisation with the `mlp` package and how data is provided to the learning functions. You are going to implement variants of a `DataProvider` class, which preprocesses data and serves data in batches when the `next()` function is called. \n",
    "\n",
    "If you are new to Python and/or NumPy and are struggling to complete the exercises, you may find going through [this Stanford University tutorial](http://cs231n.github.io/python-numpy-tutorial/) by Justin Johnson first helps. There is also a derived Jupyter notebook by Volodymyr Kuleshov and Isaac Caswell which you can download [from here](https://github.com/kuleshov/teaching-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) - if you save this in to your `mlpractical/notebooks` directory you should be able to open the notebook from the dashboard to run the examples.\n",
    "\n",
    "## Data providers\n",
    "\n",
    "Open (in the browser) the [`mlp.data_providers`](../mlp/data_providers.py) module. Have a look through the code and comments, then follow to the exercises.\n",
    "\n",
    "### Exercise 1 \n",
    "\n",
    "The `MNISTDataProvider` iterates over input images and target classes (digit IDs) from the [MNIST database of handwritten digit images](http://yann.lecun.com/exdb/mnist/), a common supervised learning benchmark task. Using the data provider and `matplotlib` we can for example iterate over the first couple of images in the dataset and display them using the following code:\n",
    "\n",
    "* NOTE: If you encounter `KeyError: 'MLP_DATA_DIR'`, check that you have correctly set the environment variable following the setup instructions, and that you are in the `mlp` environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbpresent": {
     "id": "978c1095-a9ce-4626-a113-e0be5fe51ecb"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import sys\n",
    "# sys.path.append('/path/to/mlpractical')\n",
    "import matplotlib.pyplot as plt\n",
    "import mlp.data_providers as data_providers\n",
    "# If error while importing mlp.data_providers: add path to your folder mlpractical using sys.path.append('/path/to/mlpractical/')\n",
    "def show_single_image(img, fig_size=(2, 2)):\n",
    "    fig = plt.figure(figsize=fig_size)\n",
    "    ax = fig.add_subplot(111)\n",
    "    ax.imshow(img, cmap='Greys')\n",
    "    ax.axis('off')\n",
    "    plt.show()\n",
    "    return fig, ax\n",
    "\n",
    "# An example for a single MNIST image\n",
    "mnist_dp = data_providers.MNISTDataProvider(\n",
    "    which_set='valid', batch_size=1, max_num_batches=2, shuffle_order=True)\n",
    "\n",
    "for inputs, target in mnist_dp:\n",
    "    # The reshape operation reorganizes data from 1D array of size 784 to 2D array of size 28x28\n",
    "    # See notes in the next cell\n",
    "    square_inputs = inputs.reshape((28, 28))\n",
    "    show_single_image(square_inputs)\n",
    "    print('Image target: {0}'.format(target))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generally we will want to deal with batches of multiple images i.e. `batch_size > 1`. \n",
    "\n",
    "**Your tasks**:\n",
    "\n",
    "* Using `MNISTDataProvider`, write code that iterates over the first 5 minibatches of size 100 data-points. \n",
    "* Display each batch of MNIST digits in a $10\\times10$ grid of images. \n",
    "  \n",
    "**Notes**:\n",
    "\n",
    "  * Images are returned from the provider as tuples of numpy arrays `(inputs, targets)`. The `inputs` matrix has shape `(batch_size, input_dim)` while the `targets` array is of shape `(batch_size,)`, where `batch_size` is the number of data points in a single batch and `input_dim` is dimensionality of the input features. \n",
    "  * Each input data-point (image) is stored as a 784 dimensional vector of pixel intensities normalised to $[0, 1]$ from inital integer values in $[0, 255]$. However, the original spatial domain is two dimensional, so before plotting you will need to reshape the one dimensional input arrays in to two dimensional arrays 2D (MNIST images have the same height and width dimensions).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write your code here for iterating over five batches of \n",
    "# 100 data points each and displaying as 10x10 grids\n",
    "\n",
    "def show_batch_of_images(img_batch, fig_size=(3, 3)):\n",
    "    #Expected shape of img_batch: (batch_size, im_height, im_width)\n",
    "    raise NotImplementedError('Write me!')\n",
    "\n",
    "batch_size = 100\n",
    "num_batches = 5\n",
    "\n",
    "#TODO: initialize the MNISTDataProvider class and iterate over batches\n",
    "# with the show_batch_of_images function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d2d525de-5d5b-41d5-b2fb-a83874dba986"
    }
   },
   "source": [
    "### Exercise 2\n",
    "\n",
    "The `targets` variable in `MNISTDataProvider` currently returns a vector of integers, where each element in this vector represents an the class of the corresponding data-point (0 to 9). \n",
    "\n",
    "It is easier to train neural networks using a 1-of-K representation for multi-class targets. Instead of representing class identity by an integer, each target is replaced by a vector of length equal to teh number of classes whose values are zero everywhere except on the index corresponding to the class.\n",
    "\n",
    "For instance, given a batch of 5 integer targets `[2, 2, 0, 1, 0]` and assuming there are 3 different classes \n",
    "the corresponding 1-of-K encoded targets would be\n",
    "```\n",
    "[[0, 0, 1],\n",
    " [0, 0, 1],\n",
    " [1, 0, 0],\n",
    " [0, 1, 0],\n",
    " [1, 0, 0]]\n",
    "```\n",
    "**Your Tasks**:\n",
    "  * Implement the `to_one_of_k` method of `MNISTDataProvider` class. \n",
    "  * Uncomment the overloaded `next` method, so the raw targets are converted to 1-of-K coding. \n",
    "  * Test your code by running the the cell below. As you have changed the `mlp` package, reload the notebook kernel before running the cell to make sure the changes are picked up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mlp.data_providers as data_providers\n",
    "import numpy as np\n",
    "\n",
    "mnist_dp = data_providers.MNISTDataProvider(\n",
    "    which_set='valid', batch_size=5, max_num_batches=5, shuffle_order=False)\n",
    "\n",
    "for inputs, targets in mnist_dp:\n",
    "    # Check that values are either 0 or 1\n",
    "    assert np.all(np.logical_or(targets == 0., targets == 1.))\n",
    "    # Check that there is exactly a single 1\n",
    "    assert np.all(targets.sum(-1) == 1.)\n",
    "    print(targets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "471093b7-4b94-4295-823a-5285c79d3119"
    }
   },
   "source": [
    "### Exercise 3\n",
    "\n",
    "Here you will write your own data provider `MetOfficeDataProvider` that wraps weather data for south Scotland. This data is stored in `data/HadSSP_daily_qc.txt` for your convenience and skeleton code for the class provided in `mlp/data_providers.py`.\n",
    "\n",
    "The data is organised in the text file as a table, with the first two columns indexing the year and month of the readings and the following 31 columns giving daily precipitation values for the corresponding month. As not all months have 31 days some of the entries correspond to non-existing days. These values are indicated by a non-physical value of `-99.9`.\n",
    "\n",
    "**Your tasks**:\n",
    "\n",
    "  * Implement the `MetOfficeDataProvider` class in `mlp/data_providers.py`. You only need to implement the `__init__()` function, following the instructions below:\n",
    "    * You should read all of the data from the file ([`np.loadtxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) may be useful for this) and then filter out the `-99.9` values and collapse the table to a one-dimensional array corresponding to a sequence of daily measurements for the whole period data is available for. [NumPy's boolean indexing feature](http://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays) could be helpful here.\n",
    "    * A common initial preprocessing step in machine learning tasks is to normalise data so that it has zero mean and a standard deviation of one. Normalise the data sequence so that its overall mean is zero and standard deviation one.\n",
    "    * Each data point in the data provider should correspond to a window of length specified in the `__init__` method as `window_size` of this contiguous data sequence, with the model inputs being the first `window_size - 1` elements of the window and the target output being the last element of the window. For example if the original data sequence was `[1, 2, 3, 4, 5, 6]` and `window_size=3` then `input, target` pairs iterated over by the data provider should be\n",
    "  ```\n",
    "  [1, 2], 3\n",
    "  [4, 5], 6\n",
    "  ```\n",
    "  * **Extension**: The current data provider only produces `len(data)/window_size` sample points. A better approach is to have it return overlapping windows of the sequence so that more training data instances are produced. For example for the same sequence `[1, 2, 3, 4, 5, 6]` the corresponding `input, target` pairs with `window_size=3` would be\n",
    "\n",
    "```\n",
    "[1, 2], 3\n",
    "[2, 3], 4\n",
    "[3, 4], 5\n",
    "[4, 5], 6\n",
    "```\n",
    "  * Test your code by running the cell below. (Remember to reload kernel after making changes in the `mlp` package)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbpresent": {
     "id": "c8553a56-9f25-4198-8a1a-d7e9572b4382"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import mlp.data_providers as data_providers\n",
    "import numpy as np\n",
    "\n",
    "batch_size = 3\n",
    "for window_size in [2, 5, 10]:\n",
    "    met_dp = data_providers.MetOfficeDataProvider(\n",
    "        window_size=window_size, batch_size=batch_size,\n",
    "        max_num_batches=1, shuffle_order=False)\n",
    "    fig = plt.figure(figsize=(6, 3))\n",
    "    ax = fig.add_subplot(111)\n",
    "    ax.set_title('Window size {0}'.format(window_size))\n",
    "    ax.set_xlabel('Day in window')\n",
    "    ax.set_ylabel('Normalised reading')\n",
    "    # iterate over data provider batches checking size and plotting\n",
    "    for inputs, targets in met_dp:\n",
    "        assert inputs.shape == (batch_size, window_size - 1)\n",
    "        assert targets.shape == (batch_size, )\n",
    "        ax.plot(np.c_[inputs, targets].T, '.-')\n",
    "        ax.plot([window_size - 1] * batch_size, targets, 'ko')"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}