"Today's exercises are meant to allow you to get some initial familiarisation with the `mlp` package and how data is provided to the learning functions. You are going to implement variants of a `DataProvider` class, which preprocesses data and serves data in batches when the `next()` function is called. \n",
"If you are new to Python and/or NumPy and are struggling to complete the exercises, you may find going through [this Stanford University tutorial](http://cs231n.github.io/python-numpy-tutorial/) by Justin Johnson first helps. There is also a derived Jupyter notebook by Volodymyr Kuleshov and Isaac Caswell which you can download [from here](https://github.com/kuleshov/teaching-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) - if you save this in to your `mlpractical/notebooks` directory you should be able to open the notebook from the dashboard to run the examples.\n",
"Open (in the browser) the [`mlp.data_providers`](../mlp/data_providers.py) module. Have a look through the code and comments, then follow to the exercises.\n",
"The `MNISTDataProvider` iterates over input images and target classes (digit IDs) from the [MNIST database of handwritten digit images](http://yann.lecun.com/exdb/mnist/), a common supervised learning benchmark task. Using the data provider and `matplotlib` we can for example iterate over the first couple of images in the dataset and display them using the following code:\n",
"\n",
"* NOTE: If you encounter `KeyError: 'MLP_DATA_DIR'`, check that you have correctly set the environment variable following the setup instructions, and that you are in the `mlp` environment."
" * Images are returned from the provider as tuples of numpy arrays `(inputs, targets)`. The `inputs` matrix has shape `(batch_size, input_dim)` while the `targets` array is of shape `(batch_size,)`, where `batch_size` is the number of data points in a single batch and `input_dim` is dimensionality of the input features. \n",
" * Each input data-point (image) is stored as a 784 dimensional vector of pixel intensities normalised to $[0, 1]$ from inital integer values in $[0, 255]$. However, the original spatial domain is two dimensional, so before plotting you will need to reshape the one dimensional input arrays in to two dimensional arrays 2D (MNIST images have the same height and width dimensions).\n"
"The `targets` variable in `MNISTDataProvider` currently returns a vector of integers, where each element in this vector represents an the class of the corresponding data-point (0 to 9). \n",
"It is easier to train neural networks using a 1-of-K representation for multi-class targets. Instead of representing class identity by an integer, each target is replaced by a vector of length equal to teh number of classes whose values are zero everywhere except on the index corresponding to the class.\n",
" * Test your code by running the the cell below. As you have changed the `mlp` package, reload the notebook kernel before running the cell to make sure the changes are picked up."
"Here you will write your own data provider `MetOfficeDataProvider` that wraps weather data for south Scotland. This data is stored in `data/HadSSP_daily_qc.txt` for your convenience and skeleton code for the class provided in `mlp/data_providers.py`.\n",
"The data is organised in the text file as a table, with the first two columns indexing the year and month of the readings and the following 31 columns giving daily precipitation values for the corresponding month. As not all months have 31 days some of the entries correspond to non-existing days. These values are indicated by a non-physical value of `-99.9`.\n",
" * Implement the `MetOfficeDataProvider` class in `mlp/data_providers.py`. You only need to implement the `__init__()` function, following the instructions below:\n",
" * You should read all of the data from the file ([`np.loadtxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) may be useful for this) and then filter out the `-99.9` values and collapse the table to a one-dimensional array corresponding to a sequence of daily measurements for the whole period data is available for. [NumPy's boolean indexing feature](http://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays) could be helpful here.\n",
" * A common initial preprocessing step in machine learning tasks is to normalise data so that it has zero mean and a standard deviation of one. Normalise the data sequence so that its overall mean is zero and standard deviation one.\n",
" * Each data point in the data provider should correspond to a window of length specified in the `__init__` method as `window_size` of this contiguous data sequence, with the model inputs being the first `window_size - 1` elements of the window and the target output being the last element of the window. For example if the original data sequence was `[1, 2, 3, 4, 5, 6]` and `window_size=3` then `input, target` pairs iterated over by the data provider should be\n",
" * **Extension**: The current data provider only produces `len(data)/window_size` sample points. A better approach is to have it return overlapping windows of the sequence so that more training data instances are produced. For example for the same sequence `[1, 2, 3, 4, 5, 6]` the corresponding `input, target` pairs with `window_size=3` would be\n",