{ "cells": [ { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "b167e6e2-05e0-4a4b-a6cc-47cab1c728b4" } }, "source": [ "# Introduction\n", "\n", "## Getting started with Jupyter notebooks\n", "\n", "The majority of your work in this course will be done using Jupyter notebooks so we will here introduce some of the basics of the notebook system. If you are already comfortable using notebooks or just would rather get on with some coding feel free to [skip straight to the exercises below](#Exercises).\n", "\n", "*Note: Jupyter notebooks are also known as IPython notebooks. The Jupyter system now supports languages other than Python [hence the name was changed to make it more language agnostic](https://ipython.org/#jupyter-and-the-future-of-ipython) however IPython notebook is still commonly used.*\n", "\n", "### Jupyter basics: the server, dashboard and kernels\n", "\n", "In launching this notebook you will have already come across two of the other key components of the Jupyter system - the notebook *server* and *dashboard* interface.\n", "\n", "We began by starting a notebook server instance in the terminal by running\n", "\n", "```\n", "jupyter notebook\n", "```\n", "\n", "This will have begun printing a series of log messages to terminal output similar to\n", "\n", "```\n", "$ jupyter notebook\n", "[I 08:58:24.417 NotebookApp] Serving notebooks from local directory: ~/mlpractical\n", "[I 08:58:24.417 NotebookApp] 0 active kernels\n", "[I 08:58:24.417 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/\n", "```\n", "\n", "The last message included here indicates the URL the application is being served at. The default behaviour of the `jupyter notebook` command is to open a tab in a web browser pointing to this address after the server has started up. The server can be launched without opening a browser window by running `jupyter notebook --no-browser`. This can be useful for example when running a notebook server on a remote machine over SSH. Descriptions of various other command options can be found by displaying the command help page using\n", "\n", "```\n", "juptyer notebook --help\n", "```\n", "\n", "While the notebook server is running it will continue printing log messages to terminal it was started from. Unless you detach the process from the terminal session you will need to keep the session open to keep the notebook server alive. If you want to close down a running server instance from the terminal you can use `Ctrl+C` - this will bring up a confirmation message asking you to confirm you wish to shut the server down. You can either enter `y` or skip the confirmation by hitting `Ctrl+C` again.\n", "\n", "When the notebook application first opens in your browser you are taken to the notebook *dashboard*. This will appear something like this\n", "\n", "\n", "\n", "The dashboard above is showing the `Files` tab, a list of files in the directory the notebook server was launched from. We can navigate in to a sub-directory by clicking on a directory name and back up to the parent directory by clicking the `..` link. An important point to note is that the top-most level that you will be able to navigate to is the directory you run the server from. This is a security feature and generally you should try to limit the access the server has by launching it in the highest level directory which gives you access to all the files you need to work with.\n", "\n", "As well as allowing you to launch existing notebooks, the `Files` tab of the dashboard also allows new notebooks to be created using the `New` drop-down on the right. It can also perform basic file-management tasks such as renaming and deleting files (select a file by checking the box alongside it to bring up a context menu toolbar).\n", "\n", "In addition to opening notebook files, we can also edit text files such as `.py` source files, directly in the browser by opening them from the dashboard. The in-built text-editor is less-featured than a full IDE but is useful for quick edits of source files and previewing data files.\n", "\n", "The `Running` tab of the dashboard gives a list of the currently running notebook instances. This can be useful to keep track of which notebooks are still running and to shutdown (or reopen) old notebook processes when the corresponding tab has been closed.\n", "\n", "### The notebook interface\n", "\n", "The top of your notebook window should appear something like this:\n", "\n", "\n", "\n", "The name of the current notebook is displayed at the top of the page and can be edited by clicking on the text of the name. Displayed alongside this is an indication of the last manual *checkpoint* of the notebook file. On-going changes are auto-saved at regular intervals; the check-point mechanism is mainly meant as a way to recover an earlier version of a notebook after making unwanted changes. Note the default system only currently supports storing a single previous checkpoint despite the `Revert to checkpoint` dropdown under the `File` menu perhaps suggesting otherwise.\n", "\n", "As well as having options to save and revert to checkpoints, the `File` menu also allows new notebooks to be created in same directory as the current notebook, a copy of the current notebook to be made and the ability to export the current notebook to various formats.\n", "\n", "The `Edit` menu contains standard clipboard functions as well as options for reorganising notebook *cells*. Cells are the basic units of notebooks, and can contain formatted text like the one you are reading at the moment or runnable code as we will see below. The `Edit` and `Insert` drop down menus offer various options for moving cells around the notebook, merging and splitting cells and inserting new ones, while the `Cell` menu allow running of code cells and changing cell types.\n", "\n", "The `Kernel` menu offers some useful commands for managing the Python process (kernel) running in the notebook. In particular it provides options for interrupting a busy kernel (useful for example if you realise you have set a slow code cell running with incorrect parameters) and to restart the current kernel. This will cause all variables currently defined in the workspace to be lost but may be necessary to get the kernel back to a consistent state after polluting the namespace with lots of global variables or when trying to run code from an updated module and `reload` is failing to work. \n", "\n", "To the far right of the menu toolbar is a kernel status indicator. When a dark filled circle is shown this means the kernel is currently busy and any further code cell run commands will be queued to happen after the currently running cell has completed. An open status circle indicates the kernel is currently idle.\n", "\n", "The final row of the top notebook interface is the notebook toolbar which contains shortcut buttons to some common commands such as clipboard actions and cell / kernel management. If you are interested in learning more about the notebook user interface you may wish to run through the `User Interface Tour` under the `Help` menu drop down.\n", "\n", "### Markdown cells: easy text formatting\n", "\n", "This entire introduction has been written in what is termed a *Markdown* cell of a notebook. [Markdown](https://en.wikipedia.org/wiki/Markdown) is a lightweight markup language intended to be readable in plain-text. As you may wish to use Markdown cells to keep your own formatted notes in notebooks, a small sampling of the formatting syntax available is below (escaped mark-up on top and corresponding rendered output below that); there are many much more extensive syntax guides - for example [this cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet).\n", "\n", "---\n", "\n", "```\n", "## Level 2 heading\n", "### Level 3 heading\n", "\n", "*Italicised* and **bold** text.\n", "\n", " * bulleted\n", " * lists\n", " \n", "and\n", "\n", " 1. enumerated\n", " 2. lists\n", "\n", "Inline maths $y = mx + c$ using [MathJax](https://www.mathjax.org/) as well as display style\n", "\n", "$$ ax^2 + bx + c = 0 \\qquad \\Rightarrow \\qquad x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a} $$\n", "```\n", "---\n", "\n", "## Level 2 heading\n", "### Level 3 heading\n", "\n", "*Italicised* and **bold** text.\n", "\n", " * bulleted\n", " * lists\n", " \n", "and\n", "\n", " 1. enumerated\n", " 2. lists\n", "\n", "Inline maths $y = mx + c$ using [MathJax]() as well as display maths\n", "\n", "$$ ax^2 + bx + c = 0 \\qquad \\Rightarrow \\qquad x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a} $$\n", "\n", "---\n", "\n", "We can also directly use HTML tags in Markdown cells to embed rich content such as images and videos.\n", "\n", "---\n", "```\n", "\n", "```\n", "---\n", "\n", "\n", "\n", "---\n", "\n", " \n", "### Code cells: in browser code execution\n", "\n", "Up to now we have not seen any runnable code. An example of a executable code cell is below. To run it first click on the cell so that it is highlighted, then either click the button on the notebook toolbar, go to `Cell > Run Cells` or use the keyboard shortcut `Ctrl+Enter`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from __future__ import print_function\n", "import sys\n", "\n", "print('Hello world!')\n", "print('Alarming hello!', file=sys.stderr)\n", "print('Hello again!')\n", "'And again!'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example shows the three main components of a code cell.\n", "\n", "The most obvious is the input area. This (unsuprisingly) is used to enter the code to be run which will be automatically syntax highlighted.\n", "\n", "To the immediate left of the input area is the execution indicator / counter. Before a code cell is first run this will display `In [ ]:`. After the cell is run this is updated to `In [n]:` where `n` is a number corresponding to the current execution counter which is incremented whenever any code cell in the notebook is run. This can therefore be used to keep track of the relative order in which cells were last run. There is no fundamental requirement to run cells in the order they are organised in the notebook, though things will usually be more readable if you keep things in roughly in order!\n", "\n", "Immediately below the input area is the output area. This shows any output produced by the code in the cell. This is dealt with a little bit confusingly in the current Jupyter version. At the top any output to [`stdout`](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) is displayed. Immediately below that output to [`stderr`](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_.28stderr.29) is displayed. All of the output to `stdout` is displayed together even if there has been output to `stderr` between as shown by the suprising ordering in the output here. \n", "\n", "The final part of the output area is the *display* area. By default this will just display the returned output of the last Python statement as would usually be the case in a (I)Python interpreter run in a terminal. What is displayed for a particular object is by default determined by its special `__repr__` method e.g. for a string it is just the quote enclosed value of the string itself.\n", "\n", "### Useful keyboard shortcuts\n", "\n", "There are a wealth of keyboard shortcuts available in the notebook interface. For an exhaustive list see the `Keyboard Shortcuts` option under the `Help` menu. We will cover a few of those we find most useful below.\n", "\n", "Shortcuts come in two flavours: those applicable in *command mode*, active when no cell is currently being edited and indicated by a blue highlight around the current cell; those applicable in *edit mode* when the content of a cell is being edited, indicated by a green current cell highlight.\n", "\n", "In edit mode of a code cell, two of the more generically useful keyboard shortcuts are offered by the `Tab` key.\n", "\n", " * Pressing `Tab` a single time while editing code will bring up suggested completions of what you have typed so far. This is done in a scope aware manner so for example typing `a` + `[Tab]` in a code cell will come up with a list of objects beginning with `a` in the current global namespace, while typing `np.a` + `[Tab]` (assuming `import numpy as np` has been run already) will bring up a list of objects in the root NumPy namespace beginning with `a`.\n", " * Pressing `Shift+Tab` once immediately after opening parenthesis of a function or method will cause a tool-tip to appear with the function signature (including argument names and defaults) and its docstring. Pressing `Shift+Tab` twice in succession will cause an expanded version of the same tooltip to appear, useful for longer docstrings. Pressing `Shift+Tab` four times in succession will cause the information to be instead displayed in a pager docked to bottom of the notebook interface which stays attached even when making further edits to the code cell and so can be useful for keeping documentation visible when editing e.g. to help remember the name of arguments to a function and their purposes.\n", "\n", "A series of useful shortcuts available in both command and edit mode are `[modifier]+Enter` where `[modifier]` is one of `Ctrl` (run selected cell), `Shift` (run selected cell and select next) or `Alt` (run selected cell and insert a new cell after).\n", "\n", "A useful command mode shortcut to know about is the ability to toggle line numbers on and off for a cell by pressing `L` which can be useful when trying to diagnose stack traces printed when an exception is raised or when referring someone else to a section of code.\n", " \n", "### Magics\n", "\n", "There are a range of *magic* commands in IPython notebooks, than provide helpful tools outside of the usual Python syntax. A full list of the inbuilt magic commands is given [here](http://ipython.readthedocs.io/en/stable/interactive/magics.html), however three that are particularly useful for this course:\n", "\n", " * [`%%timeit`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=matplotlib#magic-timeit) Put at the beginning of a cell to time its execution and print the resulting timing statistics.\n", " * [`%precision`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=matplotlib#magic-precision) Set the precision for pretty printing of floating point values and NumPy arrays.\n", " * [`%debug`](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=matplotlib#magic-debug) Activates the interactive debugger in a cell. Run after an exception has been occured to help diagnose the issue.\n", " \n", "### Plotting with `matplotlib`\n", "\n", "When setting up your environment one of the dependencies we asked you to install was `matplotlib`. This is an extensive plotting and data visualisation library which is tightly integrated with NumPy and Jupyter notebooks.\n", "\n", "When using `matplotlib` in a notebook you should first run the [magic command](http://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=matplotlib)\n", "\n", "```\n", "%matplotlib inline\n", "```\n", "\n", "This will cause all plots to be automatically displayed as images in the output area of the cell they are created in. Below we give a toy example of plotting two sinusoids using `matplotlib` to show case some of the basic plot options. To see the output produced select the cell and then run it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "2bced39d-ae3a-4603-ac94-fbb6a6283a96" } }, "outputs": [], "source": [ "# use the matplotlib magic to specify to display plots inline in the notebook\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "# generate a pair of sinusoids\n", "x = np.linspace(0., 2. * np.pi, 100)\n", "y1 = np.sin(x)\n", "y2 = np.cos(x)\n", "\n", "# produce a new figure object with a defined (width, height) in inches\n", "fig = plt.figure(figsize=(8, 4))\n", "# add a single axis to the figure\n", "ax = fig.add_subplot(111)\n", "# plot the two sinusoidal traces on the axis, adjusting the line width\n", "# and adding LaTeX legend labels\n", "ax.plot(x, y1, linewidth=2, label=r'$\\sin(x)$')\n", "ax.plot(x, y2, linewidth=2, label=r'$\\cos(x)$')\n", "# set the axis labels\n", "ax.set_xlabel('$x$', fontsize=16)\n", "ax.set_ylabel('$y$', fontsize=16)\n", "# force the legend to be displayed\n", "ax.legend()\n", "# adjust the limits of the horizontal axis\n", "ax.set_xlim(0., 2. * np.pi)\n", "# make a grid be displayed in the axis background\n", "ax.grid(True)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "533c10f0-95ba-4684-a72d-fd52cef0d007" } }, "source": [ "# Exercises\n", "\n", "Today's exercises are meant to allow you to get some initial familiarisation with the `mlp` package and how data is provided to the learning functions. Next week onwards, we will follow with the material covered in lectures. \n", "\n", "If you are new to Python and/or NumPy and are struggling to complete the exercises, you may find going through [this Stanford University tutorial](http://cs231n.github.io/python-numpy-tutorial/) by [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) first helps. There is also a derived Jupyter notebook by [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) and [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) which you can download [from here](https://github.com/kuleshov/cs228-material/raw/master/tutorials/python/cs228-python-tutorial.ipynb) - if you save this in to your `mlpractical/notebooks` directory you should be able to open the notebook from the dashboard to run the examples.\n", "\n", "## Data providers\n", "\n", "Open (in the browser) the [`mlp.data_providers`](../../edit/mlp/data_providers.py) module. Have a look through the code and comments, then follow to the exercises.\n", "\n", "### Exercise 1 \n", "\n", "The `MNISTDataProvider` iterates over input images and target classes (digit IDs) from the [MNIST database of handwritten digit images](http://yann.lecun.com/exdb/mnist/), a common supervised learning benchmark task. Using the data provider and `matplotlib` we can for example iterate over the first couple of images in the dataset and display them using the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "978c1095-a9ce-4626-a113-e0be5fe51ecb" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import sys\n", "# sys.path.append('/path/to/mlpractical')\n", "import matplotlib.pyplot as plt\n", "import mlp.data_providers as data_providers\n", "# If error while importing mlp.data_providers: add path to your folder mlpractical using sys.path.append('/path/to/mlpractical/')\n", "def show_single_image(img, fig_size=(2, 2)):\n", " fig = plt.figure(figsize=fig_size)\n", " ax = fig.add_subplot(111)\n", " ax.imshow(img, cmap='Greys')\n", " ax.axis('off')\n", " plt.show()\n", " return fig, ax\n", "\n", "# An example for a single MNIST image\n", "mnist_dp = data_providers.MNISTDataProvider(\n", " which_set='valid', batch_size=1, max_num_batches=2, shuffle_order=True)\n", "\n", "for inputs, target in mnist_dp:\n", " # The reshape operation reorganizes data from 1D array of size 784 to 2D array of size 28x28\n", " # See notes in the next cell\n", " square_inputs = inputs.reshape((28, 28))\n", " show_single_image(square_inputs)\n", " print('Image target: {0}'.format(target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally we will want to deal with batches of multiple images i.e. `batch_size > 1`. As a first task:\n", "\n", " * Using MNISTDataProvider, write code that iterates over the first 5 minibatches of size 100 data-points. \n", " * Display each batch of MNIST digits in a $10\\times10$ grid of images. \n", " \n", "**Notes**:\n", "\n", " * Images are returned from the provider as tuples of numpy arrays `(inputs, targets)`. The `inputs` matrix has shape `(batch_size, input_dim)` while the `targets` array is of shape `(batch_size,)`, where `batch_size` is the number of data points in a single batch and `input_dim` is dimensionality of the input features. \n", " * Each input data-point (image) is stored as a 784 dimensional vector of pixel intensities normalised to $[0, 1]$ from inital integer values in $[0, 255]$. However, the original spatial domain is two dimensional, so before plotting you will need to reshape the one dimensional input arrays in to two dimensional arrays 2D (MNIST images have the same height and width dimensions)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write your code here for iterating over five batches of \n", "# 100 data points each and displaying as 10x10 grids\n", "\n", "def show_batch_of_images(img_batch):\n", " raise NotImplementedError('Write me!')" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "d2d525de-5d5b-41d5-b2fb-a83874dba986" } }, "source": [ "### Exercise 2\n", "\n", "`MNISTDataProvider` as `targets` currently returns a vector of integers, each element in this vector represents an the integer ID of the class the corresponding data-point represents. \n", "\n", "It is easier to train neural networks using a 1-of-K representation of multi-class targets. Instead of representing class identity by an integer, each target is replaced by a vector of length equal to teh number of classes whose values are zero everywhere except on the index corresponding to the class.\n", "\n", "For instance, given a batch of 5 integer targets `[2, 2, 0, 1, 0]` and assuming there are 3 different classes \n", "the corresponding 1-of-K encoded targets would be\n", "```\n", "[[0, 0, 1],\n", " [0, 0, 1],\n", " [1, 0, 0],\n", " [0, 1, 0],\n", " [1, 0, 0]]\n", "```\n", "\n", " * Implement the `to_one_of_k` method of `MNISTDataProvider` class. \n", " * Uncomment the overloaded `next` method, so the raw targets are converted to 1-of-K coding. \n", " * Test your code by running the the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mnist_dp = data_providers.MNISTDataProvider(\n", " which_set='valid', batch_size=5, max_num_batches=5, shuffle_order=False)\n", "\n", "for inputs, targets in mnist_dp:\n", " # Check that values are either 0 or 1\n", " assert np.all(np.logical_or(targets == 0., targets == 1.))\n", " # Check that there is exactly a single 1\n", " assert np.all(targets.sum(-1) == 1.)\n", " print(targets)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "nbpresent": { "id": "471093b7-4b94-4295-823a-5285c79d3119" } }, "source": [ "### Exercise 3\n", "\n", "Here you will write your own data provider `MetOfficeDataProvider` that wraps [weather data for south Scotland](http://www.metoffice.gov.uk/hadobs/hadukp/data/daily/HadSSP_daily_qc.txt). A previous version of this data has been stored in `data` directory for your convenience and skeleton code for the class provided in `mlp/data_providers.py`.\n", "\n", "The data is organised in the text file as a table, with the first two columns indexing the year and month of the readings and the following 31 columns giving daily precipitation values for the corresponding month. As not all months have 31 days some of the entries correspond to non-existing days. These values are indicated by a non-physical value of `-99.9`.\n", "\n", " * You should read all of the data from the file ([`np.loadtxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) may be useful for this) and then filter out the `-99.9` values and collapse the table to a one-dimensional array corresponding to a sequence of daily measurements for the whole period data is available for. [NumPy's boolean indexing feature](http://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays) could be helpful here.\n", " * A common initial preprocessing step in machine learning tasks is to normalise data so that it has zero mean and a standard deviation of one. Normalise the data sequence so that its overall mean is zero and standard deviation one.\n", " * Each data point in the data provider should correspond to a window of length specified in the `__init__` method as `window_size` of this contiguous data sequence, with the model inputs being the first `window_size - 1` elements of the window and the target output being the last element of the window. For example if the original data sequence was `[1, 2, 3, 4, 5, 6]` and `window_size=3` then `input, target` pairs iterated over by the data provider should be\n", " ```\n", " [1, 2], 3\n", " [4, 5], 6\n", " ```\n", " * **Extension**: The current data provider only produces `len(data)/window_size` sample points. A better approach is to have it return overlapping windows of the sequence so that more training data instances are produced. For example for the same sequence `[1, 2, 3, 4, 5, 6]` the corresponding `input, target` pairs with `window_size=3` would be\n", "\n", "```\n", "[1, 2], 3\n", "[2, 3], 4\n", "[3, 4], 5\n", "[4, 5], 6\n", "```\n", " * Test your code by running the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbpresent": { "id": "c8553a56-9f25-4198-8a1a-d7e9572b4382" }, "scrolled": false }, "outputs": [], "source": [ "batch_size = 3\n", "for window_size in [2, 5, 10]:\n", " met_dp = data_providers.MetOfficeDataProvider(\n", " window_size=window_size, batch_size=batch_size,\n", " max_num_batches=1, shuffle_order=False)\n", " fig = plt.figure(figsize=(6, 3))\n", " ax = fig.add_subplot(111)\n", " ax.set_title('Window size {0}'.format(window_size))\n", " ax.set_xlabel('Day in window')\n", " ax.set_ylabel('Normalised reading')\n", " # iterate over data provider batches checking size and plotting\n", " for inputs, targets in met_dp:\n", " assert inputs.shape == (batch_size, window_size - 1)\n", " assert targets.shape == (batch_size, )\n", " ax.plot(np.c_[inputs, targets].T, '.-')\n", " ax.plot([window_size - 1] * batch_size, targets, 'ko')" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 1 }