From 7867a372e41e4ccd0d2f740c08a5a7fd10ded838 Mon Sep 17 00:00:00 2001 From: Matt Graham Date: Sun, 22 Jan 2017 12:47:14 +0000 Subject: [PATCH] Adding MSD intro notebook. --- ...cation_with_the_Million_Song_Dataset.ipynb | 319 ++++++++++++++++++ 1 file changed, 319 insertions(+) create mode 100644 notebooks/09b_Music_genre_classification_with_the_Million_Song_Dataset.ipynb diff --git a/notebooks/09b_Music_genre_classification_with_the_Million_Song_Dataset.ipynb b/notebooks/09b_Music_genre_classification_with_the_Million_Song_Dataset.ipynb new file mode 100644 index 0000000..11168eb --- /dev/null +++ b/notebooks/09b_Music_genre_classification_with_the_Million_Song_Dataset.ipynb @@ -0,0 +1,319 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "import numpy as np\n", + "from mlp.data_providers import MSD10GenreDataProvider, MSD25GenreDataProvider\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Genre classification with the Million Song Dataset\n", + "\n", + "The [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/) is a \n", + "\n", + "> freely-available collection of audio features and metadata for a million contemporary popular music tracks\n", + "\n", + "originally collected and compiled by Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere.\n", + "\n", + "The dataset is intended to encourage development of algorithms in the field of [music information retrieval](https://en.wikipedia.org/wiki/Music_information_retrieval). The [data for each track](http://labrosa.ee.columbia.edu/millionsong/pages/example-track-description) includes both textual features such as artist and album names, numerical descriptors such as duration and various audio features derived using a music analysis platform provided by [The Echo Nest](https://en.wikipedia.org/wiki/The_Echo_Nest) (since acquired by Spotify). Of the various audio features and segmentations included in the full dataset, the most detailed information is included at a 'segment' level: each segment corresponds to an automatically identified 'quasi-stable music event' - roughly contiguous sections of the audio with similar perceptual quality. The number of segments per track is variable and each segment can itself be of variable length - typically they seem to be around 0.2 - 0.4 seconds but can be as long as 10 seconds or more. \n", + "\n", + "For each segment of the track various extracted audio features are available - a 12 dimensional vector of [chroma features](https://en.wikipedia.org/wiki/Chroma_feature), a 12 dimensional vector of ['MFCC-like'](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) timbre features and various measures of the loudness of the segment, including loudness at the segment start and maximum loudness. In the version of the data we provide, we include a 25 dimensional vector for each included segment, consisting of the 12 timbre features, 12 chroma features and loudness at start of segment concatenated in that order. To allow easier integration in to standard feedforward models, the basic version of the data we provide includes features only for a fixed length crop of the central 120 segments of each track (with tracks with less than 120 segments therefore not being included). This gives an overall input dimension per track of 120×25=3000. Each of the 3000 input dimensions has been been preprocessed by subtracting the per-dimension mean across the training data and dividing by the per-dimension standard deviation across the training data.\n", + "\n", + "We provide data providers for the fixed length crops versions of the input features, with the inputs being returned in batches of 3000 dimensional vectors (these can be reshaped to (120, 25) to get the per-segment features). To allow for more complex variable-length sequence modelling with for example recurrent neural networks, we also provide a variable length version of the data. This is only provided as compressed NumPy (`.npz`) data files rather than data provider objects - you will need to write your own data provider if you wish to use this version of the data. As the inputs are of variable number of segments they have been ['bucketed'](https://www.tensorflow.org/tutorials/seq2seq/#bucketing_and_padding) into groups of similar maximum length, with the following binning scheme used:\n", + "\n", + " 1 - 250 segments\n", + " 251 - 500 segments\n", + " 501 - 650 segments\n", + " 651 - 800 segments\n", + " 801 - 950 segments\n", + " 951 - 1200 segments\n", + " 1201 - 2000 segments\n", + " \n", + "For each bucket the NumPy data files include inputs and targets arrays with second dimension equal to the maximum sgement size in the bucket (e.g. 250 for the bucket) and first dimension equal to the number of tracks with number of segments in that bucket. These are named `inputs_{n}` and `targets_{n}` in the data file where `{n}` is the maximal number of segments in the bucket e.g. `inputs_250` and `targets_250` for the first bucket. For tracks with less segments than the maximum size in the bucket, the features for the track have been padded with `NaN` values.\n", + "\n", + "The Million Song Dataset in its original form does not provide any genre labels, however various external groups have proposed genre labels for portions of the data by cross-referencing the track IDs against external music tagging databases. Analagously to the provision of both simpler and more complex classifications tasks for the CIFAR-10 / CIFAR-100 datasets, we provide two classification task datasets derived from the Million Song Dataset - one with 10 coarser level genre classes, and another with 25 finer-grained genre / style classifications.\n", + "\n", + "The 10-genre classification task uses the [*CD2C tagtraum genre annotations*](http://www.tagtraum.com/msd_genre_datasets.html) derived from multiple source databases (beaTunes genre dataset, Last.fm dataset, Top-MAGD dataset), with the *CD2C* variant using only non-ambiguous annotations (i.e. not including tracks with multiple genre labels). Of the 15 genre labels provided in the CD2C annotations, 5 (World, Latin, Punk, Folk and New Age) were not included due to having fewer than 5000 examples available. This left 10 remaining genre classes:\n", + "\n", + " Rap\n", + " Rock\n", + " RnB\n", + " Electronic\n", + " Metal\n", + " Blues\n", + " Pop\n", + " Jazz\n", + " Country\n", + " Reggae\n", + "\n", + "For each of these 10 classes, 5000 labelled examples have been collected for training (i.e. 50000 example in total) and a further 1000 example per class which you are provided inputs but not targets for for testing, with the exception of the Blues class for which only 991 testing examples are provided due to there being insufficient labelled tracks of the minimum required length (i.e. a total of 9991 test examples).\n", + "\n", + "The 25-genre classification tasks uses the [*MSD Allmusic Style Dataset*](http://www.ifs.tuwien.ac.at/mir/msd/MASD.html) labels derived from the [AllMusic.com](http://www.allmusic.com/) database by [Alexander Schindler, Rudolf Mayer and Andreas Rauber of Vienna University of Technology](http://www.ifs.tuwien.ac.at/~schindler/pubs/ISMIR2012.pdf). The 25 genre / style labels used are:\n", + "\n", + " Big Band\n", + " Blues Contemporary\n", + " Country Traditional\n", + " Dance\n", + " Electronica\n", + " Experimental\n", + " Folk International\n", + " Gospel\n", + " Grunge Emo\n", + " Hip Hop Rap\n", + " Jazz Classic\n", + " Metal Alternative\n", + " Metal Death\n", + " Metal Heavy\n", + " Pop Contemporary\n", + " Pop Indie\n", + " Pop Latin\n", + " Punk\n", + " Reggae\n", + " RnB Soul\n", + " Rock Alternative\n", + " Rock College\n", + " Rock Contemporary\n", + " Rock Hard\n", + " Rock Neo Psychedelia\n", + " \n", + "For each of these 25 classes, 2000 labelled examples have been collected for training (i.e. 50000 example in total) and a further 400 example per class which you are provided inputs but not targets for for testing (i.e. 10000 examples in total). The tracks used for the 25-genre classification task only partially overlap with those used for the 10-genre classification task and we do not provide any mapping between the two.\n", + "\n", + "The 50000 labelled examples provided for each of the two tasks have been split in to a 40000 example training dataset and a 10000 example validation dataset, each with target labels provided. If you wish to use a more complex cross-fold validation scheme you may want to combine these two portions of the dataset and define your own functions / classes for separating out a validation set.\n", + "\n", + "Data provider classes for both fixed length input data for the 10 and 25 genre classification tasks in the `mlp.data_providers` module as `MSD10GenreDataProvider` and `MSD25GenreDataProvider`. Both have similar behaviour to the `MNISTDataProvider` used extensively last semester. A `which_set` argument can be used to specify whether to return a data provided for the training dataset (`which_set='train'`) or validation dataset (`which_set='valid'`). Both data provider classes provide a `label_map` attribute which is a list of strings which are the class labels corresponding to the integer targets (i.e. prior to conversion to a 1-of-K encoded binary vector).\n", + "\n", + "Below example code is given for creating instances of the 10-genre and 25-genre fixed-length input data provider objects and using them to train simple two-layer feedforward network models with rectified linear activations in TensorFlow." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example two-layer classifier model on MSD 10-genre task" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "train_data = MSD10GenreDataProvider('train', batch_size=50)\n", + "valid_data = MSD10GenreDataProvider('valid', batch_size=50)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def fully_connected_layer(inputs, input_dim, output_dim, nonlinearity=tf.nn.relu):\n", + " weights = tf.Variable(\n", + " tf.truncated_normal(\n", + " [input_dim, output_dim], stddev=2. / (input_dim + output_dim)**0.5), \n", + " 'weights')\n", + " biases = tf.Variable(tf.zeros([output_dim]), 'biases')\n", + " outputs = tf.matmul(inputs, weights) + biases\n", + " return outputs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "inputs = tf.placeholder(tf.float32, [None, train_data.inputs.shape[1]], 'inputs')\n", + "targets = tf.placeholder(tf.float32, [None, train_data.num_classes], 'targets')\n", + "num_hidden = 200\n", + "\n", + "with tf.name_scope('fc-layer-1'):\n", + " hidden_1 = fully_connected_layer(inputs, train_data.inputs.shape[1], num_hidden)\n", + "with tf.name_scope('output-layer'):\n", + " outputs = fully_connected_layer(hidden_1, num_hidden, train_data.num_classes, tf.identity)\n", + "\n", + "with tf.name_scope('error'):\n", + " error = tf.reduce_mean(\n", + " tf.nn.softmax_cross_entropy_with_logits(outputs, targets))\n", + "with tf.name_scope('accuracy'):\n", + " accuracy = tf.reduce_mean(tf.cast(\n", + " tf.equal(tf.argmax(outputs, 1), tf.argmax(targets, 1)), \n", + " tf.float32))\n", + "\n", + "with tf.name_scope('train'):\n", + " train_step = tf.train.AdamOptimizer().minimize(error)\n", + " \n", + "init = tf.global_variables_initializer()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "with tf.Session() as sess:\n", + " sess.run(init)\n", + " for e in range(25):\n", + " running_error = 0.\n", + " running_accuracy = 0.\n", + " for input_batch, target_batch in train_data:\n", + " _, batch_error, batch_acc = sess.run(\n", + " [train_step, error, accuracy], \n", + " feed_dict={inputs: input_batch, targets: target_batch})\n", + " running_error += batch_error\n", + " running_accuracy += batch_acc\n", + " running_error /= train_data.num_batches\n", + " running_accuracy /= train_data.num_batches\n", + " print('End of epoch {0:02d}: err(train)={1:.2f} acc(train)={2:.2f}'\n", + " .format(e + 1, running_error, running_accuracy))\n", + " if (e + 1) % 5 == 0:\n", + " valid_error = 0.\n", + " valid_accuracy = 0.\n", + " for input_batch, target_batch in valid_data:\n", + " batch_error, batch_acc = sess.run(\n", + " [error, accuracy], \n", + " feed_dict={inputs: input_batch, targets: target_batch})\n", + " valid_error += batch_error\n", + " valid_accuracy += batch_acc\n", + " valid_error /= valid_data.num_batches\n", + " valid_accuracy /= valid_data.num_batches\n", + " print(' err(valid)={0:.2f} acc(valid)={1:.2f}'\n", + " .format(valid_error, valid_accuracy))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example two-layer classifier model on MSD 25-genre task" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "train_data = MSD25GenreDataProvider('train', batch_size=50)\n", + "valid_data = MSD25GenreDataProvider('valid', batch_size=50)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "tf.reset_default_graph()\n", + "\n", + "inputs = tf.placeholder(tf.float32, [None, train_data.inputs.shape[1]], 'inputs')\n", + "targets = tf.placeholder(tf.float32, [None, train_data.num_classes], 'targets')\n", + "num_hidden = 200\n", + "\n", + "with tf.name_scope('fc-layer-1'):\n", + " hidden_1 = fully_connected_layer(inputs, train_data.inputs.shape[1], num_hidden)\n", + "with tf.name_scope('output-layer'):\n", + " outputs = fully_connected_layer(hidden_1, num_hidden, train_data.num_classes, tf.identity)\n", + "\n", + "with tf.name_scope('error'):\n", + " error = tf.reduce_mean(\n", + " tf.nn.softmax_cross_entropy_with_logits(outputs, targets))\n", + "with tf.name_scope('accuracy'):\n", + " accuracy = tf.reduce_mean(tf.cast(\n", + " tf.equal(tf.argmax(outputs, 1), tf.argmax(targets, 1)), \n", + " tf.float32))\n", + "\n", + "with tf.name_scope('train'):\n", + " train_step = tf.train.AdamOptimizer().minimize(error)\n", + " \n", + "init = tf.global_variables_initializer()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "with tf.Session() as sess:\n", + " sess.run(init)\n", + " for e in range(25):\n", + " running_error = 0.\n", + " running_accuracy = 0.\n", + " for input_batch, target_batch in train_data:\n", + " _, batch_error, batch_acc = sess.run(\n", + " [train_step, error, accuracy], \n", + " feed_dict={inputs: input_batch, targets: target_batch})\n", + " running_error += batch_error\n", + " running_accuracy += batch_acc\n", + " running_error /= train_data.num_batches\n", + " running_accuracy /= train_data.num_batches\n", + " print('End of epoch {0:02d}: err(train)={1:.2f} acc(train)={2:.2f}'\n", + " .format(e + 1, running_error, running_accuracy))\n", + " if (e + 1) % 5 == 0:\n", + " valid_error = 0.\n", + " valid_accuracy = 0.\n", + " for input_batch, target_batch in valid_data:\n", + " batch_error, batch_acc = sess.run(\n", + " [error, accuracy], \n", + " feed_dict={inputs: input_batch, targets: target_batch})\n", + " valid_error += batch_error\n", + " valid_accuracy += batch_acc\n", + " valid_error /= valid_data.num_batches\n", + " valid_accuracy /= valid_data.num_batches\n", + " print(' err(valid)={0:.2f} acc(valid)={1:.2f}'\n", + " .format(valid_error, valid_accuracy))" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python [conda env:mlp]", + "language": "python", + "name": "conda-env-mlp-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.12" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}