Copyright (C) ‹ 2019 › ‹ Anna Scaife - anna.scaife@manchester.ac.uk ›

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

[AMS - 190215] Notebook created for Chris Engelbrecht Summer School, Sani Pass, Jan 2019
[AMS - 190910] Notebook updated for CERN School of Computing, Cluj-Napoca, Sept 2019

This notebook was created for the CHPC/NITheP 2019 Chris Engelbrecht Summer School on the Foundations of Theoretical and Computational Science. It was inspired by Rob Lyon's pulsar classification tutorials in the IAU OAD Data Science Toolkit.

Keep track of your progress:

  • [ ] Exercise 1 (basic)
  • [ ] Exercise 2 (basic)
  • [ ] Exercise 3 (basic)
  • [ ] Exercise 4 (basic)
  • [ ] Exercise 5 (basic)
  • [ ] Exercise 6 (intermediate)
  • [ ] Exercise 7 (optional; super-advanced)

First we import some libraries:

In [1]:
import numpy as np   # for array stuff
import pylab as pl   # for plotting stuff

We'll use the scikit-learn library for the machine learning tasks, so let's import a whole bunch of stuff from there:

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score

This library is useful for making evaluation plots:

In [3]:
import scikitplot as skplt

This library is for data handling:

In [4]:
import pandas as pd  

Some code to control the size of figures in the notebook:

In [5]:
pl.rcParams['figure.figsize'] = [10, 5]
pl.rcParams['figure.dpi'] = 300

We're going to import the HTRU2 dataset:

In [6]:
df = pd.read_csv('data/pulsar.csv')

The first row of the CSV file tells us what the features are:

In [7]:
feature_names = df.columns.values[0:-1]
['mean_int_pf' 'std_pf' 'ex_kurt_pf' 'skew_pf' 'mean_dm' 'std_dm'
 'kurt_dm' 'skew_dm']

and we can check just how much data we're dealing with:

In [8]:
# Show some information
print ('Dataset has %d rows and %d columns including features and labels'%(df.shape[0],df.shape[1]))
Dataset has 17898 rows and 9 columns including features and labels

We're going to start by separating the numerical feature data from class labels for all the candidates. To get the feature data on its own we can just strip off the column containing the class labels:

In [9]:
features = df.drop('class', axis=1)

The labels for each object tell us abut the target class and we can create an array of those data by extracting the column from the original dataset:

In [10]:
targets = df['class']

Let's take a look at how the two classes are distributed in parameter space. We'll plot the value of one feature against another and colour code the data samples according to their class. This is a pretty basic plot, if you want to try creating something more fancy the seaborn library is a good place to start.

In [11]:
pl.scatter(df['std_pf'], df['mean_dm'],c=df['class'])
pl.xlabel('Integrated Profile Stdev')
pl.ylabel('Dispersion Measure Mean')

Exercise 1: try plotting different combinations of features and see if there are any clear divisions.

Now we need to split our labelled data into two separate datasets: one to train the classifier and one to test the fitted machine learning model.

To do this we can use the function train_test_split from the scikit_learn library:

In [12]:
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.33, random_state=66)

Exercise 2: Once you've run through the tutorial, come back to this point and see what difference changing the relative size of your train:test datasets makes.

At this point we now have our dataset in the suitable state to start training our classifier.

To start with we need to initiate the random forest classifier from scikit_learn:

In [13]:
RFC = RandomForestClassifier(n_jobs=2,n_estimators=10)

...and we can immediately fit the machine learning model to our training data:

In [14]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
            oob_score=False, random_state=None, verbose=0,

Exercise 3: You could try changing the split criterion from the "gini" (Gini impurity) to "entropy" (information gain). Does it make a difference?

We can then used the trained classifier to predict the label for the test data that we split out earlier:

In [15]:
rfc_predict = RFC.predict(X_test)

So how did we do? We need to evaluate the performance of our classifier.

A good first step is to evaluate the cross-validation. This will tell us how well our machine learning model generalises, i.e. whether we have over-fitted the training data.

In [16]:
rfc_cv_score = cross_val_score(RFC, features, targets, cv=10, scoring='roc_auc')

Let's print out the various evaluation criteria:

In [17]:
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict, target_names=['Non Pulsar','Pulsar']))
print("=== All AUC Scores ===")
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())
=== Confusion Matrix ===
[[5328   34]
 [  92  453]]

=== Classification Report ===
              precision    recall  f1-score   support

  Non Pulsar       0.98      0.99      0.99      5362
      Pulsar       0.93      0.83      0.88       545

   micro avg       0.98      0.98      0.98      5907
   macro avg       0.96      0.91      0.93      5907
weighted avg       0.98      0.98      0.98      5907

=== All AUC Scores ===
[0.92090421 0.95277015 0.96520903 0.96037148 0.96569091 0.94410757
 0.96378026 0.95787208 0.96092836 0.96338084]

=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.955501488135929

We can make a more visual representation of the confusion matrix using the scikit-plot library. To do this we need to know the predictions from our cross validation, rather than the Area Under Curve (AUC) value:

In [18]:
predictions = cross_val_predict(RFC, features, targets, cv=2)
In [19]:
skplt.metrics.plot_confusion_matrix(targets, predictions, normalize=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1231ebdd8>

To plot the ROC curve we need to find the probabilities for each target class separately. We can do this with the predict_proba function:

In [20]:
probas = RFC.predict_proba(X_test)
In [21]:
skplt.metrics.plot_roc(y_test, probas)
<matplotlib.axes._subplots.AxesSubplot at 0x1231411d0>