Copyright (C) ‹ 2019 › ‹ Anna Scaife - email@example.com ›
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
[AMS - 190215] Notebook created for Chris Engelbrecht Summer School, Sani Pass, Jan 2019
[AMS - 190910] Notebook updated for CERN School of Computing, Cluj-Napoca, Sept 2019
This notebook was created for the CHPC/NITheP 2019 Chris Engelbrecht Summer School on the Foundations of Theoretical and Computational Science. It was inspired by Rob Lyon's pulsar classification tutorials in the IAU OAD Data Science Toolkit.
Keep track of your progress:
First we import some libraries:
import numpy as np # for array stuff import pylab as pl # for plotting stuff
We'll use the scikit-learn library for the machine learning tasks, so let's import a whole bunch of stuff from there:
from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, cross_val_predict from sklearn.metrics import classification_report, confusion_matrix from sklearn.metrics import roc_curve, roc_auc_score
This library is useful for making evaluation plots:
import scikitplot as skplt
This library is for data handling:
import pandas as pd
Some code to control the size of figures in the notebook:
pl.rcParams['figure.figsize'] = [10, 5] pl.rcParams['figure.dpi'] = 300
df = pd.read_csv('data/pulsar.csv')
The first row of the CSV file tells us what the features are:
feature_names = df.columns.values[0:-1] print(feature_names)
['mean_int_pf' 'std_pf' 'ex_kurt_pf' 'skew_pf' 'mean_dm' 'std_dm' 'kurt_dm' 'skew_dm']
and we can check just how much data we're dealing with:
# Show some information print ('Dataset has %d rows and %d columns including features and labels'%(df.shape,df.shape))
Dataset has 17898 rows and 9 columns including features and labels
We're going to start by separating the numerical feature data from class labels for all the candidates. To get the feature data on its own we can just strip off the column containing the class labels:
features = df.drop('class', axis=1)
The labels for each object tell us abut the target class and we can create an array of those data by extracting the column from the original dataset:
targets = df['class']
Let's take a look at how the two classes are distributed in parameter space. We'll plot the value of one feature against another and colour code the data samples according to their class. This is a pretty basic plot, if you want to try creating something more fancy the seaborn library is a good place to start.
pl.subplot(111) pl.scatter(df['std_pf'], df['mean_dm'],c=df['class']) pl.xlabel('Integrated Profile Stdev') pl.ylabel('Dispersion Measure Mean') pl.show()
Exercise 1: try plotting different combinations of features and see if there are any clear divisions.
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.33, random_state=66)
Exercise 2: Once you've run through the tutorial, come back to this point and see what difference changing the relative size of your train:test datasets makes.
At this point we now have our dataset in the suitable state to start training our classifier.
To start with we need to initiate the random forest classifier from scikit_learn:
RFC = RandomForestClassifier(n_jobs=2,n_estimators=10)
...and we can immediately fit the machine learning model to our training data:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2, oob_score=False, random_state=None, verbose=0, warm_start=False)
Exercise 3: You could try changing the split criterion from the "gini" (Gini impurity) to "entropy" (information gain). Does it make a difference?
We can then used the trained classifier to predict the label for the test data that we split out earlier:
rfc_predict = RFC.predict(X_test)
So how did we do? We need to evaluate the performance of our classifier.
A good first step is to evaluate the cross-validation. This will tell us how well our machine learning model generalises, i.e. whether we have over-fitted the training data.
rfc_cv_score = cross_val_score(RFC, features, targets, cv=10, scoring='roc_auc')
Let's print out the various evaluation criteria:
print("=== Confusion Matrix ===") print(confusion_matrix(y_test, rfc_predict)) print('\n') print("=== Classification Report ===") print(classification_report(y_test, rfc_predict, target_names=['Non Pulsar','Pulsar'])) print('\n') print("=== All AUC Scores ===") print(rfc_cv_score) print('\n') print("=== Mean AUC Score ===") print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())
=== Confusion Matrix === [[5328 34] [ 92 453]] === Classification Report === precision recall f1-score support Non Pulsar 0.98 0.99 0.99 5362 Pulsar 0.93 0.83 0.88 545 micro avg 0.98 0.98 0.98 5907 macro avg 0.96 0.91 0.93 5907 weighted avg 0.98 0.98 0.98 5907 === All AUC Scores === [0.92090421 0.95277015 0.96520903 0.96037148 0.96569091 0.94410757 0.96378026 0.95787208 0.96092836 0.96338084] === Mean AUC Score === Mean AUC Score - Random Forest: 0.955501488135929
predictions = cross_val_predict(RFC, features, targets, cv=2)
skplt.metrics.plot_confusion_matrix(targets, predictions, normalize=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1231ebdd8>
probas = RFC.predict_proba(X_test)
<matplotlib.axes._subplots.AxesSubplot at 0x1231411d0>