Pulsar classification is a great example of where machine learning can be used beneficially in astrophysics. It’s not the most straightforward classification problem, but here I’m going to outline the basics using the scikit-learn random forest classifier. This post was inspired by Rob Lyon’s pulsar classification tutorials in the IAU OAD Data Science Toolkit.
- The Pulsar Classification Problem
- Pulsar Classification
- Further Reading
I see dead ...stars?
Pulsars are “pulsating radio sources”, now known to be caused by rapidly rotating neutron stars. Neutron stars are the relics of dead massive stars, they’re small and extremely dense - think about something the same mass as the Sun crammed into a radius roughly the same as the M25 motorway around London. You can read all about them here.
A characteristic property of pulsars are the periodic bursts of emission produced by their radio emitting jets. As the pulsar rotates, the direction of this emission also rotates and astronomers see a pulse of radio emission each time one of the jets points towards the Earth.
You can even listen to the pulses (if you really want to…)
Pulsars are pretty interesting objects in their own right, they are used as a probe of stellar evolution as well as being used to test general relativity due to their extremely high densities. These days they’re also used to detect and map gravitational wave signatures. However, identifying them in the data streams from radio telescopes is not trivial. There are lots of man-made sources of radio frequency interference that can mimic the signals from pulsars. Classifying candidate data samples as pulsar or not pulsar is serious business.
The Pulsar Classification Problem
In order to classify a data sample as a pulsar or not a pulsar, we need to be able to extract some information on the data sample that can characterise its class. The individual bursts of emission from a pulsar (i.e. the pulses) do not have a constant shape or amplitude, so individually they’re not very useful for uniquely identifying a pulsar.
Because the individual pulses are all different, astronomers stack them up and create an average integrated pulse profile to characterise a particular pulsar:
Additionally the pulse will arrive at different times across different radio frequencies. The delay from frequency to frequency is caused by the ionised inter-stellar medium and is known as the dispersion. It looks like this:
Astronomers fit for the shape of the delay in order to compensate for its effect, but there’s always an uncertainty associated with the fit. That is expressed in the DM-SNR (“dispersion-measure-signal-to-noise-ratio”) curve, which looks like this:
When you put these two curves together it means that for each pulsar candidate there are eight numerical characteristic features that can be extracted as standard: four from the integrated pulse profile and four from the DM-SNR curve:
Pulsar Data in Python
The HTRU2 dataset compiles the eight features described above for 1,639 true known pulsars, as well as 16,259 additional candidate pulsars later identified to be RFI/noise. You can find a full description of the dataset in this paper.
I added a row to the original CSV that lists the feature names - you can find my version in the IAU OAD Data Science Toolkit here.
To read these data into python you can import the pandas library:
import pandas as pd # for data handling
and use it to read the CSV file:
df = pd.read_csv('data/pulsar.csv')
You can take a look at the names of the features in the file like this:
feature_names = df.columns.values[0:-1] print(feature_names)
['mean_int_pf' 'std_pf' 'ex_kurt_pf' 'skew_pf' 'mean_dm' 'std_dm' 'kurt_dm' 'skew_dm']
pf = integrated profile &
dm = DM-SNR curve)
and we can check just how much data we’re dealing with:
print ('Dataset has %d rows and %d columns including features and labels'%(df.shape,df.shape))
To get the feature data on its own we can just strip off the column containing the class labels:
features = df.drop('class', axis=1)
The labels for each object tell us abut the target class and we can create an array of those data by extracting the column from the original dataset:
targets = df['class']
The objective for pulsar astronomers is to classify each data sample as pulsar or not pulsar. These two possible outcomes are known as the target classes. There are multiple machine learning algorithms that can be used for this kind of binary classification (i.e. only two possible target classes). The principles behind all of them are the same.
Test data The ultimate goal of building a classifier is to be able to use it on previously unseen data and recover the correct classifications for each data sample. This unseen dataset is typically referred to as your test data.
Training data In order to build/train your classifier, you will need to provide a library of examples of each target class. For supervised learning, this dataset must be pre-labelled and it is typically referred to as your training data.
Validation data The validation dataset is a subset of your training data. It is not your test data. You must not use your test data to train your classifier in any way.
Learning algorithm The learning algorithm is whatever form of machine learning you have chosen to use for your dataset. You will often have to specify not only the form of the algorithm itself, but also the form of the cost function that the algorithm employs. Each machine learning model partitions feature space in a different way, driven by the goal of optimising the cost function.
Machine learning model The machine learning model is the output of your learning algorithm. You apply it to your test data to derive the predicted class of each test sample.
In this example we’ll use a random forest machine learning algorithm to classify the HTRU2 pulsar dataset. A random forest is a learning algorithm constructed from multiple decision trees.
Decision Trees & Random Forests
A decision tree classifies data samples using a hierarchical set of data partitions in feature space. An example of one such partition could be claasifying an animal as a mouse or a cat based on length data. Here length is our feature and our target classes are cat and mouse. If we draw a partition such that (length>10cm) = 'cat' and (length<10cm) = 'mouse' we have implemented a decision tree with a single node.
In reality if we created a training dataset by measuring 1000 cats and 1000 mice we would find that there was a distribution of lengths for each class, and that 10cm might not be the optimal partition value. The machine learning part of a decision tree is learning where that optimal value is. If we have multiple features, say length, height and weight, the algorithm will learn the best split point for all three.
A random forest employs a collection of decision trees and then makes a final classification by majority consensus. Each tree in the random forest is given a subset of the complete training dataset from which to learn and uses a subset of the features to represent the data. This technique is known as bootstrap aggregating, or bagging for short. It is used to mitigate against biases that can be introduced into the final classification by the presence of particular data samples or collections of samples, or an over-reliance on one particular feature or set of features.
An easy way to implement a random forest in Python is to use the scikit-learn library. For what follows, these are the library routines that need to import:
import numpy as np # for array stuff import pylab as pl # for plotting stuff from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, cross_val_predict from sklearn.metrics import classification_report, confusion_matrix from sklearn.metrics import roc_curve, roc_auc_score
We’ll also be using the scikit-plot, which I only recently discovered and has made my life much easier :-)
import scikitplot as skplt
Train / Validation / Test Split
Now we need to split our labelled data into two separate datasets: one to train the classifier and one to test the fitted machine learning model. To do this we can use the function train_test_split from the scikit_learn library:
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.33, random_state=66)
At this point we now have our dataset in a suitable state to start training the classifier.
To start with we need to initiate the random forest classifier from scikit_learn:
RFC = RandomForestClassifier(n_jobs=2,n_estimators=10)
…and we can immediately fit the machine learning model to our training data:
We can then used the trained classifier to predict the label for the test data that we split out earlier:
rfc_predict = RFC.predict(X_test)
So how did we do? We need to evaluate the performance of our classifier.
There are a variety of ways to evaluate the performance of a machine learning model. Which one you choose should depend on the objective of your classification. Before we look at some common performance metrics we first need to define a few terms.
Suppose we have two target classes; these could be cat and mouse, or alternatively pulsar and non-pulsar, but here I’m just going to call them positive and negative. When we apply a machine learning model to the unlabelled test data composed of these classes it fits a split that looks like this:
An unlabelled (test) data sample from class one that has been correctly labelled is called a true positive, but a sample that has been incorrectly labelled is called a false negative; likewise, an unlabelled (test) data sample from class two that has been correctly labelled is called true negative, and a sample that has been incorrectly labelled is called a false positive. We use these names to describe the different types of errors and hence the performance metrics of the machine learning model.
Some commonly used performance metrics are:
Precision The fraction of positive predictions that are truly positive.
Recall The fraction of true positives that are predicted to be positive.
F1-score A measure of accuracy that considers both precision and recall.
There are many other performance metrics, which are adapted to suit particular classification problems - for example if you are looking for a rare type of object in an imbalanced dataset. However, here we will only consider these three common metrics.
We can evaluate these metrics using our test dataset, but at this point the machine learning model has already been trained and we cannot refine it further. Poor performance metrics evaluated on test data are often an indication that the machine learning model has over-fitted the training data and does not generalise well to new input data. To guard against this we can use validation data during the training process. As described above, validation data is a subset of our training data that we reserve for on-the-fly performance testing.
A good way of using validation data is to evaluate the k-fold cross-validation. This will tell us how well our machine learning model generalises, i.e. whether we have over-fitted the training data.
The term k-fold refers to how many different validation datasets you select from the training data. For example, in 5-fold cross-validation the training data would be partitioned into five chunks and the training procedure iterated five times, each time choosing a different chunk as the validation dataset. The cross-validation performance metrics are then reported as an average across the five trained machine learning models.
For example, we can do this using the pulsar dataset. Here we are implementing 10-fold cross-validation:
rfc_cv_score = cross_val_score(RFC, features, targets, cv=10, scoring='roc_auc')
Let’s print out the various evaluation criteria:
print("=== Confusion Matrix ===") print(confusion_matrix(y_test, rfc_predict)) print('\n') print("=== Classification Report ===") print(classification_report(y_test, rfc_predict, target_names=['Non Pulsar','Pulsar'])) print('\n') print("=== All AUC Scores ===") print(rfc_cv_score) print('\n') print("=== Mean AUC Score ===") print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())
We can make a more visual representation of the confusion matrix using the scikit-plot library. To do this we need to know the predictions from our cross validation, rather than the Area Under Curve (AUC) value:
predictions = cross_val_predict(RFC, features, targets, cv=2)
skplt.metrics.plot_confusion_matrix(targets, predictions, normalize=True)
To plot the ROC curve we need to find the probabilities for each target class separately. We can do this with the predict_proba function:
probas = RFC.predict_proba(X_test)
In a balanced data set there should be no difference between the micro-average ROC curve and the macro-average ROC curve. In the case where there is a class imbalance (like here), if the macro ROC curve is lower than the micro-ROC curve then there are more cases of mis-classification in minority class.
We can use the output of the
RFC.predict_proba( ) function to plot a Precision-Recall Curve.
Let’s take a look at the relative importance of the different features that we fed to our classifier:
importances = RFC.feature_importances_ indices = np.argsort(importances)
pl.figure(1) pl.title('Feature Importances') pl.barh(range(len(indices)), importances[indices], color='b', align='center') pl.yticks(range(len(indices)), feature_names[indices]) pl.xlabel('Relative Importance') pl.show()