Source: Pixabay
Note: While we have seen machine learning techniques applied to various problem contexts from the business domain as part of the case studies so far, this case study aims to demonstrate their applicability to problem statements from the natural sciences as well.
Astronomy is one such discipline with an abundance of vast datasets, and machine learning algorithms are sometimes a necessity due to the labor intensity of analyzing all this data and deriving insights and conclusions in a more manual fashion.
The detection of celestial objects observed through telescopes as being either a star, a galaxy or a quasar, is an important classification scheme in astronomy. Stars have been known to humanity since time immemorial, but the idea of the existence of whole galaxies of stars outside our own galaxy (The Milky Way), was first theorized by the philosopher Immanuel Kant in 1755, and conclusively observed in 1925 by the American astronomer Edwin Hubble. Quasars have been a more recent discovery made possible significantly by the emergence of radio astronomy in the 1950s.
Descriptions of these three celestial objects are provided below:
Quasars were understood to be different from other stars and galaxies, because their spectral measurements (which indicated their chemical composition) and their luminosity changes were strange and initially defied explanation based on conventional knowledge – they were observed to be far more luminous than galaxies, but also far more compact, indicating tremendous power density. However, also crucially, it was the extreme “redshift” observed in the spectral readings of Quasars that stood out and gave rise to the realization that they were separate entities from other, less luminous stars and galaxies.
Note: In astronomy, redshift refers to an increase in wavelength, and hence decrease in energy/frequency of any observed electromagnetic radiation, such as light. The loss of energy of the radiation due to some factor is the key reason behind the observed redshift of that radiation. Redshift is a specific example of what’s called the Doppler Effect in Physics.
An everyday example of the Doppler Effect is the change in the wailing sound of the siren of an ambulance as it drives further away from us – when the ambulance is driving away from us, it feels as if the sound of the siren falls in pitch, in comparison to the higher pitched sound when the ambulance was initially driving towards us before passing our position.
While redshift may occur for relativistic or gravitational reasons, the most significant reason for redshift of any sufficiently-far astronomical object is that the universe is expanding – this causes the radiation to travel a greater distance through the expanding space and hence lose energy.
For cosmological reasons, quasars are more common in the early universe, which is the part of the observable universe that is furthest away from us here on earth. It is also known from astrophysics (and attributed to the existence of “dark energy” in the universe) that not only is the universe expanding, but the further an astronomical object is, the faster it appears to be receding away from Earth (similar to points on an expanding balloon), and this causes the redshift of far-away galaxies and quasars to be much higher than that of galaxies closer to Earth.
This high redshift is one of the defining traits of Quasars, as we will see from the insights in this case study.
The objective of the problem is to use the tabular features available to us about every astronomical object, to predict whether the object is a star, a galaxy or a quasar, through the use of supervised machine learning methods.
In this notebook, we will use simple non-linear methods such as k-Nearest Neighbors and Decision Trees to perform this classification.
The source for this dataset is the Sloan Digital Sky Survey (SDSS), one of the most comprehensive public sources of astronomical datasets available on the web today. SDSS has been one of the most successful surveys in astronomy history, having created highly detailed three-dimensional maps of the universe and curated spectroscopic and photometric information on over three million astronomical objects in the night sky. SDSS uses a dedicated 2.5 m wide-angle optical telescope which is located at the Apache Point Observatory in New Mexico, USA.
The survey was named after the Alfred P. Sloan Foundation (established by Alfred P. Sloan, ex-president of General Motors), a major donor to this initiative and among others, the MIT Sloan School of Management.
The following dataset consists of 250,000 celestial object observations taken by SDSS. Each observation is described by 17 feature columns and 1 class column that identifies the real object to be one of a star, a galaxy or a quasar.
In [ ]:
# Importing the basic libraries we will require for the project import numpy as np; import pandas as pd; import matplotlib.pyplot as plt; import seaborn as sns; import csv,json; import os; # Importing the Machine Learning models we require from Scikit-Learn from sklearn import tree; from sklearn.tree import DecisionTreeClassifier; from sklearn.ensemble import RandomForestClassifier; # Importing the other functions we may require from Scikit-Learn from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV; from sklearn.metrics import recall_score, roc_curve, classification_report, confusion_matrix; from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder; from sklearn.compose import ColumnTransformer; from sklearn.impute import SimpleImputer; from sklearn.pipeline import Pipeline; from sklearn import metrics, model_selection # Setting the random seed to 1 for reproducibility of results import random random.seed(1) np.random.seed(1) # Code to ignore warnings from function usage import warnings; warnings.filterwarnings('ignore')
In [ ]:
# Uncomment if you are using google colab from google.colab import drive drive.mount('/content/drive') # os.chdir("/content/drive/MyDrive")
In [ ]:
df_astro = pd.read_csv('/content/drive/MyDrive/Skyserver250k.csv');
In [ ]:
df_astro.head()
In [ ]:
df_astro.shape
In [ ]:
df_astro['class'].value_counts()
Let’s view the first few rows and last few rows of the dataset in order to understand its structure a little better.
We will use the head() and tail() methods from Pandas to do this.
In [ ]:
# Let's view the first 5 rows of the dataset df_astro.head()
In [ ]:
# Now let's view the last 5 rows of the dataset df_astro.tail()
Next, let’s check the datatypes of the columns in the dataset.
We are interested to know how many numerical and how many categorical features this dataset possesses.
In [ ]:
# Let's check the datatypes of the columns in the df_astro.info()
class
variable (the target variable) which is of the object datatype and is categorical in nature, all the other predictor variables here are numerical in nature, as they have int64 and float64 datatypes. In [ ]:
df_astro = df_astro.sample(n=50000)
In [ ]:
# Checking for any missing values just in case df_astro.isnull().sum()
Let’s also do a quick check to see if any of the rows in this dataset may be duplicates of each other,
even though we know that will not be the case given the source of this data.
In [ ]:
# Let's also check for duplicate rows in the dataset df_astro.duplicated().sum()
As seen above, there are no duplicate rows in the dataset either.
Let’s now look at the percentage class distribution of the target variable class
in this classification dataset.
In [ ]:
### Percentage class distribution of the target variable "class" df_astro['class'].value_counts(1)*100
class
variable.In [ ]:
le = LabelEncoder() df_astro["class"] = le.fit_transform(df_astro["class"]) df_astro["class"] = df_astro["class"].astype(int)
In [ ]:
df_astro['class']
Since the predictor variables in this machine learning problem are all numerical, a statistical summary is definitely required so that we can understand some of the statistical properties of the features of our dataset.
In [ ]:
# We would like the format of the values in the table to be simple float numbers with 5 decimal places, hence the code below pd.set_option('display.float_format', lambda x: '%.5f' % x) # Let's view the statistical summary of the columns in the dataset df_astro.describe().T
Observations:
redshift
is 6.4 and minimum value is 0.16.ra
) is 178.3 and standard deviation is 77.87 whereas mean and standard deviation of delta(dec) variable is 24.4 and 20.8.r
,i
and z
variables are more or less similar, their range of values are same. Let’s make sure we also understand the number of unique values in each feature of the dataset.
In [ ]:
# Number of unique values in each column df_astro.nunique()
objid
and specobjid
columns are clearly unique IDs, that is why they have the same number of values as the total number of rows in the dataset.Since the objid
,specobjid
and fiberid
columns are unique IDs, they will not add any predictive power to the machine learning model, and they can hence be removed.
In [ ]:
# Removing the objid and specobjid columns from the dataset df_astro.drop(columns=['objid', 'specobjid','fiberid'], inplace=True)
We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.
In [ ]:
def hist_box(data, feature, figsize=(12, 7), kde=False, bins=None, stat='density'): """ Boxplot and histogram combined data: dataframe feature: dataframe column figsize: size of figure (default (12,7)) kde: whether to show the density curve (default False) bins: number of bins for histogram (default None) """ f2, (ax_box2, ax_hist2) = plt.subplots( nrows=2, # Number of rows of the subplot grid= 2 sharex=True, # x-axis will be shared among all subplots gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize, ) # creating the 2 subplots sns.boxplot( data=data, x=feature, ax=ax_box2, showmeans=True, color="violet" ) # boxplot will be created and a star will indicate the mean value of the column sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, stat = 'density', palette="winter" ) if bins else sns.histplot( data=data, x=feature, kde=kde, stat = 'density', ax=ax_hist2 ) # For histogram ax_hist2.axvline( data[feature].mean(), color="green", linestyle="--" ) # Add mean to the histogram ax_hist2.axvline( data[feature].median(), color="black", linestyle="-" ) # Add median to the histogram
In [ ]:
hist_box(df_astro,'redshift', bins = 30)
In [ ]:
hist_box(df_astro,'ra')
In [ ]:
hist_box(df_astro,'dec')
In [ ]:
hist_box(df_astro,'u')
In [ ]:
hist_box(df_astro,'g')
In [ ]:
hist_box(df_astro,'i', bins = 50)
In [ ]:
hist_box(df_astro,'r')
In [ ]:
hist_box(df_astro,'z', bins = 50);
In [ ]:
hist_box(df_astro,'run')
In [ ]:
hist_box(df_astro,'rerun', bins = 20)
In [ ]:
hist_box(df_astro,'camcol');
In [ ]:
hist_box(df_astro,'field');
In [ ]:
hist_box(df_astro,'plate');
In [ ]:
hist_box(df_astro,'mjd');
Observations:
In [ ]:
#class vs redshift sns.boxplot(df_astro['class'],df_astro['redshift'],palette="PuBu")
In [ ]:
# class vs [u,g,r,i,z] cols = df_astro[['u','g','r','i','z']].columns.tolist() plt.figure(figsize=(10,10)) for i, variable in enumerate(cols): plt.subplot(3,2,i+1) sns.boxplot(df_astro["class"],df_astro[variable],palette="PuBu") plt.tight_layout() plt.title(variable) plt.show()
In [ ]:
cols = df_astro[['run', 'rerun', 'camcol', 'field']].columns.tolist() plt.figure(figsize=(10,10)) for i, variable in enumerate(cols): plt.subplot(3,2,i+1) sns.boxplot(df_astro["class"],df_astro[variable],palette="PuBu") plt.tight_layout() plt.title(variable) plt.show()
In [ ]:
cols = df_astro[['plate', 'mjd']].columns.tolist() plt.figure(figsize=(10,10)) for i, variable in enumerate(cols): plt.subplot(3,2,i+1) sns.boxplot(df_astro["class"],df_astro[variable],palette="PuBu") plt.tight_layout() plt.title(variable) plt.show()
The kdeplot is a graph of the density of a numerical variable.
In [ ]:
def plot(column): for i in range(3): sns.kdeplot(data=df_astro[df_astro["class"] == i][column], label = le.inverse_transform([i])) sns.kdeplot(data=df_astro[column],label = ["All"]) plt.legend();
In [ ]:
def log_plot(column): for i in range(3): sns.kdeplot(data=np.log(df_astro[df_astro["class"] == i][column]), label = le.inverse_transform([i])) sns.kdeplot(data=np.log(df_astro[column]),label = ["All"]) plt.legend();
Rerun Number to specify how the image was processed
In [ ]:
df_astro["rerun"].nunique()
Only one unique value does not help you train a predictive model, so we can drop this column.
In [ ]:
df_astro = df_astro.drop("rerun",axis=1)
Right Ascension angle (at J2000 epoch)
In [ ]:
plot("ra")
There is not much difference in the distribution according to class, but we can see that there are some characteristics that distinguish the STAR class here.
Declination angle (at J2000 epoch)
In [ ]:
plot("dec")
Although there is no significant difference in distribution according to class, we can see that there are some characteristics to distinguish QSO class.
The red filter in the photometric system
In [ ]:
plot("r")
It can be seen that the distribution of the QSO class for this variable is characterized by a different pattern from the other categories.
Near Infrared filter in the photometric system
In [ ]:
plot("i")
We can see that the distribution of the qso class is a little bit characteristic.
Run Number used to identify the specific scan
In [ ]:
plot("run")
There is no significant difference in distribution by class, we can drop this column.
In [ ]:
df_astro = df_astro.drop("run",axis=1)
Field number to identify each field
In [ ]:
plot("field")
There is no significant difference in distribution by class,we can drop this column.
In [ ]:
df_astro = df_astro.drop("field",axis=1)
We can see that the overall distribution has a distinct characteristic.
redshift value based on the increase in wavelength
In [ ]:
plot("redshift")
It is difficult to check the graph due to extreme values, so let’s apply the log.
In [ ]:
log_plot("redshift")
We can see that the overall distribution is characterized.
plate ID, identifies each plate in SDSS
In [ ]:
plot("plate")
We can see that the overall distribution has a distinct characteristic.
Modified Julian Date, used to indicate when a given piece of SDSS data was taken
In [ ]:
plot("mjd")
We can see that the overall distribution has a distinct characteristic.
Camera column to identify the scanline within the run
In [ ]:
sns.countplot(x=df_astro["camcol"]);
We can see that cam_col is distributed evenly.
In [ ]:
sns.countplot(x=df_astro["camcol"],hue=df_astro["class"]);
It seems difficult to distinguish the class according to cam_col, we can drop this column.
In [ ]:
df_astro = df_astro.drop("camcol",axis=1)
object class (galaxy, star or quasar object)
In [ ]:
sns.countplot(x=df_astro["class"]);
We can see that the distribution of class is unbalanced.
Pairwise Correlation
In [ ]:
plt.figure(figsize=(20,10)) sns.heatmap(df_astro.corr(),annot=True,fmt=".2f") plt.show()
Observations:
In [ ]:
# Separating the dependent and independent columns in the dataset X = df_astro.drop(['class'], axis=1); Y = df_astro[['class']];
In [ ]:
df_astro[['class']]
In [ ]:
# Splitting the dataset into the Training and Testing set X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42, stratify=Y); # Checking the shape of the Train and Test sets print('X Train Shape:', X_train.shape); print('X Test Shape:', X_test.shape); print('Y Train Shape:', y_train.shape); print('Y Test Shape:', y_test.shape);
In [ ]:
def metrics_score(actual, predicted): print(classification_report(actual, predicted)); cm = confusion_matrix(actual, predicted); plt.figure(figsize=(8,5)); sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Star', 'Quasar', 'Galaxy'], yticklabels=['Star', 'Quasar', 'Galaxy']) plt.ylabel('Actual'); plt.xlabel('Predicted'); plt.show()
In [ ]:
dt = DecisionTreeClassifier(random_state=1); dt.fit(X_train, y_train) y_train_pred_dt = dt.predict(X_train) metrics_score(y_train, y_train_pred_dt)
In [ ]:
y_test_pred_dt = dt.predict(X_test); metrics_score(y_test, y_test_pred_dt)
In [ ]:
from sklearn.model_selection import cross_val_score scores = cross_val_score(dt, X_test, y_test, cv=5) print(f"The average score of the model with K-5 Cross validation is {np.average(scores)} ")
In [ ]:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() dt = DecisionTreeClassifier(random_state=1); X_train_std = scaler.fit_transform(X_train) X_test_std = scaler.fit_transform(X_test) dt.fit(X_train_std, y_train) y_train_pred_dt = dt.predict(X_train_std) metrics_score(y_train, y_train_pred_dt)
In [ ]:
y_test_pred_dt = dt.predict(X_test_std); metrics_score(y_test, y_test_pred_dt)
As expected, scaling doesn’t make much of a difference in the performance of the Decision Tree model, since it is not a distance-based algorithm and rather tries to separate instances with orthogonal splits in vector space.
In [ ]:
from sklearn.model_selection import cross_val_score scores = cross_val_score(dt, X_test_std, y_test, cv=5) print(f"The average score of the model with K-5 Cross validation is {np.average(scores)} ")
In [ ]:
features = list(X.columns); plt.figure(figsize=(30,20)) tree.plot_tree(dt, max_depth=3, feature_names=features, filled=True, fontsize=12, node_ids=True, class_names=True); plt.show()
Observation:
Let’s look at the feature importance of the decision tree model
In [ ]:
# Plotting the feature importance importances = dt.feature_importances_ columns = X.columns; importance_df_astro = pd.DataFrame(importances, index=columns, columns=['Importance']).sort_values(by='Importance', ascending=False); plt.figure(figsize=(13,13)); sns.barplot(importance_df_astro.Importance, importance_df_astro.index) plt.show() print(importance_df_astro)