To understand how Bagging and Random Forest models are used in the real world, let’s go through an example, where we try to build a model to predict whether a person will turn into a diabetic patient, based on certain dataset.
Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients is growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.
Individuals with diabetes face a risk of developing some secondary health issues such as heart diseases and nerve damage. Thus, early detection and treatment of diabetes can prevent complications and assist in reducing the risk of severe health problems. Even though it’s incurable, it can be managed by treatment and medication.
Researchers at the Bio-Solutions lab want to get a better understanding of this disease among women and are planning to use machine learning models that will help them to identify patients who are at risk of diabetes.
You as a data scientist at Bio-Solutions have to build a classification model using a dataset collected by the “National Institute of Diabetes and Digestive and Kidney Diseases” consisting of several attributes that would help to identify whether a person is at risk of diabetes or not.
To build a model to predict whether an individual is at risk of diabetes or not.
In [1]:
# Library to suppress warnings or deprecation notes import warnings warnings.filterwarnings('ignore') # Libraries to help with reading and manipulating data import numpy as np import pandas as pd # Libraries to help with data visualization import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns # Library to split data from sklearn.model_selection import train_test_split # Libraries to import decision tree classifier and different ensemble classifiers from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import tree # Libtune to tune model, get different metric scores from sklearn import metrics from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score,f1_score,roc_auc_score from sklearn.model_selection import GridSearchCV
In [2]:
pima=pd.read_csv("pima-indians-diabetes.csv")
In [3]:
# copying data to another varaible to avoid any changes to original data data=pima.copy()
In [4]:
data.head()
Out[4]:
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | Pedigree | Age | Class | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
In [5]:
data.tail()
Out[5]:
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | Pedigree | Age | Class | |
---|---|---|---|---|---|---|---|---|---|
763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
In [6]:
data.shape
Out[6]:
(768, 9)
In [7]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 Pedigree 768 non-null float64 7 Age 768 non-null int64 8 Class 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
Observations –
In [8]:
data.describe().T
Out[8]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Pregnancies | 768.0 | 3.845052 | 3.369578 | 0.000 | 1.00000 | 3.0000 | 6.00000 | 17.00 |
Glucose | 768.0 | 120.894531 | 31.972618 | 0.000 | 99.00000 | 117.0000 | 140.25000 | 199.00 |
BloodPressure | 768.0 | 69.105469 | 19.355807 | 0.000 | 62.00000 | 72.0000 | 80.00000 | 122.00 |
SkinThickness | 768.0 | 20.536458 | 15.952218 | 0.000 | 0.00000 | 23.0000 | 32.00000 | 99.00 |
Insulin | 768.0 | 79.799479 | 115.244002 | 0.000 | 0.00000 | 30.5000 | 127.25000 | 846.00 |
BMI | 768.0 | 31.992578 | 7.884160 | 0.000 | 27.30000 | 32.0000 | 36.60000 | 67.10 |
Pedigree | 768.0 | 0.471876 | 0.331329 | 0.078 | 0.24375 | 0.3725 | 0.62625 | 2.42 |
Age | 768.0 | 33.240885 | 11.760232 | 21.000 | 24.00000 | 29.0000 | 41.00000 | 81.00 |
Class | 768.0 | 0.348958 | 0.476951 | 0.000 | 0.00000 | 0.0000 | 1.00000 | 1.00 |
Observations –
In [9]:
# function to plot a boxplot and a histogram along the same scale. def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None): """ Boxplot and histogram combined data: dataframe feature: dataframe column figsize: size of figure (default (12,7)) kde: whether to show the density curve (default False) bins: number of bins for histogram (default None) """ f2, (ax_box2, ax_hist2) = plt.subplots( nrows=2, # Number of rows of the subplot grid= 2 sharex=True, # x-axis will be shared among all subplots gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize, ) # creating the 2 subplots sns.boxplot( data=data, x=feature, ax=ax_box2, showmeans=True, color="violet" ) # boxplot will be created and a star will indicate the mean value of the column sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter" ) if bins else sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2 ) # For histogram ax_hist2.axvline( data[feature].mean(), color="green", linestyle="--" ) # Add mean to the histogram ax_hist2.axvline( data[feature].median(), color="black", linestyle="-" ) # Add median to the histogram
In [10]:
histogram_boxplot(data, "Pregnancies")
In [11]:
histogram_boxplot(data,"Glucose")
In [12]:
histogram_boxplot(data,"BloodPressure")
In [13]:
histogram_boxplot(data,"SkinThickness")
In [14]:
data[data['SkinThickness']>80]
Out[14]:
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | Pedigree | Age | Class | |
---|---|---|---|---|---|---|---|---|---|
579 | 2 | 197 | 70 | 99 | 0 | 34.7 | 0.575 | 62 | 1 |
In [15]:
histogram_boxplot(data,"Insulin")
In [16]:
histogram_boxplot(data,"BMI")
In [17]:
histogram_boxplot(data,"Pedigree")
In [18]:
histogram_boxplot(data,"Age")
In [19]:
# function to create labeled barplots def labeled_barplot(data, feature, perc=False, n=None): """ Barplot with percentage at the top data: dataframe feature: dataframe column perc: whether to display percentages instead of count (default is False) n: displays the top n category levels (default is None, i.e., display all levels) """ total = len(data[feature]) # length of the column count = data[feature].nunique() if n is None: plt.figure(figsize=(count + 1, 5)) else: plt.figure(figsize=(n + 1, 5)) plt.xticks(rotation=90, fontsize=15) ax = sns.countplot( data=data, x=feature, palette="Paired", order=data[feature].value_counts().index[:n].sort_values(), ) for p in ax.patches: if perc == True: label = "{:.1f}%".format( 100 * p.get_height() / total ) # percentage of each class of the category else: label = p.get_height() # count of each level of the category x = p.get_x() + p.get_width() / 2 # width of the plot y = p.get_height() # height of the plot ax.annotate( label, (x, y), ha="center", va="center", size=12, xytext=(0, 5), textcoords="offset points", ) # annotate the percentage plt.show() # show the plot
In [20]:
labeled_barplot(data,"Class",perc=True)
In [21]:
labeled_barplot(data,"Pregnancies",perc=True)
In [22]:
plt.figure(figsize=(15,7)) sns.heatmap(data.corr(),annot=True,vmin=-1,vmax=1,cmap="Spectral") plt.show()
Observations-
In [23]:
sns.pairplot(data=data,hue="Class") plt.show()
In [24]:
### Function to plot boxplot def boxplot(x): plt.figure(figsize=(10,7)) sns.boxplot(data=data, x="Class",y=data[x],palette="PuBu") plt.show()
In [25]:
data.columns
Out[25]:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Pedigree', 'Age', 'Class'], dtype='object')
In [26]:
boxplot('Pregnancies')
In [27]:
boxplot('Glucose')
In [28]:
boxplot('BloodPressure')
In [29]:
boxplot('SkinThickness')
In [30]:
boxplot('Insulin')
In [31]:
boxplot('BMI')
In [32]:
boxplot('Pedigree')
In [33]:
boxplot('Age')
In [34]:
data.loc[data.Glucose == 0, 'Glucose'] = data.Glucose.median() data.loc[data.BloodPressure == 0, 'BloodPressure'] = data.BloodPressure.median() data.loc[data.SkinThickness == 0, 'SkinThickness'] = data.SkinThickness.median() data.loc[data.Insulin == 0, 'Insulin'] = data.Insulin.median() data.loc[data.BMI == 0, 'BMI'] = data.BMI.median()
In [35]:
X = data.drop('Class',axis=1) y = data['Class']
In [36]:
# Splitting data into training and test set: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1,stratify=y) print(X_train.shape, X_test.shape)
(537, 8) (231, 8)
The Stratify arguments maintain the original distribution of classes in the target variable while splitting the data into train and test sets.
In [37]:
y.value_counts(1)
Out[37]:
0 0.651042 1 0.348958 Name: Class, dtype: float64
In [38]:
y_test.value_counts(1)
Out[38]:
0 0.649351 1 0.350649 Name: Class, dtype: float64
Let’s define a function to provide recall scores on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
In [39]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn def model_performance_classification_sklearn(model, predictors, target): """ Function to compute different metrics to check classification model performance model: classifier predictors: independent variables target: dependent variable """ # predicting using the independent variables pred = model.predict(predictors) acc = accuracy_score(target, pred) # to compute Accuracy recall = recall_score(target, pred) # to compute Recall precision = precision_score(target, pred) # to compute Precision f1 = f1_score(target, pred) # to compute F1-score # creating a dataframe of metrics df_perf = pd.DataFrame( { "Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, }, index=[0], ) return df_perf
In [40]:
def confusion_matrix_sklearn(model, predictors, target): """ To plot the confusion_matrix with percentages model: classifier predictors: independent variables target: dependent variable """ y_pred = model.predict(predictors) cm = confusion_matrix(target, y_pred) labels = np.asarray( [ ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())] for item in cm.flatten() ] ).reshape(2, 2) plt.figure(figsize=(6, 4)) sns.heatmap(cm, annot=labels, fmt="") plt.ylabel("True label") plt.xlabel("Predicted label")
In [41]:
#Fitting the model d_tree = DecisionTreeClassifier(random_state=1) d_tree.fit(X_train,y_train) #Calculating different metrics dtree_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train) print("Training performance:\n",dtree_model_train_perf) dtree_model_test_perf=model_performance_classification_sklearn(d_tree,X_test,y_test) print("Testing performance:\n",dtree_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(d_tree, X_test, y_test)
Training performance: Accuracy Recall Precision F1 ROC-AUC 0 1.0 1.0 1.0 1.0 1.0 Testing performance: Accuracy Recall Precision F1 ROC-AUC 0 0.731602 0.580247 0.626667 0.602564 0.69679
In [42]:
#Fitting the model rf_estimator = RandomForestClassifier(random_state=1) rf_estimator.fit(X_train,y_train) #Calculating different metrics rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator,X_train,y_train) print("Training performance:\n",rf_estimator_model_train_perf) rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator,X_test,y_test) print("Testing performance:\n",rf_estimator_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(rf_estimator, X_test, y_test)
Training performance: Accuracy Recall Precision F1 ROC-AUC 0 1.0 1.0 1.0 1.0 1.0 Testing performance: Accuracy Recall Precision F1 ROC-AUC 0 0.753247 0.54321 0.6875 0.606897 0.704938
In [43]:
#Fitting the model bagging_classifier = BaggingClassifier(random_state=1) bagging_classifier.fit(X_train,y_train) #Calculating different metrics bagging_classifier_model_train_perf=model_performance_classification_sklearn(bagging_classifier,X_train,y_train) print("Training performance:\n",bagging_classifier_model_train_perf) bagging_classifier_model_test_perf=model_performance_classification_sklearn(bagging_classifier,X_test,y_test) print("Testing performance:\n",bagging_classifier_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(bagging_classifier, X_test, y_test)
Training performance: Accuracy Recall Precision F1 ROC-AUC 0 0.994413 0.983957 1.0 0.991914 0.991979 Testing performance: Accuracy Recall Precision F1 ROC-AUC 0 0.744589 0.555556 0.661765 0.604027 0.701111
In [44]:
#Choose the type of classifier. dtree_estimator = DecisionTreeClassifier(class_weight={0:0.35,1:0.65},random_state=1) # Grid of parameters to choose from parameters = {'max_depth': np.arange(2,10), 'min_samples_leaf': [5, 7, 10, 15], 'max_leaf_nodes' : [2, 3, 5, 10,15], 'min_impurity_decrease': [0.0001,0.001,0.01,0.1] } # Type of scoring used to compare parameter combinations scorer = metrics.make_scorer(metrics.recall_score) # Run the grid search grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1) grid_obj = grid_obj.fit(X_train, y_train) # Set the clf to the best combination of parameters dtree_estimator = grid_obj.best_estimator_ # Fit the best algorithm to the data. dtree_estimator.fit(X_train, y_train)
Out[44]:
DecisionTreeClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=4, max_leaf_nodes=5, min_impurity_decrease=0.0001, min_samples_leaf=5, random_state=1)
In [45]:
#Calculating different metrics dtree_estimator_model_train_perf=model_performance_classification_sklearn(dtree_estimator,X_train,y_train) print("Training performance:\n",dtree_estimator_model_train_perf) dtree_estimator_model_test_perf=model_performance_classification_sklearn(dtree_estimator,X_test,y_test) print("Testing performance:\n",dtree_estimator_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
Training performance: Accuracy Recall Precision F1 ROC-AUC 0 0.759777 0.839572 0.613281 0.708804 0.778358 Testing performance: Accuracy Recall Precision F1 ROC-AUC 0 0.692641 0.753086 0.544643 0.632124 0.706543
In [46]:
# Choose the type of classifier. rf_tuned = RandomForestClassifier(class_weight={0:0.35,1:0.65},random_state=1) parameters = { 'max_depth': list(np.arange(3,10,1)), 'max_features': np.arange(0.6,1.1,0.1), 'max_samples': np.arange(0.7,1.1,0.1), 'min_samples_split': np.arange(2, 20, 5), 'n_estimators': np.arange(30,160,20), 'min_impurity_decrease': [0.0001,0.001,0.01,0.1] } # Type of scoring used to compare parameter combinations scorer = metrics.make_scorer(metrics.recall_score) # Run the grid search grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer,cv=5,n_jobs=-1) grid_obj = grid_obj.fit(X_train, y_train) # Set the clf to the best combination of parameters rf_tuned = grid_obj.best_estimator_ # Fit the best algorithm to the data. rf_tuned.fit(X_train, y_train)
Out[46]:
RandomForestClassifier(class_weight={0: 0.35, 1: 0.65}, max_depth=6, max_features=0.6, max_samples=0.9999999999999999, min_impurity_decrease=0.01, n_estimators=30, random_state=1)
In [47]:
#Calculating different metrics rf_tuned_model_train_perf=model_performance_classification_sklearn(rf_tuned,X_train,y_train) print("Training performance:\n",rf_tuned_model_train_perf) rf_tuned_model_test_perf=model_performance_classification_sklearn(rf_tuned,X_test,y_test) print("Testing performance:\n",rf_tuned_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(rf_tuned, X_test, y_test)
Training performance: Accuracy Recall Precision F1 ROC-AUC 0 0.815642 0.882353 0.681818 0.769231 0.831176 Testing performance: Accuracy Recall Precision F1 ROC-AUC 0 0.748918 0.716049 0.623656 0.666667 0.741358
In [48]:
# Choose the type of classifier. bagging_estimator_tuned = BaggingClassifier(random_state=1) # Grid of parameters to choose from parameters = {'max_samples': [0.7,0.8,0.9,1], 'max_features': [0.7,0.8,0.9,1], 'n_estimators' : [10,20,30,40,50], } # Type of scoring used to compare parameter combinations acc_scorer = metrics.make_scorer(metrics.recall_score) # Run the grid search grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer,cv=5) grid_obj = grid_obj.fit(X_train, y_train) # Set the clf to the best combination of parameters bagging_estimator_tuned = grid_obj.best_estimator_ # Fit the best algorithm to the data. bagging_estimator_tuned.fit(X_train, y_train)
Out[48]:
BaggingClassifier(max_features=0.9, max_samples=0.7, n_estimators=50, random_state=1)
In [49]:
#Calculating different metrics bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train) print("Training performance:\n",bagging_estimator_tuned_model_train_perf) bagging_estimator_tuned_model_test_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_test,y_test) print("Testing performance:\n",bagging_estimator_tuned_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(bagging_estimator_tuned, X_test, y_test)
Training performance: Accuracy Recall Precision F1 ROC-AUC 0 0.98324 0.957219 0.994444 0.975477 0.977181 Testing performance: Accuracy Recall Precision F1 ROC-AUC 0 0.74026 0.518519 0.666667 0.583333 0.689259
In [50]:
# training performance comparison models_train_comp_df = pd.concat( [dtree_model_train_perf.T,dtree_estimator_model_train_perf.T,rf_estimator_model_train_perf.T,rf_tuned_model_train_perf.T, bagging_classifier_model_train_perf.T,bagging_estimator_tuned_model_train_perf.T], axis=1, ) models_train_comp_df.columns = [ "Decision Tree", "Decision Tree Estimator", "Random Forest Estimator", "Random Forest Tuned", "Bagging Classifier", "Bagging Estimator Tuned"] print("Training performance comparison:") models_train_comp_df
Training performance comparison:
Out[50]:
Decision Tree | Decision Tree Estimator | Random Forest Estimator | Random Forest Tuned | Bagging Classifier | Bagging Estimator Tuned | |
---|---|---|---|---|---|---|
Accuracy | 1.0 | 0.759777 | 1.0 | 0.815642 | 0.994413 | 0.983240 |
Recall | 1.0 | 0.839572 | 1.0 | 0.882353 | 0.983957 | 0.957219 |
Precision | 1.0 | 0.613281 | 1.0 | 0.681818 | 1.000000 | 0.994444 |
F1 | 1.0 | 0.708804 | 1.0 | 0.769231 | 0.991914 | 0.975477 |
ROC-AUC | 1.0 | 0.778358 | 1.0 | 0.831176 | 0.991979 | 0.977181 |
In [51]:
# testing performance comparison models_test_comp_df = pd.concat( [dtree_model_test_perf.T,dtree_estimator_model_test_perf.T,rf_estimator_model_test_perf.T,rf_tuned_model_test_perf.T, bagging_classifier_model_test_perf.T, bagging_estimator_tuned_model_test_perf.T], axis=1, ) models_test_comp_df.columns = [ "Decision Tree", "Decision Tree Estimator", "Random Forest Estimator", "Random Forest Tuned", "Bagging Classifier", "Bagging Estimator Tuned"] print("Testing performance comparison:") models_test_comp_df
Testing performance comparison:
Out[51]:
Decision Tree | Decision Tree Estimator | Random Forest Estimator | Random Forest Tuned | Bagging Classifier | Bagging Estimator Tuned | |
---|---|---|---|---|---|---|
Accuracy | 0.731602 | 0.692641 | 0.753247 | 0.748918 | 0.744589 | 0.740260 |
Recall | 0.580247 | 0.753086 | 0.543210 | 0.716049 | 0.555556 | 0.518519 |
Precision | 0.626667 | 0.544643 | 0.687500 | 0.623656 | 0.661765 | 0.666667 |
F1 | 0.602564 | 0.632124 | 0.606897 | 0.666667 | 0.604027 | 0.583333 |
ROC-AUC | 0.696790 | 0.706543 | 0.704938 | 0.741358 | 0.701111 | 0.689259 |
In [52]:
# Text report showing the rules of a decision tree - feature_names = list(X_train.columns) print(tree.export_text(dtree_estimator,feature_names=feature_names,show_weights=True))
|--- Glucose <= 127.50 | |--- Age <= 28.50 | | |--- weights: [59.50, 7.80] class: 0 | |--- Age > 28.50 | | |--- Glucose <= 99.50 | | | |--- weights: [16.10, 2.60] class: 0 | | |--- Glucose > 99.50 | | | |--- weights: [19.60, 29.90] class: 1 |--- Glucose > 127.50 | |--- BMI <= 28.85 | | |--- weights: [12.25, 9.10] class: 0 | |--- BMI > 28.85 | | |--- weights: [15.05, 72.15] class: 1
In [53]:
feature_names = X_train.columns importances = dtree_estimator.feature_importances_ indices = np.argsort(importances) plt.figure(figsize=(12,12)) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='violet', align='center') plt.yticks(range(len(indices)), [feature_names[i] for i in indices]) plt.xlabel('Relative Importance') plt.show()