Wine is a beverage made from fermented grapes and other fruit juices with a low amount of alcohol content. Wine is the second most popular alcoholic drink in the world after beer, and it is one of the most highly consumed beverages.
Generally, the quality of wine is graded based on the taste of the wine and vintage but this process is time-consuming, costly, and not efficient as the quality of the wine also depends on other physiochemical attributes like fixed acidity, volatile acidity, etc. Also, it is not always possible to ensure wine quality by experts when there is a huge demand for the product as it will increase the cost significantly.
Moonshine is a red wine company that produces premium high-quality wines. The company wants to improve its production efficiency and reduce the cost and additional time involved in wine tasting. You as a data scientist at Moonshine company have to build a predictive model that can help to identify the premium quality wines using the available data.
In [6]:
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings('ignore')
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
In [7]:
wine = pd.read_csv('winequality.csv',sep=';')
In [8]:
# copying data to another varaible to avoid any changes to original data data = wine.copy()
In [9]:
data.head()
Out[9]:
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
In [10]:
data.tail()
Out[10]:
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
In [11]:
data.shape
Out[11]:
(1599, 12)
In [12]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
Observations-
In [13]:
data.describe().T
Out[13]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| fixed acidity | 1599.0 | 8.319637 | 1.741096 | 4.60000 | 7.1000 | 7.90000 | 9.200000 | 15.90000 |
| volatile acidity | 1599.0 | 0.527821 | 0.179060 | 0.12000 | 0.3900 | 0.52000 | 0.640000 | 1.58000 |
| citric acid | 1599.0 | 0.270976 | 0.194801 | 0.00000 | 0.0900 | 0.26000 | 0.420000 | 1.00000 |
| residual sugar | 1599.0 | 2.538806 | 1.409928 | 0.90000 | 1.9000 | 2.20000 | 2.600000 | 15.50000 |
| chlorides | 1599.0 | 0.087467 | 0.047065 | 0.01200 | 0.0700 | 0.07900 | 0.090000 | 0.61100 |
| free sulfur dioxide | 1599.0 | 15.874922 | 10.460157 | 1.00000 | 7.0000 | 14.00000 | 21.000000 | 72.00000 |
| total sulfur dioxide | 1599.0 | 46.467792 | 32.895324 | 6.00000 | 22.0000 | 38.00000 | 62.000000 | 289.00000 |
| density | 1599.0 | 0.996747 | 0.001887 | 0.99007 | 0.9956 | 0.99675 | 0.997835 | 1.00369 |
| pH | 1599.0 | 3.311113 | 0.154386 | 2.74000 | 3.2100 | 3.31000 | 3.400000 | 4.01000 |
| sulphates | 1599.0 | 0.658149 | 0.169507 | 0.33000 | 0.5500 | 0.62000 | 0.730000 | 2.00000 |
| alcohol | 1599.0 | 10.422983 | 1.065668 | 8.40000 | 9.5000 | 10.20000 | 11.100000 | 14.90000 |
| quality | 1599.0 | 5.636023 | 0.807569 | 3.00000 | 5.0000 | 6.00000 | 6.000000 | 8.00000 |
Observations-
In [14]:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
In [15]:
histogram_boxplot(data,'fixed acidity')

In [16]:
histogram_boxplot(data,'volatile acidity')

In [17]:
histogram_boxplot(data,'citric acid')

In [18]:
histogram_boxplot(data,'residual sugar')

In [19]:
histogram_boxplot(data,'chlorides')

In [20]:
histogram_boxplot(data,'free sulfur dioxide')

In [21]:
histogram_boxplot(data,'total sulfur dioxide')

In [22]:
#Calculating top 5 values data['total sulfur dioxide'].sort_values(ascending=False).head()
Out[22]:
1081 289.0 1079 278.0 354 165.0 1244 160.0 651 155.0 Name: total sulfur dioxide, dtype: float64
In [23]:
#Capping the two extreme values data['total sulfur dioxide']=data['total sulfur dioxide'].clip(upper=165)
In [24]:
histogram_boxplot(data,'density')

In [25]:
histogram_boxplot(data,'pH')

In [26]:
histogram_boxplot(data,'sulphates')

In [27]:
histogram_boxplot(data,'alcohol')

In [28]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
In [29]:
labeled_barplot(data,"quality",perc=True)

In [30]:
# defining bins bins = (2, 6, 8) # defining labels labels = ['non-premium', 'premium'] data['quality_class'] = pd.cut(x = data['quality'], bins = bins, labels = labels)
In [31]:
data['quality_class'].value_counts()
Out[31]:
non-premium 1382 premium 217 Name: quality_class, dtype: int64
In [32]:
plt.figure(figsize=(10,7)) sns.heatmap(data.corr(),annot=True,vmin=-1,vmax=1,fmt='.1g',cmap="Spectral") plt.show()

In [33]:
sns.pairplot(data,hue='quality_class') plt.show()

In [34]:
cols = data[['fixed acidity', 'volatile acidity', 'citric acid']].columns.tolist()
plt.figure(figsize=(12,5))
for i, variable in enumerate(cols):
plt.subplot(1,3,i+1)
sns.boxplot(data['quality_class'],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()

In [35]:
cols = data[['free sulfur dioxide', 'total sulfur dioxide', 'sulphates']].columns.tolist()
plt.figure(figsize=(12,5))
for i, variable in enumerate(cols):
plt.subplot(1,3,i+1)
sns.boxplot(data['quality_class'],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()

In [36]:
## function to plot boxplots w.rt quality
def boxplot(x):
plt.figure(figsize=(7,5))
sns.boxplot(data['quality_class'],x,palette="PuBu")
plt.show()
In [37]:
boxplot(data['chlorides'])

In [38]:
sns.boxplot(data['quality_class'],data['chlorides'],showfliers=False,palette='PuBu');

In [39]:
boxplot(data['density'])

In [40]:
boxplot(data['pH'])

In [41]:
boxplot(data['residual sugar'])

In [42]:
boxplot(data['alcohol'])

In [43]:
data.drop('quality', axis=1, inplace=True)
X = data.drop('quality_class', axis=1)
y = data['quality_class'].apply(lambda x : 0 if x=='non-premium' else 1 )
In [44]:
# Splitting data into training and test set: X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y) print(X_train.shape, X_test.shape)
(1119, 11) (480, 11)
Note: The stratify argument maintains the original distribution of classes in the target variable while splitting the data into train and test sets.
In [45]:
y.value_counts(1)
Out[45]:
0 0.86429 1 0.13571 Name: quality_class, dtype: float64
In [46]:
y_test.value_counts(1)
Out[46]:
0 0.864583 1 0.135417 Name: quality_class, dtype: float64
The model can make wrong predictions as:
Which case is more important?
Which metric to optimize?
Let’s define a function to provide metric scores on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
In [47]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
In [48]:
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
In [49]:
#Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)
#Calculating different metrics
d_tree_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train)
print("Training performance:\n",d_tree_model_train_perf)
d_tree_model_test_perf=model_performance_classification_sklearn(d_tree,X_test,y_test)
print("Testing performance:\n",d_tree_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(d_tree,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.8875 0.6 0.58209 0.590909

In [50]:
#Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.18,1:0.72},random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
Out[50]:
DecisionTreeClassifier(class_weight={0: 0.18, 1: 0.72}, max_depth=4,
max_leaf_nodes=15, min_impurity_decrease=0.0001,
min_samples_leaf=10, random_state=1)
In [51]:
#Calculating different metrics
dtree_estimator_model_train_perf=model_performance_classification_sklearn(dtree_estimator,X_train,y_train)
print("Training performance:\n",dtree_estimator_model_train_perf)
dtree_estimator_model_test_perf=model_performance_classification_sklearn(dtree_estimator,X_test,y_test)
print("Testing performance:\n",dtree_estimator_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.853441 0.809211 0.476744 0.6
Testing performance:
Accuracy Recall Precision F1
0 0.789583 0.707692 0.359375 0.476684

In [52]:
#Fitting the model
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
#Calculating different metrics
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator,X_train,y_train)
print("Training performance:\n",rf_estimator_model_train_perf)
rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator,X_test,y_test)
print("Testing performance:\n",rf_estimator_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(rf_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.91875 0.492308 0.842105 0.621359

In [53]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(class_weight={0:0.18,1:0.82},random_state=1,oob_score=True,bootstrap=True)
parameters = {
'max_depth': list(np.arange(5,30,5)) + [None],
'max_features': ['sqrt','log2',None],
'min_samples_leaf': np.arange(1,15,5),
'min_samples_split': np.arange(2, 20, 5),
'n_estimators': np.arange(10,110,10)}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
Out[53]:
RandomForestClassifier(class_weight={0: 0.18, 1: 0.82}, max_depth=10,
max_features='sqrt', min_samples_leaf=6,
min_samples_split=17, n_estimators=40, oob_score=True,
random_state=1)
In [54]:
#Calculating different metrics
rf_tuned_model_train_perf=model_performance_classification_sklearn(rf_tuned,X_train,y_train)
print("Training performance:\n",rf_tuned_model_train_perf)
rf_tuned_model_test_perf=model_performance_classification_sklearn(rf_tuned,X_test,y_test)
print("Testing performance:\n",rf_tuned_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(rf_tuned,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.935657 0.881579 0.712766 0.788235
Testing performance:
Accuracy Recall Precision F1
0 0.9 0.723077 0.61039 0.661972

In [55]:
#Fitting the model bagging_classifier = BaggingClassifier(random_state=1) bagging_classifier.fit(X_train,y_train) #Calculating different metrics bagging_classifier_model_train_perf=model_performance_classification_sklearn(bagging_classifier,X_train,y_train) print(bagging_classifier_model_train_perf) bagging_classifier_model_test_perf=model_performance_classification_sklearn(bagging_classifier,X_test,y_test) print(bagging_classifier_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(bagging_classifier,X_test,y_test)
Accuracy Recall Precision F1 0 0.983021 0.875 1.0 0.933333 Accuracy Recall Precision F1 0 0.916667 0.584615 0.745098 0.655172

In [56]:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_samples': [0.7,0.8,0.9,1],
'max_features': [0.7,0.8,0.9,1],
'n_estimators' : [10,20,30,40,50],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
Out[56]:
BaggingClassifier(max_features=0.7, max_samples=0.9, n_estimators=50,
random_state=1)
In [57]:
#Calculating different metrics bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train) print(bagging_estimator_tuned_model_train_perf) bagging_estimator_tuned_model_test_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_test,y_test) print(bagging_estimator_tuned_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(bagging_estimator_tuned,X_test,y_test)
Accuracy Recall Precision F1 0 0.999106 0.993421 1.0 0.9967 Accuracy Recall Precision F1 0 0.90625 0.461538 0.75 0.571429

In [58]:
#Fitting the model ab_classifier = AdaBoostClassifier(random_state=1) ab_classifier.fit(X_train,y_train) #Calculating different metrics ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train) print(ab_classifier_model_train_perf) ab_classifier_model_test_perf=model_performance_classification_sklearn(ab_classifier,X_test,y_test) print(ab_classifier_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(ab_classifier,X_test,y_test)
Accuracy Recall Precision F1 0 0.913315 0.565789 0.735043 0.639405 Accuracy Recall Precision F1 0 0.875 0.415385 0.55102 0.473684

In [59]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
Out[59]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2),
learning_rate=1.4000000000000001, n_estimators=40,
random_state=1)
In [60]:
#Calculating different metrics abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train) print(abc_tuned_model_train_perf) abc_tuned_model_test_perf=model_performance_classification_sklearn(abc_tuned,X_test,y_test) print(abc_tuned_model_test_perf) #Creating confusion matrix confusion_matrix_sklearn(abc_tuned,X_test,y_test)
Accuracy Recall Precision F1 0 0.989276 0.934211 0.986111 0.959459 Accuracy Recall Precision F1 0 0.877083 0.492308 0.551724 0.520325

In [61]:
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
#Calculating different metrics
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)
gb_classifier_model_test_perf=model_performance_classification_sklearn(gb_classifier,X_test,y_test)
print("Testing performance:\n",gb_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gb_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.969616 0.782895 0.991667 0.875
Testing performance:
Accuracy Recall Precision F1
0 0.90625 0.492308 0.727273 0.587156

In [62]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
Out[62]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9, n_estimators=250, random_state=1,
subsample=0.9)
In [63]:
#Calculating different metrics
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf=model_performance_classification_sklearn(gbc_tuned,X_test,y_test)
print("Testing performance:\n",gbc_tuned_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.993744 0.953947 1.0 0.976431
Testing performance:
Accuracy Recall Precision F1
0 0.910417 0.523077 0.73913 0.612613

In [64]:
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)
#Calculating different metrics
xgb_classifier_model_train_perf=model_performance_classification_sklearn(xgb_classifier,X_train,y_train)
print("Training performance:\n",xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf=model_performance_classification_sklearn(xgb_classifier,X_test,y_test)
print("Testing performance:\n",xgb_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.95353 0.703947 0.938596 0.804511
Testing performance:
Accuracy Recall Precision F1
0 0.9 0.492308 0.680851 0.571429

In [65]:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric='logloss')
# Grid of parameters to choose from
parameters = {
"n_estimators": [10,30,50],
"scale_pos_weight":[1,2,5],
"subsample":[0.7,0.9,1],
"learning_rate":[0.05, 0.1,0.2],
"colsample_bytree":[0.7,0.9,1],
"colsample_bylevel":[0.5,0.7,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
Out[65]:
XGBClassifier(colsample_bylevel=0.7, colsample_bytree=0.9,
eval_metric='logloss', learning_rate=0.2, n_estimators=50,
random_state=1, scale_pos_weight=5)
In [66]:
#Calculating different metrics
xgb_tuned_model_train_perf=model_performance_classification_sklearn(xgb_tuned,X_train,y_train)
print("Training performance:\n",xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf=model_performance_classification_sklearn(xgb_tuned,X_test,y_test)
print("Testing performance:\n",xgb_tuned_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.924039 0.973684 0.646288 0.776903
Testing performance:
Accuracy Recall Precision F1
0 0.858333 0.707692 0.484211 0.575

In [67]:
estimators = [('Random Forest',rf_tuned), ('Gradient Boosting',gbc_tuned), ('Decision Tree',dtree_estimator)]
final_estimator = xgb_tuned
stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)
stacking_classifier.fit(X_train,y_train)
Out[67]:
StackingClassifier(estimators=[('Random Forest',
RandomForestClassifier(class_weight={0: 0.18,
1: 0.82},
max_depth=10,
max_features='sqrt',
min_samples_leaf=6,
min_samples_split=17,
n_estimators=40,
oob_score=True,
random_state=1)),
('Gradient Boosting',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9,
n_estimators=250,
random_state=1,
subsample=0.9)),
('Decision Tree',
DecisionTreeClassifier(class_weight={0: 0.18,
1: 0.72},
max_depth=4,
max_leaf_nodes=15,
min_impurity_decrease=0.0001,
min_samples_leaf=10,
random_state=1))],
final_estimator=XGBClassifier(colsample_bylevel=0.7,
colsample_bytree=0.9,
eval_metric='logloss',
learning_rate=0.2,
n_estimators=50,
random_state=1,
scale_pos_weight=5))
In [68]:
#Calculating different metrics
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.898123 0.986842 0.572519 0.724638
Testing performance:
Accuracy Recall Precision F1
0 0.852083 0.876923 0.475 0.616216

In [69]:
# training performance comparison
models_train_comp_df = pd.concat(
[d_tree_model_train_perf.T,dtree_estimator_model_train_perf.T,rf_estimator_model_train_perf.T,rf_tuned_model_train_perf.T,
bagging_classifier_model_train_perf.T,bagging_estimator_tuned_model_train_perf.T,ab_classifier_model_train_perf.T,
abc_tuned_model_train_perf.T,gb_classifier_model_train_perf.T,gbc_tuned_model_train_perf.T,xgb_classifier_model_train_perf.T,
xgb_tuned_model_train_perf.T,stacking_classifier_model_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Decision Tree Estimator",
"Random Forest Estimator",
"Random Forest Tuned",
"Bagging Classifier",
"Bagging Estimator Tuned",
"Adaboost Classifier",
"Adabosst Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[69]:
| Decision Tree | Decision Tree Estimator | Random Forest Estimator | Random Forest Tuned | Bagging Classifier | Bagging Estimator Tuned | Adaboost Classifier | Adabosst Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | XGBoost Classifier | XGBoost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.853441 | 1.0 | 0.935657 | 0.983021 | 0.999106 | 0.913315 | 0.989276 | 0.969616 | 0.993744 | 0.953530 | 0.924039 | 0.898123 |
| Recall | 1.0 | 0.809211 | 1.0 | 0.881579 | 0.875000 | 0.993421 | 0.565789 | 0.934211 | 0.782895 | 0.953947 | 0.703947 | 0.973684 | 0.986842 |
| Precision | 1.0 | 0.476744 | 1.0 | 0.712766 | 1.000000 | 1.000000 | 0.735043 | 0.986111 | 0.991667 | 1.000000 | 0.938596 | 0.646288 | 0.572519 |
| F1 | 1.0 | 0.600000 | 1.0 | 0.788235 | 0.933333 | 0.996700 | 0.639405 | 0.959459 | 0.875000 | 0.976431 | 0.804511 | 0.776903 | 0.724638 |
In [70]:
# testing performance comparison
models_test_comp_df = pd.concat(
[d_tree_model_test_perf.T,dtree_estimator_model_test_perf.T,rf_estimator_model_test_perf.T,rf_tuned_model_test_perf.T,
bagging_classifier_model_test_perf.T,bagging_estimator_tuned_model_test_perf.T,ab_classifier_model_test_perf.T,
abc_tuned_model_test_perf.T,gb_classifier_model_test_perf.T,gbc_tuned_model_test_perf.T,xgb_classifier_model_test_perf.T,
xgb_tuned_model_test_perf.T,stacking_classifier_model_test_perf.T],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Decision Tree Estimator",
"Random Forest Estimator",
"Random Forest Tuned",
"Bagging Classifier",
"Bagging Estimator Tuned",
"Adaboost Classifier",
"Adabosst Classifier Tuned",
"Gradient Boost Classifier",
"Gradient Boost Classifier Tuned",
"XGBoost Classifier",
"XGBoost Classifier Tuned",
"Stacking Classifier"]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[70]:
| Decision Tree | Decision Tree Estimator | Random Forest Estimator | Random Forest Tuned | Bagging Classifier | Bagging Estimator Tuned | Adaboost Classifier | Adabosst Classifier Tuned | Gradient Boost Classifier | Gradient Boost Classifier Tuned | XGBoost Classifier | XGBoost Classifier Tuned | Stacking Classifier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.887500 | 0.789583 | 0.918750 | 0.900000 | 0.916667 | 0.906250 | 0.875000 | 0.877083 | 0.906250 | 0.910417 | 0.900000 | 0.858333 | 0.852083 |
| Recall | 0.600000 | 0.707692 | 0.492308 | 0.723077 | 0.584615 | 0.461538 | 0.415385 | 0.492308 | 0.492308 | 0.523077 | 0.492308 | 0.707692 | 0.876923 |
| Precision | 0.582090 | 0.359375 | 0.842105 | 0.610390 | 0.745098 | 0.750000 | 0.551020 | 0.551724 | 0.727273 | 0.739130 | 0.680851 | 0.484211 | 0.475000 |
| F1 | 0.590909 | 0.476684 | 0.621359 | 0.661972 | 0.655172 | 0.571429 | 0.473684 | 0.520325 | 0.587156 | 0.612613 | 0.571429 | 0.575000 | 0.616216 |
In [71]:
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
