When a bank receives a loan application, based on the applicant’s profile the bank has to decide whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision –
To minimize this loss HRE bank wants to automate this process using a predictive model, that will predict if a customer is at risk of making a default or not based on the customer’s demographic and socio-economic profiles
You as a Data scientist at HRE bank has been assigned the work of building a predictive model that will predict if a customer is at risk of default or not
The objective is to build a model to predict whether a person would default or not. In this dataset, the target variable is ‘Risk’.
In [2]:
# To help with reading and manipulating data import pandas as pd import numpy as np # To help with data visualization %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns # To be used for missing value imputation from sklearn.impute import SimpleImputer # To help with model building from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import ( AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, BaggingClassifier, ) from xgboost import XGBClassifier # To get different metric scores, and split data from sklearn import metrics from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score from sklearn.metrics import ( f1_score, accuracy_score, recall_score, precision_score, confusion_matrix, roc_auc_score, plot_confusion_matrix, ) # To be used for data scaling and one hot encoding from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder # To be used for tuning the model from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # To be used for creating pipelines and personalizing them from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer # To define maximum number of columns to be displayed in a dataframe pd.set_option("display.max_columns", None) # To supress scientific notations for a dataframe pd.set_option("display.float_format", lambda x: "%.3f" % x) # To supress warnings import warnings warnings.filterwarnings("ignore") # This will help in making the Python code more structured automatically (good coding practice) %load_ext nb_black
The nb_black extension is already loaded. To reload it, use: %reload_ext nb_black
In [3]:
# Loading the dataset german = pd.read_csv("German_Credit.csv")
In [4]:
# Checking the number of rows and columns in the data german.shape
Out[4]:
(1000, 10)
In [5]:
data = german.copy()
In [6]:
# let's view the first 5 rows of the data data.head()
Out[6]:
Age | Sex | Job | Housing | Saving accounts | Checking account | Credit amount | Duration | Purpose | Risk | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 67 | male | 2 | own | NaN | little | 1169 | 6 | radio/TV | 0 |
1 | 22 | female | 2 | own | little | moderate | 5951 | 48 | radio/TV | 1 |
2 | 49 | male | 1 | own | little | NaN | 2096 | 12 | education | 0 |
3 | 45 | male | 2 | free | little | little | 7882 | 42 | furniture/equipment | 0 |
4 | 53 | male | 2 | free | little | little | 4870 | 24 | car | 1 |
In [7]:
# let's view the last 5 rows of the data data.tail()
Out[7]:
Age | Sex | Job | Housing | Saving accounts | Checking account | Credit amount | Duration | Purpose | Risk | |
---|---|---|---|---|---|---|---|---|---|---|
995 | 31 | female | 1 | own | little | NaN | 1736 | 12 | furniture/equipment | 0 |
996 | 40 | male | 3 | own | little | little | 3857 | 30 | car | 0 |
997 | 38 | male | 2 | own | little | NaN | 804 | 12 | radio/TV | 0 |
998 | 23 | male | 2 | free | little | little | 1845 | 45 | radio/TV | 1 |
999 | 27 | male | 2 | own | moderate | moderate | 4576 | 45 | car | 0 |
In [8]:
# let's check the data types of the columns in the dataset data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1000 non-null int64 1 Sex 1000 non-null object 2 Job 1000 non-null int64 3 Housing 1000 non-null object 4 Saving accounts 817 non-null object 5 Checking account 606 non-null object 6 Credit amount 1000 non-null int64 7 Duration 1000 non-null int64 8 Purpose 1000 non-null object 9 Risk 1000 non-null int64 dtypes: int64(5), object(5) memory usage: 78.2+ KB
In [9]:
# let's check for duplicate values in the data data.duplicated().sum()
Out[9]:
0
In [10]:
# let's check for missing values in the data round(data.isnull().sum() / data.isnull().count() * 100, 2)
Out[10]:
Age 0.000 Sex 0.000 Job 0.000 Housing 0.000 Saving accounts 18.300 Checking account 39.400 Credit amount 0.000 Duration 0.000 Purpose 0.000 Risk 0.000 dtype: float64
Saving accounts
column has 18.3% missing values out of the total observations.Checking account
column has 39.4% missing values out of the total observations.In [11]:
# Checking for the null value in the dataset data.isna().sum()
Out[11]:
Age 0 Sex 0 Job 0 Housing 0 Saving accounts 183 Checking account 394 Credit amount 0 Duration 0 Purpose 0 Risk 0 dtype: int64
Let’s check the number of unique values in each column
In [12]:
data.nunique()
Out[12]:
Age 53 Sex 2 Job 4 Housing 3 Saving accounts 4 Checking account 3 Credit amount 921 Duration 33 Purpose 8 Risk 2 dtype: int64
In [13]:
# let's view the statistical summary of the numerical columns in the data data.describe().T
Out[13]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Age | 1000.000 | 35.546 | 11.375 | 19.000 | 27.000 | 33.000 | 42.000 | 75.000 |
Job | 1000.000 | 1.904 | 0.654 | 0.000 | 2.000 | 2.000 | 2.000 | 3.000 |
Credit amount | 1000.000 | 3271.258 | 2822.737 | 250.000 | 1365.500 | 2319.500 | 3972.250 | 18424.000 |
Duration | 1000.000 | 20.903 | 12.059 | 4.000 | 12.000 | 18.000 | 24.000 | 72.000 |
Risk | 1000.000 | 0.300 | 0.458 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
Checking the value count for each category of categorical variables
In [14]:
# Making a list of all catrgorical variables cat_col = [ "Sex", "Job", "Housing", "Saving accounts", "Checking account", "Purpose", "Risk", ] # Printing number of count of each unique value in each column for column in cat_col: print(data[column].value_counts()) print("-" * 40)
male 690 female 310 Name: Sex, dtype: int64 ---------------------------------------- 2 630 1 200 3 148 0 22 Name: Job, dtype: int64 ---------------------------------------- own 713 rent 179 free 108 Name: Housing, dtype: int64 ---------------------------------------- little 603 moderate 103 quite rich 63 rich 48 Name: Saving accounts, dtype: int64 ---------------------------------------- little 274 moderate 269 rich 63 Name: Checking account, dtype: int64 ---------------------------------------- car 337 radio/TV 280 furniture/equipment 181 business 97 education 59 repairs 22 vacation/others 12 domestic appliances 12 Name: Purpose, dtype: int64 ---------------------------------------- 0 700 1 300 Name: Risk, dtype: int64 ----------------------------------------
In [15]:
# function to plot a boxplot and a histogram along the same scale. def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None): """ Boxplot and histogram combined data: dataframe feature: dataframe column figsize: size of figure (default (12,7)) kde: whether to the show density curve (default False) bins: number of bins for histogram (default None) """ f2, (ax_box2, ax_hist2) = plt.subplots( nrows=2, # Number of rows of the subplot grid= 2 sharex=True, # x-axis will be shared among all subplots gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize, ) # creating the 2 subplots sns.boxplot( data=data, x=feature, ax=ax_box2, showmeans=True, color="violet" ) # boxplot will be created and a star will indicate the mean value of the column sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter" ) if bins else sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2 ) # For histogram ax_hist2.axvline( data[feature].mean(), color="green", linestyle="--" ) # Add mean to the histogram ax_hist2.axvline( data[feature].median(), color="black", linestyle="-" ) # Add median to the histogram
In [16]:
# Observations on Customer_age histogram_boxplot(data, "Age")
In [17]:
histogram_boxplot(data, "Credit amount")
In [18]:
histogram_boxplot(data, "Duration")
In [19]:
# function to create labeled barplots def labeled_barplot(data, feature, perc=False, n=None): """ Barplot with percentage at the top data: dataframe feature: dataframe column perc: whether to display percentages instead of count (default is False) n: displays the top n category levels (default is None, i.e., display all levels) """ total = len(data[feature]) # length of the column count = data[feature].nunique() if n is None: plt.figure(figsize=(count + 1, 5)) else: plt.figure(figsize=(n + 1, 5)) plt.xticks(rotation=90, fontsize=15) ax = sns.countplot( data=data, x=feature, palette="Paired", order=data[feature].value_counts().index[:n].sort_values(), ) for p in ax.patches: if perc == True: label = "{:.1f}%".format( 100 * p.get_height() / total ) # percentage of each class of the category else: label = p.get_height() # count of each level of the category x = p.get_x() + p.get_width() / 2 # width of the plot y = p.get_height() # height of the plot ax.annotate( label, (x, y), ha="center", va="center", size=12, xytext=(0, 5), textcoords="offset points", ) # annotate the percentage plt.show() # show the plot
In [20]:
# observations on Risk labeled_barplot(data, "Risk")
In [21]:
# observations on Sex labeled_barplot(data, "Sex")
In [22]:
# observations on Housing labeled_barplot(data, "Housing")
In [23]:
# observations on Job labeled_barplot(data, "Job")
In [24]:
# observations on Saving accounts labeled_barplot(data, "Saving accounts")
In [25]:
# observations on Checking account labeled_barplot(data, "Checking account")
In [26]:
# observations on Purpose labeled_barplot(data, "Purpose")
In [27]:
sns.pairplot(data, hue="Risk")
Out[27]:
<seaborn.axisgrid.PairGrid at 0x267080d8ac0>
In [28]:
sns.set(rc={"figure.figsize": (10, 7)}) sns.boxplot(x="Risk", y="Age", data=data, orient="vertical")
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x267075b2d00>
In [29]:
sns.set(rc={"figure.figsize": (10, 7)}) sns.boxplot(x="Risk", y="Credit amount", data=data, orient="vertical")
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x267080cfa60>
In [30]:
sns.set(rc={"figure.figsize": (10, 7)}) sns.boxplot(x="Risk", y="Duration", data=data, orient="vertical")
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x2670769e400>
In [31]:
sns.set(rc={"figure.figsize": (10, 7)}) sns.boxplot(x="Saving accounts", y="Age", data=data)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x2670b4509d0>
In [32]:
# function to plot stacked bar chart def stacked_barplot(data, predictor, target): """ Print the category counts and plot a stacked bar chart data: dataframe predictor: independent variable target: target variable """ count = data[predictor].nunique() sorter = data[target].value_counts().index[-1] tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values( by=sorter, ascending=False ) print(tab1) print("-" * 120) tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values( by=sorter, ascending=False ) tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5)) plt.legend( loc="lower left", frameon=False, ) plt.legend(loc="upper left", bbox_to_anchor=(1, 1)) plt.show()
In [33]:
stacked_barplot(data, "Sex", "Risk")
Risk 0 1 All Sex All 700 300 1000 male 499 191 690 female 201 109 310 ------------------------------------------------------------------------------------------------------------------------
In [34]:
stacked_barplot(data, "Job", "Risk")
Risk 0 1 All Job All 700 300 1000 2 444 186 630 1 144 56 200 3 97 51 148 0 15 7 22 ------------------------------------------------------------------------------------------------------------------------
In [35]:
stacked_barplot(data, "Housing", "Risk")
Risk 0 1 All Housing All 700 300 1000 own 527 186 713 rent 109 70 179 free 64 44 108 ------------------------------------------------------------------------------------------------------------------------
In [36]:
stacked_barplot(data, "Saving accounts", "Risk")
Risk 0 1 All Saving accounts All 549 268 817 little 386 217 603 moderate 69 34 103 quite rich 52 11 63 rich 42 6 48 ------------------------------------------------------------------------------------------------------------------------
In [37]:
stacked_barplot(data, "Checking account", "Risk")
Risk 0 1 All Checking account All 352 254 606 little 139 135 274 moderate 164 105 269 rich 49 14 63 ------------------------------------------------------------------------------------------------------------------------
In [38]:
stacked_barplot(data, "Purpose", "Risk")
Risk 0 1 All Purpose All 700 300 1000 car 231 106 337 radio/TV 218 62 280 furniture/equipment 123 58 181 business 63 34 97 education 36 23 59 repairs 14 8 22 vacation/others 7 5 12 domestic appliances 8 4 12 ------------------------------------------------------------------------------------------------------------------------
In [39]:
plt.figure(figsize=(15, 7)) sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
In [40]:
df = data.copy()
In [41]:
X = df.drop(["Risk"], axis=1) y = df["Risk"]
In [42]:
# Splitting data into training, validation and test sets: # first we split data into 2 parts, say temporary and test X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.2, random_state=1, stratify=y ) # then we split the temporary set into train and validation X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp ) print(X_train.shape, X_val.shape, X_test.shape)
(600, 9) (200, 9) (200, 9)
In [43]:
# Let's impute the missing values imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent") cols_to_impute = ["Saving accounts", "Checking account"] # fit and transform the imputer on train data X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute]) # Transform on validation and test data X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute]) # fit and transform the imputer on test data X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
In [45]:
# Creating dummy variables for categorical variables X_train = pd.get_dummies(data=X_train, drop_first=True) X_val = pd.get_dummies(data=X_val, drop_first=True) X_test = pd.get_dummies(data=X_test, drop_first=True)
In [46]:
models = [] # Empty list to store all the models # Appending models into the list models.append(("Bagging", BaggingClassifier(random_state=1))) models.append(("Random forest", RandomForestClassifier(random_state=1))) models.append(("GBM", GradientBoostingClassifier(random_state=1))) models.append(("Adaboost", AdaBoostClassifier(random_state=1))) models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss"))) models.append(("dtree", DecisionTreeClassifier(random_state=1))) results = [] # Empty list to store all model's CV scores names = [] # Empty list to store name of the models score = [] # loop through all models to get the mean cross validated score print("\n" "Cross-Validation Performance:" "\n") for name, model in models: scoring = "recall" kfold = StratifiedKFold( n_splits=5, shuffle=True, random_state=1 ) # Setting number of splits equal to 5 cv_result = cross_val_score( estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold ) results.append(cv_result) names.append(name) print("{}: {}".format(name, cv_result.mean() * 100)) print("\n" "Validation Performance:" "\n") for name, model in models: model.fit(X_train, y_train) scores = recall_score(y_val, model.predict(X_val)) score.append(scores) print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 24.444444444444446 Random forest: 24.444444444444446 GBM: 25.0 Adaboost: 25.0 Xgboost: 35.0 dtree: 43.33333333333333 Validation Performance: Bagging: 0.2833333333333333 Random forest: 0.31666666666666665 GBM: 0.31666666666666665 Adaboost: 0.26666666666666666 Xgboost: 0.36666666666666664 dtree: 0.31666666666666665
In [47]:
# Plotting boxplots for CV scores of all models defined above fig = plt.figure() fig.suptitle("Algorithm Comparison") ax = fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show()
We will tune decision tree and xgboost models using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods – grid search and randomized search.
First let’s create two functions to calculate different metrics and confusion matrix, so that we don’t have to use the same code repeatedly for each model.
In [48]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn def model_performance_classification_sklearn(model, predictors, target): """ Function to compute different metrics to check classification model performance model: classifier predictors: independent variables target: dependent variable """ # predicting using the independent variables pred = model.predict(predictors) acc = accuracy_score(target, pred) # to compute Accuracy recall = recall_score(target, pred) # to compute Recall precision = precision_score(target, pred) # to compute Precision f1 = f1_score(target, pred) # to compute F1-score # creating a dataframe of metrics df_perf = pd.DataFrame( { "Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, }, index=[0], ) return df_perf
In [49]:
def confusion_matrix_sklearn(model, predictors, target): """ To plot the confusion_matrix with percentages model: classifier predictors: independent variables target: dependent variable """ y_pred = model.predict(predictors) cm = confusion_matrix(target, y_pred) labels = np.asarray( [ ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())] for item in cm.flatten() ] ).reshape(2, 2) plt.figure(figsize=(6, 4)) sns.heatmap(cm, annot=labels, fmt="") plt.ylabel("True label") plt.xlabel("Predicted label")
In [50]:
# Creating pipeline model = DecisionTreeClassifier(random_state=1) # Parameter grid to pass in GridSearchCV param_grid = { "criterion": ["gini", "entropy"], "max_depth": [3, 4, 5, None], "min_samples_split": [2, 4, 7, 10, 15], } # Type of scoring used to compare parameter combinations scorer = metrics.make_scorer(metrics.recall_score) # Calling GridSearchCV grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5) # Fitting parameters in GridSeachCV grid_cv.fit(X_train, y_train) print( "Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_) )
Best Parameters:{'criterion': 'gini', 'max_depth': None, 'min_samples_split': 2} Score: 0.40555555555555556
In [51]:
# Creating new pipeline with best parameters dtree_tuned1 = DecisionTreeClassifier( random_state=1, criterion="gini", max_depth=None, min_samples_split=2 ) # Fit the model on training data dtree_tuned1.fit(X_train, y_train)
Out[51]:
DecisionTreeClassifier(random_state=1)
In [52]:
# Calculating different metrics on train set dtree_grid_train = model_performance_classification_sklearn( dtree_tuned1, X_train, y_train ) print("Training performance:") dtree_grid_train
Training performance:
Out[52]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 1.000 | 1.000 | 1.000 | 1.000 |
In [53]:
# Calculating different metrics on validation set dtree_grid_val = model_performance_classification_sklearn(dtree_tuned1, X_val, y_val) print("Validation performance:") dtree_grid_val
Validation performance:
Out[53]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.595 | 0.317 | 0.322 | 0.319 |
In [54]:
# creating confusion matrix confusion_matrix_sklearn(dtree_tuned1, X_val, y_val)
In [55]:
# Creating pipeline model = DecisionTreeClassifier(random_state=1) # Parameter grid to pass in RandomizedSearchCV param_grid = { "criterion": ["gini", "entropy"], "max_depth": [3, 4, 5, None], "min_samples_split": [2, 4, 7, 10, 15], } # Type of scoring used to compare parameter combinations scorer = metrics.make_scorer(metrics.recall_score) # Calling RandomizedSearchCV randomized_cv = RandomizedSearchCV( estimator=model, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1, ) # Fitting parameters in RandomizedSearchCV randomized_cv.fit(X_train, y_train) print( "Best parameters are {} with CV score={}:".format( randomized_cv.best_params_, randomized_cv.best_score_ ) )
Best parameters are {'min_samples_split': 2, 'max_depth': None, 'criterion': 'entropy'} with CV score=0.36666666666666664:
In [56]:
# Creating new pipeline with best parameters dtree_tuned2 = DecisionTreeClassifier( random_state=1, criterion="entropy", max_depth=None, min_samples_split=2 ) # Fit the model on training data dtree_tuned2.fit(X_train, y_train)
Out[56]:
DecisionTreeClassifier(criterion='entropy', random_state=1)
In [57]:
# Calculating different metrics on train set dtree_random_train = model_performance_classification_sklearn( dtree_tuned2, X_train, y_train ) print("Training performance:") dtree_random_train
Training performance:
Out[57]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 1.000 | 1.000 | 1.000 | 1.000 |
In [58]:
# Calculating different metrics on validation set dtree_random_val = model_performance_classification_sklearn(dtree_tuned2, X_val, y_val) print("Validation performance:") dtree_random_val
Validation performance:
Out[58]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.575 | 0.450 | 0.342 | 0.388 |
In [59]:
# creating confusion matrix confusion_matrix_sklearn(dtree_tuned1, X_val, y_val)
In [60]:
%%time #defining model model = XGBClassifier(random_state=1,eval_metric='logloss') #Parameter grid to pass in GridSearchCV param_grid={'n_estimators':np.arange(50,150,50), 'scale_pos_weight':[2,5,10], 'learning_rate':[0.01,0.1,0.2,0.05], 'gamma':[0,1,3,5], 'subsample':[0.8,0.9,1], 'max_depth':np.arange(1,5,1), 'reg_lambda':[5,10]} # Type of scoring used to compare parameter combinations scorer = metrics.make_scorer(metrics.recall_score) #Calling GridSearchCV grid_cv = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2) #Fitting parameters in GridSeachCV grid_cv.fit(X_train,y_train) print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 2304 candidates, totalling 11520 fits Best parameters are {'gamma': 0, 'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 50, 'reg_lambda': 5, 'scale_pos_weight': 10, 'subsample': 0.8} with CV score=1.0: Wall time: 4min 15s
In [61]:
# building model with best parameters xgb_tuned1 = XGBClassifier( random_state=1, n_estimators=50, scale_pos_weight=10, subsample=0.8, learning_rate=0.01, gamma=0, eval_metric="logloss", reg_lambda=5, max_depth=1, ) # Fit the model on training data xgb_tuned1.fit(X_train, y_train)
Out[61]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, eval_metric='logloss', gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.01, max_delta_step=0, max_depth=1, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=50, n_jobs=8, num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=5, scale_pos_weight=10, subsample=0.8, tree_method='exact', validate_parameters=1, verbosity=None)
In [62]:
# Calculating different metrics on train set xgboost_grid_train = model_performance_classification_sklearn( xgb_tuned1, X_train, y_train ) print("Training performance:") xgboost_grid_train
Training performance:
Out[62]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.300 | 1.000 | 0.300 | 0.462 |
In [63]:
# Calculating different metrics on validation set xgboost_grid_val = model_performance_classification_sklearn(xgb_tuned1, X_val, y_val) print("Validation performance:") xgboost_grid_val
Validation performance:
Out[63]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.300 | 1.000 | 0.300 | 0.462 |
In [64]:
# creating confusion matrix confusion_matrix_sklearn(xgb_tuned1, X_val, y_val)
In [65]:
%%time # defining model model = XGBClassifier(random_state=1,eval_metric='logloss') # Parameter grid to pass in RandomizedSearchCV param_grid={'n_estimators':np.arange(50,150,50), 'scale_pos_weight':[2,5,10], 'learning_rate':[0.01,0.1,0.2,0.05], 'gamma':[0,1,3,5], 'subsample':[0.8,0.9,1], 'max_depth':np.arange(1,5,1), 'reg_lambda':[5,10]} # Type of scoring used to compare parameter combinations scorer = metrics.make_scorer(metrics.recall_score) #Calling RandomizedSearchCV xgb_tuned2 = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1) #Fitting parameters in RandomizedSearchCV xgb_tuned2.fit(X_train,y_train) print("Best parameters are {} with CV score={}:" .format(xgb_tuned2.best_params_,xgb_tuned2.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'reg_lambda': 5, 'n_estimators': 50, 'max_depth': 1, 'learning_rate': 0.01, 'gamma': 1} with CV score=1.0: Wall time: 5.39 s
In [66]:
# building model with best parameters xgb_tuned2 = XGBClassifier( random_state=1, n_estimators=50, scale_pos_weight=10, gamma=1, subsample=0.9, learning_rate=0.01, eval_metric="logloss", max_depth=1, reg_lambda=5, ) # Fit the model on training data xgb_tuned2.fit(X_train, y_train)
Out[66]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, eval_metric='logloss', gamma=1, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.01, max_delta_step=0, max_depth=1, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=50, n_jobs=8, num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=5, scale_pos_weight=10, subsample=0.9, tree_method='exact', validate_parameters=1, verbosity=None)
In [67]:
# Calculating different metrics on train set xgboost_random_train = model_performance_classification_sklearn( xgb_tuned2, X_train, y_train ) print("Training performance:") xgboost_random_train
Training performance:
Out[67]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.300 | 1.000 | 0.300 | 0.462 |
In [68]:
# Calculating different metrics on validation set xgboost_random_val = model_performance_classification_sklearn(xgb_tuned2, X_val, y_val) print("Validation performance:") xgboost_random_val
Validation performance:
Out[68]:
Accuracy | Recall | Precision | F1 | |
---|---|---|---|---|
0 | 0.300 | 1.000 | 0.300 | 0.462 |
In [69]:
# creating confusion matrix confusion_matrix_sklearn(xgb_tuned2, X_val, y_val)
In [70]:
# training performance comparison models_train_comp_df = pd.concat( [ dtree_grid_train.T, dtree_random_train.T, xgboost_grid_train.T, xgboost_random_train.T, ], axis=1, ) models_train_comp_df.columns = [ "Decision Tree Tuned with Grid search", "Decision Tree Tuned with Random search", "Xgboost Tuned with Grid search", "Xgboost Tuned with Random Search", ] print("Training performance comparison:") models_train_comp_df
Training performance comparison:
Out[70]:
Decision Tree Tuned with Grid search | Decision Tree Tuned with Random search | Xgboost Tuned with Grid search | Xgboost Tuned with Random Search | |
---|---|---|---|---|
Accuracy | 1.000 | 1.000 | 0.300 | 0.300 |
Recall | 1.000 | 1.000 | 1.000 | 1.000 |
Precision | 1.000 | 1.000 | 0.300 | 0.300 |
F1 | 1.000 | 1.000 | 0.462 | 0.462 |
In [71]:
# Validation performance comparison models_val_comp_df = pd.concat( [ dtree_grid_val.T, dtree_random_val.T, xgboost_grid_val.T, xgboost_random_val.T, ], axis=1, ) models_val_comp_df.columns = [ "Decision Tree Tuned with Grid search", "Decision Tree Tuned with Random search", "Xgboost Tuned with Grid search", "Xgboost Tuned with Random Search", ] print("Validation performance comparison:") models_val_comp_df
Validation performance comparison:
Out[71]:
Decision Tree Tuned with Grid search | Decision Tree Tuned with Random search | Xgboost Tuned with Grid search | Xgboost Tuned with Random Search | |
---|---|---|---|---|
Accuracy | 0.595 | 0.575 | 0.300 | 0.300 |
Recall | 0.317 | 0.450 | 1.000 | 1.000 |
Precision | 0.322 | 0.342 | 0.300 | 0.300 |
F1 | 0.319 | 0.388 | 0.462 | 0.462 |
In [72]:
feature_names = X_train.columns importances = xgb_tuned1.feature_importances_ indices = np.argsort(importances) plt.figure(figsize=(12, 12)) plt.title("Feature Importances") plt.barh(range(len(indices)), importances[indices], color="violet", align="center") plt.yticks(range(len(indices)), [feature_names[i] for i in indices]) plt.xlabel("Relative Importance") plt.show()
In [73]:
# creating a list of numerical variables numerical_features = ["Age", "Credit amount", "Duration"] # creating a transformer for numerical variables, which will apply simple imputer on the numerical variables numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))]) # creating a list of categorical variables categorical_features = [ "Sex", "Job", "Housing", "Saving accounts", "Checking account", "Purpose", ] # creating a transformer for categorical variables, which will first apply simple imputer and # then do one hot encoding for categorical variables categorical_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")), ] ) # handle_unknown = "ignore", allows model to handle any unknown category in the test data # combining categorical transformer and numerical transformer using a column transformer preprocessor = ColumnTransformer( transformers=[ ("num", numeric_transformer, numerical_features), ("cat", categorical_transformer, categorical_features), ], remainder="passthrough", ) # remainder = "passthrough" has been used, it will allow variables that are present in original data # but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
In [74]:
# Separating target variable and other variables X = data.drop("Risk", axis=1) Y = data["Risk"]
In [75]:
# Splitting the data into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.30, random_state=1, stratify=Y ) print(X_train.shape, X_test.shape)
(700, 9) (300, 9)
In [76]:
# Creating new pipeline with best parameters model = Pipeline( steps=[ ("pre", preprocessor), ( "XGB", XGBClassifier( random_state=1, n_estimators=50, scale_pos_weight=10, subsample=0.8, learning_rate=0.01, gamma=0, eval_metric="logloss", reg_lambda=5, max_depth=1, ), ), ] ) # Fit the model on training data model.fit(X_train, y_train)
Out[76]:
Pipeline(steps=[('pre', ColumnTransformer(remainder='passthrough', transformers=[('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median'))]), ['Age', 'Credit amount', 'Duration']), ('cat', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), ['Sex', 'Job', 'Housing', 'Saving accounts'... gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.01, max_delta_step=0, max_depth=1, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=50, n_jobs=8, num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=5, scale_pos_weight=10, subsample=0.8, tree_method='exact', validate_parameters=1, verbosity=None))])