General Middleware

Creating Decision Trees for Classification Problems – Find Loan delinquency

Context

DRS bank is facing challenging times. Their NPAs (Non-Performing Assets) has been on a rise recently and a large part of these are due to the loans given to individual customers(borrowers). Chief Risk Officer of the bank decides to put in a scientifically robust framework for approval of loans to individual customers to minimize the risk of loans converting into NPAs and initiates a project for the data science team at the bank. You, as a senior member of the team, are assigned this project.

Objective

To identify the criteria to approve loans for an individual customer such that the likelihood of the loan delinquency is minimized

Key questions to be answered

What are the factors that drive the behavior of loan delinquency?

Dataset

  • ID: Customer ID
  • isDelinquent : indicates whether the customer is delinquent or not (1 => Yes, 0 => No)
  • term: Loan term in months
  • gender: Gender of the borrower
  • age: Age of the borrower
  • purpose: Purpose of Loan
  • home_ownership: Status of borrower’s home
  • FICO: FICO (i.e. the bureau score) of the borrower

Domain Information

  • Transactor – A person who pays his due amount balance full and on time.
  • Revolver – A person who pays the minimum due amount but keeps revolving his balance and does not pay the full amount.
  • Delinquent – Delinquency means that you are behind on payments, a person who fails to pay even the minimum due amount.
  • Defaulter – Once you are delinquent for a certain period your lender will declare you to be in the default stage.
  • Risk Analytics – A wide domain in the financial and banking industry, basically analyzing the risk of the customer.

Import the necessary packages

In [1]:

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    plot_confusion_matrix,
    make_scorer,
)

Read the dataset

In [2]:

data = pd.read_csv("Loan_Delinquent_Dataset.csv")

In [3]:

# copying data to another varaible to avoid any changes to original data
loan = data.copy()

View the first and last 5 rows of the dataset.

In [4]:

loan.head()

Out[4]:

IDisDelinquenttermgenderpurposehome_ownershipageFICO
01136 monthsFemaleHouseMortgage>25300-500
12036 monthsFemaleHouseRent20-25>500
23136 monthsFemaleHouseRent>25300-500
34136 monthsFemaleCarMortgage>25300-500
45136 monthsFemaleHouseRent>25300-500

In [5]:

loan.tail()

Out[5]:

IDisDelinquenttermgenderpurposehome_ownershipageFICO
1154311544060 monthsMaleotherMortgage>25300-500
1154411545136 monthsMaleHouseRent20-25300-500
1154511546036 monthsFemalePersonalMortgage20-25>500
1154611547136 monthsFemaleHouseRent20-25300-500
1154711548136 monthsMalePersonalMortgage20-25300-500

Understand the shape of the dataset.

In [6]:

loan.shape

Out[6]:

(11548, 8)
  • The dataset has 11548 rows and 8 columns of data

Check the data types of the columns for the dataset.

In [7]:

loan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11548 entries, 0 to 11547
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID              11548 non-null  int64 
 1   isDelinquent    11548 non-null  int64 
 2   term            11548 non-null  object
 3   gender          11548 non-null  object
 4   purpose         11548 non-null  object
 5   home_ownership  11548 non-null  object
 6   age             11548 non-null  object
 7   FICO            11548 non-null  object
dtypes: int64(2), object(6)
memory usage: 721.9+ KB

Observations –

  • isDelinquent is the dependent variable – type integer.
  • All the dependent variables except for ID are object type.

Summary of the dataset.

In [8]:

loan.describe(include="all")

Out[8]:

IDisDelinquenttermgenderpurposehome_ownershipageFICO
count11548.00000011548.000000115481154811548115481154811548
uniqueNaNNaN227322
topNaNNaN36 monthsMaleHouseMortgage20-25300-500
freqNaNNaN1058965556892546158886370
mean5774.5000000.668601NaNNaNNaNNaNNaNNaN
std3333.7647890.470737NaNNaNNaNNaNNaNNaN
min1.0000000.000000NaNNaNNaNNaNNaNNaN
25%2887.7500000.000000NaNNaNNaNNaNNaNNaN
50%5774.5000001.000000NaNNaNNaNNaNNaNNaN
75%8661.2500001.000000NaNNaNNaNNaNNaNNaN
max11548.0000001.000000NaNNaNNaNNaNNaNNaN

Observations-

  • Most of the loans are for a 36-month term loan.
  • More males have applied for loans than females.
  • Most loan applications are for house loans.
  • Most customers have either mortgaged their houses.
  • Mostly customers in the age group 20-25 have applied for a loan.
  • Most customers have a FICO score between 300 and 500.

In [9]:

# checking for unique values in ID column
loan["ID"].nunique()

Out[9]:

11548
  • Since all the values in ID column are unique we can drop it

In [10]:

loan.drop(["ID"], axis=1, inplace=True)

Check for missing values

In [11]:

loan.isnull().sum()

Out[11]:

isDelinquent      0
term              0
gender            0
purpose           0
home_ownership    0
age               0
FICO              0
dtype: int64
  • There are no missing vaues in out dataset

Univariate analysis

In [12]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Observations on isDelinquent

In [13]:

labeled_barplot(loan, "isDelinquent", perc=True)
  • 66.9% of the customers are delinquent

Observations on term

In [14]:

labeled_barplot(loan, "term", perc=True)
  • 91.7% of the loans are for a 36 month term.

Observations on gender

In [15]:

labeled_barplot(loan, "gender", perc=True)
  • There are more male applicants (56.8%) than female applicants (43.2%)

Observations on purpose

In [16]:

labeled_barplot(loan, "purpose", perc=True)
  • Most loan applications are for house loans (59.7%) followed by car loans (18%)
  • There are 2 levels named ‘other’ and ‘Other’ under the purpose variable. Since we do not have any other information about these, we can merge these levels.

Observations on home_ownership

In [17]:

labeled_barplot(loan, "home_ownership", perc=True)
  • Very few applicants <10% own their house, Most customers have either mortgaged their houses or live on rent. 

Observations on age

In [18]:

labeled_barplot(loan, "age", perc=True)
  • Almost an equal percentage of people aged 20-25 and >25 have applied for the loan.

Observations on FICO

In [19]:

labeled_barplot(loan, "FICO", perc=True)
  • Most customers have a FICO score between 300 and 500 (55.2%) followed by a score of greater than 500 (44.8%)

Data Cleaning

In [20]:

loan["purpose"].unique()

Out[20]:

array(['House', 'Car', 'Other', 'Personal', 'Wedding', 'Medical', 'other'],
      dtype=object)

We can merge the purpose – ‘other’ and ‘Other’ together

In [21]:

loan["purpose"].replace("other", "Other", inplace=True)

In [22]:

loan["purpose"].unique()

Out[22]:

array(['House', 'Car', 'Other', 'Personal', 'Wedding', 'Medical'],
      dtype=object)

Bivariate Analysis

In [23]:

# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [24]:

stacked_barplot(loan, "term", "isDelinquent")
isDelinquent     0     1    All
term                           
All           3827  7721  11548
36 months     3168  7421  10589
60 months      659   300    959
------------------------------------------------------------------------------------------------------------------------
  • Most loan delinquent customers have taken loan for 36 months.

In [25]:

stacked_barplot(loan, "gender", "isDelinquent")
isDelinquent     0     1    All
gender                         
All           3827  7721  11548
Male          1977  4578   6555
Female        1850  3143   4993
------------------------------------------------------------------------------------------------------------------------
  • There’s not much difference between male and female customers.

In [26]:

stacked_barplot(loan, "purpose", "isDelinquent")
isDelinquent     0     1    All
purpose                        
All           3827  7721  11548
House         2272  4620   6892
Car            678  1402   2080
Other          357   653   1010
Personal       274   618    892
Wedding        139   269    408
Medical        107   159    266
------------------------------------------------------------------------------------------------------------------------
  • Most loan delinquent customers are those who have applied for house loans followed by car and personal loans.

In [27]:

stacked_barplot(loan, "home_ownership", "isDelinquent")
isDelinquent       0     1    All
home_ownership                   
All             3827  7721  11548
Mortgage        1831  3630   5461
Rent            1737  3479   5216
Own              259   612    871
------------------------------------------------------------------------------------------------------------------------
  • Those customers who have their own house are less delinquent than the ones who live in a rented place or have mortgaged their home.

In [28]:

stacked_barplot(loan, "age", "isDelinquent")
isDelinquent     0     1    All
age                            
All           3827  7721  11548
>25           1969  3691   5660
20-25         1858  4030   5888
------------------------------------------------------------------------------------------------------------------------
  • Customers between 20-25 years of age are more delinquent.

In [29]:

stacked_barplot(loan, "FICO", "isDelinquent")
isDelinquent     0     1    All
FICO                           
All           3827  7721  11548
>500          2886  2292   5178
300-500        941  5429   6370
------------------------------------------------------------------------------------------------------------------------
  • If FICO score is >500 the chances of delinquency decrease quite a lot compared to when FICO score is between 300-500.

Key Observations –

  • FICO score and term of loan application appear to be very strong indicators of delinquency.
  • Other factors appear to be not very good indicators of delinquency. (We can use chi-square tests to determine statistical significance in the association between two categorical variables).

We observed that a high FICO score means that the chances of delinquency are lower, let us see if any of the other variables indicate higher a FICO score.

In [30]:

stacked_barplot(loan, "home_ownership", "FICO")
FICO            300-500  >500    All
home_ownership                      
All                6370  5178  11548
Mortgage           2857  2604   5461
Rent               3033  2183   5216
Own                 480   391    871
------------------------------------------------------------------------------------------------------------------------

In [31]:

stacked_barplot(loan, "age", "FICO")
FICO   300-500  >500    All
age                        
All       6370  5178  11548
>25       2443  3217   5660
20-25     3927  1961   5888
------------------------------------------------------------------------------------------------------------------------

In [32]:

stacked_barplot(loan, "gender", "FICO")
FICO    300-500  >500    All
gender                      
All        6370  5178  11548
Male       3705  2850   6555
Female     2665  2328   4993
------------------------------------------------------------------------------------------------------------------------

Key Observations

  1. Home ownership and gender seem to have a slight impact on the FICO scores.
  2. Age seems to have a much bigger impact on FICO scores.

Model Building – Approach

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.

Split Data

In [36]:

X = loan.drop(["isDelinquent"], axis=1)
y = loan["isDelinquent"]

In [37]:

# encoding the categorical variables
X = pd.get_dummies(X, drop_first=True)
X.head()

Out[37]:

term_60 monthsgender_Malepurpose_Housepurpose_Medicalpurpose_Otherpurpose_Personalpurpose_Weddinghome_ownership_Ownhome_ownership_Rentage_>25FICO_>500
000100000010
100100000101
200100000110
300000000010
400100000110

In [38]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [39]:

print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6928
Number of rows in test data = 4620

In [40]:

print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in training set:
1    0.677396
0    0.322604
Name: isDelinquent, dtype: float64
Percentage of classes in test set:
1    0.655411
0    0.344589
Name: isDelinquent, dtype: float64

Build Decision Tree Model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will not be behind on payments (Non-Delinquent) but in reality the customer would be behind on payments.
  2. Predicting a customer will be behind on payments (Delinquent) but in reality the customer would not be behind on payments (Non-Delinquent).

Which case is more important?

  • If we predict a non-delinquent customer as a delinquent customer bank would lose an opportunity of providing loan to a potential customer.

How to reduce this loss i.e need to reduce False Negatives?

  • recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.

First, let’s create functions to calculate different metrics and confusion matrix so that we don’t have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn function will be used to check the model performance of models. 
  • The make_confusion_matrix function will be used to plot confusion matrix.

In [41]:

# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

In [42]:

def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Build Decision Tree Model

In [43]:

model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)

Out[43]:

DecisionTreeClassifier(random_state=1)

Checking model performance on training set

In [44]:

decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train

Out[44]:

AccuracyRecallPrecisionF1
00.8555140.90880.8815630.894974

In [45]:

confusion_matrix_sklearn(model, X_train, y_train)

Checking model performance on test set

In [46]:

decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test

Out[46]:

AccuracyRecallPrecisionF1
00.8437230.8972920.8686060.882716

In [47]:

confusion_matrix_sklearn(model, X_test, y_test)
  • Model is giving good and generalized results on training and test set.

Visualizing the Decision Tree

In [48]:

column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['term_60 months', 'gender_Male', 'purpose_House', 'purpose_Medical', 'purpose_Other', 'purpose_Personal', 'purpose_Wedding', 'home_ownership_Own', 'home_ownership_Rent', 'age_>25', 'FICO_>500']

In [49]:

plt.figure(figsize=(20, 30))

out = tree.plot_tree(
    model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

In [50]:

# Text report showing the rules of a decision tree -

print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- FICO_>500 <= 0.50
|   |--- term_60 months <= 0.50
|   |   |--- age_>25 <= 0.50
|   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |--- gender_Male <= 0.50
|   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 18.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 14.00] class: 1
|   |   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.00, 82.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 3.00] class: 1
|   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 2.00] class: 0
|   |   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [4.00, 16.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- gender_Male >  0.50
|   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [30.00, 147.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [4.00, 15.00] class: 1
|   |   |   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [8.00, 30.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 13.00] class: 1
|   |   |   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [9.00, 51.00] class: 1
|   |   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 18.00] class: 1
|   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [57.00, 438.00] class: 1
|   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |--- weights: [13.00, 70.00] class: 1
|   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |--- gender_Male <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 27.00] class: 1
|   |   |   |   |   |--- gender_Male >  0.50
|   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |--- weights: [8.00, 91.00] class: 1
|   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 13.00] class: 1
|   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |--- gender_Male <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 14.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- gender_Male >  0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [60.00, 201.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 10.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |--- gender_Male <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 10.00] class: 1
|   |   |   |   |   |   |   |   |--- gender_Male >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [9.00, 52.00] class: 1
|   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |--- gender_Male <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 8.00] class: 1
|   |   |   |   |   |   |   |--- gender_Male >  0.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 33.00] class: 1
|   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |--- gender_Male <= 0.50
|   |   |   |   |   |   |   |--- weights: [14.00, 86.00] class: 1
|   |   |   |   |   |   |--- gender_Male >  0.50
|   |   |   |   |   |   |   |--- weights: [114.00, 536.00] class: 1
|   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |--- weights: [0.00, 11.00] class: 1
|   |   |--- age_>25 >  0.50
|   |   |   |--- weights: [0.00, 1291.00] class: 1
|   |--- term_60 months >  0.50
|   |   |--- weights: [196.00, 0.00] class: 0
|--- FICO_>500 >  0.50
|   |--- gender_Male <= 0.50
|   |   |--- age_>25 <= 0.50
|   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [31.00, 11.00] class: 0
|   |   |   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [120.00, 33.00] class: 0
|   |   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [8.00, 1.00] class: 0
|   |   |   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.00, 7.00] class: 0
|   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [4.00, 2.00] class: 0
|   |   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 1.00] class: 0
|   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [21.00, 4.00] class: 0
|   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [28.00, 9.00] class: 0
|   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [13.00, 5.00] class: 0
|   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |--- weights: [120.00, 28.00] class: 0
|   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |--- weights: [6.00, 1.00] class: 0
|   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |--- weights: [9.00, 1.00] class: 0
|   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 1.00] class: 0
|   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |--- weights: [37.00, 6.00] class: 0
|   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |--- age_>25 >  0.50
|   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [49.00, 12.00] class: 0
|   |   |   |   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [53.00, 16.00] class: 0
|   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [10.00, 3.00] class: 0
|   |   |   |   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.00, 3.00] class: 0
|   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 1.00] class: 0
|   |   |   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 1.00] class: 0
|   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [170.00, 48.00] class: 0
|   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |--- weights: [29.00, 10.00] class: 0
|   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |--- weights: [168.00, 54.00] class: 0
|   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |--- weights: [44.00, 21.00] class: 0
|   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |--- weights: [1.00, 1.00] class: 0
|   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |--- weights: [5.00, 1.00] class: 0
|   |   |   |--- purpose_Other >  0.50
|   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |--- weights: [31.00, 15.00] class: 0
|   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |--- weights: [24.00, 13.00] class: 0
|   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |--- weights: [4.00, 1.00] class: 0
|   |--- gender_Male >  0.50
|   |   |--- age_>25 <= 0.50
|   |   |   |--- term_60 months <= 0.50
|   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [21.00, 7.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [96.00, 33.00] class: 0
|   |   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 1.00] class: 0
|   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [18.00, 5.00] class: 0
|   |   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |   |--- weights: [17.00, 3.00] class: 0
|   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |--- weights: [2.00, 2.00] class: 0
|   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 2.00] class: 0
|   |   |   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |--- weights: [12.00, 5.00] class: 0
|   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [14.00, 8.00] class: 0
|   |   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 1.00] class: 0
|   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |--- weights: [71.00, 38.00] class: 0
|   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |--- weights: [8.00, 6.00] class: 0
|   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |--- weights: [9.00, 3.00] class: 0
|   |   |   |--- term_60 months >  0.50
|   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 13.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 9.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 7.00] class: 1
|   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 14.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 8.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 3.00] class: 1
|   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |--- weights: [2.00, 4.00] class: 1
|   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |--- weights: [14.00, 53.00] class: 1
|   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |--- weights: [10.00, 41.00] class: 1
|   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |--- weights: [2.00, 12.00] class: 1
|   |   |--- age_>25 >  0.50
|   |   |   |--- term_60 months <= 0.50
|   |   |   |   |--- purpose_Medical <= 0.50
|   |   |   |   |   |--- purpose_Wedding <= 0.50
|   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [19.00, 67.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 29.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [9.00, 48.00] class: 1
|   |   |   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.00, 64.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 33.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [53.00, 219.00] class: 1
|   |   |   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [51.00, 206.00] class: 1
|   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |--- purpose_Other <= 0.50
|   |   |   |   |   |   |   |   |--- purpose_Personal <= 0.50
|   |   |   |   |   |   |   |   |   |--- purpose_House <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 15.00] class: 1
|   |   |   |   |   |   |   |   |   |--- purpose_House >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 38.00] class: 1
|   |   |   |   |   |   |   |   |--- purpose_Personal >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 4.00] class: 1
|   |   |   |   |   |   |   |--- purpose_Other >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 3.00] class: 1
|   |   |   |   |   |--- purpose_Wedding >  0.50
|   |   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |   |--- home_ownership_Own <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |   |   |--- home_ownership_Own >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 2.00] class: 0
|   |   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |   |--- weights: [7.00, 14.00] class: 1
|   |   |   |   |--- purpose_Medical >  0.50
|   |   |   |   |   |--- home_ownership_Rent <= 0.50
|   |   |   |   |   |   |--- weights: [4.00, 9.00] class: 1
|   |   |   |   |   |--- home_ownership_Rent >  0.50
|   |   |   |   |   |   |--- weights: [4.00, 8.00] class: 1
|   |   |   |--- term_60 months >  0.50
|   |   |   |   |--- weights: [138.00, 0.00] class: 0

In [51]:

importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • FICO score, duration of loan and gender are the top 3 important features.

Using GridSearch for Hyperparameter tuning of our tree model

  • Let’s see if we can improve our model performance even more.

In [52]:

# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from

parameters = {
    "max_depth": [np.arange(2, 50, 5), None],
    "criterion": ["entropy", "gini"],
    "splitter": ["best", "random"],
    "min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)

Out[52]:

DecisionTreeClassifier(min_impurity_decrease=0.0001, random_state=1)

Checking performance on training set

In [53]:

decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train

Out[53]:

AccuracyRecallPrecisionF1
00.8552250.9100790.8802560.894919

In [54]:

confusion_matrix_sklearn(estimator, X_train, y_train)
  • The Recall has improved on the training set as compared to the initial model.

Checking model performance on test set

In [55]:

decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)

decision_tree_tune_perf_test

Out[55]:

AccuracyRecallPrecisionF1
00.8439390.8986130.8679430.883012

In [56]:

confusion_matrix_sklearn(estimator, X_test, y_test)
  • After hyperparameter tuning the model has performance has remained same and the model has become simpler.

In [57]:

plt.figure(figsize=(15, 12))

tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)
plt.show()
  • We are getting a simplified tree after pre-pruning.

Cost Complexity Pruning

In [58]:

clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [59]:

pd.DataFrame(path)

Out[59]:

ccp_alphasimpurities
00.000000e+000.226403
10.000000e+000.226403
22.794668e-090.226403
32.244984e-070.226403
44.918264e-070.226404
56.998390e-070.226404
67.597561e-070.226405
71.058874e-060.226406
81.184343e-060.226407
91.386119e-060.226409
102.183321e-060.226411
112.291140e-060.226416
123.665824e-060.226419
133.778517e-060.226423
144.160227e-060.226431
154.169086e-060.226435
164.245347e-060.226440
175.155064e-060.226445
185.244266e-060.226450
195.492923e-060.226456
206.045620e-060.226462
218.340601e-060.226470
228.765875e-060.226479
239.056740e-060.226488
249.751114e-060.226498
251.058022e-050.226519
261.138027e-050.226542
271.155642e-050.226553
281.156951e-050.226576
291.169925e-050.226600
301.174875e-050.226611
311.202848e-050.226623
321.323848e-050.226637
331.507632e-050.226652
341.608110e-050.226668
351.753314e-050.226685
361.979545e-050.226705
372.032168e-050.226725
382.166168e-050.226747
392.168081e-050.226812
402.216324e-050.226834
412.421893e-050.226931
422.477532e-050.226956
432.568272e-050.226982
443.132587e-050.227013
453.194772e-050.227077
463.204299e-050.227109
473.303016e-050.227142
483.424580e-050.227176
493.522919e-050.227211
503.529801e-050.227247
513.745085e-050.227284
523.999700e-050.227324
534.034344e-050.227566
544.156233e-050.227608
554.295438e-050.227651
564.320199e-050.227694
574.340672e-050.227737
585.348017e-050.227791
595.773672e-050.227849
605.995736e-050.227968
617.314108e-050.228115
627.574157e-050.228190
637.818003e-050.228347
648.769179e-050.228435
658.831375e-050.228523
669.072968e-050.228795
671.049759e-040.229005
681.076388e-040.229436
691.117546e-040.229771
701.193296e-040.230009
711.217918e-040.230131
721.233812e-040.230255
731.527711e-040.230407
741.553389e-040.230563
751.773114e-040.230917
761.799582e-040.231097
772.040456e-040.231301
786.198757e-040.231921
795.448168e-030.237369
801.124860e-020.248618
811.417137e-020.276961
823.466595e-020.311627
834.376431e-020.355391
848.167025e-020.437061

In [60]:

fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [61]:

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08167024657332106

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [62]:

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Recall vs alpha for training and testing sets

In [63]:

recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

In [64]:

recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)

In [65]:

fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

In [66]:

# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.014171370928955346, random_state=1)

Checking model performance on training set

In [67]:

decision_tree_postpruned_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_postpruned_perf_train

Out[67]:

AccuracyRecallPrecisionF1
00.8122110.9339440.8155940.870766

In [68]:

confusion_matrix_sklearn(best_model, X_train, y_train)

Checking model performance on test set

In [69]:

decision_tree_postpruned_perf_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)
decision_tree_postpruned_perf_test

Out[69]:

AccuracyRecallPrecisionF1
00.7980520.9247030.7988590.857187

In [70]:

confusion_matrix_sklearn(best_model, X_train, y_train)
  • With post-pruning we are getting good and generalized model performance on both training and test set.
  • The recall has improved further.

Visualizing the Decision Tree

In [71]:

plt.figure(figsize=(10, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
plt.show()

In [72]:

# Text report showing the rules of a decision tree -

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- FICO_>500 <= 0.50
|   |--- term_60 months <= 0.50
|   |   |--- weights: [358.00, 3318.00] class: 1
|   |--- term_60 months >  0.50
|   |   |--- weights: [196.00, 0.00] class: 0
|--- FICO_>500 >  0.50
|   |--- gender_Male <= 0.50
|   |   |--- weights: [1048.00, 310.00] class: 0
|   |--- gender_Male >  0.50
|   |   |--- weights: [633.00, 1065.00] class: 1

In [73]:

# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        best_model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                          Imp
FICO_>500            0.510119
term_60 months       0.273355
gender_Male          0.216526
purpose_House        0.000000
purpose_Medical      0.000000
purpose_Other        0.000000
purpose_Personal     0.000000
purpose_Wedding      0.000000
home_ownership_Own   0.000000
home_ownership_Rent  0.000000
age_>25              0.000000

In [74]:

importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • FICO score, duration of the loan, and gender remain the most important feature with post-pruning too.

Comparing all the decision tree models

In [75]:

# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_postpruned_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:

Out[75]:

Decision Tree sklearnDecision Tree (Pre-Pruning)Decision Tree (Post-Pruning)
Accuracy0.8555140.8552250.812211
Recall0.9088000.9100790.933944
Precision0.8815630.8802560.815594
F10.8949740.8949190.870766

In [76]:

# test performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_postpruned_perf_test.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_train_comp_df
Test set performance comparison:

Out[76]:

Decision Tree sklearnDecision Tree (Pre-Pruning)Decision Tree (Post-Pruning)
Accuracy0.8437230.8439390.798052
Recall0.8972920.8986130.924703
Precision0.8686060.8679430.798859
F10.8827160.8830120.857187
  • Decision tree with post-pruning is giving the highest recall on the test set.
  • The tree with post pruning is not complex and easy to interpret.

Business Insights

  • FICO, term and gender (in that order) are the most important variables in determining if a borrower will get into a delinquent stage 
  • No borrower shall be given a loan if they are applying for a 36 month term loan and have a FICO score in the range 300-500.
  • Female borrowers with a FICO score greater than 500 should be our target customers.
  • Criteria to approve loan according to decision tree model should depend on three main factors – FICO score, duration of loan and gender that is – If the FICO score is less than 500 and the duration of loan is less than 60 months then the customer will not be able to repay the loans. If the customer has greater than 500 FICO score and is a female higher chances that they will repay the loans.

Leave a Reply

Your email address will not be published. Required fields are marked *