General Middleware

Predicting Classification Models

Case Study: German Credit Analysis

Context:

To minimize loss from the bank’s perspective, the bank needs a decision rule regarding whom to approve the loan and whom not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.

In this dataset, each entry represents a person who takes credit from a bank. Each person is classified as a good or bad credit risk according to the set of attributes.

Objective:

The objective is to build a predictive model on this data to help the bank take a decision on whether to approve a loan to a prospective applicant.

Attribute Information:

  • The data contains characteristics of the people
    • Age (Numeric: Age in years)
    • Sex (Categories: male, female)
    • Job (Categories: unskilled and non-resident, unskilled and resident, skilled, highly skilled)
    • Housing (Categories: own, rent, or free)
    • Saving accounts (Categories: little, moderate, quite rich, rich)
    • Checking account (Categories: little, moderate, rich)
    • Credit amount (Numeric: Amount of credit in DM – Deutsche Mark)
    • Duration (Numeric: Duration for which the credit is given in months)
    • Purpose (Categories: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
    • Risk (0 – Person is not at risk, 1 – Person is at risk(defaulter))


Learning Outcomes:

  • Exploratory Data Analysis
  • Preparing the data to train a model
  • Training and understanding of data using a logistic regression model
  • Model evaluation

Steps and Tasks:

  • Import Libraries and Load Dataset
  • Overview of data
  • Data Visualization
  • Data preparation
  • Choose Model, Train, and Evaluate
  • Conclusion

Dataset:

Solution Jupyter Notebook:

Let’s start by importing necessary libraries

In [1]:

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)


# To build model for prediction

from sklearn.linear_model import LogisticRegression

# To get diferent metric scores
# To get diferent metric scores

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
)

Load and overview the dataset

In [2]:

# Loading the dataset - sheet_name parameter is used if there are multiple tabs in the excel file.
data = pd.read_csv("German_Credit.csv")

In [3]:

data.head()

Out[3]:

AgeSexJobHousingSaving accountsChecking accountCredit amountDurationRiskPurpose
067maleskilledownlittlelittle116960radio/TV
122femaleskilledownlittlemoderate5951481radio/TV
249maleunskilled_and_non-residentownlittlelittle2096120education
345maleskilledfreelittlelittle7882420furniture/equipment
453maleskilledfreelittlelittle4870241car

Understand the shape of the dataset.

In [4]:

data.shape

Out[4]:

(1000, 10)

Check the data types of the columns in the dataset.

In [5]:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   object
 3   Housing           1000 non-null   object
 4   Saving accounts   1000 non-null   object
 5   Checking account  1000 non-null   object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Risk              1000 non-null   int64 
 9   Purpose           1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 78.2+ KB
  • There are total 10 columns and 1,000 observations in the dataset
  • We have only three continuous variables – Age, Credit Amount, and Duration.
  • All other variables are categorical.
  • There no missing values in the dataset.

Summary of the data

In [6]:

data.describe().T

Out[6]:

countmeanstdmin25%50%75%max
Age1000.035.54611.37546919.027.033.042.0075.0
Credit amount1000.03271.2582822.736876250.01365.52319.53972.2518424.0
Duration1000.020.90312.0588144.012.018.024.0072.0
Risk1000.00.3000.4584870.00.00.01.001.0

Observations

  • Mean value for the age column is approx 35 and the median is 33. This shows that majority of the customers are under 35 years of age.
  • Mean amount of credit is approx 3,271 but it has a wide range with values from 250 to 18,424. We will explore this further in univariate analysis.
  • Mean duration for which the credit is given is approx 21 months.

In [7]:

# Making a list of all catrgorical variables
cat_col = [
    "Sex",
    "Job",
    "Housing",
    "Saving accounts",
    "Checking account",
    "Purpose",
    "Risk",
]

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 40)
male      690
female    310
Name: Sex, dtype: int64
----------------------------------------
skilled                       630
unskilled_and_non-resident    222
highly skilled                148
Name: Job, dtype: int64
----------------------------------------
own     713
rent    179
free    108
Name: Housing, dtype: int64
----------------------------------------
little        786
moderate      103
quite rich     63
rich           48
Name: Saving accounts, dtype: int64
----------------------------------------
moderate    472
little      465
rich         63
Name: Checking account, dtype: int64
----------------------------------------
car                    337
radio/TV               280
furniture/equipment    181
business                97
education               59
repairs                 22
domestic appliances     12
vacation/others         12
Name: Purpose, dtype: int64
----------------------------------------
0    700
1    300
Name: Risk, dtype: int64
----------------------------------------
  • We have more male customers as compared to female customers
  • There are very few observations i.e., only 22 for customers with job category – unskilled and non-resident
  • We can see that the distribution of classes in the target variable is imbalanced i.e., only 30% observations with defaulters.
  • Most of the customers are not at risk.

Univariate analysis

In [8]:

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Observation on Age

In [9]:

histogram_boxplot(data, "Age")
  • The distribution of age is right-skewed
  • The boxplot shows that there are outliers at the right end
  • We will not treat these outliers as they represent the real market trend

Observation on Credit Amount

In [10]:

histogram_boxplot(data, "Credit amount")
  • The distribution of the credit amount is right-skewed
  • The boxplot shows that there are outliers at the right end
  • We will not treat these outliers as they represent the real market trend

Observations on Duration

In [11]:

histogram_boxplot(data, "Duration")
  • The distribution of the duration for which the credit is given is right-skewed
  • The boxplot shows that there are outliers at the right end
  • We will not treat these outliers as they represent the real market trend

In [12]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Observations on Risk

In [13]:

labeled_barplot(data, "Risk", perc=True)
  • As mentioned earlier, the class distribution in the target variable is imbalanced.
  • We have 70% observations for non-defaulters and 30% observations for defaulters.

Observations on Sex of Customers

In [14]:

labeled_barplot(data, "Sex", perc=True)
  • Male customers are taking more credit than female customers
  • There are 69% male customers and 31% female customers

Observations on Housing

In [15]:

labeled_barplot(data, "Housing", perc=True)
  • Major of the customers, approx 71%, who take credit have their own house
  • Approx 18% customers are living in a rented house
  • There are only 11% customers who have free housing. These are the customers who live in a house given by their company or organization

Observations on Job

In [16]:

labeled_barplot(data, "Job", perc=True)
  • Majority of the customers i.e. 63% fall into the skilled category.
  • There are only approx 15% customers that lie in highly skilled category which makes sense as these may be the persons with high education or highly experienced.
  • There are very few observations, approx 22%, with 0 or 1 job category.

Observations on Saving accounts

In [17]:

labeled_barplot(data, "Saving accounts", perc=True)
  • Approx 70% customers who take credit have a little or moderate amount in their savings account. This makes sense as these customers would need credit more than the other categories.
  • Approx 11% customers who take credit are in the rich category based on their balance in the savings account.
  • Note that the percentages do not add up to 100 as we have missing values in this column.

Observations on Checking account

In [18]:

labeled_barplot(data, "Checking account", perc=True)
  • Approx 54% customers who take credit have a little or moderate amount in their checking account. This makes sense as these customers would need credit more than the other categories.
  • Approx 6% customers who take credit are in the rich category based on their balance in the checking account.
  • Note that the percentages do not add up to 100 as we have missing values in this column.

Observations on Purpose

In [19]:

labeled_barplot(data, "Purpose", perc=True)
  • The plot shows that most customers take credit for luxury items like car, radio or furniture/equipment, domestic appliances.
  • Approximately just 16% customers take credit for business or education

Bivariate Analysis

In [20]:

sns.pairplot(data, hue="Risk")
plt.show()
  • There are overlaps i.e., no clear distinction in the distribution of variables for people who have defaulted and did not default.
  • Let’s explore this further with the help of other plots.

In [21]:

### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Risk vs Age

In [22]:

distribution_plot_wrt_target(data, "Age", "Risk")
  • We can see that the median age of defaulters is less than the median age of non-defaulters.
  • This shows that younger customers are more likely to default.
  • There are outliers in boxplots of both class distributions

Risk vs Credit amount

In [23]:

distribution_plot_wrt_target(data, "Credit amount", "Risk")
  • We can see that the third quartile amount of defaulters is much more than the third quartile amount of non-defaulters.
  • This shows that customers with high credit amount are more likely to default.
  • There are outliers in boxplots of both class distributions

Risk vs Duration

In [24]:

distribution_plot_wrt_target(data, "Duration", "Risk")
  • We can see that the second and third quartiles of duration of defaulters is much more than the second and third quartiles duration of non-defaulters.
  • This shows that customers with high duration are more likely to default.

Risk vs Saving accounts

In [25]:

distribution_plot_wrt_target(data, "Age", "Saving accounts")
  • The plot shows that customers with higher age are in the rich or quite rich categories.
  • Age of the customers in the little and moderate categories is slightly less but there are outliers in both of the distributions.

In [26]:

# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Risk vs Sex

In [27]:

stacked_barplot(data, "Sex", "Risk")
Risk      0    1   All
Sex                   
All     700  300  1000
male    499  191   690
female  201  109   310
------------------------------------------------------------------------------------------------------------------------
  • We saw earlier that the percentage of male customers is more than the female customers. This plot shows that female customers are more likely to default as compared to male customers.

Risk vs Job

In [28]:

stacked_barplot(data, "Job", "Risk")
Risk                          0    1   All
Job                                       
All                         700  300  1000
skilled                     444  186   630
unskilled_and_non-resident  159   63   222
highly skilled               97   51   148
------------------------------------------------------------------------------------------------------------------------
  • There are no significant difference with respect to the job level
  • However, highly skilled or unskilled/non-resident customers are more likely to default as compared to customers in 1 or 2 category 

Risk vs Housing

In [29]:

stacked_barplot(data, "Housing", "Risk")
Risk       0    1   All
Housing                
All      700  300  1000
own      527  186   713
rent     109   70   179
free      64   44   108
------------------------------------------------------------------------------------------------------------------------
  • Customers owning a house are less likely to default
  • Customers with free or rented housing are almost at same risk of default

Risk vs Saving accounts

In [30]:

stacked_barplot(data, "Saving accounts", "Risk")
Risk               0    1   All
Saving accounts                
All              700  300  1000
little           537  249   786
moderate          69   34   103
quite rich        52   11    63
rich              42    6    48
------------------------------------------------------------------------------------------------------------------------
  • As we saw earlier, customers with little or moderate amounts in saving accounts takes more credit but at the same time they are most likely to default.
  • Rich customers are slightly less likely to default as compared to quite rich customers

Risk vs Checking account

In [31]:

stacked_barplot(data, "Checking account", "Risk")
Risk                0    1   All
Checking account                
All               700  300  1000
little            304  161   465
moderate          347  125   472
rich               49   14    63
------------------------------------------------------------------------------------------------------------------------
  • The plot further confirms the findings of the previous plot.
  • Customers with little amount in checking accounts are most likely to default as compared to customers with moderate amount, which in turn, are more likely as compared to the rich customers.

Risk vs Purpose

In [32]:

stacked_barplot(data, "Purpose", "Risk")
Risk                   0    1   All
Purpose                            
All                  700  300  1000
car                  231  106   337
radio/TV             218   62   280
furniture/equipment  123   58   181
business              63   34    97
education             36   23    59
repairs               14    8    22
vacation/others        7    5    12
domestic appliances    8    4    12
------------------------------------------------------------------------------------------------------------------------
  • Customers who take credit for radio/TV are least likely to default. This might be because their credit amount is small.
  • Customers who take credit for education or vacation are most likely to default.
  • Other categories have no significant difference between their default and non-default ratio.

In [33]:

plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • Credit amount and duration have positive correlation which makes sense as customers might take the credit for longer duration if the amount of credit is high.
  • Other variables have no significant correlation between them.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer is not going to default but in reality the customer will default – Loss of resources (FN)
  2. Predicting a customer is going to default but in reality the customer will not default – Loss of opportunity (FP)

Which Loss is greater ?

  • Loss of resources will be the greater loss as the bank will be losing on its resources/money.

How to reduce this loss i.e need to reduce False Negatives ?

  • Company would want to reduce false negatives, this can be done by maximizing the Recall. Greater the recall lesser the chances of false negatives.

First, let’s create functions to calculate different metrics and confusion matrix so that we don’t have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn_with_threshold function will be used to check the model performance of models. 
  • The confusion_matrix_sklearn_with_threshold function will be used to plot confusion matrix.

In [34]:

# defining a function to compute different metrics to check performance of a classification model built using sklearn

def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
    """
    Function to compute different metrics, based on the threshold specified, to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # predicting using the independent variables
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    pred = np.round(pred_thres)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf

In [35]:

# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix, based on the threshold specified, with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    pred_prob = model.predict_proba(predictors)[:, 1]
    pred_thres = pred_prob > threshold
    y_pred = np.round(pred_thres)

    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Data Preparation

In [36]:

# Converting monthly values to yearly
data["Duration"] = data["Duration"] / 12

In [37]:

X = data.drop("Risk", axis=1)
Y = data["Risk"]

# creating dummy variables
X = pd.get_dummies(X, drop_first=True)

# splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=42
)

Logistic Regression

In [38]:

# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data

model = LogisticRegression(solver="newton-cg", random_state=1)
lg = model.fit(X_train, y_train)

Finding the coefficients

In [39]:

log_odds = lg.coef_[0]
pd.DataFrame(log_odds, X_train.columns, columns=["coef"]).T

Out[39]:

AgeCredit amountDurationSex_maleJob_skilledJob_unskilled_and_non-residentHousing_ownHousing_rentSaving accounts_moderateSaving accounts_quite richSaving accounts_richChecking account_moderateChecking account_richPurpose_carPurpose_domestic appliancesPurpose_educationPurpose_furniture/equipmentPurpose_radio/TVPurpose_repairsPurpose_vacation/others
coef-0.0346330.0000250.354898-0.3872750.0096560.139326-0.81353-0.3764540.008775-0.67143-0.8547-0.214163-0.304406-0.0657840.2119930.308467-0.395234-0.66456-0.0449620.204389

Coefficient interpretations

  • Coefficients of Duration, Credit amount and some categorical levels of Purpose and Job are positive, an increase in these will lead to an increase in chances of a customer being a defaulter. 
  • Coefficients of Age, Sex_male, Housing, Saving_accounts, and some categorical levels of Purpose and Job is negative, an increase in these will lead to a decrease in chances of a customer being a defaulter.

Converting coefficients to odds

  • The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients. 
  • Therefore, odds = exp(b)
  • The percentage change in odds is given as odds = (exp(b) – 1) * 100

Odds from coefficients

In [40]:

# converting coefficients to odds
odds = np.exp(lg.coef_[0])

# finding the percentage change
perc_change_odds = (np.exp(lg.coef_[0]) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train.columns).T

Out[40]:

AgeCredit amountDurationSex_maleJob_skilledJob_unskilled_and_non-residentHousing_ownHousing_rentSaving accounts_moderateSaving accounts_quite richSaving accounts_richChecking account_moderateChecking account_richPurpose_carPurpose_domestic appliancesPurpose_educationPurpose_furniture/equipmentPurpose_radio/TVPurpose_repairsPurpose_vacation/others
Odds0.9659601.0000251.4260350.6789041.0097031.1494990.4432910.6862911.0088140.5109780.4254110.8072170.7375620.9363331.2361391.3613360.6735220.5145000.9560341.226775
Change_odd%-3.4040320.00250242.603495-32.1095870.97026414.949893-55.670940-31.3709350.881353-48.902239-57.458924-19.278290-26.243843-6.36673123.61394336.133607-32.647772-48.550029-4.39661322.677505

Coefficient interpretations

  • Age: Holding all other features constant a unit change in Age will decrease the odds of a customer being a defaulter by 0.96 times or a 3.40% decrease in the odds.
  • Credit amount: Holding all other features constant a unit change in Credit amount will increase the odds of a customer being a defaulter by 1.00 times or a 0.003% increase in the odds.
  • Duration: Holding all other features constant a unit change in Duration will increase the odds of a customer being a defaulter by 1.42 times or a 42.60% increase in the odds.
  • Sex: The odds of a male customer being a defaulter 0.68 times less than a female customer or 32.1% fewer odds than female.
  • Housing: The odds of a customer who has own house being a defaulter is 0.44 times less than the customer who lives in a house provided by his organization (Housing – free) or 55.67% fewer odds of being a defaulter. Similarly, The odds of a customer who lives in a rented place being a defaulter is 0.68 times less than the customer who lives in a house provided by his organization (Housing – free) or 31.37% fewer odds of being a defaulter. [Keeping housing_free as reference]

Interpretation for other attributes can be made similarly.

Checking model performance on training set

In [41]:

# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_train, y_train)

In [42]:

log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
    lg, X_train, y_train
)

print("Training performance:")
log_reg_model_train_perf
Training performance:

Out[42]:

AccuracyRecallPrecisionF1
00.7314290.2440190.629630.351724

ROC-AUC

  • ROC-AUC on training set

In [43]:

logit_roc_auc_train = roc_auc_score(y_train, lg.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
  • Logistic Regression model is giving a good performance on training set but the recall is low.

Model Performance Improvement

  • Let’s see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

In [44]:

# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3285584253725299

Checking model performance on training set

In [45]:

# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)

In [46]:

# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:

Out[46]:

AccuracyRecallPrecisionF1
00.6828570.5837320.4747080.523605
  • Model performance has improved significantly on training set.
  • Model is giving a recall of 0.58 on the training set.

Let’s use Precision-Recall curve and see if we can find a better threshold

In [47]:

y_scores = lg.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
  • At threshold around 0.37 we will get equal precision and recall but taking a step back and selecting value around 0.34 will provide a higher recall and a good precision.

In [48]:

# setting the threshold
optimal_threshold_curve = 0.34

Checking model performance on training set

In [49]:

# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_curve
)

In [50]:

log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
    lg, X_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:

Out[50]:

AccuracyRecallPrecisionF1
00.690.5502390.4831930.514541
  • Recall has improved as compared to the initial model.
  • Model with threshold as 0.32 was giving a better recall.

Model Performance Summary

In [51]:

# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.33 Threshold",
    "Logistic Regression-0.37 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:

Out[51]:

Logistic Regression sklearnLogistic Regression-0.33 ThresholdLogistic Regression-0.37 Threshold
Accuracy0.7314290.6828570.690000
Recall0.2440190.5837320.550239
Precision0.6296300.4747080.483193
F10.3517240.5236050.514541

Let’s check the performance on the test set

Using the model with default threshold

In [52]:

# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_test, y_test)

In [53]:

log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
    lg, X_test, y_test
)

print("Test set performance:")
log_reg_model_test_perf
Test set performance:

Out[53]:

AccuracyRecallPrecisionF1
00.6666670.1648350.3846150.230769
  • ROC-AUC on test set

In [54]:

logit_roc_auc_test = roc_auc_score(y_test, lg.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

Using the model with threshold of 0.32

In [55]:

# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)

In [56]:

# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc
Test set performance:

Out[56]:

AccuracyRecallPrecisionF1
00.6433330.4835160.4230770.451282

Using the model with threshold 0.34

In [57]:

# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_curve
)

In [58]:

log_reg_model_test_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
    lg, X_test, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:

Out[58]:

AccuracyRecallPrecisionF1
00.6533330.4615380.432990.446809

Model performance comparison

In [59]:

# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.32 Threshold",
    "Logistic Regression-0.34 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:

Out[59]:

Logistic Regression sklearnLogistic Regression-0.32 ThresholdLogistic Regression-0.34 Threshold
Accuracy0.7314290.6828570.690000
Recall0.2440190.5837320.550239
Precision0.6296300.4747080.483193
F10.3517240.5236050.514541

In [60]:

# testing performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression sklearn",
    "Logistic Regression-0.32 Threshold",
    "Logistic Regression-0.34 Threshold",
]

print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:

Out[60]:

Logistic Regression sklearnLogistic Regression-0.32 ThresholdLogistic Regression-0.34 Threshold
Accuracy0.6666670.6433330.653333
Recall0.1648350.4835160.461538
Precision0.3846150.4230770.432990
F10.2307690.4512820.446809

Conclusion

  • By changing the threshold of the logistic regression model we were able to see a significant improvement in the model performance.
  • The model achieved a recall of 0.58 on the training set with threshold set at 0.32.

Recommendations

  • From our logistic regression model we identified that Duration is a significant predictor of a customer being a defaulter. 
  • Bank should target more male customers as they have lesser odds of defaulting.
  • We saw in our analysis that customers with a little or moderate amount in saving or checking accounts are more likely to default. The bank can be more strict with its rules or interest rates to compensate for the risk.
  • We saw that customers who have rented or free housing are more likely to default. The bank should keep more details about such customers like hometown addresses, etc. to be able to track them.
  • Our analysis showed that younger customers are slightly more likely to default. The bank can alter its policies to deal with this.

Leave a Reply

Your email address will not be published. Required fields are marked *