General Middleware

Creating Decision Trees for Classification Problems – Machine Failure Prediction

Problem Statement

Business Context

System failure is a common issue across the manufacturing industry, where a variety of machines and equipment are used. In most cases, it becomes important to be able to predict machine failures by analyzing system data and taking preventive measures to be able to tackle them. This is known as predictive maintenance and with the rising availability of data and computational resources, the use of such data-driven, proactive maintenance methods has resulted in several benefits like minimized downtime of the equipment, minimized cost associated with spares and supplies, etc.

AutoMobi Engineering Pvt. Ltd is an auto component manufacturing company. The manufacturing facility of AutoMobi consists of numerous products machined on several CNC (Computer Numerical Controlled) machines. In an attempt to transition to a data-driven maintenance process, the company had set up sensors in various locations to collect data regarding the various parameters involved in the manufacturing process. Initially, they want to try it in an injector nozzle manufacturing shop where they are manufacturing fuel injector nozzles for automobile engines using various manufacturing processes (like turning, drilling, etc). The company has been collecting data on an hourly basis from these sensors and aims to build ML-based solutions using the data to optimize cost, improve failure predictability, and minimize the downtime of equipment.

Objective

AutoMobi has recently been encountering a problem with frequent equipment failure in the fuel injector nozzle manufacture unit, leading to disturbance in the manufacturing process. They have reached out to the Data Science team for a solution and shared data for the past three months. As a member of the Data Science team, you are tasked with analyzing the data and developing a Machine Learning model to detect potential machine failures, determine the most influencing factors on machine health, and provide recommendations for cost optimization to the management.

Dataset

Data Description

The data contains the different attributes of machines and health. The detailed data dictionary is given below.

Data Dictionary

  • UDI: Unique identifier ranging from 1 to 10000
  • Type: The type of product consisting of low(60% of all products), medium(30%), and high(10%) quality corresponding to L, M, and H
  • Air temperature: Ambient temperature (in the machine shop) measured in Kelvin
  • Process Temperature: Tool tip temperature measured in Kelvin
  • Rotational Speed: Rotational speed of the machine spindle measured in revolutions per minute (rpm)
  • Torque: Torque acting on the machine spindle measured in Newton meter (Nm)
  • Tool wear: Tool wear measured in micrometers (During the manufacturing process, continuous rubbing of the tool on the workpiece leads to the wearing of the tool material. The tool wear is measured as the amount of wear on the cutting face of the tool measured in micrometers)
  • Failure: 0-No failure, 1-Failure

Importing the necessary libraries

In [ ]:

# this will help in making the Python code more structured automatically (help adhere to good coding practices)
%load_ext nb_black

import warnings
warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    make_scorer,
)

Loading the dataset

In [ ]:

df_main = pd.read_csv("Predictive_Maintenance_Case_Study.csv")

In [ ]:

# copying data to another variable to avoid any changes to original data
data = df_main.copy()

Overview of the dataset

View the first and last 5 rows of the dataset.

In [ ]:

data.head()

Out[ ]:

UDITypeAir temperatureProcess temperatureRotational speedTorqueTool wearFailure
01M298.10000323.74074155142.8000000
12L298.20000324.11111140846.3000030
23L298.10000323.37037149849.4000050
34L298.20000323.74074143339.5000070
45L298.20000324.11111140840.0000090

In [ ]:

data.tail()

Out[ ]:

UDITypeAir temperatureProcess temperatureRotational speedTorqueTool wearFailure
99959996M298.80000323.00000160429.50000140
99969997H298.90000323.00000163231.80000170
99979998M299.00000323.74074164533.40000220
99989999H299.00000324.11111140848.50000250
999910000M299.00000324.11111150040.20000300
  • There are three types of products those are L, M, and H (Low, Medium, and High quality).
  • The UDI column is containing unique values.

Understand the shape of the dataset.

In [ ]:

data.shape

Out[ ]:

(10000, 8)
  • The dataset has 10000 rows and 8 columns.

Check the data types of the columns for the dataset.

In [ ]:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   UDI                  10000 non-null  int64  
 1   Type                 10000 non-null  object 
 2   Air temperature      10000 non-null  float64
 3   Process temperature  10000 non-null  float64
 4   Rotational speed     10000 non-null  int64  
 5   Torque               10000 non-null  float64
 6   Tool wear            10000 non-null  int64  
 7   Failure              10000 non-null  int64  
dtypes: float64(3), int64(4), object(1)
memory usage: 625.1+ KB
  • The Type column is of object type while the rest columns are numeric in nature

Checking for missing values

In [ ]:

# checking for null values
data.isnull().sum()

Out[ ]:

UDI                    0
Type                   0
Air temperature        0
Process temperature    0
Rotational speed       0
Torque                 0
Tool wear              0
Failure                0
dtype: int64
  • There are no null values in the dataset

Dropping the duplicate values

In [ ]:

# checking for duplicate values
data.duplicated().sum()

Out[ ]:

0
  • There are no duplicate values in the data.

Dropping the columns with all unique values

In [ ]:

data.UDI.nunique()

Out[ ]:

10000
  • The UDI column contains only unique values, so we can drop it

In [ ]:

data = data.drop(["UDI"], axis=1)

Statistical summary of the data

Let’s check the statistical summary of the data.

In [ ]:

data.describe().T

Out[ ]:

countmeanstdmin25%50%75%max
Air temperature10000.00000300.004932.00026295.30000298.30000300.10000301.50000304.50000
Process temperature10000.00000328.946525.49531313.00000324.48148329.29630333.00000343.00000
Rotational speed10000.000001538.77610179.284101168.000001423.000001503.000001612.000002886.00000
Torque10000.0000039.986919.968933.8000033.2000040.1000046.8000076.60000
Tool wear10000.00000107.9510063.654150.0000053.00000108.00000162.00000253.00000
Failure10000.000000.033900.180980.000000.000000.000000.000001.00000
  • The air temperature ranges from 300K to 304.5K. Usually, machine shops are maintained in control environment so the temperature range looks usual.
  • The process temperature is a bit higher than the air temperature and that’s quite usual because heat is continuously generated during the machining process.
  • The rotational speed has a max value of 2886rpm while 1612rpm at the 75th percentile. Some of the processes are performed at a higher speed than usual.

Exploratory Data Analysis (EDA) Summary

Note: The EDA section has been covered multiple times in the previous case studies. In this case study, we will mainly focus on the model building aspects. We will only be looking at the key observations from EDA. The detailed EDA can be found in the appendix section.

The below functions need to be defined to carry out the EDA.

In [ ]:

def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [ ]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [ ]:

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [ ]:

### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate Analysis

In [ ]:

histogram_boxplot(data, "Air temperature")
  • The air temperature distribution looks slightly left skewed with a mean temperature around 300K.
  • There is no outlier present.

In [ ]:

histogram_boxplot(data, "Process temperature")
  • The process temperature distribution looks slightly left skewed with a mean temperature around 329K.
  • There is no outlier present.

In [ ]:

histogram_boxplot(data, "Rotational speed")
  • The rotational speed is right skewed with many outliers on the upper quartile.
  • Some of the manufacturing operations are performed at a higher speed. 

In [ ]:

histogram_boxplot(data, "Torque")
  • The distribution of torque is normal with mean torque around 40 Nm.
  • Outliers are present on both sides.

In [ ]:

histogram_boxplot(data, "Tool wear")
  • Tool wear is uniformly distributed with some of the higher values being less frequent.

In [ ]:

labeled_barplot(data, "Type", perc=True)
  • Around 60% of products are of low quality, 30% are of medium quality whereas 10% are of high quality

In [ ]:

labeled_barplot(data, "Failure", perc=True)
  • In 96.6% of observations the machine does not fail while in 3.4% of observations it fails.

Bivariate Analysis

In [ ]:

cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Failure')

plt.figure(figsize=(12, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
  • There’s a positive correlation between the air temperature and process temperature.
  • There’s a negative correlation between the rotational speed and torque
  • No other variables are correlated. We will analyze it further.

Let’s see how the target variable varies across the type of the product

In [ ]:

stacked_barplot(data, "Type", "Failure")
Failure     0    1    All
Type                     
All      9661  339  10000
L        5765  235   6000
M        2914   83   2997
H         982   21   1003
------------------------------------------------------------------------------------------------------------------------
  • Around 70 % of the failure occurred during machining of L type i.e., low-quality products.
  • Machining of high-quality products is less prone to failure.

Let’s analyze the relation between Process temperature and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Process temperature", "Failure")
  • Most of the failures of the manufacturing system occur at higher Process temperature.

Let’s analyze the relation between Rotational speed and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Rotational speed", "Failure")
  • There is a clear boundary showing separation of failure status based of the values of Rotational speed.
  • Manufacturing system is more prone to failure at lower Rotational speed than at higher rotational speed.

Data Preprocessing

Outlier Detection

Let’s check for outliers in the data.

In [ ]:

# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observations

  • There are quite a few outliers in the data.
  • However, we will not treat them as they are proper values

Data Preparation for Modeling

In [ ]:

X = data.drop(["Failure"], axis=1)
Y = data["Failure"]

X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)

In [ ]:

print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (7000, 7)
Shape of test set :  (3000, 7)
Percentage of classes in training set:
0   0.96629
1   0.03371
Name: Failure, dtype: float64
Percentage of classes in test set:
0   0.96567
1   0.03433
Name: Failure, dtype: float64
  • We had seen that around 96.6% of observations belongs to class 0 (Not Failed) and 3.37% observations belongs to class 1 (Failed), and this is preserved in the train and test sets

Model Building

Decision Tree (default)

In [ ]:

model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)

Out[ ]:

DecisionTreeClassifier(random_state=1)

Model Evaluation

Model evaluation criterion

Model can make wrong predictions as:

  • Predicting a machine will not fail but in reality, the machine will fail (FN)
  • Predicting a machine will fail but in reality, the machine will not fail (FP)

Which case is more important?

  • If we predict that a machine will not fail but in reality, the machine fails, then the company will have to bear the cost of repair/replacement and also face equipment downtime losses
  • If we predict that a machine will fail but in reality, the machine does not fail, then the company will have to bear the cost of inspection
  • The inspection cost is generally less compared to the repair/replacement cost

How to reduce the losses?

The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.

In [ ]:

# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

In [ ]:

def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

In [ ]:

confusion_matrix_sklearn(model0, X_train, y_train)

In [ ]:

decision_tree_perf_train_without = model_performance_classification_sklearn(
    model0, X_train, y_train
)
decision_tree_perf_train_without

Out[ ]:

AccuracyRecallPrecisionF1
01.000001.000001.000001.00000

In [ ]:

confusion_matrix_sklearn(model0, X_test, y_test)

In [ ]:

decision_tree_perf_test_without = model_performance_classification_sklearn(
    model0, X_test, y_test
)
decision_tree_perf_test_without

Out[ ]:

AccuracyRecallPrecisionF1
00.975670.728160.625000.67265

Decision Tree (with class_weights)

  • If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes
  • In this case, we will set class_weight = “balanced”, which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data
  • class_weight is a hyperparameter for the decision tree classifier

In [ ]:

model = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model.fit(X_train, y_train)

Out[ ]:

DecisionTreeClassifier(class_weight='balanced', random_state=1)

In [ ]:

confusion_matrix_sklearn(model, X_train, y_train)

In [ ]:

decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train

Out[ ]:

AccuracyRecallPrecisionF1
01.000001.000001.000001.00000
  • Model is able to perfectly classify all the data points on the training set.
  • 0 errors on the training set, each sample has been classified correctly.
  • As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  • This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.

In [ ]:

confusion_matrix_sklearn(model, X_test, y_test)

In [ ]:

decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test

Out[ ]:

AccuracyRecallPrecisionF1
00.977000.611650.684780.64615
  • There is a huge disparity in performance of model on training set and test set, which suggests that the model is overfiiting.

Let’s use pruning techniques to try and reduce overfitting.

Decision Tree (Pre-pruning)

Using GridSearch for Hyperparameter tuning of our tree model

  • Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we’ll use Grid search
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. 
  • It is an exhaustive search that is performed on a the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [ ]:

# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "class_weight": [None, "balanced"],
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)

Out[ ]:

DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
                       min_samples_split=70, random_state=1)

In [ ]:

confusion_matrix_sklearn(estimator, X_train, y_train)

In [ ]:

decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)
decision_tree_tune_perf_train

Out[ ]:

AccuracyRecallPrecisionF1
00.873430.966100.206150.33979

In [ ]:

confusion_matrix_sklearn(estimator, X_test, y_test)

In [ ]:

decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)
decision_tree_tune_perf_test

Out[ ]:

AccuracyRecallPrecisionF1
00.881670.961170.220000.35805
  • The model is giving a generalized result now since the recall scores on both the train and test data are coming to be around 0.96 which shows that the model is able to generalize well on unseen data.

In [ ]:

feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)

In [ ]:

plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

In [ ]:

# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Rotational speed <= 1379.50
|   |--- Air temperature <= 301.55
|   |   |--- Torque <= 52.85
|   |   |   |--- Tool wear <= 206.50
|   |   |   |   |--- weights: [176.97, 0.00] class: 0
|   |   |   |--- Tool wear >  206.50
|   |   |   |   |--- weights: [10.35, 59.32] class: 1
|   |   |--- Torque >  52.85
|   |   |   |--- Tool wear <= 185.50
|   |   |   |   |--- weights: [151.61, 341.10] class: 1
|   |   |   |--- Tool wear >  185.50
|   |   |   |   |--- weights: [9.31, 474.58] class: 1
|   |--- Air temperature >  301.55
|   |   |--- Process temperature <= 336.89
|   |   |   |--- Process temperature <= 330.96
|   |   |   |   |--- weights: [1.03, 504.24] class: 1
|   |   |   |--- Process temperature >  330.96
|   |   |   |   |--- weights: [42.95, 845.34] class: 1
|   |   |--- Process temperature >  336.89
|   |   |   |--- weights: [32.08, 59.32] class: 1
|--- Rotational speed >  1379.50
|   |--- Tool wear <= 202.50
|   |   |--- Torque <= 15.45
|   |   |   |--- weights: [11.90, 281.78] class: 1
|   |   |--- Torque >  15.45
|   |   |   |--- Torque <= 58.40
|   |   |   |   |--- weights: [2865.61, 118.64] class: 0
|   |   |   |--- Torque >  58.40
|   |   |   |   |--- weights: [5.69, 177.97] class: 1
|   |--- Tool wear >  202.50
|   |   |--- Tool wear <= 219.50
|   |   |   |--- Torque <= 51.85
|   |   |   |   |--- weights: [154.72, 311.44] class: 1
|   |   |   |--- Torque >  51.85
|   |   |   |   |--- weights: [2.59, 59.32] class: 1
|   |   |--- Tool wear >  219.50
|   |   |   |--- Air temperature <= 297.20
|   |   |   |   |--- weights: [3.10, 0.00] class: 0
|   |   |   |--- Air temperature >  297.20
|   |   |   |   |--- weights: [32.08, 266.95] class: 1

Observations from the pre-pruned tree:

Using the above extracted decision rules we can make interpretations from the decision tree model like:

  • If the rotational speed is less than or equal to 1379.50, the air temperature is less than or equal to 301.55, the torque is less than or equal to 52.85 and the tool wear is greater than 206.50, then the machine is most likey to fail

Interpretations from other decision rules can be made similarly

In [ ]:

importances = estimator.feature_importances_
importances

Out[ ]:

array([0.03002803, 0.00726903, 0.40080865, 0.32726572, 0.23462857,
       0.        , 0.        ])

In [ ]:

# importance of features in the tree building

importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • In the pre tuned decision tree also, rotational speed and torque are the most important features.

Decision Tree (Post pruning)

The DecisionTreeClassifier provides parameters such as min_samples_leafand max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect ofccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the “weakest link”. The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn providesDecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [ ]:

clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities

In [ ]:

pd.DataFrame(path)

Out[ ]:

ccp_alphasimpurities
00.00000-0.00000
10.00000-0.00000
20.00000-0.00000
30.00000-0.00000
40.00000-0.00000
50.00000-0.00000
60.00000-0.00000
70.00000-0.00000
80.00000-0.00000
90.00000-0.00000
100.000000.00000
110.000000.00000
120.000000.00000
130.000000.00000
140.000000.00000
150.000070.00015
160.000070.00029
170.000070.00044
180.000070.00058
190.000070.00073
200.000070.00088
210.000070.00117
220.000070.00132
230.000070.00147
240.000140.00175
250.000140.00231
260.000140.00246
270.000150.00260
280.000150.00304
290.000150.00319
300.000150.00333
310.000150.00348
320.000150.00363
330.000150.00377
340.000150.00392
350.000150.00407
360.000150.00421
370.000190.00479
380.000210.00522
390.000220.00565
400.000230.00680
410.000260.00783
420.000260.00809
430.000280.00837
440.000280.00893
450.000290.00921
460.000290.00950
470.000290.00978
480.000290.01095
490.000350.01338
500.000360.01409
510.000390.01448
520.000400.01488
530.000400.01568
540.000400.01608
550.000400.01648
560.000430.01692
570.000470.01739
580.000530.01792
590.000560.01848
600.000580.02021
610.000610.02082
620.000630.02271
630.000670.02338
640.000690.02407
650.000720.02551
660.000730.02625
670.000740.02698
680.000860.02784
690.000890.03138
700.001040.03242
710.001060.03348
720.001070.03455
730.001150.03570
740.001160.03919
750.001190.04632
760.001220.04754
770.001220.04875
780.001340.05543
790.001340.05812
800.001410.05953
810.001530.06106
820.001610.07559
830.001670.08559
840.001810.08740
850.001850.09109
860.002170.09544
870.002470.10037
880.004010.10839
890.010350.11875
900.015780.18187
910.042680.22456
920.057570.28212
930.069120.35124
940.148760.50000

In [ ]:

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [ ]:

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.14875976077076158

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [ ]:

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

In [ ]:

recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)

In [ ]:

recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)

In [ ]:

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]

In [ ]:

fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

In [ ]:

# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004008680486241742, class_weight='balanced',
                       random_state=1)

In [ ]:

confusion_matrix_sklearn(best_model, X_train, y_train)

In [ ]:

decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_post_perf_train

Out[ ]:

AccuracyRecallPrecisionF1
00.911430.966100.271430.42379

In [ ]:

confusion_matrix_sklearn(best_model, X_test, y_test)

In [ ]:

decision_tree_post_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)
decision_tree_post_test

Out[ ]:

AccuracyRecallPrecisionF1
00.917330.961170.288630.44395
  • In the post-pruned tree also, the model is giving a generalized result since the recall scores on both the train and test data are coming to be around 0.96 which shows that the model is able to generalize well on unseen data.

In [ ]:

plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

In [ ]:

# Text report showing the rules of a decision tree -

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Rotational speed <= 1379.50
|   |--- Air temperature <= 301.55
|   |   |--- Torque <= 52.85
|   |   |   |--- Tool wear <= 206.50
|   |   |   |   |--- weights: [176.97, 0.00] class: 0
|   |   |   |--- Tool wear >  206.50
|   |   |   |   |--- weights: [10.35, 59.32] class: 1
|   |   |--- Torque >  52.85
|   |   |   |--- Tool wear <= 185.50
|   |   |   |   |--- Torque <= 62.35
|   |   |   |   |   |--- weights: [140.75, 0.00] class: 0
|   |   |   |   |--- Torque >  62.35
|   |   |   |   |   |--- weights: [10.87, 341.10] class: 1
|   |   |   |--- Tool wear >  185.50
|   |   |   |   |--- weights: [9.31, 474.58] class: 1
|   |--- Air temperature >  301.55
|   |   |--- weights: [76.06, 1408.90] class: 1
|--- Rotational speed >  1379.50
|   |--- Tool wear <= 202.50
|   |   |--- Torque <= 15.45
|   |   |   |--- weights: [11.90, 281.78] class: 1
|   |   |--- Torque >  15.45
|   |   |   |--- Torque <= 58.40
|   |   |   |   |--- weights: [2865.61, 118.64] class: 0
|   |   |   |--- Torque >  58.40
|   |   |   |   |--- weights: [5.69, 177.97] class: 1
|   |--- Tool wear >  202.50
|   |   |--- weights: [192.49, 637.71] class: 1

  • We can see that the observation we got from the pre-pruned tree is also matching with the decision tree rules of the post pruned tree.

In [ ]:

importances = best_model.feature_importances_
indices = np.argsort(importances)

In [ ]:

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Rotational speed and Torque are the most important features for the post pruned tree

Comparison of Models and Final Model Selection

In [ ]:

# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train_without.T,
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:

Out[ ]:

Decision Tree without class_weightDecision Tree with class_weightDecision Tree (Pre-Pruning)Decision Tree (Post-Pruning)
Accuracy1.000001.000000.873430.91143
Recall1.000001.000000.966100.96610
Precision1.000001.000000.206150.27143
F11.000001.000000.339790.42379

In [ ]:

# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test_without.T,
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree without class_weight",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:

Out[ ]:

Decision Tree without class_weightDecision Tree with class_weightDecision Tree (Pre-Pruning)Decision Tree (Post-Pruning)
Accuracy0.975670.977000.881670.91733
Recall0.728160.611650.961170.96117
Precision0.625000.684780.220000.28863
F10.672650.646150.358050.44395
  • Decision tree models with pre-pruning and post-pruning both are giving equally high recall scores on both training and test sets.
  • However, we will choose the post pruned tree as the best model since it is giving a slightly high precision score on the train and test sets than the pre-pruned tree. 

Conclusions and Recommendations

  • The model built can be used to predict if a machine is going to fail or not and can correctly identify 96.1% of the machine failures
  • Rotational speed, torque and tool wear are the most important variables in predicting whether a machine will fail or not
  • From the decision tree, it has been observed that if the rotational speed is less than or equal to 1379.50, the air temperature is less than or equal to 301.55, the torque is less than or equal to 52.85 and the tool wear is greater than 206.50, then the machine is most likey to fail
    • The company should give a vigilant eye for these values in order to detect machine failure.
  • The company should use more data for the analysis to get more reliable results
  • As the variable used vary with the type of operation (turning, drilling, etc.) being performed, the company can look to build separate models for each different type of operation

Appendix: Detailed Exploratory Data Analysis (EDA)

Univariate Analysis

Observation on Air temperature

In [ ]:

histogram_boxplot(data, "Air temperature")
  • The air temperature distribution looks slightly left skewed with a mean temperature around 300K.
  • There is no outlier present.

Observation on Process temperature

In [ ]:

histogram_boxplot(data, "Process temperature")
  • The process temperature distribution looks slightly left skewed with a mean temperature around 329K.
  • There is no outlier present.

Observation on Rotational Speed

In [ ]:

histogram_boxplot(data, "Rotational speed")
  • The rotational speed is right skewed with many outliers on the upper quartile.
  • Some of the manufacturing operations are performed at a higher speed. 

Observation on Torque

In [ ]:

histogram_boxplot(data, "Torque")
  • The distribution of torque is normal with mean torque around 40 Nm.
  • Outliers are present on both sides.

Observation on Tool Wear

In [ ]:

histogram_boxplot(data, "Tool wear")
  • Tool wear is uniformly distributed with some of the higher values being less frequent.

Observation on Type of product

In [ ]:

labeled_barplot(data, "Type", perc=True)
  • Around 60% of products are of low quality, 30% are of medium quality whereas 10% are of high quality

Observation on Failure

In [ ]:

labeled_barplot(data, "Failure", perc=True)
  • In 96.6% of observations the machine does not fail while in 3.4% of observations it fails.

Bivariate Analysis

Correlation Check

In [ ]:

cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Failure')

plt.figure(figsize=(12, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
  • There’s a positive correlation between the air temperature and process temperature.
  • There’s a negative correlation between the rotational speed and torque
  • No other variables are correlated. We will analyze it further.

Observation on Type vs Air temperature

In [ ]:

plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="Type", y="Air temperature")
plt.show()
  • There is no distinct difference in values of Air temperature and Type

Observation on Type vs Process temperature

In [ ]:

plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="Type", y="Process temperature")
plt.show()
  • There is no distinct difference in values of Process temperature and Type for M and L types. 
  • Lesser Process temperature is observed in manufacturing H type of products.

Observation on Type vs Rotational speed

In [ ]:

plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="Type", y="Rotational speed")
plt.show()
  • Some of the L type of products are manufactured at higher rotational speed

Observation on Type vs Tool wear

In [ ]:

plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="Type", y="Tool wear")
plt.show()
  • There is no distinct difference in values of Tool wear and Type

Observation on Type vs Torque

In [ ]:

plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x="Type", y="Torque")
plt.show()
  • Some of the L type products are manufactured at lower Toque as compared to M and H type of products.

Failure vs Type

Let’s see how the target variable varies across the type of the product

In [ ]:

stacked_barplot(data, "Type", "Failure")
Failure     0    1    All
Type                     
All      9661  339  10000
L        5765  235   6000
M        2914   83   2997
H         982   21   1003
------------------------------------------------------------------------------------------------------------------------
  • Around 70 % of the failure occurred during machining of L type i.e., low-quality products.
  • Machining of high-quality products is less prone to failure.

Distribution of numerical input variables by failure status

Let’s analyze the relation between Air temperature and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Air temperature", "Failure")
  • Most of the failures of the manufacturing system occur at higher Air temperature.

Let’s analyze the relation between Process temperature and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Process temperature", "Failure")
  • Most of the failures of the manufacturing system occur at higher Process temperature.

Let’s analyze the relation between Rotational speed and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Rotational speed", "Failure")
  • Manufacturing system is more prone to failure at lower Rotational speed than at higher rotational speed.

Let’s analyze the relation between Torque and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Torque", "Failure")
  • Most of the failures of the manufacturing system occur at higher torque as compared to lower values of torque.

Let’s analyze the relation between Tool wear and Failure.

In [ ]:

distribution_plot_wrt_target(data, "Tool wear", "Failure")
  • Most of the failures occur at higher values of tool wear than at lower tool wear.

In [ ]:

sns.pairplot(data, hue="Failure")

Out[ ]:

<seaborn.axisgrid.PairGrid at 0x7f68dcb4ac10>
  • The correlation between air temperature and process temperature and torque and rotational speed is visible here too

Leave a Reply

Your email address will not be published. Required fields are marked *