System failure is a common issue across the manufacturing industry, where a variety of machines and equipment are used. In most cases, it becomes important to be able to predict machine failures by analyzing system data and taking preventive measures to be able to tackle them. This is known as predictive maintenance and with the rising availability of data and computational resources, the use of such data-driven, proactive maintenance methods has resulted in several benefits like minimized downtime of the equipment, minimized cost associated with spares and supplies, etc.
AutoMobi Engineering Pvt. Ltd is an auto component manufacturing company. The manufacturing facility of AutoMobi consists of numerous products machined on several CNC (Computer Numerical Controlled) machines. In an attempt to transition to a data-driven maintenance process, the company had set up sensors in various locations to collect data regarding the various parameters involved in the manufacturing process. Initially, they want to try it in an injector nozzle manufacturing shop where they are manufacturing fuel injector nozzles for automobile engines using various manufacturing processes (like turning, drilling, etc). The company has been collecting data on an hourly basis from these sensors and aims to build ML-based solutions using the data to optimize cost, improve failure predictability, and minimize the downtime of equipment.
AutoMobi has recently been encountering a problem with frequent equipment failure in the fuel injector nozzle manufacture unit, leading to disturbance in the manufacturing process. They have reached out to the Data Science team for a solution and shared data for the past three months. As a member of the Data Science team, you are tasked with analyzing the data and developing a Machine Learning model to detect potential machine failures, determine the most influencing factors on machine health, and provide recommendations for cost optimization to the management.
The data contains the different attributes of machines and health. The detailed data dictionary is given below.
Data Dictionary
In [ ]:
# this will help in making the Python code more structured automatically (help adhere to good coding practices)
%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
In [ ]:
df_main = pd.read_csv("Predictive_Maintenance_Case_Study.csv")
In [ ]:
# copying data to another variable to avoid any changes to original data data = df_main.copy()
In [ ]:
data.head()
Out[ ]:
| UDI | Type | Air temperature | Process temperature | Rotational speed | Torque | Tool wear | Failure | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M | 298.10000 | 323.74074 | 1551 | 42.80000 | 0 | 0 |
| 1 | 2 | L | 298.20000 | 324.11111 | 1408 | 46.30000 | 3 | 0 |
| 2 | 3 | L | 298.10000 | 323.37037 | 1498 | 49.40000 | 5 | 0 |
| 3 | 4 | L | 298.20000 | 323.74074 | 1433 | 39.50000 | 7 | 0 |
| 4 | 5 | L | 298.20000 | 324.11111 | 1408 | 40.00000 | 9 | 0 |
In [ ]:
data.tail()
Out[ ]:
| UDI | Type | Air temperature | Process temperature | Rotational speed | Torque | Tool wear | Failure | |
|---|---|---|---|---|---|---|---|---|
| 9995 | 9996 | M | 298.80000 | 323.00000 | 1604 | 29.50000 | 14 | 0 |
| 9996 | 9997 | H | 298.90000 | 323.00000 | 1632 | 31.80000 | 17 | 0 |
| 9997 | 9998 | M | 299.00000 | 323.74074 | 1645 | 33.40000 | 22 | 0 |
| 9998 | 9999 | H | 299.00000 | 324.11111 | 1408 | 48.50000 | 25 | 0 |
| 9999 | 10000 | M | 299.00000 | 324.11111 | 1500 | 40.20000 | 30 | 0 |
UDI column is containing unique values.In [ ]:
data.shape
Out[ ]:
(10000, 8)
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UDI 10000 non-null int64 1 Type 10000 non-null object 2 Air temperature 10000 non-null float64 3 Process temperature 10000 non-null float64 4 Rotational speed 10000 non-null int64 5 Torque 10000 non-null float64 6 Tool wear 10000 non-null int64 7 Failure 10000 non-null int64 dtypes: float64(3), int64(4), object(1) memory usage: 625.1+ KB
Type column is of object type while the rest columns are numeric in natureIn [ ]:
# checking for null values data.isnull().sum()
Out[ ]:
UDI 0 Type 0 Air temperature 0 Process temperature 0 Rotational speed 0 Torque 0 Tool wear 0 Failure 0 dtype: int64
In [ ]:
# checking for duplicate values data.duplicated().sum()
Out[ ]:
0
In [ ]:
data.UDI.nunique()
Out[ ]:
10000
UDI column contains only unique values, so we can drop itIn [ ]:
data = data.drop(["UDI"], axis=1)
Let’s check the statistical summary of the data.
In [ ]:
data.describe().T
Out[ ]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Air temperature | 10000.00000 | 300.00493 | 2.00026 | 295.30000 | 298.30000 | 300.10000 | 301.50000 | 304.50000 |
| Process temperature | 10000.00000 | 328.94652 | 5.49531 | 313.00000 | 324.48148 | 329.29630 | 333.00000 | 343.00000 |
| Rotational speed | 10000.00000 | 1538.77610 | 179.28410 | 1168.00000 | 1423.00000 | 1503.00000 | 1612.00000 | 2886.00000 |
| Torque | 10000.00000 | 39.98691 | 9.96893 | 3.80000 | 33.20000 | 40.10000 | 46.80000 | 76.60000 |
| Tool wear | 10000.00000 | 107.95100 | 63.65415 | 0.00000 | 53.00000 | 108.00000 | 162.00000 | 253.00000 |
| Failure | 10000.00000 | 0.03390 | 0.18098 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
air temperature ranges from 300K to 304.5K. Usually, machine shops are maintained in control environment so the temperature range looks usual.process temperature is a bit higher than the air temperature and that’s quite usual because heat is continuously generated during the machining process.rotational speed has a max value of 2886rpm while 1612rpm at the 75th percentile. Some of the processes are performed at a higher speed than usual.The below functions need to be defined to carry out the EDA.
In [ ]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
In [ ]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
In [ ]:
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
In [ ]:
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
In [ ]:
histogram_boxplot(data, "Air temperature")

air temperature distribution looks slightly left skewed with a mean temperature around 300K.In [ ]:
histogram_boxplot(data, "Process temperature")

process temperature distribution looks slightly left skewed with a mean temperature around 329K.In [ ]:
histogram_boxplot(data, "Rotational speed")

rotational speed is right skewed with many outliers on the upper quartile.In [ ]:
histogram_boxplot(data, "Torque")

torque is normal with mean torque around 40 Nm.In [ ]:
histogram_boxplot(data, "Tool wear")

Tool wear is uniformly distributed with some of the higher values being less frequent.In [ ]:
labeled_barplot(data, "Type", perc=True)

In [ ]:
labeled_barplot(data, "Failure", perc=True)

In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Failure')
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

air temperature and process temperature.rotational speed and torque. Let’s see how the target variable varies across the type of the product
In [ ]:
stacked_barplot(data, "Type", "Failure")
Failure 0 1 All Type All 9661 339 10000 L 5765 235 6000 M 2914 83 2997 H 982 21 1003 ------------------------------------------------------------------------------------------------------------------------

Let’s analyze the relation between Process temperature and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Process temperature", "Failure")

Process temperature.Let’s analyze the relation between Rotational speed and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Rotational speed", "Failure")

Rotational speed.Rotational speed than at higher rotational speed.Let’s check for outliers in the data.
In [ ]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()

Observations
In [ ]:
X = data.drop(["Failure"], axis=1)
Y = data["Failure"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
In [ ]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (7000, 7) Shape of test set : (3000, 7) Percentage of classes in training set: 0 0.96629 1 0.03371 Name: Failure, dtype: float64 Percentage of classes in test set: 0 0.96567 1 0.03433 Name: Failure, dtype: float64
In [ ]:
model0 = DecisionTreeClassifier(random_state=1) model0.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
Model evaluation criterion
Model can make wrong predictions as:
Which case is more important?
How to reduce the losses?
The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.
In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
In [ ]:
confusion_matrix_sklearn(model0, X_train, y_train)

In [ ]:
decision_tree_perf_train_without = model_performance_classification_sklearn(
model0, X_train, y_train
)
decision_tree_perf_train_without
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
In [ ]:
confusion_matrix_sklearn(model0, X_test, y_test)

In [ ]:
decision_tree_perf_test_without = model_performance_classification_sklearn(
model0, X_test, y_test
)
decision_tree_perf_test_without
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97567 | 0.72816 | 0.62500 | 0.67265 |
In [ ]:
model = DecisionTreeClassifier(random_state=1, class_weight="balanced") model.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In [ ]:
confusion_matrix_sklearn(model, X_train, y_train)

In [ ]:
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
In [ ]:
confusion_matrix_sklearn(model, X_test, y_test)

In [ ]:
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97700 | 0.61165 | 0.68478 | 0.64615 |
Let’s use pruning techniques to try and reduce overfitting.
Using GridSearch for Hyperparameter tuning of our tree model
In [ ]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"class_weight": [None, "balanced"],
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50,
min_samples_split=70, random_state=1)
In [ ]:
confusion_matrix_sklearn(estimator, X_train, y_train)

In [ ]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87343 | 0.96610 | 0.20615 | 0.33979 |
In [ ]:
confusion_matrix_sklearn(estimator, X_test, y_test)

In [ ]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.88167 | 0.96117 | 0.22000 | 0.35805 |
In [ ]:
feature_names = list(X_train.columns) importances = estimator.feature_importances_ indices = np.argsort(importances)
In [ ]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()

In [ ]:
# Text report showing the rules of a decision tree - print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Rotational speed <= 1379.50 | |--- Air temperature <= 301.55 | | |--- Torque <= 52.85 | | | |--- Tool wear <= 206.50 | | | | |--- weights: [176.97, 0.00] class: 0 | | | |--- Tool wear > 206.50 | | | | |--- weights: [10.35, 59.32] class: 1 | | |--- Torque > 52.85 | | | |--- Tool wear <= 185.50 | | | | |--- weights: [151.61, 341.10] class: 1 | | | |--- Tool wear > 185.50 | | | | |--- weights: [9.31, 474.58] class: 1 | |--- Air temperature > 301.55 | | |--- Process temperature <= 336.89 | | | |--- Process temperature <= 330.96 | | | | |--- weights: [1.03, 504.24] class: 1 | | | |--- Process temperature > 330.96 | | | | |--- weights: [42.95, 845.34] class: 1 | | |--- Process temperature > 336.89 | | | |--- weights: [32.08, 59.32] class: 1 |--- Rotational speed > 1379.50 | |--- Tool wear <= 202.50 | | |--- Torque <= 15.45 | | | |--- weights: [11.90, 281.78] class: 1 | | |--- Torque > 15.45 | | | |--- Torque <= 58.40 | | | | |--- weights: [2865.61, 118.64] class: 0 | | | |--- Torque > 58.40 | | | | |--- weights: [5.69, 177.97] class: 1 | |--- Tool wear > 202.50 | | |--- Tool wear <= 219.50 | | | |--- Torque <= 51.85 | | | | |--- weights: [154.72, 311.44] class: 1 | | | |--- Torque > 51.85 | | | | |--- weights: [2.59, 59.32] class: 1 | | |--- Tool wear > 219.50 | | | |--- Air temperature <= 297.20 | | | | |--- weights: [3.10, 0.00] class: 0 | | | |--- Air temperature > 297.20 | | | | |--- weights: [32.08, 266.95] class: 1
Observations from the pre-pruned tree:
Using the above extracted decision rules we can make interpretations from the decision tree model like:
Interpretations from other decision rules can be made similarly
In [ ]:
importances = estimator.feature_importances_ importances
Out[ ]:
array([0.03002803, 0.00726903, 0.40080865, 0.32726572, 0.23462857,
0. , 0. ])
In [ ]:
# importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

The DecisionTreeClassifier provides parameters such as min_samples_leafand max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect ofccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.
Total impurity of leaves vs effective alphas of pruned tree
Minimal cost complexity pruning recursively finds the node with the “weakest link”. The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn providesDecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.
In [ ]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced") path = clf.cost_complexity_pruning_path(X_train, y_train) ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [ ]:
pd.DataFrame(path)
Out[ ]:
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | -0.00000 |
| 1 | 0.00000 | -0.00000 |
| 2 | 0.00000 | -0.00000 |
| 3 | 0.00000 | -0.00000 |
| 4 | 0.00000 | -0.00000 |
| 5 | 0.00000 | -0.00000 |
| 6 | 0.00000 | -0.00000 |
| 7 | 0.00000 | -0.00000 |
| 8 | 0.00000 | -0.00000 |
| 9 | 0.00000 | -0.00000 |
| 10 | 0.00000 | 0.00000 |
| 11 | 0.00000 | 0.00000 |
| 12 | 0.00000 | 0.00000 |
| 13 | 0.00000 | 0.00000 |
| 14 | 0.00000 | 0.00000 |
| 15 | 0.00007 | 0.00015 |
| 16 | 0.00007 | 0.00029 |
| 17 | 0.00007 | 0.00044 |
| 18 | 0.00007 | 0.00058 |
| 19 | 0.00007 | 0.00073 |
| 20 | 0.00007 | 0.00088 |
| 21 | 0.00007 | 0.00117 |
| 22 | 0.00007 | 0.00132 |
| 23 | 0.00007 | 0.00147 |
| 24 | 0.00014 | 0.00175 |
| 25 | 0.00014 | 0.00231 |
| 26 | 0.00014 | 0.00246 |
| 27 | 0.00015 | 0.00260 |
| 28 | 0.00015 | 0.00304 |
| 29 | 0.00015 | 0.00319 |
| 30 | 0.00015 | 0.00333 |
| 31 | 0.00015 | 0.00348 |
| 32 | 0.00015 | 0.00363 |
| 33 | 0.00015 | 0.00377 |
| 34 | 0.00015 | 0.00392 |
| 35 | 0.00015 | 0.00407 |
| 36 | 0.00015 | 0.00421 |
| 37 | 0.00019 | 0.00479 |
| 38 | 0.00021 | 0.00522 |
| 39 | 0.00022 | 0.00565 |
| 40 | 0.00023 | 0.00680 |
| 41 | 0.00026 | 0.00783 |
| 42 | 0.00026 | 0.00809 |
| 43 | 0.00028 | 0.00837 |
| 44 | 0.00028 | 0.00893 |
| 45 | 0.00029 | 0.00921 |
| 46 | 0.00029 | 0.00950 |
| 47 | 0.00029 | 0.00978 |
| 48 | 0.00029 | 0.01095 |
| 49 | 0.00035 | 0.01338 |
| 50 | 0.00036 | 0.01409 |
| 51 | 0.00039 | 0.01448 |
| 52 | 0.00040 | 0.01488 |
| 53 | 0.00040 | 0.01568 |
| 54 | 0.00040 | 0.01608 |
| 55 | 0.00040 | 0.01648 |
| 56 | 0.00043 | 0.01692 |
| 57 | 0.00047 | 0.01739 |
| 58 | 0.00053 | 0.01792 |
| 59 | 0.00056 | 0.01848 |
| 60 | 0.00058 | 0.02021 |
| 61 | 0.00061 | 0.02082 |
| 62 | 0.00063 | 0.02271 |
| 63 | 0.00067 | 0.02338 |
| 64 | 0.00069 | 0.02407 |
| 65 | 0.00072 | 0.02551 |
| 66 | 0.00073 | 0.02625 |
| 67 | 0.00074 | 0.02698 |
| 68 | 0.00086 | 0.02784 |
| 69 | 0.00089 | 0.03138 |
| 70 | 0.00104 | 0.03242 |
| 71 | 0.00106 | 0.03348 |
| 72 | 0.00107 | 0.03455 |
| 73 | 0.00115 | 0.03570 |
| 74 | 0.00116 | 0.03919 |
| 75 | 0.00119 | 0.04632 |
| 76 | 0.00122 | 0.04754 |
| 77 | 0.00122 | 0.04875 |
| 78 | 0.00134 | 0.05543 |
| 79 | 0.00134 | 0.05812 |
| 80 | 0.00141 | 0.05953 |
| 81 | 0.00153 | 0.06106 |
| 82 | 0.00161 | 0.07559 |
| 83 | 0.00167 | 0.08559 |
| 84 | 0.00181 | 0.08740 |
| 85 | 0.00185 | 0.09109 |
| 86 | 0.00217 | 0.09544 |
| 87 | 0.00247 | 0.10037 |
| 88 | 0.00401 | 0.10839 |
| 89 | 0.01035 | 0.11875 |
| 90 | 0.01578 | 0.18187 |
| 91 | 0.04268 | 0.22456 |
| 92 | 0.05757 | 0.28212 |
| 93 | 0.06912 | 0.35124 |
| 94 | 0.14876 | 0.50000 |
In [ ]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
In [ ]:
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.14875976077076158
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
In [ ]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

In [ ]:
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
In [ ]:
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
In [ ]:
train_scores = [clf.score(X_train, y_train) for clf in clfs] test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [ ]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()

In [ ]:
# creating the model where we get highest train and test recall index_best_model = np.argmax(recall_test) best_model = clfs[index_best_model] print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004008680486241742, class_weight='balanced',
random_state=1)
In [ ]:
confusion_matrix_sklearn(best_model, X_train, y_train)

In [ ]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91143 | 0.96610 | 0.27143 | 0.42379 |
In [ ]:
confusion_matrix_sklearn(best_model, X_test, y_test)

In [ ]:
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_test
Out[ ]:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91733 | 0.96117 | 0.28863 | 0.44395 |
In [ ]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()

In [ ]:
# Text report showing the rules of a decision tree - print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Rotational speed <= 1379.50 | |--- Air temperature <= 301.55 | | |--- Torque <= 52.85 | | | |--- Tool wear <= 206.50 | | | | |--- weights: [176.97, 0.00] class: 0 | | | |--- Tool wear > 206.50 | | | | |--- weights: [10.35, 59.32] class: 1 | | |--- Torque > 52.85 | | | |--- Tool wear <= 185.50 | | | | |--- Torque <= 62.35 | | | | | |--- weights: [140.75, 0.00] class: 0 | | | | |--- Torque > 62.35 | | | | | |--- weights: [10.87, 341.10] class: 1 | | | |--- Tool wear > 185.50 | | | | |--- weights: [9.31, 474.58] class: 1 | |--- Air temperature > 301.55 | | |--- weights: [76.06, 1408.90] class: 1 |--- Rotational speed > 1379.50 | |--- Tool wear <= 202.50 | | |--- Torque <= 15.45 | | | |--- weights: [11.90, 281.78] class: 1 | | |--- Torque > 15.45 | | | |--- Torque <= 58.40 | | | | |--- weights: [2865.61, 118.64] class: 0 | | | |--- Torque > 58.40 | | | | |--- weights: [5.69, 177.97] class: 1 | |--- Tool wear > 202.50 | | |--- weights: [192.49, 637.71] class: 1
In [ ]:
importances = best_model.feature_importances_ indices = np.argsort(importances)
In [ ]:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

In [ ]:
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train_without.T,
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 1.00000 | 1.00000 | 0.87343 | 0.91143 |
| Recall | 1.00000 | 1.00000 | 0.96610 | 0.96610 |
| Precision | 1.00000 | 1.00000 | 0.20615 | 0.27143 |
| F1 | 1.00000 | 1.00000 | 0.33979 | 0.42379 |
In [ ]:
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test_without.T,
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[ ]:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 0.97567 | 0.97700 | 0.88167 | 0.91733 |
| Recall | 0.72816 | 0.61165 | 0.96117 | 0.96117 |
| Precision | 0.62500 | 0.68478 | 0.22000 | 0.28863 |
| F1 | 0.67265 | 0.64615 | 0.35805 | 0.44395 |
In [ ]:
histogram_boxplot(data, "Air temperature")

air temperature distribution looks slightly left skewed with a mean temperature around 300K.In [ ]:
histogram_boxplot(data, "Process temperature")

process temperature distribution looks slightly left skewed with a mean temperature around 329K.In [ ]:
histogram_boxplot(data, "Rotational speed")

rotational speed is right skewed with many outliers on the upper quartile.In [ ]:
histogram_boxplot(data, "Torque")

torque is normal with mean torque around 40 Nm.In [ ]:
histogram_boxplot(data, "Tool wear")

Tool wear is uniformly distributed with some of the higher values being less frequent.In [ ]:
labeled_barplot(data, "Type", perc=True)

FailureIn [ ]:
labeled_barplot(data, "Failure", perc=True)

In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Failure')
plt.figure(figsize=(12, 7))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

air temperature and process temperature.rotational speed and torque. Type vs Air temperatureIn [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Air temperature") plt.show()

Air temperature and TypeType vs Process temperatureIn [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Process temperature") plt.show()

Process temperature and Type for M and L types. Process temperature is observed in manufacturing H type of products.Type vs Rotational speedIn [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Rotational speed") plt.show()

Type vs Tool wearIn [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Tool wear") plt.show()

Tool wear and TypeType vs TorqueIn [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Torque") plt.show()

Toque as compared to M and H type of products.Let’s see how the target variable varies across the type of the product
In [ ]:
stacked_barplot(data, "Type", "Failure")
Failure 0 1 All Type All 9661 339 10000 L 5765 235 6000 M 2914 83 2997 H 982 21 1003 ------------------------------------------------------------------------------------------------------------------------

Let’s analyze the relation between Air temperature and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Air temperature", "Failure")

Air temperature.Let’s analyze the relation between Process temperature and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Process temperature", "Failure")

Process temperature.Let’s analyze the relation between Rotational speed and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Rotational speed", "Failure")

Rotational speed than at higher rotational speed.Let’s analyze the relation between Torque and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Torque", "Failure")

Let’s analyze the relation between Tool wear and Failure.
In [ ]:
distribution_plot_wrt_target(data, "Tool wear", "Failure")

In [ ]:
sns.pairplot(data, hue="Failure")
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f68dcb4ac10>
