System failure is a common issue across the manufacturing industry, where a variety of machines and equipment are used. In most cases, it becomes important to be able to predict machine failures by analyzing system data and taking preventive measures to be able to tackle them. This is known as predictive maintenance and with the rising availability of data and computational resources, the use of such data-driven, proactive maintenance methods has resulted in several benefits like minimized downtime of the equipment, minimized cost associated with spares and supplies, etc.
AutoMobi Engineering Pvt. Ltd is an auto component manufacturing company. The manufacturing facility of AutoMobi consists of numerous products machined on several CNC (Computer Numerical Controlled) machines. In an attempt to transition to a data-driven maintenance process, the company had set up sensors in various locations to collect data regarding the various parameters involved in the manufacturing process. Initially, they want to try it in an injector nozzle manufacturing shop where they are manufacturing fuel injector nozzles for automobile engines using various manufacturing processes (like turning, drilling, etc). The company has been collecting data on an hourly basis from these sensors and aims to build ML-based solutions using the data to optimize cost, improve failure predictability, and minimize the downtime of equipment.
AutoMobi has recently been encountering a problem with frequent equipment failure in the fuel injector nozzle manufacture unit, leading to disturbance in the manufacturing process. They have reached out to the Data Science team for a solution and shared data for the past three months. As a member of the Data Science team, you are tasked with analyzing the data and developing a Machine Learning model to detect potential machine failures, determine the most influencing factors on machine health, and provide recommendations for cost optimization to the management.
The data contains the different attributes of machines and health. The detailed data dictionary is given below.
Data Dictionary
In [ ]:
# this will help in making the Python code more structured automatically (help adhere to good coding practices) %load_ext nb_black import warnings warnings.filterwarnings("ignore") # Libraries to help with reading and manipulating data import pandas as pd import numpy as np # libaries to help with data visualization import matplotlib.pyplot as plt import seaborn as sns # Removes the limit for the number of displayed columns pd.set_option("display.max_columns", None) # Sets the limit for the number of displayed rows pd.set_option("display.max_rows", 200) # setting the precision of floating numbers to 5 decimal points pd.set_option("display.float_format", lambda x: "%.5f" % x) # Library to split data from sklearn.model_selection import train_test_split # To build model for prediction from sklearn.tree import DecisionTreeClassifier from sklearn import tree # To tune different models from sklearn.model_selection import GridSearchCV # To get diferent metric scores from sklearn.metrics import ( f1_score, accuracy_score, recall_score, precision_score, confusion_matrix, make_scorer, )
In [ ]:
df_main = pd.read_csv("Predictive_Maintenance_Case_Study.csv")
In [ ]:
# copying data to another variable to avoid any changes to original data data = df_main.copy()
In [ ]:
Out[ ]:
UDI | Type | Air temperature | Process temperature | Rotational speed | Torque | Tool wear | Failure | |
0 | 1 | M | 298.10000 | 323.74074 | 1551 | 42.80000 | 0 | 0 |
1 | 2 | L | 298.20000 | 324.11111 | 1408 | 46.30000 | 3 | 0 |
2 | 3 | L | 298.10000 | 323.37037 | 1498 | 49.40000 | 5 | 0 |
3 | 4 | L | 298.20000 | 323.74074 | 1433 | 39.50000 | 7 | 0 |
4 | 5 | L | 298.20000 | 324.11111 | 1408 | 40.00000 | 9 | 0 |
In [ ]:
Out[ ]:
UDI | Type | Air temperature | Process temperature | Rotational speed | Torque | Tool wear | Failure | |
9995 | 9996 | M | 298.80000 | 323.00000 | 1604 | 29.50000 | 14 | 0 |
9996 | 9997 | H | 298.90000 | 323.00000 | 1632 | 31.80000 | 17 | 0 |
9997 | 9998 | M | 299.00000 | 323.74074 | 1645 | 33.40000 | 22 | 0 |
9998 | 9999 | H | 299.00000 | 324.11111 | 1408 | 48.50000 | 25 | 0 |
9999 | 10000 | M | 299.00000 | 324.11111 | 1500 | 40.20000 | 30 | 0 |
column is containing unique values.In [ ]:
Out[ ]:
(10000, 8)
In [ ]:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UDI 10000 non-null int64 1 Type 10000 non-null object 2 Air temperature 10000 non-null float64 3 Process temperature 10000 non-null float64 4 Rotational speed 10000 non-null int64 5 Torque 10000 non-null float64 6 Tool wear 10000 non-null int64 7 Failure 10000 non-null int64 dtypes: float64(3), int64(4), object(1) memory usage: 625.1+ KB
column is of object type while the rest columns are numeric in natureIn [ ]:
# checking for null values data.isnull().sum()
Out[ ]:
UDI 0 Type 0 Air temperature 0 Process temperature 0 Rotational speed 0 Torque 0 Tool wear 0 Failure 0 dtype: int64
In [ ]:
# checking for duplicate values data.duplicated().sum()
Out[ ]:
In [ ]:
Out[ ]:
column contains only unique values, so we can drop itIn [ ]:
data = data.drop(["UDI"], axis=1)
Let’s check the statistical summary of the data.
In [ ]:
Out[ ]:
count | mean | std | min | 25% | 50% | 75% | max | |
Air temperature | 10000.00000 | 300.00493 | 2.00026 | 295.30000 | 298.30000 | 300.10000 | 301.50000 | 304.50000 |
Process temperature | 10000.00000 | 328.94652 | 5.49531 | 313.00000 | 324.48148 | 329.29630 | 333.00000 | 343.00000 |
Rotational speed | 10000.00000 | 1538.77610 | 179.28410 | 1168.00000 | 1423.00000 | 1503.00000 | 1612.00000 | 2886.00000 |
Torque | 10000.00000 | 39.98691 | 9.96893 | 3.80000 | 33.20000 | 40.10000 | 46.80000 | 76.60000 |
Tool wear | 10000.00000 | 107.95100 | 63.65415 | 0.00000 | 53.00000 | 108.00000 | 162.00000 | 253.00000 |
Failure | 10000.00000 | 0.03390 | 0.18098 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
air temperature
ranges from 300K to 304.5K. Usually, machine shops are maintained in control environment so the temperature range looks usual.process temperature
is a bit higher than the air temperature
and that’s quite usual because heat is continuously generated during the machining process.rotational speed
has a max value of 2886rpm while 1612rpm at the 75th percentile. Some of the processes are performed at a higher speed than usual.The below functions need to be defined to carry out the EDA.
In [ ]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None): """ Boxplot and histogram combined data: dataframe feature: dataframe column figsize: size of figure (default (15,10)) kde: whether to show the density curve (default False) bins: number of bins for histogram (default None) """ f2, (ax_box2, ax_hist2) = plt.subplots( nrows=2, # Number of rows of the subplot grid= 2 sharex=True, # x-axis will be shared among all subplots gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize, ) # creating the 2 subplots sns.boxplot( data=data, x=feature, ax=ax_box2, showmeans=True, color="violet" ) # boxplot will be created and a triangle will indicate the mean value of the column sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins ) if bins else sns.histplot( data=data, x=feature, kde=kde, ax=ax_hist2 ) # For histogram ax_hist2.axvline( data[feature].mean(), color="green", linestyle="--" ) # Add mean to the histogram ax_hist2.axvline( data[feature].median(), color="black", linestyle="-" ) # Add median to the histogram
In [ ]:
# function to create labeled barplots def labeled_barplot(data, feature, perc=False, n=None): """ Barplot with percentage at the top data: dataframe feature: dataframe column perc: whether to display percentages instead of count (default is False) n: displays the top n category levels (default is None, i.e., display all levels) """ total = len(data[feature]) # length of the column count = data[feature].nunique() if n is None: plt.figure(figsize=(count + 2, 6)) else: plt.figure(figsize=(n + 2, 6)) plt.xticks(rotation=90, fontsize=15) ax = sns.countplot( data=data, x=feature, palette="Paired", order=data[feature].value_counts().index[:n], ) for p in ax.patches: if perc == True: label = "{:.1f}%".format( 100 * p.get_height() / total ) # percentage of each class of the category else: label = p.get_height() # count of each level of the category x = p.get_x() + p.get_width() / 2 # width of the plot y = p.get_height() # height of the plot ax.annotate( label, (x, y), ha="center", va="center", size=12, xytext=(0, 5), textcoords="offset points", ) # annotate the percentage # show the plot
In [ ]:
def stacked_barplot(data, predictor, target): """ Print the category counts and plot a stacked bar chart data: dataframe predictor: independent variable target: target variable """ count = data[predictor].nunique() sorter = data[target].value_counts().index[-1] tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values( by=sorter, ascending=False ) print(tab1) print("-" * 120) tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values( by=sorter, ascending=False ) tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5)) plt.legend( loc="lower left", frameon=False, ) plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
In [ ]:
### function to plot distributions wrt target def distribution_plot_wrt_target(data, predictor, target): fig, axs = plt.subplots(2, 2, figsize=(12, 10)) target_uniq = data[target].unique() axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0])) sns.histplot( data=data[data[target] == target_uniq[0]], x=predictor, kde=True, ax=axs[0, 0], color="teal", stat="density", ) axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1])) sns.histplot( data=data[data[target] == target_uniq[1]], x=predictor, kde=True, ax=axs[0, 1], color="orange", stat="density", ) axs[1, 0].set_title("Boxplot w.r.t target") sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow") axs[1, 1].set_title("Boxplot (without outliers) w.r.t target") sns.boxplot( data=data, x=target, y=predictor, ax=axs[1, 1], showfliers=False, palette="gist_rainbow", ) plt.tight_layout()
In [ ]:
histogram_boxplot(data, "Air temperature")
air temperature
distribution looks slightly left skewed with a mean temperature around 300K.In [ ]:
histogram_boxplot(data, "Process temperature")
process temperature
distribution looks slightly left skewed with a mean temperature around 329K.In [ ]:
histogram_boxplot(data, "Rotational speed")
rotational speed
is right skewed with many outliers on the upper quartile.In [ ]:
histogram_boxplot(data, "Torque")
is normal with mean torque around 40 Nm.In [ ]:
histogram_boxplot(data, "Tool wear")
Tool wear
is uniformly distributed with some of the higher values being less frequent.In [ ]:
labeled_barplot(data, "Type", perc=True)
In [ ]:
labeled_barplot(data, "Failure", perc=True)
In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist() cols_list.remove('Failure') plt.figure(figsize=(12, 7)) sns.heatmap( data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral" )
air temperature
and process temperature
.rotational speed
and torque
. Let’s see how the target variable varies across the type of the product
In [ ]:
stacked_barplot(data, "Type", "Failure")
Failure 0 1 All Type All 9661 339 10000 L 5765 235 6000 M 2914 83 2997 H 982 21 1003 ------------------------------------------------------------------------------------------------------------------------
Let’s analyze the relation between Process temperature
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Process temperature", "Failure")
Process temperature
.Let’s analyze the relation between Rotational speed
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Rotational speed", "Failure")
Rotational speed
.Rotational speed
than at higher rotational speed.Let’s check for outliers in the data.
In [ ]:
# outlier detection using boxplot numeric_columns = data.select_dtypes(include=np.number).columns.tolist() plt.figure(figsize=(15, 12)) for i, variable in enumerate(numeric_columns): plt.subplot(4, 4, i + 1) plt.boxplot(data[variable], whis=1.5) plt.tight_layout() plt.title(variable)
In [ ]:
X = data.drop(["Failure"], axis=1) Y = data["Failure"] X = pd.get_dummies(X, drop_first=True) # Splitting data in train and test sets X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.30, random_state=1 )
In [ ]:
print("Shape of Training set : ", X_train.shape) print("Shape of test set : ", X_test.shape) print("Percentage of classes in training set:") print(y_train.value_counts(normalize=True)) print("Percentage of classes in test set:") print(y_test.value_counts(normalize=True))
Shape of Training set : (7000, 7) Shape of test set : (3000, 7) Percentage of classes in training set: 0 0.96629 1 0.03371 Name: Failure, dtype: float64 Percentage of classes in test set: 0 0.96567 1 0.03433 Name: Failure, dtype: float64
In [ ]:
model0 = DecisionTreeClassifier(random_state=1), y_train)
Out[ ]:
Model evaluation criterion
Model can make wrong predictions as:
Which case is more important?
How to reduce the losses?
The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.
In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn def model_performance_classification_sklearn(model, predictors, target): """ Function to compute different metrics to check classification model performance model: classifier predictors: independent variables target: dependent variable """ # predicting using the independent variables pred = model.predict(predictors) acc = accuracy_score(target, pred) # to compute Accuracy recall = recall_score(target, pred) # to compute Recall precision = precision_score(target, pred) # to compute Precision f1 = f1_score(target, pred) # to compute F1-score # creating a dataframe of metrics df_perf = pd.DataFrame( {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,}, index=[0], ) return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target): """ To plot the confusion_matrix with percentages model: classifier predictors: independent variables target: dependent variable """ y_pred = model.predict(predictors) cm = confusion_matrix(target, y_pred) labels = np.asarray( [ ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())] for item in cm.flatten() ] ).reshape(2, 2) plt.figure(figsize=(6, 4)) sns.heatmap(cm, annot=labels, fmt="") plt.ylabel("True label") plt.xlabel("Predicted label")
In [ ]:
confusion_matrix_sklearn(model0, X_train, y_train)
In [ ]:
decision_tree_perf_train_without = model_performance_classification_sklearn( model0, X_train, y_train ) decision_tree_perf_train_without
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
In [ ]:
confusion_matrix_sklearn(model0, X_test, y_test)
In [ ]:
decision_tree_perf_test_without = model_performance_classification_sklearn( model0, X_test, y_test ) decision_tree_perf_test_without
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 0.97567 | 0.72816 | 0.62500 | 0.67265 |
In [ ]:
model = DecisionTreeClassifier(random_state=1, class_weight="balanced"), y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In [ ]:
confusion_matrix_sklearn(model, X_train, y_train)
In [ ]:
decision_tree_perf_train = model_performance_classification_sklearn( model, X_train, y_train ) decision_tree_perf_train
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
In [ ]:
confusion_matrix_sklearn(model, X_test, y_test)
In [ ]:
decision_tree_perf_test = model_performance_classification_sklearn( model, X_test, y_test ) decision_tree_perf_test
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 0.97700 | 0.61165 | 0.68478 | 0.64615 |
Let’s use pruning techniques to try and reduce overfitting.
Using GridSearch for Hyperparameter tuning of our tree model
In [ ]:
# Choose the type of classifier. estimator = DecisionTreeClassifier(random_state=1) # Grid of parameters to choose from parameters = { "class_weight": [None, "balanced"], "max_depth": np.arange(2, 7, 2), "max_leaf_nodes": [50, 75, 150, 250], "min_samples_split": [10, 30, 50, 70], } # Type of scoring used to compare parameter combinations acc_scorer = make_scorer(recall_score) # Run the grid search grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5) grid_obj =, y_train) # Set the clf to the best combination of parameters estimator = grid_obj.best_estimator_ # Fit the best algorithm to the data., y_train)
Out[ ]:
DecisionTreeClassifier(class_weight='balanced', max_depth=4, max_leaf_nodes=50, min_samples_split=70, random_state=1)
In [ ]:
confusion_matrix_sklearn(estimator, X_train, y_train)
In [ ]:
decision_tree_tune_perf_train = model_performance_classification_sklearn( estimator, X_train, y_train ) decision_tree_tune_perf_train
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 0.87343 | 0.96610 | 0.20615 | 0.33979 |
In [ ]:
confusion_matrix_sklearn(estimator, X_test, y_test)
In [ ]:
decision_tree_tune_perf_test = model_performance_classification_sklearn( estimator, X_test, y_test ) decision_tree_tune_perf_test
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 0.88167 | 0.96117 | 0.22000 | 0.35805 |
In [ ]:
feature_names = list(X_train.columns) importances = estimator.feature_importances_ indices = np.argsort(importances)
In [ ]:
plt.figure(figsize=(20, 10)) out = tree.plot_tree( estimator, feature_names=feature_names, filled=True, fontsize=9, node_ids=False, class_names=None, ) # below code will add arrows to the decision tree split if they are missing for o in out: arrow = o.arrow_patch if arrow is not None: arrow.set_edgecolor("black") arrow.set_linewidth(1)
In [ ]:
# Text report showing the rules of a decision tree - print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Rotational speed <= 1379.50 | |--- Air temperature <= 301.55 | | |--- Torque <= 52.85 | | | |--- Tool wear <= 206.50 | | | | |--- weights: [176.97, 0.00] class: 0 | | | |--- Tool wear > 206.50 | | | | |--- weights: [10.35, 59.32] class: 1 | | |--- Torque > 52.85 | | | |--- Tool wear <= 185.50 | | | | |--- weights: [151.61, 341.10] class: 1 | | | |--- Tool wear > 185.50 | | | | |--- weights: [9.31, 474.58] class: 1 | |--- Air temperature > 301.55 | | |--- Process temperature <= 336.89 | | | |--- Process temperature <= 330.96 | | | | |--- weights: [1.03, 504.24] class: 1 | | | |--- Process temperature > 330.96 | | | | |--- weights: [42.95, 845.34] class: 1 | | |--- Process temperature > 336.89 | | | |--- weights: [32.08, 59.32] class: 1 |--- Rotational speed > 1379.50 | |--- Tool wear <= 202.50 | | |--- Torque <= 15.45 | | | |--- weights: [11.90, 281.78] class: 1 | | |--- Torque > 15.45 | | | |--- Torque <= 58.40 | | | | |--- weights: [2865.61, 118.64] class: 0 | | | |--- Torque > 58.40 | | | | |--- weights: [5.69, 177.97] class: 1 | |--- Tool wear > 202.50 | | |--- Tool wear <= 219.50 | | | |--- Torque <= 51.85 | | | | |--- weights: [154.72, 311.44] class: 1 | | | |--- Torque > 51.85 | | | | |--- weights: [2.59, 59.32] class: 1 | | |--- Tool wear > 219.50 | | | |--- Air temperature <= 297.20 | | | | |--- weights: [3.10, 0.00] class: 0 | | | |--- Air temperature > 297.20 | | | | |--- weights: [32.08, 266.95] class: 1
Observations from the pre-pruned tree:
Using the above extracted decision rules we can make interpretations from the decision tree model like:
Interpretations from other decision rules can be made similarly
In [ ]:
importances = estimator.feature_importances_ importances
Out[ ]:
array([0.03002803, 0.00726903, 0.40080865, 0.32726572, 0.23462857, 0. , 0. ])
In [ ]:
# importance of features in the tree building importances = estimator.feature_importances_ indices = np.argsort(importances) plt.figure(figsize=(8, 8)) plt.title("Feature Importances") plt.barh(range(len(indices)), importances[indices], color="violet", align="center") plt.yticks(range(len(indices)), [feature_names[i] for i in indices]) plt.xlabel("Relative Importance")
The DecisionTreeClassifier
provides parameters such as min_samples_leaf
and max_depth
to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier
, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha
. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect ofccp_alpha
on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Total impurity of leaves vs effective alphas of pruned tree
Minimal cost complexity pruning recursively finds the node with the “weakest link”. The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha
could be appropriate, scikit-learn providesDecisionTreeClassifier.cost_complexity_pruning_path
that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.
In [ ]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced") path = clf.cost_complexity_pruning_path(X_train, y_train) ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [ ]:
Out[ ]:
ccp_alphas | impurities | |
0 | 0.00000 | -0.00000 |
1 | 0.00000 | -0.00000 |
2 | 0.00000 | -0.00000 |
3 | 0.00000 | -0.00000 |
4 | 0.00000 | -0.00000 |
5 | 0.00000 | -0.00000 |
6 | 0.00000 | -0.00000 |
7 | 0.00000 | -0.00000 |
8 | 0.00000 | -0.00000 |
9 | 0.00000 | -0.00000 |
10 | 0.00000 | 0.00000 |
11 | 0.00000 | 0.00000 |
12 | 0.00000 | 0.00000 |
13 | 0.00000 | 0.00000 |
14 | 0.00000 | 0.00000 |
15 | 0.00007 | 0.00015 |
16 | 0.00007 | 0.00029 |
17 | 0.00007 | 0.00044 |
18 | 0.00007 | 0.00058 |
19 | 0.00007 | 0.00073 |
20 | 0.00007 | 0.00088 |
21 | 0.00007 | 0.00117 |
22 | 0.00007 | 0.00132 |
23 | 0.00007 | 0.00147 |
24 | 0.00014 | 0.00175 |
25 | 0.00014 | 0.00231 |
26 | 0.00014 | 0.00246 |
27 | 0.00015 | 0.00260 |
28 | 0.00015 | 0.00304 |
29 | 0.00015 | 0.00319 |
30 | 0.00015 | 0.00333 |
31 | 0.00015 | 0.00348 |
32 | 0.00015 | 0.00363 |
33 | 0.00015 | 0.00377 |
34 | 0.00015 | 0.00392 |
35 | 0.00015 | 0.00407 |
36 | 0.00015 | 0.00421 |
37 | 0.00019 | 0.00479 |
38 | 0.00021 | 0.00522 |
39 | 0.00022 | 0.00565 |
40 | 0.00023 | 0.00680 |
41 | 0.00026 | 0.00783 |
42 | 0.00026 | 0.00809 |
43 | 0.00028 | 0.00837 |
44 | 0.00028 | 0.00893 |
45 | 0.00029 | 0.00921 |
46 | 0.00029 | 0.00950 |
47 | 0.00029 | 0.00978 |
48 | 0.00029 | 0.01095 |
49 | 0.00035 | 0.01338 |
50 | 0.00036 | 0.01409 |
51 | 0.00039 | 0.01448 |
52 | 0.00040 | 0.01488 |
53 | 0.00040 | 0.01568 |
54 | 0.00040 | 0.01608 |
55 | 0.00040 | 0.01648 |
56 | 0.00043 | 0.01692 |
57 | 0.00047 | 0.01739 |
58 | 0.00053 | 0.01792 |
59 | 0.00056 | 0.01848 |
60 | 0.00058 | 0.02021 |
61 | 0.00061 | 0.02082 |
62 | 0.00063 | 0.02271 |
63 | 0.00067 | 0.02338 |
64 | 0.00069 | 0.02407 |
65 | 0.00072 | 0.02551 |
66 | 0.00073 | 0.02625 |
67 | 0.00074 | 0.02698 |
68 | 0.00086 | 0.02784 |
69 | 0.00089 | 0.03138 |
70 | 0.00104 | 0.03242 |
71 | 0.00106 | 0.03348 |
72 | 0.00107 | 0.03455 |
73 | 0.00115 | 0.03570 |
74 | 0.00116 | 0.03919 |
75 | 0.00119 | 0.04632 |
76 | 0.00122 | 0.04754 |
77 | 0.00122 | 0.04875 |
78 | 0.00134 | 0.05543 |
79 | 0.00134 | 0.05812 |
80 | 0.00141 | 0.05953 |
81 | 0.00153 | 0.06106 |
82 | 0.00161 | 0.07559 |
83 | 0.00167 | 0.08559 |
84 | 0.00181 | 0.08740 |
85 | 0.00185 | 0.09109 |
86 | 0.00217 | 0.09544 |
87 | 0.00247 | 0.10037 |
88 | 0.00401 | 0.10839 |
89 | 0.01035 | 0.11875 |
90 | 0.01578 | 0.18187 |
91 | 0.04268 | 0.22456 |
92 | 0.05757 | 0.28212 |
93 | 0.06912 | 0.35124 |
94 | 0.14876 | 0.50000 |
In [ ]:
fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post") ax.set_xlabel("effective alpha") ax.set_ylabel("total impurity of leaves") ax.set_title("Total Impurity vs effective alpha for training set")
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas
is the alpha value that prunes the whole tree, leaving the tree, clfs[-1]
, with one node.
In [ ]:
clfs = [] for ccp_alpha in ccp_alphas: clf = DecisionTreeClassifier( random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced" ), y_train) clfs.append(clf) print( "Number of nodes in the last tree is: {} with ccp_alpha: {}".format( clfs[-1].tree_.node_count, ccp_alphas[-1] ) )
Number of nodes in the last tree is: 1 with ccp_alpha: 0.14875976077076158
For the remainder, we remove the last element in clfs
and ccp_alphas
, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
In [ ]:
clfs = clfs[:-1] ccp_alphas = ccp_alphas[:-1] node_counts = [clf.tree_.node_count for clf in clfs] depth = [clf.tree_.max_depth for clf in clfs] fig, ax = plt.subplots(2, 1, figsize=(10, 7)) ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post") ax[0].set_xlabel("alpha") ax[0].set_ylabel("number of nodes") ax[0].set_title("Number of nodes vs alpha") ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post") ax[1].set_xlabel("alpha") ax[1].set_ylabel("depth of tree") ax[1].set_title("Depth vs alpha") fig.tight_layout()
In [ ]:
recall_train = [] for clf in clfs: pred_train = clf.predict(X_train) values_train = recall_score(y_train, pred_train) recall_train.append(values_train)
In [ ]:
recall_test = [] for clf in clfs: pred_test = clf.predict(X_test) values_test = recall_score(y_test, pred_test) recall_test.append(values_test)
In [ ]:
train_scores = [clf.score(X_train, y_train) for clf in clfs] test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [ ]:
fig, ax = plt.subplots(figsize=(15, 5)) ax.set_xlabel("alpha") ax.set_ylabel("Recall") ax.set_title("Recall vs alpha for training and testing sets") ax.plot( ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post", ) ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post") ax.legend()
In [ ]:
# creating the model where we get highest train and test recall index_best_model = np.argmax(recall_test) best_model = clfs[index_best_model] print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004008680486241742, class_weight='balanced', random_state=1)
In [ ]:
confusion_matrix_sklearn(best_model, X_train, y_train)
In [ ]:
decision_tree_post_perf_train = model_performance_classification_sklearn( best_model, X_train, y_train ) decision_tree_post_perf_train
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 0.91143 | 0.96610 | 0.27143 | 0.42379 |
In [ ]:
confusion_matrix_sklearn(best_model, X_test, y_test)
In [ ]:
decision_tree_post_test = model_performance_classification_sklearn( best_model, X_test, y_test ) decision_tree_post_test
Out[ ]:
Accuracy | Recall | Precision | F1 | |
0 | 0.91733 | 0.96117 | 0.28863 | 0.44395 |
In [ ]:
plt.figure(figsize=(20, 10)) out = tree.plot_tree( best_model, feature_names=feature_names, filled=True, fontsize=9, node_ids=False, class_names=None, ) for o in out: arrow = o.arrow_patch if arrow is not None: arrow.set_edgecolor("black") arrow.set_linewidth(1)
In [ ]:
# Text report showing the rules of a decision tree - print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Rotational speed <= 1379.50 | |--- Air temperature <= 301.55 | | |--- Torque <= 52.85 | | | |--- Tool wear <= 206.50 | | | | |--- weights: [176.97, 0.00] class: 0 | | | |--- Tool wear > 206.50 | | | | |--- weights: [10.35, 59.32] class: 1 | | |--- Torque > 52.85 | | | |--- Tool wear <= 185.50 | | | | |--- Torque <= 62.35 | | | | | |--- weights: [140.75, 0.00] class: 0 | | | | |--- Torque > 62.35 | | | | | |--- weights: [10.87, 341.10] class: 1 | | | |--- Tool wear > 185.50 | | | | |--- weights: [9.31, 474.58] class: 1 | |--- Air temperature > 301.55 | | |--- weights: [76.06, 1408.90] class: 1 |--- Rotational speed > 1379.50 | |--- Tool wear <= 202.50 | | |--- Torque <= 15.45 | | | |--- weights: [11.90, 281.78] class: 1 | | |--- Torque > 15.45 | | | |--- Torque <= 58.40 | | | | |--- weights: [2865.61, 118.64] class: 0 | | | |--- Torque > 58.40 | | | | |--- weights: [5.69, 177.97] class: 1 | |--- Tool wear > 202.50 | | |--- weights: [192.49, 637.71] class: 1
In [ ]:
importances = best_model.feature_importances_ indices = np.argsort(importances)
In [ ]:
plt.figure(figsize=(12, 12)) plt.title("Feature Importances") plt.barh(range(len(indices)), importances[indices], color="violet", align="center") plt.yticks(range(len(indices)), [feature_names[i] for i in indices]) plt.xlabel("Relative Importance")
In [ ]:
# training performance comparison models_train_comp_df = pd.concat( [ decision_tree_perf_train_without.T, decision_tree_perf_train.T, decision_tree_tune_perf_train.T, decision_tree_post_perf_train.T, ], axis=1, ) models_train_comp_df.columns = [ "Decision Tree without class_weight", "Decision Tree with class_weight", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)", ] print("Training performance comparison:") models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
Accuracy | 1.00000 | 1.00000 | 0.87343 | 0.91143 |
Recall | 1.00000 | 1.00000 | 0.96610 | 0.96610 |
Precision | 1.00000 | 1.00000 | 0.20615 | 0.27143 |
F1 | 1.00000 | 1.00000 | 0.33979 | 0.42379 |
In [ ]:
# testing performance comparison models_test_comp_df = pd.concat( [ decision_tree_perf_test_without.T, decision_tree_perf_test.T, decision_tree_tune_perf_test.T, decision_tree_post_test.T, ], axis=1, ) models_test_comp_df.columns = [ "Decision Tree without class_weight", "Decision Tree with class_weight", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)", ] print("Test set performance comparison:") models_test_comp_df
Test set performance comparison:
Out[ ]:
Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
Accuracy | 0.97567 | 0.97700 | 0.88167 | 0.91733 |
Recall | 0.72816 | 0.61165 | 0.96117 | 0.96117 |
Precision | 0.62500 | 0.68478 | 0.22000 | 0.28863 |
F1 | 0.67265 | 0.64615 | 0.35805 | 0.44395 |
In [ ]:
histogram_boxplot(data, "Air temperature")
air temperature
distribution looks slightly left skewed with a mean temperature around 300K.In [ ]:
histogram_boxplot(data, "Process temperature")
process temperature
distribution looks slightly left skewed with a mean temperature around 329K.In [ ]:
histogram_boxplot(data, "Rotational speed")
rotational speed
is right skewed with many outliers on the upper quartile.In [ ]:
histogram_boxplot(data, "Torque")
is normal with mean torque around 40 Nm.In [ ]:
histogram_boxplot(data, "Tool wear")
Tool wear
is uniformly distributed with some of the higher values being less frequent.In [ ]:
labeled_barplot(data, "Type", perc=True)
In [ ]:
labeled_barplot(data, "Failure", perc=True)
In [ ]:
cols_list = data.select_dtypes(include=np.number).columns.tolist() cols_list.remove('Failure') plt.figure(figsize=(12, 7)) sns.heatmap( data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral" )
air temperature
and process temperature
.rotational speed
and torque
. Type
vs Air temperature
In [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Air temperature")
Air temperature
and Type
vs Process temperature
In [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Process temperature")
Process temperature
and Type
for M and L types. Process temperature
is observed in manufacturing H type of products.Type
vs Rotational speed
In [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Rotational speed")
vs Tool wear
In [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Tool wear")
Tool wear
and Type
vs Torque
In [ ]:
plt.figure(figsize=(10, 5)) sns.boxplot(data=data, x="Type", y="Torque")
as compared to M and H type of products.Let’s see how the target variable varies across the type of the product
In [ ]:
stacked_barplot(data, "Type", "Failure")
Failure 0 1 All Type All 9661 339 10000 L 5765 235 6000 M 2914 83 2997 H 982 21 1003 ------------------------------------------------------------------------------------------------------------------------
Let’s analyze the relation between Air temperature
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Air temperature", "Failure")
Air temperature
.Let’s analyze the relation between Process temperature
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Process temperature", "Failure")
Process temperature
.Let’s analyze the relation between Rotational speed
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Rotational speed", "Failure")
Rotational speed
than at higher rotational speed.Let’s analyze the relation between Torque
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Torque", "Failure")
Let’s analyze the relation between Tool wear
and Failure
In [ ]:
distribution_plot_wrt_target(data, "Tool wear", "Failure")
In [ ]:
sns.pairplot(data, hue="Failure")
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f68dcb4ac10>