General Middleware

Product Segmentation using K-Means Clustering

Case Study

Context

When you think of sneakers for a trip, the importance of good footwear cannot be discarded, and the obvious brands that come to mind are Adidas and Nike. Adidas vs Nike is a constant debate as the two giants in the apparel market, with a large market cap and market share, battle it out to come on top. As a newly hired Data Scientist in a market research company, you have been given the task of extracting insights from the data of men’s and women’s shoes, and grouping products together to identify similarities and differences between the product range of these renowned brands.

Objective

To perform an exploratory data analysis and cluster the products based on various factors

Key Questions

  • Which variables are most important for clustering?
  • How each cluster is different from the others?
  • What are the business recommendations?

Data Description

The dataset consists of 3268 products from Nike and Adidas with features of information including their ratings, discount, sales price, listed price, product name, and the number of reviews.

  • Product Name: Name of the product
  • Product ID: ID of the product
  • Listing Price: Listed price of the product
  • Sale Price: Sale price of the product
  • Discount: Percentage of discount on the product
  • Brand: Brand of the product
  • Rating: Rating of the product
  • Reviews: Number of reviews for the product

Let’s start coding!

Importing necessary libraries

In [1]:

# please uncomment and run the next line if yellowbrick library is not previously installed
#!pip install yellowbrick

In [2]:

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

In [3]:

# loading the dataset
data = pd.read_csv("data_add_nik.csv")

In [4]:

data.shape

Out[4]:

(3268, 8)
  • The dataset has 3268 rows and 8 columns 

In [5]:

# viewing a random sample of the dataset
data.sample(n=10, random_state=1)

Out[5]:

Product NameProduct IDListing PriceSale PriceDiscountBrandRatingReviews
255Women’s adidas Originals POD-S3.1 ShoesCG618813999699950Adidas ORIGINALS3.38
1551Men’s adidas Originals Superstar Pure ShoesFV301311999119990Adidas ORIGINALS3.910
1352Men’s adidas Originals Superstar ShoesFV2806799979990Adidas ORIGINALS4.442
1060Men’s adidas Football Nemeziz 19.3 Indoor ShoesF344115999359940Adidas SPORT PERFORMANCE4.575
808Men’s adidas Sport Inspired Court 80s ShoesEE96765999359940Adidas CORE / NEO4.555
836Men’s adidas Running Stargon 1.0 ShoesCM49354799383920Adidas CORE / NEO3.521
2107Men’s adidas Originals Yung-96 Chasm ShoesEE72387999479940Adidas ORIGINALS3.062
3002Nike SB Air Max Stefan Janoski 2AQ7477-009099950Nike2.611
2329Men’s adidas Originals Rivalry Low ShoesFV428710999109990Adidas ORIGINALS2.88
602Men’s adidas Sport Inspired Lite Racer RBN ShoesF366425599391930Adidas CORE / NEO3.056

In [6]:

# copying the data to another variable to avoid any changes to original data
df = data.copy()

In [7]:

# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]

In [8]:

# let's look at the structure of the data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3268 entries, 0 to 3267
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Product_Name   3268 non-null   object 
 1   Product_ID     3268 non-null   object 
 2   Listing_Price  3268 non-null   int64  
 3   Sale_Price     3268 non-null   int64  
 4   Discount       3268 non-null   int64  
 5   Brand          3268 non-null   object 
 6   Rating         3268 non-null   float64
 7   Reviews        3268 non-null   int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 204.4+ KB

We won’t need Product_ID for analysis, so let’s drop this column.

In [9]:

df.drop("Product_ID", axis=1, inplace=True)

In [10]:

# let's check for duplicate observations
df.duplicated().sum()

Out[10]:

117
  • There are 117 duplicate observations. We will remove them from the data.

In [11]:

df = df[(~df.duplicated())].copy()

Let’s take a look at the summary of the data

In [12]:

df.describe()

Out[12]:

Listing_PriceSale_PriceDiscountRatingReviews
count3151.0000003151.0000003151.0000003151.0000003151.000000
mean7045.9603305983.16693127.8609973.28568741.891146
std4652.0895114173.70889722.4427531.37161131.283464
min0.000000449.0000000.0000000.0000000.000000
25%4599.0000002999.0000000.0000002.60000012.000000
50%5999.0000004599.00000040.0000003.60000040.000000
75%8999.0000007799.00000050.0000004.40000069.000000
max29999.00000036500.00000060.0000005.000000223.000000

Observations

  • 0 in the listing price indicates missing values.
  • The average listing price is 7046.
  • The average sale price is 5983.
  • The average discount is 28%.
  • The average rating is 3.3.
  • The average number of reviews is 42.

In [13]:

# let's check how many products have listing price 0
(df.Listing_Price == 0).sum()

Out[13]:

336

In [14]:

# let's check the products which have listing price 0
df[(df.Listing_Price == 0)]

Out[14]:

Product_NameListing_PriceSale_PriceDiscountBrandRatingReviews
2625Nike Air Force 1 ’07 Essential074950Nike0.00
2626Nike Air Force 1 ’07074950Nike0.00
2627Nike Air Force 1 Sage Low LX099950Nike0.00
2628Nike Air Max Dia SE099950Nike0.00
2629Nike Air Max Verona099950Nike0.00
3254Nike Mercurial Vapor 13 Club MG049950Nike0.00
3257Air Jordan 5 Retro0159950Nike3.33
3260Nike Tiempo Legend 8 Academy TF064950Nike0.00
3262Nike React Metcon AMP0139950Nike3.01
3266Nike Air Max 980169950Nike4.04

336 rows × 7 columns

In [15]:

df[(df.Listing_Price == 0)].describe()

Out[15]:

Listing_PriceSale_PriceDiscountRatingReviews
count336.0336.000000336.0336.000000336.000000
mean0.011203.0505950.02.7976198.261905
std0.04623.8257880.02.15044519.708393
min0.01595.0000000.00.0000000.000000
25%0.07995.0000000.00.0000000.000000
50%0.010995.0000000.03.9500001.000000
75%0.013995.0000000.04.7000006.000000
max0.036500.0000000.05.000000223.000000
  • There are 336 observations that have missing values in the listing price column
  • We see that the discount for the products with listing price 0 is 0.
  • So, we will replace the listing price with the corresponding sale price for those observations.

In [16]:

df.loc[(df.Listing_Price == 0), ["Listing_Price"]] = df.loc[
    (df.Listing_Price == 0), ["Sale_Price"]
].values

In [17]:

df.Listing_Price.describe()

Out[17]:

count     3151.000000
mean      8240.573151
std       4363.018245
min        899.000000
25%       4999.000000
50%       7599.000000
75%      10995.000000
max      36500.000000
Name: Listing_Price, dtype: float64

In [18]:

# checking missing values
df.isna().sum()

Out[18]:

Product_Name     0
Listing_Price    0
Sale_Price       0
Discount         0
Brand            0
Rating           0
Reviews          0
dtype: int64
  • There are no missing values in the data.

EDA

In [19]:

# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [20]:

# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()

for item in num_col:
    histogram_boxplot(df, item)

Observations

  • Listing price and sale price have right-skewed distributions with upper outliers, which indicates the presence of very expensive products.
  • The maximum discount given is 60%.
  • Rating is left-skewed and most of the ratings are between 2.5 and 4.5.
  • The number of reviews is between 1 and 100, with an outlier value above 200.

In [21]:

fig, axes = plt.subplots(3, 2, figsize=(20, 15))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(3):
    sns.ecdfplot(ax=axes[ii][0], x=df[num_col[counter]])
    counter = counter + 1
    if counter != 5:
        sns.ecdfplot(ax=axes[ii][1], x=df[num_col[counter]])
        counter = counter + 1
    else:
        pass

fig.tight_layout(pad=2.0)

Observations

  • 90% of the products have listing prices less than 15000.
  • 95% of the product have a sale price of less than 15000.
  • 80% of the products have at least 50% discount or less than 50%.
  • 50% off the products have a rating of 3.5 or less than 3.5.
  • Almost all products have 100 or fewer reviews.

In [22]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [23]:

# let's explore discounts further
labeled_barplot(df, "Discount", perc=True)

Observations

  • 37.6% of products did not have any discounts.
  • More than 50% of the products have 40-50% discount.

Bivariate Analysis

Let’s check for correlations.

In [24]:

plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observations

  • The sale price and the listing price are highly positively correlated, which is obvious.
  • Discount is somewhat negatively correlated with listing price or sale price, which indicates that expensive products either have fewer discounts or no discounts at all.

In [25]:

sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()

Observations

  • 50-60% discounts are mostly given to products whose listing price is 10,000 or less.
  • Many products have no discount.
  • As seen before, most of the products have 100 or fewer reviews.

Data Preprocessing

In [26]:

# variables used for clustering
num_col

Out[26]:

['Listing_Price', 'Sale_Price', 'Discount', 'Rating', 'Reviews']

In [27]:

# scaling the dataset before clustering
scaler = StandardScaler()
subset = df[num_col].copy()
subset_scaled = scaler.fit_transform(subset)

In [28]:

# creating a dataframe of the scaled columns
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)

K-means Clustering

In [29]:

clusters = range(1, 9)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(subset_scaled_df)
    prediction = model.predict(subset_scaled_df)
    distortion = (
        sum(
            np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
        )
        / subset_scaled_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
plt.show()
Number of Clusters: 1 	Average Distortion: 2.0907211006458244
Number of Clusters: 2 	Average Distortion: 1.71203567523211
Number of Clusters: 3 	Average Distortion: 1.5554822489540914
Number of Clusters: 4 	Average Distortion: 1.4451709755028774
Number of Clusters: 5 	Average Distortion: 1.3083067148001288
Number of Clusters: 6 	Average Distortion: 1.2302992759867148
Number of Clusters: 7 	Average Distortion: 1.1629923806047235
Number of Clusters: 8 	Average Distortion: 1.1008904924660998

Appropriate value for k seems to be 3 or 4.

Let’s check the silhouette scores.

In [30]:

sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict((subset_scaled_df))
    # centers = clusterer.cluster_centers_
    score = silhouette_score(subset_scaled_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.33477783024143065)
For n_clusters = 3, the silhouette score is 0.29744330547812764)
For n_clusters = 4, the silhouette score is 0.30575395153576057)
For n_clusters = 5, the silhouette score is 0.27039882323588793)
For n_clusters = 6, the silhouette score is 0.2813209185631928)
For n_clusters = 7, the silhouette score is 0.2908294156005331)
For n_clusters = 8, the silhouette score is 0.29281420335577346)
For n_clusters = 9, the silhouette score is 0.2860040582312006)

Silhouette score for 4 is higher than that for 3. So, we will choose 4 as value of k.

In [31]:

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()

Out[31]:

<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 3151 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

In [32]:

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()

Out[32]:

<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 3151 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

In [33]:

# let's take 4 as number of clusters
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)

Out[33]:

KMeans(n_clusters=4, random_state=0)

In [34]:

# adding kmeans cluster labels to the original and scaled dataframes

df["K_means_segments"] = kmeans.labels_
subset_scaled_df["K_means_segments"] = kmeans.labels_

Cluster Profiling

In [35]:

cluster_profile = df.groupby("K_means_segments").mean()

In [36]:

cluster_profile["count_in_each_segments"] = (
    df.groupby("K_means_segments")["Sale_Price"].count().values
)

In [37]:

# let's display cluster profiles
cluster_profile.style.highlight_max(color="lightgreen", axis=0)

Out[37]:

Listing_PriceSale_PriceDiscountRatingReviewscount_in_each_segments
K_means_segments
015541.58628313406.7920358.7610623.92632731.869469452
16801.3542603686.67320645.4652473.39086350.1732061784
26498.1358026159.9691361.0185193.79768538.438272648
39725.9962558331.1011247.7153560.25580511.898876267

In [38]:

fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle("Boxplot of scaled numerical variables for each cluster", fontsize=20)
counter = 0
for ii in range(5):
    sns.boxplot(
        ax=axes[ii],
        y=subset_scaled_df[num_col[counter]],
        x=subset_scaled_df["K_means_segments"],
    )
    counter = counter + 1

fig.tight_layout(pad=2.0)

In [39]:

fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle("Boxplot of original numerical variables for each cluster", fontsize=20)
counter = 0
for ii in range(5):
    sns.boxplot(ax=axes[ii], y=df[num_col[counter]], x=df["K_means_segments"])
    counter = counter + 1

fig.tight_layout(pad=2.0)

Insights

  • Cluster 0:
    • This cluster contains more expensive premium products, with an average listing price of approximately 15500.
    • Almost no discount is given to products in this cluster.
    • The median rating of this cluster is around 4 suggesting that customers are very satisfied with 50% of the products in this cluster.
    • Products in these clusters have got lesser reviews, which can be because these products are very expensive and fewer people buy them.
  • Cluster 1:
    • Listing price is low for products in this cluster, and the sale price is lower, as there are heavier discounts on these products.
    • 50% of the products have discounts between 30% to 40%.
    • More than 50% of products have a rating from 2.5 to 4.5 which indicates that customer satisfaction varies across products in this cluster.
    • On average, 50 reviews are given to products in this cluster.
  • Cluster 2:
    • The listing price of products in this cluster is similar to those in Cluster 1, but the sale price is comparatively higher as there are almost no discounts.
    • On average, 38 reviews are given to products in this cluster.
  • Cluster 3:
    • Listing and sale price of products in this cluster suggest that these are medium-range products.
    • Almost no discount is given to products in this cluster.
    • Products in this cluster have a low rating or no rating at all.

Let’s compare Cluster vs Brand

In [40]:

pd.crosstab(df.K_means_segments, df.Brand).style.highlight_max(
    color="lightgreen", axis=0
)

Out[40]:

BrandAdidas Adidas ORIGINALSAdidas CORE / NEOAdidas ORIGINALSAdidas SPORT PERFORMANCENike
K_means_segments
000135128189
118645633560
2022915599165
30185323173

Observations

  • Adidas gives more discounts on their products, while Nike barely gives any.
  • Nearly one-third of Nike products are premium in nature.

Business Recommendations

  • Cluster 0 consists of premium products, so focused marketing campaigns can be done to target customers with higher purchasing power.
  • Cluster 0 products can also be made available in showrooms situated in premium locations because people generally tend to buy premium products from showrooms.
  • Cluster 1 consists of products that are sold at higher discounts, so these might be the ones that are introduced in large quantity during festive sales (like Christmas, Black Friday, New Year, etc). This strategy can be continued with high inventory.
  • Cluster 2 and 3 consists of products with comparatively low and medium prices with almost no discount, so general marketing campaigns can be done for these to target the general public.

Leave a Reply

Your email address will not be published. Required fields are marked *