General Middleware

Creating Linear Regression Models using Python

Let’s go through a case study in detail to find how the Linear Regression Models are created, based on certain dataset. Use the below dataset for reference.

Anime Rating Case Study

Context

Streamist is a streaming company that streams web series and movies for a worldwide audience. Every content on their portal is rated by the viewers, and the portal also provides other information for the content like the number of people who have watched it, the number of people who want to watch it, the number of episodes, duration of an episode, etc.

They are currently focusing on the anime available in their portal, and want to identify the most important factors involved in rating an anime. You as a data scientist at Streamist are tasked with identifying the important factors and building a predictive model to predict the rating on an anime.

Objective

To analyze the data and build a linear regression model to predict the ratings of anime.

Key Questions

  1. What are the key factors influencing the rating of an anime?
  2. Is there a good predictive model for the rating of an anime? What does the performance assessment look like for such a model?

Data Information

Each record in the database provides a description of an anime. A detailed data dictionary can be found below.

Data Dictionary

  • title – the title of anime
  • description – the synopsis of the plot
  • mediaType – format of publication
  • eps – number of episodes (movies are considered 1 episode)
  • duration – duration of an episode in minutes
  • ongoing – whether it is ongoing
  • sznOfRelease – the season of release (Winter, Spring, Fall)
  • years_running – number of years the anime ran/is running
  • studio_primary – primary studio of production
  • studios_colab – whether there was a collaboration between studios to produce the anime
  • contentWarn – whether anime has a content warning
  • watched – number of users that completed it
  • watching – number of users that are watching it
  • wantWatch – number of users that want to watch it
  • dropped – number of users that dropped it before completion
  • rating – average user rating
  • votes – number of votes that contribute to rating
  • tag_Based_on_a_Manga – whether the anime is based on a manga
  • tag_Comedy – whether the anime is of Comedy genre
  • tag_Action – whether the anime is of Action genre
  • tag_Fantasy – whether the anime is of Fantasy genre
  • tag_Sci_Fi – whether the anime is of Sci-Fi genre
  • tag_Shounen – whether the anime has a tag Shounen
  • tag_Original_Work – whether the anime is an original work
  • tag_Non_Human_Protagonists – whether the anime has any non-human protagonists
  • tag_Drama – whether the anime is of Drama genre
  • tag_Adventure – whether the anime is of Adventure genre
  • tag_Family_Friendly – whether the anime is family-friendly
  • tag_Short_Episodes – whether the anime has short episodes
  • tag_School_Life – whether the anime is regarding school life
  • tag_Romance – whether the anime is of Romance genre
  • tag_Shorts – whether the anime has a tag Shorts
  • tag_Slice_of_Life – whether the anime has a tag Slice of Life
  • tag_Seinen – whether the anime has a tag Seinen
  • tag_Supernatural – whether the anime has a tag Supernatural
  • tag_Magic – whether the anime has a tag Magic
  • tag_Animal_Protagonists – whether the anime has animal protagonists
  • tag_Ecchi – whether the anime has a tag Ecchi
  • tag_Mecha – whether the anime has a tag Mecha
  • tag_Based_on_a_Light_Novel – whether the anime is based on a light novel
  • tag_CG_Animation – whether the anime has a tag CG Animation
  • tag_Superpowers – whether the anime has a tag Superpowers
  • tag_Others – whether the anime has other tags
  • tag_is_missing – whether tag is missing or not

Let’s start coding!

Import necessary libraries

In [1]:

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split

# to build linear regression_model
from sklearn.linear_model import LinearRegression

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:

# loading the dataset
df = pd.read_csv("anime_data.csv")

In [3]:

# checking shape of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 12101 rows and 44 columns.

In [4]:

# to view first 5 rows of the dataset
df.head()

Out[4]:

titledescriptionmediaTypeepsdurationongoingsznOfReleaseyears_runningstudio_primarystudios_colabcontentWarnwatchedwatchingwantWatchdroppedratingvotestag_Based_on_a_Mangatag_Comedytag_Actiontag_Fantasytag_Sci_Fitag_Shounentag_Original_Worktag_Non_Human_Protagoniststag_Dramatag_Adventuretag_Family_Friendlytag_Short_Episodestag_School_Lifetag_Romancetag_Shortstag_Slice_of_Lifetag_Seinentag_Supernaturaltag_Magictag_Animal_Protagoniststag_Ecchitag_Mechatag_Based_on_a_Light_Noveltag_CG_Animationtag_Superpowerstag_Otherstag_missing
0Fullmetal Alchemist: BrotherhoodThe foundation of alchemy is based on the law …TV64NaNFalseSpring1Bones01103707.0143512581026564.70286547101101001100000000000000000
1your name.Mitsuha and Taki are two total strangers livin…Movie1107.0Falseis_missing0Others0058831.01453217331244.66343960000000101000110001000000000
2A Silent VoiceAfter transferring into a new school, a deaf g…Movie1130.0Falseis_missing0Kyoto Animation0145892.0946171481324.66133752100001001000100000000000000
3Haikyuu!! Karasuno High School vs Shiratorizaw…Picking up where the second season ended, the …TV10NaNFalseFall0Production I.G0025134.0218380821674.66017422100001000000100000000000000
4Attack on Titan 3rd Season: Part IIThe battle to retake Wall Maria begins now! Wi…TV10NaNFalseSpring0Others0121308.0321778641744.65015789101101000000000000000000000

In [5]:

# to view last 5 rows of the dataset
df.tail()

Out[5]:

titledescriptionmediaTypeepsdurationongoingsznOfReleaseyears_runningstudio_primarystudios_colabcontentWarnwatchedwatchingwantWatchdroppedratingvotestag_Based_on_a_Mangatag_Comedytag_Actiontag_Fantasytag_Sci_Fitag_Shounentag_Original_Worktag_Non_Human_Protagoniststag_Dramatag_Adventuretag_Family_Friendlytag_Short_Episodestag_School_Lifetag_Romancetag_Shortstag_Slice_of_Lifetag_Seinentag_Supernaturaltag_Magictag_Animal_Protagoniststag_Ecchitag_Mechatag_Based_on_a_Light_Noveltag_CG_Animationtag_Superpowerstag_Otherstag_missing
12096Sore Ike! Anpanman: Kirameke! Ice no Kuni no V…Princess Vanilla is a princess in a land of ic…Movie1NaNFalseis_missing0TMS Entertainment0022.012912.80710000000010010000000000000000
12097Hulaing Babies PetitNaNTV125.0FalseWinter0Others0013.0107722.09010010000100001000000000000000
12098Marco & The Galaxy DragonNaNOVA1NaNFalseis_missing0is_missing0017.006502.54310011000000000000000000000000
12099Xing Chen Bian 2nd SeasonSecond season of Xing Chen Bian.Web324.0Trueis_missing0is_missing0040.5312203.94110001000000000000000000001000
12100Ultra B: Black Hole kara no Dokusaisha BB!!NaNMovie120.0Falseis_missing0Shin-Ei Animation0015.011912.92510110010000000000000000000100

Note:

1. The first section of the notebook is the section that has been covered in the previous case studies. For this session, this part can be skipped and we can directly refer to this summary of data cleaning steps and observations from EDA.

2. For this session, we will drop the missing values in the dataset. We will discuss better ways of dealing with missing values in the next session.

In [6]:

# let's create a copy of the data to avoid any changes to original data
data = df.copy()

In [7]:

# checking for duplicate values in the data
data.duplicated().sum()

Out[7]:

0
  • There are no duplicate values in the data.

In [8]:

# checking the names of the columns in the data
print(data.columns)
Index(['title', 'description', 'mediaType', 'eps', 'duration', 'ongoing',
       'sznOfRelease', 'years_running', 'studio_primary', 'studios_colab',
       'contentWarn', 'watched', 'watching', 'wantWatch', 'dropped', 'rating',
       'votes', 'tag_Based_on_a_Manga', 'tag_Comedy', 'tag_Action',
       'tag_Fantasy', 'tag_Sci_Fi', 'tag_Shounen', 'tag_Original_Work',
       'tag_Non_Human_Protagonists', 'tag_Drama', 'tag_Adventure',
       'tag_Family_Friendly', 'tag_Short_Episodes', 'tag_School_Life',
       'tag_Romance', 'tag_Shorts', 'tag_Slice_of_Life', 'tag_Seinen',
       'tag_Supernatural', 'tag_Magic', 'tag_Animal_Protagonists', 'tag_Ecchi',
       'tag_Mecha', 'tag_Based_on_a_Light_Novel', 'tag_CG_Animation',
       'tag_Superpowers', 'tag_Others', 'tag_missing'],
      dtype='object')

In [9]:

# checking column datatypes and number of non-null values
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12101 entries, 0 to 12100
Data columns (total 44 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   title                       12101 non-null  object 
 1   description                 7633 non-null   object 
 2   mediaType                   12101 non-null  object 
 3   eps                         12101 non-null  int64  
 4   duration                    7465 non-null   float64
 5   ongoing                     12101 non-null  bool   
 6   sznOfRelease                12101 non-null  object 
 7   years_running               12101 non-null  int64  
 8   studio_primary              12101 non-null  object 
 9   studios_colab               12101 non-null  int64  
 10  contentWarn                 12101 non-null  int64  
 11  watched                     12101 non-null  float64
 12  watching                    12101 non-null  int64  
 13  wantWatch                   12101 non-null  int64  
 14  dropped                     12101 non-null  int64  
 15  rating                      12101 non-null  float64
 16  votes                       12101 non-null  int64  
 17  tag_Based_on_a_Manga        12101 non-null  int64  
 18  tag_Comedy                  12101 non-null  int64  
 19  tag_Action                  12101 non-null  int64  
 20  tag_Fantasy                 12101 non-null  int64  
 21  tag_Sci_Fi                  12101 non-null  int64  
 22  tag_Shounen                 12101 non-null  int64  
 23  tag_Original_Work           12101 non-null  int64  
 24  tag_Non_Human_Protagonists  12101 non-null  int64  
 25  tag_Drama                   12101 non-null  int64  
 26  tag_Adventure               12101 non-null  int64  
 27  tag_Family_Friendly         12101 non-null  int64  
 28  tag_Short_Episodes          12101 non-null  int64  
 29  tag_School_Life             12101 non-null  int64  
 30  tag_Romance                 12101 non-null  int64  
 31  tag_Shorts                  12101 non-null  int64  
 32  tag_Slice_of_Life           12101 non-null  int64  
 33  tag_Seinen                  12101 non-null  int64  
 34  tag_Supernatural            12101 non-null  int64  
 35  tag_Magic                   12101 non-null  int64  
 36  tag_Animal_Protagonists     12101 non-null  int64  
 37  tag_Ecchi                   12101 non-null  int64  
 38  tag_Mecha                   12101 non-null  int64  
 39  tag_Based_on_a_Light_Novel  12101 non-null  int64  
 40  tag_CG_Animation            12101 non-null  int64  
 41  tag_Superpowers             12101 non-null  int64  
 42  tag_Others                  12101 non-null  int64  
 43  tag_missing                 12101 non-null  int64  
dtypes: bool(1), float64(3), int64(35), object(5)
memory usage: 4.0+ MB
  • Dependent variable is the rating of an anime, which if of float type.
  • titledescriptionmediaTypesznOfReleasestudio_primary are of object type.
  • ongoing column is of bool type.
  • All other columns are numeric in nature.
  • There are missing values in the description and duration columns.

Let’s check for missing values in the data.

In [10]:

# checking for missing values in the data
data.isnull().sum()

Out[10]:

title                            0
description                   4468
mediaType                        0
eps                              0
duration                      4636
ongoing                          0
sznOfRelease                     0
years_running                    0
studio_primary                   0
studios_colab                    0
contentWarn                      0
watched                          0
watching                         0
wantWatch                        0
dropped                          0
rating                           0
votes                            0
tag_Based_on_a_Manga             0
tag_Comedy                       0
tag_Action                       0
tag_Fantasy                      0
tag_Sci_Fi                       0
tag_Shounen                      0
tag_Original_Work                0
tag_Non_Human_Protagonists       0
tag_Drama                        0
tag_Adventure                    0
tag_Family_Friendly              0
tag_Short_Episodes               0
tag_School_Life                  0
tag_Romance                      0
tag_Shorts                       0
tag_Slice_of_Life                0
tag_Seinen                       0
tag_Supernatural                 0
tag_Magic                        0
tag_Animal_Protagonists          0
tag_Ecchi                        0
tag_Mecha                        0
tag_Based_on_a_Light_Novel       0
tag_CG_Animation                 0
tag_Superpowers                  0
tag_Others                       0
tag_missing                      0
dtype: int64
  • duration column has 4636 missing values, and description column has 4468 missing values.
  • No other column has missing values.

In [11]:

# Let's look at the statistical summary of the data
data.describe().T

Out[11]:

countmeanstdmin25%50%75%max
eps12101.013.39335657.9250971.0001.0002.00012.0002527.000
duration7465.024.23014131.4681711.0004.0008.00030.000163.000
years_running12101.00.2832001.1522340.0000.0000.0000.00051.000
studios_colab12101.00.0516490.2213260.0000.0000.0000.0001.000
contentWarn12101.00.1153620.3194720.0000.0000.0000.0001.000
watched12101.02862.6056947724.3470240.00055.000341.0002026.000161567.000
watching12101.0256.3344351380.8409020.0002.00014.000100.00074537.000
wantWatch12101.01203.6814312294.3273800.00049.000296.0001275.00028541.000
dropped12101.0151.568383493.9317100.0003.00012.00065.00019481.000
rating12101.02.9490370.8273850.8442.3042.9653.6164.702
votes12101.02088.1247005950.33222810.00034.000219.0001414.000131067.000
tag_Based_on_a_Manga12101.00.2908020.4541510.0000.0000.0001.0001.000
tag_Comedy12101.00.2728700.4454530.0000.0000.0001.0001.000
tag_Action12101.00.2312210.4216310.0000.0000.0000.0001.000
tag_Fantasy12101.00.1815550.3854930.0000.0000.0000.0001.000
tag_Sci_Fi12101.00.1662670.3723360.0000.0000.0000.0001.000
tag_Shounen12101.00.1448640.3519780.0000.0000.0000.0001.000
tag_Original_Work12101.00.1351950.3419460.0000.0000.0000.0001.000
tag_Non_Human_Protagonists12101.00.1124700.3159570.0000.0000.0000.0001.000
tag_Drama12101.00.1061070.3079870.0000.0000.0000.0001.000
tag_Adventure12101.00.1037930.3050050.0000.0000.0000.0001.000
tag_Family_Friendly12101.00.0970170.2959930.0000.0000.0000.0001.000
tag_Short_Episodes12101.00.0969340.2958800.0000.0000.0000.0001.000
tag_School_Life12101.00.0923060.2894700.0000.0000.0000.0001.000
tag_Romance12101.00.0921410.2892370.0000.0000.0000.0001.000
tag_Shorts12101.00.0896620.2857090.0000.0000.0000.0001.000
tag_Slice_of_Life12101.00.0808200.2725690.0000.0000.0000.0001.000
tag_Seinen12101.00.0771010.2667630.0000.0000.0000.0001.000
tag_Supernatural12101.00.0709030.2566740.0000.0000.0000.0001.000
tag_Magic12101.00.0642920.2452830.0000.0000.0000.0001.000
tag_Animal_Protagonists12101.00.0603260.2380990.0000.0000.0000.0001.000
tag_Ecchi12101.00.0574330.2326780.0000.0000.0000.0001.000
tag_Mecha12101.00.0545410.2270910.0000.0000.0000.0001.000
tag_Based_on_a_Light_Novel12101.00.0533840.2248070.0000.0000.0000.0001.000
tag_CG_Animation12101.00.0500790.2181160.0000.0000.0000.0001.000
tag_Superpowers12101.00.0446240.2064860.0000.0000.0000.0001.000
tag_Others12101.00.0906540.2871280.0000.0000.0000.0001.000
tag_missing12101.00.0258660.1587410.0000.0000.0000.0001.000
  • We can see that the anime ratings vary between 0.844 and 4.702, which suggests that the anime were rated on a scale of 0-5.
  • 50% of the anime in the data have a runtime less than or equal to 8 minutes.
  • Some anime even have a runtime of 1 minute.
    • This seems strange at first, but a Google search can reveal that there are indeed such anime.
  • At least 75% of the anime have run for less than a year.
    • This may be because the listed anime has few episodes only.
  • At least 75% of the anime have no content warnings.
  • The number of views for the anime in the data has a very wide range (0 to more than 160,000).

Let’s look at the non-numeric columns.

In [12]:

# filtering non-numeric columns
cat_columns = data.select_dtypes(exclude=np.number).columns
cat_columns

Out[12]:

Index(['title', 'description', 'mediaType', 'ongoing', 'sznOfRelease',
       'studio_primary'],
      dtype='object')

In [13]:

# we will skip the title and description columns as they will have a lot of unique values
cat_col = ["mediaType", "ongoing", "sznOfRelease", "studio_primary"]

# printing the number of occurrences of each unique value in each categorical column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 50)
TV             3993
Movie          1928
OVA            1770
Music Video    1290
Web            1170
DVD Special     803
Other           580
TV Special      504
is_missing       63
Name: mediaType, dtype: int64
--------------------------------------------------
False    11986
True       115
Name: ongoing, dtype: int64
--------------------------------------------------
is_missing    8554
Spring        1135
Fall          1011
Winter         717
Summer         684
Name: sznOfRelease, dtype: int64
--------------------------------------------------
Others                  4340
is_missing              3208
Toei Animation           636
Sunrise                  430
J.C. Staff               341
MADHOUSE                 337
TMS Entertainment        317
Production I.G           271
Studio Deen              260
Studio Pierrot           221
OLM                      210
A-1 Pictures             194
AIC                      167
Shin-Ei Animation        164
Nippon Animation         145
Tatsunoko Production     144
DLE                      130
GONZO                    124
Bones                    121
Shaft                    118
XEBEC                    117
Kyoto Animation          106
Name: studio_primary, dtype: int64
--------------------------------------------------
  • Most of the anime in the data are either TV series or movies.
  • Most of the anime in the data are not ongoing.
  • The season of release is missing for most of the anime in the data.
  • Toei Animation and Sunrise are the top two studios (excluding other studios and missing studios).

We will drop the title and description columns before moving forward as they have a lot of text in them.

In [14]:

data.drop(["title", "description"], axis=1, inplace=True)

We will drop the missing values in the dataset.

In [15]:

data.dropna(inplace=True)
data.shape

Out[15]:

(7465, 42)

We will discuss better ways of dealing with missing values in the next session.

Let’s visualize the data

Univariate Analysis

In [16]:

# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

rating

In [17]:

histogram_boxplot(data, "rating")
  • The anime ratings are close to normally distributed.

eps

In [18]:

histogram_boxplot(data, "eps", bins=100)
  • The distribution is heavily right-skewed, as there are many anime movies in the data, and they are considered to be of only one episode (as per data description).

duration

In [19]:

histogram_boxplot(data, "duration")
  • The distribution is right-skewed with a median runtime of less than 10 minutes.

years_running

In [20]:

histogram_boxplot(data, "years_running")
  • The distribution is heavily right-skewed, and most of the anime have run for less than 1 year.

watched

In [21]:

histogram_boxplot(data, "watched", bins=50)
  • The distribution is heavily right-skewed, and most of the anime having less than 500 viewers.

watching

In [22]:

histogram_boxplot(data, "watching", bins=50)
  • The distribution is heavily right-skewed.

wantWatch

In [23]:

histogram_boxplot(data, "wantWatch", bins=50)
  • The distribution is heavily right-skewed.

dropped

In [24]:

histogram_boxplot(data, "dropped", bins=50)
  • The distribution is heavily right-skewed.

votes

In [25]:

histogram_boxplot(data, "votes", bins=50)
  • The distribution is heavily right-skewed, and few shows have more than 5000 votes.

In [26]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

mediaType

In [27]:

labeled_barplot(data, "mediaType", perc=True)
  • Distribution of the media types has changed a lot after dropping the rows with missing values.
  • Most of the anime now are movies or music videos.

ongoing

In [28]:

labeled_barplot(data, "ongoing", perc=True)
  • Very few anime in the data are ongoing.

sznOfRelease

In [29]:

labeled_barplot(data, "sznOfRelease", perc=True)
  • The season of release of anime is spread out across all seasons when the value is available.

studio_primary

In [30]:

labeled_barplot(data, "studio_primary", perc=True)
  • Toei Animation is the most common studio among the available studio names.

studios_colab

In [31]:

labeled_barplot(data, "studios_colab", perc=True)
  • More than 95% of the anime in the data do not involve a collaboration between studios.

contentWarn

In [32]:

labeled_barplot(data, "contentWarn", perc=True)
  • ~9% of the anime in the data have an associated content warning.

In [33]:

# creating a list of tag columns
tag_cols = [item for item in data.columns if "tag" in item]

# printing the number of occurrences of each unique value in each categorical column
for column in tag_cols:
    print(data[column].value_counts())
    print("-" * 50)
0    5718
1    1747
Name: tag_Based_on_a_Manga, dtype: int64
--------------------------------------------------
0    5545
1    1920
Name: tag_Comedy, dtype: int64
--------------------------------------------------
0    6227
1    1238
Name: tag_Action, dtype: int64
--------------------------------------------------
0    6342
1    1123
Name: tag_Fantasy, dtype: int64
--------------------------------------------------
0    6511
1     954
Name: tag_Sci_Fi, dtype: int64
--------------------------------------------------
0    6625
1     840
Name: tag_Shounen, dtype: int64
--------------------------------------------------
0    6548
1     917
Name: tag_Original_Work, dtype: int64
--------------------------------------------------
0    6553
1     912
Name: tag_Non_Human_Protagonists, dtype: int64
--------------------------------------------------
0    6896
1     569
Name: tag_Drama, dtype: int64
--------------------------------------------------
0    6913
1     552
Name: tag_Adventure, dtype: int64
--------------------------------------------------
0    6736
1     729
Name: tag_Family_Friendly, dtype: int64
--------------------------------------------------
0    6292
1    1173
Name: tag_Short_Episodes, dtype: int64
--------------------------------------------------
0    7029
1     436
Name: tag_School_Life, dtype: int64
--------------------------------------------------
0    6978
1     487
Name: tag_Romance, dtype: int64
--------------------------------------------------
0    6386
1    1079
Name: tag_Shorts, dtype: int64
--------------------------------------------------
0    6875
1     590
Name: tag_Slice_of_Life, dtype: int64
--------------------------------------------------
0    7019
1     446
Name: tag_Seinen, dtype: int64
--------------------------------------------------
0    7035
1     430
Name: tag_Supernatural, dtype: int64
--------------------------------------------------
0    7132
1     333
Name: tag_Magic, dtype: int64
--------------------------------------------------
0    6885
1     580
Name: tag_Animal_Protagonists, dtype: int64
--------------------------------------------------
0    7159
1     306
Name: tag_Ecchi, dtype: int64
--------------------------------------------------
0    7189
1     276
Name: tag_Mecha, dtype: int64
--------------------------------------------------
0    7224
1     241
Name: tag_Based_on_a_Light_Novel, dtype: int64
--------------------------------------------------
0    6985
1     480
Name: tag_CG_Animation, dtype: int64
--------------------------------------------------
0    7218
1     247
Name: tag_Superpowers, dtype: int64
--------------------------------------------------
0    6521
1     944
Name: tag_Others, dtype: int64
--------------------------------------------------
0    7184
1     281
Name: tag_missing, dtype: int64
--------------------------------------------------
  • There are 1747 anime that are based on manga.
  • There are 1920 anime of the Comedy genre.
  • There are 1079 anime of the Romance genre.

Bivariate analysis

We will not consider the tag columns for correlation check as they have only 0 or 1 values.

In [34]:

# creating a list of non-tag columns
corr_cols = [item for item in data.columns if "tag" not in item]
print(corr_cols)
['mediaType', 'eps', 'duration', 'ongoing', 'sznOfRelease', 'years_running', 'studio_primary', 'studios_colab', 'contentWarn', 'watched', 'watching', 'wantWatch', 'dropped', 'rating', 'votes']

In [35]:

plt.figure(figsize=(15, 7))
sns.heatmap(
    df[corr_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
  • watched and wantWatch columns are highly correlated.
  • watched and votes columns are very highly correlated.
  • wantWatch and votes columns are highly correlated.

Checking the relation of different variables with rating

Note – Double click on the below plot to zoom the image. Or you can open the below plot in a new tab for better visibilty

In [36]:

plt.figure(figsize=(20, 5))
sns.pairplot(df, y_vars="rating")
plt.show()
<Figure size 1440x360 with 0 Axes>
  • Duration, ongoing, watched columns tend to show an increasing trend which means if there is an increase in the respective column there would be an increase in the rating.
  • Content warning tend to show the decreasing trend.

Let’s check the variation in rating with some of the categorical columns in our data

mediaType vs rating

In [37]:

plt.figure(figsize=(10, 5))
sns.boxplot(x="mediaType", y="rating", data=data)
plt.show()
  • Anime available as TV series, web series, or music videos have a lower rating in general.

sznOfRelease vs rating

In [38]:

plt.figure(figsize=(10, 5))
sns.boxplot(x="sznOfRelease", y="rating", data=data)
plt.show()
  • Anime ratings have a similar distribution across all the seasons of release.

studio_primary vs rating

In [39]:

plt.figure(figsize=(15, 5))
sns.boxplot(x="studio_primary", y="rating", data=data)
plt.xticks(rotation=90)
plt.show()
  • In general, the ratings are low for anime produced by DLE studios.
  • Ratings are also low, in general, for anime produced by studios other than the ones in the data.

Summary of EDA

Data Description:

  • The target variable (rating) is of float type.
  • titledescriptionmediaTypesznOfRelease, and studio_primary are of object type.
  • ongoing column is of bool type.
  • All other columns are numeric in nature.
  • The title and description columns are dropped for modeling as they are highly textual in nature.
  • There are no duplicate values in the data.
  • There are missing values in the data. The rows with missing data have been dropped.

Observations from EDA:

  • rating: The anime ratings are close to normally distributed, with a mean rating of 2.74. The rating increases with an increase in the number of people who have watched or want to watch the anime.
  • eps: The distribution is heavily right-skewed as there are many anime movies in the data (at least 50%), and they are considered to be of only one episode as per data description. The number of episodes increases as the anime runs for more years.
  • duration: The distribution is right-skewed with a median anime runtime of less than 10 minutes. With the increase in rating, the duration increases.
  • years_running: The distribution is heavily right-skewed, and at least 75% of the anime have run for less than 1 year.
  • watched: The distribution is heavily right-skewed, and most of the anime have less than 500 viewers. This attribute is highly correlated with the wantWatch and votes attributes.
  • watching: The distribution is heavily right-skewed and highly correlated with the dropped attribute.
  • wantWatch: The distribution is heavily right-skewed with a median value of 132 potential watchers.
  • dropped: The distribution is heavily right-skewed with a drop of 25 viewers on average.
  • votes: The distribution is heavily right-skewed, and few shows have more than 5000 votes.
  • mediaType: 23% of the anime are published for TV, 17% as music videos, and 14% as web series. Anime available as TV series, web series, or music videos have a lower rating in general
  • ongoing: Less than 1% of the anime in the data are ongoing.
  • sznOfRelease: The season of release is missing for nearly 90% of the anime in the data, and is spread out almost evenly across all seasons when available. Anime ratings have a similar distribution across all the seasons of release.
  • studio_primary: Nearly 40% of the anime in the data are produced by studios not listed in the data. Toei Animation is the most common studio among the available studio names. In general, the ratings are low for anime produced by DLE studios and studios other than the ones listed in the data.
  • studios_colab: More than 95% of the anime in the data do not involve collaboration between studios.
  • contentWarn: Less than 10% of the anime in the data have an associated content warning.
  • tag_<tag/genre>: There are 1747 anime that are based on manga, 1920 of the Comedy genre, 1238 of the Action genre, 1079 anime of the Romance genre, and more.

Model Building

Define independent and dependent variables

In [40]:

X = data.drop(["rating"], axis=1)
y = data["rating"]

Creating dummy variables

In [41]:

X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)
X.head()

Out[41]:

epsdurationongoingyears_runningstudios_colabcontentWarnwatchedwatchingwantWatchdroppedvotestag_Based_on_a_Mangatag_Comedytag_Actiontag_Fantasytag_Sci_Fitag_Shounentag_Original_Worktag_Non_Human_Protagoniststag_Dramatag_Adventuretag_Family_Friendlytag_Short_Episodestag_School_Lifetag_Romancetag_Shortstag_Slice_of_Lifetag_Seinentag_Supernaturaltag_Magictag_Animal_Protagoniststag_Ecchitag_Mechatag_Based_on_a_Light_Noveltag_CG_Animationtag_Superpowerstag_Otherstag_missingmediaType_MoviemediaType_Music VideomediaType_OVAmediaType_OthermediaType_TVmediaType_TV SpecialmediaType_WebmediaType_is_missingsznOfRelease_SpringsznOfRelease_SummersznOfRelease_WintersznOfRelease_is_missingstudio_primary_AICstudio_primary_Bonesstudio_primary_DLEstudio_primary_GONZOstudio_primary_J.C. Staffstudio_primary_Kyoto Animationstudio_primary_MADHOUSEstudio_primary_Nippon Animationstudio_primary_OLMstudio_primary_Othersstudio_primary_Production I.Gstudio_primary_Shaftstudio_primary_Shin-Ei Animationstudio_primary_Studio Deenstudio_primary_Studio Pierrotstudio_primary_Sunrisestudio_primary_TMS Entertainmentstudio_primary_Tatsunoko Productionstudio_primary_Toei Animationstudio_primary_XEBECstudio_primary_is_missing
11107.0False00058831.014532173312443960000000101000110001000000000100000000001000000000100000000000
21130.0False00145892.09461714813233752100001001000100000000000000100000000001000001000000000000000
81111.0False0008454.028066241506254111011001000000000000000000100000000001000000000000000100000
271125.0False000115949.05891238816182752000100100110000000100000000100000000001000000000100000000000
311117.0False00035896.05381565113026465000100100000000100000000000100000000001000000000100000000000

In [42]:

X.shape

Out[42]:

(7465, 71)

Split the data into train and test

In [43]:

# splitting the data in 70:30 ratio for train to test data

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [44]:

print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 5225
Number of rows in test data = 2240

Fitting a linear model

In [45]:

lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

Out[45]:

LinearRegression()

Model performance check

  • We will be using metric functions defined in sklearn for RMSE, MAE, and R2�2.
  • We will define functions to calculate adjusted R2�2 and MAPE.
    • The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0.
  • We will create a function that will print out all the above metrics in one go.

In [46]:

# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

In [47]:

# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_train_perf = model_performance_regression(lin_reg_model, x_train, y_train)
lin_reg_model_train_perf
Training Performance

Out[47]:

RMSEMAER-squaredAdj. R-squaredMAPE
00.577290.4671230.521520.51492719.550725

In [48]:

# Checking model performance on test set
print("Test Performance\n")
lin_reg_model_test_perf = model_performance_regression(lin_reg_model, x_test, y_test)
lin_reg_model_test_perf
Test Performance

Out[48]:

RMSEMAER-squaredAdj. R-squaredMAPE
00.5690010.4634160.5158880.50003419.386434

Observations

  • The train and test R2�2 are 0.522 and 0.516, indicating that the model explains 52.2% and 51.6% of the total variation in the train and test sets respectively. Also, both scores are comparable.
  • RMSE values on the train and test sets are also comparable.
  • This shows that the model is not overfitting.
  • MAE indicates that our current model is able to predict anime ratings within a mean error of 0.46
  • MAPE of 19.4 on the test data means that we are able to predict within 19.4% of the anime rating.
  • However, the overall performance is not so great.

Forward Feature Selection using SequentialFeatureSelector

We will see how to select a subset of important features with forward feature selection using SequentialFeatureSelector.

Why should we do feature selection?

  • Reduces dimensionality
  • Discards deceptive features (Deceptive features appear to aid learning on the training set, but impair generalization)
  • Speeds training/testing

How does forward feature selection work?

  • It starts with an empty model and adds variables one by one.
  • In each forward step, you add the one variable that gives the highest improvement to your model.

We’ll use forward feature selection on all the variables

In [49]:

# please uncomment and run the next line if mlxtend library is not previously installed
#!pip install mlxtend

In [50]:

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

reg = LinearRegression()

# Build step forward feature selection
sfs = SFS(
    reg,
    k_features=x_train.shape[1],
    forward=True,  # k_features denotes the number of features to select
    floating=False,
    scoring="r2",
    n_jobs=-1,  # n_jobs=-1 means all processor cores will be used
    verbose=2,
    cv=5,
)

# Perform SFFS
sfs = sfs.fit(x_train, y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done  71 out of  71 | elapsed:    1.7s finished

[2022-04-12 17:55:50] Features: 1/71 -- score: 0.261089359587903[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  55 out of  70 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed:    0.1s finished

[2022-04-12 17:55:50] Features: 2/71 -- score: 0.3522663085348247[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  54 out of  69 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  69 out of  69 | elapsed:    0.2s finished

[2022-04-12 17:55:50] Features: 3/71 -- score: 0.4005056411016484[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  53 out of  68 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  68 out of  68 | elapsed:    0.3s finished

[2022-04-12 17:55:50] Features: 4/71 -- score: 0.4174196547714809[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  52 out of  67 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  67 out of  67 | elapsed:    0.2s finished

[2022-04-12 17:55:51] Features: 5/71 -- score: 0.43222076293523193[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  66 out of  66 | elapsed:    0.2s finished

[2022-04-12 17:55:51] Features: 6/71 -- score: 0.44172592802086363[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  65 out of  65 | elapsed:    0.2s finished

[2022-04-12 17:55:51] Features: 7/71 -- score: 0.45317074001480134[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  49 out of  64 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  64 out of  64 | elapsed:    0.3s finished

[2022-04-12 17:55:52] Features: 8/71 -- score: 0.4593783988688152[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  63 out of  63 | elapsed:    0.3s finished

[2022-04-12 17:55:52] Features: 9/71 -- score: 0.4652518355617037[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  62 out of  62 | elapsed:    0.3s finished

[2022-04-12 17:55:53] Features: 10/71 -- score: 0.4712356970027415[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  46 out of  61 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  61 out of  61 | elapsed:    0.4s finished

[2022-04-12 17:55:53] Features: 11/71 -- score: 0.4748379884383242[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  45 out of  60 | elapsed:    0.4s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    0.5s finished

[2022-04-12 17:55:54] Features: 12/71 -- score: 0.4788759723773873[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  59 out of  59 | elapsed:    0.5s finished

[2022-04-12 17:55:54] Features: 13/71 -- score: 0.48172194185993045[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done  43 out of  58 | elapsed:    0.4s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  58 out of  58 | elapsed:    0.5s finished

[2022-04-12 17:55:55] Features: 14/71 -- score: 0.484668059638815[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done  42 out of  57 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  57 out of  57 | elapsed:    0.6s finished

[2022-04-12 17:55:56] Features: 15/71 -- score: 0.48779553232899603[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done  56 out of  56 | elapsed:    0.7s finished

[2022-04-12 17:55:56] Features: 16/71 -- score: 0.4896860726135774[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  55 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed:    0.6s finished

[2022-04-12 17:55:57] Features: 17/71 -- score: 0.49150488345245213[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:    0.6s finished

[2022-04-12 17:55:58] Features: 18/71 -- score: 0.4929449983855525[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 out of  53 | elapsed:    0.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  53 out of  53 | elapsed:    0.7s finished

[2022-04-12 17:55:59] Features: 19/71 -- score: 0.4947138573328343[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 out of  52 | elapsed:    0.7s finished

[2022-04-12 17:55:59] Features: 20/71 -- score: 0.49600065751933853[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  51 | elapsed:    0.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  51 out of  51 | elapsed:    0.7s finished

[2022-04-12 17:56:00] Features: 21/71 -- score: 0.4977994983620356[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.7s finished

[2022-04-12 17:56:01] Features: 22/71 -- score: 0.49899929829971656[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 out of  49 | elapsed:    0.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  49 out of  49 | elapsed:    0.7s finished

[2022-04-12 17:56:02] Features: 23/71 -- score: 0.5000700135002234[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:    0.9s finished

[2022-04-12 17:56:03] Features: 24/71 -- score: 0.5013008518232682[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  47 out of  47 | elapsed:    0.7s finished

[2022-04-12 17:56:04] Features: 25/71 -- score: 0.5022616354843437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 out of  46 | elapsed:    0.7s finished

[2022-04-12 17:56:04] Features: 26/71 -- score: 0.5030824639798428[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  45 | elapsed:    0.6s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    0.8s finished

[2022-04-12 17:56:05] Features: 27/71 -- score: 0.5038498567527055[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  44 out of  44 | elapsed:    0.8s finished

[2022-04-12 17:56:06] Features: 28/71 -- score: 0.5046324131355144[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  43 out of  43 | elapsed:    0.7s finished

[2022-04-12 17:56:07] Features: 29/71 -- score: 0.5052024675128335[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:    0.7s finished

[2022-04-12 17:56:08] Features: 30/71 -- score: 0.5055993791428787[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  41 out of  41 | elapsed:    0.8s finished

[2022-04-12 17:56:09] Features: 31/71 -- score: 0.5063018797524649[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  40 | elapsed:    0.7s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    0.8s finished

[2022-04-12 17:56:10] Features: 32/71 -- score: 0.5066930489801373[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  39 | elapsed:    0.7s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  39 out of  39 | elapsed:    0.9s finished

[2022-04-12 17:56:11] Features: 33/71 -- score: 0.5070901137609963[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 out of  38 | elapsed:    0.8s finished

[2022-04-12 17:56:12] Features: 34/71 -- score: 0.5073763036837493[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  22 out of  37 | elapsed:    0.6s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  37 out of  37 | elapsed:    0.7s finished

[2022-04-12 17:56:12] Features: 35/71 -- score: 0.5076591087988501[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:    0.8s finished

[2022-04-12 17:56:13] Features: 36/71 -- score: 0.5079459351928932[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:    1.0s finished

[2022-04-12 17:56:14] Features: 37/71 -- score: 0.5081872546172406[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 out of  34 | elapsed:    0.9s finished

[2022-04-12 17:56:15] Features: 38/71 -- score: 0.508395794079204[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  33 | elapsed:    0.7s remaining:    0.6s
[Parallel(n_jobs=-1)]: Done  33 out of  33 | elapsed:    0.9s finished

[2022-04-12 17:56:16] Features: 39/71 -- score: 0.5086339743081216[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 out of  32 | elapsed:    0.8s finished

[2022-04-12 17:56:17] Features: 40/71 -- score: 0.5087726457563382[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  31 out of  31 | elapsed:    0.8s finished

[2022-04-12 17:56:18] Features: 41/71 -- score: 0.5089607360510882[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.7s finished

[2022-04-12 17:56:19] Features: 42/71 -- score: 0.5090827109049949[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  29 out of  29 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  29 out of  29 | elapsed:    0.8s finished

[2022-04-12 17:56:20] Features: 43/71 -- score: 0.5091085185502525[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 out of  28 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  28 out of  28 | elapsed:    0.8s finished

[2022-04-12 17:56:21] Features: 44/71 -- score: 0.5091265928152826[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    0.8s finished

[2022-04-12 17:56:22] Features: 45/71 -- score: 0.5090907405512923[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 out of  26 | elapsed:    0.7s finished

[2022-04-12 17:56:22] Features: 46/71 -- score: 0.50906209335144[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  23 out of  25 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.6s finished

[2022-04-12 17:56:23] Features: 47/71 -- score: 0.5090304912705245[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  22 out of  24 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    0.7s finished

[2022-04-12 17:56:24] Features: 48/71 -- score: 0.5089922497120654[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  23 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  23 out of  23 | elapsed:    0.7s finished

[2022-04-12 17:56:25] Features: 49/71 -- score: 0.5090922010721937[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  19 out of  22 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  22 out of  22 | elapsed:    0.7s finished

[2022-04-12 17:56:26] Features: 50/71 -- score: 0.5090371290359708[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 out of  21 | elapsed:    0.7s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  21 out of  21 | elapsed:    0.7s finished

[2022-04-12 17:56:26] Features: 51/71 -- score: 0.5089610658812349[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.6s finished

[2022-04-12 17:56:27] Features: 52/71 -- score: 0.5088768190842234[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 out of  19 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  19 out of  19 | elapsed:    0.6s finished

[2022-04-12 17:56:28] Features: 53/71 -- score: 0.5087913729346901[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  18 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:    0.6s finished

[2022-04-12 17:56:29] Features: 54/71 -- score: 0.5086628068184877[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  17 | elapsed:    0.4s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:    0.5s finished

[2022-04-12 17:56:29] Features: 55/71 -- score: 0.5085185487373822[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  16 | elapsed:    0.4s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:    0.5s finished

[2022-04-12 17:56:30] Features: 56/71 -- score: 0.5083562980939954[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  15 | elapsed:    0.3s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    0.5s finished

[2022-04-12 17:56:30] Features: 57/71 -- score: 0.5081893699944808[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  14 | elapsed:    0.3s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  14 out of  14 | elapsed:    0.4s finished

[2022-04-12 17:56:31] Features: 58/71 -- score: 0.50793955067985[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of  13 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done  13 out of  13 | elapsed:    0.4s finished

[2022-04-12 17:56:31] Features: 59/71 -- score: 0.5076745579858717[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of  12 | elapsed:    0.2s remaining:    0.6s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.4s finished

[2022-04-12 17:56:32] Features: 60/71 -- score: 0.5073880804211568[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of  11 | elapsed:    0.2s remaining:    1.2s
[Parallel(n_jobs=-1)]: Done   8 out of  11 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    0.4s finished

[2022-04-12 17:56:32] Features: 61/71 -- score: 0.5070412907695172[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.4s finished

[2022-04-12 17:56:33] Features: 62/71 -- score: 0.506680479226362[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:    0.2s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.3s finished

[2022-04-12 17:56:33] Features: 63/71 -- score: 0.5062991320679325[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   8 | elapsed:    0.2s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:    0.3s finished

[2022-04-12 17:56:34] Features: 64/71 -- score: 0.5059491560394366[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   7 | elapsed:    0.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:    0.2s finished

[2022-04-12 17:56:34] Features: 65/71 -- score: 0.505542220462096[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   6 | elapsed:    0.2s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    0.2s finished

[2022-04-12 17:56:34] Features: 66/71 -- score: 0.505046141804182[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.1s finished

[2022-04-12 17:56:35] Features: 67/71 -- score: 0.5045388389504056[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    0.1s finished

[2022-04-12 17:56:35] Features: 68/71 -- score: 0.5040342974289594[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.1s finished

[2022-04-12 17:56:35] Features: 69/71 -- score: 0.503466989481843[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    0.1s finished

[2022-04-12 17:56:35] Features: 70/71 -- score: 0.5027134493085308[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.0s finished

[2022-04-12 17:56:35] Features: 71/71 -- score: 0.47766875396767494

In [51]:

# to plot the performance with addition of each feature
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

fig1 = plot_sfs(sfs.get_metric_dict(), kind="std_err", figsize=(15, 5))
plt.title("Sequential Forward Selection (w. StdErr)")
plt.xticks(rotation=90)
plt.show()
  • We can see that performance increases till the 30th feature and then slowly becomes constant, and then drops sharply after the 70th feature is added.
  • The decision to choose the k_features now depends on the adjusted R2�2 vs the complexity of the model.
    • With 30 features, we are getting an adjusted R2�2 of 0.506
    • With 48 features, we are getting an adjusted R2�2 of 0.509.
    • With 71 features, we are getting an adjusted R2�2 of 0.477.
  • The increase in adjusted R2�2 is not very significant as we are getting the same values with a less complex model.
  • So we’ll use 30 features only to build our model, but you can experiment by taking a different number.
  • Number of features chosen will also depend on the business context and use case of the model.

In [52]:

reg = LinearRegression()

# Build step forward feature selection
sfs = SFS(
    reg,
    k_features=30,
    forward=True,
    floating=False,
    scoring="r2",
    n_jobs=-1,
    verbose=2,
    cv=5,
)

# Perform SFFS
sfs = sfs.fit(x_train, y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  56 out of  71 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  71 out of  71 | elapsed:    0.1s finished

[2022-04-12 17:56:37] Features: 1/30 -- score: 0.261089359587903[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed:    0.1s finished

[2022-04-12 17:56:37] Features: 2/30 -- score: 0.3522663085348247[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  54 out of  69 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  69 out of  69 | elapsed:    0.2s finished

[2022-04-12 17:56:38] Features: 3/30 -- score: 0.4005056411016484[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  68 out of  68 | elapsed:    0.2s finished

[2022-04-12 17:56:38] Features: 4/30 -- score: 0.4174196547714809[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  52 out of  67 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  67 out of  67 | elapsed:    0.2s finished

[2022-04-12 17:56:38] Features: 5/30 -- score: 0.43222076293523193[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  66 out of  66 | elapsed:    0.2s finished

[2022-04-12 17:56:39] Features: 6/30 -- score: 0.44172592802086363[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  50 out of  65 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  65 out of  65 | elapsed:    0.3s finished

[2022-04-12 17:56:39] Features: 7/30 -- score: 0.45317074001480134[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  64 out of  64 | elapsed:    0.3s finished

[2022-04-12 17:56:39] Features: 8/30 -- score: 0.4593783988688152[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  63 out of  63 | elapsed:    0.3s finished

[2022-04-12 17:56:40] Features: 9/30 -- score: 0.4652518355617037[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  62 out of  62 | elapsed:    0.3s finished

[2022-04-12 17:56:40] Features: 10/30 -- score: 0.4712356970027415[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  61 out of  61 | elapsed:    0.4s finished

[2022-04-12 17:56:41] Features: 11/30 -- score: 0.4748379884383242[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    0.4s finished

[2022-04-12 17:56:41] Features: 12/30 -- score: 0.4788759723773873[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  44 out of  59 | elapsed:    0.4s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  59 out of  59 | elapsed:    0.4s finished

[2022-04-12 17:56:42] Features: 13/30 -- score: 0.48172194185993045[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  58 out of  58 | elapsed:    0.4s finished

[2022-04-12 17:56:42] Features: 14/30 -- score: 0.484668059638815[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  57 out of  57 | elapsed:    0.4s finished

[2022-04-12 17:56:43] Features: 15/30 -- score: 0.48779553232899603[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done  41 out of  56 | elapsed:    0.4s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  56 out of  56 | elapsed:    0.5s finished

[2022-04-12 17:56:43] Features: 16/30 -- score: 0.4896860726135774[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  55 | elapsed:    0.4s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed:    0.5s finished

[2022-04-12 17:56:44] Features: 17/30 -- score: 0.49150488345245213[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:    0.5s finished

[2022-04-12 17:56:45] Features: 18/30 -- score: 0.4929449983855525[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 out of  53 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  53 out of  53 | elapsed:    0.6s finished

[2022-04-12 17:56:45] Features: 19/30 -- score: 0.4947138573328343[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 out of  52 | elapsed:    0.6s finished

[2022-04-12 17:56:46] Features: 20/30 -- score: 0.49600065751933853[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  51 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  51 out of  51 | elapsed:    0.6s finished

[2022-04-12 17:56:47] Features: 21/30 -- score: 0.4977994983620356[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    0.6s finished

[2022-04-12 17:56:47] Features: 22/30 -- score: 0.49899929829971656[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 out of  49 | elapsed:    0.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  49 out of  49 | elapsed:    0.7s finished

[2022-04-12 17:56:48] Features: 23/30 -- score: 0.5000700135002234[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:    0.7s finished

[2022-04-12 17:56:49] Features: 24/30 -- score: 0.5013008518232682[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 out of  47 | elapsed:    0.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  47 out of  47 | elapsed:    0.7s finished

[2022-04-12 17:56:50] Features: 25/30 -- score: 0.5022616354843437[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 out of  46 | elapsed:    0.7s finished

[2022-04-12 17:56:51] Features: 26/30 -- score: 0.5030824639798428[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  45 | elapsed:    0.6s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    0.8s finished

[2022-04-12 17:56:52] Features: 27/30 -- score: 0.5038498567527055[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  44 out of  44 | elapsed:    0.8s finished

[2022-04-12 17:56:52] Features: 28/30 -- score: 0.5046324131355144[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  43 out of  43 | elapsed:    0.8s finished

[2022-04-12 17:56:53] Features: 29/30 -- score: 0.5052024675128335[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:    0.9s finished

[2022-04-12 17:56:54] Features: 30/30 -- score: 0.5055993791428787

In [53]:

# let us select the features which are important
feat_cols = list(sfs.k_feature_idx_)
print(feat_cols)
[1, 2, 3, 5, 6, 8, 10, 11, 13, 14, 16, 19, 21, 22, 25, 26, 27, 28, 33, 38, 40, 42, 49, 50, 52, 54, 59, 64, 68, 70]

In [54]:

# let us look at the names of the important features
x_train.columns[feat_cols]

Out[54]:

Index(['duration', 'ongoing', 'years_running', 'contentWarn', 'watched',
       'wantWatch', 'votes', 'tag_Based_on_a_Manga', 'tag_Action',
       'tag_Fantasy', 'tag_Shounen', 'tag_Drama', 'tag_Family_Friendly',
       'tag_Short_Episodes', 'tag_Shorts', 'tag_Slice_of_Life', 'tag_Seinen',
       'tag_Supernatural', 'tag_Based_on_a_Light_Novel', 'mediaType_Movie',
       'mediaType_OVA', 'mediaType_TV', 'sznOfRelease_is_missing',
       'studio_primary_AIC', 'studio_primary_DLE', 'studio_primary_J.C. Staff',
       'studio_primary_Others', 'studio_primary_Studio Pierrot',
       'studio_primary_Toei Animation', 'studio_primary_is_missing'],
      dtype='object')

Now we will fit an sklearn model using these features only.

In [55]:

x_train_final = x_train[x_train.columns[feat_cols]]

In [56]:

# Creating new x_test with the same 20 variables that we selected for x_train
x_test_final = x_test[x_train_final.columns]

In [57]:

# Fitting linear model
lin_reg_model2 = LinearRegression()
lin_reg_model2.fit(x_train_final, y_train)

Out[57]:

LinearRegression()

In [58]:

# model performance on train set
print("Training Performance\n")
lin_reg_model2_train_perf = model_performance_regression(
    lin_reg_model2, x_train_final, y_train
)
lin_reg_model2_train_perf
Training Performance

Out[58]:

RMSEMAER-squaredAdj. R-squaredMAPE
00.5827710.4728090.5123920.50957519.791472

In [59]:

# model performance on test set
print("Test Performance\n")
lin_reg_model2_test_perf = model_performance_regression(
    lin_reg_model2, x_test_final, y_test
)
lin_reg_model2_test_perf
Test Performance

Out[59]:

RMSEMAER-squaredAdj. R-squaredMAPE
00.5770220.4705570.5021420.49538119.67366
  • The performance looks slightly worse than the previous model.
  • Let’s compare the two models we built.

In [60]:

# training performance comparison

models_train_comp_df = pd.concat(
    [lin_reg_model_train_perf.T, lin_reg_model2_train_perf.T], axis=1,
)

models_train_comp_df.columns = [
    "Linear Regression sklearn",
    "Linear Regression sklearn (SFS features)",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:

Out[60]:

Linear Regression sklearnLinear Regression sklearn (SFS features)
RMSE0.5772900.582771
MAE0.4671230.472809
R-squared0.5215200.512392
Adj. R-squared0.5149270.509575
MAPE19.55072519.791472

In [61]:

# test performance comparison

models_test_comp_df = pd.concat(
    [lin_reg_model_test_perf.T, lin_reg_model2_test_perf.T], axis=1,
)

models_test_comp_df.columns = [
    "Linear Regression sklearn",
    "Linear Regression sklearn (SFS features)",
]

print("Test performance comparison:")
models_test_comp_df
Test performance comparison:

Out[61]:

Linear Regression sklearnLinear Regression sklearn (SFS features)
RMSE0.5690010.577022
MAE0.4634160.470557
R-squared0.5158880.502142
Adj. R-squared0.5000340.495381
MAPE19.38643419.673660
  • The new model (lin_reg_model2) uses less than half the number of features as the previous model (lin_reg_model).
  • The performance of the new model, however, is close to our previous model.
  • Depending upon time sensitivity and storage restrictions, we can choose between the models.
  • We will be moving forward with lin_reg_model as it shows better performance.

Conclusions

  • We have been able to build a predictive model that can be used by Streamist to predict the rating of an anime with an R2�2 of 0.52 on the training set.
  • Streamist can use this model to predict the anime ratings within a mean error of 0.46
  • From the analysis, we found that the duration of each episode, and whether or not an anime is ongoing are some of the factors which tend to increase the rating of an anime. And factors like content warnings tend to decrease the rating of an anime.
  • We can also explore improving the linear model by applying non-linear transformations to some of the attributes. This might help us better identify the patterns in the data to predict the anime ratings more accurately.

Leave a Reply

Your email address will not be published. Required fields are marked *