General Middleware

Data Pre-Processing for Creating Models

Data Pre-Processing is the most important step to start with before the analysis and model building starts. There are several factors to be considered in a large dataset. For example, outlier handling, missing value treatment, handling categorical variables, determining and dropping unnecessary columns and variable transformation. More than half of the time is spent on these data cleaning techniques by a data scientist.

Anime Rating Case Study

Let’s look at the contents of the below dataset, clean the data and create a model.

In this case study, we will introduce the unclean version of the same dataset, go through the steps of cleaning it, and apply transformations to a few columns of the data. This will help us build an improved linear regression model which can predict anime ratings more accurately using transformed versions of features like the number of episodes, the duration of episodes, the number of people who have watched, etc.

Context

Streamist is a streaming company that streams web series and movies for a worldwide audience. Every content on their portal is rated by the viewers, and the portal also provides other information for the content like the number of people who have watched it, the number of people who want to watch it, the number of episodes, duration of an episode, etc.

They are currently focusing on the anime available in their portal, and want to identify the most important factors involved in rating an anime. You as a data scientist at Streamist are tasked with identifying the important factors and building a predictive model to predict the rating on an anime.

Objective

To preprocess the raw data, analyze it, and build a linear regression model to predict the ratings of anime.

Key Question

Is there a good predictive model for the rating of an anime? What does the performance assessment look like for such a model?

Data Information

Each record in the database provides a description of an anime. A detailed data dictionary can be found below.

Data Dictionary

  • title – the title of anime
  • mediaType – format of publication
  • eps – number of episodes (movies are considered 1 episode)
  • duration – duration of an episode
  • ongoing – whether it is ongoing
  • startYr – year that airing started
  • finishYr – year that airing finished
  • sznOfRelease – the season of release (Winter, Spring, Fall)
  • description – the synopsis of the plot
  • studios – studios responsible for creation
  • contentWarn – whether anime has a content warning
  • watched – number of users that completed it
  • watching – number of users that are watching it
  • wantWatch – number of users that want to watch it
  • dropped – number of users that dropped it before completion
  • rating – average user rating
  • votes – number of votes that contribute to rating
  • tag_<tag/genre> – whether the anime has the certain tag or falls in the certain genre

Let’s start coding!

Import necessary libraries

In [1]:

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split

# to build linear regression_model
from sklearn.linear_model import LinearRegression

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:

# loading the dataset
data = pd.read_csv("anime_data_raw.csv")

In [3]:

# checking the shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")  # f-string
There are 14578 rows and 48 columns.

In [4]:

# let's view a sample of the data
data.sample(
    10, random_state=2
)  # setting the random_state will ensure we get the same results every time

Out[4]:

titlemediaTypeepsdurationongoingstartYrfinishYrsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Others
13764Spy Penguin (2013): White ChristmasWeb1.02minFalse2013.02013.0NaNNaN[‘Next Media Animation’]08.00100NaNNaN0010001011000000100010000000000
3782A Little Snow Fairy Sugar Summer SpecialsTV Special2.0NaNFalse2003.02003.0NaNOne day, when Saga finds an old princess costu…[‘J.C. Staff’]01056.024576163.449571.00001001100000000000100000000000
2289Umineko: When They CryTV26.0NaNFalse2009.02009.0SummerIn the year 1986, eighteen members of the Ushi…[‘Studio Deen’]110896.01451848012363.7879463.00000000000000000001000000000000
5081Unbreakable Machine-Doll SpecialsDVD Special6.05minFalse2013.02014.0NaNNaN[‘Lerche’]11957.0201756503.1691312.00001000000000000000100110000000
9639Hanako Oku: HanabiTV1.06minFalse2015.02015.0NaNNaN[]046.015412.16633.00000000000000101000000000000000
12608Tamagotchi Honto no HanashiMovie1.020minFalse1997.01997.0NaNNaN[]011.02180NaNNaN0001001000000000000000000000000
6735Violinist of Hamelin MovieMovie1.030minFalse1996.01996.0NaNWhile on their quest to stop the Demon King, t…[‘Nippon Animation’]0247.0616782.826152.01001000000000000000000000000000
12846Neko KikakuMovie1.037minFalse2018.02018.0NaNNyagoya City is a trendy town where cats live….[‘Speed Inc.’]012.031022NaNNaN1000000110000000100000001000000
884Saint Young Men MovieMovie1.01hr 30minFalse2013.02013.0NaNJesus and Buddha are enjoying their vacation i…[‘A-1 Pictures’]02726.0682074374.1561962.01101000010000001010000000000000
10524Delinquent Hamsters / papalion ft. Piso StudioWeb1.02minFalse2017.02017.0FallNaN[‘Piso Studio’]018.001801.92710.01000000010001000100010000000000
  • The duration column has values in hours and minutes.
  • The studios column has a list of values.
  • There are a lot of missing values.

In [5]:

# creating a copy of the data so that original data remains unchanged
df = data.copy()

In [6]:

# checking for duplicate values in the data
df.duplicated().sum()

Out[6]:

0
  • There are no duplicate values in the data.

In [7]:

# checking the names of the columns in the data
print(df.columns)
Index(['title', 'mediaType', 'eps', 'duration', 'ongoing', 'startYr',
       'finishYr', 'sznOfRelease', 'description', 'studios', 'contentWarn',
       'watched', 'watching', 'wantWatch', 'dropped', 'rating', 'votes',
       'tag_'Comedy'', 'tag_'Based on a Manga'', 'tag_'Action'',
       'tag_'Fantasy'', 'tag_'Sci Fi'', 'tag_'Shounen'',
       'tag_'Family Friendly'', 'tag_'Original Work'',
       'tag_'Non-Human Protagonists'', 'tag_'Adventure'',
       'tag_'Short Episodes'', 'tag_'Drama'', 'tag_'Shorts'', 'tag_'Romance'',
       'tag_'School Life'', 'tag_'Slice of Life'', 'tag_'Animal Protagonists'',
       'tag_'Seinen'', 'tag_'Supernatural'', 'tag_'Magic'',
       'tag_'CG Animation'', 'tag_'Mecha'', 'tag_'Ecchi'',
       'tag_'Based on a Light Novel'', 'tag_'Anthropomorphic'',
       'tag_'Superpowers'', 'tag_'Promotional'', 'tag_'Sports'',
       'tag_'Historical'', 'tag_'Vocaloid'', 'tag_Others'],
      dtype='object')

In [8]:

# checking column datatypes and number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14578 entries, 0 to 14577
Data columns (total 48 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   title                         14578 non-null  object 
 1   mediaType                     14510 non-null  object 
 2   eps                           14219 non-null  float64
 3   duration                      9137 non-null   object 
 4   ongoing                       14578 non-null  bool   
 5   startYr                       14356 non-null  float64
 6   finishYr                      14134 non-null  float64
 7   sznOfRelease                  3767 non-null   object 
 8   description                   8173 non-null   object 
 9   studios                       14578 non-null  object 
 10  contentWarn                   14578 non-null  int64  
 11  watched                       14356 non-null  float64
 12  watching                      14578 non-null  int64  
 13  wantWatch                     14578 non-null  int64  
 14  dropped                       14578 non-null  int64  
 15  rating                        12107 non-null  float64
 16  votes                         12119 non-null  float64
 17  tag_'Comedy'                  14578 non-null  int64  
 18  tag_'Based on a Manga'        14578 non-null  int64  
 19  tag_'Action'                  14578 non-null  int64  
 20  tag_'Fantasy'                 14578 non-null  int64  
 21  tag_'Sci Fi'                  14578 non-null  int64  
 22  tag_'Shounen'                 14578 non-null  int64  
 23  tag_'Family Friendly'         14578 non-null  int64  
 24  tag_'Original Work'           14578 non-null  int64  
 25  tag_'Non-Human Protagonists'  14578 non-null  int64  
 26  tag_'Adventure'               14578 non-null  int64  
 27  tag_'Short Episodes'          14578 non-null  int64  
 28  tag_'Drama'                   14578 non-null  int64  
 29  tag_'Shorts'                  14578 non-null  int64  
 30  tag_'Romance'                 14578 non-null  int64  
 31  tag_'School Life'             14578 non-null  int64  
 32  tag_'Slice of Life'           14578 non-null  int64  
 33  tag_'Animal Protagonists'     14578 non-null  int64  
 34  tag_'Seinen'                  14578 non-null  int64  
 35  tag_'Supernatural'            14578 non-null  int64  
 36  tag_'Magic'                   14578 non-null  int64  
 37  tag_'CG Animation'            14578 non-null  int64  
 38  tag_'Mecha'                   14578 non-null  int64  
 39  tag_'Ecchi'                   14578 non-null  int64  
 40  tag_'Based on a Light Novel'  14578 non-null  int64  
 41  tag_'Anthropomorphic'         14578 non-null  int64  
 42  tag_'Superpowers'             14578 non-null  int64  
 43  tag_'Promotional'             14578 non-null  int64  
 44  tag_'Sports'                  14578 non-null  int64  
 45  tag_'Historical'              14578 non-null  int64  
 46  tag_'Vocaloid'                14578 non-null  int64  
 47  tag_Others                    14578 non-null  int64  
dtypes: bool(1), float64(6), int64(35), object(6)
memory usage: 5.2+ MB
  • There are many numeric (float and int type) and string (object type) columns in the data.
  • Dependent variable is the rating of an anime, which is of float type.
  • ongoing column is of bool type.

In [9]:

# checking for missing values in the data.
df.isnull().sum()

Out[9]:

title                               0
mediaType                          68
eps                               359
duration                         5441
ongoing                             0
startYr                           222
finishYr                          444
sznOfRelease                    10811
description                      6405
studios                             0
contentWarn                         0
watched                           222
watching                            0
wantWatch                           0
dropped                             0
rating                           2471
votes                            2459
tag_'Comedy'                        0
tag_'Based on a Manga'              0
tag_'Action'                        0
tag_'Fantasy'                       0
tag_'Sci Fi'                        0
tag_'Shounen'                       0
tag_'Family Friendly'               0
tag_'Original Work'                 0
tag_'Non-Human Protagonists'        0
tag_'Adventure'                     0
tag_'Short Episodes'                0
tag_'Drama'                         0
tag_'Shorts'                        0
tag_'Romance'                       0
tag_'School Life'                   0
tag_'Slice of Life'                 0
tag_'Animal Protagonists'           0
tag_'Seinen'                        0
tag_'Supernatural'                  0
tag_'Magic'                         0
tag_'CG Animation'                  0
tag_'Mecha'                         0
tag_'Ecchi'                         0
tag_'Based on a Light Novel'        0
tag_'Anthropomorphic'               0
tag_'Superpowers'                   0
tag_'Promotional'                   0
tag_'Sports'                        0
tag_'Historical'                    0
tag_'Vocaloid'                      0
tag_Others                          0
dtype: int64
  • There are missing values in many columns.

In [10]:

# Let's look at the statistical summary of the data
df.describe(include="all").T

Out[10]:

countuniquetopfreqmeanstdmin25%50%75%max
title1457814578Fullmetal Alchemist: Brotherhood1NaNNaNNaNNaNNaNNaNNaN
mediaType145108TV4510NaNNaNNaNNaNNaNNaNNaN
eps14219.0NaNNaNNaN13.50123162.2621851.01.01.012.02527.0
duration91371474min964NaNNaNNaNNaNNaNNaNNaN
ongoing145782False14356NaNNaNNaNNaNNaNNaNNaN
startYr14356.0NaNNaNNaN2005.45778814.7071051907.02000.02010.02016.02026.0
finishYr14134.0NaNNaNNaN2005.51591914.6565091907.02000.02010.02016.02026.0
sznOfRelease37674Spring1202NaNNaNNaNNaNNaNNaNNaN
description81738108In 19th century Belgium, in the Flanders count…3NaNNaNNaNNaNNaNNaNNaN
studios14578864[]4808NaNNaNNaNNaNNaNNaNNaN
contentWarn14578.0NaNNaNNaN0.0980240.2973580.00.00.00.01.0
watched14356.0NaNNaNNaN2408.0433967168.3684280.025.0165.01469.5161567.0
watching14578.0NaNNaNNaN213.0266841261.707640.01.07.063.074537.0
wantWatch14578.0NaNNaNNaN1021.7291122145.0106040.024.0175.0980.028541.0
dropped14578.0NaNNaNNaN125.963026453.5773480.01.07.040.019481.0
rating12107.0NaNNaNNaN2.9486970.8276420.8442.30352.9653.61554.702
votes12119.0NaNNaNNaN2085.7877715946.28368510.034.0218.01412.5131067.0
tag_’Comedy’14578.0NaNNaNNaN0.2629990.4402770.00.00.01.01.0
tag_’Based on a Manga’14578.0NaNNaNNaN0.2621760.4398330.00.00.01.01.0
tag_’Action’14578.0NaNNaNNaN0.2110030.4080340.00.00.00.01.0
tag_’Fantasy’14578.0NaNNaNNaN0.1758810.3807320.00.00.00.01.0
tag_’Sci Fi’14578.0NaNNaNNaN0.1539310.3608950.00.00.00.01.0
tag_’Shounen’14578.0NaNNaNNaN0.1280010.3341020.00.00.00.01.0
tag_’Family Friendly’14578.0NaNNaNNaN0.1268350.3327990.00.00.00.01.0
tag_’Original Work’14578.0NaNNaNNaN0.1261490.3320290.00.00.00.01.0
tag_’Non-Human Protagonists’14578.0NaNNaNNaN0.1205930.3256640.00.00.00.01.0
tag_’Adventure’14578.0NaNNaNNaN0.105570.3072970.00.00.00.01.0
tag_’Short Episodes’14578.0NaNNaNNaN0.1042670.3056170.00.00.00.01.0
tag_’Drama’14578.0NaNNaNNaN0.1025520.3033830.00.00.00.01.0
tag_’Shorts’14578.0NaNNaNNaN0.0928110.2901770.00.00.00.01.0
tag_’Romance’14578.0NaNNaNNaN0.0824530.2750630.00.00.00.01.0
tag_’School Life’14578.0NaNNaNNaN0.0803270.2718070.00.00.00.01.0
tag_’Slice of Life’14578.0NaNNaNNaN0.0766910.266110.00.00.00.01.0
tag_’Animal Protagonists’14578.0NaNNaNNaN0.0727810.2597850.00.00.00.01.0
tag_’Seinen’14578.0NaNNaNNaN0.0679790.2517190.00.00.00.01.0
tag_’Supernatural’14578.0NaNNaNNaN0.0648240.2462230.00.00.00.01.0
tag_’Magic’14578.0NaNNaNNaN0.0570040.2318580.00.00.00.01.0
tag_’CG Animation’14578.0NaNNaNNaN0.0556320.2292170.00.00.00.01.0
tag_’Mecha’14578.0NaNNaNNaN0.0496640.2172570.00.00.00.01.0
tag_’Ecchi’14578.0NaNNaNNaN0.0489090.2156860.00.00.00.01.0
tag_’Based on a Light Novel’14578.0NaNNaNNaN0.0480180.2138110.00.00.00.01.0
tag_’Anthropomorphic’14578.0NaNNaNNaN0.043970.2050360.00.00.00.01.0
tag_’Superpowers’14578.0NaNNaNNaN0.0393740.1944910.00.00.00.01.0
tag_’Promotional’14578.0NaNNaNNaN0.038140.191540.00.00.00.01.0
tag_’Sports’14578.0NaNNaNNaN0.0369740.1887030.00.00.00.01.0
tag_’Historical’14578.0NaNNaNNaN0.036150.1866710.00.00.00.01.0
tag_’Vocaloid’14578.0NaNNaNNaN0.0349160.1835720.00.00.00.01.0
tag_Others14578.0NaNNaNNaN0.0806010.2722310.00.00.00.01.0
  • We can see that the anime ratings vary between 0.844 and 4.702, which suggests that the anime were rated on a scale of 0-5.
  • TV is the most occurring type of media.
  • For anime whose season of release is available, Spring is the most common season.
  • The number of views for the anime in the data has a very wide range (0 to more than 160,000).

From the data overview, we see that many columns in the data need to be preprocessed before they can be used for analysis.

Data Preprocessing

We will drop the missing values in rating column as it is the target variable.

In [11]:

df.dropna(subset=["rating"], inplace=True)

In [12]:

# let us reset the dataframe index
df.reset_index(inplace=True, drop=True)

In [13]:

# checking missing values in rest of the data
df.isnull().sum()

Out[13]:

title                              0
mediaType                         63
eps                                0
duration                        4636
ongoing                            0
startYr                            6
finishYr                         121
sznOfRelease                    8560
description                     4474
studios                            0
contentWarn                        0
watched                          115
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
dtype: int64

Let us look at the entries with no start year.

In [14]:

df[df.startYr.isnull()]

Out[14]:

titlemediaTypeepsdurationongoingstartYrfinishYrsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Others
1405Unbelievable Space LoveWeb10.01minFalseNaNNaNNaNNaN[]090.01634304.01254.00000000100100100000000000000000
5222Manbo-P: Irokoizata wa Subete Sakuzu de Kaiket…Music Video1.05minFalseNaNNaNNaNNaN[]041.002503.13920.01000000000000100000000000000010
9813Mameshiba: Mamerry ChristmasOther1.01minFalseNaNNaNNaNNaN[]057.011702.11935.00000001000001000000000000000000
10258Meow no HoshiOther1.05minFalseNaNNaNNaNNaN[]040.002501.99925.00000000110010000100000000000000
11970LandmarkWeb1.04minFalseNaNNaNNaNNaN[]034.00901.25621.00000000000000000000000000000001
12077Burutabu-chanOther3.01minFalseNaNNaNNaNNaN[]046.011011.04633.01000000000100000000000000000000
  • We will drop the entries with no start year as this is a difficult column to impute.
  • The decision to drop these missing values or impute them by a suitable value is subject to domain knowledge, and based on the steps taken to deal with them, the model performance will vary.

In [15]:

df.dropna(subset=["startYr"], inplace=True)

# let us reset the dataframe index
df.reset_index(inplace=True, drop=True)

In [16]:

# checking missing values in rest of the data
df.isnull().sum()

Out[16]:

title                              0
mediaType                         63
eps                                0
duration                        4636
ongoing                            0
startYr                            0
finishYr                         115
sznOfRelease                    8554
description                     4468
studios                            0
contentWarn                        0
watched                          115
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
dtype: int64

Let us look at the entries with no finish year.

In [17]:

df[df.finishYr.isnull()]

Out[17]:

titlemediaTypeepsdurationongoingstartYrfinishYrsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Others
13Kaguya-sama: Love Is War?TV10.0NaNTrue2020.0NaNSpringThe battle between love and pride continues! N…[‘A-1 Pictures’]0NaN63685747964.6172359.01100000000010110010000000000000
46Douluo Dalu 2Web82.022minTrue2018.0NaNNaNSecond season of Douluo Dalu.[]0NaN1167990324.540549.00011000000000000000010000000000
70Fruits Basket 2nd SeasonTV10.0NaNTrue2020.0NaNSpringSecond season of Fruits Basket.[‘TMS Entertainment’, ‘8 Pan’]0NaN41604427554.5271194.01101000000010100000000000000000
111Ascendance of a Bookworm: Part IITV11.0NaNTrue2020.0NaNSpringWith her baptism ceremony complete, Myne begin…[‘Ajia-do’]0NaN31831916294.4831139.00001000000010000000100010000000
115Rakshasa Street 2nd SeasonWeb5.0NaNTrue2019.0NaNNaNNaN[]0NaN4710204.48210.00010010000000000001000000000000
121Kingdom 3TV4.0NaNTrue2020.0NaNSpringThird season of Kingdom.[‘Studio Pierrot’, ‘St. Signpost’]0NaN515740144.476202.00110000000010000010000000000100
239One PieceTV929.0NaNTrue1999.0NaNFallLong ago the infamous Gol D. Roger was the str…[‘Toei Animation’]0NaN7453716987124454.40259737.01111010001000000000000000100000
262Tower of GodTV11.0NaNTrue2020.0NaNSpringFame. Glory. Power. Anything in your wildest d…[‘Telecom Animation Film’]1NaN956850851874.3913387.00111000001010000000000000000000
314Wu Geng Ji 3rd SeasonWeb21.0NaNTrue2019.0NaNNaNThird season of Wu Geng Ji.[]0NaN5014014.36619.00111000000000000000010000000000
324A Certain Scientific Railgun TTV15.0NaNTrue2020.0NaNWinterMikoto Misaka and her friends prepare for the …[‘J.C. Staff’]0NaN18252939434.365638.00110100000000000001000000100000
350My Next Life as a Villainess: All Routes Lead …TV11.0NaNTrue2020.0NaNSpringWealthy heiress Katarina Claes is hit in the h…[‘SILVER LINK’]0NaN597141071444.3482126.01001000000000100000100010000000
400Ling Jian Zun 3rd SeasonWeb40.010minTrue2019.0NaNNaNThird season of Ling Jian Zun.[]0NaN563114.33218.00011000000100000000010000000000
406Major 2nd: Second SeasonTV7.0NaNTrue2020.0NaNSpringSecond season of Major 2nd.[‘OLM’]0NaN30730684.329102.00100000000000000000000000001000
418IDOLiSH7: Second BEAT!TV4.0NaNTrue2020.0NaNSpringSecond season of IDOLiSH7.[‘TROYCA’]0NaN344764134.325106.00000000000010000000000000000000
546Food Wars! The Fifth PlateTV2.0NaNTrue2020.0NaNSpringNaN[‘J.C. Staff’]0NaN33543673344.276987.01100010000000000000000100000000
597Seitokai Yakuindomo* OVAOVA9.0NaNTrue2014.0NaNNaNNaN[‘GoHands’]1NaN18661390794.2521271.01100010000000010000000000000000
604Touhou Gensou Mangekyou: The Memories of PhantasmOther14.015minTrue2011.0NaNNaNMarisa, an ordinary magician, suspects a youka…[]0NaN7291134714.248558.00001000000000000000100000000000
660Yaoshenji 4th SeasonWeb13.0NaNTrue2020.0NaNNaNFourth season of Yaoshenji.[]0NaN1068034.22642.00011000000000000000010010000000
678The Millionaire Detective – Balance:UnlimitedTV2.0NaNTrue2020.0NaNSpringDetective Daisuke Kambe has no problems using …[‘CloverWorks’]0NaN31332990894.2231028.01010000000010000000000000000000
767Quanzhi Fashi 4th SeasonWeb3.018minTrue2020.0NaNNaNFourth season of Quanzhi Fashi.[‘Foch’]0NaN9443123.79525.00011000000000010001100000000000
772That Time I Got Reincarnated as a Slime OVAOVA3.023minTrue2019.0NaNNaNIn the midst of his everyday life, Rimuru sudd…[‘8-Bit’]0NaN38593225704.1881849.01111010001000000000100100000000
842Huyao Xiao Hongniang 2Web80.0NaNTrue2017.0NaNNaNContinuation of Huyao Xiao Hongniang (Fox Spir…[‘Haoliners Animation League’]0NaN217497184.17087.00001000010000100000000000000000
906Feng Ling Yu XiuWeb4.0NaNTrue2017.0NaNNaNNaN[]0NaN2812154.15113.00000000000000000000000000000001
930Sing “Yesterday” for MeTV11.0NaNTrue2020.0NaNSpringRikuo has graduated from college, but has zero…[‘Doga Kobo’]0NaN275631191374.136827.00100000000010101010000000000000
953KakushigotoTV11.0NaNTrue2020.0NaNSpringSingle father Kakushi Goto has a secret. He’s …[‘Ajia-do’]0NaN273530931644.143891.01100010000010001000000000000000
994Detective ConanTV974.0NaNTrue1996.0NaNWinterShinichi Kudo is a famous teenage detective wh…[‘TMS Entertainment’, ‘V1 Studio’]1NaN14928503547714.12614422.00100010000010000000000000000000
1133Pokemon: Twilight WingsWeb5.06minTrue2020.0NaNWinterSet in the Galar region, where Pokémon battles…[‘Studio Colorido’]0NaN735482204.083243.00001001000100001000000000000000
1305Ling Long: IncarnationWeb6.030minTrue2019.0NaNNaNNaN[]0NaN397854.03422.00000100000000000000010000000000
1314Katarina Nounai KaigiWeb26.01minTrue2020.0NaNWinterNaN[]0NaN5814624.03218.01001000000100000000000010010000
1317Great PretenderTV10.0NaNTrue2020.0NaNSummerMakoto Edamura is a con man who is said to be …[‘Wit Studio’]0NaN23193464.12478.00000000100000000000000000000000
1323Black CloverTV132.0NaNTrue2017.0NaNFallIn a world where magic is everything, Asta and…[‘Studio Pierrot’]0NaN323131229534244.03117866.00111010001000000000100000000000
1473Ahiru no SoraTV35.0NaNTrue2019.0NaNFallHe may be shorter in stature, but Sora Kurumat…[‘diomedea’]0NaN306718133023.9891305.00100010000000000000000000001000
1550Pokemon JourneysTV24.0NaNTrue2019.0NaNFallPokémon Trainer Ash Ketchum has a new plan: se…[‘OLM’]0NaN821773683.962314.00001000001000000000000000000000
1551ArteTV11.0NaNTrue2020.0NaNSpring16th century Firenze, Italy. One girl, One ART…[‘Seven Arcs’]0NaN174619201303.966578.00100000000010100010000000000100
1683Wan Jie Shen ZhuWeb92.09minTrue2019.0NaNNaNTime traveling from 21st century to the Nanzho…[]0NaN12289103.93053.00001000000100000000010000000000
1839Black Clover: Petit Clover AdvanceDVD Special4.06minTrue2018.0NaNNaNNaN[]0NaN6310553.89431.01000000000000000000000000000000
1903Soukyuu no Fafner: The BeyondOVA6.030minTrue2019.0NaNNaNNaN[‘XEBEC Zwei’]0NaN4549853.87818.00010100100000000000001000000000
1958Appare-Ranman!TV3.0NaNTrue2020.0NaNSpringThe socially awkward yet genius engineer, Appa…[‘P.A. Works’]0NaN11111431653.864325.00000000101000000000000000000000
1972Hifuu Katsudou Kiroku: The Sealed Esoteric His…Other2.0NaNTrue2015.0NaNNaNNaN[]0NaN3212973.86217.00000000000000000000000000000001
2154Digimon Adventure Memorial Story ProjectWeb1.06minTrue2020.0NaNWinterOver ten years have passed since that initial …[‘TYO Animations’]0NaN14132353.81742.00000100000010000000000000000000
2171Doraemon (2005)TV607.0NaNTrue2005.0NaNSpringRobotic cat Doraemon is sent back in time from…[‘Shin-Ei Animation’]0NaN7463902223.815557.01100011011000000000000000000000
2263Digimon Adventure:TV3.0NaNTrue2020.0NaNSpringIt’s the year 2020. The Network has become som…[‘Toei Animation’]0NaN961671443.797306.00010100001000000000000000000000
2286GleipnirTV11.0NaNTrue2020.0NaNSpringShuichi Kagaya an ordinary high school kid in …[‘Pine Jam’]1NaN382133112703.7921424.00110100000000000010000100100000
2294Wave, Listen to Me!TV11.0NaNTrue2020.0NaNSpringThe stage is Sapporo, Hokkaido. One night, our…[‘Sunrise’]0NaN8577761153.785298.01100000000010001010000000000000
2359Tsugumomo 2TV11.0NaNTrue2020.0NaNSpringThanks to his unique condition that causes him…[‘Zero-G’]0NaN7391112313.769227.01110000000000000010000100000000
2378Strike the Blood IVOVA2.024minTrue2020.0NaNNaNNaN[‘CONNECT’]1NaN3941263103.800115.00011000000000000000100010000000
2400Diary of Our Days at the BreakwaterTV3.0NaNTrue2020.0NaNSpringHina Tsurugi is a first-year student who moves…[‘Doga Kobo’]0NaN532513453.763182.01100000000000001010000000000000
2430Gundam Build Divers Re:RISE 2nd SeasonWeb5.025minTrue2020.0NaNSpringTwo years have passed since the legendary forc…[‘Sunrise Beyond’]0NaN14317133.75249.00010100000000000000001000000000
2475Prince of Tennis: Best Games!!OVA2.0NaNTrue2018.0NaNNaNThe new anime will depict previously untold st…[‘M.S.C.’]0NaN6232543.74121.00100010000000000000000000001000
2550Letters from HibakushaTV Special10.05minTrue2017.0NaNNaNThe program will consist of three anime shorts…[]0NaN166313.72210.00000000000110000000000000000100
2865Sing “Yesterday” for Me ExtraWeb4.02minTrue2020.0NaNSpringNaN[‘Doga Kobo’]0NaN19823713.63861.00100000000100001010000000000000
3041Crayon Shin-chanTV1034.0NaNTrue1992.0NaNNaNShinnosuke Nohara is a crude and rude five-yea…[‘Shin-Ei Animation’]1NaN5103136327703.6115600.01100000000000001000000100000000
3063A3! Season Spring & SummerTV10.0NaNTrue2020.0NaNWinterMankai Company is a far cry from its glory day…[‘Studio 3Hz’, ‘P.A. Works’]0NaN470962743.571175.01000000000000000000000000000000
3091OjarumaruTV1738.010minTrue1998.0NaNFallIn the Heian era, around 1000 years ago, a you…[‘Gallop’]0NaN252443.60011.00100001000100000000000000000000
3174Nintama RantarouTV1888.010minTrue1993.0NaNNaNRantarou, Shinbei and Kirimaru are ninja appre…[‘Ajia-do’]0NaN92135433.58264.01110011001100000000000000000000
3480Beyblade: Burst Super KingTV8.0NaNTrue2020.0NaNSpringNaN[‘OLM’]0NaN5710313.51517.00010000000000000000000000000000
3520Ryoutei no AjiOther8.02minTrue2014.0NaNNaNNaN[‘Studio Colorido’]0NaN192733.50724.00000000000110000000000000010000
3529Boruto: Naruto Next GenerationsTV154.0NaNTrue2017.0NaNSpringThe life of the shinobi is beginning to change…[‘Studio Pierrot’]0NaN22446848028893.50411722.00111010000000000000000000000000
3594Argonavis from BanG Dream! AnimationTV10.0NaNTrue2020.0NaNSpringNaN[‘SANZIGEN’]0NaN137324123.49147.00000000000000000000000000000001
3736Healin’ Good Pretty CureTV12.0NaNTrue2020.0NaNWinterThe Healing Garden, a secret world that treats…[‘Toei Animation’]0NaN12619083.48846.00000001000000000000000000000000
3848BanG Dream! Girls Band Party! Pico: OhmoriWeb6.03minTrue2020.0NaNSpringNaN[‘W-Toon Studio’, ‘SANZIGEN’]0NaN559903.43311.00000000000100000000000000000000
3864Chibi Maruko-chan (1995)TV1208.0NaNTrue1995.0NaNWinterNaN[‘Nippon Animation’]0NaN8286263.43050.01100001000000011000000000000000
3936Disney Tsum TsumWeb37.02minTrue2014.0NaNNaNNaN[‘Polygon Pictures’]0NaN80172263.41463.01000001010100000100010000010000
3938The 8th Son? Are You Kidding Me?TV11.0NaNTrue2020.0NaNSpringShingo Ichinomiya, a 25-year-old man working a…[‘Synergy SP’, ‘Shin-Ei Animation’]0NaN451428952673.4281670.00011000001010000000100010000000
4000Queen’s Blade: UnlimitedOVA2.0NaNTrue2018.0NaNNaNNaN[‘Fortes’]0NaN121793183.37651.00001000000000000000000100000000
4022Move to the FutureOther2.02minTrue2019.0NaNNaNThe animated short tells the heartwarming stor…[‘Signal.MD’]0NaN182813.39411.00000100010100000000000000010000
4055Aikatsu on Parade! (2020)Web4.012minTrue2020.0NaNSpringNaN[‘BN Pictures’]0NaN5112333.38817.00000000000100000000000000000000
4061Touhou Niji Sousaku Doujin Anime: Musou KakyouOVA4.023minTrue2008.0NaNNaNNaN[]0NaN10741045913.387934.00001000000000000000100000000000
4066GudetamaTV951.01minTrue2014.0NaNSpringGudetama, an egg that is dead to the world and…[‘Gathering’]0NaN193411653.385129.01000000010100000000000001000000
4125Tenchi Muyo! Ryo-Ohki 5OVA2.0NaNTrue2020.0NaNNaNNaN[‘AIC’]0NaN7623233.37129.01001000110000000000000100000000
4351Komatta Jii-sanTV10.01minTrue2020.0NaNSpringAn old man who pulls stereotypical ikemen (han…[‘Kachidoki Studio’]0NaN8913383.33731.01100000000100000000000000000000
4588Princess Connect! Re: DiveTV10.0NaNTrue2020.0NaNSpringIn the beautiful land of Astraea where a gentl…[‘CygamesPictures’]0NaN180413861623.274594.01011000001000000000000000000000
4797PlundererTV22.0NaNTrue2020.0NaNWinterEvery human inhabiting the world of Alcia is b…[‘Geek Toys’]1NaN531239608713.2282507.00111010001000000000000100000000
4884Zoids Wild ZeroTV31.0NaNTrue2019.0NaNFallNaN[‘OLM’]0NaN5235183.21117.00010100000000000000001000000000
52556HP – Six Hearts PrincessTV Special7.0NaNTrue2016.0NaNNaNHaruka Hani is a second-year junior high schoo…[‘Poncotan’]0NaN50584143.13227.00011000100000000000100000000000
5261Extra Olympia KyklosTV4.05minTrue2020.0NaNSpringDemetrios was a young man in Ancient Greece wh…[‘Gosay Studio’]0NaN184171233.13470.01100000000100001010000000001100
5438Dorohedoro: Ma no OmakeDVD Special1.05minTrue2020.0NaNNaNNaN[‘Mappa’]0NaN4327713.09413.01101000000000000001000001000000
5529Sore Ike! AnpanmanTV1527.0NaNTrue1988.0NaNNaNOne night, a Star of Life falls down the chimn…[‘TMS Entertainment’]0NaN3649213.07632.01001001010000000000000001100000
5636Bonobono (2016)TV212.06minTrue2016.0NaNSpringBonobono, a young sea otter, bonds with Chipmu…[‘Eiken’]0NaN6983373.05346.01100000010100000100000000000000
5824Woodpecker Detective’s OfficeTV9.0NaNTrue2020.0NaNSpringIt is the end of the Meiji Era. The genius poe…[‘LIDEN FILMS’]1NaN530917933.014171.00000000000000000000000000000100
6374Hana KappaTV390.010minTrue2010.0NaNSpringThe life of a Kappa is never dull, especially …[‘XEBEC’, ‘OLM’, ‘Group TAC’]0NaN4152112.89915.00001001000100000000000000000000
6422Symphogear XV SpecialsDVD Special3.07minTrue2019.0NaNNaNNaN[]0NaN182812.89016.01000000000000001000000000000000
6726Super ShiroTV32.06minTrue2019.0NaNFallThe Nohara family dog Shiro becomes a superher…[‘Science SARU’]0NaN4484132.82524.01000000010100000100000000000000
6999Shironeko Project: Zero ChronicleTV10.0NaNTrue2020.0NaNSpringThere are two kingdoms in this world – the Kin…[‘Project No. 9’]1NaN107810671292.783386.00011000000000000000100000000000
7143Sazae-sanTV2527.0NaNTrue1969.0NaNNaNSazae Fuguta, married to Masuo and mother of T…[‘Eiken’]0NaN233216762.737139.01100000000000001000000000000000
7175Super Dragon Ball HeroesWeb22.09minTrue2018.0NaNSummerNaN[‘Toei Animation’]0NaN206112333432.7311193.00010010000100000000000000110000
7403Zenonzard The AnimationWeb4.015minTrue2020.0NaNWinterNaN[‘8-Bit’]0NaN142292152.68256.00000100000000000000000000000000
7617Puzzle & DragonsTV104.0NaNTrue2018.0NaNSpringTaiga Akaishi is a passionate boy who is aimin…[‘Studio Pierrot’]0NaN3513692.63613.00001000000000000000000000000000
7876Chokotto Anime Kemono Friends 3Web17.01minTrue2019.0NaNNaNNaN[]0NaN286542.57913.00000000010100000000010001000000
7898Jakusansei Million ArthurWeb83.02minTrue2015.0NaNFallNaN[‘Gathering’]0NaN5019372.57433.01101000000100000000000000000000
8103ListenersTV11.0NaNTrue2020.0NaNSpringIn a world where the entire idea of music vani…[‘Mappa’]0NaN7679342622.527316.00000100100000000000001000000000
8347TAMAYOMI: The Baseball GirlsTV11.0NaNTrue2020.0NaNSpringEven with her “miracle ball,” junior high stud…[‘Studio A-Cat’]0NaN4996531412.450194.00100000000000010010000000001000
8350SD Gundam World: Sangoku SouketsudenWeb7.0NaNTrue2019.0NaNSummerA mysterious virus called the “Yellow Zombie V…[‘Sunrise’]0NaN60117152.47427.00010100010000000000011000000000
8610Yu-Gi-Oh! SevensTV6.0NaNTrue2020.0NaNSpringIn the future in the town of Gouha, Yuga Ohdo …[‘Bridge’]0NaN125222202.41454.00000000000000010000000000000000
9283Gal & DinoTV7.0NaNTrue2020.0NaNSpringAfter a night of drinking, Kaede wakes up real…[‘Kamikaze Douga’]0NaN2844931612.228158.01100000000000001010000000000000
9305Asatir: Mirai no MukashibanashiTV11.0NaNTrue2020.0NaNSpringIn 2050 Riyadh, Asma transports her grandchild…[‘Toei Animation’]1NaN67152382.24934.00001000100000000000000000000000
9315OBSOLETEWeb6.012minTrue2019.0NaNFallIn 2014, aliens suddenly appear on earth and p…[‘Buemon’]0NaN118454282.24695.00010100100100000000011000000000
9336Donbei x KemurikusaWeb1.01minTrue2019.0NaNNaNNaN[]0NaN172622.24112.00000000000000000000010000010000
9367MewkledreamyTV7.0NaNTrue2020.0NaNSpringA middle school girl named Yume sees something…[‘J.C. Staff’]0NaN78200162.19930.00001001000000000000000000000000
9429Deluxe Da yo! KaishainWeb19.02minTrue2019.0NaNSummerThe new web series follows Kamoyama, a human, …[‘DLE’]0NaN222952.21912.01000000010100000100000001000000
9495Bungo and Alchemist: Gears of JudgementTV7.0NaNTrue2020.0NaNSpringThe series follows a group of historic writers…[‘OLM’]1NaN65010532332.177246.00001000000000000001000000000000
9698Chicken Ramen x GudetamaWeb3.01minTrue2019.0NaNNaNNaN[]0NaN163002.14810.01000000010100000100000001010000
9798ShadowverseTV10.0NaNTrue2020.0NaNSpringWhile attending Tensei Academy, Hiiro Ryugasak…[‘Zexcs’]0NaN222232692.12299.00000000000000000000000000000001
9884Wacky TV Na Na Na: Chase the Kraken Monster!TV9.03minTrue2020.0NaNSpringThird season of Wacky TV Na Na Na.[‘Studio Crocodile’]0NaN334822.10210.01000000111100000000000001000000
10099Sakura Wars the AnimationTV11.0NaNTrue2020.0NaNSpringSet in 1940, it’s been 10 years since the grea…[‘SANZIGEN’]0NaN2525941252.044123.00010100000000000000010000000100
10210YeastkenWeb5.01minTrue2018.0NaNNaNNaN[]0NaN304132.01311.00000001010100000100000000000000
10345Shachibato! President, It’s Time for Battle!TV10.0NaNTrue2020.0NaNSpringIn the developed city of Gatepia close to the …[‘C2C’]0NaN5535121721.979239.00011000000000000000000000000000
10693RebirthTV16.04minTrue2020.0NaNWinterNaN[‘LIDEN FILMS Osaka Studio’]0NaN61111201.86830.01000000000100000000000000000000
10972The House Spirit Tatami-chanWeb10.04minTrue2020.0NaNSpringTatami-chan is a sardonic ghost from Iwate Pre…[‘Zero-G’]0NaN162178591.76182.01000000110000000001000000000000
11346Yodel no OnnaWeb4.01minTrue2017.0NaNNaNNaN[‘DLE’]0NaN2926131.64532.01000000000100000000000000000000
11377Hatachi no Ryouma with Kurofune-kun!Web5.02minTrue2018.0NaNNaNNaN[‘DLE’]0NaN141031.63214.00000000100100000000000000010000
11630GJ8 ManWeb40.06minTrue2016.0NaNFallJoe Gorou lives a carefree life in the small h…[]0NaN214161.51812.01000000000100000000000000100000
12093Knyacki!TV48.05minTrue1995.0NaNSpringNaN[]0NaN102032.56210.01000001000100000000000000000000
12094Da Li Si Ri ZhiWeb7.017minTrue2020.0NaNNaNNaN[]0NaN196423.65610.00001000010000000100000001000000
12099Xing Chen Bian 2nd SeasonWeb3.024minTrue2020.0NaNNaNSecond season of Xing Chen Bian.[]0NaN312203.94110.00010000000000000000010000000000

In [18]:

# checking the summary of the data with missing values in finishYr
df[df.finishYr.isnull()].describe(include="all").T

Out[18]:

countuniquetopfreqmeanstdmin25%50%75%max
title115115Kaguya-sama: Love Is War?1NaNNaNNaNNaNNaNNaNNaN
mediaType1156TV64NaNNaNNaNNaNNaNNaNNaN
eps115.0NaNNaNNaN136.521739408.9812191.04.510.022.02527.0
duration50181min8NaNNaNNaNNaNNaNNaNNaN
ongoing1151True115NaNNaNNaNNaNNaNNaNNaN
startYr115.0NaNNaNNaN2016.5217398.0539281969.02018.02020.02020.02020.0
finishYr0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
sznOfRelease754Spring50NaNNaNNaNNaNNaNNaNNaN
description7979The battle between love and pride continues! N…1NaNNaNNaNNaNNaNNaNNaN
studios11566[]23NaNNaNNaNNaNNaNNaNNaN
contentWarn115.0NaNNaNNaN0.0956520.2954010.00.00.00.01.0
watched0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
watching115.0NaNNaNNaN2101.5130437943.92737310.050.0142.0909.074537.0
wantWatch115.0NaNNaNNaN1164.7652172317.38638410.0104.0324.01060.016987.0
dropped115.0NaNNaNNaN285.7304351316.7346060.05.020.084.012445.0
rating115.0NaNNaNNaN3.3907480.8011441.5182.763.5154.03154.617
votes115.0NaNNaNNaN1266.8086966019.77486610.020.063.0355.559737.0
tag_’Comedy’115.0NaNNaNNaN0.3565220.4810680.00.00.01.01.0
tag_’Based on a Manga’115.0NaNNaNNaN0.3304350.4724280.00.00.01.01.0
tag_’Action’115.0NaNNaNNaN0.2956520.4583320.00.00.01.01.0
tag_’Fantasy’115.0NaNNaNNaN0.3304350.4724280.00.00.01.01.0
tag_’Sci Fi’115.0NaNNaNNaN0.1217390.3284150.00.00.00.01.0
tag_’Shounen’115.0NaNNaNNaN0.1304350.3382550.00.00.00.01.0
tag_’Family Friendly’115.0NaNNaNNaN0.1043480.3070490.00.00.00.01.0
tag_’Original Work’115.0NaNNaNNaN0.0956520.2954010.00.00.00.01.0
tag_’Non-Human Protagonists’115.0NaNNaNNaN0.1478260.3564810.00.00.00.01.0
tag_’Adventure’115.0NaNNaNNaN0.1130430.3180320.00.00.00.01.0
tag_’Short Episodes’115.0NaNNaNNaN0.2782610.4501040.00.00.01.01.0
tag_’Drama’115.0NaNNaNNaN0.139130.3475970.00.00.00.01.0
tag_’Shorts’115.0NaNNaNNaN0.00.00.00.00.00.00.0
tag_’Romance’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’School Life’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’Slice of Life’115.0NaNNaNNaN0.1043480.3070490.00.00.00.01.0
tag_’Animal Protagonists’115.0NaNNaNNaN0.060870.2401370.00.00.00.01.0
tag_’Seinen’115.0NaNNaNNaN0.1043480.3070490.00.00.00.01.0
tag_’Supernatural’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’Magic’115.0NaNNaNNaN0.0956520.2954010.00.00.00.01.0
tag_’CG Animation’115.0NaNNaNNaN0.1130430.3180320.00.00.00.01.0
tag_’Mecha’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’Ecchi’115.0NaNNaNNaN0.0695650.2555260.00.00.00.01.0
tag_’Based on a Light Novel’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’Anthropomorphic’115.0NaNNaNNaN0.0695650.2555260.00.00.00.01.0
tag_’Superpowers’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’Promotional’115.0NaNNaNNaN0.0695650.2555260.00.00.00.01.0
tag_’Sports’115.0NaNNaNNaN0.0434780.2048240.00.00.00.01.0
tag_’Historical’115.0NaNNaNNaN0.0521740.2233510.00.00.00.01.0
tag_’Vocaloid’115.0NaNNaNNaN0.00.00.00.00.00.00.0
tag_Others115.0NaNNaNNaN0.0347830.1840310.00.00.00.01.0
  • More than 25% of the entries with missing finish year are have started on or after 2018.
  • So, we will assume that the anime with missing values in finishYr are still airing, and fill the values with 2020 (the year the data was collected).
  • You can experiment by dropping the entries where the finish year is missing.
  • The decision to drop these missing values or impute them by a suitable value is subject to domain knowledge, and based on the steps taken to deal with them, the model performance will vary.

In [19]:

df["finishYr"].fillna(2020, inplace=True)

# checking missing values in rest of the data
df.isnull().sum()

Out[19]:

title                              0
mediaType                         63
eps                                0
duration                        4636
ongoing                            0
startYr                            0
finishYr                           0
sznOfRelease                    8554
description                     4468
studios                            0
contentWarn                        0
watched                          115
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
dtype: int64
  • The missing values in startYr and finishYr columns have been dealt with.
  • We will now create a new variable years_running, which will be calculated as finishYr minus startYr.
  • We will also drop the startYr and finishYr columns.

In [20]:

df["years_running"] = df["finishYr"] - df["startYr"]
df.drop(["startYr", "finishYr"], axis=1, inplace=True)
df.head()

Out[20]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_running
0Fullmetal Alchemist: BrotherhoodTV64.0NaNFalseSpringThe foundation of alchemy is based on the law …[‘Bones’]1103707.0143512581026564.70286547.001110100010100000000000000000001.0
1your name.Movie1.01hr 47minFalseNaNMitsuha and Taki are two total strangers livin…[‘CoMix Wave Films’]058831.01453217331244.66343960.000000001000101100010000000000000.0
2A Silent VoiceMovie1.02hr 10minFalseNaNAfter transferring into a new school, a deaf g…[‘Kyoto Animation’]145892.0946171481324.66133752.001000100000100100000000000000000.0
3Haikyuu!! Karasuno High School vs Shiratorizaw…TV10.0NaNFalseFallPicking up where the second season ended, the …[‘Production I.G’]025134.0218380821674.66017422.001000100000000100000000000010000.0
4Attack on Titan 3rd Season: Part IITV10.0NaNFalseSpringThe battle to retake Wall Maria begins now! Wi…[‘Wit Studio’]121308.0321778641744.65015789.001110100000000000000000000000000.0

Let’s convert the duration column from string to numeric.

In [21]:

# we define a function to convert the duration column to numeric


def time_to_minutes(var):
    if isinstance(var, str):  # checking if the value is string or not
        if "hr" in var:  # checking for the presence of hours in the duration
            spl = var.split(" ")  # splitting the value by space
            hr = (
                float(spl[0].replace("hr", "")) * 60
            )  # taking numeric part and converting hours to minutes
            mt = float(spl[1].replace("min", ""))  # taking numeric part of minutes
            return hr + mt
        else:
            return float(var.replace("min", ""))  # taking numeric part of minutes
    else:
        return np.nan  # will return NaN if value is not string

In [22]:

# let's apply the function to the duration column and overwrite the column
df["duration"] = df["duration"].apply(time_to_minutes)
df.head()

Out[22]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_running
0Fullmetal Alchemist: BrotherhoodTV64.0NaNFalseSpringThe foundation of alchemy is based on the law …[‘Bones’]1103707.0143512581026564.70286547.001110100010100000000000000000001.0
1your name.Movie1.0107.0FalseNaNMitsuha and Taki are two total strangers livin…[‘CoMix Wave Films’]058831.01453217331244.66343960.000000001000101100010000000000000.0
2A Silent VoiceMovie1.0130.0FalseNaNAfter transferring into a new school, a deaf g…[‘Kyoto Animation’]145892.0946171481324.66133752.001000100000100100000000000000000.0
3Haikyuu!! Karasuno High School vs Shiratorizaw…TV10.0NaNFalseFallPicking up where the second season ended, the …[‘Production I.G’]025134.0218380821674.66017422.001000100000000100000000000010000.0
4Attack on Titan 3rd Season: Part IITV10.0NaNFalseSpringThe battle to retake Wall Maria begins now! Wi…[‘Wit Studio’]121308.0321778641744.65015789.001110100000000000000000000000000.0

In [23]:

# let's check the summary of the duration column
df["duration"].describe()

Out[23]:

count    7465.000000
mean       24.230141
std        31.468171
min         1.000000
25%         4.000000
50%         8.000000
75%        30.000000
max       163.000000
Name: duration, dtype: float64
  • 50% of the anime in the data have a runtime less than or equal to 8 minutes.
  • Some anime even have a runtime of 1 minute.
    • This seems strange at first, but a Google search can reveal that there are indeed such anime.

We will fill the missing values in the sznOfRelease column with ‘is_missing‘, which will act as a new category.

In [24]:

df["sznOfRelease"].fillna("is_missing", inplace=True)
df.isnull().sum()

Out[24]:

title                              0
mediaType                         63
eps                                0
duration                        4636
ongoing                            0
sznOfRelease                       0
description                     4468
studios                            0
contentWarn                        0
watched                          115
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
years_running                      0
dtype: int64

Let’s check the number of unique values and the number of times they occur for the mediaType column.

In [25]:

df.mediaType.value_counts()

Out[25]:

TV             3993
Movie          1928
OVA            1770
Music Video    1290
Web            1170
DVD Special     803
Other           580
TV Special      504
Name: mediaType, dtype: int64

We will fill the missing values in the mediaType column with ‘Other‘, as the exact values for that category are not known.

In [26]:

df.mediaType.fillna("Other", inplace=True)

# checking the number of unique values and the number of times they occur
df.mediaType.value_counts()

Out[26]:

TV             3993
Movie          1928
OVA            1770
Music Video    1290
Web            1170
DVD Special     803
Other           643
TV Special      504
Name: mediaType, dtype: int64
  • We saw that the studios column has a list of values.
  • Let us remove the leading and trailing square braces from the values in the column.
  • We will also replace the entries with blank lists in these columns with NaN.

In [27]:

df["studios"] = df["studios"].str.lstrip("[").str.rstrip("]")
df["studios"] = df["studios"].replace(
    "", np.nan
)  # mark as NaN if the value is a blank string

df.head()

Out[27]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_running
0Fullmetal Alchemist: BrotherhoodTV64.0NaNFalseSpringThe foundation of alchemy is based on the law …‘Bones’1103707.0143512581026564.70286547.001110100010100000000000000000001.0
1your name.Movie1.0107.0Falseis_missingMitsuha and Taki are two total strangers livin…‘CoMix Wave Films’058831.01453217331244.66343960.000000001000101100010000000000000.0
2A Silent VoiceMovie1.0130.0Falseis_missingAfter transferring into a new school, a deaf g…‘Kyoto Animation’145892.0946171481324.66133752.001000100000100100000000000000000.0
3Haikyuu!! Karasuno High School vs Shiratorizaw…TV10.0NaNFalseFallPicking up where the second season ended, the …‘Production I.G’025134.0218380821674.66017422.001000100000000100000000000010000.0
4Attack on Titan 3rd Season: Part IITV10.0NaNFalseSpringThe battle to retake Wall Maria begins now! Wi…‘Wit Studio’121308.0321778641744.65015789.001110100000000000000000000000000.0

In [28]:

# checking missing values in rest of the data
df.isnull().sum()

Out[28]:

title                              0
mediaType                          0
eps                                0
duration                        4636
ongoing                            0
sznOfRelease                       0
description                     4468
studios                         3208
contentWarn                        0
watched                          115
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
years_running                      0
dtype: int64

Treating the studios column

In [29]:

df.sample(
    10, random_state=2
)  # setting the random_state will ensure we get the same results every time

Out[29]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_running
7002Tales of the Rays: Mirrage PrisonWeb1.01.0Falseis_missingNaNNaN081.038012.76738.000010000000000000000000000100000.0
11871Onikiri ShoujoWeb1.01.0Falseis_missingNaNNaN044.002511.35925.000100000000010000000000000000000.0
7492Triage XTV10.0NaNFalseSpringMochizuki General Hospital boasts some of the …‘XEBEC’14129.087128677882.6653485.001100100000000000000001000000000.0
3852Rainbow Days OVAOVA1.0NaNFalseis_missingNaN‘Ashi Productions’0580.02498783.432291.001000000000001100000000000000000.0
4506Endro~!TV12.0NaNFalseWinterIn the land of Naral Island, a land of magic a…‘Studio Gokumi’01033.037212052543.290976.000010001000000010001000000000000.0
9863Heybot!TV50.0NaNFalseSummerThe story takes place on Screw Island, a screw…‘BN Pictures’033.01462362.10754.010000011100000000000000000000001.0
3513UnicoMovie1.090.0Falseis_missingUnico is a special unicorn with the ability to…‘MADHOUSE’0748.07371173.508459.000010010110000001000000000000000.0
10605Ali Baba to 40-hiki no TouzokuMovie1.056.0Falseis_missingGenerations ago, the wily Ali Baba stole a cav…‘Toei Animation’0305.0799121.897147.000000010010000000000000000000000.0
10270rerulili: Girls TalkMusic Video1.04.0Falseis_missingNaNNaN018.00501.99513.000000000000000000000000000000010.0
2942Majestic Prince Movie: Kakusei no IdenshikoMovie1.0NaNFalseis_missingNaN‘Seven Arcs Pictures’, ‘Orange’0261.01142383.634168.001101000000100000100010000000000.0
  • We can see that row 2942 has more than one studio, which indicates a collaboration between studios.
  • We will split the tags column by ‘, ‘ and take all the values in one dataframe for further analysis.

In [30]:

studio_df = pd.DataFrame(
    df.studios.str.split(", ", expand=True).values.flatten(), columns=["Studios"]
)
val_c = studio_df.Studios.value_counts()
val_c

Out[30]:

'Toei Animation'       636
'Sunrise'              433
'J.C. Staff'           341
'MADHOUSE'             339
'TMS Entertainment'    319
                      ... 
'Studio Giants'          1
'BigFireBird'            1
'Pandanium'              1
'MMT Technology'         1
'Office Nobu'            1
Name: Studios, Length: 488, dtype: int64
  • There are too many studios in the data, and adding them all as separate columns will make our data dimension very large.
  • We will use a threshold, and keep only those studios with at least as many entries as the threshold.

In [31]:

# we take 100 as threshold
threshold = 100
val_c[val_c.values >= threshold]

Out[31]:

'Toei Animation'          636
'Sunrise'                 433
'J.C. Staff'              341
'MADHOUSE'                339
'TMS Entertainment'       319
'Production I.G'          279
'Studio Deen'             266
'Studio Pierrot'          223
'OLM'                     216
'A-1 Pictures'            194
'AIC'                     167
'Shin-Ei Animation'       165
'Tatsunoko Production'    146
'Nippon Animation'        145
'XEBEC'                   143
'DLE'                     134
'GONZO'                   132
'Bones'                   122
'Shaft'                   119
'Kyoto Animation'         108
Name: Studios, dtype: int64
  • 100 looks to be a good threshold.
  • We will keep only those studios that have created more than 100 anime, and the rest we will assign as ‘Others‘.
  • You can experiment by using a different threshold.

In [32]:

# list of studios
studios_list = val_c[val_c.values >= threshold].index.tolist()
print("Studio names taken into consideration:", len(studios_list), studios_list)
Studio names taken into consideration: 20 ["'Toei Animation'", "'Sunrise'", "'J.C. Staff'", "'MADHOUSE'", "'TMS Entertainment'", "'Production I.G'", "'Studio Deen'", "'Studio Pierrot'", "'OLM'", "'A-1 Pictures'", "'AIC'", "'Shin-Ei Animation'", "'Tatsunoko Production'", "'Nippon Animation'", "'XEBEC'", "'DLE'", "'GONZO'", "'Bones'", "'Shaft'", "'Kyoto Animation'"]

In [33]:

# let us create a copy of our dataframe
df1 = df.copy()

In [34]:

# first we will fill missing values in the columns by 'Others'
df1.studios.fillna("'Others'", inplace=True)
df1.studios.isnull().sum()

Out[34]:

0
  • We will now assign the studio names to the entries.
  • We will also create a new variable that will show if collaboration between studios was involved for creating an anime.

In [35]:

studio_val = []

for i in range(df1.shape[0]):  # iterate over all rows in data
    txt = df1.studios.values[i]  # getting the values in studios column
    flag = 0  # flag variable
    for item in studios_list:  # iterate over the list of studios considered
        if item in txt and flag == 0:  # checking if studio name is in the row
            studio_val.append(item)
            flag = 1
    if flag == 0:  # if the row values is different from the list of studios considered
        studio_val.append("'Others'")

# we will strip the leading and trailing ', and assign the values to a column
df1["studio_primary"] = [item.strip("'") for item in studio_val]
df1.tail()

Out[35]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_runningstudio_primary
12096Sore Ike! Anpanman: Kirameke! Ice no Kuni no V…Movie1.0NaNFalseis_missingPrincess Vanilla is a princess in a land of ic…‘TMS Entertainment’022.012912.80710.000000010100000000000000010000000.0TMS Entertainment
12097Hulaing Babies PetitTV12.05.0FalseWinterNaN‘Fukushima Gaina’013.0107722.09010.010000001001000000000000000000000.0Others
12098Marco & The Galaxy DragonOVA1.0NaNFalseis_missingNaN‘Others’017.006502.54310.010100000000000000000000000000000.0Others
12099Xing Chen Bian 2nd SeasonWeb3.024.0Trueis_missingSecond season of Xing Chen Bian.‘Others’0NaN312203.94110.000100000000000000000100000000000.0Others
12100Ultra B: Black Hole kara no Dokusaisha BB!!Movie1.020.0Falseis_missingNaN‘Shin-Ei Animation’015.011912.92510.011001000000000000000000001000000.0Shin-Ei Animation

In [36]:

# we will create a list defining whether there is a collaboration between studios
# we will check if the second split has None values, which will mean no collaboration between studios
studio_val2 = [
    0 if item is None else 1
    for item in df1.studios.str.split(", ", expand=True).iloc[:, 1]
]

df1["studios_colab"] = studio_val2
df1.tail()

Out[36]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptionstudioscontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_runningstudio_primarystudios_colab
12096Sore Ike! Anpanman: Kirameke! Ice no Kuni no V…Movie1.0NaNFalseis_missingPrincess Vanilla is a princess in a land of ic…‘TMS Entertainment’022.012912.80710.000000010100000000000000010000000.0TMS Entertainment0
12097Hulaing Babies PetitTV12.05.0FalseWinterNaN‘Fukushima Gaina’013.0107722.09010.010000001001000000000000000000000.0Others0
12098Marco & The Galaxy DragonOVA1.0NaNFalseis_missingNaN‘Others’017.006502.54310.010100000000000000000000000000000.0Others0
12099Xing Chen Bian 2nd SeasonWeb3.024.0Trueis_missingSecond season of Xing Chen Bian.‘Others’0NaN312203.94110.000100000000000000000100000000000.0Others0
12100Ultra B: Black Hole kara no Dokusaisha BB!!Movie1.020.0Falseis_missingNaN‘Shin-Ei Animation’015.011912.92510.011001000000000000000000001000000.0Shin-Ei Animation0

We will now drop the studios column.

In [37]:

df1.drop("studios", axis=1, inplace=True)

# let's check the data once
df1.head()

Out[37]:

titlemediaTypeepsdurationongoingsznOfReleasedescriptioncontentWarnwatchedwatchingwantWatchdroppedratingvotestag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersyears_runningstudio_primarystudios_colab
0Fullmetal Alchemist: BrotherhoodTV64.0NaNFalseSpringThe foundation of alchemy is based on the law …1103707.0143512581026564.70286547.001110100010100000000000000000001.0Bones0
1your name.Movie1.0107.0Falseis_missingMitsuha and Taki are two total strangers livin…058831.01453217331244.66343960.000000001000101100010000000000000.0Others0
2A Silent VoiceMovie1.0130.0Falseis_missingAfter transferring into a new school, a deaf g…145892.0946171481324.66133752.001000100000100100000000000000000.0Kyoto Animation0
3Haikyuu!! Karasuno High School vs Shiratorizaw…TV10.0NaNFalseFallPicking up where the second season ended, the …025134.0218380821674.66017422.001000100000000100000000000010000.0Production I.G0
4Attack on Titan 3rd Season: Part IITV10.0NaNFalseSpringThe battle to retake Wall Maria begins now! Wi…121308.0321778641744.65015789.001110100000000000000000000000000.0Others0

We have preprocessed the columns with a list of values. We now have the same clean data with which we started the previous session.

The only change is that we have replaced the ‘is_missing‘ category in the studio_primary columns by ‘Others‘.

Next, we will impute the missing values in the data.

In [38]:

# checking missing values in rest of the data
df1.isnull().sum()

Out[38]:

title                              0
mediaType                          0
eps                                0
duration                        4636
ongoing                            0
sznOfRelease                       0
description                     4468
contentWarn                        0
watched                          115
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
years_running                      0
studio_primary                     0
studios_colab                      0
dtype: int64

We will fill the missing values in duration and watched columns by the median values grouped by studio_primary and mediaType.

In [39]:

df2 = df1.copy()

df2[["duration", "watched"]] = df2.groupby(["studio_primary", "mediaType"])[
    ["duration", "watched"]
].transform(lambda x: x.fillna(x.median()))

# checking for missing values
df2.isnull().sum()

Out[39]:

title                              0
mediaType                          0
eps                                0
duration                         155
ongoing                            0
sznOfRelease                       0
description                     4468
contentWarn                        0
watched                            0
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
years_running                      0
studio_primary                     0
studios_colab                      0
dtype: int64

We will fill the remaining missing values in duration column by column median.

In [40]:

df2["duration"].fillna(df2.duration.median(), inplace=True)
df2.isnull().sum()

Out[40]:

title                              0
mediaType                          0
eps                                0
duration                           0
ongoing                            0
sznOfRelease                       0
description                     4468
contentWarn                        0
watched                            0
watching                           0
wantWatch                          0
dropped                            0
rating                             0
votes                              0
tag_'Comedy'                       0
tag_'Based on a Manga'             0
tag_'Action'                       0
tag_'Fantasy'                      0
tag_'Sci Fi'                       0
tag_'Shounen'                      0
tag_'Family Friendly'              0
tag_'Original Work'                0
tag_'Non-Human Protagonists'       0
tag_'Adventure'                    0
tag_'Short Episodes'               0
tag_'Drama'                        0
tag_'Shorts'                       0
tag_'Romance'                      0
tag_'School Life'                  0
tag_'Slice of Life'                0
tag_'Animal Protagonists'          0
tag_'Seinen'                       0
tag_'Supernatural'                 0
tag_'Magic'                        0
tag_'CG Animation'                 0
tag_'Mecha'                        0
tag_'Ecchi'                        0
tag_'Based on a Light Novel'       0
tag_'Anthropomorphic'              0
tag_'Superpowers'                  0
tag_'Promotional'                  0
tag_'Sports'                       0
tag_'Historical'                   0
tag_'Vocaloid'                     0
tag_Others                         0
years_running                      0
studio_primary                     0
studios_colab                      0
dtype: int64

We will now drop the description and title columns.

In [41]:

df2.drop(["description", "title"], axis=1, inplace=True)

# let's check the summary of our data
df2.describe(include="all").T

Out[41]:

countuniquetopfreqmeanstdmin25%50%75%max
mediaType121018TV3993NaNNaNNaNNaNNaNNaNNaN
eps12101.0NaNNaNNaN13.39335657.9250971.01.02.012.02527.0
duration12101.0NaNNaNNaN20.02528727.1302961.05.07.025.0163.0
ongoing121012False11986NaNNaNNaNNaNNaNNaNNaN
sznOfRelease121015is_missing8554NaNNaNNaNNaNNaNNaNNaN
contentWarn12101.0NaNNaNNaN0.1153620.3194720.00.00.00.01.0
watched12101.0NaNNaNNaN2861.2413027724.6224430.055.0342.02026.0161567.0
watching12101.0NaNNaNNaN256.3344351380.8409020.02.014.0100.074537.0
wantWatch12101.0NaNNaNNaN1203.6814312294.327380.049.0296.01275.028541.0
dropped12101.0NaNNaNNaN151.568383493.931710.03.012.065.019481.0
rating12101.0NaNNaNNaN2.9490370.8273850.8442.3042.9653.6164.702
votes12101.0NaNNaNNaN2088.12475950.33222810.034.0219.01414.0131067.0
tag_’Comedy’12101.0NaNNaNNaN0.272870.4454530.00.00.01.01.0
tag_’Based on a Manga’12101.0NaNNaNNaN0.2908020.4541510.00.00.01.01.0
tag_’Action’12101.0NaNNaNNaN0.2312210.4216310.00.00.00.01.0
tag_’Fantasy’12101.0NaNNaNNaN0.1815550.3854930.00.00.00.01.0
tag_’Sci Fi’12101.0NaNNaNNaN0.1662670.3723360.00.00.00.01.0
tag_’Shounen’12101.0NaNNaNNaN0.1448640.3519780.00.00.00.01.0
tag_’Family Friendly’12101.0NaNNaNNaN0.0970170.2959930.00.00.00.01.0
tag_’Original Work’12101.0NaNNaNNaN0.1351950.3419460.00.00.00.01.0
tag_’Non-Human Protagonists’12101.0NaNNaNNaN0.112470.3159570.00.00.00.01.0
tag_’Adventure’12101.0NaNNaNNaN0.1037930.3050050.00.00.00.01.0
tag_’Short Episodes’12101.0NaNNaNNaN0.0969340.295880.00.00.00.01.0
tag_’Drama’12101.0NaNNaNNaN0.1061070.3079870.00.00.00.01.0
tag_’Shorts’12101.0NaNNaNNaN0.0896620.2857090.00.00.00.01.0
tag_’Romance’12101.0NaNNaNNaN0.0921410.2892370.00.00.00.01.0
tag_’School Life’12101.0NaNNaNNaN0.0923060.289470.00.00.00.01.0
tag_’Slice of Life’12101.0NaNNaNNaN0.080820.2725690.00.00.00.01.0
tag_’Animal Protagonists’12101.0NaNNaNNaN0.0603260.2380990.00.00.00.01.0
tag_’Seinen’12101.0NaNNaNNaN0.0771010.2667630.00.00.00.01.0
tag_’Supernatural’12101.0NaNNaNNaN0.0709030.2566740.00.00.00.01.0
tag_’Magic’12101.0NaNNaNNaN0.0642920.2452830.00.00.00.01.0
tag_’CG Animation’12101.0NaNNaNNaN0.0500790.2181160.00.00.00.01.0
tag_’Mecha’12101.0NaNNaNNaN0.0545410.2270910.00.00.00.01.0
tag_’Ecchi’12101.0NaNNaNNaN0.0574330.2326780.00.00.00.01.0
tag_’Based on a Light Novel’12101.0NaNNaNNaN0.0533840.2248070.00.00.00.01.0
tag_’Anthropomorphic’12101.0NaNNaNNaN0.0378480.1908370.00.00.00.01.0
tag_’Superpowers’12101.0NaNNaNNaN0.0446240.2064860.00.00.00.01.0
tag_’Promotional’12101.0NaNNaNNaN0.0363610.1871940.00.00.00.01.0
tag_’Sports’12101.0NaNNaNNaN0.0380130.1912360.00.00.00.01.0
tag_’Historical’12101.0NaNNaNNaN0.0333030.1794340.00.00.00.01.0
tag_’Vocaloid’12101.0NaNNaNNaN0.0393360.19440.00.00.00.01.0
tag_Others12101.0NaNNaNNaN0.0744570.2625230.00.00.00.01.0
years_running12101.0NaNNaNNaN0.28321.1522340.00.00.00.051.0
studio_primary1210121Others7548NaNNaNNaNNaNNaNNaNNaN
studios_colab12101.0NaNNaNNaN0.0516490.2213260.00.00.00.01.0

Note: The next section of the notebook is the section that has been covered multiple times in the previous case studies. For this session, this part can be skipped and we can directly refer to this summary of data cleaning steps and observations from EDA.

Let’s visualize the data

Univariate Analysis

In [42]:

# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

rating

In [43]:

histogram_boxplot(df2, "rating")
  • The anime ratings are close to normally distributed, with a mean rating of ~2.95.

eps

In [44]:

histogram_boxplot(df2, "eps", bins=100)
  • The distribution is heavily right-skewed, as there are many anime movies in the data, and they are considered to be of only one episode (as per data description).

duration

In [45]:

histogram_boxplot(df2, "duration")
  • The distribution is right-skewed with a median runtime of less than 10 minutes.

watched

In [46]:

histogram_boxplot(df2, "watched", bins=50)
  • The distribution is heavily right-skewed, and most of the anime having less than 500 viewers.

watching

In [47]:

histogram_boxplot(df2, "watching", bins=50)
  • The distribution is heavily right-skewed.

wantWatch

In [48]:

histogram_boxplot(df2, "wantWatch", bins=50)
  • The distribution is heavily right-skewed.

dropped

In [49]:

histogram_boxplot(df2, "dropped", bins=50)
  • The distribution is heavily right-skewed.

votes

In [50]:

histogram_boxplot(df2, "votes", bins=50)
  • The distribution is heavily right-skewed, and few shows have more than 10000 votes.

years_running

In [51]:

histogram_boxplot(df2, "years_running")
  • The distribution is heavily right-skewed, and most of the anime have run for less than 1 year.

In [52]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

mediaType

In [53]:

labeled_barplot(df2, "mediaType", perc=True)
  • One-third of the anime in the data are published for TV.
  • Movies and web series account for another 25% of the anime in the data.

ongoing

In [54]:

labeled_barplot(df2, "ongoing", perc=True)
  • Very few (1%) anime in the data are ongoing.

sznOfRelease

In [55]:

labeled_barplot(df2, "sznOfRelease", perc=True)
  • More anime are released in spring and fall compared to sumer and winter.

studio_primary

In [56]:

labeled_barplot(df2, "studio_primary", perc=True)
  • Toei Animation is the most common studio among the available studio names.

studios_colab

In [57]:

labeled_barplot(df2, "studios_colab", perc=True)
  • Nearly 95% of the anime in the data do not involve a collaboration between studios.

contentWarn

In [58]:

labeled_barplot(df2, "contentWarn", perc=True)
  • Nearly 90% of the anime in the data do not have an associated content warning.

In [59]:

# creating a list of tag columns
tag_cols = [item for item in df2.columns if "tag" in item]

In [60]:

# checking the values in tag columns
for column in tag_cols:
    print(df2[column].value_counts())
    print("-" * 50)
0    8799
1    3302
Name: tag_'Comedy', dtype: int64
--------------------------------------------------
0    8582
1    3519
Name: tag_'Based on a Manga', dtype: int64
--------------------------------------------------
0    9303
1    2798
Name: tag_'Action', dtype: int64
--------------------------------------------------
0    9904
1    2197
Name: tag_'Fantasy', dtype: int64
--------------------------------------------------
0    10089
1     2012
Name: tag_'Sci Fi', dtype: int64
--------------------------------------------------
0    10348
1     1753
Name: tag_'Shounen', dtype: int64
--------------------------------------------------
0    10927
1     1174
Name: tag_'Family Friendly', dtype: int64
--------------------------------------------------
0    10465
1     1636
Name: tag_'Original Work', dtype: int64
--------------------------------------------------
0    10740
1     1361
Name: tag_'Non-Human Protagonists', dtype: int64
--------------------------------------------------
0    10845
1     1256
Name: tag_'Adventure', dtype: int64
--------------------------------------------------
0    10928
1     1173
Name: tag_'Short Episodes', dtype: int64
--------------------------------------------------
0    10817
1     1284
Name: tag_'Drama', dtype: int64
--------------------------------------------------
0    11016
1     1085
Name: tag_'Shorts', dtype: int64
--------------------------------------------------
0    10986
1     1115
Name: tag_'Romance', dtype: int64
--------------------------------------------------
0    10984
1     1117
Name: tag_'School Life', dtype: int64
--------------------------------------------------
0    11123
1      978
Name: tag_'Slice of Life', dtype: int64
--------------------------------------------------
0    11371
1      730
Name: tag_'Animal Protagonists', dtype: int64
--------------------------------------------------
0    11168
1      933
Name: tag_'Seinen', dtype: int64
--------------------------------------------------
0    11243
1      858
Name: tag_'Supernatural', dtype: int64
--------------------------------------------------
0    11323
1      778
Name: tag_'Magic', dtype: int64
--------------------------------------------------
0    11495
1      606
Name: tag_'CG Animation', dtype: int64
--------------------------------------------------
0    11441
1      660
Name: tag_'Mecha', dtype: int64
--------------------------------------------------
0    11406
1      695
Name: tag_'Ecchi', dtype: int64
--------------------------------------------------
0    11455
1      646
Name: tag_'Based on a Light Novel', dtype: int64
--------------------------------------------------
0    11643
1      458
Name: tag_'Anthropomorphic', dtype: int64
--------------------------------------------------
0    11561
1      540
Name: tag_'Superpowers', dtype: int64
--------------------------------------------------
0    11661
1      440
Name: tag_'Promotional', dtype: int64
--------------------------------------------------
0    11641
1      460
Name: tag_'Sports', dtype: int64
--------------------------------------------------
0    11698
1      403
Name: tag_'Historical', dtype: int64
--------------------------------------------------
0    11625
1      476
Name: tag_'Vocaloid', dtype: int64
--------------------------------------------------
0    11200
1      901
Name: tag_Others, dtype: int64
--------------------------------------------------
  • There are 3519 anime that are based on manga.
  • There are 3302 anime of the Comedy genre.
  • There are 1115 anime of the Romance genre.

Bivariate analysis

We will not consider the tag columns for correlation check as they have only 0 or 1 values.

In [61]:

# creating a list of non-tag columns
corr_cols = [item for item in df2.columns if "tag" not in item]
print(corr_cols)
['mediaType', 'eps', 'duration', 'ongoing', 'sznOfRelease', 'contentWarn', 'watched', 'watching', 'wantWatch', 'dropped', 'rating', 'votes', 'years_running', 'studio_primary', 'studios_colab']

In [62]:

plt.figure(figsize=(12, 7))
sns.heatmap(
    df2[corr_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
  • watched and wantWatch columns are highly correlated.
  • watched and votes columns are very highly correlated.
  • wantWatch and votes columns are highly correlated.

Let’s check the variation in rating with some of the categorical columns in our data

mediaType vs rating

In [63]:

plt.figure(figsize=(10, 5))
sns.boxplot(x="mediaType", y="rating", data=df2)
plt.show()
  • Anime available as web series or music videos have a lower rating in general.

sznOfRelease vs rating

In [64]:

plt.figure(figsize=(10, 5))
sns.boxplot(x="sznOfRelease", y="rating", data=df2)
plt.show()
  • Anime ratings have more or less similar distribution across all the seasons of release.

studio_primary vs rating

In [65]:

plt.figure(figsize=(15, 5))
sns.boxplot(x="studio_primary", y="rating", data=df2)
plt.xticks(rotation=90)
plt.show()
  • In general, the ratings are low for anime created by DLE studios.
  • Ratings are also low, in general, for anime created by studios other than the ones listed in the data.

Summary of EDA

Data Description:

  • The target variable (rating) is of float type.
  • Columns like titledescriptionmediaTypestudio, etc. are of object type.
  • ongoing column is of bool type.
  • All other columns are numeric in nature.
  • There are no duplicate values in the data.
  • There are a lot of missing values in the data.

Data Cleaning:

  • The title and description columns are dropped for modeling as they are highly textual in nature.
  • The duration column was converted from string to numeric by applying the time_to_minutes function.
  • The studios column was processed to convert the list of values into a suitable format for analysis and modeling.
  • The missing values in the data are treated as follows:
    • Missing values in the target variable rating were dropped.
    • Missing values in startYr were dropped.
    • Missing values in finishYr were imputed with 2020.
    • Missing values in sznOfRelease were imputed with a new category ‘is_missing’.
    • Missing values in mediaType were imputed with a new category ‘Other’.
    • Missing values in duration and watched columns were imputed by the median values grouped by studio_primary and mediaType. The remaining missing values in these columns, if any, were imputed by column medians over the entire data.
  • The startYr and finishYr columns were combined to create a new feature years_running. The original columns were then dropped.

Observations from EDA:

  • rating: The anime ratings are close to normally distributed, with a mean rating of ~2.95. The rating increases with an increase in the number of people who have watched or want to watch the anime.
  • eps: The distribution is heavily right-skewed as there are many anime movies in the data (at least 50%), and they are considered to be of only one episode as per data description. The number of episodes increases as the anime runs for more years.
  • duration: The distribution is right-skewed with a median anime runtime of less than 10 minutes.
  • years_running: The distribution is heavily right-skewed, and at least 75% of the anime have run for less than 1 year.
  • watched: The distribution is heavily right-skewed, and most of the anime have less than 500 viewers. This attribute is highly correlated with the wantWatch and votes attributes.
  • watching: The distribution is heavily right-skewed and highly correlated with the dropped attribute.
  • wantWatch: The distribution is heavily right-skewed with a median value of 296 potential watchers.
  • dropped: The distribution is heavily right-skewed with a drop of 152 viewers on average.
  • votes: The distribution is heavily right-skewed, and few shows have more than 10000 votes.
  • mediaType: 33% of the anime are published for TV, 11% as music videos, and 10% as web series. Anime available as web series or music videos have a lower rating in general
  • ongoing: 1% of the anime in the data are ongoing.
  • sznOfRelease: The season of release is missing for more than 70% of the anime in the data, and more anime are released in spring and fall compared to summer and winter. Anime ratings have a similar distribution across all the seasons of release.
  • studio_primary: More than 60% of the anime in the data are produced by studios not listed in the data. Toei Animation is the most common studio among the available studio names. In general, the ratings are low for anime produced by DLE studios and studios other than the ones listed in the data.
  • studios_colab: Around 95% of the anime in the data do not involve collaboration between studios.
  • contentWarn: Nearly 90% of the anime in the data do not have an associated content warning.
  • tag_<tag/genre>: There are 3519 anime that are based on manga, 3302 of the Comedy genre, 2798 of the Action genre, 1115 anime of the Romance genre, and more.

Variable Transformations

Let us check the numeric columns other than the tag columns for skewness

In [66]:

# creating a list of non-tag columns
dist_cols = [
    item for item in df2.select_dtypes(include=np.number).columns if "tag" not in item
]

# let's plot a histogram of all non-tag columns

plt.figure(figsize=(15, 45))

for i in range(len(dist_cols)):
    plt.subplot(12, 3, i + 1)
    plt.hist(df2[dist_cols[i]], bins=50)
    # sns.histplot(data=df2, x=dist_cols[i], kde=True)  # you can comment the previous line and run this one to get distribution curves
    plt.tight_layout()
    plt.title(dist_cols[i], fontsize=25)

plt.show()
  • We see that most of the columns have a very skewed distribution.
  • We will apply the log transformation to all but the contentWarnstudios_colab, and rating columns to deal with skewness in the data.

In [67]:

# creating a copy of the dataframe
df3 = df2.copy()

# removing contentWarn and studios_colab columns as they have only 0 and 1 values
dist_cols.remove("contentWarn")
dist_cols.remove("studios_colab")

# also dropping the rating column as it is almost normally distributed
dist_cols.remove("rating")

In [68]:

# using log transforms on some columns

for col in dist_cols:
    df3[col + "_log"] = np.log(df3[col] + 1)

# dropping the original columns
df3.drop(dist_cols, axis=1, inplace=True)
df3.head()

Out[68]:

mediaTypeongoingsznOfReleasecontentWarnratingtag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersstudio_primarystudios_colabeps_logduration_logwatched_logwatching_logwantWatch_logdropped_logvotes_logyears_running_log
0TVFalseSpring14.7020111010001010000000000000000000Bones04.1743871.38629411.5493359.57164510.1585567.88495311.3684540.693147
1MovieFalseis_missing04.6630000000100010110001000000000000Others00.6931474.68213110.9824417.2820749.9866334.82831410.6910580.000000
2MovieFalseis_missing14.6610100010000010010000000000000000Kyoto Animation00.6931474.87519710.7340686.8532999.7496954.89034910.4268250.000000
3TVFalseFall04.6600100010000000010000000000001000Production I.G02.3978952.56494910.1320177.6889138.9975185.1239649.7655460.000000
4TVFalseSpring14.6500111010000000000000000000000000Others02.3978951.7917599.9668858.0765158.9701785.1647869.6671320.000000

Let’s check for skewness after applying the log transformation.

In [69]:

# creating a list of non-tag columns
dist_cols = [
    item for item in df3.select_dtypes(include=np.number).columns if "tag" not in item
]

# let's plot histogram of all non-tag columns

plt.figure(figsize=(15, 45))

for i in range(len(dist_cols)):
    plt.subplot(12, 3, i + 1)
    plt.hist(df3[dist_cols[i]], bins=50)
    # sns.histplot(data=df3, x=dist_cols[i], kde=True)  # you can comment the previous line and run this one to get distribution curves
    plt.tight_layout()
    plt.title(dist_cols[i], fontsize=25)

plt.show()
  • The columns are still skewed, but not as heavily as before.

Let’s check for correlations between the columns (other than the tag columns)

In [70]:

plt.figure(figsize=(12, 7))
sns.heatmap(
    df3[dist_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
  • There are still a few highly correlated columns.

Model Building

Define dependent variable

In [71]:

X = df3.drop(["rating"], axis=1)
y = df3["rating"]

Creating dummy variables

In [72]:

X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)

X.head()

Out[72]:

ongoingcontentWarntag_’Comedy’tag_’Based on a Manga’tag_’Action’tag_’Fantasy’tag_’Sci Fi’tag_’Shounen’tag_’Family Friendly’tag_’Original Work’tag_’Non-Human Protagonists’tag_’Adventure’tag_’Short Episodes’tag_’Drama’tag_’Shorts’tag_’Romance’tag_’School Life’tag_’Slice of Life’tag_’Animal Protagonists’tag_’Seinen’tag_’Supernatural’tag_’Magic’tag_’CG Animation’tag_’Mecha’tag_’Ecchi’tag_’Based on a Light Novel’tag_’Anthropomorphic’tag_’Superpowers’tag_’Promotional’tag_’Sports’tag_’Historical’tag_’Vocaloid’tag_Othersstudios_colabeps_logduration_logwatched_logwatching_logwantWatch_logdropped_logvotes_logyears_running_logmediaType_MoviemediaType_Music VideomediaType_OVAmediaType_OthermediaType_TVmediaType_TV SpecialmediaType_WebsznOfRelease_SpringsznOfRelease_SummersznOfRelease_WintersznOfRelease_is_missingstudio_primary_AICstudio_primary_Bonesstudio_primary_DLEstudio_primary_GONZOstudio_primary_J.C. Staffstudio_primary_Kyoto Animationstudio_primary_MADHOUSEstudio_primary_Nippon Animationstudio_primary_OLMstudio_primary_Othersstudio_primary_Production I.Gstudio_primary_Shaftstudio_primary_Shin-Ei Animationstudio_primary_Studio Deenstudio_primary_Studio Pierrotstudio_primary_Sunrisestudio_primary_TMS Entertainmentstudio_primary_Tatsunoko Productionstudio_primary_Toei Animationstudio_primary_XEBEC
0False1011101000101000000000000000000004.1743871.38629411.5493359.57164510.1585567.88495311.3684540.6931470000100100001000000000000000000
1False0000000010001011000100000000000000.6931474.68213110.9824417.2820749.9866334.82831410.6910580.0000001000000000100000000010000000000
2False1010001000001001000000000000000000.6931474.87519710.7340686.8532999.7496954.89034910.4268250.0000001000000000100000100000000000000
3False0010001000000001000000000000100002.3978952.56494910.1320177.6889138.9975185.1239649.7655460.0000000000100000000000000001000000000
4False1011101000000000000000000000000002.3978951.7917599.9668858.0765158.9701785.1647869.6671320.0000000000100100000000000010000000000

Split the data into train and test

In [73]:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [74]:

print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 8470
Number of rows in test data = 3631

Fitting a linear model

In [75]:

lin_reg_model = LinearRegression()
lin_reg_model.fit(x_train, y_train)

Out[75]:

LinearRegression()

Model performance check

  • We will be using metric functions defined in sklearn for RMSE, MAE, and R2�2.
  • We will define functions to calculate adjusted R2�2 and MAPE.
    • The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0.
  • We will create a function that will print out all the above metrics in one go.

In [76]:

# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

In [77]:

# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_train_perf = model_performance_regression(lin_reg_model, x_train, y_train)
lin_reg_model_train_perf
Training Performance

Out[77]:

RMSEMAER-squaredAdj. R-squaredMAPE
00.4568050.3570490.6942780.69161914.200784

In [78]:

# Checking model performance on test set
print("Test Performance\n")
lin_reg_model_test_perf = model_performance_regression(lin_reg_model, x_test, y_test)
lin_reg_model_test_perf
Test Performance

Out[78]:

RMSEMAER-squaredAdj. R-squaredMAPE
00.4768570.3713740.6696030.66282314.780452

Observations

  • The train and test R2�2 are 0.69 and 0.67, indicating that the model explains 69% and 67% of the total variation in the train and test sets respectively. Also, both scores are comparable.
  • RMSE values on the train and test sets are also comparable.
  • This shows that the model is not overfitting.
  • MAE indicates that our current model is able to predict anime ratings within a mean error of 0.37 on the test set.
  • MAPE of 14.78 on the test data means that we are able to predict within ~15% of the anime rating.
  • The overall performance is much better than the model we built in the previous session.

Conclusions

  • We have been able to build a predictive model that can be used by Streamist to predict the rating of an anime with an R2�2 of 0.692 on the training set.
  • Streamist can use this model to predict the anime ratings within a mean error of 0.37 on the test set.
  • We improved our linear regression model performance by applying non-linear transformations to some of the attributes.
  • Streamist can also explore non-linear models, which might be able to better identify the patterns in the data to predict the anime ratings with even higher accuracy.

Leave a Reply

Your email address will not be published. Required fields are marked *