Ridesharing is a service that arranges transportation on short notice. It is a very volatile market and its demand fluctuates wildly with time, place, weather, local events, etc. The key to being successful in this business is to be able to detect patterns in these fluctuations and cater to the demand at any given time.
Uber Technologies, Inc. is an American multinational transportation network company based in San Francisco and has operations in over 785 metropolitan areas with over 110 million users worldwide. As a newly hired Data Scientist in Uber’s New York Office, you have been given the task of extracting actionable insights from data that will help in the growth of the the business.
The data contains the details for the Uber rides across various boroughs (subdivisions) of New York City at an hourly level and attributes associated with weather conditions at that time.
In [4]:
# Libraries to help with reading and manipulating data import numpy as np import pandas as pd # Libraries to help with data visualization import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
In [ ]:
#Mount drive from google.colab import drive drive.mount('/content/drive')
In [ ]:
data = pd.read_csv('/content/Uber_Data.csv')
In [8]:
# copying data to another variable to avoid any changes to original data df = data.copy()
The initial steps to get an overview of any dataset is to:
In [ ]:
# looking at head (5 observations) df.head()
Out[ ]:
pickup_dt | borough | pickups | spd | vsb | temp | dewp | slp | pcp01 | pcp06 | pcp24 | sd | hday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-01-2015 01:00 | Bronx | 152 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y |
1 | 01-01-2015 01:00 | Brooklyn | 1519 | 5.0 | 10.0 | NaN | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y |
2 | 01-01-2015 01:00 | EWR | 0 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y |
3 | 01-01-2015 01:00 | Manhattan | 5258 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y |
4 | 01-01-2015 01:00 | Queens | 405 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y |
pickup_dt
column contains the date and time of pickupborough
column contains the name of the New York borough in which the pickup was madepickups
column contains the number of pickups in the borough at the given timespd
to sd
, all the columns are related to weather and are numerical in naturehday
column indicates whether the day of the pickup is a holiday or not (Y: Holiday, N: Not a holiday)In [ ]:
df.shape
Out[ ]:
(29101, 13)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29101 entries, 0 to 29100 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pickup_dt 29101 non-null object 1 borough 26058 non-null object 2 pickups 29101 non-null int64 3 spd 29101 non-null float64 4 vsb 29101 non-null float64 5 temp 28742 non-null float64 6 dewp 29101 non-null float64 7 slp 29101 non-null float64 8 pcp01 29101 non-null float64 9 pcp06 29101 non-null float64 10 pcp24 29101 non-null float64 11 sd 29101 non-null float64 12 hday 29101 non-null object dtypes: float64(9), int64(1), object(3) memory usage: 2.9+ MB
borough
and temp
which has 26058 and 28742 observations indicating that there are some missing values in thempickup_dt
column is being read as a ‘object’ data type but it should be in date-time formatborough
and hday
columns are of object type while the rest of the columns are numerical in natureIn [ ]:
df.describe().T
Out[ ]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
pickups | 29101.0 | 490.215903 | 995.649536 | 0.0 | 1.0 | 54.0 | 449.000000 | 7883.00 |
spd | 29101.0 | 5.984924 | 3.699007 | 0.0 | 3.0 | 6.0 | 8.000000 | 21.00 |
vsb | 29101.0 | 8.818125 | 2.442897 | 0.0 | 9.1 | 10.0 | 10.000000 | 10.00 |
temp | 28742.0 | 47.900019 | 19.798783 | 2.0 | 32.0 | 46.5 | 65.000000 | 89.00 |
dewp | 29101.0 | 30.823065 | 21.283444 | -16.0 | 14.0 | 30.0 | 50.000000 | 73.00 |
slp | 29101.0 | 1017.817938 | 7.768796 | 991.4 | 1012.5 | 1018.2 | 1022.900000 | 1043.40 |
pcp01 | 29101.0 | 0.003830 | 0.018933 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.28 |
pcp06 | 29101.0 | 0.026129 | 0.093125 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.24 |
pcp24 | 29101.0 | 0.090464 | 0.219402 | 0.0 | 0.0 | 0.0 | 0.050000 | 2.10 |
sd | 29101.0 | 2.529169 | 4.520325 | 0.0 | 0.0 | 0.0 | 2.958333 | 19.00 |
pickups
) and snow depth (sd
) indicating that there might be outliers to the right in these variablesLet’s check the count of each unique category in each of the categorical/object type variables.
In [ ]:
df['borough'].unique()
Out[ ]:
array(['Bronx', 'Brooklyn', 'EWR', 'Manhattan', 'Queens', 'Staten Island', 'Unknown'], dtype=object)
In [ ]:
df['hday'].value_counts(normalize=True)
Out[ ]:
N 0.961479 Y 0.038521 Name: hday, dtype: float64
We have observed earlier that the data type for pickup_dt
is object in nature. Let us change the data type of pickup_dt
to date-time format.
In [ ]:
df['pickup_dt'] = pd.to_datetime(df['pickup_dt'], format="%d-%m-%Y %H:%M")
Let’s check the data types of the columns again to ensure that the change has been executed properly.
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29101 entries, 0 to 29100 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pickup_dt 29101 non-null datetime64[ns] 1 borough 26058 non-null object 2 pickups 29101 non-null int64 3 spd 29101 non-null float64 4 vsb 29101 non-null float64 5 temp 28742 non-null float64 6 dewp 29101 non-null float64 7 slp 29101 non-null float64 8 pcp01 29101 non-null float64 9 pcp06 29101 non-null float64 10 pcp24 29101 non-null float64 11 sd 29101 non-null float64 12 hday 29101 non-null object dtypes: datetime64[ns](1), float64(9), int64(1), object(2) memory usage: 2.9+ MB
pickup_dt
column has been succesfully changed to date-time formatNow let’s check the range of time period for which the data has been collected.
In [ ]:
df['pickup_dt'].min() # this will display the date from which data observations have been started
Out[ ]:
Timestamp('2015-01-01 01:00:00')
In [ ]:
df['pickup_dt'].max() # this will display the last date of the dataset
Out[ ]:
Timestamp('2015-06-30 23:00:00')
Since the pickup_dt
column contains the combined information in the form of date, month, year and time of the day, let’s extract each piece of information as a separate entity to get the trend of rides varying over time.
In [ ]:
# Extracting date parts from pickup date df['start_year'] = df.pickup_dt.dt.year # extracting the year from the date df['start_month'] = df.pickup_dt.dt.month_name() # extracting the month name from the date df['start_hour'] = df.pickup_dt.dt.hour # extracting the hour from the time df['start_day'] = df.pickup_dt.dt.day # extracting the day from the date df['week_day'] = df.pickup_dt.dt.day_name() # extracting the day of the week from the date
Now we can remove the pickup_dt
column from our dataset as it will not be required for further analysis.
In [ ]:
# removing the pickup date column df.drop('pickup_dt',axis=1,inplace=True)
Let’s check the first few rows of the dataset to see if changes have been applied properly
In [ ]:
df.head()
Out[ ]:
borough | pickups | spd | vsb | temp | dewp | slp | pcp01 | pcp06 | pcp24 | sd | hday | start_year | start_month | start_hour | start_day | week_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Bronx | 152 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 1 | 1 | Thursday |
1 | Brooklyn | 1519 | 5.0 | 10.0 | NaN | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 1 | 1 | Thursday |
2 | EWR | 0 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 1 | 1 | Thursday |
3 | Manhattan | 5258 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 1 | 1 | Thursday |
4 | Queens | 405 | 5.0 | 10.0 | 30.0 | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 1 | 1 | Thursday |
We can see the changes have been applied to the dataset properly.
Let’s analyze the statistical summary for the new columns added in the dataset.
In [ ]:
df.describe(include='all').T # setting include='all' will get the statistical summary for both the numerical and categorical variables.
Out[ ]:
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
borough | 26058 | 6 | Bronx | 4343 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
pickups | 29101.0 | NaN | NaN | NaN | 490.215903 | 995.649536 | 0.0 | 1.0 | 54.0 | 449.0 | 7883.0 |
spd | 29101.0 | NaN | NaN | NaN | 5.984924 | 3.699007 | 0.0 | 3.0 | 6.0 | 8.0 | 21.0 |
vsb | 29101.0 | NaN | NaN | NaN | 8.818125 | 2.442897 | 0.0 | 9.1 | 10.0 | 10.0 | 10.0 |
temp | 28742.0 | NaN | NaN | NaN | 47.900019 | 19.798783 | 2.0 | 32.0 | 46.5 | 65.0 | 89.0 |
dewp | 29101.0 | NaN | NaN | NaN | 30.823065 | 21.283444 | -16.0 | 14.0 | 30.0 | 50.0 | 73.0 |
slp | 29101.0 | NaN | NaN | NaN | 1017.817938 | 7.768796 | 991.4 | 1012.5 | 1018.2 | 1022.9 | 1043.4 |
pcp01 | 29101.0 | NaN | NaN | NaN | 0.00383 | 0.018933 | 0.0 | 0.0 | 0.0 | 0.0 | 0.28 |
pcp06 | 29101.0 | NaN | NaN | NaN | 0.026129 | 0.093125 | 0.0 | 0.0 | 0.0 | 0.0 | 1.24 |
pcp24 | 29101.0 | NaN | NaN | NaN | 0.090464 | 0.219402 | 0.0 | 0.0 | 0.0 | 0.05 | 2.1 |
sd | 29101.0 | NaN | NaN | NaN | 2.529169 | 4.520325 | 0.0 | 0.0 | 0.0 | 2.958333 | 19.0 |
hday | 29101 | 2 | N | 27980 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
start_year | 29101.0 | NaN | NaN | NaN | 2015.0 | 0.0 | 2015.0 | 2015.0 | 2015.0 | 2015.0 | 2015.0 |
start_month | 29101 | 6 | May | 5058 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
start_hour | 29101.0 | NaN | NaN | NaN | 11.597574 | 6.907042 | 0.0 | 6.0 | 12.0 | 18.0 | 23.0 |
start_day | 29101.0 | NaN | NaN | NaN | 15.62314 | 8.72504 | 1.0 | 8.0 | 16.0 | 23.0 | 31.0 |
week_day | 29101 | 7 | Friday | 4219 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We have earlier seen that the borough
and temp
columns have missing values in them. So let us see them in detail before moving on to do our EDA.
One of the commonly used method to deal with the missing values is to impute them with the central tendencies – mean, median, and mode of a column.
Replacing with mean
: In this method the missing values are imputed with the mean of the column. Mean gets impacted by the presence of outliers, and in such cases where the column has outliers using this method may lead to erroneous imputations.Replacing with median
: In this method the missing values are imputed with the median of the column. In cases where the column has outliers, median is an appropriate measure of central tendency to deal with the missing values over mean.Replacing with mode
: In this method the missing values are imputed with the mode of the column. This method is generally preferred with categorical data.Let’s check how many missing values are present in each variable.
In [ ]:
# checking missing values across each columns df.isnull().sum()
Out[ ]:
borough 3043 pickups 0 spd 0 vsb 0 temp 359 dewp 0 slp 0 pcp01 0 pcp06 0 pcp24 0 sd 0 hday 0 start_year 0 start_month 0 start_hour 0 start_day 0 week_day 0 dtype: int64
borough
and temp
have 3043 and 359 missing values in themLet us first see the missing value of the borough
column in detail.
In [ ]:
# Checking the missing values further df.borough.value_counts(normalize=True, dropna=False)
Out[ ]:
Bronx 0.149239 Brooklyn 0.149239 EWR 0.149239 Manhattan 0.149239 Queens 0.149239 Staten Island 0.149239 NaN 0.104567 Name: borough, dtype: float64
We can replace the null values present in the borough
column with a new label as Unknown
.
In [ ]:
# Replacing NaN with Unknown df['borough'].fillna('Unknown', inplace =True)
In [ ]:
df['borough'].unique()
Out[ ]:
array(['Bronx', 'Brooklyn', 'EWR', 'Manhattan', 'Queens', 'Staten Island', 'Unknown'], dtype=object)
It can be observed that the new label Unknown
has been added in the borough
column
In [ ]:
df.isnull().sum()
Out[ ]:
borough 0 pickups 0 spd 0 vsb 0 temp 359 dewp 0 slp 0 pcp01 0 pcp06 0 pcp24 0 sd 0 hday 0 start_year 0 start_month 0 start_hour 0 start_day 0 week_day 0 dtype: int64
The missing values in the borough
column have been treated. Let us now move on to temp
variable and see how to deal with the missing values present there.
Since this is a numerical variable, so we can impute the missing values by mean or median but before imputation, let’s analyze the temp
variable in detail.
Let us print the rows where the temp
variable is having missing values.
In [ ]:
df.loc[df['temp'].isnull()==True]
Out[ ]:
borough | pickups | spd | vsb | temp | dewp | slp | pcp01 | pcp06 | pcp24 | sd | hday | start_year | start_month | start_hour | start_day | week_day | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Brooklyn | 1519 | 5.0 | 10.0 | NaN | 7.0 | 1023.5 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 1 | 1 | Thursday |
8 | Brooklyn | 1229 | 3.0 | 10.0 | NaN | 6.0 | 1023.0 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 2 | 1 | Thursday |
15 | Brooklyn | 1601 | 5.0 | 10.0 | NaN | 8.0 | 1022.3 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 3 | 1 | Thursday |
22 | Brooklyn | 1390 | 5.0 | 10.0 | NaN | 9.0 | 1022.0 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 4 | 1 | Thursday |
29 | Brooklyn | 759 | 5.0 | 10.0 | NaN | 9.0 | 1021.8 | 0.0 | 0.0 | 0.0 | 0.0 | Y | 2015 | January | 5 | 1 | Thursday |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2334 | Brooklyn | 594 | 5.0 | 10.0 | NaN | 13.0 | 1016.2 | 0.0 | 0.0 | 0.0 | 0.0 | N | 2015 | January | 19 | 15 | Thursday |
2340 | Brooklyn | 620 | 5.0 | 10.0 | NaN | 13.0 | 1015.5 | 0.0 | 0.0 | 0.0 | 0.0 | N | 2015 | January | 20 | 15 | Thursday |
2347 | Brooklyn | 607 | 3.0 | 10.0 | NaN | 14.0 | 1015.4 | 0.0 | 0.0 | 0.0 | 0.0 | N | 2015 | January | 21 | 15 | Thursday |
2354 | Brooklyn | 648 | 9.0 | 10.0 | NaN | 14.0 | 1015.4 | 0.0 | 0.0 | 0.0 | 0.0 | N | 2015 | January | 22 | 15 | Thursday |
2361 | Brooklyn | 602 | 5.0 | 10.0 | NaN | 16.0 | 1015.4 | 0.0 | 0.0 | 0.0 | 0.0 | N | 2015 | January | 23 | 15 | Thursday |
359 rows × 17 columns
There are 359 observations where temp
variable has missing values. From the overview of the dataset, it seems as if the missing temperature values are from the Brooklyn borough in the month of January.
So let’s confirm our hypothesis by printing the unique boroughs and month names present for these missing values.
In [ ]:
df.loc[df['temp'].isnull()==True,'borough'].value_counts()
Out[ ]:
Brooklyn 359 Name: borough, dtype: int64
In [ ]:
df.loc[df['temp'].isnull()==True,'start_month'].value_counts()
Out[ ]:
January 359 Name: start_month, dtype: int64
The missing values in temp
are from the Brooklyn borough and they are from the month of January.
Let’s check on which the date for the month of January, missing values are present.
In [ ]:
df.loc[df['temp'].isnull()==True,'start_day'].unique() # days for which missing values are present
Out[ ]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
In [ ]:
df.loc[df['start_month']=='January', 'start_day'].unique() # unique days in the month of January
Out[ ]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
It can be observed that out of the 31 days in January, the data is missing for the first 15 days.
Since from the statistical summary, the mean and median values of temperature are close to each other, hence we can impute the missing values in the temp
column by taking the mean tempertaure of the Brooklyn borough during 16th to 31st January.
We will use fillna() function to impute the missing values.
fillna() – The fillna() function is used to fill NaN values by using the provided input value.
Syntax of fillna(): data['column'].fillna(value = x)
In [ ]:
df['temp'] = df['temp'].fillna(value=df.loc[df['borough'] == 'Brooklyn','temp'].mean())
In [ ]:
df.isnull().sum()
Out[ ]:
borough 0 pickups 0 spd 0 vsb 0 temp 0 dewp 0 slp 0 pcp01 0 pcp06 0 pcp24 0 sd 0 hday 0 start_year 0 start_month 0 start_hour 0 start_day 0 week_day 0 dtype: int64
Let’s now perform the Exploratory Data Analysis on the dataset
Let us first explore the numerical variables.
Univariate data visualization plots help us comprehend the descriptive summary of the particular data variable. These plots help in understanding the location/position of observations in the data variable, its distribution, and dispersion.
We can check the distribution of observations by plotting Histograms and Boxplots
A histogram takes as input a numeric variable only. The variable is cut into several bins, and the number of observations per bin is represented by the height of the bar
A boxplot gives a summary of one or several numeric variables. The line that divides the box into 2 parts represents the median of the data. The end of the box shows the upper and lower quartiles. The extreme lines show the highest and lowest value excluding outliers.
In [6]:
# Defining the function for creating boxplot and hisogram def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None): """ Boxplot and histogram combined data: dataframe feature: dataframe column figsize: size of figure (default (12,7)) kde: whether to show the density curve (default False) bins: number of bins for histogram (default None) """ f2, (ax_box2, ax_hist2) = plt.subplots( nrows=2, # Number of rows of the subplot grid= 2 sharex=True, # x-axis will be shared among all subplots gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize) # creating the 2 subplots sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="mediumturquoise") # boxplot will be created and a star will indicate the mean value of the column if bins: sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, color="mediumpurple") else: sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, color="mediumpurple") # For histogram ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--") # Add mean to the histogram ax_hist2.axvline(data[feature].median(), color="black", linestyle="-") # Add median to the histogram
In [27]:
histogram_boxplot(df,'pickups')
In [28]:
histogram_boxplot(df,'vsb')
visibility
column is is left-skewedIn [29]:
histogram_boxplot(df,'temp')
In [30]:
histogram_boxplot(df,'dewp')
In [31]:
histogram_boxplot(df,'slp')
1 hour liquid precipitation
In [32]:
histogram_boxplot(df,'pcp01')
6 hour liquid precipitation
In [33]:
histogram_boxplot(df,'pcp06')
24 hour liquid precipitation
In [34]:
histogram_boxplot(df,'pcp24')
In [35]:
histogram_boxplot(df,'sd')
Let’s explore the categorical variables now
Bar Charts can be used to explore the distribution of Categorical Variables. Each entity of the categorical variable is represented as a bar. The size of the bar represents its numeric value.
In [ ]:
sns.countplot(data=df,x='hday');
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffa79b0f9d0>
In [ ]:
sns.countplot(data=df,x='borough'); plt.xticks(rotation = 90)
Out[ ]:
(array([0, 1, 2, 3, 4, 5, 6]), <a list of 7 Text major ticklabel objects>)
Bi means two and variate means variable, so here there are two variables. The analysis is related to the relationship between the two variables.
Different types of Bivariate Analysis that can be done:
Let us plot bivariate charts between variables to understand their interaction with each other.
A heatmap is a graphical representation of data as a color-encoded matrix. It is a great way of representing the correlation for each pair of columns in the data.The heatmap() function of seaborn helps us to create such a plot
In [ ]:
# Check for correlation among numerical variables num_var = ['pickups','spd','vsb','temp','dewp', 'slp','pcp01', 'pcp06', 'pcp24', 'sd'] corr = df[num_var].corr() # plot the heatmap plt.figure(figsize=(15, 7)) sns.heatmap(corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
A scatterplot displays the relationship between 2 numeric variables. For each data point, the value of its first variable is represented on the X axis, the second on the Y axis
In [ ]:
sns.pairplot(data=df[num_var], diag_kind="kde") plt.show()
We can check the trends for time measures by plotting Line charts
A line chart is often used to visualize a trend in data over intervals of time, thus the line is often drawn chronologically.
In [ ]:
cats = df.start_month.unique().tolist() df.start_month = pd.Categorical(df.start_month, ordered=True, categories=cats) plt.figure(figsize=(15,7)) sns.lineplot(data=df, x="start_month", y="pickups", ci=False, color="red", estimator='sum') plt.ylabel('Total pickups') plt.xlabel('Month') plt.show()
In [ ]:
plt.figure(figsize=(15,7)) sns.lineplot(data=df, x="start_day", y="pickups", estimator='sum', ci=False, color="red") plt.ylabel('Total pickups') plt.xlabel('Day of Month') plt.show()
Let us drop the observations for the month of Feb and see the trend
In [ ]:
# Let us drop the Feb month and see the trend df_not_feb = df[df['start_month'] != 'February'] plt.figure(figsize=(15,7)) sns.lineplot(data=df_not_feb, x="start_day", y="pickups", estimator='sum', ci=False, color="red") plt.ylabel('Total pickups') plt.xlabel('Day of Month') plt.show()
In [ ]:
plt.figure(figsize=(15,7)) sns.lineplot(data=df, x="start_hour", y="pickups", estimator='sum', ci=False, color="red") plt.ylabel('Total pickups') plt.xlabel('Hour of the day') plt.show()
In [ ]:
cats = ['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday'] df.week_day = pd.Categorical(df.week_day, ordered=True, categories=cats) plt.figure(figsize=(15,7)) sns.lineplot(data=df, x="week_day", y="pickups", ci=False, color="red", estimator='sum') plt.ylabel('Total pickups') plt.xlabel('Weeks') plt.show()
Let’s check if there is any significant effect of the categorical variables on the number of pickups
In [ ]:
plt.figure(figsize=(15,7)) sns.boxplot(x=df['borough'], y=df['pickups']) plt.ylabel('pickups') plt.xlabel('Borough') plt.show()
In [ ]:
# Dispersion of pickups in every borough sns.catplot(x='pickups', col='borough', data=df, col_wrap=4, kind="violin") plt.show()
In [ ]:
sns.catplot(x='hday', y='pickups', data=df, kind="bar") plt.show()
In [ ]:
sns.catplot(x='borough', y='pickups', data=df, kind="bar", hue='hday') plt.xticks(rotation=90) plt.show()
The bars for EWR, Staten Island and Unknown are not visible. Let’s check the mean pickups in all the borough to verify this.
In [ ]:
# Check if the trend is similar across boroughs df.groupby(by = ['borough','hday'])['pickups'].mean()
Out[ ]:
borough hday Bronx N 50.771073 Y 48.065868 Brooklyn N 534.727969 Y 527.011976 EWR N 0.023467 Y 0.041916 Manhattan N 2401.302921 Y 2035.928144 Queens N 308.899904 Y 320.730539 Staten Island N 1.606082 Y 1.497006 Unknown N 2.057456 Y 2.050420 Name: pickups, dtype: float64
Since we have seen that borough has a significant effect on the number of pickups, let’s check if that effect is present across different hours of the day.
In [ ]:
plt.figure(figsize=(15,7)) sns.lineplot(data=df, x="start_hour", y="pickups", hue='borough', estimator='sum', ci=False) plt.ylabel('Total pickups') plt.xlabel('Hour of the day') plt.show()
Let’s visualize all the outliers present in data together
In [ ]:
# outlier detection using boxplot # selecting the numerical columns of data and adding their names in a list numeric_columns = ['pickups','spd','vsb','temp','dewp', 'slp','pcp01', 'pcp06', 'pcp24', 'sd'] plt.figure(figsize=(15, 12)) for i, variable in enumerate(numeric_columns): plt.subplot(4, 4, i + 1) plt.boxplot(data[variable], whis=1.5) plt.tight_layout() plt.title(variable) plt.show()
pickups
column has a wide range of values with lots of outliers. However we are not going to treat this column since the number of pickups can have a varying range and we can miss out on some genuine values if we treat this columnspd
to sd
, all the columns are related to weather. The weather related variables have some outliers, however all of them seem to be genuine values. So we are not going to treat the outliers present in these columnsWe analyzed a dataset of nearly 30K hourly Uber pickup informations, from New York boroughs. The data spanned over every day of the first six months of the year 2015. The main feature of interest here is the number of pickups. Both from an environmental and business perspective, having cars roaming in an area while the demand is in another or filling the streets with cars during a low demand period while lacking during peak hours is inefficient. Thus we determined the factors that effect pickup and the nature of their effect.
We have been able to conclude that –