top of page

Bike Sharing Demand Analysis using Python | Regression | Machine Learning Project Tutorial

Writer's picture: Hackers RealmHackers Realm

Updated: May 30, 2023

Bike Sharing Demand Analysis is a regression problem which helps to predict the demand of the bicycles for a particular time of the day with the help of python. This article focus on predicting bike renting and returning in different areas of a city during a future period based on historical data, weather data, and time data. Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city.


Bike Sharing Demand Analysis - Regression Project
Bike Sharing Demand Analysis

In this project tutorial, we will analyze and process the dataset to predict the bike rental demand based on collected data in a specific time period and under weather conditions.


You can watch the video-based tutorial with step by step explanation down below.


Dataset Information


Bike-sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.


Apart from interesting real-world applications of bike-sharing systems, the characteristics of data being generated by these systems make them attractive for research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns the bike-sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.


Attribute Information:


Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

  • instant: record index

  • dteday : date

  • season : season (1:winter, 2:spring, 3:summer, 4:fall)

  • yr : year (0: 2011, 1:2012)

  • mnth : month ( 1 to 12)

  • hr : hour (0 to 23)

  • holiday : weather day is holiday or not (extracted from [Web Link])

  • weekday : day of the week

  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

  • weathersit :

  • 1: Clear, Few clouds, Partly cloudy, Partly cloudy

  • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

  • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

  • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

  • temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)

  • atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

  • hum: Normalized humidity. The values are divided to 100 (max)

  • windspeed: Normalized wind speed. The values are divided to 67 (max)

  • casual: count of casual users

  • registered: count of registered users

  • cnt: count of total rental bikes including both casual and registered

Here, the output variable is "cnt".


Download the Dataset here


Import Modules


Let us import all the basic modules we will be needing for this project.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 999
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • matplotlib - used for data visualization and graphical plotting

  • seaborn - built on top of matplotlib with similar functionalities

  • %matplotlib - to enable the inline plotting.

  • warnings - to manipulate warnings details

  • filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)

  • max-columns is to display all the features.


Loading the Dataset


df = pd.read_csv('hour.csv')
df.head()
Bike sharing demand analysis dataset
Bike sharing demand analysis dataset
  • Later we will drop the unnecessary column "casual" and "registered".

  • While using feature engineering, we need to mention the categorical column in one hot encoding.


# statistical info
df.describe()
Statistics of bike sharing demand dataset
Statistics of dataset
  • There are no missing values in the dataset.

# datatype info
df.info()
Data information of bike sharing demand dataset
Data information of dataset
  • We will drop other unnecessary columns 'instant' and 'dteday'.

  • The datatype of the remaining column is float and integer.


# unique values
df.apply(lambda x: len(x.unique()))
Number of unique values in each column
Number of unique values
  • Attributes containing many unique values are of numerical type. The remaining attributes are of categorical type.


Preprocessing the dataset


Data preprocessing refers to preparing (cleaning and organizing) the raw data to make it suitable for building and training Machine Learning models.


Let us check for NULL values in the dataset.

# check for null values
df.isnull().sum()
Number of NULL Values for each column
Number of NULL Values
  • There are no NULL values in the dataset.

  • We will rename the columns to year, month and hour for a better understanding.


df = df.rename(columns={'weathersit':'weather',
                       'yr':'year',
                       'mnth':'month',
                       'hr':'hour',
                       'hum':'humidity',
                       'cnt':'count'})
df.head()
Dataset after renaming columns in bike sharing demand analysis
Dataset after renaming columns
  • The attributes now contain meaningful titles.


Let us drop unnecessary columns.

df = df.drop(columns=['instant', 'dteday', 'year'])

For better visualization, let us change the Int column into a categorical column.

# change int columns to category
cols = ['season','month','hour','holiday','weekday','workingday','weather']

for col in cols:
    df[col] = df[col].astype('category')
df.info()
  • The selected columns are converted into categorical columns.

  • Later we will use the remaining numerical column to find the correlation.


Exploratory Data Analysis


We will analyze the data using visual techniques in terms of time and other attributes.


Let us start with the Time.

fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='count', hue='weekday', ax=ax)
ax.set(title='Count of bikes during weekdays and weekends')
Count of bikes during weekdays and weekends using point plot
Count of bikes during weekdays and weekends
  • The X-axis is the hour and Y-axis is the count of the bike.

  • On weekdays, we observe a peak in the morning hours and in the evening.

  • On weekends, the peak value is in the afternoon.


Let us use the same attributes with causal.

fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='casual', hue='weekday', ax=ax)
ax.set(title='Count of bikes during weekdays and weekends: Unregistered users')
Count of bikes during weekdays and weekends: Unregistered users using point plot
Count of bikes during weekdays and weekends: Unregistered users
  • The graph shows the count of unregistered users throughout the week.

  • We observe the high count on weekends.

  • This data can be related to weekend outdoor activities.


Let us use the same attributes with registered users.

fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='registered', hue='weekday', ax=ax)
ax.set(title='Count of bikes during weekdays and weekends: Registered users')
Count of bikes during weekdays and weekends: Registered users using point plot
Count of bikes during weekdays and weekends: Registered users
  • The graph shows the count of registered users throughout the week.

  • This data can be related to the working personnel.


Let us explore the graph in terms of weather.

fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='count', hue='weather', ax=ax)
ax.set(title='Count of bikes during different weathers')
Count of bikes during different weathers using point plot
Count of bikes during different weathers
  • The graph is similar to the previous graphs except for the weather 4.

  • Weather 4 with the color red must indicate rain, where no users book the bike.


Let us explore the graph in terms of season.

fig, ax = plt.subplots(figsize=(20,10))
sns.pointplot(data=df, x='hour', y='count', hue='season', ax=ax)
ax.set(title='Count of bikes during different seasons')
Count of bikes during different seasons using point plot
Count of bikes during different seasons
  • Out of four-season, three seasons show a similar graph.


Let us explore the graph in terms of months.

fig, ax = plt.subplots(figsize=(20,10))
sns.barplot(data=df, x='month', y='count', ax=ax)
ax.set(title='Count of bikes during different months')
Count of bikes during different months using bar plot
Count of bikes during different months
  • Over a period of time, the number of users increases and gradually, the number of users decreases.



Let us explore the graph in terms of weekdays.

fig, ax = plt.subplots(figsize=(20,10))
sns.barplot(data=df, x='weekday', y='count', ax=ax)
ax.set(title='Count of bikes during different days')
Count of bikes during different days using bar plot
Count of bikes during different days
  • In this graph, we observe an average number of users throughout the week.

  • Thus, the average distribution is impractical for predictions.


Regression plot of temperature and humidity with respect to count.

fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(20,6))
sns.regplot(x=df['temp'], y=df['count'], ax=ax1)
ax1.set(title="Relation between temperature and users")
sns.regplot(x=df['humidity'], y=df['count'], ax=ax2)
ax2.set(title="Relation between humidity and users")
Regression Plot between Temperature, Humidity & Users
Regression Plot between Temperature, Humidity & Users
  • With the increase in temperature, the number of user increases.

  • When the humidity increases the number of users decreases.



from statsmodels.graphics.gofplots import qqplot
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(20,6))
sns.distplot(df['count'], ax=ax1)
ax1.set(title='Distribution of the users')
qqplot(df['count'], ax=ax2, line='s')
ax2.set(title='Theoretical quantiles')
Distribution & theoretical quantiles of users
  • We can see a huge numerical difference in the distribution of the users, so the data is not equally distributed

  • Most of the data are in zero in the theoretical quantiles, so we must convert the data to approximate as much as possible as the red line


Now we will apply log transformation to uniform the data

df['count'] = np.log(df['count'])
fig, (ax1,ax2) = plt.subplots(ncols=2, figsize=(20,6))
sns.distplot(df['count'], ax=ax1)
ax1.set(title='Distribution of the users')
qqplot(df['count'], ax=ax2, line='s')
ax2.set(title='Theoritical quantiles')
Distribution & theoretical quantiles of users after log transformation
  • Now the distribution is more uniform, meaning the data was converted accordingly

  • Now the data in the theoretical quantiles is very similar to the red line

  • You may use MIN-MAX normalization or Standardization to see different results



Correlation Matrix


corr = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot=True, annot_kws={'size':15})
Correlation Matrix for bike sharing dataset
Correlation Matrix
  • We use the correlation matrix for numerical data.

  • We observe a highly positive correlation between 'temp' and 'atemp' and between 'casual' and 'registered'.

  • 'Windspeed' displays an insignificant contribution to the count.

  • Hence, we will drop a few unnecessary columns later.


One hot Encoding


pd.get_dummies(df['season'], prefix='season', drop_first=True)
  • Display of the dataset of the seasons, if specific season is present in the data it will assign 1 in the corresponding column and the other columns will be 0.

  • The prefix is to include the word in the column name, in this case it's for better understanding

  • Drop_first drops the first column, so if the all the no. are 0 in the remaining three columns, that means season 1 is present.


df_oh = df

def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

cols = ['season','month','hour','holiday','weekday','workingday','weather']

for col in cols:
    df_oh = one_hot_encoding(df_oh, col)
df_oh.head()
Preprocessed bike sharing demand analysis dataset
Preprocessed Dataset
  • New data frame after hot one encoding the data, adding new features

  • With the additional features added this will increase the training process time as well as the accuracy


Input Split


Now we will drop the columns we don't need for the model training

X = df_oh.drop(columns=['atemp', 'windspeed', 'casual', 'registered', 'count'], axis=1)
y = df_oh['count']

Model Training


from sklearn.linear_model import LinearRegression, Ridge, HuberRegressor, ElasticNetCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor

models = [LinearRegression(),
         Ridge(),
         HuberRegressor(),
         ElasticNetCV(),
         DecisionTreeRegressor(),
         RandomForestRegressor(),
         ExtraTreesRegressor(),
         GradientBoostingRegressor()]


from sklearn import model_selection
def train(model):
    kfold = model_selection.KFold(n_splits=5, random_state=42)
    pred = model_selection.cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
    cv_score = pred.mean()
    print('Model:',model)
    print('CV score:', abs(cv_score))

for model in models:
    train(model)
Scores of Trained Models for bike sharing demand analysis
Scores of Trained Models

  • Various models were imported to see different results

  • These are common models for regression problems, you may investigate and use other models for other results.




from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
  • Splitting the dataset for training and testing


model = RandomForestRegressor()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
  • Random Forest gave minimal error, so we are training to see the residuals from the test data


# plot the error difference
error = y_test - y_pred
fig, ax = plt.subplots()
ax.scatter(y_test, error)
ax.axhline(lw=3, color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Error')
plt.show()
Error Difference of Actual & Predicted Values
Error Difference of Actual & Predicted Values
  • Visualization of the predicted error values in the data set, both positive and negative




from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_pred))
  • Mean squared error from the test data and the predicted data


Final Thoughts

  • Out of the 8 models, Random Forest Regressor is the top performer with the least cv score.

  • You may do various analysis with the variety of results given from the different models used.

  • You can also use hyperparameter tuning to improve the model performance.


In this article, we explored the Bike Sharing Demand data set using various machine learning techniques and plot graphs. We also compared different models to train the data starting from basic to advanced models.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

Comentarios


bottom of page