top of page
Writer's pictureHackers Realm

Black Friday Sales Prediction Analysis using Python | Regression | Machine Learning Project Tutorial

Updated: Jun 2, 2023

Unlock the power of sales prediction with Python! This tutorial delves into regression techniques for Black Friday sales analysis. Learn to build accurate models that forecast sales, gain insights into customer behavior, and optimize pricing strategies. Enhance your skills in machine learning, data analysis, and uncover valuable insights for business success. Dive into the realm of Black Friday sales prediction with this hands-on project tutorial. #BlackFridaySales #Python #Regression #MachineLearning #SalesPrediction #DataAnalysis #BusinessInsights

Black Friday Sales Prediction using Regression
Black Friday Sales Prediction

In this project tutorial, we analyze and predict the sales during Black Friday, and display the results through plot graphs and different prediction models.


You can watch the video-based tutorial with step by step explanation down below


Dataset Information


This dataset comprises of sales transactions captured at a retail store. It’s a classic dataset to explore and expand your feature engineering skills and day to day understanding from multiple shopping experiences. This is a regression problem. The dataset has 550,069 rows and 12 columns.


Problem: Predict purchase amount


Attributes:

Black Friday Sales Dataset
Black Friday Sales Dataset
  • Masked attributes hide the data information.


Download the Dataset here



Import modules

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline

warnings.filterwarnings('ignore')
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • matplotlib - used for data visualization and graphical plotting

  • seaborn - built on top of matplotlib with similar functionalities

  • %matplotlib - to enable the inline plotting.

  • warnings - to manipulate warnings details

  • filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)


Loading the dataset


df = pd.read_csv('train.csv')
df.head()
Black Friday Sales Dataset with values
Black Friday Sales Dataset
  • Some columns have null values, those values must be replaced for a relevant value for further processing.



Let us see the statistical information of the attributes

# statistical info
df.describe()
Statistical Information of Black Friday Dataset
Statistical Information of Dataset
  • Statistical information of the data

  • Product_Category_2 and Product_Category_3 have lower number of samples than Product_Category_1, both could be sub categories.


Let us see the data type information of the attributes

# datatype info
df.info()
Data type information of black friday sales dataset
Data type information
  • We have categorical as well as numerical attributes which we will process separately.

  • Product_Category_1 data type is different from Product_Category_2 and Product_Category_3, that won't affect the process or the result.


# find unique values
df.apply(lambda x: len(x.unique()))
Unique values of each column
Unique values of each column
  • Attributes containing many unique values are of numerical type. The remaining attributes are of categorical type.


Exploratory Data Analysis


# distplot for purchase
plt.style.use('fivethirtyeight')
plt.figure(figsize=(13, 7))
sns.distplot(df['Purchase'], bins=25)
Distribution Plot of Purchase
Distribution of Purchase
  • First part of the graph has a normal distribution and later forming some peaks in the graph

  • Evaluating the whole graph, it has a normal distribution


# distribution of numeric variables
sns.countplot(df['Gender'])
Distribution Plot of Gender
Distribution of Gender
  • Many buyers are male while the minority are female.

  • Difference is due to the categories on sale during Black Friday, evaluating a particular category may change the count between genders.


sns.countplot(df['Age'])
Distribution Plot of Age
Distribution of Age
  • There are 7 categories defined to classify the age of the buyers


sns.countplot(df['Marital_Status'])
Distribution Plot of Marital Status
Distribution of Marital Status
  • Majority of the buyers are single


sns.countplot(df['Occupation'])
Distribution Plot of Occupation
Distribution of Occupation
  • Display of the occupation of the buyers

  • Occupation 8 has extremely low count compared with the others; it can be ignored for the calculation since it won't affect much the result.


sns.countplot(df['Product_Category_1'])
Distribution Plot of Product Category 1
Distribution of Product Category 1
  • Majority of the products are in category 1, 5 and 8.

  • The low no. categories can be combined into a single category to greatly reduce the complexity of the problem.


sns.countplot(df['Product_Category_2'])
Distribution Plot of Product Category 2
Distribution of Product Category 2
  • Categories are in float values

  • Categories 2, 8, 14 to 16 are higher compared with the others.


sns.countplot(df['Product_Category_3'])
Distribution Plot of Product Category 3
Distribution of Product Category 3
  • Categories are in float values

  • Categories 14 to 17 are higher


sns.countplot(df['City_Category'])
Distribution Plot of City Category
Distribution of City Category
  • Higher count might represent the urban area indicates more population


sns.countplot(df['Stay_In_Current_City_Years'])
Distribution Plot of Stay in Current City
Distribution of Stay in Current City
  • Most buyers have one year living in the city

  • Remaining categories are uniform distribution


Now let us plot using two variables for analysis

# bivariate analysis
occupation_plot = df.pivot_table(index='Occupation', values='Purchase', aggfunc=np.mean)
occupation_plot.plot(kind='bar', figsize=(13, 7))
plt.xlabel('Occupation')
plt.ylabel("Purchase")
plt.title("Occupation and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()
Bivariate Analysis of Occupation and Purchase
Bivariate Analysis of Occupation and Purchase
  • np.mean will display mean of the purchase based on occupation

  • np.sum will display a sum of the purchase based on occupation

  • Based on the labels, we can observe all the categories being purchased in an average manner.

  • Recommended plot graph for presentation


age_plot = df.pivot_table(index='Age', values='Purchase', aggfunc=np.mean)
age_plot.plot(kind='bar', figsize=(13, 7))
plt.xlabel('Age')
plt.ylabel("Purchase")
plt.title("Age and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()
Bivariate Analysis of Age and Purchase
Bivariate Analysis of Age and Purchase
  • Age and Purchase graph also has a uniform distribution.


gender_plot = df.pivot_table(index='Gender', values='Purchase', aggfunc=np.mean)
gender_plot.plot(kind='bar', figsize=(13, 7))
plt.xlabel('Gender')
plt.ylabel("Purchase")
plt.title("Gender and Purchase Analysis")
plt.xticks(rotation=0)
plt.show()
Bivariate Analysis of Gender and Purchase
Bivariate Analysis of Gender and Purchase
  • Uniform distribution but with a little difference


Preprocessing the dataset


We must check first for null values in the data

# check for null values
df.isnull().sum()
Count of null values in each column
Count of null values in each column
  • Null values are present in Product_Category_2 and Product_Category_3

  • Null values must be filled for easier processing


Now we fill the Null values in the dataset

df['Product_Category_2'] = df['Product_Category_2'].fillna(-2.0).astype("float32")
df['Product_Category_3'] = df['Product_Category_3'].fillna(-2.0).astype("float32")
  • Null values filled with a negative value to not affect the results.

  • The value filled must be of same data type of the attribute.


Let us double check the null values

df.isnull().sum()
Count of null values in each column


Now we must convert the categorical attributes to numerical using a dictionary

# encoding values using dict
gender_dict = {'F':0, 'M':1}
df['Gender'] = df['Gender'].apply(lambda x: gender_dict[x])
df.head()
Black Friday Sales Dataset
  • 'F' now converted to numerical zero (0), same for 'M' to one (1)


Label encoding is to convert the categorical column into the numerical column a lot quicker

# to improve the metric use one hot encoding
# label encoding
cols = ['Age', 'City_Category', 'Stay_In_Current_City_Years']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cols:
    df[col] = le.fit_transform(df[col])
df.head()
Black Friday Sales Dataset
  • One hot encoding increases the no. of columns but improves accuracy

  • More columns means more data to train, it will increase the training time

  • All categorical columns converted to numerical

  • For the input User_ID and Product_ID must be removed in order to generalize the results.



Correlation Matrix


A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value is in the range of -1 to 1. If two variables have a high correlation, we can neglect one variable from those two.


corr = df.corr()
plt.figure(figsize=(14,7))
sns.heatmap(corr, annot=True, cmap='coolwarm')
Correlation Matrix of Black Friday Sales Dataset
Correlation Matrix
  • Purchase is most correlated to Product_Category_1 and Product_Category_3

  • Marital_Status and Age also has positive correlation


Input Split


df.head()
Preprocessed Black Friday Sales Dataset
  • User_ID and Product_ID must be removed for better results, if not the results will be biased to User_ID or Product_ID


Now we split the data for training

X = df.drop(columns=['User_ID', 'Product_ID', 'Purchase'])
y = df['Purchase']
  • Purchase is an output data that is why it is removed from X as well


Model Training

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
def train(model, X, y):
    # train-test split
    x_train, x_test, y_train, y_test = train_test_split(X, y,         
random_state=42, test_size=0.25)
    model.fit(x_train, y_train)
    
    # predict the results
    pred = model.predict(x_test)
    
    # cross validation
    cv_score = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)
    cv_score = np.abs(np.mean(cv_score))
    
    print("Results")
    print("MSE:", np.sqrt(mean_squared_error(y_test, pred)))
    print("CV Score:", np.sqrt(cv_score))
  • cross val score() is used for better validation of the model.

  • cv=5 means that the cross-validation will split the data into 5 parts for training.

  • np.abs() will convert the negative score to positive and np.mean() will give the average value of 5 scores.


Now we display the basic models

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
train(model, X, y)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title='Model Coefficients')

Results MSE: 4617.994034201719 CV Score: 4625.252945835687

Model Coefficients of Linear Regression
  • Linear Regression model must have normalized data to give better results

  • Gender category has high coefficient for the Linear Regression model


from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
train(model, X, y)
features = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
features.plot(kind='bar', title='Feature Importance')

Results MSE: 3366.9672356860747 CV Score: 3338.5905886644855

Feature Importance of Decision Tree
  • Results have improved compared to Linear Regression model

  • Product_Category_1 has high feature importance compared to the Linear Regression model.


from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1)
train(model, X, y)
features = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
features.plot(kind='bar', title='Feature Importance')

Results MSE: 3062.66041010778 CV Score: 3052.7778119222253

Feature Importance of Random Forest
  • Better results compared with Decision Tree Regressor


from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor(n_jobs=-1)
train(model, X, y)
features = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
features.plot(kind='bar', title='Feature Importance')

Results MSE: 3194.0770038232185 CV Score: 3180.5476561931346

Feature Importance of Extra Trees
  • Better results than Linear Regression but less than Random Forest Regression

Note: this prediction steps are just an example, modify it accordingly for your own usage
pred = model.predict(x_test)

submission = pd.DataFrame()
submission['User_ID'] = x_test['User_ID']
submission['Purchase'] = pred

submission.to_csv('submission.csv', index=False)
  • Prediction of the attributes from the x_test data

  • Creation of a new dataframe for submission of prediction results

  • Stores an array with User_ID and the Purchase predictions of that buyer.


Final Thoughts

  • Out of the 4 models, Random Forest Regressor is the top performer with the least MSE and cv score.

  • You can also use hyperparameter tuning to improve the model performance.

  • You can further try other models like XGBoost, CatBoost etc.


In this project tutorial, we have explored the Black Friday Sales Prediction. We also compared different models to train the data starting from basic to advanced models.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

2,442 views
bottom of page