top of page

Mall Customer Segmentation Analysis using Python | Clustering | Machine Learning Project Tutorial

Writer's picture: Hackers RealmHackers Realm

Updated: Jun 2, 2023

Dive into the realm of customer segmentation analysis with Python! This tutorial guides you through mall customer segmentation using clustering techniques in machine learning. Uncover hidden patterns, understand customer behavior, and optimize marketing strategies. Enhance your skills in data analysis and unlock the power of segmentation with this hands-on project tutorial. #CustomerSegmentation #Python #Clustering #MachineLearning #DataAnalysis #Marketing

Mall Customer Segmentation Analysis using Clustering method
Mall Customer Segmentation Analysis

In this project tutorial, we will explore Mall Customer Segmentation Analysis using python. Furthermore, we will discuss unsupervised learning, principal component analysis, kmeans clustering and elbow method in this tutorial.


You can watch the video-based tutorial with step by step explanation down below.


Dataset Information


You are owing a supermarket mall and through membership cards, you have some basic data about your customers. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.


Attributes

  • Customer ID

  • Age

  • Gender

  • Annual income

  • Spending score

Download the Dataset here


Import Modules

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
  • pandas - used to perform data manipulation and analysis

  • numpy - used to perform a wide variety of mathematical operations on arrays

  • matplotlib - used for data visualization and graphical plotting

  • seaborn - built on top of matplotlib with similar functionalities

  • %matplotlib inline - to enable the inline plotting

  • warnings - to manipulate warnings details

filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)



Load the Dataset

df = pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()
Mall Customer Segmentation Dataset
Mall Customer Segmentation Dataset
  • We can see the top 5 samples of the dataset

  • CustomerID is not necessary for the process so it can be dropped


# statistical info
df.describe()
Statistics of the Mall Customer Dataset
Statistics of the Dataset
  • Statistical information of the dataset with min. and max range in every column.


# datatype info
df.info()
Data type Information of Dataset
Data type Information of Dataset
  • Only one attribute is categorical and the rest are numerical

  • There are no NULL values present in the data, further preprocessing is not necessary

  • If any NULL value are present in dataset, they must be replaced with a value or drop the entire row


Exploratory Data Analysis


sns.countplot(df['Gender'])
Bar Plot for Gender and Count
Bar Plot for Gender and Count
  • We can see an almost equal distribution but female has majority


sns.distplot(df['Age'])
Distribution Plot for Age
Distribution Plot for Age
  • Good distribution of the data, majority of the customers between age 30 to 40 years old


sns.distplot(df['Annual Income (k$)'])
Distribution plot for Annual Income
Distribution plot for Annual Income
  • We can see the Annual Income, with a good distribution


sns.distplot(df['Spending Score (1-100)'])
Distribution Plot for Spending Score
Distribution Plot for Spending Score
  • Average spending is between 40 to 60


Correlation Matrix


A correlation matrix is a table showing correlation coefficients between variables.

corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
Correlation Matrix for Numerical Variables
Correlation Matrix for Numerical Variables
  • The red color shows a positive correlation, and the blue color is a negative correlation.

  • In supervised learning, we can drop highly correlated attributes.

  • Since this is unsupervised learning, we will reduce the dimension of the dataset using principal component analysis.


Clustering

df.head()
Mall customer dataset


# cluster on 2 features
df1 = df[['Annual Income (k$)', 'Spending Score (1-100)']]
df1.head()
Filtering Annual Income and Spending Score
  • First, Let us take only two attributes for processing



# scatter plot
sns.scatterplot(df1['Annual Income (k$)'], df1['Spending Score (1-100)'])
Scatter plot for Annual Income and Spending Income
Scatter plot for Annual Income and Spending Income
  • Scatter plot of Annual income and Spending Score

  • We can see the major part is in the center so that can be one cluster and the corners can be four other clusters or grouped for two other clusters.


Now we can start clustering the data

from sklearn.cluster import KMeans
errors = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(df1)
    errors.append(kmeans.inertia_)   
  • errors list will contains the sum of squared distances of samples to their closest cluster center

# plot the results for elbow method
plt.figure(figsize=(13,6))
plt.plot(range(1,11), errors)
plt.plot(range(1,11), errors, linewidth=3, color='red', marker='8')
plt.xlabel('No. of clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(1,11,1))
plt.show()
elbow method plot to find number of clusters
  • We use elbow methods to find the number of clusters.

  • The shape in a graph represents an elbow.

  • We take the best cluster number from the joint of the elbow.

  • The best cluster appears to be 5.



km = KMeans(n_clusters=5)
km.fit(df1)
y = km.predict(df1)
df1['Label'] = y
df1.head()
Filtered mall customer dataset
  • Added cluster label for each sample


sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=df1, hue='Label', s=50, palette=['red', 'green', 'brown', 'blue', 'orange'])
Scatter plot between Annual Income and Spending Score with Cluster Label
Scatter plot between Annual Income and Spending Score with Cluster Label
  • Scatter plot graph of the clustered data

  • Depending on the analysis of the data you can send specific offers to a group of customers in a cluster


Now let us use a three dimension data

# cluster on 3 features
df2 = df[['Annual Income (k$)', 'Spending Score (1-100)', 'Age']]
df2.head()
filtered mall customer dataset


errors = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(df2)
    errors.append(kmeans.inertia_)

# plot the results for elbow method
plt.figure(figsize=(13,6))
plt.plot(range(1,11), errors)
plt.plot(range(1,11), errors, linewidth=3, color='red', marker='8')
plt.xlabel('No. of clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(1,11,1))
plt.show()
visualization for elbow method to find clusters
  • The most optimal cluster is still 5.



km = KMeans(n_clusters=5)
km.fit(df2)y = km.predict(df2)
df2['Label'] = y
df2.head()
dataset with cluster label
  • Added cluster label for each sample in new data



# 3d scatter plot
fig = plt.figure(figsize=(20,15))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df2['Age'][df2['Label']==0], df2['Annual Income (k$)'][df2['Label']==0], df2['Spending Score (1-100)'][df2['Label']==0], c='red', s=50)

ax.scatter(df2['Age'][df2['Label']==1], df2['Annual Income (k$)'][df2['Label']==1], df2['Spending Score (1-100)'][df2['Label']==1], c='green', s=50)

ax.scatter(df2['Age'][df2['Label']==2], df2['Annual Income (k$)'][df2['Label']==2], df2['Spending Score (1-100)'][df2['Label']==2], c='blue', s=50)

ax.scatter(df2['Age'][df2['Label']==3], df2['Annual Income (k$)'][df2['Label']==3], df2['Spending Score (1-100)'][df2['Label']==3], c='brown', s=50)

ax.scatter(df2['Age'][df2['Label']==4], df2['Annual Income (k$)'][df2['Label']==4], df2['Spending Score (1-100)'][df2['Label']==4], c='orange', s=50)

ax.view_init(30, 190)
ax.set_xlabel('Age')
ax.set_ylabel('Annual Income')
ax.set_zlabel('Spending Score')
plt.show()
3D plot for Annual Income, Spending Score & Age with Cluster Label
3D plot for Annual Income, Spending Score & Age with Cluster Label
  • 3D scatter plot graph of the data

  • ax.scatter() - plots the data points by filtering and specify the color for each cluster

  • You may change the view_init() parameters for a different angle view of the scatterplot

  • You may use different plot method for a different view.



Final Thoughts

  • You can use different hyperparameters to obtain different results.

  • You need to find the best number of clusters based on the data available.


In this article, we have analyzed the Mall Customer Segmentation Analysis as a clustering problem using machine learning. Likewise, we discussed unsupervised learning with clustering using elbow method.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

1,041 views

Comments


bottom of page