E-commerce sales EDA and Clustering for customer segmentation

An e-commerce store can generate thousands of interactions daily. With the help of e-commerce sales EDA, we can find fascinating insights which can be used to better understand the business and its customers.

In sales, cluster analysis can be used for customer segmentation so we could define new strategies for pricing, customer segmentation, special discounts, differentiation in presentation etc. It can be used to make better sales offer and increased sales and hence help a business grow by increasing customer base.

In marketing, cluster analysis is used to analyze customer behavior based on features. These features include customer demographics, purchasing behavior, and consumption.

In the present case, we have the sales data of one e-commerce store in the UK. The company wanted to know the performance of the store during the 2017 period and identify what customer segments exist on its platform because the company wants to reward customers with similar purchasing behavior in volume and profit margin with special discounts.

For the clustering process, I have proposed 3 Machine Learning models (K-Mean, Agglomerative, DBSCAN). The algorithm found 4 clusters or groups of clients.

And for the ecommerce EDA the following questions have been proposed:

  1. What was the total revenue?
  2. Which months had the higher sales?
  3. What products are in the top 5 in sales and revenue?
  4. Which products were returned more frequently?
  5. What are the top 5 countries that purchased the most?
  6. How much is the share of the revenue for each cluster?

You can take a peek of the entire process, code and figures on my Github profile.

I used Jupyter Notebook on Google Colab to develop this project.

Preparing the data for EDA and clustering

First of all, let’s import the libraries and the dataset.

Note: I imported sqldf from the pandassql library to see how SQL runs in python. I ‘ll comment this a bit later.

#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno 
import scipy.cluster.hierarchy as sch

from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

!pip install pandasql
from pandasql import sqldf

#import the dataset

Now we can take a look at the data. The dataset has 541908 entries.

As you can notice: description, Customer ID and Unamed8 columns have empty or missing values. Thus, we have to deal with them.

Also, we need to change the InvoiceDate data type from objetc to datetime

Finally, I calculated an additional column named as ‘Total_Revenue’

#let's see what the data looks like

RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
 8   Unnamed: 8   0 non-null       float64
dtypes: float64(3), int64(1), object(5)
memory usage: 37.2+ MB

#Let's add and additional column with the total revenue for purchase

I created a new dataframe without NaN values.

For EDA, I’ll keep the data with no Description or Customer ID. Thus, I filled the NaN values in this columns with 0.

df_eda=df.drop(columns=['Unnamed: 8'])

Let’s start with the ecommerce sales EDA

To conduct the exploratory data analysis, I focused on charts and tables. If you want to see the script and the step-by-step process, please check the Github repository.

1. What was the total revenue?

The total revenue was $ 9.75M.

Notice the negative values where consider in this total.

print ('Total Revenue:\n','$ ', round(Sum_revenue,2))

Total Revenue:
 $  9747747.93

2. Which months had the higher sales?

We can see that the months with higher sales were sep, oct and nov 2017.

An interesting fact is that dic-2016 does not have a relevant volume of sales as could be expected due to the holidays. The same thing happend with February and March (Valentin’s day and Mother’s day in the UK ).

3. What products are in the top 5 in sales and revenue?

For products analysis, We consider only the products with StockCode. Descriptions like ‘AMAZON FEE’, ‘POSTAGE’, ‘BANK COMMISSION, etc’ were not included in this part.

We can see that the best seller is the ‘World War 2 Gliders Asstd Designs’ with 53847 units sold and revenue of $ 13587.93.

The best seller is not necessarily the top in revenue. The ‘Regency Cakestand 3 Tier’ is the most profitable product, with total revenue of $164762.19 and 13022 units sold.

4. Which products were returned more frequently?

“Paper craft, little birdie” and “Medium ceramic top storage jar” are the products with the highest returns, cancellations, or discards. Both combined represent 57.16% of the total unit’s losses.

5. What are the top 10 countries that purchased the most?

We can see in the figure that the clients from the United Kingdom purchased the most.

UK customer share represents 85% of total ecommerce revenue.

6. How much is the share of the revenue for each cluster?

I will answer to this question a bit later. First, we need to find the clusters.

Last but not least, customer segmentation via cluster analysis.

For clustering, we need to identify customers with similar purchasing behavior. The main objectives are:

  • Identify the number of customer segments
  • Find the customers who are hightly valued

Something that we need to take into account on this part, is that we need the data for those clients with a CustomerID. Also, We have to drop rows with NaN values, as well as those without a StockCode (Some of them have codes like POST, AMAZON FEE, CHECK, ect)

Let’s prepare de df for clustering algorithms

Note: I used some SQL code to see how it works in python. The result is good, it works well, but the problem is that it takes time to complete the task. It took about 13 seconds to run the code.

sqldf could be an interesting and useful tool in some specific cases. Nevertheless, you can do the same thing with python and you won’t have to mix two languages in the script.

#We take only the Products

#let's group the purchases by customer
df_sales=sqldf("SELECT CustomerID, COUNTRY, COUNT(InvoiceNo) AS Total_purchases, SUM(Total_Revenue) AS Total_Revenue, SUM(Quantity) AS Quantity FROM df_products GROUP BY CustomerID ORDER BY Quantity desc;")

#droped the NaN values

#To better indentify the cluster centers, we need to set limits to the top values of Total_revenue and Quantity to discard outliers. 

df_clustering=sqldf("SELECT CustomerID,	Total_purchases,	Total_Revenue,	Quantity FROM df_sales WHERE Total_Revenue <150000 and Quantity <70000")

First glance to the distribution

Every dot is a client, and the figure shows how many items they purchased during a year and what was the revenue we got for that purchases.

We can se that the majority of them bought less than 10000 items.

We want to know how many segments of clients we have with similar purchases behavior. This could help us to understand a little bit more about our customers and identify which of them have a higher value in terms of volume and revenue.

Scaling using minmax scaler

In order to get confident results, We have to scale the data we are going to use. I used the minmax scaler, but there are other methods to scale data. I add the scaled ‘Quantity’ and ‘Total_Revenue’ columns as new columns in the dataframe ‘df_clustering’.




Elbow method to find the number of clusters

To determinate the optimal number of clusters (k) for our K-Mean algorithm, We need to implement the elbow method.

Based on the figure, It seems that the optimal number of clusters is 4.

Let’s trye Kmean method

As you can see in the code below, I run the algorithm with 4 cluster.

In the figure we can see how the 4 segments of customers are distributed. Now we can answer to the last ecommerce EDA 6th question.


6. How much is the share of the revenue for each cluster?

49.75% of the customers belong to cluster 2. This Group of 4007 clients contributed $3.6M. It means that every customer in this segment purchased $ 907 of products on average.

Cluster 0 represents the 32%. This segment purchased $2.4M in products during the year.

In cluste 1 represents the 12.81%. 25 customers belog to this group. On average they purchased $ 37K of products.

Finally, we have the Cluster 3, which represents the 5.22% of the total revenue and items sold during a year.

Only 5 clients belog to this segment. Each of them purchased in average $ 76K.

Other algorithms for clustering

Agglomerative algorithm

With the agglomerative algorithm I got almost the same result as with the K-Mean algorithm. Nevertheless, is interesting to compare both methods.

agg = AgglomerativeClustering(n_clusters=4, affinity = 'euclidean', linkage = 'ward')

y_agg = agg.fit_predict(points)

DBSCAN algorithm

This algorithm works well when you have clusters with similar density.

In this case, the DBSCAN found only 3 clusters and determined some outliers that don’t meet the criteria to belong to one of the clusters.

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.08,min_samples=4)

The e-commerce sales EDA showed as some interesting facts about the sales and the products. And the clustering helped us to identify the segmnets of customers we have in our ecommerce.

Are you interested Exploratory Data Analysis and Tableau? Check out this project: EDA of Oil Pipelines accidents from 2010 to 2016