Customer analysis for marketing strategies | Python | Tableau

In this project of customer analysis I used a marketing dataset with 2240 observations (customers) with 28 variables. These variables provide insights about:

  • Customer profiles
  • Products purchased
  • Marketing campaign results
  • Sales channels performance

The main objective of this analysis is to understand more about the store customers to improve the marketing results by running more efficient ad campaigns.

This project is divided into the following sections:

  1. Data cleaning and standardization
  2. Visual Exploratory Data Analysis
  3. Data visualization in Tableau
  4. Extra: I tested an ML algorithm just for fun 😁

I really like everything what is related with marketing, so I enjoyed doing this project. Let’s get started!

You can take a peek of the entire process, code and figures on my GitHub profile.

I used Jupyter Notebook on Google Colab to develop this project.

Data cleaning and standardization

These are the libraries I used in this project.

This time I decide to use the KNN algorithm to determinate missing values.

I used Google Colab to run Jupyter notebook, It’s pretty handy.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer

df=pd.read_csv(path+'/marketing_data.csv', sep=',')

Let’s get familiar with the data.

At first glance, we can see that the column ‘Income’ has an extra space in its name. We need to fix it.

In the same column, we can see that we have missing values and the Dtype is an object and not an int as it should be. We have to deal with these issues as well.

Finally the date of customer’s enrollment has the incorrect Dtype.

I renamed the column to erase the extra space it had in the column name.

As you can see, the problem with the Dtype was due to non-numeric characters.

df.rename(columns={' Income ':'Income'}, inplace=True)

I added 4 new columns that would be very useful in further analysis.

For the total years of education, I created a new data frame (df) with the education level and the number of years you should study in my country to get each degree.

Afterworth, I merged this df to the main one.


#Total Amount spended

#Total of children  at home
df['Children'] = df.Kidhome+df.Teenhome

#total years of education 
years_edu={'Education': ['Graduation', 'PhD', '2n Cycle', 'Master', 'Basic'],

Dealing with missing values

There are many ways to deal with missing values: this time I decided to use KNN imputer.

As you can remember, there were 24 missing values in the ‘Income’ column. To run this algorithm, I decided to fit the algorithm on the columns: Age, Income, and Edu_years.

We can infer that the missing customer income could be close to the income of other customers with similar education and age.

If you never tried this algorithm before, you should give it a chance! 👍

imputer = KNNImputer()
imputer = KNNImputer(n_neighbors=5,metric='nan_euclidean')

# fit on the dataset[['Income','Age','Edu_years']])

# transform the dataset
X = imputer.transform(df[['Income','Age','Edu_years']])

Statistical summary

This information is very useful to quickly understand more about the data distribution and the presence of outliers.

To identify these outliers, I decided to use 3 standard deviation above or under the mean as criteria to discard an observation.

We can see this more clear with a box plot of the columns we will use in further analysis.


We have to be very careful to determinate which observation is an outlier and which not.

It depends on the context. We can see in the income variable, that practically all the data is beneath the 200K. The observation that exceeds the 600K is definitely an outlier. We confirm this with the criteria we set before (3 std above the mean).

The same happens with the variable Age. We have 3 observations that exceeds the 120 years old.

The probability of having 3 contestants for the Guinness Record of the oldest person in the world purchasing in your store, is quite low 😂. So we can treat these observations as outliers.

“Spending” have one observation at the limit of the upper fence, so we’ll not remove it. The variables Children and Education have observations which we could interpreted as outliers, but they are not. Some Customer can have 3 children, and a some could have basic education. We’ll keep them all.

#Let's remove outliers from the columns 'Income'

#removing outliers from the column 'Age'

Visual Exploratory Data Analysis

Now that the data is ready for analysis, let’s begin with the EDA. We’ll focus on the next questions:

  1. Which country purchased more?
  2. What is the best-selling product category?
  3. Which educational segment purchased more wine?
  4. What is the marital status of the customers?
  5. People with no children buy less or more than people with kids?
  6. Which sale channel is more effective?
  7. How many customers accepted an offer throughout the 5 campaigns?
  8. There is any correlation between variables?

Which country purchased more?

Here we can see that Spain is the country with more sales by far. The second one is Saudi Arabia, and in the third position is Canada.

We don’t have information about the location of this store. Nevertheless, it is possible that it is located in Spain. We will analyze this a bit later when we go over the sales channels’ performance.

What is the best-selling product category?

Wine is the best-selling product in the store, with sales close to 700K.

Meat products are in the second position with sales around 400K.

It is an interesting fact that the main product category is wine. We have to understand more about our customer to use the insights to get better results in marketing campaigns.

What educational segment purchased more wine?

Now, we know that most of the people that purchase wine hold a bachelor degree.

We could continue doing more question to better understand the customer who love our wine and should be the target in further marketing campaigns, but I’ll let you answer other questions using the dashboard at the end of this post 👇. Have fun!

How are the customer distributed by age?

We can see that most of the customers in this dataset are in the age range of 40–70. This is one of the most important factors for market segmentation.

If you think about it, that age group differ from a younger generation on their purchasing habits and how they respond to advertisements.

A younger segment spends more time in social media, and platforms like Facebook, while an older segment tends to check their email frequently.

We should take this into account to develop marketing strategies.

What is the marital status of the customers?

More than the half of the customers are in a couple (64.5%).

21.4% is single, 10.3% is divorced and 3.8% other.

People with no children buy less or more than people with kids?

This figure shows the avg amount that customers spent in the store depending on the number of children they have.

We can see that customer with no children spend more than the ones with 1 or more children.

This information is very important to for marketing, we’ll check later if there is a correlation between spending and the number of children.

Which sale channel is the most effective?

The store had in total 33255 transactions. 39% were done in the store, 27% through the website, 18% by catalog and 16% offers and discounts.

We don’t know if the transactions with discount were done in the store or online. This fact is relevant to determinate the sale channel in which to focus the marketing resources.

How many customers accepted an offer throughout the 5 campaigns?

In total, 666 customers accepted an offer throughout the 5 campaign.

Campaigns 3, 4 and 5 had a similar performance and represents 73% of the total. The second campaign had the lowest performance.

We can use this information to build an ML model to predict if a customer would accept an offer in the initial campaigns.

This could help us to reduce the customer acquisition cost and improve the sales results with due to a more efficient use of resources.

campaign=df[['AcceptedCmp3',	'AcceptedCmp4',	'AcceptedCmp5',	'AcceptedCmp1',	'AcceptedCmp2']]


value	percent
AcceptedCmp1	144	0.22
AcceptedCmp2	30	0.05
AcceptedCmp3	163	0.24
AcceptedCmp4	167	0.25
AcceptedCmp5	162	0.24

There is any correlation between variables?

To categorize the correlations between variables, we need to consider the follow criteria:

For a natural/social/economics science, a correlation coefficient values below 0.3 are considered to be weak; 0.3-0.7 are moderate; >0.7 are strong.

The most interesting correlations for me are:

  • A moderate negative correlation between the total amount spent and the number of children.
  • A moderate positive correlation between the number of children and the purchases with discounts.
  • A strong correlation between the customers’ income and the purchases by catalog.

For sure there are other correlations that could be considered for a marketing strategy.

Visualization in Tableau

Extra: Logistic regression

I wanted to build a logistic regression model to determinate if a customer is more likely to accept an offer in the first campaign. It is well known that in the first contact (let’s say Facebook ads) the conversion rate is low.

Nevertheless, we saw that the first campaign performed well (just 2% less conversions than 3rd and 4th campaigns).

So, let’s build a model to predict if a client would accept a first campaign offer considering demographic variables as: Age, Education, Income, Number of children and Marital Status.

Doing analysis on the target customers

Let’s take a look at the kind of client that accept the first campaign offer.

We can see that this group is in avg 52 years old, graduated, the majority of them don’t have children, the income is above the global mean of the whole customers (52K).


I transformed the categorical variable “Marital Status” to a numerical one. If the customer is in a couple, it’s 1, and if single, 0.

In_couple={'Marital_Status':['Divorced', 'Single', 'Married', 'Together', 'Widow', 'Alone','Absurd', 'YOLO'],

df=pd.merge(df,In_couple, on='Marital_Status')

The ML model

I defined the demographic variables we’re going to need in this model in the variable X, and in the y variable the target.


Here we fit the model to the data and trainee the model. The model learned the relationship between the demographic information and the acceptance of the first offer.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.3)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(), y_train)

Now that the model is trained, we can do a prediction on the entire test data.

There are several ways of measuring the model performance. I used the accuracy. We can see that the model made 93% correct predictions over the total number of customer that accepted the first campaign offer. Nice!



I’m not an ML expert, I just like testing ML algorithms and learn more about their use in practical and real situation.

I really enjoyed doing this project, I hope you had some fun as well 😊. If you have interest in data analysis, EDA and clustering algorithms, check out this project: E-commerce sales EDA and Clustering for customer segmentation.