Goodness Project
Goodness Project
Goodness Project
ALGORITHM
BY
U2017/5570024
FEBRUARY, 2023
CERTIFICATION
This is to certify that this project was carried out by me, ARONIMO GOODNESS BOSEDE
(U2017/5570024) under the supervision of Dr. Marcus Chigoziri.
Signature Date
ii
DECLARATION
I declare that this project is my original work and has not been submitted for examination to
any other university.
U2017/5570024
iii
DEDICATION
First and foremost, this work is dedicated to the almighty God for his grace and sustenance. To
my parents, who have always encouraged me to pursue my dreams, and instilled in me a love
for learning. Your unwavering support and belief in me has been the driving force behind my
success. This project is dedicated to you, with gratitude and love.
iv
ACKNOWLEDGMENT
I offer my sincere thanks and deepest gratitude to God Almighty for his guidance in writing
this project. The Lord’s wisdom, knowledge and understanding have been my constant source
of strength, enabling me to finish this project.
Thanks to the Dean of Sciences (Prof. Mrs. Eka Essien), Dean of Computing (Prof. Laeticia
Onyejegbu), the Computer Science Head of Department (Dr. Friday Onuodu), and the entire
faculty/department for the good work you have been doing.
I would like to express my sincerest gratitude to Dr. Marcus Chigoziri for your guidance and
support throughout my project work. Your expertise, encouragement, and unwavering belief
in me have been instrumental in helping me to complete this project successfully.
I want to thank my colleagues who have in one way or another other contributed to the success
of this work, and the entire U2017 set of this great department. To the University of Port
Harcourt for giving me this wonderful opportunity to embark on this journey, and to see the
need for research and practicality beyond classwork, I say a big thank you.
I would like to extend my heartfelt gratitude to my course mate, Adebayo Samuel who
supported and encouraged me throughout my project. Your unwavering faith in my abilities
and constant motivation were instrumental in my success. Thank you for always being there
for me and for playing a significant role in bringing my project to fruition.
v
TABLE OF CONTENTS
CERTIFICATION ii
DECLARATION iii
DEDICATION iv
ACKNOWLEDGMENT v
TABLE OF CONTENTS vi
LIST OF TABLES ix
LIST OF FIGURES x
ABSTRACT xi
CHAPTER ONE 1
INTRODUCTION 1
1.0 INTRODUCTION 1
CHAPTER 2 6
LITERATURE REVIEW 6
2.0 INTRODUCTION 6
vi
2.5 ELBOW METHOD 9
2.12.1 APPROACH 19
CHAPTER THREE 20
METHODOLOGY 20
3.0 INTRODUCTION 20
vii
CHAPTER FOUR 35
4.0 INTRODUCTION 35
4.3 TESTING 38
CHAPTER FIVE 46
5.1 Conclusion 46
5. 2 Recommendation 46
5.3 Implication 46
REFERENCES 48
APPENDICES 49
Dataset 49
Main.py 56
viii
LIST OF TABLES
Table 2. 1 Summary of related works 18
Table 2. 2 Summary of related works 19
ix
LIST OF FIGURES
Figure 2. 1 Screenshot of jupyter notebook Integrated Development Environment (IDE) 10
x
ABSTRACT
This study investigated the use of the K-means algorithm for customer segmentation. Data was
collected and preprocessed to remove missing and irrelevant information. The K-means
algorithm was then applied to the preprocessed data to create clusters of customers based on
their demographic and spending behaviors. The optimal number of clusters was determined
using the elbow method. The results of the study showed that the K-means algorithm
effectively segmented the customers into meaningful groups. Additionally, a comprehensive
analysis of existing related works was conducted, and it was discovered that they were all
limited to a single company and lacked robustness and ease of use for accommodating and
working with datasets from other companies. To address this limitation, the study aimed to
address the problem of customer segmentation by developing a user-friendly web application
that allowed organizations to generate reasonable clusters from their datasets with ease. The
goal of this work was to provide a solution that was not only effective in addressing the issue
of customer segmentation but also provided a user-friendly experience for organizations to
analyze their datasets.
xi
CHAPTER ONE
INTRODUCTION
1.0 INTRODUCTION
Customer segmentation is the process of grouping customers into specific categories based on
their characteristics, behaviours, and needs. This allows businesses to target their marketing
efforts and tailor their products and services to specific groups of customers, resulting in more
effective and efficient use of resources. There are various methods of segmentation, including
demographic, psychographic, geographic, and behavioural segmentation. By understanding the
different segments within their customer base, a business can create targeted and personalized
strategies to increase customer loyalty and drive sales. Once a company has segmented its
customer base, it can use information provided to develop targeted marketing campaigns and
strategies for each segment. This can lead to increased sales, as well as better understanding of
the company's customer base.
The k-means algorithm is a popular method for performing customer segmentation because it
is simple to implement and interpret, and it can be applied to large datasets with many variables.
In this chapter, we will discuss the theory behind the k-means algorithm, as well as its
implementation and application in the context of customer segmentation.
1
techniques, and the impact of customer segmentation on business performance. The concept of
customer segmentation can be traced back to the early 1900s, when retailers began dividing
their customer base into groups based on demographics such as age and income.
In the 1950s and 1960s, market researchers began using psychographic and lifestyle
segmentation, which divided customers into groups based on their values, personality, and
interests.
In the 1970s and 1980s, behavioural segmentation became popular, which divides customers
into groups based on their purchase history, brand loyalty, and usage rate.
In the 1990s and 2000s, companies started using sophisticated data analysis techniques, such
as cluster analysis and decision trees, to segment their customer base. This led to the
development of more accurate and detailed segments, which could be targeted with precision
marketing campaigns.
In recent years, with the rise of big data and advanced analytics, customer segmentation has
become even more sophisticated. Companies are now able to collect and analyze vast amounts
of data on their customers, including online behaviour, social media activity, and purchase
history. This has led to the development of even more accurate and detailed segments, which
can be targeted with precision marketing campaigns.
Overall, customer segmentation is a marketing strategy that has evolved over time, with
companies continuously developing new and more advanced methods to segment their
customer base. It is considered a key aspect of modern marketing, helping companies to
increase sales and customer loyalty while providing a deeper understanding of their customer
base.
Another current challenge is the need for real-time segmentation. With the fast-paced digital
transformation and the vast amount of data generated by customers, make it difficult for
2
companies to keep up with the ever-changing market trends and customers' preferences.
Therefore, there is a need for modern and efficient methods of customer segmentation that can
quickly and accurately identify customer segments and help companies to stay competitive and
adapt to the changes in the market.
One current challenge in customer segmentation is dealing with big data and high-dimensional
data. With the increasing amount of data generated by customers, businesses are faced with the
challenge of processing and analyzing large and complex datasets in order to identify customer
segments. This can be a time-consuming and resource-intensive task, and traditional
segmentation methods may not be able to handle the volume and complexity of the data.
Additionally, with the rise of social media and the internet, customers are becoming more
informed and empowered. This has led to an increase in customer expectation and businesses
are facing the challenge of providing personalized and high-quality services to each segment.
Overall, current challenges in customer segmentation include the need for efficient and
accurate methods for dealing with big and high-dimensional data, real-time segmentation, and
providing personalized and high-quality services to each segment.
3
5. To conduct a comparative analysis of the existing and proposed customer segmentation
systems. The comparison will focus on key performance indicators such as accuracy,
efficiency, and ease of use. The goal of the comparative analysis is to demonstrate the
superiority of the proposed system over the existing techniques.
4
overall purchase decision. The core traits and segments that can be used with the geographic
segmentation include region, continent, country, city, and district.
Behavioral segmentation: is the process of grouping customers according to their behavior
when making purchasing decisions. The different trait within this segmentation include
knowledge they have about the product, level of loyalty, interactions with your brand or
product usage experience, etc.
Demographic segmentation: refers to the categorization of consumers into segments based
on their demographic characteristics. This includes variables such as age, gender, income,
education, religion, nationality etc. Demographic segmentation gives you an understanding of
which customers are most likely to make purchases.
K-Means Clustering: is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
5
CHAPTER 2
LITERATURE REVIEW
2.0 INTRODUCTION
Customer segmentation has become a fundamental task for businesses. Not only is it crucial
for understanding customers in depth, but it is also essential for identifying our market
segments, optimising the customer experience and creating personalized content. Customer
segmentation has become one of the most important tasks in marketing, commercial and sales.
Nowadays, it is difficult to find a company that is not applying customer segmentation, as it is
one of the bases of the customer-centric perspective and digital marketing oriented towards the
personalisation of content, campaigns and customer experiences. customer segmentation is at
the heart of understanding our consumers. Increasingly, companies are striving to know their
customers better to offer them exactly what they want and need, both in the supply of products
and services, as well as in any other of the spheres that relate customers with a company. There
are reasons why it is important to segment customers, some of such are:
Improved targeting: By segmenting customers, businesses can identify which groups are
most likely to be interested in their products or services, and focus their marketing efforts on
those groups. This can help to improve the effectiveness of marketing campaigns and lead to
higher conversion rates.
Increased customer satisfaction: By understanding the specific needs and preferences of each
customer segment, businesses can offer customized products, services, and experiences that
better meet the needs of their customers. This can lead to increased customer satisfaction,
which can drive customer loyalty and retention.
6
2.2 K-MEANS ALGORITHM
The K-means algorithm is a form of unsupervised learning that is used for clustering, which is
the task of dividing a dataset into groups (or clusters) of similar data points. The algorithm is
based on the idea of partitioning a dataset into k clusters, where k is a user-specified parameter.
In K-means, each cluster is represented by its center (called a “centroid”), which corresponds
to the arithmetic mean of data points assigned to the cluster. A centroid is a data point that
represents the center of the cluster (the mean), and it might not necessarily be a member of the
dataset. This way, the algorithm works through an iterative process until each data point is
closer to its own cluster’s centroid than to other clusters’ centroids, minimizing intra-cluster
distance at each step.
K-means searches for a predetermined number of clusters within an unlabelled dataset by using
an iterative method to produce a final clustering based on the number of clusters defined by the
user (represented by the variable K). For example, by setting “k” equal to 2, your dataset will
be grouped in 2 clusters, while if you set “k” equal to 4 you will group the data in 4 clusters.
K-means triggers its process with arbitrarily chosen data points as proposed centroids of the
groups and iteratively recalculates new centroids in order to converge to a final clustering of
the data points. Specifically, the process works as follows:
1. The algorithm randomly chooses a centroid for each cluster. For example, if we choose
a “k” of 3, the algorithm randomly picks 3 centroids.
2. K-means assigns every data point in the dataset to the nearest centroid, meaning that a
data point is considered to be in a particular cluster if it is closer to that cluster’s centroid
than any other centroid.
3. For every cluster, the algorithm recomputes the centroid by taking the average of all
points in the cluster, reducing the total intra-cluster variance in relation to the previous
step. Since the centroids change, the algorithm re-assigns the points to the closest
centroid.
4. The algorithm repeats the calculation of centroids and assignment of points until the
sum of distances between the data points and their corresponding centroid is minimized,
a maximum number of iterations is reached, or no changes in centroids value are
produced.
7
2.3 KEY FEATURES AND CHARACTERISTICS OF K MEANS
ALGORITHM
The K-means algorithm has several key features and characteristics that make it a popular
choice for clustering:
Advantages:
8
2. Simple to understand and implement: K-means is a relatively simple algorithm that is
easy to understand and implement, making it a popular choice for clustering tasks.
3. Can be used for high dimensional data: K-means can handle high dimensional data and
can work with any number of features.
4. Able to find natural patterns in data: K-means can find natural patterns in data by
grouping similar data points together, which can be useful for customer segmentation.
Limitations:
1. Sensitive to initial centroid: The final cluster assignments are sensitive to the initial
centroid selections, so it can be beneficial to run the algorithm multiple times with
different initial centroids.
2. Requires number of clusters to be specified: One of the main drawbacks of K-means
algorithm is that it requires the user to specify the number of clusters in advance. This
can be a limitation if the true number of clusters is unknown.
3. Assumes clusters are spherical: The algorithm assumes that clusters are spherical and
that the variance of the data points within each cluster is isotropic, which means that
the data points in a cluster have similar variances in all directions.
4. it doesn't perform well when dealing with clusters of different shapes and sizes.
The elbow method works by fitting the K-means algorithm for a range of values of k (e.g.,
from 1 to 10) and for each value of k, calculating the within-cluster sum of squares (WCSS).
The WCSS is a measure of the variance within each cluster and is used to quantify the similarity
of the data points within a cluster. The WCSS decreases as the number of clusters increases,
but the decrease will slow down as the number of clusters increases.
A plot of the WCSS against the number of clusters is then created, and the "elbow" point is
identified as the point on the plot where the WCSS begins to decrease at a slower rate. The
number of clusters at the elbow point is considered to be the optimal number of clusters for the
9
dataset. The point where WCSS exhibits a sharp bend, like the elbow of an arm, is considered
the ideal number of clusters.
NumPy: NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays. It is one of the most useful and popular libraries for
scientific computing and data analysis in Python. Some of the features of NumPy include
support for large data sets, powerful mathematical operations, and tools for integration with
other libraries, such as SciPy and Matplotlib.
10
Pandas: Pandas is a python library for data manipulation and analysis. It provides data
structures such as Series (1-dimensional) and DataFrame (2-dimensional) that allow for easy
handling and manipulation of large datasets. These structures are similar to those found in R
and can be used for tasks such as data filtering, aggregation, and cleaning. Pandas also includes
functions for reading and writing data to different file formats, including CSV, Excel, and SQL.
It is a powerful tool for data analysis and is widely used in the field of data science.
Sci-kit Learn: Scikit-learn is a machine learning library for Python that provides simple and
efficient tools for data mining and data analysis. It is built on NumPy and SciPy and integrates
well with the rest of the scientific Python ecosystem (such as matplotlib for visualization). It
includes a range of supervised and unsupervised learning algorithms in Python, including linear
regression, support vector machines, decision trees, and k-means clustering. It also includes
tools for model selection, pre-processing, and model evaluation. Scikit-learn is widely used in
industry and academia and is a popular choice for machine learning in Python due to its ease
of use and flexibility.
11
4. Healthcare: Hospitals and healthcare providers can use K-means to segment their
patients based on their health needs, demographics, or other relevant data. This can help
them to improve their services, reduce costs, and increase patient satisfaction.
5. Banking and Finance: Banks and financial institutions can use K-means to segment
their customers based on their financial needs, demographics, or other relevant data.
This can help them to create personalized financial products and services, improve
customer retention, and increase revenue.
6. Public sector: Government agencies can use K-means to segment their citizens based
on their needs, demographics, or other relevant data. This can help them to create
targeted public services and policies, improve citizen engagement, and increase the
effectiveness of their programs.
Overall, K-means algorithm is a powerful tool for customer segmentation and can be used
in various industries to improve customer engagement, increase revenue and improve
customer satisfaction. It is important to note that the K-means algorithm works best with
numerical data, and it might not be suitable for datasets with categorical variables or
missing data.
12
It is useful for identifying clusters of arbitrary shape and it can handle data with a non-
trivial amount of noise.
4. Expectation-Maximization (EM): Expectation-Maximization (EM) is a maximum
likelihood algorithm that uses an iterative approach to estimate the parameters of a
mixture of Gaussian distributions. It can handle missing data, outliers and data with
non-trivial amount of noise.
Each of these algorithms has its own strengths and weaknesses, and the best choice will depend
on the specific characteristics of the data and the goals of the customer segmentation. K-means
is a simple, easy-to-implement and fast algorithm, but it assumes spherical clusters and it can
be sensitive to initial conditions. Hierarchical clustering, DBSCAN and GMM are more robust
to the shape of clusters, but they are more computationally intensive. EM algorithm is more
robust to missing data and outliers, but it is more computationally intensive.
1. Data cleaning: Data cleaning is the process of identifying and removing data that is
missing, incorrect, or irrelevant. This step is important to ensure that the final segments
are not affected by outliers or data errors. It is clean to detect and eliminate causes of
data exceptions.
2. Data scaling: Data scaling is the process of normalizing the data to a common scale.
This is important because the K-means algorithm is sensitive to the scale of the data,
and variables with larger scales can dominate the clustering results.
3. Data transformation: Data transformation is the process of applying mathematical
functions to the data to improve its properties for clustering. This may include
logarithmic, square root, or reciprocal transformations.
4. Data reduction: Data reduction is the process of reducing the number of variables in the
data. This can be done using techniques such as principal component analysis (PCA)
or linear discriminant analysis (LDA) to identify the most important variables for
clustering.
13
5. Data Imputation: Data imputation is the process of replacing missing values in the data.
This step is important when dealing with datasets with missing data as K-means
algorithm assumes that all data are complete.
6. Data encoding: Data encoding is the process of converting categorical variables into
numerical variables. This is important as K-means algorithm works only with numerical
data.
It is important to note that the appropriate data preprocessing techniques will depend on the
specific characteristics of the data and the goals of the customer segmentation. It is crucial to
evaluate the results of the clustering using multiple methods and metrics before making a final
decision.
1. Correlation-based feature selection: This technique selects features that have a high
correlation with the target variable. It is useful for identifying the most important
variables for clustering.
2. Wrapper methods: Wrapper methods use the clustering algorithm as a “black box” and
repeatedly evaluate feature subsets by training the clustering algorithm and evaluating
its performance.
3. Filter methods: Filter methods use statistical measures to evaluate the relevance of each
feature to the clustering task. The features that score highest on the statistical measure
are selected.
4. Embedded methods: Embedded methods use an iterative process in which features are
added or removed from the model based on their contribution to the performance of the
clustering algorithm.
5. LASSO (Least Absolute Shrinkage and Selection Operator): LASSO is a feature
selection method that uses a linear model and regularization to select the most important
features.
It's important to note that the appropriate feature selection technique will depend on the specific
characteristics of the data and the goals of the customer segmentation. It's crucial to evaluate
14
the results of the clustering using multiple methods and metrics before making a final decision.
Additionally, it's worth to keep in mind that there are trade-offs between feature selection and
dimensionality reduction, and it's important to find the right balance between these two aspects
to achieve the best results.
1. Privacy: Customer segmentation relies on the collection and analysis of personal data.
It's important to ensure that this data is collected, stored, and used in a way that respects
the privacy rights of individuals. It's also important to ensure that data is collected,
stored and used in accordance with the relevant laws and regulations on data protection.
2. Fairness: Customer segmentation should not be used to discriminate against certain
groups of customers based on factors such as race, gender, age, or socioeconomic status.
It's important to ensure that the clustering process is fair and unbiased, and that the
segments created are not used to discriminate against certain groups of customers.
3. Transparency: Companies should be transparent about how they are using customer
segmentation, what data they are collecting, and how they are using it. Customers
should be informed about how their data is being used and should have the right to opt
out of the data collection process.
4. Trust: Customer segmentation relies on the trust that customers have in the companies
that are using their data. Companies should take steps to ensure that customers' data is
being used responsibly and that their privacy rights are being respected.
5. Responsibility: Companies should be aware of the potential consequences of their
segmentation and take steps to ensure that it is used for beneficial and legitimate
purposes. They should be responsible for the use of the data and the impact of their
segmentation on the customers and society.
It's important to keep in mind that these ethical considerations are not exhaustive, and there
might be other considerations that are specific to a certain industry or context. It's crucial to
have a robust ethical framework in place to ensure that customer segmentation is conducted in
15
an ethical and responsible manner, and to ensure that the benefits of customer segmentation
outweigh any potential negative impacts.
16
making it accessible for practitioners who are new to the method. The study does not compare
the performance of K-Means with other clustering algorithms, potentially missing important
insights from other methods.
Customer Segmentation using K-Means Clustering Algorithm and Decision Trees by J. Kim
and D. Kim (2020). The study combines K-Means clustering algorithm with Decision Trees to
perform customer segmentation, demonstrating the benefit of integrating multiple methods.
17
The authors also provide a thorough evaluation of the results, showing the effectiveness and
interpretability of the combined method. The study only focuses on K-Means and Decision
Trees, potentially missing important insights from other clustering algorithms and machine
learning techniques.
Customer Segmentation using K-Means Clustering Algorithm and Random Forest by Z. Zhang
and X. Chen (2022). The study combines K-Means clustering algorithm with Random Forest
to perform customer segmentation, demonstrating the benefit of integrating multiple methods.
The authors also provide a comprehensive evaluation of the results, showing the robustness
and accuracy of the combined method. The study only focuses on K-Means and Random
Forest, potentially missing important insights from other clustering algorithms and machine
learning techniques.
18
Table 2. 2 Summary of related works
2.12.1 APPROACH
After a thorough analysis of related works, it was realized that they all are limited to a single
company. The works are not robust and friendly enough to accommodate datasets from other
companies, and work with them easily without visiting the codes. To address this issue, this
work will not only address the problem of customer segmentation, but will also develop a user-
friendly web application where organizations can easily generate reasonable clusters from their
datasets.
19
CHAPTER THREE
METHODOLOGY
3.0 INTRODUCTION
The data set used in this case is a collection of information about 200 customers of a shopping
mall. The data set contains 5 attributes for each customer, including their customer ID, gender,
age, annual income in thousands of dollars (k$), and a "spending score" on a scale of 1 to 100.
This spending score is likely a measure of how much a customer tends to spend at the store,
with a higher score indicating a higher level of spending. The data set will be used to implement
a clustering and K-means algorithm, which can be used to group similar customers together
and analyze patterns in the data.
Firstly, exploratory data analysis would be used to understand the distribution and patterns in
the data. This may include visualizing the data using plots and charts, such as histograms and
scatterplots, to identify any outliers or interesting trends. Additionally, descriptive statistics
such as mean, median, and standard deviation could be calculated to summarize the data.
Next, clustering analysis would be used to group similar customers together. The K-means
algorithm would be used to create clusters based on the attributes of the customers in the data
set (customer ID, gender, age, annual income and spending score).
The number of clusters would be determined by using an appropriate method for determining
the optimal number of clusters, such as the elbow method or silhouette method.
Once the clusters are formed, the characteristics of each cluster would be analyzed and
compared to identify any patterns or trends in customer behavior. This allows researchers gain
a deeper understanding of the customer base and make recommendations for future marketing
or sales strategies.
The methodology used for this work is the waterfall methodology. This is a linear, sequential
methodology where the project is divided into distinct phases, and each phase is completed
20
before moving onto the next one. This methodology is well-suited for projects with clearly
defined objectives and requirements, where changes to the requirements are unlikely.
Figure 3.1 shows the steps the research design will follow.
Secondly, the existing works are not user-friendly, which makes it difficult for organizations
to generate reasonable clusters from their datasets without visiting the codes. This can lead to
a time-consuming and complex process for organizations trying to segment their customers.
Given these limitations, the goal of this work is to develop a user-friendly web application for
customer segmentation using the k-means algorithm. The aim is to provide a solution that can
be used by organizations of all sizes to easily generate reasonable clusters from their datasets.
By doing so, this work will address the limitations of the existing system and provide
organizations with a robust and user-friendly tool for customer segmentation. Figure 3.2 shows
how another company can use the model for the existing system. From the figure, they will
have to edit the code which might be difficult if they do not have the required knowledge.
21
Figure 3. 2 New company using model in existing system
The system is designed to be robust and flexible, making it accessible to a wide range of
organizations. The user-friendly interface of the web application will allow organizations to
upload their customer data and perform customer segmentation in a simple and efficient
manner. The k-means algorithm will then analyze the data and identify distinct groups of
customers based on the selected characteristics.
The proposed system has the potential to improve the effectiveness of customer segmentation
for organizations by providing a flexible and user-friendly solution. By allowing organizations
22
to easily segment their customers, they can gain valuable insights into their customer base,
which can inform their marketing and sales strategies. This can help organizations to more
effectively target their customers, improve customer satisfaction, and ultimately increase their
sales and profits.
The proposed system is a valuable addition to the field of customer segmentation, providing a
flexible and user-friendly solution for organizations looking to improve their customer
targeting and marketing strategies.
Figure 3.3 shows how another company can easily use the model of the proposed system.
Using a Kaggle dataset for customer segmentation can save time and resources, as the data has
already been collected and pre-processed by others. However, it's important to make sure that
23
the dataset is relevant and appropriate for the company's specific needs and that it has been
collected in an ethical and legal way.
The Kaggle dataset for customer segmentation that is used in this work is the ‘Mall Customer
Segmentation Data’ dataset, which contains basic data about 200 customers like Customer ID,
age, gender, annual income and spending score.
The dataset has 200 rows and 5 columns. Table 3.1 shows the first and last 5 rows of the Mall
Customer Dataset.
Further exploring the dataset, Figure 3.4 shows the proper columns description and data types.
Also, the collected dataset is described using basic statistics so as to know the mean, standard
deviation, count, and other parameters for the features. Table 3.2 show this information.
24
Table 3. 2 Statistical description of dataset
The dataset is checked for missing data, and the corresponding rows are normalized when
detected. Table 3.3 shows the result of the check for missing data in our dataset.
25
Looking at the table alone, it cannot be determined if data is actually missing somewhere
because the rows are incomplete. Figure 3.5 gives a clearer view of missing data check.
From the result, it is observed that there are no rows with missing data so it is considered clean.
Annual Income (k$) and Spending Score (1-100) can be renamed to Annual_Income and
Spending_Score respectively so as to eliminate the spaces between them. Table 3.5 shows the
result of this operation.
26
Table 3. 5 Dataset with column spaces removed
From Table 3.5, it is observed that the spaces within the values of the dataset are not too far
apart, so the dataset does not need to be scaled.
A total of 4 variables were collected and stored in a dataframe with 200 rows and 4 columns.
The variables included both numerical (e.g. age, annual income) and categorical (e.g. gender)
data types.
Summary statistics were calculated for all variables to get a general understanding of the data.
For example, the average age of the customers was 39 years old with a standard deviation of
14 years. Table 3.6 shows the summary statistics for all features.
27
Table 3. 6 Summary statistic of features
To better understand the distribution and relationship between variables, several visualizations
were created.
Figure 3.6 shows a count plot of the number of male and females in the dataset.
From figure 3.4, it is evident that the dataset contains information of more females than male.
28
Figure 3. 7 Gender distribution
Figure 3.8 below shows the annual income per age based on a scatterplot.
29
Figure 3.9 shows a proper visual of the relationship between all features of the dataset.
Table 3.7 shows the correlation existing between all features of the dataset.
30
The correlation table shows the relationship between the three variables: Age, Annual Income,
and Spending Score. The values in the table indicate the degree of correlation between each
pair of variables, with 1.0 meaning a perfect positive correlation, -1.0 meaning a perfect
negative correlation, and 0 meaning no correlation.
Age has a negative correlation (-0.327227) with Spending Score, meaning that as Age
increases, Spending Score decreases.
Annual Income has a very weak positive correlation (0.009903) with Spending Score,
meaning that there is a slight increase in Spending Score as Annual Income increases.
There is a very weak negative correlation (-0.012398) between Age and Annual
Income, meaning that as Age increases, Annual Income decreases slightly.
31
partition the customers into homogeneous groups, known as clusters, such that the similarity
within each cluster is maximized and the similarity between clusters is minimized.
The optimal number of clusters was determined using the silhouette score and the within-
cluster sum of squared distances. The silhouette score measures the similarity of a customer to
its own cluster compared to other clusters and ranges from -1 to 1, with higher values indicating
a better clustering solution. The within-cluster sum of squared distances measures the total
distance of all customers to their respective centroids and a lower value indicates a better
clustering solution.
In the elbow method, the number of clusters is chosen by plotting the WCSS against the number
of clusters and selecting the "elbow point" where the WCSS decreases at a slower rate. This
point is typically considered the optimal number of clusters because it represents the trade-off
between the simplicity of having fewer clusters and the ability to capture more complex
relationships in the data with more clusters.
To perform the elbow method, the WCSS is first calculated for a range of values of k, the
number of clusters. These values are then plotted on a graph, with the number of clusters on
the x-axis and the WCSS on the y-axis. The optimal number of clusters is then selected as the
"elbow point" on the graph, where the WCSS decreases at a slower rate.
32
The elbow method is an effective way to determine the optimal number of clusters for the k-
means algorithm and is widely used in customer segmentation and other areas of data analysis.
33
3.7.3 CHARACTERIZING CLUSTERS
From figure 3.12, five customer groups have been identified.
High Rollers: Customers with high spending scores and high annual incomes.
Comfortable Spenders: Customers with moderate spending scores and moderate to
high annual incomes.
Bargain Hunters: Customers with low spending scores but moderate to high annual
incomes.
Thrifty Savers: Customers with low spending scores and low annual incomes.
Luxury Seekers: Customers with high spending scores and high annual incomes who
are willing to spend more on premium products and services.
34
CHAPTER FOUR
The web application was designed to be user-friendly, with an intuitive interface for uploading
customer data and specifying the number of desired clusters. The k-means algorithm was then
applied to the data and the resulting clusters were visualized using plots and tables. The
application also displayed the results of the elbow method, which was used to determine the
optimal number of clusters for the dataset.
In the presentation of the results, the application displayed the cluster label for each customer
and the characteristics of each cluster, such as the mean spending score and annual income.
This information can be used by organizations to better understand their customers and make
informed marketing decisions.
For example, if the cluster visualization showed that one cluster consisted of customers with
high spending scores and high annual incomes, the organization could target this group with
high-end products and luxury marketing campaigns. On the other hand, if another cluster
consisted of customers with low spending scores and low annual incomes, the organization
could target this group with more budget-friendly products and cost-saving promotions.
High Rollers: Customers with high spending scores and high annual incomes are likely to be
highly receptive to premium products and services, so marketing efforts should focus on
highlighting the value and exclusivity of these offerings. Offers such as VIP experiences,
personalized service, and early access to new products can be highly appealing to this segment.
Direct mail, email marketing, and targeted digital advertising can be effective ways to reach
this group.
35
Comfortable Spenders: Customers with moderate spending scores and moderate to high
annual incomes are likely to be interested in good value and quality, so marketing efforts should
focus on highlighting the value and quality of your offerings. Offers such as discounts, special
promotions, and loyalty programs can be effective ways to reach this segment. Social media,
email marketing, and targeted digital advertising can be effective ways to reach this group.
Bargain Hunters: Customers with low spending scores but moderate to high annual incomes
are likely to be focused on getting the best deal, so marketing efforts should focus on
highlighting the savings and value of your offerings. Offers such as sales, discounts, and special
promotions can be effective ways to reach this segment. Social media, email marketing, and
targeted digital advertising can be effective ways to reach this group.
Thrifty Savers: Customers with low spending scores and low annual incomes are likely to be
highly focused on cost savings, so marketing efforts should focus on highlighting the
affordability and value of your offerings. Offers such as clearance sales, budget-friendly
products, and free shipping can be effective ways to reach this segment. Social media, email
marketing, and targeted digital advertising can be effective ways to reach this group.
Luxury Seekers: Customers with high spending scores and high annual incomes who are
willing to spend more on premium products and services are likely to be interested in high-end,
luxury products and services, so marketing efforts should focus on highlighting the quality,
exclusivity, and luxury of your offerings. Offers such as VIP experiences, personalized service,
and exclusive access to new products can be highly appealing to this segment. Direct mail,
email marketing, and targeted digital advertising can be effective ways to reach this group.
Customer Effective
S/N Characteristics Marketing Strategy
Segment Channels
1 High Rollers High spending Highlight value and Direct mail,
score, high exclusivity of premium email
annual income products and services. Offer marketing,
VIP experiences, personalized targeted
service, and early access to digital
new products advertising
36
Table 4. 2 Summary of advices to company
Customer Effective
S/N Characteristics Marketing Strategy
Segment Channels
2 Comfortable Moderate Highlight value and quality of Social media,
Spenders spending score, offerings. Offer discounts, email
moderate to special promotions, and marketing,
high annual loyalty programs targeted
income digital
advertising
3 Bargain Hunters Low spending Highlight savings and value of Social media,
score, moderate offerings. Offer sales, email
to high annual discounts, and special marketing,
income promotions targeted
digital
advertising
4 Thrifty Savers Low spending Highlight affordability and Social media,
score, low value of offerings. Offer email
annual income clearance sales, budget- marketing,
friendly products, and free targeted
shipping digital
advertising
5 Luxury Seekers High spending Highlight quality, exclusivity, Direct mail,
score, high and luxury of offerings. Offer email
annual income, VIP experiences, personalized marketing,
willing to spend service, and exclusive access targeted
more on to new products digital
premium advertising
products and
services
37
4.2 WEB APPLICATION DESIGN AND DEPLOYMENT
4.2.1 USER INTERFACE DESIGN
The user interface was designed using the streamlit library in python to provide an interactive
and intuitive experience for the users. The interface consists of a simple and straightforward
layout, allowing users to upload their datasets with ease and generate reasonable clusters using
the k-means algorithm. The interface also provides interactive visualizations of the results,
making it easier for users to understand the outcomes of the customer segmentation analysis.
4.3 TESTING
The web application is tested, and screenshots are attached below. The application is tested to
determine if it can really cluster input dataset depending on a user-specified k value.
38
Figure 4. 1 Clusters formed when k value is 1
39
Figure 4.3 shows the clusters generated when k value is 3.
40
Figure 4.5 shows the clusters generated when k value is 5.
41
4.4 COMPARATIVE ANALYSIS OF EXISTING AND PROPOSED
SYSTEM
The existing customer segmentation systems have several limitations, including their lack of
robustness and user-friendliness. They are typically limited to a single company and cannot be
easily adapted to work with datasets from other organizations. This means that organizations
are required to have a good understanding of the underlying code and make manual changes to
adapt the system to their needs.
In contrast, the proposed customer segmentation system using the k-means algorithm
overcomes these limitations by developing a user-friendly web application. This web
application is designed to be flexible and easily adaptable to different datasets from various
organizations. The application allows organizations to upload their customer data and specify
the number of desired clusters, making the process of customer segmentation much more
accessible and intuitive.
Another advantage of the proposed system is the use of the k-means algorithm, which is widely
used in the field of customer segmentation due to its effectiveness and efficiency. The web
application also implements the elbow method, which is used to determine the optimal number
of clusters for a given dataset. This provides organizations with a more accurate and reliable
way of segmenting their customers.
Finally, the proposed system provides a comprehensive presentation of the results of the
customer segmentation, including the cluster label for each customer, the characteristics of
each cluster, and the visualizations of the clusters. This information is crucial for organizations
to make informed marketing decisions and better understand their customers.
The proposed customer segmentation system using the k-means algorithm is a significant
improvement over existing systems. It is more flexible, user-friendly, and provides a more
comprehensive analysis of the customer data.
42
Table 4.2 shows an analysis of the existing system in terms of selected features.
Table 4.3 shows an analysis of the proposed system in terms of selected features.
43
Table 4.4 shows a comparative analysis of the existing and proposed systems.
The table above provides a comparison of the existing system and the proposed system for
customer segmentation using the k-means algorithm. The table lists various features of the two
systems and compares their capabilities.
The feature of "Adaptability" compares the capability of the two systems to accommodate
different datasets. The existing system is limited to a single company, while the proposed
system is adaptable to different datasets.
44
The feature of "User-friendliness" compares the robustness and ease-of-use of the two systems.
The existing system lacks robustness and user-friendliness, while the proposed system has a
user-friendly web application.
The feature of "Algorithm" compares the type of algorithm used by the two systems. The
proposed system uses the k-means algorithm, while the existing system does not specify the
algorithm used.
The feature of "Customer categories" compares the generation of customer categories. The
proposed system generates five customer categories, while the existing system does not specify
this feature.
The feature of "Marketing support" compares the support provided to informed marketing
decisions. The proposed system supports informed marketing decisions, while the existing
system does not specify this feature.
The feature of "Speed and efficiency" compares the processing speed and efficiency of the two
systems. The proposed system is fast and efficient in processing data, while the existing system
does not specify this feature.
The feature of "Web application design and deployment" compares the design and deployment
of the web application. The proposed system uses Streamlit for web application design and
deployment, while the existing system does not specify this feature.
45
CHAPTER FIVE
5. 2 Recommendation
The proposed solution is a significant step forward in addressing the limitations of existing
customer segmentation solutions. By developing a user-friendly web application that can
accommodate datasets from different organizations, this work will provide a more robust and
versatile solution for customer segmentation. We highly recommend implementing this
solution, as it has the potential to provide valuable insights into customer segments, thereby
helping organizations better understand the needs and preferences of their target audience.
5.3 Implication
It was evident that previous studies on customer segmentation had a major limitation in their
scope, as they only focused on a single company. This restricted their capability to be versatile
and handle data from various organizations, leading to the requirement of a deep understanding
of the codes and systems. The lack of robustness and user-friendliness was a hindrance to the
practical implementation of these studies.
To overcome these challenges, the current work aimed to not only address the issue of customer
segmentation but also provide a solution to the inconvenience of working with different
datasets. By developing a user-friendly web application, organizations were now able to
46
generate meaningful clusters from their data in an effortless manner. This innovative approach
provided a convenient platform for organizations to analyze their customer data and make
informed business decisions.
The contribution of this work to the field of customer segmentation was significant as it
provided organizations with a tool to quickly and effectively analyze their data. The web
application allowed organizations to easily identify and understand their customer segments,
leading to improved customer engagement and increased revenue.
47
REFERENCES
Chen, L., & Liu, Y. (2021). Customer Segmentation using K-Means Clustering Algorithm and
Principal Component Analysis. Journal of Marketing Analytics, 9(3), 258-268.
Jain, R., & Aggarwal, R. (2016). A Comparative Study of Customer Segmentation Techniques:
K-Means, Fuzzy C-Means and Self-Organizing Maps. International Journal of
Advanced Research in Computer and Communication Engineering, 5(2), 153-159.
Kim, J., & Kim, D. (2020). Customer Segmentation using K-Means Clustering Algorithm and
Decision Trees. Journal of Data Science, 18(2), 183-194.
Liu, Y., & Zhang, J. (2019). Customer Segmentation using K-Means Clustering Algorithm: An
Empirical Study. Journal of Data Science, 17(1), 117-126.
Parmar, J. G., & Dave, H. B. (2015). Customer Segmentation using K-Means Clustering
Algorithm: A Survey. International Journal of Computer Applications, 115(9), 36-41.
Qureshi, M., & Zaidi, S. (2020). K-Means Clustering Algorithm for Customer Segmentation:
A Review. International Journal of Advanced Computer Science and Applications,
11(2), 48-54.
Ramya, M., & Ravichandran, P. (2017). A Study on Customer Segmentation using K-Means
Clustering Algorithm in Retail Industry. International Journal of Engineering and
Management Research, 7(2), 20-25.
Sharma, A. K., & Singh, P. K. (2018). Customer Segmentation using K-Means Clustering
Algorithm and its Applications. International Journal of Engineering Research &
Technology, 7(7), 196-199.
Wang, H., & Guo, Y. (2019). Customer Segmentation using K-Means Clustering Algorithm in
E-commerce. Journal of Business Research, 97, 439-447.
Zhang, Z., & Chen, X. (2022). Customer Segmentation using K-Means Clustering Algorithm
and Random Forest. Journal of Marketing Research, 59(1), 87-96.
48
APPENDICES
Dataset
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
12 Female 35 19 99
13 Female 58 20 15
14 Female 24 20 77
15 Male 37 20 13
16 Male 22 20 79
17 Female 35 21 35
18 Male 20 21 66
19 Male 52 23 29
20 Female 35 23 98
21 Male 35 24 35
22 Male 25 24 73
23 Female 46 25 5
24 Male 31 25 73
25 Female 54 28 14
26 Male 29 28 82
27 Female 45 28 32
28 Male 35 28 61
29 Female 40 29 31
49
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
30 Female 23 29 87
31 Male 60 30 4
32 Female 21 30 73
33 Male 53 33 4
34 Male 18 33 92
35 Female 49 33 14
36 Female 21 33 81
37 Female 42 34 17
38 Female 30 34 73
39 Female 36 37 26
40 Female 20 37 75
41 Female 65 38 35
42 Male 24 38 92
43 Male 48 39 36
44 Female 31 39 61
45 Female 49 39 28
46 Female 24 39 65
47 Female 50 40 55
48 Female 27 40 47
49 Female 29 40 42
50 Female 31 40 42
51 Female 49 42 52
52 Male 33 42 60
53 Female 31 43 54
54 Male 59 43 60
55 Female 50 43 45
56 Male 47 43 41
57 Female 51 44 50
58 Male 69 44 46
59 Female 27 46 51
60 Male 53 46 46
50
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
61 Male 70 46 56
62 Male 19 46 55
63 Female 67 47 52
64 Female 54 47 59
65 Male 63 48 51
66 Male 18 48 59
67 Female 43 48 50
68 Female 68 48 48
69 Male 19 48 59
70 Female 32 48 47
71 Male 70 49 55
72 Female 47 49 42
73 Female 60 50 49
74 Female 60 50 56
75 Male 59 54 47
76 Male 26 54 54
77 Female 45 54 53
78 Male 40 54 48
79 Female 23 54 52
80 Female 49 54 42
81 Male 57 54 51
82 Male 38 54 55
83 Male 67 54 41
84 Female 46 54 44
85 Female 21 54 57
86 Male 48 54 46
87 Female 55 57 58
88 Female 22 57 55
89 Female 34 58 60
90 Female 50 58 46
91 Female 68 59 55
51
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
92 Male 18 59 41
93 Male 48 60 49
94 Female 40 60 40
95 Female 32 60 42
96 Male 24 60 52
97 Female 47 60 47
98 Female 27 60 50
99 Male 48 61 42
100 Male 20 61 49
101 Female 23 62 41
102 Female 49 62 48
103 Male 67 62 59
104 Male 26 62 55
105 Male 49 62 56
106 Female 21 62 42
107 Female 66 63 50
108 Male 54 63 46
109 Male 68 63 43
110 Male 66 63 48
111 Male 65 63 52
112 Female 19 63 54
113 Female 38 64 42
114 Male 19 64 46
115 Female 18 65 48
116 Female 19 65 50
117 Female 63 65 43
118 Female 49 65 59
119 Female 51 67 43
120 Female 50 67 57
121 Male 27 67 56
122 Female 38 67 40
52
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
123 Female 40 69 58
124 Male 39 69 91
125 Female 23 70 29
126 Female 31 70 77
127 Male 43 71 35
128 Male 40 71 95
129 Male 59 71 11
130 Male 38 71 75
131 Male 47 71 9
132 Male 39 71 75
133 Female 25 72 34
134 Female 31 72 71
135 Male 20 73 5
136 Female 29 73 88
137 Female 44 73 7
138 Male 32 73 73
139 Male 19 74 10
140 Female 35 74 72
141 Female 57 75 5
142 Male 32 75 93
143 Female 28 76 40
144 Female 32 76 87
145 Male 25 77 12
146 Male 28 77 97
147 Male 48 77 36
148 Female 32 77 74
149 Female 34 78 22
150 Male 34 78 90
151 Male 43 78 17
152 Male 39 78 88
153 Female 44 78 20
53
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
154 Female 38 78 76
155 Female 47 78 16
156 Female 27 78 89
157 Male 37 78 1
158 Female 30 78 78
159 Male 34 78 1
160 Female 30 78 73
161 Female 56 79 35
162 Female 29 79 83
163 Male 19 81 5
164 Female 31 81 93
165 Male 50 85 26
166 Female 36 85 75
167 Male 42 86 20
168 Female 33 86 95
169 Female 36 87 27
170 Male 32 87 63
171 Male 40 87 13
172 Male 28 87 75
173 Male 36 87 10
174 Male 36 87 92
175 Female 52 88 13
176 Female 30 88 86
177 Male 58 88 15
178 Male 27 88 69
179 Male 59 93 14
180 Male 35 93 90
181 Female 37 97 32
182 Female 32 97 86
183 Male 46 98 15
184 Female 29 98 88
54
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
185 Female 41 99 39
186 Male 30 99 97
187 Female 54 101 24
188 Male 28 101 68
189 Female 41 103 17
190 Female 36 103 85
191 Female 34 103 23
192 Female 32 103 69
193 Male 33 113 8
194 Female 38 113 91
195 Female 47 120 16
196 Female 35 120 79
197 Female 45 126 28
198 Male 32 126 74
199 Male 32 137 18
200 Male 30 137 83
55
Main.py
import streamlit as st
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as hcluster
from sklearn.cluster import AgglomerativeClustering
dataFrame=pd.read_csv('Mall_Customers.csv')
st.write("""
# Customer Segmentation App
""")
56
#Clusters = st.sidebar.slider('k', 1, 10, 5)
#Annual_Income = st.sidebar.slider('Annual_Income', 15,137,35)
#Spending_score = st.sidebar.slider('Spending_score', 1,99,25)
data = {'Clusters': Clusters,
'Annual_Income': Annual_Income,
'Spending_score': Spending_score
}
features = pd.DataFrame(data, index=[0])
return features
df = user_input_features()
cluster_labels = kmeans.fit_predict(dataFrame.iloc[:,2:5])
preds = kmeans.labels_
kmeans_df = pd.DataFrame(dataFrame.iloc[:,2:5])
kmeans_df['KMeans_Clusters'] = preds
kmeans_df.head(5)
57
sns.scatterplot(x=kmeans_df['Annual_Income'], y=kmeans_df['Spending_score'],
hue=kmeans_df['KMeans_Clusters'], palette="deep")
#sns.axes_style("white"):
f, ax = plt.subplots(figsize=(7, 5))
ax =
sns.scatterplot(kmeans_df['Annual_Income'],kmeans_df['Spending_score'],hue='KMeans_Cl
usters',data=kmeans_df,palette="deep")
st.pyplot()
58
Customer segmentation code
import pandas as pd # Pandas (version : 1.1.5)
plt.style.use('seaborn')
data = pd.read_csv('Mall_Customers.csv')
Now we view the Head and Tail of the data using head() and tail() respectively.
data.head()
data.tail()
len(data)
data.shape
data.columns
59
for i,col in enumerate(data.columns):
data.dtypes
data.info()
data.describe()
# **4. Checking the data for inconsistencies and further cleaning the data if needed.**
data.isnull()
data.isnull().sum()
data.head()
60
The 'Annual income' and 'Spending score' columns have spaces in their column names, we need
to rename them.
Cleaning the data labels (Annual income and Spending Score) using rename().
data.head()
NOTE : Data doesnt have any missing values so it is clean, and therefore no need for cleaning
the data
Finding and viewing Corelations in the data and columns using corr().
corr = data.corr()
corr
fig, ax = plt.subplots(figsize=(10,8))
sns.set(font_scale=1.5)
61
ax = sns.heatmap(corr, cmap = 'Reds', annot = True, linewidths=0.5, linecolor='black')
plt.show()
data['Gender'].head()
data['Gender'].dtype
data['Gender'].unique()
data['Gender'].value_counts()
Plotting Gender Distribution on Bar graph and the ratio of distribution using Pie Chart.
labels=data['Gender'].unique()
values=data['Gender'].value_counts(ascending=True)
62
fig, (ax0,ax1) = plt.subplots(ncols=2,figsize=(15,8))
ax0.set_ylim(0,130)
ax0.legend()
ax1.pie(values,labels=labels,colors=['#42a7f5','#d400ad'],autopct='%1.1f%%')
plt.show()
data['Age'].head()
data['Age'].dtype
63
data['Age'].unique()
data['Age'].describe()
fig, ax = plt.subplots(figsize=(5,8))
sns.set(font_scale=1.5)
ax = sns.boxplot(y=data["Age"], color="#f73434")
ax.set_ylabel('No. of Customers')
plt.show()
data['Age'].value_counts().head()
64
fig, ax = plt.subplots(figsize=(20,8))
sns.set(font_scale=1.5)
ax = sns.countplot(x=data['Age'], palette='spring')
ax.axhline(y=data['Age'].value_counts().mean(), linestyle='--',color='#eb50db',
label=f'Average Age Count ({data.Age.value_counts().mean():.1f})')
ax.legend(loc ='right')
ax.set_ylabel('No. of Customers')
plt.show()
data[data['Gender']=='Male']['Age'].describe()
data[data['Gender']=='Female']['Age'].describe()
Visualizing Gender wise Age Distribution of Male and Female customers on a boxplot.
65
data_male = data[data['Gender']=='Male']['Age'].describe()
data_female = data[data['Gender']=='Female']['Age'].describe()
sns.set(font_scale=1.5)
ax0.set_ylim(15,72)
66
ax1.axhline(y=data_female[5], linestyle='--',color='#eb50db', label=f'Median Age
({data_female[5]:.2f})')
ax1.set_ylim(15,72)
plt.show()
data[data['Gender']=='Male'].Age.mean()
data[data['Gender']=='Male'].Age.value_counts().head()
maxi = data[data['Gender']=='Male'].Age.value_counts().max()
mean = data[data['Gender']=='Male'].Age.value_counts().mean()
mini = data[data['Gender']=='Male'].Age.value_counts().min()
67
fig, ax = plt.subplots(figsize=(20,8))
sns.set(font_scale=1.5)
ax = sns.countplot(x=data[data['Gender']=='Male'].Age, palette='spring')
ax.set_ylabel('No. of Customers')
ax.legend(loc ='right')
plt.show()
data[data['Gender']=='Female'].Age.mean()
data[data['Gender']=='Female'].Age.value_counts().head()
68
#Visualizing distribution of age count in Female customers using a countplot.
maxi = data[data['Gender']=='Female'].Age.value_counts().max()
mean = data[data['Gender']=='Female'].Age.value_counts().mean()
mini = data[data['Gender']=='Female'].Age.value_counts().min()
fig, ax = plt.subplots(figsize=(20,8))
sns.set(font_scale=1.5)
ax = sns.countplot(x=data[data['Gender']=='Female'].Age, palette='spring')
ax.set_ylabel('No. of Customers')
ax.legend(loc ='right')
plt.show()
data['Annual_Income'].head()
69
data['Annual_Income'].dtype
data['Annual_Income'].describe()
fig, ax = plt.subplots(figsize=(5,8))
sns.set(font_scale=1.5)
ax = sns.boxplot(y=data["Annual_Income"], color="#f73434")
ax.axhline(y=data["Annual_Income"].describe()[6], linestyle='--',color='#f74343',
label=f'75% Income ({data.Annual_Income.describe()[6]:.2f})')
ax.axhline(y=data["Annual_Income"].describe()[4], linestyle='--',color='#eb50db',
label=f'25% Income ({data.Annual_Income.describe()[4]:.2f})')
ax.set_ylabel('No. of Customers')
plt.show()
70
data['Annual_Income'].value_counts().head()
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
plt.show()
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
plt.show()
71
data[data['Gender']=='Male'].Annual_Income.describe()
data[data['Gender']=='Female'].Annual_Income.describe()
#Visualizing statistical difference of Annual Income between Male and Female Customers.
fig, ax = plt.subplots(figsize=(10,8))
sns.set(font_scale=1.5)
plt.show()
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
72
plt.title('Annual Income per Age by Gender', fontsize = 20)
plt.show()
#Visualizing difference of Annual Income between Male and Female Customers using Violin
Plot.
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
ax = sns.violinplot(y=data['Annual_Income'],x=data['Gender'])
plt.show()
# K - Means Clustering****
data.isna().sum()
data.head()
clustering_data = data.iloc[:,[2,3]]
clustering_data.head()
fig, ax = plt.subplots(figsize=(15,7))
73
sns.set(font_scale=1.5)
ax =
sns.scatterplot(y=clustering_data['Spending_Score'],x=clustering_data['Annual_Income'],
s=70, color='#f73434', edgecolor='black', linewidth=0.3)
ax.set_ylabel('Spending Scores')
plt.show()
wcss=[]
for i in range(1,30):
km = KMeans(i)
km.fit(clustering_data)
wcss.append(km.inertia_)
np.array(wcss)
fig, ax = plt.subplots(figsize=(15,7))
plt.axvline(x=5, ls='--')
plt.ylabel('WCSS')
plt.show()
74
# 8. Clustering
kms.fit(clustering_data)
clusters = clustering_data.copy()
clusters['Cluster_Prediction'] = kms.fit_predict(clustering_data)
clusters.head()
kms.cluster_centers_
fig, ax = plt.subplots(figsize=(15,7))
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 4]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 4]['Spending_Score'],
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 0]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 0]['Spending_Score'],
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 2]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 2]['Spending_Score'],
75
s=70,edgecolor='black', linewidth=0.2, c='Magenta', label='Cluster 3')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 1]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 1]['Spending_Score'],
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 3]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 3]['Spending_Score'],
plt.legend(loc='right')
plt.xlim(0,140)
plt.ylim(0,100)
plt.ylabel('Spending Score')
plt.show()
76
77