Customer Segmentation Using K
Customer Segmentation Using K
Abstract:
Customer segmentation is a critical business strategy to understand consumer behavior and improve
marketing efficiency. This project implements K-Means Clustering, an unsupervised machine
learning algorithm, to segment customers based on key features such as annual income and spending
score. Using Python libraries like Pandas, Matplotlib, Seaborn, and Scikit-learn, we analyze customer
data to identify distinct groups, which helps businesses target customers more effectively.
Abstract:
Table of Contents
1. Introduction
2. Problem Statement
3. Objectives
4. Scope
5. Literature Review
6. Methodology
7. System Architecture
8. Dataset Description
9. Data Preprocessing
10. Clustering Algorithm – K-Means
11. Results and Analysis
12. Advantages
13. Limitations
14. Conclusion
15. Future Scope
16. References
1. Introduction
Customer segmentation involves dividing a company’s customers into groups that reflect similarity
among customers in each group. In this project, we apply K-Means Clustering to achieve effective
segmentation based on data like age, income, and spending scores.
2. Problem Statement
Businesses often struggle to personalize marketing and services due to lack of insight into customer
behavior. Manual segmentation is ineffective for large datasets. Therefore, an automated, intelligent
system is needed to cluster customers based on various features.
3. Objectives
This project is limited to analyzing a dataset of customer features using unsupervised learning. It can
be applied across industries like retail, banking, and telecom to enhance customer relationship
management and marketing efficiency.
5. Literature Review
Past research shows the importance of data-driven decisions in marketing. Techniques like clustering
have been used in CRM and sales prediction. K-Means is widely accepted due to its simplicity and
effectiveness for large datasets.
6. Methodology
Data Collection: We use a sample dataset from an e-commerce or mall customer database.
Data Preprocessing: Includes cleaning, normalization, and selection of relevant features.
Clustering: K-Means algorithm is used to group customers.
Evaluation: Clusters are analyzed and visualized for interpretation.
7. System Architecture
Data Preprocessing
K-Means Clustering
Evaluation & Visualization
Output: Segmented customer groups
8. Dataset Description
CustomerID
Gender
Age
Annual Income (k$)
Spending Score (1–100)
9. Data Preprocessing
The optimal number of clusters was found using the Elbow Method.
Each cluster showed distinct patterns such as high-income high-spending customers, low-
income low-spending, etc.
The results were visualized using scatter plots and cluster diagrams.
Business insights can be drawn based on cluster characteristics.
12. Advantages
13. Limitations
14. Conclusion
This project successfully demonstrates the use of K-Means clustering for customer segmentation. It
helps organizations better understand their customers and tailor their strategies accordingly.
16. References
1. Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques.
2. Scikit-learn Documentation – https://scikit-learn.org
3. Kaggle Dataset – https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial
4. McKinsey Insights on Customer Segmentation
Customer Segmentation using K-Means Clustering with Python
Table of Contents
1. Abstract
2. Introduction
2.1 Background
2.2 Problem Statement
2.3 Objectives
2.4 Scope of the Project
2.5 Significance of the Study
3. Literature Review
3.1 Existing Methods of Customer Segmentation
3.2 Applications of Machine Learning in Customer Segmentation
3.3 Comparison of Clustering Algorithms
3.4 Research Gaps
4. System Analysis
4.1 Requirement Analysis
4.2 Feasibility Study
4.3 System Architecture
4.4 Use Case Diagrams
5. Methodology
5.1 Data Collection
5.2 Data Preprocessing
5.3 Feature Selection and Engineering
5.4 Introduction to K-Means Clustering
5.5 Determining the Optimal Number of Clusters (Elbow Method/Silhouette Score)
5.6 Model Implementation using Python
5.7 Evaluation of Clusters
6. Implementation
6.1 Tools and Technologies Used
6.2 Dataset Description
6.3 Code Structure
6.4 Visualization of Clusters
6.5 Interpretation of Results
7. Results and Discussion
7.1 Cluster Profiles
7.2 Business Insights
7.3 Comparison with Other Clustering Methods
7.4 Limitations
8. Conclusion and Future Work
8.1 Conclusion
8.2 Recommendations
8.3 Future Enhancements
9. References
10. Appendices
10.1 Python Code Listings
10.2 Additional Graphs and Visualizations
10.3 User Guide (if applicable)
1. Abstract
Customer segmentation is a fundamental marketing strategy that involves dividing a customer base
into distinct groups based on common characteristics. This project focuses on implementing customer
segmentation using the K-Means clustering algorithm in Python. The objective is to analyze customer
behavior, group them into relevant segments, and extract meaningful business insights to enable
targeted marketing. Using an unsupervised machine learning approach, the model is trained on a
dataset that includes variables such as annual income and spending score. The implementation includes
data preprocessing, determining the optimal number of clusters using the Elbow Method, applying K-
Means clustering, and visualizing the results. The outcomes demonstrate how machine learning can
enhance decision-making by providing a clearer understanding of customer needs and preferences.
2. Introduction
2.1 Background
In today’s data-driven business environment, companies collect large volumes of customer data
through transactions, online activities, and surveys. Understanding this data is essential for designing
marketing strategies, optimizing customer experience, and increasing profitability. Customer
segmentation helps businesses tailor their products and services to specific groups of customers with
similar behaviors. Among various machine learning techniques, clustering algorithms, especially K-
Means, are widely used for this purpose due to their simplicity and efficiency.
Businesses often struggle to understand their customers' needs and behaviors due to the vast diversity
in customer preferences. Without proper segmentation, marketing efforts become generalized, leading
to suboptimal outcomes. The problem addressed in this project is:
"How can we effectively segment customers into meaningful groups using machine learning
techniques to support targeted business strategies?"
2.3 Objectives
This project is limited to applying K-Means clustering on a sample customer dataset, which typically
includes attributes such as age, income, and spending score. The analysis will be conducted using
Python and associated libraries such as Pandas, NumPy, Matplotlib, and Scikit-learn. While other
clustering algorithms exist, this study emphasizes K-Means for its balance between simplicity and
performance.
Customer segmentation has long been a central focus in marketing and customer relationship
management. Traditional methods include demographic segmentation (age, gender, income),
geographic segmentation, psychographic segmentation (lifestyle, values), and behavioral segmentation
(purchasing behavior, brand interactions). These conventional approaches often rely on manual
categorization and lack the scalability and adaptability required in modern data-intensive
environments.
With the growth of big data and computational power, machine learning approaches have gained
prominence. These methods enable automated, data-driven segmentation based on patterns and
correlations hidden within large datasets.
Recent studies emphasize the use of unsupervised learning algorithms for customer segmentation.
These include:
K-Means Clustering: One of the most widely used algorithms for segmenting customers based
on numerical attributes. It assigns data points to the nearest cluster center.
Hierarchical Clustering: Builds nested clusters by either merging or splitting them
successively. Useful for visualizing cluster relationships but computationally expensive for
large datasets.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups points
that are closely packed together while marking outliers. Effective for identifying arbitrary-
shaped clusters.
Gaussian Mixture Models (GMMs): A probabilistic model that assumes data points are
distributed according to multiple Gaussian distributions.
Among these, K-Means remains the most practical and efficient for structured datasets with
relatively distinct groupings, such as retail or customer purchase data.
Studies have shown that K-Means performs well when clusters are well-separated and the dataset is
of medium size, which aligns with the goals of this project.
While a considerable amount of work has been done on clustering algorithms for customer
segmentation, gaps remain:
Many implementations focus on accuracy without considering business interpretability.
There is a lack of studies using real-time or dynamic datasets where customer behavior changes
frequently.
Few papers address the integration of clustering results with actionable business intelligence or
marketing tools.
This project aims to address some of these limitations by offering a clear, interpretable clustering
approach with visualizations and business insights.
4. System Analysis
4.1 Requirement Analysis
Functional Requirements:
The system should allow data import from CSV or other structured formats.
The system must preprocess data (handling missing values, normalization).
The system should implement the K-Means clustering algorithm.
The system must determine the optimal number of clusters (e.g., using Elbow Method).
The system should visualize the clusters and allow interpretation of results.
Non-Functional Requirements:
Software Requirements:
Hardware Requirements:
The tools and algorithms required for this project are well-supported in Python and can be run on
standard hardware. Libraries like Scikit-learn simplify implementation, and various tutorials and
documentation are available.
Economic Feasibility:
There is no additional cost for tools or libraries used in this project since all required components are
open-source. The only investment is the time spent in development, testing, and analysis.
Operational Feasibility:
The system is designed to be operated by data analysts, students, or small business users. Minimal
technical knowledge is required to run the notebook and interpret the results, making it user-friendly
and practical.
Visualization Layer
(Plots, Graphs)
Upload Dataset
Preprocess Data
Visualize Segments
Extract Insights
5. Methodology
This section outlines the step-by-step methodology used in building the customer segmentation model
using K-Means clustering. The process includes data collection, preprocessing, feature selection,
algorithm implementation, and evaluation.
The dataset used in this project is sourced from a publicly available retail customer database. It
contains key attributes of customers, such as:
Customer ID
Gender
Age
Annual Income (in USD)
Spending Score (1–100)
This dataset simulates customer behavior and is ideal for demonstrating segmentation based on
spending patterns and income.
Before applying machine learning, the raw data must be cleaned and standardized. The following
preprocessing steps were performed:
Handling Missing Values: The dataset was checked for null values using isnull() and cleaned
accordingly.
Data Transformation: Categorical variables such as Gender were encoded using label
encoding (e.g., Male = 1, Female = 0).
Feature Scaling: To ensure uniformity, StandardScaler was used to normalize features like
Annual Income and Spending Score.
These two features were selected because they form clear clusters when visualized and are commonly
used in retail segmentation.
K-Means is an unsupervised learning algorithm that divides data into k clusters, minimizing the
variance within each cluster. It works in the following steps:
python
CopyEdit
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
The optimal k is identified at the “elbow point” in the graph — where WCSS starts to decrease more
slowly. In our case, k = 5 was found optimal.
With the optimal number of clusters determined, the K-Means algorithm was applied:
python
CopyEdit
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
cluster_labels = kmeans.fit_predict(scaled_data)
Silhouette Score: Measures how similar a point is to its own cluster compared to other clusters.
Score ranges from -1 to 1.
python
CopyEdit
from sklearn.metrics import silhouette_score
python
CopyEdit
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=cluster_labels, cmap='rainbow')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.show()
This visualization helps interpret customer behavior and derive actionable insights.
6. Implementation
This section describes how the system was implemented using Python, focusing on the tools and
technologies used, the dataset structure, code modules, visualizations, and the interpretation of results.
The following open-source tools and libraries were used for implementation:
The dataset used consists of 200+ customer records with the following key columns:
For this project, the focus was on Annual Income and Spending Score for clustering.
The code is modularized into the following sections for clarity and reusability:
1. Import Libraries
2. Load Dataset
3. Data Preprocessing
4. Determine Optimal Clusters (Elbow Method)
5. Apply K-Means Algorithm
6. Evaluate and Visualize Clusters
7. Interpret Results
Each section is encapsulated in functions where applicable to make the code readable and
maintainable.
Example:
python
CopyEdit
def load_and_scale_data(file_path):
data = pd.read_csv(file_path)
scaler = StandardScaler()
scaled = scaler.fit_transform(data[['Annual Income', 'Spending Score']])
return data, scaled
6.4 Visualization of Clusters
python
CopyEdit
plt.figure(figsize=(8, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=cluster_labels, cmap='rainbow')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.grid(True)
plt.show()
Example Output:
This analysis can help a business tailor marketing campaigns, product recommendations, and customer
service based on cluster profiles.
After preprocessing the data and applying the K-Means algorithm with k = 5, five distinct customer
segments were identified. These clusters showed clear separation when visualized on a 2D scatter plot
using Annual Income and Spending Score as axes.
These visualizations made it easier for stakeholders to understand the segmentation without deep
technical knowledge.
The Silhouette Score was used to evaluate the effectiveness of clustering. The score obtained was:
python
CopyEdit
Silhouette Score: ~0.55
7.5 Discussion
The results demonstrate the strength of K-Means in identifying distinct customer segments using just
two features. While simple, the insights derived are powerful and can directly support business
decisions like:
Personalized marketing
Customer retention
Sales forecasting
Resource allocation
However, the results can be further improved by including more features such as purchase frequency,
customer lifetime value, and online behavior.
In this project, a machine learning-based approach for customer segmentation was developed and
implemented using the K-Means clustering algorithm in Python. The primary goal was to identify
distinct groups of customers based on two key features: Annual Income and Spending Score.
The results demonstrated that K-Means is an effective method for uncovering hidden patterns in
customer data, which can be used to improve marketing efforts, customer service, and overall business
decision-making.
8.3 Limitations
Only two features were used; more features could lead to more accurate and actionable
segmentation.
K-Means assumes spherical clusters and equal variance, which might not always represent real
customer behavior.
The model does not handle dynamic or time-series data where customer behavior changes
over time.
To improve and extend this project, the following enhancements are recommended:
9. References
The following sources were consulted and referenced during the course of the project:
1. Scikit-learn Documentation
https://scikit-learn.org/stable/modules/clustering.html#k-means
2. Pandas Documentation
https://pandas.pydata.org/
3. Matplotlib Documentation
https://matplotlib.org/stable/index.html
4. Seaborn Documentation
https://seaborn.pydata.org/
5. Customer Segmentation Dataset (Mall Customers Dataset)
Available on Kaggle:
https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python
6. J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 3rd
Edition, 2011.
7. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, O'Reilly
Media, 2nd Edition, 2019.
10. Appendices
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
# Load dataset
df = pd.read_csv('Mall_Customers.csv')
# Feature selection
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# KMeans clustering
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
clusters = kmeans.fit_predict(X_scaled)
df['Cluster'] = clusters
# Visualization
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.grid(True)
plt.show()