0% found this document useful (0 votes)
15 views16 pages

Customer Segmentation Using K

This document outlines a project on customer segmentation using K-Means Clustering in Python, emphasizing its importance for targeted marketing and improved customer understanding. The methodology includes data preprocessing, applying the K-Means algorithm, and visualizing results to derive business insights. The project highlights the advantages of machine learning in enhancing decision-making and addresses limitations and future enhancements for the approach.

Uploaded by

nascmoulishwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

Customer Segmentation Using K

This document outlines a project on customer segmentation using K-Means Clustering in Python, emphasizing its importance for targeted marketing and improved customer understanding. The methodology includes data preprocessing, applying the K-Means algorithm, and visualizing results to derive business insights. The project highlights the advantages of machine learning in enhancing decision-making and addresses limitations and future enhancements for the approach.

Uploaded by

nascmoulishwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Customer Segmentation using K-Means Clustering with Python

Abstract:

Customer segmentation is a critical business strategy to understand consumer behavior and improve
marketing efficiency. This project implements K-Means Clustering, an unsupervised machine
learning algorithm, to segment customers based on key features such as annual income and spending
score. Using Python libraries like Pandas, Matplotlib, Seaborn, and Scikit-learn, we analyze customer
data to identify distinct groups, which helps businesses target customers more effectively.

Abstract:

In today’s competitive business environment, customer segmentation is crucial for personalized


marketing, resource optimization, and customer satisfaction. This project explores the use of K-Means
Clustering, a machine learning technique, to segment customers into distinct groups based on
behavioral and demographic features. By using Python and its data science libraries, we analyze
customer data and identify hidden patterns, enabling targeted marketing strategies and business
growth.

Table of Contents

1. Introduction
2. Problem Statement
3. Objectives
4. Scope
5. Literature Review
6. Methodology
7. System Architecture
8. Dataset Description
9. Data Preprocessing
10. Clustering Algorithm – K-Means
11. Results and Analysis
12. Advantages
13. Limitations
14. Conclusion
15. Future Scope
16. References

1. Introduction

Customer segmentation involves dividing a company’s customers into groups that reflect similarity
among customers in each group. In this project, we apply K-Means Clustering to achieve effective
segmentation based on data like age, income, and spending scores.

2. Problem Statement

Businesses often struggle to personalize marketing and services due to lack of insight into customer
behavior. Manual segmentation is ineffective for large datasets. Therefore, an automated, intelligent
system is needed to cluster customers based on various features.

3. Objectives

 To segment customers using K-Means Clustering.


 To analyze customer behavior based on demographic and transactional data.
 To visualize the clusters for better understanding.
 To help businesses identify valuable customer groups.
4. Scope

This project is limited to analyzing a dataset of customer features using unsupervised learning. It can
be applied across industries like retail, banking, and telecom to enhance customer relationship
management and marketing efficiency.

5. Literature Review

Past research shows the importance of data-driven decisions in marketing. Techniques like clustering
have been used in CRM and sales prediction. K-Means is widely accepted due to its simplicity and
effectiveness for large datasets.

6. Methodology

 Data Collection: We use a sample dataset from an e-commerce or mall customer database.
 Data Preprocessing: Includes cleaning, normalization, and selection of relevant features.
 Clustering: K-Means algorithm is used to group customers.
 Evaluation: Clusters are analyzed and visualized for interpretation.

7. System Architecture

Input: Customer data (e.g., Age, Gender, Income, Spending Score)


Process:

 Data Preprocessing
 K-Means Clustering
 Evaluation & Visualization
Output: Segmented customer groups

8. Dataset Description

The dataset contains the following features:

 CustomerID
 Gender
 Age
 Annual Income (k$)
 Spending Score (1–100)

(Source: [Mall Customers Dataset – Kaggle])

9. Data Preprocessing

 Handling missing values


 Converting categorical data to numerical (e.g., Gender)
 Feature selection
 Scaling the data (StandardScaler or MinMaxScaler)

10. Clustering Algorithm – K-Means

 K-Means is an unsupervised learning algorithm that partitions data into K clusters.


 It minimizes the intra-cluster variance.
 Elbow Method is used to determine optimal K.
 Customers with similar purchasing habits are grouped together.
11. Results and Analysis

 The optimal number of clusters was found using the Elbow Method.
 Each cluster showed distinct patterns such as high-income high-spending customers, low-
income low-spending, etc.
 The results were visualized using scatter plots and cluster diagrams.
 Business insights can be drawn based on cluster characteristics.

12. Advantages

 Helps in targeted marketing


 Improves customer relationship management
 Increases business efficiency
 Simple and scalable method

13. Limitations

 K-Means requires specification of K (number of clusters)


 Not suitable for non-spherical clusters
 Sensitive to outliers and scaling

14. Conclusion

This project successfully demonstrates the use of K-Means clustering for customer segmentation. It
helps organizations better understand their customers and tailor their strategies accordingly.

15. Future Scope

 Use of advanced clustering techniques like DBSCAN or Hierarchical Clustering


 Integration with real-time customer databases
 Use of deep learning for feature extraction
 Integration into a marketing CRM system

16. References

1. Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques.
2. Scikit-learn Documentation – https://scikit-learn.org
3. Kaggle Dataset – https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial
4. McKinsey Insights on Customer Segmentation
Customer Segmentation using K-Means Clustering with Python
Table of Contents

1. Abstract
2. Introduction
2.1 Background
2.2 Problem Statement
2.3 Objectives
2.4 Scope of the Project
2.5 Significance of the Study
3. Literature Review
3.1 Existing Methods of Customer Segmentation
3.2 Applications of Machine Learning in Customer Segmentation
3.3 Comparison of Clustering Algorithms
3.4 Research Gaps
4. System Analysis
4.1 Requirement Analysis
4.2 Feasibility Study
4.3 System Architecture
4.4 Use Case Diagrams
5. Methodology
5.1 Data Collection
5.2 Data Preprocessing
5.3 Feature Selection and Engineering
5.4 Introduction to K-Means Clustering
5.5 Determining the Optimal Number of Clusters (Elbow Method/Silhouette Score)
5.6 Model Implementation using Python
5.7 Evaluation of Clusters
6. Implementation
6.1 Tools and Technologies Used
6.2 Dataset Description
6.3 Code Structure
6.4 Visualization of Clusters
6.5 Interpretation of Results
7. Results and Discussion
7.1 Cluster Profiles
7.2 Business Insights
7.3 Comparison with Other Clustering Methods
7.4 Limitations
8. Conclusion and Future Work
8.1 Conclusion
8.2 Recommendations
8.3 Future Enhancements
9. References
10. Appendices
10.1 Python Code Listings
10.2 Additional Graphs and Visualizations
10.3 User Guide (if applicable)
1. Abstract

Customer segmentation is a fundamental marketing strategy that involves dividing a customer base
into distinct groups based on common characteristics. This project focuses on implementing customer
segmentation using the K-Means clustering algorithm in Python. The objective is to analyze customer
behavior, group them into relevant segments, and extract meaningful business insights to enable
targeted marketing. Using an unsupervised machine learning approach, the model is trained on a
dataset that includes variables such as annual income and spending score. The implementation includes
data preprocessing, determining the optimal number of clusters using the Elbow Method, applying K-
Means clustering, and visualizing the results. The outcomes demonstrate how machine learning can
enhance decision-making by providing a clearer understanding of customer needs and preferences.

2. Introduction

2.1 Background

In today’s data-driven business environment, companies collect large volumes of customer data
through transactions, online activities, and surveys. Understanding this data is essential for designing
marketing strategies, optimizing customer experience, and increasing profitability. Customer
segmentation helps businesses tailor their products and services to specific groups of customers with
similar behaviors. Among various machine learning techniques, clustering algorithms, especially K-
Means, are widely used for this purpose due to their simplicity and efficiency.

2.2 Problem Statement

Businesses often struggle to understand their customers' needs and behaviors due to the vast diversity
in customer preferences. Without proper segmentation, marketing efforts become generalized, leading
to suboptimal outcomes. The problem addressed in this project is:
"How can we effectively segment customers into meaningful groups using machine learning
techniques to support targeted business strategies?"

2.3 Objectives

 To analyze customer data and identify patterns in behavior.


 To apply K-Means clustering for segmenting customers.
 To visualize and interpret the resulting customer segments.
 To provide actionable business insights based on clustering results.

2.4 Scope of the Project

This project is limited to applying K-Means clustering on a sample customer dataset, which typically
includes attributes such as age, income, and spending score. The analysis will be conducted using
Python and associated libraries such as Pandas, NumPy, Matplotlib, and Scikit-learn. While other
clustering algorithms exist, this study emphasizes K-Means for its balance between simplicity and
performance.

2.5 Significance of the Study

Effective customer segmentation leads to more personalized marketing, improved customer


satisfaction, and increased sales. This project demonstrates how machine learning, particularly K-
Means clustering, can be applied to real-world business problems, empowering organizations to make
data-informed decisions and enhance their competitive edge.
3. Literature Review

3.1 Existing Methods of Customer Segmentation

Customer segmentation has long been a central focus in marketing and customer relationship
management. Traditional methods include demographic segmentation (age, gender, income),
geographic segmentation, psychographic segmentation (lifestyle, values), and behavioral segmentation
(purchasing behavior, brand interactions). These conventional approaches often rely on manual
categorization and lack the scalability and adaptability required in modern data-intensive
environments.

With the growth of big data and computational power, machine learning approaches have gained
prominence. These methods enable automated, data-driven segmentation based on patterns and
correlations hidden within large datasets.

3.2 Applications of Machine Learning in Customer Segmentation

Recent studies emphasize the use of unsupervised learning algorithms for customer segmentation.
These include:

 K-Means Clustering: One of the most widely used algorithms for segmenting customers based
on numerical attributes. It assigns data points to the nearest cluster center.
 Hierarchical Clustering: Builds nested clusters by either merging or splitting them
successively. Useful for visualizing cluster relationships but computationally expensive for
large datasets.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups points
that are closely packed together while marking outliers. Effective for identifying arbitrary-
shaped clusters.
 Gaussian Mixture Models (GMMs): A probabilistic model that assumes data points are
distributed according to multiple Gaussian distributions.

Among these, K-Means remains the most practical and efficient for structured datasets with
relatively distinct groupings, such as retail or customer purchase data.

3.3 Comparison of Clustering Algorithms


Algorithm Advantages Disadvantages

Sensitive to initial centroids, assumes spherical


K-Means Simple, fast, easy to implement
clusters

Hierarchical No need to predefine clusters Not scalable, computationally heavy

Finds non-spherical clusters, detects


DBSCAN Difficult to choose parameters
noise

Soft clustering with probabilistic


GMM Requires more computation
assignment

Studies have shown that K-Means performs well when clusters are well-separated and the dataset is
of medium size, which aligns with the goals of this project.

3.4 Research Gaps

While a considerable amount of work has been done on clustering algorithms for customer
segmentation, gaps remain:
 Many implementations focus on accuracy without considering business interpretability.
 There is a lack of studies using real-time or dynamic datasets where customer behavior changes
frequently.
 Few papers address the integration of clustering results with actionable business intelligence or
marketing tools.

This project aims to address some of these limitations by offering a clear, interpretable clustering
approach with visualizations and business insights.

4. System Analysis
4.1 Requirement Analysis
Functional Requirements:

 The system should allow data import from CSV or other structured formats.
 The system must preprocess data (handling missing values, normalization).
 The system should implement the K-Means clustering algorithm.
 The system must determine the optimal number of clusters (e.g., using Elbow Method).
 The system should visualize the clusters and allow interpretation of results.

Non-Functional Requirements:

 The system should be scalable to handle moderately large datasets.


 It should provide quick execution with minimal computational overhead.
 The visualizations should be clear, interactive (if possible), and interpretable.
 The codebase should be modular and reusable.

Software Requirements:

 Programming Language: Python 3.x


 Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
 IDE: Google Colab
 Operating System: Windows/Linux/MacOS

Hardware Requirements:

 Processor: Minimum 2.0 GHz (Dual Core or higher)


 RAM: Minimum 4 GB (8 GB recommended)
 Storage: Minimum 1 GB free space for datasets and logs

4.2 Feasibility Study


Technical Feasibility:

The tools and algorithms required for this project are well-supported in Python and can be run on
standard hardware. Libraries like Scikit-learn simplify implementation, and various tutorials and
documentation are available.

Economic Feasibility:

There is no additional cost for tools or libraries used in this project since all required components are
open-source. The only investment is the time spent in development, testing, and analysis.
Operational Feasibility:

The system is designed to be operated by data analysts, students, or small business users. Minimal
technical knowledge is required to run the notebook and interpret the results, making it user-friendly
and practical.

4.3 System Architecture

Below is a simplified system architecture for the customer segmentation project:

Customer Dataset Data Preprocessing K-Means Clustering


(Cleaning, Scaling) (n_clusters = optimal)

Clustered Data Output

Visualization Layer
(Plots, Graphs)

4.4 Use Case Diagrams


Primary Use Case: Customer Segmentation
Actor: Data Analyst / Business User
Use Case:

 Load customer data.


 Clean and preprocess data.
 Choose clustering method (K-Means).
 Determine optimal number of clusters.
 Apply clustering and visualize results.
 Interpret segments and derive business insights.

Upload Dataset

Preprocess Data

User Run Clustering Algorithm

Visualize Segments

Extract Insights
5. Methodology

This section outlines the step-by-step methodology used in building the customer segmentation model
using K-Means clustering. The process includes data collection, preprocessing, feature selection,
algorithm implementation, and evaluation.

5.1 Data Collection

The dataset used in this project is sourced from a publicly available retail customer database. It
contains key attributes of customers, such as:

 Customer ID
 Gender
 Age
 Annual Income (in USD)
 Spending Score (1–100)

This dataset simulates customer behavior and is ideal for demonstrating segmentation based on
spending patterns and income.

5.2 Data Preprocessing

Before applying machine learning, the raw data must be cleaned and standardized. The following
preprocessing steps were performed:

 Handling Missing Values: The dataset was checked for null values using isnull() and cleaned
accordingly.
 Data Transformation: Categorical variables such as Gender were encoded using label
encoding (e.g., Male = 1, Female = 0).
 Feature Scaling: To ensure uniformity, StandardScaler was used to normalize features like
Annual Income and Spending Score.

5.3 Feature Selection and Engineering

Only relevant features were selected for clustering:

 Annual Income: Represents the customer's buying capacity.


 Spending Score: Represents customer engagement or spending behavior.

These two features were selected because they form clear clusters when visualized and are commonly
used in retail segmentation.

5.4 Introduction to K-Means Clustering

K-Means is an unsupervised learning algorithm that divides data into k clusters, minimizing the
variance within each cluster. It works in the following steps:

1. Choose the number of clusters k.


2. Randomly initialize k centroids.
3. Assign each point to the nearest centroid.
4. Recompute centroids based on assigned points.
5. Repeat steps 3–4 until centroids do not change significantly.
5.5 Determining the Optimal Number of Clusters

To choose the right number of clusters, we used the Elbow Method:

python
CopyEdit
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

wcss = [] # Within-Cluster Sum of Squares

for i in range(1, 11):


kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(scaled_data)
wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)


plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

The optimal k is identified at the “elbow point” in the graph — where WCSS starts to decrease more
slowly. In our case, k = 5 was found optimal.

5.6 Model Implementation using Python

With the optimal number of clusters determined, the K-Means algorithm was applied:

python
CopyEdit
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
cluster_labels = kmeans.fit_predict(scaled_data)

# Append cluster labels to the dataset


dataset['Cluster'] = cluster_labels

5.7 Evaluation of Clusters

Since K-Means is unsupervised, we evaluate the quality of clustering using:

 Silhouette Score: Measures how similar a point is to its own cluster compared to other clusters.
Score ranges from -1 to 1.

python
CopyEdit
from sklearn.metrics import silhouette_score

score = silhouette_score(scaled_data, kmeans.labels_)


print("Silhouette Score: ", score)

 Visualization: Visual plots were created to view the clusters.

python
CopyEdit
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=cluster_labels, cmap='rainbow')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.show()

This visualization helps interpret customer behavior and derive actionable insights.
6. Implementation
This section describes how the system was implemented using Python, focusing on the tools and
technologies used, the dataset structure, code modules, visualizations, and the interpretation of results.

6.1 Tools and Technologies Used

The following open-source tools and libraries were used for implementation:

Tool / Library Purpose


Python 3.x Core programming language
Pandas Data manipulation and analysis
NumPy Numerical computations
Matplotlib Data visualization
Seaborn Statistical data visualization
Scikit-learn Machine learning algorithms and metrics
Google Colab Interactive environment for coding and output
6.2 Dataset Description

The dataset used consists of 200+ customer records with the following key columns:

Column Name Description


CustomerID Unique identifier for each customer
Gender Male/Female
Age Age of the customer
Annual Income Yearly income in USD
Spending Score Score assigned based on customer behavior (1–100)

For this project, the focus was on Annual Income and Spending Score for clustering.

6.3 Code Structure

The code is modularized into the following sections for clarity and reusability:

1. Import Libraries
2. Load Dataset
3. Data Preprocessing
4. Determine Optimal Clusters (Elbow Method)
5. Apply K-Means Algorithm
6. Evaluate and Visualize Clusters
7. Interpret Results

Each section is encapsulated in functions where applicable to make the code readable and
maintainable.

Example:

python
CopyEdit
def load_and_scale_data(file_path):
data = pd.read_csv(file_path)
scaler = StandardScaler()
scaled = scaler.fit_transform(data[['Annual Income', 'Spending Score']])
return data, scaled
6.4 Visualization of Clusters

After applying K-Means, clusters were visualized using scatter plots.

python
CopyEdit
plt.figure(figsize=(8, 6))
plt.scatter(scaled_data[:, 0], scaled_data[:, 1], c=cluster_labels, cmap='rainbow')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segments')
plt.grid(True)
plt.show()
Example Output:

 Cluster 0: High income, high spending


 Cluster 1: Low income, low spending
 Cluster 2: Average income, moderate spending
 Cluster 3: High income, low spending
 Cluster 4: Low income, high spending

6.5 Interpretation of Results

Each cluster represented a unique group of customers:

Cluster Characteristics Business Strategy


0 Wealthy and frequent spenders Offer premium services and loyalty rewards
1 Budget-conscious low spenders Offer discounts and budget plans
2 Middle-class average spenders Upsell and cross-sell opportunities
3 Wealthy but low spenders Explore causes for low engagement
4 Low income but high spending Risk-prone, offer limited-time offers

This analysis can help a business tailor marketing campaigns, product recommendations, and customer
service based on cluster profiles.

7. Results and Discussion


This section presents the results obtained from applying K-Means clustering on the customer dataset
and provides an interpretation of the clusters, their business significance, and the overall effectiveness
of the model.

7.1 Summary of Cluster Results

After preprocessing the data and applying the K-Means algorithm with k = 5, five distinct customer
segments were identified. These clusters showed clear separation when visualized on a 2D scatter plot
using Annual Income and Spending Score as axes.

Cluster No. Income Level Spending Behavior Cluster Size


0 High High 23
1 Low Low 39
2 Medium Medium 35
Cluster No. Income Level Spending Behavior Cluster Size
3 High Low 24
4 Low High 29

The sizes may vary slightly depending on data and initialization.

7.2 Visualization Insights

The following visualizations were generated:

 Elbow Plot: Helped determine optimal number of clusters (k=5).


 2D Cluster Plot: Customers plotted based on income and spending score.
 Cluster Color Coding: Each cluster was color-coded for easy interpretation.

These visualizations made it easier for stakeholders to understand the segmentation without deep
technical knowledge.

7.3 Business Interpretation of Clusters


Cluster 0 – “Target Premium Customers”

 High Income, High Spending


 These are loyal, profitable customers.
 Business Strategy: Offer personalized services, exclusive deals, early access to products.

Cluster 1 – “Minimal Engagement Segment”

 Low Income, Low Spending


 Least profitable segment.
 Business Strategy: Limit marketing resources or offer budget-friendly services.

Cluster 2 – “Average Segment”

 Moderate Income and Spending


 Balanced customers.
 Business Strategy: Potential to upsell and cross-sell.

Cluster 3 – “Cautious High Earners”

 High Income, Low Spending


 Financially strong but conservative in spending.
 Business Strategy: Investigate barriers, educate on product value, use targeted advertising.

Cluster 4 – “Value Seekers”

 Low Income, High Spending


 Spend more than expected for their income level.
 Business Strategy: Monitor for risk, incentivize with time-limited offers.

7.4 Model Evaluation

The Silhouette Score was used to evaluate the effectiveness of clustering. The score obtained was:
python
CopyEdit
Silhouette Score: ~0.55

 A score above 0.5 indicates reasonable structure in the clusters.


 The clusters were well-separated and balanced.
 Some overlaps were noted, which are expected in real-world data.

7.5 Discussion

The results demonstrate the strength of K-Means in identifying distinct customer segments using just
two features. While simple, the insights derived are powerful and can directly support business
decisions like:

 Personalized marketing
 Customer retention
 Sales forecasting
 Resource allocation

However, the results can be further improved by including more features such as purchase frequency,
customer lifetime value, and online behavior.

8. Conclusion and Future Work


8.1 Conclusion

In this project, a machine learning-based approach for customer segmentation was developed and
implemented using the K-Means clustering algorithm in Python. The primary goal was to identify
distinct groups of customers based on two key features: Annual Income and Spending Score.

The project achieved the following:

 Successfully applied data preprocessing techniques such as normalization and encoding.


 Used the Elbow Method to determine the optimal number of clusters.
 Implemented K-Means clustering to divide customers into five meaningful segments.
 Visualized the clustering results effectively using 2D scatter plots.
 Interpreted each cluster to recommend specific business strategies for different customer
types.

The results demonstrated that K-Means is an effective method for uncovering hidden patterns in
customer data, which can be used to improve marketing efforts, customer service, and overall business
decision-making.

8.2 Key Takeaways

 Customer segmentation helps companies target different groups more efficiently.


 K-Means clustering is simple, scalable, and interpretable, making it ideal for this task.
 Using real-world datasets, even a limited number of features can yield powerful insights.
 Data visualization significantly enhances the understanding of model outcomes.

8.3 Limitations

 Only two features were used; more features could lead to more accurate and actionable
segmentation.
 K-Means assumes spherical clusters and equal variance, which might not always represent real
customer behavior.
 The model does not handle dynamic or time-series data where customer behavior changes
over time.

8.4 Future Work

To improve and extend this project, the following enhancements are recommended:

1. Use of Additional Features:


o Include variables like product category preferences, website behavior, or purchase
frequency.
2. Time-Based Analysis:
o Implement RFM (Recency, Frequency, Monetary) analysis for dynamic
segmentation.
3. Algorithm Comparison:
o Compare K-Means with DBSCAN, Agglomerative Clustering, or Gaussian Mixture
Models.
4. Interactive Dashboard:
o Build a web-based dashboard using Streamlit or Dash to make the results interactive
and accessible for non-technical users.
5. Real-Time Segmentation:
o Integrate with real-time customer data systems for live clustering and personalized
recommendations.

9. References

The following sources were consulted and referenced during the course of the project:

1. Scikit-learn Documentation
https://scikit-learn.org/stable/modules/clustering.html#k-means
2. Pandas Documentation
https://pandas.pydata.org/
3. Matplotlib Documentation
https://matplotlib.org/stable/index.html
4. Seaborn Documentation
https://seaborn.pydata.org/
5. Customer Segmentation Dataset (Mall Customers Dataset)
Available on Kaggle:
https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python
6. J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 3rd
Edition, 2011.
7. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, O'Reilly
Media, 2nd Edition, 2019.

10. Appendices

Appendix A: Sample Dataset (Excerpt)


CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

1 Male 19 15 39

2 Male 21 15 81

3 Female 20 16 6

4 Female 23 16 77
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

... ... ... ... ...

Appendix B: Sample Python Code Snippet


python
CopyEdit
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_csv('Mall_Customers.csv')

# Feature selection
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# KMeans clustering
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
clusters = kmeans.fit_predict(X_scaled)
df['Cluster'] = clusters

# Visualization
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis')
plt.title('Customer Segments')
plt.xlabel('Annual Income (scaled)')
plt.ylabel('Spending Score (scaled)')
plt.grid(True)
plt.show()

Appendix C: Silhouette Score Code


python
CopyEdit
from sklearn.metrics import silhouette_score

score = silhouette_score(X_scaled, kmeans.labels_)


print(f"Silhouette Score: {score}")

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy