0% found this document useful (0 votes)

8 views9 pages

Cluster Analysis

This document presents a case study on product segmentation using clustering analysis for a new beer brand. It details the process of analyzing a dataset of beer brands based on features like calories, sodium, alcohol, and cost, employing techniques such as dendrograms and the elbow method to determine the optimal number of clusters. The study ultimately identifies three distinct clusters representing different consumer segments: medium alcohol content beers, light beers for calorie-conscious consumers, and premium brands with high alcohol content.

Uploaded by

sivamugunthan342

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views9 pages

Cluster Analysis

Uploaded by

sivamugunthan342

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2.

Case Study: Cluster analysis- Creating Product

Segmant using Clustering
One more example of clustering is product segmentation. We will discuss a case in this
section, where a company would like to enter the market with a new beer brand. Before, it
decides the kind of beer it will launch, it must understand what kinds of products already exist
in the market and what kinds of segments the products address. To understand the segments,
the company collects specification of few samples of beer brands. In this section, we will use
the dataset beer.csv (Feinberg et al., 2012) of beer brands and their corresponding features
such as calories, sodium, alcohol, and cost.

2.1 Beer Dataset:

The Beer dataset contains about 20 records. First, we will load and print the records.

In [1]:  1 # Loading packages

2 import numpy as np
3 import pandas as pd
4 import matplotlib.pyplot as plt
5 %matplotlib inline
6 import seaborn as sn
In [2]:  1 beer_df = pd.read_csv(r"C:\Users\Santhosh\Documents\RV.university\UG P
2 beer_df
 
Out[2]: Name Calories Sodium Alcohol Cost

0 Budwiser 144 15 4.7 0.43

1 Schlitz 151 19 4.9 0.43

2 Lowenbrau 157 15 0.9 0.48

3 Kronenbourg 170 7 5.2 0.73

4 Heineken 152 11 5.0 0.77

5 Old_Milwaukee 145 23 4.6 0.28

6 Augsberger 175 24 5.5 0.40

7 Srohs_Bohenian_Style 149 27 4.7 0.42

8 Miller_lite 99 10 4.3 0.43

9 Budwiser_Lite 113 8 3.7 0.40

10 Coors 140 18 4.6 0.44

11 Coors_Light 102 15 4.1 0.46

12 Michelob_Ligh 135 11 4.2 0.50

13 Becks 150 19 4.7 0.76

14 Kirin 149 6 5.0 0.79

15 Pabst_Extra_Light 68 15 2.3 0.38

16 Hamms 139 19 4.4 0.43

17 Heilemans_Old_Style 144 24 4.9 0.43

18 Olympia_Goled_Light 72 6 2.9 0.46

19 Schlitz_Light 97 7 4.2 0.47

Each observation belongs to a beer brand and it contains information about the calories,
alcohol and sodium content, and the cost. Before clustering the brands in segments, the
features need to be normalized as these are on different scales. So, the first step in creating
clusters is to normalize the features in beer dataset.

In [3]:  1 # Scaling the data:

2 from sklearn import preprocessing
3 from sklearn.preprocessing import StandardScaler
4 scaler = StandardScaler()
5 scaled_beer_df = scaler.fit_transform(beer_df[["Calories","Sodium","Al

2.2 How many cluster Exist?

As there are four features, it is not possible to plot and visualize them to understand how many
clusters may exist. For high-dimensional data, the following techniques can be used for
discovering the possible number of clusters:

1. Dendrogram
2. Elbow method
2.2.1 Using Dendrogram: (Hierarchical Cluster Analysis)

A dendrogram is a cluster tree diagram which groups those entities together that are nearer to
each other. A dendrogram can be drawn using the clustermap() method in seaborn.

In [4]:  1 cmap = sn.cubehelix_palette(as_cmap=True, rot=-.3, light=1)

2 sn.clustermap( scaled_beer_df, cmap=cmap, linewidths=.2,figsize = (8,8

As shown in Figure, dendrogram reorders the observations based on how close they are to
each other using distances (Euclidean). The tree on the left of the dendrogram depicts the
relative distance between nodes. For example, the distance between beer brand 10 and 16 is
least. They seem to be very similar to each other.

In [5]:  1 beer_df.iloc[[10, 16]]

Out[5]: Name Calories Sodium Alcohol Cost

10 Coors 140 18 4.6 0.44

16 Hamms 139 19 4.4 0.43

It can be observed that both the beer brands Coors and Hamms are very similar across all
features. Similarly, brands 2 and 18 seem to be most different as the distance is highest. They
are represented on two extremes of the dendrogram.

In [6]:  1 beer_df.iloc[[2, 18]]

Out[6]: Name Calories Sodium Alcohol Cost

2 Lowenbrau 157 15 0.9 0.48

18 Olympia_Goled_Light 72 6 2.9 0.46

Brand Lowenbrau seems to have very low alcohol content. This can be an outlier or maybe a
data error. Thus, it can be dropped from the dataset.

The tree structure on the left of the dendrogram indicates that there may be four or five clusters
in the dataset. This is only a guideline or indication about the number of clusters, but the actual
number of clusters can be determined only after creating the clusters and interpreting them.
Creating more number of clusters may give rise to the complexity of defining and managing
them. It is always advisable to have a less and reasonable number of clusters that make
business sense. We will create four clusters and verify if the clusters explain the product
segments clearly and well.

2.2.2 Finding Optimal Number of Clusters Using Elbow Curve Method : K means cluster
analysis:

If we assume all the products belong to only one segment, then the variance of the cluster will
be highest. As we increase the number of clusters, the total variance of all clusters will start
reducing. But the total variance will be zero if we assume each product is a cluster by itself. So,
Elbow curve method considers the percentage of variance explained as a function of the
number of clusters. The optimal number of clusters is chosen in such a way that adding
another cluster does not change the variance explained significantly.

In [11]:  1 import warnings

2 warnings.filterwarnings("ignore")

Let us create several cluster combinations ranging from one to ten and observe the WCSS in
each cluster and how marginal gain in explained variance starts to diminish gradually. The
interia_ parameter in KMeans cluster algorithms provides the total variance for a particular
number of clusters. The following code iterates and creates clusters ranging from 1 to 10 and

In [12]:  1 # K means cluster analysis:

2 from sklearn.cluster import KMeans
3 cluster_range = range(1, 10)
4 cluster_errors = []
5 for num_clusters in cluster_range:
6 clusters = KMeans(n_clusters=num_clusters, random_state=42) # Fix
7 clusters.fit(scaled_beer_df)
8 cluster_errors.append(clusters.inertia_)
9
10 plt.figure(figsize=(6, 4))
11 plt.plot(cluster_range, cluster_errors, marker="o")
12 plt.xlabel("Number of Clusters")
13 plt.ylabel("Inertia (Within-cluster SSE)")
14 plt.title("Elbow Method for Optimal Clusters")
15 plt.show()

2.3 Creating Clusters

We will set k to 3 for running KMeans algorithm and create a new column clusterid in beer_df to
capture the cluster number it is assigned to.

In [13]:  1 import warnings

2 warnings.filterwarnings("ignore")
3 from sklearn.cluster import KMeans
4
5 k = 3
6 clusters = KMeans(n_clusters=k, random_state=42, n_init="auto")
7 clusters.fit(scaled_beer_df)
8 beer_df["clusterid"] = clusters.labels_
2.4 Creating Clusters

The three clusters are created and numbered as cluster 0, 1, and 2. We will print each cluster
independently and interpret the characteristic of each cluster. We can print each cluster group
by filtering the records by clusterids.

2.4.1. Cluster "0":

In [12]:  1 beer_df[beer_df.clusterid == 0]

Out[12]: Name Calories Sodium Alcohol Cost clusterid

0 Budwiser 144 15 4.7 0.43 0

1 Schlitz 151 19 4.9 0.43 0

5 Old_Milwaukee 145 23 4.6 0.28 0

6 Augsberger 175 24 5.5 0.40 0

7 Srohs_Bohenian_Style 149 27 4.7 0.42 0

10 Coors 140 18 4.6 0.44 0

16 Hamms 139 19 4.4 0.43 0

17 Heilemans_Old_Style 144 24 4.9 0.43 0

In cluster 0, beers with medium alcohol content and medium cost are grouped together. This is
the largest segment and may be targeting the largest segment of customers.

In [14]:  1 beer_df[beer_df.clusterid == 1]

Out[14]: Name Calories Sodium Alcohol Cost clusterid

2 Lowenbrau 157 15 0.9 0.48 1

8 Miller_lite 99 10 4.3 0.43 1

9 Budwiser_Lite 113 8 3.7 0.40 1

11 Coors_Light 102 15 4.1 0.46 1

12 Michelob_Ligh 135 11 4.2 0.50 1

15 Pabst_Extra_Light 68 15 2.3 0.38 1

18 Olympia_Goled_Light 72 6 2.9 0.46 1

19 Schlitz_Light 97 7 4.2 0.47 1

In cluster 1, all the light beers with low calories and sodium content are clustered into one
group. This must be addressing the customer segment who wants to drink but are also calorie
conscious.
In [44]:  1 beer_df[beer_df.clusterid == 2]

Out[44]: Name Calories Sodium Alcohol Cost clusterid

3 Kronenbourg 170 7 5.2 0.73 2

4 Heineken 152 11 5.0 0.77 2

13 Becks 150 19 4.7 0.76 2

14 Kirin 149 6 5.0 0.79 2

These are expensive beers with relatively high alcohol content. Also, the sodium content is low.
The costs are high because the target customers could be brand sensitive and the brands are
promoted as premium brands.

2.5 Hierarchical Clustering:

Hierarchical clustering is a clustering algorithm which uses the following steps to develop
clusters:

1. Start with each data point in a single cluster.

2. Find the data points with the shortest distance (using an appropriate distance measure)
and merge them to form a cluster.
3. Repeat step 2 until all data points are merged together to form a single cluster. The above
procedure is called an agglomerative hierarchical cluster. AgglomerativeClustering in
sklearn. cluster provides an algorithm for hierarchical clustering and also takes the number
of clusters to be created as an argument. The agglomerative hierarchical clustering can be
represented and understood by using dendrogram, which we had created earlier.
In [14]:  1 import scipy.cluster.hierarchy as shc
2 Dendrogram = shc.dendrogram(shc.linkage(scaled_beer_df, method = "comp
3 plt.title("Dendrogram")

Out[14]: Text(0.5, 1.0, 'Dendrogram')

In [15]:  1 from sklearn.cluster import AgglomerativeClustering

We will create clusters using AgglomerativeClustering and store the new cluster labels in
h_clusterid variable.

In [47]:  1 h_clusters = AgglomerativeClustering(3)

2 h_clusters.fit(scaled_beer_df)
3 beer_df["h_clusterid"] = h_clusters.labels_

2.5.1Compare the Clusters Created by K-Means and Hierarchical Clustering

We will print each cluster independently and interpret the characteristic of each cluster.
In [48]:  1 beer_df[beer_df.h_clusterid == 0]

Out[48]: Name Calories Sodium Alcohol Cost clusterid h_clusterid

2 Lowenbrau 157 15 0.9 0.48 1 0

8 Miller_lite 99 10 4.3 0.43 1 0

9 Budwiser_Lite 113 8 3.7 0.40 1 0

11 Coors_Light 102 15 4.1 0.46 1 0

12 Michelob_Ligh 135 11 4.2 0.50 1 0

15 Pabst_Extra_Light 68 15 2.3 0.38 1 0

18 Olympia_Goled_Light 72 6 2.9 0.46 1 0

19 Schlitz_Light 97 7 4.2 0.47 1 0

In [49]:  1 beer_df[beer_df.h_clusterid == 1]

Out[49]: Name Calories Sodium Alcohol Cost clusterid h_clusterid

0 Budwiser 144 15 4.7 0.43 0 1

1 Schlitz 151 19 4.9 0.43 0 1

5 Old_Milwaukee 145 23 4.6 0.28 0 1

6 Augsberger 175 24 5.5 0.40 0 1

7 Srohs_Bohenian_Style 149 27 4.7 0.42 0 1

10 Coors 140 18 4.6 0.44 0 1

16 Hamms 139 19 4.4 0.43 0 1

17 Heilemans_Old_Style 144 24 4.9 0.43 0 1

In [50]:  1 beer_df[beer_df.h_clusterid == 2]

Out[50]: Name Calories Sodium Alcohol Cost clusterid h_clusterid

3 Kronenbourg 170 7 5.2 0.73 2 2

4 Heineken 152 11 5.0 0.77 2 2

13 Becks 150 19 4.7 0.76 2 2

14 Kirin 149 6 5.0 0.79 2 2

Both the clustering algorithms have created similar clusters. Only cluster ids have changed.

In [ ]:  1

Investigating The Configurable Parameters of K-Means Unsupervised Learning
100% (1)
Investigating The Configurable Parameters of K-Means Unsupervised Learning
33 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
MLP Slides Merged
No ratings yet
MLP Slides Merged
480 pages
Chapter 3 Forecasting
No ratings yet
Chapter 3 Forecasting
40 pages
07 - Hoc May Khong Giam Sat
No ratings yet
07 - Hoc May Khong Giam Sat
96 pages
Mini Project Clustering
50% (2)
Mini Project Clustering
33 pages
Chapter 1 Introduction Data Analytics
No ratings yet
Chapter 1 Introduction Data Analytics
64 pages
Role of Strategic Leadership in Applying Total Quality Management A Field Study in Private Hospitals in The Capital, Sana'a
No ratings yet
Role of Strategic Leadership in Applying Total Quality Management A Field Study in Private Hospitals in The Capital, Sana'a
12 pages
Lab Assignment 10: Web Mining
No ratings yet
Lab Assignment 10: Web Mining
5 pages
M6 - Basic Statistics
No ratings yet
M6 - Basic Statistics
66 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
Chapter 5 CLUSTERING
No ratings yet
Chapter 5 CLUSTERING
36 pages
Worksheet Clustering
No ratings yet
Worksheet Clustering
31 pages
Bone Suplement Market Segmentation
No ratings yet
Bone Suplement Market Segmentation
20 pages
Business Report Data Mining
91% (11)
Business Report Data Mining
18 pages
Capsule Notes - 5th BRM
No ratings yet
Capsule Notes - 5th BRM
35 pages
Rlab SS
No ratings yet
Rlab SS
25 pages
Module 2
No ratings yet
Module 2
58 pages
Wine
No ratings yet
Wine
22 pages
Project Explanation
No ratings yet
Project Explanation
17 pages
7 K-Means Clustering
No ratings yet
7 K-Means Clustering
27 pages
Module 3
No ratings yet
Module 3
20 pages
Clustering 2
No ratings yet
Clustering 2
11 pages
Section 3
No ratings yet
Section 3
22 pages
Data Mining Business Report 2
No ratings yet
Data Mining Business Report 2
18 pages
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
No ratings yet
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
34 pages
Practical 7
No ratings yet
Practical 7
21 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
BDA LabReport-9
No ratings yet
BDA LabReport-9
17 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Data Mining - Assignment: Girish Nayak
100% (1)
Data Mining - Assignment: Girish Nayak
21 pages
Mla - 2 (Cia - 3) - 20221013
No ratings yet
Mla - 2 (Cia - 3) - 20221013
21 pages
Bi12-019 Bi12-263 LW2
No ratings yet
Bi12-019 Bi12-263 LW2
17 pages
Business Analytics
No ratings yet
Business Analytics
17 pages
Mkt536syllabus Fall2016apr Pricing Strategy
No ratings yet
Mkt536syllabus Fall2016apr Pricing Strategy
12 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
11 Chapter 3
No ratings yet
11 Chapter 3
17 pages
Project Data Mining (AMAN YADAV)
No ratings yet
Project Data Mining (AMAN YADAV)
12 pages
Map Guidelines Gtu
No ratings yet
Map Guidelines Gtu
16 pages
Assessed Coursework Coversheet: Leeds University Business School
No ratings yet
Assessed Coursework Coversheet: Leeds University Business School
16 pages
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
No ratings yet
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
32 pages
Devesh
No ratings yet
Devesh
11 pages
Exercise#9 Instructions 2021
No ratings yet
Exercise#9 Instructions 2021
5 pages
ADSA Assignment Template
No ratings yet
ADSA Assignment Template
5 pages
Dialnet AnExpandedPerspectiveOnAgendaSettingEffects 4508546 PDF
No ratings yet
Dialnet AnExpandedPerspectiveOnAgendaSettingEffects 4508546 PDF
18 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
Lab Assignment 10: Web Mining
No ratings yet
Lab Assignment 10: Web Mining
5 pages
Ass6 (DMDS)
No ratings yet
Ass6 (DMDS)
7 pages
Inglés Nivel 2 Fcen: Reading Material
No ratings yet
Inglés Nivel 2 Fcen: Reading Material
31 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
K Means & Polynomial
No ratings yet
K Means & Polynomial
5 pages
Zara
No ratings yet
Zara
47 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Marketing Analytics Week-10 LAQ
No ratings yet
Marketing Analytics Week-10 LAQ
5 pages
Codigo R Diamantes
No ratings yet
Codigo R Diamantes
5 pages
Chapter 8 Multiple Regression
No ratings yet
Chapter 8 Multiple Regression
24 pages
Statistical Modelling: Regression: Multicollinearity
No ratings yet
Statistical Modelling: Regression: Multicollinearity
22 pages
Data Mining - Lab 3 - Clustering
No ratings yet
Data Mining - Lab 3 - Clustering
5 pages
Bab Iv
No ratings yet
Bab Iv
20 pages
K Means Clustering Report
No ratings yet
K Means Clustering Report
3 pages
Cluster Analysis Detail Steps
No ratings yet
Cluster Analysis Detail Steps
5 pages
Final Code
No ratings yet
Final Code
3 pages
Wine Quality Prediction Using Machine Learning
No ratings yet
Wine Quality Prediction Using Machine Learning
10 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
Data Mining:: Association Rules Techniques
No ratings yet
Data Mining:: Association Rules Techniques
14 pages
Dissertation Data Collection and Analysis
100% (2)
Dissertation Data Collection and Analysis
4 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
Uji Akar Unit - Pendekatan Augmented Dickey-Fuller (ADF)
No ratings yet
Uji Akar Unit - Pendekatan Augmented Dickey-Fuller (ADF)
16 pages
Mini Project Report
No ratings yet
Mini Project Report
12 pages
Data Method Nonorm CCC Pseudo RMSSTD Rsquare RSQ Id Var: Proc Cluster
No ratings yet
Data Method Nonorm CCC Pseudo RMSSTD Rsquare RSQ Id Var: Proc Cluster
4 pages
East West University
No ratings yet
East West University
17 pages
Using Machine Learning To Locate Support and Resistance Lines For Stocks
No ratings yet
Using Machine Learning To Locate Support and Resistance Lines For Stocks
14 pages
Guilherme Sales Smania Car Subscription Services
No ratings yet
Guilherme Sales Smania Car Subscription Services
10 pages
Companion To Marketing Data Miner
No ratings yet
Companion To Marketing Data Miner
3 pages
Fenologia Reproductiva y Clima
No ratings yet
Fenologia Reproductiva y Clima
14 pages
Week 11 Assignment 11.2.2
No ratings yet
Week 11 Assignment 11.2.2
3 pages
Cluster
No ratings yet
Cluster
3 pages
Benton Et Al-2017-International Nursing Review
No ratings yet
Benton Et Al-2017-International Nursing Review
11 pages
Time (T) Sales (In PHP) Forecast Time (T) Sales (In PHP) Forecast
No ratings yet
Time (T) Sales (In PHP) Forecast Time (T) Sales (In PHP) Forecast
10 pages
PyGWalker On-The-Fly Assistant For Exploratory Visual Data Analysis
No ratings yet
PyGWalker On-The-Fly Assistant For Exploratory Visual Data Analysis
5 pages
BMS 202 - Tutorial 04
No ratings yet
BMS 202 - Tutorial 04
6 pages
Analysis of Variance
No ratings yet
Analysis of Variance
5 pages
Assignment 2 Data Science Application Project
No ratings yet
Assignment 2 Data Science Application Project
3 pages
Data Preparation
No ratings yet
Data Preparation
3 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Homework Suggestions From Chapter 4
No ratings yet
Homework Suggestions From Chapter 4
2 pages
Advanced Portfolio Management: A Quant's Guide for Fundamental Investors
From Everand
Advanced Portfolio Management: A Quant's Guide for Fundamental Investors
Giuseppe A. Paleologo
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Cloud Native Security
From Everand
Cloud Native Security
Chris Binnie
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.