0% found this document useful (0 votes)
8 views9 pages

Cluster Analysis

This document presents a case study on product segmentation using clustering analysis for a new beer brand. It details the process of analyzing a dataset of beer brands based on features like calories, sodium, alcohol, and cost, employing techniques such as dendrograms and the elbow method to determine the optimal number of clusters. The study ultimately identifies three distinct clusters representing different consumer segments: medium alcohol content beers, light beers for calorie-conscious consumers, and premium brands with high alcohol content.

Uploaded by

sivamugunthan342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

Cluster Analysis

This document presents a case study on product segmentation using clustering analysis for a new beer brand. It details the process of analyzing a dataset of beer brands based on features like calories, sodium, alcohol, and cost, employing techniques such as dendrograms and the elbow method to determine the optimal number of clusters. The study ultimately identifies three distinct clusters representing different consumer segments: medium alcohol content beers, light beers for calorie-conscious consumers, and premium brands with high alcohol content.

Uploaded by

sivamugunthan342
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2.

Case Study: Cluster analysis- Creating Product


Segmant using Clustering
One more example of clustering is product segmentation. We will discuss a case in this
section, where a company would like to enter the market with a new beer brand. Before, it
decides the kind of beer it will launch, it must understand what kinds of products already exist
in the market and what kinds of segments the products address. To understand the segments,
the company collects specification of few samples of beer brands. In this section, we will use
the dataset beer.csv (Feinberg et al., 2012) of beer brands and their corresponding features
such as calories, sodium, alcohol, and cost.

2.1 Beer Dataset:

The Beer dataset contains about 20 records. First, we will load and print the records.

In [1]:  1 # Loading packages


2 import numpy as np
3 import pandas as pd
4 import matplotlib.pyplot as plt
5 %matplotlib inline
6 import seaborn as sn
In [2]:  1 beer_df = pd.read_csv(r"C:\Users\Santhosh\Documents\RV.university\UG P
2 beer_df
 
Out[2]: Name Calories Sodium Alcohol Cost

0 Budwiser 144 15 4.7 0.43

1 Schlitz 151 19 4.9 0.43

2 Lowenbrau 157 15 0.9 0.48

3 Kronenbourg 170 7 5.2 0.73

4 Heineken 152 11 5.0 0.77

5 Old_Milwaukee 145 23 4.6 0.28

6 Augsberger 175 24 5.5 0.40

7 Srohs_Bohenian_Style 149 27 4.7 0.42

8 Miller_lite 99 10 4.3 0.43

9 Budwiser_Lite 113 8 3.7 0.40

10 Coors 140 18 4.6 0.44

11 Coors_Light 102 15 4.1 0.46

12 Michelob_Ligh 135 11 4.2 0.50

13 Becks 150 19 4.7 0.76

14 Kirin 149 6 5.0 0.79

15 Pabst_Extra_Light 68 15 2.3 0.38

16 Hamms 139 19 4.4 0.43

17 Heilemans_Old_Style 144 24 4.9 0.43

18 Olympia_Goled_Light 72 6 2.9 0.46

19 Schlitz_Light 97 7 4.2 0.47

Each observation belongs to a beer brand and it contains information about the calories,
alcohol and sodium content, and the cost. Before clustering the brands in segments, the
features need to be normalized as these are on different scales. So, the first step in creating
clusters is to normalize the features in beer dataset.

In [3]:  1 # Scaling the data:


2 from sklearn import preprocessing
3 from sklearn.preprocessing import StandardScaler
4 scaler = StandardScaler()
5 scaled_beer_df = scaler.fit_transform(beer_df[["Calories","Sodium","Al

2.2 How many cluster Exist?

As there are four features, it is not possible to plot and visualize them to understand how many
clusters may exist. For high-dimensional data, the following techniques can be used for
discovering the possible number of clusters:

1. Dendrogram
2. Elbow method
2.2.1 Using Dendrogram: (Hierarchical Cluster Analysis)

A dendrogram is a cluster tree diagram which groups those entities together that are nearer to
each other. A dendrogram can be drawn using the clustermap() method in seaborn.

In [4]:  1 cmap = sn.cubehelix_palette(as_cmap=True, rot=-.3, light=1)


2 sn.clustermap( scaled_beer_df, cmap=cmap, linewidths=.2,figsize = (8,8

As shown in Figure, dendrogram reorders the observations based on how close they are to
each other using distances (Euclidean). The tree on the left of the dendrogram depicts the
relative distance between nodes. For example, the distance between beer brand 10 and 16 is
least. They seem to be very similar to each other.

In [5]:  1 beer_df.iloc[[10, 16]]

Out[5]: Name Calories Sodium Alcohol Cost

10 Coors 140 18 4.6 0.44

16 Hamms 139 19 4.4 0.43


It can be observed that both the beer brands Coors and Hamms are very similar across all
features. Similarly, brands 2 and 18 seem to be most different as the distance is highest. They
are represented on two extremes of the dendrogram.

In [6]:  1 beer_df.iloc[[2, 18]]

Out[6]: Name Calories Sodium Alcohol Cost

2 Lowenbrau 157 15 0.9 0.48

18 Olympia_Goled_Light 72 6 2.9 0.46

Brand Lowenbrau seems to have very low alcohol content. This can be an outlier or maybe a
data error. Thus, it can be dropped from the dataset.

The tree structure on the left of the dendrogram indicates that there may be four or five clusters
in the dataset. This is only a guideline or indication about the number of clusters, but the actual
number of clusters can be determined only after creating the clusters and interpreting them.
Creating more number of clusters may give rise to the complexity of defining and managing
them. It is always advisable to have a less and reasonable number of clusters that make
business sense. We will create four clusters and verify if the clusters explain the product
segments clearly and well.

2.2.2 Finding Optimal Number of Clusters Using Elbow Curve Method : K means cluster
analysis:

If we assume all the products belong to only one segment, then the variance of the cluster will
be highest. As we increase the number of clusters, the total variance of all clusters will start
reducing. But the total variance will be zero if we assume each product is a cluster by itself. So,
Elbow curve method considers the percentage of variance explained as a function of the
number of clusters. The optimal number of clusters is chosen in such a way that adding
another cluster does not change the variance explained significantly.

In [11]:  1 import warnings


2 warnings.filterwarnings("ignore")

Let us create several cluster combinations ranging from one to ten and observe the WCSS in
each cluster and how marginal gain in explained variance starts to diminish gradually. The
interia_ parameter in KMeans cluster algorithms provides the total variance for a particular
number of clusters. The following code iterates and creates clusters ranging from 1 to 10 and

In [12]:  1 # K means cluster analysis:


2 from sklearn.cluster import KMeans
3 cluster_range = range(1, 10)
4 cluster_errors = []
5 for num_clusters in cluster_range:
6 clusters = KMeans(n_clusters=num_clusters, random_state=42) # Fix
7 clusters.fit(scaled_beer_df)
8 cluster_errors.append(clusters.inertia_)
9 ​
10 plt.figure(figsize=(6, 4))
11 plt.plot(cluster_range, cluster_errors, marker="o")
12 plt.xlabel("Number of Clusters")
13 plt.ylabel("Inertia (Within-cluster SSE)")
14 plt.title("Elbow Method for Optimal Clusters")
15 plt.show()

2.3 Creating Clusters


We will set k to 3 for running KMeans algorithm and create a new column clusterid in beer_df to
capture the cluster number it is assigned to.

In [13]:  1 import warnings


2 warnings.filterwarnings("ignore")
3 from sklearn.cluster import KMeans
4 ​
5 k = 3
6 clusters = KMeans(n_clusters=k, random_state=42, n_init="auto")
7 clusters.fit(scaled_beer_df)
8 beer_df["clusterid"] = clusters.labels_
2.4 Creating Clusters

The three clusters are created and numbered as cluster 0, 1, and 2. We will print each cluster
independently and interpret the characteristic of each cluster. We can print each cluster group
by filtering the records by clusterids.

2.4.1. Cluster "0":

In [12]:  1 beer_df[beer_df.clusterid == 0]

Out[12]: Name Calories Sodium Alcohol Cost clusterid

0 Budwiser 144 15 4.7 0.43 0

1 Schlitz 151 19 4.9 0.43 0

5 Old_Milwaukee 145 23 4.6 0.28 0

6 Augsberger 175 24 5.5 0.40 0

7 Srohs_Bohenian_Style 149 27 4.7 0.42 0

10 Coors 140 18 4.6 0.44 0

16 Hamms 139 19 4.4 0.43 0

17 Heilemans_Old_Style 144 24 4.9 0.43 0

In cluster 0, beers with medium alcohol content and medium cost are grouped together. This is
the largest segment and may be targeting the largest segment of customers.

In [14]:  1 beer_df[beer_df.clusterid == 1]

Out[14]: Name Calories Sodium Alcohol Cost clusterid

2 Lowenbrau 157 15 0.9 0.48 1

8 Miller_lite 99 10 4.3 0.43 1

9 Budwiser_Lite 113 8 3.7 0.40 1

11 Coors_Light 102 15 4.1 0.46 1

12 Michelob_Ligh 135 11 4.2 0.50 1

15 Pabst_Extra_Light 68 15 2.3 0.38 1

18 Olympia_Goled_Light 72 6 2.9 0.46 1

19 Schlitz_Light 97 7 4.2 0.47 1

In cluster 1, all the light beers with low calories and sodium content are clustered into one
group. This must be addressing the customer segment who wants to drink but are also calorie
conscious.
In [44]:  1 beer_df[beer_df.clusterid == 2]

Out[44]: Name Calories Sodium Alcohol Cost clusterid

3 Kronenbourg 170 7 5.2 0.73 2

4 Heineken 152 11 5.0 0.77 2

13 Becks 150 19 4.7 0.76 2

14 Kirin 149 6 5.0 0.79 2

These are expensive beers with relatively high alcohol content. Also, the sodium content is low.
The costs are high because the target customers could be brand sensitive and the brands are
promoted as premium brands.

2.5 Hierarchical Clustering:

Hierarchical clustering is a clustering algorithm which uses the following steps to develop
clusters:

1. Start with each data point in a single cluster.


2. Find the data points with the shortest distance (using an appropriate distance measure)
and merge them to form a cluster.
3. Repeat step 2 until all data points are merged together to form a single cluster. The above
procedure is called an agglomerative hierarchical cluster. AgglomerativeClustering in
sklearn. cluster provides an algorithm for hierarchical clustering and also takes the number
of clusters to be created as an argument. The agglomerative hierarchical clustering can be
represented and understood by using dendrogram, which we had created earlier.
In [14]:  1 import scipy.cluster.hierarchy as shc
2 Dendrogram = shc.dendrogram(shc.linkage(scaled_beer_df, method = "comp
3 plt.title("Dendrogram")

Out[14]: Text(0.5, 1.0, 'Dendrogram')

In [15]:  1 from sklearn.cluster import AgglomerativeClustering

We will create clusters using AgglomerativeClustering and store the new cluster labels in
h_clusterid variable.

In [47]:  1 h_clusters = AgglomerativeClustering(3)


2 h_clusters.fit(scaled_beer_df)
3 beer_df["h_clusterid"] = h_clusters.labels_

2.5.1Compare the Clusters Created by K-Means and Hierarchical Clustering

We will print each cluster independently and interpret the characteristic of each cluster.
In [48]:  1 beer_df[beer_df.h_clusterid == 0]

Out[48]: Name Calories Sodium Alcohol Cost clusterid h_clusterid

2 Lowenbrau 157 15 0.9 0.48 1 0

8 Miller_lite 99 10 4.3 0.43 1 0

9 Budwiser_Lite 113 8 3.7 0.40 1 0

11 Coors_Light 102 15 4.1 0.46 1 0

12 Michelob_Ligh 135 11 4.2 0.50 1 0

15 Pabst_Extra_Light 68 15 2.3 0.38 1 0

18 Olympia_Goled_Light 72 6 2.9 0.46 1 0

19 Schlitz_Light 97 7 4.2 0.47 1 0

In [49]:  1 beer_df[beer_df.h_clusterid == 1]

Out[49]: Name Calories Sodium Alcohol Cost clusterid h_clusterid

0 Budwiser 144 15 4.7 0.43 0 1

1 Schlitz 151 19 4.9 0.43 0 1

5 Old_Milwaukee 145 23 4.6 0.28 0 1

6 Augsberger 175 24 5.5 0.40 0 1

7 Srohs_Bohenian_Style 149 27 4.7 0.42 0 1

10 Coors 140 18 4.6 0.44 0 1

16 Hamms 139 19 4.4 0.43 0 1

17 Heilemans_Old_Style 144 24 4.9 0.43 0 1

In [50]:  1 beer_df[beer_df.h_clusterid == 2]

Out[50]: Name Calories Sodium Alcohol Cost clusterid h_clusterid

3 Kronenbourg 170 7 5.2 0.73 2 2

4 Heineken 152 11 5.0 0.77 2 2

13 Becks 150 19 4.7 0.76 2 2

14 Kirin 149 6 5.0 0.79 2 2

Both the clustering algorithms have created similar clusters. Only cluster ids have changed.

In [ ]:  1 ​

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy