Cluster Analysis
Cluster Analysis
The Beer dataset contains about 20 records. First, we will load and print the records.
Each observation belongs to a beer brand and it contains information about the calories,
alcohol and sodium content, and the cost. Before clustering the brands in segments, the
features need to be normalized as these are on different scales. So, the first step in creating
clusters is to normalize the features in beer dataset.
As there are four features, it is not possible to plot and visualize them to understand how many
clusters may exist. For high-dimensional data, the following techniques can be used for
discovering the possible number of clusters:
1. Dendrogram
2. Elbow method
2.2.1 Using Dendrogram: (Hierarchical Cluster Analysis)
A dendrogram is a cluster tree diagram which groups those entities together that are nearer to
each other. A dendrogram can be drawn using the clustermap() method in seaborn.
As shown in Figure, dendrogram reorders the observations based on how close they are to
each other using distances (Euclidean). The tree on the left of the dendrogram depicts the
relative distance between nodes. For example, the distance between beer brand 10 and 16 is
least. They seem to be very similar to each other.
Brand Lowenbrau seems to have very low alcohol content. This can be an outlier or maybe a
data error. Thus, it can be dropped from the dataset.
The tree structure on the left of the dendrogram indicates that there may be four or five clusters
in the dataset. This is only a guideline or indication about the number of clusters, but the actual
number of clusters can be determined only after creating the clusters and interpreting them.
Creating more number of clusters may give rise to the complexity of defining and managing
them. It is always advisable to have a less and reasonable number of clusters that make
business sense. We will create four clusters and verify if the clusters explain the product
segments clearly and well.
2.2.2 Finding Optimal Number of Clusters Using Elbow Curve Method : K means cluster
analysis:
If we assume all the products belong to only one segment, then the variance of the cluster will
be highest. As we increase the number of clusters, the total variance of all clusters will start
reducing. But the total variance will be zero if we assume each product is a cluster by itself. So,
Elbow curve method considers the percentage of variance explained as a function of the
number of clusters. The optimal number of clusters is chosen in such a way that adding
another cluster does not change the variance explained significantly.
Let us create several cluster combinations ranging from one to ten and observe the WCSS in
each cluster and how marginal gain in explained variance starts to diminish gradually. The
interia_ parameter in KMeans cluster algorithms provides the total variance for a particular
number of clusters. The following code iterates and creates clusters ranging from 1 to 10 and
The three clusters are created and numbered as cluster 0, 1, and 2. We will print each cluster
independently and interpret the characteristic of each cluster. We can print each cluster group
by filtering the records by clusterids.
In [12]: 1 beer_df[beer_df.clusterid == 0]
In cluster 0, beers with medium alcohol content and medium cost are grouped together. This is
the largest segment and may be targeting the largest segment of customers.
In [14]: 1 beer_df[beer_df.clusterid == 1]
In cluster 1, all the light beers with low calories and sodium content are clustered into one
group. This must be addressing the customer segment who wants to drink but are also calorie
conscious.
In [44]: 1 beer_df[beer_df.clusterid == 2]
These are expensive beers with relatively high alcohol content. Also, the sodium content is low.
The costs are high because the target customers could be brand sensitive and the brands are
promoted as premium brands.
Hierarchical clustering is a clustering algorithm which uses the following steps to develop
clusters:
We will create clusters using AgglomerativeClustering and store the new cluster labels in
h_clusterid variable.
We will print each cluster independently and interpret the characteristic of each cluster.
In [48]: 1 beer_df[beer_df.h_clusterid == 0]
In [49]: 1 beer_df[beer_df.h_clusterid == 1]
In [50]: 1 beer_df[beer_df.h_clusterid == 2]
Both the clustering algorithms have created similar clusters. Only cluster ids have changed.
In [ ]: 1