Clustering Monograph
Clustering Monograph
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Index
Contents
1. Introduction ............................................................................................................................. 4
2. Clustering, Formally ................................................................................................................. 5
3. Methods of Clustering .............................................................................................................. 6
3.1 Measures of Distance ..................................................................................................... 6
3.1.1 Common Measures of Distance ...................................................................................... 7
Computing distances ......................................................................................................................... 8
Use sample data through manual input ..................................................................................... 8
4. Hierarchical Clustering .......................................................................................................... 10
4.1 Definition ..................................................................................................................... 10
4.2 Agglomerative Clustering ............................................................................................. 10
4.2.1 Concept of Linkage ....................................................................................................... 11
5. K-Means Clustering ................................................................................................................ 24
5.1 Determining the Number of Clusters ............................................................................. 25
5.1.1 Elbow method .............................................................................................................. 25
5.1.2 Silhouette Method ........................................................................................................ 26
6. Further Discussion and Considerations................................................................................. 35
6.1 Divisive Clustering Algorithm ........................................................................................ 35
6.2 Scaling ......................................................................................................................... 35
6.3 Discrete Attributes........................................................................................................ 35
6.4 K-Median Algorithm ..................................................................................................... 36
6.5 K-Medoid Algorithm ..................................................................................................... 36
6.6 K-Mode Clustering ........................................................................................................ 36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
List of Figures
Fig. 1: Data arrangement in matrix format……………………………………………………………..7
Fig. 2: Diagram of agglomerative clustering…………...……………………………………………..10
Fig. 6: Figure 6: Histogram of the variables to be used to compute distance ………….. ……16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
1. Introduction
In today’s world of information explosion, knowledge is derived by classifying information.
Analytics borrows its strength from eliminating the random perturbations and identifying the
underlying structure. Every customer who walks in a departmental superstore is different. No
two persons will buy identical items in identical quantities. Customers who are young and
single are likely to buy certain items whereas elderly couples are likely to buy certain other
items. To promote sales by utilizing resources in a cost-effective manner, superstores are to
understand the requirements of different groups of customers. Instead of a blanket
advertisement policy, where all items are being advertised to all customers, a much more
efficient way of increasing sales is to classify customers into mutually exclusive groups and
come up with a tailor-made advertising policy.
Clustering lies at the core of marketing strategy. In recent times many other application areas
of clustering have emerged. Clustering documents on the web helps to extract and classify
information. Clustering of genes helps to identify various properties, including their disease
carrying propensity. Clustering is a powerful tool for data mining and pattern recognition for
image analysis.
Clustering is an unsupervised technique. The underlying assumption is that, the observed data
is coming from multiple populations. To elaborate the marketing strategy example, it is
assumed that distinct populations exist among the customer base for a supermarket. The
difference among populations may not be based on demographics only. A set of complex
characteristics based on demography, socio-economic strata and other conditions delineate the
populations and they form a partition of the customers. Once the clusters are identified, they
can be studied in a better manner and possibly different strategies are applied to garner more
business from the targeted groups.
Object of clustering is to form homogeneous partitions out of heterogeneous observations.
Clustering may be done in many ways, not all of them are optimally data driven. It is possible
to define clusters based on age-groups alone, or based on total amount purchased or visit
frequency. In subsequent sections we will discuss data driven clustering procedures only. The
rationale behind such procedures is to define homogenous partition of the data.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2. Clustering, Formally
Clustering is an example of unsupervised learning. For unsupervised learning problems there
does not exist any response. A predictive modelling problem is a supervised learning (see the
monograph on Predictive Modelling) since it contains a designated response variable which
depends on the set of predictors. In clustering problems the underlying populations are not
identified, neither the number of possible populations is known. It is assumed that the observed
data is heterogeneous and partitioning it into multiple populations is expected to make the
clusters homogeneous. In other words, observations belonging to a cluster are more similar
among themselves and observations belonging to different groups are dissimilar. Clustering is
a process to find meaningful structure and features in a set of observations based on given
measures of similarity.
The following important points need to be clarified at the outset.
I. It is assumed that the observations are not generated from a single population. However,
there is no concrete evidence that it is so.
II. The rationale behind clustering is that the total variation among all observations gets
reduced substantially when the data set is partitioned into multiple groups.
III. Since the number of populations present, k, is unknown, it is also determined from the
data.
IV. Once the different populations are identified, they must be treated separately. This
implies that there must not be any common parameters. If any predictive model
is developed, a separate model needs to be developed for each population.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
3. Methods of Clustering
Clustering aims to determine the intrinsic grouping among the data points. The only criterion
for grouping is that, the observations belonging to the same cluster, are closer than observations
belonging to different groups. Hence a measure for closeness or similarity between pairs of
points need to be developed. In subsequent sections we will discuss various options to measure
similarity. Depending on the choice of similarity, the pair whose similarity measure is more
than that of a second pair, is assumed to be closer with respect to that measure of similarity
compared to the second pair.
There are two primary approaches to clustering; namely hierarchical or agglomerative
clustering and k-means clustering. In hierarchical clustering, the closest points are combined
in a pairwise manner to form the clusters. It is an iterative procedure, where at every step of
the iteration, two points or two clusters are combined to form a bigger cluster. At the end of
the process all points are combined into a single cluster. The number of clusters is not pre-
determined.
In the k-means clustering process, k denotes the number of clusters, which is pre-determined.
Once k is fixed, the observations are allocated to one and only one cluster so that the closest
points belong to one cluster. The cluster size is not controlled.
Since the clustering process is completely controlled by the distance between two points and
distance between two clusters, it is paramount that the concept of distance is clear before we
can move on to the actual clustering process.
Clustering works only when the observations are multivariate. For univariate observations the
concept of distance is trivial. We will also assume that the variables are only numeric. If the
variables are categorical or if the type is mixed, this simple measures of distance will not work.
A few alternative distance measures are given in Section 6.
Let us introduce matrix notation to describe the observations. Let X denote the data matrix of
order n x p, i.e. there are n observations in the data and each observation has p dimensions. In
the matrix each row denotes an observation and each column, one dimension or attribute.
𝑥11 ⋯ 𝑥1𝑝
X=[ ⋮ ⋱ ⋮ ]
𝑥𝑛1 ⋯ 𝑥𝑛𝑝
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Example: Let us consider a small data on Google user rating on various places of interest.
Suppose there are 5 users and each one of them is giving rating on 3 tourist attractions. The
rating may be any number between 1 and 5, where 1 is the lowest rating and 5 is the highest.
(Details of this data set are provided later)
In Fig 1 below the left panel shows the observed data and the right panel shows the data
arranged in matrix notation. Since there are 5 rows and 3 columns, the matrix is of dimension
5 x 3. Unless otherwise mentioned, each row represents one data point, each column represents
one attribute. The (i, j)-th cell of a data matrix identifies the value of the j-th attribute on the i-
th observation. If there is a missing value, that cell remains blank.
For the purpose of clustering, distance is defined between a pair of observations combining all
the attributes. There are many possible distance measures. Distance is a non-negative quantity.
Distance between two identical observations must be zero. Distance between two non-identical
observations must always be positive.
Pairwise distance among a set of n p-dimensional observations can be arranged in a n x n
symmetric (square) matrix whose principal diagonal is all 0.
Euclidean Distance: Between two observations X1 and X2, each of dimension p, Euclidean
distance is defined as
𝑝 2
Ed (X1, X2) = √∑𝑗=1(𝑋1𝑗 − 𝑋2𝑗 )
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
𝑝 2
SEd (X1, X2) = ∑𝑗=1(𝑋1𝑗 − 𝑋2𝑗 )
Manhattan Distance: This is also known as absolute value distance or L1 distance. The
Manhattan distance between two observations X1 and X2 is defined as
𝑝
Md (X1, X2) = ∑𝑗=1|𝑋𝑖𝑗 − 𝑋2𝑗 |
Where q is a positive integer. Note that for q = 1, Minkowski’s distance coincide with
Manhattan distance and for q = 2, it is identical to Euclidean distance.
There are many other important distance measures, among which the Mahalanobis Distance,
the Cosine Distance and the Jaccard Distance deserve mention.
Following is an illustration of distance matrices computed for data given in Fig 1. Note that the
matrices shown are all lower triangular since the principal diagonal is 0 and the distance
matrices are symmetric.
Computing distances
Use sample data through manual input
X <- cbind(c(0.58, 0.58, 0.54, 0.54, 0.58),c(3.65, 3.65, 3.66, 3.66, 3.67)
,c(3.67,3.68, 3.68, 3.67, 3.66))
print(X)
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
# Euclidean distance
round(dist(X, method = 'euclidean'),2)
## 1 2 3 4
## 2 0.01
## 3 0.04 0.04
## 4 0.04 0.04 0.01
## 5 0.02 0.03 0.05 0.04
# Manhattan distance
round(dist(X, method = 'manhattan'),2)
## 1 2 3 4
## 2 0.01
## 3 0.06 0.05
## 4 0.05 0.06 0.01
## 5 0.03 0.04 0.07 0.06
# Minkowski's distance q = 3
round(dist(X, method = 'minkowski', 3),2)
## 1 2 3 4 5
## 1 0.00
## 2 0.01 0.00
## 3 0.04 0.04 0.00
## 4 0.04 0.04 0.01 0.00
## 5 0.02 0.03 0.05 0.04 0.00
Note that the numerical values of the distances may be different depending on which method
has been used to compute distances. The distance between User 1 and 2 is 0.01 using all three
methods, while the distance between Users 1 and 3 depends on the method used. Euclidean
distance between Users 1 and 3 is 0.04 which is identical to the distance between Users 1 and
4. However the Manhattan distance between Users 1 and 4 is less than that between Users 1
and 4. This observation indicates that the clustering may depend on the distance measure.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
4. Hierarchical Clustering
4.1 Definition
Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner.
Agglomerative clustering is used more often than divisive clustering. In this monograph we
will not discuss divisive clustering in any detail (see Section 6)
Clustering of observations is the objective here. However, it is possible to apply hierarchical
clustering to cluster attributes (variables).
Figure 2 shows a graphical representation of agglomerative clustering. At each stage two units
or two clusters or one unit and one cluster are combined into a larger cluster. The graphical
representation is a tree diagram and is called a dendrogram. At the start of the algorithm, the
number of clusters is equal to the number of observations n. At the end of the algorithm, the
number of clusters is 1.
The optimum number of clusters is not pre-determined in this algorithm. Depending on which
height the dendrogram is cut, the final number of clusters is determined.
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
This method builds the hierarchy from the individual elements by progressively merging
clusters. In Fig 2, there are five elements {1} {2} {3} {4} and {5}. Each element represents
one cluster of one element, called singleton. The first step is to merge the two closest
elements into a single cluster.
Let us use Euclidean distance for illustration. In Fig 2, cutting after the first merge (from the
bottom) of the dendrogram will yield clusters {1, 2} {5} {3, 4}. Cutting after the second
merge will yield clusters {1, 2, 5} {3, 4}, which is a coarser clustering, with a smaller
number of clusters but one or more clusters having larger size.
Once one or more clusters contain more than one element, the algorithm of merging two
multi-element clusters need to be developed.
Single Linkage: Distance between two clusters is defined to be the smallest distance between
a pair of observations (points), one from each cluster. This is the distance between two
closest points of the two clusters.
Complete Linkage: Distance between two clusters is defined to be the largest distance
between a pair of points, one from each cluster. This is the distance between two points
belonging to two different clusters, which are the farthest apart.
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Average Linkage: Distance between two clusters is defined as the average distance between
each pair of points, one from each cluster.
Centroid Linkage: Centroid of a cluster is defined as the vector of means of all attributes,
over all observations within that cluster. Centroid linkage between two clusters is the distance
between the respective centroids.
Note that the choice of distance and linkage method uniquely define the outcome of a
clustering procedure. It is possible that the same set of observations may be clustered in
different partitions based on the choice of distance and linkage method.
Case Study:
Review ratings by 5456 travellers (Users) on 24 tourist attractions across various sites in
Europe are considered. This data is the primary input for an automatic recommender system.
Each reviewer (User) has rated various attractions, such as churches, resorts, bakeries etc. and
the given ratings are all between 0 (lowest) and 5 (highest). Objective is to partition the
sample of travellers into clusters, so that tailored recommendations may be made for next
travel destinations.
Statement of the problem: Cluster the Users into mutually exclusive groups according to their
preferences for various tourist attractions.
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
EDA on Review Ratings Data
dim(ReviewData)
## [1] 5456 25
names(ReviewData)
head(ReviewData, 5)
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
summary(ReviewData[,2:25]) # 5-number summary of the variables
There are 5456 reviewers’ ratings in the data. In each of the attributes, the maximum value is
5, i.e. the highest possible score has been reached. However, not all attributes have been
given the lowest (0) rating. The mean value is always greater than the median value,
sometimes considerably, indicating that the distribution of the attributes is not symmetric, but
positively skewed.
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Histograms of the variables are shown below.
library(ggplot2)
library(tidyr)
library(purrr) # Enhances various functionalities in R
d1 <- ReviewData[,2:10]
d2 <- ReviewData[,11:19]
d3 <- ReviewData[,20:25]
d1 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram(binwidth=0.5, color="blue", fill="white")
# Improve visualization
d2 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram(binwidth=0.5, color="blue", fill="white")
# Improve visualization
d3 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram(binwidth=0.5, color="blue", fill="white")
# Improve visualization
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 6: Histogram of the variables to be used to compute distance
The histograms bear proof of the inherent skewness in the attributes. Except for two attributes
(churches and zoo) most of the data is concentrated towards the lower ranks with a
disproportionate amount concentrated on rank 5.
Clustering (Illustrative)
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
hc2 <- hclust(d.eucld, "complete")
plot(hc2, hang=-1, cex = 0.6, col ="brown")
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 8: Ward’s method of constructing clusters
A dendrogram is a visual representation of cluster-making. On the x-axis are the item names
or item numbers. On the y-axis is the distance or height. The vertical straight lines denote the
height where two items or two clusters combine. The higher the level of combining, the
distant the individual items or clusters are. By definition of hierarchical clustering, all items
must combine to make one cluster.
However, the problem of interest here is to determine the optimum number of clusters. Recall
that each cluster is a representative of a population. Since this is a problem of unsupervised
learning, there cannot be any error measure to dictate the number of clusters. To some extent
this is a subjective choice. After considering the dendrogram, one may decide at what level
the resultant tree need to be cut. This is another way to say that how close-knit or dispersed
should the clusters be. If the number of clusters is large, the cluster size is small and the
clusters homogeneous. If the number of clusters is small, each contains more items and hence
clusters are more heterogeneous. Often practical consideration or domain knowledge also
plays a part in determining the number of clusters.
Note that depending on the distance measure and linkage used, the number of clusters and
their composition may be different. Compare the dendrograms constructed by Euclidean
distance and complete linkage and Manhattan distance and single linkage in Figures 7 - 8. In
the former there seems 3 clear clusters:
Cluster 1: Users 100 – 108 and User 110
Cluster 2: User 109, User 111, User 113 – 115 and User 117
Cluster 3: User 112, User 116, Users 118 – 120
In case of the other dendrogram (Manhattan and single linkage), there seems to be one
Cluster 1: Users 100 – 110
Cluster 2: Users 113 – 115
The rest of the Users are relatively distant from these two multi-item clusters and they all
form singletons. [It is possible to apply hierarchical clustering to identify multivariate
outliers].
Ward’s method also seems to indicate two clusters.
19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Cluster 1: Users 100 – 110
Cluster 2: Users 111 – 120
Both clusters are of equal size and there does not exist any singleton.
This does create subjectivity in the process. Nevertheless, the items (in this case Users) that
are really close, are combining a single cluster irrespective of the distance measure or
linkage.
Now let us proceed to the final clustering involving all Users. Euclidean distance and
complete linkage is applied.
20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 9: Dendrogram of the hierarchical clusters for the full data set
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The dendrogram in Fig 9 needs some explanation. Default options in R will not be able to
produce an interpretable tree diagram for hierarchical clustering with a large number of
objects. Hence the leaf nodes have not been labelled and a horizontal aspect has been chosen.
The number of clusters determines the colour scheme. The number of clusters have been
chosen as 12 after several iterations. Only the final results have been shared here.
# Cluster membership
clust <- cutree(hc2, k = 12)
table(clust) # Number of observations in each cluster
## clust
## 1 2 3 4 5 6 7 8 9 10 11 12
## 416 865 900 402 809 130 246 88 475 321 572 232
## User clust
## 1 User 1 1
## 2 User 2 1
## 3 User 3 1
## 4 User 4 1
## 5 User 5 1
## 6 User 6 1
## 7 User 7 2
## 8 User 8 2
## 9 User 9 2
## 10 User 10 2
Finally, to verify the differences between the identified clusters, a few important variables are
chosen and their means and standard deviations are compared below.
library(dplyr)
ReviewData %>%
group_by(clust) %>%
summarise_at(vars(Lodgings, Fast.Food, Restaurants), funs(mean(., na.rm=
TRUE)))
## # A tibble: 12 x 4
## clust Lodgings Fast.Food Restaurants
## <int> <dbl> <dbl> <dbl>
## 1 1 2.43 2.01 2.51
## 2 2 1.64 1.75 2.14
## 3 3 2.35 1.93 3.95
## 4 4 2.44 2.15 4.12
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
## 5 5 1.43 1.65 4.81
## 6 6 3.05 3.89 4.14
## 7 7 1.78 2.85 2.63
## 8 8 0.924 0.913 4.17
## 9 9 4.57 4.33 2.73
## 10 10 2.19 2.08 2.20
## 11 11 1.40 1.41 1.98
## 12 12 1.37 1.03 1.60
ReviewData %>%
group_by(clust) %>%
summarise_at(vars(Lodgings, Fast.Food, Restaurants), funs(sd(., na.rm=TR
UE)))
## # A tibble: 12 x 4
## clust Lodgings Fast.Food Restaurants
## <int> <dbl> <dbl> <dbl>
## 1 1 1.48 1.10 0.980
## 2 2 1.09 1.13 0.784
## 3 3 1.47 1.07 1.08
## 4 4 1.16 0.704 0.923
## 5 5 0.308 0.446 0.458
## 6 6 1.91 1.53 0.815
## 7 7 1.65 1.59 0.523
## 8 8 0.185 0.190 1.44
## 9 9 0.803 1.03 0.329
## 10 10 0.515 0.562 0.564
## 11 11 0.530 0.423 1.08
## 12 12 1.21 0.302 0.802
Recall that the clustering process here is an input to a recommender system. Instead of
pushing all information indiscriminately to every user, based on the average ranking of
various attributes in various groups, more efficient planning may be made. Consider for
example Clusters 4, 5, 6 and 8 have given very high ranking to Restaurants whereas Clusters
4, 5 and especially 8 have given low ranking to Fast Food. It is clear that users belonging to
Cluster 8 are discriminatory eaters and information about fast food to these users will not be
useful at all. Compared to them, preference of Cluster 9 shows high ranks in Lodging and
Fast Food but medium rank in Restaurant. It is possible that these users are more convenience
loving and information about good accommodation and perhaps near-by fast food joints will
be appreciated by this group.
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
5. K-Means Clustering
k-means clustering is the most used non-hierarchical clustering technique. It aims
to partition n observations into k clusters in which each observation belongs to
the cluster whose mean (centroid) is nearest to it, serving as a prototype of the cluster. It
minimizes within-cluster variances (squared Euclidean distances).
The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is
often called "the k-means algorithm".
Given an initial set of k means m1(1), …, mk(1) , the algorithm proceeds by alternating between
two steps
Assignment step: Assign each observation to the cluster whose mean has the least
squared Euclidean distance, this is intuitively the "nearest" mean.
Update step: Calculate the new means (centroids) of the observations in the new clusters.
The algorithm has converged when the assignments no longer change.
One advantage of k-means clustering is that computation of distances among all pairs of
observations is not necessary. However, the number of clusters k needs to be pre-specified.
Once that is determined, an arbitrary partition of data into k clusters is the starting point. Then
sequential assignment and update will find the clusters that are most separated. The cardinal
rule of cluster building is that, at every step within-cluster variance will reduce (or stay the
same) but between-cluster variance will increase (or stay the same).
Random assignment of initial centroids eliminate bias. However, it is not an efficient
algorithmic process. An alternative way to form clusters is to start with a set of pre-specified
starting points, though it is possible that such a set of initial clusters, if converges to a local
optimality, may introduce bias. The following algorithm is a compromise between efficiency
and randomness.
Leader Algorithm:
Step 1. Select the first item from the list. This item forms the centroid of the first cluster.
Step 2. Search through the subsequent items until an item is found that is at least distance
δ away from any previously defined cluster centroid. This item will form the centroid of
the next cluster.
Step 3: Step 2 is repeated until all k cluster centroids are obtained or no further items can
be assigned.
Step 4: The initial clusters are obtained by assigning items to the nearest cluster
centroids.
The distance δ along with the number of clusters k is the input to this algorithm. It is possible
that domain knowledge or various other subjective considerations determine the value of k. If
no other subjective input is available, k may be determined by empirical methods.
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
5.1 Determining the Number of Clusters
There are many methods that are recommended for determination of an optimal number of
partitions. Unfortunately, however, there is no closed form solution to the problem of
determining k. The choice is somewhat subjective and graphical methods are often employed.
Objective of partitioning is to separate out the observations or units so that the ‘most’ similar
items are put together.
For a given number of clusters, the total within-cluster sum of squares (WSS) is computed.
That value of k is chosen to be optimum, where addition of one more cluster does not lower
the value of total WSS appreciably.
The Elbow method looks at the total WSS as a function of the number of clusters.
library(ggplot2)
library(factoextra) # Required for choosing optimum K
set.seed(310)
# function to compute total within-cluster sum of squares
fviz_nbclust(ReviewData[, 2:25], kmeans, method = "wss", k.max = 25) + the
me_minimal() + ggtitle("the Elbow Method")
Fig 10 does not indicate any clear break in the elbow. For k = 15, for the first time, one more
cluster does not reduce total within-cluster sum of square. Hence one option for optimum
number of clusters is 15.
[Recall that hierarchical clustering of the same data suggested 12 clusters. There may be wide
discrepancy in the number of clusters depending on the procedure applied. This will be clearer
in subsequent analysis]
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 10: Elbow method to determine optimum number of clusters
This method measures how tightly the observations are clustered and the average distance
between clusters. For each observation a silhouette width is constructed which is a function
of the average distance between the point and all other points in the cluster to which it
belongs, and the distance between the point and all other points in all other clusters, that it
does not belong to. The maximum value of the statistic indicates the optimum value of k.
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 11: Silhouette method to determine optimum number of clusters
It is clear from Fig 11 that the maximum value of silhouette is achieved for k = 20, which,
therefore, is considered to be the optimum number of clusters for this data.
Let us now proceed with 20 clusters. Clustering, like any other predictive modelling, is an
iterative process. The final recommendation may be quite different from the initial starting
point.
The algorithm for k-means is a greedy algorithm and is sensitive to the random starting
points. The parameter nstart controls the starting partitions. Its default value in R is one.
However, it is strongly recommended to compute k-means clustering with a large value
of nstart such as 25 or 50, in order to have a more stable result. If nstart = 25 is specified,
then R will try 25 different random starting assignments and then select the best results
27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
corresponding to the one with the lowest within cluster variation. In spite of that, it is possible
that a sub-optimal partition is found.
Output of k-means clustering include a number of items including the total sum of squares,
the within sum of squares for each cluster, the total within sum of squares, the between
cluster sum of squares and the cluster centres.
Before the detail outcome of the partition algorithm is discussed, one needs to check how
separated the clusters are. Ideally each and every cluster will be well separated with a
minimal overlap.
Figure 12 above clearly shows that the 20 clusters are not at all well separated. The options
available at this stage is either to tweak the number of clusters and see if a smaller number of
clusters will work better. Another possible option may be to exercise subjective judgement. A
third option also exists which is quite computation intensive.
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The package NbClust contains 30 indices for determination of cluster size. This acts as an
ensemble algorithm and proposes that value of k, which is voted on by a maximum number of
indices.
nc<- NbClust(data = ReviewData[, 2:25], distance="euclidean", min.nc = 2,
max.nc = 20, method = "kmeans")
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
## * 3 proposed 19 as the best number of clusters
## * 1 proposed 20 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
table(nc$Best.n[1,])
##
## 0 1 2 3 4 7 8 9 18 19 20
## 2 1 5 6 3 2 1 1 1 3 1
Note that elbow plot and silhouette plot both recommended a very large number of clusters in
this case. Though these two instruments are probably the most used for determination of k,
the final decision must not be taken based on only one consideration.
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Fig 14 shows that the 3 clusters are much better separated though overlap cannot be totally
eliminated.
There are a number of merits to use a smaller number of clusters. Recall that the objective of
this particular clustering effort is to devise a suitable recommendation system. It may not be
practical to manage a very large number of tailor made recommendations.
However the final decision must be taken after considering the within sum of squares and
between sum of squares. Recall that within cluster sum of squares is the squared average
Euclidean distance of all the points within a cluster from the cluster centroid and between
cluster sum of squares is the average squared Euclidean distance between all cluster
centroids.
The output of the kmeans() procedure contains all the relevant information. A comparison of
the sums of squares for 20 clusters and 3 clusters reveal more insight.
# Cluster = 20
## [1] 215751.4
kmeans.clus20$withinss
kmeans.clus20$tot.withinss
## [1] 94295.43
kmeans.clus20$betweenss
## [1] 121455.9
kmeans.clus20$size
## [1] 235 149 196 223 414 557 104 273 314 197 529 289 352 261 349 242 111
## [18] 236 147 278
31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
# Cluster = 3
## [1] 215751.4
kmeans.clus3$withinss
kmeans.clus3$tot.withinss
## [1] 165456.8
kmeans.clus3$betweenss
## [1] 50294.59
kmeans.clus3$size
Note that the total sum of square in the data does not depend on the number of clusters. Total
sum of squares for the current data set is 215751.4 which is a measure of total variability
within the data.
Each cluster has different within cluster sum of squares. For k = 20, the largest within cluster
sum of squares is 10053.9 and the total within cluster sum of squares is 94295.43. The
between cluster sum of squares is 121455.9.
Let us now investigate the case with k = 3. The between cluster sum of squares is smaller
than the within cluster sum of squares for two clusters. Hence k = 3 cannot be recommended,
even though this value of k has been determined by majority vote in NbClust().
Next a few other values of k are experimented with and k = 6 seems to be the optimal number
of clusters.
# Cluster = 6
kmeans.clus6$withinss
kmeans.clus6$tot.withinss
## [1] 134433.5
32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
kmeans.clus6$betweenss
## [1] 81317.87
kmeans.clus6$size
Except for cluster 2 having a small overlap with cluster 3, the clusters are well separated. All
within cluster sums of squares are small compared to the between cluster sum of squares.
Hence k = 6 is being recommended as the optimum number of clusters in this case.
The last step of clustering procedure is cluster profiling. Based on the various mean values of
the attributes in different clusters, the recommender system might be developed.
# Cluster Profile
round(kmeans.clus6$centers, 2)
33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
## 2 1.23 1.59 1.69 2.19 2.29 2.56 3.70 3.64 4.53
## 3 1.40 2.76 3.19 3.68 4.35 3.83 3.66 2.44 2.63
## 4 1.64 2.24 3.13 4.30 4.07 3.19 2.73 2.19 2.67
## 5 1.15 2.89 2.31 2.29 2.53 3.44 4.63 3.23 4.45
## 6 2.35 2.68 2.47 2.22 2.03 1.86 1.94 1.57 1.75
## Monuments Gardens
## 1 0.71 0.81
## 2 1.10 1.19
## 3 1.48 1.67
## 4 2.49 1.78
## 5 0.90 1.03
## 6 2.46 2.62
34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
6. Further Discussion and
Considerations
This algorithm is rarely used. The basic principle of divisive clustering was published as the
DIANA (DIvisive ANAlysis Clustering) algorithm. Initially, all data is in the same cluster,
and the largest cluster is split until every object is separate. DIANA chooses the object with
the maximum average dissimilarity and then moves all objects to this cluster that are more
similar to the new cluster than to the remainder.
6.2 Scaling
The main problem of using discrete attributes in clustering lies in computing distance
between two items. All the distance measures defined (Euclidean, Minkowski’s etc) are
defined for continuous variables only. To use binary or nominal variables in clustering
algorithm, a meaningful distance measure between a pair of discrete variables need to be
defined. There are several options such as cosine similarity and Mahalanobis Distance. We
will not go in depth with these cases. In R, Gower Distance may be used to deal with mixed
data type.
Gower distance: For each variable type, a different distance is used and scaled to fall
between 0 and 1. For continuous variable Manhattan Distance is used. Ordinal variables are
ranked first and then Manhattan distance is used with a special adjustment for ties. Nominal
variables are converted into dummy variables and Dice Coefficient (concordance-discordance
35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
type coefficient) is computed. Then, a linear combination using user-specified weights (an
average) is calculated to create the final distance matrix.
This is a variant of k-means clustering where instead of minimizing the squared deviations,
the absolute deviations from the medians is minimized. The difficulty is that for a
multivariate data, there is no universally accepted definition of median. Another option is
using a k-medoid partitioning.
In this algorithm k representative observations, called medoids, are chosen. Each observation
is assigned to that cluster whose medoid is the closest and medoids are updated. A swapping
cost is involved when a current medoid is replaced by a new medoid. Various algorithms are
developed, namely PAM (partitioning around medoids), CLARA and randomized sampling.
Since for categorical data, mean is not defined, k-mode clustering is a sensible alternative.
Once an optimal number of clusters is determined, initial cluster centres are arbitrarily
selected. Each data object is assigned to that cluster, whose centre is closest to it by a suitably
defined distance. Note that for cluster analysis the distance measure is all important. After
each allocation, the cluster base is updated. A good discussion is given in A review on k-mode
clustering algorithm by M. Goyal and S Aggarwal published in International Journal of
Advanced Research in Computer Science.
(https://www.ijarcs.info/index.php/Ijarcs/article/view/4301/4008)
36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Cluster Observations
Hierarchical
Partition
Dendrogram
Final
Partition
37
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
References
Hennig, C., Meila M., Murtagh F. and Rocci, R (2015). Handbook of Cluster Analysis.
Chapman and Hall/CRC.
Everitt B. S., Landau S., Leese M. and Stahl, D. (2011). Cluster Analysis. 5th Ed. Wiley.
Kassambara A. (2017). Practical Guide to Cluster Analysis in R. sthda.com.
https://online.stat.psu.edu/stat505/lesson/14
38
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited