0% found this document useful (0 votes)
72 views38 pages

Clustering Monograph

Clustering monograph

Uploaded by

Naing Naing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views38 pages

Clustering Monograph

Clustering monograph

Uploaded by

Naing Naing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

A Short Monograph on Clustering

TO SERVE AS A REFRESHER FOR PGP-BABI

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Index

Contents
1. Introduction ............................................................................................................................. 4
2. Clustering, Formally ................................................................................................................. 5
3. Methods of Clustering .............................................................................................................. 6
3.1 Measures of Distance ..................................................................................................... 6
3.1.1 Common Measures of Distance ...................................................................................... 7
Computing distances ......................................................................................................................... 8
Use sample data through manual input ..................................................................................... 8
4. Hierarchical Clustering .......................................................................................................... 10
4.1 Definition ..................................................................................................................... 10
4.2 Agglomerative Clustering ............................................................................................. 10
4.2.1 Concept of Linkage ....................................................................................................... 11
5. K-Means Clustering ................................................................................................................ 24
5.1 Determining the Number of Clusters ............................................................................. 25
5.1.1 Elbow method .............................................................................................................. 25
5.1.2 Silhouette Method ........................................................................................................ 26
6. Further Discussion and Considerations................................................................................. 35
6.1 Divisive Clustering Algorithm ........................................................................................ 35
6.2 Scaling ......................................................................................................................... 35
6.3 Discrete Attributes........................................................................................................ 35
6.4 K-Median Algorithm ..................................................................................................... 36
6.5 K-Medoid Algorithm ..................................................................................................... 36
6.6 K-Mode Clustering ........................................................................................................ 36

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
List of Figures
Fig. 1: Data arrangement in matrix format……………………………………………………………..7
Fig. 2: Diagram of agglomerative clustering…………...……………………………………………..10

Fig. 3: Single linkage clustering ...…………………………………..………………………………..11

Fig. 4: Complete linkage clustering ……………………..…...………………………………………11


Fig. 5: Centroid linkage clustering …..…………………...…………………………………………...12

Fig. 6: Figure 6: Histogram of the variables to be used to compute distance ………….. ……16

Figure 7: Comparison of distance and linkage method in constructing clusters ……………18


Figure 8: Ward’s method of constructing clusters…………………………………………...18
Figure 9: Dendrogram of the hierarchical clusters for the full data set ………………………21
Figure 10: Elbow method to determine optimum number of clusters ..……………….……..26
Figure 11: Silhouette method to determine optimum number of clusters ..………………….27
Figure 12: Cluster plot for 20 clusters ………………………………………………………28

Figure 13: Options for optimum number of clusters ……….………………………………29

Figure 14: Cluster plot for 3 clusters ………………………………………………………..30

Figure 15: Cluster plot for 6 clusters …………………….…………………………………..33

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
1. Introduction
In today’s world of information explosion, knowledge is derived by classifying information.
Analytics borrows its strength from eliminating the random perturbations and identifying the
underlying structure. Every customer who walks in a departmental superstore is different. No
two persons will buy identical items in identical quantities. Customers who are young and
single are likely to buy certain items whereas elderly couples are likely to buy certain other
items. To promote sales by utilizing resources in a cost-effective manner, superstores are to
understand the requirements of different groups of customers. Instead of a blanket
advertisement policy, where all items are being advertised to all customers, a much more
efficient way of increasing sales is to classify customers into mutually exclusive groups and
come up with a tailor-made advertising policy.
Clustering lies at the core of marketing strategy. In recent times many other application areas
of clustering have emerged. Clustering documents on the web helps to extract and classify
information. Clustering of genes helps to identify various properties, including their disease
carrying propensity. Clustering is a powerful tool for data mining and pattern recognition for
image analysis.
Clustering is an unsupervised technique. The underlying assumption is that, the observed data
is coming from multiple populations. To elaborate the marketing strategy example, it is
assumed that distinct populations exist among the customer base for a supermarket. The
difference among populations may not be based on demographics only. A set of complex
characteristics based on demography, socio-economic strata and other conditions delineate the
populations and they form a partition of the customers. Once the clusters are identified, they
can be studied in a better manner and possibly different strategies are applied to garner more
business from the targeted groups.
Object of clustering is to form homogeneous partitions out of heterogeneous observations.
Clustering may be done in many ways, not all of them are optimally data driven. It is possible
to define clusters based on age-groups alone, or based on total amount purchased or visit
frequency. In subsequent sections we will discuss data driven clustering procedures only. The
rationale behind such procedures is to define homogenous partition of the data.

Clustering is an unsupervised technique to partition the data into homogeneous segments.


Within a cluster the observations may be assumed to come from one population. Observations
belonging to different clusters are assumed to represent different populations.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2. Clustering, Formally
Clustering is an example of unsupervised learning. For unsupervised learning problems there
does not exist any response. A predictive modelling problem is a supervised learning (see the
monograph on Predictive Modelling) since it contains a designated response variable which
depends on the set of predictors. In clustering problems the underlying populations are not
identified, neither the number of possible populations is known. It is assumed that the observed
data is heterogeneous and partitioning it into multiple populations is expected to make the
clusters homogeneous. In other words, observations belonging to a cluster are more similar
among themselves and observations belonging to different groups are dissimilar. Clustering is
a process to find meaningful structure and features in a set of observations based on given
measures of similarity.
The following important points need to be clarified at the outset.
I. It is assumed that the observations are not generated from a single population. However,
there is no concrete evidence that it is so.
II. The rationale behind clustering is that the total variation among all observations gets
reduced substantially when the data set is partitioned into multiple groups.
III. Since the number of populations present, k, is unknown, it is also determined from the
data.
IV. Once the different populations are identified, they must be treated separately. This
implies that there must not be any common parameters. If any predictive model
is developed, a separate model needs to be developed for each population.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
3. Methods of Clustering
Clustering aims to determine the intrinsic grouping among the data points. The only criterion
for grouping is that, the observations belonging to the same cluster, are closer than observations
belonging to different groups. Hence a measure for closeness or similarity between pairs of
points need to be developed. In subsequent sections we will discuss various options to measure
similarity. Depending on the choice of similarity, the pair whose similarity measure is more
than that of a second pair, is assumed to be closer with respect to that measure of similarity
compared to the second pair.
There are two primary approaches to clustering; namely hierarchical or agglomerative
clustering and k-means clustering. In hierarchical clustering, the closest points are combined
in a pairwise manner to form the clusters. It is an iterative procedure, where at every step of
the iteration, two points or two clusters are combined to form a bigger cluster. At the end of
the process all points are combined into a single cluster. The number of clusters is not pre-
determined.
In the k-means clustering process, k denotes the number of clusters, which is pre-determined.
Once k is fixed, the observations are allocated to one and only one cluster so that the closest
points belong to one cluster. The cluster size is not controlled.

3.1 Measures of Distance

Since the clustering process is completely controlled by the distance between two points and
distance between two clusters, it is paramount that the concept of distance is clear before we
can move on to the actual clustering process.
Clustering works only when the observations are multivariate. For univariate observations the
concept of distance is trivial. We will also assume that the variables are only numeric. If the
variables are categorical or if the type is mixed, this simple measures of distance will not work.
A few alternative distance measures are given in Section 6.
Let us introduce matrix notation to describe the observations. Let X denote the data matrix of
order n x p, i.e. there are n observations in the data and each observation has p dimensions. In
the matrix each row denotes an observation and each column, one dimension or attribute.

𝑥11 ⋯ 𝑥1𝑝
X=[ ⋮ ⋱ ⋮ ]
𝑥𝑛1 ⋯ 𝑥𝑛𝑝

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Example: Let us consider a small data on Google user rating on various places of interest.
Suppose there are 5 users and each one of them is giving rating on 3 tourist attractions. The
rating may be any number between 1 and 5, where 1 is the lowest rating and 5 is the highest.
(Details of this data set are provided later)
In Fig 1 below the left panel shows the observed data and the right panel shows the data
arranged in matrix notation. Since there are 5 rows and 3 columns, the matrix is of dimension
5 x 3. Unless otherwise mentioned, each row represents one data point, each column represents
one attribute. The (i, j)-th cell of a data matrix identifies the value of the j-th attribute on the i-
th observation. If there is a missing value, that cell remains blank.

User Resorts Beaches Parks 0.53 3.65 3.67


1 0.53 3.65 3.67 0.53 3.65 3.68
2 0.53 3.65 3.68 𝑋 = 0.54 3.66 3.68
3 0.54 3.66 3.68 0.54 3.66 3.67
4 [0.53 3.67 3.66]
0.54 3.66 3.67
5 0.53 3.67 3.66

Figure 1: Data arrangement in matrix format

For the purpose of clustering, distance is defined between a pair of observations combining all
the attributes. There are many possible distance measures. Distance is a non-negative quantity.
Distance between two identical observations must be zero. Distance between two non-identical
observations must always be positive.
Pairwise distance among a set of n p-dimensional observations can be arranged in a n x n
symmetric (square) matrix whose principal diagonal is all 0.

3.1.1 Common Measures of Distance

Euclidean Distance: Between two observations X1 and X2, each of dimension p, Euclidean
distance is defined as

𝑝 2
Ed (X1, X2) = √∑𝑗=1(𝑋1𝑗 − 𝑋2𝑗 )

Example: Euclidean distance between Users 1 and 3 is


√(0.53 − 0.54)2 + (3.65 − 3.66)2 + (3.67 − 3.68)2 = √0.0018 = 0.04
Squared Euclidean Distance: It is the Euclidean distance without the square root and the
distance between two observations X1 and X2 is defined as

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
𝑝 2
SEd (X1, X2) = ∑𝑗=1(𝑋1𝑗 − 𝑋2𝑗 )

Example: Squared Euclidean distance between Users 1 and 3 is 0.0018.

Manhattan Distance: This is also known as absolute value distance or L1 distance. The
Manhattan distance between two observations X1 and X2 is defined as
𝑝
Md (X1, X2) = ∑𝑗=1|𝑋𝑖𝑗 − 𝑋2𝑗 |

Example: Manhattan distance between Users 1 and 3 is 0.06.

Minkowski’s Distance: This distance measure is a generalization of the Euclidean and


Manhattan distance and is named after a German mathematician of the nineteenth century
Hermann Minkowski. This is defined as
𝑝 𝑞 1/𝑞
Mind (X1, X2) = (∑𝑗=1|𝑋𝑖𝑗 − 𝑋2𝑗 | )

Where q is a positive integer. Note that for q = 1, Minkowski’s distance coincide with
Manhattan distance and for q = 2, it is identical to Euclidean distance.

There are many other important distance measures, among which the Mahalanobis Distance,
the Cosine Distance and the Jaccard Distance deserve mention.

Following is an illustration of distance matrices computed for data given in Fig 1. Note that the
matrices shown are all lower triangular since the principal diagonal is 0 and the distance
matrices are symmetric.

Computing distances
Use sample data through manual input
X <- cbind(c(0.58, 0.58, 0.54, 0.54, 0.58),c(3.65, 3.65, 3.66, 3.66, 3.67)
,c(3.67,3.68, 3.68, 3.67, 3.66))
print(X)

## [,1] [,2] [,3]


## [1,] 0.58 3.65 3.67
## [2,] 0.58 3.65 3.68
## [3,] 0.54 3.66 3.68
## [4,] 0.54 3.66 3.67
## [5,] 0.58 3.67 3.66

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
# Euclidean distance
round(dist(X, method = 'euclidean'),2)

## 1 2 3 4
## 2 0.01
## 3 0.04 0.04
## 4 0.04 0.04 0.01
## 5 0.02 0.03 0.05 0.04

# Manhattan distance
round(dist(X, method = 'manhattan'),2)

## 1 2 3 4
## 2 0.01
## 3 0.06 0.05
## 4 0.05 0.06 0.01
## 5 0.03 0.04 0.07 0.06

# Minkowski's distance q = 3
round(dist(X, method = 'minkowski', 3),2)

## 1 2 3 4 5
## 1 0.00
## 2 0.01 0.00
## 3 0.04 0.04 0.00
## 4 0.04 0.04 0.01 0.00
## 5 0.02 0.03 0.05 0.04 0.00

Note that the numerical values of the distances may be different depending on which method
has been used to compute distances. The distance between User 1 and 2 is 0.01 using all three
methods, while the distance between Users 1 and 3 depends on the method used. Euclidean
distance between Users 1 and 3 is 0.04 which is identical to the distance between Users 1 and
4. However the Manhattan distance between Users 1 and 4 is less than that between Users 1
and 4. This observation indicates that the clustering may depend on the distance measure.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
4. Hierarchical Clustering
4.1 Definition

Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method


of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical
clustering generally fall into two types:

 Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
 Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner.
Agglomerative clustering is used more often than divisive clustering. In this monograph we
will not discuss divisive clustering in any detail (see Section 6)
Clustering of observations is the objective here. However, it is possible to apply hierarchical
clustering to cluster attributes (variables).

4.2 Agglomerative Clustering

Figure 2 shows a graphical representation of agglomerative clustering. At each stage two units
or two clusters or one unit and one cluster are combined into a larger cluster. The graphical
representation is a tree diagram and is called a dendrogram. At the start of the algorithm, the
number of clusters is equal to the number of observations n. At the end of the algorithm, the
number of clusters is 1.
The optimum number of clusters is not pre-determined in this algorithm. Depending on which
height the dendrogram is cut, the final number of clusters is determined.

Figure 2: Diagram of an agglomerative clustering

10

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
This method builds the hierarchy from the individual elements by progressively merging
clusters. In Fig 2, there are five elements {1} {2} {3} {4} and {5}. Each element represents
one cluster of one element, called singleton. The first step is to merge the two closest
elements into a single cluster.

Closeness of two elements depends on the chosen distance.

Let us use Euclidean distance for illustration. In Fig 2, cutting after the first merge (from the
bottom) of the dendrogram will yield clusters {1, 2} {5} {3, 4}. Cutting after the second
merge will yield clusters {1, 2, 5} {3, 4}, which is a coarser clustering, with a smaller
number of clusters but one or more clusters having larger size.
Once one or more clusters contain more than one element, the algorithm of merging two
multi-element clusters need to be developed.

4.2.1 Concept of Linkage


Once at least one multiple-element cluster is formed, a process needs to be devised how to
compute distances between a set of observations and a singleton, or between two sets of
observations. This may be defined in various manner.

Single Linkage: Distance between two clusters is defined to be the smallest distance between
a pair of observations (points), one from each cluster. This is the distance between two
closest points of the two clusters.

Figure 3: Single linkage clustering

Complete Linkage: Distance between two clusters is defined to be the largest distance
between a pair of points, one from each cluster. This is the distance between two points
belonging to two different clusters, which are the farthest apart.

Figure 4: Complete linkage clustering

11

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Average Linkage: Distance between two clusters is defined as the average distance between
each pair of points, one from each cluster.

Centroid Linkage: Centroid of a cluster is defined as the vector of means of all attributes,
over all observations within that cluster. Centroid linkage between two clusters is the distance
between the respective centroids.

Figure 5: Centroid linkage clustering

Ward’s Linkage: This is an alternative method of clustering, also known as minimum


variance clustering method. In that sense it is similar to the k-means clustering (which is
discussed later), an iterative method where after every merge, the distances are updated
successively. Ward’s method often creates compact and even-sized clusters. However, the
drawback is, it may result in less than optimal clusters. Like all other clustering methods, this
is also computationally intensive.

Note that the choice of distance and linkage method uniquely define the outcome of a
clustering procedure. It is possible that the same set of observations may be clustered in
different partitions based on the choice of distance and linkage method.

Case Study:
Review ratings by 5456 travellers (Users) on 24 tourist attractions across various sites in
Europe are considered. This data is the primary input for an automatic recommender system.
Each reviewer (User) has rated various attractions, such as churches, resorts, bakeries etc. and
the given ratings are all between 0 (lowest) and 5 (highest). Objective is to partition the
sample of travellers into clusters, so that tailored recommendations may be made for next
travel destinations.

Statement of the problem: Cluster the Users into mutually exclusive groups according to their
preferences for various tourist attractions.

12

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
EDA on Review Ratings Data

# Read data into R using .csv file

ReviewData <- read.csv("ReviewRatings.csv", header = TRUE)

dim(ReviewData)

## [1] 5456 25

names(ReviewData)

## [1] "User" "Churches" "Resorts" "Beaches"


## [5] "Parks" "Theatres" "Museums" "Malls"
## [9] "Zoo" "Restaurants" "Bars" "Local.Services
"
## [13] "Fast.Food" "Lodgings" "Juice.Bars" "Art.Galleries"
## [17] "Dance.Clubs" "Swimming.Pools" "Gyms" "Bakeries"
## [21] "Spas" "Cafes" "View.Points" "Monuments"
## [25] "Gardens"

head(ReviewData, 5)

## User Churches Resorts Beaches Parks Theatres Museums Malls Zoo


## 1 User 1 0 0.0 3.63 3.65 5 2.92 5 2.35
## 2 User 2 0 0.0 3.63 3.65 5 2.92 5 2.64
## 3 User 3 0 0.0 3.63 3.63 5 2.92 5 2.64
## 4 User 4 0 0.5 3.63 3.63 5 2.92 5 2.35
## 5 User 5 0 0.0 3.63 3.63 5 2.92 5 2.64
## Restaurants Bars Local.Services Fast.Food Lodgings Juice.Bars
## 1 2.33 2.64 1.7 1.69 1.7 1.72
## 2 2.33 2.65 1.7 1.69 1.7 1.72
## 3 2.33 2.64 1.7 1.69 1.7 1.72
## 4 2.33 2.64 1.73 1.69 1.7 1.72
## 5 2.33 2.64 1.7 1.69 1.7 1.72
## Art.Galleries Dance.Clubs Swimming.Pools Gyms Bakeries Spas Cafes
## 1 1.74 0.59 0.5 0 0.5 0 0
## 2 1.74 0.59 0.5 0 0.5 0 0
## 3 1.74 0.59 0.5 0 0.5 0 0
## 4 1.74 0.59 0.5 0 0.5 0 0
## 5 1.74 0.59 0.5 0 0.5 0 0
## View.Points Monuments Gardens
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0

13

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
summary(ReviewData[,2:25]) # 5-number summary of the variables

## Churches Resorts Beaches Parks


## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.830
## 1st Qu.:0.920 1st Qu.:1.360 1st Qu.:1.540 1st Qu.:1.730
## Median :1.340 Median :1.905 Median :2.060 Median :2.460
## Mean :1.456 Mean :2.320 Mean :2.489 Mean :2.797
## 3rd Qu.:1.810 3rd Qu.:2.683 3rd Qu.:2.740 3rd Qu.:4.093
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Theatres Museums Malls Zoo
## Min. :1.120 Min. :1.110 Min. :1.120 Min. :0.860
## 1st Qu.:1.770 1st Qu.:1.790 1st Qu.:1.930 1st Qu.:1.620
## Median :2.670 Median :2.680 Median :3.230 Median :2.170
## Mean :2.959 Mean :2.893 Mean :3.351 Mean :2.541
## 3rd Qu.:4.312 3rd Qu.:3.840 3rd Qu.:5.000 3rd Qu.:3.190
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Restaurants Bars Local.Services Fast.Food
## Min. :0.840 Min. :0.810 Min. :0.78 Min. :0.780
## 1st Qu.:1.800 1st Qu.:1.640 1st Qu.:1.58 1st Qu.:1.290
## Median :2.800 Median :2.680 Median :2.00 Median :1.690
## Mean :3.126 Mean :2.833 Mean :2.55 Mean :2.078
## 3rd Qu.:5.000 3rd Qu.:3.530 3rd Qu.:3.22 3rd Qu.:2.283
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## Lodgings Juice.Bars Art.Galleries Dance.Clubs
## Min. :0.770 Min. :0.760 Min. :0.000 Min. :0.000
## 1st Qu.:1.190 1st Qu.:1.030 1st Qu.:0.860 1st Qu.:0.690
## Median :1.610 Median :1.490 Median :1.330 Median :0.800
## Mean :2.126 Mean :2.191 Mean :2.207 Mean :1.193
## 3rd Qu.:2.360 3rd Qu.:2.740 3rd Qu.:4.440 3rd Qu.:1.160
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Swimming.Pools Gyms Bakeries Spas
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
## 1st Qu.:0.5800 1st Qu.:0.5300 1st Qu.:0.5200 1st Qu.:0.54
## Median :0.7400 Median :0.6900 Median :0.6900 Median :0.69
## Mean :0.9492 Mean :0.8224 Mean :0.9698 Mean :1.00
## 3rd Qu.:0.9100 3rd Qu.:0.8400 3rd Qu.:0.8600 3rd Qu.:0.86
## Max. :5.0000 Max. :5.0000 Max. :5.0000 Max. :5.00
## Cafes View.Points Monuments Gardens
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.5700 1st Qu.:0.740 1st Qu.:0.790 1st Qu.:0.880
## Median :0.7600 Median :1.030 Median :1.070 Median :1.290
## Mean :0.9658 Mean :1.751 Mean :1.531 Mean :1.561
## 3rd Qu.:1.0000 3rd Qu.:2.070 3rd Qu.:1.560 3rd Qu.:1.660
## Max. :5.0000 Max. :5.000 Max. :5.000 Max. :5.000

There are 5456 reviewers’ ratings in the data. In each of the attributes, the maximum value is
5, i.e. the highest possible score has been reached. However, not all attributes have been
given the lowest (0) rating. The mean value is always greater than the median value,
sometimes considerably, indicating that the distribution of the attributes is not symmetric, but
positively skewed.

14

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Histograms of the variables are shown below.

library(ggplot2)
library(tidyr)
library(purrr) # Enhances various functionalities in R

# For better resolution of visuals

d1 <- ReviewData[,2:10]
d2 <- ReviewData[,11:19]
d3 <- ReviewData[,20:25]

d1 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram(binwidth=0.5, color="blue", fill="white")
# Improve visualization

d2 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram(binwidth=0.5, color="blue", fill="white")
# Improve visualization

d3 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_histogram(binwidth=0.5, color="blue", fill="white")
# Improve visualization

## Partitioning dataframe suitably so that the histograms are well


arranged and properly visible ##

15

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
16

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 6: Histogram of the variables to be used to compute distance
The histograms bear proof of the inherent skewness in the attributes. Except for two attributes
(churches and zoo) most of the data is concentrated towards the lower ranks with a
disproportionate amount concentrated on rank 5.

Clustering (Illustrative)

Visualization plays an important role in identifying the number of clusters in a set of


observations. To illustrate the process only the users with serial number 100 – 120 are
selected. Later the process is applied on the whole data set with 5456 users. All attributes are
used in the hierarchical clustering process.
Both Euclidean distance and Manhattan distances are used with single and complete linkage.
The codes are given in the following panel whereas the dendrograms are shown together later
to facilitate comparison.

# Illustrative clustering with a small set of observations

d.eucld <- dist((ReviewData[100:120,2:25]), method="euclidean")


hc1 <- hclust(d.eucld, "single")
# List object contains outcome of clustering
plot(hc1, hang=-1, cex = 0.6, col="blue") # Plot dendrogram

17

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
hc2 <- hclust(d.eucld, "complete")
plot(hc2, hang=-1, cex = 0.6, col ="brown")

d.manhat <- dist((ReviewData[100:120,2:25]), method="manhattan")


hc3 <- hclust(d.manhat, "single")
plot(hc3, hang=-1, cex = 0.6, col = "magenta")

hc4 <- hclust(d.manhat, "complete")


plot(hc4, hang=-1, cex = 0.6, col ="dark green")

Figure 7: Comparison of distance and linkage method in constructing clusters

Illustrative clustering using Ward’s method:


d.eucld <- dist((ReviewData[100:120,2:25]), method="Euclidean”)

hcW <- hclust(d.eucld, "ward.D2") # Using Ward’s method


plot(hcW, hang=-1, cex = 0.6, col="red")

18

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 8: Ward’s method of constructing clusters

A dendrogram is a visual representation of cluster-making. On the x-axis are the item names
or item numbers. On the y-axis is the distance or height. The vertical straight lines denote the
height where two items or two clusters combine. The higher the level of combining, the
distant the individual items or clusters are. By definition of hierarchical clustering, all items
must combine to make one cluster.
However, the problem of interest here is to determine the optimum number of clusters. Recall
that each cluster is a representative of a population. Since this is a problem of unsupervised
learning, there cannot be any error measure to dictate the number of clusters. To some extent
this is a subjective choice. After considering the dendrogram, one may decide at what level
the resultant tree need to be cut. This is another way to say that how close-knit or dispersed
should the clusters be. If the number of clusters is large, the cluster size is small and the
clusters homogeneous. If the number of clusters is small, each contains more items and hence
clusters are more heterogeneous. Often practical consideration or domain knowledge also
plays a part in determining the number of clusters.
Note that depending on the distance measure and linkage used, the number of clusters and
their composition may be different. Compare the dendrograms constructed by Euclidean
distance and complete linkage and Manhattan distance and single linkage in Figures 7 - 8. In
the former there seems 3 clear clusters:
 Cluster 1: Users 100 – 108 and User 110
 Cluster 2: User 109, User 111, User 113 – 115 and User 117
 Cluster 3: User 112, User 116, Users 118 – 120
In case of the other dendrogram (Manhattan and single linkage), there seems to be one
 Cluster 1: Users 100 – 110
 Cluster 2: Users 113 – 115
The rest of the Users are relatively distant from these two multi-item clusters and they all
form singletons. [It is possible to apply hierarchical clustering to identify multivariate
outliers].
Ward’s method also seems to indicate two clusters.

19

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
 Cluster 1: Users 100 – 110
 Cluster 2: Users 111 – 120
Both clusters are of equal size and there does not exist any singleton.
This does create subjectivity in the process. Nevertheless, the items (in this case Users) that
are really close, are combining a single cluster irrespective of the distance measure or
linkage.

Now let us proceed to the final clustering involving all Users. Euclidean distance and
complete linkage is applied.

library(dendextend) # Useful for nice dendrograms

# Euclidean distance and complete linkage

d.eucld <- dist((ReviewData[,2:25]), method="euclidean")

hc2 <- hclust(d.eucld, "complete")


hc2 <- as.dendrogram(hc2) # To facilitate nice graphics
nP <- list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.6, col = "red")
hc2 <- color_branches(hc2, k = 12) # Identify clusters by colours
plot(hc2, nodePar=nP, horiz=T, leaflab="none")

20

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 9: Dendrogram of the hierarchical clusters for the full data set

21

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The dendrogram in Fig 9 needs some explanation. Default options in R will not be able to
produce an interpretable tree diagram for hierarchical clustering with a large number of
objects. Hence the leaf nodes have not been labelled and a horizontal aspect has been chosen.
The number of clusters determines the colour scheme. The number of clusters have been
chosen as 12 after several iterations. Only the final results have been shared here.

# Cluster membership
clust <- cutree(hc2, k = 12)
table(clust) # Number of observations in each cluster

## clust
## 1 2 3 4 5 6 7 8 9 10 11 12
## 416 865 900 402 809 130 246 88 475 321 572 232

# Add cluster membership to original dataframe


ReviewData <- cbind.data.frame(ReviewData,clust)

# Print a sample of observations with cluster membership


ReviewData[1:10, c(1,26)]

## User clust
## 1 User 1 1
## 2 User 2 1
## 3 User 3 1
## 4 User 4 1
## 5 User 5 1
## 6 User 6 1
## 7 User 7 2
## 8 User 8 2
## 9 User 9 2
## 10 User 10 2

Finally, to verify the differences between the identified clusters, a few important variables are
chosen and their means and standard deviations are compared below.

library(dplyr)

ReviewData %>%
group_by(clust) %>%
summarise_at(vars(Lodgings, Fast.Food, Restaurants), funs(mean(., na.rm=
TRUE)))

## # A tibble: 12 x 4
## clust Lodgings Fast.Food Restaurants
## <int> <dbl> <dbl> <dbl>
## 1 1 2.43 2.01 2.51
## 2 2 1.64 1.75 2.14
## 3 3 2.35 1.93 3.95
## 4 4 2.44 2.15 4.12

22

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
## 5 5 1.43 1.65 4.81
## 6 6 3.05 3.89 4.14
## 7 7 1.78 2.85 2.63
## 8 8 0.924 0.913 4.17
## 9 9 4.57 4.33 2.73
## 10 10 2.19 2.08 2.20
## 11 11 1.40 1.41 1.98
## 12 12 1.37 1.03 1.60

ReviewData %>%
group_by(clust) %>%
summarise_at(vars(Lodgings, Fast.Food, Restaurants), funs(sd(., na.rm=TR
UE)))

## # A tibble: 12 x 4
## clust Lodgings Fast.Food Restaurants
## <int> <dbl> <dbl> <dbl>
## 1 1 1.48 1.10 0.980
## 2 2 1.09 1.13 0.784
## 3 3 1.47 1.07 1.08
## 4 4 1.16 0.704 0.923
## 5 5 0.308 0.446 0.458
## 6 6 1.91 1.53 0.815
## 7 7 1.65 1.59 0.523
## 8 8 0.185 0.190 1.44
## 9 9 0.803 1.03 0.329
## 10 10 0.515 0.562 0.564
## 11 11 0.530 0.423 1.08
## 12 12 1.21 0.302 0.802

Recall that the clustering process here is an input to a recommender system. Instead of
pushing all information indiscriminately to every user, based on the average ranking of
various attributes in various groups, more efficient planning may be made. Consider for
example Clusters 4, 5, 6 and 8 have given very high ranking to Restaurants whereas Clusters
4, 5 and especially 8 have given low ranking to Fast Food. It is clear that users belonging to
Cluster 8 are discriminatory eaters and information about fast food to these users will not be
useful at all. Compared to them, preference of Cluster 9 shows high ranks in Lodging and
Fast Food but medium rank in Restaurant. It is possible that these users are more convenience
loving and information about good accommodation and perhaps near-by fast food joints will
be appreciated by this group.

As a practical consideration though, 12 may be a large number of clusters.

23

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
5. K-Means Clustering
k-means clustering is the most used non-hierarchical clustering technique. It aims
to partition n observations into k clusters in which each observation belongs to
the cluster whose mean (centroid) is nearest to it, serving as a prototype of the cluster. It
minimizes within-cluster variances (squared Euclidean distances).
The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is
often called "the k-means algorithm".
Given an initial set of k means m1(1), …, mk(1) , the algorithm proceeds by alternating between
two steps
Assignment step: Assign each observation to the cluster whose mean has the least
squared Euclidean distance, this is intuitively the "nearest" mean.
Update step: Calculate the new means (centroids) of the observations in the new clusters.
The algorithm has converged when the assignments no longer change.
One advantage of k-means clustering is that computation of distances among all pairs of
observations is not necessary. However, the number of clusters k needs to be pre-specified.
Once that is determined, an arbitrary partition of data into k clusters is the starting point. Then
sequential assignment and update will find the clusters that are most separated. The cardinal
rule of cluster building is that, at every step within-cluster variance will reduce (or stay the
same) but between-cluster variance will increase (or stay the same).
Random assignment of initial centroids eliminate bias. However, it is not an efficient
algorithmic process. An alternative way to form clusters is to start with a set of pre-specified
starting points, though it is possible that such a set of initial clusters, if converges to a local
optimality, may introduce bias. The following algorithm is a compromise between efficiency
and randomness.
Leader Algorithm:

 Step 1. Select the first item from the list. This item forms the centroid of the first cluster.
 Step 2. Search through the subsequent items until an item is found that is at least distance
δ away from any previously defined cluster centroid. This item will form the centroid of
the next cluster.
 Step 3: Step 2 is repeated until all k cluster centroids are obtained or no further items can
be assigned.
 Step 4: The initial clusters are obtained by assigning items to the nearest cluster
centroids.

The distance δ along with the number of clusters k is the input to this algorithm. It is possible
that domain knowledge or various other subjective considerations determine the value of k. If
no other subjective input is available, k may be determined by empirical methods.

24

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
5.1 Determining the Number of Clusters

There are many methods that are recommended for determination of an optimal number of
partitions. Unfortunately, however, there is no closed form solution to the problem of
determining k. The choice is somewhat subjective and graphical methods are often employed.
Objective of partitioning is to separate out the observations or units so that the ‘most’ similar
items are put together.

5.1.1 Elbow method

For a given number of clusters, the total within-cluster sum of squares (WSS) is computed.
That value of k is chosen to be optimum, where addition of one more cluster does not lower
the value of total WSS appreciably.

The Elbow method looks at the total WSS as a function of the number of clusters.

# K-means Clustering of Reviewers


# Read data into R using .csv file

ReviewData <- read.csv("C:/Users/Srabashi/Desktop/GL Content Creation/Clus


tering/ReviewRatings.csv", header = TRUE)

library(ggplot2)
library(factoextra) # Required for choosing optimum K

set.seed(310)
# function to compute total within-cluster sum of squares
fviz_nbclust(ReviewData[, 2:25], kmeans, method = "wss", k.max = 25) + the
me_minimal() + ggtitle("the Elbow Method")

Fig 10 does not indicate any clear break in the elbow. For k = 15, for the first time, one more
cluster does not reduce total within-cluster sum of square. Hence one option for optimum
number of clusters is 15.
[Recall that hierarchical clustering of the same data suggested 12 clusters. There may be wide
discrepancy in the number of clusters depending on the procedure applied. This will be clearer
in subsequent analysis]

25

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 10: Elbow method to determine optimum number of clusters

5.1.2 Silhouette Method

This method measures how tightly the observations are clustered and the average distance
between clusters. For each observation a silhouette width is constructed which is a function
of the average distance between the point and all other points in the cluster to which it
belongs, and the distance between the point and all other points in all other clusters, that it
does not belong to. The maximum value of the statistic indicates the optimum value of k.

fviz_nbclust(ReviewData[, 2:25], kmeans, method = "silhouette", k.max = 25


) + theme_minimal() + ggtitle("The Silhouette Plot")

26

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Figure 11: Silhouette method to determine optimum number of clusters

It is clear from Fig 11 that the maximum value of silhouette is achieved for k = 20, which,
therefore, is considered to be the optimum number of clusters for this data.

Let us now proceed with 20 clusters. Clustering, like any other predictive modelling, is an
iterative process. The final recommendation may be quite different from the initial starting
point.

Case Study continued:

Let us decide to partition 5456 users into 20 partitions.

# function to perform k-means clustering


kmeans.clus = kmeans(ReviewData[, 2:25], centers = 20, nstart = 25)

The algorithm for k-means is a greedy algorithm and is sensitive to the random starting
points. The parameter nstart controls the starting partitions. Its default value in R is one.
However, it is strongly recommended to compute k-means clustering with a large value
of nstart such as 25 or 50, in order to have a more stable result. If nstart = 25 is specified,
then R will try 25 different random starting assignments and then select the best results

27

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
corresponding to the one with the lowest within cluster variation. In spite of that, it is possible
that a sub-optimal partition is found.

Output of k-means clustering include a number of items including the total sum of squares,
the within sum of squares for each cluster, the total within sum of squares, the between
cluster sum of squares and the cluster centres.

Before the detail outcome of the partition algorithm is discussed, one needs to check how
separated the clusters are. Ideally each and every cluster will be well separated with a
minimal overlap.

library(fpc) # Plots the clusters

plotcluster( ReviewData[, 2:25], kmeans.clus$cluster)

# Uses a projection method for plotting

Figure 12: Cluster plot for 20 clusters

Figure 12 above clearly shows that the 20 clusters are not at all well separated. The options
available at this stage is either to tweak the number of clusters and see if a smaller number of
clusters will work better. Another possible option may be to exercise subjective judgement. A
third option also exists which is quite computation intensive.

28

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
The package NbClust contains 30 indices for determination of cluster size. This acts as an
ensemble algorithm and proposes that value of k, which is voted on by a maximum number of
indices.
nc<- NbClust(data = ReviewData[, 2:25], distance="euclidean", min.nc = 2,
max.nc = 20, method = "kmeans")

Figure 13: Options for optimum number of clusters

## *** : The D index is a graphical method of determining the number of cl


usters.
## In the plot of D index, we seek a significant knee (the
significant peak in Dindex
## second differences plot) that corresponds to a signific
ant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 18 as the best number of clusters

29

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
## * 3 proposed 19 as the best number of clusters
## * 1 proposed 20 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************

table(nc$Best.n[1,])

##
## 0 1 2 3 4 7 8 9 18 19 20
## 2 1 5 6 3 2 1 1 1 3 1

With current data, the value of k getting highest number of votes is 3.

kmeans.clus = kmeans(ReviewData[, 2:25], centers = 3, nstart = 25)


plotcluster( ReviewData[, 2:25], kmeans.clus$cluster )

Note that elbow plot and silhouette plot both recommended a very large number of clusters in
this case. Though these two instruments are probably the most used for determination of k,
the final decision must not be taken based on only one consideration.

Figure 14: Cluster plot for 3 clusters

30

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Fig 14 shows that the 3 clusters are much better separated though overlap cannot be totally
eliminated.

There are a number of merits to use a smaller number of clusters. Recall that the objective of
this particular clustering effort is to devise a suitable recommendation system. It may not be
practical to manage a very large number of tailor made recommendations.

However the final decision must be taken after considering the within sum of squares and
between sum of squares. Recall that within cluster sum of squares is the squared average
Euclidean distance of all the points within a cluster from the cluster centroid and between
cluster sum of squares is the average squared Euclidean distance between all cluster
centroids.

names(kmeans.clus) # Output of kmeans procedure

## [1] "cluster" "centers" "totss" "withinss"


## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"

The output of the kmeans() procedure contains all the relevant information. A comparison of
the sums of squares for 20 clusters and 3 clusters reveal more insight.

# Cluster = 20

kmeans.clus3 = kmeans(ReviewData[, 2:25], centers = 5, nstart = 50)


kmeans.clus20$totss

## [1] 215751.4

kmeans.clus20$withinss

## [1] 3379.421 2626.124 3618.589 5098.038 6375.142 10053.915 2662.499


## [8] 6217.850 5264.760 3264.566 6913.228 5776.917 3813.161 3304.953
## [15] 7344.733 4639.568 1938.427 3688.510 2792.628 5522.405

kmeans.clus20$tot.withinss

## [1] 94295.43

kmeans.clus20$betweenss

## [1] 121455.9

kmeans.clus20$size

## [1] 235 149 196 223 414 557 104 273 314 197 529 289 352 261 349 242 111
## [18] 236 147 278

31

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
# Cluster = 3

kmeans.clus3 = kmeans(ReviewData[, 2:25], centers = 3, nstart = 50)


kmeans.clus3$totss

## [1] 215751.4

kmeans.clus3$withinss

## [1] 74468.48 56688.72 34299.57

kmeans.clus3$tot.withinss

## [1] 165456.8

kmeans.clus3$betweenss

## [1] 50294.59

kmeans.clus3$size

## [1] 2457 1906 1093

Note that the total sum of square in the data does not depend on the number of clusters. Total
sum of squares for the current data set is 215751.4 which is a measure of total variability
within the data.

Each cluster has different within cluster sum of squares. For k = 20, the largest within cluster
sum of squares is 10053.9 and the total within cluster sum of squares is 94295.43. The
between cluster sum of squares is 121455.9.

Let us now investigate the case with k = 3. The between cluster sum of squares is smaller
than the within cluster sum of squares for two clusters. Hence k = 3 cannot be recommended,
even though this value of k has been determined by majority vote in NbClust().

Next a few other values of k are experimented with and k = 6 seems to be the optimal number
of clusters.

# Cluster = 6

kmeans.clus6 = kmeans(ReviewData[, 2:25], centers = 6, nstart = 50)

kmeans.clus6$withinss

## [1] 13484.60 15813.09 34311.71 20297.23 23566.84 26960.01

kmeans.clus6$tot.withinss

## [1] 134433.5

32

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
kmeans.clus6$betweenss

## [1] 81317.87

kmeans.clus6$size

## [1] 612 883 1334 726 967 934

Figure 15: Cluster plot for 6 clusters

Except for cluster 2 having a small overlap with cluster 3, the clusters are well separated. All
within cluster sums of squares are small compared to the between cluster sum of squares.
Hence k = 6 is being recommended as the optimum number of clusters in this case.

The last step of clustering procedure is cluster profiling. Based on the various mean values of
the attributes in different clusters, the recommender system might be developed.

# Cluster Profile
round(kmeans.clus6$centers, 2)

## Churches Resorts Beaches Parks Theatres Museums Malls Zoo Restaurants


## 1 0.81 1.06 1.67 1.65 1.67 1.68 3.04 1.97 2.74

33

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
## 2 1.23 1.59 1.69 2.19 2.29 2.56 3.70 3.64 4.53
## 3 1.40 2.76 3.19 3.68 4.35 3.83 3.66 2.44 2.63
## 4 1.64 2.24 3.13 4.30 4.07 3.19 2.73 2.19 2.67
## 5 1.15 2.89 2.31 2.29 2.53 3.44 4.63 3.23 4.45
## 6 2.35 2.68 2.47 2.22 2.03 1.86 1.94 1.57 1.75

## Bars Local.Services Fast.Food Lodgings Juice.Bars Art.Galleries


## 1 2.86 3.15 3.92 4.17 4.85 4.00
## 2 4.48 4.17 2.10 1.86 1.34 1.39
## 3 2.44 2.06 2.06 1.97 1.68 1.31
## 4 2.71 2.67 1.60 1.87 1.28 1.19
## 5 3.16 2.21 1.90 2.05 2.99 3.68
## 6 1.58 1.57 1.44 1.54 1.86 2.35

## Dance.Clubs Swimming.Pools Gyms Bakeries Spas Cafes View.Points


## 1 1.51 1.22 0.93 1.18 1.00 0.73 0.74
## 2 1.10 0.67 0.40 0.36 0.44 0.58 1.24
## 3 1.05 0.66 0.49 0.58 0.69 0.75 0.85
## 4 1.21 0.87 0.67 0.62 0.99 1.04 4.81
## 5 0.83 0.65 0.65 0.71 0.85 0.83 0.92
## 6 1.63 1.81 1.92 2.51 2.13 1.88 2.66

## Monuments Gardens
## 1 0.71 0.81
## 2 1.10 1.19
## 3 1.48 1.67
## 4 2.49 1.78
## 5 0.90 1.03
## 6 2.46 2.62

34

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
6. Further Discussion and
Considerations

6.1 Divisive Clustering Algorithm

This algorithm is rarely used. The basic principle of divisive clustering was published as the
DIANA (DIvisive ANAlysis Clustering) algorithm. Initially, all data is in the same cluster,
and the largest cluster is split until every object is separate. DIANA chooses the object with
the maximum average dissimilarity and then moves all objects to this cluster that are more
similar to the new cluster than to the remainder.

6.2 Scaling

Standardization or scaling is an important aspect of data pre-processing. All machine learning


algorithms are dependent on the scaling of data. Recall that scaling all numerical variables
has been strongly recommended in case of PCA and FA. Similarly, for clustering too, scaling
is usually applied.
However, in this case we have not applied scaling. The reason being all the variables here are
scores, with values between 0 and 5. The variables here are naturally scaled. It was not
necessary to smooth them any further.

6.3 Discrete Attributes

The main problem of using discrete attributes in clustering lies in computing distance
between two items. All the distance measures defined (Euclidean, Minkowski’s etc) are
defined for continuous variables only. To use binary or nominal variables in clustering
algorithm, a meaningful distance measure between a pair of discrete variables need to be
defined. There are several options such as cosine similarity and Mahalanobis Distance. We
will not go in depth with these cases. In R, Gower Distance may be used to deal with mixed
data type.
Gower distance: For each variable type, a different distance is used and scaled to fall
between 0 and 1. For continuous variable Manhattan Distance is used. Ordinal variables are
ranked first and then Manhattan distance is used with a special adjustment for ties. Nominal
variables are converted into dummy variables and Dice Coefficient (concordance-discordance

35

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
type coefficient) is computed. Then, a linear combination using user-specified weights (an
average) is calculated to create the final distance matrix.

6.4 K-Median Algorithm

This is a variant of k-means clustering where instead of minimizing the squared deviations,
the absolute deviations from the medians is minimized. The difficulty is that for a
multivariate data, there is no universally accepted definition of median. Another option is
using a k-medoid partitioning.

6.5 K-Medoid Algorithm

In this algorithm k representative observations, called medoids, are chosen. Each observation
is assigned to that cluster whose medoid is the closest and medoids are updated. A swapping
cost is involved when a current medoid is replaced by a new medoid. Various algorithms are
developed, namely PAM (partitioning around medoids), CLARA and randomized sampling.

6.6 K-Mode Clustering

Since for categorical data, mean is not defined, k-mode clustering is a sensible alternative.
Once an optimal number of clusters is determined, initial cluster centres are arbitrarily
selected. Each data object is assigned to that cluster, whose centre is closest to it by a suitably
defined distance. Note that for cluster analysis the distance measure is all important. After
each allocation, the cluster base is updated. A good discussion is given in A review on k-mode
clustering algorithm by M. Goyal and S Aggarwal published in International Journal of
Advanced Research in Computer Science.
(https://www.ijarcs.info/index.php/Ijarcs/article/view/4301/4008)

36

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Cluster Observations

Hierarchical

Partition

Dendrogram

Final
Partition

37

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
References
Hennig, C., Meila M., Murtagh F. and Rocci, R (2015). Handbook of Cluster Analysis.
Chapman and Hall/CRC.
Everitt B. S., Landau S., Leese M. and Stahl, D. (2011). Cluster Analysis. 5th Ed. Wiley.
Kassambara A. (2017). Practical Guide to Cluster Analysis in R. sthda.com.
https://online.stat.psu.edu/stat505/lesson/14

38

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy