Clustering
Clustering
Clustering
Module 6: Clustering
1
Objectives
2
What is clustering
• Clustering is a type of unsupervised machine learning
• It is distinguished from supervised learning by the fact that there is
not an a priori output (i.e. no labels)
– The task is to learn the classification/grouping from the data
• A cluster is a collection of objects which are similar in some way
• Clustering is the process of grouping similar objects into groups
• E.g.: a group of people clustered based on their height and weight
• Normally, clusters are created using distance measures
– Two or more objects belong to the same cluster if they are “close” according to a
given distance (in this case geometrical distance like Euclidean or Manhattan)
• Another measure is conceptual
– Two or more objects belong to the same cluster if this one defines a concept common
to all that objects
– In other words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures
Note: Adapted from R. Palaniappan 3
Unsupervised Learning
• Supervised learning used labeled data pairs (x, y) to
learn a function f : X→Y.
• But, what if we don’t have labels?
• Strengths
– Relatively efficient: where N is no. objects, K is no. clusters, and T is
no. iterations. Normally, K, T << N.
– Procedure always terminates successfully (but see below)
• Weaknesses
– Does not necessarily find the most optimal configuration
– Significantly sensitive to the initial randomly selected cluster centres
– Applicable only when mean is defined (i.e. can be computed)
– Need to specify K, the number of clusters, in advance
• Average linkage
• Max linkage
Agglomerative clustering
• K-means approach starts out with a fixed number of clusters
and allocates all data into the exactly number of clusters
• But agglomeration does not require the number of clusters K
as an input
• Agglomeration starts out by forming each data as one cluster
– So, data of N object will have N clusters
• Next by using some distance (or similarity) measure, reduces
the number so clusters (one in each iteration) by merging
process
• Finally, we have one big cluster than contains all the objects
• But then what is the point of having one big cluster in the end?
A dendrogram example
Note: Adapted from R. Palaniappan 20
Hierarchical clustering - algorithm
• Hierarchical clustering algorithm is a type of agglomerative clustering
29
Impact of intra-cluster distance
30
Impact of scaling a dimension
31
Clustering validity problem
• Problem 1: A problem we face in clustering is to
decide the optimal number of clusters that fits a data
set
• Problem 2: The various clustering algorithms behave
in a different way depending on
– the features of the data set the features of the data set
(geometry and density distribution of clusters)
– the input parameters values (e.g.: for K-means, initial
cluster choices influence the result)
– So, how do we know which clustering method is
better/suitable?
• We need a clustering quality criteria
Note: Adapted from R. Palaniappan 32
Clustering validity problem
• In general, good clusters, should have
– High intra–cluster similarity, i.e. low variance among intra-cluster members
where variance for x is defined by 1 N
var(x) i
N 1 i 1
( x x ) 2
x
with as the mean of x x 5
• Eg: if x=[2 4 6 8], then so var(x)=6.67
• Computing intra-cluster similarity is simple
• Eg: for the clusters shown
• var(cluster1)=2.33 while var(cluster2)=12.33
• So, cluster 1 is better than cluster 2
• Note: use ‘var’ function in MATLAB to compute variance
Note: Adapted from R. Palaniappan 33
Clustering Quality Criteria
• But this does not tell us anything about how good is the overall clustering or on
the suitable number of clusters needed!
• To solve this, we also need to compute inter-cluster variance
• Good clusters will also have low inter–cluster similarity, i.e. high variance among
inter-cluster members in addition to high intra–cluster similarity, i.e. low variance
among intra-cluster members
• One good measure of clustering quality is Davies-Bouldin index
• The others are:
– Dunn’s Validity Index
– Silhouette method
– C–index
– Goodman–Kruskal index
• So, we compute DB index for different number of clusters, K and the best value of
DB index tells us on the appropriate K value or on how good is the clustering
method
• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31
and R32)
• Finally, compute 1 k
DB . Ri
• DB=0.932 k i 1