Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

EE769: Intro to Machine Learning

Module 6: Clustering

Amit Sethi, EE, IIT Bombay


(asethi@iitb.ac.in)

1
Objectives

• Understand clustering approaches

• Understand clustering criteria

2
What is clustering
• Clustering is a type of unsupervised machine learning
• It is distinguished from supervised learning by the fact that there is
not an a priori output (i.e. no labels)
– The task is to learn the classification/grouping from the data
• A cluster is a collection of objects which are similar in some way
• Clustering is the process of grouping similar objects into groups
• E.g.: a group of people clustered based on their height and weight
• Normally, clusters are created using distance measures
– Two or more objects belong to the same cluster if they are “close” according to a
given distance (in this case geometrical distance like Euclidean or Manhattan)
• Another measure is conceptual
– Two or more objects belong to the same cluster if this one defines a concept common
to all that objects
– In other words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures
Note: Adapted from R. Palaniappan 3
Unsupervised Learning
• Supervised learning used labeled data pairs (x, y) to
learn a function f : X→Y.
• But, what if we don’t have labels?

• No labels = unsupervised learning


• Only some points are labeled = semi-supervised
learning
– Labels may be expensive to obtain, so we only get a few.

• Clustering is the unsupervised grouping of data


points. It can be used for knowledge discovery.
Slide adapted from Andrew Moore, CMU 4
Can you spot the clusters here?

Slide adapted from Andrew Moore, CMU 5


How about here?
Another example
K-Means Clustering
K-Means ( k , data )
• Randomly choose k cluster
center locations (centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest centroid.
• Reestimate the cluster
centroids based on the data
assigned to each.
• Hope that some “natural”
clusters will be found

Slide adapted from Andrew Moore, CMU 8


K-Means Clustering
K-Means ( k , data )
• Randomly choose k cluster
center locations (centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest centroid.
• Reestimate the cluster
centroids based on the data
assigned to each.
• Hope that some “natural”
clusters will be found

Slide adapted from Andrew Moore, CMU 9


K-Means Clustering
K-Means ( k , data )
• Randomly choose k cluster
center locations (centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest centroid.
• Reestimate the cluster
centroids based on the data
assigned to each.
• Hope that some “natural”
clusters will be found

Slide adapted from Andrew Moore, CMU 10


K-Means Clustering
K-Means ( k , data )
• Randomly choose k cluster
center locations (centroids).
• Loop until convergence
• Assign each point to the
cluster of the closest centroid.
• Reestimate the cluster
centroids based on the data
assigned to each.
• Hope that some “natural”
clusters will be found

Slide adapted from Andrew Moore, CMU 11


K-means objective function,
e.g. for facility location
Why should k-means converge?

Updates only happen if the distance decreases!!


Variants of k-means
K – means: Strengths and weaknesses

• Strengths
– Relatively efficient: where N is no. objects, K is no. clusters, and T is
no. iterations. Normally, K, T << N.
– Procedure always terminates successfully (but see below)

• Weaknesses
– Does not necessarily find the most optimal configuration
– Significantly sensitive to the initial randomly selected cluster centres
– Applicable only when mean is defined (i.e. can be computed)
– Need to specify K, the number of clusters, in advance

Note: Adapted from R. Palaniappan 15


Questions in clustering
• So, the goal of clustering is to determine the intrinsic grouping
in a set of unlabeled data
• But how to decide what constitutes a good clustering?
• It can be shown that there is no absolute “best” criterion
which would be independent of the final aim of the clustering
• Consequently, it is the user which must supply this criterion,
to suit the application
• Some possible applications of clustering
– data reduction – reduce data that are homogeneous (similar)
– find “natural clusters” and describe their unknown properties
– find useful and suitable groupings
– find unusual data objects (i.e. outlier detection)

Note: Adapted from R. Palaniappan 16


Clustering – Major approaches
• Exclusive (partitioning)
– Data are grouped in an exclusive way, one data can only belong to one cluster
– Eg: K-means
• Agglomerative
– Every data is a cluster initially and iterative unions between the two nearest
clusters reduces the number of clusters
– Eg: Hierarchical clustering
• Overlapping
– Uses fuzzy sets to cluster data, so that each point may belong to two or more
clusters with different degrees of membership
– In this case, data will be associated to an appropriate membership value
– Eg: Fuzzy C-Means
• Probabilistic
– Uses probability distribution measures to create the clusters
– Eg: Gaussian mixture model clustering, which is a variant of K-means
Note: Adapted from R. Palaniappan 17
Agglomerative clustering criteria
• Single linkage

• Average linkage

• Max linkage
Agglomerative clustering
• K-means approach starts out with a fixed number of clusters
and allocates all data into the exactly number of clusters
• But agglomeration does not require the number of clusters K
as an input
• Agglomeration starts out by forming each data as one cluster
– So, data of N object will have N clusters
• Next by using some distance (or similarity) measure, reduces
the number so clusters (one in each iteration) by merging
process
• Finally, we have one big cluster than contains all the objects
• But then what is the point of having one big cluster in the end?

Note: Adapted from R. Palaniappan 19


Dendrogram
• While merging cluster one by one, we can draw a tree diagram
known as dendrogram
• Dendrograms are used to represent agglomerative clustering
• From dendrograms, we can get any number of clusters
• Eg: say we wish to have 2 clusters, then cut the top one link
– Cluster 1: q, r
– Cluster 2: x, y, z, p
• Similarly for 3 clusters, cut 2 top links
– Cluster 1: q, r
– Cluster 2: x, y, z
– Cluster 3: p

A dendrogram example
Note: Adapted from R. Palaniappan 20
Hierarchical clustering - algorithm
• Hierarchical clustering algorithm is a type of agglomerative clustering

• Given a set of N items to be clustered, hierarchical clustering algorithm:


1. Start by assigning each item to its own cluster, so that if you have N items, you
now have N clusters, each containing just one item
2. Find the closest distance (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster
3. Compute pairwise distances between the new cluster and each of the old
clusters
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N
5. Draw the dendogram, and with the complete hierarchical tree, if you want K
clusters you just have to cut the K-1 top links

Note any distance measure can be used: Euclidean, Manhattan, etc

Note: Adapted from R. Palaniappan 21


An example
Hierarchical clustering algorithm– step 3
• Computing distances between clusters for Step 3 can
be implemented in different ways:
– Single-linkage clustering
• The distance between one cluster and another cluster is computed as the
shortest distance from any member of one cluster to any member of the
other cluster

– Complete-linkage clustering (max linkage)


• The distance between one cluster and another cluster is computed as the
greatest distance from any member of one cluster to any member of the
other cluster

– Centroid clustering (similar to average linkage)


• The distance between one cluster and another cluster is computed as the
distance from one cluster centroid to the other cluster centroid
Note: Adapted from R. Palaniappan 23
Hierarchical clustering algorithm– step 3

Note: Adapted from R. Palaniappan 24


Hierarchical clustering – an example
• Assume X=[3 7 10 17 18 20]
1. There are 6 items, so create 6 clusters initially

2. Compute pairwise distances of clusters (assume Manhattan distance)


The closest clusters are 17 and 18 (with distance=1), so merge these two
clusters together

3. Repeat step 2 (assume single-linkage):


The closest clusters are cluster17,18 to cluster20 (with distance |18-20|=2),
so merge these two clusters together

Note: Adapted from R. Palaniappan 25


Hierarchical clustering – an example
• Go on repeat cluster mergers until one big cluster remains

• Draw dendrogram (draw it in reverse of the cluster mergers) –


remember the height of each cluster corresponds to the
manner of cluster agglomeration

Note: Adapted from R. Palaniappan 26


Where will you cut a dendogram?
• Fixed number of clusters

• Distance-based upper bound

• Some natural largest merge distance


Density-based clustering, e.g. DBSCAN
• Identify core points
– at least minPts points are within distance ε
• Identify directly reachable points
– a non-core point within distance ε from a core point

• Identify reachable points


• there is a path p1,
..., pn with p1 = p and pn = q,
where each pi+1 is directly
reachable from pi
• Rest are outlier points
• Clusters contain at least one core
point
Questions
• What happens when K is small?

• What happens when K is large?

• What happens when we use intra-cluster


distance as a measure instead of K?

• What happens if we scale a dimension?

29
Impact of intra-cluster distance

30
Impact of scaling a dimension

31
Clustering validity problem
• Problem 1: A problem we face in clustering is to
decide the optimal number of clusters that fits a data
set
• Problem 2: The various clustering algorithms behave
in a different way depending on
– the features of the data set the features of the data set
(geometry and density distribution of clusters)
– the input parameters values (e.g.: for K-means, initial
cluster choices influence the result)
– So, how do we know which clustering method is
better/suitable?
• We need a clustering quality criteria
Note: Adapted from R. Palaniappan 32
Clustering validity problem
• In general, good clusters, should have
– High intra–cluster similarity, i.e. low variance among intra-cluster members
where variance for x is defined by 1 N
var(x)   i
N  1 i 1
( x  x ) 2

x
with as the mean of x x 5
• Eg: if x=[2 4 6 8], then so var(x)=6.67
• Computing intra-cluster similarity is simple
• Eg: for the clusters shown
• var(cluster1)=2.33 while var(cluster2)=12.33
• So, cluster 1 is better than cluster 2
• Note: use ‘var’ function in MATLAB to compute variance
Note: Adapted from R. Palaniappan 33
Clustering Quality Criteria
• But this does not tell us anything about how good is the overall clustering or on
the suitable number of clusters needed!
• To solve this, we also need to compute inter-cluster variance
• Good clusters will also have low inter–cluster similarity, i.e. high variance among
inter-cluster members in addition to high intra–cluster similarity, i.e. low variance
among intra-cluster members
• One good measure of clustering quality is Davies-Bouldin index
• The others are:
– Dunn’s Validity Index
– Silhouette method
– C–index
– Goodman–Kruskal index
• So, we compute DB index for different number of clusters, K and the best value of
DB index tells us on the appropriate K value or on how good is the clustering
method

Note: Adapted from R. Palaniappan 34


Davies-Bouldin index
• It is a function of the ratio of the sum of within-cluster (i.e. intra-cluster)
scatter to between cluster (i.e. inter-cluster) separation
• Because a low scatter and a high distance between clusters lead to low
values of Rij , a minimization of DB index is desired
• Let C={C1,….., Ck} be a clustering of a set of N objects:
1 k
DB  . Ri
k i 1
var(Ci )  var(C j )
with Ri  max Rij and Rij 
j 1,..k ,i  j i j || ci  c j ||
where Ci is the ith cluster and ci is the centroid for cluster i

Numerator of Rij is a measure of intra-cluster similarity while the denominator


is a measure of inter-cluster separation
Note, Rij=Rji

Note: Adapted from R. Palaniappan 35


Davies-Bouldin index example
• For eg: for the clusters shown
var(Ci )  var(C j )
• Compute Rij  Note, variance of one element is
i j || ci  c j || zero and centroid is simply the
element itself
• var(C1)=0, var(C2)=4.5, var(C3)=2.33
• Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33
• So, R12=1, R13=0.152, R23=0.797

• Now, compute Ri  max Rij


j 1,..k ,i  j

• R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31
and R32)
• Finally, compute 1 k
DB  . Ri
• DB=0.932 k i 1

Note: Adapted from R. Palaniappan 36


Davies-Bouldin index example (ctd)
• For eg: for the clusters shown

• Compute var(Ci )  var(C j )


Rij 
i j || ci  c j ||
• Only 2 clusters here
• var(C1)=12.33 while var(C2)=2.33; c1=6.67 while
c2=18.33
• R12=1.26 Ri  max Rij
• Now compute j 1,..k ,i  j

• Since we have only 2 clusters here, R1=R12=1.26;


R2=R21=1.26
• Finally, compute 1 k
DB  . Ri
• DB=1.26 k i 1

Note: Adapted from R. Palaniappan 37


Davies-Bouldin index example (ctd)
• DB with 2 clusters=1.26, with 3 clusters=0.932
• So, K=3 is better than K=2 (since DB smaller, better clusters)
• In general, we will repeat DB index computation for all cluster
sizes from 2 to N-1
• So, if we have 10 data items, we will do clustering with
K=2,…..9 and then compute DB for each value of K
– K=10 is not done since each item is its own cluster
• Then, we decide the best clustering size (and the best set of
clusters) would be the one with minimum values of DB index

• NOTE: This still does not address the impact of scaling a


dimension. “Standardizing” the dimensions can address it so
some extent

Note: Adapted from R. Palaniappan 38


Spectral Clustering
• CONTENT TO BE ADDED

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy