SCA - Module 8
SCA - Module 8
Clustering
Week 11
Clustering >> K-means Clustering
• Clustering is the task of grouping a set
of objects in such a way that objects in
the same group (called a cluster) are
more similar it each other than those in
the other groups
• To cluster n objects based on attributes
into k-partitions
• Strengths
• Simple iterative method
• User provides “K”
• Weaknesses
• Often too simple → bad results • Euclidean distance
• Manhattan distance
• Difficult to guess the correct “K” • Maximum norm
K-means Clustering
Basic Algorithm:
• Step 0: select K
• Step 1: randomly select initial cluster seeds
Seed 1
650
Seed 2
200
K-means Clustering
Seed 1
Seed 2
K-means Clustering
• Step 4: Compute the new centroid for each cluster
Cluster Seed 1
708.9
Cluster Seed 2
214.2
K-means Clustering
• Iterate
• Calculate distance from objects to cluster centroids
• Assign objects to closest cluster
• Recalculate new centroids
• K-means
• Easy to use
• Need to know K
• May need to scale data
• Good initial method
• Local optima
• No guarantee of optimal solution
• Repeat with different starting values
Example
Data provided by 3 companies 12
Scatter plot
A1 2 10 10
A2 2 5 8
A3 8 4 6
B1 5 8 4
B2 7 5 2
B3 6 4 0
0 1 2 3 4 5 6 7 8 9
C1 1 2
C2 4 9
Data Distance to Cluster New Cluster