K Means Clustering
K Means Clustering
March 7, 6
2023
• Intra-cluster cohesion(compactness):
–Cohesion measures how near the data points in a cluster are to the cluster
centroid.
–Sum of squared error (SSE) is a commonly used measure.
• Inter-cluster separation(isolation):
–Separation means that different cluster centroids should be far away from
one another
Measure the Quality of Clustering
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
March 7, 1
2023 0
k-means Algorithm
• Given k, the k-means algorithm
is implemented in four steps:
– Partition objects into k nonempty
subsets
– Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest
seed point
– Go back to Step 2, stop when no more new
assignment exists
K-means Clustering -
Steps
K-means clustering
1. Pick k starting means, m1 to mk
Can use: Random values/ Dynamically
picked/ Lower- Upper Bounds
2. Repeat until convergence:
i)Split data into k sets,
S1 to Sk where x belongs to Si
iff mi is the closest mean to x
ii) Update each mi to the mean of Si
k-means Clustering Method
• General Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Updat 3
3
2 each
2 e the 2
1
object
1
0
cluste 1
0
0
0 1 2 3 4 5 6 7 8 9 s to
0 1
10
2 3 4 5 6 7 8 9 r 0 1
10
2 3 4 5 6 7 8 9
10
most mean
similar reassig s reassig
center 10
n 10
n
K=2 9 9
8 8
Arbitrarily 7 7
6
choose K object
6
5 5
center 3
e the 3
2 2
1 cluste 1
0
0 1 2 3 4 5 6 7 8 9
r 0
0 1 2 3 4 5 6 7 8 9
10
mean 10
s
Example 1
k. Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
•Randomly assign means:
m1=3,m2=4 Iteration 1
• K1={2,3},
K2={4,10,12,20,30,11,25}
• Calculating mean of K1 and K2
results in m1=2.5,m2=16
Iteration 2
• K1={2,3,4},K2={10,12,20,30,11,2
5}
•Calculating mean of K1 and K2 results in
m1=3,m2=18 Iteration 3
• K1={2,3,4,10},K2={12,20,30,11,25}
•Calculating mean of K1 and K2 results in
m1=4.75,m2=19.6 Iteration 4
• K1={2,3,4,10,11,12},K2={20,30,25}
Example 2
Example 3
• The dataset • Dataset:
contains 8 objects with their
sample Objects X Y Z
X, Y and Z coordinates.
• The task is to cluster these OB-1 1 4 1
objects into two OB-2 1 2 2
clusters (k=2).
• Let us consider OB-2 (1,2,2) OB-3 1 4 2
and OB-6 (2,4,2) OB-4 2 1 2
as centroids of cluster 1
and cluster 2 OB-5 1 1 1
respectively. OB-6 2 4 2
• For distance measurement,
OB-7 1 1 2
let Manhattan distance
be used : OB-8 2 1 1
d=|x2–x1|+|y2–y1|+|z2–z1|
Example 3 (cont’d)
• After the initial pass of clustering, the
state of the clustered objects will
look something like the following:
• Distance:
Cluster 1 Distan Distan
OB-2 Cluster 2 ce ce
Object from from
OB-4 OB-1 s X Y Z
C1(1,2, C2(2,4,
OB-5 OB-3
2) 2)
OB-7 OB-6
OB-8 OB-1 1 4 1 3 2
• Updated cluster centroids: OB-2 1 2 2 0 3
• For cluster 1: OB-3 1 4 2 2 1
OB-4 2 1 2 2 3
((1+2+1+1+2)/5, OB-5 1 1 1 2 5
(2+1+1+1+1)/5,(2+2+1+2+1)/5)= OB-6 2 4 2 3 0
(1.4,1.2,1.6) OB-7 1 1 2 1 4
• For cluster 2: OB-8 2 1 1 3 4
((1+1+2)/3, (4+4+4)/3, (1+2+2)/3) =
(1.33, 4, 1.66).
Example 3 (cont’d)
• The new assignments of the objects
with respect to the updated clusters
will be:
• Distance:
Distan Distan
Cluster 1 ce ce
OB-2 Cluster 2 from from
OB-1 Object
OB-4
s X Y Z C1(1.4 C2(1.3
OB-5 OB-3 ,1.2,1. 3, 4,
OB-7 OB-6 6) 1.66)
OB-8
OB-1 1 4 1 3.8 1
• Since there is no change in the OB-2 1 2 2 1.6 2.66
current cluster formation, it is the
OB-3 1 4 2 3.6 0.66
same as the previous state of
clusters. OB-4 2 1 2 1.2 4
• Hence for k=2, the final state of two OB-5 1 1 1 1.2 4
clusters are as above. OB-6 2 4 2 3.8 1
OB-7 1 1 2 1 3.66
OB-8 2 1 1 1.4 4.33
Why use K-means?
• Strengths:
• –Simple: easy to understand and to implement
• –Efficient: Time complexity: O(tkn),