1730702218_ML13_Kmeans

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Partitioning Algorithms:

k-Means Clustering Algorithm

Construct a partition of a database D of n objects


into a set of k clusters (for a given k) that optimizes
the chosen partitioning criterion.
The k-Means Clustering Algorithm
• Start by choosing k points arbitrarily as the “centroids” of the
clusters.
• Partition objects into k nonempty subsets by associating the
nearest objects to the chosen centroids.
• Take the averages of the data points associated with a centroid
and replace the centroid with the average, and this is done for
each of the centroids.
• We repeat the process until the centroids converge to some fixed
points.
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Use k-means clustering algorithm to divide the following data into
two clusters and also compute the representative data points for
the clusters.

• In the problem, the required number of clusters is 2 and we take


k = 2.
• We choose two points arbitrarily as the initial cluster centres.
Let us choose arbitrarily (2, 1) and (2, 3).
• We compute the distances of the given data points from the
cluster centers (Use Euclidean Distance).
Cluster centres are
recalculated as
Re compute the
distances of the
given data points

Cluster centres are


recalculated as

= (4.5, 4)
= (4.5, 4)
Re compute the
distances of the
given data points

Cluster centres
are recalculated
and found there is
no change in the
centroid, so we
STOP

= (4.5, 4)
Comments on the K-Means Method
• Strength
• Relatively efficient
• Weakness
• Applicable only when mean is defined, then what
about categorical data?
• Need to specify k, the number of clusters, in advance
• Unable to handle noisy data and outliers
Optimal number of clusters (k)
• This method tries to measure the homogeneity or heterogeneity within
the cluster for various values of ‘k’.
• The measure of quality of clustering uses the Sum of Squares technique
• Within Cluster Sum of Squares (WCSS) for a given k is computed as

• where dist() is the Euclidean distance between the centroid c of the


cluster C and the data points x in the cluster.
• The summation of such distances over all the ‘k’ clusters gives WCSS.
• Plot of Within Cluster
sum of Squares (WCSS)
and k
• Formation of elbow in the
plot – choice of k
• Lower the WCSS for a
clustering solution, the
better is the
representative position of
the centroid.
Example: Clustering using the K-means method
Suppose we measure two variables X1 and X2 for
each of four items A, B, C, and D. The data are given
as (X1, X2): A(5, 3), B(-1, 1), C(1, -2) and D(-3, -2).
4
A
2
B
X2

0
-4 -2 0 2 4 6

D -2 C

-4
X1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy