Kmean
Kmean
K-means clustering, a popular unsupervised learning algorithm, in simple terms. Imagine you have a
box of mixed candies and you want to sort them into groups based on their colors. Here's how K-
means works step by step:
Step-by-Step Explanation
1. Choosing the Number of Groups (K):
o First, you decide how many groups (clusters) you want to divide the candies into. Let's
say you want to sort them into 3 groups.
2. Placing the Initial Centers (Centroids):
oImagine putting 3 invisible markers randomly in your box of candies. These markers
represent the centers (or centroids) of your 3 groups.
3. Assigning Candies to Groups:
oLook at each candy one by one and find out which marker (centroid) is closest to it.
Assign the candy to that group. For example, if a red candy is closest to the first
marker, it goes into the first group.
4. Moving the Markers:
oOnce all the candies are assigned to groups, you move the markers to the center of
their respective groups. This means you calculate the average position of all the
candies in each group and place the marker there.
5. Repeating the Process:
o Repeat steps 3 and 4 until the markers (centroids) don't move much anymore. This
means the candies are well grouped, and the markers have found their best positions.
6. Final Groups:
o The candies are now sorted into 3 groups based on their colors, with each group
having its own centroid.
3. Update Step:
o Calculate the new centroid of each cluster by taking the mean of all data points in the
cluster.
o Move the centroid to this new mean position.
4. Repeat:
o Repeat the Assignment and Update steps until the centroids no longer change
significantly (convergence).
Slides Algorithm
Simple Calculation Example
Let's say you have the following 2D points (candies):
(1, 2)
(2, 3)
(3, 4)
(8, 8)
(9, 9)
(10, 10)
1. Initialize Centroids:
o Let's randomly pick (1, 2) and (9, 9) as the initial centroids.
2. Assignment Step:
o Calculate the distance from each point to the centroids:
(1, 2) to (1, 2) = 0
(1, 2) to (9, 9) = 10.63
(2, 3) to (1, 2) = 1.41
(2, 3) to (9, 9) = 9.22
(3, 4) to (1, 2) = 2.83
(3, 4) to (9, 9) = 7.81
(8, 8) to (1, 2) = 9.21
(8, 8) to (9, 9) = 1.41
(9, 9) to (1, 2) = 10.63
(9, 9) to (9, 9) = 0
(10, 10) to (1, 2) = 12.04
(10, 10) to (9, 9) = 1.41
o Assign points to the nearest centroid:
Group 1: (1, 2), (2, 3), (3, 4)
Group 2: (8, 8), (9, 9), (10, 10)
3. Update Step:
oCalculate new centroids:
Group 1: Mean of (1, 2), (2, 3), (3, 4) = (2, 3)
Group 2: Mean of (8, 8), (9, 9), (10, 10) = (9, 9)
4. Repeat:
o Assign points again:
Group 1: (1, 2), (2, 3), (3, 4)
Group 2: (8, 8), (9, 9), (10, 10)
o New centroids remain the same, so we stop.
Final Clusters
Cluster 1: (1, 2), (2, 3), (3, 4)
Cluster 2: (8, 8), (9, 9), (10, 10)
That's K-means clustering! You started with random centroids, assigned points to the nearest ones,
updated the centroids, and repeated the process until the centroids stabilized.
The optimization objective of K-means clustering is to minimize the within-cluster sum of squares
(WCSS), also known as the sum of squared errors (SSE). This objective can be understood as trying
to make the points within each cluster as close to each other as possible, which means minimizing the
distance between the points and their respective centroids.
Mathematical Formulation
WCSS
Explanation
Let's go through a simple example with 2 clusters and a few data points.
Data Points:
(1, 2)
(2, 3)
(3, 4)
(8, 8)
(9, 9)
(10, 10)
Initial Centroids:
Centroid 1: (1, 2)
Centroid 2: (9, 9)
Assignment Step:
Centroid Calculation:
New centroid for Cluster 1: Mean of (1, 2), (2, 3), (3, 4) = (2, 3)
New centroid for Cluster 2: Mean of (8, 8), (9, 9), (10, 10) = (9, 9)
Total WCSS
The goal of K-means is to minimize this total WCSS value. During each iteration, the algorithm
updates the centroids and reassigns points to clusters in a way that gradually reduces the WCSS
until it cannot be reduced any further (convergence).
By minimizing WCSS, K-means ensures that the points within each cluster are as close to each
other (and their centroid) as possible, leading to more compact and well-defined clusters.
Now finally algorithm will act like this (no change in the syntax)
Random initialization of Clustering Centroids
We must have K < m ( training examples)
So, what we can do is select a data point randomly and make him a centroid.
Figure 1 with centroid away is a good one but figure 2 has selection not good. So this mean
that K-means can converge into different solution depending on the type of random selection
so It’s important to have multiple random selections:
“If you have K (no of clusters greater than 10 then the multiple
random initialization is not worth it because I will give same value
maximum time but If we have less than 10 clusters then it’s better to
use multiple random initialization and at the last select the one with
the least amount of distortion ( cost function)”