Pilot
Pilot
K-Means Clustering
K-Means is one of the most popular unsupervised machine learning algorithms used for
clustering data points into a predefined number of clusters. The goal of K-Means is to
partition a dataset into KK clusters in which each data point belongs to the cluster with the
nearest mean (centroid). It is widely used in data mining, pattern recognition, and customer
segmentation, among other tasks.
3. Recalculate centroids:
o Once all the points have been assigned to clusters, the centroids are updated.
The new centroid of each cluster is the mean of all points assigned to that
cluster:
where ckc_k is the centroid of cluster kk, CkC_k is the set of points in cluster kk, and
xix_i are the points in the cluster.
Key Points:
Example:
Let’s walk through a simple example with 2D data and K=2K = 2 clusters.
Compute the Euclidean distance of each point from both centroids and assign each
point to the nearest centroid.
o Point (1,2)(1, 2) is closer to Centroid 1: Assign to Cluster 1.
o Point (3,3)(3, 3) is closer to Centroid 1: Assign to Cluster 1.
o Point (6,5)(6, 5) is closer to Centroid 1: Assign to Cluster 1.
o Point (8,8)(8, 8) is closer to Centroid 2: Assign to Cluster 2.
o Point (9,10)(9, 10) is closer to Centroid 2: Assign to Cluster 2.
Compute the new centroids based on the points assigned to each cluster:
o New Centroid 1: Mean of points (1,2),(3,3),(6,5)(1, 2), (3, 3), (6, 5) →
(3.33,3.33)(3.33, 3.33)
o New Centroid 2: Mean of points (8,8),(9,10)(8, 8), (9, 10) → (8.5,9)(8.5, 9)
Repeat the process of assigning points to the nearest centroid and recalculating
centroids until the centroids stabilize.
After a few iterations, the centroids will no longer move, and the algorithm will converge to
the final clusters.
Advantages of K-Means:
Disadvantages of K-Means:
Requires KK: The number of clusters must be specified beforehand, which is not
always easy to determine.
Sensitivity to Initial Centroids: Random initialization of centroids can lead to
different final results. This is often addressed by running the algorithm multiple times
and choosing the best result (K-Means++ initialization).
Assumes spherical clusters: K-Means works best when clusters are roughly
spherical and of similar size, which may not be true for all datasets.
Applications:
Thank You