DMDW Lab8
DMDW Lab8
K-Means clustering is an unsupervised learning algorithm used to partition data into k distinct
clusters based on similarity. The algorithm works by first randomly selecting k initial centroids,
then iteratively assigning each data point to the nearest centroid, and recalculating the centroids
as the mean of all points in each cluster. This process is repeated until the centroids no longer
change significantly, signaling convergence. The goal is to minimize the sum of squared
distances between data points and their assigned centroids, which is known as the objective
function or inertia.
One of the challenges in K-Means is determining the optimal number of clusters, k. Methods like
the Elbow Method, Silhouette Score, and Gap Statistic help identify a good value for k by
assessing how well the data is grouped. While K-Means is computationally efficient and widely
used in various applications, it has some limitations, such as sensitivity to the initialization of
centroids, difficulty handling non-spherical clusters, and its sensitivity to outliers. Despite these
drawbacks, K-Means remains a powerful tool for clustering in fields like customer segmentation,
document clustering, and image compression.
2. Implement K-means clustering algorithm using python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X = iris.data
y = iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
class KMeans:
def __init__(self, n_clusters=3, max_iters=100):
self.n_clusters = n_clusters
self.max_iters = max_iters
for _ in range(self.max_iters):
self.labels = self._assign_clusters(X)
new_centroids = self._calculate_centroids(X)
if np.all(new_centroids == self.centroids):
break
self.centroids = new_centroids
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_scaled)