UNIT-6 K Means Clustering
UNIT-6 K Means Clustering
Clustering is like sorting a bunch of similar items into different groups based on their
characteristics. In data mining and machine learning, it’s a powerful technique used to group
similar data points together, making it easier to nd patterns or understand large datasets.
Essentially, clustering helps identify natural groupings in your data. There are two common
types of clustering methods:
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into different
sets based on their degree of similarity.
Hierarchical clustering
Partitioning clustering
Density based clustering
Agglomerative clustering
Divisive clustering
Partitioning:
K-Means clustering
What is K-Means Clustering?
K-means clustering is a way of grouping data based on how similar or close the data points are
to each other. Imagine you have a bunch of points, and you want to group them into clusters.
The algorithm works by 1st randomly picking some central points (called centroids) and then
assigning every data point to the nearest centroid.
Once that’s done, it recalculates the centroids based on the new groupings and repeats the
process until the clusters make sense. It’s a pretty fast and e cient method, but it works
best when the clusters are distinct and not too mixed up. One challenge, though, is guring
out the right number of clusters (K) beforehand. Plus, if there’s a lot of noise or overlap in the
data, K Means might not perform as well.
K-Means clustering primarily aims to organize similar data points into distinct groups. Here’s a
look at its key objectives:
K-Means is designed to cluster data points that share common traits, allowing patterns or
trends to emerge. Whether analyzing customer behavior or images, the method helps reveal
hidden relationships within your dataset.
Another objective is to keep data points in each group as close to the cluster's centroid as
possible. Reducing this internal distance results in compact, cohesive clusters, enhancing
the accuracy of your results.
K-Means also aims to maintain clear separation between different clusters. By maximizing the
distance between groups, the algorithm ensures that each cluster remains distinct, providing
a better understanding of data categories without overlap.
Properties of K-Means Clustering
Now, let’s look at the key properties that make K-means clustering algorithm effective in creating
meaningful groups:
One of the main things K Means aims for is that all the data points in a cluster should be pretty
similar to each other. Imagine a bank that wants to group its customers based on income
and debt. If customers within the same cluster have vastly different situations, then a one-
size- all approach to offers might not work. For example, a customer with high income and
high debt might have different needs compared to someone with low income and low debt.
By making sure the customers in each cluster are similar, the bank can create more tailored
and effective strategies.
Another important aspect is that the clusters themselves should be as distinct from each
other as possible. Going back to our bank example, if one cluster consists of high-income,
high-debt customers and another cluster has high-income, low-debt customers, the
differences between the clusters are clear. This separation helps the bank create different
strategies for each group. If the clusters are too similar, it can be challenging to treat them as
separate segments, which can make targeted marketing less effective.
Distance Measures
At the heart of K-Means clustering is the concept of distance. Euclidean distance, for example, is
a simple straight-line measurement between points and is commonly used in many
applications. Manhattan distance, however, follows a grid-like path, much like how you'd
navigate city streets. Squared Euclidean distance makes calculations easier by squaring the
values, while cosine distance is handy when working with text data because it measures the
angle between data vectors. Picking the right distance measure really depends on what kind of
problem you’re solving and the nature of your data.
Customer Segmentation
One of the most popular uses of K-means clustering is for customer segmentation. From banks
to e-commerce, businesses use K-means clustering customer segmentation to group
customers based on their behaviors. For example, in telecom or sports industries, companies
can create targeted marketing campaigns by understanding different customer segments
better. This allows for personalized offers and communications, boosting customer
engagement and satisfaction.
Document Clustering
When dealing with a vast collection of documents, K-Means can be a lifesaver. It groups similar
documents together based on their content, which makes it easier to manage and retrieve
relevant information. For instance, if you have thousands of research papers, clustering can
quickly help you nd related studies, improving both organization and e ciency in accessing
valuable information.
Image Segmentation
In image processing, K-Means clustering is commonly used to group pixels with similar colors,
which divides the image into distinct regions. This is incredibly helpful for tasks like object
detection and image enhancement. For instance, clustering can help separate objects within
an image, making analysis and processing more accurate. It’s also widely used to extract
meaningful features from images in various visual tasks.
Recommendation Engines
K-Means clustering also plays a vital role in recommendation systems. Say you want to suggest
new songs to a listener based on their past preferences; clustering can group similar songs
together, helping the system provide personalized suggestions. By clustering content that
shares similar features, recommendation engines can deliver a more tailored experience,
helping users
discover new songs that match their taste.
K-Means can even help with image compression by reducing the number of colors in an image
while keeping the visual quality intact. K-Means reduces the image size without losing much
detail by clustering similar colors and replacing the pixels with the average of their cluster. It’s a
practical method for compressing images for more accessible storage and transmission, all
while maintaining visual clarity.
Advantages of K-means
1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.
2. Fast and e cient: K-means is computationally e cient and can handle large datasets
with high dimensionality.
3. Scalability: K-means can handle large datasets with many data points and can be easily
scaled to handle even larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.
Disadvantages of K-Means
1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can
converge to a suboptimal solution.
2. Requires specifying the number of clusters: The number of clusters k needs to be speci
ed before running the algorithm, which can be challenging in some applications.
3. Sensitive to outliers: K-means is sensitive to outliers, which can have a signi cant impa ct on
the resulting clusters.
Inertia
It calculates the sum of squared distances from each point to the cluster's center (or
centroid). Think of it as measuring how snugly the points are huddled together. Lower inertia
means that points are closer to the centroid and to each other, which generally indicates that
your clusters are well-formed. For most numeric data, you'll use Euclidean distance, but if your
data includes categorical features, Manhattan distance might be better.
It tells us how far the points within a cluster are. So, inertia actually calculates the sum of
distances of all the points within a cluster from the centroid of that cluster. Normally, we use
Euclidean distance as the distance metric, as long as most of the features are numeric;
We calculate this for all the clusters; the final inertial value is the sum of all these distances.
This distance within the clusters is known as intracluster distance. So, inertia gives us the
1. Initialization: Start by randomly selecting K points from the dataset. These points will
2. Assignment: For each data point in the dataset, calculate the distance between that
point and each of the K centroids. Assign the data point to the cluster whose centroid is
3. Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the
reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster
We have a data set for a grocery shop, and we want to f i n d out how many clusters this has
to b
spread across. To n the optimum number of clusters, we break it down into the
followingsteps:
Step 1:
The Elbow method is the best way to f i n d the number of clusters. The elbow method
constitutes running K-Means clustering on the dataset.
The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is
taken as the optimum value.
You can see that there is a very gradual change in the value of WSS as the K value increases from
2.
So, you can take the elbow point value as the optimal value of K. It should be either two, three,
or at most four. But, beyond that, increasing the number of clusters does not dramatically
change the value in WSS, it gets stabilized.
Step 2:
We can randomly initialize two points called the cluster centroids. Here,
Step 3:
Now the distance of each location from the centroid is measured, and each data point is
assigned to the centroid, which is closest to it.
Step 4:
Compute the actual centroid of data points for the 1st group.
Step 5:
Step 6:
Compute the actual centroid of data points for the second group.
Step 7:
Once the cluster becomes static, the k-means algorithm is said to be converged. The
Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. The
Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e.,
calculating its distance to each centroid)
Step 3: Identify new centroids by taking the average of the assigned points. Step
Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the
steps you need to take:
Data pre-processing
1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.
# importing libraries
import numpy as
nm
import matplotlib.pyplot as tp
import pandas as pd
dataset = pd.read_csv('Mall_Customers_data.csv')
x = dataset.iloc[:, [3, 4]].values
2. Find the optimal number of clusters using the elbow method. Here’s the code you use:
kmeans. t(x)
wcss_list.append(kmeans.inertia_
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
3. Train the K-means algorithm on the training dataset. Use the same two lines of code
used in the previous section. However, instead of using i, use 5, because there are
5 clusters that need to be formed. Here’s the code: