0% found this document useful (0 votes)
25 views12 pages

UNIT-6 K Means Clustering

Clustering is an unsupervised learning technique that groups similar data points based on their characteristics, with common methods including hierarchical, partitioning, and density-based clustering. K-means clustering specifically organizes data into distinct groups by iteratively assigning points to the nearest centroid and recalculating centroids until convergence, focusing on minimizing within-cluster distance and maximizing between-cluster distance. Its applications range from customer segmentation and document clustering to image processing and recommendation systems, though it has limitations such as sensitivity to initial centroids and the need to specify the number of clusters in advance.

Uploaded by

hrp25082003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views12 pages

UNIT-6 K Means Clustering

Clustering is an unsupervised learning technique that groups similar data points based on their characteristics, with common methods including hierarchical, partitioning, and density-based clustering. K-means clustering specifically organizes data into distinct groups by iteratively assigning points to the nearest centroid and recalculating centroids until convergence, focusing on minimizing within-cluster distance and maximizing between-cluster distance. Its applications range from customer segmentation and document clustering to image processing and recommendation systems, though it has limitations such as sensitivity to initial centroids and the need to specify the number of clusters in advance.

Uploaded by

hrp25082003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

What is Clustering

Clustering is like sorting a bunch of similar items into different groups based on their
characteristics. In data mining and machine learning, it’s a powerful technique used to group
similar data points together, making it easier to nd patterns or understand large datasets.
Essentially, clustering helps identify natural groupings in your data. There are two common
types of clustering methods:

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different
sets based on their degree of similarity.

The various types of clustering are:

Hierarchical clustering
Partitioning clustering
Density based clustering

Hierarchical clustering is further subdivided into:

Agglomerative clustering
Divisive clustering

Partitioning:

K-Means clustering
What is K-Means Clustering?

K-means clustering is a way of grouping data based on how similar or close the data points are
to each other. Imagine you have a bunch of points, and you want to group them into clusters.
The algorithm works by 1st randomly picking some central points (called centroids) and then
assigning every data point to the nearest centroid.

Once that’s done, it recalculates the centroids based on the new groupings and repeats the
process until the clusters make sense. It’s a pretty fast and e cient method, but it works
best when the clusters are distinct and not too mixed up. One challenge, though, is guring
out the right number of clusters (K) beforehand. Plus, if there’s a lot of noise or overlap in the
data, K Means might not perform as well.

Objective of K-Means Clustering

K-Means clustering primarily aims to organize similar data points into distinct groups. Here’s a
look at its key objectives:

Grouping Similar Data Points

K-Means is designed to cluster data points that share common traits, allowing patterns or
trends to emerge. Whether analyzing customer behavior or images, the method helps reveal
hidden relationships within your dataset.

Minimizing Within-Cluster Distance

Another objective is to keep data points in each group as close to the cluster's centroid as
possible. Reducing this internal distance results in compact, cohesive clusters, enhancing
the accuracy of your results.

Maximizing Between-Cluster Distance

K-Means also aims to maintain clear separation between different clusters. By maximizing the
distance between groups, the algorithm ensures that each cluster remains distinct, providing
a better understanding of data categories without overlap.
Properties of K-Means Clustering

Now, let’s look at the key properties that make K-means clustering algorithm effective in creating
meaningful groups:

Similarity Within a Cluster

One of the main things K Means aims for is that all the data points in a cluster should be pretty
similar to each other. Imagine a bank that wants to group its customers based on income
and debt. If customers within the same cluster have vastly different situations, then a one-
size- all approach to offers might not work. For example, a customer with high income and
high debt might have different needs compared to someone with low income and low debt.
By making sure the customers in each cluster are similar, the bank can create more tailored
and effective strategies.

Differences Between Clusters

Another important aspect is that the clusters themselves should be as distinct from each
other as possible. Going back to our bank example, if one cluster consists of high-income,
high-debt customers and another cluster has high-income, low-debt customers, the
differences between the clusters are clear. This separation helps the bank create different
strategies for each group. If the clusters are too similar, it can be challenging to treat them as
separate segments, which can make targeted marketing less effective.

Distance Measures

At the heart of K-Means clustering is the concept of distance. Euclidean distance, for example, is
a simple straight-line measurement between points and is commonly used in many
applications. Manhattan distance, however, follows a grid-like path, much like how you'd
navigate city streets. Squared Euclidean distance makes calculations easier by squaring the
values, while cosine distance is handy when working with text data because it measures the
angle between data vectors. Picking the right distance measure really depends on what kind of
problem you’re solving and the nature of your data.

Applications of K-Means Clustering

Customer Segmentation
One of the most popular uses of K-means clustering is for customer segmentation. From banks
to e-commerce, businesses use K-means clustering customer segmentation to group
customers based on their behaviors. For example, in telecom or sports industries, companies
can create targeted marketing campaigns by understanding different customer segments
better. This allows for personalized offers and communications, boosting customer
engagement and satisfaction.

Document Clustering

When dealing with a vast collection of documents, K-Means can be a lifesaver. It groups similar
documents together based on their content, which makes it easier to manage and retrieve
relevant information. For instance, if you have thousands of research papers, clustering can
quickly help you nd related studies, improving both organization and e ciency in accessing
valuable information.

Image Segmentation

In image processing, K-Means clustering is commonly used to group pixels with similar colors,
which divides the image into distinct regions. This is incredibly helpful for tasks like object
detection and image enhancement. For instance, clustering can help separate objects within
an image, making analysis and processing more accurate. It’s also widely used to extract
meaningful features from images in various visual tasks.

 Recommendation Engines
K-Means clustering also plays a vital role in recommendation systems. Say you want to suggest
new songs to a listener based on their past preferences; clustering can group similar songs
together, helping the system provide personalized suggestions. By clustering content that
shares similar features, recommendation engines can deliver a more tailored experience,
helping users
discover new songs that match their taste.

K-Means for Image Compression

K-Means can even help with image compression by reducing the number of colors in an image
while keeping the visual quality intact. K-Means reduces the image size without losing much
detail by clustering similar colors and replacing the pixels with the average of their cluster. It’s a
practical method for compressing images for more accessible storage and transmission, all
while maintaining visual clarity.
Advantages of K-means

1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.

2. Fast and e cient: K-means is computationally e cient and can handle large datasets
with high dimensionality.

3. Scalability: K-means can handle large datasets with many data points and can be easily
scaled to handle even larger datasets.

4. Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.

Disadvantages of K-Means

1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can
converge to a suboptimal solution.

2. Requires specifying the number of clusters: The number of clusters k needs to be speci
ed before running the algorithm, which can be challenging in some applications.

3. Sensitive to outliers: K-means is sensitive to outliers, which can have a signi cant impa ct on
the resulting clusters.
Inertia

It calculates the sum of squared distances from each point to the cluster's center (or
centroid). Think of it as measuring how snugly the points are huddled together. Lower inertia
means that points are closer to the centroid and to each other, which generally indicates that
your clusters are well-formed. For most numeric data, you'll use Euclidean distance, but if your
data includes categorical features, Manhattan distance might be better.

It tells us how far the points within a cluster are. So, inertia actually calculates the sum of

distances of all the points within a cluster from the centroid of that cluster. Normally, we use

Euclidean distance as the distance metric, as long as most of the features are numeric;

otherwise, Manhattan distance in case most of the features are categorical.

We calculate this for all the clusters; the final inertial value is the sum of all these distances.

This distance within the clusters is known as intracluster distance. So, inertia gives us the

sum of intracluster distances:

How Does K-Means Clustering Work?


The f l owchart below shows how k-means clustering works:

1. Initialization: Start by randomly selecting K points from the dataset. These points will

act as the initial cluster centroids.

2. Assignment: For each data point in the dataset, calculate the distance between that

point and each of the K centroids. Assign the data point to the cluster whose centroid is

closest to it. This step effectively forms K clusters.

3. Update centroids: Once all data points have been assigned to clusters, recalculate the

centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the

centroids no longer change significantly or when a specified number of iterations is

reached.

5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster

centroids and the assignment of each data point to a cluster.

Let's use a visualization example to understand this better.

We have a data set for a grocery shop, and we want to f i n d out how many clusters this has
to b
spread across. To n the optimum number of clusters, we break it down into the
followingsteps:

Step 1:

The Elbow method is the best way to f i n d the number of clusters. The elbow method
constitutes running K-Means clustering on the dataset.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters


that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum
of the squared distance between each member of the cluster and its centroid.

The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is
taken as the optimum value.

Now, we draw a curve between WSS and the number of clusters.


Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from
2.

So, you can take the elbow point value as the optimal value of K. It should be either two, three,
or at most four. But, beyond that, increasing the number of clusters does not dramatically
change the value in WSS, it gets stabilized.

Step 2:

Let's assume that these are our delivery points:

We can randomly initialize two points called the cluster centroids. Here,

C1 and C2 are the centroids assigned randomly.

Step 3:
Now the distance of each location from the centroid is measured, and each data point is
assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

Step 4:

Compute the actual centroid of data points for the 1st group.
Step 5:

Reposition the random centroid to the actual centroid.

Step 6:

Compute the actual centroid of data points for the second group.

Step 7:

Reposition the random centroid to the actual centroid.


Step 8:

Once the cluster becomes static, the k-means algorithm is said to be converged. The

f i nal cluster with centroids c1 and c2 is as shown below:

K-Means Clustering Algorithm

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. The

steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids.

Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e.,
calculating its distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points. Step

4: Keep repeating step 2 and step 3 until convergence is achieved


Python Implementation of the K-Means Clustering Algorithm

Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the
steps you need to take:

Data pre-processing

Finding the optimal number of clusters using the elbow method

Training the K-Means algorithm on the training data set

Visualizing the clusters

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

# importing libraries

import numpy as

nm

import matplotlib.pyplot as tp

import pandas as pd

# Importing the dataset

dataset = pd.read_csv('Mall_Customers_data.csv')
x = dataset.iloc[:, [3, 4]].values

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow


method from sklearn.cluster import KMeans

wcss_list= [] #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 10.

for i in range(1, 11):

kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)

kmeans. t(x)

wcss_list.append(kmeans.inertia_

) mtp.plot(range(1, 11), wcss_list)

mtp.title('The Elobw Method Graph')

mtp.xlabel('Number of clusters(k)')

mtp.ylabel('wcss_list')

mtp.show()

3. Train the K-means algorithm on the training dataset. Use the same two lines of code
used in the previous section. However, instead of using i, use 5, because there are
5 clusters that need to be formed. Here’s the code:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy