0% found this document useful (0 votes)

25 views12 pages

UNIT-6 K Means Clustering

Clustering is an unsupervised learning technique that groups similar data points based on their characteristics, with common methods including hierarchical, partitioning, and density-based clustering. K-means clustering specifically organizes data into distinct groups by iteratively assigning points to the nearest centroid and recalculating centroids until convergence, focusing on minimizing within-cluster distance and maximizing between-cluster distance. Its applications range from customer segmentation and document clustering to image processing and recommendation systems, though it has limitations such as sensitivity to initial centroids and the need to specify the number of clusters in advance.

Uploaded by

hrp25082003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views12 pages

UNIT-6 K Means Clustering

Uploaded by

hrp25082003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

What is Clustering

Clustering is like sorting a bunch of similar items into different groups based on their
characteristics. In data mining and machine learning, it’s a powerful technique used to group
similar data points together, making it easier to nd patterns or understand large datasets.
Essentially, clustering helps identify natural groupings in your data. There are two common
types of clustering methods:

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different
sets based on their degree of similarity.

The various types of clustering are:

Hierarchical clustering
Partitioning clustering
Density based clustering

Hierarchical clustering is further subdivided into:

Agglomerative clustering
Divisive clustering

Partitioning:

K-Means clustering
What is K-Means Clustering?

K-means clustering is a way of grouping data based on how similar or close the data points are
to each other. Imagine you have a bunch of points, and you want to group them into clusters.
The algorithm works by 1st randomly picking some central points (called centroids) and then
assigning every data point to the nearest centroid.

Once that’s done, it recalculates the centroids based on the new groupings and repeats the
process until the clusters make sense. It’s a pretty fast and e cient method, but it works
best when the clusters are distinct and not too mixed up. One challenge, though, is guring
out the right number of clusters (K) beforehand. Plus, if there’s a lot of noise or overlap in the
data, K Means might not perform as well.

Objective of K-Means Clustering

K-Means clustering primarily aims to organize similar data points into distinct groups. Here’s a
look at its key objectives:

Grouping Similar Data Points

K-Means is designed to cluster data points that share common traits, allowing patterns or
trends to emerge. Whether analyzing customer behavior or images, the method helps reveal
hidden relationships within your dataset.

Minimizing Within-Cluster Distance

Another objective is to keep data points in each group as close to the cluster's centroid as
possible. Reducing this internal distance results in compact, cohesive clusters, enhancing
the accuracy of your results.

Maximizing Between-Cluster Distance

K-Means also aims to maintain clear separation between different clusters. By maximizing the
distance between groups, the algorithm ensures that each cluster remains distinct, providing
a better understanding of data categories without overlap.
Properties of K-Means Clustering

Now, let’s look at the key properties that make K-means clustering algorithm effective in creating
meaningful groups:

Similarity Within a Cluster

One of the main things K Means aims for is that all the data points in a cluster should be pretty
similar to each other. Imagine a bank that wants to group its customers based on income
and debt. If customers within the same cluster have vastly different situations, then a one-
size- all approach to offers might not work. For example, a customer with high income and
high debt might have different needs compared to someone with low income and low debt.
By making sure the customers in each cluster are similar, the bank can create more tailored
and effective strategies.

Differences Between Clusters

Another important aspect is that the clusters themselves should be as distinct from each
other as possible. Going back to our bank example, if one cluster consists of high-income,
high-debt customers and another cluster has high-income, low-debt customers, the
differences between the clusters are clear. This separation helps the bank create different
strategies for each group. If the clusters are too similar, it can be challenging to treat them as
separate segments, which can make targeted marketing less effective.

Distance Measures

At the heart of K-Means clustering is the concept of distance. Euclidean distance, for example, is
a simple straight-line measurement between points and is commonly used in many
applications. Manhattan distance, however, follows a grid-like path, much like how you'd
navigate city streets. Squared Euclidean distance makes calculations easier by squaring the
values, while cosine distance is handy when working with text data because it measures the
angle between data vectors. Picking the right distance measure really depends on what kind of
problem you’re solving and the nature of your data.

Applications of K-Means Clustering

Customer Segmentation
One of the most popular uses of K-means clustering is for customer segmentation. From banks
to e-commerce, businesses use K-means clustering customer segmentation to group
customers based on their behaviors. For example, in telecom or sports industries, companies
can create targeted marketing campaigns by understanding different customer segments
better. This allows for personalized offers and communications, boosting customer
engagement and satisfaction.

Document Clustering

When dealing with a vast collection of documents, K-Means can be a lifesaver. It groups similar
documents together based on their content, which makes it easier to manage and retrieve
relevant information. For instance, if you have thousands of research papers, clustering can
quickly help you nd related studies, improving both organization and e ciency in accessing
valuable information.

Image Segmentation

In image processing, K-Means clustering is commonly used to group pixels with similar colors,
which divides the image into distinct regions. This is incredibly helpful for tasks like object
detection and image enhancement. For instance, clustering can help separate objects within
an image, making analysis and processing more accurate. It’s also widely used to extract
meaningful features from images in various visual tasks.

 Recommendation Engines
K-Means clustering also plays a vital role in recommendation systems. Say you want to suggest
new songs to a listener based on their past preferences; clustering can group similar songs
together, helping the system provide personalized suggestions. By clustering content that
shares similar features, recommendation engines can deliver a more tailored experience,
helping users
discover new songs that match their taste.

K-Means for Image Compression

K-Means can even help with image compression by reducing the number of colors in an image
while keeping the visual quality intact. K-Means reduces the image size without losing much
detail by clustering similar colors and replacing the pixels with the average of their cluster. It’s a
practical method for compressing images for more accessible storage and transmission, all
while maintaining visual clarity.
Advantages of K-means

1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.

2. Fast and e cient: K-means is computationally e cient and can handle large datasets
with high dimensionality.

3. Scalability: K-means can handle large datasets with many data points and can be easily
scaled to handle even larger datasets.

4. Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.

Disadvantages of K-Means

1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can
converge to a suboptimal solution.

2. Requires specifying the number of clusters: The number of clusters k needs to be speci
ed before running the algorithm, which can be challenging in some applications.

3. Sensitive to outliers: K-means is sensitive to outliers, which can have a signi cant impa ct on
the resulting clusters.
Inertia

It calculates the sum of squared distances from each point to the cluster's center (or
centroid). Think of it as measuring how snugly the points are huddled together. Lower inertia
means that points are closer to the centroid and to each other, which generally indicates that
your clusters are well-formed. For most numeric data, you'll use Euclidean distance, but if your
data includes categorical features, Manhattan distance might be better.

It tells us how far the points within a cluster are. So, inertia actually calculates the sum of

distances of all the points within a cluster from the centroid of that cluster. Normally, we use

Euclidean distance as the distance metric, as long as most of the features are numeric;

otherwise, Manhattan distance in case most of the features are categorical.

We calculate this for all the clusters; the final inertial value is the sum of all these distances.

This distance within the clusters is known as intracluster distance. So, inertia gives us the

sum of intracluster distances:

How Does K-Means Clustering Work?

The f l owchart below shows how k-means clustering works:

1. Initialization: Start by randomly selecting K points from the dataset. These points will

act as the initial cluster centroids.

2. Assignment: For each data point in the dataset, calculate the distance between that

point and each of the K centroids. Assign the data point to the cluster whose centroid is

closest to it. This step effectively forms K clusters.

3. Update centroids: Once all data points have been assigned to clusters, recalculate the

centroids of the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the

centroids no longer change significantly or when a specified number of iterations is

reached.

5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster

centroids and the assignment of each data point to a cluster.

Let's use a visualization example to understand this better.

We have a data set for a grocery shop, and we want to f i n d out how many clusters this has
to b
spread across. To n the optimum number of clusters, we break it down into the
followingsteps:

Step 1:

The Elbow method is the best way to f i n d the number of clusters. The elbow method
constitutes running K-Means clustering on the dataset.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters

that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum
of the squared distance between each member of the cluster and its centroid.

The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is
taken as the optimum value.

Now, we draw a curve between WSS and the number of clusters.

Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from
2.

So, you can take the elbow point value as the optimal value of K. It should be either two, three,
or at most four. But, beyond that, increasing the number of clusters does not dramatically
change the value in WSS, it gets stabilized.

Step 2:

Let's assume that these are our delivery points:

We can randomly initialize two points called the cluster centroids. Here,

C1 and C2 are the centroids assigned randomly.

Step 3:
Now the distance of each location from the centroid is measured, and each data point is
assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

Step 4:

Compute the actual centroid of data points for the 1st group.
Step 5:

Reposition the random centroid to the actual centroid.

Step 6:

Compute the actual centroid of data points for the second group.

Step 7:

Reposition the random centroid to the actual centroid.

Step 8:

Once the cluster becomes static, the k-means algorithm is said to be converged. The

f i nal cluster with centroids c1 and c2 is as shown below:

K-Means Clustering Algorithm

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. The

steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids.

Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e.,
calculating its distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points. Step

4: Keep repeating step 2 and step 3 until convergence is achieved

Python Implementation of the K-Means Clustering Algorithm

Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the
steps you need to take:

Data pre-processing

Finding the optimal number of clusters using the elbow method

Training the K-Means algorithm on the training data set

Visualizing the clusters

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

# importing libraries

import numpy as

import matplotlib.pyplot as tp

import pandas as pd

# Importing the dataset

dataset = pd.read_csv('Mall_Customers_data.csv')
x = dataset.iloc[:, [3, 4]].values

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow

method from sklearn.cluster import KMeans

wcss_list= [] #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 10.

for i in range(1, 11):

kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)

kmeans. t(x)

wcss_list.append(kmeans.inertia_

) mtp.plot(range(1, 11), wcss_list)

mtp.title('The Elobw Method Graph')

mtp.xlabel('Number of clusters(k)')

mtp.ylabel('wcss_list')

mtp.show()

3. Train the K-means algorithm on the training dataset. Use the same two lines of code
used in the previous section. However, instead of using i, use 5, because there are
5 clusters that need to be formed. Here’s the code:

Cheat Sheet Imperva
100% (2)
Cheat Sheet Imperva
12 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
How To Create Channel Youtube
No ratings yet
How To Create Channel Youtube
6 pages
Mini Project
No ratings yet
Mini Project
8 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Unit 4
No ratings yet
Unit 4
125 pages
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
No ratings yet
WWW Simplilearn Com Tutorials Machine Learning Tutorial K Means Clustering Algor
19 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
K Mean
No ratings yet
K Mean
7 pages
Minor Project
No ratings yet
Minor Project
10 pages
K, Eans
No ratings yet
K, Eans
4 pages
KMeans Clustering Report
No ratings yet
KMeans Clustering Report
2 pages
Słowacja Wszystko PDF
No ratings yet
Słowacja Wszystko PDF
379 pages
Practical 5
No ratings yet
Practical 5
3 pages
K Means Clustering
No ratings yet
K Means Clustering
3 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
K Means Clustering Report
No ratings yet
K Means Clustering Report
3 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Introduction To Data Science: Clustering
No ratings yet
Introduction To Data Science: Clustering
45 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
K Mean
No ratings yet
K Mean
12 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Wepik Unveiling The Power of K Means Algorithm 20240320054442bjkX
No ratings yet
Wepik Unveiling The Power of K Means Algorithm 20240320054442bjkX
10 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
K Means
No ratings yet
K Means
9 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
Unit-4 ML
No ratings yet
Unit-4 ML
16 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
K Means Clustering
No ratings yet
K Means Clustering
10 pages
K Means Clustering Project Updated Cleaned
No ratings yet
K Means Clustering Project Updated Cleaned
3 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
K-Means Clustering
No ratings yet
K-Means Clustering
3 pages
Presentation: Operating System Concept CS-582
No ratings yet
Presentation: Operating System Concept CS-582
13 pages
Clustering
No ratings yet
Clustering
84 pages
CS8091 BDA Unit 2
No ratings yet
CS8091 BDA Unit 2
101 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Clustering
No ratings yet
Clustering
125 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Introduction To Kmeans
No ratings yet
Introduction To Kmeans
4 pages
Da Exp 10 66
No ratings yet
Da Exp 10 66
6 pages
Clustering
No ratings yet
Clustering
67 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
9.1. Machine Learning Unsupervised Learning-1
No ratings yet
9.1. Machine Learning Unsupervised Learning-1
57 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Unit 4
No ratings yet
Unit 4
16 pages
Unit II Final
No ratings yet
Unit II Final
152 pages
Module 4
No ratings yet
Module 4
63 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Debate
No ratings yet
Debate
2 pages
Common Idioms List
No ratings yet
Common Idioms List
5 pages
Chapter 10 - Code Optimization
No ratings yet
Chapter 10 - Code Optimization
11 pages
Car Pooling Pakoda
No ratings yet
Car Pooling Pakoda
61 pages
SE Unit 05 BSW
No ratings yet
SE Unit 05 BSW
23 pages
Requirement Analysis and Specification
No ratings yet
Requirement Analysis and Specification
6 pages
Agile Development
No ratings yet
Agile Development
7 pages
Arijit Mandal Minor Project 002111201248
No ratings yet
Arijit Mandal Minor Project 002111201248
14 pages
3-PPSM Fuente de Poder Primaria 120V
No ratings yet
3-PPSM Fuente de Poder Primaria 120V
4 pages
TLE ICT CS9 w1
No ratings yet
TLE ICT CS9 w1
4 pages
Enrichment Activities: Media and Information Literacy
No ratings yet
Enrichment Activities: Media and Information Literacy
5 pages
DCF6060 Arleca
No ratings yet
DCF6060 Arleca
2 pages
Group Work Meesho Questions
No ratings yet
Group Work Meesho Questions
2 pages
PT210 Manual
No ratings yet
PT210 Manual
95 pages
Lecture 2 - Authentication and Cryptography
No ratings yet
Lecture 2 - Authentication and Cryptography
93 pages
Education To Build Back Better - Book
No ratings yet
Education To Build Back Better - Book
207 pages
Tuan 4 Blob
No ratings yet
Tuan 4 Blob
154 pages
FCV Notes Module-1
No ratings yet
FCV Notes Module-1
33 pages
2015 G 1.2 MPI KAPPA Schematic Diagrams Steering System Motor Driven Power Steering (MDPS) System Schematic Diagrams
No ratings yet
2015 G 1.2 MPI KAPPA Schematic Diagrams Steering System Motor Driven Power Steering (MDPS) System Schematic Diagrams
2 pages
PI Time PDF
No ratings yet
PI Time PDF
5 pages
Help Eobd Facile
No ratings yet
Help Eobd Facile
25 pages
FM-IMS-GR-050 Supervised Induction Module - CONTROLLED
No ratings yet
FM-IMS-GR-050 Supervised Induction Module - CONTROLLED
2 pages
Stair Calculator
No ratings yet
Stair Calculator
3 pages
Technical Data Sheet 958in 40-47# Sh-2 Packer, 3.5in Eue B-P, Mesh, 350 Deg F, Hydraulic Set, Retrievab
No ratings yet
Technical Data Sheet 958in 40-47# Sh-2 Packer, 3.5in Eue B-P, Mesh, 350 Deg F, Hydraulic Set, Retrievab
7 pages
ZXR10 M6000-S Datasheet
No ratings yet
ZXR10 M6000-S Datasheet
11 pages
Unicast Rotary Breaker Wear Parts: Cast To Last. Designed For Hassle-Free Removal and Replacement
No ratings yet
Unicast Rotary Breaker Wear Parts: Cast To Last. Designed For Hassle-Free Removal and Replacement
2 pages
3 Hours / 70 Marks: Seat No
No ratings yet
3 Hours / 70 Marks: Seat No
4 pages
Xicoy Electronic Control Unit - V10 PDF
No ratings yet
Xicoy Electronic Control Unit - V10 PDF
20 pages
Unit 1 - Control System - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Control System - WWW - Rgpvnotes.in
21 pages
Subhajitkundu Cryptography
No ratings yet
Subhajitkundu Cryptography
9 pages
SPECS For Brgy Hall Mam Gigi
No ratings yet
SPECS For Brgy Hall Mam Gigi
2 pages
Apply For GET 2022 STEP Internship - B.Tech at Maruti Suzuki India
No ratings yet
Apply For GET 2022 STEP Internship - B.Tech at Maruti Suzuki India
3 pages
Acceptable Tax ID Forms
No ratings yet
Acceptable Tax ID Forms
1 page
SRNE - HYP Series - 48V - 5kW - Solar Storage Inverter - User Manual - V2.1
No ratings yet
SRNE - HYP Series - 48V - 5kW - Solar Storage Inverter - User Manual - V2.1
63 pages
GBAS - How It Works
No ratings yet
GBAS - How It Works
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

UNIT-6 K Means Clustering

Uploaded by

UNIT-6 K Means Clustering

Uploaded by

What is Clustering

The various types of clustering are:

Hierarchical clustering is further subdivided into:

Objective of K-Means Clustering

Grouping Similar Data Points

Minimizing Within-Cluster Distance

Maximizing Between-Cluster Distance

Similarity Within a Cluster

Differences Between Clusters

Applications of K-Means Clustering

K-Means for Image Compression

otherwise, Manhattan distance in case most of the features are categorical.

sum of intracluster distances:

How Does K-Means Clustering Work?

act as the initial cluster centroids.

closest to it. This step effectively forms K clusters.

centroids no longer change significantly or when a specified number of iterations is

centroids and the assignment of each data point to a cluster.

Let's use a visualization example to understand this better.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters

Now, we draw a curve between WSS and the number of clusters.

Let's assume that these are our delivery points:

C1 and C2 are the centroids assigned randomly.

This is how the initial grouping is done:

Reposition the random centroid to the actual centroid.

Reposition the random centroid to the actual centroid.

f i nal cluster with centroids c1 and c2 is as shown below:

K-Means Clustering Algorithm

steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids.

4: Keep repeating step 2 and step 3 until convergence is achieved

Finding the optimal number of clusters using the elbow method

Training the K-Means algorithm on the training data set

Visualizing the clusters

# Importing the dataset

#finding optimal number of clusters using the elbow

wcss_list= [] #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 10.

for i in range(1, 11):

kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)

) mtp.plot(range(1, 11), wcss_list)

mtp.title('The Elobw Method Graph')

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.