0% found this document useful (0 votes)
25 views

SCA - Module 8

This document discusses K-means clustering, an algorithm that groups data points into K number of clusters based on their attributes and distances. It explains the basic steps of the algorithm, including initializing cluster seeds, assigning data points to the closest cluster, recalculating the centroids, and iterating until clusters stabilize. Issues like dependency on initial seeds and getting stuck in local optima are also covered.

Uploaded by

mahnoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

SCA - Module 8

This document discusses K-means clustering, an algorithm that groups data points into K number of clusters based on their attributes and distances. It explains the basic steps of the algorithm, including initializing cluster seeds, assigning data points to the closest cluster, recalculating the centroids, and iterating until clusters stabilize. Issues like dependency on initial seeds and getting stuck in local optima are also covered.

Uploaded by

mahnoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Features

Clustering

Week 11
Clustering >> K-means Clustering
• Clustering is the task of grouping a set
of objects in such a way that objects in
the same group (called a cluster) are
more similar it each other than those in
the other groups
• To cluster n objects based on attributes
into k-partitions
• Strengths
• Simple iterative method
• User provides “K”

• Weaknesses
• Often too simple → bad results • Euclidean distance
• Manhattan distance
• Difficult to guess the correct “K” • Maximum norm
K-means Clustering

Basic Algorithm:
• Step 0: select K
• Step 1: randomly select initial cluster seeds

Seed 1
650

Seed 2
200
K-means Clustering

• An initial cluster seed represents the “mean value” of its cluster.


• In the preceding figure:
• Cluster seed 1 = 650
• Cluster seed 2 = 200
K-means Clustering

• Step 2: calculate distance from each object to each cluster seed.

• What type of distance should we use?


• Squared Euclidean distance

• Step 3: Assign each object to the closest cluster


K-means Clustering

Seed 1

Seed 2
K-means Clustering
• Step 4: Compute the new centroid for each cluster

Cluster Seed 1
708.9

Cluster Seed 2
214.2
K-means Clustering

• Iterate
• Calculate distance from objects to cluster centroids
• Assign objects to closest cluster
• Recalculate new centroids

• Stop based on convergence criteria


• No change in clusters
• Max iterations
K-means issues

• Distance measure is squared Euclidean

• Approach tries to minimize the within-cluster sum of squares error


• Implicit assumption that SSE is similar for each group
Bottom line

• K-means
• Easy to use
• Need to know K
• May need to scale data
• Good initial method

• Local optima
• No guarantee of optimal solution
• Repeat with different starting values
Example
Data provided by 3 companies 12
Scatter plot

A1 2 10 10

A2 2 5 8

A3 8 4 6

B1 5 8 4

B2 7 5 2

B3 6 4 0
0 1 2 3 4 5 6 7 8 9

C1 1 2
C2 4 9
Data Distance to Cluster New Cluster

Example Initial centroid A1


x1
2
y1
10
x2
2
0
y2
10 5
3.605551275
8 1
8.062257748
2
1
A1: 2, 10 A2 2 5 5 4.242640687 3.16227766 3
B1: 5, 8 A3 8 4 8.485281374 5 7.280109889 2
C1: 1, 2 B1 5 8 3.605551275 0 7.211102551 2
B2 7 5 7.071067812 3.605551275 6.708203932 2
B3 6 4 7.211102551 4.123105626 5.385164807 2
C1 1 2 8.062257748 7.211102551 0 3
C2 4 9 2.236067977 1.414213562 7.615773106 2

Data points Distance to Cluster New Cluster


New centroids
x2 y2
A1: 2, 10
x1 y1 2 10 6 6 1.5 3.5
B1: 6, 6
A1 2 10 0 5.656854249 6.519202405 1 1
C1: 1.5, 3.5
A2 2 5 5 4.123105626 1.58113883 3 3
A3 8 4 8.485281374 2.828427125 6.519202405 2 2
Scatter plot B1 5 8 3.605551275 2.236067977 5.700877125 2 2
15 B2 7 5 7.071067812 1.414213562 5.700877125 2 2
10 B3 6 4 7.211102551 2 4.527692569 2 2
5 C1 1 2 8.062257748 6.403124237 1.58113883 3 3
0 C2 4 9 2.236067977 3.605551275 6.041522987 2 1
0 2 4 6 8 10
New centroids Data Distance to Cluster New Cluster
A1: 3, 9.5 x2 y2
B1: 6.5, 5.25 x1 y1 3 9.5 6.5 5.25 1.5 3.5
C1: 1.5, 3.5 A1 2 10 1.118033989 6.543126164 6.519202405 1 1
A2 2 5 4.609772229 4.506939094 1.58113883 3 3
A3 8 4 7.433034374 1.952562419 6.519202405 2 2
B1 5 8 2.5 3.132491022 5.700877125 2 1
B2 7 5 6.020797289 0.559016994 5.700877125 2 2
B3 6 4 6.264982043 1.346291202 4.527692569 2 2
C1 1 2 7.762087348 6.38846617 1.58113883 3 3
C2 4 9 1.118033989 4.506939094 6.041522987 1 1
Example
Data Distance to Cluster New Cluster
x2 y2
New centroids
x1 y1 3.67 9 7 4.33 1.5 3.5
A1: 3.67, 9
A1 2 10 1.946509697 7.559689147 6.519202405 1 1
B1: 7, 4.33
A2 2 5 4.334616477 5.044690278 1.58113883 3 3
C1: 1.5, 3.5
A3 8 4 6.614295125 1.053043209 6.519202405 2 2
B1 5 8 1.664001202 4.179581319 5.700877125 1 1
B2 7 5 5.204699799 0.67 5.700877125 2 2
B3 6 4 5.516239661 1.053043209 4.527692569 2 2
C1 1 2 7.491922317 6.436528567 1.58113883 3 3
C2 4 9 0.33 5.550576547 6.041522987 1 1

Scatter plot of points


12
Final Cluster 10
A1 2 10 1 10 9
8
B1 5 8 1 8
C2 4 9 1
A3 8 4 2 6 5 5
4 4
B2 7 5 2 4
B3 6 4 2 2
2
A2 2 5 3
C1 1 2 3 0
0 1 2 3 4 5 6 7 8 9

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy