0% found this document useful (0 votes)
2 views29 pages

K Means Clustering

The document provides an overview of k-means clustering, which is a method of grouping similar data objects into clusters based on their characteristics. It outlines the steps involved in the k-means algorithm, the quality measures for clustering, and its applications in various fields such as marketing and city planning. Additionally, it discusses the strengths and weaknesses of the k-means algorithm, including its efficiency and sensitivity to outliers.

Uploaded by

pobocow192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views29 pages

K Means Clustering

The document provides an overview of k-means clustering, which is a method of grouping similar data objects into clusters based on their characteristics. It outlines the steps involved in the k-means algorithm, the quality measures for clustering, and its applications in various fields such as marketing and city planning. Additionally, it discusses the strengths and weaknesses of the k-means algorithm, including its efficiency and sensitivity to outliers.

Uploaded by

pobocow192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Module 4

k -means Clustering Examples


What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
March 7, 2
2023
Clustering
• Clustering is the task of grouping a set of
objects in such a way that objects in the same
group/cluster are more similar (in some sense
or another) to each other than to those in other
groups (clusters)
• k-means clustering aims to partition n
observations into k clusters in which
each observation belongs to the cluster
with the nearest mean
Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation
database
• Insurance: Identifying groups of motor insurance policy holders with a
high
average claim cost
• City-planning: Identifying groups of houses according to their house type,
value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
March 7, 4
2023
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity
measure used by the method and its implementation
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

March 7, 6
2023
• Intra-cluster cohesion(compactness):
–Cohesion measures how near the data points in a cluster are to the cluster
centroid.
–Sum of squared error (SSE) is a commonly used measure.
• Inter-cluster separation(isolation):
–Separation means that different cluster centroids should be far away from
one another
Measure the Quality of Clustering

• Dissimilarity/Similarity metric: Similarity is expressed in terms


of a distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different
for interval-scaled, boolean, categorical, ordinal ratio, and
vector variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
March 7, 2023
8
Distance (dissimilarity) measures
Requirements of Clustering in Data Mining

• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

March 7, 1
2023 0
k-means Algorithm
• Given k, the k-means algorithm
is implemented in four steps:
– Partition objects into k nonempty
subsets
– Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest
seed point
– Go back to Step 2, stop when no more new
assignment exists
K-means Clustering -
Steps
K-means clustering
1. Pick k starting means, m1 to mk
Can use: Random values/ Dynamically
picked/ Lower- Upper Bounds
2. Repeat until convergence:
i)Split data into k sets,
S1 to Sk where x belongs to Si
iff mi is the closest mean to x
ii) Update each mi to the mean of Si
k-means Clustering Method
• General Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Updat 3
3

2 each
2 e the 2

1
object
1

0
cluste 1

0
0
0 1 2 3 4 5 6 7 8 9 s to
0 1
10
2 3 4 5 6 7 8 9 r 0 1
10
2 3 4 5 6 7 8 9

10
most mean
similar reassig s reassig
center 10
n 10
n
K=2 9 9

8 8

Arbitrarily 7 7

6
choose K object
6

5 5

as initial cluster 4 Updat 4

center 3
e the 3

2 2

1 cluste 1

0
0 1 2 3 4 5 6 7 8 9
r 0
0 1 2 3 4 5 6 7 8 9
10
mean 10

s
Example 1
k. Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
•Randomly assign means:
m1=3,m2=4 Iteration 1
• K1={2,3},
K2={4,10,12,20,30,11,25}
• Calculating mean of K1 and K2
results in m1=2.5,m2=16
Iteration 2
• K1={2,3,4},K2={10,12,20,30,11,2
5}
•Calculating mean of K1 and K2 results in
m1=3,m2=18 Iteration 3
• K1={2,3,4,10},K2={12,20,30,11,25}
•Calculating mean of K1 and K2 results in
m1=4.75,m2=19.6 Iteration 4
• K1={2,3,4,10,11,12},K2={20,30,25}
Example 2
Example 3
• The dataset • Dataset:
contains 8 objects with their
sample Objects X Y Z
X, Y and Z coordinates.
• The task is to cluster these OB-1 1 4 1
objects into two OB-2 1 2 2
clusters (k=2).
• Let us consider OB-2 (1,2,2) OB-3 1 4 2
and OB-6 (2,4,2) OB-4 2 1 2
as centroids of cluster 1
and cluster 2 OB-5 1 1 1
respectively. OB-6 2 4 2
• For distance measurement,
OB-7 1 1 2
let Manhattan distance
be used : OB-8 2 1 1
d=|x2–x1|+|y2–y1|+|z2–z1|
Example 3 (cont’d)
• After the initial pass of clustering, the
state of the clustered objects will
look something like the following:
• Distance:
Cluster 1 Distan Distan
OB-2 Cluster 2 ce ce
Object from from
OB-4 OB-1 s X Y Z
C1(1,2, C2(2,4,
OB-5 OB-3
2) 2)
OB-7 OB-6
OB-8 OB-1 1 4 1 3 2
• Updated cluster centroids: OB-2 1 2 2 0 3
• For cluster 1: OB-3 1 4 2 2 1
OB-4 2 1 2 2 3
((1+2+1+1+2)/5, OB-5 1 1 1 2 5
(2+1+1+1+1)/5,(2+2+1+2+1)/5)= OB-6 2 4 2 3 0
(1.4,1.2,1.6) OB-7 1 1 2 1 4
• For cluster 2: OB-8 2 1 1 3 4
((1+1+2)/3, (4+4+4)/3, (1+2+2)/3) =
(1.33, 4, 1.66).
Example 3 (cont’d)
• The new assignments of the objects
with respect to the updated clusters
will be:
• Distance:
Distan Distan
Cluster 1 ce ce
OB-2 Cluster 2 from from
OB-1 Object
OB-4
s X Y Z C1(1.4 C2(1.3
OB-5 OB-3 ,1.2,1. 3, 4,
OB-7 OB-6 6) 1.66)
OB-8
OB-1 1 4 1 3.8 1
• Since there is no change in the OB-2 1 2 2 1.6 2.66
current cluster formation, it is the
OB-3 1 4 2 3.6 0.66
same as the previous state of
clusters. OB-4 2 1 2 1.2 4
• Hence for k=2, the final state of two OB-5 1 1 1 1.2 4
clusters are as above. OB-6 2 4 2 3.8 1
OB-7 1 1 2 1 3.66
OB-8 2 1 1 1.4 4.33
Why use K-means?
• Strengths:
• –Simple: easy to understand and to implement
• –Efficient: Time complexity: O(tkn),

• where n is the number of data points,


• K is the number of clusters, and
• t is the number of iterations.
• –Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.
Weaknesses of K-means
• The algorithm is only applicable if the mean
is defined.
–For categorical data, k-mode -the centroid is
represented by most frequent values.
•The user needs to specify k.
•The algorithm is sensitive to outliers
–Outliers are data points that are very far away
from other data points.
–Outliers could be errors in the data recording or
some special data points with very different values.
References
1Margaret H. Dunham , “Data
Mining : Introductory and Advanced Concepts”,
Pearson, 2012.
2 https://www.datacamp.com/community/tuto
rials/k-means-clustering-python

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy