0% found this document useful (0 votes)

34 views

Lecture 2 - Clustering Methods

The document discusses various clustering methods. It describes partitioning methods like k-means and k-medoids which assign data points to clusters to minimize distances between points and cluster centers or medoids. It also covers hierarchical methods that create cluster hierarchies, density-based methods based on connectivity and density, grid-based methods using multi-level grids, and model-based methods fitting clusters to hypothesized models.

Uploaded by

Manikandan M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Lecture 2 - Clustering Methods

Uploaded by

Manikandan M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary

11/1/22 Data Mining: Concepts and Techniques 1

Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

11/1/22 Data Mining: Concepts and Techniques 2

Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
11/1/22 Data Mining: Concepts and Techniques 3
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
11/1/22 Data Mining: Concepts and Techniques 4
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point of the

cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

11/1/22 Data Mining: Concepts and Techniques 5

Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary

11/1/22 Data Mining: Concepts and Techniques 6

Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance

 km1tmiKm (Cm  tmi ) 2

 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

11/1/22 Data Mining: Concepts and Techniques 7

The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in

four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment

11/1/22 Data Mining: Concepts and Techniques 8

The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 9

Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

11/1/22 Data Mining: Concepts and Techniques 10

Variations of the K-Means Method

 A few variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

11/1/22 Data Mining: Concepts and Techniques 11

What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 12

The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters

 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

11/1/22 Data Mining: Concepts and Techniques 13

A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

11/1/22 Data Mining: Concepts and Techniques 14

PAM (Partitioning Around Medoids) (1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus

 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change
11/1/22 Data Mining: Concepts and Techniques 15
PAM Clustering: Total swapping cost TCih=jCjih
10 10

9 9
j
8
t 8
t
7 7

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

h
8
8

j
7
7
6
6

5
i
i
5

4
t
4

3
h j
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

11/1/22 Cjih
Cjih = d(j, t) - d(j, i)Data Mining: Concepts and = d(j, h) - d(j, t)
Techniques 16
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
 Sampling based method,
CLARA(Clustering LARge Applications)

11/1/22 Data Mining: Concepts and Techniques 17

CLARA (Clustering Large Applications) (1990)

 CLARA (Kaufmann and Rousseeuw in 1990)

 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
11/1/22 Data Mining: Concepts and Techniques 18
CLARANS (“Randomized” CLARA) (1994)

 CLARANS (A Clustering Algorithm based on Randomized

Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
11/1/22 Data Mining: Concepts and Techniques 19

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6412)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (640)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (991)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1852)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4101)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (627)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1015)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5143)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (460)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2126)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4355)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2001)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1087)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2787)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2032)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Kobbacy, Murthy - Complex System Maintenance Handbook
100% (2)
Kobbacy, Murthy - Complex System Maintenance Handbook
648 pages
Midterm: (15 Points) : Indian Institute of Management Bangalore Decision Science II Old Exams
0% (1)
Midterm: (15 Points) : Indian Institute of Management Bangalore Decision Science II Old Exams
72 pages
KIN 206 Assignment #3
No ratings yet
KIN 206 Assignment #3
14 pages
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4087)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (814)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
HRA Notes
No ratings yet
HRA Notes
18 pages
Course Outlines
No ratings yet
Course Outlines
4 pages
Predictive Factors For Commitment To The Priestly Vocation: A Study of Priests and Seminarians
No ratings yet
Predictive Factors For Commitment To The Priestly Vocation: A Study of Priests and Seminarians
271 pages
Chapter 1 (AA - G49)
No ratings yet
Chapter 1 (AA - G49)
9 pages
Research Menthods
No ratings yet
Research Menthods
20 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
DSILYTC Syllabus (AY20-21 Term 2)
No ratings yet
DSILYTC Syllabus (AY20-21 Term 2)
13 pages
DAA_Chapter 01
No ratings yet
DAA_Chapter 01
15 pages
KEMH Full Recommendations
No ratings yet
KEMH Full Recommendations
18 pages
Archana 2014
No ratings yet
Archana 2014
23 pages
Matlab Code For Random Variable
No ratings yet
Matlab Code For Random Variable
8 pages
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
No ratings yet
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
11 pages
Assignment No.6
No ratings yet
Assignment No.6
8 pages
Pengaruh Pembelajaran Sekolah Lima Hari, Kemandirian Belajar Terhadap Prestasi Belajar Siswa XII Pemasaran SMK Negeri 1 Surakarta Ajaran 2018/2019
No ratings yet
Pengaruh Pembelajaran Sekolah Lima Hari, Kemandirian Belajar Terhadap Prestasi Belajar Siswa XII Pemasaran SMK Negeri 1 Surakarta Ajaran 2018/2019
16 pages
UKP6053 L3 Descriptive Statsitcs
100% (1)
UKP6053 L3 Descriptive Statsitcs
92 pages
Zomato Data Analysis
No ratings yet
Zomato Data Analysis
8 pages
The Role of Influencer Type and Content Style in Shaping Consumer Trust, Perceived Product Quality, and Purchase Intention in Influencer Marketing (2)
No ratings yet
The Role of Influencer Type and Content Style in Shaping Consumer Trust, Perceived Product Quality, and Purchase Intention in Influencer Marketing (2)
14 pages
A Study On Financial Statement Analysis in Mokshwa Soft Drinks at Coimbatore
No ratings yet
A Study On Financial Statement Analysis in Mokshwa Soft Drinks at Coimbatore
23 pages
Trifacta Azure Case Study
No ratings yet
Trifacta Azure Case Study
2 pages
The Giving Back Statistic: A Comparative Analysis of The Factors That Dictate The Chances of Alumni Donations For Their Universities
No ratings yet
The Giving Back Statistic: A Comparative Analysis of The Factors That Dictate The Chances of Alumni Donations For Their Universities
20 pages
P21 UG International-Database
No ratings yet
P21 UG International-Database
118 pages
Original PDF
No ratings yet
Original PDF
20 pages
Research Methodology MCQ 400
70% (20)
Research Methodology MCQ 400
190 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
RESILIENCY-AND-SUCCESS-INDICATORS-OF-CLOSE-PROXIMITY-VENDORS-IN-TALAVERA-N.E-NO-RESUME
No ratings yet
RESILIENCY-AND-SUCCESS-INDICATORS-OF-CLOSE-PROXIMITY-VENDORS-IN-TALAVERA-N.E-NO-RESUME
59 pages
Time Table data collection
No ratings yet
Time Table data collection
3 pages
A Study On Fiancial Statement and Analysis at BMTC Bangalore
0% (1)
A Study On Fiancial Statement and Analysis at BMTC Bangalore
28 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 2 - Clustering Methods

Uploaded by

Lecture 2 - Clustering Methods

Uploaded by

Chapter 7.

11/1/22 Data Mining: Concepts and Techniques 1

11/1/22 Data Mining: Concepts and Techniques 2

 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,

 Radius: square root of average distance from any point of the

11/1/22 Data Mining: Concepts and Techniques 5

11/1/22 Data Mining: Concepts and Techniques 6

 km1tmiKm (Cm  tmi ) 2

11/1/22 Data Mining: Concepts and Techniques 7

 Given k, the k-means algorithm is implemented in

11/1/22 Data Mining: Concepts and Techniques 8

cluster center 4 Update 4

11/1/22 Data Mining: Concepts and Techniques 9

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #

11/1/22 Data Mining: Concepts and Techniques 10

 A few variants of the k-means which differ in

11/1/22 Data Mining: Concepts and Techniques 11

 The k-means algorithm is sensitive to outliers !

11/1/22 Data Mining: Concepts and Techniques 12

 Find representative objects, called medoids, in clusters

11/1/22 Data Mining: Concepts and Techniques 13

K=2 Randomly select a

11/1/22 Data Mining: Concepts and Techniques 14

 PAM (Kaufman and Rousseeuw, 1987), built in Splus

Cjih = d(j, h) - d(j, i) Cjih = 0

 Pam is more robust than k-means in the presence of

11/1/22 Data Mining: Concepts and Techniques 17

 CLARA (Kaufmann and Rousseeuw in 1990)

 CLARANS (A Clustering Algorithm based on Randomized

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.