0% found this document useful (0 votes)

9 views

Cluster Analysis

Uploaded by

23211a3261

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Cluster Analysis

Uploaded by

23211a3261

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. A Categorization of Major Clustering Methods
3. Partitioning Methods
4. Hierarchical Methods
5. Density-Based Methods
6. Grid-Based Methods
7. Model-Based Methods
8. Clustering High-Dimensional Data
9. Constraint-Based Clustering
10. Link-based clustering
11. Outlier Analysis
12. Summary
1
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
2
Clustering for Data Understanding and
Applications
Biology: taxonomy of living things: kindom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
3
Clustering as Preprocessing Tools (Utility)

Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters

4
Quality: What Is Good Clustering?

A good clustering method will produce high quality

clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

October 25, 2013 Data Mining: Concepts and Techniques 5

Measure the Quality of Clustering

Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,

typically metric: d(i, j)

The definitions of distance functions are usually rather

different for interval-scaled, boolean, categorical,

ordinal ratio, and vector variables
Weights should be associated with different variables

based on applications and data semantics

Quality of clustering:
There is usually a separate “quality” function that

measures the “goodness” of a cluster.

It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective

October 25, 2013 Data Mining: Concepts and Techniques 6
Distance Measures for Different Kinds of Data
Discussed in Chapter 2: Data Preprocessing
Numerical (interval)-based:
Minkowski Distance:

Special cases: Euclidean (L2-norm), Manhattan (L1-

norm)
Binary variables:
symmetric vs. asymmetric (Jaccard coeff.)

Nominal variables: # of mismatches

Ordinal variables: treated like interval-based
Ratio-scaled variables: apply log-transformation first
Vectors: cosine measure
Mixed variables: weighted combinations
October 25, 2013 Data Mining: Concepts and Techniques 7
Requirements of Clustering in Data Mining

Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
October 25, 2013 Data Mining: Concepts and Techniques 8
Major Clustering Approaches (I)

Partitioning approach:
Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors

Typical methods: k-means, k-medoids, CLARANS

Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)

using some criterion

Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Density-based approach:
Based on connectivity and density functions

Typical methods: DBSACN, OPTICS, DenClue

Grid-based approach:
based on a multiple-level granularity structure

Typical methods: STING, WaveCluster, CLIQUE

October 25, 2013 Data Mining: Concepts and Techniques 9

Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other

Typical methods: EM, SOM, COBWEB

Frequent pattern-based:
Based on the analysis of frequent patterns

Typical methods: p-Cluster

User-guided or constraint-based:
Clustering by considering user-specified or application-specific

constraints
Typical methods: COD (obstacles), constrained clustering

Link-based clustering:
Objects are often linked together in various ways

Massive links can be used to cluster objects: SimRank, LinkClus

October 25, 2013 Data Mining: Concepts and Techniques 10

Calculation of Distance between Clusters

Single link: smallest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,

dist(Ki, Kj) = dist(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dist(Ki,

Kj) = dist(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
October 25, 2013 Data Mining: Concepts and Techniques 11
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster ΣiN= 1(t )
Cm = N
ip

Radius: square root of average distance from any point of the

cluster to its centroid
Σ N (t − cm ) 2
Rm = i =1 ip
N
Diameter: square root of average mean squared distance between
all pairs of points in the cluster

Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

October 25, 2013 Data Mining: Concepts and Techniques 12

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects

into a set of k clusters, s.t., min sum of squared distance

E = Σ ik=1Σ p∈Ci ( p − mi ) 2
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

October 25, 2013 Data Mining: Concepts and Techniques 13

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four

steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest
seed point
Go back to Step 2, stop when no more new
assignment

October 25, 2013 Data Mining: Concepts and Techniques 14

The K-Means Clustering Method

Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
1
objects 0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

October 25, 2013 Data Mining: Concepts and Techniques 15

Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
Applicable only when mean is defined, then what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

October 25, 2013 Data Mining: Concepts and Techniques 16

Variations of the K-Means Method

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method

October 25, 2013 Data Mining: Concepts and Techniques 17

What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially
distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a cluster

as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

October 25, 2013 Data Mining: Concepts and Techniques 18

Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
October 25, 2013 Data Mining: Concepts and Techniques 19
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

October 25, 2013 Data Mining: Concepts and Techniques 20

Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.

October 25, 2013 Data Mining: Concepts and Techniques 21

Clustering in AI
No ratings yet
Clustering in AI
16 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Clustering
No ratings yet
Clustering
84 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering
No ratings yet
Clustering
32 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Clustering
No ratings yet
Clustering
104 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
unit iv[1]
No ratings yet
unit iv[1]
96 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Clustering
No ratings yet
Clustering
125 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering
No ratings yet
Clustering
25 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Clustering
No ratings yet
Clustering
24 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
8clst
No ratings yet
8clst
100 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Clustering
No ratings yet
Clustering
34 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Unit 4
No ratings yet
Unit 4
4 pages
Cluster
No ratings yet
Cluster
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Sample Questions and Answers For All Subjects of MCA SEM 5 SMU
No ratings yet
Sample Questions and Answers For All Subjects of MCA SEM 5 SMU
88 pages
Clustering
0% (1)
Clustering
127 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Martanto 2021 IOP Conf. Ser. Mater. Sci. Eng. 1088 012036
No ratings yet
Martanto 2021 IOP Conf. Ser. Mater. Sci. Eng. 1088 012036
7 pages
8.hierarchical AGNES DIANA
No ratings yet
8.hierarchical AGNES DIANA
46 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
DMV & ML Lab
No ratings yet
DMV & ML Lab
103 pages
2022 IJIE Template and Article Guide For Author V.17.08.10.22
No ratings yet
2022 IJIE Template and Article Guide For Author V.17.08.10.22
12 pages
33 93 LM V1 S1 - Kmedoids
No ratings yet
33 93 LM V1 S1 - Kmedoids
3 pages
McKinsey 7s Framework
No ratings yet
McKinsey 7s Framework
8 pages
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
No ratings yet
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
7 pages
Cluster MCQ
No ratings yet
Cluster MCQ
12 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
DM UNIT IV (1)
No ratings yet
DM UNIT IV (1)
45 pages
Data Mining Imp Solutions
No ratings yet
Data Mining Imp Solutions
6 pages
3205-Article Text-23308-1-10-20240703
No ratings yet
3205-Article Text-23308-1-10-20240703
7 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
UNIT 3 Data Mining
No ratings yet
UNIT 3 Data Mining
11 pages
Ml Unit 5 Material Svck Cse
No ratings yet
Ml Unit 5 Material Svck Cse
22 pages
ML Unsupervised Notes-New
100% (1)
ML Unsupervised Notes-New
13 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cluster Analysis

Uploaded by

Cluster Analysis

Uploaded by

Chapter 7.

A good clustering method will produce high quality

October 25, 2013 Data Mining: Concepts and Techniques 5

typically metric: d(i, j)

different for interval-scaled, boolean, categorical,

based on applications and data semantics

measures the “goodness” of a cluster.

The answer is typically highly subjective

Special cases: Euclidean (L2-norm), Manhattan (L1-

Nominal variables: # of mismatches

criterion, e.g., minimizing the sum of square errors

using some criterion

Typical methods: DBSACN, OPTICS, DenClue

Typical methods: STING, WaveCluster, CLIQUE

October 25, 2013 Data Mining: Concepts and Techniques 9

the best fit of that model to each other

Typical methods: p-Cluster

Massive links can be used to cluster objects: SimRank, LinkClus

October 25, 2013 Data Mining: Concepts and Techniques 10

Single link: smallest distance between an element in one cluster

Complete link: largest distance between an element in one cluster

Average: avg distance between an element in one cluster and an

Centroid: distance between the centroids of two clusters, i.e.,

Medoid: distance between the medoids of two clusters, i.e., dist(Ki,

Radius: square root of average distance from any point of the

October 25, 2013 Data Mining: Concepts and Techniques 12

Partitioning method: Construct a partition of a database D of n objects

October 25, 2013 Data Mining: Concepts and Techniques 13

Given k, the k-means algorithm is implemented in four

October 25, 2013 Data Mining: Concepts and Techniques 14

cluster center 4 Update 4

October 25, 2013 Data Mining: Concepts and Techniques 15

Strength: Relatively efficient: O(tkn), where n is # objects, k is #

October 25, 2013 Data Mining: Concepts and Techniques 16

A few variants of the k-means which differ in

Selection of the initial k means

Strategies to calculate cluster means

Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical and numerical data: k-prototype method

October 25, 2013 Data Mining: Concepts and Techniques 17

The k-means algorithm is sensitive to outliers !

K-Medoids: Instead of taking the mean value of the object in a cluster

October 25, 2013 Data Mining: Concepts and Techniques 18

Step 0 Step 1 Step 2 Step 3 Step 4

October 25, 2013 Data Mining: Concepts and Techniques 20

Decompose data objects into a several levels of nested

A clustering of the data objects is obtained by cutting the

October 25, 2013 Data Mining: Concepts and Techniques 21

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.