0% found this document useful (0 votes)
9 views

Cluster Analysis

Uploaded by

23211a3261
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Cluster Analysis

Uploaded by

23211a3261
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. A Categorization of Major Clustering Methods
3. Partitioning Methods
4. Hierarchical Methods
5. Density-Based Methods
6. Grid-Based Methods
7. Model-Based Methods
8. Clustering High-Dimensional Data
9. Constraint-Based Clustering
10. Link-based clustering
11. Outlier Analysis
12. Summary
1
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
2
Clustering for Data Understanding and
Applications
 Biology: taxonomy of living things: kindom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch
3
Clustering as Preprocessing Tools (Utility)

 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters

4
Quality: What Is Good Clustering?

 A good clustering method will produce high quality


clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

October 25, 2013 Data Mining: Concepts and Techniques 5


Measure the Quality of Clustering

 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,

typically metric: d(i, j)


 The definitions of distance functions are usually rather

different for interval-scaled, boolean, categorical,


ordinal ratio, and vector variables
 Weights should be associated with different variables

based on applications and data semantics


 Quality of clustering:
 There is usually a separate “quality” function that

measures the “goodness” of a cluster.


 It is hard to define “similar enough” or “good enough”

 The answer is typically highly subjective


October 25, 2013 Data Mining: Concepts and Techniques 6
Distance Measures for Different Kinds of Data
Discussed in Chapter 2: Data Preprocessing
 Numerical (interval)-based:
 Minkowski Distance:

 Special cases: Euclidean (L2-norm), Manhattan (L1-

norm)
 Binary variables:
 symmetric vs. asymmetric (Jaccard coeff.)

 Nominal variables: # of mismatches


 Ordinal variables: treated like interval-based
 Ratio-scaled variables: apply log-transformation first
 Vectors: cosine measure
 Mixed variables: weighted combinations
October 25, 2013 Data Mining: Concepts and Techniques 7
Requirements of Clustering in Data Mining

 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
October 25, 2013 Data Mining: Concepts and Techniques 8
Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

October 25, 2013 Data Mining: Concepts and Techniques 9


Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other


 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific

constraints
 Typical methods: COD (obstacles), constrained clustering

 Link-based clustering:
 Objects are often linked together in various ways

 Massive links can be used to cluster objects: SimRank, LinkClus

October 25, 2013 Data Mining: Concepts and Techniques 10


Calculation of Distance between Clusters

 Single link: smallest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e.,


dist(Ki, Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,


Kj) = dist(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
October 25, 2013 Data Mining: Concepts and Techniques 11
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster ΣiN= 1(t )
Cm = N
ip

 Radius: square root of average distance from any point of the


cluster to its centroid
Σ N (t − cm ) 2
Rm = i =1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster

Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

October 25, 2013 Data Mining: Concepts and Techniques 12


Partitioning Algorithms: Basic Concept

 Partitioning method: Construct a partition of a database D of n objects


into a set of k clusters, s.t., min sum of squared distance

E = Σ ik=1Σ p∈Ci ( p − mi ) 2
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

October 25, 2013 Data Mining: Concepts and Techniques 13


The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four


steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment

October 25, 2013 Data Mining: Concepts and Techniques 14


The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
1
objects 0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

October 25, 2013 Data Mining: Concepts and Techniques 15


Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

October 25, 2013 Data Mining: Concepts and Techniques 16


Variations of the K-Means Method

 A few variants of the k-means which differ in

 Selection of the initial k means

 Dissimilarity calculations

 Strategies to calculate cluster means

 Handling categorical data: k-modes (Huang’98)

 Replacing means of clusters with modes

 Using new dissimilarity measures to deal with categorical objects

 Using a frequency-based method to update modes of clusters

 A mixture of categorical and numerical data: k-prototype method

October 25, 2013 Data Mining: Concepts and Techniques 17


What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may substantially
distort the distribution of the data.

 K-Medoids: Instead of taking the mean value of the object in a cluster


as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

October 25, 2013 Data Mining: Concepts and Techniques 18


Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
October 25, 2013 Data Mining: Concepts and Techniques 19
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

October 25, 2013 Data Mining: Concepts and Techniques 20


Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

October 25, 2013 Data Mining: Concepts and Techniques 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy