Cluster Analysis - Approach 1
Cluster Analysis - Approach 1
Clustering approaches
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters,
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5
each 5
4 object as 4 remainin 4
3 initial 3 g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
7 total cost of 7
5 5
change If quality is 4 4
improved. 3
2
3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
PAM (Partitioning Around Medoids)
(1987)
PAM (Kaufman and Rousseeuw, 1987), built in Splus
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected object i ,
calculate the total swapping cost T Cih
For each pair of i and h ,
• If T Cih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
repeat steps 2-3 until there is no change
What Is the Problem with PAM?
Pam is more robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or
other extreme values than a mean
Pam works efficiently for small data sets but does not scale
well for large data sets.
– O(k(n-k)2 ) for each iteration
where n is no of data, k is no of clusters
• Sampling based method,
CLARA(Clustering LARge Applications)
CLARA (Clustering Large Applications)
CLARA (Kaufmann and Rousseeuw in 1990)
• Built in statistical analysis packages, such as S+
It draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
• Efficiency depends on the sample size
• A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased
CLARANS (“Randomized” CLARA)
(1994)
CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a graph
where every node is a potential solution, that is, a set of k
medoids
If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Focusing techniques and spatial access structures may Further
improve its performance.
Hierarchical Clustering
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dendrogram: Shows How the Clusters
are Merged
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Recent Hierarchical Clustering Methods
9
(3,4)
8
6
(2,6)
5
4 (4,5)
3
1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
CF-Tree in BIRCH
Clustering feature:
• summary of the statistics for a given subcluster: the 0-th, 1st and 2nd
moments of the subcluster from the statistical point of view.
• registers crucial measurements for computing cluster and utilizes
storage efficiently
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Major ideas
Use links to measure similarity/proximity
Not distance-based
Computational complexity:
Algorithm: sampling-based clustering
Draw random sample
Cluster with links
Label data in disk
Similarity Measure in ROCK
Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
Example: Two groups (clusters) of transactions
• C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
• C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
• C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
Jaccard co-efficient-based similarity function:
Link Measure in ROCK
Data Set
Merge Partition
Final Clusters