Lecture 18
Lecture 18
1
2
Bottom-up Top-down
3
Hierarchical Clustering
• Two broad
4
Hierarchical Agglomerative Clustering: Linkage
Methods
• The single linkage method is based on minimum
distance, or the nearest neighbor rule.
5
Linkage Methods of Clustering
Single Linkage
Minimum
Distance
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average
Cluster 1 Distance Cluster 2 6
• Yet another distance between clusters is,
7
8
Dendrogram
9
• Single-link method can be seen as a graph
based method.
• Nodes are points.
• Every pair has an edge with distance as its
cost.
• Single-link is minimum spanning tree
clustering only.
10
Minimum spanning tree clustering
11
Single-link Vs. Complete-link
12
Single link is sensitive to noise, but is
good with arbitrary shaped clusters
13
AGNES (Agglomerative Nesting)
14
DIANA (Divisive Analysis)
15
More on Hierarchical Clustering Methods
dynamic modeling
16
BIRCH (1996)
■ Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
■ Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
■ Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
■ Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
■ Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
■ Weakness: handles only numeric data, and sensitive to the
order of the data record.
17
Clustering Feature Vector
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
18
19
CF Additive Theorem
● Suppose cluster C1 has CF1=(N1, LS1 ,SS1), cluster
C2 has CF2 =(N2,LS2,SS2)
● If we merge C1 with C2, the CF for the merged
cluster C is
● Why CF?
● Summarized info for single cluster
● Summarized info for two clusters
● Additive theorem
21
22
CF Tree Root
CF1 CF2 CF3 CF6
child1 child2 child3 child6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
23
• A CF tree is a height-balanced tree that stores
the CFs in its nodes.
• Nonleaf nodes store sums of the CFs of their
children.
– Thus summarizes about their children
• A CF tree has two parameters: branching
factor B, and threshold T.
• B is maximum number of children a nonleaf
node can have.
24
• Threshold T is the maximum diameter of
subclusters stored at the leaf nodes of the
tree.
25
26
• CF tree is built incrementally.
• An object is inserted in to the closest leaf
entry (subcluster).
• If the diameter of the subcluster stored in the
leaf node after the insertion is larger than T,
the leaf node is split.
– This can result in splitting of the parent node(s)
• Like B+ tree insertion.
27