0% found this document useful (0 votes)
8 views

Lecture 18

Uploaded by

sundarkonduru0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 18

Uploaded by

sundarkonduru0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Hierarchical Clustering

1
2
Bottom-up Top-down

3
Hierarchical Clustering
• Two broad

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative


(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

4
Hierarchical Agglomerative Clustering: Linkage
Methods
• The single linkage method is based on minimum
distance, or the nearest neighbor rule.

• The complete linkage method is based on the


maximum distance or the furthest neighbor approach.

• The average linkage method the distance between two


clusters is defined as the average of the distances
between all pairs of objects

5
Linkage Methods of Clustering
Single Linkage
Minimum
Distance

Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance

Cluster 1 Cluster 2
Average Linkage

Average
Cluster 1 Distance Cluster 2 6
• Yet another distance between clusters is,

7
8
Dendrogram

9
• Single-link method can be seen as a graph
based method.
• Nodes are points.
• Every pair has an edge with distance as its
cost.
• Single-link is minimum spanning tree
clustering only.

10
Minimum spanning tree clustering

11
Single-link Vs. Complete-link

12
Single link is sensitive to noise, but is
good with arbitrary shaped clusters

13
AGNES (Agglomerative Nesting)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster

14
DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)


• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• The cluster is split according to some principle, such as the
maximum Euclidean distance between the closest neighboring
objects in the cluster

15
More on Hierarchical Clustering Methods

■ Major weakness of agglomerative clustering methods


2
■ do not scale well: time complexity of at least O(n ),

where n is the number of total objects


■ can never undo what was done previously

■ Integration of hierarchical with distance-based clustering


■ BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters


■ CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of the


cluster by a specified fraction
■ CHAMELEON (1999): hierarchical clustering using

dynamic modeling
16
BIRCH (1996)
■ Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
■ Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
■ Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
■ Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
■ Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
■ Weakness: handles only numeric data, and sensitive to the
order of the data record.
17
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: ∑Ni=1=Xi
SS: ∑Ni=1=Xi2 CF = (5, (16,30),(54,190))

(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

18
19
CF Additive Theorem
● Suppose cluster C1 has CF1=(N1, LS1 ,SS1), cluster
C2 has CF2 =(N2,LS2,SS2)
● If we merge C1 with C2, the CF for the merged
cluster C is

● Why CF?
● Summarized info for single cluster
● Summarized info for two clusters
● Additive theorem

21
22
CF Tree Root
CF1 CF2 CF3 CF6
child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

23
• A CF tree is a height-balanced tree that stores
the CFs in its nodes.
• Nonleaf nodes store sums of the CFs of their
children.
– Thus summarizes about their children
• A CF tree has two parameters: branching
factor B, and threshold T.
• B is maximum number of children a nonleaf
node can have.

24
• Threshold T is the maximum diameter of
subclusters stored at the leaf nodes of the
tree.

25
26
• CF tree is built incrementally.
• An object is inserted in to the closest leaf
entry (subcluster).
• If the diameter of the subcluster stored in the
leaf node after the insertion is larger than T,
the leaf node is split.
– This can result in splitting of the parent node(s)
• Like B+ tree insertion.

27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy