Single Link Example
Single Link Example
Single Link Example
For the single link, the proximity of two clusters is defined as the minimum of the
distance between any two points in the two clusters.
MIN
Using graph terminology, if we start with all points, each one a separate cluster on its
own (called a singleton cluster), and then add links between all points one at a time –
shortest links first, then these single links combine the points into clusters. (i.e. the points
with the shortest distance between each other are combined into a cluster first, then the
next shortest distance are combined, and so on)
Distances (threshold)
1 2 3
A 1 B A 1 B A 1 B
3
3
2 2 2 2
2 E 2
3
1 1 1
D C D C D C
3
distance
2
1
A B C D E
Example
Problem: Assume that the database D is given by the table below. Follow single link
technique to find clusters in D. Use Euclidean distance measure.
D
x y
p1 0.40 0.53
p2 0.22 0.38
p3 0.35 0.32
p4 0.26 0.19
p5 0.08 0.41
p6 0.45 0.30
Solution:
Step 1. Plot the objects in n-dimensional space (where n is the number of attributes). In
our case we have 2 attributes – x and y, so we plot the objects p1, p2, … p6 in 2-
dimensional space:
y
0.6 -
.
..
1
0.5 -
..
0.4 - 5
2
.
0.3 - 3 6
0.2 -
4
0.1 -
| | | | | |
0 0.1 0.2 0.3 0.4 0.5 0.6 x
Single Link Clustering - Example 2
Step 2. Calculate the distance from each object (point) to all other points, using Euclidean
distance measure, and place the numbers in a distance matrix.
We recall from the previous lecture, the formula for Euclidean distance between
two points i and j is:
In our case, we only have 2 attributes. So, the Euclidean distance between our
points p1 and p2, which have attributes x and y would be calculated as
follows:
= 0.0324 + 0.0225
= 0.0549
= 0.2343
Distance matrix
p1 0
p2 0.24 0
p3 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
Step 3 Identify the two clusters with the shortest distance in the matrix, and merge them
together. Re-compute the distance matrix, as those two clusters are now in a single
cluster, (no longer exist by themselves).
By looking at the distance matrix above, we see that p3 and p6 have the
smallest distance from all - 0.11 So, we merge those two in a single cluster, and
re-compute the distance matrix.
space dendogram
distance
y
0.6 -
. -0.30
..
1
0.5 - -0.25
..
0.4 - 5 -0.20
2
.
0.3 - 3 6 -0.15
0.2 - -0.10
4
0.1 - -0.05
| | | | | |
0 0.1 0.2 0.3 0.4 0.5 0.6 x 3 6
Distance matrix
p1 0
p2 0.24 0
(p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5
Since, we have merged (p3, p6) together in a cluster, we now have one entry for
(p3, p6) in the table, and no longer have p3 or p6 separately. Therefore, we
need to re-compute the distance from each point to our new cluster - (p3, p6).
We recall that, with the single link method the proximity of two clusters is defined
as the minimum of the distance between any two points in the two clusters.
Therefore, the distance between let’s say (p3, p6) and p1 would be calculated
as follows:
a. So, looking at the last distance matrix above, we see that p2 and p5 have the
smallest distance from all - 0.14 So, we merge those two in a single cluster, and
re-compute the distance matrix.
space dendogram
y
0.6 -
. -0.30
..
1
0.5 - -0.25
..
0.4 - 5 -0.20
2
.
0.3 - 3 6 -0.15
0.2 - -0.10
4
0.1 - -0.05
| | | | | |
0 0.1 0.2 0.3 0.4 0.5 0.6 x 3 6 2 5
Distance matrix
p1 0
(p2, p5) 0.24 0 Since, we
have (p3, p6) 0.22 00.15 merged
(p2, p5) p4 0.37 0.20
0.15 0 together
in a p1 (p2, p5)
(p3, p6) p4 cluster,
we now have one
entry for
(p2, p5) in the table, and no longer have p2 or p5 separately. Therefore, we
need to re-compute the distance from all other points / clusters to our new cluster
- (p2, p5). The distance between (p3, p6) and (p2, p5) would be calculated as
follows:
dist( (p3, p6), (p2, p5) ) = MIN ( dist(p3, p2) , dist(p6, p2), dist(p3, p5), dist(p6, p5) )
= MIN ( 0.15 , 0.25, 0.28, 0.39 ) //from original matrix
space dendogram
y
0.6 -
. -0.30
..
1
0.5 - -0.25
..
0.4 - 5 -0.20
2
.
0.3 - 3 6 -0.15
0.2 - -0.10
4
0.1 - -0.05
| | | | | |
0 0.1 0.2 0.3 0.4 0.5 0.6 x 3 6 2 5
Distance matrix
p1 0
(p2, p5, p3, p6) 0.22 0
p4 0.37 0.15 0
p1 (p2, p5, p3, p6) p4
.
0.6 - So, looking at the last distance matrix above, we see that (p2, p5, p3, p6) and
p4 have the smallest distance from all - 0.15 . So, we merge those two in a -0.30
..
1
0.5 - single cluster, and re-compute the distance matrix.
-0.25
..
space 5 dendogram
0.4 -
2 -0.20
.
0.3 - 3 6
-0.15
0.2 -
4 -0.10
0.1 - Single Link Clustering - Example 6 -0.05
| | | | | |
0 0.1 0.2 0.3 0.4 0.5 0.6 x
3 6 2 5 4
Distance matrix
p1 0
(p2, p5, p3, p6, p4) 0.22 0
p1 (p2, p5, p3, p6, p4)
..
1
0.5 - -0.25
..
0.4 - 5 -0.20
2
.
0.3 - 3 6 -0.15
0.2 - -0.10
4
0.1 - -0.05
| | | | | |
0 0.1 0.2 0.3 0.4 0.5 0.6 x 3 6 2 5 4 1
In the example above, we have merged all points into a single cluster at the end. Of
course, this is not the goal of the user. The user would like the data partitioned into
several clusters for unsupervised learning purposes. Therefore, the algorithm has to stop
clustering at some point – either the user will specify the number of clusters he/she would
like to have, or the algorithm has to make a decision on its own.
In either case, it is a good strategy to let the algorithm run until the end (until all points
are merged into 1 cluster), and record the distances at which each merge was made as it
goes. (we recall those are shown in the dendogram). We will notice that the distances
increase, as we merge more points into each cluster. We can use that to monitor, if there is
a large jump in the dendogram i.e. the distance increases dramatically all of a sudden.
Then, we know the new cluster is very far away from the one we are about to merge it
with, so that is a good indication that the two probably should not be merged – objects
have large dissimilarities. The algorithm will use this strategy to make a decision on its
own about the number of clusters that should stay separate. The same can be used when
deciding which clusters should stay separate and the user provided the number of clusters
he/she would like to have.