M4 DM Clustering Part I
M4 DM Clustering Part I
Clustering Algorithms
Priya R L
Faculty Incharge for CSC 504
Department of Computer Engineering
VES Institute of Technology, Mumbai
• Euclidean distance.
Step 2:
● Thus, we obtain two clusters
containing: {1,2,3} and {4,5,6,7}.
● Their new centroids are:
Step 3:
B 2 1
A B
C 4 3
D 5 4
30
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm: Example
● Step 1: Use initial seed points for partitioning
D
Euclidean distance
C
A B
31
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
K-Means Algorithm: Example
● Step 2: Compute new centroids of the current partition
• Usually Euclidean distance (L2 norm) is the best measure when object points are
defined in n-dimensional Euclidean space.
• Other measure namely cosine similarity is more appropriate when objects are of
document type.
• Further, there may be other type of proximity measures that appropriate in the context
of applications.
Dn
● Weakness
○ Applicable only when mean is defined, then what about categorical data?
○ k-means finds a local optima and may actually minimize the global optimum.
• k-Means algorithm not really beyond the scalability issue (and not so
practical for large databases).
Non-convex shaped
clusters
Fig 16.6: Some failure instance of k-Means algorithm
44
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
What is the problem of k-Means Method?
● The k-means algorithm is sensitive to outliers !
○ Since an object with an extremely large value may substantially distort the
distribution of the data.
● K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster. 10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
1
0
9
6
Arbitrary Assign
5
choose k each
4 object as remainin
3 initial g object
2
medoids to
1
0
nearest
0 1 2 3 4 5 6 7 8 9 1
0
medoids
Randomly select a
K=2
Total Cost = 26 nonmedoid object,Oramdom
1 1
0 0
9 9
8
Compute 8
Swapping O 7 total cost of 7
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
49
K-Medoids : Example
● The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|
● Given Dataset, D of n (n=10) objects
4. Algorithms using means are not 4. Algorithms using medoids are more
computationally demanding computationally demanding
● Agglomerative
○ Initially each item in its own cluster
○ Iteratively clusters are merged together
○ Bottom Up
● Divisive
○ Initially all items in one cluster
○ Large clusters are successively divided
○ Top Down
D
d (*,*) A B C D E D
D E
0.3
A 0 0.1 0.8 0.7 1.0
E
B 0.1 0 0.5 0.6 0.9 E 0.2
C 0.8 0.5 0 0.3 0.4 Divisive Steps (DIANA)
D 0.7 0.6 0.3 0 0.2
4 3 2 1 0
E 1.0 0.9 0.4 0.2 0
63
Hierarchical Algorithms
● Single Link
● Complete Link
● Average Link
Inter-Cluster Distance
• Common measures:
• Average distance.
• Mean distance.
Single link
• In single-link hierarchical clustering, we merge in each step the two clusters whose
two closest members have the smallest distance.
Complete link
• Consider the following matrix. Apply Single Link, Complete Link and
Average Link
Items A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
A C E
A C E B D
ABC E
B D
D A 0 2 3
A 0 2 3 B
ABC 0 3
B
D C 0 3
C 0 3 D
E 0
D
E 0
E 0
Use single link technique to find clusters in
the given data
Object X Y
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5
A B C D E
A 0
B 1 0
C 1.41 2.24 0
D 1.41 1 2 0
E 1.58 2.12 0.71 1.58 0
CSC 504 : Data Warehousing & Mining (CBCGS - Rev. 2019)
Agglomerative Clustering : Example
● SINGLE LINK :
● Step 1:
● SINCE C, E is minimum we can combine clusters C, E
A B C,E D
A 0
B 1 0
C,E 1.41 2.12 0
D 1.41 1 1.58 0
A,B C,E D
A,B 0
C,E 1.41 0
D 1 1.58 0
A,B,D C,E
A,B,D 0
C,E 1.41 0
Step 4:
● We have two clusters to be combined
● Construct a dendrogram for the same