13 Clustering Techniques
13 Clustering Techniques
13 Clustering Techniques
Topics to be covered…
Introduction to clustering
Clustering techniques
Partitioning algorithms
Hierarchical algorithms
Density-based algorithm
• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques Density-based
• STING [1997] • DENCLUE [1998]
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]
• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]
Hierarchical
DIANA (divisive algorithm)
AGNES
(Agglomerative algorithm)
ROCK
Density – Based
DBSCAN
3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
A1 A2 25
6.8 12.6
0.8 9.8 20
1.2 11.6
2.8 9.6 15
3.8 9.9
A2
4.4 6.5 10
4.8 1.1
6.0 19.9
5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1
• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table 16.2.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Fig 16.2.
New Objects
Centroid
A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
• Usually, this error is measured as distance norms like L1, L2, L3 or Cosine
similarity, etc.
• It is also observed that initial choice influences the ultimate cluster quality.
In other words, the result may be trapped into local optima, if initial
centroids are chosen properly.
• For example, there are different ways to cluster 20 items into 4 clusters!
The Manhattan distance (L1 norm) is used as a proximity measure, where the
objective is to minimize the sum-of-absolute error denoted as SAE and defined as
Dn
and
‖‖
• In other words, the mean calculation assumed that each object is defined with
numerical attribute(s). Thus, we cannot apply the k-Means to objects which are
defined with categorical attributes.
• More precisely, the k-means algorithm require some definition of cluster mean
exists, but not necessarily it does have as defined in the above equation.
• In fact, the k-Means is a very general clustering algorithm and can be used with
a wide variety of data types, such as documents, time series, etc.
The above two interpretations can be readily verified as given in the next slide.
Or,
Or,
Or,
1
𝑐 𝑖= ∑
𝑛𝑖 𝑥 SSE
Thus, the best centroid for minimizing
𝑥
∈ 𝑪 of a cluster is the mean of the
𝑖
Or,
𝑖 𝑐 =𝑚𝑒𝑑𝑖𝑎𝑛 { 𝑥|𝑥∈𝑪 }
Thus, the best centroid for minimizing SAE of a 𝑖cluster is the median of the
objects in the cluster.
? Interpret the best centroid for maximizing TC (with Cosine similarity measure)
of a cluster.
The above mentioned discussion is quite sufficient for the validation of k-Means
algorithm.
Thus, time requirement is a linear order of number of objects and the algorithm
runs in a modest time if and (the iteration can be moderately controlled to check
the value of ).
• It is also efficient both from storage requirement and execution time point of
views. By saving distance information from one iteration to the next, the actual
number of distance calculations, that must be made can be reduced (specially, as
it reaches towards the termination).
? How similarity metric can be utilized to run k-Means faster? What is the
Limitations:
updation in each iteration?
• The k-Means is not suitable for all types of data. For example, k-Means does not
work on categorical data because mean cannot be defined.
• k-means finds a local optima and may actually minimize the global optimum.
Illustration of PAM
• Suppose, there are set of 12 objects and we are to cluster them into four
clusters. At any instant, the four cluster are shown in Fig. 16.7 (a). Also assume
that are the medoids in the clusters , respectively. For this clustering we can
calculate SAE.
• There are many ways to choose a non-medoid object to be replaced any one
medoid object. Out of these, suppose, if is considered as candidate medoid
instead of then it gives the lowest SAE. Thus, the new set of medoids would
be . The new cluster is shown in Fig 16.7 (b).
CS 40003: Data Analytics 41
PAM (Partitioning around Medoids)
11. Stop
References:
For PAM and CLARA:
• L. kaufman and P. J. Rousseew, “Finding Groups in Data: An introduction to
cluster analysis”, John and Wiley, 1990.
For CLARANS:
• R. Ng and J. Han, “Efficient and effective clustering method for spatial Data
mining”, Proceeding very large databases [VLDB-94], 1994.