13 Clustering Techniques

Clustering techniques
Topics to be covered…
 Introduction to clustering
 Similarity and dissimilarity measures
 Clustering techniques
 Partitioning algorithms
 Hierarchical algorithms
 Density-based algorithm
CS 40003: Data Analytics 2

 Clustering has been studied extensively for more than 40 years and across
many disciplines due to its broad applications.
 As a result, many clustering techniques have been reported in the literature.
 Let us categorize the clustering methods. In fact, it is difficult to provide a

crisp categorization because many techniques overlap to each other in terms of
clustering paradigms or features.
 A broad taxonomy of existing clustering methods is shown in Fig. 16.1.
 It is not possible to cover all the techniques in this lecture series. We

emphasize on major techniques belong to partitioning and hierarchical
algorithms.

• k-Means algorithm [1957, 1967] • PAM [1990]
• k-Medoids algorithm • CLARA [1990]
Partitioning • k-Modes [1998] • CLARANS [1994]
methods • Fuzzy c-means algorithm [1999]
• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques Density-based
• STING [1997] • DENCLUE [1998]
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]
• MST Clustering [1999]

Graph based • OPOSSUM [2000]
methods • SNN Similarity Clustering [2001, 2003]
• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]

 In this lecture, we shall cover the following clustering techniques only.
 Partitioning
 k-Means algorithm
 PAM (k-Medoids algorithm)
 Hierarchical
 DIANA (divisive algorithm)
 AGNES
(Agglomerative algorithm)
 ROCK
 Density – Based
 DBSCAN

k-Means Algorithm
 k-Means clustering algorithm proposed by J. Hartigan and M. A. Wong
[1979].
 Given a set of n distinct objects, the k-Means clustering algorithm partitions
the objects into k number of clusters such that intracluster similarity is high
but the intercluster similarity is low.
 In this algorithm, user has to specify k, the number of clusters and consider the
objects are defined with numeric attributes and thus using any one of the
distance metric to demarcate the clusters.

k-Means Algorithm
The algorithm can be stated as follows.
 First it selects k number of objects at random from the set of n objects. These k
objects are treated as the centroids or center of gravities of k clusters.
 For each of the remaining objects, it is assigned to one of the closest centroid.
Thus, it forms a collection of objects assigned to each centroid and is called a
cluster.
 Next, the centroid of each cluster is then updated (by calculating the mean
values of attributes of each object).
 The assignment and update procedure is until it reaches some stopping criteria
(such as, number of iteration, centroids remain unchanged or no assignment,
etc.)

k-Means Algorithm
Algorithm 16.1: k-Means clustering
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.
3. Compute the “cluster centers” of each cluster. These become the new cluster
centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop

k-Means Algorithm
Note:
1) Objects are defined in terms of set of attributes. where each is continuous
data type.
2) Distance computation: Any distance such as or cosine similarity.
3) Minimum distance is the measure of closeness between an object and
centroid.
4) Mean Calculation: It is the mean value of each attribute values of all objects.
5) Convergence criteria: Any one of the following are termination condition of
the algorithm.
• Number of maximum iteration permissible.
• No change of centroid values in any cluster.
• Zero (or no significant) movement(s) of object from one cluster to another.
• Cluster quality reaches to a certain level of acceptance.

Illustration of k-Means clustering algorithms
Table 16.1: 16 objects with two
attributes and . Fig 16.1: Plotting data of Table 16.1
A1 A2 25
6.8 12.6
0.8 9.8 20
1.2 11.6
2.8 9.6 15
3.8 9.9
A2
4.4 6.5 10
4.8 1.1
6.0 19.9
5
6.2 18.5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
9.6 11.1

• Suppose, k=3. Three objects are chosen at random shown as circled (see Fig
16.1). These three centroids are shown below.
Initial Centroids chosen randomly
Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
• Let us consider the Euclidean distance measure (L2 Norm) as the distance
measurement in our illustration.
• Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3
respectively. The distance calculations are shown in Table 16.2.
• Assignment of each object to the respective centroid is shown in the right-
most column and the clustering so obtained is shown in Fig 16.2.

Table 16.2: Distance calculation Fig 16.2: Initial cluster with respect to
A1 A2 d1 d2 d3 Table 16.2
cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
9.6 11.1 5.9 2.1 8.1 2

The calculation new centroids of the three cluster using the mean of attribute
values of A1 and A2 is shown in the Table below. The cluster with new centroids
are shown in Fig 16.3.
Calculation of new centroids
New Objects
Centroid
A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
Fig 16.3: Initial cluster with new centroids

We next reassign the 16 objects to three clusters by determining which centroid is
closest to each one. This gives the revised set of clusters shown in Fig 16.4.
Note that point p moves from cluster C2 to cluster C1.
Fig 16.4: Cluster after first iteration

• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster in Fig 16.5 is same as Fig 16.4.
Fig 16.5: Cluster after Second iteration
Cluster centres after second iteration

Centroid Revised Centroids
A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

Comments on k-Means algorithm
Let us analyse the k-Means algorithm and discuss the pros and cons of the algorithm.
We shall refer to the following notations in our discussion.
• Notations:
• : an object under clustering
• : number of objects under clustering
• : the i-th cluster
• : the centroid of cluster
• : number of objects in the cluster
• : denotes the centroid of all objects
• : number of clusters

1. Value of k:
• The k-means algorithm produces only one set of clusters, for which, user
must specify the desired number, k of clusters.
• In fact, k should be the best guess on the number of clusters present in the
given data. Choosing the best value of k for a given dataset is, therefore, an
issue.
• We may not have an idea about the possible number of clusters for high
dimensional data, and for data that are not scatter-plotted.
• Further, possible number of clusters is hidden or ambiguous in image, audio,
video and multimedia clustering applications etc.
• There is no principled way to know what the value of k ought to be. We may
try with successive value of k starting with 2.
• The process is stopped when two consecutive k values produce more-or-less
identical results (with respect to some cluster quality estimation).
• Normally and there is heuristic to follow .

Example 16.1: k versus cluster quality
• Usually, there is some objective function to be met as a goal of clustering. One
such objective function is sum-square-error denoted by SSE and defined as
• Here, denotes the error, if x is in cluster with cluster centroid .
• Usually, this error is measured as distance norms like L1, L2, L3 or Cosine
similarity, etc.

Example 16.1: k versus cluster quality
• With reference to an arbitrary experiment, suppose the following results are
obtained.
k SSE • With respect to this observation, we can

1 62.8 choose the value of as with this smallest value
2 12.3 of k it gives reasonably good result.
3 9.4
• Note: If then SSE=0; However, the cluster is
4 9.3 useless! This is another example of
5 9.2 overfitting.
6 9.1
7 9.05
8 9.0

2. Choosing initial centroids:
• Another requirement in the k-Means algorithm to choose initial cluster
centroid for each k would be clusters.
• It is observed that the k-Means algorithm terminate whatever be the initial

choice of the cluster centroids.
• It is also observed that initial choice influences the ultimate cluster quality.
In other words, the result may be trapped into local optima, if initial
centroids are chosen properly.
• One technique that is usually followed to avoid the above problem is to

choose initial centroids in multiple runs, each with a different set of
randomly chosen initial centroids, and then select the best cluster (with
respect to some quality measurement criterion, e.g. SSE).
• However, this strategy suffers from the combinational explosion problem

dueAnalytics
CS 40003: Data to the number of all possible solutions. 20
2. Choosing initial centroids:
• A detail calculation reveals that there are possible combinations to examine
the search of global optima.
• For example, there are different ways to cluster 20 items into 4 clusters!
• Thus, the strategy having its own limitation is practical only if

1) The sample is negatively small (~100-1000), and
2) k is relatively small compared to n (i.e.. .

3. Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that
should quantify the notion of “closest” for the objects under clustering.
• Usually Euclidean distance (L2 norm) is the best measure when object points are
defined in n-dimensional Euclidean space.
• Other measure namely cosine similarity is more appropriate when objects are of
document type.
• Further, there may be other type of proximity measures that appropriate in the
context of applications.
• For example, Manhattan distance (L1 norm), Jaccard measure, etc.

3. Distance Measurement:
Thus, in the context of different measures, the sum-of-squared error (i.e., objective
function/convergence criteria) of a clustering can be stated as under.
Data in Euclidean space (L2 norm):
Data in Euclidean space (L1 norm):
The Manhattan distance (L1 norm) is used as a proximity measure, where the
objective is to minimize the sum-of-absolute error denoted as SAE and defined as

Distance with document objects
Suppose a set of n document objects is defined as d document term matrix (DTM)
(a typical look is shown in the below form).
Document Term Here, the objective function, which is called
t1 t2 tn Total cohesion denoted as TC and defined as
D1
D2
where
Dn
and
‖‖

Note: The criteria of objective function with different proximity measures
1. SSE (using L2 norm) : To minimize the SSE.
2. SAE (using L1 norm) : To minimize the SAE.
3. TC(using cosine similarity) : To maximize the TC.

4. Type of objects under clustering:
• The k-Means algorithm can be applied only when the mean of the cluster is
defined (hence it named k-Means). The cluster mean (also called centroid) of a
cluster is defied as
• In other words, the mean calculation assumed that each object is defined with
numerical attribute(s). Thus, we cannot apply the k-Means to objects which are
defined with categorical attributes.
• More precisely, the k-means algorithm require some definition of cluster mean
exists, but not necessarily it does have as defined in the above equation.
• In fact, the k-Means is a very general clustering algorithm and can be used with
a wide variety of data types, such as documents, time series, etc.
? How to find the mean of objects with composite attributes?

Note:
1) When SSE (L2 norm) is used as objective function and the objective is to
minimize, then the cluster centroid (i.e. mean) is the mean value of the objects
in the cluster.
2) When the objective function is defined as SAE (L1 norm), minimizing the
objective function implies the cluster centroid as the median of the cluster.
The above two interpretations can be readily verified as given in the next slide.

Case 1: SSE
We know,
To minimize SSE means,

Thus,
Or,

Or,
Or,
Or,
1
𝑐 𝑖= ∑
𝑛𝑖 𝑥 SSE
 Thus, the best centroid for minimizing
𝑥
∈ 𝑪 of a cluster is the mean of the
𝑖
objects in the cluster.

Case 2: SAE
We know,
To minimize SAE means,

Thus,
Or,

Or,
Solving the above equation, we get
𝑖 𝑐 =𝑚𝑒𝑑𝑖𝑎𝑛 { 𝑥|𝑥∈𝑪 }
 Thus, the best centroid for minimizing SAE of a 𝑖cluster is the median of the
objects in the cluster.
? Interpret the best centroid for maximizing TC (with Cosine similarity measure)
of a cluster.
The above mentioned discussion is quite sufficient for the validation of k-Means
algorithm.

5. Complexity analysis of k-Means algorithm
Let us analyse the time and space complexities of k-Means algorithm.
Time complexity:
The time complexity of the k-Means algorithm can be expressed as
where = number of objects

= number of attributes in the object definition
= number of clusters
= number of iterations.
Thus, time requirement is a linear order of number of objects and the algorithm
runs in a modest time if and (the iteration can be moderately controlled to check
the value of ).

5. Complexity analysis of k-Means algorithm
Space complexity: The storage complexity can be expressed as follows.
It requires space to store the objects and space to store the proximity measure
from objects to the centroids of clusters.
Thus the total storage complexity is
That is, space requirement is in the linear order of if .

6. Final comments:
Advantages:
• k-Means is simple and can be used for a wide variety of object types.
• It is also efficient both from storage requirement and execution time point of
views. By saving distance information from one iteration to the next, the actual
number of distance calculations, that must be made can be reduced (specially, as
it reaches towards the termination).
? How similarity metric can be utilized to run k-Means faster? What is the
Limitations:
updation in each iteration?
• The k-Means is not suitable for all types of data. For example, k-Means does not
work on categorical data because mean cannot be defined.
• k-means finds a local optima and may actually minimize the global optimum.

6. Final comments:
Limitations :
• k-means has trouble clustering data that contains outliers. When the SSE is used
as objective function, outliers can unduly influence the cluster that are produced.
More precisely, in the presence of outliers, the cluster centroids, in fact, not truly
as representative as they would be otherwise. It also influence the SSE measure
as well.
• k-Means algorithm cannot handle non-globular clusters, clusters of different
sizes and densities (see Fig 16.6 in the next slide).
• k-Means algorithm not really beyond the scalability issue (and not so practical
for large databases).

Cluster with different sizes Cluster with different densities
Non-convex shaped clusters
Fig 16.6: Some failure instance of k-Means algorithm

Different variants of k-means algorithm
There are a quite few variants of the k-Means algorithm. These can differ in the
procedure of selecting the initial k means, the calculation of proximity and strategy
for calculating cluster means. Another variants of k-means to cluster categorical
data.
Few variant of k-Means algorithm includes
• Bisecting k-Means (addressing the issue of initial choice of cluster means).
1. M. Steinbach, G. Karypis and V. Kumar “A comparison of document clustering
techniques”, Proceedings of KDD workshop on Text mining, 2000.
• Mean of clusters (Proposing various strategies to define means and variants of
means).
• B. zhan “Generalised k-Harmonic means – Dynamic weighting of data in
unsupervised learning”, Technical report, HP Labs, 2000.
• A. D. Chaturvedi, P. E. Green, J. D. Carroll, “k-Modes clustering”, Journal of
classification, Vol. 18, PP. 35-36, 2001.
• D. Pelleg, A. Moore, “x-Means: Extending k-Means with efficient estimation of the
number of clusters”, 17th International conference on Machine Learning, 2000.

Different variants of k-means algorithm
• N. B. Karayiannis, M. M. Randolph, “Non-Euclidean c-Means clustering
algorithm”, Intelligent data analysis journal, Vol 7(5), PP 405-425, 2003.
• V. J. Olivera, W. Pedrycy, “Advances in Fuzzy clustering and its

applications”, Edited book. John Wiley [2007]. (Fuzzy c-Means
algorithm).
• A. K. Jain and R. C. Bubes, “Algorithms for clustering Data”, Prentice

Hall, 1988.
Online book at http://www.cse.msu.edu/~jain/clustering_Jain_Dubes.pdf
• A. K. Jain, M. N. Munty and P. J. Flynn, “Data clustering: A Review”,
ACM computing surveys, 31(3), 264-323 [1999]. Also available online.

The k-Medoids algorithm
Now, we shall study a variant of partitioning algorithm called k-Medoids
algorithm.
Motivation: We have learnt that the k-Means algorithm is sensitive to outliers
because an object with an “extremely large value” may substantially distort the
distribution. The effect is particularly exacerbated due to the use of the SSE (sum-
of-squared error) objective function. The k-Medoids algorithm aims to diminish
the effect of outliers.
Basic concepts:
• The basic concepts of this algorithm is to select an object as a cluster center (one
representative object per cluster) instead of taking the mean value of the objects
in a cluster (as in k-Means algorithm).
• We call this cluster representative as a cluster medoid or simply medoid.
1. Initially, it selects a random set of k objects as the set of medoids.
2. Then at each step, all objects from the set of objects, which are not currently
medoids are examined one by one to see if they should be medoids.

The k-Medoids algorithm
• That is, the k-Medoids algorithm determines whether there is an object that
should replace one of the current medoids.
• This is accomplished by looking all pair of medoid, non-medoid objects, and
then choosing a pair that improves the objective function of clustering the best
and exchange them.
• The sum-of-absolute error (SAE) function is used as the objective function.
Where denotes a medoid

M is the set of all medoids at any instant
x is an object belongs to set of non-medoid object, that is, x belongs to some
cluster and is not a medoid. i.e.

PAM (Partitioning around Medoids)
• For a given set of medoids, at any iteration, it select that exchange which has
minimum SAE.
• The procedure terminates, if there is no any change in SAE in syuccessive
iteration (i.e. there is no change in medoid).
• This k-Medoids algorithm is also known as PAM (Partitioning around
Medoids).
Illustration of PAM
• Suppose, there are set of 12 objects and we are to cluster them into four
clusters. At any instant, the four cluster are shown in Fig. 16.7 (a). Also assume
that are the medoids in the clusters , respectively. For this clustering we can
calculate SAE.
• There are many ways to choose a non-medoid object to be replaced any one
medoid object. Out of these, suppose, if is considered as candidate medoid
instead of then it gives the lowest SAE. Thus, the new set of medoids would
be . The new cluster is shown in Fig 16.7 (b).
(a) Cluster with as medoids (b) Cluster after swapping ( becomes

the new medoid).
Fig 16.7: Illustration of PAM

PAM algorithm is thus a procedure of iterative selection of medoids and it is
precisely stated in Algorithm 16.2.
Algorithm 16.2: PAM
Input: Database of objects D.
k, the number of desired clusters.
Output: Set of k clusters
Steps:
1. Arbitrarily select k medoids from D.
2. For each object not a medoid do
3. For each medoid do
4. Let //Set of current medoids
//set of medoids but swap with non-medoids
5. Calculate
6. End of 2 for loop

Algorithm 16.2: PAM
7. Find for which the , is the smallest.
8. Replace with and accordingly update the set M.
9. Repeat step 2 - step 8 until cost(, .
10. Return the cluster with M as the set of cluster centers.
11. Stop

Comments on PAM
1. Comparing k-Means with k-Medoids:
• Both algorithms needs to fix k, the number of cluster prior to the algorithms.
Also, oth algorithm arbitrarily choose the initial cluster centroids.
• The k-Medoid method is more robust than k-Means in the presence of outliers,
because a medoid is less influenced by outliers than a mean.
2. Time complexity of PAM:
• For each iteration, PAM consider pairs of object for which a cost determines.
Calculating the cost during each iteration requires that the cost be calculated for
all other non-medoids . There are of these. Thus, the total time complexity per
iteration is The total number of iterations may be quite large.
3. Applicability of PAM:
• PAM does not scale well to large database because of its computation
complexity.

Other variants of k-Medoids algorithms
• There are some variants of PAM that are targeted mainly large datasets are
CLARA (Clustering LARge Applications) and CLARANS (Clustering Large
Applications based upon RANdomized Search), it is an improvement of
CLARA.
References:
For PAM and CLARA:
• L. kaufman and P. J. Rousseew, “Finding Groups in Data: An introduction to
cluster analysis”, John and Wiley, 1990.
For CLARANS:
• R. Ng and J. Han, “Efficient and effective clustering method for spatial Data
mining”, Proceeding very large databases [VLDB-94], 1994.

Any question?
You may post your question(s) at the “Discussion Forum”

maintained in the course Web page!

13 Clustering Techniques

Uploaded by

Copyright:

Available Formats

13 Clustering Techniques

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13 Clustering Techniques

Uploaded by

Copyright:

Available Formats

Clustering techniques

 Similarity and dissimilarity measures

CS 40003: Data Analytics 2

 Let us categorize the clustering methods. In fact, it is difficult to provide a

 It is not possible to cover all the techniques in this lecture series. We

CS 40003: Data Analytics 3

• MST Clustering [1999]

CS 40003: Data Analytics 4

 PAM (k-Medoids algorithm)

CS 40003: Data Analytics 5

CS 40003: Data Analytics 6

CS 40003: Data Analytics 7

CS 40003: Data Analytics 8

CS 40003: Data Analytics 9

CS 40003: Data Analytics 10

CS 40003: Data Analytics 11

CS 40003: Data Analytics 12

Calculation of new centroids

Fig 16.3: Initial cluster with new centroids

Fig 16.4: Cluster after first iteration

CS 40003: Data Analytics 14

Cluster centres after second iteration

CS 40003: Data Analytics 15

CS 40003: Data Analytics 16

CS 40003: Data Analytics 17

• Here, denotes the error, if x is in cluster with cluster centroid .

CS 40003: Data Analytics 18

k SSE • With respect to this observation, we can

CS 40003: Data Analytics 19

• It is observed that the k-Means algorithm terminate whatever be the initial

• One technique that is usually followed to avoid the above problem is to

• However, this strategy suffers from the combinational explosion problem

• Thus, the strategy having its own limitation is practical only if

CS 40003: Data Analytics 21

CS 40003: Data Analytics 22

Data in Euclidean space (L1 norm):

CS 40003: Data Analytics 23

CS 40003: Data Analytics 24

1. SSE (using L2 norm) : To minimize the SSE.

2. SAE (using L1 norm) : To minimize the SAE.

3. TC(using cosine similarity) : To maximize the TC.

CS 40003: Data Analytics 25

? How to find the mean of objects with composite attributes?

CS 40003: Data Analytics 26

CS 40003: Data Analytics 27

To minimize SSE means,

CS 40003: Data Analytics 28

objects in the cluster.

CS 40003: Data Analytics 29

To minimize SAE means,

CS 40003: Data Analytics 30

Solving the above equation, we get

CS 40003: Data Analytics 31

where = number of objects

CS 40003: Data Analytics 32

That is, space requirement is in the linear order of if .

CS 40003: Data Analytics 33

CS 40003: Data Analytics 34

CS 40003: Data Analytics 35

Cluster with different sizes Cluster with different densities