Cluster Analysis
UNIT VI: Cluster Analysis: Contents
• ClusterAnalysis: Basic Concepts and Methods
➢ What is Cluster Analysis?
• Partitioning Methods
➢ k-Means: A Centroid-Based Technique
• Hierarchical Methods
➢ Agglomerativeversus Divisive Hierarchical Clustering
-The intra-class cluster show the distance between the data point of one cluster
with the other data point in other cluster.
-Outliers are extreme values that fall a long way outside of the other
observations. 5
Cluster Analysis: Applications
⚫ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genusand species
⚫ Informationretrieval: documentclustering
⚫ Land use: Identification of areas of similar land use in an earth
⚫ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
⚫ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
⚫ Earth-quake studies: Observed earth quake epicenters should be
clusteredalong continent faults
⚫ Climate: understanding earth climate, find patterns of atmosphericand
⚫ Economic Science: market research
Clustering as a Preprocessing Tool (Utility)
⚫ Summarization:
⚫ Preprocessing for regression, PCA, classification, and
association analysis
⚫ Compression:
⚫ Image processing: vector quantization
⚫ Outlier detection:
⚫ Outliers are often viewed as those “far away” from any cluster
Quality: What Is Good Clustering?
Requirements for Cluster Analysis
The following are typical requirements of clustering in data mining:
⚫ Scalability
⚫ Clustering all thedata instead of onlyon samples
⚫ Therefore, We need highly scalable clustering algorithms to deal with
⚫ Partitioning criteria
⚫ Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning isdesirable)
⚫ Separation of clusters
⚫ Exclusive (e.g., onecustomer belongs toonlyone region) vs. non-
exclusive (e.g., onedocument may belong to more than oneclass)
⚫ Similaritymeasure
⚫ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., densityorcontiguity)
⚫ Clustering space
⚫ Full space (often when lowdimensional) vs. subspaces (often in high-
dimensional clustering)
k-Means: A Centroid-Based Technique
⚫ The quality of cluster Ci can be measured by the within-cluster
variation, which is the sum of squared error between all objects in Ci
and the centroid ci, defined as
k-Means: A Centroid-Based Technique
⚫ Given k, the k-means algorithm is implemented in four steps:
k-Means: A Centroid-Based Technique
Clustering of a set of objects using the k-means method; for (b) update cluster centers and
reassign objects accordingly (the mean of each cluster is marked by a +)
An Example of K-Means Clustering
Comments on the K-Means Method
⚫ Strength: Efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
⚫ Weakness
⚫ Need to specify k, the number of clusters, in advance
(there are ways to automatically determine the best k)
⚫ Sensitive to noisy data and outliers
⚫ Not suitable to discover clusters with non-convex shapes
k-Medoids: A Representative Object-Based Technique
A drawback of k-means: Considersix points in 1-D space having the values 1,
2, 3, 8, 9, 10, and 25, respectively. Intuitively, by visual inspection we may
imagine the points partitioned into the clusters {1, 2,3} and {8, 9,10}, where
point 25 is excluded because itappears to be an outlier.
⚫ How would k-means partition thevalues? If we apply k-means using k =2 and
Eq. (1), the partitioning {1,2,3}, {8,9,10,25} has thewithin-clustervariation
⚫ Given that the mean of cluster {1, 2,3} is 2 and the mean of {8, 9, 10, 25} is 13.
⚫ Comparing this to the partitioning {1, 2, 3,8}, {9, 10,25}, for which k-means
computes thewithin clustervariationas
⚫ Given that 3.5 is the mean of cluster {1, 2, 3,8} and 14.67 is the mean of cluster
{9, 10,25}.
⚫ The latter partitioning has the lowest within-cluster variation; therefore, the
k-means method assigns the value 8 to a cluster different from that
containing 9 and 10 due to theoutlierpoint 25.
⚫ Moreover, the center of the second cluster, 14.67, is substantially far from all
the members in thecluster
• One of the data object acts as cluster center (one representative object per
cluster) instead of taking the mean value of the objects in a cluster (as in k-
means algorithm).
k-Medoids: A Representative Object-Based Technique
⚫ The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object. That is, an absolute-error criterion is
used, defined as:
⚫ where E is the sum of the absolute error for all objects p in the data set, and oi
is the representative object of Ci . This is the basis for the k-medoids
method, which groups n objects into k clusters by minimizing the absolute
⚫ When k = 1, we can find the exact median in O(n2) time. However, when k is a
general positive number, the k-medoid problem is NP-hard.
⚫ Like the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily.
Variations of k-Mediods
⚫ Applicability of PAM:
⚫ PAM does not scale well to largedatabase because of its computational
k-Medoids: A Representative Object-Based Technique
• The effectiveness of CLARA depends on the sample size.
• PAM searches for the best k-medoids among a given data set, whereas CLARA
searches forthe best k-medoids among the selected sample of the data set.
• CLARA cannot find a good clustering if any of the best sampled medoids is far
from the best k-medoids.
• If an object is one of the best k-medoids but is not selected during sampling,
CLARA will never find the best clustering.
• It then randomly selects a current medoid x and an object y that is not one of
the currentmedoids.
• The set of the current medoids after the l steps is considered a local optimum.
• CLARANS repeats this randomized process m times and returns the best local
optimal as the final result.
Hierarchical Methods
⚫ A hierarchical clustering method works by grouping data objects into a
hierarchyor “tree” of clusters . It can be visualized as a dendrogram.
⚫ Types:
⚫ Agglomerative versus Divisive Hierarchical Clustering
⚫ BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees
⚫ Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling
⚫ Probabilistic Hierarchical Clustering
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
• An agglomerative hierarchical clustering method uses a bottom-
up strategy.
• It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the
objects are in a single cluster or certain termination conditions are
⚫ The singleclusterbecomes the hierarchy’s root.
⚫ For the merging step, it finds the two clusters that are closest to each
other (according to some similarity measure), and combines the two to
form onecluster.
⚫ Because two clusters are merged per iteration, where each cluster
contains at least one object, an agglomerative method requires at most
n iterations.
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
• A divisive hierarchical clustering method employs a top-down
• The partitioning process continues until each cluster at the lowest level
is coherent enough—either containing only one object, or the objects
withina clusteraresufficientlysimilar toeach other.
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
b abcde
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
The above Figure shows the application of AGNES (AGglomerative NESting),
an agglomerative hierarchical clustering method, and DIANA (DIvisive
ANAlysis) a divisive hierarchical clustering method , on a data set of five
objects, {a,b,c,d, e}.
⚫ A tree structure called a dendrogram is commonly used to represent the
process of hierarchical clustering. It shows how objects are grouped together (in
an agglomerative method) or partitioned (in a divisive method) step-by- step.
⚫ Figure shows a dendrogram for the five objects , where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to form the first cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters. For
example, when the similarity of two groups of objects, {a,b} and {c,d, e}, is
roughly 0.16, they are merged together to form a single cluster
Distance Measures in Hierarchical Methods
⚫ Whether using an agglomerative method or a divisive
method, a core need is to measure the distance between X X
• CF tree, a height balanced tree that stores the clustering feature for
hierarchical clustering.
BIRCH: Multiphase Hierarchical Clustering Using Clustering
Feature Trees
BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
• Measures of Clustering : Given n d-dimensional data objects or points in
a clusterthen: centroid, x0, radius, R, and diameter, D, are
in= 1(x ) Centroid: Middle of a cluster.
Centroid, x0 = i
= LS Radius: Average distance from
member object to the centroid.
n n
Diameter: Average pairwise distance
within a cluster.
• Here, R and D reflect the tightness of the cluster around the centroid.
• Moreover, clustering features are additive. That is, for two disjoint clusters, C1 and C2,
with the clustering features CF1 = <n1,LS1,SS1> and CF2= <n2,LS2,SS2>, respectively, the
clustering feature for the cluster that formed by merging C1 and C2 is simply
CF1 +CF2 = <n1 +n2,LS1+LS2,SS1 +SS2> 52
BIRCH: Multiphase Hierarchical Clustering Using Clustering
Feature Trees
• Example: Clustering feature. Suppose there are three points, (2,5),(3,2), and
(4,3), in a cluster, C1. CF = <n,LS,SS>
• The clustering feature of C1 is
CF1 = <3, (2+3+4,5+2+3),(22+32+42,52+22+32)> = <3,(9,10),(29,38)>
• Suppose that C1 is disjoint to a second cluster, C2 , where
CF2 = <3, (35,36),(417,440)>
• The clustering feature of a new cluster, C3, that is formed by merging C1 and
C2, is derived by adding CF1 and CF2. That is,
CF3 = <3+3, (9+35,10+36), (29+417,38+440)> =<6, (44,46),(446,478)>
• A CF-tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering.
optionally be used to further improve the quality. The primary phases are
• Phase 1: BIRCH scans the database to build an initial in-memory CF-tree,
which can be viewed as a multilevel compression of the data that tries to
preserve the data’s inherent clustering structure.
• Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf
nodes of the CF-tree, which removes sparse clusters as outliers and groups
dense clusters into largerones.
⚫ Thus, Chameleon does not depend on a static, user-supplied model and can
automatically adapt to the internal characteristics of the clusters being
⚫ where EC{Ci, Cj} is the edge cut as previously defined fora cluster containing both Ci
and Cj . Similarly, ECCi (or ECC )j is the minimum sum of the cut edges that
partition Ci (or Cj) into two roughly equal parts.
⚫ where SEC{Ci ,Cj} is the average weight of the edges that connect vertices in Ci to
vertices in Cj , and SECCi (or SECC ) is the average weight of the edges that belong to
the min-cut bisectorof cluster Ci (or Cj ).
⚫ Chameleon has been shown to have greater power at discovering arbitrarily shaped
clusters of high quality than several well-known algorithms such as BIRCH and
density based DBSCAN.
⚫ However, the processing cost for high-dimensional data may require O(n2) time for n
objects in the worst case.
⦁ The task of learning the generative model is to find the parameters μ and σ2
such that the likelihood L(Ɲ(μ,σ2) : X) is maximized, that is, finding
⦁ where P() is the maximum likelihood. If we merge two clusters, Cj1 and Cj2 , into
a cluster, Cj1 ⋃Cj2 , then, the change in quality of the overall clustering is
⦁ Probabilistic models are more interpretable, but sometimes less f lexible than
distance metrics.
⚫ For example, the dataset in the figure below can easily be divided into three
clusters using k-means algorithm.
Density-Based Methods
Consider the following figures:
The data points in these figures are grouped in arbitrary shapes or include
outliers. Density-based clustering algorithms are very efficient at finding
high-density regions and outliers. It is very important to detect outliers for
some task, e.g. anomaly detection.
⚫ DBSCAN stands for density-based spatial clustering of applications
with noise.
• There are two key parameters of DBSCAN:
⚫ eps: The distance that specifies the neighborhoods. Two points are
considered to be neighbors if the distance between them are less than or
equal to eps.
⚫ minPts: Minimum numberof data points to define a cluster.
• Border points: A border point is not a core point, but falls within the
neighborhood of a core point. A border point can fall within the
neighborhoods of several corepoints.
• Noise points: A noise point is any point that is neither a core point nor a
• In the center-based approach, density is estimated for a particular point in
the data set by counting the number of points within a specified radius, Eps,
of that point. This includes the point itself
• Disadvantages of DBSCAN:
• Does not work well when dealing with clusters of varying densities.
• It struggles with high dimensionality data.
