Data Mining Unit 5
Data Mining Unit 5
Data Mining Unit 5
UNIT-V
Cluster Analysis
B.Tech(CSE)-V SEM
Dept of CSE 1
UNIT VI: Cluster Analysis: Contents
• ClusterAnalysis: Basic Concepts and Methods
➢ What is Cluster Analysis?
• Partitioning Methods
➢ k-Means: A Centroid-Based Technique
• Hierarchical Methods
➢ Agglomerativeversus Divisive Hierarchical Clustering
-The intra-class cluster show the distance between the data point of one cluster
with the other data point in other cluster.
-Outliers are extreme values that fall a long way outside of the other
observations. 5
Dept of CSE
Cluster Analysis: Applications
⚫ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genusand species
⚫ Informationretrieval: documentclustering
⚫ Land use: Identification of areas of similar land use in an earth
observationdatabase
⚫ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
⚫ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
⚫ Earth-quake studies: Observed earth quake epicenters should be
clusteredalong continent faults
⚫ Climate: understanding earth climate, find patterns of atmosphericand
ocean
⚫ Economic Science: market research
6
Clustering as a Preprocessing Tool (Utility)
⚫ Summarization:
⚫ Preprocessing for regression, PCA, classification, and
association analysis
⚫ Compression:
⚫ Image processing: vector quantization
⚫ Outlier detection:
⚫ Outliers are often viewed as those “far away” from any cluster
7
Quality: What Is Good Clustering?
8
Requirements for Cluster Analysis
The following are typical requirements of clustering in data mining:
⚫ Scalability
⚫ Clustering all thedata instead of onlyon samples
⚫ Therefore, We need highly scalable clustering algorithms to deal with
largedatabases.
⚫ Partitioning criteria
⚫ Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning isdesirable)
⚫ Separation of clusters
⚫ Exclusive (e.g., onecustomer belongs toonlyone region) vs. non-
exclusive (e.g., onedocument may belong to more than oneclass)
⚫ Similaritymeasure
⚫ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., densityorcontiguity)
⚫ Clustering space
⚫ Full space (often when lowdimensional) vs. subspaces (often in high-
dimensional clustering)
Dept of CSE 16
k-Means: A Centroid-Based Technique
⚫ The quality of cluster Ci can be measured by the within-cluster
variation, which is the sum of squared error between all objects in Ci
and the centroid ci, defined as
Dept of CSE 17
k-Means: A Centroid-Based Technique
⚫ Given k, the k-means algorithm is implemented in four steps:
18
k-Means: A Centroid-Based Technique
Clustering of a set of objects using the k-means method; for (b) update cluster centers and
reassign objects accordingly (the mean of each cluster is marked by a +)
20
An Example of K-Means Clustering
K=2
Dept of CSE 25
Comments on the K-Means Method
⚫ Strength: Efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
⚫ Weakness
⚫ Need to specify k, the number of clusters, in advance
(there are ways to automatically determine the best k)
⚫ Sensitive to noisy data and outliers
⚫ Not suitable to discover clusters with non-convex shapes
26
k-Medoids: A Representative Object-Based Technique
A drawback of k-means: Considersix points in 1-D space having the values 1,
2, 3, 8, 9, 10, and 25, respectively. Intuitively, by visual inspection we may
imagine the points partitioned into the clusters {1, 2,3} and {8, 9,10}, where
point 25 is excluded because itappears to be an outlier.
⚫ How would k-means partition thevalues? If we apply k-means using k =2 and
Eq. (1), the partitioning {1,2,3}, {8,9,10,25} has thewithin-clustervariation
⚫ Given that the mean of cluster {1, 2,3} is 2 and the mean of {8, 9, 10, 25} is 13.
⚫ Comparing this to the partitioning {1, 2, 3,8}, {9, 10,25}, for which k-means
computes thewithin clustervariationas
⚫ Given that 3.5 is the mean of cluster {1, 2, 3,8} and 14.67 is the mean of cluster
{9, 10,25}.
⚫ The latter partitioning has the lowest within-cluster variation; therefore, the
k-means method assigns the value 8 to a cluster different from that
containing 9 and 10 due to theoutlierpoint 25.
⚫ Moreover, the center of the second cluster, 14.67, is substantially far from all
the members in thecluster
• One of the data object acts as cluster center (one representative object per
cluster) instead of taking the mean value of the objects in a cluster (as in k-
means algorithm).
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
26
k-Medoids: A Representative Object-Based Technique
⚫ The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object. That is, an absolute-error criterion is
used, defined as:
⚫ where E is the sum of the absolute error for all objects p in the data set, and oi
is the representative object of Ci . This is the basis for the k-medoids
method, which groups n objects into k clusters by minimizing the absolute
error
⚫ When k = 1, we can find the exact median in O(n2) time. However, when k is a
general positive number, the k-medoid problem is NP-hard.
⚫ Like the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily.
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5
each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
medoids
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Orandom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Dept of CSE 32
Variations of k-Mediods
⚫ Applicability of PAM:
⚫ PAM does not scale well to largedatabase because of its computational
complexity.
Dept of CSE 33
k-Medoids: A Representative Object-Based Technique
• The effectiveness of CLARA depends on the sample size.
• PAM searches for the best k-medoids among a given data set, whereas CLARA
searches forthe best k-medoids among the selected sample of the data set.
• CLARA cannot find a good clustering if any of the best sampled medoids is far
from the best k-medoids.
• If an object is one of the best k-medoids but is not selected during sampling,
CLARA will never find the best clustering.
• It then randomly selects a current medoid x and an object y that is not one of
the currentmedoids.
• The set of the current medoids after the l steps is considered a local optimum.
• CLARANS repeats this randomized process m times and returns the best local
optimal as the final result.
35
Hierarchical Methods
⚫ A hierarchical clustering method works by grouping data objects into a
hierarchyor “tree” of clusters . It can be visualized as a dendrogram.
⚫ Types:
⚫ Agglomerative versus Divisive Hierarchical Clustering
⚫ BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees
⚫ Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling
⚫ Probabilistic Hierarchical Clustering
Dept of CSE 36
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
• An agglomerative hierarchical clustering method uses a bottom-
up strategy.
• It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the
objects are in a single cluster or certain termination conditions are
satisfied.
⚫ The singleclusterbecomes the hierarchy’s root.
⚫ For the merging step, it finds the two clusters that are closest to each
other (according to some similarity measure), and combines the two to
form onecluster.
⚫ Because two clusters are merged per iteration, where each cluster
contains at least one object, an agglomerative method requires at most
n iterations.
Dept of CSE 37
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
• A divisive hierarchical clustering method employs a top-down
strategy.
• The partitioning process continues until each cluster at the lowest level
is coherent enough—either containing only one object, or the objects
withina clusteraresufficientlysimilar toeach other.
Dept of CSE 38
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
The above Figure shows the application of AGNES (AGglomerative NESting),
an agglomerative hierarchical clustering method, and DIANA (DIvisive
ANAlysis) a divisive hierarchical clustering method , on a data set of five
objects, {a,b,c,d, e}.
Dept of CSE 39
⚫ A tree structure called a dendrogram is commonly used to represent the
process of hierarchical clustering. It shows how objects are grouped together (in
an agglomerative method) or partitioned (in a divisive method) step-by- step.
⚫ Figure shows a dendrogram for the five objects , where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to form the first cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters. For
example, when the similarity of two groups of objects, {a,b} and {c,d, e}, is
roughly 0.16, they are merged together to form a single cluster
Dept of CSE 40
Distance Measures in Hierarchical Methods
⚫ Whether using an agglomerative method or a divisive
method, a core need is to measure the distance between X X
• CF tree, a height balanced tree that stores the clustering feature for
hierarchical clustering.
DM Unit VI , CSE, SVEC 45
46
BIRCH: Multiphase Hierarchical Clustering Using Clustering
Feature Trees
⚫
Dept of CSE 51
BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
Trees
• Measures of Clustering : Given n d-dimensional data objects or points in
a clusterthen: centroid, x0, radius, R, and diameter, D, are
in= 1(x ) Centroid: Middle of a cluster.
Centroid, x0 = i
= LS Radius: Average distance from
member object to the centroid.
n n
Diameter: Average pairwise distance
within a cluster.
radius,
Diameter,
• Here, R and D reflect the tightness of the cluster around the centroid.
• Moreover, clustering features are additive. That is, for two disjoint clusters, C1 and C2,
with the clustering features CF1 = <n1,LS1,SS1> and CF2= <n2,LS2,SS2>, respectively, the
clustering feature for the cluster that formed by merging C1 and C2 is simply
CF1 +CF2 = <n1 +n2,LS1+LS2,SS1 +SS2> 52
BIRCH: Multiphase Hierarchical Clustering Using Clustering
Feature Trees
• Example: Clustering feature. Suppose there are three points, (2,5),(3,2), and
(4,3), in a cluster, C1. CF = <n,LS,SS>
• The clustering feature of C1 is
CF1 = <3, (2+3+4,5+2+3),(22+32+42,52+22+32)> = <3,(9,10),(29,38)>
• Suppose that C1 is disjoint to a second cluster, C2 , where
CF2 = <3, (35,36),(417,440)>
• The clustering feature of a new cluster, C3, that is formed by merging C1 and
C2, is derived by adding CF1 and CF2. That is,
CF3 = <3+3, (9+35,10+36), (29+417,38+440)> =<6, (44,46),(446,478)>
• A CF-tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering.
CF-treestructure
optionally be used to further improve the quality. The primary phases are
• Phase 1: BIRCH scans the database to build an initial in-memory CF-tree,
which can be viewed as a multilevel compression of the data that tries to
preserve the data’s inherent clustering structure.
• Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf
nodes of the CF-tree, which removes sparse clusters as outliers and groups
dense clusters into largerones.
⚫ Thus, Chameleon does not depend on a static, user-supplied model and can
automatically adapt to the internal characteristics of the clusters being
merged.
⚫ where EC{Ci, Cj} is the edge cut as previously defined fora cluster containing both Ci
and Cj . Similarly, ECCi (or ECC )j is the minimum sum of the cut edges that
partition Ci (or Cj) into two roughly equal parts.
⚫ where SEC{Ci ,Cj} is the average weight of the edges that connect vertices in Ci to
vertices in Cj , and SECCi (or SECC ) is the average weight of the edges that belong to
j
the min-cut bisectorof cluster Ci (or Cj ).
⚫ Chameleon has been shown to have greater power at discovering arbitrarily shaped
clusters of high quality than several well-known algorithms such as BIRCH and
density based DBSCAN.
⚫ However, the processing cost for high-dimensional data may require O(n2) time for n
objects in the worst case.
⦁ The task of learning the generative model is to find the parameters μ and σ2
such that the likelihood L(Ɲ(μ,σ2) : X) is maximized, that is, finding
⦁ where P() is the maximum likelihood. If we merge two clusters, Cj1 and Cj2 , into
a cluster, Cj1 ⋃Cj2 , then, the change in quality of the overall clustering is
⦁ Probabilistic models are more interpretable, but sometimes less f lexible than
distance metrics.
⚫ For example, the dataset in the figure below can easily be divided into three
clusters using k-means algorithm.
70
Density-Based Methods
Consider the following figures:
The data points in these figures are grouped in arbitrary shapes or include
outliers. Density-based clustering algorithms are very efficient at finding
high-density regions and outliers. It is very important to detect outliers for
some task, e.g. anomaly detection.
Dept of CSE 64
DBSCAN
⚫ DBSCAN stands for density-based spatial clustering of applications
with noise.
Dept of CSE 65
DBSCAN
• There are two key parameters of DBSCAN:
⚫ eps: The distance that specifies the neighborhoods. Two points are
considered to be neighbors if the distance between them are less than or
equal to eps.
⚫ minPts: Minimum numberof data points to define a cluster.
• Border points: A border point is not a core point, but falls within the
neighborhood of a core point. A border point can fall within the
neighborhoods of several corepoints.
• Noise points: A noise point is any point that is neither a core point nor a
borderpoint.
Dept of CSE 66
DBSCAN
Dept of CSE 67
DBSCAN
• In the center-based approach, density is estimated for a particular point in
the data set by counting the number of points within a specified radius, Eps,
of that point. This includes the point itself
• Disadvantages of DBSCAN:
• Does not work well when dealing with clusters of varying densities.
• It struggles with high dimensionality data.
Dept of CSE 77