Data Mining Unit 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Data Mining

UNIT-V
Cluster Analysis
B.Tech(CSE)-V SEM

Dept of CSE 1
UNIT VI: Cluster Analysis: Contents
• ClusterAnalysis: Basic Concepts and Methods
➢ What is Cluster Analysis?

➢ Requirements for Cluster Analysis

➢ Overview of Basic Clustering Methods

• Partitioning Methods
➢ k-Means: A Centroid-Based Technique

➢ k-Medoids: A Representative Object-Based Technique

• Hierarchical Methods
➢ Agglomerativeversus Divisive Hierarchical Clustering

➢ Distance Measures in Algorithmic Methods

➢ BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees

➢ Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling

➢ Probabilistic Hierarchical Clustering

• Density Based Methods


➢ DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

2-Jul-2021 DM Unit VI , CSE, SVEC 2


Cluster: (Cluster Analysis – Unsupervised learning)
• Clustering is the process of grouping a set of data objects into
multiple groups or clusters so that objects within a cluster have
high similarity , but are very dissimilar to objects in other
clusters.

• Dissimilarities (or unrelated) and similarities (or related) are


assessed based on the attribute values describing the objects
and often involve distance measures.

• Clustering is known as unsupervised learning because the


class label information is not present.
5
Dept of CSE 4
-The inter-class cluster show the distance between data point with cluster center.

-The intra-class cluster show the distance between the data point of one cluster
with the other data point in other cluster.

-Outliers are extreme values that fall a long way outside of the other
observations. 5
Dept of CSE
Cluster Analysis: Applications
⚫ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genusand species
⚫ Informationretrieval: documentclustering
⚫ Land use: Identification of areas of similar land use in an earth
observationdatabase
⚫ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
⚫ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
⚫ Earth-quake studies: Observed earth quake epicenters should be
clusteredalong continent faults
⚫ Climate: understanding earth climate, find patterns of atmosphericand
ocean
⚫ Economic Science: market research

6
Clustering as a Preprocessing Tool (Utility)
⚫ Summarization:
⚫ Preprocessing for regression, PCA, classification, and
association analysis

⚫ Compression:
⚫ Image processing: vector quantization

⚫ Finding K-nearest Neighbors:


⚫ Localizing search to one or a small number of clusters.

⚫ Outlier detection:
⚫ Outliers are often viewed as those “far away” from any cluster

7
Quality: What Is Good Clustering?

⚫ A good clustering method will produce high quality clusters


⚫ high intra-class similarity: cohesivewithin clusters

⚫ low inter-class similarity: distinctive between clusters

⚫ The quality of a clustering method depends on


⚫ the similarity measure used by the method

⚫ its implementation, and

⚫ Its ability to discover some or all of the hidden patterns

8
Requirements for Cluster Analysis
The following are typical requirements of clustering in data mining:
⚫ Scalability
⚫ Clustering all thedata instead of onlyon samples
⚫ Therefore, We need highly scalable clustering algorithms to deal with
largedatabases.

⚫ Ability to deal with different types of attributes


⚫ Many algorithms are designed to cluster numeric (interval-based) data.
However, applications may require clustering other data types, such as
binary, nominal (categorical), and ordinal data, or mixtures of these data
types.
⚫ More applications need clustering techniques for complex data types
such as graphs, sequences, images, and documents.

2-Jul-2021 DM Unit VI , CSE, SVEC 9


Requirements for Cluster Analysis (Contd…)
⚫ Discovery of clusters with arbitrary shape
⚫ Many clustering algorithms determine clusters based on Euclidean or
Manhattan distance measures.
⚫ Algorithms based on such distance measures tend to find spherical
clusters with similar size and density. However, a cluster could be of any
shape.
⚫ It is important to develop algorithms that can detect clusters of arbitrary
shape.

⚫ Requirements for domain knowledge to determine input


parameters
⚫ Many clustering algorithms require users to provide domain knowledge
in the form of input parameterssuch as the desired numberof clusters.
⚫ Consequently, theclustering results may be sensitive tosuch parameters.
⚫ Requiring the specification of domain knowledge not only burdens
users, butalso makes the qualityof clustering difficult tocontrol.

2-Jul-2021 DM Unit VI , CSE, SVEC 10


Requirements for Cluster Analysis (Contd…)
⚫ Ability to deal with noisy data
⚫ Most real-world data sets contain outliers and/or missing, unknown, or
erroneousdata.
⚫ Clustering algorithms can be sensitive to such noise and may produce
poor-quality clusters. Therefore, we need clustering methods that are
robustto noise.

⚫ Incremental clustering and insensitivity to input order


⚫ In many applications, incremental updates (representing newer data)
may arriveat any time
⚫ Clustering algorithms may also be sensitive to the inputdataorder
⚫ Incremental clustering algorithms and algorithms that are insensitive to
the inputorderare needed

2-Jul-2021 DM Unit VI , CSE, SVEC 11


Requirements for Cluster Analysis (Contd…)
⚫ Capability of clustering high-dimensionality data
⚫ A data set can contain numerous dimensions or attributes. When
clustering documents, for example, each keyword can be regarded as a
dimension, and thereareoften thousandsof keywords.
⚫ Finding clusters of data objects in a high-dimensional space is
challenging, especially considering that such data can be very sparse and
highlyskewed.
⚫ Constraint-based clustering
⚫ Real-world applications may need to perform clustering under various
kinds of constraints.
⚫ A challenging task is to find data groups with good clustering behavior
thatsatisfy specified constraints.
⚫ Interpretability and usability
⚫ Users want clustering results to be interpretable, comprehensible, and
usable.
⚫ It is important to study how an application goal may influence the
selection of clustering featuresand clustering methods.

2-Jul-2021 DM Unit VI , CSE, SVEC 12


Considerations for Cluster Analysis
Clustering methodscan be compared using the following aspects:

⚫ Partitioning criteria
⚫ Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning isdesirable)
⚫ Separation of clusters
⚫ Exclusive (e.g., onecustomer belongs toonlyone region) vs. non-
exclusive (e.g., onedocument may belong to more than oneclass)
⚫ Similaritymeasure
⚫ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., densityorcontiguity)
⚫ Clustering space
⚫ Full space (often when lowdimensional) vs. subspaces (often in high-
dimensional clustering)

2-Jul-2021 DM Unit VI , CSE, SVEC 13


Overview of Major Clustering Approaches

2-Jul-2021 DM Unit VI , CSE, SVEC 14


Partitioning Methods
⚫ The simplest and most fundamental version of cluster analysis is
partitioning, which organizes the objects of a set into several exclusive
groupsorclusters.

⚫ Formally, given a data set, D, of n objects, and k, the number of clusters to


form, a partitioning algorithm organizes the objects into k partitions (k
≤ n), whereeach partition representsa cluster.

⚫ The clusters are formed to optimize an objective partitioning criterion,


such as a dissimilarity function based on distance, so that the objects
within a cluster are “similar” to one another and “dissimilar” to objects in
otherclusters in terms of the data set attributes.
18
k-Means: A Centroid-Based Technique
⚫ Suppose a data set, D, contains n objects in Euclidean space.
⚫ Partitioning methods distribute the objects in D into k clusters, C1,…,Ck, that is,
Ci D and Ci ⋂ Cj = ϕ for (1≤ i, j ≤ k)
⚫ An objective function is used to assess the partitioning quality so that objects
within a cluster are similar to one another but dissimilar to objects in other
clusters.
⚫ The objective function aims for high intra-cluster similarity and low inter-
clustersimilarity.
⚫ A centroid-based partitioning technique uses the centroid of a cluster, Ci , to
representthat cluster.
⚫ Conceptually, the centroid of a cluster is its center point. The centroid can be
defined in various ways such as by the mean or medoid of the objects (or
points) assigned to the cluster.
⚫ The difference between an object p Ci and ci, the representative of the cluster,
is measured by dist(p, ci), where dist(x,y) is the Euclidean distance between
two points x and y.

Dept of CSE 16
k-Means: A Centroid-Based Technique
⚫ The quality of cluster Ci can be measured by the within-cluster
variation, which is the sum of squared error between all objects in Ci
and the centroid ci, defined as

E = ki=1  pCi dist( p,ci ) 2


⚫ where E is the sum of the squared error for all objects in the data set; p
is the point in space representing a given object; and ci is the centroid of
cluster Ci (both pand ci are multidimensional).

⚫ This objective function tries to make the resulting k clusters as


compact and as separateas possible

Dept of CSE 17
k-Means: A Centroid-Based Technique
⚫ Given k, the k-means algorithm is implemented in four steps:

⚫ Partition objects into k nonempty subsets


⚫ Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean
point, of the cluster)
⚫ Assign each object to the cluster with the nearest seed point

⚫ Go back to Step 2, stop when the assignment does not change

18
k-Means: A Centroid-Based Technique

DM Unit VI , CSE, SVEC 19


k-Means: A Centroid-Based Technique

Clustering of a set of objects using the k-means method; for (b) update cluster centers and
reassign objects accordingly (the mean of each cluster is marked by a +)

20
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into k centroids
groups

The initial data set Loop if Reassign objects


needed
◼ Partition objects into k nonempty
subsets
◼ Repeat
◼ Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
◼ Assign each object to the
cluster of its nearest centroid
◼ Until no change
21
K-Means Clustering Algorithm (contd…)
Step1:Take the mean value
Step2:Find the nearest no. of Mean and put in cluster.
Step3: Repeat 1 & 2 steps until we get the same mean.
Example:
Data pts, k={2, 4, 6, 9, 12, 16, 20, 24, 26}
No. of Clusters=2 {4, 12}

k1={2,4,6} k2={9, 12, 16, 20, 24, 26}


=2+4+6/3 =9+12+16+20+24+26/6
=4 =18

k1={2,4,6,9} k2={12, 16, 20, 24, 26}


=21/4=5.25=5 =19.6=20

k1={2,4,6,9,12} k2={ 16, 20, 24, 26}


=6.6=7 =21.5=22

k1={2,4,6,9,12} k2={ 16, 20, 24, 26}


=6.6=7 =21.5=22

Dept of CSE 25
Comments on the K-Means Method
⚫ Strength: Efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.

⚫ Comment: Often terminates at a local optimal.

⚫ Weakness
⚫ Need to specify k, the number of clusters, in advance
(there are ways to automatically determine the best k)
⚫ Sensitive to noisy data and outliers
⚫ Not suitable to discover clusters with non-convex shapes

26
k-Medoids: A Representative Object-Based Technique
A drawback of k-means: Considersix points in 1-D space having the values 1,
2, 3, 8, 9, 10, and 25, respectively. Intuitively, by visual inspection we may
imagine the points partitioned into the clusters {1, 2,3} and {8, 9,10}, where
point 25 is excluded because itappears to be an outlier.
⚫ How would k-means partition thevalues? If we apply k-means using k =2 and
Eq. (1), the partitioning {1,2,3}, {8,9,10,25} has thewithin-clustervariation

⚫ Given that the mean of cluster {1, 2,3} is 2 and the mean of {8, 9, 10, 25} is 13.
⚫ Comparing this to the partitioning {1, 2, 3,8}, {9, 10,25}, for which k-means
computes thewithin clustervariationas

DM Unit VI , CSE, SVEC 24


k-Medoids: A Representative Object-Based Technique

⚫ Given that 3.5 is the mean of cluster {1, 2, 3,8} and 14.67 is the mean of cluster
{9, 10,25}.
⚫ The latter partitioning has the lowest within-cluster variation; therefore, the
k-means method assigns the value 8 to a cluster different from that
containing 9 and 10 due to theoutlierpoint 25.

⚫ Moreover, the center of the second cluster, 14.67, is substantially far from all
the members in thecluster

⚫ How can we modify the k-means algorithm to diminish such sensitivity to


outliers?

DM Unit VI , CSE, SVEC 25


k-Medoids: A Representative Object-Based Technique

• One of the data object acts as cluster center (one representative object per
cluster) instead of taking the mean value of the objects in a cluster (as in k-
means algorithm).

• Wecall thisclusterrepresentativeas a cluster mediod orsimply mediod.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

26
k-Medoids: A Representative Object-Based Technique
⚫ The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
corresponding representative object. That is, an absolute-error criterion is
used, defined as:

⚫ where E is the sum of the absolute error for all objects p in the data set, and oi
is the representative object of Ci . This is the basis for the k-medoids
method, which groups n objects into k clusters by minimizing the absolute
error
⚫ When k = 1, we can find the exact median in O(n2) time. However, when k is a
general positive number, the k-medoid problem is NP-hard.

DM Unit VI , CSE, SVEC 27


35
k-Medoids: A Representative Object-Based Technique
⚫ Partitioning Around Medoids (PAM) algorithm is a popular realization of k-
medoids clustering.

⚫ It tackles the problem in an iterative, greedyway.

⚫ Like the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily.

⚫ All the possible replacements are tried out.

⚫ The iterative process of replacing representative objects by other objects


continues until the quality of the resulting clustering cannot be improved by any
replacement.

⚫ This quality is measured by a cost function of the average dissimilarity between


an object and the representativeobject of its cluster.

DM Unit VI , CSE, SVEC 28


35
k-Medoids: A Representative Object-Based Technique

Four cases of the cost function for k-medoids clustering.

DM Unit VI , CSE, SVEC 29


35
k-Medoids: A Representative Object-Based Technique

DM Unit VI , CSE, SVEC 30


35
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5
each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

medoids
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


TotalCost = 26 nonmedoid object,Orandom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Orandom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

DM Unit VI , CSE, SVEC 31


35
K-means vs k-mediods
1. Comparing k-means with k-mediods:
• Both algorithms need predefined value for k, the number of clusters ,
prior to the training algorithm.
• Also both algorithmsarbitrarilychoose the initial clustercentriods.
• The k-mediod method is more robust than k-means in the presence
of outliers, because a mediod is less influenced by outliers than a
mean.

2. Time Complexity of k-mean is O(tkn)


where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

3. Time Complexity of PAM is O(k(n-k)2 )

Dept of CSE 32
Variations of k-Mediods
⚫ Applicability of PAM:
⚫ PAM does not scale well to largedatabase because of its computational
complexity.

⚫ Thereare some variantsof PAM thataredesigned mainly forlarge


datasetsare:
⚫ CLARA (Clustering LARge Applications) and

⚫ CLARANS (Clustering Large Applications based upon RANdomized


Search.

⚫ CLARANS is an improvementof CLARA

Dept of CSE 33
k-Medoids: A Representative Object-Based Technique
• The effectiveness of CLARA depends on the sample size.
• PAM searches for the best k-medoids among a given data set, whereas CLARA
searches forthe best k-medoids among the selected sample of the data set.
• CLARA cannot find a good clustering if any of the best sampled medoids is far
from the best k-medoids.
• If an object is one of the best k-medoids but is not selected during sampling,
CLARA will never find the best clustering.

How might we improve the quality and scalability of CLARA?


• When searching for better medoids, PAM examines every object in the data set
against every current medoid, whereas CLARA confines the candidate medoids
to onlya random sample of the data set.
• A randomized algorithm called CLARANS (Clustering Large Applications
based upon RANdomized Search) presents a trade-off between the cost and the
effectiveness of using samples to obtain clustering.

DM Unit VI , CSE, SVEC 38


k-Medoids: A Representative Object-Based Technique
CLARANS (Clustering Large Applications based upon RANdomized
Search):

• First, it randomlyselects k objects in the data set as the current medoids.

• It then randomly selects a current medoid x and an object y that is not one of
the currentmedoids.

• Can replacing x by y improve the absolute-error criterion? If yes, the


replacement is made.

• CLARANS conducts such a randomized search l times.

• The set of the current medoids after the l steps is considered a local optimum.

• CLARANS repeats this randomized process m times and returns the best local
optimal as the final result.

35
Hierarchical Methods
⚫ A hierarchical clustering method works by grouping data objects into a
hierarchyor “tree” of clusters . It can be visualized as a dendrogram.
⚫ Types:
⚫ Agglomerative versus Divisive Hierarchical Clustering
⚫ BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees
⚫ Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling
⚫ Probabilistic Hierarchical Clustering

⚫ Representing data objects in the form of a hierarchy is useful for data


summarization and visualization
⚫ For example, as the manager of human resources at AllElectronics, you may
organize your employees into major groups such as executives, managers, and
staff.
⚫ You can further partition these groups into smaller subgroups. For instance, the
general group of staff can be further divided into subgroups of senior officers,
officers, and trainees.
⚫ A hierarchical clustering method can be either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-
up (merging) or topdown (splitting) fashion

Dept of CSE 36
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
• An agglomerative hierarchical clustering method uses a bottom-
up strategy.
• It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the
objects are in a single cluster or certain termination conditions are
satisfied.
⚫ The singleclusterbecomes the hierarchy’s root.
⚫ For the merging step, it finds the two clusters that are closest to each
other (according to some similarity measure), and combines the two to
form onecluster.
⚫ Because two clusters are merged per iteration, where each cluster
contains at least one object, an agglomerative method requires at most
n iterations.

Dept of CSE 37
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
• A divisive hierarchical clustering method employs a top-down
strategy.

• It starts by placing all objects in one cluster, which is the hierarchy’s


root. It then divides the root cluster into several smaller subclusters,
and recursively partitions thoseclusters intosmallerones.

• The partitioning process continues until each cluster at the lowest level
is coherent enough—either containing only one object, or the objects
withina clusteraresufficientlysimilar toeach other.

In either agglomerative or divisive hierarchical clustering, a user can


specify the desired numberof clustersas a termination condition.

Dept of CSE 38
Hierarchical Methods: Agglomerative versus Divisive
Hierarchical Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
The above Figure shows the application of AGNES (AGglomerative NESting),
an agglomerative hierarchical clustering method, and DIANA (DIvisive
ANAlysis) a divisive hierarchical clustering method , on a data set of five
objects, {a,b,c,d, e}.
Dept of CSE 39
⚫ A tree structure called a dendrogram is commonly used to represent the
process of hierarchical clustering. It shows how objects are grouped together (in
an agglomerative method) or partitioned (in a divisive method) step-by- step.
⚫ Figure shows a dendrogram for the five objects , where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to form the first cluster, and they stay together at all subsequent levels.
We can also use a vertical axis to show the similarity scale between clusters. For
example, when the similarity of two groups of objects, {a,b} and {c,d, e}, is
roughly 0.16, they are merged together to form a single cluster

Dept of CSE 40
Distance Measures in Hierarchical Methods
⚫ Whether using an agglomerative method or a divisive
method, a core need is to measure the distance between X X

two clusters, where each cluster is generally a set of


objects.
• Four widely used measures for distance between clusters are as follows, where |p-p′| is
the distance between two objects or points, p and p′; mi is the mean for cluster, Ci ;and ni
is the number of objects in Ci. They are also known as Linkage measures.

DM Unit VI , CSE, SVEC 41


46
Distance Measures in Hierarchical Methods
⚫ When an algorithm uses the minimum distance, dmin(Ci ,Cj), to measure the
distance between clusters, it is sometimes called a nearest-neighbor
clustering algorithm.
⚫ If the clustering process is terminated when the distance between nearest
clusters exceeds a user-defined threshold, it is called a single-linkage
algorithm
⚫ If we view the data points as nodes of a graph, with edges forming a path
between the nodes in a cluster, then the merging of two clusters, Ci and Cj ,
corresponds to adding an edge between the nearest pair of nodes in Ci and
Cj .
⚫ Because edges linking clusters always go between distinct clusters, the
resulting graphwill generatea tree
⚫ Thus, an agglomerative hierarchical clustering algorithm that uses the
minimum distance measure is also called a minimal spanning tree
algorithm

DM Unit VI , CSE, SVEC 42


46
Distance Measures in Hierarchical Methods
⚫ When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the
distance between clusters, it is sometimes called a farthest-neighbor
clustering algorithm.
⚫ If the clustering process is terminated when the maximum distance between
nearest clusters exceeds a user-defined threshold, it is called a complete-
linkage algorithm.
⚫ The minimum and maximum measures represent two extremes in measuring
the distance between clusters.
⚫ They tend to be overlysensitive to outliers or noisy data.
⚫ The use of mean or average distance is a compromise between the minimum
and maximum distances and overcomes the outlier sensitivity problem.
⚫ Whereas the mean distance is the simplest to compute, the average distance is
advantageous in that it can handle categoricas well as numeric data.
⚫ The computation of the mean vector for categoric data can be difficult or
impossible todefine.

DM Unit VI , CSE, SVEC 43


46
Hierarchical Methods

DM Unit VI , CSE, SVEC 44


46
BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
Trees
• Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)
is designed forclustering a large amount of numeric data.

• Done by integrating hierarchical clustering (at the initial microclustering


stage) and other clustering methods such as iterative partitioning (at the
later macroclustering stage).

• It overcomes the two difficulties in agglomerative clustering methods:


(1) scalability and
(2) the inability to undo what was done in the previous step.

• BIRCH introduces to concepts:


(i) Clustering feature (CF)
(ii) Clustering Feature tree (CF-tree) , which is used to summarize the
cluster representation.

• CF tree, a height balanced tree that stores the clustering feature for
hierarchical clustering.
DM Unit VI , CSE, SVEC 45
46
BIRCH: Multiphase Hierarchical Clustering Using Clustering
Feature Trees

Dept of CSE 51
BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
Trees
• Measures of Clustering : Given n d-dimensional data objects or points in
a clusterthen: centroid, x0, radius, R, and diameter, D, are
in= 1(x ) Centroid: Middle of a cluster.
Centroid, x0 = i
= LS Radius: Average distance from
member object to the centroid.
n n
Diameter: Average pairwise distance
within a cluster.

radius,

Diameter,

• Here, R and D reflect the tightness of the cluster around the centroid.
• Moreover, clustering features are additive. That is, for two disjoint clusters, C1 and C2,
with the clustering features CF1 = <n1,LS1,SS1> and CF2= <n2,LS2,SS2>, respectively, the
clustering feature for the cluster that formed by merging C1 and C2 is simply
CF1 +CF2 = <n1 +n2,LS1+LS2,SS1 +SS2> 52
BIRCH: Multiphase Hierarchical Clustering Using Clustering
Feature Trees

• Example: Clustering feature. Suppose there are three points, (2,5),(3,2), and
(4,3), in a cluster, C1. CF = <n,LS,SS>
• The clustering feature of C1 is
CF1 = <3, (2+3+4,5+2+3),(22+32+42,52+22+32)> = <3,(9,10),(29,38)>
• Suppose that C1 is disjoint to a second cluster, C2 , where
CF2 = <3, (35,36),(417,440)>
• The clustering feature of a new cluster, C3, that is formed by merging C1 and
C2, is derived by adding CF1 and CF2. That is,
CF3 = <3+3, (9+35,10+36), (29+417,38+440)> =<6, (44,46),(446,478)>
• A CF-tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering.

DM Unit VI , CSE, SVEC 53


BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
Trees
• The nonleaf nodes store sums of the CFs of their children, and thus summarize
clustering information about their children.
• A CF-tree has two parameters: branching factor, B, and threshold, T.
• The branching factor specifies the maximum number of children per nonleaf
node.
• The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
• These two parameters implicitly control the resulting tree’s size.

CF-treestructure

DM Unit VI , CSE, SVEC 54


BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
Trees
• Given a limited amount of main memory, an important consideration in
BIRCH is to minimize the time required for input/output (I/O).
• BIRCH applies a multiphase clustering technique: A single scan of the data
set yields a basic, good clustering, and one or more additional scans can

optionally be used to further improve the quality. The primary phases are
• Phase 1: BIRCH scans the database to build an initial in-memory CF-tree,
which can be viewed as a multilevel compression of the data that tries to
preserve the data’s inherent clustering structure.
• Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf
nodes of the CF-tree, which removes sparse clusters as outliers and groups
dense clusters into largerones.

DM Unit VI , CSE, SVEC 50


BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature
Trees
• The time complexity of the algorithm is O(n), where n is the number of
objects to be clustered.
• Experiments have shown the linear scalability of the algorithm with respect
to the number of objects, and good quality of clustering of the data.
• However, since each node in a CF-tree can hold only a limited number of
entries due to its size, a CF-tree node does not always correspond to what a
user may considera natural cluster.
• Moreover, if the clusters are not spherical in shape, BIRCH does not perform
well because it uses the notion of radius or diameter to control the boundary
of a cluster.

DM Unit VI , CSE, SVEC 51


Chameleon: Multiphase Hierarchical Clustering Using
Dynamic Modeling
⚫ Chameleon is a hierarchical clustering algorithm that uses dynamic modeling
to determine the similarity between pairs of clusters.
⚫ In Chameleon, cluster similarity is assessed based on:
(1) how well connected objects arewithin a cluster and
(2) the proximityof clusters.
⚫ That is, two clusters are merged if their interconnectivity is high and they are
close together.

⚫ Thus, Chameleon does not depend on a static, user-supplied model and can
automatically adapt to the internal characteristics of the clusters being
merged.

⚫ The merge process facilitates the discovery of natural and homogeneous


clusters and applies to all data types as long as a similarity function can be
specified.
DM Unit VI , CSE, SVEC 52
Chameleon: Multiphase Hierarchical Clustering Using
Dynamic Modeling
⚫ Chameleon then uses an agglomerative hierarchical clustering algorithm that
iteratively merges subclusters based on their similarity.
⚫ To determine the pairs of most similar subclusters, it takes into account both the
interconnectivityand the closeness of the clusters.
⚫ Specifically, Chameleon determines the similarity between each pair of clusters Ci
and Cj according to their relative interconnectivity, RI(Ci ,Cj), and their relative
closeness, RC(Ci ,Cj)
⚫ The relative interconnectivity, RI(Ci ,Cj) between two clusters, Ci and Cj , is
defined as the absolute interconnectivity between Ci and Cj , normalized with
respect to the internal interconnectivityof the two clusters, Ci and Cj . That is,

⚫ where EC{Ci, Cj} is the edge cut as previously defined fora cluster containing both Ci
and Cj . Similarly, ECCi (or ECC )j is the minimum sum of the cut edges that
partition Ci (or Cj) into two roughly equal parts.

DM Unit VI , CSE, SVEC 53


Chameleon: Multiphase Hierarchical Clustering Using
Dynamic Modeling
⚫ The relative closeness, RC(Ci ,Cj) , between a pair of clusters, Ci and Cj , is the
absolute closeness between Ci and Cj, normalized with respect to the internal
closeness of the two clusters, Ci and Cj. It is defined as

⚫ where SEC{Ci ,Cj} is the average weight of the edges that connect vertices in Ci to
vertices in Cj , and SECCi (or SECC ) is the average weight of the edges that belong to
j
the min-cut bisectorof cluster Ci (or Cj ).
⚫ Chameleon has been shown to have greater power at discovering arbitrarily shaped
clusters of high quality than several well-known algorithms such as BIRCH and
density based DBSCAN.
⚫ However, the processing cost for high-dimensional data may require O(n2) time for n
objects in the worst case.

DM Unit VI , CSE, SVEC 54


Probabilistic Hierarchical Clustering
⚫ Algorithmic hierarchical clustering methods can suffer from several
drawbacks.
⚫ Nontrivial tochoose a good distancemeasure
⚫ Hard to handle missing attributevalues.
⚫ Optimization goal not clear: heuristic, local search
⚫ Consequently, theoptimization goal of the resulting clusterhierarchycan beunclear.

• Probabilistic hierarchical clustering aims to overcome some of these


disadvantages
• using probabilisticmodels to measuredistances between clusters.
• Generate model: regard the set of data objects to be clustered as a sample of the
underlying data generation mechanismto be analyzed.
• Easy to understand, same efficiency as algorithmetic agglomerativeclustering
method, can handle partially observed data.

• In practice, assume the generative models adoptcommon distribution functions, such


as Gaussian distribution or Bernoulli distribution, which are governed by parameters.
62
62
Probabilistic Hierarchical Clustering
⚫ Given a set of 1-D points X ={x1,…,xn} for clustering analysis and assuming they
are generated by a Gaussian distribution,

where the parametersare μ (the mean) and σ2 (thevariance).

⚫ The probability that a point xi ∈ X is then generated by the model is

DM Unit VI , CSE, SVEC 56


63
Probabilistic Hierarchical Clustering

⦁ The likelihood that X is generated by the model is

⦁ The task of learning the generative model is to find the parameters μ and σ2
such that the likelihood L(Ɲ(μ,σ2) : X) is maximized, that is, finding

where max { L(Ɲ(μ,σ2) : X) } is called the maximum likelihood.

DM Unit VI , CSE, SVEC 57


63
Probabilistic Hierarchical Clustering
⦁ For a set of objects partitioned into m clusters C1, …,Cm, the quality can be
measured by

⦁ where P() is the maximum likelihood. If we merge two clusters, Cj1 and Cj2 , into
a cluster, Cj1 ⋃Cj2 , then, the change in quality of the overall clustering is

DM Unit VI , CSE, SVEC 58


63
Probabilistic Hierarchical Clustering

DM Unit VI , CSE, SVEC 59


63
Probabilistic Hierarchical Clustering

Merging clusters in probabilistic hierarchical clustering: (a)Merging clusters C1 and


C2 leads to an increase in overall cluster quality, but merging clusters (b) C3 and (c)
C4 does not as no Gaussian functions can fit the merged cluster well

DM Unit VI , CSE, SVEC 60


63
Probabilistic Hierarchical Clustering

DM Unit VI , CSE, SVEC 61


63
Probabilistic Hierarchical Clustering
⦁ Probabilistic hierarchical clustering methods are easy to understand, and
generally have the same efficiency as algorithmic agglomerative hierarchical
clustering methods; in fact, they share the same framework.

⦁ Probabilistic models are more interpretable, but sometimes less f lexible than
distance metrics.

⦁ Probabilistic models can handle partially observed data

⦁ A drawback of using probabilistic hierarchical clustering is that it outputs


only one hierarchywith respect to a chosen probabilistic model.

⦁ It cannot handle the uncertaintyof cluster hierarchies.

DM Unit VI , CSE, SVEC 62


63
Density-Based Methods
⚫ Partitioning and hierarchical methods are designed to find spherical-shaped
clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the
presence of noise and outliers in the data, density-based techniques are
moreefficient.

⚫ For example, the dataset in the figure below can easily be divided into three
clusters using k-means algorithm.

70
Density-Based Methods
Consider the following figures:

The data points in these figures are grouped in arbitrary shapes or include
outliers. Density-based clustering algorithms are very efficient at finding
high-density regions and outliers. It is very important to detect outliers for
some task, e.g. anomaly detection.

Dept of CSE 64
DBSCAN
⚫ DBSCAN stands for density-based spatial clustering of applications
with noise.

⦁ Tofind clusters of arbitraryshape, alternatively, wecan model clusters


as dense regions in thedata space, separated by sparse regions.

⦁ This is the main strategy behind density-based clustering methods,


which can discoverclustersof non-spherical shape.

Clusters of arbitrary shape

Dept of CSE 65
DBSCAN
• There are two key parameters of DBSCAN:
⚫ eps: The distance that specifies the neighborhoods. Two points are
considered to be neighbors if the distance between them are less than or
equal to eps.
⚫ minPts: Minimum numberof data points to define a cluster.

⚫Based on these two parameters, points are classified as core point,


border point, or outlier:
• Core points: A point is a core point if it has more than minPts number of
points (including the point itself ) in its surrounding area with radius eps.

• Border points: A border point is not a core point, but falls within the
neighborhood of a core point. A border point can fall within the
neighborhoods of several corepoints.

• Noise points: A noise point is any point that is neither a core point nor a
borderpoint.

Dept of CSE 66
DBSCAN

Dept of CSE 67
DBSCAN
• In the center-based approach, density is estimated for a particular point in
the data set by counting the number of points within a specified radius, Eps,
of that point. This includes the point itself

Center-based density Core, border, and noise points.

DM Unit VI , CSE, SVEC 68


DBSCAN
DBSCAN Algorithm
1. Label all pointsas core, border,ornoise points.
2. Eliminate noise points.
3. Put an edge between all core points thatarewithin Eps of each other.
4. Make each groupof connected core points intoaseparatecluster.
5. Assign each border point to one of the clusters of its associated core
points

• If a spatial index is used, the computational complexity of DBSCAN is


O(nlogn), where n is the number of database objects.
• Otherwise, the complexity is O(n2).
• With appropriate settings of the user-defined parameters, ∈ and MinPts,
the algorithm is effective in finding arbitrary-shaped clusters.

DM Unit VI , CSE, SVEC 69


DBSCAN
⚫ Advantages of DBSCAN:
⚫ It will separateclusters of
high density versus clustersof low
density within a given dataset.
⚫ Handle outliers within the dataset.

• Disadvantages of DBSCAN:
• Does not work well when dealing with clusters of varying densities.
• It struggles with high dimensionality data.

Dept of CSE 77

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy