Unit 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 84

Data Mining (DM)

2101CS521

Unit-5
Clustering

Prof. Jayesh D. vagadiya


Computer Engineering
Department
Darshan Institute of Engineering & Technology, Rajkot
jayesh.vagadiya@darshan.ac.in
9537133260
 Looping
Topics to be covered
• Cluster Analysis
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Outlier Detection
• Outliers and Outlier Analysis
• Outlier Detection Methods
Classification

Input Data
???
Apple Banana Apple
Prediction

Banana Banana Apple


Classificati Its an
on Apple
Algorithm
Apple Apple Apple

Training Data Set


Here we Use Label Data to train Classifier (Supervised Learning)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 3


Clustering (Grouping)

Input Data

Clustering
Algorithm

Here we don’t use Label Data to train Classifier(Unsupervised Learning)


Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 4
Clustering
 Cluster analysis or simply clustering is the process of partitioning a set of
data objects (or observations) into subsets.
 Each subset is a cluster.
 Cluster is a collection of data objects.
 Similar (or related) to one another within the same group
 Dissimilar (or unrelated) to the objects in other groups
 The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
 Clustering is useful in that it can lead to the discovery of previously
unknown groups within the data.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 5


Clustering
 Unsupervised Learning: no predefined classes. (i.e., learning by
observations)(Clustering).
 Supervised Learning: Predefined classes available. (i.e., learning by
examples) (Classification).
 Clustering can also be used for outlier detection.
 where outliers (values that are “far away” from any cluster) may be more interesting
than common cases.
 Applications of outlier detection include the detection of credit card fraud and the
monitoring of criminal activities in electronic commerce.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 6


Applications of Clustering
 Real life examples where we use clustering:
 Marketing
 Finding group of customers with similar behavior given a large data-base of
customers.
 Data containing their properties and past buying records.
 Biology
 Classification of Plants and Animals Based on the properties under observation.
 Insurance
 Identifying groups of car insurance policy holders with a high average claim cost.
 City-Planning
 Groups of houses according to their house type, value and geographical location.
 Libraries
 It is used in clustering different books on the basis of topics and information.
 Earthquake studies
 By learning the earthquake-affected areas we can determine the dangerous zones.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 7


Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Input data is labeled Input data is unlabeled
Has a feedback mechanism Has no feedback mechanism
Data is classified based on Assigns properties of given
the training dataset data to classify it
Divided into Regression Divided into Clustering
& Classification & Association
Used for prediction Used for analysis
Algorithms include: decision Algorithms include: k-means
trees, logistic regressions, clustering, hierarchical
support vector machine clustering, apriori algorithm
A known number of classes A unknown number of classes

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 8


Good clustering Algorithm
 A good clustering method will produce high quality clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 9


Requirements for Cluster Analysis
 Clustering is a challenging research field. In this section, you will learn
about the requirements for clustering as a data mining tool.
 Scalability:
 Clustering all the data instead of only on samples.
 Ability to deal with different types of attributes:
 Many algorithms are designed to cluster numeric (interval-based) data. However,
applications may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
 Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any
shape.
 It is important to develop algorithms that can detect clusters of arbitrary shape.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 10


Requirements for Cluster Analysis
 Requirements for domain knowledge to determine input
parameters:
 Many clustering algorithms require users to provide domain knowledge in the form of
input parameters such as the desired number of clusters.
 The clustering results may be sensitive to such parameters.
 Ability to deal with noisy data:
 Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data.
 Therefore, we need clustering methods that are robust to noise.
 Incremental clustering and insensitivity to input order
 In many applications, incremental updates (representing newer data) may arrive at
any time. Some clustering algorithms cannot incorporate incremental updates into
existing clustering structures and, instead, have to recompute a new clustering from
scratch.
 Incremental clustering algorithms and algorithms that are insensitive to the input
order are needed.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 11


Requirements for Cluster Analysis
 Capability of clustering high-dimensionality data
 A data set can contain numerous dimensions or attributes. When clustering
documents, for example, each keyword can be regarded as a dimension, and there
are often thousands of keywords.
 Finding clusters of data objects in a high- dimensional space is challenging.
 Constraint-based clustering
 Real-world applications may need to perform clustering under various kinds of
constraints.
 Suppose that your job is to choose the locations for a given number of new
automatic teller machines (ATMs) in a city. To decide upon this, you may cluster
households while considering constraints such as the city’s rivers and highway
networks and the types and number of customers per cluster
 Interpretability and usability
 Users want clustering results to be interpretable, comprehensible, and usable.
 That is, clustering may need to be tied in with specific semantic interpretations and
applications.
 It is important to study how an application goal may influence the selection of
clustering features and clustering methods.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 12
Overview of Basic Clustering Methods : Partitioning methods

 Partitioning Methods:
 Given a set of n objects, a partitioning method constructs k partitions of
the data, where each partition represents a cluster and k ≤ n.
 It divides the data into k groups such that each group must contain at
least one object.
 It will find mutually exclusive clusters of spherical shape(each object must
belong to exactly one group ).
 Most partitioning methods are distance-based.
 Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning.
 It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
 May use mean or medoid (etc.) to represent cluster center.
 Effective for small- to medium-size data sets.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 13
Hierarchical methods
 Hierarchical Methods:
 A hierarchical method creates a hierarchical decomposition of the given
set of data objects.
 A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
 Agglomerative approach:
 It also called bottom-up approach.
 It starts with each object forming a separate group.
 It successively merges the objects or groups close to one another, until all the groups
are merged into one (the topmost level of the hierarchy), or a termination condition
holds.
 Divisive approach:
 It also called top-down approach.
 It starts with all the objects in the same cluster.
 In each successive iteration, a cluster is split into smaller clusters, until eventually
each object is in one cluster, or a termination condition holds.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 14
Hierarchical methods

Step Step Step Step Step


0 1 2 3 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 15


Density-based methods
 Density-based methods:
 Distance based method only find spherical-shaped cluster and encounter
difficulty in discovering clusters of arbitrary shapes.
 It can find arbitrarily shaped clusters.
 Clusters are dense regions of objects in space that are separated by low-
density regions.
 Cluster density: Each point must have a minimum number of points within
its “neighborhood”.
 May filter out outliers.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 16


Grid-based methods
 Grid-based methods:
 Grid-based methods quantize the object space into a finite number of cells
that form a grid structure.
 All the clustering operations are performed on the grid structure.
 The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only
on the number of cells in each dimension in the quantized space.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 17


Partitioning Methods: k-Means : A Centroid-Based Technique

 The k-means algorithm defines the centroid of a cluster as the mean value
of the points within the cluster.
 First, it randomly selects k of the objects in D, each of which initially
represents a cluster mean or center.
 For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the Euclidean distance between the
object and the cluster mean.
 For each cluster, it computes the new mean using the objects assigned to
the cluster in the previous iteration.
 The iterations continue until the assignment is stable, that is, the clusters
formed in the current round are the same as those formed in the previous
round.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 18


k-Means : Algorithm
INPUT:
k: the number of clusters,
D: a data set containing n objects.

Output:
A set of k clusters.

arbitrarily choose k objects from D as the initial cluster centers;


repeat
(re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
update the cluster means, that is, calculate the mean value of the objects for
each cluster;
until no change;

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 19


k-Means : Algorithm Cont..
 The initial partitioning can be done in a variety of ways.
 Dynamically Chosen
 This method is good when the amount of data is expected to grow.
 The initial cluster means can simply be the first few items of data from the set.
 For instance, if the data will be grouped into 3 clusters, then the initial cluster means
will be the first 3 items of data.
 Randomly Chosen
 Almost self-explanatory, the initial cluster means are randomly chosen values within
the same range as the highest and lowest of the data values.
 Choosing from Upper and Lower Bounds
 Depending on the types of data in the set, the highest and lowest of the data range
are chosen as the initial cluster means.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 20


Clustering

 A clustered scatter plot.


 The black dots are data points.
 The red lines illustrate the partitions
created by the k-means algorithm.
 The blue dots represent the
centroids which define the
partitions.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 21


K-Mean Example
Sr X Y K = 2, and we assume Cluster 1 center is (1,1) and Cluster 2 center
Here
.
1 1.0 1.0
2 1.5 2.0
K K
3 3.0 4.0
1 2
4 5.0 7.0 (1,1) (5,7)
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Initial Centroid
K1 = (1,1)
K2 = (5,7)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 22


K-Mean Example
Data Points Distance To Center Cluster New
Initial Centroid Cluster
K1 = (1,1) 1 1 5 7
K2 = (5,7)
1.0 1.0 0 7.2 1
1.5 2.0 1.11 6.10 1
3.0 4.0 3.6 3.6 1
5.0 7.0 7.2 0 2
3.5 5.0 4.7 2.5 2
4.5 5.0 5.3 2.06 2
3.5 4.5 4.3 2.9 2
New
𝑬𝑫 = √( 𝑿𝐨− 𝑿 𝒄
𝟐
) +( 𝒀 𝒐 − 𝒀 𝒄 )
𝟐
Centroid
K1 =
(1.83,2.33)
K2 =
(4.12,5.37)
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 23
K-Mean Example
Data Points Distance To Center Cluster New
Old Centroid Cluster
K1 = (1.83,2.33) 1.83 2.3 4.12 5.37
K2 = (4.12,5.37)
1.0 1.0 1.54 5.36 1 1
1.5 2.0 0.44 4.26 1 1
3.0 4.0 2.06 1.76 1 2
5.0 7.0 5.66 1.8 2 2
3.5 5.0 3.17 0.72 2 2
4.5 5.0 3.79 0.53 2 2
3.5 4.5 2.762 1.06 2 2
New
Centroid
K1 =
(1.25,1.5)
K2 =
(3.9,5.1)
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 24
K-Mean Example
Data Points Distance To Center Cluster New
Old Centroid Cluster
K1 = (1.25,1.5) 1.25 1.5 3.9 5.1
K2 = (3.9,5.1)
1.0 1.0 0.55 5.02 1 1
1.5 2.0 0.55 3.92 1 1
3.0 4.0 3.05 1.42 2 2
5.0 7.0 6.65 2.19 2 2
3.5 5.0 4.16 0.41 2 2
4.5 5.0 4.77 0.60 2 2
3.5 4.5 3.75 0.72 2 2
Here no Change in Cluster so we terminate the Algorithm

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 25


k-Medoids: A Representative Object-Based Technique
 The k-means algorithm is sensitive to outliers because such objects are far
away from the majority of the data
 when assigned to a cluster, they can dramatically distort the mean value
of the cluster.
 K-Medoids: Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most centrally
located object in a cluster
 The Partitioning Around Medoids (PAM) algorithm is a popular realization of
k-medoids clustering.
 Like the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily.
 We consider whether replacing a representative object by a nonrepresentative object
would improve the clustering quality.
 All the possible replacements are tried out.
 The iterative process of replacing representative objects by other objects continues
until the quality of the resulting clustering cannot be improved by any replacement.
This quality is m
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 26
k-medoids : Algorithm
INPUT:
k: the number of clusters,
D: a data set containing n objects.

Output:
A set of k clusters.

arbitrarily choose k objects from D as the initial representative objects or


seeds;
repeat
assign each remaining object to the cluster with the nearest representative
object;
randomly select a nonrepresentative object, Orandom;
compute the total cost, S, of swapping representative object, Oj, with Orandom;
if S < 0 then swap oj with orandom to form the new set of k representative
objects.
until no change;

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 27


K-Medoids Clustering Algorithm - Example
Step 1:
Sr. X Y Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4,
5) and C2 -(8, 5) are the two medoids.
0 8 7
The dissimilarity of each non-medoid point with the medoids is calculated
1 3 7 and
Sr tabulated:
Dissimilarity From
X Y Dissimilarity From C2
2 4 9 . C1
3 9 6 7 |(8-4)|+|(7-5)| = |(8-8)|+|(7-5)| = 2
0 8
6
4 8 5
1 3 7 3 7
5 5 8
2 4 9 4 8
6 7 3
3 9 6 6 2
7 8 4
5 5 8 4 6
8 7 5
6 7 3 5 3
9 4 5
7 8 4 5 1
8 7 5 3 1

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 28


K-Medoids Clustering Algorithm – Example Cont..

Dissimilar Dissimilar • Each point is assigned to the cluster of


Sr. X Y ity From ity From that medoid whose dissimilarity is less.
C1 C2 • The points 1, 2, 5 go to cluster C1 and
0 8 7 6 2 0, 3, 6, 7, 8 go to cluster C2.
• The Cost = (3 + 4 + 4) + (2 + 2 + 3 + 1
1 3 7 3 7 + 1) = 20
2 4 9 4 8
3 9 6 6 2
5 5 8 4 6
6 7 3 5 3
7 8 4 5 1
8 7 5 3 1

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 29


K-Medoids Clustering Algorithm – Example Cont..
• Step 3: randomly select one non-
medoid point and recalculate the
Dissimilar Dissimilar
cost.
Sr. X Y ity From ity From • Let the randomly selected point be (8,
C1 C2
4).
0 8 7 6 3 • The dissimilarity of each non-medoid
1 3 7 3 8 point with the medoids – C1 (4,
5) and C2 (8, 4) is calculated and
2 4 9 4 9 tabulated.
3 9 6 6 3
4 8 5 4 1
5 5 8 4 7
6 7 3 5 2
8 7 5 3 2

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 30


K-Medoids Clustering Algorithm – Example Cont..
• Each point is assigned to that cluster
whose dissimilarity is less. So, the
Dissimilar Dissimilar
points 1, 2, 5 go to cluster C1 and 0,
Sr. X Y ity From ity From
3, 4, 6, 8 go to cluster C2.
C1 C2
• The New cost,
0 8 7 6 3 = (3 + 4 + 4) + (3 + 3 + 1 + 2 + 2) =
1 3 7 3 8 22
• Swap Cost = New Cost – Previous Cost
2 4 9 4 9 = 22 – 20
3 9 6 6 3 =2
• So, 2>0 that is positive, now our previous
4 8 5 4 1
medoid is best.
5 5 8 4 7 • The total cost of Medoid (8,4) > the
6 7 3 5 2 total cost when (8,5) was the medoid
earlier & it generates the same
8 7 5 3 2 clusters as earlier.
• If you get negative then you have to take
new medoid and recalculate again.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 31


K-Medoids Clustering Algorithm – Example Cont..

• As the swap cost is not less than zero, we undo the


Sr. X Y swap.
0 8 7 • Hence (4, 5) and (8, 5) are the final medoids.
• The clustering would be in the following way
1 3 7
2 4 9
3 9 6
4 8 5
5 5 8
6 7 3
7 8 4
8 7 5
9 4 5

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 32


K-Medoids Clustering Algorithm (Try Yourself!!)

Sr. X Y
0 2 6
1 3 4
2 3 8
3 4 7
4 6 2
5 6 4
6 7 3
7 7 4
8 8 5
9 7 6

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 33


Hierarchical methods
 A hierarchical method creates a hierarchical decomposition of the given
set of data objects.
 A hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters.
 Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
 A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
 Hierarchical clustering methods can face challenges when it comes to
deciding when to combine or separate groups of objects.
 This choice is vital because once this action is taken, it sets the stage for
subsequent clustering steps.
 Importantly, these methods cannot reverse previous actions or swap
objects between clusters.
 Therefore, if the decisions to combine or separate are not made wisely.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 34
Agglomerative approach vs Divisive approach
 Agglomerative
approach(AGNES):
 It also called bottom-up Step Step Step Step Step
0 1 2 3 4
agglomerativ
approach.
(AGNES)
 It starts with each object a
forming a separate group. ab
 It successively merges the b abcde
objects or groups close to one c
another, until all the groups cde
are merged into one (the d
topmost level of the de
hierarchy), or a termination e
condition holds. divisive
 Divisive approach(DIANA):
Step Step Step Step Step (DIANA)
4 3 2 1 0
 It also called top-down
approach.
 It starts with all the objects in
the same cluster.
 Prof.
In Jayesh
each successive iteration,
D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 35
Dendrogram: Shows How Clusters are Merged
 Decompose data Step=
-Step=
0
objects into a
several levels of -2
nested Step=
partitioning (tree -3
of clusters),
called a Step=
-4
dendrogram
Step=
-5
 A clustering of
Step=
the data objects -6
is obtained by
cutting the Step=
dendrogram at -7
the desired 36

level, then each


connected
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 36
Distance Measures in Algorithmic Methods
 Whether using an agglomerative method or a divisive method, a core
need is to measure the distance between two clusters, where each cluster
is generally a set of objects.
 Single link:
 Smallest distance between
an element in one cluster
and an element in the
other, i.e.,
 dist(Ki, Kj) = min(tip, tjq)

 Complete link:
 largest distance between
an element in one cluster
and an element in the
other, i.e.,
 dist(Ki, Kj) = max(tip, tjq)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 37


Distance Measures in Algorithmic Methods
 Average link:
 Average distance between
an element in one cluster
and an element in the
other, i.e.,
 dist(Ki, Kj) = Avg(tip, tjq)

 Centroid link:
 distance between the
centroids of two clusters,
i.e.,
 dist(Ki, Kj) = dist(Ci, Cj)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 38


Agglomerative Hierarchical Clustering -
Example
X Y  Calculate Euclidean distance, create the distance
P1 0.40 0.53 matrix.
P2 0.22 0.38  Distance [(x,y),(a,b)] =
P3 0.35 0.32
P4 0.26 0.19 P1 P2 P3 P4 P5 P6
P5 0.08 0.41 P1 0
P6 0.45 0.30 0.2 0
P2
3
P3 0
P4 0
P5 0
P6 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 39


Agglomerative Hierarchical Clustering -
Example
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41 P1 P2 P3 P4 P5 P6

P6 0.45 0.30 P1 0
P2 0.23 0
P3 0.22 0
P4 0
P5 0
P6 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 40


Agglomerative Hierarchical Clustering -
Example
X Y P1 P2 P3 P4 P5 P6
P1 0.40 0.53 P1 0
P2 0.22 0.38 P2 0.23 0
P3 0.35 0.32 P3 0.22 0.15 0
P4 0.26 0.19 P4 0.37 0.20 0.15 0
P5 0.08 0.41 P5 0.34 0.14 0.28 0.29 0
P6 0.45 0.30 P6 0.23 0.25 0.11 0.22 0.39 0

3 6

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 41


Agglomerative Hierarchical Clustering -
Example
 To Update the distance matrix MIN[dist(P3,P6),P1]
 MIN (dist(P3,P1),(P6,P1))
 Min[(0.22,0.23)]
 0.22 P1 P2 P3 P4 P5 P6
P1 0
0.2 0
P2
3
 To Update the distance matrix MIN[dist(P3,P6),P2]
0.2 0.1 0
P3
 MIN (dist(P3,P2),(P6,P2)) 2 5
 Min[(0.15,0.25)] 0.3 0.2 0.1 0
P4
7 0 5
 0.15 0.3 0.1 0.2 0.2 0
P5
4 4 8 9
0.2 0.2 0.1 0.2 0.3 0
P6
3 5 1 2 9
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 42
Agglomerative Hierarchical Clustering -
Example
 To Update the distance matrix MIN[dist(P3,P6),P4]
 MIN (dist(P3,P4),(P6,P4))
 Min[(0.15,0.22)]
 0.15 P1 P2 P3 P4 P5 P6
P1 0
0.2 0
P2
3
 To Update the distance matrix MIN[dist(P3,P6),P5]
0.2 0.1 0
P3
 MIN (dist(P3,P5),(P6,P5)) 2 5
 Min[(0.28,0.39)] 0.3 0.2 0.1 0
P4
7 0 5
 0.28 0.3 0.1 0.2 0.2 0
P5
4 4 8 9
0.2 0.2 0.1 0.2 0.3 0
P6
3 5 1 2 9
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 43
Agglomerative Hierarchical Clustering -
Example
 The Updated distance matrix for cluster P3, P6
P1 P2 P3,P P4 P5
6
P1 0
P2 0.23 0
P3,P 0.15 0
0.22
6
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0

2 5

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 44


Agglomerative Hierarchical Clustering -
Example
 To Update the distance matrix MIN[dist(P2,P5),P1]
 MIN (dist(P2,P1),(P5,P1))
 Min[(0.23,0.34)]
 0.23 P1 P2 P3,P P4 P5
6
P1 0
P2 0.23 0
 To Update the distance matrix MIN[dist(P2,P5),(P3,P6)]
P3,P 0.15 0
0.22
 MIN [(dist(P2,(P3,P6)),(P5,(P3,P6))] 6

 Min[(0.15,0.28)] P4 0.37 0.20 0.15 0


P5 0.34 0.14 0.28 0.29 0
 0.15

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 45


Agglomerative Hierarchical Clustering -
Example
 To Update the distance matrix MIN[dist(P2,P5),P4] P1 P2 P3,P P4 P5
 MIN (dist(P2,P4),(P5,P4)) 6
P1 0
 Min[(0.20,0.29)]
P2 0.23 0
 0.20
P3,P 0.15 0
0.22
6
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 46


Agglomerative Hierarchical Clustering -
Example
P1 P2 P3,P P4 P5
6 P1 P2,P P3,P P4
5 6
P1 0
P1 0
P2 0.23 0
P2,P 0
P3,P 0.15 0 0.23
0.22 5
6
P3,P 0.15 0
P4 0.37 0.20 0.15 0 0.22
6
P5 0.34 0.14 0.28 0.29 0 P4 0.37 0.20 0.15 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 47


Agglomerative Hierarchical Clustering -
Example
P1 P2,P P3,P P4 P1 P2,P P3,P P4
5 6 5 6
P1 0 P1 0
P2,P 0 P2,P 0
0.23 0.23
5 5
P3,P 0.15 0 P3,P 0.15 0
0.22 0.22
6 6
P4 0.37 0.20 0.15 0 P4 0.37 0.20 0.15 0

2 5
3 6

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 48


Agglomerative Hierarchical Clustering -
Example
P1 P2,P P3,P P4 To Update the distance matrix MIN[dist(P2,P5),
5 6 (P3,P6)),P1]
P1 0 MIN (dist(P2,P5),P1),((P3,P6),P1)]
Min[(0.23,0.22)]
P2,P 0 0.22
0.23
5
P3,P 0.15 0 To Update the distance matrix MIN[dist(P2,P5),
0.22
6 (P3,P6)),P4]
P4 0.37 0.20 0.15 0 MIN (dist(P2,P5),P4),((P3,P6),P4)]
Min[(0.20,0.15)]
0.15
P1 P2,P5,P3, P4
P6
P1 0
P2,P5,P3, 0
0.22
P6
2 5 4
3 6 P4 0.37 0.15 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 49


Agglomerative Hierarchical Clustering -
Example
P1 P2,P5,P3, P4 To Update the distance matrix
P6 MIN[dist(P2,P5,P3,P6),P4]
P1 0 MIN (dist(P2,P5,P3,P6),P1),(P4,P1)]
Min[(0.22,0.37)]
P2,P5,P3, 0 0.22
0.22
P6
P4 0.37 0.15 0 P1 P2,P5,P3,P6,
P4
P1 0
P2,P5,P3,P6 0
0.22
,P4

2 5 4
3 6

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 50


Agglomerative Hierarchical Clustering -
Example

X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
3 6 2 5 4 1

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 51


Weakness of agglomerative clustering
methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2), where n is the number
of total objects
 Integration of hierarchical & distance-based clustering
 BIRCH: uses CF-tree and incrementally adjusts the quality of sub-clusters.
 CHAMELEON: hierarchical clustering using dynamic modeling

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 52


BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)

 Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is


designed for clustering a large amount of numeric data by integrating
hierarchical clustering and other clustering methods such as iterative
partitioning.
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of
the data that tries to preserve the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-
tree.
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 53


Clustering Feature Vector in BIRCH
 Clustering Feature (CF): CF = (N, LS, SS)
 Where,
 N: Number of data points
 LS: linear sum of N points :
 SS: square sum of N points :

(3, CF = (3, (9,15),(29,77))


4)
(2,
6)
(4,
5)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 54


CF Tree structure
 A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering.
 By definition, a nonleaf node in a tree has descendants or "children.”
 The nonleaf nodes store sums of the CFs of their children, and thus
summarize clustering information about their children.
 A CF tree has two parameters: branching factor, B, and threshold, T.
 The branching factor specifies the maximum number of children per non
leaf node.
 The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
 These two parameters influence the size of the resulting tree.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 55


The CF Tree Structure
B=3 CF12 CF3 CFN

T = 1.5

CF1 CF2 CF
3
…………………
…..

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 56


Centroid, Radius and Diameter of a Cluster
 Centroid: the “middle” of a 𝑁

cluster ∑ 𝑡 𝑖𝑝
𝑖=1
𝐶𝑚=
 Radius: square root of average 𝑁
distance from any point of the
cluster to its centroid R
 Diameter: square root of
average mean squared
distance between all pairs of
points in the cluster
D

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 57


BIRCH
 For each point in the input
 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly parents
 Algorithm is O(n)
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so natural
 Clusters tend to be spherical given the radius and diameter measures

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 58


CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling
 Chameleon is a hierarchical clustering algorithm that uses dynamic
modeling to determine the similarity between pairs of clusters.
 In Chameleon, cluster similarity is assessed based on
 Interconnectivity: How well connected objects are within a cluster.
 Proximity: How they are close together.
 Algorithm:
 Chameleon uses a k-nearest-neighbor graph approach to construct a
sparse graph
 where each vertex of the graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among the k-most similar objects to
the other.
 Chameleon uses a graph partitioning algorithm to partition the k-nearest-
neighbor graph into a large number of relatively small subclusters
 Chameleon then uses an agglomerative hierarchical clustering algorithm
that iteratively merges subclusters based on their similarity.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 59
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling
Construct (K-NN)
Sparse Graph
Step - 2
Step - 1
Partition the
Graph
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p

Merge
Step - 3 Partition

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 60


Density-Based Clustering Methods
 Clusters are dense regions of objects in space that are separated by low-
density regions.
 It can discover clusters of arbitrary shape.
 It can handle noise in dataset.
 It required one scan.
 Need density parameters as termination condition
 Several density based algorithm.
 DBSCAN:(Density-Based Clustering Based on Connected Regions with High Density )
 OPTICS:(Ordering Points to Identify the Clustering Structure)
 DENCLUE: (Clustering Based on Density Distribution Functions )

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 61


DBSCAN
 The density of an object O can be measured by the number of objects
close to O.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
finds core objects, that is, objects that have dense neighborhoods.
 It connects core objects and their neighborhoods to form dense regions as
clusters.
 A user-specified parameter Eps > 0 is used to specify the radius of a
neighborhood.
 the density of a neighborhood can be measured simply by the number of
objects in the neighborhood
 DBSCAN uses another user-specified parameter, MinPts, which specifies
the density threshold of dense regions.
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-neighbourhood of that point
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 62
DBSCAN
 An object is a core object if the Eps-neighborhood of the object contains at
least MinPts objects.

 Directly density-reachable.
 For a core object q and an object p.
 we say that p is directly density-reachable from q (with respect to ε and
MinPts) if p is within the ε-neighborhood of q

p MinPts =
5
q Eps = 1
cm

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 63


DBSCAN
 Density-reachable
 A point p is density-reachable from a point q w.r.t.
p
Eps, MinPts if there is a chain of points p1, …, pn,
p1 = q, pn = p such that pi+1 is directly density- p1
q
reachable from pi

p q

o
 Density-connected:
 A point p is density-connected to a point q w.r.t.
Eps, MinPts if there is a point o such that both, p
andProf.qJayesh
areD. Vagadiya
density-reachable from o w.r.t.
#2101CS521 (DM) Eps
 Unit 5and
-Clustering 64
DBSCAN
INPUT:
D: a data set containing n objects,
ε: the radius parameter, and
MinPts: the neighborhood density threshold.

Output:
A set of density-based clusters.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 65


DBSCAN
mark all objects as unvisited;
do
randomly select an unvisited object p;
mark p as visited;
if the ε-neighborhood of p has at least MinPts objects;
create a new cluster C, and add p to C;
let N be the set of objects in the ε-neighborhood of p;
for each point p′ in N
if p′ is unvisited
mark p′ as visited;
if the ε-neighborhood of p′ has at least MinPts points, add those
points to N;
if p′ is not yet a member of any cluster, add p′ to C;
end for
output C;
else mark p as noise;
until no object is unvisited;

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 66


OPTICS(Ordering Points To Identify the
Clustering Structure)
 In DBSCAN we use input parameters such as ε (the maximum radius of a
neighborhood) and MinPts.
 it restrict users with the responsibility of selecting parameter values that
will lead to the discovery of acceptable clusters.
 This is a problem associated with many other clustering algorithms.
 Most algorithms are sensitive to these parameter values.
 Slightly different settings may lead to very different clusterings of the
data.
 To overcome the difficulty in using one set of global parameters in
clustering analysis, a cluster analysis method called OPTICS was proposed.
 OPTICS does not explicitly produce a data set clustering. instead, it
outputs a cluster ordering.
 This is a linear list of all objects under analysis and represents the density-
based clustering structure of the data.
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 67
OPTICS(Ordering Points To Identify the
Clustering Structure)
 Objects in a denser cluster are listed closer to each other in the cluster
ordering.
 This ordering is equivalent to density-based clustering obtained from a
wide range of parameter settings.
 OPTICS does not require the user to provide a specific density threshold.
 The cluster ordering can be used to extract basic clustering information
(e.g., cluster centers, or arbitrary-shaped clusters), derive the intrinsic
clustering structure, as well as provide a visualization of the clustering.
e = 6 mm
 Core Distance:
 The core-distance of an object p is the smallest P
value ε such that the ε-neighborhood of p has
e = 3 mm
at least MinPts objects.
 That is, ε is the minimum distance threshold
that makes p a core object.
 If p is not a core object with respect to ε and MinPts = 5
ε = 6 mm
MinPts, the core-distance of p is undefined.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 68


OPTICS(Ordering Points To Identify the
Clustering Structure)
 Reachability-distance
 The reachability-distance to e = 6 mm
object p from q is the
minimum radius value that P
makes p density-reachable e =
from q. 3 q1
mm
 According to the definition
MinPts = 5 q2
of density-reachability, q
ε = 6 mm
has to be a core object and
p must be in the
neighborhood of q. Reachability-distance (p, q1) = 3 mm
 Therefore, the reachability- Reachability-distance (p, q2) = dist (p, q2)
distance from q to p is
max{core-distance(q),
dist(p, q)}

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 69


OPTICS(Ordering Points To Identify the
Clustering Structure)
 OPTICS begins with an arbitrary object from the input database as the
current object, p.
 It retrieves the ε-neighborhood of p, determines the core-distance, and
sets the reachability-distance to undefined.
 If p is not a core object, OPTICS simply moves on to the next object in the
OrderSeeds list (or the input database if OrderSeeds is empty).
 if p is a core object, then for each object, q, in the ε-neighborhood of p,
OPTICS updates its reachability-distance from p and inserts q into
OrderSeeds if q has not yet been processed.
 The iteration continues until the input is fully consumed and OrderSeeds is
empty.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 70


OPTICS(Ordering Points To Identify the
Clustering Structure)

undefined


 ‘

Cluster-order
of the objects
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 71
What Are Outliers ?
 Assume that a given statistical process is used to generate a set of data
objects.
 An outlier is a data object that deviates significantly from the rest of the
objects, as if it were generated by a different mechanism.
 Outliers are different from noisy data.
 noise is a random error or variance in a measured variable.
 Noise should be removed before outlier detection
 In general, noise is not interesting in data analysis, including outlier detection.
 Outliers are interesting because they are suspected of not being generated by the
same mechanisms as the rest of the data.
 Outliers are interesting: It violates the mechanism that generates the
normal data

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 72


What Are Outliers ?
 Applications
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 73


Types of Outliers
 Three kinds: global, contextual and collective outliers
 Global outlier (or point anomaly)
 In a given data set, a data object is a global outlier if it deviates
significantly from the rest of the data set.
 Global outliers are sometimes called point anomalies, and are the simplest
type of outliers
 Most outlier detection methods are aimed at finding global outliers.
 Issue: Find an appropriate measurement of deviation
 Contextual outlier (or conditional outlier)
 Object is Oc if it deviates significantly based on a selected context.
 “The temperature today is 28 C. Is it exceptional (i.e., an outlier)?” It
depends, for example, on the time and location! If it is in winter in Toronto,
yes, it is an outlier. If it is a summer day in Toronto, then it is normal.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 74


Types of Outliers
 In a given data set, a data object is a contextual outlier if it deviates
significantly with respect to a specific context of the object
 Contextual outliers are also known as conditional outliers because they
are conditional on the selected context.
 Attributes of data objects should be divided into two groups
 Contextual attributes: defines the context, e.g., time & location
 Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
 An object in a data set is a local outlier if its density significantly deviates
from the local area in which it occurs.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 75


Types of Outliers
 Collective Outliers
 A subset of data objects collectively deviate significantly from the whole
data set, even if the individual data objects may not be outliers
 Detection of collective outliers
 Consider not only behavior of individual objects, but also that of groups of objects
 Need to have the background knowledge on the relationship among data objects,
such as a distance or similarity measure on objects.

Collective Outlier
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 76
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a gray area
 Application-specific outlier detection
 Choice of distance measure among objects and the model of relationship among
objects are often application-dependent
 E.g., clinic data: a small deviation could be an outlier; while in marketing analysis,
larger fluctuations
 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction between normal objects
and outliers. It may help hide outliers and reduce the effectiveness of outlier
detection
 Understandability
 Understand why these are outliers: Justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being generated by a
normal mechanism
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 77
Outlier Detection Methods
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be
obtained:
 Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:
 Statistical, proximity-based, and clustering-based methods
 Supervised Methods
 Modeling outlier detection as a classification problem
 The task is to learn a classifier that can recognize outliers
 Samples examined by domain experts used for training & testing
 Methods for Learning a classifier for outlier detection effectively:
 Model normal objects & report those not matching the model as outliers, or
 Model outliers and treat those not matching the model as normal

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 78


Outlier Detection Methods
 Supervised Methods
 Challenges
 Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some
artificial outliers
 Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e.,
not mislabeling normal objects as outliers)
 Unsupervised Methods
 Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
 An outlier is expected to be far away from any groups of normal objects
 Weakness: Cannot detect collective outlier effectively
 Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 79


Outlier Detection Methods
 Unsupervised Methods
 Many clustering methods can be adapted for unsupervised methods
 Find clusters, then outliers: not belonging to any cluster
 Problem 1: Hard to distinguish noise from outliers
 Problem 2: Costly since first clustering: but far less outliers than normal objects
 Semi-Supervised Methods
 Situation: In many applications, the number of labeled data is often small:
Labels could be on outliers only, normal objects only, or both
 Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
 If some labeled normal objects are available
 Use the labeled examples and the proximate unlabeled objects to train a model for
normal objects
 Those not fitting the model of normal objects are detected as outliers

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 80


Outlier Detection Methods
 If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
 To improve the quality of outlier detection, one can get help from models for normal
objects learned from unsupervised methods
 Statistical Methods:
 Statistical methods (also known as model-based methods) make
assumptions of data normality. They assume that normal data objects are
generated by a statistical (stochastic) model, and that data not following
the model are outliers.
 Example (right figure): First use Gaussian distribution
to model the normal data
 For each object y in region R, estimate gD(y), the probability
of y fits the Gaussian distribution
 If gD(y) is very low, y is unlikely generated by the Gaussian
model, thus an outlier
 Effectiveness of statistical methods: highly depends on
whether the assumption of statistical model holds in the
real data
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 81
Outlier Detection Methods
 Proximity-Based Methods
 An object is an outlier if the nearest neighbors of the object are far away,
i.e., the proximity of the object is significantly deviates from the proximity
of most of the other objects in the same data set
 Example (right figure):
 Model the proximity of an object using its 3 nearest neighbors
 Objects in region R are substantially different from other objects in
the data set.
 Thus the objects in R are outliers
 The effectiveness of proximity-based methods highly relies
on the proximity measure.
 In some applications, proximity or distance measures
cannot be obtained easily.
 Often have a difficulty in finding a group of outliers which
stay close to each other
 TwoProf.major types of outlier detection
Jayesh D. Vagadiya Distance-based
#2101CS521 (DM)  Unit 5 -Clustering vs. 82
Outlier Detection Methods
 Clustering-Based Methods
 Normal data belong to large and dense clusters, whereas outliers belong
to small or sparse clusters, or do not belong to any clusters
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster, thus are outliers
 there are many clustering-based outlier detection methods are available.
 Clustering is expensive: straightforward adaption of a clustering method
for outlier detection can be costly and does not scale up well for large
data sets

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 83


IMP Questions
 What is Clustering ? How it is different from classification ?
 Differentiate Supervised learning vs Unsupervised Learning
 Explain Requirements for Cluster Analysis in details
 Explain basic clustering methods (Partitioning Methods,Hierarchical Methods, Density-based
methods, Grid-based methods)
 Explain k-Means algorithm with example.
 Explain k-medoids algorithm with example.
 Differentiate Agglomerative approach vs Divisive approach for clustering
 Explain BIRCH in details
 Explain CHAMELEON in details
 What is DBSCAN ? Write down algorithmic steps of DBSCAN
 Explain OPTICS in details
 What is Outliers ? Explain types of outliers
 Explain Outlier Detection Methods

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 84

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy