Unit 5

Data Mining (DM)
2101CS521
Unit-5
Clustering
Prof. Jayesh D. vagadiya

Computer Engineering
Department
Darshan Institute of Engineering & Technology, Rajkot
jayesh.vagadiya@darshan.ac.in
9537133260
 Looping
Topics to be covered
• Cluster Analysis
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Outlier Detection
• Outliers and Outlier Analysis
• Outlier Detection Methods
Classification
Input Data
???
Apple Banana Apple
Prediction
Banana Banana Apple

Classificati Its an
on Apple
Algorithm
Apple Apple Apple
Training Data Set

Here we Use Label Data to train Classifier (Supervised Learning)
Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 3

Clustering (Grouping)
Input Data
Clustering
Algorithm
Here we don’t use Label Data to train Classifier(Unsupervised Learning)

Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
Clustering
 Cluster analysis or simply clustering is the process of partitioning a set of
data objects (or observations) into subsets.
 Each subset is a cluster.
 Cluster is a collection of data objects.
 Similar (or related) to one another within the same group
 Dissimilar (or unrelated) to the objects in other groups
 The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
 Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
 Clustering is useful in that it can lead to the discovery of previously
unknown groups within the data.

Clustering
 Unsupervised Learning: no predefined classes. (i.e., learning by
observations)(Clustering).
 Supervised Learning: Predefined classes available. (i.e., learning by
examples) (Classification).
 Clustering can also be used for outlier detection.
 where outliers (values that are “far away” from any cluster) may be more interesting
than common cases.
 Applications of outlier detection include the detection of credit card fraud and the
monitoring of criminal activities in electronic commerce.

Applications of Clustering
 Real life examples where we use clustering:
 Marketing
 Finding group of customers with similar behavior given a large data-base of
customers.
 Data containing their properties and past buying records.
 Biology
 Classification of Plants and Animals Based on the properties under observation.
 Insurance
 Identifying groups of car insurance policy holders with a high average claim cost.
 City-Planning
 Groups of houses according to their house type, value and geographical location.
 Libraries
 It is used in clustering different books on the basis of topics and information.
 Earthquake studies
 By learning the earthquake-affected areas we can determine the dangerous zones.

Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
Input data is labeled Input data is unlabeled
Has a feedback mechanism Has no feedback mechanism
Data is classified based on Assigns properties of given
the training dataset data to classify it
Divided into Regression Divided into Clustering
& Classification & Association
Used for prediction Used for analysis
Algorithms include: decision Algorithms include: k-means
trees, logistic regressions, clustering, hierarchical
support vector machine clustering, apriori algorithm
A known number of classes A unknown number of classes

Good clustering Algorithm
 A good clustering method will produce high quality clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters

Requirements for Cluster Analysis
 Clustering is a challenging research field. In this section, you will learn
about the requirements for clustering as a data mining tool.
 Scalability:
 Clustering all the data instead of only on samples.
 Ability to deal with different types of attributes:
 Many algorithms are designed to cluster numeric (interval-based) data. However,
applications may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
 Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any
shape.
 It is important to develop algorithms that can detect clusters of arbitrary shape.

 Requirements for domain knowledge to determine input
parameters:
 Many clustering algorithms require users to provide domain knowledge in the form of
input parameters such as the desired number of clusters.
 The clustering results may be sensitive to such parameters.
 Ability to deal with noisy data:
 Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data.
 Therefore, we need clustering methods that are robust to noise.
 Incremental clustering and insensitivity to input order
 In many applications, incremental updates (representing newer data) may arrive at
any time. Some clustering algorithms cannot incorporate incremental updates into
existing clustering structures and, instead, have to recompute a new clustering from
scratch.
 Incremental clustering algorithms and algorithms that are insensitive to the input
order are needed.

 Capability of clustering high-dimensionality data
 A data set can contain numerous dimensions or attributes. When clustering
documents, for example, each keyword can be regarded as a dimension, and there
are often thousands of keywords.
 Finding clusters of data objects in a high- dimensional space is challenging.
 Constraint-based clustering
 Real-world applications may need to perform clustering under various kinds of
constraints.
 Suppose that your job is to choose the locations for a given number of new
automatic teller machines (ATMs) in a city. To decide upon this, you may cluster
households while considering constraints such as the city’s rivers and highway
networks and the types and number of customers per cluster
 Interpretability and usability
 Users want clustering results to be interpretable, comprehensible, and usable.
 That is, clustering may need to be tied in with specific semantic interpretations and
applications.
 It is important to study how an application goal may influence the selection of
clustering features and clustering methods.
Overview of Basic Clustering Methods : Partitioning methods
 Partitioning Methods:
 Given a set of n objects, a partitioning method constructs k partitions of
the data, where each partition represents a cluster and k ≤ n.
 It divides the data into k groups such that each group must contain at
least one object.
 It will find mutually exclusive clusters of spherical shape(each object must
belong to exactly one group ).
 Most partitioning methods are distance-based.
 Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning.
 It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
 May use mean or medoid (etc.) to represent cluster center.
 Effective for small- to medium-size data sets.
Hierarchical methods
 Hierarchical Methods:
 A hierarchical method creates a hierarchical decomposition of the given
set of data objects.
 A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
 Agglomerative approach:
 It also called bottom-up approach.
 It starts with each object forming a separate group.
 It successively merges the objects or groups close to one another, until all the groups
are merged into one (the topmost level of the hierarchy), or a termination condition
holds.
 Divisive approach:
 It also called top-down approach.
 It starts with all the objects in the same cluster.
 In each successive iteration, a cluster is split into smaller clusters, until eventually
each object is in one cluster, or a termination condition holds.
Step Step Step Step Step

0 1 2 3 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0

Density-based methods
 Density-based methods:
 Distance based method only find spherical-shaped cluster and encounter
difficulty in discovering clusters of arbitrary shapes.
 It can find arbitrarily shaped clusters.
 Clusters are dense regions of objects in space that are separated by low-
density regions.
 Cluster density: Each point must have a minimum number of points within
its “neighborhood”.
 May filter out outliers.

Grid-based methods
 Grid-based methods:
 Grid-based methods quantize the object space into a finite number of cells
that form a grid structure.
 All the clustering operations are performed on the grid structure.
 The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only
on the number of cells in each dimension in the quantized space.

Partitioning Methods: k-Means : A Centroid-Based Technique
 The k-means algorithm defines the centroid of a cluster as the mean value
of the points within the cluster.
 First, it randomly selects k of the objects in D, each of which initially
represents a cluster mean or center.
 For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the Euclidean distance between the
object and the cluster mean.
 For each cluster, it computes the new mean using the objects assigned to
the cluster in the previous iteration.
 The iterations continue until the assignment is stable, that is, the clusters
formed in the current round are the same as those formed in the previous
round.

k-Means : Algorithm
INPUT:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
arbitrarily choose k objects from D as the initial cluster centers;

repeat
(re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
update the cluster means, that is, calculate the mean value of the objects for
each cluster;
until no change;

k-Means : Algorithm Cont..
 The initial partitioning can be done in a variety of ways.
 Dynamically Chosen
 This method is good when the amount of data is expected to grow.
 The initial cluster means can simply be the first few items of data from the set.
 For instance, if the data will be grouped into 3 clusters, then the initial cluster means
will be the first 3 items of data.
 Randomly Chosen
 Almost self-explanatory, the initial cluster means are randomly chosen values within
the same range as the highest and lowest of the data values.
 Choosing from Upper and Lower Bounds
 Depending on the types of data in the set, the highest and lowest of the data range
are chosen as the initial cluster means.

Clustering
 A clustered scatter plot.

 The black dots are data points.
 The red lines illustrate the partitions
created by the k-means algorithm.
 The blue dots represent the
centroids which define the
partitions.

K-Mean Example
Sr X Y K = 2, and we assume Cluster 1 center is (1,1) and Cluster 2 center
Here
.
1 1.0 1.0
2 1.5 2.0
K K
3 3.0 4.0
1 2
4 5.0 7.0 (1,1) (5,7)
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Initial Centroid
K1 = (1,1)
K2 = (5,7)

K-Mean Example
Data Points Distance To Center Cluster New
Initial Centroid Cluster
K1 = (1,1) 1 1 5 7
K2 = (5,7)
1.0 1.0 0 7.2 1
1.5 2.0 1.11 6.10 1
3.0 4.0 3.6 3.6 1
5.0 7.0 7.2 0 2
3.5 5.0 4.7 2.5 2
4.5 5.0 5.3 2.06 2
3.5 4.5 4.3 2.9 2
New
𝑬𝑫 = √( 𝑿𝐨− 𝑿 𝒄
𝟐
) +( 𝒀 𝒐 − 𝒀 𝒄 )
𝟐
Centroid
K1 =
(1.83,2.33)
K2 =
(4.12,5.37)
K-Mean Example
Old Centroid Cluster
K1 = (1.83,2.33) 1.83 2.3 4.12 5.37
K2 = (4.12,5.37)
1.0 1.0 1.54 5.36 1 1
1.5 2.0 0.44 4.26 1 1
3.0 4.0 2.06 1.76 1 2
5.0 7.0 5.66 1.8 2 2
3.5 5.0 3.17 0.72 2 2
4.5 5.0 3.79 0.53 2 2
3.5 4.5 2.762 1.06 2 2
New
Centroid
K1 =
(1.25,1.5)
K2 =
(3.9,5.1)
K-Mean Example
Old Centroid Cluster
K1 = (1.25,1.5) 1.25 1.5 3.9 5.1
K2 = (3.9,5.1)
1.0 1.0 0.55 5.02 1 1
1.5 2.0 0.55 3.92 1 1
3.0 4.0 3.05 1.42 2 2
5.0 7.0 6.65 2.19 2 2
3.5 5.0 4.16 0.41 2 2
4.5 5.0 4.77 0.60 2 2
3.5 4.5 3.75 0.72 2 2
Here no Change in Cluster so we terminate the Algorithm

k-Medoids: A Representative Object-Based Technique
 The k-means algorithm is sensitive to outliers because such objects are far
away from the majority of the data
 when assigned to a cluster, they can dramatically distort the mean value
of the cluster.
 K-Medoids: Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most centrally
located object in a cluster
 The Partitioning Around Medoids (PAM) algorithm is a popular realization of
k-medoids clustering.
 Like the k-means algorithm, the initial representative objects (called seeds) are
chosen arbitrarily.
 We consider whether replacing a representative object by a nonrepresentative object
would improve the clustering quality.
 All the possible replacements are tried out.
 The iterative process of replacing representative objects by other objects continues
until the quality of the resulting clustering cannot be improved by any replacement.
This quality is m
k-medoids : Algorithm
INPUT:
k: the number of clusters,
D: a data set containing n objects.
Output:
A set of k clusters.
arbitrarily choose k objects from D as the initial representative objects or

seeds;
repeat
assign each remaining object to the cluster with the nearest representative
object;
randomly select a nonrepresentative object, Orandom;
compute the total cost, S, of swapping representative object, Oj, with Orandom;
if S < 0 then swap oj with orandom to form the new set of k representative
objects.
until no change;

K-Medoids Clustering Algorithm - Example
Step 1:
Sr. X Y Let the randomly selected 2 medoids, so select k = 2 and let C1 -(4,
5) and C2 -(8, 5) are the two medoids.
0 8 7
The dissimilarity of each non-medoid point with the medoids is calculated
1 3 7 and
Sr tabulated:
Dissimilarity From
X Y Dissimilarity From C2
2 4 9 . C1
3 9 6 7 |(8-4)|+|(7-5)| = |(8-8)|+|(7-5)| = 2
0 8
6
4 8 5
1 3 7 3 7
5 5 8
2 4 9 4 8
6 7 3
3 9 6 6 2
7 8 4
5 5 8 4 6
8 7 5
6 7 3 5 3
9 4 5
7 8 4 5 1
8 7 5 3 1

K-Medoids Clustering Algorithm – Example Cont..
Dissimilar Dissimilar • Each point is assigned to the cluster of

Sr. X Y ity From ity From that medoid whose dissimilarity is less.
C1 C2 • The points 1, 2, 5 go to cluster C1 and
0 8 7 6 2 0, 3, 6, 7, 8 go to cluster C2.
• The Cost = (3 + 4 + 4) + (2 + 2 + 3 + 1
1 3 7 3 7 + 1) = 20
2 4 9 4 8
3 9 6 6 2
5 5 8 4 6
6 7 3 5 3
7 8 4 5 1
8 7 5 3 1

• Step 3: randomly select one non-
medoid point and recalculate the
Dissimilar Dissimilar
cost.
Sr. X Y ity From ity From • Let the randomly selected point be (8,
C1 C2
4).
0 8 7 6 3 • The dissimilarity of each non-medoid
1 3 7 3 8 point with the medoids – C1 (4,
5) and C2 (8, 4) is calculated and
2 4 9 4 9 tabulated.
3 9 6 6 3
4 8 5 4 1
5 5 8 4 7
6 7 3 5 2
8 7 5 3 2

• Each point is assigned to that cluster
whose dissimilarity is less. So, the
Dissimilar Dissimilar
points 1, 2, 5 go to cluster C1 and 0,
Sr. X Y ity From ity From
3, 4, 6, 8 go to cluster C2.
C1 C2
• The New cost,
0 8 7 6 3 = (3 + 4 + 4) + (3 + 3 + 1 + 2 + 2) =
1 3 7 3 8 22
• Swap Cost = New Cost – Previous Cost
2 4 9 4 9 = 22 – 20
3 9 6 6 3 =2
• So, 2>0 that is positive, now our previous
4 8 5 4 1
medoid is best.
5 5 8 4 7 • The total cost of Medoid (8,4) > the
6 7 3 5 2 total cost when (8,5) was the medoid
earlier & it generates the same
8 7 5 3 2 clusters as earlier.
• If you get negative then you have to take
new medoid and recalculate again.

• As the swap cost is not less than zero, we undo the

Sr. X Y swap.
0 8 7 • Hence (4, 5) and (8, 5) are the final medoids.
• The clustering would be in the following way
1 3 7
2 4 9
3 9 6
4 8 5
5 5 8
6 7 3
7 8 4
8 7 5
9 4 5

K-Medoids Clustering Algorithm (Try Yourself!!)
Sr. X Y
0 2 6
1 3 4
2 3 8
3 4 7
4 6 2
5 6 4
6 7 3
7 7 4
8 8 5
9 7 6

 A hierarchical method creates a hierarchical decomposition of the given
set of data objects.
 A hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters.
 Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
 A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
 Hierarchical clustering methods can face challenges when it comes to
deciding when to combine or separate groups of objects.
 This choice is vital because once this action is taken, it sets the stage for
subsequent clustering steps.
 Importantly, these methods cannot reverse previous actions or swap
objects between clusters.
 Therefore, if the decisions to combine or separate are not made wisely.
Agglomerative approach vs Divisive approach
 Agglomerative
approach(AGNES):
 It also called bottom-up Step Step Step Step Step
0 1 2 3 4
agglomerativ
approach.
(AGNES)
 It starts with each object a
forming a separate group. ab
 It successively merges the b abcde
objects or groups close to one c
another, until all the groups cde
are merged into one (the d
topmost level of the de
hierarchy), or a termination e
condition holds. divisive
 Divisive approach(DIANA):
Step Step Step Step Step (DIANA)
4 3 2 1 0
 It also called top-down
approach.
 It starts with all the objects in
the same cluster.
 Prof.
In Jayesh
each successive iteration,
D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 35
Dendrogram: Shows How Clusters are Merged
 Decompose data Step=
-Step=
0
objects into a
several levels of -2
nested Step=
partitioning (tree -3
of clusters),
called a Step=
-4
dendrogram
Step=
-5
 A clustering of
Step=
the data objects -6
is obtained by
cutting the Step=
dendrogram at -7
the desired 36
level, then each

connected
Distance Measures in Algorithmic Methods
 Whether using an agglomerative method or a divisive method, a core
need is to measure the distance between two clusters, where each cluster
is generally a set of objects.
 Single link:
 Smallest distance between
an element in one cluster
and an element in the
other, i.e.,
 dist(Ki, Kj) = min(tip, tjq)
 Complete link:
 largest distance between
other, i.e.,
 dist(Ki, Kj) = max(tip, tjq)

Distance Measures in Algorithmic Methods
 Average link:
 Average distance between
other, i.e.,
 dist(Ki, Kj) = Avg(tip, tjq)
 Centroid link:
 distance between the
centroids of two clusters,
i.e.,
 dist(Ki, Kj) = dist(Ci, Cj)

Agglomerative Hierarchical Clustering -
Example
X Y  Calculate Euclidean distance, create the distance
P1 0.40 0.53 matrix.
P2 0.22 0.38  Distance [(x,y),(a,b)] =
P3 0.35 0.32
P4 0.26 0.19 P1 P2 P3 P4 P5 P6
P5 0.08 0.41 P1 0
P6 0.45 0.30 0.2 0
P2
3
P3 0
P4 0
P5 0
P6 0

Example
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41 P1 P2 P3 P4 P5 P6
P6 0.45 0.30 P1 0
P2 0.23 0
P3 0.22 0
P4 0
P5 0
P6 0

Example
X Y P1 P2 P3 P4 P5 P6
P1 0.40 0.53 P1 0
P2 0.22 0.38 P2 0.23 0
P3 0.35 0.32 P3 0.22 0.15 0
P4 0.26 0.19 P4 0.37 0.20 0.15 0
P5 0.08 0.41 P5 0.34 0.14 0.28 0.29 0
P6 0.45 0.30 P6 0.23 0.25 0.11 0.22 0.39 0
3 6

Example
 To Update the distance matrix MIN[dist(P3,P6),P1]
 MIN (dist(P3,P1),(P6,P1))
 Min[(0.22,0.23)]
 0.22 P1 P2 P3 P4 P5 P6
P1 0
0.2 0
P2
3
0.2 0.1 0
P3
 MIN (dist(P3,P2),(P6,P2)) 2 5
 Min[(0.15,0.25)] 0.3 0.2 0.1 0
P4
7 0 5
 0.15 0.3 0.1 0.2 0.2 0
P5
4 4 8 9
0.2 0.2 0.1 0.2 0.3 0
P6
3 5 1 2 9
Example
 Min[(0.15,0.22)]
 0.15 P1 P2 P3 P4 P5 P6
P1 0
0.2 0
P2
3
0.2 0.1 0
P3
 MIN (dist(P3,P5),(P6,P5)) 2 5
 Min[(0.28,0.39)] 0.3 0.2 0.1 0
P4
7 0 5
 0.28 0.3 0.1 0.2 0.2 0
P5
4 4 8 9
0.2 0.2 0.1 0.2 0.3 0
P6
3 5 1 2 9
Example
 The Updated distance matrix for cluster P3, P6
P1 P2 P3,P P4 P5
6
P1 0
P2 0.23 0
P3,P 0.15 0
0.22
6
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0
2 5

Example
 Min[(0.23,0.34)]
 0.23 P1 P2 P3,P P4 P5
6
P1 0
P2 0.23 0
 To Update the distance matrix MIN[dist(P2,P5),(P3,P6)]
P3,P 0.15 0
0.22
 MIN [(dist(P2,(P3,P6)),(P5,(P3,P6))] 6
 Min[(0.15,0.28)] P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0
 0.15

Example
 To Update the distance matrix MIN[dist(P2,P5),P4] P1 P2 P3,P P4 P5
 MIN (dist(P2,P4),(P5,P4)) 6
P1 0
 Min[(0.20,0.29)]
P2 0.23 0
 0.20
P3,P 0.15 0
0.22
6
P4 0.37 0.20 0.15 0
P5 0.34 0.14 0.28 0.29 0

Example
P1 P2 P3,P P4 P5
6 P1 P2,P P3,P P4
5 6
P1 0
P1 0
P2 0.23 0
P2,P 0
P3,P 0.15 0 0.23
0.22 5
6
P3,P 0.15 0
P4 0.37 0.20 0.15 0 0.22
6
P5 0.34 0.14 0.28 0.29 0 P4 0.37 0.20 0.15 0

Example
P1 P2,P P3,P P4 P1 P2,P P3,P P4
5 6 5 6
P1 0 P1 0
P2,P 0 P2,P 0
0.23 0.23
5 5
P3,P 0.15 0 P3,P 0.15 0
0.22 0.22
6 6
P4 0.37 0.20 0.15 0 P4 0.37 0.20 0.15 0
2 5
3 6

Example
P1 P2,P P3,P P4 To Update the distance matrix MIN[dist(P2,P5),
5 6 (P3,P6)),P1]
P1 0 MIN (dist(P2,P5),P1),((P3,P6),P1)]
Min[(0.23,0.22)]
P2,P 0 0.22
0.23
5
P3,P 0.15 0 To Update the distance matrix MIN[dist(P2,P5),
0.22
6 (P3,P6)),P4]
P4 0.37 0.20 0.15 0 MIN (dist(P2,P5),P4),((P3,P6),P4)]
Min[(0.20,0.15)]
0.15
P1 P2,P5,P3, P4
P6
P1 0
P2,P5,P3, 0
0.22
P6
2 5 4
3 6 P4 0.37 0.15 0

Example
P1 P2,P5,P3, P4 To Update the distance matrix
P6 MIN[dist(P2,P5,P3,P6),P4]
P1 0 MIN (dist(P2,P5,P3,P6),P1),(P4,P1)]
Min[(0.22,0.37)]
P2,P5,P3, 0 0.22
0.22
P6
P4 0.37 0.15 0 P1 P2,P5,P3,P6,
P4
P1 0
P2,P5,P3,P6 0
0.22
,P4
2 5 4
3 6

Example
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
3 6 2 5 4 1

Weakness of agglomerative clustering
methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2), where n is the number
of total objects
 Integration of hierarchical & distance-based clustering
 BIRCH: uses CF-tree and incrementally adjusts the quality of sub-clusters.
 CHAMELEON: hierarchical clustering using dynamic modeling

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies)
 Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is

designed for clustering a large amount of numeric data by integrating
hierarchical clustering and other clustering methods such as iterative
partitioning.
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of
the data that tries to preserve the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-
tree.
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

Clustering Feature Vector in BIRCH
 Clustering Feature (CF): CF = (N, LS, SS)
 Where,
 N: Number of data points
 LS: linear sum of N points :
 SS: square sum of N points :
(3, CF = (3, (9,15),(29,77))

4)
(2,
6)
(4,
5)

CF Tree structure
 A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering.
 By definition, a nonleaf node in a tree has descendants or "children.”
 The nonleaf nodes store sums of the CFs of their children, and thus
summarize clustering information about their children.
 A CF tree has two parameters: branching factor, B, and threshold, T.
 The branching factor specifies the maximum number of children per non
leaf node.
 The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
 These two parameters influence the size of the resulting tree.

The CF Tree Structure
B=3 CF12 CF3 CFN
T = 1.5
CF1 CF2 CF
3
…………………
…..

Centroid, Radius and Diameter of a Cluster
 Centroid: the “middle” of a 𝑁
cluster ∑ 𝑡 𝑖𝑝
𝑖=1
𝐶𝑚=
 Radius: square root of average 𝑁
distance from any point of the
cluster to its centroid R
 Diameter: square root of
average mean squared
distance between all pairs of
points in the cluster
D

BIRCH
 For each point in the input
 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly parents
 Algorithm is O(n)
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so natural
 Clusters tend to be spherical given the radius and diameter measures

CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling
 Chameleon is a hierarchical clustering algorithm that uses dynamic
modeling to determine the similarity between pairs of clusters.
 In Chameleon, cluster similarity is assessed based on
 Interconnectivity: How well connected objects are within a cluster.
 Proximity: How they are close together.
 Algorithm:
 Chameleon uses a k-nearest-neighbor graph approach to construct a
sparse graph
 where each vertex of the graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among the k-most similar objects to
the other.
 Chameleon uses a graph partitioning algorithm to partition the k-nearest-
neighbor graph into a large number of relatively small subclusters
 Chameleon then uses an agglomerative hierarchical clustering algorithm
that iteratively merges subclusters based on their similarity.
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling
Construct (K-NN)
Sparse Graph
Step - 2
Step - 1
Partition the
Graph
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Merge
Step - 3 Partition

Density-Based Clustering Methods
 Clusters are dense regions of objects in space that are separated by low-
density regions.
 It can discover clusters of arbitrary shape.
 It can handle noise in dataset.
 It required one scan.
 Need density parameters as termination condition
 Several density based algorithm.
 DBSCAN:(Density-Based Clustering Based on Connected Regions with High Density )
 OPTICS:(Ordering Points to Identify the Clustering Structure)
 DENCLUE: (Clustering Based on Density Distribution Functions )

DBSCAN
 The density of an object O can be measured by the number of objects
close to O.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
finds core objects, that is, objects that have dense neighborhoods.
 It connects core objects and their neighborhoods to form dense regions as
clusters.
 A user-specified parameter Eps > 0 is used to specify the radius of a
neighborhood.
 the density of a neighborhood can be measured simply by the number of
objects in the neighborhood
 DBSCAN uses another user-specified parameter, MinPts, which specifies
the density threshold of dense regions.
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-neighbourhood of that point
DBSCAN
 An object is a core object if the Eps-neighborhood of the object contains at
least MinPts objects.
 Directly density-reachable.
 For a core object q and an object p.
 we say that p is directly density-reachable from q (with respect to ε and
MinPts) if p is within the ε-neighborhood of q
p MinPts =
5
q Eps = 1
cm

DBSCAN
 Density-reachable
 A point p is density-reachable from a point q w.r.t.
p
Eps, MinPts if there is a chain of points p1, …, pn,
p1 = q, pn = p such that pi+1 is directly density- p1
q
reachable from pi
p q
o
 Density-connected:
 A point p is density-connected to a point q w.r.t.
Eps, MinPts if there is a point o such that both, p
andProf.qJayesh
areD. Vagadiya
density-reachable from o w.r.t.
#2101CS521 (DM) Eps
 Unit 5and
-Clustering 64
DBSCAN
INPUT:
D: a data set containing n objects,
ε: the radius parameter, and
MinPts: the neighborhood density threshold.
Output:
A set of density-based clusters.

DBSCAN
mark all objects as unvisited;
do
randomly select an unvisited object p;
mark p as visited;
if the ε-neighborhood of p has at least MinPts objects;
create a new cluster C, and add p to C;
let N be the set of objects in the ε-neighborhood of p;
for each point p′ in N
if p′ is unvisited
mark p′ as visited;
if the ε-neighborhood of p′ has at least MinPts points, add those
points to N;
if p′ is not yet a member of any cluster, add p′ to C;
end for
output C;
else mark p as noise;
until no object is unvisited;

OPTICS(Ordering Points To Identify the
Clustering Structure)
 In DBSCAN we use input parameters such as ε (the maximum radius of a
neighborhood) and MinPts.
 it restrict users with the responsibility of selecting parameter values that
will lead to the discovery of acceptable clusters.
 This is a problem associated with many other clustering algorithms.
 Most algorithms are sensitive to these parameter values.
 Slightly different settings may lead to very different clusterings of the
data.
 To overcome the difficulty in using one set of global parameters in
clustering analysis, a cluster analysis method called OPTICS was proposed.
 OPTICS does not explicitly produce a data set clustering. instead, it
outputs a cluster ordering.
 This is a linear list of all objects under analysis and represents the density-
based clustering structure of the data.
 Objects in a denser cluster are listed closer to each other in the cluster
ordering.
 This ordering is equivalent to density-based clustering obtained from a
wide range of parameter settings.
 OPTICS does not require the user to provide a specific density threshold.
 The cluster ordering can be used to extract basic clustering information
(e.g., cluster centers, or arbitrary-shaped clusters), derive the intrinsic
clustering structure, as well as provide a visualization of the clustering.
e = 6 mm
 Core Distance:
 The core-distance of an object p is the smallest P
value ε such that the ε-neighborhood of p has
e = 3 mm
at least MinPts objects.
 That is, ε is the minimum distance threshold
that makes p a core object.
 If p is not a core object with respect to ε and MinPts = 5
ε = 6 mm
MinPts, the core-distance of p is undefined.

 Reachability-distance
 The reachability-distance to e = 6 mm
object p from q is the
minimum radius value that P
makes p density-reachable e =
from q. 3 q1
mm
 According to the definition
MinPts = 5 q2
of density-reachability, q
ε = 6 mm
has to be a core object and
p must be in the
neighborhood of q. Reachability-distance (p, q1) = 3 mm
 Therefore, the reachability- Reachability-distance (p, q2) = dist (p, q2)
distance from q to p is
max{core-distance(q),
dist(p, q)}

 OPTICS begins with an arbitrary object from the input database as the
current object, p.
 It retrieves the ε-neighborhood of p, determines the core-distance, and
sets the reachability-distance to undefined.
 If p is not a core object, OPTICS simply moves on to the next object in the
OrderSeeds list (or the input database if OrderSeeds is empty).
 if p is a core object, then for each object, q, in the ε-neighborhood of p,
OPTICS updates its reachability-distance from p and inserts q into
OrderSeeds if q has not yet been processed.
 The iteration continues until the input is fully consumed and OrderSeeds is
empty.

undefined


 ‘
Cluster-order
of the objects
What Are Outliers ?
 Assume that a given statistical process is used to generate a set of data
objects.
 An outlier is a data object that deviates significantly from the rest of the
objects, as if it were generated by a different mechanism.
 Outliers are different from noisy data.
 noise is a random error or variance in a measured variable.
 Noise should be removed before outlier detection
 In general, noise is not interesting in data analysis, including outlier detection.
 Outliers are interesting because they are suspected of not being generated by the
same mechanisms as the rest of the data.
 Outliers are interesting: It violates the mechanism that generates the
normal data

What Are Outliers ?
 Applications
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis

Types of Outliers
 Three kinds: global, contextual and collective outliers
 Global outlier (or point anomaly)
 In a given data set, a data object is a global outlier if it deviates
significantly from the rest of the data set.
 Global outliers are sometimes called point anomalies, and are the simplest
type of outliers
 Most outlier detection methods are aimed at finding global outliers.
 Issue: Find an appropriate measurement of deviation
 Contextual outlier (or conditional outlier)
 Object is Oc if it deviates significantly based on a selected context.
 “The temperature today is 28 C. Is it exceptional (i.e., an outlier)?” It
depends, for example, on the time and location! If it is in winter in Toronto,
yes, it is an outlier. If it is a summer day in Toronto, then it is normal.

Types of Outliers
 In a given data set, a data object is a contextual outlier if it deviates
significantly with respect to a specific context of the object
 Contextual outliers are also known as conditional outliers because they
are conditional on the selected context.
 Attributes of data objects should be divided into two groups
 Contextual attributes: defines the context, e.g., time & location
 Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
 An object in a data set is a local outlier if its density significantly deviates
from the local area in which it occurs.

Types of Outliers
 Collective Outliers
 A subset of data objects collectively deviate significantly from the whole
data set, even if the individual data objects may not be outliers
 Detection of collective outliers
 Consider not only behavior of individual objects, but also that of groups of objects
 Need to have the background knowledge on the relationship among data objects,
such as a distance or similarity measure on objects.
Collective Outlier
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a gray area
 Application-specific outlier detection
 Choice of distance measure among objects and the model of relationship among
objects are often application-dependent
 E.g., clinic data: a small deviation could be an outlier; while in marketing analysis,
larger fluctuations
 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction between normal objects
and outliers. It may help hide outliers and reduce the effectiveness of outlier
detection
 Understandability
 Understand why these are outliers: Justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being generated by a
normal mechanism
Outlier Detection Methods
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be
obtained:
 Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:
 Statistical, proximity-based, and clustering-based methods
 Supervised Methods
 Modeling outlier detection as a classification problem
 The task is to learn a classifier that can recognize outliers
 Samples examined by domain experts used for training & testing
 Methods for Learning a classifier for outlier detection effectively:
 Model normal objects & report those not matching the model as outliers, or
 Model outliers and treat those not matching the model as normal

 Supervised Methods
 Challenges
 Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some
artificial outliers
 Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e.,
not mislabeling normal objects as outliers)
 Unsupervised Methods
 Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
 An outlier is expected to be far away from any groups of normal objects
 Weakness: Cannot detect collective outlier effectively
 Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area

 Unsupervised Methods
 Many clustering methods can be adapted for unsupervised methods
 Find clusters, then outliers: not belonging to any cluster
 Problem 1: Hard to distinguish noise from outliers
 Problem 2: Costly since first clustering: but far less outliers than normal objects
 Semi-Supervised Methods
 Situation: In many applications, the number of labeled data is often small:
Labels could be on outliers only, normal objects only, or both
 Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
 If some labeled normal objects are available
 Use the labeled examples and the proximate unlabeled objects to train a model for
normal objects
 Those not fitting the model of normal objects are detected as outliers

 If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
 To improve the quality of outlier detection, one can get help from models for normal
objects learned from unsupervised methods
 Statistical Methods:
 Statistical methods (also known as model-based methods) make
assumptions of data normality. They assume that normal data objects are
generated by a statistical (stochastic) model, and that data not following
the model are outliers.
 Example (right figure): First use Gaussian distribution
to model the normal data
 For each object y in region R, estimate gD(y), the probability
of y fits the Gaussian distribution
 If gD(y) is very low, y is unlikely generated by the Gaussian
model, thus an outlier
 Effectiveness of statistical methods: highly depends on
whether the assumption of statistical model holds in the
real data
 Proximity-Based Methods
 An object is an outlier if the nearest neighbors of the object are far away,
i.e., the proximity of the object is significantly deviates from the proximity
of most of the other objects in the same data set
 Example (right figure):
 Model the proximity of an object using its 3 nearest neighbors
 Objects in region R are substantially different from other objects in
the data set.
 Thus the objects in R are outliers
 The effectiveness of proximity-based methods highly relies
on the proximity measure.
 In some applications, proximity or distance measures
cannot be obtained easily.
 Often have a difficulty in finding a group of outliers which
stay close to each other
 TwoProf.major types of outlier detection
Jayesh D. Vagadiya Distance-based
#2101CS521 (DM)  Unit 5 -Clustering vs. 82
 Clustering-Based Methods
 Normal data belong to large and dense clusters, whereas outliers belong
to small or sparse clusters, or do not belong to any clusters
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster, thus are outliers
 there are many clustering-based outlier detection methods are available.
 Clustering is expensive: straightforward adaption of a clustering method
for outlier detection can be costly and does not scale up well for large
data sets

IMP Questions
 What is Clustering ? How it is different from classification ?
 Differentiate Supervised learning vs Unsupervised Learning
 Explain Requirements for Cluster Analysis in details
 Explain basic clustering methods (Partitioning Methods,Hierarchical Methods, Density-based
methods, Grid-based methods)
 Explain k-Means algorithm with example.
 Explain k-medoids algorithm with example.
 Differentiate Agglomerative approach vs Divisive approach for clustering
 Explain BIRCH in details
 Explain CHAMELEON in details
 What is DBSCAN ? Write down algorithmic steps of DBSCAN
 Explain OPTICS in details
 What is Outliers ? Explain types of outliers
 Explain Outlier Detection Methods

Unit 5

Uploaded by

Copyright:

Available Formats

Unit 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5

Uploaded by

Copyright:

Available Formats

Data Mining (DM)

Prof. Jayesh D. vagadiya

Banana Banana Apple

Training Data Set

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 3

Here we don’t use Label Data to train Classifier(Unsupervised Learning)

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 5

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 6

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 7

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 8

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 9

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 10

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 11

Step Step Step Step Step

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 15

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 16

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 17

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 18

arbitrarily choose k objects from D as the initial cluster centers;

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 19

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 20

 A clustered scatter plot.

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 21

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 22

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 25

arbitrarily choose k objects from D as the initial representative objects or

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 27

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 28

Dissimilar Dissimilar • Each point is assigned to the cluster of

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 29

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 30

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 31

• As the swap cost is not less than zero, we undo the

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 32

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 33

level, then each

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 37

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 38

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 39

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 40

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 41

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 44

 Min[(0.15,0.28)] P4 0.37 0.20 0.15 0

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 45

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 46

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 47

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 48

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 49

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 50

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 51

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 52

 Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 53

(3, CF = (3, (9,15),(29,77))

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 54

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 55

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 56

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 57

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 58

Prof. Jayesh D. Vagadiya #2101CS521 (DM)  Unit 5 -Clustering 60