Unit 5
Unit 5
Unit 5
2101CS521
Unit-5
Clustering
Input Data
???
Apple Banana Apple
Prediction
Input Data
Clustering
Algorithm
Partitioning Methods:
Given a set of n objects, a partitioning method constructs k partitions of
the data, where each partition represents a cluster and k ≤ n.
It divides the data into k groups such that each group must contain at
least one object.
It will find mutually exclusive clusters of spherical shape(each object must
belong to exactly one group ).
Most partitioning methods are distance-based.
Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning.
It then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
May use mean or medoid (etc.) to represent cluster center.
Effective for small- to medium-size data sets.
Prof. Jayesh D. Vagadiya #2101CS521 (DM) Unit 5 -Clustering 13
Hierarchical methods
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of the given
set of data objects.
A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
Agglomerative approach:
It also called bottom-up approach.
It starts with each object forming a separate group.
It successively merges the objects or groups close to one another, until all the groups
are merged into one (the topmost level of the hierarchy), or a termination condition
holds.
Divisive approach:
It also called top-down approach.
It starts with all the objects in the same cluster.
In each successive iteration, a cluster is split into smaller clusters, until eventually
each object is in one cluster, or a termination condition holds.
Prof. Jayesh D. Vagadiya #2101CS521 (DM) Unit 5 -Clustering 14
Hierarchical methods
The k-means algorithm defines the centroid of a cluster as the mean value
of the points within the cluster.
First, it randomly selects k of the objects in D, each of which initially
represents a cluster mean or center.
For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the Euclidean distance between the
object and the cluster mean.
For each cluster, it computes the new mean using the objects assigned to
the cluster in the previous iteration.
The iterations continue until the assignment is stable, that is, the clusters
formed in the current round are the same as those formed in the previous
round.
Output:
A set of k clusters.
Output:
A set of k clusters.
Sr. X Y
0 2 6
1 3 4
2 3 8
3 4 7
4 6 2
5 6 4
6 7 3
7 7 4
8 8 5
9 7 6
Complete link:
largest distance between
an element in one cluster
and an element in the
other, i.e.,
dist(Ki, Kj) = max(tip, tjq)
Centroid link:
distance between the
centroids of two clusters,
i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
P6 0.45 0.30 P1 0
P2 0.23 0
P3 0.22 0
P4 0
P5 0
P6 0
3 6
2 5
2 5
3 6
2 5 4
3 6
X Y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
3 6 2 5 4 1
T = 1.5
CF1 CF2 CF
3
…………………
…..
cluster ∑ 𝑡 𝑖𝑝
𝑖=1
𝐶𝑚=
Radius: square root of average 𝑁
distance from any point of the
cluster to its centroid R
Diameter: square root of
average mean squared
distance between all pairs of
points in the cluster
D
Merge
Step - 3 Partition
Directly density-reachable.
For a core object q and an object p.
we say that p is directly density-reachable from q (with respect to ε and
MinPts) if p is within the ε-neighborhood of q
p MinPts =
5
q Eps = 1
cm
p q
o
Density-connected:
A point p is density-connected to a point q w.r.t.
Eps, MinPts if there is a point o such that both, p
andProf.qJayesh
areD. Vagadiya
density-reachable from o w.r.t.
#2101CS521 (DM) Eps
Unit 5and
-Clustering 64
DBSCAN
INPUT:
D: a data set containing n objects,
ε: the radius parameter, and
MinPts: the neighborhood density threshold.
Output:
A set of density-based clusters.
undefined
‘
Cluster-order
of the objects
Prof. Jayesh D. Vagadiya #2101CS521 (DM) Unit 5 -Clustering 71
What Are Outliers ?
Assume that a given statistical process is used to generate a set of data
objects.
An outlier is a data object that deviates significantly from the rest of the
objects, as if it were generated by a different mechanism.
Outliers are different from noisy data.
noise is a random error or variance in a measured variable.
Noise should be removed before outlier detection
In general, noise is not interesting in data analysis, including outlier detection.
Outliers are interesting because they are suspected of not being generated by the
same mechanisms as the rest of the data.
Outliers are interesting: It violates the mechanism that generates the
normal data
Collective Outlier
A data set may have multiple types of outlier
One object may belong to more than one type of outlier
Prof. Jayesh D. Vagadiya #2101CS521 (DM) Unit 5 -Clustering 76
Challenges of Outlier Detection
Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in an application
The border between normal and outlier objects is often a gray area
Application-specific outlier detection
Choice of distance measure among objects and the model of relationship among
objects are often application-dependent
E.g., clinic data: a small deviation could be an outlier; while in marketing analysis,
larger fluctuations
Handling noise in outlier detection
Noise may distort the normal objects and blur the distinction between normal objects
and outliers. It may help hide outliers and reduce the effectiveness of outlier
detection
Understandability
Understand why these are outliers: Justification of the detection
Specify the degree of an outlier: the unlikelihood of the object being generated by a
normal mechanism
Prof. Jayesh D. Vagadiya #2101CS521 (DM) Unit 5 -Clustering 77
Outlier Detection Methods
Two ways to categorize outlier detection methods:
Based on whether user-labeled examples of outliers can be
obtained:
Supervised, semi-supervised vs. unsupervised methods
Based on assumptions about normal data and outliers:
Statistical, proximity-based, and clustering-based methods
Supervised Methods
Modeling outlier detection as a classification problem
The task is to learn a classifier that can recognize outliers
Samples examined by domain experts used for training & testing
Methods for Learning a classifier for outlier detection effectively:
Model normal objects & report those not matching the model as outliers, or
Model outliers and treat those not matching the model as normal