0% found this document useful (0 votes)

17 views

Unit 5

Uploaded by

Asif EE-010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Unit 5

Uploaded by

Asif EE-010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

UNIT - V: Cluster Analysis

Basic Concepts and Algorithms: Overview, What Is Cluster Analysis?

Different Types of Clustering, Different Types of Clusters; K-means:
The Basic K-means Algorithm, K-means Additional Issues, Bisecting
K-means, Strengths and Weaknesses; Agglomerative Hierarchical
Clustering: Basic Agglomerative Hierarchical Clustering Algorithm
DBSCAN: Traditional Density: Centre-Based Approach, DBSCAN
Algorithm, Strengths and Weaknesses.
What Is Cluster Analysis?

• Cluster analysis groups data objects based only on information

found in the data that describes the objects and their relationships.
• The goal is that the objects within a group be similar (or related)
to one another and different from (or unrelated to) the objects in
other groups.
Different ways of clustering the same set of points
Different Types of Clusterings

• An entire collection of clusters is referred to as a clustering

Hierarchical versus Partitional

• A partitional clustering is division of the set of data objects
into non-overlapping subsets (clusters) such that each data
object is in exactly one subset.
• If we permit clusters to have subclusters, then we obtain a
hierarchical clustering, which is a set of nested clusters that are
organized as a tree.
• Each node (cluster) in the tree (except for the leaf nodes) is the
union of its children (subclusters), and the root of the tree is the
cluster containing all the objects.
Exclusive versus Overlapping versus Fuzzy
• In Exclusive, assign each object to a single cluster.
• In overlapping or non-exclusive clustering an object can
simultaneously belong to more than one group (class).
• In a fuzzy clustering, every object belongs to every cluster with a
membership weight that is between 0 (absolutely doesn’t belong)
and 1 (absolutely belongs).
• Clusters are treated as fuzzy sets. (a fuzzy set is one in which an
object belongs to any set with a weight ie between 0 and 1.
• Complete versus Partial A complete clustering assigns every
object to a cluster, whereas a partial clustering does not.
• The motivation for a partial clustering is that some objects in a
data set may not belong to well-defined groups.
Different Types of Clusters

• Clustering aims to find useful groups of objects (clusters).

• Well-Separated A cluster is a set of objects in which each
object is closer (or more similar) to every other object in the
cluster than to any object not in the cluster.
• Prototype-Based: prototype based clusters are center-based
clusters.
• Each point is closer to the center of its cluster than to the center
of any other cluster.

Well-separated clusters Center-based clusters

• Graph-Based :If the data is represented as a graph, where the
nodes are objects and the links represent connections among
objects.
• An example::contiguity-based clusters, where two objects are
connected only if they are within a specified distance of each other.
• Each point is closer to at least one point in its cluster than to any
point in another cluster.
contiguity-based clusters
• Density-Based A cluster is a dense region of objects that is
surrounded by a region of low density.

Density-based clusters
• Conceptual Clusters:we can define a cluster as a set of
objects that share some property

Conceptual clusters
K-means
• K-means defines a prototype in terms of a centroid, which is
usually the mean of a group of points, and is typically applied
to objects in a continuous n-dimensional space
The Basic K-means Algorithm

• We first choose K initial centroids, where K is a userspecified

parameter, namely, the number of clusters desired.
• Each point is then assigned to the closest centroid, and each
collection of points assigned to a centroid is a cluster.
• The centroid of each cluster is then updated based on the
points assigned to the cluster.
• We repeat the assignment and update steps until no point
changes clusters, or equivalently, until the centroids remain the
same.
Example::Each figure shows
(1) the centroids at the start of the iteration and
(2) the assignment of the points to those centroids.
(3) The centroids are indicated by the “+” symbol; all points
belonging to the same cluster have the same shape.
Using the K-means algorithm to find three clusters in sample
data.
• In the first step, shown in Figure , points are assigned to the initial
centroids, which are all in the larger group of points.
• For this example, we use the mean as the centroid.
• After points are assigned to a centroid, the centroid is then
updated.
• In the second step, points are assigned to the updated centroids,
and the centroids are updated again.
• In steps 2, 3, and 4, which are shown in Figures (b), (c), and
(d), respectively, two of the centroids move to the two small
groups of points at the bottom of the figures.
• Then the K-means algorithm terminates in Figure 8.3(d),
because no more changes occur.
Assigning Points to the Closest Centroid
• To assign a point to the closest centroid, we need a proximity
measure to find the “closest” for the specific data.
• Euclidean (L2) distance is used for data points in Euclidean
space, while cosine similarity is more appropriate for documents.
• For example, Manhattan (L1) distance can be used for Euclidean
data, while the Jaccard measure is often employed for documents.
Centroids and Objective Functions
• Consider data whose proximity measure is Euclidean distance.
• For our objective function, which measures the quality of a
clustering, we use the sum of the squared error (SSE).
• we calculate the error of each data point, i.e., its Euclidean
distance to the closest centroid, and then compute the total sum of
the squared errors.
Table of notation
• the centroid (mean) of the ith cluster is defined by
Choosing Initial Centroids
• When random initialization of centroids is used, different runs of
K-means typically produce different total SSEs.
Time and Space Complexity
• The space requirements for K-means are modest because only the
data points and centroids are stored.
• Specifically, the storage required is O((m + K)n), where m is the
number of points and n is the number of attributes.
• The time requirements for K-means are also modest—basically
linear in the number of data points.
• In particular, the time required is O(I ∗K ∗m ∗ n), where I is the
number of iterations.
K-means: Additional Issues

Handling Empty Clusters

• One of the problems with the basic K-means is that empty clusters
can be obtained if no points are allocated to a cluster.
• If this happens, then a strategy is needed to choose a replacement
centroid.
• One approach is to choose the point that is far away from any
current centroid.

• Another approach is to choose the replacement centroid from the

cluster that has the highest SSE.

Outliers
• In particular, when outliers are present, the resulting cluster
centroids may not be a representative, the SSE will be higher as
well.
• Because of this, it is often useful to discover outliers and eliminate
them
Reducing the SSE with Postprocessing
• An obvious way to reduce the SSE is to find more clusters, i.e., to
use a larger K.
• Two strategies that decrease the SSE by increasing the
number of clusters are :
• Split a cluster: The cluster with the largest SSE is usually
chosen.
• Introduce a new cluster centroid: The point that is far from
any cluster center is chosen..
Two strategies that decrease the number of clusters, while
minimize SSE

• Disperse a cluster: This is done by removing the centroid of a

cluster and reassigning the points to other clusters.
• Merge two clusters: The clusters with the closest centroids are
selected, better approach is to merge the two clusters that result in
the smallest increase in total SSE
Updating Centroids Incrementally
• Instead of updating cluster centroids after all points have been
assigned to a cluster, the centroids can be updated incrementally,
after each assignment of a point to a cluster.
Bisecting K-means

• The bisecting K-means algorithm is a extension of the basic K-

means algorithm .
• To obtain K clusters, split the set of all points into two clusters,
select one of these clusters to split, and so on, until K clusters
have been produced.
Bisecting K-means Example:
• In iteration 1, two pairs of clusters are found;
• In iteration 2, the rightmost pair of clusters is split; and
• In iteration 3, the leftmost pair of clusters is split.
Bisecting K-means on the four clusters example
Strengths and Weaknesses

• K-means is simple and can be used for a wide variety of data types.

• It is quite efficient, even though multiple runs are often performed.

• bisecting K-means, are even more efficient, and are less susceptible to
initialization problems.
• K-means is not suitable for all types of data

• It cannot handle non-globular clusters or clusters of different sizes and

densities
K-means with clusters of different size
K-means with clusters of different density
K-means with non-globular clusters
Agglomerative Hierarchical Clustering

• There are two basic approaches for generating a hierarchical

clustering
• Agglomerative: Start with the points as individual clusters and, at
each step, merge the closest pair of clusters.
• Divisive: Start with one cluster and, at each step, split a cluster
until only singleton clusters of points remain.
• A hierarchical clustering is often displayed graphically using a
tree-like diagram called a dendrogram.
• a hierarchical clustering can also be graphically represented using
a nested cluster diagram
Basic Agglomerative Hierarchical Clustering Algorithm

• Agglomerative hierarchical clustering techniques : starting

with individual points as clusters, successively merge the two
closest clusters until only one cluster remains.
Defining Proximity between Clusters
• The key operation is the computation of the proximity between
two clusters.
• For example, many agglomerative hierarchical clustering
techniques, such as MIN, MAX, and Group Average.
• MIN defines cluster proximity as the proximity between the
closest two points that are in different clusters
• MAX takes the proximity between the farthest two points in
different clusters.
• the group average technique, defines cluster proximity to be the
average pairwise proximities) of all pairs of points from different
clusters.
DBSCAN

• Density-based clustering locates regions of high density that

are separated from one another by regions of low density.

Traditional Density: Center-Based Approach

• In the center-based approach, density is estimated for a

particular point in the data set by counting the number of points
within a specified radius, Eps, of that point.
• This includes the point itself.
Center-based density.
Classification of Points According to Center-Based Density
• The center-based approach to density allows us to classify a point
as being
(1) in the interior of a dense region (a core point),
(2) on the edge of a dense region (a border point), or
(3) in a sparsely occupied region (a noise or background point).
Core points:
• These points are in the interior of a density-based cluster.
• A point is a core point if there are at least minPts number of
points (including the point itself) in its surrounding area with
radius eps..
• In Figure, point A is a core point, for the indicated radius (Eps) if
MinPts ≤ 7.
Core, border, and noise points
• Border points: A border point is not a core point, but falls within
the neighborhood of a core point.
• In Figure , point B is a border point.
• Noise points: A noise point is any point that is neither a core point
nor a border point.
• In Figure, point C is a noise point.
The DBSCAN Algorithm

• Any two core points that are close enough—within a distance Eps
of one another—are put in the same cluster.
• Likewise, any border point that is close enough to a core point is
put in the same cluster as the core point.
• Noise points are discarded.
Time and Space Complexity
• The basic time complexity of the DBSCAN algorithm is O(m ×
time to find points in the Eps-neighborhood), where m is the
number of points.
• The space requirement of DBSCAN, even for high-dimensional
data, is O(m)
Strengths and Weaknesses

• DBSCAN uses a density-based definition of a cluster, it is relatively

resistant to noise and can handle clusters of arbitrary shapes and sizes.
• Thus DBSCAN can find many clusters that could not be found using
K-means
• however, DBSCAN has trouble when the clusters have widely varying
densities.
• It also has trouble with high-dimensional data because density is more
difficult to define for such data.
Thank you
The END

Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Cluster
100% (1)
Cluster
72 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
Clustering
No ratings yet
Clustering
65 pages
DM_C6
No ratings yet
DM_C6
37 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
M5
No ratings yet
M5
40 pages
ML - 8
No ratings yet
ML - 8
70 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Clustering
No ratings yet
Clustering
12 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
M5
No ratings yet
M5
40 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
UNIT5
No ratings yet
UNIT5
60 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Clustering
No ratings yet
Clustering
19 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
PART2
No ratings yet
PART2
61 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
MachineLearning Unit IV.pptx
No ratings yet
MachineLearning Unit IV.pptx
51 pages
Clustering
No ratings yet
Clustering
125 pages
Module 5
No ratings yet
Module 5
98 pages
Clustering
No ratings yet
Clustering
75 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering
No ratings yet
Clustering
84 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering-Part1
No ratings yet
Clustering-Part1
79 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Unit 2
No ratings yet
Unit 2
33 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Quadtree: Exploring Hierarchical Data Structures for Image Analysis
From Everand
Quadtree: Exploring Hierarchical Data Structures for Image Analysis
Fouad Sabry
No ratings yet
Leetcode - Easy, Medium
No ratings yet
Leetcode - Easy, Medium
8 pages
1a. Practice Quiz 2
No ratings yet
1a. Practice Quiz 2
7 pages
Studyset4 With Solutions
No ratings yet
Studyset4 With Solutions
6 pages
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
No ratings yet
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
11 pages
Algorithm Design AI
No ratings yet
Algorithm Design AI
54 pages
Stack and Linked List Programs Using Clsss Templete
No ratings yet
Stack and Linked List Programs Using Clsss Templete
5 pages
Advanced Cryptography Reading Assignment Lattice-Based Cryptanalysis
No ratings yet
Advanced Cryptography Reading Assignment Lattice-Based Cryptanalysis
3 pages
K-Means Clustering Dan Local Outlier Factor: Clustering Data Remunerasi PNS Menggunakan Metode
No ratings yet
K-Means Clustering Dan Local Outlier Factor: Clustering Data Remunerasi PNS Menggunakan Metode
8 pages
4 Lab Manual 18CSL76
No ratings yet
4 Lab Manual 18CSL76
29 pages
Assignment Model
No ratings yet
Assignment Model
2 pages
section08-sorting
No ratings yet
section08-sorting
5 pages
Nutanix - LeetCode
No ratings yet
Nutanix - LeetCode
2 pages
ML Lab 6
No ratings yet
ML Lab 6
7 pages
GTU IMP Question Index
No ratings yet
GTU IMP Question Index
2 pages
Optimization Technique MCQ with Answer
No ratings yet
Optimization Technique MCQ with Answer
24 pages
Worksheet in Basis (1) 28
No ratings yet
Worksheet in Basis (1) 28
27 pages
Machine Learning 2M&10M Qpaper
No ratings yet
Machine Learning 2M&10M Qpaper
3 pages
Searching and Sorting Algorithm
No ratings yet
Searching and Sorting Algorithm
12 pages
Acado Optimization
No ratings yet
Acado Optimization
33 pages
DSL Oral Question
No ratings yet
DSL Oral Question
5 pages
Bytewise Computation of CRC Without Look Up Table
No ratings yet
Bytewise Computation of CRC Without Look Up Table
3 pages
ES 204 L2 Linear Equations
No ratings yet
ES 204 L2 Linear Equations
75 pages
Newton School DSA
No ratings yet
Newton School DSA
3 pages
DSA EXP3 - ShwetaJoshi
No ratings yet
DSA EXP3 - ShwetaJoshi
14 pages
Module III
No ratings yet
Module III
91 pages
CS2208 2 Set3
No ratings yet
CS2208 2 Set3
2 pages
SFD - Bai Giang Full PDF
No ratings yet
SFD - Bai Giang Full PDF
123 pages
Z Function and Its Calculation:: Int Int Int Int For Int If While If
No ratings yet
Z Function and Its Calculation:: Int Int Int Int For Int If While If
32 pages
Module 4 Algorithms For Data Science
No ratings yet
Module 4 Algorithms For Data Science
66 pages
pdsa_ga5
No ratings yet
pdsa_ga5
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 5

Uploaded by

Unit 5

Uploaded by

UNIT - V: Cluster Analysis

Basic Concepts and Algorithms: Overview, What Is Cluster Analysis?

• Cluster analysis groups data objects based only on information

• An entire collection of clusters is referred to as a clustering

Hierarchical versus Partitional

• Clustering aims to find useful groups of objects (clusters).

Well-separated clusters Center-based clusters

• We first choose K initial centroids, where K is a userspecified

Handling Empty Clusters

• Another approach is to choose the replacement centroid from the

cluster that has the highest SSE.

• Disperse a cluster: This is done by removing the centroid of a

• The bisecting K-means algorithm is a extension of the basic K-

• It is quite efficient, even though multiple runs are often performed.

• It cannot handle non-globular clusters or clusters of different sizes and

• There are two basic approaches for generating a hierarchical

• Agglomerative hierarchical clustering techniques : starting

• Density-based clustering locates regions of high density that

are separated from one another by regions of low density.

• In the center-based approach, density is estimated for a

• DBSCAN uses a density-based definition of a cluster, it is relatively

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.