0% found this document useful (0 votes)
17 views

Unit 5

Uploaded by

Asif EE-010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Unit 5

Uploaded by

Asif EE-010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

UNIT - V: Cluster Analysis

Basic Concepts and Algorithms: Overview, What Is Cluster Analysis?


Different Types of Clustering, Different Types of Clusters; K-means:
The Basic K-means Algorithm, K-means Additional Issues, Bisecting
K-means, Strengths and Weaknesses; Agglomerative Hierarchical
Clustering: Basic Agglomerative Hierarchical Clustering Algorithm
DBSCAN: Traditional Density: Centre-Based Approach, DBSCAN
Algorithm, Strengths and Weaknesses.
What Is Cluster Analysis?

• Cluster analysis groups data objects based only on information


found in the data that describes the objects and their relationships.
• The goal is that the objects within a group be similar (or related)
to one another and different from (or unrelated to) the objects in
other groups.
Different ways of clustering the same set of points
Different Types of Clusterings

• An entire collection of clusters is referred to as a clustering

Hierarchical versus Partitional


• A partitional clustering is division of the set of data objects
into non-overlapping subsets (clusters) such that each data
object is in exactly one subset.
• If we permit clusters to have subclusters, then we obtain a
hierarchical clustering, which is a set of nested clusters that are
organized as a tree.
• Each node (cluster) in the tree (except for the leaf nodes) is the
union of its children (subclusters), and the root of the tree is the
cluster containing all the objects.
Exclusive versus Overlapping versus Fuzzy
• In Exclusive, assign each object to a single cluster.
• In overlapping or non-exclusive clustering an object can
simultaneously belong to more than one group (class).
• In a fuzzy clustering, every object belongs to every cluster with a
membership weight that is between 0 (absolutely doesn’t belong)
and 1 (absolutely belongs).
• Clusters are treated as fuzzy sets. (a fuzzy set is one in which an
object belongs to any set with a weight ie between 0 and 1.
• Complete versus Partial A complete clustering assigns every
object to a cluster, whereas a partial clustering does not.
• The motivation for a partial clustering is that some objects in a
data set may not belong to well-defined groups.
Different Types of Clusters

• Clustering aims to find useful groups of objects (clusters).


• Well-Separated A cluster is a set of objects in which each
object is closer (or more similar) to every other object in the
cluster than to any object not in the cluster.
• Prototype-Based: prototype based clusters are center-based
clusters.
• Each point is closer to the center of its cluster than to the center
of any other cluster.

Well-separated clusters Center-based clusters


• Graph-Based :If the data is represented as a graph, where the
nodes are objects and the links represent connections among
objects.
• An example::contiguity-based clusters, where two objects are
connected only if they are within a specified distance of each other.
• Each point is closer to at least one point in its cluster than to any
point in another cluster.
contiguity-based clusters
• Density-Based A cluster is a dense region of objects that is
surrounded by a region of low density.

Density-based clusters
• Conceptual Clusters:we can define a cluster as a set of
objects that share some property

Conceptual clusters
K-means
• K-means defines a prototype in terms of a centroid, which is
usually the mean of a group of points, and is typically applied
to objects in a continuous n-dimensional space
The Basic K-means Algorithm

• We first choose K initial centroids, where K is a userspecified


parameter, namely, the number of clusters desired.
• Each point is then assigned to the closest centroid, and each
collection of points assigned to a centroid is a cluster.
• The centroid of each cluster is then updated based on the
points assigned to the cluster.
• We repeat the assignment and update steps until no point
changes clusters, or equivalently, until the centroids remain the
same.
Example::Each figure shows
(1) the centroids at the start of the iteration and
(2) the assignment of the points to those centroids.
(3) The centroids are indicated by the “+” symbol; all points
belonging to the same cluster have the same shape.
Using the K-means algorithm to find three clusters in sample
data.
• In the first step, shown in Figure , points are assigned to the initial
centroids, which are all in the larger group of points.
• For this example, we use the mean as the centroid.
• After points are assigned to a centroid, the centroid is then
updated.
• In the second step, points are assigned to the updated centroids,
and the centroids are updated again.
• In steps 2, 3, and 4, which are shown in Figures (b), (c), and
(d), respectively, two of the centroids move to the two small
groups of points at the bottom of the figures.
• Then the K-means algorithm terminates in Figure 8.3(d),
because no more changes occur.
Assigning Points to the Closest Centroid
• To assign a point to the closest centroid, we need a proximity
measure to find the “closest” for the specific data.
• Euclidean (L2) distance is used for data points in Euclidean
space, while cosine similarity is more appropriate for documents.
• For example, Manhattan (L1) distance can be used for Euclidean
data, while the Jaccard measure is often employed for documents.
Centroids and Objective Functions
• Consider data whose proximity measure is Euclidean distance.
• For our objective function, which measures the quality of a
clustering, we use the sum of the squared error (SSE).
• we calculate the error of each data point, i.e., its Euclidean
distance to the closest centroid, and then compute the total sum of
the squared errors.
Table of notation
• the centroid (mean) of the ith cluster is defined by
Choosing Initial Centroids
• When random initialization of centroids is used, different runs of
K-means typically produce different total SSEs.
Time and Space Complexity
• The space requirements for K-means are modest because only the
data points and centroids are stored.
• Specifically, the storage required is O((m + K)n), where m is the
number of points and n is the number of attributes.
• The time requirements for K-means are also modest—basically
linear in the number of data points.
• In particular, the time required is O(I ∗K ∗m ∗ n), where I is the
number of iterations.
K-means: Additional Issues

Handling Empty Clusters


• One of the problems with the basic K-means is that empty clusters
can be obtained if no points are allocated to a cluster.
• If this happens, then a strategy is needed to choose a replacement
centroid.
• One approach is to choose the point that is far away from any
current centroid.

• Another approach is to choose the replacement centroid from the

cluster that has the highest SSE.


Outliers
• In particular, when outliers are present, the resulting cluster
centroids may not be a representative, the SSE will be higher as
well.
• Because of this, it is often useful to discover outliers and eliminate
them
Reducing the SSE with Postprocessing
• An obvious way to reduce the SSE is to find more clusters, i.e., to
use a larger K.
• Two strategies that decrease the SSE by increasing the
number of clusters are :
• Split a cluster: The cluster with the largest SSE is usually
chosen.
• Introduce a new cluster centroid: The point that is far from
any cluster center is chosen..
Two strategies that decrease the number of clusters, while
minimize SSE

• Disperse a cluster: This is done by removing the centroid of a


cluster and reassigning the points to other clusters.
• Merge two clusters: The clusters with the closest centroids are
selected, better approach is to merge the two clusters that result in
the smallest increase in total SSE
Updating Centroids Incrementally
• Instead of updating cluster centroids after all points have been
assigned to a cluster, the centroids can be updated incrementally,
after each assignment of a point to a cluster.
Bisecting K-means

• The bisecting K-means algorithm is a extension of the basic K-


means algorithm .
• To obtain K clusters, split the set of all points into two clusters,
select one of these clusters to split, and so on, until K clusters
have been produced.
Bisecting K-means Example:
• In iteration 1, two pairs of clusters are found;
• In iteration 2, the rightmost pair of clusters is split; and
• In iteration 3, the leftmost pair of clusters is split.
Bisecting K-means on the four clusters example
Strengths and Weaknesses

• K-means is simple and can be used for a wide variety of data types.

• It is quite efficient, even though multiple runs are often performed.


• bisecting K-means, are even more efficient, and are less susceptible to
initialization problems.
• K-means is not suitable for all types of data

• It cannot handle non-globular clusters or clusters of different sizes and


densities
K-means with clusters of different size
K-means with clusters of different density
K-means with non-globular clusters
Agglomerative Hierarchical Clustering

• There are two basic approaches for generating a hierarchical


clustering
• Agglomerative: Start with the points as individual clusters and, at
each step, merge the closest pair of clusters.
• Divisive: Start with one cluster and, at each step, split a cluster
until only singleton clusters of points remain.
• A hierarchical clustering is often displayed graphically using a
tree-like diagram called a dendrogram.
• a hierarchical clustering can also be graphically represented using
a nested cluster diagram
Basic Agglomerative Hierarchical Clustering Algorithm

• Agglomerative hierarchical clustering techniques : starting


with individual points as clusters, successively merge the two
closest clusters until only one cluster remains.
Defining Proximity between Clusters
• The key operation is the computation of the proximity between
two clusters.
• For example, many agglomerative hierarchical clustering
techniques, such as MIN, MAX, and Group Average.
• MIN defines cluster proximity as the proximity between the
closest two points that are in different clusters
• MAX takes the proximity between the farthest two points in
different clusters.
• the group average technique, defines cluster proximity to be the
average pairwise proximities) of all pairs of points from different
clusters.
DBSCAN

• Density-based clustering locates regions of high density that

are separated from one another by regions of low density.


Traditional Density: Center-Based Approach

• In the center-based approach, density is estimated for a


particular point in the data set by counting the number of points
within a specified radius, Eps, of that point.
• This includes the point itself.
Center-based density.
Classification of Points According to Center-Based Density
• The center-based approach to density allows us to classify a point
as being
(1) in the interior of a dense region (a core point),
(2) on the edge of a dense region (a border point), or
(3) in a sparsely occupied region (a noise or background point).
Core points:
• These points are in the interior of a density-based cluster.
• A point is a core point if there are at least minPts number of
points (including the point itself) in its surrounding area with
radius eps..
• In Figure, point A is a core point, for the indicated radius (Eps) if
MinPts ≤ 7.
Core, border, and noise points
• Border points: A border point is not a core point, but falls within
the neighborhood of a core point.
• In Figure , point B is a border point.
• Noise points: A noise point is any point that is neither a core point
nor a border point.
• In Figure, point C is a noise point.
The DBSCAN Algorithm

• Any two core points that are close enough—within a distance Eps
of one another—are put in the same cluster.
• Likewise, any border point that is close enough to a core point is
put in the same cluster as the core point.
• Noise points are discarded.
Time and Space Complexity
• The basic time complexity of the DBSCAN algorithm is O(m ×
time to find points in the Eps-neighborhood), where m is the
number of points.
• The space requirement of DBSCAN, even for high-dimensional
data, is O(m)
Strengths and Weaknesses

• DBSCAN uses a density-based definition of a cluster, it is relatively


resistant to noise and can handle clusters of arbitrary shapes and sizes.
• Thus DBSCAN can find many clusters that could not be found using
K-means
• however, DBSCAN has trouble when the clusters have widely varying
densities.
• It also has trouble with high-dimensional data because density is more
difficult to define for such data.
Thank you
The END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy