0% found this document useful (0 votes)
60 views

DMW Unit-V

The document discusses different clustering methods used in data mining. It begins by defining what a cluster is and explaining that clustering groups a set of data objects into clusters without predefined classes. It then covers various clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. For k-means, it provides the algorithm details and shows an example. It also discusses a drawback of k-means in being sensitive to outliers and proposes using k-medoids instead of mean values.

Uploaded by

Ravindra Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

DMW Unit-V

The document discusses different clustering methods used in data mining. It begins by defining what a cluster is and explaining that clustering groups a set of data objects into clusters without predefined classes. It then covers various clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. For k-means, it provides the algorithm details and shows an example. It also discusses a drawback of k-means in being sensitive to outliers and proposes using k-medoids instead of mean values.

Uploaded by

Ravindra Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

UNIT -V

1
What is Cluster Analysis?

• Cluster: a collection of data objects


– Similar to one another within the same cluster.
– Dissimilar to the objects in other clusters.
• Cluster analysis
– Grouping a set of data objects into clusters.
• Clustering is unsupervised classification: no predefined
classes.
• Applications
– As a stand-alone tool to get insight into data distribution.
– As a preprocessing step for other algorithms.
2
 General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– Create the matic maps in GIS by clustering feature spaces
– Detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science ( market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
3
 Requirements of Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine
input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

4
Overview of Basic Clustering Methods
• The major clustering methods can be classified into the following
categories.
• Partitioning methods:
• Given a database of n objects or data tuples, a partitioning method
constructs k partitions of the data, where each partition represents a
cluster and k ≤ n.
• That is, it classifies the data into k groups, which together satisfy the
following requirements:
(1) each group must contain at least one object,
(2) each object must belong to exactly one group.
Few Popular methods are
(1) The k-means algorithm, where each cluster is represented by the
mean value of the objects in the cluster. [Each cluster is represented by the
center of the cluster].
(2) The k-medoids algorithm, where each cluster is represented by one of
the objects located near the center of the cluster. [Each cluster is represented
by one of the objects in the cluster ]
5
• Hierarchical methods:
• A hierarchical method creates a hierarchical decomposition of
the given set of data objects.
• A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical
decomposition is formed.
• The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group.
• It successive merges the objects or groups that are close to
one another, until all of the groups are merged into one (the
topmost level of the hierarchy), or until a termination
condition holds.
• The divisive approach, also called the top-down approach,
starts with all of the objects in the same cluster.
• In each successive iteration, a cluster is split up into smaller
clusters, until eventually each object is in one cluster, or until
a termination condition holds.
6
• Density-based methods:
• The general idea is to continue growing the given cluster as
long as the density (number of objects or data points) in the
“neighborhood” exceeds some threshold;

• That is, for each data point within a given cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points.

• Such a method can be used to filter out noise (outliers) and


discover clusters of arbitrary shape.

• DBSCAN and its extension, OPTICS, are typical density-based


methods

7
• Grid-based methods:
• Grid-based methods quantize the object space into a finite
number of cells that form a grid structure.
• All of the clustering operations are performed on the grid
structure (i.e., on the quantized space).
• The main advantage of this approach is its fast processing
time, which is typically independent of the number of data
objects and dependent only on the number of cells in each
dimension in the quantized space.
• STING is a typical example of a grid-based method.

8
9
 Partitioning Methods
K-Means: A Centroid-Based Technique
• Suppose a data set, D, contains n objects in Euclidean space.
Partition methods distribute the objects in D into k clusters,
C1,…,Ck, that is, Ci  D and Ci  Cj =  for (1≤i, j ≤k).

• An objective function is used to assess the partition quality so


that objects within a cluster are similar to one another but
dissimilar to objects in other clusters.

• This is, the objective function aims for high intra cluster
similarity and low inter cluster similarity.

10
• A centroid-based partition technique uses the centroid of a
cluster, Ci , to represent that cluster. The centroid of a cluster is its
center point.

• The centroid can be defined in various ways such as by the mean


or medoid of the objects (or points) assigned to the cluster.

• The difference between an object p Ci and ci, which represent


the cluster, is measured by dist (p, ci), where dist(x, y) is the
distance between two points x and y.

11
• The quality of cluster Ci can be measured by the within cluster
variation, which is the sum of squar error between all objects in
Ci and the centroid ci defined as

where
• E is the sum of the squar error for all objects in the data set;
• P is the point in space represent a given object;
• ci is the centroid of cluster Ci (both P and ci are
multidimensional).
• In other words, for each object in each cluster, the distance from
the object to its cluster center is squared, and the distances are
summed.
12
Algorithm: k-means. The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean value of
the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) Choose k objects from D as the initial cluster centers;
(2) Repeat
(3)(Re) assign each object to the cluster to which the object is the
most similar, based on the mean value of the objects in the
cluster;
(4) update the cluster means, that is, calculate the mean value of
the objects for each cluster;
(5) Until no change.
13
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
14
A drawback of k-means.
• Consider six points in 1-D space having the values

1, 2, 3, 8, 9, 10, and 25

• By visually we may imagine the points partitioned into the


clusters {1, 2, 3} and {8, 9, 10}, where point 25 is exclude
because it appears to be an outlier.

• How would k-means partition the values? If we apply k-


means using k = 2 and the partitioning {{1, 2,3}, {8, 9,
10,25}} has the within-cluster variation

15
• Within-cluster variation

(1-2)2 + (2-2)2+ (3-2)2 + (8-13)2+ (9-13)2+ (10-13)2+ (25-13)2 = 196

• Given that the mean of cluster {1, 2, 3} is 2 and the mean of {8, 9,
10, 25} is 13.

• Compare this to the partitioning {{1, 2, 3,8} ,{9, 10,25}}, for which k-
means computes the within cluster variation as

• Within-cluster variation

(1-3.5)2 + (2-3.5)2+ (3-3.5)2 + (8-3.5)2+ (9-14.67)2+ (10-14.67)2+ (25-14.67)2 = 189.67

• Given that 3.5 is the mean of cluster {1, 2, 3, 8} and 14.67 is the
mean of cluster {9, 10, 25}.
16
• The latter partitioning has the lowest within-cluster variation;
therefore, the k-means method assigns the value 8 to a cluster
different from that containing 9 and 10 due to the outlier point
25. Moreover, the center of the second cluster, 14.67, is far from
all the members in the cluster.

• “How can we modify the k-means algorithm to diminish such


sensitivity to outliers?” Instead of taking the mean value of the
objects in a cluster as a reference point, we can pick actual
objects to represent the clusters, using one which represent
object per cluster.

17
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
– Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of
data
– Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
– Sensitive to noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
18
 The k-Medoids Method
Absolute-error criterion is used, defined as
where
• E is the sum of the absolute error for all objects p in the data set;
• oj is the representative object of Cj.

Figure:- Four cases of the cost function for k-medoids clustering. 19


•The k-Medoids Method Continue..

20
PAM: A Typical K-Medoids Algorithm

Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5 choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

21
 Hierarchical Methods
• Agglomerative and Divisive Hierarchical Clustering
• Agglomerative hierarchical clustering:
• This bottom-up strategy starts by placing each object in its
own cluster and then merges these atomic clusters into larger
and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
• Divisive hierarchical clustering:
• This top-down strategy does the reverse of agglomerative
hierarchical clustering by starting with all objects in one
cluster.
• It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies
certain termination conditions.
22
Figure:- Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}

AGNES: AGglomerative NESting, DIANA: DIvisive ANAlysis


23
Figure:- Dendrogram representation for hierarchical clustering of data objects {a,b, c,d, e}

• A tree structure called a dendrogram is commonly used to


represent the process of hierarchical clustering.
• It shows how objects are grouped together (in an
agglomerative method) or partitioned (in a divisive method)
step-by-step.

24
Density-Based Methods
• Partitioning and hierarchical methods are designed to find
spherical-shaped clusters.
• They have difficulty finding clusters of arbitrary shape such as
the “S” shape and oval clusters in below Figure.

Figure : Clusters of arbitrary shape


• Given such data, they would likely inaccurately identify
convex regions, where noise or outliers are included in the
clusters.
25
• To find clusters of arbitrary shape, alternatively, we can model
clusters as dense regions in the data space, separated by
sparse regions.
• This is the main strategy behind density-based clustering
methods, which can discover clusters of nonspherical shape.

26
 DBSCAN (Density-Based Clustering Based on
Connected Regions with High Density)
• “How can we find dense regions in density-based clustering?”
The density of an object o can be measured by the number of
objects close to o.
• DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) finds core objects, that is, objects that have dense
neighborhoods.
• It connects core objects and their neighborhoods to form
dense regions as clusters.
• “How does DBSCAN quantify the neighborhood of an object?”
A user-specified parameter > 0 is used to specify the radius of
a neighborhood we consider for every object.
• The -neighborhood of an object o is the space within a radius
centered at o.
27
• Due to the fixed neighborhood size parameterized by , the
density of a neighborhood can be measured simply by the
number of objects in the neighborhood.
• To determine whether a neighborhood is dense or not,
DBSCAN uses another user-specified parameter, MinPts,
which specifies the density threshold of dense regions.
• An object is a core object if the -neighborhood of the object
contains at least MinPts objects. Core objects are the pillars of
dense regions.

28
• Example: Density-reachability and density connectivity
• Consider Figure for a given ɛ represented by the radius of the
circles, and, say, let MinPts = 3. Based on the above definitions:
• Of the labeled points ,m, p, o, and r are core objects because
each is in an ɛ -neighborhood containing at least three points.
• q is directly density-reachable from m. Object m is directly
density-reachable from p and vice versa.
• Object q is (indirectly) density-reachable from p because q is
directly density-reachable from m and m is directly density-
reachable from p.
• However, p is not density-reachable from q because q is not a
core object.
• Similarly, r and s are density-reachable from o, and o is density-
reachable from r. Thus o, r, and s are all density-connected.
29
• A density-based cluster is a set of density-connected
objects that is maximal with respect to density-
reachability. Every object not contained in any cluster
is considered to be noise.

Figure:- Density reachability and density connectivity in density-based clustering.

30
• “How does DBSCAN find clusters?” Initially, all objects in a given
data set D are marked as “unvisited.”
• DBSCAN randomly selects an unvisited object p, marks p as
“visited,” and checks whether the -neighborhood of p contains at
least MinPts objects.
• If not, p is marked as a noise point. Otherwise, a new cluster C is
created for p, and all the objects in the -neighborhood of p are
added to a candidate set, N.
• DBSCAN iteratively adds to C those objects in N that do not
belong to any cluster. In this process, for an object p’ in N that
carries the label “unvisited,” DBSCAN marks it as “visited” and
checks its -neighborhood.
• If the -neighborhood of p’ has at least MinPts objects, those
objects in the -neighborhood of p0 are added to N.

31
• DBSCAN continues adding objects to C until C can no longer be
expanded, that is, N is empty. At this time, cluster C is
completed, and thus is output.
• To find the next cluster, DBSCAN randomly selects an unvisited
object from the remaining ones. The clustering process ontinues
until all objects are visited.

32
DBSCAN Algorithm 33
Evaluation of Clustering
• In general, cluster evaluation assesses the feasibility of
clustering analysis on a data set and the quality of the results
generated by a clustering method.
The major tasks of clustering evaluation include the following:
• Assessing clustering tendency.
• In this task, for a given data set, we assess whether a nonrandom
structure exists in the data.
• Blindly applying a clustering method on a data set will return clusters;
however, the clusters mined may be misleading.
• Clustering analysis on a data set is meaningful only when there is a
nonrandom structure in the data.
• Determining the number of clusters in a data set.
• A few algorithms, such as k-means, require the number of clusters in a
data set as the parameter.
• Moreover, the number of clusters can be regarded as an interesting
and important summary statistic of a data set.
• Therefore, it is desirable to estimate this number even before a
clustering algorithm is used to derive detailed clusters.
• Measuring clustering quality.
• After applying a clustering method on a data set, we want to assess
how good the resulting clusters are. A number of measures can be
used.
• Some methods measure how well the clusters fit the data set, while
others measure how well the clusters match the ground truth, if such
truth is available.
• There are also measures that score clusterings and thus can compare
two sets of clustering results on the same data set.
Assessing Clustering Tendency
• Clustering tendency assessment determines whether a given
data set has a non-random structure, which may lead to
meaningful clusters.
• Consider a data set that does not have any non-random
structure, such as a set of uniformly distributed points in a
data space.
• Even though a clustering algorithm may return clusters for the
data, those clusters are random and are not meaningful.
• Clustering requires nonuniform distribution of data.
Figure 10.21 shows a data set that is uniformly distributed in
2-D data space.
• Although a clustering algorithm may still artificially partition
the points into groups, the groups will unlikely mean anything
significant to the application due to the uniform distribution
of the data.
• “How can we assess the clustering tendency of a data set?”
Intuitively, we can try to measure the probability that the data
set is generated by a uniform data distribution.
• This can be achieved using statistical tests for spatial
randomness. To illustrate this idea, let’s look at a simple yet
effective statistic called the Hopkins Statistic.
• The Hopkins Statistic is a spatial statistic that tests the spatial
randomness of a variable as distributed in a space.
• Given a data set, D, which is regarded as a sample of a
random variable, o, we want to determine how far away o is
from being uniformly distributed in the data space. We
calculate the Hopkins Statistic as follows:
 Determining the Number of Clusters
• Determining the “right” number of clusters in a data set is important,
not only because some clustering algorithms like k-means require
such a parameter, but also because the appropriate number of
clusters controls the proper granularity of cluster analysis.
• It can be regarded as finding a good balance between compressibility
and accuracy in cluster analysis. Consider two extreme cases. What if
you were to treat the entire data set as a cluster? This would
maximize the compression of the data, but such a cluster analysis has
no value.
• On the other hand, treating each object in a data set as a cluster gives
the finest clustering resolution (i.e., most accurate due to the zero
distance between an object and the corresponding cluster center).
• In some methods like k-means, this even achieves the best cost.
However, having one object per cluster does not enable any data
summarization.
• Determining the number of clusters is far from easy, often because the
“right” number is ambiguous.
• Figuring out what the right number of clusters should be often depends
on the distribution’s shape and scale in the data set, as well as the
clustering resolution required by the user.
• There are many possible ways to estimate the number of clusters. Here,
we briefly introduce a few simple yet popular and effective methods.
• The elbow method is based on the observation that increasing the
number of clusters can help to reduce the sum of within-cluster variance
of each cluster.
• This is because having more clusters allows one to capture finer groups
of data objects that are more similar to each other.
• However, the marginal effect of reducing the sum of within-cluster
variances may drop if too many clusters are formed, because splitting a
cohesive cluster into two gives only a small reduction.
• Consequently, a heuristic for selecting the right number of clusters is to
use the turning point in the curve of the sum of within-cluster variances
with respect to the number of clusters.
 Measuring Clustering Quality
• We have a few methods to choose from for measuring the quality
of a clustering.
• In general, these methods can be categorized into two groups
according to whether ground truth is available. Here, ground truth
is the ideal clustering that is often built using human experts.
• If ground truth is available, it can be used by extrinsic methods,
which compare the clustering against the group truth and
measure.
• If the ground truth is unavailable, we can use intrinsic methods,
which evaluate the goodness of a clustering by considering how
well the clusters are separated.
• Ground truth can be considered as supervision in the form of
“cluster labels.” Hence, extrinsic methods are also known as
supervised methods, while intrinsic methods are unsupervised
methods.
• Extrinsic Methods
• Intrinsic Methods
• When the ground truth of a data set is not available, we have to use
an intrinsic method to assess the clustering quality.
• In general, intrinsic methods evaluate a clustering by examining how
well the clusters are separated and how compact the clusters are.
• Many intrinsic methods have the advantage of a similarity metric
between objects in the data set.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy