UNIT 2 DMW
UNIT 2 DMW
UNIT 2 DMW
Cluster Analysis:
The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the
objects in other clusters.
A cluster of data objects can be treated collectively as one
group and so may be considered as a form of data
compression.
Cluster analysis tools based on k-means, k-medoids, and
several methods have also been built into many statistical
analysis software packages or systems, such as S-Plus, SPSS,
and SAS.
Partitioning Methods:
A partitioning method constructs k partitions of the data,
where each partition represents a cluster and k <= n. That is, it
classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object, and
Each object must belong to exactly one group.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of
the given set of data objects. A hierarchical method can be
classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group. It
successively merges the objects or groups that are close to one
another, until all of the groups are merged into one or until a
termination condition holds.
The divisive approach, also called the top-down approach,
starts with all of the objects in the same cluster. In each
successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a
termination condition holds.
Hierarchical methods suffer from the fact that once a step
(merge or split) is done, it can never be undone. This rigidity is
useful in that it leads to smaller computation costs by not
having to worry about a combinatorial number of different
choices.
Density-based methods:
Most partitioning methods cluster objects based on the
distance between objects. Such methods can find only
spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.
Other clustering methods have been developed based on the
notion of density. Their general idea is to continue growing the
given cluster as long as the density in the neighborhood
exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be
used to filter out noise (outliers)and discover clusters of
arbitrary shape.
DBSCAN and its extension, OPTICS, are typical density-based
methods that growclusters according to a density-based
connectivity analysis. DENCLUE is a methodthat clusters objects
based on the analysis of the value distributions of density
functions.
Grid-Based Methods:
Grid-based methods quantize the object space into a finite
number of cells that form a grid structure.
All of the clustering operations are performed on the grid
structure i.e., on the quantized space. The main advantage
of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent
only on the number of cells in each dimension in the
quantized space.
STING is a typical example of a grid-based method. Wave
Cluster applies wavelet transformation for clustering analysis
and is both grid-based and density-based.
Model-Based Methods:
Model-based methods hypothesize a model for each of the
clusters and find the best fit of the data to the given model.
A model-based algorithm may locate clusters by constructing
a density function that reflects the spatial distribution of the
data points.
It also leads to a way of automatically determining the
number of clusters based on standard statistics, taking
―noise‖ or outliers into account and thus yielding robust
clustering methods.
Constraint-Based Clustering:
It is a clustering approach that performs clustering
by incorporation of user-specified or application-
oriented constraints.
A constraint expresses a user’s expectation or
describes properties of the desired clustering
results, and provides an effective means for
communicating with the clustering process.
Various kinds of constraints can be specified, either
by a user or as per application requirements. Spatial
clustering employs with the existence of obstacles
and clustering under userspecified constraints.
In addition, semi-supervised clusteringemploys
forpairwise constraints in order to improvethe
quality of the resulting clustering.
Where E is the sum of the square error for all objects in the data set
p is the point in space representing a given object
mi is the mean of cluster Ci.
Where E is the sum of the absolute error for all objects in the data
set
p is the point in space representing a given object in cluster Cj
oj is the representative object of Cj
The initial representative objects are chosen arbitrarily. The
iterative process of replacing representative objects by non
representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures
the average dissimilaritybetween an object and the
representative object of its cluster.
To determine whether a non representative object, oj random,
is a good replacement for a current representativeobject, oj,
the following four cases are examined for each of the
nonrepresentative objects.
The k-Medoids Algorithm:
The k-medoids algorithm for partitioning based on medoid or
central objects.
The k-medoids method is more robust than k-means in the
presence of noise and outliers, because a medoid is less
influenced by outliers or other extreme values than a mean.
However, its processing is more costly than the k-means
method.
OR
K-Medoids and K-Means are two types of clustering mechanisms in
Partition Clustering. First, Clustering is the process of breaking down
an abstract group of data points/ objects into classes of similar
objects such that all the objects in one cluster have similar traits. , a
group of n objects is broken down into k number of clusters based on
their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up
with this method. This tutorial explains what K-Medoids do, their
applications, and the difference between K-Means and K-Medoids.
K-medoids is an unsupervised method with unlabelled data to be
clustered. It is an improvised version of the K-Means algorithm
mainly designed to deal with outlier data sensitivity. Compared to
other partitioning algorithms, the algorithm is simple, fast, and easy
to implement.
The partitioning will be carried on such that:
1. Choose k number of random points (Data point from the data set or some
other points). These points are also called "Centroids" or "Means".
2. Assign all the data points in the data set to the closest centroid by applying
any distance formula like Euclidian distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all the data points in
the clusters and goto step 2
4. Continue step 3 until no data point changes classification between two
iterations.
The problem with the K-Means algorithm is that the algorithm needs to
handle outlier data. An outlier is a point different from the rest of the points.
All the outlier data points show up in a different cluster and will attract other
clusters to merge with it. Outlier data increases the mean of a cluster by up
to 10 units. Hence, K-Means clustering is highly affected by outlier
data.
K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances
to other data points is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other
points in the clusters are minimal.
PAM is the most powerful algorithm of the three algorithms but has the
disadvantage of time complexity. The following K-Medoids are performed
using PAM. In the further parts, we'll see what CLARA and CLARANS are.
Algorithm:
Given the value of k and unlabelled data:
1. Choose k number of random points from the data and assign these k points to
k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid
and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to
the medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new
medoids to classify data points.
Scatter plot:
If k is given as 2, we need to break down the data points into 2 clusters.
Cluster 2: 1, 3
Cluster 1: 2, 3
Cluster 2: 1
(5 + 5) + 5 = 15
Cluster 2: 0, 2
(5) + (5 + 10) = 20
UNDO
Cluster 1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:
Limitation of PAM:
Hence, PAM is suitable and recommended to be used for small data sets.
CLARA:
The idea is that if the sample is selected in a fairly random manner, then it
correctly represents the whole dataset and therefore, the representative
objects (medoids) chosen will be similar as if chosen from the whole dataset.
CLARA draws several samples and outputs the good clustering out of these
samples. CLARA can deal with a higher dataset than PAM. The complexity of
each iteration now becomes O(kS2+k(n−k)), where S is the size of the
sample.
OR
It is an extension to PAM to support Medoid clustering for large data sets.
This algorithm selects data samples from the data set, applies Pam on
each sample, and outputs the best Clustering out of these samples.
This is more effective than PAM. We should ensure that the selected samples
aren't biased as they affect the Clustering of the whole data.
CLARANS:
CLARANS algorithm combines both PAM and CLARA by searching only the
subset of the dataset and it does not constraint itself to some sample at any
given time. While CLARA has a constant sample at each phase of the search,
CLARANS draws a sample with some randomness in every phase of the
search.
OR
This algorithm selects a sample of neighbors to examine instead of selecting
samples from the data set. In every step, it examines the neighbors of every
node. The time complexity of this algorithm is O(n 2), and this is the best and
most efficient Medoids algorithm of all.
DBSCAN
DBSCAN algorithm can cluster densely grouped points efficiently into one
cluster. It can identify local density in the data points among large datasets.
DBSCAN can very effectively handle outliers. An advantage of DBSACN over
the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN.
minPoints is the number of points required within the radius so that the data
point becomes a core point.
In the DBSCAN algorithm, a circle with a radius epsilon is drawn around each
data point and the data point is classified into Core Point, Border Point, or
Noise Point. The data point is classified as a core point if it has minPoints
number of data points with epsilon radius. If it has points less than minPoints
it is known as Border Point and if there are no points inside epsilon radius it is
considered a Noise Point.
In the above figure, we can see that point A has no points inside epsilon(e)
radius. Hence it is a Noise Point. Point B has minPoints(=4) number of points
with epsilon e radius , thus it is a Core Point. While the point has only 1 ( less
than minPoints) point, hence it is a Border Point.
First, all the points within epsilon radius are found and the core points
are identified with number of points greater than or equal to minPoints.
Next, for each core point, if not assigned to a particular cluster, a new
cluster is created for it.
All the densely connected points related to the core point are found
and assigned to the same cluster. Two points are called densely
connected points if they have a neighbor point that has both the points
within epsilon distance.
Then all the points in the data are iterated, and the points that do not
belong to any cluster are marked as noise.
BIRCH
BIRCH uses two main data structures to represent the clusters: Clustering
Feature (CF) and Sub-Cluster Feature (SCF). CF is used to summarize the
statistical properties of a set of data points, while SCF is used to represent
the structure of subclusters.
OR
CURE
Procedure :
1. Select target sample number ‘gfg’.
2. Choose ‘gfg’ well scattered points in a cluster.
3. These scattered points are shrunk towards
centroid.
4. These points are used as representatives of
clusters and used in ‘Dmin’ cluster merging
approach. In Dmin(distance minimum) cluster
merging approach, the minimum distance from
the scattered point inside the sample ‘gfg’ and
the points outside ‘gfg sample, is calculated. The
point having the least distance to the scattered
point inside the sample, when compared to other
points, is considered and merged into the
sample.
5. After every such merging, new sample points will
be selected to represent the new cluster.
6. Cluster merging will stop until target, say ‘k’ is
reached.
https://www.slideshare.net/slideshow/clustering-in-data-miningpdf/
252172468#20