Content-Length: 3148656 | pFad | https://www.scribd.com/document/674992900/IV-UNIT-DM
5Iv Unit DM
Iv Unit DM
Iv Unit DM
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create an
initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-
clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each
data point within a given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of desired
clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.
Types Of Data Used In Cluster
Analysis Are:
Interval-Scaled variables
Binary variables
Nominal, Ordinal, and Ratio variables
Variables of mixed types
Data Matrix
This represents n objects, such as persons, with p
variables (also called measurements or attributes),
such as age, height, weight, gender, race and so on.
The structure is in the form of a relational table, or n-
by-p matrix (n objects x p variables)
Binary Variables
A binary variable is a variable that can take only 2
values.
For example, generally, gender variables can take 2
variables male and female.
Let p=a+b+c+d
Ordinal Variables
An ordinal variable can be discrete or continuous.
Ratio-Scaled Intervals
Ratio-scaled variable: It is a positive measurement
on a nonlinear scale, approximately at an exponential
scale, such as Ae^Bt or A^e-Bt.
Methods:
First, treat them like interval-scaled variables —
not a good choice! (why?)
Then apply logarithmic transformation i.e.y =
log(x)
Interval-Scaled variables
Binary variables
Nominal, Ordinal, and Ratio variables
Variables of mixed types
Clustering Methods
2. Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct
k partition of data.
Each partition will represent a cluster and k≤n. It means that it will classify the
data into k groups, which satisfy the following requirements:
Typical methods:
K-means, k-medoids, CLARANS
3 Hierarchical Methods
This method creates the hierarchical decomposition of the given set of data
objects.:
o Agglomerative Approach
o Divisive Approach
Agglomerative Approach
This approach is also known as bottom-up approach. In this we start with each
object forming a separate group. It keeps on merging the objects or groups that
are close to one another. It keep on doing so until all of the groups are merged
into one or until the termination condition holds.
Divisive Approach
This approach is also known as top-down approach. In this we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up
into smaller clusters. It is down until each object in one cluster or the
termination condition holds.
Disadvantage
This method is rigid i.e. once merge or split is done, It can never be undone.
BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters
Major ideas
o Use links to measure similarity/proximity
o Not distance-based
o Computational complexity:
A two-phase algorithm
4 Density-based Method
Clustering based on density (local cluster criterion), such as density-
connected points
Major features:
o Discover clusters of arbitrary shape
o Handle noise
o One scan
o Need density parameters as termination condition
Two parameters:
o Eps: Maximum radius of the neighbourhood
o MinPts: Minimum number of points in an Eps-neighbourhood of that
point
Continue the process until all of the points have been processed.
5. Grid-based Method
Using multi-resolution grid data structure
Advantage
o The major advantage of this method is fast processing time.
o It is dependent only on the number of cells in each dimension in the
quantized space.
6 Model-based methods
· Attempt to optimize the fit between the given data and some
mathematical model
· Based on the assumption: Data are generated by a mixture of
underlying probability distribution
o In this method a model is hypothesize for each cluster and find the best fit
of data to the given model.
o General idea
Starts with an initial estimate of the parameter vector
o
Involves a hierarchical architecture of several units (neurons)
o
Neurons compete in a ―winner-takes-all‖ fashion for the object currently
being presented
· The winner and its neighbors learn by having their weights adjusted
o SOMs are believed to resemble processing that can occur in the brain
o Useful for visualizing high-dimensional data in 2- or 3-D space
7 Constraint-based Method
o Clustering by considering user-specified or application-specific
constraints
o Proposed approach
Fetched URL: https://www.scribd.com/document/674992900/IV-UNIT-DM
Alternative Proxies: