Fundamentals of Data Science Unit 3
Fundamentals of Data Science Unit 3
Clustering Methods:
1. PARTITIONING METHOD
Advantages of K-Means
CLUSTERING
Here are some advantages of the K-means clustering algorithm -
Scalability - K-means is a scalable algorithm that can
handle large datasets with high dimensionality. This is
because it only requires calculating the distances between
data points and their assigned cluster centroids.
Speed - K-means is a relatively fast algorithm, making it
suitable for real-time or near-real-time applications. It can
handle datasets with millions of data points and converge to
a solution in a few iterations.
Simplicity - K-means is a simple algorithm to implement
and understand. It only requires specifying the number of
clusters and the initial centroids, and it iteratively refines
the clusters' centroids until convergence.
Interpretability - K-means provide interpretable results, as
the clusters' centroids represent the centre points of the
clusters. This makes it easy to interpret and understand the
clustering results.
2. HIERARCHIAL CLUSTERING METHOD :
Hierarchical clustering is a method of cluster analysis in data
mining that creates a hierarchical representation of the clusters
in a dataset.
The method starts by treating each data point as a separate
cluster and then iteratively combines the closest clusters until a
stopping criterion is reached.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
CLUSTERING
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and
at every step, merge the nearest pairs of the cluster. (It is a
bottom-up method). At first, every dataset is considered an
individual entity or cluster. At every iteration, the clusters merge
with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other
clusters (calculate proximity matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each
other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Let’s say we have six data points A, B, C, D, E, and F.
CLUSTERING
EVALUATION OF CLUSTERING:
Three important factors by which clustering can be
evaluated are
(a) Clustering tendency (b) Number of clusters, k (c)
Clustering quality
1.Clustering tendency
Before evaluating the clustering performance, making
sure that data set we are working has clustering tendency
and does not contain uniformly distributed points is very
important.
If the data does not contain clustering tendency, then
clusters identified by any state of the art clustering
algorithms may be irrelevant.
Non-uniform distribution of points in data set becomes
important in clustering.
To solve this, Hopkins test, a statistical test for spatial
randomness of a variable, can be used to measure the
probability of data points generated by uniform data
distribution.
CLUSTERING
Null Hypothesis (Ho) : Data points are generated by
uniform distribution (implying no meaningful clusters)
Alternate Hypothesis (Ha): Data points are generated by
random data points (presence of clusters)