0% found this document useful (0 votes)
64 views

Fundamentals of Data Science Unit 3

Clustering in data mining groups similar data points together based on their features. Common applications include customer segmentation, image segmentation, anomaly detection, text mining, and recommender systems. Popular clustering methods include partitioning methods like K-means, hierarchical clustering, density-based clustering, and grid-based clustering.

Uploaded by

rakshithadahnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Fundamentals of Data Science Unit 3

Clustering in data mining groups similar data points together based on their features. Common applications include customer segmentation, image segmentation, anomaly detection, text mining, and recommender systems. Popular clustering methods include partitioning methods like K-means, hierarchical clustering, density-based clustering, and grid-based clustering.

Uploaded by

rakshithadahnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CLUSTERING

Clustering in data mining is a technique that groups similar data


points together based on their features and characteristics.
It can also be referred to as a process of grouping a set of
objects so that objects in the same group (called a cluster) are
more similar to each other than those in other groups (clusters).

Applications of Clustering in Data Mining


Clustering is a widely used technique in data mining and has
numerous applications in various fields. Some of the common
applications of clustering in data mining include -
 Customer Segmentation
Clustering techniques in data mining can be used to group
customers with similar behavior, preferences, and
purchasing patterns to create more targeted marketing
campaigns.
 Image Segmentation
Clustering techniques in data mining can be used to
segment images into different regions based on their pixel
values, which can be useful for tasks such as object
recognition and image compression.
 Anomaly Detection
Clustering techniques in data mining can be used to
CLUSTERING
identify outliers or anomalies in datasets that deviate
significantly from normal behavior.
 Text Mining
Clustering techniques in data mining can be used to group
documents or texts with similar content, which can be
useful for tasks such as document summarization and topic
modeling.
 Biological Data Analysis
Clustering techniques in data mining can be used to group
genes or proteins with similar characteristics or expression
patterns, which can be useful for tasks such as drug
discovery and disease diagnosis.
 Recommender Systems
Clustering techniques in data mining can be used to group
users with similar interests or behavior to create more
personalized recommendations for products or services.
Advantages of Cluster Analysis:
1. It can help identify patterns and relationships within a
dataset that may not be immediately obvious.

2. It can be used for exploratory data analysis and can help


with feature selection.
CLUSTERING
3. It can be used to reduce the dimensionality of the data.

4. It can be used for anomaly detection and outlier


identification.

5. It can be used for market segmentation and customer


profiling.
Disadvantages of Cluster Analysis:
1. It can be sensitive to the choice of initial conditions and the
number of clusters.

2. It can be sensitive to the presence of noise or outliers in the


data.

3. It can be difficult to interpret the results of the analysis if


the clusters are not well-defined.

4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the choice of


clustering algorithm used.
CLUSTERING
6. It is important to note that the success of cluster analysis
depends on the data, the goals of the analysis, and the
ability of the analyst to interpret the results.

Clustering Methods:

The clustering methods can be classified into the following


categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method

1. PARTITIONING METHOD

Partitioning methods are a widely used family of clustering


algorithms in data mining that aim to partition a dataset into K
clusters. These algorithms attempt to group similar data points
together while maximizing the differences between the clusters.

Partitioning methods work by iteratively refining the cluster


centroids until convergence is reached. These algorithms are
popular for their speed and scalability in handling large datasets.

The most widely used partitioning method is the K-means


algorithm. K-means is the most popular algorithm in partitioning
methods for clustering. It partitions a dataset into K clusters,
where K is a user-defined parameter. Let’s understand the K-
Means algorithm in more detail.
CLUSTERING
Algorithm
Here is a high-level overview of the algorithm to implement K-
means clustering:
1. Initialize K cluster centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids' coordinates by computing the
mean of the data points assigned to each cluster.
4. Repeat steps 2 and 3 until the cluster assignments no longer
change or a maximum number of iterations is reached.
5. Return the K clusters and their respective centroids.
The flow chart of the K-Means algorithm is shown in the figure
belo
w.

Advantages of K-Means
CLUSTERING
Here are some advantages of the K-means clustering algorithm -
 Scalability - K-means is a scalable algorithm that can
handle large datasets with high dimensionality. This is
because it only requires calculating the distances between
data points and their assigned cluster centroids.
 Speed - K-means is a relatively fast algorithm, making it
suitable for real-time or near-real-time applications. It can
handle datasets with millions of data points and converge to
a solution in a few iterations.
 Simplicity - K-means is a simple algorithm to implement
and understand. It only requires specifying the number of
clusters and the initial centroids, and it iteratively refines
the clusters' centroids until convergence.
 Interpretability - K-means provide interpretable results, as
the clusters' centroids represent the centre points of the
clusters. This makes it easy to interpret and understand the
clustering results.
2. HIERARCHIAL CLUSTERING METHOD :
Hierarchical clustering is a method of cluster analysis in data
mining that creates a hierarchical representation of the clusters
in a dataset.
The method starts by treating each data point as a separate
cluster and then iteratively combines the closest clusters until a
stopping criterion is reached.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
CLUSTERING
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and
at every step, merge the nearest pairs of the cluster. (It is a
bottom-up method). At first, every dataset is considered an
individual entity or cluster. At every iteration, the clusters merge
with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other
clusters (calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each
other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s say we have six data points A, B, C, D, E, and F.
CLUSTERING

 Step-1: Consider each alphabet as a single cluster and


calculate the distance of one cluster from all the other
clusters.
 Step-2: In the second step comparable clusters are merged
together to form a single cluster. Let’s say cluster (B) and
cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and
(E) and at last, we get the clusters [(A), (BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the
algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and
BC are comparable and merged together to form a new
cluster. We’re now left with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged
together to form a single cluster [(ABCDEF)].
CLUSTERING

2. Divisive Hierarchical clustering

We can say that Divisive Hierarchical clustering is precisely


the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, we take into account all of the
data points as a single cluster and in every iteration, we separate
the data points from the clusters which aren’t comparable. In the
end, we are left with N clusters.

3. DENSITY BASED METHOD


CLUSTERING
Density-Based Clustering refers to one of the most popular
unsupervised learning methodologies used in model
building and machine learning algorithms.
The data points in the region separated by two clusters of
low point density are considered as noise. The
surroundings with a radius ε of a given object are known
as the ε neighborhood of the object.
4.GRID BASED METHOD :
Density-Based Clustering refers to one of the most
popular unsupervised learning methodologies used in
model building and machine learning algorithms.
The data points in the region separated by two clusters of
low point density are considered as noise. The
surroundings with a radius ε of a given object are known
as the ε neighborhood of the object.
Steps to perform grid based clustering :
1. Define a set of grid cells
2. Assign objects to appropriate grid cell &
compute density of each cell
3. Eliminate cells whose density is below certain
threshold
CLUSTERING
4. Forms the cluster having density.

EVALUATION OF CLUSTERING:
Three important factors by which clustering can be
evaluated are
(a) Clustering tendency (b) Number of clusters, k (c)
Clustering quality
1.Clustering tendency
Before evaluating the clustering performance, making
sure that data set we are working has clustering tendency
and does not contain uniformly distributed points is very
important.
If the data does not contain clustering tendency, then
clusters identified by any state of the art clustering
algorithms may be irrelevant.
Non-uniform distribution of points in data set becomes
important in clustering.
To solve this, Hopkins test, a statistical test for spatial
randomness of a variable, can be used to measure the
probability of data points generated by uniform data
distribution.
CLUSTERING
Null Hypothesis (Ho) : Data points are generated by
uniform distribution (implying no meaningful clusters)
Alternate Hypothesis (Ha): Data points are generated by
random data points (presence of clusters)

If H>0.5, null hypothesis can be rejected and it is very


much likely that data contains clusters. If H is more close
to 0, then data set doesn’t have clustering tendency.
2.Number of Optimal Clusters, k
Some of the clustering algorithms like K-means, require
number of clusters, k, as clustering parameter. Getting the
optimal number of clusters is very significant in the
analysis.
If k is too high, each point will broadly start representing
a cluster and if k is too low, then data points are
incorrectly clustered. Finding the optimal number of
clusters leads to granularity in clustering.

There is no definitive answer for finding right number of


cluster as it depends upon (a) Distribution shape (b) scale
in the data set (c) clustering resolution required by user.
CLUSTERING
Although finding number of clusters is a very subjective
problem. There are two major approaches to find optimal
number of clusters:
(1) Domain knowledge
(2) Data driven approach

Domain knowledge — Domain knowledge might give


some prior knowledge on finding number of clusters. For
example, in case of clustering iris data set, if we have the
prior knowledge of species (sertosa, virginica, versicolor)
, then k = 3. Domain knowledge driven k value gives
more relevant insights.

Data driven approach — If the domain knowledge is not


available, mathematical methods help in finding out right
number of clusters.
3.Clustering quality
Once clustering is done, how well the clustering has
performed can be quantified by a number of metrics. Ideal
clustering is characterized by minimal intra cluster
distance and maximal inter cluster distance.
There are majorly two types of measures to assess the
clustering performance.
CLUSTERING
(i) Extrinsic Measures which require ground truth labels.
Examples are Adjusted Rand index, Fowlkes-Mallows
scores, Mutual information based scores, Homogeneity,
Completeness and V-measure.
(ii) Intrinsic Measures that does not require ground truth
labels. Some of the clustering performance measures are
Silhouette Coefficient, Calinski-Harabasz Index, Davies-
Bouldin Index etc.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast
amount of data and should be dealing with huge
databases.
In order to handle extensive databases, the clustering
algorithm should be scalable. Data should be scalable, if it
is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to
handle high dimensional space along with the data of
small size.
3. Algorithm Usability with multiple data
kinds: Different kinds of data can be used with
CLUSTERING
algorithms of clustering. It should be capable of dealing
with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with unstructured data: There would be
some databases that contain missing values, and noisy or
erroneous data.
If the algorithms are sensitive to such data then it may
lead to poor quality clusters. So it should be able to
handle unstructured data and give some structure to the
data by organizing it into groups of similar data objects.
This makes the job of the data expert easier in order to
process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be
interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy