0% found this document useful (0 votes)

64 views

Fundamentals of Data Science Unit 3

Clustering in data mining groups similar data points together based on their features. Common applications include customer segmentation, image segmentation, anomaly detection, text mining, and recommender systems. Popular clustering methods include partitioning methods like K-means, hierarchical clustering, density-based clustering, and grid-based clustering.

Uploaded by

rakshithadahnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Fundamentals of Data Science Unit 3

Uploaded by

rakshithadahnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CLUSTERING

Clustering in data mining is a technique that groups similar data

points together based on their features and characteristics.
It can also be referred to as a process of grouping a set of
objects so that objects in the same group (called a cluster) are
more similar to each other than those in other groups (clusters).

Applications of Clustering in Data Mining

Clustering is a widely used technique in data mining and has
numerous applications in various fields. Some of the common
applications of clustering in data mining include -
 Customer Segmentation
Clustering techniques in data mining can be used to group
customers with similar behavior, preferences, and
purchasing patterns to create more targeted marketing
campaigns.
 Image Segmentation
Clustering techniques in data mining can be used to
segment images into different regions based on their pixel
values, which can be useful for tasks such as object
recognition and image compression.
 Anomaly Detection
Clustering techniques in data mining can be used to
CLUSTERING
identify outliers or anomalies in datasets that deviate
significantly from normal behavior.
 Text Mining
Clustering techniques in data mining can be used to group
documents or texts with similar content, which can be
useful for tasks such as document summarization and topic
modeling.
 Biological Data Analysis
Clustering techniques in data mining can be used to group
genes or proteins with similar characteristics or expression
patterns, which can be useful for tasks such as drug
discovery and disease diagnosis.
 Recommender Systems
Clustering techniques in data mining can be used to group
users with similar interests or behavior to create more
personalized recommendations for products or services.
Advantages of Cluster Analysis:
1. It can help identify patterns and relationships within a
dataset that may not be immediately obvious.

2. It can be used for exploratory data analysis and can help

with feature selection.
CLUSTERING
3. It can be used to reduce the dimensionality of the data.

4. It can be used for anomaly detection and outlier

identification.

5. It can be used for market segmentation and customer

profiling.
Disadvantages of Cluster Analysis:
1. It can be sensitive to the choice of initial conditions and the
number of clusters.

2. It can be sensitive to the presence of noise or outliers in the

data.

3. It can be difficult to interpret the results of the analysis if

the clusters are not well-defined.

4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the choice of

clustering algorithm used.
CLUSTERING
6. It is important to note that the success of cluster analysis
depends on the data, the goals of the analysis, and the
ability of the analyst to interpret the results.

Clustering Methods:

The clustering methods can be classified into the following

categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method

1. PARTITIONING METHOD

Partitioning methods are a widely used family of clustering

algorithms in data mining that aim to partition a dataset into K
clusters. These algorithms attempt to group similar data points
together while maximizing the differences between the clusters.

Partitioning methods work by iteratively refining the cluster

centroids until convergence is reached. These algorithms are
popular for their speed and scalability in handling large datasets.

The most widely used partitioning method is the K-means

algorithm. K-means is the most popular algorithm in partitioning
methods for clustering. It partitions a dataset into K clusters,
where K is a user-defined parameter. Let’s understand the K-
Means algorithm in more detail.
CLUSTERING
Algorithm
Here is a high-level overview of the algorithm to implement K-
means clustering:
1. Initialize K cluster centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids' coordinates by computing the
mean of the data points assigned to each cluster.
4. Repeat steps 2 and 3 until the cluster assignments no longer
change or a maximum number of iterations is reached.
5. Return the K clusters and their respective centroids.
The flow chart of the K-Means algorithm is shown in the figure
belo
w.

Advantages of K-Means
CLUSTERING
Here are some advantages of the K-means clustering algorithm -
 Scalability - K-means is a scalable algorithm that can
handle large datasets with high dimensionality. This is
because it only requires calculating the distances between
data points and their assigned cluster centroids.
 Speed - K-means is a relatively fast algorithm, making it
suitable for real-time or near-real-time applications. It can
handle datasets with millions of data points and converge to
a solution in a few iterations.
 Simplicity - K-means is a simple algorithm to implement
and understand. It only requires specifying the number of
clusters and the initial centroids, and it iteratively refines
the clusters' centroids until convergence.
 Interpretability - K-means provide interpretable results, as
the clusters' centroids represent the centre points of the
clusters. This makes it easy to interpret and understand the
clustering results.
2. HIERARCHIAL CLUSTERING METHOD :
Hierarchical clustering is a method of cluster analysis in data
mining that creates a hierarchical representation of the clusters
in a dataset.
The method starts by treating each data point as a separate
cluster and then iteratively combines the closest clusters until a
stopping criterion is reached.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
CLUSTERING
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and
at every step, merge the nearest pairs of the cluster. (It is a
bottom-up method). At first, every dataset is considered an
individual entity or cluster. At every iteration, the clusters merge
with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other
clusters (calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each
other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s say we have six data points A, B, C, D, E, and F.
CLUSTERING

 Step-1: Consider each alphabet as a single cluster and

calculate the distance of one cluster from all the other
clusters.
 Step-2: In the second step comparable clusters are merged
together to form a single cluster. Let’s say cluster (B) and
cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and
(E) and at last, we get the clusters [(A), (BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the
algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and
BC are comparable and merged together to form a new
cluster. We’re now left with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged
together to form a single cluster [(ABCDEF)].
CLUSTERING

2. Divisive Hierarchical clustering

We can say that Divisive Hierarchical clustering is precisely

the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, we take into account all of the
data points as a single cluster and in every iteration, we separate
the data points from the clusters which aren’t comparable. In the
end, we are left with N clusters.

3. DENSITY BASED METHOD

CLUSTERING
Density-Based Clustering refers to one of the most popular
unsupervised learning methodologies used in model
building and machine learning algorithms.
The data points in the region separated by two clusters of
low point density are considered as noise. The
surroundings with a radius ε of a given object are known
as the ε neighborhood of the object.
4.GRID BASED METHOD :
Density-Based Clustering refers to one of the most
popular unsupervised learning methodologies used in
model building and machine learning algorithms.
The data points in the region separated by two clusters of
low point density are considered as noise. The
surroundings with a radius ε of a given object are known
as the ε neighborhood of the object.
Steps to perform grid based clustering :
1. Define a set of grid cells
2. Assign objects to appropriate grid cell &
compute density of each cell
3. Eliminate cells whose density is below certain
threshold
CLUSTERING
4. Forms the cluster having density.

EVALUATION OF CLUSTERING:
Three important factors by which clustering can be
evaluated are
(a) Clustering tendency (b) Number of clusters, k (c)
Clustering quality
1.Clustering tendency
Before evaluating the clustering performance, making
sure that data set we are working has clustering tendency
and does not contain uniformly distributed points is very
important.
If the data does not contain clustering tendency, then
clusters identified by any state of the art clustering
algorithms may be irrelevant.
Non-uniform distribution of points in data set becomes
important in clustering.
To solve this, Hopkins test, a statistical test for spatial
randomness of a variable, can be used to measure the
probability of data points generated by uniform data
distribution.
CLUSTERING
Null Hypothesis (Ho) : Data points are generated by
uniform distribution (implying no meaningful clusters)
Alternate Hypothesis (Ha): Data points are generated by
random data points (presence of clusters)

If H>0.5, null hypothesis can be rejected and it is very

much likely that data contains clusters. If H is more close
to 0, then data set doesn’t have clustering tendency.
2.Number of Optimal Clusters, k
Some of the clustering algorithms like K-means, require
number of clusters, k, as clustering parameter. Getting the
optimal number of clusters is very significant in the
analysis.
If k is too high, each point will broadly start representing
a cluster and if k is too low, then data points are
incorrectly clustered. Finding the optimal number of
clusters leads to granularity in clustering.

There is no definitive answer for finding right number of

cluster as it depends upon (a) Distribution shape (b) scale
in the data set (c) clustering resolution required by user.
CLUSTERING
Although finding number of clusters is a very subjective
problem. There are two major approaches to find optimal
number of clusters:
(1) Domain knowledge
(2) Data driven approach

Domain knowledge — Domain knowledge might give

some prior knowledge on finding number of clusters. For
example, in case of clustering iris data set, if we have the
prior knowledge of species (sertosa, virginica, versicolor)
, then k = 3. Domain knowledge driven k value gives
more relevant insights.

Data driven approach — If the domain knowledge is not

available, mathematical methods help in finding out right
number of clusters.
3.Clustering quality
Once clustering is done, how well the clustering has
performed can be quantified by a number of metrics. Ideal
clustering is characterized by minimal intra cluster
distance and maximal inter cluster distance.
There are majorly two types of measures to assess the
clustering performance.
CLUSTERING
(i) Extrinsic Measures which require ground truth labels.
Examples are Adjusted Rand index, Fowlkes-Mallows
scores, Mutual information based scores, Homogeneity,
Completeness and V-measure.
(ii) Intrinsic Measures that does not require ground truth
labels. Some of the clustering performance measures are
Silhouette Coefficient, Calinski-Harabasz Index, Davies-
Bouldin Index etc.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast
amount of data and should be dealing with huge
databases.
In order to handle extensive databases, the clustering
algorithm should be scalable. Data should be scalable, if it
is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to
handle high dimensional space along with the data of
small size.
3. Algorithm Usability with multiple data
kinds: Different kinds of data can be used with
CLUSTERING
algorithms of clustering. It should be capable of dealing
with different types of data like discrete, categorical and
interval-based data, binary data etc.
4. Dealing with unstructured data: There would be
some databases that contain missing values, and noisy or
erroneous data.
If the algorithms are sensitive to such data then it may
lead to poor quality clusters. So it should be able to
handle unstructured data and give some structure to the
data by organizing it into groups of similar data objects.
This makes the job of the data expert easier in order to
process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be
interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

Uci Assignment
No ratings yet
Uci Assignment
6 pages
Wcms 3rd Unit Notes
No ratings yet
Wcms 3rd Unit Notes
33 pages
Functional Business Systems: James A. O'Brien, and George Marakas Management Information Systems
No ratings yet
Functional Business Systems: James A. O'Brien, and George Marakas Management Information Systems
22 pages
Functional Business Systems: James A. O'Brien, and George Marakas Management Information Systems
No ratings yet
Functional Business Systems: James A. O'Brien, and George Marakas Management Information Systems
22 pages
Unit 4
No ratings yet
Unit 4
4 pages
WCMS Unit2
No ratings yet
WCMS Unit2
16 pages
WCMS STUDY MATERLAS OF BCA AND BCOM-CA
No ratings yet
WCMS STUDY MATERLAS OF BCA AND BCOM-CA
142 pages
Mining Frequent, Patterns, Associations, and Correlations
No ratings yet
Mining Frequent, Patterns, Associations, and Correlations
13 pages
Unit - 2 WCMS
No ratings yet
Unit - 2 WCMS
10 pages
Php and MySQL Final Year Bca
0% (1)
Php and MySQL Final Year Bca
21 pages
Web Content Development Notes BCA VI Semester NEP
No ratings yet
Web Content Development Notes BCA VI Semester NEP
14 pages
Web Content Management Unit-1
100% (1)
Web Content Management Unit-1
10 pages
Social Information Filtering
No ratings yet
Social Information Filtering
25 pages
Abstract: Online Event Management System Is An Online Event
No ratings yet
Abstract: Online Event Management System Is An Online Event
2 pages
Hostel Management System
No ratings yet
Hostel Management System
11 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
40 pages
Unit5 CSM
No ratings yet
Unit5 CSM
22 pages
1724084235-Handbook
No ratings yet
1724084235-Handbook
414 pages
Database Issues in Mobile-Computing PDF
No ratings yet
Database Issues in Mobile-Computing PDF
20 pages
Programming Using Asp Dot Net Notes
No ratings yet
Programming Using Asp Dot Net Notes
138 pages
Data Structure MCA Question Bank
No ratings yet
Data Structure MCA Question Bank
8 pages
XML &amp DHTML
No ratings yet
XML &amp DHTML
2 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Multimedia Notes
No ratings yet
Multimedia Notes
84 pages
BCS-053 Solved Assignment 2015-16 PDF
No ratings yet
BCS-053 Solved Assignment 2015-16 PDF
18 pages
Chapter 5 Data Resource Management
No ratings yet
Chapter 5 Data Resource Management
24 pages
Hypertext and Multimedia in Hci
No ratings yet
Hypertext and Multimedia in Hci
15 pages
BCA 3rd Data Structure 1 20
No ratings yet
BCA 3rd Data Structure 1 20
20 pages
Table of Content Chapter 1 Introduction: School Management System
No ratings yet
Table of Content Chapter 1 Introduction: School Management System
34 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Web Technologies PDF
No ratings yet
Web Technologies PDF
139 pages
SQL3
No ratings yet
SQL3
6 pages
Chapter - 2 - Graphics Primitives
No ratings yet
Chapter - 2 - Graphics Primitives
10 pages
IT2403 Systems Analysis and Design: (Compulsory)
No ratings yet
IT2403 Systems Analysis and Design: (Compulsory)
6 pages
Scan Converting Circle
No ratings yet
Scan Converting Circle
24 pages
DHTML Positioning
No ratings yet
DHTML Positioning
11 pages
Pinterest Clone
No ratings yet
Pinterest Clone
8 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Data Mining of Restaurant Review Using W PDF
No ratings yet
Data Mining of Restaurant Review Using W PDF
4 pages
SRMS Project Report
No ratings yet
SRMS Project Report
60 pages
of Restaurant
100% (1)
of Restaurant
14 pages
DBMS Interview Questions - Answers
No ratings yet
DBMS Interview Questions - Answers
3 pages
System Analysis and Design
No ratings yet
System Analysis and Design
57 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Mis Syllabus
100% (1)
Mis Syllabus
2 pages
Content Management System (Tamzid Nijam)
No ratings yet
Content Management System (Tamzid Nijam)
12 pages
Ch-6 Optical Storage Media
100% (1)
Ch-6 Optical Storage Media
5 pages
TEXT BOOK:"Client/Server Survival Guide" Wiley INDIA Publication, 3 Edition, 2011. Prepared By: B.Loganathan
No ratings yet
TEXT BOOK:"Client/Server Survival Guide" Wiley INDIA Publication, 3 Edition, 2011. Prepared By: B.Loganathan
41 pages
Chapter One Web Content Management
No ratings yet
Chapter One Web Content Management
77 pages
R Programming First Unit
No ratings yet
R Programming First Unit
34 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
History of Computer Graphics
100% (1)
History of Computer Graphics
55 pages
55217-mt - Human Computer Interaction
No ratings yet
55217-mt - Human Computer Interaction
1 page
Filters and Transitions
67% (3)
Filters and Transitions
61 pages
Introduction To Information Systems
No ratings yet
Introduction To Information Systems
161 pages
CH 9 Planning and Costing
No ratings yet
CH 9 Planning and Costing
30 pages
Content Management System
100% (1)
Content Management System
6 pages
Consistency and Replication: CS403/534 Distributed Systems Erkay Savas Sabanci University
No ratings yet
Consistency and Replication: CS403/534 Distributed Systems Erkay Savas Sabanci University
44 pages
Unit 5 PHP Question Bank
No ratings yet
Unit 5 PHP Question Bank
14 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Upgrade Downgrade Guide ASR9K eXR XR 7112
No ratings yet
Upgrade Downgrade Guide ASR9K eXR XR 7112
22 pages
Win CC
No ratings yet
Win CC
14 pages
Sistem Monitoring Kualitas Udara Dalam Ruangan Mea
No ratings yet
Sistem Monitoring Kualitas Udara Dalam Ruangan Mea
10 pages
Steam Turbine Pedestal Installation Alignment
No ratings yet
Steam Turbine Pedestal Installation Alignment
14 pages
Importing C++ Programs
No ratings yet
Importing C++ Programs
34 pages
Chapter 6 Coding
No ratings yet
Chapter 6 Coding
27 pages
?ViP?script Lua
No ratings yet
?ViP?script Lua
99 pages
Dangers of The Internet
No ratings yet
Dangers of The Internet
12 pages
GRADE 10 TLE - ICT TOS-Table-Of-Specification
100% (1)
GRADE 10 TLE - ICT TOS-Table-Of-Specification
7 pages
IBM XForce Threat Intelligence Index 2024
No ratings yet
IBM XForce Threat Intelligence Index 2024
65 pages
HPing Tutorial
No ratings yet
HPing Tutorial
10 pages
Humanities Data in R Exploring Networks Geospatial Data Images and Text 2nd Edition Unknown download
No ratings yet
Humanities Data in R Exploring Networks Geospatial Data Images and Text 2nd Edition Unknown download
74 pages
20-Storm Control
No ratings yet
20-Storm Control
9 pages
WI3 - Process Vendor Invoices
No ratings yet
WI3 - Process Vendor Invoices
19 pages
Tmmug
No ratings yet
Tmmug
120 pages
Sap GRC Access Control 12030241142
No ratings yet
Sap GRC Access Control 12030241142
14 pages
Infosys Placement Paper
No ratings yet
Infosys Placement Paper
84 pages
L15 Course On Computer Concepts (CCC) 20240202
No ratings yet
L15 Course On Computer Concepts (CCC) 20240202
14 pages
A Review On The Emerging Technology of TinyML
No ratings yet
A Review On The Emerging Technology of TinyML
37 pages
DICOM Conformance Statement
No ratings yet
DICOM Conformance Statement
25 pages
CSCF in Volte - The P-CSCF (Part 1 of 4) : Uname: Naveen - S - Cs Passwd: Makeapp@12
No ratings yet
CSCF in Volte - The P-CSCF (Part 1 of 4) : Uname: Naveen - S - Cs Passwd: Makeapp@12
7 pages
DDR PHY Interface Specification v2 1 1
100% (1)
DDR PHY Interface Specification v2 1 1
102 pages
Simcom Sim5320 Atc en v2.02
No ratings yet
Simcom Sim5320 Atc en v2.02
550 pages
Build Your Own IoT Gateway With Python
No ratings yet
Build Your Own IoT Gateway With Python
98 pages
Computer Lecture (Lesson 1 Continuation)
No ratings yet
Computer Lecture (Lesson 1 Continuation)
7 pages
Semester II Practicals List
No ratings yet
Semester II Practicals List
1 page
Teacher's Starter Guide To Kahoot!
No ratings yet
Teacher's Starter Guide To Kahoot!
18 pages
Hardware Specifications: ETERNUS Hybrid Storage Systems Comparison Chart
No ratings yet
Hardware Specifications: ETERNUS Hybrid Storage Systems Comparison Chart
4 pages
Acars SDR
No ratings yet
Acars SDR
31 pages
Operator Skill: HF SSB Microprocessor MFSK Modem Address
No ratings yet
Operator Skill: HF SSB Microprocessor MFSK Modem Address
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Fundamentals of Data Science Unit 3

Uploaded by

Fundamentals of Data Science Unit 3

Uploaded by

CLUSTERING

Clustering in data mining is a technique that groups similar data

Applications of Clustering in Data Mining

2. It can be used for exploratory data analysis and can help

4. It can be used for anomaly detection and outlier

5. It can be used for market segmentation and customer

2. It can be sensitive to the presence of noise or outliers in the

3. It can be difficult to interpret the results of the analysis if

4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the choice of

The clustering methods can be classified into the following

Partitioning methods are a widely used family of clustering

Partitioning methods work by iteratively refining the cluster

The most widely used partitioning method is the K-means

 Step-1: Consider each alphabet as a single cluster and

2. Divisive Hierarchical clustering

We can say that Divisive Hierarchical clustering is precisely

3. DENSITY BASED METHOD

If H>0.5, null hypothesis can be rejected and it is very

There is no definitive answer for finding right number of

Domain knowledge — Domain knowledge might give

Data driven approach — If the domain knowledge is not

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.