0% found this document useful (0 votes)

0 views

Clustering Unit4

Uploaded by

abhi gowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Clustering Unit4

Uploaded by

abhi gowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Unit4

CLUSTERING

Introduction to Clusters:
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data
points within each group are more similar to each other than to data points in other groups.
It is an unsupervised machine learning-based algorithm that acts on unlabelled data.
Types of Clustering Methods
There are several clustering methods, each with its own advantages and applications:
❑ Partitioning Methods
❑ Hierarchical Methods
❑ Density-Based Methods
❑ Grid-Based Methods

Partitioning Methods
Partitioning clustering methods divide data into k distinct clusters, where each data point belongs to
exactly one cluster. The goal is to minimize intra-cluster distance (data points within a cluster should
be close to each other) and maximize inter-cluster distance (clusters should be well-separated).
Types of Partitioning Algorithms

• K-Means Clustering
• K-Medoids Clustering (PAM - Partitioning Around Medoids)
K-Means Clustering (Algorithm is given in a notes refer that one)
• K-Means is a clustering algorithm used to group similar data points together. It works by
dividing a dataset into k clusters based on similarity.
Steps:
1. Select k initial centroids.
2. Assign each data point to the nearest centroid.
3. Compute new centroids by averaging points in each cluster.
4. Repeat until centroids no longer change.
Advantages of K-Means Clustering:
• Simple & Easy to Implement – It is straightforward and widely used.
• Efficient & Scalable – Works well with large datasets.
• Fast Convergence – Typically converges in fewer iterations.
• Works Well with Spherical Clusters – Performs well when clusters are well-separated and
circular in shape.
Disadvantages of K-Means Clustering:
• Needs Predefined K – You must specify the number of clusters in advance.
• Sensitive to Outliers – Outliers can distort the cluster centers.
• Not Suitable for Arbitrary Shapes – Struggles with complex or non-spherical clusters.

K-Medoids Clustering
• A medoid can be defined as a point in the cluster, whose dissimilarities with all the other
points in the cluster are minimum. The dissimilarity of the medoid(Ci) and object(Pi) is
calculated by using E = |Pi – Ci|
• The cost in K-Medoids algorithm is given as

Advantages:
• Easy to understand and implement.
• Fast and converges in a fixed number of steps.
• Less sensitive to outliers than other clustering methods.
Disadvantages:
• Not good for clustering irregularly shaped groups because it focuses on compactness rather
than connectivity.
• Can give different results each time it runs since the starting medoids are chosen randomly.
Hierarchical Methods
Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical
representation of the clusters in a dataset. The method starts by treating each data point as a separate
cluster and then iteratively combines the closest clusters until a stopping criterion is reached.

Types of Hierarchical Clustering

Now that we understand the basics of hierarchical clustering, let’s explore the two main types of
hierarchical clustering.
Agglomerative Clustering
• It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. Bottom-
up algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.

(You can define the diagram with your own words)

Divisive clustering
• It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data.

(You can define the diagram with your own words)

Grid-Based Methods
• Grid-based clustering is a technique that divides the data space into a finite number of grid cells
and then clusters data based on these cells. Instead of working directly with data points, it
processes groups of points within grid structures, making it efficient for large datasets.

Popular Grid-Based Clustering Algorithms:

❑ CLIQUE (Clustering In QUEst)
• Used for high-dimensional data clustering.
• Identifies dense regions in subspaces instead of the entire dataset.
• Useful for discovering patterns in datasets with many attributes.
❑ STING
• A STING is a grid-based clustering technique. It uses a multidimensional grid data structure
that quantifies space into a finite number of cells. Instead of focusing on data points, it
focuses on the value space surrounding the data points.

Difference between STING and CLIQUE

Density-Based Methods
• The Density Method considers points in a denseregions to have more similarities and
differences than points in a lower dense region. The density method has a good accuracy. It
also has the ability to merge clusters.
Two common algorithms are DBSCAN and OPTICS.
DBSCAN (Refer class notes for algorithms)

• DBSCAN is a density-based clustering algorithm that groups data points that are closely
packed together and marks outliers as noise based on their density in the feature space.
• Unlike K-Means or hierarchical clustering, which assume clusters are compact and spherical,
DBSCAN excels in handling real-world data irregularities.

DBSCAN works by categorizing data points into three types:

• Core points, which have a sufficient number of neighbors within a specified radius (eplison)
• Border points, which are near core points but lack enough neighbors to be core points
themselves
• Noise points, which do not belong to any cluster.

DBSCAN ALGORITHM CONCENTRATE ON 2 CONDITIONS

❖ ε (Epsilon) - Neighborhood Radius:
ε defines the maximum distance within which a point is considered a neighbor of another point. Think
of it as a circle drawn around a point, and any other point inside this circle is considered a neighbor
❖ MinPts (Minimum Points) :
Minimum Density Requirement MinPts is the minimum number of points that must be within ε to
consider a point a core point and form a cluster.

Applications of DBSCAN
• Fraud Detection (Banking & Online Transactions):
Fraud Detection in Banking and Online Transactions involves using advanced technologies like
machine learning and data analytics to identify suspicious activities that may indicate fraud. In
the banking sector, millions of transactions happen daily, and it's crucial to monitor them in real
time to detect anomalies. Fraud detection systems are trained to recognize patterns in normal
user behaviour—such as regular transaction amounts, locations, and times—and flag any
activity that deviates from this pattern.
• Customer Grouping (Shopping & Marketing):
Customer Grouping in Shopping and Marketing is the process of segmenting customers into
different groups based on their behavior, preferences, and purchasing patterns. Businesses use
this technique to better understand their customers and tailor their marketing strategies
accordingly.
• Finding Popular Locations (GPS & Traffic Analysis):
Finding Popular Locations through GPS and Traffic Analysis involves using data collected from
mobile devices, vehicles, and navigation apps to identify areas with high user activity or traffic
congestion. This technique is widely used in apps like Google Maps and Uber to suggest
frequently visited places, detect traffic jams, and plan optimal routes. By analyzing patterns
such as the number of people visiting a location, the duration of their stay, and peak traffic
hours, systems can identify hotspots like shopping malls, tourist attractions, or accident-prone
zones.
• Social Media & Friends Grouping:

Social Media and Friends Grouping is a method used by platforms like Facebook,
Instagram, and LinkedIn to analyse user behaviour and connections to identify and
organize people into groups or communities. It uses algorithms to detect patterns,
suggest recommendations, and auto-tag friends, enhancing user engagement, content
recommendations, and targeted advertising.

• Medical Imaging (Detecting Tumors in Scans):

Medical Imaging (Detecting Tumors in Scans) is a powerful application of artificial intelligence
(AI) and machine learning in the healthcare industry. It involves analysing medical images such
as X-rays, CT scans, MRIs. Traditionally, radiologists would manually examine these images,
but with the help of AI—particularly deep learning models like Convolutional Neural Networks
(CNNs)—this process has become faster, more accurate, and more efficient.
• Spam Detection (Emails & Messages):
Spam Detection (Emails & Messages) is a technique used to identify and filter out unwanted,
irrelevant, or potentially harmful content from emails and messages. With the rise of digital
communication, spam—such as promotional ads, phishing links, and fraudulent messages—
has become a major issue. Spam detection systems use machine learning and natural language
processing (NLP) to analyse the content, sender information, and patterns of messages to
determine whether they are spam or not.

What is OPTICS?
OPTICS (Ordering Points to Identify the Clustering Structure) is a clustering algorithm used to group
similar data points based on density. It is an improved version of DBSCAN and is better at finding
clusters of different densities.

Ex:
Imagine you have a group of dots (data points) on a paper. OPTICS helps you identify which dots
form clusters based on how close they are to each other.
Steps for Ordering Points in OPTICS
1. Pick a Random Point:
o Select a random unvisited point from the dataset.
2. Find Neighbours:
o Check how many points are within a given distance (ϵ\epsilonϵ) from this point.
o If the point has enough neighbours (MinPts), it is a core point.
3. Calculate Core & Reachability Distances:
o Compute the core distance (minimum distance to include MinPts points).
o Compute the reachability distance for each neighbour.
4. Expand Cluster in Order of Reachability Distance:
o The closest point (smallest reachability distance) is visited first.
o Repeat the process for this new point and its neighbours.
o This forms an ordered list of points, sorted by their density reachability.

Applications of OPTICS

Geospatial Data – Finding dense areas in maps (e.g., detecting crowded places).

Anomaly Detection – Identifying unusual patterns in finance, fraud detection.

Social Networks – Grouping users based on interaction levels.

Biological Data – Identifying clusters in DNA sequences or medical data.

Customer Segmentation – Finding customer groups based on shopping behaviour.

Explanation:
Geospatial Data – Geospatial Data refers to location-based information collected through GPS,
satellites, drones, and sensors. It is used to map and analyze physical spaces and patterns. This data
helps in urban planning, traffic management, disaster response, retail planning, and environmental
monitoring.In urban planning, it identifies population clusters and infrastructure needs. In traffic
systems, it helps optimize routes and signals. During disasters, it aids in damage assessment and rescue
planning. Businesses use it to choose store locations and target customers. In environmental studies, it
tracks deforestation, pollution, and wildlife.

Anomaly Detection – Fraud Detection uses data analysis and machine learning to identify unusual
or suspicious patterns in financial transactions. It helps detect abnormal spending behaviour,
unauthorized access, fake identities, and stock market manipulation. This technology is widely used in
banking, e-commerce, cybersecurity, and insurance to prevent financial loss, protect user data, and
reduce risks. Advanced systems monitor transactions, helping organizations take quick action against
cyber threats, money laundering, and scams.

Social Networks – Social Media & Friends Grouping involves analysing user interactions, likes,
shares, and connections to group individuals with similar behaviours or interests. These groupings help
in identifying online communities, delivering targeted advertisements, and understanding emerging
social trends. Platforms like Facebook, Instagram, and Twitter use this approach to recommend friends,
pages, or content to users based on their activity patterns.

Biological Data – Medical Imaging & Biological Data Analysis involves identifying patterns or
clusters in DNA sequences, gene expressions, and protein structures. This helps in detecting genetic
diseases, discovering new drugs, and understanding the evolutionary relationships between organisms.
It's widely used in fields like genomics, bioinformatics, and medical research to develop targeted
treatments, understand biological functions, and improve healthcare outcomes through personalized
medicine.
Customer Segmentation – Customer Segmentation involves grouping customers based on their
shopping behavior, purchase history, and preferences. This helps businesses understand different
customer types and tailor their strategies accordingly. It enables targeted marketing, personalized
product recommendations, and customized ads, improving customer experience and increasing sales.
This approach is widely used in retail, e-commerce, and Customer Relationship Management (CRM)
to boost engagement and customer loyalty.

Differencec Between OPTICS and DBSCAN:

Evaluation of clustering:
Clustering evaluation helps assess how well a clustering algorithm has grouped data points. Since
clustering is unsupervised, we use different evaluation methods based on internal, external, and
relative measures.
➢ Internal Evaluation (Based on Data Structure)
This method checks how well clusters are formed without using any true labels. It looks at how
compact (tight) and well-separated the clusters are.
Common Metrics:
• Silhouette Score – Measures how well each data point fits in its cluster (higher is better).
• Davies-Bouldin Index (DBI) – Measures how similar clusters are to each other (lower is
better).
• Dunn Index – Compares the distance between clusters and the spread within clusters
(higher is better).
Example: Imagine you group customers based on shopping habits, but you don’t know their actual
categories. If the clusters are tight and well-separated, the Silhouette Score is high and DBI is low,
meaning the clustering is good.
➢ External Evaluation (Comparing with True Labels)
This method checks how well clustering matches predefined labels (if available). It compares the
clustering result with the actual categories.

Common Metrics:
• Rand Index (RI) – Measures how much the clusters match real labels.
• Adjusted Rand Index (ARI) – Same as RI, but removes the effect of random chance.
• Normalized Mutual Information (NMI) – Measures how much information clusters share
with actual labels.
• Purity Score – Measures how many points in a cluster belong to a single correct category.
Example: Suppose you have customer labels like "Frequent Buyers" and "Occasional Shoppers." If
the clustering groups them perfectly, ARI and NMI will be high, showing strong agreement with the
true labels.

Note :
Difference between Supervised and Unsupervised

Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Unit 4
No ratings yet
Unit 4
5 pages
DWM PT 2 QB Soln
No ratings yet
DWM PT 2 QB Soln
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering
No ratings yet
Clustering
34 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit 5
No ratings yet
Unit 5
66 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Clustering
No ratings yet
Clustering
57 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
AI
No ratings yet
AI
19 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Clustering
No ratings yet
Clustering
8 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Machine Learning Note Modul 4 5[1]
No ratings yet
Machine Learning Note Modul 4 5[1]
20 pages
Clustering new
No ratings yet
Clustering new
6 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
ML Unit-4-1
No ratings yet
ML Unit-4-1
39 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
M5
No ratings yet
M5
40 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
DWM IA-2 QB
No ratings yet
DWM IA-2 QB
10 pages
Clustering
No ratings yet
Clustering
6 pages
unit iv[1]
No ratings yet
unit iv[1]
96 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
M5
No ratings yet
M5
40 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
clustering
No ratings yet
clustering
16 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
CLUSTER ANALYSIS unit 3 Data mining
No ratings yet
CLUSTER ANALYSIS unit 3 Data mining
84 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Clustering
No ratings yet
Clustering
12 pages
Rangkuman Data Analitik Dan Big Data
No ratings yet
Rangkuman Data Analitik Dan Big Data
10 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
78 pages
Density-Based Clustering Algorithms Are The Algorithms Which Are
No ratings yet
Density-Based Clustering Algorithms Are The Algorithms Which Are
1 page
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Module 5
No ratings yet
Module 5
91 pages
Cluster Analysis
No ratings yet
Cluster Analysis
4 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
Frequent Pattrens Notes
No ratings yet
Frequent Pattrens Notes
5 pages
8th Hough Circle Transformation
No ratings yet
8th Hough Circle Transformation
1 page
2016
No ratings yet
2016
14 pages
java_labprog
No ratings yet
java_labprog
3 pages
Trouble Areas Types Reason For Use: Soil Type Foundation
No ratings yet
Trouble Areas Types Reason For Use: Soil Type Foundation
6 pages
Thesis Sa Filipino Chapter 3
100% (2)
Thesis Sa Filipino Chapter 3
5 pages
Elleman Beyond Comprehension Strategy
No ratings yet
Elleman Beyond Comprehension Strategy
9 pages
EXPO'98 - Public Spaces at The Urban Conversion of Lisbon Expo'98 PedroRessanoGarcia
No ratings yet
EXPO'98 - Public Spaces at The Urban Conversion of Lisbon Expo'98 PedroRessanoGarcia
18 pages
Kohler
No ratings yet
Kohler
1 page
SUGERE Certificates ISPT v3
No ratings yet
SUGERE Certificates ISPT v3
5 pages
IM070449
No ratings yet
IM070449
77 pages
R.K. Eniyan Rough Draft - 2
No ratings yet
R.K. Eniyan Rough Draft - 2
41 pages
LaserCh 1034 Instructions
No ratings yet
LaserCh 1034 Instructions
7 pages
18-Watter Buzz Info 311
No ratings yet
18-Watter Buzz Info 311
8 pages
Esim Isim For Dummies Brian Underdahl Full Chapter
No ratings yet
Esim Isim For Dummies Brian Underdahl Full Chapter
59 pages
Timing Diagram
80% (5)
Timing Diagram
25 pages
Rubik's Cube Instructions
No ratings yet
Rubik's Cube Instructions
7 pages
Student Guide - Module 2 Machine Learning
No ratings yet
Student Guide - Module 2 Machine Learning
50 pages
MIL810G Acceleration
No ratings yet
MIL810G Acceleration
19 pages
Research - Proposal Final
No ratings yet
Research - Proposal Final
11 pages
Current Limiting Circuit Breakers8-14
No ratings yet
Current Limiting Circuit Breakers8-14
8 pages
Fire and Safety Equipment in Arar
No ratings yet
Fire and Safety Equipment in Arar
2 pages
VossPac - Silica Fabric2
No ratings yet
VossPac - Silica Fabric2
1 page
Citing Guide Harvard Style 24062014
No ratings yet
Citing Guide Harvard Style 24062014
40 pages
Magnetism Science Unit Plan
No ratings yet
Magnetism Science Unit Plan
7 pages
5 - Ce 112 - Background of The Profession
No ratings yet
5 - Ce 112 - Background of The Profession
17 pages
Do You Consider Your Past As An Obstacle To Your Goals in Life?
No ratings yet
Do You Consider Your Past As An Obstacle To Your Goals in Life?
2 pages
Teacherspack Solutions3ed Preint
No ratings yet
Teacherspack Solutions3ed Preint
10 pages
Brochure Ad630 Ip Loudspeakers 1802en
No ratings yet
Brochure Ad630 Ip Loudspeakers 1802en
8 pages
Art Integration Lesson Cause and Effect
No ratings yet
Art Integration Lesson Cause and Effect
8 pages
Hydrogen Catalouge
No ratings yet
Hydrogen Catalouge
10 pages
Acme An Architecture Description Interchange Language
No ratings yet
Acme An Architecture Description Interchange Language
15 pages
Thermoplastic Comp Guide Welding
100% (1)
Thermoplastic Comp Guide Welding
2 pages
Rico Ooo
No ratings yet
Rico Ooo
40 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering Unit4

Uploaded by

Clustering Unit4

Uploaded by

Unit4

Types of Hierarchical Clustering

(You can define the diagram with your own words)

(You can define the diagram with your own words)

Popular Grid-Based Clustering Algorithms:

Difference between STING and CLIQUE

DBSCAN works by categorizing data points into three types:

DBSCAN ALGORITHM CONCENTRATE ON 2 CONDITIONS

• Medical Imaging (Detecting Tumors in Scans):

Anomaly Detection – Identifying unusual patterns in finance, fraud detection.

Social Networks – Grouping users based on interaction levels.

Biological Data – Identifying clusters in DNA sequences or medical data.

Customer Segmentation – Finding customer groups based on shopping behaviour.

Differencec Between OPTICS and DBSCAN:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.