0% found this document useful (0 votes)
0 views

Clustering Unit4

Uploaded by

abhi gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Clustering Unit4

Uploaded by

abhi gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit4

CLUSTERING

Introduction to Clusters:
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data
points within each group are more similar to each other than to data points in other groups.
It is an unsupervised machine learning-based algorithm that acts on unlabelled data.
Types of Clustering Methods
There are several clustering methods, each with its own advantages and applications:
❑ Partitioning Methods
❑ Hierarchical Methods
❑ Density-Based Methods
❑ Grid-Based Methods

Partitioning Methods
Partitioning clustering methods divide data into k distinct clusters, where each data point belongs to
exactly one cluster. The goal is to minimize intra-cluster distance (data points within a cluster should
be close to each other) and maximize inter-cluster distance (clusters should be well-separated).
Types of Partitioning Algorithms

• K-Means Clustering
• K-Medoids Clustering (PAM - Partitioning Around Medoids)
K-Means Clustering (Algorithm is given in a notes refer that one)
• K-Means is a clustering algorithm used to group similar data points together. It works by
dividing a dataset into k clusters based on similarity.
Steps:
1. Select k initial centroids.
2. Assign each data point to the nearest centroid.
3. Compute new centroids by averaging points in each cluster.
4. Repeat until centroids no longer change.
Advantages of K-Means Clustering:
• Simple & Easy to Implement – It is straightforward and widely used.
• Efficient & Scalable – Works well with large datasets.
• Fast Convergence – Typically converges in fewer iterations.
• Works Well with Spherical Clusters – Performs well when clusters are well-separated and
circular in shape.
Disadvantages of K-Means Clustering:
• Needs Predefined K – You must specify the number of clusters in advance.
• Sensitive to Outliers – Outliers can distort the cluster centers.
• Not Suitable for Arbitrary Shapes – Struggles with complex or non-spherical clusters.

K-Medoids Clustering
• A medoid can be defined as a point in the cluster, whose dissimilarities with all the other
points in the cluster are minimum. The dissimilarity of the medoid(Ci) and object(Pi) is
calculated by using E = |Pi – Ci|
• The cost in K-Medoids algorithm is given as

Advantages:
• Easy to understand and implement.
• Fast and converges in a fixed number of steps.
• Less sensitive to outliers than other clustering methods.
Disadvantages:
• Not good for clustering irregularly shaped groups because it focuses on compactness rather
than connectivity.
• Can give different results each time it runs since the starting medoids are chosen randomly.
Hierarchical Methods
Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical
representation of the clusters in a dataset. The method starts by treating each data point as a separate
cluster and then iteratively combines the closest clusters until a stopping criterion is reached.

Types of Hierarchical Clustering


Now that we understand the basics of hierarchical clustering, let’s explore the two main types of
hierarchical clustering.
Agglomerative Clustering
• It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC).
Unlike flat clustering hierarchical clustering provides a structured way to group data. Bottom-
up algorithms treat each data as a singleton cluster at the outset and then successively
agglomerate pairs of clusters until all clusters have been merged into a single cluster that
contains all data.

(You can define the diagram with your own words)

Divisive clustering
• It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data.

(You can define the diagram with your own words)


Grid-Based Methods
• Grid-based clustering is a technique that divides the data space into a finite number of grid cells
and then clusters data based on these cells. Instead of working directly with data points, it
processes groups of points within grid structures, making it efficient for large datasets.

Popular Grid-Based Clustering Algorithms:


❑ CLIQUE (Clustering In QUEst)
• Used for high-dimensional data clustering.
• Identifies dense regions in subspaces instead of the entire dataset.
• Useful for discovering patterns in datasets with many attributes.
❑ STING
• A STING is a grid-based clustering technique. It uses a multidimensional grid data structure
that quantifies space into a finite number of cells. Instead of focusing on data points, it
focuses on the value space surrounding the data points.

Difference between STING and CLIQUE

Density-Based Methods
• The Density Method considers points in a denseregions to have more similarities and
differences than points in a lower dense region. The density method has a good accuracy. It
also has the ability to merge clusters.
Two common algorithms are DBSCAN and OPTICS.
DBSCAN (Refer class notes for algorithms)

• DBSCAN is a density-based clustering algorithm that groups data points that are closely
packed together and marks outliers as noise based on their density in the feature space.
• Unlike K-Means or hierarchical clustering, which assume clusters are compact and spherical,
DBSCAN excels in handling real-world data irregularities.

DBSCAN works by categorizing data points into three types:


• Core points, which have a sufficient number of neighbors within a specified radius (eplison)
• Border points, which are near core points but lack enough neighbors to be core points
themselves
• Noise points, which do not belong to any cluster.

DBSCAN ALGORITHM CONCENTRATE ON 2 CONDITIONS


❖ ε (Epsilon) - Neighborhood Radius:
ε defines the maximum distance within which a point is considered a neighbor of another point. Think
of it as a circle drawn around a point, and any other point inside this circle is considered a neighbor
❖ MinPts (Minimum Points) :
Minimum Density Requirement MinPts is the minimum number of points that must be within ε to
consider a point a core point and form a cluster.

Applications of DBSCAN
• Fraud Detection (Banking & Online Transactions):
Fraud Detection in Banking and Online Transactions involves using advanced technologies like
machine learning and data analytics to identify suspicious activities that may indicate fraud. In
the banking sector, millions of transactions happen daily, and it's crucial to monitor them in real
time to detect anomalies. Fraud detection systems are trained to recognize patterns in normal
user behaviour—such as regular transaction amounts, locations, and times—and flag any
activity that deviates from this pattern.
• Customer Grouping (Shopping & Marketing):
Customer Grouping in Shopping and Marketing is the process of segmenting customers into
different groups based on their behavior, preferences, and purchasing patterns. Businesses use
this technique to better understand their customers and tailor their marketing strategies
accordingly.
• Finding Popular Locations (GPS & Traffic Analysis):
Finding Popular Locations through GPS and Traffic Analysis involves using data collected from
mobile devices, vehicles, and navigation apps to identify areas with high user activity or traffic
congestion. This technique is widely used in apps like Google Maps and Uber to suggest
frequently visited places, detect traffic jams, and plan optimal routes. By analyzing patterns
such as the number of people visiting a location, the duration of their stay, and peak traffic
hours, systems can identify hotspots like shopping malls, tourist attractions, or accident-prone
zones.
• Social Media & Friends Grouping:

Social Media and Friends Grouping is a method used by platforms like Facebook,
Instagram, and LinkedIn to analyse user behaviour and connections to identify and
organize people into groups or communities. It uses algorithms to detect patterns,
suggest recommendations, and auto-tag friends, enhancing user engagement, content
recommendations, and targeted advertising.

• Medical Imaging (Detecting Tumors in Scans):


Medical Imaging (Detecting Tumors in Scans) is a powerful application of artificial intelligence
(AI) and machine learning in the healthcare industry. It involves analysing medical images such
as X-rays, CT scans, MRIs. Traditionally, radiologists would manually examine these images,
but with the help of AI—particularly deep learning models like Convolutional Neural Networks
(CNNs)—this process has become faster, more accurate, and more efficient.
• Spam Detection (Emails & Messages):
Spam Detection (Emails & Messages) is a technique used to identify and filter out unwanted,
irrelevant, or potentially harmful content from emails and messages. With the rise of digital
communication, spam—such as promotional ads, phishing links, and fraudulent messages—
has become a major issue. Spam detection systems use machine learning and natural language
processing (NLP) to analyse the content, sender information, and patterns of messages to
determine whether they are spam or not.

What is OPTICS?
OPTICS (Ordering Points to Identify the Clustering Structure) is a clustering algorithm used to group
similar data points based on density. It is an improved version of DBSCAN and is better at finding
clusters of different densities.

Ex:
Imagine you have a group of dots (data points) on a paper. OPTICS helps you identify which dots
form clusters based on how close they are to each other.
Steps for Ordering Points in OPTICS
1. Pick a Random Point:
o Select a random unvisited point from the dataset.
2. Find Neighbours:
o Check how many points are within a given distance (ϵ\epsilonϵ) from this point.
o If the point has enough neighbours (MinPts), it is a core point.
3. Calculate Core & Reachability Distances:
o Compute the core distance (minimum distance to include MinPts points).
o Compute the reachability distance for each neighbour.
4. Expand Cluster in Order of Reachability Distance:
o The closest point (smallest reachability distance) is visited first.
o Repeat the process for this new point and its neighbours.
o This forms an ordered list of points, sorted by their density reachability.

Applications of OPTICS

Geospatial Data – Finding dense areas in maps (e.g., detecting crowded places).

Anomaly Detection – Identifying unusual patterns in finance, fraud detection.

Social Networks – Grouping users based on interaction levels.

Biological Data – Identifying clusters in DNA sequences or medical data.

Customer Segmentation – Finding customer groups based on shopping behaviour.

Explanation:
Geospatial Data – Geospatial Data refers to location-based information collected through GPS,
satellites, drones, and sensors. It is used to map and analyze physical spaces and patterns. This data
helps in urban planning, traffic management, disaster response, retail planning, and environmental
monitoring.In urban planning, it identifies population clusters and infrastructure needs. In traffic
systems, it helps optimize routes and signals. During disasters, it aids in damage assessment and rescue
planning. Businesses use it to choose store locations and target customers. In environmental studies, it
tracks deforestation, pollution, and wildlife.

Anomaly Detection – Fraud Detection uses data analysis and machine learning to identify unusual
or suspicious patterns in financial transactions. It helps detect abnormal spending behaviour,
unauthorized access, fake identities, and stock market manipulation. This technology is widely used in
banking, e-commerce, cybersecurity, and insurance to prevent financial loss, protect user data, and
reduce risks. Advanced systems monitor transactions, helping organizations take quick action against
cyber threats, money laundering, and scams.

Social Networks – Social Media & Friends Grouping involves analysing user interactions, likes,
shares, and connections to group individuals with similar behaviours or interests. These groupings help
in identifying online communities, delivering targeted advertisements, and understanding emerging
social trends. Platforms like Facebook, Instagram, and Twitter use this approach to recommend friends,
pages, or content to users based on their activity patterns.

Biological Data – Medical Imaging & Biological Data Analysis involves identifying patterns or
clusters in DNA sequences, gene expressions, and protein structures. This helps in detecting genetic
diseases, discovering new drugs, and understanding the evolutionary relationships between organisms.
It's widely used in fields like genomics, bioinformatics, and medical research to develop targeted
treatments, understand biological functions, and improve healthcare outcomes through personalized
medicine.
Customer Segmentation – Customer Segmentation involves grouping customers based on their
shopping behavior, purchase history, and preferences. This helps businesses understand different
customer types and tailor their strategies accordingly. It enables targeted marketing, personalized
product recommendations, and customized ads, improving customer experience and increasing sales.
This approach is widely used in retail, e-commerce, and Customer Relationship Management (CRM)
to boost engagement and customer loyalty.

Differencec Between OPTICS and DBSCAN:

Evaluation of clustering:
Clustering evaluation helps assess how well a clustering algorithm has grouped data points. Since
clustering is unsupervised, we use different evaluation methods based on internal, external, and
relative measures.
➢ Internal Evaluation (Based on Data Structure)
This method checks how well clusters are formed without using any true labels. It looks at how
compact (tight) and well-separated the clusters are.
Common Metrics:
• Silhouette Score – Measures how well each data point fits in its cluster (higher is better).
• Davies-Bouldin Index (DBI) – Measures how similar clusters are to each other (lower is
better).
• Dunn Index – Compares the distance between clusters and the spread within clusters
(higher is better).
Example: Imagine you group customers based on shopping habits, but you don’t know their actual
categories. If the clusters are tight and well-separated, the Silhouette Score is high and DBI is low,
meaning the clustering is good.
➢ External Evaluation (Comparing with True Labels)
This method checks how well clustering matches predefined labels (if available). It compares the
clustering result with the actual categories.

Common Metrics:
• Rand Index (RI) – Measures how much the clusters match real labels.
• Adjusted Rand Index (ARI) – Same as RI, but removes the effect of random chance.
• Normalized Mutual Information (NMI) – Measures how much information clusters share
with actual labels.
• Purity Score – Measures how many points in a cluster belong to a single correct category.
Example: Suppose you have customer labels like "Frequent Buyers" and "Occasional Shoppers." If
the clustering groups them perfectly, ARI and NMI will be high, showing strong agreement with the
true labels.

Note :
Difference between Supervised and Unsupervised

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy