0% found this document useful (0 votes)
58 views

DBSCAN AND OPTICS

Uploaded by

shritis2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

DBSCAN AND OPTICS

Uploaded by

shritis2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Implementation DENSITY-BASED

ALGORITHMS
Disadvantages

Key Concepts
Advantages

About
Steps

1. DBSCAN ALGORITHM
2. OPTICS ALGORITHM
Disadvantages
Advantages
Implementation
Steps
Key Concepts
About
INTRODUCTION TO
DBSCAN ALGORITHM
Clustering Analysis is an unsupervised learning method that organizes data points
into groups based on their similarities. This technique is particularly useful for
identifying patterns and structures within datasets without prior labeling. Key
clustering methods include K-Means, which partitions data into a predefined number
of clusters by minimizing the variance within each group, and DBSCAN (Density-
Based Spatial Clustering of Applications with Noise), which groups together closely
Implementation
packed points and identifies outliers in sparse regions.
Disadvantages

Key Concepts
Advantages

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a

About
Steps

popular clustering algorithm that groups together points that are close to each other
based on a distance measurement and a minimum number of points within a given
neighborhood. It is particularly well-suited for datasets with clusters of varying
shapes and sizes and is robust to noise. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a minimum number
of points.
Density: DBSCAN defines clusters as dense regions of points separated by areas
of lower density. It identifies regions where the number of points exceeds a
certain threshold. DBSCAN identifies dense areas by looking at how many
points are within a specified distance (ε) from any given point. If a region
contains a sufficient number of points, it is considered dense.

Core Points: A point is considered a core point if it has at least a specified


Implementation
Disadvantages

Key Concepts
number (MinPts) of other points within a given radius (ε, epsilon). Core points
Advantages

are crucial for forming clusters. When a core point is identified, it expands the

About
Steps

cluster by including all its neighboring points that are either core points or
border points.

Border Points: These points are within the ε radius of a core point but do not
have enough neighbors to be a core point themselves. They belong to a cluster
but are not dense enough to form one.

Noise Points: Points that are neither core nor border points are classified as
noise. These points do not belong to any cluster and are considered outliers.
Noise points are essentially isolated points that fall outside the influence of
dense regions defined by core points.
When to Use DBSCAN vs. K-Means in Clustering

Use DBSCAN when:

• Clusters are irregularly shaped


Implementation
Disadvantages

Key Concepts
• Data has noise or outliers
Advantages

About
• Number of clusters is unknown
Steps

• Clusters vary in density

Use K-Means when:

• Clusters are roughly spherical


• You know the number of clusters
• Data is free of significant noise or outliers
• Clusters are similar in density
Disadvantages
Advantages
Implementation
Steps

Key Concepts
About
Choose parameters:
Set ε (epsilon) as the radius that defines the neighborhood around a point, and minPts
as the minimum number of points required to form a dense region (a common rule is
to set minPts to at least the dimensionality of the dataset plus one).

Label core points:


For each point, count how many other points fall within its ε radius. If the count is
greater than or equal to minPts, mark it as a core point, indicating that the point is in
a dense region.
Implementation
Disadvantages

Key Concepts
Advantages

Form clusters:

About
Steps
Start with an arbitrary core point and retrieve all points within its ε neighborhood. If
any neighboring points are core points, recursively retrieve their neighbors until no
more core points can be found. All reachable points from the starting core point form
a cluster.

Label border points:


Identify any point that is within the ε radius of a core point but does not qualify as a
core point itself. These points are included in the cluster but do not contribute to
expanding it.

Identify noise:
Classify points that are neither core nor border points as noise. These points do not
belong to any cluster and are considered outliers.
Example:

Consider the following 2D points:


•(1, 2), (2, 2), (2, 3), (3, 3)
•(8, 7), (8, 8), (25, 80)

Parameters:

Implementation
Disadvantages

Epsilon (ε): 1.5 (radius for neighborhood search)

Key Concepts
Advantages

minPts: 2 (minimum points to form a cluster)

About
Steps
DBSCAN Process:
1.Identify Core Points:
For each point, count how many other points are within ε.
Calculations:
1. (1, 2): Neighbors: (2, 2), (2, 3) → 2 neighbors (core point)
2. (2, 2): Neighbors: (1, 2), (2, 3), (3, 3) → 3 neighbors (core point)
3. (2, 3): Neighbors: (1, 2), (2, 2), (3, 3) → 3 neighbors (core point)
4. (3, 3): Neighbors: (2, 2), (2, 3) → 2 neighbors (core point)
5. (8, 7): 1 neighbor (not a core point)
6. (8, 8): 1 neighbor (not a core point)
7. (25, 80): 0 neighbors (not a core point)
2.Core Points: (1, 2), (2, 2), (2, 3), (3, 3)
3.Form Clusters:
1. Start with (1, 2):
1. Neighbors: (2, 2), (2, 3) → Include in cluster.
2. Check core point (2, 2) and (2, 3) → Already included.
3. Add (3, 3) to the cluster.

Implementation
Disadvantages

4.Cluster 1: {(1, 2), (2, 2), (2, 3), (3, 3)}

Key Concepts
Advantages

5.Process Remaining Points:

About
Steps
4. (8, 7): Not a core point → classify as noise.
5. (8, 8): Not a core point → classify as noise.
6. (25, 80): Not a core point → classify as noise.

Final Clusters and Noise:

•Clusters:
• Cluster 1: {(1, 2), (2, 2), (2, 3), (3, 3)}
•Noise Points:
• (8, 7), (8, 8), (25, 80)
• Unlike K-means, which requires specifying the number of clusters beforehand,
DBSCAN automatically determines the number of clusters based on data
density, making it suitable for exploratory data analysis.

• DBSCAN can find clusters of varying shapes and sizes. Its density-based
approach allows it to discover complex cluster structures that K-means might
miss, as it relies on proximity rather than distances to centroids.

Implementation
Disadvantages

Key Concepts
Advantages
• DBSCAN effectively identifies outliers or noise points, classifying them as

About
noise. This allows analysts to focus on meaningful data patterns without being

Steps
influenced by anomalies.

• When combined with spatial indexing methods like KD-trees, DBSCAN is


efficient for handling large datasets, optimizing the process of finding neighbors
within the ε radius and reducing computational complexity.

• DBSCAN is less sensitive to outliers than K-means, as it explicitly classifies


points as noise, enhancing the quality of clustering results in messy datasets.

• While it requires tuning of parameters (ε and minPts), these can be adjusted


based on the dataset's characteristics, allowing for optimal clustering results.
• The choice of ε (epsilon) and minPts (minimum points) significantly affects clustering
results. Selecting these parameters is often challenging; too small an ε can classify many
points as noise, while too large a value can merge distinct clusters.

• DBSCAN struggles with clusters of varying densities. It may incorrectly merge clusters of
different densities, leading to inaccurate classifications and loss of meaningful patterns.

Implementation
Disadvantages

Key Concepts
• DBSCAN can require substantial memory for large datasets, especially dense ones.

Advantages
Maintaining lists of points and their neighbors can be resource-intensive, making it less

About
Steps
suitable for very large datasets.

• The effectiveness of DBSCAN decreases in high-dimensional spaces due to the "curse of


dimensionality." Distances become less meaningful, making it hard to define dense regions
accurately.

• Unlike some clustering algorithms with default settings, DBSCAN requires careful tuning
of ε and minPts, often through trial and error, complicating its use for less experienced
users.

• A high minPts value can overlook small clusters, as these may not meet density
requirements, limiting the algorithm's ability to identify all relevant patterns in the data.
Disadvantages
Advantages
Implementation
Steps
Key Concepts
About
OPTICS ALGORITHM
OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based
clustering algorithm designed to identify clusters of varying densities in
complex datasets. It orders data points to reveal their density-based clustering
structure, rather than assigning strict cluster memberships as in traditional
Implementation methods.
Disadvantages

Introduction
Unlike traditional clustering algorithms that require predefined cluster shapes
Key Concepts
Advantages

and densities (such as K-means), OPTICS is flexible and can discover clusters of
different sizes and shapes, especially useful when clusters are not well-
Steps

separated.

OPTICS extends the DBSCAN algorithm by enabling the detection of clusters


with different densities, overcoming DBSCAN's limitation of a single fixed-
density parameter. By creating a reachability plot, OPTICS visually represents
clusters as valleys, allowing flexible extraction of clusters with varied densities
and sizes. This makes it particularly useful for analyzing unstructured data,
spatial distributions, and applications where cluster shapes are not uniform.
• Density-Based Approach: OPTICS defines clusters based on the
density of data points. Points in denser regions are more likely to
belong to the same cluster, whereas points in sparser regions might
either belong to different clusters or be considered noise.
• Core Distance: For each data point, the core distance is defined as
Implementation
Disadvantages

Key Concepts
the minimum radius required to encompass a specific number of
Advantages

About
points, typically defined as a parameter minPts. If a point does not
Steps

have minPts neighbors within any radius, it’s considered noise or an


outlier.
• Reachability Distance: This metric reflects the "reachable" distance
from one point to another. For each point, its reachability distance from
a neighboring point is the maximum of the core distance and the
actual distance between the two points. The reachability distance
helps identify clusters by indicating how accessible a point is from
another.
Disadvantages
Advantages
Implementation
Steps

Key Concepts
About
• Cluster Ordering: OPTICS generates an ordering of the points based on the
reachability distances, allowing it to form a reachability plot. This plot can
reveal clusters of varying densities without needing specific cluster
assignments upfront.
Implementation
Disadvantages

Key Concepts
Advantages

About
• Reachability Plot: A reachability plot in OPTICS clustering visually represents
Steps

the structure of a dataset by plotting each point’s reachability distance in the


order they are processed. Peaks and valleys in the plot reveal potential
clusters: valleys indicate dense regions (clusters), while peaks represent
sparser areas or noise. By interpreting these patterns, one can identify
clusters of varying densities without predefined boundaries, making the
reachability plot a powerful tool for flexible, density-based clustering.
Disadvantages
Advantages
Implementation
Steps

Key Concepts
About
1. Input Parameters: Define parameters such as the distance metric, minPts
(minimum number of points in a neighborhood), and the dataset.

2. Compute Distances: Calculate the distance between all points in the


dataset.

3. Determine Core Distances: For each point, calculate its core distance based
Implementation
on the minPts parameter.
Disadvantages

Key Concepts
Advantages

About
Steps
4. Calculate Reachability Distances: For each point, compute the reachability
distance from its neighbors, considering core distances.

5. Sort Points: Order points based on their reachability distances.

6. Construct Reachability Plot: Plot the reachability distances to visualize


clusters.

7. Identify Clusters: Analyze the plot to identify and extract clusters based on
valleys and peaks.

8. Assign Cluster Labels: Optionally, assign cluster labels to the points based
on the identified clusters.
Disadvantages
Advantages
Example: Given , Epsilon: 5 and Minpts: 2

Implementation
Steps
Key Concepts
About
c

From the plot we can infer that;

C1 : B,C,D
C2 : G,H,I,J,K
C3 : O,N,P,R,Q,S,T
• Flexibility in Cluster Shape and Size: OPTICS is appropriate for datasets with
complex structures and irregularly shaped clusters because it can recognize clusters
of different sizes and shapes.
• Flexible Density-Based Clustering: OPTICS clusters data points based on density,
similar to DBSCAN, but provides more flexibility with density variations, allowing it
to handle clusters of different densities in a single dataset.
• No Predefined Cluster Number Requirement: OPTICS does not require specifying

Implementation
Disadvantages

Key Concepts
the number of clusters beforehand, which is advantageous when the optimal

Advantages
number of clusters is unknown or when the data does not have a clear separation

About
Steps
into distinct groups.
• Handles Noise: OPTICS can effectively identify and separate noise points from
actual clusters. This makes it more robust for noisy datasets compared to
algorithms like DBSCAN or k-means.
• Automatic Cluster Ordering: The algorithm orders points based on reachability
distance, creating a structure that can be visualized and used to identify clusters at
multiple density levels.
• Works with High-Dimensional Data: Although it may still be slower than simpler
clustering algorithms, OPTICS can handle high-dimensional data better than many
other clustering methods.
• Computational Complexity: OPTICS can be computationally intensive, especially for
very large datasets, as it has a complexity of in the worst case. This makes it slower
than simpler clustering methods like k-means, especially with high-dimensional data.

• Non-Deterministic Output: The reachability plot’s point ordering may produce non-
deterministic clustering outcomes. The final clustering result may change if the order of
the input data is slightly altered.

Implementation
• Less Effective for Well-Separated Clusters: For data with well-separated, distinct

Disadvantages

Key Concepts
Advantages
clusters, simpler clustering algorithms like k-means might perform just as well or even
better than OPTICS, with less computational overhead.

About
Steps
• Memory Intensive: OPTICS can consume significant memory, especially for large
datasets, as it needs to store distances between all points to build the reachability plot.

• Not Ideal for High-Dimensional Sparse Data: Although OPTICS can handle high-
dimensional data better than some algorithms, it can struggle with sparse data due to
increased distance calculations, leading to reduced accuracy and efficiency in high-
dimensional, sparse datasets.

• Parameter Sensitivity: OPTICS requires setting parameters, such as minimum points


(minPts) and maximum reachability distance (ε). Choosing appropriate values for these
parameters can be challenging and may require tuning to achieve optimal results.
DBSCAN OPTICS

Clusters based on density with fixed Creates an ordering of points based on


neighborhood (eps) density, allowing flexible clustering

Requires two parameters: eps Requires MinPts; eps is optional and


(neighborhood radius), minPts (minimum adjusted dynamically
points)

Sensitive to eps value; improper choice More flexible, adapts to varying


may split/merge clusters c densities, adjusting eps dynamically

Provides clusters directly after algorithm Requires post-processing to extract


execution clusters from reachability plot

Works well with uniform density datasets Handles varying density effectively

Scales well for large datasets Less scalable; slower on large datasets

Ideal for datasets with uniform density Better for datasets with varying density
and complex structures
• The choice of ε (epsilon) and minPts (minimum points) significantly affects clustering
results. Selecting these parameters is often challenging; too small an ε can classify many
points as noise, while too large a value can merge distinct clusters.

• DBSCAN struggles with clusters of varying densities. It may incorrectly merge clusters of
different densities, leading to inaccurate classifications and loss of meaningful patterns.

Implementation
Disadvantages

Key Concepts
• DBSCAN can require substantial memory for large datasets, especially dense ones.

Advantages
Maintaining lists of points and their neighbors can be resource-intensive, making it less

THE END
THANK YOU

About
Steps
suitable for very large datasets.

• The effectiveness of DBSCAN decreases in high-dimensional spaces due to the "curse of


dimensionality." Distances become less meaningful, making it hard to define dense regions
accurately.

• Unlike some clustering algorithms with default settings, DBSCAN requires careful tuning
of ε and minPts, often through trial and error, complicating its use for less experienced
users.
BY:
• A high minPts value can overlook small clusters, as these may R.SANJANA (RA2211027010237)
not meet density
requirements, limiting the algorithm's ability to identify allJAYASHREE S (RA2211027010255)
relevant patterns in the data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy