DBSCAN AND OPTICS
DBSCAN AND OPTICS
ALGORITHMS
Disadvantages
Key Concepts
Advantages
About
Steps
1. DBSCAN ALGORITHM
2. OPTICS ALGORITHM
Disadvantages
Advantages
Implementation
Steps
Key Concepts
About
INTRODUCTION TO
DBSCAN ALGORITHM
Clustering Analysis is an unsupervised learning method that organizes data points
into groups based on their similarities. This technique is particularly useful for
identifying patterns and structures within datasets without prior labeling. Key
clustering methods include K-Means, which partitions data into a predefined number
of clusters by minimizing the variance within each group, and DBSCAN (Density-
Based Spatial Clustering of Applications with Noise), which groups together closely
Implementation
packed points and identifies outliers in sparse regions.
Disadvantages
Key Concepts
Advantages
About
Steps
popular clustering algorithm that groups together points that are close to each other
based on a distance measurement and a minimum number of points within a given
neighborhood. It is particularly well-suited for datasets with clusters of varying
shapes and sizes and is robust to noise. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a minimum number
of points.
Density: DBSCAN defines clusters as dense regions of points separated by areas
of lower density. It identifies regions where the number of points exceeds a
certain threshold. DBSCAN identifies dense areas by looking at how many
points are within a specified distance (ε) from any given point. If a region
contains a sufficient number of points, it is considered dense.
Key Concepts
number (MinPts) of other points within a given radius (ε, epsilon). Core points
Advantages
are crucial for forming clusters. When a core point is identified, it expands the
About
Steps
cluster by including all its neighboring points that are either core points or
border points.
Border Points: These points are within the ε radius of a core point but do not
have enough neighbors to be a core point themselves. They belong to a cluster
but are not dense enough to form one.
Noise Points: Points that are neither core nor border points are classified as
noise. These points do not belong to any cluster and are considered outliers.
Noise points are essentially isolated points that fall outside the influence of
dense regions defined by core points.
When to Use DBSCAN vs. K-Means in Clustering
Key Concepts
• Data has noise or outliers
Advantages
About
• Number of clusters is unknown
Steps
Key Concepts
About
Choose parameters:
Set ε (epsilon) as the radius that defines the neighborhood around a point, and minPts
as the minimum number of points required to form a dense region (a common rule is
to set minPts to at least the dimensionality of the dataset plus one).
Key Concepts
Advantages
Form clusters:
About
Steps
Start with an arbitrary core point and retrieve all points within its ε neighborhood. If
any neighboring points are core points, recursively retrieve their neighbors until no
more core points can be found. All reachable points from the starting core point form
a cluster.
Identify noise:
Classify points that are neither core nor border points as noise. These points do not
belong to any cluster and are considered outliers.
Example:
Parameters:
Implementation
Disadvantages
Key Concepts
Advantages
About
Steps
DBSCAN Process:
1.Identify Core Points:
For each point, count how many other points are within ε.
Calculations:
1. (1, 2): Neighbors: (2, 2), (2, 3) → 2 neighbors (core point)
2. (2, 2): Neighbors: (1, 2), (2, 3), (3, 3) → 3 neighbors (core point)
3. (2, 3): Neighbors: (1, 2), (2, 2), (3, 3) → 3 neighbors (core point)
4. (3, 3): Neighbors: (2, 2), (2, 3) → 2 neighbors (core point)
5. (8, 7): 1 neighbor (not a core point)
6. (8, 8): 1 neighbor (not a core point)
7. (25, 80): 0 neighbors (not a core point)
2.Core Points: (1, 2), (2, 2), (2, 3), (3, 3)
3.Form Clusters:
1. Start with (1, 2):
1. Neighbors: (2, 2), (2, 3) → Include in cluster.
2. Check core point (2, 2) and (2, 3) → Already included.
3. Add (3, 3) to the cluster.
Implementation
Disadvantages
Key Concepts
Advantages
About
Steps
4. (8, 7): Not a core point → classify as noise.
5. (8, 8): Not a core point → classify as noise.
6. (25, 80): Not a core point → classify as noise.
•Clusters:
• Cluster 1: {(1, 2), (2, 2), (2, 3), (3, 3)}
•Noise Points:
• (8, 7), (8, 8), (25, 80)
• Unlike K-means, which requires specifying the number of clusters beforehand,
DBSCAN automatically determines the number of clusters based on data
density, making it suitable for exploratory data analysis.
• DBSCAN can find clusters of varying shapes and sizes. Its density-based
approach allows it to discover complex cluster structures that K-means might
miss, as it relies on proximity rather than distances to centroids.
Implementation
Disadvantages
Key Concepts
Advantages
• DBSCAN effectively identifies outliers or noise points, classifying them as
About
noise. This allows analysts to focus on meaningful data patterns without being
Steps
influenced by anomalies.
• DBSCAN struggles with clusters of varying densities. It may incorrectly merge clusters of
different densities, leading to inaccurate classifications and loss of meaningful patterns.
Implementation
Disadvantages
Key Concepts
• DBSCAN can require substantial memory for large datasets, especially dense ones.
Advantages
Maintaining lists of points and their neighbors can be resource-intensive, making it less
About
Steps
suitable for very large datasets.
• Unlike some clustering algorithms with default settings, DBSCAN requires careful tuning
of ε and minPts, often through trial and error, complicating its use for less experienced
users.
• A high minPts value can overlook small clusters, as these may not meet density
requirements, limiting the algorithm's ability to identify all relevant patterns in the data.
Disadvantages
Advantages
Implementation
Steps
Key Concepts
About
OPTICS ALGORITHM
OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based
clustering algorithm designed to identify clusters of varying densities in
complex datasets. It orders data points to reveal their density-based clustering
structure, rather than assigning strict cluster memberships as in traditional
Implementation methods.
Disadvantages
Introduction
Unlike traditional clustering algorithms that require predefined cluster shapes
Key Concepts
Advantages
and densities (such as K-means), OPTICS is flexible and can discover clusters of
different sizes and shapes, especially useful when clusters are not well-
Steps
separated.
Key Concepts
the minimum radius required to encompass a specific number of
Advantages
About
points, typically defined as a parameter minPts. If a point does not
Steps
Key Concepts
About
• Cluster Ordering: OPTICS generates an ordering of the points based on the
reachability distances, allowing it to form a reachability plot. This plot can
reveal clusters of varying densities without needing specific cluster
assignments upfront.
Implementation
Disadvantages
Key Concepts
Advantages
About
• Reachability Plot: A reachability plot in OPTICS clustering visually represents
Steps
Key Concepts
About
1. Input Parameters: Define parameters such as the distance metric, minPts
(minimum number of points in a neighborhood), and the dataset.
3. Determine Core Distances: For each point, calculate its core distance based
Implementation
on the minPts parameter.
Disadvantages
Key Concepts
Advantages
About
Steps
4. Calculate Reachability Distances: For each point, compute the reachability
distance from its neighbors, considering core distances.
7. Identify Clusters: Analyze the plot to identify and extract clusters based on
valleys and peaks.
8. Assign Cluster Labels: Optionally, assign cluster labels to the points based
on the identified clusters.
Disadvantages
Advantages
Example: Given , Epsilon: 5 and Minpts: 2
Implementation
Steps
Key Concepts
About
c
C1 : B,C,D
C2 : G,H,I,J,K
C3 : O,N,P,R,Q,S,T
• Flexibility in Cluster Shape and Size: OPTICS is appropriate for datasets with
complex structures and irregularly shaped clusters because it can recognize clusters
of different sizes and shapes.
• Flexible Density-Based Clustering: OPTICS clusters data points based on density,
similar to DBSCAN, but provides more flexibility with density variations, allowing it
to handle clusters of different densities in a single dataset.
• No Predefined Cluster Number Requirement: OPTICS does not require specifying
Implementation
Disadvantages
Key Concepts
the number of clusters beforehand, which is advantageous when the optimal
Advantages
number of clusters is unknown or when the data does not have a clear separation
About
Steps
into distinct groups.
• Handles Noise: OPTICS can effectively identify and separate noise points from
actual clusters. This makes it more robust for noisy datasets compared to
algorithms like DBSCAN or k-means.
• Automatic Cluster Ordering: The algorithm orders points based on reachability
distance, creating a structure that can be visualized and used to identify clusters at
multiple density levels.
• Works with High-Dimensional Data: Although it may still be slower than simpler
clustering algorithms, OPTICS can handle high-dimensional data better than many
other clustering methods.
• Computational Complexity: OPTICS can be computationally intensive, especially for
very large datasets, as it has a complexity of in the worst case. This makes it slower
than simpler clustering methods like k-means, especially with high-dimensional data.
• Non-Deterministic Output: The reachability plot’s point ordering may produce non-
deterministic clustering outcomes. The final clustering result may change if the order of
the input data is slightly altered.
Implementation
• Less Effective for Well-Separated Clusters: For data with well-separated, distinct
Disadvantages
Key Concepts
Advantages
clusters, simpler clustering algorithms like k-means might perform just as well or even
better than OPTICS, with less computational overhead.
About
Steps
• Memory Intensive: OPTICS can consume significant memory, especially for large
datasets, as it needs to store distances between all points to build the reachability plot.
• Not Ideal for High-Dimensional Sparse Data: Although OPTICS can handle high-
dimensional data better than some algorithms, it can struggle with sparse data due to
increased distance calculations, leading to reduced accuracy and efficiency in high-
dimensional, sparse datasets.
Works well with uniform density datasets Handles varying density effectively
Scales well for large datasets Less scalable; slower on large datasets
Ideal for datasets with uniform density Better for datasets with varying density
and complex structures
• The choice of ε (epsilon) and minPts (minimum points) significantly affects clustering
results. Selecting these parameters is often challenging; too small an ε can classify many
points as noise, while too large a value can merge distinct clusters.
• DBSCAN struggles with clusters of varying densities. It may incorrectly merge clusters of
different densities, leading to inaccurate classifications and loss of meaningful patterns.
Implementation
Disadvantages
Key Concepts
• DBSCAN can require substantial memory for large datasets, especially dense ones.
Advantages
Maintaining lists of points and their neighbors can be resource-intensive, making it less
THE END
THANK YOU
About
Steps
suitable for very large datasets.
• Unlike some clustering algorithms with default settings, DBSCAN requires careful tuning
of ε and minPts, often through trial and error, complicating its use for less experienced
users.
BY:
• A high minPts value can overlook small clusters, as these may R.SANJANA (RA2211027010237)
not meet density
requirements, limiting the algorithm's ability to identify allJAYASHREE S (RA2211027010255)
relevant patterns in the data.