UNIT 4
UNIT 4
To Do:
1. Classification:
a. Definition, Data Generalization and Analytical Characterization
b. Analysis of Attribute Relevance
c. Mining Class Comparisons
d. Statistical Measures in Large Databases
e. Statistical-Based Algorithms (Naïve bayes, Bayesian, logistic, LDA)
f. Distance-Based Algorithms (KNN, SVM, K-means, Hierarchical)
g. Decision Tree-Based Algorithms (ID3, C4.5, CART, Random Forest, scalable DT)
2. Clustering:
a. Introduction
b. Similarity and Distance Measures
c. Hierarchical and Partitional Algorithms
d. Hierarchical Clustering: CURE and CHAMELEON
e. Density-Based Methods: DBSCAN, OPTICS
f. Grid-Based Methods: STING, CLIQUE
g. Model-Based Methods: Statistical Approach
3. Association Rules:
a. Introduction
b. Large Item Sets
c. Basic Algorithms
d. Parallel and Distributed Algorithms
e. Neural Network Approach
#CLASSIFICATION
1. Definition of Classification:
Classification is a supervised learning technique in data mining where the goal is to assign a
label or class to data points based on their features. It involves building a model using
labelled training data and then using this model to classify unseen test data.
• Objective: Predict the categorical label of a given input based on learned patterns.
• Examples:
o Spam email detection: Classify emails as "Spam" or "Not Spam."
o Disease diagnosis: Classify a patient as "Healthy" or "Diseased."
• Key Steps:
o Training Phase: Use a dataset where the output (label) is already known to
build a model.
o Testing Phase: Evaluate the model on unseen data to check its accuracy.
• Common Algorithms: Decision Trees, Naive Bayes, Support Vector Machines (SVM),
Neural Networks.
2. Data Generalization:
Data generalization refers to summarizing and abstracting data to identify meaningful
patterns while reducing complexity. This is achieved by aggregating data or rolling up data
to a higher abstraction level.
• Purpose: Focus on higher-level patterns rather than specific details.
• Techniques:
o Attribute Aggregation: Combine detailed data into a summary (e.g., weekly
sales instead of daily sales).
o Concept Hierarchies: Organize data into hierarchical levels (e.g., "City" →
"State" → "Country").
o Data Cube Operations: Perform roll-up or drill-down operations in OLAP
systems to view data at different granularities.
• Applications:
o Identifying trends (e.g., seasonal sales patterns).
o Reducing noise in data for better classification results.
3. Analytical Characterization:
Analytical characterization involves summarizing and distinguishing the features of a
dataset, particularly between different classes or categories.
• Types:
o Descriptive Characterization: Summarizes the general properties of a class. For
example, "most buyers of Product A are aged 25-35."
o Comparative Characterization: Identifies differences between classes. For
instance, "Product A buyers tend to be younger than Product B buyers."
• Key Techniques:
o Use of statistical summaries (e.g., mean, variance, and distribution).
o Visualization tools like bar charts and scatter plots to identify patterns.
o Analytical tools such as correlation and regression analysis.
• Applications:
o Marketing: Profiling customers for targeted campaigns.
o Healthcare: Differentiating patients with specific conditions based on
symptoms.
Objective:
• Understand how two or more classes differ.
• Identify attributes or patterns that distinguish one class from another.
Steps in Mining Class Comparisons:
1. Select Classes for Comparison:
o Choose the classes or groups you want to compare. For example:
▪ In a dataset of customers, compare "High Spenders" vs. "Low Spenders."
2. Summarize Data:
o Summarize data within each class using descriptive statistics like mean,
median, mode, variance, and frequency.
o For example, calculate the average income for "High Spenders" and "Low
Spenders."
3. Attribute Relevance:
o Analyse which attributes are most relevant for distinguishing the classes.
o Techniques include statistical methods (e.g., t-tests, ANOVA) and feature
importance analysis.
4. Visualization:
o Use visualizations like histograms, bar charts, or scatter plots to see how the
attributes differ between classes.
o For example, plot the age distribution of "High Spenders" and "Low Spenders."
Applications:
• Marketing: Identify differences between customers who purchase a product and
those who don’t.
• Healthcare: Compare patients with different diseases or health conditions.
• Education: Compare the performance of students in different grade levels.
Example:
Scenario:
• Dataset: Customer purchases in an e-commerce store.
• Classes: "Frequent Buyers" vs. "Infrequent Buyers."
• Attributes: Age, Income, Browsing History, and Purchase Amount.
Analysis:
1. Age: "Frequent Buyers" are generally aged 25–40, while "Infrequent Buyers" are 18–
24.
2. Income: "Frequent Buyers" have a higher income range.
3. Browsing History: "Frequent Buyers" visit product pages more often.
Outcome:
This analysis helps the store design targeted campaigns for "Infrequent Buyers" to convert
them into "Frequent Buyers."
Statistical Measures
• Limitations:
o Affected by extreme values (outliers).
2. Median:
• Definition: The median is the middle value of an ordered dataset. If the dataset has
an even number of values, it is the average of the two middle values.
• Steps to Find Median:
1. Arrange the data in ascending order.
2. Find the middle value(s).
• Example:
Dataset: 3,7,8,12,153, 7, 8, 12, 15
Median = 88 (middle value).
Dataset (even): 2,4,6,82, 4, 6, 8
Median = (4+6)/2=5(4 + 6) / 2 = 5.
• Strengths:
o Not affected by outliers.
o Suitable for skewed data.
• Limitations:
o Ignores much of the dataset.
3. Mode:
• Definition: The mode is the most frequently occurring value in a dataset. A dataset
can have:
o No mode (if all values are unique).
o One mode (unimodal).
o Two modes (bimodal) or more (multimodal).
• Strengths:
o Useful for categorical data.
o Simple to identify in small datasets.
• Limitations:
o Can be unstable in datasets with small frequency differences.
o May not exist or may not represent centrality well in numerical data.
4. Midrange
• Definition: It is the average of largest and smallest value of data set.
• Example:
Dataset: 2,5,8,10,12
• Minimum value = 2
• Maximum value = 12
Midrange= (2+12) / 2 = 7
The midrange is 7.
Strengths:
• Simplicity: Easy to calculate and understand.
• Quick estimation: Useful when only the range of the data is known.
Limitations:
• Sensitivity to outliers: It is significantly affected by extreme values, as it depends
solely on the minimum and maximum values.
• Ignores data distribution: The midrange does not consider any other values in the
dataset besides the extreme
# MEASURE OF DISPERSION
Measures of dispersion are statistical tools that describe the spread or variability of a
dataset. They provide insight into how much the data points in a set differ from one another
or from the central tendency (mean, median, or mode). Dispersion is essential for
understanding the reliability and consistency of data. Here’s a detailed explanation of the
various measures of dispersion:
1. Range
• Definition: The range is the difference between the largest and smallest values in a
dataset.
• Formula: Range= Maximum value − Minimum value
• Example: For the dataset {4,7,9,15}
Range = 15−4=11
• Strengths: Easy to calculate.
• Weaknesses: Does not consider all data points, sensitive to outliers.
2. Interquartile Range (IQR)
• Definition: The IQR measures the range of the middle 50% of the data, reducing the
effect of outliers.
• Formula: IQR=Q3−Q1
Where Q1 (First Quartile) is the median of the lower half, and Q3 (Third Quartile) is
the median of the upper half.
• Example: For the dataset {1,3,5,7,9,11,13}
Q1=3, Q3=11, IQR = Q3 – Q1 = 11 – 3 = 8.
• Strengths: Reduces the impact of outliers.
• Weaknesses: Ignores data outside the middle 50%.
3. Variance
• Definition: Variance quantifies the average squared deviation of each data point from
the mean.
• Formula:
2 ∑(𝑥𝑖 − 𝜇)2
For a population: 𝜎 =
𝑁
2 ∑(𝑥𝑖 − 𝑥̅ )2
For a sample: 𝑠 =
𝑛−1
Where:
o 𝑥𝑖 : Each data point
o 𝜇: Population mean
o 𝑥̅ : Sample mean
o N: Population size
o n: Sample size
• Example: For {2,4,6}, mean (𝑥̅ ) = 4
• Strengths: Uses all data points, suitable for further statistical analysis.
• Weaknesses: Squaring exaggerates the impact of large deviations.
4. Standard Deviation
• Definition: Standard deviation is the square root of variance, representing the
average deviation from the mean.
• Formula: 𝜎 = √𝜎 2
• Example: For {2,4,6}, Variance = 2.67.
Standard Deviation =√2.67 ≈ 1.63
• Strengths: Expressed in the same unit as the data, easy to interpret.
• Weaknesses: Sensitive to outliers.
• Strengths: Useful for comparing variability across datasets with different units or
scales.
• Weaknesses: Requires a non-zero mean.
7. Absolute Measures vs. Relative Measures
• Absolute Measures: Depend on the unit of data (e.g., Range, Variance, Standard
Deviation, MAD).
• Relative Measures: Unit-free and expressed as percentages or ratios (e.g., CV).
Applications:
• Spam detection.
• Sentiment analysis.
• Medical diagnosis.
Advantages:
• Fast and efficient.
• Works well for high-dimensional data.
• Provides probabilistic outputs.
Limitations:
• Assumes feature independence, which may not hold in practice.
• Sensitive to zero probabilities (can be mitigated with Laplace smoothing).
2. Bayesian Classification
Definition:
Bayesian classification is a general probabilistic framework that uses Bayes' Theorem to
calculate the probability of data belonging to a specific class. Unlike Naive Bayes, it does not
assume feature independence and considers the dependencies among features.
How It Works:
1. Calculate the joint probability distribution P(X,C), where X represents the features
and C represents the class.
2. Use Bayes' Theorem to compute the posterior probability P(C∣X).
3. Assign the data point to the class with the highest posterior probability.
Key Difference from Naive Bayes:
• Bayesian classification considers feature dependencies, making it more flexible but
computationally expensive.
Applications:
• Complex datasets where feature dependencies are important (e.g., image
recognition, speech processing).
Advantages:
• Handles correlated features better than Naive Bayes.
• Provides a strong probabilistic foundation for classification.
Limitations:
• Computationally intensive due to the need to model dependencies.
• Requires large amounts of data to estimate probabilities accurately.
3. Logistic Regression
Definition:
Logistic regression is a statistical model that predicts the probability of a binary outcome
using a logistic (sigmoid) function. Despite its name, logistic regression is widely used for
classification tasks.
Sigmoid Function:
1
𝜎(𝑧) =
1 + 𝑒 −𝑧
Where z is a linear combination of input features:
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
How It Works:
1. The model computes the weighted sum of the features (z).
2. The sigmoid function converts z into a probability (P(Y=1)).
3. A threshold (commonly 0.5) is applied to classify data points:
o Probability ≥ threshold → Class 1.
o Probability < threshold → Class 0.
Applications:
• Credit risk assessment.
• Fraud detection.
• Customer churn prediction.
Advantages:
• Simple and interpretable.
• Works well for linearly separable data.
• Outputs probabilities for better decision-making.
Limitations:
• Limited to linear decision boundaries.
• Sensitive to outliers.
• May struggle with non-linear relationships unless features are transformed.
4. Linear Discriminant Analysis (LDA)
Definition:
Linear Discriminant Analysis is both a dimensionality reduction and classification technique.
It finds a linear combination of features that best separates the classes by maximizing the
ratio of between-class variance to within-class variance.
Steps:
1. Compute the mean and variance of each class.
2. Calculate the scatter matrices:
o Within-class scatter matrix (Sw): Measures the spread of data points within
each class.
o Between-class scatter matrix (Sb): Measures the spread of the class means
relative to the overall mean.
3. Solve the eigenvalue problem to find the projection vector that maximizes the ratio:
𝑤 𝑇 𝑆𝑏 𝑤
𝐽(𝑤) =
𝑤 𝑇 𝑆𝑤 𝑤
4. Project the data onto the new axis and classify based on the projected values.
Applications:
• Facial recognition.
• Text classification.
• Medical diagnosis.
Advantages:
• Effective for linearly separable classes.
• Reduces dimensionality, improving computational efficiency.
Limitations:
• Assumes Gaussian distribution of features.
• Assumes equal covariance matrices for all classes.
• Sensitive to outliers.
Comparison of Algorithms
Algorithm Strengths Weaknesses
Fast, simple, effective for high- Assumes feature independence,
Naive Bayes
dimensional data. sensitive to zero probabilities.
Computationally intensive,
Bayesian Considers feature dependencies,
requires large data for accurate
Classification more flexible.
modelling.
Interpretable, outputs
Logistic Limited to linear relationships,
probabilities, works well for
Regression sensitive to outliers.
binary classification.
Linear Effective for dimensionality Assumes Gaussian distribution and
Discriminant reduction and linearly separable equal covariances, sensitive to
Analysis classes. outliers.
Conclusion
These four statistical-based algorithms have unique strengths and weaknesses, making
them suitable for different types of classification problems. Understanding their underlying
assumptions and limitations is crucial for selecting the right algorithm for a specific task.
# Distance Based Algorithms
Applications:
• Image classification.
• Text categorization.
• Bioinformatics (e.g., protein classification).
Advantages:
• Works well with non-linear decision boundaries.
• Effective in high-dimensional spaces.
• Robust to overfitting if 𝛾 and other parameters are tuned well.
Limitations:
• Computationally expensive for large datasets.
• Requires careful parameter tuning (𝛾 and C).
3. K-Means Clustering (for Classification Tasks)
Definition:
Though primarily a clustering algorithm, K-Means can be adapted for classification by
labelling clusters with the majority class.
How It Works:
1. Initialize kk cluster centroids randomly.
2. Assign each data point to the nearest centroid based on a distance metric (e.g.,
Euclidean distance):
𝑑(𝑥, 𝑐) = √∑(𝑥𝑖 − 𝑐𝑖 )2
𝑖=1
Conclusion
Distance-based algorithms provide versatile methods for classification and clustering. Each
algorithm has unique strengths and limitations, making them suitable for specific datasets
and use cases. The choice of algorithm depends on factors like dataset size, feature space,
and computational resources.
# DECISION TREE ALGORITHMS
DETAILED EXPLANATION
1. ID3 (Iterative Dichotomiser 3):
• Overview: ID3 is one of the earliest decision tree algorithms, developed by Ross
Quinlan in 1986. It is primarily used for classification tasks. The algorithm builds a
decision tree by selecting the feature that maximizes the information gain (IG) at
each node. It continues splitting the data until it reaches pure nodes (where all
samples belong to the same class).
• Steps in ID3:
1. Calculate Information Gain: For each feature in the dataset, calculate the
information gain, which is a measure of how well the feature separates the
data into different classes. Information gain is calculated using the concept of
entropy (a measure of uncertainty in the dataset). The feature with the highest
information gain is selected for the split.
2. Split the Dataset: Once the feature with the highest information gain is
selected, the dataset is split into subsets based on the values of that feature.
3. Recursion: This process is recursively repeated on the subsets, choosing
features that maximize information gain until one of the stopping criteria is met
(such as when all data in a subset belong to a single class, or there are no more
features to split on).
• Advantages:
o Simple and easy to implement.
o Works well with categorical data.
• Disadvantages:
o Can overfit if the tree is too deep.
o Prone to errors in the presence of noise or missing data.
• In Data Warehousing: ID3 can be applied for data mining tasks within data
warehouses to classify data into distinct categories (e.g., classifying transactions as
fraudulent or legitimate). The decision tree can provide a transparent way to
understand the classification logic.
• AI Relation: ID3 is foundational in AI, as it is one of the first algorithms to generate
interpretable models for decision-making based on historical data. It is often used in
knowledge discovery, pattern recognition, and classification tasks.
2. C4.5:
• Overview: C4.5 is an extension of ID3, also developed by Ross Quinlan. It is an
improved version that addresses some of ID3’s weaknesses. It uses gain ratio instead
of information gain, handles both categorical and continuous attributes, and
incorporates pruning to prevent overfitting.
• Steps in C4.5:
1. Select the Best Feature: For each feature, C4.5 calculates the gain ratio (which
is the ratio of information gain to the intrinsic information of a feature). The
feature with the highest gain ratio is chosen for the split.
2. Handle Continuous Features: C4.5 can handle continuous attributes by finding
the best split point (e.g., choosing a threshold value).
3. Pruning: C4.5 prunes the tree after it is built to remove branches that provide
little or no predictive value. This helps reduce overfitting and improves the
model's generalization ability.
4. Create Subtrees: The algorithm recursively creates subtrees for each subset,
splitting them until all data points in a node are classified or other stopping
criteria are met.
• Advantages:
o Works with both continuous and categorical data.
o Incorporates pruning to reduce overfitting.
o Can handle missing values and noisy data.
• Disadvantages:
o Computation-intensive, especially for large datasets.
o More complex than ID3.
• In Data Warehousing: C4.5 can be applied in data warehousing tasks such as
customer segmentation, predictive analytics, and classification of complex business
data (e.g., identifying high-value customers, classifying customer behaviours).
• AI Relation: C4.5 is widely used in machine learning and AI for pattern recognition,
classification, and decision-making tasks. It is useful in applications where
interpretability is critical.
3. CART (Classification and Regression Trees):
• Overview: CART is a more general algorithm that supports both classification and
regression tasks. Unlike ID3 and C4.5, which build multi-way branches, CART
constructs binary trees where each node has exactly two branches. It uses Gini
impurity for classification and mean squared error for regression to determine the
best splits.
• Steps in CART:
1. Choose the Best Split: CART uses the Gini impurity (for classification) or mean
squared error (for regression) to choose the best feature and threshold for
splitting the data.
2. Binary Splitting: Unlike C4.5, CART builds binary trees, meaning each split
divides the dataset into two parts.
3. Recursive Splitting: This process is repeated recursively, creating branches until
stopping criteria are met (e.g., when the data in a node are pure, or the tree
reaches a predefined depth).
4. Pruning: Post-pruning is performed to reduce overfitting, cutting off branches
that do not contribute significantly to the model’s accuracy.
• Advantages:
o Can be used for both classification and regression.
o More flexible and easier to implement than C4.5.
o Can handle large datasets efficiently.
• Disadvantages:
o Pruning is essential to avoid overfitting.
o Sensitive to small changes in the data.
• In Data Warehousing: CART is used for tasks like customer segmentation, demand
forecasting, and risk management. It is beneficial in predictive analytics within data
warehouses, where quick and reliable decision-making models are needed.
• AI Relation: In AI, CART is utilized in applications where both classification and
regression are needed, such as in speech recognition, medical diagnosis, and
predictive maintenance.
4. Random Forest:
• Overview: Random Forest is an ensemble learning method that builds multiple
decision trees and aggregates their results to improve accuracy and reduce
overfitting. It is one of the most powerful algorithms, particularly in classification
tasks.
• Steps in Random Forest:
1. Bootstrap Aggregating (Bagging): Random Forest builds multiple decision trees
using random subsets of the training data (with replacement). Each tree is
trained independently.
2. Random Feature Selection: For each tree, a random subset of features is
selected to determine the best split at each node. This reduces correlation
between trees.
3. Voting/Averaging: Once the trees are trained, predictions from all trees are
aggregated (by majority voting for classification or averaging for regression) to
make the final prediction.
• Advantages:
o Handles large datasets well.
o Reduces overfitting by averaging multiple trees.
o Robust to noise and outliers.
• Disadvantages:
o Less interpretable than a single decision tree.
o Requires more computational resources.
• In Data Warehousing: Random Forest is commonly used for tasks like anomaly
detection, predictive modeling, and classification of complex data within data
warehouses. It’s particularly effective when dealing with large volumes of structured
and unstructured data.
• AI Relation: In AI, Random Forest is used in a variety of applications, including image
classification, fraud detection, and bioinformatics, where high accuracy and resilience
to overfitting are essential.
5. Scalable Decision Tree Techniques:
As datasets grow in size and complexity, traditional decision tree algorithms like ID3, C4.5,
and CART might struggle with scalability. Scalable decision tree techniques address the
challenges posed by large datasets, high-dimensionality, and distributed computing. Some
notable scalable decision tree techniques include:
• XGBoost (Extreme Gradient Boosting):
o A highly efficient, scalable implementation of gradient boosting that optimizes
decision tree ensembles for classification and regression tasks.
o It uses a gradient boosting framework, where trees are added sequentially,
with each tree correcting errors from the previous one.
o XGBoost handles large datasets well and includes regularization to prevent
overfitting.
• LightGBM (Light Gradient Boosting Machine):
o A gradient boosting framework designed for efficiency and scalability with large
datasets. It builds trees leaf-wise rather than level-wise, improving accuracy
and reducing computation time.
o LightGBM is known for its high speed and low memory usage.
• CatBoost (Categorical Boosting):
o A gradient boosting algorithm designed to handle categorical data efficiently. It
automatically deals with categorical features without requiring explicit
encoding.
o CatBoost is optimized for performance, offering faster training and better
accuracy.
• Distributed Decision Trees:
o These techniques allow decision trees to be built in a distributed computing
environment, making it possible to process large datasets spread across
multiple machines. This approach can handle very large volumes of data,
especially in big data scenarios.
• Advantages:
o Handle large and high-dimensional datasets efficiently.
o Can be used in distributed environments for big data processing.
o Often outperform traditional decision trees in terms of accuracy and
computation time.
• Disadvantages:
o May require more complex setup and tuning.
o Less interpretable than traditional decision trees.
• In Data Warehousing: Scalable decision tree techniques like XGBoost and LightGBM
are crucial for big data analytics, allowing decision trees to be built on vast amounts
of structured and unstructured data, helping businesses extract insights quickly and
efficiently.
• AI Relation: Scalable decision tree techniques are essential in modern AI applications,
particularly in large-scale machine learning tasks like fraud detection,
recommendation systems, and real-time analytics.
#CLUSTERING
Clustering
Algorithms
Density
Hierarchical Grid Based Model Based
Based
Statistical
CURE DBSCAN STING
Approach
Key Concepts:
1. Epsilon (ε):
o A radius that defines the neighbourhood around a data point.
o If other points lie within this radius, they are considered part of the same
cluster.
2. MinPts:
o The minimum number of points required to form a dense region (cluster).
3. Core Points:
o Points that have at least MinPts neighbours within the ε-radius.
4. Border Points:
o Points that have fewer than MinPts neighbours but are within the ε-radius of a
core point.
o These points are on the edge of a cluster.
5. Noise Points:
o Points that do not belong to any cluster.
Steps in DBSCAN:
1. Start with an Unvisited Point:
o Pick a random point from the dataset. If it has at least MinPts neighbours
within ε-radius, mark it as a core point and create a new cluster.
2. Expand the Cluster:
o Add all points within ε-radius to the cluster.
o If any of these points are core points, repeat the process to include their
neighbours.
3. Mark Border Points and Noise:
o If a point is not a core point but lies within the ε-radius of a core point, mark it
as a border point.
o Points that are not reachable from any core point are considered noise.
4. Repeat:
o Continue until all points are visited.
Advantages of DBSCAN:
• Can find clusters of arbitrary shapes (e.g., circular or elongated clusters).
• Effectively handles noise and outliers.
• Does not require specifying the number of clusters (kk).
Limitations:
• Sensitive to the choice of parameters ϵ\epsilon and MinPts.
• Struggles with datasets where cluster densities vary significantly.
• Computationally expensive for very large datasets.
Applications:
• Geospatial data analysis (e.g., detecting hotspots in GPS data).
• Image segmentation.
• Identifying anomalies or outliers in datasets.
2. OPTICS (Ordering Points To Identify the Clustering Structure)
Overview:
OPTICS is an extension of DBSCAN that addresses its sensitivity to the choice of the
ϵ\epsilon parameter. Instead of forming fixed clusters for a specific ϵ\epsilon, OPTICS
creates an ordering of the data points, representing the clustering structure at different
density levels.
Key Concepts:
1. Core Distance:
o The smallest distance at which a point becomes a core point (i.e., it has MinPts
neighbours within that distance).
2. Reachability Distance:
o The distance needed to reach a point from a core point.
o It is either the core distance or the actual distance, whichever is greater.
3. Cluster Order:
o OPTICS orders points based on their reachability distance to reveal the
clustering structure at varying densities.
Steps in OPTICS:
1. Compute Core Distance:
o For each point, calculate its core distance based on ϵ\epsilon and MinPts.
2. Expand Clusters:
o Similar to DBSCAN, start with an unvisited point and expand the cluster by
visiting its neighbours.
o Record the reachability distances of all visited points.
3. Output Cluster Ordering:
o Once all points are processed, OPTICS produces a reachability plot, which
visually represents the clustering structure.
4. Extract Clusters:
o Analyse the reachability plot to extract clusters of different densities by
identifying valleys (low reachability distances) and separating them with peaks
(high reachability distances).
Advantages of OPTICS:
• Handles datasets with varying densities better than DBSCAN.
• Produces a clustering hierarchy that can be used to analyse clusters at multiple scales.
• Provides more flexibility in understanding the structure of the data.
Limitations:
• Slower than DBSCAN due to the additional computations for core and reachability
distances.
• Requires careful interpretation of the reachability plot.
Applications:
• Hierarchical clustering in spatial databases.
• Analysing customer behaviour in e-commerce.
• Identifying natural groups in datasets with mixed-density clusters.
Conclusion:
• Use DBSCAN for datasets with well-defined clusters of uniform density and when you
need a simple clustering solution.
• Use OPTICS for datasets with varying densities and when a deeper understanding of
the clustering structure is required.
3. GRID BASED CLUSTERING ALGORITHM
Grid-based clustering algorithms like STING (Statistical Information Grid) and CLIQUE
(CLustering In QUEst) are powerful techniques for analysing large datasets, especially in
spatial data mining. Here's an explanation with examples: