0% found this document useful (0 votes)
7 views

UNIT 4

Unit 4 DWDM covers key concepts in data mining, including classification, clustering, and association rules. It details various algorithms and techniques for analyzing data, such as statistical measures, attribute relevance, and methods for mining class comparisons. The document also explains measures of central tendency and dispersion, providing insights into data variability and model accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

UNIT 4

Unit 4 DWDM covers key concepts in data mining, including classification, clustering, and association rules. It details various algorithms and techniques for analyzing data, such as statistical measures, attribute relevance, and methods for mining class comparisons. The document also explains measures of central tendency and dispersion, providing insights into data variability and model accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT 4 DWDM

To Do:
1. Classification:
a. Definition, Data Generalization and Analytical Characterization
b. Analysis of Attribute Relevance
c. Mining Class Comparisons
d. Statistical Measures in Large Databases
e. Statistical-Based Algorithms (Naïve bayes, Bayesian, logistic, LDA)
f. Distance-Based Algorithms (KNN, SVM, K-means, Hierarchical)
g. Decision Tree-Based Algorithms (ID3, C4.5, CART, Random Forest, scalable DT)
2. Clustering:
a. Introduction
b. Similarity and Distance Measures
c. Hierarchical and Partitional Algorithms
d. Hierarchical Clustering: CURE and CHAMELEON
e. Density-Based Methods: DBSCAN, OPTICS
f. Grid-Based Methods: STING, CLIQUE
g. Model-Based Methods: Statistical Approach
3. Association Rules:
a. Introduction
b. Large Item Sets
c. Basic Algorithms
d. Parallel and Distributed Algorithms
e. Neural Network Approach
#CLASSIFICATION
1. Definition of Classification:
Classification is a supervised learning technique in data mining where the goal is to assign a
label or class to data points based on their features. It involves building a model using
labelled training data and then using this model to classify unseen test data.
• Objective: Predict the categorical label of a given input based on learned patterns.
• Examples:
o Spam email detection: Classify emails as "Spam" or "Not Spam."
o Disease diagnosis: Classify a patient as "Healthy" or "Diseased."
• Key Steps:
o Training Phase: Use a dataset where the output (label) is already known to
build a model.
o Testing Phase: Evaluate the model on unseen data to check its accuracy.
• Common Algorithms: Decision Trees, Naive Bayes, Support Vector Machines (SVM),
Neural Networks.

2. Data Generalization:
Data generalization refers to summarizing and abstracting data to identify meaningful
patterns while reducing complexity. This is achieved by aggregating data or rolling up data
to a higher abstraction level.
• Purpose: Focus on higher-level patterns rather than specific details.
• Techniques:
o Attribute Aggregation: Combine detailed data into a summary (e.g., weekly
sales instead of daily sales).
o Concept Hierarchies: Organize data into hierarchical levels (e.g., "City" →
"State" → "Country").
o Data Cube Operations: Perform roll-up or drill-down operations in OLAP
systems to view data at different granularities.
• Applications:
o Identifying trends (e.g., seasonal sales patterns).
o Reducing noise in data for better classification results.
3. Analytical Characterization:
Analytical characterization involves summarizing and distinguishing the features of a
dataset, particularly between different classes or categories.
• Types:
o Descriptive Characterization: Summarizes the general properties of a class. For
example, "most buyers of Product A are aged 25-35."
o Comparative Characterization: Identifies differences between classes. For
instance, "Product A buyers tend to be younger than Product B buyers."
• Key Techniques:
o Use of statistical summaries (e.g., mean, variance, and distribution).
o Visualization tools like bar charts and scatter plots to identify patterns.
o Analytical tools such as correlation and regression analysis.
• Applications:
o Marketing: Profiling customers for targeted campaigns.
o Healthcare: Differentiating patients with specific conditions based on
symptoms.

Analysis of Attribute Relevance


Definition:
Analysis of attribute relevance involves identifying and evaluating which attributes
(features) in a dataset significantly contribute to the target variable or outcome in a
classification task. Irrelevant or redundant attributes can decrease the model's
performance, making this analysis a critical step in data preprocessing and feature selection.

Why It’s Important:


1. Improves Model Accuracy: Including only relevant attributes reduces noise and
enhances the predictive power of the model.
2. Reduces Dimensionality: Helps simplify the dataset by removing irrelevant features,
which reduces computational complexity and avoids overfitting.
3. Increases Interpretability: Models built on fewer, more relevant features are easier to
understand and interpret.
Steps
1. Statistical Evaluation:
o Use statistical tests to measure the correlation or dependency between an
attribute and the target variable.
o Examples:
▪ Chi-square test for categorical data.
▪ ANOVA (Analysis of Variance) for numerical data.
2. Information Gain:
o Measure the reduction in uncertainty (entropy) of the target variable when a
particular attribute is used for splitting.
o Attributes with higher information gain are considered more relevant.
3. Correlation Analysis:
o Compute the correlation coefficient (e.g., Pearson or Spearman) to assess the
linear or monotonic relationship between a feature and the target variable.
o Attributes with strong correlations are likely to be relevant.
4. Feature Importance in Models:
o Train a classification model (e.g., Decision Tree, Random Forest) and check the
feature importance scores provided by the model.
o Features with higher importance scores contribute more to the model's
decisions.
5. Recursive Feature Elimination (RFE):
o Iteratively train the model by removing less important attributes until the
optimal subset of features is identified.
# Mining Class Comparisons
Definition:
Mining class comparisons involve contrasting and comparing two or more classes or groups
in a dataset to understand their distinguishing features. It’s used to identify how these
classes differ in terms of specific attributes or patterns.

Objective:
• Understand how two or more classes differ.
• Identify attributes or patterns that distinguish one class from another.
Steps in Mining Class Comparisons:
1. Select Classes for Comparison:
o Choose the classes or groups you want to compare. For example:
▪ In a dataset of customers, compare "High Spenders" vs. "Low Spenders."
2. Summarize Data:
o Summarize data within each class using descriptive statistics like mean,
median, mode, variance, and frequency.
o For example, calculate the average income for "High Spenders" and "Low
Spenders."
3. Attribute Relevance:
o Analyse which attributes are most relevant for distinguishing the classes.
o Techniques include statistical methods (e.g., t-tests, ANOVA) and feature
importance analysis.
4. Visualization:
o Use visualizations like histograms, bar charts, or scatter plots to see how the
attributes differ between classes.
o For example, plot the age distribution of "High Spenders" and "Low Spenders."

Applications:
• Marketing: Identify differences between customers who purchase a product and
those who don’t.
• Healthcare: Compare patients with different diseases or health conditions.
• Education: Compare the performance of students in different grade levels.

Example:
Scenario:
• Dataset: Customer purchases in an e-commerce store.
• Classes: "Frequent Buyers" vs. "Infrequent Buyers."
• Attributes: Age, Income, Browsing History, and Purchase Amount.
Analysis:
1. Age: "Frequent Buyers" are generally aged 25–40, while "Infrequent Buyers" are 18–
24.
2. Income: "Frequent Buyers" have a higher income range.
3. Browsing History: "Frequent Buyers" visit product pages more often.
Outcome:
This analysis helps the store design targeted campaigns for "Infrequent Buyers" to convert
them into "Frequent Buyers."

Statistical Measures

Measures of central Measures of


Tendency dispersion
# MEASURES OF CENTRAL TENDENCY

1. Mean (Arithmetic Average):


• Definition: The mean is the sum of all values in a dataset divided by the total number
of values.
∑ 𝑥𝑖
• Formula: 𝑀𝑒𝑎𝑛 =
𝑁

Where xi is each data value, and N is the total number of values.


• Example:
Dataset: 4,6,8,104, 6, 8, 10
Mean = (4+6+8+10)/4=7(4 + 6 + 8 + 10) / 4 = 7.
• Strengths:
o Uses all data points, making it sensitive to changes in the dataset.
o Ideal for symmetric distributions.

• Limitations:
o Affected by extreme values (outliers).

2. Median:
• Definition: The median is the middle value of an ordered dataset. If the dataset has
an even number of values, it is the average of the two middle values.
• Steps to Find Median:
1. Arrange the data in ascending order.
2. Find the middle value(s).
• Example:
Dataset: 3,7,8,12,153, 7, 8, 12, 15
Median = 88 (middle value).
Dataset (even): 2,4,6,82, 4, 6, 8
Median = (4+6)/2=5(4 + 6) / 2 = 5.
• Strengths:
o Not affected by outliers.
o Suitable for skewed data.
• Limitations:
o Ignores much of the dataset.

3. Mode:
• Definition: The mode is the most frequently occurring value in a dataset. A dataset
can have:
o No mode (if all values are unique).
o One mode (unimodal).
o Two modes (bimodal) or more (multimodal).

• Formula: 𝑀𝑜𝑑𝑒 = 3 𝑀𝑒𝑑𝑖𝑎𝑛 − 2 𝑀𝑒𝑎𝑛


• Example:
Dataset: 2,3,3,5,72, 3, 3, 5, 7
Mode = 33.
Dataset: 1,1,2,2,31, 1, 2, 2, 3
Modes = 11 and 22 (bimodal).

• Strengths:
o Useful for categorical data.
o Simple to identify in small datasets.
• Limitations:
o Can be unstable in datasets with small frequency differences.
o May not exist or may not represent centrality well in numerical data.

4. Midrange
• Definition: It is the average of largest and smallest value of data set.
• Example:
Dataset: 2,5,8,10,12
• Minimum value = 2
• Maximum value = 12
Midrange= (2+12) / 2 = 7
The midrange is 7.

Strengths:
• Simplicity: Easy to calculate and understand.
• Quick estimation: Useful when only the range of the data is known.

Limitations:
• Sensitivity to outliers: It is significantly affected by extreme values, as it depends
solely on the minimum and maximum values.
• Ignores data distribution: The midrange does not consider any other values in the
dataset besides the extreme

# MEASURE OF DISPERSION
Measures of dispersion are statistical tools that describe the spread or variability of a
dataset. They provide insight into how much the data points in a set differ from one another
or from the central tendency (mean, median, or mode). Dispersion is essential for
understanding the reliability and consistency of data. Here’s a detailed explanation of the
various measures of dispersion:

1. Range
• Definition: The range is the difference between the largest and smallest values in a
dataset.
• Formula: Range= Maximum value − Minimum value
• Example: For the dataset {4,7,9,15}
Range = 15−4=11
• Strengths: Easy to calculate.
• Weaknesses: Does not consider all data points, sensitive to outliers.
2. Interquartile Range (IQR)
• Definition: The IQR measures the range of the middle 50% of the data, reducing the
effect of outliers.
• Formula: IQR=Q3−Q1
Where Q1 (First Quartile) is the median of the lower half, and Q3 (Third Quartile) is
the median of the upper half.
• Example: For the dataset {1,3,5,7,9,11,13}
Q1=3, Q3=11, IQR = Q3 – Q1 = 11 – 3 = 8.
• Strengths: Reduces the impact of outliers.
• Weaknesses: Ignores data outside the middle 50%.

3. Variance
• Definition: Variance quantifies the average squared deviation of each data point from
the mean.
• Formula:
2 ∑(𝑥𝑖 − 𝜇)2
For a population: 𝜎 =
𝑁

2 ∑(𝑥𝑖 − 𝑥̅ )2
For a sample: 𝑠 =
𝑛−1

Where:
o 𝑥𝑖 : Each data point
o 𝜇: Population mean
o 𝑥̅ : Sample mean
o N: Population size
o n: Sample size
• Example: For {2,4,6}, mean (𝑥̅ ) = 4

(2−4)2 + (4−4)2 + (6−4)2 4+0+4


Variance = = = 2.67.
3 3

• Strengths: Uses all data points, suitable for further statistical analysis.
• Weaknesses: Squaring exaggerates the impact of large deviations.
4. Standard Deviation
• Definition: Standard deviation is the square root of variance, representing the
average deviation from the mean.

• Formula: 𝜎 = √𝜎 2
• Example: For {2,4,6}, Variance = 2.67.
Standard Deviation =√2.67 ≈ 1.63
• Strengths: Expressed in the same unit as the data, easy to interpret.
• Weaknesses: Sensitive to outliers.

5. Mean Absolute Deviation (MAD)


• Definition: MAD is the average of the absolute deviations of data points from the
mean.

• Strengths: Less sensitive to outliers compared to variance.


• Weaknesses: May not be as widely used in advanced statistical methods.

6. Coefficient of Variation (CV)


• Definition: The CV is a relative measure of dispersion that expresses standard
deviation as a percentage of the mean.

• Strengths: Useful for comparing variability across datasets with different units or
scales.
• Weaknesses: Requires a non-zero mean.
7. Absolute Measures vs. Relative Measures
• Absolute Measures: Depend on the unit of data (e.g., Range, Variance, Standard
Deviation, MAD).
• Relative Measures: Unit-free and expressed as percentages or ratios (e.g., CV).

Choosing the Right Measure


• Range or IQR: Suitable for small datasets or when outliers are a concern.
• Variance or Standard Deviation: Best for statistical analysis or when all data points
are important.
• MAD: Useful when robustness against outliers is needed.
• CV: Ideal for comparing variability across datasets.
Understanding and applying the right measure of dispersion is crucial for analysing the
consistency and variability in data, enabling better decision-making and insights.
# Statistical Based Algorithms
1. Naïve Bayes Classifier
2. Bayesian Classification
3. Logistic Regression
4. Linear Discriminant Analysis (LDA)

1. Naive Bayes Classifier


Definition:
Naive Bayes is a probabilistic classifier based on Bayes' Theorem. It assumes that all
features are conditionally independent of one another given the class label. Despite the
"naive" assumption of independence, it often performs well in practice.
Bayes' Theorem:
𝑃(𝑋|𝐶) ∙ 𝑃(𝐶)
𝑃(𝐶|𝑋) =
𝑃(𝑋)
Where:
o 𝑃(𝐶|𝑋): Probability of class C given the feature X (posterior probability).
o 𝑃(𝑋|𝐶): Probability of feature X given class C.
o 𝑃(𝐶): Prior probability of class C.
o 𝑃(𝑋)Prior probability of feature X.
How It Works:
1. Compute the prior probability (P(C)) for each class.
2. Calculate the likelihood (P(X∣C) of each feature for every class.
3. Use Bayes' theorem to compute the posterior probabilities (P(C∣X)).
4. Assign the class with the highest posterior probability.
Assumption: Features are conditionally independent:
𝑃(𝑋|𝐶) = 𝑃(𝑥1 |𝐶) ∙ 𝑃(𝑥2 |𝐶) ∙∙∙∙∙∙∙∙ 𝑃(𝑥𝑛 |𝐶)

Applications:
• Spam detection.
• Sentiment analysis.
• Medical diagnosis.
Advantages:
• Fast and efficient.
• Works well for high-dimensional data.
• Provides probabilistic outputs.
Limitations:
• Assumes feature independence, which may not hold in practice.
• Sensitive to zero probabilities (can be mitigated with Laplace smoothing).

2. Bayesian Classification
Definition:
Bayesian classification is a general probabilistic framework that uses Bayes' Theorem to
calculate the probability of data belonging to a specific class. Unlike Naive Bayes, it does not
assume feature independence and considers the dependencies among features.
How It Works:
1. Calculate the joint probability distribution P(X,C), where X represents the features
and C represents the class.
2. Use Bayes' Theorem to compute the posterior probability P(C∣X).
3. Assign the data point to the class with the highest posterior probability.
Key Difference from Naive Bayes:
• Bayesian classification considers feature dependencies, making it more flexible but
computationally expensive.
Applications:
• Complex datasets where feature dependencies are important (e.g., image
recognition, speech processing).
Advantages:
• Handles correlated features better than Naive Bayes.
• Provides a strong probabilistic foundation for classification.
Limitations:
• Computationally intensive due to the need to model dependencies.
• Requires large amounts of data to estimate probabilities accurately.
3. Logistic Regression
Definition:
Logistic regression is a statistical model that predicts the probability of a binary outcome
using a logistic (sigmoid) function. Despite its name, logistic regression is widely used for
classification tasks.
Sigmoid Function:
1
𝜎(𝑧) =
1 + 𝑒 −𝑧
Where z is a linear combination of input features:
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛 + 𝑏
How It Works:
1. The model computes the weighted sum of the features (z).
2. The sigmoid function converts z into a probability (P(Y=1)).
3. A threshold (commonly 0.5) is applied to classify data points:
o Probability ≥ threshold → Class 1.
o Probability < threshold → Class 0.
Applications:
• Credit risk assessment.
• Fraud detection.
• Customer churn prediction.
Advantages:
• Simple and interpretable.
• Works well for linearly separable data.
• Outputs probabilities for better decision-making.
Limitations:
• Limited to linear decision boundaries.
• Sensitive to outliers.
• May struggle with non-linear relationships unless features are transformed.
4. Linear Discriminant Analysis (LDA)
Definition:
Linear Discriminant Analysis is both a dimensionality reduction and classification technique.
It finds a linear combination of features that best separates the classes by maximizing the
ratio of between-class variance to within-class variance.
Steps:
1. Compute the mean and variance of each class.
2. Calculate the scatter matrices:
o Within-class scatter matrix (Sw): Measures the spread of data points within
each class.
o Between-class scatter matrix (Sb): Measures the spread of the class means
relative to the overall mean.
3. Solve the eigenvalue problem to find the projection vector that maximizes the ratio:
𝑤 𝑇 𝑆𝑏 𝑤
𝐽(𝑤) =
𝑤 𝑇 𝑆𝑤 𝑤
4. Project the data onto the new axis and classify based on the projected values.
Applications:
• Facial recognition.
• Text classification.
• Medical diagnosis.
Advantages:
• Effective for linearly separable classes.
• Reduces dimensionality, improving computational efficiency.
Limitations:
• Assumes Gaussian distribution of features.
• Assumes equal covariance matrices for all classes.
• Sensitive to outliers.
Comparison of Algorithms
Algorithm Strengths Weaknesses
Fast, simple, effective for high- Assumes feature independence,
Naive Bayes
dimensional data. sensitive to zero probabilities.
Computationally intensive,
Bayesian Considers feature dependencies,
requires large data for accurate
Classification more flexible.
modelling.
Interpretable, outputs
Logistic Limited to linear relationships,
probabilities, works well for
Regression sensitive to outliers.
binary classification.
Linear Effective for dimensionality Assumes Gaussian distribution and
Discriminant reduction and linearly separable equal covariances, sensitive to
Analysis classes. outliers.

Conclusion
These four statistical-based algorithms have unique strengths and weaknesses, making
them suitable for different types of classification problems. Understanding their underlying
assumptions and limitations is crucial for selecting the right algorithm for a specific task.
# Distance Based Algorithms

1. K-Nearest Neighbours (KNN)


Definition:
KNN is a non-parametric, instance-based learning algorithm that classifies a new data point
based on the majority class of its kk-nearest neighbours.
How It Works:
1. Compute the distance between the new data point and all existing data points in the
dataset.

2. Identify the kk closest neighbours.


3. Perform majority voting among the kk-nearest neighbours:
o Assign the class that occurs most frequently among the neighbours.
Applications:
• Handwritten digit recognition.
• Recommender systems.
• Fraud detection.
Advantages:
• Simple to implement and intuitive.
• Works well with multi-class classification.
• No training phase required (lazy learner).
Limitations:
• Computationally expensive for large datasets.
• Sensitive to irrelevant or unscaled features.
• Requires careful tuning of kk.

2. Support Vector Machine (SVM) with RBF Kernel


Definition:
While SVM is generally considered a margin-based algorithm, it can become distance-based
when combined with the Radial Basis Function (RBF) kernel. RBF transforms data into a
higher-dimensional space where the algorithm finds the optimal hyperplane to separate
classes.

Applications:
• Image classification.
• Text categorization.
• Bioinformatics (e.g., protein classification).
Advantages:
• Works well with non-linear decision boundaries.
• Effective in high-dimensional spaces.
• Robust to overfitting if 𝛾 and other parameters are tuned well.
Limitations:
• Computationally expensive for large datasets.
• Requires careful parameter tuning (𝛾 and C).
3. K-Means Clustering (for Classification Tasks)
Definition:
Though primarily a clustering algorithm, K-Means can be adapted for classification by
labelling clusters with the majority class.
How It Works:
1. Initialize kk cluster centroids randomly.
2. Assign each data point to the nearest centroid based on a distance metric (e.g.,
Euclidean distance):

𝑑(𝑥, 𝑐) = √∑(𝑥𝑖 − 𝑐𝑖 )2
𝑖=1

3. Recompute centroids as the mean of the points in each cluster.


4. Iterate until centroids stabilize.
5. For classification: Assign the majority class in each cluster as the class label.
Applications:
• Image segmentation.
• Customer segmentation.
• Document categorization.
Advantages:
• Simple and fast for smaller datasets.
• Works well when clusters/classes are spherical and well-separated.
Limitations:
• Sensitive to outliers and initial centroid placement.
• Not suitable for non-spherical clusters.
• Requires the number of clusters (kk) to be known in advance.
4. Hierarchical Clustering (for Classification)
Definition:
Hierarchical clustering can be adapted for classification by grouping data into clusters and
assigning labels based on proximity.
Types:
• Agglomerative: Start with individual data points and merge them iteratively.
• Divisive: Start with all data points in one cluster and split them iteratively.
How It Works:
1. Compute the distance between all pairs of data points (or clusters).
2. Merge or split clusters based on a linkage criterion:
o Single Linkage: Distance between closest points in two clusters.
o Complete Linkage: Distance between farthest points in two clusters.
o Average Linkage: Average distance between points in two clusters.
3. Repeat until all data points are clustered.
Applications:
• Gene expression analysis.
• Social network analysis.
• Text classification.
Advantages:
• Does not require the number of clusters in advance.
• Produces a dendrogram (tree structure) for interpretability.
Limitations:
• Computationally expensive for large datasets.
• Sensitive to noise and outliers.
• Requires manual selection of the number of clusters.
Key Comparison of Distance-Based Algorithms
Algorithm Distance Metric Strengths Weaknesses
Euclidean, Simple, effective for Sensitive to irrelevant
KNN
Manhattan, etc. multi-class. features.
SVM (RBF Squared Euclidean Non-linear boundaries, Expensive, requires
Kernel) (via RBF) robust. tuning.
Fast, works for Sensitive to outliers,
K-Means Euclidean
spherical clusters. assumes kk.
Euclidean, linkage No need for kk, Slow, sensitive to noise
Hierarchical
methods interpretable. and outliers.

Conclusion
Distance-based algorithms provide versatile methods for classification and clustering. Each
algorithm has unique strengths and limitations, making them suitable for specific datasets
and use cases. The choice of algorithm depends on factors like dataset size, feature space,
and computational resources.
# DECISION TREE ALGORITHMS

What is a Decision Tree?


A decision tree is a tree-like structure where:
1. Nodes represent features (attributes) in the dataset.
2. Branches represent decisions or conditions based on feature values.
3. Leaves represent the output class (label).
The goal is to build a tree that splits the dataset into pure subsets (where most or all of the
data points in a subset belong to the same class).

How Decision Trees Work


1. Root Node:
o Start with all the data at the root.
o Choose the best feature to split the data.
▪ The "best" feature is chosen based on a criterion like Information Gain,
Gini Index, or Gain Ratio.
2. Splitting:
o Partition the data based on the values of the selected feature.
o Each split reduces the impurity of the subsets.
3. Recursion:
o For each subset, repeat the process: choose the best feature and split the data
until:
▪ All data points in a subset belong to the same class (pure subset).
▪ Or a stopping condition is met (e.g., tree depth, minimum data points in
a subset).
4. Leaf Nodes:
o Assign a class label to each leaf node based on the majority class of data points
in that subset.
Splitting Criteria

Types of Decision Tree Algorithms


1. ID3 (Iterative Dichotomiser 3):
o Uses Information Gain to select the best feature for splitting.
o Works only with categorical data.
2. C4.5:
o An improvement over ID3.
o Uses Gain Ratio for splitting.
o Supports both categorical and continuous data by discretizing continuous
features.
3. CART (Classification and Regression Tree):
o Uses Gini Index for classification tasks.
o Supports binary splits (yes/no decisions).
o Can also be used for regression by minimizing variance.

Advantages of Decision Trees


1. Interpretability:
o Decision trees are easy to understand and visualize.
o The decision-making process can be explained in simple terms.
2. Non-Linearity:
o Can model non-linear decision boundaries.
3. No Feature Scaling:
o Does not require normalization or standardization of features.
4. Handles Mixed Data Types:
o Can work with both numerical and categorical data.

Limitations of Decision Trees


1. Overfitting:
o A decision tree can become too complex and fit the noise in the training data.
o Solution: Use pruning techniques or set limits on tree depth.
2. Bias Toward Features with Many Values:
o Features with more unique values can dominate the splitting process.
o Solution: Use Gain Ratio or other splitting criteria.
3. Instability:
o Small changes in the dataset can lead to significant changes in the tree
structure.

Applications of Decision Trees


1. Medical Diagnosis:
o Predicting diseases based on symptoms.
2. Loan Approval:
o Determining creditworthiness of applicants.
3. Customer Segmentation:
o Classifying customers into segments based on behaviour.
4. Fraud Detection:
o Identifying fraudulent transactions.

Improvements to Decision Trees


1. Ensemble Methods:
o Combine multiple decision trees for better performance.
o Examples:
▪ Random Forest: Combines multiple decision trees using bagging.
▪ Gradient Boosting: Combines decision trees iteratively to reduce error.
2. Pruning:
o Removes unnecessary branches to reduce overfitting and improve
generalization.
3. Hyperparameter Tuning:
o Limit tree depth, minimum samples per leaf, or minimum samples for splitting.

DETAILED EXPLANATION
1. ID3 (Iterative Dichotomiser 3):
• Overview: ID3 is one of the earliest decision tree algorithms, developed by Ross
Quinlan in 1986. It is primarily used for classification tasks. The algorithm builds a
decision tree by selecting the feature that maximizes the information gain (IG) at
each node. It continues splitting the data until it reaches pure nodes (where all
samples belong to the same class).
• Steps in ID3:
1. Calculate Information Gain: For each feature in the dataset, calculate the
information gain, which is a measure of how well the feature separates the
data into different classes. Information gain is calculated using the concept of
entropy (a measure of uncertainty in the dataset). The feature with the highest
information gain is selected for the split.
2. Split the Dataset: Once the feature with the highest information gain is
selected, the dataset is split into subsets based on the values of that feature.
3. Recursion: This process is recursively repeated on the subsets, choosing
features that maximize information gain until one of the stopping criteria is met
(such as when all data in a subset belong to a single class, or there are no more
features to split on).
• Advantages:
o Simple and easy to implement.
o Works well with categorical data.
• Disadvantages:
o Can overfit if the tree is too deep.
o Prone to errors in the presence of noise or missing data.
• In Data Warehousing: ID3 can be applied for data mining tasks within data
warehouses to classify data into distinct categories (e.g., classifying transactions as
fraudulent or legitimate). The decision tree can provide a transparent way to
understand the classification logic.
• AI Relation: ID3 is foundational in AI, as it is one of the first algorithms to generate
interpretable models for decision-making based on historical data. It is often used in
knowledge discovery, pattern recognition, and classification tasks.
2. C4.5:
• Overview: C4.5 is an extension of ID3, also developed by Ross Quinlan. It is an
improved version that addresses some of ID3’s weaknesses. It uses gain ratio instead
of information gain, handles both categorical and continuous attributes, and
incorporates pruning to prevent overfitting.
• Steps in C4.5:
1. Select the Best Feature: For each feature, C4.5 calculates the gain ratio (which
is the ratio of information gain to the intrinsic information of a feature). The
feature with the highest gain ratio is chosen for the split.
2. Handle Continuous Features: C4.5 can handle continuous attributes by finding
the best split point (e.g., choosing a threshold value).
3. Pruning: C4.5 prunes the tree after it is built to remove branches that provide
little or no predictive value. This helps reduce overfitting and improves the
model's generalization ability.
4. Create Subtrees: The algorithm recursively creates subtrees for each subset,
splitting them until all data points in a node are classified or other stopping
criteria are met.
• Advantages:
o Works with both continuous and categorical data.
o Incorporates pruning to reduce overfitting.
o Can handle missing values and noisy data.
• Disadvantages:
o Computation-intensive, especially for large datasets.
o More complex than ID3.
• In Data Warehousing: C4.5 can be applied in data warehousing tasks such as
customer segmentation, predictive analytics, and classification of complex business
data (e.g., identifying high-value customers, classifying customer behaviours).
• AI Relation: C4.5 is widely used in machine learning and AI for pattern recognition,
classification, and decision-making tasks. It is useful in applications where
interpretability is critical.
3. CART (Classification and Regression Trees):
• Overview: CART is a more general algorithm that supports both classification and
regression tasks. Unlike ID3 and C4.5, which build multi-way branches, CART
constructs binary trees where each node has exactly two branches. It uses Gini
impurity for classification and mean squared error for regression to determine the
best splits.
• Steps in CART:
1. Choose the Best Split: CART uses the Gini impurity (for classification) or mean
squared error (for regression) to choose the best feature and threshold for
splitting the data.
2. Binary Splitting: Unlike C4.5, CART builds binary trees, meaning each split
divides the dataset into two parts.
3. Recursive Splitting: This process is repeated recursively, creating branches until
stopping criteria are met (e.g., when the data in a node are pure, or the tree
reaches a predefined depth).
4. Pruning: Post-pruning is performed to reduce overfitting, cutting off branches
that do not contribute significantly to the model’s accuracy.
• Advantages:
o Can be used for both classification and regression.
o More flexible and easier to implement than C4.5.
o Can handle large datasets efficiently.
• Disadvantages:
o Pruning is essential to avoid overfitting.
o Sensitive to small changes in the data.
• In Data Warehousing: CART is used for tasks like customer segmentation, demand
forecasting, and risk management. It is beneficial in predictive analytics within data
warehouses, where quick and reliable decision-making models are needed.
• AI Relation: In AI, CART is utilized in applications where both classification and
regression are needed, such as in speech recognition, medical diagnosis, and
predictive maintenance.
4. Random Forest:
• Overview: Random Forest is an ensemble learning method that builds multiple
decision trees and aggregates their results to improve accuracy and reduce
overfitting. It is one of the most powerful algorithms, particularly in classification
tasks.
• Steps in Random Forest:
1. Bootstrap Aggregating (Bagging): Random Forest builds multiple decision trees
using random subsets of the training data (with replacement). Each tree is
trained independently.
2. Random Feature Selection: For each tree, a random subset of features is
selected to determine the best split at each node. This reduces correlation
between trees.
3. Voting/Averaging: Once the trees are trained, predictions from all trees are
aggregated (by majority voting for classification or averaging for regression) to
make the final prediction.
• Advantages:
o Handles large datasets well.
o Reduces overfitting by averaging multiple trees.
o Robust to noise and outliers.
• Disadvantages:
o Less interpretable than a single decision tree.
o Requires more computational resources.
• In Data Warehousing: Random Forest is commonly used for tasks like anomaly
detection, predictive modeling, and classification of complex data within data
warehouses. It’s particularly effective when dealing with large volumes of structured
and unstructured data.
• AI Relation: In AI, Random Forest is used in a variety of applications, including image
classification, fraud detection, and bioinformatics, where high accuracy and resilience
to overfitting are essential.
5. Scalable Decision Tree Techniques:
As datasets grow in size and complexity, traditional decision tree algorithms like ID3, C4.5,
and CART might struggle with scalability. Scalable decision tree techniques address the
challenges posed by large datasets, high-dimensionality, and distributed computing. Some
notable scalable decision tree techniques include:
• XGBoost (Extreme Gradient Boosting):
o A highly efficient, scalable implementation of gradient boosting that optimizes
decision tree ensembles for classification and regression tasks.
o It uses a gradient boosting framework, where trees are added sequentially,
with each tree correcting errors from the previous one.
o XGBoost handles large datasets well and includes regularization to prevent
overfitting.
• LightGBM (Light Gradient Boosting Machine):
o A gradient boosting framework designed for efficiency and scalability with large
datasets. It builds trees leaf-wise rather than level-wise, improving accuracy
and reducing computation time.
o LightGBM is known for its high speed and low memory usage.
• CatBoost (Categorical Boosting):
o A gradient boosting algorithm designed to handle categorical data efficiently. It
automatically deals with categorical features without requiring explicit
encoding.
o CatBoost is optimized for performance, offering faster training and better
accuracy.
• Distributed Decision Trees:
o These techniques allow decision trees to be built in a distributed computing
environment, making it possible to process large datasets spread across
multiple machines. This approach can handle very large volumes of data,
especially in big data scenarios.
• Advantages:
o Handle large and high-dimensional datasets efficiently.
o Can be used in distributed environments for big data processing.
o Often outperform traditional decision trees in terms of accuracy and
computation time.
• Disadvantages:
o May require more complex setup and tuning.
o Less interpretable than traditional decision trees.
• In Data Warehousing: Scalable decision tree techniques like XGBoost and LightGBM
are crucial for big data analytics, allowing decision trees to be built on vast amounts
of structured and unstructured data, helping businesses extract insights quickly and
efficiently.
• AI Relation: Scalable decision tree techniques are essential in modern AI applications,
particularly in large-scale machine learning tasks like fraud detection,
recommendation systems, and real-time analytics.
#CLUSTERING

Clustering
Algorithms

Density
Hierarchical Grid Based Model Based
Based

Statistical
CURE DBSCAN STING
Approach

CHAMELEON OPTICS CLIQUE


1. HIERARCHICAL CLUSTERING
Hierarchical clustering algorithms like CURE (Clustering Using REpresentatives) and
Chameleon are widely used in data warehousing to uncover patterns in large datasets by
organizing them into meaningful clusters. Here's a detailed explanation of each:

1. CURE (Clustering Using REpresentatives)


Overview:
CURE improves traditional hierarchical clustering methods by addressing scalability and
cluster shape limitations. It uses a set of representative points for each cluster, which
allows it to handle clusters of arbitrary shapes and sizes.
Key Steps:
1. Random Sampling:
A random sample of data points is selected to reduce computational overhead.
2. Initial Clustering:
Apply an agglomerative clustering algorithm (like hierarchical clustering) to group the
sampled points into initial clusters.
3. Select Representative Points:
For each cluster, several representative points are chosen. These points are spread
out within the cluster and shrink closer to the cluster centroid by a predefined factor.
4. Merge Clusters:
Use the representative points to determine the distance between clusters and merge
clusters based on a threshold or linkage criterion.
Advantages:
• Handles large datasets effectively.
• Works well with clusters of arbitrary shapes and sizes.
• Robust to noise and outliers.
Example:
Consider a dataset with points distributed in circular and elongated clusters.
• Traditional hierarchical clustering might fail to merge parts of the elongated cluster.
• CURE selects representative points (e.g., the outermost points of each cluster),
shrinks them toward the centroid, and computes distances based on these
representatives. This ensures correct identification and merging of clusters, even for
non-spherical shapes.
2. Chameleon
Overview:
Chameleon is another hierarchical clustering algorithm designed to overcome the
limitations of static models that assume clusters are uniform in shape, size, and density. It
dynamically adjusts to the characteristics of the data using two phases.
Key Steps:
1. Graph Partitioning:
o Model the data as a k-nearest neighbour (k-NN) graph, where each data point
is a node, and edges represent the similarity between points.
o Partition this graph into smaller sub-clusters using a graph partitioning
algorithm like METIS.
2. Cluster Merging:
o Iteratively merge sub-clusters based on two criteria:
▪ Inter-connectivity: Measures how strongly the points in one cluster are
connected to another.
▪ Relative Closeness: Evaluates how similar the two clusters are in density
and size.
o This ensures that merged clusters preserve natural data characteristics.
Advantages:
• Adapts to complex data distributions.
• Balances inter-connectivity and relative closeness, allowing the discovery of clusters
of varying shapes, sizes, and densities.
Example:
Consider a dataset with clusters of varying densities.
• In the graph partitioning step, Chameleon divides the dataset into fine-grained sub-
clusters based on the k-NN graph.
• During merging, clusters with high inter-connectivity and similar densities are
combined. This ensures that clusters with different densities or irregular shapes are
correctly identified, unlike traditional hierarchical methods.
Comparison:
Feature CURE Chameleon
Representative points for Adaptive merging for complex
Key Focus
arbitrary shapes. characteristics.
Clustering Merges clusters based on Merges based on inter-connectivity and
Approach representative points. relative closeness.
Suitable for datasets with complex
Dataset Size Suitable for large datasets.
structures.
Efficient with sampling Requires k-NN graph, may be less efficient
Scalability
techniques. for extremely large datasets.

Use Case in Data Warehousing:


Both algorithms are used for segmenting customer data, market basket analysis, or
identifying anomalies in transactional databases.
• CURE is ideal for datasets with irregularly shaped clusters, such as geographical
clustering of customers.
• Chameleon is suitable for datasets with diverse cluster densities, such as clustering
product sales across regions.
Let me know if you'd like detailed Python examples for implementation!
2. DENSITY BASED CLUSTERING ALGORITHM
Density-Based Clustering Algorithms: DBSCAN and OPTICS
Density-based clustering methods are designed to identify clusters of arbitrary shapes
and sizes by finding regions in the dataset with high density of data points. These
methods are particularly effective at separating clusters from noise (outliers). Let’s delve
into two widely used density-based clustering algorithms: DBSCAN and OPTICS.

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)


Overview:
DBSCAN groups data points into clusters based on the notion of density. It defines
clusters as regions of high data point density separated by regions of lower density
(noise). Unlike K-Means, it does not require specifying the number of clusters
beforehand.

Key Concepts:
1. Epsilon (ε):
o A radius that defines the neighbourhood around a data point.
o If other points lie within this radius, they are considered part of the same
cluster.
2. MinPts:
o The minimum number of points required to form a dense region (cluster).
3. Core Points:
o Points that have at least MinPts neighbours within the ε-radius.
4. Border Points:
o Points that have fewer than MinPts neighbours but are within the ε-radius of a
core point.
o These points are on the edge of a cluster.
5. Noise Points:
o Points that do not belong to any cluster.
Steps in DBSCAN:
1. Start with an Unvisited Point:
o Pick a random point from the dataset. If it has at least MinPts neighbours
within ε-radius, mark it as a core point and create a new cluster.
2. Expand the Cluster:
o Add all points within ε-radius to the cluster.
o If any of these points are core points, repeat the process to include their
neighbours.
3. Mark Border Points and Noise:
o If a point is not a core point but lies within the ε-radius of a core point, mark it
as a border point.
o Points that are not reachable from any core point are considered noise.
4. Repeat:
o Continue until all points are visited.

Advantages of DBSCAN:
• Can find clusters of arbitrary shapes (e.g., circular or elongated clusters).
• Effectively handles noise and outliers.
• Does not require specifying the number of clusters (kk).

Limitations:
• Sensitive to the choice of parameters ϵ\epsilon and MinPts.
• Struggles with datasets where cluster densities vary significantly.
• Computationally expensive for very large datasets.

Applications:
• Geospatial data analysis (e.g., detecting hotspots in GPS data).
• Image segmentation.
• Identifying anomalies or outliers in datasets.
2. OPTICS (Ordering Points To Identify the Clustering Structure)
Overview:
OPTICS is an extension of DBSCAN that addresses its sensitivity to the choice of the
ϵ\epsilon parameter. Instead of forming fixed clusters for a specific ϵ\epsilon, OPTICS
creates an ordering of the data points, representing the clustering structure at different
density levels.

Key Concepts:
1. Core Distance:
o The smallest distance at which a point becomes a core point (i.e., it has MinPts
neighbours within that distance).
2. Reachability Distance:
o The distance needed to reach a point from a core point.
o It is either the core distance or the actual distance, whichever is greater.
3. Cluster Order:
o OPTICS orders points based on their reachability distance to reveal the
clustering structure at varying densities.

Steps in OPTICS:
1. Compute Core Distance:
o For each point, calculate its core distance based on ϵ\epsilon and MinPts.
2. Expand Clusters:
o Similar to DBSCAN, start with an unvisited point and expand the cluster by
visiting its neighbours.
o Record the reachability distances of all visited points.
3. Output Cluster Ordering:
o Once all points are processed, OPTICS produces a reachability plot, which
visually represents the clustering structure.
4. Extract Clusters:
o Analyse the reachability plot to extract clusters of different densities by
identifying valleys (low reachability distances) and separating them with peaks
(high reachability distances).
Advantages of OPTICS:
• Handles datasets with varying densities better than DBSCAN.
• Produces a clustering hierarchy that can be used to analyse clusters at multiple scales.
• Provides more flexibility in understanding the structure of the data.

Limitations:
• Slower than DBSCAN due to the additional computations for core and reachability
distances.
• Requires careful interpretation of the reachability plot.

Applications:
• Hierarchical clustering in spatial databases.
• Analysing customer behaviour in e-commerce.
• Identifying natural groups in datasets with mixed-density clusters.

Comparison: DBSCAN vs. OPTICS


Aspect DBSCAN OPTICS
Parameter Sensitive to a fixed ϵ\epsilon Flexible with varying densities
Sensitivity
Output Fixed clusters and noise Reachability plot and hierarchical clusters
points
Scalability Faster for fixed density Slower but more versatile for mixed
datasets densities
Complexity O(nlogn) for efficient O(n^2) in worst cases
versions

Conclusion:
• Use DBSCAN for datasets with well-defined clusters of uniform density and when you
need a simple clustering solution.
• Use OPTICS for datasets with varying densities and when a deeper understanding of
the clustering structure is required.
3. GRID BASED CLUSTERING ALGORITHM
Grid-based clustering algorithms like STING (Statistical Information Grid) and CLIQUE
(CLustering In QUEst) are powerful techniques for analysing large datasets, especially in
spatial data mining. Here's an explanation with examples:

1. STING (Statistical Information Grid)


Overview:
STING is a grid-based clustering algorithm that divides the data space into a hierarchical
grid structure. It uses statistical information stored in these grid cells to identify clusters.
Key Steps:
1. Grid Creation: The data space is divided into a grid structure with cells at different
levels of granularity.
2. Statistical Summaries: Each cell stores statistical summaries (mean, variance, min,
max, etc.) of the data it contains.
3. Query Processing:
o If a query region intersects multiple cells, STING aggregates the statistical
information instead of accessing the raw data.
4. Cluster Formation:
o Higher-level cells combine information from lower-level cells.
o Cells with sufficient density of data points are identified as clusters.
Advantages:
• Efficient as it doesn't need to scan the entire dataset.
• Hierarchical structure allows for multi-resolution clustering.
Example:
Imagine we want to cluster geographical locations of customers based on their purchase
frequency:
• Divide the map into a grid (e.g., 10x10 cells).
• Each cell stores:
o Average purchase frequency of customers in that cell.
o Variance and range of purchase frequency.
• Identify regions with high purchase frequency (clusters) based on density thresholds.
2. CLIQUE (Clustering In QUEst)
Overview:
CLIQUE is a grid-based and subspace clustering algorithm that identifies clusters in both
high-dimensional spaces and their subspaces.
Key Steps:
1. Grid Partitioning:
o The data space is divided into non-overlapping rectangular units (grid cells).
2. Density Calculation:
o Each grid cell's density is calculated based on the number of points it contains.
3. Thresholding:
o Cells with densities above a specified threshold are considered dense.
4. Cluster Formation:
o Adjacent dense cells are combined to form clusters.
o CLIQUE identifies clusters in subspaces (not just the full space), making it useful
for high-dimensional data.
Advantages:
• Automatically finds clusters in subspaces without requiring prior knowledge.
• Scales well with large datasets and high dimensions.
Example:
Consider a dataset of customer attributes: age, income, and spending score:
1. Partition the data space into grid cells (e.g., age: 20-30, income: $40k-$50k).
2. Compute densities for each cell. For instance:
o Age 20-30 & Income $40k-$50k might have 500 customers (dense).
o Age 30-40 & Income $30k-$40k might have 50 customers (sparse).
3. Combine adjacent dense cells to form clusters.
o E.g., Age 20-30 & Income $40k-$50k could merge with Age 20-30 & Income
$50k-$60k to form a cluster.
Comparison: STING vs CLIQUE
Feature STING CLIQUE
Type Grid-based clustering Grid-based and subspace clustering
Primary Focus Statistical summaries Subspace clusters in high dimensions
Cluster Formation Density-based Density-based + Subspaces
Efficiency Very efficient for spatial data Scales well for high dimensions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy