Classification in Data Mining
Classification in Data Mining
Data Mining
• Classification is a core data mining technique used to
assign data instances to predefined classes or
categories based on a training dataset.
Logistic Regression
• A statistical method used for binary classification problems. It models the
probability of the target class using a logistic function.
Advantages:
• Easy to implement and interpret.
• Suitable for linearly separable data.
Example:
Predicting whether a customer will purchase a product or not.
k-Nearest Neighbors (k-NN)
• A simple algorithm that classifies a data point based on the majority vote
of its k nearest neighbors.
Advantages:
• No need for training phase (lazy learning).
• Works well with low-dimensional data.
Example:
Recognizing handwritten digits.
Support Vector Machines (SVM)
• Finds a hyperplane that best separates the classes in the feature space.
Advantages:
• Effective in high-dimensional spaces.
• Works well with non-linear boundaries using kernel functions.
Example:
Image classification.
Random Forest
• An ensemble method that combines multiple decision trees to make robust
predictions.
Advantages:
• Reduces overfitting.
• Handles large datasets and high-dimensional data.
Example:
Fraud detection in banking.
Neural Networks
• Uses layers of interconnected nodes (neurons) to model complex
relationships.
Advantages:
• Can handle non-linear relationships.
• Scales well with large datasets.
Example:
Speech recognition or image classification.
Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
• Ensemble techniques that iteratively improve model performance by
minimizing errors.
Advantages:
• High accuracy for structured data.
• Handles missing data effectively.
Example:
Predicting customer churn.
Evaluation Metrics for Classification
• Evaluating the performance of classification models is crucial.
Commonly used metrics include:
Accuracy:
Measures the proportion of correct predictions.
Accuracy=Number of Correct Predictions/Total Number of Predictions.
Precision:
Focuses on the proportion of true positive predictions among all
positive predictions.
Precision=True Positives /True Positives+ False Positives
Recall (Sensitivity):
Measures the proportion of actual positives identified.
Recall=True Positives / True Positives+False Negatives
F1-Score:
The harmonic mean of precision and recall.
F1-Score=2×PrecisionxRecall/Precision + Recall
ROC-AUC:
Evaluates the trade-off between true positive rate
and false positive rate.
Challenges in Classification
1.Class Imbalance:
When one class dominates the dataset, it can bias the model.
Solution: Use techniques like oversampling, undersampling, or class
weighting.
2.Overfitting:
The model performs well on training data but poorly on testing data.
Solution: Use regularization, cross-validation and simpler models.
3.Feature Selection:
Irrelevant or redundant features can reduce model accuracy.
Solution: Use feature selection techniques like PCA or LASSO.
4.Noisy Data:
Inaccurate data can mislead the model.
Solution: Perform data cleaning and outlier detection.
Comparison with Other Techniques
• Classification vs. Regression:
Classification predicts categorical outcomes, while
regression predicts continuous values.
• Classification vs. Clustering:
Classification is supervised learning, while clustering
is unsupervised and groups data based on similarity.
Clustering in Data
Mining
• Clustering is an unsupervised learning technique
used in data mining to group similar data points into
clusters.
• Unlike classification, clustering does not require
labeled data.
• The objective is to partition the dataset into
meaningful groups where data points in the same
cluster are more similar to each other than to those in
other clusters.
Key Characteristics of Clustering
1.Unsupervised Learning:
No predefined labels or classes are required.
2.Similarity:
Grouping is based on similarity or distance measures such
as Euclidean distance or cosine similarity.
3.Partitioning:
Clusters are often non-overlapping, but some methods
allow overlapping clusters (e.g., fuzzy clustering).
4.Exploratory Analysis:
Often used to explore patterns and structures in the data.
Applications of Clustering
Clustering is widely applied in various fields, including:
1.Market Segmentation:
Group customers based on purchasing behavior.
2.Document Clustering:
Organize documents into topics.
3.Image Segmentation:
Partition an image into meaningful regions.
4.Anomaly Detection:
Identify outliers as separate clusters (e.g., fraud detection).
5.Genomics:
Group genes or proteins with similar functions.
Types of Clustering Methods
Clustering algorithms are categorized into the following main types:
Partitioning Methods
• Partition the dataset into k non-overlapping clusters, where k is
predefined.
• Example Algorithms:
• k-Means Clustering:
• Partitions data into k clusters by minimizing the within-cluster variance.
• Iterative process: assign points to the nearest cluster center, then update centers.
• Advantages:
• Simple and efficient.
• Works well with large datasets.
• Disadvantages:
• Requires the number of clusters (k) to be specified.
• Sensitive to outliers.
• k-Medoids (or PAM - Partitioning Around Medoids):
• Similar to k-means but uses medoids (actual data points) as cluster centers.
• Less sensitive to outliers than k-means.
Hierarchical Methods
Builds a hierarchy of clusters in a tree-like structure (dendrogram).
Types:
• Agglomerative (Bottom-Up):
• Starts with each data point as a single cluster.
• Merges clusters iteratively until a single cluster remains.
• Divisive (Top-Down):
• Starts with all data points in one cluster.
• Splits clusters iteratively until each point is a separate cluster.
Advantages:
• Does not require the number of clusters to be predefined.
• Provides a visual representation of cluster relationships.
Disadvantages:
• Computationally expensive for large datasets.
Density-Based Methods
Clusters are formed based on areas of high data density.
Example Algorithms:
• DBSCAN (Density-Based Spatial Clustering of Applications with
Noise):
• Groups points that are closely packed together.
• Marks points in low-density regions as noise (outliers).
• Advantages:
• Handles noise and irregularly shaped clusters.
• Does not require the number of clusters to be predefined.
• Disadvantages:
• Sensitive to parameters (e.g., ε, the neighborhood radius).
• OPTICS (Ordering Points To Identify Clustering
Structure):
• Extends DBSCAN to handle varying densities.
Grid-Based Methods
• The data space is divided into a grid structure, and clusters are
formed based on dense grid cells.
Example Algorithms:
• STING (Statistical Information Grid):
• Divides the data space into hierarchical grid cells and
aggregates statistics.
• CLIQUE (Clustering in QUEST):
• Combines grid-based and density-based approaches for
high-dimensional data.
Advantages:
• Efficient for large datasets.
Disadvantages:
• May lose information due to grid approximation.
Model-Based Methods
• Assumes the data is generated by a mixture of underlying
probability distributions (e.g., Gaussian distributions).
Example Algorithms:
• Gaussian Mixture Models (GMM):
• Uses the Expectation-Maximization (EM) algorithm to
model clusters as Gaussian distributions.
• BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies):
• Efficient for large datasets and hierarchical clustering.
Advantages:
• Can handle overlapping clusters.
Disadvantages:
• Requires assumptions about the data distribution.
Fuzzy Clustering
Allows data points to belong to multiple clusters with varying
degrees of membership.
Example Algorithm:
• Fuzzy C-Means (FCM):
• Assigns membership probabilities to each point for all
clusters.
Advantages:
• Handles overlapping clusters.
Disadvantages:
• Computationally expensive.
Distance and Similarity Measures in Clustering
• Clustering algorithms rely on measuring similarity
between data points. Common measures include:
1.Euclidean Distance: ∑(xi−yi)2
2.Manhattan Distance: ∑∣xi−yi∣
3.Cosine Similarity:
4.Jaccard Similarity: Used for binary data.
Evaluation Metrics for Clustering
• Unlike classification, clustering evaluation is challenging because
there are no predefined labels. Metrics include:
1.Internal Evaluation (Based on Intrinsic Properties):
• Silhouette Coefficient: Measures how similar a point is to its
cluster compared to others.
• Dunn Index: Evaluates compactness and separation of clusters.
2.External Evaluation (Based on Ground Truth):
• Rand Index: Compares the clustering result with a ground
truth.
• Adjusted Rand Index (ARI): Adjusts for chance groupings.
3.Cluster Validation:
• Use Elbow Method for k-means to find the optimal number of
clusters by plotting within-cluster sum of squares (WCSS).
Challenges in Clustering
1.Determining the Number of Clusters:
• Many algorithms require specifying the number of clusters (e.g., k-
means).
2.Scalability:
• Clustering large datasets can be computationally expensive.
3.Handling Noisy and Outlier Data:
• Outliers can distort clustering results.
4.High-Dimensional Data:
• Distance measures become less meaningful in high dimensions
("curse of dimensionality").
5.Cluster Shape:
• Algorithms like k-means struggle with non-spherical clusters.
Comparison of Clustering vs. Classification
2 Bread, Butter
3 Bread, Milk
4 Butter, Milk
Step 1:
Frequent Itemsets
Using a support threshold of 50%, frequent itemsets are:
• {Bread} (Support: 75%)
• {Butter} (Support: 75%)
• {Milk} (Support: 75%)
• {Bread, Butter} (Support: 50%)
Step 2:
Generate Rules
• Rule: Bread → Butter (Confidence: 50/75 = 66.7%)
Anomaly Detection
Anomaly detection is a process in data mining aimed at
identifying rare items, events or observations that deviate
significantly from the majority of the data.
These anomalies are often of significant interest as they may
indicate critical actionable insights, such as fraud detection,
fault diagnosis or security breaches.
Key Concepts in Anomaly Detection
Definition of an Anomaly:
• An anomaly (or outlier) is an observation that does not
conform to the expected pattern or other observations in
a dataset.
Example:
• Unusually high transaction amounts in a banking dataset
might indicate fraudulent activity.
Types of Anomalies:
Point Anomalies:
Single data points that are significantly different from the rest.
Example:
A temperature reading of 100°C in a dataset of room temperatures.
Contextual Anomalies:
Data points that are only anomalous within a specific context.
Example:
A temperature of 30°C might be normal in summer but anomalous
in winter.
Collective Anomalies:
A group of data points that deviate from the expected pattern, even
if individual points may not.
Example:
A sudden spike in network traffic.
Techniques for Anomaly Detection
Statistical Methods:
• Assumes that normal data points follow a statistical distribution (e.g., Gaussian).
• Uses measures like z-scores or Grubbs' test to identify anomalies.
• Challenges: Limited by the assumption of distribution and struggle with high-
dimensional data.
Machine Learning Methods:
• Supervised Learning: Requires labeled data with anomalies explicitly
marked.
• Examples: Decision trees, support vector machines (SVMs).
• Limitation: Labeled data is often scarce.
• Unsupervised Learning: Identifies anomalies without labeled data by
assuming that anomalies are rare and different.
• Examples: Clustering (e.g., k-means, DBSCAN), autoencoders.
• Semi-supervised Learning: Trains on a dataset containing mostly
normal data, then detects deviations.
Proximity-Based Methods:
• Detect anomalies based on their distance from other data points.
• Techniques:
• k-Nearest Neighbors (k-NN): Anomalies are points far from their
neighbors.
• Local Outlier Factor (LOF): Measures the local density deviation of
a given data point.
• Advantages: Simple to understand and implement.
Density-Based Methods:
• Measure the density of data points, anomalies occur in low-density
regions.
Examples:
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
Deep Learning Methods:
• Suitable for complex, high-dimensional datasets.
Examples:
• Autoencoders: Neural networks trained to reconstruct input data.
• Anomalies result in high reconstruction errors.
• Generative Adversarial Networks (GANs): Can be used for generating normal
data distributions and flagging deviations.
Time Series Anomaly Detection:
• Focuses on detecting anomalies in time-dependent data.
Examples:
• ARIMA, LSTM-based models.
ARIMA stands for AutoRegressive Integrated Moving Average, a
statistical modeling technique used for analyzing and forecasting time
series data. ARIMA is widely applied in time series analysis to predict
future points by understanding patterns from past observations.
Applications of Anomaly Detection
1.Fraud Detection:
• Credit card fraud, insurance fraud and insider trading.
2.Network Security:
• Detecting unusual login attempts, DDoS attacks or malware
activity.
3.Healthcare:
• Identifying anomalies in patient health records or medical imaging.
4.Manufacturing:
• Fault detection in equipment through sensor data.
5.Retail:
• Identifying unusual purchasing behavior to optimize inventory.
6.Finance:
• Detecting unusual market behavior or trading anomalies.
Challenges in Anomaly Detection
1.Imbalanced Data:
• Anomalies are often rare, making them difficult to identify.
2.High Dimensionality:
• Large datasets with many features can make traditional methods
ineffective.
3.Concept Drift:
• Data patterns change over time, requiring models to adapt.
4.Scalability:
• Real-time anomaly detection requires scalable and efficient
algorithms.
5.Interpretability:
• Explaining why a data point is flagged as anomalous can be
challenging.
Best Practices for Anomaly Detection
1.Preprocessing:
• Handle missing data, normalize values and remove noise.
2.Feature Engineering:
• Extract meaningful features to improve model
performance.
3.Evaluation Metrics:
• Use precision, recall, F1-score and ROC-AUC to evaluate
anomaly detection models.
4.Hybrid Approaches:
• Combine multiple techniques (e.g., statistical and machine
learning) to improve accuracy.
Applications of Anomaly Detection in Fraud Detection
and Cybersecurity
• Anomaly detection plays a critical role in fraud
detection and cybersecurity by identifying unusual
patterns or behaviors that may indicate malicious
activities.
• These anomalies often signal breaches, fraud or other
security-related concerns that require immediate
attention.
Applications in Fraud Detection
Fraud detection involves identifying deceptive practices to gain
unauthorized benefits.
Anomaly detection helps by uncovering patterns that deviate from
legitimate behavior.
Credit Card Fraud Detection
• Problem: Fraudulent transactions mimic legitimate purchases, making
them hard to detect.
• How Anomaly Detection Helps:
• Identify transactions with unusual attributes, such as abnormally high
amounts or purchases from distant locations.
• Detect patterns in spending behavior that deviate from a cardholder's typical
usage.
Example Techniques:
• Machine learning models like Random Forests or Neural Networks to classify
transactions as normal or anomalous.
Insurance Fraud Detection
• Problem: Fraudulent claims inflate costs for insurance
companies.
• How Anomaly Detection Helps:
• Analyze claim patterns to detect unusual spikes or claims
inconsistent with the policyholder's history.
• Spot repetitive claims using text analysis of claim
descriptions.
Example Techniques:
• Natural Language Processing (NLP) for textual claim
data.
• Clustering to identify suspicious groups of claims.
Online Payment Fraud
• Problem: Fraudulent activities occur in online payment systems, such
as e-wallets and payment gateways.
• How Anomaly Detection Helps:
• Detect unusually high transaction frequencies or large withdrawals.
• Identify suspicious device usage or IP addresses.
• Example Techniques:
• Behavioral analytics using unsupervised learning.
• Real-time anomaly scoring.
Identity Theft Detection
• Problem: Fraudsters impersonate users to access accounts or
services.
• How Anomaly Detection Helps:
• Monitor login attempts and flag unusual IP addresses, device types, or
geolocations.
• Detect abnormal account activity, such as simultaneous logins from different
regions.
• Example Techniques:
• Time-series analysis for account activity.
• User profiling to model normal behavior.
Applications in Cybersecurity
• Cybersecurity involves protecting systems, networks, and data from
attacks.
• Anomaly detection helps in proactively identifying potential security
threats.
Intrusion Detection Systems (IDS)
• Problem: Cyberattacks like hacking, unauthorized access, and
malware infiltration compromise system security.
• How Anomaly Detection Helps:
• Identify unusual network traffic, such as large data transfers or unexplained
connection spikes.
• Detect deviations in user behavior, like accessing restricted areas.
• Example Techniques:
• Signature-based detection for known attack patterns.
• Anomaly-based systems (e.g., k-NN, Support Vector Machines) to flag unknown threats.
• Phishing Attack Detection
• Problem: Phishing attacks trick users into revealing sensitive
information.
• How Anomaly Detection Helps:
• Analyze email content and flag messages with suspicious patterns, such as
unusual URLs or misspelled domains.
• Detect anomalous user interactions with links in emails.
• Example Techniques:
• NLP for email and URL analysis.
• Feature-based anomaly detection to assess sender reputation and content features.
Ransomware and Malware Detection
• Problem: Malicious software encrypts or steals sensitive data.
• How Anomaly Detection Helps:
• Detect abnormal file access patterns, such as frequent file modifications.
• Identify unusual processes or scripts running on a system.
• Example Techniques:
• Behavioral analytics on system logs.
• Deep learning for detecting unusual program execution flows.
Distributed Denial of Service (DDoS) Attack Detection
• Problem: Flooding servers with excessive requests to render services
unavailable.
• How Anomaly Detection Helps:
• Identify abnormal spikes in incoming requests to servers.
• Detect unusual IP patterns or geographic origins of traffic.
• Example Techniques:
• Time-series anomaly detection for traffic patterns.
• Statistical methods like entropy-based analysis.
Endpoint Protection
• Problem: Malicious activities on individual devices compromise
security.
• How Anomaly Detection Helps:
• Monitor device logs for anomalous processes or unauthorized applications.
• Detect deviations in user behavior on the endpoint.
• Example Techniques:
• Host-based intrusion detection systems (HIDS).
• Machine learning models to detect anomalies in device activity.
THANK YOU.