0% found this document useful (0 votes)
3 views

Classification in Data Mining

Classification in data mining is a supervised learning technique that assigns data instances to predefined classes based on a training dataset, aiming to build models that predict class labels for new data. Key steps in the classification process include data collection, preprocessing, model building, evaluation, and prediction, with applications in spam filtering, fraud detection, and medical diagnosis. Popular algorithms include Decision Trees, Naïve Bayes, and Support Vector Machines, each with unique advantages and challenges in model evaluation and performance.

Uploaded by

nics1425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Classification in Data Mining

Classification in data mining is a supervised learning technique that assigns data instances to predefined classes based on a training dataset, aiming to build models that predict class labels for new data. Key steps in the classification process include data collection, preprocessing, model building, evaluation, and prediction, with applications in spam filtering, fraud detection, and medical diagnosis. Popular algorithms include Decision Trees, Naïve Bayes, and Support Vector Machines, each with unique advantages and challenges in model evaluation and performance.

Uploaded by

nics1425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Classification in

Data Mining
• Classification is a core data mining technique used to
assign data instances to predefined classes or
categories based on a training dataset.

• It is a type of supervised learning where the outcome


(class label) is already known for the training data.

• The goal is to build a model that can accurately


predict the class labels for new, unseen data.
Key Characteristics of Classification
1.Supervised Learning:
Requires labeled data (data with known outcomes).
2.Discrete Output:
The target variable (class label) is categorical, e.g., "Yes/No"
or "Spam/Not Spam."
3.Predictive Modeling:
Focuses on predicting the category of new data instances.
4.Feature Space:
Classification uses one or more input features to determine
the output class.
Steps in Classification Process
The classification process generally follows these steps:
1.Data Collection:
Gather labeled data for training.
2.Data Preprocessing:
1.Handle missing values.
2.Normalize or standardize features.
3.Perform feature selection to remove irrelevant features.
3.Model Building:
1.Use the training dataset to build the classification model.
2.Train the model using an appropriate algorithm (e.g.,
Decision Tree, Naïve Bayes).
4. Model Evaluation:
Test the model on unseen data (testing dataset).
Evaluate performance using metrics like accuracy,
precision, recall, and F1-score.
5. Prediction:
Apply the model to classify new data instances.
Applications of Classification
1.Spam Filtering:
Classify emails as "Spam" or "Not Spam."
2.Fraud Detection:
Identify fraudulent transactions.
3.Medical Diagnosis:
Predict diseases based on patient data.
4.Customer Churn Prediction:
Determine if a customer is likely to leave a service.
5.Sentiment Analysis:
Classify customer reviews as "Positive," "Neutral," or
"Negative."
Popular Classification Algorithms
• There are several algorithms used for classification in data
mining. The most commonly used ones:
Decision Trees
• Builds a tree-like structure where nodes represent features,
branches represent decision rules, and leaves represent
class labels.
Advantages:
• Simple to understand and interpret.
• Handles both numerical and categorical data.
Example:
Predicting loan approval based on income and credit
history.
Naïve Bayes
• A probabilistic classifier based on Bayes’ theorem, assuming that all features are
independent.
Advantages:
• Works well with small datasets.
• Efficient for high-dimensional data.
Example:
Classifying news articles into categories like "Sports" or "Politics."

Logistic Regression
• A statistical method used for binary classification problems. It models the
probability of the target class using a logistic function.
Advantages:
• Easy to implement and interpret.
• Suitable for linearly separable data.
Example:
Predicting whether a customer will purchase a product or not.
k-Nearest Neighbors (k-NN)
• A simple algorithm that classifies a data point based on the majority vote
of its k nearest neighbors.
Advantages:
• No need for training phase (lazy learning).
• Works well with low-dimensional data.
Example:
Recognizing handwritten digits.
Support Vector Machines (SVM)
• Finds a hyperplane that best separates the classes in the feature space.
Advantages:
• Effective in high-dimensional spaces.
• Works well with non-linear boundaries using kernel functions.
Example:
Image classification.
Random Forest
• An ensemble method that combines multiple decision trees to make robust
predictions.
Advantages:
• Reduces overfitting.
• Handles large datasets and high-dimensional data.
Example:
Fraud detection in banking.
Neural Networks
• Uses layers of interconnected nodes (neurons) to model complex
relationships.
Advantages:
• Can handle non-linear relationships.
• Scales well with large datasets.
Example:
Speech recognition or image classification.
Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
• Ensemble techniques that iteratively improve model performance by
minimizing errors.
Advantages:
• High accuracy for structured data.
• Handles missing data effectively.
Example:
Predicting customer churn.
Evaluation Metrics for Classification
• Evaluating the performance of classification models is crucial.
Commonly used metrics include:
Accuracy:
Measures the proportion of correct predictions.
Accuracy=Number of Correct Predictions/Total Number of Predictions.
Precision:
Focuses on the proportion of true positive predictions among all
positive predictions.
Precision=True Positives /True Positives​+ False Positives
Recall (Sensitivity):
Measures the proportion of actual positives identified.
Recall=True Positives / True Positives​+False Negatives
F1-Score:
The harmonic mean of precision and recall.
F1-Score=2×PrecisionxRecall/Precision + Recall
ROC-AUC:
Evaluates the trade-off between true positive rate
and false positive rate.
Challenges in Classification
1.Class Imbalance:
When one class dominates the dataset, it can bias the model.
Solution: Use techniques like oversampling, undersampling, or class
weighting.
2.Overfitting:
The model performs well on training data but poorly on testing data.
Solution: Use regularization, cross-validation and simpler models.
3.Feature Selection:
Irrelevant or redundant features can reduce model accuracy.
Solution: Use feature selection techniques like PCA or LASSO.
4.Noisy Data:
Inaccurate data can mislead the model.
Solution: Perform data cleaning and outlier detection.
Comparison with Other Techniques
• Classification vs. Regression:
Classification predicts categorical outcomes, while
regression predicts continuous values.
• Classification vs. Clustering:
Classification is supervised learning, while clustering
is unsupervised and groups data based on similarity.
Clustering in Data
Mining
• Clustering is an unsupervised learning technique
used in data mining to group similar data points into
clusters.
• Unlike classification, clustering does not require
labeled data.
• The objective is to partition the dataset into
meaningful groups where data points in the same
cluster are more similar to each other than to those in
other clusters.
Key Characteristics of Clustering
1.Unsupervised Learning:
No predefined labels or classes are required.
2.Similarity:
Grouping is based on similarity or distance measures such
as Euclidean distance or cosine similarity.
3.Partitioning:
Clusters are often non-overlapping, but some methods
allow overlapping clusters (e.g., fuzzy clustering).
4.Exploratory Analysis:
Often used to explore patterns and structures in the data.
Applications of Clustering
Clustering is widely applied in various fields, including:
1.Market Segmentation:
Group customers based on purchasing behavior.
2.Document Clustering:
Organize documents into topics.
3.Image Segmentation:
Partition an image into meaningful regions.
4.Anomaly Detection:
Identify outliers as separate clusters (e.g., fraud detection).
5.Genomics:
Group genes or proteins with similar functions.
Types of Clustering Methods
Clustering algorithms are categorized into the following main types:
Partitioning Methods
• Partition the dataset into k non-overlapping clusters, where k is
predefined.
• Example Algorithms:
• k-Means Clustering:
• Partitions data into k clusters by minimizing the within-cluster variance.
• Iterative process: assign points to the nearest cluster center, then update centers.
• Advantages:
• Simple and efficient.
• Works well with large datasets.
• Disadvantages:
• Requires the number of clusters (k) to be specified.
• Sensitive to outliers.
• k-Medoids (or PAM - Partitioning Around Medoids):
• Similar to k-means but uses medoids (actual data points) as cluster centers.
• Less sensitive to outliers than k-means.
Hierarchical Methods
Builds a hierarchy of clusters in a tree-like structure (dendrogram).
Types:
• Agglomerative (Bottom-Up):
• Starts with each data point as a single cluster.
• Merges clusters iteratively until a single cluster remains.
• Divisive (Top-Down):
• Starts with all data points in one cluster.
• Splits clusters iteratively until each point is a separate cluster.
Advantages:
• Does not require the number of clusters to be predefined.
• Provides a visual representation of cluster relationships.
Disadvantages:
• Computationally expensive for large datasets.
Density-Based Methods
Clusters are formed based on areas of high data density.
Example Algorithms:
• DBSCAN (Density-Based Spatial Clustering of Applications with
Noise):
• Groups points that are closely packed together.
• Marks points in low-density regions as noise (outliers).
• Advantages:
• Handles noise and irregularly shaped clusters.
• Does not require the number of clusters to be predefined.
• Disadvantages:
• Sensitive to parameters (e.g., ε, the neighborhood radius).
• OPTICS (Ordering Points To Identify Clustering
Structure):
• Extends DBSCAN to handle varying densities.
Grid-Based Methods
• The data space is divided into a grid structure, and clusters are
formed based on dense grid cells.
Example Algorithms:
• STING (Statistical Information Grid):
• Divides the data space into hierarchical grid cells and
aggregates statistics.
• CLIQUE (Clustering in QUEST):
• Combines grid-based and density-based approaches for
high-dimensional data.
Advantages:
• Efficient for large datasets.
Disadvantages:
• May lose information due to grid approximation.
Model-Based Methods
• Assumes the data is generated by a mixture of underlying
probability distributions (e.g., Gaussian distributions).
Example Algorithms:
• Gaussian Mixture Models (GMM):
• Uses the Expectation-Maximization (EM) algorithm to
model clusters as Gaussian distributions.
• BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies):
• Efficient for large datasets and hierarchical clustering.
Advantages:
• Can handle overlapping clusters.
Disadvantages:
• Requires assumptions about the data distribution.
Fuzzy Clustering
Allows data points to belong to multiple clusters with varying
degrees of membership.
Example Algorithm:
• Fuzzy C-Means (FCM):
• Assigns membership probabilities to each point for all
clusters.
Advantages:
• Handles overlapping clusters.
Disadvantages:
• Computationally expensive.
Distance and Similarity Measures in Clustering
• Clustering algorithms rely on measuring similarity
between data points. Common measures include:
1.Euclidean Distance: ∑(xi​−yi​)2
2.Manhattan Distance: ∑∣xi​−yi​∣
3.Cosine Similarity:
4.Jaccard Similarity: Used for binary data.
Evaluation Metrics for Clustering
• Unlike classification, clustering evaluation is challenging because
there are no predefined labels. Metrics include:
1.Internal Evaluation (Based on Intrinsic Properties):
• Silhouette Coefficient: Measures how similar a point is to its
cluster compared to others.
• Dunn Index: Evaluates compactness and separation of clusters.
2.External Evaluation (Based on Ground Truth):
• Rand Index: Compares the clustering result with a ground
truth.
• Adjusted Rand Index (ARI): Adjusts for chance groupings.
3.Cluster Validation:
• Use Elbow Method for k-means to find the optimal number of
clusters by plotting within-cluster sum of squares (WCSS).
Challenges in Clustering
1.Determining the Number of Clusters:
• Many algorithms require specifying the number of clusters (e.g., k-
means).
2.Scalability:
• Clustering large datasets can be computationally expensive.
3.Handling Noisy and Outlier Data:
• Outliers can distort clustering results.
4.High-Dimensional Data:
• Distance measures become less meaningful in high dimensions
("curse of dimensionality").
5.Cluster Shape:
• Algorithms like k-means struggle with non-spherical clusters.
Comparison of Clustering vs. Classification

Aspect Clustering Classification

Learning Type Unsupervised Supervised

Output Groups or clusters Predefined class labels

Input Data Unlabeled Labeled

Objective Discover hidden patterns Assign instances to classes


Association Rule Learning
Association Rule Learning in Data Mining
• Association Rule Learning is a rule-based machine
learning method used to discover interesting
relationships or patterns among items in large
datasets.
• It is particularly popular in transactional databases,
such as market basket analysis, where the goal is to
identify items frequently purchased together.
Key Terminology
Itemset: A collection of one or more items (e.g., {bread, milk}).
Support: Measures how frequently an itemset appears in the dataset.
Support(A)=Number of transactions containing A​/Total number of transactions
Confidence: Measures the likelihood of occurrence of itemset B given
that itemset A has occurred.
Confidence(A → B)=Support(A ∪ B)/Support(A)
Lift: Measures the strength of the association rule compared to random
co-occurrence of A and B.
Lift(A → B)=Confidence(A → B)/Support(B)
Lift > 1: A and B are positively correlated
Lift = 1: A and B are independent.
Lift < 1: A and B are negatively correlated.
Phases of Association Rule Learning
1.Frequent Itemset Generation:
• Identify all itemsets that satisfy a minimum support
threshold.
• This reduces the search space by focusing on
frequent itemsets.
2.Association Rule Generation:
• Generate rules from the frequent itemsets that
meet the minimum confidence threshold.
Algorithms for Association Rule Learning
Apriori Algorithm
• Iteratively identifies frequent itemsets by pruning infrequent ones.
• Based on the Apriori Property: "If an itemset is frequent, all of its
subsets must also be frequent."
• Steps:
• Generate candidate itemsets of size k (Ck).
• Count their support in the dataset.
• Prune itemsets with support less than the threshold.
ECLAT (Equivalence Class Clustering and Bottom-Up Lattice Traversal)
• Uses a vertical data format where each item is represented by the list
of transactions containing it.
• Efficiently finds frequent itemsets through intersection.
FP-Growth (Frequent Pattern Growth)
• Uses a tree-based structure (FP-tree) to encode the dataset.
• Avoids candidate generation by recursively building the FP-tree.
• More memory-efficient and faster than Apriori for large datasets.
Applications of Association Rule Learning
1.Market Basket Analysis:
• Example: "If a customer buys bread, they are likely to buy butter."
• Used for cross-selling and promotional strategies.
2.Recommendation Systems:
• Suggest items based on associations, such as recommending movies or books.
3.Healthcare:
• Identify patterns in patient symptoms and treatments.
4.Fraud Detection:
• Spot unusual combinations of transactions that could indicate fraud.
5.Web Usage Mining:
• Analyze user behavior to improve website navigation or suggest relevant
content.
Advantages of Association Rule Learning
1.Provides interpretable and actionable rules.
2.Uncovers hidden patterns in large datasets.
3.Suitable for exploratory data analysis.
Challenges in Association Rule Learning
1.Scalability:
Large datasets can lead to an exponential number of itemsets.
Algorithms like FP-Growth address this issue.
2.Rare Item Problem:
Rules with low support may still be meaningful but can be
missed.
3.Redundancy:
Many rules may overlap, leading to unnecessary complexity.
4.Evaluation of Interestingness:
Not all high-confidence rules are useful; additional criteria like
lift or conviction are needed.
Example
• Consider a transactional dataset:
Transaction ID Items Purchased

1 Bread, Butter, Milk

2 Bread, Butter

3 Bread, Milk

4 Butter, Milk
Step 1:
Frequent Itemsets
Using a support threshold of 50%, frequent itemsets are:
• {Bread} (Support: 75%)
• {Butter} (Support: 75%)
• {Milk} (Support: 75%)
• {Bread, Butter} (Support: 50%)
Step 2:
Generate Rules
• Rule: Bread → Butter (Confidence: 50/75 = 66.7%)
Anomaly Detection
Anomaly detection is a process in data mining aimed at
identifying rare items, events or observations that deviate
significantly from the majority of the data.
These anomalies are often of significant interest as they may
indicate critical actionable insights, such as fraud detection,
fault diagnosis or security breaches.
Key Concepts in Anomaly Detection
Definition of an Anomaly:
• An anomaly (or outlier) is an observation that does not
conform to the expected pattern or other observations in
a dataset.
Example:
• Unusually high transaction amounts in a banking dataset
might indicate fraudulent activity.
Types of Anomalies:
Point Anomalies:
Single data points that are significantly different from the rest.
Example:
A temperature reading of 100°C in a dataset of room temperatures.
Contextual Anomalies:
Data points that are only anomalous within a specific context.
Example:
A temperature of 30°C might be normal in summer but anomalous
in winter.
Collective Anomalies:
A group of data points that deviate from the expected pattern, even
if individual points may not.
Example:
A sudden spike in network traffic.
Techniques for Anomaly Detection
Statistical Methods:
• Assumes that normal data points follow a statistical distribution (e.g., Gaussian).
• Uses measures like z-scores or Grubbs' test to identify anomalies.
• Challenges: Limited by the assumption of distribution and struggle with high-
dimensional data.
Machine Learning Methods:
• Supervised Learning: Requires labeled data with anomalies explicitly
marked.
• Examples: Decision trees, support vector machines (SVMs).
• Limitation: Labeled data is often scarce.
• Unsupervised Learning: Identifies anomalies without labeled data by
assuming that anomalies are rare and different.
• Examples: Clustering (e.g., k-means, DBSCAN), autoencoders.
• Semi-supervised Learning: Trains on a dataset containing mostly
normal data, then detects deviations.
Proximity-Based Methods:
• Detect anomalies based on their distance from other data points.
• Techniques:
• k-Nearest Neighbors (k-NN): Anomalies are points far from their
neighbors.
• Local Outlier Factor (LOF): Measures the local density deviation of
a given data point.
• Advantages: Simple to understand and implement.
Density-Based Methods:
• Measure the density of data points, anomalies occur in low-density
regions.
Examples:
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
Deep Learning Methods:
• Suitable for complex, high-dimensional datasets.
Examples:
• Autoencoders: Neural networks trained to reconstruct input data.
• Anomalies result in high reconstruction errors.
• Generative Adversarial Networks (GANs): Can be used for generating normal
data distributions and flagging deviations.
Time Series Anomaly Detection:
• Focuses on detecting anomalies in time-dependent data.
Examples:
• ARIMA, LSTM-based models.
ARIMA stands for AutoRegressive Integrated Moving Average, a
statistical modeling technique used for analyzing and forecasting time
series data. ARIMA is widely applied in time series analysis to predict
future points by understanding patterns from past observations.
Applications of Anomaly Detection
1.Fraud Detection:
• Credit card fraud, insurance fraud and insider trading.
2.Network Security:
• Detecting unusual login attempts, DDoS attacks or malware
activity.
3.Healthcare:
• Identifying anomalies in patient health records or medical imaging.
4.Manufacturing:
• Fault detection in equipment through sensor data.
5.Retail:
• Identifying unusual purchasing behavior to optimize inventory.
6.Finance:
• Detecting unusual market behavior or trading anomalies.
Challenges in Anomaly Detection
1.Imbalanced Data:
• Anomalies are often rare, making them difficult to identify.
2.High Dimensionality:
• Large datasets with many features can make traditional methods
ineffective.
3.Concept Drift:
• Data patterns change over time, requiring models to adapt.
4.Scalability:
• Real-time anomaly detection requires scalable and efficient
algorithms.
5.Interpretability:
• Explaining why a data point is flagged as anomalous can be
challenging.
Best Practices for Anomaly Detection
1.Preprocessing:
• Handle missing data, normalize values and remove noise.
2.Feature Engineering:
• Extract meaningful features to improve model
performance.
3.Evaluation Metrics:
• Use precision, recall, F1-score and ROC-AUC to evaluate
anomaly detection models.
4.Hybrid Approaches:
• Combine multiple techniques (e.g., statistical and machine
learning) to improve accuracy.
Applications of Anomaly Detection in Fraud Detection
and Cybersecurity
• Anomaly detection plays a critical role in fraud
detection and cybersecurity by identifying unusual
patterns or behaviors that may indicate malicious
activities.
• These anomalies often signal breaches, fraud or other
security-related concerns that require immediate
attention.
Applications in Fraud Detection
Fraud detection involves identifying deceptive practices to gain
unauthorized benefits.
Anomaly detection helps by uncovering patterns that deviate from
legitimate behavior.
Credit Card Fraud Detection
• Problem: Fraudulent transactions mimic legitimate purchases, making
them hard to detect.
• How Anomaly Detection Helps:
• Identify transactions with unusual attributes, such as abnormally high
amounts or purchases from distant locations.
• Detect patterns in spending behavior that deviate from a cardholder's typical
usage.
Example Techniques:
• Machine learning models like Random Forests or Neural Networks to classify
transactions as normal or anomalous.
Insurance Fraud Detection
• Problem: Fraudulent claims inflate costs for insurance
companies.
• How Anomaly Detection Helps:
• Analyze claim patterns to detect unusual spikes or claims
inconsistent with the policyholder's history.
• Spot repetitive claims using text analysis of claim
descriptions.
Example Techniques:
• Natural Language Processing (NLP) for textual claim
data.
• Clustering to identify suspicious groups of claims.
Online Payment Fraud
• Problem: Fraudulent activities occur in online payment systems, such
as e-wallets and payment gateways.
• How Anomaly Detection Helps:
• Detect unusually high transaction frequencies or large withdrawals.
• Identify suspicious device usage or IP addresses.
• Example Techniques:
• Behavioral analytics using unsupervised learning.
• Real-time anomaly scoring.
Identity Theft Detection
• Problem: Fraudsters impersonate users to access accounts or
services.
• How Anomaly Detection Helps:
• Monitor login attempts and flag unusual IP addresses, device types, or
geolocations.
• Detect abnormal account activity, such as simultaneous logins from different
regions.
• Example Techniques:
• Time-series analysis for account activity.
• User profiling to model normal behavior.
Applications in Cybersecurity
• Cybersecurity involves protecting systems, networks, and data from
attacks.
• Anomaly detection helps in proactively identifying potential security
threats.
Intrusion Detection Systems (IDS)
• Problem: Cyberattacks like hacking, unauthorized access, and
malware infiltration compromise system security.
• How Anomaly Detection Helps:
• Identify unusual network traffic, such as large data transfers or unexplained
connection spikes.
• Detect deviations in user behavior, like accessing restricted areas.
• Example Techniques:
• Signature-based detection for known attack patterns.
• Anomaly-based systems (e.g., k-NN, Support Vector Machines) to flag unknown threats.
• Phishing Attack Detection
• Problem: Phishing attacks trick users into revealing sensitive
information.
• How Anomaly Detection Helps:
• Analyze email content and flag messages with suspicious patterns, such as
unusual URLs or misspelled domains.
• Detect anomalous user interactions with links in emails.
• Example Techniques:
• NLP for email and URL analysis.
• Feature-based anomaly detection to assess sender reputation and content features.
Ransomware and Malware Detection
• Problem: Malicious software encrypts or steals sensitive data.
• How Anomaly Detection Helps:
• Detect abnormal file access patterns, such as frequent file modifications.
• Identify unusual processes or scripts running on a system.
• Example Techniques:
• Behavioral analytics on system logs.
• Deep learning for detecting unusual program execution flows.
Distributed Denial of Service (DDoS) Attack Detection
• Problem: Flooding servers with excessive requests to render services
unavailable.
• How Anomaly Detection Helps:
• Identify abnormal spikes in incoming requests to servers.
• Detect unusual IP patterns or geographic origins of traffic.
• Example Techniques:
• Time-series anomaly detection for traffic patterns.
• Statistical methods like entropy-based analysis.
Endpoint Protection
• Problem: Malicious activities on individual devices compromise
security.
• How Anomaly Detection Helps:
• Monitor device logs for anomalous processes or unauthorized applications.
• Detect deviations in user behavior on the endpoint.
• Example Techniques:
• Host-based intrusion detection systems (HIDS).
• Machine learning models to detect anomalies in device activity.
THANK YOU.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy