Dmaclat4 Merged
Dmaclat4 Merged
K-Means K-Medoids
Uses centroids (mean of points) as the Uses medoids (actual data points) as the center
center of clusters of clusters
Sensitive to outliers as mean is influenced by More robust to outliers since it uses actual
extreme values points (medoids)
Efficient for large datasets due to simpler Slower due to the need to calculate distances
computations between all pairs of points
Works well for spherical clusters Works better for non-spherical and complex
clusters
Only works with numerical data (mean Can work with non-numeric data as well
calculation)
Not ideal when clusters are not globular Better for cases where clusters are
non-spherical or contain noise
Strengths: Strengths:
Fast and efficient for large datasets More robust to outliers since it uses actual
(O(nk)) points (medoids)
Works well for spherical clusters Can work with non-numeric and categorical
Less computational cost compared to data
K-Medoids Can better handle non-spherical and complex
shapes
Weaknesses: Weaknesses:
Sensitive to outliers (mean is Slower due to the need to calculate distances
influenced by extreme values) between all pairs of points
Assumes spherical clusters, which is Computational complexity is higher (O(n²k))
limiting
(b) Strength and Weakness of K-Means/K-Medoids vs. Hierarchical Clustering (e.g.,
AGNES):
K-Means/K-Medoids Hierarchical Clustering (e.g., AGNES)
Strengths: Strengths:
More efficient and faster for large Does not require the number of clusters to
datasets be predefined
K-Means: Faster (O(nk)), K-Medoids: Provides a hierarchy (dendrogram) that
More robust to outliers shows cluster relationships
Weaknesses: Weaknesses:
Sensitive to initial centroids (K-Means) Computationally expensive (O(n²))
Assumes spherical clusters (K-Means) Cannot handle large datasets efficiently
K-Medoids slower with larger datasets Can be affected by noisy data
Scalability Moderate scalability Highly scalable (linear Highly scalable (good for
time) high dimensions)
Diagram DBSCAN Example - BIRCH CF-Tree - Page 97 CLIQUE Grids - Page 176
Reference Page 130
Theory / Concept Questions
1. Differentiate between agglomerative and divisive hierarchical clustering.
Illustrate each method with a simple example.
Each object starts in its own cluster All objects start in one cluster
Stops when all objects form one cluster Stops when each object is its own cluster
Does not require pre-specifying the number of Does not require pre-specifying the number
clusters of clusters
Forms one big cluster at the end Ends with singleton clusters
Sensitive to noise and outliers Can separate outliers early during splits
BIRCH STING
Builds a CF-tree from data points Divides the data space into hierarchical grids
Good for numeric and relatively Good for spatial and high-dimensional data
low-dimensional data
Needs explicit thresholds (e.g., diameter Works with precomputed statistical measures
threshold)
Optimize an objective function (e.g., SSE) Merge or split clusters based on distance
measures
Poor for clusters of arbitrary shape Can capture complex cluster shapes
Suitable for applications like market Suitable for applications like phylogenetic tree
segmentation construction
Difficult to handle overlapping clusters Can handle overlapping and nested clusters better
Reassignment of points possible during Once merged or split, cannot undo decisions
iterations
2. Give an application example where global outliers, contextual outliers and
collective outliers are all interesting. What are the attributes, and what
are the contextual and behavioral attributes? How is the relationship
among objects modeled in collective outlier detection?
Scenario: Intrusion Detection in a Computer Network.
● Global Outlier: A single computer sending a very unusual data packet.
● Contextual Outlier: A normal packet becomes suspicious based on the time
or network load (e.g., sending large files during maintenance hours).
● Collective Outlier: A group of computers suddenly sending denial-of-service
(DoS) packets simultaneously.
Attributes:
● Contextual Attributes: Time of access, network location.
● Behavioral Attributes: Packet size, type of activity.
Modeling Relationships in Collective Outlier Detection:
● Background knowledge like distance measures or similarity metrics among
multiple data objects is essential to detect collective outliers.
3. Give an application example of where the border between "normal
objects" and outliers is often unclear, so that the degree to which an
object is an outlier has to be well estimated.
Scenario: Temperature Measurement.
● A temperature of 80°F (26.67°C) could be normal during summer but an
outlier in winter.
● The distinction depends heavily on context (season, location).
● The border between normal and outlier behavior is often a gray area,
making exact thresholding difficult.
1. Requires labeled data (normal and outliers). 1. Does not require labeled data.
3. Trained using past known outliers. 3. Detects patterns without prior knowledge of
outliers.
4. Decision Trees & Random Forests used. 4. Clustering methods like DBSCAN and
K-Means are used.
5. SVM (Support Vector Machines) separate 5. Points not fitting any cluster detected as
normal and outlier points. outliers.
6. Effective when good labels are available. 6. Useful when labels are unavailable.
10. Can classify future instances once 10. Detects anomalies by measuring deviation
trained. from clusters.
11. Sensitive to training data quality. 11. Sensitive to distance measure and cluster
density.
12. Good for applications like medical anomaly 12. Good for credit card fraud detection
detection. without labeled fraud.
13. Requires constant retraining if new 13. Automatically adapts as new patterns form.
outliers appear.
14. Less effective when labels are missing or 14. Suitable for dynamic, evolving datasets.
incomplete.
15. Example Scenario: Bank uses past fraud 15. Example Scenario: Detect fraudulent credit
transactions to train a decision tree. card transactions by finding transactions that
don't fit into spending clusters.
3. Statistical Data Mining Approaches.
● Regression Analysis: Predicts a continuous outcome and models relationships
between dependent and independent variables.
● Generalized Linear Models (GLM): Extends linear regression to allow response
variables that have error distribution models other than a normal distribution.
● Analysis of Variance (ANOVA): Tests whether there are significant differences
between the means of three or more groups.
● Mixed Effect Models: Models where data have both fixed effects (main factors)
and random effects (natural groupings like doctors, schools).
● Discriminant Analysis: A technique used for classifying data into predefined
categories and for dimensionality reduction.
● Factor Analysis: Reduces a large number of variables into fewer factors that
explain most of the variability in the data.
● Survival Analysis: Deals with modeling time until an event occurs; handles censored
data like unfinished observations.
● Grubb’s Test: Detects univariate outliers in a normally distributed dataset by
comparing a test statistic to a critical value.
● Detection of Multivariate Outliers: Uses Mahalanobis Distance (MDist) to flag
objects significantly deviating from the mean in multi-dimensional data.
● Mixture of Parametric Distributions: Models data with multiple distributions like
Gaussian Mixtures to detect low-probability points.
● Histogram-Based Outlier Detection (Non-Parametric): Defines outliers based on
rare, low-frequency bins in histograms.
● Scientific Data Mining: Combines scientific computing techniques with statistical
data mining for domains like healthcare and social sciences.
General Steps in Statistical Data Mining:
● Collect data
● Preprocess (clean, normalize)
● Apply statistical modeling (e.g., regression, clustering)
● Interpret results.
Handling High Dimensional Data: Statistical methods like Principal Component
Analysis (PCA) and Factor Analysis help manage high dimensions.
Application Examples:
● Financial Risk Modeling
● Medical Diagnosis
● Customer Behavior Analysis
● Reliability Testing.
Applications
1. Explain how data mining is applied in intrusion detection systems. Give an
example of detecting anomalies in login patterns.
● Intrusion Detection Systems (IDS) monitor network or system activities for
malicious activities.
● Data mining enhances IDS through pattern discovery techniques.
● Association rule mining is used to find relationships between system attributes
indicative of intrusions.
● Classification techniques like decision trees are used to detect known attacks.
● Clustering detects new or unknown attacks by finding unusual patterns.
● Outlier detection identifies anomalies that deviate significantly from normal
behavior.
● Data mining helps in building accurate user behavior profiles.
● Real-time intrusion detection is improved using stream data analysis.
● Distributed data mining enables analysis across multiple network locations.
● Mining frequent patterns helps identify common intrusion behaviors.
● Mining rare patterns assists in detecting previously unknown attacks.
● IDS use supervised learning for signature-based detection.
● IDS use unsupervised learning for anomaly-based detection.
● Ensemble methods combine multiple models to improve detection accuracy.
● Example: Repeated failed login attempts, unusual time of access, or login from an
unfamiliar device are detected as anomalies by mining login data.
2. Discuss the role of data mining in recommender systems. How does it
personalize suggestions based on user behavior?
● Recommender systems predict user preferences using data mining.
● Association rule mining discovers products/items frequently bought together.
● Clustering groups similar users based on preferences.
● Classification models predict if a user will like an item.
● Collaborative filtering recommends items based on user similarity.
● Content-based filtering recommends items similar to those the user liked before.
● Hybrid methods combine collaborative and content-based approaches.
● Data mining captures hidden patterns in user interactions.
● Mining clickstream data helps personalize web recommendations.
● Mining purchase history improves product recommendations.
● User demographic data is mined to predict preferences.
● Mining user feedback (ratings, reviews) helps refine recommendations.
● Trend analysis finds evolving user interests over time.
● Anomaly detection identifies unusual behaviors for fine-tuning.
● Systems personalize suggestions by matching user behaviors with discovered
patterns.
● Using collaborative filtering, recommend Item C to User 1 (similar users liked it).
● Similarity measures like cosine similarity are used in collaborative filtering.
● Content features like genre, price, brand are used in content-based filtering.
● Outlier detection identifies fake ratings or abnormal behavior.
● Removing outliers (fake users, bots) improves recommendation reliability.
● Outliers skew user similarity computation; detecting them preserves accuracy.
● Final recommendation is improved by filtering anomalies before prediction.
4. Explain briefly how data is analyzed by data mining in the Finance
Sector.
● Data mining is used for credit scoring and risk assessment.
● Customer segmentation is performed based on transaction behavior.
● Fraud detection is achieved by anomaly detection techniques.
● Stock market prediction uses time series analysis.
● Loan default prediction is modeled using classification algorithms.
● Clustering helps to identify similar investment patterns.
● Rule mining finds association between different financial products.
● Text mining is used to analyze financial news sentiment.
● Data mining automates large-scale financial data processing.
● Regression models forecast revenue and financial metrics.
● Decision trees identify key factors affecting investment returns.
● Neural networks model complex financial behavior.
● Portfolio optimization is performed using predictive analytics.
● Anti-money laundering uses clustering and outlier detection.
● Bankruptcy prediction models are built using data mining techniques.